virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration
@ 2024-11-04 10:21 Yishai Hadas
  2024-11-04 10:21 ` [PATCH V1 vfio 1/7] virtio_pci: Introduce device parts access commands Yishai Hadas
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Yishai Hadas @ 2024-11-04 10:21 UTC (permalink / raw)
  To: alex.williamson, mst, jasowang, jgg
  Cc: kvm, virtualization, parav, feliu, kevin.tian, joao.m.martins,
	leonro, yishaih, maorg

This series enhances the vfio-virtio driver to support live migration
for virtio-net Virtual Functions (VFs) that are migration-capable.
 
This series follows the Virtio 1.4 specification to implement the
necessary device parts commands, enabling a device to participate in the
live migration process.

The key VFIO features implemented include: VFIO_MIGRATION_STOP_COPY,
VFIO_MIGRATION_P2P, VFIO_MIGRATION_PRE_COPY.
 
The implementation integrates with the VFIO subsystem via vfio_pci_core
and incorporates Virtio-specific logic to handle the migration process.
 
Migration functionality follows the definitions in uapi/vfio.h and uses
the Virtio VF-to-PF admin queue command channel for executing the device
parts related commands.
 
Patch Overview:
The first four patches focus on the Virtio layer and address the
following:
- Define the layout of the device parts commands required as part of the
  migration process.
- Provide APIs to enable upper layers (e.g., VFIO, net) to execute the
  related device parts commands.
 
The last three patches focus on the VFIO layer:
- Extend the vfio-virtio driver to support live migration for Virtio-net
  VFs.
- Move legacy I/O operations to a separate file, which is compiled only
  when VIRTIO_PCI_ADMIN_LEGACY is configured, ensuring that live
  migration depends solely on VIRTIO_PCI.
 
Additional Notes:
- The kernel protocol between the source and target devices includes a
  header containing metadata such as record size, tag, and flags.
  The record size allows the target to read a complete image from the
  source before passing device part data. This follows the Virtio
  specification, which mandates that partial device parts are not
  supplied. The tag and flags serve as placeholders for future extensions
  to the kernel protocol between the source and target, ensuring backward
  and forward compatibility.
 
- Both the source and target comply with the Virtio specification by
  using a device part object with a unique ID during the migration
  process. As this resource is limited to a maximum of 255, its lifecycle
  is confined to periods when live migration is active.

- According to the Virtio specification, a device has only two states:
  RUNNING and STOPPED. Consequently, certain VFIO transitions (e.g.,
  RUNNING_P2P->STOP, STOP->RUNNING_P2P) are treated as no-ops. When
  transitioning to RUNNING_P2P, the device state is set to STOP and
  remains STOPPED until it transitions back from RUNNING_P2P->RUNNING, at
  which point it resumes its RUNNING state. During transition to STOP,
  the virtio device only stops initiating outgoing requests(e.g. DMA,
  MSIx, etc.) but still must accept incoming operations.

- Furthermore, the Virtio specification does not support reading partial
  or incremental device contexts. This means that during the PRE_COPY
  state, the vfio-virtio driver reads the full device state. This step is
  beneficial because it allows the device to send some "initial data"
  before moving to the STOP_COPY state, thus reducing downtime by
  preparing early and warming-up. As the device state can be changed and
  the benefit is highest when the pre copy data closely matches the final
  data we read it in a rate limiter mode and reporting no data available
  for some time interval after the previous call. With PRE_COPY enabled,
  we observed a downtime reduction of approximately 70-75% in various
  scenarios compared to when PRE_COPY was disabled, while keeping the
  total migration time nearly the same.

- Support for dirty page tracking during migration will be provided via
  the IOMMUFD framework.
 
- This series has been successfully tested on Virtio-net VF devices.

Changes from V0:
https://lore.kernel.org/kvm/20241101102518.1bf2c6e6.alex.williamson@redhat.com/T/

Vfio:
Patch #5:
- Enhance the commit log to provide a clearer explanation of P2P
  behavior over Virtio devices, as discussed on the mailing list.
Patch #6:
- Implement the rate limiter mechanism as part of the PRE_COPY state,
  following Alex’s suggestion.
- Update the commit log to include actual data demonstrating the impact of
  PRE_COPY, as requested by Alex.
Patch #7:
- Update the default driver operations (i.e., vfio_device_ops) to use
  the live migration set, and expand it to include the legacy I/O
  operations if they are compiled and supported.

Yishai

Yishai Hadas (7):
  virtio_pci: Introduce device parts access commands
  virtio: Extend the admin command to include the result size
  virtio: Manage device and driver capabilities via the admin commands
  virtio-pci: Introduce APIs to execute device parts admin commands
  vfio/virtio: Add support for the basic live migration functionality
  vfio/virtio: Add PRE_COPY support for live migration
  vfio/virtio: Enable live migration once VIRTIO_PCI was configured

 drivers/vfio/pci/virtio/Kconfig     |    4 +-
 drivers/vfio/pci/virtio/Makefile    |    3 +-
 drivers/vfio/pci/virtio/common.h    |  127 +++
 drivers/vfio/pci/virtio/legacy_io.c |  420 +++++++++
 drivers/vfio/pci/virtio/main.c      |  500 ++--------
 drivers/vfio/pci/virtio/migrate.c   | 1336 +++++++++++++++++++++++++++
 drivers/virtio/virtio_pci_common.h  |   19 +-
 drivers/virtio/virtio_pci_modern.c  |  457 ++++++++-
 include/linux/virtio.h              |    1 +
 include/linux/virtio_pci_admin.h    |   11 +
 include/uapi/linux/virtio_pci.h     |  131 +++
 11 files changed, 2594 insertions(+), 415 deletions(-)
 create mode 100644 drivers/vfio/pci/virtio/common.h
 create mode 100644 drivers/vfio/pci/virtio/legacy_io.c
 create mode 100644 drivers/vfio/pci/virtio/migrate.c

-- 
2.27.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V1 vfio 1/7] virtio_pci: Introduce device parts access commands
  2024-11-04 10:21 [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration Yishai Hadas
@ 2024-11-04 10:21 ` Yishai Hadas
  2024-11-04 10:21 ` [PATCH V1 vfio 2/7] virtio: Extend the admin command to include the result size Yishai Hadas
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Yishai Hadas @ 2024-11-04 10:21 UTC (permalink / raw)
  To: alex.williamson, mst, jasowang, jgg
  Cc: kvm, virtualization, parav, feliu, kevin.tian, joao.m.martins,
	leonro, yishaih, maorg

Introduce device parts access commands via the admin queue.

These commands and their structure adhere to the Virtio 1.4
specification.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 include/uapi/linux/virtio_pci.h | 131 ++++++++++++++++++++++++++++++++
 1 file changed, 131 insertions(+)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index a8208492e822..1beb317df1b9 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -40,6 +40,7 @@
 #define _LINUX_VIRTIO_PCI_H
 
 #include <linux/types.h>
+#include <linux/kernel.h>
 
 #ifndef VIRTIO_PCI_NO_LEGACY
 
@@ -240,6 +241,17 @@ struct virtio_pci_cfg_cap {
 #define VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ		0x5
 #define VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO		0x6
 
+/* Device parts access commands. */
+#define VIRTIO_ADMIN_CMD_CAP_ID_LIST_QUERY		0x7
+#define VIRTIO_ADMIN_CMD_DEVICE_CAP_GET			0x8
+#define VIRTIO_ADMIN_CMD_DRIVER_CAP_SET			0x9
+#define VIRTIO_ADMIN_CMD_RESOURCE_OBJ_CREATE		0xa
+#define VIRTIO_ADMIN_CMD_RESOURCE_OBJ_DESTROY		0xd
+#define VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_GET		0xe
+#define VIRTIO_ADMIN_CMD_DEV_PARTS_GET			0xf
+#define VIRTIO_ADMIN_CMD_DEV_PARTS_SET			0x10
+#define VIRTIO_ADMIN_CMD_DEV_MODE_SET			0x11
+
 struct virtio_admin_cmd_hdr {
 	__le16 opcode;
 	/*
@@ -286,4 +298,123 @@ struct virtio_admin_cmd_notify_info_result {
 	struct virtio_admin_cmd_notify_info_data entries[VIRTIO_ADMIN_CMD_MAX_NOTIFY_INFO];
 };
 
+#define VIRTIO_DEV_PARTS_CAP 0x0000
+
+struct virtio_dev_parts_cap {
+	__u8 get_parts_resource_objects_limit;
+	__u8 set_parts_resource_objects_limit;
+};
+
+#define MAX_CAP_ID __KERNEL_DIV_ROUND_UP(VIRTIO_DEV_PARTS_CAP + 1, 64)
+
+struct virtio_admin_cmd_query_cap_id_result {
+	__le64 supported_caps[MAX_CAP_ID];
+};
+
+struct virtio_admin_cmd_cap_get_data {
+	__le16 id;
+	__u8 reserved[6];
+};
+
+struct virtio_admin_cmd_cap_set_data {
+	__le16 id;
+	__u8 reserved[6];
+	__u8 cap_specific_data[];
+};
+
+struct virtio_admin_cmd_resource_obj_cmd_hdr {
+	__le16 type;
+	__u8 reserved[2];
+	__le32 id; /* Indicates unique resource object id per resource object type */
+};
+
+struct virtio_admin_cmd_resource_obj_create_data {
+	struct virtio_admin_cmd_resource_obj_cmd_hdr hdr;
+	__le64 flags;
+	__u8 resource_obj_specific_data[];
+};
+
+#define VIRTIO_RESOURCE_OBJ_DEV_PARTS 0
+
+#define VIRTIO_RESOURCE_OBJ_DEV_PARTS_TYPE_GET 0
+#define VIRTIO_RESOURCE_OBJ_DEV_PARTS_TYPE_SET 1
+
+struct virtio_resource_obj_dev_parts {
+	__u8 type;
+	__u8 reserved[7];
+};
+
+#define VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_TYPE_SIZE 0
+#define VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_TYPE_COUNT 1
+#define VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_TYPE_LIST 2
+
+struct virtio_admin_cmd_dev_parts_metadata_data {
+	struct virtio_admin_cmd_resource_obj_cmd_hdr hdr;
+	__u8 type;
+	__u8 reserved[7];
+};
+
+#define VIRTIO_DEV_PART_F_OPTIONAL 0
+
+struct virtio_dev_part_hdr {
+	__le16 part_type;
+	__u8 flags;
+	__u8 reserved;
+	union {
+		struct {
+			__le32 offset;
+			__le32 reserved;
+		} pci_common_cfg;
+		struct {
+			__le16 index;
+			__u8 reserved[6];
+		} vq_index;
+	} selector;
+	__le32 length;
+};
+
+struct virtio_dev_part {
+	struct virtio_dev_part_hdr hdr;
+	__u8 value[];
+};
+
+struct virtio_admin_cmd_dev_parts_metadata_result {
+	union {
+		struct {
+			__le32 size;
+			__le32 reserved;
+		} parts_size;
+		struct {
+			__le32 count;
+			__le32 reserved;
+		} hdr_list_count;
+		struct {
+			__le32 count;
+			__le32 reserved;
+			struct virtio_dev_part_hdr hdrs[];
+		} hdr_list;
+	};
+};
+
+#define VIRTIO_ADMIN_CMD_DEV_PARTS_GET_TYPE_SELECTED 0
+#define VIRTIO_ADMIN_CMD_DEV_PARTS_GET_TYPE_ALL 1
+
+struct virtio_admin_cmd_dev_parts_get_data {
+	struct virtio_admin_cmd_resource_obj_cmd_hdr hdr;
+	__u8 type;
+	__u8 reserved[7];
+	struct virtio_dev_part_hdr hdr_list[];
+};
+
+struct virtio_admin_cmd_dev_parts_set_data {
+	struct virtio_admin_cmd_resource_obj_cmd_hdr hdr;
+	struct virtio_dev_part parts[];
+};
+
+#define VIRTIO_ADMIN_CMD_DEV_MODE_F_STOPPED 0
+
+struct virtio_admin_cmd_dev_mode_set_data {
+	__u8 flags;
+};
+
 #endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V1 vfio 2/7] virtio: Extend the admin command to include the result size
  2024-11-04 10:21 [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration Yishai Hadas
  2024-11-04 10:21 ` [PATCH V1 vfio 1/7] virtio_pci: Introduce device parts access commands Yishai Hadas
@ 2024-11-04 10:21 ` Yishai Hadas
  2024-11-04 10:21 ` [PATCH V1 vfio 3/7] virtio: Manage device and driver capabilities via the admin commands Yishai Hadas
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Yishai Hadas @ 2024-11-04 10:21 UTC (permalink / raw)
  To: alex.williamson, mst, jasowang, jgg
  Cc: kvm, virtualization, parav, feliu, kevin.tian, joao.m.martins,
	leonro, yishaih, maorg

Extend the admin command by incorporating a result size field.

This allows higher layers to determine the actual result size from the
backend when this information is not included in the result_sg.

The additional information introduced here will be used in subsequent
patches of this series.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/virtio/virtio_pci_modern.c | 4 +++-
 include/linux/virtio.h             | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 9193c30d640a..487d04610ecb 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -64,8 +64,10 @@ void vp_modern_avq_done(struct virtqueue *vq)
 	spin_lock_irqsave(&admin_vq->lock, flags);
 	do {
 		virtqueue_disable_cb(vq);
-		while ((cmd = virtqueue_get_buf(vq, &len)))
+		while ((cmd = virtqueue_get_buf(vq, &len))) {
+			cmd->result_sg_size = len;
 			complete(&cmd->completion);
+		}
 	} while (!virtqueue_enable_cb(vq));
 	spin_unlock_irqrestore(&admin_vq->lock, flags);
 }
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 306137a15d07..b5f7a611715a 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -111,6 +111,7 @@ struct virtio_admin_cmd {
 	struct scatterlist *data_sg;
 	struct scatterlist *result_sg;
 	struct completion completion;
+	u32 result_sg_size;
 	int ret;
 };
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V1 vfio 3/7] virtio: Manage device and driver capabilities via the admin commands
  2024-11-04 10:21 [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration Yishai Hadas
  2024-11-04 10:21 ` [PATCH V1 vfio 1/7] virtio_pci: Introduce device parts access commands Yishai Hadas
  2024-11-04 10:21 ` [PATCH V1 vfio 2/7] virtio: Extend the admin command to include the result size Yishai Hadas
@ 2024-11-04 10:21 ` Yishai Hadas
  2024-11-04 10:21 ` [PATCH V1 vfio 4/7] virtio-pci: Introduce APIs to execute device parts " Yishai Hadas
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Yishai Hadas @ 2024-11-04 10:21 UTC (permalink / raw)
  To: alex.williamson, mst, jasowang, jgg
  Cc: kvm, virtualization, parav, feliu, kevin.tian, joao.m.martins,
	leonro, yishaih, maorg

Manage device and driver capabilities via the admin commands.

The device exposes its supported features and resource object limits via
an administrative command called VIRTIO_ADMIN_CMD_CAP_ID_LIST_QUERY,
using the 'self group type.'

Each capability is identified by a unique ID, and the driver
communicates the functionality and resource limits it plans to utilize.

The capability VIRTIO_DEV_PARTS_CAP specifically represents the device's
parts resource object limit.

Manage the device's parts resource object ID using a common IDA for both
get and set operations.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/virtio/virtio_pci_common.h |  13 +++-
 drivers/virtio/virtio_pci_modern.c | 105 +++++++++++++++++++++++++++++
 2 files changed, 116 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index 1d9c49947f52..04b1d17663b3 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -48,6 +48,9 @@ struct virtio_pci_admin_vq {
 	/* Protects virtqueue access. */
 	spinlock_t lock;
 	u64 supported_cmds;
+	u64 supported_caps;
+	u8 max_dev_parts_objects;
+	struct ida dev_parts_ida;
 	/* Name of the admin queue: avq.$vq_index. */
 	char name[10];
 	u16 vq_index;
@@ -167,15 +170,21 @@ struct virtio_device *virtio_pci_vf_get_pf_dev(struct pci_dev *pdev);
 	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \
 	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO))
 
+#define VIRTIO_DEV_PARTS_ADMIN_CMD_BITMAP \
+	(BIT_ULL(VIRTIO_ADMIN_CMD_CAP_ID_LIST_QUERY) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_DRIVER_CAP_SET) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_DEVICE_CAP_GET))
+
 /* Unlike modern drivers which support hardware virtio devices, legacy drivers
  * assume software-based devices: e.g. they don't use proper memory barriers
  * on ARM, use big endian on PPC, etc. X86 drivers are mostly ok though, more
  * or less by chance. For now, only support legacy IO on X86.
  */
 #ifdef CONFIG_VIRTIO_PCI_ADMIN_LEGACY
-#define VIRTIO_ADMIN_CMD_BITMAP VIRTIO_LEGACY_ADMIN_CMD_BITMAP
+#define VIRTIO_ADMIN_CMD_BITMAP (VIRTIO_LEGACY_ADMIN_CMD_BITMAP | \
+				 VIRTIO_DEV_PARTS_ADMIN_CMD_BITMAP)
 #else
-#define VIRTIO_ADMIN_CMD_BITMAP 0
+#define VIRTIO_ADMIN_CMD_BITMAP VIRTIO_DEV_PARTS_ADMIN_CMD_BITMAP
 #endif
 
 void vp_modern_avq_done(struct virtqueue *vq);
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 487d04610ecb..8ddac2829bc8 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -230,12 +230,117 @@ static void virtio_pci_admin_cmd_list_init(struct virtio_device *virtio_dev)
 	kfree(data);
 }
 
+static void
+virtio_pci_admin_cmd_dev_parts_objects_enable(struct virtio_device *virtio_dev)
+{
+	struct virtio_pci_device *vp_dev = to_vp_device(virtio_dev);
+	struct virtio_admin_cmd_cap_get_data *get_data;
+	struct virtio_admin_cmd_cap_set_data *set_data;
+	struct virtio_dev_parts_cap *result;
+	struct virtio_admin_cmd cmd = {};
+	struct scatterlist result_sg;
+	struct scatterlist data_sg;
+	u8 resource_objects_limit;
+	u16 set_data_size;
+	int ret;
+
+	get_data = kzalloc(sizeof(*get_data), GFP_KERNEL);
+	if (!get_data)
+		return;
+
+	result = kzalloc(sizeof(*result), GFP_KERNEL);
+	if (!result)
+		goto end;
+
+	get_data->id = cpu_to_le16(VIRTIO_DEV_PARTS_CAP);
+	sg_init_one(&data_sg, get_data, sizeof(*get_data));
+	sg_init_one(&result_sg, result, sizeof(*result));
+	cmd.opcode = cpu_to_le16(VIRTIO_ADMIN_CMD_DEVICE_CAP_GET);
+	cmd.group_type = cpu_to_le16(VIRTIO_ADMIN_GROUP_TYPE_SRIOV);
+	cmd.data_sg = &data_sg;
+	cmd.result_sg = &result_sg;
+	ret = vp_modern_admin_cmd_exec(virtio_dev, &cmd);
+	if (ret)
+		goto err_get;
+
+	set_data_size = sizeof(*set_data) + sizeof(*result);
+	set_data = kzalloc(set_data_size, GFP_KERNEL);
+	if (!set_data)
+		goto err_get;
+
+	set_data->id = cpu_to_le16(VIRTIO_DEV_PARTS_CAP);
+
+	/* Set the limit to the minimum value between the GET and SET values
+	 * supported by the device. Since the obj_id for VIRTIO_DEV_PARTS_CAP
+	 * is a globally unique value per PF, there is no possibility of
+	 * overlap between GET and SET operations.
+	 */
+	resource_objects_limit = min(result->get_parts_resource_objects_limit,
+				     result->set_parts_resource_objects_limit);
+	result->get_parts_resource_objects_limit = resource_objects_limit;
+	result->set_parts_resource_objects_limit = resource_objects_limit;
+	memcpy(set_data->cap_specific_data, result, sizeof(*result));
+	sg_init_one(&data_sg, set_data, set_data_size);
+	cmd.data_sg = &data_sg;
+	cmd.result_sg = NULL;
+	cmd.opcode = cpu_to_le16(VIRTIO_ADMIN_CMD_DRIVER_CAP_SET);
+	ret = vp_modern_admin_cmd_exec(virtio_dev, &cmd);
+	if (ret)
+		goto err_set;
+
+	/* Allocate IDR to manage the dev caps objects */
+	ida_init(&vp_dev->admin_vq.dev_parts_ida);
+	vp_dev->admin_vq.max_dev_parts_objects = resource_objects_limit;
+
+err_set:
+	kfree(set_data);
+err_get:
+	kfree(result);
+end:
+	kfree(get_data);
+}
+
+static void virtio_pci_admin_cmd_cap_init(struct virtio_device *virtio_dev)
+{
+	struct virtio_pci_device *vp_dev = to_vp_device(virtio_dev);
+	struct virtio_admin_cmd_query_cap_id_result *data;
+	struct virtio_admin_cmd cmd = {};
+	struct scatterlist result_sg;
+	int ret;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return;
+
+	sg_init_one(&result_sg, data, sizeof(*data));
+	cmd.opcode = cpu_to_le16(VIRTIO_ADMIN_CMD_CAP_ID_LIST_QUERY);
+	cmd.group_type = cpu_to_le16(VIRTIO_ADMIN_GROUP_TYPE_SRIOV);
+	cmd.result_sg = &result_sg;
+
+	ret = vp_modern_admin_cmd_exec(virtio_dev, &cmd);
+	if (ret)
+		goto end;
+
+	/* Max number of caps fits into a single u64 */
+	BUILD_BUG_ON(sizeof(data->supported_caps) > sizeof(u64));
+
+	vp_dev->admin_vq.supported_caps = le64_to_cpu(data->supported_caps[0]);
+
+	if (!(vp_dev->admin_vq.supported_caps & (1 << VIRTIO_DEV_PARTS_CAP)))
+		goto end;
+
+	virtio_pci_admin_cmd_dev_parts_objects_enable(virtio_dev);
+end:
+	kfree(data);
+}
+
 static void vp_modern_avq_activate(struct virtio_device *vdev)
 {
 	if (!virtio_has_feature(vdev, VIRTIO_F_ADMIN_VQ))
 		return;
 
 	virtio_pci_admin_cmd_list_init(vdev);
+	virtio_pci_admin_cmd_cap_init(vdev);
 }
 
 static void vp_modern_avq_cleanup(struct virtio_device *vdev)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V1 vfio 4/7] virtio-pci: Introduce APIs to execute device parts admin commands
  2024-11-04 10:21 [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration Yishai Hadas
                   ` (2 preceding siblings ...)
  2024-11-04 10:21 ` [PATCH V1 vfio 3/7] virtio: Manage device and driver capabilities via the admin commands Yishai Hadas
@ 2024-11-04 10:21 ` Yishai Hadas
  2024-11-04 10:21 ` [PATCH V1 vfio 5/7] vfio/virtio: Add support for the basic live migration functionality Yishai Hadas
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Yishai Hadas @ 2024-11-04 10:21 UTC (permalink / raw)
  To: alex.williamson, mst, jasowang, jgg
  Cc: kvm, virtualization, parav, feliu, kevin.tian, joao.m.martins,
	leonro, yishaih, maorg

Introduce APIs to handle the execution of device parts admin commands.

These APIs cover functionalities such as mode setting, object creation
and destruction, and operations like parts get/set and metadata
retrieval.

These APIs will be utilized in upcoming patches within this series.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/virtio/virtio_pci_common.h |   8 +-
 drivers/virtio/virtio_pci_modern.c | 348 +++++++++++++++++++++++++++++
 include/linux/virtio_pci_admin.h   |  11 +
 3 files changed, 366 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index 04b1d17663b3..0d00740cca07 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -173,7 +173,13 @@ struct virtio_device *virtio_pci_vf_get_pf_dev(struct pci_dev *pdev);
 #define VIRTIO_DEV_PARTS_ADMIN_CMD_BITMAP \
 	(BIT_ULL(VIRTIO_ADMIN_CMD_CAP_ID_LIST_QUERY) | \
 	 BIT_ULL(VIRTIO_ADMIN_CMD_DRIVER_CAP_SET) | \
-	 BIT_ULL(VIRTIO_ADMIN_CMD_DEVICE_CAP_GET))
+	 BIT_ULL(VIRTIO_ADMIN_CMD_DEVICE_CAP_GET) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_RESOURCE_OBJ_CREATE) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_RESOURCE_OBJ_DESTROY) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_GET) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_DEV_PARTS_GET) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_DEV_PARTS_SET) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_DEV_MODE_SET))
 
 /* Unlike modern drivers which support hardware virtio devices, legacy drivers
  * assume software-based devices: e.g. they don't use proper memory barriers
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 8ddac2829bc8..3f5aba71cfde 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -15,6 +15,7 @@
  */
 
 #include <linux/delay.h>
+#include <linux/virtio_pci_admin.h>
 #define VIRTIO_PCI_NO_LEGACY
 #define VIRTIO_RING_NO_LEGACY
 #include "virtio_pci_common.h"
@@ -875,6 +876,353 @@ static bool vp_get_shm_region(struct virtio_device *vdev,
 	return true;
 }
 
+/*
+ * virtio_pci_admin_has_dev_parts - Checks whether the device parts
+ * functionality is supported
+ * @pdev: VF pci_dev
+ *
+ * Returns true on success.
+ */
+bool virtio_pci_admin_has_dev_parts(struct pci_dev *pdev)
+{
+	struct virtio_device *virtio_dev = virtio_pci_vf_get_pf_dev(pdev);
+	struct virtio_pci_device *vp_dev;
+
+	if (!virtio_dev)
+		return false;
+
+	if (!virtio_has_feature(virtio_dev, VIRTIO_F_ADMIN_VQ))
+		return false;
+
+	vp_dev = to_vp_device(virtio_dev);
+
+	if (!((vp_dev->admin_vq.supported_cmds & VIRTIO_DEV_PARTS_ADMIN_CMD_BITMAP) ==
+		VIRTIO_DEV_PARTS_ADMIN_CMD_BITMAP))
+		return false;
+
+	return vp_dev->admin_vq.max_dev_parts_objects;
+}
+EXPORT_SYMBOL_GPL(virtio_pci_admin_has_dev_parts);
+
+/*
+ * virtio_pci_admin_mode_set - Sets the mode of a member device
+ * @pdev: VF pci_dev
+ * @flags: device mode's flags
+ *
+ * Note: caller must serialize access for the given device.
+ * Returns 0 on success, or negative on failure.
+ */
+int virtio_pci_admin_mode_set(struct pci_dev *pdev, u8 flags)
+{
+	struct virtio_device *virtio_dev = virtio_pci_vf_get_pf_dev(pdev);
+	struct virtio_admin_cmd_dev_mode_set_data *data;
+	struct virtio_admin_cmd cmd = {};
+	struct scatterlist data_sg;
+	int vf_id;
+	int ret;
+
+	if (!virtio_dev)
+		return -ENODEV;
+
+	vf_id = pci_iov_vf_id(pdev);
+	if (vf_id < 0)
+		return vf_id;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->flags = flags;
+	sg_init_one(&data_sg, data, sizeof(*data));
+	cmd.opcode = cpu_to_le16(VIRTIO_ADMIN_CMD_DEV_MODE_SET);
+	cmd.group_type = cpu_to_le16(VIRTIO_ADMIN_GROUP_TYPE_SRIOV);
+	cmd.group_member_id = cpu_to_le64(vf_id + 1);
+	cmd.data_sg = &data_sg;
+	ret = vp_modern_admin_cmd_exec(virtio_dev, &cmd);
+
+	kfree(data);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(virtio_pci_admin_mode_set);
+
+/*
+ * virtio_pci_admin_obj_create - Creates an object for a given type and operation,
+ * following the max objects that can be created for that request.
+ * @pdev: VF pci_dev
+ * @obj_type: Object type
+ * @operation_type: Operation type
+ * @obj_id: Output unique object id
+ *
+ * Note: caller must serialize access for the given device.
+ * Returns 0 on success, or negative on failure.
+ */
+int virtio_pci_admin_obj_create(struct pci_dev *pdev, u16 obj_type, u8 operation_type,
+				u32 *obj_id)
+{
+	struct virtio_device *virtio_dev = virtio_pci_vf_get_pf_dev(pdev);
+	u16 data_size = sizeof(struct virtio_admin_cmd_resource_obj_create_data);
+	struct virtio_admin_cmd_resource_obj_create_data *obj_create_data;
+	struct virtio_resource_obj_dev_parts obj_dev_parts = {};
+	struct virtio_pci_admin_vq *avq;
+	struct virtio_admin_cmd cmd = {};
+	struct scatterlist data_sg;
+	void *data;
+	int id = -1;
+	int vf_id;
+	int ret;
+
+	if (!virtio_dev)
+		return -ENODEV;
+
+	vf_id = pci_iov_vf_id(pdev);
+	if (vf_id < 0)
+		return vf_id;
+
+	if (obj_type != VIRTIO_RESOURCE_OBJ_DEV_PARTS)
+		return -EOPNOTSUPP;
+
+	if (operation_type != VIRTIO_RESOURCE_OBJ_DEV_PARTS_TYPE_GET &&
+	    operation_type != VIRTIO_RESOURCE_OBJ_DEV_PARTS_TYPE_SET)
+		return -EINVAL;
+
+	avq = &to_vp_device(virtio_dev)->admin_vq;
+	if (!avq->max_dev_parts_objects)
+		return -EOPNOTSUPP;
+
+	id = ida_alloc_range(&avq->dev_parts_ida, 0,
+			     avq->max_dev_parts_objects - 1, GFP_KERNEL);
+	if (id < 0)
+		return id;
+
+	*obj_id = id;
+	data_size += sizeof(obj_dev_parts);
+	data = kzalloc(data_size, GFP_KERNEL);
+	if (!data) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	obj_create_data = data;
+	obj_create_data->hdr.type = cpu_to_le16(obj_type);
+	obj_create_data->hdr.id = cpu_to_le32(*obj_id);
+	obj_dev_parts.type = operation_type;
+	memcpy(obj_create_data->resource_obj_specific_data, &obj_dev_parts,
+	       sizeof(obj_dev_parts));
+	sg_init_one(&data_sg, data, data_size);
+	cmd.opcode = cpu_to_le16(VIRTIO_ADMIN_CMD_RESOURCE_OBJ_CREATE);
+	cmd.group_type = cpu_to_le16(VIRTIO_ADMIN_GROUP_TYPE_SRIOV);
+	cmd.group_member_id = cpu_to_le64(vf_id + 1);
+	cmd.data_sg = &data_sg;
+	ret = vp_modern_admin_cmd_exec(virtio_dev, &cmd);
+
+	kfree(data);
+end:
+	if (ret)
+		ida_free(&avq->dev_parts_ida, id);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(virtio_pci_admin_obj_create);
+
+/*
+ * virtio_pci_admin_obj_destroy - Destroys an object of a given type and id
+ * @pdev: VF pci_dev
+ * @obj_type: Object type
+ * @id: Object id
+ *
+ * Note: caller must serialize access for the given device.
+ * Returns 0 on success, or negative on failure.
+ */
+int virtio_pci_admin_obj_destroy(struct pci_dev *pdev, u16 obj_type, u32 id)
+{
+	struct virtio_device *virtio_dev = virtio_pci_vf_get_pf_dev(pdev);
+	struct virtio_admin_cmd_resource_obj_cmd_hdr *data;
+	struct virtio_pci_device *vp_dev;
+	struct virtio_admin_cmd cmd = {};
+	struct scatterlist data_sg;
+	int vf_id;
+	int ret;
+
+	if (!virtio_dev)
+		return -ENODEV;
+
+	vf_id = pci_iov_vf_id(pdev);
+	if (vf_id < 0)
+		return vf_id;
+
+	if (obj_type != VIRTIO_RESOURCE_OBJ_DEV_PARTS)
+		return -EINVAL;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->type = cpu_to_le16(obj_type);
+	data->id = cpu_to_le32(id);
+	sg_init_one(&data_sg, data, sizeof(*data));
+	cmd.opcode = cpu_to_le16(VIRTIO_ADMIN_CMD_RESOURCE_OBJ_DESTROY);
+	cmd.group_type = cpu_to_le16(VIRTIO_ADMIN_GROUP_TYPE_SRIOV);
+	cmd.group_member_id = cpu_to_le64(vf_id + 1);
+	cmd.data_sg = &data_sg;
+	ret = vp_modern_admin_cmd_exec(virtio_dev, &cmd);
+	if (!ret) {
+		vp_dev = to_vp_device(virtio_dev);
+		ida_free(&vp_dev->admin_vq.dev_parts_ida, id);
+	}
+
+	kfree(data);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(virtio_pci_admin_obj_destroy);
+
+/*
+ * virtio_pci_admin_dev_parts_metadata_get - Gets the metadata of the device parts
+ * identified by the below attributes.
+ * @pdev: VF pci_dev
+ * @obj_type: Object type
+ * @id: Object id
+ * @metadata_type: Metadata type
+ * @out: Upon success holds the output for 'metadata type size'
+ *
+ * Note: caller must serialize access for the given device.
+ * Returns 0 on success, or negative on failure.
+ */
+int virtio_pci_admin_dev_parts_metadata_get(struct pci_dev *pdev, u16 obj_type,
+					    u32 id, u8 metadata_type, u32 *out)
+{
+	struct virtio_device *virtio_dev = virtio_pci_vf_get_pf_dev(pdev);
+	struct virtio_admin_cmd_dev_parts_metadata_result *result;
+	struct virtio_admin_cmd_dev_parts_metadata_data *data;
+	struct scatterlist data_sg, result_sg;
+	struct virtio_admin_cmd cmd = {};
+	int vf_id;
+	int ret;
+
+	if (!virtio_dev)
+		return -ENODEV;
+
+	if (metadata_type != VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_TYPE_SIZE)
+		return -EOPNOTSUPP;
+
+	vf_id = pci_iov_vf_id(pdev);
+	if (vf_id < 0)
+		return vf_id;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	result = kzalloc(sizeof(*result), GFP_KERNEL);
+	if (!result) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	data->hdr.type = cpu_to_le16(obj_type);
+	data->hdr.id = cpu_to_le32(id);
+	data->type = metadata_type;
+	sg_init_one(&data_sg, data, sizeof(*data));
+	sg_init_one(&result_sg, result, sizeof(*result));
+	cmd.opcode = cpu_to_le16(VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_GET);
+	cmd.group_type = cpu_to_le16(VIRTIO_ADMIN_GROUP_TYPE_SRIOV);
+	cmd.group_member_id = cpu_to_le64(vf_id + 1);
+	cmd.data_sg = &data_sg;
+	cmd.result_sg = &result_sg;
+	ret = vp_modern_admin_cmd_exec(virtio_dev, &cmd);
+	if (!ret)
+		*out = le32_to_cpu(result->parts_size.size);
+
+	kfree(result);
+end:
+	kfree(data);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(virtio_pci_admin_dev_parts_metadata_get);
+
+/*
+ * virtio_pci_admin_dev_parts_get - Gets the device parts identified by the below attributes.
+ * @pdev: VF pci_dev
+ * @obj_type: Object type
+ * @id: Object id
+ * @get_type: Get type
+ * @res_sg: Upon success holds the output result data
+ * @res_size: Upon success holds the output result size
+ *
+ * Note: caller must serialize access for the given device.
+ * Returns 0 on success, or negative on failure.
+ */
+int virtio_pci_admin_dev_parts_get(struct pci_dev *pdev, u16 obj_type, u32 id,
+				   u8 get_type, struct scatterlist *res_sg,
+				   u32 *res_size)
+{
+	struct virtio_device *virtio_dev = virtio_pci_vf_get_pf_dev(pdev);
+	struct virtio_admin_cmd_dev_parts_get_data *data;
+	struct scatterlist data_sg;
+	struct virtio_admin_cmd cmd = {};
+	int vf_id;
+	int ret;
+
+	if (!virtio_dev)
+		return -ENODEV;
+
+	if (get_type != VIRTIO_ADMIN_CMD_DEV_PARTS_GET_TYPE_ALL)
+		return -EOPNOTSUPP;
+
+	vf_id = pci_iov_vf_id(pdev);
+	if (vf_id < 0)
+		return vf_id;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->hdr.type = cpu_to_le16(obj_type);
+	data->hdr.id = cpu_to_le32(id);
+	data->type = get_type;
+	sg_init_one(&data_sg, data, sizeof(*data));
+	cmd.opcode = cpu_to_le16(VIRTIO_ADMIN_CMD_DEV_PARTS_GET);
+	cmd.group_type = cpu_to_le16(VIRTIO_ADMIN_GROUP_TYPE_SRIOV);
+	cmd.group_member_id = cpu_to_le64(vf_id + 1);
+	cmd.data_sg = &data_sg;
+	cmd.result_sg = res_sg;
+	ret = vp_modern_admin_cmd_exec(virtio_dev, &cmd);
+	if (!ret)
+		*res_size = cmd.result_sg_size;
+
+	kfree(data);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(virtio_pci_admin_dev_parts_get);
+
+/*
+ * virtio_pci_admin_dev_parts_set - Sets the device parts identified by the below attributes.
+ * @pdev: VF pci_dev
+ * @data_sg: The device parts data, its layout follows struct virtio_admin_cmd_dev_parts_set_data
+ *
+ * Note: caller must serialize access for the given device.
+ * Returns 0 on success, or negative on failure.
+ */
+int virtio_pci_admin_dev_parts_set(struct pci_dev *pdev, struct scatterlist *data_sg)
+{
+	struct virtio_device *virtio_dev = virtio_pci_vf_get_pf_dev(pdev);
+	struct virtio_admin_cmd cmd = {};
+	int vf_id;
+
+	if (!virtio_dev)
+		return -ENODEV;
+
+	vf_id = pci_iov_vf_id(pdev);
+	if (vf_id < 0)
+		return vf_id;
+
+	cmd.opcode = cpu_to_le16(VIRTIO_ADMIN_CMD_DEV_PARTS_SET);
+	cmd.group_type = cpu_to_le16(VIRTIO_ADMIN_GROUP_TYPE_SRIOV);
+	cmd.group_member_id = cpu_to_le64(vf_id + 1);
+	cmd.data_sg = data_sg;
+	return vp_modern_admin_cmd_exec(virtio_dev, &cmd);
+}
+EXPORT_SYMBOL_GPL(virtio_pci_admin_dev_parts_set);
+
 static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
 	.get		= NULL,
 	.set		= NULL,
diff --git a/include/linux/virtio_pci_admin.h b/include/linux/virtio_pci_admin.h
index f4a100a0fe2e..dffc92c17ad2 100644
--- a/include/linux/virtio_pci_admin.h
+++ b/include/linux/virtio_pci_admin.h
@@ -20,4 +20,15 @@ int virtio_pci_admin_legacy_io_notify_info(struct pci_dev *pdev,
 					   u64 *bar_offset);
 #endif
 
+bool virtio_pci_admin_has_dev_parts(struct pci_dev *pdev);
+int virtio_pci_admin_mode_set(struct pci_dev *pdev, u8 mode);
+int virtio_pci_admin_obj_create(struct pci_dev *pdev, u16 obj_type, u8 operation_type,
+				u32 *obj_id);
+int virtio_pci_admin_obj_destroy(struct pci_dev *pdev, u16 obj_type, u32 id);
+int virtio_pci_admin_dev_parts_metadata_get(struct pci_dev *pdev, u16 obj_type,
+					    u32 id, u8 metadata_type, u32 *out);
+int virtio_pci_admin_dev_parts_get(struct pci_dev *pdev, u16 obj_type, u32 id,
+				   u8 get_type, struct scatterlist *res_sg, u32 *res_size);
+int virtio_pci_admin_dev_parts_set(struct pci_dev *pdev, struct scatterlist *data_sg);
+
 #endif /* _LINUX_VIRTIO_PCI_ADMIN_H */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V1 vfio 5/7] vfio/virtio: Add support for the basic live migration functionality
  2024-11-04 10:21 [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration Yishai Hadas
                   ` (3 preceding siblings ...)
  2024-11-04 10:21 ` [PATCH V1 vfio 4/7] virtio-pci: Introduce APIs to execute device parts " Yishai Hadas
@ 2024-11-04 10:21 ` Yishai Hadas
  2024-11-05 22:47   ` Alex Williamson
  2024-11-04 10:21 ` [PATCH V1 vfio 6/7] vfio/virtio: Add PRE_COPY support for live migration Yishai Hadas
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Yishai Hadas @ 2024-11-04 10:21 UTC (permalink / raw)
  To: alex.williamson, mst, jasowang, jgg
  Cc: kvm, virtualization, parav, feliu, kevin.tian, joao.m.martins,
	leonro, yishaih, maorg

Add support for basic live migration functionality in VFIO over
virtio-net devices, aligned with the virtio device specification 1.4.

This includes the following VFIO features:
VFIO_MIGRATION_STOP_COPY, VFIO_MIGRATION_P2P.

The implementation registers with the VFIO subsystem using vfio_pci_core
and then incorporates the virtio-specific logic for the migration
process.

The migration follows the definitions in uapi/vfio.h and leverages the
virtio VF-to-PF admin queue command channel for execution device parts
related commands.

Additional Notes:
-----------------
The kernel protocol between the source and target devices contains a
header with metadata, including record size, tag, and flags.

The record size allows the target to recognize and read a complete image
from the source before passing the device part data. This adheres to the
virtio device specification, which mandates that partial device parts
cannot be supplied.

The tag and flags serve as placeholders for future extensions of the
kernel protocol between the source and target, ensuring backward and
forward compatibility.

Both the source and target comply with the virtio device specification
by using a device part object with a unique ID as part of the migration
process. Since this resource is limited to a maximum of 255, its
lifecycle is confined to periods with an active live migration flow.

According to the virtio specification, a device has only two modes:
RUNNING and STOPPED. As a result, certain VFIO transitions (i.e.,
RUNNING_P2P->STOP, STOP->RUNNING_P2P) are treated as no-ops. When
transitioning to RUNNING_P2P, the device state is set to STOP, and it
will remain STOPPED until the transition out of RUNNING_P2P->RUNNING, at
which point it returns to RUNNING. During transition to STOP, the virtio
device only stops initiating outgoing requests(e.g. DMA, MSIx, etc.) but
still must accept incoming operations.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/virtio/Makefile  |    2 +-
 drivers/vfio/pci/virtio/common.h  |  104 +++
 drivers/vfio/pci/virtio/main.c    |  144 ++--
 drivers/vfio/pci/virtio/migrate.c | 1119 +++++++++++++++++++++++++++++
 4 files changed, 1318 insertions(+), 51 deletions(-)
 create mode 100644 drivers/vfio/pci/virtio/common.h
 create mode 100644 drivers/vfio/pci/virtio/migrate.c

diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
index 7171105baf33..bf0ccde6a91a 100644
--- a/drivers/vfio/pci/virtio/Makefile
+++ b/drivers/vfio/pci/virtio/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
-virtio-vfio-pci-y := main.o
+virtio-vfio-pci-y := main.o migrate.o
diff --git a/drivers/vfio/pci/virtio/common.h b/drivers/vfio/pci/virtio/common.h
new file mode 100644
index 000000000000..3bdfb3ea1174
--- /dev/null
+++ b/drivers/vfio/pci/virtio/common.h
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef VIRTIO_VFIO_COMMON_H
+#define VIRTIO_VFIO_COMMON_H
+
+#include <linux/kernel.h>
+#include <linux/virtio.h>
+#include <linux/vfio_pci_core.h>
+#include <linux/virtio_pci.h>
+
+enum virtiovf_migf_state {
+	VIRTIOVF_MIGF_STATE_ERROR = 1,
+};
+
+enum virtiovf_load_state {
+	VIRTIOVF_LOAD_STATE_READ_HEADER,
+	VIRTIOVF_LOAD_STATE_PREP_HEADER_DATA,
+	VIRTIOVF_LOAD_STATE_READ_HEADER_DATA,
+	VIRTIOVF_LOAD_STATE_PREP_CHUNK,
+	VIRTIOVF_LOAD_STATE_READ_CHUNK,
+	VIRTIOVF_LOAD_STATE_LOAD_CHUNK,
+};
+
+struct virtiovf_data_buffer {
+	struct sg_append_table table;
+	loff_t start_pos;
+	u64 length;
+	u64 allocated_length;
+	struct list_head buf_elm;
+	u8 include_header_object:1;
+	struct virtiovf_migration_file *migf;
+	/* Optimize virtiovf_get_migration_page() for sequential access */
+	struct scatterlist *last_offset_sg;
+	unsigned int sg_last_entry;
+	unsigned long last_offset;
+};
+
+enum virtiovf_migf_header_flags {
+	VIRTIOVF_MIGF_HEADER_FLAGS_TAG_MANDATORY = 0,
+	VIRTIOVF_MIGF_HEADER_FLAGS_TAG_OPTIONAL = 1 << 0,
+};
+
+enum virtiovf_migf_header_tag {
+	VIRTIOVF_MIGF_HEADER_TAG_DEVICE_DATA = 0,
+};
+
+struct virtiovf_migration_header {
+	__le64 record_size;
+	/* For future use in case we may need to change the kernel protocol */
+	__le32 flags; /* Use virtiovf_migf_header_flags */
+	__le32 tag; /* Use virtiovf_migf_header_tag */
+	__u8 data[]; /* Its size is given in the record_size */
+};
+
+struct virtiovf_migration_file {
+	struct file *filp;
+	/* synchronize access to the file state */
+	struct mutex lock;
+	loff_t max_pos;
+	u64 record_size;
+	u32 record_tag;
+	u8 has_obj_id:1;
+	u32 obj_id;
+	enum virtiovf_migf_state state;
+	enum virtiovf_load_state load_state;
+	/* synchronize access to the lists */
+	spinlock_t list_lock;
+	struct list_head buf_list;
+	struct list_head avail_list;
+	struct virtiovf_data_buffer *buf;
+	struct virtiovf_data_buffer *buf_header;
+	struct virtiovf_pci_core_device *virtvdev;
+};
+
+struct virtiovf_pci_core_device {
+	struct vfio_pci_core_device core_device;
+	u8 *bar0_virtual_buf;
+	/* synchronize access to the virtual buf */
+	struct mutex bar_mutex;
+	void __iomem *notify_addr;
+	u64 notify_offset;
+	__le32 pci_base_addr_0;
+	__le16 pci_cmd;
+	u8 bar0_virtual_buf_size;
+	u8 notify_bar;
+
+	/* LM related */
+	u8 migrate_cap:1;
+	u8 deferred_reset:1;
+	/* protect migration state */
+	struct mutex state_mutex;
+	enum vfio_device_mig_state mig_state;
+	/* protect the reset_done flow */
+	spinlock_t reset_lock;
+	struct virtiovf_migration_file *resuming_migf;
+	struct virtiovf_migration_file *saving_migf;
+};
+
+void virtiovf_set_migratable(struct virtiovf_pci_core_device *virtvdev);
+void virtiovf_open_migration(struct virtiovf_pci_core_device *virtvdev);
+void virtiovf_close_migration(struct virtiovf_pci_core_device *virtvdev);
+void virtiovf_migration_reset_done(struct pci_dev *pdev);
+
+#endif /* VIRTIO_VFIO_COMMON_H */
diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
index b5d3a8c5bbc9..e2cdf2d48200 100644
--- a/drivers/vfio/pci/virtio/main.c
+++ b/drivers/vfio/pci/virtio/main.c
@@ -16,18 +16,9 @@
 #include <linux/virtio_net.h>
 #include <linux/virtio_pci_admin.h>
 
-struct virtiovf_pci_core_device {
-	struct vfio_pci_core_device core_device;
-	u8 *bar0_virtual_buf;
-	/* synchronize access to the virtual buf */
-	struct mutex bar_mutex;
-	void __iomem *notify_addr;
-	u64 notify_offset;
-	__le32 pci_base_addr_0;
-	__le16 pci_cmd;
-	u8 bar0_virtual_buf_size;
-	u8 notify_bar;
-};
+#include "common.h"
+
+static int virtiovf_pci_init_device(struct vfio_device *core_vdev);
 
 static int
 virtiovf_issue_legacy_rw_cmd(struct virtiovf_pci_core_device *virtvdev,
@@ -355,8 +346,8 @@ virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
 
 static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
 {
-	struct virtiovf_pci_core_device *virtvdev = container_of(
-		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	struct virtiovf_pci_core_device *virtvdev = container_of(core_vdev,
+			struct virtiovf_pci_core_device, core_device.vdev);
 	struct vfio_pci_core_device *vdev = &virtvdev->core_device;
 	int ret;
 
@@ -377,10 +368,20 @@ static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
 		}
 	}
 
+	virtiovf_open_migration(virtvdev);
 	vfio_pci_core_finish_enable(vdev);
 	return 0;
 }
 
+static void virtiovf_pci_close_device(struct vfio_device *core_vdev)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(core_vdev,
+			struct virtiovf_pci_core_device, core_device.vdev);
+
+	virtiovf_close_migration(virtvdev);
+	vfio_pci_core_close_device(core_vdev);
+}
+
 static int virtiovf_get_device_config_size(unsigned short device)
 {
 	/* Network card */
@@ -404,48 +405,40 @@ static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
 	return 0;
 }
 
-static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
-{
-	struct virtiovf_pci_core_device *virtvdev = container_of(
-		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
-	struct pci_dev *pdev;
-	int ret;
-
-	ret = vfio_pci_core_init_dev(core_vdev);
-	if (ret)
-		return ret;
-
-	pdev = virtvdev->core_device.pdev;
-	ret = virtiovf_read_notify_info(virtvdev);
-	if (ret)
-		return ret;
-
-	virtvdev->bar0_virtual_buf_size = VIRTIO_PCI_CONFIG_OFF(true) +
-				virtiovf_get_device_config_size(pdev->device);
-	BUILD_BUG_ON(!is_power_of_2(virtvdev->bar0_virtual_buf_size));
-	virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
-					     GFP_KERNEL);
-	if (!virtvdev->bar0_virtual_buf)
-		return -ENOMEM;
-	mutex_init(&virtvdev->bar_mutex);
-	return 0;
-}
-
 static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
 {
-	struct virtiovf_pci_core_device *virtvdev = container_of(
-		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	struct virtiovf_pci_core_device *virtvdev = container_of(core_vdev,
+			struct virtiovf_pci_core_device, core_device.vdev);
 
 	kfree(virtvdev->bar0_virtual_buf);
 	vfio_pci_core_release_dev(core_vdev);
 }
 
-static const struct vfio_device_ops virtiovf_vfio_pci_tran_ops = {
-	.name = "virtio-vfio-pci-trans",
+static const struct vfio_device_ops virtiovf_vfio_pci_lm_ops = {
+	.name = "virtio-vfio-pci-lm",
 	.init = virtiovf_pci_init_device,
 	.release = virtiovf_pci_core_release_dev,
 	.open_device = virtiovf_pci_open_device,
-	.close_device = vfio_pci_core_close_device,
+	.close_device = virtiovf_pci_close_device,
+	.ioctl = vfio_pci_core_ioctl,
+	.device_feature = vfio_pci_core_ioctl_feature,
+	.read = vfio_pci_core_read,
+	.write = vfio_pci_core_write,
+	.mmap = vfio_pci_core_mmap,
+	.request = vfio_pci_core_request,
+	.match = vfio_pci_core_match,
+	.bind_iommufd = vfio_iommufd_physical_bind,
+	.unbind_iommufd = vfio_iommufd_physical_unbind,
+	.attach_ioas = vfio_iommufd_physical_attach_ioas,
+	.detach_ioas = vfio_iommufd_physical_detach_ioas,
+};
+
+static const struct vfio_device_ops virtiovf_vfio_pci_tran_lm_ops = {
+	.name = "virtio-vfio-pci-trans-lm",
+	.init = virtiovf_pci_init_device,
+	.release = virtiovf_pci_core_release_dev,
+	.open_device = virtiovf_pci_open_device,
+	.close_device = virtiovf_pci_close_device,
 	.ioctl = virtiovf_vfio_pci_core_ioctl,
 	.device_feature = vfio_pci_core_ioctl_feature,
 	.read = virtiovf_pci_core_read,
@@ -485,16 +478,66 @@ static bool virtiovf_bar0_exists(struct pci_dev *pdev)
 	return res->flags;
 }
 
+static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(core_vdev,
+			struct virtiovf_pci_core_device, core_device.vdev);
+	struct pci_dev *pdev;
+	bool sup_legacy_io;
+	bool sup_lm;
+	int ret;
+
+	ret = vfio_pci_core_init_dev(core_vdev);
+	if (ret)
+		return ret;
+
+	pdev = virtvdev->core_device.pdev;
+	sup_legacy_io = virtio_pci_admin_has_legacy_io(pdev) &&
+				!virtiovf_bar0_exists(pdev);
+	sup_lm = virtio_pci_admin_has_dev_parts(pdev);
+
+	/*
+	 * If the device is not capable to this driver functionality, fallback
+	 * to the default vfio-pci ops
+	 */
+	if (!sup_legacy_io && !sup_lm) {
+		core_vdev->ops = &virtiovf_vfio_pci_ops;
+		return 0;
+	}
+
+	if (sup_legacy_io) {
+		ret = virtiovf_read_notify_info(virtvdev);
+		if (ret)
+			return ret;
+
+		virtvdev->bar0_virtual_buf_size = VIRTIO_PCI_CONFIG_OFF(true) +
+					virtiovf_get_device_config_size(pdev->device);
+		BUILD_BUG_ON(!is_power_of_2(virtvdev->bar0_virtual_buf_size));
+		virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
+						     GFP_KERNEL);
+		if (!virtvdev->bar0_virtual_buf)
+			return -ENOMEM;
+		mutex_init(&virtvdev->bar_mutex);
+	}
+
+	if (sup_lm)
+		virtiovf_set_migratable(virtvdev);
+
+	if (sup_lm && !sup_legacy_io)
+		core_vdev->ops = &virtiovf_vfio_pci_lm_ops;
+
+	return 0;
+}
+
 static int virtiovf_pci_probe(struct pci_dev *pdev,
 			      const struct pci_device_id *id)
 {
-	const struct vfio_device_ops *ops = &virtiovf_vfio_pci_ops;
 	struct virtiovf_pci_core_device *virtvdev;
+	const struct vfio_device_ops *ops;
 	int ret;
 
-	if (pdev->is_virtfn && virtio_pci_admin_has_legacy_io(pdev) &&
-	    !virtiovf_bar0_exists(pdev))
-		ops = &virtiovf_vfio_pci_tran_ops;
+	ops = (pdev->is_virtfn) ? &virtiovf_vfio_pci_tran_lm_ops :
+				  &virtiovf_vfio_pci_ops;
 
 	virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
 				     &pdev->dev, ops);
@@ -532,6 +575,7 @@ static void virtiovf_pci_aer_reset_done(struct pci_dev *pdev)
 	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
 
 	virtvdev->pci_cmd = 0;
+	virtiovf_migration_reset_done(pdev);
 }
 
 static const struct pci_error_handlers virtiovf_err_handlers = {
diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c
new file mode 100644
index 000000000000..2a9614c2ef07
--- /dev/null
+++ b/drivers/vfio/pci/virtio/migrate.c
@@ -0,0 +1,1119 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/vfio_pci_core.h>
+#include <linux/virtio_pci.h>
+#include <linux/virtio_net.h>
+#include <linux/virtio_pci_admin.h>
+#include <linux/anon_inodes.h>
+
+#include "common.h"
+
+/* Device specification max parts size */
+#define MAX_LOAD_SIZE (BIT_ULL(BITS_PER_TYPE \
+	(((struct virtio_admin_cmd_dev_parts_metadata_result *)0)->parts_size.size)) - 1)
+
+/* Initial target buffer size */
+#define VIRTIOVF_TARGET_INITIAL_BUF_SIZE SZ_1M
+
+static struct page *
+virtiovf_get_migration_page(struct virtiovf_data_buffer *buf,
+			    unsigned long offset)
+{
+	unsigned long cur_offset = 0;
+	struct scatterlist *sg;
+	unsigned int i;
+
+	/* All accesses are sequential */
+	if (offset < buf->last_offset || !buf->last_offset_sg) {
+		buf->last_offset = 0;
+		buf->last_offset_sg = buf->table.sgt.sgl;
+		buf->sg_last_entry = 0;
+	}
+
+	cur_offset = buf->last_offset;
+
+	for_each_sg(buf->last_offset_sg, sg,
+		    buf->table.sgt.orig_nents - buf->sg_last_entry, i) {
+		if (offset < sg->length + cur_offset) {
+			buf->last_offset_sg = sg;
+			buf->sg_last_entry += i;
+			buf->last_offset = cur_offset;
+			return nth_page(sg_page(sg),
+					(offset - cur_offset) / PAGE_SIZE);
+		}
+		cur_offset += sg->length;
+	}
+	return NULL;
+}
+
+static int virtiovf_add_migration_pages(struct virtiovf_data_buffer *buf,
+					unsigned int npages)
+{
+	unsigned int to_alloc = npages;
+	struct page **page_list;
+	unsigned long filled;
+	unsigned int to_fill;
+	int ret;
+
+	to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*page_list));
+	page_list = kvzalloc(to_fill * sizeof(*page_list), GFP_KERNEL_ACCOUNT);
+	if (!page_list)
+		return -ENOMEM;
+
+	do {
+		filled = alloc_pages_bulk_array(GFP_KERNEL_ACCOUNT, to_fill,
+						page_list);
+		if (!filled) {
+			ret = -ENOMEM;
+			goto err;
+		}
+		to_alloc -= filled;
+		ret = sg_alloc_append_table_from_pages(&buf->table, page_list,
+			filled, 0, filled << PAGE_SHIFT, UINT_MAX,
+			SG_MAX_SINGLE_ALLOC, GFP_KERNEL_ACCOUNT);
+
+		if (ret)
+			goto err;
+		buf->allocated_length += filled * PAGE_SIZE;
+		/* clean input for another bulk allocation */
+		memset(page_list, 0, filled * sizeof(*page_list));
+		to_fill = min_t(unsigned int, to_alloc,
+				PAGE_SIZE / sizeof(*page_list));
+	} while (to_alloc > 0);
+
+	kvfree(page_list);
+	return 0;
+
+err:
+	kvfree(page_list);
+	return ret;
+}
+
+static void virtiovf_free_data_buffer(struct virtiovf_data_buffer *buf)
+{
+	struct sg_page_iter sg_iter;
+
+	/* Undo alloc_pages_bulk_array() */
+	for_each_sgtable_page(&buf->table.sgt, &sg_iter, 0)
+		__free_page(sg_page_iter_page(&sg_iter));
+	sg_free_append_table(&buf->table);
+	kfree(buf);
+}
+
+static struct virtiovf_data_buffer *
+virtiovf_alloc_data_buffer(struct virtiovf_migration_file *migf, size_t length)
+{
+	struct virtiovf_data_buffer *buf;
+	int ret;
+
+	buf = kzalloc(sizeof(*buf), GFP_KERNEL_ACCOUNT);
+	if (!buf)
+		return ERR_PTR(-ENOMEM);
+
+	ret = virtiovf_add_migration_pages(buf,
+				DIV_ROUND_UP_ULL(length, PAGE_SIZE));
+	if (ret)
+		goto end;
+
+	buf->migf = migf;
+	return buf;
+end:
+	virtiovf_free_data_buffer(buf);
+	return ERR_PTR(ret);
+}
+
+static void virtiovf_put_data_buffer(struct virtiovf_data_buffer *buf)
+{
+	spin_lock_irq(&buf->migf->list_lock);
+	list_add_tail(&buf->buf_elm, &buf->migf->avail_list);
+	spin_unlock_irq(&buf->migf->list_lock);
+}
+
+static int
+virtiovf_pci_alloc_obj_id(struct virtiovf_pci_core_device *virtvdev, u8 type,
+			  u32 *obj_id)
+{
+	return virtio_pci_admin_obj_create(virtvdev->core_device.pdev,
+					   VIRTIO_RESOURCE_OBJ_DEV_PARTS, type, obj_id);
+}
+
+static void
+virtiovf_pci_free_obj_id(struct virtiovf_pci_core_device *virtvdev, u32 obj_id)
+{
+	virtio_pci_admin_obj_destroy(virtvdev->core_device.pdev,
+			VIRTIO_RESOURCE_OBJ_DEV_PARTS, obj_id);
+}
+
+static void virtiovf_clean_migf_resources(struct virtiovf_migration_file *migf)
+{
+	struct virtiovf_data_buffer *entry;
+
+	if (migf->buf) {
+		virtiovf_free_data_buffer(migf->buf);
+		migf->buf = NULL;
+	}
+
+	if (migf->buf_header) {
+		virtiovf_free_data_buffer(migf->buf_header);
+		migf->buf_header = NULL;
+	}
+
+	list_splice(&migf->avail_list, &migf->buf_list);
+
+	while ((entry = list_first_entry_or_null(&migf->buf_list,
+				struct virtiovf_data_buffer, buf_elm))) {
+		list_del(&entry->buf_elm);
+		virtiovf_free_data_buffer(entry);
+	}
+
+	if (migf->has_obj_id)
+		virtiovf_pci_free_obj_id(migf->virtvdev, migf->obj_id);
+}
+
+static void virtiovf_disable_fd(struct virtiovf_migration_file *migf)
+{
+	mutex_lock(&migf->lock);
+	migf->state = VIRTIOVF_MIGF_STATE_ERROR;
+	migf->filp->f_pos = 0;
+	mutex_unlock(&migf->lock);
+}
+
+static void virtiovf_disable_fds(struct virtiovf_pci_core_device *virtvdev)
+{
+	if (virtvdev->resuming_migf) {
+		virtiovf_disable_fd(virtvdev->resuming_migf);
+		virtiovf_clean_migf_resources(virtvdev->resuming_migf);
+		fput(virtvdev->resuming_migf->filp);
+		virtvdev->resuming_migf = NULL;
+	}
+	if (virtvdev->saving_migf) {
+		virtiovf_disable_fd(virtvdev->saving_migf);
+		virtiovf_clean_migf_resources(virtvdev->saving_migf);
+		fput(virtvdev->saving_migf->filp);
+		virtvdev->saving_migf = NULL;
+	}
+}
+
+/*
+ * This function is called in all state_mutex unlock cases to
+ * handle a 'deferred_reset' if exists.
+ */
+static void virtiovf_state_mutex_unlock(struct virtiovf_pci_core_device *virtvdev)
+{
+again:
+	spin_lock(&virtvdev->reset_lock);
+	if (virtvdev->deferred_reset) {
+		virtvdev->deferred_reset = false;
+		spin_unlock(&virtvdev->reset_lock);
+		virtvdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+		virtiovf_disable_fds(virtvdev);
+		goto again;
+	}
+	mutex_unlock(&virtvdev->state_mutex);
+	spin_unlock(&virtvdev->reset_lock);
+}
+
+void virtiovf_migration_reset_done(struct pci_dev *pdev)
+{
+	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
+
+	if (!virtvdev->migrate_cap)
+		return;
+
+	/*
+	 * As the higher VFIO layers are holding locks across reset and using
+	 * those same locks with the mm_lock we need to prevent ABBA deadlock
+	 * with the state_mutex and mm_lock.
+	 * In case the state_mutex was taken already we defer the cleanup work
+	 * to the unlock flow of the other running context.
+	 */
+	spin_lock(&virtvdev->reset_lock);
+	virtvdev->deferred_reset = true;
+	if (!mutex_trylock(&virtvdev->state_mutex)) {
+		spin_unlock(&virtvdev->reset_lock);
+		return;
+	}
+	spin_unlock(&virtvdev->reset_lock);
+	virtiovf_state_mutex_unlock(virtvdev);
+}
+
+static int virtiovf_release_file(struct inode *inode, struct file *filp)
+{
+	struct virtiovf_migration_file *migf = filp->private_data;
+
+	virtiovf_disable_fd(migf);
+	mutex_destroy(&migf->lock);
+	kfree(migf);
+	return 0;
+}
+
+static struct virtiovf_data_buffer *
+virtiovf_get_data_buff_from_pos(struct virtiovf_migration_file *migf,
+				loff_t pos, bool *end_of_data)
+{
+	struct virtiovf_data_buffer *buf;
+	bool found = false;
+
+	*end_of_data = false;
+	spin_lock_irq(&migf->list_lock);
+	if (list_empty(&migf->buf_list)) {
+		*end_of_data = true;
+		goto end;
+	}
+
+	buf = list_first_entry(&migf->buf_list, struct virtiovf_data_buffer,
+			       buf_elm);
+	if (pos >= buf->start_pos &&
+	    pos < buf->start_pos + buf->length) {
+		found = true;
+		goto end;
+	}
+
+	/*
+	 * As we use a stream based FD we may expect having the data always
+	 * on first chunk
+	 */
+	migf->state = VIRTIOVF_MIGF_STATE_ERROR;
+
+end:
+	spin_unlock_irq(&migf->list_lock);
+	return found ? buf : NULL;
+}
+
+static ssize_t virtiovf_buf_read(struct virtiovf_data_buffer *vhca_buf,
+				 char __user **buf, size_t *len, loff_t *pos)
+{
+	unsigned long offset;
+	ssize_t done = 0;
+	size_t copy_len;
+
+	copy_len = min_t(size_t,
+			 vhca_buf->start_pos + vhca_buf->length - *pos, *len);
+	while (copy_len) {
+		size_t page_offset;
+		struct page *page;
+		size_t page_len;
+		u8 *from_buff;
+		int ret;
+
+		offset = *pos - vhca_buf->start_pos;
+		page_offset = offset % PAGE_SIZE;
+		offset -= page_offset;
+		page = virtiovf_get_migration_page(vhca_buf, offset);
+		if (!page)
+			return -EINVAL;
+		page_len = min_t(size_t, copy_len, PAGE_SIZE - page_offset);
+		from_buff = kmap_local_page(page);
+		ret = copy_to_user(*buf, from_buff + page_offset, page_len);
+		kunmap_local(from_buff);
+		if (ret)
+			return -EFAULT;
+		*pos += page_len;
+		*len -= page_len;
+		*buf += page_len;
+		done += page_len;
+		copy_len -= page_len;
+	}
+
+	if (*pos >= vhca_buf->start_pos + vhca_buf->length) {
+		spin_lock_irq(&vhca_buf->migf->list_lock);
+		list_del_init(&vhca_buf->buf_elm);
+		list_add_tail(&vhca_buf->buf_elm, &vhca_buf->migf->avail_list);
+		spin_unlock_irq(&vhca_buf->migf->list_lock);
+	}
+
+	return done;
+}
+
+static ssize_t virtiovf_save_read(struct file *filp, char __user *buf, size_t len,
+				  loff_t *pos)
+{
+	struct virtiovf_migration_file *migf = filp->private_data;
+	struct virtiovf_data_buffer *vhca_buf;
+	bool end_of_data;
+	ssize_t done = 0;
+
+	if (pos)
+		return -ESPIPE;
+	pos = &filp->f_pos;
+
+	mutex_lock(&migf->lock);
+	if (migf->state == VIRTIOVF_MIGF_STATE_ERROR) {
+		done = -ENODEV;
+		goto out_unlock;
+	}
+
+	while (len) {
+		ssize_t count;
+
+		vhca_buf = virtiovf_get_data_buff_from_pos(migf, *pos, &end_of_data);
+		if (end_of_data)
+			goto out_unlock;
+
+		if (!vhca_buf) {
+			done = -EINVAL;
+			goto out_unlock;
+		}
+
+		count = virtiovf_buf_read(vhca_buf, &buf, &len, pos);
+		if (count < 0) {
+			done = count;
+			goto out_unlock;
+		}
+		done += count;
+	}
+
+out_unlock:
+	mutex_unlock(&migf->lock);
+	return done;
+}
+
+static const struct file_operations virtiovf_save_fops = {
+	.owner = THIS_MODULE,
+	.read = virtiovf_save_read,
+	.release = virtiovf_release_file,
+};
+
+static int
+virtiovf_add_buf_header(struct virtiovf_data_buffer *header_buf,
+			u32 data_size)
+{
+	struct virtiovf_migration_file *migf = header_buf->migf;
+	struct virtiovf_migration_header header = {};
+	struct page *page;
+	u8 *to_buff;
+
+	header.record_size = cpu_to_le64(data_size);
+	header.flags = cpu_to_le32(VIRTIOVF_MIGF_HEADER_FLAGS_TAG_MANDATORY);
+	header.tag = cpu_to_le32(VIRTIOVF_MIGF_HEADER_TAG_DEVICE_DATA);
+	page = virtiovf_get_migration_page(header_buf, 0);
+	if (!page)
+		return -EINVAL;
+	to_buff = kmap_local_page(page);
+	memcpy(to_buff, &header, sizeof(header));
+	kunmap_local(to_buff);
+	header_buf->length = sizeof(header);
+	header_buf->start_pos = header_buf->migf->max_pos;
+	migf->max_pos += header_buf->length;
+	spin_lock_irq(&migf->list_lock);
+	list_add_tail(&header_buf->buf_elm, &migf->buf_list);
+	spin_unlock_irq(&migf->list_lock);
+	return 0;
+}
+
+static int
+virtiovf_read_device_context_chunk(struct virtiovf_migration_file *migf,
+				   u32 ctx_size)
+{
+	struct virtiovf_data_buffer *header_buf;
+	struct virtiovf_data_buffer *buf;
+	bool unmark_end = false;
+	struct scatterlist *sg;
+	unsigned int i;
+	u32 res_size;
+	int nent;
+	int ret;
+
+	buf = virtiovf_alloc_data_buffer(migf, ctx_size);
+	if (IS_ERR(buf))
+		return PTR_ERR(buf);
+
+	/* Find the total count of SG entries which satisfies the size */
+	nent = sg_nents_for_len(buf->table.sgt.sgl, ctx_size);
+	if (nent <= 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * Iterate to that SG entry and mark it as last (if it's not already)
+	 * to let underlay layers iterate only till that entry.
+	 */
+	for_each_sg(buf->table.sgt.sgl, sg, nent - 1, i)
+		;
+
+	if (!sg_is_last(sg)) {
+		unmark_end = true;
+		sg_mark_end(sg);
+	}
+
+	ret = virtio_pci_admin_dev_parts_get(migf->virtvdev->core_device.pdev,
+					     VIRTIO_RESOURCE_OBJ_DEV_PARTS,
+					     migf->obj_id,
+					     VIRTIO_ADMIN_CMD_DEV_PARTS_GET_TYPE_ALL,
+					     buf->table.sgt.sgl, &res_size);
+	/* Restore the original SG mark end */
+	if (unmark_end)
+		sg_unmark_end(sg);
+	if (ret)
+		goto out;
+
+	buf->length = res_size;
+	header_buf = virtiovf_alloc_data_buffer(migf,
+				sizeof(struct virtiovf_migration_header));
+	if (IS_ERR(header_buf)) {
+		ret = PTR_ERR(header_buf);
+		goto out;
+	}
+
+	ret = virtiovf_add_buf_header(header_buf, res_size);
+	if (ret)
+		goto out_header;
+
+	buf->start_pos = buf->migf->max_pos;
+	migf->max_pos += buf->length;
+	spin_lock(&migf->list_lock);
+	list_add_tail(&buf->buf_elm, &migf->buf_list);
+	spin_unlock_irq(&migf->list_lock);
+	return 0;
+
+out_header:
+	virtiovf_put_data_buffer(header_buf);
+out:
+	virtiovf_put_data_buffer(buf);
+	return ret;
+}
+
+static struct virtiovf_migration_file *
+virtiovf_pci_save_device_data(struct virtiovf_pci_core_device *virtvdev)
+{
+	struct virtiovf_migration_file *migf;
+	u32 ctx_size;
+	u32 obj_id;
+	int ret;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL_ACCOUNT);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("virtiovf_mig", &virtiovf_save_fops, migf,
+					O_RDONLY);
+	if (IS_ERR(migf->filp)) {
+		ret = PTR_ERR(migf->filp);
+		goto end;
+	}
+
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+	INIT_LIST_HEAD(&migf->buf_list);
+	INIT_LIST_HEAD(&migf->avail_list);
+	spin_lock_init(&migf->list_lock);
+	migf->virtvdev = virtvdev;
+
+	lockdep_assert_held(&virtvdev->state_mutex);
+	ret = virtiovf_pci_alloc_obj_id(virtvdev, VIRTIO_RESOURCE_OBJ_DEV_PARTS_TYPE_GET,
+					&obj_id);
+	if (ret)
+		goto out;
+
+	migf->obj_id = obj_id;
+	/* Mark as having a valid obj id which can be even 0 */
+	migf->has_obj_id = true;
+	ret = virtio_pci_admin_dev_parts_metadata_get(virtvdev->core_device.pdev,
+				VIRTIO_RESOURCE_OBJ_DEV_PARTS, obj_id,
+				VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_TYPE_SIZE,
+				&ctx_size);
+	if (ret)
+		goto out_clean;
+
+	if (!ctx_size) {
+		ret = -EINVAL;
+		goto out_clean;
+	}
+
+	ret = virtiovf_read_device_context_chunk(migf, ctx_size);
+	if (ret)
+		goto out_clean;
+
+	return migf;
+
+out_clean:
+	virtiovf_clean_migf_resources(migf);
+out:
+	fput(migf->filp);
+end:
+	kfree(migf);
+	return ERR_PTR(ret);
+}
+
+/*
+ * Set the required object header at the beginning of the buffer.
+ * The actual device parts data will be written post of the header offset.
+ */
+static int virtiovf_set_obj_cmd_header(struct virtiovf_data_buffer *vhca_buf)
+{
+	struct virtio_admin_cmd_resource_obj_cmd_hdr obj_hdr = {};
+	struct page *page;
+	u8 *to_buff;
+
+	obj_hdr.type = cpu_to_le16(VIRTIO_RESOURCE_OBJ_DEV_PARTS);
+	obj_hdr.id = cpu_to_le32(vhca_buf->migf->obj_id);
+	page = virtiovf_get_migration_page(vhca_buf, 0);
+	if (!page)
+		return -EINVAL;
+	to_buff = kmap_local_page(page);
+	memcpy(to_buff, &obj_hdr, sizeof(obj_hdr));
+	kunmap_local(to_buff);
+
+	/* Mark the buffer as including the header object data */
+	vhca_buf->include_header_object = 1;
+	return 0;
+}
+
+static int
+virtiovf_append_page_to_mig_buf(struct virtiovf_data_buffer *vhca_buf,
+				const char __user **buf, size_t *len,
+				loff_t *pos, ssize_t *done)
+{
+	unsigned long offset;
+	size_t page_offset;
+	struct page *page;
+	size_t page_len;
+	u8 *to_buff;
+	int ret;
+
+	offset = *pos - vhca_buf->start_pos;
+
+	if (vhca_buf->include_header_object)
+		/* The buffer holds the object header, update the offest accordingly */
+		offset += sizeof(struct virtio_admin_cmd_resource_obj_cmd_hdr);
+
+	page_offset = offset % PAGE_SIZE;
+
+	page = virtiovf_get_migration_page(vhca_buf, offset - page_offset);
+	if (!page)
+		return -EINVAL;
+
+	page_len = min_t(size_t, *len, PAGE_SIZE - page_offset);
+	to_buff = kmap_local_page(page);
+	ret = copy_from_user(to_buff + page_offset, *buf, page_len);
+	kunmap_local(to_buff);
+	if (ret)
+		return -EFAULT;
+
+	*pos += page_len;
+	*done += page_len;
+	*buf += page_len;
+	*len -= page_len;
+	vhca_buf->length += page_len;
+	return 0;
+}
+
+static ssize_t
+virtiovf_resume_read_chunk(struct virtiovf_migration_file *migf,
+			   struct virtiovf_data_buffer *vhca_buf,
+			   size_t chunk_size, const char __user **buf,
+			   size_t *len, loff_t *pos, ssize_t *done,
+			   bool *has_work)
+{
+	size_t copy_len, to_copy;
+	int ret;
+
+	to_copy = min_t(size_t, *len, chunk_size - vhca_buf->length);
+	copy_len = to_copy;
+	while (to_copy) {
+		ret = virtiovf_append_page_to_mig_buf(vhca_buf, buf, &to_copy,
+						      pos, done);
+		if (ret)
+			return ret;
+	}
+
+	*len -= copy_len;
+	if (vhca_buf->length == chunk_size) {
+		migf->load_state = VIRTIOVF_LOAD_STATE_LOAD_CHUNK;
+		migf->max_pos += chunk_size;
+		*has_work = true;
+	}
+
+	return 0;
+}
+
+static int
+virtiovf_resume_read_header_data(struct virtiovf_migration_file *migf,
+				 struct virtiovf_data_buffer *vhca_buf,
+				 const char __user **buf, size_t *len,
+				 loff_t *pos, ssize_t *done)
+{
+	size_t copy_len, to_copy;
+	size_t required_data;
+	int ret;
+
+	required_data = migf->record_size - vhca_buf->length;
+	to_copy = min_t(size_t, *len, required_data);
+	copy_len = to_copy;
+	while (to_copy) {
+		ret = virtiovf_append_page_to_mig_buf(vhca_buf, buf, &to_copy,
+						      pos, done);
+		if (ret)
+			return ret;
+	}
+
+	*len -= copy_len;
+	if (vhca_buf->length == migf->record_size) {
+		switch (migf->record_tag) {
+		default:
+			/* Optional tag */
+			break;
+		}
+
+		migf->load_state = VIRTIOVF_LOAD_STATE_READ_HEADER;
+		migf->max_pos += migf->record_size;
+		vhca_buf->length = 0;
+	}
+
+	return 0;
+}
+
+static int
+virtiovf_resume_read_header(struct virtiovf_migration_file *migf,
+			    struct virtiovf_data_buffer *vhca_buf,
+			    const char __user **buf,
+			    size_t *len, loff_t *pos,
+			    ssize_t *done, bool *has_work)
+{
+	struct page *page;
+	size_t copy_len;
+	u8 *to_buff;
+	int ret;
+
+	copy_len = min_t(size_t, *len,
+		sizeof(struct virtiovf_migration_header) - vhca_buf->length);
+	page = virtiovf_get_migration_page(vhca_buf, 0);
+	if (!page)
+		return -EINVAL;
+	to_buff = kmap_local_page(page);
+	ret = copy_from_user(to_buff + vhca_buf->length, *buf, copy_len);
+	if (ret) {
+		ret = -EFAULT;
+		goto end;
+	}
+
+	*buf += copy_len;
+	*pos += copy_len;
+	*done += copy_len;
+	*len -= copy_len;
+	vhca_buf->length += copy_len;
+	if (vhca_buf->length == sizeof(struct virtiovf_migration_header)) {
+		u64 record_size;
+		u32 flags;
+
+		record_size = le64_to_cpup((__le64 *)to_buff);
+		if (record_size > MAX_LOAD_SIZE) {
+			ret = -ENOMEM;
+			goto end;
+		}
+
+		migf->record_size = record_size;
+		flags = le32_to_cpup((__le32 *)(to_buff +
+			    offsetof(struct virtiovf_migration_header, flags)));
+		migf->record_tag = le32_to_cpup((__le32 *)(to_buff +
+			    offsetof(struct virtiovf_migration_header, tag)));
+		switch (migf->record_tag) {
+		case VIRTIOVF_MIGF_HEADER_TAG_DEVICE_DATA:
+			migf->load_state = VIRTIOVF_LOAD_STATE_PREP_CHUNK;
+			break;
+		default:
+			if (!(flags & VIRTIOVF_MIGF_HEADER_FLAGS_TAG_OPTIONAL)) {
+				ret = -EOPNOTSUPP;
+				goto end;
+			}
+			/* We may read and skip this optional record data */
+			migf->load_state = VIRTIOVF_LOAD_STATE_PREP_HEADER_DATA;
+		}
+
+		migf->max_pos += vhca_buf->length;
+		vhca_buf->length = 0;
+		*has_work = true;
+	}
+end:
+	kunmap_local(to_buff);
+	return ret;
+}
+
+static ssize_t virtiovf_resume_write(struct file *filp, const char __user *buf,
+				     size_t len, loff_t *pos)
+{
+	struct virtiovf_migration_file *migf = filp->private_data;
+	struct virtiovf_data_buffer *vhca_buf = migf->buf;
+	struct virtiovf_data_buffer *vhca_buf_header = migf->buf_header;
+	unsigned int orig_length;
+	bool has_work = false;
+	ssize_t done = 0;
+	int ret = 0;
+
+	if (pos)
+		return -ESPIPE;
+
+	pos = &filp->f_pos;
+	if (*pos < vhca_buf->start_pos)
+		return -EINVAL;
+
+	mutex_lock(&migf->virtvdev->state_mutex);
+	mutex_lock(&migf->lock);
+	if (migf->state == VIRTIOVF_MIGF_STATE_ERROR) {
+		done = -ENODEV;
+		goto out_unlock;
+	}
+
+	while (len || has_work) {
+		has_work = false;
+		switch (migf->load_state) {
+		case VIRTIOVF_LOAD_STATE_READ_HEADER:
+			ret = virtiovf_resume_read_header(migf, vhca_buf_header, &buf,
+							  &len, pos, &done, &has_work);
+			if (ret)
+				goto out_unlock;
+			break;
+		case VIRTIOVF_LOAD_STATE_PREP_HEADER_DATA:
+			if (vhca_buf_header->allocated_length < migf->record_size) {
+				virtiovf_free_data_buffer(vhca_buf_header);
+
+				migf->buf_header = virtiovf_alloc_data_buffer(migf,
+						migf->record_size);
+				if (IS_ERR(migf->buf_header)) {
+					ret = PTR_ERR(migf->buf_header);
+					migf->buf_header = NULL;
+					goto out_unlock;
+				}
+
+				vhca_buf_header = migf->buf_header;
+			}
+
+			vhca_buf_header->start_pos = migf->max_pos;
+			migf->load_state = VIRTIOVF_LOAD_STATE_READ_HEADER_DATA;
+			break;
+		case VIRTIOVF_LOAD_STATE_READ_HEADER_DATA:
+			ret = virtiovf_resume_read_header_data(migf, vhca_buf_header,
+							       &buf, &len, pos, &done);
+			if (ret)
+				goto out_unlock;
+			break;
+		case VIRTIOVF_LOAD_STATE_PREP_CHUNK:
+		{
+			u32 cmd_size = migf->record_size +
+				sizeof(struct virtio_admin_cmd_resource_obj_cmd_hdr);
+
+			/*
+			 * The DMA map/unmap is managed in virtio layer, we just need to extend
+			 * the SG pages to hold the extra required chunk data.
+			 */
+			if (vhca_buf->allocated_length < cmd_size) {
+				ret = virtiovf_add_migration_pages(vhca_buf,
+					DIV_ROUND_UP_ULL(cmd_size - vhca_buf->allocated_length,
+							 PAGE_SIZE));
+				if (ret)
+					goto out_unlock;
+			}
+
+			vhca_buf->start_pos = migf->max_pos;
+			migf->load_state = VIRTIOVF_LOAD_STATE_READ_CHUNK;
+			break;
+		}
+		case VIRTIOVF_LOAD_STATE_READ_CHUNK:
+			ret = virtiovf_resume_read_chunk(migf, vhca_buf, migf->record_size,
+							 &buf, &len, pos, &done, &has_work);
+			if (ret)
+				goto out_unlock;
+			break;
+		case VIRTIOVF_LOAD_STATE_LOAD_CHUNK:
+			/* Mark the last SG entry and set its length */
+			sg_mark_end(vhca_buf->last_offset_sg);
+			orig_length = vhca_buf->last_offset_sg->length;
+			/* Length should include the resource object command header */
+			vhca_buf->last_offset_sg->length = vhca_buf->length +
+					sizeof(struct virtio_admin_cmd_resource_obj_cmd_hdr) -
+					vhca_buf->last_offset;
+			ret = virtio_pci_admin_dev_parts_set(migf->virtvdev->core_device.pdev,
+							     vhca_buf->table.sgt.sgl);
+			/* Restore the original SG data */
+			vhca_buf->last_offset_sg->length = orig_length;
+			sg_unmark_end(vhca_buf->last_offset_sg);
+			if (ret)
+				goto out_unlock;
+			migf->load_state = VIRTIOVF_LOAD_STATE_READ_HEADER;
+			/* be ready for reading the next chunk */
+			vhca_buf->length = 0;
+			break;
+		default:
+			break;
+		}
+	}
+
+out_unlock:
+	if (ret)
+		migf->state = VIRTIOVF_MIGF_STATE_ERROR;
+	mutex_unlock(&migf->lock);
+	virtiovf_state_mutex_unlock(migf->virtvdev);
+	return ret ? ret : done;
+}
+
+static const struct file_operations virtiovf_resume_fops = {
+	.owner = THIS_MODULE,
+	.write = virtiovf_resume_write,
+	.release = virtiovf_release_file,
+};
+
+static struct virtiovf_migration_file *
+virtiovf_pci_resume_device_data(struct virtiovf_pci_core_device *virtvdev)
+{
+	struct virtiovf_migration_file *migf;
+	struct virtiovf_data_buffer *buf;
+	u32 obj_id;
+	int ret;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL_ACCOUNT);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("virtiovf_mig", &virtiovf_resume_fops, migf,
+					O_WRONLY);
+	if (IS_ERR(migf->filp)) {
+		ret = PTR_ERR(migf->filp);
+		goto end;
+	}
+
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+	INIT_LIST_HEAD(&migf->buf_list);
+	INIT_LIST_HEAD(&migf->avail_list);
+	spin_lock_init(&migf->list_lock);
+
+	buf = virtiovf_alloc_data_buffer(migf, VIRTIOVF_TARGET_INITIAL_BUF_SIZE);
+	if (IS_ERR(buf)) {
+		ret = PTR_ERR(buf);
+		goto out_free;
+	}
+
+	migf->buf = buf;
+
+	buf = virtiovf_alloc_data_buffer(migf,
+		sizeof(struct virtiovf_migration_header));
+	if (IS_ERR(buf)) {
+		ret = PTR_ERR(buf);
+		goto out_clean;
+	}
+
+	migf->buf_header = buf;
+	migf->load_state = VIRTIOVF_LOAD_STATE_READ_HEADER;
+
+	migf->virtvdev = virtvdev;
+	ret = virtiovf_pci_alloc_obj_id(virtvdev, VIRTIO_RESOURCE_OBJ_DEV_PARTS_TYPE_SET,
+					&obj_id);
+	if (ret)
+		goto out_clean;
+
+	migf->obj_id = obj_id;
+	/* Mark as having a valid obj id which can be even 0 */
+	migf->has_obj_id = true;
+	ret = virtiovf_set_obj_cmd_header(migf->buf);
+	if (ret)
+		goto out_clean;
+
+	return migf;
+
+out_clean:
+	virtiovf_clean_migf_resources(migf);
+out_free:
+	fput(migf->filp);
+end:
+	kfree(migf);
+	return ERR_PTR(ret);
+}
+
+static struct file *
+virtiovf_pci_step_device_state_locked(struct virtiovf_pci_core_device *virtvdev,
+				      u32 new)
+{
+	u32 cur = virtvdev->mig_state;
+	int ret;
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_STOP) {
+		/* NOP */
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RUNNING_P2P) {
+		/* NOP */
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_RUNNING_P2P) {
+		ret = virtio_pci_admin_mode_set(virtvdev->core_device.pdev,
+						BIT(VIRTIO_ADMIN_CMD_DEV_MODE_F_STOPPED));
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_RUNNING) {
+		ret = virtio_pci_admin_mode_set(virtvdev->core_device.pdev, 0);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_STOP_COPY) {
+		struct virtiovf_migration_file *migf;
+
+		migf = virtiovf_pci_save_device_data(virtvdev);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		virtvdev->saving_migf = migf;
+		return migf->filp;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP_COPY && new == VFIO_DEVICE_STATE_STOP) {
+		virtiovf_disable_fds(virtvdev);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RESUMING) {
+		struct virtiovf_migration_file *migf;
+
+		migf = virtiovf_pci_resume_device_data(virtvdev);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		virtvdev->resuming_migf = migf;
+		return migf->filp;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RESUMING && new == VFIO_DEVICE_STATE_STOP) {
+		virtiovf_disable_fds(virtvdev);
+		return NULL;
+	}
+
+	/*
+	 * vfio_mig_get_next_state() does not use arcs other than the above
+	 */
+	WARN_ON(true);
+	return ERR_PTR(-EINVAL);
+}
+
+static struct file *
+virtiovf_pci_set_device_state(struct vfio_device *vdev,
+			      enum vfio_device_mig_state new_state)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	enum vfio_device_mig_state next_state;
+	struct file *res = NULL;
+	int ret;
+
+	mutex_lock(&virtvdev->state_mutex);
+	while (new_state != virtvdev->mig_state) {
+		ret = vfio_mig_get_next_state(vdev, virtvdev->mig_state,
+					      new_state, &next_state);
+		if (ret) {
+			res = ERR_PTR(ret);
+			break;
+		}
+		res = virtiovf_pci_step_device_state_locked(virtvdev, next_state);
+		if (IS_ERR(res))
+			break;
+		virtvdev->mig_state = next_state;
+		if (WARN_ON(res && new_state != virtvdev->mig_state)) {
+			fput(res);
+			res = ERR_PTR(-EINVAL);
+			break;
+		}
+	}
+	virtiovf_state_mutex_unlock(virtvdev);
+	return res;
+}
+
+static int virtiovf_pci_get_device_state(struct vfio_device *vdev,
+				       enum vfio_device_mig_state *curr_state)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		vdev, struct virtiovf_pci_core_device, core_device.vdev);
+
+	mutex_lock(&virtvdev->state_mutex);
+	*curr_state = virtvdev->mig_state;
+	virtiovf_state_mutex_unlock(virtvdev);
+	return 0;
+}
+
+static int virtiovf_pci_get_data_size(struct vfio_device *vdev,
+				      unsigned long *stop_copy_length)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	bool obj_id_exists;
+	u32 res_size;
+	u32 obj_id;
+	int ret;
+
+	mutex_lock(&virtvdev->state_mutex);
+	obj_id_exists = virtvdev->saving_migf && virtvdev->saving_migf->has_obj_id;
+	if (!obj_id_exists) {
+		ret = virtiovf_pci_alloc_obj_id(virtvdev,
+						VIRTIO_RESOURCE_OBJ_DEV_PARTS_TYPE_GET,
+						&obj_id);
+		if (ret)
+			goto end;
+	} else {
+		obj_id = virtvdev->saving_migf->obj_id;
+	}
+
+	ret = virtio_pci_admin_dev_parts_metadata_get(virtvdev->core_device.pdev,
+				VIRTIO_RESOURCE_OBJ_DEV_PARTS, obj_id,
+				VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_TYPE_SIZE,
+				&res_size);
+	if (!ret)
+		*stop_copy_length = res_size;
+
+	/* We can't leave this obj_id alive if didn't exist before, otherwise, it might
+	 * stay alive, even without an active migration flow (e.g. migration was cancelled)
+	 */
+	if (!obj_id_exists)
+		virtiovf_pci_free_obj_id(virtvdev, obj_id);
+end:
+	virtiovf_state_mutex_unlock(virtvdev);
+	return ret;
+}
+
+static const struct vfio_migration_ops virtvdev_pci_mig_ops = {
+	.migration_set_state = virtiovf_pci_set_device_state,
+	.migration_get_state = virtiovf_pci_get_device_state,
+	.migration_get_data_size = virtiovf_pci_get_data_size,
+};
+
+void virtiovf_set_migratable(struct virtiovf_pci_core_device *virtvdev)
+{
+	virtvdev->migrate_cap = 1;
+	mutex_init(&virtvdev->state_mutex);
+	spin_lock_init(&virtvdev->reset_lock);
+	virtvdev->core_device.vdev.migration_flags =
+		VFIO_MIGRATION_STOP_COPY |
+		VFIO_MIGRATION_P2P;
+	virtvdev->core_device.vdev.mig_ops = &virtvdev_pci_mig_ops;
+}
+
+void virtiovf_open_migration(struct virtiovf_pci_core_device *virtvdev)
+{
+	if (!virtvdev->migrate_cap)
+		return;
+
+	virtvdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+}
+
+void virtiovf_close_migration(struct virtiovf_pci_core_device *virtvdev)
+{
+	if (!virtvdev->migrate_cap)
+		return;
+
+	virtiovf_disable_fds(virtvdev);
+}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V1 vfio 6/7] vfio/virtio: Add PRE_COPY support for live migration
  2024-11-04 10:21 [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration Yishai Hadas
                   ` (4 preceding siblings ...)
  2024-11-04 10:21 ` [PATCH V1 vfio 5/7] vfio/virtio: Add support for the basic live migration functionality Yishai Hadas
@ 2024-11-04 10:21 ` Yishai Hadas
  2024-11-05 23:18   ` Alex Williamson
  2024-11-04 10:21 ` [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured Yishai Hadas
  2024-11-06  9:32 ` [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration Michael S. Tsirkin
  7 siblings, 1 reply; 28+ messages in thread
From: Yishai Hadas @ 2024-11-04 10:21 UTC (permalink / raw)
  To: alex.williamson, mst, jasowang, jgg
  Cc: kvm, virtualization, parav, feliu, kevin.tian, joao.m.martins,
	leonro, yishaih, maorg

Add PRE_COPY support for live migration.

This functionality may reduce the downtime upon STOP_COPY as of letting
the target machine to get some 'initial data' from the source once the
machine is still in its RUNNING state and let it prepares itself
pre-ahead to get the final STOP_COPY data.

As the Virtio specification does not support reading partial or
incremental device contexts. This means that during the PRE_COPY state,
the vfio-virtio driver reads the full device state.

As the device state can be changed and the benefit is highest when the
pre copy data closely matches the final data we read it in a rate
limiter mode and reporting no data available for some time interval
after the previous call.

With PRE_COPY enabled, we observed a downtime reduction of approximately
70-75% in various scenarios compared to when PRE_COPY was disabled,
while keeping the total migration time nearly the same.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/virtio/common.h  |   4 +
 drivers/vfio/pci/virtio/migrate.c | 233 +++++++++++++++++++++++++++++-
 2 files changed, 229 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/virtio/common.h b/drivers/vfio/pci/virtio/common.h
index 3bdfb3ea1174..5704603f0f9d 100644
--- a/drivers/vfio/pci/virtio/common.h
+++ b/drivers/vfio/pci/virtio/common.h
@@ -10,6 +10,8 @@
 
 enum virtiovf_migf_state {
 	VIRTIOVF_MIGF_STATE_ERROR = 1,
+	VIRTIOVF_MIGF_STATE_PRECOPY = 2,
+	VIRTIOVF_MIGF_STATE_COMPLETE = 3,
 };
 
 enum virtiovf_load_state {
@@ -57,6 +59,8 @@ struct virtiovf_migration_file {
 	/* synchronize access to the file state */
 	struct mutex lock;
 	loff_t max_pos;
+	u64 pre_copy_initial_bytes;
+	struct ratelimit_state pre_copy_rl_state;
 	u64 record_size;
 	u32 record_tag;
 	u8 has_obj_id:1;
diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c
index 2a9614c2ef07..cdb252f6fd80 100644
--- a/drivers/vfio/pci/virtio/migrate.c
+++ b/drivers/vfio/pci/virtio/migrate.c
@@ -26,6 +26,10 @@
 /* Initial target buffer size */
 #define VIRTIOVF_TARGET_INITIAL_BUF_SIZE SZ_1M
 
+static int
+virtiovf_read_device_context_chunk(struct virtiovf_migration_file *migf,
+				   u32 ctx_size);
+
 static struct page *
 virtiovf_get_migration_page(struct virtiovf_data_buffer *buf,
 			    unsigned long offset)
@@ -155,6 +159,41 @@ virtiovf_pci_free_obj_id(struct virtiovf_pci_core_device *virtvdev, u32 obj_id)
 			VIRTIO_RESOURCE_OBJ_DEV_PARTS, obj_id);
 }
 
+static struct virtiovf_data_buffer *
+virtiovf_get_data_buffer(struct virtiovf_migration_file *migf, size_t length)
+{
+	struct virtiovf_data_buffer *buf, *temp_buf;
+	struct list_head free_list;
+
+	INIT_LIST_HEAD(&free_list);
+
+	spin_lock_irq(&migf->list_lock);
+	list_for_each_entry_safe(buf, temp_buf, &migf->avail_list, buf_elm) {
+		list_del_init(&buf->buf_elm);
+		if (buf->allocated_length >= length) {
+			spin_unlock_irq(&migf->list_lock);
+			goto found;
+		}
+		/*
+		 * Prevent holding redundant buffers. Put in a free
+		 * list and call at the end not under the spin lock
+		 * (&migf->list_lock) to minimize its scope usage.
+		 */
+		list_add(&buf->buf_elm, &free_list);
+	}
+	spin_unlock_irq(&migf->list_lock);
+	buf = virtiovf_alloc_data_buffer(migf, length);
+
+found:
+	while ((temp_buf = list_first_entry_or_null(&free_list,
+				struct virtiovf_data_buffer, buf_elm))) {
+		list_del(&temp_buf->buf_elm);
+		virtiovf_free_data_buffer(temp_buf);
+	}
+
+	return buf;
+}
+
 static void virtiovf_clean_migf_resources(struct virtiovf_migration_file *migf)
 {
 	struct virtiovf_data_buffer *entry;
@@ -341,6 +380,7 @@ static ssize_t virtiovf_save_read(struct file *filp, char __user *buf, size_t le
 {
 	struct virtiovf_migration_file *migf = filp->private_data;
 	struct virtiovf_data_buffer *vhca_buf;
+	bool first_loop_call = true;
 	bool end_of_data;
 	ssize_t done = 0;
 
@@ -358,6 +398,19 @@ static ssize_t virtiovf_save_read(struct file *filp, char __user *buf, size_t le
 		ssize_t count;
 
 		vhca_buf = virtiovf_get_data_buff_from_pos(migf, *pos, &end_of_data);
+		if (first_loop_call) {
+			first_loop_call = false;
+			/* Temporary end of file as part of PRE_COPY */
+			if (end_of_data && migf->state == VIRTIOVF_MIGF_STATE_PRECOPY) {
+				done = -ENOMSG;
+				goto out_unlock;
+			}
+			if (end_of_data && migf->state != VIRTIOVF_MIGF_STATE_COMPLETE) {
+				done = -EINVAL;
+				goto out_unlock;
+			}
+		}
+
 		if (end_of_data)
 			goto out_unlock;
 
@@ -379,9 +432,104 @@ static ssize_t virtiovf_save_read(struct file *filp, char __user *buf, size_t le
 	return done;
 }
 
+static long virtiovf_precopy_ioctl(struct file *filp, unsigned int cmd,
+				   unsigned long arg)
+{
+	struct virtiovf_migration_file *migf = filp->private_data;
+	struct virtiovf_pci_core_device *virtvdev = migf->virtvdev;
+	struct vfio_precopy_info info = {};
+	loff_t *pos = &filp->f_pos;
+	bool end_of_data = false;
+	unsigned long minsz;
+	u32 ctx_size;
+	int ret;
+
+	if (cmd != VFIO_MIG_GET_PRECOPY_INFO)
+		return -ENOTTY;
+
+	minsz = offsetofend(struct vfio_precopy_info, dirty_bytes);
+	if (copy_from_user(&info, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz)
+		return -EINVAL;
+
+	mutex_lock(&virtvdev->state_mutex);
+	if (virtvdev->mig_state != VFIO_DEVICE_STATE_PRE_COPY &&
+	    virtvdev->mig_state != VFIO_DEVICE_STATE_PRE_COPY_P2P) {
+		ret = -EINVAL;
+		goto err_state_unlock;
+	}
+
+	/*
+	 * The virtio specification does not include a PRE_COPY concept.
+	 * Since we can expect the data to remain the same for a certain period,
+	 * we use a rate limiter mechanism before making a call to the device.
+	 */
+	if (!__ratelimit(&migf->pre_copy_rl_state)) {
+		/* Reporting no data available */
+		ret = 0;
+		goto done;
+	}
+
+	ret = virtio_pci_admin_dev_parts_metadata_get(virtvdev->core_device.pdev,
+				VIRTIO_RESOURCE_OBJ_DEV_PARTS, migf->obj_id,
+				VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_TYPE_SIZE,
+				&ctx_size);
+	if (ret)
+		goto err_state_unlock;
+
+	mutex_lock(&migf->lock);
+	if (migf->state == VIRTIOVF_MIGF_STATE_ERROR) {
+		ret = -ENODEV;
+		goto err_migf_unlock;
+	}
+
+	if (migf->pre_copy_initial_bytes > *pos) {
+		info.initial_bytes = migf->pre_copy_initial_bytes - *pos;
+	} else {
+		info.dirty_bytes = migf->max_pos - *pos;
+		if (!info.dirty_bytes)
+			end_of_data = true;
+		info.dirty_bytes += ctx_size;
+	}
+
+	if (!end_of_data || !ctx_size) {
+		mutex_unlock(&migf->lock);
+		goto done;
+	}
+
+	mutex_unlock(&migf->lock);
+	/*
+	 * We finished transferring the current state and the device has a
+	 * dirty state, read a new state.
+	 */
+	ret = virtiovf_read_device_context_chunk(migf, ctx_size);
+	if (ret)
+		/*
+		 * The machine is running, and context size could be grow, so no reason to mark
+		 * the device state as VIRTIOVF_MIGF_STATE_ERROR.
+		 */
+		goto err_state_unlock;
+
+done:
+	virtiovf_state_mutex_unlock(virtvdev);
+	if (copy_to_user((void __user *)arg, &info, minsz))
+		return -EFAULT;
+	return 0;
+
+err_migf_unlock:
+	mutex_unlock(&migf->lock);
+err_state_unlock:
+	virtiovf_state_mutex_unlock(virtvdev);
+	return ret;
+}
+
 static const struct file_operations virtiovf_save_fops = {
 	.owner = THIS_MODULE,
 	.read = virtiovf_save_read,
+	.unlocked_ioctl = virtiovf_precopy_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 	.release = virtiovf_release_file,
 };
 
@@ -425,7 +573,7 @@ virtiovf_read_device_context_chunk(struct virtiovf_migration_file *migf,
 	int nent;
 	int ret;
 
-	buf = virtiovf_alloc_data_buffer(migf, ctx_size);
+	buf = virtiovf_get_data_buffer(migf, ctx_size);
 	if (IS_ERR(buf))
 		return PTR_ERR(buf);
 
@@ -460,7 +608,7 @@ virtiovf_read_device_context_chunk(struct virtiovf_migration_file *migf,
 		goto out;
 
 	buf->length = res_size;
-	header_buf = virtiovf_alloc_data_buffer(migf,
+	header_buf = virtiovf_get_data_buffer(migf,
 				sizeof(struct virtiovf_migration_header));
 	if (IS_ERR(header_buf)) {
 		ret = PTR_ERR(header_buf);
@@ -485,8 +633,43 @@ virtiovf_read_device_context_chunk(struct virtiovf_migration_file *migf,
 	return ret;
 }
 
+static int
+virtiovf_pci_save_device_final_data(struct virtiovf_pci_core_device *virtvdev)
+{
+	struct virtiovf_migration_file *migf = virtvdev->saving_migf;
+	u32 ctx_size;
+	int ret;
+
+	if (migf->state == VIRTIOVF_MIGF_STATE_ERROR)
+		return -ENODEV;
+
+	ret = virtio_pci_admin_dev_parts_metadata_get(virtvdev->core_device.pdev,
+				VIRTIO_RESOURCE_OBJ_DEV_PARTS, migf->obj_id,
+				VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_TYPE_SIZE,
+				&ctx_size);
+	if (ret)
+		goto err;
+
+	if (!ctx_size) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ret = virtiovf_read_device_context_chunk(migf, ctx_size);
+	if (ret)
+		goto err;
+
+	migf->state = VIRTIOVF_MIGF_STATE_COMPLETE;
+	return 0;
+
+err:
+	migf->state = VIRTIOVF_MIGF_STATE_ERROR;
+	return ret;
+}
+
 static struct virtiovf_migration_file *
-virtiovf_pci_save_device_data(struct virtiovf_pci_core_device *virtvdev)
+virtiovf_pci_save_device_data(struct virtiovf_pci_core_device *virtvdev,
+			      bool pre_copy)
 {
 	struct virtiovf_migration_file *migf;
 	u32 ctx_size;
@@ -536,6 +719,17 @@ virtiovf_pci_save_device_data(struct virtiovf_pci_core_device *virtvdev)
 	if (ret)
 		goto out_clean;
 
+	if (pre_copy) {
+		migf->pre_copy_initial_bytes = migf->max_pos;
+		ratelimit_state_init(&migf->pre_copy_rl_state, 1 * HZ, 1);
+		/* Prevent any rate messages upon its usage */
+		ratelimit_set_flags(&migf->pre_copy_rl_state,
+				    RATELIMIT_MSG_ON_RELEASE);
+		migf->state = VIRTIOVF_MIGF_STATE_PRECOPY;
+	} else {
+		migf->state = VIRTIOVF_MIGF_STATE_COMPLETE;
+	}
+
 	return migf;
 
 out_clean:
@@ -948,7 +1142,8 @@ virtiovf_pci_step_device_state_locked(struct virtiovf_pci_core_device *virtvdev,
 		return NULL;
 	}
 
-	if (cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_RUNNING_P2P) {
+	if ((cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_RUNNING_P2P) ||
+	    (cur == VFIO_DEVICE_STATE_PRE_COPY && new == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {
 		ret = virtio_pci_admin_mode_set(virtvdev->core_device.pdev,
 						BIT(VIRTIO_ADMIN_CMD_DEV_MODE_F_STOPPED));
 		if (ret)
@@ -956,7 +1151,8 @@ virtiovf_pci_step_device_state_locked(struct virtiovf_pci_core_device *virtvdev,
 		return NULL;
 	}
 
-	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_RUNNING) {
+	if ((cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_RUNNING) ||
+	    (cur == VFIO_DEVICE_STATE_PRE_COPY_P2P && new == VFIO_DEVICE_STATE_PRE_COPY)) {
 		ret = virtio_pci_admin_mode_set(virtvdev->core_device.pdev, 0);
 		if (ret)
 			return ERR_PTR(ret);
@@ -966,7 +1162,7 @@ virtiovf_pci_step_device_state_locked(struct virtiovf_pci_core_device *virtvdev,
 	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_STOP_COPY) {
 		struct virtiovf_migration_file *migf;
 
-		migf = virtiovf_pci_save_device_data(virtvdev);
+		migf = virtiovf_pci_save_device_data(virtvdev, false);
 		if (IS_ERR(migf))
 			return ERR_CAST(migf);
 		get_file(migf->filp);
@@ -974,7 +1170,9 @@ virtiovf_pci_step_device_state_locked(struct virtiovf_pci_core_device *virtvdev,
 		return migf->filp;
 	}
 
-	if (cur == VFIO_DEVICE_STATE_STOP_COPY && new == VFIO_DEVICE_STATE_STOP) {
+	if ((cur == VFIO_DEVICE_STATE_STOP_COPY && new == VFIO_DEVICE_STATE_STOP) ||
+	    (cur == VFIO_DEVICE_STATE_PRE_COPY && new == VFIO_DEVICE_STATE_RUNNING) ||
+	    (cur == VFIO_DEVICE_STATE_PRE_COPY_P2P && new == VFIO_DEVICE_STATE_RUNNING_P2P)) {
 		virtiovf_disable_fds(virtvdev);
 		return NULL;
 	}
@@ -995,6 +1193,24 @@ virtiovf_pci_step_device_state_locked(struct virtiovf_pci_core_device *virtvdev,
 		return NULL;
 	}
 
+	if ((cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_PRE_COPY) ||
+	    (cur == VFIO_DEVICE_STATE_RUNNING_P2P &&
+	     new == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {
+		struct virtiovf_migration_file *migf;
+
+		migf = virtiovf_pci_save_device_data(virtvdev, true);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		virtvdev->saving_migf = migf;
+		return migf->filp;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_PRE_COPY_P2P && new == VFIO_DEVICE_STATE_STOP_COPY) {
+		ret = virtiovf_pci_save_device_final_data(virtvdev);
+		return ret ? ERR_PTR(ret) : NULL;
+	}
+
 	/*
 	 * vfio_mig_get_next_state() does not use arcs other than the above
 	 */
@@ -1098,7 +1314,8 @@ void virtiovf_set_migratable(struct virtiovf_pci_core_device *virtvdev)
 	spin_lock_init(&virtvdev->reset_lock);
 	virtvdev->core_device.vdev.migration_flags =
 		VFIO_MIGRATION_STOP_COPY |
-		VFIO_MIGRATION_P2P;
+		VFIO_MIGRATION_P2P |
+		VFIO_MIGRATION_PRE_COPY;
 	virtvdev->core_device.vdev.mig_ops = &virtvdev_pci_mig_ops;
 }
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured
  2024-11-04 10:21 [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration Yishai Hadas
                   ` (5 preceding siblings ...)
  2024-11-04 10:21 ` [PATCH V1 vfio 6/7] vfio/virtio: Add PRE_COPY support for live migration Yishai Hadas
@ 2024-11-04 10:21 ` Yishai Hadas
  2024-11-05 23:29   ` Alex Williamson
  2024-11-06  9:32 ` [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration Michael S. Tsirkin
  7 siblings, 1 reply; 28+ messages in thread
From: Yishai Hadas @ 2024-11-04 10:21 UTC (permalink / raw)
  To: alex.williamson, mst, jasowang, jgg
  Cc: kvm, virtualization, parav, feliu, kevin.tian, joao.m.martins,
	leonro, yishaih, maorg

Now that the driver supports live migration, only the legacy IO
functionality depends on config VIRTIO_PCI_ADMIN_LEGACY.

Move the legacy IO into a separate file to be compiled only once
VIRTIO_PCI_ADMIN_LEGACY was configured and let the live migration
depends only on VIRTIO_PCI.

As part of this, modify the default driver operations (i.e.,
vfio_device_ops) to use the live migration set, and extend it to include
legacy I/O operations if they are compiled and supported.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/virtio/Kconfig     |   4 +-
 drivers/vfio/pci/virtio/Makefile    |   1 +
 drivers/vfio/pci/virtio/common.h    |  19 ++
 drivers/vfio/pci/virtio/legacy_io.c | 420 ++++++++++++++++++++++++++++
 drivers/vfio/pci/virtio/main.c      | 416 ++-------------------------
 5 files changed, 469 insertions(+), 391 deletions(-)
 create mode 100644 drivers/vfio/pci/virtio/legacy_io.c

diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
index bd80eca4a196..af1dd9e84a5c 100644
--- a/drivers/vfio/pci/virtio/Kconfig
+++ b/drivers/vfio/pci/virtio/Kconfig
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config VIRTIO_VFIO_PCI
         tristate "VFIO support for VIRTIO NET PCI devices"
-        depends on VIRTIO_PCI && VIRTIO_PCI_ADMIN_LEGACY
+        depends on VIRTIO_PCI
         select VFIO_PCI_CORE
         help
           This provides support for exposing VIRTIO NET VF devices which support
@@ -11,5 +11,7 @@ config VIRTIO_VFIO_PCI
           As of that this driver emulates I/O BAR in software to let a VF be
           seen as a transitional device by its users and let it work with
           a legacy driver.
+          In addition, it provides migration support for VIRTIO NET VF devices
+          using the VFIO framework.
 
           If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
index bf0ccde6a91a..0032e6db4636 100644
--- a/drivers/vfio/pci/virtio/Makefile
+++ b/drivers/vfio/pci/virtio/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
 virtio-vfio-pci-y := main.o migrate.o
+virtio-vfio-pci-$(CONFIG_VIRTIO_PCI_ADMIN_LEGACY) += legacy_io.o
diff --git a/drivers/vfio/pci/virtio/common.h b/drivers/vfio/pci/virtio/common.h
index 5704603f0f9d..bca797b82f01 100644
--- a/drivers/vfio/pci/virtio/common.h
+++ b/drivers/vfio/pci/virtio/common.h
@@ -78,6 +78,7 @@ struct virtiovf_migration_file {
 
 struct virtiovf_pci_core_device {
 	struct vfio_pci_core_device core_device;
+#ifdef CONFIG_VIRTIO_PCI_ADMIN_LEGACY
 	u8 *bar0_virtual_buf;
 	/* synchronize access to the virtual buf */
 	struct mutex bar_mutex;
@@ -87,6 +88,7 @@ struct virtiovf_pci_core_device {
 	__le16 pci_cmd;
 	u8 bar0_virtual_buf_size;
 	u8 notify_bar;
+#endif
 
 	/* LM related */
 	u8 migrate_cap:1;
@@ -105,4 +107,21 @@ void virtiovf_open_migration(struct virtiovf_pci_core_device *virtvdev);
 void virtiovf_close_migration(struct virtiovf_pci_core_device *virtvdev);
 void virtiovf_migration_reset_done(struct pci_dev *pdev);
 
+#ifdef CONFIG_VIRTIO_PCI_ADMIN_LEGACY
+int virtiovf_open_legacy_io(struct virtiovf_pci_core_device *virtvdev);
+long virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev,
+				  unsigned int cmd, unsigned long arg);
+int virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
+				       unsigned int cmd, unsigned long arg);
+ssize_t virtiovf_pci_core_write(struct vfio_device *core_vdev,
+				const char __user *buf, size_t count,
+				loff_t *ppos);
+ssize_t virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
+			       size_t count, loff_t *ppos);
+int virtiovf_init_legacy_io(struct virtiovf_pci_core_device *virtvdev,
+			    bool *sup_legacy_io);
+void virtiovf_release_legacy_io(struct virtiovf_pci_core_device *virtvdev);
+void virtiovf_legacy_io_reset_done(struct pci_dev *pdev);
+#endif
+
 #endif /* VIRTIO_VFIO_COMMON_H */
diff --git a/drivers/vfio/pci/virtio/legacy_io.c b/drivers/vfio/pci/virtio/legacy_io.c
new file mode 100644
index 000000000000..52c7515ff020
--- /dev/null
+++ b/drivers/vfio/pci/virtio/legacy_io.c
@@ -0,0 +1,420 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/vfio_pci_core.h>
+#include <linux/virtio_pci.h>
+#include <linux/virtio_net.h>
+#include <linux/virtio_pci_admin.h>
+
+#include "common.h"
+
+static int
+virtiovf_issue_legacy_rw_cmd(struct virtiovf_pci_core_device *virtvdev,
+			     loff_t pos, char __user *buf,
+			     size_t count, bool read)
+{
+	bool msix_enabled =
+		(virtvdev->core_device.irq_type == VFIO_PCI_MSIX_IRQ_INDEX);
+	struct pci_dev *pdev = virtvdev->core_device.pdev;
+	u8 *bar0_buf = virtvdev->bar0_virtual_buf;
+	bool common;
+	u8 offset;
+	int ret;
+
+	common = pos < VIRTIO_PCI_CONFIG_OFF(msix_enabled);
+	/* offset within the relevant configuration area */
+	offset = common ? pos : pos - VIRTIO_PCI_CONFIG_OFF(msix_enabled);
+	mutex_lock(&virtvdev->bar_mutex);
+	if (read) {
+		if (common)
+			ret = virtio_pci_admin_legacy_common_io_read(pdev, offset,
+					count, bar0_buf + pos);
+		else
+			ret = virtio_pci_admin_legacy_device_io_read(pdev, offset,
+					count, bar0_buf + pos);
+		if (ret)
+			goto out;
+		if (copy_to_user(buf, bar0_buf + pos, count))
+			ret = -EFAULT;
+	} else {
+		if (copy_from_user(bar0_buf + pos, buf, count)) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		if (common)
+			ret = virtio_pci_admin_legacy_common_io_write(pdev, offset,
+					count, bar0_buf + pos);
+		else
+			ret = virtio_pci_admin_legacy_device_io_write(pdev, offset,
+					count, bar0_buf + pos);
+	}
+out:
+	mutex_unlock(&virtvdev->bar_mutex);
+	return ret;
+}
+
+static int
+virtiovf_pci_bar0_rw(struct virtiovf_pci_core_device *virtvdev,
+		     loff_t pos, char __user *buf,
+		     size_t count, bool read)
+{
+	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
+	struct pci_dev *pdev = core_device->pdev;
+	u16 queue_notify;
+	int ret;
+
+	if (!(le16_to_cpu(virtvdev->pci_cmd) & PCI_COMMAND_IO))
+		return -EIO;
+
+	if (pos + count > virtvdev->bar0_virtual_buf_size)
+		return -EINVAL;
+
+	ret = pm_runtime_resume_and_get(&pdev->dev);
+	if (ret) {
+		pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
+		return -EIO;
+	}
+
+	switch (pos) {
+	case VIRTIO_PCI_QUEUE_NOTIFY:
+		if (count != sizeof(queue_notify)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		if (read) {
+			ret = vfio_pci_core_ioread16(core_device, true, &queue_notify,
+						     virtvdev->notify_addr);
+			if (ret)
+				goto end;
+			if (copy_to_user(buf, &queue_notify,
+					 sizeof(queue_notify))) {
+				ret = -EFAULT;
+				goto end;
+			}
+		} else {
+			if (copy_from_user(&queue_notify, buf, count)) {
+				ret = -EFAULT;
+				goto end;
+			}
+			ret = vfio_pci_core_iowrite16(core_device, true, queue_notify,
+						      virtvdev->notify_addr);
+		}
+		break;
+	default:
+		ret = virtiovf_issue_legacy_rw_cmd(virtvdev, pos, buf, count,
+						   read);
+	}
+
+end:
+	pm_runtime_put(&pdev->dev);
+	return ret ? ret : count;
+}
+
+static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
+					char __user *buf, size_t count,
+					loff_t *ppos)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	size_t register_offset;
+	loff_t copy_offset;
+	size_t copy_count;
+	__le32 val32;
+	__le16 val16;
+	u8 val8;
+	int ret;
+
+	ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
+	if (ret < 0)
+		return ret;
+
+	if (vfio_pci_core_range_intersect_range(pos, count, PCI_DEVICE_ID,
+						sizeof(val16), &copy_offset,
+						&copy_count, &register_offset)) {
+		val16 = cpu_to_le16(VIRTIO_TRANS_ID_NET);
+		if (copy_to_user(buf + copy_offset, (void *)&val16 + register_offset, copy_count))
+			return -EFAULT;
+	}
+
+	if ((le16_to_cpu(virtvdev->pci_cmd) & PCI_COMMAND_IO) &&
+	    vfio_pci_core_range_intersect_range(pos, count, PCI_COMMAND,
+						sizeof(val16), &copy_offset,
+						&copy_count, &register_offset)) {
+		if (copy_from_user((void *)&val16 + register_offset, buf + copy_offset,
+				   copy_count))
+			return -EFAULT;
+		val16 |= cpu_to_le16(PCI_COMMAND_IO);
+		if (copy_to_user(buf + copy_offset, (void *)&val16 + register_offset,
+				 copy_count))
+			return -EFAULT;
+	}
+
+	if (vfio_pci_core_range_intersect_range(pos, count, PCI_REVISION_ID,
+						sizeof(val8), &copy_offset,
+						&copy_count, &register_offset)) {
+		/* Transional needs to have revision 0 */
+		val8 = 0;
+		if (copy_to_user(buf + copy_offset, &val8, copy_count))
+			return -EFAULT;
+	}
+
+	if (vfio_pci_core_range_intersect_range(pos, count, PCI_BASE_ADDRESS_0,
+						sizeof(val32), &copy_offset,
+						&copy_count, &register_offset)) {
+		u32 bar_mask = ~(virtvdev->bar0_virtual_buf_size - 1);
+		u32 pci_base_addr_0 = le32_to_cpu(virtvdev->pci_base_addr_0);
+
+		val32 = cpu_to_le32((pci_base_addr_0 & bar_mask) | PCI_BASE_ADDRESS_SPACE_IO);
+		if (copy_to_user(buf + copy_offset, (void *)&val32 + register_offset, copy_count))
+			return -EFAULT;
+	}
+
+	if (vfio_pci_core_range_intersect_range(pos, count, PCI_SUBSYSTEM_ID,
+						sizeof(val16), &copy_offset,
+						&copy_count, &register_offset)) {
+		/*
+		 * Transitional devices use the PCI subsystem device id as
+		 * virtio device id, same as legacy driver always did.
+		 */
+		val16 = cpu_to_le16(VIRTIO_ID_NET);
+		if (copy_to_user(buf + copy_offset, (void *)&val16 + register_offset,
+				 copy_count))
+			return -EFAULT;
+	}
+
+	if (vfio_pci_core_range_intersect_range(pos, count, PCI_SUBSYSTEM_VENDOR_ID,
+						sizeof(val16), &copy_offset,
+						&copy_count, &register_offset)) {
+		val16 = cpu_to_le16(PCI_VENDOR_ID_REDHAT_QUMRANET);
+		if (copy_to_user(buf + copy_offset, (void *)&val16 + register_offset,
+				 copy_count))
+			return -EFAULT;
+	}
+
+	return count;
+}
+
+ssize_t virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (!count)
+		return 0;
+
+	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+		return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
+
+	if (index == VFIO_PCI_BAR0_REGION_INDEX)
+		return virtiovf_pci_bar0_rw(virtvdev, pos, buf, count, true);
+
+	return vfio_pci_core_read(core_vdev, buf, count, ppos);
+}
+
+static ssize_t virtiovf_pci_write_config(struct vfio_device *core_vdev,
+					 const char __user *buf, size_t count,
+					 loff_t *ppos)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	size_t register_offset;
+	loff_t copy_offset;
+	size_t copy_count;
+
+	if (vfio_pci_core_range_intersect_range(pos, count, PCI_COMMAND,
+						sizeof(virtvdev->pci_cmd),
+						&copy_offset, &copy_count,
+						&register_offset)) {
+		if (copy_from_user((void *)&virtvdev->pci_cmd + register_offset,
+				   buf + copy_offset,
+				   copy_count))
+			return -EFAULT;
+	}
+
+	if (vfio_pci_core_range_intersect_range(pos, count, PCI_BASE_ADDRESS_0,
+						sizeof(virtvdev->pci_base_addr_0),
+						&copy_offset, &copy_count,
+						&register_offset)) {
+		if (copy_from_user((void *)&virtvdev->pci_base_addr_0 + register_offset,
+				   buf + copy_offset,
+				   copy_count))
+			return -EFAULT;
+	}
+
+	return vfio_pci_core_write(core_vdev, buf, count, ppos);
+}
+
+ssize_t virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (!count)
+		return 0;
+
+	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+		return virtiovf_pci_write_config(core_vdev, buf, count, ppos);
+
+	if (index == VFIO_PCI_BAR0_REGION_INDEX)
+		return virtiovf_pci_bar0_rw(virtvdev, pos, (char __user *)buf, count, false);
+
+	return vfio_pci_core_write(core_vdev, buf, count, ppos);
+}
+
+int virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
+				       unsigned int cmd, unsigned long arg)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	unsigned long minsz = offsetofend(struct vfio_region_info, offset);
+	void __user *uarg = (void __user *)arg;
+	struct vfio_region_info info = {};
+
+	if (copy_from_user(&info, uarg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz)
+		return -EINVAL;
+
+	switch (info.index) {
+	case VFIO_PCI_BAR0_REGION_INDEX:
+		info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+		info.size = virtvdev->bar0_virtual_buf_size;
+		info.flags = VFIO_REGION_INFO_FLAG_READ |
+			     VFIO_REGION_INFO_FLAG_WRITE;
+		return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
+	default:
+		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
+	}
+}
+
+long virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
+				  unsigned long arg)
+{
+	switch (cmd) {
+	case VFIO_DEVICE_GET_REGION_INFO:
+		return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
+	default:
+		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
+	}
+}
+
+static int virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
+{
+	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
+	int ret;
+
+	/*
+	 * Setup the BAR where the 'notify' exists to be used by vfio as well
+	 * This will let us mmap it only once and use it when needed.
+	 */
+	ret = vfio_pci_core_setup_barmap(core_device,
+					 virtvdev->notify_bar);
+	if (ret)
+		return ret;
+
+	virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
+			virtvdev->notify_offset;
+	return 0;
+}
+
+int virtiovf_open_legacy_io(struct virtiovf_pci_core_device *virtvdev)
+{
+	if (!virtvdev->bar0_virtual_buf)
+		return 0;
+
+	/*
+	 * Upon close_device() the vfio_pci_core_disable() is called
+	 * and will close all the previous mmaps, so it seems that the
+	 * valid life cycle for the 'notify' addr is per open/close.
+	 */
+	return virtiovf_set_notify_addr(virtvdev);
+}
+
+static int virtiovf_get_device_config_size(unsigned short device)
+{
+	/* Network card */
+	return offsetofend(struct virtio_net_config, status);
+}
+
+static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
+{
+	u64 offset;
+	int ret;
+	u8 bar;
+
+	ret = virtio_pci_admin_legacy_io_notify_info(virtvdev->core_device.pdev,
+				VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
+				&bar, &offset);
+	if (ret)
+		return ret;
+
+	virtvdev->notify_bar = bar;
+	virtvdev->notify_offset = offset;
+	return 0;
+}
+
+static bool virtiovf_bar0_exists(struct pci_dev *pdev)
+{
+	struct resource *res = pdev->resource;
+
+	return res->flags;
+}
+
+int virtiovf_init_legacy_io(struct virtiovf_pci_core_device *virtvdev,
+			    bool *sup_legacy_io)
+{
+	struct pci_dev *pdev = virtvdev->core_device.pdev;
+	int ret;
+
+	*sup_legacy_io = virtio_pci_admin_has_legacy_io(pdev) &&
+			!virtiovf_bar0_exists(pdev);
+
+	if (!*sup_legacy_io)
+		return 0;
+
+	ret = virtiovf_read_notify_info(virtvdev);
+	if (ret)
+		return ret;
+
+	virtvdev->bar0_virtual_buf_size = VIRTIO_PCI_CONFIG_OFF(true) +
+				virtiovf_get_device_config_size(pdev->device);
+	BUILD_BUG_ON(!is_power_of_2(virtvdev->bar0_virtual_buf_size));
+	virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
+					     GFP_KERNEL);
+	if (!virtvdev->bar0_virtual_buf)
+		return -ENOMEM;
+	mutex_init(&virtvdev->bar_mutex);
+	return 0;
+}
+
+void virtiovf_release_legacy_io(struct virtiovf_pci_core_device *virtvdev)
+{
+	kfree(virtvdev->bar0_virtual_buf);
+}
+
+void virtiovf_legacy_io_reset_done(struct pci_dev *pdev)
+{
+	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
+
+	virtvdev->pci_cmd = 0;
+}
diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
index e2cdf2d48200..01b237908e7a 100644
--- a/drivers/vfio/pci/virtio/main.c
+++ b/drivers/vfio/pci/virtio/main.c
@@ -20,330 +20,6 @@
 
 static int virtiovf_pci_init_device(struct vfio_device *core_vdev);
 
-static int
-virtiovf_issue_legacy_rw_cmd(struct virtiovf_pci_core_device *virtvdev,
-			     loff_t pos, char __user *buf,
-			     size_t count, bool read)
-{
-	bool msix_enabled =
-		(virtvdev->core_device.irq_type == VFIO_PCI_MSIX_IRQ_INDEX);
-	struct pci_dev *pdev = virtvdev->core_device.pdev;
-	u8 *bar0_buf = virtvdev->bar0_virtual_buf;
-	bool common;
-	u8 offset;
-	int ret;
-
-	common = pos < VIRTIO_PCI_CONFIG_OFF(msix_enabled);
-	/* offset within the relevant configuration area */
-	offset = common ? pos : pos - VIRTIO_PCI_CONFIG_OFF(msix_enabled);
-	mutex_lock(&virtvdev->bar_mutex);
-	if (read) {
-		if (common)
-			ret = virtio_pci_admin_legacy_common_io_read(pdev, offset,
-					count, bar0_buf + pos);
-		else
-			ret = virtio_pci_admin_legacy_device_io_read(pdev, offset,
-					count, bar0_buf + pos);
-		if (ret)
-			goto out;
-		if (copy_to_user(buf, bar0_buf + pos, count))
-			ret = -EFAULT;
-	} else {
-		if (copy_from_user(bar0_buf + pos, buf, count)) {
-			ret = -EFAULT;
-			goto out;
-		}
-
-		if (common)
-			ret = virtio_pci_admin_legacy_common_io_write(pdev, offset,
-					count, bar0_buf + pos);
-		else
-			ret = virtio_pci_admin_legacy_device_io_write(pdev, offset,
-					count, bar0_buf + pos);
-	}
-out:
-	mutex_unlock(&virtvdev->bar_mutex);
-	return ret;
-}
-
-static int
-virtiovf_pci_bar0_rw(struct virtiovf_pci_core_device *virtvdev,
-		     loff_t pos, char __user *buf,
-		     size_t count, bool read)
-{
-	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
-	struct pci_dev *pdev = core_device->pdev;
-	u16 queue_notify;
-	int ret;
-
-	if (!(le16_to_cpu(virtvdev->pci_cmd) & PCI_COMMAND_IO))
-		return -EIO;
-
-	if (pos + count > virtvdev->bar0_virtual_buf_size)
-		return -EINVAL;
-
-	ret = pm_runtime_resume_and_get(&pdev->dev);
-	if (ret) {
-		pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
-		return -EIO;
-	}
-
-	switch (pos) {
-	case VIRTIO_PCI_QUEUE_NOTIFY:
-		if (count != sizeof(queue_notify)) {
-			ret = -EINVAL;
-			goto end;
-		}
-		if (read) {
-			ret = vfio_pci_core_ioread16(core_device, true, &queue_notify,
-						     virtvdev->notify_addr);
-			if (ret)
-				goto end;
-			if (copy_to_user(buf, &queue_notify,
-					 sizeof(queue_notify))) {
-				ret = -EFAULT;
-				goto end;
-			}
-		} else {
-			if (copy_from_user(&queue_notify, buf, count)) {
-				ret = -EFAULT;
-				goto end;
-			}
-			ret = vfio_pci_core_iowrite16(core_device, true, queue_notify,
-						      virtvdev->notify_addr);
-		}
-		break;
-	default:
-		ret = virtiovf_issue_legacy_rw_cmd(virtvdev, pos, buf, count,
-						   read);
-	}
-
-end:
-	pm_runtime_put(&pdev->dev);
-	return ret ? ret : count;
-}
-
-static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
-					char __user *buf, size_t count,
-					loff_t *ppos)
-{
-	struct virtiovf_pci_core_device *virtvdev = container_of(
-		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
-	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
-	size_t register_offset;
-	loff_t copy_offset;
-	size_t copy_count;
-	__le32 val32;
-	__le16 val16;
-	u8 val8;
-	int ret;
-
-	ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
-	if (ret < 0)
-		return ret;
-
-	if (vfio_pci_core_range_intersect_range(pos, count, PCI_DEVICE_ID,
-						sizeof(val16), &copy_offset,
-						&copy_count, &register_offset)) {
-		val16 = cpu_to_le16(VIRTIO_TRANS_ID_NET);
-		if (copy_to_user(buf + copy_offset, (void *)&val16 + register_offset, copy_count))
-			return -EFAULT;
-	}
-
-	if ((le16_to_cpu(virtvdev->pci_cmd) & PCI_COMMAND_IO) &&
-	    vfio_pci_core_range_intersect_range(pos, count, PCI_COMMAND,
-						sizeof(val16), &copy_offset,
-						&copy_count, &register_offset)) {
-		if (copy_from_user((void *)&val16 + register_offset, buf + copy_offset,
-				   copy_count))
-			return -EFAULT;
-		val16 |= cpu_to_le16(PCI_COMMAND_IO);
-		if (copy_to_user(buf + copy_offset, (void *)&val16 + register_offset,
-				 copy_count))
-			return -EFAULT;
-	}
-
-	if (vfio_pci_core_range_intersect_range(pos, count, PCI_REVISION_ID,
-						sizeof(val8), &copy_offset,
-						&copy_count, &register_offset)) {
-		/* Transional needs to have revision 0 */
-		val8 = 0;
-		if (copy_to_user(buf + copy_offset, &val8, copy_count))
-			return -EFAULT;
-	}
-
-	if (vfio_pci_core_range_intersect_range(pos, count, PCI_BASE_ADDRESS_0,
-						sizeof(val32), &copy_offset,
-						&copy_count, &register_offset)) {
-		u32 bar_mask = ~(virtvdev->bar0_virtual_buf_size - 1);
-		u32 pci_base_addr_0 = le32_to_cpu(virtvdev->pci_base_addr_0);
-
-		val32 = cpu_to_le32((pci_base_addr_0 & bar_mask) | PCI_BASE_ADDRESS_SPACE_IO);
-		if (copy_to_user(buf + copy_offset, (void *)&val32 + register_offset, copy_count))
-			return -EFAULT;
-	}
-
-	if (vfio_pci_core_range_intersect_range(pos, count, PCI_SUBSYSTEM_ID,
-						sizeof(val16), &copy_offset,
-						&copy_count, &register_offset)) {
-		/*
-		 * Transitional devices use the PCI subsystem device id as
-		 * virtio device id, same as legacy driver always did.
-		 */
-		val16 = cpu_to_le16(VIRTIO_ID_NET);
-		if (copy_to_user(buf + copy_offset, (void *)&val16 + register_offset,
-				 copy_count))
-			return -EFAULT;
-	}
-
-	if (vfio_pci_core_range_intersect_range(pos, count, PCI_SUBSYSTEM_VENDOR_ID,
-						sizeof(val16), &copy_offset,
-						&copy_count, &register_offset)) {
-		val16 = cpu_to_le16(PCI_VENDOR_ID_REDHAT_QUMRANET);
-		if (copy_to_user(buf + copy_offset, (void *)&val16 + register_offset,
-				 copy_count))
-			return -EFAULT;
-	}
-
-	return count;
-}
-
-static ssize_t
-virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
-		       size_t count, loff_t *ppos)
-{
-	struct virtiovf_pci_core_device *virtvdev = container_of(
-		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
-	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
-	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
-
-	if (!count)
-		return 0;
-
-	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
-		return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
-
-	if (index == VFIO_PCI_BAR0_REGION_INDEX)
-		return virtiovf_pci_bar0_rw(virtvdev, pos, buf, count, true);
-
-	return vfio_pci_core_read(core_vdev, buf, count, ppos);
-}
-
-static ssize_t virtiovf_pci_write_config(struct vfio_device *core_vdev,
-					 const char __user *buf, size_t count,
-					 loff_t *ppos)
-{
-	struct virtiovf_pci_core_device *virtvdev = container_of(
-		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
-	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
-	size_t register_offset;
-	loff_t copy_offset;
-	size_t copy_count;
-
-	if (vfio_pci_core_range_intersect_range(pos, count, PCI_COMMAND,
-						sizeof(virtvdev->pci_cmd),
-						&copy_offset, &copy_count,
-						&register_offset)) {
-		if (copy_from_user((void *)&virtvdev->pci_cmd + register_offset,
-				   buf + copy_offset,
-				   copy_count))
-			return -EFAULT;
-	}
-
-	if (vfio_pci_core_range_intersect_range(pos, count, PCI_BASE_ADDRESS_0,
-						sizeof(virtvdev->pci_base_addr_0),
-						&copy_offset, &copy_count,
-						&register_offset)) {
-		if (copy_from_user((void *)&virtvdev->pci_base_addr_0 + register_offset,
-				   buf + copy_offset,
-				   copy_count))
-			return -EFAULT;
-	}
-
-	return vfio_pci_core_write(core_vdev, buf, count, ppos);
-}
-
-static ssize_t
-virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
-			size_t count, loff_t *ppos)
-{
-	struct virtiovf_pci_core_device *virtvdev = container_of(
-		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
-	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
-	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
-
-	if (!count)
-		return 0;
-
-	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
-		return virtiovf_pci_write_config(core_vdev, buf, count, ppos);
-
-	if (index == VFIO_PCI_BAR0_REGION_INDEX)
-		return virtiovf_pci_bar0_rw(virtvdev, pos, (char __user *)buf, count, false);
-
-	return vfio_pci_core_write(core_vdev, buf, count, ppos);
-}
-
-static int
-virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
-				   unsigned int cmd, unsigned long arg)
-{
-	struct virtiovf_pci_core_device *virtvdev = container_of(
-		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
-	unsigned long minsz = offsetofend(struct vfio_region_info, offset);
-	void __user *uarg = (void __user *)arg;
-	struct vfio_region_info info = {};
-
-	if (copy_from_user(&info, uarg, minsz))
-		return -EFAULT;
-
-	if (info.argsz < minsz)
-		return -EINVAL;
-
-	switch (info.index) {
-	case VFIO_PCI_BAR0_REGION_INDEX:
-		info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-		info.size = virtvdev->bar0_virtual_buf_size;
-		info.flags = VFIO_REGION_INFO_FLAG_READ |
-			     VFIO_REGION_INFO_FLAG_WRITE;
-		return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
-	default:
-		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
-	}
-}
-
-static long
-virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
-			     unsigned long arg)
-{
-	switch (cmd) {
-	case VFIO_DEVICE_GET_REGION_INFO:
-		return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
-	default:
-		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
-	}
-}
-
-static int
-virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
-{
-	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
-	int ret;
-
-	/*
-	 * Setup the BAR where the 'notify' exists to be used by vfio as well
-	 * This will let us mmap it only once and use it when needed.
-	 */
-	ret = vfio_pci_core_setup_barmap(core_device,
-					 virtvdev->notify_bar);
-	if (ret)
-		return ret;
-
-	virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
-			virtvdev->notify_offset;
-	return 0;
-}
-
 static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
 {
 	struct virtiovf_pci_core_device *virtvdev = container_of(core_vdev,
@@ -355,18 +31,13 @@ static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
 	if (ret)
 		return ret;
 
-	if (virtvdev->bar0_virtual_buf) {
-		/*
-		 * Upon close_device() the vfio_pci_core_disable() is called
-		 * and will close all the previous mmaps, so it seems that the
-		 * valid life cycle for the 'notify' addr is per open/close.
-		 */
-		ret = virtiovf_set_notify_addr(virtvdev);
-		if (ret) {
-			vfio_pci_core_disable(vdev);
-			return ret;
-		}
+#ifdef CONFIG_VIRTIO_PCI_ADMIN_LEGACY
+	ret = virtiovf_open_legacy_io(virtvdev);
+	if (ret) {
+		vfio_pci_core_disable(vdev);
+		return ret;
 	}
+#endif
 
 	virtiovf_open_migration(virtvdev);
 	vfio_pci_core_finish_enable(vdev);
@@ -382,35 +53,14 @@ static void virtiovf_pci_close_device(struct vfio_device *core_vdev)
 	vfio_pci_core_close_device(core_vdev);
 }
 
-static int virtiovf_get_device_config_size(unsigned short device)
-{
-	/* Network card */
-	return offsetofend(struct virtio_net_config, status);
-}
-
-static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
-{
-	u64 offset;
-	int ret;
-	u8 bar;
-
-	ret = virtio_pci_admin_legacy_io_notify_info(virtvdev->core_device.pdev,
-				VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
-				&bar, &offset);
-	if (ret)
-		return ret;
-
-	virtvdev->notify_bar = bar;
-	virtvdev->notify_offset = offset;
-	return 0;
-}
-
 static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
 {
+#ifdef CONFIG_VIRTIO_PCI_ADMIN_LEGACY
 	struct virtiovf_pci_core_device *virtvdev = container_of(core_vdev,
 			struct virtiovf_pci_core_device, core_device.vdev);
 
-	kfree(virtvdev->bar0_virtual_buf);
+	virtiovf_release_legacy_io(virtvdev);
+#endif
 	vfio_pci_core_release_dev(core_vdev);
 }
 
@@ -433,6 +83,7 @@ static const struct vfio_device_ops virtiovf_vfio_pci_lm_ops = {
 	.detach_ioas = vfio_iommufd_physical_detach_ioas,
 };
 
+#ifdef CONFIG_VIRTIO_PCI_ADMIN_LEGACY
 static const struct vfio_device_ops virtiovf_vfio_pci_tran_lm_ops = {
 	.name = "virtio-vfio-pci-trans-lm",
 	.init = virtiovf_pci_init_device,
@@ -451,6 +102,7 @@ static const struct vfio_device_ops virtiovf_vfio_pci_tran_lm_ops = {
 	.attach_ioas = vfio_iommufd_physical_attach_ioas,
 	.detach_ioas = vfio_iommufd_physical_detach_ioas,
 };
+#endif
 
 static const struct vfio_device_ops virtiovf_vfio_pci_ops = {
 	.name = "virtio-vfio-pci",
@@ -471,19 +123,12 @@ static const struct vfio_device_ops virtiovf_vfio_pci_ops = {
 	.detach_ioas = vfio_iommufd_physical_detach_ioas,
 };
 
-static bool virtiovf_bar0_exists(struct pci_dev *pdev)
-{
-	struct resource *res = pdev->resource;
-
-	return res->flags;
-}
-
 static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
 {
 	struct virtiovf_pci_core_device *virtvdev = container_of(core_vdev,
 			struct virtiovf_pci_core_device, core_device.vdev);
 	struct pci_dev *pdev;
-	bool sup_legacy_io;
+	bool sup_legacy_io = false;
 	bool sup_lm;
 	int ret;
 
@@ -492,8 +137,12 @@ static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
 		return ret;
 
 	pdev = virtvdev->core_device.pdev;
-	sup_legacy_io = virtio_pci_admin_has_legacy_io(pdev) &&
-				!virtiovf_bar0_exists(pdev);
+#ifdef CONFIG_VIRTIO_PCI_ADMIN_LEGACY
+	ret = virtiovf_init_legacy_io(virtvdev, &sup_legacy_io);
+	if (ret)
+		return ret;
+#endif
+
 	sup_lm = virtio_pci_admin_has_dev_parts(pdev);
 
 	/*
@@ -505,26 +154,13 @@ static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
 		return 0;
 	}
 
-	if (sup_legacy_io) {
-		ret = virtiovf_read_notify_info(virtvdev);
-		if (ret)
-			return ret;
-
-		virtvdev->bar0_virtual_buf_size = VIRTIO_PCI_CONFIG_OFF(true) +
-					virtiovf_get_device_config_size(pdev->device);
-		BUILD_BUG_ON(!is_power_of_2(virtvdev->bar0_virtual_buf_size));
-		virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
-						     GFP_KERNEL);
-		if (!virtvdev->bar0_virtual_buf)
-			return -ENOMEM;
-		mutex_init(&virtvdev->bar_mutex);
-	}
-
 	if (sup_lm)
 		virtiovf_set_migratable(virtvdev);
 
-	if (sup_lm && !sup_legacy_io)
-		core_vdev->ops = &virtiovf_vfio_pci_lm_ops;
+#ifdef CONFIG_VIRTIO_PCI_ADMIN_LEGACY
+	if (sup_legacy_io)
+		core_vdev->ops = &virtiovf_vfio_pci_tran_lm_ops;
+#endif
 
 	return 0;
 }
@@ -536,7 +172,7 @@ static int virtiovf_pci_probe(struct pci_dev *pdev,
 	const struct vfio_device_ops *ops;
 	int ret;
 
-	ops = (pdev->is_virtfn) ? &virtiovf_vfio_pci_tran_lm_ops :
+	ops = (pdev->is_virtfn) ? &virtiovf_vfio_pci_lm_ops :
 				  &virtiovf_vfio_pci_ops;
 
 	virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
@@ -572,9 +208,9 @@ MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
 
 static void virtiovf_pci_aer_reset_done(struct pci_dev *pdev)
 {
-	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
-
-	virtvdev->pci_cmd = 0;
+#ifdef CONFIG_VIRTIO_PCI_ADMIN_LEGACY
+	virtiovf_legacy_io_reset_done(pdev);
+#endif
 	virtiovf_migration_reset_done(pdev);
 }
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 5/7] vfio/virtio: Add support for the basic live migration functionality
  2024-11-04 10:21 ` [PATCH V1 vfio 5/7] vfio/virtio: Add support for the basic live migration functionality Yishai Hadas
@ 2024-11-05 22:47   ` Alex Williamson
  2024-11-06 10:21     ` Yishai Hadas
  2024-11-06 15:48     ` Jason Gunthorpe
  0 siblings, 2 replies; 28+ messages in thread
From: Alex Williamson @ 2024-11-05 22:47 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: mst, jasowang, jgg, kvm, virtualization, parav, feliu, kevin.tian,
	joao.m.martins, leonro, maorg

On Mon, 4 Nov 2024 12:21:29 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:
> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
> index b5d3a8c5bbc9..e2cdf2d48200 100644
> --- a/drivers/vfio/pci/virtio/main.c
> +++ b/drivers/vfio/pci/virtio/main.c
...
> @@ -485,16 +478,66 @@ static bool virtiovf_bar0_exists(struct pci_dev *pdev)
>  	return res->flags;
>  }
>  
> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(core_vdev,
> +			struct virtiovf_pci_core_device, core_device.vdev);
> +	struct pci_dev *pdev;
> +	bool sup_legacy_io;
> +	bool sup_lm;
> +	int ret;
> +
> +	ret = vfio_pci_core_init_dev(core_vdev);
> +	if (ret)
> +		return ret;
> +
> +	pdev = virtvdev->core_device.pdev;
> +	sup_legacy_io = virtio_pci_admin_has_legacy_io(pdev) &&
> +				!virtiovf_bar0_exists(pdev);
> +	sup_lm = virtio_pci_admin_has_dev_parts(pdev);
> +
> +	/*
> +	 * If the device is not capable to this driver functionality, fallback
> +	 * to the default vfio-pci ops
> +	 */
> +	if (!sup_legacy_io && !sup_lm) {
> +		core_vdev->ops = &virtiovf_vfio_pci_ops;
> +		return 0;
> +	}
> +
> +	if (sup_legacy_io) {
> +		ret = virtiovf_read_notify_info(virtvdev);
> +		if (ret)
> +			return ret;
> +
> +		virtvdev->bar0_virtual_buf_size = VIRTIO_PCI_CONFIG_OFF(true) +
> +					virtiovf_get_device_config_size(pdev->device);
> +		BUILD_BUG_ON(!is_power_of_2(virtvdev->bar0_virtual_buf_size));
> +		virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
> +						     GFP_KERNEL);
> +		if (!virtvdev->bar0_virtual_buf)
> +			return -ENOMEM;
> +		mutex_init(&virtvdev->bar_mutex);
> +	}
> +
> +	if (sup_lm)
> +		virtiovf_set_migratable(virtvdev);
> +
> +	if (sup_lm && !sup_legacy_io)
> +		core_vdev->ops = &virtiovf_vfio_pci_lm_ops;
> +
> +	return 0;
> +}
> +
>  static int virtiovf_pci_probe(struct pci_dev *pdev,
>  			      const struct pci_device_id *id)
>  {
> -	const struct vfio_device_ops *ops = &virtiovf_vfio_pci_ops;
>  	struct virtiovf_pci_core_device *virtvdev;
> +	const struct vfio_device_ops *ops;
>  	int ret;
>  
> -	if (pdev->is_virtfn && virtio_pci_admin_has_legacy_io(pdev) &&
> -	    !virtiovf_bar0_exists(pdev))
> -		ops = &virtiovf_vfio_pci_tran_ops;
> +	ops = (pdev->is_virtfn) ? &virtiovf_vfio_pci_tran_lm_ops :
> +				  &virtiovf_vfio_pci_ops;

I can't figure out why we moved the more thorough ops setup to the
.init() callback of the ops themselves.  Clearly we can do the legacy
IO and BAR0 test here and the dev parts test uses the same mechanisms
as the legacy IO test, so it seems we could know sup_legacy_io and
sup_lm here.  I think we can even do virtiovf_set_migratable() here
after virtvdev is allocated below.

I think the API to vfio core also suggests we shouldn't be modifying the
ops pointer after the core device is allocated.

>  
>  	virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
>  				     &pdev->dev, ops);
> @@ -532,6 +575,7 @@ static void virtiovf_pci_aer_reset_done(struct pci_dev *pdev)
>  	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
>  
>  	virtvdev->pci_cmd = 0;
> +	virtiovf_migration_reset_done(pdev);
>  }
>  
>  static const struct pci_error_handlers virtiovf_err_handlers = {
> diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c
> new file mode 100644
> index 000000000000..2a9614c2ef07
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/migrate.c
...
> +static int virtiovf_pci_get_data_size(struct vfio_device *vdev,
> +				      unsigned long *stop_copy_length)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +	bool obj_id_exists;
> +	u32 res_size;
> +	u32 obj_id;
> +	int ret;
> +
> +	mutex_lock(&virtvdev->state_mutex);
> +	obj_id_exists = virtvdev->saving_migf && virtvdev->saving_migf->has_obj_id;
> +	if (!obj_id_exists) {
> +		ret = virtiovf_pci_alloc_obj_id(virtvdev,
> +						VIRTIO_RESOURCE_OBJ_DEV_PARTS_TYPE_GET,
> +						&obj_id);
> +		if (ret)
> +			goto end;
> +	} else {
> +		obj_id = virtvdev->saving_migf->obj_id;
> +	}
> +
> +	ret = virtio_pci_admin_dev_parts_metadata_get(virtvdev->core_device.pdev,
> +				VIRTIO_RESOURCE_OBJ_DEV_PARTS, obj_id,
> +				VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_TYPE_SIZE,
> +				&res_size);
> +	if (!ret)
> +		*stop_copy_length = res_size;
> +
> +	/* We can't leave this obj_id alive if didn't exist before, otherwise, it might
> +	 * stay alive, even without an active migration flow (e.g. migration was cancelled)
> +	 */

Nit, multi-line comment style.

Thanks,
Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 6/7] vfio/virtio: Add PRE_COPY support for live migration
  2024-11-04 10:21 ` [PATCH V1 vfio 6/7] vfio/virtio: Add PRE_COPY support for live migration Yishai Hadas
@ 2024-11-05 23:18   ` Alex Williamson
  2024-11-06 11:16     ` Yishai Hadas
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2024-11-05 23:18 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: mst, jasowang, jgg, kvm, virtualization, parav, feliu, kevin.tian,
	joao.m.martins, leonro, maorg

On Mon, 4 Nov 2024 12:21:30 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> Add PRE_COPY support for live migration.
> 
> This functionality may reduce the downtime upon STOP_COPY as of letting
> the target machine to get some 'initial data' from the source once the
> machine is still in its RUNNING state and let it prepares itself
> pre-ahead to get the final STOP_COPY data.
> 
> As the Virtio specification does not support reading partial or
> incremental device contexts. This means that during the PRE_COPY state,
> the vfio-virtio driver reads the full device state.
> 
> As the device state can be changed and the benefit is highest when the
> pre copy data closely matches the final data we read it in a rate
> limiter mode and reporting no data available for some time interval
> after the previous call.
> 
> With PRE_COPY enabled, we observed a downtime reduction of approximately
> 70-75% in various scenarios compared to when PRE_COPY was disabled,
> while keeping the total migration time nearly the same.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/vfio/pci/virtio/common.h  |   4 +
>  drivers/vfio/pci/virtio/migrate.c | 233 +++++++++++++++++++++++++++++-
>  2 files changed, 229 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/virtio/common.h b/drivers/vfio/pci/virtio/common.h
> index 3bdfb3ea1174..5704603f0f9d 100644
> --- a/drivers/vfio/pci/virtio/common.h
> +++ b/drivers/vfio/pci/virtio/common.h
> @@ -10,6 +10,8 @@
>  
>  enum virtiovf_migf_state {
>  	VIRTIOVF_MIGF_STATE_ERROR = 1,
> +	VIRTIOVF_MIGF_STATE_PRECOPY = 2,
> +	VIRTIOVF_MIGF_STATE_COMPLETE = 3,
>  };
>  
>  enum virtiovf_load_state {
> @@ -57,6 +59,8 @@ struct virtiovf_migration_file {
>  	/* synchronize access to the file state */
>  	struct mutex lock;
>  	loff_t max_pos;
> +	u64 pre_copy_initial_bytes;
> +	struct ratelimit_state pre_copy_rl_state;
>  	u64 record_size;
>  	u32 record_tag;
>  	u8 has_obj_id:1;
> diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c
> index 2a9614c2ef07..cdb252f6fd80 100644
> --- a/drivers/vfio/pci/virtio/migrate.c
> +++ b/drivers/vfio/pci/virtio/migrate.c
...
> @@ -379,9 +432,104 @@ static ssize_t virtiovf_save_read(struct file *filp, char __user *buf, size_t le
>  	return done;
>  }
>  
> +static long virtiovf_precopy_ioctl(struct file *filp, unsigned int cmd,
> +				   unsigned long arg)
> +{
> +	struct virtiovf_migration_file *migf = filp->private_data;
> +	struct virtiovf_pci_core_device *virtvdev = migf->virtvdev;
> +	struct vfio_precopy_info info = {};
> +	loff_t *pos = &filp->f_pos;
> +	bool end_of_data = false;
> +	unsigned long minsz;
> +	u32 ctx_size;
> +	int ret;
> +
> +	if (cmd != VFIO_MIG_GET_PRECOPY_INFO)
> +		return -ENOTTY;
> +
> +	minsz = offsetofend(struct vfio_precopy_info, dirty_bytes);
> +	if (copy_from_user(&info, (void __user *)arg, minsz))
> +		return -EFAULT;
> +
> +	if (info.argsz < minsz)
> +		return -EINVAL;
> +
> +	mutex_lock(&virtvdev->state_mutex);
> +	if (virtvdev->mig_state != VFIO_DEVICE_STATE_PRE_COPY &&
> +	    virtvdev->mig_state != VFIO_DEVICE_STATE_PRE_COPY_P2P) {
> +		ret = -EINVAL;
> +		goto err_state_unlock;
> +	}
> +
> +	/*
> +	 * The virtio specification does not include a PRE_COPY concept.
> +	 * Since we can expect the data to remain the same for a certain period,
> +	 * we use a rate limiter mechanism before making a call to the device.
> +	 */
> +	if (!__ratelimit(&migf->pre_copy_rl_state)) {
> +		/* Reporting no data available */
> +		ret = 0;
> +		goto done;

@ret is not used by the done: goto target.  Don't we need to zero dirty
bytes, or account for initial bytes not being fully read yet?

> +	}
> +
> +	ret = virtio_pci_admin_dev_parts_metadata_get(virtvdev->core_device.pdev,
> +				VIRTIO_RESOURCE_OBJ_DEV_PARTS, migf->obj_id,
> +				VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_TYPE_SIZE,
> +				&ctx_size);
> +	if (ret)
> +		goto err_state_unlock;
> +
> +	mutex_lock(&migf->lock);
> +	if (migf->state == VIRTIOVF_MIGF_STATE_ERROR) {
> +		ret = -ENODEV;
> +		goto err_migf_unlock;
> +	}
> +
> +	if (migf->pre_copy_initial_bytes > *pos) {
> +		info.initial_bytes = migf->pre_copy_initial_bytes - *pos;
> +	} else {
> +		info.dirty_bytes = migf->max_pos - *pos;
> +		if (!info.dirty_bytes)
> +			end_of_data = true;
> +		info.dirty_bytes += ctx_size;
> +	}
> +
> +	if (!end_of_data || !ctx_size) {
> +		mutex_unlock(&migf->lock);
> +		goto done;
> +	}
> +
> +	mutex_unlock(&migf->lock);
> +	/*
> +	 * We finished transferring the current state and the device has a
> +	 * dirty state, read a new state.
> +	 */
> +	ret = virtiovf_read_device_context_chunk(migf, ctx_size);
> +	if (ret)
> +		/*
> +		 * The machine is running, and context size could be grow, so no reason to mark
> +		 * the device state as VIRTIOVF_MIGF_STATE_ERROR.
> +		 */
> +		goto err_state_unlock;
> +
> +done:
> +	virtiovf_state_mutex_unlock(virtvdev);
> +	if (copy_to_user((void __user *)arg, &info, minsz))
> +		return -EFAULT;
> +	return 0;
> +
> +err_migf_unlock:
> +	mutex_unlock(&migf->lock);
> +err_state_unlock:
> +	virtiovf_state_mutex_unlock(virtvdev);
> +	return ret;
> +}
> +
...
> @@ -536,6 +719,17 @@ virtiovf_pci_save_device_data(struct virtiovf_pci_core_device *virtvdev)
>  	if (ret)
>  		goto out_clean;
>  
> +	if (pre_copy) {
> +		migf->pre_copy_initial_bytes = migf->max_pos;
> +		ratelimit_state_init(&migf->pre_copy_rl_state, 1 * HZ, 1);

A comment describing the choice of this heuristic would be useful for
future maintenance, even if the comment is "Arbitrarily rate limit
pre-copy to 1s intervals."  Thanks,

Alex

> +		/* Prevent any rate messages upon its usage */
> +		ratelimit_set_flags(&migf->pre_copy_rl_state,
> +				    RATELIMIT_MSG_ON_RELEASE);
> +		migf->state = VIRTIOVF_MIGF_STATE_PRECOPY;
> +	} else {
> +		migf->state = VIRTIOVF_MIGF_STATE_COMPLETE;
> +	}
> +
>  	return migf;
>  
>  out_clean:


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured
  2024-11-04 10:21 ` [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured Yishai Hadas
@ 2024-11-05 23:29   ` Alex Williamson
  2024-11-06 13:59     ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2024-11-05 23:29 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: mst, jasowang, jgg, kvm, virtualization, parav, feliu, kevin.tian,
	joao.m.martins, leonro, maorg

On Mon, 4 Nov 2024 12:21:31 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> Now that the driver supports live migration, only the legacy IO
> functionality depends on config VIRTIO_PCI_ADMIN_LEGACY.
> 
> Move the legacy IO into a separate file to be compiled only once
> VIRTIO_PCI_ADMIN_LEGACY was configured and let the live migration
> depends only on VIRTIO_PCI.
> 
> As part of this, modify the default driver operations (i.e.,
> vfio_device_ops) to use the live migration set, and extend it to include
> legacy I/O operations if they are compiled and supported.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/vfio/pci/virtio/Kconfig     |   4 +-
>  drivers/vfio/pci/virtio/Makefile    |   1 +
>  drivers/vfio/pci/virtio/common.h    |  19 ++
>  drivers/vfio/pci/virtio/legacy_io.c | 420 ++++++++++++++++++++++++++++
>  drivers/vfio/pci/virtio/main.c      | 416 ++-------------------------
>  5 files changed, 469 insertions(+), 391 deletions(-)
>  create mode 100644 drivers/vfio/pci/virtio/legacy_io.c
> 
> diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
> index bd80eca4a196..af1dd9e84a5c 100644
> --- a/drivers/vfio/pci/virtio/Kconfig
> +++ b/drivers/vfio/pci/virtio/Kconfig
> @@ -1,7 +1,7 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  config VIRTIO_VFIO_PCI
>          tristate "VFIO support for VIRTIO NET PCI devices"
> -        depends on VIRTIO_PCI && VIRTIO_PCI_ADMIN_LEGACY
> +        depends on VIRTIO_PCI
>          select VFIO_PCI_CORE
>          help
>            This provides support for exposing VIRTIO NET VF devices which support
> @@ -11,5 +11,7 @@ config VIRTIO_VFIO_PCI
>            As of that this driver emulates I/O BAR in software to let a VF be
>            seen as a transitional device by its users and let it work with
>            a legacy driver.
> +          In addition, it provides migration support for VIRTIO NET VF devices
> +          using the VFIO framework.

The first half of this now describes something that may or may not be
enabled by this config option and the additional help text for
migration is vague enough relative to PF requirements to get user
reports that the driver doesn't work as intended.

For the former, maybe we still want a separate config item that's
optionally enabled if VIRTIO_VFIO_PCI && VFIO_PCI_ADMIN_LEGACY.

Thanks,
Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration
  2024-11-04 10:21 [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration Yishai Hadas
                   ` (6 preceding siblings ...)
  2024-11-04 10:21 ` [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured Yishai Hadas
@ 2024-11-06  9:32 ` Michael S. Tsirkin
  2024-11-06 22:30   ` Alex Williamson
  7 siblings, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2024-11-06  9:32 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: alex.williamson, jasowang, jgg, kvm, virtualization, parav, feliu,
	kevin.tian, joao.m.martins, leonro, maorg

On Mon, Nov 04, 2024 at 12:21:24PM +0200, Yishai Hadas wrote:
> This series enhances the vfio-virtio driver to support live migration
> for virtio-net Virtual Functions (VFs) that are migration-capable.
>  
> This series follows the Virtio 1.4 specification to implement the
> necessary device parts commands, enabling a device to participate in the
> live migration process.
> 
> The key VFIO features implemented include: VFIO_MIGRATION_STOP_COPY,
> VFIO_MIGRATION_P2P, VFIO_MIGRATION_PRE_COPY.
>  
> The implementation integrates with the VFIO subsystem via vfio_pci_core
> and incorporates Virtio-specific logic to handle the migration process.
>  
> Migration functionality follows the definitions in uapi/vfio.h and uses
> the Virtio VF-to-PF admin queue command channel for executing the device
> parts related commands.


virtio things here:

Acked-by: Michael S. Tsirkin <mst@redhat.com>

Alex, your tree I presume? I hope the virtio changes do not
cause conflicts.


> Patch Overview:
> The first four patches focus on the Virtio layer and address the
> following:
> - Define the layout of the device parts commands required as part of the
>   migration process.
> - Provide APIs to enable upper layers (e.g., VFIO, net) to execute the
>   related device parts commands.
>  
> The last three patches focus on the VFIO layer:
> - Extend the vfio-virtio driver to support live migration for Virtio-net
>   VFs.
> - Move legacy I/O operations to a separate file, which is compiled only
>   when VIRTIO_PCI_ADMIN_LEGACY is configured, ensuring that live
>   migration depends solely on VIRTIO_PCI.
>  
> Additional Notes:
> - The kernel protocol between the source and target devices includes a
>   header containing metadata such as record size, tag, and flags.
>   The record size allows the target to read a complete image from the
>   source before passing device part data. This follows the Virtio
>   specification, which mandates that partial device parts are not
>   supplied. The tag and flags serve as placeholders for future extensions
>   to the kernel protocol between the source and target, ensuring backward
>   and forward compatibility.
>  
> - Both the source and target comply with the Virtio specification by
>   using a device part object with a unique ID during the migration
>   process. As this resource is limited to a maximum of 255, its lifecycle
>   is confined to periods when live migration is active.
> 
> - According to the Virtio specification, a device has only two states:
>   RUNNING and STOPPED. Consequently, certain VFIO transitions (e.g.,
>   RUNNING_P2P->STOP, STOP->RUNNING_P2P) are treated as no-ops. When
>   transitioning to RUNNING_P2P, the device state is set to STOP and
>   remains STOPPED until it transitions back from RUNNING_P2P->RUNNING, at
>   which point it resumes its RUNNING state. During transition to STOP,
>   the virtio device only stops initiating outgoing requests(e.g. DMA,
>   MSIx, etc.) but still must accept incoming operations.
> 
> - Furthermore, the Virtio specification does not support reading partial
>   or incremental device contexts. This means that during the PRE_COPY
>   state, the vfio-virtio driver reads the full device state. This step is
>   beneficial because it allows the device to send some "initial data"
>   before moving to the STOP_COPY state, thus reducing downtime by
>   preparing early and warming-up. As the device state can be changed and
>   the benefit is highest when the pre copy data closely matches the final
>   data we read it in a rate limiter mode and reporting no data available
>   for some time interval after the previous call. With PRE_COPY enabled,
>   we observed a downtime reduction of approximately 70-75% in various
>   scenarios compared to when PRE_COPY was disabled, while keeping the
>   total migration time nearly the same.
> 
> - Support for dirty page tracking during migration will be provided via
>   the IOMMUFD framework.
>  
> - This series has been successfully tested on Virtio-net VF devices.
> 
> Changes from V0:
> https://lore.kernel.org/kvm/20241101102518.1bf2c6e6.alex.williamson@redhat.com/T/
> 
> Vfio:
> Patch #5:
> - Enhance the commit log to provide a clearer explanation of P2P
>   behavior over Virtio devices, as discussed on the mailing list.
> Patch #6:
> - Implement the rate limiter mechanism as part of the PRE_COPY state,
>   following Alex’s suggestion.
> - Update the commit log to include actual data demonstrating the impact of
>   PRE_COPY, as requested by Alex.
> Patch #7:
> - Update the default driver operations (i.e., vfio_device_ops) to use
>   the live migration set, and expand it to include the legacy I/O
>   operations if they are compiled and supported.
> 
> Yishai
> 
> Yishai Hadas (7):
>   virtio_pci: Introduce device parts access commands
>   virtio: Extend the admin command to include the result size
>   virtio: Manage device and driver capabilities via the admin commands
>   virtio-pci: Introduce APIs to execute device parts admin commands
>   vfio/virtio: Add support for the basic live migration functionality
>   vfio/virtio: Add PRE_COPY support for live migration
>   vfio/virtio: Enable live migration once VIRTIO_PCI was configured
> 
>  drivers/vfio/pci/virtio/Kconfig     |    4 +-
>  drivers/vfio/pci/virtio/Makefile    |    3 +-
>  drivers/vfio/pci/virtio/common.h    |  127 +++
>  drivers/vfio/pci/virtio/legacy_io.c |  420 +++++++++
>  drivers/vfio/pci/virtio/main.c      |  500 ++--------
>  drivers/vfio/pci/virtio/migrate.c   | 1336 +++++++++++++++++++++++++++
>  drivers/virtio/virtio_pci_common.h  |   19 +-
>  drivers/virtio/virtio_pci_modern.c  |  457 ++++++++-
>  include/linux/virtio.h              |    1 +
>  include/linux/virtio_pci_admin.h    |   11 +
>  include/uapi/linux/virtio_pci.h     |  131 +++
>  11 files changed, 2594 insertions(+), 415 deletions(-)
>  create mode 100644 drivers/vfio/pci/virtio/common.h
>  create mode 100644 drivers/vfio/pci/virtio/legacy_io.c
>  create mode 100644 drivers/vfio/pci/virtio/migrate.c
> 
> -- 
> 2.27.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 5/7] vfio/virtio: Add support for the basic live migration functionality
  2024-11-05 22:47   ` Alex Williamson
@ 2024-11-06 10:21     ` Yishai Hadas
  2024-11-06 21:33       ` Alex Williamson
  2024-11-06 15:48     ` Jason Gunthorpe
  1 sibling, 1 reply; 28+ messages in thread
From: Yishai Hadas @ 2024-11-06 10:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mst, jasowang, jgg, kvm, virtualization, parav, feliu, kevin.tian,
	joao.m.martins, leonro, maorg

On 06/11/2024 0:47, Alex Williamson wrote:
> On Mon, 4 Nov 2024 12:21:29 +0200
> Yishai Hadas <yishaih@nvidia.com> wrote:
>> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
>> index b5d3a8c5bbc9..e2cdf2d48200 100644
>> --- a/drivers/vfio/pci/virtio/main.c
>> +++ b/drivers/vfio/pci/virtio/main.c
> ...
>> @@ -485,16 +478,66 @@ static bool virtiovf_bar0_exists(struct pci_dev *pdev)
>>   	return res->flags;
>>   }
>>   
>> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
>> +{
>> +	struct virtiovf_pci_core_device *virtvdev = container_of(core_vdev,
>> +			struct virtiovf_pci_core_device, core_device.vdev);
>> +	struct pci_dev *pdev;
>> +	bool sup_legacy_io;
>> +	bool sup_lm;
>> +	int ret;
>> +
>> +	ret = vfio_pci_core_init_dev(core_vdev);
>> +	if (ret)
>> +		return ret;
>> +
>> +	pdev = virtvdev->core_device.pdev;
>> +	sup_legacy_io = virtio_pci_admin_has_legacy_io(pdev) &&
>> +				!virtiovf_bar0_exists(pdev);
>> +	sup_lm = virtio_pci_admin_has_dev_parts(pdev);
>> +
>> +	/*
>> +	 * If the device is not capable to this driver functionality, fallback
>> +	 * to the default vfio-pci ops
>> +	 */
>> +	if (!sup_legacy_io && !sup_lm) {
>> +		core_vdev->ops = &virtiovf_vfio_pci_ops;
>> +		return 0;
>> +	}
>> +
>> +	if (sup_legacy_io) {
>> +		ret = virtiovf_read_notify_info(virtvdev);
>> +		if (ret)
>> +			return ret;
>> +
>> +		virtvdev->bar0_virtual_buf_size = VIRTIO_PCI_CONFIG_OFF(true) +
>> +					virtiovf_get_device_config_size(pdev->device);
>> +		BUILD_BUG_ON(!is_power_of_2(virtvdev->bar0_virtual_buf_size));
>> +		virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
>> +						     GFP_KERNEL);
>> +		if (!virtvdev->bar0_virtual_buf)
>> +			return -ENOMEM;
>> +		mutex_init(&virtvdev->bar_mutex);
>> +	}
>> +
>> +	if (sup_lm)
>> +		virtiovf_set_migratable(virtvdev);
>> +
>> +	if (sup_lm && !sup_legacy_io)
>> +		core_vdev->ops = &virtiovf_vfio_pci_lm_ops;
>> +
>> +	return 0;
>> +}
>> +
>>   static int virtiovf_pci_probe(struct pci_dev *pdev,
>>   			      const struct pci_device_id *id)
>>   {
>> -	const struct vfio_device_ops *ops = &virtiovf_vfio_pci_ops;
>>   	struct virtiovf_pci_core_device *virtvdev;
>> +	const struct vfio_device_ops *ops;
>>   	int ret;
>>   
>> -	if (pdev->is_virtfn && virtio_pci_admin_has_legacy_io(pdev) &&
>> -	    !virtiovf_bar0_exists(pdev))
>> -		ops = &virtiovf_vfio_pci_tran_ops;
>> +	ops = (pdev->is_virtfn) ? &virtiovf_vfio_pci_tran_lm_ops :
>> +				  &virtiovf_vfio_pci_ops;
> 
> I can't figure out why we moved the more thorough ops setup to the
> .init() callback of the ops themselves.  Clearly we can do the legacy
> IO and BAR0 test here and the dev parts test uses the same mechanisms
> as the legacy IO test, so it seems we could know sup_legacy_io and
> sup_lm here.  I think we can even do virtiovf_set_migratable() here
> after virtvdev is allocated below.
> 

Setting the 'ops' as part of the probe() seems actually doable, 
including calling virtiovf_set_migratable() following the virtiodev 
allocation below.

The main issue with that approach will be the init part of the legacy IO 
(i.e. virtiovf_init_legacy_io()) as part of virtiovf_pci_init_device().

Assuming that we don't want to repeat calling 
virtiovf_support_legacy_io() as part of virtiovf_pci_init_device() to 
know whether legacy IO is supported, we can consider calling 
virtiovf_init_legacy_io() as part of the probe() as well, which IMO 
doesn't look clean as it's actually seems to match the init flow.

Alternatively, we can consider checking inside 
virtiovf_pci_init_device() whether the 'ops' actually equals the 'tran' 
ones and then call it.

Something like the below.

static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
{
	...

#ifdef CONFIG_VIRTIO_PCI_ADMIN_LEGACY
	if (core_vdev->ops == &virtiovf_vfio_pci_tran_lm_ops)
		return virtiovf_init_legacy_io(virtvdev);
#endif

	return 0;
}

Do you prefer the above approach rather than current V1 code which has a 
  single check as part of virtiovf_init_legacy_io() ?

> I think the API to vfio core also suggests we shouldn't be modifying the
> ops pointer after the core device is allocated.

Any pointer for that ?
Do we actually see a problem with replacing the 'ops' as part of the 
init flow ?

> 
>>   
>>   	virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
>>   				     &pdev->dev, ops);
>> @@ -532,6 +575,7 @@ static void virtiovf_pci_aer_reset_done(struct pci_dev *pdev)
>>   	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
>>   
>>   	virtvdev->pci_cmd = 0;
>> +	virtiovf_migration_reset_done(pdev);
>>   }
>>   
>>   static const struct pci_error_handlers virtiovf_err_handlers = {
>> diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c
>> new file mode 100644
>> index 000000000000..2a9614c2ef07
>> --- /dev/null
>> +++ b/drivers/vfio/pci/virtio/migrate.c
> ...
>> +static int virtiovf_pci_get_data_size(struct vfio_device *vdev,
>> +				      unsigned long *stop_copy_length)
>> +{
>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>> +		vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> +	bool obj_id_exists;
>> +	u32 res_size;
>> +	u32 obj_id;
>> +	int ret;
>> +
>> +	mutex_lock(&virtvdev->state_mutex);
>> +	obj_id_exists = virtvdev->saving_migf && virtvdev->saving_migf->has_obj_id;
>> +	if (!obj_id_exists) {
>> +		ret = virtiovf_pci_alloc_obj_id(virtvdev,
>> +						VIRTIO_RESOURCE_OBJ_DEV_PARTS_TYPE_GET,
>> +						&obj_id);
>> +		if (ret)
>> +			goto end;
>> +	} else {
>> +		obj_id = virtvdev->saving_migf->obj_id;
>> +	}
>> +
>> +	ret = virtio_pci_admin_dev_parts_metadata_get(virtvdev->core_device.pdev,
>> +				VIRTIO_RESOURCE_OBJ_DEV_PARTS, obj_id,
>> +				VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_TYPE_SIZE,
>> +				&res_size);
>> +	if (!ret)
>> +		*stop_copy_length = res_size;
>> +
>> +	/* We can't leave this obj_id alive if didn't exist before, otherwise, it might
>> +	 * stay alive, even without an active migration flow (e.g. migration was cancelled)
>> +	 */
> 
> Nit, multi-line comment style.

Sure, will change.

Thanks,
Yishai


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 6/7] vfio/virtio: Add PRE_COPY support for live migration
  2024-11-05 23:18   ` Alex Williamson
@ 2024-11-06 11:16     ` Yishai Hadas
  2024-11-06 21:40       ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Yishai Hadas @ 2024-11-06 11:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mst, jasowang, jgg, kvm, virtualization, parav, feliu, kevin.tian,
	joao.m.martins, leonro, maorg

On 06/11/2024 1:18, Alex Williamson wrote:
> On Mon, 4 Nov 2024 12:21:30 +0200
> Yishai Hadas <yishaih@nvidia.com> wrote:
> 
>> Add PRE_COPY support for live migration.
>>
>> This functionality may reduce the downtime upon STOP_COPY as of letting
>> the target machine to get some 'initial data' from the source once the
>> machine is still in its RUNNING state and let it prepares itself
>> pre-ahead to get the final STOP_COPY data.
>>
>> As the Virtio specification does not support reading partial or
>> incremental device contexts. This means that during the PRE_COPY state,
>> the vfio-virtio driver reads the full device state.
>>
>> As the device state can be changed and the benefit is highest when the
>> pre copy data closely matches the final data we read it in a rate
>> limiter mode and reporting no data available for some time interval
>> after the previous call.
>>
>> With PRE_COPY enabled, we observed a downtime reduction of approximately
>> 70-75% in various scenarios compared to when PRE_COPY was disabled,
>> while keeping the total migration time nearly the same.
>>
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> ---
>>   drivers/vfio/pci/virtio/common.h  |   4 +
>>   drivers/vfio/pci/virtio/migrate.c | 233 +++++++++++++++++++++++++++++-
>>   2 files changed, 229 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/virtio/common.h b/drivers/vfio/pci/virtio/common.h
>> index 3bdfb3ea1174..5704603f0f9d 100644
>> --- a/drivers/vfio/pci/virtio/common.h
>> +++ b/drivers/vfio/pci/virtio/common.h
>> @@ -10,6 +10,8 @@
>>   
>>   enum virtiovf_migf_state {
>>   	VIRTIOVF_MIGF_STATE_ERROR = 1,
>> +	VIRTIOVF_MIGF_STATE_PRECOPY = 2,
>> +	VIRTIOVF_MIGF_STATE_COMPLETE = 3,
>>   };
>>   
>>   enum virtiovf_load_state {
>> @@ -57,6 +59,8 @@ struct virtiovf_migration_file {
>>   	/* synchronize access to the file state */
>>   	struct mutex lock;
>>   	loff_t max_pos;
>> +	u64 pre_copy_initial_bytes;
>> +	struct ratelimit_state pre_copy_rl_state;
>>   	u64 record_size;
>>   	u32 record_tag;
>>   	u8 has_obj_id:1;
>> diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c
>> index 2a9614c2ef07..cdb252f6fd80 100644
>> --- a/drivers/vfio/pci/virtio/migrate.c
>> +++ b/drivers/vfio/pci/virtio/migrate.c
> ...
>> @@ -379,9 +432,104 @@ static ssize_t virtiovf_save_read(struct file *filp, char __user *buf, size_t le
>>   	return done;
>>   }
>>   
>> +static long virtiovf_precopy_ioctl(struct file *filp, unsigned int cmd,
>> +				   unsigned long arg)
>> +{
>> +	struct virtiovf_migration_file *migf = filp->private_data;
>> +	struct virtiovf_pci_core_device *virtvdev = migf->virtvdev;
>> +	struct vfio_precopy_info info = {};
>> +	loff_t *pos = &filp->f_pos;
>> +	bool end_of_data = false;
>> +	unsigned long minsz;
>> +	u32 ctx_size;
>> +	int ret;
>> +
>> +	if (cmd != VFIO_MIG_GET_PRECOPY_INFO)
>> +		return -ENOTTY;
>> +
>> +	minsz = offsetofend(struct vfio_precopy_info, dirty_bytes);
>> +	if (copy_from_user(&info, (void __user *)arg, minsz))
>> +		return -EFAULT;
>> +
>> +	if (info.argsz < minsz)
>> +		return -EINVAL;
>> +
>> +	mutex_lock(&virtvdev->state_mutex);
>> +	if (virtvdev->mig_state != VFIO_DEVICE_STATE_PRE_COPY &&
>> +	    virtvdev->mig_state != VFIO_DEVICE_STATE_PRE_COPY_P2P) {
>> +		ret = -EINVAL;
>> +		goto err_state_unlock;
>> +	}
>> +
>> +	/*
>> +	 * The virtio specification does not include a PRE_COPY concept.
>> +	 * Since we can expect the data to remain the same for a certain period,
>> +	 * we use a rate limiter mechanism before making a call to the device.
>> +	 */
>> +	if (!__ratelimit(&migf->pre_copy_rl_state)) {
>> +		/* Reporting no data available */
>> +		ret = 0;
>> +		goto done;
> 
> @ret is not used by the done: goto target.  Don't we need to zero dirty
> bytes, or account for initial bytes not being fully read yet?

The 'dirty bytes' are actually zero as we used the below line [1] of 
code above.

[1] struct vfio_precopy_info info = {};

However, I agree, we may better account for the 'initial bytes' which 
potentially might not being fully read yet.
Same can be true for returning the actual 'dirty bytes' that we may have 
from the previous call.

So, in V2 I'll change the logic to initially set:
u32 ctx_size = 0;

Then, will call the device to get its 'ctx_size' only if time has passed 
according to the rate limiter.

Something as of the below.

if (__ratelimit(&migf->pre_copy_rl_state)) {
	ret = virtio_pci_admin_dev_parts_metadata_get(.., &ctx_size);
	if (ret)
		goto err_state_unlock;
}

 From that point the function will proceed with its V1 flow to return 
the actual 'initial bytes' and 'dirty_bytes' while considering the extra 
context size from the device to be 0.

> 
>> +	}
>> +
>> +	ret = virtio_pci_admin_dev_parts_metadata_get(virtvdev->core_device.pdev,
>> +				VIRTIO_RESOURCE_OBJ_DEV_PARTS, migf->obj_id,
>> +				VIRTIO_ADMIN_CMD_DEV_PARTS_METADATA_TYPE_SIZE,
>> +				&ctx_size);
>> +	if (ret)
>> +		goto err_state_unlock;
>> +
>> +	mutex_lock(&migf->lock);
>> +	if (migf->state == VIRTIOVF_MIGF_STATE_ERROR) {
>> +		ret = -ENODEV;
>> +		goto err_migf_unlock;
>> +	}
>> +
>> +	if (migf->pre_copy_initial_bytes > *pos) {
>> +		info.initial_bytes = migf->pre_copy_initial_bytes - *pos;
>> +	} else {
>> +		info.dirty_bytes = migf->max_pos - *pos;
>> +		if (!info.dirty_bytes)
>> +			end_of_data = true;
>> +		info.dirty_bytes += ctx_size;
>> +	}
>> +
>> +	if (!end_of_data || !ctx_size) {
>> +		mutex_unlock(&migf->lock);
>> +		goto done;
>> +	}
>> +
>> +	mutex_unlock(&migf->lock);
>> +	/*
>> +	 * We finished transferring the current state and the device has a
>> +	 * dirty state, read a new state.
>> +	 */
>> +	ret = virtiovf_read_device_context_chunk(migf, ctx_size);
>> +	if (ret)
>> +		/*
>> +		 * The machine is running, and context size could be grow, so no reason to mark
>> +		 * the device state as VIRTIOVF_MIGF_STATE_ERROR.
>> +		 */
>> +		goto err_state_unlock;
>> +
>> +done:
>> +	virtiovf_state_mutex_unlock(virtvdev);
>> +	if (copy_to_user((void __user *)arg, &info, minsz))
>> +		return -EFAULT;
>> +	return 0;
>> +
>> +err_migf_unlock:
>> +	mutex_unlock(&migf->lock);
>> +err_state_unlock:
>> +	virtiovf_state_mutex_unlock(virtvdev);
>> +	return ret;
>> +}
>> +
> ...
>> @@ -536,6 +719,17 @@ virtiovf_pci_save_device_data(struct virtiovf_pci_core_device *virtvdev)
>>   	if (ret)
>>   		goto out_clean;
>>   
>> +	if (pre_copy) {
>> +		migf->pre_copy_initial_bytes = migf->max_pos;
>> +		ratelimit_state_init(&migf->pre_copy_rl_state, 1 * HZ, 1);
> 
> A comment describing the choice of this heuristic would be useful for
> future maintenance, even if the comment is "Arbitrarily rate limit
> pre-copy to 1s intervals."  Thanks,

Sure, I believe that we can start with your suggested comment above.

Thanks,
Yishai

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured
  2024-11-05 23:29   ` Alex Williamson
@ 2024-11-06 13:59     ` Jason Gunthorpe
  2024-11-06 22:27       ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2024-11-06 13:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, mst, jasowang, kvm, virtualization, parav, feliu,
	kevin.tian, joao.m.martins, leonro, maorg

On Tue, Nov 05, 2024 at 04:29:04PM -0700, Alex Williamson wrote:
> > @@ -1,7 +1,7 @@
> >  # SPDX-License-Identifier: GPL-2.0-only
> >  config VIRTIO_VFIO_PCI
> >          tristate "VFIO support for VIRTIO NET PCI devices"
> > -        depends on VIRTIO_PCI && VIRTIO_PCI_ADMIN_LEGACY
> > +        depends on VIRTIO_PCI
> >          select VFIO_PCI_CORE
> >          help
> >            This provides support for exposing VIRTIO NET VF devices which support
> > @@ -11,5 +11,7 @@ config VIRTIO_VFIO_PCI
> >            As of that this driver emulates I/O BAR in software to let a VF be
> >            seen as a transitional device by its users and let it work with
> >            a legacy driver.
> > +          In addition, it provides migration support for VIRTIO NET VF devices
> > +          using the VFIO framework.
> 
> The first half of this now describes something that may or may not be
> enabled by this config option and the additional help text for
> migration is vague enough relative to PF requirements to get user
> reports that the driver doesn't work as intended.

Yes, I think the help should be clearer

> For the former, maybe we still want a separate config item that's
> optionally enabled if VIRTIO_VFIO_PCI && VFIO_PCI_ADMIN_LEGACY.

If we are going to add a bunch of  #ifdefs/etc for ADMIN_LEGACY we
may as well just use VIRTIO_PCI_ADMIN_LEGACY directly and not
introduce another kconfig for it?

Is there any reason to compile out the migration support for virtio?
No other drivers were doing this?

kconfig combinations are painful, it woudl be nice to not make too
many..

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 5/7] vfio/virtio: Add support for the basic live migration functionality
  2024-11-05 22:47   ` Alex Williamson
  2024-11-06 10:21     ` Yishai Hadas
@ 2024-11-06 15:48     ` Jason Gunthorpe
  1 sibling, 0 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2024-11-06 15:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, mst, jasowang, kvm, virtualization, parav, feliu,
	kevin.tian, joao.m.martins, leonro, maorg

On Tue, Nov 05, 2024 at 03:47:46PM -0700, Alex Williamson wrote:

> I think the API to vfio core also suggests we shouldn't be modifying the
> ops pointer after the core device is allocated.

Yeah, that is a dangerous pattern we should avoid

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 5/7] vfio/virtio: Add support for the basic live migration functionality
  2024-11-06 10:21     ` Yishai Hadas
@ 2024-11-06 21:33       ` Alex Williamson
  2024-11-07  9:39         ` Yishai Hadas
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2024-11-06 21:33 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: mst, jasowang, jgg, kvm, virtualization, parav, feliu, kevin.tian,
	joao.m.martins, leonro, maorg

On Wed, 6 Nov 2024 12:21:03 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 06/11/2024 0:47, Alex Williamson wrote:
> > On Mon, 4 Nov 2024 12:21:29 +0200
> > Yishai Hadas <yishaih@nvidia.com> wrote:  
> >> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
> >> index b5d3a8c5bbc9..e2cdf2d48200 100644
> >> --- a/drivers/vfio/pci/virtio/main.c
> >> +++ b/drivers/vfio/pci/virtio/main.c  
> > ...  
> >> @@ -485,16 +478,66 @@ static bool virtiovf_bar0_exists(struct pci_dev *pdev)
> >>   	return res->flags;
> >>   }
> >>   
> >> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
> >> +{
> >> +	struct virtiovf_pci_core_device *virtvdev = container_of(core_vdev,
> >> +			struct virtiovf_pci_core_device, core_device.vdev);
> >> +	struct pci_dev *pdev;
> >> +	bool sup_legacy_io;
> >> +	bool sup_lm;
> >> +	int ret;
> >> +
> >> +	ret = vfio_pci_core_init_dev(core_vdev);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	pdev = virtvdev->core_device.pdev;
> >> +	sup_legacy_io = virtio_pci_admin_has_legacy_io(pdev) &&
> >> +				!virtiovf_bar0_exists(pdev);
> >> +	sup_lm = virtio_pci_admin_has_dev_parts(pdev);
> >> +
> >> +	/*
> >> +	 * If the device is not capable to this driver functionality, fallback
> >> +	 * to the default vfio-pci ops
> >> +	 */
> >> +	if (!sup_legacy_io && !sup_lm) {
> >> +		core_vdev->ops = &virtiovf_vfio_pci_ops;
> >> +		return 0;
> >> +	}
> >> +
> >> +	if (sup_legacy_io) {
> >> +		ret = virtiovf_read_notify_info(virtvdev);
> >> +		if (ret)
> >> +			return ret;
> >> +
> >> +		virtvdev->bar0_virtual_buf_size = VIRTIO_PCI_CONFIG_OFF(true) +
> >> +					virtiovf_get_device_config_size(pdev->device);
> >> +		BUILD_BUG_ON(!is_power_of_2(virtvdev->bar0_virtual_buf_size));
> >> +		virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
> >> +						     GFP_KERNEL);
> >> +		if (!virtvdev->bar0_virtual_buf)
> >> +			return -ENOMEM;
> >> +		mutex_init(&virtvdev->bar_mutex);
> >> +	}
> >> +
> >> +	if (sup_lm)
> >> +		virtiovf_set_migratable(virtvdev);
> >> +
> >> +	if (sup_lm && !sup_legacy_io)
> >> +		core_vdev->ops = &virtiovf_vfio_pci_lm_ops;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >>   static int virtiovf_pci_probe(struct pci_dev *pdev,
> >>   			      const struct pci_device_id *id)
> >>   {
> >> -	const struct vfio_device_ops *ops = &virtiovf_vfio_pci_ops;
> >>   	struct virtiovf_pci_core_device *virtvdev;
> >> +	const struct vfio_device_ops *ops;
> >>   	int ret;
> >>   
> >> -	if (pdev->is_virtfn && virtio_pci_admin_has_legacy_io(pdev) &&
> >> -	    !virtiovf_bar0_exists(pdev))
> >> -		ops = &virtiovf_vfio_pci_tran_ops;
> >> +	ops = (pdev->is_virtfn) ? &virtiovf_vfio_pci_tran_lm_ops :
> >> +				  &virtiovf_vfio_pci_ops;  
> > 
> > I can't figure out why we moved the more thorough ops setup to the
> > .init() callback of the ops themselves.  Clearly we can do the legacy
> > IO and BAR0 test here and the dev parts test uses the same mechanisms
> > as the legacy IO test, so it seems we could know sup_legacy_io and
> > sup_lm here.  I think we can even do virtiovf_set_migratable() here
> > after virtvdev is allocated below.
> >   
> 
> Setting the 'ops' as part of the probe() seems actually doable, 
> including calling virtiovf_set_migratable() following the virtiodev 
> allocation below.
> 
> The main issue with that approach will be the init part of the legacy IO 
> (i.e. virtiovf_init_legacy_io()) as part of virtiovf_pci_init_device().
> 
> Assuming that we don't want to repeat calling 
> virtiovf_support_legacy_io() as part of virtiovf_pci_init_device() to 
> know whether legacy IO is supported, we can consider calling 
> virtiovf_init_legacy_io() as part of the probe() as well, which IMO 
> doesn't look clean as it's actually seems to match the init flow.
> 
> Alternatively, we can consider checking inside 
> virtiovf_pci_init_device() whether the 'ops' actually equals the 'tran' 
> ones and then call it.
> 
> Something like the below.
> 
> static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
> {
> 	...
> 
> #ifdef CONFIG_VIRTIO_PCI_ADMIN_LEGACY
> 	if (core_vdev->ops == &virtiovf_vfio_pci_tran_lm_ops)
> 		return virtiovf_init_legacy_io(virtvdev);
> #endif
> 
> 	return 0;
> }
> 
> Do you prefer the above approach rather than current V1 code which has a 
>   single check as part of virtiovf_init_legacy_io() ?

If ops is properly configured and set-migratable is done in probe,
then doesn't only the legacy ops .init callback need to init the legacy
setup?  The non-legacy, migration ops structure would just use
vfio_pci_core_init_dev.

> 
> > I think the API to vfio core also suggests we shouldn't be modifying the
> > ops pointer after the core device is allocated.  
> 
> Any pointer for that ?
> Do we actually see a problem with replacing the 'ops' as part of the 
> init flow ?

What makes it that way to me is that it's an argument to and set by the
object constructor.  The ops callbacks should be considered live once
set.  It's probably safe to do as you've done here because the
constructor calls the init callback directly, so we don't have any
races.  However as Jason agreed, it's generally a pattern to avoid and I
think we can rather easily do so here.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 6/7] vfio/virtio: Add PRE_COPY support for live migration
  2024-11-06 11:16     ` Yishai Hadas
@ 2024-11-06 21:40       ` Alex Williamson
  0 siblings, 0 replies; 28+ messages in thread
From: Alex Williamson @ 2024-11-06 21:40 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: mst, jasowang, jgg, kvm, virtualization, parav, feliu, kevin.tian,
	joao.m.martins, leonro, maorg

On Wed, 6 Nov 2024 13:16:34 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 06/11/2024 1:18, Alex Williamson wrote:
> > On Mon, 4 Nov 2024 12:21:30 +0200
> > Yishai Hadas <yishaih@nvidia.com> wrote:
> >   
> >> Add PRE_COPY support for live migration.
> >>
> >> This functionality may reduce the downtime upon STOP_COPY as of letting
> >> the target machine to get some 'initial data' from the source once the
> >> machine is still in its RUNNING state and let it prepares itself
> >> pre-ahead to get the final STOP_COPY data.
> >>
> >> As the Virtio specification does not support reading partial or
> >> incremental device contexts. This means that during the PRE_COPY state,
> >> the vfio-virtio driver reads the full device state.
> >>
> >> As the device state can be changed and the benefit is highest when the
> >> pre copy data closely matches the final data we read it in a rate
> >> limiter mode and reporting no data available for some time interval
> >> after the previous call.
> >>
> >> With PRE_COPY enabled, we observed a downtime reduction of approximately
> >> 70-75% in various scenarios compared to when PRE_COPY was disabled,
> >> while keeping the total migration time nearly the same.
> >>
> >> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> >> ---
> >>   drivers/vfio/pci/virtio/common.h  |   4 +
> >>   drivers/vfio/pci/virtio/migrate.c | 233 +++++++++++++++++++++++++++++-
> >>   2 files changed, 229 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/drivers/vfio/pci/virtio/common.h b/drivers/vfio/pci/virtio/common.h
> >> index 3bdfb3ea1174..5704603f0f9d 100644
> >> --- a/drivers/vfio/pci/virtio/common.h
> >> +++ b/drivers/vfio/pci/virtio/common.h
> >> @@ -10,6 +10,8 @@
> >>   
> >>   enum virtiovf_migf_state {
> >>   	VIRTIOVF_MIGF_STATE_ERROR = 1,
> >> +	VIRTIOVF_MIGF_STATE_PRECOPY = 2,
> >> +	VIRTIOVF_MIGF_STATE_COMPLETE = 3,
> >>   };
> >>   
> >>   enum virtiovf_load_state {
> >> @@ -57,6 +59,8 @@ struct virtiovf_migration_file {
> >>   	/* synchronize access to the file state */
> >>   	struct mutex lock;
> >>   	loff_t max_pos;
> >> +	u64 pre_copy_initial_bytes;
> >> +	struct ratelimit_state pre_copy_rl_state;
> >>   	u64 record_size;
> >>   	u32 record_tag;
> >>   	u8 has_obj_id:1;
> >> diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c
> >> index 2a9614c2ef07..cdb252f6fd80 100644
> >> --- a/drivers/vfio/pci/virtio/migrate.c
> >> +++ b/drivers/vfio/pci/virtio/migrate.c  
> > ...  
> >> @@ -379,9 +432,104 @@ static ssize_t virtiovf_save_read(struct file *filp, char __user *buf, size_t le
> >>   	return done;
> >>   }
> >>   
> >> +static long virtiovf_precopy_ioctl(struct file *filp, unsigned int cmd,
> >> +				   unsigned long arg)
> >> +{
> >> +	struct virtiovf_migration_file *migf = filp->private_data;
> >> +	struct virtiovf_pci_core_device *virtvdev = migf->virtvdev;
> >> +	struct vfio_precopy_info info = {};
> >> +	loff_t *pos = &filp->f_pos;
> >> +	bool end_of_data = false;
> >> +	unsigned long minsz;
> >> +	u32 ctx_size;
> >> +	int ret;
> >> +
> >> +	if (cmd != VFIO_MIG_GET_PRECOPY_INFO)
> >> +		return -ENOTTY;
> >> +
> >> +	minsz = offsetofend(struct vfio_precopy_info, dirty_bytes);
> >> +	if (copy_from_user(&info, (void __user *)arg, minsz))
> >> +		return -EFAULT;
> >> +
> >> +	if (info.argsz < minsz)
> >> +		return -EINVAL;
> >> +
> >> +	mutex_lock(&virtvdev->state_mutex);
> >> +	if (virtvdev->mig_state != VFIO_DEVICE_STATE_PRE_COPY &&
> >> +	    virtvdev->mig_state != VFIO_DEVICE_STATE_PRE_COPY_P2P) {
> >> +		ret = -EINVAL;
> >> +		goto err_state_unlock;
> >> +	}
> >> +
> >> +	/*
> >> +	 * The virtio specification does not include a PRE_COPY concept.
> >> +	 * Since we can expect the data to remain the same for a certain period,
> >> +	 * we use a rate limiter mechanism before making a call to the device.
> >> +	 */
> >> +	if (!__ratelimit(&migf->pre_copy_rl_state)) {
> >> +		/* Reporting no data available */
> >> +		ret = 0;
> >> +		goto done;  
> > 
> > @ret is not used by the done: goto target.  Don't we need to zero dirty
> > bytes, or account for initial bytes not being fully read yet?  
> 
> The 'dirty bytes' are actually zero as we used the below line [1] of 
> code above.
> 
> [1] struct vfio_precopy_info info = {};

Ah, missed that.
 
> However, I agree, we may better account for the 'initial bytes' which 
> potentially might not being fully read yet.
> Same can be true for returning the actual 'dirty bytes' that we may have 
> from the previous call.
> 
> So, in V2 I'll change the logic to initially set:
> u32 ctx_size = 0;
> 
> Then, will call the device to get its 'ctx_size' only if time has passed 
> according to the rate limiter.
> 
> Something as of the below.
> 
> if (__ratelimit(&migf->pre_copy_rl_state)) {
> 	ret = virtio_pci_admin_dev_parts_metadata_get(.., &ctx_size);
> 	if (ret)
> 		goto err_state_unlock;
> }
> 
>  From that point the function will proceed with its V1 flow to return 
> the actual 'initial bytes' and 'dirty_bytes' while considering the extra 
> context size from the device to be 0.

That appears correct to me.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured
  2024-11-06 13:59     ` Jason Gunthorpe
@ 2024-11-06 22:27       ` Alex Williamson
  2024-11-07 12:57         ` Yishai Hadas
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2024-11-06 22:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, mst, jasowang, kvm, virtualization, parav, feliu,
	kevin.tian, joao.m.martins, leonro, maorg

On Wed, 6 Nov 2024 09:59:09 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Nov 05, 2024 at 04:29:04PM -0700, Alex Williamson wrote:
> > > @@ -1,7 +1,7 @@
> > >  # SPDX-License-Identifier: GPL-2.0-only
> > >  config VIRTIO_VFIO_PCI
> > >          tristate "VFIO support for VIRTIO NET PCI devices"
> > > -        depends on VIRTIO_PCI && VIRTIO_PCI_ADMIN_LEGACY
> > > +        depends on VIRTIO_PCI
> > >          select VFIO_PCI_CORE
> > >          help
> > >            This provides support for exposing VIRTIO NET VF devices which support
> > > @@ -11,5 +11,7 @@ config VIRTIO_VFIO_PCI
> > >            As of that this driver emulates I/O BAR in software to let a VF be
> > >            seen as a transitional device by its users and let it work with
> > >            a legacy driver.
> > > +          In addition, it provides migration support for VIRTIO NET VF devices
> > > +          using the VFIO framework.  
> > 
> > The first half of this now describes something that may or may not be
> > enabled by this config option and the additional help text for
> > migration is vague enough relative to PF requirements to get user
> > reports that the driver doesn't work as intended.  
> 
> Yes, I think the help should be clearer
> 
> > For the former, maybe we still want a separate config item that's
> > optionally enabled if VIRTIO_VFIO_PCI && VFIO_PCI_ADMIN_LEGACY.  
> 
> If we are going to add a bunch of  #ifdefs/etc for ADMIN_LEGACY we
> may as well just use VIRTIO_PCI_ADMIN_LEGACY directly and not
> introduce another kconfig for it?

I think that's what Yishai is proposing, but as we're adding a whole
new feature to the driver I'm concerned how the person configuring the
kernel knows which features from the description might be available in
the resulting driver.

We could maybe solve that with a completely re-written help text that
describes the legacy feature as X86-only and migration as a separate
architecture independent feature, but people aren't great at reading
and part of the audience is going to see "X86" in their peripheral
vision and disable it, and maybe even complain that the text was
presented to them.

OR, we can just add an optional sub-config bool that makes it easier to
describe the (new) main feature of the driver as supporting live
migration (on supported hardware) and the sub-config option as
providing legacy support (on supported hardware), and that sub-config
is only presented on X86, ie. ADMIN_LEGACY.

Ultimately the code already needs to support #ifdefs for the latter and
I think it's more user friendly and versatile to have the separate
config option.

NB. The sub-config should be default on for upgrade compatibility.

> Is there any reason to compile out the migration support for virtio?
> No other drivers were doing this?

No other vfio-pci variant driver provides multiple, independent
features, so for instance to compile out migration support from the
vfio-pci-mlx5 driver is to simply disable the driver altogether.

> kconfig combinations are painful, it woudl be nice to not make too
> many..

I'm not arguing for a legacy-only, non-migration version (please speak
up if someone wants that).  The code already needs to support the
#ifdefs and I think reflecting that in a sub-config option helps
clarify what the driver is providing and conveniently makes it possible
to have a driver with exactly the same feature set across archs, if
desired.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration
  2024-11-06  9:32 ` [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration Michael S. Tsirkin
@ 2024-11-06 22:30   ` Alex Williamson
  0 siblings, 0 replies; 28+ messages in thread
From: Alex Williamson @ 2024-11-06 22:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Yishai Hadas, jasowang, jgg, kvm, virtualization, parav, feliu,
	kevin.tian, joao.m.martins, leonro, maorg

On Wed, 6 Nov 2024 04:32:31 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Nov 04, 2024 at 12:21:24PM +0200, Yishai Hadas wrote:
> > This series enhances the vfio-virtio driver to support live migration
> > for virtio-net Virtual Functions (VFs) that are migration-capable.
> >  
> > This series follows the Virtio 1.4 specification to implement the
> > necessary device parts commands, enabling a device to participate in the
> > live migration process.
> > 
> > The key VFIO features implemented include: VFIO_MIGRATION_STOP_COPY,
> > VFIO_MIGRATION_P2P, VFIO_MIGRATION_PRE_COPY.
> >  
> > The implementation integrates with the VFIO subsystem via vfio_pci_core
> > and incorporates Virtio-specific logic to handle the migration process.
> >  
> > Migration functionality follows the definitions in uapi/vfio.h and uses
> > the Virtio VF-to-PF admin queue command channel for executing the device
> > parts related commands.  
> 
> 
> virtio things here:
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 
> Alex, your tree I presume? I hope the virtio changes do not
> cause conflicts.

Sure, I can ultimately take it through my tree once we have consensus.

Yishai, please add Michael's ack to 1-4 on the next round.  Thanks,

Alex

> > Patch Overview:
> > The first four patches focus on the Virtio layer and address the
> > following:
> > - Define the layout of the device parts commands required as part of the
> >   migration process.
> > - Provide APIs to enable upper layers (e.g., VFIO, net) to execute the
> >   related device parts commands.
> >  
> > The last three patches focus on the VFIO layer:
> > - Extend the vfio-virtio driver to support live migration for Virtio-net
> >   VFs.
> > - Move legacy I/O operations to a separate file, which is compiled only
> >   when VIRTIO_PCI_ADMIN_LEGACY is configured, ensuring that live
> >   migration depends solely on VIRTIO_PCI.
> >  
> > Additional Notes:
> > - The kernel protocol between the source and target devices includes a
> >   header containing metadata such as record size, tag, and flags.
> >   The record size allows the target to read a complete image from the
> >   source before passing device part data. This follows the Virtio
> >   specification, which mandates that partial device parts are not
> >   supplied. The tag and flags serve as placeholders for future extensions
> >   to the kernel protocol between the source and target, ensuring backward
> >   and forward compatibility.
> >  
> > - Both the source and target comply with the Virtio specification by
> >   using a device part object with a unique ID during the migration
> >   process. As this resource is limited to a maximum of 255, its lifecycle
> >   is confined to periods when live migration is active.
> > 
> > - According to the Virtio specification, a device has only two states:
> >   RUNNING and STOPPED. Consequently, certain VFIO transitions (e.g.,
> >   RUNNING_P2P->STOP, STOP->RUNNING_P2P) are treated as no-ops. When
> >   transitioning to RUNNING_P2P, the device state is set to STOP and
> >   remains STOPPED until it transitions back from RUNNING_P2P->RUNNING, at
> >   which point it resumes its RUNNING state. During transition to STOP,
> >   the virtio device only stops initiating outgoing requests(e.g. DMA,
> >   MSIx, etc.) but still must accept incoming operations.
> > 
> > - Furthermore, the Virtio specification does not support reading partial
> >   or incremental device contexts. This means that during the PRE_COPY
> >   state, the vfio-virtio driver reads the full device state. This step is
> >   beneficial because it allows the device to send some "initial data"
> >   before moving to the STOP_COPY state, thus reducing downtime by
> >   preparing early and warming-up. As the device state can be changed and
> >   the benefit is highest when the pre copy data closely matches the final
> >   data we read it in a rate limiter mode and reporting no data available
> >   for some time interval after the previous call. With PRE_COPY enabled,
> >   we observed a downtime reduction of approximately 70-75% in various
> >   scenarios compared to when PRE_COPY was disabled, while keeping the
> >   total migration time nearly the same.
> > 
> > - Support for dirty page tracking during migration will be provided via
> >   the IOMMUFD framework.
> >  
> > - This series has been successfully tested on Virtio-net VF devices.
> > 
> > Changes from V0:
> > https://lore.kernel.org/kvm/20241101102518.1bf2c6e6.alex.williamson@redhat.com/T/
> > 
> > Vfio:
> > Patch #5:
> > - Enhance the commit log to provide a clearer explanation of P2P
> >   behavior over Virtio devices, as discussed on the mailing list.
> > Patch #6:
> > - Implement the rate limiter mechanism as part of the PRE_COPY state,
> >   following Alex’s suggestion.
> > - Update the commit log to include actual data demonstrating the impact of
> >   PRE_COPY, as requested by Alex.
> > Patch #7:
> > - Update the default driver operations (i.e., vfio_device_ops) to use
> >   the live migration set, and expand it to include the legacy I/O
> >   operations if they are compiled and supported.
> > 
> > Yishai
> > 
> > Yishai Hadas (7):
> >   virtio_pci: Introduce device parts access commands
> >   virtio: Extend the admin command to include the result size
> >   virtio: Manage device and driver capabilities via the admin commands
> >   virtio-pci: Introduce APIs to execute device parts admin commands
> >   vfio/virtio: Add support for the basic live migration functionality
> >   vfio/virtio: Add PRE_COPY support for live migration
> >   vfio/virtio: Enable live migration once VIRTIO_PCI was configured
> > 
> >  drivers/vfio/pci/virtio/Kconfig     |    4 +-
> >  drivers/vfio/pci/virtio/Makefile    |    3 +-
> >  drivers/vfio/pci/virtio/common.h    |  127 +++
> >  drivers/vfio/pci/virtio/legacy_io.c |  420 +++++++++
> >  drivers/vfio/pci/virtio/main.c      |  500 ++--------
> >  drivers/vfio/pci/virtio/migrate.c   | 1336 +++++++++++++++++++++++++++
> >  drivers/virtio/virtio_pci_common.h  |   19 +-
> >  drivers/virtio/virtio_pci_modern.c  |  457 ++++++++-
> >  include/linux/virtio.h              |    1 +
> >  include/linux/virtio_pci_admin.h    |   11 +
> >  include/uapi/linux/virtio_pci.h     |  131 +++
> >  11 files changed, 2594 insertions(+), 415 deletions(-)
> >  create mode 100644 drivers/vfio/pci/virtio/common.h
> >  create mode 100644 drivers/vfio/pci/virtio/legacy_io.c
> >  create mode 100644 drivers/vfio/pci/virtio/migrate.c
> > 
> > -- 
> > 2.27.0  
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 5/7] vfio/virtio: Add support for the basic live migration functionality
  2024-11-06 21:33       ` Alex Williamson
@ 2024-11-07  9:39         ` Yishai Hadas
  0 siblings, 0 replies; 28+ messages in thread
From: Yishai Hadas @ 2024-11-07  9:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mst, jasowang, jgg, kvm, virtualization, parav, feliu, kevin.tian,
	joao.m.martins, leonro, maorg

On 06/11/2024 23:33, Alex Williamson wrote:
> On Wed, 6 Nov 2024 12:21:03 +0200
> Yishai Hadas <yishaih@nvidia.com> wrote:
> 
>> On 06/11/2024 0:47, Alex Williamson wrote:
>>> On Mon, 4 Nov 2024 12:21:29 +0200
>>> Yishai Hadas <yishaih@nvidia.com> wrote:
>>>> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
>>>> index b5d3a8c5bbc9..e2cdf2d48200 100644
>>>> --- a/drivers/vfio/pci/virtio/main.c
>>>> +++ b/drivers/vfio/pci/virtio/main.c
>>> ...
>>>> @@ -485,16 +478,66 @@ static bool virtiovf_bar0_exists(struct pci_dev *pdev)
>>>>    	return res->flags;
>>>>    }
>>>>    
>>>> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
>>>> +{
>>>> +	struct virtiovf_pci_core_device *virtvdev = container_of(core_vdev,
>>>> +			struct virtiovf_pci_core_device, core_device.vdev);
>>>> +	struct pci_dev *pdev;
>>>> +	bool sup_legacy_io;
>>>> +	bool sup_lm;
>>>> +	int ret;
>>>> +
>>>> +	ret = vfio_pci_core_init_dev(core_vdev);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	pdev = virtvdev->core_device.pdev;
>>>> +	sup_legacy_io = virtio_pci_admin_has_legacy_io(pdev) &&
>>>> +				!virtiovf_bar0_exists(pdev);
>>>> +	sup_lm = virtio_pci_admin_has_dev_parts(pdev);
>>>> +
>>>> +	/*
>>>> +	 * If the device is not capable to this driver functionality, fallback
>>>> +	 * to the default vfio-pci ops
>>>> +	 */
>>>> +	if (!sup_legacy_io && !sup_lm) {
>>>> +		core_vdev->ops = &virtiovf_vfio_pci_ops;
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	if (sup_legacy_io) {
>>>> +		ret = virtiovf_read_notify_info(virtvdev);
>>>> +		if (ret)
>>>> +			return ret;
>>>> +
>>>> +		virtvdev->bar0_virtual_buf_size = VIRTIO_PCI_CONFIG_OFF(true) +
>>>> +					virtiovf_get_device_config_size(pdev->device);
>>>> +		BUILD_BUG_ON(!is_power_of_2(virtvdev->bar0_virtual_buf_size));
>>>> +		virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
>>>> +						     GFP_KERNEL);
>>>> +		if (!virtvdev->bar0_virtual_buf)
>>>> +			return -ENOMEM;
>>>> +		mutex_init(&virtvdev->bar_mutex);
>>>> +	}
>>>> +
>>>> +	if (sup_lm)
>>>> +		virtiovf_set_migratable(virtvdev);
>>>> +
>>>> +	if (sup_lm && !sup_legacy_io)
>>>> +		core_vdev->ops = &virtiovf_vfio_pci_lm_ops;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>>    static int virtiovf_pci_probe(struct pci_dev *pdev,
>>>>    			      const struct pci_device_id *id)
>>>>    {
>>>> -	const struct vfio_device_ops *ops = &virtiovf_vfio_pci_ops;
>>>>    	struct virtiovf_pci_core_device *virtvdev;
>>>> +	const struct vfio_device_ops *ops;
>>>>    	int ret;
>>>>    
>>>> -	if (pdev->is_virtfn && virtio_pci_admin_has_legacy_io(pdev) &&
>>>> -	    !virtiovf_bar0_exists(pdev))
>>>> -		ops = &virtiovf_vfio_pci_tran_ops;
>>>> +	ops = (pdev->is_virtfn) ? &virtiovf_vfio_pci_tran_lm_ops :
>>>> +				  &virtiovf_vfio_pci_ops;
>>>
>>> I can't figure out why we moved the more thorough ops setup to the
>>> .init() callback of the ops themselves.  Clearly we can do the legacy
>>> IO and BAR0 test here and the dev parts test uses the same mechanisms
>>> as the legacy IO test, so it seems we could know sup_legacy_io and
>>> sup_lm here.  I think we can even do virtiovf_set_migratable() here
>>> after virtvdev is allocated below.
>>>    
>>
>> Setting the 'ops' as part of the probe() seems actually doable,
>> including calling virtiovf_set_migratable() following the virtiodev
>> allocation below.
>>
>> The main issue with that approach will be the init part of the legacy IO
>> (i.e. virtiovf_init_legacy_io()) as part of virtiovf_pci_init_device().
>>
>> Assuming that we don't want to repeat calling
>> virtiovf_support_legacy_io() as part of virtiovf_pci_init_device() to
>> know whether legacy IO is supported, we can consider calling
>> virtiovf_init_legacy_io() as part of the probe() as well, which IMO
>> doesn't look clean as it's actually seems to match the init flow.
>>
>> Alternatively, we can consider checking inside
>> virtiovf_pci_init_device() whether the 'ops' actually equals the 'tran'
>> ones and then call it.
>>
>> Something like the below.
>>
>> static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
>> {
>> 	...
>>
>> #ifdef CONFIG_VIRTIO_PCI_ADMIN_LEGACY
>> 	if (core_vdev->ops == &virtiovf_vfio_pci_tran_lm_ops)
>> 		return virtiovf_init_legacy_io(virtvdev);
>> #endif
>>
>> 	return 0;
>> }
>>
>> Do you prefer the above approach rather than current V1 code which has a
>>    single check as part of virtiovf_init_legacy_io() ?
> 
> If ops is properly configured and set-migratable is done in probe,
> then doesn't only the legacy ops .init callback need to init the legacy
> setup?  The non-legacy, migration ops structure would just use
> vfio_pci_core_init_dev.

Correct, this seems as a clean solution, will use it as part of V2.

> 
>>
>>> I think the API to vfio core also suggests we shouldn't be modifying the
>>> ops pointer after the core device is allocated.
>>
>> Any pointer for that ?
>> Do we actually see a problem with replacing the 'ops' as part of the
>> init flow ?
> 
> What makes it that way to me is that it's an argument to and set by the
> object constructor.  The ops callbacks should be considered live once
> set.  It's probably safe to do as you've done here because the
> constructor calls the init callback directly, so we don't have any
> races.  However as Jason agreed, it's generally a pattern to avoid and I
> think we can rather easily do so here.  Thanks,
> 

Yes, makes sense.

Thanks,
Yishai

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured
  2024-11-06 22:27       ` Alex Williamson
@ 2024-11-07 12:57         ` Yishai Hadas
  2024-11-07 21:25           ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Yishai Hadas @ 2024-11-07 12:57 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: mst, jasowang, kvm, virtualization, parav, feliu, kevin.tian,
	joao.m.martins, leonro, maorg

On 07/11/2024 0:27, Alex Williamson wrote:
> On Wed, 6 Nov 2024 09:59:09 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> On Tue, Nov 05, 2024 at 04:29:04PM -0700, Alex Williamson wrote:
>>>> @@ -1,7 +1,7 @@
>>>>   # SPDX-License-Identifier: GPL-2.0-only
>>>>   config VIRTIO_VFIO_PCI
>>>>           tristate "VFIO support for VIRTIO NET PCI devices"
>>>> -        depends on VIRTIO_PCI && VIRTIO_PCI_ADMIN_LEGACY
>>>> +        depends on VIRTIO_PCI
>>>>           select VFIO_PCI_CORE
>>>>           help
>>>>             This provides support for exposing VIRTIO NET VF devices which support
>>>> @@ -11,5 +11,7 @@ config VIRTIO_VFIO_PCI
>>>>             As of that this driver emulates I/O BAR in software to let a VF be
>>>>             seen as a transitional device by its users and let it work with
>>>>             a legacy driver.
>>>> +          In addition, it provides migration support for VIRTIO NET VF devices
>>>> +          using the VFIO framework.
>>>
>>> The first half of this now describes something that may or may not be
>>> enabled by this config option and the additional help text for
>>> migration is vague enough relative to PF requirements to get user
>>> reports that the driver doesn't work as intended.
>>
>> Yes, I think the help should be clearer
>>
>>> For the former, maybe we still want a separate config item that's
>>> optionally enabled if VIRTIO_VFIO_PCI && VFIO_PCI_ADMIN_LEGACY.
>>
>> If we are going to add a bunch of  #ifdefs/etc for ADMIN_LEGACY we
>> may as well just use VIRTIO_PCI_ADMIN_LEGACY directly and not
>> introduce another kconfig for it?
> 
> I think that's what Yishai is proposing, but as we're adding a whole
> new feature to the driver I'm concerned how the person configuring the
> kernel knows which features from the description might be available in
> the resulting driver.
> 
> We could maybe solve that with a completely re-written help text that
> describes the legacy feature as X86-only and migration as a separate
> architecture independent feature, but people aren't great at reading
> and part of the audience is going to see "X86" in their peripheral
> vision and disable it, and maybe even complain that the text was
> presented to them.
> 
> OR, we can just add an optional sub-config bool that makes it easier to
> describe the (new) main feature of the driver as supporting live
> migration (on supported hardware) and the sub-config option as
> providing legacy support (on supported hardware), and that sub-config
> is only presented on X86, ie. ADMIN_LEGACY.
> 
> Ultimately the code already needs to support #ifdefs for the latter and
> I think it's more user friendly and versatile to have the separate
> config option.
> 
> NB. The sub-config should be default on for upgrade compatibility.
> 
>> Is there any reason to compile out the migration support for virtio?
>> No other drivers were doing this?
> 
> No other vfio-pci variant driver provides multiple, independent
> features, so for instance to compile out migration support from the
> vfio-pci-mlx5 driver is to simply disable the driver altogether.
> 
>> kconfig combinations are painful, it woudl be nice to not make too
>> many..
> 
> I'm not arguing for a legacy-only, non-migration version (please speak
> up if someone wants that).  The code already needs to support the
> #ifdefs and I think reflecting that in a sub-config option helps
> clarify what the driver is providing and conveniently makes it possible
> to have a driver with exactly the same feature set across archs, if
> desired.  Thanks,
> 

Since the live migration functionality is not architecture-dependent 
(unlike legacy access, which requires X86) and is likely to be the 
primary use of the driver, I would suggest keeping it outside of any 
#ifdef directives, as initially introduced in V1.

To address the description issue and provide control for customers who 
may need the legacy access functionality, we could introduce a bool 
configuration option as a submenu under the driver’s main live migration 
feature.

This approach will keep things simple and align with the typical use 
case of the driver.

Something like the below [1] can do the job for that.

Alex,
Can that work for you ?

By the way, you have suggested calling the config entry 
VFIO_PCI_ADMIN_LEGACY, don't we need to add here also the VIRTIO as a 
prefix ? (i.e. VIRTIO_VFIO_PCI_ADMIN_LEGACY)

[1]
# SPDX-License-Identifier: GPL-2.0-only

config VIRTIO_VFIO_PCI
         tristate "VFIO support for live migration over VIRTIO NET PCI
                   devices"
         depends on VIRTIO_PCI
         select VFIO_PCI_CORE
         select IOMMUFD_DRIVER
         help
           This provides migration support for VIRTIO NET PCI VF devices
           using the VFIO framework.

           If you don't know what to do here, say N.

config VFIO_PCI_ADMIN_LEGACY
         bool "VFIO support for legacy I/O access for VIRTIO NET PCI
               devices"
         depends on VIRTIO_VFIO_PCI && VIRTIO_PCI_ADMIN_LEGACY
         default y
         help
           This provides support for exposing VIRTIO NET VF devices which
           support legacy IO access, using the VFIO framework that can
           work with a legacy virtio driver in the guest.
           Based on PCIe spec, VFs do not support I/O Space.
           As of that this driver emulates I/O BAR in software to let a
           VF be seen as a transitional device by its users and let it
           work with a legacy driver.

           If you don't know what to do here, say N.

Yishai

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured
  2024-11-07 12:57         ` Yishai Hadas
@ 2024-11-07 21:25           ` Alex Williamson
  2024-11-11  8:22             ` Yishai Hadas
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2024-11-07 21:25 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: Jason Gunthorpe, mst, jasowang, kvm, virtualization, parav, feliu,
	kevin.tian, joao.m.martins, leonro, maorg

On Thu, 7 Nov 2024 14:57:39 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 07/11/2024 0:27, Alex Williamson wrote:
> > On Wed, 6 Nov 2024 09:59:09 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> >> On Tue, Nov 05, 2024 at 04:29:04PM -0700, Alex Williamson wrote:  
> >>>> @@ -1,7 +1,7 @@
> >>>>   # SPDX-License-Identifier: GPL-2.0-only
> >>>>   config VIRTIO_VFIO_PCI
> >>>>           tristate "VFIO support for VIRTIO NET PCI devices"
> >>>> -        depends on VIRTIO_PCI && VIRTIO_PCI_ADMIN_LEGACY
> >>>> +        depends on VIRTIO_PCI
> >>>>           select VFIO_PCI_CORE
> >>>>           help
> >>>>             This provides support for exposing VIRTIO NET VF devices which support
> >>>> @@ -11,5 +11,7 @@ config VIRTIO_VFIO_PCI
> >>>>             As of that this driver emulates I/O BAR in software to let a VF be
> >>>>             seen as a transitional device by its users and let it work with
> >>>>             a legacy driver.
> >>>> +          In addition, it provides migration support for VIRTIO NET VF devices
> >>>> +          using the VFIO framework.  
> >>>
> >>> The first half of this now describes something that may or may not be
> >>> enabled by this config option and the additional help text for
> >>> migration is vague enough relative to PF requirements to get user
> >>> reports that the driver doesn't work as intended.  
> >>
> >> Yes, I think the help should be clearer
> >>  
> >>> For the former, maybe we still want a separate config item that's
> >>> optionally enabled if VIRTIO_VFIO_PCI && VFIO_PCI_ADMIN_LEGACY.  
> >>
> >> If we are going to add a bunch of  #ifdefs/etc for ADMIN_LEGACY we
> >> may as well just use VIRTIO_PCI_ADMIN_LEGACY directly and not
> >> introduce another kconfig for it?  
> > 
> > I think that's what Yishai is proposing, but as we're adding a whole
> > new feature to the driver I'm concerned how the person configuring the
> > kernel knows which features from the description might be available in
> > the resulting driver.
> > 
> > We could maybe solve that with a completely re-written help text that
> > describes the legacy feature as X86-only and migration as a separate
> > architecture independent feature, but people aren't great at reading
> > and part of the audience is going to see "X86" in their peripheral
> > vision and disable it, and maybe even complain that the text was
> > presented to them.
> > 
> > OR, we can just add an optional sub-config bool that makes it easier to
> > describe the (new) main feature of the driver as supporting live
> > migration (on supported hardware) and the sub-config option as
> > providing legacy support (on supported hardware), and that sub-config
> > is only presented on X86, ie. ADMIN_LEGACY.
> > 
> > Ultimately the code already needs to support #ifdefs for the latter and
> > I think it's more user friendly and versatile to have the separate
> > config option.
> > 
> > NB. The sub-config should be default on for upgrade compatibility.
> >   
> >> Is there any reason to compile out the migration support for virtio?
> >> No other drivers were doing this?  
> > 
> > No other vfio-pci variant driver provides multiple, independent
> > features, so for instance to compile out migration support from the
> > vfio-pci-mlx5 driver is to simply disable the driver altogether.
> >   
> >> kconfig combinations are painful, it woudl be nice to not make too
> >> many..  
> > 
> > I'm not arguing for a legacy-only, non-migration version (please speak
> > up if someone wants that).  The code already needs to support the
> > #ifdefs and I think reflecting that in a sub-config option helps
> > clarify what the driver is providing and conveniently makes it possible
> > to have a driver with exactly the same feature set across archs, if
> > desired.  Thanks,
> >   
> 
> Since the live migration functionality is not architecture-dependent 
> (unlike legacy access, which requires X86) and is likely to be the 
> primary use of the driver, I would suggest keeping it outside of any 
> #ifdef directives, as initially introduced in V1.
> 
> To address the description issue and provide control for customers who 
> may need the legacy access functionality, we could introduce a bool 
> configuration option as a submenu under the driver’s main live migration 
> feature.
> 
> This approach will keep things simple and align with the typical use 
> case of the driver.
> 
> Something like the below [1] can do the job for that.
> 
> Alex,
> Can that work for you ?
> 
> By the way, you have suggested calling the config entry 
> VFIO_PCI_ADMIN_LEGACY, don't we need to add here also the VIRTIO as a 
> prefix ? (i.e. VIRTIO_VFIO_PCI_ADMIN_LEGACY)

I think that was just a typo referring to VIRTIO_PCI_ADMIN_LEGACY.
Yes, appending _ADMIN_LEGACY to the main config option is fine.

> [1]
> # SPDX-License-Identifier: GPL-2.0-only
> 
> config VIRTIO_VFIO_PCI
>          tristate "VFIO support for live migration over VIRTIO NET PCI
>                    devices"

Looking at other variant drivers, I think this should just be:

	"VFIO support for VIRTIO NET PCI VF devices"

>          depends on VIRTIO_PCI
>          select VFIO_PCI_CORE
>          select IOMMUFD_DRIVER

IIUC, this is not a dependency, the device will just lack dirty page
tracking with either the type1 backend or when using iommufd when the
IOMMU hardware doesn't have dirty page tracking, therefore all VM
memory is perpetually dirty.  Do I have that right?

>          help
>            This provides migration support for VIRTIO NET PCI VF devices
>            using the VFIO framework.

This is still too open ended for me, there is specific PF support
required in the device to make this work.  Maybe...

	This provides migration support for VIRTIO NET PCI VF devices
	using the VFIO framework.  Migration support requires the
	SR-IOV PF device to support specific VIRTIO extensions,
	otherwise this driver provides no additional functionality
	beyond vfio-pci.

	Migration support in this driver relies on dirty page tracking
	provided by the IOMMU hardware and exposed through IOMMUFD, any
	other use cases are dis-recommended.

>            If you don't know what to do here, say N.
> 
> config VFIO_PCI_ADMIN_LEGACY

VIRTIO_VFIO_PCI_ADMIN_LEGACY

>          bool "VFIO support for legacy I/O access for VIRTIO NET PCI
>                devices"

Maybe:

	"Legacy I/O support for VIRTIO NET PCI VF devices"

>          depends on VIRTIO_VFIO_PCI && VIRTIO_PCI_ADMIN_LEGACY
>          default y
>          help
>            This provides support for exposing VIRTIO NET VF devices which
>            support legacy IO access, using the VFIO framework that can
>            work with a legacy virtio driver in the guest.
>            Based on PCIe spec, VFs do not support I/O Space.
>            As of that this driver emulates I/O BAR in software to let a
>            VF be seen as a transitional device by its users and let it
>            work with a legacy driver.

Maybe:

	This extends the virtio-vfio-pci driver to support legacy I/O
	access, allowing use of legacy virtio drivers with VIRTIO NET
	PCI VF devices.  Legacy I/O support requires the SR-IOV PF
	device to support and enable specific VIRTIO extensions,
	otherwise this driver provides no additional functionality
	beyond vfio-pci.

IMO, noting the PF requirement in each is important (feel free to
elaborate on specific VIRTIO extension requirements).  It doesn't seem
necessary to explain how the legacy compatibility works, only that this
driver makes the VF compatible with the legacy driver.

Are both of these options configurable at the PF in either firmware or
software?  I used "support and enable" in the legacy section assuming
that there is such a knob, but for migration it seems less necessary
that there's an enable step.  Please correct based on the actual
device behavior.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured
  2024-11-07 21:25           ` Alex Williamson
@ 2024-11-11  8:22             ` Yishai Hadas
  2024-11-11 10:32               ` Joao Martins
  0 siblings, 1 reply; 28+ messages in thread
From: Yishai Hadas @ 2024-11-11  8:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, mst, jasowang, kvm, virtualization, parav, feliu,
	kevin.tian, joao.m.martins, leonro, maorg

On 07/11/2024 23:25, Alex Williamson wrote:
> On Thu, 7 Nov 2024 14:57:39 +0200
> Yishai Hadas <yishaih@nvidia.com> wrote:
> 
>> On 07/11/2024 0:27, Alex Williamson wrote:
>>> On Wed, 6 Nov 2024 09:59:09 -0400
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>    
>>>> On Tue, Nov 05, 2024 at 04:29:04PM -0700, Alex Williamson wrote:
>>>>>> @@ -1,7 +1,7 @@
>>>>>>    # SPDX-License-Identifier: GPL-2.0-only
>>>>>>    config VIRTIO_VFIO_PCI
>>>>>>            tristate "VFIO support for VIRTIO NET PCI devices"
>>>>>> -        depends on VIRTIO_PCI && VIRTIO_PCI_ADMIN_LEGACY
>>>>>> +        depends on VIRTIO_PCI
>>>>>>            select VFIO_PCI_CORE
>>>>>>            help
>>>>>>              This provides support for exposing VIRTIO NET VF devices which support
>>>>>> @@ -11,5 +11,7 @@ config VIRTIO_VFIO_PCI
>>>>>>              As of that this driver emulates I/O BAR in software to let a VF be
>>>>>>              seen as a transitional device by its users and let it work with
>>>>>>              a legacy driver.
>>>>>> +          In addition, it provides migration support for VIRTIO NET VF devices
>>>>>> +          using the VFIO framework.
>>>>>
>>>>> The first half of this now describes something that may or may not be
>>>>> enabled by this config option and the additional help text for
>>>>> migration is vague enough relative to PF requirements to get user
>>>>> reports that the driver doesn't work as intended.
>>>>
>>>> Yes, I think the help should be clearer
>>>>   
>>>>> For the former, maybe we still want a separate config item that's
>>>>> optionally enabled if VIRTIO_VFIO_PCI && VFIO_PCI_ADMIN_LEGACY.
>>>>
>>>> If we are going to add a bunch of  #ifdefs/etc for ADMIN_LEGACY we
>>>> may as well just use VIRTIO_PCI_ADMIN_LEGACY directly and not
>>>> introduce another kconfig for it?
>>>
>>> I think that's what Yishai is proposing, but as we're adding a whole
>>> new feature to the driver I'm concerned how the person configuring the
>>> kernel knows which features from the description might be available in
>>> the resulting driver.
>>>
>>> We could maybe solve that with a completely re-written help text that
>>> describes the legacy feature as X86-only and migration as a separate
>>> architecture independent feature, but people aren't great at reading
>>> and part of the audience is going to see "X86" in their peripheral
>>> vision and disable it, and maybe even complain that the text was
>>> presented to them.
>>>
>>> OR, we can just add an optional sub-config bool that makes it easier to
>>> describe the (new) main feature of the driver as supporting live
>>> migration (on supported hardware) and the sub-config option as
>>> providing legacy support (on supported hardware), and that sub-config
>>> is only presented on X86, ie. ADMIN_LEGACY.
>>>
>>> Ultimately the code already needs to support #ifdefs for the latter and
>>> I think it's more user friendly and versatile to have the separate
>>> config option.
>>>
>>> NB. The sub-config should be default on for upgrade compatibility.
>>>    
>>>> Is there any reason to compile out the migration support for virtio?
>>>> No other drivers were doing this?
>>>
>>> No other vfio-pci variant driver provides multiple, independent
>>> features, so for instance to compile out migration support from the
>>> vfio-pci-mlx5 driver is to simply disable the driver altogether.
>>>    
>>>> kconfig combinations are painful, it woudl be nice to not make too
>>>> many..
>>>
>>> I'm not arguing for a legacy-only, non-migration version (please speak
>>> up if someone wants that).  The code already needs to support the
>>> #ifdefs and I think reflecting that in a sub-config option helps
>>> clarify what the driver is providing and conveniently makes it possible
>>> to have a driver with exactly the same feature set across archs, if
>>> desired.  Thanks,
>>>    
>>
>> Since the live migration functionality is not architecture-dependent
>> (unlike legacy access, which requires X86) and is likely to be the
>> primary use of the driver, I would suggest keeping it outside of any
>> #ifdef directives, as initially introduced in V1.
>>
>> To address the description issue and provide control for customers who
>> may need the legacy access functionality, we could introduce a bool
>> configuration option as a submenu under the driver’s main live migration
>> feature.
>>
>> This approach will keep things simple and align with the typical use
>> case of the driver.
>>
>> Something like the below [1] can do the job for that.
>>
>> Alex,
>> Can that work for you ?
>>
>> By the way, you have suggested calling the config entry
>> VFIO_PCI_ADMIN_LEGACY, don't we need to add here also the VIRTIO as a
>> prefix ? (i.e. VIRTIO_VFIO_PCI_ADMIN_LEGACY)
> 
> I think that was just a typo referring to VIRTIO_PCI_ADMIN_LEGACY.
> Yes, appending _ADMIN_LEGACY to the main config option is fine.
> 
>> [1]
>> # SPDX-License-Identifier: GPL-2.0-only
>>
>> config VIRTIO_VFIO_PCI
>>           tristate "VFIO support for live migration over VIRTIO NET PCI
>>                     devices"
> 
> Looking at other variant drivers, I think this should just be:
> 
> 	"VFIO support for VIRTIO NET PCI VF devices"
> 
>>           depends on VIRTIO_PCI
>>           select VFIO_PCI_CORE
>>           select IOMMUFD_DRIVER
> 
> IIUC, this is not a dependency, the device will just lack dirty page
> tracking with either the type1 backend or when using iommufd when the
> IOMMU hardware doesn't have dirty page tracking, therefore all VM
> memory is perpetually dirty.  Do I have that right?

IOMMUFD_DRIVER is selected to utilize the dirty tracking functionality 
of IOMMU. Therefore, this is a select option rather than a dependency, 
similar to how the pds and mlx5 VFIO drivers handle it in their Kconfig 
files.

> 
>>           help
>>             This provides migration support for VIRTIO NET PCI VF devices
>>             using the VFIO framework.
> 
> This is still too open ended for me, there is specific PF support
> required in the device to make this work.  Maybe...
> 
> 	This provides migration support for VIRTIO NET PCI VF devices
> 	using the VFIO framework.  Migration support requires the
> 	SR-IOV PF device to support specific VIRTIO extensions,
> 	otherwise this driver provides no additional functionality
> 	beyond vfio-pci.
> 
> 	Migration support in this driver relies on dirty page tracking
> 	provided by the IOMMU hardware and exposed through IOMMUFD, any
> 	other use cases are dis-recommended.
> 
>>             If you don't know what to do here, say N.

Looks good.

>>
>> config VFIO_PCI_ADMIN_LEGACY
> 
> VIRTIO_VFIO_PCI_ADMIN_LEGACY
> 
>>           bool "VFIO support for legacy I/O access for VIRTIO NET PCI
>>                 devices"
> 
> Maybe:
> 
> 	"Legacy I/O support for VIRTIO NET PCI VF devices"
> 
>>           depends on VIRTIO_VFIO_PCI && VIRTIO_PCI_ADMIN_LEGACY
>>           default y
>>           help
>>             This provides support for exposing VIRTIO NET VF devices which
>>             support legacy IO access, using the VFIO framework that can
>>             work with a legacy virtio driver in the guest.
>>             Based on PCIe spec, VFs do not support I/O Space.
>>             As of that this driver emulates I/O BAR in software to let a
>>             VF be seen as a transitional device by its users and let it
>>             work with a legacy driver.
> 
> Maybe:
> 
> 	This extends the virtio-vfio-pci driver to support legacy I/O
> 	access, allowing use of legacy virtio drivers with VIRTIO NET
> 	PCI VF devices.  Legacy I/O support requires the SR-IOV PF
> 	device to support and enable specific VIRTIO extensions,
> 	otherwise this driver provides no additional functionality
> 	beyond vfio-pci.
> 

Looks good.

> IMO, noting the PF requirement in each is important (feel free to
> elaborate on specific VIRTIO extension requirements).  It doesn't seem
> necessary to explain how the legacy compatibility works, only that this
> driver makes the VF compatible with the legacy driver.
> 
> Are both of these options configurable at the PF in either firmware or
> software? 

These options are configured in the firmware.

  I used "support and enable" in the legacy section assuming
> that there is such a knob, but for migration it seems less necessary
> that there's an enable step.  Please correct based on the actual
> device behavior.  Thanks,
> 

Migration is such a basic functionality that we may expect it to be 
enabled by default, so your suggestion seems reasonable. Let’s proceed 
with it.


Thanks,
Yishai

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured
  2024-11-11  8:22             ` Yishai Hadas
@ 2024-11-11 10:32               ` Joao Martins
  2024-11-11 14:17                 ` Yishai Hadas
  0 siblings, 1 reply; 28+ messages in thread
From: Joao Martins @ 2024-11-11 10:32 UTC (permalink / raw)
  To: Yishai Hadas, Alex Williamson
  Cc: Jason Gunthorpe, mst, jasowang, kvm, virtualization, parav, feliu,
	kevin.tian, leonro, maorg

>>>           depends on VIRTIO_PCI
>>>           select VFIO_PCI_CORE
>>>           select IOMMUFD_DRIVER
>>
>> IIUC, this is not a dependency, the device will just lack dirty page
>> tracking with either the type1 backend or when using iommufd when the
>> IOMMU hardware doesn't have dirty page tracking, therefore all VM
>> memory is perpetually dirty.  Do I have that right?
> 
> IOMMUFD_DRIVER is selected to utilize the dirty tracking functionality of IOMMU.
> Therefore, this is a select option rather than a dependency, similar to how the
> pds and mlx5 VFIO drivers handle it in their Kconfig files.
> 

Yishai, I think Alex is right here.

'select IOMMUFD_DRIVER' is more for VF dirty trackers where it uses the same
helpers as IOMMUFD does for dirty tracking. But it's definitely not signaling
intent for 'IOMMU dirty tracking' but rather 'VF dirty tracking'

If you want to tie in to IOMMU dirty tracking you probably want to do:

	'select IOMMUFD'

But that is a big hammer, as you also need the VFIO_DEVICE_CDEV kconfig selected
as well and probably more.

Perhaps best to do like qat/hisilicon drivers and letting the user optionally
pick it. Migration is anyways disabled when using type1 (unless you force it,
and it then it does the perpectual dirty trick).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured
  2024-11-11 10:32               ` Joao Martins
@ 2024-11-11 14:17                 ` Yishai Hadas
  2024-11-11 15:30                   ` Joao Martins
  0 siblings, 1 reply; 28+ messages in thread
From: Yishai Hadas @ 2024-11-11 14:17 UTC (permalink / raw)
  To: Joao Martins, Alex Williamson
  Cc: Jason Gunthorpe, mst, jasowang, kvm, virtualization, parav, feliu,
	kevin.tian, leonro, maorg

On 11/11/2024 12:32, Joao Martins wrote:
>>>>            depends on VIRTIO_PCI
>>>>            select VFIO_PCI_CORE
>>>>            select IOMMUFD_DRIVER
>>>
>>> IIUC, this is not a dependency, the device will just lack dirty page
>>> tracking with either the type1 backend or when using iommufd when the
>>> IOMMU hardware doesn't have dirty page tracking, therefore all VM
>>> memory is perpetually dirty.  Do I have that right?
>>
>> IOMMUFD_DRIVER is selected to utilize the dirty tracking functionality of IOMMU.
>> Therefore, this is a select option rather than a dependency, similar to how the
>> pds and mlx5 VFIO drivers handle it in their Kconfig files.
>>
> 
> Yishai, I think Alex is right here.
> 
> 'select IOMMUFD_DRIVER' is more for VF dirty trackers where it uses the same
> helpers as IOMMUFD does for dirty tracking. But it's definitely not signaling
> intent for 'IOMMU dirty tracking' but rather 'VF dirty tracking'

I see, please see below.

> 
> If you want to tie in to IOMMU dirty tracking you probably want to do:
> 
> 	'select IOMMUFD'
> 

Looking at the below Kconfig(s) for AMD/INTEL_IOMMU [1], we can see that 
if IOMMUFD is set IOMMFD_DRIVER is selected.

 From that we can assume that to have 'IOMMU dirty tracking' the 
IOMMFD_DRIVER is finally needed/selected, right ?

So you are saying that it's redundant in the vfio/virtio driver as it 
will be selected down the road once needed ?

[1]
https://elixir.bootlin.com/linux/v6.12-rc6/source/drivers/iommu/intel/Kconfig#L17
https://elixir.bootlin.com/linux/v6.12-rc6/source/drivers/iommu/amd/Kconfig#L16

> But that is a big hammer, as you also need the VFIO_DEVICE_CDEV kconfig selected
> as well and probably more.

I agree, we didn't plan to add those dependencies.

> 
> Perhaps best to do like qat/hisilicon drivers and letting the user optionally
> pick it. Migration is anyways disabled when using type1 (unless you force it,
> and it then it does the perpectual dirty trick).

I can drop it in V3 if we all agree that it's not really needed here.

Thanks,
Yishai

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured
  2024-11-11 14:17                 ` Yishai Hadas
@ 2024-11-11 15:30                   ` Joao Martins
  2024-11-11 21:27                     ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Joao Martins @ 2024-11-11 15:30 UTC (permalink / raw)
  To: Yishai Hadas, Alex Williamson
  Cc: Jason Gunthorpe, mst, jasowang, kvm, virtualization, parav, feliu,
	kevin.tian, leonro, maorg

On 11/11/2024 14:17, Yishai Hadas wrote:
> On 11/11/2024 12:32, Joao Martins wrote:
>>>>>            depends on VIRTIO_PCI
>>>>>            select VFIO_PCI_CORE
>>>>>            select IOMMUFD_DRIVER
>>>>
>>>> IIUC, this is not a dependency, the device will just lack dirty page
>>>> tracking with either the type1 backend or when using iommufd when the
>>>> IOMMU hardware doesn't have dirty page tracking, therefore all VM
>>>> memory is perpetually dirty.  Do I have that right?
>>>
>>> IOMMUFD_DRIVER is selected to utilize the dirty tracking functionality of IOMMU.
>>> Therefore, this is a select option rather than a dependency, similar to how the
>>> pds and mlx5 VFIO drivers handle it in their Kconfig files.
>>>
>>
>> Yishai, I think Alex is right here.
>>
>> 'select IOMMUFD_DRIVER' is more for VF dirty trackers where it uses the same
>> helpers as IOMMUFD does for dirty tracking. But it's definitely not signaling
>> intent for 'IOMMU dirty tracking' but rather 'VF dirty tracking'
> 
> I see, please see below.
> 
I should have said that in the context of VFIO drivers selecting it, not broadly
across all drivers that select it.

>>
>> If you want to tie in to IOMMU dirty tracking you probably want to do:
>>
>>     'select IOMMUFD'
>>
> 
> Looking at the below Kconfig(s) for AMD/INTEL_IOMMU [1], we can see that if
> IOMMUFD is set IOMMFD_DRIVER is selected.
> 
Correct.

> From that we can assume that to have 'IOMMU dirty tracking' the IOMMFD_DRIVER is
> finally needed/selected, right ?
> 

Right, if you have CONFIG_IOMMUFD then the IOMMU will in the end auto-select
IOMMU_DRIVER. But standalone at best you can assume that 'something does dirty
tracking'. The context (i.e. who selects it) is what tells you if it's VF or IOMMU.

In your example above, that option is meant to be selected by *a* dirty tracker,
and it's because AMD/Intel was selecting that you would have IOMMU dirty
tracking. The option essentially selects IOVA bitmaps helpers (zerocopy scheme
to set bits in a bitmap) which is both used by VF dirty trackers and IOMMU dirty
trackers. Originally this started in VFIO and later got moved into IOMMUFD which
is why we have this kconfig to allow independent use.

> So you are saying that it's redundant in the vfio/virtio driver as it will be
> selected down the road once needed ?
> 
Right.

Of course, it will always depend on user enabling the right kconfigs and such.
But that would be no different than the other drivers than don't support VF
dirty tracking.

> [1]
> https://elixir.bootlin.com/linux/v6.12-rc6/source/drivers/iommu/intel/Kconfig#L17
> https://elixir.bootlin.com/linux/v6.12-rc6/source/drivers/iommu/amd/Kconfig#L16
> 
>> But that is a big hammer, as you also need the VFIO_DEVICE_CDEV kconfig selected
>> as well and probably more.
> 
> I agree, we didn't plan to add those dependencies.
> 
>>
>> Perhaps best to do like qat/hisilicon drivers and letting the user optionally
>> pick it. Migration is anyways disabled when using type1 (unless you force it,
>> and it then it does the perpectual dirty trick).
> 
> I can drop it in V3 if we all agree that it's not really needed here.
> 

Sure

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured
  2024-11-11 15:30                   ` Joao Martins
@ 2024-11-11 21:27                     ` Alex Williamson
  0 siblings, 0 replies; 28+ messages in thread
From: Alex Williamson @ 2024-11-11 21:27 UTC (permalink / raw)
  To: Joao Martins
  Cc: Yishai Hadas, Jason Gunthorpe, mst, jasowang, kvm, virtualization,
	parav, feliu, kevin.tian, leonro, maorg

On Mon, 11 Nov 2024 15:30:47 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 11/11/2024 14:17, Yishai Hadas wrote:
> > On 11/11/2024 12:32, Joao Martins wrote:  
> >>>>>            depends on VIRTIO_PCI
> >>>>>            select VFIO_PCI_CORE
> >>>>>            select IOMMUFD_DRIVER  
> >>>>
> >>>> IIUC, this is not a dependency, the device will just lack dirty page
> >>>> tracking with either the type1 backend or when using iommufd when the
> >>>> IOMMU hardware doesn't have dirty page tracking, therefore all VM
> >>>> memory is perpetually dirty.  Do I have that right?  
> >>>
> >>> IOMMUFD_DRIVER is selected to utilize the dirty tracking functionality of IOMMU.
> >>> Therefore, this is a select option rather than a dependency, similar to how the
> >>> pds and mlx5 VFIO drivers handle it in their Kconfig files.
> >>>  
> >>
> >> Yishai, I think Alex is right here.
> >>
> >> 'select IOMMUFD_DRIVER' is more for VF dirty trackers where it uses the same
> >> helpers as IOMMUFD does for dirty tracking. But it's definitely not signaling
> >> intent for 'IOMMU dirty tracking' but rather 'VF dirty tracking'  
> > 
> > I see, please see below.
> >   
> I should have said that in the context of VFIO drivers selecting it, not broadly
> across all drivers that select it.
> 
> >>
> >> If you want to tie in to IOMMU dirty tracking you probably want to do:
> >>
> >>     'select IOMMUFD'
> >>  
> > 
> > Looking at the below Kconfig(s) for AMD/INTEL_IOMMU [1], we can see that if
> > IOMMUFD is set IOMMFD_DRIVER is selected.
> >   
> Correct.
> 
> > From that we can assume that to have 'IOMMU dirty tracking' the IOMMFD_DRIVER is
> > finally needed/selected, right ?
> >   
> 
> Right, if you have CONFIG_IOMMUFD then the IOMMU will in the end auto-select
> IOMMU_DRIVER. But standalone at best you can assume that 'something does dirty
> tracking'. The context (i.e. who selects it) is what tells you if it's VF or IOMMU.
> 
> In your example above, that option is meant to be selected by *a* dirty tracker,
> and it's because AMD/Intel was selecting that you would have IOMMU dirty
> tracking. The option essentially selects IOVA bitmaps helpers (zerocopy scheme
> to set bits in a bitmap) which is both used by VF dirty trackers and IOMMU dirty
> trackers. Originally this started in VFIO and later got moved into IOMMUFD which
> is why we have this kconfig to allow independent use.

Yeah, I agree.  IOMMUFD_DRIVER is only configuring in iova_bitmap
support independent of IOMMUFD, which mlx5 requires, but this does not.

> > So you are saying that it's redundant in the vfio/virtio driver as it will be
> > selected down the road once needed ?
> >   
> Right.
> 
> Of course, it will always depend on user enabling the right kconfigs and such.
> But that would be no different than the other drivers than don't support VF
> dirty tracking.
> 
> > [1]
> > https://elixir.bootlin.com/linux/v6.12-rc6/source/drivers/iommu/intel/Kconfig#L17
> > https://elixir.bootlin.com/linux/v6.12-rc6/source/drivers/iommu/amd/Kconfig#L16
> >   
> >> But that is a big hammer, as you also need the VFIO_DEVICE_CDEV kconfig selected
> >> as well and probably more.  
> > 
> > I agree, we didn't plan to add those dependencies.
> >   
> >>
> >> Perhaps best to do like qat/hisilicon drivers and letting the user optionally
> >> pick it. Migration is anyways disabled when using type1 (unless you force it,
> >> and it then it does the perpectual dirty trick).  

Yes, at least for QEMU, unless the user forces the device to support
migration we'll add a migration blocker rather than use the perpetually
dirty trick.  Ultimately the support depends on the underlying IOMMU
capabilities where we're running anyway, so it makes sense to me to
leave this to the user rather than trying to force a kernel config that
can support IOMMU dirty page tracking.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-11-11 21:27 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-04 10:21 [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration Yishai Hadas
2024-11-04 10:21 ` [PATCH V1 vfio 1/7] virtio_pci: Introduce device parts access commands Yishai Hadas
2024-11-04 10:21 ` [PATCH V1 vfio 2/7] virtio: Extend the admin command to include the result size Yishai Hadas
2024-11-04 10:21 ` [PATCH V1 vfio 3/7] virtio: Manage device and driver capabilities via the admin commands Yishai Hadas
2024-11-04 10:21 ` [PATCH V1 vfio 4/7] virtio-pci: Introduce APIs to execute device parts " Yishai Hadas
2024-11-04 10:21 ` [PATCH V1 vfio 5/7] vfio/virtio: Add support for the basic live migration functionality Yishai Hadas
2024-11-05 22:47   ` Alex Williamson
2024-11-06 10:21     ` Yishai Hadas
2024-11-06 21:33       ` Alex Williamson
2024-11-07  9:39         ` Yishai Hadas
2024-11-06 15:48     ` Jason Gunthorpe
2024-11-04 10:21 ` [PATCH V1 vfio 6/7] vfio/virtio: Add PRE_COPY support for live migration Yishai Hadas
2024-11-05 23:18   ` Alex Williamson
2024-11-06 11:16     ` Yishai Hadas
2024-11-06 21:40       ` Alex Williamson
2024-11-04 10:21 ` [PATCH V1 vfio 7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured Yishai Hadas
2024-11-05 23:29   ` Alex Williamson
2024-11-06 13:59     ` Jason Gunthorpe
2024-11-06 22:27       ` Alex Williamson
2024-11-07 12:57         ` Yishai Hadas
2024-11-07 21:25           ` Alex Williamson
2024-11-11  8:22             ` Yishai Hadas
2024-11-11 10:32               ` Joao Martins
2024-11-11 14:17                 ` Yishai Hadas
2024-11-11 15:30                   ` Joao Martins
2024-11-11 21:27                     ` Alex Williamson
2024-11-06  9:32 ` [PATCH V1 vfio 0/7] Enhance the vfio-virtio driver to support live migration Michael S. Tsirkin
2024-11-06 22:30   ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).