[RFC PATCH 0/4] Add new VFIO PCI driver for NVMe devices

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Add new VFIO PCI driver for NVMe devices
@ 2025-08-03  2:47 Chaitanya Kulkarni
  2025-08-03  2:47 ` [RFC PATCH 1/4] vfio-nvme: add vfio-nvme lm driver infrastructure Chaitanya Kulkarni
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Chaitanya Kulkarni @ 2025-08-03  2:47 UTC (permalink / raw)
  To: kbusch, axboe, hch, sagi, alex.williamson, cohuck, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, mgurtovoy
  Cc: linux-nvme, kvm, Konrad.wilk, martin.petersen, jmeneghi, arnd,
	schnelle, bhelgaas, joao.m.martins, Chaitanya Kulkarni

Hi,

Some devices, such as Infrastructure Processing Units (IPUs),
Data Processing Units (DPUs), and  SSDs expose SR-IOV-capable NVMe
devices to the host. These virtual function (VF) devices support live 
migration via specific NVMe admin commands issued through the parent
PF's admin queue.

NVMe TP4159 defines support for basic live migration operations,
including Suspend, Resume, Get Controller State, and Set Controller
State. While TP4159 standardizes the command interface, it does not
yet define a fixed layout for controller state, NVIDIA and others
in NVMe TWG is actively working on defining this layout.

This series introduces a vfio-pci driver to enable live migration of
SR-IOV NVMe devices. It also adds interface hooks to the core NVMe
driver to allow VF command submission through the PF's admin queue.
Adding support for migration of non SR-IOV devices can be added
incrementally.

This RFC complies with the TP4159 specification and is derived from
initial submission of Intel and NVIDIA’s vendor-specific
implementation.

Objective for this RFC
----------------------

Our initial submission received feedback encouraging standardization
of live migration support for NVMe. In response, NVIDIA and Intel
collaborated to merge architectural elements from TP4173 into TP4159.

Now that TP4159 has been ratified with core live migration commands,
we aim to resume discussion with the upstream community and solicit
feedback on what remains to support NVMe live migration in mainline.

What is implemented in this RFC?
--------------------------------

1. Patch 0001 introduces the core vfio-nvme driver infrastructure
   including helper routines and basic driver registration.

2. Patch 0002 adds TP4159-specific command definitions and updates
   existing NVMe data structures, such as `nvme_id_ctrl`.

3. Patch 0003 exports helpers from pci. (Needs a discussion)

4. Patch 0004 implements the TP4159 commands: Suspend, Resume,
   Get Controller State, and Set Controller State. It also includes
   debug helpers and command parsing logic.

Open Issues and Discussion Points
---------------------------------

1. This RFC exposes two new interfaces from the nvme-pci driver to
   submit admin commands for VF devices through the PF. We welcome
   input on the correct or preferred upstream approach for this.

2. Are there any gaps between the current VFIO live migration
   architecture and what is required to fully support NVMe VF
   migration?

3. TP4193 is under development in NVMe TWG, it will define subsystem 
   state and missing configuration functionality. Are there additional
   capabilities or architecture changes needed beyond what TP4193 will
   cover to upstream the VFIO NVMe Live Migration support from spec or 
   from linux kernel point of view ?

NVIDIA and Intel has started the NVMe Live Migration upstreaming work
and fully committed to upstreaming NVMe live migration support, we are
also eager to align ongoing development with community expectations and
bring the feedback to the standards representing the kernel community.

This RFC is compiles and generated on linux-nvme tree branch nvme-6.17
HEAD :-

commit 70d12a283303b1241884b04f77dc1b07fdbbc90e (origin/nvme-6.17)
Author: Maurizio Lombardi <mlombard@redhat.com>
Date:   Wed Jul 2 16:06:29 2025 +0200

    nvme-tcp: log TLS handshake failures at error level

We greatly appreciate your feedback and comments on this work.

-ck

Chaitanya Kulkarni (4):
  vfio-nvme: add vfio-nvme lm driver infrastructure
  nvme: add live migration TP 4159 definitions
  nvme: export helpers to implement vfio-nvme lm
  vfio-nvme: implement TP4159 live migration cmds

 drivers/nvme/host/core.c       |    5 +-
 drivers/nvme/host/nvme.h       |    5 +
 drivers/nvme/host/pci.c        |   34 ++
 drivers/vfio/pci/Kconfig       |    2 +
 drivers/vfio/pci/Makefile      |    2 +
 drivers/vfio/pci/nvme/Kconfig  |   10 +
 drivers/vfio/pci/nvme/Makefile |    6 +
 drivers/vfio/pci/nvme/nvme.c   | 1036 ++++++++++++++++++++++++++++++++
 drivers/vfio/pci/nvme/nvme.h   |   39 ++
 include/linux/nvme.h           |  334 +++++++++-
 10 files changed, 1471 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vfio/pci/nvme/Kconfig
 create mode 100644 drivers/vfio/pci/nvme/Makefile
 create mode 100644 drivers/vfio/pci/nvme/nvme.c
 create mode 100644 drivers/vfio/pci/nvme/nvme.h

-- 
2.40.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/4] vfio-nvme: add vfio-nvme lm driver infrastructure
  2025-08-03  2:47 [RFC PATCH 0/4] Add new VFIO PCI driver for NVMe devices Chaitanya Kulkarni
@ 2025-08-03  2:47 ` Chaitanya Kulkarni
  2025-08-04 15:20   ` Shameerali Kolothum Thodi
                     ` (2 more replies)
  2025-08-03  2:47 ` [RFC PATCH 2/4] nvme: add live migration TP 4159 definitions Chaitanya Kulkarni
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 9+ messages in thread
From: Chaitanya Kulkarni @ 2025-08-03  2:47 UTC (permalink / raw)
  To: kbusch, axboe, hch, sagi, alex.williamson, cohuck, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, mgurtovoy
  Cc: linux-nvme, kvm, Konrad.wilk, martin.petersen, jmeneghi, arnd,
	schnelle, bhelgaas, joao.m.martins, Chaitanya Kulkarni, Lei Rao

Add foundational infrastructure for vfio-nvme, enabling support for live
migration of NVMe devices via the VFIO framework. The following
components are included:

- Core driver skeleton for vfio-nvme support under drivers/vfio/pci/nvme/
- Definitions of basic data structures used in live migration
  (e.g., nvmevf_pci_core_device and nvmevf_migration_file)
- Implementation of helper routines for managing migration file state
- Integration of PCI driver callbacks and error handling logic
- Registration with vfio-pci-core through nvmevf_pci_ops
- Initial support for VFIO migration states and device open/close flows

Subsequent patches will build upon this base to implement actual live
migration commands and complete the vfio device state handling logic.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/vfio/pci/Kconfig       |   2 +
 drivers/vfio/pci/Makefile      |   2 +
 drivers/vfio/pci/nvme/Kconfig  |  10 ++
 drivers/vfio/pci/nvme/Makefile |   3 +
 drivers/vfio/pci/nvme/nvme.c   | 196 +++++++++++++++++++++++++++++++++
 drivers/vfio/pci/nvme/nvme.h   |  36 ++++++
 6 files changed, 249 insertions(+)
 create mode 100644 drivers/vfio/pci/nvme/Kconfig
 create mode 100644 drivers/vfio/pci/nvme/Makefile
 create mode 100644 drivers/vfio/pci/nvme/nvme.c
 create mode 100644 drivers/vfio/pci/nvme/nvme.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 2b0172f54665..8f94429e7adc 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -67,4 +67,6 @@ source "drivers/vfio/pci/nvgrace-gpu/Kconfig"
 
 source "drivers/vfio/pci/qat/Kconfig"
 
+source "drivers/vfio/pci/nvme/Kconfig"
+
 endmenu
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index cf00c0a7e55c..be8c4b5ee0ba 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -10,6 +10,8 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 
 obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
 
+obj-$(CONFIG_NVME_VFIO_PCI) += nvme/
+
 obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
 
 obj-$(CONFIG_PDS_VFIO_PCI) += pds/
diff --git a/drivers/vfio/pci/nvme/Kconfig b/drivers/vfio/pci/nvme/Kconfig
new file mode 100644
index 000000000000..12e0eaba0de1
--- /dev/null
+++ b/drivers/vfio/pci/nvme/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config NVME_VFIO_PCI
+	tristate "VFIO support for NVMe PCI devices"
+	depends on NVME_CORE
+	depends on VFIO_PCI_CORE
+	help
+	  This provides migration support for NVMe devices using the
+	  VFIO framework.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/nvme/Makefile b/drivers/vfio/pci/nvme/Makefile
new file mode 100644
index 000000000000..2f4a0ad3d9cf
--- /dev/null
+++ b/drivers/vfio/pci/nvme/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_NVME_VFIO_PCI) += nvme-vfio-pci.o
+nvme-vfio-pci-y := nvme.o
diff --git a/drivers/vfio/pci/nvme/nvme.c b/drivers/vfio/pci/nvme/nvme.c
new file mode 100644
index 000000000000..08bee3274207
--- /dev/null
+++ b/drivers/vfio/pci/nvme/nvme.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2022, INTEL CORPORATION. All rights reserved
+ * Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/types.h>
+#include <linux/vfio.h>
+#include <linux/anon_inodes.h>
+#include <linux/kernel.h>
+#include <linux/vfio_pci_core.h>
+
+#include "nvme.h"
+
+static void nvmevf_disable_fd(struct nvmevf_migration_file *migf)
+{
+	mutex_lock(&migf->lock);
+
+	/* release the device states buffer */
+	kvfree(migf->vf_data);
+	migf->vf_data = NULL;
+	migf->disabled = true;
+	migf->total_length = 0;
+	migf->filp->f_pos = 0;
+	mutex_unlock(&migf->lock);
+}
+
+static void nvmevf_disable_fds(struct nvmevf_pci_core_device *nvmevf_dev)
+{
+	if (nvmevf_dev->resuming_migf) {
+		nvmevf_disable_fd(nvmevf_dev->resuming_migf);
+		fput(nvmevf_dev->resuming_migf->filp);
+		nvmevf_dev->resuming_migf = NULL;
+	}
+
+	if (nvmevf_dev->saving_migf) {
+		nvmevf_disable_fd(nvmevf_dev->saving_migf);
+		fput(nvmevf_dev->saving_migf->filp);
+		nvmevf_dev->saving_migf = NULL;
+	}
+}
+
+static void nvmevf_state_mutex_unlock(struct nvmevf_pci_core_device *nvmevf_dev)
+{
+	lockdep_assert_held(&nvmevf_dev->state_mutex);
+again:
+	spin_lock(&nvmevf_dev->reset_lock);
+	if (nvmevf_dev->deferred_reset) {
+		nvmevf_dev->deferred_reset = false;
+		spin_unlock(&nvmevf_dev->reset_lock);
+		nvmevf_dev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+		nvmevf_disable_fds(nvmevf_dev);
+		goto again;
+	}
+	mutex_unlock(&nvmevf_dev->state_mutex);
+	spin_unlock(&nvmevf_dev->reset_lock);
+}
+
+static struct nvmevf_pci_core_device *nvmevf_drvdata(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
+
+	return container_of(core_device, struct nvmevf_pci_core_device,
+			    core_device);
+}
+
+static int nvmevf_pci_open_device(struct vfio_device *core_vdev)
+{
+	struct nvmevf_pci_core_device *nvmevf_dev;
+	struct vfio_pci_core_device *vdev;
+	int ret;
+
+	nvmevf_dev = container_of(core_vdev, struct nvmevf_pci_core_device,
+			core_device.vdev);
+	vdev = &nvmevf_dev->core_device;
+
+	ret = vfio_pci_core_enable(vdev);
+	if (ret)
+		return ret;
+
+	if (nvmevf_dev->migrate_cap)
+		nvmevf_dev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+	vfio_pci_core_finish_enable(vdev);
+	return 0;
+}
+
+static void nvmevf_pci_close_device(struct vfio_device *core_vdev)
+{
+	struct nvmevf_pci_core_device *nvmevf_dev;
+
+	nvmevf_dev = container_of(core_vdev, struct nvmevf_pci_core_device,
+			core_device.vdev);
+
+	if (nvmevf_dev->migrate_cap) {
+		mutex_lock(&nvmevf_dev->state_mutex);
+		nvmevf_disable_fds(nvmevf_dev);
+		nvmevf_state_mutex_unlock(nvmevf_dev);
+	}
+
+	vfio_pci_core_close_device(core_vdev);
+}
+
+static const struct vfio_device_ops nvmevf_pci_ops = {
+	.name = "nvme-vfio-pci",
+	.release = vfio_pci_core_release_dev,
+	.open_device = nvmevf_pci_open_device,
+	.close_device = nvmevf_pci_close_device,
+	.ioctl = vfio_pci_core_ioctl,
+	.device_feature = vfio_pci_core_ioctl_feature,
+	.read = vfio_pci_core_read,
+	.write = vfio_pci_core_write,
+	.mmap = vfio_pci_core_mmap,
+	.request = vfio_pci_core_request,
+	.match = vfio_pci_core_match,
+};
+
+static int nvmevf_pci_probe(struct pci_dev *pdev,
+			    const struct pci_device_id *id)
+{
+	struct nvmevf_pci_core_device *nvmevf_dev;
+	int ret;
+
+	nvmevf_dev = vfio_alloc_device(nvmevf_pci_core_device, core_device.vdev,
+				       &pdev->dev, &nvmevf_pci_ops);
+	if (IS_ERR(nvmevf_dev))
+		return PTR_ERR(nvmevf_dev);
+
+	dev_set_drvdata(&pdev->dev, &nvmevf_dev->core_device);
+	ret = vfio_pci_core_register_device(&nvmevf_dev->core_device);
+	if (ret)
+		goto out_put_dev;
+
+	return 0;
+
+out_put_dev:
+	vfio_put_device(&nvmevf_dev->core_device.vdev);
+	return ret;
+}
+
+static void nvmevf_pci_remove(struct pci_dev *pdev)
+{
+	struct nvmevf_pci_core_device *nvmevf_dev = nvmevf_drvdata(pdev);
+
+	vfio_pci_core_unregister_device(&nvmevf_dev->core_device);
+	vfio_put_device(&nvmevf_dev->core_device.vdev);
+}
+
+static void nvmevf_pci_aer_reset_done(struct pci_dev *pdev)
+{
+	struct nvmevf_pci_core_device *nvmevf_dev = nvmevf_drvdata(pdev);
+
+	if (!nvmevf_dev->migrate_cap)
+		return;
+
+	/*
+	 * As the higher VFIO layers are holding locks across reset and using
+	 * those same locks with the mm_lock we need to prevent ABBA deadlock
+	 * with the state_mutex and mm_lock.
+	 * In case the state_mutex was taken already we defer the cleanup work
+	 * to the unlock flow of the other running context.
+	 */
+	spin_lock(&nvmevf_dev->reset_lock);
+	nvmevf_dev->deferred_reset = true;
+	if (!mutex_trylock(&nvmevf_dev->state_mutex)) {
+		spin_unlock(&nvmevf_dev->reset_lock);
+		return;
+	}
+	spin_unlock(&nvmevf_dev->reset_lock);
+	nvmevf_state_mutex_unlock(nvmevf_dev);
+}
+
+static const struct pci_error_handlers nvmevf_err_handlers = {
+	.reset_done = nvmevf_pci_aer_reset_done,
+	.error_detected = vfio_pci_core_aer_err_detected,
+};
+
+static struct pci_driver nvmevf_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.probe = nvmevf_pci_probe,
+	.remove = nvmevf_pci_remove,
+	.err_handler = &nvmevf_err_handlers,
+	.driver_managed_dma = true,
+};
+
+module_pci_driver(nvmevf_pci_driver);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Chaitanya Kulkarni <kch@nvidia.com>");
+MODULE_DESCRIPTION("NVMe VFIO PCI - VFIO PCI driver with live migration support for NVMe");
diff --git a/drivers/vfio/pci/nvme/nvme.h b/drivers/vfio/pci/nvme/nvme.h
new file mode 100644
index 000000000000..ee602254679e
--- /dev/null
+++ b/drivers/vfio/pci/nvme/nvme.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2022, INTEL CORPORATION. All rights reserved
+ * Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved
+ */
+
+#ifndef NVME_VFIO_PCI_H
+#define NVME_VFIO_PCI_H
+
+#include <linux/kernel.h>
+#include <linux/vfio_pci_core.h>
+#include <linux/nvme.h>
+
+struct nvmevf_migration_file {
+	struct file *filp;
+	struct mutex lock;
+	bool disabled;
+	u8 *vf_data;
+	size_t total_length;
+};
+
+struct nvmevf_pci_core_device {
+	struct vfio_pci_core_device core_device;
+	int vf_id;
+	u8 migrate_cap:1;
+	u8 deferred_reset:1;
+	/* protect migration state */
+	struct mutex state_mutex;
+	enum vfio_device_mig_state mig_state;
+	/* protect the reset_done flow */
+	spinlock_t reset_lock;
+	struct nvmevf_migration_file *resuming_migf;
+	struct nvmevf_migration_file *saving_migf;
+};
+
+#endif /* NVME_VFIO_PCI_H */
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/4] nvme: add live migration TP 4159 definitions
  2025-08-03  2:47 [RFC PATCH 0/4] Add new VFIO PCI driver for NVMe devices Chaitanya Kulkarni
  2025-08-03  2:47 ` [RFC PATCH 1/4] vfio-nvme: add vfio-nvme lm driver infrastructure Chaitanya Kulkarni
@ 2025-08-03  2:47 ` Chaitanya Kulkarni
  2025-08-03  2:47 ` [RFC PATCH 3/4] nvme: export helpers to implement vfio-nvme lm Chaitanya Kulkarni
  2025-08-03  2:47 ` [RFC PATCH 4/4] vfio-nvme: implement TP4159 live migration cmds Chaitanya Kulkarni
  3 siblings, 0 replies; 9+ messages in thread
From: Chaitanya Kulkarni @ 2025-08-03  2:47 UTC (permalink / raw)
  To: kbusch, axboe, hch, sagi, alex.williamson, cohuck, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, mgurtovoy
  Cc: linux-nvme, kvm, Konrad.wilk, martin.petersen, jmeneghi, arnd,
	schnelle, bhelgaas, joao.m.martins, Chaitanya Kulkarni, Lei Rao

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="y", Size: 16022 bytes --]

nvme: add TP4159 live migration definitions and command support

Introduce core definitions and data structures required to support
Live Migration operations as described in TP4159. These updates
enable controller state extraction, transfer, and restoration.

Key changes:

- Extend nvme_id_ctrl with TP4159 migration capability fields:
  - CMMRTD/NMMRTD: Migration tracking granularity
  - MCUDMQ/MNSUDMQ: Migration controller suspend queue depth
  - TRATTR: Tracking attribute bitfield (e.g., THMCS, TUDCS)

- Define controller state format discovery (CNS=0x20):
  - struct nvme_lm_supported_ctrl_state_fmts: layout for reporting
    NVMe-defined and vendor-defined controller state formats
  - struct nvme_lm_ctrl_state_fmts_info: internal parsing view

- Add live migration controller state format v0 support:
  - struct nvme_lm_nvme_cs_v0_state: encapsulates suspendable state
    including I/O submission and completion queues
  - struct nvme_lm_iosq_state / iocq_state: PRP, QID, head/tail, etc.

- Introduce Migration Send (Opcode 0x43) and Receive (0x42):
  - struct nvme_lm_send_cmd: supports suspend, resume, set state
  - struct nvme_lm_recv_cmd: supports get controller state

- Support for sequence indicators (SEQIND) to allow multi-part
  transfer of controller state buffers during suspend/resume

- Add new admin opcodes to enum nvme_admin_opcode:
  - nvme_admin_lm_send = 0x41
  - nvme_admin_lm_recv = 0x42

- Extend union nvme_command to include struct nvme_lm_command,
  enabling transport of send/receive commands through common paths

- Add new status code definitions for migration error handling:
  - NVME_SC_NOT_ENOUGH_RESOURCE
  - NVME_SC_CTRL_SUSPENDED
  - NVME_SC_CTRL_NOT_SUSPENDED
  - NVME_SC_CTRL_DATA_QUEUE_FAIL

- Include migration-related size checks in _nvme_check_size() to
  ensure live migration command structures align to spec (64 bytes)

These additions form the low-level protocol and data contract
supporting live migration of NVMe controllers, and are a prerequisite
for implementing suspend/resume and controller state transfer flows.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/nvme/host/core.c |   2 +
 include/linux/nvme.h     | 334 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 335 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5d8638086cba..2445862ac7d4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5314,6 +5314,8 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_rotational_media_log) != 512);
 	BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64);
+	BUILD_BUG_ON(sizeof(struct nvme_lm_send_cmd) != 64);
+	BUILD_BUG_ON(sizeof(struct nvme_lm_recv_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_feat_host_behavior) != 512);
 }
 
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 655d194f8e72..69a8c48faa6c 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -362,7 +362,19 @@ struct nvme_id_ctrl {
 	__u8			anacap;
 	__le32			anagrpmax;
 	__le32			nanagrpid;
-	__u8			rsvd352[160];
+	/* --- Live Migration support (TP4159 additions) --- */
+	__le16			cmmrtd;
+	__le16			nmmrtd;
+	__u8			minmrtg;
+	__u8			maxmrtg;
+	__u8			trattr;
+	__u8			rsvd577;
+	__le16			mcudmq;
+	__le16			mnsudmq;
+	__le16			mcmr;
+	__le16			nmcmr;
+	__le16			mcdqpc;
+	__u8			rsvd352[160 - 20]; /* pad to offset 512 */
 	__u8			sqes;
 	__u8			cqes;
 	__le16			maxcmd;
@@ -394,6 +406,14 @@ struct nvme_id_ctrl {
 	__u8			vs[1024];
 };
 
+/* Tracking Attributes (TRATTR) Bit Definitions */
+/* Track Host Memory Changes Support */
+#define NVME_TRATTR_THMCS         (1 << 0)
+/* Track User Data Changes Support */
+#define NVME_TRATTR_TUDCS         (1 << 1)
+ /* Memory Range Tracking Length Limit */
+#define NVME_TRATTR_MRTLL         (1 << 2)
+
 enum {
 	NVME_CTRL_CMIC_MULTI_PORT		= 1 << 0,
 	NVME_CTRL_CMIC_MULTI_CTRL		= 1 << 1,
@@ -409,6 +429,7 @@ enum {
 	NVME_CTRL_OACS_NS_MNGT_SUPP		= 1 << 3,
 	NVME_CTRL_OACS_DIRECTIVES		= 1 << 5,
 	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 8,
+	NVME_CTRL_OACS_HMLMS			= 1 << 11,
 	NVME_CTRL_LPA_CMD_EFFECTS_LOG		= 1 << 1,
 	NVME_CTRL_CTRATT_128_ID			= 1 << 0,
 	NVME_CTRL_CTRATT_NON_OP_PSP		= 1 << 1,
@@ -567,6 +588,7 @@ enum {
 	NVME_ID_CNS_NS_GRANULARITY	= 0x16,
 	NVME_ID_CNS_UUID_LIST		= 0x17,
 	NVME_ID_CNS_ENDGRP_LIST		= 0x19,
+	NVME_ID_CNS_LM_CTRL_STATE_FMT	= 0x20,
 };
 
 enum {
@@ -1290,6 +1312,300 @@ enum {
 	NVME_ENABLE_LBAFEE	= 1,
 };
 
+/* Figure SCSF-FIG1: Supported Controller State Formats Data Structure */
+/**
+ * Supported Controller State Formats (SCSF-FIG1)
+ *
+ * This describes the Identify CNS=0x20 response layout, which contains:
+ *  - NV    : Number of NVMe-defined controller state format versions
+ *  - NUUID : Number of vendor-specific format UUIDs
+ *  - VERS[NV]  : Array of 2-byte version IDs
+ *  - UUID[NUUID] : Array of 16-byte UUIDs
+ *
+ * Memory layout (variable-sized structure):
+ *
+ *  +------------------+-------------------------------+
+ *  | Offset (Bytes)   | Field                         |
+ *  +------------------+-------------------------------+
+ *  | 0                | NV (Number of Versions)       |
+ *  | 1                | NUUID (Number of UUIDs)       |
+ *  +------------------+-------------------------------+
+ *  | 2                | VERS[0] (2 bytes)             |
+ *  | 4                | VERS[1] (2 bytes)             |
+ *  | ..               | ...                           |
+ *  | 2 + NV*2 - 2     | VERS[NV-1] (2 bytes)          |
+ *  +------------------+-------------------------------+
+ *  | ...              | UUID[0] (16 bytes)            |
+ *  | ...              | UUID[1] (16 bytes)            |
+ *  | ..               | ...                           |
+ *  | ...              | UUID[NUUID-1] (16 bytes)      |
+ *  +------------------+-------------------------------+
+ *
+ * Total size = 2 + NV * 2 + NUUID * 16 bytes.
+ */
+
+#define NVME_LM_CTRL_STATE_HDR_SIZE	2
+#define NVME_LM_VERSION_ENTRY_SIZE	2
+#define NVME_LM_UUID_ENTRY_SIZE		16
+
+struct nvme_lm_supported_ctrl_state_fmts {
+	__u8		nv;
+	__u8		nuuid;
+	__le16		vers[]; /* Followed by uuid[NUUID][16] */
+} __packed;
+
+struct nvme_lm_ctrl_state_fmts_info {
+	__u8		nv;
+	__u8		nuuid;
+	const __le16	*vers;
+	const __u8	(*uuids)[16];
+	void		*ctrl_state_raw_buf;
+	size_t		raw_len;
+};
+
+/**
+ * Controller State Buffer (SCS-FIG5)
+ *
+ * This describes the Migration Receive (Opcode = 0x42, Select = 0h)
+ * response layout, which contains:
+ *
+ *  - version : Structure version (e.g. 0x0000)
+ *  - csattr  : Controller state attributes
+ *  - nvmecss : Length of NVMECS block (in DWORDs)
+ *  - vss     : Length of VSD block (in DWORDs)
+ *  - data[]  : Contiguous NVMECS + VSD blocks
+ *
+ * Memory layout (variable-sized structure):
+ *
+ *  +------------------+-------------------------------------------+
+ *  | Offset (Bytes)   | Field                                     |
+ *  +------------------+-------------------------------------------+
+ *  | 0x00             | version (2 bytes)                         |
+ *  | 0x02             | csattr (1 byte)                           |
+ *  | 0x03             | rsvd[13] (13 bytes)                       |
+ *  | 0x10             | nvmecss (2 bytes)                         |
+ *  | 0x12             | vss (2 bytes)                             |
+ *  +------------------+-------------------------------------------+
+ *  | 0x14             | NVMECS[0] (nvmecss * 4 bytes)             |
+ *  | ...              |                                           |
+ *  | 0x14 + NVMECS    | VSD[0] (vss * 4 bytes)                    |
+ *  +------------------+-------------------------------------------+
+ *
+ * Total size = 0x14 + (nvmecss + vss) * 4 bytes.
+ */
+
+struct nvme_lm_ctrl_state {
+	__le16	version;
+	__u8	csattr;
+	__u8	rsvd[13];
+	__le16	nvmecss;
+	__le16	vss;
+	__u8	data[]; /* NVMECS + VSD */
+} __packed;
+
+struct nvme_lm_ctrl_state_info {
+	struct nvme_lm_ctrl_state *raw;
+	size_t total_len;
+
+	const __u8 *nvme_cs;
+	const __u8 *vsd;
+
+	__le16 version;
+	__u8   csattr;
+	__le16 nvmecss;
+	__le16 vss;
+};
+
+/**
+ * struct nvme_lm_iosq_state - I/O Submission Queue State (SCS-FIG7)
+ */
+struct nvme_lm_iosq_state {
+	__le64 prp1;
+	__le16 qsize;
+	__le16 qid;
+	__le16 cqid;
+	__le16 attr;
+	__le16 head;
+	__le16 tail;
+	__u8   rsvd[4];
+} __packed;
+
+/**
+ * struct nvme_lm_iocq_state - I/O Completion Queue State (SCS-FIG8)
+ */
+struct nvme_lm_iocq_state {
+	__le64 prp1;
+	__le16 qsize;
+	__le16 qid;
+	__le16 head;
+	__le16 tail;
+	__le16 attr;
+	__u8   rsvd[4];
+} __packed;
+
+/**
+ * struct nvme_lm_nvme_cs_v0_state - NVMe Controller State v0 (SCS-FIG6)
+ *
+ * Memory layout:
+ *
+ *  +---------+--------+-----------------------------------------+
+ *  | Offset  | Size   | Field                                   |
+ *  +---------+--------+-----------------------------------------+
+ *  | 0x00    | 2 B    | VER - version of NVMECS block           |
+ *  | 0x02    | 2 B    | NIOSQ - number of I/O submission queues |
+ *  | 0x04    | 2 B    | NIOCQ - number of I/O completion queues |
+ *  | 0x06    | 2 B    | Reserved                                |
+ *  | 0x08    | ...    | IOSQ[NIOSQ] (24 bytes each)             |
+ *  | ...     | ...    | IOCQ[NIOCQ] (24 bytes each)             |
+ *  +---------+--------+-----------------------------------------+
+ */
+struct nvme_lm_nvme_cs_v0_state {
+	__le16 ver;
+	__le16 niosq;
+	__le16 niocq;
+	__u8   rsvd[2];
+	struct nvme_lm_iosq_state iosq[]; /* Followed by IOCQ */
+} __packed;
+
+
+/* Suspend Type field (cdw11[17:16]) per SUSPEND-FIG1 */
+enum nvme_lm_suspend_type {
+	NVME_LM_SUSPEND_TYPE_NOTIFY	= 0x00,
+	NVME_LM_SUSPEND_TYPE_SUSPEND	= 0x01,
+};
+
+/* Migration Send Select field values (cdw10[7:0]) */
+enum nvme_lm_send_select {
+	NVME_LM_SEND_SEL_SUSPEND	= 0x00,
+	NVME_LM_SEND_SEL_RESUME		= 0x02,
+	NVME_LM_SEND_SEL_SET_CTRL_STATE	= 0x03,
+};
+
+/* Migration Send Sequence Indicator (SEQIND) field values (cdw11[17:16]) */
+enum nvme_lm_send_seqind {
+	NVME_LM_SEQIND_MIDDLE = 0x00,
+	NVME_LM_SEQIND_FIRST  = 0x01,
+	NVME_LM_SEQIND_LAST   = 0x02,
+	NVME_LM_SEQIND_ONLY   = 0x03,
+};
+
+/**
+ * struct nvme_lm_send_cmd - Migration Send Command (Opcode 0x43)
+ *
+ * Command fields correspond to:
+ * - MSC-FIG1: Command Dword 10
+ * - SUSPEND-FIG1: Command Dword 11
+ * - MSC-FIG2: Command Dword 14 and 15
+ *
+ * Layout:
+ * @opcode:     Opcode = 0x43 (Migration Send)
+ * @flags:      Command flags
+ * @command_id: Unique command identifier
+ * @nsid:       Namespace ID (typically 0)
+ * @cdw2:       Reserved (CDW2–CDW3)
+ * @metadata:   Metadata pointer (unused)
+ * @dptr:       PRP/SGL pointer to controller state buffer
+ * @cdw10:      CDW10:
+ *              - [07:00] Select (SEL): migration operation (e.g., 0x01 = Suspend)
+ *              - [15:08] Reserved
+ *              - [31:16] Management Operation Specific (MOS)
+ * @cdw11:      CDW11:
+ *              - [15:00] Controller Identifier (CNTLID)
+ *              - [17:16] Sequence Indicator (SEQIND):
+ *                  01b = First, 00b = Middle, 10b = Last, 11b = Only
+ *              - [31:18] Reserved
+ * @cdw12:      Reserved or vendor-specific
+ * @cdw13:      CDW13:
+ *              - [07:00] UUID Index
+ *              - [15:08] UUID Parameter
+ *              - [31:16] Reserved
+ * @cdw14:      Offset Lower (used with Send Controller State)
+ * @cdw15:      Offset Upper
+ */
+struct nvme_lm_send_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__le32			cdw2[2];
+	__le64			metadata;
+	union nvme_data_ptr	dptr;
+	__le32			cdw10;
+	__le32			cdw11;
+	__le32			cdw12;
+	__le32			cdw13;
+	__le32			cdw14;
+	__le32			cdw15;
+} __packed;
+
+
+enum nvme_lm_recv_sel {
+	NVME_LM_RECV_GET_CTRL_STATE	= 0x00,
+};
+
+enum nvme_lm_recv_mos {
+	/* NVMe-defined Controller State Format v1 */
+	 NVME_LM_RECV_CSVI_NVME_V1	= 0x0000,
+	/* Additional indices may be defined by future specs or vendors */
+};
+
+#define NVME_LM_CTRL_STATE_VER   0x0000  /** Expected version value */
+
+/**
+ * struct nvme_lm_recv_cmd - NVMe Migration Receive Command
+ *
+ * This structure defines the NVMe admin command used to receive
+ * controller state or resume a controller as part of a live migration
+ * operation (Opcode 0x42), as described in TP4159.
+ *
+ * Fields:
+ * @opcode:     Opcode for Migration Receive command (0x42).
+ * @flags:      Command flags (e.g., fused operations, SGLs).
+ * @command_id: Unique identifier for this command issued by the host.
+ * @nsid:       Namespace ID (typically 0 for admin commands).
+ * @cdw2:       Reserved (Command Dwords 2–3).
+ * @metadata:   Metadata pointer (typically unused for this command).
+ * @dptr:       PRP/SGL pointer to the host buffer receiving the controller
+ *              state.
+ * @cdw10:      Command Dword 10:
+ *              - [07:00] Select (operation: e.g., Get Controller State,
+ *                Resume Controller).
+ *              - [31:08] Management Operation Specific (MOS) —
+ *                defined per Select operation.
+ * @cdw11:      Command Dword 11:
+ *              - [15:00] Controller Identifier (CNTLID) — identifies the
+ *                target controller.
+ *              - [17:16] Sequence Indicator (SEQIND) — position in
+ *                multi-part transfer sequence:
+ *                01b = First, 00b = Middle, 10b = Last, 11b = Only.
+ * @cdw12:       Offset into the controller state data (in dwords).
+ * @cdw13:       Number of dwords to transfer (0-based).
+ * @cdw14:       Reserved or vendor-specific.
+ * @cdw15:       Optional UUID Index (if vendor-specific controller state
+ *               format is used), or reserved.
+ */
+struct nvme_lm_recv_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__le32			rsvd2[2];
+	__le64			metadata;
+	union nvme_data_ptr	dptr;
+	__u8			sel;
+	__u8			rsvd10_1;
+	__le16			mos;
+	__le16			cntlid;
+	__u8			csuuidi;
+	__u8			csuidxp;
+	__le32			offset_lower;
+	__le32			offset_upper;
+	__u8			uuid_index;
+	__u8			rsvd14[3];
+	__le32			numd;
+};
+
+
 /* Admin commands */
 
 enum nvme_admin_opcode {
@@ -1314,6 +1630,8 @@ enum nvme_admin_opcode {
 	nvme_admin_virtual_mgmt		= 0x1c,
 	nvme_admin_nvme_mi_send		= 0x1d,
 	nvme_admin_nvme_mi_recv		= 0x1e,
+	nvme_admin_lm_send		= 0x41,
+	nvme_admin_lm_recv		= 0x42,
 	nvme_admin_dbbuf		= 0x7C,
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
@@ -1347,6 +1665,8 @@ enum nvme_admin_opcode {
 		nvme_admin_opcode_name(nvme_admin_virtual_mgmt),	\
 		nvme_admin_opcode_name(nvme_admin_nvme_mi_send),	\
 		nvme_admin_opcode_name(nvme_admin_nvme_mi_recv),	\
+		nvme_admin_opcode_name(nvme_admin_lm_send),	\
+		nvme_admin_opcode_name(nvme_admin_lm_recv),	\
 		nvme_admin_opcode_name(nvme_admin_dbbuf),		\
 		nvme_admin_opcode_name(nvme_admin_format_nvm),		\
 		nvme_admin_opcode_name(nvme_admin_security_send),	\
@@ -1973,6 +2293,13 @@ struct streams_directive_params {
 	__u8	rsvd2[6];
 };
 
+struct nvme_lm_command {
+	union {
+		struct nvme_lm_recv_cmd		recv;
+		struct nvme_lm_send_cmd		send;
+	};
+};
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -1999,6 +2326,7 @@ struct nvme_command {
 		struct nvmf_auth_receive_command auth_receive;
 		struct nvme_dbbuf dbbuf;
 		struct nvme_directive_cmd directive;
+		struct nvme_lm_command lm;
 		struct nvme_io_mgmt_recv_cmd imr;
 	};
 };
@@ -2116,6 +2444,10 @@ enum {
 	NVME_SC_TRANSIENT_TR_ERR	= 0x22,
 	NVME_SC_ADMIN_COMMAND_MEDIA_NOT_READY = 0x24,
 	NVME_SC_INVALID_IO_CMD_SET	= 0x2C,
+	NVME_SC_NOT_ENOUGH_RESOURCE	= 0x38,
+	NVME_SC_CTRL_SUSPENDED		= 0x39,
+	NVME_SC_CTRL_NOT_SUSPENDED	= 0x3A,
+	NVME_SC_CTRL_DATA_QUEUE_FAIL	= 0x3B,
 
 	NVME_SC_LBA_RANGE		= 0x80,
 	NVME_SC_CAP_EXCEEDED		= 0x81,
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/4] nvme: export helpers to implement vfio-nvme lm
  2025-08-03  2:47 [RFC PATCH 0/4] Add new VFIO PCI driver for NVMe devices Chaitanya Kulkarni
  2025-08-03  2:47 ` [RFC PATCH 1/4] vfio-nvme: add vfio-nvme lm driver infrastructure Chaitanya Kulkarni
  2025-08-03  2:47 ` [RFC PATCH 2/4] nvme: add live migration TP 4159 definitions Chaitanya Kulkarni
@ 2025-08-03  2:47 ` Chaitanya Kulkarni
  2025-08-03  2:47 ` [RFC PATCH 4/4] vfio-nvme: implement TP4159 live migration cmds Chaitanya Kulkarni
  3 siblings, 0 replies; 9+ messages in thread
From: Chaitanya Kulkarni @ 2025-08-03  2:47 UTC (permalink / raw)
  To: kbusch, axboe, hch, sagi, alex.williamson, cohuck, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, mgurtovoy
  Cc: linux-nvme, kvm, Konrad.wilk, martin.petersen, jmeneghi, arnd,
	schnelle, bhelgaas, joao.m.martins, Chaitanya Kulkarni, Lei Rao

Export the nvme_error_status() function to allow error status
translation from external modules such as VFIO-based drivers.

Add helper functions in nvme-pci to support virtual function (VF)
command submission and controller ID retrieval:

  - nvme_submit_vf_cmd() submits a synchronous admin command
    from a VF context using the PF's admin queue.
  - nvme_get_ctrl_id() returns the controller ID associated
    with a PCI device.

These changes support VFIO NVMe Live Migration workflows and
infrastructure by enabling necessary low-level admin command access.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/nvme/host/core.c |  3 ++-
 drivers/nvme/host/nvme.h |  5 +++++
 drivers/nvme/host/pci.c  | 34 ++++++++++++++++++++++++++++++++++
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2445862ac7d4..3620e7cb21d1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -278,7 +278,7 @@ void nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl)
 	nvme_put_ctrl(ctrl);
 }
 
-static blk_status_t nvme_error_status(u16 status)
+blk_status_t nvme_error_status(u16 status)
 {
 	switch (status & NVME_SCT_SC_MASK) {
 	case NVME_SC_SUCCESS:
@@ -318,6 +318,7 @@ static blk_status_t nvme_error_status(u16 status)
 		return BLK_STS_IOERR;
 	}
 }
+EXPORT_SYMBOL_GPL(nvme_error_status);
 
 static void nvme_retry_req(struct request *req)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cfd2b5b90b91..5549c7e3bcd3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -1218,4 +1218,9 @@ static inline bool nvme_multi_css(struct nvme_ctrl *ctrl)
 	return (ctrl->ctrl_config & NVME_CC_CSS_MASK) == NVME_CC_CSS_CSI;
 }
 
+blk_status_t nvme_error_status(u16 status);
+int nvme_submit_vf_cmd(struct pci_dev *dev, struct nvme_command *cmd,
+		       size_t *result, void *buffer, unsigned int bufflen);
+u16 nvme_get_ctrl_id(struct pci_dev *dev);
+
 #endif /* _NVME_H */
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4cf87fb5d857..b239d38485ee 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3936,6 +3936,40 @@ static struct pci_driver nvme_driver = {
 	.err_handler	= &nvme_err_handler,
 };
 
+u16 nvme_get_ctrl_id(struct pci_dev *dev)
+{
+	struct nvme_dev *ndev = pci_iov_get_pf_drvdata(dev, &nvme_driver);
+
+	return ndev->ctrl.cntlid;
+}
+EXPORT_SYMBOL_GPL(nvme_get_ctrl_id);
+
+int nvme_submit_vf_cmd(struct pci_dev *dev, struct nvme_command *cmd,
+		size_t *result, void *buffer, unsigned int bufflen)
+{
+	struct nvme_dev *ndev = NULL;
+	union nvme_result res = { };
+	int ret;
+
+	ndev = pci_iov_get_pf_drvdata(dev, &nvme_driver);
+	if (IS_ERR(ndev))
+		return PTR_ERR(ndev);
+	ret = __nvme_submit_sync_cmd(ndev->ctrl.admin_q, cmd, &res, buffer,
+					bufflen, NVME_QID_ANY, 0);
+	if (ret < 0)
+		return ret;
+
+	if (ret > 0) {
+		ret = blk_status_to_errno(nvme_error_status(ret));
+		return ret;
+	}
+
+	if (result)
+		*result = le32_to_cpu(res.u32);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nvme_submit_vf_cmd);
+
 static int __init nvme_init(void)
 {
 	BUILD_BUG_ON(sizeof(struct nvme_create_cq) != 64);
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 4/4] vfio-nvme: implement TP4159 live migration cmds
  2025-08-03  2:47 [RFC PATCH 0/4] Add new VFIO PCI driver for NVMe devices Chaitanya Kulkarni
                   ` (2 preceding siblings ...)
  2025-08-03  2:47 ` [RFC PATCH 3/4] nvme: export helpers to implement vfio-nvme lm Chaitanya Kulkarni
@ 2025-08-03  2:47 ` Chaitanya Kulkarni
  2025-08-04 16:41   ` Bjorn Helgaas
  3 siblings, 1 reply; 9+ messages in thread
From: Chaitanya Kulkarni @ 2025-08-03  2:47 UTC (permalink / raw)
  To: kbusch, axboe, hch, sagi, alex.williamson, cohuck, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, mgurtovoy
  Cc: linux-nvme, kvm, Konrad.wilk, martin.petersen, jmeneghi, arnd,
	schnelle, bhelgaas, joao.m.martins, Chaitanya Kulkarni, Lei Rao

Implements TP4159-based live migration support in vfio-nvme
driver by integrating command execution, controller state handling,
and vfio migration state transitions.

Key features:

- Use nvme_submit_vf_cmd() and nvme_get_ctrl_id() helpers
  in the NVMe core PCI driver for submitting admin commands on VFs.

- Implements Migration Send (opcode 0x43) and Receive (opcode 0x42)
  command handling for suspend, resume, get/set controller state.

  _Remark_:-
  We are currently in the process of defining the state in TP4193, 
  so the current state management code will be replaced with TP4193.
  However, in this patch we include TP4159-compatible state management
  code for the sake of completeness.

- Adds parsing and serialization of controller state including:
  - NVMeCS v0 controller state format (SCS-FIG6, FIG7, FIG8)
  - Supported Controller State Formats (CNS=0x20 response)
  - Migration file abstraction with read/write fileops

- Adds debug decoders to log IOSQ/IOCQ state during migration save

- Allocates anon inodes to handle save and resume file interfaces
  exposed via VFIO migration file descriptors

- Adds vfio migration state machine transitions:
  - RUNNING → STOP: sends suspend command
  - STOP → STOP_COPY: extracts controller state (save)
  - STOP_COPY → STOP: disables file and frees buffer
  - STOP → RESUMING: allocates resume file buffer
  - RESUMING → STOP: loads controller state via set state
  - STOP → RUNNING: resumes controller via resume command

- Hooks vfio_migration_ops into vfio_pci_ops using:
  - `migration_set_state()` and `migration_get_state()`
  - Uses state_mutex + reset_lock for proper concurrency

- Queries Identify Controller (CNS=01h) to check for HMLMS bit
  in OACS field, indicating controller migration capability

- Applies runtime checks for buffer alignment, format support,
  and state size bounds to ensure spec compliance

With this patch, vfio-nvme enables live migration of VF-based
NVMe devices by implementing TP4159 migration command flows
and vfio device state transitions required by QEMU/VMM.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/vfio/pci/nvme/Makefile |   3 +
 drivers/vfio/pci/nvme/nvme.c   | 840 +++++++++++++++++++++++++++++++++
 drivers/vfio/pci/nvme/nvme.h   |   3 +
 3 files changed, 846 insertions(+)

diff --git a/drivers/vfio/pci/nvme/Makefile b/drivers/vfio/pci/nvme/Makefile
index 2f4a0ad3d9cf..d434c943436b 100644
--- a/drivers/vfio/pci/nvme/Makefile
+++ b/drivers/vfio/pci/nvme/Makefile
@@ -1,3 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
+
+KBUILD_EXTRA_SYMBOLS := $(srctree)/drivers/nvme/Module.symvers
+
 obj-$(CONFIG_NVME_VFIO_PCI) += nvme-vfio-pci.o
 nvme-vfio-pci-y := nvme.o
diff --git a/drivers/vfio/pci/nvme/nvme.c b/drivers/vfio/pci/nvme/nvme.c
index 08bee3274207..5283d6b606dc 100644
--- a/drivers/vfio/pci/nvme/nvme.c
+++ b/drivers/vfio/pci/nvme/nvme.c
@@ -19,6 +19,8 @@
 
 #include "nvme.h"
 
+#define MAX_MIGRATION_SIZE (256 * 1024)
+
 static void nvmevf_disable_fd(struct nvmevf_migration_file *migf)
 {
 	mutex_lock(&migf->lock);
@@ -71,6 +73,842 @@ static struct nvmevf_pci_core_device *nvmevf_drvdata(struct pci_dev *pdev)
 			    core_device);
 }
 
+/*
+ * Convert byte length to nvme's 0-based num dwords
+ */
+static inline u32 bytes_to_nvme_numd(size_t len)
+{
+	if (len < 4)
+		return 0;
+	return (len >> 2) - 1;
+}
+
+static int nvmevf_cmd_suspend_device(struct nvmevf_pci_core_device *nvmevf_dev)
+{
+	struct pci_dev *dev = nvmevf_dev->core_device.pdev;
+	struct nvme_command c = { };
+	u32 cdw11 = NVME_LM_SUSPEND_TYPE_SUSPEND << 16 | nvme_get_ctrl_id(dev);
+	int ret;
+
+	c.lm.send.opcode = nvme_admin_lm_send;
+	c.lm.send.cdw10 = cpu_to_le32(NVME_LM_SEND_SEL_SUSPEND);
+	c.lm.send.cdw11 = cpu_to_le32(cdw11);
+
+	ret = nvme_submit_vf_cmd(dev, &c, NULL, NULL, 0);
+	if (ret) {
+		dev_warn(&dev->dev,
+			 "Suspend virtual function failed (ret=0x%x)\n",
+			 ret);
+		return ret;
+	}
+
+	dev_dbg(&dev->dev, "Suspend command successful\n");
+	return 0;
+}
+
+static int nvmevf_cmd_resume_device(struct nvmevf_pci_core_device *nvmevf_dev)
+{
+	struct pci_dev *dev = nvmevf_dev->core_device.pdev;
+	struct nvme_command c = { };
+	int ret;
+
+	c.lm.send.opcode = nvme_admin_lm_send;
+	c.lm.send.cdw10 = cpu_to_le32(NVME_LM_SEND_SEL_RESUME);
+	c.lm.send.cdw11 = cpu_to_le32(nvme_get_ctrl_id(dev));
+
+	ret = nvme_submit_vf_cmd(dev, &c, NULL, NULL, 0);
+	if (ret) {
+		dev_warn(&dev->dev,
+			 "Resume virtual function failed (ret=0x%x)\n", ret);
+		return ret;
+	}
+	dev_dbg(&dev->dev, "Resume command successful\n");
+	return 0;
+}
+
+/**
+ * Figure SCSF-FIG1: Supported Controller State Formats Data Structure
+ * nvme_lm_get_ctrl_state_fmts - Query and parse CNS=0x20 format list
+ * @dev:  Controller pci device
+ * @fmt:  Output struct populated with NV, NUUID, and pointers
+ *
+ * Issues Identify CNS=0x20 (Supported Controller State Formats),
+ * allocates a buffer, and parses the result into the provided struct.
+ *
+ * The caller must free fmt->ctrl_state_raw_buf using kfree().
+ *
+ * Returns 0 on success, or a negative errno on failure.
+ */
+static int nvme_lm_id_ctrl_state(struct pci_dev *dev,
+				 struct nvme_lm_ctrl_state_fmts_info *fmt)
+{
+	struct nvme_command c = { };
+	void *buf;
+	int ret;
+	__u8 nv, nuuid;
+	size_t len;
+
+	if (!fmt)
+		return -EINVAL;
+
+	/* Step 1: Read first 2 bytes to get NV and NUUID */
+	buf = kzalloc(2, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	c.identify.opcode = nvme_admin_identify;
+	c.identify.cns = NVME_ID_CNS_LM_CTRL_STATE_FMT;
+	c.identify.nsid = cpu_to_le32(0);
+
+	ret = nvme_submit_vf_cmd(dev, &c, NULL, buf, 2);
+	if (ret)
+		goto out_free;
+
+	nv = ((__u8 *)buf)[0];
+	nuuid = ((__u8 *)buf)[1];
+
+	kfree(buf);
+
+	/*
+	 * Compute total buffer length for the full Identify CNS=0x20 response:
+	 *
+	 * - The first 2 bytes hold the header:
+	 *     * Byte 0: NV     — number of NVMe-defined format versions
+	 *     * Byte 1: NUUID  — number of vendor-specific UUID entries
+	 *
+	 * - Each version entry is 2 bytes (VERSION_ENTRY_SIZE)
+	 * - Each UUID entry is 16 bytes (UUID_ENTRY_SIZE)
+	 *
+	 * Therefore:
+	 *   Total length = 2 + (NV * 2) + (NUUID * 16)
+	 */
+	len = NVME_LM_CTRL_STATE_HDR_SIZE +
+	nv * NVME_LM_VERSION_ENTRY_SIZE + nuuid * NVME_LM_UUID_ENTRY_SIZE;
+
+	buf = kzalloc(len, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	memset(&c, 0, sizeof(c));
+	c.identify.opcode = nvme_admin_identify;
+	c.identify.cns = NVME_ID_CNS_LM_CTRL_STATE_FMT;
+	c.identify.nsid = cpu_to_le32(0);
+
+	ret = nvme_submit_vf_cmd(dev, &c, NULL, buf, len);
+	if (ret)
+		goto out_free;
+
+	/* Parse the result in-place */
+	fmt->nv = nv;
+	fmt->nuuid = nuuid;
+	fmt->vers = ((struct nvme_lm_supported_ctrl_state_fmts *)buf)->vers;
+	fmt->uuids = (const void *)(fmt->vers + nv);
+	fmt->ctrl_state_raw_buf = buf;
+	fmt->raw_len = len;
+
+	return 0;
+
+out_free:
+	kfree(buf);
+	return ret;
+}
+
+static int nvme_lm_get_ctrl_state_fmt(struct pci_dev *dev, bool debug,
+				      struct nvme_lm_ctrl_state_fmts_info *fmt)
+{
+	__u8 i;
+	int ret;
+
+	ret = nvme_lm_id_ctrl_state(dev, fmt);
+	if (ret) {
+		pr_err("Failed to get ctrl state formats (ret=%d)\n", ret);
+		return ret;
+	}
+
+	if (debug)
+		pr_info("NV = %u, NUUID = %u\n", fmt->nv, fmt->nuuid);
+
+	if (debug) {
+		for (i = 0; i < fmt->nv; i++) {
+			pr_info("  Format[%d] Version = 0x%04x\n",
+					i, le16_to_cpu(fmt->vers[i]));
+		}
+
+		for (i = 0; i < fmt->nuuid; i++) {
+			char uuid_str[37]; /* 36 chars + null */
+
+			snprintf(uuid_str, sizeof(uuid_str),
+					"%02x%02x%02x%02x-%02x%02x-%02x%02x-"
+					"%02x%02x-%02x%02x%02x%02x%02x%02x",
+					fmt->uuids[i][0], fmt->uuids[i][1],
+					fmt->uuids[i][2], fmt->uuids[i][3],
+					fmt->uuids[i][4], fmt->uuids[i][5],
+					fmt->uuids[i][6], fmt->uuids[i][7],
+					fmt->uuids[i][8], fmt->uuids[i][9],
+					fmt->uuids[i][10], fmt->uuids[i][11],
+					fmt->uuids[i][12], fmt->uuids[i][13],
+					fmt->uuids[i][14], fmt->uuids[i][15]);
+
+			pr_info("  UUID[%d] = %s\n", i, uuid_str);
+		}
+	}
+
+	return ret;
+}
+
+static void nvmevf_init_get_ctrl_state_cmd(struct nvme_command *c, __u16 cntlid,
+					   __u8 csvi, __u8 csuuidi,
+					   __u8 csuidxp, size_t buf_len)
+{
+	c->lm.recv.opcode = nvme_admin_lm_recv;
+	c->lm.recv.sel = NVME_LM_RECV_GET_CTRL_STATE;
+	/*
+	 * MOS fields treated as ctrl state version index, Use NVME V1 state.
+	 */
+	/*
+	 * For upstream read the supported controller state formats using
+	 * identify command with cns value 0x20 and make sure NVME_LM_CSVI
+	 * matches the on of the reported formats for NVMe states.
+	 */
+	c->lm.recv.mos = cpu_to_le16(csvi);
+	/* Target Controller is this a right way to get the controller ID */
+	c->lm.recv.cntlid = cpu_to_le16(cntlid);
+
+	/*
+	 * For upstream read the supported controller state formats using
+	 * identify command with cns value 0x20 and make sure NVME_LM_CSVI
+	 * matches the on of the reported formats for Vender specific states.
+	 */
+	/* adjust the state as per needed by setting the macro values */
+	c->lm.recv.csuuidi = cpu_to_le32(csuuidi);
+	c->lm.recv.csuidxp = cpu_to_le32(csuidxp);
+
+	/*
+	 * Associates the Migration Receive command with the correct migration
+	 * session UUID currently we set to 0. For now asssume that initiaor
+	 * and target has agreed on the UUIDX 0 for all the live migration
+	 * sessions.
+	 */
+	c->lm.recv.uuid_index = cpu_to_le32(0);
+
+	/*
+	 * Assume that data buffer is big enoough to hold the state,
+	 * 0-based dword count.
+	 */
+	c->lm.recv.numd = cpu_to_le32(bytes_to_nvme_numd(buf_len));
+}
+
+#define NVME_LM_MAX_NVMECS	1024
+#define NVME_LM_MAX_VSD		1024
+
+static int nvmevf_get_ctrl_state(struct pci_dev *dev,
+				__u8 csvi, __u8 csuuidi, __u8 csuidxp,
+				struct nvmevf_migration_file *migf,
+				struct nvme_lm_ctrl_state_info *state)
+{
+	struct nvme_command c = { };
+	struct nvme_lm_ctrl_state *hdr;
+	/* Make sure hdr_len is a multiple of 4 */
+	size_t hdr_len = ALIGN(sizeof(*hdr), 4);
+	__u16 id = nvme_get_ctrl_id(dev);
+	void *local_buf;
+	size_t len;
+	int ret;
+
+	/* Step 1: Issue Migration Receive (Select = 0) to get header */
+	local_buf = kzalloc(hdr_len, GFP_KERNEL);
+	if (!local_buf)
+		return -ENOMEM;
+
+	nvmevf_init_get_ctrl_state_cmd(&c, id, csvi, csuuidi, csuidxp, hdr_len);
+	ret = nvme_submit_vf_cmd(dev, &c, NULL, local_buf, hdr_len);
+	if (ret) {
+		dev_warn(&dev->dev,
+			"nvme_admin_lm_recv failed (ret=0x%x)\n", ret);
+		kfree(local_buf);
+		return ret;
+	}
+
+	if (le16_to_cpu(hdr->nvmecss) > NVME_LM_MAX_NVMECS ||
+	    le16_to_cpu(hdr->vss) > NVME_LM_MAX_VSD) {
+		kfree(local_buf);
+		return -EINVAL;
+	}
+
+	hdr = local_buf;
+	len = hdr_len + 4 * (le16_to_cpu(hdr->nvmecss) + le16_to_cpu(hdr->vss));
+
+	kfree(local_buf);
+
+	if (len == hdr_len)
+		dev_warn(&dev->dev, "nvmecss == 0 or vss = 0\n");
+
+	/* Step 2: Allocate full buffer */
+	migf->total_length = len;
+	migf->vf_data = kvzalloc(migf->total_length, GFP_KERNEL);
+	if (!migf->vf_data)
+		return -ENOMEM;
+
+	memset(&c, 0, sizeof(c));
+	nvmevf_init_get_ctrl_state_cmd(&c, id, csvi, csuuidi, csuidxp, len);
+	ret = nvme_submit_vf_cmd(dev, &c, NULL, migf->vf_data, len);
+	if (ret)
+		goto free_big;
+
+	/* Populate state struct */
+	hdr = (struct nvme_lm_ctrl_state *)migf->vf_data;
+	state->raw = hdr;
+	state->total_len = len;
+	state->version = hdr->version;
+	state->csattr = hdr->csattr;
+	state->nvmecss = hdr->nvmecss;
+	state->vss = hdr->vss;
+	state->nvme_cs = hdr->data;
+	state->vsd = hdr->data + le16_to_cpu(hdr->nvmecss) * 4;
+
+	return ret;
+
+free_big:
+	kvfree(migf->vf_data);
+	return ret;
+}
+
+static const struct nvme_lm_nvme_cs_v0_state *
+nvme_lm_parse_nvme_cs_v0_state(const void *data, size_t len, u16 *niosq,
+			       u16 *niocq)
+{
+	const struct nvme_lm_nvme_cs_v0_state *hdr = data;
+	size_t hdr_len = sizeof(*hdr);
+	size_t iosq_sz, iocq_sz, total;
+	u16 sq, cq;
+
+	if (!data || len < hdr_len)
+		return NULL;
+
+	sq = le16_to_cpu(hdr->niosq);
+	cq = le16_to_cpu(hdr->niocq);
+
+	iosq_sz = sq * sizeof(struct nvme_lm_iosq_state);
+	iocq_sz = cq * sizeof(struct nvme_lm_iocq_state);
+	total = hdr_len + iosq_sz + iocq_sz;
+
+	if (len < total)
+		return NULL;
+
+	if (niosq)
+		*niosq = sq;
+	if (niocq)
+		*niocq = cq;
+
+	return hdr;
+}
+
+static void nvme_lm_debug_ctrl_state(struct nvme_lm_ctrl_state_info *state)
+{
+	const struct nvme_lm_nvme_cs_v0_state *cs;
+	const struct nvme_lm_iosq_state *iosq;
+	const struct nvme_lm_iocq_state *iocq;
+	u16 niosq, niocq;
+	int i;
+
+	pr_info("Controller State:\n");
+	pr_info("Version    : 0x%04x\n", le16_to_cpu(state->version));
+	pr_info("CSATTR     : 0x%02x\n", state->csattr);
+	pr_info("NVMECS Len : %u bytes\n", le16_to_cpu(state->nvmecss) * 4);
+	pr_info("VSD Len    : %u bytes\n", le16_to_cpu(state->vss) * 4);
+
+	cs = nvme_lm_parse_nvme_cs_v0_state(state->nvme_cs,
+					    le16_to_cpu(state->nvmecss) * 4,
+					    &niosq, &niocq);
+	if (!cs) {
+		pr_warn("Failed to parse NVMECS\n");
+		return;
+	}
+
+	iosq = cs->iosq;
+	iocq = (const void *)(iosq + niosq);
+
+	for (i = 0; i < niosq; i++) {
+		pr_info("IOSQ[%d]: SIZE=%u QID=%u CQID=%u ATTR=0x%x Head=%u "
+			"Tail=%u\n", i,
+			le16_to_cpu(iosq[i].qsize),
+			le16_to_cpu(iosq[i].qid),
+			le16_to_cpu(iosq[i].cqid),
+			le16_to_cpu(iosq[i].attr),
+			le16_to_cpu(iosq[i].head),
+			le16_to_cpu(iosq[i].tail));
+	}
+
+	for (i = 0; i < niocq; i++) {
+		pr_info("IOCQ[%d]: SIZE=%u QID=%u ATTR=%u Head=%u Tail=%u\n", i,
+			le16_to_cpu(iocq[i].qsize),
+			le16_to_cpu(iocq[i].qid),
+			le16_to_cpu(iocq[i].attr),
+			le16_to_cpu(iocq[i].head),
+			le16_to_cpu(iocq[i].tail));
+	}
+}
+
+#define NVME_LM_CSUUIDI	0
+#define NVME_LM_CSVI	NVME_LM_RECV_CSVI_NVME_V1
+
+static int nvmevf_cmd_get_ctrl_state(struct nvmevf_pci_core_device *nvmevf_dev,
+				     struct nvmevf_migration_file *migf)
+{
+	struct pci_dev *dev = nvmevf_dev->core_device.pdev;
+	struct nvme_lm_ctrl_state_fmts_info fmt = { };
+	struct nvme_lm_ctrl_state_info state = { };
+	__u8 csvi = NVME_LM_CSVI;
+	__u8 csuuidi = NVME_LM_CSUUIDI;
+	__u8 csuidxp = 0;
+	int ret;
+
+	/*
+	 * Read the supported controller state formats to make sure they match
+	 * csvi value specified in vfio-nvme without this check we'd not know
+	 * which controller state format we are working with.
+	 */
+	ret = nvme_lm_get_ctrl_state_fmt(dev, true, &fmt);
+	if (ret)
+		return ret;
+	/*
+	 * Number of versions NV cannot be less than controller state version
+	 * index we are using, it's an error. Please note that CSVI is
+	 * a configurable value user can define this macro at the compile time
+	 * to select the required NVMe controller state version index from
+	 * Supported Controller State Formats Data Structure.
+	 */
+	if (fmt.nv < csvi) {
+		dev_warn(&dev->dev,
+			 "required ctrl state format not found\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = nvmevf_get_ctrl_state(dev, csvi, csuuidi, csuidxp, migf, &state);
+	if (ret)
+		goto out;
+
+	if (le16_to_cpu(state.version) != csvi) {
+		dev_warn(&dev->dev,
+			 "Unexpected controller state version: 0x%04x\n",
+			 le16_to_cpu(state.version));
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * Now that we have received the controller state decode the state
+	 * properly for debugging purpose
+	 */
+
+	nvme_lm_debug_ctrl_state(&state);
+
+	dev_info(&dev->dev, "Get controller state successful\n");
+
+out:
+	kfree(fmt.ctrl_state_raw_buf);
+	return ret;
+}
+
+static int nvmevf_cmd_set_ctrl_state(struct nvmevf_pci_core_device *nvmevf_dev,
+				     struct nvmevf_migration_file *migf)
+{
+	struct pci_dev *dev = nvmevf_dev->core_device.pdev;
+	struct nvme_command c = { };
+	u32 sel = NVME_LM_SEND_SEL_SET_CTRL_STATE;
+	/* assume that data buffer is big enough to hold state in one cmd */
+	u32 mos = NVME_LM_SEQIND_ONLY;
+	u32 cntlid = nvme_get_ctrl_id(dev);
+	u32 csvi = NVME_LM_CSVI;
+	u32 csuuidi = NVME_LM_CSUUIDI;
+	int ret;
+
+	c.lm.send.opcode = nvme_admin_lm_send;
+	/* mos = SEQIND = 0b11 (Only) in MOS bits [17:16] */
+	c.lm.send.cdw10 = cpu_to_le32((mos << 16) | sel);
+	/*
+	 * Assume that we are only working on NVMe state and not on vendor
+	 * specific state.
+	 */
+	c.lm.send.cdw11 = cpu_to_le32(csuuidi << 24 | csvi << 16 | cntlid);
+
+	/*
+	 * Associates the Migration Send command with the correct migration
+	 * session UUID currently we set to 0. For now asssume that initiaor
+	 * and target has agreed on the UUIDX 0 for all the live migration
+	 * sessions.
+	 */
+	c.lm.send.cdw14 = cpu_to_le32(0);
+	/*
+	 * Assume that data buffer is big enoough to hold the state,
+	 * 0-based dword count.
+	 */
+	c.lm.send.cdw15 = cpu_to_le32(bytes_to_nvme_numd(migf->total_length));
+
+	ret = nvme_submit_vf_cmd(dev, &c, NULL, migf->vf_data,
+				 migf->total_length);
+	if (ret) {
+		dev_warn(&dev->dev,
+			 "Load the device states failed (ret=0x%x)\n", ret);
+		return ret;
+	}
+
+	dev_info(&dev->dev, "Set controller state successful\n");
+	return 0;
+}
+
+static int nvmevf_release_file(struct inode *inode, struct file *filp)
+{
+	struct nvmevf_migration_file *migf = filp->private_data;
+
+	nvmevf_disable_fd(migf);
+	mutex_destroy(&migf->lock);
+	kfree(migf);
+	return 0;
+}
+
+static ssize_t nvmevf_resume_write(struct file *filp, const char __user *buf,
+				   size_t len, loff_t *pos)
+{
+	struct nvmevf_migration_file *migf = filp->private_data;
+	loff_t requested_length;
+	ssize_t done = 0;
+	int ret;
+
+	if (pos)
+		return -ESPIPE;
+	pos = &filp->f_pos;
+
+	if (*pos < 0 ||
+	    check_add_overflow((loff_t)len, *pos, &requested_length))
+		return -EINVAL;
+
+	if (requested_length > MAX_MIGRATION_SIZE)
+		return -ENOMEM;
+	mutex_lock(&migf->lock);
+	if (migf->disabled) {
+		done = -ENODEV;
+		goto out_unlock;
+	}
+
+	ret = copy_from_user(migf->vf_data + *pos, buf, len);
+	if (ret) {
+		done = -EFAULT;
+		goto out_unlock;
+	}
+	*pos += len;
+	done = len;
+	migf->total_length += len;
+
+out_unlock:
+	mutex_unlock(&migf->lock);
+	return done;
+}
+
+static const struct file_operations nvmevf_resume_fops = {
+	.owner = THIS_MODULE,
+	.write = nvmevf_resume_write,
+	.release = nvmevf_release_file,
+	.llseek = noop_llseek,
+};
+
+static struct nvmevf_migration_file *
+nvmevf_pci_resume_device_data(struct nvmevf_pci_core_device *nvmevf_dev)
+{
+	struct nvmevf_migration_file *migf;
+	int ret;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("nvmevf_mig", &nvmevf_resume_fops, migf,
+					O_WRONLY);
+	if (IS_ERR(migf->filp)) {
+		int err = PTR_ERR(migf->filp);
+
+		kfree(migf);
+		return ERR_PTR(err);
+	}
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+
+	/* Allocate buffer to load the device states and max states is 256K */
+	migf->vf_data = kvzalloc(MAX_MIGRATION_SIZE, GFP_KERNEL);
+	if (!migf->vf_data) {
+		ret = -ENOMEM;
+		goto out_free;
+	}
+
+	return migf;
+
+out_free:
+	fput(migf->filp);
+	return ERR_PTR(ret);
+}
+
+static ssize_t nvmevf_save_read(struct file *filp, char __user *buf,
+				size_t len, loff_t *pos)
+{
+	struct nvmevf_migration_file *migf = filp->private_data;
+	ssize_t done = 0;
+	int ret;
+
+	if (pos)
+		return -ESPIPE;
+	pos = &filp->f_pos;
+
+	mutex_lock(&migf->lock);
+	if (*pos > migf->total_length) {
+		done = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (migf->disabled) {
+		done = -EINVAL;
+		goto out_unlock;
+	}
+
+	len = min_t(size_t, migf->total_length - *pos, len);
+	if (len) {
+		ret = copy_to_user(buf, migf->vf_data + *pos, len);
+		if (ret) {
+			done = -EFAULT;
+			goto out_unlock;
+		}
+		*pos += len;
+		done = len;
+	}
+
+out_unlock:
+	mutex_unlock(&migf->lock);
+	return done;
+}
+
+static const struct file_operations nvmevf_save_fops = {
+	.owner = THIS_MODULE,
+	.read = nvmevf_save_read,
+	.release = nvmevf_release_file,
+	.llseek = noop_llseek,
+};
+
+static struct nvmevf_migration_file *
+nvmevf_pci_save_device_data(struct nvmevf_pci_core_device *nvmevf_dev)
+{
+	struct nvmevf_migration_file *migf;
+	int ret;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("nvmevf_mig", &nvmevf_save_fops, migf,
+					O_RDONLY);
+	if (IS_ERR(migf->filp)) {
+		int err = PTR_ERR(migf->filp);
+
+		kfree(migf);
+		return ERR_PTR(err);
+	}
+
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+
+	ret = nvmevf_cmd_get_ctrl_state(nvmevf_dev, migf);
+	if (ret)
+		goto out_free;
+
+	return migf;
+out_free:
+	fput(migf->filp);
+	return ERR_PTR(ret);
+}
+
+static struct file *
+nvmevf_pci_step_device_state_locked(struct nvmevf_pci_core_device *nvmevf_dev,
+				    u32 new)
+{
+	u32 cur = nvmevf_dev->mig_state;
+	int ret;
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_STOP) {
+		ret = nvmevf_cmd_suspend_device(nvmevf_dev);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP &&
+	    new == VFIO_DEVICE_STATE_STOP_COPY) {
+		struct nvmevf_migration_file *migf;
+
+		migf = nvmevf_pci_save_device_data(nvmevf_dev);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		nvmevf_dev->saving_migf = migf;
+		return migf->filp;
+	}
+
+
+	if (cur == VFIO_DEVICE_STATE_STOP_COPY &&
+	    new == VFIO_DEVICE_STATE_STOP) {
+		nvmevf_disable_fds(nvmevf_dev);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP &&
+	    new == VFIO_DEVICE_STATE_RESUMING) {
+		struct nvmevf_migration_file *migf;
+
+		migf = nvmevf_pci_resume_device_data(nvmevf_dev);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		nvmevf_dev->resuming_migf = migf;
+		return migf->filp;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RESUMING &&
+	    new == VFIO_DEVICE_STATE_STOP) {
+		ret = nvmevf_cmd_set_ctrl_state(nvmevf_dev,
+						nvmevf_dev->resuming_migf);
+		if (ret)
+			return ERR_PTR(ret);
+		nvmevf_disable_fds(nvmevf_dev);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP &&
+	    new == VFIO_DEVICE_STATE_RUNNING) {
+		nvmevf_cmd_resume_device(nvmevf_dev);
+		return NULL;
+	}
+
+	/* vfio_mig_get_next_state() does not use arcs other than the above */
+	WARN_ON(true);
+	return ERR_PTR(-EINVAL);
+}
+
+static struct file *
+nvmevf_pci_set_device_state(struct vfio_device *vdev,
+			    enum vfio_device_mig_state new_state)
+{
+	struct nvmevf_pci_core_device *nvmevf_dev = container_of(vdev,
+			struct nvmevf_pci_core_device, core_device.vdev);
+	enum vfio_device_mig_state next_state;
+	struct file *res = NULL;
+	int ret;
+
+	mutex_lock(&nvmevf_dev->state_mutex);
+	while (new_state != nvmevf_dev->mig_state) {
+		ret = vfio_mig_get_next_state(vdev, nvmevf_dev->mig_state,
+					      new_state, &next_state);
+		if (ret) {
+			res = ERR_PTR(-EINVAL);
+			break;
+		}
+
+		res = nvmevf_pci_step_device_state_locked(nvmevf_dev,
+							  next_state);
+		if (IS_ERR(res))
+			break;
+		nvmevf_dev->mig_state = next_state;
+		if (WARN_ON(res && new_state != nvmevf_dev->mig_state)) {
+			fput(res);
+			res = ERR_PTR(-EINVAL);
+			break;
+		}
+	}
+	nvmevf_state_mutex_unlock(nvmevf_dev);
+	return res;
+}
+
+static int nvmevf_pci_get_device_state(struct vfio_device *vdev,
+				       enum vfio_device_mig_state *curr_state)
+{
+	struct nvmevf_pci_core_device *nvmevf_dev = container_of(
+			vdev, struct nvmevf_pci_core_device, core_device.vdev);
+
+	mutex_lock(&nvmevf_dev->state_mutex);
+	*curr_state = nvmevf_dev->mig_state;
+	nvmevf_state_mutex_unlock(nvmevf_dev);
+	return 0;
+}
+
+static const struct vfio_migration_ops nvmevf_pci_mig_ops = {
+	.migration_set_state = nvmevf_pci_set_device_state,
+	.migration_get_state = nvmevf_pci_get_device_state,
+};
+
+static bool nvmevf_migration_supp(struct pci_dev *pdev)
+{
+	struct nvme_command c = { };
+	u8 lm_supported = false;
+	struct nvme_id_ctrl *id;
+	__u16 oacs;
+	int ret;
+
+	c.identify.opcode = nvme_admin_identify;
+	c.identify.cns = NVME_ID_CNS_CTRL;
+
+	id = kmalloc(sizeof(struct nvme_id_ctrl), GFP_KERNEL);
+	if (!id)
+		return false;
+
+	ret = nvme_submit_vf_cmd(pdev, &c, NULL, id,
+				 sizeof(struct nvme_id_ctrl));
+	if (ret) {
+		dev_warn(&pdev->dev, "Get identify ctrl failed (ret=0x%x)\n",
+			 ret);
+		lm_supported = false;
+		goto out;
+	}
+
+	oacs = le16_to_cpu(id->oacs);
+	lm_supported = oacs & NVME_CTRL_OACS_HMLMS ? true : false;
+out:
+	kfree(id);
+	return lm_supported;
+}
+
+static int nvmevf_migration_init_dev(struct vfio_device *core_vdev)
+{
+	struct nvmevf_pci_core_device *nvmevf_dev;
+	struct pci_dev *pdev;
+	int vf_id;
+	int ret = -1;
+
+	nvmevf_dev = container_of(core_vdev, struct nvmevf_pci_core_device,
+				  core_device.vdev);
+	pdev = to_pci_dev(core_vdev->dev);
+
+	if (!pdev->is_virtfn)
+		return ret;
+
+	/*
+	 * Get the identify controller data structure to check the live
+	 * migration support.
+	 */
+	if (!nvmevf_migration_supp(pdev))
+		return ret;
+
+	nvmevf_dev->migrate_cap = 1;
+
+	vf_id = pci_iov_vf_id(pdev);
+	if (vf_id < 0)
+		return ret;
+	nvmevf_dev->vf_id = vf_id + 1;
+	core_vdev->migration_flags = VFIO_MIGRATION_STOP_COPY;
+
+	mutex_init(&nvmevf_dev->state_mutex);
+	spin_lock_init(&nvmevf_dev->reset_lock);
+	core_vdev->mig_ops = &nvmevf_pci_mig_ops;
+
+	return vfio_pci_core_init_dev(core_vdev);
+}
+
 static int nvmevf_pci_open_device(struct vfio_device *core_vdev)
 {
 	struct nvmevf_pci_core_device *nvmevf_dev;
@@ -109,6 +947,7 @@ static void nvmevf_pci_close_device(struct vfio_device *core_vdev)
 
 static const struct vfio_device_ops nvmevf_pci_ops = {
 	.name = "nvme-vfio-pci",
+	.init = nvmevf_migration_init_dev,
 	.release = vfio_pci_core_release_dev,
 	.open_device = nvmevf_pci_open_device,
 	.close_device = nvmevf_pci_close_device,
@@ -193,4 +1032,5 @@ module_pci_driver(nvmevf_pci_driver);
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Chaitanya Kulkarni <kch@nvidia.com>");
+MODULE_AUTHOR("Lei Rao <lei.rao@intel.com>");
 MODULE_DESCRIPTION("NVMe VFIO PCI - VFIO PCI driver with live migration support for NVMe");
diff --git a/drivers/vfio/pci/nvme/nvme.h b/drivers/vfio/pci/nvme/nvme.h
index ee602254679e..80dd75d33762 100644
--- a/drivers/vfio/pci/nvme/nvme.h
+++ b/drivers/vfio/pci/nvme/nvme.h
@@ -33,4 +33,7 @@ struct nvmevf_pci_core_device {
 	struct nvmevf_migration_file *saving_migf;
 };
 
+extern int nvme_submit_vf_cmd(struct pci_dev *dev, struct nvme_command *cmd,
+			size_t *result, void *buffer, unsigned int bufflen);
+extern u16 nvme_get_ctrl_id(struct pci_dev *dev);
 #endif /* NVME_VFIO_PCI_H */
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* RE: [RFC PATCH 1/4] vfio-nvme: add vfio-nvme lm driver infrastructure
  2025-08-03  2:47 ` [RFC PATCH 1/4] vfio-nvme: add vfio-nvme lm driver infrastructure Chaitanya Kulkarni
@ 2025-08-04 15:20   ` Shameerali Kolothum Thodi
  2025-08-04 16:43   ` Bjorn Helgaas
  2025-08-04 17:15   ` Alex Williamson
  2 siblings, 0 replies; 9+ messages in thread
From: Shameerali Kolothum Thodi @ 2025-08-04 15:20 UTC (permalink / raw)
  To: Chaitanya Kulkarni, kbusch@kernel.org, axboe@fb.com, hch@lst.de,
	sagi@grimberg.me, alex.williamson@redhat.com, cohuck@redhat.com,
	jgg@ziepe.ca, yishaih@nvidia.com, kevin.tian@intel.com,
	mjrosato@linux.ibm.com, mgurtovoy@nvidia.com
  Cc: linux-nvme@lists.infradead.org, kvm@vger.kernel.org,
	Konrad.wilk@oracle.com, martin.petersen@oracle.com,
	jmeneghi@redhat.com, arnd@arndb.de, schnelle@linux.ibm.com,
	bhelgaas@google.com, joao.m.martins@oracle.com, Lei Rao



> -----Original Message-----
> From: Chaitanya Kulkarni <kch@nvidia.com>
> Sent: Sunday, August 3, 2025 3:47 AM
> To: kbusch@kernel.org; axboe@fb.com; hch@lst.de; sagi@grimberg.me;
> alex.williamson@redhat.com; cohuck@redhat.com; jgg@ziepe.ca;
> yishaih@nvidia.com; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; kevin.tian@intel.com;
> mjrosato@linux.ibm.com; mgurtovoy@nvidia.com
> Cc: linux-nvme@lists.infradead.org; kvm@vger.kernel.org;
> Konrad.wilk@oracle.com; martin.petersen@oracle.com;
> jmeneghi@redhat.com; arnd@arndb.de; schnelle@linux.ibm.com;
> bhelgaas@google.com; joao.m.martins@oracle.com; Chaitanya Kulkarni
> <kch@nvidia.com>; Lei Rao <lei.rao@intel.com>
> Subject: [RFC PATCH 1/4] vfio-nvme: add vfio-nvme lm driver infrastructure
> 
> Add foundational infrastructure for vfio-nvme, enabling support for live
> migration of NVMe devices via the VFIO framework. The following
> components are included:
> 
> - Core driver skeleton for vfio-nvme support under drivers/vfio/pci/nvme/
> - Definitions of basic data structures used in live migration
>   (e.g., nvmevf_pci_core_device and nvmevf_migration_file)
> - Implementation of helper routines for managing migration file state
> - Integration of PCI driver callbacks and error handling logic
> - Registration with vfio-pci-core through nvmevf_pci_ops
> - Initial support for VFIO migration states and device open/close flows
> 
> Subsequent patches will build upon this base to implement actual live
> migration commands and complete the vfio device state handling logic.
> 
> Signed-off-by: Lei Rao <lei.rao@intel.com>
> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> ---
>  drivers/vfio/pci/Kconfig       |   2 +
>  drivers/vfio/pci/Makefile      |   2 +
>  drivers/vfio/pci/nvme/Kconfig  |  10 ++
>  drivers/vfio/pci/nvme/Makefile |   3 +
>  drivers/vfio/pci/nvme/nvme.c   | 196
> +++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/nvme/nvme.h   |  36 ++++++
>  6 files changed, 249 insertions(+)
>  create mode 100644 drivers/vfio/pci/nvme/Kconfig
>  create mode 100644 drivers/vfio/pci/nvme/Makefile
>  create mode 100644 drivers/vfio/pci/nvme/nvme.c
>  create mode 100644 drivers/vfio/pci/nvme/nvme.h
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 2b0172f54665..8f94429e7adc 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -67,4 +67,6 @@ source "drivers/vfio/pci/nvgrace-gpu/Kconfig"
> 
>  source "drivers/vfio/pci/qat/Kconfig"
> 
> +source "drivers/vfio/pci/nvme/Kconfig"
> +
>  endmenu
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index cf00c0a7e55c..be8c4b5ee0ba 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -10,6 +10,8 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> 
>  obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
> 
> +obj-$(CONFIG_NVME_VFIO_PCI) += nvme/
> +
>  obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
> 
>  obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> diff --git a/drivers/vfio/pci/nvme/Kconfig b/drivers/vfio/pci/nvme/Kconfig
> new file mode 100644
> index 000000000000..12e0eaba0de1
> --- /dev/null
> +++ b/drivers/vfio/pci/nvme/Kconfig
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config NVME_VFIO_PCI
> +	tristate "VFIO support for NVMe PCI devices"
> +	depends on NVME_CORE
> +	depends on VFIO_PCI_CORE
> +	help
> +	  This provides migration support for NVMe devices using the
> +	  VFIO framework.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/nvme/Makefile b/drivers/vfio/pci/nvme/Makefile
> new file mode 100644
> index 000000000000..2f4a0ad3d9cf
> --- /dev/null
> +++ b/drivers/vfio/pci/nvme/Makefile
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_NVME_VFIO_PCI) += nvme-vfio-pci.o
> +nvme-vfio-pci-y := nvme.o
> diff --git a/drivers/vfio/pci/nvme/nvme.c b/drivers/vfio/pci/nvme/nvme.c
> new file mode 100644
> index 000000000000..08bee3274207
> --- /dev/null
> +++ b/drivers/vfio/pci/nvme/nvme.c
> @@ -0,0 +1,196 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2022, INTEL CORPORATION. All rights reserved
> + * Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved
> + */
> +
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/file.h>
> +#include <linux/interrupt.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/types.h>
> +#include <linux/vfio.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/kernel.h>
> +#include <linux/vfio_pci_core.h>
> +
> +#include "nvme.h"
> +
> +static void nvmevf_disable_fd(struct nvmevf_migration_file *migf)
> +{
> +	mutex_lock(&migf->lock);
> +
> +	/* release the device states buffer */
> +	kvfree(migf->vf_data);
> +	migf->vf_data = NULL;
> +	migf->disabled = true;
> +	migf->total_length = 0;
> +	migf->filp->f_pos = 0;
> +	mutex_unlock(&migf->lock);

May be we can use guard(mutex) here and elsewhere for this driver now?

> +}
> +
> +static void nvmevf_disable_fds(struct nvmevf_pci_core_device
> *nvmevf_dev)
> +{
> +	if (nvmevf_dev->resuming_migf) {
> +		nvmevf_disable_fd(nvmevf_dev->resuming_migf);
> +		fput(nvmevf_dev->resuming_migf->filp);
> +		nvmevf_dev->resuming_migf = NULL;
> +	}
> +
> +	if (nvmevf_dev->saving_migf) {
> +		nvmevf_disable_fd(nvmevf_dev->saving_migf);
> +		fput(nvmevf_dev->saving_migf->filp);
> +		nvmevf_dev->saving_migf = NULL;
> +	}
> +}
> +
> +static void nvmevf_state_mutex_unlock(struct nvmevf_pci_core_device
> *nvmevf_dev)
> +{
> +	lockdep_assert_held(&nvmevf_dev->state_mutex);
> +again:
> +	spin_lock(&nvmevf_dev->reset_lock);
> +	if (nvmevf_dev->deferred_reset) {
> +		nvmevf_dev->deferred_reset = false;
> +		spin_unlock(&nvmevf_dev->reset_lock);
> +		nvmevf_dev->mig_state = VFIO_DEVICE_STATE_RUNNING;
> +		nvmevf_disable_fds(nvmevf_dev);
> +		goto again;
> +	}
> +	mutex_unlock(&nvmevf_dev->state_mutex);
> +	spin_unlock(&nvmevf_dev->reset_lock);
> +}
> +
> +static struct nvmevf_pci_core_device *nvmevf_drvdata(struct pci_dev
> *pdev)
> +{
> +	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev-
> >dev);
> +
> +	return container_of(core_device, struct nvmevf_pci_core_device,
> +			    core_device);
> +}
> +
> +static int nvmevf_pci_open_device(struct vfio_device *core_vdev)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev;
> +	struct vfio_pci_core_device *vdev;
> +	int ret;
> +
> +	nvmevf_dev = container_of(core_vdev, struct
> nvmevf_pci_core_device,
> +			core_device.vdev);
> +	vdev = &nvmevf_dev->core_device;
> +
> +	ret = vfio_pci_core_enable(vdev);
> +	if (ret)
> +		return ret;
> +
> +	if (nvmevf_dev->migrate_cap)
> +		nvmevf_dev->mig_state = VFIO_DEVICE_STATE_RUNNING;
> +	vfio_pci_core_finish_enable(vdev);
> +	return 0;
> +}
> +
> +static void nvmevf_pci_close_device(struct vfio_device *core_vdev)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev;
> +
> +	nvmevf_dev = container_of(core_vdev, struct
> nvmevf_pci_core_device,
> +			core_device.vdev);
> +
> +	if (nvmevf_dev->migrate_cap) {
> +		mutex_lock(&nvmevf_dev->state_mutex);
> +		nvmevf_disable_fds(nvmevf_dev);
> +		nvmevf_state_mutex_unlock(nvmevf_dev);
> +	}
> +
> +	vfio_pci_core_close_device(core_vdev);
> +}
> +
> +static const struct vfio_device_ops nvmevf_pci_ops = {
> +	.name = "nvme-vfio-pci",
> +	.release = vfio_pci_core_release_dev,
> +	.open_device = nvmevf_pci_open_device,
> +	.close_device = nvmevf_pci_close_device,
> +	.ioctl = vfio_pci_core_ioctl,
> +	.device_feature = vfio_pci_core_ioctl_feature,
> +	.read = vfio_pci_core_read,
> +	.write = vfio_pci_core_write,
> +	.mmap = vfio_pci_core_mmap,
> +	.request = vfio_pci_core_request,
> +	.match = vfio_pci_core_match,

Any reason why this driver don't need the iommufd related callbacks?
bind_iommufd/unbind_iommufd etc?


> +};
> +
> +static int nvmevf_pci_probe(struct pci_dev *pdev,
> +			    const struct pci_device_id *id)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev;
> +	int ret;
> +
> +	nvmevf_dev = vfio_alloc_device(nvmevf_pci_core_device,
> core_device.vdev,
> +				       &pdev->dev, &nvmevf_pci_ops);
> +	if (IS_ERR(nvmevf_dev))
> +		return PTR_ERR(nvmevf_dev);
> +
> +	dev_set_drvdata(&pdev->dev, &nvmevf_dev->core_device);
> +	ret = vfio_pci_core_register_device(&nvmevf_dev->core_device);
> +	if (ret)
> +		goto out_put_dev;
> +
> +	return 0;
> +
> +out_put_dev:
> +	vfio_put_device(&nvmevf_dev->core_device.vdev);
> +	return ret;
> +}
> +
> +static void nvmevf_pci_remove(struct pci_dev *pdev)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev =
> nvmevf_drvdata(pdev);
> +
> +	vfio_pci_core_unregister_device(&nvmevf_dev->core_device);
> +	vfio_put_device(&nvmevf_dev->core_device.vdev);
> +}
> +
> +static void nvmevf_pci_aer_reset_done(struct pci_dev *pdev)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev =
> nvmevf_drvdata(pdev);
> +
> +	if (!nvmevf_dev->migrate_cap)
> +		return;
> +
> +	/*
> +	 * As the higher VFIO layers are holding locks across reset and using
> +	 * those same locks with the mm_lock we need to prevent ABBA
> deadlock
> +	 * with the state_mutex and mm_lock.
> +	 * In case the state_mutex was taken already we defer the cleanup
> work
> +	 * to the unlock flow of the other running context.
> +	 */

I am not sure this is relevant for this driver. This deferred logic was added to avoid
a circular locking issue w.r.t copy_to/from_user() functions holding mm_lock 
under state mutex. 

See this,
https://lore.kernel.org/all/20240229091152.56664-1-shameerali.kolothum.thodi@huawei.com/

Looking at patch #4 it doesn't look like this has that issue. May be I missed it.
Please double check.
 
Thanks,
Shameer

> +	spin_lock(&nvmevf_dev->reset_lock);
> +	nvmevf_dev->deferred_reset = true;
> +	if (!mutex_trylock(&nvmevf_dev->state_mutex)) {
> +		spin_unlock(&nvmevf_dev->reset_lock);
> +		return;
> +	}
> +	spin_unlock(&nvmevf_dev->reset_lock);
> +	nvmevf_state_mutex_unlock(nvmevf_dev);
> +}
> +
> +static const struct pci_error_handlers nvmevf_err_handlers = {
> +	.reset_done = nvmevf_pci_aer_reset_done,
> +	.error_detected = vfio_pci_core_aer_err_detected,
> +};
> +
> +static struct pci_driver nvmevf_pci_driver = {
> +	.name = KBUILD_MODNAME,
> +	.probe = nvmevf_pci_probe,
> +	.remove = nvmevf_pci_remove,
> +	.err_handler = &nvmevf_err_handlers,
> +	.driver_managed_dma = true,
> +};
> +
> +module_pci_driver(nvmevf_pci_driver);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Chaitanya Kulkarni <kch@nvidia.com>");
> +MODULE_DESCRIPTION("NVMe VFIO PCI - VFIO PCI driver with live
> migration support for NVMe");
> diff --git a/drivers/vfio/pci/nvme/nvme.h b/drivers/vfio/pci/nvme/nvme.h
> new file mode 100644
> index 000000000000..ee602254679e
> --- /dev/null
> +++ b/drivers/vfio/pci/nvme/nvme.h
> @@ -0,0 +1,36 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2022, INTEL CORPORATION. All rights reserved
> + * Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved
> + */
> +
> +#ifndef NVME_VFIO_PCI_H
> +#define NVME_VFIO_PCI_H
> +
> +#include <linux/kernel.h>
> +#include <linux/vfio_pci_core.h>
> +#include <linux/nvme.h>
> +
> +struct nvmevf_migration_file {
> +	struct file *filp;
> +	struct mutex lock;
> +	bool disabled;
> +	u8 *vf_data;
> +	size_t total_length;
> +};
> +
> +struct nvmevf_pci_core_device {
> +	struct vfio_pci_core_device core_device;
> +	int vf_id;
> +	u8 migrate_cap:1;
> +	u8 deferred_reset:1;
> +	/* protect migration state */
> +	struct mutex state_mutex;
> +	enum vfio_device_mig_state mig_state;
> +	/* protect the reset_done flow */
> +	spinlock_t reset_lock;
> +	struct nvmevf_migration_file *resuming_migf;
> +	struct nvmevf_migration_file *saving_migf;
> +};
> +
> +#endif /* NVME_VFIO_PCI_H */
> --
> 2.40.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 4/4] vfio-nvme: implement TP4159 live migration cmds
  2025-08-03  2:47 ` [RFC PATCH 4/4] vfio-nvme: implement TP4159 live migration cmds Chaitanya Kulkarni
@ 2025-08-04 16:41   ` Bjorn Helgaas
  0 siblings, 0 replies; 9+ messages in thread
From: Bjorn Helgaas @ 2025-08-04 16:41 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: kbusch, axboe, hch, sagi, alex.williamson, cohuck, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, mgurtovoy,
	linux-nvme, kvm, Konrad.wilk, martin.petersen, jmeneghi, arnd,
	schnelle, bhelgaas, joao.m.martins, Lei Rao

On Sat, Aug 02, 2025 at 07:47:05PM -0700, Chaitanya Kulkarni wrote:
> Implements TP4159-based live migration support in vfio-nvme
> driver by integrating command execution, controller state handling,
> and vfio migration state transitions.
> 
> Key features:
> 
> - Use nvme_submit_vf_cmd() and nvme_get_ctrl_id() helpers
>   in the NVMe core PCI driver for submitting admin commands on VFs.
> 
> - Implements Migration Send (opcode 0x43) and Receive (opcode 0x42)
>   command handling for suspend, resume, get/set controller state.
> 
>   _Remark_:-
>   We are currently in the process of defining the state in TP4193, 
>   so the current state management code will be replaced with TP4193.
>   However, in this patch we include TP4159-compatible state management
>   code for the sake of completeness.
> 
> - Adds parsing and serialization of controller state including:
>   - NVMeCS v0 controller state format (SCS-FIG6, FIG7, FIG8)
>   - Supported Controller State Formats (CNS=0x20 response)
>   - Migration file abstraction with read/write fileops
> 
> - Adds debug decoders to log IOSQ/IOCQ state during migration save
> 
> - Allocates anon inodes to handle save and resume file interfaces
>   exposed via VFIO migration file descriptors
> 
> - Adds vfio migration state machine transitions:
>   - RUNNING → STOP: sends suspend command
>   - STOP → STOP_COPY: extracts controller state (save)
>   - STOP_COPY → STOP: disables file and frees buffer
>   - STOP → RESUMING: allocates resume file buffer
>   - RESUMING → STOP: loads controller state via set state
>   - STOP → RUNNING: resumes controller via resume command
> 
> - Hooks vfio_migration_ops into vfio_pci_ops using:
>   - `migration_set_state()` and `migration_get_state()`
>   - Uses state_mutex + reset_lock for proper concurrency
> 
> - Queries Identify Controller (CNS=01h) to check for HMLMS bit
>   in OACS field, indicating controller migration capability
> 
> - Applies runtime checks for buffer alignment, format support,
>   and state size bounds to ensure spec compliance

Above mixes different verb forms: "Implements", "Adds", "Allocates",
etc. vs "Use".  I think there's some trend toward using the
imperative, e.g., "Implement", "Add", "Allocate", etc.

Similar for "sends", "extracts", "disables", ...

> +static int nvme_lm_get_ctrl_state_fmt(struct pci_dev *dev, bool debug,
> +				      struct nvme_lm_ctrl_state_fmts_info *fmt)
> +{
> +	__u8 i;
> +	int ret;
> +
> +	ret = nvme_lm_id_ctrl_state(dev, fmt);
> +	if (ret) {
> +		pr_err("Failed to get ctrl state formats (ret=%d)\n", ret);
> +		return ret;
> +	}
> +
> +	if (debug)
> +		pr_info("NV = %u, NUUID = %u\n", fmt->nv, fmt->nuuid);
> +
> +	if (debug) {
> +		for (i = 0; i < fmt->nv; i++) {
> +			pr_info("  Format[%d] Version = 0x%04x\n",
> +					i, le16_to_cpu(fmt->vers[i]));
> +		}
> +
> +		for (i = 0; i < fmt->nuuid; i++) {
> +			char uuid_str[37]; /* 36 chars + null */
> +
> +			snprintf(uuid_str, sizeof(uuid_str),
> +					"%02x%02x%02x%02x-%02x%02x-%02x%02x-"
> +					"%02x%02x-%02x%02x%02x%02x%02x%02x",
> +					fmt->uuids[i][0], fmt->uuids[i][1],
> +					fmt->uuids[i][2], fmt->uuids[i][3],
> +					fmt->uuids[i][4], fmt->uuids[i][5],
> +					fmt->uuids[i][6], fmt->uuids[i][7],
> +					fmt->uuids[i][8], fmt->uuids[i][9],
> +					fmt->uuids[i][10], fmt->uuids[i][11],
> +					fmt->uuids[i][12], fmt->uuids[i][13],
> +					fmt->uuids[i][14], fmt->uuids[i][15]);
> +
> +			pr_info("  UUID[%d] = %s\n", i, uuid_str);
> +		}
> +	}
> +
> +	return ret;

I think we know "ret == 0" here; if so, just use "return 0".

> +}
> +
> +static void nvmevf_init_get_ctrl_state_cmd(struct nvme_command *c, __u16 cntlid,
> +					   __u8 csvi, __u8 csuuidi,
> +					   __u8 csuidxp, size_t buf_len)
> +{
> +	c->lm.recv.opcode = nvme_admin_lm_recv;
> +	c->lm.recv.sel = NVME_LM_RECV_GET_CTRL_STATE;
> +	/*
> +	 * MOS fields treated as ctrl state version index, Use NVME V1 state.

s/, Use/; use/ ?

> +	 */
> +	/*
> +	 * For upstream read the supported controller state formats using
> +	 * identify command with cns value 0x20 and make sure NVME_LM_CSVI
> +	 * matches the on of the reported formats for NVMe states.

Above you capitalized "CNS".

"matches one of the"?  Not sure of your intent.

> +	 */
> +	c->lm.recv.mos = cpu_to_le16(csvi);
> +	/* Target Controller is this a right way to get the controller ID */

Is this a TODO item, i.e., a question that needs to be resolved?

> +	c->lm.recv.cntlid = cpu_to_le16(cntlid);
> +
> +	/*
> +	 * For upstream read the supported controller state formats using
> +	 * identify command with cns value 0x20 and make sure NVME_LM_CSVI
> +	 * matches the on of the reported formats for Vender specific states.

Above you capitalized "CNS".

"matches one of the"?  Not sure of your intent.

s/Vender/Vendor/

> +	 */
> +	/* adjust the state as per needed by setting the macro values */
> +	c->lm.recv.csuuidi = cpu_to_le32(csuuidi);
> +	c->lm.recv.csuidxp = cpu_to_le32(csuidxp);
> +
> +	/*
> +	 * Associates the Migration Receive command with the correct migration
> +	 * session UUID currently we set to 0. For now asssume that initiaor
> +	 * and target has agreed on the UUIDX 0 for all the live migration
> +	 * sessions.

s/Associates/Associate/
s/asssume/assume/
s/initiaor/initiator/
s/UUIDX/UUID/ ?

> +	 */
> +	c->lm.recv.uuid_index = cpu_to_le32(0);
> +
> +	/*
> +	 * Assume that data buffer is big enoough to hold the state,
> +	 * 0-based dword count.

s/enoough/enough/

> +	 */
> +	c->lm.recv.numd = cpu_to_le32(bytes_to_nvme_numd(buf_len));
> +}
> +
> +#define NVME_LM_MAX_NVMECS	1024
> +#define NVME_LM_MAX_VSD		1024
> +
> +static int nvmevf_get_ctrl_state(struct pci_dev *dev,
> +				__u8 csvi, __u8 csuuidi, __u8 csuidxp,
> +				struct nvmevf_migration_file *migf,
> +				struct nvme_lm_ctrl_state_info *state)
> +{
> +	struct nvme_command c = { };
> +	struct nvme_lm_ctrl_state *hdr;
> +	/* Make sure hdr_len is a multiple of 4 */
> +	size_t hdr_len = ALIGN(sizeof(*hdr), 4);
> +	__u16 id = nvme_get_ctrl_id(dev);
> +	void *local_buf;
> +	size_t len;
> +	int ret;
> +
> +	/* Step 1: Issue Migration Receive (Select = 0) to get header */
> +	local_buf = kzalloc(hdr_len, GFP_KERNEL);
> +	if (!local_buf)
> +		return -ENOMEM;
> +
> +	nvmevf_init_get_ctrl_state_cmd(&c, id, csvi, csuuidi, csuidxp, hdr_len);
> +	ret = nvme_submit_vf_cmd(dev, &c, NULL, local_buf, hdr_len);
> +	if (ret) {
> +		dev_warn(&dev->dev,
> +			"nvme_admin_lm_recv failed (ret=0x%x)\n", ret);
> +		kfree(local_buf);
> +		return ret;
> +	}
> +
> +	if (le16_to_cpu(hdr->nvmecss) > NVME_LM_MAX_NVMECS ||
> +	    le16_to_cpu(hdr->vss) > NVME_LM_MAX_VSD) {
> +		kfree(local_buf);
> +		return -EINVAL;
> +	}
> +
> +	hdr = local_buf;
> +	len = hdr_len + 4 * (le16_to_cpu(hdr->nvmecss) + le16_to_cpu(hdr->vss));
> +
> +	kfree(local_buf);
> +
> +	if (len == hdr_len)
> +		dev_warn(&dev->dev, "nvmecss == 0 or vss = 0\n");
> +
> +	/* Step 2: Allocate full buffer */
> +	migf->total_length = len;
> +	migf->vf_data = kvzalloc(migf->total_length, GFP_KERNEL);
> +	if (!migf->vf_data)
> +		return -ENOMEM;
> +
> +	memset(&c, 0, sizeof(c));
> +	nvmevf_init_get_ctrl_state_cmd(&c, id, csvi, csuuidi, csuidxp, len);
> +	ret = nvme_submit_vf_cmd(dev, &c, NULL, migf->vf_data, len);
> +	if (ret)
> +		goto free_big;
> +
> +	/* Populate state struct */
> +	hdr = (struct nvme_lm_ctrl_state *)migf->vf_data;
> +	state->raw = hdr;
> +	state->total_len = len;
> +	state->version = hdr->version;
> +	state->csattr = hdr->csattr;
> +	state->nvmecss = hdr->nvmecss;
> +	state->vss = hdr->vss;
> +	state->nvme_cs = hdr->data;
> +	state->vsd = hdr->data + le16_to_cpu(hdr->nvmecss) * 4;
> +
> +	return ret;

I think we know "ret == 0" here; if so, just use "return 0".

> +free_big:
> +	kvfree(migf->vf_data);
> +	return ret;
> +}

> +static void nvme_lm_debug_ctrl_state(struct nvme_lm_ctrl_state_info *state)
> +{
> +	const struct nvme_lm_nvme_cs_v0_state *cs;
> +	const struct nvme_lm_iosq_state *iosq;
> +	const struct nvme_lm_iocq_state *iocq;
> +	u16 niosq, niocq;
> +	int i;
> +
> +	pr_info("Controller State:\n");
> +	pr_info("Version    : 0x%04x\n", le16_to_cpu(state->version));
> +	pr_info("CSATTR     : 0x%02x\n", state->csattr);
> +	pr_info("NVMECS Len : %u bytes\n", le16_to_cpu(state->nvmecss) * 4);
> +	pr_info("VSD Len    : %u bytes\n", le16_to_cpu(state->vss) * 4);

I always wish messages like this were associated with a device, e.g.,
some flavor of dev_info().

> +
> +	cs = nvme_lm_parse_nvme_cs_v0_state(state->nvme_cs,
> +					    le16_to_cpu(state->nvmecss) * 4,
> +					    &niosq, &niocq);
> +	if (!cs) {
> +		pr_warn("Failed to parse NVMECS\n");
> +		return;
> +	}
> +
> +	iosq = cs->iosq;
> +	iocq = (const void *)(iosq + niosq);
> +
> +	for (i = 0; i < niosq; i++) {
> +		pr_info("IOSQ[%d]: SIZE=%u QID=%u CQID=%u ATTR=0x%x Head=%u "
> +			"Tail=%u\n", i,
> +			le16_to_cpu(iosq[i].qsize),
> +			le16_to_cpu(iosq[i].qid),
> +			le16_to_cpu(iosq[i].cqid),
> +			le16_to_cpu(iosq[i].attr),
> +			le16_to_cpu(iosq[i].head),
> +			le16_to_cpu(iosq[i].tail));
> +	}
> +
> +	for (i = 0; i < niocq; i++) {
> +		pr_info("IOCQ[%d]: SIZE=%u QID=%u ATTR=%u Head=%u Tail=%u\n", i,
> +			le16_to_cpu(iocq[i].qsize),
> +			le16_to_cpu(iocq[i].qid),
> +			le16_to_cpu(iocq[i].attr),
> +			le16_to_cpu(iocq[i].head),
> +			le16_to_cpu(iocq[i].tail));
> +	}
> +}

> +static int nvmevf_cmd_get_ctrl_state(struct nvmevf_pci_core_device *nvmevf_dev,
> +				     struct nvmevf_migration_file *migf)
> +{
> +	struct pci_dev *dev = nvmevf_dev->core_device.pdev;
> +	struct nvme_lm_ctrl_state_fmts_info fmt = { };
> +	struct nvme_lm_ctrl_state_info state = { };
> +	__u8 csvi = NVME_LM_CSVI;
> +	__u8 csuuidi = NVME_LM_CSUUIDI;
> +	__u8 csuidxp = 0;
> +	int ret;
> +
> +	/*
> +	 * Read the supported controller state formats to make sure they match
> +	 * csvi value specified in vfio-nvme without this check we'd not know
> +	 * which controller state format we are working with.

Run-on sentence.

> +	 */
> +	ret = nvme_lm_get_ctrl_state_fmt(dev, true, &fmt);
> +	if (ret)
> +		return ret;
> +	/*
> +	 * Number of versions NV cannot be less than controller state version
> +	 * index we are using, it's an error. Please note that CSVI is
> +	 * a configurable value user can define this macro at the compile time
> +	 * to select the required NVMe controller state version index from
> +	 * Supported Controller State Formats Data Structure.

Capitalize "CSVI" (or "csvi") consistently.  It's lower-case above and
upper-case here.  Also another run-on sentence.

> +	 */
> +	if (fmt.nv < csvi) {
> +		dev_warn(&dev->dev,
> +			 "required ctrl state format not found\n");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	ret = nvmevf_get_ctrl_state(dev, csvi, csuuidi, csuidxp, migf, &state);
> +	if (ret)
> +		goto out;
> +
> +	if (le16_to_cpu(state.version) != csvi) {
> +		dev_warn(&dev->dev,
> +			 "Unexpected controller state version: 0x%04x\n",
> +			 le16_to_cpu(state.version));
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Now that we have received the controller state decode the state
> +	 * properly for debugging purpose
> +	 */
> +
> +	nvme_lm_debug_ctrl_state(&state);
> +
> +	dev_info(&dev->dev, "Get controller state successful\n");
> +
> +out:
> +	kfree(fmt.ctrl_state_raw_buf);
> +	return ret;
> +}
> +
> +static int nvmevf_cmd_set_ctrl_state(struct nvmevf_pci_core_device *nvmevf_dev,
> +				     struct nvmevf_migration_file *migf)
> +{
> +	struct pci_dev *dev = nvmevf_dev->core_device.pdev;
> +	struct nvme_command c = { };
> +	u32 sel = NVME_LM_SEND_SEL_SET_CTRL_STATE;
> +	/* assume that data buffer is big enough to hold state in one cmd */
> +	u32 mos = NVME_LM_SEQIND_ONLY;
> +	u32 cntlid = nvme_get_ctrl_id(dev);
> +	u32 csvi = NVME_LM_CSVI;
> +	u32 csuuidi = NVME_LM_CSUUIDI;
> +	int ret;
> +
> +	c.lm.send.opcode = nvme_admin_lm_send;
> +	/* mos = SEQIND = 0b11 (Only) in MOS bits [17:16] */
> +	c.lm.send.cdw10 = cpu_to_le32((mos << 16) | sel);
> +	/*
> +	 * Assume that we are only working on NVMe state and not on vendor
> +	 * specific state.
> +	 */
> +	c.lm.send.cdw11 = cpu_to_le32(csuuidi << 24 | csvi << 16 | cntlid);
> +
> +	/*
> +	 * Associates the Migration Send command with the correct migration
> +	 * session UUID currently we set to 0. For now asssume that initiaor
> +	 * and target has agreed on the UUIDX 0 for all the live migration
> +	 * sessions.

s/Associates/Associate/
s/asssume/assume/
s/initiaor/initiator/
s/UUIDX/UUID/ ?

> +	 */
> +	c.lm.send.cdw14 = cpu_to_le32(0);
> +	/*
> +	 * Assume that data buffer is big enoough to hold the state,
> +	 * 0-based dword count.

s/enoough/enough/

> +	 */
> +	c.lm.send.cdw15 = cpu_to_le32(bytes_to_nvme_numd(migf->total_length));
> +
> +	ret = nvme_submit_vf_cmd(dev, &c, NULL, migf->vf_data,
> +				 migf->total_length);
> +	if (ret) {
> +		dev_warn(&dev->dev,
> +			 "Load the device states failed (ret=0x%x)\n", ret);
> +		return ret;
> +	}
> +
> +	dev_info(&dev->dev, "Set controller state successful\n");
> +	return 0;
> +}

> +static struct nvmevf_migration_file *
> +nvmevf_pci_resume_device_data(struct nvmevf_pci_core_device *nvmevf_dev)
> +{
> +	struct nvmevf_migration_file *migf;
> +	int ret;
> +
> +	migf = kzalloc(sizeof(*migf), GFP_KERNEL);
> +	if (!migf)
> +		return ERR_PTR(-ENOMEM);
> +
> +	migf->filp = anon_inode_getfile("nvmevf_mig", &nvmevf_resume_fops, migf,
> +					O_WRONLY);
> +	if (IS_ERR(migf->filp)) {
> +		int err = PTR_ERR(migf->filp);
> +
> +		kfree(migf);
> +		return ERR_PTR(err);
> +	}
> +	stream_open(migf->filp->f_inode, migf->filp);
> +	mutex_init(&migf->lock);
> +
> +	/* Allocate buffer to load the device states and max states is 256K */

Why repeat the "#define MAX_MIGRATION_SIZE (256 * 1024)" value in the
comment?

> +	migf->vf_data = kvzalloc(MAX_MIGRATION_SIZE, GFP_KERNEL);
> +	if (!migf->vf_data) {
> +		ret = -ENOMEM;
> +		goto out_free;
> +	}
> +
> +	return migf;
> +
> +out_free:
> +	fput(migf->filp);
> +	return ERR_PTR(ret);
> +}

> +static struct file *
> +nvmevf_pci_set_device_state(struct vfio_device *vdev,
> +			    enum vfio_device_mig_state new_state)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev = container_of(vdev,
> +			struct nvmevf_pci_core_device, core_device.vdev);
> +	enum vfio_device_mig_state next_state;
> +	struct file *res = NULL;
> +	int ret;
> +
> +	mutex_lock(&nvmevf_dev->state_mutex);
> +	while (new_state != nvmevf_dev->mig_state) {
> +		ret = vfio_mig_get_next_state(vdev, nvmevf_dev->mig_state,
> +					      new_state, &next_state);
> +		if (ret) {
> +			res = ERR_PTR(-EINVAL);
> +			break;
> +		}
> +
> +		res = nvmevf_pci_step_device_state_locked(nvmevf_dev,
> +							  next_state);
> +		if (IS_ERR(res))
> +			break;
> +		nvmevf_dev->mig_state = next_state;
> +		if (WARN_ON(res && new_state != nvmevf_dev->mig_state)) {
> +			fput(res);
> +			res = ERR_PTR(-EINVAL);
> +			break;
> +		}
> +	}
> +	nvmevf_state_mutex_unlock(nvmevf_dev);
> +	return res;
> +}
> +
> +static int nvmevf_pci_get_device_state(struct vfio_device *vdev,
> +				       enum vfio_device_mig_state *curr_state)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev = container_of(
> +			vdev, struct nvmevf_pci_core_device, core_device.vdev);
> +
> +	mutex_lock(&nvmevf_dev->state_mutex);
> +	*curr_state = nvmevf_dev->mig_state;
> +	nvmevf_state_mutex_unlock(nvmevf_dev);

I see that you added nvmevf_state_mutex_unlock() because you want to
deal with deferred reset things at the same time as the
mutex_unlock(), but it does make it harder to read when we have to
know and verify that

  mutex_lock(&nvmevf_dev->state_mutex)
  nvmevf_state_mutex_unlock(nvmevf_dev)

is actually a lock/unlock of the same mutex.  Maybe there's no better
way and this is the best we can do.

> +	return 0;
> +}

> +static int nvmevf_migration_init_dev(struct vfio_device *core_vdev)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev;
> +	struct pci_dev *pdev;
> +	int vf_id;
> +	int ret = -1;
> +
> +	nvmevf_dev = container_of(core_vdev, struct nvmevf_pci_core_device,
> +				  core_device.vdev);
> +	pdev = to_pci_dev(core_vdev->dev);
> +
> +	if (!pdev->is_virtfn)
> +		return ret;

"ret" seems pointless here since it's always -1.  You could just
"return -1" directly, although it looks like this is supposed to be a
negative errno, e.g., -EINVAL or whatever.

> +
> +	/*
> +	 * Get the identify controller data structure to check the live
> +	 * migration support.
> +	 */
> +	if (!nvmevf_migration_supp(pdev))
> +		return ret;
> +
> +	nvmevf_dev->migrate_cap = 1;
> +
> +	vf_id = pci_iov_vf_id(pdev);
> +	if (vf_id < 0)
> +		return ret;
> +	nvmevf_dev->vf_id = vf_id + 1;
> +	core_vdev->migration_flags = VFIO_MIGRATION_STOP_COPY;
> +
> +	mutex_init(&nvmevf_dev->state_mutex);
> +	spin_lock_init(&nvmevf_dev->reset_lock);
> +	core_vdev->mig_ops = &nvmevf_pci_mig_ops;
> +
> +	return vfio_pci_core_init_dev(core_vdev);
> +}

> +++ b/drivers/vfio/pci/nvme/nvme.h
> @@ -33,4 +33,7 @@ struct nvmevf_pci_core_device {
>  	struct nvmevf_migration_file *saving_migf;
>  };
>  
> +extern int nvme_submit_vf_cmd(struct pci_dev *dev, struct nvme_command *cmd,
> +			size_t *result, void *buffer, unsigned int bufflen);
> +extern u16 nvme_get_ctrl_id(struct pci_dev *dev);

Typical style in drivers/vfio/ omits the "extern" on function
declarations.

>  #endif /* NVME_VFIO_PCI_H */
> -- 
> 2.40.0
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/4] vfio-nvme: add vfio-nvme lm driver infrastructure
  2025-08-03  2:47 ` [RFC PATCH 1/4] vfio-nvme: add vfio-nvme lm driver infrastructure Chaitanya Kulkarni
  2025-08-04 15:20   ` Shameerali Kolothum Thodi
@ 2025-08-04 16:43   ` Bjorn Helgaas
  2025-08-04 17:15   ` Alex Williamson
  2 siblings, 0 replies; 9+ messages in thread
From: Bjorn Helgaas @ 2025-08-04 16:43 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: kbusch, axboe, hch, sagi, alex.williamson, cohuck, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, mgurtovoy,
	linux-nvme, kvm, Konrad.wilk, martin.petersen, jmeneghi, arnd,
	schnelle, bhelgaas, joao.m.martins, Lei Rao

On Sat, Aug 02, 2025 at 07:47:02PM -0700, Chaitanya Kulkarni wrote:
> Add foundational infrastructure for vfio-nvme, enabling support for live
> migration of NVMe devices via the VFIO framework. The following
> components are included:

> +static void nvmevf_pci_aer_reset_done(struct pci_dev *pdev)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev = nvmevf_drvdata(pdev);
> +
> +	if (!nvmevf_dev->migrate_cap)
> +		return;
> +
> +	/*
> +	 * As the higher VFIO layers are holding locks across reset and using
> +	 * those same locks with the mm_lock we need to prevent ABBA deadlock
> +	 * with the state_mutex and mm_lock.

Add blank line between paragraphs.

> +	 * In case the state_mutex was taken already we defer the cleanup work
> +	 * to the unlock flow of the other running context.
> +	 */
> +	spin_lock(&nvmevf_dev->reset_lock);
> +	nvmevf_dev->deferred_reset = true;
> +	if (!mutex_trylock(&nvmevf_dev->state_mutex)) {
> +		spin_unlock(&nvmevf_dev->reset_lock);
> +		return;
> +	}
> +	spin_unlock(&nvmevf_dev->reset_lock);
> +	nvmevf_state_mutex_unlock(nvmevf_dev);
> +}


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/4] vfio-nvme: add vfio-nvme lm driver infrastructure
  2025-08-03  2:47 ` [RFC PATCH 1/4] vfio-nvme: add vfio-nvme lm driver infrastructure Chaitanya Kulkarni
  2025-08-04 15:20   ` Shameerali Kolothum Thodi
  2025-08-04 16:43   ` Bjorn Helgaas
@ 2025-08-04 17:15   ` Alex Williamson
  2 siblings, 0 replies; 9+ messages in thread
From: Alex Williamson @ 2025-08-04 17:15 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: kbusch, axboe, hch, sagi, cohuck, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, mgurtovoy,
	linux-nvme, kvm, Konrad.wilk, martin.petersen, jmeneghi, arnd,
	schnelle, bhelgaas, joao.m.martins, Lei Rao

On Sat, 2 Aug 2025 19:47:02 -0700
Chaitanya Kulkarni <kch@nvidia.com> wrote:

> Add foundational infrastructure for vfio-nvme, enabling support for live
> migration of NVMe devices via the VFIO framework. The following
> components are included:
> 
> - Core driver skeleton for vfio-nvme support under drivers/vfio/pci/nvme/
> - Definitions of basic data structures used in live migration
>   (e.g., nvmevf_pci_core_device and nvmevf_migration_file)
> - Implementation of helper routines for managing migration file state
> - Integration of PCI driver callbacks and error handling logic
> - Registration with vfio-pci-core through nvmevf_pci_ops
> - Initial support for VFIO migration states and device open/close flows
> 
> Subsequent patches will build upon this base to implement actual live
> migration commands and complete the vfio device state handling logic.
> 
> Signed-off-by: Lei Rao <lei.rao@intel.com>
> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> ---
>  drivers/vfio/pci/Kconfig       |   2 +
>  drivers/vfio/pci/Makefile      |   2 +
>  drivers/vfio/pci/nvme/Kconfig  |  10 ++
>  drivers/vfio/pci/nvme/Makefile |   3 +
>  drivers/vfio/pci/nvme/nvme.c   | 196 +++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/nvme/nvme.h   |  36 ++++++
>  6 files changed, 249 insertions(+)
>  create mode 100644 drivers/vfio/pci/nvme/Kconfig
>  create mode 100644 drivers/vfio/pci/nvme/Makefile
>  create mode 100644 drivers/vfio/pci/nvme/nvme.c
>  create mode 100644 drivers/vfio/pci/nvme/nvme.h
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 2b0172f54665..8f94429e7adc 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -67,4 +67,6 @@ source "drivers/vfio/pci/nvgrace-gpu/Kconfig"
>  
>  source "drivers/vfio/pci/qat/Kconfig"
>  
> +source "drivers/vfio/pci/nvme/Kconfig"
> +
>  endmenu
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index cf00c0a7e55c..be8c4b5ee0ba 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -10,6 +10,8 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>  
>  obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
>  
> +obj-$(CONFIG_NVME_VFIO_PCI) += nvme/
> +
>  obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>  
>  obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> diff --git a/drivers/vfio/pci/nvme/Kconfig b/drivers/vfio/pci/nvme/Kconfig
> new file mode 100644
> index 000000000000..12e0eaba0de1
> --- /dev/null
> +++ b/drivers/vfio/pci/nvme/Kconfig
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config NVME_VFIO_PCI
> +	tristate "VFIO support for NVMe PCI devices"
> +	depends on NVME_CORE
> +	depends on VFIO_PCI_CORE
> +	help
> +	  This provides migration support for NVMe devices using the
> +	  VFIO framework.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/nvme/Makefile b/drivers/vfio/pci/nvme/Makefile
> new file mode 100644
> index 000000000000..2f4a0ad3d9cf
> --- /dev/null
> +++ b/drivers/vfio/pci/nvme/Makefile
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_NVME_VFIO_PCI) += nvme-vfio-pci.o
> +nvme-vfio-pci-y := nvme.o
> diff --git a/drivers/vfio/pci/nvme/nvme.c b/drivers/vfio/pci/nvme/nvme.c
> new file mode 100644
> index 000000000000..08bee3274207
> --- /dev/null
> +++ b/drivers/vfio/pci/nvme/nvme.c
> @@ -0,0 +1,196 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2022, INTEL CORPORATION. All rights reserved
> + * Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved
> + */
> +
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/file.h>
> +#include <linux/interrupt.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/types.h>
> +#include <linux/vfio.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/kernel.h>
> +#include <linux/vfio_pci_core.h>
> +
> +#include "nvme.h"
> +
> +static void nvmevf_disable_fd(struct nvmevf_migration_file *migf)
> +{
> +	mutex_lock(&migf->lock);
> +
> +	/* release the device states buffer */
> +	kvfree(migf->vf_data);
> +	migf->vf_data = NULL;
> +	migf->disabled = true;
> +	migf->total_length = 0;
> +	migf->filp->f_pos = 0;
> +	mutex_unlock(&migf->lock);
> +}
> +
> +static void nvmevf_disable_fds(struct nvmevf_pci_core_device *nvmevf_dev)
> +{
> +	if (nvmevf_dev->resuming_migf) {
> +		nvmevf_disable_fd(nvmevf_dev->resuming_migf);
> +		fput(nvmevf_dev->resuming_migf->filp);
> +		nvmevf_dev->resuming_migf = NULL;
> +	}
> +
> +	if (nvmevf_dev->saving_migf) {
> +		nvmevf_disable_fd(nvmevf_dev->saving_migf);
> +		fput(nvmevf_dev->saving_migf->filp);
> +		nvmevf_dev->saving_migf = NULL;
> +	}
> +}
> +
> +static void nvmevf_state_mutex_unlock(struct nvmevf_pci_core_device *nvmevf_dev)
> +{
> +	lockdep_assert_held(&nvmevf_dev->state_mutex);
> +again:
> +	spin_lock(&nvmevf_dev->reset_lock);
> +	if (nvmevf_dev->deferred_reset) {
> +		nvmevf_dev->deferred_reset = false;
> +		spin_unlock(&nvmevf_dev->reset_lock);
> +		nvmevf_dev->mig_state = VFIO_DEVICE_STATE_RUNNING;
> +		nvmevf_disable_fds(nvmevf_dev);
> +		goto again;
> +	}
> +	mutex_unlock(&nvmevf_dev->state_mutex);
> +	spin_unlock(&nvmevf_dev->reset_lock);
> +}
> +
> +static struct nvmevf_pci_core_device *nvmevf_drvdata(struct pci_dev *pdev)
> +{
> +	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
> +
> +	return container_of(core_device, struct nvmevf_pci_core_device,
> +			    core_device);
> +}
> +
> +static int nvmevf_pci_open_device(struct vfio_device *core_vdev)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev;
> +	struct vfio_pci_core_device *vdev;
> +	int ret;
> +
> +	nvmevf_dev = container_of(core_vdev, struct nvmevf_pci_core_device,
> +			core_device.vdev);
> +	vdev = &nvmevf_dev->core_device;
> +
> +	ret = vfio_pci_core_enable(vdev);
> +	if (ret)
> +		return ret;
> +
> +	if (nvmevf_dev->migrate_cap)
> +		nvmevf_dev->mig_state = VFIO_DEVICE_STATE_RUNNING;
> +	vfio_pci_core_finish_enable(vdev);
> +	return 0;
> +}
> +
> +static void nvmevf_pci_close_device(struct vfio_device *core_vdev)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev;
> +
> +	nvmevf_dev = container_of(core_vdev, struct nvmevf_pci_core_device,
> +			core_device.vdev);
> +
> +	if (nvmevf_dev->migrate_cap) {
> +		mutex_lock(&nvmevf_dev->state_mutex);
> +		nvmevf_disable_fds(nvmevf_dev);
> +		nvmevf_state_mutex_unlock(nvmevf_dev);
> +	}
> +
> +	vfio_pci_core_close_device(core_vdev);
> +}
> +
> +static const struct vfio_device_ops nvmevf_pci_ops = {
> +	.name = "nvme-vfio-pci",
> +	.release = vfio_pci_core_release_dev,
> +	.open_device = nvmevf_pci_open_device,
> +	.close_device = nvmevf_pci_close_device,
> +	.ioctl = vfio_pci_core_ioctl,
> +	.device_feature = vfio_pci_core_ioctl_feature,
> +	.read = vfio_pci_core_read,
> +	.write = vfio_pci_core_write,
> +	.mmap = vfio_pci_core_mmap,
> +	.request = vfio_pci_core_request,
> +	.match = vfio_pci_core_match,
> +};
> +
> +static int nvmevf_pci_probe(struct pci_dev *pdev,
> +			    const struct pci_device_id *id)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev;
> +	int ret;
> +
> +	nvmevf_dev = vfio_alloc_device(nvmevf_pci_core_device, core_device.vdev,
> +				       &pdev->dev, &nvmevf_pci_ops);
> +	if (IS_ERR(nvmevf_dev))
> +		return PTR_ERR(nvmevf_dev);
> +
> +	dev_set_drvdata(&pdev->dev, &nvmevf_dev->core_device);
> +	ret = vfio_pci_core_register_device(&nvmevf_dev->core_device);
> +	if (ret)
> +		goto out_put_dev;
> +
> +	return 0;
> +
> +out_put_dev:
> +	vfio_put_device(&nvmevf_dev->core_device.vdev);
> +	return ret;
> +}
> +
> +static void nvmevf_pci_remove(struct pci_dev *pdev)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev = nvmevf_drvdata(pdev);
> +
> +	vfio_pci_core_unregister_device(&nvmevf_dev->core_device);
> +	vfio_put_device(&nvmevf_dev->core_device.vdev);
> +}
> +
> +static void nvmevf_pci_aer_reset_done(struct pci_dev *pdev)
> +{
> +	struct nvmevf_pci_core_device *nvmevf_dev = nvmevf_drvdata(pdev);
> +
> +	if (!nvmevf_dev->migrate_cap)
> +		return;
> +
> +	/*
> +	 * As the higher VFIO layers are holding locks across reset and using
> +	 * those same locks with the mm_lock we need to prevent ABBA deadlock
> +	 * with the state_mutex and mm_lock.
> +	 * In case the state_mutex was taken already we defer the cleanup work
> +	 * to the unlock flow of the other running context.
> +	 */
> +	spin_lock(&nvmevf_dev->reset_lock);
> +	nvmevf_dev->deferred_reset = true;
> +	if (!mutex_trylock(&nvmevf_dev->state_mutex)) {
> +		spin_unlock(&nvmevf_dev->reset_lock);
> +		return;
> +	}
> +	spin_unlock(&nvmevf_dev->reset_lock);
> +	nvmevf_state_mutex_unlock(nvmevf_dev);
> +}
> +
> +static const struct pci_error_handlers nvmevf_err_handlers = {
> +	.reset_done = nvmevf_pci_aer_reset_done,
> +	.error_detected = vfio_pci_core_aer_err_detected,
> +};
> +
> +static struct pci_driver nvmevf_pci_driver = {
> +	.name = KBUILD_MODNAME,
> +	.probe = nvmevf_pci_probe,
> +	.remove = nvmevf_pci_remove,
> +	.err_handler = &nvmevf_err_handlers,
> +	.driver_managed_dma = true,
> +};
> +
> +module_pci_driver(nvmevf_pci_driver);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Chaitanya Kulkarni <kch@nvidia.com>");
> +MODULE_DESCRIPTION("NVMe VFIO PCI - VFIO PCI driver with live migration support for NVMe");

Without a MODULE_DEVICE_TABLE, what devices are ever going to use this
driver?  Userspace needs to be given a clue when to use this driver vs
vfio-pci.  We also don't have a fallback mechanism to try a driver
until it fails, so this driver likely needs to take over defacto
support for all NVMe devices from vfio-pci, rather that later rejecting
those that don't support migration as patch 4/ implements in the .init
callback.  Thanks,

Alex

> diff --git a/drivers/vfio/pci/nvme/nvme.h b/drivers/vfio/pci/nvme/nvme.h
> new file mode 100644
> index 000000000000..ee602254679e
> --- /dev/null
> +++ b/drivers/vfio/pci/nvme/nvme.h
> @@ -0,0 +1,36 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2022, INTEL CORPORATION. All rights reserved
> + * Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved
> + */
> +
> +#ifndef NVME_VFIO_PCI_H
> +#define NVME_VFIO_PCI_H
> +
> +#include <linux/kernel.h>
> +#include <linux/vfio_pci_core.h>
> +#include <linux/nvme.h>
> +
> +struct nvmevf_migration_file {
> +	struct file *filp;
> +	struct mutex lock;
> +	bool disabled;
> +	u8 *vf_data;
> +	size_t total_length;
> +};
> +
> +struct nvmevf_pci_core_device {
> +	struct vfio_pci_core_device core_device;
> +	int vf_id;
> +	u8 migrate_cap:1;
> +	u8 deferred_reset:1;
> +	/* protect migration state */
> +	struct mutex state_mutex;
> +	enum vfio_device_mig_state mig_state;
> +	/* protect the reset_done flow */
> +	spinlock_t reset_lock;
> +	struct nvmevf_migration_file *resuming_migf;
> +	struct nvmevf_migration_file *saving_migf;
> +};
> +
> +#endif /* NVME_VFIO_PCI_H */



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-08-04 17:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-03  2:47 [RFC PATCH 0/4] Add new VFIO PCI driver for NVMe devices Chaitanya Kulkarni
2025-08-03  2:47 ` [RFC PATCH 1/4] vfio-nvme: add vfio-nvme lm driver infrastructure Chaitanya Kulkarni
2025-08-04 15:20   ` Shameerali Kolothum Thodi
2025-08-04 16:43   ` Bjorn Helgaas
2025-08-04 17:15   ` Alex Williamson
2025-08-03  2:47 ` [RFC PATCH 2/4] nvme: add live migration TP 4159 definitions Chaitanya Kulkarni
2025-08-03  2:47 ` [RFC PATCH 3/4] nvme: export helpers to implement vfio-nvme lm Chaitanya Kulkarni
2025-08-03  2:47 ` [RFC PATCH 4/4] vfio-nvme: implement TP4159 live migration cmds Chaitanya Kulkarni
2025-08-04 16:41   ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).