public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support
@ 2026-04-24 11:37 Joel Granados
  2026-04-24 11:37 ` [PATCH RFC 1/5] nvme: Add CDQ data structures to nvme spec header Joel Granados
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Joel Granados @ 2026-04-24 11:37 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Jason Gunthorpe
  Cc: linux-nvme, linux-kernel, Joel Granados

This RFC implements Controller Data Queue (CDQ) support in the NVMe
driver, a variation of my original RFC sent last July [2]. It exposes an
ioctl interface for userspace to create, configure, and delete CDQs
backed by DMA-mapped user memory with eventfd notification. In this
version I explore how the CDQ protocol logic might live outside the
kernel; the ioctl serves as a testing tool but is not necessarily the
final interface.

This RFC exists within a broader goal, which is to enable NVMe namespace
migration. The timing feels right as hardware with CDQ capability
exists, NVMe fully specifies the feature and there is growing interest
in Live Migration which by extension includes CDQ.

There is however, no clear consensus on how NVMe Live Migration should
land in the Linux kernel. The 2022 discussion [1] explored a VFIO-based
approach but reached no conclusion, likely because the specification was
not yet mature.

To move CDQ forward, I would like to understand where the LM logic belongs. I
currently see two options (of which I have no particular preference):

1. VFIO: Implement NVMe LM following the VFIO state machine, similar to what
   was proposed in 2022.
2. VM manager interface: Bypass VFIO and implement LM logic in the interface
   between the VM manager (e.g., QEMU) and the NVMe driver.

One aspect that has not received much attention in previous discussions
is namespace migration as prior work focused on migrating state and not
the actual data. Migrating potential terabytes is IMO a distinct use
case worth considering. LSF/MM/BPF is in a week. I hope this series
encourages folks to revisit their positions, give their opinions and set
the stage for face2face discussions.

Best

PS: I'm including the regular NVMe contacts and the folks that seemed to
have strong opinions in [2]. I always find it difficult to decide who to
include in these so let me know if you want to be removed in the future
or if I have missed someone.

[1] https://lore.kernel.org/20221206055816.292304-1-lei.rao@intel.com
[2] https://lore.kernel.org/20250714-jag-cdq-v1-0-01e027d256d5@kernel.org

Signed-off-by: Joel Granados <joel.granados@kernel.org>
---
Joel Granados (5):
      nvme: Add CDQ data structures to nvme spec header
      nvme: Add CDQ data structures to host driver
      nvme: Add NVME_AER_ONE_SHOT callback handler
      nvme: Implement CDQ core functionality
      nvme: Add CDQ ioctl interface

 drivers/nvme/host/core.c        | 312 ++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/ioctl.c       |  53 ++++++-
 drivers/nvme/host/nvme.h        |  20 +++
 include/linux/nvme.h            |  50 ++++++-
 include/uapi/linux/nvme_ioctl.h |  29 ++++
 5 files changed, 462 insertions(+), 2 deletions(-)
---
base-commit: 028ef9c96e96197026887c0f092424679298aae8
change-id: 20260424-jag-cdq-lkml-cd9b7c79983d

Best regards,
-- 
Joel Granados <joel.granados@kernel.org>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH RFC 1/5] nvme: Add CDQ data structures to nvme spec header
  2026-04-24 11:37 [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support Joel Granados
@ 2026-04-24 11:37 ` Joel Granados
  2026-04-24 11:37 ` [PATCH RFC 2/5] nvme: Add CDQ data structures to host driver Joel Granados
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Joel Granados @ 2026-04-24 11:37 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Jason Gunthorpe
  Cc: linux-nvme, linux-kernel, Joel Granados

Add Controller Data Queue (CDQ) related data structures and definitions
to include/linux/nvme.h. These are just the data structures. No
functional implementation yet.

Signed-off-by: Joel Granados <joel.granados@kernel.org>
---
 include/linux/nvme.h | 45 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 655d194f8e722c3400ac00f76841e1af0281f38f..4a42f1614de962b9d448193193f68fe1968dfb6f 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -1314,6 +1314,7 @@ enum nvme_admin_opcode {
 	nvme_admin_virtual_mgmt		= 0x1c,
 	nvme_admin_nvme_mi_send		= 0x1d,
 	nvme_admin_nvme_mi_recv		= 0x1e,
+	nvme_admin_cdq			= 0x45,
 	nvme_admin_dbbuf		= 0x7C,
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
@@ -1352,7 +1353,8 @@ enum nvme_admin_opcode {
 		nvme_admin_opcode_name(nvme_admin_security_send),	\
 		nvme_admin_opcode_name(nvme_admin_security_recv),	\
 		nvme_admin_opcode_name(nvme_admin_sanitize_nvm),	\
-		nvme_admin_opcode_name(nvme_admin_get_lba_status))
+		nvme_admin_opcode_name(nvme_admin_get_lba_status),	\
+		nvme_admin_opcode_name(nvme_admin_cdq))
 
 enum {
 	NVME_QUEUE_PHYS_CONTIG	= (1 << 0),
@@ -1412,6 +1414,10 @@ enum {
 	NVME_FWACT_ACTV		= (2 << 3),
 };
 
+enum {
+	NVME_FEAT_CDQ_ID_MASK = GENMASK(15, 0),
+};
+
 struct nvme_supported_log {
 	__le32	lids[256];
 };
@@ -1590,6 +1596,42 @@ struct nvme_directive_cmd {
 	__u32			rsvd16[3];
 };
 
+enum {
+	NVME_CDQ_SEL_CREATE_CDQ = 0x0,
+	NVME_CDQ_SEL_DELETE_CDQ = 0x1
+};
+
+enum {
+	NVME_CDQ_CFG_PC_DISCONT = 0x0,
+	NVME_CDQ_CFG_PC_CONT = 0x1
+};
+
+union nvme_cdq_dw11 {
+	struct {
+		__le16	flags;
+		__le16	cqs;
+	};
+	struct {
+		__le16	cdqid;
+		__le16	rsvd;
+	};
+};
+
+struct nvme_cdq {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[5];
+	__le64			prp1;
+	__le32			rsvd8[2];
+	__u8			sel;
+	__u8			rsvd10;
+	__le16			mos;
+	union nvme_cdq_dw11	dw11;
+	__le32			cdqsize;
+	__u32			rsvd13[3];
+};
+
 /*
  * Fabrics subcommands.
  */
@@ -2000,6 +2042,7 @@ struct nvme_command {
 		struct nvme_dbbuf dbbuf;
 		struct nvme_directive_cmd directive;
 		struct nvme_io_mgmt_recv_cmd imr;
+		struct nvme_cdq cdq;
 	};
 };
 

-- 
2.50.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RFC 2/5] nvme: Add CDQ data structures to host driver
  2026-04-24 11:37 [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support Joel Granados
  2026-04-24 11:37 ` [PATCH RFC 1/5] nvme: Add CDQ data structures to nvme spec header Joel Granados
@ 2026-04-24 11:37 ` Joel Granados
  2026-04-24 11:37 ` [PATCH RFC 3/5] nvme: Add NVME_AER_ONE_SHOT callback handler Joel Granados
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Joel Granados @ 2026-04-24 11:37 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Jason Gunthorpe
  Cc: linux-nvme, linux-kernel, Joel Granados

Add host-side Controller Data Queue (CDQ) data structures and function
declarations to drivers/nvme/host/nvme.h:
- Add cdqs xarray to nvme_ctrl for managing CDQ instances
- Add cdq_nvme_queue structure containing:
  - DMA mapping state
  - PRP list management
  - eventfd context for tail pointer event notifications

Signed-off-by: Joel Granados <joel.granados@kernel.org>
---
 drivers/nvme/host/nvme.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9971045dbc05e9bb9d7fa32ad540fd107d8c8b83..30d5052c7728c0d5c5e8772ff531bc672e96940f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -466,6 +466,7 @@ struct nvme_ctrl {
 	enum nvme_dctype dctype;
 
 	u16			awupf; /* 0's based value. */
+	struct xarray cdqs; /* Controller Data Queue */
 };
 
 static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
@@ -619,6 +620,20 @@ static inline unsigned long nvme_get_virt_boundary(struct nvme_ctrl *ctrl,
 	return NVME_CTRL_PAGE_SIZE - 1;
 }
 
+#define MAX_NR_CDQ_PRPS		20
+struct cdq_nvme_queue {
+	struct nvme_ctrl *ctrl;
+	__u32	size_nbyte;
+	u16 cdq_id;
+	struct eventfd_ctx *tpt_efd_ctx;
+	struct sg_table sgt;
+	struct page **pages;
+	unsigned long nr_pages;
+	void *prp_lists[MAX_NR_CDQ_PRPS];
+	dma_addr_t prp_lists_dma[MAX_NR_CDQ_PRPS];
+	u32 nr_prp_lists; /*number of PRP lists*/
+};
+
 struct nvme_ctrl_ops {
 	const char *name;
 	struct module *module;

-- 
2.50.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RFC 3/5] nvme: Add NVME_AER_ONE_SHOT callback handler
  2026-04-24 11:37 [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support Joel Granados
  2026-04-24 11:37 ` [PATCH RFC 1/5] nvme: Add CDQ data structures to nvme spec header Joel Granados
  2026-04-24 11:37 ` [PATCH RFC 2/5] nvme: Add CDQ data structures to host driver Joel Granados
@ 2026-04-24 11:37 ` Joel Granados
  2026-04-24 11:37 ` [PATCH RFC 4/5] nvme: Implement CDQ core functionality Joel Granados
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Joel Granados @ 2026-04-24 11:37 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Jason Gunthorpe
  Cc: linux-nvme, linux-kernel, Joel Granados

Add support for handling NVME_AER_ONE_SHOT asynchronous event
notifications. One-shot events, unlike traditional AERs, are not
requeued and include additional event parameters in the upper 32 bits of
the result.

Add nvme_handle_aen_oneshot() stub to dispatch one-shot events based on
subtype. This will be extended by subsequent patches to handle specific
one-shot event types like CDQ tail pointer events.

Extend nvme_complete_async_event() to handle the NVME_AER_ONE_SHOT case,
extracting the event parameter from the 64-bit result and passing it to
the handler.

Signed-off-by: Joel Granados <joel.granados@kernel.org>
---
 drivers/nvme/host/core.c | 20 ++++++++++++++++++++
 include/linux/nvme.h     |  5 +++++
 2 files changed, 25 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 766e9cc4ffca5e269c5e85dd4a0323dc99e5658c..be4807591d2d80d228c10e3c78b6b7dc371b3865 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4747,6 +4747,16 @@ static u32 nvme_aer_subtype(u32 result)
 	return (result & 0xff00) >> 8;
 }
 
+static bool nvme_handle_aen_oneshot(struct nvme_ctrl *ctrl, u32 result, u32 event_param)
+{
+	u32 aer_subtype = nvme_aer_subtype(result);
+
+	/* Will be extended to handle specific one-shot event types */
+	if (aer_subtype == NVME_AER_ONE_SHOT_CDQ_TAIL_PTR)
+		return -ENOSYS;
+	return false;
+}
+
 static bool nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
 {
 	u32 aer_notice_type = nvme_aer_subtype(result);
@@ -4795,6 +4805,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		volatile union nvme_result *res)
 {
 	u32 result = le32_to_cpu(res->u32);
+	u32 event_param = 0;
 	u32 aer_type = nvme_aer_type(result);
 	u32 aer_subtype = nvme_aer_subtype(result);
 	bool requeue = true;
@@ -4807,6 +4818,15 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 	case NVME_AER_NOTICE:
 		requeue = nvme_handle_aen_notice(ctrl, result);
 		break;
+	case NVME_AER_ONE_SHOT:
+		/*
+		 * One-shot events like CDQ tail pointer events.
+		 * Extract event parameter from upper 32 bits.
+		 */
+		event_param = le64_to_cpu(res->u64) >> 32;
+		requeue = nvme_handle_aen_oneshot(ctrl, result, event_param);
+		trace_nvme_async_event(ctrl, result);
+		break;
 	case NVME_AER_ERROR:
 		/*
 		 * For a persistent internal error, don't run async_event_work
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 4a42f1614de962b9d448193193f68fe1968dfb6f..6948e39842d48dc9974579ea1f9c4d5330238275 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -840,6 +840,7 @@ enum {
 	NVME_AER_ERROR			= 0,
 	NVME_AER_SMART			= 1,
 	NVME_AER_NOTICE			= 2,
+	NVME_AER_ONE_SHOT		= 4,
 	NVME_AER_CSS			= 6,
 	NVME_AER_VS			= 7,
 };
@@ -855,6 +856,10 @@ enum {
 	NVME_AER_NOTICE_DISC_CHANGED	= 0xf0,
 };
 
+enum {
+	NVME_AER_ONE_SHOT_CDQ_TAIL_PTR	= 0x00,
+};
+
 enum {
 	NVME_AEN_BIT_NS_ATTR		= 8,
 	NVME_AEN_BIT_FW_ACT		= 9,

-- 
2.50.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RFC 4/5] nvme: Implement CDQ core functionality
  2026-04-24 11:37 [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support Joel Granados
                   ` (2 preceding siblings ...)
  2026-04-24 11:37 ` [PATCH RFC 3/5] nvme: Add NVME_AER_ONE_SHOT callback handler Joel Granados
@ 2026-04-24 11:37 ` Joel Granados
  2026-04-24 11:37 ` [PATCH RFC 5/5] nvme: Add CDQ ioctl interface Joel Granados
  2026-04-24 13:06 ` [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support Jason Gunthorpe
  5 siblings, 0 replies; 10+ messages in thread
From: Joel Granados @ 2026-04-24 11:37 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Jason Gunthorpe
  Cc: linux-nvme, linux-kernel, Joel Granados

Add Controller Data Queue (CDQ) support to the NVMe driver. CDQs enable
efficient device-to-host data transfer through dedicated queues with
DMA-mapped user memory.

This patch implements:
- DMA mapping with user page pinning (nvme_cdq_map_dma_usr)
- PRP list allocation for discontiguous memory (nvme_cdq_alloc_prp_list)
- CDQ create/delete commands (nvme_cdq_create, nvme_cdq_delete)
- Tail pointer event notification via eventfd (nvme_cdq_set_tpt)
- Async event notification handling for CDQ events
- xarray-based CDQ instance management
- Integration into controller init/free paths
- Add function declarations for CDQ lifecycle management:
  nvme_cdq_create(), nvme_cdq_delete(), nvme_cdq_set_tpt()

Signed-off-by: Joel Granados <joel.granados@kernel.org>
---
 drivers/nvme/host/core.c | 306 +++++++++++++++++++++++++++++++++++++++++++++--
 drivers/nvme/host/nvme.h |   5 +
 2 files changed, 304 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index be4807591d2d80d228c10e3c78b6b7dc371b3865..1bcdf328b0edf0ede7a799a965fd0b539404e3c6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -23,6 +23,7 @@
 #include <linux/pm_qos.h>
 #include <linux/ratelimit.h>
 #include <linux/unaligned.h>
+#include <linux/eventfd.h>
 
 #include "nvme.h"
 #include "fabrics.h"
@@ -1252,6 +1253,298 @@ u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns, u8 opcode)
 }
 EXPORT_SYMBOL_NS_GPL(nvme_passthru_start, "NVME_TARGET_PASSTHRU");
 
+static void nvme_cdq_unmap_dma_usr(struct nvme_ctrl *ctrl, struct cdq_nvme_queue *cdq)
+{
+	dma_unmap_sgtable(ctrl->dev, &cdq->sgt, DMA_BIDIRECTIONAL, 0);
+	sg_free_table(&cdq->sgt);
+	unpin_user_pages(cdq->pages, cdq->nr_pages);
+	kfree(cdq->pages);
+}
+
+/* nvme_cdq_alloc_from_usr - Make user virtual memory DMAable */
+static int nvme_cdq_map_dma_usr(struct nvme_ctrl *ctrl, struct cdq_nvme_queue *cdq,
+				const u32 size_nbytes, unsigned long uaddr)
+{
+	int ret = -ENOMEM;
+	struct page **pages;
+
+	if (!PAGE_ALIGN(uaddr))
+		return -EINVAL;
+
+	cdq->nr_pages = (size_nbytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	pages = kvmalloc_array(cdq->nr_pages, sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	mmap_read_lock(current->mm);
+	ret = pin_user_pages(uaddr, cdq->nr_pages, FOLL_WRITE | FOLL_LONGTERM, pages);
+	if (ret != cdq->nr_pages) {
+		if (ret > 0)
+			unpin_user_pages(pages, ret);
+		ret = -EFAULT;
+		mmap_read_unlock(current->mm);
+		goto free_pages;
+	}
+	mmap_read_unlock(current->mm);
+
+	ret = sg_alloc_table_from_pages_segment(&cdq->sgt, pages, cdq->nr_pages,
+						0, size_nbytes, PAGE_SIZE, GFP_KERNEL);
+	if (ret)
+		goto unpin_pages;
+
+	ret = dma_map_sgtable(ctrl->dev, &cdq->sgt, DMA_BIDIRECTIONAL, 0);
+	if (ret)
+		goto free_sgt;
+
+	cdq->pages = pages;
+
+	return 0;
+
+free_sgt:
+	sg_free_table(&cdq->sgt);
+
+unpin_pages:
+	unpin_user_pages(pages, cdq->nr_pages);
+
+free_pages:
+	kvfree(pages);
+
+	return ret;
+}
+
+static void nvme_cdq_free_prp_lists(struct nvme_ctrl *ctrl,
+				    struct cdq_nvme_queue *cdq)
+{
+	for (int i = 0; i < cdq->nr_prp_lists; ++i) {
+		if (cdq->prp_lists[i])
+			dma_free_coherent(ctrl->dev, PAGE_SIZE,
+					  cdq->prp_lists[i],
+					  cdq->prp_lists_dma[i]);
+	}
+}
+static int nvme_cdq_alloc_prp_single(struct nvme_ctrl *ctrl, struct cdq_nvme_queue *cdq)
+{
+	cdq->nr_prp_lists = 0;
+	memset(cdq->prp_lists, 0, sizeof(cdq->prp_lists));
+	cdq->prp_lists_dma[0] = sg_dma_address(cdq->sgt.sgl);
+	cdq->prp_lists_dma[1] = 0;
+	return 0;
+}
+
+static int nvme_cdq_alloc_prp_list(struct nvme_ctrl *ctrl, struct cdq_nvme_queue *cdq)
+{
+	unsigned int i, prp_list_idx = 0;
+	struct scatterlist *sg;
+	u64 *prp_list, *prp_list_tmp;
+	dma_addr_t prp_list_tmp_dma;
+
+	prp_list = dma_alloc_coherent(ctrl->dev, PAGE_SIZE, &prp_list_tmp_dma, GFP_KERNEL);
+	if (!prp_list)
+		return -ENOMEM;
+
+	cdq->prp_lists[0] = prp_list;
+	cdq->prp_lists_dma[0] = prp_list_tmp_dma;
+	cdq->nr_prp_lists = 1;
+
+	for_each_sgtable_dma_sg(&cdq->sgt, sg, i) {
+		if (prp_list_idx == PAGE_SIZE >> 3) {
+			if (cdq->nr_prp_lists == MAX_NR_CDQ_PRPS)
+				goto prps_err;
+
+			prp_list_tmp = dma_alloc_coherent(ctrl->dev,
+					PAGE_SIZE, &prp_list_tmp_dma, GFP_KERNEL);
+			if (!prp_list_tmp)
+				goto prps_err;
+
+			cdq->prp_lists_dma[cdq->nr_prp_lists] = prp_list_tmp_dma;
+			cdq->prp_lists[cdq->nr_prp_lists++] = prp_list_tmp;
+
+			prp_list = prp_list_tmp;
+			prp_list_idx = 0;
+		}
+		prp_list[prp_list_idx++] = sg_dma_address(sg);
+	}
+
+	return 0;
+
+prps_err:
+	nvme_cdq_free_prp_lists(ctrl, cdq);
+
+	return -EFAULT;
+}
+
+static int nvme_cdq_cmd_delete(struct nvme_ctrl *ctrl, const u16 cdq_id)
+{
+	struct nvme_command c = {
+		.cdq.opcode = nvme_admin_cdq,
+		.cdq.sel = NVME_CDQ_SEL_DELETE_CDQ,
+		.cdq.dw11.cdqid = cdq_id
+	};
+
+	return __nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, NULL, 0, NVME_QID_ANY, 0);
+}
+
+static int nvme_cdq_cmd_create(struct cdq_nvme_queue *cdq, const u16 mos, const u16 cqs,
+			       const u16 dw11_flags)
+{
+	int ret;
+	union nvme_result result = { };
+	struct nvme_command c = {
+		.cdq.opcode = nvme_admin_cdq,
+		.cdq.sel = NVME_CDQ_SEL_CREATE_CDQ,
+		.cdq.mos = cpu_to_le16(mos),
+		.cdq.dw11.cqs = cpu_to_le16(cqs),
+		.cdq.cdqsize = cdq->size_nbyte >> 2, // >>2: size is in dwords
+		.cdq.dw11.flags = cpu_to_le16(dw11_flags),
+		.cdq.prp1 = cdq->prp_lists_dma[0]
+	};
+
+	ret = __nvme_submit_sync_cmd(cdq->ctrl->admin_q, &c, &result, NULL, 0, NVME_QID_ANY, 0);
+	if (ret)
+		return ret;
+
+	cdq->cdq_id = le16_to_cpu(result.u16);
+
+	return ret;
+}
+
+int nvme_cdq_set_tpt(struct nvme_ctrl *ctrl, u16 cdq_id, const int tpt_fd)
+{
+
+	struct cdq_nvme_queue *cdq;
+
+	if (tpt_fd < 0)
+		return -EINVAL;
+
+	cdq = xa_load(&ctrl->cdqs, cdq_id);
+	if (xa_is_err(cdq))
+		return -EINVAL;
+
+	if (cdq->tpt_efd_ctx)
+		eventfd_ctx_put(cdq->tpt_efd_ctx);
+
+	cdq->tpt_efd_ctx = eventfd_ctx_fdget(tpt_fd);
+	if (IS_ERR(cdq->tpt_efd_ctx))
+		return -EINVAL;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nvme_cdq_set_tpt);
+
+int nvme_cdq_create(struct nvme_ctrl *ctrl, const u16 mos, const u16 cqs,
+		    unsigned long uaddr, const u32 size_nbyte, u16 *cdq_id)
+{
+	int ret;
+	u16 d11_flags;
+	struct cdq_nvme_queue *cdq, *xa_ret;
+
+	cdq = kzalloc(sizeof(*cdq), GFP_KERNEL);
+	if (!cdq)
+		return -ENOMEM;
+
+	cdq->ctrl = ctrl;
+	cdq->size_nbyte = size_nbyte;
+
+	ret = nvme_cdq_map_dma_usr(ctrl, cdq, size_nbyte, uaddr);
+	if (ret)
+		goto err_cdq_free;
+
+	if (cdq->sgt.nents > 1) {
+		d11_flags = NVME_CDQ_CFG_PC_DISCONT;
+		ret = nvme_cdq_alloc_prp_list(ctrl, cdq);
+	} else {
+		d11_flags = NVME_CDQ_CFG_PC_CONT;
+		ret = nvme_cdq_alloc_prp_single(ctrl, cdq);
+	}
+
+	if (ret)
+		goto err_cdq_unmap_dma;
+
+	ret = nvme_cdq_cmd_create(cdq, mos, cqs, d11_flags);
+	if (ret)
+		goto err_cdq_free_prp;
+
+	xa_ret = xa_store(&ctrl->cdqs, cdq->cdq_id, cdq, GFP_KERNEL);
+	if (xa_is_err(xa_ret)) {
+		ret = xa_err(xa_ret);
+		goto err_cmd_del;
+	}
+
+	*cdq_id = cdq->cdq_id;
+
+	return 0;
+
+err_cmd_del:
+	nvme_cdq_cmd_delete(ctrl, cdq->cdq_id);
+
+err_cdq_free_prp:
+	nvme_cdq_free_prp_lists(ctrl, cdq);
+
+err_cdq_unmap_dma:
+	cdq_id = NULL;
+	nvme_cdq_unmap_dma_usr(ctrl, cdq);
+
+err_cdq_free:
+	kfree(cdq);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(nvme_cdq_create);
+
+int nvme_cdq_delete(struct nvme_ctrl *ctrl, const u16 cdq_id)
+{
+	int ret;
+	struct cdq_nvme_queue *cdq;
+
+	cdq = xa_load(&ctrl->cdqs, cdq_id);
+	if (xa_is_err(cdq))
+		return -EINVAL;
+
+	if (cdq->tpt_efd_ctx)
+		eventfd_ctx_put(cdq->tpt_efd_ctx);
+
+	ret = nvme_cdq_cmd_delete(ctrl, cdq_id);
+	if (ret)
+		return ret;
+
+	cdq = xa_erase(&ctrl->cdqs, cdq_id);
+	if (!cdq)
+		return -EINVAL;
+
+	nvme_cdq_free_prp_lists(ctrl, cdq);
+	nvme_cdq_unmap_dma_usr(ctrl, cdq);
+
+	kfree(cdq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nvme_cdq_delete);
+
+static int nvme_cdq_handle_aen_tpevent(struct nvme_ctrl *ctrl, u32 event_param)
+{
+	u16 cdq_id = event_param & NVME_FEAT_CDQ_ID_MASK;
+	struct cdq_nvme_queue *cdq;
+
+	cdq = xa_load(&ctrl->cdqs, cdq_id);
+	if (!cdq || xa_is_err(cdq) || !cdq->tpt_efd_ctx)
+		return false;
+
+	eventfd_signal(cdq->tpt_efd_ctx);
+
+	return true;
+}
+
+static void nvme_free_cdqs(struct nvme_ctrl *ctrl)
+{
+	struct cdq_nvme_queue *cdq;
+	unsigned long i;
+
+	xa_for_each(&ctrl->cdqs, i, cdq)
+		nvme_cdq_delete(ctrl, i);
+
+	xa_destroy(&ctrl->cdqs);
+}
+
 void nvme_passthru_end(struct nvme_ctrl *ctrl, struct nvme_ns *ns, u32 effects,
 		       struct nvme_command *cmd, int status)
 {
@@ -4751,9 +5044,9 @@ static bool nvme_handle_aen_oneshot(struct nvme_ctrl *ctrl, u32 result, u32 even
 {
 	u32 aer_subtype = nvme_aer_subtype(result);
 
-	/* Will be extended to handle specific one-shot event types */
 	if (aer_subtype == NVME_AER_ONE_SHOT_CDQ_TAIL_PTR)
-		return -ENOSYS;
+		return nvme_cdq_handle_aen_tpevent(ctrl, event_param);
+
 	return false;
 }
 
@@ -4819,13 +5112,9 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		requeue = nvme_handle_aen_notice(ctrl, result);
 		break;
 	case NVME_AER_ONE_SHOT:
-		/*
-		 * One-shot events like CDQ tail pointer events.
-		 * Extract event parameter from upper 32 bits.
-		 */
+		/* One-shot events like CDQ tail pointer events. */
 		event_param = le64_to_cpu(res->u64) >> 32;
 		requeue = nvme_handle_aen_oneshot(ctrl, result, event_param);
-		trace_nvme_async_event(ctrl, result);
 		break;
 	case NVME_AER_ERROR:
 		/*
@@ -5064,6 +5353,7 @@ static void nvme_free_ctrl(struct device *dev)
 	if (!subsys || ctrl->instance != subsys->instance)
 		ida_free(&nvme_instance_ida, ctrl->instance);
 	nvme_free_cels(ctrl);
+	nvme_free_cdqs(ctrl);
 	nvme_mpath_uninit(ctrl);
 	cleanup_srcu_struct(&ctrl->srcu);
 	nvme_auth_stop(ctrl);
@@ -5110,6 +5400,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	mutex_init(&ctrl->scan_lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
 	xa_init(&ctrl->cels);
+	xa_init(&ctrl->cdqs);
 	ctrl->dev = dev;
 	ctrl->ops = ops;
 	ctrl->quirks = quirks;
@@ -5375,6 +5666,7 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_rotational_media_log) != 512);
 	BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64);
+	BUILD_BUG_ON(sizeof(struct nvme_cdq) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_feat_host_behavior) != 512);
 }
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 30d5052c7728c0d5c5e8772ff531bc672e96940f..2e8bbd3a7394303f6c803b0d5a457abb6d1b485d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -1285,6 +1285,11 @@ u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns, u8 opcode);
 int nvme_execute_rq(struct request *rq, bool at_head);
 void nvme_passthru_end(struct nvme_ctrl *ctrl, struct nvme_ns *ns, u32 effects,
 		       struct nvme_command *cmd, int status);
+int nvme_cdq_create(struct nvme_ctrl *ctrl, const u16 mos, const u16 cqs,
+		    unsigned long uaddr, const u32 size_nbyte,
+		    u16 *cdq_id);
+int nvme_cdq_delete(struct nvme_ctrl *ctrl, const u16 cdq_id);
+int nvme_cdq_set_tpt(struct nvme_ctrl *ctrl, u16 cdq_id, const int tpt_fd);
 struct nvme_ctrl *nvme_ctrl_from_file(struct file *file);
 struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid);
 bool nvme_get_ns(struct nvme_ns *ns);

-- 
2.50.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RFC 5/5] nvme: Add CDQ ioctl interface
  2026-04-24 11:37 [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support Joel Granados
                   ` (3 preceding siblings ...)
  2026-04-24 11:37 ` [PATCH RFC 4/5] nvme: Implement CDQ core functionality Joel Granados
@ 2026-04-24 11:37 ` Joel Granados
  2026-04-24 13:06 ` [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support Jason Gunthorpe
  5 siblings, 0 replies; 10+ messages in thread
From: Joel Granados @ 2026-04-24 11:37 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Jason Gunthorpe
  Cc: linux-nvme, linux-kernel, Joel Granados

Add userspace ioctl interface for CDQ (Controller Data Queue)
management. This allows userspace applications to create, configure,
and delete CDQs.

The interface includes:
- struct nvme_cdq_cmd for passing CDQ parameters
- NVME_IOCTL_CDQ ioctl command (0x50)
- Support for both controller and device ioctls

Signed-off-by: Joel Granados <joel.granados@kernel.org>
---
 drivers/nvme/host/ioctl.c       | 53 ++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/nvme_ioctl.h | 29 ++++++++++++++++++++++
 2 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 8844bbd395159e544218db413e066cae6c24b2f1..98441439bd6be67e20717fed4ffc4d32c9b37725 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -8,6 +8,7 @@
 #include <linux/nvme_ioctl.h>
 #include <linux/io_uring/cmd.h>
 #include "nvme.h"
+#include "trace.h"
 
 enum {
 	NVME_IOCTL_VEC		= (1 << 0),
@@ -373,6 +374,51 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	return status;
 }
 
+static int nvme_user_cdq(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+		struct nvme_cdq_cmd __user *ucmd, unsigned int flags,
+		bool open_for_write)
+{
+	int status;
+	u16 cdq_id = 0;
+	struct nvme_cdq_cmd cmd = {};
+
+	if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
+		return -EFAULT;
+
+	/* 21 = 12 (PAGE_SHIFT) + 9 (PAGE_SHIFT / sizeof(u64)) */
+	if (cmd.size_nbyte > MAX_NR_CDQ_PRPS << 21)
+		return -EINVAL;
+
+	if (cmd.size_nbyte == 0) {
+		status = nvme_cdq_delete(ctrl, cmd.id);
+	} else {
+		status = nvme_cdq_create(ctrl, cmd.mos, cmd.cqs, cmd.entries,
+					 cmd.size_nbyte, &cdq_id);
+		if (status)
+			return status;
+
+		if (cmd.tpt_fd > 0) {
+			status = nvme_cdq_set_tpt(ctrl, cdq_id, cmd.tpt_fd);
+			if (status)
+				goto del_cdq;
+		}
+
+		cmd.id = cdq_id;
+
+		if (copy_to_user(ucmd, &cmd, sizeof(cmd))) {
+			status = -EINVAL;
+			goto del_cdq;
+		}
+	}
+
+	return status;
+
+del_cdq:
+	// Ignore return; already in error
+	nvme_cdq_delete(ctrl, cdq_id);
+	return status;
+}
+
 struct nvme_uring_data {
 	__u64	metadata;
 	__u64	addr;
@@ -540,7 +586,8 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 
 static bool is_ctrl_ioctl(unsigned int cmd)
 {
-	if (cmd == NVME_IOCTL_ADMIN_CMD || cmd == NVME_IOCTL_ADMIN64_CMD)
+	if (cmd == NVME_IOCTL_ADMIN_CMD || cmd == NVME_IOCTL_ADMIN64_CMD ||
+	    cmd == NVME_IOCTL_CDQ)
 		return true;
 	if (is_sed_ioctl(cmd))
 		return true;
@@ -555,6 +602,8 @@ static int nvme_ctrl_ioctl(struct nvme_ctrl *ctrl, unsigned int cmd,
 		return nvme_user_cmd(ctrl, NULL, argp, 0, open_for_write);
 	case NVME_IOCTL_ADMIN64_CMD:
 		return nvme_user_cmd64(ctrl, NULL, argp, 0, open_for_write);
+	case NVME_IOCTL_CDQ:
+		return nvme_user_cdq(ctrl, NULL, argp, 0, open_for_write);
 	default:
 		return sed_ioctl(ctrl->opal_dev, cmd, argp);
 	}
@@ -873,6 +922,8 @@ long nvme_dev_ioctl(struct file *file, unsigned int cmd,
 			return -EACCES;
 		nvme_queue_scan(ctrl);
 		return 0;
+	case NVME_IOCTL_CDQ:
+		return nvme_user_cdq(ctrl, NULL, argp, 0, open_for_write);
 	default:
 		return -ENOTTY;
 	}
diff --git a/include/uapi/linux/nvme_ioctl.h b/include/uapi/linux/nvme_ioctl.h
index 2f76cba6716637baff53e167a6141b68420d75c3..8d220c276d959dbd45f224d8ed300fe02dea2f20 100644
--- a/include/uapi/linux/nvme_ioctl.h
+++ b/include/uapi/linux/nvme_ioctl.h
@@ -92,6 +92,34 @@ struct nvme_uring_cmd {
 	__u32   rsvd2;
 };
 
+struct nvme_cdq_cmd {
+	/*
+	 * CDQ size in bytes:
+	 * (Number of entries) * (entry size in bytes)
+	 */
+	__u32	size_nbyte;
+
+	/*
+	 * Virtual mem (returned by mmap). Start of the entries buf in virtual mem.
+	 */
+	__u64	entries;
+
+	/*
+	 * Tail Pointer Trigger eventfd File Descriptor
+	 * Passed when creating the cdq.
+	 * -1 means that there is no FD and AER should not be forwarded.
+	 */
+	int	tpt_fd;
+
+	/*
+	 * Returned by controller; CDQ ID
+	 */
+	__u16	id;
+
+	__u16	cqs;
+	__u16	mos;
+};
+
 #define nvme_admin_cmd nvme_passthru_cmd
 
 #define NVME_IOCTL_ID		_IO('N', 0x40)
@@ -104,6 +132,7 @@ struct nvme_uring_cmd {
 #define NVME_IOCTL_ADMIN64_CMD	_IOWR('N', 0x47, struct nvme_passthru_cmd64)
 #define NVME_IOCTL_IO64_CMD	_IOWR('N', 0x48, struct nvme_passthru_cmd64)
 #define NVME_IOCTL_IO64_CMD_VEC	_IOWR('N', 0x49, struct nvme_passthru_cmd64)
+#define NVME_IOCTL_CDQ		_IOR('N', 0x50, struct nvme_cdq_cmd)
 
 /* io_uring async commands: */
 #define NVME_URING_CMD_IO	_IOWR('N', 0x80, struct nvme_uring_cmd)

-- 
2.50.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support
  2026-04-24 11:37 [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support Joel Granados
                   ` (4 preceding siblings ...)
  2026-04-24 11:37 ` [PATCH RFC 5/5] nvme: Add CDQ ioctl interface Joel Granados
@ 2026-04-24 13:06 ` Jason Gunthorpe
  2026-04-24 13:24   ` Christoph Hellwig
  2026-04-27 18:59   ` Joel Granados
  5 siblings, 2 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2026-04-24 13:06 UTC (permalink / raw)
  To: Joel Granados
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, linux-nvme, linux-kernel

On Fri, Apr 24, 2026 at 01:37:50PM +0200, Joel Granados wrote:

> There is however, no clear consensus on how NVMe Live Migration should
> land in the Linux kernel. The 2022 discussion [1] explored a VFIO-based
> approach but reached no conclusion, likely because the specification was
> not yet mature.

Yes it was paused until the spec matures, then I expect it to go
forward.
 
> To move CDQ forward, I would like to understand where the LM logic belongs. I
> currently see two options (of which I have no particular preference):
> 
> 1. VFIO: Implement NVMe LM following the VFIO state machine, similar to what
>    was proposed in 2022.
> 2. VM manager interface: Bypass VFIO and implement LM logic in the interface
>    between the VM manager (e.g., QEMU) and the NVMe driver.

I imagined it to be split between VFIO for the pci and volatile guest
state and something else for the namespace setup and media migration.
Media migration is only needed for local drive so there use cases that
don't need this component.

We have many drivers fitting into the VFIO scheme now and good VMM
coverage, I don't see a reason to throw it out.

> One aspect that has not received much attention in previous discussions
> is namespace migration as prior work focused on migrating state and not
> the actual data. Migrating potential terabytes is IMO a distinct use
> case worth considering. 

Yes

Though IDK if just plumbing the entire CDQ to userspace is the right
choice for NVMe.. We don't know what future specs will add to CDQ, it
may not be appropriate to treat it so insecurely.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support
  2026-04-24 13:06 ` [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support Jason Gunthorpe
@ 2026-04-24 13:24   ` Christoph Hellwig
  2026-04-27 18:24     ` Joel Granados
  2026-04-27 18:59   ` Joel Granados
  1 sibling, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2026-04-24 13:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joel Granados, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, linux-nvme, linux-kernel

On Fri, Apr 24, 2026 at 10:06:15AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 24, 2026 at 01:37:50PM +0200, Joel Granados wrote:
> 
> > There is however, no clear consensus on how NVMe Live Migration should
> > land in the Linux kernel. The 2022 discussion [1] explored a VFIO-based
> > approach but reached no conclusion, likely because the specification was
> > not yet mature.
> 
> Yes it was paused until the spec matures, then I expect it to go
> forward.

And it will happen in the nvme software working group.  Which should be
up an running if Samsung hadn't done everything in the power to torpedo
it.  Because of that I do not exact Samsung to have any major impact in
how this will be implemented in Linux.

Note that we also can't discuss any of this at LSF/MM in public, so
Joel side channel loading it onto the schedule should be removed as well.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support
  2026-04-24 13:24   ` Christoph Hellwig
@ 2026-04-27 18:24     ` Joel Granados
  0 siblings, 0 replies; 10+ messages in thread
From: Joel Granados @ 2026-04-27 18:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Keith Busch, Jens Axboe, Sagi Grimberg,
	Chaitanya Kulkarni, linux-nvme, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1471 bytes --]

On Fri, Apr 24, 2026 at 03:24:23PM +0200, Christoph Hellwig wrote:
> On Fri, Apr 24, 2026 at 10:06:15AM -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 24, 2026 at 01:37:50PM +0200, Joel Granados wrote:
> > 
> > > There is however, no clear consensus on how NVMe Live Migration should
> > > land in the Linux kernel. The 2022 discussion [1] explored a VFIO-based
> > > approach but reached no conclusion, likely because the specification was
> > > not yet mature.
> > 
> > Yes it was paused until the spec matures, then I expect it to go
> > forward.
> 
> And it will happen in the nvme software working group.  Which should be
> up an running if Samsung hadn't done everything in the power to torpedo
There is nothing that indicates to me that Samsung "torpedoed" the
creation of the nvme SW working group.

> it.  Because of that I do not exact Samsung to have any major impact in
> how this will be implemented in Linux.
> 
> Note that we also can't discuss any of this at LSF/MM in public, so
I see no reason not to have the discussion at LSF/MM. It is the perfect
venue to unpack the (potential) interaction between the vfio and nvme
drivers. Regardless of what is currently brewing in NVMe, there is no
reason why the fundamental architecture cannot be discussed. Of course,
we need to be careful of what is mentioned in public, but I see that as
a detail that does not prevent from having the conversation.

Best

-- 

Joel Granados

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support
  2026-04-24 13:06 ` [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support Jason Gunthorpe
  2026-04-24 13:24   ` Christoph Hellwig
@ 2026-04-27 18:59   ` Joel Granados
  1 sibling, 0 replies; 10+ messages in thread
From: Joel Granados @ 2026-04-27 18:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, linux-nvme, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2311 bytes --]

On Fri, Apr 24, 2026 at 10:06:15AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 24, 2026 at 01:37:50PM +0200, Joel Granados wrote:
> 
> > There is however, no clear consensus on how NVMe Live Migration should
> > land in the Linux kernel. The 2022 discussion [1] explored a VFIO-based
> > approach but reached no conclusion, likely because the specification was
> > not yet mature.
> 
> Yes it was paused until the spec matures, then I expect it to go
> forward.
>  
> > To move CDQ forward, I would like to understand where the LM logic belongs. I
> > currently see two options (of which I have no particular preference):
> > 
> > 1. VFIO: Implement NVMe LM following the VFIO state machine, similar to what
> >    was proposed in 2022.
> > 2. VM manager interface: Bypass VFIO and implement LM logic in the interface
> >    between the VM manager (e.g., QEMU) and the NVMe driver.
> 
> I imagined it to be split between VFIO for the pci and volatile guest
> state and something else for the namespace setup and media migration.
That is an option. If we end up with an approach that supports namespace
migration, I'm happy :)

> Media migration is only needed for local drive so there use cases that
> don't need this component.
Indeed, And there should be an approach that supports those use cases as
well.

> 
> We have many drivers fitting into the VFIO scheme now and good VMM
> coverage, I don't see a reason to throw it out.
And this is why I included it as one of the ways to implement it.

> 
> > One aspect that has not received much attention in previous discussions
> > is namespace migration as prior work focused on migrating state and not
> > the actual data. Migrating potential terabytes is IMO a distinct use
> > case worth considering. 
> 
> Yes
> 
> Though IDK if just plumbing the entire CDQ to userspace is the right
> choice for NVMe.. We don't know what future specs will add to CDQ, it
> may not be appropriate to treat it so insecurely.
Agreed, this RFC is just one of many ways of doing it. My original one is
fully contained inside the NVMe driver. One thing that is clear with the
tests that I have made, is that it is easy to move the CDQ logic in and
out of user space (depending on what is needed).

Best

-- 

Joel Granados

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-04-27 18:59 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 11:37 [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support Joel Granados
2026-04-24 11:37 ` [PATCH RFC 1/5] nvme: Add CDQ data structures to nvme spec header Joel Granados
2026-04-24 11:37 ` [PATCH RFC 2/5] nvme: Add CDQ data structures to host driver Joel Granados
2026-04-24 11:37 ` [PATCH RFC 3/5] nvme: Add NVME_AER_ONE_SHOT callback handler Joel Granados
2026-04-24 11:37 ` [PATCH RFC 4/5] nvme: Implement CDQ core functionality Joel Granados
2026-04-24 11:37 ` [PATCH RFC 5/5] nvme: Add CDQ ioctl interface Joel Granados
2026-04-24 13:06 ` [PATCH RFC 0/5] nvme: Controller Data Queue (CDQ) support Jason Gunthorpe
2026-04-24 13:24   ` Christoph Hellwig
2026-04-27 18:24     ` Joel Granados
2026-04-27 18:59   ` Joel Granados

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox