* [PATCH 0/4] [RFC] hw/nvme: add basic live migration support
@ 2026-02-17 15:25 Alexander Mikhalitsyn
2026-02-17 15:25 ` [PATCH 1/4] hw/nvme: add migration blockers for non-supported cases Alexander Mikhalitsyn
` (4 more replies)
0 siblings, 5 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2026-02-17 15:25 UTC (permalink / raw)
To: qemu-devel
Cc: Jesper Devantier, Peter Xu, Klaus Jensen, Fabiano Rosas,
qemu-block, Keith Busch, Alexander Mikhalitsyn
From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
Dear friends,
This patchset adds basic live migration support for
QEMU emulated NVMe device.
Implementation has some limitations:
- only one NVMe namespace is supported
- SMART counters are not preserved
- CMB is not supported
- PMR is not supported
- SPDM is not supported
- SR-IOV is not supported
- AERs are not fully supported
I believe this is something I can support in next patchset versions or
separately on-demand (when usecase appears). But I wanted to share this
first version as RFC to get some feedback on this in case if I'm approaching
it wrong.
Kind regards,
Alex
Alexander Mikhalitsyn (4):
hw/nvme: add migration blockers for non-supported cases
hw/nvme: split nvme_init_sq/nvme_init_cq into helpers
migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC
hw/nvme: add basic live migration support
hw/nvme/ctrl.c | 505 +++++++++++++++++++++++++++++++++---
hw/nvme/nvme.h | 5 +
hw/nvme/trace-events | 9 +
include/migration/vmstate.h | 21 ++
migration/vmstate-types.c | 88 +++++++
5 files changed, 598 insertions(+), 30 deletions(-)
--
2.47.3
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/4] hw/nvme: add migration blockers for non-supported cases
2026-02-17 15:25 [PATCH 0/4] [RFC] hw/nvme: add basic live migration support Alexander Mikhalitsyn
@ 2026-02-17 15:25 ` Alexander Mikhalitsyn
2026-02-17 15:25 ` [PATCH 2/4] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers Alexander Mikhalitsyn
` (3 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2026-02-17 15:25 UTC (permalink / raw)
To: qemu-devel
Cc: Jesper Devantier, Peter Xu, Klaus Jensen, Fabiano Rosas,
qemu-block, Keith Busch, Alexander Mikhalitsyn
From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
Let's block migration for cases we don't support:
- SR-IOV
- CMB
- PMR
- SPDM
No functional changes here, because NVMe migration is
not supported at all as of this commit.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
---
hw/nvme/ctrl.c | 35 +++++++++++++++++++++++++++++++++++
hw/nvme/nvme.h | 3 +++
2 files changed, 38 insertions(+)
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index cc4593cd427..4694bdb4d02 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -207,6 +207,7 @@
#include "hw/pci/msix.h"
#include "hw/pci/pcie_sriov.h"
#include "system/spdm-socket.h"
+#include "migration/blocker.h"
#include "migration/vmstate.h"
#include "nvme.h"
@@ -8962,6 +8963,14 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
pcie_endpoint_cap_init(pci_dev, 0x80);
pcie_cap_flr_init(pci_dev);
if (n->params.sriov_max_vfs) {
+ if (n->migration_blocker == NULL) {
+ error_setg(&n->migration_blocker,
+ "Migration is disabled when SR-IOV capability is set");
+ if (migrate_add_blocker(&n->migration_blocker, errp) < 0) {
+ return false;
+ }
+ }
+
pcie_ari_init(pci_dev, 0x100);
}
@@ -9025,6 +9034,14 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
if (pci_dev->spdm_port) {
uint16_t doe_offset = PCI_CONFIG_SPACE_SIZE;
+ if (n->migration_blocker == NULL) {
+ error_setg(&n->migration_blocker,
+ "Migration is disabled when SPDM responder is used");
+ if (migrate_add_blocker(&n->migration_blocker, errp) < 0) {
+ return false;
+ }
+ }
+
switch (pci_dev->spdm_trans) {
case SPDM_SOCKET_TRANSPORT_TYPE_PCI_DOE:
if (n->params.sriov_max_vfs) {
@@ -9053,10 +9070,26 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
}
if (n->params.cmb_size_mb) {
+ if (n->migration_blocker == NULL) {
+ error_setg(&n->migration_blocker,
+ "Migration is disabled when CMB feature is used");
+ if (migrate_add_blocker(&n->migration_blocker, errp) < 0) {
+ return false;
+ }
+ }
+
nvme_init_cmb(n, pci_dev);
}
if (n->pmr.dev) {
+ if (n->migration_blocker == NULL) {
+ error_setg(&n->migration_blocker,
+ "Migration is disabled when PMR feature is used");
+ if (migrate_add_blocker(&n->migration_blocker, errp) < 0) {
+ return false;
+ }
+ }
+
if (!nvme_init_pmr(n, pci_dev, errp)) {
return false;
}
@@ -9365,6 +9398,8 @@ static void nvme_exit(PCIDevice *pci_dev)
}
memory_region_del_subregion(&n->bar0, &n->iomem);
+
+ migrate_del_blocker(&n->migration_blocker);
}
static const Property nvme_props[] = {
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index d66f7dc82d5..457b6637249 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -666,6 +666,9 @@ typedef struct NvmeCtrl {
/* Socket mapping to SPDM over NVMe Security In/Out commands */
int spdm_socket;
+
+ /* Migration-related stuff */
+ Error *migration_blocker;
} NvmeCtrl;
typedef enum NvmeResetType {
--
2.47.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/4] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers
2026-02-17 15:25 [PATCH 0/4] [RFC] hw/nvme: add basic live migration support Alexander Mikhalitsyn
2026-02-17 15:25 ` [PATCH 1/4] hw/nvme: add migration blockers for non-supported cases Alexander Mikhalitsyn
@ 2026-02-17 15:25 ` Alexander Mikhalitsyn
2026-02-17 15:25 ` [PATCH 3/4] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC Alexander Mikhalitsyn
` (2 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2026-02-17 15:25 UTC (permalink / raw)
To: qemu-devel
Cc: Jesper Devantier, Peter Xu, Klaus Jensen, Fabiano Rosas,
qemu-block, Keith Busch, Alexander Mikhalitsyn
From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
---
hw/nvme/ctrl.c | 57 +++++++++++++++++++++++++++++++-------------------
1 file changed, 36 insertions(+), 21 deletions(-)
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 4694bdb4d02..89cc26d745b 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4852,18 +4852,14 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req)
return NVME_SUCCESS;
}
-static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
- uint16_t sqid, uint16_t cqid, uint16_t size)
+static void __nvme_init_sq(NvmeSQueue *sq)
{
+ NvmeCtrl *n = sq->ctrl;
+ uint16_t sqid = sq->sqid;
+ uint16_t cqid = sq->cqid;
int i;
NvmeCQueue *cq;
- sq->ctrl = n;
- sq->dma_addr = dma_addr;
- sq->sqid = sqid;
- sq->size = size;
- sq->cqid = cqid;
- sq->head = sq->tail = 0;
sq->io_req = g_new0(NvmeRequest, sq->size);
QTAILQ_INIT(&sq->req_list);
@@ -4893,6 +4889,18 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
n->sq[sqid] = sq;
}
+static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
+ uint16_t sqid, uint16_t cqid, uint16_t size)
+{
+ sq->ctrl = n;
+ sq->dma_addr = dma_addr;
+ sq->sqid = sqid;
+ sq->size = size;
+ sq->cqid = cqid;
+ sq->head = sq->tail = 0;
+ __nvme_init_sq(sq);
+}
+
static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req)
{
NvmeSQueue *sq;
@@ -5553,24 +5561,16 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeRequest *req)
return NVME_SUCCESS;
}
-static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
- uint16_t cqid, uint16_t vector, uint16_t size,
- uint16_t irq_enabled)
+static void __nvme_init_cq(NvmeCQueue *cq)
{
+ NvmeCtrl *n = cq->ctrl;
PCIDevice *pci = PCI_DEVICE(n);
+ uint16_t cqid = cq->cqid;
- if (msix_enabled(pci) && irq_enabled) {
- msix_vector_use(pci, vector);
+ if (msix_enabled(pci) && cq->irq_enabled) {
+ msix_vector_use(pci, cq->vector);
}
- cq->ctrl = n;
- cq->cqid = cqid;
- cq->size = size;
- cq->dma_addr = dma_addr;
- cq->phase = 1;
- cq->irq_enabled = irq_enabled;
- cq->vector = vector;
- cq->head = cq->tail = 0;
QTAILQ_INIT(&cq->req_list);
QTAILQ_INIT(&cq->sq_list);
if (n->dbbuf_enabled) {
@@ -5588,6 +5588,21 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
&DEVICE(cq->ctrl)->mem_reentrancy_guard);
}
+static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
+ uint16_t cqid, uint16_t vector, uint16_t size,
+ uint16_t irq_enabled)
+{
+ cq->ctrl = n;
+ cq->cqid = cqid;
+ cq->size = size;
+ cq->dma_addr = dma_addr;
+ cq->phase = 1;
+ cq->irq_enabled = irq_enabled;
+ cq->vector = vector;
+ cq->head = cq->tail = 0;
+ __nvme_init_cq(cq);
+}
+
static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
{
NvmeCQueue *cq;
--
2.47.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 3/4] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC
2026-02-17 15:25 [PATCH 0/4] [RFC] hw/nvme: add basic live migration support Alexander Mikhalitsyn
2026-02-17 15:25 ` [PATCH 1/4] hw/nvme: add migration blockers for non-supported cases Alexander Mikhalitsyn
2026-02-17 15:25 ` [PATCH 2/4] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers Alexander Mikhalitsyn
@ 2026-02-17 15:25 ` Alexander Mikhalitsyn
2026-03-02 21:47 ` Peter Xu
2026-02-17 15:25 ` [PATCH 4/4] hw/nvme: add basic live migration support Alexander Mikhalitsyn
2026-02-27 9:59 ` [PATCH 0/4] [RFC] " Klaus Jensen
4 siblings, 1 reply; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2026-02-17 15:25 UTC (permalink / raw)
To: qemu-devel
Cc: Jesper Devantier, Peter Xu, Klaus Jensen, Fabiano Rosas,
qemu-block, Keith Busch, Alexander Mikhalitsyn
From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
Add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC, which
helps to save/restore a dynamic array of pointers to
structures.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
---
include/migration/vmstate.h | 21 +++++++++
migration/vmstate-types.c | 88 +++++++++++++++++++++++++++++++++++++
2 files changed, 109 insertions(+)
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 89f9f49d20a..bc6495a7f67 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -265,6 +265,7 @@ extern const VMStateInfo vmstate_info_bitmap;
extern const VMStateInfo vmstate_info_qtailq;
extern const VMStateInfo vmstate_info_gtree;
extern const VMStateInfo vmstate_info_qlist;
+extern const VMStateInfo vmstate_info_ptrs_array_entry;
#define type_check_2darray(t1,t2,n,m) ((t1(*)[n][m])0 - (t2*)0)
/*
@@ -537,6 +538,26 @@ extern const VMStateInfo vmstate_info_qlist;
.offset = vmstate_offset_array(_s, _f, _type*, _n), \
}
+/*
+ * For migrating a dynamically allocated uint32-indexed array
+ * of pointers to structures (with NULL entries and with auto memory allocation).
+ *
+ * _type: type of structure pointed to
+ * _vmsd: VMSD for structure
+ * start: size of structure pointed to (for auto memory allocation)
+ */
+#define VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC(_field, _state, _field_num, _version, _vmsd, _type) { \
+ .name = (stringify(_field)), \
+ .version_id = (_version), \
+ .num_offset = vmstate_offset_value(_state, _field_num, uint32_t), \
+ .info = &vmstate_info_ptrs_array_entry, \
+ .vmsd = &(_vmsd), \
+ .start = sizeof(_type), \
+ .size = sizeof(_type *), \
+ .flags = VMS_VARRAY_UINT32|VMS_POINTER, \
+ .offset = vmstate_offset_pointer(_state, _field, _type *), \
+}
+
#define VMSTATE_VARRAY_OF_POINTER_UINT32(_field, _state, _field_num, _version, _info, _type) { \
.name = (stringify(_field)), \
.version_id = (_version), \
diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
index 89cb2114721..3335377cd07 100644
--- a/migration/vmstate-types.c
+++ b/migration/vmstate-types.c
@@ -942,3 +942,91 @@ const VMStateInfo vmstate_info_qlist = {
.get = get_qlist,
.put = put_qlist,
};
+
+static int put_ptrs_array_entry(QEMUFile *f, void *ppv, size_t unused_size,
+ const VMStateField *field, JSONWriter *vmdesc)
+{
+ const VMStateDescription *vmsd = field->vmsd;
+ int ret;
+ Error *local_err = NULL;
+ void *pv;
+
+ /*
+ * (ppv) is an address of an i-th element of a dynamic array.
+ *
+ * (ppv) can not be NULL unless we have some regression/bug in
+ * vmstate_save_state_v(), because it is result of pointer arithemic like:
+ * first_elem + size * i.
+ */
+ if (ppv == NULL) {
+ error_report("vmstate: put_ptrs_array_entry must be called with ppv != NULL");
+ return -EINVAL;
+ }
+
+ /* get a pointer to a structure */
+ pv = *(void **)ppv;
+
+ if (pv == NULL) {
+ /* write a mark telling that there was a NULL pointer */
+ qemu_put_byte(f, false);
+ return 0;
+ }
+
+ /* if pointer is not NULL, dump the structure contents with help of vmsd */
+ qemu_put_byte(f, true);
+ ret = vmstate_save_state(f, vmsd, pv, vmdesc, &local_err);
+ if (ret) {
+ error_report_err(local_err);
+ return ret;
+ }
+
+ return 0;
+}
+
+static int get_ptrs_array_entry(QEMUFile *f, void *ppv, size_t unused_size,
+ const VMStateField *field)
+{
+ int ret = 0;
+ Error *local_err = NULL;
+ const VMStateDescription *vmsd = field->vmsd;
+ /* size of structure pointed to by elements of array */
+ size_t size = field->start;
+
+ if (ppv == NULL) {
+ error_report("vmstate: get_ptrs_array_entry must be called with ppv != NULL");
+ return -EINVAL;
+ }
+
+ /*
+ * We start from a clean array, all elements must be NULL, unless
+ * something we haven't prepared for has changed in vmstate_save_state_v().
+ * Let's check for this just in case.
+ */
+ if (*(void **)ppv != NULL) {
+ error_report("vmstate: get_ptrs_array_entry must be called with *ppv == NULL");
+ return -EINVAL;
+ }
+
+ if (qemu_get_byte(f)) {
+ void *pv;
+
+ /* allocate memory for structure */
+ pv = g_malloc0(size);
+ ret = vmstate_load_state(f, vmsd, pv, vmsd->version_id, &local_err);
+ if (ret) {
+ error_report_err(local_err);
+ g_free(pv);
+ return ret;
+ }
+
+ *(void **)ppv = pv;
+ }
+
+ return ret;
+}
+
+const VMStateInfo vmstate_info_ptrs_array_entry = {
+ .name = "ptrs_array_entry",
+ .get = get_ptrs_array_entry,
+ .put = put_ptrs_array_entry,
+};
--
2.47.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 4/4] hw/nvme: add basic live migration support
2026-02-17 15:25 [PATCH 0/4] [RFC] hw/nvme: add basic live migration support Alexander Mikhalitsyn
` (2 preceding siblings ...)
2026-02-17 15:25 ` [PATCH 3/4] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC Alexander Mikhalitsyn
@ 2026-02-17 15:25 ` Alexander Mikhalitsyn
2026-02-27 9:59 ` [PATCH 0/4] [RFC] " Klaus Jensen
4 siblings, 0 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2026-02-17 15:25 UTC (permalink / raw)
To: qemu-devel
Cc: Jesper Devantier, Peter Xu, Klaus Jensen, Fabiano Rosas,
qemu-block, Keith Busch, Alexander Mikhalitsyn
From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
It has some limitations:
- only one NVMe namespace is supported
- SMART counters are not preserved
- CMB is not supported
- PMR is not supported
- SPDM is not supported
- SR-IOV is not supported
- AERs are not fully supported
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
---
hw/nvme/ctrl.c | 413 ++++++++++++++++++++++++++++++++++++++++++-
hw/nvme/nvme.h | 2 +
hw/nvme/trace-events | 9 +
3 files changed, 415 insertions(+), 9 deletions(-)
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 89cc26d745b..a92837844df 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -208,6 +208,7 @@
#include "hw/pci/pcie_sriov.h"
#include "system/spdm-socket.h"
#include "migration/blocker.h"
+#include "migration/qemu-file-types.h"
#include "migration/vmstate.h"
#include "nvme.h"
@@ -4901,6 +4902,25 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
__nvme_init_sq(sq);
}
+static void nvme_restore_sq(NvmeSQueue *sq_from)
+{
+ NvmeCtrl *n = sq_from->ctrl;
+ NvmeSQueue *sq = sq_from;
+
+ if (sq_from->sqid == 0) {
+ sq = &n->admin_sq;
+ sq->ctrl = n;
+ sq->dma_addr = sq_from->dma_addr;
+ sq->sqid = sq_from->sqid;
+ sq->size = sq_from->size;
+ sq->cqid = sq_from->cqid;
+ sq->head = sq_from->head;
+ sq->tail = sq_from->tail;
+ }
+
+ __nvme_init_sq(sq);
+}
+
static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req)
{
NvmeSQueue *sq;
@@ -5603,6 +5623,27 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
__nvme_init_cq(cq);
}
+static void nvme_restore_cq(NvmeCQueue *cq_from)
+{
+ NvmeCtrl *n = cq_from->ctrl;
+ NvmeCQueue *cq = cq_from;
+
+ if (cq_from->cqid == 0) {
+ cq = &n->admin_cq;
+ cq->ctrl = n;
+ cq->cqid = cq_from->cqid;
+ cq->size = cq_from->size;
+ cq->dma_addr = cq_from->dma_addr;
+ cq->phase = cq_from->phase;
+ cq->irq_enabled = cq_from->irq_enabled;
+ cq->vector = cq_from->vector;
+ cq->head = cq_from->head;
+ cq->tail = cq_from->tail;
+ }
+
+ __nvme_init_cq(cq);
+}
+
static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
{
NvmeCQueue *cq;
@@ -7291,7 +7332,7 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
n->dbbuf_eis = eis_addr;
n->dbbuf_enabled = true;
- for (i = 0; i < n->params.max_ioqpairs + 1; i++) {
+ for (i = 0; i < n->num_queues; i++) {
NvmeSQueue *sq = n->sq[i];
NvmeCQueue *cq = n->cq[i];
@@ -7731,7 +7772,7 @@ static int nvme_atomic_write_check(NvmeCtrl *n, NvmeCmd *cmd,
/*
* Walk the queues to see if there are any atomic conflicts.
*/
- for (i = 1; i < n->params.max_ioqpairs + 1; i++) {
+ for (i = 1; i < n->num_queues; i++) {
NvmeSQueue *sq;
NvmeRequest *req;
NvmeRwCmd *req_rw;
@@ -7801,6 +7842,10 @@ static void nvme_process_sq(void *opaque)
NvmeCmd cmd;
NvmeRequest *req;
+ if (qatomic_read(&n->stop_processing_sq)) {
+ return;
+ }
+
if (n->dbbuf_enabled) {
nvme_update_sq_tail(sq);
}
@@ -7809,6 +7854,10 @@ static void nvme_process_sq(void *opaque)
NvmeAtomic *atomic;
bool cmd_is_atomic;
+ if (qatomic_read(&n->stop_processing_sq)) {
+ return;
+ }
+
addr = sq->dma_addr + (sq->head << NVME_SQES);
if (nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd))) {
trace_pci_nvme_err_addr_read(addr);
@@ -7917,12 +7966,12 @@ static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
nvme_ns_drain(ns);
}
- for (i = 0; i < n->params.max_ioqpairs + 1; i++) {
+ for (i = 0; i < n->num_queues; i++) {
if (n->sq[i] != NULL) {
nvme_free_sq(n->sq[i], n);
}
}
- for (i = 0; i < n->params.max_ioqpairs + 1; i++) {
+ for (i = 0; i < n->num_queues; i++) {
if (n->cq[i] != NULL) {
nvme_free_cq(n->cq[i], n);
}
@@ -8592,6 +8641,8 @@ static bool nvme_check_params(NvmeCtrl *n, Error **errp)
params->max_ioqpairs = params->num_queues - 1;
}
+ n->num_queues = params->max_ioqpairs + 1;
+
if (n->namespace.blkconf.blk && n->subsys) {
error_setg(errp, "subsystem support is unavailable with legacy "
"namespace ('drive' property)");
@@ -8746,8 +8797,8 @@ static void nvme_init_state(NvmeCtrl *n)
n->conf_msix_qsize = n->params.msix_qsize;
}
- n->sq = g_new0(NvmeSQueue *, n->params.max_ioqpairs + 1);
- n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
+ n->sq = g_new0(NvmeSQueue *, n->num_queues);
+ n->cq = g_new0(NvmeCQueue *, n->num_queues);
n->temperature = NVME_TEMPERATURE;
n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
@@ -8990,7 +9041,7 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
}
if (n->params.msix_exclusive_bar && !pci_is_vf(pci_dev)) {
- bar_size = nvme_mbar_size(n->params.max_ioqpairs + 1, 0, NULL, NULL);
+ bar_size = nvme_mbar_size(n->num_queues, 0, NULL, NULL);
memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
bar_size);
pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
@@ -9002,7 +9053,7 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
/* add one to max_ioqpairs to account for the admin queue pair */
if (!pci_is_vf(pci_dev)) {
nr_vectors = n->params.msix_qsize;
- bar_size = nvme_mbar_size(n->params.max_ioqpairs + 1,
+ bar_size = nvme_mbar_size(n->num_queues,
nr_vectors, &msix_table_offset,
&msix_pba_offset);
} else {
@@ -9552,9 +9603,353 @@ static uint32_t nvme_pci_read_config(PCIDevice *dev, uint32_t address, int len)
return pci_default_read_config(dev, address, len);
}
+static int nvme_ctrl_pre_save(void *opaque)
+{
+ NvmeCtrl *n = opaque;
+ int i;
+
+ trace_pci_nvme_pre_save_enter(n);
+
+ /* ask SQ processing code not to take new requests */
+ qatomic_set(&n->stop_processing_sq, true);
+
+ /* prevent new in-flight IO from appearing */
+ for (i = 0; i < n->num_queues; i++) {
+ NvmeSQueue *sq = n->sq[i];
+
+ if (!sq)
+ continue;
+
+ qemu_bh_cancel(sq->bh);
+ }
+
+ /* drain all IO */
+ for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+ NvmeNamespace *ns;
+
+ ns = nvme_ns(n, i);
+ if (!ns) {
+ continue;
+ }
+
+ trace_pci_nvme_pre_save_ns_drain(n, i);
+ nvme_ns_drain(ns);
+ }
+
+ /*
+ * Now, we should take care of AERs.
+ * It is a bit tricky, because AER can be queued
+ * (added to n->aer_queue) when something happens,
+ * but then we need to wait until guest submits
+ * NVME_ADM_CMD_ASYNC_EV_REQ, only after this
+ * we can get remove it from aer_queue and produce
+ * CQE on that NVME_ADM_CMD_ASYNC_EV_REQ command.
+ *
+ * If we are unlucky, and guest haven't submited
+ * NVME_ADM_CMD_ASYNC_EV_REQ recently, but there
+ * are a few events in aer_queue, then nvme_process_aers()
+ * is useless. But we should at least try.
+ */
+ nvme_process_aers(n);
+
+ /*
+ * Now we go in a hard way:
+ * 1. Remove all queued events.
+ * 2. Abort all NVME_ADM_CMD_ASYNC_EV_REQ requests.
+ *
+ * TODO: dump/restore this stuff?
+ */
+ while (!QTAILQ_EMPTY(&n->aer_queue)) {
+ NvmeAsyncEvent *event = QTAILQ_FIRST(&n->aer_queue);
+ QTAILQ_REMOVE(&n->aer_queue, event, entry);
+ n->aer_queued--;
+ g_free(event);
+ }
+
+ for (i = 0; i < n->outstanding_aers; i++) {
+ NvmeRequest *re = n->aer_reqs[i];
+ memmove(n->aer_reqs + i, n->aer_reqs + i + 1,
+ (n->outstanding_aers - i - 1) * sizeof(NvmeRequest *));
+ n->outstanding_aers--;
+ re->status = NVME_CMD_ABORT_REQ;
+ nvme_enqueue_req_completion(&n->admin_cq, re);
+ }
+
+ /*
+ * nvme_enqueue_req_completion() will schedule BH for Admin CQ,
+ * but we are under BQL and this scheduled BH won't be executed.
+ * Let's manually call nvme_post_cqes().
+ */
+ qemu_bh_cancel(n->admin_cq.bh);
+ nvme_post_cqes(&n->admin_cq);
+
+ if (n->aer_queued != 0 || n->outstanding_aers != 0 || !QTAILQ_EMPTY(&n->aer_queue)) {
+ error_report("%s: AERs migrations is not supported aer_queued=%d outstanding_aers=%d qtailq_empty=%d",
+ __func__, n->aer_queued, n->outstanding_aers, QTAILQ_EMPTY(&n->aer_queue));
+ goto err;
+ }
+
+ /* wait when all in-flight IO requests are processed */
+ for (i = 0; i < n->num_queues; i++) {
+ NvmeSQueue *sq = n->sq[i];
+
+ if (!sq)
+ continue;
+
+ trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size);
+
+ while (!QTAILQ_EMPTY(&sq->out_req_list)) {
+ cpu_relax();
+ }
+
+ trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail);
+ }
+
+ /* wait when all IO requests completions are written to guest memory */
+ for (i = 0; i < n->num_queues; i++) {
+ NvmeCQueue *cq = n->cq[i];
+
+ if (!cq)
+ continue;
+
+ trace_pci_nvme_pre_save_cq_req_drain_wait(n, i, cq->head, cq->tail, cq->size);
+
+ while (!QTAILQ_EMPTY(&cq->req_list)) {
+ /*
+ * nvme_post_cqes() can't do its job of cleaning cq->req_list
+ * when CQ is full, it means that we need to save what we have in
+ * cq->req_list and restore it back on VM resume.
+ *
+ * Good thing is that this can only happen when guest hasn't
+ * processed CQ for a long time and at the same time, many SQEs
+ * are in flight.
+ *
+ * For now, let's just block migration in this rare case.
+ */
+ if (nvme_cq_full(cq)) {
+ error_report("%s: no free space in CQ (not supported)", __func__);
+ goto err;
+ }
+
+ cpu_relax();
+ }
+
+ trace_pci_nvme_pre_save_cq_req_drain_wait_end(n, i, cq->head, cq->tail);
+ }
+
+ for (uint32_t nsid = 0; nsid <= NVME_MAX_NAMESPACES; nsid++) {
+ NvmeNamespace *ns = n->namespaces[nsid];
+
+ if (!ns)
+ continue;
+
+ if (ns != &n->namespace) {
+ error_report("%s: only one NVMe namespace is supported for migration", __func__);
+ goto err;
+ }
+ }
+
+ return 0;
+
+err:
+ /* restore sq processing back to normal */
+ qatomic_set(&n->stop_processing_sq, false);
+ return -1;
+}
+
+static bool nvme_ctrl_post_load(void *opaque, int version_id, Error **errp)
+{
+ NvmeCtrl *n = opaque;
+ int i;
+
+ trace_pci_nvme_post_load_enter(n);
+
+ /* restore CQs first */
+ for (i = 0; i < n->num_queues; i++) {
+ NvmeCQueue *cq = n->cq[i];
+
+ if (!cq)
+ continue;
+
+ cq->ctrl = n;
+ nvme_restore_cq(cq);
+ trace_pci_nvme_post_load_restore_cq(n, i, cq->head, cq->tail, cq->size);
+
+ if (i == 0) {
+ /*
+ * Admin CQ lives in n->admin_cq, we don't need
+ * memory allocated for it in get_ptrs_array_entry() anymore.
+ *
+ * nvme_restore_cq() also takes care of:
+ * n->cq[0] = &n->admin_cq;
+ * so n->cq[0] remains valid.
+ */
+ g_free(cq);
+ }
+ }
+
+ for (i = 0; i < n->num_queues; i++) {
+ NvmeSQueue *sq = n->sq[i];
+
+ if (!sq)
+ continue;
+
+ sq->ctrl = n;
+ nvme_restore_sq(sq);
+ trace_pci_nvme_post_load_restore_sq(n, i, sq->head, sq->tail, sq->size);
+
+ if (i == 0) {
+ /* same as for CQ */
+ g_free(sq);
+ }
+ }
+
+ /*
+ * We need to attach namespaces (currently, only one namespace is
+ * supported for migration).
+ * This logic comes from nvme_start_ctrl().
+ */
+ for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+ NvmeNamespace *ns = nvme_subsys_ns(n->subsys, i);
+
+ if (!ns || (!ns->params.shared && ns->ctrl != n)) {
+ continue;
+ }
+
+ if (nvme_csi_supported(n, ns->csi) && !ns->params.detached) {
+ if (!ns->attached || ns->params.shared) {
+ nvme_attach_ns(n, ns);
+ }
+ }
+ }
+
+ /* schedule SQ processing */
+ for (i = 0; i < n->num_queues; i++) {
+ NvmeSQueue *sq = n->sq[i];
+
+ if (!sq)
+ continue;
+
+ qemu_bh_schedule(sq->bh);
+ }
+
+ /*
+ * We ensured in pre_save() that cq->req_list was empty,
+ * so we don't need to schedule BH for CQ processing.
+ */
+
+ return true;
+}
+
+static const VMStateDescription nvme_vmstate_bar = {
+ .name = "nvme-bar",
+ .minimum_version_id = 1,
+ .version_id = 1,
+ .fields = (const VMStateField[]) {
+ VMSTATE_UINT64(cap, NvmeBar),
+ VMSTATE_UINT32(vs, NvmeBar),
+ VMSTATE_UINT32(intms, NvmeBar),
+ VMSTATE_UINT32(intmc, NvmeBar),
+ VMSTATE_UINT32(cc, NvmeBar),
+ VMSTATE_UINT8_ARRAY(rsvd24, NvmeBar, 4),
+ VMSTATE_UINT32(csts, NvmeBar),
+ VMSTATE_UINT32(nssr, NvmeBar),
+ VMSTATE_UINT32(aqa, NvmeBar),
+ VMSTATE_UINT64(asq, NvmeBar),
+ VMSTATE_UINT64(acq, NvmeBar),
+ VMSTATE_UINT32(cmbloc, NvmeBar),
+ VMSTATE_UINT32(cmbsz, NvmeBar),
+ VMSTATE_UINT32(bpinfo, NvmeBar),
+ VMSTATE_UINT32(bprsel, NvmeBar),
+ VMSTATE_UINT64(bpmbl, NvmeBar),
+ VMSTATE_UINT64(cmbmsc, NvmeBar),
+ VMSTATE_UINT32(cmbsts, NvmeBar),
+ VMSTATE_UINT8_ARRAY(rsvd92, NvmeBar, 3492),
+ VMSTATE_UINT32(pmrcap, NvmeBar),
+ VMSTATE_UINT32(pmrctl, NvmeBar),
+ VMSTATE_UINT32(pmrsts, NvmeBar),
+ VMSTATE_UINT32(pmrebs, NvmeBar),
+ VMSTATE_UINT32(pmrswtp, NvmeBar),
+ VMSTATE_UINT32(pmrmscl, NvmeBar),
+ VMSTATE_UINT32(pmrmscu, NvmeBar),
+ VMSTATE_UINT8_ARRAY(css, NvmeBar, 484),
+ VMSTATE_END_OF_LIST()
+ },
+};
+
+static const VMStateDescription nvme_vmstate_cqueue = {
+ .name = "nvme-cq",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .fields = (const VMStateField[]) {
+ VMSTATE_UINT8(phase, NvmeCQueue),
+ VMSTATE_UINT16(cqid, NvmeCQueue),
+ VMSTATE_UINT16(irq_enabled, NvmeCQueue),
+ VMSTATE_UINT32(head, NvmeCQueue),
+ VMSTATE_UINT32(tail, NvmeCQueue),
+ VMSTATE_UINT32(vector, NvmeCQueue),
+ VMSTATE_UINT32(size, NvmeCQueue),
+ VMSTATE_UINT64(dma_addr, NvmeCQueue),
+ /* db_addr, ei_addr, etc will be recalculated */
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+static const VMStateDescription nvme_vmstate_squeue = {
+ .name = "nvme-sq",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .fields = (const VMStateField[]) {
+ VMSTATE_UINT16(sqid, NvmeSQueue),
+ VMSTATE_UINT16(cqid, NvmeSQueue),
+ VMSTATE_UINT32(head, NvmeSQueue),
+ VMSTATE_UINT32(tail, NvmeSQueue),
+ VMSTATE_UINT32(size, NvmeSQueue),
+ VMSTATE_UINT64(dma_addr, NvmeSQueue),
+ /* db_addr, ei_addr, etc will be recalculated */
+ VMSTATE_END_OF_LIST()
+ }
+};
+
static const VMStateDescription nvme_vmstate = {
.name = "nvme",
- .unmigratable = 1,
+ .minimum_version_id = 1,
+ .version_id = 1,
+ .pre_save = nvme_ctrl_pre_save,
+ .post_load_errp = nvme_ctrl_post_load,
+ .fields = (const VMStateField[]) {
+ VMSTATE_PCI_DEVICE(parent_obj, NvmeCtrl),
+ VMSTATE_MSIX(parent_obj, NvmeCtrl),
+ VMSTATE_STRUCT(bar, NvmeCtrl, 0, nvme_vmstate_bar, NvmeBar),
+
+ VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC(
+ sq, NvmeCtrl, num_queues, 0, nvme_vmstate_squeue, NvmeSQueue),
+ VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC(
+ cq, NvmeCtrl, num_queues, 0, nvme_vmstate_cqueue, NvmeCQueue),
+
+ VMSTATE_BOOL(qs_created, NvmeCtrl),
+ VMSTATE_UINT32(page_size, NvmeCtrl),
+ VMSTATE_UINT16(page_bits, NvmeCtrl),
+ VMSTATE_UINT16(max_prp_ents, NvmeCtrl),
+ VMSTATE_UINT32(max_q_ents, NvmeCtrl),
+ VMSTATE_UINT8(outstanding_aers, NvmeCtrl),
+ VMSTATE_UINT32(irq_status, NvmeCtrl),
+ VMSTATE_INT32(cq_pending, NvmeCtrl),
+
+ VMSTATE_UINT64(host_timestamp, NvmeCtrl),
+ VMSTATE_UINT64(timestamp_set_qemu_clock_ms, NvmeCtrl),
+ VMSTATE_UINT64(starttime_ms, NvmeCtrl),
+ VMSTATE_UINT16(temperature, NvmeCtrl),
+ VMSTATE_UINT8(smart_critical_warning, NvmeCtrl),
+
+ VMSTATE_UINT32(conf_msix_qsize, NvmeCtrl),
+ VMSTATE_UINT32(conf_ioqpairs, NvmeCtrl),
+ VMSTATE_UINT64(dbbuf_dbs, NvmeCtrl),
+ VMSTATE_UINT64(dbbuf_eis, NvmeCtrl),
+ VMSTATE_BOOL(dbbuf_enabled, NvmeCtrl),
+
+ VMSTATE_END_OF_LIST()
+ },
};
static void nvme_class_init(ObjectClass *oc, const void *data)
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 457b6637249..9c5f53c688c 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -638,6 +638,7 @@ typedef struct NvmeCtrl {
NvmeNamespace namespace;
NvmeNamespace *namespaces[NVME_MAX_NAMESPACES + 1];
+ uint32_t num_queues;
NvmeSQueue **sq;
NvmeCQueue **cq;
NvmeSQueue admin_sq;
@@ -669,6 +670,7 @@ typedef struct NvmeCtrl {
/* Migration-related stuff */
Error *migration_blocker;
+ bool stop_processing_sq;
} NvmeCtrl;
typedef enum NvmeResetType {
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index 6be0bfa1c1f..b9c5868a942 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -7,6 +7,15 @@ pci_nvme_dbbuf_config(uint64_t dbs_addr, uint64_t eis_addr) "dbs_addr=0x%"PRIx64
pci_nvme_map_addr(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64""
pci_nvme_map_addr_cmb(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64""
pci_nvme_map_prp(uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "trans_len %"PRIu64" len %"PRIu32" prp1 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
+pci_nvme_pre_save_enter(void *n) "n=%p"
+pci_nvme_pre_save_ns_drain(void *n, int i) "n=%p i=%d"
+pci_nvme_pre_save_sq_out_req_drain_wait(void *n, int i, uint32_t head, uint32_t tail, uint32_t size) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32" size=0x%"PRIx32""
+pci_nvme_pre_save_sq_out_req_drain_wait_end(void *n, int i, uint32_t head, uint32_t tail) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32""
+pci_nvme_pre_save_cq_req_drain_wait(void *n, int i, uint32_t head, uint32_t tail, uint32_t size) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32" size=0x%"PRIx32""
+pci_nvme_pre_save_cq_req_drain_wait_end(void *n, int i, uint32_t head, uint32_t tail) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32""
+pci_nvme_post_load_enter(void *n) "n=%p"
+pci_nvme_post_load_restore_cq(void *n, int i, uint32_t head, uint32_t tail, uint32_t size) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32" size=0x%"PRIx32""
+pci_nvme_post_load_restore_sq(void *n, int i, uint32_t head, uint32_t tail, uint32_t size) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32" size=0x%"PRIx32""
pci_nvme_map_sgl(uint8_t typ, uint64_t len) "type 0x%"PRIx8" len %"PRIu64""
pci_nvme_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode, const char *opname) "cid %"PRIu16" nsid 0x%"PRIx32" sqid %"PRIu16" opc 0x%"PRIx8" opname '%s'"
pci_nvme_admin_cmd(uint16_t cid, uint16_t sqid, uint8_t opcode, const char *opname) "cid %"PRIu16" sqid %"PRIu16" opc 0x%"PRIx8" opname '%s'"
--
2.47.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 0/4] [RFC] hw/nvme: add basic live migration support
2026-02-17 15:25 [PATCH 0/4] [RFC] hw/nvme: add basic live migration support Alexander Mikhalitsyn
` (3 preceding siblings ...)
2026-02-17 15:25 ` [PATCH 4/4] hw/nvme: add basic live migration support Alexander Mikhalitsyn
@ 2026-02-27 9:59 ` Klaus Jensen
2026-03-02 8:43 ` Alexander Mikhalitsyn
2026-03-13 14:24 ` Alexander Mikhalitsyn
4 siblings, 2 replies; 12+ messages in thread
From: Klaus Jensen @ 2026-02-27 9:59 UTC (permalink / raw)
To: Alexander Mikhalitsyn
Cc: qemu-devel, Jesper Devantier, Peter Xu, Fabiano Rosas, qemu-block,
Keith Busch, Alexander Mikhalitsyn
[-- Attachment #1: Type: text/plain, Size: 1741 bytes --]
On Feb 17 16:25, Alexander Mikhalitsyn wrote:
> From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
>
> Dear friends,
>
> This patchset adds basic live migration support for
> QEMU emulated NVMe device.
>
> Implementation has some limitations:
> - only one NVMe namespace is supported
> - SMART counters are not preserved
> - CMB is not supported
> - PMR is not supported
> - SPDM is not supported
> - SR-IOV is not supported
> - AERs are not fully supported
>
> I believe this is something I can support in next patchset versions or
> separately on-demand (when usecase appears). But I wanted to share this
> first version as RFC to get some feedback on this in case if I'm approaching
> it wrong.
>
Hi Alex,
Nice work!
As you have already identified, there are a lot of features that are
non-trivial to implement migration for. I am completely in favor of only
supporting migration on a very limited feature set (i.e., don't worry
about CMB, PMR, SPDM, SR-IOV, ZNS/FDP and so on). Focus on the bare
mandatory requirements. It would be preferable if the "is migration
possible?" test is an allowlist instead of a denylist. That makes sure
we don't add a feature down the road and forget to add it to the
denylist. I'm not 100% sure how to go about that at this point.
AERs are something we need to deal with. We should not drop events. I
don't think I have a problem with aborting enqueued AERs, but not the
events.
Finally, this at a minimum needs somekind of simple smoke test to catch
regressions. Preferably as part of the QEMU test suite itself, but if
that is hard to achieve, then I may be ok with an out-of-tree test that
maintainers can use.
Cheers,
Klaus
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/4] [RFC] hw/nvme: add basic live migration support
2026-02-27 9:59 ` [PATCH 0/4] [RFC] " Klaus Jensen
@ 2026-03-02 8:43 ` Alexander Mikhalitsyn
2026-03-13 14:24 ` Alexander Mikhalitsyn
1 sibling, 0 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-02 8:43 UTC (permalink / raw)
To: Klaus Jensen
Cc: qemu-devel, Jesper Devantier, Peter Xu, Fabiano Rosas, qemu-block,
Keith Busch, Alexander Mikhalitsyn, Stéphane Graber
Am Fr., 27. Feb. 2026 um 10:59 Uhr schrieb Klaus Jensen <its@irrelevant.dk>:
>
> On Feb 17 16:25, Alexander Mikhalitsyn wrote:
> > From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
> >
> > Dear friends,
> >
> > This patchset adds basic live migration support for
> > QEMU emulated NVMe device.
> >
> > Implementation has some limitations:
> > - only one NVMe namespace is supported
> > - SMART counters are not preserved
> > - CMB is not supported
> > - PMR is not supported
> > - SPDM is not supported
> > - SR-IOV is not supported
> > - AERs are not fully supported
> >
> > I believe this is something I can support in next patchset versions or
> > separately on-demand (when usecase appears). But I wanted to share this
> > first version as RFC to get some feedback on this in case if I'm approaching
> > it wrong.
> >
>
> Hi Alex,
Hi Klaus,
>
> Nice work!
Thank you ;-)
>
> As you have already identified, there are a lot of features that are
> non-trivial to implement migration for. I am completely in favor of only
> supporting migration on a very limited feature set (i.e., don't worry
> about CMB, PMR, SPDM, SR-IOV, ZNS/FDP and so on). Focus on the bare
> mandatory requirements. It would be preferable if the "is migration
> possible?" test is an allowlist instead of a denylist. That makes sure
> we don't add a feature down the road and forget to add it to the
> denylist. I'm not 100% sure how to go about that at this point.
Yeah, I agree. I'll think about it.
>
> AERs are something we need to deal with. We should not drop events. I
> don't think I have a problem with aborting enqueued AERs, but not the
> events.
Yes, this is something I'm working on right now and this week I'll
send a -v2 with
graceful AERs handling.
>
> Finally, this at a minimum needs somekind of simple smoke test to catch
> regressions. Preferably as part of the QEMU test suite itself, but if
> that is hard to achieve, then I may be ok with an out-of-tree test that
> maintainers can use.
Sure, my current tests were simple, I was running fio in screen
session and then manual
suspend/resume:
time fio --name=nvme-verify \
--filename=/dev/nvme0n1 \
--size=5G \
--rw=randwrite \
--bs=4k \
--iodepth=16 \
--numjobs=1 \
--direct=0 \
--ioengine=io_uring \
--verify=crc32c \
--verify_fatal=1
Also, I had an idea in mind to ensure that Windows VM also survives
suspend/resume.
I'll take a look on how QEMU test suite is organized and try to come
up with something.
Thank you very much for looking into this stuff, Klaus!
Kind regards,
Alex
>
>
> Cheers,
> Klaus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/4] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC
2026-02-17 15:25 ` [PATCH 3/4] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC Alexander Mikhalitsyn
@ 2026-03-02 21:47 ` Peter Xu
2026-03-04 9:43 ` Alexander Mikhalitsyn
0 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2026-03-02 21:47 UTC (permalink / raw)
To: Alexander Mikhalitsyn
Cc: qemu-devel, Jesper Devantier, Klaus Jensen, Fabiano Rosas,
qemu-block, Keith Busch, Alexander Mikhalitsyn
On Tue, Feb 17, 2026 at 04:25:16PM +0100, Alexander Mikhalitsyn wrote:
> From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
>
> Add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC, which
> helps to save/restore a dynamic array of pointers to
> structures.
Sorry for the late response.
>
> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
> ---
> include/migration/vmstate.h | 21 +++++++++
> migration/vmstate-types.c | 88 +++++++++++++++++++++++++++++++++++++
> 2 files changed, 109 insertions(+)
>
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index 89f9f49d20a..bc6495a7f67 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -265,6 +265,7 @@ extern const VMStateInfo vmstate_info_bitmap;
> extern const VMStateInfo vmstate_info_qtailq;
> extern const VMStateInfo vmstate_info_gtree;
> extern const VMStateInfo vmstate_info_qlist;
> +extern const VMStateInfo vmstate_info_ptrs_array_entry;
>
> #define type_check_2darray(t1,t2,n,m) ((t1(*)[n][m])0 - (t2*)0)
> /*
> @@ -537,6 +538,26 @@ extern const VMStateInfo vmstate_info_qlist;
> .offset = vmstate_offset_array(_s, _f, _type*, _n), \
> }
>
> +/*
> + * For migrating a dynamically allocated uint32-indexed array
> + * of pointers to structures (with NULL entries and with auto memory allocation).
> + *
> + * _type: type of structure pointed to
> + * _vmsd: VMSD for structure
> + * start: size of structure pointed to (for auto memory allocation)
> + */
> +#define VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC(_field, _state, _field_num, _version, _vmsd, _type) { \
> + .name = (stringify(_field)), \
> + .version_id = (_version), \
> + .num_offset = vmstate_offset_value(_state, _field_num, uint32_t), \
> + .info = &vmstate_info_ptrs_array_entry, \
I wonder if we need this hard-coded new info structure for this, more
below.
> + .vmsd = &(_vmsd), \
> + .start = sizeof(_type), \
> + .size = sizeof(_type *), \
> + .flags = VMS_VARRAY_UINT32|VMS_POINTER, \
> + .offset = vmstate_offset_pointer(_state, _field, _type *), \
> +}
> +
> #define VMSTATE_VARRAY_OF_POINTER_UINT32(_field, _state, _field_num, _version, _info, _type) { \
> .name = (stringify(_field)), \
> .version_id = (_version), \
> diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
> index 89cb2114721..3335377cd07 100644
> --- a/migration/vmstate-types.c
> +++ b/migration/vmstate-types.c
> @@ -942,3 +942,91 @@ const VMStateInfo vmstate_info_qlist = {
> .get = get_qlist,
> .put = put_qlist,
> };
> +
> +static int put_ptrs_array_entry(QEMUFile *f, void *ppv, size_t unused_size,
> + const VMStateField *field, JSONWriter *vmdesc)
> +{
> + const VMStateDescription *vmsd = field->vmsd;
> + int ret;
> + Error *local_err = NULL;
> + void *pv;
> +
> + /*
> + * (ppv) is an address of an i-th element of a dynamic array.
> + *
> + * (ppv) can not be NULL unless we have some regression/bug in
> + * vmstate_save_state_v(), because it is result of pointer arithemic like:
> + * first_elem + size * i.
> + */
> + if (ppv == NULL) {
> + error_report("vmstate: put_ptrs_array_entry must be called with ppv != NULL");
> + return -EINVAL;
> + }
> +
> + /* get a pointer to a structure */
> + pv = *(void **)ppv;
Here IIUC it's almost what VMS_ARRAY_OF_POINTER should do. Could we
declare the vmsd with VMS_ARRAY_OF_POINTER so as to save this open-coded
dereference of pointer?
> +
> + if (pv == NULL) {
> + /* write a mark telling that there was a NULL pointer */
> + qemu_put_byte(f, false);
> + return 0;
> + }
> +
> + /* if pointer is not NULL, dump the structure contents with help of vmsd */
> + qemu_put_byte(f, true);
> + ret = vmstate_save_state(f, vmsd, pv, vmdesc, &local_err);
This looks really like VMS_STRUCT, except that it also tries to detect if
the pointer is NULL, then save some dumps.
Is it feasible to make both queues to always be available so as to reuse
VMS_STRUCT? Or is it intentional here to avoid dumping queues where the
pointer is NULL?
Even if for the latter, I wonder if we can have better way to do it. For
example, if we can integrate this directly into vmstate_save/load_state()
to support some new flag, VMS_ARRAY_OF_POINTER_DYNAMIC, then we can declare
some more "official" way to say some pointer of an dynamic array is NULL.
Thanks,
> + if (ret) {
> + error_report_err(local_err);
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> +static int get_ptrs_array_entry(QEMUFile *f, void *ppv, size_t unused_size,
> + const VMStateField *field)
> +{
> + int ret = 0;
> + Error *local_err = NULL;
> + const VMStateDescription *vmsd = field->vmsd;
> + /* size of structure pointed to by elements of array */
> + size_t size = field->start;
> +
> + if (ppv == NULL) {
> + error_report("vmstate: get_ptrs_array_entry must be called with ppv != NULL");
> + return -EINVAL;
> + }
> +
> + /*
> + * We start from a clean array, all elements must be NULL, unless
> + * something we haven't prepared for has changed in vmstate_save_state_v().
> + * Let's check for this just in case.
> + */
> + if (*(void **)ppv != NULL) {
> + error_report("vmstate: get_ptrs_array_entry must be called with *ppv == NULL");
> + return -EINVAL;
> + }
> +
> + if (qemu_get_byte(f)) {
> + void *pv;
> +
> + /* allocate memory for structure */
> + pv = g_malloc0(size);
> + ret = vmstate_load_state(f, vmsd, pv, vmsd->version_id, &local_err);
> + if (ret) {
> + error_report_err(local_err);
> + g_free(pv);
> + return ret;
> + }
> +
> + *(void **)ppv = pv;
> + }
> +
> + return ret;
> +}
> +
> +const VMStateInfo vmstate_info_ptrs_array_entry = {
> + .name = "ptrs_array_entry",
> + .get = get_ptrs_array_entry,
> + .put = put_ptrs_array_entry,
> +};
> --
> 2.47.3
>
--
Peter Xu
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/4] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC
2026-03-02 21:47 ` Peter Xu
@ 2026-03-04 9:43 ` Alexander Mikhalitsyn
2026-03-04 16:40 ` Peter Xu
0 siblings, 1 reply; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-04 9:43 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Jesper Devantier, Klaus Jensen, Fabiano Rosas,
qemu-block, Keith Busch, Alexander Mikhalitsyn,
Stéphane Graber
Am Mo., 2. März 2026 um 22:47 Uhr schrieb Peter Xu <peterx@redhat.com>:
>
> On Tue, Feb 17, 2026 at 04:25:16PM +0100, Alexander Mikhalitsyn wrote:
> > From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
> >
> > Add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC, which
> > helps to save/restore a dynamic array of pointers to
> > structures.
Dear Peter,
>
> Sorry for the late response.
No problem at all, thanks for reviewing ;)
>
> >
> > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
> > ---
> > include/migration/vmstate.h | 21 +++++++++
> > migration/vmstate-types.c | 88 +++++++++++++++++++++++++++++++++++++
> > 2 files changed, 109 insertions(+)
> >
> > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> > index 89f9f49d20a..bc6495a7f67 100644
> > --- a/include/migration/vmstate.h
> > +++ b/include/migration/vmstate.h
> > @@ -265,6 +265,7 @@ extern const VMStateInfo vmstate_info_bitmap;
> > extern const VMStateInfo vmstate_info_qtailq;
> > extern const VMStateInfo vmstate_info_gtree;
> > extern const VMStateInfo vmstate_info_qlist;
> > +extern const VMStateInfo vmstate_info_ptrs_array_entry;
> >
> > #define type_check_2darray(t1,t2,n,m) ((t1(*)[n][m])0 - (t2*)0)
> > /*
> > @@ -537,6 +538,26 @@ extern const VMStateInfo vmstate_info_qlist;
> > .offset = vmstate_offset_array(_s, _f, _type*, _n), \
> > }
> >
> > +/*
> > + * For migrating a dynamically allocated uint32-indexed array
> > + * of pointers to structures (with NULL entries and with auto memory allocation).
> > + *
> > + * _type: type of structure pointed to
> > + * _vmsd: VMSD for structure
> > + * start: size of structure pointed to (for auto memory allocation)
> > + */
> > +#define VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC(_field, _state, _field_num, _version, _vmsd, _type) { \
> > + .name = (stringify(_field)), \
> > + .version_id = (_version), \
> > + .num_offset = vmstate_offset_value(_state, _field_num, uint32_t), \
> > + .info = &vmstate_info_ptrs_array_entry, \
>
> I wonder if we need this hard-coded new info structure for this, more
> below.
>
> > + .vmsd = &(_vmsd), \
> > + .start = sizeof(_type), \
> > + .size = sizeof(_type *), \
> > + .flags = VMS_VARRAY_UINT32|VMS_POINTER, \
> > + .offset = vmstate_offset_pointer(_state, _field, _type *), \
> > +}
> > +
> > #define VMSTATE_VARRAY_OF_POINTER_UINT32(_field, _state, _field_num, _version, _info, _type) { \
> > .name = (stringify(_field)), \
> > .version_id = (_version), \
> > diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
> > index 89cb2114721..3335377cd07 100644
> > --- a/migration/vmstate-types.c
> > +++ b/migration/vmstate-types.c
> > @@ -942,3 +942,91 @@ const VMStateInfo vmstate_info_qlist = {
> > .get = get_qlist,
> > .put = put_qlist,
> > };
> > +
> > +static int put_ptrs_array_entry(QEMUFile *f, void *ppv, size_t unused_size,
> > + const VMStateField *field, JSONWriter *vmdesc)
> > +{
> > + const VMStateDescription *vmsd = field->vmsd;
> > + int ret;
> > + Error *local_err = NULL;
> > + void *pv;
> > +
> > + /*
> > + * (ppv) is an address of an i-th element of a dynamic array.
> > + *
> > + * (ppv) can not be NULL unless we have some regression/bug in
> > + * vmstate_save_state_v(), because it is result of pointer arithemic like:
> > + * first_elem + size * i.
> > + */
> > + if (ppv == NULL) {
> > + error_report("vmstate: put_ptrs_array_entry must be called with ppv != NULL");
> > + return -EINVAL;
> > + }
> > +
> > + /* get a pointer to a structure */
> > + pv = *(void **)ppv;
>
> Here IIUC it's almost what VMS_ARRAY_OF_POINTER should do. Could we
> declare the vmsd with VMS_ARRAY_OF_POINTER so as to save this open-coded
> dereference of pointer?
Yes, this is something I tried to do in the first place, but this
didn't work for my use-case.
Let me explain what I'm doing in detail. I have the following structure:
typedef struct NvmeCtrl {
...
uint32_t num_queues; // < MAX number of elements in sq
NvmeSQueue **sq;
...
} NvmeCtrl;
Suppose we have variable n with type NvmeCtrl *.
// somewhere early during initialization of QEMU device:
n->sq = g_new0(NvmeSQueue *, n->num_queues);
// then in runtime (depending on what *guest* wants) we might create
some sq[i], e.g:
NvmeSQueue *sq;
sq = g_malloc0(sizeof(*sq));
n->sq[3] = sq;
please, note that n->sq[0], n->sq[1], n->sq[2] can still be NULL. So,
we allow holes in that array.
What I found is that if I raise VMS_ARRAY_OF_POINTER flag on VMStateField, then
migration code in vmstate_load_state() will expect to see so-called
"fake nullptr field" (vmsd_create_fake_nullptr_field() function call).
But this is a problem, because our array at the time of
vmstate_load_state() looks like (N == NULL):
0 1 2 3 4 5
N N N N N N
while after vmstate_load_state() I expect it to be:
0 1 2 3 4 5
N N N SQ N N
so, somebody should allocate memory for sq[3] and load data from image
into that memory.
Probably, it makes sense to somehow implement a combination of
VMS_ARRAY_OF_POINTER and VMS_ALLOC,
but I tried to be as less invasive as I can and decided to introduce a
new kind of VMStateField instead.
>
> > +
> > + if (pv == NULL) {
> > + /* write a mark telling that there was a NULL pointer */
> > + qemu_put_byte(f, false);
> > + return 0;
> > + }
> > +
> > + /* if pointer is not NULL, dump the structure contents with help of vmsd */
> > + qemu_put_byte(f, true);
> > + ret = vmstate_save_state(f, vmsd, pv, vmdesc, &local_err);
>
> This looks really like VMS_STRUCT, except that it also tries to detect if
> the pointer is NULL, then save some dumps.
Precisely, I'm mimicking behavior of VMS_STRUCT | VMS_ALLOC and
VMS_ARRAY_OF_POINTER here.
>
> Is it feasible to make both queues to always be available so as to reuse
> VMS_STRUCT? Or is it intentional here to avoid dumping queues where the
> pointer is NULL?
By "available" you mean to always allocate memory for all n->sq[i] elements?
>
> Even if for the latter, I wonder if we can have better way to do it. For
> example, if we can integrate this directly into vmstate_save/load_state()
> to support some new flag, VMS_ARRAY_OF_POINTER_DYNAMIC, then we can declare
> some more "official" way to say some pointer of an dynamic array is NULL.
>
Yeah, I thought about this too, I believe this is a great idea. But
taking into account that
this is my first contribution to QEMU and I'm learning code around I tried to be
very conservative and don't change any generic code ;-)
I'm happy to do this work in -v3 if you would like me to do it and
guide me a bit (I have just send a v2 series,
but there was no changes in migration API part, only NVMe specifics).
Kind regards,
Alex
> Thanks,
>
> > + if (ret) {
> > + error_report_err(local_err);
> > + return ret;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int get_ptrs_array_entry(QEMUFile *f, void *ppv, size_t unused_size,
> > + const VMStateField *field)
> > +{
> > + int ret = 0;
> > + Error *local_err = NULL;
> > + const VMStateDescription *vmsd = field->vmsd;
> > + /* size of structure pointed to by elements of array */
> > + size_t size = field->start;
> > +
> > + if (ppv == NULL) {
> > + error_report("vmstate: get_ptrs_array_entry must be called with ppv != NULL");
> > + return -EINVAL;
> > + }
> > +
> > + /*
> > + * We start from a clean array, all elements must be NULL, unless
> > + * something we haven't prepared for has changed in vmstate_save_state_v().
> > + * Let's check for this just in case.
> > + */
> > + if (*(void **)ppv != NULL) {
> > + error_report("vmstate: get_ptrs_array_entry must be called with *ppv == NULL");
> > + return -EINVAL;
> > + }
> > +
> > + if (qemu_get_byte(f)) {
> > + void *pv;
> > +
> > + /* allocate memory for structure */
> > + pv = g_malloc0(size);
> > + ret = vmstate_load_state(f, vmsd, pv, vmsd->version_id, &local_err);
> > + if (ret) {
> > + error_report_err(local_err);
> > + g_free(pv);
> > + return ret;
> > + }
> > +
> > + *(void **)ppv = pv;
> > + }
> > +
> > + return ret;
> > +}
> > +
> > +const VMStateInfo vmstate_info_ptrs_array_entry = {
> > + .name = "ptrs_array_entry",
> > + .get = get_ptrs_array_entry,
> > + .put = put_ptrs_array_entry,
> > +};
> > --
> > 2.47.3
> >
>
> --
> Peter Xu
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/4] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC
2026-03-04 9:43 ` Alexander Mikhalitsyn
@ 2026-03-04 16:40 ` Peter Xu
2026-03-05 8:55 ` Alexander Mikhalitsyn
0 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2026-03-04 16:40 UTC (permalink / raw)
To: Alexander Mikhalitsyn
Cc: qemu-devel, Jesper Devantier, Klaus Jensen, Fabiano Rosas,
qemu-block, Keith Busch, Alexander Mikhalitsyn,
Stéphane Graber
On Wed, Mar 04, 2026 at 10:43:53AM +0100, Alexander Mikhalitsyn wrote:
> Am Mo., 2. März 2026 um 22:47 Uhr schrieb Peter Xu <peterx@redhat.com>:
> >
> > On Tue, Feb 17, 2026 at 04:25:16PM +0100, Alexander Mikhalitsyn wrote:
> > > From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
> > >
> > > Add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC, which
> > > helps to save/restore a dynamic array of pointers to
> > > structures.
>
> Dear Peter,
>
> >
> > Sorry for the late response.
>
> No problem at all, thanks for reviewing ;)
>
> >
> > >
> > > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
> > > ---
> > > include/migration/vmstate.h | 21 +++++++++
> > > migration/vmstate-types.c | 88 +++++++++++++++++++++++++++++++++++++
> > > 2 files changed, 109 insertions(+)
> > >
> > > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> > > index 89f9f49d20a..bc6495a7f67 100644
> > > --- a/include/migration/vmstate.h
> > > +++ b/include/migration/vmstate.h
> > > @@ -265,6 +265,7 @@ extern const VMStateInfo vmstate_info_bitmap;
> > > extern const VMStateInfo vmstate_info_qtailq;
> > > extern const VMStateInfo vmstate_info_gtree;
> > > extern const VMStateInfo vmstate_info_qlist;
> > > +extern const VMStateInfo vmstate_info_ptrs_array_entry;
> > >
> > > #define type_check_2darray(t1,t2,n,m) ((t1(*)[n][m])0 - (t2*)0)
> > > /*
> > > @@ -537,6 +538,26 @@ extern const VMStateInfo vmstate_info_qlist;
> > > .offset = vmstate_offset_array(_s, _f, _type*, _n), \
> > > }
> > >
> > > +/*
> > > + * For migrating a dynamically allocated uint32-indexed array
> > > + * of pointers to structures (with NULL entries and with auto memory allocation).
> > > + *
> > > + * _type: type of structure pointed to
> > > + * _vmsd: VMSD for structure
> > > + * start: size of structure pointed to (for auto memory allocation)
> > > + */
> > > +#define VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC(_field, _state, _field_num, _version, _vmsd, _type) { \
> > > + .name = (stringify(_field)), \
> > > + .version_id = (_version), \
> > > + .num_offset = vmstate_offset_value(_state, _field_num, uint32_t), \
> > > + .info = &vmstate_info_ptrs_array_entry, \
> >
> > I wonder if we need this hard-coded new info structure for this, more
> > below.
> >
> > > + .vmsd = &(_vmsd), \
> > > + .start = sizeof(_type), \
> > > + .size = sizeof(_type *), \
> > > + .flags = VMS_VARRAY_UINT32|VMS_POINTER, \
> > > + .offset = vmstate_offset_pointer(_state, _field, _type *), \
> > > +}
> > > +
> > > #define VMSTATE_VARRAY_OF_POINTER_UINT32(_field, _state, _field_num, _version, _info, _type) { \
> > > .name = (stringify(_field)), \
> > > .version_id = (_version), \
> > > diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
> > > index 89cb2114721..3335377cd07 100644
> > > --- a/migration/vmstate-types.c
> > > +++ b/migration/vmstate-types.c
> > > @@ -942,3 +942,91 @@ const VMStateInfo vmstate_info_qlist = {
> > > .get = get_qlist,
> > > .put = put_qlist,
> > > };
> > > +
> > > +static int put_ptrs_array_entry(QEMUFile *f, void *ppv, size_t unused_size,
> > > + const VMStateField *field, JSONWriter *vmdesc)
> > > +{
> > > + const VMStateDescription *vmsd = field->vmsd;
> > > + int ret;
> > > + Error *local_err = NULL;
> > > + void *pv;
> > > +
> > > + /*
> > > + * (ppv) is an address of an i-th element of a dynamic array.
> > > + *
> > > + * (ppv) can not be NULL unless we have some regression/bug in
> > > + * vmstate_save_state_v(), because it is result of pointer arithemic like:
> > > + * first_elem + size * i.
> > > + */
> > > + if (ppv == NULL) {
> > > + error_report("vmstate: put_ptrs_array_entry must be called with ppv != NULL");
> > > + return -EINVAL;
> > > + }
> > > +
> > > + /* get a pointer to a structure */
> > > + pv = *(void **)ppv;
> >
> > Here IIUC it's almost what VMS_ARRAY_OF_POINTER should do. Could we
> > declare the vmsd with VMS_ARRAY_OF_POINTER so as to save this open-coded
> > dereference of pointer?
>
> Yes, this is something I tried to do in the first place, but this
> didn't work for my use-case.
>
> Let me explain what I'm doing in detail. I have the following structure:
>
> typedef struct NvmeCtrl {
> ...
> uint32_t num_queues; // < MAX number of elements in sq
> NvmeSQueue **sq;
> ...
> } NvmeCtrl;
>
> Suppose we have variable n with type NvmeCtrl *.
>
> // somewhere early during initialization of QEMU device:
> n->sq = g_new0(NvmeSQueue *, n->num_queues);
>
> // then in runtime (depending on what *guest* wants) we might create
> some sq[i], e.g:
> NvmeSQueue *sq;
> sq = g_malloc0(sizeof(*sq));
> n->sq[3] = sq;
>
> please, note that n->sq[0], n->sq[1], n->sq[2] can still be NULL. So,
> we allow holes in that array.
>
> What I found is that if I raise VMS_ARRAY_OF_POINTER flag on VMStateField, then
> migration code in vmstate_load_state() will expect to see so-called
> "fake nullptr field" (vmsd_create_fake_nullptr_field() function call).
>
> But this is a problem, because our array at the time of
> vmstate_load_state() looks like (N == NULL):
> 0 1 2 3 4 5
> N N N N N N
Yes, I actually just notice that the old nullptr trick only works if the
dest QEMU will know if the pointer is NULL... that makes that interface
pretty much, useless in most cases.. :(
So it's not source telling dest that "there's a null pointer in the array",
it's just that both sides know that. Checking the original patch of that,
indeed it mentioned it's used in ChannelSubSys.css in hw/s390x/css.c:
https://lore.kernel.org/all/20170222160119.52771-4-pasic@linux.vnet.ibm.com/
So I guess the qemu cmdline for that s390 guest will make sure the pointers
will be pre-initialized already.. So yes, it won't work for you. You're
looking for a full dynamic array of pointers.
>
> while after vmstate_load_state() I expect it to be:
> 0 1 2 3 4 5
> N N N SQ N N
>
> so, somebody should allocate memory for sq[3] and load data from image
> into that memory.
>
> Probably, it makes sense to somehow implement a combination of
> VMS_ARRAY_OF_POINTER and VMS_ALLOC,
> but I tried to be as less invasive as I can and decided to introduce a
> new kind of VMStateField instead.
Yes. We already used VMS_ALLOC for top level allocation. We may need
another flag for per-array allocation.
>
> >
> > > +
> > > + if (pv == NULL) {
> > > + /* write a mark telling that there was a NULL pointer */
> > > + qemu_put_byte(f, false);
> > > + return 0;
> > > + }
> > > +
> > > + /* if pointer is not NULL, dump the structure contents with help of vmsd */
> > > + qemu_put_byte(f, true);
> > > + ret = vmstate_save_state(f, vmsd, pv, vmdesc, &local_err);
> >
> > This looks really like VMS_STRUCT, except that it also tries to detect if
> > the pointer is NULL, then save some dumps.
>
> Precisely, I'm mimicking behavior of VMS_STRUCT | VMS_ALLOC and
> VMS_ARRAY_OF_POINTER here.
>
> >
> > Is it feasible to make both queues to always be available so as to reuse
> > VMS_STRUCT? Or is it intentional here to avoid dumping queues where the
> > pointer is NULL?
>
> By "available" you mean to always allocate memory for all n->sq[i] elements?
>
> >
> > Even if for the latter, I wonder if we can have better way to do it. For
> > example, if we can integrate this directly into vmstate_save/load_state()
> > to support some new flag, VMS_ARRAY_OF_POINTER_DYNAMIC, then we can declare
> > some more "official" way to say some pointer of an dynamic array is NULL.
> >
>
> Yeah, I thought about this too, I believe this is a great idea. But
> taking into account that
> this is my first contribution to QEMU and I'm learning code around I tried to be
> very conservative and don't change any generic code ;-)
>
> I'm happy to do this work in -v3 if you would like me to do it and
> guide me a bit (I have just send a v2 series,
> but there was no changes in migration API part, only NVMe specifics).
Yes, it would be nice if we can use a good base on this part of work. The
problem with your proposal is even if it doesn't touch vmstate core, we
will then need to maintain that special .info struct, that's also a burden.
So maybe it's better we do it the best we can come up with.
I changed slightly on the name of the new flag, DYNAMIC might be fine but I
wonder if we can choose something clearer on what it does.
Maybe VMS_ARRAY_OF_POINTER_ALLOW_NULL? It should at least imply two things:
(1) The array being migrated can contain NULL pointers anywhere in the
elements. We will need to change the nullptr trick a bit for this: we
must send a byte (even if it's non-NULL) beforehand saying if this
element is NULL. We could reuse 0x30 for NULL, but we need e.g. 0x31
to say it's non-NULL. Then this will work even if dest QEMU doesn't
know which pointer is NULL (unlike s390's css).
(2) If it's non-NULL, dest QEMU allocates the array element memory during
loading the vmsd. It plays similarly a role of VMS_ALLOC but for the
array elements. We could use a separate flag but I don't see a case
where this flag can be used without being able to migrate NULL
elements, so maybe we can stick with one flag so far until we see
another use case.
The new flag must only be used with VMS_ARRAY_OF_POINTER for sure. It
should likely verify that each pointer on dest is NULL before hand before
allocating that array element too in case we leak things.
We will also need some coverage in test-vmstate.c for this.
It would be great if you will volunteer on this piece of work, I will try
to help whatever way I can on reviews. You can also consider doing it
separately then migration can collect it in advance when ready on its own
(proven that it'll be 100% useful here for nvme very soon).
But I know it might be too much to ask, so if you want I can also take a
look at this problem. Let me know.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/4] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC
2026-03-04 16:40 ` Peter Xu
@ 2026-03-05 8:55 ` Alexander Mikhalitsyn
0 siblings, 0 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-05 8:55 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Jesper Devantier, Klaus Jensen, Fabiano Rosas,
qemu-block, Keith Busch, Alexander Mikhalitsyn,
Stéphane Graber
Am Mi., 4. März 2026 um 17:40 Uhr schrieb Peter Xu <peterx@redhat.com>:
>
> On Wed, Mar 04, 2026 at 10:43:53AM +0100, Alexander Mikhalitsyn wrote:
> > Am Mo., 2. März 2026 um 22:47 Uhr schrieb Peter Xu <peterx@redhat.com>:
> > >
> > > On Tue, Feb 17, 2026 at 04:25:16PM +0100, Alexander Mikhalitsyn wrote:
> > > > From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
> > > >
> > > > Add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC, which
> > > > helps to save/restore a dynamic array of pointers to
> > > > structures.
> >
> > Dear Peter,
> >
> > >
> > > Sorry for the late response.
> >
> > No problem at all, thanks for reviewing ;)
> >
> > >
> > > >
> > > > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
> > > > ---
> > > > include/migration/vmstate.h | 21 +++++++++
> > > > migration/vmstate-types.c | 88 +++++++++++++++++++++++++++++++++++++
> > > > 2 files changed, 109 insertions(+)
> > > >
> > > > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> > > > index 89f9f49d20a..bc6495a7f67 100644
> > > > --- a/include/migration/vmstate.h
> > > > +++ b/include/migration/vmstate.h
> > > > @@ -265,6 +265,7 @@ extern const VMStateInfo vmstate_info_bitmap;
> > > > extern const VMStateInfo vmstate_info_qtailq;
> > > > extern const VMStateInfo vmstate_info_gtree;
> > > > extern const VMStateInfo vmstate_info_qlist;
> > > > +extern const VMStateInfo vmstate_info_ptrs_array_entry;
> > > >
> > > > #define type_check_2darray(t1,t2,n,m) ((t1(*)[n][m])0 - (t2*)0)
> > > > /*
> > > > @@ -537,6 +538,26 @@ extern const VMStateInfo vmstate_info_qlist;
> > > > .offset = vmstate_offset_array(_s, _f, _type*, _n), \
> > > > }
> > > >
> > > > +/*
> > > > + * For migrating a dynamically allocated uint32-indexed array
> > > > + * of pointers to structures (with NULL entries and with auto memory allocation).
> > > > + *
> > > > + * _type: type of structure pointed to
> > > > + * _vmsd: VMSD for structure
> > > > + * start: size of structure pointed to (for auto memory allocation)
> > > > + */
> > > > +#define VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC(_field, _state, _field_num, _version, _vmsd, _type) { \
> > > > + .name = (stringify(_field)), \
> > > > + .version_id = (_version), \
> > > > + .num_offset = vmstate_offset_value(_state, _field_num, uint32_t), \
> > > > + .info = &vmstate_info_ptrs_array_entry, \
> > >
> > > I wonder if we need this hard-coded new info structure for this, more
> > > below.
> > >
> > > > + .vmsd = &(_vmsd), \
> > > > + .start = sizeof(_type), \
> > > > + .size = sizeof(_type *), \
> > > > + .flags = VMS_VARRAY_UINT32|VMS_POINTER, \
> > > > + .offset = vmstate_offset_pointer(_state, _field, _type *), \
> > > > +}
> > > > +
> > > > #define VMSTATE_VARRAY_OF_POINTER_UINT32(_field, _state, _field_num, _version, _info, _type) { \
> > > > .name = (stringify(_field)), \
> > > > .version_id = (_version), \
> > > > diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
> > > > index 89cb2114721..3335377cd07 100644
> > > > --- a/migration/vmstate-types.c
> > > > +++ b/migration/vmstate-types.c
> > > > @@ -942,3 +942,91 @@ const VMStateInfo vmstate_info_qlist = {
> > > > .get = get_qlist,
> > > > .put = put_qlist,
> > > > };
> > > > +
> > > > +static int put_ptrs_array_entry(QEMUFile *f, void *ppv, size_t unused_size,
> > > > + const VMStateField *field, JSONWriter *vmdesc)
> > > > +{
> > > > + const VMStateDescription *vmsd = field->vmsd;
> > > > + int ret;
> > > > + Error *local_err = NULL;
> > > > + void *pv;
> > > > +
> > > > + /*
> > > > + * (ppv) is an address of an i-th element of a dynamic array.
> > > > + *
> > > > + * (ppv) can not be NULL unless we have some regression/bug in
> > > > + * vmstate_save_state_v(), because it is result of pointer arithemic like:
> > > > + * first_elem + size * i.
> > > > + */
> > > > + if (ppv == NULL) {
> > > > + error_report("vmstate: put_ptrs_array_entry must be called with ppv != NULL");
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + /* get a pointer to a structure */
> > > > + pv = *(void **)ppv;
> > >
> > > Here IIUC it's almost what VMS_ARRAY_OF_POINTER should do. Could we
> > > declare the vmsd with VMS_ARRAY_OF_POINTER so as to save this open-coded
> > > dereference of pointer?
> >
> > Yes, this is something I tried to do in the first place, but this
> > didn't work for my use-case.
> >
> > Let me explain what I'm doing in detail. I have the following structure:
> >
> > typedef struct NvmeCtrl {
> > ...
> > uint32_t num_queues; // < MAX number of elements in sq
> > NvmeSQueue **sq;
> > ...
> > } NvmeCtrl;
> >
> > Suppose we have variable n with type NvmeCtrl *.
> >
> > // somewhere early during initialization of QEMU device:
> > n->sq = g_new0(NvmeSQueue *, n->num_queues);
> >
> > // then in runtime (depending on what *guest* wants) we might create
> > some sq[i], e.g:
> > NvmeSQueue *sq;
> > sq = g_malloc0(sizeof(*sq));
> > n->sq[3] = sq;
> >
> > please, note that n->sq[0], n->sq[1], n->sq[2] can still be NULL. So,
> > we allow holes in that array.
> >
> > What I found is that if I raise VMS_ARRAY_OF_POINTER flag on VMStateField, then
> > migration code in vmstate_load_state() will expect to see so-called
> > "fake nullptr field" (vmsd_create_fake_nullptr_field() function call).
> >
> > But this is a problem, because our array at the time of
> > vmstate_load_state() looks like (N == NULL):
> > 0 1 2 3 4 5
> > N N N N N N
Hi Peter,
>
> Yes, I actually just notice that the old nullptr trick only works if the
> dest QEMU will know if the pointer is NULL... that makes that interface
> pretty much, useless in most cases.. :(
>
> So it's not source telling dest that "there's a null pointer in the array",
> it's just that both sides know that. Checking the original patch of that,
> indeed it mentioned it's used in ChannelSubSys.css in hw/s390x/css.c:
Precisely.
>
> https://lore.kernel.org/all/20170222160119.52771-4-pasic@linux.vnet.ibm.com/
>
> So I guess the qemu cmdline for that s390 guest will make sure the pointers
> will be pre-initialized already.. So yes, it won't work for you. You're
> looking for a full dynamic array of pointers.
>
> >
> > while after vmstate_load_state() I expect it to be:
> > 0 1 2 3 4 5
> > N N N SQ N N
> >
> > so, somebody should allocate memory for sq[3] and load data from image
> > into that memory.
> >
> > Probably, it makes sense to somehow implement a combination of
> > VMS_ARRAY_OF_POINTER and VMS_ALLOC,
> > but I tried to be as less invasive as I can and decided to introduce a
> > new kind of VMStateField instead.
>
> Yes. We already used VMS_ALLOC for top level allocation. We may need
> another flag for per-array allocation.
>
Yep.
> >
> > >
> > > > +
> > > > + if (pv == NULL) {
> > > > + /* write a mark telling that there was a NULL pointer */
> > > > + qemu_put_byte(f, false);
> > > > + return 0;
> > > > + }
> > > > +
> > > > + /* if pointer is not NULL, dump the structure contents with help of vmsd */
> > > > + qemu_put_byte(f, true);
> > > > + ret = vmstate_save_state(f, vmsd, pv, vmdesc, &local_err);
> > >
> > > This looks really like VMS_STRUCT, except that it also tries to detect if
> > > the pointer is NULL, then save some dumps.
> >
> > Precisely, I'm mimicking behavior of VMS_STRUCT | VMS_ALLOC and
> > VMS_ARRAY_OF_POINTER here.
> >
> > >
> > > Is it feasible to make both queues to always be available so as to reuse
> > > VMS_STRUCT? Or is it intentional here to avoid dumping queues where the
> > > pointer is NULL?
> >
> > By "available" you mean to always allocate memory for all n->sq[i] elements?
> >
> > >
> > > Even if for the latter, I wonder if we can have better way to do it. For
> > > example, if we can integrate this directly into vmstate_save/load_state()
> > > to support some new flag, VMS_ARRAY_OF_POINTER_DYNAMIC, then we can declare
> > > some more "official" way to say some pointer of an dynamic array is NULL.
> > >
> >
> > Yeah, I thought about this too, I believe this is a great idea. But
> > taking into account that
> > this is my first contribution to QEMU and I'm learning code around I tried to be
> > very conservative and don't change any generic code ;-)
> >
> > I'm happy to do this work in -v3 if you would like me to do it and
> > guide me a bit (I have just send a v2 series,
> > but there was no changes in migration API part, only NVMe specifics).
>
> Yes, it would be nice if we can use a good base on this part of work. The
> problem with your proposal is even if it doesn't touch vmstate core, we
> will then need to maintain that special .info struct, that's also a burden.
> So maybe it's better we do it the best we can come up with.
yes, I agree.
>
> I changed slightly on the name of the new flag, DYNAMIC might be fine but I
> wonder if we can choose something clearer on what it does.
>
> Maybe VMS_ARRAY_OF_POINTER_ALLOW_NULL? It should at least imply two things:
>
> (1) The array being migrated can contain NULL pointers anywhere in the
> elements. We will need to change the nullptr trick a bit for this: we
> must send a byte (even if it's non-NULL) beforehand saying if this
> element is NULL. We could reuse 0x30 for NULL, but we need e.g. 0x31
> to say it's non-NULL. Then this will work even if dest QEMU doesn't
> know which pointer is NULL (unlike s390's css).
>
> (2) If it's non-NULL, dest QEMU allocates the array element memory during
> loading the vmsd. It plays similarly a role of VMS_ALLOC but for the
> array elements. We could use a separate flag but I don't see a case
> where this flag can be used without being able to migrate NULL
> elements, so maybe we can stick with one flag so far until we see
> another use case.
>
> The new flag must only be used with VMS_ARRAY_OF_POINTER for sure. It
> should likely verify that each pointer on dest is NULL before hand before
> allocating that array element too in case we leak things.
Yes, this is precisely what I'm currently doing in get_ptrs_array_entry()
>
> We will also need some coverage in test-vmstate.c for this.
Sure!
>
> It would be great if you will volunteer on this piece of work, I will try
> to help whatever way I can on reviews. You can also consider doing it
> separately then migration can collect it in advance when ready on its own
> (proven that it'll be 100% useful here for nvme very soon).
I'm in.
>
> But I know it might be too much to ask, so if you want I can also take a
> look at this problem. Let me know.
No problem at all. I'm happy to pick this up and try to implement.
And huge thanks for your guidance and advice ;-)
Kind regards,
Alex
>
> Thanks,
>
> --
> Peter Xu
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/4] [RFC] hw/nvme: add basic live migration support
2026-02-27 9:59 ` [PATCH 0/4] [RFC] " Klaus Jensen
2026-03-02 8:43 ` Alexander Mikhalitsyn
@ 2026-03-13 14:24 ` Alexander Mikhalitsyn
1 sibling, 0 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-13 14:24 UTC (permalink / raw)
To: Klaus Jensen
Cc: qemu-devel, Jesper Devantier, Peter Xu, Fabiano Rosas, qemu-block,
Keith Busch, Alexander Mikhalitsyn
Am Fr., 27. Feb. 2026 um 10:59 Uhr schrieb Klaus Jensen <its@irrelevant.dk>:
>
> On Feb 17 16:25, Alexander Mikhalitsyn wrote:
> > From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
> >
> > Dear friends,
> >
> > This patchset adds basic live migration support for
> > QEMU emulated NVMe device.
> >
> > Implementation has some limitations:
> > - only one NVMe namespace is supported
> > - SMART counters are not preserved
> > - CMB is not supported
> > - PMR is not supported
> > - SPDM is not supported
> > - SR-IOV is not supported
> > - AERs are not fully supported
> >
> > I believe this is something I can support in next patchset versions or
> > separately on-demand (when usecase appears). But I wanted to share this
> > first version as RFC to get some feedback on this in case if I'm approaching
> > it wrong.
> >
>
> Hi Alex,
>
> Nice work!
>
> As you have already identified, there are a lot of features that are
> non-trivial to implement migration for. I am completely in favor of only
> supporting migration on a very limited feature set (i.e., don't worry
> about CMB, PMR, SPDM, SR-IOV, ZNS/FDP and so on). Focus on the bare
> mandatory requirements. It would be preferable if the "is migration
> possible?" test is an allowlist instead of a denylist. That makes sure
> we don't add a feature down the road and forget to add it to the
> denylist. I'm not 100% sure how to go about that at this point.
>
> AERs are something we need to deal with. We should not drop events. I
> don't think I have a problem with aborting enqueued AERs, but not the
> events.
>
> Finally, this at a minimum needs somekind of simple smoke test to catch
> regressions. Preferably as part of the QEMU test suite itself, but if
> that is hard to achieve, then I may be ok with an out-of-tree test that
> maintainers can use.
Hi Klaus,
JFYI: I've just sent a version 4 of this patchset, it has all the requested
changes:
- AERs handling
- new autotest
- better feature filtering
Kind regards,
Alex
>
>
> Cheers,
> Klaus
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-03-13 14:25 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-17 15:25 [PATCH 0/4] [RFC] hw/nvme: add basic live migration support Alexander Mikhalitsyn
2026-02-17 15:25 ` [PATCH 1/4] hw/nvme: add migration blockers for non-supported cases Alexander Mikhalitsyn
2026-02-17 15:25 ` [PATCH 2/4] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers Alexander Mikhalitsyn
2026-02-17 15:25 ` [PATCH 3/4] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_ALLOC Alexander Mikhalitsyn
2026-03-02 21:47 ` Peter Xu
2026-03-04 9:43 ` Alexander Mikhalitsyn
2026-03-04 16:40 ` Peter Xu
2026-03-05 8:55 ` Alexander Mikhalitsyn
2026-02-17 15:25 ` [PATCH 4/4] hw/nvme: add basic live migration support Alexander Mikhalitsyn
2026-02-27 9:59 ` [PATCH 0/4] [RFC] " Klaus Jensen
2026-03-02 8:43 ` Alexander Mikhalitsyn
2026-03-13 14:24 ` Alexander Mikhalitsyn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox