All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/8] hw/nvme: add basic live migration support
@ 2026-03-17 10:27 Alexander Mikhalitsyn
  2026-03-17 10:27 ` [PATCH v5 1/8] migration/vmstate: export vmstate_{load, save}_field helpers Alexander Mikhalitsyn
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-17 10:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexander Mikhalitsyn, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Stefan Hajnoczi,
	Hanna Reitz, Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>

Dear friends,

This patchset adds basic live migration support for
QEMU emulated NVMe device.

Implementation has some limitations:
- only one NVMe namespace is supported
- SMART counters are not preserved
- CMB is not supported
- PMR is not supported
- SPDM is not supported
- SR-IOV is not supported

I believe this is something I can support in next patchset versions or
separately on-demand (when usecase appears).

Testing.

This patch series was manually tested on:
- Debian 13.3 VM (kernel 6.12.69+deb13-amd64) using fio on *non-root* NVMe disk
  (root disk was virtio-scsi):

time fio --name=nvme-verify \
    --filename=/dev/nvme0n1 \
    --size=5G \
    --rw=randwrite \
    --bs=4k \
    --iodepth=16 \
    --numjobs=1 \
    --direct=0 \
    --ioengine=io_uring \
    --verify=crc32c \
    --verify_fatal=1

- Windows Server 2022 VM (NVMe drive was a *root* disk) with opened browser
  playing video.

No defects were found.

Git tree:
https://github.com/mihalicyn/qemu/commits/nvme-live-migration

Changelog for version 5:
- rebased on top of https://lore.kernel.org/all/20260304212303.667141-1-vsementsov@yandex-team.ru/
  (as Peter has requested)

Changelog for version 4:
- vmstate dynamic array support reworked as suggested by Peter Xu
  VMS_ARRAY_OF_POINTER_ALLOW_NULL flag was introduced
  qtests were added
- NVMe migration blockers were reworked as Klaus has requested earlier
  Now, instead of having "deny list" approach, we have more strict pattern
  of NVMe features filtering and it should be harded to break migration when
  adding new NVMe features.

Changelog for version 3:
- rebased
- simple functional test was added (in accordance with Klaus Jensen's review comment)
  $ meson test 'func-x86_64-nvme_migration' --setup thorough -C build

Changelog for version 2:
- full support for AERs (in-flight requests and queued events too)

Kind regards,
Alex

Alexander Mikhalitsyn (8):
  migration/vmstate: export vmstate_{load, save}_field helpers
  migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC
  tests/unit/test-vmstate: add tests for VMS_ARRAY_OF_POINTER_ALLOW_NULL
  tests/functional/migration: add VM launch/configure hooks
  hw/nvme: add migration blockers for non-supported cases
  hw/nvme: split nvme_init_sq/nvme_init_cq into helpers
  hw/nvme: add basic live migration support
  tests/functional/x86_64: add migration test for NVMe device

 hw/nvme/ctrl.c                                | 874 +++++++++++++++++-
 hw/nvme/ns.c                                  | 160 ++++
 hw/nvme/nvme.h                                |   8 +
 hw/nvme/trace-events                          |  11 +
 include/block/nvme.h                          |  12 +
 include/migration/vmstate.h                   |  83 +-
 migration/savevm.c                            |  26 +
 migration/vmstate-types.c                     |  91 ++
 migration/vmstate.c                           |  64 +-
 tests/functional/migration.py                 |  22 +-
 tests/functional/x86_64/meson.build           |   1 +
 .../functional/x86_64/test_nvme_migration.py  | 159 ++++
 tests/unit/test-vmstate.c                     | 157 ++++
 13 files changed, 1618 insertions(+), 50 deletions(-)
 create mode 100755 tests/functional/x86_64/test_nvme_migration.py

-- 
2.47.3



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 1/8] migration/vmstate: export vmstate_{load, save}_field helpers
  2026-03-17 10:27 [PATCH v5 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
@ 2026-03-17 10:27 ` Alexander Mikhalitsyn
  2026-03-17 10:27 ` [PATCH v5 2/8] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC Alexander Mikhalitsyn
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-17 10:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexander Mikhalitsyn, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Stefan Hajnoczi,
	Hanna Reitz, Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>

Let's export vmstate_{load,save}_field() helpers, they will
be used in next patches to support fully-dynamic arrays with NULLs.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
---
 include/migration/vmstate.h |  6 ++++++
 migration/vmstate.c         | 10 +++++-----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index d4a39aa7944..7ed4a0742b2 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -1255,6 +1255,12 @@ extern const VMStateInfo vmstate_info_qlist;
         .flags = VMS_END, \
     }
 
+bool vmstate_load_field(QEMUFile *f, void *pv, size_t size,
+                        const VMStateField *field, Error **errp);
+bool vmstate_save_field(QEMUFile *f, void *pv, size_t size,
+                        const VMStateField *field,
+                        JSONWriter *vmdesc, Error **errp);
+
 /*
  * vmstate_load_state() and vmstate_save_state() are
  * depreacated, use vmstate_load_vmsd() and vmstate_save_vmsd()
diff --git a/migration/vmstate.c b/migration/vmstate.c
index e98b5f5346c..616eb310e61 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -161,8 +161,8 @@ static bool vmstate_pre_load(const VMStateDescription *vmsd, void *opaque,
     return true;
 }
 
-static bool vmstate_load_field(QEMUFile *f, void *pv, size_t size,
-                               const VMStateField *field, Error **errp)
+bool vmstate_load_field(QEMUFile *f, void *pv, size_t size,
+                        const VMStateField *field, Error **errp)
 {
     if (field->flags & VMS_STRUCT) {
         return vmstate_load_vmsd(f, field->vmsd, pv, field->vmsd->version_id,
@@ -485,9 +485,9 @@ static bool vmstate_pre_save(const VMStateDescription *vmsd, void *opaque,
     return true;
 }
 
-static bool vmstate_save_field(QEMUFile *f, void *pv, size_t size,
-                               const VMStateField *field,
-                               JSONWriter *vmdesc, Error **errp)
+bool vmstate_save_field(QEMUFile *f, void *pv, size_t size,
+                        const VMStateField *field,
+                        JSONWriter *vmdesc, Error **errp)
 {
     if (field->flags & VMS_STRUCT) {
         return vmstate_save_vmsd(f, field->vmsd, pv, vmdesc, errp);
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 2/8] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC
  2026-03-17 10:27 [PATCH v5 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
  2026-03-17 10:27 ` [PATCH v5 1/8] migration/vmstate: export vmstate_{load, save}_field helpers Alexander Mikhalitsyn
@ 2026-03-17 10:27 ` Alexander Mikhalitsyn
  2026-03-17 21:39   ` Peter Xu
  2026-03-17 10:27 ` [PATCH v5 3/8] tests/unit/test-vmstate: add tests for VMS_ARRAY_OF_POINTER_ALLOW_NULL Alexander Mikhalitsyn
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-17 10:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexander Mikhalitsyn, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Stefan Hajnoczi,
	Hanna Reitz, Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>

Add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC, which
helps to save/restore a dynamic array of pointers to
structures.

Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
v2:
- added VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT8_ALLOC
v4:
- almost completely reworked, new flag VMS_ARRAY_OF_POINTER_ALLOW_NULL
  was introduced as suggested by Peter
v5:
- rebased on top of https://lore.kernel.org/all/20260304212303.667141-1-vsementsov@yandex-team.ru/
---
 include/migration/vmstate.h | 77 ++++++++++++++++++++++++++++++-
 migration/savevm.c          | 26 +++++++++++
 migration/vmstate-types.c   | 91 +++++++++++++++++++++++++++++++++++++
 migration/vmstate.c         | 54 ++++++++++++++++++----
 4 files changed, 236 insertions(+), 12 deletions(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 7ed4a0742b2..0a409700598 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -162,7 +162,19 @@ enum VMStateFlags {
     VMS_VSTRUCT           = 0x8000,
 
     /* Marker for end of list */
-    VMS_END = 0x10000
+    VMS_END                         = 0x10000,
+
+    /* The field is a (fixed-size or variable-size) array of pointers
+     * (e.g. struct a { uint8_t **b; }) that can contain NULL values.
+     * This instructs vmstate engine to:
+     * - Dereference each array entry before using it.
+     * - Assume that array is initialized with NULLs on load phase
+     * - Automatically allocate memory for array entries (with size
+     *   specified in (VMStateField).start) on load phase
+     * - Produce NULL/not-NULL markers in migration stream
+     *
+     * Note: Does not imply VMS_ARRAY_OF_POINTER; it needs to be set explicitly. */
+    VMS_ARRAY_OF_POINTER_ALLOW_NULL = 0x20000,
 };
 
 typedef enum {
@@ -194,6 +206,7 @@ struct VMStateField {
     int version_id;
     int struct_version_id;
     bool (*field_exists)(void *opaque, int version_id);
+    const struct VMStateField *real_field;
 };
 
 struct VMStateDescription {
@@ -268,8 +281,10 @@ extern const VMStateInfo vmstate_info_uint64;
 extern const VMStateInfo vmstate_info_fd;
 
 /** Put this in the stream when migrating a null pointer.*/
-#define VMS_NULLPTR_MARKER (0x30U) /* '0' */
+#define VMS_NULLPTR_MARKER    (0x30U) /* '0' */
+#define VMS_NOTNULLPTR_MARKER (0x31U) /* '1' */
 extern const VMStateInfo vmstate_info_nullptr;
+extern const VMStateInfo vmstate_info_maybeptr;
 
 extern const VMStateInfo vmstate_info_cpudouble;
 
@@ -281,6 +296,7 @@ extern const VMStateInfo vmstate_info_bitmap;
 extern const VMStateInfo vmstate_info_qtailq;
 extern const VMStateInfo vmstate_info_gtree;
 extern const VMStateInfo vmstate_info_qlist;
+extern const VMStateInfo vmstate_info_ptrs_array_entry;
 
 #define type_check_2darray(t1,t2,n,m) ((t1(*)[n][m])0 - (t2*)0)
 /*
@@ -562,6 +578,63 @@ extern const VMStateInfo vmstate_info_qlist;
     .offset     = vmstate_offset_array(_s, _f, _type*, _n),          \
 }
 
+/*
+ * For migrating a dynamically allocated uint{8,32}-indexed array
+ * of pointers to structures (with NULL entries and with auto memory allocation).
+ *
+ * _type: type of structure pointed to
+ * _vmsd: VMSD for structure _type (when VMS_STRUCT is set)
+ * _info: VMStateInfo for _type (when VMS_STRUCT is not set)
+ * start: size of (_type) pointed to (for auto memory allocation)
+ */
+#define VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT8_ALLOC(_field, _state, _field_num, _version, _vmsd, _type) { \
+    .name       = (stringify(_field)),                               \
+    .version_id = (_version),                                        \
+    .num_offset = vmstate_offset_value(_state, _field_num, uint8_t), \
+    .vmsd       = &(_vmsd),                                          \
+    .start      = sizeof(_type),                                     \
+    .size       = sizeof(_type *),                                   \
+    .flags      = VMS_POINTER|VMS_VARRAY_UINT8|VMS_ARRAY_OF_POINTER| \
+                  VMS_ARRAY_OF_POINTER_ALLOW_NULL|VMS_STRUCT,        \
+    .offset     = vmstate_offset_pointer(_state, _field, _type *),   \
+}
+
+#define VMSTATE_VARRAY_OF_POINTER_UINT8_ALLOC(_field, _state, _field_num, _version, _info, _type) { \
+    .name       = (stringify(_field)),                               \
+    .version_id = (_version),                                        \
+    .num_offset = vmstate_offset_value(_state, _field_num, uint8_t), \
+    .info       = &(_info),                                          \
+    .start      = sizeof(_type),                                     \
+    .size       = sizeof(_type *),                                   \
+    .flags      = VMS_POINTER|VMS_VARRAY_UINT8|VMS_ARRAY_OF_POINTER| \
+                  VMS_ARRAY_OF_POINTER_ALLOW_NULL,                   \
+    .offset     = vmstate_offset_pointer(_state, _field, _type *),   \
+}
+
+#define VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT32_ALLOC(_field, _state, _field_num, _version, _vmsd, _type) { \
+    .name       = (stringify(_field)),                                \
+    .version_id = (_version),                                         \
+    .num_offset = vmstate_offset_value(_state, _field_num, uint32_t), \
+    .vmsd       = &(_vmsd),                                           \
+    .start      = sizeof(_type),                                      \
+    .size       = sizeof(_type *),                                    \
+    .flags      = VMS_POINTER|VMS_VARRAY_UINT32|VMS_ARRAY_OF_POINTER| \
+                  VMS_ARRAY_OF_POINTER_ALLOW_NULL|VMS_STRUCT,         \
+    .offset     = vmstate_offset_pointer(_state, _field, _type *),    \
+}
+
+#define VMSTATE_VARRAY_OF_POINTER_UINT32_ALLOC(_field, _state, _field_num, _version, _info, _type) { \
+    .name       = (stringify(_field)),                                \
+    .version_id = (_version),                                         \
+    .num_offset = vmstate_offset_value(_state, _field_num, uint32_t), \
+    .info       = &(_info),                                           \
+    .start      = sizeof(_type),                                      \
+    .size       = sizeof(_type *),                                    \
+    .flags      = VMS_POINTER|VMS_VARRAY_UINT32|VMS_ARRAY_OF_POINTER| \
+                  VMS_ARRAY_OF_POINTER_ALLOW_NULL,                    \
+    .offset     = vmstate_offset_pointer(_state, _field, _type *),    \
+}
+
 #define VMSTATE_VARRAY_OF_POINTER_UINT32(_field, _state, _field_num, _version, _info, _type) { \
     .name       = (stringify(_field)),                                    \
     .version_id = (_version),                                             \
diff --git a/migration/savevm.c b/migration/savevm.c
index 8115203b518..882c882f684 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -868,6 +868,32 @@ static void vmstate_check(const VMStateDescription *vmsd)
 
     if (field) {
         while (field->name) {
+            /*
+             * VMS_ARRAY_OF_POINTER must be used only together
+             * with one of VMS_(V)ARRAY* flags.
+             */
+            assert(!(field->flags & VMS_ARRAY_OF_POINTER) ||
+                   ((field->flags & (VMS_ARRAY | VMS_VARRAY_INT32 |
+                     VMS_VARRAY_UINT16 | VMS_VARRAY_UINT8 | VMS_VARRAY_UINT32))));
+
+            /*
+             * When VMS_ARRAY_OF_POINTER_ALLOW_NULL is used, we must:
+             * 1. have VMS_ARRAY_OF_POINTER set too;
+             * 2. have ->start field set and it should tell us a size
+             *    of memory chunk we should allocate for every array member.
+             */
+            assert(!(field->flags & VMS_ARRAY_OF_POINTER_ALLOW_NULL) ||
+                   (field->flags & VMS_ARRAY_OF_POINTER));
+            assert(!(field->flags & VMS_ARRAY_OF_POINTER_ALLOW_NULL) ||
+                   field->start);
+
+            /*
+             * (VMStateField).real_field is only for internal purposes
+             * and should never be used by any user-defined VMStateField.
+             * Currently, it is only used by vmsd_create_fake_nullptr_field().
+             */
+            assert(!field->real_field);
+
             if (field->flags & (VMS_STRUCT | VMS_VSTRUCT)) {
                 /* Recurse to sub structures */
                 vmstate_check(field->vmsd);
diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
index 23f34336964..a55f4d51f4b 100644
--- a/migration/vmstate-types.c
+++ b/migration/vmstate-types.c
@@ -391,6 +391,97 @@ const VMStateInfo vmstate_info_nullptr = {
     .save = save_nullptr,
 };
 
+static bool load_maybeptr(QEMUFile *f, void *ppv, size_t unused_size,
+                          const VMStateField *field, Error **errp)
+{
+    bool ok = false;
+    const VMStateField *real_field = field->real_field;
+    /* size of structure pointed to by elements of array */
+    size_t size = real_field->start;
+    int marker;
+
+    assert(size);
+
+    if (ppv == NULL) {
+        error_setg(errp, "vmstate: get_maybeptr must be called with ppv != NULL");
+        return false;
+    }
+
+    /*
+     * We start from a clean array, all elements must be NULL, unless
+     * something we haven't prepared for has changed in vmstate_save_state_v().
+     * Let's check for this just in case.
+     */
+    if (*(void **)ppv != NULL) {
+        error_setg(errp, "vmstate: get_maybeptr must be called with *ppv == NULL");
+        return false;
+    }
+
+    marker = qemu_get_byte(f);
+    assert(marker == VMS_NULLPTR_MARKER || marker == VMS_NOTNULLPTR_MARKER);
+
+    if (marker == VMS_NOTNULLPTR_MARKER) {
+        void *pv;
+
+        /* allocate memory for structure */
+        pv = g_malloc0(size);
+
+        ok = vmstate_load_field(f, pv, size, real_field, errp);
+        if (!ok) {
+            g_free(pv);
+            return false;
+        }
+
+        *(void **)ppv = pv;
+    }
+
+    return true;
+}
+
+static bool save_maybeptr(QEMUFile *f, void *ppv, size_t unused_size,
+                          const VMStateField *field, JSONWriter *vmdesc,
+                          Error **errp)
+{
+    const VMStateField *real_field = field->real_field;
+    /* size of structure pointed to by elements of array */
+    size_t size = real_field->start;
+    void *pv;
+
+    assert(size);
+
+    /*
+     * (ppv) is an address of an i-th element of a dynamic array.
+     *
+     * (ppv) can not be NULL unless we have some regression/bug in
+     * vmstate_save_state_v(), because it is result of pointer arithemic like:
+     * first_elem + size * i.
+     */
+    if (ppv == NULL) {
+        error_setg(errp, "vmstate: put_maybeptr must be called with ppv != NULL");
+        return false;
+    }
+
+    /* get a pointer to a structure */
+    pv = *(void **)ppv;
+
+    if (pv == NULL) {
+        /* write a mark telling that there was a NULL pointer */
+        qemu_put_byte(f, VMS_NULLPTR_MARKER);
+        return true;
+    }
+
+    /* if pv is not NULL, write a marker and save field using vmstate_save_field() */
+    qemu_put_byte(f, VMS_NOTNULLPTR_MARKER);
+
+    return vmstate_save_field(f, pv, size, real_field, vmdesc, errp);
+}
+
+const VMStateInfo vmstate_info_maybeptr = {
+    .name = "maybeptr",
+    .load  = load_maybeptr,
+    .save  = save_maybeptr,
+};
+
 /* 64 bit unsigned int. See that the received value is the same than the one
    in the field */
 
diff --git a/migration/vmstate.c b/migration/vmstate.c
index 616eb310e61..29e63751105 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -74,10 +74,15 @@ vmsd_create_fake_nullptr_field(const VMStateField *field)
     /* Do not need "field_exists" check as it always exists (which is null) */
     fake->field_exists = NULL;
 
-    /* See vmstate_info_nullptr - use 1 byte to represent nullptr */
-    fake->size = 1;
-    fake->info = &vmstate_info_nullptr;
-    fake->flags = VMS_SINGLE;
+    if (!(field->flags & VMS_ARRAY_OF_POINTER_ALLOW_NULL)) {
+        /* See vmstate_info_nullptr - use 1 byte to represent nullptr */
+        fake->size = 1;
+        fake->info = &vmstate_info_nullptr;
+        fake->flags = VMS_SINGLE;
+    } else {
+        fake->real_field = field;
+        fake->info = &vmstate_info_maybeptr;
+    }
 
     /* All the rest fields shouldn't matter.. */
 
@@ -258,13 +263,28 @@ bool vmstate_load_vmsd(QEMUFile *f, const VMStateDescription *vmsd,
             for (i = 0; i < n_elems; i++) {
                 bool ok;
                 void *curr_elem = first_elem + size * i;
+                bool need_fake_field = false;
                 const VMStateField *inner_field;
 
                 if (field->flags & VMS_ARRAY_OF_POINTER) {
-                    curr_elem = *(void **)curr_elem;
+                    if (!(field->flags & VMS_ARRAY_OF_POINTER_ALLOW_NULL)) {
+                        assert(curr_elem);
+                        curr_elem = *(void **)curr_elem;
+                        need_fake_field = !curr_elem;
+                    } else {
+                        /*
+                         * We expect array of pointers to be initialized.
+                         * We don't want to overwrite curr_elem with it's
+                         * dereferenced value, because we may need to
+                         * allocate memory (depending on what is in the migration
+                         * stream) and write to it later.
+                         */
+                        assert(!*(void **)curr_elem);
+                        need_fake_field = true;
+                    }
                 }
 
-                if (!curr_elem && size) {
+                if (need_fake_field && size) {
                     /*
                      * If null pointer found (which should only happen in
                      * an array of pointers), use null placeholder and do
@@ -272,6 +292,7 @@ bool vmstate_load_vmsd(QEMUFile *f, const VMStateDescription *vmsd,
                      */
                     inner_field = vmsd_create_fake_nullptr_field(field);
                 } else {
+                    assert(curr_elem || !size);
                     inner_field = field;
                 }
 
@@ -546,25 +567,38 @@ static bool vmstate_save_vmsd_v(QEMUFile *f, const VMStateDescription *vmsd,
 
             for (i = 0; i < n_elems; i++) {
                 void *curr_elem = first_elem + size * i;
+                bool need_fake_field = false;
                 const VMStateField *inner_field;
                 bool is_null;
                 int max_elems = n_elems - i;
 
                 old_offset = qemu_file_transferred(f);
                 if (field->flags & VMS_ARRAY_OF_POINTER) {
-                    assert(curr_elem);
-                    curr_elem = *(void **)curr_elem;
+                    if (!(field->flags & VMS_ARRAY_OF_POINTER_ALLOW_NULL)) {
+                        assert(curr_elem);
+                        curr_elem = *(void **)curr_elem;
+                        need_fake_field = !curr_elem;
+                    } else {
+                        /*
+                         * We always need a fake field to properly handle
+                         * VMS_ARRAY_OF_POINTER_ALLOW_NULL case, because
+                         * even if pointer is not NULL, we still want to
+                         * write a marker in the migration stream.
+                         */
+                        need_fake_field = true;
+                    }
                 }
 
-                if (!curr_elem && size) {
+                if (need_fake_field && size) {
                     /*
                      * If null pointer found (which should only happen in
                      * an array of pointers), use null placeholder and do
                      * not follow.
                      */
                     inner_field = vmsd_create_fake_nullptr_field(field);
-                    is_null = true;
+                    is_null = !curr_elem;
                 } else {
+                    assert(curr_elem || !size);
                     inner_field = field;
                     is_null = false;
                 }
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 3/8] tests/unit/test-vmstate: add tests for VMS_ARRAY_OF_POINTER_ALLOW_NULL
  2026-03-17 10:27 [PATCH v5 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
  2026-03-17 10:27 ` [PATCH v5 1/8] migration/vmstate: export vmstate_{load, save}_field helpers Alexander Mikhalitsyn
  2026-03-17 10:27 ` [PATCH v5 2/8] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC Alexander Mikhalitsyn
@ 2026-03-17 10:27 ` Alexander Mikhalitsyn
  2026-03-17 10:27 ` [PATCH v5 4/8] tests/functional/migration: add VM launch/configure hooks Alexander Mikhalitsyn
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-17 10:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexander Mikhalitsyn, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Stefan Hajnoczi,
	Hanna Reitz, Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>

Add tests for VMSTATE_VARRAY_OF_POINTER{,_TO_STRUCT}_UINT32_ALLOC.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
---
 tests/unit/test-vmstate.c | 157 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 157 insertions(+)

diff --git a/tests/unit/test-vmstate.c b/tests/unit/test-vmstate.c
index cadbab3c5e2..d10cf34fc75 100644
--- a/tests/unit/test-vmstate.c
+++ b/tests/unit/test-vmstate.c
@@ -702,6 +702,155 @@ static void test_arr_ptr_prim_0_load(void)
     }
 }
 
+static uint8_t wire_arr_ptr_with_nulls[] = {
+    VMS_NOTNULLPTR_MARKER,
+    0x00, 0x00, 0x00, 0x00,
+    VMS_NULLPTR_MARKER,
+    VMS_NOTNULLPTR_MARKER,
+    0x00, 0x00, 0x00, 0x02,
+    VMS_NOTNULLPTR_MARKER,
+    0x00, 0x00, 0x00, 0x03,
+    QEMU_VM_EOF
+};
+
+typedef struct {
+    uint32_t       ar_items_num;
+    TestStructTriv **ar;
+} TestVArrayOfPtrToStuctWithNULLs;
+
+const VMStateDescription vmsd_arps_with_nulls = {
+    .name = "test/arps_with_nulls",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT32_ALLOC(
+            ar, TestVArrayOfPtrToStuctWithNULLs, ar_items_num, 0, vmsd_tst, TestStructTriv),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static void test_arr_ptr_nulls_str_save(void)
+{
+    TestStructTriv ar[AR_SIZE] = { {.i = 0}, {.i = 1}, {.i = 2}, {.i = 3} };
+    TestVArrayOfPtrToStuctWithNULLs sample = {};
+    int idx;
+
+    sample.ar_items_num = AR_SIZE;
+    sample.ar = g_new0(TestStructTriv*, sample.ar_items_num);
+    sample.ar[0] = g_new0(TestStructTriv, 1);
+    *sample.ar[0] = ar[0];
+    /* note, sample.ar[1] remains NULL */
+    sample.ar[2] = g_new0(TestStructTriv, 1);
+    *sample.ar[2] = ar[2];
+    sample.ar[3] = g_new0(TestStructTriv, 1);
+    *sample.ar[3] = ar[3];
+
+    save_vmstate(&vmsd_arps_with_nulls, &sample);
+    compare_vmstate(wire_arr_ptr_with_nulls, sizeof(wire_arr_ptr_with_nulls));
+
+    for (idx = 0; idx < AR_SIZE; ++idx) {
+        g_free(sample.ar[idx]);
+    }
+    g_free(sample.ar);
+}
+
+static void test_arr_ptr_nulls_str_load(void)
+{
+    TestStructTriv ar_gt[AR_SIZE] = {{.i = 0}, {.i = 0}, {.i = 2}, {.i = 3} };
+    TestVArrayOfPtrToStuctWithNULLs obj = {};
+    int idx;
+
+    obj.ar_items_num = AR_SIZE;
+    obj.ar = g_new0(TestStructTriv*, obj.ar_items_num);
+
+    save_buffer(wire_arr_ptr_with_nulls, sizeof(wire_arr_ptr_with_nulls));
+    SUCCESS(load_vmstate_one(&vmsd_arps_with_nulls, &obj, 1,
+                          wire_arr_ptr_with_nulls, sizeof(wire_arr_ptr_with_nulls)));
+
+    for (idx = 0; idx < AR_SIZE; ++idx) {
+        if (idx == 1) {
+            g_assert_cmpint((uintptr_t)(obj.ar[idx]), ==, 0);
+        } else {
+            /* compare the target array ar with the ground truth array ar_gt */
+            g_assert_cmpint(ar_gt[idx].i, ==, obj.ar[idx]->i);
+        }
+    }
+
+    for (idx = 0; idx < AR_SIZE; ++idx) {
+        g_free(obj.ar[idx]);
+    }
+    g_free(obj.ar);
+}
+
+typedef struct {
+    uint32_t ar_items_num;
+    int32_t  **ar;
+} TestVArrayOfPtrToIntWithNULLs;
+
+const VMStateDescription vmsd_arpp_with_nulls = {
+    .name = "test/arpp_with_nulls",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_VARRAY_OF_POINTER_UINT32_ALLOC(
+            ar, TestVArrayOfPtrToIntWithNULLs, ar_items_num, 0, vmstate_info_int32, int32_t),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static void test_arr_ptr_nulls_prim_save(void)
+{
+    int32_t ar[AR_SIZE] = { 0, 1, 2, 3 };
+    TestVArrayOfPtrToIntWithNULLs sample = {};
+    int idx;
+
+    sample.ar_items_num = AR_SIZE;
+    sample.ar = g_new0(int32_t*, sample.ar_items_num);
+    sample.ar[0] = g_new0(int32_t, 1);
+    *sample.ar[0] = ar[0];
+    /* note, sample.ar[1] remains NULL */
+    sample.ar[2] = g_new0(int32_t, 1);
+    *sample.ar[2] = ar[2];
+    sample.ar[3] = g_new0(int32_t, 1);
+    *sample.ar[3] = ar[3];
+
+    save_vmstate(&vmsd_arpp_with_nulls, &sample);
+    compare_vmstate(wire_arr_ptr_with_nulls, sizeof(wire_arr_ptr_with_nulls));
+
+    for (idx = 0; idx < AR_SIZE; ++idx) {
+        g_free(sample.ar[idx]);
+    }
+    g_free(sample.ar);
+}
+
+static void test_arr_ptr_nulls_prim_load(void)
+{
+    int32_t ar_gt[AR_SIZE] = { 0, 0, 2, 3 };
+    TestVArrayOfPtrToIntWithNULLs obj = {};
+    int idx;
+
+    obj.ar_items_num = AR_SIZE;
+    obj.ar = g_new0(int32_t*, obj.ar_items_num);
+
+    save_buffer(wire_arr_ptr_with_nulls, sizeof(wire_arr_ptr_with_nulls));
+    SUCCESS(load_vmstate_one(&vmsd_arpp_with_nulls, &obj, 1,
+                          wire_arr_ptr_with_nulls, sizeof(wire_arr_ptr_with_nulls)));
+
+    for (idx = 0; idx < AR_SIZE; ++idx) {
+        if (idx == 1) {
+            g_assert_cmpint((uintptr_t)(obj.ar[idx]), ==, 0);
+        } else {
+            /* compare the target array ar with the ground truth array ar_gt */
+            g_assert_cmpint(ar_gt[idx], ==, *obj.ar[idx]);
+        }
+    }
+
+    for (idx = 0; idx < AR_SIZE; ++idx) {
+        g_free(obj.ar[idx]);
+    }
+    g_free(obj.ar);
+}
+
 /* test QTAILQ migration */
 typedef struct TestQtailqElement TestQtailqElement;
 
@@ -1568,6 +1717,14 @@ int main(int argc, char **argv)
                     test_arr_ptr_prim_0_save);
     g_test_add_func("/vmstate/array/ptr/prim/0/load",
                     test_arr_ptr_prim_0_load);
+    g_test_add_func("/vmstate/array/ptr-nulls/str/save",
+                    test_arr_ptr_nulls_str_save);
+    g_test_add_func("/vmstate/array/ptr-nulls/str/load",
+                    test_arr_ptr_nulls_str_load);
+    g_test_add_func("/vmstate/array/ptr-nulls/prim/save",
+                    test_arr_ptr_nulls_prim_save);
+    g_test_add_func("/vmstate/array/ptr-nulls/prim/load",
+                    test_arr_ptr_nulls_prim_load);
     g_test_add_func("/vmstate/qtailq/save/saveq", test_save_q);
     g_test_add_func("/vmstate/qtailq/load/loadq", test_load_q);
     g_test_add_func("/vmstate/gtree/save/savedomain", test_gtree_save_domain);
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 4/8] tests/functional/migration: add VM launch/configure hooks
  2026-03-17 10:27 [PATCH v5 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
                   ` (2 preceding siblings ...)
  2026-03-17 10:27 ` [PATCH v5 3/8] tests/unit/test-vmstate: add tests for VMS_ARRAY_OF_POINTER_ALLOW_NULL Alexander Mikhalitsyn
@ 2026-03-17 10:27 ` Alexander Mikhalitsyn
  2026-03-17 10:27 ` [PATCH v5 5/8] hw/nvme: add migration blockers for non-supported cases Alexander Mikhalitsyn
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-17 10:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexander Mikhalitsyn, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Stefan Hajnoczi,
	Hanna Reitz, Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>

Introduce configure_machine, launch_source_vm and assert_dest_vm
methods to allow child classes to override some pieces of
source/dest VMs creation, start and check logic.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
---
 tests/functional/migration.py | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/tests/functional/migration.py b/tests/functional/migration.py
index e995328e833..40e9c094c46 100644
--- a/tests/functional/migration.py
+++ b/tests/functional/migration.py
@@ -40,19 +40,35 @@ def assert_migration(self, src_vm, dst_vm):
         self.assertEqual(dst_vm.cmd('query-status')['status'], 'running')
         self.assertEqual(src_vm.cmd('query-status')['status'],'postmigrate')
 
+    # Can be overridden by subclasses to configure both source/dest VMs.
+    def configure_machine(self, vm):
+        vm.add_args('-nodefaults')
+
+    # Can be overridden by subclasses to prepare the source VM before migration,
+    # e.g. by running some workload inside the source VM to see if it continues
+    # to run properly after migration.
+    def launch_source_vm(self, vm):
+        vm.launch()
+
+    # Can be overridden by subclasses to check the destination VM after migration,
+    # e.g. by checking if the workload is still running after migration.
+    def assert_dest_vm(self, vm):
+        pass
+
     def migrate_vms(self, dst_uri, src_uri, dst_vm, src_vm):
         dst_vm.qmp('migrate-incoming', uri=dst_uri)
         src_vm.qmp('migrate', uri=src_uri)
         self.assert_migration(src_vm, dst_vm)
+        self.assert_dest_vm(dst_vm)
 
     def migrate(self, dst_uri, src_uri=None):
         dst_vm = self.get_vm('-incoming', 'defer', name="dst-qemu")
-        dst_vm.add_args('-nodefaults')
+        self.configure_machine(dst_vm)
         dst_vm.launch()
 
         src_vm = self.get_vm(name="src-qemu")
-        src_vm.add_args('-nodefaults')
-        src_vm.launch()
+        self.configure_machine(src_vm)
+        self.launch_source_vm(src_vm)
 
         if src_uri is None:
             src_uri = dst_uri
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 5/8] hw/nvme: add migration blockers for non-supported cases
  2026-03-17 10:27 [PATCH v5 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
                   ` (3 preceding siblings ...)
  2026-03-17 10:27 ` [PATCH v5 4/8] tests/functional/migration: add VM launch/configure hooks Alexander Mikhalitsyn
@ 2026-03-17 10:27 ` Alexander Mikhalitsyn
  2026-04-07 15:34   ` Stefan Hajnoczi
  2026-03-17 10:27 ` [PATCH v5 6/8] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers Alexander Mikhalitsyn
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-17 10:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexander Mikhalitsyn, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Stefan Hajnoczi,
	Hanna Reitz, Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>

Let's block migration for cases we don't support:
- SR-IOV
- CMB
- PMR
- SPDM

No functional changes here, because NVMe migration is
not supported at all as of this commit.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
---
 hw/nvme/ctrl.c       | 206 +++++++++++++++++++++++++++++++++++++++++++
 hw/nvme/nvme.h       |   3 +
 include/block/nvme.h |  12 +++
 3 files changed, 221 insertions(+)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index cc4593cd427..9f9c9bcbead 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -207,6 +207,7 @@
 #include "hw/pci/msix.h"
 #include "hw/pci/pcie_sriov.h"
 #include "system/spdm-socket.h"
+#include "migration/blocker.h"
 #include "migration/vmstate.h"
 
 #include "nvme.h"
@@ -250,6 +251,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = {
     [NVME_COMMAND_SET_PROFILE]      = true,
     [NVME_FDP_MODE]                 = true,
     [NVME_FDP_EVENTS]               = true,
+    /* if you add something here, please update nvme_set_migration_blockers() */
 };
 
 static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
@@ -4601,6 +4603,7 @@ static uint16_t nvme_io_mgmt_send(NvmeCtrl *n, NvmeRequest *req)
         return 0;
     case NVME_IOMS_MO_RUH_UPDATE:
         return nvme_io_mgmt_send_ruh_update(n, req);
+    /* if you add something here, please update nvme_set_migration_blockers() */
     default:
         return NVME_INVALID_FIELD | NVME_DNR;
     };
@@ -7518,6 +7521,10 @@ static uint16_t nvme_security_receive(NvmeCtrl *n, NvmeRequest *req)
 
 static uint16_t nvme_directive_send(NvmeCtrl *n, NvmeRequest *req)
 {
+    /*
+     * When adding a new dtype handling here,
+     * please also update nvme_set_migration_blockers().
+     */
     return NVME_INVALID_FIELD | NVME_DNR;
 }
 
@@ -9208,6 +9215,199 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     }
 }
 
+#define BLOCKER_FEATURES_MAX_LEN 256
+
+static inline void nvme_add_blocker_feature(char *blocker_features, const char *feature)
+{
+    if (strlen(blocker_features) > 0) {
+        g_strlcat(blocker_features, ", ", BLOCKER_FEATURES_MAX_LEN);
+    }
+    g_strlcat(blocker_features, feature, BLOCKER_FEATURES_MAX_LEN);
+}
+
+static bool nvme_set_migration_blockers(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
+{
+    uint64_t unsupported_cap, cap = ldq_le_p(&n->bar.cap);
+    char blocker_features[BLOCKER_FEATURES_MAX_LEN] = "";
+    bool adm_cmd_security_checked = false;
+    bool cmd_io_mgmt_checked = false;
+    bool cmd_zone_checked = false;
+
+    /*
+     * Idea of this function is simple, we iterate over all Command Sets and
+     * for each supported command we provide a special handling logic to
+     * determine if we should block migration or not.
+     *
+     * For instance, we have NVME_ADM_CMD_NS_ATTACHMENT and it is always
+     * available to the guest, but if there is only 1 namespace, then it is
+     * safe to allow migration, but if there are more, then we need to block
+     * migration because we don't handle this in migration code yet.
+     */
+    for (int opcode = 0; opcode < sizeof(n->cse.acs) / sizeof(n->cse.acs[0]); opcode++) {
+        /* Is command supported? */
+        if (!n->cse.acs[opcode]) {
+            continue;
+        }
+
+        switch (opcode) {
+        case NVME_ADM_CMD_DELETE_SQ:
+        case NVME_ADM_CMD_CREATE_SQ:
+        case NVME_ADM_CMD_GET_LOG_PAGE:
+        case NVME_ADM_CMD_DELETE_CQ:
+        case NVME_ADM_CMD_CREATE_CQ:
+        case NVME_ADM_CMD_IDENTIFY:
+        case NVME_ADM_CMD_ABORT:
+        case NVME_ADM_CMD_SET_FEATURES:
+        case NVME_ADM_CMD_GET_FEATURES:
+        case NVME_ADM_CMD_ASYNC_EV_REQ:
+        case NVME_ADM_CMD_DBBUF_CONFIG:
+        case NVME_ADM_CMD_FORMAT_NVM:
+        case NVME_ADM_CMD_DIRECTIVE_SEND:
+        case NVME_ADM_CMD_DIRECTIVE_RECV:
+            break;
+        case NVME_ADM_CMD_NS_ATTACHMENT:
+            int namespaces_num = 0;
+            for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+                NvmeNamespace *ns = nvme_subsys_ns(n->subsys, i);
+                if (!ns) {
+                    continue;
+                }
+
+                namespaces_num++;
+            }
+
+            if (namespaces_num > 1) {
+                nvme_add_blocker_feature(blocker_features, "Namespace Attachment");
+            }
+
+            break;
+        case NVME_ADM_CMD_VIRT_MNGMT:
+            if (n->params.sriov_max_vfs) {
+                nvme_add_blocker_feature(blocker_features, "SR-IOV");
+            }
+
+            break;
+        case NVME_ADM_CMD_SECURITY_SEND:
+        case NVME_ADM_CMD_SECURITY_RECV:
+            if (adm_cmd_security_checked) {
+                break;
+            }
+
+            if (pci_dev->spdm_port) {
+                nvme_add_blocker_feature(blocker_features, "SPDM");
+            }
+
+            adm_cmd_security_checked = true;
+
+            break;
+        default:
+            g_assert_not_reached();
+        }
+    }
+
+    for (int opcode = 0; opcode < sizeof(n->cse.iocs.nvm) / sizeof(n->cse.iocs.nvm[0]); opcode++) {
+        if (!n->cse.iocs.nvm[opcode]) {
+            continue;
+        }
+
+        switch (opcode) {
+        case NVME_CMD_FLUSH:
+        case NVME_CMD_WRITE:
+        case NVME_CMD_READ:
+        case NVME_CMD_COMPARE:
+        case NVME_CMD_WRITE_ZEROES:
+        case NVME_CMD_DSM:
+        case NVME_CMD_VERIFY:
+        case NVME_CMD_COPY:
+            break;
+        case NVME_CMD_IO_MGMT_RECV:
+        case NVME_CMD_IO_MGMT_SEND:
+            if (cmd_io_mgmt_checked) {
+                break;
+            }
+
+            /* check for NVME_IOMS_MO_RUH_UPDATE */
+            if (n->subsys->params.fdp.enabled) {
+                nvme_add_blocker_feature(blocker_features, "FDP");
+            }
+
+            cmd_io_mgmt_checked = true;
+
+            break;
+        default:
+            g_assert_not_reached();
+        }
+    }
+
+    for (int opcode = 0; opcode < sizeof(n->cse.iocs.zoned) / sizeof(n->cse.iocs.zoned[0]); opcode++) {
+        /*
+         * If command isn't supported or we have the same command
+         * in n->cse.iocs.nvm, then we can skip it here.
+         */
+        if (!n->cse.iocs.zoned[opcode] || n->cse.iocs.nvm[opcode]) {
+            continue;
+        }
+
+        switch (opcode) {
+        case NVME_CMD_ZONE_APPEND:
+        case NVME_CMD_ZONE_MGMT_SEND:
+        case NVME_CMD_ZONE_MGMT_RECV:
+            if (cmd_zone_checked) {
+                break;
+            }
+
+            for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+                NvmeNamespace *ns = nvme_subsys_ns(n->subsys, i);
+                if (!ns) {
+                    continue;
+                }
+
+                if (ns->params.zoned) {
+                    nvme_add_blocker_feature(blocker_features, "Zoned Namespace");
+                    break;
+                }
+            }
+
+            cmd_zone_checked = true;
+
+            break;
+        default:
+            g_assert_not_reached();
+        }
+    }
+
+    /*
+     * Try our best to explicitly detect all not supported caps,
+     * to let users know what features cause migration to be blocked,
+     * but in case we miss handling here, everything else will be
+     * covered by unsupported_cap check.
+     */
+    if (NVME_CAP_CMBS(cap)) {
+        nvme_add_blocker_feature(blocker_features, "CMB");
+        cap &= ~((uint64_t)CAP_CMBS_MASK << CAP_CMBS_SHIFT);
+    }
+
+    if (NVME_CAP_PMRS(cap)) {
+        nvme_add_blocker_feature(blocker_features, "PMR");
+        cap &= ~((uint64_t)CAP_PMRS_MASK << CAP_PMRS_SHIFT);
+    }
+
+    unsupported_cap = cap & ~NVME_MIGRATION_SUPPORTED_CAP_BITS;
+    if (unsupported_cap) {
+        nvme_add_blocker_feature(blocker_features, "unknown capability");
+    }
+
+    assert(n->migration_blocker == NULL);
+    if (strlen(blocker_features) > 0) {
+        error_setg(&n->migration_blocker, "Migration is not supported for %s", blocker_features);
+        if (migrate_add_blocker(&n->migration_blocker, errp) < 0) {
+            return false;
+        }
+    }
+
+    return true;
+}
+
 static int nvme_init_subsys(NvmeCtrl *n, Error **errp)
 {
     int cntlid;
@@ -9313,6 +9513,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 
         n->subsys->namespaces[ns->params.nsid] = ns;
     }
+
+    if (!nvme_set_migration_blockers(n, pci_dev, errp)) {
+        return;
+    }
 }
 
 static void nvme_exit(PCIDevice *pci_dev)
@@ -9365,6 +9569,8 @@ static void nvme_exit(PCIDevice *pci_dev)
     }
 
     memory_region_del_subregion(&n->bar0, &n->iomem);
+
+    migrate_del_blocker(&n->migration_blocker);
 }
 
 static const Property nvme_props[] = {
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index d66f7dc82d5..457b6637249 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -666,6 +666,9 @@ typedef struct NvmeCtrl {
 
     /* Socket mapping to SPDM over NVMe Security In/Out commands */
     int spdm_socket;
+
+    /* Migration-related stuff */
+    Error *migration_blocker;
 } NvmeCtrl;
 
 typedef enum NvmeResetType {
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 9d7159ed7a7..a7f586fc801 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -141,6 +141,18 @@ enum NvmeCapMask {
 #define NVME_CAP_SET_CMBS(cap, val)   \
     ((cap) |= (uint64_t)((val) & CAP_CMBS_MASK)   << CAP_CMBS_SHIFT)
 
+#define NVME_MIGRATION_SUPPORTED_CAP_BITS ( \
+      ((uint64_t)CAP_MQES_MASK   << CAP_MQES_SHIFT)   \
+    | ((uint64_t)CAP_CQR_MASK    << CAP_CQR_SHIFT)    \
+    | ((uint64_t)CAP_AMS_MASK    << CAP_AMS_SHIFT)    \
+    | ((uint64_t)CAP_TO_MASK     << CAP_TO_SHIFT)     \
+    | ((uint64_t)CAP_DSTRD_MASK  << CAP_DSTRD_SHIFT)  \
+    | ((uint64_t)CAP_NSSRS_MASK  << CAP_NSSRS_SHIFT)  \
+    | ((uint64_t)CAP_CSS_MASK    << CAP_CSS_SHIFT)    \
+    | ((uint64_t)CAP_MPSMIN_MASK << CAP_MPSMIN_SHIFT) \
+    | ((uint64_t)CAP_MPSMAX_MASK << CAP_MPSMAX_SHIFT) \
+)
+
 enum NvmeCapCss {
     NVME_CAP_CSS_NCSS    = 1 << 0,
     NVME_CAP_CSS_IOCSS   = 1 << 6,
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 6/8] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers
  2026-03-17 10:27 [PATCH v5 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
                   ` (4 preceding siblings ...)
  2026-03-17 10:27 ` [PATCH v5 5/8] hw/nvme: add migration blockers for non-supported cases Alexander Mikhalitsyn
@ 2026-03-17 10:27 ` Alexander Mikhalitsyn
  2026-03-17 10:27 ` [PATCH v5 7/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-17 10:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexander Mikhalitsyn, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Stefan Hajnoczi,
	Hanna Reitz, Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
---
 hw/nvme/ctrl.c | 57 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 36 insertions(+), 21 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 9f9c9bcbead..68ce9802505 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4854,18 +4854,14 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req)
     return NVME_SUCCESS;
 }
 
-static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
-                         uint16_t sqid, uint16_t cqid, uint16_t size)
+static void __nvme_init_sq(NvmeSQueue *sq)
 {
+    NvmeCtrl *n = sq->ctrl;
+    uint16_t sqid = sq->sqid;
+    uint16_t cqid = sq->cqid;
     int i;
     NvmeCQueue *cq;
 
-    sq->ctrl = n;
-    sq->dma_addr = dma_addr;
-    sq->sqid = sqid;
-    sq->size = size;
-    sq->cqid = cqid;
-    sq->head = sq->tail = 0;
     sq->io_req = g_new0(NvmeRequest, sq->size);
 
     QTAILQ_INIT(&sq->req_list);
@@ -4895,6 +4891,18 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
     n->sq[sqid] = sq;
 }
 
+static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
+                         uint16_t sqid, uint16_t cqid, uint16_t size)
+{
+    sq->ctrl = n;
+    sq->dma_addr = dma_addr;
+    sq->sqid = sqid;
+    sq->size = size;
+    sq->cqid = cqid;
+    sq->head = sq->tail = 0;
+    __nvme_init_sq(sq);
+}
+
 static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeSQueue *sq;
@@ -5555,24 +5563,16 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeRequest *req)
     return NVME_SUCCESS;
 }
 
-static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
-                         uint16_t cqid, uint16_t vector, uint16_t size,
-                         uint16_t irq_enabled)
+static void __nvme_init_cq(NvmeCQueue *cq)
 {
+    NvmeCtrl *n = cq->ctrl;
     PCIDevice *pci = PCI_DEVICE(n);
+    uint16_t cqid = cq->cqid;
 
-    if (msix_enabled(pci) && irq_enabled) {
-        msix_vector_use(pci, vector);
+    if (msix_enabled(pci) && cq->irq_enabled) {
+        msix_vector_use(pci, cq->vector);
     }
 
-    cq->ctrl = n;
-    cq->cqid = cqid;
-    cq->size = size;
-    cq->dma_addr = dma_addr;
-    cq->phase = 1;
-    cq->irq_enabled = irq_enabled;
-    cq->vector = vector;
-    cq->head = cq->tail = 0;
     QTAILQ_INIT(&cq->req_list);
     QTAILQ_INIT(&cq->sq_list);
     if (n->dbbuf_enabled) {
@@ -5590,6 +5590,21 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
                                  &DEVICE(cq->ctrl)->mem_reentrancy_guard);
 }
 
+static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
+                         uint16_t cqid, uint16_t vector, uint16_t size,
+                         uint16_t irq_enabled)
+{
+    cq->ctrl = n;
+    cq->cqid = cqid;
+    cq->size = size;
+    cq->dma_addr = dma_addr;
+    cq->phase = 1;
+    cq->irq_enabled = irq_enabled;
+    cq->vector = vector;
+    cq->head = cq->tail = 0;
+    __nvme_init_cq(cq);
+}
+
 static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeCQueue *cq;
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 7/8] hw/nvme: add basic live migration support
  2026-03-17 10:27 [PATCH v5 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
                   ` (5 preceding siblings ...)
  2026-03-17 10:27 ` [PATCH v5 6/8] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers Alexander Mikhalitsyn
@ 2026-03-17 10:27 ` Alexander Mikhalitsyn
  2026-04-07 15:48   ` Stefan Hajnoczi
  2026-03-17 10:27 ` [PATCH v5 8/8] tests/functional/x86_64: add migration test for NVMe device Alexander Mikhalitsyn
  2026-03-30 11:38 ` [PATCH v5 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
  8 siblings, 1 reply; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-17 10:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexander Mikhalitsyn, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Stefan Hajnoczi,
	Hanna Reitz, Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>

It has some limitations:
- only one NVMe namespace is supported
- SMART counters are not preserved
- CMB is not supported
- PMR is not supported
- SPDM is not supported
- SR-IOV is not supported

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
v2:
- AERs are now fully supported
---
 hw/nvme/ctrl.c       | 611 ++++++++++++++++++++++++++++++++++++++++++-
 hw/nvme/ns.c         | 160 +++++++++++
 hw/nvme/nvme.h       |   5 +
 hw/nvme/trace-events |  11 +
 4 files changed, 778 insertions(+), 9 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 68ce9802505..e3030bfadcb 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -208,6 +208,7 @@
 #include "hw/pci/pcie_sriov.h"
 #include "system/spdm-socket.h"
 #include "migration/blocker.h"
+#include "migration/qemu-file-types.h"
 #include "migration/vmstate.h"
 
 #include "nvme.h"
@@ -4903,6 +4904,25 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
     __nvme_init_sq(sq);
 }
 
+static void nvme_restore_sq(NvmeSQueue *sq_from)
+{
+    NvmeCtrl *n = sq_from->ctrl;
+    NvmeSQueue *sq = sq_from;
+
+    if (sq_from->sqid == 0) {
+        sq = &n->admin_sq;
+        sq->ctrl = n;
+        sq->dma_addr = sq_from->dma_addr;
+        sq->sqid = sq_from->sqid;
+        sq->size = sq_from->size;
+        sq->cqid = sq_from->cqid;
+        sq->head = sq_from->head;
+        sq->tail = sq_from->tail;
+    }
+
+    __nvme_init_sq(sq);
+}
+
 static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeSQueue *sq;
@@ -5605,6 +5625,27 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
     __nvme_init_cq(cq);
 }
 
+static void nvme_restore_cq(NvmeCQueue *cq_from)
+{
+    NvmeCtrl *n = cq_from->ctrl;
+    NvmeCQueue *cq = cq_from;
+
+    if (cq_from->cqid == 0) {
+        cq = &n->admin_cq;
+        cq->ctrl = n;
+        cq->cqid = cq_from->cqid;
+        cq->size = cq_from->size;
+        cq->dma_addr = cq_from->dma_addr;
+        cq->phase = cq_from->phase;
+        cq->irq_enabled = cq_from->irq_enabled;
+        cq->vector = cq_from->vector;
+        cq->head = cq_from->head;
+        cq->tail = cq_from->tail;
+    }
+
+    __nvme_init_cq(cq);
+}
+
 static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeCQueue *cq;
@@ -7293,7 +7334,7 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
     n->dbbuf_eis = eis_addr;
     n->dbbuf_enabled = true;
 
-    for (i = 0; i < n->params.max_ioqpairs + 1; i++) {
+    for (i = 0; i < n->num_queues; i++) {
         NvmeSQueue *sq = n->sq[i];
         NvmeCQueue *cq = n->cq[i];
 
@@ -7737,7 +7778,7 @@ static int nvme_atomic_write_check(NvmeCtrl *n, NvmeCmd *cmd,
     /*
      * Walk the queues to see if there are any atomic conflicts.
      */
-    for (i = 1; i < n->params.max_ioqpairs + 1; i++) {
+    for (i = 1; i < n->num_queues; i++) {
         NvmeSQueue *sq;
         NvmeRequest *req;
         NvmeRwCmd *req_rw;
@@ -7807,6 +7848,10 @@ static void nvme_process_sq(void *opaque)
     NvmeCmd cmd;
     NvmeRequest *req;
 
+    if (qatomic_read(&n->stop_processing_sq)) {
+        return;
+    }
+
     if (n->dbbuf_enabled) {
         nvme_update_sq_tail(sq);
     }
@@ -7815,6 +7860,10 @@ static void nvme_process_sq(void *opaque)
         NvmeAtomic *atomic;
         bool cmd_is_atomic;
 
+        if (qatomic_read(&n->stop_processing_sq)) {
+            return;
+        }
+
         addr = sq->dma_addr + (sq->head << NVME_SQES);
         if (nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd))) {
             trace_pci_nvme_err_addr_read(addr);
@@ -7923,12 +7972,12 @@ static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
         nvme_ns_drain(ns);
     }
 
-    for (i = 0; i < n->params.max_ioqpairs + 1; i++) {
+    for (i = 0; i < n->num_queues; i++) {
         if (n->sq[i] != NULL) {
             nvme_free_sq(n->sq[i], n);
         }
     }
-    for (i = 0; i < n->params.max_ioqpairs + 1; i++) {
+    for (i = 0; i < n->num_queues; i++) {
         if (n->cq[i] != NULL) {
             nvme_free_cq(n->cq[i], n);
         }
@@ -8598,6 +8647,8 @@ static bool nvme_check_params(NvmeCtrl *n, Error **errp)
         params->max_ioqpairs = params->num_queues - 1;
     }
 
+    n->num_queues = params->max_ioqpairs + 1;
+
     if (n->namespace.blkconf.blk && n->subsys) {
         error_setg(errp, "subsystem support is unavailable with legacy "
                    "namespace ('drive' property)");
@@ -8752,8 +8803,8 @@ static void nvme_init_state(NvmeCtrl *n)
         n->conf_msix_qsize = n->params.msix_qsize;
     }
 
-    n->sq = g_new0(NvmeSQueue *, n->params.max_ioqpairs + 1);
-    n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
+    n->sq = g_new0(NvmeSQueue *, n->num_queues);
+    n->cq = g_new0(NvmeCQueue *, n->num_queues);
     n->temperature = NVME_TEMPERATURE;
     n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
@@ -8988,7 +9039,7 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
     }
 
     if (n->params.msix_exclusive_bar && !pci_is_vf(pci_dev)) {
-        bar_size = nvme_mbar_size(n->params.max_ioqpairs + 1, 0, NULL, NULL);
+        bar_size = nvme_mbar_size(n->num_queues, 0, NULL, NULL);
         memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
                               bar_size);
         pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
@@ -9000,7 +9051,7 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
         /* add one to max_ioqpairs to account for the admin queue pair */
         if (!pci_is_vf(pci_dev)) {
             nr_vectors = n->params.msix_qsize;
-            bar_size = nvme_mbar_size(n->params.max_ioqpairs + 1,
+            bar_size = nvme_mbar_size(n->num_queues,
                                       nr_vectors, &msix_table_offset,
                                       &msix_pba_offset);
         } else {
@@ -9723,9 +9774,551 @@ static uint32_t nvme_pci_read_config(PCIDevice *dev, uint32_t address, int len)
     return pci_default_read_config(dev, address, len);
 }
 
+static bool nvme_ctrl_pre_save(void *opaque, Error **errp)
+{
+    NvmeCtrl *n = opaque;
+    int i;
+
+    trace_pci_nvme_pre_save_enter(n);
+
+    /* ask SQ processing code not to take new requests */
+    qatomic_set(&n->stop_processing_sq, true);
+
+    /* prevent new in-flight IO from appearing */
+    for (i = 0; i < n->num_queues; i++) {
+        NvmeSQueue *sq = n->sq[i];
+
+        if (!sq)
+            continue;
+
+        qemu_bh_cancel(sq->bh);
+    }
+
+    /* drain all IO */
+    for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+        NvmeNamespace *ns;
+
+        ns = nvme_ns(n, i);
+        if (!ns) {
+            continue;
+        }
+
+        trace_pci_nvme_pre_save_ns_drain(n, i);
+        nvme_ns_drain(ns);
+    }
+
+    /*
+     * Now, we should take care of AERs.
+     *
+     * 1. Save all queued events (n->aer_queue).
+     *    This is done automatically, see nvme_vmstate VMStateDescription.
+     *    Here we only need to print them for debugging purpose.
+     * 2. Go over outstanding AER requests (n->aer_reqs) and check they are
+     *    all have expected opcode (NVME_ADM_CMD_ASYNC_EV_REQ) and other fields.
+     *
+     * We must be really careful here, because in case of further QEMU NVMe changes,
+     * we may break migration without noticing it, or worse, introduce silent
+     * data corruption during migration.
+     */
+    if (n->aer_queued) {
+        NvmeAsyncEvent *event;
+
+        QTAILQ_FOREACH(event, &n->aer_queue, entry) {
+            trace_pci_nvme_pre_save_aer(event->result.event_type, event->result.event_info,
+                                        event->result.log_page);
+        }
+    }
+
+    for (i = 0; i < n->outstanding_aers; i++) {
+        NvmeRequest *re = n->aer_reqs[i];
+
+        /*
+         * Can't use assert() here, because we don't want
+         * to just crash QEMU when user requests a migration.
+         */
+        if (!(re->cmd.opcode == NVME_ADM_CMD_ASYNC_EV_REQ)) {
+            error_setg(errp, "re->cmd.opcode (%u) != NVME_ADM_CMD_ASYNC_EV_REQ", re->cmd.opcode);
+            goto err;
+        }
+
+        if (!(re->ns == NULL)) {
+            error_setg(errp, "re->ns != NULL");
+            goto err;
+        }
+
+        if (!(re->sq == &n->admin_sq)) {
+            error_setg(errp, "re->sq != &n->admin_sq");
+            goto err;
+        }
+
+        if (!(re->aiocb == NULL)) {
+            error_setg(errp, "re->aiocb != NULL");
+            goto err;
+        }
+
+        if (!(re->opaque == NULL)) {
+            error_setg(errp, "re->opaque != NULL");
+            goto err;
+        }
+
+        if (!(re->atomic_write == false)) {
+            error_setg(errp, "re->atomic_write != false");
+            goto err;
+        }
+
+        if (re->sg.flags & NVME_SG_ALLOC) {
+            error_setg(errp, "unexpected NVME_SG_ALLOC flag in re->sg.flags");
+            goto err;
+        }
+    }
+
+    /* wait when all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */
+    for (i = 0; i < n->num_queues; i++) {
+        NvmeRequest *req;
+        NvmeSQueue *sq = n->sq[i];
+
+        if (!sq)
+            continue;
+
+        trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size);
+
+wait_out_reqs:
+        QTAILQ_FOREACH(req, &sq->out_req_list, entry) {
+            if (req->cmd.opcode != NVME_ADM_CMD_ASYNC_EV_REQ) {
+                cpu_relax();
+                goto wait_out_reqs;
+            }
+        }
+
+        trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail);
+    }
+
+    /* wait when all IO requests completions are written to guest memory */
+    for (i = 0; i < n->num_queues; i++) {
+        NvmeCQueue *cq = n->cq[i];
+
+        if (!cq)
+            continue;
+
+        trace_pci_nvme_pre_save_cq_req_drain_wait(n, i, cq->head, cq->tail, cq->size);
+
+        while (!QTAILQ_EMPTY(&cq->req_list)) {
+            /*
+             * nvme_post_cqes() can't do its job of cleaning cq->req_list
+             * when CQ is full, it means that we need to save what we have in
+             * cq->req_list and restore it back on VM resume.
+             *
+             * Good thing is that this can only happen when guest hasn't
+             * processed CQ for a long time and at the same time, many SQEs
+             * are in flight.
+             *
+             * For now, let's just block migration in this rare case.
+             */
+            if (nvme_cq_full(cq)) {
+                error_setg(errp, "no free space in CQ (not supported)");
+                goto err;
+            }
+
+            cpu_relax();
+        }
+
+        trace_pci_nvme_pre_save_cq_req_drain_wait_end(n, i, cq->head, cq->tail);
+    }
+
+    for (uint32_t nsid = 0; nsid <= NVME_MAX_NAMESPACES; nsid++) {
+        NvmeNamespace *ns = n->namespaces[nsid];
+
+        if (!ns)
+            continue;
+
+        if (ns != &n->namespace) {
+            error_setg(errp, "only one NVMe namespace is supported for migration");
+            goto err;
+        }
+    }
+
+    return true;
+
+err:
+    /* restore sq processing back to normal */
+    qatomic_set(&n->stop_processing_sq, false);
+    return false;
+}
+
+static bool nvme_ctrl_post_load(void *opaque, int version_id, Error **errp)
+{
+    NvmeCtrl *n = opaque;
+    int i;
+
+    trace_pci_nvme_post_load_enter(n);
+
+    /* restore CQs first */
+    for (i = 0; i < n->num_queues; i++) {
+        NvmeCQueue *cq = n->cq[i];
+
+        if (!cq)
+            continue;
+
+        cq->ctrl = n;
+        nvme_restore_cq(cq);
+        trace_pci_nvme_post_load_restore_cq(n, i, cq->head, cq->tail, cq->size);
+
+        if (i == 0) {
+            /*
+             * Admin CQ lives in n->admin_cq, we don't need
+             * memory allocated for it in get_ptrs_array_entry() anymore.
+             *
+             * nvme_restore_cq() also takes care of:
+             * n->cq[0] = &n->admin_cq;
+             * so n->cq[0] remains valid.
+             */
+            g_free(cq);
+        }
+    }
+
+    for (i = 0; i < n->num_queues; i++) {
+        NvmeSQueue *sq = n->sq[i];
+
+        if (!sq)
+            continue;
+
+        sq->ctrl = n;
+        nvme_restore_sq(sq);
+        trace_pci_nvme_post_load_restore_sq(n, i, sq->head, sq->tail, sq->size);
+
+        if (i == 0) {
+            /* same as for CQ */
+            g_free(sq);
+        }
+    }
+
+    if (n->aer_queued) {
+        NvmeAsyncEvent *event;
+
+        QTAILQ_FOREACH(event, &n->aer_queue, entry) {
+            trace_pci_nvme_post_load_aer(event->result.event_type, event->result.event_info,
+                                         event->result.log_page);
+        }
+    }
+
+    for (i = 0; i < n->outstanding_aers; i++) {
+        NvmeSQueue *sq = &n->admin_sq;
+        NvmeRequest *req_from = n->aer_reqs[i];
+        NvmeRequest *req;
+
+        /*
+         * We use nvme_vmstate VMStateDescription to save/restore
+         * NvmeRequest structures, but tricky thing here is that
+         * memory for each n->aer_reqs[i] will be allocated separately
+         * during restore. It doesn't work for us. We need to take
+         * an existing NvmeRequest structure from SQ's req_list
+         * and fill it with data from the newly allocated one (req_from).
+         * Then, we can safely release allocated memory for it.
+         */
+
+        /* take an NvmeRequest struct from SQ */
+        req = QTAILQ_FIRST(&sq->req_list);
+        QTAILQ_REMOVE(&sq->req_list, req, entry);
+        QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
+        nvme_req_clear(req);
+
+        /* copy data from the source NvmeRequest */
+        req->status = req_from->status;
+        memcpy(&req->cqe, &req_from->cqe, sizeof(NvmeCqe));
+        memcpy(&req->cmd, &req_from->cmd, sizeof(NvmeCmd));
+
+        n->aer_reqs[i] = req;
+        g_free(req_from);
+    }
+
+    /*
+     * We need to attach namespaces (currently, only one namespace is
+     * supported for migration).
+     * This logic comes from nvme_start_ctrl().
+     */
+    for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+        NvmeNamespace *ns = nvme_subsys_ns(n->subsys, i);
+
+        if (!ns || (!ns->params.shared && ns->ctrl != n)) {
+            continue;
+        }
+
+        if (nvme_csi_supported(n, ns->csi) && !ns->params.detached) {
+            if (!ns->attached || ns->params.shared) {
+                nvme_attach_ns(n, ns);
+            }
+        }
+    }
+
+    /* schedule SQ processing */
+    for (i = 0; i < n->num_queues; i++) {
+        NvmeSQueue *sq = n->sq[i];
+
+        if (!sq)
+            continue;
+
+        qemu_bh_schedule(sq->bh);
+    }
+
+    /*
+     * We ensured in pre_save() that cq->req_list was empty,
+     * so we don't need to schedule BH for CQ processing.
+     */
+
+    return true;
+}
+
+static const VMStateDescription nvme_vmstate_bar = {
+    .name = "nvme-bar",
+    .minimum_version_id = 1,
+    .version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT64(cap, NvmeBar),
+        VMSTATE_UINT32(vs, NvmeBar),
+        VMSTATE_UINT32(intms, NvmeBar),
+        VMSTATE_UINT32(intmc, NvmeBar),
+        VMSTATE_UINT32(cc, NvmeBar),
+        VMSTATE_UINT8_ARRAY(rsvd24, NvmeBar, 4),
+        VMSTATE_UINT32(csts, NvmeBar),
+        VMSTATE_UINT32(nssr, NvmeBar),
+        VMSTATE_UINT32(aqa, NvmeBar),
+        VMSTATE_UINT64(asq, NvmeBar),
+        VMSTATE_UINT64(acq, NvmeBar),
+        VMSTATE_UINT32(cmbloc, NvmeBar),
+        VMSTATE_UINT32(cmbsz, NvmeBar),
+        VMSTATE_UINT32(bpinfo, NvmeBar),
+        VMSTATE_UINT32(bprsel, NvmeBar),
+        VMSTATE_UINT64(bpmbl, NvmeBar),
+        VMSTATE_UINT64(cmbmsc, NvmeBar),
+        VMSTATE_UINT32(cmbsts, NvmeBar),
+        VMSTATE_UINT8_ARRAY(rsvd92, NvmeBar, 3492),
+        VMSTATE_UINT32(pmrcap, NvmeBar),
+        VMSTATE_UINT32(pmrctl, NvmeBar),
+        VMSTATE_UINT32(pmrsts, NvmeBar),
+        VMSTATE_UINT32(pmrebs, NvmeBar),
+        VMSTATE_UINT32(pmrswtp, NvmeBar),
+        VMSTATE_UINT32(pmrmscl, NvmeBar),
+        VMSTATE_UINT32(pmrmscu, NvmeBar),
+        VMSTATE_UINT8_ARRAY(css, NvmeBar, 484),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static const VMStateDescription nvme_vmstate_cqueue = {
+    .name = "nvme-cq",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT8(phase, NvmeCQueue),
+        VMSTATE_UINT16(cqid, NvmeCQueue),
+        VMSTATE_UINT16(irq_enabled, NvmeCQueue),
+        VMSTATE_UINT32(head, NvmeCQueue),
+        VMSTATE_UINT32(tail, NvmeCQueue),
+        VMSTATE_UINT32(vector, NvmeCQueue),
+        VMSTATE_UINT32(size, NvmeCQueue),
+        VMSTATE_UINT64(dma_addr, NvmeCQueue),
+        /* db_addr, ei_addr, etc will be recalculated */
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription nvme_vmstate_squeue = {
+    .name = "nvme-sq",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT16(sqid, NvmeSQueue),
+        VMSTATE_UINT16(cqid, NvmeSQueue),
+        VMSTATE_UINT32(head, NvmeSQueue),
+        VMSTATE_UINT32(tail, NvmeSQueue),
+        VMSTATE_UINT32(size, NvmeSQueue),
+        VMSTATE_UINT64(dma_addr, NvmeSQueue),
+        /* db_addr, ei_addr, etc will be recalculated */
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription nvme_vmstate_async_event_result = {
+    .name = "nvme-async-event-result",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT8(event_type, NvmeAerResult),
+        VMSTATE_UINT8(event_info, NvmeAerResult),
+        VMSTATE_UINT8(log_page, NvmeAerResult),
+        VMSTATE_UINT8(resv, NvmeAerResult),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription nvme_vmstate_async_event = {
+    .name = "nvme-async-event",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_STRUCT(result, NvmeAsyncEvent, 0, nvme_vmstate_async_event_result, NvmeAerResult),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription nvme_vmstate_cqe = {
+    .name = "nvme-cqe",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT32(result, NvmeCqe),
+        VMSTATE_UINT32(dw1, NvmeCqe),
+        VMSTATE_UINT16(sq_head, NvmeCqe),
+        VMSTATE_UINT16(sq_id, NvmeCqe),
+        VMSTATE_UINT16(cid, NvmeCqe),
+        VMSTATE_UINT16(status, NvmeCqe),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription nvme_vmstate_cmd_dptr_sgl = {
+    .name = "nvme-request-cmd-dptr-sgl",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT64(addr, NvmeSglDescriptor),
+        VMSTATE_UINT32(len, NvmeSglDescriptor),
+        VMSTATE_UINT8_ARRAY(rsvd, NvmeSglDescriptor, 3),
+        VMSTATE_UINT8(type, NvmeSglDescriptor),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription nvme_vmstate_cmd_dptr = {
+    .name = "nvme-request-cmd-dptr",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT64(prp1, NvmeCmdDptr),
+        VMSTATE_UINT64(prp2, NvmeCmdDptr),
+        VMSTATE_STRUCT(sgl, NvmeCmdDptr, 0, nvme_vmstate_cmd_dptr_sgl, NvmeSglDescriptor),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription nvme_vmstate_cmd = {
+    .name = "nvme-request-cmd",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT8(opcode, NvmeCmd),
+        VMSTATE_UINT8(flags, NvmeCmd),
+        VMSTATE_UINT16(cid, NvmeCmd),
+        VMSTATE_UINT32(nsid, NvmeCmd),
+        VMSTATE_UINT64(res1, NvmeCmd),
+        VMSTATE_UINT64(mptr, NvmeCmd),
+        VMSTATE_STRUCT(dptr, NvmeCmd, 0, nvme_vmstate_cmd_dptr, NvmeCmdDptr),
+        VMSTATE_UINT32(cdw10, NvmeCmd),
+        VMSTATE_UINT32(cdw11, NvmeCmd),
+        VMSTATE_UINT32(cdw12, NvmeCmd),
+        VMSTATE_UINT32(cdw13, NvmeCmd),
+        VMSTATE_UINT32(cdw14, NvmeCmd),
+        VMSTATE_UINT32(cdw15, NvmeCmd),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription nvme_vmstate_request = {
+    .name = "nvme-request",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT16(status, NvmeRequest),
+        VMSTATE_STRUCT(cqe, NvmeRequest, 0, nvme_vmstate_cqe, NvmeCqe),
+        VMSTATE_STRUCT(cmd, NvmeRequest, 0, nvme_vmstate_cmd, NvmeCmd),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription nvme_vmstate_hbs = {
+    .name = "nvme-hbs",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT8(acre, NvmeHostBehaviorSupport),
+        VMSTATE_UINT8(etdas, NvmeHostBehaviorSupport),
+        VMSTATE_UINT8(lbafee, NvmeHostBehaviorSupport),
+        VMSTATE_UINT8(rsvd3, NvmeHostBehaviorSupport),
+        VMSTATE_UINT16(cdfe, NvmeHostBehaviorSupport),
+        VMSTATE_UINT8_ARRAY(rsvd6, NvmeHostBehaviorSupport, 506),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+const VMStateDescription nvme_vmstate_atomic = {
+    .name = "nvme-atomic",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT32(atomic_max_write_size, NvmeAtomic),
+        VMSTATE_UINT64(atomic_boundary, NvmeAtomic),
+        VMSTATE_UINT64(atomic_nabo, NvmeAtomic),
+        VMSTATE_BOOL(atomic_writes, NvmeAtomic),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 static const VMStateDescription nvme_vmstate = {
     .name = "nvme",
-    .unmigratable = 1,
+    .minimum_version_id = 1,
+    .version_id = 1,
+    .pre_save_errp = nvme_ctrl_pre_save,
+    .post_load_errp = nvme_ctrl_post_load,
+    .fields = (const VMStateField[]) {
+        VMSTATE_PCI_DEVICE(parent_obj, NvmeCtrl),
+        VMSTATE_MSIX(parent_obj, NvmeCtrl),
+        VMSTATE_STRUCT(bar, NvmeCtrl, 0, nvme_vmstate_bar, NvmeBar),
+
+        VMSTATE_BOOL(qs_created, NvmeCtrl),
+        VMSTATE_UINT32(page_size, NvmeCtrl),
+        VMSTATE_UINT16(page_bits, NvmeCtrl),
+        VMSTATE_UINT16(max_prp_ents, NvmeCtrl),
+        VMSTATE_UINT32(max_q_ents, NvmeCtrl),
+        VMSTATE_UINT8(outstanding_aers, NvmeCtrl),
+        VMSTATE_UINT32(irq_status, NvmeCtrl),
+        VMSTATE_INT32(cq_pending, NvmeCtrl),
+
+        VMSTATE_UINT64(host_timestamp, NvmeCtrl),
+        VMSTATE_UINT64(timestamp_set_qemu_clock_ms, NvmeCtrl),
+        VMSTATE_UINT64(starttime_ms, NvmeCtrl),
+        VMSTATE_UINT16(temperature, NvmeCtrl),
+        VMSTATE_UINT8(smart_critical_warning, NvmeCtrl),
+
+        VMSTATE_UINT32(conf_msix_qsize, NvmeCtrl),
+        VMSTATE_UINT32(conf_ioqpairs, NvmeCtrl),
+        VMSTATE_UINT64(dbbuf_dbs, NvmeCtrl),
+        VMSTATE_UINT64(dbbuf_eis, NvmeCtrl),
+        VMSTATE_BOOL(dbbuf_enabled, NvmeCtrl),
+
+        VMSTATE_UINT8(aer_mask, NvmeCtrl),
+        VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT8_ALLOC(
+            aer_reqs, NvmeCtrl, outstanding_aers, 0, nvme_vmstate_request, NvmeRequest),
+        VMSTATE_QTAILQ_V(aer_queue, NvmeCtrl, 1, nvme_vmstate_async_event,
+                         NvmeAsyncEvent, entry),
+        VMSTATE_INT32(aer_queued, NvmeCtrl),
+
+        VMSTATE_STRUCT(namespace, NvmeCtrl, 0, nvme_vmstate_ns, NvmeNamespace),
+
+        VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT32_ALLOC(
+            sq, NvmeCtrl, num_queues, 0, nvme_vmstate_squeue, NvmeSQueue),
+        VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT32_ALLOC(
+            cq, NvmeCtrl, num_queues, 0, nvme_vmstate_cqueue, NvmeCQueue),
+
+        VMSTATE_UINT16(features.temp_thresh_hi, NvmeCtrl),
+        VMSTATE_UINT16(features.temp_thresh_low, NvmeCtrl),
+        VMSTATE_UINT32(features.async_config, NvmeCtrl),
+        VMSTATE_STRUCT(features.hbs, NvmeCtrl, 0, nvme_vmstate_hbs, NvmeHostBehaviorSupport),
+
+        VMSTATE_UINT32(dn, NvmeCtrl),
+        VMSTATE_STRUCT(atomic, NvmeCtrl, 0, nvme_vmstate_atomic, NvmeAtomic),
+
+        VMSTATE_END_OF_LIST()
+    },
 };
 
 static void nvme_class_init(ObjectClass *oc, const void *data)
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 38f86a17268..dd374677078 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -20,6 +20,7 @@
 #include "qemu/bitops.h"
 #include "system/system.h"
 #include "system/block-backend.h"
+#include "migration/vmstate.h"
 
 #include "nvme.h"
 #include "trace.h"
@@ -886,6 +887,164 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
     }
 }
 
+static const VMStateDescription nvme_vmstate_lbaf = {
+    .name = "nvme_lbaf",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT16(ms, NvmeLBAF),
+        VMSTATE_UINT8(ds, NvmeLBAF),
+        VMSTATE_UINT8(rp, NvmeLBAF),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription nvme_vmstate_id_ns = {
+    .name = "nvme_id_ns",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT64(nsze, NvmeIdNs),
+        VMSTATE_UINT64(ncap, NvmeIdNs),
+        VMSTATE_UINT64(nuse, NvmeIdNs),
+        VMSTATE_UINT8(nsfeat, NvmeIdNs),
+        VMSTATE_UINT8(nlbaf, NvmeIdNs),
+        VMSTATE_UINT8(flbas, NvmeIdNs),
+        VMSTATE_UINT8(mc, NvmeIdNs),
+        VMSTATE_UINT8(dpc, NvmeIdNs),
+        VMSTATE_UINT8(dps, NvmeIdNs),
+        VMSTATE_UINT8(nmic, NvmeIdNs),
+        VMSTATE_UINT8(rescap, NvmeIdNs),
+        VMSTATE_UINT8(fpi, NvmeIdNs),
+        VMSTATE_UINT8(dlfeat, NvmeIdNs),
+        VMSTATE_UINT16(nawun, NvmeIdNs),
+        VMSTATE_UINT16(nawupf, NvmeIdNs),
+        VMSTATE_UINT16(nacwu, NvmeIdNs),
+        VMSTATE_UINT16(nabsn, NvmeIdNs),
+        VMSTATE_UINT16(nabo, NvmeIdNs),
+        VMSTATE_UINT16(nabspf, NvmeIdNs),
+        VMSTATE_UINT16(noiob, NvmeIdNs),
+        VMSTATE_UINT8_ARRAY(nvmcap, NvmeIdNs, 16),
+        VMSTATE_UINT16(npwg, NvmeIdNs),
+        VMSTATE_UINT16(npwa, NvmeIdNs),
+        VMSTATE_UINT16(npdg, NvmeIdNs),
+        VMSTATE_UINT16(npda, NvmeIdNs),
+        VMSTATE_UINT16(nows, NvmeIdNs),
+        VMSTATE_UINT16(mssrl, NvmeIdNs),
+        VMSTATE_UINT32(mcl, NvmeIdNs),
+        VMSTATE_UINT8(msrc, NvmeIdNs),
+        VMSTATE_UINT8_ARRAY(rsvd81, NvmeIdNs, 18),
+        VMSTATE_UINT8(nsattr, NvmeIdNs),
+        VMSTATE_UINT16(nvmsetid, NvmeIdNs),
+        VMSTATE_UINT16(endgid, NvmeIdNs),
+        VMSTATE_UINT8_ARRAY(nguid, NvmeIdNs, 16),
+        VMSTATE_UINT64(eui64, NvmeIdNs),
+        VMSTATE_STRUCT_ARRAY(lbaf, NvmeIdNs, NVME_MAX_NLBAF, 1,
+                             nvme_vmstate_lbaf, NvmeLBAF),
+        VMSTATE_UINT8_ARRAY(vs, NvmeIdNs, 3712),
+
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription nvme_vmstate_id_ns_nvm = {
+    .name = "nvme_id_ns_nvm",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT64(lbstm, NvmeIdNsNvm),
+        VMSTATE_UINT8(pic, NvmeIdNsNvm),
+        VMSTATE_UINT8_ARRAY(rsvd9, NvmeIdNsNvm, 3),
+        VMSTATE_UINT32_ARRAY(elbaf, NvmeIdNsNvm, NVME_MAX_NLBAF),
+        VMSTATE_UINT32(npdgl, NvmeIdNsNvm),
+        VMSTATE_UINT32(nprg, NvmeIdNsNvm),
+        VMSTATE_UINT32(npra, NvmeIdNsNvm),
+        VMSTATE_UINT32(nors, NvmeIdNsNvm),
+        VMSTATE_UINT32(npdal, NvmeIdNsNvm),
+        VMSTATE_UINT8_ARRAY(rsvd288, NvmeIdNsNvm, 3808),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription nvme_vmstate_id_ns_ind = {
+    .name = "nvme_id_ns_ind",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT8(nsfeat, NvmeIdNsInd),
+        VMSTATE_UINT8(nmic, NvmeIdNsInd),
+        VMSTATE_UINT8(rescap, NvmeIdNsInd),
+        VMSTATE_UINT8(fpi, NvmeIdNsInd),
+        VMSTATE_UINT32(anagrpid, NvmeIdNsInd),
+        VMSTATE_UINT8(nsattr, NvmeIdNsInd),
+        VMSTATE_UINT8(rsvd9, NvmeIdNsInd),
+        VMSTATE_UINT16(nvmsetid, NvmeIdNsInd),
+        VMSTATE_UINT16(endgrpid, NvmeIdNsInd),
+        VMSTATE_UINT8(nstat, NvmeIdNsInd),
+        VMSTATE_UINT8_ARRAY(rsvd15, NvmeIdNsInd, 4081),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+typedef struct TmpNvmeNamespace {
+    NvmeNamespace *parent;
+    bool enable_write_cache;
+} TmpNvmeNamespace;
+
+static bool nvme_ns_tmp_pre_save(void *opaque, Error **errp)
+{
+    struct TmpNvmeNamespace *tns = opaque;
+
+    tns->enable_write_cache = blk_enable_write_cache(tns->parent->blkconf.blk);
+
+    return true;
+}
+
+static bool nvme_ns_tmp_post_load(void *opaque, int version_id, Error **errp)
+{
+    struct TmpNvmeNamespace *tns = opaque;
+
+    blk_set_enable_write_cache(tns->parent->blkconf.blk, tns->enable_write_cache);
+
+    return true;
+}
+
+static const VMStateDescription nvme_vmstate_ns_tmp = {
+    .name = "nvme_ns_tmp",
+    .pre_save_errp = nvme_ns_tmp_pre_save,
+    .post_load_errp = nvme_ns_tmp_post_load,
+    .fields = (const VMStateField[]) {
+        VMSTATE_BOOL(enable_write_cache, TmpNvmeNamespace),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+const VMStateDescription nvme_vmstate_ns = {
+    .name = "nvme_ns",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_WITH_TMP(NvmeNamespace, TmpNvmeNamespace, nvme_vmstate_ns_tmp),
+
+        VMSTATE_STRUCT(id_ns, NvmeNamespace, 0, nvme_vmstate_id_ns, NvmeIdNs),
+        VMSTATE_STRUCT(id_ns_nvm, NvmeNamespace, 0, nvme_vmstate_id_ns_nvm, NvmeIdNsNvm),
+        VMSTATE_STRUCT(id_ns_ind, NvmeNamespace, 0, nvme_vmstate_id_ns_ind, NvmeIdNsInd),
+        VMSTATE_STRUCT(lbaf, NvmeNamespace, 0, nvme_vmstate_lbaf, NvmeLBAF),
+        VMSTATE_UINT32(nlbaf, NvmeNamespace),
+        VMSTATE_UINT8(csi, NvmeNamespace),
+        VMSTATE_UINT16(status, NvmeNamespace),
+        VMSTATE_UINT8(pif, NvmeNamespace),
+
+        VMSTATE_UINT16(zns.zrwas, NvmeNamespace),
+        VMSTATE_UINT16(zns.zrwafg, NvmeNamespace),
+        VMSTATE_UINT32(zns.numzrwa, NvmeNamespace),
+
+        VMSTATE_UINT32(features.err_rec, NvmeNamespace),
+        VMSTATE_STRUCT(atomic, NvmeNamespace, 0, nvme_vmstate_atomic, NvmeAtomic),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 static const Property nvme_ns_props[] = {
     DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf),
     DEFINE_PROP_BOOL("detached", NvmeNamespace, params.detached, false),
@@ -937,6 +1096,7 @@ static void nvme_ns_class_init(ObjectClass *oc, const void *data)
     dc->bus_type = TYPE_NVME_BUS;
     dc->realize = nvme_ns_realize;
     dc->unrealize = nvme_ns_unrealize;
+    dc->vmsd = &nvme_vmstate_ns;
     device_class_set_props(dc, nvme_ns_props);
     dc->desc = "Virtual NVMe namespace";
 }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 457b6637249..03d6aecb7d7 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -638,6 +638,7 @@ typedef struct NvmeCtrl {
 
     NvmeNamespace   namespace;
     NvmeNamespace   *namespaces[NVME_MAX_NAMESPACES + 1];
+    uint32_t        num_queues;
     NvmeSQueue      **sq;
     NvmeCQueue      **cq;
     NvmeSQueue      admin_sq;
@@ -669,6 +670,7 @@ typedef struct NvmeCtrl {
 
     /* Migration-related stuff */
     Error *migration_blocker;
+    bool stop_processing_sq;
 } NvmeCtrl;
 
 typedef enum NvmeResetType {
@@ -749,4 +751,7 @@ void nvme_atomic_configure_max_write_size(bool dn, uint16_t awun,
 void nvme_ns_atomic_configure_boundary(bool dn, uint16_t nabsn,
                                        uint16_t nabspf, NvmeAtomic *atomic);
 
+extern const VMStateDescription nvme_vmstate_atomic;
+extern const VMStateDescription nvme_vmstate_ns;
+
 #endif /* HW_NVME_NVME_H */
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index 6be0bfa1c1f..8e5544e0008 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -7,6 +7,17 @@ pci_nvme_dbbuf_config(uint64_t dbs_addr, uint64_t eis_addr) "dbs_addr=0x%"PRIx64
 pci_nvme_map_addr(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64""
 pci_nvme_map_addr_cmb(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64""
 pci_nvme_map_prp(uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "trans_len %"PRIu64" len %"PRIu32" prp1 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
+pci_nvme_pre_save_enter(void *n) "n=%p"
+pci_nvme_pre_save_ns_drain(void *n, int i) "n=%p i=%d"
+pci_nvme_pre_save_sq_out_req_drain_wait(void *n, int i, uint32_t head, uint32_t tail, uint32_t size) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32" size=0x%"PRIx32""
+pci_nvme_pre_save_sq_out_req_drain_wait_end(void *n, int i, uint32_t head, uint32_t tail) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32""
+pci_nvme_pre_save_cq_req_drain_wait(void *n, int i, uint32_t head, uint32_t tail, uint32_t size) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32" size=0x%"PRIx32""
+pci_nvme_pre_save_cq_req_drain_wait_end(void *n, int i, uint32_t head, uint32_t tail) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32""
+pci_nvme_pre_save_aer(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
+pci_nvme_post_load_enter(void *n) "n=%p"
+pci_nvme_post_load_restore_cq(void *n, int i, uint32_t head, uint32_t tail, uint32_t size) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32" size=0x%"PRIx32""
+pci_nvme_post_load_restore_sq(void *n, int i, uint32_t head, uint32_t tail, uint32_t size) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32" size=0x%"PRIx32""
+pci_nvme_post_load_aer(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
 pci_nvme_map_sgl(uint8_t typ, uint64_t len) "type 0x%"PRIx8" len %"PRIu64""
 pci_nvme_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode, const char *opname) "cid %"PRIu16" nsid 0x%"PRIx32" sqid %"PRIu16" opc 0x%"PRIx8" opname '%s'"
 pci_nvme_admin_cmd(uint16_t cid, uint16_t sqid, uint8_t opcode, const char *opname) "cid %"PRIu16" sqid %"PRIu16" opc 0x%"PRIx8" opname '%s'"
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 8/8] tests/functional/x86_64: add migration test for NVMe device
  2026-03-17 10:27 [PATCH v5 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
                   ` (6 preceding siblings ...)
  2026-03-17 10:27 ` [PATCH v5 7/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
@ 2026-03-17 10:27 ` Alexander Mikhalitsyn
  2026-03-30 11:38 ` [PATCH v5 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
  8 siblings, 0 replies; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-17 10:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexander Mikhalitsyn, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Stefan Hajnoczi,
	Hanna Reitz, Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>

Introduce a very simple test to ensure that NVMe device
migration works fine.

Test plan is simple:
1. prepare VM with NVMe device
2. run workload that produces relatively heavy IO on the device
3. migrate VM
4. ensure that workload is alive and finishes without errors

Test can be run as simple as:
$ meson test 'func-x86_64-nvme_migration' --setup thorough -C build

In the future we can extend this approach, and introduce some
fio-based tests. And probably, it makes sense to make this test
to apply not only to NVMe device, but also virtio-{blk,scsi},
ide, sata and other migratable devices.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
---
 tests/functional/x86_64/meson.build           |   1 +
 .../functional/x86_64/test_nvme_migration.py  | 159 ++++++++++++++++++
 2 files changed, 160 insertions(+)
 create mode 100755 tests/functional/x86_64/test_nvme_migration.py

diff --git a/tests/functional/x86_64/meson.build b/tests/functional/x86_64/meson.build
index 1ed10ad6c29..fd77f19d726 100644
--- a/tests/functional/x86_64/meson.build
+++ b/tests/functional/x86_64/meson.build
@@ -37,6 +37,7 @@ tests_x86_64_system_thorough = [
   'linux_initrd',
   'multiprocess',
   'netdev_ethtool',
+  'nvme_migration',
   'replay',
   'reverse_debug',
   'tuxrun',
diff --git a/tests/functional/x86_64/test_nvme_migration.py b/tests/functional/x86_64/test_nvme_migration.py
new file mode 100755
index 00000000000..3788a8e3473
--- /dev/null
+++ b/tests/functional/x86_64/test_nvme_migration.py
@@ -0,0 +1,159 @@
+#!/usr/bin/env python3
+#
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# x86_64 NVMe migration test
+
+from migration import MigrationTest
+from qemu_test import QemuSystemTest, Asset
+from qemu_test import wait_for_console_pattern
+from qemu_test import exec_command, exec_command_and_wait_for_pattern
+
+
+class X8664NVMeMigrationTest(MigrationTest):
+    ASSET_KERNEL = Asset(
+        ('https://archives.fedoraproject.org/pub/archive/fedora/linux/releases'
+         '/31/Server/x86_64/os/images/pxeboot/vmlinuz'),
+        'd4738d03dbbe083ca610d0821d0a8f1488bebbdccef54ce33e3adb35fda00129')
+
+    ASSET_INITRD = Asset(
+        ('https://archives.fedoraproject.org/pub/archive/fedora/linux/releases'
+         '/31/Server/x86_64/os/images/pxeboot/initrd.img'),
+        '277cd6c7adf77c7e63d73bbb2cded8ef9e2d3a2f100000e92ff1f8396513cd8b')
+
+    ASSET_DISKIMAGE = Asset(
+        ('https://archives.fedoraproject.org/pub/archive/fedora/linux/releases'
+         '/31/Cloud/x86_64/images/Fedora-Cloud-Base-31-1.9.x86_64.qcow2'),
+        'e3c1b309d9203604922d6e255c2c5d098a309c2d46215d8fc026954f3c5c27a0')
+
+    DEFAULT_KERNEL_PARAMS = ('root=/dev/nvme0n1p1 console=ttyS0 net.ifnames=0 '
+                             'rd.rescue quiet')
+
+    def wait_for_console_pattern(self, success_message, vm):
+        wait_for_console_pattern(
+            self,
+            success_message,
+            failure_message="Kernel panic - not syncing",
+            vm=vm,
+        )
+
+    def exec_command_and_check(self, command, vm):
+        prompt = '# '
+        exec_command_and_wait_for_pattern(self,
+                                        f"{command} && echo OK || echo FAIL",
+                                        'FAIL', vm=vm)
+        # Note, that commands we send to the console are echo-ed back, so if we have a word "FAIL"
+        # in the command itself, we should expect to see it once.
+        wait_for_console_pattern(self, 'OK', failure_message="FAIL", vm=vm)
+        self.wait_for_console_pattern(prompt, vm)
+
+    def configure_machine(self, vm):
+        kernel_path = self.ASSET_KERNEL.fetch()
+        initrd_path = self.ASSET_INITRD.fetch()
+        diskimage_path = self.ASSET_DISKIMAGE.fetch()
+
+        vm.set_console()
+        vm.add_args("-cpu", "max")
+        vm.add_args("-m", "2G")
+        vm.add_args("-accel", "kvm")
+
+        vm.add_args('-drive',
+                         f'file={diskimage_path},if=none,id=drv0,snapshot=on')
+        vm.add_args('-device', 'nvme,bus=pcie.0,' +
+                         'drive=drv0,id=nvme-disk0,serial=nvmemigratetest,bootindex=1')
+
+        vm.add_args(
+            "-kernel",
+            kernel_path,
+            "-initrd",
+            initrd_path,
+            "-append",
+            self.DEFAULT_KERNEL_PARAMS
+        )
+
+    def launch_source_vm(self, vm):
+        vm.launch()
+
+        self.wait_for_console_pattern('Entering emergency mode.', vm)
+        prompt = '# '
+        self.wait_for_console_pattern(prompt, vm)
+
+        # Synchronize on NVMe driver creating the root device
+        exec_command_and_wait_for_pattern(self,
+                        "while ! (dmesg -c | grep nvme0n1:) ; do sleep 1 ; done",
+                        "nvme0n1", vm=vm)
+        self.wait_for_console_pattern(prompt, vm)
+
+        # prepare system
+        exec_command_and_wait_for_pattern(self, 'mount /dev/nvme0n1p1 /sysroot',
+                                          prompt, vm=vm)
+        exec_command_and_wait_for_pattern(self, 'chroot /sysroot',
+                                          prompt, vm=vm)
+        exec_command_and_wait_for_pattern(self, 'mount -t proc proc /proc',
+                                          prompt, vm=vm)
+        exec_command_and_wait_for_pattern(self, 'mount -t sysfs sysfs /sys',
+                                          prompt, vm=vm)
+
+        # Run workload before migration to check if it continues to run properly after migration
+        #
+        # Workload is simple: it continuously calculates checksums of all files in /usr/bin
+        # to generate some I/O load on the NVMe disk and at the same time it drops caches to
+        # make sure that we have some read I/O on the disk as well.
+        # If there are any issues with the migration of the NVMe device, we should see errors
+        # in dmesg and consequently in the workload log.
+        exec_command_and_wait_for_pattern(self,
+                                        "(while [ ! -f /tmp/test_nvme_migration_workload.stop ]; do \
+                                            rm -f /tmp/test_nvme_migration_workload.iteration_finished; \
+                                            echo 3 > /proc/sys/vm/drop_caches; \
+                                            find /usr/bin -type f -exec cksum {} \\;; \
+                                            touch /tmp/test_nvme_migration_workload.iteration_finished; \
+                                        done) > /dev/null 2> /tmp/test_nvme_migration_workload.errors &",
+                                        prompt, vm=vm)
+        exec_command_and_wait_for_pattern(self, 'echo $! > /tmp/test_nvme_migration_workload.pid',
+                                          prompt, vm=vm)
+
+        # check if process is alive and running
+        self.exec_command_and_check("kill -0 $(cat /tmp/test_nvme_migration_workload.pid)", vm)
+
+    def assert_dest_vm(self, vm):
+        prompt = '# '
+
+        # check if process is alive and running after migration, if not - fail the test
+        self.exec_command_and_check("kill -0 $(cat /tmp/test_nvme_migration_workload.pid)", vm)
+
+        # signal workload to stop
+        exec_command_and_wait_for_pattern(self, 'touch /tmp/test_nvme_migration_workload.stop',
+                                          prompt, vm=vm)
+
+        # wait workload to finish, because we want to examine log to see if there are any errors
+        exec_command_and_wait_for_pattern(self,
+                                        "while [ ! -f /tmp/test_nvme_migration_workload.iteration_finished ]; do sleep 1; done;",
+                                        prompt, vm=vm)
+
+        exec_command_and_wait_for_pattern(self, 'cat /tmp/test_nvme_migration_workload.errors',
+                                          prompt, vm=vm)
+
+        # fail the test if non-empty
+        self.exec_command_and_check("[ ! -s /tmp/test_nvme_migration_workload.errors ]", vm)
+
+    def test_migration_with_tcp_localhost(self):
+        self.set_machine('q35')
+        self.require_accelerator("kvm")
+
+        self.migration_with_tcp_localhost()
+
+    def test_migration_with_unix(self):
+        self.set_machine('q35')
+        self.require_accelerator("kvm")
+
+        self.migration_with_unix()
+
+    def test_migration_with_exec(self):
+        self.set_machine('q35')
+        self.require_accelerator("kvm")
+
+        self.migration_with_exec()
+
+
+if __name__ == '__main__':
+    MigrationTest.main()
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/8] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC
  2026-03-17 10:27 ` [PATCH v5 2/8] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC Alexander Mikhalitsyn
@ 2026-03-17 21:39   ` Peter Xu
  2026-03-17 23:30     ` Peter Xu
  2026-03-18 10:05     ` Alexander Mikhalitsyn
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Xu @ 2026-03-17 21:39 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: qemu-devel, Fabiano Rosas, Jesper Devantier, Klaus Jensen,
	Stéphane Graber, qemu-block, Stefan Hajnoczi, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

On Tue, Mar 17, 2026 at 11:27:02AM +0100, Alexander Mikhalitsyn wrote:
> From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
> 
> Add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC, which
> helps to save/restore a dynamic array of pointers to
> structures.
> 
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
> v2:
> - added VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT8_ALLOC
> v4:
> - almost completely reworked, new flag VMS_ARRAY_OF_POINTER_ALLOW_NULL
>   was introduced as suggested by Peter
> v5:
> - rebased on top of https://lore.kernel.org/all/20260304212303.667141-1-vsementsov@yandex-team.ru/
> ---
>  include/migration/vmstate.h | 77 ++++++++++++++++++++++++++++++-
>  migration/savevm.c          | 26 +++++++++++
>  migration/vmstate-types.c   | 91 +++++++++++++++++++++++++++++++++++++
>  migration/vmstate.c         | 54 ++++++++++++++++++----
>  4 files changed, 236 insertions(+), 12 deletions(-)

Hi, Alexander,

Sorry for a late reply, I didn't reply to your previous version on patch 2
because I didn't yet get time to really review this patch. It took me some
time to figure out what's the best way to do this.  I still think it'll be
good to avoid introducing yet another info like what this patch did.  In
general, I want to see if we can reduce get()/put()/save()/load() rather
than adding more.

Meanwhile, there're small issues here and there I saw.

E.g. reusing ->start to cache the object size sounds too tricky.  I am
trying to figure out if we can always stick with ->size to store the object
size; I'll need some even pre-requisite patches though to convert some
existing VMS_ARRAY_OF_POINTER users to reserve the ->size field for that,
though.

The other thing is I had a gut feeling this patch may have issue on the
JSON blob dumped in the migration stream, via the JSONWriter* object passed
into vmstate_save_vmsd_v().

All these made me feel like I'd better go prepare some RFC patches, because
commenting will be too slow.

I'll likely send something very soon just to show what I meant, it'll be
good if you can have a look at that when it comes.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/8] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC
  2026-03-17 21:39   ` Peter Xu
@ 2026-03-17 23:30     ` Peter Xu
  2026-03-18 10:06       ` Alexander Mikhalitsyn
  2026-03-18 10:05     ` Alexander Mikhalitsyn
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Xu @ 2026-03-17 23:30 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: qemu-devel, Fabiano Rosas, Jesper Devantier, Klaus Jensen,
	Stéphane Graber, qemu-block, Stefan Hajnoczi, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

On Tue, Mar 17, 2026 at 05:39:54PM -0400, Peter Xu wrote:
> I'll likely send something very soon just to show what I meant, it'll be
> good if you can have a look at that when it comes.

I've posted the series as RFC here:

https://lore.kernel.org/r/20260317232332.15209-1-peterx@redhat.com

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/8] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC
  2026-03-17 21:39   ` Peter Xu
  2026-03-17 23:30     ` Peter Xu
@ 2026-03-18 10:05     ` Alexander Mikhalitsyn
  1 sibling, 0 replies; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-18 10:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Jesper Devantier, Klaus Jensen,
	Stéphane Graber, qemu-block, Stefan Hajnoczi, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

Am Di., 17. März 2026 um 22:40 Uhr schrieb Peter Xu <peterx@redhat.com>:
>
> On Tue, Mar 17, 2026 at 11:27:02AM +0100, Alexander Mikhalitsyn wrote:
> > From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
> >
> > Add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC, which
> > helps to save/restore a dynamic array of pointers to
> > structures.
> >
> > Suggested-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
> > v2:
> > - added VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT8_ALLOC
> > v4:
> > - almost completely reworked, new flag VMS_ARRAY_OF_POINTER_ALLOW_NULL
> >   was introduced as suggested by Peter
> > v5:
> > - rebased on top of https://lore.kernel.org/all/20260304212303.667141-1-vsementsov@yandex-team.ru/
> > ---
> >  include/migration/vmstate.h | 77 ++++++++++++++++++++++++++++++-
> >  migration/savevm.c          | 26 +++++++++++
> >  migration/vmstate-types.c   | 91 +++++++++++++++++++++++++++++++++++++
> >  migration/vmstate.c         | 54 ++++++++++++++++++----
> >  4 files changed, 236 insertions(+), 12 deletions(-)
>
> Hi, Alexander,

Hi Peter,

>
> Sorry for a late reply, I didn't reply to your previous version on patch 2
> because I didn't yet get time to really review this patch. It took me some
> time to figure out what's the best way to do this.  I still think it'll be
> good to avoid introducing yet another info like what this patch did.  In
> general, I want to see if we can reduce get()/put()/save()/load() rather
> than adding more.

It took a while for me too, and I've failed :D

>
> Meanwhile, there're small issues here and there I saw.
>
> E.g. reusing ->start to cache the object size sounds too tricky.  I am
> trying to figure out if we can always stick with ->size to store the object
> size; I'll need some even pre-requisite patches though to convert some
> existing VMS_ARRAY_OF_POINTER users to reserve the ->size field for that,
> though.

yep, it completely makes sense. I was using the same approach as we have for
QLIST/QTAILQ/GTREE, but I agree that it's a bit hacky.

>
> The other thing is I had a gut feeling this patch may have issue on the
> JSON blob dumped in the migration stream, via the JSONWriter* object passed
> into vmstate_save_vmsd_v().
>
> All these made me feel like I'd better go prepare some RFC patches, because
> commenting will be too slow.

sure, thank you!

>
> I'll likely send something very soon just to show what I meant, it'll be
> good if you can have a look at that when it comes.

Kind regards,
Alex

>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/8] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC
  2026-03-17 23:30     ` Peter Xu
@ 2026-03-18 10:06       ` Alexander Mikhalitsyn
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-18 10:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Jesper Devantier, Klaus Jensen,
	Stéphane Graber, qemu-block, Stefan Hajnoczi, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

Am Mi., 18. März 2026 um 00:30 Uhr schrieb Peter Xu <peterx@redhat.com>:
>
> On Tue, Mar 17, 2026 at 05:39:54PM -0400, Peter Xu wrote:
> > I'll likely send something very soon just to show what I meant, it'll be
> > good if you can have a look at that when it comes.
>
> I've posted the series as RFC here:
>
> https://lore.kernel.org/r/20260317232332.15209-1-peterx@redhat.com

HUGE thanks for this, Peter!

I've left my RWB and Tested-by tags on that series.

Kind regards,
Alex

>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 0/8] hw/nvme: add basic live migration support
  2026-03-17 10:27 [PATCH v5 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
                   ` (7 preceding siblings ...)
  2026-03-17 10:27 ` [PATCH v5 8/8] tests/functional/x86_64: add migration test for NVMe device Alexander Mikhalitsyn
@ 2026-03-30 11:38 ` Alexander Mikhalitsyn
  8 siblings, 0 replies; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-03-30 11:38 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Peter Xu, Fabiano Rosas, Jesper Devantier, Stéphane Graber,
	qemu-block, Stefan Hajnoczi, Hanna Reitz, Paolo Bonzini,
	Keith Busch, Fam Zheng, Philippe Mathieu-Daudé, Zhao Liu,
	Kevin Wolf, Alexander Mikhalitsyn, qemu-devel

Hi Klaus,

sorry for bothering you with this.

I've fixed my patchset in accordance with all your suggestions from
https://lore.kernel.org/qemu-devel/aaFrB7fMMUPfMP_5@AALNPWKJENSEN.aal.scsc.local/
please can you take a look on my patchset once again?

Kind regards,
Alex

Am Di., 17. März 2026 um 11:27 Uhr schrieb Alexander Mikhalitsyn
<alexander@mihalicyn.com>:
>
> From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
>
> Dear friends,
>
> This patchset adds basic live migration support for
> QEMU emulated NVMe device.
>
> Implementation has some limitations:
> - only one NVMe namespace is supported
> - SMART counters are not preserved
> - CMB is not supported
> - PMR is not supported
> - SPDM is not supported
> - SR-IOV is not supported
>
> I believe this is something I can support in next patchset versions or
> separately on-demand (when usecase appears).
>
> Testing.
>
> This patch series was manually tested on:
> - Debian 13.3 VM (kernel 6.12.69+deb13-amd64) using fio on *non-root* NVMe disk
>   (root disk was virtio-scsi):
>
> time fio --name=nvme-verify \
>     --filename=/dev/nvme0n1 \
>     --size=5G \
>     --rw=randwrite \
>     --bs=4k \
>     --iodepth=16 \
>     --numjobs=1 \
>     --direct=0 \
>     --ioengine=io_uring \
>     --verify=crc32c \
>     --verify_fatal=1
>
> - Windows Server 2022 VM (NVMe drive was a *root* disk) with opened browser
>   playing video.
>
> No defects were found.
>
> Git tree:
> https://github.com/mihalicyn/qemu/commits/nvme-live-migration
>
> Changelog for version 5:
> - rebased on top of https://lore.kernel.org/all/20260304212303.667141-1-vsementsov@yandex-team.ru/
>   (as Peter has requested)
>
> Changelog for version 4:
> - vmstate dynamic array support reworked as suggested by Peter Xu
>   VMS_ARRAY_OF_POINTER_ALLOW_NULL flag was introduced
>   qtests were added
> - NVMe migration blockers were reworked as Klaus has requested earlier
>   Now, instead of having "deny list" approach, we have more strict pattern
>   of NVMe features filtering and it should be harded to break migration when
>   adding new NVMe features.
>
> Changelog for version 3:
> - rebased
> - simple functional test was added (in accordance with Klaus Jensen's review comment)
>   $ meson test 'func-x86_64-nvme_migration' --setup thorough -C build
>
> Changelog for version 2:
> - full support for AERs (in-flight requests and queued events too)
>
> Kind regards,
> Alex
>
> Alexander Mikhalitsyn (8):
>   migration/vmstate: export vmstate_{load, save}_field helpers
>   migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC
>   tests/unit/test-vmstate: add tests for VMS_ARRAY_OF_POINTER_ALLOW_NULL
>   tests/functional/migration: add VM launch/configure hooks
>   hw/nvme: add migration blockers for non-supported cases
>   hw/nvme: split nvme_init_sq/nvme_init_cq into helpers
>   hw/nvme: add basic live migration support
>   tests/functional/x86_64: add migration test for NVMe device
>
>  hw/nvme/ctrl.c                                | 874 +++++++++++++++++-
>  hw/nvme/ns.c                                  | 160 ++++
>  hw/nvme/nvme.h                                |   8 +
>  hw/nvme/trace-events                          |  11 +
>  include/block/nvme.h                          |  12 +
>  include/migration/vmstate.h                   |  83 +-
>  migration/savevm.c                            |  26 +
>  migration/vmstate-types.c                     |  91 ++
>  migration/vmstate.c                           |  64 +-
>  tests/functional/migration.py                 |  22 +-
>  tests/functional/x86_64/meson.build           |   1 +
>  .../functional/x86_64/test_nvme_migration.py  | 159 ++++
>  tests/unit/test-vmstate.c                     | 157 ++++
>  13 files changed, 1618 insertions(+), 50 deletions(-)
>  create mode 100755 tests/functional/x86_64/test_nvme_migration.py
>
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 5/8] hw/nvme: add migration blockers for non-supported cases
  2026-03-17 10:27 ` [PATCH v5 5/8] hw/nvme: add migration blockers for non-supported cases Alexander Mikhalitsyn
@ 2026-04-07 15:34   ` Stefan Hajnoczi
  2026-04-07 18:39     ` Alexander Mikhalitsyn
  0 siblings, 1 reply; 28+ messages in thread
From: Stefan Hajnoczi @ 2026-04-07 15:34 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

[-- Attachment #1: Type: text/plain, Size: 1030 bytes --]

On Tue, Mar 17, 2026 at 11:27:05AM +0100, Alexander Mikhalitsyn wrote:
> +static bool nvme_set_migration_blockers(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
> +{
> +    uint64_t unsupported_cap, cap = ldq_le_p(&n->bar.cap);
> +    char blocker_features[BLOCKER_FEATURES_MAX_LEN] = "";
> +    bool adm_cmd_security_checked = false;
> +    bool cmd_io_mgmt_checked = false;
> +    bool cmd_zone_checked = false;
> +
> +    /*
> +     * Idea of this function is simple, we iterate over all Command Sets and
> +     * for each supported command we provide a special handling logic to
> +     * determine if we should block migration or not.
> +     *
> +     * For instance, we have NVME_ADM_CMD_NS_ATTACHMENT and it is always
> +     * available to the guest, but if there is only 1 namespace, then it is
> +     * safe to allow migration, but if there are more, then we need to block
> +     * migration because we don't handle this in migration code yet.

What would be required to migrate multiple namespaces?

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/8] hw/nvme: add basic live migration support
  2026-03-17 10:27 ` [PATCH v5 7/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
@ 2026-04-07 15:48   ` Stefan Hajnoczi
  2026-04-07 19:02     ` Alexander Mikhalitsyn
  0 siblings, 1 reply; 28+ messages in thread
From: Stefan Hajnoczi @ 2026-04-07 15:48 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

[-- Attachment #1: Type: text/plain, Size: 1030 bytes --]

On Tue, Mar 17, 2026 at 11:27:07AM +0100, Alexander Mikhalitsyn wrote:
> +    /* wait when all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */
> +    for (i = 0; i < n->num_queues; i++) {
> +        NvmeRequest *req;
> +        NvmeSQueue *sq = n->sq[i];
> +
> +        if (!sq)
> +            continue;
> +
> +        trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size);
> +
> +wait_out_reqs:
> +        QTAILQ_FOREACH(req, &sq->out_req_list, entry) {
> +            if (req->cmd.opcode != NVME_ADM_CMD_ASYNC_EV_REQ) {
> +                cpu_relax();
> +                goto wait_out_reqs;
> +            }
> +        }
> +
> +        trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail);
> +    }

Emulated storage controllers usually do not drain requests themselves.
They rely on core migration code (e.g. migration_completion_precopy())
to stop vCPUs and call bdrv_drain_all_begin/end() to quiesce I/O. Why
does NVMe busy wait for requests here?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 5/8] hw/nvme: add migration blockers for non-supported cases
  2026-04-07 15:34   ` Stefan Hajnoczi
@ 2026-04-07 18:39     ` Alexander Mikhalitsyn
  2026-04-08  6:32       ` Klaus Jensen
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-04-07 18:39 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

Am Di., 7. Apr. 2026 um 17:34 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
>
> On Tue, Mar 17, 2026 at 11:27:05AM +0100, Alexander Mikhalitsyn wrote:
> > +static bool nvme_set_migration_blockers(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
> > +{
> > +    uint64_t unsupported_cap, cap = ldq_le_p(&n->bar.cap);
> > +    char blocker_features[BLOCKER_FEATURES_MAX_LEN] = "";
> > +    bool adm_cmd_security_checked = false;
> > +    bool cmd_io_mgmt_checked = false;
> > +    bool cmd_zone_checked = false;
> > +
> > +    /*
> > +     * Idea of this function is simple, we iterate over all Command Sets and
> > +     * for each supported command we provide a special handling logic to
> > +     * determine if we should block migration or not.
> > +     *
> > +     * For instance, we have NVME_ADM_CMD_NS_ATTACHMENT and it is always
> > +     * available to the guest, but if there is only 1 namespace, then it is
> > +     * safe to allow migration, but if there are more, then we need to block
> > +     * migration because we don't handle this in migration code yet.
>

Hi Stefan,

first of all thank you for looking into my patchset.

> What would be required to migrate multiple namespaces?

It should not be very complex, and starting from v4 I've added
VMStateDescription for nvme_ns, but I'm trying
to be extremely careful here, and keep patchset as small as possible
(and easy to review), because I it is my first
contribution to QEMU and I'm newbie with NVMe protocol. I wanted to
avoid any non-absolutely-required features
on the first approach and then add all extras after and on-demand.

My current understanding is that to support multiple namespaces we need:
- VMStateDescription for nvme_ns [ done ]
- add logic to handle (NvmeCtrl)->namespaces field and save info about
attached namespaces
  to some temporary field (use VMSTATE_WITH_TMP?)
- on restore we should iterate over namespaces attached and call
nvme_attach_ns(n, .)

Feels like this is it, but I can miss something. Please, let me know
if you believe that implementing this is absolutely
required and I'll look closer and add this in v6.

Kind regards,
Alex

>
> Thanks,
> Stefan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/8] hw/nvme: add basic live migration support
  2026-04-07 15:48   ` Stefan Hajnoczi
@ 2026-04-07 19:02     ` Alexander Mikhalitsyn
  2026-04-08  6:41       ` Klaus Jensen
  2026-04-08 18:27       ` Stefan Hajnoczi
  0 siblings, 2 replies; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-04-07 19:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

Am Di., 7. Apr. 2026 um 17:48 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
>
> On Tue, Mar 17, 2026 at 11:27:07AM +0100, Alexander Mikhalitsyn wrote:
> > +    /* wait when all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */
> > +    for (i = 0; i < n->num_queues; i++) {
> > +        NvmeRequest *req;
> > +        NvmeSQueue *sq = n->sq[i];
> > +
> > +        if (!sq)
> > +            continue;
> > +
> > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size);
> > +
> > +wait_out_reqs:
> > +        QTAILQ_FOREACH(req, &sq->out_req_list, entry) {
> > +            if (req->cmd.opcode != NVME_ADM_CMD_ASYNC_EV_REQ) {
> > +                cpu_relax();
> > +                goto wait_out_reqs;
> > +            }
> > +        }
> > +
> > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail);
> > +    }
>

Hi Stefan,

> Emulated storage controllers usually do not drain requests themselves.
> They rely on core migration code (e.g. migration_completion_precopy())
> to stop vCPUs and call bdrv_drain_all_begin/end() to quiesce I/O. Why
> does NVMe busy wait for requests here?

I rely on core migration code to stop vCPUs and drain requests, *but*
a challenge here is that
a concept of "in-flight" request in NVMe is not that simple and we
have a few different types of in-flight requests:
- request was written in SQ (sq->head != sq->tail) -> this I don't
even consider as in-flight, because we just stop SQ processing
  and these requests don't require any special handling during migration
- request was taken from SQ by nvme_process_sq() and it now lives in
sq->out_req_list - this means that
  we have also initialized req->aiocb and submitted IO for processing
in QEMU block layer. After request is processed, completion callback
  will be called (for read/write requests it is
nvme_rw_complete_cb()), then nvme_enqueue_req_completion() will be
called and remove
  NvmeRequest from sq->out_req_list and put it into cq->req_list.
  I expect, that by the time when we enter nvme_ctrl_pre_save(),
bdrv_drain_all_begin/end() were called and
  all AIO is finished and sq->out_req_list is empty (except AERs).
*But* to be on a safe side I also added busy loop on
  sq->out_req_list.

So, I tend to agree that this busy wait is probably not required, but
I believe that we still need to verify that sq->out_req_list
is in fact empty. Because if we messed up, then it's better to crash
on assert() than to have silent data corruption.

Then after I have a loop cq->req_list, and this time it is absolutely
required because we need to write all NvmeRequest
results to CQ and free NvmeRequest structure, cause I didn't want to
deal with NvmeRequest serialization.

Kind regards,
Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 5/8] hw/nvme: add migration blockers for non-supported cases
  2026-04-07 18:39     ` Alexander Mikhalitsyn
@ 2026-04-08  6:32       ` Klaus Jensen
  2026-04-08 11:47         ` Stefan Hajnoczi
  0 siblings, 1 reply; 28+ messages in thread
From: Klaus Jensen @ 2026-04-08  6:32 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: Stefan Hajnoczi, qemu-devel, Peter Xu, Fabiano Rosas,
	Jesper Devantier, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

[-- Attachment #1: Type: text/plain, Size: 2490 bytes --]

On Apr  7 20:39, Alexander Mikhalitsyn wrote:
> Am Di., 7. Apr. 2026 um 17:34 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> >
> > On Tue, Mar 17, 2026 at 11:27:05AM +0100, Alexander Mikhalitsyn wrote:
> > > +static bool nvme_set_migration_blockers(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
> > > +{
> > > +    uint64_t unsupported_cap, cap = ldq_le_p(&n->bar.cap);
> > > +    char blocker_features[BLOCKER_FEATURES_MAX_LEN] = "";
> > > +    bool adm_cmd_security_checked = false;
> > > +    bool cmd_io_mgmt_checked = false;
> > > +    bool cmd_zone_checked = false;
> > > +
> > > +    /*
> > > +     * Idea of this function is simple, we iterate over all Command Sets and
> > > +     * for each supported command we provide a special handling logic to
> > > +     * determine if we should block migration or not.
> > > +     *
> > > +     * For instance, we have NVME_ADM_CMD_NS_ATTACHMENT and it is always
> > > +     * available to the guest, but if there is only 1 namespace, then it is
> > > +     * safe to allow migration, but if there are more, then we need to block
> > > +     * migration because we don't handle this in migration code yet.
> >
> 
> Hi Stefan,
> 
> first of all thank you for looking into my patchset.
> 
> > What would be required to migrate multiple namespaces?
> 
> It should not be very complex, and starting from v4 I've added
> VMStateDescription for nvme_ns, but I'm trying
> to be extremely careful here, and keep patchset as small as possible
> (and easy to review), because I it is my first
> contribution to QEMU and I'm newbie with NVMe protocol. I wanted to
> avoid any non-absolutely-required features
> on the first approach and then add all extras after and on-demand.
> 
> My current understanding is that to support multiple namespaces we need:
> - VMStateDescription for nvme_ns [ done ]
> - add logic to handle (NvmeCtrl)->namespaces field and save info about
> attached namespaces
>   to some temporary field (use VMSTATE_WITH_TMP?)
> - on restore we should iterate over namespaces attached and call
> nvme_attach_ns(n, .)
> 
> Feels like this is it, but I can miss something. Please, let me know
> if you believe that implementing this is absolutely
> required and I'll look closer and add this in v6.
> 

I do not think multi-ns support is a hard requirement. Multi namespace
support is among the features that are mainly used for driver/host
test and verification purposes.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/8] hw/nvme: add basic live migration support
  2026-04-07 19:02     ` Alexander Mikhalitsyn
@ 2026-04-08  6:41       ` Klaus Jensen
  2026-04-08 11:31         ` Alexander Mikhalitsyn
  2026-04-08 18:27       ` Stefan Hajnoczi
  1 sibling, 1 reply; 28+ messages in thread
From: Klaus Jensen @ 2026-04-08  6:41 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: Stefan Hajnoczi, qemu-devel, Peter Xu, Fabiano Rosas,
	Jesper Devantier, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

[-- Attachment #1: Type: text/plain, Size: 3538 bytes --]

On Apr  7 21:02, Alexander Mikhalitsyn wrote:
> Am Di., 7. Apr. 2026 um 17:48 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> >
> > On Tue, Mar 17, 2026 at 11:27:07AM +0100, Alexander Mikhalitsyn wrote:
> > > +    /* wait when all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */
> > > +    for (i = 0; i < n->num_queues; i++) {
> > > +        NvmeRequest *req;
> > > +        NvmeSQueue *sq = n->sq[i];
> > > +
> > > +        if (!sq)
> > > +            continue;
> > > +
> > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size);
> > > +
> > > +wait_out_reqs:
> > > +        QTAILQ_FOREACH(req, &sq->out_req_list, entry) {
> > > +            if (req->cmd.opcode != NVME_ADM_CMD_ASYNC_EV_REQ) {
> > > +                cpu_relax();
> > > +                goto wait_out_reqs;
> > > +            }
> > > +        }
> > > +
> > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail);
> > > +    }
> >
> 
> Hi Stefan,
> 
> > Emulated storage controllers usually do not drain requests themselves.
> > They rely on core migration code (e.g. migration_completion_precopy())
> > to stop vCPUs and call bdrv_drain_all_begin/end() to quiesce I/O. Why
> > does NVMe busy wait for requests here?
> 
> I rely on core migration code to stop vCPUs and drain requests, *but*
> a challenge here is that
> a concept of "in-flight" request in NVMe is not that simple and we
> have a few different types of in-flight requests:
> - request was written in SQ (sq->head != sq->tail) -> this I don't
> even consider as in-flight, because we just stop SQ processing
>   and these requests don't require any special handling during migration
> - request was taken from SQ by nvme_process_sq() and it now lives in
> sq->out_req_list - this means that
>   we have also initialized req->aiocb and submitted IO for processing
> in QEMU block layer. After request is processed, completion callback
>   will be called (for read/write requests it is
> nvme_rw_complete_cb()), then nvme_enqueue_req_completion() will be
> called and remove
>   NvmeRequest from sq->out_req_list and put it into cq->req_list.
>   I expect, that by the time when we enter nvme_ctrl_pre_save(),
> bdrv_drain_all_begin/end() were called and
>   all AIO is finished and sq->out_req_list is empty (except AERs).
> *But* to be on a safe side I also added busy loop on
>   sq->out_req_list.
> 

Correct. When nvme_rw_complete_cb (and friends) returns, we still have
the completion laying around, not posted on the CQ. That is queued up
for a BH to handle to coalesce CQE posting.

> So, I tend to agree that this busy wait is probably not required, but
> I believe that we still need to verify that sq->out_req_list
> is in fact empty. Because if we messed up, then it's better to crash
> on assert() than to have silent data corruption.
> 
> Then after I have a loop cq->req_list, and this time it is absolutely
> required because we need to write all NvmeRequest
> results to CQ and free NvmeRequest structure, cause I didn't want to
> deal with NvmeRequest serialization.
> 

There is a subtle catch here. There may not be room in the CQ to post
all CQEs. For example, in the extreme, the host has allocated a CQ with
room for 1 entry (size 2) and several deep SQs all associated with the
same CQ. If the controller has nowhere to post CQEs, then we need to
either abort migration (and try again) or migrate the CQEs.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/8] hw/nvme: add basic live migration support
  2026-04-08  6:41       ` Klaus Jensen
@ 2026-04-08 11:31         ` Alexander Mikhalitsyn
  2026-04-08 18:35           ` Stefan Hajnoczi
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-04-08 11:31 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Stefan Hajnoczi, qemu-devel, Peter Xu, Fabiano Rosas,
	Jesper Devantier, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

Am Mi., 8. Apr. 2026 um 08:41 Uhr schrieb Klaus Jensen <its@irrelevant.dk>:
>
> On Apr  7 21:02, Alexander Mikhalitsyn wrote:
> > Am Di., 7. Apr. 2026 um 17:48 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> > >
> > > On Tue, Mar 17, 2026 at 11:27:07AM +0100, Alexander Mikhalitsyn wrote:
> > > > +    /* wait when all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */
> > > > +    for (i = 0; i < n->num_queues; i++) {
> > > > +        NvmeRequest *req;
> > > > +        NvmeSQueue *sq = n->sq[i];
> > > > +
> > > > +        if (!sq)
> > > > +            continue;
> > > > +
> > > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size);
> > > > +
> > > > +wait_out_reqs:
> > > > +        QTAILQ_FOREACH(req, &sq->out_req_list, entry) {
> > > > +            if (req->cmd.opcode != NVME_ADM_CMD_ASYNC_EV_REQ) {
> > > > +                cpu_relax();
> > > > +                goto wait_out_reqs;
> > > > +            }
> > > > +        }
> > > > +
> > > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail);
> > > > +    }
> > >
> >
> > Hi Stefan,
> >
> > > Emulated storage controllers usually do not drain requests themselves.
> > > They rely on core migration code (e.g. migration_completion_precopy())
> > > to stop vCPUs and call bdrv_drain_all_begin/end() to quiesce I/O. Why
> > > does NVMe busy wait for requests here?
> >
> > I rely on core migration code to stop vCPUs and drain requests, *but*
> > a challenge here is that
> > a concept of "in-flight" request in NVMe is not that simple and we
> > have a few different types of in-flight requests:
> > - request was written in SQ (sq->head != sq->tail) -> this I don't
> > even consider as in-flight, because we just stop SQ processing
> >   and these requests don't require any special handling during migration
> > - request was taken from SQ by nvme_process_sq() and it now lives in
> > sq->out_req_list - this means that
> >   we have also initialized req->aiocb and submitted IO for processing
> > in QEMU block layer. After request is processed, completion callback
> >   will be called (for read/write requests it is
> > nvme_rw_complete_cb()), then nvme_enqueue_req_completion() will be
> > called and remove
> >   NvmeRequest from sq->out_req_list and put it into cq->req_list.
> >   I expect, that by the time when we enter nvme_ctrl_pre_save(),
> > bdrv_drain_all_begin/end() were called and
> >   all AIO is finished and sq->out_req_list is empty (except AERs).
> > *But* to be on a safe side I also added busy loop on
> >   sq->out_req_list.
> >
>
> Correct. When nvme_rw_complete_cb (and friends) returns, we still have
> the completion laying around, not posted on the CQ. That is queued up
> for a BH to handle to coalesce CQE posting.
>
> > So, I tend to agree that this busy wait is probably not required, but
> > I believe that we still need to verify that sq->out_req_list
> > is in fact empty. Because if we messed up, then it's better to crash
> > on assert() than to have silent data corruption.
> >
> > Then after I have a loop cq->req_list, and this time it is absolutely
> > required because we need to write all NvmeRequest
> > results to CQ and free NvmeRequest structure, cause I didn't want to
> > deal with NvmeRequest serialization.
> >

Hi Klaus,

>
> There is a subtle catch here. There may not be room in the CQ to post
> all CQEs. For example, in the extreme, the host has allocated a CQ with
> room for 1 entry (size 2) and several deep SQs all associated with the
> same CQ. If the controller has nowhere to post CQEs, then we need to
> either abort migration (and try again) or migrate the CQEs.

Exactly. This is why I have a next loop, where I ensure that
cq->req_list is empty
and also check nvme_cq_full(cq) to ensure that all CQEs can be posted.

Kind regards,
Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 5/8] hw/nvme: add migration blockers for non-supported cases
  2026-04-08  6:32       ` Klaus Jensen
@ 2026-04-08 11:47         ` Stefan Hajnoczi
  2026-04-08 11:50           ` Alexander Mikhalitsyn
  0 siblings, 1 reply; 28+ messages in thread
From: Stefan Hajnoczi @ 2026-04-08 11:47 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Alexander Mikhalitsyn, qemu-devel, Peter Xu, Fabiano Rosas,
	Jesper Devantier, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

[-- Attachment #1: Type: text/plain, Size: 3015 bytes --]

On Wed, Apr 08, 2026 at 08:32:38AM +0200, Klaus Jensen wrote:
> On Apr  7 20:39, Alexander Mikhalitsyn wrote:
> > Am Di., 7. Apr. 2026 um 17:34 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> > >
> > > On Tue, Mar 17, 2026 at 11:27:05AM +0100, Alexander Mikhalitsyn wrote:
> > > > +static bool nvme_set_migration_blockers(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
> > > > +{
> > > > +    uint64_t unsupported_cap, cap = ldq_le_p(&n->bar.cap);
> > > > +    char blocker_features[BLOCKER_FEATURES_MAX_LEN] = "";
> > > > +    bool adm_cmd_security_checked = false;
> > > > +    bool cmd_io_mgmt_checked = false;
> > > > +    bool cmd_zone_checked = false;
> > > > +
> > > > +    /*
> > > > +     * Idea of this function is simple, we iterate over all Command Sets and
> > > > +     * for each supported command we provide a special handling logic to
> > > > +     * determine if we should block migration or not.
> > > > +     *
> > > > +     * For instance, we have NVME_ADM_CMD_NS_ATTACHMENT and it is always
> > > > +     * available to the guest, but if there is only 1 namespace, then it is
> > > > +     * safe to allow migration, but if there are more, then we need to block
> > > > +     * migration because we don't handle this in migration code yet.
> > >
> > 
> > Hi Stefan,
> > 
> > first of all thank you for looking into my patchset.
> > 
> > > What would be required to migrate multiple namespaces?
> > 
> > It should not be very complex, and starting from v4 I've added
> > VMStateDescription for nvme_ns, but I'm trying
> > to be extremely careful here, and keep patchset as small as possible
> > (and easy to review), because I it is my first
> > contribution to QEMU and I'm newbie with NVMe protocol. I wanted to
> > avoid any non-absolutely-required features
> > on the first approach and then add all extras after and on-demand.
> > 
> > My current understanding is that to support multiple namespaces we need:
> > - VMStateDescription for nvme_ns [ done ]
> > - add logic to handle (NvmeCtrl)->namespaces field and save info about
> > attached namespaces
> >   to some temporary field (use VMSTATE_WITH_TMP?)
> > - on restore we should iterate over namespaces attached and call
> > nvme_attach_ns(n, .)
> > 
> > Feels like this is it, but I can miss something. Please, let me know
> > if you believe that implementing this is absolutely
> > required and I'll look closer and add this in v6.
> > 
> 
> I do not think multi-ns support is a hard requirement. Multi namespace
> support is among the features that are mainly used for driver/host
> test and verification purposes.

Adding it incrementally is fine as long as it can be done in a
compatible way. Most of the time that is possible, although adding
something in a compatible way can be more complicated than implementing
it from the start.

In this case it sounds like the extra state can be added as a subsection
and it won't be too complicated.

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 5/8] hw/nvme: add migration blockers for non-supported cases
  2026-04-08 11:47         ` Stefan Hajnoczi
@ 2026-04-08 11:50           ` Alexander Mikhalitsyn
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-04-08 11:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Klaus Jensen, qemu-devel, Peter Xu, Fabiano Rosas,
	Jesper Devantier, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

Am Mi., 8. Apr. 2026 um 13:47 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
>
> On Wed, Apr 08, 2026 at 08:32:38AM +0200, Klaus Jensen wrote:
> > On Apr  7 20:39, Alexander Mikhalitsyn wrote:
> > > Am Di., 7. Apr. 2026 um 17:34 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> > > >
> > > > On Tue, Mar 17, 2026 at 11:27:05AM +0100, Alexander Mikhalitsyn wrote:
> > > > > +static bool nvme_set_migration_blockers(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
> > > > > +{
> > > > > +    uint64_t unsupported_cap, cap = ldq_le_p(&n->bar.cap);
> > > > > +    char blocker_features[BLOCKER_FEATURES_MAX_LEN] = "";
> > > > > +    bool adm_cmd_security_checked = false;
> > > > > +    bool cmd_io_mgmt_checked = false;
> > > > > +    bool cmd_zone_checked = false;
> > > > > +
> > > > > +    /*
> > > > > +     * Idea of this function is simple, we iterate over all Command Sets and
> > > > > +     * for each supported command we provide a special handling logic to
> > > > > +     * determine if we should block migration or not.
> > > > > +     *
> > > > > +     * For instance, we have NVME_ADM_CMD_NS_ATTACHMENT and it is always
> > > > > +     * available to the guest, but if there is only 1 namespace, then it is
> > > > > +     * safe to allow migration, but if there are more, then we need to block
> > > > > +     * migration because we don't handle this in migration code yet.
> > > >
> > >
> > > Hi Stefan,
> > >
> > > first of all thank you for looking into my patchset.
> > >
> > > > What would be required to migrate multiple namespaces?
> > >
> > > It should not be very complex, and starting from v4 I've added
> > > VMStateDescription for nvme_ns, but I'm trying
> > > to be extremely careful here, and keep patchset as small as possible
> > > (and easy to review), because I it is my first
> > > contribution to QEMU and I'm newbie with NVMe protocol. I wanted to
> > > avoid any non-absolutely-required features
> > > on the first approach and then add all extras after and on-demand.
> > >
> > > My current understanding is that to support multiple namespaces we need:
> > > - VMStateDescription for nvme_ns [ done ]
> > > - add logic to handle (NvmeCtrl)->namespaces field and save info about
> > > attached namespaces
> > >   to some temporary field (use VMSTATE_WITH_TMP?)
> > > - on restore we should iterate over namespaces attached and call
> > > nvme_attach_ns(n, .)
> > >
> > > Feels like this is it, but I can miss something. Please, let me know
> > > if you believe that implementing this is absolutely
> > > required and I'll look closer and add this in v6.
> > >
> >
> > I do not think multi-ns support is a hard requirement. Multi namespace
> > support is among the features that are mainly used for driver/host
> > test and verification purposes.
>

> Adding it incrementally is fine as long as it can be done in a
> compatible way. Most of the time that is possible, although adding
> something in a compatible way can be more complicated than implementing
> it from the start.



yeah, I understand.

>
> In this case it sounds like the extra state can be added as a subsection
> and it won't be too complicated.

Kind regards,
Alex

>
> Thanks,
> Stefan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/8] hw/nvme: add basic live migration support
  2026-04-07 19:02     ` Alexander Mikhalitsyn
  2026-04-08  6:41       ` Klaus Jensen
@ 2026-04-08 18:27       ` Stefan Hajnoczi
  2026-04-08 19:55         ` Alexander Mikhalitsyn
  1 sibling, 1 reply; 28+ messages in thread
From: Stefan Hajnoczi @ 2026-04-08 18:27 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

[-- Attachment #1: Type: text/plain, Size: 4037 bytes --]

On Tue, Apr 07, 2026 at 09:02:26PM +0200, Alexander Mikhalitsyn wrote:
> Am Di., 7. Apr. 2026 um 17:48 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> >
> > On Tue, Mar 17, 2026 at 11:27:07AM +0100, Alexander Mikhalitsyn wrote:
> > > +    /* wait when all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */
> > > +    for (i = 0; i < n->num_queues; i++) {
> > > +        NvmeRequest *req;
> > > +        NvmeSQueue *sq = n->sq[i];
> > > +
> > > +        if (!sq)
> > > +            continue;
> > > +
> > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size);
> > > +
> > > +wait_out_reqs:
> > > +        QTAILQ_FOREACH(req, &sq->out_req_list, entry) {
> > > +            if (req->cmd.opcode != NVME_ADM_CMD_ASYNC_EV_REQ) {
> > > +                cpu_relax();
> > > +                goto wait_out_reqs;
> > > +            }
> > > +        }
> > > +
> > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail);
> > > +    }
> >
> 
> Hi Stefan,
> 
> > Emulated storage controllers usually do not drain requests themselves.
> > They rely on core migration code (e.g. migration_completion_precopy())
> > to stop vCPUs and call bdrv_drain_all_begin/end() to quiesce I/O. Why
> > does NVMe busy wait for requests here?
> 
> I rely on core migration code to stop vCPUs and drain requests, *but*
> a challenge here is that
> a concept of "in-flight" request in NVMe is not that simple and we
> have a few different types of in-flight requests:
> - request was written in SQ (sq->head != sq->tail) -> this I don't
> even consider as in-flight, because we just stop SQ processing
>   and these requests don't require any special handling during migration
> - request was taken from SQ by nvme_process_sq() and it now lives in
> sq->out_req_list - this means that
>   we have also initialized req->aiocb and submitted IO for processing
> in QEMU block layer. After request is processed, completion callback
>   will be called (for read/write requests it is
> nvme_rw_complete_cb()), then nvme_enqueue_req_completion() will be
> called and remove
>   NvmeRequest from sq->out_req_list and put it into cq->req_list.
>   I expect, that by the time when we enter nvme_ctrl_pre_save(),
> bdrv_drain_all_begin/end() were called and
>   all AIO is finished and sq->out_req_list is empty (except AERs).
> *But* to be on a safe side I also added busy loop on
>   sq->out_req_list.
> 
> So, I tend to agree that this busy wait is probably not required, but
> I believe that we still need to verify that sq->out_req_list
> is in fact empty. Because if we messed up, then it's better to crash
> on assert() than to have silent data corruption.
> 
> Then after I have a loop cq->req_list, and this time it is absolutely
> required because we need to write all NvmeRequest
> results to CQ and free NvmeRequest structure, cause I didn't want to
> deal with NvmeRequest serialization.

I don't see how the busy wait approach can work since
migration_completion_precopy() holds the Big QEMU Lock
(bql_lock()/bql_unlock()) while .pre_save() is called. The main loop
thread's event loop will not be able to make progress while .pre_save()
is busy waiting.

Based on what you and Klaus have said, how about:
1. assert(QTAILQ_EMPTY(&sq->out_req_list))
2. Call nvme_post_cqes() from .pre_save() to write CQEs back to RAM.
   Whether or not cq->bh is cancelled too doesn't really matter.
3. Full CQs need to be handled. It will be necessary to serialize the
   remainder of cq->req_list them. Failing live migration for this
   reason is not acceptable from a user perspective. 
   (virtio-scsi and virtio-blk have a mechanism for serializing failed
   I/O requests that need to be restarted after migration. There might
   be a nicer way to do this with vmstate nowadays, but they are at
   least two examples of how to serialize a storage controller's
   internal request state.)

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/8] hw/nvme: add basic live migration support
  2026-04-08 11:31         ` Alexander Mikhalitsyn
@ 2026-04-08 18:35           ` Stefan Hajnoczi
  2026-04-08 19:59             ` Alexander Mikhalitsyn
  0 siblings, 1 reply; 28+ messages in thread
From: Stefan Hajnoczi @ 2026-04-08 18:35 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: Klaus Jensen, qemu-devel, Peter Xu, Fabiano Rosas,
	Jesper Devantier, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

[-- Attachment #1: Type: text/plain, Size: 4720 bytes --]

On Wed, Apr 08, 2026 at 01:31:33PM +0200, Alexander Mikhalitsyn wrote:
> Am Mi., 8. Apr. 2026 um 08:41 Uhr schrieb Klaus Jensen <its@irrelevant.dk>:
> >
> > On Apr  7 21:02, Alexander Mikhalitsyn wrote:
> > > Am Di., 7. Apr. 2026 um 17:48 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> > > >
> > > > On Tue, Mar 17, 2026 at 11:27:07AM +0100, Alexander Mikhalitsyn wrote:
> > > > > +    /* wait when all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */
> > > > > +    for (i = 0; i < n->num_queues; i++) {
> > > > > +        NvmeRequest *req;
> > > > > +        NvmeSQueue *sq = n->sq[i];
> > > > > +
> > > > > +        if (!sq)
> > > > > +            continue;
> > > > > +
> > > > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size);
> > > > > +
> > > > > +wait_out_reqs:
> > > > > +        QTAILQ_FOREACH(req, &sq->out_req_list, entry) {
> > > > > +            if (req->cmd.opcode != NVME_ADM_CMD_ASYNC_EV_REQ) {
> > > > > +                cpu_relax();
> > > > > +                goto wait_out_reqs;
> > > > > +            }
> > > > > +        }
> > > > > +
> > > > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail);
> > > > > +    }
> > > >
> > >
> > > Hi Stefan,
> > >
> > > > Emulated storage controllers usually do not drain requests themselves.
> > > > They rely on core migration code (e.g. migration_completion_precopy())
> > > > to stop vCPUs and call bdrv_drain_all_begin/end() to quiesce I/O. Why
> > > > does NVMe busy wait for requests here?
> > >
> > > I rely on core migration code to stop vCPUs and drain requests, *but*
> > > a challenge here is that
> > > a concept of "in-flight" request in NVMe is not that simple and we
> > > have a few different types of in-flight requests:
> > > - request was written in SQ (sq->head != sq->tail) -> this I don't
> > > even consider as in-flight, because we just stop SQ processing
> > >   and these requests don't require any special handling during migration
> > > - request was taken from SQ by nvme_process_sq() and it now lives in
> > > sq->out_req_list - this means that
> > >   we have also initialized req->aiocb and submitted IO for processing
> > > in QEMU block layer. After request is processed, completion callback
> > >   will be called (for read/write requests it is
> > > nvme_rw_complete_cb()), then nvme_enqueue_req_completion() will be
> > > called and remove
> > >   NvmeRequest from sq->out_req_list and put it into cq->req_list.
> > >   I expect, that by the time when we enter nvme_ctrl_pre_save(),
> > > bdrv_drain_all_begin/end() were called and
> > >   all AIO is finished and sq->out_req_list is empty (except AERs).
> > > *But* to be on a safe side I also added busy loop on
> > >   sq->out_req_list.
> > >
> >
> > Correct. When nvme_rw_complete_cb (and friends) returns, we still have
> > the completion laying around, not posted on the CQ. That is queued up
> > for a BH to handle to coalesce CQE posting.
> >
> > > So, I tend to agree that this busy wait is probably not required, but
> > > I believe that we still need to verify that sq->out_req_list
> > > is in fact empty. Because if we messed up, then it's better to crash
> > > on assert() than to have silent data corruption.
> > >
> > > Then after I have a loop cq->req_list, and this time it is absolutely
> > > required because we need to write all NvmeRequest
> > > results to CQ and free NvmeRequest structure, cause I didn't want to
> > > deal with NvmeRequest serialization.
> > >
> 
> Hi Klaus,
> 
> >
> > There is a subtle catch here. There may not be room in the CQ to post
> > all CQEs. For example, in the extreme, the host has allocated a CQ with
> > room for 1 entry (size 2) and several deep SQs all associated with the
> > same CQ. If the controller has nowhere to post CQEs, then we need to
> > either abort migration (and try again) or migrate the CQEs.
> 
> Exactly. This is why I have a next loop, where I ensure that
> cq->req_list is empty
> and also check nvme_cq_full(cq) to ensure that all CQEs can be posted.

Hi Alexander,
nvme_cq_full(cq) will never become false because the migration thread
holds the Big QEMU Lock while calling .pre_save(). The busy loop will
hang forever. See my reply to an earlier email in this thread for
details.

A qtest test case would be useful here (see tests/qtest/nvme-test.c) to
fill the CQ and live migrate. It might be tricky to deterministically
run .pre_save() (via the `migrate` QMP command) before nvme_post_cqes()
gets called, but then you'd have a test case that covers the CQ full
code path.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/8] hw/nvme: add basic live migration support
  2026-04-08 18:27       ` Stefan Hajnoczi
@ 2026-04-08 19:55         ` Alexander Mikhalitsyn
  2026-04-09 13:36           ` Stefan Hajnoczi
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-04-08 19:55 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

Am Mi., 8. Apr. 2026 um 20:27 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
>
> On Tue, Apr 07, 2026 at 09:02:26PM +0200, Alexander Mikhalitsyn wrote:
> > Am Di., 7. Apr. 2026 um 17:48 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> > >
> > > On Tue, Mar 17, 2026 at 11:27:07AM +0100, Alexander Mikhalitsyn wrote:
> > > > +    /* wait when all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */
> > > > +    for (i = 0; i < n->num_queues; i++) {
> > > > +        NvmeRequest *req;
> > > > +        NvmeSQueue *sq = n->sq[i];
> > > > +
> > > > +        if (!sq)
> > > > +            continue;
> > > > +
> > > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size);
> > > > +
> > > > +wait_out_reqs:
> > > > +        QTAILQ_FOREACH(req, &sq->out_req_list, entry) {
> > > > +            if (req->cmd.opcode != NVME_ADM_CMD_ASYNC_EV_REQ) {
> > > > +                cpu_relax();
> > > > +                goto wait_out_reqs;
> > > > +            }
> > > > +        }
> > > > +
> > > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail);
> > > > +    }
> > >
> >
> > Hi Stefan,
> >
> > > Emulated storage controllers usually do not drain requests themselves.
> > > They rely on core migration code (e.g. migration_completion_precopy())
> > > to stop vCPUs and call bdrv_drain_all_begin/end() to quiesce I/O. Why
> > > does NVMe busy wait for requests here?
> >
> > I rely on core migration code to stop vCPUs and drain requests, *but*
> > a challenge here is that
> > a concept of "in-flight" request in NVMe is not that simple and we
> > have a few different types of in-flight requests:
> > - request was written in SQ (sq->head != sq->tail) -> this I don't
> > even consider as in-flight, because we just stop SQ processing
> >   and these requests don't require any special handling during migration
> > - request was taken from SQ by nvme_process_sq() and it now lives in
> > sq->out_req_list - this means that
> >   we have also initialized req->aiocb and submitted IO for processing
> > in QEMU block layer. After request is processed, completion callback
> >   will be called (for read/write requests it is
> > nvme_rw_complete_cb()), then nvme_enqueue_req_completion() will be
> > called and remove
> >   NvmeRequest from sq->out_req_list and put it into cq->req_list.
> >   I expect, that by the time when we enter nvme_ctrl_pre_save(),
> > bdrv_drain_all_begin/end() were called and
> >   all AIO is finished and sq->out_req_list is empty (except AERs).
> > *But* to be on a safe side I also added busy loop on
> >   sq->out_req_list.
> >
> > So, I tend to agree that this busy wait is probably not required, but
> > I believe that we still need to verify that sq->out_req_list
> > is in fact empty. Because if we messed up, then it's better to crash
> > on assert() than to have silent data corruption.
> >
> > Then after I have a loop cq->req_list, and this time it is absolutely
> > required because we need to write all NvmeRequest
> > results to CQ and free NvmeRequest structure, cause I didn't want to
> > deal with NvmeRequest serialization.
>
> I don't see how the busy wait approach can work since
> migration_completion_precopy() holds the Big QEMU Lock
> (bql_lock()/bql_unlock()) while .pre_save() is called. The main loop
> thread's event loop will not be able to make progress while .pre_save()
> is busy waiting.

Yes, my bad. Good catch! You are absolutely right.

This is especially stupid mistake from my side, taking into account
that I *knew* about that we hold BQL
in this context, cause in my first version of this patchset I *was*
taking this into account:
https://lore.kernel.org/qemu-devel/20260217152517.271422-5-alexander@mihalicyn.com/

see comment before qemu_bh_cancel(n->admin_cq.bh).

Thanks, Stefan! ;)

>
> Based on what you and Klaus have said, how about:
> 1. assert(QTAILQ_EMPTY(&sq->out_req_list))
> 2. Call nvme_post_cqes() from .pre_save() to write CQEs back to RAM.
>    Whether or not cq->bh is cancelled too doesn't really matter.
> 3. Full CQs need to be handled. It will be necessary to serialize the
>    remainder of cq->req_list them. Failing live migration for this
>    reason is not acceptable from a user perspective.
>    (virtio-scsi and virtio-blk have a mechanism for serializing failed
>    I/O requests that need to be restarted after migration. There might
>    be a nicer way to do this with vmstate nowadays, but they are at
>    least two examples of how to serialize a storage controller's
>    internal request state.)

Ok, I understand. So your point is that we should find a way to
serialize information about
I/O request completions if they can't be posted to CQ. Let me take a
closer look into this matter.
I did something like this for AERs.

Kind regards,
Alex

>
> Stefan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/8] hw/nvme: add basic live migration support
  2026-04-08 18:35           ` Stefan Hajnoczi
@ 2026-04-08 19:59             ` Alexander Mikhalitsyn
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Mikhalitsyn @ 2026-04-08 19:59 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Klaus Jensen, qemu-devel, Peter Xu, Fabiano Rosas,
	Jesper Devantier, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

Am Mi., 8. Apr. 2026 um 20:35 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
>
> On Wed, Apr 08, 2026 at 01:31:33PM +0200, Alexander Mikhalitsyn wrote:
> > Am Mi., 8. Apr. 2026 um 08:41 Uhr schrieb Klaus Jensen <its@irrelevant.dk>:
> > >
> > > On Apr  7 21:02, Alexander Mikhalitsyn wrote:
> > > > Am Di., 7. Apr. 2026 um 17:48 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> > > > >
> > > > > On Tue, Mar 17, 2026 at 11:27:07AM +0100, Alexander Mikhalitsyn wrote:
> > > > > > +    /* wait when all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */
> > > > > > +    for (i = 0; i < n->num_queues; i++) {
> > > > > > +        NvmeRequest *req;
> > > > > > +        NvmeSQueue *sq = n->sq[i];
> > > > > > +
> > > > > > +        if (!sq)
> > > > > > +            continue;
> > > > > > +
> > > > > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size);
> > > > > > +
> > > > > > +wait_out_reqs:
> > > > > > +        QTAILQ_FOREACH(req, &sq->out_req_list, entry) {
> > > > > > +            if (req->cmd.opcode != NVME_ADM_CMD_ASYNC_EV_REQ) {
> > > > > > +                cpu_relax();
> > > > > > +                goto wait_out_reqs;
> > > > > > +            }
> > > > > > +        }
> > > > > > +
> > > > > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail);
> > > > > > +    }
> > > > >
> > > >
> > > > Hi Stefan,
> > > >
> > > > > Emulated storage controllers usually do not drain requests themselves.
> > > > > They rely on core migration code (e.g. migration_completion_precopy())
> > > > > to stop vCPUs and call bdrv_drain_all_begin/end() to quiesce I/O. Why
> > > > > does NVMe busy wait for requests here?
> > > >
> > > > I rely on core migration code to stop vCPUs and drain requests, *but*
> > > > a challenge here is that
> > > > a concept of "in-flight" request in NVMe is not that simple and we
> > > > have a few different types of in-flight requests:
> > > > - request was written in SQ (sq->head != sq->tail) -> this I don't
> > > > even consider as in-flight, because we just stop SQ processing
> > > >   and these requests don't require any special handling during migration
> > > > - request was taken from SQ by nvme_process_sq() and it now lives in
> > > > sq->out_req_list - this means that
> > > >   we have also initialized req->aiocb and submitted IO for processing
> > > > in QEMU block layer. After request is processed, completion callback
> > > >   will be called (for read/write requests it is
> > > > nvme_rw_complete_cb()), then nvme_enqueue_req_completion() will be
> > > > called and remove
> > > >   NvmeRequest from sq->out_req_list and put it into cq->req_list.
> > > >   I expect, that by the time when we enter nvme_ctrl_pre_save(),
> > > > bdrv_drain_all_begin/end() were called and
> > > >   all AIO is finished and sq->out_req_list is empty (except AERs).
> > > > *But* to be on a safe side I also added busy loop on
> > > >   sq->out_req_list.
> > > >
> > >
> > > Correct. When nvme_rw_complete_cb (and friends) returns, we still have
> > > the completion laying around, not posted on the CQ. That is queued up
> > > for a BH to handle to coalesce CQE posting.
> > >
> > > > So, I tend to agree that this busy wait is probably not required, but
> > > > I believe that we still need to verify that sq->out_req_list
> > > > is in fact empty. Because if we messed up, then it's better to crash
> > > > on assert() than to have silent data corruption.
> > > >
> > > > Then after I have a loop cq->req_list, and this time it is absolutely
> > > > required because we need to write all NvmeRequest
> > > > results to CQ and free NvmeRequest structure, cause I didn't want to
> > > > deal with NvmeRequest serialization.
> > > >
> >
> > Hi Klaus,
> >
> > >
> > > There is a subtle catch here. There may not be room in the CQ to post
> > > all CQEs. For example, in the extreme, the host has allocated a CQ with
> > > room for 1 entry (size 2) and several deep SQs all associated with the
> > > same CQ. If the controller has nowhere to post CQEs, then we need to
> > > either abort migration (and try again) or migrate the CQEs.
> >
> > Exactly. This is why I have a next loop, where I ensure that
> > cq->req_list is empty
> > and also check nvme_cq_full(cq) to ensure that all CQEs can be posted.
>
> Hi Alexander,

Hi Stefan,

> nvme_cq_full(cq) will never become false because the migration thread
> holds the Big QEMU Lock while calling .pre_save(). The busy loop will
> hang forever. See my reply to an earlier email in this thread for
> details.

Yep. That was really stupid bug. Sorry again.

>
> A qtest test case would be useful here (see tests/qtest/nvme-test.c) to
> fill the CQ and live migrate. It might be tricky to deterministically
> run .pre_save() (via the `migrate` QMP command) before nvme_post_cqes()
> gets called, but then you'd have a test case that covers the CQ full
> code path.

Sure, no problem I can add qtest here. Thanks for pointing to the
particular file as I'm
just doing my first steps with QEMU development :-)

Kind regards,
Alex

>
> Stefan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/8] hw/nvme: add basic live migration support
  2026-04-08 19:55         ` Alexander Mikhalitsyn
@ 2026-04-09 13:36           ` Stefan Hajnoczi
  0 siblings, 0 replies; 28+ messages in thread
From: Stefan Hajnoczi @ 2026-04-09 13:36 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, Jesper Devantier,
	Klaus Jensen, Stéphane Graber, qemu-block, Hanna Reitz,
	Paolo Bonzini, Keith Busch, Fam Zheng,
	Philippe Mathieu-Daudé, Zhao Liu, Kevin Wolf,
	Alexander Mikhalitsyn

[-- Attachment #1: Type: text/plain, Size: 4316 bytes --]

On Wed, Apr 08, 2026 at 09:55:10PM +0200, Alexander Mikhalitsyn wrote:
> Am Mi., 8. Apr. 2026 um 20:27 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> >
> > On Tue, Apr 07, 2026 at 09:02:26PM +0200, Alexander Mikhalitsyn wrote:
> > > Am Di., 7. Apr. 2026 um 17:48 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> > > >
> > > > On Tue, Mar 17, 2026 at 11:27:07AM +0100, Alexander Mikhalitsyn wrote:
> > > > > +    /* wait when all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */
> > > > > +    for (i = 0; i < n->num_queues; i++) {
> > > > > +        NvmeRequest *req;
> > > > > +        NvmeSQueue *sq = n->sq[i];
> > > > > +
> > > > > +        if (!sq)
> > > > > +            continue;
> > > > > +
> > > > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size);
> > > > > +
> > > > > +wait_out_reqs:
> > > > > +        QTAILQ_FOREACH(req, &sq->out_req_list, entry) {
> > > > > +            if (req->cmd.opcode != NVME_ADM_CMD_ASYNC_EV_REQ) {
> > > > > +                cpu_relax();
> > > > > +                goto wait_out_reqs;
> > > > > +            }
> > > > > +        }
> > > > > +
> > > > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail);
> > > > > +    }
> > > >
> > >
> > > Hi Stefan,
> > >
> > > > Emulated storage controllers usually do not drain requests themselves.
> > > > They rely on core migration code (e.g. migration_completion_precopy())
> > > > to stop vCPUs and call bdrv_drain_all_begin/end() to quiesce I/O. Why
> > > > does NVMe busy wait for requests here?
> > >
> > > I rely on core migration code to stop vCPUs and drain requests, *but*
> > > a challenge here is that
> > > a concept of "in-flight" request in NVMe is not that simple and we
> > > have a few different types of in-flight requests:
> > > - request was written in SQ (sq->head != sq->tail) -> this I don't
> > > even consider as in-flight, because we just stop SQ processing
> > >   and these requests don't require any special handling during migration
> > > - request was taken from SQ by nvme_process_sq() and it now lives in
> > > sq->out_req_list - this means that
> > >   we have also initialized req->aiocb and submitted IO for processing
> > > in QEMU block layer. After request is processed, completion callback
> > >   will be called (for read/write requests it is
> > > nvme_rw_complete_cb()), then nvme_enqueue_req_completion() will be
> > > called and remove
> > >   NvmeRequest from sq->out_req_list and put it into cq->req_list.
> > >   I expect, that by the time when we enter nvme_ctrl_pre_save(),
> > > bdrv_drain_all_begin/end() were called and
> > >   all AIO is finished and sq->out_req_list is empty (except AERs).
> > > *But* to be on a safe side I also added busy loop on
> > >   sq->out_req_list.
> > >
> > > So, I tend to agree that this busy wait is probably not required, but
> > > I believe that we still need to verify that sq->out_req_list
> > > is in fact empty. Because if we messed up, then it's better to crash
> > > on assert() than to have silent data corruption.
> > >
> > > Then after I have a loop cq->req_list, and this time it is absolutely
> > > required because we need to write all NvmeRequest
> > > results to CQ and free NvmeRequest structure, cause I didn't want to
> > > deal with NvmeRequest serialization.
> >
> > I don't see how the busy wait approach can work since
> > migration_completion_precopy() holds the Big QEMU Lock
> > (bql_lock()/bql_unlock()) while .pre_save() is called. The main loop
> > thread's event loop will not be able to make progress while .pre_save()
> > is busy waiting.
> 
> Yes, my bad. Good catch! You are absolutely right.
> 
> This is especially stupid mistake from my side, taking into account
> that I *knew* about that we hold BQL
> in this context, cause in my first version of this patchset I *was*
> taking this into account:
> https://lore.kernel.org/qemu-devel/20260217152517.271422-5-alexander@mihalicyn.com/
> 
> see comment before qemu_bh_cancel(n->admin_cq.bh).
> 
> Thanks, Stefan! ;)

No worries. The good news is that we don't need to worry about race
conditions due to device state changing while .pre_save() runs :).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2026-04-09 13:37 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-17 10:27 [PATCH v5 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
2026-03-17 10:27 ` [PATCH v5 1/8] migration/vmstate: export vmstate_{load, save}_field helpers Alexander Mikhalitsyn
2026-03-17 10:27 ` [PATCH v5 2/8] migration: add VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT{8, 32}_ALLOC Alexander Mikhalitsyn
2026-03-17 21:39   ` Peter Xu
2026-03-17 23:30     ` Peter Xu
2026-03-18 10:06       ` Alexander Mikhalitsyn
2026-03-18 10:05     ` Alexander Mikhalitsyn
2026-03-17 10:27 ` [PATCH v5 3/8] tests/unit/test-vmstate: add tests for VMS_ARRAY_OF_POINTER_ALLOW_NULL Alexander Mikhalitsyn
2026-03-17 10:27 ` [PATCH v5 4/8] tests/functional/migration: add VM launch/configure hooks Alexander Mikhalitsyn
2026-03-17 10:27 ` [PATCH v5 5/8] hw/nvme: add migration blockers for non-supported cases Alexander Mikhalitsyn
2026-04-07 15:34   ` Stefan Hajnoczi
2026-04-07 18:39     ` Alexander Mikhalitsyn
2026-04-08  6:32       ` Klaus Jensen
2026-04-08 11:47         ` Stefan Hajnoczi
2026-04-08 11:50           ` Alexander Mikhalitsyn
2026-03-17 10:27 ` [PATCH v5 6/8] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers Alexander Mikhalitsyn
2026-03-17 10:27 ` [PATCH v5 7/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn
2026-04-07 15:48   ` Stefan Hajnoczi
2026-04-07 19:02     ` Alexander Mikhalitsyn
2026-04-08  6:41       ` Klaus Jensen
2026-04-08 11:31         ` Alexander Mikhalitsyn
2026-04-08 18:35           ` Stefan Hajnoczi
2026-04-08 19:59             ` Alexander Mikhalitsyn
2026-04-08 18:27       ` Stefan Hajnoczi
2026-04-08 19:55         ` Alexander Mikhalitsyn
2026-04-09 13:36           ` Stefan Hajnoczi
2026-03-17 10:27 ` [PATCH v5 8/8] tests/functional/x86_64: add migration test for NVMe device Alexander Mikhalitsyn
2026-03-30 11:38 ` [PATCH v5 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.