* [PATCH v6 0/8] hw/nvme: add basic live migration support
@ 2026-04-19 13:01 Alexander Mikhalitsyn
2026-04-19 13:01 ` [PATCH v6 1/8] tests/functional/migration: add VM launch/configure hooks Alexander Mikhalitsyn
` (7 more replies)
0 siblings, 8 replies; 17+ messages in thread
From: Alexander Mikhalitsyn @ 2026-04-19 13:01 UTC (permalink / raw)
To: qemu-devel
Cc: Alexander Mikhalitsyn, Kevin Wolf, qemu-block, Fam Zheng,
Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini,
Stefan Hajnoczi, Laurent Vivier, Jesper Devantier, Klaus Jensen,
Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz,
Alexander Mikhalitsyn
From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
Dear friends,
This patchset adds basic live migration support for
QEMU emulated NVMe device.
Implementation has some limitations:
- only one NVMe namespace is supported
- SMART counters are not preserved
- CMB is not supported
- PMR is not supported
- SPDM is not supported
- SR-IOV is not supported
I believe this is something I can support in next patchset versions or
separately on-demand (when usecase appears).
Testing.
This patch series was manually tested on:
- Debian 13.3 VM (kernel 6.12.69+deb13-amd64) using fio on *non-root* NVMe disk
(root disk was virtio-scsi):
time fio --name=nvme-verify \
--filename=/dev/nvme0n1 \
--size=5G \
--rw=randwrite \
--bs=4k \
--iodepth=16 \
--numjobs=1 \
--direct=0 \
--ioengine=io_uring \
--verify=crc32c \
--verify_fatal=1
- Windows Server 2022 VM (NVMe drive was a *root* disk) with opened browser
playing video.
No defects were found.
Git tree:
https://github.com/mihalicyn/qemu/commits/nvme-live-migration
Changelog for version 6:
- rebased on top of:
https://gitlab.com/peterx/qemu/-/tree/vmstate-array-null
(see also https://lore.kernel.org/all/20260401202844.673494-1-peterx@redhat.com)
- addressed review comments from Stefan Hajnoczi:
- supported "full CQ" case by serializing NvmeRequest state
- added qtest for NVMe device migration with full CQ
Changelog for version 5:
- rebased on top of https://lore.kernel.org/all/20260304212303.667141-1-vsementsov@yandex-team.ru/
(as Peter has requested)
Changelog for version 4:
- vmstate dynamic array support reworked as suggested by Peter Xu
VMS_ARRAY_OF_POINTER_ALLOW_NULL flag was introduced
qtests were added
- NVMe migration blockers were reworked as Klaus has requested earlier
Now, instead of having "deny list" approach, we have more strict pattern
of NVMe features filtering and it should be harded to break migration when
adding new NVMe features.
Changelog for version 3:
- rebased
- simple functional test was added (in accordance with Klaus Jensen's review comment)
$ meson test 'func-x86_64-nvme_migration' --setup thorough -C build
Changelog for version 2:
- full support for AERs (in-flight requests and queued events too)
Kind regards,
Alex
Alexander Mikhalitsyn (8):
tests/functional/migration: add VM launch/configure hooks
hw/nvme: add migration blockers for non-supported cases
hw/nvme: split nvme_init_sq/nvme_init_cq into helpers
hw/nvme: set CQE.sq_id earlier in nvme_process_sq
hw/nvme: unmap req->sg earlier in nvme_enqueue_req_completion
hw/nvme: add basic live migration support
tests/functional/x86_64: add migration test for NVMe device
tests/qtest/nvme-test: add migration test with full CQ
hw/nvme/ctrl.c | 992 +++++++++++++++++-
hw/nvme/ns.c | 160 +++
hw/nvme/nvme.h | 12 +
hw/nvme/trace-events | 10 +
include/block/nvme.h | 12 +
tests/functional/migration.py | 22 +-
tests/functional/x86_64/meson.build | 1 +
.../functional/x86_64/test_nvme_migration.py | 159 +++
tests/qtest/nvme-test.c | 393 +++++++
9 files changed, 1725 insertions(+), 36 deletions(-)
create mode 100755 tests/functional/x86_64/test_nvme_migration.py
--
2.47.3
^ permalink raw reply [flat|nested] 17+ messages in thread* [PATCH v6 1/8] tests/functional/migration: add VM launch/configure hooks 2026-04-19 13:01 [PATCH v6 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn @ 2026-04-19 13:01 ` Alexander Mikhalitsyn 2026-04-19 13:01 ` [PATCH v6 2/8] hw/nvme: add migration blockers for non-supported cases Alexander Mikhalitsyn ` (6 subsequent siblings) 7 siblings, 0 replies; 17+ messages in thread From: Alexander Mikhalitsyn @ 2026-04-19 13:01 UTC (permalink / raw) To: qemu-devel Cc: Alexander Mikhalitsyn, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Stefan Hajnoczi, Laurent Vivier, Jesper Devantier, Klaus Jensen, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> Introduce configure_machine, launch_source_vm and assert_dest_vm methods to allow child classes to override some pieces of source/dest VMs creation, start and check logic. Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> --- tests/functional/migration.py | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/tests/functional/migration.py b/tests/functional/migration.py index 2395119d6c6..9b5b31efe3e 100644 --- a/tests/functional/migration.py +++ b/tests/functional/migration.py @@ -40,19 +40,35 @@ def assert_migration(self, src_vm, dst_vm): self.assertEqual(dst_vm.cmd('query-status')['status'], 'running') self.assertEqual(src_vm.cmd('query-status')['status'],'postmigrate') + # Can be overridden by subclasses to configure both source/dest VMs. + def configure_machine(self, vm): + vm.add_args('-nodefaults') + + # Can be overridden by subclasses to prepare the source VM before migration, + # e.g. by running some workload inside the source VM to see if it continues + # to run properly after migration. + def launch_source_vm(self, vm): + vm.launch() + + # Can be overridden by subclasses to check the destination VM after migration, + # e.g. by checking if the workload is still running after migration. + def assert_dest_vm(self, vm): + pass + def migrate_vms(self, dst_uri, src_uri, dst_vm, src_vm): dst_vm.qmp('migrate-incoming', uri=dst_uri) src_vm.qmp('migrate', uri=src_uri) self.assert_migration(src_vm, dst_vm) + self.assert_dest_vm(dst_vm) def migrate(self, dst_uri, src_uri=None): dst_vm = self.get_vm('-incoming', 'defer', name="dst-qemu") - dst_vm.add_args('-nodefaults') + self.configure_machine(dst_vm) dst_vm.launch() src_vm = self.get_vm(name="src-qemu") - src_vm.add_args('-nodefaults') - src_vm.launch() + self.configure_machine(src_vm) + self.launch_source_vm(src_vm) if src_uri is None: src_uri = dst_uri -- 2.47.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v6 2/8] hw/nvme: add migration blockers for non-supported cases 2026-04-19 13:01 [PATCH v6 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn 2026-04-19 13:01 ` [PATCH v6 1/8] tests/functional/migration: add VM launch/configure hooks Alexander Mikhalitsyn @ 2026-04-19 13:01 ` Alexander Mikhalitsyn 2026-04-19 13:01 ` [PATCH v6 3/8] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers Alexander Mikhalitsyn ` (5 subsequent siblings) 7 siblings, 0 replies; 17+ messages in thread From: Alexander Mikhalitsyn @ 2026-04-19 13:01 UTC (permalink / raw) To: qemu-devel Cc: Alexander Mikhalitsyn, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Stefan Hajnoczi, Laurent Vivier, Jesper Devantier, Klaus Jensen, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> Let's block migration for cases we don't support: - SR-IOV - CMB - PMR - SPDM No functional changes here, because NVMe migration is not supported at all as of this commit. Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> --- hw/nvme/ctrl.c | 206 +++++++++++++++++++++++++++++++++++++++++++ hw/nvme/nvme.h | 3 + include/block/nvme.h | 12 +++ 3 files changed, 221 insertions(+) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index cc4593cd427..9f9c9bcbead 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -207,6 +207,7 @@ #include "hw/pci/msix.h" #include "hw/pci/pcie_sriov.h" #include "system/spdm-socket.h" +#include "migration/blocker.h" #include "migration/vmstate.h" #include "nvme.h" @@ -250,6 +251,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = { [NVME_COMMAND_SET_PROFILE] = true, [NVME_FDP_MODE] = true, [NVME_FDP_EVENTS] = true, + /* if you add something here, please update nvme_set_migration_blockers() */ }; static const uint32_t nvme_feature_cap[NVME_FID_MAX] = { @@ -4601,6 +4603,7 @@ static uint16_t nvme_io_mgmt_send(NvmeCtrl *n, NvmeRequest *req) return 0; case NVME_IOMS_MO_RUH_UPDATE: return nvme_io_mgmt_send_ruh_update(n, req); + /* if you add something here, please update nvme_set_migration_blockers() */ default: return NVME_INVALID_FIELD | NVME_DNR; }; @@ -7518,6 +7521,10 @@ static uint16_t nvme_security_receive(NvmeCtrl *n, NvmeRequest *req) static uint16_t nvme_directive_send(NvmeCtrl *n, NvmeRequest *req) { + /* + * When adding a new dtype handling here, + * please also update nvme_set_migration_blockers(). + */ return NVME_INVALID_FIELD | NVME_DNR; } @@ -9208,6 +9215,199 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev) } } +#define BLOCKER_FEATURES_MAX_LEN 256 + +static inline void nvme_add_blocker_feature(char *blocker_features, const char *feature) +{ + if (strlen(blocker_features) > 0) { + g_strlcat(blocker_features, ", ", BLOCKER_FEATURES_MAX_LEN); + } + g_strlcat(blocker_features, feature, BLOCKER_FEATURES_MAX_LEN); +} + +static bool nvme_set_migration_blockers(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp) +{ + uint64_t unsupported_cap, cap = ldq_le_p(&n->bar.cap); + char blocker_features[BLOCKER_FEATURES_MAX_LEN] = ""; + bool adm_cmd_security_checked = false; + bool cmd_io_mgmt_checked = false; + bool cmd_zone_checked = false; + + /* + * Idea of this function is simple, we iterate over all Command Sets and + * for each supported command we provide a special handling logic to + * determine if we should block migration or not. + * + * For instance, we have NVME_ADM_CMD_NS_ATTACHMENT and it is always + * available to the guest, but if there is only 1 namespace, then it is + * safe to allow migration, but if there are more, then we need to block + * migration because we don't handle this in migration code yet. + */ + for (int opcode = 0; opcode < sizeof(n->cse.acs) / sizeof(n->cse.acs[0]); opcode++) { + /* Is command supported? */ + if (!n->cse.acs[opcode]) { + continue; + } + + switch (opcode) { + case NVME_ADM_CMD_DELETE_SQ: + case NVME_ADM_CMD_CREATE_SQ: + case NVME_ADM_CMD_GET_LOG_PAGE: + case NVME_ADM_CMD_DELETE_CQ: + case NVME_ADM_CMD_CREATE_CQ: + case NVME_ADM_CMD_IDENTIFY: + case NVME_ADM_CMD_ABORT: + case NVME_ADM_CMD_SET_FEATURES: + case NVME_ADM_CMD_GET_FEATURES: + case NVME_ADM_CMD_ASYNC_EV_REQ: + case NVME_ADM_CMD_DBBUF_CONFIG: + case NVME_ADM_CMD_FORMAT_NVM: + case NVME_ADM_CMD_DIRECTIVE_SEND: + case NVME_ADM_CMD_DIRECTIVE_RECV: + break; + case NVME_ADM_CMD_NS_ATTACHMENT: + int namespaces_num = 0; + for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) { + NvmeNamespace *ns = nvme_subsys_ns(n->subsys, i); + if (!ns) { + continue; + } + + namespaces_num++; + } + + if (namespaces_num > 1) { + nvme_add_blocker_feature(blocker_features, "Namespace Attachment"); + } + + break; + case NVME_ADM_CMD_VIRT_MNGMT: + if (n->params.sriov_max_vfs) { + nvme_add_blocker_feature(blocker_features, "SR-IOV"); + } + + break; + case NVME_ADM_CMD_SECURITY_SEND: + case NVME_ADM_CMD_SECURITY_RECV: + if (adm_cmd_security_checked) { + break; + } + + if (pci_dev->spdm_port) { + nvme_add_blocker_feature(blocker_features, "SPDM"); + } + + adm_cmd_security_checked = true; + + break; + default: + g_assert_not_reached(); + } + } + + for (int opcode = 0; opcode < sizeof(n->cse.iocs.nvm) / sizeof(n->cse.iocs.nvm[0]); opcode++) { + if (!n->cse.iocs.nvm[opcode]) { + continue; + } + + switch (opcode) { + case NVME_CMD_FLUSH: + case NVME_CMD_WRITE: + case NVME_CMD_READ: + case NVME_CMD_COMPARE: + case NVME_CMD_WRITE_ZEROES: + case NVME_CMD_DSM: + case NVME_CMD_VERIFY: + case NVME_CMD_COPY: + break; + case NVME_CMD_IO_MGMT_RECV: + case NVME_CMD_IO_MGMT_SEND: + if (cmd_io_mgmt_checked) { + break; + } + + /* check for NVME_IOMS_MO_RUH_UPDATE */ + if (n->subsys->params.fdp.enabled) { + nvme_add_blocker_feature(blocker_features, "FDP"); + } + + cmd_io_mgmt_checked = true; + + break; + default: + g_assert_not_reached(); + } + } + + for (int opcode = 0; opcode < sizeof(n->cse.iocs.zoned) / sizeof(n->cse.iocs.zoned[0]); opcode++) { + /* + * If command isn't supported or we have the same command + * in n->cse.iocs.nvm, then we can skip it here. + */ + if (!n->cse.iocs.zoned[opcode] || n->cse.iocs.nvm[opcode]) { + continue; + } + + switch (opcode) { + case NVME_CMD_ZONE_APPEND: + case NVME_CMD_ZONE_MGMT_SEND: + case NVME_CMD_ZONE_MGMT_RECV: + if (cmd_zone_checked) { + break; + } + + for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) { + NvmeNamespace *ns = nvme_subsys_ns(n->subsys, i); + if (!ns) { + continue; + } + + if (ns->params.zoned) { + nvme_add_blocker_feature(blocker_features, "Zoned Namespace"); + break; + } + } + + cmd_zone_checked = true; + + break; + default: + g_assert_not_reached(); + } + } + + /* + * Try our best to explicitly detect all not supported caps, + * to let users know what features cause migration to be blocked, + * but in case we miss handling here, everything else will be + * covered by unsupported_cap check. + */ + if (NVME_CAP_CMBS(cap)) { + nvme_add_blocker_feature(blocker_features, "CMB"); + cap &= ~((uint64_t)CAP_CMBS_MASK << CAP_CMBS_SHIFT); + } + + if (NVME_CAP_PMRS(cap)) { + nvme_add_blocker_feature(blocker_features, "PMR"); + cap &= ~((uint64_t)CAP_PMRS_MASK << CAP_PMRS_SHIFT); + } + + unsupported_cap = cap & ~NVME_MIGRATION_SUPPORTED_CAP_BITS; + if (unsupported_cap) { + nvme_add_blocker_feature(blocker_features, "unknown capability"); + } + + assert(n->migration_blocker == NULL); + if (strlen(blocker_features) > 0) { + error_setg(&n->migration_blocker, "Migration is not supported for %s", blocker_features); + if (migrate_add_blocker(&n->migration_blocker, errp) < 0) { + return false; + } + } + + return true; +} + static int nvme_init_subsys(NvmeCtrl *n, Error **errp) { int cntlid; @@ -9313,6 +9513,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp) n->subsys->namespaces[ns->params.nsid] = ns; } + + if (!nvme_set_migration_blockers(n, pci_dev, errp)) { + return; + } } static void nvme_exit(PCIDevice *pci_dev) @@ -9365,6 +9569,8 @@ static void nvme_exit(PCIDevice *pci_dev) } memory_region_del_subregion(&n->bar0, &n->iomem); + + migrate_del_blocker(&n->migration_blocker); } static const Property nvme_props[] = { diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h index d66f7dc82d5..457b6637249 100644 --- a/hw/nvme/nvme.h +++ b/hw/nvme/nvme.h @@ -666,6 +666,9 @@ typedef struct NvmeCtrl { /* Socket mapping to SPDM over NVMe Security In/Out commands */ int spdm_socket; + + /* Migration-related stuff */ + Error *migration_blocker; } NvmeCtrl; typedef enum NvmeResetType { diff --git a/include/block/nvme.h b/include/block/nvme.h index 9d7159ed7a7..a7f586fc801 100644 --- a/include/block/nvme.h +++ b/include/block/nvme.h @@ -141,6 +141,18 @@ enum NvmeCapMask { #define NVME_CAP_SET_CMBS(cap, val) \ ((cap) |= (uint64_t)((val) & CAP_CMBS_MASK) << CAP_CMBS_SHIFT) +#define NVME_MIGRATION_SUPPORTED_CAP_BITS ( \ + ((uint64_t)CAP_MQES_MASK << CAP_MQES_SHIFT) \ + | ((uint64_t)CAP_CQR_MASK << CAP_CQR_SHIFT) \ + | ((uint64_t)CAP_AMS_MASK << CAP_AMS_SHIFT) \ + | ((uint64_t)CAP_TO_MASK << CAP_TO_SHIFT) \ + | ((uint64_t)CAP_DSTRD_MASK << CAP_DSTRD_SHIFT) \ + | ((uint64_t)CAP_NSSRS_MASK << CAP_NSSRS_SHIFT) \ + | ((uint64_t)CAP_CSS_MASK << CAP_CSS_SHIFT) \ + | ((uint64_t)CAP_MPSMIN_MASK << CAP_MPSMIN_SHIFT) \ + | ((uint64_t)CAP_MPSMAX_MASK << CAP_MPSMAX_SHIFT) \ +) + enum NvmeCapCss { NVME_CAP_CSS_NCSS = 1 << 0, NVME_CAP_CSS_IOCSS = 1 << 6, -- 2.47.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v6 3/8] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers 2026-04-19 13:01 [PATCH v6 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn 2026-04-19 13:01 ` [PATCH v6 1/8] tests/functional/migration: add VM launch/configure hooks Alexander Mikhalitsyn 2026-04-19 13:01 ` [PATCH v6 2/8] hw/nvme: add migration blockers for non-supported cases Alexander Mikhalitsyn @ 2026-04-19 13:01 ` Alexander Mikhalitsyn 2026-04-29 6:20 ` Klaus Jensen 2026-04-19 13:01 ` [PATCH v6 4/8] hw/nvme: set CQE.sq_id earlier in nvme_process_sq Alexander Mikhalitsyn ` (4 subsequent siblings) 7 siblings, 1 reply; 17+ messages in thread From: Alexander Mikhalitsyn @ 2026-04-19 13:01 UTC (permalink / raw) To: qemu-devel Cc: Alexander Mikhalitsyn, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Stefan Hajnoczi, Laurent Vivier, Jesper Devantier, Klaus Jensen, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> --- hw/nvme/ctrl.c | 59 +++++++++++++++++++++++++++++++------------------- 1 file changed, 37 insertions(+), 22 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 9f9c9bcbead..191398e700f 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -4854,18 +4854,14 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req) return NVME_SUCCESS; } -static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr, - uint16_t sqid, uint16_t cqid, uint16_t size) +static void __nvme_init_sq(NvmeSQueue *sq) { + NvmeCtrl *n = sq->ctrl; + uint16_t sqid = sq->sqid; + uint16_t cqid = sq->cqid; int i; NvmeCQueue *cq; - sq->ctrl = n; - sq->dma_addr = dma_addr; - sq->sqid = sqid; - sq->size = size; - sq->cqid = cqid; - sq->head = sq->tail = 0; sq->io_req = g_new0(NvmeRequest, sq->size); QTAILQ_INIT(&sq->req_list); @@ -4895,6 +4891,18 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr, n->sq[sqid] = sq; } +static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr, + uint16_t sqid, uint16_t cqid, uint16_t size) +{ + sq->ctrl = n; + sq->dma_addr = dma_addr; + sq->sqid = sqid; + sq->size = size; + sq->cqid = cqid; + sq->head = sq->tail = 0; + __nvme_init_sq(sq); +} + static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req) { NvmeSQueue *sq; @@ -5555,25 +5563,16 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeRequest *req) return NVME_SUCCESS; } -static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr, - uint16_t cqid, uint16_t vector, uint16_t size, - uint16_t irq_enabled) +static void __nvme_init_cq(NvmeCQueue *cq) { + NvmeCtrl *n = cq->ctrl; PCIDevice *pci = PCI_DEVICE(n); + uint16_t cqid = cq->cqid; - if (msix_enabled(pci) && irq_enabled) { - msix_vector_use(pci, vector); + if (msix_enabled(pci) && cq->irq_enabled) { + msix_vector_use(pci, cq->vector); } - cq->ctrl = n; - cq->cqid = cqid; - cq->size = size; - cq->dma_addr = dma_addr; - cq->phase = 1; - cq->irq_enabled = irq_enabled; - cq->vector = vector; - cq->head = cq->tail = 0; - QTAILQ_INIT(&cq->req_list); QTAILQ_INIT(&cq->sq_list); if (n->dbbuf_enabled) { cq->db_addr = n->dbbuf_dbs + (cqid << 3) + (1 << 2); @@ -5590,6 +5589,22 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr, &DEVICE(cq->ctrl)->mem_reentrancy_guard); } +static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr, + uint16_t cqid, uint16_t vector, uint16_t size, + uint16_t irq_enabled) +{ + cq->ctrl = n; + cq->cqid = cqid; + cq->size = size; + cq->dma_addr = dma_addr; + cq->phase = 1; + cq->irq_enabled = irq_enabled; + cq->vector = vector; + cq->head = cq->tail = 0; + QTAILQ_INIT(&cq->req_list); + __nvme_init_cq(cq); +} + static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req) { NvmeCQueue *cq; -- 2.47.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v6 3/8] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers 2026-04-19 13:01 ` [PATCH v6 3/8] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers Alexander Mikhalitsyn @ 2026-04-29 6:20 ` Klaus Jensen 0 siblings, 0 replies; 17+ messages in thread From: Klaus Jensen @ 2026-04-29 6:20 UTC (permalink / raw) To: Alexander Mikhalitsyn Cc: qemu-devel, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Stefan Hajnoczi, Laurent Vivier, Jesper Devantier, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn [-- Attachment #1: Type: text/plain, Size: 3876 bytes --] On Apr 19 15:01, Alexander Mikhalitsyn wrote: > From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> > > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> LGTM, but a short description on why this is needed in preparation for later patches would be nice :) Reviewed-by: Klaus Jensen <k.jensen@samsung.com> > --- > hw/nvme/ctrl.c | 59 +++++++++++++++++++++++++++++++------------------- > 1 file changed, 37 insertions(+), 22 deletions(-) > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index 9f9c9bcbead..191398e700f 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -4854,18 +4854,14 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req) > return NVME_SUCCESS; > } > > -static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr, > - uint16_t sqid, uint16_t cqid, uint16_t size) > +static void __nvme_init_sq(NvmeSQueue *sq) > { > + NvmeCtrl *n = sq->ctrl; > + uint16_t sqid = sq->sqid; > + uint16_t cqid = sq->cqid; > int i; > NvmeCQueue *cq; > > - sq->ctrl = n; > - sq->dma_addr = dma_addr; > - sq->sqid = sqid; > - sq->size = size; > - sq->cqid = cqid; > - sq->head = sq->tail = 0; > sq->io_req = g_new0(NvmeRequest, sq->size); > > QTAILQ_INIT(&sq->req_list); > @@ -4895,6 +4891,18 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr, > n->sq[sqid] = sq; > } > > +static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr, > + uint16_t sqid, uint16_t cqid, uint16_t size) > +{ > + sq->ctrl = n; > + sq->dma_addr = dma_addr; > + sq->sqid = sqid; > + sq->size = size; > + sq->cqid = cqid; > + sq->head = sq->tail = 0; > + __nvme_init_sq(sq); > +} > + > static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req) > { > NvmeSQueue *sq; > @@ -5555,25 +5563,16 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeRequest *req) > return NVME_SUCCESS; > } > > -static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr, > - uint16_t cqid, uint16_t vector, uint16_t size, > - uint16_t irq_enabled) > +static void __nvme_init_cq(NvmeCQueue *cq) > { > + NvmeCtrl *n = cq->ctrl; > PCIDevice *pci = PCI_DEVICE(n); > + uint16_t cqid = cq->cqid; > > - if (msix_enabled(pci) && irq_enabled) { > - msix_vector_use(pci, vector); > + if (msix_enabled(pci) && cq->irq_enabled) { > + msix_vector_use(pci, cq->vector); > } > > - cq->ctrl = n; > - cq->cqid = cqid; > - cq->size = size; > - cq->dma_addr = dma_addr; > - cq->phase = 1; > - cq->irq_enabled = irq_enabled; > - cq->vector = vector; > - cq->head = cq->tail = 0; > - QTAILQ_INIT(&cq->req_list); > QTAILQ_INIT(&cq->sq_list); > if (n->dbbuf_enabled) { > cq->db_addr = n->dbbuf_dbs + (cqid << 3) + (1 << 2); > @@ -5590,6 +5589,22 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr, > &DEVICE(cq->ctrl)->mem_reentrancy_guard); > } > > +static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr, > + uint16_t cqid, uint16_t vector, uint16_t size, > + uint16_t irq_enabled) > +{ > + cq->ctrl = n; > + cq->cqid = cqid; > + cq->size = size; > + cq->dma_addr = dma_addr; > + cq->phase = 1; > + cq->irq_enabled = irq_enabled; > + cq->vector = vector; > + cq->head = cq->tail = 0; > + QTAILQ_INIT(&cq->req_list); > + __nvme_init_cq(cq); > +} > + > static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req) > { > NvmeCQueue *cq; > -- > 2.47.3 > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v6 4/8] hw/nvme: set CQE.sq_id earlier in nvme_process_sq 2026-04-19 13:01 [PATCH v6 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn ` (2 preceding siblings ...) 2026-04-19 13:01 ` [PATCH v6 3/8] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers Alexander Mikhalitsyn @ 2026-04-19 13:01 ` Alexander Mikhalitsyn 2026-04-29 6:18 ` Klaus Jensen 2026-04-19 13:01 ` [PATCH v6 5/8] hw/nvme: unmap req->sg earlier in nvme_enqueue_req_completion Alexander Mikhalitsyn ` (3 subsequent siblings) 7 siblings, 1 reply; 17+ messages in thread From: Alexander Mikhalitsyn @ 2026-04-19 13:01 UTC (permalink / raw) To: qemu-devel Cc: Alexander Mikhalitsyn, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Stefan Hajnoczi, Laurent Vivier, Jesper Devantier, Klaus Jensen, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> Instead of filling req->cqe.sq_id in nvme_post_cqes, let's set it earlier in nvme_process_sq. This shouldn't cause any issues, because req->cqe.sq_id never changes during lifetime of req. This will help us for migration support. Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> --- hw/nvme/ctrl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 191398e700f..d9bf32bff2c 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -1520,7 +1520,6 @@ static void nvme_post_cqes(void *opaque) sq = req->sq; req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase); - req->cqe.sq_id = cpu_to_le16(sq->sqid); req->cqe.sq_head = cpu_to_le16(sq->head); addr = cq->dma_addr + (cq->tail << NVME_CQES); ret = pci_dma_write(PCI_DEVICE(n), addr, (void *)&req->cqe, @@ -7848,6 +7847,7 @@ static void nvme_process_sq(void *opaque) QTAILQ_REMOVE(&sq->req_list, req, entry); QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry); nvme_req_clear(req); + req->cqe.sq_id = cpu_to_le16(sq->sqid); req->cqe.cid = cmd.cid; memcpy(&req->cmd, &cmd, sizeof(NvmeCmd)); -- 2.47.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v6 4/8] hw/nvme: set CQE.sq_id earlier in nvme_process_sq 2026-04-19 13:01 ` [PATCH v6 4/8] hw/nvme: set CQE.sq_id earlier in nvme_process_sq Alexander Mikhalitsyn @ 2026-04-29 6:18 ` Klaus Jensen 0 siblings, 0 replies; 17+ messages in thread From: Klaus Jensen @ 2026-04-29 6:18 UTC (permalink / raw) To: Alexander Mikhalitsyn Cc: qemu-devel, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Stefan Hajnoczi, Laurent Vivier, Jesper Devantier, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn [-- Attachment #1: Type: text/plain, Size: 1532 bytes --] On Apr 19 15:01, Alexander Mikhalitsyn wrote: > From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> > > Instead of filling req->cqe.sq_id in nvme_post_cqes, let's set it earlier > in nvme_process_sq. > > This shouldn't cause any issues, because req->cqe.sq_id never changes > during lifetime of req. > > This will help us for migration support. > > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> > --- > hw/nvme/ctrl.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index 191398e700f..d9bf32bff2c 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -1520,7 +1520,6 @@ static void nvme_post_cqes(void *opaque) > > sq = req->sq; > req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase); > - req->cqe.sq_id = cpu_to_le16(sq->sqid); > req->cqe.sq_head = cpu_to_le16(sq->head); > addr = cq->dma_addr + (cq->tail << NVME_CQES); > ret = pci_dma_write(PCI_DEVICE(n), addr, (void *)&req->cqe, > @@ -7848,6 +7847,7 @@ static void nvme_process_sq(void *opaque) > QTAILQ_REMOVE(&sq->req_list, req, entry); > QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry); > nvme_req_clear(req); > + req->cqe.sq_id = cpu_to_le16(sq->sqid); > req->cqe.cid = cmd.cid; > memcpy(&req->cmd, &cmd, sizeof(NvmeCmd)); > > -- > 2.47.3 > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v6 5/8] hw/nvme: unmap req->sg earlier in nvme_enqueue_req_completion 2026-04-19 13:01 [PATCH v6 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn ` (3 preceding siblings ...) 2026-04-19 13:01 ` [PATCH v6 4/8] hw/nvme: set CQE.sq_id earlier in nvme_process_sq Alexander Mikhalitsyn @ 2026-04-19 13:01 ` Alexander Mikhalitsyn 2026-04-29 6:19 ` Klaus Jensen 2026-04-19 13:01 ` [PATCH v6 6/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn ` (2 subsequent siblings) 7 siblings, 1 reply; 17+ messages in thread From: Alexander Mikhalitsyn @ 2026-04-19 13:01 UTC (permalink / raw) To: qemu-devel Cc: Alexander Mikhalitsyn, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Stefan Hajnoczi, Laurent Vivier, Jesper Devantier, Klaus Jensen, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> Instead of unmapping req->sg in nvme_post_cqes(), we can do it earlier in nvme_enqueue_req_completion(). When req completion is enqueued we don't need to access req->sg anymore. We only care about req->sq, req->cqe and req->status. Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> --- hw/nvme/ctrl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index d9bf32bff2c..1ff91493b60 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -1534,7 +1534,6 @@ static void nvme_post_cqes(void *opaque) QTAILQ_REMOVE(&cq->req_list, req, entry); nvme_inc_cq_tail(cq); - nvme_sg_unmap(&req->sg); if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) { qemu_bh_schedule(sq->bh); @@ -1564,6 +1563,8 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req) req->status, req->cmd.opcode); } + nvme_sg_unmap(&req->sg); + QTAILQ_REMOVE(&req->sq->out_req_list, req, entry); QTAILQ_INSERT_TAIL(&cq->req_list, req, entry); -- 2.47.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v6 5/8] hw/nvme: unmap req->sg earlier in nvme_enqueue_req_completion 2026-04-19 13:01 ` [PATCH v6 5/8] hw/nvme: unmap req->sg earlier in nvme_enqueue_req_completion Alexander Mikhalitsyn @ 2026-04-29 6:19 ` Klaus Jensen 0 siblings, 0 replies; 17+ messages in thread From: Klaus Jensen @ 2026-04-29 6:19 UTC (permalink / raw) To: Alexander Mikhalitsyn Cc: qemu-devel, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Stefan Hajnoczi, Laurent Vivier, Jesper Devantier, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn [-- Attachment #1: Type: text/plain, Size: 1426 bytes --] On Apr 19 15:01, Alexander Mikhalitsyn wrote: > From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> > > Instead of unmapping req->sg in nvme_post_cqes(), we can do it earlier in > nvme_enqueue_req_completion(). When req completion is enqueued we don't > need to access req->sg anymore. We only care about req->sq, req->cqe and > req->status. > > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> Nice cleanup, Reviewed-by: Klaus Jensen <k.jensen@samsung.com> > --- > hw/nvme/ctrl.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index d9bf32bff2c..1ff91493b60 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -1534,7 +1534,6 @@ static void nvme_post_cqes(void *opaque) > QTAILQ_REMOVE(&cq->req_list, req, entry); > > nvme_inc_cq_tail(cq); > - nvme_sg_unmap(&req->sg); > > if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) { > qemu_bh_schedule(sq->bh); > @@ -1564,6 +1563,8 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req) > req->status, req->cmd.opcode); > } > > + nvme_sg_unmap(&req->sg); > + > QTAILQ_REMOVE(&req->sq->out_req_list, req, entry); > QTAILQ_INSERT_TAIL(&cq->req_list, req, entry); > > -- > 2.47.3 > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v6 6/8] hw/nvme: add basic live migration support 2026-04-19 13:01 [PATCH v6 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn ` (4 preceding siblings ...) 2026-04-19 13:01 ` [PATCH v6 5/8] hw/nvme: unmap req->sg earlier in nvme_enqueue_req_completion Alexander Mikhalitsyn @ 2026-04-19 13:01 ` Alexander Mikhalitsyn 2026-04-27 21:03 ` Stefan Hajnoczi 2026-04-19 13:01 ` [PATCH v6 7/8] tests/functional/x86_64: add migration test for NVMe device Alexander Mikhalitsyn 2026-04-19 13:01 ` [PATCH v6 8/8] tests/qtest/nvme-test: add migration test with full CQ Alexander Mikhalitsyn 7 siblings, 1 reply; 17+ messages in thread From: Alexander Mikhalitsyn @ 2026-04-19 13:01 UTC (permalink / raw) To: qemu-devel Cc: Alexander Mikhalitsyn, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Stefan Hajnoczi, Laurent Vivier, Jesper Devantier, Klaus Jensen, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> It has some limitations: - only one NVMe namespace is supported - SMART counters are not preserved - CMB is not supported - PMR is not supported - SPDM is not supported - SR-IOV is not supported Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> v2: - AERs are now fully supported v6: - handle full CQ case --- hw/nvme/ctrl.c | 722 ++++++++++++++++++++++++++++++++++++++++++- hw/nvme/ns.c | 160 ++++++++++ hw/nvme/nvme.h | 9 + hw/nvme/trace-events | 10 + 4 files changed, 892 insertions(+), 9 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 1ff91493b60..5157c7fd5a4 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -208,6 +208,7 @@ #include "hw/pci/pcie_sriov.h" #include "system/spdm-socket.h" #include "migration/blocker.h" +#include "migration/qemu-file-types.h" #include "migration/vmstate.h" #include "nvme.h" @@ -1518,6 +1519,18 @@ static void nvme_post_cqes(void *opaque) break; } + /* + * Here we take the following fields from NvmeRequest structure + * and write cqe to the guest RAM based on them: + * - req->sq + * - req->status + * - req->cqe + * + * If you change this code and more fields from NvmeRequest are + * used, please make sure that you have handled this in: + * nvme_vmstate_request and nvme_ctrl_pre_save(). + */ + sq = req->sq; req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase); req->cqe.sq_head = cpu_to_le16(sq->head); @@ -4903,6 +4916,25 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr, __nvme_init_sq(sq); } +static void nvme_restore_sq(NvmeSQueue *sq_from) +{ + NvmeCtrl *n = sq_from->ctrl; + NvmeSQueue *sq = sq_from; + + if (sq_from->sqid == 0) { + sq = &n->admin_sq; + sq->ctrl = n; + sq->dma_addr = sq_from->dma_addr; + sq->sqid = sq_from->sqid; + sq->size = sq_from->size; + sq->cqid = sq_from->cqid; + sq->head = sq_from->head; + sq->tail = sq_from->tail; + } + + __nvme_init_sq(sq); +} + static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req) { NvmeSQueue *sq; @@ -5605,6 +5637,39 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr, __nvme_init_cq(cq); } +static void copy_cq_req_list(NvmeCQueue *cq_to, NvmeCQueue *cq_from) +{ + NvmeRequest *req, *next; + + QTAILQ_FOREACH_SAFE(req, &cq_from->req_list, entry, next) { + QTAILQ_REMOVE(&cq_from->req_list, req, entry); + QTAILQ_INSERT_TAIL(&cq_to->req_list, req, entry); + } +} + +static void nvme_restore_cq(NvmeCQueue *cq_from) +{ + NvmeCtrl *n = cq_from->ctrl; + NvmeCQueue *cq = cq_from; + + if (cq_from->cqid == 0) { + cq = &n->admin_cq; + cq->ctrl = n; + cq->cqid = cq_from->cqid; + cq->size = cq_from->size; + cq->dma_addr = cq_from->dma_addr; + cq->phase = cq_from->phase; + cq->irq_enabled = cq_from->irq_enabled; + cq->vector = cq_from->vector; + cq->head = cq_from->head; + cq->tail = cq_from->tail; + QTAILQ_INIT(&cq->req_list); + copy_cq_req_list(cq, cq_from); + } + + __nvme_init_cq(cq); +} + static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req) { NvmeCQueue *cq; @@ -7293,7 +7358,7 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req) n->dbbuf_eis = eis_addr; n->dbbuf_enabled = true; - for (i = 0; i < n->params.max_ioqpairs + 1; i++) { + for (i = 0; i < n->num_queues; i++) { NvmeSQueue *sq = n->sq[i]; NvmeCQueue *cq = n->cq[i]; @@ -7737,7 +7802,7 @@ static int nvme_atomic_write_check(NvmeCtrl *n, NvmeCmd *cmd, /* * Walk the queues to see if there are any atomic conflicts. */ - for (i = 1; i < n->params.max_ioqpairs + 1; i++) { + for (i = 1; i < n->num_queues; i++) { NvmeSQueue *sq; NvmeRequest *req; NvmeRwCmd *req_rw; @@ -7807,6 +7872,12 @@ static void nvme_process_sq(void *opaque) NvmeCmd cmd; NvmeRequest *req; + /* + * We don't want to have a race with nvme_ctrl_pre_save(). + * What implicitly protects us from this is BQL. + */ + assert(bql_locked()); + if (n->dbbuf_enabled) { nvme_update_sq_tail(sq); } @@ -7924,12 +7995,12 @@ static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst) nvme_ns_drain(ns); } - for (i = 0; i < n->params.max_ioqpairs + 1; i++) { + for (i = 0; i < n->num_queues; i++) { if (n->sq[i] != NULL) { nvme_free_sq(n->sq[i], n); } } - for (i = 0; i < n->params.max_ioqpairs + 1; i++) { + for (i = 0; i < n->num_queues; i++) { if (n->cq[i] != NULL) { nvme_free_cq(n->cq[i], n); } @@ -8599,6 +8670,8 @@ static bool nvme_check_params(NvmeCtrl *n, Error **errp) params->max_ioqpairs = params->num_queues - 1; } + n->num_queues = params->max_ioqpairs + 1; + if (n->namespace.blkconf.blk && n->subsys) { error_setg(errp, "subsystem support is unavailable with legacy " "namespace ('drive' property)"); @@ -8753,8 +8826,8 @@ static void nvme_init_state(NvmeCtrl *n) n->conf_msix_qsize = n->params.msix_qsize; } - n->sq = g_new0(NvmeSQueue *, n->params.max_ioqpairs + 1); - n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1); + n->sq = g_new0(NvmeSQueue *, n->num_queues); + n->cq = g_new0(NvmeCQueue *, n->num_queues); n->temperature = NVME_TEMPERATURE; n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING; n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL); @@ -8989,7 +9062,7 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp) } if (n->params.msix_exclusive_bar && !pci_is_vf(pci_dev)) { - bar_size = nvme_mbar_size(n->params.max_ioqpairs + 1, 0, NULL, NULL); + bar_size = nvme_mbar_size(n->num_queues, 0, NULL, NULL); memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme", bar_size); pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY | @@ -9001,7 +9074,7 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp) /* add one to max_ioqpairs to account for the admin queue pair */ if (!pci_is_vf(pci_dev)) { nr_vectors = n->params.msix_qsize; - bar_size = nvme_mbar_size(n->params.max_ioqpairs + 1, + bar_size = nvme_mbar_size(n->num_queues, nr_vectors, &msix_table_offset, &msix_pba_offset); } else { @@ -9724,9 +9797,640 @@ static uint32_t nvme_pci_read_config(PCIDevice *dev, uint32_t address, int len) return pci_default_read_config(dev, address, len); } +static const VMStateDescription nvme_vmstate_cqe = { + .name = "nvme-cqe", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_UINT32(result, NvmeCqe), + VMSTATE_UINT32(dw1, NvmeCqe), + VMSTATE_UINT16(sq_head, NvmeCqe), + VMSTATE_UINT16(sq_id, NvmeCqe), + VMSTATE_UINT16(cid, NvmeCqe), + VMSTATE_UINT16(status, NvmeCqe), + VMSTATE_END_OF_LIST() + } +}; + +static const VMStateDescription nvme_vmstate_cmd_dptr_sgl = { + .name = "nvme-request-cmd-dptr-sgl", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_UINT64(addr, NvmeSglDescriptor), + VMSTATE_UINT32(len, NvmeSglDescriptor), + VMSTATE_UINT8_ARRAY(rsvd, NvmeSglDescriptor, 3), + VMSTATE_UINT8(type, NvmeSglDescriptor), + VMSTATE_END_OF_LIST() + } +}; + +static const VMStateDescription nvme_vmstate_cmd_dptr = { + .name = "nvme-request-cmd-dptr", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_UINT64(prp1, NvmeCmdDptr), + VMSTATE_UINT64(prp2, NvmeCmdDptr), + VMSTATE_STRUCT(sgl, NvmeCmdDptr, 0, nvme_vmstate_cmd_dptr_sgl, NvmeSglDescriptor), + VMSTATE_END_OF_LIST() + } +}; + +static const VMStateDescription nvme_vmstate_cmd = { + .name = "nvme-request-cmd", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_UINT8(opcode, NvmeCmd), + VMSTATE_UINT8(flags, NvmeCmd), + VMSTATE_UINT16(cid, NvmeCmd), + VMSTATE_UINT32(nsid, NvmeCmd), + VMSTATE_UINT64(res1, NvmeCmd), + VMSTATE_UINT64(mptr, NvmeCmd), + VMSTATE_STRUCT(dptr, NvmeCmd, 0, nvme_vmstate_cmd_dptr, NvmeCmdDptr), + VMSTATE_UINT32(cdw10, NvmeCmd), + VMSTATE_UINT32(cdw11, NvmeCmd), + VMSTATE_UINT32(cdw12, NvmeCmd), + VMSTATE_UINT32(cdw13, NvmeCmd), + VMSTATE_UINT32(cdw14, NvmeCmd), + VMSTATE_UINT32(cdw15, NvmeCmd), + VMSTATE_END_OF_LIST() + } +}; + +static bool nvme_req_pre_load(void *opaque, Error **errp) +{ + memset(opaque, 0x0, sizeof(NvmeRequest)); + return true; +} + +static const VMStateDescription nvme_vmstate_request = { + .name = "nvme-request", + .version_id = 1, + .minimum_version_id = 1, + .pre_load_errp = nvme_req_pre_load, + .fields = (const VMStateField[]) { + VMSTATE_UINT16(status, NvmeRequest), + VMSTATE_STRUCT(cqe, NvmeRequest, 0, nvme_vmstate_cqe, NvmeCqe), + VMSTATE_STRUCT(cmd, NvmeRequest, 0, nvme_vmstate_cmd, NvmeCmd), + VMSTATE_END_OF_LIST() + } +}; + +static const VMStateDescription nvme_vmstate_bar = { + .name = "nvme-bar", + .minimum_version_id = 1, + .version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_UINT64(cap, NvmeBar), + VMSTATE_UINT32(vs, NvmeBar), + VMSTATE_UINT32(intms, NvmeBar), + VMSTATE_UINT32(intmc, NvmeBar), + VMSTATE_UINT32(cc, NvmeBar), + VMSTATE_UINT8_ARRAY(rsvd24, NvmeBar, 4), + VMSTATE_UINT32(csts, NvmeBar), + VMSTATE_UINT32(nssr, NvmeBar), + VMSTATE_UINT32(aqa, NvmeBar), + VMSTATE_UINT64(asq, NvmeBar), + VMSTATE_UINT64(acq, NvmeBar), + VMSTATE_UINT32(cmbloc, NvmeBar), + VMSTATE_UINT32(cmbsz, NvmeBar), + VMSTATE_UINT32(bpinfo, NvmeBar), + VMSTATE_UINT32(bprsel, NvmeBar), + VMSTATE_UINT64(bpmbl, NvmeBar), + VMSTATE_UINT64(cmbmsc, NvmeBar), + VMSTATE_UINT32(cmbsts, NvmeBar), + VMSTATE_UINT8_ARRAY(rsvd92, NvmeBar, 3492), + VMSTATE_UINT32(pmrcap, NvmeBar), + VMSTATE_UINT32(pmrctl, NvmeBar), + VMSTATE_UINT32(pmrsts, NvmeBar), + VMSTATE_UINT32(pmrebs, NvmeBar), + VMSTATE_UINT32(pmrswtp, NvmeBar), + VMSTATE_UINT32(pmrmscl, NvmeBar), + VMSTATE_UINT32(pmrmscu, NvmeBar), + VMSTATE_UINT8_ARRAY(css, NvmeBar, 484), + VMSTATE_END_OF_LIST() + }, +}; + +static bool nvme_cqueue_pre_load(void *opaque, Error **errp) +{ + NvmeCQueue *cq = opaque; + + QTAILQ_INIT(&cq->req_list); + return true; +} + +static const VMStateDescription nvme_vmstate_cqueue = { + .name = "nvme-cq", + .version_id = 1, + .minimum_version_id = 1, + .pre_load_errp = nvme_cqueue_pre_load, + .fields = (const VMStateField[]) { + VMSTATE_UINT8(phase, NvmeCQueue), + VMSTATE_UINT16(cqid, NvmeCQueue), + VMSTATE_UINT16(irq_enabled, NvmeCQueue), + VMSTATE_UINT32(head, NvmeCQueue), + VMSTATE_UINT32(tail, NvmeCQueue), + VMSTATE_UINT32(vector, NvmeCQueue), + VMSTATE_UINT32(size, NvmeCQueue), + VMSTATE_UINT64(dma_addr, NvmeCQueue), + + VMSTATE_QTAILQ_V(req_list, NvmeCQueue, 1, nvme_vmstate_request, + NvmeRequest, entry), + + /* db_addr, ei_addr, etc will be recalculated */ + VMSTATE_END_OF_LIST() + } +}; + +static const VMStateDescription nvme_vmstate_squeue = { + .name = "nvme-sq", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_UINT16(sqid, NvmeSQueue), + VMSTATE_UINT16(cqid, NvmeSQueue), + VMSTATE_UINT32(head, NvmeSQueue), + VMSTATE_UINT32(tail, NvmeSQueue), + VMSTATE_UINT32(size, NvmeSQueue), + VMSTATE_UINT64(dma_addr, NvmeSQueue), + /* db_addr, ei_addr, etc will be recalculated */ + VMSTATE_END_OF_LIST() + } +}; + +static const VMStateDescription nvme_vmstate_async_event_result = { + .name = "nvme-async-event-result", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_UINT8(event_type, NvmeAerResult), + VMSTATE_UINT8(event_info, NvmeAerResult), + VMSTATE_UINT8(log_page, NvmeAerResult), + VMSTATE_UINT8(resv, NvmeAerResult), + VMSTATE_END_OF_LIST() + } +}; + +static const VMStateDescription nvme_vmstate_async_event = { + .name = "nvme-async-event", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_STRUCT(result, NvmeAsyncEvent, 0, nvme_vmstate_async_event_result, NvmeAerResult), + VMSTATE_END_OF_LIST() + } +}; + +static const VMStateDescription nvme_vmstate_hbs = { + .name = "nvme-hbs", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_UINT8(acre, NvmeHostBehaviorSupport), + VMSTATE_UINT8(etdas, NvmeHostBehaviorSupport), + VMSTATE_UINT8(lbafee, NvmeHostBehaviorSupport), + VMSTATE_UINT8(rsvd3, NvmeHostBehaviorSupport), + VMSTATE_UINT16(cdfe, NvmeHostBehaviorSupport), + VMSTATE_UINT8_ARRAY(rsvd6, NvmeHostBehaviorSupport, 506), + VMSTATE_END_OF_LIST() + } +}; + +const VMStateDescription nvme_vmstate_atomic = { + .name = "nvme-atomic", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_UINT32(atomic_max_write_size, NvmeAtomic), + VMSTATE_UINT64(atomic_boundary, NvmeAtomic), + VMSTATE_UINT64(atomic_nabo, NvmeAtomic), + VMSTATE_BOOL(atomic_writes, NvmeAtomic), + VMSTATE_END_OF_LIST() + } +}; + +static bool pre_save_validate_aer_req(NvmeRequest *req, Error **errp) +{ + /* + * Can't use assert() here, because we don't want + * to just crash QEMU when user requests a migration. + */ + if (!(req->cmd.opcode == NVME_ADM_CMD_ASYNC_EV_REQ)) { + error_setg(errp, "req->cmd.opcode (%u) != NVME_ADM_CMD_ASYNC_EV_REQ", req->cmd.opcode); + return false; + } + + if (!(req->ns == NULL)) { + error_setg(errp, "req->ns != NULL"); + return false; + } + + if (!(req->sq == &req->sq->ctrl->admin_sq)) { + error_setg(errp, "req->sq != &req->sq->ctrl->admin_sq"); + return false; + } + + if (!(req->aiocb == NULL)) { + error_setg(errp, "req->aiocb != NULL"); + return false; + } + + if (!(req->opaque == NULL)) { + error_setg(errp, "req->opaque != NULL"); + return false; + } + + if (!(req->atomic_write == false)) { + error_setg(errp, "req->atomic_write != false"); + return false; + } + + if (req->sg.flags & NVME_SG_ALLOC) { + error_setg(errp, "unexpected NVME_SG_ALLOC flag in req->sg.flags"); + return false; + } + + return true; +} + +static bool pre_save_validate_cq_req(NvmeRequest *req, Error **errp) +{ + if (!(req->ns == NULL)) { + error_setg(errp, "req->ns != NULL"); + return false; + } + + if (!(req->aiocb == NULL)) { + error_setg(errp, "req->aiocb != NULL"); + return false; + } + + if (!(req->opaque == NULL)) { + error_setg(errp, "req->opaque != NULL"); + return false; + } + + if (!(req->atomic_write == false)) { + error_setg(errp, "req->atomic_write != false"); + return false; + } + + if (req->sg.flags & NVME_SG_ALLOC) { + error_setg(errp, "unexpected NVME_SG_ALLOC flag in req->sg.flags"); + return false; + } + + return true; +} + +static bool nvme_ctrl_pre_save(void *opaque, Error **errp) +{ + NvmeCtrl *n = opaque; + int i; + + trace_pci_nvme_pre_save_enter(n); + + /* + * We don't want to have a race with nvme_process_sq(). + * What implicitly protects us from this is BQL. + */ + assert(bql_locked()); + + /* cancel all SQ processing BHs */ + for (i = 0; i < n->num_queues; i++) { + NvmeSQueue *sq = n->sq[i]; + + if (!sq) + continue; + + qemu_bh_cancel(sq->bh); + } + + /* drain all IO */ + for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { + NvmeNamespace *ns; + + ns = nvme_ns(n, i); + if (!ns) { + continue; + } + + trace_pci_nvme_pre_save_ns_drain(n, i); + nvme_ns_drain(ns); + } + + /* + * Now, we should take care of AERs. + * + * 1. Save all queued events (n->aer_queue). + * This is done automatically, see nvme_vmstate VMStateDescription. + * Here we only need to print them for debugging purpose. + * 2. Go over outstanding AER requests (n->aer_reqs) and check they are + * all have expected opcode (NVME_ADM_CMD_ASYNC_EV_REQ) and other fields. + * + * We must be really careful here, because in case of further QEMU NVMe changes, + * we may break migration without noticing it, or worse, introduce silent + * data corruption during migration. + */ + if (n->aer_queued) { + NvmeAsyncEvent *event; + + QTAILQ_FOREACH(event, &n->aer_queue, entry) { + trace_pci_nvme_pre_save_aer(event->result.event_type, event->result.event_info, + event->result.log_page); + } + } + + for (i = 0; i < n->outstanding_aers; i++) { + NvmeRequest *req = n->aer_reqs[i]; + + if (!pre_save_validate_aer_req(req, errp)) { + return false; + } + } + + /* make sure that all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */ + for (i = 0; i < n->num_queues; i++) { + NvmeRequest *req; + NvmeSQueue *sq = n->sq[i]; + + if (!sq) + continue; + + trace_pci_nvme_pre_save_sq_out_req_check(n, i, sq->head, sq->tail, sq->size); + + QTAILQ_FOREACH(req, &sq->out_req_list, entry) { + assert(req->cmd.opcode == NVME_ADM_CMD_ASYNC_EV_REQ); + } + } + + /* wait when all IO requests completions are written to guest memory */ + for (i = 0; i < n->num_queues; i++) { + NvmeCQueue *cq = n->cq[i]; + + if (!cq) + continue; + + qemu_bh_cancel(cq->bh); + /* this should empty cq->req_list unless CQ is full */ + nvme_post_cqes(cq); + + trace_pci_nvme_pre_save_cq_req_check(n, i, cq->head, cq->tail, cq->size); + + if (!QTAILQ_EMPTY(&cq->req_list)) { + NvmeRequest *req; + + assert(nvme_cq_full(cq)); + + QTAILQ_FOREACH(req, &cq->req_list, entry) { + trace_pci_nvme_pre_save_cq_unposted_cqe(n, i, nvme_cid(req), + nvme_nsid(req->ns), + le32_to_cpu(req->cqe.result), + le32_to_cpu(req->cqe.dw1), + req->status, req->cmd.opcode); + if (!pre_save_validate_cq_req(req, errp)) { + return false; + } + } + } + } + + for (uint32_t nsid = 0; nsid <= NVME_MAX_NAMESPACES; nsid++) { + NvmeNamespace *ns = n->namespaces[nsid]; + + if (!ns) + continue; + + if (ns != &n->namespace) { + error_setg(errp, "only one NVMe namespace is supported for migration"); + return false; + } + } + + return true; +} + +static bool nvme_ctrl_post_load(void *opaque, int version_id, Error **errp) +{ + NvmeCtrl *n = opaque; + int i; + + trace_pci_nvme_post_load_enter(n); + + /* restore CQs first */ + for (i = 0; i < n->num_queues; i++) { + NvmeCQueue *cq = n->cq[i]; + + if (!cq) + continue; + + cq->ctrl = n; + nvme_restore_cq(cq); + trace_pci_nvme_post_load_restore_cq(n, i, cq->head, cq->tail, cq->size); + + if (i == 0) { + /* + * Admin CQ lives in n->admin_cq, we don't need + * memory allocated for it in get_ptrs_array_entry() anymore. + * + * nvme_restore_cq() also takes care of: + * n->cq[0] = &n->admin_cq; + * so n->cq[0] remains valid. + */ + g_free(cq); + } + } + + for (i = 0; i < n->num_queues; i++) { + NvmeSQueue *sq = n->sq[i]; + + if (!sq) + continue; + + sq->ctrl = n; + nvme_restore_sq(sq); + trace_pci_nvme_post_load_restore_sq(n, i, sq->head, sq->tail, sq->size); + + if (i == 0) { + /* same as for CQ */ + g_free(sq); + } + } + + /* restore cq->req_list-s */ + for (i = 0; i < n->num_queues; i++) { + NvmeRequest *req_from, *next; + typeof_field(NvmeCQueue, req_list) req_list; + NvmeCQueue *cq = n->cq[i]; + + if (!cq || QTAILQ_EMPTY(&cq->req_list)) + continue; + + /* + * We use nvme_vmstate_request VMStateDescription to save/restore + * NvmeRequest structures, but tricky thing here is that + * memory for each cq->req_list item is allocated separately + * during restore. It doesn't work for us. We need to take + * an existing NvmeRequest structure from SQ's req_list pool + * and fill it with data from the newly allocated one (req_from). + * Then, we can safely release allocated memory for it. + */ + + /* make a copy of cq->req_list (QTAILQ head) and clean cq->req_list */ + QTAILQ_INIT(&req_list); + QTAILQ_FOREACH_SAFE(req_from, &cq->req_list, entry, next) { + QTAILQ_REMOVE(&cq->req_list, req_from, entry); + QTAILQ_INSERT_TAIL(&req_list, req_from, entry); + } + QTAILQ_INIT(&cq->req_list); + + QTAILQ_FOREACH_SAFE(req_from, &req_list, entry, next) { + uint16_t sqid = le16_to_cpu(req_from->cqe.sq_id); + NvmeRequest *req; + NvmeSQueue *sq; + + assert(!nvme_check_sqid(n, sqid)); + sq = n->sq[sqid]; + + req = QTAILQ_FIRST(&sq->req_list); + QTAILQ_REMOVE(&sq->req_list, req, entry); + QTAILQ_INSERT_TAIL(&cq->req_list, req, entry); + nvme_req_clear(req); + + /* copy data from the source NvmeRequest */ + req->status = req_from->status; + memcpy(&req->cqe, &req_from->cqe, sizeof(NvmeCqe)); + memcpy(&req->cmd, &req_from->cmd, sizeof(NvmeCmd)); + + QTAILQ_REMOVE(&req_list, req_from, entry); + g_free(req_from); + } + + qemu_bh_schedule(cq->bh); + } + + if (n->aer_queued) { + NvmeAsyncEvent *event; + + QTAILQ_FOREACH(event, &n->aer_queue, entry) { + trace_pci_nvme_post_load_aer(event->result.event_type, event->result.event_info, + event->result.log_page); + } + } + + for (i = 0; i < n->outstanding_aers; i++) { + NvmeSQueue *sq = &n->admin_sq; + NvmeRequest *req_from = n->aer_reqs[i]; + NvmeRequest *req; + + /* Idea here is the same as for "restore cq->req_list-s" step */ + + /* take an NvmeRequest struct from SQ */ + req = QTAILQ_FIRST(&sq->req_list); + QTAILQ_REMOVE(&sq->req_list, req, entry); + QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry); + nvme_req_clear(req); + + /* copy data from the source NvmeRequest */ + req->status = req_from->status; + memcpy(&req->cqe, &req_from->cqe, sizeof(NvmeCqe)); + memcpy(&req->cmd, &req_from->cmd, sizeof(NvmeCmd)); + + n->aer_reqs[i] = req; + g_free(req_from); + } + + /* + * We need to attach namespaces (currently, only one namespace is + * supported for migration). + * This logic comes from nvme_start_ctrl(). + */ + for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { + NvmeNamespace *ns = nvme_subsys_ns(n->subsys, i); + + if (!ns || (!ns->params.shared && ns->ctrl != n)) { + continue; + } + + if (nvme_csi_supported(n, ns->csi) && !ns->params.detached) { + if (!ns->attached || ns->params.shared) { + nvme_attach_ns(n, ns); + } + } + } + + /* schedule SQ processing */ + for (i = 0; i < n->num_queues; i++) { + NvmeSQueue *sq = n->sq[i]; + + if (!sq) + continue; + + qemu_bh_schedule(sq->bh); + } + + return true; +} + static const VMStateDescription nvme_vmstate = { .name = "nvme", - .unmigratable = 1, + .minimum_version_id = 1, + .version_id = 1, + .pre_save_errp = nvme_ctrl_pre_save, + .post_load_errp = nvme_ctrl_post_load, + .fields = (const VMStateField[]) { + VMSTATE_PCI_DEVICE(parent_obj, NvmeCtrl), + VMSTATE_MSIX(parent_obj, NvmeCtrl), + VMSTATE_STRUCT(bar, NvmeCtrl, 0, nvme_vmstate_bar, NvmeBar), + + VMSTATE_BOOL(qs_created, NvmeCtrl), + VMSTATE_UINT32(page_size, NvmeCtrl), + VMSTATE_UINT16(page_bits, NvmeCtrl), + VMSTATE_UINT16(max_prp_ents, NvmeCtrl), + VMSTATE_UINT32(max_q_ents, NvmeCtrl), + VMSTATE_UINT8(outstanding_aers, NvmeCtrl), + VMSTATE_UINT32(irq_status, NvmeCtrl), + VMSTATE_INT32(cq_pending, NvmeCtrl), + + VMSTATE_UINT64(host_timestamp, NvmeCtrl), + VMSTATE_UINT64(timestamp_set_qemu_clock_ms, NvmeCtrl), + VMSTATE_UINT64(starttime_ms, NvmeCtrl), + VMSTATE_UINT16(temperature, NvmeCtrl), + VMSTATE_UINT8(smart_critical_warning, NvmeCtrl), + + VMSTATE_UINT32(conf_msix_qsize, NvmeCtrl), + VMSTATE_UINT32(conf_ioqpairs, NvmeCtrl), + VMSTATE_UINT64(dbbuf_dbs, NvmeCtrl), + VMSTATE_UINT64(dbbuf_eis, NvmeCtrl), + VMSTATE_BOOL(dbbuf_enabled, NvmeCtrl), + + VMSTATE_UINT8(aer_mask, NvmeCtrl), + VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT8_ALLOC( + aer_reqs, NvmeCtrl, outstanding_aers, 0, nvme_vmstate_request, NvmeRequest), + VMSTATE_QTAILQ_V(aer_queue, NvmeCtrl, 1, nvme_vmstate_async_event, + NvmeAsyncEvent, entry), + VMSTATE_INT32(aer_queued, NvmeCtrl), + + VMSTATE_STRUCT(namespace, NvmeCtrl, 0, nvme_vmstate_ns, NvmeNamespace), + + VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT32_ALLOC( + sq, NvmeCtrl, num_queues, 0, nvme_vmstate_squeue, NvmeSQueue), + VMSTATE_VARRAY_OF_POINTER_TO_STRUCT_UINT32_ALLOC( + cq, NvmeCtrl, num_queues, 0, nvme_vmstate_cqueue, NvmeCQueue), + + VMSTATE_UINT16(features.temp_thresh_hi, NvmeCtrl), + VMSTATE_UINT16(features.temp_thresh_low, NvmeCtrl), + VMSTATE_UINT32(features.async_config, NvmeCtrl), + VMSTATE_STRUCT(features.hbs, NvmeCtrl, 0, nvme_vmstate_hbs, NvmeHostBehaviorSupport), + + VMSTATE_UINT32(dn, NvmeCtrl), + VMSTATE_STRUCT(atomic, NvmeCtrl, 0, nvme_vmstate_atomic, NvmeAtomic), + + VMSTATE_END_OF_LIST() + }, }; static void nvme_class_init(ObjectClass *oc, const void *data) diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c index 38f86a17268..dd374677078 100644 --- a/hw/nvme/ns.c +++ b/hw/nvme/ns.c @@ -20,6 +20,7 @@ #include "qemu/bitops.h" #include "system/system.h" #include "system/block-backend.h" +#include "migration/vmstate.h" #include "nvme.h" #include "trace.h" @@ -886,6 +887,164 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp) } } +static const VMStateDescription nvme_vmstate_lbaf = { + .name = "nvme_lbaf", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_UINT16(ms, NvmeLBAF), + VMSTATE_UINT8(ds, NvmeLBAF), + VMSTATE_UINT8(rp, NvmeLBAF), + VMSTATE_END_OF_LIST() + } +}; + +static const VMStateDescription nvme_vmstate_id_ns = { + .name = "nvme_id_ns", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_UINT64(nsze, NvmeIdNs), + VMSTATE_UINT64(ncap, NvmeIdNs), + VMSTATE_UINT64(nuse, NvmeIdNs), + VMSTATE_UINT8(nsfeat, NvmeIdNs), + VMSTATE_UINT8(nlbaf, NvmeIdNs), + VMSTATE_UINT8(flbas, NvmeIdNs), + VMSTATE_UINT8(mc, NvmeIdNs), + VMSTATE_UINT8(dpc, NvmeIdNs), + VMSTATE_UINT8(dps, NvmeIdNs), + VMSTATE_UINT8(nmic, NvmeIdNs), + VMSTATE_UINT8(rescap, NvmeIdNs), + VMSTATE_UINT8(fpi, NvmeIdNs), + VMSTATE_UINT8(dlfeat, NvmeIdNs), + VMSTATE_UINT16(nawun, NvmeIdNs), + VMSTATE_UINT16(nawupf, NvmeIdNs), + VMSTATE_UINT16(nacwu, NvmeIdNs), + VMSTATE_UINT16(nabsn, NvmeIdNs), + VMSTATE_UINT16(nabo, NvmeIdNs), + VMSTATE_UINT16(nabspf, NvmeIdNs), + VMSTATE_UINT16(noiob, NvmeIdNs), + VMSTATE_UINT8_ARRAY(nvmcap, NvmeIdNs, 16), + VMSTATE_UINT16(npwg, NvmeIdNs), + VMSTATE_UINT16(npwa, NvmeIdNs), + VMSTATE_UINT16(npdg, NvmeIdNs), + VMSTATE_UINT16(npda, NvmeIdNs), + VMSTATE_UINT16(nows, NvmeIdNs), + VMSTATE_UINT16(mssrl, NvmeIdNs), + VMSTATE_UINT32(mcl, NvmeIdNs), + VMSTATE_UINT8(msrc, NvmeIdNs), + VMSTATE_UINT8_ARRAY(rsvd81, NvmeIdNs, 18), + VMSTATE_UINT8(nsattr, NvmeIdNs), + VMSTATE_UINT16(nvmsetid, NvmeIdNs), + VMSTATE_UINT16(endgid, NvmeIdNs), + VMSTATE_UINT8_ARRAY(nguid, NvmeIdNs, 16), + VMSTATE_UINT64(eui64, NvmeIdNs), + VMSTATE_STRUCT_ARRAY(lbaf, NvmeIdNs, NVME_MAX_NLBAF, 1, + nvme_vmstate_lbaf, NvmeLBAF), + VMSTATE_UINT8_ARRAY(vs, NvmeIdNs, 3712), + + VMSTATE_END_OF_LIST() + } +}; + +static const VMStateDescription nvme_vmstate_id_ns_nvm = { + .name = "nvme_id_ns_nvm", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_UINT64(lbstm, NvmeIdNsNvm), + VMSTATE_UINT8(pic, NvmeIdNsNvm), + VMSTATE_UINT8_ARRAY(rsvd9, NvmeIdNsNvm, 3), + VMSTATE_UINT32_ARRAY(elbaf, NvmeIdNsNvm, NVME_MAX_NLBAF), + VMSTATE_UINT32(npdgl, NvmeIdNsNvm), + VMSTATE_UINT32(nprg, NvmeIdNsNvm), + VMSTATE_UINT32(npra, NvmeIdNsNvm), + VMSTATE_UINT32(nors, NvmeIdNsNvm), + VMSTATE_UINT32(npdal, NvmeIdNsNvm), + VMSTATE_UINT8_ARRAY(rsvd288, NvmeIdNsNvm, 3808), + VMSTATE_END_OF_LIST() + } +}; + +static const VMStateDescription nvme_vmstate_id_ns_ind = { + .name = "nvme_id_ns_ind", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_UINT8(nsfeat, NvmeIdNsInd), + VMSTATE_UINT8(nmic, NvmeIdNsInd), + VMSTATE_UINT8(rescap, NvmeIdNsInd), + VMSTATE_UINT8(fpi, NvmeIdNsInd), + VMSTATE_UINT32(anagrpid, NvmeIdNsInd), + VMSTATE_UINT8(nsattr, NvmeIdNsInd), + VMSTATE_UINT8(rsvd9, NvmeIdNsInd), + VMSTATE_UINT16(nvmsetid, NvmeIdNsInd), + VMSTATE_UINT16(endgrpid, NvmeIdNsInd), + VMSTATE_UINT8(nstat, NvmeIdNsInd), + VMSTATE_UINT8_ARRAY(rsvd15, NvmeIdNsInd, 4081), + VMSTATE_END_OF_LIST() + } +}; + +typedef struct TmpNvmeNamespace { + NvmeNamespace *parent; + bool enable_write_cache; +} TmpNvmeNamespace; + +static bool nvme_ns_tmp_pre_save(void *opaque, Error **errp) +{ + struct TmpNvmeNamespace *tns = opaque; + + tns->enable_write_cache = blk_enable_write_cache(tns->parent->blkconf.blk); + + return true; +} + +static bool nvme_ns_tmp_post_load(void *opaque, int version_id, Error **errp) +{ + struct TmpNvmeNamespace *tns = opaque; + + blk_set_enable_write_cache(tns->parent->blkconf.blk, tns->enable_write_cache); + + return true; +} + +static const VMStateDescription nvme_vmstate_ns_tmp = { + .name = "nvme_ns_tmp", + .pre_save_errp = nvme_ns_tmp_pre_save, + .post_load_errp = nvme_ns_tmp_post_load, + .fields = (const VMStateField[]) { + VMSTATE_BOOL(enable_write_cache, TmpNvmeNamespace), + VMSTATE_END_OF_LIST() + } +}; + +const VMStateDescription nvme_vmstate_ns = { + .name = "nvme_ns", + .version_id = 1, + .minimum_version_id = 1, + .fields = (const VMStateField[]) { + VMSTATE_WITH_TMP(NvmeNamespace, TmpNvmeNamespace, nvme_vmstate_ns_tmp), + + VMSTATE_STRUCT(id_ns, NvmeNamespace, 0, nvme_vmstate_id_ns, NvmeIdNs), + VMSTATE_STRUCT(id_ns_nvm, NvmeNamespace, 0, nvme_vmstate_id_ns_nvm, NvmeIdNsNvm), + VMSTATE_STRUCT(id_ns_ind, NvmeNamespace, 0, nvme_vmstate_id_ns_ind, NvmeIdNsInd), + VMSTATE_STRUCT(lbaf, NvmeNamespace, 0, nvme_vmstate_lbaf, NvmeLBAF), + VMSTATE_UINT32(nlbaf, NvmeNamespace), + VMSTATE_UINT8(csi, NvmeNamespace), + VMSTATE_UINT16(status, NvmeNamespace), + VMSTATE_UINT8(pif, NvmeNamespace), + + VMSTATE_UINT16(zns.zrwas, NvmeNamespace), + VMSTATE_UINT16(zns.zrwafg, NvmeNamespace), + VMSTATE_UINT32(zns.numzrwa, NvmeNamespace), + + VMSTATE_UINT32(features.err_rec, NvmeNamespace), + VMSTATE_STRUCT(atomic, NvmeNamespace, 0, nvme_vmstate_atomic, NvmeAtomic), + VMSTATE_END_OF_LIST() + } +}; + static const Property nvme_ns_props[] = { DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf), DEFINE_PROP_BOOL("detached", NvmeNamespace, params.detached, false), @@ -937,6 +1096,7 @@ static void nvme_ns_class_init(ObjectClass *oc, const void *data) dc->bus_type = TYPE_NVME_BUS; dc->realize = nvme_ns_realize; dc->unrealize = nvme_ns_unrealize; + dc->vmsd = &nvme_vmstate_ns; device_class_set_props(dc, nvme_ns_props); dc->desc = "Virtual NVMe namespace"; } diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h index 457b6637249..2e7597cded3 100644 --- a/hw/nvme/nvme.h +++ b/hw/nvme/nvme.h @@ -444,6 +444,11 @@ typedef struct NvmeRequest { NvmeSg sg; bool atomic_write; QTAILQ_ENTRY(NvmeRequest)entry; + /* + * If you add a new field here, please make sure to update + * nvme_vmstate_request, pre_save_validate_aer_req() and + * pre_save_validate_cq_req(). + */ } NvmeRequest; typedef struct NvmeBounceContext { @@ -638,6 +643,7 @@ typedef struct NvmeCtrl { NvmeNamespace namespace; NvmeNamespace *namespaces[NVME_MAX_NAMESPACES + 1]; + uint32_t num_queues; NvmeSQueue **sq; NvmeCQueue **cq; NvmeSQueue admin_sq; @@ -749,4 +755,7 @@ void nvme_atomic_configure_max_write_size(bool dn, uint16_t awun, void nvme_ns_atomic_configure_boundary(bool dn, uint16_t nabsn, uint16_t nabspf, NvmeAtomic *atomic); +extern const VMStateDescription nvme_vmstate_atomic; +extern const VMStateDescription nvme_vmstate_ns; + #endif /* HW_NVME_NVME_H */ diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events index 6be0bfa1c1f..f97a6a11f36 100644 --- a/hw/nvme/trace-events +++ b/hw/nvme/trace-events @@ -7,6 +7,16 @@ pci_nvme_dbbuf_config(uint64_t dbs_addr, uint64_t eis_addr) "dbs_addr=0x%"PRIx64 pci_nvme_map_addr(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64"" pci_nvme_map_addr_cmb(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64"" pci_nvme_map_prp(uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "trans_len %"PRIu64" len %"PRIu32" prp1 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d" +pci_nvme_pre_save_enter(void *n) "n=%p" +pci_nvme_pre_save_ns_drain(void *n, int i) "n=%p i=%d" +pci_nvme_pre_save_sq_out_req_check(void *n, int i, uint32_t head, uint32_t tail, uint32_t size) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32" size=0x%"PRIx32"" +pci_nvme_pre_save_cq_req_check(void *n, int i, uint32_t head, uint32_t tail, uint32_t size) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32" size=0x%"PRIx32"" +pci_nvme_pre_save_cq_unposted_cqe(void *n, int i, uint16_t cid, uint32_t nsid, uint32_t dw0, uint32_t dw1, uint16_t status, uint8_t opc) "n=%p i=%d cid %"PRIu16" nsid %"PRIu32" dw0 0x%"PRIx32" dw1 0x%"PRIx32" status 0x%"PRIx16" opc 0x%"PRIx8"" +pci_nvme_pre_save_aer(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8"" +pci_nvme_post_load_enter(void *n) "n=%p" +pci_nvme_post_load_restore_cq(void *n, int i, uint32_t head, uint32_t tail, uint32_t size) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32" size=0x%"PRIx32"" +pci_nvme_post_load_restore_sq(void *n, int i, uint32_t head, uint32_t tail, uint32_t size) "n=%p i=%d head=0x%"PRIx32" tail=0x%"PRIx32" size=0x%"PRIx32"" +pci_nvme_post_load_aer(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8"" pci_nvme_map_sgl(uint8_t typ, uint64_t len) "type 0x%"PRIx8" len %"PRIu64"" pci_nvme_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode, const char *opname) "cid %"PRIu16" nsid 0x%"PRIx32" sqid %"PRIu16" opc 0x%"PRIx8" opname '%s'" pci_nvme_admin_cmd(uint16_t cid, uint16_t sqid, uint8_t opcode, const char *opname) "cid %"PRIu16" sqid %"PRIu16" opc 0x%"PRIx8" opname '%s'" -- 2.47.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v6 6/8] hw/nvme: add basic live migration support 2026-04-19 13:01 ` [PATCH v6 6/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn @ 2026-04-27 21:03 ` Stefan Hajnoczi 2026-04-28 15:38 ` Alexander Mikhalitsyn 0 siblings, 1 reply; 17+ messages in thread From: Stefan Hajnoczi @ 2026-04-27 21:03 UTC (permalink / raw) To: Alexander Mikhalitsyn Cc: qemu-devel, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Laurent Vivier, Jesper Devantier, Klaus Jensen, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn [-- Attachment #1: Type: text/plain, Size: 2474 bytes --] On Sun, Apr 19, 2026 at 03:01:37PM +0200, Alexander Mikhalitsyn wrote: > @@ -4903,6 +4916,25 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr, > __nvme_init_sq(sq); > } > > +static void nvme_restore_sq(NvmeSQueue *sq_from) > +{ > + NvmeCtrl *n = sq_from->ctrl; > + NvmeSQueue *sq = sq_from; > + > + if (sq_from->sqid == 0) { > + sq = &n->admin_sq; docs/devel/migration/main.rst says: - The destination should treat an incoming migration stream as hostile (which we do to varying degrees in the existing code). Check that offsets into buffers and the like can't cause overruns. Fail the incoming migration in the case of a corrupted stream like this. Can a corrupt/malicious device state reach this point multiple times (i.e. several sqid 0 queues are stored in the device state)? If yes, then input validation would be good here to avoid undefined behavior later in the sq code. The same issue may apply to duplicate sqids in general. It seems safest to reject them during restore. > + sq->ctrl = n; > + sq->dma_addr = sq_from->dma_addr; > + sq->sqid = sq_from->sqid; > + sq->size = sq_from->size; > + sq->cqid = sq_from->cqid; > + sq->head = sq_from->head; > + sq->tail = sq_from->tail; > + } > + > + __nvme_init_sq(sq); > +} > + > static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req) > { > NvmeSQueue *sq; > @@ -5605,6 +5637,39 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr, > __nvme_init_cq(cq); > } > > +static void copy_cq_req_list(NvmeCQueue *cq_to, NvmeCQueue *cq_from) This moves the NvmeRequests rather than copying them. Rename to move_cq_req_list()? > +{ > + NvmeRequest *req, *next; > + > + QTAILQ_FOREACH_SAFE(req, &cq_from->req_list, entry, next) { > + QTAILQ_REMOVE(&cq_from->req_list, req, entry); > + QTAILQ_INSERT_TAIL(&cq_to->req_list, req, entry); > + } > +} > + > +static void nvme_restore_cq(NvmeCQueue *cq_from) > +{ > + NvmeCtrl *n = cq_from->ctrl; > + NvmeCQueue *cq = cq_from; > + > + if (cq_from->cqid == 0) { > + cq = &n->admin_cq; Same question about duplicate CQs in corrupt/malicious device states. I reviewed the new draining (busy wait removal). I'll leave the rest for Klaus because I'm not that familiar with the NVMe emulation code. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 6/8] hw/nvme: add basic live migration support 2026-04-27 21:03 ` Stefan Hajnoczi @ 2026-04-28 15:38 ` Alexander Mikhalitsyn 0 siblings, 0 replies; 17+ messages in thread From: Alexander Mikhalitsyn @ 2026-04-28 15:38 UTC (permalink / raw) To: Stefan Hajnoczi Cc: qemu-devel, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Laurent Vivier, Jesper Devantier, Klaus Jensen, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn Am Mo., 27. Apr. 2026 um 23:03 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>: > > On Sun, Apr 19, 2026 at 03:01:37PM +0200, Alexander Mikhalitsyn wrote: > > @@ -4903,6 +4916,25 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr, > > __nvme_init_sq(sq); > > } > > > > +static void nvme_restore_sq(NvmeSQueue *sq_from) > > +{ > > + NvmeCtrl *n = sq_from->ctrl; > > + NvmeSQueue *sq = sq_from; > > + > > + if (sq_from->sqid == 0) { > > + sq = &n->admin_sq; > > docs/devel/migration/main.rst says: > > - The destination should treat an incoming migration stream as hostile > (which we do to varying degrees in the existing code). Check that offsets > into buffers and the like can't cause overruns. Fail the incoming migration > in the case of a corrupted stream like this. Hi Stefan, > > Can a corrupt/malicious device state reach this point multiple times > (i.e. several sqid 0 queues are stored in the device state)? If yes, > then input validation would be good here to avoid undefined behavior > later in the sq code. > > The same issue may apply to duplicate sqids in general. It seems safest > to reject them during restore. Definitely. I'll take care of this in the next version. Thank you! > > > + sq->ctrl = n; > > + sq->dma_addr = sq_from->dma_addr; > > + sq->sqid = sq_from->sqid; > > + sq->size = sq_from->size; > > + sq->cqid = sq_from->cqid; > > + sq->head = sq_from->head; > > + sq->tail = sq_from->tail; > > + } > > + > > + __nvme_init_sq(sq); > > +} > > + > > static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req) > > { > > NvmeSQueue *sq; > > @@ -5605,6 +5637,39 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr, > > __nvme_init_cq(cq); > > } > > > > +static void copy_cq_req_list(NvmeCQueue *cq_to, NvmeCQueue *cq_from) > > This moves the NvmeRequests rather than copying them. Rename to > move_cq_req_list()? Sure! > > > +{ > > + NvmeRequest *req, *next; > > + > > + QTAILQ_FOREACH_SAFE(req, &cq_from->req_list, entry, next) { > > + QTAILQ_REMOVE(&cq_from->req_list, req, entry); > > + QTAILQ_INSERT_TAIL(&cq_to->req_list, req, entry); > > + } > > +} > > + > > +static void nvme_restore_cq(NvmeCQueue *cq_from) > > +{ > > + NvmeCtrl *n = cq_from->ctrl; > > + NvmeCQueue *cq = cq_from; > > + > > + if (cq_from->cqid == 0) { > > + cq = &n->admin_cq; > > Same question about duplicate CQs in corrupt/malicious device states. > > I reviewed the new draining (busy wait removal). I'll leave the rest for > Klaus because I'm not that familiar with the NVMe emulation code. Thanks for reviewing this, Stefan. I'll fix everything and send the next version in the 2nd half of the next week, cause this week I'm on vacation and next week I'll be at LSF/MM/BPF. Kind regards, Alex > > Stefan ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v6 7/8] tests/functional/x86_64: add migration test for NVMe device 2026-04-19 13:01 [PATCH v6 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn ` (5 preceding siblings ...) 2026-04-19 13:01 ` [PATCH v6 6/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn @ 2026-04-19 13:01 ` Alexander Mikhalitsyn 2026-04-19 13:01 ` [PATCH v6 8/8] tests/qtest/nvme-test: add migration test with full CQ Alexander Mikhalitsyn 7 siblings, 0 replies; 17+ messages in thread From: Alexander Mikhalitsyn @ 2026-04-19 13:01 UTC (permalink / raw) To: qemu-devel Cc: Alexander Mikhalitsyn, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Stefan Hajnoczi, Laurent Vivier, Jesper Devantier, Klaus Jensen, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> Introduce a very simple test to ensure that NVMe device migration works fine. Test plan is simple: 1. prepare VM with NVMe device 2. run workload that produces relatively heavy IO on the device 3. migrate VM 4. ensure that workload is alive and finishes without errors Test can be run as simple as: $ meson test 'func-x86_64-nvme_migration' --setup thorough -C build In the future we can extend this approach, and introduce some fio-based tests. And probably, it makes sense to make this test to apply not only to NVMe device, but also virtio-{blk,scsi}, ide, sata and other migratable devices. Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> --- tests/functional/x86_64/meson.build | 1 + .../functional/x86_64/test_nvme_migration.py | 159 ++++++++++++++++++ 2 files changed, 160 insertions(+) create mode 100755 tests/functional/x86_64/test_nvme_migration.py diff --git a/tests/functional/x86_64/meson.build b/tests/functional/x86_64/meson.build index 1ed10ad6c29..fd77f19d726 100644 --- a/tests/functional/x86_64/meson.build +++ b/tests/functional/x86_64/meson.build @@ -37,6 +37,7 @@ tests_x86_64_system_thorough = [ 'linux_initrd', 'multiprocess', 'netdev_ethtool', + 'nvme_migration', 'replay', 'reverse_debug', 'tuxrun', diff --git a/tests/functional/x86_64/test_nvme_migration.py b/tests/functional/x86_64/test_nvme_migration.py new file mode 100755 index 00000000000..3788a8e3473 --- /dev/null +++ b/tests/functional/x86_64/test_nvme_migration.py @@ -0,0 +1,159 @@ +#!/usr/bin/env python3 +# +# SPDX-License-Identifier: GPL-2.0-or-later +# +# x86_64 NVMe migration test + +from migration import MigrationTest +from qemu_test import QemuSystemTest, Asset +from qemu_test import wait_for_console_pattern +from qemu_test import exec_command, exec_command_and_wait_for_pattern + + +class X8664NVMeMigrationTest(MigrationTest): + ASSET_KERNEL = Asset( + ('https://archives.fedoraproject.org/pub/archive/fedora/linux/releases' + '/31/Server/x86_64/os/images/pxeboot/vmlinuz'), + 'd4738d03dbbe083ca610d0821d0a8f1488bebbdccef54ce33e3adb35fda00129') + + ASSET_INITRD = Asset( + ('https://archives.fedoraproject.org/pub/archive/fedora/linux/releases' + '/31/Server/x86_64/os/images/pxeboot/initrd.img'), + '277cd6c7adf77c7e63d73bbb2cded8ef9e2d3a2f100000e92ff1f8396513cd8b') + + ASSET_DISKIMAGE = Asset( + ('https://archives.fedoraproject.org/pub/archive/fedora/linux/releases' + '/31/Cloud/x86_64/images/Fedora-Cloud-Base-31-1.9.x86_64.qcow2'), + 'e3c1b309d9203604922d6e255c2c5d098a309c2d46215d8fc026954f3c5c27a0') + + DEFAULT_KERNEL_PARAMS = ('root=/dev/nvme0n1p1 console=ttyS0 net.ifnames=0 ' + 'rd.rescue quiet') + + def wait_for_console_pattern(self, success_message, vm): + wait_for_console_pattern( + self, + success_message, + failure_message="Kernel panic - not syncing", + vm=vm, + ) + + def exec_command_and_check(self, command, vm): + prompt = '# ' + exec_command_and_wait_for_pattern(self, + f"{command} && echo OK || echo FAIL", + 'FAIL', vm=vm) + # Note, that commands we send to the console are echo-ed back, so if we have a word "FAIL" + # in the command itself, we should expect to see it once. + wait_for_console_pattern(self, 'OK', failure_message="FAIL", vm=vm) + self.wait_for_console_pattern(prompt, vm) + + def configure_machine(self, vm): + kernel_path = self.ASSET_KERNEL.fetch() + initrd_path = self.ASSET_INITRD.fetch() + diskimage_path = self.ASSET_DISKIMAGE.fetch() + + vm.set_console() + vm.add_args("-cpu", "max") + vm.add_args("-m", "2G") + vm.add_args("-accel", "kvm") + + vm.add_args('-drive', + f'file={diskimage_path},if=none,id=drv0,snapshot=on') + vm.add_args('-device', 'nvme,bus=pcie.0,' + + 'drive=drv0,id=nvme-disk0,serial=nvmemigratetest,bootindex=1') + + vm.add_args( + "-kernel", + kernel_path, + "-initrd", + initrd_path, + "-append", + self.DEFAULT_KERNEL_PARAMS + ) + + def launch_source_vm(self, vm): + vm.launch() + + self.wait_for_console_pattern('Entering emergency mode.', vm) + prompt = '# ' + self.wait_for_console_pattern(prompt, vm) + + # Synchronize on NVMe driver creating the root device + exec_command_and_wait_for_pattern(self, + "while ! (dmesg -c | grep nvme0n1:) ; do sleep 1 ; done", + "nvme0n1", vm=vm) + self.wait_for_console_pattern(prompt, vm) + + # prepare system + exec_command_and_wait_for_pattern(self, 'mount /dev/nvme0n1p1 /sysroot', + prompt, vm=vm) + exec_command_and_wait_for_pattern(self, 'chroot /sysroot', + prompt, vm=vm) + exec_command_and_wait_for_pattern(self, 'mount -t proc proc /proc', + prompt, vm=vm) + exec_command_and_wait_for_pattern(self, 'mount -t sysfs sysfs /sys', + prompt, vm=vm) + + # Run workload before migration to check if it continues to run properly after migration + # + # Workload is simple: it continuously calculates checksums of all files in /usr/bin + # to generate some I/O load on the NVMe disk and at the same time it drops caches to + # make sure that we have some read I/O on the disk as well. + # If there are any issues with the migration of the NVMe device, we should see errors + # in dmesg and consequently in the workload log. + exec_command_and_wait_for_pattern(self, + "(while [ ! -f /tmp/test_nvme_migration_workload.stop ]; do \ + rm -f /tmp/test_nvme_migration_workload.iteration_finished; \ + echo 3 > /proc/sys/vm/drop_caches; \ + find /usr/bin -type f -exec cksum {} \\;; \ + touch /tmp/test_nvme_migration_workload.iteration_finished; \ + done) > /dev/null 2> /tmp/test_nvme_migration_workload.errors &", + prompt, vm=vm) + exec_command_and_wait_for_pattern(self, 'echo $! > /tmp/test_nvme_migration_workload.pid', + prompt, vm=vm) + + # check if process is alive and running + self.exec_command_and_check("kill -0 $(cat /tmp/test_nvme_migration_workload.pid)", vm) + + def assert_dest_vm(self, vm): + prompt = '# ' + + # check if process is alive and running after migration, if not - fail the test + self.exec_command_and_check("kill -0 $(cat /tmp/test_nvme_migration_workload.pid)", vm) + + # signal workload to stop + exec_command_and_wait_for_pattern(self, 'touch /tmp/test_nvme_migration_workload.stop', + prompt, vm=vm) + + # wait workload to finish, because we want to examine log to see if there are any errors + exec_command_and_wait_for_pattern(self, + "while [ ! -f /tmp/test_nvme_migration_workload.iteration_finished ]; do sleep 1; done;", + prompt, vm=vm) + + exec_command_and_wait_for_pattern(self, 'cat /tmp/test_nvme_migration_workload.errors', + prompt, vm=vm) + + # fail the test if non-empty + self.exec_command_and_check("[ ! -s /tmp/test_nvme_migration_workload.errors ]", vm) + + def test_migration_with_tcp_localhost(self): + self.set_machine('q35') + self.require_accelerator("kvm") + + self.migration_with_tcp_localhost() + + def test_migration_with_unix(self): + self.set_machine('q35') + self.require_accelerator("kvm") + + self.migration_with_unix() + + def test_migration_with_exec(self): + self.set_machine('q35') + self.require_accelerator("kvm") + + self.migration_with_exec() + + +if __name__ == '__main__': + MigrationTest.main() -- 2.47.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v6 8/8] tests/qtest/nvme-test: add migration test with full CQ 2026-04-19 13:01 [PATCH v6 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn ` (6 preceding siblings ...) 2026-04-19 13:01 ` [PATCH v6 7/8] tests/functional/x86_64: add migration test for NVMe device Alexander Mikhalitsyn @ 2026-04-19 13:01 ` Alexander Mikhalitsyn 2026-04-27 20:00 ` Stefan Hajnoczi 7 siblings, 1 reply; 17+ messages in thread From: Alexander Mikhalitsyn @ 2026-04-19 13:01 UTC (permalink / raw) To: qemu-devel Cc: Alexander Mikhalitsyn, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Stefan Hajnoczi, Laurent Vivier, Jesper Devantier, Klaus Jensen, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> As suggested by Stefan [1], let's add a migration test to cover rare scenario when CQ is full of non-processed CQEs and migration happens. To run this test: $ meson test -C build 'qtest-x86_64/qos-test' Link: https://lore.kernel.org/qemu-devel/20260408183529.GB319710@fedora/ [1] Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> --- tests/qtest/nvme-test.c | 393 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 393 insertions(+) diff --git a/tests/qtest/nvme-test.c b/tests/qtest/nvme-test.c index 4aec1651e6e..1ba2fa6943f 100644 --- a/tests/qtest/nvme-test.c +++ b/tests/qtest/nvme-test.c @@ -8,9 +8,11 @@ */ #include "qemu/osdep.h" +#include "qemu/bswap.h" #include "qemu/module.h" #include "qemu/units.h" #include "libqtest.h" +#include "libqtest-single.h" #include "libqos/qgraph.h" #include "libqos/pci.h" #include "block/nvme.h" @@ -142,6 +144,395 @@ static void nvmetest_pmr_reg_test(void *obj, void *data, QGuestAllocator *alloc) qpci_iounmap(pdev, pmr_bar); } +#define PAGE_SIZE 4096 + +typedef struct nvme_ctrl nvme_ctrl; + +typedef struct nvme_queue { + nvme_ctrl *ctrl; + uint64_t doorbell; + uint32_t size; +} nvme_queue; + +typedef struct nvme_cq { + nvme_queue common; + NvmeCqe *phys_cqe; + uint16_t head; + uint8_t phase; +} nvme_cq; + +typedef struct nvme_sq { + nvme_queue common; + NvmeCmd *phys_sqe; + nvme_cq *cq; + uint16_t head; + uint16_t tail; +} nvme_sq; + +struct nvme_ctrl { + QGuestAllocator *alloc; + QPCIDevice *pdev; + QPCIBar bar; + + uint32_t db_stride; + + nvme_sq admin_sq; + nvme_cq admin_cq; +}; + +static void nvme_init_queue_common(nvme_ctrl *ctrl, nvme_queue *q, + uint16_t db_idx, uint32_t size) +{ + q->ctrl = ctrl; + q->doorbell = (sizeof(NvmeBar) + db_idx * ctrl->db_stride); + g_test_message(" q %p db_idx %u doorbell %lx", q, db_idx, q->doorbell); + q->size = size; +} + +static void nvme_init_sq(nvme_ctrl *ctrl, nvme_sq *sq, uint16_t db_idx, + uint32_t size, nvme_cq *cq) +{ + nvme_init_queue_common(ctrl, &sq->common, db_idx, size); + + sq->phys_sqe = (typeof(sq->phys_sqe))guest_alloc(ctrl->alloc, + PAGE_SIZE); + g_assert(sq->phys_sqe); + + g_test_message("sq %p db_idx %u sqe %p", sq, db_idx, sq->phys_sqe); + sq->cq = cq; + sq->head = 0; + sq->tail = 0; +} + +static void nvme_init_cq(nvme_ctrl *ctrl, nvme_cq *cq, uint16_t db_idx, + uint32_t size) +{ + nvme_init_queue_common(ctrl, &cq->common, db_idx, size); + + cq->phys_cqe = (typeof(cq->phys_cqe))guest_alloc(ctrl->alloc, + PAGE_SIZE); + g_assert(cq->phys_cqe); + + g_test_message("cq %p db_idx %u cqe %p", cq, db_idx, cq->phys_cqe); + cq->head = 0; + cq->phase = 1; +} + +static int nvme_cqe_pending(nvme_cq *cq) +{ + uint16_t status = qtest_readw(cq->common.ctrl->pdev->bus->qts, + (uint64_t)&cq->phys_cqe[cq->head].status); + return (le16_to_cpu(status) & 1) == cq->phase; +} + +static int nvme_is_cqe_success(NvmeCqe *cqe) +{ + return (le16_to_cpu(cqe->status) >> 1) == NVME_SUCCESS; +} + +static NvmeCqe nvme_handle_cqe(nvme_sq *sq) +{ + nvme_cq *cq = sq->cq; + NvmeCqe *phys_cqe = &cq->phys_cqe[cq->head]; + NvmeCqe cqe; + uint16_t cq_next_head; + + g_assert(nvme_cqe_pending(cq)); + + qtest_memread(sq->common.ctrl->pdev->bus->qts, (uint64_t)phys_cqe, &cqe, sizeof(cqe)); + + cq_next_head = (cq->head + 1) % cq->common.size; + g_test_message("cq %p head %u -> %u", cq, cq->head, cq_next_head); + if (cq_next_head < cq->head) { + cq->phase ^= 1; + } + cq->head = cq_next_head; + + if (cqe.sq_head != sq->head) { + sq->head = cqe.sq_head; + g_test_message("sq %p head = %u", sq, sq->head); + } + + qpci_io_writel(cq->common.ctrl->pdev, cq->common.ctrl->bar, cq->common.doorbell, cq->head); + + return cqe; +} + +static NvmeCqe nvme_wait(nvme_sq *sq) +{ + int i; + bool ready = false; + + for (i = 0; i < 10; i++) { + if (nvme_cqe_pending(sq->cq)) { + ready = true; + break; + } + + g_usleep(1000); + } + + g_assert(ready); + + return nvme_handle_cqe(sq); +} + +static NvmeCmd *nvme_get_next_sqe(nvme_sq *sq, uint8_t opcode, uint16_t cid, void *prp1) +{ + NvmeCmd *phys_sqe = &sq->phys_sqe[sq->tail]; + + if (((sq->tail + 1) % sq->common.size) == sq->head) { + /* no space in SQ */ + g_test_message("%s head %d tail %d", __func__, sq->head, sq->tail); + g_assert(false); + return NULL; + } + + qtest_memset(sq->common.ctrl->pdev->bus->qts, + (uint64_t)phys_sqe, 0, sizeof(*phys_sqe)); + + #define GUEST_MEM_WRITE(fn, field, val) \ + fn(sq->common.ctrl->pdev->bus->qts, (uint64_t)&(field), (val)) + + GUEST_MEM_WRITE(qtest_writeb, phys_sqe->opcode, opcode); + GUEST_MEM_WRITE(qtest_writew, phys_sqe->cid, cid); + GUEST_MEM_WRITE(qtest_writeq, phys_sqe->dptr.prp1, (uint32_t)(uint64_t)prp1); + + #undef GUEST_MEM_WRITE + + g_test_message("sq %p next_sqe %u sqe %p", sq, sq->tail, phys_sqe); + return phys_sqe; +} + +static void nvme_commit_sqe(nvme_sq *sq) +{ + g_test_message("sq %p commit sqe tail %u", sq, sq->tail); + sq->tail = (sq->tail + 1) % sq->common.size; + qpci_io_writel(sq->common.ctrl->pdev, sq->common.ctrl->bar, sq->common.doorbell, sq->tail); +} + +static NvmeIdCtrl *nvme_admin_identify_ctrl(nvme_ctrl *ctrl, uint16_t cid, bool no_wait) +{ + NvmeCmd *phys_cmd_identify; + NvmeIdCtrl *phys_identify; + NvmeCqe cqe; + + g_test_message("sending req cid %u no_wait %d", cid, no_wait); + + phys_identify = (typeof(phys_identify))guest_alloc(ctrl->alloc, PAGE_SIZE); + g_assert(phys_identify); + + phys_cmd_identify = nvme_get_next_sqe(&ctrl->admin_sq, + NVME_ADM_CMD_IDENTIFY, cid, + phys_identify); + g_assert(phys_cmd_identify); + + #define GUEST_MEM_WRITE(fn, field, val) \ + fn(ctrl->pdev->bus->qts, (uint64_t)&(field), (val)) + + GUEST_MEM_WRITE(qtest_writel, phys_cmd_identify->nsid, 0); + GUEST_MEM_WRITE(qtest_writel, ((NvmeIdentify *)phys_cmd_identify)->cns, NVME_ID_CNS_CTRL); + + #undef GUEST_MEM_WRITE + + nvme_commit_sqe(&ctrl->admin_sq); + + if (no_wait) { + return phys_identify; + } + + cqe = nvme_wait(&ctrl->admin_sq); + g_assert(nvme_is_cqe_success(&cqe)); + g_assert(cqe.cid == cid); + + return phys_identify; +} + +static void nvme_wait_ready(nvme_ctrl *ctrl, int val) +{ + int i; + + for (i = 0; i < 10; i++) { + uint32_t csts = qpci_io_readl(ctrl->pdev, ctrl->bar, NVME_REG_CSTS); + g_test_message("%s: csts %x", __func__, csts); + + if (NVME_CSTS_RDY(csts) == val) { + return; + } + + g_usleep(1000); + } + + g_assert(false); +} + +static void test_migrate_setup_nvme_ctrl(nvme_ctrl *ctrl) +{ + uint64_t cap; + + /* disable controller */ + qpci_io_writel(ctrl->pdev, ctrl->bar, NVME_REG_CC, 0); + nvme_wait_ready(ctrl, 0); + + cap = qpci_io_readq(ctrl->pdev, ctrl->bar, NVME_REG_CAP); + ctrl->db_stride = 4 << NVME_CAP_DSTRD(cap); + + nvme_init_cq(ctrl, &ctrl->admin_cq, 1, 2 /* CQEs num */); + nvme_init_sq(ctrl, &ctrl->admin_sq, 0, 4 /* SQEs num */, &ctrl->admin_cq); + + qpci_io_writel(ctrl->pdev, ctrl->bar, NVME_REG_AQA, + ((ctrl->admin_cq.common.size - 1) << AQA_ACQS_SHIFT) | + ((ctrl->admin_sq.common.size - 1) << AQA_ASQS_SHIFT) + ); + + qpci_io_writeq(ctrl->pdev, ctrl->bar, + NVME_REG_ASQ, (uint64_t)ctrl->admin_sq.phys_sqe); + qpci_io_writeq(ctrl->pdev, ctrl->bar, + NVME_REG_ACQ, (uint64_t)ctrl->admin_cq.phys_cqe); + + /* enable controller */ + { + uint32_t cc = 0; + NVME_SET_CC_EN(cc, 1); + qpci_io_writel(ctrl->pdev, ctrl->bar, NVME_REG_CC, cc); + } + + nvme_wait_ready(ctrl, 1); +} + +typedef struct test_migrate_req { + uint16_t cid; + bool handle_cqe; + NvmeIdCtrl *phys_identify; +} test_migrate_req; + +static void test_migrate_send_nvme_reqs(nvme_ctrl *ctrl, test_migrate_req *reqs, + int num) +{ + int i; + + for (i = 0; i < num; i++) { + reqs[i].phys_identify = nvme_admin_identify_ctrl(ctrl, reqs[i].cid, + !reqs[i].handle_cqe); + g_assert(reqs[i].phys_identify); + + if (reqs[i].handle_cqe) { + guest_free(ctrl->alloc, (uint64_t)reqs[i].phys_identify); + } + } +} + +static void test_migrate_check_nvme(nvme_ctrl *ctrl, test_migrate_req *reqs, int num) +{ + int i; + + for (i = 0; i < num; i++) { + NvmeCqe cqe; + + if (reqs[i].handle_cqe) { + continue; + } + + cqe = nvme_wait(&ctrl->admin_sq); + g_assert(nvme_is_cqe_success(&cqe)); + + g_assert_cmpint(cqe.cid, ==, reqs[i].cid); + + #define GUEST_MEM_READB(field) \ + qtest_readb(ctrl->pdev->bus->qts, (uint64_t)&(field)) + + g_assert_cmpint(GUEST_MEM_READB(reqs[i].phys_identify->ieee[0]), ==, 0x0); + g_assert_cmpint(GUEST_MEM_READB(reqs[i].phys_identify->ieee[1]), ==, 0x54); + g_assert_cmpint(GUEST_MEM_READB(reqs[i].phys_identify->ieee[2]), ==, 0x52); + + #undef GUEST_MEM_READB + + guest_free(ctrl->alloc, (uint64_t)reqs[i].phys_identify); + } +} + +static void test_migrate(void *obj, void *data, QGuestAllocator *alloc) +{ + g_autofree gchar *tmpfs = NULL; + GError *err = NULL; + g_autofree gchar *mig_path; + g_autofree gchar *uri; + GString *dest_cmdline; + QTestState *to; + QDict *rsp; + QNvme *nvme = obj; + QPCIDevice *pdev = &nvme->dev; + nvme_ctrl *ctrl; + test_migrate_req test_reqs[] = { + { 123, true }, + { 456, false }, + { 300, false }, + { 333, false } + }; + + /* create temporary dir and prepare unix socket path for migration */ + tmpfs = g_dir_make_tmp("nvme-test-XXXXXX", &err); + if (!tmpfs) { + g_test_message("Can't create temporary directory in %s: %s", + g_get_tmp_dir(), err->message); + g_error_free(err); + } + g_assert(tmpfs); + + mig_path = g_strdup_printf("%s/socket.mig", tmpfs); + uri = g_strdup_printf("unix:%s", mig_path); + + /* enable NVMe PCI device */ + qpci_device_enable(pdev); + + ctrl = g_malloc0(sizeof(*ctrl)); + ctrl->alloc = alloc; + ctrl->pdev = pdev; + ctrl->bar = qpci_iomap(ctrl->pdev, 0, NULL); + g_assert(pdev->bus->qts == global_qtest); + + test_migrate_setup_nvme_ctrl(ctrl); + test_migrate_send_nvme_reqs(ctrl, test_reqs, ARRAY_SIZE(test_reqs)); + + qpci_iounmap(ctrl->pdev, ctrl->bar); + + dest_cmdline = g_string_new(qos_get_current_command_line()); + g_string_append_printf(dest_cmdline, " -incoming %s", uri); + + /* Create destination VM */ + to = qtest_init(dest_cmdline->str); + + /* Get access to PCI device from destination VM */ + nvme = qos_allocate_objects(to, &ctrl->alloc); + pdev = &nvme->dev; + ctrl->pdev = pdev; + ctrl->bar = qpci_iomap(ctrl->pdev, 0, NULL); + g_assert(pdev->bus->qts == to); + + /* Migrate VM */ + rsp = qmp("{ 'execute': 'migrate', 'arguments': { 'uri': %s } }", uri); + g_assert(qdict_haskey(rsp, "return")); + qobject_unref(rsp); + + /* Wait when source VM is stopped */ + qmp_eventwait("STOP"); + + /* Copy guest physical memory allocator state */ + migrate_allocator(alloc, ctrl->alloc); + + /* Wait for destination VM to become alive */ + qtest_qmp_eventwait(to, "RESUME"); + + test_migrate_check_nvme(ctrl, test_reqs, ARRAY_SIZE(test_reqs)); + + qpci_iounmap(ctrl->pdev, ctrl->bar); + + qtest_quit(to); + g_unlink(mig_path); + g_rmdir(tmpfs); + g_string_free(dest_cmdline, true); +} + static void nvme_register_nodes(void) { QOSGraphEdgeOptions opts = { @@ -168,6 +559,8 @@ static void nvme_register_nodes(void) }); qos_add_test("reg-read", "nvme", nvmetest_reg_read_test, NULL); + + qos_add_test("migrate", "nvme", test_migrate, NULL); } libqos_init(nvme_register_nodes); -- 2.47.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v6 8/8] tests/qtest/nvme-test: add migration test with full CQ 2026-04-19 13:01 ` [PATCH v6 8/8] tests/qtest/nvme-test: add migration test with full CQ Alexander Mikhalitsyn @ 2026-04-27 20:00 ` Stefan Hajnoczi 2026-04-28 15:55 ` Alexander Mikhalitsyn 0 siblings, 1 reply; 17+ messages in thread From: Stefan Hajnoczi @ 2026-04-27 20:00 UTC (permalink / raw) To: Alexander Mikhalitsyn Cc: qemu-devel, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Laurent Vivier, Jesper Devantier, Klaus Jensen, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn [-- Attachment #1: Type: text/plain, Size: 3018 bytes --] On Sun, Apr 19, 2026 at 03:01:39PM +0200, Alexander Mikhalitsyn wrote: > +static void test_migrate_check_nvme(nvme_ctrl *ctrl, test_migrate_req *reqs, int num) > +{ > + int i; > + > + for (i = 0; i < num; i++) { > + NvmeCqe cqe; > + > + if (reqs[i].handle_cqe) { > + continue; > + } > + > + cqe = nvme_wait(&ctrl->admin_sq); > + g_assert(nvme_is_cqe_success(&cqe)); > + > + g_assert_cmpint(cqe.cid, ==, reqs[i].cid); Please check the endianness in this patch. I don't see an le16_to_cpu() call here. Maybe other NVMe DMA structures or registers are also not converted to/from little endian. This test must pass on both big-endian and little-endian hosts. > + > + #define GUEST_MEM_READB(field) \ > + qtest_readb(ctrl->pdev->bus->qts, (uint64_t)&(field)) > + > + g_assert_cmpint(GUEST_MEM_READB(reqs[i].phys_identify->ieee[0]), ==, 0x0); I was trying to figure out why phys_identify has a pointer type although it is a guest memory address that cannot be dereferenced in C code. I guess the reason is so that the GUEST_MEM_READB() macro can use the address-of operator instead of explicitly calculating the offset of ieee[0], ieee[1], etc? I found this confusing and potentially dangerous (the compiler won't stop the pointer from being dereferenced if someone does it by mistake). It took more time to review the pointer trick than the straightforward approach would have taken. I'd avoid it, but it's a question of coding style and up to you. > + g_assert_cmpint(GUEST_MEM_READB(reqs[i].phys_identify->ieee[1]), ==, 0x54); > + g_assert_cmpint(GUEST_MEM_READB(reqs[i].phys_identify->ieee[2]), ==, 0x52); > + > + #undef GUEST_MEM_READB > + > + guest_free(ctrl->alloc, (uint64_t)reqs[i].phys_identify); > + } > +} > + > +static void test_migrate(void *obj, void *data, QGuestAllocator *alloc) > +{ > + g_autofree gchar *tmpfs = NULL; > + GError *err = NULL; > + g_autofree gchar *mig_path; > + g_autofree gchar *uri; > + GString *dest_cmdline; > + QTestState *to; > + QDict *rsp; > + QNvme *nvme = obj; > + QPCIDevice *pdev = &nvme->dev; > + nvme_ctrl *ctrl; > + test_migrate_req test_reqs[] = { > + { 123, true }, > + { 456, false }, > + { 300, false }, > + { 333, false } > + }; > + > + /* create temporary dir and prepare unix socket path for migration */ > + tmpfs = g_dir_make_tmp("nvme-test-XXXXXX", &err); > + if (!tmpfs) { > + g_test_message("Can't create temporary directory in %s: %s", > + g_get_tmp_dir(), err->message); > + g_error_free(err); > + } > + g_assert(tmpfs); > + > + mig_path = g_strdup_printf("%s/socket.mig", tmpfs); > + uri = g_strdup_printf("unix:%s", mig_path); > + > + /* enable NVMe PCI device */ > + qpci_device_enable(pdev); > + > + ctrl = g_malloc0(sizeof(*ctrl)); I don't see a g_free() call for ctrl. Use g_autofree? [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 8/8] tests/qtest/nvme-test: add migration test with full CQ 2026-04-27 20:00 ` Stefan Hajnoczi @ 2026-04-28 15:55 ` Alexander Mikhalitsyn 2026-04-28 16:13 ` Stefan Hajnoczi 0 siblings, 1 reply; 17+ messages in thread From: Alexander Mikhalitsyn @ 2026-04-28 15:55 UTC (permalink / raw) To: Stefan Hajnoczi Cc: qemu-devel, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Laurent Vivier, Jesper Devantier, Klaus Jensen, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn Am Mo., 27. Apr. 2026 um 22:00 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>: > > On Sun, Apr 19, 2026 at 03:01:39PM +0200, Alexander Mikhalitsyn wrote: > > +static void test_migrate_check_nvme(nvme_ctrl *ctrl, test_migrate_req *reqs, int num) > > +{ > > + int i; > > + > > + for (i = 0; i < num; i++) { > > + NvmeCqe cqe; > > + > > + if (reqs[i].handle_cqe) { > > + continue; > > + } > > + > > + cqe = nvme_wait(&ctrl->admin_sq); > > + g_assert(nvme_is_cqe_success(&cqe)); > > + > > + g_assert_cmpint(cqe.cid, ==, reqs[i].cid); > > Please check the endianness in this patch. I don't see an le16_to_cpu() > call here. > > Maybe other NVMe DMA structures or registers are also not converted > to/from little endian. This test must pass on both big-endian and > little-endian hosts. Ah. I cared about endianness conversions but haven't tested this on a BE machine. Will fix this (and test on arm64) in the next version, thanks, Stefan! > > > + > > + #define GUEST_MEM_READB(field) \ > > + qtest_readb(ctrl->pdev->bus->qts, (uint64_t)&(field)) > > + > > + g_assert_cmpint(GUEST_MEM_READB(reqs[i].phys_identify->ieee[0]), ==, 0x0); > > I was trying to figure out why phys_identify has a pointer type although > it is a guest memory address that cannot be dereferenced in C code. I > guess the reason is so that the GUEST_MEM_READB() macro can use the > address-of operator instead of explicitly calculating the offset of > ieee[0], ieee[1], etc? I found this confusing and potentially dangerous > (the compiler won't stop the pointer from being dereferenced if someone > does it by mistake). Exactly, that was my intention, to be able to use address-of operation in conjunction with structure-dereference operator. To prevent accidental mistakes, I used "phys_" prefix in structure member names (like "cq->phys_cqe" or "reqs[i].phys_identify"). > > It took more time to review the pointer trick than the straightforward > approach would have taken. I'd avoid it, but it's a question of coding > style and up to you. No problem, I think the fact that this code confused you means that my idea wasn't that good as I thought ;-) I think I should then change a type for phys_identify and phys_cqe to u64 and then add explicit address arithmetic or continue to use "arrow" operator but in a special macro like GUEST_MEM_READ_SOMETHING(...). WDYT? > > > + g_assert_cmpint(GUEST_MEM_READB(reqs[i].phys_identify->ieee[1]), ==, 0x54); > > + g_assert_cmpint(GUEST_MEM_READB(reqs[i].phys_identify->ieee[2]), ==, 0x52); > > + > > + #undef GUEST_MEM_READB > > + > > + guest_free(ctrl->alloc, (uint64_t)reqs[i].phys_identify); > > + } > > +} > > + > > +static void test_migrate(void *obj, void *data, QGuestAllocator *alloc) > > +{ > > + g_autofree gchar *tmpfs = NULL; > > + GError *err = NULL; > > + g_autofree gchar *mig_path; > > + g_autofree gchar *uri; > > + GString *dest_cmdline; > > + QTestState *to; > > + QDict *rsp; > > + QNvme *nvme = obj; > > + QPCIDevice *pdev = &nvme->dev; > > + nvme_ctrl *ctrl; > > + test_migrate_req test_reqs[] = { > > + { 123, true }, > > + { 456, false }, > > + { 300, false }, > > + { 333, false } > > + }; > > + > > + /* create temporary dir and prepare unix socket path for migration */ > > + tmpfs = g_dir_make_tmp("nvme-test-XXXXXX", &err); > > + if (!tmpfs) { > > + g_test_message("Can't create temporary directory in %s: %s", > > + g_get_tmp_dir(), err->message); > > + g_error_free(err); > > + } > > + g_assert(tmpfs); > > + > > + mig_path = g_strdup_printf("%s/socket.mig", tmpfs); > > + uri = g_strdup_printf("unix:%s", mig_path); > > + > > + /* enable NVMe PCI device */ > > + qpci_device_enable(pdev); > > + > > + ctrl = g_malloc0(sizeof(*ctrl)); > > I don't see a g_free() call for ctrl. Use g_autofree? sure Thank you very much for your reviews! Kind regards, Alex ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 8/8] tests/qtest/nvme-test: add migration test with full CQ 2026-04-28 15:55 ` Alexander Mikhalitsyn @ 2026-04-28 16:13 ` Stefan Hajnoczi 0 siblings, 0 replies; 17+ messages in thread From: Stefan Hajnoczi @ 2026-04-28 16:13 UTC (permalink / raw) To: Alexander Mikhalitsyn Cc: qemu-devel, Kevin Wolf, qemu-block, Fam Zheng, Stéphane Graber, Philippe Mathieu-Daudé, Paolo Bonzini, Laurent Vivier, Jesper Devantier, Klaus Jensen, Fabiano Rosas, Zhao Liu, Keith Busch, Peter Xu, Hanna Reitz, Alexander Mikhalitsyn [-- Attachment #1: Type: text/plain, Size: 1989 bytes --] On Tue, Apr 28, 2026 at 05:55:12PM +0200, Alexander Mikhalitsyn wrote: > Am Mo., 27. Apr. 2026 um 22:00 Uhr schrieb Stefan Hajnoczi > > On Sun, Apr 19, 2026 at 03:01:39PM +0200, Alexander Mikhalitsyn wrote: > > > + > > > + #define GUEST_MEM_READB(field) \ > > > + qtest_readb(ctrl->pdev->bus->qts, (uint64_t)&(field)) > > > + > > > + g_assert_cmpint(GUEST_MEM_READB(reqs[i].phys_identify->ieee[0]), ==, 0x0); > > > > I was trying to figure out why phys_identify has a pointer type although > > it is a guest memory address that cannot be dereferenced in C code. I > > guess the reason is so that the GUEST_MEM_READB() macro can use the > > address-of operator instead of explicitly calculating the offset of > > ieee[0], ieee[1], etc? I found this confusing and potentially dangerous > > (the compiler won't stop the pointer from being dereferenced if someone > > does it by mistake). > > Exactly, that was my intention, to be able to use address-of operation > in conjunction > with structure-dereference operator. To prevent accidental mistakes, I > used "phys_" > prefix in structure member names (like "cq->phys_cqe" or > "reqs[i].phys_identify"). > > > > > It took more time to review the pointer trick than the straightforward > > approach would have taken. I'd avoid it, but it's a question of coding > > style and up to you. > > No problem, I think the fact that this code confused you means that my idea > wasn't that good as I thought ;-) I think I should then change a type for > phys_identify and phys_cqe to u64 and then add explicit address arithmetic > or continue to use "arrow" operator but in a special macro like > GUEST_MEM_READ_SOMETHING(...). > WDYT? Encapsulating the address-of operator trick inside a macro would be a nice solution. By localizing the pointer type just within the macro, it's more obvious what the intention is and the rest of the program uses the uint64_t type. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-04-29 6:23 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-19 13:01 [PATCH v6 0/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn 2026-04-19 13:01 ` [PATCH v6 1/8] tests/functional/migration: add VM launch/configure hooks Alexander Mikhalitsyn 2026-04-19 13:01 ` [PATCH v6 2/8] hw/nvme: add migration blockers for non-supported cases Alexander Mikhalitsyn 2026-04-19 13:01 ` [PATCH v6 3/8] hw/nvme: split nvme_init_sq/nvme_init_cq into helpers Alexander Mikhalitsyn 2026-04-29 6:20 ` Klaus Jensen 2026-04-19 13:01 ` [PATCH v6 4/8] hw/nvme: set CQE.sq_id earlier in nvme_process_sq Alexander Mikhalitsyn 2026-04-29 6:18 ` Klaus Jensen 2026-04-19 13:01 ` [PATCH v6 5/8] hw/nvme: unmap req->sg earlier in nvme_enqueue_req_completion Alexander Mikhalitsyn 2026-04-29 6:19 ` Klaus Jensen 2026-04-19 13:01 ` [PATCH v6 6/8] hw/nvme: add basic live migration support Alexander Mikhalitsyn 2026-04-27 21:03 ` Stefan Hajnoczi 2026-04-28 15:38 ` Alexander Mikhalitsyn 2026-04-19 13:01 ` [PATCH v6 7/8] tests/functional/x86_64: add migration test for NVMe device Alexander Mikhalitsyn 2026-04-19 13:01 ` [PATCH v6 8/8] tests/qtest/nvme-test: add migration test with full CQ Alexander Mikhalitsyn 2026-04-27 20:00 ` Stefan Hajnoczi 2026-04-28 15:55 ` Alexander Mikhalitsyn 2026-04-28 16:13 ` Stefan Hajnoczi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.