[PATCH 0/2] NVMe namespace hotplug and drive reconnection support

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] NVMe namespace hotplug and drive reconnection support
@ 2026-04-09  6:01 mr-083
  2026-04-09  6:01 ` [PATCH 1/2] hw/nvme: add namespace hotplug support mr-083
                   ` (5 more replies)
  0 siblings, 6 replies; 34+ messages in thread
From: mr-083 @ 2026-04-09  6:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: its, kbusch, stefanha, mr-083

This series adds two features that together enable transparent NVMe disk
hot-swap simulation in QEMU, matching the behavior of physical NVMe
drives being pulled and reinserted in the same PCIe slot.

Problem:
Currently, hot-swapping an NVMe disk in QEMU requires removing the
entire NVMe controller via device_del, which causes the Linux guest to
assign a new controller number on re-add (e.g. nvme2 becomes nvme4).
This breaks storage software that tracks drives by device name.

Solution:
Patch 1 adds hotplug support for nvme-ns devices on the NvmeBus, with
proper Asynchronous Event Notification (AEN) so the guest kernel detects
namespace changes. This allows namespace-level hot-swap without removing
the NVMe controller.

Patch 2 adds a drive_insert HMP command that reconnects a host block
device file to an existing guest device after drive_del. This is the
counterpart to drive_del for non-removable devices where
blockdev-change-medium cannot be used.

The recommended hot-swap sequence is:
  1. drive_del <drive-id>          # disconnect backing store
  2. drive_insert <device> <file>  # reconnect backing store
  3. pcie_aer_inject_error <port> SDN  # trigger controller reset

After this sequence, the guest sees the same controller and namespace
names (e.g. /dev/nvme2n1 remains /dev/nvme2n1), and the NVMe driver
recovers transparently via the standard AER recovery path.

Tested with:
- Linux 6.1 guest on QEMU aarch64 with HVF (macOS)
- NVMe subsystem model with multipath disabled
- DirectPV and MinIO AIStor storage stack

mr-083 (2):
  hw/nvme: add namespace hotplug support
  block/monitor: add drive_insert HMP command

 block/monitor/block-hmp-cmds.c | 59 +++++++++++++++++++++++
 hmp-commands.hx                | 18 +++++++
 hw/nvme/ctrl.c                 | 85 ++++++++++++++++++++++++++++++++++
 hw/nvme/ns.c                   |  1 +
 hw/nvme/subsys.c               |  2 +
 include/block/block-hmp-cmds.h |  1 +
 6 files changed, 166 insertions(+)

--
2.50.1 (Apple Git-155)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/2] hw/nvme: add namespace hotplug support
  2026-04-09  6:01 [PATCH 0/2] NVMe namespace hotplug and drive reconnection support mr-083
@ 2026-04-09  6:01 ` mr-083
  2026-04-09  6:01 ` [PATCH 2/2] block/monitor: add drive_insert HMP command mr-083
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: mr-083 @ 2026-04-09  6:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: its, kbusch, stefanha, mr-083

Add hotplug support for nvme-ns devices on the NvmeBus. This enables
namespace-level hot-swap without removing the NVMe controller, which
is how physical NVMe drives behave when hot-swapped in the same PCIe
slot.

Mark nvme-ns devices as hotpluggable and register the NvmeBus as a
hotplug handler with proper plug and unplug callbacks:

- plug: attach namespace to all started controllers and send an
  Asynchronous Event Notification (AEN) with NS_ATTR_CHANGED so
  the guest kernel rescans namespaces
- unplug: detach from all controllers, send AEN, remove from
  subsystem, then unrealize the device

The plug handler skips controllers that haven't started yet
(qs_created == false) to avoid interfering with boot-time namespace
attachment in nvme_start_ctrl().

Both the controller bus and subsystem bus are configured as hotplug
handlers via qbus_set_bus_hotplug_handler() since nvme-ns devices
may reparent to the subsystem bus during realize.

Signed-off-by: Matthieu Receveur <matthieu@min.io>
---
 hw/nvme/ctrl.c   | 85 ++++++++++++++++++++++++++++++++++++++++++++++++
 hw/nvme/ns.c     |  1 +
 hw/nvme/subsys.c |  2 ++
 3 files changed, 88 insertions(+)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index be6c7028cb..5502e4ea2b 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -206,6 +206,7 @@
 #include "system/hostmem.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pcie_sriov.h"
+#include "hw/core/qdev.h"
 #include "system/spdm-socket.h"
 #include "migration/vmstate.h"
 
@@ -9293,6 +9294,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     }
 
     qbus_init(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
+    qbus_set_bus_hotplug_handler(BUS(&n->bus));
 
     if (nvme_init_subsys(n, errp)) {
         return;
@@ -9553,10 +9555,93 @@ static const TypeInfo nvme_info = {
     },
 };
 
+static void nvme_ns_hot_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
+                              Error **errp)
+{
+    NvmeNamespace *ns = NVME_NS(dev);
+    NvmeSubsystem *subsys = ns->subsys;
+    uint32_t nsid = ns->params.nsid;
+    int i;
+
+    /*
+     * Attach to all started controllers and notify via AEN.
+     * Skip controllers that haven't started yet (boot-time realize) —
+     * nvme_start_ctrl() will attach namespaces during controller init.
+     */
+    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
+        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
+        if (!ctrl || !ctrl->qs_created) {
+            continue;
+        }
+
+        if (nvme_csi_supported(ctrl, ns->csi) && !ns->params.detached) {
+            nvme_attach_ns(ctrl, ns);
+            nvme_update_dsm_limits(ctrl, ns);
+
+            if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
+                nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
+                                   NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
+                                   NVME_LOG_CHANGED_NSLIST);
+            }
+        }
+    }
+}
+
+static void nvme_ns_hot_unplug(HotplugHandler *hotplug_dev, DeviceState *dev,
+                               Error **errp)
+{
+    NvmeNamespace *ns = NVME_NS(dev);
+    NvmeSubsystem *subsys = ns->subsys;
+    uint32_t nsid = ns->params.nsid;
+    int i;
+
+    /*
+     * Detach from all controllers and notify the guest via AEN.
+     * Must happen before unrealize to avoid use-after-free when the
+     * guest sends I/O to a freed namespace.
+     */
+    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
+        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
+        if (!ctrl || !nvme_ns(ctrl, nsid)) {
+            continue;
+        }
+
+        nvme_detach_ns(ctrl, ns);
+        nvme_update_dsm_limits(ctrl, NULL);
+
+        if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
+            nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
+                               NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
+                               NVME_LOG_CHANGED_NSLIST);
+        }
+    }
+
+    /* Remove from subsystem namespace list. */
+    subsys->namespaces[nsid] = NULL;
+
+    /*
+     * Unrealize: drain I/O, flush, cleanup structures, remove from QOM.
+     * nvme_ns_unrealize() handles drain/shutdown/cleanup internally.
+     */
+    qdev_unrealize(dev);
+}
+
+static void nvme_bus_class_init(ObjectClass *klass, const void *data)
+{
+    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(klass);
+    hc->plug = nvme_ns_hot_plug;
+    hc->unplug = nvme_ns_hot_unplug;
+}
+
 static const TypeInfo nvme_bus_info = {
     .name = TYPE_NVME_BUS,
     .parent = TYPE_BUS,
     .instance_size = sizeof(NvmeBus),
+    .class_init = nvme_bus_class_init,
+    .interfaces = (const InterfaceInfo[]) {
+        { TYPE_HOTPLUG_HANDLER },
+        { }
+    },
 };
 
 static void nvme_register_types(void)
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index b0106eaa5c..eb628c0734 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -937,6 +937,7 @@ static void nvme_ns_class_init(ObjectClass *oc, const void *data)
     dc->bus_type = TYPE_NVME_BUS;
     dc->realize = nvme_ns_realize;
     dc->unrealize = nvme_ns_unrealize;
+    dc->hotpluggable = true;
     device_class_set_props(dc, nvme_ns_props);
     dc->desc = "Virtual NVMe namespace";
 }
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 777e1c620f..fa35055d3c 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -9,6 +9,7 @@
 #include "qemu/osdep.h"
 #include "qemu/units.h"
 #include "qapi/error.h"
+#include "hw/core/qdev.h"
 
 #include "nvme.h"
 
@@ -205,6 +206,7 @@ static void nvme_subsys_realize(DeviceState *dev, Error **errp)
     NvmeSubsystem *subsys = NVME_SUBSYS(dev);
 
     qbus_init(&subsys->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
+    qbus_set_bus_hotplug_handler(BUS(&subsys->bus));
 
     nvme_subsys_setup(subsys, errp);
 }
-- 
2.50.1 (Apple Git-155)



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/2] block/monitor: add drive_insert HMP command
  2026-04-09  6:01 [PATCH 0/2] NVMe namespace hotplug and drive reconnection support mr-083
  2026-04-09  6:01 ` [PATCH 1/2] hw/nvme: add namespace hotplug support mr-083
@ 2026-04-09  6:01 ` mr-083
  2026-04-14 17:57   ` Stefan Hajnoczi
  2026-04-15 10:48   ` Daniel P. Berrangé
  2026-04-09 21:34 ` [PATCH v2] hw/nvme: add namespace hotplug support mr-083
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 34+ messages in thread
From: mr-083 @ 2026-04-09  6:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: its, kbusch, stefanha, mr-083

Add a drive_insert HMP command that reconnects a host block device file
to an existing guest device whose backing store was previously removed
with drive_del.

After drive_del, the BlockBackend remains attached to the guest device
but has no BlockDriverState (shown as "[not inserted]" in info block).
drive_insert opens the specified file, finds the device's BlockBackend
by iterating all backends and matching the attached device ID, then
calls blk_insert_bs() to reconnect the backing store.

This complements drive_del for non-removable devices (such as NVMe
namespaces) where blockdev-change-medium cannot be used. Combined with
PCIe AER Surprise Down error injection to trigger a controller reset,
this enables complete NVMe disk hot-swap simulation where the guest
sees the same device names throughout.

Example usage:
  drive_del drv0             # remove backing store
  drive_insert ns0 disk.qcow2  # reconnect backing
  pcie_aer_inject_error rp0 SDN  # trigger controller reset

Signed-off-by: Matthieu Receveur <matthieu@min.io>
---
 block/monitor/block-hmp-cmds.c | 59 ++++++++++++++++++++++++++++++++++
 hmp-commands.hx                | 18 +++++++++++
 include/block/block-hmp-cmds.h |  1 +
 3 files changed, 78 insertions(+)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index 1fd28d59eb..77e9662ead 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -38,7 +38,9 @@
 #include "qemu/osdep.h"
 #include "hw/core/boards.h"
 #include "system/block-backend.h"
+#include "system/block-backend-global-state.h"
 #include "system/blockdev.h"
+#include "block/block-global-state.h"
 #include "qapi/qapi-commands-block.h"
 #include "qapi/qapi-commands-block-export.h"
 #include "qobject/qdict.h"
@@ -195,6 +197,63 @@ unlock:
     hmp_handle_error(mon, err);
 }
 
+void hmp_drive_insert(Monitor *mon, const QDict *qdict)
+{
+    const char *id = qdict_get_str(qdict, "id");
+    const char *filename = qdict_get_str(qdict, "filename");
+    BlockBackend *blk = NULL;
+    BlockBackend *iter;
+    BlockDriverState *bs;
+    Error *err = NULL;
+
+    GLOBAL_STATE_CODE();
+
+    /*
+     * After drive_del, the BlockBackend is removed from the monitor name
+     * registry but still attached to the device. Find it by iterating all
+     * BlockBackends and matching by the device ID shown in "info block".
+     */
+    for (iter = blk_all_next(NULL); iter; iter = blk_all_next(iter)) {
+        DeviceState *dev = blk_get_attached_dev(iter);
+        if (dev && dev->id && strcmp(dev->id, id) == 0) {
+            blk = iter;
+            break;
+        }
+    }
+
+    if (!blk) {
+        /* Fallback: try by block backend name */
+        blk = blk_by_name(id);
+    }
+
+    if (!blk) {
+        error_setg(&err, "Device '%s' not found", id);
+        goto out;
+    }
+
+    if (blk_bs(blk)) {
+        error_setg(&err, "Device '%s' already has a medium inserted", id);
+        goto out;
+    }
+
+    bs = bdrv_open(filename, NULL, NULL, BDRV_O_RDWR, &err);
+    if (!bs) {
+        goto out;
+    }
+
+    if (blk_insert_bs(blk, bs, &err) < 0) {
+        bdrv_unref(bs);
+        goto out;
+    }
+
+    bdrv_unref(bs);
+    monitor_printf(mon, "OK\n");
+    return;
+
+out:
+    hmp_handle_error(mon, err);
+}
+
 void hmp_commit(Monitor *mon, const QDict *qdict)
 {
     const char *device = qdict_get_str(qdict, "device");
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 5cc4788f12..79af8e8988 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -207,6 +207,24 @@ SRST
   actions (drive options rerror, werror).
 ERST
 
+    {
+        .name       = "drive_insert",
+        .args_type  = "id:B,filename:F",
+        .params     = "device filename",
+        .help       = "insert a host block device into an empty drive",
+        .cmd        = hmp_drive_insert,
+    },
+
+SRST
+``drive_insert`` *device* *filename*
+  Insert a host block device file into a drive that has been emptied by
+  ``drive_del``.  This reconnects the backing store without removing the
+  guest device, enabling transparent disk hot-swap for non-removable devices
+  such as NVMe namespaces.  Combined with PCIe AER Surprise Down error
+  injection (``pcie_aer_inject_error`` *device* ``SDN``), this enables
+  complete NVMe disk hot-swap simulation.
+ERST
+
     {
         .name       = "change",
         .args_type  = "device:B,force:-f,target:F,arg:s?,read-only-mode:s?",
diff --git a/include/block/block-hmp-cmds.h b/include/block/block-hmp-cmds.h
index 71113cd7ef..73c9607402 100644
--- a/include/block/block-hmp-cmds.h
+++ b/include/block/block-hmp-cmds.h
@@ -21,6 +21,7 @@ void hmp_drive_add(Monitor *mon, const QDict *qdict);
 
 void hmp_commit(Monitor *mon, const QDict *qdict);
 void hmp_drive_del(Monitor *mon, const QDict *qdict);
+void hmp_drive_insert(Monitor *mon, const QDict *qdict);
 
 void hmp_drive_mirror(Monitor *mon, const QDict *qdict);
 void hmp_drive_backup(Monitor *mon, const QDict *qdict);
-- 
2.50.1 (Apple Git-155)



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2] hw/nvme: add namespace hotplug support
  2026-04-09  6:01 [PATCH 0/2] NVMe namespace hotplug and drive reconnection support mr-083
  2026-04-09  6:01 ` [PATCH 1/2] hw/nvme: add namespace hotplug support mr-083
  2026-04-09  6:01 ` [PATCH 2/2] block/monitor: add drive_insert HMP command mr-083
@ 2026-04-09 21:34 ` mr-083
  2026-04-10 12:41   ` Stefan Hajnoczi
  2026-04-10 14:30 ` [PATCH v3] " mr-083
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 34+ messages in thread
From: mr-083 @ 2026-04-09 21:34 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: its, kbusch, stefanha, mr-083

Add hotplug support for nvme-ns devices on the NvmeBus. This enables
namespace-level hot-swap without removing the NVMe controller, matching
the behavior of physical NVMe drives hot-swapped in the same PCIe slot.

Mark nvme-ns devices as hotpluggable and register the NvmeBus as a
hotplug handler with proper plug and unplug callbacks:

- plug: attach namespace to all started controllers and send an
  Asynchronous Event Notification (AEN) with NS_ATTR_CHANGED so
  the guest kernel rescans namespaces and adds the block device
- unplug: detach from all controllers, send AEN, remove from
  subsystem, then unrealize the device. The guest kernel rescans
  and removes the block device.

The plug handler skips controllers that haven't started yet
(qs_created == false) to avoid interfering with boot-time namespace
attachment in nvme_start_ctrl().

Both the controller bus and subsystem bus are configured as hotplug
handlers via qbus_set_bus_hotplug_handler() since nvme-ns devices
may reparent to the subsystem bus during realize.

Example hot-swap sequence using the NVMe subsystem model:

  # Boot with: -device nvme-subsys,id=subsys0
  #            -device nvme,id=ctrl0,subsys=subsys0
  #            -device nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1

  device_del ns0             # guest receives AEN, removes /dev/nvme0n1
  drive_del drv0
  drive_add 0 file=disk.qcow2,format=qcow2,id=drv0,if=none
  device_add nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1
                              # guest receives AEN, adds /dev/nvme0n1

Tested with Linux 6.1 guest (NVMe driver processes AEN and rescans
namespace list automatically).

Signed-off-by: Matthieu Receveur <matthieu@min.io>
---
 hw/nvme/ctrl.c   | 85 ++++++++++++++++++++++++++++++++++++++++++++++++
 hw/nvme/ns.c     |  1 +
 hw/nvme/subsys.c |  2 ++
 3 files changed, 88 insertions(+)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index be6c7028cb..5502e4ea2b 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -206,6 +206,7 @@
 #include "system/hostmem.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pcie_sriov.h"
+#include "hw/core/qdev.h"
 #include "system/spdm-socket.h"
 #include "migration/vmstate.h"
 
@@ -9293,6 +9294,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     }
 
     qbus_init(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
+    qbus_set_bus_hotplug_handler(BUS(&n->bus));
 
     if (nvme_init_subsys(n, errp)) {
         return;
@@ -9553,10 +9555,93 @@ static const TypeInfo nvme_info = {
     },
 };
 
+static void nvme_ns_hot_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
+                              Error **errp)
+{
+    NvmeNamespace *ns = NVME_NS(dev);
+    NvmeSubsystem *subsys = ns->subsys;
+    uint32_t nsid = ns->params.nsid;
+    int i;
+
+    /*
+     * Attach to all started controllers and notify via AEN.
+     * Skip controllers that haven't started yet (boot-time realize) —
+     * nvme_start_ctrl() will attach namespaces during controller init.
+     */
+    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
+        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
+        if (!ctrl || !ctrl->qs_created) {
+            continue;
+        }
+
+        if (nvme_csi_supported(ctrl, ns->csi) && !ns->params.detached) {
+            nvme_attach_ns(ctrl, ns);
+            nvme_update_dsm_limits(ctrl, ns);
+
+            if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
+                nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
+                                   NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
+                                   NVME_LOG_CHANGED_NSLIST);
+            }
+        }
+    }
+}
+
+static void nvme_ns_hot_unplug(HotplugHandler *hotplug_dev, DeviceState *dev,
+                               Error **errp)
+{
+    NvmeNamespace *ns = NVME_NS(dev);
+    NvmeSubsystem *subsys = ns->subsys;
+    uint32_t nsid = ns->params.nsid;
+    int i;
+
+    /*
+     * Detach from all controllers and notify the guest via AEN.
+     * Must happen before unrealize to avoid use-after-free when the
+     * guest sends I/O to a freed namespace.
+     */
+    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
+        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
+        if (!ctrl || !nvme_ns(ctrl, nsid)) {
+            continue;
+        }
+
+        nvme_detach_ns(ctrl, ns);
+        nvme_update_dsm_limits(ctrl, NULL);
+
+        if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
+            nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
+                               NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
+                               NVME_LOG_CHANGED_NSLIST);
+        }
+    }
+
+    /* Remove from subsystem namespace list. */
+    subsys->namespaces[nsid] = NULL;
+
+    /*
+     * Unrealize: drain I/O, flush, cleanup structures, remove from QOM.
+     * nvme_ns_unrealize() handles drain/shutdown/cleanup internally.
+     */
+    qdev_unrealize(dev);
+}
+
+static void nvme_bus_class_init(ObjectClass *klass, const void *data)
+{
+    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(klass);
+    hc->plug = nvme_ns_hot_plug;
+    hc->unplug = nvme_ns_hot_unplug;
+}
+
 static const TypeInfo nvme_bus_info = {
     .name = TYPE_NVME_BUS,
     .parent = TYPE_BUS,
     .instance_size = sizeof(NvmeBus),
+    .class_init = nvme_bus_class_init,
+    .interfaces = (const InterfaceInfo[]) {
+        { TYPE_HOTPLUG_HANDLER },
+        { }
+    },
 };
 
 static void nvme_register_types(void)
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index b0106eaa5c..eb628c0734 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -937,6 +937,7 @@ static void nvme_ns_class_init(ObjectClass *oc, const void *data)
     dc->bus_type = TYPE_NVME_BUS;
     dc->realize = nvme_ns_realize;
     dc->unrealize = nvme_ns_unrealize;
+    dc->hotpluggable = true;
     device_class_set_props(dc, nvme_ns_props);
     dc->desc = "Virtual NVMe namespace";
 }
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 777e1c620f..fa35055d3c 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -9,6 +9,7 @@
 #include "qemu/osdep.h"
 #include "qemu/units.h"
 #include "qapi/error.h"
+#include "hw/core/qdev.h"
 
 #include "nvme.h"
 
@@ -205,6 +206,7 @@ static void nvme_subsys_realize(DeviceState *dev, Error **errp)
     NvmeSubsystem *subsys = NVME_SUBSYS(dev);
 
     qbus_init(&subsys->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
+    qbus_set_bus_hotplug_handler(BUS(&subsys->bus));
 
     nvme_subsys_setup(subsys, errp);
 }
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] hw/nvme: add namespace hotplug support
  2026-04-09 21:34 ` [PATCH v2] hw/nvme: add namespace hotplug support mr-083
@ 2026-04-10 12:41   ` Stefan Hajnoczi
  0 siblings, 0 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2026-04-10 12:41 UTC (permalink / raw)
  To: mr-083; +Cc: qemu-devel, qemu-block, its, kbusch, mr-083

[-- Attachment #1: Type: text/plain, Size: 8180 bytes --]

On Thu, Apr 09, 2026 at 11:34:51PM +0200, mr-083 wrote:
> Add hotplug support for nvme-ns devices on the NvmeBus. This enables
> namespace-level hot-swap without removing the NVMe controller, matching
> the behavior of physical NVMe drives hot-swapped in the same PCIe slot.

If we rely purely on NVMe's AEN then this is not equivalent to swapping
physical drives in the same PCIe slot. Maybe adjust the wording to
reflect that this is NVMe-level Namespace hotplug?

> Mark nvme-ns devices as hotpluggable and register the NvmeBus as a
> hotplug handler with proper plug and unplug callbacks:
> 
> - plug: attach namespace to all started controllers and send an
>   Asynchronous Event Notification (AEN) with NS_ATTR_CHANGED so
>   the guest kernel rescans namespaces and adds the block device
> - unplug: detach from all controllers, send AEN, remove from
>   subsystem, then unrealize the device. The guest kernel rescans
>   and removes the block device.
> 
> The plug handler skips controllers that haven't started yet
> (qs_created == false) to avoid interfering with boot-time namespace
> attachment in nvme_start_ctrl().
> 
> Both the controller bus and subsystem bus are configured as hotplug
> handlers via qbus_set_bus_hotplug_handler() since nvme-ns devices
> may reparent to the subsystem bus during realize.
> 
> Example hot-swap sequence using the NVMe subsystem model:
> 
>   # Boot with: -device nvme-subsys,id=subsys0
>   #            -device nvme,id=ctrl0,subsys=subsys0
>   #            -device nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1
> 
>   device_del ns0             # guest receives AEN, removes /dev/nvme0n1
>   drive_del drv0
>   drive_add 0 file=disk.qcow2,format=qcow2,id=drv0,if=none
>   device_add nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1
>                               # guest receives AEN, adds /dev/nvme0n1
> 
> Tested with Linux 6.1 guest (NVMe driver processes AEN and rescans
> namespace list automatically).

Did you test a Windows Server guest? If not, I can try that next week in
case there are any surprises.

> 
> Signed-off-by: Matthieu Receveur <matthieu@min.io>
> ---
>  hw/nvme/ctrl.c   | 85 ++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/nvme/ns.c     |  1 +
>  hw/nvme/subsys.c |  2 ++
>  3 files changed, 88 insertions(+)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index be6c7028cb..5502e4ea2b 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -206,6 +206,7 @@
>  #include "system/hostmem.h"
>  #include "hw/pci/msix.h"
>  #include "hw/pci/pcie_sriov.h"
> +#include "hw/core/qdev.h"
>  #include "system/spdm-socket.h"
>  #include "migration/vmstate.h"
>  
> @@ -9293,6 +9294,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>      }
>  
>      qbus_init(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
> +    qbus_set_bus_hotplug_handler(BUS(&n->bus));
>  
>      if (nvme_init_subsys(n, errp)) {
>          return;
> @@ -9553,10 +9555,93 @@ static const TypeInfo nvme_info = {
>      },
>  };
>  
> +static void nvme_ns_hot_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
> +                              Error **errp)
> +{
> +    NvmeNamespace *ns = NVME_NS(dev);
> +    NvmeSubsystem *subsys = ns->subsys;
> +    uint32_t nsid = ns->params.nsid;
> +    int i;
> +
> +    /*
> +     * Attach to all started controllers and notify via AEN.
> +     * Skip controllers that haven't started yet (boot-time realize) —
> +     * nvme_start_ctrl() will attach namespaces during controller init.
> +     */
> +    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
> +        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
> +        if (!ctrl || !ctrl->qs_created) {
> +            continue;
> +        }
> +
> +        if (nvme_csi_supported(ctrl, ns->csi) && !ns->params.detached) {
> +            nvme_attach_ns(ctrl, ns);
> +            nvme_update_dsm_limits(ctrl, ns);
> +
> +            if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
> +                nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
> +                                   NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
> +                                   NVME_LOG_CHANGED_NSLIST);
> +            }
> +        }
> +    }
> +}
> +
> +static void nvme_ns_hot_unplug(HotplugHandler *hotplug_dev, DeviceState *dev,
> +                               Error **errp)
> +{
> +    NvmeNamespace *ns = NVME_NS(dev);
> +    NvmeSubsystem *subsys = ns->subsys;
> +    uint32_t nsid = ns->params.nsid;
> +    int i;

While there is qdev_unrealize -> nvme_ns_unrealize -> nvme_ns_drain ->
blk_drain to quiesce I/O requests at the end of this function, I wonder
whether it's safe to start removing the namespace before I/O has been
drained.

Did you test hot unplug while the Namespace is under heavy I/O (e.g. fio
job running inside the guest with lots of queued I/O requests)?

It might be necessary to stop I/O first before tearing down the
namespace.

> +
> +    /*
> +     * Detach from all controllers and notify the guest via AEN.
> +     * Must happen before unrealize to avoid use-after-free when the
> +     * guest sends I/O to a freed namespace.
> +     */
> +    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
> +        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
> +        if (!ctrl || !nvme_ns(ctrl, nsid)) {
> +            continue;
> +        }
> +
> +        nvme_detach_ns(ctrl, ns);
> +        nvme_update_dsm_limits(ctrl, NULL);
> +
> +        if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
> +            nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
> +                               NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
> +                               NVME_LOG_CHANGED_NSLIST);
> +        }
> +    }
> +
> +    /* Remove from subsystem namespace list. */
> +    subsys->namespaces[nsid] = NULL;

The dual of this operation is done in nvme_ns_realize():

  subsys->namespaces[nsid] = ns;

Maybe nvme_ns_unrealize() should remove the namespace from the
subsystem for consistency? I guess the lack of removal was never an
issue before hot unplug, but now it would be nice to implement the
lifecycle.

> +
> +    /*
> +     * Unrealize: drain I/O, flush, cleanup structures, remove from QOM.
> +     * nvme_ns_unrealize() handles drain/shutdown/cleanup internally.
> +     */
> +    qdev_unrealize(dev);
> +}
> +
> +static void nvme_bus_class_init(ObjectClass *klass, const void *data)
> +{
> +    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(klass);
> +    hc->plug = nvme_ns_hot_plug;
> +    hc->unplug = nvme_ns_hot_unplug;
> +}
> +
>  static const TypeInfo nvme_bus_info = {
>      .name = TYPE_NVME_BUS,
>      .parent = TYPE_BUS,
>      .instance_size = sizeof(NvmeBus),
> +    .class_init = nvme_bus_class_init,
> +    .interfaces = (const InterfaceInfo[]) {
> +        { TYPE_HOTPLUG_HANDLER },
> +        { }
> +    },
>  };
>  
>  static void nvme_register_types(void)
> diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
> index b0106eaa5c..eb628c0734 100644
> --- a/hw/nvme/ns.c
> +++ b/hw/nvme/ns.c
> @@ -937,6 +937,7 @@ static void nvme_ns_class_init(ObjectClass *oc, const void *data)
>      dc->bus_type = TYPE_NVME_BUS;
>      dc->realize = nvme_ns_realize;
>      dc->unrealize = nvme_ns_unrealize;
> +    dc->hotpluggable = true;
>      device_class_set_props(dc, nvme_ns_props);
>      dc->desc = "Virtual NVMe namespace";
>  }
> diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
> index 777e1c620f..fa35055d3c 100644
> --- a/hw/nvme/subsys.c
> +++ b/hw/nvme/subsys.c
> @@ -9,6 +9,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
>  #include "qapi/error.h"
> +#include "hw/core/qdev.h"
>  
>  #include "nvme.h"
>  
> @@ -205,6 +206,7 @@ static void nvme_subsys_realize(DeviceState *dev, Error **errp)
>      NvmeSubsystem *subsys = NVME_SUBSYS(dev);
>  
>      qbus_init(&subsys->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
> +    qbus_set_bus_hotplug_handler(BUS(&subsys->bus));
>  
>      nvme_subsys_setup(subsys, errp);
>  }
> -- 
> 2.53.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3] hw/nvme: add namespace hotplug support
  2026-04-09  6:01 [PATCH 0/2] NVMe namespace hotplug and drive reconnection support mr-083
                   ` (2 preceding siblings ...)
  2026-04-09 21:34 ` [PATCH v2] hw/nvme: add namespace hotplug support mr-083
@ 2026-04-10 14:30 ` mr-083
  2026-04-10 14:33   ` Matthieu Rolla
  2026-04-13 17:17 ` [PATCH 0/2] NVMe namespace hotplug and drive reconnection support Klaus Jensen
  2026-04-15 17:38 ` [PATCH v4] hw/nvme: add namespace hotplug support mr-083
  5 siblings, 1 reply; 34+ messages in thread
From: mr-083 @ 2026-04-10 14:30 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: its, kbusch, stefanha, mr-083

Add hotplug support for nvme-ns devices on the NvmeBus. This enables
NVMe namespace-level hot-add and hot-remove via device_add and
device_del with proper Asynchronous Event Notification (AEN), so the
guest kernel can react to namespace topology changes.

Mark nvme-ns devices as hotpluggable and register the NvmeBus as a
hotplug handler with proper plug and unplug callbacks:

- plug: attach namespace to all started controllers and send an
  Asynchronous Event Notification (AEN) with NS_ATTR_CHANGED so
  the guest kernel rescans namespaces and adds the block device
- unplug: drain in-flight I/O, detach from all controllers, send
  AEN, then unrealize the device. The guest kernel rescans and
  removes the block device.

The plug handler skips controllers that haven't started yet
(qs_created == false) to avoid interfering with boot-time namespace
attachment in nvme_start_ctrl().

The unplug handler drains in-flight I/O via nvme_ns_drain() before
detaching the namespace from controllers, so pending requests can
complete normally without touching freed state.

For symmetry with nvme_ns_realize() which sets subsys->namespaces[nsid],
nvme_ns_unrealize() now clears that slot too — making the namespace
lifecycle complete.

Both the controller bus and subsystem bus are configured as hotplug
handlers via qbus_set_bus_hotplug_handler() since nvme-ns devices
may reparent to the subsystem bus during realize.

Example hot-swap sequence using the NVMe subsystem model:

  # Boot with: -device nvme-subsys,id=subsys0
  #            -device nvme,id=ctrl0,subsys=subsys0
  #            -device nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1

  device_del ns0             # guest receives AEN, removes /dev/nvme0n1
  drive_del drv0
  drive_add 0 file=disk.qcow2,format=qcow2,id=drv0,if=none
  device_add nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1
                              # guest receives AEN, adds /dev/nvme0n1

Tested with Linux 6.1 guest (NVMe driver processes AEN and rescans
namespace list automatically).

Signed-off-by: Matthieu <matthieu@min.io>
---
 hw/nvme/ctrl.c   | 88 ++++++++++++++++++++++++++++++++++++++++++++++++
 hw/nvme/ns.c     |  8 +++++
 hw/nvme/subsys.c |  2 ++
 3 files changed, 98 insertions(+)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index be6c7028cb..2024b0ff75 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -206,6 +206,7 @@
 #include "system/hostmem.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pcie_sriov.h"
+#include "hw/core/qdev.h"
 #include "system/spdm-socket.h"
 #include "migration/vmstate.h"
 
@@ -9293,6 +9294,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     }
 
     qbus_init(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
+    qbus_set_bus_hotplug_handler(BUS(&n->bus));
 
     if (nvme_init_subsys(n, errp)) {
         return;
@@ -9553,10 +9555,96 @@ static const TypeInfo nvme_info = {
     },
 };
 
+static void nvme_ns_hot_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
+                              Error **errp)
+{
+    NvmeNamespace *ns = NVME_NS(dev);
+    NvmeSubsystem *subsys = ns->subsys;
+    uint32_t nsid = ns->params.nsid;
+    int i;
+
+    /*
+     * Attach to all started controllers and notify via AEN.
+     * Skip controllers that haven't started yet (boot-time realize) —
+     * nvme_start_ctrl() will attach namespaces during controller init.
+     */
+    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
+        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
+        if (!ctrl || !ctrl->qs_created) {
+            continue;
+        }
+
+        if (nvme_csi_supported(ctrl, ns->csi) && !ns->params.detached) {
+            nvme_attach_ns(ctrl, ns);
+            nvme_update_dsm_limits(ctrl, ns);
+
+            if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
+                nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
+                                   NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
+                                   NVME_LOG_CHANGED_NSLIST);
+            }
+        }
+    }
+}
+
+static void nvme_ns_hot_unplug(HotplugHandler *hotplug_dev, DeviceState *dev,
+                               Error **errp)
+{
+    NvmeNamespace *ns = NVME_NS(dev);
+    NvmeSubsystem *subsys = ns->subsys;
+    uint32_t nsid = ns->params.nsid;
+    int i;
+
+    /*
+     * Drain in-flight I/O before tearing down the namespace.
+     * This must happen while the namespace is still attached to the
+     * controllers so any pending requests can complete normally.
+     */
+    nvme_ns_drain(ns);
+
+    /*
+     * Detach from all controllers and notify the guest via AEN.
+     * The guest kernel will rescan namespaces and remove the block device.
+     */
+    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
+        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
+        if (!ctrl || !nvme_ns(ctrl, nsid)) {
+            continue;
+        }
+
+        nvme_detach_ns(ctrl, ns);
+        nvme_update_dsm_limits(ctrl, NULL);
+
+        if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
+            nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
+                               NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
+                               NVME_LOG_CHANGED_NSLIST);
+        }
+    }
+
+    /*
+     * Unrealize: removes from subsystem (in nvme_ns_unrealize), flushes,
+     * cleans up structures, and removes from QOM.
+     */
+    qdev_unrealize(dev);
+}
+
+static void nvme_bus_class_init(ObjectClass *klass, const void *data)
+{
+    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(klass);
+    hc->plug = nvme_ns_hot_plug;
+    hc->unplug = nvme_ns_hot_unplug;
+}
+
 static const TypeInfo nvme_bus_info = {
     .name = TYPE_NVME_BUS,
     .parent = TYPE_BUS,
     .instance_size = sizeof(NvmeBus),
+    .class_init = nvme_bus_class_init,
+    .interfaces = (const InterfaceInfo[]) {
+        { TYPE_HOTPLUG_HANDLER },
+        { }
+    },
 };
 
 static void nvme_register_types(void)
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index b0106eaa5c..f4f755c6fc 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -719,10 +719,17 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
 static void nvme_ns_unrealize(DeviceState *dev)
 {
     NvmeNamespace *ns = NVME_NS(dev);
+    NvmeSubsystem *subsys = ns->subsys;
+    uint32_t nsid = ns->params.nsid;
 
     nvme_ns_drain(ns);
     nvme_ns_shutdown(ns);
     nvme_ns_cleanup(ns);
+
+    /* Symmetric with nvme_ns_realize() which sets subsys->namespaces[nsid]. */
+    if (subsys && nsid && subsys->namespaces[nsid] == ns) {
+        subsys->namespaces[nsid] = NULL;
+    }
 }
 
 void nvme_ns_atomic_configure_boundary(bool dn, uint16_t nabsn,
@@ -937,6 +944,7 @@ static void nvme_ns_class_init(ObjectClass *oc, const void *data)
     dc->bus_type = TYPE_NVME_BUS;
     dc->realize = nvme_ns_realize;
     dc->unrealize = nvme_ns_unrealize;
+    dc->hotpluggable = true;
     device_class_set_props(dc, nvme_ns_props);
     dc->desc = "Virtual NVMe namespace";
 }
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 777e1c620f..fa35055d3c 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -9,6 +9,7 @@
 #include "qemu/osdep.h"
 #include "qemu/units.h"
 #include "qapi/error.h"
+#include "hw/core/qdev.h"
 
 #include "nvme.h"
 
@@ -205,6 +206,7 @@ static void nvme_subsys_realize(DeviceState *dev, Error **errp)
     NvmeSubsystem *subsys = NVME_SUBSYS(dev);
 
     qbus_init(&subsys->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
+    qbus_set_bus_hotplug_handler(BUS(&subsys->bus));
 
     nvme_subsys_setup(subsys, errp);
 }
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] hw/nvme: add namespace hotplug support
  2026-04-10 14:30 ` [PATCH v3] " mr-083
@ 2026-04-10 14:33   ` Matthieu Rolla
  2026-04-10 20:14     ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Matthieu Rolla @ 2026-04-10 14:33 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: its, kbusch, stefanha, mr-083

[-- Attachment #1: Type: text/plain, Size: 9028 bytes --]

Hello @Stefan Hajnoczi <stefanha@redhat.com> ,


Thanks for the review! v3 sent.


Wording: Fixed in v3, no more "physical PCIe slot" claim. Now describes it
as NVMe namespace-level hotplug.


I/O drain: Moved nvme_ns_drain() to the start of the unplug handler so
in-flight I/O completes before detach. Tested under warp load (16
concurrent 1MiB uploads via MinIO/DirectPV)  device_del returns in ~400ms
with clean removal, no use-after-free.


Symmetry:  moved subsys->namespaces[nsid] = NULL into nvme_ns_unrealize()
so the namespace lifecycle is complete (mirrors what nvme_ns_realize() sets
up).


I don't have a working Windows test setup, I'd really appreciate if you
could test it next week as you offered.

Thanks again for your time

On Fri, Apr 10, 2026 at 4:29 PM mr-083 <matthieu@minio.io> wrote:

> Add hotplug support for nvme-ns devices on the NvmeBus. This enables
> NVMe namespace-level hot-add and hot-remove via device_add and
> device_del with proper Asynchronous Event Notification (AEN), so the
> guest kernel can react to namespace topology changes.
>
> Mark nvme-ns devices as hotpluggable and register the NvmeBus as a
> hotplug handler with proper plug and unplug callbacks:
>
> - plug: attach namespace to all started controllers and send an
>   Asynchronous Event Notification (AEN) with NS_ATTR_CHANGED so
>   the guest kernel rescans namespaces and adds the block device
> - unplug: drain in-flight I/O, detach from all controllers, send
>   AEN, then unrealize the device. The guest kernel rescans and
>   removes the block device.
>
> The plug handler skips controllers that haven't started yet
> (qs_created == false) to avoid interfering with boot-time namespace
> attachment in nvme_start_ctrl().
>
> The unplug handler drains in-flight I/O via nvme_ns_drain() before
> detaching the namespace from controllers, so pending requests can
> complete normally without touching freed state.
>
> For symmetry with nvme_ns_realize() which sets subsys->namespaces[nsid],
> nvme_ns_unrealize() now clears that slot too — making the namespace
> lifecycle complete.
>
> Both the controller bus and subsystem bus are configured as hotplug
> handlers via qbus_set_bus_hotplug_handler() since nvme-ns devices
> may reparent to the subsystem bus during realize.
>
> Example hot-swap sequence using the NVMe subsystem model:
>
>   # Boot with: -device nvme-subsys,id=subsys0
>   #            -device nvme,id=ctrl0,subsys=subsys0
>   #            -device nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1
>
>   device_del ns0             # guest receives AEN, removes /dev/nvme0n1
>   drive_del drv0
>   drive_add 0 file=disk.qcow2,format=qcow2,id=drv0,if=none
>   device_add nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1
>                               # guest receives AEN, adds /dev/nvme0n1
>
> Tested with Linux 6.1 guest (NVMe driver processes AEN and rescans
> namespace list automatically).
>
> Signed-off-by: Matthieu <matthieu@min.io>
> ---
>  hw/nvme/ctrl.c   | 88 ++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/nvme/ns.c     |  8 +++++
>  hw/nvme/subsys.c |  2 ++
>  3 files changed, 98 insertions(+)
>
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index be6c7028cb..2024b0ff75 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -206,6 +206,7 @@
>  #include "system/hostmem.h"
>  #include "hw/pci/msix.h"
>  #include "hw/pci/pcie_sriov.h"
> +#include "hw/core/qdev.h"
>  #include "system/spdm-socket.h"
>  #include "migration/vmstate.h"
>
> @@ -9293,6 +9294,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error
> **errp)
>      }
>
>      qbus_init(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
> +    qbus_set_bus_hotplug_handler(BUS(&n->bus));
>
>      if (nvme_init_subsys(n, errp)) {
>          return;
> @@ -9553,10 +9555,96 @@ static const TypeInfo nvme_info = {
>      },
>  };
>
> +static void nvme_ns_hot_plug(HotplugHandler *hotplug_dev, DeviceState
> *dev,
> +                              Error **errp)
> +{
> +    NvmeNamespace *ns = NVME_NS(dev);
> +    NvmeSubsystem *subsys = ns->subsys;
> +    uint32_t nsid = ns->params.nsid;
> +    int i;
> +
> +    /*
> +     * Attach to all started controllers and notify via AEN.
> +     * Skip controllers that haven't started yet (boot-time realize) —
> +     * nvme_start_ctrl() will attach namespaces during controller init.
> +     */
> +    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
> +        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
> +        if (!ctrl || !ctrl->qs_created) {
> +            continue;
> +        }
> +
> +        if (nvme_csi_supported(ctrl, ns->csi) && !ns->params.detached) {
> +            nvme_attach_ns(ctrl, ns);
> +            nvme_update_dsm_limits(ctrl, ns);
> +
> +            if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
> +                nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
> +                                   NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
> +                                   NVME_LOG_CHANGED_NSLIST);
> +            }
> +        }
> +    }
> +}
> +
> +static void nvme_ns_hot_unplug(HotplugHandler *hotplug_dev, DeviceState
> *dev,
> +                               Error **errp)
> +{
> +    NvmeNamespace *ns = NVME_NS(dev);
> +    NvmeSubsystem *subsys = ns->subsys;
> +    uint32_t nsid = ns->params.nsid;
> +    int i;
> +
> +    /*
> +     * Drain in-flight I/O before tearing down the namespace.
> +     * This must happen while the namespace is still attached to the
> +     * controllers so any pending requests can complete normally.
> +     */
> +    nvme_ns_drain(ns);
> +
> +    /*
> +     * Detach from all controllers and notify the guest via AEN.
> +     * The guest kernel will rescan namespaces and remove the block
> device.
> +     */
> +    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
> +        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
> +        if (!ctrl || !nvme_ns(ctrl, nsid)) {
> +            continue;
> +        }
> +
> +        nvme_detach_ns(ctrl, ns);
> +        nvme_update_dsm_limits(ctrl, NULL);
> +
> +        if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
> +            nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
> +                               NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
> +                               NVME_LOG_CHANGED_NSLIST);
> +        }
> +    }
> +
> +    /*
> +     * Unrealize: removes from subsystem (in nvme_ns_unrealize), flushes,
> +     * cleans up structures, and removes from QOM.
> +     */
> +    qdev_unrealize(dev);
> +}
> +
> +static void nvme_bus_class_init(ObjectClass *klass, const void *data)
> +{
> +    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(klass);
> +    hc->plug = nvme_ns_hot_plug;
> +    hc->unplug = nvme_ns_hot_unplug;
> +}
> +
>  static const TypeInfo nvme_bus_info = {
>      .name = TYPE_NVME_BUS,
>      .parent = TYPE_BUS,
>      .instance_size = sizeof(NvmeBus),
> +    .class_init = nvme_bus_class_init,
> +    .interfaces = (const InterfaceInfo[]) {
> +        { TYPE_HOTPLUG_HANDLER },
> +        { }
> +    },
>  };
>
>  static void nvme_register_types(void)
> diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
> index b0106eaa5c..f4f755c6fc 100644
> --- a/hw/nvme/ns.c
> +++ b/hw/nvme/ns.c
> @@ -719,10 +719,17 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
>  static void nvme_ns_unrealize(DeviceState *dev)
>  {
>      NvmeNamespace *ns = NVME_NS(dev);
> +    NvmeSubsystem *subsys = ns->subsys;
> +    uint32_t nsid = ns->params.nsid;
>
>      nvme_ns_drain(ns);
>      nvme_ns_shutdown(ns);
>      nvme_ns_cleanup(ns);
> +
> +    /* Symmetric with nvme_ns_realize() which sets
> subsys->namespaces[nsid]. */
> +    if (subsys && nsid && subsys->namespaces[nsid] == ns) {
> +        subsys->namespaces[nsid] = NULL;
> +    }
>  }
>
>  void nvme_ns_atomic_configure_boundary(bool dn, uint16_t nabsn,
> @@ -937,6 +944,7 @@ static void nvme_ns_class_init(ObjectClass *oc, const
> void *data)
>      dc->bus_type = TYPE_NVME_BUS;
>      dc->realize = nvme_ns_realize;
>      dc->unrealize = nvme_ns_unrealize;
> +    dc->hotpluggable = true;
>      device_class_set_props(dc, nvme_ns_props);
>      dc->desc = "Virtual NVMe namespace";
>  }
> diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
> index 777e1c620f..fa35055d3c 100644
> --- a/hw/nvme/subsys.c
> +++ b/hw/nvme/subsys.c
> @@ -9,6 +9,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
>  #include "qapi/error.h"
> +#include "hw/core/qdev.h"
>
>  #include "nvme.h"
>
> @@ -205,6 +206,7 @@ static void nvme_subsys_realize(DeviceState *dev,
> Error **errp)
>      NvmeSubsystem *subsys = NVME_SUBSYS(dev);
>
>      qbus_init(&subsys->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
> +    qbus_set_bus_hotplug_handler(BUS(&subsys->bus));
>
>      nvme_subsys_setup(subsys, errp);
>  }
> --
> 2.53.0
>
>

[-- Attachment #2: Type: text/html, Size: 13689 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] hw/nvme: add namespace hotplug support
  2026-04-10 14:33   ` Matthieu Rolla
@ 2026-04-10 20:14     ` Stefan Hajnoczi
  2026-04-13 15:24       ` Matthieu Rolla
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2026-04-10 20:14 UTC (permalink / raw)
  To: Matthieu Rolla; +Cc: qemu-devel, qemu-block, its, kbusch, mr-083

[-- Attachment #1: Type: text/plain, Size: 1067 bytes --]

On Fri, Apr 10, 2026 at 04:33:47PM +0200, Matthieu Rolla wrote:
> Hello @Stefan Hajnoczi <stefanha@redhat.com> ,
> 
> 
> Thanks for the review! v3 sent.
> 
> 
> Wording: Fixed in v3, no more "physical PCIe slot" claim. Now describes it
> as NVMe namespace-level hotplug.
> 
> 
> I/O drain: Moved nvme_ns_drain() to the start of the unplug handler so
> in-flight I/O completes before detach. Tested under warp load (16
> concurrent 1MiB uploads via MinIO/DirectPV)  device_del returns in ~400ms
> with clean removal, no use-after-free.
> 
> 
> Symmetry:  moved subsys->namespaces[nsid] = NULL into nvme_ns_unrealize()
> so the namespace lifecycle is complete (mirrors what nvme_ns_realize() sets
> up).
> 
> 
> I don't have a working Windows test setup, I'd really appreciate if you
> could test it next week as you offered.
> 
> Thanks again for your time

Awesome, thanks!

I will give Windows Server a spin next week. I'm not an expert in
hw/nvme/ but the patch looks good to me:

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] hw/nvme: add namespace hotplug support
  2026-04-10 20:14     ` Stefan Hajnoczi
@ 2026-04-13 15:24       ` Matthieu Rolla
  0 siblings, 0 replies; 34+ messages in thread
From: Matthieu Rolla @ 2026-04-13 15:24 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, qemu-block, its, kbusch, mr-083

[-- Attachment #1: Type: text/plain, Size: 2349 bytes --]

Hello,

I tested further more the device_del + device_add and the AEN works
correctly, causing the guest to rescan namespaces.

However, when XFS filesystems are mounted on the namespace (via
DirectPV/CSI in our case), the kernel doesn't reuse the block device number
after re-add.

For example, nvme2n1 becomes nvme2n2 because the stale XFS mount holds a
reference to the old nvme_ns_head, preventing ida_free() in
nvme_free_ns_head() from releasing the instance number before the new
namespace is allocated.

This causes XFS "duplicate UUID" errors since the same filesystem appears
under a different device name while the old stale mount still exists.

This is a Linux kernel behavior (ida_alloc in nvme_alloc_ns_head), not a
QEMU issue. The namespace hotplug patch itself is correct. But for
practical use with mounted filesystems, I found that the drive_del +
drive_insert approach (keeping the namespace device alive) avoids the
rename entirely since no ida allocation/free cycle occurs.

Would it make sense to include drive_insert as a companion patch, or is
that better handled separately ?

Thanks

On Fri, Apr 10, 2026 at 10:14 PM Stefan Hajnoczi <stefanha@redhat.com>
wrote:

> On Fri, Apr 10, 2026 at 04:33:47PM +0200, Matthieu Rolla wrote:
> > Hello @Stefan Hajnoczi <stefanha@redhat.com> ,
> >
> >
> > Thanks for the review! v3 sent.
> >
> >
> > Wording: Fixed in v3, no more "physical PCIe slot" claim. Now describes
> it
> > as NVMe namespace-level hotplug.
> >
> >
> > I/O drain: Moved nvme_ns_drain() to the start of the unplug handler so
> > in-flight I/O completes before detach. Tested under warp load (16
> > concurrent 1MiB uploads via MinIO/DirectPV)  device_del returns in ~400ms
> > with clean removal, no use-after-free.
> >
> >
> > Symmetry:  moved subsys->namespaces[nsid] = NULL into nvme_ns_unrealize()
> > so the namespace lifecycle is complete (mirrors what nvme_ns_realize()
> sets
> > up).
> >
> >
> > I don't have a working Windows test setup, I'd really appreciate if you
> > could test it next week as you offered.
> >
> > Thanks again for your time
>
> Awesome, thanks!
>
> I will give Windows Server a spin next week. I'm not an expert in
> hw/nvme/ but the patch looks good to me:
>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>

[-- Attachment #2: Type: text/html, Size: 2975 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/2] NVMe namespace hotplug and drive reconnection support
  2026-04-09  6:01 [PATCH 0/2] NVMe namespace hotplug and drive reconnection support mr-083
                   ` (3 preceding siblings ...)
  2026-04-10 14:30 ` [PATCH v3] " mr-083
@ 2026-04-13 17:17 ` Klaus Jensen
  2026-04-14 12:42   ` Stefan Hajnoczi
  2026-04-15 17:38 ` [PATCH v4] hw/nvme: add namespace hotplug support mr-083
  5 siblings, 1 reply; 34+ messages in thread
From: Klaus Jensen @ 2026-04-13 17:17 UTC (permalink / raw)
  To: mr-083; +Cc: qemu-devel, qemu-block, kbusch, stefanha, mr-083

[-- Attachment #1: Type: text/plain, Size: 373 bytes --]

On Apr  9 08:01, mr-083 wrote:
> This series adds two features that together enable transparent NVMe disk
> hot-swap simulation in QEMU, matching the behavior of physical NVMe
> drives being pulled and reinserted in the same PCIe slot.
> 

I don't understand this. From an NVMe perspective you can't hotplug a
namespace. You can hotplug a PCIe-based NVM Subsystem.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/2] NVMe namespace hotplug and drive reconnection support
  2026-04-13 17:17 ` [PATCH 0/2] NVMe namespace hotplug and drive reconnection support Klaus Jensen
@ 2026-04-14 12:42   ` Stefan Hajnoczi
  2026-04-14 13:36     ` Matthieu Rolla
                       ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2026-04-14 12:42 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: mr-083, qemu-devel, qemu-block, kbusch, mr-083, John Meneghini

[-- Attachment #1: Type: text/plain, Size: 1684 bytes --]

On Mon, Apr 13, 2026 at 07:17:37PM +0200, Klaus Jensen wrote:
> On Apr  9 08:01, mr-083 wrote:
> > This series adds two features that together enable transparent NVMe disk
> > hot-swap simulation in QEMU, matching the behavior of physical NVMe
> > drives being pulled and reinserted in the same PCIe slot.
> > 
> 
> I don't understand this. From an NVMe perspective you can't hotplug a
> namespace. You can hotplug a PCIe-based NVM Subsystem.

Hi Klaus,
It would be great if someone with more NVMe experience than myself can
find a definite answer, but I think the Namespace List can change
asynchronously even on a NVMe PCIe controller as long as it supports
Namespace Management commands.

There are instances in the NVMe Express Base Specification 2.0b like:
- 8.3.1 Capacity Management Overview
  "a Namespace Attribute Changed event is generated for hosts other than
  the host which issued the Capacity Management command"
- 8.11 Namespace Management
  "If Namespace Attribute Notices are enabled, any controller(s) not
  processing the Namespace Management command that was attached to the
  namespace reports a Namespace Attribute Changed asynchronous event to
  the host."

I imagine this functionality would be useful in storage offload cards
(IPUs/DPUs) that present as NVMe PCIe controllers instead of as
NVMe-over-Fabrics. This makes sense when the host is not supposed to
manage the storage itself. When the card's control plane configures a
new volume, the NVMe Namespace List changes and the host is notified.

Linux and Windows NVMe PCI drivers support this according to the testing
that Matthieu and I have done.

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/2] NVMe namespace hotplug and drive reconnection support
  2026-04-14 12:42   ` Stefan Hajnoczi
@ 2026-04-14 13:36     ` Matthieu Rolla
  2026-04-14 18:09       ` Keith Busch
  2026-04-14 18:10       ` Stefan Hajnoczi
  2026-04-14 14:04     ` John Meneghini
  2026-04-14 14:42     ` Keith Busch
  2 siblings, 2 replies; 34+ messages in thread
From: Matthieu Rolla @ 2026-04-14 13:36 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Klaus Jensen, qemu-devel, qemu-block, kbusch, mr-083,
	John Meneghini

[-- Attachment #1: Type: text/plain, Size: 2552 bytes --]

Thanks for testing Windows, Stefan! 

Great to have confirmation on both Linux and Windows.


Regarding `drive_insert`,  I found that `device_del` + `device_add` works well when no filesystem is mounted on the namespace. 

However, when XFS is mounted (e.g. via DirectPV/CSI), the Linux kernel doesn't reuse the block device number (nvme0n1 becomes nvme0n2) because the stale mount holds a reference to the old `nvme_ns_head`, preventing `ida_free()`. 

This causes XFS "duplicate UUID" errors on remount.

`drive_insert` avoids this by keeping the namespace device alive which means no ida cycle, same block device name. 

Should I send it as a separate follow-up patch, or keep it in this series?

Matthieu

> On Apr 14, 2026, at 2:42 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Mon, Apr 13, 2026 at 07:17:37PM +0200, Klaus Jensen wrote:
>> On Apr  9 08:01, mr-083 wrote:
>>> This series adds two features that together enable transparent NVMe disk
>>> hot-swap simulation in QEMU, matching the behavior of physical NVMe
>>> drives being pulled and reinserted in the same PCIe slot.
>>> 
>> 
>> I don't understand this. From an NVMe perspective you can't hotplug a
>> namespace. You can hotplug a PCIe-based NVM Subsystem.
> 
> Hi Klaus,
> It would be great if someone with more NVMe experience than myself can
> find a definite answer, but I think the Namespace List can change
> asynchronously even on a NVMe PCIe controller as long as it supports
> Namespace Management commands.
> 
> There are instances in the NVMe Express Base Specification 2.0b like:
> - 8.3.1 Capacity Management Overview
>  "a Namespace Attribute Changed event is generated for hosts other than
>  the host which issued the Capacity Management command"
> - 8.11 Namespace Management
>  "If Namespace Attribute Notices are enabled, any controller(s) not
>  processing the Namespace Management command that was attached to the
>  namespace reports a Namespace Attribute Changed asynchronous event to
>  the host."
> 
> I imagine this functionality would be useful in storage offload cards
> (IPUs/DPUs) that present as NVMe PCIe controllers instead of as
> NVMe-over-Fabrics. This makes sense when the host is not supposed to
> manage the storage itself. When the card's control plane configures a
> new volume, the NVMe Namespace List changes and the host is notified.
> 
> Linux and Windows NVMe PCI drivers support this according to the testing
> that Matthieu and I have done.
> 
> Thanks,
> Stefan


[-- Attachment #2: Type: text/html, Size: 6306 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/2] NVMe namespace hotplug and drive reconnection support
  2026-04-14 12:42   ` Stefan Hajnoczi
  2026-04-14 13:36     ` Matthieu Rolla
@ 2026-04-14 14:04     ` John Meneghini
  2026-04-16 10:11       ` Nilay Shroff
  2026-04-14 14:42     ` Keith Busch
  2 siblings, 1 reply; 34+ messages in thread
From: John Meneghini @ 2026-04-14 14:04 UTC (permalink / raw)
  To: Stefan Hajnoczi, Klaus Jensen, Nilay Shroff
  Cc: mr-083, qemu-devel, qemu-block, kbusch, mr-083

Adding Nilay who has done a lot of work on nvme hot plug.

Nilay please take look at these patches and let us know if they can work on powerpc

I'll set up a test bed and try this out with x86_64.

John A. Meneghini
Senior Principal Platform Storage Engineer
RHEL SST - Platform Storage Group
jmeneghi@redhat.com

On 4/14/26 8:42 AM, Stefan Hajnoczi wrote:
> On Mon, Apr 13, 2026 at 07:17:37PM +0200, Klaus Jensen wrote:
>> On Apr  9 08:01, mr-083 wrote:
>>> This series adds two features that together enable transparent NVMe disk
>>> hot-swap simulation in QEMU, matching the behavior of physical NVMe
>>> drives being pulled and reinserted in the same PCIe slot.
>>>
>>
>> I don't understand this. From an NVMe perspective you can't hotplug a
>> namespace. You can hotplug a PCIe-based NVM Subsystem.
> 
> Hi Klaus,
> It would be great if someone with more NVMe experience than myself can
> find a definite answer, but I think the Namespace List can change
> asynchronously even on a NVMe PCIe controller as long as it supports
> Namespace Management commands.
> 
> There are instances in the NVMe Express Base Specification 2.0b like:
> - 8.3.1 Capacity Management Overview
>    "a Namespace Attribute Changed event is generated for hosts other than
>    the host which issued the Capacity Management command"
> - 8.11 Namespace Management
>    "If Namespace Attribute Notices are enabled, any controller(s) not
>    processing the Namespace Management command that was attached to the
>    namespace reports a Namespace Attribute Changed asynchronous event to
>    the host."
> 
> I imagine this functionality would be useful in storage offload cards
> (IPUs/DPUs) that present as NVMe PCIe controllers instead of as
> NVMe-over-Fabrics. This makes sense when the host is not supposed to
> manage the storage itself. When the card's control plane configures a
> new volume, the NVMe Namespace List changes and the host is notified.
> 
> Linux and Windows NVMe PCI drivers support this according to the testing
> that Matthieu and I have done.
> 
> Thanks,
> Stefan



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/2] NVMe namespace hotplug and drive reconnection support
  2026-04-14 12:42   ` Stefan Hajnoczi
  2026-04-14 13:36     ` Matthieu Rolla
  2026-04-14 14:04     ` John Meneghini
@ 2026-04-14 14:42     ` Keith Busch
  2 siblings, 0 replies; 34+ messages in thread
From: Keith Busch @ 2026-04-14 14:42 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Klaus Jensen, mr-083, qemu-devel, qemu-block, mr-083,
	John Meneghini

On Tue, Apr 14, 2026 at 08:42:21AM -0400, Stefan Hajnoczi wrote:
> On Mon, Apr 13, 2026 at 07:17:37PM +0200, Klaus Jensen wrote:
> > On Apr  9 08:01, mr-083 wrote:
> > > This series adds two features that together enable transparent NVMe disk
> > > hot-swap simulation in QEMU, matching the behavior of physical NVMe
> > > drives being pulled and reinserted in the same PCIe slot.
> > > 
> > 
> > I don't understand this. From an NVMe perspective you can't hotplug a
> > namespace. You can hotplug a PCIe-based NVM Subsystem.
> 
> Hi Klaus,
> It would be great if someone with more NVMe experience than myself can
> find a definite answer, but I think the Namespace List can change
> asynchronously even on a NVMe PCIe controller as long as it supports
> Namespace Management commands.

I think there's some clash in terminology. From nvme protocol side,
hotplug refers to bus events detected by the host, so something like
PCIe slot capabilities defines how that works. This series is doing
something behind the scenes from the host-controller interface
visibility, so it's just coincidence that framework is also called
"hotplug". From nvme protocol perspective, this patch looks like a qemu
specific out-of-band method for namespace "attach/detach" via the QMP
interface. Sounds fine to me: the nvme namespace events are not strictly
tied to the spec defined in-band attachment status.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] block/monitor: add drive_insert HMP command
  2026-04-09  6:01 ` [PATCH 2/2] block/monitor: add drive_insert HMP command mr-083
@ 2026-04-14 17:57   ` Stefan Hajnoczi
  2026-04-14 18:02     ` Matthieu Rolla
  2026-04-15 10:48   ` Daniel P. Berrangé
  1 sibling, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2026-04-14 17:57 UTC (permalink / raw)
  To: mr-083; +Cc: qemu-devel, qemu-block, its, kbusch, mr-083, kwolf

[-- Attachment #1: Type: text/plain, Size: 1576 bytes --]

On Thu, Apr 09, 2026 at 08:01:54AM +0200, mr-083 wrote:
> Add a drive_insert HMP command that reconnects a host block device file
> to an existing guest device whose backing store was previously removed
> with drive_del.
> 
> After drive_del, the BlockBackend remains attached to the guest device
> but has no BlockDriverState (shown as "[not inserted]" in info block).
> drive_insert opens the specified file, finds the device's BlockBackend
> by iterating all backends and matching the attached device ID, then
> calls blk_insert_bs() to reconnect the backing store.
> 
> This complements drive_del for non-removable devices (such as NVMe
> namespaces) where blockdev-change-medium cannot be used. Combined with
> PCIe AER Surprise Down error injection to trigger a controller reset,
> this enables complete NVMe disk hot-swap simulation where the guest
> sees the same device names throughout.
> 
> Example usage:
>   drive_del drv0             # remove backing store
>   drive_insert ns0 disk.qcow2  # reconnect backing
>   pcie_aer_inject_error rp0 SDN  # trigger controller reset

This approach does not delete the `--device nvme-ns` but instead
replaces the BlockBackend's root node so that the existing NVMe
Namespace changes the underlying storage.

Questions about this approach:
1. NVMe AEN is not involved at all?
2. Does the guest have access to the new storage as soon as drive_insert
   completes and the PCIe AER is just to kick the guest driver (e.g.
   getting it to update the size of the guest Linux block device)?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] block/monitor: add drive_insert HMP command
  2026-04-14 17:57   ` Stefan Hajnoczi
@ 2026-04-14 18:02     ` Matthieu Rolla
  2026-04-14 19:05       ` Warner Losh
  0 siblings, 1 reply; 34+ messages in thread
From: Matthieu Rolla @ 2026-04-14 18:02 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, qemu-block, its, kbusch, mr-083, kwolf

[-- Attachment #1: Type: text/plain, Size: 2764 bytes --]

Hello Stefan,

1. Correct, no AEN involved.The namespace stays attached to the controller throughout. Only the BlockBackend's root node changes, so from the NVMe protocol perspective nothing happened to the namespace topology.

2. Yes, the new storage is accessible immediately after drive_insert at the QEMU block layer level. The SDN is needed to reset the NVMe controller which may have cached an error state from I/O failures during the period when no backing was present. Without the reset, the guest driver continues to see the controller as faulted. The SDN triggers the standard PCIe AER recovery path (freeze → slot reset → restart) which clears the error state and resumes normal I/O.

One thing to note: since the namespace device stays alive, the guest kernel's block device number is preserved (e.g. /dev/nvme0n1 stays nvme0n1). This avoids the ida_alloc renaming issue that occurs with device_del + device_add when filesystems hold references to the old namespace head.

Thanks

	.	
Matthieu
www.min.io <>
matthieu@min.io <> 

> On Apr 14, 2026, at 7:57 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Thu, Apr 09, 2026 at 08:01:54AM +0200, mr-083 wrote:
>> Add a drive_insert HMP command that reconnects a host block device file
>> to an existing guest device whose backing store was previously removed
>> with drive_del.
>> 
>> After drive_del, the BlockBackend remains attached to the guest device
>> but has no BlockDriverState (shown as "[not inserted]" in info block).
>> drive_insert opens the specified file, finds the device's BlockBackend
>> by iterating all backends and matching the attached device ID, then
>> calls blk_insert_bs() to reconnect the backing store.
>> 
>> This complements drive_del for non-removable devices (such as NVMe
>> namespaces) where blockdev-change-medium cannot be used. Combined with
>> PCIe AER Surprise Down error injection to trigger a controller reset,
>> this enables complete NVMe disk hot-swap simulation where the guest
>> sees the same device names throughout.
>> 
>> Example usage:
>>  drive_del drv0             # remove backing store
>>  drive_insert ns0 disk.qcow2  # reconnect backing
>>  pcie_aer_inject_error rp0 SDN  # trigger controller reset
> 
> This approach does not delete the `--device nvme-ns` but instead
> replaces the BlockBackend's root node so that the existing NVMe
> Namespace changes the underlying storage.
> 
> Questions about this approach:
> 1. NVMe AEN is not involved at all?
> 2. Does the guest have access to the new storage as soon as drive_insert
>   completes and the PCIe AER is just to kick the guest driver (e.g.
>   getting it to update the size of the guest Linux block device)?
> 
> Stefan

[-- Attachment #2: Type: text/html, Size: 6580 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/2] NVMe namespace hotplug and drive reconnection support
  2026-04-14 13:36     ` Matthieu Rolla
@ 2026-04-14 18:09       ` Keith Busch
  2026-04-14 18:10       ` Stefan Hajnoczi
  1 sibling, 0 replies; 34+ messages in thread
From: Keith Busch @ 2026-04-14 18:09 UTC (permalink / raw)
  To: Matthieu Rolla
  Cc: Stefan Hajnoczi, Klaus Jensen, qemu-devel, qemu-block, mr-083,
	John Meneghini

On Tue, Apr 14, 2026 at 03:36:19PM +0200, Matthieu Rolla wrote:
> Regarding `drive_insert`,  I found that `device_del` + `device_add` works well when no filesystem is mounted on the namespace. 
> 
> However, when XFS is mounted (e.g. via DirectPV/CSI), the Linux kernel doesn't reuse the block device number (nvme0n1 becomes nvme0n2) because the stale mount holds a reference to the old `nvme_ns_head`, preventing `ida_free()`. 
> 
> This causes XFS "duplicate UUID" errors on remount.
> 
> `drive_insert` avoids this by keeping the namespace device alive which means no ida cycle, same block device name. 

Are you attempting some kind of covert way to swap out the backend
without the host knowing you did that? Isn't that just going to confuse
the filesystem that's actively using the previous backend when it's
in-memory context no longer aligns with the on-disk format?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/2] NVMe namespace hotplug and drive reconnection support
  2026-04-14 13:36     ` Matthieu Rolla
  2026-04-14 18:09       ` Keith Busch
@ 2026-04-14 18:10       ` Stefan Hajnoczi
  2026-04-14 18:14         ` Matthieu Rolla
  1 sibling, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2026-04-14 18:10 UTC (permalink / raw)
  To: Matthieu Rolla
  Cc: Klaus Jensen, qemu-devel, qemu-block, kbusch, mr-083,
	John Meneghini

[-- Attachment #1: Type: text/plain, Size: 1494 bytes --]

On Tue, Apr 14, 2026 at 03:36:19PM +0200, Matthieu Rolla wrote:
> Regarding `drive_insert`,  I found that `device_del` + `device_add` works well when no filesystem is mounted on the namespace. 
> 
> However, when XFS is mounted (e.g. via DirectPV/CSI), the Linux kernel doesn't reuse the block device number (nvme0n1 becomes nvme0n2) because the stale mount holds a reference to the old `nvme_ns_head`, preventing `ida_free()`. 

Can you use the stable device names in /dev/disk/by-*/ instead of the
/dev/nvmeCnN names to access the new namespace? Then it won't matter
that ida_free() hasn't been called yet.

> This causes XFS "duplicate UUID" errors on remount.

(I have to admit that using stable device names doesn't solve this
because the guest kernel still potentially has multiple XFS mounts for
the file system.)

> `drive_insert` avoids this by keeping the namespace device alive which means no ida cycle, same block device name. 

Are you sure this is safe? Even if PCIe AER somehow kills the old XFS
mount, then there is still a race condition between drive_insert and
PCIe AER injection when the guest kernel sees the new underlying storage
through the old XFS mount.

Getting this wrong could cause data corruption, so it needs to be well
understood. I don't really understand and would need to look at the
guest kernel code path. Can you describe what happens to the guest
kernel blkdev and the XFS mount in the drive_insert workflow?

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/2] NVMe namespace hotplug and drive reconnection support
  2026-04-14 18:10       ` Stefan Hajnoczi
@ 2026-04-14 18:14         ` Matthieu Rolla
  2026-04-15 12:45           ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Matthieu Rolla @ 2026-04-14 18:14 UTC (permalink / raw)
  To: kbusch
  Cc: Klaus Jensen, qemu-devel, qemu-block, mr-083, John Meneghini,
	Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 2632 bytes --]

Hello Keith, 

To clarify,  we're not swapping to a different backend. It's the same disk file being disconnected and reconnected, simulating a physical drive being pulled and reinserted.

The sequence is:
drive_del -> disconnect the backing (simulates drive pull)

User does whatever they need (test failure handling, etc.)

drive_insert -> reconnect the same backing file (simulates drive reinsertion)
SDN -> reset controller so guest resumes I/O

The filesystem on disk is unchanged, same data, same UUID, same format. The guest's in-memory state realigns with the on-disk state after the controller reset, just like it would after a physical drive reinsertion on real hardware.

The use case is a storage integration lab where we need to simulate disk failures and recoveries without the guest block device being renamed, which is what happens with device_del + device_add due to the kernel's ida_alloc behavior.

Thank you


	.	
Matthieu
www.min.io <>
matthieu@min.io <> 

> On Apr 14, 2026, at 8:10 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Tue, Apr 14, 2026 at 03:36:19PM +0200, Matthieu Rolla wrote:
>> Regarding `drive_insert`,  I found that `device_del` + `device_add` works well when no filesystem is mounted on the namespace. 
>> 
>> However, when XFS is mounted (e.g. via DirectPV/CSI), the Linux kernel doesn't reuse the block device number (nvme0n1 becomes nvme0n2) because the stale mount holds a reference to the old `nvme_ns_head`, preventing `ida_free()`.
> 
> Can you use the stable device names in /dev/disk/by-*/ instead of the
> /dev/nvmeCnN names to access the new namespace? Then it won't matter
> that ida_free() hasn't been called yet.
> 
>> This causes XFS "duplicate UUID" errors on remount.
> 
> (I have to admit that using stable device names doesn't solve this
> because the guest kernel still potentially has multiple XFS mounts for
> the file system.)
> 
>> `drive_insert` avoids this by keeping the namespace device alive which means no ida cycle, same block device name.
> 
> Are you sure this is safe? Even if PCIe AER somehow kills the old XFS
> mount, then there is still a race condition between drive_insert and
> PCIe AER injection when the guest kernel sees the new underlying storage
> through the old XFS mount.
> 
> Getting this wrong could cause data corruption, so it needs to be well
> understood. I don't really understand and would need to look at the
> guest kernel code path. Can you describe what happens to the guest
> kernel blkdev and the XFS mount in the drive_insert workflow?
> 
> Thanks,
> Stefan


[-- Attachment #2: Type: text/html, Size: 4397 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] block/monitor: add drive_insert HMP command
  2026-04-14 18:02     ` Matthieu Rolla
@ 2026-04-14 19:05       ` Warner Losh
  2026-04-14 21:01         ` Matthieu Rolla
  0 siblings, 1 reply; 34+ messages in thread
From: Warner Losh @ 2026-04-14 19:05 UTC (permalink / raw)
  To: Matthieu Rolla
  Cc: Stefan Hajnoczi, qemu-devel, qemu-block, its, kbusch, mr-083,
	kwolf

[-- Attachment #1: Type: text/plain, Size: 3406 bytes --]

On Tue, Apr 14, 2026 at 12:02 PM Matthieu Rolla <matthieu@minio.io> wrote:

> Hello Stefan,
>
> 1. Correct, no AEN involved.The namespace stays attached to the controller
> throughout. Only the BlockBackend's root node changes, so from the NVMe
> protocol perspective nothing happened to the namespace topology.
>
> 2. Yes, the new storage is accessible immediately after drive_insert at
> the QEMU block layer level. The SDN is needed to reset the NVMe controller
> which may have cached an error state from I/O failures during the period
> when no backing was present. Without the reset, the guest driver continues
> to see the controller as faulted. The SDN triggers the standard PCIe AER
> recovery path (freeze → slot reset → restart) which clears the error state
> and resumes normal I/O.
>
> One thing to note: since the namespace device stays alive, the guest
> kernel's block device number is preserved (e.g. /dev/nvme0n1 stays
> nvme0n1). This avoids the ida_alloc renaming issue that occurs with
> device_del + device_add when filesystems hold references to the old
> namespace head.
>

OK. I had hoped this would let me test my FreeBSD namespace arrival and
departure messages correctly. I just added the code for AWS (and other VM)
support for attaching / detaching drives and was hoping I could turn this
into a regression test, but sounds like no.

I'll have to go look at what the Linux driver does here. Right now Surprise
Down events just increment a counter on FreeBSD...  Will thta work for a
DPC system too? Or is that not relevant for the current state of the qemu
art?

Warner


>
> Thanks
>
>
> .
> *Matthieu*
> www.min.io
> matthieu@min.io
>
> On Apr 14, 2026, at 7:57 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 09, 2026 at 08:01:54AM +0200, mr-083 wrote:
>
> Add a drive_insert HMP command that reconnects a host block device file
> to an existing guest device whose backing store was previously removed
> with drive_del.
>
> After drive_del, the BlockBackend remains attached to the guest device
> but has no BlockDriverState (shown as "[not inserted]" in info block).
> drive_insert opens the specified file, finds the device's BlockBackend
> by iterating all backends and matching the attached device ID, then
> calls blk_insert_bs() to reconnect the backing store.
>
> This complements drive_del for non-removable devices (such as NVMe
> namespaces) where blockdev-change-medium cannot be used. Combined with
> PCIe AER Surprise Down error injection to trigger a controller reset,
> this enables complete NVMe disk hot-swap simulation where the guest
> sees the same device names throughout.
>
> Example usage:
>  drive_del drv0             # remove backing store
>  drive_insert ns0 disk.qcow2  # reconnect backing
>  pcie_aer_inject_error rp0 SDN  # trigger controller reset
>
>
> This approach does not delete the `--device nvme-ns` but instead
> replaces the BlockBackend's root node so that the existing NVMe
> Namespace changes the underlying storage.
>
> Questions about this approach:
> 1. NVMe AEN is not involved at all?
> 2. Does the guest have access to the new storage as soon as drive_insert
>   completes and the PCIe AER is just to kick the guest driver (e.g.
>   getting it to update the size of the guest Linux block device)?
>
> Stefan
>
>
>

[-- Attachment #2: Type: text/html, Size: 7092 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] block/monitor: add drive_insert HMP command
  2026-04-14 19:05       ` Warner Losh
@ 2026-04-14 21:01         ` Matthieu Rolla
  0 siblings, 0 replies; 34+ messages in thread
From: Matthieu Rolla @ 2026-04-14 21:01 UTC (permalink / raw)
  To: Warner Losh
  Cc: Stefan Hajnoczi, qemu-devel, qemu-block, its, kbusch, mr-083,
	kwolf

[-- Attachment #1: Type: text/plain, Size: 4735 bytes --]

Hello Warner,

For testing namespace arrival and departure, Patch 1 (nvme-ns hotplug with
AEN) is what you want. That properly executes device_del + device_add which
triggers a Namespace Attribute Changed AEN, the guest sees the namespace
disappear and reappear.

drive_insert is specifically for reconnecting a backing store after
drive_del without the guest seeing a namespace topology change. This is
useful for simulating disk removal and reinsertion, or replacing a disk,
while preserving the guest's block device name, as on a real system.
Different use cases, expected behavior.

Regarding FreeBSD and SDN, the SDN triggers PCIe AER recovery at the root
port level. If FreeBSD just increments a counter for Surprise Down events
without initiating a slot reset + controller restart, then the controller
won't recover.

Linux handles SDN by freezing the device, resetting the link, and
restarting the driver. FreeBSD would need similar handling for the
drive_insert + SDN flow to work. But again, for namespace arrival/departure
testing, you don't need SDN at all, you can use device_del/device_add and
the AEN from Patch 1.

On Tue, Apr 14, 2026 at 9:06 PM Warner Losh <imp@bsdimp.com> wrote:

>
>
> On Tue, Apr 14, 2026 at 12:02 PM Matthieu Rolla <matthieu@minio.io> wrote:
>
>> Hello Stefan,
>>
>> 1. Correct, no AEN involved.The namespace stays attached to the
>> controller throughout. Only the BlockBackend's root node changes, so from
>> the NVMe protocol perspective nothing happened to the namespace topology.
>>
>> 2. Yes, the new storage is accessible immediately after drive_insert at
>> the QEMU block layer level. The SDN is needed to reset the NVMe controller
>> which may have cached an error state from I/O failures during the period
>> when no backing was present. Without the reset, the guest driver continues
>> to see the controller as faulted. The SDN triggers the standard PCIe AER
>> recovery path (freeze → slot reset → restart) which clears the error state
>> and resumes normal I/O.
>>
>> One thing to note: since the namespace device stays alive, the guest
>> kernel's block device number is preserved (e.g. /dev/nvme0n1 stays
>> nvme0n1). This avoids the ida_alloc renaming issue that occurs with
>> device_del + device_add when filesystems hold references to the old
>> namespace head.
>>
>
> OK. I had hoped this would let me test my FreeBSD namespace arrival and
> departure messages correctly. I just added the code for AWS (and other VM)
> support for attaching / detaching drives and was hoping I could turn this
> into a regression test, but sounds like no.
>
> I'll have to go look at what the Linux driver does here. Right now
> Surprise Down events just increment a counter on FreeBSD...  Will thta work
> for a DPC system too? Or is that not relevant for the current state of the
> qemu art?
>
> Warner
>
>
>>
>> Thanks
>>
>>
>> .
>> *Matthieu*
>> www.min.io
>> matthieu@min.io
>>
>> On Apr 14, 2026, at 7:57 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>
>> On Thu, Apr 09, 2026 at 08:01:54AM +0200, mr-083 wrote:
>>
>> Add a drive_insert HMP command that reconnects a host block device file
>> to an existing guest device whose backing store was previously removed
>> with drive_del.
>>
>> After drive_del, the BlockBackend remains attached to the guest device
>> but has no BlockDriverState (shown as "[not inserted]" in info block).
>> drive_insert opens the specified file, finds the device's BlockBackend
>> by iterating all backends and matching the attached device ID, then
>> calls blk_insert_bs() to reconnect the backing store.
>>
>> This complements drive_del for non-removable devices (such as NVMe
>> namespaces) where blockdev-change-medium cannot be used. Combined with
>> PCIe AER Surprise Down error injection to trigger a controller reset,
>> this enables complete NVMe disk hot-swap simulation where the guest
>> sees the same device names throughout.
>>
>> Example usage:
>>  drive_del drv0             # remove backing store
>>  drive_insert ns0 disk.qcow2  # reconnect backing
>>  pcie_aer_inject_error rp0 SDN  # trigger controller reset
>>
>>
>> This approach does not delete the `--device nvme-ns` but instead
>> replaces the BlockBackend's root node so that the existing NVMe
>> Namespace changes the underlying storage.
>>
>> Questions about this approach:
>> 1. NVMe AEN is not involved at all?
>> 2. Does the guest have access to the new storage as soon as drive_insert
>>   completes and the PCIe AER is just to kick the guest driver (e.g.
>>   getting it to update the size of the guest Linux block device)?
>>
>> Stefan
>>
>>
>>

[-- Attachment #2: Type: text/html, Size: 8642 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] block/monitor: add drive_insert HMP command
  2026-04-09  6:01 ` [PATCH 2/2] block/monitor: add drive_insert HMP command mr-083
  2026-04-14 17:57   ` Stefan Hajnoczi
@ 2026-04-15 10:48   ` Daniel P. Berrangé
  2026-04-15 12:32     ` Matthieu Rolla
  2026-04-15 12:33     ` Stefan Hajnoczi
  1 sibling, 2 replies; 34+ messages in thread
From: Daniel P. Berrangé @ 2026-04-15 10:48 UTC (permalink / raw)
  To: mr-083; +Cc: qemu-devel, qemu-block, its, kbusch, stefanha, mr-083

On Thu, Apr 09, 2026 at 08:01:54AM +0200, mr-083 wrote:
> Add a drive_insert HMP command that reconnects a host block device file
> to an existing guest device whose backing store was previously removed
> with drive_del.
> 
> After drive_del, the BlockBackend remains attached to the guest device
> but has no BlockDriverState (shown as "[not inserted]" in info block).
> drive_insert opens the specified file, finds the device's BlockBackend
> by iterating all backends and matching the attached device ID, then
> calls blk_insert_bs() to reconnect the backing store.
> 
> This complements drive_del for non-removable devices (such as NVMe
> namespaces) where blockdev-change-medium cannot be used. Combined with
> PCIe AER Surprise Down error injection to trigger a controller reset,
> this enables complete NVMe disk hot-swap simulation where the guest
> sees the same device names throughout.
> 
> Example usage:
>   drive_del drv0             # remove backing store
>   drive_insert ns0 disk.qcow2  # reconnect backing
>   pcie_aer_inject_error rp0 SDN  # trigger controller reset
> 
> Signed-off-by: Matthieu Receveur <matthieu@min.io>
> ---
>  block/monitor/block-hmp-cmds.c | 59 ++++++++++++++++++++++++++++++++++
>  hmp-commands.hx                | 18 +++++++++++

I see v3 has dropped this new command, but in case you have plans
to re-introduce it..

First no new HMP-only commands please. Anything new must be implemented
in QMP, and the HMP must be a shim to the QMP.

drive_insert semantics are somewhat overloaded - it is doing two jobs,
first creating the block backend, and then associating the backend with
a device.  IMHO those tasks should be separate.  "drive_add" or
"blockdev-add" can already do the first task, so we shouldn't replicate
that. AFAICT, it would only need a way to do the backend/frontend
association, and possibly the PCI error injection should be done as
a standard part of that re-association of backends, which would avoid
the need for pcie_aer_inject_error which has no QMP impl currently.



>  include/block/block-hmp-cmds.h |  1 +
>  3 files changed, 78 insertions(+)
> 
> diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
> index 1fd28d59eb..77e9662ead 100644
> --- a/block/monitor/block-hmp-cmds.c
> +++ b/block/monitor/block-hmp-cmds.c
> @@ -38,7 +38,9 @@
>  #include "qemu/osdep.h"
>  #include "hw/core/boards.h"
>  #include "system/block-backend.h"
> +#include "system/block-backend-global-state.h"
>  #include "system/blockdev.h"
> +#include "block/block-global-state.h"
>  #include "qapi/qapi-commands-block.h"
>  #include "qapi/qapi-commands-block-export.h"
>  #include "qobject/qdict.h"
> @@ -195,6 +197,63 @@ unlock:
>      hmp_handle_error(mon, err);
>  }
>  
> +void hmp_drive_insert(Monitor *mon, const QDict *qdict)
> +{
> +    const char *id = qdict_get_str(qdict, "id");
> +    const char *filename = qdict_get_str(qdict, "filename");
> +    BlockBackend *blk = NULL;
> +    BlockBackend *iter;
> +    BlockDriverState *bs;
> +    Error *err = NULL;
> +
> +    GLOBAL_STATE_CODE();
> +
> +    /*
> +     * After drive_del, the BlockBackend is removed from the monitor name
> +     * registry but still attached to the device. Find it by iterating all
> +     * BlockBackends and matching by the device ID shown in "info block".
> +     */
> +    for (iter = blk_all_next(NULL); iter; iter = blk_all_next(iter)) {
> +        DeviceState *dev = blk_get_attached_dev(iter);
> +        if (dev && dev->id && strcmp(dev->id, id) == 0) {
> +            blk = iter;
> +            break;
> +        }
> +    }
> +
> +    if (!blk) {
> +        /* Fallback: try by block backend name */
> +        blk = blk_by_name(id);
> +    }
> +
> +    if (!blk) {
> +        error_setg(&err, "Device '%s' not found", id);
> +        goto out;
> +    }
> +
> +    if (blk_bs(blk)) {
> +        error_setg(&err, "Device '%s' already has a medium inserted", id);
> +        goto out;
> +    }
> +
> +    bs = bdrv_open(filename, NULL, NULL, BDRV_O_RDWR, &err);
> +    if (!bs) {
> +        goto out;
> +    }
> +
> +    if (blk_insert_bs(blk, bs, &err) < 0) {
> +        bdrv_unref(bs);
> +        goto out;
> +    }
> +
> +    bdrv_unref(bs);
> +    monitor_printf(mon, "OK\n");
> +    return;
> +
> +out:
> +    hmp_handle_error(mon, err);
> +}
> +
>  void hmp_commit(Monitor *mon, const QDict *qdict)
>  {
>      const char *device = qdict_get_str(qdict, "device");
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index 5cc4788f12..79af8e8988 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -207,6 +207,24 @@ SRST
>    actions (drive options rerror, werror).
>  ERST
>  
> +    {
> +        .name       = "drive_insert",
> +        .args_type  = "id:B,filename:F",
> +        .params     = "device filename",
> +        .help       = "insert a host block device into an empty drive",
> +        .cmd        = hmp_drive_insert,
> +    },
> +
> +SRST
> +``drive_insert`` *device* *filename*
> +  Insert a host block device file into a drive that has been emptied by
> +  ``drive_del``.  This reconnects the backing store without removing the
> +  guest device, enabling transparent disk hot-swap for non-removable devices
> +  such as NVMe namespaces.  Combined with PCIe AER Surprise Down error
> +  injection (``pcie_aer_inject_error`` *device* ``SDN``), this enables
> +  complete NVMe disk hot-swap simulation.
> +ERST
> +
>      {
>          .name       = "change",
>          .args_type  = "device:B,force:-f,target:F,arg:s?,read-only-mode:s?",
> diff --git a/include/block/block-hmp-cmds.h b/include/block/block-hmp-cmds.h
> index 71113cd7ef..73c9607402 100644
> --- a/include/block/block-hmp-cmds.h
> +++ b/include/block/block-hmp-cmds.h
> @@ -21,6 +21,7 @@ void hmp_drive_add(Monitor *mon, const QDict *qdict);
>  
>  void hmp_commit(Monitor *mon, const QDict *qdict);
>  void hmp_drive_del(Monitor *mon, const QDict *qdict);
> +void hmp_drive_insert(Monitor *mon, const QDict *qdict);
>  
>  void hmp_drive_mirror(Monitor *mon, const QDict *qdict);
>  void hmp_drive_backup(Monitor *mon, const QDict *qdict);
> -- 
> 2.50.1 (Apple Git-155)
> 
> 

With regards,
Daniel
-- 
|: https://berrange.com       ~~        https://hachyderm.io/@berrange :|
|: https://libvirt.org          ~~          https://entangle-photo.org :|
|: https://pixelfed.art/berrange   ~~    https://fstop138.berrange.com :|



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] block/monitor: add drive_insert HMP command
  2026-04-15 10:48   ` Daniel P. Berrangé
@ 2026-04-15 12:32     ` Matthieu Rolla
  2026-04-16 19:52       ` Stefan Hajnoczi
  2026-04-15 12:33     ` Stefan Hajnoczi
  1 sibling, 1 reply; 34+ messages in thread
From: Matthieu Rolla @ 2026-04-15 12:32 UTC (permalink / raw)
  To: "Daniel P. Berrangé"
  Cc: qemu-devel, qemu-block, its, kbusch, stefanha, mr-083

[-- Attachment #1: Type: text/plain, Size: 1359 bytes --]

Thanks Daniel, 

It makes sense Thanks. 

Looking at the existing code, blockdev-insert-medium already does the backend/frontend association via blk_insert_bs(), but is restricted to removable devices. 
A new QMP command like blockdev-attach could reuse the same logic without the removable restriction, paired with blockdev-add for creating the block node.

Would that be a better approach ? 

Thanks


	.	
Matthieu
www.min.io <>
matthieu@min.io <> 

> On Apr 15, 2026, at 12:48 PM, Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
>  see v3 has dropped this new command, but in case you have plans
> to re-introduce it..
> 
> First no new HMP-only commands please. Anything new must be implemented
> in QMP, and the HMP must be a shim to the QMP.
> 
> drive_insert semantics are somewhat overloaded - it is doing two jobs,
> first creating the block backend, and then associating the backend with
> a device.  IMHO those tasks should be separate.  "drive_add" or
> "blockdev-add" can already do the first task, so we shouldn't replicate
> that. AFAICT, it would only need a way to do the backend/frontend
> association, and possibly the PCI error injection should be done as
> a standard part of that re-association of backends, which would avoid
> the need for pcie_aer_inject_error which has no QMP impl currently.


[-- Attachment #2: Type: text/html, Size: 13084 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] block/monitor: add drive_insert HMP command
  2026-04-15 10:48   ` Daniel P. Berrangé
  2026-04-15 12:32     ` Matthieu Rolla
@ 2026-04-15 12:33     ` Stefan Hajnoczi
  1 sibling, 0 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2026-04-15 12:33 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: mr-083, qemu-devel, qemu-block, its, kbusch, mr-083

[-- Attachment #1: Type: text/plain, Size: 1885 bytes --]

On Wed, Apr 15, 2026 at 11:48:20AM +0100, Daniel P. Berrangé wrote:
> On Thu, Apr 09, 2026 at 08:01:54AM +0200, mr-083 wrote:
> > Add a drive_insert HMP command that reconnects a host block device file
> > to an existing guest device whose backing store was previously removed
> > with drive_del.
> > 
> > After drive_del, the BlockBackend remains attached to the guest device
> > but has no BlockDriverState (shown as "[not inserted]" in info block).
> > drive_insert opens the specified file, finds the device's BlockBackend
> > by iterating all backends and matching the attached device ID, then
> > calls blk_insert_bs() to reconnect the backing store.
> > 
> > This complements drive_del for non-removable devices (such as NVMe
> > namespaces) where blockdev-change-medium cannot be used. Combined with
> > PCIe AER Surprise Down error injection to trigger a controller reset,
> > this enables complete NVMe disk hot-swap simulation where the guest
> > sees the same device names throughout.
> > 
> > Example usage:
> >   drive_del drv0             # remove backing store
> >   drive_insert ns0 disk.qcow2  # reconnect backing
> >   pcie_aer_inject_error rp0 SDN  # trigger controller reset
> > 
> > Signed-off-by: Matthieu Receveur <matthieu@min.io>
> > ---
> >  block/monitor/block-hmp-cmds.c | 59 ++++++++++++++++++++++++++++++++++
> >  hmp-commands.hx                | 18 +++++++++++
> 
> I see v3 has dropped this new command, but in case you have plans
> to re-introduce it..

I suggest splitting this into two separate patch series since two
different use cases are being addressed:

1. --device nvme-ns hotplug using NVMe AEN. Allows users to attach and
   detach storage to the NVMe controller at runtime (without PCI
   hotplug).

2. PCIe Surprise Down and AER. Allows testing of PCI error recovery in
   guest drivers.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/2] NVMe namespace hotplug and drive reconnection support
  2026-04-14 18:14         ` Matthieu Rolla
@ 2026-04-15 12:45           ` Stefan Hajnoczi
  2026-04-15 17:39             ` Matthieu Rolla
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2026-04-15 12:45 UTC (permalink / raw)
  To: Matthieu Rolla
  Cc: kbusch, Klaus Jensen, qemu-devel, qemu-block, mr-083,
	John Meneghini

[-- Attachment #1: Type: text/plain, Size: 448 bytes --]

On Tue, Apr 14, 2026 at 08:14:16PM +0200, Matthieu Rolla wrote:
> To clarify,  we're not swapping to a different backend. It's the same disk file being disconnected and reconnected, simulating a physical drive being pulled and reinserted.

Is it necessary to drive_del to simulate PCIe Surprise Down? Can you
perform just the PCIe actions without removing the drive from the NVMe
device? That way the drive_insert command is not necessary.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4] hw/nvme: add namespace hotplug support
  2026-04-09  6:01 [PATCH 0/2] NVMe namespace hotplug and drive reconnection support mr-083
                   ` (4 preceding siblings ...)
  2026-04-13 17:17 ` [PATCH 0/2] NVMe namespace hotplug and drive reconnection support Klaus Jensen
@ 2026-04-15 17:38 ` mr-083
  2026-04-16 19:42   ` Stefan Hajnoczi
  2026-04-17  9:29   ` Klaus Jensen
  5 siblings, 2 replies; 34+ messages in thread
From: mr-083 @ 2026-04-15 17:38 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: its, kbusch, stefanha, mr-083

Add hotplug support for nvme-ns devices on the NvmeBus. This enables
NVMe namespace-level hot-add and hot-remove via device_add and
device_del with proper Asynchronous Event Notification (AEN), so the
guest kernel can react to namespace topology changes.

Mark nvme-ns devices as hotpluggable and register the NvmeBus as a
hotplug handler with proper plug and unplug callbacks:

- plug: attach namespace to all started controllers and send an
  Asynchronous Event Notification (AEN) with NS_ATTR_CHANGED so
  the guest kernel rescans namespaces and adds the block device
- unplug: drain in-flight I/O, detach from all controllers, send
  AEN, then unrealize the device. The guest kernel rescans and
  removes the block device.

The plug handler skips controllers that haven't started yet
(qs_created == false) to avoid interfering with boot-time namespace
attachment in nvme_start_ctrl().

The unplug handler drains in-flight I/O via nvme_ns_drain() before
detaching the namespace from controllers, so pending requests can
complete normally without touching freed state.

For symmetry with nvme_ns_realize() which sets subsys->namespaces[nsid],
nvme_ns_unrealize() now clears that slot too making the namespace
lifecycle complete.

Both the controller bus and subsystem bus are configured as hotplug
handlers via qbus_set_bus_hotplug_handler() since nvme-ns devices
may reparent to the subsystem bus during realize.

Example hot-swap sequence using the NVMe subsystem model:

  # Boot with: -device nvme-subsys,id=subsys0
  #            -device nvme,id=ctrl0,subsys=subsys0
  #            -device nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1

  device_del ns0             # guest receives AEN, removes /dev/nvme0n1
  drive_del drv0
  drive_add 0 file=disk.qcow2,format=qcow2,id=drv0,if=none
  device_add nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1
                              # guest receives AEN, adds /dev/nvme0n1

Tested with Linux 6.1 guest (NVMe driver processes AEN and rescans
namespace list automatically).

Signed-off-by: Matthieu <matthieu@min.io>
---
 hw/nvme/ctrl.c   | 88 ++++++++++++++++++++++++++++++++++++++++++++++++
 hw/nvme/ns.c     |  8 +++++
 hw/nvme/subsys.c |  2 ++
 3 files changed, 98 insertions(+)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index be6c7028cb..2024b0ff75 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -206,6 +206,7 @@
 #include "system/hostmem.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pcie_sriov.h"
+#include "hw/core/qdev.h"
 #include "system/spdm-socket.h"
 #include "migration/vmstate.h"
 
@@ -9293,6 +9294,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     }
 
     qbus_init(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
+    qbus_set_bus_hotplug_handler(BUS(&n->bus));
 
     if (nvme_init_subsys(n, errp)) {
         return;
@@ -9553,10 +9555,96 @@ static const TypeInfo nvme_info = {
     },
 };
 
+static void nvme_ns_hot_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
+                              Error **errp)
+{
+    NvmeNamespace *ns = NVME_NS(dev);
+    NvmeSubsystem *subsys = ns->subsys;
+    uint32_t nsid = ns->params.nsid;
+    int i;
+
+    /*
+     * Attach to all started controllers and notify via AEN.
+     * Skip controllers that haven't started yet (boot-time realize) —
+     * nvme_start_ctrl() will attach namespaces during controller init.
+     */
+    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
+        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
+        if (!ctrl || !ctrl->qs_created) {
+            continue;
+        }
+
+        if (nvme_csi_supported(ctrl, ns->csi) && !ns->params.detached) {
+            nvme_attach_ns(ctrl, ns);
+            nvme_update_dsm_limits(ctrl, ns);
+
+            if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
+                nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
+                                   NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
+                                   NVME_LOG_CHANGED_NSLIST);
+            }
+        }
+    }
+}
+
+static void nvme_ns_hot_unplug(HotplugHandler *hotplug_dev, DeviceState *dev,
+                               Error **errp)
+{
+    NvmeNamespace *ns = NVME_NS(dev);
+    NvmeSubsystem *subsys = ns->subsys;
+    uint32_t nsid = ns->params.nsid;
+    int i;
+
+    /*
+     * Drain in-flight I/O before tearing down the namespace.
+     * This must happen while the namespace is still attached to the
+     * controllers so any pending requests can complete normally.
+     */
+    nvme_ns_drain(ns);
+
+    /*
+     * Detach from all controllers and notify the guest via AEN.
+     * The guest kernel will rescan namespaces and remove the block device.
+     */
+    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
+        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
+        if (!ctrl || !nvme_ns(ctrl, nsid)) {
+            continue;
+        }
+
+        nvme_detach_ns(ctrl, ns);
+        nvme_update_dsm_limits(ctrl, NULL);
+
+        if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
+            nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
+                               NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
+                               NVME_LOG_CHANGED_NSLIST);
+        }
+    }
+
+    /*
+     * Unrealize: removes from subsystem (in nvme_ns_unrealize), flushes,
+     * cleans up structures, and removes from QOM.
+     */
+    qdev_unrealize(dev);
+}
+
+static void nvme_bus_class_init(ObjectClass *klass, const void *data)
+{
+    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(klass);
+    hc->plug = nvme_ns_hot_plug;
+    hc->unplug = nvme_ns_hot_unplug;
+}
+
 static const TypeInfo nvme_bus_info = {
     .name = TYPE_NVME_BUS,
     .parent = TYPE_BUS,
     .instance_size = sizeof(NvmeBus),
+    .class_init = nvme_bus_class_init,
+    .interfaces = (const InterfaceInfo[]) {
+        { TYPE_HOTPLUG_HANDLER },
+        { }
+    },
 };
 
 static void nvme_register_types(void)
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index b0106eaa5c..f4f755c6fc 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -719,10 +719,17 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
 static void nvme_ns_unrealize(DeviceState *dev)
 {
     NvmeNamespace *ns = NVME_NS(dev);
+    NvmeSubsystem *subsys = ns->subsys;
+    uint32_t nsid = ns->params.nsid;
 
     nvme_ns_drain(ns);
     nvme_ns_shutdown(ns);
     nvme_ns_cleanup(ns);
+
+    /* Symmetric with nvme_ns_realize() which sets subsys->namespaces[nsid]. */
+    if (subsys && nsid && subsys->namespaces[nsid] == ns) {
+        subsys->namespaces[nsid] = NULL;
+    }
 }
 
 void nvme_ns_atomic_configure_boundary(bool dn, uint16_t nabsn,
@@ -937,6 +944,7 @@ static void nvme_ns_class_init(ObjectClass *oc, const void *data)
     dc->bus_type = TYPE_NVME_BUS;
     dc->realize = nvme_ns_realize;
     dc->unrealize = nvme_ns_unrealize;
+    dc->hotpluggable = true;
     device_class_set_props(dc, nvme_ns_props);
     dc->desc = "Virtual NVMe namespace";
 }
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 777e1c620f..fa35055d3c 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -9,6 +9,7 @@
 #include "qemu/osdep.h"
 #include "qemu/units.h"
 #include "qapi/error.h"
+#include "hw/core/qdev.h"
 
 #include "nvme.h"
 
@@ -205,6 +206,7 @@ static void nvme_subsys_realize(DeviceState *dev, Error **errp)
     NvmeSubsystem *subsys = NVME_SUBSYS(dev);
 
     qbus_init(&subsys->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
+    qbus_set_bus_hotplug_handler(BUS(&subsys->bus));
 
     nvme_subsys_setup(subsys, errp);
 }
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/2] NVMe namespace hotplug and drive reconnection support
  2026-04-15 12:45           ` Stefan Hajnoczi
@ 2026-04-15 17:39             ` Matthieu Rolla
  0 siblings, 0 replies; 34+ messages in thread
From: Matthieu Rolla @ 2026-04-15 17:39 UTC (permalink / raw)
  To: mr-083
  Cc: Klaus Jensen, qemu-devel, qemu-block, mr-083, John Meneghini,
	kbusch, Stefan Hajnoczi, "Daniel P. Berrangé"

[-- Attachment #1: Type: text/plain, Size: 1825 bytes --]

Hello,

Thanks everyone for the reviews.
I just sent v4 of the namespace hotplug patch (Series 1) with the I/O drain fix and nvme_ns_unrealize symmetry as discussed.
As suggested by Stefan, the backend reassociation is sent as a separate series (Series 2). 

Per Daniel's feedback, it is implemented as a QMP command (blockdev-attach) that pairs with the existing blockdev-add, with an HMP wrapper. This allows reconnecting a block node to a non-removable device's backend after drive_del, without the removable media restriction of blockdev-insert-medium.

Both patches tested with Linux 6.1 guest under DirectPV/MinIO AIStor storage stack. Scenarios covered:
Namespace attach/detach via device_del + device_add (Series 1)
Backend disconnect/reconnect via drive_del + blockdev-add + blockdev-attach + PCIe AER SDN (Series 2)
Same device name preserved across detach/attach cycles
Detach under heavy I/O (warp benchmark, 16 concurrent uploads)
Short disconnect (<3s): XFS mounts intact, DirectPV Ready, MinIO 12/12
Long disconnect (60s+): XFS journal shutdown, recovery via kubectl directpv repair, full 12/12 recovery (minio trigger healing on disk)
Multiple disks across multiple nodes (6 disks, 3 nodes)

	.	
Matthieu

matthieu@min.io <> 

> On Apr 15, 2026, at 2:45 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Tue, Apr 14, 2026 at 08:14:16PM +0200, Matthieu Rolla wrote:
>> To clarify,  we're not swapping to a different backend. It's the same disk file being disconnected and reconnected, simulating a physical drive being pulled and reinserted.
> 
> Is it necessary to drive_del to simulate PCIe Surprise Down? Can you
> perform just the PCIe actions without removing the drive from the NVMe
> device? That way the drive_insert command is not necessary.
> 
> Stefan

[-- Attachment #2: Type: text/html, Size: 5728 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/2] NVMe namespace hotplug and drive reconnection support
  2026-04-14 14:04     ` John Meneghini
@ 2026-04-16 10:11       ` Nilay Shroff
  2026-04-16 12:33         ` Matthieu Rolla
  0 siblings, 1 reply; 34+ messages in thread
From: Nilay Shroff @ 2026-04-16 10:11 UTC (permalink / raw)
  To: John Meneghini, Stefan Hajnoczi, Klaus Jensen
  Cc: mr-083, qemu-devel, qemu-block, kbusch, mr-083

Hi John,

On 4/14/26 7:34 PM, John Meneghini wrote:
> Adding Nilay who has done a lot of work on nvme hot plug.
> 
> Nilay please take look at these patches and let us know if they can work on powerpc
> 
> I'll set up a test bed and try this out with x86_64.
> 

Thanks for looping me in.

I tested this patch series on pseries QEMU, and overall it works as expected.
For the first patch (NVMe namespace hotplug), the functionality behaves correctly
and achieves its intended goal. That said, from an NVMe specification perspective,
the operation appears closer to a namespace attach/detach rather than a traditional
“hotplug.” I understand that in the QEMU device model, this is framed as a hotplug
event, which is likely why the terminology is used here, but it may still be somewhat
confusing when viewed through the NVMe spec lens.

For the second patch (drive_insert), the implementation also works as intended on
pseries. However, I have a concern regarding how the backend is handled. The flow
effectively removes the backing storage using drive_del and later reattaches it
using drive_insert. While the expectation is to reconnect the same backing store,
there is currently no enforcement of this. As a result, it is possible—perhaps
unintentionally—to reattach a different backing file. If this happens, it may lead
to inconsistencies with the in-memory state maintained by the kernel (e.g., page
cache or filesystem metadata), especially if the original device was already in use
or mounted. This may potentially result in data corruption or undefined behavior
from the guest’s perspective. It might be worth considering whether some form of
validation or restriction should be added to ensure that the same backing store
is reattached, or at least to make this behavior more explicit.

Overall, both patches are functional on pseries, but the above points may be worth
addressing.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/2] NVMe namespace hotplug and drive reconnection support
  2026-04-16 10:11       ` Nilay Shroff
@ 2026-04-16 12:33         ` Matthieu Rolla
  0 siblings, 0 replies; 34+ messages in thread
From: Matthieu Rolla @ 2026-04-16 12:33 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: John Meneghini, Stefan Hajnoczi, Klaus Jensen, qemu-devel,
	qemu-block, kbusch, mr-083

[-- Attachment #1: Type: text/plain, Size: 3228 bytes --]

Thanks Nilay for testing on pseries!

On the terminology,  agreed, v4 of the namespace patch uses "out-of-band namespace attach/detach" wording as Klaus suggested.

On the backend concern,  the drive_insert patch has been replaced by a new series implementing a QMP blockdev-attach command (per Daniel's feedback). 
The ability to attach a different backing file is intentional, it allows simulating disk replacement where a failed drive is swapped for a new one. The guest sees the same device name but with fresh storage. 

This mirrors what happens on real hardware when you replace a failed disk in the same slot. The risk you describe (stale page cache / filesystem metadata) is expected and handled at the guest level,  the filesystem detects the inconsistency and the storage stack (e.g. MinIO) heals the data via erasure coding

Link to v4 patch (serie 1): https://lists.nongnu.org/archive/html/qemu-devel/2026-04/msg02612.html
Link to new patch (serie 2): https://lists.nongnu.org/archive/html/qemu-devel/2026-04/msg02613.html

Thanks again for your time.

	.	
Matthieu
www.min.io <>
matthieu@min.io <> 

> On Apr 16, 2026, at 12:11 PM, Nilay Shroff <nilay@linux.ibm.com> wrote:
> 
> Hi John,
> 
> On 4/14/26 7:34 PM, John Meneghini wrote:
>> Adding Nilay who has done a lot of work on nvme hot plug.
>> Nilay please take look at these patches and let us know if they can work on powerpc
>> I'll set up a test bed and try this out with x86_64.
> 
> Thanks for looping me in.
> 
> I tested this patch series on pseries QEMU, and overall it works as expected.
> For the first patch (NVMe namespace hotplug), the functionality behaves correctly
> and achieves its intended goal. That said, from an NVMe specification perspective,
> the operation appears closer to a namespace attach/detach rather than a traditional
> “hotplug.” I understand that in the QEMU device model, this is framed as a hotplug
> event, which is likely why the terminology is used here, but it may still be somewhat
> confusing when viewed through the NVMe spec lens.
> 
> For the second patch (drive_insert), the implementation also works as intended on
> pseries. However, I have a concern regarding how the backend is handled. The flow
> effectively removes the backing storage using drive_del and later reattaches it
> using drive_insert. While the expectation is to reconnect the same backing store,
> there is currently no enforcement of this. As a result, it is possible—perhaps
> unintentionally—to reattach a different backing file. If this happens, it may lead
> to inconsistencies with the in-memory state maintained by the kernel (e.g., page
> cache or filesystem metadata), especially if the original device was already in use
> or mounted. This may potentially result in data corruption or undefined behavior
> from the guest’s perspective. It might be worth considering whether some form of
> validation or restriction should be added to ensure that the same backing store
> is reattached, or at least to make this behavior more explicit.
> 
> Overall, both patches are functional on pseries, but the above points may be worth
> addressing.
> 
> Thanks,
> --Nilay


[-- Attachment #2: Type: text/html, Size: 4877 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4] hw/nvme: add namespace hotplug support
  2026-04-15 17:38 ` [PATCH v4] hw/nvme: add namespace hotplug support mr-083
@ 2026-04-16 19:42   ` Stefan Hajnoczi
  2026-04-17  9:29   ` Klaus Jensen
  1 sibling, 0 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2026-04-16 19:42 UTC (permalink / raw)
  To: mr-083; +Cc: qemu-devel, qemu-block, its, kbusch, mr-083

[-- Attachment #1: Type: text/plain, Size: 2787 bytes --]

On Wed, Apr 15, 2026 at 07:38:52PM +0200, mr-083 wrote:
> Add hotplug support for nvme-ns devices on the NvmeBus. This enables
> NVMe namespace-level hot-add and hot-remove via device_add and
> device_del with proper Asynchronous Event Notification (AEN), so the
> guest kernel can react to namespace topology changes.
> 
> Mark nvme-ns devices as hotpluggable and register the NvmeBus as a
> hotplug handler with proper plug and unplug callbacks:
> 
> - plug: attach namespace to all started controllers and send an
>   Asynchronous Event Notification (AEN) with NS_ATTR_CHANGED so
>   the guest kernel rescans namespaces and adds the block device
> - unplug: drain in-flight I/O, detach from all controllers, send
>   AEN, then unrealize the device. The guest kernel rescans and
>   removes the block device.
> 
> The plug handler skips controllers that haven't started yet
> (qs_created == false) to avoid interfering with boot-time namespace
> attachment in nvme_start_ctrl().
> 
> The unplug handler drains in-flight I/O via nvme_ns_drain() before
> detaching the namespace from controllers, so pending requests can
> complete normally without touching freed state.
> 
> For symmetry with nvme_ns_realize() which sets subsys->namespaces[nsid],
> nvme_ns_unrealize() now clears that slot too making the namespace
> lifecycle complete.
> 
> Both the controller bus and subsystem bus are configured as hotplug
> handlers via qbus_set_bus_hotplug_handler() since nvme-ns devices
> may reparent to the subsystem bus during realize.
> 
> Example hot-swap sequence using the NVMe subsystem model:
> 
>   # Boot with: -device nvme-subsys,id=subsys0
>   #            -device nvme,id=ctrl0,subsys=subsys0
>   #            -device nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1
> 
>   device_del ns0             # guest receives AEN, removes /dev/nvme0n1
>   drive_del drv0
>   drive_add 0 file=disk.qcow2,format=qcow2,id=drv0,if=none
>   device_add nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1
>                               # guest receives AEN, adds /dev/nvme0n1
> 
> Tested with Linux 6.1 guest (NVMe driver processes AEN and rescans
> namespace list automatically).
> 
> Signed-off-by: Matthieu <matthieu@min.io>
> ---
>  hw/nvme/ctrl.c   | 88 ++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/nvme/ns.c     |  8 +++++
>  hw/nvme/subsys.c |  2 ++
>  3 files changed, 98 insertions(+)

This is useful functionality because PCI hotplug is cumbersome as it
requires preallocating PCIe root ports at guest creation time. Users may
not know how many devices they will eventually hotplug, and so it's
convenient to hotplug multiple namespaces onto an existing NVMe PCI
controller.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] block/monitor: add drive_insert HMP command
  2026-04-15 12:32     ` Matthieu Rolla
@ 2026-04-16 19:52       ` Stefan Hajnoczi
  2026-04-16 22:00         ` Matthieu Rolla
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2026-04-16 19:52 UTC (permalink / raw)
  To: Matthieu Rolla
  Cc: "Daniel P. Berrangé", qemu-devel, qemu-block, its,
	kbusch, mr-083

[-- Attachment #1: Type: text/plain, Size: 1064 bytes --]

On Wed, Apr 15, 2026 at 02:32:39PM +0200, Matthieu Rolla wrote:
> Thanks Daniel, 
> 
> It makes sense Thanks. 
> 
> Looking at the existing code, blockdev-insert-medium already does the backend/frontend association via blk_insert_bs(), but is restricted to removable devices. 
> A new QMP command like blockdev-attach could reuse the same logic without the removable restriction, paired with blockdev-add for creating the block node.
> 
> Would that be a better approach ? 

Hi Matthieu,
I was wondering whether the blockdev needs to be changed at all. Since
the disk image remains the same, is it sufficient to inject the PCIe SDN
and then recover the NVMe PCI controller?

I don't understand the test scenario well enough, but it seems like
you're testing at the PCIe level here rather than nothin NVMe- or
blockdev-specific. Therefore blockdev commands may not be necessary.

If the testing can be done completely at the PCIe level then that would
also allow other device types to be tested in the same way, which would
be nice.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 484 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] block/monitor: add drive_insert HMP command
  2026-04-16 19:52       ` Stefan Hajnoczi
@ 2026-04-16 22:00         ` Matthieu Rolla
  0 siblings, 0 replies; 34+ messages in thread
From: Matthieu Rolla @ 2026-04-16 22:00 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: "Daniel P. Berrangé", qemu-devel, qemu-block, its,
	kbusch, mr-083

[-- Attachment #1: Type: text/plain, Size: 2489 bytes --]

Thanks Stefan.

I tested SDN alone on 6 disks across 3 nodes,  controllers freeze, auto-recover via AER in ~200ms, and resume I/O. This works well for simulating transient PCIe errors and as you noted, it should be generic across device types.

However, SDN alone can't simulate a prolonged disk outage,  the controller recovers immediately. The test scenarios need to cover:

- Transient disk failure: brief I/O disruption, automatic recovery (SDN alone works here)
- Prolonged disk failure: disk offline for minutes/hours, filesystem corruption, manual recovery required
- Disk replacement: failed disk swapped for a new one, storage stack heals data

For the last two scenarios, we need the disk to actually stay disconnected. This requires drive_del to disconnect the backing store, and blockdev-attach to reconnect it later, preserving the guest device name since drive_del has no existing counterpart for non-removable devices.

So the two series address different scenarios:
- Series 1 (namespace hotplug): namespace topology changes via device_del/device_add with AEN
- Series 2 (blockdev-attach): prolonged disk disconnect/reconnect while preserving the guest device name

Please let me know if you have any questions


	.	
Matthieu
www.min.io <>
matthieu@min.io <> 

> On Apr 16, 2026, at 9:52 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Apr 15, 2026 at 02:32:39PM +0200, Matthieu Rolla wrote:
>> Thanks Daniel, 
>> 
>> It makes sense Thanks. 
>> 
>> Looking at the existing code, blockdev-insert-medium already does the backend/frontend association via blk_insert_bs(), but is restricted to removable devices. 
>> A new QMP command like blockdev-attach could reuse the same logic without the removable restriction, paired with blockdev-add for creating the block node.
>> 
>> Would that be a better approach ?
> 
> Hi Matthieu,
> I was wondering whether the blockdev needs to be changed at all. Since
> the disk image remains the same, is it sufficient to inject the PCIe SDN
> and then recover the NVMe PCI controller?
> 
> I don't understand the test scenario well enough, but it seems like
> you're testing at the PCIe level here rather than nothin NVMe- or
> blockdev-specific. Therefore blockdev commands may not be necessary.
> 
> If the testing can be done completely at the PCIe level then that would
> also allow other device types to be tested in the same way, which would
> be nice.
> 
> Stefan


[-- Attachment #2: Type: text/html, Size: 4169 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4] hw/nvme: add namespace hotplug support
  2026-04-15 17:38 ` [PATCH v4] hw/nvme: add namespace hotplug support mr-083
  2026-04-16 19:42   ` Stefan Hajnoczi
@ 2026-04-17  9:29   ` Klaus Jensen
  2026-04-17  9:45     ` Matthieu Rolla
  1 sibling, 1 reply; 34+ messages in thread
From: Klaus Jensen @ 2026-04-17  9:29 UTC (permalink / raw)
  To: mr-083; +Cc: qemu-devel, qemu-block, kbusch, stefanha, mr-083

[-- Attachment #1: Type: text/plain, Size: 8393 bytes --]

On Apr 15 19:38, mr-083 wrote:
> Add hotplug support for nvme-ns devices on the NvmeBus. This enables
> NVMe namespace-level hot-add and hot-remove via device_add and
> device_del with proper Asynchronous Event Notification (AEN), so the
> guest kernel can react to namespace topology changes.
> 
> Mark nvme-ns devices as hotpluggable and register the NvmeBus as a
> hotplug handler with proper plug and unplug callbacks:
> 
> - plug: attach namespace to all started controllers and send an
>   Asynchronous Event Notification (AEN) with NS_ATTR_CHANGED so
>   the guest kernel rescans namespaces and adds the block device
> - unplug: drain in-flight I/O, detach from all controllers, send
>   AEN, then unrealize the device. The guest kernel rescans and
>   removes the block device.
> 
> The plug handler skips controllers that haven't started yet
> (qs_created == false) to avoid interfering with boot-time namespace
> attachment in nvme_start_ctrl().
> 
> The unplug handler drains in-flight I/O via nvme_ns_drain() before
> detaching the namespace from controllers, so pending requests can
> complete normally without touching freed state.
> 
> For symmetry with nvme_ns_realize() which sets subsys->namespaces[nsid],
> nvme_ns_unrealize() now clears that slot too making the namespace
> lifecycle complete.
> 
> Both the controller bus and subsystem bus are configured as hotplug
> handlers via qbus_set_bus_hotplug_handler() since nvme-ns devices
> may reparent to the subsystem bus during realize.
> 
> Example hot-swap sequence using the NVMe subsystem model:
> 
>   # Boot with: -device nvme-subsys,id=subsys0
>   #            -device nvme,id=ctrl0,subsys=subsys0
>   #            -device nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1
> 
>   device_del ns0             # guest receives AEN, removes /dev/nvme0n1
>   drive_del drv0
>   drive_add 0 file=disk.qcow2,format=qcow2,id=drv0,if=none
>   device_add nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1
>                               # guest receives AEN, adds /dev/nvme0n1
> 
> Tested with Linux 6.1 guest (NVMe driver processes AEN and rescans
> namespace list automatically).
> 
> Signed-off-by: Matthieu <matthieu@min.io>
> ---
>  hw/nvme/ctrl.c   | 88 ++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/nvme/ns.c     |  8 +++++
>  hw/nvme/subsys.c |  2 ++
>  3 files changed, 98 insertions(+)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index be6c7028cb..2024b0ff75 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -206,6 +206,7 @@
>  #include "system/hostmem.h"
>  #include "hw/pci/msix.h"
>  #include "hw/pci/pcie_sriov.h"
> +#include "hw/core/qdev.h"
>  #include "system/spdm-socket.h"
>  #include "migration/vmstate.h"
>  
> @@ -9293,6 +9294,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>      }
>  
>      qbus_init(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
> +    qbus_set_bus_hotplug_handler(BUS(&n->bus));
>  
>      if (nvme_init_subsys(n, errp)) {
>          return;
> @@ -9553,10 +9555,96 @@ static const TypeInfo nvme_info = {
>      },
>  };
>  
> +static void nvme_ns_hot_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
> +                              Error **errp)
> +{
> +    NvmeNamespace *ns = NVME_NS(dev);
> +    NvmeSubsystem *subsys = ns->subsys;
> +    uint32_t nsid = ns->params.nsid;
> +    int i;
> +
> +    /*
> +     * Attach to all started controllers and notify via AEN.
> +     * Skip controllers that haven't started yet (boot-time realize) —
> +     * nvme_start_ctrl() will attach namespaces during controller init.
> +     */
> +    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
> +        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
> +        if (!ctrl || !ctrl->qs_created) {
> +            continue;
> +        }
> +
> +        if (nvme_csi_supported(ctrl, ns->csi) && !ns->params.detached) {
> +            nvme_attach_ns(ctrl, ns);
> +            nvme_update_dsm_limits(ctrl, ns);
> +
> +            if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
> +                nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
> +                                   NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
> +                                   NVME_LOG_CHANGED_NSLIST);
> +            }
> +        }
> +    }
> +}
> +
> +static void nvme_ns_hot_unplug(HotplugHandler *hotplug_dev, DeviceState *dev,
> +                               Error **errp)
> +{
> +    NvmeNamespace *ns = NVME_NS(dev);
> +    NvmeSubsystem *subsys = ns->subsys;
> +    uint32_t nsid = ns->params.nsid;
> +    int i;
> +
> +    /*
> +     * Drain in-flight I/O before tearing down the namespace.
> +     * This must happen while the namespace is still attached to the
> +     * controllers so any pending requests can complete normally.
> +     */
> +    nvme_ns_drain(ns);
> +
> +    /*
> +     * Detach from all controllers and notify the guest via AEN.
> +     * The guest kernel will rescan namespaces and remove the block device.
> +     */
> +    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
> +        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
> +        if (!ctrl || !nvme_ns(ctrl, nsid)) {
> +            continue;
> +        }
> +
> +        nvme_detach_ns(ctrl, ns);
> +        nvme_update_dsm_limits(ctrl, NULL);
> +
> +        if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
> +            nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
> +                               NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
> +                               NVME_LOG_CHANGED_NSLIST);
> +        }
> +    }
> +
> +    /*
> +     * Unrealize: removes from subsystem (in nvme_ns_unrealize), flushes,
> +     * cleans up structures, and removes from QOM.
> +     */
> +    qdev_unrealize(dev);
> +}
> +
> +static void nvme_bus_class_init(ObjectClass *klass, const void *data)
> +{
> +    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(klass);
> +    hc->plug = nvme_ns_hot_plug;
> +    hc->unplug = nvme_ns_hot_unplug;
> +}
> +
>  static const TypeInfo nvme_bus_info = {
>      .name = TYPE_NVME_BUS,
>      .parent = TYPE_BUS,
>      .instance_size = sizeof(NvmeBus),
> +    .class_init = nvme_bus_class_init,
> +    .interfaces = (const InterfaceInfo[]) {
> +        { TYPE_HOTPLUG_HANDLER },
> +        { }
> +    },
>  };
>  
>  static void nvme_register_types(void)
> diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
> index b0106eaa5c..f4f755c6fc 100644
> --- a/hw/nvme/ns.c
> +++ b/hw/nvme/ns.c
> @@ -719,10 +719,17 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
>  static void nvme_ns_unrealize(DeviceState *dev)
>  {
>      NvmeNamespace *ns = NVME_NS(dev);
> +    NvmeSubsystem *subsys = ns->subsys;
> +    uint32_t nsid = ns->params.nsid;
>  
>      nvme_ns_drain(ns);
>      nvme_ns_shutdown(ns);
>      nvme_ns_cleanup(ns);
> +
> +    /* Symmetric with nvme_ns_realize() which sets subsys->namespaces[nsid]. */
> +    if (subsys && nsid && subsys->namespaces[nsid] == ns) {
> +        subsys->namespaces[nsid] = NULL;
> +    }
>  }
>  
>  void nvme_ns_atomic_configure_boundary(bool dn, uint16_t nabsn,
> @@ -937,6 +944,7 @@ static void nvme_ns_class_init(ObjectClass *oc, const void *data)
>      dc->bus_type = TYPE_NVME_BUS;
>      dc->realize = nvme_ns_realize;
>      dc->unrealize = nvme_ns_unrealize;
> +    dc->hotpluggable = true;
>      device_class_set_props(dc, nvme_ns_props);
>      dc->desc = "Virtual NVMe namespace";
>  }
> diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
> index 777e1c620f..fa35055d3c 100644
> --- a/hw/nvme/subsys.c
> +++ b/hw/nvme/subsys.c
> @@ -9,6 +9,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
>  #include "qapi/error.h"
> +#include "hw/core/qdev.h"
>  
>  #include "nvme.h"
>  
> @@ -205,6 +206,7 @@ static void nvme_subsys_realize(DeviceState *dev, Error **errp)
>      NvmeSubsystem *subsys = NVME_SUBSYS(dev);
>  
>      qbus_init(&subsys->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
> +    qbus_set_bus_hotplug_handler(BUS(&subsys->bus));
>  
>      nvme_subsys_setup(subsys, errp);
>  }
> -- 
> 2.53.0
> 
> 

As an out-of-band mechanism to hot-plug namespaces, this looks good.

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

I'll pick it up for 11.1. Thanks!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4] hw/nvme: add namespace hotplug support
  2026-04-17  9:29   ` Klaus Jensen
@ 2026-04-17  9:45     ` Matthieu Rolla
  0 siblings, 0 replies; 34+ messages in thread
From: Matthieu Rolla @ 2026-04-17  9:45 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: qemu-devel, qemu-block, kbusch, stefanha, mr-083

[-- Attachment #1: Type: text/plain, Size: 8864 bytes --]

Hello Klaus,

Thank you for your feedback. I also created https://lists.nongnu.org/archive/html/qemu-devel/2026-04/msg02613.html if you are interested. 

BR,

	.	
Matthieu
www.min.io <>
matthieu@min.io <> 

> On Apr 17, 2026, at 11:29 AM, Klaus Jensen <its@irrelevant.dk> wrote:
> 
> On Apr 15 19:38, mr-083 wrote:
>> Add hotplug support for nvme-ns devices on the NvmeBus. This enables
>> NVMe namespace-level hot-add and hot-remove via device_add and
>> device_del with proper Asynchronous Event Notification (AEN), so the
>> guest kernel can react to namespace topology changes.
>> 
>> Mark nvme-ns devices as hotpluggable and register the NvmeBus as a
>> hotplug handler with proper plug and unplug callbacks:
>> 
>> - plug: attach namespace to all started controllers and send an
>>  Asynchronous Event Notification (AEN) with NS_ATTR_CHANGED so
>>  the guest kernel rescans namespaces and adds the block device
>> - unplug: drain in-flight I/O, detach from all controllers, send
>>  AEN, then unrealize the device. The guest kernel rescans and
>>  removes the block device.
>> 
>> The plug handler skips controllers that haven't started yet
>> (qs_created == false) to avoid interfering with boot-time namespace
>> attachment in nvme_start_ctrl().
>> 
>> The unplug handler drains in-flight I/O via nvme_ns_drain() before
>> detaching the namespace from controllers, so pending requests can
>> complete normally without touching freed state.
>> 
>> For symmetry with nvme_ns_realize() which sets subsys->namespaces[nsid],
>> nvme_ns_unrealize() now clears that slot too making the namespace
>> lifecycle complete.
>> 
>> Both the controller bus and subsystem bus are configured as hotplug
>> handlers via qbus_set_bus_hotplug_handler() since nvme-ns devices
>> may reparent to the subsystem bus during realize.
>> 
>> Example hot-swap sequence using the NVMe subsystem model:
>> 
>>  # Boot with: -device nvme-subsys,id=subsys0
>>  #            -device nvme,id=ctrl0,subsys=subsys0
>>  #            -device nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1
>> 
>>  device_del ns0             # guest receives AEN, removes /dev/nvme0n1
>>  drive_del drv0
>>  drive_add 0 file=disk.qcow2,format=qcow2,id=drv0,if=none
>>  device_add nvme-ns,id=ns0,drive=drv0,bus=ctrl0,nsid=1
>>                              # guest receives AEN, adds /dev/nvme0n1
>> 
>> Tested with Linux 6.1 guest (NVMe driver processes AEN and rescans
>> namespace list automatically).
>> 
>> Signed-off-by: Matthieu <matthieu@min.io>
>> ---
>> hw/nvme/ctrl.c   | 88 ++++++++++++++++++++++++++++++++++++++++++++++++
>> hw/nvme/ns.c     |  8 +++++
>> hw/nvme/subsys.c |  2 ++
>> 3 files changed, 98 insertions(+)
>> 
>> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
>> index be6c7028cb..2024b0ff75 100644
>> --- a/hw/nvme/ctrl.c
>> +++ b/hw/nvme/ctrl.c
>> @@ -206,6 +206,7 @@
>> #include "system/hostmem.h"
>> #include "hw/pci/msix.h"
>> #include "hw/pci/pcie_sriov.h"
>> +#include "hw/core/qdev.h"
>> #include "system/spdm-socket.h"
>> #include "migration/vmstate.h"
>> 
>> @@ -9293,6 +9294,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>>     }
>> 
>>     qbus_init(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
>> +    qbus_set_bus_hotplug_handler(BUS(&n->bus));
>> 
>>     if (nvme_init_subsys(n, errp)) {
>>         return;
>> @@ -9553,10 +9555,96 @@ static const TypeInfo nvme_info = {
>>     },
>> };
>> 
>> +static void nvme_ns_hot_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>> +                              Error **errp)
>> +{
>> +    NvmeNamespace *ns = NVME_NS(dev);
>> +    NvmeSubsystem *subsys = ns->subsys;
>> +    uint32_t nsid = ns->params.nsid;
>> +    int i;
>> +
>> +    /*
>> +     * Attach to all started controllers and notify via AEN.
>> +     * Skip controllers that haven't started yet (boot-time realize) —
>> +     * nvme_start_ctrl() will attach namespaces during controller init.
>> +     */
>> +    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
>> +        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
>> +        if (!ctrl || !ctrl->qs_created) {
>> +            continue;
>> +        }
>> +
>> +        if (nvme_csi_supported(ctrl, ns->csi) && !ns->params.detached) {
>> +            nvme_attach_ns(ctrl, ns);
>> +            nvme_update_dsm_limits(ctrl, ns);
>> +
>> +            if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
>> +                nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
>> +                                   NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
>> +                                   NVME_LOG_CHANGED_NSLIST);
>> +            }
>> +        }
>> +    }
>> +}
>> +
>> +static void nvme_ns_hot_unplug(HotplugHandler *hotplug_dev, DeviceState *dev,
>> +                               Error **errp)
>> +{
>> +    NvmeNamespace *ns = NVME_NS(dev);
>> +    NvmeSubsystem *subsys = ns->subsys;
>> +    uint32_t nsid = ns->params.nsid;
>> +    int i;
>> +
>> +    /*
>> +     * Drain in-flight I/O before tearing down the namespace.
>> +     * This must happen while the namespace is still attached to the
>> +     * controllers so any pending requests can complete normally.
>> +     */
>> +    nvme_ns_drain(ns);
>> +
>> +    /*
>> +     * Detach from all controllers and notify the guest via AEN.
>> +     * The guest kernel will rescan namespaces and remove the block device.
>> +     */
>> +    for (i = 0; i < NVME_MAX_CONTROLLERS; i++) {
>> +        NvmeCtrl *ctrl = nvme_subsys_ctrl(subsys, i);
>> +        if (!ctrl || !nvme_ns(ctrl, nsid)) {
>> +            continue;
>> +        }
>> +
>> +        nvme_detach_ns(ctrl, ns);
>> +        nvme_update_dsm_limits(ctrl, NULL);
>> +
>> +        if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
>> +            nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
>> +                               NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
>> +                               NVME_LOG_CHANGED_NSLIST);
>> +        }
>> +    }
>> +
>> +    /*
>> +     * Unrealize: removes from subsystem (in nvme_ns_unrealize), flushes,
>> +     * cleans up structures, and removes from QOM.
>> +     */
>> +    qdev_unrealize(dev);
>> +}
>> +
>> +static void nvme_bus_class_init(ObjectClass *klass, const void *data)
>> +{
>> +    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(klass);
>> +    hc->plug = nvme_ns_hot_plug;
>> +    hc->unplug = nvme_ns_hot_unplug;
>> +}
>> +
>> static const TypeInfo nvme_bus_info = {
>>     .name = TYPE_NVME_BUS,
>>     .parent = TYPE_BUS,
>>     .instance_size = sizeof(NvmeBus),
>> +    .class_init = nvme_bus_class_init,
>> +    .interfaces = (const InterfaceInfo[]) {
>> +        { TYPE_HOTPLUG_HANDLER },
>> +        { }
>> +    },
>> };
>> 
>> static void nvme_register_types(void)
>> diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
>> index b0106eaa5c..f4f755c6fc 100644
>> --- a/hw/nvme/ns.c
>> +++ b/hw/nvme/ns.c
>> @@ -719,10 +719,17 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
>> static void nvme_ns_unrealize(DeviceState *dev)
>> {
>>     NvmeNamespace *ns = NVME_NS(dev);
>> +    NvmeSubsystem *subsys = ns->subsys;
>> +    uint32_t nsid = ns->params.nsid;
>> 
>>     nvme_ns_drain(ns);
>>     nvme_ns_shutdown(ns);
>>     nvme_ns_cleanup(ns);
>> +
>> +    /* Symmetric with nvme_ns_realize() which sets subsys->namespaces[nsid]. */
>> +    if (subsys && nsid && subsys->namespaces[nsid] == ns) {
>> +        subsys->namespaces[nsid] = NULL;
>> +    }
>> }
>> 
>> void nvme_ns_atomic_configure_boundary(bool dn, uint16_t nabsn,
>> @@ -937,6 +944,7 @@ static void nvme_ns_class_init(ObjectClass *oc, const void *data)
>>     dc->bus_type = TYPE_NVME_BUS;
>>     dc->realize = nvme_ns_realize;
>>     dc->unrealize = nvme_ns_unrealize;
>> +    dc->hotpluggable = true;
>>     device_class_set_props(dc, nvme_ns_props);
>>     dc->desc = "Virtual NVMe namespace";
>> }
>> diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
>> index 777e1c620f..fa35055d3c 100644
>> --- a/hw/nvme/subsys.c
>> +++ b/hw/nvme/subsys.c
>> @@ -9,6 +9,7 @@
>> #include "qemu/osdep.h"
>> #include "qemu/units.h"
>> #include "qapi/error.h"
>> +#include "hw/core/qdev.h"
>> 
>> #include "nvme.h"
>> 
>> @@ -205,6 +206,7 @@ static void nvme_subsys_realize(DeviceState *dev, Error **errp)
>>     NvmeSubsystem *subsys = NVME_SUBSYS(dev);
>> 
>>     qbus_init(&subsys->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev, dev->id);
>> +    qbus_set_bus_hotplug_handler(BUS(&subsys->bus));
>> 
>>     nvme_subsys_setup(subsys, errp);
>> }
>> -- 
>> 2.53.0
>> 
>> 
> 
> As an out-of-band mechanism to hot-plug namespaces, this looks good.
> 
> Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
> 
> I'll pick it up for 11.1. Thanks!


[-- Attachment #2: Type: text/html, Size: 13888 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2026-04-17  9:46 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09  6:01 [PATCH 0/2] NVMe namespace hotplug and drive reconnection support mr-083
2026-04-09  6:01 ` [PATCH 1/2] hw/nvme: add namespace hotplug support mr-083
2026-04-09  6:01 ` [PATCH 2/2] block/monitor: add drive_insert HMP command mr-083
2026-04-14 17:57   ` Stefan Hajnoczi
2026-04-14 18:02     ` Matthieu Rolla
2026-04-14 19:05       ` Warner Losh
2026-04-14 21:01         ` Matthieu Rolla
2026-04-15 10:48   ` Daniel P. Berrangé
2026-04-15 12:32     ` Matthieu Rolla
2026-04-16 19:52       ` Stefan Hajnoczi
2026-04-16 22:00         ` Matthieu Rolla
2026-04-15 12:33     ` Stefan Hajnoczi
2026-04-09 21:34 ` [PATCH v2] hw/nvme: add namespace hotplug support mr-083
2026-04-10 12:41   ` Stefan Hajnoczi
2026-04-10 14:30 ` [PATCH v3] " mr-083
2026-04-10 14:33   ` Matthieu Rolla
2026-04-10 20:14     ` Stefan Hajnoczi
2026-04-13 15:24       ` Matthieu Rolla
2026-04-13 17:17 ` [PATCH 0/2] NVMe namespace hotplug and drive reconnection support Klaus Jensen
2026-04-14 12:42   ` Stefan Hajnoczi
2026-04-14 13:36     ` Matthieu Rolla
2026-04-14 18:09       ` Keith Busch
2026-04-14 18:10       ` Stefan Hajnoczi
2026-04-14 18:14         ` Matthieu Rolla
2026-04-15 12:45           ` Stefan Hajnoczi
2026-04-15 17:39             ` Matthieu Rolla
2026-04-14 14:04     ` John Meneghini
2026-04-16 10:11       ` Nilay Shroff
2026-04-16 12:33         ` Matthieu Rolla
2026-04-14 14:42     ` Keith Busch
2026-04-15 17:38 ` [PATCH v4] hw/nvme: add namespace hotplug support mr-083
2026-04-16 19:42   ` Stefan Hajnoczi
2026-04-17  9:29   ` Klaus Jensen
2026-04-17  9:45     ` Matthieu Rolla

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.