[PATCH V3 0/3] Ensure ordered namespace registration during async scan

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V3 0/3] Ensure ordered namespace registration during async scan
@ 2026-02-25 16:12 Maurizio Lombardi
  2026-02-25 16:12 ` [PATCH V3 1/3] lib: Introduce completion chain helper Maurizio Lombardi
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Maurizio Lombardi @ 2026-02-25 16:12 UTC (permalink / raw)
  To: kbusch
  Cc: hch, hare, chaitanyak, bvanassche, linux-scsi, linux-nvme,
	James.Bottomley, mlombard, jmeneghi, emilne, bgurney

The NVMe fully asynchronous namespace scanning introduced in
commit 4e893ca81170 ("nvme-core: scan namespaces asynchronously")
significantly improved discovery times. However, it also introduced
non-deterministic ordering for namespace registration.

While kernel device names (/dev/nvmeXnY) are not guaranteed to be stable
across reboots, this unpredictable ordering has caused considerable user
confusion and has been perceived as a regression, leading to multiple bug
reports.

This series introduces a solution to enforce strict sequential
registration based on NSID order, entirely preserving the performance
benefits of the asynchronous scan approach.

Instead of adding an NVMe-specific hack, this series abstracts the
serialization mechanism currently open-coded in the SCSI subsystem
(drivers/scsi/scsi_scan.c) into a generic library helper called the
completion chain (compl_chain).

By enforcing a strict First-In, First-Out (FIFO) completion order for
asynchronous tasks, we can ensure that namespaces are allocated and
registered sequentially without blocking the underlying parallel discovery
processes.

PATCH 3 Refactors the existing SCSI asynchronous scanning implementation
to use the new compl_chain helper, stripping out the custom, open-coded task
list and reducing code duplication.

Original code:

$ nvme list
Node                  Generic               Namespace
--------------------- --------------------- ----------
/dev/nvme0n1          /dev/ng0n1            0x2
/dev/nvme0n2          /dev/ng0n2            0x1
/dev/nvme0n3          /dev/ng0n3            0x5
/dev/nvme0n4          /dev/ng0n4            0x3
/dev/nvme0n5          /dev/ng0n5            0x4
[...]
/dev/nvme0n10         /dev/ng0n10           0xa
/dev/nvme0n11         /dev/ng0n11           0x8
/dev/nvme0n12         /dev/ng0n12           0x12
/dev/nvme0n13         /dev/ng0n13           0x17
/dev/nvme0n14         /dev/ng0n14           0xc
/dev/nvme0n15         /dev/ng0n15           0x11
/dev/nvme0n16         /dev/ng0n16           0x14
/dev/nvme0n17         /dev/ng0n17           0x13
/dev/nvme0n18         /dev/ng0n18           0xe
/dev/nvme0n19         /dev/ng0n19           0xf


With this patch:

$ nvme list
Node                  Generic               Namespace
--------------------- --------------------- ----------
/dev/nvme0n1          /dev/ng0n1            0x1
/dev/nvme0n2          /dev/ng0n2            0x2
/dev/nvme0n3          /dev/ng0n3            0x3
/dev/nvme0n4          /dev/ng0n4            0x4
/dev/nvme0n5          /dev/ng0n5            0x5
/dev/nvme0n6          /dev/ng0n6            0x6
[...]
/dev/nvme0n10         /dev/ng0n10           0xa
/dev/nvme0n11         /dev/ng0n11           0xb
/dev/nvme0n12         /dev/ng0n12           0xc
/dev/nvme0n13         /dev/ng0n13           0xd
/dev/nvme0n14         /dev/ng0n14           0xe
/dev/nvme0n15         /dev/ng0n15           0xf
/dev/nvme0n16         /dev/ng0n16           0x10
/dev/nvme0n17         /dev/ng0n17           0x11
/dev/nvme0n18         /dev/ng0n18           0x12
/dev/nvme0n19         /dev/ng0n19           0x13

V3: fixed some comments
    PATCH 3: declare scanning_hosts as static
    remove "extern" keyword from scsi_complete_async_scans()
    prototype declaration

V2: create the compl_chain helper that both SCSI and NVMe can share

Maurizio Lombardi (3):
  lib: Introduce completion chain helper
  nvme-core: register namespaces in order during async scan
  scsi: Convert async scanning to use the completion chain helper

 drivers/nvme/host/core.c    |  94 +++++++++++++++++-----------
 drivers/nvme/host/nvme.h    |   2 +
 drivers/scsi/scsi_priv.h    |   2 +-
 drivers/scsi/scsi_scan.c    |  68 +++------------------
 include/linux/compl_chain.h |  35 +++++++++++
 lib/Makefile                |   2 +-
 lib/compl_chain.c           | 118 ++++++++++++++++++++++++++++++++++++
 7 files changed, 225 insertions(+), 96 deletions(-)
 create mode 100644 include/linux/compl_chain.h
 create mode 100644 lib/compl_chain.c

-- 
2.53.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH V3 1/3] lib: Introduce completion chain helper
  2026-02-25 16:12 [PATCH V3 0/3] Ensure ordered namespace registration during async scan Maurizio Lombardi
@ 2026-02-25 16:12 ` Maurizio Lombardi
  2026-02-25 16:12 ` [PATCH V3 2/3] nvme-core: register namespaces in order during async scan Maurizio Lombardi
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Maurizio Lombardi @ 2026-02-25 16:12 UTC (permalink / raw)
  To: kbusch
  Cc: hch, hare, chaitanyak, bvanassche, linux-scsi, linux-nvme,
	James.Bottomley, mlombard, jmeneghi, emilne, bgurney

Introduce a new helper library, the completion chain, designed to serialize
asynchronous operations that must complete in a strict First-In, First-Out
(FIFO) order.

Certain workflows, particularly in storage drivers, require operations to
complete in the same sequence they were submitted.
This helper provides a generic mechanism to enforce this ordering.

compl_chain: The main structure representing the queue of operations
compl_chain_entry: An entry embedded in a per-operation structure

The typical usage pattern is:

    * An operation is enqueued by calling compl_chain_add().

    * The worker thread for the operation calls
      compl_chain_wait(), which blocks until the previously
      enqueued operation has finished.

    * After the work is done, the thread calls compl_chain_complete().
      This signals the next operation in the chain that it can now
      proceed and removes the current entry from the list.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
---
 include/linux/compl_chain.h |  35 +++++++++++
 lib/Makefile                |   2 +-
 lib/compl_chain.c           | 118 ++++++++++++++++++++++++++++++++++++
 3 files changed, 154 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/compl_chain.h
 create mode 100644 lib/compl_chain.c

diff --git a/include/linux/compl_chain.h b/include/linux/compl_chain.h
new file mode 100644
index 000000000000..a2bf271144e0
--- /dev/null
+++ b/include/linux/compl_chain.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_COMPLETION_CHAIN_H
+#define _LINUX_COMPLETION_CHAIN_H
+
+#include <linux/list.h>
+#include <linux/completion.h>
+#include <linux/spinlock.h>
+
+struct compl_chain {
+	spinlock_t lock;
+	struct list_head list;
+};
+
+#define COMPL_CHAIN_INIT(name) \
+	{ .lock = __SPIN_LOCK_UNLOCKED((name).lock), \
+	  .list = LIST_HEAD_INIT((name).list) }
+
+#define DEFINE_COMPL_CHAIN(name) \
+	struct compl_chain name = COMPL_CHAIN_INIT(name)
+
+struct compl_chain_entry {
+	struct compl_chain *chain;
+	struct list_head list;
+	struct completion prev_finished;
+};
+
+void compl_chain_init(struct compl_chain *chain);
+void compl_chain_add(struct compl_chain *chain,
+			struct compl_chain_entry *entry);
+void compl_chain_wait(struct compl_chain_entry *entry);
+void compl_chain_complete(struct compl_chain_entry *entry);
+bool compl_chain_pending(struct compl_chain_entry *entry);
+void compl_chain_flush(struct compl_chain *chain);
+
+#endif /* _LINUX_COMPLETION_CHAIN_H */
diff --git a/lib/Makefile b/lib/Makefile
index 1b9ee167517f..c3ccd82bb190 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -58,7 +58,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
 	 bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \
 	 percpu-refcount.o rhashtable.o base64.o \
 	 once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
-	 generic-radix-tree.o bitmap-str.o
+	 generic-radix-tree.o bitmap-str.o compl_chain.o
 obj-y += string_helpers.o
 obj-y += hexdump.o
 obj-$(CONFIG_TEST_HEXDUMP) += test_hexdump.o
diff --git a/lib/compl_chain.c b/lib/compl_chain.c
new file mode 100644
index 000000000000..b1cb43753f52
--- /dev/null
+++ b/lib/compl_chain.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Author: Maurizio Lombardi <mlombard@redhat.com>
+ */
+#include <linux/compl_chain.h>
+
+/**
+ * compl_chain_init - Initialize a completion chain
+ * @chain: The completion chain to be initialized.
+ *
+ * Initializes a compl_chain structure
+ */
+void compl_chain_init(struct compl_chain *chain)
+{
+	spin_lock_init(&chain->lock);
+	INIT_LIST_HEAD(&chain->list);
+}
+EXPORT_SYMBOL_GPL(compl_chain_init);
+
+/**
+ * compl_chain_add - Add a new entry to the tail of the chain
+ * @chain: The completion chain to add the entry to.
+ * @entry: The entry to be enqueued.
+ *
+ * Adds a new entry to the end of the queue.
+ * If the chain is empty when this entry is added, it is immediately marked
+ * as ready to run, as there is no preceding entry to wait for.
+ */
+void compl_chain_add(struct compl_chain *chain,
+			struct compl_chain_entry *entry)
+{
+	init_completion(&entry->prev_finished);
+	INIT_LIST_HEAD(&entry->list);
+
+	WRITE_ONCE(entry->chain, chain);
+
+	spin_lock(&chain->lock);
+	if (list_empty(&chain->list))
+		complete_all(&entry->prev_finished);
+	list_add_tail(&entry->list, &chain->list);
+	spin_unlock(&chain->lock);
+}
+EXPORT_SYMBOL_GPL(compl_chain_add);
+
+/**
+ * compl_chain_wait - Wait for the preceding operation to finish
+ * @entry: The entry for the current operation.
+ *
+ * Blocks the current execution thread until compl_chain_complete()
+ * is executed against the previous entry in the chain.
+ */
+void compl_chain_wait(struct compl_chain_entry *entry)
+{
+	WARN_ON(!entry->chain);
+
+	wait_for_completion(&entry->prev_finished);
+}
+EXPORT_SYMBOL_GPL(compl_chain_wait);
+
+/**
+ * compl_chain_complete - Mark an entry as completed and signal the next one
+ * @entry: The entry to mark as completed.
+ *
+ * Removes the current entry from the chain and signals the next waiting
+ * entry (if one exists) that it is now allowed to proceed.
+ */
+void compl_chain_complete(struct compl_chain_entry *entry)
+{
+	struct compl_chain *chain = entry->chain;
+
+	WARN_ON(!chain);
+
+	wait_for_completion(&entry->prev_finished);
+
+	spin_lock(&chain->lock);
+	list_del(&entry->list);
+	if (!list_empty(&chain->list)) {
+		struct compl_chain_entry *next =
+			list_first_entry(&chain->list,
+					 struct compl_chain_entry, list);
+		complete_all(&next->prev_finished);
+	}
+	spin_unlock(&chain->lock);
+
+	WRITE_ONCE(entry->chain, NULL);
+}
+EXPORT_SYMBOL_GPL(compl_chain_complete);
+
+/**
+ * compl_chain_pending - Check if an entry is pending
+ * @entry: The entry to check.
+ *
+ * Returns true if an entry has been added to a chain and hasn't yet
+ * been completed.
+ */
+bool compl_chain_pending(struct compl_chain_entry *entry)
+{
+	return READ_ONCE(entry->chain) != NULL;
+}
+EXPORT_SYMBOL_GPL(compl_chain_pending);
+
+/**
+ * compl_chain_flush - Wait for all entries currently in the chain to finish
+ * @chain: The completion chain to flush.
+ *
+ * Enqueues a dummy entry into the chain and immediately calls
+ * compl_chain_complete() against it. Because operations execute in strict
+ * FIFO order, this acts as a barrier, blocking the calling thread until
+ * all previously enqueued entries have finished.
+ */
+void compl_chain_flush(struct compl_chain *chain)
+{
+	struct compl_chain_entry dummy_entry;
+
+	compl_chain_add(chain, &dummy_entry);
+	compl_chain_complete(&dummy_entry);
+}
+EXPORT_SYMBOL_GPL(compl_chain_flush);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V3 2/3] nvme-core: register namespaces in order during async scan
  2026-02-25 16:12 [PATCH V3 0/3] Ensure ordered namespace registration during async scan Maurizio Lombardi
  2026-02-25 16:12 ` [PATCH V3 1/3] lib: Introduce completion chain helper Maurizio Lombardi
@ 2026-02-25 16:12 ` Maurizio Lombardi
  2026-02-25 21:37   ` kernel test robot
  2026-02-25 16:12 ` [PATCH V3 3/3] scsi: Convert async scanning to use the completion chain helper Maurizio Lombardi
  2026-02-25 21:41 ` [PATCH V3 0/3] Ensure ordered namespace registration during async scan Keith Busch
  3 siblings, 1 reply; 20+ messages in thread
From: Maurizio Lombardi @ 2026-02-25 16:12 UTC (permalink / raw)
  To: kbusch
  Cc: hch, hare, chaitanyak, bvanassche, linux-scsi, linux-nvme,
	James.Bottomley, mlombard, jmeneghi, emilne, bgurney

The fully asynchronous namespace scanning, while fast, can result in
namespaces being allocated and registered out of order. This leads to
unpredictable device naming across reboots which can be confusing
for users.

To solve this, introduce a serialization mechanism for the asynchronous
namespace scan. This is achieved by using the generic compl_chain helper,
which ensures that the initialization of one namespace (nvme_alloc_ns)
completes before the next one begins.

This approach preserves the performance benefits of asynchronous
identification while guaranteeing that the final device registration
occurs in the correct order.

Performance testing shows that this change has no noticeable impact on
scan times compared to the fully asynchronous method.

High latency NVMe/TCP, ~150ms ping, 100 namespaces

Synchronous namespace scan (RHEL-10.1): 32375ms
Fully async namespace scan (7.0-rc1):    2543ms
Async namespace scan with dependency chain (7.0-rc1): 2431ms

Low latency NVMe/TCP, ~0.2ms ping, 100 namespaces

Synchronous namespace scan (RHEL-10.1): 352ms
Fully async namespace scan (7.0-rc1):  248ms
Async namespace scan with dependency chain (7.0-rc1): 191ms

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
---
 drivers/nvme/host/core.c | 94 +++++++++++++++++++++++++---------------
 drivers/nvme/host/nvme.h |  2 +
 2 files changed, 62 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f5ebcaa2f859..d186c0082cc8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4105,13 +4105,27 @@ static void nvme_ns_add_to_ctrl_list(struct nvme_ns *ns)
 	list_add_rcu(&ns->list, &ns->ctrl->namespaces);
 }
 
-static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
+/**
+ * struct async_scan_task - keeps track of controller & NSID to scan
+ * @entry:	link to the completion chain list
+ * @ctrl:	Controller on which namespaces are being scanned
+ * @nsid:	The NSID to scan
+ */
+struct async_scan_task {
+	struct compl_chain_entry chain_entry;
+	struct nvme_ctrl *ctrl;
+	u32 nsid;
+};
+
+static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info,
+				struct compl_chain_entry *cc_entry)
 {
 	struct queue_limits lim = { };
 	struct nvme_ns *ns;
 	struct gendisk *disk;
 	int node = ctrl->numa_node;
 	bool last_path = false;
+	int r;
 
 	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
 	if (!ns)
@@ -4134,7 +4148,19 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 	ns->ctrl = ctrl;
 	kref_init(&ns->kref);
 
-	if (nvme_init_ns_head(ns, info))
+	/*
+	 * Wait for the previous async task to finish before
+	 * allocating the namespace.
+	 */
+	if (cc_entry)
+		compl_chain_wait(cc_entry);
+
+	r = nvme_init_ns_head(ns, info);
+
+	if (cc_entry)
+		compl_chain_complete(cc_entry);
+
+	if (r)
 		goto out_cleanup_disk;
 
 	/*
@@ -4309,7 +4335,8 @@ static void nvme_validate_ns(struct nvme_ns *ns, struct nvme_ns_info *info)
 		nvme_ns_remove(ns);
 }
 
-static void nvme_scan_ns(struct nvme_ctrl *ctrl, unsigned nsid)
+static void nvme_scan_ns(struct nvme_ctrl *ctrl, unsigned int nsid,
+				struct compl_chain_entry *cc_entry)
 {
 	struct nvme_ns_info info = { .nsid = nsid };
 	struct nvme_ns *ns;
@@ -4348,40 +4375,30 @@ static void nvme_scan_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	ns = nvme_find_get_ns(ctrl, nsid);
 	if (ns) {
+		/* Release the chain early so the next task can proceed */
+		if (cc_entry)
+			compl_chain_complete(cc_entry);
 		nvme_validate_ns(ns, &info);
 		nvme_put_ns(ns);
 	} else {
-		nvme_alloc_ns(ctrl, &info);
+		nvme_alloc_ns(ctrl, &info, cc_entry);
 	}
 }
 
-/**
- * struct async_scan_info - keeps track of controller & NSIDs to scan
- * @ctrl:	Controller on which namespaces are being scanned
- * @next_nsid:	Index of next NSID to scan in ns_list
- * @ns_list:	Pointer to list of NSIDs to scan
- *
- * Note: There is a single async_scan_info structure shared by all instances
- * of nvme_scan_ns_async() scanning a given controller, so the atomic
- * operations on next_nsid are critical to ensure each instance scans a unique
- * NSID.
- */
-struct async_scan_info {
-	struct nvme_ctrl *ctrl;
-	atomic_t next_nsid;
-	__le32 *ns_list;
-};
-
 static void nvme_scan_ns_async(void *data, async_cookie_t cookie)
 {
-	struct async_scan_info *scan_info = data;
-	int idx;
-	u32 nsid;
+	struct async_scan_task *task = data;
 
-	idx = (u32)atomic_fetch_inc(&scan_info->next_nsid);
-	nsid = le32_to_cpu(scan_info->ns_list[idx]);
+	nvme_scan_ns(task->ctrl, task->nsid, &task->chain_entry);
 
-	nvme_scan_ns(scan_info->ctrl, nsid);
+	/*
+	 * If the task failed early and returned without completing the
+	 * chain entry, ensure the chain progresses safely.
+	 */
+	if (compl_chain_pending(&task->chain_entry))
+		compl_chain_complete(&task->chain_entry);
+
+	kfree(task);
 }
 
 static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
@@ -4411,14 +4428,12 @@ static int nvme_scan_ns_list(struct nvme_ctrl *ctrl)
 	u32 prev = 0;
 	int ret = 0, i;
 	ASYNC_DOMAIN(domain);
-	struct async_scan_info scan_info;
+	struct async_scan_task *task;
 
 	ns_list = kzalloc(NVME_IDENTIFY_DATA_SIZE, GFP_KERNEL);
 	if (!ns_list)
 		return -ENOMEM;
 
-	scan_info.ctrl = ctrl;
-	scan_info.ns_list = ns_list;
 	for (;;) {
 		struct nvme_command cmd = {
 			.identify.opcode	= nvme_admin_identify,
@@ -4434,20 +4449,30 @@ static int nvme_scan_ns_list(struct nvme_ctrl *ctrl)
 			goto free;
 		}
 
-		atomic_set(&scan_info.next_nsid, 0);
 		for (i = 0; i < nr_entries; i++) {
 			u32 nsid = le32_to_cpu(ns_list[i]);
 
 			if (!nsid)	/* end of the list? */
 				goto out;
-			async_schedule_domain(nvme_scan_ns_async, &scan_info,
+
+			task = kmalloc_obj(*task);
+			if (!task) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			task->nsid = nsid;
+			task->ctrl = ctrl;
+			compl_chain_add(&ctrl->scan_chain, &task->chain_entry);
+
+			async_schedule_domain(nvme_scan_ns_async, task,
 						&domain);
 			while (++prev < nsid)
 				nvme_ns_remove_by_nsid(ctrl, prev);
 		}
-		async_synchronize_full_domain(&domain);
 	}
  out:
+	async_synchronize_full_domain(&domain);
 	nvme_remove_invalid_namespaces(ctrl, prev);
  free:
 	async_synchronize_full_domain(&domain);
@@ -4466,7 +4491,7 @@ static void nvme_scan_ns_sequential(struct nvme_ctrl *ctrl)
 	kfree(id);
 
 	for (i = 1; i <= nn; i++)
-		nvme_scan_ns(ctrl, i);
+		nvme_scan_ns(ctrl, i, NULL);
 
 	nvme_remove_invalid_namespaces(ctrl, nn);
 }
@@ -5094,6 +5119,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 
 	mutex_init(&ctrl->scan_lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
+	compl_chain_init(&ctrl->scan_chain);
 	xa_init(&ctrl->cels);
 	ctrl->dev = dev;
 	ctrl->ops = ops;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9a5f28c5103c..95f8c40ec86b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -17,6 +17,7 @@
 #include <linux/wait.h>
 #include <linux/t10-pi.h>
 #include <linux/ratelimit_types.h>
+#include <linux/compl_chain.h>
 
 #include <trace/events/block.h>
 
@@ -294,6 +295,7 @@ struct nvme_ctrl {
 	struct blk_mq_tag_set *tagset;
 	struct blk_mq_tag_set *admin_tagset;
 	struct list_head namespaces;
+	struct compl_chain scan_chain;
 	struct mutex namespaces_lock;
 	struct srcu_struct srcu;
 	struct device ctrl_device;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 2/3] nvme-core: register namespaces in order during async scan
  2026-02-25 16:12 ` [PATCH V3 2/3] nvme-core: register namespaces in order during async scan Maurizio Lombardi
@ 2026-02-25 21:37   ` kernel test robot
  0 siblings, 0 replies; 20+ messages in thread
From: kernel test robot @ 2026-02-25 21:37 UTC (permalink / raw)
  To: Maurizio Lombardi, kbusch
  Cc: oe-kbuild-all, hch, hare, chaitanyak, bvanassche, linux-scsi,
	linux-nvme, James.Bottomley, mlombard, jmeneghi, emilne, bgurney

Hi Maurizio,

kernel test robot noticed the following build warnings:

[auto build test WARNING on jejb-scsi/for-next]
[also build test WARNING on mkp-scsi/for-next linus/master v7.0-rc1 next-20260225]
[cannot apply to linux-nvme/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Maurizio-Lombardi/lib-Introduce-completion-chain-helper/20260226-001842
base:   https://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git for-next
patch link:    https://lore.kernel.org/r/20260225161203.76168-3-mlombard%40redhat.com
patch subject: [PATCH V3 2/3] nvme-core: register namespaces in order during async scan
config: x86_64-randconfig-161-20260226 (https://download.01.org/0day-ci/archive/20260226/202602260543.EHcJPG8y-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
smatch version: v0.5.0-8994-gd50c5a4c
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260226/202602260543.EHcJPG8y-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602260543.EHcJPG8y-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: drivers/nvme/host/core.c:4117 struct member 'chain_entry' not described in 'async_scan_task'
>> Warning: drivers/nvme/host/core.c:4117 struct member 'chain_entry' not described in 'async_scan_task'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH V3 3/3] scsi: Convert async scanning to use the completion chain helper
  2026-02-25 16:12 [PATCH V3 0/3] Ensure ordered namespace registration during async scan Maurizio Lombardi
  2026-02-25 16:12 ` [PATCH V3 1/3] lib: Introduce completion chain helper Maurizio Lombardi
  2026-02-25 16:12 ` [PATCH V3 2/3] nvme-core: register namespaces in order during async scan Maurizio Lombardi
@ 2026-02-25 16:12 ` Maurizio Lombardi
  2026-02-25 21:41 ` [PATCH V3 0/3] Ensure ordered namespace registration during async scan Keith Busch
  3 siblings, 0 replies; 20+ messages in thread
From: Maurizio Lombardi @ 2026-02-25 16:12 UTC (permalink / raw)
  To: kbusch
  Cc: hch, hare, chaitanyak, bvanassche, linux-scsi, linux-nvme,
	James.Bottomley, mlombard, jmeneghi, emilne, bgurney

The asynchronous host scanning logic in scsi_scan.c uses a custom,
open-coded implementation to serialize scans. This involves a manually
managed list of tasks, each with its own completion, to ensure that hosts
are scanned and added to the system in a deterministic order.

Refactors the SCSI async scanning implementation to use the new compl_chain
helper. This simplifies the scsi_scan.c code and makes the serialization
logic more readable.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
---
 drivers/scsi/scsi_priv.h |  2 +-
 drivers/scsi/scsi_scan.c | 68 +++++-----------------------------------
 2 files changed, 9 insertions(+), 61 deletions(-)

diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index 7a193cc04e5b..274fdd7edac4 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -132,7 +132,7 @@ extern void scsi_exit_procfs(void);
 
 /* scsi_scan.c */
 void scsi_enable_async_suspend(struct device *dev);
-extern int scsi_complete_async_scans(void);
+void scsi_complete_async_scans(void);
 extern int scsi_scan_host_selected(struct Scsi_Host *, unsigned int,
 				   unsigned int, u64, enum scsi_scan_mode);
 extern void scsi_forget_host(struct Scsi_Host *);
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 60c06fa4ec32..f19f2c73f042 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -36,6 +36,7 @@
 #include <linux/async.h>
 #include <linux/slab.h>
 #include <linux/unaligned.h>
+#include <linux/compl_chain.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -112,14 +113,11 @@ MODULE_PARM_DESC(inq_timeout,
 		 "Timeout (in seconds) waiting for devices to answer INQUIRY."
 		 " Default is 20. Some devices may need more; most need less.");
 
-/* This lock protects only this list */
-static DEFINE_SPINLOCK(async_scan_lock);
-static LIST_HEAD(scanning_hosts);
+static DEFINE_COMPL_CHAIN(scanning_hosts);
 
 struct async_scan_data {
-	struct list_head list;
+	struct compl_chain_entry chain_entry;
 	struct Scsi_Host *shost;
-	struct completion prev_finished;
 };
 
 /*
@@ -146,48 +144,10 @@ void scsi_enable_async_suspend(struct device *dev)
  * started scanning after this function was called may or may not have
  * finished.
  */
-int scsi_complete_async_scans(void)
+void scsi_complete_async_scans(void)
 {
-	struct async_scan_data *data;
-
-	do {
-		scoped_guard(spinlock, &async_scan_lock)
-			if (list_empty(&scanning_hosts))
-				return 0;
-		/* If we can't get memory immediately, that's OK.  Just
-		 * sleep a little.  Even if we never get memory, the async
-		 * scans will finish eventually.
-		 */
-		data = kmalloc(sizeof(*data), GFP_KERNEL);
-		if (!data)
-			msleep(1);
-	} while (!data);
-
-	data->shost = NULL;
-	init_completion(&data->prev_finished);
-
-	spin_lock(&async_scan_lock);
-	/* Check that there's still somebody else on the list */
-	if (list_empty(&scanning_hosts))
-		goto done;
-	list_add_tail(&data->list, &scanning_hosts);
-	spin_unlock(&async_scan_lock);
-
 	printk(KERN_INFO "scsi: waiting for bus probes to complete ...\n");
-	wait_for_completion(&data->prev_finished);
-
-	spin_lock(&async_scan_lock);
-	list_del(&data->list);
-	if (!list_empty(&scanning_hosts)) {
-		struct async_scan_data *next = list_entry(scanning_hosts.next,
-				struct async_scan_data, list);
-		complete(&next->prev_finished);
-	}
- done:
-	spin_unlock(&async_scan_lock);
-
-	kfree(data);
-	return 0;
+	compl_chain_flush(&scanning_hosts);
 }
 
 /**
@@ -1960,18 +1920,13 @@ static struct async_scan_data *scsi_prep_async_scan(struct Scsi_Host *shost)
 	data->shost = scsi_host_get(shost);
 	if (!data->shost)
 		goto err;
-	init_completion(&data->prev_finished);
 
 	spin_lock_irqsave(shost->host_lock, flags);
 	shost->async_scan = 1;
 	spin_unlock_irqrestore(shost->host_lock, flags);
 	mutex_unlock(&shost->scan_mutex);
 
-	spin_lock(&async_scan_lock);
-	if (list_empty(&scanning_hosts))
-		complete(&data->prev_finished);
-	list_add_tail(&data->list, &scanning_hosts);
-	spin_unlock(&async_scan_lock);
+	compl_chain_add(&scanning_hosts, &data->chain_entry);
 
 	return data;
 
@@ -2008,7 +1963,7 @@ static void scsi_finish_async_scan(struct async_scan_data *data)
 		return;
 	}
 
-	wait_for_completion(&data->prev_finished);
+	compl_chain_wait(&data->chain_entry);
 
 	scsi_sysfs_add_devices(shost);
 
@@ -2018,14 +1973,7 @@ static void scsi_finish_async_scan(struct async_scan_data *data)
 
 	mutex_unlock(&shost->scan_mutex);
 
-	spin_lock(&async_scan_lock);
-	list_del(&data->list);
-	if (!list_empty(&scanning_hosts)) {
-		struct async_scan_data *next = list_entry(scanning_hosts.next,
-				struct async_scan_data, list);
-		complete(&next->prev_finished);
-	}
-	spin_unlock(&async_scan_lock);
+	compl_chain_complete(&data->chain_entry);
 
 	scsi_autopm_put_host(shost);
 	scsi_host_put(shost);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-02-25 16:12 [PATCH V3 0/3] Ensure ordered namespace registration during async scan Maurizio Lombardi
                   ` (2 preceding siblings ...)
  2026-02-25 16:12 ` [PATCH V3 3/3] scsi: Convert async scanning to use the completion chain helper Maurizio Lombardi
@ 2026-02-25 21:41 ` Keith Busch
  2026-02-26  8:07   ` Maurizio Lombardi
  3 siblings, 1 reply; 20+ messages in thread
From: Keith Busch @ 2026-02-25 21:41 UTC (permalink / raw)
  To: Maurizio Lombardi
  Cc: hch, hare, chaitanyak, bvanassche, linux-scsi, linux-nvme,
	James.Bottomley, mlombard, jmeneghi, emilne, bgurney

On Wed, Feb 25, 2026 at 05:12:00PM +0100, Maurizio Lombardi wrote:
> The NVMe fully asynchronous namespace scanning introduced in
> commit 4e893ca81170 ("nvme-core: scan namespaces asynchronously")
> significantly improved discovery times. However, it also introduced
> non-deterministic ordering for namespace registration.
>
> While kernel device names (/dev/nvmeXnY) are not guaranteed to be stable
> across reboots, this unpredictable ordering has caused considerable user
> confusion and has been perceived as a regression, leading to multiple bug
> reports.

The nvme-pci driver also probes the controllers asynchronously, which
can also create non-determinisitic names. Is that part not a problem?

Just on the suffix part of the namespace's block handle, I have a
potential alternate suggestion here. The instance names pulled from the
ida guarantee we'll always have unique names for the lifetime of the
backing kobject. I introduced that a while ago, but I'm testing this out
now and it seems kobject_del is sufficient to reuse that name. The
driver already did that to all the objects when deleting the namespace,
so there doesn't appear to be a reason to wait for the final
kobject_put.

What I'm saying is I may have been mistaken about the naming collision
issues and we can just use the head's ns_id to get a consistent and
meaningful name based off the backing namespaces. There's some unlikely
races with multipath at the moment if we did use ns_id, but I think
they're all fixable.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-02-25 21:41 ` [PATCH V3 0/3] Ensure ordered namespace registration during async scan Keith Busch
@ 2026-02-26  8:07   ` Maurizio Lombardi
  2026-02-26 15:09     ` Keith Busch
  2026-02-26 16:35     ` John Meneghini
  0 siblings, 2 replies; 20+ messages in thread
From: Maurizio Lombardi @ 2026-02-26  8:07 UTC (permalink / raw)
  To: Keith Busch, Maurizio Lombardi
  Cc: hch, hare, chaitanyak, bvanassche, linux-scsi, linux-nvme,
	James.Bottomley, mlombard, jmeneghi, emilne, bgurney

On Wed Feb 25, 2026 at 10:41 PM CET, Keith Busch wrote:
> On Wed, Feb 25, 2026 at 05:12:00PM +0100, Maurizio Lombardi wrote:
>> The NVMe fully asynchronous namespace scanning introduced in
>> commit 4e893ca81170 ("nvme-core: scan namespaces asynchronously")
>> significantly improved discovery times. However, it also introduced
>> non-deterministic ordering for namespace registration.
>>
>> While kernel device names (/dev/nvmeXnY) are not guaranteed to be stable
>> across reboots, this unpredictable ordering has caused considerable user
>> confusion and has been perceived as a regression, leading to multiple bug
>> reports.
>
> The nvme-pci driver also probes the controllers asynchronously, which
> can also create non-determinisitic names. Is that part not a problem?

Potentially, it is. The difference is that so far no one ever complained
about it, while with namespace async scanning we immediately received regression
reports, to the point we had to revert the changes and restore the
sequential namespaces scan in RHEL.

>
> Just on the suffix part of the namespace's block handle, I have a
> potential alternate suggestion here. The instance names pulled from the
> ida guarantee we'll always have unique names for the lifetime of the
> backing kobject. I introduced that a while ago, but I'm testing this out
> now and it seems kobject_del is sufficient to reuse that name. The
> driver already did that to all the objects when deleting the namespace,
> so there doesn't appear to be a reason to wait for the final
> kobject_put.
>
> What I'm saying is I may have been mistaken about the naming collision
> issues and we can just use the head's ns_id to get a consistent and
> meaningful name based off the backing namespaces. There's some unlikely
> races with multipath at the moment if we did use ns_id, but I think
> they're all fixable.

Ok, so you'd like to use the namespace's NSID as the suffix.
I also considered this approach, the reason I didn't implemented
it is that I wished to have the async namespace scan performance improvements
while preserving the same enumeration we had for years with the sequential scan:

Before the introduction of the async scan, /dev/nvme0n1 always pointed
to the first entry of the NSID list, /dev/nvme0n2 to the second
entry and so on.

With your proposal, if a user has sparse NSIDs (1, 10, 333)
then he will get /dev/nvme0n1, /dev/nvme0n10, /dev/nvme0n333.
On one hand, yes, they are "more stable" and more meaningful too,
on the other hand this breaks the assumption of contiguous naming.
This might not be a problem for the mainline kernel, but I suspect we
will have people complaining again that the /dev/nvmeXnY enumeration changed

Maurizio

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-02-26  8:07   ` Maurizio Lombardi
@ 2026-02-26 15:09     ` Keith Busch
  2026-02-26 16:35     ` John Meneghini
  1 sibling, 0 replies; 20+ messages in thread
From: Keith Busch @ 2026-02-26 15:09 UTC (permalink / raw)
  To: Maurizio Lombardi
  Cc: Maurizio Lombardi, hch, hare, chaitanyak, bvanassche, linux-scsi,
	linux-nvme, James.Bottomley, jmeneghi, emilne, bgurney

On Thu, Feb 26, 2026 at 09:07:10AM +0100, Maurizio Lombardi wrote:
> With your proposal, if a user has sparse NSIDs (1, 10, 333)
> then he will get /dev/nvme0n1, /dev/nvme0n10, /dev/nvme0n333.
> On one hand, yes, they are "more stable" and more meaningful too,
> on the other hand this breaks the assumption of contiguous naming.
> This might not be a problem for the mainline kernel, but I suspect we
> will have people complaining again that the /dev/nvmeXnY enumeration changed

The bonus of using the nsid is that it will always enumerate with the
same name even after you alter the other attached namespaces.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-02-26  8:07   ` Maurizio Lombardi
  2026-02-26 15:09     ` Keith Busch
@ 2026-02-26 16:35     ` John Meneghini
  2026-02-26 18:15       ` Keith Busch
  1 sibling, 1 reply; 20+ messages in thread
From: John Meneghini @ 2026-02-26 16:35 UTC (permalink / raw)
  To: Maurizio Lombardi, Keith Busch, Maurizio Lombardi
  Cc: hch, hare, chaitanyak, bvanassche, linux-scsi, linux-nvme,
	James.Bottomley, emilne, bgurney

On 2/26/26 3:07 AM, Maurizio Lombardi wrote:
> On Wed Feb 25, 2026 at 10:41 PM CET, Keith Busch wrote:
>> On Wed, Feb 25, 2026 at 05:12:00PM +0100, Maurizio Lombardi wrote:
>>> The NVMe fully asynchronous namespace scanning introduced in
>>> commit 4e893ca81170 ("nvme-core: scan namespaces asynchronously")
>>> significantly improved discovery times. However, it also introduced
>>> non-deterministic ordering for namespace registration.
>>>
>>> While kernel device names (/dev/nvmeXnY) are not guaranteed to be stable
>>> across reboots, this unpredictable ordering has caused considerable user
>>> confusion and has been perceived as a regression, leading to multiple bug
>>> reports.
>>
>> The nvme-pci driver also probes the controllers asynchronously, which
>> can also create non-determinisitic names. Is that part not a problem?
> 
> Potentially, it is. The difference is that so far no one ever complained
> about it, while with namespace async scanning we immediately received regression
> reports, to the point we had to revert the changes and restore the
> sequential namespaces scan in RHEL.

It's worse than this.  Yes, in RHEL we carry out of tree patches to tun off the async scanning with SCSI,
and we reverted this async namespace scanning patch in NVMe.

We had to do this because, as soon as we turned these async scanning mechanisms on, we immediately
received customer escalations. Customer were not able to upgrade their systems. We have customer issues
and complaints open about this and we see this async namespace scanning as a barrier to adoption with NVMEe -
especially with NVME-OF which tends to have many more Namespaces than PCIe.

We've talked about this at LSF/MM - more than once - and several solutions have been proposed in the past,
but nothing ever happened.

And yes, the PCIe async discovery stuff does cause some problems.  The difference is: the PCIe bus configuration does
not change nearly as often as, e.g., the nvme namespace configuration in a fabric, so customers don't notice the changing pci ids.
Unless some one is going lots of hot unplugging and plugging with their PCI bus, the PCI ids typically don't change at all.

So from boot to boot, pci id don't usually change.  This async namespace scanning causes the namespace ids to change with every reboot, especially on
a system with 100's of nvme-of namespaces.

So, we really need this change, or something like this, to be accepted upstream.

/John

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-02-26 16:35     ` John Meneghini
@ 2026-02-26 18:15       ` Keith Busch
  2026-03-02  7:16         ` Hannes Reinecke
  0 siblings, 1 reply; 20+ messages in thread
From: Keith Busch @ 2026-02-26 18:15 UTC (permalink / raw)
  To: John Meneghini
  Cc: Maurizio Lombardi, Maurizio Lombardi, hch, hare, chaitanyak,
	bvanassche, linux-scsi, linux-nvme, James.Bottomley, emilne,
	bgurney

On Thu, Feb 26, 2026 at 11:35:15AM -0500, John Meneghini wrote:
> It's worse than this.  Yes, in RHEL we carry out of tree patches to tun off the async scanning with SCSI,
> and we reverted this async namespace scanning patch in NVMe.
> 
> We had to do this because, as soon as we turned these async scanning mechanisms on, we immediately
> received customer escalations. Customer were not able to upgrade their systems. We have customer issues
> and complaints open about this and we see this async namespace scanning as a barrier to adoption with NVMEe -
> especially with NVME-OF which tends to have many more Namespaces than PCIe.

Sounds like some people just don't know how to use labels or persistent
names. Relying on /dev/nvmeXnY or /dev/sdX to always be a handle to the
same device is a fragile solution.
 
> And yes, the PCIe async discovery stuff does cause some problems.  The difference is: the PCIe bus configuration does
> not change nearly as often as, e.g., the nvme namespace configuration in a fabric, so customers don't notice the changing pci ids.
> Unless some one is going lots of hot unplugging and plugging with their PCI bus, the PCI ids typically don't change at all.

It's not about the PCI topology changing. The async probe makes it
non-deterministic as to which PCI device is going to claim which
instance out of the nvme ida since they all try to run concurrently.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-02-26 18:15       ` Keith Busch
@ 2026-03-02  7:16         ` Hannes Reinecke
  2026-03-02 17:12           ` Keith Busch
  0 siblings, 1 reply; 20+ messages in thread
From: Hannes Reinecke @ 2026-03-02  7:16 UTC (permalink / raw)
  To: Keith Busch, John Meneghini
  Cc: Maurizio Lombardi, Maurizio Lombardi, hch, chaitanyak, bvanassche,
	linux-scsi, linux-nvme, James.Bottomley, emilne, bgurney

On 2/26/26 19:15, Keith Busch wrote:
> On Thu, Feb 26, 2026 at 11:35:15AM -0500, John Meneghini wrote:
>> It's worse than this.  Yes, in RHEL we carry out of tree patches to tun off the async scanning with SCSI,
>> and we reverted this async namespace scanning patch in NVMe.
>>
>> We had to do this because, as soon as we turned these async scanning mechanisms on, we immediately
>> received customer escalations. Customer were not able to upgrade their systems. We have customer issues
>> and complaints open about this and we see this async namespace scanning as a barrier to adoption with NVMEe -
>> especially with NVME-OF which tends to have many more Namespaces than PCIe.
> 
> Sounds like some people just don't know how to use labels or persistent
> names. Relying on /dev/nvmeXnY or /dev/sdX to always be a handle to the
> same device is a fragile solution.
>   
Yeah. We have undergone this (admittedly, rather painful) process quite 
some time back for SLES (with the switch from SLES12 to SLES15 if memory
serves correctly). Since then our customer seem to be happy with using
persistent device links.

>> And yes, the PCIe async discovery stuff does cause some problems.  The difference is: the PCIe bus configuration does
>> not change nearly as often as, e.g., the nvme namespace configuration in a fabric, so customers don't notice the changing pci ids.
>> Unless some one is going lots of hot unplugging and plugging with their PCI bus, the PCI ids typically don't change at all.
> 
> It's not about the PCI topology changing. The async probe makes it
> non-deterministic as to which PCI device is going to claim which
> instance out of the nvme ida since they all try to run concurrently.

I really would like to go with the nsid based solution from Keith.
That would avoid quite some cumbersome code here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-03-02  7:16         ` Hannes Reinecke
@ 2026-03-02 17:12           ` Keith Busch
  2026-06-17 17:41             ` Maurizio Lombardi
  0 siblings, 1 reply; 20+ messages in thread
From: Keith Busch @ 2026-03-02 17:12 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: John Meneghini, Maurizio Lombardi, Maurizio Lombardi, hch,
	chaitanyak, bvanassche, linux-scsi, linux-nvme, James.Bottomley,
	emilne, bgurney

On Mon, Mar 02, 2026 at 08:16:19AM +0100, Hannes Reinecke wrote:
> I really would like to go with the nsid based solution from Keith.
> That would avoid quite some cumbersome code here.

I've seen various documentation that assumes the current naming
indicates the nsid, so the scheme follows at least some people's
expectations. I don't know if we can make everyone happy here, though.
:(

I've fixed up most of the multipath races that get us closer to allowing
nsid suffix, but there's one left: nvme_remove_head is called outside
the subsys lock after detaching the head from the subsystem list. That
could cause a subsequent add event to call nvme_alloc_ns() before the
mpath side has completed del_gendisk() for the old nsid.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-03-02 17:12           ` Keith Busch
@ 2026-06-17 17:41             ` Maurizio Lombardi
  2026-06-18 21:55               ` Keith Busch
  0 siblings, 1 reply; 20+ messages in thread
From: Maurizio Lombardi @ 2026-06-17 17:41 UTC (permalink / raw)
  To: Keith Busch, Hannes Reinecke
  Cc: John Meneghini, Maurizio Lombardi, Maurizio Lombardi, hch,
	chaitanyak, bvanassche, linux-scsi, linux-nvme, James.Bottomley,
	emilne, bgurney

Hello Keith,

On Mon Mar 2, 2026 at 6:12 PM CET, Keith Busch wrote:
> On Mon, Mar 02, 2026 at 08:16:19AM +0100, Hannes Reinecke wrote:
>> I really would like to go with the nsid based solution from Keith.
>> That would avoid quite some cumbersome code here.
>
> I've seen various documentation that assumes the current naming
> indicates the nsid, so the scheme follows at least some people's
> expectations. I don't know if we can make everyone happy here, though.
> :(
>
> I've fixed up most of the multipath races that get us closer to allowing
> nsid suffix, but there's one left: nvme_remove_head is called outside
> the subsys lock after detaching the head from the subsystem list. That
> could cause a subsequent add event to call nvme_alloc_ns() before the
> mpath side has completed del_gendisk() for the old nsid.

Did you manage to find a solution for this race?
I just wanted to check whether you had any patches ready for testing since
this discussion.

Thanks,
Maurizio


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-06-17 17:41             ` Maurizio Lombardi
@ 2026-06-18 21:55               ` Keith Busch
  2026-06-19  5:59                 ` Hannes Reinecke
  0 siblings, 1 reply; 20+ messages in thread
From: Keith Busch @ 2026-06-18 21:55 UTC (permalink / raw)
  To: Maurizio Lombardi
  Cc: Hannes Reinecke, John Meneghini, Maurizio Lombardi, hch,
	chaitanyak, bvanassche, linux-scsi, linux-nvme, James.Bottomley,
	emilne, bgurney

On Wed, Jun 17, 2026 at 07:41:58PM +0200, Maurizio Lombardi wrote:
> Did you manage to find a solution for this race?
> I just wanted to check whether you had any patches ready for testing since
> this discussion.

At last month's LSFMM, I heard concerns that this would break something.
I don't remember the specifics as I wasn't trying to push the issue. I
think this feedback was from the more fabrics focused folks, maybe Randy
Jennings, Nilay Shroff or John Meneghini remembers?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-06-18 21:55               ` Keith Busch
@ 2026-06-19  5:59                 ` Hannes Reinecke
  2026-06-19 18:45                   ` Keith Busch
  0 siblings, 1 reply; 20+ messages in thread
From: Hannes Reinecke @ 2026-06-19  5:59 UTC (permalink / raw)
  To: Keith Busch, Maurizio Lombardi
  Cc: John Meneghini, Maurizio Lombardi, hch, chaitanyak, bvanassche,
	linux-scsi, linux-nvme, James.Bottomley, emilne, bgurney

On 6/18/26 23:55, Keith Busch wrote:
> On Wed, Jun 17, 2026 at 07:41:58PM +0200, Maurizio Lombardi wrote:
>> Did you manage to find a solution for this race?
>> I just wanted to check whether you had any patches ready for testing since
>> this discussion.
> 
> At last month's LSFMM, I heard concerns that this would break something.
> I don't remember the specifics as I wasn't trying to push the issue. I
> think this feedback was from the more fabrics focused folks, maybe Randy
> Jennings, Nilay Shroff or John Meneghini remembers?

The problem here is namespace lifetime. The ns_ida is only ever released
at the very last step, so the 'number' of the namespace will only be 
freed once all references to the namespace are dropped.
So if you were trying to keep the namespace number ordered you would
have to delay the creation of the namespace until that point, and you
would induce a serialization between deletion and creation.
Which will drastically prolong the rescan process.
And of course you need to hope that no-one triggers another rescan
process while the old one isn't finished, as that would need to wait
for the previous one to finish, too.
Or you need to introduce a mechanism to terminate an already running
rescan process.
So really, not a good idea.

I fail to see why one cannot use persistent names here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-06-19  5:59                 ` Hannes Reinecke
@ 2026-06-19 18:45                   ` Keith Busch
  2026-06-22  7:15                     ` Hannes Reinecke
  0 siblings, 1 reply; 20+ messages in thread
From: Keith Busch @ 2026-06-19 18:45 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Maurizio Lombardi, John Meneghini, Maurizio Lombardi, hch,
	chaitanyak, bvanassche, linux-scsi, linux-nvme, James.Bottomley,
	emilne, bgurney

On Fri, Jun 19, 2026 at 07:59:43AM +0200, Hannes Reinecke wrote:
> The problem here is namespace lifetime. The ns_ida is only ever released
> at the very last step, so the 'number' of the namespace will only be freed
> once all references to the namespace are dropped.
> So if you were trying to keep the namespace number ordered you would
> have to delay the creation of the namespace until that point, and you
> would induce a serialization between deletion and creation.

Under the proposed scheme, there is no ns_ida. You just use the NSID of
the namespace, and that's it. You have to ensure that del_gendisk
completed on all heads and paths that was using it prior to bringing up
the next one, but that's not really a problem.

The problem I recall has something to do with the nsid not being a
consistent value when migrating a namespace to another array or
something like that. Not that we currently have proper support for such
a thing...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-06-19 18:45                   ` Keith Busch
@ 2026-06-22  7:15                     ` Hannes Reinecke
  2026-06-24 16:05                       ` Maurizio Lombardi
  2026-06-24 22:16                       ` Keith Busch
  0 siblings, 2 replies; 20+ messages in thread
From: Hannes Reinecke @ 2026-06-22  7:15 UTC (permalink / raw)
  To: Keith Busch
  Cc: Maurizio Lombardi, John Meneghini, Maurizio Lombardi, hch,
	chaitanyak, bvanassche, linux-scsi, linux-nvme, James.Bottomley,
	emilne, bgurney

On 6/19/26 20:45, Keith Busch wrote:
> On Fri, Jun 19, 2026 at 07:59:43AM +0200, Hannes Reinecke wrote:
>> The problem here is namespace lifetime. The ns_ida is only ever released
>> at the very last step, so the 'number' of the namespace will only be freed
>> once all references to the namespace are dropped.
>> So if you were trying to keep the namespace number ordered you would
>> have to delay the creation of the namespace until that point, and you
>> would induce a serialization between deletion and creation.
> 
> Under the proposed scheme, there is no ns_ida. You just use the NSID of
> the namespace, and that's it. You have to ensure that del_gendisk
> completed on all heads and paths that was using it prior to bringing up
> the next one, but that's not really a problem.
> 
But then you'll have to delay the (re-)scan until the very last 
reference is gone, otherwise the nsid the scan is about to create
will be blocked by the nsid still pending to be deleted.

And we do have blktest nvme/058 as a really nice testcase for executing
rapid namespace remapping; that regularly manages to get the 'nsid'
and 'ns_ida' numbers getting out of sync.

In general I fail to see the issue here.
Any modern distro should be using persistent device links to access
devices, so the actual device name is pretty much irrelevant.
We on our side haven't had any issues here since ages.

And scanning has been one of the most complex operations we are doing
on nvme, and so I'd really think twice before changing that.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-06-22  7:15                     ` Hannes Reinecke
@ 2026-06-24 16:05                       ` Maurizio Lombardi
  2026-06-24 22:16                       ` Keith Busch
  1 sibling, 0 replies; 20+ messages in thread
From: Maurizio Lombardi @ 2026-06-24 16:05 UTC (permalink / raw)
  To: Hannes Reinecke, Keith Busch
  Cc: Maurizio Lombardi, John Meneghini, Maurizio Lombardi, hch,
	chaitanyak, bvanassche, linux-scsi, linux-nvme, James.Bottomley,
	emilne, bgurney

On Mon Jun 22, 2026 at 9:15 AM CEST, Hannes Reinecke wrote:
>
> In general I fail to see the issue here.
> Any modern distro should be using persistent device links to access
> devices, so the actual device name is pretty much irrelevant.
> We on our side haven't had any issues here since ages.

In principle, I agree with you that persistent links are best practice.
That said, many users still rely on /dev/nvmeXnY links, for example for
some nvme-cli commands. Because kernel 6.11 made these names totally
random across reboots, it's causing some confusion for them.

But yes, I totally understand the reason why you don't perceive this as
an issue.

Maurizio



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-06-22  7:15                     ` Hannes Reinecke
  2026-06-24 16:05                       ` Maurizio Lombardi
@ 2026-06-24 22:16                       ` Keith Busch
  2026-06-25  0:53                         ` Randy Jennings
  1 sibling, 1 reply; 20+ messages in thread
From: Keith Busch @ 2026-06-24 22:16 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Maurizio Lombardi, John Meneghini, Maurizio Lombardi, hch,
	chaitanyak, bvanassche, linux-scsi, linux-nvme, James.Bottomley,
	emilne, bgurney

On Mon, Jun 22, 2026 at 09:15:16AM +0200, Hannes Reinecke wrote:
> But then you'll have to delay the (re-)scan until the very last reference is
> gone, otherwise the nsid the scan is about to create
> will be blocked by the nsid still pending to be deleted.

It's not about the last reference. Either something changed or there was
some previous misunderstanding when that kobj name uniqueness was
introduced to this driver. We just need to wait for del_gendisk to
complete, which is usually already serialized in the same scan_work. It
doesn't appear to matter if a reference is held on a kobj waiting to be
deleted.

> In general I fail to see the issue here.
> Any modern distro should be using persistent device links to access
> devices, so the actual device name is pretty much irrelevant.
> We on our side haven't had any issues here since ages.

I agree there's not a real issue here. The suggestion is purely a
quality-of-life improvement to provide a visual clue that aligns with
people's expectations, reducing any surprises. There are people and
documentation that still think the "n1" in the nvme0n1 means it's NSID
1. If we can easily align to that, then why not? But I'm not exactly
needing this feature either, so if you think there are some "gotcha's"
here that may destablize the current scanning, then I have no
problem shelving this one.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
  2026-06-24 22:16                       ` Keith Busch
@ 2026-06-25  0:53                         ` Randy Jennings
  0 siblings, 0 replies; 20+ messages in thread
From: Randy Jennings @ 2026-06-25  0:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: Hannes Reinecke, Maurizio Lombardi, John Meneghini,
	Maurizio Lombardi, hch, chaitanyak, bvanassche, linux-scsi,
	linux-nvme, James.Bottomley, emilne, bgurney

On Wed, Jun 24, 2026 at 4:02 PM Keith Busch <kbusch@kernel.org> wrote:
>
> > In general I fail to see the issue here.
> > Any modern distro should be using persistent device links to access
> > devices, so the actual device name is pretty much irrelevant.
> > We on our side haven't had any issues here since ages.
>
> I agree there's not a real issue here. The suggestion is purely a
> quality-of-life improvement to provide a visual clue that aligns with
> people's expectations, reducing any surprises. There are people and
> documentation that still think the "n1" in the nvme0n1 means it's NSID
> 1. If we can easily align to that, then why not? But I'm not exactly
> needing this feature either, so if you think there are some "gotcha's"
> here that may destablize the current scanning, then I have no
> problem shelving this one.

There is the issue where the NSID of a specific namespace can change
when there are no hosts connected that it is attached to.  It does not
happen often, but it can need to happen (even without namespace
migration).  This is not speculative.  Over the course of years, our array
has had to do it for specific scenarios a couple of times.

However, the more fundamental problem is this:
> That said, many users still rely on /dev/nvmeXnY links, for example for
> some nvme-cli commands. Because kernel 6.11 made these names totally
> random across reboots, it's causing some confusion for them.
Relying on NSID to fix references to a specific namespace is not a safe
way to operate.  And making the device name predictable leads to people
taking these shortcuts.  To find a specific namespace, the NGUID should
be used.  That guarantees you are on the same storage.  Enabling
shortcuts that are not reliable is not a good practice.

Sincerely,
Randy Jennings


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-06-25  0:53 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-25 16:12 [PATCH V3 0/3] Ensure ordered namespace registration during async scan Maurizio Lombardi
2026-02-25 16:12 ` [PATCH V3 1/3] lib: Introduce completion chain helper Maurizio Lombardi
2026-02-25 16:12 ` [PATCH V3 2/3] nvme-core: register namespaces in order during async scan Maurizio Lombardi
2026-02-25 21:37   ` kernel test robot
2026-02-25 16:12 ` [PATCH V3 3/3] scsi: Convert async scanning to use the completion chain helper Maurizio Lombardi
2026-02-25 21:41 ` [PATCH V3 0/3] Ensure ordered namespace registration during async scan Keith Busch
2026-02-26  8:07   ` Maurizio Lombardi
2026-02-26 15:09     ` Keith Busch
2026-02-26 16:35     ` John Meneghini
2026-02-26 18:15       ` Keith Busch
2026-03-02  7:16         ` Hannes Reinecke
2026-03-02 17:12           ` Keith Busch
2026-06-17 17:41             ` Maurizio Lombardi
2026-06-18 21:55               ` Keith Busch
2026-06-19  5:59                 ` Hannes Reinecke
2026-06-19 18:45                   ` Keith Busch
2026-06-22  7:15                     ` Hannes Reinecke
2026-06-24 16:05                       ` Maurizio Lombardi
2026-06-24 22:16                       ` Keith Busch
2026-06-25  0:53                         ` Randy Jennings

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox