* [PATCH V3 1/3] lib: Introduce completion chain helper
2026-02-25 16:12 [PATCH V3 0/3] Ensure ordered namespace registration during async scan Maurizio Lombardi
@ 2026-02-25 16:12 ` Maurizio Lombardi
2026-02-25 16:12 ` [PATCH V3 2/3] nvme-core: register namespaces in order during async scan Maurizio Lombardi
` (2 subsequent siblings)
3 siblings, 0 replies; 12+ messages in thread
From: Maurizio Lombardi @ 2026-02-25 16:12 UTC (permalink / raw)
To: kbusch
Cc: hch, hare, chaitanyak, bvanassche, linux-scsi, linux-nvme,
James.Bottomley, mlombard, jmeneghi, emilne, bgurney
Introduce a new helper library, the completion chain, designed to serialize
asynchronous operations that must complete in a strict First-In, First-Out
(FIFO) order.
Certain workflows, particularly in storage drivers, require operations to
complete in the same sequence they were submitted.
This helper provides a generic mechanism to enforce this ordering.
compl_chain: The main structure representing the queue of operations
compl_chain_entry: An entry embedded in a per-operation structure
The typical usage pattern is:
* An operation is enqueued by calling compl_chain_add().
* The worker thread for the operation calls
compl_chain_wait(), which blocks until the previously
enqueued operation has finished.
* After the work is done, the thread calls compl_chain_complete().
This signals the next operation in the chain that it can now
proceed and removes the current entry from the list.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
---
include/linux/compl_chain.h | 35 +++++++++++
lib/Makefile | 2 +-
lib/compl_chain.c | 118 ++++++++++++++++++++++++++++++++++++
3 files changed, 154 insertions(+), 1 deletion(-)
create mode 100644 include/linux/compl_chain.h
create mode 100644 lib/compl_chain.c
diff --git a/include/linux/compl_chain.h b/include/linux/compl_chain.h
new file mode 100644
index 000000000000..a2bf271144e0
--- /dev/null
+++ b/include/linux/compl_chain.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_COMPLETION_CHAIN_H
+#define _LINUX_COMPLETION_CHAIN_H
+
+#include <linux/list.h>
+#include <linux/completion.h>
+#include <linux/spinlock.h>
+
+struct compl_chain {
+ spinlock_t lock;
+ struct list_head list;
+};
+
+#define COMPL_CHAIN_INIT(name) \
+ { .lock = __SPIN_LOCK_UNLOCKED((name).lock), \
+ .list = LIST_HEAD_INIT((name).list) }
+
+#define DEFINE_COMPL_CHAIN(name) \
+ struct compl_chain name = COMPL_CHAIN_INIT(name)
+
+struct compl_chain_entry {
+ struct compl_chain *chain;
+ struct list_head list;
+ struct completion prev_finished;
+};
+
+void compl_chain_init(struct compl_chain *chain);
+void compl_chain_add(struct compl_chain *chain,
+ struct compl_chain_entry *entry);
+void compl_chain_wait(struct compl_chain_entry *entry);
+void compl_chain_complete(struct compl_chain_entry *entry);
+bool compl_chain_pending(struct compl_chain_entry *entry);
+void compl_chain_flush(struct compl_chain *chain);
+
+#endif /* _LINUX_COMPLETION_CHAIN_H */
diff --git a/lib/Makefile b/lib/Makefile
index 1b9ee167517f..c3ccd82bb190 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -58,7 +58,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \
percpu-refcount.o rhashtable.o base64.o \
once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
- generic-radix-tree.o bitmap-str.o
+ generic-radix-tree.o bitmap-str.o compl_chain.o
obj-y += string_helpers.o
obj-y += hexdump.o
obj-$(CONFIG_TEST_HEXDUMP) += test_hexdump.o
diff --git a/lib/compl_chain.c b/lib/compl_chain.c
new file mode 100644
index 000000000000..b1cb43753f52
--- /dev/null
+++ b/lib/compl_chain.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Author: Maurizio Lombardi <mlombard@redhat.com>
+ */
+#include <linux/compl_chain.h>
+
+/**
+ * compl_chain_init - Initialize a completion chain
+ * @chain: The completion chain to be initialized.
+ *
+ * Initializes a compl_chain structure
+ */
+void compl_chain_init(struct compl_chain *chain)
+{
+ spin_lock_init(&chain->lock);
+ INIT_LIST_HEAD(&chain->list);
+}
+EXPORT_SYMBOL_GPL(compl_chain_init);
+
+/**
+ * compl_chain_add - Add a new entry to the tail of the chain
+ * @chain: The completion chain to add the entry to.
+ * @entry: The entry to be enqueued.
+ *
+ * Adds a new entry to the end of the queue.
+ * If the chain is empty when this entry is added, it is immediately marked
+ * as ready to run, as there is no preceding entry to wait for.
+ */
+void compl_chain_add(struct compl_chain *chain,
+ struct compl_chain_entry *entry)
+{
+ init_completion(&entry->prev_finished);
+ INIT_LIST_HEAD(&entry->list);
+
+ WRITE_ONCE(entry->chain, chain);
+
+ spin_lock(&chain->lock);
+ if (list_empty(&chain->list))
+ complete_all(&entry->prev_finished);
+ list_add_tail(&entry->list, &chain->list);
+ spin_unlock(&chain->lock);
+}
+EXPORT_SYMBOL_GPL(compl_chain_add);
+
+/**
+ * compl_chain_wait - Wait for the preceding operation to finish
+ * @entry: The entry for the current operation.
+ *
+ * Blocks the current execution thread until compl_chain_complete()
+ * is executed against the previous entry in the chain.
+ */
+void compl_chain_wait(struct compl_chain_entry *entry)
+{
+ WARN_ON(!entry->chain);
+
+ wait_for_completion(&entry->prev_finished);
+}
+EXPORT_SYMBOL_GPL(compl_chain_wait);
+
+/**
+ * compl_chain_complete - Mark an entry as completed and signal the next one
+ * @entry: The entry to mark as completed.
+ *
+ * Removes the current entry from the chain and signals the next waiting
+ * entry (if one exists) that it is now allowed to proceed.
+ */
+void compl_chain_complete(struct compl_chain_entry *entry)
+{
+ struct compl_chain *chain = entry->chain;
+
+ WARN_ON(!chain);
+
+ wait_for_completion(&entry->prev_finished);
+
+ spin_lock(&chain->lock);
+ list_del(&entry->list);
+ if (!list_empty(&chain->list)) {
+ struct compl_chain_entry *next =
+ list_first_entry(&chain->list,
+ struct compl_chain_entry, list);
+ complete_all(&next->prev_finished);
+ }
+ spin_unlock(&chain->lock);
+
+ WRITE_ONCE(entry->chain, NULL);
+}
+EXPORT_SYMBOL_GPL(compl_chain_complete);
+
+/**
+ * compl_chain_pending - Check if an entry is pending
+ * @entry: The entry to check.
+ *
+ * Returns true if an entry has been added to a chain and hasn't yet
+ * been completed.
+ */
+bool compl_chain_pending(struct compl_chain_entry *entry)
+{
+ return READ_ONCE(entry->chain) != NULL;
+}
+EXPORT_SYMBOL_GPL(compl_chain_pending);
+
+/**
+ * compl_chain_flush - Wait for all entries currently in the chain to finish
+ * @chain: The completion chain to flush.
+ *
+ * Enqueues a dummy entry into the chain and immediately calls
+ * compl_chain_complete() against it. Because operations execute in strict
+ * FIFO order, this acts as a barrier, blocking the calling thread until
+ * all previously enqueued entries have finished.
+ */
+void compl_chain_flush(struct compl_chain *chain)
+{
+ struct compl_chain_entry dummy_entry;
+
+ compl_chain_add(chain, &dummy_entry);
+ compl_chain_complete(&dummy_entry);
+}
+EXPORT_SYMBOL_GPL(compl_chain_flush);
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* [PATCH V3 2/3] nvme-core: register namespaces in order during async scan
2026-02-25 16:12 [PATCH V3 0/3] Ensure ordered namespace registration during async scan Maurizio Lombardi
2026-02-25 16:12 ` [PATCH V3 1/3] lib: Introduce completion chain helper Maurizio Lombardi
@ 2026-02-25 16:12 ` Maurizio Lombardi
2026-02-25 21:37 ` kernel test robot
2026-02-25 16:12 ` [PATCH V3 3/3] scsi: Convert async scanning to use the completion chain helper Maurizio Lombardi
2026-02-25 21:41 ` [PATCH V3 0/3] Ensure ordered namespace registration during async scan Keith Busch
3 siblings, 1 reply; 12+ messages in thread
From: Maurizio Lombardi @ 2026-02-25 16:12 UTC (permalink / raw)
To: kbusch
Cc: hch, hare, chaitanyak, bvanassche, linux-scsi, linux-nvme,
James.Bottomley, mlombard, jmeneghi, emilne, bgurney
The fully asynchronous namespace scanning, while fast, can result in
namespaces being allocated and registered out of order. This leads to
unpredictable device naming across reboots which can be confusing
for users.
To solve this, introduce a serialization mechanism for the asynchronous
namespace scan. This is achieved by using the generic compl_chain helper,
which ensures that the initialization of one namespace (nvme_alloc_ns)
completes before the next one begins.
This approach preserves the performance benefits of asynchronous
identification while guaranteeing that the final device registration
occurs in the correct order.
Performance testing shows that this change has no noticeable impact on
scan times compared to the fully asynchronous method.
High latency NVMe/TCP, ~150ms ping, 100 namespaces
Synchronous namespace scan (RHEL-10.1): 32375ms
Fully async namespace scan (7.0-rc1): 2543ms
Async namespace scan with dependency chain (7.0-rc1): 2431ms
Low latency NVMe/TCP, ~0.2ms ping, 100 namespaces
Synchronous namespace scan (RHEL-10.1): 352ms
Fully async namespace scan (7.0-rc1): 248ms
Async namespace scan with dependency chain (7.0-rc1): 191ms
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
---
drivers/nvme/host/core.c | 94 +++++++++++++++++++++++++---------------
drivers/nvme/host/nvme.h | 2 +
2 files changed, 62 insertions(+), 34 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f5ebcaa2f859..d186c0082cc8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4105,13 +4105,27 @@ static void nvme_ns_add_to_ctrl_list(struct nvme_ns *ns)
list_add_rcu(&ns->list, &ns->ctrl->namespaces);
}
-static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
+/**
+ * struct async_scan_task - keeps track of controller & NSID to scan
+ * @entry: link to the completion chain list
+ * @ctrl: Controller on which namespaces are being scanned
+ * @nsid: The NSID to scan
+ */
+struct async_scan_task {
+ struct compl_chain_entry chain_entry;
+ struct nvme_ctrl *ctrl;
+ u32 nsid;
+};
+
+static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info,
+ struct compl_chain_entry *cc_entry)
{
struct queue_limits lim = { };
struct nvme_ns *ns;
struct gendisk *disk;
int node = ctrl->numa_node;
bool last_path = false;
+ int r;
ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
if (!ns)
@@ -4134,7 +4148,19 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
ns->ctrl = ctrl;
kref_init(&ns->kref);
- if (nvme_init_ns_head(ns, info))
+ /*
+ * Wait for the previous async task to finish before
+ * allocating the namespace.
+ */
+ if (cc_entry)
+ compl_chain_wait(cc_entry);
+
+ r = nvme_init_ns_head(ns, info);
+
+ if (cc_entry)
+ compl_chain_complete(cc_entry);
+
+ if (r)
goto out_cleanup_disk;
/*
@@ -4309,7 +4335,8 @@ static void nvme_validate_ns(struct nvme_ns *ns, struct nvme_ns_info *info)
nvme_ns_remove(ns);
}
-static void nvme_scan_ns(struct nvme_ctrl *ctrl, unsigned nsid)
+static void nvme_scan_ns(struct nvme_ctrl *ctrl, unsigned int nsid,
+ struct compl_chain_entry *cc_entry)
{
struct nvme_ns_info info = { .nsid = nsid };
struct nvme_ns *ns;
@@ -4348,40 +4375,30 @@ static void nvme_scan_ns(struct nvme_ctrl *ctrl, unsigned nsid)
ns = nvme_find_get_ns(ctrl, nsid);
if (ns) {
+ /* Release the chain early so the next task can proceed */
+ if (cc_entry)
+ compl_chain_complete(cc_entry);
nvme_validate_ns(ns, &info);
nvme_put_ns(ns);
} else {
- nvme_alloc_ns(ctrl, &info);
+ nvme_alloc_ns(ctrl, &info, cc_entry);
}
}
-/**
- * struct async_scan_info - keeps track of controller & NSIDs to scan
- * @ctrl: Controller on which namespaces are being scanned
- * @next_nsid: Index of next NSID to scan in ns_list
- * @ns_list: Pointer to list of NSIDs to scan
- *
- * Note: There is a single async_scan_info structure shared by all instances
- * of nvme_scan_ns_async() scanning a given controller, so the atomic
- * operations on next_nsid are critical to ensure each instance scans a unique
- * NSID.
- */
-struct async_scan_info {
- struct nvme_ctrl *ctrl;
- atomic_t next_nsid;
- __le32 *ns_list;
-};
-
static void nvme_scan_ns_async(void *data, async_cookie_t cookie)
{
- struct async_scan_info *scan_info = data;
- int idx;
- u32 nsid;
+ struct async_scan_task *task = data;
- idx = (u32)atomic_fetch_inc(&scan_info->next_nsid);
- nsid = le32_to_cpu(scan_info->ns_list[idx]);
+ nvme_scan_ns(task->ctrl, task->nsid, &task->chain_entry);
- nvme_scan_ns(scan_info->ctrl, nsid);
+ /*
+ * If the task failed early and returned without completing the
+ * chain entry, ensure the chain progresses safely.
+ */
+ if (compl_chain_pending(&task->chain_entry))
+ compl_chain_complete(&task->chain_entry);
+
+ kfree(task);
}
static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
@@ -4411,14 +4428,12 @@ static int nvme_scan_ns_list(struct nvme_ctrl *ctrl)
u32 prev = 0;
int ret = 0, i;
ASYNC_DOMAIN(domain);
- struct async_scan_info scan_info;
+ struct async_scan_task *task;
ns_list = kzalloc(NVME_IDENTIFY_DATA_SIZE, GFP_KERNEL);
if (!ns_list)
return -ENOMEM;
- scan_info.ctrl = ctrl;
- scan_info.ns_list = ns_list;
for (;;) {
struct nvme_command cmd = {
.identify.opcode = nvme_admin_identify,
@@ -4434,20 +4449,30 @@ static int nvme_scan_ns_list(struct nvme_ctrl *ctrl)
goto free;
}
- atomic_set(&scan_info.next_nsid, 0);
for (i = 0; i < nr_entries; i++) {
u32 nsid = le32_to_cpu(ns_list[i]);
if (!nsid) /* end of the list? */
goto out;
- async_schedule_domain(nvme_scan_ns_async, &scan_info,
+
+ task = kmalloc_obj(*task);
+ if (!task) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ task->nsid = nsid;
+ task->ctrl = ctrl;
+ compl_chain_add(&ctrl->scan_chain, &task->chain_entry);
+
+ async_schedule_domain(nvme_scan_ns_async, task,
&domain);
while (++prev < nsid)
nvme_ns_remove_by_nsid(ctrl, prev);
}
- async_synchronize_full_domain(&domain);
}
out:
+ async_synchronize_full_domain(&domain);
nvme_remove_invalid_namespaces(ctrl, prev);
free:
async_synchronize_full_domain(&domain);
@@ -4466,7 +4491,7 @@ static void nvme_scan_ns_sequential(struct nvme_ctrl *ctrl)
kfree(id);
for (i = 1; i <= nn; i++)
- nvme_scan_ns(ctrl, i);
+ nvme_scan_ns(ctrl, i, NULL);
nvme_remove_invalid_namespaces(ctrl, nn);
}
@@ -5094,6 +5119,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
mutex_init(&ctrl->scan_lock);
INIT_LIST_HEAD(&ctrl->namespaces);
+ compl_chain_init(&ctrl->scan_chain);
xa_init(&ctrl->cels);
ctrl->dev = dev;
ctrl->ops = ops;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9a5f28c5103c..95f8c40ec86b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -17,6 +17,7 @@
#include <linux/wait.h>
#include <linux/t10-pi.h>
#include <linux/ratelimit_types.h>
+#include <linux/compl_chain.h>
#include <trace/events/block.h>
@@ -294,6 +295,7 @@ struct nvme_ctrl {
struct blk_mq_tag_set *tagset;
struct blk_mq_tag_set *admin_tagset;
struct list_head namespaces;
+ struct compl_chain scan_chain;
struct mutex namespaces_lock;
struct srcu_struct srcu;
struct device ctrl_device;
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH V3 2/3] nvme-core: register namespaces in order during async scan
2026-02-25 16:12 ` [PATCH V3 2/3] nvme-core: register namespaces in order during async scan Maurizio Lombardi
@ 2026-02-25 21:37 ` kernel test robot
0 siblings, 0 replies; 12+ messages in thread
From: kernel test robot @ 2026-02-25 21:37 UTC (permalink / raw)
To: Maurizio Lombardi, kbusch
Cc: oe-kbuild-all, hch, hare, chaitanyak, bvanassche, linux-scsi,
linux-nvme, James.Bottomley, mlombard, jmeneghi, emilne, bgurney
Hi Maurizio,
kernel test robot noticed the following build warnings:
[auto build test WARNING on jejb-scsi/for-next]
[also build test WARNING on mkp-scsi/for-next linus/master v7.0-rc1 next-20260225]
[cannot apply to linux-nvme/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Maurizio-Lombardi/lib-Introduce-completion-chain-helper/20260226-001842
base: https://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git for-next
patch link: https://lore.kernel.org/r/20260225161203.76168-3-mlombard%40redhat.com
patch subject: [PATCH V3 2/3] nvme-core: register namespaces in order during async scan
config: x86_64-randconfig-161-20260226 (https://download.01.org/0day-ci/archive/20260226/202602260543.EHcJPG8y-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
smatch version: v0.5.0-8994-gd50c5a4c
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260226/202602260543.EHcJPG8y-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602260543.EHcJPG8y-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> Warning: drivers/nvme/host/core.c:4117 struct member 'chain_entry' not described in 'async_scan_task'
>> Warning: drivers/nvme/host/core.c:4117 struct member 'chain_entry' not described in 'async_scan_task'
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH V3 3/3] scsi: Convert async scanning to use the completion chain helper
2026-02-25 16:12 [PATCH V3 0/3] Ensure ordered namespace registration during async scan Maurizio Lombardi
2026-02-25 16:12 ` [PATCH V3 1/3] lib: Introduce completion chain helper Maurizio Lombardi
2026-02-25 16:12 ` [PATCH V3 2/3] nvme-core: register namespaces in order during async scan Maurizio Lombardi
@ 2026-02-25 16:12 ` Maurizio Lombardi
2026-02-25 21:41 ` [PATCH V3 0/3] Ensure ordered namespace registration during async scan Keith Busch
3 siblings, 0 replies; 12+ messages in thread
From: Maurizio Lombardi @ 2026-02-25 16:12 UTC (permalink / raw)
To: kbusch
Cc: hch, hare, chaitanyak, bvanassche, linux-scsi, linux-nvme,
James.Bottomley, mlombard, jmeneghi, emilne, bgurney
The asynchronous host scanning logic in scsi_scan.c uses a custom,
open-coded implementation to serialize scans. This involves a manually
managed list of tasks, each with its own completion, to ensure that hosts
are scanned and added to the system in a deterministic order.
Refactors the SCSI async scanning implementation to use the new compl_chain
helper. This simplifies the scsi_scan.c code and makes the serialization
logic more readable.
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
---
drivers/scsi/scsi_priv.h | 2 +-
drivers/scsi/scsi_scan.c | 68 +++++-----------------------------------
2 files changed, 9 insertions(+), 61 deletions(-)
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index 7a193cc04e5b..274fdd7edac4 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -132,7 +132,7 @@ extern void scsi_exit_procfs(void);
/* scsi_scan.c */
void scsi_enable_async_suspend(struct device *dev);
-extern int scsi_complete_async_scans(void);
+void scsi_complete_async_scans(void);
extern int scsi_scan_host_selected(struct Scsi_Host *, unsigned int,
unsigned int, u64, enum scsi_scan_mode);
extern void scsi_forget_host(struct Scsi_Host *);
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 60c06fa4ec32..f19f2c73f042 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -36,6 +36,7 @@
#include <linux/async.h>
#include <linux/slab.h>
#include <linux/unaligned.h>
+#include <linux/compl_chain.h>
#include <scsi/scsi.h>
#include <scsi/scsi_cmnd.h>
@@ -112,14 +113,11 @@ MODULE_PARM_DESC(inq_timeout,
"Timeout (in seconds) waiting for devices to answer INQUIRY."
" Default is 20. Some devices may need more; most need less.");
-/* This lock protects only this list */
-static DEFINE_SPINLOCK(async_scan_lock);
-static LIST_HEAD(scanning_hosts);
+static DEFINE_COMPL_CHAIN(scanning_hosts);
struct async_scan_data {
- struct list_head list;
+ struct compl_chain_entry chain_entry;
struct Scsi_Host *shost;
- struct completion prev_finished;
};
/*
@@ -146,48 +144,10 @@ void scsi_enable_async_suspend(struct device *dev)
* started scanning after this function was called may or may not have
* finished.
*/
-int scsi_complete_async_scans(void)
+void scsi_complete_async_scans(void)
{
- struct async_scan_data *data;
-
- do {
- scoped_guard(spinlock, &async_scan_lock)
- if (list_empty(&scanning_hosts))
- return 0;
- /* If we can't get memory immediately, that's OK. Just
- * sleep a little. Even if we never get memory, the async
- * scans will finish eventually.
- */
- data = kmalloc(sizeof(*data), GFP_KERNEL);
- if (!data)
- msleep(1);
- } while (!data);
-
- data->shost = NULL;
- init_completion(&data->prev_finished);
-
- spin_lock(&async_scan_lock);
- /* Check that there's still somebody else on the list */
- if (list_empty(&scanning_hosts))
- goto done;
- list_add_tail(&data->list, &scanning_hosts);
- spin_unlock(&async_scan_lock);
-
printk(KERN_INFO "scsi: waiting for bus probes to complete ...\n");
- wait_for_completion(&data->prev_finished);
-
- spin_lock(&async_scan_lock);
- list_del(&data->list);
- if (!list_empty(&scanning_hosts)) {
- struct async_scan_data *next = list_entry(scanning_hosts.next,
- struct async_scan_data, list);
- complete(&next->prev_finished);
- }
- done:
- spin_unlock(&async_scan_lock);
-
- kfree(data);
- return 0;
+ compl_chain_flush(&scanning_hosts);
}
/**
@@ -1960,18 +1920,13 @@ static struct async_scan_data *scsi_prep_async_scan(struct Scsi_Host *shost)
data->shost = scsi_host_get(shost);
if (!data->shost)
goto err;
- init_completion(&data->prev_finished);
spin_lock_irqsave(shost->host_lock, flags);
shost->async_scan = 1;
spin_unlock_irqrestore(shost->host_lock, flags);
mutex_unlock(&shost->scan_mutex);
- spin_lock(&async_scan_lock);
- if (list_empty(&scanning_hosts))
- complete(&data->prev_finished);
- list_add_tail(&data->list, &scanning_hosts);
- spin_unlock(&async_scan_lock);
+ compl_chain_add(&scanning_hosts, &data->chain_entry);
return data;
@@ -2008,7 +1963,7 @@ static void scsi_finish_async_scan(struct async_scan_data *data)
return;
}
- wait_for_completion(&data->prev_finished);
+ compl_chain_wait(&data->chain_entry);
scsi_sysfs_add_devices(shost);
@@ -2018,14 +1973,7 @@ static void scsi_finish_async_scan(struct async_scan_data *data)
mutex_unlock(&shost->scan_mutex);
- spin_lock(&async_scan_lock);
- list_del(&data->list);
- if (!list_empty(&scanning_hosts)) {
- struct async_scan_data *next = list_entry(scanning_hosts.next,
- struct async_scan_data, list);
- complete(&next->prev_finished);
- }
- spin_unlock(&async_scan_lock);
+ compl_chain_complete(&data->chain_entry);
scsi_autopm_put_host(shost);
scsi_host_put(shost);
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
2026-02-25 16:12 [PATCH V3 0/3] Ensure ordered namespace registration during async scan Maurizio Lombardi
` (2 preceding siblings ...)
2026-02-25 16:12 ` [PATCH V3 3/3] scsi: Convert async scanning to use the completion chain helper Maurizio Lombardi
@ 2026-02-25 21:41 ` Keith Busch
2026-02-26 8:07 ` Maurizio Lombardi
3 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2026-02-25 21:41 UTC (permalink / raw)
To: Maurizio Lombardi
Cc: hch, hare, chaitanyak, bvanassche, linux-scsi, linux-nvme,
James.Bottomley, mlombard, jmeneghi, emilne, bgurney
On Wed, Feb 25, 2026 at 05:12:00PM +0100, Maurizio Lombardi wrote:
> The NVMe fully asynchronous namespace scanning introduced in
> commit 4e893ca81170 ("nvme-core: scan namespaces asynchronously")
> significantly improved discovery times. However, it also introduced
> non-deterministic ordering for namespace registration.
>
> While kernel device names (/dev/nvmeXnY) are not guaranteed to be stable
> across reboots, this unpredictable ordering has caused considerable user
> confusion and has been perceived as a regression, leading to multiple bug
> reports.
The nvme-pci driver also probes the controllers asynchronously, which
can also create non-determinisitic names. Is that part not a problem?
Just on the suffix part of the namespace's block handle, I have a
potential alternate suggestion here. The instance names pulled from the
ida guarantee we'll always have unique names for the lifetime of the
backing kobject. I introduced that a while ago, but I'm testing this out
now and it seems kobject_del is sufficient to reuse that name. The
driver already did that to all the objects when deleting the namespace,
so there doesn't appear to be a reason to wait for the final
kobject_put.
What I'm saying is I may have been mistaken about the naming collision
issues and we can just use the head's ns_id to get a consistent and
meaningful name based off the backing namespaces. There's some unlikely
races with multipath at the moment if we did use ns_id, but I think
they're all fixable.
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
2026-02-25 21:41 ` [PATCH V3 0/3] Ensure ordered namespace registration during async scan Keith Busch
@ 2026-02-26 8:07 ` Maurizio Lombardi
2026-02-26 15:09 ` Keith Busch
2026-02-26 16:35 ` John Meneghini
0 siblings, 2 replies; 12+ messages in thread
From: Maurizio Lombardi @ 2026-02-26 8:07 UTC (permalink / raw)
To: Keith Busch, Maurizio Lombardi
Cc: hch, hare, chaitanyak, bvanassche, linux-scsi, linux-nvme,
James.Bottomley, mlombard, jmeneghi, emilne, bgurney
On Wed Feb 25, 2026 at 10:41 PM CET, Keith Busch wrote:
> On Wed, Feb 25, 2026 at 05:12:00PM +0100, Maurizio Lombardi wrote:
>> The NVMe fully asynchronous namespace scanning introduced in
>> commit 4e893ca81170 ("nvme-core: scan namespaces asynchronously")
>> significantly improved discovery times. However, it also introduced
>> non-deterministic ordering for namespace registration.
>>
>> While kernel device names (/dev/nvmeXnY) are not guaranteed to be stable
>> across reboots, this unpredictable ordering has caused considerable user
>> confusion and has been perceived as a regression, leading to multiple bug
>> reports.
>
> The nvme-pci driver also probes the controllers asynchronously, which
> can also create non-determinisitic names. Is that part not a problem?
Potentially, it is. The difference is that so far no one ever complained
about it, while with namespace async scanning we immediately received regression
reports, to the point we had to revert the changes and restore the
sequential namespaces scan in RHEL.
>
> Just on the suffix part of the namespace's block handle, I have a
> potential alternate suggestion here. The instance names pulled from the
> ida guarantee we'll always have unique names for the lifetime of the
> backing kobject. I introduced that a while ago, but I'm testing this out
> now and it seems kobject_del is sufficient to reuse that name. The
> driver already did that to all the objects when deleting the namespace,
> so there doesn't appear to be a reason to wait for the final
> kobject_put.
>
> What I'm saying is I may have been mistaken about the naming collision
> issues and we can just use the head's ns_id to get a consistent and
> meaningful name based off the backing namespaces. There's some unlikely
> races with multipath at the moment if we did use ns_id, but I think
> they're all fixable.
Ok, so you'd like to use the namespace's NSID as the suffix.
I also considered this approach, the reason I didn't implemented
it is that I wished to have the async namespace scan performance improvements
while preserving the same enumeration we had for years with the sequential scan:
Before the introduction of the async scan, /dev/nvme0n1 always pointed
to the first entry of the NSID list, /dev/nvme0n2 to the second
entry and so on.
With your proposal, if a user has sparse NSIDs (1, 10, 333)
then he will get /dev/nvme0n1, /dev/nvme0n10, /dev/nvme0n333.
On one hand, yes, they are "more stable" and more meaningful too,
on the other hand this breaks the assumption of contiguous naming.
This might not be a problem for the mainline kernel, but I suspect we
will have people complaining again that the /dev/nvmeXnY enumeration changed
Maurizio
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
2026-02-26 8:07 ` Maurizio Lombardi
@ 2026-02-26 15:09 ` Keith Busch
2026-02-26 16:35 ` John Meneghini
1 sibling, 0 replies; 12+ messages in thread
From: Keith Busch @ 2026-02-26 15:09 UTC (permalink / raw)
To: Maurizio Lombardi
Cc: Maurizio Lombardi, hch, hare, chaitanyak, bvanassche, linux-scsi,
linux-nvme, James.Bottomley, jmeneghi, emilne, bgurney
On Thu, Feb 26, 2026 at 09:07:10AM +0100, Maurizio Lombardi wrote:
> With your proposal, if a user has sparse NSIDs (1, 10, 333)
> then he will get /dev/nvme0n1, /dev/nvme0n10, /dev/nvme0n333.
> On one hand, yes, they are "more stable" and more meaningful too,
> on the other hand this breaks the assumption of contiguous naming.
> This might not be a problem for the mainline kernel, but I suspect we
> will have people complaining again that the /dev/nvmeXnY enumeration changed
The bonus of using the nsid is that it will always enumerate with the
same name even after you alter the other attached namespaces.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
2026-02-26 8:07 ` Maurizio Lombardi
2026-02-26 15:09 ` Keith Busch
@ 2026-02-26 16:35 ` John Meneghini
2026-02-26 18:15 ` Keith Busch
1 sibling, 1 reply; 12+ messages in thread
From: John Meneghini @ 2026-02-26 16:35 UTC (permalink / raw)
To: Maurizio Lombardi, Keith Busch, Maurizio Lombardi
Cc: hch, hare, chaitanyak, bvanassche, linux-scsi, linux-nvme,
James.Bottomley, emilne, bgurney
On 2/26/26 3:07 AM, Maurizio Lombardi wrote:
> On Wed Feb 25, 2026 at 10:41 PM CET, Keith Busch wrote:
>> On Wed, Feb 25, 2026 at 05:12:00PM +0100, Maurizio Lombardi wrote:
>>> The NVMe fully asynchronous namespace scanning introduced in
>>> commit 4e893ca81170 ("nvme-core: scan namespaces asynchronously")
>>> significantly improved discovery times. However, it also introduced
>>> non-deterministic ordering for namespace registration.
>>>
>>> While kernel device names (/dev/nvmeXnY) are not guaranteed to be stable
>>> across reboots, this unpredictable ordering has caused considerable user
>>> confusion and has been perceived as a regression, leading to multiple bug
>>> reports.
>>
>> The nvme-pci driver also probes the controllers asynchronously, which
>> can also create non-determinisitic names. Is that part not a problem?
>
> Potentially, it is. The difference is that so far no one ever complained
> about it, while with namespace async scanning we immediately received regression
> reports, to the point we had to revert the changes and restore the
> sequential namespaces scan in RHEL.
It's worse than this. Yes, in RHEL we carry out of tree patches to tun off the async scanning with SCSI,
and we reverted this async namespace scanning patch in NVMe.
We had to do this because, as soon as we turned these async scanning mechanisms on, we immediately
received customer escalations. Customer were not able to upgrade their systems. We have customer issues
and complaints open about this and we see this async namespace scanning as a barrier to adoption with NVMEe -
especially with NVME-OF which tends to have many more Namespaces than PCIe.
We've talked about this at LSF/MM - more than once - and several solutions have been proposed in the past,
but nothing ever happened.
And yes, the PCIe async discovery stuff does cause some problems. The difference is: the PCIe bus configuration does
not change nearly as often as, e.g., the nvme namespace configuration in a fabric, so customers don't notice the changing pci ids.
Unless some one is going lots of hot unplugging and plugging with their PCI bus, the PCI ids typically don't change at all.
So from boot to boot, pci id don't usually change. This async namespace scanning causes the namespace ids to change with every reboot, especially on
a system with 100's of nvme-of namespaces.
So, we really need this change, or something like this, to be accepted upstream.
/John
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
2026-02-26 16:35 ` John Meneghini
@ 2026-02-26 18:15 ` Keith Busch
2026-03-02 7:16 ` Hannes Reinecke
0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2026-02-26 18:15 UTC (permalink / raw)
To: John Meneghini
Cc: Maurizio Lombardi, Maurizio Lombardi, hch, hare, chaitanyak,
bvanassche, linux-scsi, linux-nvme, James.Bottomley, emilne,
bgurney
On Thu, Feb 26, 2026 at 11:35:15AM -0500, John Meneghini wrote:
> It's worse than this. Yes, in RHEL we carry out of tree patches to tun off the async scanning with SCSI,
> and we reverted this async namespace scanning patch in NVMe.
>
> We had to do this because, as soon as we turned these async scanning mechanisms on, we immediately
> received customer escalations. Customer were not able to upgrade their systems. We have customer issues
> and complaints open about this and we see this async namespace scanning as a barrier to adoption with NVMEe -
> especially with NVME-OF which tends to have many more Namespaces than PCIe.
Sounds like some people just don't know how to use labels or persistent
names. Relying on /dev/nvmeXnY or /dev/sdX to always be a handle to the
same device is a fragile solution.
> And yes, the PCIe async discovery stuff does cause some problems. The difference is: the PCIe bus configuration does
> not change nearly as often as, e.g., the nvme namespace configuration in a fabric, so customers don't notice the changing pci ids.
> Unless some one is going lots of hot unplugging and plugging with their PCI bus, the PCI ids typically don't change at all.
It's not about the PCI topology changing. The async probe makes it
non-deterministic as to which PCI device is going to claim which
instance out of the nvme ida since they all try to run concurrently.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
2026-02-26 18:15 ` Keith Busch
@ 2026-03-02 7:16 ` Hannes Reinecke
2026-03-02 17:12 ` Keith Busch
0 siblings, 1 reply; 12+ messages in thread
From: Hannes Reinecke @ 2026-03-02 7:16 UTC (permalink / raw)
To: Keith Busch, John Meneghini
Cc: Maurizio Lombardi, Maurizio Lombardi, hch, chaitanyak, bvanassche,
linux-scsi, linux-nvme, James.Bottomley, emilne, bgurney
On 2/26/26 19:15, Keith Busch wrote:
> On Thu, Feb 26, 2026 at 11:35:15AM -0500, John Meneghini wrote:
>> It's worse than this. Yes, in RHEL we carry out of tree patches to tun off the async scanning with SCSI,
>> and we reverted this async namespace scanning patch in NVMe.
>>
>> We had to do this because, as soon as we turned these async scanning mechanisms on, we immediately
>> received customer escalations. Customer were not able to upgrade their systems. We have customer issues
>> and complaints open about this and we see this async namespace scanning as a barrier to adoption with NVMEe -
>> especially with NVME-OF which tends to have many more Namespaces than PCIe.
>
> Sounds like some people just don't know how to use labels or persistent
> names. Relying on /dev/nvmeXnY or /dev/sdX to always be a handle to the
> same device is a fragile solution.
>
Yeah. We have undergone this (admittedly, rather painful) process quite
some time back for SLES (with the switch from SLES12 to SLES15 if memory
serves correctly). Since then our customer seem to be happy with using
persistent device links.
>> And yes, the PCIe async discovery stuff does cause some problems. The difference is: the PCIe bus configuration does
>> not change nearly as often as, e.g., the nvme namespace configuration in a fabric, so customers don't notice the changing pci ids.
>> Unless some one is going lots of hot unplugging and plugging with their PCI bus, the PCI ids typically don't change at all.
>
> It's not about the PCI topology changing. The async probe makes it
> non-deterministic as to which PCI device is going to claim which
> instance out of the nvme ida since they all try to run concurrently.
I really would like to go with the nsid based solution from Keith.
That would avoid quite some cumbersome code here.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH V3 0/3] Ensure ordered namespace registration during async scan
2026-03-02 7:16 ` Hannes Reinecke
@ 2026-03-02 17:12 ` Keith Busch
0 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2026-03-02 17:12 UTC (permalink / raw)
To: Hannes Reinecke
Cc: John Meneghini, Maurizio Lombardi, Maurizio Lombardi, hch,
chaitanyak, bvanassche, linux-scsi, linux-nvme, James.Bottomley,
emilne, bgurney
On Mon, Mar 02, 2026 at 08:16:19AM +0100, Hannes Reinecke wrote:
> I really would like to go with the nsid based solution from Keith.
> That would avoid quite some cumbersome code here.
I've seen various documentation that assumes the current naming
indicates the nsid, so the scheme follows at least some people's
expectations. I don't know if we can make everyone happy here, though.
:(
I've fixed up most of the multipath races that get us closer to allowing
nsid suffix, but there's one left: nvme_remove_head is called outside
the subsys lock after detaching the head from the subsystem list. That
could cause a subsequent add event to call nvme_alloc_ns() before the
mpath side has completed del_gendisk() for the old nsid.
^ permalink raw reply [flat|nested] 12+ messages in thread