* [PATCH 1/3] nvme: failover requests for inactive hctx
2026-02-26 13:40 [PATCH 0/3] block: revert avoid acquiring cpu hotplug lock in group_cpus_evenly Daniel Wagner
@ 2026-02-26 13:40 ` Daniel Wagner
2026-02-26 19:09 ` kernel test robot
2026-02-26 23:55 ` kernel test robot
2026-02-26 13:40 ` [PATCH 2/3] blk-mq: add handshake for offlinig hw queues Daniel Wagner
2026-02-26 13:40 ` [PATCH 3/3] Revert "lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly" Daniel Wagner
2 siblings, 2 replies; 10+ messages in thread
From: Daniel Wagner @ 2026-02-26 13:40 UTC (permalink / raw)
To: Christoph Hellwig, Keith Busch, Jens Axboe, Ming Lei
Cc: Guangwu Zhang, Chengming Zhou, Thomas Gleixner, linux-nvme,
linux-kernel, linux-block, Daniel Wagner
When the ctrl is not in LIVE state, a hardware queue can be in the
INACTIVE state due to CPU hotplug offlining operations. In this case,
the driver will freeze and quiesce the request queue and doesn't expect
new request entering via queue_rq. Though a request will fail eventually,
though shortcut it and fail it earlier.
Check if a request is targeted for an inactive hardware queue and use
nvme_failover_req and hand it back to the block layer.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
drivers/nvme/host/core.c | 55 ++++++++++++++++++++++++++++++++++++++++++-
drivers/nvme/host/multipath.c | 43 ---------------------------------
drivers/nvme/host/nvme.h | 1 -
3 files changed, 54 insertions(+), 45 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f5ebcaa2f859..e84df1a2d321 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -454,6 +454,51 @@ void nvme_end_req(struct request *req)
blk_mq_end_request(req, status);
}
+static void nvme_failover_req(struct request *req)
+{
+ struct nvme_ns *ns = req->q->queuedata;
+ u16 status = nvme_req(req)->status & NVME_SCT_SC_MASK;
+ unsigned long flags;
+ struct bio *bio;
+
+ if (nvme_ns_head_multipath(ns->head))
+ nvme_mpath_clear_current_path(ns);
+
+ /*
+ * If we got back an ANA error, we know the controller is alive but not
+ * ready to serve this namespace. Kick of a re-read of the ANA
+ * information page, and just try any other available path for now.
+ */
+ if (nvme_is_ana_error(status) && ns->ctrl->ana_log_buf) {
+ set_bit(NVME_NS_ANA_PENDING, &ns->flags);
+ queue_work(nvme_wq, &ns->ctrl->ana_work);
+ }
+
+ spin_lock_irqsave(&ns->head->requeue_lock, flags);
+ for (bio = req->bio; bio; bio = bio->bi_next) {
+ if (nvme_ns_head_multipath(ns->head))
+ bio_set_dev(bio, ns->head->disk->part0);
+ if (bio->bi_opf & REQ_POLLED) {
+ bio->bi_opf &= ~REQ_POLLED;
+ bio->bi_cookie = BLK_QC_T_NONE;
+ }
+ /*
+ * The alternate request queue that we may end up submitting
+ * the bio to may be frozen temporarily, in this case REQ_NOWAIT
+ * will fail the I/O immediately with EAGAIN to the issuer.
+ * We are not in the issuer context which cannot block. Clear
+ * the flag to avoid spurious EAGAIN I/O failures.
+ */
+ bio->bi_opf &= ~REQ_NOWAIT;
+ }
+ blk_steal_bios(&ns->head->requeue_list, req);
+ spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
+
+ nvme_req(req)->status = 0;
+ nvme_end_req(req);
+ kblockd_schedule_work(&ns->head->requeue_work);
+}
+
void nvme_complete_rq(struct request *req)
{
struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
@@ -762,8 +807,13 @@ blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
state != NVME_CTRL_DELETING &&
state != NVME_CTRL_DEAD &&
!test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
- !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
+ !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH)) {
+ if (test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state)) {
+ nvme_failover_req(rq);
+ return BLK_STS_OK;
+ }
return BLK_STS_RESOURCE;
+ }
if (!(rq->rq_flags & RQF_DONTPREP))
nvme_clear_nvme_request(rq);
@@ -809,6 +859,9 @@ bool __nvme_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
}
}
+ if (test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))
+ return false;
+
return queue_live;
}
EXPORT_SYMBOL_GPL(__nvme_check_ready);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 174027d1cc19..cce3a23f6de5 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -134,49 +134,6 @@ void nvme_mpath_start_freeze(struct nvme_subsystem *subsys)
blk_freeze_queue_start(h->disk->queue);
}
-void nvme_failover_req(struct request *req)
-{
- struct nvme_ns *ns = req->q->queuedata;
- u16 status = nvme_req(req)->status & NVME_SCT_SC_MASK;
- unsigned long flags;
- struct bio *bio;
-
- nvme_mpath_clear_current_path(ns);
-
- /*
- * If we got back an ANA error, we know the controller is alive but not
- * ready to serve this namespace. Kick of a re-read of the ANA
- * information page, and just try any other available path for now.
- */
- if (nvme_is_ana_error(status) && ns->ctrl->ana_log_buf) {
- set_bit(NVME_NS_ANA_PENDING, &ns->flags);
- queue_work(nvme_wq, &ns->ctrl->ana_work);
- }
-
- spin_lock_irqsave(&ns->head->requeue_lock, flags);
- for (bio = req->bio; bio; bio = bio->bi_next) {
- bio_set_dev(bio, ns->head->disk->part0);
- if (bio->bi_opf & REQ_POLLED) {
- bio->bi_opf &= ~REQ_POLLED;
- bio->bi_cookie = BLK_QC_T_NONE;
- }
- /*
- * The alternate request queue that we may end up submitting
- * the bio to may be frozen temporarily, in this case REQ_NOWAIT
- * will fail the I/O immediately with EAGAIN to the issuer.
- * We are not in the issuer context which cannot block. Clear
- * the flag to avoid spurious EAGAIN I/O failures.
- */
- bio->bi_opf &= ~REQ_NOWAIT;
- }
- blk_steal_bios(&ns->head->requeue_list, req);
- spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
-
- nvme_req(req)->status = 0;
- nvme_end_req(req);
- kblockd_schedule_work(&ns->head->requeue_work);
-}
-
void nvme_mpath_start_request(struct request *rq)
{
struct nvme_ns *ns = rq->q->queuedata;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9a5f28c5103c..dbd063413da9 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -967,7 +967,6 @@ void nvme_mpath_unfreeze(struct nvme_subsystem *subsys);
void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys);
void nvme_mpath_start_freeze(struct nvme_subsystem *subsys);
void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys);
-void nvme_failover_req(struct request *req);
void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
void nvme_mpath_add_sysfs_link(struct nvme_ns_head *ns);
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH 1/3] nvme: failover requests for inactive hctx
2026-02-26 13:40 ` [PATCH 1/3] nvme: failover requests for inactive hctx Daniel Wagner
@ 2026-02-26 19:09 ` kernel test robot
2026-02-26 23:55 ` kernel test robot
1 sibling, 0 replies; 10+ messages in thread
From: kernel test robot @ 2026-02-26 19:09 UTC (permalink / raw)
To: Daniel Wagner, Christoph Hellwig, Keith Busch, Jens Axboe,
Ming Lei
Cc: llvm, oe-kbuild-all, Guangwu Zhang, Chengming Zhou,
Thomas Gleixner, linux-nvme, linux-kernel, linux-block,
Daniel Wagner
Hi Daniel,
kernel test robot noticed the following build errors:
[auto build test ERROR on 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f]
url: https://github.com/intel-lab-lkp/linux/commits/Daniel-Wagner/nvme-failover-requests-for-inactive-hctx/20260226-224213
base: 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f
patch link: https://lore.kernel.org/r/20260226-revert-cpu-read-lock-v1-1-eb005072566e%40kernel.org
patch subject: [PATCH 1/3] nvme: failover requests for inactive hctx
config: riscv-defconfig (https://download.01.org/0day-ci/archive/20260227/202602270348.j0MMNhUj-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 9a109fbb6e184ec9bcce10615949f598f4c974a9)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260227/202602270348.j0MMNhUj-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602270348.j0MMNhUj-lkp@intel.com/
All errors (new ones prefixed by >>):
drivers/nvme/host/core.c:457:13: error: redefinition of 'nvme_failover_req'
457 | static void nvme_failover_req(struct request *req)
| ^
drivers/nvme/host/nvme.h:1020:20: note: previous definition is here
1020 | static inline void nvme_failover_req(struct request *req)
| ^
>> drivers/nvme/host/core.c:472:45: error: no member named 'ana_log_buf' in 'struct nvme_ctrl'
472 | if (nvme_is_ana_error(status) && ns->ctrl->ana_log_buf) {
| ~~~~~~~~ ^
>> drivers/nvme/host/core.c:474:34: error: no member named 'ana_work' in 'struct nvme_ctrl'
474 | queue_work(nvme_wq, &ns->ctrl->ana_work);
| ~~~~~~~~ ^
>> drivers/nvme/host/core.c:477:31: error: no member named 'requeue_lock' in 'struct nvme_ns_head'
477 | spin_lock_irqsave(&ns->head->requeue_lock, flags);
| ~~~~~~~~ ^
include/linux/spinlock.h:376:39: note: expanded from macro 'spin_lock_irqsave'
376 | raw_spin_lock_irqsave(spinlock_check(lock), flags); \
| ^~~~
include/linux/spinlock.h:244:34: note: expanded from macro 'raw_spin_lock_irqsave'
244 | flags = _raw_spin_lock_irqsave(lock); \
| ^~~~
>> drivers/nvme/host/core.c:494:28: error: no member named 'requeue_list' in 'struct nvme_ns_head'
494 | blk_steal_bios(&ns->head->requeue_list, req);
| ~~~~~~~~ ^
drivers/nvme/host/core.c:495:36: error: no member named 'requeue_lock' in 'struct nvme_ns_head'
495 | spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
| ~~~~~~~~ ^
>> drivers/nvme/host/core.c:499:35: error: no member named 'requeue_work' in 'struct nvme_ns_head'
499 | kblockd_schedule_work(&ns->head->requeue_work);
| ~~~~~~~~ ^
7 errors generated.
vim +472 drivers/nvme/host/core.c
456
457 static void nvme_failover_req(struct request *req)
458 {
459 struct nvme_ns *ns = req->q->queuedata;
460 u16 status = nvme_req(req)->status & NVME_SCT_SC_MASK;
461 unsigned long flags;
462 struct bio *bio;
463
464 if (nvme_ns_head_multipath(ns->head))
465 nvme_mpath_clear_current_path(ns);
466
467 /*
468 * If we got back an ANA error, we know the controller is alive but not
469 * ready to serve this namespace. Kick of a re-read of the ANA
470 * information page, and just try any other available path for now.
471 */
> 472 if (nvme_is_ana_error(status) && ns->ctrl->ana_log_buf) {
473 set_bit(NVME_NS_ANA_PENDING, &ns->flags);
> 474 queue_work(nvme_wq, &ns->ctrl->ana_work);
475 }
476
> 477 spin_lock_irqsave(&ns->head->requeue_lock, flags);
478 for (bio = req->bio; bio; bio = bio->bi_next) {
479 if (nvme_ns_head_multipath(ns->head))
480 bio_set_dev(bio, ns->head->disk->part0);
481 if (bio->bi_opf & REQ_POLLED) {
482 bio->bi_opf &= ~REQ_POLLED;
483 bio->bi_cookie = BLK_QC_T_NONE;
484 }
485 /*
486 * The alternate request queue that we may end up submitting
487 * the bio to may be frozen temporarily, in this case REQ_NOWAIT
488 * will fail the I/O immediately with EAGAIN to the issuer.
489 * We are not in the issuer context which cannot block. Clear
490 * the flag to avoid spurious EAGAIN I/O failures.
491 */
492 bio->bi_opf &= ~REQ_NOWAIT;
493 }
> 494 blk_steal_bios(&ns->head->requeue_list, req);
495 spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
496
497 nvme_req(req)->status = 0;
498 nvme_end_req(req);
> 499 kblockd_schedule_work(&ns->head->requeue_work);
500 }
501
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH 1/3] nvme: failover requests for inactive hctx
2026-02-26 13:40 ` [PATCH 1/3] nvme: failover requests for inactive hctx Daniel Wagner
2026-02-26 19:09 ` kernel test robot
@ 2026-02-26 23:55 ` kernel test robot
1 sibling, 0 replies; 10+ messages in thread
From: kernel test robot @ 2026-02-26 23:55 UTC (permalink / raw)
To: Daniel Wagner, Christoph Hellwig, Keith Busch, Jens Axboe,
Ming Lei
Cc: oe-kbuild-all, Guangwu Zhang, Chengming Zhou, Thomas Gleixner,
linux-nvme, linux-kernel, linux-block, Daniel Wagner
Hi Daniel,
kernel test robot noticed the following build errors:
[auto build test ERROR on 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f]
url: https://github.com/intel-lab-lkp/linux/commits/Daniel-Wagner/nvme-failover-requests-for-inactive-hctx/20260226-224213
base: 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f
patch link: https://lore.kernel.org/r/20260226-revert-cpu-read-lock-v1-1-eb005072566e%40kernel.org
patch subject: [PATCH 1/3] nvme: failover requests for inactive hctx
config: x86_64-randconfig-r071-20260227 (https://download.01.org/0day-ci/archive/20260227/202602270720.cugNS3m1-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
smatch version: v0.5.0-8994-gd50c5a4c
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260227/202602270720.cugNS3m1-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602270720.cugNS3m1-lkp@intel.com/
All errors (new ones prefixed by >>):
>> drivers/nvme/host/core.c:457:13: error: redefinition of 'nvme_failover_req'
457 | static void nvme_failover_req(struct request *req)
| ^~~~~~~~~~~~~~~~~
In file included from drivers/nvme/host/core.c:27:
drivers/nvme/host/nvme.h:1020:20: note: previous definition of 'nvme_failover_req' with type 'void(struct request *)'
1020 | static inline void nvme_failover_req(struct request *req)
| ^~~~~~~~~~~~~~~~~
drivers/nvme/host/core.c: In function 'nvme_failover_req':
>> drivers/nvme/host/core.c:472:50: error: 'struct nvme_ctrl' has no member named 'ana_log_buf'
472 | if (nvme_is_ana_error(status) && ns->ctrl->ana_log_buf) {
| ^~
>> drivers/nvme/host/core.c:474:48: error: 'struct nvme_ctrl' has no member named 'ana_work'; did you mean 'ka_work'?
474 | queue_work(nvme_wq, &ns->ctrl->ana_work);
| ^~~~~~~~
| ka_work
In file included from include/linux/sched.h:37,
from include/linux/ratelimit.h:6,
from include/linux/dev_printk.h:16,
from include/linux/device.h:15,
from include/linux/async.h:14,
from drivers/nvme/host/core.c:7:
>> drivers/nvme/host/core.c:477:36: error: 'struct nvme_ns_head' has no member named 'requeue_lock'
477 | spin_lock_irqsave(&ns->head->requeue_lock, flags);
| ^~
include/linux/spinlock.h:244:48: note: in definition of macro 'raw_spin_lock_irqsave'
244 | flags = _raw_spin_lock_irqsave(lock); \
| ^~~~
drivers/nvme/host/core.c:477:9: note: in expansion of macro 'spin_lock_irqsave'
477 | spin_lock_irqsave(&ns->head->requeue_lock, flags);
| ^~~~~~~~~~~~~~~~~
>> drivers/nvme/host/core.c:494:33: error: 'struct nvme_ns_head' has no member named 'requeue_list'
494 | blk_steal_bios(&ns->head->requeue_list, req);
| ^~
drivers/nvme/host/core.c:495:41: error: 'struct nvme_ns_head' has no member named 'requeue_lock'
495 | spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
| ^~
>> drivers/nvme/host/core.c:499:40: error: 'struct nvme_ns_head' has no member named 'requeue_work'
499 | kblockd_schedule_work(&ns->head->requeue_work);
| ^~
vim +/nvme_failover_req +457 drivers/nvme/host/core.c
456
> 457 static void nvme_failover_req(struct request *req)
458 {
459 struct nvme_ns *ns = req->q->queuedata;
460 u16 status = nvme_req(req)->status & NVME_SCT_SC_MASK;
461 unsigned long flags;
462 struct bio *bio;
463
464 if (nvme_ns_head_multipath(ns->head))
465 nvme_mpath_clear_current_path(ns);
466
467 /*
468 * If we got back an ANA error, we know the controller is alive but not
469 * ready to serve this namespace. Kick of a re-read of the ANA
470 * information page, and just try any other available path for now.
471 */
> 472 if (nvme_is_ana_error(status) && ns->ctrl->ana_log_buf) {
473 set_bit(NVME_NS_ANA_PENDING, &ns->flags);
> 474 queue_work(nvme_wq, &ns->ctrl->ana_work);
475 }
476
> 477 spin_lock_irqsave(&ns->head->requeue_lock, flags);
478 for (bio = req->bio; bio; bio = bio->bi_next) {
479 if (nvme_ns_head_multipath(ns->head))
480 bio_set_dev(bio, ns->head->disk->part0);
481 if (bio->bi_opf & REQ_POLLED) {
482 bio->bi_opf &= ~REQ_POLLED;
483 bio->bi_cookie = BLK_QC_T_NONE;
484 }
485 /*
486 * The alternate request queue that we may end up submitting
487 * the bio to may be frozen temporarily, in this case REQ_NOWAIT
488 * will fail the I/O immediately with EAGAIN to the issuer.
489 * We are not in the issuer context which cannot block. Clear
490 * the flag to avoid spurious EAGAIN I/O failures.
491 */
492 bio->bi_opf &= ~REQ_NOWAIT;
493 }
> 494 blk_steal_bios(&ns->head->requeue_list, req);
495 spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
496
497 nvme_req(req)->status = 0;
498 nvme_end_req(req);
> 499 kblockd_schedule_work(&ns->head->requeue_work);
500 }
501
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 2/3] blk-mq: add handshake for offlinig hw queues
2026-02-26 13:40 [PATCH 0/3] block: revert avoid acquiring cpu hotplug lock in group_cpus_evenly Daniel Wagner
2026-02-26 13:40 ` [PATCH 1/3] nvme: failover requests for inactive hctx Daniel Wagner
@ 2026-02-26 13:40 ` Daniel Wagner
2026-02-26 13:40 ` [PATCH 3/3] Revert "lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly" Daniel Wagner
2 siblings, 0 replies; 10+ messages in thread
From: Daniel Wagner @ 2026-02-26 13:40 UTC (permalink / raw)
To: Christoph Hellwig, Keith Busch, Jens Axboe, Ming Lei
Cc: Guangwu Zhang, Chengming Zhou, Thomas Gleixner, linux-nvme,
linux-kernel, linux-block, Daniel Wagner
The CPU hotplug offline handler in the block layer checks for any
in-flight requests on a CPU going offline. It prevents the CPU hotplug
state engine from progressing as long as there are pending requests.
This is done by checking for any allocated requests on the hardware
context that is going offline. The driver is responsible for completing
all in-flight requests.
However, the driver might be performing error recovery simultaneously.
Therefore, the request queue might be in a frozen or quiesced state. In
this case, requests may not make progress (see
blk_mq_sched_dispatch_requests for an example).
Introduce an explicit handshake protocol between the driver and the
block layer. This allows the driver to signal when it is safe to ignore
any remaining pending requests.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
block/blk-mq-debugfs.c | 1 +
block/blk-mq.c | 36 ++++++++++++++++++++++++++++++++++++
drivers/nvme/host/core.c | 28 ++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 2 ++
drivers/nvme/host/pci.c | 3 +++
include/linux/blk-mq.h | 3 +++
6 files changed, 73 insertions(+)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 28167c9baa55..a312cb6b6127 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -162,6 +162,7 @@ static const char *const hctx_state_name[] = {
HCTX_STATE_NAME(TAG_ACTIVE),
HCTX_STATE_NAME(SCHED_RESTART),
HCTX_STATE_NAME(INACTIVE),
+ HCTX_STATE_NAME(IDLE),
};
#undef HCTX_STATE_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9af8c3dec3f6..359f19b8238a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -112,6 +112,31 @@ void blk_mq_in_driver_rw(struct block_device *part, unsigned int inflight[2])
inflight[WRITE] = mi.inflight[WRITE];
}
+static void __blk_update_hw_queue_idle(struct request_queue *q, bool idle)
+{
+ struct blk_mq_hw_ctx *hctx;
+ unsigned long i;
+
+ queue_for_each_hw_ctx(q, hctx, i) {
+ if (idle)
+ set_bit(BLK_MQ_S_IDLE, &hctx->state);
+ else
+ clear_bit(BLK_MQ_S_IDLE, &hctx->state);
+ }
+}
+
+void blk_mq_set_hw_queues_idle(struct request_queue *q)
+{
+ __blk_update_hw_queue_idle(q, true);
+}
+EXPORT_SYMBOL_GPL(blk_mq_set_hw_queues_idle);
+
+void blk_mq_clear_hw_queues_idle(struct request_queue *q)
+{
+ __blk_update_hw_queue_idle(q, false);
+}
+EXPORT_SYMBOL_GPL(blk_mq_clear_hw_queues_idle);
+
#ifdef CONFIG_LOCKDEP
static bool blk_freeze_set_owner(struct request_queue *q,
struct task_struct *owner)
@@ -3679,6 +3704,17 @@ static bool blk_mq_has_request(struct request *rq, void *data)
if (rq->mq_hctx != iter_data->hctx)
return true;
+
+ /*
+ * The driver ensures that all hardware queue resources are freed, even
+ * if a request has a tag allocated to a CPU that is going offline. This
+ * applies to requests not yet handed to the hardware. Essentially those
+ * 'in-flight' between the block layer and the hardware (e.g., a request
+ * blocked because the queue is quiesced).
+ */
+ if (test_bit(BLK_MQ_S_IDLE, &iter_data->hctx->state))
+ return false;
+
iter_data->has_rq = true;
return false;
}
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e84df1a2d321..1b736a58e467 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5354,6 +5354,34 @@ void nvme_unquiesce_admin_queue(struct nvme_ctrl *ctrl)
}
EXPORT_SYMBOL_GPL(nvme_unquiesce_admin_queue);
+static void __nvme_set_hw_queues_idle(struct nvme_ctrl *ctrl, bool idle)
+{
+ struct nvme_ns *ns;
+ int srcu_idx;
+
+ srcu_idx = srcu_read_lock(&ctrl->srcu);
+ list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
+ srcu_read_lock_held(&ctrl->srcu)) {
+ if (idle)
+ blk_mq_set_hw_queues_idle(ns->queue);
+ else
+ blk_mq_clear_hw_queues_idle(ns->queue);
+ }
+ srcu_read_unlock(&ctrl->srcu, srcu_idx);
+}
+
+void nvme_set_hw_queues_idle(struct nvme_ctrl *ctrl)
+{
+ __nvme_set_hw_queues_idle(ctrl, true);
+}
+EXPORT_SYMBOL_GPL(nvme_set_hw_queues_idle);
+
+void nvme_clear_hw_queues_idle(struct nvme_ctrl *ctrl)
+{
+ __nvme_set_hw_queues_idle(ctrl, false);
+}
+EXPORT_SYMBOL_GPL(nvme_clear_hw_queues_idle);
+
void nvme_sync_io_queues(struct nvme_ctrl *ctrl)
{
struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index dbd063413da9..d199009982f1 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -834,6 +834,8 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl);
void nvme_unquiesce_io_queues(struct nvme_ctrl *ctrl);
void nvme_quiesce_admin_queue(struct nvme_ctrl *ctrl);
void nvme_unquiesce_admin_queue(struct nvme_ctrl *ctrl);
+void nvme_set_hw_queues_idle(struct nvme_ctrl *ctrl);
+void nvme_clear_hw_queues_idle(struct nvme_ctrl *ctrl);
void nvme_mark_namespaces_dead(struct nvme_ctrl *ctrl);
void nvme_sync_queues(struct nvme_ctrl *ctrl);
void nvme_sync_io_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 2f0c05719316..0097a4f71f97 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3211,6 +3211,8 @@ static void nvme_reset_work(struct work_struct *work)
nvme_unquiesce_admin_queue(&dev->ctrl);
mutex_unlock(&dev->shutdown_lock);
+ nvme_set_hw_queues_idle(&dev->ctrl);
+
/*
* Introduce CONNECTING state from nvme-fc/rdma transports to mark the
* initializing procedure here.
@@ -3243,6 +3245,7 @@ static void nvme_reset_work(struct work_struct *work)
if (result)
goto out;
+ nvme_clear_hw_queues_idle(&dev->ctrl);
/*
* Freeze and update the number of I/O queues as those might have
* changed. If there are no I/O queues left after this reset, keep the
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..8885e84a7889 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -721,6 +721,7 @@ enum {
BLK_MQ_S_SCHED_RESTART,
/* hw queue is inactive after all its CPUs become offline */
BLK_MQ_S_INACTIVE,
+ BLK_MQ_S_IDLE,
BLK_MQ_S_MAX
};
@@ -934,6 +935,8 @@ void blk_mq_stop_hw_queues(struct request_queue *q);
void blk_mq_start_hw_queues(struct request_queue *q);
void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
+void blk_mq_set_hw_queues_idle(struct request_queue *q);
+void blk_mq_clear_hw_queues_idle(struct request_queue *q);
void blk_mq_quiesce_queue(struct request_queue *q);
void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set);
void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set);
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH 3/3] Revert "lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly"
2026-02-26 13:40 [PATCH 0/3] block: revert avoid acquiring cpu hotplug lock in group_cpus_evenly Daniel Wagner
2026-02-26 13:40 ` [PATCH 1/3] nvme: failover requests for inactive hctx Daniel Wagner
2026-02-26 13:40 ` [PATCH 2/3] blk-mq: add handshake for offlinig hw queues Daniel Wagner
@ 2026-02-26 13:40 ` Daniel Wagner
2026-02-26 14:04 ` Ming Lei
2 siblings, 1 reply; 10+ messages in thread
From: Daniel Wagner @ 2026-02-26 13:40 UTC (permalink / raw)
To: Christoph Hellwig, Keith Busch, Jens Axboe, Ming Lei
Cc: Guangwu Zhang, Chengming Zhou, Thomas Gleixner, linux-nvme,
linux-kernel, linux-block, Daniel Wagner
This reverts commit 0263f92fadbb9d294d5971ac57743f882c93b2b3.
The reason the lock was removed was that the nvme-pci driver reset
handler attempted to acquire the CPU read lock during CPU hotplug
offlining (holds the CPU write lock). Consequently, the block layer
offline notifier callback could not progress because in-flight requests
were detected.
Since then, in-flight detection has been improved, and the nvme-pci
driver now explicitly updates the hctx state when it is safe to ignore
detected in-flight requests. As a result, it's possible to reintroduce
the CPU read lock in group_cpus_evenly.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
lib/group_cpus.c | 21 +++++----------------
1 file changed, 5 insertions(+), 16 deletions(-)
diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index e6e18d7a49bb..533c722b5c2c 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -510,25 +510,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
if (!masks)
goto fail_node_to_cpumask;
+ /* Stabilize the cpumasks */
+ cpus_read_lock();
build_node_to_cpumask(node_to_cpumask);
- /*
- * Make a local cache of 'cpu_present_mask', so the two stages
- * spread can observe consistent 'cpu_present_mask' without holding
- * cpu hotplug lock, then we can reduce deadlock risk with cpu
- * hotplug code.
- *
- * Here CPU hotplug may happen when reading `cpu_present_mask`, and
- * we can live with the case because it only affects that hotplug
- * CPU is handled in the 1st or 2nd stage, and either way is correct
- * from API user viewpoint since 2-stage spread is sort of
- * optimization.
- */
- cpumask_copy(npresmsk, data_race(cpu_present_mask));
-
/* grouping present CPUs first */
ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
- npresmsk, nmsk, masks);
+ cpu_present_mask, nmsk, masks);
if (ret < 0)
goto fail_node_to_cpumask;
nr_present = ret;
@@ -543,13 +531,14 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
curgrp = 0;
else
curgrp = nr_present;
- cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
+ cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
npresmsk, nmsk, masks);
if (ret >= 0)
nr_others = ret;
fail_node_to_cpumask:
+ cpus_read_unlock();
free_node_to_cpumask(node_to_cpumask);
fail_npresmsk:
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] Revert "lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly"
2026-02-26 13:40 ` [PATCH 3/3] Revert "lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly" Daniel Wagner
@ 2026-02-26 14:04 ` Ming Lei
2026-03-02 14:04 ` Daniel Wagner
0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2026-02-26 14:04 UTC (permalink / raw)
To: Daniel Wagner
Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Guangwu Zhang,
Chengming Zhou, Thomas Gleixner, linux-nvme, linux-kernel,
linux-block
On Thu, Feb 26, 2026 at 02:40:37PM +0100, Daniel Wagner wrote:
> This reverts commit 0263f92fadbb9d294d5971ac57743f882c93b2b3.
>
> The reason the lock was removed was that the nvme-pci driver reset
> handler attempted to acquire the CPU read lock during CPU hotplug
> offlining (holds the CPU write lock). Consequently, the block layer
> offline notifier callback could not progress because in-flight requests
> were detected.
>
> Since then, in-flight detection has been improved, and the nvme-pci
> driver now explicitly updates the hctx state when it is safe to ignore
> detected in-flight requests. As a result, it's possible to reintroduce
> the CPU read lock in group_cpus_evenly.
Can you explain your motivation a bit? Especially adding back the lock
causes the API hard to use. Any benefit?
Thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] Revert "lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly"
2026-02-26 14:04 ` Ming Lei
@ 2026-03-02 14:04 ` Daniel Wagner
2026-03-02 14:12 ` Ming Lei
0 siblings, 1 reply; 10+ messages in thread
From: Daniel Wagner @ 2026-03-02 14:04 UTC (permalink / raw)
To: Ming Lei
Cc: Daniel Wagner, Christoph Hellwig, Keith Busch, Jens Axboe,
Guangwu Zhang, Chengming Zhou, Thomas Gleixner, linux-nvme,
linux-kernel, linux-block
Hi Ming,
Sorry for the late response. Last week the mail server did take a break...
On Thu, Feb 26, 2026 at 10:04:18PM +0800, Ming Lei wrote:
> On Thu, Feb 26, 2026 at 02:40:37PM +0100, Daniel Wagner wrote:
> > This reverts commit 0263f92fadbb9d294d5971ac57743f882c93b2b3.
> >
> > The reason the lock was removed was that the nvme-pci driver reset
> > handler attempted to acquire the CPU read lock during CPU hotplug
> > offlining (holds the CPU write lock). Consequently, the block layer
> > offline notifier callback could not progress because in-flight requests
> > were detected.
> >
> > Since then, in-flight detection has been improved, and the nvme-pci
> > driver now explicitly updates the hctx state when it is safe to ignore
> > detected in-flight requests. As a result, it's possible to reintroduce
> > the CPU read lock in group_cpus_evenly.
>
> Can you explain your motivation a bit? Especially adding back the lock
> causes the API hard to use. Any benefit?
Sure, I would like to add the lock back to group_cpus_evenly so it's
possible to add support for the isolcpu use case. For the isolcpus case,
it's necessary to access the cpu_online_mask when creating a
housekeeping cpu mask. I failed to find a good solution which doesn't
introduce horrible hacks (see Thomas' feedback on this [1]).
Anyway, I am not totally set on this solution, but I having a proper
lock in this code path would make the isolcpu extension way cleaner I
think.
What do you exactly mean with 'API hard to use'? The problem that the
caller/driver has to make sure it doesn't do anything like the nvme-pci
driver?
[1] https://lore.kernel.org/linux-nvme/87cy7vrbc4.ffs@tglx/
Thanks,
Daniel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] Revert "lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly"
2026-03-02 14:04 ` Daniel Wagner
@ 2026-03-02 14:12 ` Ming Lei
2026-03-02 14:27 ` Daniel Wagner
0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2026-03-02 14:12 UTC (permalink / raw)
To: Daniel Wagner
Cc: Daniel Wagner, Christoph Hellwig, Keith Busch, Jens Axboe,
Guangwu Zhang, Chengming Zhou, Thomas Gleixner, linux-nvme,
linux-kernel, linux-block
On Mon, Mar 2, 2026 at 10:05 PM Daniel Wagner <dwagner@suse.de> wrote:
>
> Hi Ming,
>
> Sorry for the late response. Last week the mail server did take a break...
>
> On Thu, Feb 26, 2026 at 10:04:18PM +0800, Ming Lei wrote:
> > On Thu, Feb 26, 2026 at 02:40:37PM +0100, Daniel Wagner wrote:
> > > This reverts commit 0263f92fadbb9d294d5971ac57743f882c93b2b3.
> > >
> > > The reason the lock was removed was that the nvme-pci driver reset
> > > handler attempted to acquire the CPU read lock during CPU hotplug
> > > offlining (holds the CPU write lock). Consequently, the block layer
> > > offline notifier callback could not progress because in-flight requests
> > > were detected.
> > >
> > > Since then, in-flight detection has been improved, and the nvme-pci
> > > driver now explicitly updates the hctx state when it is safe to ignore
> > > detected in-flight requests. As a result, it's possible to reintroduce
> > > the CPU read lock in group_cpus_evenly.
> >
> > Can you explain your motivation a bit? Especially adding back the lock
> > causes the API hard to use. Any benefit?
>
> Sure, I would like to add the lock back to group_cpus_evenly so it's
> possible to add support for the isolcpu use case. For the isolcpus case,
> it's necessary to access the cpu_online_mask when creating a
> housekeeping cpu mask. I failed to find a good solution which doesn't
> introduce horrible hacks (see Thomas' feedback on this [1]).
>
> Anyway, I am not totally set on this solution, but I having a proper
> lock in this code path would make the isolcpu extension way cleaner I
> think.
Then please include this patch with an explanation in your isolcpus patch set.
>
> What do you exactly mean with 'API hard to use'? The problem that the
> caller/driver has to make sure it doesn't do anything like the nvme-pci
> driver?
This API is usually called in slow path, in which subsystem locks are often
required, then lock dependency against cpus_read_lock is added.
Thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] Revert "lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly"
2026-03-02 14:12 ` Ming Lei
@ 2026-03-02 14:27 ` Daniel Wagner
0 siblings, 0 replies; 10+ messages in thread
From: Daniel Wagner @ 2026-03-02 14:27 UTC (permalink / raw)
To: Ming Lei
Cc: Daniel Wagner, Christoph Hellwig, Keith Busch, Jens Axboe,
Guangwu Zhang, Chengming Zhou, Thomas Gleixner, linux-nvme,
linux-kernel, linux-block
On Mon, Mar 02, 2026 at 10:12:49PM +0800, Ming Lei wrote:
> > Sure, I would like to add the lock back to group_cpus_evenly so it's
> > possible to add support for the isolcpu use case. For the isolcpus case,
> > it's necessary to access the cpu_online_mask when creating a
> > housekeeping cpu mask. I failed to find a good solution which doesn't
> > introduce horrible hacks (see Thomas' feedback on this [1]).
> >
> > Anyway, I am not totally set on this solution, but I having a proper
> > lock in this code path would make the isolcpu extension way cleaner I
> > think.
>
> Then please include this patch with an explanation in your isolcpus
> patch set.
I didn't add it in the commit message because the code is not there yet,
thus only mentioned it the cover letter. But sure, I'll add this info.
> > What do you exactly mean with 'API hard to use'? The problem that the
> > caller/driver has to make sure it doesn't do anything like the nvme-pci
> > driver?
>
> This API is usually called in slow path, in which subsystem locks are often
> required, then lock dependency against cpus_read_lock is added.
Yes, that's very reason I came up with this handshake protocol, which
only covers the block layer subsystem. I wonder if it would be possible
to do a lock free version with a retry check at the end. When a cpu
hotplug event happended during the calculation, start over. For this
some sort of generation count for cpu hotplug events would be handy. Just
think loud.
^ permalink raw reply [flat|nested] 10+ messages in thread