[PATCH 0/4] blk-mq/nvme: fix debugfs creation with frozen queue

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH 0/4] blk-mq/nvme: fix debugfs creation with frozen queue
@ 2026-02-09  8:29 Yu Kuai
  2026-02-09  8:29 ` [PATCH 1/4] nvme-rdma: move blk_mq_update_nr_hw_queues after nvme_unfreeze Yu Kuai
                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Yu Kuai @ 2026-02-09  8:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, sven, j, linux-block, linux-nvme
  Cc: tj, nilay, ming.lei, neal, asahi, linux-arm-kernel, yukuai

blk_mq_update_nr_hw_queues() freezes and unfreezes queues internally.
When callers pre-freeze the queue before calling this function, the
freeze depth becomes 2. The internal unfreeze only decrements it to 1,
leaving the queue still frozen when debugfs_create_files() is called.

This triggers WARN_ON_ONCE(q->mq_freeze_depth != 0) in
debugfs_create_files() and risks deadlock.

Patch 1-3 fix nvme drivers by moving nvme_unfreeze() before
blk_mq_update_nr_hw_queues() so the queue is unfrozen before the call.

Patch 4 fixes the block layer debugfs_create_files() itself. I checked
all callers of debugfs_create_files():

  - blk_mq_debugfs_register() from blk_register_queue(): queue not frozen
  - blk_mq_debugfs_register_hctx() from blk_mq_debugfs_register_hctxs():
    called after blk_mq_elv_switch_back() which unfreezes the queue
  - blk_mq_debugfs_register_sched() from elv_register_queue():
    called after blk_mq_unfreeze_queue()
  - blk_mq_debugfs_register_rqos() from wbt paths:
    called after blk_mq_unfreeze_queue()

All callers have the queue unfrozen. However, the queue can be frozen
from another context at any time, so the WARN_ON_ONCE check is racy.
Replace it with blk_queue_enter()/blk_queue_exit() which properly waits
for the queue to be unfrozen and prevents new freezes while creating
debugfs files.

Yu Kuai (4):
  nvme-rdma: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  nvme-tcp: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  blk-mq: use blk_queue_enter/exit to protect debugfs file creation

 block/blk-mq-debugfs.c    | 15 ++++++++++-----
 drivers/nvme/host/apple.c |  2 +-
 drivers/nvme/host/rdma.c  |  2 +-
 drivers/nvme/host/tcp.c   |  2 +-
 4 files changed, 13 insertions(+), 8 deletions(-)

--
2.51.0

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/4] nvme-rdma: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-09  8:29 [PATCH 0/4] blk-mq/nvme: fix debugfs creation with frozen queue Yu Kuai
@ 2026-02-09  8:29 ` Yu Kuai
  2026-02-09 14:57   ` Christoph Hellwig
  2026-02-11  7:21   ` Nilay Shroff
  2026-02-09  8:29 ` [PATCH 2/4] nvme-tcp: " Yu Kuai
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 29+ messages in thread
From: Yu Kuai @ 2026-02-09  8:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, sven, j, linux-block, linux-nvme
  Cc: tj, nilay, ming.lei, neal, asahi, linux-arm-kernel, yukuai

blk_mq_update_nr_hw_queues() freezes and unfreezes queues internally.
When the queue is already frozen before this call, the freeze depth
becomes 2. The internal unfreeze only decrements it to 1, leaving the
queue still frozen when debugfs_create_files() is called.

This triggers WARN_ON_ONCE(q->mq_freeze_depth != 0) in
debugfs_create_files() and risks deadlock.

Fix this by moving nvme_unfreeze() before blk_mq_update_nr_hw_queues()
so the queue is unfrozen before the call, allowing the internal
freeze/unfreeze to work correctly.

Signed-off-by: Yu Kuai <yukuai@fnnas.com>
---
 drivers/nvme/host/rdma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 35c0822edb2d..b0253d90ac86 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -898,9 +898,9 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
 			nvme_unfreeze(&ctrl->ctrl);
 			goto out_wait_freeze_timed_out;
 		}
+		nvme_unfreeze(&ctrl->ctrl);
 		blk_mq_update_nr_hw_queues(ctrl->ctrl.tagset,
 			ctrl->ctrl.queue_count - 1);
-		nvme_unfreeze(&ctrl->ctrl);
 	}
 
 	/*
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/4] nvme-tcp: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-09  8:29 [PATCH 0/4] blk-mq/nvme: fix debugfs creation with frozen queue Yu Kuai
  2026-02-09  8:29 ` [PATCH 1/4] nvme-rdma: move blk_mq_update_nr_hw_queues after nvme_unfreeze Yu Kuai
@ 2026-02-09  8:29 ` Yu Kuai
  2026-02-09 14:57   ` Christoph Hellwig
  2026-02-11  7:22   ` Nilay Shroff
  2026-02-09  8:29 ` [PATCH 3/4] nvme-apple: " Yu Kuai
  2026-02-09  8:29 ` [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation Yu Kuai
  3 siblings, 2 replies; 29+ messages in thread
From: Yu Kuai @ 2026-02-09  8:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, sven, j, linux-block, linux-nvme
  Cc: tj, nilay, ming.lei, neal, asahi, linux-arm-kernel, yukuai

blk_mq_update_nr_hw_queues() freezes and unfreezes queues internally.
When the queue is already frozen before this call, the freeze depth
becomes 2. The internal unfreeze only decrements it to 1, leaving the
queue still frozen when debugfs_create_files() is called.

This triggers WARN_ON_ONCE(q->mq_freeze_depth != 0) in
debugfs_create_files() and risks deadlock.

Fix this by moving nvme_unfreeze() before blk_mq_update_nr_hw_queues()
so the queue is unfrozen before the call, allowing the internal
freeze/unfreeze to work correctly.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs9gNKEYAPagD9JADfO5UH+OiCr4P7OO2wjpfOYeM-RV=A@mail.gmail.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
---
 drivers/nvme/host/tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 69cb04406b47..daa02afbc9f5 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2203,9 +2203,9 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
 			nvme_unfreeze(ctrl);
 			goto out_wait_freeze_timed_out;
 		}
+		nvme_unfreeze(ctrl);
 		blk_mq_update_nr_hw_queues(ctrl->tagset,
 			ctrl->queue_count - 1);
-		nvme_unfreeze(ctrl);
 	}
 
 	/*
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/4] nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-09  8:29 [PATCH 0/4] blk-mq/nvme: fix debugfs creation with frozen queue Yu Kuai
  2026-02-09  8:29 ` [PATCH 1/4] nvme-rdma: move blk_mq_update_nr_hw_queues after nvme_unfreeze Yu Kuai
  2026-02-09  8:29 ` [PATCH 2/4] nvme-tcp: " Yu Kuai
@ 2026-02-09  8:29 ` Yu Kuai
  2026-02-09 14:58   ` Christoph Hellwig
  2026-02-11  7:23   ` Nilay Shroff
  2026-02-09  8:29 ` [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation Yu Kuai
  3 siblings, 2 replies; 29+ messages in thread
From: Yu Kuai @ 2026-02-09  8:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, sven, j, linux-block, linux-nvme
  Cc: tj, nilay, ming.lei, neal, asahi, linux-arm-kernel, yukuai

blk_mq_update_nr_hw_queues() freezes and unfreezes queues internally.
When the queue is already frozen before this call (from nvme_start_freeze
in apple_nvme_disable), the freeze depth becomes 2. The internal unfreeze
only decrements it to 1, leaving the queue still frozen when
debugfs_create_files() is called.

This triggers WARN_ON_ONCE(q->mq_freeze_depth != 0) in
debugfs_create_files() and risks deadlock.

Fix this by moving nvme_unfreeze() before blk_mq_update_nr_hw_queues()
so the queue is unfrozen before the call, allowing the internal
freeze/unfreeze to work correctly.

Signed-off-by: Yu Kuai <yukuai@fnnas.com>
---
 drivers/nvme/host/apple.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
index 15b3d07f8ccd..1835753ad91a 100644
--- a/drivers/nvme/host/apple.c
+++ b/drivers/nvme/host/apple.c
@@ -1202,8 +1202,8 @@ static void apple_nvme_reset_work(struct work_struct *work)
 
 	nvme_unquiesce_io_queues(&anv->ctrl);
 	nvme_wait_freeze(&anv->ctrl);
-	blk_mq_update_nr_hw_queues(&anv->tagset, 1);
 	nvme_unfreeze(&anv->ctrl);
+	blk_mq_update_nr_hw_queues(&anv->tagset, 1);
 
 	if (!nvme_change_ctrl_state(&anv->ctrl, NVME_CTRL_LIVE)) {
 		dev_warn(anv->ctrl.device,
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation
  2026-02-09  8:29 [PATCH 0/4] blk-mq/nvme: fix debugfs creation with frozen queue Yu Kuai
                   ` (2 preceding siblings ...)
  2026-02-09  8:29 ` [PATCH 3/4] nvme-apple: " Yu Kuai
@ 2026-02-09  8:29 ` Yu Kuai
  2026-02-09 14:59   ` Christoph Hellwig
                     ` (2 more replies)
  3 siblings, 3 replies; 29+ messages in thread
From: Yu Kuai @ 2026-02-09  8:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, sven, j, linux-block, linux-nvme
  Cc: tj, nilay, ming.lei, neal, asahi, linux-arm-kernel, yukuai

Replace the freeze depth check with blk_queue_enter()/blk_queue_exit()
which properly waits for the queue to be unfrozen and prevents new
freezes while creating debugfs files. This provides correct
synchronization without false warnings.

If the queue is dying (blk_queue_enter returns error), skip creating
the debugfs files as the queue is being torn down anyway.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/all/aYWQR7CtYdk3K39g@shinmob/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
---
 block/blk-mq-debugfs.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index faeaa1fc86a7..03583d0d3972 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -613,11 +613,6 @@ static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
 				 const struct blk_mq_debugfs_attr *attr)
 {
 	lockdep_assert_held(&q->debugfs_mutex);
-	/*
-	 * Creating new debugfs entries with queue freezed has the risk of
-	 * deadlock.
-	 */
-	WARN_ON_ONCE(q->mq_freeze_depth != 0);
 	/*
 	 * debugfs_mutex should not be nested under other locks that can be
 	 * grabbed while queue is frozen.
@@ -628,9 +623,19 @@ static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
 	if (IS_ERR_OR_NULL(parent))
 		return;
 
+	/*
+	 * Avoid creating debugfs files while the queue is frozen, wait for
+	 * the queue to be unfrozen and prevent new freeze while creating
+	 * debugfs files.
+	 */
+	if (blk_queue_enter(q, 0))
+		return;
+
 	for (; attr->name; attr++)
 		debugfs_create_file_aux(attr->name, attr->mode, parent,
 				    (void *)attr, data, &blk_mq_debugfs_fops);
+
+	blk_queue_exit(q);
 }
 
 void blk_mq_debugfs_register(struct request_queue *q)
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] nvme-rdma: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-09  8:29 ` [PATCH 1/4] nvme-rdma: move blk_mq_update_nr_hw_queues after nvme_unfreeze Yu Kuai
@ 2026-02-09 14:57   ` Christoph Hellwig
  2026-02-11  7:21   ` Nilay Shroff
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2026-02-09 14:57 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, kbusch, hch, sagi, sven, j, linux-block, linux-nvme, tj,
	nilay, ming.lei, neal, asahi, linux-arm-kernel

On Mon, Feb 09, 2026 at 04:29:50PM +0800, Yu Kuai wrote:
> blk_mq_update_nr_hw_queues() freezes and unfreezes queues internally.
> When the queue is already frozen before this call, the freeze depth
> becomes 2. The internal unfreeze only decrements it to 1, leaving the
> queue still frozen when debugfs_create_files() is called.
> 
> This triggers WARN_ON_ONCE(q->mq_freeze_depth != 0) in
> debugfs_create_files() and risks deadlock.
> 
> Fix this by moving nvme_unfreeze() before blk_mq_update_nr_hw_queues()
> so the queue is unfrozen before the call, allowing the internal
> freeze/unfreeze to work correctly.

After this nothing is really protected by the frozen queue except
for unquiescing the I/O queues.  So we can probably eventually
drop it, but it might make sense to avoid that for now.

Reviewed-by: Christoph Hellwig <hch@lst.de>



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] nvme-tcp: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-09  8:29 ` [PATCH 2/4] nvme-tcp: " Yu Kuai
@ 2026-02-09 14:57   ` Christoph Hellwig
  2026-02-11  7:22   ` Nilay Shroff
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2026-02-09 14:57 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, kbusch, hch, sagi, sven, j, linux-block, linux-nvme, tj,
	nilay, ming.lei, neal, asahi, linux-arm-kernel

On Mon, Feb 09, 2026 at 04:29:51PM +0800, Yu Kuai wrote:
> blk_mq_update_nr_hw_queues() freezes and unfreezes queues internally.
> When the queue is already frozen before this call, the freeze depth
> becomes 2. The internal unfreeze only decrements it to 1, leaving the
> queue still frozen when debugfs_create_files() is called.
> 
> This triggers WARN_ON_ONCE(q->mq_freeze_depth != 0) in
> debugfs_create_files() and risks deadlock.
> 
> Fix this by moving nvme_unfreeze() before blk_mq_update_nr_hw_queues()
> so the queue is unfrozen before the call, allowing the internal
> freeze/unfreeze to work correctly.

Same comment as for rdma, otherwise:

Reviewed-by: Christoph Hellwig <hch@lst.de>



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-09  8:29 ` [PATCH 3/4] nvme-apple: " Yu Kuai
@ 2026-02-09 14:58   ` Christoph Hellwig
  2026-02-09 15:35     ` Keith Busch
  2026-02-11  7:23   ` Nilay Shroff
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2026-02-09 14:58 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, kbusch, hch, sagi, sven, j, linux-block, linux-nvme, tj,
	nilay, ming.lei, neal, asahi, linux-arm-kernel

On Mon, Feb 09, 2026 at 04:29:52PM +0800, Yu Kuai wrote:
> blk_mq_update_nr_hw_queues() freezes and unfreezes queues internally.
> When the queue is already frozen before this call (from nvme_start_freeze
> in apple_nvme_disable), the freeze depth becomes 2. The internal unfreeze
> only decrements it to 1, leaving the queue still frozen when
> debugfs_create_files() is called.
> 
> This triggers WARN_ON_ONCE(q->mq_freeze_depth != 0) in
> debugfs_create_files() and risks deadlock.
> 
> Fix this by moving nvme_unfreeze() before blk_mq_update_nr_hw_queues()
> so the queue is unfrozen before the call, allowing the internal
> freeze/unfreeze to work correctly.
> 
> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
> ---
>  drivers/nvme/host/apple.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
> index 15b3d07f8ccd..1835753ad91a 100644
> --- a/drivers/nvme/host/apple.c
> +++ b/drivers/nvme/host/apple.c
> @@ -1202,8 +1202,8 @@ static void apple_nvme_reset_work(struct work_struct *work)
>  
>  	nvme_unquiesce_io_queues(&anv->ctrl);
>  	nvme_wait_freeze(&anv->ctrl);
> -	blk_mq_update_nr_hw_queues(&anv->tagset, 1);
>  	nvme_unfreeze(&anv->ctrl);
> +	blk_mq_update_nr_hw_queues(&anv->tagset, 1);

Looks good on it's own, but it would also good to align the
apple driver with the PCI one here more.

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation
  2026-02-09  8:29 ` [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation Yu Kuai
@ 2026-02-09 14:59   ` Christoph Hellwig
  2026-02-09 16:40   ` Bart Van Assche
  2026-02-11  7:20   ` Nilay Shroff
  2 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2026-02-09 14:59 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, kbusch, hch, sagi, sven, j, linux-block, linux-nvme, tj,
	nilay, ming.lei, neal, asahi, linux-arm-kernel

On Mon, Feb 09, 2026 at 04:29:53PM +0800, Yu Kuai wrote:
> +	/*
> +	 * Avoid creating debugfs files while the queue is frozen, wait for
> +	 * the queue to be unfrozen and prevent new freeze while creating

I'd say "prevent concurrent freezes" here, but I'm not a native
speaker either.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-09 14:58   ` Christoph Hellwig
@ 2026-02-09 15:35     ` Keith Busch
  2026-02-10  6:47       ` Yu Kuai
  2026-02-10  8:10       ` Nilay Shroff
  0 siblings, 2 replies; 29+ messages in thread
From: Keith Busch @ 2026-02-09 15:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yu Kuai, axboe, sagi, sven, j, linux-block, linux-nvme, tj, nilay,
	ming.lei, neal, asahi, linux-arm-kernel

On Mon, Feb 09, 2026 at 03:58:32PM +0100, Christoph Hellwig wrote:
> On Mon, Feb 09, 2026 at 04:29:52PM +0800, Yu Kuai wrote:
> > blk_mq_update_nr_hw_queues() freezes and unfreezes queues internally.
> > When the queue is already frozen before this call (from nvme_start_freeze
> > in apple_nvme_disable), the freeze depth becomes 2. The internal unfreeze
> > only decrements it to 1, leaving the queue still frozen when
> > debugfs_create_files() is called.
> > 
> > This triggers WARN_ON_ONCE(q->mq_freeze_depth != 0) in
> > debugfs_create_files() and risks deadlock.
> > 
> > Fix this by moving nvme_unfreeze() before blk_mq_update_nr_hw_queues()
> > so the queue is unfrozen before the call, allowing the internal
> > freeze/unfreeze to work correctly.
> > 
> > Signed-off-by: Yu Kuai <yukuai@fnnas.com>
> > ---
> >  drivers/nvme/host/apple.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
> > index 15b3d07f8ccd..1835753ad91a 100644
> > --- a/drivers/nvme/host/apple.c
> > +++ b/drivers/nvme/host/apple.c
> > @@ -1202,8 +1202,8 @@ static void apple_nvme_reset_work(struct work_struct *work)
> >  
> >  	nvme_unquiesce_io_queues(&anv->ctrl);
> >  	nvme_wait_freeze(&anv->ctrl);
> > -	blk_mq_update_nr_hw_queues(&anv->tagset, 1);
> >  	nvme_unfreeze(&anv->ctrl);
> > +	blk_mq_update_nr_hw_queues(&anv->tagset, 1);
> 
> Looks good on it's own, but it would also good to align the
> apple driver with the PCI one here more.

I'm pretty sure this series would deadlock nvme-pci, as that driver
still leaves the queue frozen when calling blk_mq_update_nr_hw_queues.

We've left it frozen on purpose, though. The idea was to prevent new IO
from entering a hw context that's no longer backed by a hardware
resourse. Unfreezing prior opens that window up again. Maybe it's not a
big deal; I don't often encounter scenarios where the queue count
changes after a reset.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation
  2026-02-09  8:29 ` [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation Yu Kuai
  2026-02-09 14:59   ` Christoph Hellwig
@ 2026-02-09 16:40   ` Bart Van Assche
  2026-02-09 17:33     ` Keith Busch
  2026-02-11  7:20   ` Nilay Shroff
  2 siblings, 1 reply; 29+ messages in thread
From: Bart Van Assche @ 2026-02-09 16:40 UTC (permalink / raw)
  To: Yu Kuai, axboe, kbusch, hch, sagi, sven, j, linux-block,
	linux-nvme
  Cc: tj, nilay, ming.lei, neal, asahi, linux-arm-kernel

On 2/9/26 12:29 AM, Yu Kuai wrote:
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index faeaa1fc86a7..03583d0d3972 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -613,11 +613,6 @@ static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
>   				 const struct blk_mq_debugfs_attr *attr)
>   {
>   	lockdep_assert_held(&q->debugfs_mutex);
> -	/*
> -	 * Creating new debugfs entries with queue freezed has the risk of
> -	 * deadlock.
> -	 */
> -	WARN_ON_ONCE(q->mq_freeze_depth != 0);
>   	/*
>   	 * debugfs_mutex should not be nested under other locks that can be
>   	 * grabbed while queue is frozen.

The above looks fine to me.

> @@ -628,9 +623,19 @@ static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
>   	if (IS_ERR_OR_NULL(parent))
>   		return;
>   
> +	/*
> +	 * Avoid creating debugfs files while the queue is frozen, wait for
> +	 * the queue to be unfrozen and prevent new freeze while creating
> +	 * debugfs files.
> +	 */
> +	if (blk_queue_enter(q, 0))
> +		return;
> +
>   	for (; attr->name; attr++)
>   		debugfs_create_file_aux(attr->name, attr->mode, parent,
>   				    (void *)attr, data, &blk_mq_debugfs_fops);
> +
> +	blk_queue_exit(q);
>   }

This is not clear to me. Why are concurrent queue freezes not allowed
while debugfs attributes are created? I don't see any code in debugfs
that calls back into the block layer while creating debugfs attributes?
Did I perhaps overlook something?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation
  2026-02-09 16:40   ` Bart Van Assche
@ 2026-02-09 17:33     ` Keith Busch
  2026-02-09 17:51       ` Bart Van Assche
  2026-02-10 15:41       ` Christoph Hellwig
  0 siblings, 2 replies; 29+ messages in thread
From: Keith Busch @ 2026-02-09 17:33 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Yu Kuai, axboe, hch, sagi, sven, j, linux-block, linux-nvme, tj,
	nilay, ming.lei, neal, asahi, linux-arm-kernel

On Mon, Feb 09, 2026 at 08:40:29AM -0800, Bart Van Assche wrote:
> On 2/9/26 12:29 AM, Yu Kuai wrote:
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index faeaa1fc86a7..03583d0d3972 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -613,11 +613,6 @@ static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
> >   				 const struct blk_mq_debugfs_attr *attr)
> >   {
> >   	lockdep_assert_held(&q->debugfs_mutex);
> > -	/*
> > -	 * Creating new debugfs entries with queue freezed has the risk of
> > -	 * deadlock.
> > -	 */
> > -	WARN_ON_ONCE(q->mq_freeze_depth != 0);
> >   	/*
> >   	 * debugfs_mutex should not be nested under other locks that can be
> >   	 * grabbed while queue is frozen.
> 
> The above looks fine to me.
> 
> > @@ -628,9 +623,19 @@ static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
> >   	if (IS_ERR_OR_NULL(parent))
> >   		return;
> > +	/*
> > +	 * Avoid creating debugfs files while the queue is frozen, wait for
> > +	 * the queue to be unfrozen and prevent new freeze while creating
> > +	 * debugfs files.
> > +	 */
> > +	if (blk_queue_enter(q, 0))
> > +		return;
> > +
> >   	for (; attr->name; attr++)
> >   		debugfs_create_file_aux(attr->name, attr->mode, parent,
> >   				    (void *)attr, data, &blk_mq_debugfs_fops);
> > +
> > +	blk_queue_exit(q);
> >   }
> 
> This is not clear to me. Why are concurrent queue freezes not allowed
> while debugfs attributes are created? I don't see any code in debugfs
> that calls back into the block layer while creating debugfs attributes?
> Did I perhaps overlook something?

I had to look up the original commit that introduced the WARN,
65d466b629847. The commit message says "Creating new debugfs entries can
trigger fs reclaim", so that must be the path that enters back into the
blaock layer request_queue.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation
  2026-02-09 17:33     ` Keith Busch
@ 2026-02-09 17:51       ` Bart Van Assche
  2026-02-10  6:59         ` Yu Kuai
  2026-02-10 15:41       ` Christoph Hellwig
  1 sibling, 1 reply; 29+ messages in thread
From: Bart Van Assche @ 2026-02-09 17:51 UTC (permalink / raw)
  To: Keith Busch
  Cc: Yu Kuai, axboe, hch, sagi, sven, j, linux-block, linux-nvme, tj,
	nilay, ming.lei, neal, asahi, linux-arm-kernel

On 2/9/26 9:33 AM, Keith Busch wrote:
> I had to look up the original commit that introduced the WARN,
> 65d466b629847. The commit message says "Creating new debugfs entries can
> trigger fs reclaim", so that must be the path that enters back into the
> block layer request_queue.

Thanks Keith. It's probably a good idea to integrate this information in
the source code comment above the blk_queue_enter() call.

Bart.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-09 15:35     ` Keith Busch
@ 2026-02-10  6:47       ` Yu Kuai
  2026-02-10 15:09         ` Keith Busch
  2026-02-10  8:10       ` Nilay Shroff
  1 sibling, 1 reply; 29+ messages in thread
From: Yu Kuai @ 2026-02-10  6:47 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: axboe, sagi, sven, j, linux-block, linux-nvme, tj, nilay,
	ming.lei, neal, asahi, linux-arm-kernel, yukuai

Hi,

在 2026/2/9 23:35, Keith Busch 写道:
> On Mon, Feb 09, 2026 at 03:58:32PM +0100, Christoph Hellwig wrote:
>> On Mon, Feb 09, 2026 at 04:29:52PM +0800, Yu Kuai wrote:
>>> blk_mq_update_nr_hw_queues() freezes and unfreezes queues internally.
>>> When the queue is already frozen before this call (from nvme_start_freeze
>>> in apple_nvme_disable), the freeze depth becomes 2. The internal unfreeze
>>> only decrements it to 1, leaving the queue still frozen when
>>> debugfs_create_files() is called.
>>>
>>> This triggers WARN_ON_ONCE(q->mq_freeze_depth != 0) in
>>> debugfs_create_files() and risks deadlock.
>>>
>>> Fix this by moving nvme_unfreeze() before blk_mq_update_nr_hw_queues()
>>> so the queue is unfrozen before the call, allowing the internal
>>> freeze/unfreeze to work correctly.
>>>
>>> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
>>> ---
>>>   drivers/nvme/host/apple.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
>>> index 15b3d07f8ccd..1835753ad91a 100644
>>> --- a/drivers/nvme/host/apple.c
>>> +++ b/drivers/nvme/host/apple.c
>>> @@ -1202,8 +1202,8 @@ static void apple_nvme_reset_work(struct work_struct *work)
>>>   
>>>   	nvme_unquiesce_io_queues(&anv->ctrl);
>>>   	nvme_wait_freeze(&anv->ctrl);
>>> -	blk_mq_update_nr_hw_queues(&anv->tagset, 1);
>>>   	nvme_unfreeze(&anv->ctrl);
>>> +	blk_mq_update_nr_hw_queues(&anv->tagset, 1);
>> Looks good on it's own, but it would also good to align the
>> apple driver with the PCI one here more.
> I'm pretty sure this series would deadlock nvme-pci, as that driver
> still leaves the queue frozen when calling blk_mq_update_nr_hw_queues.

Yeah, thanks for the letting me know. It's true I didn't realize nvme-pci
still leaves the queue frozen.

>
> We've left it frozen on purpose, though. The idea was to prevent new IO
> from entering a hw context that's no longer backed by a hardware
> resourse. Unfreezing prior opens that window up again. Maybe it's not a
> big deal; I don't often encounter scenarios where the queue count
> changes after a reset.

Do you think if there are new IO coming between nvme_unfreeze() and
blk_mq_update_nr_hw_queues(), will be any race problems? If so, will it
be helpful to move nvme_unquiesce_io_queues() after
blk_mq_update_nr_hw_queues() so that new IO won't be issued to driver
during the race window.

I'm not quite familiar with nvme drivers, details will be quite helpful. :)

-- 
Thansk,
Kuai


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation
  2026-02-09 17:51       ` Bart Van Assche
@ 2026-02-10  6:59         ` Yu Kuai
  0 siblings, 0 replies; 29+ messages in thread
From: Yu Kuai @ 2026-02-10  6:59 UTC (permalink / raw)
  To: Bart Van Assche, Keith Busch
  Cc: axboe, hch, sagi, sven, j, linux-block, linux-nvme, tj, nilay,
	ming.lei, neal, asahi, linux-arm-kernel, yukuai

Hi,

在 2026/2/10 1:51, Bart Van Assche 写道:
> On 2/9/26 9:33 AM, Keith Busch wrote:
>> I had to look up the original commit that introduced the WARN,
>> 65d466b629847. The commit message says "Creating new debugfs entries can
>> trigger fs reclaim", so that must be the path that enters back into the
>> block layer request_queue.
>
> Thanks Keith. It's probably a good idea to integrate this information in
> the source code comment above the blk_queue_enter() call.

Yeah, I'll add the information in the next version.

BTW, I'd like to add something. The original lockdep report is:

[REPORT] Possible circular locking dependency on 6.18-rc2 in 
blkg_conf_open_bdev_frozen+0x80/0xa0 - David Wei <https://lore.kernel.org/all/63c97224-0e9a-4dd8-8706-38c10a1506e9@davidwei.uk/>

And we see there some contexts like wbt/nbd can freeze queue fist and then create debugfs
entries, and it's fixed by the thread:

[PATCH v9 0/8] blk-mq: fix possible deadlocks - Yu Kuai <https://lore.kernel.org/all/20260202080523.3947504-1-yukuai@fnnas.com/>

And the warning is added in this thread.

>
> Bart.

-- 
Thansk,
Kuai


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-09 15:35     ` Keith Busch
  2026-02-10  6:47       ` Yu Kuai
@ 2026-02-10  8:10       ` Nilay Shroff
  2026-02-10 15:12         ` Keith Busch
  1 sibling, 1 reply; 29+ messages in thread
From: Nilay Shroff @ 2026-02-10  8:10 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Yu Kuai, axboe, sagi, sven, j, linux-block, linux-nvme, tj,
	ming.lei, neal, asahi, linux-arm-kernel



On 2/9/26 9:05 PM, Keith Busch wrote:
> On Mon, Feb 09, 2026 at 03:58:32PM +0100, Christoph Hellwig wrote:
>> On Mon, Feb 09, 2026 at 04:29:52PM +0800, Yu Kuai wrote:
>>> blk_mq_update_nr_hw_queues() freezes and unfreezes queues internally.
>>> When the queue is already frozen before this call (from nvme_start_freeze
>>> in apple_nvme_disable), the freeze depth becomes 2. The internal unfreeze
>>> only decrements it to 1, leaving the queue still frozen when
>>> debugfs_create_files() is called.
>>>
>>> This triggers WARN_ON_ONCE(q->mq_freeze_depth != 0) in
>>> debugfs_create_files() and risks deadlock.
>>>
>>> Fix this by moving nvme_unfreeze() before blk_mq_update_nr_hw_queues()
>>> so the queue is unfrozen before the call, allowing the internal
>>> freeze/unfreeze to work correctly.
>>>
>>> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
>>> ---
>>>  drivers/nvme/host/apple.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
>>> index 15b3d07f8ccd..1835753ad91a 100644
>>> --- a/drivers/nvme/host/apple.c
>>> +++ b/drivers/nvme/host/apple.c
>>> @@ -1202,8 +1202,8 @@ static void apple_nvme_reset_work(struct work_struct *work)
>>>  
>>>  	nvme_unquiesce_io_queues(&anv->ctrl);
>>>  	nvme_wait_freeze(&anv->ctrl);
>>> -	blk_mq_update_nr_hw_queues(&anv->tagset, 1);
>>>  	nvme_unfreeze(&anv->ctrl);
>>> +	blk_mq_update_nr_hw_queues(&anv->tagset, 1);
>>
>> Looks good on it's own, but it would also good to align the
>> apple driver with the PCI one here more.
> 
> I'm pretty sure this series would deadlock nvme-pci, as that driver
> still leaves the queue frozen when calling blk_mq_update_nr_hw_queues.
> 
> We've left it frozen on purpose, though. The idea was to prevent new IO
> from entering a hw context that's no longer backed by a hardware
> resourse. Unfreezing prior opens that window up again. Maybe it's not a
> big deal; I don't often encounter scenarios where the queue count
> changes after a reset.

If an I/O were to slip through during the brief window between unfreeze
and the subsequent freeze inside blk_mq_update_nr_hw_queues(), wouldn’t
it still fail because the NVMe queues have already been suspended earlier
in the reset path? My understanding is that when the controller reset
reduces the number of online NVMe queues, the queues that are no longer
backed by hardware remain in the suspended state. As a result, any I/O
that reaches them before nr_hw_queues is updated should be rejected in
nvme_queue_rq(). And if that’s the case, then allowing a small unfreeze
window before updating the nr_hw_queue count shouldn’t result in a deadlock.
What do you think?

Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-10  6:47       ` Yu Kuai
@ 2026-02-10 15:09         ` Keith Busch
  2026-02-10 15:41           ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Keith Busch @ 2026-02-10 15:09 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, axboe, sagi, sven, j, linux-block, linux-nvme,
	tj, nilay, ming.lei, neal, asahi, linux-arm-kernel

On Tue, Feb 10, 2026 at 02:47:00PM +0800, Yu Kuai wrote:
> 在 2026/2/9 23:35, Keith Busch 写道:
> > We've left it frozen on purpose, though. The idea was to prevent new IO
> > from entering a hw context that's no longer backed by a hardware
> > resourse. Unfreezing prior opens that window up again. Maybe it's not a
> > big deal; I don't often encounter scenarios where the queue count
> > changes after a reset.
> 
> Do you think if there are new IO coming between nvme_unfreeze() and
> blk_mq_update_nr_hw_queues(), will be any race problems? If so, will it
> be helpful to move nvme_unquiesce_io_queues() after
> blk_mq_update_nr_hw_queues() so that new IO won't be issued to driver
> during the race window.

If you leave the queue quiesced, pending IO will form requests that are
entered and waiting in the block layer. You can't freeze a queue with
entered requests.

We unquiesce first to flush any pending IO that had entered during the
prior reset. It's not the best way to handle this situation. It would be
smarter to steal the bio's from all the entered requests, then end those
requests, then resubmit the bios after the hw queues are initialized. We
don't do that because no one's really complained, probably because the
queue counts don't usually change after a reset. But if the queue count
did change, we'd potentially see unexpected IO errors with the current
way we're handling resets.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-10  8:10       ` Nilay Shroff
@ 2026-02-10 15:12         ` Keith Busch
  0 siblings, 0 replies; 29+ messages in thread
From: Keith Busch @ 2026-02-10 15:12 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Christoph Hellwig, Yu Kuai, axboe, sagi, sven, j, linux-block,
	linux-nvme, tj, ming.lei, neal, asahi, linux-arm-kernel

On Tue, Feb 10, 2026 at 01:40:54PM +0530, Nilay Shroff wrote:
> On 2/9/26 9:05 PM, Keith Busch wrote:
> > 
> > We've left it frozen on purpose, though. The idea was to prevent new IO
> > from entering a hw context that's no longer backed by a hardware
> > resourse. Unfreezing prior opens that window up again. Maybe it's not a
> > big deal; I don't often encounter scenarios where the queue count
> > changes after a reset.
> 
> If an I/O were to slip through during the brief window between unfreeze
> and the subsequent freeze inside blk_mq_update_nr_hw_queues(), wouldn´t
> it still fail because the NVMe queues have already been suspended earlier
> in the reset path? My understanding is that when the controller reset
> reduces the number of online NVMe queues, the queues that are no longer
> backed by hardware remain in the suspended state. As a result, any I/O
> that reaches them before nr_hw_queues is updated should be rejected in
> nvme_queue_rq(). And if that´s the case, then allowing a small unfreeze
> window before updating the nr_hw_queue count shouldn´t result in a deadlock.
> What do you think?

Yeah, that wouldn't deadlock. It just increases the time for when you
may see IO failures if the queue count is reduced after the reset.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-10 15:09         ` Keith Busch
@ 2026-02-10 15:41           ` Christoph Hellwig
  2026-02-10 16:01             ` Keith Busch
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2026-02-10 15:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Yu Kuai, Christoph Hellwig, axboe, sagi, sven, j, linux-block,
	linux-nvme, tj, nilay, ming.lei, neal, asahi, linux-arm-kernel,
	Daniel Wagner

On Tue, Feb 10, 2026 at 08:09:08AM -0700, Keith Busch wrote:
> If you leave the queue quiesced, pending IO will form requests that are
> entered and waiting in the block layer. You can't freeze a queue with
> entered requests.
> 
> We unquiesce first to flush any pending IO that had entered during the
> prior reset. It's not the best way to handle this situation. It would be
> smarter to steal the bio's from all the entered requests, then end those
> requests, then resubmit the bios after the hw queues are initialized. We
> don't do that because no one's really complained, probably because the
> queue counts don't usually change after a reset.

FYI, Daniel Wagner had been thinking about doing this reinsert for
something (I forgot what exactly), and this kind of reinserting from
kblockd would also finally make REQ_NOWAIT practically useful for
file system initiated writes.  So I hope we can eventually get to it,
and it should help to sort out various problems.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation
  2026-02-09 17:33     ` Keith Busch
  2026-02-09 17:51       ` Bart Van Assche
@ 2026-02-10 15:41       ` Christoph Hellwig
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2026-02-10 15:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, Yu Kuai, axboe, hch, sagi, sven, j, linux-block,
	linux-nvme, tj, nilay, ming.lei, neal, asahi, linux-arm-kernel

On Mon, Feb 09, 2026 at 10:33:15AM -0700, Keith Busch wrote:
> > This is not clear to me. Why are concurrent queue freezes not allowed
> > while debugfs attributes are created? I don't see any code in debugfs
> > that calls back into the block layer while creating debugfs attributes?
> > Did I perhaps overlook something?
> 
> I had to look up the original commit that introduced the WARN,
> 65d466b629847. The commit message says "Creating new debugfs entries can
> trigger fs reclaim", so that must be the path that enters back into the
> blaock layer request_queue.

Another way to solves this would be to scope all the debugs action
as NOIO context.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-10 15:41           ` Christoph Hellwig
@ 2026-02-10 16:01             ` Keith Busch
  2026-02-10 16:28               ` Daniel Wagner
  0 siblings, 1 reply; 29+ messages in thread
From: Keith Busch @ 2026-02-10 16:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yu Kuai, axboe, sagi, sven, j, linux-block, linux-nvme, tj, nilay,
	ming.lei, neal, asahi, linux-arm-kernel, Daniel Wagner

On Tue, Feb 10, 2026 at 04:41:08PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 10, 2026 at 08:09:08AM -0700, Keith Busch wrote:
> > If you leave the queue quiesced, pending IO will form requests that are
> > entered and waiting in the block layer. You can't freeze a queue with
> > entered requests.
> > 
> > We unquiesce first to flush any pending IO that had entered during the
> > prior reset. It's not the best way to handle this situation. It would be
> > smarter to steal the bio's from all the entered requests, then end those
> > requests, then resubmit the bios after the hw queues are initialized. We
> > don't do that because no one's really complained, probably because the
> > queue counts don't usually change after a reset.
> 
> FYI, Daniel Wagner had been thinking about doing this reinsert for
> something (I forgot what exactly), and this kind of reinserting from
> kblockd would also finally make REQ_NOWAIT practically useful for
> file system initiated writes.  So I hope we can eventually get to it,
> and it should help to sort out various problems.

I took a moment to craft a test by modifying the poll_queues parameter
at runtime, say from 1 -> 0, then start a controller reset. Turns out
the driver is a bit broken here. A hipri job will poll a non-polled
queue after the reset, causing a double completion with the now irq
driven queue, and the entire blk-mq map is messed up, too. Oops!


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-10 16:01             ` Keith Busch
@ 2026-02-10 16:28               ` Daniel Wagner
  2026-02-11  1:57                 ` Yu Kuai
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2026-02-10 16:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Yu Kuai, axboe, sagi, sven, j, linux-block,
	linux-nvme, tj, nilay, ming.lei, neal, asahi, linux-arm-kernel

On Tue, Feb 10, 2026 at 09:01:14AM -0700, Keith Busch wrote:
> On Tue, Feb 10, 2026 at 04:41:08PM +0100, Christoph Hellwig wrote:
> > On Tue, Feb 10, 2026 at 08:09:08AM -0700, Keith Busch wrote:
> > > If you leave the queue quiesced, pending IO will form requests that are
> > > entered and waiting in the block layer. You can't freeze a queue with
> > > entered requests.
> > > 
> > > We unquiesce first to flush any pending IO that had entered during the
> > > prior reset. It's not the best way to handle this situation. It would be
> > > smarter to steal the bio's from all the entered requests, then end those
> > > requests, then resubmit the bios after the hw queues are initialized. We
> > > don't do that because no one's really complained, probably because the
> > > queue counts don't usually change after a reset.
> > 
> > FYI, Daniel Wagner had been thinking about doing this reinsert for
> > something (I forgot what exactly),

This feature would solve a tricky problem with isolation patches:

https://lore.kernel.org/linux-nvme/87cy7vrbc4.ffs@tglx/

Currently, it's not possible to take the cpu hotplug lock which would
prevent races in the cpu-queue mapping. The current logic depends on
making progress in the error case. If it would be possible to fail the
in-flight request in the error handler and reinsert them in the block
layer, progress could be guaranteed when holding the cpu hotplug lock.

Unfortunatly, haven't found time to implement and test this idea so far.
Sorry.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-10 16:28               ` Daniel Wagner
@ 2026-02-11  1:57                 ` Yu Kuai
  2026-02-11 12:57                   ` Keith Busch
  0 siblings, 1 reply; 29+ messages in thread
From: Yu Kuai @ 2026-02-11  1:57 UTC (permalink / raw)
  To: Daniel Wagner, Keith Busch
  Cc: Christoph Hellwig, axboe, sagi, sven, j, linux-block, linux-nvme,
	tj, nilay, ming.lei, neal, asahi, linux-arm-kernel, yukuai

Hi,

在 2026/2/11 0:28, Daniel Wagner 写道:
> On Tue, Feb 10, 2026 at 09:01:14AM -0700, Keith Busch wrote:
>> On Tue, Feb 10, 2026 at 04:41:08PM +0100, Christoph Hellwig wrote:
>>> On Tue, Feb 10, 2026 at 08:09:08AM -0700, Keith Busch wrote:
>>>> If you leave the queue quiesced, pending IO will form requests that are
>>>> entered and waiting in the block layer. You can't freeze a queue with
>>>> entered requests.
>>>>
>>>> We unquiesce first to flush any pending IO that had entered during the
>>>> prior reset. It's not the best way to handle this situation. It would be
>>>> smarter to steal the bio's from all the entered requests, then end those
>>>> requests, then resubmit the bios after the hw queues are initialized. We
>>>> don't do that because no one's really complained, probably because the
>>>> queue counts don't usually change after a reset.
>>> FYI, Daniel Wagner had been thinking about doing this reinsert for
>>> something (I forgot what exactly),
> This feature would solve a tricky problem with isolation patches:
>
> https://lore.kernel.org/linux-nvme/87cy7vrbc4.ffs@tglx/
>
> Currently, it's not possible to take the cpu hotplug lock which would
> prevent races in the cpu-queue mapping. The current logic depends on
> making progress in the error case. If it would be possible to fail the
> in-flight request in the error handler and reinsert them in the block
> layer, progress could be guaranteed when holding the cpu hotplug lock.
>
> Unfortunatly, haven't found time to implement and test this idea so far.
> Sorry.

Thanks all!

If I understand this correctly, it seems fine to make progress with this set,
currently IO can return error during the race window, and this can be finally
fixed with this reinsert.

-- 
Thansk,
Kuai


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation
  2026-02-09  8:29 ` [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation Yu Kuai
  2026-02-09 14:59   ` Christoph Hellwig
  2026-02-09 16:40   ` Bart Van Assche
@ 2026-02-11  7:20   ` Nilay Shroff
  2026-02-11  8:06     ` Yu Kuai
  2 siblings, 1 reply; 29+ messages in thread
From: Nilay Shroff @ 2026-02-11  7:20 UTC (permalink / raw)
  To: Yu Kuai, axboe, kbusch, hch, sagi, sven, j, linux-block,
	linux-nvme
  Cc: tj, ming.lei, neal, asahi, linux-arm-kernel



On 2/9/26 1:59 PM, Yu Kuai wrote:
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index faeaa1fc86a7..03583d0d3972 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -613,11 +613,6 @@ static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
>  				 const struct blk_mq_debugfs_attr *attr)
>  {
>  	lockdep_assert_held(&q->debugfs_mutex);
> -	/*
> -	 * Creating new debugfs entries with queue freezed has the risk of
> -	 * deadlock.
> -	 */
> -	WARN_ON_ONCE(q->mq_freeze_depth != 0);
>  	/*
>  	 * debugfs_mutex should not be nested under other locks that can be
>  	 * grabbed while queue is frozen.
> @@ -628,9 +623,19 @@ static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
>  	if (IS_ERR_OR_NULL(parent))
>  		return;
>  
> +	/*
> +	 * Avoid creating debugfs files while the queue is frozen, wait for
> +	 * the queue to be unfrozen and prevent new freeze while creating
> +	 * debugfs files.
> +	 */
> +	if (blk_queue_enter(q, 0))
> +		return;
> +
>  	for (; attr->name; attr++)
>  		debugfs_create_file_aux(attr->name, attr->mode, parent,
>  				    (void *)attr, data, &blk_mq_debugfs_fops);
> +
> +	blk_queue_exit(q);
>  }
>  
>  void blk_mq_debugfs_register(struct request_queue *q)

I also noticed that we use other debugfs helpers such as debugfs_create_dir()
from some block paths, which could run into the same fs-reclaim recursion issue.
So we may need to consider those call sites as well to ensure consistent handling.

On another note, instead of waiting for queue to be unfreezed, does it makes sense to
wrap the debugfs code under NOIO context (as Christoph pointed in another thread)?
IMO, if fs-reclaim recursion is the only issue here we would be better off using
NOIO context instead of blocking for queue to be unfrozen from other contexts.

Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] nvme-rdma: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-09  8:29 ` [PATCH 1/4] nvme-rdma: move blk_mq_update_nr_hw_queues after nvme_unfreeze Yu Kuai
  2026-02-09 14:57   ` Christoph Hellwig
@ 2026-02-11  7:21   ` Nilay Shroff
  1 sibling, 0 replies; 29+ messages in thread
From: Nilay Shroff @ 2026-02-11  7:21 UTC (permalink / raw)
  To: Yu Kuai, axboe, kbusch, hch, sagi, sven, j, linux-block,
	linux-nvme
  Cc: tj, ming.lei, neal, asahi, linux-arm-kernel



On 2/9/26 1:59 PM, Yu Kuai wrote:
> blk_mq_update_nr_hw_queues() freezes and unfreezes queues internally.
> When the queue is already frozen before this call, the freeze depth
> becomes 2. The internal unfreeze only decrements it to 1, leaving the
> queue still frozen when debugfs_create_files() is called.
> 
> This triggers WARN_ON_ONCE(q->mq_freeze_depth != 0) in
> debugfs_create_files() and risks deadlock.
> 
> Fix this by moving nvme_unfreeze() before blk_mq_update_nr_hw_queues()
> so the queue is unfrozen before the call, allowing the internal
> freeze/unfreeze to work correctly.
> 
> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
> ---

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] nvme-tcp: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-09  8:29 ` [PATCH 2/4] nvme-tcp: " Yu Kuai
  2026-02-09 14:57   ` Christoph Hellwig
@ 2026-02-11  7:22   ` Nilay Shroff
  1 sibling, 0 replies; 29+ messages in thread
From: Nilay Shroff @ 2026-02-11  7:22 UTC (permalink / raw)
  To: Yu Kuai, axboe, kbusch, hch, sagi, sven, j, linux-block,
	linux-nvme
  Cc: tj, ming.lei, neal, asahi, linux-arm-kernel



On 2/9/26 1:59 PM, Yu Kuai wrote:
> blk_mq_update_nr_hw_queues() freezes and unfreezes queues internally.
> When the queue is already frozen before this call, the freeze depth
> becomes 2. The internal unfreeze only decrements it to 1, leaving the
> queue still frozen when debugfs_create_files() is called.
> 
> This triggers WARN_ON_ONCE(q->mq_freeze_depth != 0) in
> debugfs_create_files() and risks deadlock.
> 
> Fix this by moving nvme_unfreeze() before blk_mq_update_nr_hw_queues()
> so the queue is unfrozen before the call, allowing the internal
> freeze/unfreeze to work correctly.
> 
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Closes: https://lore.kernel.org/all/CAHj4cs9gNKEYAPagD9JADfO5UH+OiCr4P7OO2wjpfOYeM-RV=A@mail.gmail.com/
> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
> ---

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-09  8:29 ` [PATCH 3/4] nvme-apple: " Yu Kuai
  2026-02-09 14:58   ` Christoph Hellwig
@ 2026-02-11  7:23   ` Nilay Shroff
  1 sibling, 0 replies; 29+ messages in thread
From: Nilay Shroff @ 2026-02-11  7:23 UTC (permalink / raw)
  To: Yu Kuai, axboe, kbusch, hch, sagi, sven, j, linux-block,
	linux-nvme
  Cc: tj, ming.lei, neal, asahi, linux-arm-kernel



On 2/9/26 1:59 PM, Yu Kuai wrote:
> blk_mq_update_nr_hw_queues() freezes and unfreezes queues internally.
> When the queue is already frozen before this call (from nvme_start_freeze
> in apple_nvme_disable), the freeze depth becomes 2. The internal unfreeze
> only decrements it to 1, leaving the queue still frozen when
> debugfs_create_files() is called.
> 
> This triggers WARN_ON_ONCE(q->mq_freeze_depth != 0) in
> debugfs_create_files() and risks deadlock.
> 
> Fix this by moving nvme_unfreeze() before blk_mq_update_nr_hw_queues()
> so the queue is unfrozen before the call, allowing the internal
> freeze/unfreeze to work correctly.
> 
> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
> ---
>  drivers/nvme/host/apple.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation
  2026-02-11  7:20   ` Nilay Shroff
@ 2026-02-11  8:06     ` Yu Kuai
  0 siblings, 0 replies; 29+ messages in thread
From: Yu Kuai @ 2026-02-11  8:06 UTC (permalink / raw)
  To: Nilay Shroff, axboe, kbusch, hch, sagi, sven, j, linux-block,
	linux-nvme, yukuai
  Cc: tj, ming.lei, neal, asahi, linux-arm-kernel

Hi,

在 2026/2/11 15:20, Nilay Shroff 写道:
>
> On 2/9/26 1:59 PM, Yu Kuai wrote:
>> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
>> index faeaa1fc86a7..03583d0d3972 100644
>> --- a/block/blk-mq-debugfs.c
>> +++ b/block/blk-mq-debugfs.c
>> @@ -613,11 +613,6 @@ static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
>>   				 const struct blk_mq_debugfs_attr *attr)
>>   {
>>   	lockdep_assert_held(&q->debugfs_mutex);
>> -	/*
>> -	 * Creating new debugfs entries with queue freezed has the risk of
>> -	 * deadlock.
>> -	 */
>> -	WARN_ON_ONCE(q->mq_freeze_depth != 0);
>>   	/*
>>   	 * debugfs_mutex should not be nested under other locks that can be
>>   	 * grabbed while queue is frozen.
>> @@ -628,9 +623,19 @@ static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
>>   	if (IS_ERR_OR_NULL(parent))
>>   		return;
>>   
>> +	/*
>> +	 * Avoid creating debugfs files while the queue is frozen, wait for
>> +	 * the queue to be unfrozen and prevent new freeze while creating
>> +	 * debugfs files.
>> +	 */
>> +	if (blk_queue_enter(q, 0))
>> +		return;
>> +
>>   	for (; attr->name; attr++)
>>   		debugfs_create_file_aux(attr->name, attr->mode, parent,
>>   				    (void *)attr, data, &blk_mq_debugfs_fops);
>> +
>> +	blk_queue_exit(q);
>>   }
>>   
>>   void blk_mq_debugfs_register(struct request_queue *q)
> I also noticed that we use other debugfs helpers such as debugfs_create_dir()
> from some block paths, which could run into the same fs-reclaim recursion issue.
> So we may need to consider those call sites as well to ensure consistent handling.

Yes, sounds correct.

>
> On another note, instead of waiting for queue to be unfreezed, does it makes sense to
> wrap the debugfs code under NOIO context (as Christoph pointed in another thread)?
> IMO, if fs-reclaim recursion is the only issue here we would be better off using
> NOIO context instead of blocking for queue to be unfrozen from other contexts.

I can try this. So far, debugfs_mutex related deadlocks are related to fs-reclaim,
although I'm not that confident there won't be any other  implicit dependency.
Meanwhile, looks like the previous patches for nvme targets are not necessary anymore.

> Thanks,
> --Nilay
>
-- 
Thansk,
Kuai


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] nvme-apple: move blk_mq_update_nr_hw_queues after nvme_unfreeze
  2026-02-11  1:57                 ` Yu Kuai
@ 2026-02-11 12:57                   ` Keith Busch
  0 siblings, 0 replies; 29+ messages in thread
From: Keith Busch @ 2026-02-11 12:57 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Daniel Wagner, Christoph Hellwig, axboe, sagi, sven, j,
	linux-block, linux-nvme, tj, nilay, ming.lei, neal, asahi,
	linux-arm-kernel

On Wed, Feb 11, 2026 at 09:57:26AM +0800, Yu Kuai wrote:
> If I understand this correctly, it seems fine to make progress with this set,
> currently IO can return error during the race window, and this can be finally
> fixed with this reinsert.

Yeah, if you can just make sure to add an equivalent patch for nvme-pci,
otherwise the last patch in this series will deadlock it.

Some experiments I've run over the last day have me convinced no one is
actually hitting queue count changes in nvme-pci, so I'm not concerned
with opening that race to this scenario. You'd get IO errors either way;
your proposal just potentially gives you more of them, but I don't think
encountering that is practical reality right now anyway.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-02-11 12:57 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-09  8:29 [PATCH 0/4] blk-mq/nvme: fix debugfs creation with frozen queue Yu Kuai
2026-02-09  8:29 ` [PATCH 1/4] nvme-rdma: move blk_mq_update_nr_hw_queues after nvme_unfreeze Yu Kuai
2026-02-09 14:57   ` Christoph Hellwig
2026-02-11  7:21   ` Nilay Shroff
2026-02-09  8:29 ` [PATCH 2/4] nvme-tcp: " Yu Kuai
2026-02-09 14:57   ` Christoph Hellwig
2026-02-11  7:22   ` Nilay Shroff
2026-02-09  8:29 ` [PATCH 3/4] nvme-apple: " Yu Kuai
2026-02-09 14:58   ` Christoph Hellwig
2026-02-09 15:35     ` Keith Busch
2026-02-10  6:47       ` Yu Kuai
2026-02-10 15:09         ` Keith Busch
2026-02-10 15:41           ` Christoph Hellwig
2026-02-10 16:01             ` Keith Busch
2026-02-10 16:28               ` Daniel Wagner
2026-02-11  1:57                 ` Yu Kuai
2026-02-11 12:57                   ` Keith Busch
2026-02-10  8:10       ` Nilay Shroff
2026-02-10 15:12         ` Keith Busch
2026-02-11  7:23   ` Nilay Shroff
2026-02-09  8:29 ` [PATCH 4/4] blk-mq: use blk_queue_enter/exit to protect debugfs file creation Yu Kuai
2026-02-09 14:59   ` Christoph Hellwig
2026-02-09 16:40   ` Bart Van Assche
2026-02-09 17:33     ` Keith Busch
2026-02-09 17:51       ` Bart Van Assche
2026-02-10  6:59         ` Yu Kuai
2026-02-10 15:41       ` Christoph Hellwig
2026-02-11  7:20   ` Nilay Shroff
2026-02-11  8:06     ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox