* [PATCH 0/2] nvme-tcp: fix two problems found with blktests nvme/063
@ 2025-06-02 4:35 Shin'ichiro Kawasaki
2025-06-02 4:35 ` [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset Shin'ichiro Kawasaki
2025-06-02 4:35 ` [PATCH 2/2] nvme-tcp: remove tag set when second admin queue config fails Shin'ichiro Kawasaki
0 siblings, 2 replies; 16+ messages in thread
From: Shin'ichiro Kawasaki @ 2025-06-02 4:35 UTC (permalink / raw)
To: linux-nvme
Cc: Hannes Reinecke, Sagi Grimberg, Keith Busch, Jens Axboe,
Christoph Hellwig, Shin'ichiro Kawasaki
When the test case nvme/063 is repeated a few times, a WARN and a BUG KASAN
slab-use-after-free are observed [1]. This series addresses them. The first
patch fixes the WARN, and the second patch fixes the BUG KASAN.
[1] https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/
Shin'ichiro Kawasaki (2):
nvme-tcp: avoid race between nvme scan and reset
nvme-tcp: remove tag set when second admin queue config fails
drivers/nvme/host/tcp.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--
2.49.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-02 4:35 [PATCH 0/2] nvme-tcp: fix two problems found with blktests nvme/063 Shin'ichiro Kawasaki
@ 2025-06-02 4:35 ` Shin'ichiro Kawasaki
2025-06-02 6:39 ` Hannes Reinecke
` (2 more replies)
2025-06-02 4:35 ` [PATCH 2/2] nvme-tcp: remove tag set when second admin queue config fails Shin'ichiro Kawasaki
1 sibling, 3 replies; 16+ messages in thread
From: Shin'ichiro Kawasaki @ 2025-06-02 4:35 UTC (permalink / raw)
To: linux-nvme
Cc: Hannes Reinecke, Sagi Grimberg, Keith Busch, Jens Axboe,
Christoph Hellwig, Shin'ichiro Kawasaki
When the nvme scan work and the nvme controller reset work race, the
WARN below happens:
WARNING: CPU: 1 PID: 69 at block/blk-mq.c:330 blk_mq_unquiesce_queue+0x8f/0xb0
The WARN can be recreated by repeating the blktests test case nvme/063 a
few times [1]. Its cause is the new queue allocation for the tag set
by the scan work between blk_mq_quiesce_tagset() and
blk_mq_unquiesce_tagset() calls by the reset work.
Reset work Scan work
------------------------------------------------------------------------
nvme_reset_ctrl_work()
nvme_tcp_teardown_ctrl()
nvme_tcp_teardown_io_queues()
nvme_quiesce_io_queues()
blk_mq_quiesce_tagset()
list_for_each_entry() .. N queues
blk_mq_quiesce_queue()
nvme_scan_work()
nvme_scan_ns_*()
nvme_scan_ns()
nvme_alloc_ns()
blk_mq_alloc_disk()
__blk_mq_alloc_disk()
blk_mq_alloc_queue()
blk_mq_init_allocate_queue()
blk_mq_add_queue_tag_set()
list_add_tail() .. N+1 queues
nvme_tcp_setup_ctrl()
nvme_start_ctrl()
nvme_unquiesce_io_queues()
blk_mq_unquiesce_tagset()
list_for_each_entry() .. N+1 queues
blk_mq_unquiesce_queue()
WARN_ON_ONCE(q->quiesce_depth <= 0)
blk_mq_quiesce_queue() is not called for the new queue added by the scan
work, while blk_mq_unquiesce_queue() is called for it. Hence the WARN.
To suppress the WARN, avoid the race between the reset work and the scan
work by flushing the scan work at the beginning of the reset work.
Link: https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ [1]
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
drivers/nvme/host/tcp.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index f6379aa33d77..14b9d67329b3 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2491,6 +2491,9 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
container_of(work, struct nvme_ctrl, reset_work);
int ret;
+ /* prevent racing with ns scanning */
+ flush_work(&ctrl->scan_work);
+
if (nvme_tcp_key_revoke_needed(ctrl))
nvme_auth_revoke_tls_key(ctrl);
nvme_stop_ctrl(ctrl);
--
2.49.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/2] nvme-tcp: remove tag set when second admin queue config fails
2025-06-02 4:35 [PATCH 0/2] nvme-tcp: fix two problems found with blktests nvme/063 Shin'ichiro Kawasaki
2025-06-02 4:35 ` [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset Shin'ichiro Kawasaki
@ 2025-06-02 4:35 ` Shin'ichiro Kawasaki
2025-06-02 6:39 ` Hannes Reinecke
` (3 more replies)
1 sibling, 4 replies; 16+ messages in thread
From: Shin'ichiro Kawasaki @ 2025-06-02 4:35 UTC (permalink / raw)
To: linux-nvme
Cc: Hannes Reinecke, Sagi Grimberg, Keith Busch, Jens Axboe,
Christoph Hellwig, Shin'ichiro Kawasaki
Commit 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure
concatenation") modified nvme_tcp_setup_ctrl() to call
nvme_tcp_configure_admin_queue() twice. The first call prepares for
DH-CHAP negotitation, and the second call is required for secure
concatenation. However, this change triggered BUG KASAN slab-use-after-
free in blk_mq_queue_tag_busy_iter(). This BUG can be recreated by
repeating the blktests test case nvme/063 a few times [1].
When the BUG happens, nvme_tcp_create_ctrl() fails in the call chain
below:
nvme_tcp_create_ctrl()
nvme_tcp_alloc_ctrl() new=true ... Alloc nvme_tcp_ctrl and admin_tag_set
nvme_tcp_setup_ctrl() new=true
nvme_tcp_configure_admin_queue() new=true ... Succeed
nvme_alloc_admin_tag_set() ... Alloc the tag set for admin_tag_set
nvme_stop_keep_alive()
nvme_tcp_teardown_admin_queue() remove=false
nvme_tcp_configure_admin_queue() new=false
nvme_tcp_alloc_admin_queue() ... Fail, but do not call nvme_remove_admin_tag_set()
nvme_uninit_ctrl()
nvme_put_ctrl() ... Free up the nvme_tcp_ctrl and admin_tag_set
The first call of nvme_tcp_configure_admin_queue() succeeds with
new=true argument. The second call fails with new=false argument. This
second call does not call nvme_remove_admin_tag_set() on failure, due to
the new=false argument. Then the admin tag set is not removed. However,
nvme_tcp_create_ctrl() assumes that nvme_tcp_setup_ctrl() would call
nvme_remove_admin_tag_set(). Then it frees up struct nvme_tcp_ctrl which
has admin_tag_set field. Later on, the timeout handler accesses the
admin_tag_set field and causes the BUG KASAN slab-use-after-free.
To not leave the admin tag set, call nvme_remove_admin_tag_set() when
the second nvme_tcp_configure_admin_queue() call fails. Do not return
from nvme_tcp_setup_ctrl() on failure. Instead, jump to "destroy_admin"
go-to label to call nvme_tcp_teardown_admin_queue() which calls
nvme_remove_admin_tag_set().
Fixes: 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure concatenation")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ [1]
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
drivers/nvme/host/tcp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 14b9d67329b3..85acb4c9c977 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2350,7 +2350,7 @@ static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new)
nvme_tcp_teardown_admin_queue(ctrl, false);
ret = nvme_tcp_configure_admin_queue(ctrl, false);
if (ret)
- return ret;
+ goto destroy_admin;
}
if (ctrl->icdoff) {
--
2.49.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-02 4:35 ` [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset Shin'ichiro Kawasaki
@ 2025-06-02 6:39 ` Hannes Reinecke
2025-06-03 0:48 ` Chaitanya Kulkarni
2025-06-03 10:58 ` Sagi Grimberg
2 siblings, 0 replies; 16+ messages in thread
From: Hannes Reinecke @ 2025-06-02 6:39 UTC (permalink / raw)
To: Shin'ichiro Kawasaki, linux-nvme
Cc: Hannes Reinecke, Sagi Grimberg, Keith Busch, Jens Axboe,
Christoph Hellwig
On 6/2/25 06:35, Shin'ichiro Kawasaki wrote:
> When the nvme scan work and the nvme controller reset work race, the
> WARN below happens:
>
> WARNING: CPU: 1 PID: 69 at block/blk-mq.c:330 blk_mq_unquiesce_queue+0x8f/0xb0
>
> The WARN can be recreated by repeating the blktests test case nvme/063 a
> few times [1]. Its cause is the new queue allocation for the tag set
> by the scan work between blk_mq_quiesce_tagset() and
> blk_mq_unquiesce_tagset() calls by the reset work.
>
> Reset work Scan work
> ------------------------------------------------------------------------
> nvme_reset_ctrl_work()
> nvme_tcp_teardown_ctrl()
> nvme_tcp_teardown_io_queues()
> nvme_quiesce_io_queues()
> blk_mq_quiesce_tagset()
> list_for_each_entry() .. N queues
> blk_mq_quiesce_queue()
> nvme_scan_work()
> nvme_scan_ns_*()
> nvme_scan_ns()
> nvme_alloc_ns()
> blk_mq_alloc_disk()
> __blk_mq_alloc_disk()
> blk_mq_alloc_queue()
> blk_mq_init_allocate_queue()
> blk_mq_add_queue_tag_set()
> list_add_tail() .. N+1 queues
> nvme_tcp_setup_ctrl()
> nvme_start_ctrl()
> nvme_unquiesce_io_queues()
> blk_mq_unquiesce_tagset()
> list_for_each_entry() .. N+1 queues
> blk_mq_unquiesce_queue()
> WARN_ON_ONCE(q->quiesce_depth <= 0)
>
> blk_mq_quiesce_queue() is not called for the new queue added by the scan
> work, while blk_mq_unquiesce_queue() is called for it. Hence the WARN.
>
> To suppress the WARN, avoid the race between the reset work and the scan
> work by flushing the scan work at the beginning of the reset work.
>
> Link: https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ [1]
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
> drivers/nvme/host/tcp.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index f6379aa33d77..14b9d67329b3 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2491,6 +2491,9 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
> container_of(work, struct nvme_ctrl, reset_work);
> int ret;
>
> + /* prevent racing with ns scanning */
> + flush_work(&ctrl->scan_work);
> +
> if (nvme_tcp_key_revoke_needed(ctrl))
> nvme_auth_revoke_tls_key(ctrl);
> nvme_stop_ctrl(ctrl);
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] nvme-tcp: remove tag set when second admin queue config fails
2025-06-02 4:35 ` [PATCH 2/2] nvme-tcp: remove tag set when second admin queue config fails Shin'ichiro Kawasaki
@ 2025-06-02 6:39 ` Hannes Reinecke
2025-06-03 0:49 ` Chaitanya Kulkarni
` (2 subsequent siblings)
3 siblings, 0 replies; 16+ messages in thread
From: Hannes Reinecke @ 2025-06-02 6:39 UTC (permalink / raw)
To: Shin'ichiro Kawasaki, linux-nvme
Cc: Hannes Reinecke, Sagi Grimberg, Keith Busch, Jens Axboe,
Christoph Hellwig
On 6/2/25 06:35, Shin'ichiro Kawasaki wrote:
> Commit 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure
> concatenation") modified nvme_tcp_setup_ctrl() to call
> nvme_tcp_configure_admin_queue() twice. The first call prepares for
> DH-CHAP negotitation, and the second call is required for secure
> concatenation. However, this change triggered BUG KASAN slab-use-after-
> free in blk_mq_queue_tag_busy_iter(). This BUG can be recreated by
> repeating the blktests test case nvme/063 a few times [1].
>
> When the BUG happens, nvme_tcp_create_ctrl() fails in the call chain
> below:
>
> nvme_tcp_create_ctrl()
> nvme_tcp_alloc_ctrl() new=true ... Alloc nvme_tcp_ctrl and admin_tag_set
> nvme_tcp_setup_ctrl() new=true
> nvme_tcp_configure_admin_queue() new=true ... Succeed
> nvme_alloc_admin_tag_set() ... Alloc the tag set for admin_tag_set
> nvme_stop_keep_alive()
> nvme_tcp_teardown_admin_queue() remove=false
> nvme_tcp_configure_admin_queue() new=false
> nvme_tcp_alloc_admin_queue() ... Fail, but do not call nvme_remove_admin_tag_set()
> nvme_uninit_ctrl()
> nvme_put_ctrl() ... Free up the nvme_tcp_ctrl and admin_tag_set
>
> The first call of nvme_tcp_configure_admin_queue() succeeds with
> new=true argument. The second call fails with new=false argument. This
> second call does not call nvme_remove_admin_tag_set() on failure, due to
> the new=false argument. Then the admin tag set is not removed. However,
> nvme_tcp_create_ctrl() assumes that nvme_tcp_setup_ctrl() would call
> nvme_remove_admin_tag_set(). Then it frees up struct nvme_tcp_ctrl which
> has admin_tag_set field. Later on, the timeout handler accesses the
> admin_tag_set field and causes the BUG KASAN slab-use-after-free.
>
> To not leave the admin tag set, call nvme_remove_admin_tag_set() when
> the second nvme_tcp_configure_admin_queue() call fails. Do not return
> from nvme_tcp_setup_ctrl() on failure. Instead, jump to "destroy_admin"
> go-to label to call nvme_tcp_teardown_admin_queue() which calls
> nvme_remove_admin_tag_set().
>
> Fixes: 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure concatenation")
> Cc: stable@vger.kernel.org
> Link: https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ [1]
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
> drivers/nvme/host/tcp.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 14b9d67329b3..85acb4c9c977 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2350,7 +2350,7 @@ static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new)
> nvme_tcp_teardown_admin_queue(ctrl, false);
> ret = nvme_tcp_configure_admin_queue(ctrl, false);
> if (ret)
> - return ret;
> + goto destroy_admin;
> }
>
> if (ctrl->icdoff) {
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-02 4:35 ` [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset Shin'ichiro Kawasaki
2025-06-02 6:39 ` Hannes Reinecke
@ 2025-06-03 0:48 ` Chaitanya Kulkarni
2025-06-03 10:58 ` Sagi Grimberg
2 siblings, 0 replies; 16+ messages in thread
From: Chaitanya Kulkarni @ 2025-06-03 0:48 UTC (permalink / raw)
To: Shin'ichiro Kawasaki, linux-nvme@lists.infradead.org
Cc: Hannes Reinecke, Sagi Grimberg, Keith Busch, Jens Axboe,
Christoph Hellwig
On 6/1/25 21:35, Shin'ichiro Kawasaki wrote:
> When the nvme scan work and the nvme controller reset work race, the
> WARN below happens:
>
> WARNING: CPU: 1 PID: 69 at block/blk-mq.c:330 blk_mq_unquiesce_queue+0x8f/0xb0
>
> The WARN can be recreated by repeating the blktests test case nvme/063 a
> few times [1]. Its cause is the new queue allocation for the tag set
> by the scan work between blk_mq_quiesce_tagset() and
> blk_mq_unquiesce_tagset() calls by the reset work.
>
> Reset work Scan work
> ------------------------------------------------------------------------
> nvme_reset_ctrl_work()
> nvme_tcp_teardown_ctrl()
> nvme_tcp_teardown_io_queues()
> nvme_quiesce_io_queues()
> blk_mq_quiesce_tagset()
> list_for_each_entry() .. N queues
> blk_mq_quiesce_queue()
> nvme_scan_work()
> nvme_scan_ns_*()
> nvme_scan_ns()
> nvme_alloc_ns()
> blk_mq_alloc_disk()
> __blk_mq_alloc_disk()
> blk_mq_alloc_queue()
> blk_mq_init_allocate_queue()
> blk_mq_add_queue_tag_set()
> list_add_tail() .. N+1 queues
> nvme_tcp_setup_ctrl()
> nvme_start_ctrl()
> nvme_unquiesce_io_queues()
> blk_mq_unquiesce_tagset()
> list_for_each_entry() .. N+1 queues
> blk_mq_unquiesce_queue()
> WARN_ON_ONCE(q->quiesce_depth <= 0)
>
> blk_mq_quiesce_queue() is not called for the new queue added by the scan
> work, while blk_mq_unquiesce_queue() is called for it. Hence the WARN.
>
> To suppress the WARN, avoid the race between the reset work and the scan
> work by flushing the scan work at the beginning of the reset work.
>
> Link:https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ [1]
> Signed-off-by: Shin'ichiro Kawasaki<shinichiro.kawasaki@wdc.com>
Looks good.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] nvme-tcp: remove tag set when second admin queue config fails
2025-06-02 4:35 ` [PATCH 2/2] nvme-tcp: remove tag set when second admin queue config fails Shin'ichiro Kawasaki
2025-06-02 6:39 ` Hannes Reinecke
@ 2025-06-03 0:49 ` Chaitanya Kulkarni
2025-06-04 6:50 ` Sagi Grimberg
2025-06-04 7:53 ` Christoph Hellwig
3 siblings, 0 replies; 16+ messages in thread
From: Chaitanya Kulkarni @ 2025-06-03 0:49 UTC (permalink / raw)
To: Shin'ichiro Kawasaki, linux-nvme@lists.infradead.org
Cc: Hannes Reinecke, Sagi Grimberg, Keith Busch, Jens Axboe,
Christoph Hellwig
On 6/1/25 21:35, Shin'ichiro Kawasaki wrote:
> Commit 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure
> concatenation") modified nvme_tcp_setup_ctrl() to call
> nvme_tcp_configure_admin_queue() twice. The first call prepares for
> DH-CHAP negotitation, and the second call is required for secure
> concatenation. However, this change triggered BUG KASAN slab-use-after-
> free in blk_mq_queue_tag_busy_iter(). This BUG can be recreated by
> repeating the blktests test case nvme/063 a few times [1].
>
> When the BUG happens, nvme_tcp_create_ctrl() fails in the call chain
> below:
>
> nvme_tcp_create_ctrl()
> nvme_tcp_alloc_ctrl() new=true ... Alloc nvme_tcp_ctrl and admin_tag_set
> nvme_tcp_setup_ctrl() new=true
> nvme_tcp_configure_admin_queue() new=true ... Succeed
> nvme_alloc_admin_tag_set() ... Alloc the tag set for admin_tag_set
> nvme_stop_keep_alive()
> nvme_tcp_teardown_admin_queue() remove=false
> nvme_tcp_configure_admin_queue() new=false
> nvme_tcp_alloc_admin_queue() ... Fail, but do not call nvme_remove_admin_tag_set()
> nvme_uninit_ctrl()
> nvme_put_ctrl() ... Free up the nvme_tcp_ctrl and admin_tag_set
>
> The first call of nvme_tcp_configure_admin_queue() succeeds with
> new=true argument. The second call fails with new=false argument. This
> second call does not call nvme_remove_admin_tag_set() on failure, due to
> the new=false argument. Then the admin tag set is not removed. However,
> nvme_tcp_create_ctrl() assumes that nvme_tcp_setup_ctrl() would call
> nvme_remove_admin_tag_set(). Then it frees up struct nvme_tcp_ctrl which
> has admin_tag_set field. Later on, the timeout handler accesses the
> admin_tag_set field and causes the BUG KASAN slab-use-after-free.
>
> To not leave the admin tag set, call nvme_remove_admin_tag_set() when
> the second nvme_tcp_configure_admin_queue() call fails. Do not return
> from nvme_tcp_setup_ctrl() on failure. Instead, jump to "destroy_admin"
> go-to label to call nvme_tcp_teardown_admin_queue() which calls
> nvme_remove_admin_tag_set().
>
> Fixes: 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure concatenation")
> Cc:stable@vger.kernel.org
> Link:https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ [1]
> Signed-off-by: Shin'ichiro Kawasaki<shinichiro.kawasaki@wdc.com>
Thanks for the really good commit log, it made reviewing both the
patches easier. Looks good.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-02 4:35 ` [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset Shin'ichiro Kawasaki
2025-06-02 6:39 ` Hannes Reinecke
2025-06-03 0:48 ` Chaitanya Kulkarni
@ 2025-06-03 10:58 ` Sagi Grimberg
2025-06-04 7:10 ` Sagi Grimberg
2 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2025-06-03 10:58 UTC (permalink / raw)
To: Shin'ichiro Kawasaki, linux-nvme
Cc: Hannes Reinecke, Keith Busch, Jens Axboe, Christoph Hellwig
On 02/06/2025 7:35, Shin'ichiro Kawasaki wrote:
> When the nvme scan work and the nvme controller reset work race, the
> WARN below happens:
>
> WARNING: CPU: 1 PID: 69 at block/blk-mq.c:330 blk_mq_unquiesce_queue+0x8f/0xb0
>
> The WARN can be recreated by repeating the blktests test case nvme/063 a
> few times [1]. Its cause is the new queue allocation for the tag set
> by the scan work between blk_mq_quiesce_tagset() and
> blk_mq_unquiesce_tagset() calls by the reset work.
>
> Reset work Scan work
> ------------------------------------------------------------------------
> nvme_reset_ctrl_work()
> nvme_tcp_teardown_ctrl()
> nvme_tcp_teardown_io_queues()
> nvme_quiesce_io_queues()
> blk_mq_quiesce_tagset()
> list_for_each_entry() .. N queues
> blk_mq_quiesce_queue()
> nvme_scan_work()
> nvme_scan_ns_*()
> nvme_scan_ns()
> nvme_alloc_ns()
> blk_mq_alloc_disk()
> __blk_mq_alloc_disk()
> blk_mq_alloc_queue()
> blk_mq_init_allocate_queue()
> blk_mq_add_queue_tag_set()
> list_add_tail() .. N+1 queues
> nvme_tcp_setup_ctrl()
> nvme_start_ctrl()
> nvme_unquiesce_io_queues()
> blk_mq_unquiesce_tagset()
> list_for_each_entry() .. N+1 queues
> blk_mq_unquiesce_queue()
> WARN_ON_ONCE(q->quiesce_depth <= 0)
>
> blk_mq_quiesce_queue() is not called for the new queue added by the scan
> work, while blk_mq_unquiesce_queue() is called for it. Hence the WARN.
>
> To suppress the WARN, avoid the race between the reset work and the scan
> work by flushing the scan work at the beginning of the reset work.
>
> Link: https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ [1]
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
> drivers/nvme/host/tcp.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index f6379aa33d77..14b9d67329b3 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2491,6 +2491,9 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
> container_of(work, struct nvme_ctrl, reset_work);
> int ret;
>
> + /* prevent racing with ns scanning */
> + flush_work(&ctrl->scan_work);
This is a problem. We are flushing a work that is IO bound, which can
take a long time to complete.
Up until now, we have deliberately avoided introducing dependency
between reset forward progress
and scan work IO to completely finish.
I would like to keep it this way.
BTW, this is not TCP specific.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] nvme-tcp: remove tag set when second admin queue config fails
2025-06-02 4:35 ` [PATCH 2/2] nvme-tcp: remove tag set when second admin queue config fails Shin'ichiro Kawasaki
2025-06-02 6:39 ` Hannes Reinecke
2025-06-03 0:49 ` Chaitanya Kulkarni
@ 2025-06-04 6:50 ` Sagi Grimberg
2025-06-04 7:53 ` Christoph Hellwig
3 siblings, 0 replies; 16+ messages in thread
From: Sagi Grimberg @ 2025-06-04 6:50 UTC (permalink / raw)
To: Shin'ichiro Kawasaki, linux-nvme
Cc: Hannes Reinecke, Keith Busch, Jens Axboe, Christoph Hellwig
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-03 10:58 ` Sagi Grimberg
@ 2025-06-04 7:10 ` Sagi Grimberg
2025-06-04 11:17 ` Shinichiro Kawasaki
0 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2025-06-04 7:10 UTC (permalink / raw)
To: Shin'ichiro Kawasaki, linux-nvme, linux-block
Cc: Hannes Reinecke, Keith Busch, Jens Axboe, Christoph Hellwig
On 03/06/2025 13:58, Sagi Grimberg wrote:
>
>
> On 02/06/2025 7:35, Shin'ichiro Kawasaki wrote:
>> When the nvme scan work and the nvme controller reset work race, the
>> WARN below happens:
>>
>> WARNING: CPU: 1 PID: 69 at block/blk-mq.c:330
>> blk_mq_unquiesce_queue+0x8f/0xb0
>>
>> The WARN can be recreated by repeating the blktests test case nvme/063 a
>> few times [1]. Its cause is the new queue allocation for the tag set
>> by the scan work between blk_mq_quiesce_tagset() and
>> blk_mq_unquiesce_tagset() calls by the reset work.
>>
>> Reset work Scan work
>> ------------------------------------------------------------------------
>> nvme_reset_ctrl_work()
>> nvme_tcp_teardown_ctrl()
>> nvme_tcp_teardown_io_queues()
>> nvme_quiesce_io_queues()
>> blk_mq_quiesce_tagset()
>> list_for_each_entry() .. N queues
>> blk_mq_quiesce_queue()
>> nvme_scan_work()
>> nvme_scan_ns_*()
>> nvme_scan_ns()
>> nvme_alloc_ns()
>> blk_mq_alloc_disk()
>> __blk_mq_alloc_disk()
>> blk_mq_alloc_queue()
>> blk_mq_init_allocate_queue()
>> blk_mq_add_queue_tag_set()
>> list_add_tail() .. N+1
>> queues
>> nvme_tcp_setup_ctrl()
>> nvme_start_ctrl()
>> nvme_unquiesce_io_queues()
>> blk_mq_unquiesce_tagset()
>> list_for_each_entry() .. N+1
>> queues
>> blk_mq_unquiesce_queue()
>> WARN_ON_ONCE(q->quiesce_depth <= 0)
>>
>> blk_mq_quiesce_queue() is not called for the new queue added by the scan
>> work, while blk_mq_unquiesce_queue() is called for it. Hence the WARN.
>>
>> To suppress the WARN, avoid the race between the reset work and the scan
>> work by flushing the scan work at the beginning of the reset work.
>>
>> Link:
>> https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/
>> [1]
>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>> ---
>> drivers/nvme/host/tcp.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index f6379aa33d77..14b9d67329b3 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -2491,6 +2491,9 @@ static void nvme_reset_ctrl_work(struct
>> work_struct *work)
>> container_of(work, struct nvme_ctrl, reset_work);
>> int ret;
>> + /* prevent racing with ns scanning */
>> + flush_work(&ctrl->scan_work);
>
> This is a problem. We are flushing a work that is IO bound, which can
> take a long time to complete.
> Up until now, we have deliberately avoided introducing dependency
> between reset forward progress
> and scan work IO to completely finish.
>
> I would like to keep it this way.
>
> BTW, this is not TCP specific.
blk_mq_unquiesce_queue is still very much safe to call as many times as
we want.
The only thing that comes in the way is this pesky WARN. How about we
make it go away and have
a debug print instead?
My preference would be to allow nvme to unquiesce queues that were not
previously quiesced (just
like it historically was) instead of having to block a controller reset
until the scan_work is completed (which
is admin I/O dependent, and may get stuck until admin timeout, which can
be changed by the user for 60
minutes or something arbitrarily long btw).
How about something like this patch instead:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c2697db59109..74f3ad16e812 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
bool run_queue = false;
spin_lock_irqsave(&q->queue_lock, flags);
- if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
- ;
+ if (q->quiesce_depth <= 0) {
+ printk(KERN_DEBUG
+ "dev %s: unquiescing a non-quiesced queue,
expected?\n",
+ q->disk ? q->disk->disk_name : "?", );
} else if (!--q->quiesce_depth) {
blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
run_queue = true;
--
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] nvme-tcp: remove tag set when second admin queue config fails
2025-06-02 4:35 ` [PATCH 2/2] nvme-tcp: remove tag set when second admin queue config fails Shin'ichiro Kawasaki
` (2 preceding siblings ...)
2025-06-04 6:50 ` Sagi Grimberg
@ 2025-06-04 7:53 ` Christoph Hellwig
3 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-06-04 7:53 UTC (permalink / raw)
To: Shin'ichiro Kawasaki
Cc: linux-nvme, Hannes Reinecke, Sagi Grimberg, Keith Busch,
Jens Axboe, Christoph Hellwig
Thanks, applied to nvme-6.16.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-04 7:10 ` Sagi Grimberg
@ 2025-06-04 11:17 ` Shinichiro Kawasaki
2025-06-13 4:10 ` Shinichiro Kawasaki
2025-06-13 4:56 ` Ming Lei
0 siblings, 2 replies; 16+ messages in thread
From: Shinichiro Kawasaki @ 2025-06-04 11:17 UTC (permalink / raw)
To: Sagi Grimberg
Cc: linux-nvme@lists.infradead.org, linux-block, Hannes Reinecke,
Keith Busch, Jens Axboe, hch@infradead.org, Ming Lei
Cc+: Ming,
Hi Sagi, thanks for the background explanation and the suggestion.
On Jun 04, 2025 / 10:10, Sagi Grimberg wrote:
...
> > This is a problem. We are flushing a work that is IO bound, which can
> > take a long time to complete.
> > Up until now, we have deliberately avoided introducing dependency
> > between reset forward progress
> > and scan work IO to completely finish.
> >
> > I would like to keep it this way.
> >
> > BTW, this is not TCP specific.
I see. The blktests test case nvme/063 is dedicated to tcp transport, so that's
why it was reported for the TCP transport.
>
> blk_mq_unquiesce_queue is still very much safe to call as many times as we
> want.
> The only thing that comes in the way is this pesky WARN. How about we make
> it go away and have
> a debug print instead?
>
> My preference would be to allow nvme to unquiesce queues that were not
> previously quiesced (just
> like it historically was) instead of having to block a controller reset
> until the scan_work is completed (which
> is admin I/O dependent, and may get stuck until admin timeout, which can be
> changed by the user for 60
> minutes or something arbitrarily long btw).
>
> How about something like this patch instead:
> --
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c2697db59109..74f3ad16e812 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
> bool run_queue = false;
>
> spin_lock_irqsave(&q->queue_lock, flags);
> - if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
> - ;
> + if (q->quiesce_depth <= 0) {
> + printk(KERN_DEBUG
> + "dev %s: unquiescing a non-quiesced queue,
> expected?\n",
> + q->disk ? q->disk->disk_name : "?", );
> } else if (!--q->quiesce_depth) {
> blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> run_queue = true;
> --
The WARN was introduced with the commit e70feb8b3e68 ("blk-mq: support
concurrent queue quiesce/unquiesce") that Ming authored. Ming, may I
ask your comment on the suggestion by Sagi?
In case the WARN will be left as it is, blktests can ignore it by adding the
line below to the test case:
DMESG_FILTER="grep --invert-match blk_mq_unquiesce_queue"
Said that, I think Sagi's solution will be cleaner.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-04 11:17 ` Shinichiro Kawasaki
@ 2025-06-13 4:10 ` Shinichiro Kawasaki
2025-06-13 4:56 ` Ming Lei
1 sibling, 0 replies; 16+ messages in thread
From: Shinichiro Kawasaki @ 2025-06-13 4:10 UTC (permalink / raw)
To: Sagi Grimberg
Cc: linux-nvme@lists.infradead.org, linux-block, Hannes Reinecke,
Keith Busch, Jens Axboe, hch@infradead.org, Ming Lei
On Jun 04, 2025 / 11:17, Shinichiro Kawasaki wrote:
...
> On Jun 04, 2025 / 10:10, Sagi Grimberg wrote:
> ...
> > My preference would be to allow nvme to unquiesce queues that were not
> > previously quiesced (just
> > like it historically was) instead of having to block a controller reset
> > until the scan_work is completed (which
> > is admin I/O dependent, and may get stuck until admin timeout, which can be
> > changed by the user for 60
> > minutes or something arbitrarily long btw).
> >
> > How about something like this patch instead:
> > --
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index c2697db59109..74f3ad16e812 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
> > bool run_queue = false;
> >
> > spin_lock_irqsave(&q->queue_lock, flags);
> > - if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
> > - ;
> > + if (q->quiesce_depth <= 0) {
> > + printk(KERN_DEBUG
> > + "dev %s: unquiescing a non-quiesced queue,
> > expected?\n",
> > + q->disk ? q->disk->disk_name : "?", );
> > } else if (!--q->quiesce_depth) {
> > blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> > run_queue = true;
> > --
>
> The WARN was introduced with the commit e70feb8b3e68 ("blk-mq: support
> concurrent queue quiesce/unquiesce") that Ming authored. Ming, may I
> ask your comment on the suggestion by Sagi?
>
> In case the WARN will be left as it is, blktests can ignore it by adding the
> line below to the test case:
>
> DMESG_FILTER="grep --invert-match blk_mq_unquiesce_queue"
>
> Said that, I think Sagi's solution will be cleaner.
FYI, I tried to recreate the WARN blk_mq_unquiesce_queue() using the kernel
v6.16-rc1, but it was not recreated. AFAIK, the kernel changes between v6.15 and
v6.16-rc1 do not address the WARN, so I'm guessing the WARN just disappeared
because of timing changes. Anyway, I suggest to put low priority for this
problem. Sagi, Hannes, thanks for your actions.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-04 11:17 ` Shinichiro Kawasaki
2025-06-13 4:10 ` Shinichiro Kawasaki
@ 2025-06-13 4:56 ` Ming Lei
2025-06-26 4:45 ` Shinichiro Kawasaki
2025-06-28 10:24 ` Sagi Grimberg
1 sibling, 2 replies; 16+ messages in thread
From: Ming Lei @ 2025-06-13 4:56 UTC (permalink / raw)
To: Shinichiro Kawasaki
Cc: Sagi Grimberg, linux-nvme@lists.infradead.org, linux-block,
Hannes Reinecke, Keith Busch, Jens Axboe, hch@infradead.org
On Wed, Jun 04, 2025 at 11:17:05AM +0000, Shinichiro Kawasaki wrote:
> Cc+: Ming,
>
> Hi Sagi, thanks for the background explanation and the suggestion.
>
> On Jun 04, 2025 / 10:10, Sagi Grimberg wrote:
> ...
> > > This is a problem. We are flushing a work that is IO bound, which can
> > > take a long time to complete.
> > > Up until now, we have deliberately avoided introducing dependency
> > > between reset forward progress
> > > and scan work IO to completely finish.
> > >
> > > I would like to keep it this way.
> > >
> > > BTW, this is not TCP specific.
>
> I see. The blktests test case nvme/063 is dedicated to tcp transport, so that's
> why it was reported for the TCP transport.
>
> >
> > blk_mq_unquiesce_queue is still very much safe to call as many times as we
> > want.
> > The only thing that comes in the way is this pesky WARN. How about we make
> > it go away and have
> > a debug print instead?
> >
> > My preference would be to allow nvme to unquiesce queues that were not
> > previously quiesced (just
> > like it historically was) instead of having to block a controller reset
> > until the scan_work is completed (which
> > is admin I/O dependent, and may get stuck until admin timeout, which can be
> > changed by the user for 60
> > minutes or something arbitrarily long btw).
> >
> > How about something like this patch instead:
> > --
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index c2697db59109..74f3ad16e812 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
> > bool run_queue = false;
> >
> > spin_lock_irqsave(&q->queue_lock, flags);
> > - if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
> > - ;
> > + if (q->quiesce_depth <= 0) {
> > + printk(KERN_DEBUG
> > + "dev %s: unquiescing a non-quiesced queue,
> > expected?\n",
> > + q->disk ? q->disk->disk_name : "?", );
> > } else if (!--q->quiesce_depth) {
> > blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> > run_queue = true;
> > --
>
> The WARN was introduced with the commit e70feb8b3e68 ("blk-mq: support
> concurrent queue quiesce/unquiesce") that Ming authored. Ming, may I
> ask your comment on the suggestion by Sagi?
I think it is bad to use one standard block layer API in this unbalanced way,
that is why WARN_ON_ONCE() is added. We shouldn't encourage driver to use
APIs in this way.
Question is why nvme have to unquiesce one non-quiesced queue?
Thanks,
Ming
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-13 4:56 ` Ming Lei
@ 2025-06-26 4:45 ` Shinichiro Kawasaki
2025-06-28 10:24 ` Sagi Grimberg
1 sibling, 0 replies; 16+ messages in thread
From: Shinichiro Kawasaki @ 2025-06-26 4:45 UTC (permalink / raw)
To: Ming Lei
Cc: Sagi Grimberg, linux-nvme@lists.infradead.org, linux-block,
Hannes Reinecke, Keith Busch, Jens Axboe, hch@infradead.org
On Jun 13, 2025 / 12:56, Ming Lei wrote:
> On Wed, Jun 04, 2025 at 11:17:05AM +0000, Shinichiro Kawasaki wrote:
> > Cc+: Ming,
> >
> > Hi Sagi, thanks for the background explanation and the suggestion.
> >
> > On Jun 04, 2025 / 10:10, Sagi Grimberg wrote:
> > ...
> > > > This is a problem. We are flushing a work that is IO bound, which can
> > > > take a long time to complete.
> > > > Up until now, we have deliberately avoided introducing dependency
> > > > between reset forward progress
> > > > and scan work IO to completely finish.
> > > >
> > > > I would like to keep it this way.
> > > >
> > > > BTW, this is not TCP specific.
> >
> > I see. The blktests test case nvme/063 is dedicated to tcp transport, so that's
> > why it was reported for the TCP transport.
> >
> > >
> > > blk_mq_unquiesce_queue is still very much safe to call as many times as we
> > > want.
> > > The only thing that comes in the way is this pesky WARN. How about we make
> > > it go away and have
> > > a debug print instead?
> > >
> > > My preference would be to allow nvme to unquiesce queues that were not
> > > previously quiesced (just
> > > like it historically was) instead of having to block a controller reset
> > > until the scan_work is completed (which
> > > is admin I/O dependent, and may get stuck until admin timeout, which can be
> > > changed by the user for 60
> > > minutes or something arbitrarily long btw).
> > >
> > > How about something like this patch instead:
> > > --
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index c2697db59109..74f3ad16e812 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
> > > bool run_queue = false;
> > >
> > > spin_lock_irqsave(&q->queue_lock, flags);
> > > - if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
> > > - ;
> > > + if (q->quiesce_depth <= 0) {
> > > + printk(KERN_DEBUG
> > > + "dev %s: unquiescing a non-quiesced queue,
> > > expected?\n",
> > > + q->disk ? q->disk->disk_name : "?", );
> > > } else if (!--q->quiesce_depth) {
> > > blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> > > run_queue = true;
> > > --
> >
> > The WARN was introduced with the commit e70feb8b3e68 ("blk-mq: support
> > concurrent queue quiesce/unquiesce") that Ming authored. Ming, may I
> > ask your comment on the suggestion by Sagi?
>
> I think it is bad to use one standard block layer API in this unbalanced way,
> that is why WARN_ON_ONCE() is added. We shouldn't encourage driver to use
> APIs in this way.
>
> Question is why nvme have to unquiesce one non-quiesced queue?
Ming, thanks for the comment. I think the call trace in the commit message of my
patch will answer your question. blk_mq_add_queue_tag_set() is called between
blk_mq_quiesce_tagset() and blk_mq_unquiesce_tagset() calls. Let me show the
call trace again, as below.
Anyway, the WARN disappeared with v6.16-rc1 kernel, so this problem does not
have high priority at this moment from my point of view.
Reset work Scan work
------------------------------------------------------------------------
nvme_reset_ctrl_work()
nvme_tcp_teardown_ctrl()
nvme_tcp_teardown_io_queues()
nvme_quiesce_io_queues()
blk_mq_quiesce_tagset()
list_for_each_entry() .. N queues
blk_mq_quiesce_queue()
nvme_scan_work()
nvme_scan_ns_*()
nvme_scan_ns()
nvme_alloc_ns()
blk_mq_alloc_disk()
__blk_mq_alloc_disk()
blk_mq_alloc_queue()
blk_mq_init_allocate_queue()
blk_mq_add_queue_tag_set()
list_add_tail() .. N+1 queues
nvme_tcp_setup_ctrl()
nvme_start_ctrl()
nvme_unquiesce_io_queues()
blk_mq_unquiesce_tagset()
list_for_each_entry() .. N+1 queues
blk_mq_unquiesce_queue()
WARN_ON_ONCE(q->quiesce_depth <= 0)
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-13 4:56 ` Ming Lei
2025-06-26 4:45 ` Shinichiro Kawasaki
@ 2025-06-28 10:24 ` Sagi Grimberg
1 sibling, 0 replies; 16+ messages in thread
From: Sagi Grimberg @ 2025-06-28 10:24 UTC (permalink / raw)
To: Ming Lei, Shinichiro Kawasaki
Cc: linux-nvme@lists.infradead.org, linux-block, Hannes Reinecke,
Keith Busch, Jens Axboe, hch@infradead.org
On 13/06/2025 7:56, Ming Lei wrote:
> On Wed, Jun 04, 2025 at 11:17:05AM +0000, Shinichiro Kawasaki wrote:
>> Cc+: Ming,
>>
>> Hi Sagi, thanks for the background explanation and the suggestion.
>>
>> On Jun 04, 2025 / 10:10, Sagi Grimberg wrote:
>> ...
>>>> This is a problem. We are flushing a work that is IO bound, which can
>>>> take a long time to complete.
>>>> Up until now, we have deliberately avoided introducing dependency
>>>> between reset forward progress
>>>> and scan work IO to completely finish.
>>>>
>>>> I would like to keep it this way.
>>>>
>>>> BTW, this is not TCP specific.
>> I see. The blktests test case nvme/063 is dedicated to tcp transport, so that's
>> why it was reported for the TCP transport.
>>
>>> blk_mq_unquiesce_queue is still very much safe to call as many times as we
>>> want.
>>> The only thing that comes in the way is this pesky WARN. How about we make
>>> it go away and have
>>> a debug print instead?
>>>
>>> My preference would be to allow nvme to unquiesce queues that were not
>>> previously quiesced (just
>>> like it historically was) instead of having to block a controller reset
>>> until the scan_work is completed (which
>>> is admin I/O dependent, and may get stuck until admin timeout, which can be
>>> changed by the user for 60
>>> minutes or something arbitrarily long btw).
>>>
>>> How about something like this patch instead:
>>> --
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index c2697db59109..74f3ad16e812 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
>>> bool run_queue = false;
>>>
>>> spin_lock_irqsave(&q->queue_lock, flags);
>>> - if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
>>> - ;
>>> + if (q->quiesce_depth <= 0) {
>>> + printk(KERN_DEBUG
>>> + "dev %s: unquiescing a non-quiesced queue,
>>> expected?\n",
>>> + q->disk ? q->disk->disk_name : "?", );
>>> } else if (!--q->quiesce_depth) {
>>> blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
>>> run_queue = true;
>>> --
>> The WARN was introduced with the commit e70feb8b3e68 ("blk-mq: support
>> concurrent queue quiesce/unquiesce") that Ming authored. Ming, may I
>> ask your comment on the suggestion by Sagi?
> I think it is bad to use one standard block layer API in this unbalanced way,
> that is why WARN_ON_ONCE() is added. We shouldn't encourage driver to use
> APIs in this way.
>
> Question is why nvme have to unquiesce one non-quiesced queue?
It started before quiesce/unquiesce became an API that had to be balanced.
In this case, we are using the tagset quiesce/unquiesce interface. Which
iterates
over the request queues from the tagset. The problem is that due to
namespace
scanning, we may add new namespaces (and their request queues) after we
quiesced the tagset. Then when we call tagset unquiesce, we have a new
request
queue that wasn't quiesced (added after the quiesce).
I don't mind having a BLK_FEAT_ALLOW_UNBALANCED_QUIESCE for the nvme request
queues if you don't want to blindly remove the WARN_ON.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-06-28 10:24 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-02 4:35 [PATCH 0/2] nvme-tcp: fix two problems found with blktests nvme/063 Shin'ichiro Kawasaki
2025-06-02 4:35 ` [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset Shin'ichiro Kawasaki
2025-06-02 6:39 ` Hannes Reinecke
2025-06-03 0:48 ` Chaitanya Kulkarni
2025-06-03 10:58 ` Sagi Grimberg
2025-06-04 7:10 ` Sagi Grimberg
2025-06-04 11:17 ` Shinichiro Kawasaki
2025-06-13 4:10 ` Shinichiro Kawasaki
2025-06-13 4:56 ` Ming Lei
2025-06-26 4:45 ` Shinichiro Kawasaki
2025-06-28 10:24 ` Sagi Grimberg
2025-06-02 4:35 ` [PATCH 2/2] nvme-tcp: remove tag set when second admin queue config fails Shin'ichiro Kawasaki
2025-06-02 6:39 ` Hannes Reinecke
2025-06-03 0:49 ` Chaitanya Kulkarni
2025-06-04 6:50 ` Sagi Grimberg
2025-06-04 7:53 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).