* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
[not found] ` <910b31ba-1982-4365-961e-435f5e7611b2@grimberg.me>
@ 2025-06-04 7:10 ` Sagi Grimberg
2025-06-04 11:17 ` Shinichiro Kawasaki
0 siblings, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2025-06-04 7:10 UTC (permalink / raw)
To: Shin'ichiro Kawasaki, linux-nvme, linux-block
Cc: Hannes Reinecke, Keith Busch, Jens Axboe, Christoph Hellwig
On 03/06/2025 13:58, Sagi Grimberg wrote:
>
>
> On 02/06/2025 7:35, Shin'ichiro Kawasaki wrote:
>> When the nvme scan work and the nvme controller reset work race, the
>> WARN below happens:
>>
>> WARNING: CPU: 1 PID: 69 at block/blk-mq.c:330
>> blk_mq_unquiesce_queue+0x8f/0xb0
>>
>> The WARN can be recreated by repeating the blktests test case nvme/063 a
>> few times [1]. Its cause is the new queue allocation for the tag set
>> by the scan work between blk_mq_quiesce_tagset() and
>> blk_mq_unquiesce_tagset() calls by the reset work.
>>
>> Reset work Scan work
>> ------------------------------------------------------------------------
>> nvme_reset_ctrl_work()
>> nvme_tcp_teardown_ctrl()
>> nvme_tcp_teardown_io_queues()
>> nvme_quiesce_io_queues()
>> blk_mq_quiesce_tagset()
>> list_for_each_entry() .. N queues
>> blk_mq_quiesce_queue()
>> nvme_scan_work()
>> nvme_scan_ns_*()
>> nvme_scan_ns()
>> nvme_alloc_ns()
>> blk_mq_alloc_disk()
>> __blk_mq_alloc_disk()
>> blk_mq_alloc_queue()
>> blk_mq_init_allocate_queue()
>> blk_mq_add_queue_tag_set()
>> list_add_tail() .. N+1
>> queues
>> nvme_tcp_setup_ctrl()
>> nvme_start_ctrl()
>> nvme_unquiesce_io_queues()
>> blk_mq_unquiesce_tagset()
>> list_for_each_entry() .. N+1
>> queues
>> blk_mq_unquiesce_queue()
>> WARN_ON_ONCE(q->quiesce_depth <= 0)
>>
>> blk_mq_quiesce_queue() is not called for the new queue added by the scan
>> work, while blk_mq_unquiesce_queue() is called for it. Hence the WARN.
>>
>> To suppress the WARN, avoid the race between the reset work and the scan
>> work by flushing the scan work at the beginning of the reset work.
>>
>> Link:
>> https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/
>> [1]
>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>> ---
>> drivers/nvme/host/tcp.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index f6379aa33d77..14b9d67329b3 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -2491,6 +2491,9 @@ static void nvme_reset_ctrl_work(struct
>> work_struct *work)
>> container_of(work, struct nvme_ctrl, reset_work);
>> int ret;
>> + /* prevent racing with ns scanning */
>> + flush_work(&ctrl->scan_work);
>
> This is a problem. We are flushing a work that is IO bound, which can
> take a long time to complete.
> Up until now, we have deliberately avoided introducing dependency
> between reset forward progress
> and scan work IO to completely finish.
>
> I would like to keep it this way.
>
> BTW, this is not TCP specific.
blk_mq_unquiesce_queue is still very much safe to call as many times as
we want.
The only thing that comes in the way is this pesky WARN. How about we
make it go away and have
a debug print instead?
My preference would be to allow nvme to unquiesce queues that were not
previously quiesced (just
like it historically was) instead of having to block a controller reset
until the scan_work is completed (which
is admin I/O dependent, and may get stuck until admin timeout, which can
be changed by the user for 60
minutes or something arbitrarily long btw).
How about something like this patch instead:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c2697db59109..74f3ad16e812 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
bool run_queue = false;
spin_lock_irqsave(&q->queue_lock, flags);
- if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
- ;
+ if (q->quiesce_depth <= 0) {
+ printk(KERN_DEBUG
+ "dev %s: unquiescing a non-quiesced queue,
expected?\n",
+ q->disk ? q->disk->disk_name : "?", );
} else if (!--q->quiesce_depth) {
blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
run_queue = true;
--
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-04 7:10 ` [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset Sagi Grimberg
@ 2025-06-04 11:17 ` Shinichiro Kawasaki
2025-06-13 4:10 ` Shinichiro Kawasaki
2025-06-13 4:56 ` Ming Lei
0 siblings, 2 replies; 6+ messages in thread
From: Shinichiro Kawasaki @ 2025-06-04 11:17 UTC (permalink / raw)
To: Sagi Grimberg
Cc: linux-nvme@lists.infradead.org, linux-block, Hannes Reinecke,
Keith Busch, Jens Axboe, hch@infradead.org, Ming Lei
Cc+: Ming,
Hi Sagi, thanks for the background explanation and the suggestion.
On Jun 04, 2025 / 10:10, Sagi Grimberg wrote:
...
> > This is a problem. We are flushing a work that is IO bound, which can
> > take a long time to complete.
> > Up until now, we have deliberately avoided introducing dependency
> > between reset forward progress
> > and scan work IO to completely finish.
> >
> > I would like to keep it this way.
> >
> > BTW, this is not TCP specific.
I see. The blktests test case nvme/063 is dedicated to tcp transport, so that's
why it was reported for the TCP transport.
>
> blk_mq_unquiesce_queue is still very much safe to call as many times as we
> want.
> The only thing that comes in the way is this pesky WARN. How about we make
> it go away and have
> a debug print instead?
>
> My preference would be to allow nvme to unquiesce queues that were not
> previously quiesced (just
> like it historically was) instead of having to block a controller reset
> until the scan_work is completed (which
> is admin I/O dependent, and may get stuck until admin timeout, which can be
> changed by the user for 60
> minutes or something arbitrarily long btw).
>
> How about something like this patch instead:
> --
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c2697db59109..74f3ad16e812 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
> bool run_queue = false;
>
> spin_lock_irqsave(&q->queue_lock, flags);
> - if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
> - ;
> + if (q->quiesce_depth <= 0) {
> + printk(KERN_DEBUG
> + "dev %s: unquiescing a non-quiesced queue,
> expected?\n",
> + q->disk ? q->disk->disk_name : "?", );
> } else if (!--q->quiesce_depth) {
> blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> run_queue = true;
> --
The WARN was introduced with the commit e70feb8b3e68 ("blk-mq: support
concurrent queue quiesce/unquiesce") that Ming authored. Ming, may I
ask your comment on the suggestion by Sagi?
In case the WARN will be left as it is, blktests can ignore it by adding the
line below to the test case:
DMESG_FILTER="grep --invert-match blk_mq_unquiesce_queue"
Said that, I think Sagi's solution will be cleaner.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-04 11:17 ` Shinichiro Kawasaki
@ 2025-06-13 4:10 ` Shinichiro Kawasaki
2025-06-13 4:56 ` Ming Lei
1 sibling, 0 replies; 6+ messages in thread
From: Shinichiro Kawasaki @ 2025-06-13 4:10 UTC (permalink / raw)
To: Sagi Grimberg
Cc: linux-nvme@lists.infradead.org, linux-block, Hannes Reinecke,
Keith Busch, Jens Axboe, hch@infradead.org, Ming Lei
On Jun 04, 2025 / 11:17, Shinichiro Kawasaki wrote:
...
> On Jun 04, 2025 / 10:10, Sagi Grimberg wrote:
> ...
> > My preference would be to allow nvme to unquiesce queues that were not
> > previously quiesced (just
> > like it historically was) instead of having to block a controller reset
> > until the scan_work is completed (which
> > is admin I/O dependent, and may get stuck until admin timeout, which can be
> > changed by the user for 60
> > minutes or something arbitrarily long btw).
> >
> > How about something like this patch instead:
> > --
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index c2697db59109..74f3ad16e812 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
> > bool run_queue = false;
> >
> > spin_lock_irqsave(&q->queue_lock, flags);
> > - if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
> > - ;
> > + if (q->quiesce_depth <= 0) {
> > + printk(KERN_DEBUG
> > + "dev %s: unquiescing a non-quiesced queue,
> > expected?\n",
> > + q->disk ? q->disk->disk_name : "?", );
> > } else if (!--q->quiesce_depth) {
> > blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> > run_queue = true;
> > --
>
> The WARN was introduced with the commit e70feb8b3e68 ("blk-mq: support
> concurrent queue quiesce/unquiesce") that Ming authored. Ming, may I
> ask your comment on the suggestion by Sagi?
>
> In case the WARN will be left as it is, blktests can ignore it by adding the
> line below to the test case:
>
> DMESG_FILTER="grep --invert-match blk_mq_unquiesce_queue"
>
> Said that, I think Sagi's solution will be cleaner.
FYI, I tried to recreate the WARN blk_mq_unquiesce_queue() using the kernel
v6.16-rc1, but it was not recreated. AFAIK, the kernel changes between v6.15 and
v6.16-rc1 do not address the WARN, so I'm guessing the WARN just disappeared
because of timing changes. Anyway, I suggest to put low priority for this
problem. Sagi, Hannes, thanks for your actions.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-04 11:17 ` Shinichiro Kawasaki
2025-06-13 4:10 ` Shinichiro Kawasaki
@ 2025-06-13 4:56 ` Ming Lei
2025-06-26 4:45 ` Shinichiro Kawasaki
2025-06-28 10:24 ` Sagi Grimberg
1 sibling, 2 replies; 6+ messages in thread
From: Ming Lei @ 2025-06-13 4:56 UTC (permalink / raw)
To: Shinichiro Kawasaki
Cc: Sagi Grimberg, linux-nvme@lists.infradead.org, linux-block,
Hannes Reinecke, Keith Busch, Jens Axboe, hch@infradead.org
On Wed, Jun 04, 2025 at 11:17:05AM +0000, Shinichiro Kawasaki wrote:
> Cc+: Ming,
>
> Hi Sagi, thanks for the background explanation and the suggestion.
>
> On Jun 04, 2025 / 10:10, Sagi Grimberg wrote:
> ...
> > > This is a problem. We are flushing a work that is IO bound, which can
> > > take a long time to complete.
> > > Up until now, we have deliberately avoided introducing dependency
> > > between reset forward progress
> > > and scan work IO to completely finish.
> > >
> > > I would like to keep it this way.
> > >
> > > BTW, this is not TCP specific.
>
> I see. The blktests test case nvme/063 is dedicated to tcp transport, so that's
> why it was reported for the TCP transport.
>
> >
> > blk_mq_unquiesce_queue is still very much safe to call as many times as we
> > want.
> > The only thing that comes in the way is this pesky WARN. How about we make
> > it go away and have
> > a debug print instead?
> >
> > My preference would be to allow nvme to unquiesce queues that were not
> > previously quiesced (just
> > like it historically was) instead of having to block a controller reset
> > until the scan_work is completed (which
> > is admin I/O dependent, and may get stuck until admin timeout, which can be
> > changed by the user for 60
> > minutes or something arbitrarily long btw).
> >
> > How about something like this patch instead:
> > --
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index c2697db59109..74f3ad16e812 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
> > bool run_queue = false;
> >
> > spin_lock_irqsave(&q->queue_lock, flags);
> > - if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
> > - ;
> > + if (q->quiesce_depth <= 0) {
> > + printk(KERN_DEBUG
> > + "dev %s: unquiescing a non-quiesced queue,
> > expected?\n",
> > + q->disk ? q->disk->disk_name : "?", );
> > } else if (!--q->quiesce_depth) {
> > blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> > run_queue = true;
> > --
>
> The WARN was introduced with the commit e70feb8b3e68 ("blk-mq: support
> concurrent queue quiesce/unquiesce") that Ming authored. Ming, may I
> ask your comment on the suggestion by Sagi?
I think it is bad to use one standard block layer API in this unbalanced way,
that is why WARN_ON_ONCE() is added. We shouldn't encourage driver to use
APIs in this way.
Question is why nvme have to unquiesce one non-quiesced queue?
Thanks,
Ming
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-13 4:56 ` Ming Lei
@ 2025-06-26 4:45 ` Shinichiro Kawasaki
2025-06-28 10:24 ` Sagi Grimberg
1 sibling, 0 replies; 6+ messages in thread
From: Shinichiro Kawasaki @ 2025-06-26 4:45 UTC (permalink / raw)
To: Ming Lei
Cc: Sagi Grimberg, linux-nvme@lists.infradead.org, linux-block,
Hannes Reinecke, Keith Busch, Jens Axboe, hch@infradead.org
On Jun 13, 2025 / 12:56, Ming Lei wrote:
> On Wed, Jun 04, 2025 at 11:17:05AM +0000, Shinichiro Kawasaki wrote:
> > Cc+: Ming,
> >
> > Hi Sagi, thanks for the background explanation and the suggestion.
> >
> > On Jun 04, 2025 / 10:10, Sagi Grimberg wrote:
> > ...
> > > > This is a problem. We are flushing a work that is IO bound, which can
> > > > take a long time to complete.
> > > > Up until now, we have deliberately avoided introducing dependency
> > > > between reset forward progress
> > > > and scan work IO to completely finish.
> > > >
> > > > I would like to keep it this way.
> > > >
> > > > BTW, this is not TCP specific.
> >
> > I see. The blktests test case nvme/063 is dedicated to tcp transport, so that's
> > why it was reported for the TCP transport.
> >
> > >
> > > blk_mq_unquiesce_queue is still very much safe to call as many times as we
> > > want.
> > > The only thing that comes in the way is this pesky WARN. How about we make
> > > it go away and have
> > > a debug print instead?
> > >
> > > My preference would be to allow nvme to unquiesce queues that were not
> > > previously quiesced (just
> > > like it historically was) instead of having to block a controller reset
> > > until the scan_work is completed (which
> > > is admin I/O dependent, and may get stuck until admin timeout, which can be
> > > changed by the user for 60
> > > minutes or something arbitrarily long btw).
> > >
> > > How about something like this patch instead:
> > > --
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index c2697db59109..74f3ad16e812 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
> > > bool run_queue = false;
> > >
> > > spin_lock_irqsave(&q->queue_lock, flags);
> > > - if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
> > > - ;
> > > + if (q->quiesce_depth <= 0) {
> > > + printk(KERN_DEBUG
> > > + "dev %s: unquiescing a non-quiesced queue,
> > > expected?\n",
> > > + q->disk ? q->disk->disk_name : "?", );
> > > } else if (!--q->quiesce_depth) {
> > > blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> > > run_queue = true;
> > > --
> >
> > The WARN was introduced with the commit e70feb8b3e68 ("blk-mq: support
> > concurrent queue quiesce/unquiesce") that Ming authored. Ming, may I
> > ask your comment on the suggestion by Sagi?
>
> I think it is bad to use one standard block layer API in this unbalanced way,
> that is why WARN_ON_ONCE() is added. We shouldn't encourage driver to use
> APIs in this way.
>
> Question is why nvme have to unquiesce one non-quiesced queue?
Ming, thanks for the comment. I think the call trace in the commit message of my
patch will answer your question. blk_mq_add_queue_tag_set() is called between
blk_mq_quiesce_tagset() and blk_mq_unquiesce_tagset() calls. Let me show the
call trace again, as below.
Anyway, the WARN disappeared with v6.16-rc1 kernel, so this problem does not
have high priority at this moment from my point of view.
Reset work Scan work
------------------------------------------------------------------------
nvme_reset_ctrl_work()
nvme_tcp_teardown_ctrl()
nvme_tcp_teardown_io_queues()
nvme_quiesce_io_queues()
blk_mq_quiesce_tagset()
list_for_each_entry() .. N queues
blk_mq_quiesce_queue()
nvme_scan_work()
nvme_scan_ns_*()
nvme_scan_ns()
nvme_alloc_ns()
blk_mq_alloc_disk()
__blk_mq_alloc_disk()
blk_mq_alloc_queue()
blk_mq_init_allocate_queue()
blk_mq_add_queue_tag_set()
list_add_tail() .. N+1 queues
nvme_tcp_setup_ctrl()
nvme_start_ctrl()
nvme_unquiesce_io_queues()
blk_mq_unquiesce_tagset()
list_for_each_entry() .. N+1 queues
blk_mq_unquiesce_queue()
WARN_ON_ONCE(q->quiesce_depth <= 0)
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
2025-06-13 4:56 ` Ming Lei
2025-06-26 4:45 ` Shinichiro Kawasaki
@ 2025-06-28 10:24 ` Sagi Grimberg
1 sibling, 0 replies; 6+ messages in thread
From: Sagi Grimberg @ 2025-06-28 10:24 UTC (permalink / raw)
To: Ming Lei, Shinichiro Kawasaki
Cc: linux-nvme@lists.infradead.org, linux-block, Hannes Reinecke,
Keith Busch, Jens Axboe, hch@infradead.org
On 13/06/2025 7:56, Ming Lei wrote:
> On Wed, Jun 04, 2025 at 11:17:05AM +0000, Shinichiro Kawasaki wrote:
>> Cc+: Ming,
>>
>> Hi Sagi, thanks for the background explanation and the suggestion.
>>
>> On Jun 04, 2025 / 10:10, Sagi Grimberg wrote:
>> ...
>>>> This is a problem. We are flushing a work that is IO bound, which can
>>>> take a long time to complete.
>>>> Up until now, we have deliberately avoided introducing dependency
>>>> between reset forward progress
>>>> and scan work IO to completely finish.
>>>>
>>>> I would like to keep it this way.
>>>>
>>>> BTW, this is not TCP specific.
>> I see. The blktests test case nvme/063 is dedicated to tcp transport, so that's
>> why it was reported for the TCP transport.
>>
>>> blk_mq_unquiesce_queue is still very much safe to call as many times as we
>>> want.
>>> The only thing that comes in the way is this pesky WARN. How about we make
>>> it go away and have
>>> a debug print instead?
>>>
>>> My preference would be to allow nvme to unquiesce queues that were not
>>> previously quiesced (just
>>> like it historically was) instead of having to block a controller reset
>>> until the scan_work is completed (which
>>> is admin I/O dependent, and may get stuck until admin timeout, which can be
>>> changed by the user for 60
>>> minutes or something arbitrarily long btw).
>>>
>>> How about something like this patch instead:
>>> --
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index c2697db59109..74f3ad16e812 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
>>> bool run_queue = false;
>>>
>>> spin_lock_irqsave(&q->queue_lock, flags);
>>> - if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
>>> - ;
>>> + if (q->quiesce_depth <= 0) {
>>> + printk(KERN_DEBUG
>>> + "dev %s: unquiescing a non-quiesced queue,
>>> expected?\n",
>>> + q->disk ? q->disk->disk_name : "?", );
>>> } else if (!--q->quiesce_depth) {
>>> blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
>>> run_queue = true;
>>> --
>> The WARN was introduced with the commit e70feb8b3e68 ("blk-mq: support
>> concurrent queue quiesce/unquiesce") that Ming authored. Ming, may I
>> ask your comment on the suggestion by Sagi?
> I think it is bad to use one standard block layer API in this unbalanced way,
> that is why WARN_ON_ONCE() is added. We shouldn't encourage driver to use
> APIs in this way.
>
> Question is why nvme have to unquiesce one non-quiesced queue?
It started before quiesce/unquiesce became an API that had to be balanced.
In this case, we are using the tagset quiesce/unquiesce interface. Which
iterates
over the request queues from the tagset. The problem is that due to
namespace
scanning, we may add new namespaces (and their request queues) after we
quiesced the tagset. Then when we call tagset unquiesce, we have a new
request
queue that wasn't quiesced (added after the quiesce).
I don't mind having a BLK_FEAT_ALLOW_UNBALANCED_QUIESCE for the nvme request
queues if you don't want to blindly remove the WARN_ON.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-06-28 10:24 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20250602043522.55787-1-shinichiro.kawasaki@wdc.com>
[not found] ` <20250602043522.55787-2-shinichiro.kawasaki@wdc.com>
[not found] ` <910b31ba-1982-4365-961e-435f5e7611b2@grimberg.me>
2025-06-04 7:10 ` [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset Sagi Grimberg
2025-06-04 11:17 ` Shinichiro Kawasaki
2025-06-13 4:10 ` Shinichiro Kawasaki
2025-06-13 4:56 ` Ming Lei
2025-06-26 4:45 ` Shinichiro Kawasaki
2025-06-28 10:24 ` Sagi Grimberg
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox