Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

From: Sagi Grimberg <sagi@grimberg.me>
To: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>,
	linux-nvme@lists.infradead.org,
	linux-block <linux-block@vger.kernel.org>
Cc: Hannes Reinecke <hare@kernel.org>,
	Keith Busch <kbusch@kernel.org>, Jens Axboe <axboe@kernel.dk>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
Date: Wed, 4 Jun 2025 10:10:01 +0300	[thread overview]
Message-ID: <86e241dd-9065-4cf0-9c35-8b7502ab2d8a@grimberg.me> (raw)
In-Reply-To: <910b31ba-1982-4365-961e-435f5e7611b2@grimberg.me>



On 03/06/2025 13:58, Sagi Grimberg wrote:
>
>
> On 02/06/2025 7:35, Shin'ichiro Kawasaki wrote:
>> When the nvme scan work and the nvme controller reset work race, the
>> WARN below happens:
>>
>>    WARNING: CPU: 1 PID: 69 at block/blk-mq.c:330 
>> blk_mq_unquiesce_queue+0x8f/0xb0
>>
>> The WARN can be recreated by repeating the blktests test case nvme/063 a
>> few times [1]. Its cause is the new queue allocation for the tag set
>> by the scan work between blk_mq_quiesce_tagset() and
>> blk_mq_unquiesce_tagset() calls by the reset work.
>>
>> Reset work                     Scan work
>> ------------------------------------------------------------------------
>> nvme_reset_ctrl_work()
>>   nvme_tcp_teardown_ctrl()
>>    nvme_tcp_teardown_io_queues()
>>     nvme_quiesce_io_queues()
>>      blk_mq_quiesce_tagset()
>>       list_for_each_entry()                                 .. N queues
>>        blk_mq_quiesce_queue()
>>                                 nvme_scan_work()
>>                                  nvme_scan_ns_*()
>>                                   nvme_scan_ns()
>>                                    nvme_alloc_ns()
>>                                     blk_mq_alloc_disk()
>>                                      __blk_mq_alloc_disk()
>>                                       blk_mq_alloc_queue()
>> blk_mq_init_allocate_queue()
>> blk_mq_add_queue_tag_set()
>>                                          list_add_tail()    .. N+1 
>> queues
>>   nvme_tcp_setup_ctrl()
>>    nvme_start_ctrl()
>>     nvme_unquiesce_io_queues()
>>      blk_mq_unquiesce_tagset()
>>       list_for_each_entry()                                 .. N+1 
>> queues
>>        blk_mq_unquiesce_queue()
>>         WARN_ON_ONCE(q->quiesce_depth <= 0)
>>
>> blk_mq_quiesce_queue() is not called for the new queue added by the scan
>> work, while blk_mq_unquiesce_queue() is called for it. Hence the WARN.
>>
>> To suppress the WARN, avoid the race between the reset work and the scan
>> work by flushing the scan work at the beginning of the reset work.
>>
>> Link: 
>> https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ 
>> [1]
>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>> ---
>>   drivers/nvme/host/tcp.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index f6379aa33d77..14b9d67329b3 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -2491,6 +2491,9 @@ static void nvme_reset_ctrl_work(struct 
>> work_struct *work)
>>           container_of(work, struct nvme_ctrl, reset_work);
>>       int ret;
>>   +    /* prevent racing with ns scanning */
>> +    flush_work(&ctrl->scan_work);
>
> This is a problem. We are flushing a work that is IO bound, which can 
> take a long time to complete.
> Up until now, we have deliberately avoided introducing dependency 
> between reset forward progress
> and scan work IO to completely finish.
>
> I would like to keep it this way.
>
> BTW, this is not TCP specific.

blk_mq_unquiesce_queue is still very much safe to call as many times as 
we want.
The only thing that comes in the way is this pesky WARN. How about we 
make it go away and have
a debug print instead?

My preference would be to allow nvme to unquiesce queues that were not 
previously quiesced (just
like it historically was) instead of having to block a controller reset 
until the scan_work is completed (which
is admin I/O dependent, and may get stuck until admin timeout, which can 
be changed by the user for 60
minutes or something arbitrarily long btw).

How about something like this patch instead:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c2697db59109..74f3ad16e812 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
         bool run_queue = false;

         spin_lock_irqsave(&q->queue_lock, flags);
-       if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
-               ;
+       if (q->quiesce_depth <= 0) {
+               printk(KERN_DEBUG
+                       "dev %s: unquiescing a non-quiesced queue, 
expected?\n",
+                       q->disk ? q->disk->disk_name : "?", );
         } else if (!--q->quiesce_depth) {
                 blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
                 run_queue = true;
--

next      parent reply	other threads:[~2025-06-04  7:10 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20250602043522.55787-1-shinichiro.kawasaki@wdc.com>
     [not found] ` <20250602043522.55787-2-shinichiro.kawasaki@wdc.com>
     [not found]   ` <910b31ba-1982-4365-961e-435f5e7611b2@grimberg.me>
2025-06-04  7:10     ` Sagi Grimberg [this message]
2025-06-04 11:17       ` [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset Shinichiro Kawasaki
2025-06-13  4:10         ` Shinichiro Kawasaki
2025-06-13  4:56         ` Ming Lei
2025-06-26  4:45           ` Shinichiro Kawasaki
2025-06-28 10:24           ` Sagi Grimberg

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:c2697db5910 dfblob:74f3ad16e81 )
 OR (
bs:"Re: [PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86e241dd-9065-4cf0-9c35-8b7502ab2d8a@grimberg.me \
    --to=sagi@grimberg.me \
    --cc=axboe@kernel.dk \
    --cc=hare@kernel.org \
    --cc=hch@infradead.org \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=shinichiro.kawasaki@wdc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox