[PATCH] blk-flush: fix possibe deadlock when process nvme

Linux block layer
 help / color / mirror / Atom feed

* [PATCH] blk-flush: fix possibe deadlock when process nvme_timeout()
@ 2026-06-08 11:39 Ye Bin
  2026-06-29 12:00 ` yebin
  0 siblings, 1 reply; 2+ messages in thread
From: Ye Bin @ 2026-06-08 11:39 UTC (permalink / raw)
  To: axboe, linux-block, yebin, yebin10; +Cc: kbusch, hch, sagi, linux-nvme

From: Ye Bin <yebin10@huawei.com>

 There's when process nvme_timeout():
 [  206.734601][ T8184] nvme nvme0: I/O tag 512 (1200) opcode 0x0 (I/O Cmd) QID 3 timeout, aborting req_op:FLUSH(2) size:0
 [  206.736112][    C0] nvme nvme0: Abort status: 0x0
 [  208.094637][ T8184] nvme nvme0: I/O tag 512 (1200) opcode 0x0 (I/O Cmd) QID 3 timeout, reset controller

 [root@localhost ~]# cat /proc/8184/stack
 [<0>] msleep+0x37/0x50
 [<0>] blk_mq_tagset_wait_completed_request+0x6f/0xe0
 [<0>] nvme_cancel_tagset+0x79/0xa0
 [<0>] nvme_dev_disable+0x55c/0x7e0
 [<0>] nvme_timeout+0x25b/0x1530
 [<0>] blk_mq_handle_expired+0x210/0x2c0
 [<0>] bt_iter+0x2bb/0x360
 [<0>] blk_mq_queue_tag_busy_iter+0x9f8/0x1f30
 [<0>] blk_mq_timeout_work+0x5dc/0x7d0
 [<0>] process_one_work+0xa08/0x1d00
 [<0>] worker_thread+0x698/0xeb0
 [<0>] kthread+0x408/0x540
 [<0>] ret_from_fork+0xa4d/0xdd0
 [<0>] ret_from_fork_asm+0x1a/0x30

 Above issue may happen as follows:
 nvme_timeout  // tag 512 request's flush request the first timeout
   iod->aborted = 1;
   abort_req = nvme_alloc_request(dev->ctrl.admin_q, &cmd,
          BLK_MQ_REQ_NOWAIT, NVME_QID_ANY);  // Abort tag 512 flush request
   blk_execute_rq_nowait(abort_req->q, NULL, abort_req, 0, abort_endio);
      // Abort request completion, will no wait
         ....
  ****'abort_req' not complete***
         ....
 nvme_timeout  // tag 512 request's flush request the second timeout
  if (!nvmeq->qid || (iod->flags & IOD_ABORTED))
    nvme_req(req)->flags |= NVME_REQ_CANCELLED;
    goto disable;
      ...
    **** tag 512 request's flush request end ****
         nvme_try_complete_req
          blk_mq_complete_request_remote(req);
           WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
            ...
             nvme_end_req(req);
              blk_mq_end_request(req, status);
               __blk_mq_end_request(rq, error);
                if (rq->end_io)
                 rq->end_io(rq, error);
                  flush_end_io(rq, error);
                  // The timeout process holds the reference count.
                  // so request keep MQ_RQ_COMPLETE state
                   if (!refcount_dec_and_test(&flush_rq->ref))
                    fq->rq_status = error;
                    return;
    **** tag 512 flush request is MQ_RQ_COMPLETE state ****
 disable:
   nvme_dev_disable(dev, false);
     nvme_cancel_tagset(&dev->ctrl);
       blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request,
                               &dev->ctrl);
         nvme_cancel_request
           if (blk_mq_request_completed(req))
             return true;
      blk_mq_tagset_wait_completed_request(&dev->tagset);
        while (true)
          blk_mq_tagset_busy_iter(tagset,
                           blk_mq_tagset_count_completed_rqs, &count);
             blk_mq_tagset_count_completed_rqs();
             // request is MQ_RQ_COMPLETE state
                if (blk_mq_request_completed(rq))   // return true
                  (*count)++;
          if (!count) // So the value of 'count' is never 0, loop endless
              break;
          msleep(5);
The preceding problem occurs because the timeout processing flow holds
the reference count of the request, and the flush request is always in
the MQ_RQ_COMPLETE state due to the special nature of the flush request.
As a result, a dead loop occurs in the nvme_dev_disable() process.
To solve the preceding problem, if only the timeout processing flow holds
the reference count when the flush request times out, the request status
must be changed to MQ_RQ_IDLE in advance. In this way, it is safe to call
blk_mq_tagset_wait_completed_request () during the timeout processing.

Fixes: e1569a16180a ("nvme: do not restart the request timeout if we're resetting the controller")
Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 block/blk-flush.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 403a46c86411..d12839b1fcb5 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -213,6 +213,18 @@ static enum rq_end_io_ret flush_end_io(struct request *flush_rq,
 
 	if (!req_ref_put_and_test(flush_rq)) {
 		fq->rq_status = error;
+
+		/*
+		 * The timeout processing flow holds the reference count
+		 * of flush_rq. If the last reference count is held by the
+		 * timeout processing flow, the status of flush_rq must be
+		 * changed to MQ_RQ_IDLE in advance. Otherwise, a deadlock
+		 * occurs when blk_mq_tagset_wait_completed_request() is
+		 * called in the timeout processing flow.
+		 */
+		if (req_ref_read(flush_rq) == 1 &&
+		    flush_rq->rq_flags & RQF_TIMED_OUT)
+			WRITE_ONCE(flush_rq->state, MQ_RQ_IDLE);
 		spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
 		return RQ_END_IO_NONE;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH] blk-flush: fix possibe deadlock when process nvme_timeout()
  2026-06-08 11:39 [PATCH] blk-flush: fix possibe deadlock when process nvme_timeout() Ye Bin
@ 2026-06-29 12:00 ` yebin
  0 siblings, 0 replies; 2+ messages in thread
From: yebin @ 2026-06-29 12:00 UTC (permalink / raw)
  To: axboe, linux-block, yebin10; +Cc: kbusch, hch, sagi, linux-nvme

Friendly ping ...

This issue occurs once every week in our product's live network environment.
The root cause is certainly triggered by firmware issues. The kernel still
needs to reinforce this scenario to prevent system hangs caused by dead loops.


In `blk_mq_tagset_wait_completed_request()`, the reference count is continuously
acquired and released, and there is still a chance for the request to be in the
MQ_RQ_IDLE state. Therefore, the race condition pointed out by sashiko exists,
but in this scenario, it can still be handled correctly in the end.

[1] https://sashiko.dev/#/patchset/20260608113923.3893518-1-yebin%40huaweicloud.com

On 2026/6/8 19:39, Ye Bin wrote:
> From: Ye Bin <yebin10@huawei.com>
>
>   There's when process nvme_timeout():
>   [  206.734601][ T8184] nvme nvme0: I/O tag 512 (1200) opcode 0x0 (I/O Cmd) QID 3 timeout, aborting req_op:FLUSH(2) size:0
>   [  206.736112][    C0] nvme nvme0: Abort status: 0x0
>   [  208.094637][ T8184] nvme nvme0: I/O tag 512 (1200) opcode 0x0 (I/O Cmd) QID 3 timeout, reset controller
>
>   [root@localhost ~]# cat /proc/8184/stack
>   [<0>] msleep+0x37/0x50
>   [<0>] blk_mq_tagset_wait_completed_request+0x6f/0xe0
>   [<0>] nvme_cancel_tagset+0x79/0xa0
>   [<0>] nvme_dev_disable+0x55c/0x7e0
>   [<0>] nvme_timeout+0x25b/0x1530
>   [<0>] blk_mq_handle_expired+0x210/0x2c0
>   [<0>] bt_iter+0x2bb/0x360
>   [<0>] blk_mq_queue_tag_busy_iter+0x9f8/0x1f30
>   [<0>] blk_mq_timeout_work+0x5dc/0x7d0
>   [<0>] process_one_work+0xa08/0x1d00
>   [<0>] worker_thread+0x698/0xeb0
>   [<0>] kthread+0x408/0x540
>   [<0>] ret_from_fork+0xa4d/0xdd0
>   [<0>] ret_from_fork_asm+0x1a/0x30
>
>   Above issue may happen as follows:
>   nvme_timeout  // tag 512 request's flush request the first timeout
>     iod->aborted = 1;
>     abort_req = nvme_alloc_request(dev->ctrl.admin_q, &cmd,
>            BLK_MQ_REQ_NOWAIT, NVME_QID_ANY);  // Abort tag 512 flush request
>     blk_execute_rq_nowait(abort_req->q, NULL, abort_req, 0, abort_endio);
>        // Abort request completion, will no wait
>           ....
>    ****'abort_req' not complete***
>           ....
>   nvme_timeout  // tag 512 request's flush request the second timeout
>    if (!nvmeq->qid || (iod->flags & IOD_ABORTED))
>      nvme_req(req)->flags |= NVME_REQ_CANCELLED;
>      goto disable;
>        ...
>      **** tag 512 request's flush request end ****
>           nvme_try_complete_req
>            blk_mq_complete_request_remote(req);
>             WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
>              ...
>               nvme_end_req(req);
>                blk_mq_end_request(req, status);
>                 __blk_mq_end_request(rq, error);
>                  if (rq->end_io)
>                   rq->end_io(rq, error);
>                    flush_end_io(rq, error);
>                    // The timeout process holds the reference count.
>                    // so request keep MQ_RQ_COMPLETE state
>                     if (!refcount_dec_and_test(&flush_rq->ref))
>                      fq->rq_status = error;
>                      return;
>      **** tag 512 flush request is MQ_RQ_COMPLETE state ****
>   disable:
>     nvme_dev_disable(dev, false);
>       nvme_cancel_tagset(&dev->ctrl);
>         blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request,
>                                 &dev->ctrl);
>           nvme_cancel_request
>             if (blk_mq_request_completed(req))
>               return true;
>        blk_mq_tagset_wait_completed_request(&dev->tagset);
>          while (true)
>            blk_mq_tagset_busy_iter(tagset,
>                             blk_mq_tagset_count_completed_rqs, &count);
>               blk_mq_tagset_count_completed_rqs();
>               // request is MQ_RQ_COMPLETE state
>                  if (blk_mq_request_completed(rq))   // return true
>                    (*count)++;
>            if (!count) // So the value of 'count' is never 0, loop endless
>                break;
>            msleep(5);
> The preceding problem occurs because the timeout processing flow holds
> the reference count of the request, and the flush request is always in
> the MQ_RQ_COMPLETE state due to the special nature of the flush request.
> As a result, a dead loop occurs in the nvme_dev_disable() process.
> To solve the preceding problem, if only the timeout processing flow holds
> the reference count when the flush request times out, the request status
> must be changed to MQ_RQ_IDLE in advance. In this way, it is safe to call
> blk_mq_tagset_wait_completed_request () during the timeout processing.
>
> Fixes: e1569a16180a ("nvme: do not restart the request timeout if we're resetting the controller")
> Signed-off-by: Ye Bin <yebin10@huawei.com>
> ---
>   block/blk-flush.c | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
>
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 403a46c86411..d12839b1fcb5 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -213,6 +213,18 @@ static enum rq_end_io_ret flush_end_io(struct request *flush_rq,
>
>   	if (!req_ref_put_and_test(flush_rq)) {
>   		fq->rq_status = error;
> +
> +		/*
> +		 * The timeout processing flow holds the reference count
> +		 * of flush_rq. If the last reference count is held by the
> +		 * timeout processing flow, the status of flush_rq must be
> +		 * changed to MQ_RQ_IDLE in advance. Otherwise, a deadlock
> +		 * occurs when blk_mq_tagset_wait_completed_request() is
> +		 * called in the timeout processing flow.
> +		 */
> +		if (req_ref_read(flush_rq) == 1 &&
> +		    flush_rq->rq_flags & RQF_TIMED_OUT)
> +			WRITE_ONCE(flush_rq->state, MQ_RQ_IDLE);
>   		spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
>   		return RQ_END_IO_NONE;
>   	}
>


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-06-29 12:00 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08 11:39 [PATCH] blk-flush: fix possibe deadlock when process nvme_timeout() Ye Bin
2026-06-29 12:00 ` yebin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox