[PATCH 0/1] Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/1] Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
@ 2025-12-04 18:11 Mohamed Khalfella
  2025-12-04 18:11 ` [PATCH 1/1] block: " Mohamed Khalfella
  0 siblings, 1 reply; 17+ messages in thread
From: Mohamed Khalfella @ 2025-12-04 18:11 UTC (permalink / raw)
  To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Casey Chen, Yuanyuan Zhong, Hannes Reinecke, Ming Lei,
	Waiman Long, Hillf Danton, linux-nvme, linux-block, linux-kernel,
	Mohamed Khalfella

The patch in v2 converted set->tag_list_lock to a rwsemaphore. It was
pointed out this could result in a different kind of deadlock. This
patch uses RCU to protect set->tag_list when accessed from [un]quiesce
code.

v2 - https://lore.kernel.org/all/20251117202414.4071380-1-mkhalfella@purestorage.com/

Mohamed Khalfella (1):
  block: Use RCU in blk_mq_[un]quiesce_tagset() instead of
    set->tag_list_lock

 block/blk-mq.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

-- 
2.51.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-04 18:11 [PATCH 0/1] Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock Mohamed Khalfella
@ 2025-12-04 18:11 ` Mohamed Khalfella
  2025-12-04 18:22   ` Bart Van Assche
  2025-12-04 19:02   ` Mohamed Khalfella
  0 siblings, 2 replies; 17+ messages in thread
From: Mohamed Khalfella @ 2025-12-04 18:11 UTC (permalink / raw)
  To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Casey Chen, Yuanyuan Zhong, Hannes Reinecke, Ming Lei,
	Waiman Long, Hillf Danton, linux-nvme, linux-block, linux-kernel,
	Mohamed Khalfella

blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
tagset, the functions make sure that tagset and queues are marked as
shared when two or more queues are attached to the same tagset.
Initially a tagset starts as unshared and when the number of added
queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
with all the queues attached to it. When the number of attached queues
drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
the remaining queues as unshared.

Both functions need to freeze current queues in tagset before setting on
unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
hold set->tag_list_lock mutex, which makes sense as we do not want
queues to be added or deleted in the process. This used to work fine
until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
made the nvme driver quiesce tagset instead of quiscing individual
queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
set->tag_list while holding set->tag_list_lock also.

This results in deadlock between two threads with these stacktraces:

  __schedule+0x48e/0xed0
  schedule+0x5a/0xc0
  schedule_preempt_disabled+0x11/0x20
  __mutex_lock.constprop.0+0x3cc/0x760
  blk_mq_quiesce_tagset+0x26/0xd0
  nvme_dev_disable_locked+0x77/0x280 [nvme]
  nvme_timeout+0x268/0x320 [nvme]
  blk_mq_handle_expired+0x5d/0x90
  bt_iter+0x7e/0x90
  blk_mq_queue_tag_busy_iter+0x2b2/0x590
  ? __blk_mq_complete_request_remote+0x10/0x10
  ? __blk_mq_complete_request_remote+0x10/0x10
  blk_mq_timeout_work+0x15b/0x1a0
  process_one_work+0x133/0x2f0
  ? mod_delayed_work_on+0x90/0x90
  worker_thread+0x2ec/0x400
  ? mod_delayed_work_on+0x90/0x90
  kthread+0xe2/0x110
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x2d/0x50
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork_asm+0x11/0x20

  __schedule+0x48e/0xed0
  schedule+0x5a/0xc0
  blk_mq_freeze_queue_wait+0x62/0x90
  ? destroy_sched_domains_rcu+0x30/0x30
  blk_mq_exit_queue+0x151/0x180
  disk_release+0xe3/0xf0
  device_release+0x31/0x90
  kobject_put+0x6d/0x180
  nvme_scan_ns+0x858/0xc90 [nvme_core]
  ? nvme_scan_work+0x281/0x560 [nvme_core]
  nvme_scan_work+0x281/0x560 [nvme_core]
  process_one_work+0x133/0x2f0
  ? mod_delayed_work_on+0x90/0x90
  worker_thread+0x2ec/0x400
  ? mod_delayed_work_on+0x90/0x90
  kthread+0xe2/0x110
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x2d/0x50
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork_asm+0x11/0x20

The top stacktrace is showing nvme_timeout() called to handle nvme
command timeout. timeout handler is trying to disable the controller and
as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
to call queue callback handlers. The thread is stuck waiting for
set->tag_list_lock as it tires to walk the queues in set->tag_list.

The lock is held by the second thread in the bottom stack which is
waiting for one of queues to be frozen. The queue usage counter will
drop to zero after nvme_timeout() finishes, and this will not happen
because the thread will wait for this mutex forever.

Given that [un]quescing queue is an operation that does not need to
sleep, update blk_mq_[un]quiesce_tagset() to use RCU instead of taking
set->tag_list_lock. Also update blk_mq_{add,del}_queue_tag_set() to use
RCU safe list operations. This should help avoid deadlock seen above.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 block/blk-mq.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d626d32f6e57..ceb176ac154b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
 {
 	struct request_queue *q;

-	mutex_lock(&set->tag_list_lock);
-	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(q, &set->tag_list, tag_set_list) {
 		if (!blk_queue_skip_tagset_quiesce(q))
 			blk_mq_quiesce_queue_nowait(q);
 	}
-	mutex_unlock(&set->tag_list_lock);
+	rcu_read_unlock();

 	blk_mq_wait_quiesce_done(set);
 }
@@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
 {
 	struct request_queue *q;

-	mutex_lock(&set->tag_list_lock);
-	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(q, &set->tag_list, tag_set_list) {
 		if (!blk_queue_skip_tagset_quiesce(q))
 			blk_mq_unquiesce_queue(q);
 	}
-	mutex_unlock(&set->tag_list_lock);
+	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);

@@ -4294,7 +4294,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 	struct blk_mq_tag_set *set = q->tag_set;

 	mutex_lock(&set->tag_list_lock);
-	list_del(&q->tag_set_list);
+	list_del_rcu(&q->tag_set_list);
 	if (list_is_singular(&set->tag_list)) {
 		/* just transitioned to unshared */
 		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
@@ -4302,6 +4302,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 		blk_mq_update_tag_set_shared(set, false);
 	}
 	mutex_unlock(&set->tag_list_lock);
+
+	synchronize_rcu();
 	INIT_LIST_HEAD(&q->tag_set_list);
 }

@@ -4321,7 +4323,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	}
 	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
 		queue_set_hctx_shared(q, true);
-	list_add_tail(&q->tag_set_list, &set->tag_list);
+	list_add_tail_rcu(&q->tag_set_list, &set->tag_list);

 	mutex_unlock(&set->tag_list_lock);
 }
-- 
2.51.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-04 18:11 ` [PATCH 1/1] block: " Mohamed Khalfella
@ 2025-12-04 18:22   ` Bart Van Assche
  2025-12-04 18:42     ` Mohamed Khalfella
  2025-12-04 19:02   ` Mohamed Khalfella
  1 sibling, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2025-12-04 18:22 UTC (permalink / raw)
  To: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Casey Chen, Yuanyuan Zhong, Hannes Reinecke, Ming Lei,
	Waiman Long, Hillf Danton, linux-nvme, linux-block, linux-kernel

On 12/4/25 8:11 AM, Mohamed Khalfella wrote:
> @@ -4302,6 +4302,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
>   		blk_mq_update_tag_set_shared(set, false);
>   	}
>   	mutex_unlock(&set->tag_list_lock);
> +
> +	synchronize_rcu();
>   	INIT_LIST_HEAD(&q->tag_set_list);
>   }
Yikes. This change slows down all blk_mq_del_queue_tag_set() callers.
Please fix the reported deadlock by modifying the NVMe code instead of
slowing down the block layer.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-04 18:22   ` Bart Van Assche
@ 2025-12-04 18:42     ` Mohamed Khalfella
  2025-12-04 19:06       ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Mohamed Khalfella @ 2025-12-04 18:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Casey Chen, Yuanyuan Zhong, Hannes Reinecke,
	Ming Lei, Waiman Long, Hillf Danton, linux-nvme, linux-block,
	linux-kernel

On Thu 2025-12-04 08:22:23 -1000, Bart Van Assche wrote:
> On 12/4/25 8:11 AM, Mohamed Khalfella wrote:
> > @@ -4302,6 +4302,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
> >   		blk_mq_update_tag_set_shared(set, false);
> >   	}
> >   	mutex_unlock(&set->tag_list_lock);
> > +
> > +	synchronize_rcu();
> >   	INIT_LIST_HEAD(&q->tag_set_list);
> >   }
> Yikes. This change slows down all blk_mq_del_queue_tag_set() callers.

synchronize_rcu() is necessary before re-initializing q->tag_set_list
because of list_for_each_entry_rcu() in blk_mq_[un]quiesce_tagset().

Is blk_mq_del_queue_tag_set() performance sensitive such that it can not
take synchronize_rcu()? It is not in IO codepath, right?

> Please fix the reported deadlock by modifying the NVMe code instead of
> slowing down the block layer.

I can not think of an easy way to do that. Suggestions are welcomed.

> 
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-04 18:11 ` [PATCH 1/1] block: " Mohamed Khalfella
  2025-12-04 18:22   ` Bart Van Assche
@ 2025-12-04 19:02   ` Mohamed Khalfella
  1 sibling, 0 replies; 17+ messages in thread
From: Mohamed Khalfella @ 2025-12-04 19:02 UTC (permalink / raw)
  To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Casey Chen, Yuanyuan Zhong, Hannes Reinecke, Ming Lei,
	Waiman Long, Hillf Danton, linux-nvme, linux-block, linux-kernel

On Thu 2025-12-04 10:11:53 -0800, Mohamed Khalfella wrote:
> blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> tagset, the functions make sure that tagset and queues are marked as
> shared when two or more queues are attached to the same tagset.
> Initially a tagset starts as unshared and when the number of added
> queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> with all the queues attached to it. When the number of attached queues
> drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> the remaining queues as unshared.
> 
> Both functions need to freeze current queues in tagset before setting on
> unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> hold set->tag_list_lock mutex, which makes sense as we do not want
> queues to be added or deleted in the process. This used to work fine
> until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> made the nvme driver quiesce tagset instead of quiscing individual
> queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> set->tag_list while holding set->tag_list_lock also.
> 
> This results in deadlock between two threads with these stacktraces:
> 
>   __schedule+0x48e/0xed0
>   schedule+0x5a/0xc0
>   schedule_preempt_disabled+0x11/0x20
>   __mutex_lock.constprop.0+0x3cc/0x760
>   blk_mq_quiesce_tagset+0x26/0xd0
>   nvme_dev_disable_locked+0x77/0x280 [nvme]
>   nvme_timeout+0x268/0x320 [nvme]
>   blk_mq_handle_expired+0x5d/0x90
>   bt_iter+0x7e/0x90
>   blk_mq_queue_tag_busy_iter+0x2b2/0x590
>   ? __blk_mq_complete_request_remote+0x10/0x10
>   ? __blk_mq_complete_request_remote+0x10/0x10
>   blk_mq_timeout_work+0x15b/0x1a0
>   process_one_work+0x133/0x2f0
>   ? mod_delayed_work_on+0x90/0x90
>   worker_thread+0x2ec/0x400
>   ? mod_delayed_work_on+0x90/0x90
>   kthread+0xe2/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x2d/0x50
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork_asm+0x11/0x20
> 
>   __schedule+0x48e/0xed0
>   schedule+0x5a/0xc0
>   blk_mq_freeze_queue_wait+0x62/0x90
>   ? destroy_sched_domains_rcu+0x30/0x30
>   blk_mq_exit_queue+0x151/0x180
>   disk_release+0xe3/0xf0
>   device_release+0x31/0x90
>   kobject_put+0x6d/0x180
>   nvme_scan_ns+0x858/0xc90 [nvme_core]
>   ? nvme_scan_work+0x281/0x560 [nvme_core]
>   nvme_scan_work+0x281/0x560 [nvme_core]
>   process_one_work+0x133/0x2f0
>   ? mod_delayed_work_on+0x90/0x90
>   worker_thread+0x2ec/0x400
>   ? mod_delayed_work_on+0x90/0x90
>   kthread+0xe2/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x2d/0x50
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork_asm+0x11/0x20
> 
> The top stacktrace is showing nvme_timeout() called to handle nvme
> command timeout. timeout handler is trying to disable the controller and
> as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> to call queue callback handlers. The thread is stuck waiting for
> set->tag_list_lock as it tires to walk the queues in set->tag_list.
> 
> The lock is held by the second thread in the bottom stack which is
> waiting for one of queues to be frozen. The queue usage counter will
> drop to zero after nvme_timeout() finishes, and this will not happen
> because the thread will wait for this mutex forever.
> 
> Given that [un]quescing queue is an operation that does not need to
> sleep, update blk_mq_[un]quiesce_tagset() to use RCU instead of taking
> set->tag_list_lock. Also update blk_mq_{add,del}_queue_tag_set() to use
> RCU safe list operations. This should help avoid deadlock seen above.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")

Oops, this should be v3 and I also missed Fixes: tag above.

> ---
>  block/blk-mq.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d626d32f6e57..ceb176ac154b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
>  {
>  	struct request_queue *q;
>  
> -	mutex_lock(&set->tag_list_lock);
> -	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(q, &set->tag_list, tag_set_list) {
>  		if (!blk_queue_skip_tagset_quiesce(q))
>  			blk_mq_quiesce_queue_nowait(q);
>  	}
> -	mutex_unlock(&set->tag_list_lock);
> +	rcu_read_unlock();
>  
>  	blk_mq_wait_quiesce_done(set);
>  }
> @@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
>  {
>  	struct request_queue *q;
>  
> -	mutex_lock(&set->tag_list_lock);
> -	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(q, &set->tag_list, tag_set_list) {
>  		if (!blk_queue_skip_tagset_quiesce(q))
>  			blk_mq_unquiesce_queue(q);
>  	}
> -	mutex_unlock(&set->tag_list_lock);
> +	rcu_read_unlock();
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
>  
> @@ -4294,7 +4294,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
>  	struct blk_mq_tag_set *set = q->tag_set;
>  
>  	mutex_lock(&set->tag_list_lock);
> -	list_del(&q->tag_set_list);
> +	list_del_rcu(&q->tag_set_list);
>  	if (list_is_singular(&set->tag_list)) {
>  		/* just transitioned to unshared */
>  		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> @@ -4302,6 +4302,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
>  		blk_mq_update_tag_set_shared(set, false);
>  	}
>  	mutex_unlock(&set->tag_list_lock);
> +
> +	synchronize_rcu();
>  	INIT_LIST_HEAD(&q->tag_set_list);
>  }
>  
> @@ -4321,7 +4323,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>  	}
>  	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
>  		queue_set_hctx_shared(q, true);
> -	list_add_tail(&q->tag_set_list, &set->tag_list);
> +	list_add_tail_rcu(&q->tag_set_list, &set->tag_list);
>  
>  	mutex_unlock(&set->tag_list_lock);
>  }
> -- 
> 2.51.2
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-04 18:42     ` Mohamed Khalfella
@ 2025-12-04 19:06       ` Bart Van Assche
  2025-12-04 19:15         ` Mohamed Khalfella
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2025-12-04 19:06 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Casey Chen, Yuanyuan Zhong, Hannes Reinecke,
	Ming Lei, Waiman Long, Hillf Danton, linux-nvme, linux-block,
	linux-kernel

On 12/4/25 8:42 AM, Mohamed Khalfella wrote:
> Is blk_mq_del_queue_tag_set() performance sensitive such that it can not
> take synchronize_rcu()? It is not in IO codepath, right?

Introducing a new synchronize_rcu() call almost always slows down some 
workload so it should be avoided if possible.

 > I can not think of an easy way to do that. Suggestions are welcomed.

I can't find the implementation of nvme_dev_disable_locked(). What
kernel tree does your patch apply to?

$ git grep -w nvme_dev_disable_locked axboe-block/for-next | wc -l
0

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-04 19:06       ` Bart Van Assche
@ 2025-12-04 19:15         ` Mohamed Khalfella
  2025-12-04 19:31           ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Mohamed Khalfella @ 2025-12-04 19:15 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Casey Chen, Yuanyuan Zhong, Hannes Reinecke,
	Ming Lei, Waiman Long, Hillf Danton, linux-nvme, linux-block,
	linux-kernel

On Thu 2025-12-04 09:06:47 -1000, Bart Van Assche wrote:
> On 12/4/25 8:42 AM, Mohamed Khalfella wrote:
> > Is blk_mq_del_queue_tag_set() performance sensitive such that it can not
> > take synchronize_rcu()? It is not in IO codepath, right?
> 
> Introducing a new synchronize_rcu() call almost always slows down some 
> workload so it should be avoided if possible.
> 
>  > I can not think of an easy way to do that. Suggestions are welcomed.
> 
> I can't find the implementation of nvme_dev_disable_locked(). What
> kernel tree does your patch apply to?
> 
> $ git grep -w nvme_dev_disable_locked axboe-block/for-next | wc -l
> 0

The stacktraces are from old 6.6.9 kernel. However, the issue is still
applicable to recent kernels. This is an example from 6.13 kernel.

Oct  1 15:19:30 hostname kernel: INFO: task kworker/151:1H:2442 blocked for more than 122 seconds.
Oct  1 15:19:30 hostname kernel:      Tainted: G            E       6.13.2-ge5f37b497f62 #1
Oct  1 15:19:30 hostname kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  1 15:19:30 hostname kernel: task:kworker/151:1H  state:D stack:0     pid:2442  tgid:2442  ppid:2      flags:0x00004000
Oct  1 15:19:30 hostname kernel: Workqueue: kblockd blk_mq_timeout_work
Oct  1 15:19:30 hostname kernel: Call Trace:
Oct  1 15:19:30 hostname kernel: <TASK>
Oct  1 15:19:30 hostname kernel: __schedule+0x47c/0xbb0
Oct  1 15:19:30 hostname kernel: ? timerqueue_add+0x66/0xb0
Oct  1 15:19:30 hostname kernel: schedule+0x1c/0xa0
Oct  1 15:19:30 hostname kernel: schedule_preempt_disabled+0xa/0x10
Oct  1 15:19:30 hostname kernel: __mutex_lock.constprop.0+0x271/0x600
Oct  1 15:19:30 hostname kernel: blk_mq_quiesce_tagset+0x25/0xc0
Oct  1 15:19:30 hostname kernel: nvme_dev_disable+0x9c/0x250
Oct  1 15:19:30 hostname kernel: nvme_timeout+0x1fc/0x520
Oct  1 15:19:30 hostname kernel: blk_mq_handle_expired+0x5c/0x90
Oct  1 15:19:30 hostname kernel: bt_iter+0x7e/0x90
Oct  1 15:19:30 hostname kernel: blk_mq_queue_tag_busy_iter+0x27e/0x550
Oct  1 15:19:30 hostname kernel: ? __blk_mq_complete_request_remote+0x10/0x10
Oct  1 15:19:30 hostname kernel: ? __blk_mq_complete_request_remote+0x10/0x10
Oct  1 15:19:30 hostname kernel: ? __call_rcu_common.constprop.0+0x1c0/0x210
Oct  1 15:19:30 hostname kernel: blk_mq_timeout_work+0x12d/0x170
Oct  1 15:19:30 hostname kernel: process_one_work+0x12e/0x2d0
Oct  1 15:19:30 hostname kernel: worker_thread+0x288/0x3a0
Oct  1 15:19:30 hostname kernel: ? rescuer_thread+0x480/0x480
Oct  1 15:19:30 hostname kernel: kthread+0xb8/0xe0
Oct  1 15:19:30 hostname kernel: ? kthread_park+0x80/0x80
Oct  1 15:19:30 hostname kernel: ret_from_fork+0x2d/0x50
Oct  1 15:19:30 hostname kernel: ? kthread_park+0x80/0x80
Oct  1 15:19:30 hostname kernel: ret_from_fork_asm+0x11/0x20
Oct  1 15:19:30 hostname kernel: </TASK>
Oct  1 15:19:30 hostname kernel: INFO: task python:37330 blocked for more than 122 seconds.
Oct  1 15:19:30 hostname kernel:      Tainted: G            E       6.13.2-ge5f37b497f62 #1
Oct  1 15:19:30 hostname kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  1 15:19:30 hostname kernel: task:python          state:D stack:0     pid:37330 tgid:37330 ppid:37329  flags:0x00004002
Oct  1 15:19:30 hostname kernel: Call Trace:
Oct  1 15:19:30 hostname kernel: <TASK>
Oct  1 15:19:30 hostname kernel: __schedule+0x47c/0xbb0
Oct  1 15:19:30 hostname kernel: ? xas_find+0x161/0x1a0
Oct  1 15:19:30 hostname kernel: schedule+0x1c/0xa0
Oct  1 15:19:30 hostname kernel: blk_mq_freeze_queue_wait+0x3d/0x70
Oct  1 15:19:30 hostname kernel: ? destroy_sched_domains_rcu+0x30/0x30
Oct  1 15:19:30 hostname kernel: blk_mq_update_tag_set_shared+0x44/0x80
Oct  1 15:19:30 hostname kernel: blk_mq_exit_queue+0x141/0x150
Oct  1 15:19:30 hostname kernel: del_gendisk+0x25a/0x2d0
Oct  1 15:19:30 hostname kernel: nvme_ns_remove+0xc9/0x170
Oct  1 15:19:30 hostname kernel: nvme_remove_namespaces+0xc7/0x100
Oct  1 15:19:30 hostname kernel: nvme_remove+0x62/0x150
Oct  1 15:19:30 hostname kernel: pci_device_remove+0x23/0x60
Oct  1 15:19:30 hostname kernel: device_release_driver_internal+0x159/0x200
Oct  1 15:19:30 hostname kernel: unbind_store+0x99/0xa0
Oct  1 15:19:30 hostname kernel: kernfs_fop_write_iter+0x112/0x1e0
Oct  1 15:19:30 hostname kernel: vfs_write+0x2b1/0x3d0
Oct  1 15:19:30 hostname kernel: ksys_write+0x4e/0xb0
Oct  1 15:19:30 hostname kernel: do_syscall_64+0x5b/0x160
Oct  1 15:19:30 hostname kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53
Oct  1 15:19:30 hostname kernel: RIP: 0033:0x7f12cf2fe02f
Oct  1 15:19:30 hostname kernel: RSP: 002b:00007f12311f78e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Oct  1 15:19:30 hostname kernel: RAX: ffffffffffffffda RBX: 00007f12311ff5c8 RCX: 00007f12cf2fe02f
Oct  1 15:19:30 hostname kernel: RDX: 000000000000000c RSI: 00007f12081c19a0 RDI: 000000000000003b
Oct  1 15:19:30 hostname kernel: RBP: 000000000000000c R08: 0000000000000000 R09: 0000000000000002
Oct  1 15:19:30 hostname kernel: R10: 0000000000000002 R11: 0000000000000293 R12: 00007f12cae00700
Oct  1 15:19:30 hostname kernel: R13: 00007f12081c19a0 R14: 000000000000003b R15: 00007f1220219990
Oct  1 15:19:30 hostname kernel: </TASK>

> 
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-04 19:15         ` Mohamed Khalfella
@ 2025-12-04 19:31           ` Bart Van Assche
  2025-12-04 19:57             ` Mohamed Khalfella
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2025-12-04 19:31 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Casey Chen, Yuanyuan Zhong, Hannes Reinecke,
	Ming Lei, Waiman Long, Hillf Danton, linux-nvme, linux-block,
	linux-kernel

On 12/4/25 9:15 AM, Mohamed Khalfella wrote:
> The stacktraces are from old 6.6.9 kernel.

Please always include stack traces from a recent upstream kernel in
patch descriptions.

> However, the issue is still
> applicable to recent kernels. This is an example from 6.13 kernel.

Thanks, these stack traces make it clear what is causing the deadlock.

 From nvme_timeout():

	/*
	 * Reset immediately if the controller is failed
	 */
	if (nvme_should_reset(dev, csts)) {
		nvme_warn_reset(dev, csts);
		nvme_dev_disable(dev, false);
		nvme_reset_ctrl(&dev->ctrl);
		return BLK_EH_DONE;
	}

Is my understanding correct that the above code is involved in the
reported deadlock? If so, has it been considered to run the code inside
the if-statement asynchronously (queue_work()) instead of calling it
synchronously? Would this be sufficient to fix the deadlock?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-04 19:31           ` Bart Van Assche
@ 2025-12-04 19:57             ` Mohamed Khalfella
  2025-12-04 20:24               ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Mohamed Khalfella @ 2025-12-04 19:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Casey Chen, Yuanyuan Zhong, Hannes Reinecke,
	Ming Lei, Waiman Long, Hillf Danton, linux-nvme, linux-block,
	linux-kernel

On Thu 2025-12-04 09:31:55 -1000, Bart Van Assche wrote:
> On 12/4/25 9:15 AM, Mohamed Khalfella wrote:
> > The stacktraces are from old 6.6.9 kernel.
> 
> Please always include stack traces from a recent upstream kernel in
> patch descriptions.
> 

Good point. Will do that in next version of the patch.

> > However, the issue is still
> > applicable to recent kernels. This is an example from 6.13 kernel.
> 
> Thanks, these stack traces make it clear what is causing the deadlock.
> 
>  From nvme_timeout():
> 
> 	/*
> 	 * Reset immediately if the controller is failed
> 	 */
> 	if (nvme_should_reset(dev, csts)) {
> 		nvme_warn_reset(dev, csts);
> 		nvme_dev_disable(dev, false);
> 		nvme_reset_ctrl(&dev->ctrl);
> 		return BLK_EH_DONE;
> 	}
> 
> Is my understanding correct that the above code is involved in the
> reported deadlock? If so, has it been considered to run the code inside
> the if-statement asynchronously (queue_work()) instead of calling it
> synchronously? Would this be sufficient to fix the deadlock?
> 

Yes, the above code is involved in the deadlock. I do not see how
running this code in another thread will solve the problem. It will
still cause a deadlock between blk_mq_quiesce_tagset() and 
blk_mq_del_queue_tag_set(). The later is holding the mutex and while
waiting for the queue to be frozen. The former wants the mutex in order
to make progress and cancel inflight requests to let the queue to be
frozen. I do not see how this will make a difference.

> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-04 19:57             ` Mohamed Khalfella
@ 2025-12-04 20:24               ` Bart Van Assche
  2025-12-04 21:26                 ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2025-12-04 20:24 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Casey Chen, Yuanyuan Zhong, Hannes Reinecke,
	Ming Lei, Waiman Long, Hillf Danton, linux-nvme, linux-block,
	linux-kernel


On 12/4/25 9:57 AM, Mohamed Khalfella wrote:
> I do not see how running this code in another thread will solve the
> problem.
blk_mq_freeze_queue_wait() waits forever because nvme_timeout() waits
for blk_mq_freeze_queue_wait() to finish. Hence, the deadlock can be
solved by removing the blk_mq_quiesce_tagset() call from nvme_timeout()
and by failing I/O from inside nvme_timeout(). If nvme_timeout() fails
I/O and does not call blk_mq_quiesce_tagset() then the
blk_mq_freeze_queue_wait() call will finish instead of triggering a
deadlock. However, I do not know whether this proposal seems acceptable
to the NVMe maintainers.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-04 20:24               ` Bart Van Assche
@ 2025-12-04 21:26                 ` Keith Busch
  2025-12-04 23:22                   ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2025-12-04 21:26 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
	Jens Axboe, Sagi Grimberg, Casey Chen, Yuanyuan Zhong,
	Hannes Reinecke, Ming Lei, Waiman Long, Hillf Danton, linux-nvme,
	linux-block, linux-kernel

On Thu, Dec 04, 2025 at 10:24:03AM -1000, Bart Van Assche wrote:
> 
> On 12/4/25 9:57 AM, Mohamed Khalfella wrote:
> > I do not see how running this code in another thread will solve the
> > problem.
> blk_mq_freeze_queue_wait() waits forever because nvme_timeout() waits
> for blk_mq_freeze_queue_wait() to finish. 

No, nvme_timeout does NOT wait for freeze to finish. It wants freeze to
finish, but it proceeds anyway whether it finished or not. If the freeze
didn't completele, anything outstanding after that will be forcefully
reclaimed and requeued once we ensure the device is disabled.

> Hence, the deadlock can be
> solved by removing the blk_mq_quiesce_tagset() call from nvme_timeout()
> and by failing I/O from inside nvme_timeout(). If nvme_timeout() fails
> I/O and does not call blk_mq_quiesce_tagset() then the
> blk_mq_freeze_queue_wait() call will finish instead of triggering a
> deadlock. However, I do not know whether this proposal seems acceptable
> to the NVMe maintainers.

You periodically make this suggestion, but there's never a reason
offered to introduce yet another work queue for the driver to
synchronize with at various points. The whole point of making blk-mq
timeout handler in a work queue (it used to be a timer) was so that we
could do blocking actions like this.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-04 21:26                 ` Keith Busch
@ 2025-12-04 23:22                   ` Bart Van Assche
  2025-12-05  1:32                     ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2025-12-04 23:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
	Jens Axboe, Sagi Grimberg, Casey Chen, Yuanyuan Zhong,
	Hannes Reinecke, Ming Lei, Waiman Long, Hillf Danton, linux-nvme,
	linux-block, linux-kernel


On 12/4/25 11:26 AM, Keith Busch wrote:
> On Thu, Dec 04, 2025 at 10:24:03AM -1000, Bart Van Assche wrote:>> Hence, the deadlock can be
>> solved by removing the blk_mq_quiesce_tagset() call from nvme_timeout()
>> and by failing I/O from inside nvme_timeout(). If nvme_timeout() fails
>> I/O and does not call blk_mq_quiesce_tagset() then the
>> blk_mq_freeze_queue_wait() call will finish instead of triggering a
>> deadlock. However, I do not know whether this proposal seems acceptable
>> to the NVMe maintainers.
> 
> You periodically make this suggestion, but there's never a reason
> offered to introduce yet another work queue for the driver to
> synchronize with at various points. The whole point of making blk-mq
> timeout handler in a work queue (it used to be a timer) was so that we
> could do blocking actions like this.
Hi Keith,

The blk_mq_quiesce_tagset() call from the NVMe timeout handler is
unfortunate because it triggers a deadlock with
blk_mq_update_tag_set_shared().

I proposed to modify the NVMe driver because I think that's a better
approach than introducing a new synchronize_rcu() call in the block
layer core.

However, there may be better approaches for fixing this in the NVMe
driver than what I proposed so far.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-04 23:22                   ` Bart Van Assche
@ 2025-12-05  1:32                     ` Keith Busch
  2025-12-05  2:52                       ` Bart Van Assche
                                         ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Keith Busch @ 2025-12-05  1:32 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
	Jens Axboe, Sagi Grimberg, Casey Chen, Yuanyuan Zhong,
	Hannes Reinecke, Ming Lei, Waiman Long, Hillf Danton, linux-nvme,
	linux-block, linux-kernel

On Thu, Dec 04, 2025 at 01:22:49PM -1000, Bart Van Assche wrote:
> 
> On 12/4/25 11:26 AM, Keith Busch wrote:
> > On Thu, Dec 04, 2025 at 10:24:03AM -1000, Bart Van Assche wrote:>> Hence, the deadlock can be
> > > solved by removing the blk_mq_quiesce_tagset() call from nvme_timeout()
> > > and by failing I/O from inside nvme_timeout(). If nvme_timeout() fails
> > > I/O and does not call blk_mq_quiesce_tagset() then the
> > > blk_mq_freeze_queue_wait() call will finish instead of triggering a
> > > deadlock. However, I do not know whether this proposal seems acceptable
> > > to the NVMe maintainers.
> > 
> > You periodically make this suggestion, but there's never a reason
> > offered to introduce yet another work queue for the driver to
> > synchronize with at various points. The whole point of making blk-mq
> > timeout handler in a work queue (it used to be a timer) was so that we
> > could do blocking actions like this.
> Hi Keith,
> 
> The blk_mq_quiesce_tagset() call from the NVMe timeout handler is
> unfortunate because it triggers a deadlock with
> blk_mq_update_tag_set_shared().

So in this scenario, the driver is modifying a tagset list from two
queues to one, which causes blk-mq to clear the "shared" flags. The
remaining one just so happens to have hit a timeout at the same time,
which runs in a context with an elevated "q_usage_counter". The current
rule, then, is you can not take the tag_list_lock from any context using
any queue in the tag list.

> I proposed to modify the NVMe driver because I think that's a better
> approach than introducing a new synchronize_rcu() call in the block
> layer core.

I'm not interested in introducing rcu synchronize here either. I guess I
would make it so you can quiesce a tagset from a context that entered
the queue. So quick shot at that here:

---

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4e96bb2462475..20450017b9512 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4262,11 +4262,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)
  * Caller needs to ensure that we're either frozen/quiesced, or that
  * the queue isn't live yet.
  */
-static void queue_set_hctx_shared(struct request_queue *q, bool shared)
+static void queue_set_hctx_shared_locked(struct request_queue *q)
 {
+	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
+	bool shared;
+
+	lockdep_assert_held(&set->tag_list_lock);
 
+	shared = set->flags & BLK_MQ_F_TAG_QUEUE_SHARED;
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (shared) {
 			hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
@@ -4277,24 +4282,22 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 	}
 }
 
-static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
-					 bool shared)
+static void queue_set_hctx_shared(struct request_queue *q)
 {
-	struct request_queue *q;
+	struct blk_mq_tag_set *set = q->tag_set;
 	unsigned int memflags;
 
-	lockdep_assert_held(&set->tag_list_lock);
-
-	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		memflags = blk_mq_freeze_queue(q);
-		queue_set_hctx_shared(q, shared);
-		blk_mq_unfreeze_queue(q, memflags);
-	}
+	memflags = blk_mq_freeze_queue(q);
+	mutex_lock(&set->tag_list_lock);
+	queue_set_hctx_shared_locked(q);
+	mutex_unlock(&set->tag_list_lock);
+	blk_mq_unfreeze_queue(q, memflags);
 }
 
 static void blk_mq_del_queue_tag_set(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
+	struct request_queue *shared = NULL;
 
 	mutex_lock(&set->tag_list_lock);
 	list_del(&q->tag_set_list);
@@ -4302,15 +4305,25 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 		/* just transitioned to unshared */
 		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
-		blk_mq_update_tag_set_shared(set, false);
+		shared = list_first_entry(&set->tag_list, struct request_queue,
+					  tag_set_list);
+		if (!blk_get_queue(shared))
+			shared = NULL;
 	}
 	mutex_unlock(&set->tag_list_lock);
 	INIT_LIST_HEAD(&q->tag_set_list);
+
+	if (shared) {
+		queue_set_hctx_shared(shared);
+		blk_put_queue(shared);
+	}
 }
 
 static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 				     struct request_queue *q)
 {
+	struct request_queue *shared = NULL;
+
 	mutex_lock(&set->tag_list_lock);
 
 	/*
@@ -4318,15 +4331,24 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	 */
 	if (!list_empty(&set->tag_list) &&
 	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
-		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
-		blk_mq_update_tag_set_shared(set, true);
+		shared = list_first_entry(&set->tag_list, struct request_queue,
+					  tag_set_list);
+		if (!blk_get_queue(shared))
+			shared = NULL;
+		else
+			set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 	}
 	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
-		queue_set_hctx_shared(q, true);
+		queue_set_hctx_shared_locked(q);
 	list_add_tail(&q->tag_set_list, &set->tag_list);
 
 	mutex_unlock(&set->tag_list_lock);
+
+	if (shared) {
+		queue_set_hctx_shared(shared);
+		blk_put_queue(shared);
+	}
 }
 
 /* All allocations will be freed in release handler of q->mq_kobj */
--

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-05  1:32                     ` Keith Busch
@ 2025-12-05  2:52                       ` Bart Van Assche
  2025-12-05 16:39                       ` Mohamed Khalfella
  2025-12-08 19:22                       ` Bart Van Assche
  2 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-12-05  2:52 UTC (permalink / raw)
  To: Keith Busch
  Cc: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
	Jens Axboe, Sagi Grimberg, Casey Chen, Yuanyuan Zhong,
	Hannes Reinecke, Ming Lei, Waiman Long, Hillf Danton, linux-nvme,
	linux-block, linux-kernel

On 12/4/25 3:32 PM, Keith Busch wrote:
> I'm not interested in introducing rcu synchronize here either. I guess I
> would make it so you can quiesce a tagset from a context that entered
> the queue. So quick shot at that here: [ ... ]
The approach of this patch looks very interesting to me. I will take a 
closer look at this patch when I have more time.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-05  1:32                     ` Keith Busch
  2025-12-05  2:52                       ` Bart Van Assche
@ 2025-12-05 16:39                       ` Mohamed Khalfella
  2025-12-05 18:11                         ` Keith Busch
  2025-12-08 19:22                       ` Bart Van Assche
  2 siblings, 1 reply; 17+ messages in thread
From: Mohamed Khalfella @ 2025-12-05 16:39 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, Chaitanya Kulkarni, Christoph Hellwig,
	Jens Axboe, Sagi Grimberg, Casey Chen, Yuanyuan Zhong,
	Hannes Reinecke, Ming Lei, Waiman Long, Hillf Danton, linux-nvme,
	linux-block, linux-kernel

On Thu 2025-12-04 18:32:31 -0700, Keith Busch wrote:
> On Thu, Dec 04, 2025 at 01:22:49PM -1000, Bart Van Assche wrote:
> > 
> > On 12/4/25 11:26 AM, Keith Busch wrote:
> > > On Thu, Dec 04, 2025 at 10:24:03AM -1000, Bart Van Assche wrote:>> Hence, the deadlock can be
> > > > solved by removing the blk_mq_quiesce_tagset() call from nvme_timeout()
> > > > and by failing I/O from inside nvme_timeout(). If nvme_timeout() fails
> > > > I/O and does not call blk_mq_quiesce_tagset() then the
> > > > blk_mq_freeze_queue_wait() call will finish instead of triggering a
> > > > deadlock. However, I do not know whether this proposal seems acceptable
> > > > to the NVMe maintainers.
> > > 
> > > You periodically make this suggestion, but there's never a reason
> > > offered to introduce yet another work queue for the driver to
> > > synchronize with at various points. The whole point of making blk-mq
> > > timeout handler in a work queue (it used to be a timer) was so that we
> > > could do blocking actions like this.
> > Hi Keith,
> > 
> > The blk_mq_quiesce_tagset() call from the NVMe timeout handler is
> > unfortunate because it triggers a deadlock with
> > blk_mq_update_tag_set_shared().
> 
> So in this scenario, the driver is modifying a tagset list from two
> queues to one, which causes blk-mq to clear the "shared" flags. The
> remaining one just so happens to have hit a timeout at the same time,
> which runs in a context with an elevated "q_usage_counter". The current
> rule, then, is you can not take the tag_list_lock from any context using
> any queue in the tag list.
> 
> > I proposed to modify the NVMe driver because I think that's a better
> > approach than introducing a new synchronize_rcu() call in the block
> > layer core.
> 
> I'm not interested in introducing rcu synchronize here either. I guess I
> would make it so you can quiesce a tagset from a context that entered
> the queue. So quick shot at that here:

Why sychronize_rcu() is intolerable in this blk_mq_del_queue_tag_set()?
This code is not performance sensitive, right?

Looking at the code again, I _think_ synchronize_rcu() along with 
INIT_LIST_HEAD(&q->tag_set_list) can be deleted. I do not see usecase
where "q" is re-added to a tagset after it is deleted from one.
Also, "q" is freed in blk_free_queue() after end of RCU grace period.

Are there any other concerns with this approach other than
synchronize_rcu()? If not, I will delete it and submit v3.

> 
> ---
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4e96bb2462475..20450017b9512 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4262,11 +4262,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>   * Caller needs to ensure that we're either frozen/quiesced, or that
>   * the queue isn't live yet.
>   */
> -static void queue_set_hctx_shared(struct request_queue *q, bool shared)
> +static void queue_set_hctx_shared_locked(struct request_queue *q)
>  {
> +	struct blk_mq_tag_set *set = q->tag_set;
>  	struct blk_mq_hw_ctx *hctx;
>  	unsigned long i;
> +	bool shared;
> +
> +	lockdep_assert_held(&set->tag_list_lock);
>  
> +	shared = set->flags & BLK_MQ_F_TAG_QUEUE_SHARED;
>  	queue_for_each_hw_ctx(q, hctx, i) {
>  		if (shared) {
>  			hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> @@ -4277,24 +4282,22 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
>  	}
>  }
>  
> -static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
> -					 bool shared)
> +static void queue_set_hctx_shared(struct request_queue *q)
>  {
> -	struct request_queue *q;
> +	struct blk_mq_tag_set *set = q->tag_set;
>  	unsigned int memflags;
>  
> -	lockdep_assert_held(&set->tag_list_lock);
> -
> -	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> -		memflags = blk_mq_freeze_queue(q);
> -		queue_set_hctx_shared(q, shared);
> -		blk_mq_unfreeze_queue(q, memflags);
> -	}
> +	memflags = blk_mq_freeze_queue(q);
> +	mutex_lock(&set->tag_list_lock);
> +	queue_set_hctx_shared_locked(q);
> +	mutex_unlock(&set->tag_list_lock);
> +	blk_mq_unfreeze_queue(q, memflags);
>  }
>  
>  static void blk_mq_del_queue_tag_set(struct request_queue *q)
>  {
>  	struct blk_mq_tag_set *set = q->tag_set;
> +	struct request_queue *shared = NULL;
>  
>  	mutex_lock(&set->tag_list_lock);
>  	list_del(&q->tag_set_list);
> @@ -4302,15 +4305,25 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
>  		/* just transitioned to unshared */
>  		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
>  		/* update existing queue */
> -		blk_mq_update_tag_set_shared(set, false);
> +		shared = list_first_entry(&set->tag_list, struct request_queue,
> +					  tag_set_list);
> +		if (!blk_get_queue(shared))
> +			shared = NULL;
>  	}
>  	mutex_unlock(&set->tag_list_lock);
>  	INIT_LIST_HEAD(&q->tag_set_list);
> +
> +	if (shared) {
> +		queue_set_hctx_shared(shared);
> +		blk_put_queue(shared);
> +	}
>  }
>  
>  static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>  				     struct request_queue *q)
>  {
> +	struct request_queue *shared = NULL;
> +
>  	mutex_lock(&set->tag_list_lock);
>  
>  	/*
> @@ -4318,15 +4331,24 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>  	 */
>  	if (!list_empty(&set->tag_list) &&
>  	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> -		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
>  		/* update existing queue */
> -		blk_mq_update_tag_set_shared(set, true);
> +		shared = list_first_entry(&set->tag_list, struct request_queue,
> +					  tag_set_list);
> +		if (!blk_get_queue(shared))
> +			shared = NULL;
> +		else
> +			set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
>  	}
>  	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> -		queue_set_hctx_shared(q, true);
> +		queue_set_hctx_shared_locked(q);
>  	list_add_tail(&q->tag_set_list, &set->tag_list);
>  
>  	mutex_unlock(&set->tag_list_lock);
> +
> +	if (shared) {
> +		queue_set_hctx_shared(shared);
> +		blk_put_queue(shared);
> +	}
>  }
>  
>  /* All allocations will be freed in release handler of q->mq_kobj */
> --

__blk_mq_update_nr_hw_queues() freezes the queues while holding
set->tag_list_lock. Can this cause the same deadlock?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-05 16:39                       ` Mohamed Khalfella
@ 2025-12-05 18:11                         ` Keith Busch
  0 siblings, 0 replies; 17+ messages in thread
From: Keith Busch @ 2025-12-05 18:11 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Bart Van Assche, Chaitanya Kulkarni, Christoph Hellwig,
	Jens Axboe, Sagi Grimberg, Casey Chen, Yuanyuan Zhong,
	Hannes Reinecke, Ming Lei, Waiman Long, Hillf Danton, linux-nvme,
	linux-block, linux-kernel

On Fri, Dec 05, 2025 at 08:39:26AM -0800, Mohamed Khalfella wrote:
> Why sychronize_rcu() is intolerable in this blk_mq_del_queue_tag_set()?
> This code is not performance sensitive, right?

synchronize_rcu() gets expensive on servers with many CPUs. While this
is not a fast path, I think adding it here would still be harmful, if
only just to testing when devices are frequently created and deleted.

> Looking at the code again, I _think_ synchronize_rcu() along with 
> INIT_LIST_HEAD(&q->tag_set_list) can be deleted. I do not see usecase
> where "q" is re-added to a tagset after it is deleted from one.
> Also, "q" is freed in blk_free_queue() after end of RCU grace period.

I think you could skip the synchronize since the queue's memory free is
done in blk_free_queue_rcu from a call_rcu() callback.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  2025-12-05  1:32                     ` Keith Busch
  2025-12-05  2:52                       ` Bart Van Assche
  2025-12-05 16:39                       ` Mohamed Khalfella
@ 2025-12-08 19:22                       ` Bart Van Assche
  2 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-12-08 19:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
	Jens Axboe, Sagi Grimberg, Casey Chen, Yuanyuan Zhong,
	Hannes Reinecke, Ming Lei, Waiman Long, Hillf Danton, linux-nvme,
	linux-block, linux-kernel

On 12/4/25 6:32 PM, Keith Busch wrote:
>   static void blk_mq_del_queue_tag_set(struct request_queue *q)
>   {
>   	struct blk_mq_tag_set *set = q->tag_set;
> +	struct request_queue *shared = NULL;
>   
>   	mutex_lock(&set->tag_list_lock);
>   	list_del(&q->tag_set_list);
> @@ -4302,15 +4305,25 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
>   		/* just transitioned to unshared */
>   		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
>   		/* update existing queue */
> -		blk_mq_update_tag_set_shared(set, false);
> +		shared = list_first_entry(&set->tag_list, struct request_queue,
> +					  tag_set_list);
> +		if (!blk_get_queue(shared))
> +			shared = NULL;
>   	}
>   	mutex_unlock(&set->tag_list_lock);
>   	INIT_LIST_HEAD(&q->tag_set_list);
> +
> +	if (shared) {
> +		queue_set_hctx_shared(shared);
> +		blk_put_queue(shared);
> +	}
>   }

Although harmless, with this approach the queue_set_hctx_shared() calls
by blk_mq_del_queue_tag_set() and blk_mq_add_queue_tag_set() can be
reordered. I like Mohamed's approach better because it results in code
that is easier to review and to reason about.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-12-08 19:22 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-04 18:11 [PATCH 0/1] Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock Mohamed Khalfella
2025-12-04 18:11 ` [PATCH 1/1] block: " Mohamed Khalfella
2025-12-04 18:22   ` Bart Van Assche
2025-12-04 18:42     ` Mohamed Khalfella
2025-12-04 19:06       ` Bart Van Assche
2025-12-04 19:15         ` Mohamed Khalfella
2025-12-04 19:31           ` Bart Van Assche
2025-12-04 19:57             ` Mohamed Khalfella
2025-12-04 20:24               ` Bart Van Assche
2025-12-04 21:26                 ` Keith Busch
2025-12-04 23:22                   ` Bart Van Assche
2025-12-05  1:32                     ` Keith Busch
2025-12-05  2:52                       ` Bart Van Assche
2025-12-05 16:39                       ` Mohamed Khalfella
2025-12-05 18:11                         ` Keith Busch
2025-12-08 19:22                       ` Bart Van Assche
2025-12-04 19:02   ` Mohamed Khalfella

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox