[PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
@ 2025-11-13 20:23 Mohamed Khalfella
  2025-11-14  4:56 ` Chaitanya Kulkarni
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Mohamed Khalfella @ 2025-11-13 20:23 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni
  Cc: Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	Ming Lei, linux-nvme, linux-block, linux-kernel,
	Mohamed Khalfella

blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
tagset, the functions make sure that tagset and queues are marked as
shared when two or more queues are attached to the same tagset.
Initially a tagset starts as unshared and when the number of added
queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
with all the queues attached to it. When the number of attached queues
drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
the remaining queues as unshared.

Both functions need to freeze current queues in tagset before setting on
unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
hold set->tag_list_lock mutex, which makes sense as we do not want
queues to be added or deleted in the process. This used to work fine
until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
made the nvme driver quiesce tagset instead of quiscing individual
queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
set->tag_list while holding set->tag_list_lock also.

This results in deadlock between two threads with these stacktraces:

  __schedule+0x48e/0xed0
  schedule+0x5a/0xc0
  schedule_preempt_disabled+0x11/0x20
  __mutex_lock.constprop.0+0x3cc/0x760
  blk_mq_quiesce_tagset+0x26/0xd0
  nvme_dev_disable_locked+0x77/0x280 [nvme]
  nvme_timeout+0x268/0x320 [nvme]
  blk_mq_handle_expired+0x5d/0x90
  bt_iter+0x7e/0x90
  blk_mq_queue_tag_busy_iter+0x2b2/0x590
  ? __blk_mq_complete_request_remote+0x10/0x10
  ? __blk_mq_complete_request_remote+0x10/0x10
  blk_mq_timeout_work+0x15b/0x1a0
  process_one_work+0x133/0x2f0
  ? mod_delayed_work_on+0x90/0x90
  worker_thread+0x2ec/0x400
  ? mod_delayed_work_on+0x90/0x90
  kthread+0xe2/0x110
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x2d/0x50
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork_asm+0x11/0x20

  __schedule+0x48e/0xed0
  schedule+0x5a/0xc0
  blk_mq_freeze_queue_wait+0x62/0x90
  ? destroy_sched_domains_rcu+0x30/0x30
  blk_mq_exit_queue+0x151/0x180
  disk_release+0xe3/0xf0
  device_release+0x31/0x90
  kobject_put+0x6d/0x180
  nvme_scan_ns+0x858/0xc90 [nvme_core]
  ? nvme_scan_work+0x281/0x560 [nvme_core]
  nvme_scan_work+0x281/0x560 [nvme_core]
  process_one_work+0x133/0x2f0
  ? mod_delayed_work_on+0x90/0x90
  worker_thread+0x2ec/0x400
  ? mod_delayed_work_on+0x90/0x90
  kthread+0xe2/0x110
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x2d/0x50
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork_asm+0x11/0x20

The top stacktrace is showing nvme_timeout() called to handle nvme
command timeout. timeout handler is trying to disable the controller and
as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
to call queue callback handlers. The thread is stuck waiting for
set->tag_list_lock as it tires to walk the queues in set->tag_list.

The lock is held by the second thread in the bottom stack which is
waiting for one of queues to be frozen. The queue usage counter will
drop to zero after nvme_timeout() finishes, and this will not happen
because the thread will wait for this mutex forever.

Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
semaphore for read since this is enough to guarantee no queues will be
added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
semaphore for write while updating set->tag_list and downgrade it to
read while freezing the queues. It should be safe to update set->flags
and hctx->flags while holding the semaphore for read since the queues
are already frozen.

Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 block/blk-mq-sysfs.c   | 10 +++----
 block/blk-mq.c         | 63 ++++++++++++++++++++++--------------------
 include/linux/blk-mq.h |  4 +--
 3 files changed, 40 insertions(+), 37 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 58ec293373c6..f474781654fb 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)

 	kobject_uevent(q->mq_kobj, KOBJ_ADD);

-	mutex_lock(&q->tag_set->tag_list_lock);
+	down_read(&q->tag_set->tag_list_rwsem);
 	queue_for_each_hw_ctx(q, hctx, i) {
 		ret = blk_mq_register_hctx(hctx);
 		if (ret)
 			goto out_unreg;
 	}
-	mutex_unlock(&q->tag_set->tag_list_lock);
+	up_read(&q->tag_set->tag_list_rwsem);
 	return 0;

 out_unreg:
@@ -244,7 +244,7 @@ int blk_mq_sysfs_register(struct gendisk *disk)
 		if (j < i)
 			blk_mq_unregister_hctx(hctx);
 	}
-	mutex_unlock(&q->tag_set->tag_list_lock);
+	up_read(&q->tag_set->tag_list_rwsem);

 	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
 	kobject_del(q->mq_kobj);
@@ -257,10 +257,10 @@ void blk_mq_sysfs_unregister(struct gendisk *disk)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;

-	mutex_lock(&q->tag_set->tag_list_lock);
+	down_read(&q->tag_set->tag_list_rwsem);
 	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_unregister_hctx(hctx);
-	mutex_unlock(&q->tag_set->tag_list_lock);
+	up_read(&q->tag_set->tag_list_rwsem);

 	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
 	kobject_del(q->mq_kobj);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d626d32f6e57..1770277fe453 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
 {
 	struct request_queue *q;

-	mutex_lock(&set->tag_list_lock);
+	down_read(&set->tag_list_rwsem);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		if (!blk_queue_skip_tagset_quiesce(q))
 			blk_mq_quiesce_queue_nowait(q);
 	}
-	mutex_unlock(&set->tag_list_lock);
+	up_read(&set->tag_list_rwsem);

 	blk_mq_wait_quiesce_done(set);
 }
@@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
 {
 	struct request_queue *q;

-	mutex_lock(&set->tag_list_lock);
+	down_read(&set->tag_list_rwsem);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		if (!blk_queue_skip_tagset_quiesce(q))
 			blk_mq_unquiesce_queue(q);
 	}
-	mutex_unlock(&set->tag_list_lock);
+	up_read(&set->tag_list_rwsem);
 }
 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);

@@ -4280,7 +4280,7 @@ static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
 	struct request_queue *q;
 	unsigned int memflags;

-	lockdep_assert_held(&set->tag_list_lock);
+	lockdep_assert_held(&set->tag_list_rwsem);

 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		memflags = blk_mq_freeze_queue(q);
@@ -4293,37 +4293,40 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;

-	mutex_lock(&set->tag_list_lock);
+	down_write(&set->tag_list_rwsem);
 	list_del(&q->tag_set_list);
-	if (list_is_singular(&set->tag_list)) {
-		/* just transitioned to unshared */
-		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
-		/* update existing queue */
-		blk_mq_update_tag_set_shared(set, false);
+	if (!list_is_singular(&set->tag_list)) {
+		up_write(&set->tag_list_rwsem);
+		goto out;
 	}
-	mutex_unlock(&set->tag_list_lock);
+
+	/* Transitioning to unshared. */
+	set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
+	downgrade_write(&set->tag_list_rwsem);
+	blk_mq_update_tag_set_shared(set, false);
+	up_read(&set->tag_list_rwsem);
+out:
 	INIT_LIST_HEAD(&q->tag_set_list);
 }

 static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 				     struct request_queue *q)
 {
-	mutex_lock(&set->tag_list_lock);
+	down_write(&set->tag_list_rwsem);
+	if (!list_is_singular(&set->tag_list)) {
+		if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
+			queue_set_hctx_shared(q, true);
+		list_add_tail(&q->tag_set_list, &set->tag_list);
+		up_write(&set->tag_list_rwsem);
+		return;
+	}

-	/*
-	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
-	 */
-	if (!list_empty(&set->tag_list) &&
-	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
-		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
-		/* update existing queue */
-		blk_mq_update_tag_set_shared(set, true);
-	}
-	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
-		queue_set_hctx_shared(q, true);
+	/* Transitioning to shared. */
+	set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 	list_add_tail(&q->tag_set_list, &set->tag_list);
-
-	mutex_unlock(&set->tag_list_lock);
+	downgrade_write(&set->tag_list_rwsem);
+	blk_mq_update_tag_set_shared(set, true);
+	up_read(&set->tag_list_rwsem);
 }

 /* All allocations will be freed in release handler of q->mq_kobj */
@@ -4855,7 +4858,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (ret)
 		goto out_free_mq_map;

-	mutex_init(&set->tag_list_lock);
+	init_rwsem(&set->tag_list_rwsem);
 	INIT_LIST_HEAD(&set->tag_list);

 	return 0;
@@ -5044,7 +5047,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	struct xarray elv_tbl, et_tbl;
 	bool queues_frozen = false;

-	lockdep_assert_held(&set->tag_list_lock);
+	lockdep_assert_held_write(&set->tag_list_rwsem);

 	if (set->nr_maps == 1 && nr_hw_queues > nr_cpu_ids)
 		nr_hw_queues = nr_cpu_ids;
@@ -5129,9 +5132,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 {
 	down_write(&set->update_nr_hwq_lock);
-	mutex_lock(&set->tag_list_lock);
+	down_write(&set->tag_list_rwsem);
 	__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
-	mutex_unlock(&set->tag_list_lock);
+	up_write(&set->tag_list_rwsem);
 	up_write(&set->update_nr_hwq_lock);
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b25d12545f46..4c8441671518 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -502,7 +502,7 @@ enum hctx_type {
  * @shared_tags:
  *		   Shared set of tags. Has @nr_hw_queues elements. If set,
  *		   shared by all @tags.
- * @tag_list_lock: Serializes tag_list accesses.
+ * @tag_list_rwsem: Serializes tag_list accesses.
  * @tag_list:	   List of the request queues that use this tag set. See also
  *		   request_queue.tag_set_list.
  * @srcu:	   Use as lock when type of the request queue is blocking
@@ -530,7 +530,7 @@ struct blk_mq_tag_set {

 	struct blk_mq_tags	*shared_tags;

-	struct mutex		tag_list_lock;
+	struct rw_semaphore	tag_list_rwsem;
 	struct list_head	tag_list;
 	struct srcu_struct	*srcu;
 	struct srcu_struct	tags_srcu;
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-13 20:23 [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock Mohamed Khalfella
@ 2025-11-14  4:56 ` Chaitanya Kulkarni
  2025-11-14  5:34   ` Mohamed Khalfella
  2025-11-14 11:41 ` Ming Lei
  2025-11-17  2:30 ` Ming Lei
  2 siblings, 1 reply; 9+ messages in thread
From: Chaitanya Kulkarni @ 2025-11-14  4:56 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	Ming Lei, linux-nvme@lists.infradead.org, Sagi Grimberg,
	Jens Axboe, linux-block@vger.kernel.org, Chaitanya Kulkarni,
	linux-kernel@vger.kernel.org, Keith Busch

On 11/13/25 12:23, Mohamed Khalfella wrote:
> blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> tagset, the functions make sure that tagset and queues are marked as
> shared when two or more queues are attached to the same tagset.
> Initially a tagset starts as unshared and when the number of added
> queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> with all the queues attached to it. When the number of attached queues
> drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> the remaining queues as unshared.
>
> Both functions need to freeze current queues in tagset before setting on
> unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> hold set->tag_list_lock mutex, which makes sense as we do not want
> queues to be added or deleted in the process. This used to work fine
> until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> made the nvme driver quiesce tagset instead of quiscing individual
> queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> set->tag_list while holding set->tag_list_lock also.
>
> This results in deadlock between two threads with these stacktraces:
>
[...]

>
> The top stacktrace is showing nvme_timeout() called to handle nvme
> command timeout. timeout handler is trying to disable the controller and
> as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> to call queue callback handlers. The thread is stuck waiting for
> set->tag_list_lock as it tires to walk the queues in set->tag_list.
>
> The lock is held by the second thread in the bottom stack which is
> waiting for one of queues to be frozen. The queue usage counter will
> drop to zero after nvme_timeout() finishes, and this will not happen
> because the thread will wait for this mutex forever.
>
> Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> semaphore for read since this is enough to guarantee no queues will be
> added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> semaphore for write while updating set->tag_list and downgrade it to
> read while freezing the queues. It should be safe to update set->flags
> and hctx->flags while holding the semaphore for read since the queues
> are already frozen.
>
> Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>

I think there is no better way to solve this in to nvme code ?

will it have any impact on existing users, if any, that are relying
on current mutex based implementation ?

BTW, thanks for reporting this and providing a patch.

> ---
>   block/blk-mq-sysfs.c   | 10 +++----
>   block/blk-mq.c         | 63 ++++++++++++++++++++++--------------------
>   include/linux/blk-mq.h |  4 +--
>   3 files changed, 40 insertions(+), 37 deletions(-)
>
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 58ec293373c6..f474781654fb 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)
>   
>   	kobject_uevent(q->mq_kobj, KOBJ_ADD);
>   
> -	mutex_lock(&q->tag_set->tag_list_lock);
> +	down_read(&q->tag_set->tag_list_rwsem);
>   	queue_for_each_hw_ctx(q, hctx, i) {
>   		ret = blk_mq_register_hctx(hctx);
>   		if (ret)
>   			goto out_unreg;
>   	}
> -	mutex_unlock(&q->tag_set->tag_list_lock);
> +	up_read(&q->tag_set->tag_list_rwsem);
>   	return 0;
>   

[...]

>   static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>   				     struct request_queue *q)
>   {
> -	mutex_lock(&set->tag_list_lock);
> +	down_write(&set->tag_list_rwsem);
> +	if (!list_is_singular(&set->tag_list)) {
> +		if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> +			queue_set_hctx_shared(q, true);
> +		list_add_tail(&q->tag_set_list, &set->tag_list);
> +		up_write(&set->tag_list_rwsem);
> +		return;
> +	}
>   
> -	/*
> -	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
> -	 */
> -	if (!list_empty(&set->tag_list) &&
> -	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> -		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> -		/* update existing queue */
> -		blk_mq_update_tag_set_shared(set, true);
> -	}
> -	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> -		queue_set_hctx_shared(q, true);
> +	/* Transitioning to shared. */
> +	set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
>   	list_add_tail(&q->tag_set_list, &set->tag_list);
> -
> -	mutex_unlock(&set->tag_list_lock);
> +	downgrade_write(&set->tag_list_rwsem);

do we need a comment here what to expect since downgrade_write() is
not as common as mutex_unlock()|down_write() before merging the
patch ?

-ck



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-14  4:56 ` Chaitanya Kulkarni
@ 2025-11-14  5:34   ` Mohamed Khalfella
  0 siblings, 0 replies; 9+ messages in thread
From: Mohamed Khalfella @ 2025-11-14  5:34 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	Ming Lei, linux-nvme@lists.infradead.org, Sagi Grimberg,
	Jens Axboe, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, Keith Busch

On Fri 2025-11-14 04:56:52 +0000, Chaitanya Kulkarni wrote:
> On 11/13/25 12:23, Mohamed Khalfella wrote:
> > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > tagset, the functions make sure that tagset and queues are marked as
> > shared when two or more queues are attached to the same tagset.
> > Initially a tagset starts as unshared and when the number of added
> > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > with all the queues attached to it. When the number of attached queues
> > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > the remaining queues as unshared.
> >
> > Both functions need to freeze current queues in tagset before setting on
> > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > hold set->tag_list_lock mutex, which makes sense as we do not want
> > queues to be added or deleted in the process. This used to work fine
> > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > made the nvme driver quiesce tagset instead of quiscing individual
> > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > set->tag_list while holding set->tag_list_lock also.
> >
> > This results in deadlock between two threads with these stacktraces:
> >
> [...]
> 
> >
> > The top stacktrace is showing nvme_timeout() called to handle nvme
> > command timeout. timeout handler is trying to disable the controller and
> > as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> > to call queue callback handlers. The thread is stuck waiting for
> > set->tag_list_lock as it tires to walk the queues in set->tag_list.
> >
> > The lock is held by the second thread in the bottom stack which is
> > waiting for one of queues to be frozen. The queue usage counter will
> > drop to zero after nvme_timeout() finishes, and this will not happen
> > because the thread will wait for this mutex forever.
> >
> > Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> > avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> > semaphore for read since this is enough to guarantee no queues will be
> > added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> > semaphore for write while updating set->tag_list and downgrade it to
> > read while freezing the queues. It should be safe to update set->flags
> > and hctx->flags while holding the semaphore for read since the queues
> > are already frozen.
> >
> > Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> 
> I think there is no better way to solve this in to nvme code ?

I can not think of way to fix this issue within nvme code.

> 
> will it have any impact on existing users, if any, that are relying
> on current mutex based implementation ?
> 

I audited the codepaths that use the mutex to the best of my knowledge.
I think this change should not have impact on existing code that uses
the mutex.

> BTW, thanks for reporting this and providing a patch.
> 

No problem.

> > ---
> >   block/blk-mq-sysfs.c   | 10 +++----
> >   block/blk-mq.c         | 63 ++++++++++++++++++++++--------------------
> >   include/linux/blk-mq.h |  4 +--
> >   3 files changed, 40 insertions(+), 37 deletions(-)
> >
> > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> > index 58ec293373c6..f474781654fb 100644
> > --- a/block/blk-mq-sysfs.c
> > +++ b/block/blk-mq-sysfs.c
> > @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)
> >   
> >   	kobject_uevent(q->mq_kobj, KOBJ_ADD);
> >   
> > -	mutex_lock(&q->tag_set->tag_list_lock);
> > +	down_read(&q->tag_set->tag_list_rwsem);
> >   	queue_for_each_hw_ctx(q, hctx, i) {
> >   		ret = blk_mq_register_hctx(hctx);
> >   		if (ret)
> >   			goto out_unreg;
> >   	}
> > -	mutex_unlock(&q->tag_set->tag_list_lock);
> > +	up_read(&q->tag_set->tag_list_rwsem);
> >   	return 0;
> >   
> 
> [...]
> 
> >   static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
> >   				     struct request_queue *q)
> >   {
> > -	mutex_lock(&set->tag_list_lock);
> > +	down_write(&set->tag_list_rwsem);
> > +	if (!list_is_singular(&set->tag_list)) {
> > +		if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> > +			queue_set_hctx_shared(q, true);
> > +		list_add_tail(&q->tag_set_list, &set->tag_list);
> > +		up_write(&set->tag_list_rwsem);
> > +		return;
> > +	}
> >   
> > -	/*
> > -	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
> > -	 */
> > -	if (!list_empty(&set->tag_list) &&
> > -	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> > -		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> > -		/* update existing queue */
> > -		blk_mq_update_tag_set_shared(set, true);
> > -	}
> > -	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> > -		queue_set_hctx_shared(q, true);
> > +	/* Transitioning to shared. */
> > +	set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> >   	list_add_tail(&q->tag_set_list, &set->tag_list);
> > -
> > -	mutex_unlock(&set->tag_list_lock);
> > +	downgrade_write(&set->tag_list_rwsem);
> 
> do we need a comment here what to expect since downgrade_write() is
> not as common as mutex_unlock()|down_write() before merging the
> patch ?
> 

	/*
	 * Downgrade the semaphore before freezing the queues to avoid
	 * deadlock with a thread trying to quiesce the tagset before
	 * completing requests.
	 */

Yes, this could use some explanation. How about the three lines above?

> -ck
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-13 20:23 [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock Mohamed Khalfella
  2025-11-14  4:56 ` Chaitanya Kulkarni
@ 2025-11-14 11:41 ` Ming Lei
  2025-11-14 17:34   ` Mohamed Khalfella
  2025-11-17  2:30 ` Ming Lei
  2 siblings, 1 reply; 9+ messages in thread
From: Ming Lei @ 2025-11-14 11:41 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	linux-nvme, linux-block, linux-kernel

On Thu, Nov 13, 2025 at 12:23:20PM -0800, Mohamed Khalfella wrote:
> blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> tagset, the functions make sure that tagset and queues are marked as
> shared when two or more queues are attached to the same tagset.
> Initially a tagset starts as unshared and when the number of added
> queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> with all the queues attached to it. When the number of attached queues
> drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> the remaining queues as unshared.
> 
> Both functions need to freeze current queues in tagset before setting on
> unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> hold set->tag_list_lock mutex, which makes sense as we do not want
> queues to be added or deleted in the process. This used to work fine
> until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> made the nvme driver quiesce tagset instead of quiscing individual
> queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> set->tag_list while holding set->tag_list_lock also.
> 
> This results in deadlock between two threads with these stacktraces:
> 
>   __schedule+0x48e/0xed0
>   schedule+0x5a/0xc0
>   schedule_preempt_disabled+0x11/0x20
>   __mutex_lock.constprop.0+0x3cc/0x760
>   blk_mq_quiesce_tagset+0x26/0xd0
>   nvme_dev_disable_locked+0x77/0x280 [nvme]
>   nvme_timeout+0x268/0x320 [nvme]
>   blk_mq_handle_expired+0x5d/0x90
>   bt_iter+0x7e/0x90
>   blk_mq_queue_tag_busy_iter+0x2b2/0x590
>   ? __blk_mq_complete_request_remote+0x10/0x10
>   ? __blk_mq_complete_request_remote+0x10/0x10
>   blk_mq_timeout_work+0x15b/0x1a0
>   process_one_work+0x133/0x2f0
>   ? mod_delayed_work_on+0x90/0x90
>   worker_thread+0x2ec/0x400
>   ? mod_delayed_work_on+0x90/0x90
>   kthread+0xe2/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x2d/0x50
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork_asm+0x11/0x20
> 
>   __schedule+0x48e/0xed0
>   schedule+0x5a/0xc0
>   blk_mq_freeze_queue_wait+0x62/0x90
>   ? destroy_sched_domains_rcu+0x30/0x30
>   blk_mq_exit_queue+0x151/0x180
>   disk_release+0xe3/0xf0
>   device_release+0x31/0x90
>   kobject_put+0x6d/0x180
>   nvme_scan_ns+0x858/0xc90 [nvme_core]
>   ? nvme_scan_work+0x281/0x560 [nvme_core]
>   nvme_scan_work+0x281/0x560 [nvme_core]
>   process_one_work+0x133/0x2f0
>   ? mod_delayed_work_on+0x90/0x90
>   worker_thread+0x2ec/0x400
>   ? mod_delayed_work_on+0x90/0x90
>   kthread+0xe2/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x2d/0x50
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork_asm+0x11/0x20

It is one AB-BA deadlock, lockdep should have complained it, but nvme doesn't
support owned freeze queue.

Maybe the following change can avoid it?

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c916176bd9f0..9967c4a7e72d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3004,6 +3004,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
        bool dead;

        mutex_lock(&dev->shutdown_lock);
+       nvme_quiesce_io_queues(&dev->ctrl);
        dead = nvme_pci_ctrl_is_dead(dev);
        if (state == NVME_CTRL_LIVE || state == NVME_CTRL_RESETTING) {
                if (pci_is_enabled(pdev))
@@ -3016,8 +3017,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
                        nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
        }

-       nvme_quiesce_io_queues(&dev->ctrl);
-
        if (!dead && dev->ctrl.queue_count > 0) {
                nvme_delete_io_queues(dev);
                nvme_disable_ctrl(&dev->ctrl, shutdown);


Thanks, 
Ming


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-14 11:41 ` Ming Lei
@ 2025-11-14 17:34   ` Mohamed Khalfella
  2025-11-16 15:15     ` Ming Lei
  0 siblings, 1 reply; 9+ messages in thread
From: Mohamed Khalfella @ 2025-11-14 17:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	linux-nvme, linux-block, linux-kernel

On Fri 2025-11-14 19:41:14 +0800, Ming Lei wrote:
> On Thu, Nov 13, 2025 at 12:23:20PM -0800, Mohamed Khalfella wrote:
> > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > tagset, the functions make sure that tagset and queues are marked as
> > shared when two or more queues are attached to the same tagset.
> > Initially a tagset starts as unshared and when the number of added
> > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > with all the queues attached to it. When the number of attached queues
> > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > the remaining queues as unshared.
> > 
> > Both functions need to freeze current queues in tagset before setting on
> > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > hold set->tag_list_lock mutex, which makes sense as we do not want
> > queues to be added or deleted in the process. This used to work fine
> > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > made the nvme driver quiesce tagset instead of quiscing individual
> > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > set->tag_list while holding set->tag_list_lock also.
> > 
> > This results in deadlock between two threads with these stacktraces:
> > 
> >   __schedule+0x48e/0xed0
> >   schedule+0x5a/0xc0
> >   schedule_preempt_disabled+0x11/0x20
> >   __mutex_lock.constprop.0+0x3cc/0x760
> >   blk_mq_quiesce_tagset+0x26/0xd0
> >   nvme_dev_disable_locked+0x77/0x280 [nvme]
> >   nvme_timeout+0x268/0x320 [nvme]
> >   blk_mq_handle_expired+0x5d/0x90
> >   bt_iter+0x7e/0x90
> >   blk_mq_queue_tag_busy_iter+0x2b2/0x590
> >   ? __blk_mq_complete_request_remote+0x10/0x10
> >   ? __blk_mq_complete_request_remote+0x10/0x10
> >   blk_mq_timeout_work+0x15b/0x1a0
> >   process_one_work+0x133/0x2f0
> >   ? mod_delayed_work_on+0x90/0x90
> >   worker_thread+0x2ec/0x400
> >   ? mod_delayed_work_on+0x90/0x90
> >   kthread+0xe2/0x110
> >   ? kthread_complete_and_exit+0x20/0x20
> >   ret_from_fork+0x2d/0x50
> >   ? kthread_complete_and_exit+0x20/0x20
> >   ret_from_fork_asm+0x11/0x20
> > 
> >   __schedule+0x48e/0xed0
> >   schedule+0x5a/0xc0
> >   blk_mq_freeze_queue_wait+0x62/0x90
> >   ? destroy_sched_domains_rcu+0x30/0x30
> >   blk_mq_exit_queue+0x151/0x180
> >   disk_release+0xe3/0xf0
> >   device_release+0x31/0x90
> >   kobject_put+0x6d/0x180
> >   nvme_scan_ns+0x858/0xc90 [nvme_core]
> >   ? nvme_scan_work+0x281/0x560 [nvme_core]
> >   nvme_scan_work+0x281/0x560 [nvme_core]
> >   process_one_work+0x133/0x2f0
> >   ? mod_delayed_work_on+0x90/0x90
> >   worker_thread+0x2ec/0x400
> >   ? mod_delayed_work_on+0x90/0x90
> >   kthread+0xe2/0x110
> >   ? kthread_complete_and_exit+0x20/0x20
> >   ret_from_fork+0x2d/0x50
> >   ? kthread_complete_and_exit+0x20/0x20
> >   ret_from_fork_asm+0x11/0x20
> 
> It is one AB-BA deadlock, lockdep should have complained it, but nvme doesn't
> support owned freeze queue.
> 
> Maybe the following change can avoid it?
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index c916176bd9f0..9967c4a7e72d 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3004,6 +3004,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>         bool dead;
> 
>         mutex_lock(&dev->shutdown_lock);
> +       nvme_quiesce_io_queues(&dev->ctrl);
>         dead = nvme_pci_ctrl_is_dead(dev);
>         if (state == NVME_CTRL_LIVE || state == NVME_CTRL_RESETTING) {
>                 if (pci_is_enabled(pdev))
> @@ -3016,8 +3017,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>                         nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
>         }
> 
> -       nvme_quiesce_io_queues(&dev->ctrl);
> -
>         if (!dead && dev->ctrl.queue_count > 0) {
>                 nvme_delete_io_queues(dev);
>                 nvme_disable_ctrl(&dev->ctrl, shutdown);
> 
> 

Interesting. Can you elaborate more on why this diff can help us avoid
the problem?

If the thread doing nvme_scan_work() is waiting for queue to be frozen
while holding set->tag_list_lock, then nvme_quiesce_io_queues() moved up
will cause the deadlock, no?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-14 17:34   ` Mohamed Khalfella
@ 2025-11-16 15:15     ` Ming Lei
  2025-11-16 16:46       ` Mohamed Khalfella
  0 siblings, 1 reply; 9+ messages in thread
From: Ming Lei @ 2025-11-16 15:15 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	linux-nvme, linux-block, linux-kernel

On Fri, Nov 14, 2025 at 09:34:19AM -0800, Mohamed Khalfella wrote:
> On Fri 2025-11-14 19:41:14 +0800, Ming Lei wrote:
> > On Thu, Nov 13, 2025 at 12:23:20PM -0800, Mohamed Khalfella wrote:
> > > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > > tagset, the functions make sure that tagset and queues are marked as
> > > shared when two or more queues are attached to the same tagset.
> > > Initially a tagset starts as unshared and when the number of added
> > > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > > with all the queues attached to it. When the number of attached queues
> > > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > > the remaining queues as unshared.
> > > 
> > > Both functions need to freeze current queues in tagset before setting on
> > > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > > hold set->tag_list_lock mutex, which makes sense as we do not want
> > > queues to be added or deleted in the process. This used to work fine
> > > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > made the nvme driver quiesce tagset instead of quiscing individual
> > > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > > set->tag_list while holding set->tag_list_lock also.
> > > 
> > > This results in deadlock between two threads with these stacktraces:
> > > 
> > >   __schedule+0x48e/0xed0
> > >   schedule+0x5a/0xc0
> > >   schedule_preempt_disabled+0x11/0x20
> > >   __mutex_lock.constprop.0+0x3cc/0x760
> > >   blk_mq_quiesce_tagset+0x26/0xd0
> > >   nvme_dev_disable_locked+0x77/0x280 [nvme]
> > >   nvme_timeout+0x268/0x320 [nvme]
> > >   blk_mq_handle_expired+0x5d/0x90
> > >   bt_iter+0x7e/0x90
> > >   blk_mq_queue_tag_busy_iter+0x2b2/0x590
> > >   ? __blk_mq_complete_request_remote+0x10/0x10
> > >   ? __blk_mq_complete_request_remote+0x10/0x10
> > >   blk_mq_timeout_work+0x15b/0x1a0
> > >   process_one_work+0x133/0x2f0
> > >   ? mod_delayed_work_on+0x90/0x90
> > >   worker_thread+0x2ec/0x400
> > >   ? mod_delayed_work_on+0x90/0x90
> > >   kthread+0xe2/0x110
> > >   ? kthread_complete_and_exit+0x20/0x20
> > >   ret_from_fork+0x2d/0x50
> > >   ? kthread_complete_and_exit+0x20/0x20
> > >   ret_from_fork_asm+0x11/0x20
> > > 
> > >   __schedule+0x48e/0xed0
> > >   schedule+0x5a/0xc0
> > >   blk_mq_freeze_queue_wait+0x62/0x90
> > >   ? destroy_sched_domains_rcu+0x30/0x30
> > >   blk_mq_exit_queue+0x151/0x180
> > >   disk_release+0xe3/0xf0
> > >   device_release+0x31/0x90
> > >   kobject_put+0x6d/0x180
> > >   nvme_scan_ns+0x858/0xc90 [nvme_core]
> > >   ? nvme_scan_work+0x281/0x560 [nvme_core]
> > >   nvme_scan_work+0x281/0x560 [nvme_core]
> > >   process_one_work+0x133/0x2f0
> > >   ? mod_delayed_work_on+0x90/0x90
> > >   worker_thread+0x2ec/0x400
> > >   ? mod_delayed_work_on+0x90/0x90
> > >   kthread+0xe2/0x110
> > >   ? kthread_complete_and_exit+0x20/0x20
> > >   ret_from_fork+0x2d/0x50
> > >   ? kthread_complete_and_exit+0x20/0x20
> > >   ret_from_fork_asm+0x11/0x20
> > 
> > It is one AB-BA deadlock, lockdep should have complained it, but nvme doesn't
> > support owned freeze queue.
> > 
> > Maybe the following change can avoid it?
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index c916176bd9f0..9967c4a7e72d 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -3004,6 +3004,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >         bool dead;
> > 
> >         mutex_lock(&dev->shutdown_lock);
> > +       nvme_quiesce_io_queues(&dev->ctrl);
> >         dead = nvme_pci_ctrl_is_dead(dev);
> >         if (state == NVME_CTRL_LIVE || state == NVME_CTRL_RESETTING) {
> >                 if (pci_is_enabled(pdev))
> > @@ -3016,8 +3017,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >                         nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
> >         }
> > 
> > -       nvme_quiesce_io_queues(&dev->ctrl);
> > -
> >         if (!dead && dev->ctrl.queue_count > 0) {
> >                 nvme_delete_io_queues(dev);
> >                 nvme_disable_ctrl(&dev->ctrl, shutdown);
> > 
> > 
> 
> Interesting. Can you elaborate more on why this diff can help us avoid
> the problem?

In nvme_dev_disable(), NS queues are frozen first, then call
nvme_quiesce_io_queues().

queue freeze can be thought as one lock, so q->freeze_lock -> tag_set->tag_list_lock
in timeout code path.

However, in blk_mq_exit_queue(), the lock order becomes
tag_set->tag_list_lock -> q->freeze_lock.

That is why I call it AB-BA lock.

However, that looks not the reason in your deadlock, so my patch shouldn't
work here, in which nvme_dev_disable() doesn't provide forward-progress
because of ->tag_list_lock.

> 
> If the thread doing nvme_scan_work() is waiting for queue to be frozen
> while holding set->tag_list_lock, then nvme_quiesce_io_queues() moved up
> will cause the deadlock, no?

 __schedule+0x48e/0xed0
  schedule+0x5a/0xc0
  blk_mq_freeze_queue_wait+0x62/0x90
  ? destroy_sched_domains_rcu+0x30/0x30
  blk_mq_exit_queue+0x151/0x180
  disk_release+0xe3/0xf0
  device_release+0x31/0x90
  kobject_put+0x6d/0x180

Here blk_mq_exit_queue() is called from disk release, and the disk is not
added actually, so question is why blk_mq_freeze_queue_wait() doesn't
return? Who holds queue usage counter here?  Can you investigate a bit
and figure out the reason?

For one released NS/disk, no one should grab the queue usage counter,
right?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-16 15:15     ` Ming Lei
@ 2025-11-16 16:46       ` Mohamed Khalfella
  2025-11-17  2:13         ` Ming Lei
  0 siblings, 1 reply; 9+ messages in thread
From: Mohamed Khalfella @ 2025-11-16 16:46 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	linux-nvme, linux-block, linux-kernel

On Sun 2025-11-16 23:15:38 +0800, Ming Lei wrote:
> On Fri, Nov 14, 2025 at 09:34:19AM -0800, Mohamed Khalfella wrote:
> > On Fri 2025-11-14 19:41:14 +0800, Ming Lei wrote:
> > > On Thu, Nov 13, 2025 at 12:23:20PM -0800, Mohamed Khalfella wrote:
> > > > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > > > tagset, the functions make sure that tagset and queues are marked as
> > > > shared when two or more queues are attached to the same tagset.
> > > > Initially a tagset starts as unshared and when the number of added
> > > > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > > > with all the queues attached to it. When the number of attached queues
> > > > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > > > the remaining queues as unshared.
> > > > 
> > > > Both functions need to freeze current queues in tagset before setting on
> > > > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > > > hold set->tag_list_lock mutex, which makes sense as we do not want
> > > > queues to be added or deleted in the process. This used to work fine
> > > > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > > made the nvme driver quiesce tagset instead of quiscing individual
> > > > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > > > set->tag_list while holding set->tag_list_lock also.
> > > > 
> > > > This results in deadlock between two threads with these stacktraces:
> > > > 
> > > >   __schedule+0x48e/0xed0
> > > >   schedule+0x5a/0xc0
> > > >   schedule_preempt_disabled+0x11/0x20
> > > >   __mutex_lock.constprop.0+0x3cc/0x760
> > > >   blk_mq_quiesce_tagset+0x26/0xd0
> > > >   nvme_dev_disable_locked+0x77/0x280 [nvme]
> > > >   nvme_timeout+0x268/0x320 [nvme]
> > > >   blk_mq_handle_expired+0x5d/0x90
> > > >   bt_iter+0x7e/0x90
> > > >   blk_mq_queue_tag_busy_iter+0x2b2/0x590
> > > >   ? __blk_mq_complete_request_remote+0x10/0x10
> > > >   ? __blk_mq_complete_request_remote+0x10/0x10
> > > >   blk_mq_timeout_work+0x15b/0x1a0
> > > >   process_one_work+0x133/0x2f0
> > > >   ? mod_delayed_work_on+0x90/0x90
> > > >   worker_thread+0x2ec/0x400
> > > >   ? mod_delayed_work_on+0x90/0x90
> > > >   kthread+0xe2/0x110
> > > >   ? kthread_complete_and_exit+0x20/0x20
> > > >   ret_from_fork+0x2d/0x50
> > > >   ? kthread_complete_and_exit+0x20/0x20
> > > >   ret_from_fork_asm+0x11/0x20
> > > > 
> > > >   __schedule+0x48e/0xed0
> > > >   schedule+0x5a/0xc0
> > > >   blk_mq_freeze_queue_wait+0x62/0x90
> > > >   ? destroy_sched_domains_rcu+0x30/0x30
> > > >   blk_mq_exit_queue+0x151/0x180
> > > >   disk_release+0xe3/0xf0
> > > >   device_release+0x31/0x90
> > > >   kobject_put+0x6d/0x180
> > > >   nvme_scan_ns+0x858/0xc90 [nvme_core]
> > > >   ? nvme_scan_work+0x281/0x560 [nvme_core]
> > > >   nvme_scan_work+0x281/0x560 [nvme_core]
> > > >   process_one_work+0x133/0x2f0
> > > >   ? mod_delayed_work_on+0x90/0x90
> > > >   worker_thread+0x2ec/0x400
> > > >   ? mod_delayed_work_on+0x90/0x90
> > > >   kthread+0xe2/0x110
> > > >   ? kthread_complete_and_exit+0x20/0x20
> > > >   ret_from_fork+0x2d/0x50
> > > >   ? kthread_complete_and_exit+0x20/0x20
> > > >   ret_from_fork_asm+0x11/0x20
> > > 
> > > It is one AB-BA deadlock, lockdep should have complained it, but nvme doesn't
> > > support owned freeze queue.
> > > 
> > > Maybe the following change can avoid it?
> > > 
> > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > index c916176bd9f0..9967c4a7e72d 100644
> > > --- a/drivers/nvme/host/pci.c
> > > +++ b/drivers/nvme/host/pci.c
> > > @@ -3004,6 +3004,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> > >         bool dead;
> > > 
> > >         mutex_lock(&dev->shutdown_lock);
> > > +       nvme_quiesce_io_queues(&dev->ctrl);
> > >         dead = nvme_pci_ctrl_is_dead(dev);
> > >         if (state == NVME_CTRL_LIVE || state == NVME_CTRL_RESETTING) {
> > >                 if (pci_is_enabled(pdev))
> > > @@ -3016,8 +3017,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> > >                         nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
> > >         }
> > > 
> > > -       nvme_quiesce_io_queues(&dev->ctrl);
> > > -
> > >         if (!dead && dev->ctrl.queue_count > 0) {
> > >                 nvme_delete_io_queues(dev);
> > >                 nvme_disable_ctrl(&dev->ctrl, shutdown);
> > > 
> > > 
> > 
> > Interesting. Can you elaborate more on why this diff can help us avoid
> > the problem?
> 
> In nvme_dev_disable(), NS queues are frozen first, then call
> nvme_quiesce_io_queues().
> 
> queue freeze can be thought as one lock, so q->freeze_lock -> tag_set->tag_list_lock
> in timeout code path.
> 
> However, in blk_mq_exit_queue(), the lock order becomes
> tag_set->tag_list_lock -> q->freeze_lock.
> 
> That is why I call it AB-BA lock.
> 
> However, that looks not the reason in your deadlock, so my patch shouldn't
> work here, in which nvme_dev_disable() doesn't provide forward-progress
> because of ->tag_list_lock.
> 
> > 
> > If the thread doing nvme_scan_work() is waiting for queue to be frozen
> > while holding set->tag_list_lock, then nvme_quiesce_io_queues() moved up
> > will cause the deadlock, no?
> 
>  __schedule+0x48e/0xed0
>   schedule+0x5a/0xc0
>   blk_mq_freeze_queue_wait+0x62/0x90
>   ? destroy_sched_domains_rcu+0x30/0x30
>   blk_mq_exit_queue+0x151/0x180
>   disk_release+0xe3/0xf0
>   device_release+0x31/0x90
>   kobject_put+0x6d/0x180
> 
> Here blk_mq_exit_queue() is called from disk release, and the disk is not
> added actually, so question is why blk_mq_freeze_queue_wait() doesn't
> return? Who holds queue usage counter here?  Can you investigate a bit
> and figure out the reason?

The nvme controller has two namespaces. n1 was created with q1 added to
the tagset. n2 was also created with q2 added to the tagset. Now the
tagset is shared because it has two queues on it. Before the disk is
added n2 was found to be duplicate to n1. That means the disk should be
released, as noted above, and q2 should be removed from the tagset.
Because n1 block device is ready to receive IO it is possible for q1
usage counter to be greater than zero.

Back to q2 removal from the tagset. This requires q1 to be frozen before
it can be marked as unshared. blk_mq_freeze_queue_wait() was called for
q1 and it does not return because there is a request issued on the queue.
nvme_timeout() was called for that request in q1.

> 
> For one released NS/disk, no one should grab the queue usage counter,
> right?

Right. The disk is not visible yet.

> 
> 
> Thanks,
> Ming
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-16 16:46       ` Mohamed Khalfella
@ 2025-11-17  2:13         ` Ming Lei
  0 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2025-11-17  2:13 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	linux-nvme, linux-block, linux-kernel

On Sun, Nov 16, 2025 at 08:46:06AM -0800, Mohamed Khalfella wrote:
> On Sun 2025-11-16 23:15:38 +0800, Ming Lei wrote:
> > On Fri, Nov 14, 2025 at 09:34:19AM -0800, Mohamed Khalfella wrote:
> > > On Fri 2025-11-14 19:41:14 +0800, Ming Lei wrote:
> > > > On Thu, Nov 13, 2025 at 12:23:20PM -0800, Mohamed Khalfella wrote:
> > > > > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > > > > tagset, the functions make sure that tagset and queues are marked as
> > > > > shared when two or more queues are attached to the same tagset.
> > > > > Initially a tagset starts as unshared and when the number of added
> > > > > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > > > > with all the queues attached to it. When the number of attached queues
> > > > > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > > > > the remaining queues as unshared.
> > > > > 
> > > > > Both functions need to freeze current queues in tagset before setting on
> > > > > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > > > > hold set->tag_list_lock mutex, which makes sense as we do not want
> > > > > queues to be added or deleted in the process. This used to work fine
> > > > > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > > > made the nvme driver quiesce tagset instead of quiscing individual
> > > > > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > > > > set->tag_list while holding set->tag_list_lock also.
> > > > > 
> > > > > This results in deadlock between two threads with these stacktraces:
> > > > > 
> > > > >   __schedule+0x48e/0xed0
> > > > >   schedule+0x5a/0xc0
> > > > >   schedule_preempt_disabled+0x11/0x20
> > > > >   __mutex_lock.constprop.0+0x3cc/0x760
> > > > >   blk_mq_quiesce_tagset+0x26/0xd0
> > > > >   nvme_dev_disable_locked+0x77/0x280 [nvme]
> > > > >   nvme_timeout+0x268/0x320 [nvme]
> > > > >   blk_mq_handle_expired+0x5d/0x90
> > > > >   bt_iter+0x7e/0x90
> > > > >   blk_mq_queue_tag_busy_iter+0x2b2/0x590
> > > > >   ? __blk_mq_complete_request_remote+0x10/0x10
> > > > >   ? __blk_mq_complete_request_remote+0x10/0x10
> > > > >   blk_mq_timeout_work+0x15b/0x1a0
> > > > >   process_one_work+0x133/0x2f0
> > > > >   ? mod_delayed_work_on+0x90/0x90
> > > > >   worker_thread+0x2ec/0x400
> > > > >   ? mod_delayed_work_on+0x90/0x90
> > > > >   kthread+0xe2/0x110
> > > > >   ? kthread_complete_and_exit+0x20/0x20
> > > > >   ret_from_fork+0x2d/0x50
> > > > >   ? kthread_complete_and_exit+0x20/0x20
> > > > >   ret_from_fork_asm+0x11/0x20
> > > > > 
> > > > >   __schedule+0x48e/0xed0
> > > > >   schedule+0x5a/0xc0
> > > > >   blk_mq_freeze_queue_wait+0x62/0x90
> > > > >   ? destroy_sched_domains_rcu+0x30/0x30
> > > > >   blk_mq_exit_queue+0x151/0x180
> > > > >   disk_release+0xe3/0xf0
> > > > >   device_release+0x31/0x90
> > > > >   kobject_put+0x6d/0x180
> > > > >   nvme_scan_ns+0x858/0xc90 [nvme_core]
> > > > >   ? nvme_scan_work+0x281/0x560 [nvme_core]
> > > > >   nvme_scan_work+0x281/0x560 [nvme_core]
> > > > >   process_one_work+0x133/0x2f0
> > > > >   ? mod_delayed_work_on+0x90/0x90
> > > > >   worker_thread+0x2ec/0x400
> > > > >   ? mod_delayed_work_on+0x90/0x90
> > > > >   kthread+0xe2/0x110
> > > > >   ? kthread_complete_and_exit+0x20/0x20
> > > > >   ret_from_fork+0x2d/0x50
> > > > >   ? kthread_complete_and_exit+0x20/0x20
> > > > >   ret_from_fork_asm+0x11/0x20
> > > > 
> > > > It is one AB-BA deadlock, lockdep should have complained it, but nvme doesn't
> > > > support owned freeze queue.
> > > > 
> > > > Maybe the following change can avoid it?
> > > > 
> > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > > index c916176bd9f0..9967c4a7e72d 100644
> > > > --- a/drivers/nvme/host/pci.c
> > > > +++ b/drivers/nvme/host/pci.c
> > > > @@ -3004,6 +3004,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> > > >         bool dead;
> > > > 
> > > >         mutex_lock(&dev->shutdown_lock);
> > > > +       nvme_quiesce_io_queues(&dev->ctrl);
> > > >         dead = nvme_pci_ctrl_is_dead(dev);
> > > >         if (state == NVME_CTRL_LIVE || state == NVME_CTRL_RESETTING) {
> > > >                 if (pci_is_enabled(pdev))
> > > > @@ -3016,8 +3017,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> > > >                         nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
> > > >         }
> > > > 
> > > > -       nvme_quiesce_io_queues(&dev->ctrl);
> > > > -
> > > >         if (!dead && dev->ctrl.queue_count > 0) {
> > > >                 nvme_delete_io_queues(dev);
> > > >                 nvme_disable_ctrl(&dev->ctrl, shutdown);
> > > > 
> > > > 
> > > 
> > > Interesting. Can you elaborate more on why this diff can help us avoid
> > > the problem?
> > 
> > In nvme_dev_disable(), NS queues are frozen first, then call
> > nvme_quiesce_io_queues().
> > 
> > queue freeze can be thought as one lock, so q->freeze_lock -> tag_set->tag_list_lock
> > in timeout code path.
> > 
> > However, in blk_mq_exit_queue(), the lock order becomes
> > tag_set->tag_list_lock -> q->freeze_lock.
> > 
> > That is why I call it AB-BA lock.
> > 
> > However, that looks not the reason in your deadlock, so my patch shouldn't
> > work here, in which nvme_dev_disable() doesn't provide forward-progress
> > because of ->tag_list_lock.
> > 
> > > 
> > > If the thread doing nvme_scan_work() is waiting for queue to be frozen
> > > while holding set->tag_list_lock, then nvme_quiesce_io_queues() moved up
> > > will cause the deadlock, no?
> > 
> >  __schedule+0x48e/0xed0
> >   schedule+0x5a/0xc0
> >   blk_mq_freeze_queue_wait+0x62/0x90
> >   ? destroy_sched_domains_rcu+0x30/0x30
> >   blk_mq_exit_queue+0x151/0x180
> >   disk_release+0xe3/0xf0
> >   device_release+0x31/0x90
> >   kobject_put+0x6d/0x180
> > 
> > Here blk_mq_exit_queue() is called from disk release, and the disk is not
> > added actually, so question is why blk_mq_freeze_queue_wait() doesn't
> > return? Who holds queue usage counter here?  Can you investigate a bit
> > and figure out the reason?
> 
> The nvme controller has two namespaces. n1 was created with q1 added to
> the tagset. n2 was also created with q2 added to the tagset. Now the
> tagset is shared because it has two queues on it. Before the disk is
> added n2 was found to be duplicate to n1. That means the disk should be
> released, as noted above, and q2 should be removed from the tagset.
> Because n1 block device is ready to receive IO it is possible for q1
> usage counter to be greater than zero.
> 
> Back to q2 removal from the tagset. This requires q1 to be frozen before
> it can be marked as unshared. blk_mq_freeze_queue_wait() was called for
> q1 and it does not return because there is a request issued on the queue.
> nvme_timeout() was called for that request in q1.

Yeah, it is actually the other request queues frozen from
blk_mq_exit_queue(). Then your approach looks good, will run a close view.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-13 20:23 [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock Mohamed Khalfella
  2025-11-14  4:56 ` Chaitanya Kulkarni
  2025-11-14 11:41 ` Ming Lei
@ 2025-11-17  2:30 ` Ming Lei
  2 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2025-11-17  2:30 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	linux-nvme, linux-block, linux-kernel

On Thu, Nov 13, 2025 at 12:23:20PM -0800, Mohamed Khalfella wrote:
> blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> tagset, the functions make sure that tagset and queues are marked as
> shared when two or more queues are attached to the same tagset.
> Initially a tagset starts as unshared and when the number of added
> queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> with all the queues attached to it. When the number of attached queues
> drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> the remaining queues as unshared.
> 
> Both functions need to freeze current queues in tagset before setting on
> unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> hold set->tag_list_lock mutex, which makes sense as we do not want
> queues to be added or deleted in the process. This used to work fine
> until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> made the nvme driver quiesce tagset instead of quiscing individual
> queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> set->tag_list while holding set->tag_list_lock also.
> 
> This results in deadlock between two threads with these stacktraces:
> 
>   __schedule+0x48e/0xed0
>   schedule+0x5a/0xc0
>   schedule_preempt_disabled+0x11/0x20
>   __mutex_lock.constprop.0+0x3cc/0x760
>   blk_mq_quiesce_tagset+0x26/0xd0
>   nvme_dev_disable_locked+0x77/0x280 [nvme]
>   nvme_timeout+0x268/0x320 [nvme]
>   blk_mq_handle_expired+0x5d/0x90
>   bt_iter+0x7e/0x90
>   blk_mq_queue_tag_busy_iter+0x2b2/0x590
>   ? __blk_mq_complete_request_remote+0x10/0x10
>   ? __blk_mq_complete_request_remote+0x10/0x10
>   blk_mq_timeout_work+0x15b/0x1a0
>   process_one_work+0x133/0x2f0
>   ? mod_delayed_work_on+0x90/0x90
>   worker_thread+0x2ec/0x400
>   ? mod_delayed_work_on+0x90/0x90
>   kthread+0xe2/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x2d/0x50
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork_asm+0x11/0x20
> 
>   __schedule+0x48e/0xed0
>   schedule+0x5a/0xc0
>   blk_mq_freeze_queue_wait+0x62/0x90
>   ? destroy_sched_domains_rcu+0x30/0x30
>   blk_mq_exit_queue+0x151/0x180
>   disk_release+0xe3/0xf0
>   device_release+0x31/0x90
>   kobject_put+0x6d/0x180
>   nvme_scan_ns+0x858/0xc90 [nvme_core]
>   ? nvme_scan_work+0x281/0x560 [nvme_core]
>   nvme_scan_work+0x281/0x560 [nvme_core]
>   process_one_work+0x133/0x2f0
>   ? mod_delayed_work_on+0x90/0x90
>   worker_thread+0x2ec/0x400
>   ? mod_delayed_work_on+0x90/0x90
>   kthread+0xe2/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x2d/0x50
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork_asm+0x11/0x20
> 
> The top stacktrace is showing nvme_timeout() called to handle nvme
> command timeout. timeout handler is trying to disable the controller and
> as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> to call queue callback handlers. The thread is stuck waiting for
> set->tag_list_lock as it tires to walk the queues in set->tag_list.
> 
> The lock is held by the second thread in the bottom stack which is
> waiting for one of queues to be frozen. The queue usage counter will
> drop to zero after nvme_timeout() finishes, and this will not happen
> because the thread will wait for this mutex forever.
> 
> Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> semaphore for read since this is enough to guarantee no queues will be
> added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> semaphore for write while updating set->tag_list and downgrade it to
> read while freezing the queues. It should be safe to update set->flags
> and hctx->flags while holding the semaphore for read since the queues
> are already frozen.
> 
> Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>  block/blk-mq-sysfs.c   | 10 +++----
>  block/blk-mq.c         | 63 ++++++++++++++++++++++--------------------
>  include/linux/blk-mq.h |  4 +--
>  3 files changed, 40 insertions(+), 37 deletions(-)
> 
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 58ec293373c6..f474781654fb 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)
>  
>  	kobject_uevent(q->mq_kobj, KOBJ_ADD);
>  
> -	mutex_lock(&q->tag_set->tag_list_lock);
> +	down_read(&q->tag_set->tag_list_rwsem);
>  	queue_for_each_hw_ctx(q, hctx, i) {
>  		ret = blk_mq_register_hctx(hctx);
>  		if (ret)
>  			goto out_unreg;
>  	}
> -	mutex_unlock(&q->tag_set->tag_list_lock);
> +	up_read(&q->tag_set->tag_list_rwsem);
>  	return 0;
>  
>  out_unreg:
> @@ -244,7 +244,7 @@ int blk_mq_sysfs_register(struct gendisk *disk)
>  		if (j < i)
>  			blk_mq_unregister_hctx(hctx);
>  	}
> -	mutex_unlock(&q->tag_set->tag_list_lock);
> +	up_read(&q->tag_set->tag_list_rwsem);
>  
>  	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
>  	kobject_del(q->mq_kobj);
> @@ -257,10 +257,10 @@ void blk_mq_sysfs_unregister(struct gendisk *disk)
>  	struct blk_mq_hw_ctx *hctx;
>  	unsigned long i;
>  
> -	mutex_lock(&q->tag_set->tag_list_lock);
> +	down_read(&q->tag_set->tag_list_rwsem);
>  	queue_for_each_hw_ctx(q, hctx, i)
>  		blk_mq_unregister_hctx(hctx);
> -	mutex_unlock(&q->tag_set->tag_list_lock);
> +	up_read(&q->tag_set->tag_list_rwsem);
>  
>  	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
>  	kobject_del(q->mq_kobj);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d626d32f6e57..1770277fe453 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
>  {
>  	struct request_queue *q;
>  
> -	mutex_lock(&set->tag_list_lock);
> +	down_read(&set->tag_list_rwsem);
>  	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>  		if (!blk_queue_skip_tagset_quiesce(q))
>  			blk_mq_quiesce_queue_nowait(q);
>  	}
> -	mutex_unlock(&set->tag_list_lock);
> +	up_read(&set->tag_list_rwsem);
>  
>  	blk_mq_wait_quiesce_done(set);
>  }
> @@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
>  {
>  	struct request_queue *q;
>  
> -	mutex_lock(&set->tag_list_lock);
> +	down_read(&set->tag_list_rwsem);
>  	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>  		if (!blk_queue_skip_tagset_quiesce(q))
>  			blk_mq_unquiesce_queue(q);
>  	}
> -	mutex_unlock(&set->tag_list_lock);
> +	up_read(&set->tag_list_rwsem);
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
>  
> @@ -4280,7 +4280,7 @@ static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
>  	struct request_queue *q;
>  	unsigned int memflags;
>  
> -	lockdep_assert_held(&set->tag_list_lock);
> +	lockdep_assert_held(&set->tag_list_rwsem);
>  
>  	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>  		memflags = blk_mq_freeze_queue(q);
> @@ -4293,37 +4293,40 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
>  {
>  	struct blk_mq_tag_set *set = q->tag_set;
>  
> -	mutex_lock(&set->tag_list_lock);
> +	down_write(&set->tag_list_rwsem);
>  	list_del(&q->tag_set_list);
> -	if (list_is_singular(&set->tag_list)) {
> -		/* just transitioned to unshared */
> -		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> -		/* update existing queue */
> -		blk_mq_update_tag_set_shared(set, false);
> +	if (!list_is_singular(&set->tag_list)) {
> +		up_write(&set->tag_list_rwsem);
> +		goto out;
>  	}
> -	mutex_unlock(&set->tag_list_lock);
> +
> +	/* Transitioning to unshared. */
> +	set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> +	downgrade_write(&set->tag_list_rwsem);
> +	blk_mq_update_tag_set_shared(set, false);
> +	up_read(&set->tag_list_rwsem);
> +out:
>  	INIT_LIST_HEAD(&q->tag_set_list);
>  }
>  
>  static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>  				     struct request_queue *q)
>  {
> -	mutex_lock(&set->tag_list_lock);
> +	down_write(&set->tag_list_rwsem);
> +	if (!list_is_singular(&set->tag_list)) {
> +		if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> +			queue_set_hctx_shared(q, true);
> +		list_add_tail(&q->tag_set_list, &set->tag_list);
> +		up_write(&set->tag_list_rwsem);
> +		return;
> +	}

It could be more readable to add documentation:

  /*
   * Three cases:
   * 1. Empty list (first queue): Add without shared flag
   * 2. Singular list (1→2 queues): Transition to shared
   * 3. Multiple queues (2+): Add with existing shared flag
   */

>  
> -	/*
> -	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
> -	 */
> -	if (!list_empty(&set->tag_list) &&
> -	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> -		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> -		/* update existing queue */
> -		blk_mq_update_tag_set_shared(set, true);
> -	}
> -	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> -		queue_set_hctx_shared(q, true);
> +	/* Transitioning to shared. */
> +	set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
>  	list_add_tail(&q->tag_set_list, &set->tag_list);
> -
> -	mutex_unlock(&set->tag_list_lock);
> +	downgrade_write(&set->tag_list_rwsem);
> +	blk_mq_update_tag_set_shared(set, true);
> +	up_read(&set->tag_list_rwsem);
>  }

Here you change to freeze two queues, and the in-tree code just freezes the
1st added queue. Probably it is nice to keep existing behavior to just freeze
the 1st old queue.

Also blk_mq_update_tag_set_shared() is only done on single queue, so
the helper can be removed and just open-code queue_set_hctx_shared(),
which can show the single queue point obviously and more readable.

Otherwise, feel free to add:

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-11-17  2:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-13 20:23 [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock Mohamed Khalfella
2025-11-14  4:56 ` Chaitanya Kulkarni
2025-11-14  5:34   ` Mohamed Khalfella
2025-11-14 11:41 ` Ming Lei
2025-11-14 17:34   ` Mohamed Khalfella
2025-11-16 15:15     ` Ming Lei
2025-11-16 16:46       ` Mohamed Khalfella
2025-11-17  2:13         ` Ming Lei
2025-11-17  2:30 ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).