[PATCH v2 0/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
@ 2025-11-17 20:23 Mohamed Khalfella
  2025-11-17 20:23 ` [PATCH v2 1/1] " Mohamed Khalfella
  0 siblings, 1 reply; 21+ messages in thread
From: Mohamed Khalfella @ 2025-11-17 20:23 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni
  Cc: Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	Ming Lei, linux-nvme, linux-block, linux-kernel,
	Mohamed Khalfella

Changes from v1:

- Keep existing behavior do not freeze newly added "q" when
  transitioning tagset from unshared to shared.
- Delete blk_mq_update_tag_set_shared() and explicitly freeze "firstq"
  and "q" and set their shared status.
- Add comment explaining why we are downgrading the semaphore before
  freezing "firstq".

v1: https://lore.kernel.org/all/20251113202320.2530531-1-mkhalfella@purestorage.com/

Mohamed Khalfella (1):
  nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock

 block/blk-mq-sysfs.c   | 10 ++---
 block/blk-mq.c         | 95 +++++++++++++++++++++++-------------------
 include/linux/blk-mq.h |  4 +-
 3 files changed, 58 insertions(+), 51 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-17 20:23 [PATCH v2 0/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock Mohamed Khalfella
@ 2025-11-17 20:23 ` Mohamed Khalfella
  2025-11-18  1:34   ` Hillf Danton
                     ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Mohamed Khalfella @ 2025-11-17 20:23 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni
  Cc: Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	Ming Lei, linux-nvme, linux-block, linux-kernel,
	Mohamed Khalfella

blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
tagset, the functions make sure that tagset and queues are marked as
shared when two or more queues are attached to the same tagset.
Initially a tagset starts as unshared and when the number of added
queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
with all the queues attached to it. When the number of attached queues
drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
the remaining queues as unshared.

Both functions need to freeze current queues in tagset before setting on
unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
hold set->tag_list_lock mutex, which makes sense as we do not want
queues to be added or deleted in the process. This used to work fine
until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
made the nvme driver quiesce tagset instead of quiscing individual
queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
set->tag_list while holding set->tag_list_lock also.

This results in deadlock between two threads with these stacktraces:

  __schedule+0x48e/0xed0
  schedule+0x5a/0xc0
  schedule_preempt_disabled+0x11/0x20
  __mutex_lock.constprop.0+0x3cc/0x760
  blk_mq_quiesce_tagset+0x26/0xd0
  nvme_dev_disable_locked+0x77/0x280 [nvme]
  nvme_timeout+0x268/0x320 [nvme]
  blk_mq_handle_expired+0x5d/0x90
  bt_iter+0x7e/0x90
  blk_mq_queue_tag_busy_iter+0x2b2/0x590
  ? __blk_mq_complete_request_remote+0x10/0x10
  ? __blk_mq_complete_request_remote+0x10/0x10
  blk_mq_timeout_work+0x15b/0x1a0
  process_one_work+0x133/0x2f0
  ? mod_delayed_work_on+0x90/0x90
  worker_thread+0x2ec/0x400
  ? mod_delayed_work_on+0x90/0x90
  kthread+0xe2/0x110
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x2d/0x50
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork_asm+0x11/0x20

  __schedule+0x48e/0xed0
  schedule+0x5a/0xc0
  blk_mq_freeze_queue_wait+0x62/0x90
  ? destroy_sched_domains_rcu+0x30/0x30
  blk_mq_exit_queue+0x151/0x180
  disk_release+0xe3/0xf0
  device_release+0x31/0x90
  kobject_put+0x6d/0x180
  nvme_scan_ns+0x858/0xc90 [nvme_core]
  ? nvme_scan_work+0x281/0x560 [nvme_core]
  nvme_scan_work+0x281/0x560 [nvme_core]
  process_one_work+0x133/0x2f0
  ? mod_delayed_work_on+0x90/0x90
  worker_thread+0x2ec/0x400
  ? mod_delayed_work_on+0x90/0x90
  kthread+0xe2/0x110
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x2d/0x50
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork_asm+0x11/0x20

The top stacktrace is showing nvme_timeout() called to handle nvme
command timeout. timeout handler is trying to disable the controller and
as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
to call queue callback handlers. The thread is stuck waiting for
set->tag_list_lock as it tires to walk the queues in set->tag_list.

The lock is held by the second thread in the bottom stack which is
waiting for one of queues to be frozen. The queue usage counter will
drop to zero after nvme_timeout() finishes, and this will not happen
because the thread will wait for this mutex forever.

Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
semaphore for read since this is enough to guarantee no queues will be
added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
semaphore for write while updating set->tag_list and downgrade it to
read while freezing the queues. It should be safe to update set->flags
and hctx->flags while holding the semaphore for read since the queues
are already frozen.

Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 block/blk-mq-sysfs.c   | 10 ++---
 block/blk-mq.c         | 95 +++++++++++++++++++++++-------------------
 include/linux/blk-mq.h |  4 +-
 3 files changed, 58 insertions(+), 51 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 58ec293373c6..f474781654fb 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)

 	kobject_uevent(q->mq_kobj, KOBJ_ADD);

-	mutex_lock(&q->tag_set->tag_list_lock);
+	down_read(&q->tag_set->tag_list_rwsem);
 	queue_for_each_hw_ctx(q, hctx, i) {
 		ret = blk_mq_register_hctx(hctx);
 		if (ret)
 			goto out_unreg;
 	}
-	mutex_unlock(&q->tag_set->tag_list_lock);
+	up_read(&q->tag_set->tag_list_rwsem);
 	return 0;

 out_unreg:
@@ -244,7 +244,7 @@ int blk_mq_sysfs_register(struct gendisk *disk)
 		if (j < i)
 			blk_mq_unregister_hctx(hctx);
 	}
-	mutex_unlock(&q->tag_set->tag_list_lock);
+	up_read(&q->tag_set->tag_list_rwsem);

 	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
 	kobject_del(q->mq_kobj);
@@ -257,10 +257,10 @@ void blk_mq_sysfs_unregister(struct gendisk *disk)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;

-	mutex_lock(&q->tag_set->tag_list_lock);
+	down_read(&q->tag_set->tag_list_rwsem);
 	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_unregister_hctx(hctx);
-	mutex_unlock(&q->tag_set->tag_list_lock);
+	up_read(&q->tag_set->tag_list_rwsem);

 	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
 	kobject_del(q->mq_kobj);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d626d32f6e57..9211d32ce820 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
 {
 	struct request_queue *q;

-	mutex_lock(&set->tag_list_lock);
+	down_read(&set->tag_list_rwsem);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		if (!blk_queue_skip_tagset_quiesce(q))
 			blk_mq_quiesce_queue_nowait(q);
 	}
-	mutex_unlock(&set->tag_list_lock);
+	up_read(&set->tag_list_rwsem);

 	blk_mq_wait_quiesce_done(set);
 }
@@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
 {
 	struct request_queue *q;

-	mutex_lock(&set->tag_list_lock);
+	down_read(&set->tag_list_rwsem);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		if (!blk_queue_skip_tagset_quiesce(q))
 			blk_mq_unquiesce_queue(q);
 	}
-	mutex_unlock(&set->tag_list_lock);
+	up_read(&set->tag_list_rwsem);
 }
 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);

@@ -4274,56 +4274,63 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 	}
 }

-static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
-					 bool shared)
-{
-	struct request_queue *q;
-	unsigned int memflags;
-
-	lockdep_assert_held(&set->tag_list_lock);
-
-	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		memflags = blk_mq_freeze_queue(q);
-		queue_set_hctx_shared(q, shared);
-		blk_mq_unfreeze_queue(q, memflags);
-	}
-}
-
 static void blk_mq_del_queue_tag_set(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
+	struct request_queue *firstq;
+	unsigned int memflags;

-	mutex_lock(&set->tag_list_lock);
+	down_write(&set->tag_list_rwsem);
 	list_del(&q->tag_set_list);
-	if (list_is_singular(&set->tag_list)) {
-		/* just transitioned to unshared */
-		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
-		/* update existing queue */
-		blk_mq_update_tag_set_shared(set, false);
+	if (!list_is_singular(&set->tag_list)) {
+		up_write(&set->tag_list_rwsem);
+		goto out;
 	}
-	mutex_unlock(&set->tag_list_lock);
+
+	/*
+	 * Transitioning the remaining firstq to unshared.
+	 * Also, downgrade the semaphore to avoid deadlock
+	 * with blk_mq_quiesce_tagset() while waiting for
+	 * firstq to be frozen.
+	 */
+	set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
+	downgrade_write(&set->tag_list_rwsem);
+	firstq = list_first_entry(&set->tag_list, struct request_queue,
+				  tag_set_list);
+	memflags = blk_mq_freeze_queue(firstq);
+	queue_set_hctx_shared(firstq, false);
+	blk_mq_unfreeze_queue(firstq, memflags);
+	up_read(&set->tag_list_rwsem);
+out:
 	INIT_LIST_HEAD(&q->tag_set_list);
 }

 static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 				     struct request_queue *q)
 {
-	mutex_lock(&set->tag_list_lock);
+	struct request_queue *firstq;
+	unsigned int memflags;

-	/*
-	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
-	 */
-	if (!list_empty(&set->tag_list) &&
-	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
-		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
-		/* update existing queue */
-		blk_mq_update_tag_set_shared(set, true);
-	}
-	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
-		queue_set_hctx_shared(q, true);
-	list_add_tail(&q->tag_set_list, &set->tag_list);
+	down_write(&set->tag_list_rwsem);
+	if (!list_is_singular(&set->tag_list)) {
+		if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
+			queue_set_hctx_shared(q, true);
+		list_add_tail(&q->tag_set_list, &set->tag_list);
+		up_write(&set->tag_list_rwsem);
+		return;
+	}

-	mutex_unlock(&set->tag_list_lock);
+	/* Transitioning firstq and q to shared. */
+	set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
+	list_add_tail(&q->tag_set_list, &set->tag_list);
+	downgrade_write(&set->tag_list_rwsem);
+	queue_set_hctx_shared(q, true);
+	firstq = list_first_entry(&set->tag_list, struct request_queue,
+				  tag_set_list);
+	memflags = blk_mq_freeze_queue(firstq);
+	queue_set_hctx_shared(firstq, true);
+	blk_mq_unfreeze_queue(firstq, memflags);
+	up_read(&set->tag_list_rwsem);
 }

 /* All allocations will be freed in release handler of q->mq_kobj */
@@ -4855,7 +4862,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (ret)
 		goto out_free_mq_map;

-	mutex_init(&set->tag_list_lock);
+	init_rwsem(&set->tag_list_rwsem);
 	INIT_LIST_HEAD(&set->tag_list);

 	return 0;
@@ -5044,7 +5051,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	struct xarray elv_tbl, et_tbl;
 	bool queues_frozen = false;

-	lockdep_assert_held(&set->tag_list_lock);
+	lockdep_assert_held_write(&set->tag_list_rwsem);

 	if (set->nr_maps == 1 && nr_hw_queues > nr_cpu_ids)
 		nr_hw_queues = nr_cpu_ids;
@@ -5129,9 +5136,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 {
 	down_write(&set->update_nr_hwq_lock);
-	mutex_lock(&set->tag_list_lock);
+	down_write(&set->tag_list_rwsem);
 	__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
-	mutex_unlock(&set->tag_list_lock);
+	up_write(&set->tag_list_rwsem);
 	up_write(&set->update_nr_hwq_lock);
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b25d12545f46..4c8441671518 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -502,7 +502,7 @@ enum hctx_type {
  * @shared_tags:
  *		   Shared set of tags. Has @nr_hw_queues elements. If set,
  *		   shared by all @tags.
- * @tag_list_lock: Serializes tag_list accesses.
+ * @tag_list_rwsem: Serializes tag_list accesses.
  * @tag_list:	   List of the request queues that use this tag set. See also
  *		   request_queue.tag_set_list.
  * @srcu:	   Use as lock when type of the request queue is blocking
@@ -530,7 +530,7 @@ struct blk_mq_tag_set {

 	struct blk_mq_tags	*shared_tags;

-	struct mutex		tag_list_lock;
+	struct rw_semaphore	tag_list_rwsem;
 	struct list_head	tag_list;
 	struct srcu_struct	*srcu;
 	struct srcu_struct	tags_srcu;
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-17 20:23 ` [PATCH v2 1/1] " Mohamed Khalfella
@ 2025-11-18  1:34   ` Hillf Danton
  2025-11-18  2:07     ` Mohamed Khalfella
  2025-11-18  2:24     ` Waiman Long
  2025-11-18  2:00   ` Ming Lei
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 21+ messages in thread
From: Hillf Danton @ 2025-11-18  1:34 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Jens Axboe, Ming Lei, Waiman Long, linux-nvme, linux-block,
	linux-kernel

On Mon, 17 Nov 2025 12:23:53 -0800 Mohamed Khalfella wrote:
> blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> tagset, the functions make sure that tagset and queues are marked as
> shared when two or more queues are attached to the same tagset.
> Initially a tagset starts as unshared and when the number of added
> queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> with all the queues attached to it. When the number of attached queues
> drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> the remaining queues as unshared.
> 
> Both functions need to freeze current queues in tagset before setting on
> unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> hold set->tag_list_lock mutex, which makes sense as we do not want
> queues to be added or deleted in the process. This used to work fine
> until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> made the nvme driver quiesce tagset instead of quiscing individual
> queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> set->tag_list while holding set->tag_list_lock also.
> 
> This results in deadlock between two threads with these stacktraces:
> 
>   __schedule+0x48e/0xed0
>   schedule+0x5a/0xc0
>   schedule_preempt_disabled+0x11/0x20
>   __mutex_lock.constprop.0+0x3cc/0x760
>   blk_mq_quiesce_tagset+0x26/0xd0
>   nvme_dev_disable_locked+0x77/0x280 [nvme]
>   nvme_timeout+0x268/0x320 [nvme]
>   blk_mq_handle_expired+0x5d/0x90
>   bt_iter+0x7e/0x90
>   blk_mq_queue_tag_busy_iter+0x2b2/0x590
>   ? __blk_mq_complete_request_remote+0x10/0x10
>   ? __blk_mq_complete_request_remote+0x10/0x10
>   blk_mq_timeout_work+0x15b/0x1a0
>   process_one_work+0x133/0x2f0
>   ? mod_delayed_work_on+0x90/0x90
>   worker_thread+0x2ec/0x400
>   ? mod_delayed_work_on+0x90/0x90
>   kthread+0xe2/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x2d/0x50
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork_asm+0x11/0x20
> 
>   __schedule+0x48e/0xed0
>   schedule+0x5a/0xc0
>   blk_mq_freeze_queue_wait+0x62/0x90
>   ? destroy_sched_domains_rcu+0x30/0x30
>   blk_mq_exit_queue+0x151/0x180
>   disk_release+0xe3/0xf0
>   device_release+0x31/0x90
>   kobject_put+0x6d/0x180
>   nvme_scan_ns+0x858/0xc90 [nvme_core]
>   ? nvme_scan_work+0x281/0x560 [nvme_core]
>   nvme_scan_work+0x281/0x560 [nvme_core]
>   process_one_work+0x133/0x2f0
>   ? mod_delayed_work_on+0x90/0x90
>   worker_thread+0x2ec/0x400
>   ? mod_delayed_work_on+0x90/0x90
>   kthread+0xe2/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x2d/0x50
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork_asm+0x11/0x20
> 
> The top stacktrace is showing nvme_timeout() called to handle nvme
> command timeout. timeout handler is trying to disable the controller and
> as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> to call queue callback handlers. The thread is stuck waiting for
> set->tag_list_lock as it tires to walk the queues in set->tag_list.
> 
> The lock is held by the second thread in the bottom stack which is
> waiting for one of queues to be frozen. The queue usage counter will
> drop to zero after nvme_timeout() finishes, and this will not happen
> because the thread will wait for this mutex forever.
> 
> Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> semaphore for read since this is enough to guarantee no queues will be
> added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> semaphore for write while updating set->tag_list and downgrade it to
> read while freezing the queues. It should be safe to update set->flags
> and hctx->flags while holding the semaphore for read since the queues
> are already frozen.
> 
> Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>  block/blk-mq-sysfs.c   | 10 ++---
>  block/blk-mq.c         | 95 +++++++++++++++++++++++-------------------
>  include/linux/blk-mq.h |  4 +-
>  3 files changed, 58 insertions(+), 51 deletions(-)
> 
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 58ec293373c6..f474781654fb 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)
>  
>  	kobject_uevent(q->mq_kobj, KOBJ_ADD);
>  
> -	mutex_lock(&q->tag_set->tag_list_lock);
> +	down_read(&q->tag_set->tag_list_rwsem);
>  	queue_for_each_hw_ctx(q, hctx, i) {
>  		ret = blk_mq_register_hctx(hctx);
>  		if (ret)
>  			goto out_unreg;
>  	}
> -	mutex_unlock(&q->tag_set->tag_list_lock);
> +	up_read(&q->tag_set->tag_list_rwsem);
>  	return 0;
>  
>  out_unreg:
> @@ -244,7 +244,7 @@ int blk_mq_sysfs_register(struct gendisk *disk)
>  		if (j < i)
>  			blk_mq_unregister_hctx(hctx);
>  	}
> -	mutex_unlock(&q->tag_set->tag_list_lock);
> +	up_read(&q->tag_set->tag_list_rwsem);
>  
>  	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
>  	kobject_del(q->mq_kobj);
> @@ -257,10 +257,10 @@ void blk_mq_sysfs_unregister(struct gendisk *disk)
>  	struct blk_mq_hw_ctx *hctx;
>  	unsigned long i;
>  
> -	mutex_lock(&q->tag_set->tag_list_lock);
> +	down_read(&q->tag_set->tag_list_rwsem);
>  	queue_for_each_hw_ctx(q, hctx, i)
>  		blk_mq_unregister_hctx(hctx);
> -	mutex_unlock(&q->tag_set->tag_list_lock);
> +	up_read(&q->tag_set->tag_list_rwsem);
>  
>  	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
>  	kobject_del(q->mq_kobj);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d626d32f6e57..9211d32ce820 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
>  {
>  	struct request_queue *q;
>  
> -	mutex_lock(&set->tag_list_lock);
> +	down_read(&set->tag_list_rwsem);
>  	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>  		if (!blk_queue_skip_tagset_quiesce(q))
>  			blk_mq_quiesce_queue_nowait(q);
>  	}
> -	mutex_unlock(&set->tag_list_lock);
> +	up_read(&set->tag_list_rwsem);
>  
>  	blk_mq_wait_quiesce_done(set);
>  }
> @@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
>  {
>  	struct request_queue *q;
>  
> -	mutex_lock(&set->tag_list_lock);
> +	down_read(&set->tag_list_rwsem);
>  	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>  		if (!blk_queue_skip_tagset_quiesce(q))
>  			blk_mq_unquiesce_queue(q);
>  	}
> -	mutex_unlock(&set->tag_list_lock);
> +	up_read(&set->tag_list_rwsem);
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
>  
> @@ -4274,56 +4274,63 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
>  	}
>  }
>  
> -static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
> -					 bool shared)
> -{
> -	struct request_queue *q;
> -	unsigned int memflags;
> -
> -	lockdep_assert_held(&set->tag_list_lock);
> -
> -	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> -		memflags = blk_mq_freeze_queue(q);
> -		queue_set_hctx_shared(q, shared);
> -		blk_mq_unfreeze_queue(q, memflags);
> -	}
> -}
> -
>  static void blk_mq_del_queue_tag_set(struct request_queue *q)
>  {
>  	struct blk_mq_tag_set *set = q->tag_set;
> +	struct request_queue *firstq;
> +	unsigned int memflags;
>  
> -	mutex_lock(&set->tag_list_lock);
> +	down_write(&set->tag_list_rwsem);
>  	list_del(&q->tag_set_list);
> -	if (list_is_singular(&set->tag_list)) {
> -		/* just transitioned to unshared */
> -		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> -		/* update existing queue */
> -		blk_mq_update_tag_set_shared(set, false);
> +	if (!list_is_singular(&set->tag_list)) {
> +		up_write(&set->tag_list_rwsem);
> +		goto out;
>  	}
> -	mutex_unlock(&set->tag_list_lock);
> +
> +	/*
> +	 * Transitioning the remaining firstq to unshared.
> +	 * Also, downgrade the semaphore to avoid deadlock
> +	 * with blk_mq_quiesce_tagset() while waiting for
> +	 * firstq to be frozen.
> +	 */
> +	set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> +	downgrade_write(&set->tag_list_rwsem);

If the first lock waiter is for write, it could ruin your downgrade trick.

> +	firstq = list_first_entry(&set->tag_list, struct request_queue,
> +				  tag_set_list);
> +	memflags = blk_mq_freeze_queue(firstq);
> +	queue_set_hctx_shared(firstq, false);
> +	blk_mq_unfreeze_queue(firstq, memflags);
> +	up_read(&set->tag_list_rwsem);
> +out:
>  	INIT_LIST_HEAD(&q->tag_set_list);
>  }
>  
>  static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>  				     struct request_queue *q)
>  {
> -	mutex_lock(&set->tag_list_lock);
> +	struct request_queue *firstq;
> +	unsigned int memflags;
>  
> -	/*
> -	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
> -	 */
> -	if (!list_empty(&set->tag_list) &&
> -	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> -		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> -		/* update existing queue */
> -		blk_mq_update_tag_set_shared(set, true);
> -	}
> -	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> -		queue_set_hctx_shared(q, true);
> -	list_add_tail(&q->tag_set_list, &set->tag_list);
> +	down_write(&set->tag_list_rwsem);
> +	if (!list_is_singular(&set->tag_list)) {
> +		if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> +			queue_set_hctx_shared(q, true);
> +		list_add_tail(&q->tag_set_list, &set->tag_list);
> +		up_write(&set->tag_list_rwsem);
> +		return;
> +	}
>  
> -	mutex_unlock(&set->tag_list_lock);
> +	/* Transitioning firstq and q to shared. */
> +	set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> +	list_add_tail(&q->tag_set_list, &set->tag_list);
> +	downgrade_write(&set->tag_list_rwsem);
> +	queue_set_hctx_shared(q, true);
> +	firstq = list_first_entry(&set->tag_list, struct request_queue,
> +				  tag_set_list);
> +	memflags = blk_mq_freeze_queue(firstq);
> +	queue_set_hctx_shared(firstq, true);
> +	blk_mq_unfreeze_queue(firstq, memflags);
> +	up_read(&set->tag_list_rwsem);
>  }
>  
>  /* All allocations will be freed in release handler of q->mq_kobj */
> @@ -4855,7 +4862,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>  	if (ret)
>  		goto out_free_mq_map;
>  
> -	mutex_init(&set->tag_list_lock);
> +	init_rwsem(&set->tag_list_rwsem);
>  	INIT_LIST_HEAD(&set->tag_list);
>  
>  	return 0;
> @@ -5044,7 +5051,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  	struct xarray elv_tbl, et_tbl;
>  	bool queues_frozen = false;
>  
> -	lockdep_assert_held(&set->tag_list_lock);
> +	lockdep_assert_held_write(&set->tag_list_rwsem);
>  
>  	if (set->nr_maps == 1 && nr_hw_queues > nr_cpu_ids)
>  		nr_hw_queues = nr_cpu_ids;
> @@ -5129,9 +5136,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
>  {
>  	down_write(&set->update_nr_hwq_lock);
> -	mutex_lock(&set->tag_list_lock);
> +	down_write(&set->tag_list_rwsem);
>  	__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
> -	mutex_unlock(&set->tag_list_lock);
> +	up_write(&set->tag_list_rwsem);
>  	up_write(&set->update_nr_hwq_lock);
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index b25d12545f46..4c8441671518 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -502,7 +502,7 @@ enum hctx_type {
>   * @shared_tags:
>   *		   Shared set of tags. Has @nr_hw_queues elements. If set,
>   *		   shared by all @tags.
> - * @tag_list_lock: Serializes tag_list accesses.
> + * @tag_list_rwsem: Serializes tag_list accesses.
>   * @tag_list:	   List of the request queues that use this tag set. See also
>   *		   request_queue.tag_set_list.
>   * @srcu:	   Use as lock when type of the request queue is blocking
> @@ -530,7 +530,7 @@ struct blk_mq_tag_set {
>  
>  	struct blk_mq_tags	*shared_tags;
>  
> -	struct mutex		tag_list_lock;
> +	struct rw_semaphore	tag_list_rwsem;
>  	struct list_head	tag_list;
>  	struct srcu_struct	*srcu;
>  	struct srcu_struct	tags_srcu;
> -- 
> 2.51.0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-17 20:23 ` [PATCH v2 1/1] " Mohamed Khalfella
  2025-11-18  1:34   ` Hillf Danton
@ 2025-11-18  2:00   ` Ming Lei
  2025-11-18  2:15     ` Mohamed Khalfella
  2025-11-24  4:00   ` Ming Lei
  2025-11-30 22:56   ` Sagi Grimberg
  3 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2025-11-18  2:00 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	linux-nvme, linux-block, linux-kernel

On Mon, Nov 17, 2025 at 12:23:53PM -0800, Mohamed Khalfella wrote:
> blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> tagset, the functions make sure that tagset and queues are marked as
> shared when two or more queues are attached to the same tagset.
> Initially a tagset starts as unshared and when the number of added
> queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> with all the queues attached to it. When the number of attached queues
> drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> the remaining queues as unshared.
> 
> Both functions need to freeze current queues in tagset before setting on
> unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> hold set->tag_list_lock mutex, which makes sense as we do not want
> queues to be added or deleted in the process. This used to work fine
> until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> made the nvme driver quiesce tagset instead of quiscing individual
> queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> set->tag_list while holding set->tag_list_lock also.
> 
> This results in deadlock between two threads with these stacktraces:
> 
>   __schedule+0x48e/0xed0
>   schedule+0x5a/0xc0
>   schedule_preempt_disabled+0x11/0x20
>   __mutex_lock.constprop.0+0x3cc/0x760
>   blk_mq_quiesce_tagset+0x26/0xd0
>   nvme_dev_disable_locked+0x77/0x280 [nvme]
>   nvme_timeout+0x268/0x320 [nvme]
>   blk_mq_handle_expired+0x5d/0x90
>   bt_iter+0x7e/0x90
>   blk_mq_queue_tag_busy_iter+0x2b2/0x590
>   ? __blk_mq_complete_request_remote+0x10/0x10
>   ? __blk_mq_complete_request_remote+0x10/0x10
>   blk_mq_timeout_work+0x15b/0x1a0
>   process_one_work+0x133/0x2f0
>   ? mod_delayed_work_on+0x90/0x90
>   worker_thread+0x2ec/0x400
>   ? mod_delayed_work_on+0x90/0x90
>   kthread+0xe2/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x2d/0x50
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork_asm+0x11/0x20
> 
>   __schedule+0x48e/0xed0
>   schedule+0x5a/0xc0
>   blk_mq_freeze_queue_wait+0x62/0x90
>   ? destroy_sched_domains_rcu+0x30/0x30
>   blk_mq_exit_queue+0x151/0x180
>   disk_release+0xe3/0xf0
>   device_release+0x31/0x90
>   kobject_put+0x6d/0x180
>   nvme_scan_ns+0x858/0xc90 [nvme_core]
>   ? nvme_scan_work+0x281/0x560 [nvme_core]
>   nvme_scan_work+0x281/0x560 [nvme_core]
>   process_one_work+0x133/0x2f0
>   ? mod_delayed_work_on+0x90/0x90
>   worker_thread+0x2ec/0x400
>   ? mod_delayed_work_on+0x90/0x90
>   kthread+0xe2/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x2d/0x50
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork_asm+0x11/0x20
> 
> The top stacktrace is showing nvme_timeout() called to handle nvme
> command timeout. timeout handler is trying to disable the controller and
> as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> to call queue callback handlers. The thread is stuck waiting for
> set->tag_list_lock as it tires to walk the queues in set->tag_list.
> 
> The lock is held by the second thread in the bottom stack which is
> waiting for one of queues to be frozen. The queue usage counter will
> drop to zero after nvme_timeout() finishes, and this will not happen
> because the thread will wait for this mutex forever.
> 
> Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> semaphore for read since this is enough to guarantee no queues will be
> added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> semaphore for write while updating set->tag_list and downgrade it to
> read while freezing the queues. It should be safe to update set->flags
> and hctx->flags while holding the semaphore for read since the queues
> are already frozen.
> 
> Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>  block/blk-mq-sysfs.c   | 10 ++---
>  block/blk-mq.c         | 95 +++++++++++++++++++++++-------------------
>  include/linux/blk-mq.h |  4 +-
>  3 files changed, 58 insertions(+), 51 deletions(-)
> 
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 58ec293373c6..f474781654fb 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)
>  
>  	kobject_uevent(q->mq_kobj, KOBJ_ADD);
>  
> -	mutex_lock(&q->tag_set->tag_list_lock);
> +	down_read(&q->tag_set->tag_list_rwsem);
>  	queue_for_each_hw_ctx(q, hctx, i) {
>  		ret = blk_mq_register_hctx(hctx);
>  		if (ret)
>  			goto out_unreg;
>  	}
> -	mutex_unlock(&q->tag_set->tag_list_lock);
> +	up_read(&q->tag_set->tag_list_rwsem);
>  	return 0;
>  
>  out_unreg:
> @@ -244,7 +244,7 @@ int blk_mq_sysfs_register(struct gendisk *disk)
>  		if (j < i)
>  			blk_mq_unregister_hctx(hctx);
>  	}
> -	mutex_unlock(&q->tag_set->tag_list_lock);
> +	up_read(&q->tag_set->tag_list_rwsem);
>  
>  	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
>  	kobject_del(q->mq_kobj);
> @@ -257,10 +257,10 @@ void blk_mq_sysfs_unregister(struct gendisk *disk)
>  	struct blk_mq_hw_ctx *hctx;
>  	unsigned long i;
>  
> -	mutex_lock(&q->tag_set->tag_list_lock);
> +	down_read(&q->tag_set->tag_list_rwsem);
>  	queue_for_each_hw_ctx(q, hctx, i)
>  		blk_mq_unregister_hctx(hctx);
> -	mutex_unlock(&q->tag_set->tag_list_lock);
> +	up_read(&q->tag_set->tag_list_rwsem);
>  
>  	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
>  	kobject_del(q->mq_kobj);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d626d32f6e57..9211d32ce820 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
>  {
>  	struct request_queue *q;
>  
> -	mutex_lock(&set->tag_list_lock);
> +	down_read(&set->tag_list_rwsem);
>  	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>  		if (!blk_queue_skip_tagset_quiesce(q))
>  			blk_mq_quiesce_queue_nowait(q);
>  	}
> -	mutex_unlock(&set->tag_list_lock);
> +	up_read(&set->tag_list_rwsem);
>  
>  	blk_mq_wait_quiesce_done(set);
>  }
> @@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
>  {
>  	struct request_queue *q;
>  
> -	mutex_lock(&set->tag_list_lock);
> +	down_read(&set->tag_list_rwsem);
>  	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>  		if (!blk_queue_skip_tagset_quiesce(q))
>  			blk_mq_unquiesce_queue(q);
>  	}
> -	mutex_unlock(&set->tag_list_lock);
> +	up_read(&set->tag_list_rwsem);
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
>  
> @@ -4274,56 +4274,63 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
>  	}
>  }
>  
> -static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
> -					 bool shared)
> -{
> -	struct request_queue *q;
> -	unsigned int memflags;
> -
> -	lockdep_assert_held(&set->tag_list_lock);
> -
> -	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> -		memflags = blk_mq_freeze_queue(q);
> -		queue_set_hctx_shared(q, shared);
> -		blk_mq_unfreeze_queue(q, memflags);
> -	}
> -}
> -
>  static void blk_mq_del_queue_tag_set(struct request_queue *q)
>  {
>  	struct blk_mq_tag_set *set = q->tag_set;
> +	struct request_queue *firstq;
> +	unsigned int memflags;
>  
> -	mutex_lock(&set->tag_list_lock);
> +	down_write(&set->tag_list_rwsem);
>  	list_del(&q->tag_set_list);
> -	if (list_is_singular(&set->tag_list)) {
> -		/* just transitioned to unshared */
> -		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> -		/* update existing queue */
> -		blk_mq_update_tag_set_shared(set, false);
> +	if (!list_is_singular(&set->tag_list)) {
> +		up_write(&set->tag_list_rwsem);
> +		goto out;
>  	}
> -	mutex_unlock(&set->tag_list_lock);
> +
> +	/*
> +	 * Transitioning the remaining firstq to unshared.
> +	 * Also, downgrade the semaphore to avoid deadlock
> +	 * with blk_mq_quiesce_tagset() while waiting for
> +	 * firstq to be frozen.
> +	 */
> +	set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> +	downgrade_write(&set->tag_list_rwsem);
> +	firstq = list_first_entry(&set->tag_list, struct request_queue,
> +				  tag_set_list);
> +	memflags = blk_mq_freeze_queue(firstq);
> +	queue_set_hctx_shared(firstq, false);
> +	blk_mq_unfreeze_queue(firstq, memflags);
> +	up_read(&set->tag_list_rwsem);
> +out:
>  	INIT_LIST_HEAD(&q->tag_set_list);
>  }
>  
>  static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>  				     struct request_queue *q)
>  {
> -	mutex_lock(&set->tag_list_lock);
> +	struct request_queue *firstq;
> +	unsigned int memflags;
>  
> -	/*
> -	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
> -	 */
> -	if (!list_empty(&set->tag_list) &&
> -	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> -		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> -		/* update existing queue */
> -		blk_mq_update_tag_set_shared(set, true);
> -	}
> -	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> -		queue_set_hctx_shared(q, true);
> -	list_add_tail(&q->tag_set_list, &set->tag_list);
> +	down_write(&set->tag_list_rwsem);
> +	if (!list_is_singular(&set->tag_list)) {
> +		if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> +			queue_set_hctx_shared(q, true);
> +		list_add_tail(&q->tag_set_list, &set->tag_list);
> +		up_write(&set->tag_list_rwsem);
> +		return;
> +	}
>  
> -	mutex_unlock(&set->tag_list_lock);
> +	/* Transitioning firstq and q to shared. */
> +	set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> +	list_add_tail(&q->tag_set_list, &set->tag_list);
> +	downgrade_write(&set->tag_list_rwsem);
> +	queue_set_hctx_shared(q, true);

queue_set_hctx_shared(q, true) should be moved into write critical area
because this queue has been added to the list.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-18  1:34   ` Hillf Danton
@ 2025-11-18  2:07     ` Mohamed Khalfella
  2025-11-18  2:24     ` Waiman Long
  1 sibling, 0 replies; 21+ messages in thread
From: Mohamed Khalfella @ 2025-11-18  2:07 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Jens Axboe, Ming Lei, Waiman Long, linux-nvme, linux-block,
	linux-kernel

On Tue 2025-11-18 09:34:41 +0800, Hillf Danton wrote:
> On Mon, 17 Nov 2025 12:23:53 -0800 Mohamed Khalfella wrote:
> >  static void blk_mq_del_queue_tag_set(struct request_queue *q)
> >  {
> >  	struct blk_mq_tag_set *set = q->tag_set;
> > +	struct request_queue *firstq;
> > +	unsigned int memflags;
> >  
> > -	mutex_lock(&set->tag_list_lock);
> > +	down_write(&set->tag_list_rwsem);
> >  	list_del(&q->tag_set_list);
> > -	if (list_is_singular(&set->tag_list)) {
> > -		/* just transitioned to unshared */
> > -		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> > -		/* update existing queue */
> > -		blk_mq_update_tag_set_shared(set, false);
> > +	if (!list_is_singular(&set->tag_list)) {
> > +		up_write(&set->tag_list_rwsem);
> > +		goto out;
> >  	}
> > -	mutex_unlock(&set->tag_list_lock);
> > +
> > +	/*
> > +	 * Transitioning the remaining firstq to unshared.
> > +	 * Also, downgrade the semaphore to avoid deadlock
> > +	 * with blk_mq_quiesce_tagset() while waiting for
> > +	 * firstq to be frozen.
> > +	 */
> > +	set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> > +	downgrade_write(&set->tag_list_rwsem);
> 
> If the first lock waiter is for write, it could ruin your downgrade trick.

How is that possible? If the first waiter or the only waiter is for
write then they should not take the semaphore because it has not been
fully released yet, right?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-18  2:00   ` Ming Lei
@ 2025-11-18  2:15     ` Mohamed Khalfella
  2025-11-18  2:30       ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Mohamed Khalfella @ 2025-11-18  2:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	linux-nvme, linux-block, linux-kernel

On Tue 2025-11-18 10:00:19 +0800, Ming Lei wrote:
> On Mon, Nov 17, 2025 at 12:23:53PM -0800, Mohamed Khalfella wrote:
> >  static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
> >  				     struct request_queue *q)
> >  {
> > -	mutex_lock(&set->tag_list_lock);
> > +	struct request_queue *firstq;
> > +	unsigned int memflags;
> >  
> > -	/*
> > -	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
> > -	 */
> > -	if (!list_empty(&set->tag_list) &&
> > -	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> > -		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> > -		/* update existing queue */
> > -		blk_mq_update_tag_set_shared(set, true);
> > -	}
> > -	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> > -		queue_set_hctx_shared(q, true);
> > -	list_add_tail(&q->tag_set_list, &set->tag_list);
> > +	down_write(&set->tag_list_rwsem);
> > +	if (!list_is_singular(&set->tag_list)) {
> > +		if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> > +			queue_set_hctx_shared(q, true);
> > +		list_add_tail(&q->tag_set_list, &set->tag_list);
> > +		up_write(&set->tag_list_rwsem);
> > +		return;
> > +	}
> >  
> > -	mutex_unlock(&set->tag_list_lock);
> > +	/* Transitioning firstq and q to shared. */
> > +	set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> > +	list_add_tail(&q->tag_set_list, &set->tag_list);
> > +	downgrade_write(&set->tag_list_rwsem);
> > +	queue_set_hctx_shared(q, true);
> 
> queue_set_hctx_shared(q, true) should be moved into write critical area
> because this queue has been added to the list.
> 

I failed to see why that is the case. What can go wrong by running
queue_set_hctx_shared(q, true) after downgrade_write()?

After the semaphore is downgraded we promise not to change the list
set->tag_list because now we have read-only access. Marking the "q" as
shared should be fine because it is new and we know there will be no
users of the queue yet (that is why we skipped freezing it).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-18  1:34   ` Hillf Danton
  2025-11-18  2:07     ` Mohamed Khalfella
@ 2025-11-18  2:24     ` Waiman Long
  2025-11-18  3:08       ` Ming Lei
  1 sibling, 1 reply; 21+ messages in thread
From: Waiman Long @ 2025-11-18  2:24 UTC (permalink / raw)
  To: Hillf Danton, Mohamed Khalfella
  Cc: Jens Axboe, Ming Lei, Waiman Long, linux-nvme, linux-block,
	linux-kernel

On 11/17/25 8:34 PM, Hillf Danton wrote:
> On Mon, 17 Nov 2025 12:23:53 -0800 Mohamed Khalfella wrote:
>> blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
>> tagset, the functions make sure that tagset and queues are marked as
>> shared when two or more queues are attached to the same tagset.
>> Initially a tagset starts as unshared and when the number of added
>> queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
>> with all the queues attached to it. When the number of attached queues
>> drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
>> the remaining queues as unshared.
>>
>> Both functions need to freeze current queues in tagset before setting on
>> unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
>> hold set->tag_list_lock mutex, which makes sense as we do not want
>> queues to be added or deleted in the process. This used to work fine
>> until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
>> made the nvme driver quiesce tagset instead of quiscing individual
>> queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
>> set->tag_list while holding set->tag_list_lock also.
>>
>> This results in deadlock between two threads with these stacktraces:
>>
>>    __schedule+0x48e/0xed0
>>    schedule+0x5a/0xc0
>>    schedule_preempt_disabled+0x11/0x20
>>    __mutex_lock.constprop.0+0x3cc/0x760
>>    blk_mq_quiesce_tagset+0x26/0xd0
>>    nvme_dev_disable_locked+0x77/0x280 [nvme]
>>    nvme_timeout+0x268/0x320 [nvme]
>>    blk_mq_handle_expired+0x5d/0x90
>>    bt_iter+0x7e/0x90
>>    blk_mq_queue_tag_busy_iter+0x2b2/0x590
>>    ? __blk_mq_complete_request_remote+0x10/0x10
>>    ? __blk_mq_complete_request_remote+0x10/0x10
>>    blk_mq_timeout_work+0x15b/0x1a0
>>    process_one_work+0x133/0x2f0
>>    ? mod_delayed_work_on+0x90/0x90
>>    worker_thread+0x2ec/0x400
>>    ? mod_delayed_work_on+0x90/0x90
>>    kthread+0xe2/0x110
>>    ? kthread_complete_and_exit+0x20/0x20
>>    ret_from_fork+0x2d/0x50
>>    ? kthread_complete_and_exit+0x20/0x20
>>    ret_from_fork_asm+0x11/0x20
>>
>>    __schedule+0x48e/0xed0
>>    schedule+0x5a/0xc0
>>    blk_mq_freeze_queue_wait+0x62/0x90
>>    ? destroy_sched_domains_rcu+0x30/0x30
>>    blk_mq_exit_queue+0x151/0x180
>>    disk_release+0xe3/0xf0
>>    device_release+0x31/0x90
>>    kobject_put+0x6d/0x180
>>    nvme_scan_ns+0x858/0xc90 [nvme_core]
>>    ? nvme_scan_work+0x281/0x560 [nvme_core]
>>    nvme_scan_work+0x281/0x560 [nvme_core]
>>    process_one_work+0x133/0x2f0
>>    ? mod_delayed_work_on+0x90/0x90
>>    worker_thread+0x2ec/0x400
>>    ? mod_delayed_work_on+0x90/0x90
>>    kthread+0xe2/0x110
>>    ? kthread_complete_and_exit+0x20/0x20
>>    ret_from_fork+0x2d/0x50
>>    ? kthread_complete_and_exit+0x20/0x20
>>    ret_from_fork_asm+0x11/0x20
>>
>> The top stacktrace is showing nvme_timeout() called to handle nvme
>> command timeout. timeout handler is trying to disable the controller and
>> as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
>> to call queue callback handlers. The thread is stuck waiting for
>> set->tag_list_lock as it tires to walk the queues in set->tag_list.
>>
>> The lock is held by the second thread in the bottom stack which is
>> waiting for one of queues to be frozen. The queue usage counter will
>> drop to zero after nvme_timeout() finishes, and this will not happen
>> because the thread will wait for this mutex forever.
>>
>> Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
>> avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
>> semaphore for read since this is enough to guarantee no queues will be
>> added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
>> semaphore for write while updating set->tag_list and downgrade it to
>> read while freezing the queues. It should be safe to update set->flags
>> and hctx->flags while holding the semaphore for read since the queues
>> are already frozen.
>>
>> Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>> ---
>>   block/blk-mq-sysfs.c   | 10 ++---
>>   block/blk-mq.c         | 95 +++++++++++++++++++++++-------------------
>>   include/linux/blk-mq.h |  4 +-
>>   3 files changed, 58 insertions(+), 51 deletions(-)
>>
>> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
>> index 58ec293373c6..f474781654fb 100644
>> --- a/block/blk-mq-sysfs.c
>> +++ b/block/blk-mq-sysfs.c
>> @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)
>>   
>>   	kobject_uevent(q->mq_kobj, KOBJ_ADD);
>>   
>> -	mutex_lock(&q->tag_set->tag_list_lock);
>> +	down_read(&q->tag_set->tag_list_rwsem);
>>   	queue_for_each_hw_ctx(q, hctx, i) {
>>   		ret = blk_mq_register_hctx(hctx);
>>   		if (ret)
>>   			goto out_unreg;
>>   	}
>> -	mutex_unlock(&q->tag_set->tag_list_lock);
>> +	up_read(&q->tag_set->tag_list_rwsem);
>>   	return 0;
>>   
>>   out_unreg:
>> @@ -244,7 +244,7 @@ int blk_mq_sysfs_register(struct gendisk *disk)
>>   		if (j < i)
>>   			blk_mq_unregister_hctx(hctx);
>>   	}
>> -	mutex_unlock(&q->tag_set->tag_list_lock);
>> +	up_read(&q->tag_set->tag_list_rwsem);
>>   
>>   	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
>>   	kobject_del(q->mq_kobj);
>> @@ -257,10 +257,10 @@ void blk_mq_sysfs_unregister(struct gendisk *disk)
>>   	struct blk_mq_hw_ctx *hctx;
>>   	unsigned long i;
>>   
>> -	mutex_lock(&q->tag_set->tag_list_lock);
>> +	down_read(&q->tag_set->tag_list_rwsem);
>>   	queue_for_each_hw_ctx(q, hctx, i)
>>   		blk_mq_unregister_hctx(hctx);
>> -	mutex_unlock(&q->tag_set->tag_list_lock);
>> +	up_read(&q->tag_set->tag_list_rwsem);
>>   
>>   	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
>>   	kobject_del(q->mq_kobj);
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index d626d32f6e57..9211d32ce820 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
>>   {
>>   	struct request_queue *q;
>>   
>> -	mutex_lock(&set->tag_list_lock);
>> +	down_read(&set->tag_list_rwsem);
>>   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>   		if (!blk_queue_skip_tagset_quiesce(q))
>>   			blk_mq_quiesce_queue_nowait(q);
>>   	}
>> -	mutex_unlock(&set->tag_list_lock);
>> +	up_read(&set->tag_list_rwsem);
>>   
>>   	blk_mq_wait_quiesce_done(set);
>>   }
>> @@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
>>   {
>>   	struct request_queue *q;
>>   
>> -	mutex_lock(&set->tag_list_lock);
>> +	down_read(&set->tag_list_rwsem);
>>   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>   		if (!blk_queue_skip_tagset_quiesce(q))
>>   			blk_mq_unquiesce_queue(q);
>>   	}
>> -	mutex_unlock(&set->tag_list_lock);
>> +	up_read(&set->tag_list_rwsem);
>>   }
>>   EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
>>   
>> @@ -4274,56 +4274,63 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
>>   	}
>>   }
>>   
>> -static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
>> -					 bool shared)
>> -{
>> -	struct request_queue *q;
>> -	unsigned int memflags;
>> -
>> -	lockdep_assert_held(&set->tag_list_lock);
>> -
>> -	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>> -		memflags = blk_mq_freeze_queue(q);
>> -		queue_set_hctx_shared(q, shared);
>> -		blk_mq_unfreeze_queue(q, memflags);
>> -	}
>> -}
>> -
>>   static void blk_mq_del_queue_tag_set(struct request_queue *q)
>>   {
>>   	struct blk_mq_tag_set *set = q->tag_set;
>> +	struct request_queue *firstq;
>> +	unsigned int memflags;
>>   
>> -	mutex_lock(&set->tag_list_lock);
>> +	down_write(&set->tag_list_rwsem);
>>   	list_del(&q->tag_set_list);
>> -	if (list_is_singular(&set->tag_list)) {
>> -		/* just transitioned to unshared */
>> -		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
>> -		/* update existing queue */
>> -		blk_mq_update_tag_set_shared(set, false);
>> +	if (!list_is_singular(&set->tag_list)) {
>> +		up_write(&set->tag_list_rwsem);
>> +		goto out;
>>   	}
>> -	mutex_unlock(&set->tag_list_lock);
>> +
>> +	/*
>> +	 * Transitioning the remaining firstq to unshared.
>> +	 * Also, downgrade the semaphore to avoid deadlock
>> +	 * with blk_mq_quiesce_tagset() while waiting for
>> +	 * firstq to be frozen.
>> +	 */
>> +	set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
>> +	downgrade_write(&set->tag_list_rwsem);
> If the first lock waiter is for write, it could ruin your downgrade trick.

That is true. The downgrade will wake up all the waiting readers at the 
front of the wait queue, but if there is one or more writers in the mix. 
The wakeup will stop when the first writer is hit and all the readers 
after that will not be woken up.

We can theoretically provide a downgrade variant that wakes up all the 
readers if it is a useful feature.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-18  2:15     ` Mohamed Khalfella
@ 2025-11-18  2:30       ` Ming Lei
  2025-11-18  3:44         ` Mohamed Khalfella
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2025-11-18  2:30 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	linux-nvme, linux-block, linux-kernel

On Mon, Nov 17, 2025 at 06:15:04PM -0800, Mohamed Khalfella wrote:
> On Tue 2025-11-18 10:00:19 +0800, Ming Lei wrote:
> > On Mon, Nov 17, 2025 at 12:23:53PM -0800, Mohamed Khalfella wrote:
> > >  static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
> > >  				     struct request_queue *q)
> > >  {
> > > -	mutex_lock(&set->tag_list_lock);
> > > +	struct request_queue *firstq;
> > > +	unsigned int memflags;
> > >  
> > > -	/*
> > > -	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
> > > -	 */
> > > -	if (!list_empty(&set->tag_list) &&
> > > -	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> > > -		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> > > -		/* update existing queue */
> > > -		blk_mq_update_tag_set_shared(set, true);
> > > -	}
> > > -	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> > > -		queue_set_hctx_shared(q, true);
> > > -	list_add_tail(&q->tag_set_list, &set->tag_list);
> > > +	down_write(&set->tag_list_rwsem);
> > > +	if (!list_is_singular(&set->tag_list)) {
> > > +		if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> > > +			queue_set_hctx_shared(q, true);
> > > +		list_add_tail(&q->tag_set_list, &set->tag_list);
> > > +		up_write(&set->tag_list_rwsem);
> > > +		return;
> > > +	}
> > >  
> > > -	mutex_unlock(&set->tag_list_lock);
> > > +	/* Transitioning firstq and q to shared. */
> > > +	set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> > > +	list_add_tail(&q->tag_set_list, &set->tag_list);
> > > +	downgrade_write(&set->tag_list_rwsem);
> > > +	queue_set_hctx_shared(q, true);
> > 
> > queue_set_hctx_shared(q, true) should be moved into write critical area
> > because this queue has been added to the list.
> > 
> 
> I failed to see why that is the case. What can go wrong by running
> queue_set_hctx_shared(q, true) after downgrade_write()?
> 
> After the semaphore is downgraded we promise not to change the list
> set->tag_list because now we have read-only access. Marking the "q" as
> shared should be fine because it is new and we know there will be no
> users of the queue yet (that is why we skipped freezing it).
 
I think it is read/write lock's use practice. The protected data shouldn't be
written any more when you downgrade to read lock.

In this case, it may not make a difference, because it is one new queue and
the other readers don't use the `shared` flag, but still better to do
correct things from beginning and make code less fragile.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-18  2:24     ` Waiman Long
@ 2025-11-18  3:08       ` Ming Lei
  2025-11-18  3:42         ` Waiman Long
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2025-11-18  3:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Hillf Danton, Mohamed Khalfella, Jens Axboe, linux-nvme,
	linux-block, linux-kernel

On Mon, Nov 17, 2025 at 09:24:21PM -0500, Waiman Long wrote:
> On 11/17/25 8:34 PM, Hillf Danton wrote:
> > On Mon, 17 Nov 2025 12:23:53 -0800 Mohamed Khalfella wrote:
> > > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > > tagset, the functions make sure that tagset and queues are marked as
> > > shared when two or more queues are attached to the same tagset.
> > > Initially a tagset starts as unshared and when the number of added
> > > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > > with all the queues attached to it. When the number of attached queues
> > > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > > the remaining queues as unshared.
> > > 
> > > Both functions need to freeze current queues in tagset before setting on
> > > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > > hold set->tag_list_lock mutex, which makes sense as we do not want
> > > queues to be added or deleted in the process. This used to work fine
> > > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > made the nvme driver quiesce tagset instead of quiscing individual
> > > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > > set->tag_list while holding set->tag_list_lock also.
> > > 
> > > This results in deadlock between two threads with these stacktraces:
> > > 
> > >    __schedule+0x48e/0xed0
> > >    schedule+0x5a/0xc0
> > >    schedule_preempt_disabled+0x11/0x20
> > >    __mutex_lock.constprop.0+0x3cc/0x760
> > >    blk_mq_quiesce_tagset+0x26/0xd0
> > >    nvme_dev_disable_locked+0x77/0x280 [nvme]
> > >    nvme_timeout+0x268/0x320 [nvme]
> > >    blk_mq_handle_expired+0x5d/0x90
> > >    bt_iter+0x7e/0x90
> > >    blk_mq_queue_tag_busy_iter+0x2b2/0x590
> > >    ? __blk_mq_complete_request_remote+0x10/0x10
> > >    ? __blk_mq_complete_request_remote+0x10/0x10
> > >    blk_mq_timeout_work+0x15b/0x1a0
> > >    process_one_work+0x133/0x2f0
> > >    ? mod_delayed_work_on+0x90/0x90
> > >    worker_thread+0x2ec/0x400
> > >    ? mod_delayed_work_on+0x90/0x90
> > >    kthread+0xe2/0x110
> > >    ? kthread_complete_and_exit+0x20/0x20
> > >    ret_from_fork+0x2d/0x50
> > >    ? kthread_complete_and_exit+0x20/0x20
> > >    ret_from_fork_asm+0x11/0x20
> > > 
> > >    __schedule+0x48e/0xed0
> > >    schedule+0x5a/0xc0
> > >    blk_mq_freeze_queue_wait+0x62/0x90
> > >    ? destroy_sched_domains_rcu+0x30/0x30
> > >    blk_mq_exit_queue+0x151/0x180
> > >    disk_release+0xe3/0xf0
> > >    device_release+0x31/0x90
> > >    kobject_put+0x6d/0x180
> > >    nvme_scan_ns+0x858/0xc90 [nvme_core]
> > >    ? nvme_scan_work+0x281/0x560 [nvme_core]
> > >    nvme_scan_work+0x281/0x560 [nvme_core]
> > >    process_one_work+0x133/0x2f0
> > >    ? mod_delayed_work_on+0x90/0x90
> > >    worker_thread+0x2ec/0x400
> > >    ? mod_delayed_work_on+0x90/0x90
> > >    kthread+0xe2/0x110
> > >    ? kthread_complete_and_exit+0x20/0x20
> > >    ret_from_fork+0x2d/0x50
> > >    ? kthread_complete_and_exit+0x20/0x20
> > >    ret_from_fork_asm+0x11/0x20
> > > 
> > > The top stacktrace is showing nvme_timeout() called to handle nvme
> > > command timeout. timeout handler is trying to disable the controller and
> > > as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> > > to call queue callback handlers. The thread is stuck waiting for
> > > set->tag_list_lock as it tires to walk the queues in set->tag_list.
> > > 
> > > The lock is held by the second thread in the bottom stack which is
> > > waiting for one of queues to be frozen. The queue usage counter will
> > > drop to zero after nvme_timeout() finishes, and this will not happen
> > > because the thread will wait for this mutex forever.
> > > 
> > > Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> > > avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> > > semaphore for read since this is enough to guarantee no queues will be
> > > added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> > > semaphore for write while updating set->tag_list and downgrade it to
> > > read while freezing the queues. It should be safe to update set->flags
> > > and hctx->flags while holding the semaphore for read since the queues
> > > are already frozen.
> > > 
> > > Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > > ---
> > >   block/blk-mq-sysfs.c   | 10 ++---
> > >   block/blk-mq.c         | 95 +++++++++++++++++++++++-------------------
> > >   include/linux/blk-mq.h |  4 +-
> > >   3 files changed, 58 insertions(+), 51 deletions(-)
> > > 
> > > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> > > index 58ec293373c6..f474781654fb 100644
> > > --- a/block/blk-mq-sysfs.c
> > > +++ b/block/blk-mq-sysfs.c
> > > @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)
> > >   	kobject_uevent(q->mq_kobj, KOBJ_ADD);
> > > -	mutex_lock(&q->tag_set->tag_list_lock);
> > > +	down_read(&q->tag_set->tag_list_rwsem);
> > >   	queue_for_each_hw_ctx(q, hctx, i) {
> > >   		ret = blk_mq_register_hctx(hctx);
> > >   		if (ret)
> > >   			goto out_unreg;
> > >   	}
> > > -	mutex_unlock(&q->tag_set->tag_list_lock);
> > > +	up_read(&q->tag_set->tag_list_rwsem);
> > >   	return 0;
> > >   out_unreg:
> > > @@ -244,7 +244,7 @@ int blk_mq_sysfs_register(struct gendisk *disk)
> > >   		if (j < i)
> > >   			blk_mq_unregister_hctx(hctx);
> > >   	}
> > > -	mutex_unlock(&q->tag_set->tag_list_lock);
> > > +	up_read(&q->tag_set->tag_list_rwsem);
> > >   	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
> > >   	kobject_del(q->mq_kobj);
> > > @@ -257,10 +257,10 @@ void blk_mq_sysfs_unregister(struct gendisk *disk)
> > >   	struct blk_mq_hw_ctx *hctx;
> > >   	unsigned long i;
> > > -	mutex_lock(&q->tag_set->tag_list_lock);
> > > +	down_read(&q->tag_set->tag_list_rwsem);
> > >   	queue_for_each_hw_ctx(q, hctx, i)
> > >   		blk_mq_unregister_hctx(hctx);
> > > -	mutex_unlock(&q->tag_set->tag_list_lock);
> > > +	up_read(&q->tag_set->tag_list_rwsem);
> > >   	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
> > >   	kobject_del(q->mq_kobj);
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index d626d32f6e57..9211d32ce820 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
> > >   {
> > >   	struct request_queue *q;
> > > -	mutex_lock(&set->tag_list_lock);
> > > +	down_read(&set->tag_list_rwsem);
> > >   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > >   		if (!blk_queue_skip_tagset_quiesce(q))
> > >   			blk_mq_quiesce_queue_nowait(q);
> > >   	}
> > > -	mutex_unlock(&set->tag_list_lock);
> > > +	up_read(&set->tag_list_rwsem);
> > >   	blk_mq_wait_quiesce_done(set);
> > >   }
> > > @@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
> > >   {
> > >   	struct request_queue *q;
> > > -	mutex_lock(&set->tag_list_lock);
> > > +	down_read(&set->tag_list_rwsem);
> > >   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > >   		if (!blk_queue_skip_tagset_quiesce(q))
> > >   			blk_mq_unquiesce_queue(q);
> > >   	}
> > > -	mutex_unlock(&set->tag_list_lock);
> > > +	up_read(&set->tag_list_rwsem);
> > >   }
> > >   EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
> > > @@ -4274,56 +4274,63 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
> > >   	}
> > >   }
> > > -static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
> > > -					 bool shared)
> > > -{
> > > -	struct request_queue *q;
> > > -	unsigned int memflags;
> > > -
> > > -	lockdep_assert_held(&set->tag_list_lock);
> > > -
> > > -	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > > -		memflags = blk_mq_freeze_queue(q);
> > > -		queue_set_hctx_shared(q, shared);
> > > -		blk_mq_unfreeze_queue(q, memflags);
> > > -	}
> > > -}
> > > -
> > >   static void blk_mq_del_queue_tag_set(struct request_queue *q)
> > >   {
> > >   	struct blk_mq_tag_set *set = q->tag_set;
> > > +	struct request_queue *firstq;
> > > +	unsigned int memflags;
> > > -	mutex_lock(&set->tag_list_lock);
> > > +	down_write(&set->tag_list_rwsem);
> > >   	list_del(&q->tag_set_list);
> > > -	if (list_is_singular(&set->tag_list)) {
> > > -		/* just transitioned to unshared */
> > > -		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> > > -		/* update existing queue */
> > > -		blk_mq_update_tag_set_shared(set, false);
> > > +	if (!list_is_singular(&set->tag_list)) {
> > > +		up_write(&set->tag_list_rwsem);
> > > +		goto out;
> > >   	}
> > > -	mutex_unlock(&set->tag_list_lock);
> > > +
> > > +	/*
> > > +	 * Transitioning the remaining firstq to unshared.
> > > +	 * Also, downgrade the semaphore to avoid deadlock
> > > +	 * with blk_mq_quiesce_tagset() while waiting for
> > > +	 * firstq to be frozen.
> > > +	 */
> > > +	set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> > > +	downgrade_write(&set->tag_list_rwsem);
> > If the first lock waiter is for write, it could ruin your downgrade trick.

If the 1st waiter is for WEITE, rwsem_mark_wake() simply returns and grants
read lock to this caller, meantime wakes up nothing.

That is exactly what this use case expects, so can you explain in detail why
`it could ruin your downgrade trick`?

> 
> That is true. The downgrade will wake up all the waiting readers at the
> front of the wait queue, but if there is one or more writers in the mix. The
> wakeup will stop when the first writer is hit and all the readers after that
> will not be woken up.

So waiters for WRITE won't be waken up by downgrade_write() if I understand correctly,
and rwsem_downgrade_wake() documents this behavior too.

> 
> We can theoretically provide a downgrade variant that wakes up all the
> readers if it is a useful feature.

The following up_read() in this code block will wake up the waiter for
WRITE, which finally wakes up other waiters for READ, then I am confused
what is the problem with this usage?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-18  3:08       ` Ming Lei
@ 2025-11-18  3:42         ` Waiman Long
  2025-11-18  4:03           ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Waiman Long @ 2025-11-18  3:42 UTC (permalink / raw)
  To: Ming Lei, Waiman Long
  Cc: Hillf Danton, Mohamed Khalfella, Jens Axboe, linux-nvme,
	linux-block, linux-kernel

On 11/17/25 10:08 PM, Ming Lei wrote:
> On Mon, Nov 17, 2025 at 09:24:21PM -0500, Waiman Long wrote:
>> On 11/17/25 8:34 PM, Hillf Danton wrote:
>>> On Mon, 17 Nov 2025 12:23:53 -0800 Mohamed Khalfella wrote:
>>>> blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
>>>> tagset, the functions make sure that tagset and queues are marked as
>>>> shared when two or more queues are attached to the same tagset.
>>>> Initially a tagset starts as unshared and when the number of added
>>>> queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
>>>> with all the queues attached to it. When the number of attached queues
>>>> drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
>>>> the remaining queues as unshared.
>>>>
>>>> Both functions need to freeze current queues in tagset before setting on
>>>> unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
>>>> hold set->tag_list_lock mutex, which makes sense as we do not want
>>>> queues to be added or deleted in the process. This used to work fine
>>>> until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
>>>> made the nvme driver quiesce tagset instead of quiscing individual
>>>> queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
>>>> set->tag_list while holding set->tag_list_lock also.
>>>>
>>>> This results in deadlock between two threads with these stacktraces:
>>>>
>>>>     __schedule+0x48e/0xed0
>>>>     schedule+0x5a/0xc0
>>>>     schedule_preempt_disabled+0x11/0x20
>>>>     __mutex_lock.constprop.0+0x3cc/0x760
>>>>     blk_mq_quiesce_tagset+0x26/0xd0
>>>>     nvme_dev_disable_locked+0x77/0x280 [nvme]
>>>>     nvme_timeout+0x268/0x320 [nvme]
>>>>     blk_mq_handle_expired+0x5d/0x90
>>>>     bt_iter+0x7e/0x90
>>>>     blk_mq_queue_tag_busy_iter+0x2b2/0x590
>>>>     ? __blk_mq_complete_request_remote+0x10/0x10
>>>>     ? __blk_mq_complete_request_remote+0x10/0x10
>>>>     blk_mq_timeout_work+0x15b/0x1a0
>>>>     process_one_work+0x133/0x2f0
>>>>     ? mod_delayed_work_on+0x90/0x90
>>>>     worker_thread+0x2ec/0x400
>>>>     ? mod_delayed_work_on+0x90/0x90
>>>>     kthread+0xe2/0x110
>>>>     ? kthread_complete_and_exit+0x20/0x20
>>>>     ret_from_fork+0x2d/0x50
>>>>     ? kthread_complete_and_exit+0x20/0x20
>>>>     ret_from_fork_asm+0x11/0x20
>>>>
>>>>     __schedule+0x48e/0xed0
>>>>     schedule+0x5a/0xc0
>>>>     blk_mq_freeze_queue_wait+0x62/0x90
>>>>     ? destroy_sched_domains_rcu+0x30/0x30
>>>>     blk_mq_exit_queue+0x151/0x180
>>>>     disk_release+0xe3/0xf0
>>>>     device_release+0x31/0x90
>>>>     kobject_put+0x6d/0x180
>>>>     nvme_scan_ns+0x858/0xc90 [nvme_core]
>>>>     ? nvme_scan_work+0x281/0x560 [nvme_core]
>>>>     nvme_scan_work+0x281/0x560 [nvme_core]
>>>>     process_one_work+0x133/0x2f0
>>>>     ? mod_delayed_work_on+0x90/0x90
>>>>     worker_thread+0x2ec/0x400
>>>>     ? mod_delayed_work_on+0x90/0x90
>>>>     kthread+0xe2/0x110
>>>>     ? kthread_complete_and_exit+0x20/0x20
>>>>     ret_from_fork+0x2d/0x50
>>>>     ? kthread_complete_and_exit+0x20/0x20
>>>>     ret_from_fork_asm+0x11/0x20
>>>>
>>>> The top stacktrace is showing nvme_timeout() called to handle nvme
>>>> command timeout. timeout handler is trying to disable the controller and
>>>> as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
>>>> to call queue callback handlers. The thread is stuck waiting for
>>>> set->tag_list_lock as it tires to walk the queues in set->tag_list.
>>>>
>>>> The lock is held by the second thread in the bottom stack which is
>>>> waiting for one of queues to be frozen. The queue usage counter will
>>>> drop to zero after nvme_timeout() finishes, and this will not happen
>>>> because the thread will wait for this mutex forever.
>>>>
>>>> Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
>>>> avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
>>>> semaphore for read since this is enough to guarantee no queues will be
>>>> added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
>>>> semaphore for write while updating set->tag_list and downgrade it to
>>>> read while freezing the queues. It should be safe to update set->flags
>>>> and hctx->flags while holding the semaphore for read since the queues
>>>> are already frozen.
>>>>
>>>> Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
>>>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>>>> ---
>>>>    block/blk-mq-sysfs.c   | 10 ++---
>>>>    block/blk-mq.c         | 95 +++++++++++++++++++++++-------------------
>>>>    include/linux/blk-mq.h |  4 +-
>>>>    3 files changed, 58 insertions(+), 51 deletions(-)
>>>>
>>>> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
>>>> index 58ec293373c6..f474781654fb 100644
>>>> --- a/block/blk-mq-sysfs.c
>>>> +++ b/block/blk-mq-sysfs.c
>>>> @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)
>>>>    	kobject_uevent(q->mq_kobj, KOBJ_ADD);
>>>> -	mutex_lock(&q->tag_set->tag_list_lock);
>>>> +	down_read(&q->tag_set->tag_list_rwsem);
>>>>    	queue_for_each_hw_ctx(q, hctx, i) {
>>>>    		ret = blk_mq_register_hctx(hctx);
>>>>    		if (ret)
>>>>    			goto out_unreg;
>>>>    	}
>>>> -	mutex_unlock(&q->tag_set->tag_list_lock);
>>>> +	up_read(&q->tag_set->tag_list_rwsem);
>>>>    	return 0;
>>>>    out_unreg:
>>>> @@ -244,7 +244,7 @@ int blk_mq_sysfs_register(struct gendisk *disk)
>>>>    		if (j < i)
>>>>    			blk_mq_unregister_hctx(hctx);
>>>>    	}
>>>> -	mutex_unlock(&q->tag_set->tag_list_lock);
>>>> +	up_read(&q->tag_set->tag_list_rwsem);
>>>>    	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
>>>>    	kobject_del(q->mq_kobj);
>>>> @@ -257,10 +257,10 @@ void blk_mq_sysfs_unregister(struct gendisk *disk)
>>>>    	struct blk_mq_hw_ctx *hctx;
>>>>    	unsigned long i;
>>>> -	mutex_lock(&q->tag_set->tag_list_lock);
>>>> +	down_read(&q->tag_set->tag_list_rwsem);
>>>>    	queue_for_each_hw_ctx(q, hctx, i)
>>>>    		blk_mq_unregister_hctx(hctx);
>>>> -	mutex_unlock(&q->tag_set->tag_list_lock);
>>>> +	up_read(&q->tag_set->tag_list_rwsem);
>>>>    	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
>>>>    	kobject_del(q->mq_kobj);
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index d626d32f6e57..9211d32ce820 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
>>>>    {
>>>>    	struct request_queue *q;
>>>> -	mutex_lock(&set->tag_list_lock);
>>>> +	down_read(&set->tag_list_rwsem);
>>>>    	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>>>    		if (!blk_queue_skip_tagset_quiesce(q))
>>>>    			blk_mq_quiesce_queue_nowait(q);
>>>>    	}
>>>> -	mutex_unlock(&set->tag_list_lock);
>>>> +	up_read(&set->tag_list_rwsem);
>>>>    	blk_mq_wait_quiesce_done(set);
>>>>    }
>>>> @@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
>>>>    {
>>>>    	struct request_queue *q;
>>>> -	mutex_lock(&set->tag_list_lock);
>>>> +	down_read(&set->tag_list_rwsem);
>>>>    	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>>>    		if (!blk_queue_skip_tagset_quiesce(q))
>>>>    			blk_mq_unquiesce_queue(q);
>>>>    	}
>>>> -	mutex_unlock(&set->tag_list_lock);
>>>> +	up_read(&set->tag_list_rwsem);
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
>>>> @@ -4274,56 +4274,63 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
>>>>    	}
>>>>    }
>>>> -static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
>>>> -					 bool shared)
>>>> -{
>>>> -	struct request_queue *q;
>>>> -	unsigned int memflags;
>>>> -
>>>> -	lockdep_assert_held(&set->tag_list_lock);
>>>> -
>>>> -	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>>> -		memflags = blk_mq_freeze_queue(q);
>>>> -		queue_set_hctx_shared(q, shared);
>>>> -		blk_mq_unfreeze_queue(q, memflags);
>>>> -	}
>>>> -}
>>>> -
>>>>    static void blk_mq_del_queue_tag_set(struct request_queue *q)
>>>>    {
>>>>    	struct blk_mq_tag_set *set = q->tag_set;
>>>> +	struct request_queue *firstq;
>>>> +	unsigned int memflags;
>>>> -	mutex_lock(&set->tag_list_lock);
>>>> +	down_write(&set->tag_list_rwsem);
>>>>    	list_del(&q->tag_set_list);
>>>> -	if (list_is_singular(&set->tag_list)) {
>>>> -		/* just transitioned to unshared */
>>>> -		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
>>>> -		/* update existing queue */
>>>> -		blk_mq_update_tag_set_shared(set, false);
>>>> +	if (!list_is_singular(&set->tag_list)) {
>>>> +		up_write(&set->tag_list_rwsem);
>>>> +		goto out;
>>>>    	}
>>>> -	mutex_unlock(&set->tag_list_lock);
>>>> +
>>>> +	/*
>>>> +	 * Transitioning the remaining firstq to unshared.
>>>> +	 * Also, downgrade the semaphore to avoid deadlock
>>>> +	 * with blk_mq_quiesce_tagset() while waiting for
>>>> +	 * firstq to be frozen.
>>>> +	 */
>>>> +	set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
>>>> +	downgrade_write(&set->tag_list_rwsem);
>>> If the first lock waiter is for write, it could ruin your downgrade trick.
> If the 1st waiter is for WEITE, rwsem_mark_wake() simply returns and grants
> read lock to this caller, meantime wakes up nothing.
>
> That is exactly what this use case expects, so can you explain in detail why
> `it could ruin your downgrade trick`?
>
>> That is true. The downgrade will wake up all the waiting readers at the
>> front of the wait queue, but if there is one or more writers in the mix. The
>> wakeup will stop when the first writer is hit and all the readers after that
>> will not be woken up.
> So waiters for WRITE won't be waken up by downgrade_write() if I understand correctly,
> and rwsem_downgrade_wake() documents this behavior too.
>
>> We can theoretically provide a downgrade variant that wakes up all the
>> readers if it is a useful feature.
> The following up_read() in this code block will wake up the waiter for
> WRITE, which finally wakes up other waiters for READ, then I am confused
> what is the problem with this usage?

I am just referring to the fact that not all the readers may be woken 
up. So if the deadlock is caused by one of those readers that is not 
woken up, it can be a problem. I haven't analyzed the deadlock scenario 
in detail to see if that is really the case. It is up to you and others 
who are more familiar with this code base to figure this out.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-18  2:30       ` Ming Lei
@ 2025-11-18  3:44         ` Mohamed Khalfella
  2025-11-18  4:16           ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Mohamed Khalfella @ 2025-11-18  3:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	linux-nvme, linux-block, linux-kernel

On Tue 2025-11-18 10:30:52 +0800, Ming Lei wrote:
> On Mon, Nov 17, 2025 at 06:15:04PM -0800, Mohamed Khalfella wrote:
> > On Tue 2025-11-18 10:00:19 +0800, Ming Lei wrote:
> > > On Mon, Nov 17, 2025 at 12:23:53PM -0800, Mohamed Khalfella wrote:
> > > >  static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
> > > >  				     struct request_queue *q)
> > > >  {
> > > > -	mutex_lock(&set->tag_list_lock);
> > > > +	struct request_queue *firstq;
> > > > +	unsigned int memflags;
> > > >  
> > > > -	/*
> > > > -	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
> > > > -	 */
> > > > -	if (!list_empty(&set->tag_list) &&
> > > > -	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> > > > -		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> > > > -		/* update existing queue */
> > > > -		blk_mq_update_tag_set_shared(set, true);
> > > > -	}
> > > > -	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> > > > -		queue_set_hctx_shared(q, true);
> > > > -	list_add_tail(&q->tag_set_list, &set->tag_list);
> > > > +	down_write(&set->tag_list_rwsem);
> > > > +	if (!list_is_singular(&set->tag_list)) {
> > > > +		if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> > > > +			queue_set_hctx_shared(q, true);
> > > > +		list_add_tail(&q->tag_set_list, &set->tag_list);
> > > > +		up_write(&set->tag_list_rwsem);
> > > > +		return;
> > > > +	}
> > > >  
> > > > -	mutex_unlock(&set->tag_list_lock);
> > > > +	/* Transitioning firstq and q to shared. */
> > > > +	set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> > > > +	list_add_tail(&q->tag_set_list, &set->tag_list);
> > > > +	downgrade_write(&set->tag_list_rwsem);
> > > > +	queue_set_hctx_shared(q, true);
> > > 
> > > queue_set_hctx_shared(q, true) should be moved into write critical area
> > > because this queue has been added to the list.
> > > 
> > 
> > I failed to see why that is the case. What can go wrong by running
> > queue_set_hctx_shared(q, true) after downgrade_write()?
> > 
> > After the semaphore is downgraded we promise not to change the list
> > set->tag_list because now we have read-only access. Marking the "q" as
> > shared should be fine because it is new and we know there will be no
> > users of the queue yet (that is why we skipped freezing it).
>  
> I think it is read/write lock's use practice. The protected data shouldn't be
> written any more when you downgrade to read lock.
> 
> In this case, it may not make a difference, because it is one new queue and
> the other readers don't use the `shared` flag, but still better to do
> correct things from beginning and make code less fragile.
> 

set->tag_list_rwsem protects set->tag_list. It does not protect
hctx->flags. hctx->flags is protected by the context. In the case of "q"
it is new and we are not expecting request allocation. In case of
"firstq" the queue is frozen which makes it safe to update hctx->flags.
I prefer to keep the code as it is unless there is a reason to change
it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-18  3:42         ` Waiman Long
@ 2025-11-18  4:03           ` Ming Lei
       [not found]             ` <702d904c-2d9a-42b4-95b3-0fa43d91e673@redhat.com>
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2025-11-18  4:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Hillf Danton, Mohamed Khalfella, Jens Axboe, linux-nvme,
	linux-block, linux-kernel

On Mon, Nov 17, 2025 at 10:42:04PM -0500, Waiman Long wrote:
> On 11/17/25 10:08 PM, Ming Lei wrote:
> > On Mon, Nov 17, 2025 at 09:24:21PM -0500, Waiman Long wrote:
> > > On 11/17/25 8:34 PM, Hillf Danton wrote:
> > > > On Mon, 17 Nov 2025 12:23:53 -0800 Mohamed Khalfella wrote:
> > > > > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > > > > tagset, the functions make sure that tagset and queues are marked as
> > > > > shared when two or more queues are attached to the same tagset.
> > > > > Initially a tagset starts as unshared and when the number of added
> > > > > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > > > > with all the queues attached to it. When the number of attached queues
> > > > > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > > > > the remaining queues as unshared.
> > > > > 
> > > > > Both functions need to freeze current queues in tagset before setting on
> > > > > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > > > > hold set->tag_list_lock mutex, which makes sense as we do not want
> > > > > queues to be added or deleted in the process. This used to work fine
> > > > > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > > > made the nvme driver quiesce tagset instead of quiscing individual
> > > > > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > > > > set->tag_list while holding set->tag_list_lock also.
> > > > > 
> > > > > This results in deadlock between two threads with these stacktraces:
> > > > > 
> > > > >     __schedule+0x48e/0xed0
> > > > >     schedule+0x5a/0xc0
> > > > >     schedule_preempt_disabled+0x11/0x20
> > > > >     __mutex_lock.constprop.0+0x3cc/0x760
> > > > >     blk_mq_quiesce_tagset+0x26/0xd0
> > > > >     nvme_dev_disable_locked+0x77/0x280 [nvme]
> > > > >     nvme_timeout+0x268/0x320 [nvme]
> > > > >     blk_mq_handle_expired+0x5d/0x90
> > > > >     bt_iter+0x7e/0x90
> > > > >     blk_mq_queue_tag_busy_iter+0x2b2/0x590
> > > > >     ? __blk_mq_complete_request_remote+0x10/0x10
> > > > >     ? __blk_mq_complete_request_remote+0x10/0x10
> > > > >     blk_mq_timeout_work+0x15b/0x1a0
> > > > >     process_one_work+0x133/0x2f0
> > > > >     ? mod_delayed_work_on+0x90/0x90
> > > > >     worker_thread+0x2ec/0x400
> > > > >     ? mod_delayed_work_on+0x90/0x90
> > > > >     kthread+0xe2/0x110
> > > > >     ? kthread_complete_and_exit+0x20/0x20
> > > > >     ret_from_fork+0x2d/0x50
> > > > >     ? kthread_complete_and_exit+0x20/0x20
> > > > >     ret_from_fork_asm+0x11/0x20
> > > > > 
> > > > >     __schedule+0x48e/0xed0
> > > > >     schedule+0x5a/0xc0
> > > > >     blk_mq_freeze_queue_wait+0x62/0x90
> > > > >     ? destroy_sched_domains_rcu+0x30/0x30
> > > > >     blk_mq_exit_queue+0x151/0x180
> > > > >     disk_release+0xe3/0xf0
> > > > >     device_release+0x31/0x90
> > > > >     kobject_put+0x6d/0x180
> > > > >     nvme_scan_ns+0x858/0xc90 [nvme_core]
> > > > >     ? nvme_scan_work+0x281/0x560 [nvme_core]
> > > > >     nvme_scan_work+0x281/0x560 [nvme_core]
> > > > >     process_one_work+0x133/0x2f0
> > > > >     ? mod_delayed_work_on+0x90/0x90
> > > > >     worker_thread+0x2ec/0x400
> > > > >     ? mod_delayed_work_on+0x90/0x90
> > > > >     kthread+0xe2/0x110
> > > > >     ? kthread_complete_and_exit+0x20/0x20
> > > > >     ret_from_fork+0x2d/0x50
> > > > >     ? kthread_complete_and_exit+0x20/0x20
> > > > >     ret_from_fork_asm+0x11/0x20
> > > > > 
> > > > > The top stacktrace is showing nvme_timeout() called to handle nvme
> > > > > command timeout. timeout handler is trying to disable the controller and
> > > > > as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> > > > > to call queue callback handlers. The thread is stuck waiting for
> > > > > set->tag_list_lock as it tires to walk the queues in set->tag_list.
> > > > > 
> > > > > The lock is held by the second thread in the bottom stack which is
> > > > > waiting for one of queues to be frozen. The queue usage counter will
> > > > > drop to zero after nvme_timeout() finishes, and this will not happen
> > > > > because the thread will wait for this mutex forever.
> > > > > 
> > > > > Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> > > > > avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> > > > > semaphore for read since this is enough to guarantee no queues will be
> > > > > added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> > > > > semaphore for write while updating set->tag_list and downgrade it to
> > > > > read while freezing the queues. It should be safe to update set->flags
> > > > > and hctx->flags while holding the semaphore for read since the queues
> > > > > are already frozen.
> > > > > 
> > > > > Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > > > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > > > > ---
> > > > >    block/blk-mq-sysfs.c   | 10 ++---
> > > > >    block/blk-mq.c         | 95 +++++++++++++++++++++++-------------------
> > > > >    include/linux/blk-mq.h |  4 +-
> > > > >    3 files changed, 58 insertions(+), 51 deletions(-)
> > > > > 
> > > > > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> > > > > index 58ec293373c6..f474781654fb 100644
> > > > > --- a/block/blk-mq-sysfs.c
> > > > > +++ b/block/blk-mq-sysfs.c
> > > > > @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)
> > > > >    	kobject_uevent(q->mq_kobj, KOBJ_ADD);
> > > > > -	mutex_lock(&q->tag_set->tag_list_lock);
> > > > > +	down_read(&q->tag_set->tag_list_rwsem);
> > > > >    	queue_for_each_hw_ctx(q, hctx, i) {
> > > > >    		ret = blk_mq_register_hctx(hctx);
> > > > >    		if (ret)
> > > > >    			goto out_unreg;
> > > > >    	}
> > > > > -	mutex_unlock(&q->tag_set->tag_list_lock);
> > > > > +	up_read(&q->tag_set->tag_list_rwsem);
> > > > >    	return 0;
> > > > >    out_unreg:
> > > > > @@ -244,7 +244,7 @@ int blk_mq_sysfs_register(struct gendisk *disk)
> > > > >    		if (j < i)
> > > > >    			blk_mq_unregister_hctx(hctx);
> > > > >    	}
> > > > > -	mutex_unlock(&q->tag_set->tag_list_lock);
> > > > > +	up_read(&q->tag_set->tag_list_rwsem);
> > > > >    	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
> > > > >    	kobject_del(q->mq_kobj);
> > > > > @@ -257,10 +257,10 @@ void blk_mq_sysfs_unregister(struct gendisk *disk)
> > > > >    	struct blk_mq_hw_ctx *hctx;
> > > > >    	unsigned long i;
> > > > > -	mutex_lock(&q->tag_set->tag_list_lock);
> > > > > +	down_read(&q->tag_set->tag_list_rwsem);
> > > > >    	queue_for_each_hw_ctx(q, hctx, i)
> > > > >    		blk_mq_unregister_hctx(hctx);
> > > > > -	mutex_unlock(&q->tag_set->tag_list_lock);
> > > > > +	up_read(&q->tag_set->tag_list_rwsem);
> > > > >    	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
> > > > >    	kobject_del(q->mq_kobj);
> > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > index d626d32f6e57..9211d32ce820 100644
> > > > > --- a/block/blk-mq.c
> > > > > +++ b/block/blk-mq.c
> > > > > @@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
> > > > >    {
> > > > >    	struct request_queue *q;
> > > > > -	mutex_lock(&set->tag_list_lock);
> > > > > +	down_read(&set->tag_list_rwsem);
> > > > >    	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > > > >    		if (!blk_queue_skip_tagset_quiesce(q))
> > > > >    			blk_mq_quiesce_queue_nowait(q);
> > > > >    	}
> > > > > -	mutex_unlock(&set->tag_list_lock);
> > > > > +	up_read(&set->tag_list_rwsem);
> > > > >    	blk_mq_wait_quiesce_done(set);
> > > > >    }
> > > > > @@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
> > > > >    {
> > > > >    	struct request_queue *q;
> > > > > -	mutex_lock(&set->tag_list_lock);
> > > > > +	down_read(&set->tag_list_rwsem);
> > > > >    	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > > > >    		if (!blk_queue_skip_tagset_quiesce(q))
> > > > >    			blk_mq_unquiesce_queue(q);
> > > > >    	}
> > > > > -	mutex_unlock(&set->tag_list_lock);
> > > > > +	up_read(&set->tag_list_rwsem);
> > > > >    }
> > > > >    EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
> > > > > @@ -4274,56 +4274,63 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
> > > > >    	}
> > > > >    }
> > > > > -static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
> > > > > -					 bool shared)
> > > > > -{
> > > > > -	struct request_queue *q;
> > > > > -	unsigned int memflags;
> > > > > -
> > > > > -	lockdep_assert_held(&set->tag_list_lock);
> > > > > -
> > > > > -	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > > > > -		memflags = blk_mq_freeze_queue(q);
> > > > > -		queue_set_hctx_shared(q, shared);
> > > > > -		blk_mq_unfreeze_queue(q, memflags);
> > > > > -	}
> > > > > -}
> > > > > -
> > > > >    static void blk_mq_del_queue_tag_set(struct request_queue *q)
> > > > >    {
> > > > >    	struct blk_mq_tag_set *set = q->tag_set;
> > > > > +	struct request_queue *firstq;
> > > > > +	unsigned int memflags;
> > > > > -	mutex_lock(&set->tag_list_lock);
> > > > > +	down_write(&set->tag_list_rwsem);
> > > > >    	list_del(&q->tag_set_list);
> > > > > -	if (list_is_singular(&set->tag_list)) {
> > > > > -		/* just transitioned to unshared */
> > > > > -		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> > > > > -		/* update existing queue */
> > > > > -		blk_mq_update_tag_set_shared(set, false);
> > > > > +	if (!list_is_singular(&set->tag_list)) {
> > > > > +		up_write(&set->tag_list_rwsem);
> > > > > +		goto out;
> > > > >    	}
> > > > > -	mutex_unlock(&set->tag_list_lock);
> > > > > +
> > > > > +	/*
> > > > > +	 * Transitioning the remaining firstq to unshared.
> > > > > +	 * Also, downgrade the semaphore to avoid deadlock
> > > > > +	 * with blk_mq_quiesce_tagset() while waiting for
> > > > > +	 * firstq to be frozen.
> > > > > +	 */
> > > > > +	set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> > > > > +	downgrade_write(&set->tag_list_rwsem);
> > > > If the first lock waiter is for write, it could ruin your downgrade trick.
> > If the 1st waiter is for WEITE, rwsem_mark_wake() simply returns and grants
> > read lock to this caller, meantime wakes up nothing.
> > 
> > That is exactly what this use case expects, so can you explain in detail why
> > `it could ruin your downgrade trick`?
> > 
> > > That is true. The downgrade will wake up all the waiting readers at the
> > > front of the wait queue, but if there is one or more writers in the mix. The
> > > wakeup will stop when the first writer is hit and all the readers after that
> > > will not be woken up.
> > So waiters for WRITE won't be waken up by downgrade_write() if I understand correctly,
> > and rwsem_downgrade_wake() documents this behavior too.
> > 
> > > We can theoretically provide a downgrade variant that wakes up all the
> > > readers if it is a useful feature.
> > The following up_read() in this code block will wake up the waiter for
> > WRITE, which finally wakes up other waiters for READ, then I am confused
> > what is the problem with this usage?
> 
> I am just referring to the fact that not all the readers may be woken up. So
> if the deadlock is caused by one of those readers that is not woken up, it
> can be a problem. I haven't analyzed the deadlock scenario in detail to see
> if that is really the case. It is up to you and others who are more familiar
> with this code base to figure this out.

Follows the code base, which isn't special compared with other
downgrade_write() usages:

blk_mq_del_queue_tag_set()/blk_mq_add_queue_tag_set():

	down_write(&set->tag_list_rwsem);
	...
	downgrade_write(&set->tag_list_rwsem);
	...
	up_read(&set->tag_list_rwsem);

All others are readers:

	down_read(&set->tag_list_rwsem);
	...
	up_read(&set->tag_list_rwsem);


You mentioned reader may not be waken up in case of r/w mixed waiters, but I
don't see it is possible.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-18  3:44         ` Mohamed Khalfella
@ 2025-11-18  4:16           ` Ming Lei
  0 siblings, 0 replies; 21+ messages in thread
From: Ming Lei @ 2025-11-18  4:16 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	linux-nvme, linux-block, linux-kernel

On Mon, Nov 17, 2025 at 07:44:05PM -0800, Mohamed Khalfella wrote:
> On Tue 2025-11-18 10:30:52 +0800, Ming Lei wrote:
> > On Mon, Nov 17, 2025 at 06:15:04PM -0800, Mohamed Khalfella wrote:
> > > On Tue 2025-11-18 10:00:19 +0800, Ming Lei wrote:
> > > > On Mon, Nov 17, 2025 at 12:23:53PM -0800, Mohamed Khalfella wrote:
> > > > >  static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
> > > > >  				     struct request_queue *q)
> > > > >  {
> > > > > -	mutex_lock(&set->tag_list_lock);
> > > > > +	struct request_queue *firstq;
> > > > > +	unsigned int memflags;
> > > > >  
> > > > > -	/*
> > > > > -	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
> > > > > -	 */
> > > > > -	if (!list_empty(&set->tag_list) &&
> > > > > -	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> > > > > -		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> > > > > -		/* update existing queue */
> > > > > -		blk_mq_update_tag_set_shared(set, true);
> > > > > -	}
> > > > > -	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> > > > > -		queue_set_hctx_shared(q, true);
> > > > > -	list_add_tail(&q->tag_set_list, &set->tag_list);
> > > > > +	down_write(&set->tag_list_rwsem);
> > > > > +	if (!list_is_singular(&set->tag_list)) {
> > > > > +		if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> > > > > +			queue_set_hctx_shared(q, true);
> > > > > +		list_add_tail(&q->tag_set_list, &set->tag_list);
> > > > > +		up_write(&set->tag_list_rwsem);
> > > > > +		return;
> > > > > +	}
> > > > >  
> > > > > -	mutex_unlock(&set->tag_list_lock);
> > > > > +	/* Transitioning firstq and q to shared. */
> > > > > +	set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> > > > > +	list_add_tail(&q->tag_set_list, &set->tag_list);
> > > > > +	downgrade_write(&set->tag_list_rwsem);
> > > > > +	queue_set_hctx_shared(q, true);
> > > > 
> > > > queue_set_hctx_shared(q, true) should be moved into write critical area
> > > > because this queue has been added to the list.
> > > > 
> > > 
> > > I failed to see why that is the case. What can go wrong by running
> > > queue_set_hctx_shared(q, true) after downgrade_write()?
> > > 
> > > After the semaphore is downgraded we promise not to change the list
> > > set->tag_list because now we have read-only access. Marking the "q" as
> > > shared should be fine because it is new and we know there will be no
> > > users of the queue yet (that is why we skipped freezing it).
> >  
> > I think it is read/write lock's use practice. The protected data shouldn't be
> > written any more when you downgrade to read lock.
> > 
> > In this case, it may not make a difference, because it is one new queue and
> > the other readers don't use the `shared` flag, but still better to do
> > correct things from beginning and make code less fragile.
> > 
> 
> set->tag_list_rwsem protects set->tag_list. It does not protect
> hctx->flags. hctx->flags is protected by the context. In the case of "q"
> it is new and we are not expecting request allocation. In case of
> "firstq" the queue is frozen which makes it safe to update hctx->flags.
> I prefer to keep the code as it is unless there is a reason to change
> it.

Fair enough, given it is done for `firstrq`.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
       [not found]             ` <702d904c-2d9a-42b4-95b3-0fa43d91e673@redhat.com>
@ 2025-11-18  4:50               ` Ming Lei
  2025-11-18 17:59                 ` Mohamed Khalfella
  2025-11-18  5:02               ` Mohamed Khalfella
  1 sibling, 1 reply; 21+ messages in thread
From: Ming Lei @ 2025-11-18  4:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Hillf Danton, Mohamed Khalfella, Jens Axboe, linux-nvme,
	linux-block, linux-kernel

On Mon, Nov 17, 2025 at 11:35:56PM -0500, Waiman Long wrote:
> On 11/17/25 11:03 PM, Ming Lei wrote:
> > On Mon, Nov 17, 2025 at 10:42:04PM -0500, Waiman Long wrote:
> > > On 11/17/25 10:08 PM, Ming Lei wrote:
> > > > On Mon, Nov 17, 2025 at 09:24:21PM -0500, Waiman Long wrote:
> > > > > On 11/17/25 8:34 PM, Hillf Danton wrote:
> > > > > > On Mon, 17 Nov 2025 12:23:53 -0800 Mohamed Khalfella wrote:
> > > > > > > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > > > > > > tagset, the functions make sure that tagset and queues are marked as
> > > > > > > shared when two or more queues are attached to the same tagset.
> > > > > > > Initially a tagset starts as unshared and when the number of added
> > > > > > > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > > > > > > with all the queues attached to it. When the number of attached queues
> > > > > > > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > > > > > > the remaining queues as unshared.
> > > > > > > 
> > > > > > > Both functions need to freeze current queues in tagset before setting on
> > > > > > > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > > > > > > hold set->tag_list_lock mutex, which makes sense as we do not want
> > > > > > > queues to be added or deleted in the process. This used to work fine
> > > > > > > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > > > > > made the nvme driver quiesce tagset instead of quiscing individual
> > > > > > > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > > > > > > set->tag_list while holding set->tag_list_lock also.
> > > > > > > 
> > > > > > > This results in deadlock between two threads with these stacktraces:
> > > > > > > 
> > > > > > >      __schedule+0x48e/0xed0
> > > > > > >      schedule+0x5a/0xc0
> > > > > > >      schedule_preempt_disabled+0x11/0x20
> > > > > > >      __mutex_lock.constprop.0+0x3cc/0x760
> > > > > > >      blk_mq_quiesce_tagset+0x26/0xd0
> > > > > > >      nvme_dev_disable_locked+0x77/0x280 [nvme]
> > > > > > >      nvme_timeout+0x268/0x320 [nvme]
> > > > > > >      blk_mq_handle_expired+0x5d/0x90
> > > > > > >      bt_iter+0x7e/0x90
> > > > > > >      blk_mq_queue_tag_busy_iter+0x2b2/0x590
> > > > > > >      ? __blk_mq_complete_request_remote+0x10/0x10
> > > > > > >      ? __blk_mq_complete_request_remote+0x10/0x10
> > > > > > >      blk_mq_timeout_work+0x15b/0x1a0
> > > > > > >      process_one_work+0x133/0x2f0
> > > > > > >      ? mod_delayed_work_on+0x90/0x90
> > > > > > >      worker_thread+0x2ec/0x400
> > > > > > >      ? mod_delayed_work_on+0x90/0x90
> > > > > > >      kthread+0xe2/0x110
> > > > > > >      ? kthread_complete_and_exit+0x20/0x20
> > > > > > >      ret_from_fork+0x2d/0x50
> > > > > > >      ? kthread_complete_and_exit+0x20/0x20
> > > > > > >      ret_from_fork_asm+0x11/0x20
> > > > > > > 
> > > > > > >      __schedule+0x48e/0xed0
> > > > > > >      schedule+0x5a/0xc0
> > > > > > >      blk_mq_freeze_queue_wait+0x62/0x90
> > > > > > >      ? destroy_sched_domains_rcu+0x30/0x30
> > > > > > >      blk_mq_exit_queue+0x151/0x180
> > > > > > >      disk_release+0xe3/0xf0
> > > > > > >      device_release+0x31/0x90
> > > > > > >      kobject_put+0x6d/0x180
> > > > > > >      nvme_scan_ns+0x858/0xc90 [nvme_core]
> > > > > > >      ? nvme_scan_work+0x281/0x560 [nvme_core]
> > > > > > >      nvme_scan_work+0x281/0x560 [nvme_core]
> > > > > > >      process_one_work+0x133/0x2f0
> > > > > > >      ? mod_delayed_work_on+0x90/0x90
> > > > > > >      worker_thread+0x2ec/0x400
> > > > > > >      ? mod_delayed_work_on+0x90/0x90
> > > > > > >      kthread+0xe2/0x110
> > > > > > >      ? kthread_complete_and_exit+0x20/0x20
> > > > > > >      ret_from_fork+0x2d/0x50
> > > > > > >      ? kthread_complete_and_exit+0x20/0x20
> > > > > > >      ret_from_fork_asm+0x11/0x20
> > > > > > > 
> > > > > > > The top stacktrace is showing nvme_timeout() called to handle nvme
> > > > > > > command timeout. timeout handler is trying to disable the controller and
> > > > > > > as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> > > > > > > to call queue callback handlers. The thread is stuck waiting for
> > > > > > > set->tag_list_lock as it tires to walk the queues in set->tag_list.
> > > > > > > 
> > > > > > > The lock is held by the second thread in the bottom stack which is
> > > > > > > waiting for one of queues to be frozen. The queue usage counter will
> > > > > > > drop to zero after nvme_timeout() finishes, and this will not happen
> > > > > > > because the thread will wait for this mutex forever.
> > > > > > > 
> > > > > > > Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> > > > > > > avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> > > > > > > semaphore for read since this is enough to guarantee no queues will be
> > > > > > > added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> > > > > > > semaphore for write while updating set->tag_list and downgrade it to
> > > > > > > read while freezing the queues. It should be safe to update set->flags
> > > > > > > and hctx->flags while holding the semaphore for read since the queues
> > > > > > > are already frozen.
> > > > > > > 
> > > > > > > Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > > > > > Signed-off-by: Mohamed Khalfella<mkhalfella@purestorage.com>
> > > > > > > ---
> > > > > > >     block/blk-mq-sysfs.c   | 10 ++---
> > > > > > >     block/blk-mq.c         | 95 +++++++++++++++++++++++-------------------
> > > > > > >     include/linux/blk-mq.h |  4 +-
> > > > > > >     3 files changed, 58 insertions(+), 51 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> > > > > > > index 58ec293373c6..f474781654fb 100644
> > > > > > > --- a/block/blk-mq-sysfs.c
> > > > > > > +++ b/block/blk-mq-sysfs.c
> > > > > > > @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)
> > > > > > >     	kobject_uevent(q->mq_kobj, KOBJ_ADD);
> > > > > > > -	mutex_lock(&q->tag_set->tag_list_lock);
> > > > > > > +	down_read(&q->tag_set->tag_list_rwsem);
> > > > > > >     	queue_for_each_hw_ctx(q, hctx, i) {
> > > > > > >     		ret = blk_mq_register_hctx(hctx);
> > > > > > >     		if (ret)
> > > > > > >     			goto out_unreg;
> > > > > > >     	}
> > > > > > > -	mutex_unlock(&q->tag_set->tag_list_lock);
> > > > > > > +	up_read(&q->tag_set->tag_list_rwsem);
> > > > > > >     	return 0;
> > > > > > >     out_unreg:
> > > > > > > @@ -244,7 +244,7 @@ int blk_mq_sysfs_register(struct gendisk *disk)
> > > > > > >     		if (j < i)
> > > > > > >     			blk_mq_unregister_hctx(hctx);
> > > > > > >     	}
> > > > > > > -	mutex_unlock(&q->tag_set->tag_list_lock);
> > > > > > > +	up_read(&q->tag_set->tag_list_rwsem);
> > > > > > >     	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
> > > > > > >     	kobject_del(q->mq_kobj);
> > > > > > > @@ -257,10 +257,10 @@ void blk_mq_sysfs_unregister(struct gendisk *disk)
> > > > > > >     	struct blk_mq_hw_ctx *hctx;
> > > > > > >     	unsigned long i;
> > > > > > > -	mutex_lock(&q->tag_set->tag_list_lock);
> > > > > > > +	down_read(&q->tag_set->tag_list_rwsem);
> > > > > > >     	queue_for_each_hw_ctx(q, hctx, i)
> > > > > > >     		blk_mq_unregister_hctx(hctx);
> > > > > > > -	mutex_unlock(&q->tag_set->tag_list_lock);
> > > > > > > +	up_read(&q->tag_set->tag_list_rwsem);
> > > > > > >     	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
> > > > > > >     	kobject_del(q->mq_kobj);
> > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > > > index d626d32f6e57..9211d32ce820 100644
> > > > > > > --- a/block/blk-mq.c
> > > > > > > +++ b/block/blk-mq.c
> > > > > > > @@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
> > > > > > >     {
> > > > > > >     	struct request_queue *q;
> > > > > > > -	mutex_lock(&set->tag_list_lock);
> > > > > > > +	down_read(&set->tag_list_rwsem);
> > > > > > >     	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > > > > > >     		if (!blk_queue_skip_tagset_quiesce(q))
> > > > > > >     			blk_mq_quiesce_queue_nowait(q);
> > > > > > >     	}
> > > > > > > -	mutex_unlock(&set->tag_list_lock);
> > > > > > > +	up_read(&set->tag_list_rwsem);
> > > > > > >     	blk_mq_wait_quiesce_done(set);
> > > > > > >     }
> > > > > > > @@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
> > > > > > >     {
> > > > > > >     	struct request_queue *q;
> > > > > > > -	mutex_lock(&set->tag_list_lock);
> > > > > > > +	down_read(&set->tag_list_rwsem);
> > > > > > >     	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > > > > > >     		if (!blk_queue_skip_tagset_quiesce(q))
> > > > > > >     			blk_mq_unquiesce_queue(q);
> > > > > > >     	}
> > > > > > > -	mutex_unlock(&set->tag_list_lock);
> > > > > > > +	up_read(&set->tag_list_rwsem);
> > > > > > >     }
> > > > > > >     EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
> > > > > > > @@ -4274,56 +4274,63 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
> > > > > > >     	}
> > > > > > >     }
> > > > > > > -static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
> > > > > > > -					 bool shared)
> > > > > > > -{
> > > > > > > -	struct request_queue *q;
> > > > > > > -	unsigned int memflags;
> > > > > > > -
> > > > > > > -	lockdep_assert_held(&set->tag_list_lock);
> > > > > > > -
> > > > > > > -	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > > > > > > -		memflags = blk_mq_freeze_queue(q);
> > > > > > > -		queue_set_hctx_shared(q, shared);
> > > > > > > -		blk_mq_unfreeze_queue(q, memflags);
> > > > > > > -	}
> > > > > > > -}
> > > > > > > -
> > > > > > >     static void blk_mq_del_queue_tag_set(struct request_queue *q)
> > > > > > >     {
> > > > > > >     	struct blk_mq_tag_set *set = q->tag_set;
> > > > > > > +	struct request_queue *firstq;
> > > > > > > +	unsigned int memflags;
> > > > > > > -	mutex_lock(&set->tag_list_lock);
> > > > > > > +	down_write(&set->tag_list_rwsem);
> > > > > > >     	list_del(&q->tag_set_list);
> > > > > > > -	if (list_is_singular(&set->tag_list)) {
> > > > > > > -		/* just transitioned to unshared */
> > > > > > > -		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> > > > > > > -		/* update existing queue */
> > > > > > > -		blk_mq_update_tag_set_shared(set, false);
> > > > > > > +	if (!list_is_singular(&set->tag_list)) {
> > > > > > > +		up_write(&set->tag_list_rwsem);
> > > > > > > +		goto out;
> > > > > > >     	}
> > > > > > > -	mutex_unlock(&set->tag_list_lock);
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Transitioning the remaining firstq to unshared.
> > > > > > > +	 * Also, downgrade the semaphore to avoid deadlock
> > > > > > > +	 * with blk_mq_quiesce_tagset() while waiting for
> > > > > > > +	 * firstq to be frozen.
> > > > > > > +	 */
> > > > > > > +	set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> > > > > > > +	downgrade_write(&set->tag_list_rwsem);
> > > > > > If the first lock waiter is for write, it could ruin your downgrade trick.
> > > > If the 1st waiter is for WEITE, rwsem_mark_wake() simply returns and grants
> > > > read lock to this caller, meantime wakes up nothing.
> > > > 
> > > > That is exactly what this use case expects, so can you explain in detail why
> > > > `it could ruin your downgrade trick`?
> > > > 
> > > > > That is true. The downgrade will wake up all the waiting readers at the
> > > > > front of the wait queue, but if there is one or more writers in the mix. The
> > > > > wakeup will stop when the first writer is hit and all the readers after that
> > > > > will not be woken up.
> > > > So waiters for WRITE won't be waken up by downgrade_write() if I understand correctly,
> > > > and rwsem_downgrade_wake() documents this behavior too.
> > > > 
> > > > > We can theoretically provide a downgrade variant that wakes up all the
> > > > > readers if it is a useful feature.
> > > > The following up_read() in this code block will wake up the waiter for
> > > > WRITE, which finally wakes up other waiters for READ, then I am confused
> > > > what is the problem with this usage?
> > > I am just referring to the fact that not all the readers may be woken up. So
> > > if the deadlock is caused by one of those readers that is not woken up, it
> > > can be a problem. I haven't analyzed the deadlock scenario in detail to see
> > > if that is really the case. It is up to you and others who are more familiar
> > > with this code base to figure this out.
> > Follows the code base, which isn't special compared with other
> > downgrade_write() usages:
> > 
> > blk_mq_del_queue_tag_set()/blk_mq_add_queue_tag_set():
> > 
> > 	down_write(&set->tag_list_rwsem);
> > 	...
> > 	downgrade_write(&set->tag_list_rwsem);
> > 	...
> > 	up_read(&set->tag_list_rwsem);
> > 
> > All others are readers:
> > 
> > 	down_read(&set->tag_list_rwsem);
> > 	...
> > 	up_read(&set->tag_list_rwsem);
> > 
> > 
> > You mentioned reader may not be waken up in case of r/w mixed waiters, but I
> > don't see it is possible.
> 
> I don't know if concurrent calls to blk_mq_del_queue_tag_set() is possible

Both blk_mq_del_queue_tag_set() and blk_mq_add_queue_tag_set() can be
called concurrently with same `struct blk_mq_tag_set *` and different
'struct request_queue *'.

> or not. There is also the blk_mq_update_nr_hw_queues() function that will
> acquire the write lock.

blk_mq_update_nr_hw_queues() doesn't add/remove queue to set->tag_list, so
I guess read lock may be enough.

Also blk_mq_update_nr_hw_queues() can be run concurrently with the above
two helpers.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
       [not found]             ` <702d904c-2d9a-42b4-95b3-0fa43d91e673@redhat.com>
  2025-11-18  4:50               ` Ming Lei
@ 2025-11-18  5:02               ` Mohamed Khalfella
  1 sibling, 0 replies; 21+ messages in thread
From: Mohamed Khalfella @ 2025-11-18  5:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ming Lei, Hillf Danton, Jens Axboe, linux-nvme, linux-block,
	linux-kernel

On Mon 2025-11-17 23:35:56 -0500, Waiman Long wrote:
>    On 11/17/25 11:03 PM, Ming Lei wrote:
> 
> On Mon, Nov 17, 2025 at 10:42:04PM -0500, Waiman Long wrote:
> 
> On 11/17/25 10:08 PM, Ming Lei wrote:
> 
> On Mon, Nov 17, 2025 at 09:24:21PM -0500, Waiman Long wrote:
> 
> On 11/17/25 8:34 PM, Hillf Danton wrote:
> 
> On Mon, 17 Nov 2025 12:23:53 -0800 Mohamed Khalfella wrote:
> 
> blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> tagset, the functions make sure that tagset and queues are marked as
> shared when two or more queues are attached to the same tagset.
> Initially a tagset starts as unshared and when the number of added
> queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> with all the queues attached to it. When the number of attached queues
> drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> the remaining queues as unshared.
> 
> Both functions need to freeze current queues in tagset before setting on
> unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> hold set->tag_list_lock mutex, which makes sense as we do not want
> queues to be added or deleted in the process. This used to work fine
> until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> made the nvme driver quiesce tagset instead of quiscing individual
> queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> set->tag_list while holding set->tag_list_lock also.
> 
> This results in deadlock between two threads with these stacktraces:
> 
>     __schedule+0x48e/0xed0
>     schedule+0x5a/0xc0
>     schedule_preempt_disabled+0x11/0x20
>     __mutex_lock.constprop.0+0x3cc/0x760
>     blk_mq_quiesce_tagset+0x26/0xd0
>     nvme_dev_disable_locked+0x77/0x280 [nvme]
>     nvme_timeout+0x268/0x320 [nvme]
>     blk_mq_handle_expired+0x5d/0x90
>     bt_iter+0x7e/0x90
>     blk_mq_queue_tag_busy_iter+0x2b2/0x590
>     ? __blk_mq_complete_request_remote+0x10/0x10
>     ? __blk_mq_complete_request_remote+0x10/0x10
>     blk_mq_timeout_work+0x15b/0x1a0
>     process_one_work+0x133/0x2f0
>     ? mod_delayed_work_on+0x90/0x90
>     worker_thread+0x2ec/0x400
>     ? mod_delayed_work_on+0x90/0x90
>     kthread+0xe2/0x110
>     ? kthread_complete_and_exit+0x20/0x20
>     ret_from_fork+0x2d/0x50
>     ? kthread_complete_and_exit+0x20/0x20
>     ret_from_fork_asm+0x11/0x20
> 
>     __schedule+0x48e/0xed0
>     schedule+0x5a/0xc0
>     blk_mq_freeze_queue_wait+0x62/0x90
>     ? destroy_sched_domains_rcu+0x30/0x30
>     blk_mq_exit_queue+0x151/0x180
>     disk_release+0xe3/0xf0
>     device_release+0x31/0x90
>     kobject_put+0x6d/0x180
>     nvme_scan_ns+0x858/0xc90 [nvme_core]
>     ? nvme_scan_work+0x281/0x560 [nvme_core]
>     nvme_scan_work+0x281/0x560 [nvme_core]
>     process_one_work+0x133/0x2f0
>     ? mod_delayed_work_on+0x90/0x90
>     worker_thread+0x2ec/0x400
>     ? mod_delayed_work_on+0x90/0x90
>     kthread+0xe2/0x110
>     ? kthread_complete_and_exit+0x20/0x20
>     ret_from_fork+0x2d/0x50
>     ? kthread_complete_and_exit+0x20/0x20
>     ret_from_fork_asm+0x11/0x20
> 
> The top stacktrace is showing nvme_timeout() called to handle nvme
> command timeout. timeout handler is trying to disable the controller and
> as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> to call queue callback handlers. The thread is stuck waiting for
> set->tag_list_lock as it tires to walk the queues in set->tag_list.
> 
> The lock is held by the second thread in the bottom stack which is
> waiting for one of queues to be frozen. The queue usage counter will
> drop to zero after nvme_timeout() finishes, and this will not happen
> because the thread will wait for this mutex forever.
> 
> Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> semaphore for read since this is enough to guarantee no queues will be
> added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> semaphore for write while updating set->tag_list and downgrade it to
> read while freezing the queues. It should be safe to update set->flags
> and hctx->flags while holding the semaphore for read since the queues
> are already frozen.
> 
> Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> Signed-off-by: Mohamed Khalfella [1]<mkhalfella@purestorage.com>
> ---
>    block/blk-mq-sysfs.c   | 10 ++---
>    block/blk-mq.c         | 95 +++++++++++++++++++++++-------------------
>    include/linux/blk-mq.h |  4 +-
>    3 files changed, 58 insertions(+), 51 deletions(-)
> 
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 58ec293373c6..f474781654fb 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)
>         kobject_uevent(q->mq_kobj, KOBJ_ADD);
> -       mutex_lock(&q->tag_set->tag_list_lock);
> +       down_read(&q->tag_set->tag_list_rwsem);
>         queue_for_each_hw_ctx(q, hctx, i) {
>                 ret = blk_mq_register_hctx(hctx);
>                 if (ret)
>                         goto out_unreg;
>         }
> -       mutex_unlock(&q->tag_set->tag_list_lock);
> +       up_read(&q->tag_set->tag_list_rwsem);
>         return 0;
>    out_unreg:
> @@ -244,7 +244,7 @@ int blk_mq_sysfs_register(struct gendisk *disk)
>                 if (j < i)
>                         blk_mq_unregister_hctx(hctx);
>         }
> -       mutex_unlock(&q->tag_set->tag_list_lock);
> +       up_read(&q->tag_set->tag_list_rwsem);
>         kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
>         kobject_del(q->mq_kobj);
> @@ -257,10 +257,10 @@ void blk_mq_sysfs_unregister(struct gendisk *disk)
>         struct blk_mq_hw_ctx *hctx;
>         unsigned long i;
> -       mutex_lock(&q->tag_set->tag_list_lock);
> +       down_read(&q->tag_set->tag_list_rwsem);
>         queue_for_each_hw_ctx(q, hctx, i)
>                 blk_mq_unregister_hctx(hctx);
> -       mutex_unlock(&q->tag_set->tag_list_lock);
> +       up_read(&q->tag_set->tag_list_rwsem);
>         kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
>         kobject_del(q->mq_kobj);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d626d32f6e57..9211d32ce820 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
>    {
>         struct request_queue *q;
> -       mutex_lock(&set->tag_list_lock);
> +       down_read(&set->tag_list_rwsem);
>         list_for_each_entry(q, &set->tag_list, tag_set_list) {
>                 if (!blk_queue_skip_tagset_quiesce(q))
>                         blk_mq_quiesce_queue_nowait(q);
>         }
> -       mutex_unlock(&set->tag_list_lock);
> +       up_read(&set->tag_list_rwsem);
>         blk_mq_wait_quiesce_done(set);
>    }
> @@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
>    {
>         struct request_queue *q;
> -       mutex_lock(&set->tag_list_lock);
> +       down_read(&set->tag_list_rwsem);
>         list_for_each_entry(q, &set->tag_list, tag_set_list) {
>                 if (!blk_queue_skip_tagset_quiesce(q))
>                         blk_mq_unquiesce_queue(q);
>         }
> -       mutex_unlock(&set->tag_list_lock);
> +       up_read(&set->tag_list_rwsem);
>    }
>    EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
> @@ -4274,56 +4274,63 @@ static void queue_set_hctx_shared(struct request_queue *
> q, bool shared)
>         }
>    }
> -static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
> -                                        bool shared)
> -{
> -       struct request_queue *q;
> -       unsigned int memflags;
> -
> -       lockdep_assert_held(&set->tag_list_lock);
> -
> -       list_for_each_entry(q, &set->tag_list, tag_set_list) {
> -               memflags = blk_mq_freeze_queue(q);
> -               queue_set_hctx_shared(q, shared);
> -               blk_mq_unfreeze_queue(q, memflags);
> -       }
> -}
> -
>    static void blk_mq_del_queue_tag_set(struct request_queue *q)
>    {
>         struct blk_mq_tag_set *set = q->tag_set;
> +       struct request_queue *firstq;
> +       unsigned int memflags;
> -       mutex_lock(&set->tag_list_lock);
> +       down_write(&set->tag_list_rwsem);
>         list_del(&q->tag_set_list);
> -       if (list_is_singular(&set->tag_list)) {
> -               /* just transitioned to unshared */
> -               set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> -               /* update existing queue */
> -               blk_mq_update_tag_set_shared(set, false);
> +       if (!list_is_singular(&set->tag_list)) {
> +               up_write(&set->tag_list_rwsem);
> +               goto out;
>         }
> -       mutex_unlock(&set->tag_list_lock);
> +
> +       /*
> +        * Transitioning the remaining firstq to unshared.
> +        * Also, downgrade the semaphore to avoid deadlock
> +        * with blk_mq_quiesce_tagset() while waiting for
> +        * firstq to be frozen.
> +        */
> +       set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> +       downgrade_write(&set->tag_list_rwsem);
> 
> If the first lock waiter is for write, it could ruin your downgrade trick.
> 
> If the 1st waiter is for WEITE, rwsem_mark_wake() simply returns and grants
> read lock to this caller, meantime wakes up nothing.
> 
> That is exactly what this use case expects, so can you explain in detail why
> `it could ruin your downgrade trick`?
> 
> 
> That is true. The downgrade will wake up all the waiting readers at the
> front of the wait queue, but if there is one or more writers in the mix. The
> wakeup will stop when the first writer is hit and all the readers after that
> will not be woken up.
> 
> So waiters for WRITE won't be waken up by downgrade_write() if I understand corr
> ectly,
> and rwsem_downgrade_wake() documents this behavior too.
> 
> 
> We can theoretically provide a downgrade variant that wakes up all the
> readers if it is a useful feature.
> 
> The following up_read() in this code block will wake up the waiter for
> WRITE, which finally wakes up other waiters for READ, then I am confused
> what is the problem with this usage?
> 
> I am just referring to the fact that not all the readers may be woken up. So
> if the deadlock is caused by one of those readers that is not woken up, it
> can be a problem. I haven't analyzed the deadlock scenario in detail to see
> if that is really the case. It is up to you and others who are more familiar
> with this code base to figure this out.
> 
> Follows the code base, which isn't special compared with other
> downgrade_write() usages:
> 
> blk_mq_del_queue_tag_set()/blk_mq_add_queue_tag_set():
> 
>         down_write(&set->tag_list_rwsem);
>         ...
>         downgrade_write(&set->tag_list_rwsem);
>         ...
>         up_read(&set->tag_list_rwsem);
> 
> All others are readers:
> 
>         down_read(&set->tag_list_rwsem);
>         ...
>         up_read(&set->tag_list_rwsem);
> 
> 
> You mentioned reader may not be waken up in case of r/w mixed waiters, but I
> don't see it is possible.
> 
>    I don't know if concurrent calls to blk_mq_del_queue_tag_set() is
>    possible or not. There is also the blk_mq_update_nr_hw_queues()
>    function that will acquire the write lock.
> 

Assuming blk_mq_del_queue_tag_set(), blk_mq_add_queue_tag_set() and
other readers run in parallel? Are we talking about potential starvation
here? 

If yes, is this reader starvation or writer starvation? 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-18  4:50               ` Ming Lei
@ 2025-11-18 17:59                 ` Mohamed Khalfella
  0 siblings, 0 replies; 21+ messages in thread
From: Mohamed Khalfella @ 2025-11-18 17:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: Waiman Long, Hillf Danton, Jens Axboe, linux-nvme, linux-block,
	linux-kernel

On Tue 2025-11-18 12:50:52 +0800, Ming Lei wrote:
> On Mon, Nov 17, 2025 at 11:35:56PM -0500, Waiman Long wrote:
> > On 11/17/25 11:03 PM, Ming Lei wrote:
> > > On Mon, Nov 17, 2025 at 10:42:04PM -0500, Waiman Long wrote:
> > > > On 11/17/25 10:08 PM, Ming Lei wrote:
> > > > > On Mon, Nov 17, 2025 at 09:24:21PM -0500, Waiman Long wrote:
> > > > > > On 11/17/25 8:34 PM, Hillf Danton wrote:
> > > > > > > On Mon, 17 Nov 2025 12:23:53 -0800 Mohamed Khalfella wrote:
> > > > > > > > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > > > > > > > tagset, the functions make sure that tagset and queues are marked as
> > > > > > > > shared when two or more queues are attached to the same tagset.
> > > > > > > > Initially a tagset starts as unshared and when the number of added
> > > > > > > > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > > > > > > > with all the queues attached to it. When the number of attached queues
> > > > > > > > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > > > > > > > the remaining queues as unshared.
> > > > > > > > 
> > > > > > > > Both functions need to freeze current queues in tagset before setting on
> > > > > > > > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > > > > > > > hold set->tag_list_lock mutex, which makes sense as we do not want
> > > > > > > > queues to be added or deleted in the process. This used to work fine
> > > > > > > > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > > > > > > made the nvme driver quiesce tagset instead of quiscing individual
> > > > > > > > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > > > > > > > set->tag_list while holding set->tag_list_lock also.
> > > > > > > > 
> > > > > > > > This results in deadlock between two threads with these stacktraces:
> > > > > > > > 
> > > > > > > >      __schedule+0x48e/0xed0
> > > > > > > >      schedule+0x5a/0xc0
> > > > > > > >      schedule_preempt_disabled+0x11/0x20
> > > > > > > >      __mutex_lock.constprop.0+0x3cc/0x760
> > > > > > > >      blk_mq_quiesce_tagset+0x26/0xd0
> > > > > > > >      nvme_dev_disable_locked+0x77/0x280 [nvme]
> > > > > > > >      nvme_timeout+0x268/0x320 [nvme]
> > > > > > > >      blk_mq_handle_expired+0x5d/0x90
> > > > > > > >      bt_iter+0x7e/0x90
> > > > > > > >      blk_mq_queue_tag_busy_iter+0x2b2/0x590
> > > > > > > >      ? __blk_mq_complete_request_remote+0x10/0x10
> > > > > > > >      ? __blk_mq_complete_request_remote+0x10/0x10
> > > > > > > >      blk_mq_timeout_work+0x15b/0x1a0
> > > > > > > >      process_one_work+0x133/0x2f0
> > > > > > > >      ? mod_delayed_work_on+0x90/0x90
> > > > > > > >      worker_thread+0x2ec/0x400
> > > > > > > >      ? mod_delayed_work_on+0x90/0x90
> > > > > > > >      kthread+0xe2/0x110
> > > > > > > >      ? kthread_complete_and_exit+0x20/0x20
> > > > > > > >      ret_from_fork+0x2d/0x50
> > > > > > > >      ? kthread_complete_and_exit+0x20/0x20
> > > > > > > >      ret_from_fork_asm+0x11/0x20
> > > > > > > > 
> > > > > > > >      __schedule+0x48e/0xed0
> > > > > > > >      schedule+0x5a/0xc0
> > > > > > > >      blk_mq_freeze_queue_wait+0x62/0x90
> > > > > > > >      ? destroy_sched_domains_rcu+0x30/0x30
> > > > > > > >      blk_mq_exit_queue+0x151/0x180
> > > > > > > >      disk_release+0xe3/0xf0
> > > > > > > >      device_release+0x31/0x90
> > > > > > > >      kobject_put+0x6d/0x180
> > > > > > > >      nvme_scan_ns+0x858/0xc90 [nvme_core]
> > > > > > > >      ? nvme_scan_work+0x281/0x560 [nvme_core]
> > > > > > > >      nvme_scan_work+0x281/0x560 [nvme_core]
> > > > > > > >      process_one_work+0x133/0x2f0
> > > > > > > >      ? mod_delayed_work_on+0x90/0x90
> > > > > > > >      worker_thread+0x2ec/0x400
> > > > > > > >      ? mod_delayed_work_on+0x90/0x90
> > > > > > > >      kthread+0xe2/0x110
> > > > > > > >      ? kthread_complete_and_exit+0x20/0x20
> > > > > > > >      ret_from_fork+0x2d/0x50
> > > > > > > >      ? kthread_complete_and_exit+0x20/0x20
> > > > > > > >      ret_from_fork_asm+0x11/0x20
> > > > > > > > 
> > > > > > > > The top stacktrace is showing nvme_timeout() called to handle nvme
> > > > > > > > command timeout. timeout handler is trying to disable the controller and
> > > > > > > > as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> > > > > > > > to call queue callback handlers. The thread is stuck waiting for
> > > > > > > > set->tag_list_lock as it tires to walk the queues in set->tag_list.
> > > > > > > > 
> > > > > > > > The lock is held by the second thread in the bottom stack which is
> > > > > > > > waiting for one of queues to be frozen. The queue usage counter will
> > > > > > > > drop to zero after nvme_timeout() finishes, and this will not happen
> > > > > > > > because the thread will wait for this mutex forever.
> > > > > > > > 
> > > > > > > > Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> > > > > > > > avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> > > > > > > > semaphore for read since this is enough to guarantee no queues will be
> > > > > > > > added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> > > > > > > > semaphore for write while updating set->tag_list and downgrade it to
> > > > > > > > read while freezing the queues. It should be safe to update set->flags
> > > > > > > > and hctx->flags while holding the semaphore for read since the queues
> > > > > > > > are already frozen.
> > > > > > > > 
> > > > > > > > Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > > > > > > Signed-off-by: Mohamed Khalfella<mkhalfella@purestorage.com>
> > > > > > > > ---
> > > > > > > >     block/blk-mq-sysfs.c   | 10 ++---
> > > > > > > >     block/blk-mq.c         | 95 +++++++++++++++++++++++-------------------
> > > > > > > >     include/linux/blk-mq.h |  4 +-
> > > > > > > >     3 files changed, 58 insertions(+), 51 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> > > > > > > > index 58ec293373c6..f474781654fb 100644
> > > > > > > > --- a/block/blk-mq-sysfs.c
> > > > > > > > +++ b/block/blk-mq-sysfs.c
> > > > > > > > @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)
> > > > > > > >     	kobject_uevent(q->mq_kobj, KOBJ_ADD);
> > > > > > > > -	mutex_lock(&q->tag_set->tag_list_lock);
> > > > > > > > +	down_read(&q->tag_set->tag_list_rwsem);
> > > > > > > >     	queue_for_each_hw_ctx(q, hctx, i) {
> > > > > > > >     		ret = blk_mq_register_hctx(hctx);
> > > > > > > >     		if (ret)
> > > > > > > >     			goto out_unreg;
> > > > > > > >     	}
> > > > > > > > -	mutex_unlock(&q->tag_set->tag_list_lock);
> > > > > > > > +	up_read(&q->tag_set->tag_list_rwsem);
> > > > > > > >     	return 0;
> > > > > > > >     out_unreg:
> > > > > > > > @@ -244,7 +244,7 @@ int blk_mq_sysfs_register(struct gendisk *disk)
> > > > > > > >     		if (j < i)
> > > > > > > >     			blk_mq_unregister_hctx(hctx);
> > > > > > > >     	}
> > > > > > > > -	mutex_unlock(&q->tag_set->tag_list_lock);
> > > > > > > > +	up_read(&q->tag_set->tag_list_rwsem);
> > > > > > > >     	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
> > > > > > > >     	kobject_del(q->mq_kobj);
> > > > > > > > @@ -257,10 +257,10 @@ void blk_mq_sysfs_unregister(struct gendisk *disk)
> > > > > > > >     	struct blk_mq_hw_ctx *hctx;
> > > > > > > >     	unsigned long i;
> > > > > > > > -	mutex_lock(&q->tag_set->tag_list_lock);
> > > > > > > > +	down_read(&q->tag_set->tag_list_rwsem);
> > > > > > > >     	queue_for_each_hw_ctx(q, hctx, i)
> > > > > > > >     		blk_mq_unregister_hctx(hctx);
> > > > > > > > -	mutex_unlock(&q->tag_set->tag_list_lock);
> > > > > > > > +	up_read(&q->tag_set->tag_list_rwsem);
> > > > > > > >     	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
> > > > > > > >     	kobject_del(q->mq_kobj);
> > > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > > > > index d626d32f6e57..9211d32ce820 100644
> > > > > > > > --- a/block/blk-mq.c
> > > > > > > > +++ b/block/blk-mq.c
> > > > > > > > @@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
> > > > > > > >     {
> > > > > > > >     	struct request_queue *q;
> > > > > > > > -	mutex_lock(&set->tag_list_lock);
> > > > > > > > +	down_read(&set->tag_list_rwsem);
> > > > > > > >     	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > > > > > > >     		if (!blk_queue_skip_tagset_quiesce(q))
> > > > > > > >     			blk_mq_quiesce_queue_nowait(q);
> > > > > > > >     	}
> > > > > > > > -	mutex_unlock(&set->tag_list_lock);
> > > > > > > > +	up_read(&set->tag_list_rwsem);
> > > > > > > >     	blk_mq_wait_quiesce_done(set);
> > > > > > > >     }
> > > > > > > > @@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
> > > > > > > >     {
> > > > > > > >     	struct request_queue *q;
> > > > > > > > -	mutex_lock(&set->tag_list_lock);
> > > > > > > > +	down_read(&set->tag_list_rwsem);
> > > > > > > >     	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > > > > > > >     		if (!blk_queue_skip_tagset_quiesce(q))
> > > > > > > >     			blk_mq_unquiesce_queue(q);
> > > > > > > >     	}
> > > > > > > > -	mutex_unlock(&set->tag_list_lock);
> > > > > > > > +	up_read(&set->tag_list_rwsem);
> > > > > > > >     }
> > > > > > > >     EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
> > > > > > > > @@ -4274,56 +4274,63 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
> > > > > > > >     	}
> > > > > > > >     }
> > > > > > > > -static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
> > > > > > > > -					 bool shared)
> > > > > > > > -{
> > > > > > > > -	struct request_queue *q;
> > > > > > > > -	unsigned int memflags;
> > > > > > > > -
> > > > > > > > -	lockdep_assert_held(&set->tag_list_lock);
> > > > > > > > -
> > > > > > > > -	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > > > > > > > -		memflags = blk_mq_freeze_queue(q);
> > > > > > > > -		queue_set_hctx_shared(q, shared);
> > > > > > > > -		blk_mq_unfreeze_queue(q, memflags);
> > > > > > > > -	}
> > > > > > > > -}
> > > > > > > > -
> > > > > > > >     static void blk_mq_del_queue_tag_set(struct request_queue *q)
> > > > > > > >     {
> > > > > > > >     	struct blk_mq_tag_set *set = q->tag_set;
> > > > > > > > +	struct request_queue *firstq;
> > > > > > > > +	unsigned int memflags;
> > > > > > > > -	mutex_lock(&set->tag_list_lock);
> > > > > > > > +	down_write(&set->tag_list_rwsem);
> > > > > > > >     	list_del(&q->tag_set_list);
> > > > > > > > -	if (list_is_singular(&set->tag_list)) {
> > > > > > > > -		/* just transitioned to unshared */
> > > > > > > > -		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> > > > > > > > -		/* update existing queue */
> > > > > > > > -		blk_mq_update_tag_set_shared(set, false);
> > > > > > > > +	if (!list_is_singular(&set->tag_list)) {
> > > > > > > > +		up_write(&set->tag_list_rwsem);
> > > > > > > > +		goto out;
> > > > > > > >     	}
> > > > > > > > -	mutex_unlock(&set->tag_list_lock);
> > > > > > > > +
> > > > > > > > +	/*
> > > > > > > > +	 * Transitioning the remaining firstq to unshared.
> > > > > > > > +	 * Also, downgrade the semaphore to avoid deadlock
> > > > > > > > +	 * with blk_mq_quiesce_tagset() while waiting for
> > > > > > > > +	 * firstq to be frozen.
> > > > > > > > +	 */
> > > > > > > > +	set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> > > > > > > > +	downgrade_write(&set->tag_list_rwsem);
> > > > > > > If the first lock waiter is for write, it could ruin your downgrade trick.
> > > > > If the 1st waiter is for WEITE, rwsem_mark_wake() simply returns and grants
> > > > > read lock to this caller, meantime wakes up nothing.
> > > > > 
> > > > > That is exactly what this use case expects, so can you explain in detail why
> > > > > `it could ruin your downgrade trick`?
> > > > > 
> > > > > > That is true. The downgrade will wake up all the waiting readers at the
> > > > > > front of the wait queue, but if there is one or more writers in the mix. The
> > > > > > wakeup will stop when the first writer is hit and all the readers after that
> > > > > > will not be woken up.
> > > > > So waiters for WRITE won't be waken up by downgrade_write() if I understand correctly,
> > > > > and rwsem_downgrade_wake() documents this behavior too.
> > > > > 
> > > > > > We can theoretically provide a downgrade variant that wakes up all the
> > > > > > readers if it is a useful feature.
> > > > > The following up_read() in this code block will wake up the waiter for
> > > > > WRITE, which finally wakes up other waiters for READ, then I am confused
> > > > > what is the problem with this usage?
> > > > I am just referring to the fact that not all the readers may be woken up. So
> > > > if the deadlock is caused by one of those readers that is not woken up, it
> > > > can be a problem. I haven't analyzed the deadlock scenario in detail to see
> > > > if that is really the case. It is up to you and others who are more familiar
> > > > with this code base to figure this out.
> > > Follows the code base, which isn't special compared with other
> > > downgrade_write() usages:
> > > 
> > > blk_mq_del_queue_tag_set()/blk_mq_add_queue_tag_set():
> > > 
> > > 	down_write(&set->tag_list_rwsem);
> > > 	...
> > > 	downgrade_write(&set->tag_list_rwsem);
> > > 	...
> > > 	up_read(&set->tag_list_rwsem);
> > > 
> > > All others are readers:
> > > 
> > > 	down_read(&set->tag_list_rwsem);
> > > 	...
> > > 	up_read(&set->tag_list_rwsem);
> > > 
> > > 
> > > You mentioned reader may not be waken up in case of r/w mixed waiters, but I
> > > don't see it is possible.
> > 
> > I don't know if concurrent calls to blk_mq_del_queue_tag_set() is possible
> 
> Both blk_mq_del_queue_tag_set() and blk_mq_add_queue_tag_set() can be
> called concurrently with same `struct blk_mq_tag_set *` and different
> 'struct request_queue *'.
> 
> > or not. There is also the blk_mq_update_nr_hw_queues() function that will
> > acquire the write lock.
> 
> blk_mq_update_nr_hw_queues() doesn't add/remove queue to set->tag_list, so
> I guess read lock may be enough.

The reason blk_mq_update_nr_hw_queues() takes write lock is that it can
change q->hctx_table. We do not want to miss hcxt in queue_set_hctx_shared()
called from blk_mq_{add,del}_queue_tag_set().

> 
> Also blk_mq_update_nr_hw_queues() can be run concurrently with the above
> two helpers.
> 
> 
> Thanks,
> Ming
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-17 20:23 ` [PATCH v2 1/1] " Mohamed Khalfella
  2025-11-18  1:34   ` Hillf Danton
  2025-11-18  2:00   ` Ming Lei
@ 2025-11-24  4:00   ` Ming Lei
  2025-11-24 17:42     ` Mohamed Khalfella
  2025-11-30 22:56   ` Sagi Grimberg
  3 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2025-11-24  4:00 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	linux-nvme, linux-block, linux-kernel

On Mon, Nov 17, 2025 at 12:23:53PM -0800, Mohamed Khalfella wrote:
> blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> tagset, the functions make sure that tagset and queues are marked as
> shared when two or more queues are attached to the same tagset.
> Initially a tagset starts as unshared and when the number of added
> queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> with all the queues attached to it. When the number of attached queues
> drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> the remaining queues as unshared.
> 
> Both functions need to freeze current queues in tagset before setting on
> unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> hold set->tag_list_lock mutex, which makes sense as we do not want
> queues to be added or deleted in the process. This used to work fine
> until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> made the nvme driver quiesce tagset instead of quiscing individual
> queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> set->tag_list while holding set->tag_list_lock also.
> 
> This results in deadlock between two threads with these stacktraces:
> 
>   __schedule+0x48e/0xed0
>   schedule+0x5a/0xc0
>   schedule_preempt_disabled+0x11/0x20
>   __mutex_lock.constprop.0+0x3cc/0x760
>   blk_mq_quiesce_tagset+0x26/0xd0
>   nvme_dev_disable_locked+0x77/0x280 [nvme]
>   nvme_timeout+0x268/0x320 [nvme]
>   blk_mq_handle_expired+0x5d/0x90
>   bt_iter+0x7e/0x90
>   blk_mq_queue_tag_busy_iter+0x2b2/0x590
>   ? __blk_mq_complete_request_remote+0x10/0x10
>   ? __blk_mq_complete_request_remote+0x10/0x10
>   blk_mq_timeout_work+0x15b/0x1a0
>   process_one_work+0x133/0x2f0
>   ? mod_delayed_work_on+0x90/0x90
>   worker_thread+0x2ec/0x400
>   ? mod_delayed_work_on+0x90/0x90
>   kthread+0xe2/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x2d/0x50
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork_asm+0x11/0x20
> 
>   __schedule+0x48e/0xed0
>   schedule+0x5a/0xc0
>   blk_mq_freeze_queue_wait+0x62/0x90
>   ? destroy_sched_domains_rcu+0x30/0x30
>   blk_mq_exit_queue+0x151/0x180
>   disk_release+0xe3/0xf0
>   device_release+0x31/0x90
>   kobject_put+0x6d/0x180
>   nvme_scan_ns+0x858/0xc90 [nvme_core]
>   ? nvme_scan_work+0x281/0x560 [nvme_core]
>   nvme_scan_work+0x281/0x560 [nvme_core]
>   process_one_work+0x133/0x2f0
>   ? mod_delayed_work_on+0x90/0x90
>   worker_thread+0x2ec/0x400
>   ? mod_delayed_work_on+0x90/0x90
>   kthread+0xe2/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x2d/0x50
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork_asm+0x11/0x20
> 
> The top stacktrace is showing nvme_timeout() called to handle nvme
> command timeout. timeout handler is trying to disable the controller and
> as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> to call queue callback handlers. The thread is stuck waiting for
> set->tag_list_lock as it tires to walk the queues in set->tag_list.
> 
> The lock is held by the second thread in the bottom stack which is
> waiting for one of queues to be frozen. The queue usage counter will
> drop to zero after nvme_timeout() finishes, and this will not happen
> because the thread will wait for this mutex forever.
> 
> Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> semaphore for read since this is enough to guarantee no queues will be
> added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> semaphore for write while updating set->tag_list and downgrade it to
> read while freezing the queues. It should be safe to update set->flags
> and hctx->flags while holding the semaphore for read since the queues
> are already frozen.
> 
> Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-24  4:00   ` Ming Lei
@ 2025-11-24 17:42     ` Mohamed Khalfella
  2025-12-01  1:36       ` Hillf Danton
  0 siblings, 1 reply; 21+ messages in thread
From: Mohamed Khalfella @ 2025-11-24 17:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	linux-nvme, linux-block, linux-kernel

On Mon 2025-11-24 12:00:15 +0800, Ming Lei wrote:
> On Mon, Nov 17, 2025 at 12:23:53PM -0800, Mohamed Khalfella wrote:
> > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > tagset, the functions make sure that tagset and queues are marked as
> > shared when two or more queues are attached to the same tagset.
> > Initially a tagset starts as unshared and when the number of added
> > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > with all the queues attached to it. When the number of attached queues
> > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > the remaining queues as unshared.
> > 
> > Both functions need to freeze current queues in tagset before setting on
> > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > hold set->tag_list_lock mutex, which makes sense as we do not want
> > queues to be added or deleted in the process. This used to work fine
> > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > made the nvme driver quiesce tagset instead of quiscing individual
> > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > set->tag_list while holding set->tag_list_lock also.
> > 
> > This results in deadlock between two threads with these stacktraces:
> > 
> >   __schedule+0x48e/0xed0
> >   schedule+0x5a/0xc0
> >   schedule_preempt_disabled+0x11/0x20
> >   __mutex_lock.constprop.0+0x3cc/0x760
> >   blk_mq_quiesce_tagset+0x26/0xd0
> >   nvme_dev_disable_locked+0x77/0x280 [nvme]
> >   nvme_timeout+0x268/0x320 [nvme]
> >   blk_mq_handle_expired+0x5d/0x90
> >   bt_iter+0x7e/0x90
> >   blk_mq_queue_tag_busy_iter+0x2b2/0x590
> >   ? __blk_mq_complete_request_remote+0x10/0x10
> >   ? __blk_mq_complete_request_remote+0x10/0x10
> >   blk_mq_timeout_work+0x15b/0x1a0
> >   process_one_work+0x133/0x2f0
> >   ? mod_delayed_work_on+0x90/0x90
> >   worker_thread+0x2ec/0x400
> >   ? mod_delayed_work_on+0x90/0x90
> >   kthread+0xe2/0x110
> >   ? kthread_complete_and_exit+0x20/0x20
> >   ret_from_fork+0x2d/0x50
> >   ? kthread_complete_and_exit+0x20/0x20
> >   ret_from_fork_asm+0x11/0x20
> > 
> >   __schedule+0x48e/0xed0
> >   schedule+0x5a/0xc0
> >   blk_mq_freeze_queue_wait+0x62/0x90
> >   ? destroy_sched_domains_rcu+0x30/0x30
> >   blk_mq_exit_queue+0x151/0x180
> >   disk_release+0xe3/0xf0
> >   device_release+0x31/0x90
> >   kobject_put+0x6d/0x180
> >   nvme_scan_ns+0x858/0xc90 [nvme_core]
> >   ? nvme_scan_work+0x281/0x560 [nvme_core]
> >   nvme_scan_work+0x281/0x560 [nvme_core]
> >   process_one_work+0x133/0x2f0
> >   ? mod_delayed_work_on+0x90/0x90
> >   worker_thread+0x2ec/0x400
> >   ? mod_delayed_work_on+0x90/0x90
> >   kthread+0xe2/0x110
> >   ? kthread_complete_and_exit+0x20/0x20
> >   ret_from_fork+0x2d/0x50
> >   ? kthread_complete_and_exit+0x20/0x20
> >   ret_from_fork_asm+0x11/0x20
> > 
> > The top stacktrace is showing nvme_timeout() called to handle nvme
> > command timeout. timeout handler is trying to disable the controller and
> > as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> > to call queue callback handlers. The thread is stuck waiting for
> > set->tag_list_lock as it tires to walk the queues in set->tag_list.
> > 
> > The lock is held by the second thread in the bottom stack which is
> > waiting for one of queues to be frozen. The queue usage counter will
> > drop to zero after nvme_timeout() finishes, and this will not happen
> > because the thread will wait for this mutex forever.
> > 
> > Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> > avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> > semaphore for read since this is enough to guarantee no queues will be
> > added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> > semaphore for write while updating set->tag_list and downgrade it to
> > read while freezing the queues. It should be safe to update set->flags
> > and hctx->flags while holding the semaphore for read since the queues
> > are already frozen.
> > 
> > Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> 

Sorry, I was supposed to reply to this thread eariler. The concern
raised about potential deadlock in set->tag_list_rwsem caused by writer
blocking readers makes this approach buggy. The way I understood it is
that rw_semaphore have this writer starvation prevention mechanism.
If a writer is waiting for the semaphore to be available then readers
that come after the waiting writer will not be able to take the semphore.
Even if it is available for reader. If we rely on the readers to do
something to make the semaphore available for the waiting writer then
this is a deadlock. This change relies on the reader to cancel inflight
requests so that queue usage counter drops to zero and queue is fully
frozen. Only then semphore will be available for the waiting writer.
This results in a deadlock between three threads.

To put it in another way blk_mq_del_queue_tag_set() downgrades the
semaphore and waits for the queue to be frozen. If another call to
blk_mq_del_queue_tag_set() happens from another thread, before
blk_mq_quiesce_tagset() comes in, it will cause a deadlock. The second
call to blk_mq_del_queue_tag_set() is a writer and it will wait
until the semaphore is available. blk_mq_quiesce_tagset() is a reader
that comes after a waiting writer. Eventhough the semaphore is available
for readers blk_mq_quiesce_tagset() will not be able to take it because
of the writer starvation prevention mechanism. The first thread that is
waiting for queue to be frozen in blk_mq_del_queue_tag_set() will not be
able to make progress because of inflight requests. The second writer
thread waiting for the semphore on blk_mq_del_queue_tag_set() will not
be able to make progress because the semaphore is not availble. The
thread calling blk_mq_quiesce_tagset() will not be able to make progress
because it is blocked behind the writer (second thread).

Commit 4e893ca81170 ("nvme_core: scan namespaces asynchronously") makes
this scenario more likely to happen. If a controller has a namespace
that is duplicate three times then it is possible to hit this deadlock.

I was thinking about use RCU to protect set->tag_list but never had a
chance to write the code and test it. I hope I will find time in the coming
few days.

Thanks,
Mohamed Khalfella

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-17 20:23 ` [PATCH v2 1/1] " Mohamed Khalfella
                     ` (2 preceding siblings ...)
  2025-11-24  4:00   ` Ming Lei
@ 2025-11-30 22:56   ` Sagi Grimberg
  3 siblings, 0 replies; 21+ messages in thread
From: Sagi Grimberg @ 2025-11-30 22:56 UTC (permalink / raw)
  To: Mohamed Khalfella, Jens Axboe, Keith Busch, Chaitanya Kulkarni
  Cc: Casey Chen, Vikas Manocha, Yuanyuan Zhong, Hannes Reinecke,
	Ming Lei, linux-nvme, linux-block, linux-kernel

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-11-24 17:42     ` Mohamed Khalfella
@ 2025-12-01  1:36       ` Hillf Danton
  2025-12-01  2:50         ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Hillf Danton @ 2025-12-01  1:36 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Jens Axboe, Keith Busch, Ming Lei, linux-nvme, linux-block,
	linux-kernel

On Mon, 24 Nov 2025 09:42:11 -0800 Mohamed Khalfella wrote:
> On Mon 2025-11-24 12:00:15 +0800, Ming Lei wrote:
> > On Mon, Nov 17, 2025 at 12:23:53PM -0800, Mohamed Khalfella wrote:
> > > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > > tagset, the functions make sure that tagset and queues are marked as
> > > shared when two or more queues are attached to the same tagset.
> > > Initially a tagset starts as unshared and when the number of added
> > > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > > with all the queues attached to it. When the number of attached queues
> > > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > > the remaining queues as unshared.
> > > 
> > > Both functions need to freeze current queues in tagset before setting on
> > > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > > hold set->tag_list_lock mutex, which makes sense as we do not want
> > > queues to be added or deleted in the process. This used to work fine
> > > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > made the nvme driver quiesce tagset instead of quiscing individual
> > > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > > set->tag_list while holding set->tag_list_lock also.
> > > 
> > > This results in deadlock between two threads with these stacktraces:
> > > 
> > >   __schedule+0x48e/0xed0
> > >   schedule+0x5a/0xc0
> > >   schedule_preempt_disabled+0x11/0x20
> > >   __mutex_lock.constprop.0+0x3cc/0x760
> > >   blk_mq_quiesce_tagset+0x26/0xd0
> > >   nvme_dev_disable_locked+0x77/0x280 [nvme]
> > >   nvme_timeout+0x268/0x320 [nvme]
> > >   blk_mq_handle_expired+0x5d/0x90
> > >   bt_iter+0x7e/0x90
> > >   blk_mq_queue_tag_busy_iter+0x2b2/0x590
> > >   ? __blk_mq_complete_request_remote+0x10/0x10
> > >   ? __blk_mq_complete_request_remote+0x10/0x10
> > >   blk_mq_timeout_work+0x15b/0x1a0
> > >   process_one_work+0x133/0x2f0
> > >   ? mod_delayed_work_on+0x90/0x90
> > >   worker_thread+0x2ec/0x400
> > >   ? mod_delayed_work_on+0x90/0x90
> > >   kthread+0xe2/0x110
> > >   ? kthread_complete_and_exit+0x20/0x20
> > >   ret_from_fork+0x2d/0x50
> > >   ? kthread_complete_and_exit+0x20/0x20
> > >   ret_from_fork_asm+0x11/0x20
> > > 
> > >   __schedule+0x48e/0xed0
> > >   schedule+0x5a/0xc0
> > >   blk_mq_freeze_queue_wait+0x62/0x90
> > >   ? destroy_sched_domains_rcu+0x30/0x30
> > >   blk_mq_exit_queue+0x151/0x180
> > >   disk_release+0xe3/0xf0
> > >   device_release+0x31/0x90
> > >   kobject_put+0x6d/0x180
> > >   nvme_scan_ns+0x858/0xc90 [nvme_core]
> > >   ? nvme_scan_work+0x281/0x560 [nvme_core]
> > >   nvme_scan_work+0x281/0x560 [nvme_core]
> > >   process_one_work+0x133/0x2f0
> > >   ? mod_delayed_work_on+0x90/0x90
> > >   worker_thread+0x2ec/0x400
> > >   ? mod_delayed_work_on+0x90/0x90
> > >   kthread+0xe2/0x110
> > >   ? kthread_complete_and_exit+0x20/0x20
> > >   ret_from_fork+0x2d/0x50
> > >   ? kthread_complete_and_exit+0x20/0x20
> > >   ret_from_fork_asm+0x11/0x20
> > > 
> > > The top stacktrace is showing nvme_timeout() called to handle nvme
> > > command timeout. timeout handler is trying to disable the controller and
> > > as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> > > to call queue callback handlers. The thread is stuck waiting for
> > > set->tag_list_lock as it tires to walk the queues in set->tag_list.
> > > 
> > > The lock is held by the second thread in the bottom stack which is
> > > waiting for one of queues to be frozen. The queue usage counter will
> > > drop to zero after nvme_timeout() finishes, and this will not happen
> > > because the thread will wait for this mutex forever.
> > > 
> > > Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> > > avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> > > semaphore for read since this is enough to guarantee no queues will be
> > > added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> > > semaphore for write while updating set->tag_list and downgrade it to
> > > read while freezing the queues. It should be safe to update set->flags
> > > and hctx->flags while holding the semaphore for read since the queues
> > > are already frozen.
> > > 
> > > Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > 
> > Reviewed-by: Ming Lei <ming.lei@redhat.com>
> > 
> 
> Sorry, I was supposed to reply to this thread eariler. The concern
> raised about potential deadlock in set->tag_list_rwsem caused by writer
> blocking readers makes this approach buggy. The way I understood it is
> that rw_semaphore have this writer starvation prevention mechanism.
> If a writer is waiting for the semaphore to be available then readers
> that come after the waiting writer will not be able to take the semphore.
> Even if it is available for reader. If we rely on the readers to do
> something to make the semaphore available for the waiting writer then
> this is a deadlock. This change relies on the reader to cancel inflight
> requests so that queue usage counter drops to zero and queue is fully
> frozen. Only then semphore will be available for the waiting writer.
> This results in a deadlock between three threads.
> 
Sounds like you know more about rwsem than the reviewer now.

> To put it in another way blk_mq_del_queue_tag_set() downgrades the
> semaphore and waits for the queue to be frozen. If another call to
> blk_mq_del_queue_tag_set() happens from another thread, before
> blk_mq_quiesce_tagset() comes in, it will cause a deadlock. The second
> call to blk_mq_del_queue_tag_set() is a writer and it will wait
> until the semaphore is available. blk_mq_quiesce_tagset() is a reader
> that comes after a waiting writer. Eventhough the semaphore is available
> for readers blk_mq_quiesce_tagset() will not be able to take it because
> of the writer starvation prevention mechanism. The first thread that is
> waiting for queue to be frozen in blk_mq_del_queue_tag_set() will not be
> able to make progress because of inflight requests. The second writer
> thread waiting for the semphore on blk_mq_del_queue_tag_set() will not
> be able to make progress because the semaphore is not availble. The
> thread calling blk_mq_quiesce_tagset() will not be able to make progress
> because it is blocked behind the writer (second thread).
> 
> Commit 4e893ca81170 ("nvme_core: scan namespaces asynchronously") makes
> this scenario more likely to happen. If a controller has a namespace
> that is duplicate three times then it is possible to hit this deadlock.
> 
> I was thinking about use RCU to protect set->tag_list but never had a
> chance to write the code and test it. I hope I will find time in the coming
> few days.
> 
Break a leg.

	Hillf

> Thanks,
> Mohamed Khalfella

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
  2025-12-01  1:36       ` Hillf Danton
@ 2025-12-01  2:50         ` Jens Axboe
  0 siblings, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2025-12-01  2:50 UTC (permalink / raw)
  To: Hillf Danton, Mohamed Khalfella
  Cc: Keith Busch, Ming Lei, linux-nvme, linux-block, linux-kernel

Hillf, if you have nothing constructive to add, please just keep quiet.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-12-01  2:50 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-17 20:23 [PATCH v2 0/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock Mohamed Khalfella
2025-11-17 20:23 ` [PATCH v2 1/1] " Mohamed Khalfella
2025-11-18  1:34   ` Hillf Danton
2025-11-18  2:07     ` Mohamed Khalfella
2025-11-18  2:24     ` Waiman Long
2025-11-18  3:08       ` Ming Lei
2025-11-18  3:42         ` Waiman Long
2025-11-18  4:03           ` Ming Lei
     [not found]             ` <702d904c-2d9a-42b4-95b3-0fa43d91e673@redhat.com>
2025-11-18  4:50               ` Ming Lei
2025-11-18 17:59                 ` Mohamed Khalfella
2025-11-18  5:02               ` Mohamed Khalfella
2025-11-18  2:00   ` Ming Lei
2025-11-18  2:15     ` Mohamed Khalfella
2025-11-18  2:30       ` Ming Lei
2025-11-18  3:44         ` Mohamed Khalfella
2025-11-18  4:16           ` Ming Lei
2025-11-24  4:00   ` Ming Lei
2025-11-24 17:42     ` Mohamed Khalfella
2025-12-01  1:36       ` Hillf Danton
2025-12-01  2:50         ` Jens Axboe
2025-11-30 22:56   ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).