[PATCH 0/4] blk-rq-qos: fix possible deadlock

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] blk-rq-qos: fix possible deadlock
@ 2025-10-14  2:21 Yu Kuai
  2025-10-14  2:21 ` [PATCH 1/4] blk-mq-debugfs: warn about " Yu Kuai
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Yu Kuai @ 2025-10-14  2:21 UTC (permalink / raw)
  To: nilay, ming.lei, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang,
	yangerkun, johnny.chenyi

Currently rq-qos debugfs entries is created from rq_qos_add(), while
rq_qos_add() requires queue to be freezed. This can deadlock because

creating new entries can trigger fs reclaim.

Fix this problem by delaying creating rq-qos debugfs entries until
it's initialization is complete.

- For wbt, it can be initialized by default of by blk-sysfs, fix it by
  delaying after wbt_init();
- For other policies, they can only be initialized by blkg configuration,
  fix it by delaying to blkg_conf_end();

Noted this set is cooked on the top of my other thread:
https://lore.kernel.org/all/20251010091446.3048529-1-yukuai@kernel.org/

And the deadlock can be reporduced with above thead, by running blktests
throtl/001 with wbt enabled by default. While the deadlock is really a
long term problem.

BTW, it's probably better to fix this problem before the above huge
refactor, I'm sending this set anyway to make sure if people will aggree
about the solution.

Yu Kuai (4):
  blk-mq-debugfs: warn about possible deadlock
  blk-mq-debugfs: factor out a helper blk_mq_debugfs_register_rq_qos()
  blk-rq-qos: fix possible deadlock
  blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static

 block/blk-cgroup.c     |  6 +++++
 block/blk-mq-debugfs.c | 58 ++++++++++++++++++++++++++++++------------
 block/blk-mq-debugfs.h |  4 +--
 block/blk-rq-qos.c     |  7 -----
 block/blk-sysfs.c      |  4 +++
 block/blk-wbt.c        |  7 ++++-
 6 files changed, 60 insertions(+), 26 deletions(-)

-- 
2.39.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/4] blk-mq-debugfs: warn about possible deadlock
  2025-10-14  2:21 [PATCH 0/4] blk-rq-qos: fix possible deadlock Yu Kuai
@ 2025-10-14  2:21 ` Yu Kuai
  2025-10-14  8:06   ` Ming Lei
  2025-10-14  2:21 ` [PATCH 2/4] blk-mq-debugfs: factor out a helper blk_mq_debugfs_register_rq_qos() Yu Kuai
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Yu Kuai @ 2025-10-14  2:21 UTC (permalink / raw)
  To: nilay, ming.lei, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang,
	yangerkun, johnny.chenyi

Creating new debugfs entries can trigger fs reclaim, hence we can't do
this with queue freezed, meanwhile, other locks that can be held while
queue is freezed should not be held as well.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-debugfs.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 4896525b1c05..66864ed0b77f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -608,9 +608,23 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
 	{},
 };
 
-static void debugfs_create_files(struct dentry *parent, void *data,
+static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
+				 void *data,
 				 const struct blk_mq_debugfs_attr *attr)
 {
+	/*
+	 * Creating new debugfs entries with queue freezed has the rist of
+	 * deadlock.
+	 */
+	WARN_ON_ONCE(q->mq_freeze_depth != 0);
+	/*
+	 * debugfs_mutex should not be nested under other locks that can be
+	 * grabbed while queue is freezed.
+	 */
+	lockdep_assert_not_held(&q->elevator_lock);
+	lockdep_assert_not_held(&q->rq_qos_mutex);
+	lockdep_assert_not_held(&q->blkcg_mutex);
+
 	if (IS_ERR_OR_NULL(parent))
 		return;
 
@@ -624,7 +638,7 @@ void blk_mq_debugfs_register(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
 
-	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
+	debugfs_create_files(q, q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx->debugfs_dir)
@@ -650,7 +664,8 @@ static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
 	snprintf(name, sizeof(name), "cpu%u", ctx->cpu);
 	ctx_dir = debugfs_create_dir(name, hctx->debugfs_dir);
 
-	debugfs_create_files(ctx_dir, ctx, blk_mq_debugfs_ctx_attrs);
+	debugfs_create_files(hctx->queue, ctx_dir, ctx,
+			     blk_mq_debugfs_ctx_attrs);
 }
 
 void blk_mq_debugfs_register_hctx(struct request_queue *q,
@@ -666,7 +681,8 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
 	snprintf(name, sizeof(name), "hctx%u", hctx->queue_num);
 	hctx->debugfs_dir = debugfs_create_dir(name, q->debugfs_dir);
 
-	debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);
+	debugfs_create_files(q, hctx->debugfs_dir, hctx,
+			     blk_mq_debugfs_hctx_attrs);
 
 	hctx_for_each_ctx(hctx, ctx, i)
 		blk_mq_debugfs_register_ctx(hctx, ctx);
@@ -717,7 +733,7 @@ void blk_mq_debugfs_register_sched(struct request_queue *q)
 
 	q->sched_debugfs_dir = debugfs_create_dir("sched", q->debugfs_dir);
 
-	debugfs_create_files(q->sched_debugfs_dir, q, e->queue_debugfs_attrs);
+	debugfs_create_files(q, q->sched_debugfs_dir, q, e->queue_debugfs_attrs);
 }
 
 void blk_mq_debugfs_unregister_sched(struct request_queue *q)
@@ -766,7 +782,8 @@ void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
 							 q->debugfs_dir);
 
 	rqos->debugfs_dir = debugfs_create_dir(dir_name, q->rqos_debugfs_dir);
-	debugfs_create_files(rqos->debugfs_dir, rqos, rqos->ops->debugfs_attrs);
+	debugfs_create_files(q, rqos->debugfs_dir, rqos,
+			     rqos->ops->debugfs_attrs);
 }
 
 void blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
@@ -789,7 +806,7 @@ void blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
 
 	hctx->sched_debugfs_dir = debugfs_create_dir("sched",
 						     hctx->debugfs_dir);
-	debugfs_create_files(hctx->sched_debugfs_dir, hctx,
+	debugfs_create_files(q, hctx->sched_debugfs_dir, hctx,
 			     e->hctx_debugfs_attrs);
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/4] blk-mq-debugfs: factor out a helper blk_mq_debugfs_register_rq_qos()
  2025-10-14  2:21 [PATCH 0/4] blk-rq-qos: fix possible deadlock Yu Kuai
  2025-10-14  2:21 ` [PATCH 1/4] blk-mq-debugfs: warn about " Yu Kuai
@ 2025-10-14  2:21 ` Yu Kuai
  2025-10-14  2:21 ` [PATCH 3/4] blk-rq-qos: fix possible deadlock Yu Kuai
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Yu Kuai @ 2025-10-14  2:21 UTC (permalink / raw)
  To: nilay, ming.lei, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang,
	yangerkun, johnny.chenyi

Prepare to fix possible deadlock to create blk-mq debugfs entries while
queue is still freezed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-debugfs.c | 23 +++++++++++++++--------
 block/blk-mq-debugfs.h |  5 +++++
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 66864ed0b77f..1de9bab7ba80 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -633,6 +633,20 @@ static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
 				    (void *)attr, data, &blk_mq_debugfs_fops);
 }
 
+void blk_mq_debugfs_register_rq_qos(struct request_queue *q)
+{
+	lockdep_assert_held(&q->debugfs_mutex);
+
+	if (q->rq_qos) {
+		struct rq_qos *rqos = q->rq_qos;
+
+		while (rqos) {
+			blk_mq_debugfs_register_rqos(rqos);
+			rqos = rqos->next;
+		}
+	}
+}
+
 void blk_mq_debugfs_register(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
@@ -645,14 +659,7 @@ void blk_mq_debugfs_register(struct request_queue *q)
 			blk_mq_debugfs_register_hctx(q, hctx);
 	}
 
-	if (q->rq_qos) {
-		struct rq_qos *rqos = q->rq_qos;
-
-		while (rqos) {
-			blk_mq_debugfs_register_rqos(rqos);
-			rqos = rqos->next;
-		}
-	}
+	blk_mq_debugfs_register_rq_qos(q);
 }
 
 static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index c80e453e3014..9f76603792fe 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -34,6 +34,7 @@ void blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
 void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_debugfs_register_rqos(struct rq_qos *rqos);
+void blk_mq_debugfs_register_rq_qos(struct request_queue *q);
 void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos);
 #else
 static inline void blk_mq_debugfs_register(struct request_queue *q)
@@ -78,6 +79,10 @@ static inline void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
 {
 }
 
+static inline void blk_mq_debugfs_register_rq_qos(struct request_queue *q)
+{
+}
+
 static inline void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
 {
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/4] blk-rq-qos: fix possible deadlock
  2025-10-14  2:21 [PATCH 0/4] blk-rq-qos: fix possible deadlock Yu Kuai
  2025-10-14  2:21 ` [PATCH 1/4] blk-mq-debugfs: warn about " Yu Kuai
  2025-10-14  2:21 ` [PATCH 2/4] blk-mq-debugfs: factor out a helper blk_mq_debugfs_register_rq_qos() Yu Kuai
@ 2025-10-14  2:21 ` Yu Kuai
  2025-10-14  8:13   ` Ming Lei
  2025-10-14  2:21 ` [PATCH 4/4] blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static Yu Kuai
  2025-10-14 10:58 ` [PATCH 0/4] blk-rq-qos: fix possible deadlock Nilay Shroff
  4 siblings, 1 reply; 23+ messages in thread
From: Yu Kuai @ 2025-10-14  2:21 UTC (permalink / raw)
  To: nilay, ming.lei, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang,
	yangerkun, johnny.chenyi

Currently rq-qos debugfs entries is created from rq_qos_add(), while
rq_qos_add() requires queue to be freezed. This can deadlock because
creating new entries can trigger fs reclaim.

Fix this problem by delaying creating rq-qos debugfs entries until
it's initialization is complete.

- For wbt, it can be initialized by default of by blk-sysfs, fix it by
  calling blk_mq_debugfs_register_rq_qos() after wbt_init;
- For other policies, they can only be initialized by blkg configuration,
  fix it by calling blk_mq_debugfs_register_rq_qos() from
  blkg_conf_end();

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-cgroup.c | 6 ++++++
 block/blk-rq-qos.c | 7 -------
 block/blk-sysfs.c  | 4 ++++
 block/blk-wbt.c    | 7 ++++++-
 4 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d93654334854..e4ccabf132c0 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -33,6 +33,7 @@
 #include "blk-cgroup.h"
 #include "blk-ioprio.h"
 #include "blk-throttle.h"
+#include "blk-mq-debugfs.h"
 
 static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu);
 
@@ -746,6 +747,11 @@ void blkg_conf_end(struct blkg_conf_ctx *ctx)
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, ctx->memflags);
 	blkdev_put_no_open(ctx->bdev);
+
+	mutex_lock(&q->debugfs_mutex);
+	blk_mq_debugfs_register_rq_qos(q);
+	mutex_unlock(&q->debugfs_mutex);
+
 }
 EXPORT_SYMBOL_GPL(blkg_conf_end);
 
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 654478dfbc20..d7ce99ce2e80 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -347,13 +347,6 @@ int rq_qos_add(struct rq_qos *rqos, struct gendisk *disk, enum rq_qos_id id,
 	blk_queue_flag_set(QUEUE_FLAG_QOS_ENABLED, q);
 
 	blk_mq_unfreeze_queue(q, memflags);
-
-	if (rqos->ops->debugfs_attrs) {
-		mutex_lock(&q->debugfs_mutex);
-		blk_mq_debugfs_register_rqos(rqos);
-		mutex_unlock(&q->debugfs_mutex);
-	}
-
 	return 0;
 ebusy:
 	blk_mq_unfreeze_queue(q, memflags);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 76c47fe9b8d6..52bb4db25cf5 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -688,6 +688,10 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
 	mutex_unlock(&disk->rqos_state_mutex);
 
 	blk_mq_unquiesce_queue(q);
+
+	mutex_lock(&q->debugfs_mutex);
+	blk_mq_debugfs_register_rq_qos(q);
+	mutex_unlock(&q->debugfs_mutex);
 out:
 	blk_mq_unfreeze_queue(q, memflags);
 
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index eb8037bae0bd..a120b5ba54db 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -724,8 +724,13 @@ void wbt_enable_default(struct gendisk *disk)
 	if (!blk_queue_registered(q))
 		return;
 
-	if (queue_is_mq(q) && enable)
+	if (queue_is_mq(q) && enable) {
 		wbt_init(disk);
+
+		mutex_lock(&q->debugfs_mutex);
+		blk_mq_debugfs_register_rq_qos(q);
+		mutex_unlock(&q->debugfs_mutex);
+	}
 }
 EXPORT_SYMBOL_GPL(wbt_enable_default);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/4] blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
  2025-10-14  2:21 [PATCH 0/4] blk-rq-qos: fix possible deadlock Yu Kuai
                   ` (2 preceding siblings ...)
  2025-10-14  2:21 ` [PATCH 3/4] blk-rq-qos: fix possible deadlock Yu Kuai
@ 2025-10-14  2:21 ` Yu Kuai
  2025-10-14  8:15   ` Ming Lei
  2025-10-14 10:58 ` [PATCH 0/4] blk-rq-qos: fix possible deadlock Nilay Shroff
  4 siblings, 1 reply; 23+ messages in thread
From: Yu Kuai @ 2025-10-14  2:21 UTC (permalink / raw)
  To: nilay, ming.lei, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang,
	yangerkun, johnny.chenyi

Because it's only used inside blk-mq-debugfs.c now.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-debugfs.c | 4 +++-
 block/blk-mq-debugfs.h | 5 -----
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 1de9bab7ba80..919e484aa1b2 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -14,6 +14,8 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
+static void blk_mq_debugfs_register_rqos(struct rq_qos *rqos);
+
 static int queue_poll_stat_show(void *data, struct seq_file *m)
 {
 	return 0;
@@ -774,7 +776,7 @@ void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
 	rqos->debugfs_dir = NULL;
 }
 
-void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
+static void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
 {
 	struct request_queue *q = rqos->disk->queue;
 	const char *dir_name = rq_qos_id_to_name(rqos->id);
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index 9f76603792fe..d94daa66556b 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -33,7 +33,6 @@ void blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
 				       struct blk_mq_hw_ctx *hctx);
 void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx);
 
-void blk_mq_debugfs_register_rqos(struct rq_qos *rqos);
 void blk_mq_debugfs_register_rq_qos(struct request_queue *q);
 void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos);
 #else
@@ -75,10 +74,6 @@ static inline void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hc
 {
 }
 
-static inline void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
-{
-}
-
 static inline void blk_mq_debugfs_register_rq_qos(struct request_queue *q)
 {
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] blk-mq-debugfs: warn about possible deadlock
  2025-10-14  2:21 ` [PATCH 1/4] blk-mq-debugfs: warn about " Yu Kuai
@ 2025-10-14  8:06   ` Ming Lei
  2025-10-14  8:21     ` Yu Kuai
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2025-10-14  8:06 UTC (permalink / raw)
  To: Yu Kuai
  Cc: nilay, tj, josef, axboe, cgroups, linux-block, linux-kernel,
	yukuai1, yi.zhang, yangerkun, johnny.chenyi

On Tue, Oct 14, 2025 at 10:21:46AM +0800, Yu Kuai wrote:
> Creating new debugfs entries can trigger fs reclaim, hence we can't do
> this with queue freezed, meanwhile, other locks that can be held while
> queue is freezed should not be held as well.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-mq-debugfs.c | 31 ++++++++++++++++++++++++-------
>  1 file changed, 24 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 4896525b1c05..66864ed0b77f 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -608,9 +608,23 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
>  	{},
>  };
>  
> -static void debugfs_create_files(struct dentry *parent, void *data,
> +static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
> +				 void *data,
>  				 const struct blk_mq_debugfs_attr *attr)
>  {
> +	/*
> +	 * Creating new debugfs entries with queue freezed has the rist of
> +	 * deadlock.
> +	 */
> +	WARN_ON_ONCE(q->mq_freeze_depth != 0);
> +	/*
> +	 * debugfs_mutex should not be nested under other locks that can be
> +	 * grabbed while queue is freezed.
> +	 */
> +	lockdep_assert_not_held(&q->elevator_lock);
> +	lockdep_assert_not_held(&q->rq_qos_mutex);

->rq_qos_mutex use looks one real mess, in blk-cgroup.c, it is grabbed after
queue is frozen. However, inside block/blk-rq-qos.c, the two are re-ordered,
maybe we need to fix order between queue freeze and q->rq_qos_mutex first?
Or move on by removing the above line?

Otherwise, this patch looks good.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] blk-rq-qos: fix possible deadlock
  2025-10-14  2:21 ` [PATCH 3/4] blk-rq-qos: fix possible deadlock Yu Kuai
@ 2025-10-14  8:13   ` Ming Lei
  2025-10-14  8:24     ` Yu Kuai
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2025-10-14  8:13 UTC (permalink / raw)
  To: Yu Kuai
  Cc: nilay, tj, josef, axboe, cgroups, linux-block, linux-kernel,
	yukuai1, yi.zhang, yangerkun, johnny.chenyi

On Tue, Oct 14, 2025 at 10:21:48AM +0800, Yu Kuai wrote:
> Currently rq-qos debugfs entries is created from rq_qos_add(), while
> rq_qos_add() requires queue to be freezed. This can deadlock because
> creating new entries can trigger fs reclaim.
> 
> Fix this problem by delaying creating rq-qos debugfs entries until
> it's initialization is complete.
> 
> - For wbt, it can be initialized by default of by blk-sysfs, fix it by
>   calling blk_mq_debugfs_register_rq_qos() after wbt_init;
> - For other policies, they can only be initialized by blkg configuration,
>   fix it by calling blk_mq_debugfs_register_rq_qos() from
>   blkg_conf_end();
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-cgroup.c | 6 ++++++
>  block/blk-rq-qos.c | 7 -------
>  block/blk-sysfs.c  | 4 ++++
>  block/blk-wbt.c    | 7 ++++++-
>  4 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d93654334854..e4ccabf132c0 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -33,6 +33,7 @@
>  #include "blk-cgroup.h"
>  #include "blk-ioprio.h"
>  #include "blk-throttle.h"
> +#include "blk-mq-debugfs.h"
>  
>  static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu);
>  
> @@ -746,6 +747,11 @@ void blkg_conf_end(struct blkg_conf_ctx *ctx)
>  	mutex_unlock(&q->elevator_lock);
>  	blk_mq_unfreeze_queue(q, ctx->memflags);
>  	blkdev_put_no_open(ctx->bdev);
> +
> +	mutex_lock(&q->debugfs_mutex);
> +	blk_mq_debugfs_register_rq_qos(q);
> +	mutex_unlock(&q->debugfs_mutex);
> +
>  }
>  EXPORT_SYMBOL_GPL(blkg_conf_end);
>  
> diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
> index 654478dfbc20..d7ce99ce2e80 100644
> --- a/block/blk-rq-qos.c
> +++ b/block/blk-rq-qos.c
> @@ -347,13 +347,6 @@ int rq_qos_add(struct rq_qos *rqos, struct gendisk *disk, enum rq_qos_id id,
>  	blk_queue_flag_set(QUEUE_FLAG_QOS_ENABLED, q);
>  
>  	blk_mq_unfreeze_queue(q, memflags);
> -
> -	if (rqos->ops->debugfs_attrs) {
> -		mutex_lock(&q->debugfs_mutex);
> -		blk_mq_debugfs_register_rqos(rqos);
> -		mutex_unlock(&q->debugfs_mutex);
> -	}
> -
>  	return 0;
>  ebusy:
>  	blk_mq_unfreeze_queue(q, memflags);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 76c47fe9b8d6..52bb4db25cf5 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -688,6 +688,10 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
>  	mutex_unlock(&disk->rqos_state_mutex);
>  
>  	blk_mq_unquiesce_queue(q);
> +
> +	mutex_lock(&q->debugfs_mutex);
> +	blk_mq_debugfs_register_rq_qos(q);
> +	mutex_unlock(&q->debugfs_mutex);
>  out:
>  	blk_mq_unfreeze_queue(q, memflags);
>  
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index eb8037bae0bd..a120b5ba54db 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -724,8 +724,13 @@ void wbt_enable_default(struct gendisk *disk)
>  	if (!blk_queue_registered(q))
>  		return;
>  
> -	if (queue_is_mq(q) && enable)
> +	if (queue_is_mq(q) && enable) {
>  		wbt_init(disk);
> +
> +		mutex_lock(&q->debugfs_mutex);
> +		blk_mq_debugfs_register_rq_qos(q);
> +		mutex_unlock(&q->debugfs_mutex);
> +	}

->debugfs_mutex only may be not enough, because blk_mq_debugfs_register_rq_qos()
has to traverse rq_qos single list list, you may have to grab q->rq_qos_mutex
for protect the list.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
  2025-10-14  2:21 ` [PATCH 4/4] blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static Yu Kuai
@ 2025-10-14  8:15   ` Ming Lei
  2025-10-14  8:26     ` Yu Kuai
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2025-10-14  8:15 UTC (permalink / raw)
  To: Yu Kuai
  Cc: nilay, tj, josef, axboe, cgroups, linux-block, linux-kernel,
	yukuai1, yi.zhang, yangerkun, johnny.chenyi

On Tue, Oct 14, 2025 at 10:21:49AM +0800, Yu Kuai wrote:
> Because it's only used inside blk-mq-debugfs.c now.

If it is true, why do you export it in 2nd patch?

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] blk-mq-debugfs: warn about possible deadlock
  2025-10-14  8:06   ` Ming Lei
@ 2025-10-14  8:21     ` Yu Kuai
  2025-10-14  8:34       ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Yu Kuai @ 2025-10-14  8:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: nilay, tj, josef, axboe, cgroups, linux-block, linux-kernel,
	yukuai1, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/10/14 16:06, Ming Lei 写道:
> On Tue, Oct 14, 2025 at 10:21:46AM +0800, Yu Kuai wrote:
>> Creating new debugfs entries can trigger fs reclaim, hence we can't do
>> this with queue freezed, meanwhile, other locks that can be held while
>> queue is freezed should not be held as well.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/blk-mq-debugfs.c | 31 ++++++++++++++++++++++++-------
>>   1 file changed, 24 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
>> index 4896525b1c05..66864ed0b77f 100644
>> --- a/block/blk-mq-debugfs.c
>> +++ b/block/blk-mq-debugfs.c
>> @@ -608,9 +608,23 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
>>   	{},
>>   };
>>   
>> -static void debugfs_create_files(struct dentry *parent, void *data,
>> +static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
>> +				 void *data,
>>   				 const struct blk_mq_debugfs_attr *attr)
>>   {
>> +	/*
>> +	 * Creating new debugfs entries with queue freezed has the rist of
>> +	 * deadlock.
>> +	 */
>> +	WARN_ON_ONCE(q->mq_freeze_depth != 0);
>> +	/*
>> +	 * debugfs_mutex should not be nested under other locks that can be
>> +	 * grabbed while queue is freezed.
>> +	 */
>> +	lockdep_assert_not_held(&q->elevator_lock);
>> +	lockdep_assert_not_held(&q->rq_qos_mutex);
> 
> ->rq_qos_mutex use looks one real mess, in blk-cgroup.c, it is grabbed after
> queue is frozen. However, inside block/blk-rq-qos.c, the two are re-ordered,
> maybe we need to fix order between queue freeze and q->rq_qos_mutex first?
> Or move on by removing the above line?

Yeah, I see this reoder as well, and I tried to fix this in the other
thread for blkg configuration.

- queue is freezed by new helper blkg_conf_start(), and unfreezed after
   blkg_conf_end(), rq_qos_add() is now called between them.

And for wbt, there are two cases:
  - for blk-sysfs, queue is alredy freezed before rq_qos_add() as well;
  - for wbt_enable_default(), this looks still problemaic, we should fix
    the reorder seperatly.

Perhaps, should I fix this simple problem first, and then rebase the
thread to convert queue_lock to blkcg_mtuex?

Thanks,
Kuai


Thanks,
Kuai
> 
> Otherwise, this patch looks good.
> 
> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] blk-rq-qos: fix possible deadlock
  2025-10-14  8:13   ` Ming Lei
@ 2025-10-14  8:24     ` Yu Kuai
  2025-10-14  8:37       ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Yu Kuai @ 2025-10-14  8:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: nilay, tj, josef, axboe, cgroups, linux-block, linux-kernel,
	yukuai1, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/10/14 16:13, Ming Lei 写道:
> On Tue, Oct 14, 2025 at 10:21:48AM +0800, Yu Kuai wrote:
>> Currently rq-qos debugfs entries is created from rq_qos_add(), while
>> rq_qos_add() requires queue to be freezed. This can deadlock because
>> creating new entries can trigger fs reclaim.
>>
>> Fix this problem by delaying creating rq-qos debugfs entries until
>> it's initialization is complete.
>>
>> - For wbt, it can be initialized by default of by blk-sysfs, fix it by
>>    calling blk_mq_debugfs_register_rq_qos() after wbt_init;
>> - For other policies, they can only be initialized by blkg configuration,
>>    fix it by calling blk_mq_debugfs_register_rq_qos() from
>>    blkg_conf_end();
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/blk-cgroup.c | 6 ++++++
>>   block/blk-rq-qos.c | 7 -------
>>   block/blk-sysfs.c  | 4 ++++
>>   block/blk-wbt.c    | 7 ++++++-
>>   4 files changed, 16 insertions(+), 8 deletions(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index d93654334854..e4ccabf132c0 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -33,6 +33,7 @@
>>   #include "blk-cgroup.h"
>>   #include "blk-ioprio.h"
>>   #include "blk-throttle.h"
>> +#include "blk-mq-debugfs.h"
>>   
>>   static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu);
>>   
>> @@ -746,6 +747,11 @@ void blkg_conf_end(struct blkg_conf_ctx *ctx)
>>   	mutex_unlock(&q->elevator_lock);
>>   	blk_mq_unfreeze_queue(q, ctx->memflags);
>>   	blkdev_put_no_open(ctx->bdev);
>> +
>> +	mutex_lock(&q->debugfs_mutex);
>> +	blk_mq_debugfs_register_rq_qos(q);
>> +	mutex_unlock(&q->debugfs_mutex);
>> +
>>   }
>>   EXPORT_SYMBOL_GPL(blkg_conf_end);
>>   
>> diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
>> index 654478dfbc20..d7ce99ce2e80 100644
>> --- a/block/blk-rq-qos.c
>> +++ b/block/blk-rq-qos.c
>> @@ -347,13 +347,6 @@ int rq_qos_add(struct rq_qos *rqos, struct gendisk *disk, enum rq_qos_id id,
>>   	blk_queue_flag_set(QUEUE_FLAG_QOS_ENABLED, q);
>>   
>>   	blk_mq_unfreeze_queue(q, memflags);
>> -
>> -	if (rqos->ops->debugfs_attrs) {
>> -		mutex_lock(&q->debugfs_mutex);
>> -		blk_mq_debugfs_register_rqos(rqos);
>> -		mutex_unlock(&q->debugfs_mutex);
>> -	}
>> -
>>   	return 0;
>>   ebusy:
>>   	blk_mq_unfreeze_queue(q, memflags);
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 76c47fe9b8d6..52bb4db25cf5 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -688,6 +688,10 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
>>   	mutex_unlock(&disk->rqos_state_mutex);
>>   
>>   	blk_mq_unquiesce_queue(q);
>> +
>> +	mutex_lock(&q->debugfs_mutex);
>> +	blk_mq_debugfs_register_rq_qos(q);
>> +	mutex_unlock(&q->debugfs_mutex);
>>   out:
>>   	blk_mq_unfreeze_queue(q, memflags);
>>   
>> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
>> index eb8037bae0bd..a120b5ba54db 100644
>> --- a/block/blk-wbt.c
>> +++ b/block/blk-wbt.c
>> @@ -724,8 +724,13 @@ void wbt_enable_default(struct gendisk *disk)
>>   	if (!blk_queue_registered(q))
>>   		return;
>>   
>> -	if (queue_is_mq(q) && enable)
>> +	if (queue_is_mq(q) && enable) {
>>   		wbt_init(disk);
>> +
>> +		mutex_lock(&q->debugfs_mutex);
>> +		blk_mq_debugfs_register_rq_qos(q);
>> +		mutex_unlock(&q->debugfs_mutex);
>> +	}
> 
> ->debugfs_mutex only may be not enough, because blk_mq_debugfs_register_rq_qos()
> has to traverse rq_qos single list list, you may have to grab q->rq_qos_mutex
> for protect the list.
> 

I think we can't grab rq_qos_mutex to create debugfs entries, right?
With the respect of this, perhaps we can grab debugfs_mutex to protect
insering rq_qos list instead?

Thanks,
Kuai

> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
  2025-10-14  8:15   ` Ming Lei
@ 2025-10-14  8:26     ` Yu Kuai
  0 siblings, 0 replies; 23+ messages in thread
From: Yu Kuai @ 2025-10-14  8:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: nilay, tj, josef, axboe, cgroups, linux-block, linux-kernel,
	yukuai1, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/10/14 16:15, Ming Lei 写道:
> On Tue, Oct 14, 2025 at 10:21:49AM +0800, Yu Kuai wrote:
>> Because it's only used inside blk-mq-debugfs.c now.
> 
> If it is true, why do you export it in 2nd patch?

Patch 2 is a different helper,
- blk_mq_debugfs_register_rqos() is the old one to register one rqos;
- blk_mq_debugfs_regist_rq_qos() is the new one to iterate all rqos,
   check and register if it's unregistered.

Thanks,
Kuai

> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] blk-mq-debugfs: warn about possible deadlock
  2025-10-14  8:21     ` Yu Kuai
@ 2025-10-14  8:34       ` Ming Lei
  0 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2025-10-14  8:34 UTC (permalink / raw)
  To: Yu Kuai
  Cc: nilay, tj, josef, axboe, cgroups, linux-block, linux-kernel,
	yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

On Tue, Oct 14, 2025 at 04:21:30PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/10/14 16:06, Ming Lei 写道:
> > On Tue, Oct 14, 2025 at 10:21:46AM +0800, Yu Kuai wrote:
> > > Creating new debugfs entries can trigger fs reclaim, hence we can't do
> > > this with queue freezed, meanwhile, other locks that can be held while
> > > queue is freezed should not be held as well.
> > > 
> > > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> > > ---
> > >   block/blk-mq-debugfs.c | 31 ++++++++++++++++++++++++-------
> > >   1 file changed, 24 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > > index 4896525b1c05..66864ed0b77f 100644
> > > --- a/block/blk-mq-debugfs.c
> > > +++ b/block/blk-mq-debugfs.c
> > > @@ -608,9 +608,23 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
> > >   	{},
> > >   };
> > > -static void debugfs_create_files(struct dentry *parent, void *data,
> > > +static void debugfs_create_files(struct request_queue *q, struct dentry *parent,
> > > +				 void *data,
> > >   				 const struct blk_mq_debugfs_attr *attr)
> > >   {
> > > +	/*
> > > +	 * Creating new debugfs entries with queue freezed has the rist of
> > > +	 * deadlock.
> > > +	 */
> > > +	WARN_ON_ONCE(q->mq_freeze_depth != 0);
> > > +	/*
> > > +	 * debugfs_mutex should not be nested under other locks that can be
> > > +	 * grabbed while queue is freezed.
> > > +	 */
> > > +	lockdep_assert_not_held(&q->elevator_lock);
> > > +	lockdep_assert_not_held(&q->rq_qos_mutex);
> > 
> > ->rq_qos_mutex use looks one real mess, in blk-cgroup.c, it is grabbed after
> > queue is frozen. However, inside block/blk-rq-qos.c, the two are re-ordered,
> > maybe we need to fix order between queue freeze and q->rq_qos_mutex first?
> > Or move on by removing the above line?
> 
> Yeah, I see this reoder as well, and I tried to fix this in the other
> thread for blkg configuration.
> 
> - queue is freezed by new helper blkg_conf_start(), and unfreezed after
>   blkg_conf_end(), rq_qos_add() is now called between them.
> 
> And for wbt, there are two cases:
>  - for blk-sysfs, queue is alredy freezed before rq_qos_add() as well;
>  - for wbt_enable_default(), this looks still problemaic, we should fix
>    the reorder seperatly.
> 
> Perhaps, should I fix this simple problem first, and then rebase the
> thread to convert queue_lock to blkcg_mtuex?

As I mentioned, if you want to move on with patchset first, the line of
`lockdep_assert_not_held(&q->rq_qos_mutex);` shouldn't be added until
->rq_qos_mutex vs. freeze queue order is finalized.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] blk-rq-qos: fix possible deadlock
  2025-10-14  8:24     ` Yu Kuai
@ 2025-10-14  8:37       ` Ming Lei
  2025-10-14  8:42         ` Yu Kuai
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2025-10-14  8:37 UTC (permalink / raw)
  To: Yu Kuai
  Cc: nilay, tj, josef, axboe, cgroups, linux-block, linux-kernel,
	yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

On Tue, Oct 14, 2025 at 04:24:23PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/10/14 16:13, Ming Lei 写道:
> > On Tue, Oct 14, 2025 at 10:21:48AM +0800, Yu Kuai wrote:
> > > Currently rq-qos debugfs entries is created from rq_qos_add(), while
> > > rq_qos_add() requires queue to be freezed. This can deadlock because
> > > creating new entries can trigger fs reclaim.
> > > 
> > > Fix this problem by delaying creating rq-qos debugfs entries until
> > > it's initialization is complete.
> > > 
> > > - For wbt, it can be initialized by default of by blk-sysfs, fix it by
> > >    calling blk_mq_debugfs_register_rq_qos() after wbt_init;
> > > - For other policies, they can only be initialized by blkg configuration,
> > >    fix it by calling blk_mq_debugfs_register_rq_qos() from
> > >    blkg_conf_end();
> > > 
> > > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> > > ---
> > >   block/blk-cgroup.c | 6 ++++++
> > >   block/blk-rq-qos.c | 7 -------
> > >   block/blk-sysfs.c  | 4 ++++
> > >   block/blk-wbt.c    | 7 ++++++-
> > >   4 files changed, 16 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > > index d93654334854..e4ccabf132c0 100644
> > > --- a/block/blk-cgroup.c
> > > +++ b/block/blk-cgroup.c
> > > @@ -33,6 +33,7 @@
> > >   #include "blk-cgroup.h"
> > >   #include "blk-ioprio.h"
> > >   #include "blk-throttle.h"
> > > +#include "blk-mq-debugfs.h"
> > >   static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu);
> > > @@ -746,6 +747,11 @@ void blkg_conf_end(struct blkg_conf_ctx *ctx)
> > >   	mutex_unlock(&q->elevator_lock);
> > >   	blk_mq_unfreeze_queue(q, ctx->memflags);
> > >   	blkdev_put_no_open(ctx->bdev);
> > > +
> > > +	mutex_lock(&q->debugfs_mutex);
> > > +	blk_mq_debugfs_register_rq_qos(q);
> > > +	mutex_unlock(&q->debugfs_mutex);
> > > +
> > >   }
> > >   EXPORT_SYMBOL_GPL(blkg_conf_end);
> > > diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
> > > index 654478dfbc20..d7ce99ce2e80 100644
> > > --- a/block/blk-rq-qos.c
> > > +++ b/block/blk-rq-qos.c
> > > @@ -347,13 +347,6 @@ int rq_qos_add(struct rq_qos *rqos, struct gendisk *disk, enum rq_qos_id id,
> > >   	blk_queue_flag_set(QUEUE_FLAG_QOS_ENABLED, q);
> > >   	blk_mq_unfreeze_queue(q, memflags);
> > > -
> > > -	if (rqos->ops->debugfs_attrs) {
> > > -		mutex_lock(&q->debugfs_mutex);
> > > -		blk_mq_debugfs_register_rqos(rqos);
> > > -		mutex_unlock(&q->debugfs_mutex);
> > > -	}
> > > -
> > >   	return 0;
> > >   ebusy:
> > >   	blk_mq_unfreeze_queue(q, memflags);
> > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > > index 76c47fe9b8d6..52bb4db25cf5 100644
> > > --- a/block/blk-sysfs.c
> > > +++ b/block/blk-sysfs.c
> > > @@ -688,6 +688,10 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
> > >   	mutex_unlock(&disk->rqos_state_mutex);
> > >   	blk_mq_unquiesce_queue(q);
> > > +
> > > +	mutex_lock(&q->debugfs_mutex);
> > > +	blk_mq_debugfs_register_rq_qos(q);
> > > +	mutex_unlock(&q->debugfs_mutex);
> > >   out:
> > >   	blk_mq_unfreeze_queue(q, memflags);
> > > diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> > > index eb8037bae0bd..a120b5ba54db 100644
> > > --- a/block/blk-wbt.c
> > > +++ b/block/blk-wbt.c
> > > @@ -724,8 +724,13 @@ void wbt_enable_default(struct gendisk *disk)
> > >   	if (!blk_queue_registered(q))
> > >   		return;
> > > -	if (queue_is_mq(q) && enable)
> > > +	if (queue_is_mq(q) && enable) {
> > >   		wbt_init(disk);
> > > +
> > > +		mutex_lock(&q->debugfs_mutex);
> > > +		blk_mq_debugfs_register_rq_qos(q);
> > > +		mutex_unlock(&q->debugfs_mutex);
> > > +	}
> > 
> > ->debugfs_mutex only may be not enough, because blk_mq_debugfs_register_rq_qos()
> > has to traverse rq_qos single list list, you may have to grab q->rq_qos_mutex
> > for protect the list.
> > 
> 
> I think we can't grab rq_qos_mutex to create debugfs entries, right?

It depends on the finalized order between rq_qos_mutex and freezing queue.

> With the respect of this, perhaps we can grab debugfs_mutex to protect
> insering rq_qos list instead?

No, debugfs_mutex shouldn't protect rq_qos list, and rq_qos_mutex is
supposed to do the job at least from naming viewpoint.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] blk-rq-qos: fix possible deadlock
  2025-10-14  8:37       ` Ming Lei
@ 2025-10-14  8:42         ` Yu Kuai
  2025-10-14  8:55           ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Yu Kuai @ 2025-10-14  8:42 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: nilay, tj, josef, axboe, cgroups, linux-block, linux-kernel,
	yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/10/14 16:37, Ming Lei 写道:
> On Tue, Oct 14, 2025 at 04:24:23PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/10/14 16:13, Ming Lei 写道:
>>> On Tue, Oct 14, 2025 at 10:21:48AM +0800, Yu Kuai wrote:
>>>> Currently rq-qos debugfs entries is created from rq_qos_add(), while
>>>> rq_qos_add() requires queue to be freezed. This can deadlock because
>>>> creating new entries can trigger fs reclaim.
>>>>
>>>> Fix this problem by delaying creating rq-qos debugfs entries until
>>>> it's initialization is complete.
>>>>
>>>> - For wbt, it can be initialized by default of by blk-sysfs, fix it by
>>>>     calling blk_mq_debugfs_register_rq_qos() after wbt_init;
>>>> - For other policies, they can only be initialized by blkg configuration,
>>>>     fix it by calling blk_mq_debugfs_register_rq_qos() from
>>>>     blkg_conf_end();
>>>>
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>> ---
>>>>    block/blk-cgroup.c | 6 ++++++
>>>>    block/blk-rq-qos.c | 7 -------
>>>>    block/blk-sysfs.c  | 4 ++++
>>>>    block/blk-wbt.c    | 7 ++++++-
>>>>    4 files changed, 16 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>>>> index d93654334854..e4ccabf132c0 100644
>>>> --- a/block/blk-cgroup.c
>>>> +++ b/block/blk-cgroup.c
>>>> @@ -33,6 +33,7 @@
>>>>    #include "blk-cgroup.h"
>>>>    #include "blk-ioprio.h"
>>>>    #include "blk-throttle.h"
>>>> +#include "blk-mq-debugfs.h"
>>>>    static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu);
>>>> @@ -746,6 +747,11 @@ void blkg_conf_end(struct blkg_conf_ctx *ctx)
>>>>    	mutex_unlock(&q->elevator_lock);
>>>>    	blk_mq_unfreeze_queue(q, ctx->memflags);
>>>>    	blkdev_put_no_open(ctx->bdev);
>>>> +
>>>> +	mutex_lock(&q->debugfs_mutex);
>>>> +	blk_mq_debugfs_register_rq_qos(q);
>>>> +	mutex_unlock(&q->debugfs_mutex);
>>>> +
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(blkg_conf_end);
>>>> diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
>>>> index 654478dfbc20..d7ce99ce2e80 100644
>>>> --- a/block/blk-rq-qos.c
>>>> +++ b/block/blk-rq-qos.c
>>>> @@ -347,13 +347,6 @@ int rq_qos_add(struct rq_qos *rqos, struct gendisk *disk, enum rq_qos_id id,
>>>>    	blk_queue_flag_set(QUEUE_FLAG_QOS_ENABLED, q);
>>>>    	blk_mq_unfreeze_queue(q, memflags);
>>>> -
>>>> -	if (rqos->ops->debugfs_attrs) {
>>>> -		mutex_lock(&q->debugfs_mutex);
>>>> -		blk_mq_debugfs_register_rqos(rqos);
>>>> -		mutex_unlock(&q->debugfs_mutex);
>>>> -	}
>>>> -
>>>>    	return 0;
>>>>    ebusy:
>>>>    	blk_mq_unfreeze_queue(q, memflags);
>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>> index 76c47fe9b8d6..52bb4db25cf5 100644
>>>> --- a/block/blk-sysfs.c
>>>> +++ b/block/blk-sysfs.c
>>>> @@ -688,6 +688,10 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
>>>>    	mutex_unlock(&disk->rqos_state_mutex);
>>>>    	blk_mq_unquiesce_queue(q);
>>>> +
>>>> +	mutex_lock(&q->debugfs_mutex);
>>>> +	blk_mq_debugfs_register_rq_qos(q);
>>>> +	mutex_unlock(&q->debugfs_mutex);
>>>>    out:
>>>>    	blk_mq_unfreeze_queue(q, memflags);
>>>> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
>>>> index eb8037bae0bd..a120b5ba54db 100644
>>>> --- a/block/blk-wbt.c
>>>> +++ b/block/blk-wbt.c
>>>> @@ -724,8 +724,13 @@ void wbt_enable_default(struct gendisk *disk)
>>>>    	if (!blk_queue_registered(q))
>>>>    		return;
>>>> -	if (queue_is_mq(q) && enable)
>>>> +	if (queue_is_mq(q) && enable) {
>>>>    		wbt_init(disk);
>>>> +
>>>> +		mutex_lock(&q->debugfs_mutex);
>>>> +		blk_mq_debugfs_register_rq_qos(q);
>>>> +		mutex_unlock(&q->debugfs_mutex);
>>>> +	}
>>>
>>> ->debugfs_mutex only may be not enough, because blk_mq_debugfs_register_rq_qos()
>>> has to traverse rq_qos single list list, you may have to grab q->rq_qos_mutex
>>> for protect the list.
>>>
>>
>> I think we can't grab rq_qos_mutex to create debugfs entries, right?
> 
> It depends on the finalized order between rq_qos_mutex and freezing queue.
> 
>> With the respect of this, perhaps we can grab debugfs_mutex to protect
>> insering rq_qos list instead?
> 
> No, debugfs_mutex shouldn't protect rq_qos list, and rq_qos_mutex is
> supposed to do the job at least from naming viewpoint.

Ok, then we'll have to make sure the order is rq_qos_mutex before
freezing queue, I was thinking the inverse order because of the helper
blkg_conf_open_bdev_frozen().

I'll check first if this is possible.

Thanks,
Kuai
> 
> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] blk-rq-qos: fix possible deadlock
  2025-10-14  8:42         ` Yu Kuai
@ 2025-10-14  8:55           ` Ming Lei
  2025-10-14  9:03             ` Yu Kuai
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2025-10-14  8:55 UTC (permalink / raw)
  To: Yu Kuai
  Cc: nilay, tj, josef, axboe, cgroups, linux-block, linux-kernel,
	yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

On Tue, Oct 14, 2025 at 04:42:30PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/10/14 16:37, Ming Lei 写道:
> > On Tue, Oct 14, 2025 at 04:24:23PM +0800, Yu Kuai wrote:
> > > Hi,
> > > 
> > > 在 2025/10/14 16:13, Ming Lei 写道:
> > > > On Tue, Oct 14, 2025 at 10:21:48AM +0800, Yu Kuai wrote:
> > > > > Currently rq-qos debugfs entries is created from rq_qos_add(), while
> > > > > rq_qos_add() requires queue to be freezed. This can deadlock because
> > > > > creating new entries can trigger fs reclaim.
> > > > > 
> > > > > Fix this problem by delaying creating rq-qos debugfs entries until
> > > > > it's initialization is complete.
> > > > > 
> > > > > - For wbt, it can be initialized by default of by blk-sysfs, fix it by
> > > > >     calling blk_mq_debugfs_register_rq_qos() after wbt_init;
> > > > > - For other policies, they can only be initialized by blkg configuration,
> > > > >     fix it by calling blk_mq_debugfs_register_rq_qos() from
> > > > >     blkg_conf_end();
> > > > > 
> > > > > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> > > > > ---
> > > > >    block/blk-cgroup.c | 6 ++++++
> > > > >    block/blk-rq-qos.c | 7 -------
> > > > >    block/blk-sysfs.c  | 4 ++++
> > > > >    block/blk-wbt.c    | 7 ++++++-
> > > > >    4 files changed, 16 insertions(+), 8 deletions(-)
> > > > > 
> > > > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > > > > index d93654334854..e4ccabf132c0 100644
> > > > > --- a/block/blk-cgroup.c
> > > > > +++ b/block/blk-cgroup.c
> > > > > @@ -33,6 +33,7 @@
> > > > >    #include "blk-cgroup.h"
> > > > >    #include "blk-ioprio.h"
> > > > >    #include "blk-throttle.h"
> > > > > +#include "blk-mq-debugfs.h"
> > > > >    static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu);
> > > > > @@ -746,6 +747,11 @@ void blkg_conf_end(struct blkg_conf_ctx *ctx)
> > > > >    	mutex_unlock(&q->elevator_lock);
> > > > >    	blk_mq_unfreeze_queue(q, ctx->memflags);
> > > > >    	blkdev_put_no_open(ctx->bdev);
> > > > > +
> > > > > +	mutex_lock(&q->debugfs_mutex);
> > > > > +	blk_mq_debugfs_register_rq_qos(q);
> > > > > +	mutex_unlock(&q->debugfs_mutex);
> > > > > +
> > > > >    }
> > > > >    EXPORT_SYMBOL_GPL(blkg_conf_end);
> > > > > diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
> > > > > index 654478dfbc20..d7ce99ce2e80 100644
> > > > > --- a/block/blk-rq-qos.c
> > > > > +++ b/block/blk-rq-qos.c
> > > > > @@ -347,13 +347,6 @@ int rq_qos_add(struct rq_qos *rqos, struct gendisk *disk, enum rq_qos_id id,
> > > > >    	blk_queue_flag_set(QUEUE_FLAG_QOS_ENABLED, q);
> > > > >    	blk_mq_unfreeze_queue(q, memflags);
> > > > > -
> > > > > -	if (rqos->ops->debugfs_attrs) {
> > > > > -		mutex_lock(&q->debugfs_mutex);
> > > > > -		blk_mq_debugfs_register_rqos(rqos);
> > > > > -		mutex_unlock(&q->debugfs_mutex);
> > > > > -	}
> > > > > -
> > > > >    	return 0;
> > > > >    ebusy:
> > > > >    	blk_mq_unfreeze_queue(q, memflags);
> > > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > > > > index 76c47fe9b8d6..52bb4db25cf5 100644
> > > > > --- a/block/blk-sysfs.c
> > > > > +++ b/block/blk-sysfs.c
> > > > > @@ -688,6 +688,10 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
> > > > >    	mutex_unlock(&disk->rqos_state_mutex);
> > > > >    	blk_mq_unquiesce_queue(q);
> > > > > +
> > > > > +	mutex_lock(&q->debugfs_mutex);
> > > > > +	blk_mq_debugfs_register_rq_qos(q);
> > > > > +	mutex_unlock(&q->debugfs_mutex);
> > > > >    out:
> > > > >    	blk_mq_unfreeze_queue(q, memflags);
> > > > > diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> > > > > index eb8037bae0bd..a120b5ba54db 100644
> > > > > --- a/block/blk-wbt.c
> > > > > +++ b/block/blk-wbt.c
> > > > > @@ -724,8 +724,13 @@ void wbt_enable_default(struct gendisk *disk)
> > > > >    	if (!blk_queue_registered(q))
> > > > >    		return;
> > > > > -	if (queue_is_mq(q) && enable)
> > > > > +	if (queue_is_mq(q) && enable) {
> > > > >    		wbt_init(disk);
> > > > > +
> > > > > +		mutex_lock(&q->debugfs_mutex);
> > > > > +		blk_mq_debugfs_register_rq_qos(q);
> > > > > +		mutex_unlock(&q->debugfs_mutex);
> > > > > +	}
> > > > 
> > > > ->debugfs_mutex only may be not enough, because blk_mq_debugfs_register_rq_qos()
> > > > has to traverse rq_qos single list list, you may have to grab q->rq_qos_mutex
> > > > for protect the list.
> > > > 
> > > 
> > > I think we can't grab rq_qos_mutex to create debugfs entries, right?
> > 
> > It depends on the finalized order between rq_qos_mutex and freezing queue.
> > 
> > > With the respect of this, perhaps we can grab debugfs_mutex to protect
> > > insering rq_qos list instead?
> > 
> > No, debugfs_mutex shouldn't protect rq_qos list, and rq_qos_mutex is
> > supposed to do the job at least from naming viewpoint.
> 
> Ok, then we'll have to make sure the order is rq_qos_mutex before
> freezing queue, I was thinking the inverse order because of the helper
> blkg_conf_open_bdev_frozen().
> 
> I'll check first if this is possible.

You may misunderstand my point, I meant `debugfs_mutex` can't be used for
protecting rq_qos list because of its name. But order between rq_qos_mutex
and freeze queue might be fine in either way, just it has to be fixed.
Not look into it yet.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] blk-rq-qos: fix possible deadlock
  2025-10-14  8:55           ` Ming Lei
@ 2025-10-14  9:03             ` Yu Kuai
  0 siblings, 0 replies; 23+ messages in thread
From: Yu Kuai @ 2025-10-14  9:03 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: nilay, tj, josef, axboe, cgroups, linux-block, linux-kernel,
	yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/10/14 16:55, Ming Lei 写道:
> On Tue, Oct 14, 2025 at 04:42:30PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/10/14 16:37, Ming Lei 写道:
>>> On Tue, Oct 14, 2025 at 04:24:23PM +0800, Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2025/10/14 16:13, Ming Lei 写道:
>>>>> On Tue, Oct 14, 2025 at 10:21:48AM +0800, Yu Kuai wrote:
>>>>>> Currently rq-qos debugfs entries is created from rq_qos_add(), while
>>>>>> rq_qos_add() requires queue to be freezed. This can deadlock because
>>>>>> creating new entries can trigger fs reclaim.
>>>>>>
>>>>>> Fix this problem by delaying creating rq-qos debugfs entries until
>>>>>> it's initialization is complete.
>>>>>>
>>>>>> - For wbt, it can be initialized by default of by blk-sysfs, fix it by
>>>>>>      calling blk_mq_debugfs_register_rq_qos() after wbt_init;
>>>>>> - For other policies, they can only be initialized by blkg configuration,
>>>>>>      fix it by calling blk_mq_debugfs_register_rq_qos() from
>>>>>>      blkg_conf_end();
>>>>>>
>>>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>>>> ---
>>>>>>     block/blk-cgroup.c | 6 ++++++
>>>>>>     block/blk-rq-qos.c | 7 -------
>>>>>>     block/blk-sysfs.c  | 4 ++++
>>>>>>     block/blk-wbt.c    | 7 ++++++-
>>>>>>     4 files changed, 16 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>>>>>> index d93654334854..e4ccabf132c0 100644
>>>>>> --- a/block/blk-cgroup.c
>>>>>> +++ b/block/blk-cgroup.c
>>>>>> @@ -33,6 +33,7 @@
>>>>>>     #include "blk-cgroup.h"
>>>>>>     #include "blk-ioprio.h"
>>>>>>     #include "blk-throttle.h"
>>>>>> +#include "blk-mq-debugfs.h"
>>>>>>     static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu);
>>>>>> @@ -746,6 +747,11 @@ void blkg_conf_end(struct blkg_conf_ctx *ctx)
>>>>>>     	mutex_unlock(&q->elevator_lock);
>>>>>>     	blk_mq_unfreeze_queue(q, ctx->memflags);
>>>>>>     	blkdev_put_no_open(ctx->bdev);
>>>>>> +
>>>>>> +	mutex_lock(&q->debugfs_mutex);
>>>>>> +	blk_mq_debugfs_register_rq_qos(q);
>>>>>> +	mutex_unlock(&q->debugfs_mutex);
>>>>>> +
>>>>>>     }
>>>>>>     EXPORT_SYMBOL_GPL(blkg_conf_end);
>>>>>> diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
>>>>>> index 654478dfbc20..d7ce99ce2e80 100644
>>>>>> --- a/block/blk-rq-qos.c
>>>>>> +++ b/block/blk-rq-qos.c
>>>>>> @@ -347,13 +347,6 @@ int rq_qos_add(struct rq_qos *rqos, struct gendisk *disk, enum rq_qos_id id,
>>>>>>     	blk_queue_flag_set(QUEUE_FLAG_QOS_ENABLED, q);
>>>>>>     	blk_mq_unfreeze_queue(q, memflags);
>>>>>> -
>>>>>> -	if (rqos->ops->debugfs_attrs) {
>>>>>> -		mutex_lock(&q->debugfs_mutex);
>>>>>> -		blk_mq_debugfs_register_rqos(rqos);
>>>>>> -		mutex_unlock(&q->debugfs_mutex);
>>>>>> -	}
>>>>>> -
>>>>>>     	return 0;
>>>>>>     ebusy:
>>>>>>     	blk_mq_unfreeze_queue(q, memflags);
>>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>>>> index 76c47fe9b8d6..52bb4db25cf5 100644
>>>>>> --- a/block/blk-sysfs.c
>>>>>> +++ b/block/blk-sysfs.c
>>>>>> @@ -688,6 +688,10 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
>>>>>>     	mutex_unlock(&disk->rqos_state_mutex);
>>>>>>     	blk_mq_unquiesce_queue(q);
>>>>>> +
>>>>>> +	mutex_lock(&q->debugfs_mutex);
>>>>>> +	blk_mq_debugfs_register_rq_qos(q);
>>>>>> +	mutex_unlock(&q->debugfs_mutex);
>>>>>>     out:
>>>>>>     	blk_mq_unfreeze_queue(q, memflags);
>>>>>> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
>>>>>> index eb8037bae0bd..a120b5ba54db 100644
>>>>>> --- a/block/blk-wbt.c
>>>>>> +++ b/block/blk-wbt.c
>>>>>> @@ -724,8 +724,13 @@ void wbt_enable_default(struct gendisk *disk)
>>>>>>     	if (!blk_queue_registered(q))
>>>>>>     		return;
>>>>>> -	if (queue_is_mq(q) && enable)
>>>>>> +	if (queue_is_mq(q) && enable) {
>>>>>>     		wbt_init(disk);
>>>>>> +
>>>>>> +		mutex_lock(&q->debugfs_mutex);
>>>>>> +		blk_mq_debugfs_register_rq_qos(q);
>>>>>> +		mutex_unlock(&q->debugfs_mutex);
>>>>>> +	}
>>>>>
>>>>> ->debugfs_mutex only may be not enough, because blk_mq_debugfs_register_rq_qos()
>>>>> has to traverse rq_qos single list list, you may have to grab q->rq_qos_mutex
>>>>> for protect the list.
>>>>>
>>>>
>>>> I think we can't grab rq_qos_mutex to create debugfs entries, right?
>>>
>>> It depends on the finalized order between rq_qos_mutex and freezing queue.
>>>
>>>> With the respect of this, perhaps we can grab debugfs_mutex to protect
>>>> insering rq_qos list instead?
>>>
>>> No, debugfs_mutex shouldn't protect rq_qos list, and rq_qos_mutex is
>>> supposed to do the job at least from naming viewpoint.
>>
>> Ok, then we'll have to make sure the order is rq_qos_mutex before
>> freezing queue, I was thinking the inverse order because of the helper
>> blkg_conf_open_bdev_frozen().
>>
>> I'll check first if this is possible.
> 
> You may misunderstand my point, I meant `debugfs_mutex` can't be used for
> protecting rq_qos list because of its name. But order between rq_qos_mutex
> and freeze queue might be fine in either way, just it has to be fixed.
> Not look into it yet.

No misunderstood :) I mean if we want to fix this by delaying creating
debugfs entries after queue is unfreezed, and we have to hold
rq_qos_mutex for ierating rqos, then rq_qos_mutex have to be hold before
freeing queue.

A quick look I feel it's ok, I'll try a new version.

Thanks,
Kuai

> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] blk-rq-qos: fix possible deadlock
  2025-10-14  2:21 [PATCH 0/4] blk-rq-qos: fix possible deadlock Yu Kuai
                   ` (3 preceding siblings ...)
  2025-10-14  2:21 ` [PATCH 4/4] blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static Yu Kuai
@ 2025-10-14 10:58 ` Nilay Shroff
  2025-10-14 11:14   ` Yu Kuai
  4 siblings, 1 reply; 23+ messages in thread
From: Nilay Shroff @ 2025-10-14 10:58 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi



On 10/14/25 7:51 AM, Yu Kuai wrote:
> Currently rq-qos debugfs entries is created from rq_qos_add(), while
> rq_qos_add() requires queue to be freezed. This can deadlock because
> 
> creating new entries can trigger fs reclaim.
> 
> Fix this problem by delaying creating rq-qos debugfs entries until
> it's initialization is complete.
> 
> - For wbt, it can be initialized by default of by blk-sysfs, fix it by
>   delaying after wbt_init();
> - For other policies, they can only be initialized by blkg configuration,
>   fix it by delaying to blkg_conf_end();
> 
> Noted this set is cooked on the top of my other thread:
> https://lore.kernel.org/all/20251010091446.3048529-1-yukuai@kernel.org/
> 
> And the deadlock can be reporduced with above thead, by running blktests
> throtl/001 with wbt enabled by default. While the deadlock is really a
> long term problem.
> 
While freezing the queue we also mark GFP_NOIO scope, so doesn't that
help avoid fs-reclaim? Or maybe if you can share the lockdep splat 
encountered running throtl/001?

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] blk-rq-qos: fix possible deadlock
  2025-10-14 10:58 ` [PATCH 0/4] blk-rq-qos: fix possible deadlock Nilay Shroff
@ 2025-10-14 11:14   ` Yu Kuai
  2025-10-14 17:57     ` Nilay Shroff
  2025-10-15  1:42     ` Ming Lei
  0 siblings, 2 replies; 23+ messages in thread
From: Yu Kuai @ 2025-10-14 11:14 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, ming.lei, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

Hi,

在 2025/10/14 18:58, Nilay Shroff 写道:
>
> On 10/14/25 7:51 AM, Yu Kuai wrote:
>> Currently rq-qos debugfs entries is created from rq_qos_add(), while
>> rq_qos_add() requires queue to be freezed. This can deadlock because
>>
>> creating new entries can trigger fs reclaim.
>>
>> Fix this problem by delaying creating rq-qos debugfs entries until
>> it's initialization is complete.
>>
>> - For wbt, it can be initialized by default of by blk-sysfs, fix it by
>>    delaying after wbt_init();
>> - For other policies, they can only be initialized by blkg configuration,
>>    fix it by delaying to blkg_conf_end();
>>
>> Noted this set is cooked on the top of my other thread:
>> https://lore.kernel.org/all/20251010091446.3048529-1-yukuai@kernel.org/
>>
>> And the deadlock can be reporduced with above thead, by running blktests
>> throtl/001 with wbt enabled by default. While the deadlock is really a
>> long term problem.
>>
> While freezing the queue we also mark GFP_NOIO scope, so doesn't that
> help avoid fs-reclaim? Or maybe if you can share the lockdep splat
> encountered running throtl/001?

Yes, we can avoid fs-reclaim if queue is freezing, however,
because debugfs is a generic file system, and we can't avoid fs reclaim from
all context. There is still

Following is the log with above set and wbt enabled by default, the set acquire
lock order by:

freeze queue -> elevator lock -> rq_qos_mutex -> blkcg_mutex

However, fs-reclaim from other context cause the deadlock report.


[   45.632372][  T531] ======================================================
[   45.633734][  T531] WARNING: possible circular locking dependency detected
[   45.635062][  T531] 6.17.0-gfd4a560a0864-dirty #30 Not tainted
[   45.636220][  T531] ------------------------------------------------------
[   45.637587][  T531] check/531 is trying to acquire lock:
[   45.638626][  T531] ffff9473884382b0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: blkg_
conf_start+0x116/0x190
[   45.640416][  T531]
[   45.640416][  T531] but task is already holding lock:
[   45.641828][  T531] ffff9473884385d8 (&q->elevator_lock){+.+.}-{4:4}, at: blkg
_conf_start+0x108/0x190
[   45.643322][  T531]
[   45.643322][  T531] which lock already depends on the new lock.
[   45.643322][  T531]
[   45.644862][  T531]
[   45.644862][  T531] the existing dependency chain (in reverse order) is:
[   45.646046][  T531]
[   45.646046][  T531] -> #5 (&q->elevator_lock){+.+.}-{4:4}:
[   45.647052][  T531]        __mutex_lock+0xd3/0x8d0
[   45.647716][  T531]        blkg_conf_start+0x108/0x190
[   45.648395][  T531]        tg_set_limit+0x74/0x300
[   45.649046][  T531]        kernfs_fop_write_iter+0x14a/0x210
[   45.649813][  T531]        vfs_write+0x29e/0x550
[   45.650413][  T531]        ksys_write+0x74/0xf0
[   45.651032][  T531]        do_syscall_64+0xbb/0x380
[   45.651730][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   45.652533][  T531]
[   45.652533][  T531] -> #4 (&q->q_usage_counter(io)#3){++++}-{0:0}:
[   45.653297][  T531]        blk_alloc_queue+0x30b/0x350
[   45.653807][  T531]        blk_mq_alloc_queue+0x5d/0xd0
[   45.654300][  T531]        __blk_mq_alloc_disk+0x13/0x60
[   45.654810][  T531]        null_add_dev+0x2f8/0x660 [null_blk]
[   45.655374][  T531] nullb_device_power_store+0x88/0x130 [null_blk]
[   45.656009][  T531]        configfs_write_iter+0xcb/0x130 [configfs]
[   45.656614][  T531]        vfs_write+0x29e/0x550
[   45.657045][  T531]        ksys_write+0x74/0xf0
[   45.657497][  T531]        do_syscall_64+0xbb/0x380
[   45.657958][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   45.658595][  T531]
[   45.658595][  T531] -> #3 (fs_reclaim){+.+.}-{0:0}:
[   45.659266][  T531]        fs_reclaim_acquire+0xa4/0xe0
[   45.659783][  T531] kmem_cache_alloc_lru_noprof+0x3b/0x2a0
[   45.660369][  T531]        alloc_inode+0x1a/0x60
[   45.660873][  T531]        new_inode+0xd/0x90
[   45.661291][  T531]        debugfs_create_dir+0x38/0x160
[   45.661806][  T531]        component_debug_init+0x12/0x20
[   45.662316][  T531]        do_one_initcall+0x68/0x2b0
[   45.662807][  T531]        kernel_init_freeable+0x238/0x290
[   45.663241][  T531]        kernel_init+0x15/0x130
[   45.663659][  T531]        ret_from_fork+0x182/0x1d0
[   45.664054][  T531]        ret_from_fork_asm+0x1a/0x30
[   45.664601][  T531]
[   45.664601][  T531] -> #2 (&sb->s_type->i_mutex_key#2){+.+.15:25:59 [194/1936]
[   45.665274][  T531]        down_write+0x3a/0xb0
[   45.665669][  T531]        simple_start_creating+0x51/0x110
[   45.666097][  T531]        debugfs_start_creating+0x68/0xd0
[   45.666561][  T531]        debugfs_create_dir+0x10/0x160
[   45.666970][  T531]        blk_register_queue+0x8b/0x1e0
[   45.667386][  T531]        __add_disk+0x253/0x3f0
[   45.667804][  T531]        add_disk_fwnode+0x71/0x180
[   45.668205][  T531]        virtblk_probe+0x9c2/0xb50
[   45.668603][  T531]        virtio_dev_probe+0x223/0x380
[   45.669004][  T531]        really_probe+0xc2/0x260
[   45.669369][  T531]        __driver_probe_device+0x6e/0x100
[   45.669856][  T531]        driver_probe_device+0x1a/0xd0
[   45.670263][  T531]        __driver_attach+0x93/0x1a0
[   45.670672][  T531]        bus_for_each_dev+0x87/0xe0
[   45.671063][  T531]        bus_add_driver+0xe0/0x1d0
[   45.671469][  T531]        driver_register+0x70/0xe0
[   45.671856][  T531]        virtio_blk_init+0x53/0x80
[   45.672258][  T531]        do_one_initcall+0x68/0x2b0
[   45.672661][  T531]        kernel_init_freeable+0x238/0x290
[   45.673097][  T531]        kernel_init+0x15/0x130
[   45.673510][  T531]        ret_from_fork+0x182/0x1d0
[   45.673907][  T531]        ret_from_fork_asm+0x1a/0x30
[   45.674319][  T531]
[   45.674319][  T531] -> #1 (&q->debugfs_mutex){+.+.}-{4:4}:
[   45.674929][  T531]        __mutex_lock+0xd3/0x8d0
[   45.675315][  T531]        rq_qos_add+0xe0/0x130
[   45.675717][  T531]        wbt_init+0x183/0x210
[   45.676062][  T531]        blk_register_queue+0xdf/0x1e0
[   45.676490][  T531]        __add_disk+0x253/0x3f0
[   45.676844][  T531]        add_disk_fwnode+0x71/0x180
[   45.677247][  T531]        virtblk_probe+0x9c2/0xb50
[   45.677648][  T531]        virtio_dev_probe+0x223/0x380
[   45.678044][  T531]        really_probe+0xc2/0x260
[   45.678411][  T531]        __driver_probe_device+0x6e/0x100
[   45.678875][  T531]        driver_probe_device+0x1a/0xd0
[   45.679282][  T531]        __driver_attach+0x93/0x1a0
[   45.679678][  T531]        bus_for_each_dev+0x87/0xe0
[   45.680053][  T531]        bus_add_driver+0xe0/0x1d0
[   45.680458][  T531]        driver_register+0x70/0xe0
[   45.680823][  T531]        virtio_blk_init+0x53/0x80
[   45.681208][  T531]        do_one_initcall+0x68/0x2b0
[   45.681611][  T531]        kernel_init_freeable+0x238/0x290
[   45.682027][  T531]        kernel_init+0x15/0x130
[   45.682392][  T531]        ret_from_fork+0x182/0x1d0
[   45.682829][  T531]        ret_from_fork_asm+0x1a/0x30
[   45.683240][  T531]
[   45.683240][  T531] -> #0 (&q->rq_qos_mutex){+.+.}-{4:4}:
[   45.683867][  T531]        __lock_acquire+0x103d/0x1650
[   45.684411][  T531]        lock_acquire+0xbc/0x2c0
[   45.684798][  T531]        __mutex_lock+0xd3/0x8d0
[   45.685172][  T531]        blkg_conf_start+0x116/0x190
[   45.685623][  T531]        tg_set_limit+0x74/0x300
[   45.685986][  T531]        kernfs_fop_write_iter+0x14a/0x210
[   45.686405][  T531]        vfs_write+0x29e/0x550
[   45.686771][  T531]        ksys_write+0x74/0xf0
[   45.687112][  T531]        do_syscall_64+0xbb/0x380
[   45.687514][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   45.687983][  T531]
[   45.687983][  T531] other info that might help us debug this:
[   45.687983][  T531]
[   45.688760][  T531] Chain exists of:
[   45.688760][  T531]   &q->rq_qos_mutex --> &q->q_usage_counter(io)#3 --> &q->e
levator_lock
[   45.688760][  T531]
[   45.689817][  T531]  Possible unsafe locking scenario:
[   45.689817][  T531]
[   45.690359][  T531]        CPU0                    CPU1
[   45.690764][  T531]        ----                    ----
[   45.691172][  T531]   lock(&q->elevator_lock);
[   45.691544][  T531] lock(&q->q_usage_counter(io
)#3);
[   45.692108][  T531] lock(&q->elevator_lock);
[   45.692659][  T531]   lock(&q->rq_qos_mutex);
[   45.692994][  T531]
[   45.692994][  T531]  *** DEADLOCK ***

Thanks, Kuai




>
> Thanks,
> --Nilay
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] blk-rq-qos: fix possible deadlock
  2025-10-14 11:14   ` Yu Kuai
@ 2025-10-14 17:57     ` Nilay Shroff
  2025-10-15  1:36       ` Yu Kuai
  2025-10-15  1:42     ` Ming Lei
  1 sibling, 1 reply; 23+ messages in thread
From: Nilay Shroff @ 2025-10-14 17:57 UTC (permalink / raw)
  To: Yu Kuai, Yu Kuai, ming.lei, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi



On 10/14/25 4:44 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/10/14 18:58, Nilay Shroff 写道:
>>
>> On 10/14/25 7:51 AM, Yu Kuai wrote:
>>> Currently rq-qos debugfs entries is created from rq_qos_add(), while
>>> rq_qos_add() requires queue to be freezed. This can deadlock because
>>>
>>> creating new entries can trigger fs reclaim.
>>>
>>> Fix this problem by delaying creating rq-qos debugfs entries until
>>> it's initialization is complete.
>>>
>>> - For wbt, it can be initialized by default of by blk-sysfs, fix it by
>>>    delaying after wbt_init();
>>> - For other policies, they can only be initialized by blkg configuration,
>>>    fix it by delaying to blkg_conf_end();
>>>
>>> Noted this set is cooked on the top of my other thread:
>>> https://lore.kernel.org/all/20251010091446.3048529-1-yukuai@kernel.org/
>>>
>>> And the deadlock can be reporduced with above thead, by running blktests
>>> throtl/001 with wbt enabled by default. While the deadlock is really a
>>> long term problem.
>>>
>> While freezing the queue we also mark GFP_NOIO scope, so doesn't that
>> help avoid fs-reclaim? Or maybe if you can share the lockdep splat
>> encountered running throtl/001?
> 
> Yes, we can avoid fs-reclaim if queue is freezing, however,
> because debugfs is a generic file system, and we can't avoid fs reclaim from
> all context. There is still
> 
> Following is the log with above set and wbt enabled by default, the set acquire
> lock order by:
> 
> freeze queue -> elevator lock -> rq_qos_mutex -> blkcg_mutex
> 
> However, fs-reclaim from other context cause the deadlock report.
> 
> 
> [   45.632372][  T531] ======================================================
> [   45.633734][  T531] WARNING: possible circular locking dependency detected
> [   45.635062][  T531] 6.17.0-gfd4a560a0864-dirty #30 Not tainted
> [   45.636220][  T531] ------------------------------------------------------
> [   45.637587][  T531] check/531 is trying to acquire lock:
> [   45.638626][  T531] ffff9473884382b0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: blkg_
> conf_start+0x116/0x190
> [   45.640416][  T531]
> [   45.640416][  T531] but task is already holding lock:
> [   45.641828][  T531] ffff9473884385d8 (&q->elevator_lock){+.+.}-{4:4}, at: blkg
> _conf_start+0x108/0x190
> [   45.643322][  T531]
> [   45.643322][  T531] which lock already depends on the new lock.
> [   45.643322][  T531]
> [   45.644862][  T531]
> [   45.644862][  T531] the existing dependency chain (in reverse order) is:
> [   45.646046][  T531]
> [   45.646046][  T531] -> #5 (&q->elevator_lock){+.+.}-{4:4}:
> [   45.647052][  T531]        __mutex_lock+0xd3/0x8d0
> [   45.647716][  T531]        blkg_conf_start+0x108/0x190
> [   45.648395][  T531]        tg_set_limit+0x74/0x300
> [   45.649046][  T531]        kernfs_fop_write_iter+0x14a/0x210
> [   45.649813][  T531]        vfs_write+0x29e/0x550
> [   45.650413][  T531]        ksys_write+0x74/0xf0
> [   45.651032][  T531]        do_syscall_64+0xbb/0x380
> [   45.651730][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [   45.652533][  T531]
> [   45.652533][  T531] -> #4 (&q->q_usage_counter(io)#3){++++}-{0:0}:
> [   45.653297][  T531]        blk_alloc_queue+0x30b/0x350
> [   45.653807][  T531]        blk_mq_alloc_queue+0x5d/0xd0
> [   45.654300][  T531]        __blk_mq_alloc_disk+0x13/0x60
> [   45.654810][  T531]        null_add_dev+0x2f8/0x660 [null_blk]
> [   45.655374][  T531] nullb_device_power_store+0x88/0x130 [null_blk]
> [   45.656009][  T531]        configfs_write_iter+0xcb/0x130 [configfs]
> [   45.656614][  T531]        vfs_write+0x29e/0x550
> [   45.657045][  T531]        ksys_write+0x74/0xf0
> [   45.657497][  T531]        do_syscall_64+0xbb/0x380
> [   45.657958][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [   45.658595][  T531]
> [   45.658595][  T531] -> #3 (fs_reclaim){+.+.}-{0:0}:
> [   45.659266][  T531]        fs_reclaim_acquire+0xa4/0xe0
> [   45.659783][  T531] kmem_cache_alloc_lru_noprof+0x3b/0x2a0
> [   45.660369][  T531]        alloc_inode+0x1a/0x60
> [   45.660873][  T531]        new_inode+0xd/0x90
> [   45.661291][  T531]        debugfs_create_dir+0x38/0x160
> [   45.661806][  T531]        component_debug_init+0x12/0x20
> [   45.662316][  T531]        do_one_initcall+0x68/0x2b0
> [   45.662807][  T531]        kernel_init_freeable+0x238/0x290
> [   45.663241][  T531]        kernel_init+0x15/0x130
> [   45.663659][  T531]        ret_from_fork+0x182/0x1d0
> [   45.664054][  T531]        ret_from_fork_asm+0x1a/0x30
> [   45.664601][  T531]
> [   45.664601][  T531] -> #2 (&sb->s_type->i_mutex_key#2){+.+.15:25:59 [194/1936]
> [   45.665274][  T531]        down_write+0x3a/0xb0
> [   45.665669][  T531]        simple_start_creating+0x51/0x110
> [   45.666097][  T531]        debugfs_start_creating+0x68/0xd0
> [   45.666561][  T531]        debugfs_create_dir+0x10/0x160
> [   45.666970][  T531]        blk_register_queue+0x8b/0x1e0
> [   45.667386][  T531]        __add_disk+0x253/0x3f0
> [   45.667804][  T531]        add_disk_fwnode+0x71/0x180
> [   45.668205][  T531]        virtblk_probe+0x9c2/0xb50
> [   45.668603][  T531]        virtio_dev_probe+0x223/0x380
> [   45.669004][  T531]        really_probe+0xc2/0x260
> [   45.669369][  T531]        __driver_probe_device+0x6e/0x100
> [   45.669856][  T531]        driver_probe_device+0x1a/0xd0
> [   45.670263][  T531]        __driver_attach+0x93/0x1a0
> [   45.670672][  T531]        bus_for_each_dev+0x87/0xe0
> [   45.671063][  T531]        bus_add_driver+0xe0/0x1d0
> [   45.671469][  T531]        driver_register+0x70/0xe0
> [   45.671856][  T531]        virtio_blk_init+0x53/0x80
> [   45.672258][  T531]        do_one_initcall+0x68/0x2b0
> [   45.672661][  T531]        kernel_init_freeable+0x238/0x290
> [   45.673097][  T531]        kernel_init+0x15/0x130
> [   45.673510][  T531]        ret_from_fork+0x182/0x1d0
> [   45.673907][  T531]        ret_from_fork_asm+0x1a/0x30
> [   45.674319][  T531]
> [   45.674319][  T531] -> #1 (&q->debugfs_mutex){+.+.}-{4:4}:
> [   45.674929][  T531]        __mutex_lock+0xd3/0x8d0
> [   45.675315][  T531]        rq_qos_add+0xe0/0x130
> [   45.675717][  T531]        wbt_init+0x183/0x210
> [   45.676062][  T531]        blk_register_queue+0xdf/0x1e0
> [   45.676490][  T531]        __add_disk+0x253/0x3f0
> [   45.676844][  T531]        add_disk_fwnode+0x71/0x180
> [   45.677247][  T531]        virtblk_probe+0x9c2/0xb50
> [   45.677648][  T531]        virtio_dev_probe+0x223/0x380
> [   45.678044][  T531]        really_probe+0xc2/0x260
> [   45.678411][  T531]        __driver_probe_device+0x6e/0x100
> [   45.678875][  T531]        driver_probe_device+0x1a/0xd0
> [   45.679282][  T531]        __driver_attach+0x93/0x1a0
> [   45.679678][  T531]        bus_for_each_dev+0x87/0xe0
> [   45.680053][  T531]        bus_add_driver+0xe0/0x1d0
> [   45.680458][  T531]        driver_register+0x70/0xe0
> [   45.680823][  T531]        virtio_blk_init+0x53/0x80
> [   45.681208][  T531]        do_one_initcall+0x68/0x2b0
> [   45.681611][  T531]        kernel_init_freeable+0x238/0x290
> [   45.682027][  T531]        kernel_init+0x15/0x130
> [   45.682392][  T531]        ret_from_fork+0x182/0x1d0
> [   45.682829][  T531]        ret_from_fork_asm+0x1a/0x30
> [   45.683240][  T531]
> [   45.683240][  T531] -> #0 (&q->rq_qos_mutex){+.+.}-{4:4}:
> [   45.683867][  T531]        __lock_acquire+0x103d/0x1650
> [   45.684411][  T531]        lock_acquire+0xbc/0x2c0
> [   45.684798][  T531]        __mutex_lock+0xd3/0x8d0
> [   45.685172][  T531]        blkg_conf_start+0x116/0x190
> [   45.685623][  T531]        tg_set_limit+0x74/0x300
> [   45.685986][  T531]        kernfs_fop_write_iter+0x14a/0x210
> [   45.686405][  T531]        vfs_write+0x29e/0x550
> [   45.686771][  T531]        ksys_write+0x74/0xf0
> [   45.687112][  T531]        do_syscall_64+0xbb/0x380
> [   45.687514][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [   45.687983][  T531]
> [   45.687983][  T531] other info that might help us debug this:
> [   45.687983][  T531]
> [   45.688760][  T531] Chain exists of:
> [   45.688760][  T531]   &q->rq_qos_mutex --> &q->q_usage_counter(io)#3 --> &q->e
> levator_lock
> [   45.688760][  T531]
> [   45.689817][  T531]  Possible unsafe locking scenario:
> [   45.689817][  T531]
> [   45.690359][  T531]        CPU0                    CPU1
> [   45.690764][  T531]        ----                    ----
> [   45.691172][  T531]   lock(&q->elevator_lock);
> [   45.691544][  T531] lock(&q->q_usage_counter(io
> )#3);
> [   45.692108][  T531] lock(&q->elevator_lock);
> [   45.692659][  T531]   lock(&q->rq_qos_mutex);
> [   45.692994][  T531]
> [   45.692994][  T531]  *** DEADLOCK ***
> 

From the above trace, I see that thread #1 (virtblk device) is
unable to acquire &q->debugfs_mutex and it's pending on thread #2
(another virtblk device) because thread #2 has already acquired
the same &q->debugfs_mutex. And this sequence of events appears
to be a false-positive. The reasoning is as follows:

Each virtio block device (e.g., virtblk0, virtblk1, etc.) has its own
struct request_queue, and therefore its own instance of q->debugfs_mutex.
These mutexes are distinct objects at different memory addresses. Hence,
thread #1 cannot physically block on thread #2’s q->debugfs_mutex, since
the mutex wait queue is per-lock address.

However, lockdep does not track individual mutex addresses. Instead, it
operates at the level of lock classes, identified by a lock class key. 
Because all q->debugfs_mutex instances share the same lock class key,
lockdep treats them as equivalent and reports a potential circular
dependency across devices, even though such dependency cannot actually
occur at runtime. So IMO, to eliminate this false-positive warning, we
should assign a separate lockdep class/key for each queue’s q->debugfs_mutex.
This will ensure lockdep distinguishes mutex instances belonging to
different queues.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] blk-rq-qos: fix possible deadlock
  2025-10-14 17:57     ` Nilay Shroff
@ 2025-10-15  1:36       ` Yu Kuai
  0 siblings, 0 replies; 23+ messages in thread
From: Yu Kuai @ 2025-10-15  1:36 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, ming.lei, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/10/15 1:57, Nilay Shroff 写道:
> 
> 
> On 10/14/25 4:44 PM, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/10/14 18:58, Nilay Shroff 写道:
>>>
>>> On 10/14/25 7:51 AM, Yu Kuai wrote:
>>>> Currently rq-qos debugfs entries is created from rq_qos_add(), while
>>>> rq_qos_add() requires queue to be freezed. This can deadlock because
>>>>
>>>> creating new entries can trigger fs reclaim.
>>>>
>>>> Fix this problem by delaying creating rq-qos debugfs entries until
>>>> it's initialization is complete.
>>>>
>>>> - For wbt, it can be initialized by default of by blk-sysfs, fix it by
>>>>     delaying after wbt_init();
>>>> - For other policies, they can only be initialized by blkg configuration,
>>>>     fix it by delaying to blkg_conf_end();
>>>>
>>>> Noted this set is cooked on the top of my other thread:
>>>> https://lore.kernel.org/all/20251010091446.3048529-1-yukuai@kernel.org/
>>>>
>>>> And the deadlock can be reporduced with above thead, by running blktests
>>>> throtl/001 with wbt enabled by default. While the deadlock is really a
>>>> long term problem.
>>>>
>>> While freezing the queue we also mark GFP_NOIO scope, so doesn't that
>>> help avoid fs-reclaim? Or maybe if you can share the lockdep splat
>>> encountered running throtl/001?
>>
>> Yes, we can avoid fs-reclaim if queue is freezing, however,
>> because debugfs is a generic file system, and we can't avoid fs reclaim from
>> all context. There is still
>>
>> Following is the log with above set and wbt enabled by default, the set acquire
>> lock order by:
>>
>> freeze queue -> elevator lock -> rq_qos_mutex -> blkcg_mutex
>>
>> However, fs-reclaim from other context cause the deadlock report.
>>
>>
>> [   45.632372][  T531] ======================================================
>> [   45.633734][  T531] WARNING: possible circular locking dependency detected
>> [   45.635062][  T531] 6.17.0-gfd4a560a0864-dirty #30 Not tainted
>> [   45.636220][  T531] ------------------------------------------------------
>> [   45.637587][  T531] check/531 is trying to acquire lock:
>> [   45.638626][  T531] ffff9473884382b0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: blkg_
>> conf_start+0x116/0x190
>> [   45.640416][  T531]
>> [   45.640416][  T531] but task is already holding lock:
>> [   45.641828][  T531] ffff9473884385d8 (&q->elevator_lock){+.+.}-{4:4}, at: blkg
>> _conf_start+0x108/0x190
>> [   45.643322][  T531]
>> [   45.643322][  T531] which lock already depends on the new lock.
>> [   45.643322][  T531]
>> [   45.644862][  T531]
>> [   45.644862][  T531] the existing dependency chain (in reverse order) is:
>> [   45.646046][  T531]
>> [   45.646046][  T531] -> #5 (&q->elevator_lock){+.+.}-{4:4}:
>> [   45.647052][  T531]        __mutex_lock+0xd3/0x8d0
>> [   45.647716][  T531]        blkg_conf_start+0x108/0x190
>> [   45.648395][  T531]        tg_set_limit+0x74/0x300
>> [   45.649046][  T531]        kernfs_fop_write_iter+0x14a/0x210
>> [   45.649813][  T531]        vfs_write+0x29e/0x550
>> [   45.650413][  T531]        ksys_write+0x74/0xf0
>> [   45.651032][  T531]        do_syscall_64+0xbb/0x380
>> [   45.651730][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
>> [   45.652533][  T531]
>> [   45.652533][  T531] -> #4 (&q->q_usage_counter(io)#3){++++}-{0:0}:
>> [   45.653297][  T531]        blk_alloc_queue+0x30b/0x350
>> [   45.653807][  T531]        blk_mq_alloc_queue+0x5d/0xd0
>> [   45.654300][  T531]        __blk_mq_alloc_disk+0x13/0x60
>> [   45.654810][  T531]        null_add_dev+0x2f8/0x660 [null_blk]
>> [   45.655374][  T531] nullb_device_power_store+0x88/0x130 [null_blk]
>> [   45.656009][  T531]        configfs_write_iter+0xcb/0x130 [configfs]
>> [   45.656614][  T531]        vfs_write+0x29e/0x550
>> [   45.657045][  T531]        ksys_write+0x74/0xf0
>> [   45.657497][  T531]        do_syscall_64+0xbb/0x380
>> [   45.657958][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
>> [   45.658595][  T531]
>> [   45.658595][  T531] -> #3 (fs_reclaim){+.+.}-{0:0}:
>> [   45.659266][  T531]        fs_reclaim_acquire+0xa4/0xe0
>> [   45.659783][  T531] kmem_cache_alloc_lru_noprof+0x3b/0x2a0
>> [   45.660369][  T531]        alloc_inode+0x1a/0x60
>> [   45.660873][  T531]        new_inode+0xd/0x90
>> [   45.661291][  T531]        debugfs_create_dir+0x38/0x160
>> [   45.661806][  T531]        component_debug_init+0x12/0x20
>> [   45.662316][  T531]        do_one_initcall+0x68/0x2b0
>> [   45.662807][  T531]        kernel_init_freeable+0x238/0x290
>> [   45.663241][  T531]        kernel_init+0x15/0x130
>> [   45.663659][  T531]        ret_from_fork+0x182/0x1d0
>> [   45.664054][  T531]        ret_from_fork_asm+0x1a/0x30
>> [   45.664601][  T531]
>> [   45.664601][  T531] -> #2 (&sb->s_type->i_mutex_key#2){+.+.15:25:59 [194/1936]
>> [   45.665274][  T531]        down_write+0x3a/0xb0
>> [   45.665669][  T531]        simple_start_creating+0x51/0x110
>> [   45.666097][  T531]        debugfs_start_creating+0x68/0xd0
>> [   45.666561][  T531]        debugfs_create_dir+0x10/0x160
>> [   45.666970][  T531]        blk_register_queue+0x8b/0x1e0
>> [   45.667386][  T531]        __add_disk+0x253/0x3f0
>> [   45.667804][  T531]        add_disk_fwnode+0x71/0x180
>> [   45.668205][  T531]        virtblk_probe+0x9c2/0xb50
>> [   45.668603][  T531]        virtio_dev_probe+0x223/0x380
>> [   45.669004][  T531]        really_probe+0xc2/0x260
>> [   45.669369][  T531]        __driver_probe_device+0x6e/0x100
>> [   45.669856][  T531]        driver_probe_device+0x1a/0xd0
>> [   45.670263][  T531]        __driver_attach+0x93/0x1a0
>> [   45.670672][  T531]        bus_for_each_dev+0x87/0xe0
>> [   45.671063][  T531]        bus_add_driver+0xe0/0x1d0
>> [   45.671469][  T531]        driver_register+0x70/0xe0
>> [   45.671856][  T531]        virtio_blk_init+0x53/0x80
>> [   45.672258][  T531]        do_one_initcall+0x68/0x2b0
>> [   45.672661][  T531]        kernel_init_freeable+0x238/0x290
>> [   45.673097][  T531]        kernel_init+0x15/0x130
>> [   45.673510][  T531]        ret_from_fork+0x182/0x1d0
>> [   45.673907][  T531]        ret_from_fork_asm+0x1a/0x30
>> [   45.674319][  T531]
>> [   45.674319][  T531] -> #1 (&q->debugfs_mutex){+.+.}-{4:4}:
>> [   45.674929][  T531]        __mutex_lock+0xd3/0x8d0
>> [   45.675315][  T531]        rq_qos_add+0xe0/0x130
>> [   45.675717][  T531]        wbt_init+0x183/0x210
>> [   45.676062][  T531]        blk_register_queue+0xdf/0x1e0
>> [   45.676490][  T531]        __add_disk+0x253/0x3f0
>> [   45.676844][  T531]        add_disk_fwnode+0x71/0x180
>> [   45.677247][  T531]        virtblk_probe+0x9c2/0xb50
>> [   45.677648][  T531]        virtio_dev_probe+0x223/0x380
>> [   45.678044][  T531]        really_probe+0xc2/0x260
>> [   45.678411][  T531]        __driver_probe_device+0x6e/0x100
>> [   45.678875][  T531]        driver_probe_device+0x1a/0xd0
>> [   45.679282][  T531]        __driver_attach+0x93/0x1a0
>> [   45.679678][  T531]        bus_for_each_dev+0x87/0xe0
>> [   45.680053][  T531]        bus_add_driver+0xe0/0x1d0
>> [   45.680458][  T531]        driver_register+0x70/0xe0
>> [   45.680823][  T531]        virtio_blk_init+0x53/0x80
>> [   45.681208][  T531]        do_one_initcall+0x68/0x2b0
>> [   45.681611][  T531]        kernel_init_freeable+0x238/0x290
>> [   45.682027][  T531]        kernel_init+0x15/0x130
>> [   45.682392][  T531]        ret_from_fork+0x182/0x1d0
>> [   45.682829][  T531]        ret_from_fork_asm+0x1a/0x30
>> [   45.683240][  T531]
>> [   45.683240][  T531] -> #0 (&q->rq_qos_mutex){+.+.}-{4:4}:
>> [   45.683867][  T531]        __lock_acquire+0x103d/0x1650
>> [   45.684411][  T531]        lock_acquire+0xbc/0x2c0
>> [   45.684798][  T531]        __mutex_lock+0xd3/0x8d0
>> [   45.685172][  T531]        blkg_conf_start+0x116/0x190
>> [   45.685623][  T531]        tg_set_limit+0x74/0x300
>> [   45.685986][  T531]        kernfs_fop_write_iter+0x14a/0x210
>> [   45.686405][  T531]        vfs_write+0x29e/0x550
>> [   45.686771][  T531]        ksys_write+0x74/0xf0
>> [   45.687112][  T531]        do_syscall_64+0xbb/0x380
>> [   45.687514][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
>> [   45.687983][  T531]
>> [   45.687983][  T531] other info that might help us debug this:
>> [   45.687983][  T531]
>> [   45.688760][  T531] Chain exists of:
>> [   45.688760][  T531]   &q->rq_qos_mutex --> &q->q_usage_counter(io)#3 --> &q->e
>> levator_lock
>> [   45.688760][  T531]
>> [   45.689817][  T531]  Possible unsafe locking scenario:
>> [   45.689817][  T531]
>> [   45.690359][  T531]        CPU0                    CPU1
>> [   45.690764][  T531]        ----                    ----
>> [   45.691172][  T531]   lock(&q->elevator_lock);
>> [   45.691544][  T531] lock(&q->q_usage_counter(io
>> )#3);
>> [   45.692108][  T531] lock(&q->elevator_lock);
>> [   45.692659][  T531]   lock(&q->rq_qos_mutex);
>> [   45.692994][  T531]
>> [   45.692994][  T531]  *** DEADLOCK ***
>>
> 
>>From the above trace, I see that thread #1 (virtblk device) is
> unable to acquire &q->debugfs_mutex and it's pending on thread #2
> (another virtblk device) because thread #2 has already acquired
> the same &q->debugfs_mutex. And this sequence of events appears
> to be a false-positive. The reasoning is as follows:
> 

Yes, this case #1 and #2 are both from __add_disk() and they can't
concurrent with each other for the same disk.

> Each virtio block device (e.g., virtblk0, virtblk1, etc.) has its own
> struct request_queue, and therefore its own instance of q->debugfs_mutex.
> These mutexes are distinct objects at different memory addresses. Hence,
> thread #1 cannot physically block on thread #2’s q->debugfs_mutex, since
> the mutex wait queue is per-lock address.
> 
> However, lockdep does not track individual mutex addresses. Instead, it
> operates at the level of lock classes, identified by a lock class key.
> Because all q->debugfs_mutex instances share the same lock class key,
> lockdep treats them as equivalent and reports a potential circular
> dependency across devices, even though such dependency cannot actually
> occur at runtime. So IMO, to eliminate this false-positive warning, we
> should assign a separate lockdep class/key for each queue’s q->debugfs_mutex.
> This will ensure lockdep distinguishes mutex instances belonging to
> different queues.
>

Lockdep just shows possible circular locking dependency, and we can
absolutely avoid above warning with a seperate lockdep key. However, I
still feel fix the circular dependency by just don't create debugfs
entries while queue is freezed is better, unless it's too complicated.
There still might be other possible undiscover deadlocks somewhere...

BTW, we should probably document the locking orders somewhere.

Thanks,
Kuai

> Thanks,
> --Nilay
> 
> .
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] blk-rq-qos: fix possible deadlock
  2025-10-14 11:14   ` Yu Kuai
  2025-10-14 17:57     ` Nilay Shroff
@ 2025-10-15  1:42     ` Ming Lei
  2025-10-15  5:16       ` Nilay Shroff
  1 sibling, 1 reply; 23+ messages in thread
From: Ming Lei @ 2025-10-15  1:42 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Nilay Shroff, Yu Kuai, tj, josef, axboe, cgroups, linux-block,
	linux-kernel, yukuai1, yi.zhang, yangerkun, johnny.chenyi

On Tue, Oct 14, 2025 at 07:14:16PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/10/14 18:58, Nilay Shroff 写道:
> > 
> > On 10/14/25 7:51 AM, Yu Kuai wrote:
> > > Currently rq-qos debugfs entries is created from rq_qos_add(), while
> > > rq_qos_add() requires queue to be freezed. This can deadlock because
> > > 
> > > creating new entries can trigger fs reclaim.
> > > 
> > > Fix this problem by delaying creating rq-qos debugfs entries until
> > > it's initialization is complete.
> > > 
> > > - For wbt, it can be initialized by default of by blk-sysfs, fix it by
> > >    delaying after wbt_init();
> > > - For other policies, they can only be initialized by blkg configuration,
> > >    fix it by delaying to blkg_conf_end();
> > > 
> > > Noted this set is cooked on the top of my other thread:
> > > https://lore.kernel.org/all/20251010091446.3048529-1-yukuai@kernel.org/
> > > 
> > > And the deadlock can be reporduced with above thead, by running blktests
> > > throtl/001 with wbt enabled by default. While the deadlock is really a
> > > long term problem.
> > > 
> > While freezing the queue we also mark GFP_NOIO scope, so doesn't that
> > help avoid fs-reclaim? Or maybe if you can share the lockdep splat
> > encountered running throtl/001?
> 
> Yes, we can avoid fs-reclaim if queue is freezing, however,
> because debugfs is a generic file system, and we can't avoid fs reclaim from
> all context. There is still
> 
> Following is the log with above set and wbt enabled by default, the set acquire
> lock order by:
> 
> freeze queue -> elevator lock -> rq_qos_mutex -> blkcg_mutex
> 
> However, fs-reclaim from other context cause the deadlock report.
> 
> 
> [   45.632372][  T531] ======================================================
> [   45.633734][  T531] WARNING: possible circular locking dependency detected
> [   45.635062][  T531] 6.17.0-gfd4a560a0864-dirty #30 Not tainted
> [   45.636220][  T531] ------------------------------------------------------
> [   45.637587][  T531] check/531 is trying to acquire lock:
> [   45.638626][  T531] ffff9473884382b0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: blkg_
> conf_start+0x116/0x190
> [   45.640416][  T531]
> [   45.640416][  T531] but task is already holding lock:
> [   45.641828][  T531] ffff9473884385d8 (&q->elevator_lock){+.+.}-{4:4}, at: blkg
> _conf_start+0x108/0x190
> [   45.643322][  T531]
> [   45.643322][  T531] which lock already depends on the new lock.
> [   45.643322][  T531]
> [   45.644862][  T531]
> [   45.644862][  T531] the existing dependency chain (in reverse order) is:
> [   45.646046][  T531]
> [   45.646046][  T531] -> #5 (&q->elevator_lock){+.+.}-{4:4}:
> [   45.647052][  T531]        __mutex_lock+0xd3/0x8d0
> [   45.647716][  T531]        blkg_conf_start+0x108/0x190
> [   45.648395][  T531]        tg_set_limit+0x74/0x300
> [   45.649046][  T531]        kernfs_fop_write_iter+0x14a/0x210
> [   45.649813][  T531]        vfs_write+0x29e/0x550
> [   45.650413][  T531]        ksys_write+0x74/0xf0
> [   45.651032][  T531]        do_syscall_64+0xbb/0x380
> [   45.651730][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f

Not sure why elevator lock is grabbed in throttle code, which looks a elevator lock
misuse, what does the elevator try to protect here?

The comment log doesn't explain the usage too:

```
  /*
   * Similar to blkg_conf_open_bdev, but additionally freezes the queue,
   * acquires q->elevator_lock, and ensures the correct locking order
   * between q->elevator_lock and q->rq_qos_mutex.
   *
   * This function returns negative error on failure. On success it returns
   * memflags which must be saved and later passed to blkg_conf_exit_frozen
   * for restoring the memalloc scope.
   */
```

I think it is still order issue between queue freeze and q->rq_qos_mutex
first, which need to be solved first.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] blk-rq-qos: fix possible deadlock
  2025-10-15  1:42     ` Ming Lei
@ 2025-10-15  5:16       ` Nilay Shroff
  2025-10-15  9:27         ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Nilay Shroff @ 2025-10-15  5:16 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: Yu Kuai, tj, josef, axboe, cgroups, linux-block, linux-kernel,
	yukuai1, yi.zhang, yangerkun, johnny.chenyi



On 10/15/25 7:12 AM, Ming Lei wrote:
> On Tue, Oct 14, 2025 at 07:14:16PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/10/14 18:58, Nilay Shroff 写道:
>>>
>>> On 10/14/25 7:51 AM, Yu Kuai wrote:
>>>> Currently rq-qos debugfs entries is created from rq_qos_add(), while
>>>> rq_qos_add() requires queue to be freezed. This can deadlock because
>>>>
>>>> creating new entries can trigger fs reclaim.
>>>>
>>>> Fix this problem by delaying creating rq-qos debugfs entries until
>>>> it's initialization is complete.
>>>>
>>>> - For wbt, it can be initialized by default of by blk-sysfs, fix it by
>>>>    delaying after wbt_init();
>>>> - For other policies, they can only be initialized by blkg configuration,
>>>>    fix it by delaying to blkg_conf_end();
>>>>
>>>> Noted this set is cooked on the top of my other thread:
>>>> https://lore.kernel.org/all/20251010091446.3048529-1-yukuai@kernel.org/
>>>>
>>>> And the deadlock can be reporduced with above thead, by running blktests
>>>> throtl/001 with wbt enabled by default. While the deadlock is really a
>>>> long term problem.
>>>>
>>> While freezing the queue we also mark GFP_NOIO scope, so doesn't that
>>> help avoid fs-reclaim? Or maybe if you can share the lockdep splat
>>> encountered running throtl/001?
>>
>> Yes, we can avoid fs-reclaim if queue is freezing, however,
>> because debugfs is a generic file system, and we can't avoid fs reclaim from
>> all context. There is still
>>
>> Following is the log with above set and wbt enabled by default, the set acquire
>> lock order by:
>>
>> freeze queue -> elevator lock -> rq_qos_mutex -> blkcg_mutex
>>
>> However, fs-reclaim from other context cause the deadlock report.
>>
>>
>> [   45.632372][  T531] ======================================================
>> [   45.633734][  T531] WARNING: possible circular locking dependency detected
>> [   45.635062][  T531] 6.17.0-gfd4a560a0864-dirty #30 Not tainted
>> [   45.636220][  T531] ------------------------------------------------------
>> [   45.637587][  T531] check/531 is trying to acquire lock:
>> [   45.638626][  T531] ffff9473884382b0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: blkg_
>> conf_start+0x116/0x190
>> [   45.640416][  T531]
>> [   45.640416][  T531] but task is already holding lock:
>> [   45.641828][  T531] ffff9473884385d8 (&q->elevator_lock){+.+.}-{4:4}, at: blkg
>> _conf_start+0x108/0x190
>> [   45.643322][  T531]
>> [   45.643322][  T531] which lock already depends on the new lock.
>> [   45.643322][  T531]
>> [   45.644862][  T531]
>> [   45.644862][  T531] the existing dependency chain (in reverse order) is:
>> [   45.646046][  T531]
>> [   45.646046][  T531] -> #5 (&q->elevator_lock){+.+.}-{4:4}:
>> [   45.647052][  T531]        __mutex_lock+0xd3/0x8d0
>> [   45.647716][  T531]        blkg_conf_start+0x108/0x190
>> [   45.648395][  T531]        tg_set_limit+0x74/0x300
>> [   45.649046][  T531]        kernfs_fop_write_iter+0x14a/0x210
>> [   45.649813][  T531]        vfs_write+0x29e/0x550
>> [   45.650413][  T531]        ksys_write+0x74/0xf0
>> [   45.651032][  T531]        do_syscall_64+0xbb/0x380
>> [   45.651730][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Not sure why elevator lock is grabbed in throttle code, which looks a elevator lock
> misuse, what does the elevator try to protect here?
> 
> The comment log doesn't explain the usage too:
> 
Lets go back to the history:
The ->elevator_lock was first added in the wbt code path under this commit 
245618f8e45f ("block: protect wbt_lat_usec using q->elevator_lock"). It was
introduced to protect the wbt latency and state updates which could be 
simultaneously accessed from elevator switch path and from sysfs write method
(queue_wb_lat_store()) as well as from cgroup (ioc_qos_write()).

Later above change caused a lockdep splat and then we updated the code 
to fix locking order between ->freeze_lock, ->elevator_lock and ->rq_qos_mutex
and that was implemented in this commit 9730763f4756 ("block: correct locking
order for protecting blk-wbt parameters"). With this change we set the 
locking order as follows: 
->freeze_lock ->elevator_lock ->rq_qos_mutex

Then later on under this commit 78c271344b6f ("block: move wbt_enable_default()
out of queue freezing from sched ->exit()") we moved the wbt latency/stat
update code out of the ->freeze_lock and ->elevator_lock from elevator switch
path. So essentially with this commit now in theory we don't need to acquire
->elevator_lock while updating wbt latency/stat values. In fact, we also removed
->elevator_lock  from queue_wb_lat_store() in this commit but I think we missed
to remove ->elevator_lock from cgroup (ioc_qos_write()). 

> 
> I think it is still order issue between queue freeze and q->rq_qos_mutex
> first, which need to be solved first.
> 
So yes we should first target to get rid off the use of ->elevator_lock
from ioc_qos_write(). Later we can decide on locking order between
->freeze_lock, ->rq_qos_mutex and ->debugfs_mutex. 

Thanks,
--Nilay





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] blk-rq-qos: fix possible deadlock
  2025-10-15  5:16       ` Nilay Shroff
@ 2025-10-15  9:27         ` Ming Lei
  0 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2025-10-15  9:27 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Yu Kuai, Yu Kuai, tj, josef, axboe, cgroups, linux-block,
	linux-kernel, yukuai1, yi.zhang, yangerkun, johnny.chenyi

On Wed, Oct 15, 2025 at 10:46:51AM +0530, Nilay Shroff wrote:
> 
> 
> On 10/15/25 7:12 AM, Ming Lei wrote:
> > On Tue, Oct 14, 2025 at 07:14:16PM +0800, Yu Kuai wrote:
> >> Hi,
> >>
> >> 在 2025/10/14 18:58, Nilay Shroff 写道:
> >>>
> >>> On 10/14/25 7:51 AM, Yu Kuai wrote:
> >>>> Currently rq-qos debugfs entries is created from rq_qos_add(), while
> >>>> rq_qos_add() requires queue to be freezed. This can deadlock because
> >>>>
> >>>> creating new entries can trigger fs reclaim.
> >>>>
> >>>> Fix this problem by delaying creating rq-qos debugfs entries until
> >>>> it's initialization is complete.
> >>>>
> >>>> - For wbt, it can be initialized by default of by blk-sysfs, fix it by
> >>>>    delaying after wbt_init();
> >>>> - For other policies, they can only be initialized by blkg configuration,
> >>>>    fix it by delaying to blkg_conf_end();
> >>>>
> >>>> Noted this set is cooked on the top of my other thread:
> >>>> https://lore.kernel.org/all/20251010091446.3048529-1-yukuai@kernel.org/
> >>>>
> >>>> And the deadlock can be reporduced with above thead, by running blktests
> >>>> throtl/001 with wbt enabled by default. While the deadlock is really a
> >>>> long term problem.
> >>>>
> >>> While freezing the queue we also mark GFP_NOIO scope, so doesn't that
> >>> help avoid fs-reclaim? Or maybe if you can share the lockdep splat
> >>> encountered running throtl/001?
> >>
> >> Yes, we can avoid fs-reclaim if queue is freezing, however,
> >> because debugfs is a generic file system, and we can't avoid fs reclaim from
> >> all context. There is still
> >>
> >> Following is the log with above set and wbt enabled by default, the set acquire
> >> lock order by:
> >>
> >> freeze queue -> elevator lock -> rq_qos_mutex -> blkcg_mutex
> >>
> >> However, fs-reclaim from other context cause the deadlock report.
> >>
> >>
> >> [   45.632372][  T531] ======================================================
> >> [   45.633734][  T531] WARNING: possible circular locking dependency detected
> >> [   45.635062][  T531] 6.17.0-gfd4a560a0864-dirty #30 Not tainted
> >> [   45.636220][  T531] ------------------------------------------------------
> >> [   45.637587][  T531] check/531 is trying to acquire lock:
> >> [   45.638626][  T531] ffff9473884382b0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: blkg_
> >> conf_start+0x116/0x190
> >> [   45.640416][  T531]
> >> [   45.640416][  T531] but task is already holding lock:
> >> [   45.641828][  T531] ffff9473884385d8 (&q->elevator_lock){+.+.}-{4:4}, at: blkg
> >> _conf_start+0x108/0x190
> >> [   45.643322][  T531]
> >> [   45.643322][  T531] which lock already depends on the new lock.
> >> [   45.643322][  T531]
> >> [   45.644862][  T531]
> >> [   45.644862][  T531] the existing dependency chain (in reverse order) is:
> >> [   45.646046][  T531]
> >> [   45.646046][  T531] -> #5 (&q->elevator_lock){+.+.}-{4:4}:
> >> [   45.647052][  T531]        __mutex_lock+0xd3/0x8d0
> >> [   45.647716][  T531]        blkg_conf_start+0x108/0x190
> >> [   45.648395][  T531]        tg_set_limit+0x74/0x300
> >> [   45.649046][  T531]        kernfs_fop_write_iter+0x14a/0x210
> >> [   45.649813][  T531]        vfs_write+0x29e/0x550
> >> [   45.650413][  T531]        ksys_write+0x74/0xf0
> >> [   45.651032][  T531]        do_syscall_64+0xbb/0x380
> >> [   45.651730][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > 
> > Not sure why elevator lock is grabbed in throttle code, which looks a elevator lock
> > misuse, what does the elevator try to protect here?
> > 
> > The comment log doesn't explain the usage too:
> > 
> Lets go back to the history:
> The ->elevator_lock was first added in the wbt code path under this commit 
> 245618f8e45f ("block: protect wbt_lat_usec using q->elevator_lock"). It was
> introduced to protect the wbt latency and state updates which could be 
> simultaneously accessed from elevator switch path and from sysfs write method
> (queue_wb_lat_store()) as well as from cgroup (ioc_qos_write()).
> 
> Later above change caused a lockdep splat and then we updated the code 
> to fix locking order between ->freeze_lock, ->elevator_lock and ->rq_qos_mutex
> and that was implemented in this commit 9730763f4756 ("block: correct locking
> order for protecting blk-wbt parameters"). With this change we set the 
> locking order as follows: 
> ->freeze_lock ->elevator_lock ->rq_qos_mutex
> 
> Then later on under this commit 78c271344b6f ("block: move wbt_enable_default()
> out of queue freezing from sched ->exit()") we moved the wbt latency/stat
> update code out of the ->freeze_lock and ->elevator_lock from elevator switch
> path. So essentially with this commit now in theory we don't need to acquire
> ->elevator_lock while updating wbt latency/stat values. In fact, we also removed
> ->elevator_lock  from queue_wb_lat_store() in this commit but I think we missed
> to remove ->elevator_lock from cgroup (ioc_qos_write()). 

Nice looking back!

I will cook one patch to remove it from ioc_qos_write().

> 
> > 
> > I think it is still order issue between queue freeze and q->rq_qos_mutex
> > first, which need to be solved first.
> > 
> So yes we should first target to get rid off the use of ->elevator_lock
> from ioc_qos_write(). Later we can decide on locking order between
> ->freeze_lock, ->rq_qos_mutex and ->debugfs_mutex. 

Yu Kuai is working on ordering queue freeze and ->rq_qos_mutex.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-10-15  9:27 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-14  2:21 [PATCH 0/4] blk-rq-qos: fix possible deadlock Yu Kuai
2025-10-14  2:21 ` [PATCH 1/4] blk-mq-debugfs: warn about " Yu Kuai
2025-10-14  8:06   ` Ming Lei
2025-10-14  8:21     ` Yu Kuai
2025-10-14  8:34       ` Ming Lei
2025-10-14  2:21 ` [PATCH 2/4] blk-mq-debugfs: factor out a helper blk_mq_debugfs_register_rq_qos() Yu Kuai
2025-10-14  2:21 ` [PATCH 3/4] blk-rq-qos: fix possible deadlock Yu Kuai
2025-10-14  8:13   ` Ming Lei
2025-10-14  8:24     ` Yu Kuai
2025-10-14  8:37       ` Ming Lei
2025-10-14  8:42         ` Yu Kuai
2025-10-14  8:55           ` Ming Lei
2025-10-14  9:03             ` Yu Kuai
2025-10-14  2:21 ` [PATCH 4/4] blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static Yu Kuai
2025-10-14  8:15   ` Ming Lei
2025-10-14  8:26     ` Yu Kuai
2025-10-14 10:58 ` [PATCH 0/4] blk-rq-qos: fix possible deadlock Nilay Shroff
2025-10-14 11:14   ` Yu Kuai
2025-10-14 17:57     ` Nilay Shroff
2025-10-15  1:36       ` Yu Kuai
2025-10-15  1:42     ` Ming Lei
2025-10-15  5:16       ` Nilay Shroff
2025-10-15  9:27         ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).