Re: [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit()

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

From: Nilay Shroff <nilay@linux.ibm.com>
To: Ming Lei <ming.lei@redhat.com>, Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org
Cc: "Shinichiro Kawasaki" <shinichiro.kawasaki@wdc.com>,
	"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
	"Christoph Hellwig" <hch@lst.de>
Subject: Re: [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit()
Date: Tue, 29 Apr 2025 16:29:23 +0530	[thread overview]
Message-ID: <e99c8ed8-bfcb-4a4d-aff7-c78506eddc09@linux.ibm.com> (raw)
In-Reply-To: <20250424152148.1066220-21-ming.lei@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 7898 bytes --]



On 4/24/25 8:51 PM, Ming Lei wrote:
> scheduler's ->exit() is called with queue frozen and elevator lock is held, and
> wbt_enable_default() can't be called with queue frozen, otherwise the
> following lockdep warning is triggered:
> 
> 	#6 (&q->rq_qos_mutex){+.+.}-{4:4}:
> 	#5 (&eq->sysfs_lock){+.+.}-{4:4}:
> 	#4 (&q->elevator_lock){+.+.}-{4:4}:
> 	#3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
> 	#2 (fs_reclaim){+.+.}-{0:0}:
> 	#1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
> 	#0 (&q->debugfs_mutex){+.+.}-{4:4}:
> 
> Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
> call it from elevator_change_done().
> 
> Meantime add disk->rqos_state_mutex for covering wbt state change, which
> matches the purpose more than ->elevator_lock.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

While testing this patch on my machine using blktests, I stumbled upon
a lockdep splat shown below.(I could consistently recreate it):

run blktests block/005 at 2025-04-28 06:57:51

======================================================
WARNING: possible circular locking dependency detected
6.15.0-rc2+ #174 Not tainted
------------------------------------------------------
check/8088 is trying to acquire lock:
c0000000a0c03538 (&disk->rqos_state_mutex){+.+.}-{4:4}, at: wbt_disable_default+0x9c/0x118

but task is already holding lock:
c00000005b8f6c38 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x94/0x214

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (&q->elevator_lock){+.+.}-{4:4}:
       __mutex_lock+0x128/0xdd8
       elevator_change+0x94/0x214
       elv_iosched_store+0x14c/0x1f4
       queue_attr_store+0x194/0x1d0
       sysfs_kf_write+0xbc/0x110
       kernfs_fop_write_iter+0x264/0x384
       vfs_write+0x5b0/0x77c
       ksys_write+0xa0/0x180
       system_call_exception+0x1b0/0x4f0
       system_call_vectored_common+0x15c/0x2ec

-> #2 (&q->q_usage_counter(io)#23){++++}-{0:0}:
       blk_alloc_queue+0x46c/0x4bc
       blk_mq_alloc_queue+0xc0/0x160
       __blk_mq_alloc_disk+0x34/0x128
       nvme_alloc_ns+0x140/0x1804 [nvme_core]
       nvme_scan_ns+0x42c/0x564 [nvme_core]
       async_run_entry_fn+0x9c/0x30c
       process_one_work+0x514/0xd38
       worker_thread+0x390/0x6dc
       kthread+0x230/0x278
       start_kernel_thread+0x14/0x18

-> #1 (fs_reclaim){+.+.}-{0:0}:
       fs_reclaim_acquire+0x114/0x150
       __kmalloc_cache_noprof+0x70/0x5c0
       wbt_init+0x64/0x2fc
       wbt_enable_default+0x140/0x15c
       elevator_change_done+0x314/0x3a8
       elv_iosched_store+0x14c/0x1f4
       queue_attr_store+0x194/0x1d0
       sysfs_kf_write+0xbc/0x110
       kernfs_fop_write_iter+0x264/0x384
       vfs_write+0x5b0/0x77c
       ksys_write+0xa0/0x180
       system_call_exception+0x1b0/0x4f0
       system_call_vectored_common+0x15c/0x2ec

-> #0 (&disk->rqos_state_mutex){+.+.}-{4:4}:
       __lock_acquire+0x1b5c/0x29f8
       lock_acquire+0x23c/0x3f8
       __mutex_lock+0x128/0xdd8
       wbt_disable_default+0x9c/0x118
       bfq_init_queue+0x7b0/0x8c0
       blk_mq_init_sched+0x29c/0x3a8
       __elevator_change+0x3a4/0x8a4
       elevator_change+0x1a4/0x214
       elv_iosched_store+0x14c/0x1f4
       queue_attr_store+0x194/0x1d0
       sysfs_kf_write+0xbc/0x110
       kernfs_fop_write_iter+0x264/0x384
       vfs_write+0x5b0/0x77c
       ksys_write+0xa0/0x180
       system_call_exception+0x1b0/0x4f0
       system_call_vectored_common+0x15c/0x2ec

other info that might help us debug this:

Chain exists of:
  &disk->rqos_state_mutex --> &q->q_usage_counter(io)#23 --> &q->elevator_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&q->elevator_lock);
                               lock(&q->q_usage_counter(io)#23);
                               lock(&q->elevator_lock);
  lock(&disk->rqos_state_mutex);

 *** DEADLOCK ***

7 locks held by check/8088:
 #0: c0000000873f2400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0xa0/0x180
 #1: c00000008c10c088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1e0/0x384
 #2: c000000085239248 (kn->active#57){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x1f8/0x384
 #3: c0000000f801c190 (&set->update_nr_hwq_sema){.+.+}-{4:4}, at: elv_iosched_store+0x13c/0x1f4
 #4: c00000005b8f6718 (&q->q_usage_counter(io)#23){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
 #5: c00000005b8f6750 (&q->q_usage_counter(queue)#21){+.+.}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
 #6: c00000005b8f6c38 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x94/0x214

stack backtrace:
CPU: 26 UID: 0 PID: 8088 Comm: check Kdump: loaded Not tainted 6.15.0-rc2+ #174 VOLUNTARY
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Call Trace:
[c0000000d7497240] [c0000000017b9888] dump_stack_lvl+0x100/0x184 (unreliable)
[c0000000d7497270] [c0000000002b546c] print_circular_bug+0x448/0x604
[c0000000d7497320] [c0000000002b5874] check_noncircular+0x24c/0x26c
[c0000000d74973f0] [c0000000002bbb78] __lock_acquire+0x1b5c/0x29f8
[c0000000d7497520] [c0000000002b915c] lock_acquire+0x23c/0x3f8
[c0000000d7497620] [c00000000181277c] __mutex_lock+0x128/0xdd8
[c0000000d7497780] [c000000000c73bf8] wbt_disable_default+0x9c/0x118
[c0000000d74977c0] [c000000000c4c2c0] bfq_init_queue+0x7b0/0x8c0
[c0000000d7497890] [c000000000bff634] blk_mq_init_sched+0x29c/0x3a8
[c0000000d7497910] [c000000000bc2a18] __elevator_change+0x3a4/0x8a4
[c0000000d74979b0] [c000000000bc30bc] elevator_change+0x1a4/0x214
[c0000000d7497a00] [c000000000bc427c] elv_iosched_store+0x14c/0x1f4
[c0000000d7497ae0] [c000000000bd07ec] queue_attr_store+0x194/0x1d0
[c0000000d7497c00] [c000000000a40f00] sysfs_kf_write+0xbc/0x110
[c0000000d7497c50] [c000000000a3cc4c] kernfs_fop_write_iter+0x264/0x384
[c0000000d7497cb0] [c0000000008bb9bc] vfs_write+0x5b0/0x77c
[c0000000d7497d90] [c0000000008bbf88] ksys_write+0xa0/0x180
[c0000000d7497df0] [c000000000039f70] system_call_exception+0x1b0/0x4f0
[c0000000d7497e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
--- interrupt: 3000 at 0x7fffa413b034
NIP:  00007fffa413b034 LR: 00007fffa413b034 CTR: 0000000000000000
REGS: c0000000d7497e80 TRAP: 3000   Not tainted  (6.15.0-rc2+)
MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 44422408  XER: 00000000
IRQMASK: 0
GPR00: 0000000000000004 00007ffffd011260 000000010dfa7e00 0000000000000001
GPR04: 000000011c30b720 0000000000000004 0000000000000010 0000000000000001
GPR08: 0000000000000003 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fffa43fab60 000000011c3adbc0 000000010dfa87b8
GPR16: 000000010dfa94d8 0000000020000000 0000000000000000 000000010deb9070
GPR20: 000000010df4beb8 00007ffffd011404 000000010df4f8a0 000000010dfa89bc
GPR24: 000000010dfa8a50 0000000000000000 000000011c30b720 0000000000000004
GPR28: 0000000000000004 00007fffa42418e0 000000011c30b720 0000000000000004
NIP [00007fffa413b034] 0x7fffa413b034
LR [00007fffa413b034] 0x7fffa413b034
--- interrupt: 3000


From the changes in this patch it appears that we have now got this
new "disk->rqos_state_mutex" which we're holding while allocating
the dynamic memory in wbt_init(). One way is to define GFP_NOIO scope
while holding this mutex but then looking further it seems that we 
don't really need to keep holding this lock in wbt_enable_default()
all the way till wbt_init returns. Probably, unlocking this mutex as
soon as we're done updating the rqos state should be sufficient. 

Similarly, in queue_wb_lat_store we may need to hold this mutex only
while we invoke wbt_set_min_lat. And I just realized that in 
queue_wb_lat_show we need to replace q->elevator_lock with 
disk->rqos_state_mutex. I have attached the above change for 
your reference. You may address this in your next patch.

Thanks,
--Nilay

[-- Attachment #2: wbt.patch --]
[-- Type: text/x-patch, Size: 1984 bytes --]

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fafe9e9e97cc..0248dc9438c3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -560,7 +560,7 @@ static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
 	ssize_t ret;
 	struct request_queue *q = disk->queue;
 
-	mutex_lock(&q->elevator_lock);
+	mutex_lock(&disk->rqos_state_mutex);
 	if (!wbt_rq_qos(q)) {
 		ret = -EINVAL;
 		goto out;
@@ -573,7 +573,7 @@ static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
 
 	ret = sysfs_emit(page, "%llu\n", div_u64(wbt_get_min_lat(q), 1000));
 out:
-	mutex_unlock(&q->elevator_lock);
+	mutex_unlock(&disk->rqos_state_mutex);
 	return ret;
 }
 
@@ -593,7 +593,6 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
 		return -EINVAL;
 
 	memflags = blk_mq_freeze_queue(q);
-	mutex_lock(&disk->rqos_state_mutex);
 
 	rqos = wbt_rq_qos(q);
 	if (!rqos) {
@@ -618,11 +617,12 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
 	 */
 	blk_mq_quiesce_queue(q);
 
+	mutex_lock(&disk->rqos_state_mutex);
 	wbt_set_min_lat(q, val);
+	mutex_unlock(&disk->rqos_state_mutex);
 
 	blk_mq_unquiesce_queue(q);
 out:
-	mutex_unlock(&disk->rqos_state_mutex);
 	blk_mq_unfreeze_queue(q, memflags);
 
 	return ret;
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index c8588bae1c1b..74ae7131ada9 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -714,17 +714,17 @@ void wbt_enable_default(struct gendisk *disk)
 	if (rqos) {
 		if (enable && RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT)
 			RQWB(rqos)->enable_state = WBT_STATE_ON_DEFAULT;
-		goto unlock;
+		mutex_unlock(&disk->rqos_state_mutex);
+		return;
 	}
+	mutex_unlock(&disk->rqos_state_mutex);
 
 	/* Queue not registered? Maybe shutting down... */
 	if (!blk_queue_registered(q))
-		goto unlock;
+		return;
 
 	if (queue_is_mq(q) && enable)
 		wbt_init(disk);
-unlock:
-	mutex_unlock(&disk->rqos_state_mutex);
 }
 EXPORT_SYMBOL_GPL(wbt_enable_default);

next prev parent reply	other threads:[~2025-04-29 10:59 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
2025-04-24 15:21 ` [PATCH V3 01/20] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
2025-04-24 15:21 ` [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag Ming Lei
2025-04-25 14:29   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
2025-04-25 14:29   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
2025-04-25  6:08   ` Hannes Reinecke
2025-04-25 10:53   ` Nilay Shroff
2025-04-25 14:29   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 05/20] block: add two helpers for registering/un-registering sched debugfs Ming Lei
2025-04-24 15:21 ` [PATCH V3 06/20] block: move sched debugfs register into elvevator_register_queue Ming Lei
2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
2025-04-25  6:33   ` Hannes Reinecke
2025-04-25 11:37   ` Nilay Shroff
2025-04-25 14:33   ` Christoph Hellwig
2025-04-25 16:46   ` kernel test robot
2025-04-24 15:21 ` [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
2025-04-25  6:33   ` Hannes Reinecke
2025-04-25 12:48   ` Nilay Shroff
2025-04-27  2:27     ` Ming Lei
2025-04-28 16:17       ` Nilay Shroff
2025-04-29  2:43         ` Ming Lei
2025-04-29 10:22           ` Nilay Shroff
2025-04-30  0:54             ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
2025-04-25  6:34   ` Hannes Reinecke
2025-04-25 18:12   ` Christoph Hellwig
2025-04-29  9:51     ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
2025-04-25  6:35   ` Hannes Reinecke
2025-04-25 12:50   ` Nilay Shroff
2025-04-25 14:34   ` Christoph Hellwig
2025-04-28 11:51     ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
2025-04-25  6:36   ` Hannes Reinecke
2025-04-25 12:54   ` Nilay Shroff
2025-04-25 14:35   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
2025-04-25  6:38   ` Hannes Reinecke
2025-04-25 18:23   ` Christoph Hellwig
2025-04-29 15:45     ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
2025-04-25  6:39   ` Hannes Reinecke
2025-04-25 13:03   ` Nilay Shroff
2025-04-30  0:46     ` Ming Lei
2025-04-25 18:30   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
2025-04-25  6:40   ` Hannes Reinecke
2025-04-24 15:21 ` [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
2025-04-25  6:45   ` Hannes Reinecke
2025-04-25 18:36   ` Christoph Hellwig
2025-04-30  1:07     ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
2025-04-25  6:46   ` Hannes Reinecke
2025-04-25 13:05   ` Nilay Shroff
2025-04-25 18:37   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue Ming Lei
2025-04-25  6:47   ` Hannes Reinecke
2025-04-25 18:38   ` Christoph Hellwig
2025-04-28 11:28     ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 18/20] block: remove several ->elevator_lock Ming Lei
2025-04-25  6:48   ` Hannes Reinecke
2025-04-25 18:38   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing Ming Lei
2025-04-25  6:49   ` Hannes Reinecke
2025-04-24 15:21 ` [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
2025-04-25 13:10   ` Nilay Shroff
2025-04-29 10:59   ` Nilay Shroff [this message]
2025-04-29 12:00 ` [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Stefan Haberland
2025-04-29 12:11   ` Ming Lei

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:fafe9e9e97c dfblob:0248dc9438c dfblob:c8588bae1c1
dfblob:74ae7131ada )
 OR (
bs:"Re: [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e99c8ed8-bfcb-4a4d-aff7-c78506eddc09@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=shinichiro.kawasaki@wdc.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox