From: Nilay Shroff <nilay@linux.ibm.com>
To: Ming Lei <ming.lei@redhat.com>, Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org
Cc: "Shinichiro Kawasaki" <shinichiro.kawasaki@wdc.com>,
"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
"Christoph Hellwig" <hch@lst.de>
Subject: Re: [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit()
Date: Tue, 29 Apr 2025 16:29:23 +0530 [thread overview]
Message-ID: <e99c8ed8-bfcb-4a4d-aff7-c78506eddc09@linux.ibm.com> (raw)
In-Reply-To: <20250424152148.1066220-21-ming.lei@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 7898 bytes --]
On 4/24/25 8:51 PM, Ming Lei wrote:
> scheduler's ->exit() is called with queue frozen and elevator lock is held, and
> wbt_enable_default() can't be called with queue frozen, otherwise the
> following lockdep warning is triggered:
>
> #6 (&q->rq_qos_mutex){+.+.}-{4:4}:
> #5 (&eq->sysfs_lock){+.+.}-{4:4}:
> #4 (&q->elevator_lock){+.+.}-{4:4}:
> #3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
> #2 (fs_reclaim){+.+.}-{0:0}:
> #1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
> #0 (&q->debugfs_mutex){+.+.}-{4:4}:
>
> Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
> call it from elevator_change_done().
>
> Meantime add disk->rqos_state_mutex for covering wbt state change, which
> matches the purpose more than ->elevator_lock.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
While testing this patch on my machine using blktests, I stumbled upon
a lockdep splat shown below.(I could consistently recreate it):
run blktests block/005 at 2025-04-28 06:57:51
======================================================
WARNING: possible circular locking dependency detected
6.15.0-rc2+ #174 Not tainted
------------------------------------------------------
check/8088 is trying to acquire lock:
c0000000a0c03538 (&disk->rqos_state_mutex){+.+.}-{4:4}, at: wbt_disable_default+0x9c/0x118
but task is already holding lock:
c00000005b8f6c38 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x94/0x214
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock+0x128/0xdd8
elevator_change+0x94/0x214
elv_iosched_store+0x14c/0x1f4
queue_attr_store+0x194/0x1d0
sysfs_kf_write+0xbc/0x110
kernfs_fop_write_iter+0x264/0x384
vfs_write+0x5b0/0x77c
ksys_write+0xa0/0x180
system_call_exception+0x1b0/0x4f0
system_call_vectored_common+0x15c/0x2ec
-> #2 (&q->q_usage_counter(io)#23){++++}-{0:0}:
blk_alloc_queue+0x46c/0x4bc
blk_mq_alloc_queue+0xc0/0x160
__blk_mq_alloc_disk+0x34/0x128
nvme_alloc_ns+0x140/0x1804 [nvme_core]
nvme_scan_ns+0x42c/0x564 [nvme_core]
async_run_entry_fn+0x9c/0x30c
process_one_work+0x514/0xd38
worker_thread+0x390/0x6dc
kthread+0x230/0x278
start_kernel_thread+0x14/0x18
-> #1 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x114/0x150
__kmalloc_cache_noprof+0x70/0x5c0
wbt_init+0x64/0x2fc
wbt_enable_default+0x140/0x15c
elevator_change_done+0x314/0x3a8
elv_iosched_store+0x14c/0x1f4
queue_attr_store+0x194/0x1d0
sysfs_kf_write+0xbc/0x110
kernfs_fop_write_iter+0x264/0x384
vfs_write+0x5b0/0x77c
ksys_write+0xa0/0x180
system_call_exception+0x1b0/0x4f0
system_call_vectored_common+0x15c/0x2ec
-> #0 (&disk->rqos_state_mutex){+.+.}-{4:4}:
__lock_acquire+0x1b5c/0x29f8
lock_acquire+0x23c/0x3f8
__mutex_lock+0x128/0xdd8
wbt_disable_default+0x9c/0x118
bfq_init_queue+0x7b0/0x8c0
blk_mq_init_sched+0x29c/0x3a8
__elevator_change+0x3a4/0x8a4
elevator_change+0x1a4/0x214
elv_iosched_store+0x14c/0x1f4
queue_attr_store+0x194/0x1d0
sysfs_kf_write+0xbc/0x110
kernfs_fop_write_iter+0x264/0x384
vfs_write+0x5b0/0x77c
ksys_write+0xa0/0x180
system_call_exception+0x1b0/0x4f0
system_call_vectored_common+0x15c/0x2ec
other info that might help us debug this:
Chain exists of:
&disk->rqos_state_mutex --> &q->q_usage_counter(io)#23 --> &q->elevator_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->elevator_lock);
lock(&q->q_usage_counter(io)#23);
lock(&q->elevator_lock);
lock(&disk->rqos_state_mutex);
*** DEADLOCK ***
7 locks held by check/8088:
#0: c0000000873f2400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0xa0/0x180
#1: c00000008c10c088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1e0/0x384
#2: c000000085239248 (kn->active#57){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x1f8/0x384
#3: c0000000f801c190 (&set->update_nr_hwq_sema){.+.+}-{4:4}, at: elv_iosched_store+0x13c/0x1f4
#4: c00000005b8f6718 (&q->q_usage_counter(io)#23){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
#5: c00000005b8f6750 (&q->q_usage_counter(queue)#21){+.+.}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
#6: c00000005b8f6c38 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x94/0x214
stack backtrace:
CPU: 26 UID: 0 PID: 8088 Comm: check Kdump: loaded Not tainted 6.15.0-rc2+ #174 VOLUNTARY
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Call Trace:
[c0000000d7497240] [c0000000017b9888] dump_stack_lvl+0x100/0x184 (unreliable)
[c0000000d7497270] [c0000000002b546c] print_circular_bug+0x448/0x604
[c0000000d7497320] [c0000000002b5874] check_noncircular+0x24c/0x26c
[c0000000d74973f0] [c0000000002bbb78] __lock_acquire+0x1b5c/0x29f8
[c0000000d7497520] [c0000000002b915c] lock_acquire+0x23c/0x3f8
[c0000000d7497620] [c00000000181277c] __mutex_lock+0x128/0xdd8
[c0000000d7497780] [c000000000c73bf8] wbt_disable_default+0x9c/0x118
[c0000000d74977c0] [c000000000c4c2c0] bfq_init_queue+0x7b0/0x8c0
[c0000000d7497890] [c000000000bff634] blk_mq_init_sched+0x29c/0x3a8
[c0000000d7497910] [c000000000bc2a18] __elevator_change+0x3a4/0x8a4
[c0000000d74979b0] [c000000000bc30bc] elevator_change+0x1a4/0x214
[c0000000d7497a00] [c000000000bc427c] elv_iosched_store+0x14c/0x1f4
[c0000000d7497ae0] [c000000000bd07ec] queue_attr_store+0x194/0x1d0
[c0000000d7497c00] [c000000000a40f00] sysfs_kf_write+0xbc/0x110
[c0000000d7497c50] [c000000000a3cc4c] kernfs_fop_write_iter+0x264/0x384
[c0000000d7497cb0] [c0000000008bb9bc] vfs_write+0x5b0/0x77c
[c0000000d7497d90] [c0000000008bbf88] ksys_write+0xa0/0x180
[c0000000d7497df0] [c000000000039f70] system_call_exception+0x1b0/0x4f0
[c0000000d7497e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
--- interrupt: 3000 at 0x7fffa413b034
NIP: 00007fffa413b034 LR: 00007fffa413b034 CTR: 0000000000000000
REGS: c0000000d7497e80 TRAP: 3000 Not tainted (6.15.0-rc2+)
MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 44422408 XER: 00000000
IRQMASK: 0
GPR00: 0000000000000004 00007ffffd011260 000000010dfa7e00 0000000000000001
GPR04: 000000011c30b720 0000000000000004 0000000000000010 0000000000000001
GPR08: 0000000000000003 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fffa43fab60 000000011c3adbc0 000000010dfa87b8
GPR16: 000000010dfa94d8 0000000020000000 0000000000000000 000000010deb9070
GPR20: 000000010df4beb8 00007ffffd011404 000000010df4f8a0 000000010dfa89bc
GPR24: 000000010dfa8a50 0000000000000000 000000011c30b720 0000000000000004
GPR28: 0000000000000004 00007fffa42418e0 000000011c30b720 0000000000000004
NIP [00007fffa413b034] 0x7fffa413b034
LR [00007fffa413b034] 0x7fffa413b034
--- interrupt: 3000
From the changes in this patch it appears that we have now got this
new "disk->rqos_state_mutex" which we're holding while allocating
the dynamic memory in wbt_init(). One way is to define GFP_NOIO scope
while holding this mutex but then looking further it seems that we
don't really need to keep holding this lock in wbt_enable_default()
all the way till wbt_init returns. Probably, unlocking this mutex as
soon as we're done updating the rqos state should be sufficient.
Similarly, in queue_wb_lat_store we may need to hold this mutex only
while we invoke wbt_set_min_lat. And I just realized that in
queue_wb_lat_show we need to replace q->elevator_lock with
disk->rqos_state_mutex. I have attached the above change for
your reference. You may address this in your next patch.
Thanks,
--Nilay
[-- Attachment #2: wbt.patch --]
[-- Type: text/x-patch, Size: 1984 bytes --]
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fafe9e9e97cc..0248dc9438c3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -560,7 +560,7 @@ static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
ssize_t ret;
struct request_queue *q = disk->queue;
- mutex_lock(&q->elevator_lock);
+ mutex_lock(&disk->rqos_state_mutex);
if (!wbt_rq_qos(q)) {
ret = -EINVAL;
goto out;
@@ -573,7 +573,7 @@ static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
ret = sysfs_emit(page, "%llu\n", div_u64(wbt_get_min_lat(q), 1000));
out:
- mutex_unlock(&q->elevator_lock);
+ mutex_unlock(&disk->rqos_state_mutex);
return ret;
}
@@ -593,7 +593,6 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
return -EINVAL;
memflags = blk_mq_freeze_queue(q);
- mutex_lock(&disk->rqos_state_mutex);
rqos = wbt_rq_qos(q);
if (!rqos) {
@@ -618,11 +617,12 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
*/
blk_mq_quiesce_queue(q);
+ mutex_lock(&disk->rqos_state_mutex);
wbt_set_min_lat(q, val);
+ mutex_unlock(&disk->rqos_state_mutex);
blk_mq_unquiesce_queue(q);
out:
- mutex_unlock(&disk->rqos_state_mutex);
blk_mq_unfreeze_queue(q, memflags);
return ret;
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index c8588bae1c1b..74ae7131ada9 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -714,17 +714,17 @@ void wbt_enable_default(struct gendisk *disk)
if (rqos) {
if (enable && RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT)
RQWB(rqos)->enable_state = WBT_STATE_ON_DEFAULT;
- goto unlock;
+ mutex_unlock(&disk->rqos_state_mutex);
+ return;
}
+ mutex_unlock(&disk->rqos_state_mutex);
/* Queue not registered? Maybe shutting down... */
if (!blk_queue_registered(q))
- goto unlock;
+ return;
if (queue_is_mq(q) && enable)
wbt_init(disk);
-unlock:
- mutex_unlock(&disk->rqos_state_mutex);
}
EXPORT_SYMBOL_GPL(wbt_enable_default);
next prev parent reply other threads:[~2025-04-29 10:59 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
2025-04-24 15:21 ` [PATCH V3 01/20] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
2025-04-24 15:21 ` [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag Ming Lei
2025-04-25 14:29 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
2025-04-25 14:29 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
2025-04-25 6:08 ` Hannes Reinecke
2025-04-25 10:53 ` Nilay Shroff
2025-04-25 14:29 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 05/20] block: add two helpers for registering/un-registering sched debugfs Ming Lei
2025-04-24 15:21 ` [PATCH V3 06/20] block: move sched debugfs register into elvevator_register_queue Ming Lei
2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
2025-04-25 6:33 ` Hannes Reinecke
2025-04-25 11:37 ` Nilay Shroff
2025-04-25 14:33 ` Christoph Hellwig
2025-04-25 16:46 ` kernel test robot
2025-04-24 15:21 ` [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
2025-04-25 6:33 ` Hannes Reinecke
2025-04-25 12:48 ` Nilay Shroff
2025-04-27 2:27 ` Ming Lei
2025-04-28 16:17 ` Nilay Shroff
2025-04-29 2:43 ` Ming Lei
2025-04-29 10:22 ` Nilay Shroff
2025-04-30 0:54 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
2025-04-25 6:34 ` Hannes Reinecke
2025-04-25 18:12 ` Christoph Hellwig
2025-04-29 9:51 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
2025-04-25 6:35 ` Hannes Reinecke
2025-04-25 12:50 ` Nilay Shroff
2025-04-25 14:34 ` Christoph Hellwig
2025-04-28 11:51 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
2025-04-25 6:36 ` Hannes Reinecke
2025-04-25 12:54 ` Nilay Shroff
2025-04-25 14:35 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
2025-04-25 6:38 ` Hannes Reinecke
2025-04-25 18:23 ` Christoph Hellwig
2025-04-29 15:45 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
2025-04-25 6:39 ` Hannes Reinecke
2025-04-25 13:03 ` Nilay Shroff
2025-04-30 0:46 ` Ming Lei
2025-04-25 18:30 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
2025-04-25 6:40 ` Hannes Reinecke
2025-04-24 15:21 ` [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
2025-04-25 6:45 ` Hannes Reinecke
2025-04-25 18:36 ` Christoph Hellwig
2025-04-30 1:07 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
2025-04-25 6:46 ` Hannes Reinecke
2025-04-25 13:05 ` Nilay Shroff
2025-04-25 18:37 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue Ming Lei
2025-04-25 6:47 ` Hannes Reinecke
2025-04-25 18:38 ` Christoph Hellwig
2025-04-28 11:28 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 18/20] block: remove several ->elevator_lock Ming Lei
2025-04-25 6:48 ` Hannes Reinecke
2025-04-25 18:38 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing Ming Lei
2025-04-25 6:49 ` Hannes Reinecke
2025-04-24 15:21 ` [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
2025-04-25 13:10 ` Nilay Shroff
2025-04-29 10:59 ` Nilay Shroff [this message]
2025-04-29 12:00 ` [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Stefan Haberland
2025-04-29 12:11 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e99c8ed8-bfcb-4a4d-aff7-c78506eddc09@linux.ibm.com \
--to=nilay@linux.ibm.com \
--cc=axboe@kernel.dk \
--cc=hch@lst.de \
--cc=linux-block@vger.kernel.org \
--cc=ming.lei@redhat.com \
--cc=shinichiro.kawasaki@wdc.com \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox