From: Ming Lei <tom.leiming@gmail.com>
To: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, Ming Lei <tom.leiming@gmail.com>,
Michael Wu <michael@allwinnertech.com>
Subject: [PATCH] sched: disable preemption around blk_flush_plug in sched_submit_work
Date: Thu, 23 Apr 2026 20:55:28 +0800 [thread overview]
Message-ID: <20260423125528.2917171-1-tom.leiming@gmail.com> (raw)
On preemptible kernels, a three-way deadlock can occur involving
blk_mq_freeze_queue and blk_mq_dispatch_list:
- Task A holds a filesystem lock (e.g., f2fs io_rwsem) and enters
__bio_queue_enter(), waiting for mq_freeze_depth == 0
- Task B holds mq_freeze_depth=1 (elevator_change) and waits for
q_usage_counter to reach zero in blk_mq_freeze_queue_wait()
- Task C is going to sleep waiting for the filesystem lock. Before
sleeping, schedule() calls sched_submit_work() -> blk_flush_plug()
-> blk_mq_dispatch_list(), which acquires q_usage_counter via
percpu_ref_get(). If Task C gets preempted before percpu_ref_put(),
it will not be scheduled back because the task is already in
uninterruptible sleep state (TASK_UNINTERRUPTIBLE). This means it
holds the percpu_ref indefinitely, preventing freeze from completing.
This is fundamentally an ABBA deadlock between queue freeze and the
filesystem lock, exposed by preemption creating an artificial hold
on q_usage_counter during the plug flush.
Fix by disabling preemption around blk_flush_plug() in
sched_submit_work(). The _notrace variants are used since this runs
in scheduler context. preempt_enable_no_resched_notrace() is correct
because we are already inside __schedule() and about to pick the next
task.
Fixes: 73c101011926 ("block: initial patch for on-stack per-task plugging")
Reported-by: Michael Wu <michael@allwinnertech.com>
Tested-by: Michael Wu <michael@allwinnertech.com>
Link: https://lore.kernel.org/linux-block/20260417082744.30124-1-michael@allwinnertech.com/
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
kernel/sched/core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7f77c165a6e..4217aaaa8e47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6966,7 +6966,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
* If we are going to sleep and we have plugged IO queued,
* make sure to submit it to avoid deadlocks.
*/
+ preempt_disable_notrace();
blk_flush_plug(tsk->plug, true);
+ preempt_enable_no_resched_notrace();
lock_map_release(&sched_map);
}
--
2.53.0
reply other threads:[~2026-04-23 12:55 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260423125528.2917171-1-tom.leiming@gmail.com \
--to=tom.leiming@gmail.com \
--cc=axboe@kernel.dk \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=michael@allwinnertech.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox