From: Ming Lei <tom.leiming@gmail.com>
To: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, Ming Lei <tom.leiming@gmail.com>,
Michael Wu <michael@allwinnertech.com>
Subject: [PATCH] sched: disable preemption around blk_flush_plug in sched_submit_work
Date: Thu, 23 Apr 2026 20:55:28 +0800 [thread overview]
Message-ID: <20260423125528.2917171-1-tom.leiming@gmail.com> (raw)
On preemptible kernels, a three-way deadlock can occur involving
blk_mq_freeze_queue and blk_mq_dispatch_list:
- Task A holds a filesystem lock (e.g., f2fs io_rwsem) and enters
__bio_queue_enter(), waiting for mq_freeze_depth == 0
- Task B holds mq_freeze_depth=1 (elevator_change) and waits for
q_usage_counter to reach zero in blk_mq_freeze_queue_wait()
- Task C is going to sleep waiting for the filesystem lock. Before
sleeping, schedule() calls sched_submit_work() -> blk_flush_plug()
-> blk_mq_dispatch_list(), which acquires q_usage_counter via
percpu_ref_get(). If Task C gets preempted before percpu_ref_put(),
it will not be scheduled back because the task is already in
uninterruptible sleep state (TASK_UNINTERRUPTIBLE). This means it
holds the percpu_ref indefinitely, preventing freeze from completing.
This is fundamentally an ABBA deadlock between queue freeze and the
filesystem lock, exposed by preemption creating an artificial hold
on q_usage_counter during the plug flush.
Fix by disabling preemption around blk_flush_plug() in
sched_submit_work(). The _notrace variants are used since this runs
in scheduler context. preempt_enable_no_resched_notrace() is correct
because we are already inside __schedule() and about to pick the next
task.
Fixes: 73c101011926 ("block: initial patch for on-stack per-task plugging")
Reported-by: Michael Wu <michael@allwinnertech.com>
Tested-by: Michael Wu <michael@allwinnertech.com>
Link: https://lore.kernel.org/linux-block/20260417082744.30124-1-michael@allwinnertech.com/
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
kernel/sched/core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7f77c165a6e..4217aaaa8e47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6966,7 +6966,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
* If we are going to sleep and we have plugged IO queued,
* make sure to submit it to avoid deadlocks.
*/
+ preempt_disable_notrace();
blk_flush_plug(tsk->plug, true);
+ preempt_enable_no_resched_notrace();
lock_map_release(&sched_map);
}
--
2.53.0
next reply other threads:[~2026-04-23 12:55 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-23 12:55 Ming Lei [this message]
2026-05-12 5:56 ` [PATCH] sched: disable preemption around blk_flush_plug in sched_submit_work Xiaosen
2026-05-12 8:07 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260423125528.2917171-1-tom.leiming@gmail.com \
--to=tom.leiming@gmail.com \
--cc=axboe@kernel.dk \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=michael@allwinnertech.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.