* [PATCH] sched: disable preemption around blk_flush_plug in sched_submit_work
@ 2026-04-23 12:55 Ming Lei
0 siblings, 0 replies; only message in thread
From: Ming Lei @ 2026-04-23 12:55 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: linux-kernel, Ming Lei, Michael Wu
On preemptible kernels, a three-way deadlock can occur involving
blk_mq_freeze_queue and blk_mq_dispatch_list:
- Task A holds a filesystem lock (e.g., f2fs io_rwsem) and enters
__bio_queue_enter(), waiting for mq_freeze_depth == 0
- Task B holds mq_freeze_depth=1 (elevator_change) and waits for
q_usage_counter to reach zero in blk_mq_freeze_queue_wait()
- Task C is going to sleep waiting for the filesystem lock. Before
sleeping, schedule() calls sched_submit_work() -> blk_flush_plug()
-> blk_mq_dispatch_list(), which acquires q_usage_counter via
percpu_ref_get(). If Task C gets preempted before percpu_ref_put(),
it will not be scheduled back because the task is already in
uninterruptible sleep state (TASK_UNINTERRUPTIBLE). This means it
holds the percpu_ref indefinitely, preventing freeze from completing.
This is fundamentally an ABBA deadlock between queue freeze and the
filesystem lock, exposed by preemption creating an artificial hold
on q_usage_counter during the plug flush.
Fix by disabling preemption around blk_flush_plug() in
sched_submit_work(). The _notrace variants are used since this runs
in scheduler context. preempt_enable_no_resched_notrace() is correct
because we are already inside __schedule() and about to pick the next
task.
Fixes: 73c101011926 ("block: initial patch for on-stack per-task plugging")
Reported-by: Michael Wu <michael@allwinnertech.com>
Tested-by: Michael Wu <michael@allwinnertech.com>
Link: https://lore.kernel.org/linux-block/20260417082744.30124-1-michael@allwinnertech.com/
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
kernel/sched/core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7f77c165a6e..4217aaaa8e47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6966,7 +6966,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
* If we are going to sleep and we have plugged IO queued,
* make sure to submit it to avoid deadlocks.
*/
+ preempt_disable_notrace();
blk_flush_plug(tsk->plug, true);
+ preempt_enable_no_resched_notrace();
lock_map_release(&sched_map);
}
--
2.53.0
^ permalink raw reply related [flat|nested] only message in thread
only message in thread, other threads:[~2026-04-23 12:55 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-23 12:55 [PATCH] sched: disable preemption around blk_flush_plug in sched_submit_work Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox