public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched: disable preemption around blk_flush_plug in sched_submit_work
@ 2026-04-23 12:55 Ming Lei
  0 siblings, 0 replies; only message in thread
From: Ming Lei @ 2026-04-23 12:55 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: linux-kernel, Ming Lei, Michael Wu

On preemptible kernels, a three-way deadlock can occur involving
blk_mq_freeze_queue and blk_mq_dispatch_list:

- Task A holds a filesystem lock (e.g., f2fs io_rwsem) and enters
  __bio_queue_enter(), waiting for mq_freeze_depth == 0
- Task B holds mq_freeze_depth=1 (elevator_change) and waits for
  q_usage_counter to reach zero in blk_mq_freeze_queue_wait()
- Task C is going to sleep waiting for the filesystem lock. Before
  sleeping, schedule() calls sched_submit_work() -> blk_flush_plug()
  -> blk_mq_dispatch_list(), which acquires q_usage_counter via
  percpu_ref_get(). If Task C gets preempted before percpu_ref_put(),
  it will not be scheduled back because the task is already in
  uninterruptible sleep state (TASK_UNINTERRUPTIBLE). This means it
  holds the percpu_ref indefinitely, preventing freeze from completing.

This is fundamentally an ABBA deadlock between queue freeze and the
filesystem lock, exposed by preemption creating an artificial hold
on q_usage_counter during the plug flush.

Fix by disabling preemption around blk_flush_plug() in
sched_submit_work(). The _notrace variants are used since this runs
in scheduler context. preempt_enable_no_resched_notrace() is correct
because we are already inside __schedule() and about to pick the next
task.

Fixes: 73c101011926 ("block: initial patch for on-stack per-task plugging")
Reported-by: Michael Wu <michael@allwinnertech.com>
Tested-by: Michael Wu <michael@allwinnertech.com>
Link: https://lore.kernel.org/linux-block/20260417082744.30124-1-michael@allwinnertech.com/
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 kernel/sched/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7f77c165a6e..4217aaaa8e47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6966,7 +6966,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
 	 */
+	preempt_disable_notrace();
 	blk_flush_plug(tsk->plug, true);
+	preempt_enable_no_resched_notrace();

 	lock_map_release(&sched_map);
 }
--
2.53.0


^ permalink raw reply related	[flat|nested] only message in thread

only message in thread, other threads:[~2026-04-23 12:55 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-23 12:55 [PATCH] sched: disable preemption around blk_flush_plug in sched_submit_work Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox