All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <tom.leiming@gmail.com>
To: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, Ming Lei <tom.leiming@gmail.com>,
	Michael Wu <michael@allwinnertech.com>
Subject: [PATCH] sched: disable preemption around blk_flush_plug in sched_submit_work
Date: Thu, 23 Apr 2026 20:55:28 +0800	[thread overview]
Message-ID: <20260423125528.2917171-1-tom.leiming@gmail.com> (raw)

On preemptible kernels, a three-way deadlock can occur involving
blk_mq_freeze_queue and blk_mq_dispatch_list:

- Task A holds a filesystem lock (e.g., f2fs io_rwsem) and enters
  __bio_queue_enter(), waiting for mq_freeze_depth == 0
- Task B holds mq_freeze_depth=1 (elevator_change) and waits for
  q_usage_counter to reach zero in blk_mq_freeze_queue_wait()
- Task C is going to sleep waiting for the filesystem lock. Before
  sleeping, schedule() calls sched_submit_work() -> blk_flush_plug()
  -> blk_mq_dispatch_list(), which acquires q_usage_counter via
  percpu_ref_get(). If Task C gets preempted before percpu_ref_put(),
  it will not be scheduled back because the task is already in
  uninterruptible sleep state (TASK_UNINTERRUPTIBLE). This means it
  holds the percpu_ref indefinitely, preventing freeze from completing.

This is fundamentally an ABBA deadlock between queue freeze and the
filesystem lock, exposed by preemption creating an artificial hold
on q_usage_counter during the plug flush.

Fix by disabling preemption around blk_flush_plug() in
sched_submit_work(). The _notrace variants are used since this runs
in scheduler context. preempt_enable_no_resched_notrace() is correct
because we are already inside __schedule() and about to pick the next
task.

Fixes: 73c101011926 ("block: initial patch for on-stack per-task plugging")
Reported-by: Michael Wu <michael@allwinnertech.com>
Tested-by: Michael Wu <michael@allwinnertech.com>
Link: https://lore.kernel.org/linux-block/20260417082744.30124-1-michael@allwinnertech.com/
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 kernel/sched/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7f77c165a6e..4217aaaa8e47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6966,7 +6966,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
 	 */
+	preempt_disable_notrace();
 	blk_flush_plug(tsk->plug, true);
+	preempt_enable_no_resched_notrace();

 	lock_map_release(&sched_map);
 }
--
2.53.0


             reply	other threads:[~2026-04-23 12:55 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-23 12:55 Ming Lei [this message]
2026-05-12  5:56 ` [PATCH] sched: disable preemption around blk_flush_plug in sched_submit_work Xiaosen
2026-05-12  8:07   ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260423125528.2917171-1-tom.leiming@gmail.com \
    --to=tom.leiming@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael@allwinnertech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.