[PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>,
	Vineeth Pillai <vineethrp@google.com>,
	 Sonam Sanju <sonam.sanju@intel.com>,
	Sean Christopherson <seanjc@google.com>,
	 Kunwu Chan <kunwu.chan@linux.dev>, Tejun Heo <tj@kernel.org>,
	 Joel Fernandes <joelagnelf@nvidia.com>,
	Qais Yousef <qyousef@layalina.io>,
	 Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	 Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	 Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
	 Steven Rostedt <rostedt@goodmis.org>,
	Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>,
	 Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	 Metin Kaya <Metin.Kaya@arm.com>,
	Xuewen Yan <xuewen.yan94@gmail.com>,
	 K Prateek Nayak <kprateek.nayak@amd.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	 Daniel Lezcano <daniel.lezcano@linaro.org>,
	Suleiman Souhlal <suleiman@google.com>,
	 kuyo chang <kuyo.chang@mediatek.com>, hupu <hupu.gm@gmail.com>,
	kernel-team@android.com
Subject: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
Date: Thu, 30 Apr 2026 21:50:46 +0000	[thread overview]
Message-ID: <20260430215103.2978955-2-jstultz@google.com> (raw)
In-Reply-To: <20260430215103.2978955-1-jstultz@google.com>

Vineeth reported seeing a KVM related deadlock connected to work
queue lockups using the android17-6.18 tree, which has
Proxy Execution enabled (using the full patch stack), but I've
subsequently reproduced it on v7.1-rc1.

On further debugging he found:
- kvm-irqfd-cleanup workqueue and rcu_gp lands in a per-cpu
  pwq(work queue pool)
- one of kvm-irqfd-cleanup worker(say A) takes a mutex and then
  calls synchronize_srcu_expedited()
- one other kvm-irqfd-cleanup worker worker(Say B) tries to
  acquire the lock and then gets blocked
- On the way to blocking, this cpu gets an IPI and on return
  from IPI, it calls __schedule() and did not get to complete
  workqueue accounting(worker->sleeping = 0 and decrementing
  pool->nr_running). This is done in sched_submit_work() ->
  wq_worker_sleeping() called from schedule() and we got
  preempted before that.
- proxy execution doesn't immediately take it off run queue as
  p->blocked_on is set during __mutex_lock
- Next time when B is picked for running, it notices A(mutex
  holder) is not on a runqueue and then blocks B.
  find_proxy_task() -> proxy_deactivate() -> block_task()
- And things are then stuck. A is waiting for the workqueue to
  be run, but B can't run the workqueue as it is blocked on A.

The trouble is that with Proxy Execution, in
__mutex_lock_common() we set the task state to
TASK_UNINTERRUPTIBLE, and set blocked_on before calling into
schedule(), where sched_submit_work() will be called.

But if an IPI comes in before we call schedule() the interrupt
will call __schedule(SM_PREEMPT) directly. This causes the
scheduler to see the current task as blocked_on, and deactivate
it (because the owner is off the runqueue).

Since its deactivated, it wont' be run, and it won't get to
call sched_submit_work(). And then we see workqueue stalls.

Without proxy-execution, things work, as the SM_PREEMPT case
will prevent the task from being dequeued, and it can be
reselected again and run, which will allow it to finish calling
into schedule() and calling sched_submit_work() before actually
blocking.

Peter didn't like my earlier attempt to solve this by clearing
the blocked_on state and marking the task __state RUNNABLE, as
we shouldn't modify __state from schedule().

So this approach is slightly different. We use the low bit of
the blocked_on pointer as a latch bit flag. When the task sets
the blocked_on pointer, we don't consider it for use with proxy
execution until the latch is set.

We then only set the latch bit in __schedule() when we are not
in an SM_PREEMPT case and are considering blocking the task.

This makes the blocked_on state machine a little more complex:
  NULL -> ptr:unlatched -> ptr:latched -> PROXY_WAKING -> NULL

With additional transitions:
  // only done on current
  ptr:latched -> NULL

  // only done on current or when trying to set waking
  ptr:unlatched -> NULL

And where NULL and ptr:unlatched are functionally equivalent
except for the ability to transition to ptr:latched.

Credit for this idea is due to Vineeth and Sulieman who had
proposed something very similar when the issue was first
reported. As well as to Peter for suggesting it and K Prateek
who helped iterate and shared an initial working version.

Many thanks to Vineeth for figuring this very obscure race out
and for implementing a test tool to make it easily reproducible!

Reported-by: Vineeth Pillai <vineethrp@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Switch to using extra flag bit to ensure we don't proxy early
  on SM_PREEMPT cases, as suggested by Peter (and Vineeth and
  Suleiman) and helped developed with K Prateek

Cc: Vineeth Pillai <vineethrp@google.com>
Cc: Sonam Sanju <sonam.sanju@intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Kunwu Chan <kunwu.chan@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h | 64 +++++++++++++++++++++++++++++++++++--------
 kernel/sched/core.c   | 15 ++++++----
 2 files changed, 63 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 368c7b4d7cb51..8b9e971d98f67 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2183,18 +2183,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
 #ifndef CONFIG_PREEMPT_RT
 
 /*
- * With proxy exec, if a task has been proxy-migrated, it may be a donor
- * on a cpu that it can't actually run on. Thus we need a special state
- * to denote that the task is being woken, but that it needs to be
- * evaluated for return-migration before it is run. So if the task is
- * blocked_on PROXY_WAKING, return migrate it before running it.
+ * The proxy exec blocked_on pointer value uses the low bit as a latch
+ * value which clarifies if the blocked_on value is used for proxying or
+ * not.
+ *
+ * The state machine looks something like
+ *   NULL -> ptr:unlatched -> ptr:latched -> PROXY_WAKING -> NULL
+ *
+ * With some additional transitions:
+ *   ptr:unlatched -> NULL (done on current, or via set_task_blocked_on_waking())
+ *   ptr:latched -> NULL (done only on current)
+ *
+ * 1) NULL and ptr:unlatched are effectively equivalent, no proxying will occur
+ * 2) ptr:latched is the state when proxying will occur
+ * 3) PROXY_WAKING is used when the task is being woken to  ensure we
+ *    return-migrate proxy-migrated tasks before running them (note it has
+ *    the latch bit set).
  */
-#define PROXY_WAKING ((struct mutex *)(-1L))
+#define PROXY_BLOCKED_LATCH (1UL)
+#define PROXY_BLOCKED_ON_MASK(x) ((struct mutex *)((unsigned long)(x) & ~PROXY_BLOCKED_LATCH))
+#define PROXY_WAKING ((struct mutex *)(-1L)) /* PROXY_WAKING has LATCH bit set */
+
+static inline struct mutex *task_is_blocked_on(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+	return (struct mutex *)((unsigned long)p->blocked_on & PROXY_BLOCKED_LATCH);
+}
+
+static inline void __set_task_blocked_on_latched(struct task_struct *p)
+{
+	lockdep_assert_held_once(&p->blocked_lock);
+	WARN_ON_ONCE(!p->blocked_on);
+	p->blocked_on = (struct mutex *)((unsigned long)p->blocked_on | PROXY_BLOCKED_LATCH);
+}
+
+static inline struct mutex *__get_task_latched_blocked_on(struct task_struct *p)
+{
+	if (!task_is_blocked_on(p))
+		return NULL;
+	if (p->blocked_on == PROXY_WAKING)
+		return PROXY_WAKING;
+	return PROXY_BLOCKED_ON_MASK(p->blocked_on);
+}
 
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
 	lockdep_assert_held_once(&p->blocked_lock);
-	return p->blocked_on == PROXY_WAKING ? NULL : p->blocked_on;
+	if (p->blocked_on == PROXY_WAKING)
+		return NULL;
+	return PROXY_BLOCKED_ON_MASK(p->blocked_on);
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2215,6 +2253,8 @@ static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 
 static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
+	struct mutex *bo = p->blocked_on;
+
 	/* Currently we serialize blocked_on under the task::blocked_lock */
 	lockdep_assert_held_once(&p->blocked_lock);
 	/*
@@ -2222,7 +2262,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
 	 * blocked_on relationships, but make sure we are not
 	 * clearing the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	WARN_ON_ONCE(m && bo && __get_task_blocked_on(p) != m && bo != PROXY_WAKING);
 	p->blocked_on = NULL;
 }
 
@@ -2242,15 +2282,17 @@ static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mu
 		return;
 	}
 
-	/* Don't set PROXY_WAKING if blocked_on was already cleared */
-	if (!p->blocked_on)
+	/* Don't set PROXY_WAKING if we are not really blocked_on  */
+	if (!task_is_blocked_on(p)) {
+		p->blocked_on = NULL;  /* clear if unlatched */
 		return;
+	}
 	/*
 	 * There may be cases where we set PROXY_WAKING on tasks that were
 	 * already set to waking, but make sure we are not changing
 	 * the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	WARN_ON_ONCE(m && __get_task_blocked_on(p) != m && p->blocked_on != PROXY_WAKING);
 	p->blocked_on = PROXY_WAKING;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25ae..2f912bf698446 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6599,8 +6599,13 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	 * blocked on a mutex, and we want to keep it on the runqueue
 	 * to be selectable for proxy-execution.
 	 */
-	if (!should_block)
-		return false;
+	if (!should_block) {
+		guard(raw_spinlock)(&p->blocked_lock);
+		if (p->blocked_on) {
+			__set_task_blocked_on_latched(p);
+			return false;
+		}
+	}
 
 	p->sched_contributes_to_load =
 		(task_state & TASK_UNINTERRUPTIBLE) &&
@@ -6833,7 +6838,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	int owner_cpu;
 
 	/* Follow blocked_on chain. */
-	for (p = donor; (mutex = p->blocked_on); p = owner) {
+	for (p = donor; (mutex = __get_task_latched_blocked_on(p)); p = owner) {
 		/* if its PROXY_WAKING, do return migration or run if current */
 		if (mutex == PROXY_WAKING) {
 			if (task_current(rq, p)) {
@@ -6851,7 +6856,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		guard(raw_spinlock)(&p->blocked_lock);
 
 		/* Check again that p is blocked with blocked_lock held */
-		if (mutex != __get_task_blocked_on(p)) {
+		if (mutex != __get_task_latched_blocked_on(p)) {
 			/*
 			 * Something changed in the blocked_on chain and
 			 * we don't know if only at this level. So, let's
@@ -7107,7 +7112,7 @@ static void __sched notrace __schedule(int sched_mode)
 		struct task_struct *prev_donor = rq->donor;
 
 		rq_set_donor(rq, next);
-		if (unlikely(next->blocked_on)) {
+		if (unlikely(task_is_blocked_on(next))) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next) {
 				zap_balance_callbacks(rq);
-- 
2.54.0.545.g6539524ca2-goog

next prev parent reply	other threads:[~2026-04-30 21:51 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-30 21:50 [PATCH v2 0/2] Proxy Execution fixes for v7.1-rc John Stultz
2026-04-30 21:50 ` John Stultz [this message]
2026-04-30 23:53   ` [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
2026-05-01  6:39   ` K Prateek Nayak
2026-05-01  7:11     ` John Stultz
2026-05-01 13:21   ` Peter Zijlstra
2026-05-01 15:55     ` K Prateek Nayak
2026-05-01 18:59       ` Peter Zijlstra
2026-05-01 22:26         ` John Stultz
2026-05-03 18:42           ` K Prateek Nayak
2026-05-04  5:37             ` K Prateek Nayak
2026-05-05  3:32               ` John Stultz
2026-05-05  4:37                 ` K Prateek Nayak
2026-05-04 21:33             ` John Stultz
2026-04-30 21:50 ` [PATCH v2 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz
2026-05-01  6:57   ` K Prateek Nayak
2026-05-04 22:30   ` kernel test robot

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:368c7b4d7cb5 dfblob:8b9e971d98f6 dfblob:da20fb6ea25a
dfblob:2f912bf69844 )
 OR (
bs:"[PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260430215103.2978955-2-jstultz@google.com \
    --to=jstultz@google.com \
    --cc=Metin.Kaya@arm.com \
    --cc=boqun.feng@gmail.com \
    --cc=daniel.lezcano@linaro.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=hupu.gm@gmail.com \
    --cc=joelagnelf@nvidia.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@android.com \
    --cc=kprateek.nayak@amd.com \
    --cc=kunwu.chan@linux.dev \
    --cc=kuyo.chang@mediatek.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qyousef@layalina.io \
    --cc=rostedt@goodmis.org \
    --cc=seanjc@google.com \
    --cc=sonam.sanju@intel.com \
    --cc=suleiman@google.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vineethrp@google.com \
    --cc=vschneid@redhat.com \
    --cc=will@kernel.org \
    --cc=xuewen.yan94@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox