From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9D5F2392C4A
	for <linux-kernel@vger.kernel.org>; Thu, 30 Apr 2026 21:51:12 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1777585874; cv=none; b=eD37A2zi+gbaKizU1AaSTI+9ugAN2bFBWXt9N6W6bAp4HBkGdX/8629v1XyOfgmYCsULQUvQGpXe2fXOCbssXHnoxydtYYuvPc8iMzHw6ngNZGSegdWxMMLxAL7on6SZufKYDtjWWwdQEFkymued236AlJqm1dQMzDol/RH9lvk=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1777585874; c=relaxed/simple;
	bh=7EH82Wz5HZ6tkvwWNUlGQOxswubMnF5K1SipbsiFu8k=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type; b=fQJw3TlvWL5D99kdkLC3ZhUN9jxWLQcrOxsx+KaiVyXijySK30G7pwTKu9I8t/4jkwbsG9CoiOHfN3hcDaFGCnK/XnX2JeOKdAW9MazMTCcK7t9BkmCPpIkdzeGa2twKWc54GTZX6bn2ygHBQPKPzb+L2b1HRif6BEeF4ds4tcc=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=BW7tveWY; arc=none smtp.client-ip=209.85.215.201
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="BW7tveWY"
Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-c7948640854so766440a12.1
        for <linux-kernel@vger.kernel.org>; Thu, 30 Apr 2026 14:51:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20251104; t=1777585872; x=1778190672; darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=5Iu/xCqbCo/zTSrkdtrDLa1P5+N6MsN4LTbTPimT3vk=;
        b=BW7tveWYQtwE5ZA44ziJubsYjMsIDyY3Ven/MELPkgYckOhSuv0/Dk3sbDtcnOZJGT
         wpQhZgZ8/yzOxD75UqbI/DFhnP0RGVXsqzdKLzoqFFfwxkS1ewvgeh06lhw3/3o5JdzM
         bICCZDOOyNBscjYB9VJe5VdyIYJ32GAGwpu2oU481NFbEDpfSOgGg3odNO9+agZ3eiG8
         wQr+nhDqqaQGrFAM9rh3J9IFibX2GWRcdEuohfjZQDNiHf314oSPS2EzGscU5B8+pJnp
         uv5en9S7caZRTQaCt5yGb3Te0QJXlS1PfAL3aNRBjdOSJnkd3gfbtlTNvpUIaoYt+eVn
         FGng==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1777585872; x=1778190672;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=5Iu/xCqbCo/zTSrkdtrDLa1P5+N6MsN4LTbTPimT3vk=;
        b=Vf5FRjvR/GLEN039NC6hytktUMbN76wB5M06sfTwj8NS1J1D6+MGFkM0xmZE/BcCkE
         Wf16M+PrKAK+AfD7VDYP60Ugb04jhNVlP3snootBbv/xUYwpQYtSlY8pBNImD0bgMkl1
         xv2YRloWWWIj/7LC0ruBpgNSjAIKkAgkKnRZmI25yMslom5SkX0wJubB1yBsldffT6U4
         2e3PWzEEU97oxGOF1I20+/Gmj+sOIpOuaJVNBoryYfUKWWLgNFMh51z1WQJCe8cLnQee
         JLmW9PvLhtl+M5L9pO5XCK8K/QcTHiFIszcfFSUyVFbDtwtIpHQI2v31CHoUaQnV7VUw
         1qQA==
X-Gm-Message-State: AOJu0YwGfuADAgskiPdrRujIInMQ+Zq6kT7bUP9y7FJi1dF9TcJmrlnJ
	mndkwwVzWyoYdAB9FAdMJDrEWhO0b+fSs4md8o6WiH8//RRDcCsE5qiRWXoCt0OxRXNpQV42WSe
	TaG8KnAGNflbUvR+MIT+EQHtoKeKFKQdPoOqDGfAiRYUiKsXqLpwESYH+4a6d5ukWE5JeTzWxNm
	2aqEuIcDZGCb0m5+6cWiuMJ8/dr6ccFOTtXArdDRfSmQIARe+w
X-Received: from pgu15.prod.google.com ([2002:a63:144f:0:b0:c79:6553:cfe0])
 (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a20:9188:b0:398:a934:9be6
 with SMTP id adf61e73a8af0-3a3cf85a036mr5480828637.43.1777585871134; Thu, 30
 Apr 2026 14:51:11 -0700 (PDT)
Date: Thu, 30 Apr 2026 21:50:46 +0000
In-Reply-To: <20260430215103.2978955-1-jstultz@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20260430215103.2978955-1-jstultz@google.com>
X-Mailer: git-send-email 2.54.0.545.g6539524ca2-goog
Message-ID: <20260430215103.2978955-2-jstultz@google.com>
Subject: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work
 being delayed
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>, Vineeth Pillai <vineethrp@google.com>, 
	Sonam Sanju <sonam.sanju@intel.com>, Sean Christopherson <seanjc@google.com>, 
	Kunwu Chan <kunwu.chan@linux.dev>, Tejun Heo <tj@kernel.org>, 
	Joel Fernandes <joelagnelf@nvidia.com>, Qais Yousef <qyousef@layalina.io>, 
	Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, 
	Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, 
	Dietmar Eggemann <dietmar.eggemann@arm.com>, Valentin Schneider <vschneid@redhat.com>, 
	Steven Rostedt <rostedt@goodmis.org>, Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>, 
	Boqun Feng <boqun.feng@gmail.com>, "Paul E. McKenney" <paulmck@kernel.org>, 
	Metin Kaya <Metin.Kaya@arm.com>, Xuewen Yan <xuewen.yan94@gmail.com>, 
	K Prateek Nayak <kprateek.nayak@amd.com>, Thomas Gleixner <tglx@linutronix.de>, 
	Daniel Lezcano <daniel.lezcano@linaro.org>, Suleiman Souhlal <suleiman@google.com>, 
	kuyo chang <kuyo.chang@mediatek.com>, hupu <hupu.gm@gmail.com>, kernel-team@android.com
Content-Type: text/plain; charset="UTF-8"

Vineeth reported seeing a KVM related deadlock connected to work
queue lockups using the android17-6.18 tree, which has
Proxy Execution enabled (using the full patch stack), but I've
subsequently reproduced it on v7.1-rc1.

On further debugging he found:
- kvm-irqfd-cleanup workqueue and rcu_gp lands in a per-cpu
  pwq(work queue pool)
- one of kvm-irqfd-cleanup worker(say A) takes a mutex and then
  calls synchronize_srcu_expedited()
- one other kvm-irqfd-cleanup worker worker(Say B) tries to
  acquire the lock and then gets blocked
- On the way to blocking, this cpu gets an IPI and on return
  from IPI, it calls __schedule() and did not get to complete
  workqueue accounting(worker->sleeping = 0 and decrementing
  pool->nr_running). This is done in sched_submit_work() ->
  wq_worker_sleeping() called from schedule() and we got
  preempted before that.
- proxy execution doesn't immediately take it off run queue as
  p->blocked_on is set during __mutex_lock
- Next time when B is picked for running, it notices A(mutex
  holder) is not on a runqueue and then blocks B.
  find_proxy_task() -> proxy_deactivate() -> block_task()
- And things are then stuck. A is waiting for the workqueue to
  be run, but B can't run the workqueue as it is blocked on A.

The trouble is that with Proxy Execution, in
__mutex_lock_common() we set the task state to
TASK_UNINTERRUPTIBLE, and set blocked_on before calling into
schedule(), where sched_submit_work() will be called.

But if an IPI comes in before we call schedule() the interrupt
will call __schedule(SM_PREEMPT) directly. This causes the
scheduler to see the current task as blocked_on, and deactivate
it (because the owner is off the runqueue).

Since its deactivated, it wont' be run, and it won't get to
call sched_submit_work(). And then we see workqueue stalls.

Without proxy-execution, things work, as the SM_PREEMPT case
will prevent the task from being dequeued, and it can be
reselected again and run, which will allow it to finish calling
into schedule() and calling sched_submit_work() before actually
blocking.

Peter didn't like my earlier attempt to solve this by clearing
the blocked_on state and marking the task __state RUNNABLE, as
we shouldn't modify __state from schedule().

So this approach is slightly different. We use the low bit of
the blocked_on pointer as a latch bit flag. When the task sets
the blocked_on pointer, we don't consider it for use with proxy
execution until the latch is set.

We then only set the latch bit in __schedule() when we are not
in an SM_PREEMPT case and are considering blocking the task.

This makes the blocked_on state machine a little more complex:
  NULL -> ptr:unlatched -> ptr:latched -> PROXY_WAKING -> NULL

With additional transitions:
  // only done on current
  ptr:latched -> NULL

  // only done on current or when trying to set waking
  ptr:unlatched -> NULL

And where NULL and ptr:unlatched are functionally equivalent
except for the ability to transition to ptr:latched.

Credit for this idea is due to Vineeth and Sulieman who had
proposed something very similar when the issue was first
reported. As well as to Peter for suggesting it and K Prateek
who helped iterate and shared an initial working version.

Many thanks to Vineeth for figuring this very obscure race out
and for implementing a test tool to make it easily reproducible!

Reported-by: Vineeth Pillai <vineethrp@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Switch to using extra flag bit to ensure we don't proxy early
  on SM_PREEMPT cases, as suggested by Peter (and Vineeth and
  Suleiman) and helped developed with K Prateek

Cc: Vineeth Pillai <vineethrp@google.com>
Cc: Sonam Sanju <sonam.sanju@intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Kunwu Chan <kunwu.chan@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h | 64 +++++++++++++++++++++++++++++++++++--------
 kernel/sched/core.c   | 15 ++++++----
 2 files changed, 63 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 368c7b4d7cb51..8b9e971d98f67 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2183,18 +2183,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
 #ifndef CONFIG_PREEMPT_RT
 
 /*
- * With proxy exec, if a task has been proxy-migrated, it may be a donor
- * on a cpu that it can't actually run on. Thus we need a special state
- * to denote that the task is being woken, but that it needs to be
- * evaluated for return-migration before it is run. So if the task is
- * blocked_on PROXY_WAKING, return migrate it before running it.
+ * The proxy exec blocked_on pointer value uses the low bit as a latch
+ * value which clarifies if the blocked_on value is used for proxying or
+ * not.
+ *
+ * The state machine looks something like
+ *   NULL -> ptr:unlatched -> ptr:latched -> PROXY_WAKING -> NULL
+ *
+ * With some additional transitions:
+ *   ptr:unlatched -> NULL (done on current, or via set_task_blocked_on_waking())
+ *   ptr:latched -> NULL (done only on current)
+ *
+ * 1) NULL and ptr:unlatched are effectively equivalent, no proxying will occur
+ * 2) ptr:latched is the state when proxying will occur
+ * 3) PROXY_WAKING is used when the task is being woken to  ensure we
+ *    return-migrate proxy-migrated tasks before running them (note it has
+ *    the latch bit set).
  */
-#define PROXY_WAKING ((struct mutex *)(-1L))
+#define PROXY_BLOCKED_LATCH (1UL)
+#define PROXY_BLOCKED_ON_MASK(x) ((struct mutex *)((unsigned long)(x) & ~PROXY_BLOCKED_LATCH))
+#define PROXY_WAKING ((struct mutex *)(-1L)) /* PROXY_WAKING has LATCH bit set */
+
+static inline struct mutex *task_is_blocked_on(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+	return (struct mutex *)((unsigned long)p->blocked_on & PROXY_BLOCKED_LATCH);
+}
+
+static inline void __set_task_blocked_on_latched(struct task_struct *p)
+{
+	lockdep_assert_held_once(&p->blocked_lock);
+	WARN_ON_ONCE(!p->blocked_on);
+	p->blocked_on = (struct mutex *)((unsigned long)p->blocked_on | PROXY_BLOCKED_LATCH);
+}
+
+static inline struct mutex *__get_task_latched_blocked_on(struct task_struct *p)
+{
+	if (!task_is_blocked_on(p))
+		return NULL;
+	if (p->blocked_on == PROXY_WAKING)
+		return PROXY_WAKING;
+	return PROXY_BLOCKED_ON_MASK(p->blocked_on);
+}
 
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
 	lockdep_assert_held_once(&p->blocked_lock);
-	return p->blocked_on == PROXY_WAKING ? NULL : p->blocked_on;
+	if (p->blocked_on == PROXY_WAKING)
+		return NULL;
+	return PROXY_BLOCKED_ON_MASK(p->blocked_on);
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2215,6 +2253,8 @@ static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 
 static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
+	struct mutex *bo = p->blocked_on;
+
 	/* Currently we serialize blocked_on under the task::blocked_lock */
 	lockdep_assert_held_once(&p->blocked_lock);
 	/*
@@ -2222,7 +2262,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
 	 * blocked_on relationships, but make sure we are not
 	 * clearing the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	WARN_ON_ONCE(m && bo && __get_task_blocked_on(p) != m && bo != PROXY_WAKING);
 	p->blocked_on = NULL;
 }
 
@@ -2242,15 +2282,17 @@ static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mu
 		return;
 	}
 
-	/* Don't set PROXY_WAKING if blocked_on was already cleared */
-	if (!p->blocked_on)
+	/* Don't set PROXY_WAKING if we are not really blocked_on  */
+	if (!task_is_blocked_on(p)) {
+		p->blocked_on = NULL;  /* clear if unlatched */
 		return;
+	}
 	/*
 	 * There may be cases where we set PROXY_WAKING on tasks that were
 	 * already set to waking, but make sure we are not changing
 	 * the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	WARN_ON_ONCE(m && __get_task_blocked_on(p) != m && p->blocked_on != PROXY_WAKING);
 	p->blocked_on = PROXY_WAKING;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25ae..2f912bf698446 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6599,8 +6599,13 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	 * blocked on a mutex, and we want to keep it on the runqueue
 	 * to be selectable for proxy-execution.
 	 */
-	if (!should_block)
-		return false;
+	if (!should_block) {
+		guard(raw_spinlock)(&p->blocked_lock);
+		if (p->blocked_on) {
+			__set_task_blocked_on_latched(p);
+			return false;
+		}
+	}
 
 	p->sched_contributes_to_load =
 		(task_state & TASK_UNINTERRUPTIBLE) &&
@@ -6833,7 +6838,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	int owner_cpu;
 
 	/* Follow blocked_on chain. */
-	for (p = donor; (mutex = p->blocked_on); p = owner) {
+	for (p = donor; (mutex = __get_task_latched_blocked_on(p)); p = owner) {
 		/* if its PROXY_WAKING, do return migration or run if current */
 		if (mutex == PROXY_WAKING) {
 			if (task_current(rq, p)) {
@@ -6851,7 +6856,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		guard(raw_spinlock)(&p->blocked_lock);
 
 		/* Check again that p is blocked with blocked_lock held */
-		if (mutex != __get_task_blocked_on(p)) {
+		if (mutex != __get_task_latched_blocked_on(p)) {
 			/*
 			 * Something changed in the blocked_on chain and
 			 * we don't know if only at this level. So, let's
@@ -7107,7 +7112,7 @@ static void __sched notrace __schedule(int sched_mode)
 		struct task_struct *prev_donor = rq->donor;
 
 		rq_set_donor(rq, next);
-		if (unlikely(next->blocked_on)) {
+		if (unlikely(task_is_blocked_on(next))) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next) {
 				zap_balance_callbacks(rq);
-- 
2.54.0.545.g6539524ca2-goog