[PATCH v2 0/2] Proxy Execution fixes for v7.1-rc

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Proxy Execution fixes for v7.1-rc
@ 2026-04-30 21:50 John Stultz
  2026-04-30 21:50 ` [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
  2026-04-30 21:50 ` [PATCH v2 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz
  0 siblings, 2 replies; 17+ messages in thread
From: John Stultz @ 2026-04-30 21:50 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hey All,
  So in testing with the full Proxy Execution series,
Vineeth Pillai managed to trip some interesting bugs which
initially looked to be KVM or RCU related[1], which he later
diagnosed as Proxy Execution related and created a useful test
driver to reproduce.

I found these same issues could be triggered with the upstream
portions of Proxy Execution, so I wanted to send along these
fixes for 7.1-rc

Again, a huge thanks to Vineeth for uncovering these issues
that have evaded all my stress testing so far!

New in this version:
* Peter didn't like the change modifying task __state in
  schedule() and so I've reworked this to use an extra bit flag
  in the low bit of the blocked_on pointer. This does make the
  change a little larger then I'd like for a fix, but I think
  minimizing it further would create a bit of a mess.

Anyway, any feedback would be greatly appreciated!

Thanks
-john

[1]: https://lore.kernel.org/lkml/20260320125633.2290675-1-vineeth@bitbyteword.org/

Cc: Vineeth Pillai <vineethrp@google.com>
Cc: Sonam Sanju <sonam.sanju@intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Kunwu Chan <kunwu.chan@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com

John Stultz (2):
  sched: proxy-exec: Close race causing workqueue work being delayed
  locking: mutex: Fix proxy-exec potentially deactivating tasks marked
    TASK_RUNNING

 include/linux/sched.h  | 64 ++++++++++++++++++++++++++++++++++--------
 kernel/locking/mutex.c |  1 +
 kernel/sched/core.c    | 15 ++++++----
 3 files changed, 64 insertions(+), 16 deletions(-)

-- 
2.54.0.545.g6539524ca2-goog

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-04-30 21:50 [PATCH v2 0/2] Proxy Execution fixes for v7.1-rc John Stultz
@ 2026-04-30 21:50 ` John Stultz
  2026-04-30 23:53   ` John Stultz
                     ` (2 more replies)
  2026-04-30 21:50 ` [PATCH v2 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz
  1 sibling, 3 replies; 17+ messages in thread
From: John Stultz @ 2026-04-30 21:50 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Vineeth reported seeing a KVM related deadlock connected to work
queue lockups using the android17-6.18 tree, which has
Proxy Execution enabled (using the full patch stack), but I've
subsequently reproduced it on v7.1-rc1.

On further debugging he found:
- kvm-irqfd-cleanup workqueue and rcu_gp lands in a per-cpu
  pwq(work queue pool)
- one of kvm-irqfd-cleanup worker(say A) takes a mutex and then
  calls synchronize_srcu_expedited()
- one other kvm-irqfd-cleanup worker worker(Say B) tries to
  acquire the lock and then gets blocked
- On the way to blocking, this cpu gets an IPI and on return
  from IPI, it calls __schedule() and did not get to complete
  workqueue accounting(worker->sleeping = 0 and decrementing
  pool->nr_running). This is done in sched_submit_work() ->
  wq_worker_sleeping() called from schedule() and we got
  preempted before that.
- proxy execution doesn't immediately take it off run queue as
  p->blocked_on is set during __mutex_lock
- Next time when B is picked for running, it notices A(mutex
  holder) is not on a runqueue and then blocks B.
  find_proxy_task() -> proxy_deactivate() -> block_task()
- And things are then stuck. A is waiting for the workqueue to
  be run, but B can't run the workqueue as it is blocked on A.

The trouble is that with Proxy Execution, in
__mutex_lock_common() we set the task state to
TASK_UNINTERRUPTIBLE, and set blocked_on before calling into
schedule(), where sched_submit_work() will be called.

But if an IPI comes in before we call schedule() the interrupt
will call __schedule(SM_PREEMPT) directly. This causes the
scheduler to see the current task as blocked_on, and deactivate
it (because the owner is off the runqueue).

Since its deactivated, it wont' be run, and it won't get to
call sched_submit_work(). And then we see workqueue stalls.

Without proxy-execution, things work, as the SM_PREEMPT case
will prevent the task from being dequeued, and it can be
reselected again and run, which will allow it to finish calling
into schedule() and calling sched_submit_work() before actually
blocking.

Peter didn't like my earlier attempt to solve this by clearing
the blocked_on state and marking the task __state RUNNABLE, as
we shouldn't modify __state from schedule().

So this approach is slightly different. We use the low bit of
the blocked_on pointer as a latch bit flag. When the task sets
the blocked_on pointer, we don't consider it for use with proxy
execution until the latch is set.

We then only set the latch bit in __schedule() when we are not
in an SM_PREEMPT case and are considering blocking the task.

This makes the blocked_on state machine a little more complex:
  NULL -> ptr:unlatched -> ptr:latched -> PROXY_WAKING -> NULL

With additional transitions:
  // only done on current
  ptr:latched -> NULL

  // only done on current or when trying to set waking
  ptr:unlatched -> NULL

And where NULL and ptr:unlatched are functionally equivalent
except for the ability to transition to ptr:latched.

Credit for this idea is due to Vineeth and Sulieman who had
proposed something very similar when the issue was first
reported. As well as to Peter for suggesting it and K Prateek
who helped iterate and shared an initial working version.

Many thanks to Vineeth for figuring this very obscure race out
and for implementing a test tool to make it easily reproducible!

Reported-by: Vineeth Pillai <vineethrp@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Switch to using extra flag bit to ensure we don't proxy early
  on SM_PREEMPT cases, as suggested by Peter (and Vineeth and
  Suleiman) and helped developed with K Prateek

Cc: Vineeth Pillai <vineethrp@google.com>
Cc: Sonam Sanju <sonam.sanju@intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Kunwu Chan <kunwu.chan@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h | 64 +++++++++++++++++++++++++++++++++++--------
 kernel/sched/core.c   | 15 ++++++----
 2 files changed, 63 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 368c7b4d7cb51..8b9e971d98f67 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2183,18 +2183,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
 #ifndef CONFIG_PREEMPT_RT
 
 /*
- * With proxy exec, if a task has been proxy-migrated, it may be a donor
- * on a cpu that it can't actually run on. Thus we need a special state
- * to denote that the task is being woken, but that it needs to be
- * evaluated for return-migration before it is run. So if the task is
- * blocked_on PROXY_WAKING, return migrate it before running it.
+ * The proxy exec blocked_on pointer value uses the low bit as a latch
+ * value which clarifies if the blocked_on value is used for proxying or
+ * not.
+ *
+ * The state machine looks something like
+ *   NULL -> ptr:unlatched -> ptr:latched -> PROXY_WAKING -> NULL
+ *
+ * With some additional transitions:
+ *   ptr:unlatched -> NULL (done on current, or via set_task_blocked_on_waking())
+ *   ptr:latched -> NULL (done only on current)
+ *
+ * 1) NULL and ptr:unlatched are effectively equivalent, no proxying will occur
+ * 2) ptr:latched is the state when proxying will occur
+ * 3) PROXY_WAKING is used when the task is being woken to  ensure we
+ *    return-migrate proxy-migrated tasks before running them (note it has
+ *    the latch bit set).
  */
-#define PROXY_WAKING ((struct mutex *)(-1L))
+#define PROXY_BLOCKED_LATCH (1UL)
+#define PROXY_BLOCKED_ON_MASK(x) ((struct mutex *)((unsigned long)(x) & ~PROXY_BLOCKED_LATCH))
+#define PROXY_WAKING ((struct mutex *)(-1L)) /* PROXY_WAKING has LATCH bit set */
+
+static inline struct mutex *task_is_blocked_on(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+	return (struct mutex *)((unsigned long)p->blocked_on & PROXY_BLOCKED_LATCH);
+}
+
+static inline void __set_task_blocked_on_latched(struct task_struct *p)
+{
+	lockdep_assert_held_once(&p->blocked_lock);
+	WARN_ON_ONCE(!p->blocked_on);
+	p->blocked_on = (struct mutex *)((unsigned long)p->blocked_on | PROXY_BLOCKED_LATCH);
+}
+
+static inline struct mutex *__get_task_latched_blocked_on(struct task_struct *p)
+{
+	if (!task_is_blocked_on(p))
+		return NULL;
+	if (p->blocked_on == PROXY_WAKING)
+		return PROXY_WAKING;
+	return PROXY_BLOCKED_ON_MASK(p->blocked_on);
+}
 
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
 	lockdep_assert_held_once(&p->blocked_lock);
-	return p->blocked_on == PROXY_WAKING ? NULL : p->blocked_on;
+	if (p->blocked_on == PROXY_WAKING)
+		return NULL;
+	return PROXY_BLOCKED_ON_MASK(p->blocked_on);
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2215,6 +2253,8 @@ static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 
 static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
+	struct mutex *bo = p->blocked_on;
+
 	/* Currently we serialize blocked_on under the task::blocked_lock */
 	lockdep_assert_held_once(&p->blocked_lock);
 	/*
@@ -2222,7 +2262,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
 	 * blocked_on relationships, but make sure we are not
 	 * clearing the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	WARN_ON_ONCE(m && bo && __get_task_blocked_on(p) != m && bo != PROXY_WAKING);
 	p->blocked_on = NULL;
 }
 
@@ -2242,15 +2282,17 @@ static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mu
 		return;
 	}
 
-	/* Don't set PROXY_WAKING if blocked_on was already cleared */
-	if (!p->blocked_on)
+	/* Don't set PROXY_WAKING if we are not really blocked_on  */
+	if (!task_is_blocked_on(p)) {
+		p->blocked_on = NULL;  /* clear if unlatched */
 		return;
+	}
 	/*
 	 * There may be cases where we set PROXY_WAKING on tasks that were
 	 * already set to waking, but make sure we are not changing
 	 * the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	WARN_ON_ONCE(m && __get_task_blocked_on(p) != m && p->blocked_on != PROXY_WAKING);
 	p->blocked_on = PROXY_WAKING;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25ae..2f912bf698446 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6599,8 +6599,13 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	 * blocked on a mutex, and we want to keep it on the runqueue
 	 * to be selectable for proxy-execution.
 	 */
-	if (!should_block)
-		return false;
+	if (!should_block) {
+		guard(raw_spinlock)(&p->blocked_lock);
+		if (p->blocked_on) {
+			__set_task_blocked_on_latched(p);
+			return false;
+		}
+	}
 
 	p->sched_contributes_to_load =
 		(task_state & TASK_UNINTERRUPTIBLE) &&
@@ -6833,7 +6838,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	int owner_cpu;
 
 	/* Follow blocked_on chain. */
-	for (p = donor; (mutex = p->blocked_on); p = owner) {
+	for (p = donor; (mutex = __get_task_latched_blocked_on(p)); p = owner) {
 		/* if its PROXY_WAKING, do return migration or run if current */
 		if (mutex == PROXY_WAKING) {
 			if (task_current(rq, p)) {
@@ -6851,7 +6856,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		guard(raw_spinlock)(&p->blocked_lock);
 
 		/* Check again that p is blocked with blocked_lock held */
-		if (mutex != __get_task_blocked_on(p)) {
+		if (mutex != __get_task_latched_blocked_on(p)) {
 			/*
 			 * Something changed in the blocked_on chain and
 			 * we don't know if only at this level. So, let's
@@ -7107,7 +7112,7 @@ static void __sched notrace __schedule(int sched_mode)
 		struct task_struct *prev_donor = rq->donor;
 
 		rq_set_donor(rq, next);
-		if (unlikely(next->blocked_on)) {
+		if (unlikely(task_is_blocked_on(next))) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next) {
 				zap_balance_callbacks(rq);
-- 
2.54.0.545.g6539524ca2-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING
  2026-04-30 21:50 [PATCH v2 0/2] Proxy Execution fixes for v7.1-rc John Stultz
  2026-04-30 21:50 ` [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
@ 2026-04-30 21:50 ` John Stultz
  2026-05-01  6:57   ` K Prateek Nayak
  2026-05-04 22:30   ` kernel test robot
  1 sibling, 2 replies; 17+ messages in thread
From: John Stultz @ 2026-04-30 21:50 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Vineeth found came up with a test driver that could trip up
workqueue stalls. After fixing one issue this test found,
Vineeth reported the test was still failing.

Greatly simplified, a task that tries to take a mutex already
owned by another task that is sleeping, can hit a edge case in
the mutex_lock_common() case.

If the task fails to get the lock, calls into schedule, but gets
a spurious wakeup, it will find that it is first waiter, and
go into the mutex_optimistic_spin() logic. Though before calling
mutex_optimistic_spin(), we clear task blocked_on state, since
mutex_optimistic_spin() may call schedule() if need_resched() is
set.

After mutex_optimistic_spin() fails, we set blocked_on again,
restart the main mutex loop, try to take the lock and call into
schedule_preempt_disabled().

From there, with proxy-execution, we'll see the task is
blocked_on, follow the chain, see the owner is sleeping and
dequeue the waiting task from the runqueue.

This all sounds fine and reasonable.  But what I had missed is
that in mutex_optimistic_spin(), not only do we call schedule()
but we set TASK_RUNNABLE right before doing so.

This is ok for that invocation of schedule(). But when we come
back we re-set the blocked_on we had just cleared, but we do not
re-set the task state to TASK_INTERRUPTIBLE/UNINTERRUPTIBLE.

This means we have a task that is blocked_on & TASK_RUNNABLE,
so when the proxy execution code dequeues the task, we are
in trouble since future wakeups will be shortcut by the
ttwu_state_match() check.

Thus, to avoid this, after mutex_optimistic_spin(), set the task
state back when we set blocked_on.

Many many thanks again to Vineeth for his very useful testing
driver that uncovered this long hidden bug, that I hadn't
tripped in all my testing! Very impressed with the problems he's
uncovered!

Reported-by: Vineeth Pillai <vineethrp@google.com>
Tested-by: Vineeth Pillai <vineethrp@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Vineeth Pillai <vineethrp@google.com>
Cc: Sonam Sanju <sonam.sanju@intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Kunwu Chan <kunwu.chan@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/locking/mutex.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 09534628dc01a..a93d4c6bee1a3 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -763,6 +763,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			raw_spin_lock_irqsave(&lock->wait_lock, flags);
 			raw_spin_lock(&current->blocked_lock);
 			__set_task_blocked_on(current, lock);
+			set_current_state(state);

 			if (opt_acquired)
 				break;
-- 
2.54.0.545.g6539524ca2-goog

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-04-30 21:50 ` [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
@ 2026-04-30 23:53   ` John Stultz
  2026-05-01  6:39   ` K Prateek Nayak
  2026-05-01 13:21   ` Peter Zijlstra
  2 siblings, 0 replies; 17+ messages in thread
From: John Stultz @ 2026-04-30 23:53 UTC (permalink / raw)
  To: LKML
  Cc: Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan,
	Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Apr 30, 2026 at 2:51 PM John Stultz <jstultz@google.com> wrote:
>
> +static inline struct mutex *task_is_blocked_on(struct task_struct *p)
> +{
> +       if (!sched_proxy_exec())
> +               return false;
> +       return (struct mutex *)((unsigned long)p->blocked_on & PROXY_BLOCKED_LATCH);
> +}

Bah, this function should be a bool.  I'll fix this up for the next
revision along with any other feedback I get.

thanks
-john

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-04-30 21:50 ` [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
  2026-04-30 23:53   ` John Stultz
@ 2026-05-01  6:39   ` K Prateek Nayak
  2026-05-01  7:11     ` John Stultz
  2026-05-01 13:21   ` Peter Zijlstra
  2 siblings, 1 reply; 17+ messages in thread
From: K Prateek Nayak @ 2026-05-01  6:39 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan,
	Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Hello John,

Mostly cosmetic nitpicks. The overall idea looks good.

On 5/1/2026 3:20 AM, John Stultz wrote:
> @@ -2183,18 +2183,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
>  #ifndef CONFIG_PREEMPT_RT
>  
>  /*
> - * With proxy exec, if a task has been proxy-migrated, it may be a donor
> - * on a cpu that it can't actually run on. Thus we need a special state
> - * to denote that the task is being woken, but that it needs to be
> - * evaluated for return-migration before it is run. So if the task is
> - * blocked_on PROXY_WAKING, return migrate it before running it.
> + * The proxy exec blocked_on pointer value uses the low bit as a latch
> + * value which clarifies if the blocked_on value is used for proxying or
> + * not.
> + *
> + * The state machine looks something like
> + *   NULL -> ptr:unlatched -> ptr:latched -> PROXY_WAKING -> NULL
> + *
> + * With some additional transitions:
> + *   ptr:unlatched -> NULL (done on current, or via set_task_blocked_on_waking())
> + *   ptr:latched -> NULL (done only on current)
> + *
> + * 1) NULL and ptr:unlatched are effectively equivalent, no proxying will occur
> + * 2) ptr:latched is the state when proxying will occur
> + * 3) PROXY_WAKING is used when the task is being woken to  ensure we
> + *    return-migrate proxy-migrated tasks before running them (note it has
> + *    the latch bit set).
>   */
> -#define PROXY_WAKING ((struct mutex *)(-1L))
> +#define PROXY_BLOCKED_LATCH (1UL)
> +#define PROXY_BLOCKED_ON_MASK(x) ((struct mutex *)((unsigned long)(x) & ~PROXY_BLOCKED_LATCH))

nit. I think PROXY_BLOCKED_ON_MUTEX() would be a better name since this
is returning the true mutex pointer back. No strong feelings, I'll defer
to others for more comments.


> +#define PROXY_WAKING ((struct mutex *)(-1L)) /* PROXY_WAKING has LATCH bit set */
> +
> +static inline struct mutex *task_is_blocked_on(struct task_struct *p)

I think this can take the role of task_is_blocked() no?

Only one caller for try_to_block_task() will require looking at the
raw blocked_on state but other than that, it is safe for the scheduler
to move around the preempted task until it has grabbed the BO latch.

> +{
> +	if (!sched_proxy_exec())
> +		return false;
> +	return (struct mutex *)((unsigned long)p->blocked_on & PROXY_BLOCKED_LATCH);
> +}
> +
> +static inline void __set_task_blocked_on_latched(struct task_struct *p)
> +{

Are you planning to reuse this sometime later in the series? If not I
think we can convert this to "try_set_task_blocked_on_latch()" and return
false if it finds blocked on having been cleared already.

That way the lock + check in try_to_block_task() can be moved here.

> +	lockdep_assert_held_once(&p->blocked_lock);
> +	WARN_ON_ONCE(!p->blocked_on);
> +	p->blocked_on = (struct mutex *)((unsigned long)p->blocked_on | PROXY_BLOCKED_LATCH);
> +}
> +
> +static inline struct mutex *__get_task_latched_blocked_on(struct task_struct *p)

I think this can be __get_task_blocked_on() ...

> +{
> +	if (!task_is_blocked_on(p))
> +		return NULL;
> +	if (p->blocked_on == PROXY_WAKING)
> +		return PROXY_WAKING;
> +	return PROXY_BLOCKED_ON_MASK(p->blocked_on);
> +}
>  
>  static inline struct mutex *__get_task_blocked_on(struct task_struct *p)

... and this can be __get_task_blocked_on_raw() since only one caller in
kernel/locking/mutex.h really cares about the ~PROXY_BLOCKED_LATCH
value outside of this file.

Everything in the sched bits can then simply be __get_task_blocked_on()
and that seems much cleaner.

Thoughts?

>  {
>  	lockdep_assert_held_once(&p->blocked_lock);
> -	return p->blocked_on == PROXY_WAKING ? NULL : p->blocked_on;
> +	if (p->blocked_on == PROXY_WAKING)
> +		return NULL;
> +	return PROXY_BLOCKED_ON_MASK(p->blocked_on);
>  }
>  
>  static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
> @@ -2215,6 +2253,8 @@ static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
>  
>  static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
>  {
> +	struct mutex *bo = p->blocked_on;
> +
>  	/* Currently we serialize blocked_on under the task::blocked_lock */
>  	lockdep_assert_held_once(&p->blocked_lock);
>  	/*
> @@ -2222,7 +2262,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
>  	 * blocked_on relationships, but make sure we are not
>  	 * clearing the relationship with a different lock.
>  	 */
> -	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
> +	WARN_ON_ONCE(m && bo && __get_task_blocked_on(p) != m && bo != PROXY_WAKING);
>  	p->blocked_on = NULL;
>  }
>  
> @@ -2242,15 +2282,17 @@ static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mu
>  		return;
>  	}
>  
> -	/* Don't set PROXY_WAKING if blocked_on was already cleared */
> -	if (!p->blocked_on)
> +	/* Don't set PROXY_WAKING if we are not really blocked_on  */
> +	if (!task_is_blocked_on(p)) {
> +		p->blocked_on = NULL;  /* clear if unlatched */
>  		return;
> +	}
>  	/*
>  	 * There may be cases where we set PROXY_WAKING on tasks that were
>  	 * already set to waking, but make sure we are not changing
>  	 * the relationship with a different lock.
>  	 */
> -	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
> +	WARN_ON_ONCE(m && __get_task_blocked_on(p) != m && p->blocked_on != PROXY_WAKING);
>  	p->blocked_on = PROXY_WAKING;
>  }
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index da20fb6ea25ae..2f912bf698446 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6599,8 +6599,13 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
>  	 * blocked on a mutex, and we want to keep it on the runqueue
>  	 * to be selectable for proxy-execution.
>  	 */
> -	if (!should_block)
> -		return false;
> +	if (!should_block) {
> +		guard(raw_spinlock)(&p->blocked_lock);
> +		if (p->blocked_on) {
> +			__set_task_blocked_on_latched(p);
> +			return false;
> +		}
> +	}

In my head, this as:

    if (!should_block & try_to_latch_task_blocked_on(p))
           return false;

seems much cleaner. I'll defer to other for comments.

>  
>  	p->sched_contributes_to_load =
>  		(task_state & TASK_UNINTERRUPTIBLE) &&
> @@ -6833,7 +6838,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  	int owner_cpu;
>  
>  	/* Follow blocked_on chain. */
> -	for (p = donor; (mutex = p->blocked_on); p = owner) {
> +	for (p = donor; (mutex = __get_task_latched_blocked_on(p)); p = owner) {
>  		/* if its PROXY_WAKING, do return migration or run if current */
>  		if (mutex == PROXY_WAKING) {
>  			if (task_current(rq, p)) {
> @@ -6851,7 +6856,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  		guard(raw_spinlock)(&p->blocked_lock);
>  
>  		/* Check again that p is blocked with blocked_lock held */
> -		if (mutex != __get_task_blocked_on(p)) {
> +		if (mutex != __get_task_latched_blocked_on(p)) {
>  			/*
>  			 * Something changed in the blocked_on chain and
>  			 * we don't know if only at this level. So, let's
> @@ -7107,7 +7112,7 @@ static void __sched notrace __schedule(int sched_mode)
>  		struct task_struct *prev_donor = rq->donor;
>  
>  		rq_set_donor(rq, next);
> -		if (unlikely(next->blocked_on)) {
> +		if (unlikely(task_is_blocked_on(next))) {
>  			next = find_proxy_task(rq, next, &rf);
>  			if (!next) {
>  				zap_balance_callbacks(rq);

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING
  2026-04-30 21:50 ` [PATCH v2 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz
@ 2026-05-01  6:57   ` K Prateek Nayak
  2026-05-04 22:30   ` kernel test robot
  1 sibling, 0 replies; 17+ messages in thread
From: K Prateek Nayak @ 2026-05-01  6:57 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan,
	Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Hello John,

On 5/1/2026 3:20 AM, John Stultz wrote:
> Vineeth found came up with a test driver that could trip up
> workqueue stalls. After fixing one issue this test found,
> Vineeth reported the test was still failing.
> 
> Greatly simplified, a task that tries to take a mutex already
> owned by another task that is sleeping, can hit a edge case in
> the mutex_lock_common() case.
> 
> If the task fails to get the lock, calls into schedule, but gets
> a spurious wakeup, it will find that it is first waiter, and
> go into the mutex_optimistic_spin() logic. Though before calling
> mutex_optimistic_spin(), we clear task blocked_on state, since
> mutex_optimistic_spin() may call schedule() if need_resched() is
> set.
> 
> After mutex_optimistic_spin() fails, we set blocked_on again,
> restart the main mutex loop, try to take the lock and call into
> schedule_preempt_disabled().
> 
> From there, with proxy-execution, we'll see the task is
> blocked_on, follow the chain, see the owner is sleeping and
> dequeue the waiting task from the runqueue.
> 
> This all sounds fine and reasonable.  But what I had missed is
> that in mutex_optimistic_spin(), not only do we call schedule()
> but we set TASK_RUNNABLE right before doing so.
> 
> This is ok for that invocation of schedule(). But when we come
> back we re-set the blocked_on we had just cleared, but we do not
> re-set the task state to TASK_INTERRUPTIBLE/UNINTERRUPTIBLE.
> 
> This means we have a task that is blocked_on & TASK_RUNNABLE,
> so when the proxy execution code dequeues the task, we are
> in trouble since future wakeups will be shortcut by the
> ttwu_state_match() check.

I'm still having a hard time understanding how this happens - when the
task fails grabbing a lock during optimistic_spinning(), we set
blocked_on with TASK_RUNNING and go through another iteration of the
loop.

When the task hits schedule_preempt_disabled(), it is still
TASK_RUNNING and __schedule() skips try_to_block_task() leaving the
task in a preempted (unlatched) state. The task, when selected again,
sets the state back to interruptible/uninterruptible/killable and
then goes to optimistic spinning again since it should still be the
first waiter if it hasn't managed to grab the lock.

I don't see how this can cause a problem now with the latched state.
There is no need for a wakeup since TASK_RUNNING implies the pick
will select it again to run at some point and the blocked_on is
re-evaluated.

The signal_pending_state() checks the "state" based on the parameter
passed to __mutex_lock_common() so it'll still bail out early for
signal delivery.

Do we still need it with the latched state machine?

> 
> Thus, to avoid this, after mutex_optimistic_spin(), set the task
> state back when we set blocked_on.
> 
> Many many thanks again to Vineeth for his very useful testing
> driver that uncovered this long hidden bug, that I hadn't
> tripped in all my testing! Very impressed with the problems he's
> uncovered!
> 
> Reported-by: Vineeth Pillai <vineethrp@google.com>
> Tested-by: Vineeth Pillai <vineethrp@google.com>
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
>  kernel/locking/mutex.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 09534628dc01a..a93d4c6bee1a3 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -763,6 +763,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  			raw_spin_lock_irqsave(&lock->wait_lock, flags);
>  			raw_spin_lock(&current->blocked_lock);
>  			__set_task_blocked_on(current, lock);
> +			set_current_state(state);
>  
>  			if (opt_acquired)
>  				break;

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-05-01  6:39   ` K Prateek Nayak
@ 2026-05-01  7:11     ` John Stultz
  0 siblings, 0 replies; 17+ messages in thread
From: John Stultz @ 2026-05-01  7:11 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

On Thu, Apr 30, 2026 at 11:39 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Mostly cosmetic nitpicks. The overall idea looks good.
>
> On 5/1/2026 3:20 AM, John Stultz wrote:
> > @@ -2183,18 +2183,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
> >  #ifndef CONFIG_PREEMPT_RT
> >
> >  /*
> > - * With proxy exec, if a task has been proxy-migrated, it may be a donor
> > - * on a cpu that it can't actually run on. Thus we need a special state
> > - * to denote that the task is being woken, but that it needs to be
> > - * evaluated for return-migration before it is run. So if the task is
> > - * blocked_on PROXY_WAKING, return migrate it before running it.
> > + * The proxy exec blocked_on pointer value uses the low bit as a latch
> > + * value which clarifies if the blocked_on value is used for proxying or
> > + * not.
> > + *
> > + * The state machine looks something like
> > + *   NULL -> ptr:unlatched -> ptr:latched -> PROXY_WAKING -> NULL
> > + *
> > + * With some additional transitions:
> > + *   ptr:unlatched -> NULL (done on current, or via set_task_blocked_on_waking())
> > + *   ptr:latched -> NULL (done only on current)
> > + *
> > + * 1) NULL and ptr:unlatched are effectively equivalent, no proxying will occur
> > + * 2) ptr:latched is the state when proxying will occur
> > + * 3) PROXY_WAKING is used when the task is being woken to  ensure we
> > + *    return-migrate proxy-migrated tasks before running them (note it has
> > + *    the latch bit set).
> >   */
> > -#define PROXY_WAKING ((struct mutex *)(-1L))
> > +#define PROXY_BLOCKED_LATCH (1UL)
> > +#define PROXY_BLOCKED_ON_MASK(x) ((struct mutex *)((unsigned long)(x) & ~PROXY_BLOCKED_LATCH))
>
> nit. I think PROXY_BLOCKED_ON_MUTEX() would be a better name since this
> is returning the true mutex pointer back. No strong feelings, I'll defer
> to others for more comments.

So, in the full series we support other lock types as well (rwsems!).
So I'm being a little generic here so it will be usable without major
renaming later.

> > +#define PROXY_WAKING ((struct mutex *)(-1L)) /* PROXY_WAKING has LATCH bit set */
> > +
> > +static inline struct mutex *task_is_blocked_on(struct task_struct *p)
>
> I think this can take the role of task_is_blocked() no?
>

Yes. I have a follow-on patch already queued that switches
task_is_blocked() for task_is_blocked_on().
But I didn't include it here because it is more then a bug fix.

Personally, I've been starting to bristle with task_is_blocked() name
as I think it gets confusing with block_task() and friends, so
something more explicit seems better (though task_is_blocked_on is
only a smidge better).

> Only one caller for try_to_block_task() will require looking at the
> raw blocked_on state but other than that, it is safe for the scheduler
> to move around the preempted task until it has grabbed the BO latch.

Right. In my cleanup I provide a local proxy_should_block() helper for
that usage. I don't directly used blocked_on because with
!CONFIG_SCHED_PROXY_EXEC it would be good to optimize out conditionals
with a constant.

> > +{
> > +     if (!sched_proxy_exec())
> > +             return false;
> > +     return (struct mutex *)((unsigned long)p->blocked_on & PROXY_BLOCKED_LATCH);
> > +}
> > +
> > +static inline void __set_task_blocked_on_latched(struct task_struct *p)
> > +{
>
> Are you planning to reuse this sometime later in the series? If not I
> think we can convert this to "try_set_task_blocked_on_latch()" and return
> false if it finds blocked on having been cleared already.
>
> That way the lock + check in try_to_block_task() can be moved here.

Yep. I did some further cleanups in my current tree this afternoon,
and I already adapted this from your patch set, and got rid of the
unlocked __ version.


> > +static inline struct mutex *__get_task_latched_blocked_on(struct task_struct *p)
>
> I think this can be __get_task_blocked_on() ...
>
> > +{
> > +     if (!task_is_blocked_on(p))
> > +             return NULL;
> > +     if (p->blocked_on == PROXY_WAKING)
> > +             return PROXY_WAKING;
> > +     return PROXY_BLOCKED_ON_MASK(p->blocked_on);
> > +}
> >
> >  static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
>
> ... and this can be __get_task_blocked_on_raw() since only one caller in
> kernel/locking/mutex.h really cares about the ~PROXY_BLOCKED_LATCH
> value outside of this file.
>
> Everything in the sched bits can then simply be __get_task_blocked_on()
> and that seems much cleaner.
>
> Thoughts?

I guess, elsewhere there are a few callers who want the
~PROXY_BLOCKED_LATCH case (both in sched.h and mutex.h). Though you
are right it is mostly for debugging/warnings, so I guess I could go
with your naming suggestion.

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index da20fb6ea25ae..2f912bf698446 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6599,8 +6599,13 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
> >        * blocked on a mutex, and we want to keep it on the runqueue
> >        * to be selectable for proxy-execution.
> >        */
> > -     if (!should_block)
> > -             return false;
> > +     if (!should_block) {
> > +             guard(raw_spinlock)(&p->blocked_lock);
> > +             if (p->blocked_on) {
> > +                     __set_task_blocked_on_latched(p);
> > +                     return false;
> > +             }
> > +     }
>
> In my head, this as:
>
>     if (!should_block & try_to_latch_task_blocked_on(p))
>            return false;
>
> seems much cleaner. I'll defer to other for comments.

Heh!  In my current tree for v3 its:
        if (!should_block && set_task_blocked_on_latched(p))
                return false;

I'm glad we're thinking the same thing.

Thanks as always for the great feedback! I'll give some time for
others to comment and try to send v3 out later tomorrow and folks can
review it after the weekend.

thanks
-john

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-04-30 21:50 ` [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
  2026-04-30 23:53   ` John Stultz
  2026-05-01  6:39   ` K Prateek Nayak
@ 2026-05-01 13:21   ` Peter Zijlstra
  2026-05-01 15:55     ` K Prateek Nayak
  2 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-01 13:21 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team


Sorry for being late, I was unwell for a few days :/

On Thu, Apr 30, 2026 at 09:50:46PM +0000, John Stultz wrote:

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 368c7b4d7cb51..8b9e971d98f67 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2183,18 +2183,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
>  #ifndef CONFIG_PREEMPT_RT
>  
>  /*
> - * With proxy exec, if a task has been proxy-migrated, it may be a donor
> - * on a cpu that it can't actually run on. Thus we need a special state
> - * to denote that the task is being woken, but that it needs to be
> - * evaluated for return-migration before it is run. So if the task is
> - * blocked_on PROXY_WAKING, return migrate it before running it.
> + * The proxy exec blocked_on pointer value uses the low bit as a latch
> + * value which clarifies if the blocked_on value is used for proxying or
> + * not.
> + *
> + * The state machine looks something like
> + *   NULL -> ptr:unlatched -> ptr:latched -> PROXY_WAKING -> NULL
> + *
> + * With some additional transitions:
> + *   ptr:unlatched -> NULL (done on current, or via set_task_blocked_on_waking())
> + *   ptr:latched -> NULL (done only on current)
> + *
> + * 1) NULL and ptr:unlatched are effectively equivalent, no proxying will occur
> + * 2) ptr:latched is the state when proxying will occur
> + * 3) PROXY_WAKING is used when the task is being woken to  ensure we
> + *    return-migrate proxy-migrated tasks before running them (note it has
> + *    the latch bit set).
>   */
> -#define PROXY_WAKING ((struct mutex *)(-1L))
> +#define PROXY_BLOCKED_LATCH (1UL)
> +#define PROXY_BLOCKED_ON_MASK(x) ((struct mutex *)((unsigned long)(x) & ~PROXY_BLOCKED_LATCH))
> +#define PROXY_WAKING ((struct mutex *)(-1L)) /* PROXY_WAKING has LATCH bit set */

Urgh, please no.

You're making it needlessly complicated. There really are two separate
states, set by two different chains of logic:

 - the blocked_on link, set by the blocking primitive (mutex)

 - the is_blocked state, set by the scheduler when logically blocking
   the task.

by munging them together like that, you also inherit that blocked_lock
into contexts that really don't need it, and you're also sprinkling
more of that sched_proxy_exec() stuff around.

If we keep them nicely separated, none of that happens, and
additionally, we might be able to get rid of the p->se.sched_delayed
(ab)use in the core code (eventually).

Does something like the below really not work?

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 368c7b4d7cb5..0bd5da8360f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -846,7 +846,11 @@ struct task_struct {
 	struct alloc_tag		*alloc_tag;
 #endif
 
-	int				on_cpu;
+	u8				on_cpu;
+	u8				on_rq;
+	u8				is_blocked;
+	u8				__pad;
+
 	struct __call_single_node	wake_entry;
 	unsigned int			wakee_flips;
 	unsigned long			wakee_flip_decay_ts;
@@ -861,7 +865,6 @@ struct task_struct {
 	 */
 	int				recent_used_cpu;
 	int				wake_cpu;
-	int				on_rq;
 
 	int				prio;
 	int				static_prio;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8871449d3c6..f679d65d98a3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -615,6 +615,12 @@ EXPORT_SYMBOL(__trace_set_current_state);
  *   [ The astute reader will observe that it is possible for two tasks on one
  *     CPU to have ->on_cpu = 1 at the same time. ]
  *
+ * p->is_blocked <- { 0, 1 }:
+ *
+ *   is set by try_to_block_task() and cleared by ttwu_do_wakeup() and tracks
+ *   if the task is blocked. Tradidionally this would mirror p->on_rq, however
+ *   due things like DELAY_DEQUEUE and PROXY_EXEC, this can diverge.
+ *
  * task_cpu(p): is changed by set_task_cpu(), the rules are:
  *
  *  - Don't call set_task_cpu() on a blocked task:
@@ -3685,6 +3691,7 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
  */
 static inline void ttwu_do_wakeup(struct task_struct *p)
 {
+	p->is_blocked = 0;
 	WRITE_ONCE(p->__state, TASK_RUNNING);
 	trace_sched_wakeup(p);
 }
@@ -4173,6 +4180,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 *    it disabling IRQs (this allows not taking ->pi_lock).
 		 */
 		WARN_ON_ONCE(p->se.sched_delayed);
+		WARN_ON_ONCE(p->is_blocked);
 		if (!ttwu_state_match(p, state, &success))
 			goto out;
 
@@ -4463,6 +4471,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
 
 	/* A delayed task cannot be in clone(). */
 	WARN_ON_ONCE(p->se.sched_delayed);
+	WARN_ON_ONCE(p->is_blocked);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
@@ -6593,6 +6602,8 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 		return false;
 	}
 
+	p->is_blocked = 1;
+
 	/*
 	 * We check should_block after signal_pending because we
 	 * will want to wake the task in that case. But if
@@ -7108,7 +7119,7 @@ static void __sched notrace __schedule(int sched_mode)
 		struct task_struct *prev_donor = rq->donor;
 
 		rq_set_donor(rq, next);
-		if (unlikely(next->blocked_on)) {
+		if (unlikely(next->is_blocked && next->blocked_on)) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next) {
 				zap_balance_callbacks(rq);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-05-01 13:21   ` Peter Zijlstra
@ 2026-05-01 15:55     ` K Prateek Nayak
  2026-05-01 18:59       ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: K Prateek Nayak @ 2026-05-01 15:55 UTC (permalink / raw)
  To: Peter Zijlstra, John Stultz
  Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello Peter,

On 5/1/2026 6:51 PM, Peter Zijlstra wrote:
> 
> Sorry for being late, I was unwell for a few days :/

Hope you are feeling better now.

>> -#define PROXY_WAKING ((struct mutex *)(-1L))
>> +#define PROXY_BLOCKED_LATCH (1UL)
>> +#define PROXY_BLOCKED_ON_MASK(x) ((struct mutex *)((unsigned long)(x) & ~PROXY_BLOCKED_LATCH))
>> +#define PROXY_WAKING ((struct mutex *)(-1L)) /* PROXY_WAKING has LATCH bit set */
> 
> Urgh, please no.
> 
> You're making it needlessly complicated. There really are two separate
> states, set by two different chains of logic:
> 
>  - the blocked_on link, set by the blocking primitive (mutex)
> 
>  - the is_blocked state, set by the scheduler when logically blocking
>    the task.
> 
> by munging them together like that, you also inherit that blocked_lock
> into contexts that really don't need it, and you're also sprinkling
> more of that sched_proxy_exec() stuff around.
> 
> If we keep them nicely separated, none of that happens, and
> additionally, we might be able to get rid of the p->se.sched_delayed
> (ab)use in the core code (eventually).

So there are cases where we want to traverse the find_proxy_task()
bits even after the task gets a wakeup to do return migration which
will break if we start clearing p->is_blocked at ttwu_do_wakeup().

More on that below ...

> 
> Does something like the below really not work?
> 
> ---
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 368c7b4d7cb5..0bd5da8360f3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -846,7 +846,11 @@ struct task_struct {
>  	struct alloc_tag		*alloc_tag;
>  #endif
>  
> -	int				on_cpu;
> +	u8				on_cpu;
> +	u8				on_rq;
> +	u8				is_blocked;
> +	u8				__pad;
> +
>  	struct __call_single_node	wake_entry;
>  	unsigned int			wakee_flips;
>  	unsigned long			wakee_flip_decay_ts;
> @@ -861,7 +865,6 @@ struct task_struct {
>  	 */
>  	int				recent_used_cpu;
>  	int				wake_cpu;
> -	int				on_rq;
>  
>  	int				prio;
>  	int				static_prio;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b8871449d3c6..f679d65d98a3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -615,6 +615,12 @@ EXPORT_SYMBOL(__trace_set_current_state);
>   *   [ The astute reader will observe that it is possible for two tasks on one
>   *     CPU to have ->on_cpu = 1 at the same time. ]
>   *
> + * p->is_blocked <- { 0, 1 }:
> + *
> + *   is set by try_to_block_task() and cleared by ttwu_do_wakeup() and tracks
> + *   if the task is blocked. Tradidionally this would mirror p->on_rq, however
> + *   due things like DELAY_DEQUEUE and PROXY_EXEC, this can diverge.
> + *
>   * task_cpu(p): is changed by set_task_cpu(), the rules are:
>   *
>   *  - Don't call set_task_cpu() on a blocked task:
> @@ -3685,6 +3691,7 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
>   */
>  static inline void ttwu_do_wakeup(struct task_struct *p)
>  {
> +	p->is_blocked = 0;

I don't think it is this simple at the moment because the proxy bits in
__schedule() still have to handle PROXY_WAKING and once we clear it here
task will no longer go through proxy_needs_return() path.

Clearing of ->is_blocked has to be done at the same point where
->blocked_on is cleared although they are set separately.

>  	WRITE_ONCE(p->__state, TASK_RUNNING);
>  	trace_sched_wakeup(p);
>  }
> @@ -4173,6 +4180,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  		 *    it disabling IRQs (this allows not taking ->pi_lock).
>  		 */
>  		WARN_ON_ONCE(p->se.sched_delayed);
> +		WARN_ON_ONCE(p->is_blocked);
>  		if (!ttwu_state_match(p, state, &success))
>  			goto out;
>  
> @@ -4463,6 +4471,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
>  
>  	/* A delayed task cannot be in clone(). */
>  	WARN_ON_ONCE(p->se.sched_delayed);
> +	WARN_ON_ONCE(p->is_blocked);
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	p->se.cfs_rq			= NULL;
> @@ -6593,6 +6602,8 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
>  		return false;
>  	}

If we change the set_task_blocked_on_waking() above for pending
signal to clear_task_blocked_on(), this should be fine. Since
prev is on_cpu, it doesn't need any return migration and going via
PROXY_WAKING path isn't too helpful IMO.

>  
> +	p->is_blocked = 1;
> +
>  	/*
>  	 * We check should_block after signal_pending because we
>  	 * will want to wake the task in that case. But if
> @@ -7108,7 +7119,7 @@ static void __sched notrace __schedule(int sched_mode)
>  		struct task_struct *prev_donor = rq->donor;
>  
>  		rq_set_donor(rq, next);
> -		if (unlikely(next->blocked_on)) {
> +		if (unlikely(next->is_blocked && next->blocked_on)) {

There is a race with ttwu_runnable() that happens like:

  mutex_lock_common(mutex)
    set_task_blocked_on(p, mutex)
    set_current_state(state)         mutex_unloc(mutex)
    schedule_preempt_disabled()        set_task_blocked_on_waking(p)
      ...                              try_to_wake_up(p) /* State matches; p->on_rq */
                                         ttwu_runnable(p)
                                           ttwu_do_wakeup(p);
      if (!preempt && prev_state) {
         /*
          * Never happens since
          * ->state == TASK_RUNNING.
          * -> is/_blocked is never set.
          */
      }

      next = /* Gets prev again */
      /* proxy bits are skipped since ->is_blocked is 0 */

    /*
     * Exits out of schedule_preempt_disabled()
     * in mutex_lock_common().
     */
    __set_task_blocked_on(current, lock);
      !!! SPLAT: p->blocked_on /* PROXY_WAKING */ && p->blocked != lock !!!


So that screams since we fail to clear the ->blocked_on state when
ttwu_runnable() wins over schedule().

John didn't like touching the ->blocked_on state for
(!prev_state && prev->blocked_on) so we resorted to using the lower
bits of ->blocked_on.

The p->se.sched_proxy like fix is the closest we'll get to if we go
down the separate state in task_struct path and for most part it
will mirror blocked_on which is why setting the bottom bits like
MUTEX_FLAGS made some sense when we looked at it.

>  			next = find_proxy_task(rq, next, &rf);
>  			if (!next) {
>  				zap_balance_callbacks(rq);

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-05-01 15:55     ` K Prateek Nayak
@ 2026-05-01 18:59       ` Peter Zijlstra
  2026-05-01 22:26         ` John Stultz
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-01 18:59 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Vineeth Pillai, Sonam Sanju,
	Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon,
	Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

On Fri, May 01, 2026 at 09:25:29PM +0530, K Prateek Nayak wrote:

> > @@ -3685,6 +3691,7 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
> >   */
> >  static inline void ttwu_do_wakeup(struct task_struct *p)
> >  {
> > +	p->is_blocked = 0;
> 
> I don't think it is this simple at the moment because the proxy bits in
> __schedule() still have to handle PROXY_WAKING and once we clear it here
> task will no longer go through proxy_needs_return() path.
> 
> Clearing of ->is_blocked has to be done at the same point where
> ->blocked_on is cleared although they are set separately.

Argh. Its all a convoluted mess. AFAICT this all goes away when we make
ttwu() do the return migration properly. And then it does work.

So we're now in the situation that things are a bit of a mess, and we
need to make a bigger mess, only to then instantly remove it all again
when we clean up :/

Can't we simply mark PROXY_EXEc broken for a cycle? Its not like the
upstream version has been very functional anyway.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-05-01 18:59       ` Peter Zijlstra
@ 2026-05-01 22:26         ` John Stultz
  2026-05-03 18:42           ` K Prateek Nayak
  0 siblings, 1 reply; 17+ messages in thread
From: John Stultz @ 2026-05-01 22:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, LKML, Vineeth Pillai, Sonam Sanju,
	Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon,
	Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

On Fri, May 1, 2026 at 11:59 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 01, 2026 at 09:25:29PM +0530, K Prateek Nayak wrote:
>
> > > @@ -3685,6 +3691,7 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
> > >   */
> > >  static inline void ttwu_do_wakeup(struct task_struct *p)
> > >  {
> > > +   p->is_blocked = 0;
> >
> > I don't think it is this simple at the moment because the proxy bits in
> > __schedule() still have to handle PROXY_WAKING and once we clear it here
> > task will no longer go through proxy_needs_return() path.
> >
> > Clearing of ->is_blocked has to be done at the same point where
> > ->blocked_on is cleared although they are set separately.
>
> Argh. Its all a convoluted mess. AFAICT this all goes away when we make
> ttwu() do the return migration properly. And then it does work.
>
> So we're now in the situation that things are a bit of a mess, and we
> need to make a bigger mess, only to then instantly remove it all again
> when we clean up :/

Apologies! I don't want to make you grumpy coming back from being ill
(hope you're feeling better!).

> Can't we simply mark PROXY_EXEc broken for a cycle? Its not like the
> upstream version has been very functional anyway.

This issue has been present for awhile (since it is really around the
proxy deactivation path taking action in the preempt case). I just
reproduced it with the early chunk of PROXY_EXEC logic that was in
v6.18. So I don't think it's super urgent as the proxy-exec code
upstream isn't complete (and behind CONFIG_EXPERIMENTAL).

So let me take a swing at integrating your approach into the next
chunk of patches, and hopefully they can be ready for the next merge
window.

If you want to add BROKEN in the Kconfig in the meantime, I'll not object.

thanks
-john

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-05-01 22:26         ` John Stultz
@ 2026-05-03 18:42           ` K Prateek Nayak
  2026-05-04  5:37             ` K Prateek Nayak
  2026-05-04 21:33             ` John Stultz
  0 siblings, 2 replies; 17+ messages in thread
From: K Prateek Nayak @ 2026-05-03 18:42 UTC (permalink / raw)
  To: John Stultz, Peter Zijlstra
  Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello folks,

On 5/2/2026 3:56 AM, John Stultz wrote:
> On Fri, May 1, 2026 at 11:59 AM Peter Zijlstra <peterz@infradead.org> wrote:
>> On Fri, May 01, 2026 at 09:25:29PM +0530, K Prateek Nayak wrote:
>>
>>>> @@ -3685,6 +3691,7 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
>>>>   */
>>>>  static inline void ttwu_do_wakeup(struct task_struct *p)
>>>>  {
>>>> +   p->is_blocked = 0;
>>>
>>> I don't think it is this simple at the moment because the proxy bits in
>>> __schedule() still have to handle PROXY_WAKING and once we clear it here
>>> task will no longer go through proxy_needs_return() path.
>>>
>>> Clearing of ->is_blocked has to be done at the same point where
>>> ->blocked_on is cleared although they are set separately.
>>
>> Argh. Its all a convoluted mess. AFAICT this all goes away when we make
>> ttwu() do the return migration properly. And then it does work.
>>
>> So we're now in the situation that things are a bit of a mess, and we
>> need to make a bigger mess, only to then instantly remove it all again
>> when we clean up :/
> 
> Apologies! I don't want to make you grumpy coming back from being ill
> (hope you're feeling better!).
> 
>> Can't we simply mark PROXY_EXEc broken for a cycle? Its not like the
>> upstream version has been very functional anyway.
> 
> This issue has been present for awhile (since it is really around the
> proxy deactivation path taking action in the preempt case). I just
> reproduced it with the early chunk of PROXY_EXEC logic that was in
> v6.18. So I don't think it's super urgent as the proxy-exec code
> upstream isn't complete (and behind CONFIG_EXPERIMENTAL).
> 
> So let me take a swing at integrating your approach into the next
> chunk of patches, and hopefully they can be ready for the next merge
> window.

So when looking at all of this, I realized we probably don't need
PROXY_WAKING anymore if we have the "is_blocked" state in task_struct.
The owner can simply clear the blocked_on and move along and the
waiter's "is_blocked" state will handle the sched bits.

(p->is_blocked && !p->blocked_on) can then be interpreted as
PROXY_WAKING and that task should explore return migration in
find_proxy_task().

Would something like below be more amenable from a backport standpoint
instead of marking the config broken?

  (Lightly tested; Based on tip:sched/core)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8ec3b6d7d718b..7be5e1faf56a1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -846,7 +846,11 @@ struct task_struct {
 	struct alloc_tag		*alloc_tag;
 #endif
 
-	int				on_cpu;
+	u8				on_cpu;
+	u8				on_rq;
+	u8				is_blocked;
+	u8				__pad;
+
 	struct __call_single_node	wake_entry;
 	unsigned int			wakee_flips;
 	unsigned long			wakee_flip_decay_ts;
@@ -861,7 +865,6 @@ struct task_struct {
 	 */
 	int				recent_used_cpu;
 	int				wake_cpu;
-	int				on_rq;
 
 	int				prio;
 	int				static_prio;
@@ -2181,19 +2184,10 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
 
 #ifndef CONFIG_PREEMPT_RT
 
-/*
- * With proxy exec, if a task has been proxy-migrated, it may be a donor
- * on a cpu that it can't actually run on. Thus we need a special state
- * to denote that the task is being woken, but that it needs to be
- * evaluated for return-migration before it is run. So if the task is
- * blocked_on PROXY_WAKING, return migrate it before running it.
- */
-#define PROXY_WAKING ((struct mutex *)(-1L))
-
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
 	lockdep_assert_held_once(&p->blocked_lock);
-	return p->blocked_on == PROXY_WAKING ? NULL : p->blocked_on;
+	return p->blocked_on;
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2221,7 +2215,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
 	 * blocked_on relationships, but make sure we are not
 	 * clearing the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
 	p->blocked_on = NULL;
 }
 
@@ -2231,34 +2225,6 @@ static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 	__clear_task_blocked_on(p, m);
 }
 
-static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
-{
-	/* Currently we serialize blocked_on under the task::blocked_lock */
-	lockdep_assert_held_once(&p->blocked_lock);
-
-	if (!sched_proxy_exec()) {
-		__clear_task_blocked_on(p, m);
-		return;
-	}
-
-	/* Don't set PROXY_WAKING if blocked_on was already cleared */
-	if (!p->blocked_on)
-		return;
-	/*
-	 * There may be cases where we set PROXY_WAKING on tasks that were
-	 * already set to waking, but make sure we are not changing
-	 * the relationship with a different lock.
-	 */
-	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
-	p->blocked_on = PROXY_WAKING;
-}
-
-static inline void set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
-{
-	guard(raw_spinlock_irqsave)(&p->blocked_lock);
-	__set_task_blocked_on_waking(p, m);
-}
-
 #else
 static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
@@ -2267,14 +2233,6 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mute
 static inline void clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
 }
-
-static inline void __set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
-{
-}
-
-static inline void set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
-{
-}
 #endif /* !CONFIG_PREEMPT_RT */
 
 static __always_inline bool need_resched(void)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 7d359647156df..4aa79bcab08c7 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -983,7 +983,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		set_task_blocked_on_waking(next, lock);
+		clear_task_blocked_on(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 5cd9dfa4b31e6..522fe045eb1b2 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
 		/*
-		 * When waking up the task to die, be sure to set the
-		 * blocked_on to PROXY_WAKING. Otherwise we can see
-		 * circular blocked_on relationships that can't resolve.
+		 * When waking up the task to die, be sure to clear the
+		 * blocked_on. Otherwise we can see circular blocked_on
+		 * relationships that can't resolve.
 		 */
-		set_task_blocked_on_waking(waiter->task, lock);
+		clear_task_blocked_on(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -340,14 +340,14 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		if (owner != current) {
 			/*
 			 * When waking up the task to wound, be sure to set the
-			 * blocked_on to PROXY_WAKING. Otherwise we can see
-			 * circular blocked_on relationships that can't resolve.
+			 * clear blocked_on. Otherwise we can see circular
+			 * blocked_on relationships that can't resolve.
 			 *
 			 * NOTE: We pass NULL here instead of lock, because we
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			set_task_blocked_on_waking(owner, NULL);
+			clear_task_blocked_on(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 49cd5d2171613..d33398a03a1c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6495,6 +6495,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 #endif /* !CONFIG_SCHED_CORE */
 
+static inline void sched_set_task_is_blocked(struct task_struct *p);
+static inline void sched_clear_task_is_blocked(struct task_struct *p);
+
 /*
  * Constants for the sched_mode argument of __schedule().
  *
@@ -6523,7 +6526,18 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	if (signal_pending_state(task_state, p)) {
 		WRITE_ONCE(p->__state, TASK_RUNNING);
 		*task_state_p = TASK_RUNNING;
-		set_task_blocked_on_waking(p, NULL);
+
+		/*
+		 * Clear blocked_on relation if we were planning to
+		 * retain the task as proxy donor since it is runnable
+		 * again as a result of pending signal.
+		 *
+		 * Since only the running task can set the blocked_on
+		 * relation for itself, do not unnecessarily grab the
+		 * blocked_lock if blocked_on is not set.
+		 */
+		if (!should_block)
+			clear_task_blocked_on(p, NULL);
 
 		return false;
 	}
@@ -6535,8 +6549,10 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	 * blocked on a mutex, and we want to keep it on the runqueue
 	 * to be selectable for proxy-execution.
 	 */
-	if (!should_block)
+	if (!should_block) {
+		sched_set_task_is_blocked(p);
 		return false;
+	}
 
 	p->sched_contributes_to_load =
 		(task_state & TASK_UNINTERRUPTIBLE) &&
@@ -6562,6 +6578,27 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 }
 
 #ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void sched_set_task_is_blocked(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return;
+
+	p->is_blocked = 1;
+}
+
+static inline void sched_clear_task_is_blocked(struct task_struct *p)
+{
+	p->is_blocked = 0;
+}
+
+static inline bool task_should_block(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return true;
+
+	return !p->blocked_on;
+}
+
 static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
 {
 	unsigned int wake_cpu;
@@ -6602,6 +6639,7 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	 * need to be changed from next *before* we deactivate.
 	 */
 	proxy_resched_idle(rq);
+	sched_clear_task_is_blocked(donor);
 	return try_to_block_task(rq, donor, &state, true);
 }
 
@@ -6732,7 +6770,7 @@ static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
 		cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
 		set_task_cpu(p, cpu);
 		target_rq = cpu_rq(cpu);
-		clear_task_blocked_on(p, NULL);
+		sched_clear_task_is_blocked(p);
 	}
 
 	if (target_rq)
@@ -6765,15 +6803,16 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
-	struct mutex *mutex;
 	int owner_cpu;
 
 	/* Follow blocked_on chain. */
-	for (p = donor; (mutex = p->blocked_on); p = owner) {
-		/* if its PROXY_WAKING, do return migration or run if current */
-		if (mutex == PROXY_WAKING) {
+	for (p = donor; task_is_blocked(p); p = owner) {
+		struct mutex *mutex = p->blocked_on;
+
+		/* If task is no longer blocked, do return migration or run if current */
+		if (!mutex) {
 			if (task_current(rq, p)) {
-				clear_task_blocked_on(p, PROXY_WAKING);
+				sched_clear_task_is_blocked(p);
 				return p;
 			}
 			goto force_return;
@@ -6807,8 +6846,9 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			 * and return p (if it is current and safe to
 			 * just run on this rq), or return-migrate the task.
 			 */
+			__clear_task_blocked_on(p, mutex);
 			if (task_current(rq, p)) {
-				__clear_task_blocked_on(p, NULL);
+				sched_clear_task_is_blocked(p);
 				return p;
 			}
 			goto force_return;
@@ -6902,6 +6942,14 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	return NULL;
 }
 #else /* SCHED_PROXY_EXEC */
+static inline void sched_set_task_is_blocked(struct task_struct *p) {}
+static inline void sched_clear_task_is_blocked(struct task_struct *p) {}
+
+static inline bool task_should_block(struct task_struct *p)
+{
+	return true;
+}
+
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
@@ -7044,7 +7092,7 @@ static void __sched notrace __schedule(int sched_mode)
 		struct task_struct *prev_donor = rq->donor;
 
 		rq_set_donor(rq, next);
-		if (unlikely(next->blocked_on)) {
+		if (unlikely(task_is_blocked(next))) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next) {
 				zap_balance_callbacks(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c95584191d58f..5c1085f260ad4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2390,7 +2390,7 @@ static inline bool task_is_blocked(struct task_struct *p)
 	if (!sched_proxy_exec())
 		return false;
 
-	return !!p->blocked_on;
+	return !!p->is_blocked;
 }
 
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
---

It could be split as introduction of new state + removal of PROXY_WAKING
for easier review. I'll let you two decide if it is worthwhile of not.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-05-03 18:42           ` K Prateek Nayak
@ 2026-05-04  5:37             ` K Prateek Nayak
  2026-05-05  3:32               ` John Stultz
  2026-05-04 21:33             ` John Stultz
  1 sibling, 1 reply; 17+ messages in thread
From: K Prateek Nayak @ 2026-05-04  5:37 UTC (permalink / raw)
  To: John Stultz, Peter Zijlstra
  Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On 5/4/2026 12:12 AM, K Prateek Nayak wrote:
> So when looking at all of this, I realized we probably don't need
> PROXY_WAKING anymore if we have the "is_blocked" state in task_struct.
> The owner can simply clear the blocked_on and move along and the
> waiter's "is_blocked" state will handle the sched bits.
> 
> (p->is_blocked && !p->blocked_on) can then be interpreted as
> PROXY_WAKING and that task should explore return migration in
> find_proxy_task().
> 
> Would something like below be more amenable from a backport standpoint
> instead of marking the config broken?
> 
>   (Lightly tested; Based on tip:sched/core)

... and I missed this hunk for try_to_block_task():

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 49cd5d2171613..ee89d751b9594 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7026,13 +7074,13 @@ static void __sched notrace __schedule(int sched_mode)
 		}
 	} else if (!preempt && prev_state) {
 		/*
-		 * We pass task_is_blocked() as the should_block arg
+		 * We pass task_should_block() as the should_block arg
 		 * in order to keep mutex-blocked tasks on the runqueue
 		 * for slection with proxy-exec (without proxy-exec
-		 * task_is_blocked() will always be false).
+		 * task_should_block() will always be true).
 		 */
 		try_to_block_task(rq, prev, &prev_state,
-				  !task_is_blocked(prev));
+				  task_should_block(prev));
 		switch_count = &prev->nvcsw;
 	}
 
---

Sorry about that and sorry for the noise! Final diffstat looks like:

  include/linux/sched.h     | 56 ++++---------------------
  kernel/locking/mutex.c    |  2 +-
  kernel/locking/ww_mutex.h | 14 +++----
  kernel/sched/core.c       | 72 +++++++++++++++++++++++++++------
  kernel/sched/sched.h      |  2 +-
  5 files changed, 75 insertions(+), 71 deletions(-)

It mostly relocates the PROXY_WAKING bits from linux/sched.h to internal
"is_blocked" helper in core.c. Pasting the full diff again with some
cleanups for convenience:

  (Lightly tested with test-ww_mutex and sched-messaging)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8ec3b6d7d718b..7be5e1faf56a1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -846,7 +846,11 @@ struct task_struct {
 	struct alloc_tag		*alloc_tag;
 #endif
 
-	int				on_cpu;
+	u8				on_cpu;
+	u8				on_rq;
+	u8				is_blocked;
+	u8				__pad;
+
 	struct __call_single_node	wake_entry;
 	unsigned int			wakee_flips;
 	unsigned long			wakee_flip_decay_ts;
@@ -861,7 +865,6 @@ struct task_struct {
 	 */
 	int				recent_used_cpu;
 	int				wake_cpu;
-	int				on_rq;
 
 	int				prio;
 	int				static_prio;
@@ -2181,19 +2184,10 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
 
 #ifndef CONFIG_PREEMPT_RT
 
-/*
- * With proxy exec, if a task has been proxy-migrated, it may be a donor
- * on a cpu that it can't actually run on. Thus we need a special state
- * to denote that the task is being woken, but that it needs to be
- * evaluated for return-migration before it is run. So if the task is
- * blocked_on PROXY_WAKING, return migrate it before running it.
- */
-#define PROXY_WAKING ((struct mutex *)(-1L))
-
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
 	lockdep_assert_held_once(&p->blocked_lock);
-	return p->blocked_on == PROXY_WAKING ? NULL : p->blocked_on;
+	return p->blocked_on;
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2221,7 +2215,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
 	 * blocked_on relationships, but make sure we are not
 	 * clearing the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
 	p->blocked_on = NULL;
 }
 
@@ -2231,34 +2225,6 @@ static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 	__clear_task_blocked_on(p, m);
 }
 
-static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
-{
-	/* Currently we serialize blocked_on under the task::blocked_lock */
-	lockdep_assert_held_once(&p->blocked_lock);
-
-	if (!sched_proxy_exec()) {
-		__clear_task_blocked_on(p, m);
-		return;
-	}
-
-	/* Don't set PROXY_WAKING if blocked_on was already cleared */
-	if (!p->blocked_on)
-		return;
-	/*
-	 * There may be cases where we set PROXY_WAKING on tasks that were
-	 * already set to waking, but make sure we are not changing
-	 * the relationship with a different lock.
-	 */
-	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
-	p->blocked_on = PROXY_WAKING;
-}
-
-static inline void set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
-{
-	guard(raw_spinlock_irqsave)(&p->blocked_lock);
-	__set_task_blocked_on_waking(p, m);
-}
-
 #else
 static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
@@ -2267,14 +2233,6 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mute
 static inline void clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
 }
-
-static inline void __set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
-{
-}
-
-static inline void set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
-{
-}
 #endif /* !CONFIG_PREEMPT_RT */
 
 static __always_inline bool need_resched(void)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 7d359647156df..4aa79bcab08c7 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -983,7 +983,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		set_task_blocked_on_waking(next, lock);
+		clear_task_blocked_on(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 5cd9dfa4b31e6..522fe045eb1b2 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
 		/*
-		 * When waking up the task to die, be sure to set the
-		 * blocked_on to PROXY_WAKING. Otherwise we can see
-		 * circular blocked_on relationships that can't resolve.
+		 * When waking up the task to die, be sure to clear the
+		 * blocked_on. Otherwise we can see circular blocked_on
+		 * relationships that can't resolve.
 		 */
-		set_task_blocked_on_waking(waiter->task, lock);
+		clear_task_blocked_on(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -340,14 +340,14 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		if (owner != current) {
 			/*
 			 * When waking up the task to wound, be sure to set the
-			 * blocked_on to PROXY_WAKING. Otherwise we can see
-			 * circular blocked_on relationships that can't resolve.
+			 * clear blocked_on. Otherwise we can see circular
+			 * blocked_on relationships that can't resolve.
 			 *
 			 * NOTE: We pass NULL here instead of lock, because we
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			set_task_blocked_on_waking(owner, NULL);
+			clear_task_blocked_on(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 49cd5d2171613..30672390e6f99 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6495,6 +6495,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 #endif /* !CONFIG_SCHED_CORE */
 
+static inline void sched_set_task_is_blocked(struct task_struct *p);
+
 /*
  * Constants for the sched_mode argument of __schedule().
  *
@@ -6523,7 +6525,18 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	if (signal_pending_state(task_state, p)) {
 		WRITE_ONCE(p->__state, TASK_RUNNING);
 		*task_state_p = TASK_RUNNING;
-		set_task_blocked_on_waking(p, NULL);
+
+		/*
+		 * Clear blocked_on relation if we were planning to
+		 * retain the task as proxy donor since it is runnable
+		 * again as a result of pending signal.
+		 *
+		 * Since only the running task can set the blocked_on
+		 * relation for itself, do not unnecessarily grab the
+		 * blocked_lock if blocked_on is not set.
+		 */
+		if (!should_block)
+			clear_task_blocked_on(p, NULL);
 
 		return false;
 	}
@@ -6535,8 +6548,10 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	 * blocked on a mutex, and we want to keep it on the runqueue
 	 * to be selectable for proxy-execution.
 	 */
-	if (!should_block)
+	if (!should_block) {
+		sched_set_task_is_blocked(p);
 		return false;
+	}
 
 	p->sched_contributes_to_load =
 		(task_state & TASK_UNINTERRUPTIBLE) &&
@@ -6562,6 +6577,27 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 }
 
 #ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void sched_set_task_is_blocked(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return;
+
+	p->is_blocked = 1;
+}
+
+static inline void sched_clear_task_is_blocked(struct task_struct *p)
+{
+	p->is_blocked = 0;
+}
+
+static inline bool task_should_block(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return true;
+
+	return !p->blocked_on;
+}
+
 static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
 {
 	unsigned int wake_cpu;
@@ -6602,6 +6638,7 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	 * need to be changed from next *before* we deactivate.
 	 */
 	proxy_resched_idle(rq);
+	sched_clear_task_is_blocked(donor);
 	return try_to_block_task(rq, donor, &state, true);
 }
 
@@ -6732,7 +6769,7 @@ static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
 		cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
 		set_task_cpu(p, cpu);
 		target_rq = cpu_rq(cpu);
-		clear_task_blocked_on(p, NULL);
+		sched_clear_task_is_blocked(p);
 	}
 
 	if (target_rq)
@@ -6765,15 +6802,16 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
-	struct mutex *mutex;
 	int owner_cpu;
 
 	/* Follow blocked_on chain. */
-	for (p = donor; (mutex = p->blocked_on); p = owner) {
-		/* if its PROXY_WAKING, do return migration or run if current */
-		if (mutex == PROXY_WAKING) {
+	for (p = donor; task_is_blocked(p); p = owner) {
+		struct mutex *mutex = p->blocked_on;
+
+		/* If task is no longer blocked, do return migration or run if current */
+		if (!mutex) {
 			if (task_current(rq, p)) {
-				clear_task_blocked_on(p, PROXY_WAKING);
+				sched_clear_task_is_blocked(p);
 				return p;
 			}
 			goto force_return;
@@ -6807,8 +6845,9 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			 * and return p (if it is current and safe to
 			 * just run on this rq), or return-migrate the task.
 			 */
+			__clear_task_blocked_on(p, mutex);
 			if (task_current(rq, p)) {
-				__clear_task_blocked_on(p, NULL);
+				sched_clear_task_is_blocked(p);
 				return p;
 			}
 			goto force_return;
@@ -6902,6 +6941,13 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	return NULL;
 }
 #else /* SCHED_PROXY_EXEC */
+static inline void sched_set_task_is_blocked(struct task_struct *p) {}
+
+static inline bool task_should_block(struct task_struct *p)
+{
+	return true;
+}
+
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
@@ -7026,13 +7072,13 @@ static void __sched notrace __schedule(int sched_mode)
 		}
 	} else if (!preempt && prev_state) {
 		/*
-		 * We pass task_is_blocked() as the should_block arg
+		 * We pass task_should_block() as the should_block arg
 		 * in order to keep mutex-blocked tasks on the runqueue
 		 * for slection with proxy-exec (without proxy-exec
-		 * task_is_blocked() will always be false).
+		 * task_should_block() will always be true).
 		 */
 		try_to_block_task(rq, prev, &prev_state,
-				  !task_is_blocked(prev));
+				  task_should_block(prev));
 		switch_count = &prev->nvcsw;
 	}
 
@@ -7044,7 +7090,7 @@ static void __sched notrace __schedule(int sched_mode)
 		struct task_struct *prev_donor = rq->donor;
 
 		rq_set_donor(rq, next);
-		if (unlikely(next->blocked_on)) {
+		if (unlikely(task_is_blocked(next))) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next) {
 				zap_balance_callbacks(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c95584191d58f..5c1085f260ad4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2390,7 +2390,7 @@ static inline bool task_is_blocked(struct task_struct *p)
 	if (!sched_proxy_exec())
 		return false;
 
-	return !!p->blocked_on;
+	return !!p->is_blocked;
 }
 
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-05-03 18:42           ` K Prateek Nayak
  2026-05-04  5:37             ` K Prateek Nayak
@ 2026-05-04 21:33             ` John Stultz
  1 sibling, 0 replies; 17+ messages in thread
From: John Stultz @ 2026-05-04 21:33 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, LKML, Vineeth Pillai, Sonam Sanju,
	Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon,
	Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

On Sun, May 3, 2026 at 11:43 AM 'K Prateek Nayak' via kernel-team
<kernel-team@android.com> wrote:
> On 5/2/2026 3:56 AM, John Stultz wrote:
> > On Fri, May 1, 2026 at 11:59 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >> On Fri, May 01, 2026 at 09:25:29PM +0530, K Prateek Nayak wrote:
> >>
> >>>> @@ -3685,6 +3691,7 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
> >>>>   */
> >>>>  static inline void ttwu_do_wakeup(struct task_struct *p)
> >>>>  {
> >>>> +   p->is_blocked = 0;
> >>>
> >>> I don't think it is this simple at the moment because the proxy bits in
> >>> __schedule() still have to handle PROXY_WAKING and once we clear it here
> >>> task will no longer go through proxy_needs_return() path.
> >>>
> >>> Clearing of ->is_blocked has to be done at the same point where
> >>> ->blocked_on is cleared although they are set separately.
> >>
> >> Argh. Its all a convoluted mess. AFAICT this all goes away when we make
> >> ttwu() do the return migration properly. And then it does work.
> >>
> >> So we're now in the situation that things are a bit of a mess, and we
> >> need to make a bigger mess, only to then instantly remove it all again
> >> when we clean up :/
> >
> > Apologies! I don't want to make you grumpy coming back from being ill
> > (hope you're feeling better!).
> >
> >> Can't we simply mark PROXY_EXEc broken for a cycle? Its not like the
> >> upstream version has been very functional anyway.
> >
> > This issue has been present for awhile (since it is really around the
> > proxy deactivation path taking action in the preempt case). I just
> > reproduced it with the early chunk of PROXY_EXEC logic that was in
> > v6.18. So I don't think it's super urgent as the proxy-exec code
> > upstream isn't complete (and behind CONFIG_EXPERIMENTAL).
> >
> > So let me take a swing at integrating your approach into the next
> > chunk of patches, and hopefully they can be ready for the next merge
> > window.
>
> So when looking at all of this, I realized we probably don't need
> PROXY_WAKING anymore if we have the "is_blocked" state in task_struct.
> The owner can simply clear the blocked_on and move along and the
> waiter's "is_blocked" state will handle the sched bits.
>
> (p->is_blocked && !p->blocked_on) can then be interpreted as
> PROXY_WAKING and that task should explore return migration in
> find_proxy_task().

Interesting! Using the follow-on patch you sent here, it doesn't seem
to trip up the issues with the reproducer Vineeth implemented and I've
not hit any troubles from initial testing against 7.1-rc1.

It may take me a little bit to really get my head around the change to
layer the rest of the series ontop w/o PROXY_WAKING. But it is aligned
with Peter's suggestion and gets rid extra state, and if it can apply
first that's nicer then having the get the ttwu handling in place
before switching to the is_blocked logic (which is what I was testing
if we were going to just fix the issue in the next chunk), so it looks
attractive!

I'll take a swing at reworking my code ontop of your patch here and
let you know how it goes.

thanks
-john

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING
  2026-04-30 21:50 ` [PATCH v2 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz
  2026-05-01  6:57   ` K Prateek Nayak
@ 2026-05-04 22:30   ` kernel test robot
  1 sibling, 0 replies; 17+ messages in thread
From: kernel test robot @ 2026-05-04 22:30 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: oe-kbuild-all, John Stultz, Vineeth Pillai, Sonam Sanju,
	Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Hi John,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master next-20260430]
[cannot apply to tip/locking/core tip/auto-latest linus/master v6.16-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/John-Stultz/sched-proxy-exec-Close-race-causing-workqueue-work-being-delayed/20260504-045026
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20260430215103.2978955-3-jstultz%40google.com
patch subject: [PATCH v2 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING
config: x86_64-rhel-9.4-rust (https://download.01.org/0day-ci/archive/20260505/202605050032.Sh1SUziK-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260505/202605050032.Sh1SUziK-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605050032.Sh1SUziK-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from arch/x86/kernel/asm-offsets.c:9:
   In file included from include/linux/crypto.h:18:
   In file included from include/linux/slab.h:17:
   In file included from include/linux/gfp.h:7:
   In file included from include/linux/mmzone.h:22:
   In file included from include/linux/mm_types.h:11:
   In file included from include/linux/rbtree.h:24:
   In file included from include/linux/rcupdate.h:27:
>> include/linux/sched.h:2209:10: warning: expression which evaluates to zero treated as a null pointer constant of type 'struct mutex *' [-Wnon-literal-null-conversion]
    2209 |                 return false;
         |                        ^~~~~
   1 warning generated.
--
>> clang diag: include/linux/sched.h:2209:10: warning: expression which evaluates to zero treated as a null pointer constant of type 'struct mutex *' [-Wnon-literal-null-conversion]


vim +2209 include/linux/sched.h

  2205	
  2206	static inline struct mutex *task_is_blocked_on(struct task_struct *p)
  2207	{
  2208		if (!sched_proxy_exec())
> 2209			return false;
  2210		return (struct mutex *)((unsigned long)p->blocked_on & PROXY_BLOCKED_LATCH);
  2211	}
  2212	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-05-04  5:37             ` K Prateek Nayak
@ 2026-05-05  3:32               ` John Stultz
  2026-05-05  4:37                 ` K Prateek Nayak
  0 siblings, 1 reply; 17+ messages in thread
From: John Stultz @ 2026-05-05  3:32 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, LKML, Vineeth Pillai, Sonam Sanju,
	Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon,
	Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

On Sun, May 3, 2026 at 10:37 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 5/4/2026 12:12 AM, K Prateek Nayak wrote:
> > So when looking at all of this, I realized we probably don't need
> > PROXY_WAKING anymore if we have the "is_blocked" state in task_struct.
> > The owner can simply clear the blocked_on and move along and the
> > waiter's "is_blocked" state will handle the sched bits.
> >
> > (p->is_blocked && !p->blocked_on) can then be interpreted as
> > PROXY_WAKING and that task should explore return migration in
> > find_proxy_task().
> >
> > Would something like below be more amenable from a backport standpoint
> > instead of marking the config broken?
> >
> @@ -6535,8 +6548,10 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
>          * blocked on a mutex, and we want to keep it on the runqueue
>          * to be selectable for proxy-execution.
>          */
> -       if (!should_block)
> +       if (!should_block) {
> +               sched_set_task_is_blocked(p);
>                 return false;
> +       }
>

So digging a bit more into this, it seems is_blocked in your patch is
semantically different from what Peter was proposing.

Peter seemed to be suggesting is_blocked would be more generic then
just for proxy-exec, getting set in try_to_block_task() regardless if
we actually blocked the task or not, and then clearing it in
ttwu_do_wakeup() when we go RUNNABLE.  Pretty much independent of
blocked_on.

Where as your patch is still having is_blocked very much tied with
blocked_on (since with yours we only set is_blocked if we avoid
blocking the task in try_to_block_task(), and clear it only from
find_proxy_task()).
In a way I can map your approach utilizing is_blocked as conceptually
sort of separating the latch bit from my last approach, (if we also
re-worked PROXY_WAKING to be the value 1 (!blocked_on + latch) instead
of -1).  So your approach seems workable (I've got it about half way
integrated with my full series - hitting a little bit of trouble with
the sleeping owner enquing at the moment), but I'm not sure this is
what Peter is looking for.

thanks
-john

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-05-05  3:32               ` John Stultz
@ 2026-05-05  4:37                 ` K Prateek Nayak
  0 siblings, 0 replies; 17+ messages in thread
From: K Prateek Nayak @ 2026-05-05  4:37 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, LKML, Vineeth Pillai, Sonam Sanju,
	Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon,
	Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

[-- Attachment #1: Type: text/plain, Size: 6419 bytes --]

Hello John,

On 5/5/2026 9:02 AM, John Stultz wrote:
> On Sun, May 3, 2026 at 10:37 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>> On 5/4/2026 12:12 AM, K Prateek Nayak wrote:
>>> So when looking at all of this, I realized we probably don't need
>>> PROXY_WAKING anymore if we have the "is_blocked" state in task_struct.
>>> The owner can simply clear the blocked_on and move along and the
>>> waiter's "is_blocked" state will handle the sched bits.
>>>
>>> (p->is_blocked && !p->blocked_on) can then be interpreted as
>>> PROXY_WAKING and that task should explore return migration in
>>> find_proxy_task().
>>>
>>> Would something like below be more amenable from a backport standpoint
>>> instead of marking the config broken?
>>>
>> @@ -6535,8 +6548,10 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
>>          * blocked on a mutex, and we want to keep it on the runqueue
>>          * to be selectable for proxy-execution.
>>          */
>> -       if (!should_block)
>> +       if (!should_block) {
>> +               sched_set_task_is_blocked(p);
>>                 return false;
>> +       }
>>
> 
> So digging a bit more into this, it seems is_blocked in your patch is
> semantically different from what Peter was proposing.
> 
> Peter seemed to be suggesting is_blocked would be more generic then
> just for proxy-exec, getting set in try_to_block_task() regardless if
> we actually blocked the task or not, and then clearing it in
> ttwu_do_wakeup() when we go RUNNABLE.  Pretty much independent of
> blocked_on.

Something very similar to Peter's suggestion like that is attached
towards the end if that is more favorable but it doesn't always clear
"is_blocked" at ttwu_do_wakeup() currently - that would require the
return bits in ttwu_runnable() before it can be moved to
ttwu_do_wakeup() safely.

> 
> Where as your patch is still having is_blocked very much tied with
> blocked_on (since with yours we only set is_blocked if we avoid
> blocking the task in try_to_block_task(), and clear it only from
> find_proxy_task()).

Ack! That is the main difference - we can clear it during ttwu too
once we have proxy_needs_return() but with the set of changes we
have committed, it is done selectively for blocked tasks in
find_proxy_task(). 

> In a way I can map your approach utilizing is_blocked as conceptually
> sort of separating the latch bit from my last approach, (if we also
> re-worked PROXY_WAKING to be the value 1 (!blocked_on + latch) instead
> of -1).  So your approach seems workable (I've got it about half way
> integrated with my full series - hitting a little bit of trouble with
> the sleeping owner enquing at the moment),

So, this new state is synchronized by task's rq_lock() when p->on_rq
(even for the ttwu bits) but from what I can tell, sleeping owner really
depended on the blocked_lock based synchronization so perhaps that is
the difference?

Would grabbing blocked_lock when setting and clearing the "is_blocked"
help in case you've not already explored it?

> but I'm not sure this is what Peter is looking for.

Well this was just an option in case we don't want to backport super
invasive changes.

That said, we can easily do the following on top to fit what Peter
originally suggested (although it'll probably require a bit effort to
integrate with the sleeping owner bits):

  (Lightly tested as usual :-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 30672390e6f99..e88f5b7a02b3e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3675,6 +3675,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 	}
 }
 
+static inline void sched_clear_task_is_blocked(struct task_struct *p);
+
 /*
  * Consider @p being inside a wait loop:
  *
@@ -3709,8 +3711,19 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 	rq = __task_rq_lock(p, &rf);
 	if (task_on_rq_queued(p)) {
 		update_rq_clock(rq);
-		if (p->se.sched_delayed)
+		if (p->se.sched_delayed) {
+			/*
+			 * Task was fully blocked (not retained as proxy) and
+			 * is runnable again. Clear "is_blocked" indicator.
+			 * For all other cases, the task has either not set
+			 * "is_blocked" since ttwu_runnable() won against
+			 * schedule(), or the task was retained as proxy and
+			 * expects find_proxy_task() to handle the clearing of
+			 * "is_blocked" state.
+			 */
+			sched_clear_task_is_blocked(p);
 			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+		}
 		if (!task_on_cpu(rq, p)) {
 			/*
 			 * When on_rq && !on_cpu the task is preempted, see if
@@ -4190,6 +4203,13 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 */
 		WRITE_ONCE(p->__state, TASK_WAKING);
 
+		/*
+		 * If ttwu_runnable() did not win, task is fully blocked (!p->on_rq) and
+		 * requires a full wakeup. Clear task_is_blocked() before attempting
+		 * ttwu_queue_wakelist().
+		 */
+		sched_clear_task_is_blocked(p);
+
 		/*
 		 * If the owning (remote) CPU is still in the middle of schedule() with
 		 * this task as prev, considering queueing p on the remote CPUs wake_list
@@ -6541,6 +6561,12 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 		return false;
 	}
 
+	/*
+	 * Task is considered fully blocked at this point and requires
+	 * a wakeup to be runnable again including delayed task.
+	 */
+	sched_set_task_is_blocked(p);
+
 	/*
 	 * We check should_block after signal_pending because we
 	 * will want to wake the task in that case. But if
@@ -6548,10 +6574,8 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	 * blocked on a mutex, and we want to keep it on the runqueue
 	 * to be selectable for proxy-execution.
 	 */
-	if (!should_block) {
-		sched_set_task_is_blocked(p);
+	if (!should_block)
 		return false;
-	}
 
 	p->sched_contributes_to_load =
 		(task_state & TASK_UNINTERRUPTIBLE) &&
@@ -6942,6 +6966,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 }
 #else /* SCHED_PROXY_EXEC */
 static inline void sched_set_task_is_blocked(struct task_struct *p) {}
+static inline void sched_clear_task_is_blocked(struct task_struct *p) {}
 
 static inline bool task_should_block(struct task_struct *p)
 {
---

Attached is full diff as proxy.diff on top of tip:sched/core for
convenience. I'll let Peter comment further if he likes this
approach or not :-)

-- 
Thanks and Regards,
Prateek

[-- Attachment #2: proxy.diff --]
[-- Type: text/plain, Size: 12452 bytes --]

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8ec3b6d7d718b..7be5e1faf56a1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -846,7 +846,11 @@ struct task_struct {
 	struct alloc_tag		*alloc_tag;
 #endif
 
-	int				on_cpu;
+	u8				on_cpu;
+	u8				on_rq;
+	u8				is_blocked;
+	u8				__pad;
+
 	struct __call_single_node	wake_entry;
 	unsigned int			wakee_flips;
 	unsigned long			wakee_flip_decay_ts;
@@ -861,7 +865,6 @@ struct task_struct {
 	 */
 	int				recent_used_cpu;
 	int				wake_cpu;
-	int				on_rq;
 
 	int				prio;
 	int				static_prio;
@@ -2181,19 +2184,10 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
 
 #ifndef CONFIG_PREEMPT_RT
 
-/*
- * With proxy exec, if a task has been proxy-migrated, it may be a donor
- * on a cpu that it can't actually run on. Thus we need a special state
- * to denote that the task is being woken, but that it needs to be
- * evaluated for return-migration before it is run. So if the task is
- * blocked_on PROXY_WAKING, return migrate it before running it.
- */
-#define PROXY_WAKING ((struct mutex *)(-1L))
-
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
 	lockdep_assert_held_once(&p->blocked_lock);
-	return p->blocked_on == PROXY_WAKING ? NULL : p->blocked_on;
+	return p->blocked_on;
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2221,7 +2215,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
 	 * blocked_on relationships, but make sure we are not
 	 * clearing the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
 	p->blocked_on = NULL;
 }
 
@@ -2231,34 +2225,6 @@ static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 	__clear_task_blocked_on(p, m);
 }
 
-static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
-{
-	/* Currently we serialize blocked_on under the task::blocked_lock */
-	lockdep_assert_held_once(&p->blocked_lock);
-
-	if (!sched_proxy_exec()) {
-		__clear_task_blocked_on(p, m);
-		return;
-	}
-
-	/* Don't set PROXY_WAKING if blocked_on was already cleared */
-	if (!p->blocked_on)
-		return;
-	/*
-	 * There may be cases where we set PROXY_WAKING on tasks that were
-	 * already set to waking, but make sure we are not changing
-	 * the relationship with a different lock.
-	 */
-	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
-	p->blocked_on = PROXY_WAKING;
-}
-
-static inline void set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
-{
-	guard(raw_spinlock_irqsave)(&p->blocked_lock);
-	__set_task_blocked_on_waking(p, m);
-}
-
 #else
 static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
@@ -2267,14 +2233,6 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mute
 static inline void clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
 }
-
-static inline void __set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
-{
-}
-
-static inline void set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
-{
-}
 #endif /* !CONFIG_PREEMPT_RT */
 
 static __always_inline bool need_resched(void)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 7d359647156df..4aa79bcab08c7 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -983,7 +983,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		set_task_blocked_on_waking(next, lock);
+		clear_task_blocked_on(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 5cd9dfa4b31e6..522fe045eb1b2 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
 		/*
-		 * When waking up the task to die, be sure to set the
-		 * blocked_on to PROXY_WAKING. Otherwise we can see
-		 * circular blocked_on relationships that can't resolve.
+		 * When waking up the task to die, be sure to clear the
+		 * blocked_on. Otherwise we can see circular blocked_on
+		 * relationships that can't resolve.
 		 */
-		set_task_blocked_on_waking(waiter->task, lock);
+		clear_task_blocked_on(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -340,14 +340,14 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		if (owner != current) {
 			/*
 			 * When waking up the task to wound, be sure to set the
-			 * blocked_on to PROXY_WAKING. Otherwise we can see
-			 * circular blocked_on relationships that can't resolve.
+			 * clear blocked_on. Otherwise we can see circular
+			 * blocked_on relationships that can't resolve.
 			 *
 			 * NOTE: We pass NULL here instead of lock, because we
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			set_task_blocked_on_waking(owner, NULL);
+			clear_task_blocked_on(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 49cd5d2171613..e88f5b7a02b3e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3675,6 +3675,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 	}
 }
 
+static inline void sched_clear_task_is_blocked(struct task_struct *p);
+
 /*
  * Consider @p being inside a wait loop:
  *
@@ -3709,8 +3711,19 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 	rq = __task_rq_lock(p, &rf);
 	if (task_on_rq_queued(p)) {
 		update_rq_clock(rq);
-		if (p->se.sched_delayed)
+		if (p->se.sched_delayed) {
+			/*
+			 * Task was fully blocked (not retained as proxy) and
+			 * is runnable again. Clear "is_blocked" indicator.
+			 * For all otehr cases, the task has either not set
+			 * "is_blocked" since ttwu_runnable() won against
+			 * schedule(), or the task was retained as proxy and
+			 * expects find_proxy_task() to handle the clearing of
+			 * "is_blocked" state.
+			 */
+			sched_clear_task_is_blocked(p);
 			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+		}
 		if (!task_on_cpu(rq, p)) {
 			/*
 			 * When on_rq && !on_cpu the task is preempted, see if
@@ -4190,6 +4203,13 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 */
 		WRITE_ONCE(p->__state, TASK_WAKING);
 
+		/*
+		 * If ttwu_runnable() did not win, task is fully blocked (!p->on_rq) and
+		 * requires a full wakeup. Clear task_is_blocked() before attempting
+		 * ttwu_queue_wakelist().
+		 */
+		sched_clear_task_is_blocked(p);
+
 		/*
 		 * If the owning (remote) CPU is still in the middle of schedule() with
 		 * this task as prev, considering queueing p on the remote CPUs wake_list
@@ -6495,6 +6515,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 #endif /* !CONFIG_SCHED_CORE */
 
+static inline void sched_set_task_is_blocked(struct task_struct *p);
+
 /*
  * Constants for the sched_mode argument of __schedule().
  *
@@ -6523,11 +6545,28 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	if (signal_pending_state(task_state, p)) {
 		WRITE_ONCE(p->__state, TASK_RUNNING);
 		*task_state_p = TASK_RUNNING;
-		set_task_blocked_on_waking(p, NULL);
+
+		/*
+		 * Clear blocked_on relation if we were planning to
+		 * retain the task as proxy donor since it is runnable
+		 * again as a result of pending signal.
+		 *
+		 * Since only the running task can set the blocked_on
+		 * relation for itself, do not unnecessarily grab the
+		 * blocked_lock if blocked_on is not set.
+		 */
+		if (!should_block)
+			clear_task_blocked_on(p, NULL);
 
 		return false;
 	}
 
+	/*
+	 * Task is considered fully blocked at this point and requires
+	 * a wakeup to be runnable again including the delayed task.
+	 */
+	sched_set_task_is_blocked(p);
+
 	/*
 	 * We check should_block after signal_pending because we
 	 * will want to wake the task in that case. But if
@@ -6562,6 +6601,27 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 }
 
 #ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void sched_set_task_is_blocked(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return;
+
+	p->is_blocked = 1;
+}
+
+static inline void sched_clear_task_is_blocked(struct task_struct *p)
+{
+	p->is_blocked = 0;
+}
+
+static inline bool task_should_block(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return true;
+
+	return !p->blocked_on;
+}
+
 static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
 {
 	unsigned int wake_cpu;
@@ -6602,6 +6662,7 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	 * need to be changed from next *before* we deactivate.
 	 */
 	proxy_resched_idle(rq);
+	sched_clear_task_is_blocked(donor);
 	return try_to_block_task(rq, donor, &state, true);
 }
 
@@ -6732,7 +6793,7 @@ static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
 		cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
 		set_task_cpu(p, cpu);
 		target_rq = cpu_rq(cpu);
-		clear_task_blocked_on(p, NULL);
+		sched_clear_task_is_blocked(p);
 	}
 
 	if (target_rq)
@@ -6765,15 +6826,16 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
-	struct mutex *mutex;
 	int owner_cpu;
 
 	/* Follow blocked_on chain. */
-	for (p = donor; (mutex = p->blocked_on); p = owner) {
-		/* if its PROXY_WAKING, do return migration or run if current */
-		if (mutex == PROXY_WAKING) {
+	for (p = donor; task_is_blocked(p); p = owner) {
+		struct mutex *mutex = p->blocked_on;
+
+		/* If task is no longer blocked, do return migration or run if current */
+		if (!mutex) {
 			if (task_current(rq, p)) {
-				clear_task_blocked_on(p, PROXY_WAKING);
+				sched_clear_task_is_blocked(p);
 				return p;
 			}
 			goto force_return;
@@ -6807,8 +6869,9 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			 * and return p (if it is current and safe to
 			 * just run on this rq), or return-migrate the task.
 			 */
+			__clear_task_blocked_on(p, mutex);
 			if (task_current(rq, p)) {
-				__clear_task_blocked_on(p, NULL);
+				sched_clear_task_is_blocked(p);
 				return p;
 			}
 			goto force_return;
@@ -6902,6 +6965,14 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	return NULL;
 }
 #else /* SCHED_PROXY_EXEC */
+static inline void sched_set_task_is_blocked(struct task_struct *p) {}
+static inline void sched_clear_task_is_blocked(struct task_struct *p) {}
+
+static inline bool task_should_block(struct task_struct *p)
+{
+	return true;
+}
+
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
@@ -7026,13 +7097,13 @@ static void __sched notrace __schedule(int sched_mode)
 		}
 	} else if (!preempt && prev_state) {
 		/*
-		 * We pass task_is_blocked() as the should_block arg
+		 * We pass task_should_block() as the should_block arg
 		 * in order to keep mutex-blocked tasks on the runqueue
 		 * for slection with proxy-exec (without proxy-exec
-		 * task_is_blocked() will always be false).
+		 * task_should_block() will always be true).
 		 */
 		try_to_block_task(rq, prev, &prev_state,
-				  !task_is_blocked(prev));
+				  task_should_block(prev));
 		switch_count = &prev->nvcsw;
 	}
 
@@ -7044,7 +7115,7 @@ static void __sched notrace __schedule(int sched_mode)
 		struct task_struct *prev_donor = rq->donor;
 
 		rq_set_donor(rq, next);
-		if (unlikely(next->blocked_on)) {
+		if (unlikely(task_is_blocked(next))) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next) {
 				zap_balance_callbacks(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c95584191d58f..5c1085f260ad4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2390,7 +2390,7 @@ static inline bool task_is_blocked(struct task_struct *p)
 	if (!sched_proxy_exec())
 		return false;
 
-	return !!p->blocked_on;
+	return !!p->is_blocked;
 }
 
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-05-05  4:37 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-30 21:50 [PATCH v2 0/2] Proxy Execution fixes for v7.1-rc John Stultz
2026-04-30 21:50 ` [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
2026-04-30 23:53   ` John Stultz
2026-05-01  6:39   ` K Prateek Nayak
2026-05-01  7:11     ` John Stultz
2026-05-01 13:21   ` Peter Zijlstra
2026-05-01 15:55     ` K Prateek Nayak
2026-05-01 18:59       ` Peter Zijlstra
2026-05-01 22:26         ` John Stultz
2026-05-03 18:42           ` K Prateek Nayak
2026-05-04  5:37             ` K Prateek Nayak
2026-05-05  3:32               ` John Stultz
2026-05-05  4:37                 ` K Prateek Nayak
2026-05-04 21:33             ` John Stultz
2026-04-30 21:50 ` [PATCH v2 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz
2026-05-01  6:57   ` K Prateek Nayak
2026-05-04 22:30   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox