[PATCH 0/2] Proxy Execution fixes for v7.1-rc

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Proxy Execution fixes for v7.1-rc
@ 2026-04-27 18:38 John Stultz
  2026-04-27 18:38 ` [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
  2026-04-27 18:38 ` [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz
  0 siblings, 2 replies; 11+ messages in thread
From: John Stultz @ 2026-04-27 18:38 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hey All,
  So in testing with the full Proxy Execution series,
Vineeth Pillai managed to trip some interesting bugs which
initially looked to be KVM or RCU related[1], which he later
diagnosed as Proxy Execution related and created a useful test
driver to reproduce.

I found these same issues could be triggered with the upstream
portions of Proxy Execution, so I wanted to send along these
fixes for 7.1-rc

Again, a huge thanks to Vineeth for uncovering these issues
that have evaded all my stress testing so far!

Thanks
-john

[1]: https://lore.kernel.org/lkml/20260320125633.2290675-1-vineeth@bitbyteword.org/

Cc: Vineeth Pillai <vineethrp@google.com>
Cc: Sonam Sanju <sonam.sanju@intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Kunwu Chan <kunwu.chan@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com

John Stultz (2):
  sched: proxy-exec: Close race causing workqueue work being delayed
  locking: mutex: Fix proxy-exec potentially deactivating tasks marked
    TASK_RUNNING

 kernel/locking/mutex.c |  1 +
 kernel/sched/core.c    | 11 +++++++++++
 2 files changed, 12 insertions(+)

-- 
2.54.0.545.g6539524ca2-goog

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-04-27 18:38 [PATCH 0/2] Proxy Execution fixes for v7.1-rc John Stultz
@ 2026-04-27 18:38 ` John Stultz
  2026-04-28  8:06   ` K Prateek Nayak
  2026-04-28  9:43   ` Peter Zijlstra
  2026-04-27 18:38 ` [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz
  1 sibling, 2 replies; 11+ messages in thread
From: John Stultz @ 2026-04-27 18:38 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Vineeth reported seeing a KVM related deadlock connected to work
queue lockups using the android17-6.18 tree, which has
Proxy Execution enabled (using the full patch stack), but I've
subsequently reproduced it on v7.1-rc1.

On further debugging he found:
- kvm-irqfd-cleanup workqueue and rcu_gp lands in a per-cpu
  pwq(work queue pool)
- one of kvm-irqfd-cleanup worker(say A) takes a mutex and then
  calls synchronize_srcu_expedited()
- one other kvm-irqfd-cleanup worker worker(Say B) tries to
  acquire the lock and then gets blocked
- On the way to blocking, this cpu gets an IPI and on return
  from IPI, it calls __schedule() and did not get to complete
  workqueue accounting(worker->sleeping = 0 and decrementing
  pool->nr_running). This is done in sched_submit_work() ->
  wq_worker_sleeping() called from schedule() and we got
  preempted before that.
- proxy execution doesn't immediately take it off run queue as
  p->blocked_on is set during __mutex_lock
- Next time when B is picked for running, it notices A(mutex
  holder) is not on a runqueue and then blocks B.
  find_proxy_task() -> proxy_deactivate() -> block_task()
- And things are then stuck. A is waiting for the workqueue to
  be run, but B can't run the workqueue as it is blocked on A.

The trouble is that with Proxy Execution, in
__mutex_lock_common() we set the task state to
TASK_UNINTERRUPTIBLE, and set blocked_on before calling into
schedule(), where sched_submit_work() will be called.

But if an IPI comes in before we call schedule() the interrupt
will call __schedule(SM_PREEMPT) directly. This causes the
scheduler to see the current task as blocked_on, and deactivate
it (because the owner is off the runqueue).

Since its deactivated, it wont' be run, and it won't get to
call sched_submit_work().

Without proxy-execution, the SM_PREEMPT case will prevent the
task from being dequeued, and it can be reselected again and
run, which will allow it to finish calling into schedule()
and calling sched_submit_work() before actually blocking.

So we need to make sure on the SM_PREEMPT case, if current is
marked as blocked_on, we should clear the blocked_on state and
mark the task RUNNABLE so the task can be selected to complete
its call to schedule() -> sched_submit_work().

Now because we cleared BLOCKED_ON and set the task RUNNABLE,
the task will be able to be selected and run again and loop back
in __mutex_lock_common() where it can re-set the blocked_on
state and call back into schedule() in order to properly be
chosen as a donor.

Many thanks to Vineeth for figuring this very obscure race out
and for implementing a test tool to make it easily reproducible!

Reported-by: Vineeth Pillai <vineethrp@google.com>
Tested-by: Vineeth Pillai <vineethrp@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Vineeth Pillai <vineethrp@google.com>
Cc: Sonam Sanju <sonam.sanju@intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Kunwu Chan <kunwu.chan@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25ae..5f684caefd8b2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode)
 		try_to_block_task(rq, prev, &prev_state,
 				  !task_is_blocked(prev));
 		switch_count = &prev->nvcsw;
+	} else if (preempt && prev->blocked_on) {
+		/*
+		 * If we are SM_PREEMPT, we may have interrupted
+		 * after blocked_on was set, before schedule()
+		 * was run, preventing workques from running. So
+		 * clear blocked_on and mark task RUNNING so it
+		 * can be reselected to run and complete its
+		 * logic
+		 */
+		WRITE_ONCE(prev->__state, TASK_RUNNING);
+		clear_task_blocked_on(prev, NULL);
 	}
 
 pick_again:
-- 
2.54.0.545.g6539524ca2-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING
  2026-04-27 18:38 [PATCH 0/2] Proxy Execution fixes for v7.1-rc John Stultz
  2026-04-27 18:38 ` [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
@ 2026-04-27 18:38 ` John Stultz
  2026-04-28  8:16   ` K Prateek Nayak
  1 sibling, 1 reply; 11+ messages in thread
From: John Stultz @ 2026-04-27 18:38 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Vineeth found came up with a test driver that could trip up
workqueue stalls. After fixing one issue this test found,
Vineeth reported the test was still failing.

Greatly simplified, a task that tries to take a mutex already
owned by another task that is sleeping, can hit a edge case in
the mutex_lock_common() case.

If the task fails to get the lock, calls into schedule, but gets
a spurious wakeup, it will find that it is first waiter, and
go into the mutex_optimistic_spin() logic. Though before calling
mutex_optimistic_spin(), we clear task blocked_on state, since
mutex_optimistic_spin() may call schedule() if need_resched() is
set.

After mutex_optimistic_spin() fails, we set blocked_on again,
restart the main mutex loop, try to take the lock and call into
schedule_preempt_disabled().

From there, with proxy-execution, we'll see the task is
blocked_on, follow the chain, see the owner is sleeping and
dequeue the waiting task from the runqueue.

This all sounds fine and reasonable.  But what I had missed is
that in mutex_optimistic_spin(), not only do we call schedule()
but we set TASK_RUNNABLE right before doing so.

This is ok for that invocation of schedule(). But when we come
back we re-set the blocked_on we had just cleared, but we do not
re-set the task state to TASK_INTERRUPTIBLE/UNINTERRUPTIBLE.

This means we have a task that is blocked_on & TASK_RUNNABLE,
so when the proxy execution code dequeues the task, we are
in trouble since future wakeups will be shortcut by the
ttwu_state_match() check.

Thus, to avoid this, after mutex_optimistic_spin(), set the task
state back when we set blocked_on.

Many many thanks again to Vineeth for his very useful testing
driver that uncovered this long hidden bug, that I hadn't
tripped in all my testing! Very impressed with the problems he's
uncovered!

Reported-by: Vineeth Pillai <vineethrp@google.com>
Tested-by: Vineeth Pillai <vineethrp@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Vineeth Pillai <vineethrp@google.com>
Cc: Sonam Sanju <sonam.sanju@intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Kunwu Chan <kunwu.chan@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/locking/mutex.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 09534628dc01a..a93d4c6bee1a3 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -763,6 +763,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			raw_spin_lock_irqsave(&lock->wait_lock, flags);
 			raw_spin_lock(&current->blocked_lock);
 			__set_task_blocked_on(current, lock);
+			set_current_state(state);

 			if (opt_acquired)
 				break;
-- 
2.54.0.545.g6539524ca2-goog

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-04-27 18:38 ` [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
@ 2026-04-28  8:06   ` K Prateek Nayak
  2026-04-28  9:43   ` Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: K Prateek Nayak @ 2026-04-28  8:06 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan,
	Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Hello John,

On 4/28/2026 12:08 AM, John Stultz wrote:
> Vineeth reported seeing a KVM related deadlock connected to work
> queue lockups using the android17-6.18 tree, which has
> Proxy Execution enabled (using the full patch stack), but I've
> subsequently reproduced it on v7.1-rc1.
> 
> On further debugging he found:
> - kvm-irqfd-cleanup workqueue and rcu_gp lands in a per-cpu
>   pwq(work queue pool)
> - one of kvm-irqfd-cleanup worker(say A) takes a mutex and then
>   calls synchronize_srcu_expedited()
> - one other kvm-irqfd-cleanup worker worker(Say B) tries to
>   acquire the lock and then gets blocked
> - On the way to blocking, this cpu gets an IPI and on return
>   from IPI, it calls __schedule() and did not get to complete
>   workqueue accounting(worker->sleeping = 0 and decrementing
>   pool->nr_running). This is done in sched_submit_work() ->
>   wq_worker_sleeping() called from schedule() and we got
>   preempted before that.
> - proxy execution doesn't immediately take it off run queue as
>   p->blocked_on is set during __mutex_lock
> - Next time when B is picked for running, it notices A(mutex
>   holder) is not on a runqueue and then blocks B.
>   find_proxy_task() -> proxy_deactivate() -> block_task()
> - And things are then stuck. A is waiting for the workqueue to
>   be run, but B can't run the workqueue as it is blocked on A.
> 
> The trouble is that with Proxy Execution, in
> __mutex_lock_common() we set the task state to
> TASK_UNINTERRUPTIBLE, and set blocked_on before calling into
> schedule(), where sched_submit_work() will be called.

Geez! That is an interesting race.

> 
> But if an IPI comes in before we call schedule() the interrupt
> will call __schedule(SM_PREEMPT) directly. This causes the
> scheduler to see the current task as blocked_on, and deactivate
> it (because the owner is off the runqueue).
> 
> Since its deactivated, it wont' be run, and it won't get to
> call sched_submit_work().
> 
> Without proxy-execution, the SM_PREEMPT case will prevent the
> task from being dequeued, and it can be reselected again and
> run, which will allow it to finish calling into schedule()
> and calling sched_submit_work() before actually blocking.
> 
> So we need to make sure on the SM_PREEMPT case, if current is
> marked as blocked_on, we should clear the blocked_on state and
> mark the task RUNNABLE so the task can be selected to complete
> its call to schedule() -> sched_submit_work().
> 
> Now because we cleared BLOCKED_ON and set the task RUNNABLE,
> the task will be able to be selected and run again and loop back
> in __mutex_lock_common() where it can re-set the blocked_on
> state and call back into schedule() in order to properly be
> chosen as a donor.
> 
> Many thanks to Vineeth for figuring this very obscure race out
> and for implementing a test tool to make it easily reproducible!
> 
> Reported-by: Vineeth Pillai <vineethrp@google.com>
> Tested-by: Vineeth Pillai <vineethrp@google.com>
> Signed-off-by: John Stultz <jstultz@google.com>

I guess it is missing a:

Fixes: be41bde4c3a8 ("sched: Add an initial sketch of the find_proxy_task() function")

since that is where we began blocking a task on task_is_blocked(). I
really wish there was a better way to have detected this but I cannot
think of any better way at the moment so feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

> ---
> Cc: Vineeth Pillai <vineethrp@google.com>
> Cc: Sonam Sanju <sonam.sanju@intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Kunwu Chan <kunwu.chan@linux.dev>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kuyo chang <kuyo.chang@mediatek.com>
> Cc: hupu <hupu.gm@gmail.com>
> Cc: kernel-team@android.com
> ---
>  kernel/sched/core.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index da20fb6ea25ae..5f684caefd8b2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode)
>  		try_to_block_task(rq, prev, &prev_state,
>  				  !task_is_blocked(prev));
>  		switch_count = &prev->nvcsw;
> +	} else if (preempt && prev->blocked_on) {
> +		/*
> +		 * If we are SM_PREEMPT, we may have interrupted
> +		 * after blocked_on was set, before schedule()
> +		 * was run, preventing workques from running. So
> +		 * clear blocked_on and mark task RUNNING so it
> +		 * can be reselected to run and complete its
> +		 * logic
> +		 */
> +		WRITE_ONCE(prev->__state, TASK_RUNNING);

nit.

You probably need to update "prev_state" too for trace_sched_switch() to
capture the right state down below.

Since this is on the way to schedule(), I wonder if it possible to just
do a "next = prev" and goto picked ... but that adds more latency on
PREEMPT_RT so that is a no go I presume.

> +		clear_task_blocked_on(prev, NULL);
>  	}
>  
>  pick_again:

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING
  2026-04-27 18:38 ` [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz
@ 2026-04-28  8:16   ` K Prateek Nayak
  2026-04-28 19:50     ` John Stultz
  0 siblings, 1 reply; 11+ messages in thread
From: K Prateek Nayak @ 2026-04-28  8:16 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan,
	Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Hello John,

On 4/28/2026 12:08 AM, John Stultz wrote:
> Vineeth found came up with a test driver that could trip up
> workqueue stalls. After fixing one issue this test found,
> Vineeth reported the test was still failing.
> 
> Greatly simplified, a task that tries to take a mutex already
> owned by another task that is sleeping, can hit a edge case in
> the mutex_lock_common() case.
> 
> If the task fails to get the lock, calls into schedule, but gets
> a spurious wakeup, it will find that it is first waiter, and
> go into the mutex_optimistic_spin() logic. Though before calling
> mutex_optimistic_spin(), we clear task blocked_on state, since
> mutex_optimistic_spin() may call schedule() if need_resched() is
> set.
> 
> After mutex_optimistic_spin() fails, we set blocked_on again,
> restart the main mutex loop, try to take the lock and call into
> schedule_preempt_disabled().
> 
> From there, with proxy-execution, we'll see the task is
> blocked_on, follow the chain, see the owner is sleeping and
> dequeue the waiting task from the runqueue.
> 
> This all sounds fine and reasonable.  But what I had missed is
> that in mutex_optimistic_spin(), not only do we call schedule()
> but we set TASK_RUNNABLE right before doing so.
> 
> This is ok for that invocation of schedule(). But when we come
> back we re-set the blocked_on we had just cleared, but we do not
> re-set the task state to TASK_INTERRUPTIBLE/UNINTERRUPTIBLE.
> 
> This means we have a task that is blocked_on & TASK_RUNNABLE,
> so when the proxy execution code dequeues the task, we are
> in trouble since future wakeups will be shortcut by the
> ttwu_state_match() check.
> 
> Thus, to avoid this, after mutex_optimistic_spin(), set the task
> state back when we set blocked_on.
> 
> Many many thanks again to Vineeth for his very useful testing
> driver that uncovered this long hidden bug, that I hadn't
> tripped in all my testing! Very impressed with the problems he's
> uncovered!
> 
> Reported-by: Vineeth Pillai <vineethrp@google.com>
> Tested-by: Vineeth Pillai <vineethrp@google.com>
> Signed-off-by: John Stultz <jstultz@google.com>

I think this too requires a:

Fixes: be41bde4c3a8 ("sched: Add an initial sketch of the find_proxy_task() function")

With that, feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

> ---
> Cc: Vineeth Pillai <vineethrp@google.com>
> Cc: Sonam Sanju <sonam.sanju@intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Kunwu Chan <kunwu.chan@linux.dev>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kuyo chang <kuyo.chang@mediatek.com>
> Cc: hupu <hupu.gm@gmail.com>
> Cc: kernel-team@android.com
> ---
>  kernel/locking/mutex.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 09534628dc01a..a93d4c6bee1a3 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -763,6 +763,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  			raw_spin_lock_irqsave(&lock->wait_lock, flags);
>  			raw_spin_lock(&current->blocked_lock);
>  			__set_task_blocked_on(current, lock);
> +			set_current_state(state);
>  
>  			if (opt_acquired)
>  				break;

nit.

As a micro-optimization, you can probably move the
__set_task_blocked_on() and set_current_state() after this break.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-04-27 18:38 ` [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
  2026-04-28  8:06   ` K Prateek Nayak
@ 2026-04-28  9:43   ` Peter Zijlstra
  2026-04-28 11:18     ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2026-04-28  9:43 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

On Mon, Apr 27, 2026 at 06:38:40PM +0000, John Stultz wrote:

>  kernel/sched/core.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index da20fb6ea25ae..5f684caefd8b2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode)
>  		try_to_block_task(rq, prev, &prev_state,
>  				  !task_is_blocked(prev));
>  		switch_count = &prev->nvcsw;
> +	} else if (preempt && prev->blocked_on) {
> +		/*
> +		 * If we are SM_PREEMPT, we may have interrupted
> +		 * after blocked_on was set, before schedule()
> +		 * was run, preventing workques from running. So

workqueues

> +		 * clear blocked_on and mark task RUNNING so it
> +		 * can be reselected to run and complete its
> +		 * logic
> +		 */
> +		WRITE_ONCE(prev->__state, TASK_RUNNING);
> +		clear_task_blocked_on(prev, NULL);
>  	}
>  
>  pick_again:

*groan*, this feels wrong. Preemption should never touch state. Let me
try and wake up and make sense of this.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-04-28  9:43   ` Peter Zijlstra
@ 2026-04-28 11:18     ` Peter Zijlstra
  2026-04-28 13:15       ` K Prateek Nayak
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2026-04-28 11:18 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

On Tue, Apr 28, 2026 at 11:43:53AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 27, 2026 at 06:38:40PM +0000, John Stultz wrote:
> 
> >  kernel/sched/core.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index da20fb6ea25ae..5f684caefd8b2 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode)
> >  		try_to_block_task(rq, prev, &prev_state,
> >  				  !task_is_blocked(prev));
> >  		switch_count = &prev->nvcsw;
> > +	} else if (preempt && prev->blocked_on) {
> > +		/*
> > +		 * If we are SM_PREEMPT, we may have interrupted
> > +		 * after blocked_on was set, before schedule()
> > +		 * was run, preventing workques from running. So
> 
> workqueues
> 
> > +		 * clear blocked_on and mark task RUNNING so it
> > +		 * can be reselected to run and complete its
> > +		 * logic
> > +		 */
> > +		WRITE_ONCE(prev->__state, TASK_RUNNING);
> > +		clear_task_blocked_on(prev, NULL);
> >  	}
> >  
> >  pick_again:
> 
> *groan*, this feels wrong. Preemption should never touch state. Let me
> try and wake up and make sense of this.

So all non-special block states *SHOULD* be in a loop and handle
spurious wakeups -- I fixed a pile of offenders some many years ago, but
there really isn't anything in the kernel that validates this. 

[ I suppose someone could try and do a cocci test for this? ]

Any wait for non-special states that is not a loop is fundamentally
broken, since many of the lock wake-up paths are explicitly racy in that
they can cause spurious wakeups (which is the safe side of the race,
since insufficient wakeups is bad etc.).

OTOH special states, are special, esp. because they cannot handle
spurious wakeups.

Eg, consider something like:

	set_current_state(TASK_FROZEN)
	<PREEMPT>
	  current->__state = TASK_RUNNING
	</PREEMPT/
	schedule();

is all sorts of broken. Now, obiously special states must never have
blocked_on set, so this can be fudged about. But still, touching __state
from schedule seems wrong.

Anyway, the historical distinction between a blocked task and a
preempted task is that the blocked task is not on the runqueue, while
the preempted task is kept on the runqueue.

Obviously PE wrecks this, and hence the problem. And yeah, amazing we
never hit this before.

Something like so perhaps?

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 368c7b4d7cb5..0bd5da8360f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -846,7 +846,11 @@ struct task_struct {
 	struct alloc_tag		*alloc_tag;
 #endif
 
-	int				on_cpu;
+	u8				on_cpu;
+	u8				on_rq;
+	u8				is_blocked;
+	u8				__pad;
+
 	struct __call_single_node	wake_entry;
 	unsigned int			wakee_flips;
 	unsigned long			wakee_flip_decay_ts;
@@ -861,7 +865,6 @@ struct task_struct {
 	 */
 	int				recent_used_cpu;
 	int				wake_cpu;
-	int				on_rq;
 
 	int				prio;
 	int				static_prio;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25a..06817ae0cbd9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -615,6 +615,13 @@ EXPORT_SYMBOL(__trace_set_current_state);
  *   [ The astute reader will observe that it is possible for two tasks on one
  *     CPU to have ->on_cpu = 1 at the same time. ]
  *
+*  p->is_blocked <- { 0, 1 }:
+*
+*    is set by block_task() and cleared by ttwu_do_activate() and indicates
+*    this task is blocked, as opposed to runnable. Used to distinguish between
+*    preempted and blocked tasks for proxy exec, which keeps everything on the
+*    runqueue.
+ *
  * task_cpu(p): is changed by set_task_cpu(), the rules are:
  *
  *  - Don't call set_task_cpu() on a blocked task:
@@ -2225,6 +2232,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 
 static void block_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	p->is_blocked = 1;
 	if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
 		__block_task(rq, p);
 }
@@ -3722,6 +3730,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		atomic_dec(&task_rq(p)->nr_iowait);
 	}
 
+	p->is_blocked = 0;
 	activate_task(rq, p, en_flags);
 	wakeup_preempt(rq, p, wake_flags);
 
@@ -7107,7 +7116,7 @@ static void __sched notrace __schedule(int sched_mode)
 		struct task_struct *prev_donor = rq->donor;
 
 		rq_set_donor(rq, next);
-		if (unlikely(next->blocked_on)) {
+		if (unlikely(next->is_blocked && next->blocked_on)) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next) {
 				zap_balance_callbacks(rq);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-04-28 11:18     ` Peter Zijlstra
@ 2026-04-28 13:15       ` K Prateek Nayak
  2026-04-28 14:12         ` K Prateek Nayak
  2026-04-28 16:50         ` Peter Zijlstra
  0 siblings, 2 replies; 11+ messages in thread
From: K Prateek Nayak @ 2026-04-28 13:15 UTC (permalink / raw)
  To: Peter Zijlstra, John Stultz
  Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello Peter,

On 4/28/2026 4:48 PM, Peter Zijlstra wrote:
> On Tue, Apr 28, 2026 at 11:43:53AM +0200, Peter Zijlstra wrote:
>> On Mon, Apr 27, 2026 at 06:38:40PM +0000, John Stultz wrote:
>>
>>>  kernel/sched/core.c | 11 +++++++++++
>>>  1 file changed, 11 insertions(+)
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index da20fb6ea25ae..5f684caefd8b2 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode)
>>>  		try_to_block_task(rq, prev, &prev_state,
>>>  				  !task_is_blocked(prev));
>>>  		switch_count = &prev->nvcsw;
>>> +	} else if (preempt && prev->blocked_on) {
>>> +		/*
>>> +		 * If we are SM_PREEMPT, we may have interrupted
>>> +		 * after blocked_on was set, before schedule()
>>> +		 * was run, preventing workques from running. So
>>
>> workqueues
>>
>>> +		 * clear blocked_on and mark task RUNNING so it
>>> +		 * can be reselected to run and complete its
>>> +		 * logic
>>> +		 */
>>> +		WRITE_ONCE(prev->__state, TASK_RUNNING);
>>> +		clear_task_blocked_on(prev, NULL);
>>>  	}
>>>  
>>>  pick_again:
>>
>> *groan*, this feels wrong. Preemption should never touch state. Let me
>> try and wake up and make sense of this.
> 
> So all non-special block states *SHOULD* be in a loop and handle
> spurious wakeups -- I fixed a pile of offenders some many years ago, but
> there really isn't anything in the kernel that validates this. 
> 
> [ I suppose someone could try and do a cocci test for this? ]
> 
> Any wait for non-special states that is not a loop is fundamentally
> broken, since many of the lock wake-up paths are explicitly racy in that
> they can cause spurious wakeups (which is the safe side of the race,
> since insufficient wakeups is bad etc.).
> 
> OTOH special states, are special, esp. because they cannot handle
> spurious wakeups.
> 
> Eg, consider something like:
> 
> 	set_current_state(TASK_FROZEN)
> 	<PREEMPT>
> 	  current->__state = TASK_RUNNING
> 	</PREEMPT/
> 	schedule();
> 
> is all sorts of broken. Now, obiously special states must never have
> blocked_on set, so this can be fudged about. But still, touching __state
> from schedule seems wrong.
> 
> Anyway, the historical distinction between a blocked task and a
> preempted task is that the blocked task is not on the runqueue, while
> the preempted task is kept on the runqueue.
> 
> Obviously PE wrecks this, and hence the problem. And yeah, amazing we
> never hit this before.
> 
> Something like so perhaps?
> 
> ---
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 368c7b4d7cb5..0bd5da8360f3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -846,7 +846,11 @@ struct task_struct {
>  	struct alloc_tag		*alloc_tag;
>  #endif
>  
> -	int				on_cpu;
> +	u8				on_cpu;
> +	u8				on_rq;
> +	u8				is_blocked;
> +	u8				__pad;
> +
>  	struct __call_single_node	wake_entry;
>  	unsigned int			wakee_flips;
>  	unsigned long			wakee_flip_decay_ts;
> @@ -861,7 +865,6 @@ struct task_struct {
>  	 */
>  	int				recent_used_cpu;
>  	int				wake_cpu;
> -	int				on_rq;
>  
>  	int				prio;
>  	int				static_prio;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index da20fb6ea25a..06817ae0cbd9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -615,6 +615,13 @@ EXPORT_SYMBOL(__trace_set_current_state);
>   *   [ The astute reader will observe that it is possible for two tasks on one
>   *     CPU to have ->on_cpu = 1 at the same time. ]
>   *
> +*  p->is_blocked <- { 0, 1 }:
> +*
> +*    is set by block_task() and cleared by ttwu_do_activate() and indicates
> +*    this task is blocked, as opposed to runnable. Used to distinguish between
> +*    preempted and blocked tasks for proxy exec, which keeps everything on the
> +*    runqueue.
> + *
>   * task_cpu(p): is changed by set_task_cpu(), the rules are:
>   *
>   *  - Don't call set_task_cpu() on a blocked task:
> @@ -2225,6 +2232,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
>  
>  static void block_task(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	p->is_blocked = 1;

We never reach here with PROXY_EXEC. Instead we bail out in the caller
try_to_block_task() ...

>  	if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
>  		__block_task(rq, p);
>  }
> @@ -3722,6 +3730,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>  		atomic_dec(&task_rq(p)->nr_iowait);
>  	}
>  
> +	p->is_blocked = 0;
>  	activate_task(rq, p, en_flags);
>  	wakeup_preempt(rq, p, wake_flags);
>  
> @@ -7107,7 +7116,7 @@ static void __sched notrace __schedule(int sched_mode)
>  		struct task_struct *prev_donor = rq->donor;
>  
>  		rq_set_donor(rq, next);
> -		if (unlikely(next->blocked_on)) {
> +		if (unlikely(next->is_blocked && next->blocked_on)) {

... so ->is_blocked here is always false for proxy tasks retained on
the runqueue.

I was trying something like below but I'm somewhere missing a
clear_task_blocked_on() for PROXY_WAKING before going back into
mutex_lock_common():

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8ec3b6d7d718b..6ea74aecc5fbd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -586,6 +586,7 @@ struct sched_entity {
 	unsigned char			sched_delayed;
 	unsigned char			rel_deadline;
 	unsigned char			custom_slice;
+	unsigned char			sched_proxy;
 					/* hole */
 
 	u64				exec_start;
@@ -2222,6 +2223,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
 	 * clearing the relationship with a different lock.
 	 */
 	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	WRITE_ONCE(p->se.sched_proxy, 0);
 	p->blocked_on = NULL;
 }
 
@@ -2250,6 +2252,8 @@ static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mu
 	 * the relationship with a different lock.
 	 */
 	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	/* Force the task down proxy_force_return() path. */
+	WRITE_ONCE(p->se.sched_proxy, 1);
 	p->blocked_on = PROXY_WAKING;
 }
 
diff --git a/init/init_task.c b/init/init_task.c
index b5f48ebdc2b6e..8e8fc680fcd21 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -118,6 +118,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	},
 	.se		= {
 		.group_node 	= LIST_HEAD_INIT(init_task.se.group_node),
+		.sched_proxy 	= 0,
 	},
 	.rt		= {
 		.run_list	= LIST_HEAD_INIT(init_task.rt.run_list),
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 49cd5d2171613..8142fba59ad94 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4395,6 +4395,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
+	p->se.sched_proxy		= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 	/* A delayed task cannot be in clone(). */
@@ -6535,8 +6536,13 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	 * blocked on a mutex, and we want to keep it on the runqueue
 	 * to be selectable for proxy-execution.
 	 */
-	if (!should_block)
+	if (!should_block) {
+		guard(raw_spinlock)(&p->blocked_lock);
+		/* Stable against race */
+		if (task_is_blocked(p))
+			WRITE_ONCE(p->se.sched_proxy, 1);
 		return false;
+	}
 
 	p->sched_contributes_to_load =
 		(task_state & TASK_UNINTERRUPTIBLE) &&
@@ -6765,11 +6771,15 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
-	struct mutex *mutex;
 	int owner_cpu;
 
 	/* Follow blocked_on chain. */
-	for (p = donor; (mutex = p->blocked_on); p = owner) {
+	for (p = donor; READ_ONCE(p->se.sched_proxy); p = owner) {
+		struct mutex *mutex = p->blocked_on;
+
+		if (!mutex)
+			return NULL;
+
 		/* if its PROXY_WAKING, do return migration or run if current */
 		if (mutex == PROXY_WAKING) {
 			if (task_current(rq, p)) {
@@ -6787,7 +6797,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		guard(raw_spinlock)(&p->blocked_lock);
 
 		/* Check again that p is blocked with blocked_lock held */
-		if (mutex != __get_task_blocked_on(p)) {
+		if (!p->se.sched_proxy || mutex != __get_task_blocked_on(p)) {
 			/*
 			 * Something changed in the blocked_on chain and
 			 * we don't know if only at this level. So, let's
@@ -7044,7 +7054,7 @@ static void __sched notrace __schedule(int sched_mode)
 		struct task_struct *prev_donor = rq->donor;
 
 		rq_set_donor(rq, next);
-		if (unlikely(next->blocked_on)) {
+		if (unlikely(READ_ONCE(next->se.sched_proxy))) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next) {
 				zap_balance_callbacks(rq);
---

>  			next = find_proxy_task(rq, next, &rf);
>  			if (!next) {
>  				zap_balance_callbacks(rq);

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-04-28 13:15       ` K Prateek Nayak
@ 2026-04-28 14:12         ` K Prateek Nayak
  2026-04-28 16:50         ` Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: K Prateek Nayak @ 2026-04-28 14:12 UTC (permalink / raw)
  To: Peter Zijlstra, John Stultz
  Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On 4/28/2026 6:45 PM, K Prateek Nayak wrote:
> I was trying something like below but I'm somewhere missing a
> clear_task_blocked_on() for PROXY_WAKING before going back into
> mutex_lock_common():

And I seem to have been missing:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8142fba59ad94..a8679b759398c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7046,6 +7046,9 @@ static void __sched notrace __schedule(int sched_mode)
 		switch_count = &prev->nvcsw;
 	}
 
+	if (!prev_state && task_is_blocked(prev))
+		clear_task_blocked_on(prev, NULL);
+
 pick_again:
 	assert_balance_callbacks_empty(rq);
 	next = pick_next_task(rq, rq->donor, &rf);
---

With that, it survives test-ww_mutex and a sched-messaging run without
any splats.

> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8ec3b6d7d718b..6ea74aecc5fbd 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -586,6 +586,7 @@ struct sched_entity {
>  	unsigned char			sched_delayed;
>  	unsigned char			rel_deadline;
>  	unsigned char			custom_slice;
> +	unsigned char			sched_proxy;
>  					/* hole */
>  
>  	u64				exec_start;
> @@ -2222,6 +2223,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
>  	 * clearing the relationship with a different lock.
>  	 */
>  	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
> +	WRITE_ONCE(p->se.sched_proxy, 0);
>  	p->blocked_on = NULL;
>  }
>  
> @@ -2250,6 +2252,8 @@ static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mu
>  	 * the relationship with a different lock.
>  	 */
>  	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
> +	/* Force the task down proxy_force_return() path. */
> +	WRITE_ONCE(p->se.sched_proxy, 1);
>  	p->blocked_on = PROXY_WAKING;
>  }
>  
> diff --git a/init/init_task.c b/init/init_task.c
> index b5f48ebdc2b6e..8e8fc680fcd21 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -118,6 +118,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
>  	},
>  	.se		= {
>  		.group_node 	= LIST_HEAD_INIT(init_task.se.group_node),
> +		.sched_proxy 	= 0,
>  	},
>  	.rt		= {
>  		.run_list	= LIST_HEAD_INIT(init_task.rt.run_list),
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 49cd5d2171613..8142fba59ad94 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4395,6 +4395,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
>  	p->se.nr_migrations		= 0;
>  	p->se.vruntime			= 0;
>  	p->se.vlag			= 0;
> +	p->se.sched_proxy		= 0;
>  	INIT_LIST_HEAD(&p->se.group_node);
>  
>  	/* A delayed task cannot be in clone(). */
> @@ -6535,8 +6536,13 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
>  	 * blocked on a mutex, and we want to keep it on the runqueue
>  	 * to be selectable for proxy-execution.
>  	 */
> -	if (!should_block)
> +	if (!should_block) {
> +		guard(raw_spinlock)(&p->blocked_lock);
> +		/* Stable against race */
> +		if (task_is_blocked(p))
> +			WRITE_ONCE(p->se.sched_proxy, 1);
>  		return false;
> +	}
>  
>  	p->sched_contributes_to_load =
>  		(task_state & TASK_UNINTERRUPTIBLE) &&
> @@ -6765,11 +6771,15 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  	bool curr_in_chain = false;
>  	int this_cpu = cpu_of(rq);
>  	struct task_struct *p;
> -	struct mutex *mutex;
>  	int owner_cpu;
>  
>  	/* Follow blocked_on chain. */
> -	for (p = donor; (mutex = p->blocked_on); p = owner) {
> +	for (p = donor; READ_ONCE(p->se.sched_proxy); p = owner) {
> +		struct mutex *mutex = p->blocked_on;
> +
> +		if (!mutex)
> +			return NULL;
> +
>  		/* if its PROXY_WAKING, do return migration or run if current */
>  		if (mutex == PROXY_WAKING) {
>  			if (task_current(rq, p)) {
> @@ -6787,7 +6797,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  		guard(raw_spinlock)(&p->blocked_lock);
>  
>  		/* Check again that p is blocked with blocked_lock held */
> -		if (mutex != __get_task_blocked_on(p)) {
> +		if (!p->se.sched_proxy || mutex != __get_task_blocked_on(p)) {
>  			/*
>  			 * Something changed in the blocked_on chain and
>  			 * we don't know if only at this level. So, let's
> @@ -7044,7 +7054,7 @@ static void __sched notrace __schedule(int sched_mode)
>  		struct task_struct *prev_donor = rq->donor;
>  
>  		rq_set_donor(rq, next);
> -		if (unlikely(next->blocked_on)) {
> +		if (unlikely(READ_ONCE(next->se.sched_proxy))) {
>  			next = find_proxy_task(rq, next, &rf);
>  			if (!next) {
>  				zap_balance_callbacks(rq);
> ---
> 
>>  			next = find_proxy_task(rq, next, &rf);
>>  			if (!next) {
>>  				zap_balance_callbacks(rq);
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
  2026-04-28 13:15       ` K Prateek Nayak
  2026-04-28 14:12         ` K Prateek Nayak
@ 2026-04-28 16:50         ` Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2026-04-28 16:50 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Vineeth Pillai, Sonam Sanju,
	Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon,
	Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

On Tue, Apr 28, 2026 at 06:45:39PM +0530, K Prateek Nayak wrote:

> > Something like so perhaps?
> > 
> > ---
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 368c7b4d7cb5..0bd5da8360f3 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -846,7 +846,11 @@ struct task_struct {
> >  	struct alloc_tag		*alloc_tag;
> >  #endif
> >  
> > -	int				on_cpu;
> > +	u8				on_cpu;
> > +	u8				on_rq;
> > +	u8				is_blocked;
> > +	u8				__pad;
> > +
> >  	struct __call_single_node	wake_entry;
> >  	unsigned int			wakee_flips;
> >  	unsigned long			wakee_flip_decay_ts;
> > @@ -861,7 +865,6 @@ struct task_struct {
> >  	 */
> >  	int				recent_used_cpu;
> >  	int				wake_cpu;
> > -	int				on_rq;
> >  
> >  	int				prio;
> >  	int				static_prio;
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index da20fb6ea25a..06817ae0cbd9 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -615,6 +615,13 @@ EXPORT_SYMBOL(__trace_set_current_state);
> >   *   [ The astute reader will observe that it is possible for two tasks on one
> >   *     CPU to have ->on_cpu = 1 at the same time. ]
> >   *
> > +*  p->is_blocked <- { 0, 1 }:
> > +*
> > +*    is set by block_task() and cleared by ttwu_do_activate() and indicates
> > +*    this task is blocked, as opposed to runnable. Used to distinguish between
> > +*    preempted and blocked tasks for proxy exec, which keeps everything on the
> > +*    runqueue.
> > + *
> >   * task_cpu(p): is changed by set_task_cpu(), the rules are:
> >   *
> >   *  - Don't call set_task_cpu() on a blocked task:
> > @@ -2225,6 +2232,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
> >  
> >  static void block_task(struct rq *rq, struct task_struct *p, int flags)
> >  {
> > +	p->is_blocked = 1;
> 
> We never reach here with PROXY_EXEC. Instead we bail out in the caller
> try_to_block_task() ...
> 
> >  	if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
> >  		__block_task(rq, p);
> >  }
> > @@ -3722,6 +3730,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> >  		atomic_dec(&task_rq(p)->nr_iowait);
> >  	}
> >  
> > +	p->is_blocked = 0;
> >  	activate_task(rq, p, en_flags);
> >  	wakeup_preempt(rq, p, wake_flags);
> >  
> > @@ -7107,7 +7116,7 @@ static void __sched notrace __schedule(int sched_mode)
> >  		struct task_struct *prev_donor = rq->donor;
> >  
> >  		rq_set_donor(rq, next);
> > -		if (unlikely(next->blocked_on)) {
> > +		if (unlikely(next->is_blocked && next->blocked_on)) {
> 
> ... so ->is_blocked here is always false for proxy tasks retained on
> the runqueue.

Right. Also, egads, we really should fix that block/ttwu part, this is a
mess.

Anyway, idea is simple even if execution turns into a bit of a mess now,
set when task really is blocked and clear on wakeup.

> I was trying something like below but I'm somewhere missing a
> clear_task_blocked_on() for PROXY_WAKING before going back into
> mutex_lock_common():
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8ec3b6d7d718b..6ea74aecc5fbd 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -586,6 +586,7 @@ struct sched_entity {
>  	unsigned char			sched_delayed;
>  	unsigned char			rel_deadline;
>  	unsigned char			custom_slice;
> +	unsigned char			sched_proxy;
>  					/* hole */

Should not live in sched_entity I suppose.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING
  2026-04-28  8:16   ` K Prateek Nayak
@ 2026-04-28 19:50     ` John Stultz
  0 siblings, 0 replies; 11+ messages in thread
From: John Stultz @ 2026-04-28 19:50 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
	Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

On Tue, Apr 28, 2026 at 1:16 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 4/28/2026 12:08 AM, John Stultz wrote:
> I think this too requires a:
>
> Fixes: be41bde4c3a8 ("sched: Add an initial sketch of the find_proxy_task() function")
>
> With that, feel free to include:
>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

Thanks so much for looking this over!

> > diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> > index 09534628dc01a..a93d4c6bee1a3 100644
> > --- a/kernel/locking/mutex.c
> > +++ b/kernel/locking/mutex.c
> > @@ -763,6 +763,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> >                       raw_spin_lock_irqsave(&lock->wait_lock, flags);
> >                       raw_spin_lock(&current->blocked_lock);
> >                       __set_task_blocked_on(current, lock);
> > +                     set_current_state(state);
> >
> >                       if (opt_acquired)
> >                               break;
>
> nit.
>
> As a micro-optimization, you can probably move the
> __set_task_blocked_on() and set_current_state() after this break.

So my main reason for setting it before the break makes it symmetric
with the clearing that happens just outside the loop. Feels a little
cleaner to me for it to all match.

thanks
-john

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-28 19:51 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 18:38 [PATCH 0/2] Proxy Execution fixes for v7.1-rc John Stultz
2026-04-27 18:38 ` [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
2026-04-28  8:06   ` K Prateek Nayak
2026-04-28  9:43   ` Peter Zijlstra
2026-04-28 11:18     ` Peter Zijlstra
2026-04-28 13:15       ` K Prateek Nayak
2026-04-28 14:12         ` K Prateek Nayak
2026-04-28 16:50         ` Peter Zijlstra
2026-04-27 18:38 ` [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz
2026-04-28  8:16   ` K Prateek Nayak
2026-04-28 19:50     ` John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox