* [PATCH 0/2] Proxy Execution fixes for v7.1-rc
@ 2026-04-27 18:38 John Stultz
2026-04-27 18:38 ` [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
2026-04-27 18:38 ` [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz
0 siblings, 2 replies; 12+ messages in thread
From: John Stultz @ 2026-04-27 18:38 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Vineeth Pillai, Sonam Sanju, Sean Christopherson,
Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long,
Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
Suleiman Souhlal, kuyo chang, hupu, kernel-team
Hey All,
So in testing with the full Proxy Execution series,
Vineeth Pillai managed to trip some interesting bugs which
initially looked to be KVM or RCU related[1], which he later
diagnosed as Proxy Execution related and created a useful test
driver to reproduce.
I found these same issues could be triggered with the upstream
portions of Proxy Execution, so I wanted to send along these
fixes for 7.1-rc
Again, a huge thanks to Vineeth for uncovering these issues
that have evaded all my stress testing so far!
Thanks
-john
[1]: https://lore.kernel.org/lkml/20260320125633.2290675-1-vineeth@bitbyteword.org/
Cc: Vineeth Pillai <vineethrp@google.com>
Cc: Sonam Sanju <sonam.sanju@intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Kunwu Chan <kunwu.chan@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
John Stultz (2):
sched: proxy-exec: Close race causing workqueue work being delayed
locking: mutex: Fix proxy-exec potentially deactivating tasks marked
TASK_RUNNING
kernel/locking/mutex.c | 1 +
kernel/sched/core.c | 11 +++++++++++
2 files changed, 12 insertions(+)
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed 2026-04-27 18:38 [PATCH 0/2] Proxy Execution fixes for v7.1-rc John Stultz @ 2026-04-27 18:38 ` John Stultz 2026-04-28 8:06 ` K Prateek Nayak 2026-04-28 9:43 ` Peter Zijlstra 2026-04-27 18:38 ` [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz 1 sibling, 2 replies; 12+ messages in thread From: John Stultz @ 2026-04-27 18:38 UTC (permalink / raw) To: LKML Cc: John Stultz, Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team Vineeth reported seeing a KVM related deadlock connected to work queue lockups using the android17-6.18 tree, which has Proxy Execution enabled (using the full patch stack), but I've subsequently reproduced it on v7.1-rc1. On further debugging he found: - kvm-irqfd-cleanup workqueue and rcu_gp lands in a per-cpu pwq(work queue pool) - one of kvm-irqfd-cleanup worker(say A) takes a mutex and then calls synchronize_srcu_expedited() - one other kvm-irqfd-cleanup worker worker(Say B) tries to acquire the lock and then gets blocked - On the way to blocking, this cpu gets an IPI and on return from IPI, it calls __schedule() and did not get to complete workqueue accounting(worker->sleeping = 0 and decrementing pool->nr_running). This is done in sched_submit_work() -> wq_worker_sleeping() called from schedule() and we got preempted before that. - proxy execution doesn't immediately take it off run queue as p->blocked_on is set during __mutex_lock - Next time when B is picked for running, it notices A(mutex holder) is not on a runqueue and then blocks B. find_proxy_task() -> proxy_deactivate() -> block_task() - And things are then stuck. A is waiting for the workqueue to be run, but B can't run the workqueue as it is blocked on A. The trouble is that with Proxy Execution, in __mutex_lock_common() we set the task state to TASK_UNINTERRUPTIBLE, and set blocked_on before calling into schedule(), where sched_submit_work() will be called. But if an IPI comes in before we call schedule() the interrupt will call __schedule(SM_PREEMPT) directly. This causes the scheduler to see the current task as blocked_on, and deactivate it (because the owner is off the runqueue). Since its deactivated, it wont' be run, and it won't get to call sched_submit_work(). Without proxy-execution, the SM_PREEMPT case will prevent the task from being dequeued, and it can be reselected again and run, which will allow it to finish calling into schedule() and calling sched_submit_work() before actually blocking. So we need to make sure on the SM_PREEMPT case, if current is marked as blocked_on, we should clear the blocked_on state and mark the task RUNNABLE so the task can be selected to complete its call to schedule() -> sched_submit_work(). Now because we cleared BLOCKED_ON and set the task RUNNABLE, the task will be able to be selected and run again and loop back in __mutex_lock_common() where it can re-set the blocked_on state and call back into schedule() in order to properly be chosen as a donor. Many thanks to Vineeth for figuring this very obscure race out and for implementing a test tool to make it easily reproducible! Reported-by: Vineeth Pillai <vineethrp@google.com> Tested-by: Vineeth Pillai <vineethrp@google.com> Signed-off-by: John Stultz <jstultz@google.com> --- Cc: Vineeth Pillai <vineethrp@google.com> Cc: Sonam Sanju <sonam.sanju@intel.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Kunwu Chan <kunwu.chan@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Joel Fernandes <joelagnelf@nvidia.com> Cc: Qais Yousef <qyousef@layalina.io> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Metin Kaya <Metin.Kaya@arm.com> Cc: Xuewen Yan <xuewen.yan94@gmail.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: kuyo chang <kuyo.chang@mediatek.com> Cc: hupu <hupu.gm@gmail.com> Cc: kernel-team@android.com --- kernel/sched/core.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index da20fb6ea25ae..5f684caefd8b2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode) try_to_block_task(rq, prev, &prev_state, !task_is_blocked(prev)); switch_count = &prev->nvcsw; + } else if (preempt && prev->blocked_on) { + /* + * If we are SM_PREEMPT, we may have interrupted + * after blocked_on was set, before schedule() + * was run, preventing workques from running. So + * clear blocked_on and mark task RUNNING so it + * can be reselected to run and complete its + * logic + */ + WRITE_ONCE(prev->__state, TASK_RUNNING); + clear_task_blocked_on(prev, NULL); } pick_again: -- 2.54.0.545.g6539524ca2-goog ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed 2026-04-27 18:38 ` [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz @ 2026-04-28 8:06 ` K Prateek Nayak 2026-04-28 9:43 ` Peter Zijlstra 1 sibling, 0 replies; 12+ messages in thread From: K Prateek Nayak @ 2026-04-28 8:06 UTC (permalink / raw) To: John Stultz, LKML Cc: Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team Hello John, On 4/28/2026 12:08 AM, John Stultz wrote: > Vineeth reported seeing a KVM related deadlock connected to work > queue lockups using the android17-6.18 tree, which has > Proxy Execution enabled (using the full patch stack), but I've > subsequently reproduced it on v7.1-rc1. > > On further debugging he found: > - kvm-irqfd-cleanup workqueue and rcu_gp lands in a per-cpu > pwq(work queue pool) > - one of kvm-irqfd-cleanup worker(say A) takes a mutex and then > calls synchronize_srcu_expedited() > - one other kvm-irqfd-cleanup worker worker(Say B) tries to > acquire the lock and then gets blocked > - On the way to blocking, this cpu gets an IPI and on return > from IPI, it calls __schedule() and did not get to complete > workqueue accounting(worker->sleeping = 0 and decrementing > pool->nr_running). This is done in sched_submit_work() -> > wq_worker_sleeping() called from schedule() and we got > preempted before that. > - proxy execution doesn't immediately take it off run queue as > p->blocked_on is set during __mutex_lock > - Next time when B is picked for running, it notices A(mutex > holder) is not on a runqueue and then blocks B. > find_proxy_task() -> proxy_deactivate() -> block_task() > - And things are then stuck. A is waiting for the workqueue to > be run, but B can't run the workqueue as it is blocked on A. > > The trouble is that with Proxy Execution, in > __mutex_lock_common() we set the task state to > TASK_UNINTERRUPTIBLE, and set blocked_on before calling into > schedule(), where sched_submit_work() will be called. Geez! That is an interesting race. > > But if an IPI comes in before we call schedule() the interrupt > will call __schedule(SM_PREEMPT) directly. This causes the > scheduler to see the current task as blocked_on, and deactivate > it (because the owner is off the runqueue). > > Since its deactivated, it wont' be run, and it won't get to > call sched_submit_work(). > > Without proxy-execution, the SM_PREEMPT case will prevent the > task from being dequeued, and it can be reselected again and > run, which will allow it to finish calling into schedule() > and calling sched_submit_work() before actually blocking. > > So we need to make sure on the SM_PREEMPT case, if current is > marked as blocked_on, we should clear the blocked_on state and > mark the task RUNNABLE so the task can be selected to complete > its call to schedule() -> sched_submit_work(). > > Now because we cleared BLOCKED_ON and set the task RUNNABLE, > the task will be able to be selected and run again and loop back > in __mutex_lock_common() where it can re-set the blocked_on > state and call back into schedule() in order to properly be > chosen as a donor. > > Many thanks to Vineeth for figuring this very obscure race out > and for implementing a test tool to make it easily reproducible! > > Reported-by: Vineeth Pillai <vineethrp@google.com> > Tested-by: Vineeth Pillai <vineethrp@google.com> > Signed-off-by: John Stultz <jstultz@google.com> I guess it is missing a: Fixes: be41bde4c3a8 ("sched: Add an initial sketch of the find_proxy_task() function") since that is where we began blocking a task on task_is_blocked(). I really wish there was a better way to have detected this but I cannot think of any better way at the moment so feel free to include: Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> > --- > Cc: Vineeth Pillai <vineethrp@google.com> > Cc: Sonam Sanju <sonam.sanju@intel.com> > Cc: Sean Christopherson <seanjc@google.com> > Cc: Kunwu Chan <kunwu.chan@linux.dev> > Cc: Tejun Heo <tj@kernel.org> > Cc: Joel Fernandes <joelagnelf@nvidia.com> > Cc: Qais Yousef <qyousef@layalina.io> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> > Cc: Valentin Schneider <vschneid@redhat.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Will Deacon <will@kernel.org> > Cc: Waiman Long <longman@redhat.com> > Cc: Boqun Feng <boqun.feng@gmail.com> > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Cc: Metin Kaya <Metin.Kaya@arm.com> > Cc: Xuewen Yan <xuewen.yan94@gmail.com> > Cc: K Prateek Nayak <kprateek.nayak@amd.com> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org> > Cc: Suleiman Souhlal <suleiman@google.com> > Cc: kuyo chang <kuyo.chang@mediatek.com> > Cc: hupu <hupu.gm@gmail.com> > Cc: kernel-team@android.com > --- > kernel/sched/core.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index da20fb6ea25ae..5f684caefd8b2 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode) > try_to_block_task(rq, prev, &prev_state, > !task_is_blocked(prev)); > switch_count = &prev->nvcsw; > + } else if (preempt && prev->blocked_on) { > + /* > + * If we are SM_PREEMPT, we may have interrupted > + * after blocked_on was set, before schedule() > + * was run, preventing workques from running. So > + * clear blocked_on and mark task RUNNING so it > + * can be reselected to run and complete its > + * logic > + */ > + WRITE_ONCE(prev->__state, TASK_RUNNING); nit. You probably need to update "prev_state" too for trace_sched_switch() to capture the right state down below. Since this is on the way to schedule(), I wonder if it possible to just do a "next = prev" and goto picked ... but that adds more latency on PREEMPT_RT so that is a no go I presume. > + clear_task_blocked_on(prev, NULL); > } > > pick_again: -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed 2026-04-27 18:38 ` [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz 2026-04-28 8:06 ` K Prateek Nayak @ 2026-04-28 9:43 ` Peter Zijlstra 2026-04-28 11:18 ` Peter Zijlstra 1 sibling, 1 reply; 12+ messages in thread From: Peter Zijlstra @ 2026-04-28 9:43 UTC (permalink / raw) To: John Stultz Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team On Mon, Apr 27, 2026 at 06:38:40PM +0000, John Stultz wrote: > kernel/sched/core.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index da20fb6ea25ae..5f684caefd8b2 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode) > try_to_block_task(rq, prev, &prev_state, > !task_is_blocked(prev)); > switch_count = &prev->nvcsw; > + } else if (preempt && prev->blocked_on) { > + /* > + * If we are SM_PREEMPT, we may have interrupted > + * after blocked_on was set, before schedule() > + * was run, preventing workques from running. So workqueues > + * clear blocked_on and mark task RUNNING so it > + * can be reselected to run and complete its > + * logic > + */ > + WRITE_ONCE(prev->__state, TASK_RUNNING); > + clear_task_blocked_on(prev, NULL); > } > > pick_again: *groan*, this feels wrong. Preemption should never touch state. Let me try and wake up and make sense of this. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed 2026-04-28 9:43 ` Peter Zijlstra @ 2026-04-28 11:18 ` Peter Zijlstra 2026-04-28 13:15 ` K Prateek Nayak 0 siblings, 1 reply; 12+ messages in thread From: Peter Zijlstra @ 2026-04-28 11:18 UTC (permalink / raw) To: John Stultz Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team On Tue, Apr 28, 2026 at 11:43:53AM +0200, Peter Zijlstra wrote: > On Mon, Apr 27, 2026 at 06:38:40PM +0000, John Stultz wrote: > > > kernel/sched/core.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index da20fb6ea25ae..5f684caefd8b2 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode) > > try_to_block_task(rq, prev, &prev_state, > > !task_is_blocked(prev)); > > switch_count = &prev->nvcsw; > > + } else if (preempt && prev->blocked_on) { > > + /* > > + * If we are SM_PREEMPT, we may have interrupted > > + * after blocked_on was set, before schedule() > > + * was run, preventing workques from running. So > > workqueues > > > + * clear blocked_on and mark task RUNNING so it > > + * can be reselected to run and complete its > > + * logic > > + */ > > + WRITE_ONCE(prev->__state, TASK_RUNNING); > > + clear_task_blocked_on(prev, NULL); > > } > > > > pick_again: > > *groan*, this feels wrong. Preemption should never touch state. Let me > try and wake up and make sense of this. So all non-special block states *SHOULD* be in a loop and handle spurious wakeups -- I fixed a pile of offenders some many years ago, but there really isn't anything in the kernel that validates this. [ I suppose someone could try and do a cocci test for this? ] Any wait for non-special states that is not a loop is fundamentally broken, since many of the lock wake-up paths are explicitly racy in that they can cause spurious wakeups (which is the safe side of the race, since insufficient wakeups is bad etc.). OTOH special states, are special, esp. because they cannot handle spurious wakeups. Eg, consider something like: set_current_state(TASK_FROZEN) <PREEMPT> current->__state = TASK_RUNNING </PREEMPT/ schedule(); is all sorts of broken. Now, obiously special states must never have blocked_on set, so this can be fudged about. But still, touching __state from schedule seems wrong. Anyway, the historical distinction between a blocked task and a preempted task is that the blocked task is not on the runqueue, while the preempted task is kept on the runqueue. Obviously PE wrecks this, and hence the problem. And yeah, amazing we never hit this before. Something like so perhaps? --- diff --git a/include/linux/sched.h b/include/linux/sched.h index 368c7b4d7cb5..0bd5da8360f3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -846,7 +846,11 @@ struct task_struct { struct alloc_tag *alloc_tag; #endif - int on_cpu; + u8 on_cpu; + u8 on_rq; + u8 is_blocked; + u8 __pad; + struct __call_single_node wake_entry; unsigned int wakee_flips; unsigned long wakee_flip_decay_ts; @@ -861,7 +865,6 @@ struct task_struct { */ int recent_used_cpu; int wake_cpu; - int on_rq; int prio; int static_prio; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index da20fb6ea25a..06817ae0cbd9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -615,6 +615,13 @@ EXPORT_SYMBOL(__trace_set_current_state); * [ The astute reader will observe that it is possible for two tasks on one * CPU to have ->on_cpu = 1 at the same time. ] * +* p->is_blocked <- { 0, 1 }: +* +* is set by block_task() and cleared by ttwu_do_activate() and indicates +* this task is blocked, as opposed to runnable. Used to distinguish between +* preempted and blocked tasks for proxy exec, which keeps everything on the +* runqueue. + * * task_cpu(p): is changed by set_task_cpu(), the rules are: * * - Don't call set_task_cpu() on a blocked task: @@ -2225,6 +2232,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags) static void block_task(struct rq *rq, struct task_struct *p, int flags) { + p->is_blocked = 1; if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags)) __block_task(rq, p); } @@ -3722,6 +3730,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, atomic_dec(&task_rq(p)->nr_iowait); } + p->is_blocked = 0; activate_task(rq, p, en_flags); wakeup_preempt(rq, p, wake_flags); @@ -7107,7 +7116,7 @@ static void __sched notrace __schedule(int sched_mode) struct task_struct *prev_donor = rq->donor; rq_set_donor(rq, next); - if (unlikely(next->blocked_on)) { + if (unlikely(next->is_blocked && next->blocked_on)) { next = find_proxy_task(rq, next, &rf); if (!next) { zap_balance_callbacks(rq); ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed 2026-04-28 11:18 ` Peter Zijlstra @ 2026-04-28 13:15 ` K Prateek Nayak 2026-04-28 14:12 ` K Prateek Nayak ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: K Prateek Nayak @ 2026-04-28 13:15 UTC (permalink / raw) To: Peter Zijlstra, John Stultz Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team Hello Peter, On 4/28/2026 4:48 PM, Peter Zijlstra wrote: > On Tue, Apr 28, 2026 at 11:43:53AM +0200, Peter Zijlstra wrote: >> On Mon, Apr 27, 2026 at 06:38:40PM +0000, John Stultz wrote: >> >>> kernel/sched/core.c | 11 +++++++++++ >>> 1 file changed, 11 insertions(+) >>> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>> index da20fb6ea25ae..5f684caefd8b2 100644 >>> --- a/kernel/sched/core.c >>> +++ b/kernel/sched/core.c >>> @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode) >>> try_to_block_task(rq, prev, &prev_state, >>> !task_is_blocked(prev)); >>> switch_count = &prev->nvcsw; >>> + } else if (preempt && prev->blocked_on) { >>> + /* >>> + * If we are SM_PREEMPT, we may have interrupted >>> + * after blocked_on was set, before schedule() >>> + * was run, preventing workques from running. So >> >> workqueues >> >>> + * clear blocked_on and mark task RUNNING so it >>> + * can be reselected to run and complete its >>> + * logic >>> + */ >>> + WRITE_ONCE(prev->__state, TASK_RUNNING); >>> + clear_task_blocked_on(prev, NULL); >>> } >>> >>> pick_again: >> >> *groan*, this feels wrong. Preemption should never touch state. Let me >> try and wake up and make sense of this. > > So all non-special block states *SHOULD* be in a loop and handle > spurious wakeups -- I fixed a pile of offenders some many years ago, but > there really isn't anything in the kernel that validates this. > > [ I suppose someone could try and do a cocci test for this? ] > > Any wait for non-special states that is not a loop is fundamentally > broken, since many of the lock wake-up paths are explicitly racy in that > they can cause spurious wakeups (which is the safe side of the race, > since insufficient wakeups is bad etc.). > > OTOH special states, are special, esp. because they cannot handle > spurious wakeups. > > Eg, consider something like: > > set_current_state(TASK_FROZEN) > <PREEMPT> > current->__state = TASK_RUNNING > </PREEMPT/ > schedule(); > > is all sorts of broken. Now, obiously special states must never have > blocked_on set, so this can be fudged about. But still, touching __state > from schedule seems wrong. > > Anyway, the historical distinction between a blocked task and a > preempted task is that the blocked task is not on the runqueue, while > the preempted task is kept on the runqueue. > > Obviously PE wrecks this, and hence the problem. And yeah, amazing we > never hit this before. > > Something like so perhaps? > > --- > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 368c7b4d7cb5..0bd5da8360f3 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -846,7 +846,11 @@ struct task_struct { > struct alloc_tag *alloc_tag; > #endif > > - int on_cpu; > + u8 on_cpu; > + u8 on_rq; > + u8 is_blocked; > + u8 __pad; > + > struct __call_single_node wake_entry; > unsigned int wakee_flips; > unsigned long wakee_flip_decay_ts; > @@ -861,7 +865,6 @@ struct task_struct { > */ > int recent_used_cpu; > int wake_cpu; > - int on_rq; > > int prio; > int static_prio; > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index da20fb6ea25a..06817ae0cbd9 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -615,6 +615,13 @@ EXPORT_SYMBOL(__trace_set_current_state); > * [ The astute reader will observe that it is possible for two tasks on one > * CPU to have ->on_cpu = 1 at the same time. ] > * > +* p->is_blocked <- { 0, 1 }: > +* > +* is set by block_task() and cleared by ttwu_do_activate() and indicates > +* this task is blocked, as opposed to runnable. Used to distinguish between > +* preempted and blocked tasks for proxy exec, which keeps everything on the > +* runqueue. > + * > * task_cpu(p): is changed by set_task_cpu(), the rules are: > * > * - Don't call set_task_cpu() on a blocked task: > @@ -2225,6 +2232,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags) > > static void block_task(struct rq *rq, struct task_struct *p, int flags) > { > + p->is_blocked = 1; We never reach here with PROXY_EXEC. Instead we bail out in the caller try_to_block_task() ... > if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags)) > __block_task(rq, p); > } > @@ -3722,6 +3730,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, > atomic_dec(&task_rq(p)->nr_iowait); > } > > + p->is_blocked = 0; > activate_task(rq, p, en_flags); > wakeup_preempt(rq, p, wake_flags); > > @@ -7107,7 +7116,7 @@ static void __sched notrace __schedule(int sched_mode) > struct task_struct *prev_donor = rq->donor; > > rq_set_donor(rq, next); > - if (unlikely(next->blocked_on)) { > + if (unlikely(next->is_blocked && next->blocked_on)) { ... so ->is_blocked here is always false for proxy tasks retained on the runqueue. I was trying something like below but I'm somewhere missing a clear_task_blocked_on() for PROXY_WAKING before going back into mutex_lock_common(): diff --git a/include/linux/sched.h b/include/linux/sched.h index 8ec3b6d7d718b..6ea74aecc5fbd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -586,6 +586,7 @@ struct sched_entity { unsigned char sched_delayed; unsigned char rel_deadline; unsigned char custom_slice; + unsigned char sched_proxy; /* hole */ u64 exec_start; @@ -2222,6 +2223,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex * * clearing the relationship with a different lock. */ WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING); + WRITE_ONCE(p->se.sched_proxy, 0); p->blocked_on = NULL; } @@ -2250,6 +2252,8 @@ static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mu * the relationship with a different lock. */ WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING); + /* Force the task down proxy_force_return() path. */ + WRITE_ONCE(p->se.sched_proxy, 1); p->blocked_on = PROXY_WAKING; } diff --git a/init/init_task.c b/init/init_task.c index b5f48ebdc2b6e..8e8fc680fcd21 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -118,6 +118,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = { }, .se = { .group_node = LIST_HEAD_INIT(init_task.se.group_node), + .sched_proxy = 0, }, .rt = { .run_list = LIST_HEAD_INIT(init_task.rt.run_list), diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 49cd5d2171613..8142fba59ad94 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4395,6 +4395,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p) p->se.nr_migrations = 0; p->se.vruntime = 0; p->se.vlag = 0; + p->se.sched_proxy = 0; INIT_LIST_HEAD(&p->se.group_node); /* A delayed task cannot be in clone(). */ @@ -6535,8 +6536,13 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p, * blocked on a mutex, and we want to keep it on the runqueue * to be selectable for proxy-execution. */ - if (!should_block) + if (!should_block) { + guard(raw_spinlock)(&p->blocked_lock); + /* Stable against race */ + if (task_is_blocked(p)) + WRITE_ONCE(p->se.sched_proxy, 1); return false; + } p->sched_contributes_to_load = (task_state & TASK_UNINTERRUPTIBLE) && @@ -6765,11 +6771,15 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf) bool curr_in_chain = false; int this_cpu = cpu_of(rq); struct task_struct *p; - struct mutex *mutex; int owner_cpu; /* Follow blocked_on chain. */ - for (p = donor; (mutex = p->blocked_on); p = owner) { + for (p = donor; READ_ONCE(p->se.sched_proxy); p = owner) { + struct mutex *mutex = p->blocked_on; + + if (!mutex) + return NULL; + /* if its PROXY_WAKING, do return migration or run if current */ if (mutex == PROXY_WAKING) { if (task_current(rq, p)) { @@ -6787,7 +6797,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf) guard(raw_spinlock)(&p->blocked_lock); /* Check again that p is blocked with blocked_lock held */ - if (mutex != __get_task_blocked_on(p)) { + if (!p->se.sched_proxy || mutex != __get_task_blocked_on(p)) { /* * Something changed in the blocked_on chain and * we don't know if only at this level. So, let's @@ -7044,7 +7054,7 @@ static void __sched notrace __schedule(int sched_mode) struct task_struct *prev_donor = rq->donor; rq_set_donor(rq, next); - if (unlikely(next->blocked_on)) { + if (unlikely(READ_ONCE(next->se.sched_proxy))) { next = find_proxy_task(rq, next, &rf); if (!next) { zap_balance_callbacks(rq); --- > next = find_proxy_task(rq, next, &rf); > if (!next) { > zap_balance_callbacks(rq); -- Thanks and Regards, Prateek ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed 2026-04-28 13:15 ` K Prateek Nayak @ 2026-04-28 14:12 ` K Prateek Nayak 2026-04-28 16:50 ` Peter Zijlstra 2026-04-29 2:27 ` John Stultz 2 siblings, 0 replies; 12+ messages in thread From: K Prateek Nayak @ 2026-04-28 14:12 UTC (permalink / raw) To: Peter Zijlstra, John Stultz Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team On 4/28/2026 6:45 PM, K Prateek Nayak wrote: > I was trying something like below but I'm somewhere missing a > clear_task_blocked_on() for PROXY_WAKING before going back into > mutex_lock_common(): And I seem to have been missing: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8142fba59ad94..a8679b759398c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7046,6 +7046,9 @@ static void __sched notrace __schedule(int sched_mode) switch_count = &prev->nvcsw; } + if (!prev_state && task_is_blocked(prev)) + clear_task_blocked_on(prev, NULL); + pick_again: assert_balance_callbacks_empty(rq); next = pick_next_task(rq, rq->donor, &rf); --- With that, it survives test-ww_mutex and a sched-messaging run without any splats. > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 8ec3b6d7d718b..6ea74aecc5fbd 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -586,6 +586,7 @@ struct sched_entity { > unsigned char sched_delayed; > unsigned char rel_deadline; > unsigned char custom_slice; > + unsigned char sched_proxy; > /* hole */ > > u64 exec_start; > @@ -2222,6 +2223,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex * > * clearing the relationship with a different lock. > */ > WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING); > + WRITE_ONCE(p->se.sched_proxy, 0); > p->blocked_on = NULL; > } > > @@ -2250,6 +2252,8 @@ static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mu > * the relationship with a different lock. > */ > WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING); > + /* Force the task down proxy_force_return() path. */ > + WRITE_ONCE(p->se.sched_proxy, 1); > p->blocked_on = PROXY_WAKING; > } > > diff --git a/init/init_task.c b/init/init_task.c > index b5f48ebdc2b6e..8e8fc680fcd21 100644 > --- a/init/init_task.c > +++ b/init/init_task.c > @@ -118,6 +118,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = { > }, > .se = { > .group_node = LIST_HEAD_INIT(init_task.se.group_node), > + .sched_proxy = 0, > }, > .rt = { > .run_list = LIST_HEAD_INIT(init_task.rt.run_list), > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 49cd5d2171613..8142fba59ad94 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4395,6 +4395,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p) > p->se.nr_migrations = 0; > p->se.vruntime = 0; > p->se.vlag = 0; > + p->se.sched_proxy = 0; > INIT_LIST_HEAD(&p->se.group_node); > > /* A delayed task cannot be in clone(). */ > @@ -6535,8 +6536,13 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p, > * blocked on a mutex, and we want to keep it on the runqueue > * to be selectable for proxy-execution. > */ > - if (!should_block) > + if (!should_block) { > + guard(raw_spinlock)(&p->blocked_lock); > + /* Stable against race */ > + if (task_is_blocked(p)) > + WRITE_ONCE(p->se.sched_proxy, 1); > return false; > + } > > p->sched_contributes_to_load = > (task_state & TASK_UNINTERRUPTIBLE) && > @@ -6765,11 +6771,15 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf) > bool curr_in_chain = false; > int this_cpu = cpu_of(rq); > struct task_struct *p; > - struct mutex *mutex; > int owner_cpu; > > /* Follow blocked_on chain. */ > - for (p = donor; (mutex = p->blocked_on); p = owner) { > + for (p = donor; READ_ONCE(p->se.sched_proxy); p = owner) { > + struct mutex *mutex = p->blocked_on; > + > + if (!mutex) > + return NULL; > + > /* if its PROXY_WAKING, do return migration or run if current */ > if (mutex == PROXY_WAKING) { > if (task_current(rq, p)) { > @@ -6787,7 +6797,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf) > guard(raw_spinlock)(&p->blocked_lock); > > /* Check again that p is blocked with blocked_lock held */ > - if (mutex != __get_task_blocked_on(p)) { > + if (!p->se.sched_proxy || mutex != __get_task_blocked_on(p)) { > /* > * Something changed in the blocked_on chain and > * we don't know if only at this level. So, let's > @@ -7044,7 +7054,7 @@ static void __sched notrace __schedule(int sched_mode) > struct task_struct *prev_donor = rq->donor; > > rq_set_donor(rq, next); > - if (unlikely(next->blocked_on)) { > + if (unlikely(READ_ONCE(next->se.sched_proxy))) { > next = find_proxy_task(rq, next, &rf); > if (!next) { > zap_balance_callbacks(rq); > --- > >> next = find_proxy_task(rq, next, &rf); >> if (!next) { >> zap_balance_callbacks(rq); > -- Thanks and Regards, Prateek ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed 2026-04-28 13:15 ` K Prateek Nayak 2026-04-28 14:12 ` K Prateek Nayak @ 2026-04-28 16:50 ` Peter Zijlstra 2026-04-29 2:27 ` John Stultz 2 siblings, 0 replies; 12+ messages in thread From: Peter Zijlstra @ 2026-04-28 16:50 UTC (permalink / raw) To: K Prateek Nayak Cc: John Stultz, LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team On Tue, Apr 28, 2026 at 06:45:39PM +0530, K Prateek Nayak wrote: > > Something like so perhaps? > > > > --- > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index 368c7b4d7cb5..0bd5da8360f3 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -846,7 +846,11 @@ struct task_struct { > > struct alloc_tag *alloc_tag; > > #endif > > > > - int on_cpu; > > + u8 on_cpu; > > + u8 on_rq; > > + u8 is_blocked; > > + u8 __pad; > > + > > struct __call_single_node wake_entry; > > unsigned int wakee_flips; > > unsigned long wakee_flip_decay_ts; > > @@ -861,7 +865,6 @@ struct task_struct { > > */ > > int recent_used_cpu; > > int wake_cpu; > > - int on_rq; > > > > int prio; > > int static_prio; > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index da20fb6ea25a..06817ae0cbd9 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -615,6 +615,13 @@ EXPORT_SYMBOL(__trace_set_current_state); > > * [ The astute reader will observe that it is possible for two tasks on one > > * CPU to have ->on_cpu = 1 at the same time. ] > > * > > +* p->is_blocked <- { 0, 1 }: > > +* > > +* is set by block_task() and cleared by ttwu_do_activate() and indicates > > +* this task is blocked, as opposed to runnable. Used to distinguish between > > +* preempted and blocked tasks for proxy exec, which keeps everything on the > > +* runqueue. > > + * > > * task_cpu(p): is changed by set_task_cpu(), the rules are: > > * > > * - Don't call set_task_cpu() on a blocked task: > > @@ -2225,6 +2232,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags) > > > > static void block_task(struct rq *rq, struct task_struct *p, int flags) > > { > > + p->is_blocked = 1; > > We never reach here with PROXY_EXEC. Instead we bail out in the caller > try_to_block_task() ... > > > if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags)) > > __block_task(rq, p); > > } > > @@ -3722,6 +3730,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, > > atomic_dec(&task_rq(p)->nr_iowait); > > } > > > > + p->is_blocked = 0; > > activate_task(rq, p, en_flags); > > wakeup_preempt(rq, p, wake_flags); > > > > @@ -7107,7 +7116,7 @@ static void __sched notrace __schedule(int sched_mode) > > struct task_struct *prev_donor = rq->donor; > > > > rq_set_donor(rq, next); > > - if (unlikely(next->blocked_on)) { > > + if (unlikely(next->is_blocked && next->blocked_on)) { > > ... so ->is_blocked here is always false for proxy tasks retained on > the runqueue. Right. Also, egads, we really should fix that block/ttwu part, this is a mess. Anyway, idea is simple even if execution turns into a bit of a mess now, set when task really is blocked and clear on wakeup. > I was trying something like below but I'm somewhere missing a > clear_task_blocked_on() for PROXY_WAKING before going back into > mutex_lock_common(): > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 8ec3b6d7d718b..6ea74aecc5fbd 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -586,6 +586,7 @@ struct sched_entity { > unsigned char sched_delayed; > unsigned char rel_deadline; > unsigned char custom_slice; > + unsigned char sched_proxy; > /* hole */ Should not live in sched_entity I suppose. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed 2026-04-28 13:15 ` K Prateek Nayak 2026-04-28 14:12 ` K Prateek Nayak 2026-04-28 16:50 ` Peter Zijlstra @ 2026-04-29 2:27 ` John Stultz 2 siblings, 0 replies; 12+ messages in thread From: John Stultz @ 2026-04-29 2:27 UTC (permalink / raw) To: K Prateek Nayak Cc: Peter Zijlstra, LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team On Tue, Apr 28, 2026 at 6:15 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote: > On 4/28/2026 4:48 PM, Peter Zijlstra wrote: > > On Tue, Apr 28, 2026 at 11:43:53AM +0200, Peter Zijlstra wrote: > >> On Mon, Apr 27, 2026 at 06:38:40PM +0000, John Stultz wrote: > >> > >>> kernel/sched/core.c | 11 +++++++++++ > >>> 1 file changed, 11 insertions(+) > >>> > >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c > >>> index da20fb6ea25ae..5f684caefd8b2 100644 > >>> --- a/kernel/sched/core.c > >>> +++ b/kernel/sched/core.c > >>> @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode) > >>> try_to_block_task(rq, prev, &prev_state, > >>> !task_is_blocked(prev)); > >>> switch_count = &prev->nvcsw; > >>> + } else if (preempt && prev->blocked_on) { > >>> + /* > >>> + * If we are SM_PREEMPT, we may have interrupted > >>> + * after blocked_on was set, before schedule() > >>> + * was run, preventing workques from running. So > >> > >> workqueues > >> > >>> + * clear blocked_on and mark task RUNNING so it > >>> + * can be reselected to run and complete its > >>> + * logic > >>> + */ > >>> + WRITE_ONCE(prev->__state, TASK_RUNNING); > >>> + clear_task_blocked_on(prev, NULL); > >>> } > >>> > >>> pick_again: > >> > >> *groan*, this feels wrong. Preemption should never touch state. Let me > >> try and wake up and make sense of this. > > > > So all non-special block states *SHOULD* be in a loop and handle > > spurious wakeups -- I fixed a pile of offenders some many years ago, but > > there really isn't anything in the kernel that validates this. > > > > [ I suppose someone could try and do a cocci test for this? ] > > > > Any wait for non-special states that is not a loop is fundamentally > > broken, since many of the lock wake-up paths are explicitly racy in that > > they can cause spurious wakeups (which is the safe side of the race, > > since insufficient wakeups is bad etc.). > > > > OTOH special states, are special, esp. because they cannot handle > > spurious wakeups. > > > > Eg, consider something like: > > > > set_current_state(TASK_FROZEN) > > <PREEMPT> > > current->__state = TASK_RUNNING > > </PREEMPT/ > > schedule(); > > > > is all sorts of broken. Now, obiously special states must never have > > blocked_on set, so this can be fudged about. But still, touching __state > > from schedule seems wrong. > > > > Anyway, the historical distinction between a blocked task and a > > preempted task is that the blocked task is not on the runqueue, while > > the preempted task is kept on the runqueue. > > > > Obviously PE wrecks this, and hence the problem. And yeah, amazing we > > never hit this before. > > > > Something like so perhaps? > > > > --- > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index 368c7b4d7cb5..0bd5da8360f3 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -846,7 +846,11 @@ struct task_struct { > > struct alloc_tag *alloc_tag; > > #endif > > > > - int on_cpu; > > + u8 on_cpu; > > + u8 on_rq; > > + u8 is_blocked; > > + u8 __pad; > > + > > struct __call_single_node wake_entry; > > unsigned int wakee_flips; > > unsigned long wakee_flip_decay_ts; > > @@ -861,7 +865,6 @@ struct task_struct { > > */ > > int recent_used_cpu; > > int wake_cpu; > > - int on_rq; > > > > int prio; > > int static_prio; > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index da20fb6ea25a..06817ae0cbd9 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -615,6 +615,13 @@ EXPORT_SYMBOL(__trace_set_current_state); > > * [ The astute reader will observe that it is possible for two tasks on one > > * CPU to have ->on_cpu = 1 at the same time. ] > > * > > +* p->is_blocked <- { 0, 1 }: > > +* > > +* is set by block_task() and cleared by ttwu_do_activate() and indicates > > +* this task is blocked, as opposed to runnable. Used to distinguish between > > +* preempted and blocked tasks for proxy exec, which keeps everything on the > > +* runqueue. > > + * > > * task_cpu(p): is changed by set_task_cpu(), the rules are: > > * > > * - Don't call set_task_cpu() on a blocked task: > > @@ -2225,6 +2232,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags) > > > > static void block_task(struct rq *rq, struct task_struct *p, int flags) > > { > > + p->is_blocked = 1; > > We never reach here with PROXY_EXEC. Instead we bail out in the caller > try_to_block_task() ... > > > if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags)) > > __block_task(rq, p); > > } > > @@ -3722,6 +3730,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, > > atomic_dec(&task_rq(p)->nr_iowait); > > } > > > > + p->is_blocked = 0; > > activate_task(rq, p, en_flags); > > wakeup_preempt(rq, p, wake_flags); > > > > @@ -7107,7 +7116,7 @@ static void __sched notrace __schedule(int sched_mode) > > struct task_struct *prev_donor = rq->donor; > > > > rq_set_donor(rq, next); > > - if (unlikely(next->blocked_on)) { > > + if (unlikely(next->is_blocked && next->blocked_on)) { > > ... so ->is_blocked here is always false for proxy tasks retained on > the runqueue. > > I was trying something like below but I'm somewhere missing a > clear_task_blocked_on() for PROXY_WAKING before going back into > mutex_lock_common(): > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 8ec3b6d7d718b..6ea74aecc5fbd 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -586,6 +586,7 @@ struct sched_entity { > unsigned char sched_delayed; > unsigned char rel_deadline; > unsigned char custom_slice; > + unsigned char sched_proxy; > /* hole */ I feel like this is so tied to the blocked_on value, I suspect it makes the most sense to have this flag be the low bit of that pointer? Sort of a blocked_on latch, to signal its really in effect? Plus it gets cleared automatically on set and clear, so it looks a little cleaner. > @@ -6535,8 +6536,13 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p, > * blocked on a mutex, and we want to keep it on the runqueue > * to be selectable for proxy-execution. > */ > - if (!should_block) > + if (!should_block) { > + guard(raw_spinlock)(&p->blocked_lock); > + /* Stable against race */ > + if (task_is_blocked(p)) > + WRITE_ONCE(p->se.sched_proxy, 1); > return false; > + } So if we double check and find the task isn't blocked anymore, we probably shouldn't return early here, no? Let me take a stab at the bit flag approach and see how it goes. thanks -john ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING 2026-04-27 18:38 [PATCH 0/2] Proxy Execution fixes for v7.1-rc John Stultz 2026-04-27 18:38 ` [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz @ 2026-04-27 18:38 ` John Stultz 2026-04-28 8:16 ` K Prateek Nayak 1 sibling, 1 reply; 12+ messages in thread From: John Stultz @ 2026-04-27 18:38 UTC (permalink / raw) To: LKML Cc: John Stultz, Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team Vineeth found came up with a test driver that could trip up workqueue stalls. After fixing one issue this test found, Vineeth reported the test was still failing. Greatly simplified, a task that tries to take a mutex already owned by another task that is sleeping, can hit a edge case in the mutex_lock_common() case. If the task fails to get the lock, calls into schedule, but gets a spurious wakeup, it will find that it is first waiter, and go into the mutex_optimistic_spin() logic. Though before calling mutex_optimistic_spin(), we clear task blocked_on state, since mutex_optimistic_spin() may call schedule() if need_resched() is set. After mutex_optimistic_spin() fails, we set blocked_on again, restart the main mutex loop, try to take the lock and call into schedule_preempt_disabled(). From there, with proxy-execution, we'll see the task is blocked_on, follow the chain, see the owner is sleeping and dequeue the waiting task from the runqueue. This all sounds fine and reasonable. But what I had missed is that in mutex_optimistic_spin(), not only do we call schedule() but we set TASK_RUNNABLE right before doing so. This is ok for that invocation of schedule(). But when we come back we re-set the blocked_on we had just cleared, but we do not re-set the task state to TASK_INTERRUPTIBLE/UNINTERRUPTIBLE. This means we have a task that is blocked_on & TASK_RUNNABLE, so when the proxy execution code dequeues the task, we are in trouble since future wakeups will be shortcut by the ttwu_state_match() check. Thus, to avoid this, after mutex_optimistic_spin(), set the task state back when we set blocked_on. Many many thanks again to Vineeth for his very useful testing driver that uncovered this long hidden bug, that I hadn't tripped in all my testing! Very impressed with the problems he's uncovered! Reported-by: Vineeth Pillai <vineethrp@google.com> Tested-by: Vineeth Pillai <vineethrp@google.com> Signed-off-by: John Stultz <jstultz@google.com> --- Cc: Vineeth Pillai <vineethrp@google.com> Cc: Sonam Sanju <sonam.sanju@intel.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Kunwu Chan <kunwu.chan@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Joel Fernandes <joelagnelf@nvidia.com> Cc: Qais Yousef <qyousef@layalina.io> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Metin Kaya <Metin.Kaya@arm.com> Cc: Xuewen Yan <xuewen.yan94@gmail.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: kuyo chang <kuyo.chang@mediatek.com> Cc: hupu <hupu.gm@gmail.com> Cc: kernel-team@android.com --- kernel/locking/mutex.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 09534628dc01a..a93d4c6bee1a3 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -763,6 +763,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas raw_spin_lock_irqsave(&lock->wait_lock, flags); raw_spin_lock(¤t->blocked_lock); __set_task_blocked_on(current, lock); + set_current_state(state); if (opt_acquired) break; -- 2.54.0.545.g6539524ca2-goog ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING 2026-04-27 18:38 ` [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz @ 2026-04-28 8:16 ` K Prateek Nayak 2026-04-28 19:50 ` John Stultz 0 siblings, 1 reply; 12+ messages in thread From: K Prateek Nayak @ 2026-04-28 8:16 UTC (permalink / raw) To: John Stultz, LKML Cc: Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team Hello John, On 4/28/2026 12:08 AM, John Stultz wrote: > Vineeth found came up with a test driver that could trip up > workqueue stalls. After fixing one issue this test found, > Vineeth reported the test was still failing. > > Greatly simplified, a task that tries to take a mutex already > owned by another task that is sleeping, can hit a edge case in > the mutex_lock_common() case. > > If the task fails to get the lock, calls into schedule, but gets > a spurious wakeup, it will find that it is first waiter, and > go into the mutex_optimistic_spin() logic. Though before calling > mutex_optimistic_spin(), we clear task blocked_on state, since > mutex_optimistic_spin() may call schedule() if need_resched() is > set. > > After mutex_optimistic_spin() fails, we set blocked_on again, > restart the main mutex loop, try to take the lock and call into > schedule_preempt_disabled(). > > From there, with proxy-execution, we'll see the task is > blocked_on, follow the chain, see the owner is sleeping and > dequeue the waiting task from the runqueue. > > This all sounds fine and reasonable. But what I had missed is > that in mutex_optimistic_spin(), not only do we call schedule() > but we set TASK_RUNNABLE right before doing so. > > This is ok for that invocation of schedule(). But when we come > back we re-set the blocked_on we had just cleared, but we do not > re-set the task state to TASK_INTERRUPTIBLE/UNINTERRUPTIBLE. > > This means we have a task that is blocked_on & TASK_RUNNABLE, > so when the proxy execution code dequeues the task, we are > in trouble since future wakeups will be shortcut by the > ttwu_state_match() check. > > Thus, to avoid this, after mutex_optimistic_spin(), set the task > state back when we set blocked_on. > > Many many thanks again to Vineeth for his very useful testing > driver that uncovered this long hidden bug, that I hadn't > tripped in all my testing! Very impressed with the problems he's > uncovered! > > Reported-by: Vineeth Pillai <vineethrp@google.com> > Tested-by: Vineeth Pillai <vineethrp@google.com> > Signed-off-by: John Stultz <jstultz@google.com> I think this too requires a: Fixes: be41bde4c3a8 ("sched: Add an initial sketch of the find_proxy_task() function") With that, feel free to include: Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> > --- > Cc: Vineeth Pillai <vineethrp@google.com> > Cc: Sonam Sanju <sonam.sanju@intel.com> > Cc: Sean Christopherson <seanjc@google.com> > Cc: Kunwu Chan <kunwu.chan@linux.dev> > Cc: Tejun Heo <tj@kernel.org> > Cc: Joel Fernandes <joelagnelf@nvidia.com> > Cc: Qais Yousef <qyousef@layalina.io> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> > Cc: Valentin Schneider <vschneid@redhat.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Will Deacon <will@kernel.org> > Cc: Waiman Long <longman@redhat.com> > Cc: Boqun Feng <boqun.feng@gmail.com> > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Cc: Metin Kaya <Metin.Kaya@arm.com> > Cc: Xuewen Yan <xuewen.yan94@gmail.com> > Cc: K Prateek Nayak <kprateek.nayak@amd.com> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org> > Cc: Suleiman Souhlal <suleiman@google.com> > Cc: kuyo chang <kuyo.chang@mediatek.com> > Cc: hupu <hupu.gm@gmail.com> > Cc: kernel-team@android.com > --- > kernel/locking/mutex.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c > index 09534628dc01a..a93d4c6bee1a3 100644 > --- a/kernel/locking/mutex.c > +++ b/kernel/locking/mutex.c > @@ -763,6 +763,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas > raw_spin_lock_irqsave(&lock->wait_lock, flags); > raw_spin_lock(¤t->blocked_lock); > __set_task_blocked_on(current, lock); > + set_current_state(state); > > if (opt_acquired) > break; nit. As a micro-optimization, you can probably move the __set_task_blocked_on() and set_current_state() after this break. -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING 2026-04-28 8:16 ` K Prateek Nayak @ 2026-04-28 19:50 ` John Stultz 0 siblings, 0 replies; 12+ messages in thread From: John Stultz @ 2026-04-28 19:50 UTC (permalink / raw) To: K Prateek Nayak Cc: LKML, Vineeth Pillai, Sonam Sanju, Sean Christopherson, Kunwu Chan, Tejun Heo, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team On Tue, Apr 28, 2026 at 1:16 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote: > On 4/28/2026 12:08 AM, John Stultz wrote: > I think this too requires a: > > Fixes: be41bde4c3a8 ("sched: Add an initial sketch of the find_proxy_task() function") > > With that, feel free to include: > > Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Thanks so much for looking this over! > > diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c > > index 09534628dc01a..a93d4c6bee1a3 100644 > > --- a/kernel/locking/mutex.c > > +++ b/kernel/locking/mutex.c > > @@ -763,6 +763,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas > > raw_spin_lock_irqsave(&lock->wait_lock, flags); > > raw_spin_lock(¤t->blocked_lock); > > __set_task_blocked_on(current, lock); > > + set_current_state(state); > > > > if (opt_acquired) > > break; > > nit. > > As a micro-optimization, you can probably move the > __set_task_blocked_on() and set_current_state() after this break. So my main reason for setting it before the break makes it symmetric with the clearing that happens just outside the loop. Feels a little cleaner to me for it to all match. thanks -john ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-04-29 2:27 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-27 18:38 [PATCH 0/2] Proxy Execution fixes for v7.1-rc John Stultz 2026-04-27 18:38 ` [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz 2026-04-28 8:06 ` K Prateek Nayak 2026-04-28 9:43 ` Peter Zijlstra 2026-04-28 11:18 ` Peter Zijlstra 2026-04-28 13:15 ` K Prateek Nayak 2026-04-28 14:12 ` K Prateek Nayak 2026-04-28 16:50 ` Peter Zijlstra 2026-04-29 2:27 ` John Stultz 2026-04-27 18:38 ` [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz 2026-04-28 8:16 ` K Prateek Nayak 2026-04-28 19:50 ` John Stultz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox