public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] sched: Fix psi_dequeue for Proxy Execution
@ 2025-11-21 19:01 John Stultz
  2025-11-28  9:35 ` Haiyue Wang
  0 siblings, 1 reply; 3+ messages in thread
From: John Stultz @ 2025-11-21 19:01 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Johannes Weiner, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Suren Baghdasaryan, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Currently, if the sleep flag is set, psi_dequeue() doesn't
change any of the psi_flags.

This is because psi_switch_task() will clear TSK_ONCPU as well
as other potential flags (TSK_RUNNING), and the assumption is
that a voluntary sleep always consists of a task being dequeued
followed shortly there after with a psi_sched_switch() call.

Proxy Execution changes this expectation, as mutex-blocked tasks
that would normally sleep stay on the runqueue. But in the case
where the mutex-owning task goes to sleep, or the owner is on a
remote cpu, we will then deactivate the blocked task shortly
after.

In that situation, the mutex-blocked task will have had its
TSK_ONCPU cleared when it was switched off the cpu, but it will
stay TSK_RUNNING. Then if we later dequeue it (as currently done
if we hit a case find_proxy_task() can't yet handle, such as the
case of the owner being on another rq or a sleeping owner)
psi_dequeue() won't change any state (leaving it TSK_RUNNING),
as it incorrectly expects a psi_task_switch() call to
immediately follow.

Later on when the task get woken/re-enqueued, and psi_flags are
set for TSK_RUNNING, we hit an error as the task is already
TSK_RUNNING:
  psi: inconsistent task state! task=188:kworker/28:0 cpu=28 psi_flags=4 clear=0 set=4

To resolve this, extend the logic in psi_dequeue() so that
if the sleep flag is set, we also check if psi_flags have
TSK_ONCPU set (meaning the psi_task_switch is imminent) before
we do the shortcut return.

If TSK_ONCPU is not set, that means we've already switched away,
and this psi_dequeue call needs to clear the flags.

Fixes: be41bde4c3a8 ("sched: Add an initial sketch of the find_proxy_task() function")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Closes: https://lore.kernel.org/lkml/20251117185550.365156-1-kprateek.nayak@amd.com/
Signed-off-by: John Stultz <jstultz@google.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
proxy-v13:
* Reworked for collision
proxy-v15:
* Fixed commit message typo noticed by Todd Kjos
v1 (separate from proxy series):
* Reworded commit message in response to K Prateek pointing
  out this issue can affect us earlier in the full proxy
  series then I had anticipated.
v2 (separate from proxy series):
* Minor tweak to comment suggested by Johannes

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/stats.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 26f3fd4d34cea..73bd6bca4d310 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -180,8 +180,13 @@ static inline void psi_dequeue(struct task_struct *p, int flags)
 	 * avoid walking all ancestors twice, psi_task_switch() handles
 	 * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
 	 * Do nothing here.
+	 *
+	 * In the SCHED_PROXY_EXECUTION case we may do sleeping
+	 * dequeues that are not followed by a task switch, so check
+	 * TSK_ONCPU is set to ensure the task switch is imminent.
+	 * Otherwise clear the flags as usual.
 	 */
-	if (flags & DEQUEUE_SLEEP)
+	if ((flags & DEQUEUE_SLEEP) && (p->psi_flags & TSK_ONCPU))
 		return;
 
 	/*
-- 
2.52.0.rc2.455.g230fcf2819-goog


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] sched: Fix psi_dequeue for Proxy Execution
  2025-11-21 19:01 [PATCH v2] sched: Fix psi_dequeue for Proxy Execution John Stultz
@ 2025-11-28  9:35 ` Haiyue Wang
  2025-12-05  1:17   ` John Stultz
  0 siblings, 1 reply; 3+ messages in thread
From: Haiyue Wang @ 2025-11-28  9:35 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: K Prateek Nayak, Johannes Weiner, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Suren Baghdasaryan,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team


On 11/22/2025 3:01 AM, John Stultz wrote:
> Currently, if the sleep flag is set, psi_dequeue() doesn't
> change any of the psi_flags.
> 
> This is because psi_switch_task() will clear TSK_ONCPU as well

psi_task_switch()

> as other potential flags (TSK_RUNNING), and the assumption is
> that a voluntary sleep always consists of a task being dequeued
> followed shortly there after with a psi_sched_switch() call.
> 
> Proxy Execution changes this expectation, as mutex-blocked tasks
> that would normally sleep stay on the runqueue. But in the case
> where the mutex-owning task goes to sleep, or the owner is on a
> remote cpu, we will then deactivate the blocked task shortly
> after.
> 
> In that situation, the mutex-blocked task will have had its
> TSK_ONCPU cleared when it was switched off the cpu, but it will
> stay TSK_RUNNING. Then if we later dequeue it (as currently done
> if we hit a case find_proxy_task() can't yet handle, such as the
> case of the owner being on another rq or a sleeping owner)
> psi_dequeue() won't change any state (leaving it TSK_RUNNING),
> as it incorrectly expects a psi_task_switch() call to
> immediately follow.
> 
> Later on when the task get woken/re-enqueued, and psi_flags are
> set for TSK_RUNNING, we hit an error as the task is already
> TSK_RUNNING:
>    psi: inconsistent task state! task=188:kworker/28:0 cpu=28 psi_flags=4 clear=0 set=4
> 
> To resolve this, extend the logic in psi_dequeue() so that
> if the sleep flag is set, we also check if psi_flags have
> TSK_ONCPU set (meaning the psi_task_switch is imminent) before
> we do the shortcut return.
> 
> If TSK_ONCPU is not set, that means we've already switched away,
> and this psi_dequeue call needs to clear the flags.
> 
> Fixes: be41bde4c3a8 ("sched: Add an initial sketch of the find_proxy_task() function")
> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Closes: https://lore.kernel.org/lkml/20251117185550.365156-1-kprateek.nayak@amd.com/
> Signed-off-by: John Stultz <jstultz@google.com>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> proxy-v13:
> * Reworked for collision
> proxy-v15:
> * Fixed commit message typo noticed by Todd Kjos
> v1 (separate from proxy series):
> * Reworded commit message in response to K Prateek pointing
>    out this issue can affect us earlier in the full proxy
>    series then I had anticipated.
> v2 (separate from proxy series):
> * Minor tweak to comment suggested by Johannes
> 
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kuyo chang <kuyo.chang@mediatek.com>
> Cc: hupu <hupu.gm@gmail.com>
> Cc: kernel-team@android.com
> ---
>   kernel/sched/stats.h | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 

Tested-by: Haiyue Wang <haiyuewa@163.com>

>   
>   	/*


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] sched: Fix psi_dequeue for Proxy Execution
  2025-11-28  9:35 ` Haiyue Wang
@ 2025-12-05  1:17   ` John Stultz
  0 siblings, 0 replies; 3+ messages in thread
From: John Stultz @ 2025-12-05  1:17 UTC (permalink / raw)
  To: Haiyue Wang
  Cc: LKML, K Prateek Nayak, Johannes Weiner, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Suren Baghdasaryan, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Nov 28, 2025 at 1:35 AM Haiyue Wang <haiyuewa@163.com> wrote:
>
>
> On 11/22/2025 3:01 AM, John Stultz wrote:
> > Currently, if the sleep flag is set, psi_dequeue() doesn't
> > change any of the psi_flags.
> >
> > This is because psi_switch_task() will clear TSK_ONCPU as well
>
> psi_task_switch()
>
>
> Tested-by: Haiyue Wang <haiyuewa@163.com>

Thanks for the testing and the sharp eye on that detail!

I'll fix up and resend shortly here.

thanks
-john

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-12-05  1:17 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-21 19:01 [PATCH v2] sched: Fix psi_dequeue for Proxy Execution John Stultz
2025-11-28  9:35 ` Haiyue Wang
2025-12-05  1:17   ` John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox