[RFC 1/1] sched: Remove unreliable wake_cpu check in proxy_needs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 1/1] sched: Remove unreliable wake_cpu check in proxy_needs_return
@ 2025-04-07 13:46 hupu
  2025-04-09  6:01 ` Madadi Vineeth Reddy
  2025-04-09 23:35 ` John Stultz
  0 siblings, 2 replies; 8+ messages in thread
From: hupu @ 2025-04-07 13:46 UTC (permalink / raw)
  To: jstultz, linux-kernel
  Cc: juri.lelli, peterz, vschneid, mingo, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, hupu, hupu

The (p->wake_cpu != cpu_of(rq)) check in proxy_needs_return() is unsafe
during early wakeup phase. When called via ttwu_runnable() path:

|-- try_to_wake_up
    |-- ttwu_runnable
        |-- proxy_needs_return    //we are here
    |-- select_task_rq
    |-- set_task_cpu              //set p->wake_cpu here
    |-- ttwu_queue

The p->wake_cpu at this point reflects the CPU where donor last ran before
blocking, not the target migration CPU. During blocking period:
1. CPU affinity may have been changed by other threads
2. Proxy migrations might have altered the effective wake_cpu
3. set_task_cpu() hasn't updated wake_cpu yet in this code path

This makes the wake_cpu vs current CPU comparison meaningless and potentially
dangerous. Rely on find_proxy_task()'s later migration logic to handle CPU
placement based on up-to-date affinity and scheduler state.

Signed-off-by: hupu <hupu.gm@gmail.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c4ef4c71cfd..ca4ca739eb85 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4047,7 +4047,7 @@ static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)

 	raw_spin_lock(&p->blocked_lock);
 	if (__get_task_blocked_on(p) && p->blocked_on_state & BO_NEEDS_RETURN) {
-		if (!task_current(rq, p) && (p->wake_cpu != cpu_of(rq))) {
+		if (!task_current(rq, p)) {
 			if (task_current_donor(rq, p)) {
 				put_prev_task(rq, p);
 				rq_set_donor(rq, rq->idle);
-- 
2.47.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] sched: Remove unreliable wake_cpu check in proxy_needs_return
  2025-04-07 13:46 [RFC 1/1] sched: Remove unreliable wake_cpu check in proxy_needs_return hupu
@ 2025-04-09  6:01 ` Madadi Vineeth Reddy
  2025-04-09  6:56   ` hupu
  2025-04-09 23:35 ` John Stultz
  1 sibling, 1 reply; 8+ messages in thread
From: Madadi Vineeth Reddy @ 2025-04-09  6:01 UTC (permalink / raw)
  To: hupu
  Cc: jstultz, linux-kernel, juri.lelli, peterz, vschneid, mingo,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	hupu, Madadi Vineeth Reddy

On 07/04/25 19:16, hupu wrote:
> The (p->wake_cpu != cpu_of(rq)) check in proxy_needs_return() is unsafe
> during early wakeup phase. When called via ttwu_runnable() path:
> 
> |-- try_to_wake_up
>     |-- ttwu_runnable
>         |-- proxy_needs_return    //we are here
>     |-- select_task_rq
>     |-- set_task_cpu              //set p->wake_cpu here
>     |-- ttwu_queue
> 
> The p->wake_cpu at this point reflects the CPU where donor last ran before
> blocking, not the target migration CPU. During blocking period:
> 1. CPU affinity may have been changed by other threads
> 2. Proxy migrations might have altered the effective wake_cpu
> 3. set_task_cpu() hasn't updated wake_cpu yet in this code path
> 
> This makes the wake_cpu vs current CPU comparison meaningless and potentially
> dangerous. Rely on find_proxy_task()'s later migration logic to handle CPU
> placement based on up-to-date affinity and scheduler state.
> 
> Signed-off-by: hupu <hupu.gm@gmail.com>
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3c4ef4c71cfd..ca4ca739eb85 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4047,7 +4047,7 @@ static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
>  
>  	raw_spin_lock(&p->blocked_lock);
>  	if (__get_task_blocked_on(p) && p->blocked_on_state & BO_NEEDS_RETURN) {
> -		if (!task_current(rq, p) && (p->wake_cpu != cpu_of(rq))) {
> +		if (!task_current(rq, p)) {
>  			if (task_current_donor(rq, p)) {
>  				put_prev_task(rq, p);
>  				rq_set_donor(rq, rq->idle);

Which tree is this change based on? I don't see `proxy_needs_return` in tip/sched/core.

Thanks,
Madadi Vineeth Reddy


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] sched: Remove unreliable wake_cpu check in proxy_needs_return
  2025-04-09  6:01 ` Madadi Vineeth Reddy
@ 2025-04-09  6:56   ` hupu
  0 siblings, 0 replies; 8+ messages in thread
From: hupu @ 2025-04-09  6:56 UTC (permalink / raw)
  To: 20250407134654.92841-1-hupu.gm
  Cc: jstultz, linux-kernel, juri.lelli, peterz, vschneid, mingo,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	hupu, Madadi Vineeth Reddy

Hi Madadi Vineeth Reddy,
Thank you for your prompt review, and my sincere apologies for not
providing the repository and branch details clearly in my initial
submission. This was an oversight on my part.

The patch is based on the following repository:
https://github.com/johnstultz-work/linux-dev.git

Specifically, the changes are present in these Proxy Execution
development branches:
a) proxy-exec-WIP
b) proxy-exec-v15-WIP

The proxy_needs_return function resides in these branches as part of
the ongoing Proxy Execution feature work.
Please let me know if you need additional information or further
clarifications. Thank you again for your time and guidance.

Best regards,
hupu


Madadi Vineeth Reddy <vineethr@linux.ibm.com> 于2025年4月9日周三 14:01写道：
>
> On 07/04/25 19:16, hupu wrote:
> > The (p->wake_cpu != cpu_of(rq)) check in proxy_needs_return() is unsafe
> > during early wakeup phase. When called via ttwu_runnable() path:
> >
> > |-- try_to_wake_up
> >     |-- ttwu_runnable
> >         |-- proxy_needs_return    //we are here
> >     |-- select_task_rq
> >     |-- set_task_cpu              //set p->wake_cpu here
> >     |-- ttwu_queue
> >
> > The p->wake_cpu at this point reflects the CPU where donor last ran before
> > blocking, not the target migration CPU. During blocking period:
> > 1. CPU affinity may have been changed by other threads
> > 2. Proxy migrations might have altered the effective wake_cpu
> > 3. set_task_cpu() hasn't updated wake_cpu yet in this code path
> >
> > This makes the wake_cpu vs current CPU comparison meaningless and potentially
> > dangerous. Rely on find_proxy_task()'s later migration logic to handle CPU
> > placement based on up-to-date affinity and scheduler state.
> >
> > Signed-off-by: hupu <hupu.gm@gmail.com>
> > ---
> >  kernel/sched/core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 3c4ef4c71cfd..ca4ca739eb85 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4047,7 +4047,7 @@ static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
> >
> >       raw_spin_lock(&p->blocked_lock);
> >       if (__get_task_blocked_on(p) && p->blocked_on_state & BO_NEEDS_RETURN) {
> > -             if (!task_current(rq, p) && (p->wake_cpu != cpu_of(rq))) {
> > +             if (!task_current(rq, p)) {
> >                       if (task_current_donor(rq, p)) {
> >                               put_prev_task(rq, p);
> >                               rq_set_donor(rq, rq->idle);
>
> Which tree is this change based on? I don't see `proxy_needs_return` in tip/sched/core.
>
> Thanks,
> Madadi Vineeth Reddy
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] sched: Remove unreliable wake_cpu check in proxy_needs_return
  2025-04-07 13:46 [RFC 1/1] sched: Remove unreliable wake_cpu check in proxy_needs_return hupu
  2025-04-09  6:01 ` Madadi Vineeth Reddy
@ 2025-04-09 23:35 ` John Stultz
  2025-04-10  7:14   ` hupu
  1 sibling, 1 reply; 8+ messages in thread
From: John Stultz @ 2025-04-09 23:35 UTC (permalink / raw)
  To: hupu
  Cc: linux-kernel, juri.lelli, peterz, vschneid, mingo,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	hupu

Hey hupu!
  Thanks so much for taking the time to review the full proxy patch
series and to submit this change!

On Mon, Apr 7, 2025 at 6:47 AM hupu <hupu.gm@gmail.com> wrote:
>
> The (p->wake_cpu != cpu_of(rq)) check in proxy_needs_return() is unsafe
> during early wakeup phase. When called via ttwu_runnable() path:
>
> |-- try_to_wake_up
>     |-- ttwu_runnable
>         |-- proxy_needs_return    //we are here
>     |-- select_task_rq
>     |-- set_task_cpu              //set p->wake_cpu here
>     |-- ttwu_queue
>
> The p->wake_cpu at this point reflects the CPU where donor last ran before
> blocking, not the target migration CPU. During blocking period:
> 1. CPU affinity may have been changed by other threads
> 2. Proxy migrations might have altered the effective wake_cpu
> 3. set_task_cpu() hasn't updated wake_cpu yet in this code path

A few nits here:
1) We do reassign wake_cpu __set_cpus_allowed_ptr_locked() if necessary.
2) Proxy migrations use proxy_set_task_cpu() which preserve the wake_cpu

And I'm not sure I quite understand the concern of #3 above. Could you
clarify further?

> This makes the wake_cpu vs current CPU comparison meaningless and potentially
> dangerous. Rely on find_proxy_task()'s later migration logic to handle CPU
> placement based on up-to-date affinity and scheduler state.

The task_cpu serialization rules are a bit more subtle then I'd like,
so I'll grant there likely could be bugs lurking there, but I'm not
totally sure I understand the case you're thinking of here.

As you noted, in proxy_needs_return() we use the wake_cpu to tell us
if the task is still on the rq where it was blocked,  with the hope
that we can short-cut the deactiave&wakeup logic and instead just mark
it BO_RUNNABLE and return from ttwu.

I'm not sure how the "find_proxy_task()'s later migration logic" gets
involved here. Instead, it seems your change would just force it so we
always do the deactivate&wakeup (still using the task->wake_cpu as the
guide for select_task_rq()).

This would be a more conservative approach, since select_task_rq would
handle finding an appropriate rq if wake_cpu was not correct, so I
don't disagree with your suggestion. In fact, in getting the v16
series ready, I've reworked the find_proxy_tasks()'s "need_return:"
logic to do the deactivate/wakeup step because otherwise I was hitting
cases where we would try to do the manual return-migration off to a
offline cpu during cpu-hotplug testing.  So a similar approach might
be helpful here to avoid wakeups on offlined cpus (though I need to
look a bit more at the cpu hotplug code to understand how we'd hit the
task->on_rq case and have task_cpu() return an offlined rq).

But I'm a little hesitant to drop the optimization, so if you have a
more specific case that would be problematic, I'd appreciate you
sharing.

thanks
-john

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] sched: Remove unreliable wake_cpu check in proxy_needs_return
  2025-04-09 23:35 ` John Stultz
@ 2025-04-10  7:14   ` hupu
  2025-05-19  5:13     ` hupu
  0 siblings, 1 reply; 8+ messages in thread
From: hupu @ 2025-04-10  7:14 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, juri.lelli, peterz, vschneid, mingo,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	hupu

Hi John,
Thank you for your feedback, I appreciate you taking the time to
review my proposal.

On Thu, Apr 10, 2025 at 7:35 AM John Stultz <jstultz@google.com> wrote:
>
> Hey hupu!
>   Thanks so much for taking the time to review the full proxy patch
> series and to submit this change!
>
> On Mon, Apr 7, 2025 at 6:47 AM hupu <hupu.gm@gmail.com> wrote:
> >
> > The (p->wake_cpu != cpu_of(rq)) check in proxy_needs_return() is unsafe
> > during early wakeup phase. When called via ttwu_runnable() path:
> >
> > |-- try_to_wake_up
> >     |-- ttwu_runnable
> >         |-- proxy_needs_return    //we are here
> >     |-- select_task_rq
> >     |-- set_task_cpu              //set p->wake_cpu here
> >     |-- ttwu_queue
> >
> > The p->wake_cpu at this point reflects the CPU where donor last ran before
> > blocking, not the target migration CPU. During blocking period:
> > 1. CPU affinity may have been changed by other threads
> > 2. Proxy migrations might have altered the effective wake_cpu
> > 3. set_task_cpu() hasn't updated wake_cpu yet in this code path
>
> A few nits here:
> 1) We do reassign wake_cpu __set_cpus_allowed_ptr_locked() if necessary.
> 2) Proxy migrations use proxy_set_task_cpu() which preserve the wake_cpu
>
> And I'm not sure I quite understand the concern of #3 above. Could you
> clarify further?

Thank you for pointing this out! You are correct that
`__set_cpus_allowed_ptr_locked()` has updated `wake_cpu`, which makes
the `p->wake_cpu` check in `proxy_needs_return()` valid. This avoids
placing the donor task on an incorrect CPU, and I sincerely regret my
oversight in failing to account for this behavior in my initial
submission.

However, I still believe this logic could lead to potential issues,
which I will explain below.

In `__set_cpus_allowed_ptr_locked()`, the `wake_cpu` is updated by
using `cpumask_any_and_distribute()` to select the first online CPU
from `ctx->new_mask`. I believe this selection criterion is overly
simplistic, as CPU selection for tasks is inherently complex and
varies depending on the scheduling class of the task. For instance,
CFS tasks require considerations for load balancing across
sched_domains, while RT tasks aim to avoid placing multiple RT threads
on the same CPU to minimize contention. Therefore, I believe it is
insufficient to simply select a CPU from `ctx->new_mask` and assign it
to `p->wake_cpu`. Instead, this task is better handled by delegating
it to `select_task_rq()`.

Additionally, on derivative platforms like Android, performance issues
may arise. Vendors often hook into `select_task_rq()` to customize CPU
selection for critical tasks, such as foreground or high-priority
threads. For instance, on ARM big.LITTLE architectures, vendors
prioritize placing high-priority tasks on powerful CPUs to reduce
latency. If we assign `wake_cpu` to the first online CPU in `new_mask`
(often a less powerful CPU due to smaller ID values), donor tasks may
wake up on CPUs that cannot meet performance or load-balancing needs.
Worse, vendors cannot intervene in this logic via their hooks in
`select_task_rq()`.

>
> As you noted, in proxy_needs_return() we use the wake_cpu to tell us
> if the task is still on the rq where it was blocked,  with the hope
> that we can short-cut the deactiave&wakeup logic and instead just mark
> it BO_RUNNABLE and return from ttwu.

I understand and agree that using `wake_cpu` can effectively shortcut
the deactivate and wakeup logic. However, my concern is that it may
not always meet the donor task's performance requirements and may
disrupt load balancing across CPUs.

To address these concerns, I suggest that `proxy_needs_return()`
should always return false for donor tasks unless the task is already
running on a CPU, as this would ensure the `set_task_cpu()` logic in
`try_to_wake_up()` is triggered, allowing `select_task_rq()` to make a
more informed CPU selection and enabling vendors on derivative
platforms to retain the ability to customize CPU selection for
critical tasks via hooks in `select_task_rq()`.

These are my considerations for the current patch proposal. I am
uncertain if this design is reasonable and would appreciate the
opportunity to discuss it further with you.

Best regards,
hupu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] sched: Remove unreliable wake_cpu check in proxy_needs_return
  2025-04-10  7:14   ` hupu
@ 2025-05-19  5:13     ` hupu
  2025-05-25  7:47       ` [RESEND RFC " hupu
  2025-06-05 14:54       ` [RFC " hupu
  0 siblings, 2 replies; 8+ messages in thread
From: hupu @ 2025-05-19  5:13 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, juri.lelli, peterz, vschneid, mingo,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	hupu

Hi John,
I’d like to revisit the discussion about this patch.

As mentioned in the previous email, while using wake_cpu directly can
effectively shortcut the deactivate & wakeup logic, it may introduce
performance issues, especially on big.LITTLE architectures. Let me
illustrate this with a common scenario on Android: foreground critical
threads often need to wait for background threads to release a mutex
lock. Under the "Proxy Execution" mechanism, the foreground thread is
migrated to the CPU where the background thread is running. However,
background threads are typically bound to weaker CPUs due to being
part of the background group. If proxy_needs_return allows the
foreground thread to be placed on wake_cpu, it may result in the
foreground thread being migrated to a little core, causing performance
bottlenecks.

Therefore, I suggest that proxy_needs_return() should always return
false for donor tasks unless the task is already running on a CPU.
This ensures that donor tasks trigger a CPU re-selection process,
which is consistent with the behavior prior to the introduction of
"Proxy Execution" and should not introduce additional overhead.
Furthermore, on Android platforms, this behavior allows vendors to
leverage the hook in select_task_rq() to fine-tune CPU selection logic
for critical threads, enabling better optimization for specific
scenarios.

Additionally, this patch has been successfully validated on an ARM64
platform emulated via QEMU. It has been running for several days
without issues, demonstrating its stability and reliability.

Looking forward to your response and further discussion!
hupu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RESEND RFC 1/1] sched: Remove unreliable wake_cpu check in proxy_needs_return
  2025-05-19  5:13     ` hupu
@ 2025-05-25  7:47       ` hupu
  2025-06-05 14:54       ` [RFC " hupu
  1 sibling, 0 replies; 8+ messages in thread
From: hupu @ 2025-05-25  7:47 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, Juri Lelli, Peter Zijlstra, Valentin Schneider,
	Ingo Molnar, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, hupu

Hi John,
It seems this email might have been overlooked. I am resending it to
bring it back to your attention and continue the discussion.

Looking forward to your response!
hupu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] sched: Remove unreliable wake_cpu check in proxy_needs_return
  2025-05-19  5:13     ` hupu
  2025-05-25  7:47       ` [RESEND RFC " hupu
@ 2025-06-05 14:54       ` hupu
  1 sibling, 0 replies; 8+ messages in thread
From: hupu @ 2025-06-05 14:54 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, juri.lelli, peterz, vschneid, mingo,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	hupu

Resending to bring it to your attention.

Thanks.
hupu

On Mon, May 19, 2025 at 1:13 PM hupu <hupu.gm@gmail.com> wrote:
>
> Hi John,
> I’d like to revisit the discussion about this patch.
>
> As mentioned in the previous email, while using wake_cpu directly can
> effectively shortcut the deactivate & wakeup logic, it may introduce
> performance issues, especially on big.LITTLE architectures. Let me
> illustrate this with a common scenario on Android: foreground critical
> threads often need to wait for background threads to release a mutex
> lock. Under the "Proxy Execution" mechanism, the foreground thread is
> migrated to the CPU where the background thread is running. However,
> background threads are typically bound to weaker CPUs due to being
> part of the background group. If proxy_needs_return allows the
> foreground thread to be placed on wake_cpu, it may result in the
> foreground thread being migrated to a little core, causing performance
> bottlenecks.
>
> Therefore, I suggest that proxy_needs_return() should always return
> false for donor tasks unless the task is already running on a CPU.
> This ensures that donor tasks trigger a CPU re-selection process,
> which is consistent with the behavior prior to the introduction of
> "Proxy Execution" and should not introduce additional overhead.
> Furthermore, on Android platforms, this behavior allows vendors to
> leverage the hook in select_task_rq() to fine-tune CPU selection logic
> for critical threads, enabling better optimization for specific
> scenarios.
>
> Additionally, this patch has been successfully validated on an ARM64
> platform emulated via QEMU. It has been running for several days
> without issues, demonstrating its stability and reliability.
>
> Looking forward to your response and further discussion!
> hupu

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-06-05 14:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-07 13:46 [RFC 1/1] sched: Remove unreliable wake_cpu check in proxy_needs_return hupu
2025-04-09  6:01 ` Madadi Vineeth Reddy
2025-04-09  6:56   ` hupu
2025-04-09 23:35 ` John Stultz
2025-04-10  7:14   ` hupu
2025-05-19  5:13     ` hupu
2025-05-25  7:47       ` [RESEND RFC " hupu
2025-06-05 14:54       ` [RFC " hupu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).