Very high scheduling delay with plenty of idle CPUs

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Very high scheduling delay with plenty of idle CPUs
@ 2024-11-08  7:28 Saravana Kannan
  2024-11-08  8:31 ` Peter Zijlstra
  2024-11-08  9:02 ` Vincent Guittot
  0 siblings, 2 replies; 22+ messages in thread
From: Saravana Kannan @ 2024-11-08  7:28 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, K Prateek Nayak, John Stultz,
	Vincent Palomares

Hi scheduler folks,

I'm running into some weird scheduling issues when testing non-sched
changes on a Pixel 6 that's running close to 6.12-rc5. I'm not sure if
this is an issue in earlier kernel versions or not.

The async suspend/resume code calls async_schedule_dev_nocall() to
queue up a bunch of work. These queued up work seem to be running in
kworker threads.

However, there have been many times where I see scheduling latency
(runnable, but not running) of 4.5 ms or higher for a kworker thread
when there are plenty of idle CPUs.

Does async_schedule_dev_nocall() have some weird limitations on where
they can be run? I know it has some NUMA related stuff, but the Pixel
6 doesn't have NUMA. This oddity unnecessarily increases
suspend/resume latency as it adds up across kworker threads. So, I'd
appreciate any insights on what might be happening?

If you know how to use perfetto (it's really pretty simple, all you
need to know is WASD and clicking), here's an example:
https://ui.perfetto.dev/#!/?s=e20045736e7dfa1e897db6489710061d2495be92

Thanks,
Saravana

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-08  7:28 Very high scheduling delay with plenty of idle CPUs Saravana Kannan
@ 2024-11-08  8:31 ` Peter Zijlstra
  2024-11-10  5:49   ` Saravana Kannan
  2024-11-08  9:02 ` Vincent Guittot
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2024-11-08  8:31 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Benjamin Segall, Mel Gorman, Valentin Schneider,
	LKML, wuyun.abel, youssefesmat, Thomas Gleixner, efault,
	K Prateek Nayak, John Stultz, Vincent Palomares

On Thu, Nov 07, 2024 at 11:28:07PM -0800, Saravana Kannan wrote:
> Hi scheduler folks,
> 
> I'm running into some weird scheduling issues when testing non-sched
> changes on a Pixel 6 that's running close to 6.12-rc5. I'm not sure if
> this is an issue in earlier kernel versions or not.
> 

It's a bit unfortunate you don't have a known good kernel there. Anyway,
one thing that recently came up is that DELAY_DEQUEUE can cause some
delays, specifically it can inhibit wakeup migration.

You can either test with that feature turned off, or apply something
like the following patch:

  https://lkml.kernel.org/r/20241106135346.GL24862@noisy.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-08  7:28 Very high scheduling delay with plenty of idle CPUs Saravana Kannan
  2024-11-08  8:31 ` Peter Zijlstra
@ 2024-11-08  9:02 ` Vincent Guittot
  2024-11-14  6:36   ` Saravana Kannan
  1 sibling, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2024-11-08  9:02 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Ingo Molnar, Peter Zijlstra (Intel), Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Benjamin Segall, Mel Gorman, Valentin Schneider,
	LKML, wuyun.abel, youssefesmat, Thomas Gleixner, efault,
	K Prateek Nayak, John Stultz, Vincent Palomares

On Fri, 8 Nov 2024 at 08:28, Saravana Kannan <saravanak@google.com> wrote:
>
> Hi scheduler folks,
>
> I'm running into some weird scheduling issues when testing non-sched
> changes on a Pixel 6 that's running close to 6.12-rc5. I'm not sure if
> this is an issue in earlier kernel versions or not.
>
> The async suspend/resume code calls async_schedule_dev_nocall() to
> queue up a bunch of work. These queued up work seem to be running in
> kworker threads.
>
> However, there have been many times where I see scheduling latency
> (runnable, but not running) of 4.5 ms or higher for a kworker thread
> when there are plenty of idle CPUs.

You are using EAS, aren't you ?
so the energy impact drive the cpu selection not cpu idleness

There is a proposal to change feec to also take into account such case
in addition to the energy impact
https://lore.kernel.org/lkml/64ed0fb8-12ea-4452-9ec2-7ad127b65529@arm.com/T/

I still have to finalize v2

>
> Does async_schedule_dev_nocall() have some weird limitations on where
> they can be run? I know it has some NUMA related stuff, but the Pixel
> 6 doesn't have NUMA. This oddity unnecessarily increases
> suspend/resume latency as it adds up across kworker threads. So, I'd
> appreciate any insights on what might be happening?
>
> If you know how to use perfetto (it's really pretty simple, all you
> need to know is WASD and clicking), here's an example:
> https://ui.perfetto.dev/#!/?s=e20045736e7dfa1e897db6489710061d2495be92
>
> Thanks,
> Saravana

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-08  8:31 ` Peter Zijlstra
@ 2024-11-10  5:49   ` Saravana Kannan
  2024-11-11  5:17     ` K Prateek Nayak
  0 siblings, 1 reply; 22+ messages in thread
From: Saravana Kannan @ 2024-11-10  5:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Benjamin Segall, Mel Gorman, Valentin Schneider,
	LKML, wuyun.abel, youssefesmat, Thomas Gleixner, efault,
	K Prateek Nayak, John Stultz, Vincent Palomares

On Fri, Nov 8, 2024 at 12:31 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 07, 2024 at 11:28:07PM -0800, Saravana Kannan wrote:
> > Hi scheduler folks,
> >
> > I'm running into some weird scheduling issues when testing non-sched
> > changes on a Pixel 6 that's running close to 6.12-rc5. I'm not sure if
> > this is an issue in earlier kernel versions or not.
> >
>
> It's a bit unfortunate you don't have a known good kernel there. Anyway,
> one thing that recently came up is that DELAY_DEQUEUE can cause some
> delays, specifically it can inhibit wakeup migration.

I disabled DELAY_DEQUEUE and I'm still seeing preemptions or
scheduling latency (after wakeup) when there are plenty of CPUs even
within the same cluster/frequency domain.

Can we tell the scheduler to just spread out all the tasks during
suspend/resume? Doesn't make a lot of sense to try and save power
during a suspend/resume. It's almost always cheaper/better to do those
quickly.

-Saravana


-Saravana

>
> You can either test with that feature turned off, or apply something
> like the following patch:
>
>   https://lkml.kernel.org/r/20241106135346.GL24862@noisy.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-10  5:49   ` Saravana Kannan
@ 2024-11-11  5:17     ` K Prateek Nayak
  2024-11-11  6:15       ` Saravana Kannan
  0 siblings, 1 reply; 22+ messages in thread
From: K Prateek Nayak @ 2024-11-11  5:17 UTC (permalink / raw)
  To: Saravana Kannan, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Benjamin Segall, Mel Gorman, Valentin Schneider,
	LKML, wuyun.abel, youssefesmat, Thomas Gleixner, efault,
	John Stultz, Vincent Palomares, Tobias Huschle

(+ Tobias)

Hello Saravana,

On 11/10/2024 11:19 AM, Saravana Kannan wrote:
> On Fri, Nov 8, 2024 at 12:31 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Thu, Nov 07, 2024 at 11:28:07PM -0800, Saravana Kannan wrote:
>>> Hi scheduler folks,
>>>
>>> I'm running into some weird scheduling issues when testing non-sched
>>> changes on a Pixel 6 that's running close to 6.12-rc5. I'm not sure if
>>> this is an issue in earlier kernel versions or not.
>>>
>>
>> It's a bit unfortunate you don't have a known good kernel there. Anyway,
>> one thing that recently came up is that DELAY_DEQUEUE can cause some
>> delays, specifically it can inhibit wakeup migration.
> 
> I disabled DELAY_DEQUEUE and I'm still seeing preemptions or
> scheduling latency (after wakeup)

On the scheduling latency front, have you tried running with
RUN_TO_PARITY and/or PLACE_LAG disabled. If the tick granularity on your
system is less that the "base_slice_ns", disabling RUN_TO_PARITY can
help switch to a newly woken up task slightly faster. Disabling
PLACE_LAG makes sure the newly woken task is always eligible for
selection. However, both come with the added disadvantage of a sharp
increase in the number of involuntary context switches for some of the
scenarios we have tested. There is a separate thread from Cristian
making a case to toggle these features via sysfs and keep them disabled
by default [0]

[0] https://lore.kernel.org/lkml/20241017052000.99200-1-cpru@amazon.com/

> when there are plenty of CPUs even
> within the same cluster/frequency domain.

I'm not aware of any recent EAS specific changes that could have led to
larger scheduling latencies in the recent times but Tobias had reported
a similar increase in kworker scheduling latency when EEVDF was first
introduced in a different context [1]. I'm not sure if he is still
observing the same behavior on the current upstream but would it be
possible to check if you can see the large scheduling latency only
starting with v6.6 (when EEVDF was introduced) and not on v6.5
(ran with older CFS logic). I'm also assuming the system / benchmark
does change the default scheduler related debug tunables, some of which
went away in v6.6

[1] https://lore.kernel.org/lkml/c7b38bc27cc2c480f0c5383366416455@linux.ibm.com/

> 
> Can we tell the scheduler to just spread out all the tasks during
> suspend/resume? Doesn't make a lot of sense to try and save power
> during a suspend/resume. It's almost always cheaper/better to do those
> quickly.

That would increase the resume latency right since each runnable task
needs to go through a full idle CPU selection cycle? Isn't time a
consideration / concern in the resume path? Unless we go through the
slow path, it is very likely we'll end up making the same task
placement decisions again?

> 
> -Saravana
> 
> 
> -Saravana
> 
>>
>> You can either test with that feature turned off, or apply something
>> like the following patch:
>>
>>    https://lkml.kernel.org/r/20241106135346.GL24862@noisy.programming.kicks-ass.net

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-11  5:17     ` K Prateek Nayak
@ 2024-11-11  6:15       ` Saravana Kannan
  2024-11-11  8:25         ` Christian Loehle
  2024-11-11 10:40         ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Saravana Kannan @ 2024-11-11  6:15 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, John Stultz, Vincent Palomares,
	Tobias Huschle

On Sun, Nov 10, 2024 at 9:17 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> (+ Tobias)
>
> Hello Saravana,
>
> On 11/10/2024 11:19 AM, Saravana Kannan wrote:
> > On Fri, Nov 8, 2024 at 12:31 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >>
> >> On Thu, Nov 07, 2024 at 11:28:07PM -0800, Saravana Kannan wrote:
> >>> Hi scheduler folks,
> >>>
> >>> I'm running into some weird scheduling issues when testing non-sched
> >>> changes on a Pixel 6 that's running close to 6.12-rc5. I'm not sure if
> >>> this is an issue in earlier kernel versions or not.
> >>>
> >>
> >> It's a bit unfortunate you don't have a known good kernel there. Anyway,
> >> one thing that recently came up is that DELAY_DEQUEUE can cause some
> >> delays, specifically it can inhibit wakeup migration.
> >
> > I disabled DELAY_DEQUEUE and I'm still seeing preemptions or
> > scheduling latency (after wakeup)
>
> On the scheduling latency front, have you tried running with
> RUN_TO_PARITY and/or PLACE_LAG disabled. If the tick granularity on your
> system is less that the "base_slice_ns", disabling RUN_TO_PARITY can
> help switch to a newly woken up task slightly faster. Disabling
> PLACE_LAG makes sure the newly woken task is always eligible for
> selection. However, both come with the added disadvantage of a sharp
> increase in the number of involuntary context switches for some of the
> scenarios we have tested.

Yeah, I don't think I can just change these because that'd have a much
wider impact on power and performance. I really need something
isolated to the suspend/resume scenario. Or just a generic bug fix
where the scheduler does better CPU selection for a thread. I'm saying
better because I'd think this would be better from a power perspective
too in the specific example I gave.

> There is a separate thread from Cristian
> making a case to toggle these features via sysfs and keep them disabled
> by default [0]
>
> [0] https://lore.kernel.org/lkml/20241017052000.99200-1-cpru@amazon.com/
>
> > when there are plenty of CPUs even
> > within the same cluster/frequency domain.
>
> I'm not aware of any recent EAS specific changes that could have led to
> larger scheduling latencies in the recent times but Tobias had reported
> a similar increase in kworker scheduling latency when EEVDF was first
> introduced in a different context [1]. I'm not sure if he is still
> observing the same behavior on the current upstream but would it be
> possible to check if you can see the large scheduling latency only
> starting with v6.6 (when EEVDF was introduced) and not on v6.5
> (ran with older CFS logic). I'm also assuming the system / benchmark
> does change the default scheduler related debug tunables, some of which
> went away in v6.6

Hmmm... I don't know if this is specific to EEVDF. But going back to
v6.5 has a lot of other hurdles that I don't want to get into.

>
> [1] https://lore.kernel.org/lkml/c7b38bc27cc2c480f0c5383366416455@linux.ibm.com/
>
> >
> > Can we tell the scheduler to just spread out all the tasks during
> > suspend/resume? Doesn't make a lot of sense to try and save power
> > during a suspend/resume. It's almost always cheaper/better to do those
> > quickly.
>
> That would increase the resume latency right since each runnable task
> needs to go through a full idle CPU selection cycle? Isn't time a
> consideration / concern in the resume path? Unless we go through the
> slow path, it is very likely we'll end up making the same task
> placement decisions again?

I actually quickly hacked up the cpu_overutilized() function to return
true during suspend/resume and the threads are nicely spread out and
running in parallel. That actually reduces the total of the
dpm_resume*() phases from 90ms to 75ms on my Pixel 6.

Also, this whole email thread started because I'm optimizing the
suspend/resume code to reduce a lot of sleeps/wakeups and the number
of kworker threads. And with that + over utilization hack, resume time
has dropped to 60ms.

Peter,

Would you be open to the scheduler being aware of
dpm_suspend*()/dpm_resume*() phases and triggering the CPU
overutilized behavior during these phases? I know it's a very use case
specific behavior but how often do we NOT want to speed up
suspend/resume? We can make this a CONFIG or a kernel command line
option -- say, fast_suspend or something like that.

Thanks,
Saravana

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-11  6:15       ` Saravana Kannan
@ 2024-11-11  8:25         ` Christian Loehle
  2024-11-11  9:02           ` Saravana Kannan
  2024-11-11 10:40         ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Christian Loehle @ 2024-11-11  8:25 UTC (permalink / raw)
  To: Saravana Kannan, K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, John Stultz, Vincent Palomares,
	Tobias Huschle

On 11/11/24 06:15, Saravana Kannan wrote:
[...]
>>> Can we tell the scheduler to just spread out all the tasks during
>>> suspend/resume? Doesn't make a lot of sense to try and save power
>>> during a suspend/resume. It's almost always cheaper/better to do those
>>> quickly.
>>
>> That would increase the resume latency right since each runnable task
>> needs to go through a full idle CPU selection cycle? Isn't time a
>> consideration / concern in the resume path? Unless we go through the
>> slow path, it is very likely we'll end up making the same task
>> placement decisions again?
> 
> I actually quickly hacked up the cpu_overutilized() function to return
> true during suspend/resume and the threads are nicely spread out and
> running in parallel. That actually reduces the total of the
> dpm_resume*() phases from 90ms to 75ms on my Pixel 6.
> 
> Also, this whole email thread started because I'm optimizing the
> suspend/resume code to reduce a lot of sleeps/wakeups and the number
> of kworker threads. And with that + over utilization hack, resume time
> has dropped to 60ms.
> 
> Peter,
> 
> Would you be open to the scheduler being aware of
> dpm_suspend*()/dpm_resume*() phases and triggering the CPU
> overutilized behavior during these phases? I know it's a very use case
> specific behavior but how often do we NOT want to speed up
> suspend/resume? We can make this a CONFIG or a kernel command line
> option -- say, fast_suspend or something like that.
>

Just to confirm, you essentially want to disable EAS during
suspend/resume, or does sis also not give you an acceptable
placement?

Regards,
Christian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-11  8:25         ` Christian Loehle
@ 2024-11-11  9:02           ` Saravana Kannan
  2024-11-11  9:08             ` Christian Loehle
  0 siblings, 1 reply; 22+ messages in thread
From: Saravana Kannan @ 2024-11-11  9:02 UTC (permalink / raw)
  To: Christian Loehle
  Cc: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Benjamin Segall, Mel Gorman, Valentin Schneider, LKML, wuyun.abel,
	youssefesmat, Thomas Gleixner, efault, John Stultz,
	Vincent Palomares, Tobias Huschle

On Mon, Nov 11, 2024 at 12:25 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 11/11/24 06:15, Saravana Kannan wrote:
> [...]
> >>> Can we tell the scheduler to just spread out all the tasks during
> >>> suspend/resume? Doesn't make a lot of sense to try and save power
> >>> during a suspend/resume. It's almost always cheaper/better to do those
> >>> quickly.
> >>
> >> That would increase the resume latency right since each runnable task
> >> needs to go through a full idle CPU selection cycle? Isn't time a
> >> consideration / concern in the resume path? Unless we go through the
> >> slow path, it is very likely we'll end up making the same task
> >> placement decisions again?
> >
> > I actually quickly hacked up the cpu_overutilized() function to return
> > true during suspend/resume and the threads are nicely spread out and
> > running in parallel. That actually reduces the total of the
> > dpm_resume*() phases from 90ms to 75ms on my Pixel 6.
> >
> > Also, this whole email thread started because I'm optimizing the
> > suspend/resume code to reduce a lot of sleeps/wakeups and the number
> > of kworker threads. And with that + over utilization hack, resume time
> > has dropped to 60ms.
> >
> > Peter,
> >
> > Would you be open to the scheduler being aware of
> > dpm_suspend*()/dpm_resume*() phases and triggering the CPU
> > overutilized behavior during these phases? I know it's a very use case
> > specific behavior but how often do we NOT want to speed up
> > suspend/resume? We can make this a CONFIG or a kernel command line
> > option -- say, fast_suspend or something like that.
> >
>
> Just to confirm, you essentially want to disable EAS during
> suspend/resume, or does sis also not give you an acceptable
> placement?

If I effectively disable EAS during the dpm_resume/no_irq/early()
phases (the past of the resume where devices are resumed and can run
in parallel), that gives the best results. It shaves 15ms off.

More important than disabling EAS, I think the main need is to not
preempt a runnable thread or delay scheduling a runnable thread. But
yes, effectively, all CPUs end up getting used because there's enough
work to keep all the CPUs busy for 5ms. With the current behavior (is
it solely because of EAS?), some of the 5ms runs get stacked in one
CPU and it ends up taking 5ms longer. And this happens in multiple
phases and bumps it up by 15ms today. And this is all data averaged
over 100+ samples. So it's very clear cut data and not just noise.

-Saravana

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-11  9:02           ` Saravana Kannan
@ 2024-11-11  9:08             ` Christian Loehle
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Loehle @ 2024-11-11  9:08 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Benjamin Segall, Mel Gorman, Valentin Schneider, LKML, wuyun.abel,
	youssefesmat, Thomas Gleixner, efault, John Stultz,
	Vincent Palomares, Tobias Huschle

On 11/11/24 09:02, Saravana Kannan wrote:
> On Mon, Nov 11, 2024 at 12:25 AM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 11/11/24 06:15, Saravana Kannan wrote:
>> [...]
>>>>> Can we tell the scheduler to just spread out all the tasks during
>>>>> suspend/resume? Doesn't make a lot of sense to try and save power
>>>>> during a suspend/resume. It's almost always cheaper/better to do those
>>>>> quickly.
>>>>
>>>> That would increase the resume latency right since each runnable task
>>>> needs to go through a full idle CPU selection cycle? Isn't time a
>>>> consideration / concern in the resume path? Unless we go through the
>>>> slow path, it is very likely we'll end up making the same task
>>>> placement decisions again?
>>>
>>> I actually quickly hacked up the cpu_overutilized() function to return
>>> true during suspend/resume and the threads are nicely spread out and
>>> running in parallel. That actually reduces the total of the
>>> dpm_resume*() phases from 90ms to 75ms on my Pixel 6.
>>>
>>> Also, this whole email thread started because I'm optimizing the
>>> suspend/resume code to reduce a lot of sleeps/wakeups and the number
>>> of kworker threads. And with that + over utilization hack, resume time
>>> has dropped to 60ms.
>>>
>>> Peter,
>>>
>>> Would you be open to the scheduler being aware of
>>> dpm_suspend*()/dpm_resume*() phases and triggering the CPU
>>> overutilized behavior during these phases? I know it's a very use case
>>> specific behavior but how often do we NOT want to speed up
>>> suspend/resume? We can make this a CONFIG or a kernel command line
>>> option -- say, fast_suspend or something like that.
>>>
>>
>> Just to confirm, you essentially want to disable EAS during
>> suspend/resume, or does sis also not give you an acceptable
>> placement?
> 
> If I effectively disable EAS during the dpm_resume/no_irq/early()
> phases (the past of the resume where devices are resumed and can run
> in parallel), that gives the best results. It shaves 15ms off.
> 
> More important than disabling EAS, I think the main need is to not
> preempt a runnable thread or delay scheduling a runnable thread. But
> yes, effectively, all CPUs end up getting used because there's enough
> work to keep all the CPUs busy for 5ms. With the current behavior (is
> it solely because of EAS?), some of the 5ms runs get stacked in one
> CPU and it ends up taking 5ms longer. And this happens in multiple
> phases and bumps it up by 15ms today. And this is all data averaged
> over 100+ samples. So it's very clear cut data and not just noise.

"Is it only EAS?"
I would hope so, EAS should be responsible for all placement in your
case.
Right, but potential latency costs are a side-effect of co-scheduling,
so I'm not sure I understand why you'd rather make EAS work for this
specific use-case instead of just disabling it for phases we know
it can't do the best job?
The entire post-EEVDF discussions are all about "Some workloads like
preemption, other's don't", but as long as we have plenty of idle
CPUs all that seems like unnecessary effort, am I missing something?

Regards,
Christian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-11  6:15       ` Saravana Kannan
  2024-11-11  8:25         ` Christian Loehle
@ 2024-11-11 10:40         ` Peter Zijlstra
  2024-11-11 11:15           ` Vincent Guittot
  2024-11-11 18:23           ` Saravana Kannan
  1 sibling, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2024-11-11 10:40 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: K Prateek Nayak, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, John Stultz, Vincent Palomares,
	Tobias Huschle

On Sun, Nov 10, 2024 at 10:15:07PM -0800, Saravana Kannan wrote:

> I actually quickly hacked up the cpu_overutilized() function to return
> true during suspend/resume and the threads are nicely spread out and
> running in parallel. That actually reduces the total of the
> dpm_resume*() phases from 90ms to 75ms on my Pixel 6.

Right, so that kills EAS and makes it fall through to the regular
select_idle_sibling() thing.

> Peter,
> 
> Would you be open to the scheduler being aware of
> dpm_suspend*()/dpm_resume*() phases and triggering the CPU
> overutilized behavior during these phases? I know it's a very use case
> specific behavior but how often do we NOT want to speed up
> suspend/resume? We can make this a CONFIG or a kernel command line
> option -- say, fast_suspend or something like that.

Well, I don't mind if Vincent doesn't. It seems like a very
specific/targeted thing and should not affect much else, so it is a
relatively safe thing to do.

Perhaps a more direct hack in is_rd_overutilized() would be even less
invasive, changing cpu_overutilized() relies on that getting propagated
to rd->overutilized, might as well skip that step, no?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-11 10:40         ` Peter Zijlstra
@ 2024-11-11 11:15           ` Vincent Guittot
  2024-11-11 18:17             ` Saravana Kannan
  2024-11-11 18:23           ` Saravana Kannan
  1 sibling, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2024-11-11 11:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Saravana Kannan, K Prateek Nayak, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, John Stultz, Vincent Palomares,
	Tobias Huschle

On Mon, 11 Nov 2024 at 11:41, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Nov 10, 2024 at 10:15:07PM -0800, Saravana Kannan wrote:
>
> > I actually quickly hacked up the cpu_overutilized() function to return
> > true during suspend/resume and the threads are nicely spread out and
> > running in parallel. That actually reduces the total of the
> > dpm_resume*() phases from 90ms to 75ms on my Pixel 6.
>
> Right, so that kills EAS and makes it fall through to the regular
> select_idle_sibling() thing.
>
> > Peter,
> >
> > Would you be open to the scheduler being aware of
> > dpm_suspend*()/dpm_resume*() phases and triggering the CPU
> > overutilized behavior during these phases? I know it's a very use case
> > specific behavior but how often do we NOT want to speed up
> > suspend/resume? We can make this a CONFIG or a kernel command line
> > option -- say, fast_suspend or something like that.
>
> Well, I don't mind if Vincent doesn't. It seems like a very
> specific/targeted thing and should not affect much else, so it is a
> relatively safe thing to do.

I would like to understand why all idle little cpus are not used in
saravana's example and tasks are packed on the same cpu instead.

>
> Perhaps a more direct hack in is_rd_overutilized() would be even less
> invasive, changing cpu_overutilized() relies on that getting propagated
> to rd->overutilized, might as well skip that step, no?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-11 11:15           ` Vincent Guittot
@ 2024-11-11 18:17             ` Saravana Kannan
  2024-11-11 19:00               ` Vincent Guittot
  0 siblings, 1 reply; 22+ messages in thread
From: Saravana Kannan @ 2024-11-11 18:17 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, K Prateek Nayak, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, John Stultz, Vincent Palomares,
	Tobias Huschle

On Mon, Nov 11, 2024 at 3:15 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Mon, 11 Nov 2024 at 11:41, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Sun, Nov 10, 2024 at 10:15:07PM -0800, Saravana Kannan wrote:
> >
> > > I actually quickly hacked up the cpu_overutilized() function to return
> > > true during suspend/resume and the threads are nicely spread out and
> > > running in parallel. That actually reduces the total of the
> > > dpm_resume*() phases from 90ms to 75ms on my Pixel 6.
> >
> > Right, so that kills EAS and makes it fall through to the regular
> > select_idle_sibling() thing.
> >
> > > Peter,
> > >
> > > Would you be open to the scheduler being aware of
> > > dpm_suspend*()/dpm_resume*() phases and triggering the CPU
> > > overutilized behavior during these phases? I know it's a very use case
> > > specific behavior but how often do we NOT want to speed up
> > > suspend/resume? We can make this a CONFIG or a kernel command line
> > > option -- say, fast_suspend or something like that.
> >
> > Well, I don't mind if Vincent doesn't. It seems like a very
> > specific/targeted thing and should not affect much else, so it is a
> > relatively safe thing to do.
>
> I would like to understand why all idle little cpus are not used in
> saravana's example and tasks are packed on the same cpu instead.

If you want to try this on your end and debug it further, it should be
pretty easy to reproduce on a Pixel 6 even without my suspend/resume
changes.

Just run this on the device to mark all devices as async
suspend/resume. This assumes you have CONFIG_PM_DEBUG enabled.

find /sys/devices/ -name async | while read -r filename; do echo
enabled > "$filename"; done

And look at the dpm_resume_noirq() phase. You should see some kworkers
that are runnable but not running for a while while a little CPU is
idle. It should happen within a few tries. You need to unplug the USB
cable to let the device suspend and wait at least 10 seconds after the
screen goes off.

But even if you fix EAS to pick little CPUs, I think we also want to
use the mid and big CPUs. That's not going to happen right?

-Saravana

> >
> > Perhaps a more direct hack in is_rd_overutilized() would be even less
> > invasive, changing cpu_overutilized() relies on that getting propagated
> > to rd->overutilized, might as well skip that step, no?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-11 10:40         ` Peter Zijlstra
  2024-11-11 11:15           ` Vincent Guittot
@ 2024-11-11 18:23           ` Saravana Kannan
  2024-11-11 19:01             ` Vincent Guittot
  1 sibling, 1 reply; 22+ messages in thread
From: Saravana Kannan @ 2024-11-11 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, John Stultz, Vincent Palomares,
	Tobias Huschle

On Mon, Nov 11, 2024 at 2:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Nov 10, 2024 at 10:15:07PM -0800, Saravana Kannan wrote:
>
> > I actually quickly hacked up the cpu_overutilized() function to return
> > true during suspend/resume and the threads are nicely spread out and
> > running in parallel. That actually reduces the total of the
> > dpm_resume*() phases from 90ms to 75ms on my Pixel 6.
>
> Right, so that kills EAS and makes it fall through to the regular
> select_idle_sibling() thing.
>
> > Peter,
> >
> > Would you be open to the scheduler being aware of
> > dpm_suspend*()/dpm_resume*() phases and triggering the CPU
> > overutilized behavior during these phases? I know it's a very use case
> > specific behavior but how often do we NOT want to speed up
> > suspend/resume? We can make this a CONFIG or a kernel command line
> > option -- say, fast_suspend or something like that.
>
> Well, I don't mind if Vincent doesn't. It seems like a very
> specific/targeted thing and should not affect much else, so it is a
> relatively safe thing to do.
>
> Perhaps a more direct hack in is_rd_overutilized() would be even less
> invasive, changing cpu_overutilized() relies on that getting propagated
> to rd->overutilized, might as well skip that step, no?

is_rd_overutilized() sounds good to me. Outside of setting a flag in
sched.c that the suspend/resume code sets/clears, I can't think of an
interface that's better at avoiding abuse. Let me know if you have
any. Otherwise, I'll just go with the flag option. If Vincent gets the
scheduler to do the right thing without this, I'll happily drop this
targeted hack.

-Saravana

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-11 18:17             ` Saravana Kannan
@ 2024-11-11 19:00               ` Vincent Guittot
  0 siblings, 0 replies; 22+ messages in thread
From: Vincent Guittot @ 2024-11-11 19:00 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Peter Zijlstra, K Prateek Nayak, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, John Stultz, Vincent Palomares,
	Tobias Huschle

On Mon, 11 Nov 2024 at 19:17, Saravana Kannan <saravanak@google.com> wrote:
>
> On Mon, Nov 11, 2024 at 3:15 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Mon, 11 Nov 2024 at 11:41, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Sun, Nov 10, 2024 at 10:15:07PM -0800, Saravana Kannan wrote:
> > >
> > > > I actually quickly hacked up the cpu_overutilized() function to return
> > > > true during suspend/resume and the threads are nicely spread out and
> > > > running in parallel. That actually reduces the total of the
> > > > dpm_resume*() phases from 90ms to 75ms on my Pixel 6.
> > >
> > > Right, so that kills EAS and makes it fall through to the regular
> > > select_idle_sibling() thing.
> > >
> > > > Peter,
> > > >
> > > > Would you be open to the scheduler being aware of
> > > > dpm_suspend*()/dpm_resume*() phases and triggering the CPU
> > > > overutilized behavior during these phases? I know it's a very use case
> > > > specific behavior but how often do we NOT want to speed up
> > > > suspend/resume? We can make this a CONFIG or a kernel command line
> > > > option -- say, fast_suspend or something like that.
> > >
> > > Well, I don't mind if Vincent doesn't. It seems like a very
> > > specific/targeted thing and should not affect much else, so it is a
> > > relatively safe thing to do.
> >
> > I would like to understand why all idle little cpus are not used in
> > saravana's example and tasks are packed on the same cpu instead.
>
> If you want to try this on your end and debug it further, it should be
> pretty easy to reproduce on a Pixel 6 even without my suspend/resume
> changes.

You are using the v6.12-rc5 on Pixel6 ?

>
> Just run this on the device to mark all devices as async
> suspend/resume. This assumes you have CONFIG_PM_DEBUG enabled.
>
> find /sys/devices/ -name async | while read -r filename; do echo
> enabled > "$filename"; done
>
> And look at the dpm_resume_noirq() phase. You should see some kworkers
> that are runnable but not running for a while while a little CPU is
> idle. It should happen within a few tries. You need to unplug the USB
> cable to let the device suspend and wait at least 10 seconds after the
> screen goes off.
>
> But even if you fix EAS to pick little CPUs, I think we also want to
> use the mid and big CPUs. That's not going to happen right?

Who knows ?
Right now the trace that you shared clearly show a wrong behavior

>
> -Saravana
>
> > >
> > > Perhaps a more direct hack in is_rd_overutilized() would be even less
> > > invasive, changing cpu_overutilized() relies on that getting propagated
> > > to rd->overutilized, might as well skip that step, no?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-11 18:23           ` Saravana Kannan
@ 2024-11-11 19:01             ` Vincent Guittot
  2024-11-11 19:12               ` Vincent Guittot
  0 siblings, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2024-11-11 19:01 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Peter Zijlstra, K Prateek Nayak, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, John Stultz, Vincent Palomares,
	Tobias Huschle

On Mon, 11 Nov 2024 at 19:24, Saravana Kannan <saravanak@google.com> wrote:
>
> On Mon, Nov 11, 2024 at 2:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Sun, Nov 10, 2024 at 10:15:07PM -0800, Saravana Kannan wrote:
> >
> > > I actually quickly hacked up the cpu_overutilized() function to return
> > > true during suspend/resume and the threads are nicely spread out and
> > > running in parallel. That actually reduces the total of the
> > > dpm_resume*() phases from 90ms to 75ms on my Pixel 6.
> >
> > Right, so that kills EAS and makes it fall through to the regular
> > select_idle_sibling() thing.
> >
> > > Peter,
> > >
> > > Would you be open to the scheduler being aware of
> > > dpm_suspend*()/dpm_resume*() phases and triggering the CPU
> > > overutilized behavior during these phases? I know it's a very use case
> > > specific behavior but how often do we NOT want to speed up
> > > suspend/resume? We can make this a CONFIG or a kernel command line
> > > option -- say, fast_suspend or something like that.
> >
> > Well, I don't mind if Vincent doesn't. It seems like a very
> > specific/targeted thing and should not affect much else, so it is a
> > relatively safe thing to do.
> >
> > Perhaps a more direct hack in is_rd_overutilized() would be even less
> > invasive, changing cpu_overutilized() relies on that getting propagated
> > to rd->overutilized, might as well skip that step, no?
>
> is_rd_overutilized() sounds good to me. Outside of setting a flag in

At know I'm not convinced that this is a solution but just a quick
hack for your problem. We must understand 1st what is wrong

> sched.c that the suspend/resume code sets/clears, I can't think of an
> interface that's better at avoiding abuse. Let me know if you have
> any. Otherwise, I'll just go with the flag option. If Vincent gets the
> scheduler to do the right thing without this, I'll happily drop this
> targeted hack.
>
> -Saravana

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-11 19:01             ` Vincent Guittot
@ 2024-11-11 19:12               ` Vincent Guittot
  2024-11-12  7:23                 ` Saravana Kannan
  0 siblings, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2024-11-11 19:12 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Peter Zijlstra, K Prateek Nayak, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, John Stultz, Vincent Palomares,
	Tobias Huschle

On Mon, 11 Nov 2024 at 20:01, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Mon, 11 Nov 2024 at 19:24, Saravana Kannan <saravanak@google.com> wrote:
> >
> > On Mon, Nov 11, 2024 at 2:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Sun, Nov 10, 2024 at 10:15:07PM -0800, Saravana Kannan wrote:
> > >
> > > > I actually quickly hacked up the cpu_overutilized() function to return
> > > > true during suspend/resume and the threads are nicely spread out and
> > > > running in parallel. That actually reduces the total of the
> > > > dpm_resume*() phases from 90ms to 75ms on my Pixel 6.
> > >
> > > Right, so that kills EAS and makes it fall through to the regular
> > > select_idle_sibling() thing.
> > >
> > > > Peter,
> > > >
> > > > Would you be open to the scheduler being aware of
> > > > dpm_suspend*()/dpm_resume*() phases and triggering the CPU
> > > > overutilized behavior during these phases? I know it's a very use case
> > > > specific behavior but how often do we NOT want to speed up
> > > > suspend/resume? We can make this a CONFIG or a kernel command line
> > > > option -- say, fast_suspend or something like that.
> > >
> > > Well, I don't mind if Vincent doesn't. It seems like a very
> > > specific/targeted thing and should not affect much else, so it is a
> > > relatively safe thing to do.
> > >
> > > Perhaps a more direct hack in is_rd_overutilized() would be even less
> > > invasive, changing cpu_overutilized() relies on that getting propagated
> > > to rd->overutilized, might as well skip that step, no?
> >
> > is_rd_overutilized() sounds good to me. Outside of setting a flag in
>
> At know I'm not convinced that this is a solution but just a quick
> hack for your problem. We must understand 1st what is wrong

And you should better switch to performance cpufreq governor to
disable eas and run at max freq if your further wants to decrease
latency

>
> > sched.c that the suspend/resume code sets/clears, I can't think of an
> > interface that's better at avoiding abuse. Let me know if you have
> > any. Otherwise, I'll just go with the flag option. If Vincent gets the
> > scheduler to do the right thing without this, I'll happily drop this
> > targeted hack.
> >
> > -Saravana

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-11 19:12               ` Vincent Guittot
@ 2024-11-12  7:23                 ` Saravana Kannan
  2024-11-12  9:03                   ` Vincent Guittot
  0 siblings, 1 reply; 22+ messages in thread
From: Saravana Kannan @ 2024-11-12  7:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, K Prateek Nayak, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, John Stultz, Vincent Palomares,
	Tobias Huschle

On Mon, Nov 11, 2024 at 11:12 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Mon, 11 Nov 2024 at 20:01, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Mon, 11 Nov 2024 at 19:24, Saravana Kannan <saravanak@google.com> wrote:
> > >
> > > On Mon, Nov 11, 2024 at 2:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Sun, Nov 10, 2024 at 10:15:07PM -0800, Saravana Kannan wrote:
> > > >
> > > > > I actually quickly hacked up the cpu_overutilized() function to return
> > > > > true during suspend/resume and the threads are nicely spread out and
> > > > > running in parallel. That actually reduces the total of the
> > > > > dpm_resume*() phases from 90ms to 75ms on my Pixel 6.
> > > >
> > > > Right, so that kills EAS and makes it fall through to the regular
> > > > select_idle_sibling() thing.
> > > >
> > > > > Peter,
> > > > >
> > > > > Would you be open to the scheduler being aware of
> > > > > dpm_suspend*()/dpm_resume*() phases and triggering the CPU
> > > > > overutilized behavior during these phases? I know it's a very use case
> > > > > specific behavior but how often do we NOT want to speed up
> > > > > suspend/resume? We can make this a CONFIG or a kernel command line
> > > > > option -- say, fast_suspend or something like that.
> > > >
> > > > Well, I don't mind if Vincent doesn't. It seems like a very
> > > > specific/targeted thing and should not affect much else, so it is a
> > > > relatively safe thing to do.
> > > >
> > > > Perhaps a more direct hack in is_rd_overutilized() would be even less
> > > > invasive, changing cpu_overutilized() relies on that getting propagated
> > > > to rd->overutilized, might as well skip that step, no?
> > >
> > > is_rd_overutilized() sounds good to me. Outside of setting a flag in
> >
> > At know I'm not convinced that this is a solution but just a quick
> > hack for your problem. We must understand 1st what is wrong
>
> And you should better switch to performance cpufreq governor to
> disable eas and run at max freq if your further wants to decrease
> latency

Ohhh... now that you mention fixing CPU frequencies, a lot of systems
fix their CPU frequencies during suspend/resume. Pixel 6 is one of
them. In the case of Pixel 6, the driver sets the policy min/max to
these fixed frequencies to force the CPU to stay at one frequency.
Will EAS handle this correctly? I wonder if that'd affect the task
placement decision. Also, other systems might limit CPU frequencies in
ways that EAS can't tell. If the CPU frequencies are frozen, I'm not
sure EAS makes a lot of sense. Except maybe using CPU max capacity to
make sure little CPUs are busy first before using the big CPUs?

But even if EAS thinks the CPU freq could go up (when it can't), it
still doesn't make a lot of sense to not use those idle CPUs and
instead try to bump up the frequency (by putting more threads in a
CPU).

Anyway, with all this in mind, it makes more sense to me to just
trigger the "overutilized" mode during these specific parts of
suspend/resume.

-Saravana

>
> >
> > > sched.c that the suspend/resume code sets/clears, I can't think of an
> > > interface that's better at avoiding abuse. Let me know if you have
> > > any. Otherwise, I'll just go with the flag option. If Vincent gets the
> > > scheduler to do the right thing without this, I'll happily drop this
> > > targeted hack.
> > >
> > > -Saravana

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-12  7:23                 ` Saravana Kannan
@ 2024-11-12  9:03                   ` Vincent Guittot
  2024-11-12 16:25                     ` Saravana Kannan
  0 siblings, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2024-11-12  9:03 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Peter Zijlstra, K Prateek Nayak, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, John Stultz, Vincent Palomares,
	Tobias Huschle

On Tue, 12 Nov 2024 at 08:24, Saravana Kannan <saravanak@google.com> wrote:
>
> On Mon, Nov 11, 2024 at 11:12 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Mon, 11 Nov 2024 at 20:01, Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > On Mon, 11 Nov 2024 at 19:24, Saravana Kannan <saravanak@google.com> wrote:
> > > >
> > > > On Mon, Nov 11, 2024 at 2:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > >
> > > > > On Sun, Nov 10, 2024 at 10:15:07PM -0800, Saravana Kannan wrote:
> > > > >
> > > > > > I actually quickly hacked up the cpu_overutilized() function to return
> > > > > > true during suspend/resume and the threads are nicely spread out and
> > > > > > running in parallel. That actually reduces the total of the
> > > > > > dpm_resume*() phases from 90ms to 75ms on my Pixel 6.
> > > > >
> > > > > Right, so that kills EAS and makes it fall through to the regular
> > > > > select_idle_sibling() thing.
> > > > >
> > > > > > Peter,
> > > > > >
> > > > > > Would you be open to the scheduler being aware of
> > > > > > dpm_suspend*()/dpm_resume*() phases and triggering the CPU
> > > > > > overutilized behavior during these phases? I know it's a very use case
> > > > > > specific behavior but how often do we NOT want to speed up
> > > > > > suspend/resume? We can make this a CONFIG or a kernel command line
> > > > > > option -- say, fast_suspend or something like that.
> > > > >
> > > > > Well, I don't mind if Vincent doesn't. It seems like a very
> > > > > specific/targeted thing and should not affect much else, so it is a
> > > > > relatively safe thing to do.
> > > > >
> > > > > Perhaps a more direct hack in is_rd_overutilized() would be even less
> > > > > invasive, changing cpu_overutilized() relies on that getting propagated
> > > > > to rd->overutilized, might as well skip that step, no?
> > > >
> > > > is_rd_overutilized() sounds good to me. Outside of setting a flag in
> > >
> > > At know I'm not convinced that this is a solution but just a quick
> > > hack for your problem. We must understand 1st what is wrong
> >
> > And you should better switch to performance cpufreq governor to
> > disable eas and run at max freq if your further wants to decrease
> > latency
>
> Ohhh... now that you mention fixing CPU frequencies, a lot of systems
> fix their CPU frequencies during suspend/resume. Pixel 6 is one of
> them. In the case of Pixel 6, the driver sets the policy min/max to
> these fixed frequencies to force the CPU to stay at one frequency.
> Will EAS handle this correctly? I wonder if that'd affect the task

AFAICT, it should

> placement decision. Also, other systems might limit CPU frequencies in
> ways that EAS can't tell. If the CPU frequencies are frozen, I'm not
> sure EAS makes a lot of sense. Except maybe using CPU max capacity to
> make sure little CPUs are busy first before using the big CPUs?
>
> But even if EAS thinks the CPU freq could go up (when it can't), it
> still doesn't make a lot of sense to not use those idle CPUs and
> instead try to bump up the frequency (by putting more threads in a
> CPU).

In this case, you just need to call the below before entering suspend
and after resuming
  echo 1 > /proc/sys/kernel/sched_energy_aware
instead of hacking overutilized
This will disable EAS without rebuilding sched domain

>
> Anyway, with all this in mind, it makes more sense to me to just
> trigger the "overutilized" mode during these specific parts of
> suspend/resume.
>
> -Saravana
>
> >
> > >
> > > > sched.c that the suspend/resume code sets/clears, I can't think of an
> > > > interface that's better at avoiding abuse. Let me know if you have
> > > > any. Otherwise, I'll just go with the flag option. If Vincent gets the
> > > > scheduler to do the right thing without this, I'll happily drop this
> > > > targeted hack.
> > > >
> > > > -Saravana

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-12  9:03                   ` Vincent Guittot
@ 2024-11-12 16:25                     ` Saravana Kannan
  2024-11-12 17:00                       ` Vincent Guittot
  0 siblings, 1 reply; 22+ messages in thread
From: Saravana Kannan @ 2024-11-12 16:25 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, K Prateek Nayak, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, John Stultz, Vincent Palomares,
	Tobias Huschle

On Tue, Nov 12, 2024 at 1:03 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Tue, 12 Nov 2024 at 08:24, Saravana Kannan <saravanak@google.com> wrote:
> >
> > On Mon, Nov 11, 2024 at 11:12 AM Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > On Mon, 11 Nov 2024 at 20:01, Vincent Guittot
> > > <vincent.guittot@linaro.org> wrote:
> > > >
> > > > On Mon, 11 Nov 2024 at 19:24, Saravana Kannan <saravanak@google.com> wrote:
> > > > >
> > > > > On Mon, Nov 11, 2024 at 2:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > > >
> > > > > > On Sun, Nov 10, 2024 at 10:15:07PM -0800, Saravana Kannan wrote:
> > > > > >
> > > > > > > I actually quickly hacked up the cpu_overutilized() function to return
> > > > > > > true during suspend/resume and the threads are nicely spread out and
> > > > > > > running in parallel. That actually reduces the total of the
> > > > > > > dpm_resume*() phases from 90ms to 75ms on my Pixel 6.
> > > > > >
> > > > > > Right, so that kills EAS and makes it fall through to the regular
> > > > > > select_idle_sibling() thing.
> > > > > >
> > > > > > > Peter,
> > > > > > >
> > > > > > > Would you be open to the scheduler being aware of
> > > > > > > dpm_suspend*()/dpm_resume*() phases and triggering the CPU
> > > > > > > overutilized behavior during these phases? I know it's a very use case
> > > > > > > specific behavior but how often do we NOT want to speed up
> > > > > > > suspend/resume? We can make this a CONFIG or a kernel command line
> > > > > > > option -- say, fast_suspend or something like that.
> > > > > >
> > > > > > Well, I don't mind if Vincent doesn't. It seems like a very
> > > > > > specific/targeted thing and should not affect much else, so it is a
> > > > > > relatively safe thing to do.
> > > > > >
> > > > > > Perhaps a more direct hack in is_rd_overutilized() would be even less
> > > > > > invasive, changing cpu_overutilized() relies on that getting propagated
> > > > > > to rd->overutilized, might as well skip that step, no?
> > > > >
> > > > > is_rd_overutilized() sounds good to me. Outside of setting a flag in
> > > >
> > > > At know I'm not convinced that this is a solution but just a quick
> > > > hack for your problem. We must understand 1st what is wrong
> > >
> > > And you should better switch to performance cpufreq governor to
> > > disable eas and run at max freq if your further wants to decrease
> > > latency
> >
> > Ohhh... now that you mention fixing CPU frequencies, a lot of systems
> > fix their CPU frequencies during suspend/resume. Pixel 6 is one of
> > them. In the case of Pixel 6, the driver sets the policy min/max to
> > these fixed frequencies to force the CPU to stay at one frequency.
> > Will EAS handle this correctly? I wonder if that'd affect the task
>
> AFAICT, it should

To be clear, I'm not opposed to any sched fixes that will do the right
thing naturally.

> > placement decision. Also, other systems might limit CPU frequencies in
> > ways that EAS can't tell. If the CPU frequencies are frozen, I'm not
> > sure EAS makes a lot of sense. Except maybe using CPU max capacity to
> > make sure little CPUs are busy first before using the big CPUs?
> >
> > But even if EAS thinks the CPU freq could go up (when it can't), it
> > still doesn't make a lot of sense to not use those idle CPUs and
> > instead try to bump up the frequency (by putting more threads in a
> > CPU).
>
> In this case, you just need to call the below before entering suspend
> and after resuming
>   echo 1 > /proc/sys/kernel/sched_energy_aware
> instead of hacking overutilized
> This will disable EAS without rebuilding sched domain

That disables EAS for a huge portion of the suspend/resume where we do
want it to be enabled.

Also, as I said before, I want to do this only for the "devices
resume" part where there is a lot of parallelism. Not for the entire
system suspend/resume.

Is there an in-kernel version of this call? Do I just need to set and
clear sysctl_sched_energy_aware? Also, does setting/clearing
overutilized rebuild the sched domain?

Thanks,
Saravana

>
> >
> > Anyway, with all this in mind, it makes more sense to me to just
> > trigger the "overutilized" mode during these specific parts of
> > suspend/resume.
> >
> > -Saravana
> >
> > >
> > > >
> > > > > sched.c that the suspend/resume code sets/clears, I can't think of an
> > > > > interface that's better at avoiding abuse. Let me know if you have
> > > > > any. Otherwise, I'll just go with the flag option. If Vincent gets the
> > > > > scheduler to do the right thing without this, I'll happily drop this
> > > > > targeted hack.
> > > > >
> > > > > -Saravana

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-12 16:25                     ` Saravana Kannan
@ 2024-11-12 17:00                       ` Vincent Guittot
  0 siblings, 0 replies; 22+ messages in thread
From: Vincent Guittot @ 2024-11-12 17:00 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Peter Zijlstra, K Prateek Nayak, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, LKML, wuyun.abel, youssefesmat,
	Thomas Gleixner, efault, John Stultz, Vincent Palomares,
	Tobias Huschle

On Tue, 12 Nov 2024 at 17:26, Saravana Kannan <saravanak@google.com> wrote:
>
> On Tue, Nov 12, 2024 at 1:03 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Tue, 12 Nov 2024 at 08:24, Saravana Kannan <saravanak@google.com> wrote:
> > >
> > > On Mon, Nov 11, 2024 at 11:12 AM Vincent Guittot
> > > <vincent.guittot@linaro.org> wrote:
> > > >
> > > > On Mon, 11 Nov 2024 at 20:01, Vincent Guittot
> > > > <vincent.guittot@linaro.org> wrote:
> > > > >
> > > > > On Mon, 11 Nov 2024 at 19:24, Saravana Kannan <saravanak@google.com> wrote:
> > > > > >
> > > > > > On Mon, Nov 11, 2024 at 2:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > > > >
> > > > > > > On Sun, Nov 10, 2024 at 10:15:07PM -0800, Saravana Kannan wrote:
> > > > > > >
> > > > > > > > I actually quickly hacked up the cpu_overutilized() function to return
> > > > > > > > true during suspend/resume and the threads are nicely spread out and
> > > > > > > > running in parallel. That actually reduces the total of the
> > > > > > > > dpm_resume*() phases from 90ms to 75ms on my Pixel 6.
> > > > > > >
> > > > > > > Right, so that kills EAS and makes it fall through to the regular
> > > > > > > select_idle_sibling() thing.
> > > > > > >
> > > > > > > > Peter,
> > > > > > > >
> > > > > > > > Would you be open to the scheduler being aware of
> > > > > > > > dpm_suspend*()/dpm_resume*() phases and triggering the CPU
> > > > > > > > overutilized behavior during these phases? I know it's a very use case
> > > > > > > > specific behavior but how often do we NOT want to speed up
> > > > > > > > suspend/resume? We can make this a CONFIG or a kernel command line
> > > > > > > > option -- say, fast_suspend or something like that.
> > > > > > >
> > > > > > > Well, I don't mind if Vincent doesn't. It seems like a very
> > > > > > > specific/targeted thing and should not affect much else, so it is a
> > > > > > > relatively safe thing to do.
> > > > > > >
> > > > > > > Perhaps a more direct hack in is_rd_overutilized() would be even less
> > > > > > > invasive, changing cpu_overutilized() relies on that getting propagated
> > > > > > > to rd->overutilized, might as well skip that step, no?
> > > > > >
> > > > > > is_rd_overutilized() sounds good to me. Outside of setting a flag in
> > > > >
> > > > > At know I'm not convinced that this is a solution but just a quick
> > > > > hack for your problem. We must understand 1st what is wrong
> > > >
> > > > And you should better switch to performance cpufreq governor to
> > > > disable eas and run at max freq if your further wants to decrease
> > > > latency
> > >
> > > Ohhh... now that you mention fixing CPU frequencies, a lot of systems
> > > fix their CPU frequencies during suspend/resume. Pixel 6 is one of
> > > them. In the case of Pixel 6, the driver sets the policy min/max to
> > > these fixed frequencies to force the CPU to stay at one frequency.
> > > Will EAS handle this correctly? I wonder if that'd affect the task
> >
> > AFAICT, it should
>
> To be clear, I'm not opposed to any sched fixes that will do the right
> thing naturally.

a quick try on rb5 while continuing testing my rework of eas patch
doesn't show the problem and I still need to check with current eas
version

>
> > > placement decision. Also, other systems might limit CPU frequencies in
> > > ways that EAS can't tell. If the CPU frequencies are frozen, I'm not
> > > sure EAS makes a lot of sense. Except maybe using CPU max capacity to
> > > make sure little CPUs are busy first before using the big CPUs?
> > >
> > > But even if EAS thinks the CPU freq could go up (when it can't), it
> > > still doesn't make a lot of sense to not use those idle CPUs and
> > > instead try to bump up the frequency (by putting more threads in a
> > > CPU).
> >
> > In this case, you just need to call the below before entering suspend
> > and after resuming
> >   echo 1 > /proc/sys/kernel/sched_energy_aware
> > instead of hacking overutilized
> > This will disable EAS without rebuilding sched domain
>
> That disables EAS for a huge portion of the suspend/resume where we do
> want it to be enabled.
>
> Also, as I said before, I want to do this only for the "devices
> resume" part where there is a lot of parallelism. Not for the entire
> system suspend/resume.

Would this be really a problem ? You might not get the disable of eas
for your exact portion but on the other hand, you want to speedup
suspend resume.
I mean, if systems already fix frequency of cpus during suspend
resume, they can just disable eas as well. eas will be disable but
sched_asym_cpucapacity will remain enabled

>
> Is there an in-kernel version of this call? Do I just need to set and
> clear sysctl_sched_energy_aware? Also, does setting/clearing

no, it ends up updating a static key

> overutilized rebuild the sched domain?

no.

But system is not overutilized as you mentioned in your description,
you have some scheduling latency constraint on kworker threads


>
> Thanks,
> Saravana
>
> >
> > >
> > > Anyway, with all this in mind, it makes more sense to me to just
> > > trigger the "overutilized" mode during these specific parts of
> > > suspend/resume.
> > >
> > > -Saravana
> > >
> > > >
> > > > >
> > > > > > sched.c that the suspend/resume code sets/clears, I can't think of an
> > > > > > interface that's better at avoiding abuse. Let me know if you have
> > > > > > any. Otherwise, I'll just go with the flag option. If Vincent gets the
> > > > > > scheduler to do the right thing without this, I'll happily drop this
> > > > > > targeted hack.
> > > > > >
> > > > > > -Saravana

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-08  9:02 ` Vincent Guittot
@ 2024-11-14  6:36   ` Saravana Kannan
  2024-11-14 13:06     ` Vincent Guittot
  0 siblings, 1 reply; 22+ messages in thread
From: Saravana Kannan @ 2024-11-14  6:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra (Intel), Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Benjamin Segall, Mel Gorman, Valentin Schneider,
	LKML, wuyun.abel, youssefesmat, Thomas Gleixner, efault,
	K Prateek Nayak, John Stultz, Vincent Palomares

Ugh... just realized that for a few of the emails I've been replying
directly to one person instead of reply-all.

On Fri, Nov 8, 2024 at 1:02 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Fri, 8 Nov 2024 at 08:28, Saravana Kannan <saravanak@google.com> wrote:
> >
> > Hi scheduler folks,
> >
> > I'm running into some weird scheduling issues when testing non-sched
> > changes on a Pixel 6 that's running close to 6.12-rc5. I'm not sure if
> > this is an issue in earlier kernel versions or not.
> >
> > The async suspend/resume code calls async_schedule_dev_nocall() to
> > queue up a bunch of work. These queued up work seem to be running in
> > kworker threads.
> >
> > However, there have been many times where I see scheduling latency
> > (runnable, but not running) of 4.5 ms or higher for a kworker thread
> > when there are plenty of idle CPUs.
>
> You are using EAS, aren't you ?
> so the energy impact drive the cpu selection not cpu idleness
>
> There is a proposal to change feec to also take into account such case
> in addition to the energy impact
> https://lore.kernel.org/lkml/64ed0fb8-12ea-4452-9ec2-7ad127b65529@arm.com/T/
>
> I still have to finalize v2

Anyway, I tried this series (got it from
https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=sched/rework-eas)
and:
1. The timing hasn't improved at all compared to not having the series.
2. There's still a lot of preemption of runnable tasks with some empty CPUs.

For example:
https://ui.perfetto.dev/#!/?s=955ff7e73edf32dab27501025211fa2ce322f725

Thanks,
Saravana


>
> >
> > Does async_schedule_dev_nocall() have some weird limitations on where
> > they can be run? I know it has some NUMA related stuff, but the Pixel
> > 6 doesn't have NUMA. This oddity unnecessarily increases
> > suspend/resume latency as it adds up across kworker threads. So, I'd
> > appreciate any insights on what might be happening?
> >
> > If you know how to use perfetto (it's really pretty simple, all you
> > need to know is WASD and clicking), here's an example:
> > https://ui.perfetto.dev/#!/?s=e20045736e7dfa1e897db6489710061d2495be92
> >
> > Thanks,
> > Saravana

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Very high scheduling delay with plenty of idle CPUs
  2024-11-14  6:36   ` Saravana Kannan
@ 2024-11-14 13:06     ` Vincent Guittot
  0 siblings, 0 replies; 22+ messages in thread
From: Vincent Guittot @ 2024-11-14 13:06 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Ingo Molnar, Peter Zijlstra (Intel), Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Benjamin Segall, Mel Gorman, Valentin Schneider,
	LKML, wuyun.abel, youssefesmat, Thomas Gleixner, efault,
	K Prateek Nayak, John Stultz, Vincent Palomares

On Thu, 14 Nov 2024 at 07:37, Saravana Kannan <saravanak@google.com> wrote:
>
> Ugh... just realized that for a few of the emails I've been replying
> directly to one person instead of reply-all.
>
> On Fri, Nov 8, 2024 at 1:02 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Fri, 8 Nov 2024 at 08:28, Saravana Kannan <saravanak@google.com> wrote:
> > >
> > > Hi scheduler folks,
> > >
> > > I'm running into some weird scheduling issues when testing non-sched
> > > changes on a Pixel 6 that's running close to 6.12-rc5. I'm not sure if
> > > this is an issue in earlier kernel versions or not.
> > >
> > > The async suspend/resume code calls async_schedule_dev_nocall() to
> > > queue up a bunch of work. These queued up work seem to be running in
> > > kworker threads.
> > >
> > > However, there have been many times where I see scheduling latency
> > > (runnable, but not running) of 4.5 ms or higher for a kworker thread
> > > when there are plenty of idle CPUs.
> >
> > You are using EAS, aren't you ?
> > so the energy impact drive the cpu selection not cpu idleness
> >
> > There is a proposal to change feec to also take into account such case
> > in addition to the energy impact
> > https://lore.kernel.org/lkml/64ed0fb8-12ea-4452-9ec2-7ad127b65529@arm.com/T/
> >
> > I still have to finalize v2
>
> Anyway, I tried this series (got it from
> https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=sched/rework-eas)
> and:
> 1. The timing hasn't improved at all compared to not having the series.

Surprising As I can see improvements on rb5 with unbounded kworker
spreads on little cpus unlike current implementation but the use of
med and big cores waitq for little to be filled 1st which is not not
case when disable eas

> 2. There's still a lot of preemption of runnable tasks with some empty CPUs.

Yes, little are fully filled but med and big are used later when
utilization of little have increased

>
> For example:
> https://ui.perfetto.dev/#!/?s=955ff7e73edf32dab27501025211fa2ce322f725
>
> Thanks,
> Saravana
>
>
> >
> > >
> > > Does async_schedule_dev_nocall() have some weird limitations on where
> > > they can be run? I know it has some NUMA related stuff, but the Pixel
> > > 6 doesn't have NUMA. This oddity unnecessarily increases
> > > suspend/resume latency as it adds up across kworker threads. So, I'd
> > > appreciate any insights on what might be happening?
> > >
> > > If you know how to use perfetto (it's really pretty simple, all you
> > > need to know is WASD and clicking), here's an example:
> > > https://ui.perfetto.dev/#!/?s=e20045736e7dfa1e897db6489710061d2495be92
> > >
> > > Thanks,
> > > Saravana

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-11-14 13:06 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-08  7:28 Very high scheduling delay with plenty of idle CPUs Saravana Kannan
2024-11-08  8:31 ` Peter Zijlstra
2024-11-10  5:49   ` Saravana Kannan
2024-11-11  5:17     ` K Prateek Nayak
2024-11-11  6:15       ` Saravana Kannan
2024-11-11  8:25         ` Christian Loehle
2024-11-11  9:02           ` Saravana Kannan
2024-11-11  9:08             ` Christian Loehle
2024-11-11 10:40         ` Peter Zijlstra
2024-11-11 11:15           ` Vincent Guittot
2024-11-11 18:17             ` Saravana Kannan
2024-11-11 19:00               ` Vincent Guittot
2024-11-11 18:23           ` Saravana Kannan
2024-11-11 19:01             ` Vincent Guittot
2024-11-11 19:12               ` Vincent Guittot
2024-11-12  7:23                 ` Saravana Kannan
2024-11-12  9:03                   ` Vincent Guittot
2024-11-12 16:25                     ` Saravana Kannan
2024-11-12 17:00                       ` Vincent Guittot
2024-11-08  9:02 ` Vincent Guittot
2024-11-14  6:36   ` Saravana Kannan
2024-11-14 13:06     ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox