linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* sched: ARM: arch_scale_freq_power
@ 2011-10-06 11:36 Vincent Guittot
  2011-10-11  7:16 ` Amit Kucheria
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2011-10-06 11:36 UTC (permalink / raw)
  To: linux-arm-kernel

I work to link the cpu_power of ARM cores to their frequency by using
arch_scale_freq_power. It's explained in the kernel that cpu_power is
used to distribute load on cpus and a cpu with more cpu_power will
pick up more load. The default value is SCHED_POWER_SCALE and I
increase the value if I want a cpu to have more load than another one.
Is there an advised range for cpu_power value as well as some time
scale constraints for updating the cpu_power value ?
I'm also wondering why this scheduler feature is currently disable by default ?

Regards,
Vincent

^ permalink raw reply	[flat|nested] 11+ messages in thread

* sched: ARM: arch_scale_freq_power
  2011-10-06 11:36 sched: ARM: arch_scale_freq_power Vincent Guittot
@ 2011-10-11  7:16 ` Amit Kucheria
  2011-10-11  7:57   ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Amit Kucheria @ 2011-10-11  7:16 UTC (permalink / raw)
  To: linux-arm-kernel

Adding Peter to the discussion..

On Thu, Oct 6, 2011 at 5:06 PM, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
> I work to link the cpu_power of ARM cores to their frequency by using
> arch_scale_freq_power. It's explained in the kernel that cpu_power is
> used to distribute load on cpus and a cpu with more cpu_power will
> pick up more load. The default value is SCHED_POWER_SCALE and I
> increase the value if I want a cpu to have more load than another one.
> Is there an advised range for cpu_power value as well as some time
> scale constraints for updating the cpu_power value ?
> I'm also wondering why this scheduler feature is currently disable by default ?
>
> Regards,
> Vincent

In discussions with Vincent regarding this, I've wondered whether
cpu_power wouldn't be better renamed to cpu_capacity since that is
what it really seems to describe.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* sched: ARM: arch_scale_freq_power
  2011-10-11  7:16 ` Amit Kucheria
@ 2011-10-11  7:57   ` Peter Zijlstra
  2011-10-11  8:51     ` Vincent Guittot
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2011-10-11  7:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2011-10-11 at 12:46 +0530, Amit Kucheria wrote:
> Adding Peter to the discussion..

Right, CCing the folks who actually wrote the code you're asking
questions about always helps ;-)

> On Thu, Oct 6, 2011 at 5:06 PM, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> > I work to link the cpu_power of ARM cores to their frequency by using
> > arch_scale_freq_power. 

Why and how? In particular note that if you're using something like the
on-demand cpufreq governor this isn't going to work.

> It's explained in the kernel that cpu_power is
> > used to distribute load on cpus and a cpu with more cpu_power will
> > pick up more load. The default value is SCHED_POWER_SCALE and I
> > increase the value if I want a cpu to have more load than another one.
> > Is there an advised range for cpu_power value as well as some time
> > scale constraints for updating the cpu_power value ?

Basically 1024 is the unit and denotes the capacity of a full core at
'normal' speed. 

Typically cpufreq would down-clock a core and thus you'd end up with a
smaller number (linearly proportional to the freq ratio etc. although if
you want to go really fancy you could determine the actual
throughput/freq curves).

Things like x86 turbo mode would result in a >1024 value.

Things like SMT would typically result in <1024 and the SMT sum over the
core >1024 (if you're lucky).

> > I'm also wondering why this scheduler feature is currently disable by default ?

Because the only implementation in existence (x86) is broken and I
haven't gotten around to fixing it. Arguable we should disable that for
the time being, see below.

> In discussions with Vincent regarding this, I've wondered whether
> cpu_power wouldn't be better renamed to cpu_capacity since that is
> what it really seems to describe.

Possibly, but its been cpu_power for ages and we use capacity to
describe something else.

---
 arch/x86/kernel/cpu/sched.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/sched.c b/arch/x86/kernel/cpu/sched.c
index a640ae5..90ae68c 100644
--- a/arch/x86/kernel/cpu/sched.c
+++ b/arch/x86/kernel/cpu/sched.c
@@ -6,7 +6,14 @@
 #include <asm/cpufeature.h>
 #include <asm/processor.h>
 
-#ifdef CONFIG_SMP
+#if 0 /* def CONFIG_SMP */
+
+/*
+ * Currently broken, we need to filter out idle time because the aperf/mperf
+ * ratio measures actual throughput, not capacity. This means that if a logical
+ * cpu idles it will report less capacity and receive less work, which isn't
+ * what we want.
+ */
 
 static DEFINE_PER_CPU(struct aperfmperf, old_perf_sched);
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* sched: ARM: arch_scale_freq_power
  2011-10-11  7:57   ` Peter Zijlstra
@ 2011-10-11  8:51     ` Vincent Guittot
  2011-10-11  9:13       ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2011-10-11  8:51 UTC (permalink / raw)
  To: linux-arm-kernel

On 11 October 2011 09:57, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-10-11 at 12:46 +0530, Amit Kucheria wrote:
>> Adding Peter to the discussion..
>
> Right, CCing the folks who actually wrote the code you're asking
> questions about always helps ;-)
>
>> On Thu, Oct 6, 2011 at 5:06 PM, Vincent Guittot
>> <vincent.guittot@linaro.org> wrote:
>> > I work to link the cpu_power of ARM cores to their frequency by using
>> > arch_scale_freq_power.
>
> Why and how? In particular note that if you're using something like the
> on-demand cpufreq governor this isn't going to work.
>

I have several goals. The 1st one is that I need to put more load on
some cpus when I have packages with different cpu frequency.
I also study if I can follow the real cpu frequency but it seems to be
not so easy. I have noticed that the cpu_power is updated periodical
except when we have a lot of newly_idle events.
Then, I have some use cases which have several running tasks but a low
cpu load. In this case, the small tasks are spread on several cpu by
the load_balance whereas they could be easily handled by one cpu
without significant performance modification. If the cpu_power is
higher than 1024, the cpu is no more seen out of capacity by the
load_balance as soon as a short process is running and teh main result
is that the small tasks will stay on the same cpu. This configuration
is mainly usefull for ARM dual core system when we want to power gate
one cpu. I use cyclictest to simulate such use case.

>> It's explained in the kernel that cpu_power is
>> > used to distribute load on cpus and a cpu with more cpu_power will
>> > pick up more load. The default value is SCHED_POWER_SCALE and I
>> > increase the value if I want a cpu to have more load than another one.
>> > Is there an advised range for cpu_power value as well as some time
>> > scale constraints for updating the cpu_power value ?
>
> Basically 1024 is the unit and denotes the capacity of a full core at
> 'normal' speed.
>
> Typically cpufreq would down-clock a core and thus you'd end up with a
> smaller number (linearly proportional to the freq ratio etc. although if
> you want to go really fancy you could determine the actual
> throughput/freq curves).
>
> Things like x86 turbo mode would result in a >1024 value.
>
> Things like SMT would typically result in <1024 and the SMT sum over the
> core >1024 (if you're lucky).
>
>> > I'm also wondering why this scheduler feature is currently disable by default ?
>
> Because the only implementation in existence (x86) is broken and I
> haven't gotten around to fixing it. Arguable we should disable that for
> the time being, see below.
>
>> In discussions with Vincent regarding this, I've wondered whether
>> cpu_power wouldn't be better renamed to cpu_capacity since that is
>> what it really seems to describe.
>
> Possibly, but its been cpu_power for ages and we use capacity to
> describe something else.
>
> ---
> ?arch/x86/kernel/cpu/sched.c | ? ?9 ++++++++-
> ?1 files changed, 8 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sched.c b/arch/x86/kernel/cpu/sched.c
> index a640ae5..90ae68c 100644
> --- a/arch/x86/kernel/cpu/sched.c
> +++ b/arch/x86/kernel/cpu/sched.c
> @@ -6,7 +6,14 @@
> ?#include <asm/cpufeature.h>
> ?#include <asm/processor.h>
>
> -#ifdef CONFIG_SMP
> +#if 0 /* def CONFIG_SMP */
> +
> +/*
> + * Currently broken, we need to filter out idle time because the aperf/mperf
> + * ratio measures actual throughput, not capacity. This means that if a logical
> + * cpu idles it will report less capacity and receive less work, which isn't
> + * what we want.
> + */
>
> ?static DEFINE_PER_CPU(struct aperfmperf, old_perf_sched);
>
>
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* sched: ARM: arch_scale_freq_power
  2011-10-11  8:51     ` Vincent Guittot
@ 2011-10-11  9:13       ` Peter Zijlstra
  2011-10-11  9:38         ` Amit Kucheria
  2011-10-11  9:40         ` Vincent Guittot
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2011-10-11  9:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2011-10-11 at 10:51 +0200, Vincent Guittot wrote:
> I have several goals. The 1st one is that I need to put more load on
> some cpus when I have packages with different cpu frequency.

That should be rather easy.

> I also study if I can follow the real cpu frequency but it seems to be
> not so easy.

Why not?

>  I have noticed that the cpu_power is updated periodical
> except when we have a lot of newly_idle events.

We can certainly fix that.

> Then, I have some use cases which have several running tasks but a low
> cpu load. In this case, the small tasks are spread on several cpu by
> the load_balance whereas they could be easily handled by one cpu
> without significant performance modification. 

That shouldn't be done using cpu_power, we have sched_smt_power_savings
and sched_mc_power_savings for stuff like that.

Although I would really like to kill all those different
sched_*_power_savings knobs and reduce it to one.

> If the cpu_power is
> higher than 1024, the cpu is no more seen out of capacity by the
> load_balance as soon as a short process is running and teh main result
> is that the small tasks will stay on the same cpu. This configuration
> is mainly usefull for ARM dual core system when we want to power gate
> one cpu. I use cyclictest to simulate such use case. 

Yeah, but that's wrong.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* sched: ARM: arch_scale_freq_power
  2011-10-11  9:13       ` Peter Zijlstra
@ 2011-10-11  9:38         ` Amit Kucheria
  2011-10-11 10:03           ` Peter Zijlstra
  2011-10-11  9:40         ` Vincent Guittot
  1 sibling, 1 reply; 11+ messages in thread
From: Amit Kucheria @ 2011-10-11  9:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Oct 11, 2011 at 2:43 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-10-11 at 10:51 +0200, Vincent Guittot wrote:
>> I have several goals. The 1st one is that I need to put more load on
>> some cpus when I have packages with different cpu frequency.
>
> That should be rather easy.
>
>> I also study if I can follow the real cpu frequency but it seems to be
>> not so easy.
>
> Why not?
>
>> ?I have noticed that the cpu_power is updated periodical
>> except when we have a lot of newly_idle events.
>
> We can certainly fix that.
>
>> Then, I have some use cases which have several running tasks but a low
>> cpu load. In this case, the small tasks are spread on several cpu by
>> the load_balance whereas they could be easily handled by one cpu
>> without significant performance modification.
>
> That shouldn't be done using cpu_power, we have sched_smt_power_savings
> and sched_mc_power_savings for stuff like that.

AFAICT, sched_mc assume all cores to have the same capacity - which is
certainly true of the x86 architecture. But in ARM you can see hybrid
cores[1] designed using different fab technology, so that some cores
can run at 'n' GHz and some at 'm' GHz. The idea being that when there
isn't much to do (e.g periodic keep alives for messaging, email, etc.)
you don't wake up the higher power-consuming cores.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* sched: ARM: arch_scale_freq_power
  2011-10-11  9:13       ` Peter Zijlstra
  2011-10-11  9:38         ` Amit Kucheria
@ 2011-10-11  9:40         ` Vincent Guittot
  2011-10-11 10:27           ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2011-10-11  9:40 UTC (permalink / raw)
  To: linux-arm-kernel

On 11 October 2011 11:13, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-10-11 at 10:51 +0200, Vincent Guittot wrote:
>> I have several goals. The 1st one is that I need to put more load on
>> some cpus when I have packages with different cpu frequency.
>
> That should be rather easy.
>

I agree, I was mainly wondering If I should use a [1-1024] or a
[1024-xxxx] range and it seems that both can be used according : SMT
uses <1024 and x86 turbo mode uses >1024

>> I also study if I can follow the real cpu frequency but it seems to be
>> not so easy.
>
> Why not?
>

In fact, the problem is not really to follow the frequency but to be
sure that update_group_power is called often enough by load_balance.
The newly_idle event was also one main problem.

>> ?I have noticed that the cpu_power is updated periodical
>> except when we have a lot of newly_idle events.
>
> We can certainly fix that.
>

That's a good news.

>> Then, I have some use cases which have several running tasks but a low
>> cpu load. In this case, the small tasks are spread on several cpu by
>> the load_balance whereas they could be easily handled by one cpu
>> without significant performance modification.
>
> That shouldn't be done using cpu_power, we have sched_smt_power_savings
> and sched_mc_power_savings for stuff like that.
>

sched_mc_power_saving works fine when we have more than 2 cpus but
can't apply on a dual core because it needs at least 2 sched_groups
and the nr_running of these sched_groups must be higher than 0 but
smaller than group_capacity which is 1 on a dual core system.

> Although I would really like to kill all those different
> sched_*_power_savings knobs and reduce it to one.
>
>> If the cpu_power is
>> higher than 1024, the cpu is no more seen out of capacity by the
>> load_balance as soon as a short process is running and teh main result
>> is that the small tasks will stay on the same cpu. This configuration
>> is mainly usefull for ARM dual core system when we want to power gate
>> one cpu. I use cyclictest to simulate such use case.
>
> Yeah, but that's wrong.

That's the only way I have found to gathers small task without any
relationship on one cpu. Do you know any better solution ?

>
>
Regards,
Vincent

^ permalink raw reply	[flat|nested] 11+ messages in thread

* sched: ARM: arch_scale_freq_power
  2011-10-11  9:38         ` Amit Kucheria
@ 2011-10-11 10:03           ` Peter Zijlstra
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2011-10-11 10:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2011-10-11 at 15:08 +0530, Amit Kucheria wrote:

> > That shouldn't be done using cpu_power, we have sched_smt_power_savings
> > and sched_mc_power_savings for stuff like that.
> 
> AFAICT, sched_mc assume all cores to have the same capacity - which is
> certainly true of the x86 architecture. But in ARM you can see hybrid
> cores[1] designed using different fab technology, so that some cores
> can run at 'n' GHz and some at 'm' GHz. The idea being that when there
> isn't much to do (e.g periodic keep alives for messaging, email, etc.)
> you don't wake up the higher power-consuming cores.
> 
> From TFA[1],  "Sheeva was already capable of 1.2GHz, but the new
> design can go up to 1.5GHz. But only two of the 628's Sheeva cores run
> at the full 1.5GHz. The third one is down-clocked to 624MHz, and
> interesting design choice that saves on power but adds some extra
> utility. In a sense, the 628 could be called a 2.5-core design."

Cute :-)

> Are we mistaken in thinking that sched_mc can not currently handle
> this usecase? How would we 'tune' sched_mc to do this w/o playing with
> cpu_power?

Yeah, sched_mc wants some TLC there.

> > Although I would really like to kill all those different
> > sched_*_power_savings knobs and reduce it to one.
> >
> >> If the cpu_power is
> >> higher than 1024, the cpu is no more seen out of capacity by the
> >> load_balance as soon as a short process is running and teh main result
> >> is that the small tasks will stay on the same cpu. This configuration
> >> is mainly usefull for ARM dual core system when we want to power gate
> >> one cpu. I use cyclictest to simulate such use case.
> >
> > Yeah, but that's wrong.
> 
> What is wrong - the use case simulation using cyclictest? Can you
> suggest better tools?

Using cpu_power to do power saving load-balancing like that.

So ideally cpu_power is simply a factor in the weight balance decision
such that:

	cpu_weight_i      cpu_weigjt_j
	------------  ~=  ------------
	cpu_power_i       cpu_power_j

This yields that under sufficient[*] load, eg. 5 equal weight tasks and
your 2.5 core thingy, you'd get 2:2:1

The decision on what to do on under-utilized systems should be separate
from this.

Currently the load-balancer doesn't know about 'short' running processes
at all, we just have nr_running and weight it doesn't know/care about
for how long those will be around for.

Now for some of the cgroup crap we track a time-weighted weight average,
and pjt was talking about pulling that up into the normal code to get
rid of our multitude of different ways to calculate actual load. [**]

(/me pokes pjt with a sharp stick, where those patches at!?)

But that only gets you half-way there, you also need to compute an
effective time-weighted load per task to go with that.. now while all
that is quite feasible, the problem is overhead. We very much already
are way to expensive and should be cutting back, not keep adding more
and more accounting.

[*] Sufficient such that the weight problem is feasible. eg. 3 equal
tasks on 2 equal cores can never be statically balanced, 2 unequal tasks
on 2 equal cores (or v.v.) can't ever be balanced.

[**] I suspect this might solve the over-balancing problem triggered by
tasks woken from the tick that also does the load-balance pass. This
load-balance pass will run in sIRQ context and thus preempt running all
those just woken tasks, thus giving the impression the CPU is very busy,
while in fact most those tasks will instantly go back to sleep after
finding nothing to do.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* sched: ARM: arch_scale_freq_power
  2011-10-11  9:40         ` Vincent Guittot
@ 2011-10-11 10:27           ` Peter Zijlstra
  2011-10-11 16:03             ` Vincent Guittot
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2011-10-11 10:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2011-10-11 at 11:40 +0200, Vincent Guittot wrote:
> On 11 October 2011 11:13, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Tue, 2011-10-11 at 10:51 +0200, Vincent Guittot wrote:
> >> I have several goals. The 1st one is that I need to put more load on
> >> some cpus when I have packages with different cpu frequency.
> >
> > That should be rather easy.
> >
> 
> I agree, I was mainly wondering If I should use a [1-1024] or a
> [1024-xxxx] range and it seems that both can be used according : SMT
> uses <1024 and x86 turbo mode uses >1024

Well, turbo mode would typically only boost a cpu 25% or so, and only
while idling other cores to keep under its thermal limit. So its not
sufficient to actually affect the capacity calculation much if at all.

> >> Then, I have some use cases which have several running tasks but a low
> >> cpu load. In this case, the small tasks are spread on several cpu by
> >> the load_balance whereas they could be easily handled by one cpu
> >> without significant performance modification.
> >
> > That shouldn't be done using cpu_power, we have sched_smt_power_savings
> > and sched_mc_power_savings for stuff like that.
> >
> 
> sched_mc_power_saving works fine when we have more than 2 cpus but
> can't apply on a dual core because it needs at least 2 sched_groups
> and the nr_running of these sched_groups must be higher than 0 but
> smaller than group_capacity which is 1 on a dual core system.

SD_POWERSAVINGS_BALANCE does /=2 to nr_running, effectively doubling the
capacity iirc. And I know some IBM dudes were toying with the idea of
playing tricks with the capacity numbers, but that never went anywhere.

> > Although I would really like to kill all those different
> > sched_*_power_savings knobs and reduce it to one.
> >
> >> If the cpu_power is
> >> higher than 1024, the cpu is no more seen out of capacity by the
> >> load_balance as soon as a short process is running and teh main result
> >> is that the small tasks will stay on the same cpu. This configuration
> >> is mainly usefull for ARM dual core system when we want to power gate
> >> one cpu. I use cyclictest to simulate such use case.
> >
> > Yeah, but that's wrong.
> 
> That's the only way I have found to gathers small task without any
> relationship on one cpu. Do you know any better solution ?

How do you know the task is 'small' ?

For that you would need to track a time-weighted effective load average
of the task and we don't have that.

[ how bad is all this u64 math on ARM btw? and when will ARM finally
  agree all this 32bit nonsense is a waste of time and silicon? ]

But yeah, the whole nr_running vs capacity thing was traditionally to
deal with spreading single tasks around. And traditional power aware
scheduling was mostly about packing those on sockets (keeps other
sockets idle) instead of spreading them around sockets (optimizes
cache).

Now I wouldn't at all mind you ripping out all that
sched_*_power_savings crap and replacing it, I doubt it actually works
anyway. I haven't got many patches on the subject, and I know I don't
have the equipment to measure power usage.

Also, the few patches I got mostly made the sched_*_power_savings mess
bigger, which I refuse to do (what sysad wants to have a 27-state space
to configure his power aware scheduling). This has mostly made people go
away instead of fixing things up :-(

As to what the replacement would have to look like, dunno, its not
something I've really thought much about, but maybe the time-weighted
stuff is the only sane approach, that combined with options on how to
spread tasks (core, socket, node, etc..).

I really think changing the load-balancer is the right way to go about
solving your power issue (hot-plugging a cpu really is an insane way to
idle a core) and I'm open to discussing what would work for you.

All I really ask is to not cobble something together, the load-balancer
is a horridly complex thing already and the last thing it needs is more
special cases that don't interact properly.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* sched: ARM: arch_scale_freq_power
  2011-10-11 10:27           ` Peter Zijlstra
@ 2011-10-11 16:03             ` Vincent Guittot
  2011-10-11 16:21               ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2011-10-11 16:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 11 October 2011 12:27, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-10-11 at 11:40 +0200, Vincent Guittot wrote:
>> On 11 October 2011 11:13, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> > On Tue, 2011-10-11 at 10:51 +0200, Vincent Guittot wrote:
>> >> I have several goals. The 1st one is that I need to put more load on
>> >> some cpus when I have packages with different cpu frequency.
>> >
>> > That should be rather easy.
>> >
>>
>> I agree, I was mainly wondering If I should use a [1-1024] or a
>> [1024-xxxx] range and it seems that both can be used according : SMT
>> uses <1024 and x86 turbo mode uses >1024
>
> Well, turbo mode would typically only boost a cpu 25% or so, and only
> while idling other cores to keep under its thermal limit. So its not
> sufficient to actually affect the capacity calculation much if at all.
>

OK

>> >> Then, I have some use cases which have several running tasks but a low
>> >> cpu load. In this case, the small tasks are spread on several cpu by
>> >> the load_balance whereas they could be easily handled by one cpu
>> >> without significant performance modification.
>> >
>> > That shouldn't be done using cpu_power, we have sched_smt_power_savings
>> > and sched_mc_power_savings for stuff like that.
>> >
>>
>> sched_mc_power_saving works fine when we have more than 2 cpus but
>> can't apply on a dual core because it needs at least 2 sched_groups
>> and the nr_running of these sched_groups must be higher than 0 but
>> smaller than group_capacity which is 1 on a dual core system.
>
> SD_POWERSAVINGS_BALANCE does /=2 to nr_running, effectively doubling the
> capacity iirc. And I know some IBM dudes were toying with the idea of
> playing tricks with the capacity numbers, but that never went anywhere.
>

yes but it's only a special case for 2 tasks on a dual core and the
SD_WAKE_AFFINE flag and cpu_idle_sibling can overwrite this decision.

>> > Although I would really like to kill all those different
>> > sched_*_power_savings knobs and reduce it to one.
>> >
>> >> If the cpu_power is
>> >> higher than 1024, the cpu is no more seen out of capacity by the
>> >> load_balance as soon as a short process is running and teh main result
>> >> is that the small tasks will stay on the same cpu. This configuration
>> >> is mainly usefull for ARM dual core system when we want to power gate
>> >> one cpu. I use cyclictest to simulate such use case.
>> >
>> > Yeah, but that's wrong.
>>
>> That's the only way I have found to gathers small task without any
>> relationship on one cpu. Do you know any better solution ?
>
> How do you know the task is 'small' ?
>

I want to use cpufreq to be notified that we have a large/small cpu
load. If we have several tasks but the cpu uses the lowest frequency,
it "should" mean that we have small tasks that are running (less than
20ms*95% of added duration) and we could gather them on one cpu (by
increasing the cpu_power on a dual core).

> For that you would need to track a time-weighted effective load average
> of the task and we don't have that.
>

yes, that's why I use cpufreq until better option, like a
time-weighted load average, is available

> [ how bad is all this u64 math on ARM btw? and when will ARM finally
> ?agree all this 32bit nonsense is a waste of time and silicon? ]
>
> But yeah, the whole nr_running vs capacity thing was traditionally to
> deal with spreading single tasks around. And traditional power aware
> scheduling was mostly about packing those on sockets (keeps other
> sockets idle) instead of spreading them around sockets (optimizes
> cache).
>
> Now I wouldn't at all mind you ripping out all that
> sched_*_power_savings crap and replacing it, I doubt it actually works
> anyway. I haven't got many patches on the subject, and I know I don't
> have the equipment to measure power usage.
>
> Also, the few patches I got mostly made the sched_*_power_savings mess
> bigger, which I refuse to do (what sysad wants to have a 27-state space
> to configure his power aware scheduling). This has mostly made people go
> away instead of fixing things up :-(
>
> As to what the replacement would have to look like, dunno, its not
> something I've really thought much about, but maybe the time-weighted
> stuff is the only sane approach, that combined with options on how to
> spread tasks (core, socket, node, etc..).
>
> I really think changing the load-balancer is the right way to go about
> solving your power issue (hot-plugging a cpu really is an insane way to
> idle a core) and I'm open to discussing what would work for you.
>

Great. My 1st goal was not to modify the load-balancer and sched_mc
(or as less as possible) and to study how I could tune the scheduler
parameters to have the best power consumption on ARM platform. Now,
changing the load-balancer is probably a better solution.

> All I really ask is to not cobble something together, the load-balancer
> is a horridly complex thing already and the last thing it needs is more
> special cases that don't interact properly.
>
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* sched: ARM: arch_scale_freq_power
  2011-10-11 16:03             ` Vincent Guittot
@ 2011-10-11 16:21               ` Peter Zijlstra
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2011-10-11 16:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2011-10-11 at 18:03 +0200, Vincent Guittot wrote:
> > How do you know the task is 'small' ?
> >
> 
> I want to use cpufreq to be notified that we have a large/small cpu
> load. If we have several tasks but the cpu uses the lowest frequency,
> it "should" mean that we have small tasks that are running (less than
> 20ms*95% of added duration) and we could gather them on one cpu (by
> increasing the cpu_power on a dual core).
> 
> > For that you would need to track a time-weighted effective load average
> > of the task and we don't have that.
> >
> 
> yes, that's why I use cpufreq until better option, like a
> time-weighted load average, is available 

Egads... so basically you're (ab)using the ondemand cpufreq stats to get
a guestimate of the time-weighted load of the cpu, and then (ab)use the
scheduler cpufreq hook to pump its capacity numbers.

No cookies for you.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-10-11 16:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-06 11:36 sched: ARM: arch_scale_freq_power Vincent Guittot
2011-10-11  7:16 ` Amit Kucheria
2011-10-11  7:57   ` Peter Zijlstra
2011-10-11  8:51     ` Vincent Guittot
2011-10-11  9:13       ` Peter Zijlstra
2011-10-11  9:38         ` Amit Kucheria
2011-10-11 10:03           ` Peter Zijlstra
2011-10-11  9:40         ` Vincent Guittot
2011-10-11 10:27           ` Peter Zijlstra
2011-10-11 16:03             ` Vincent Guittot
2011-10-11 16:21               ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).