Re: [patch v7 0/21] sched: power aware scheduling

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [patch v7 0/21] sched: power aware scheduling
       [not found] <1365040862-8390-1-git-send-email-alex.shi@intel.com>
@ 2013-04-11 21:02 ` Len Brown
  2013-04-12  8:46   ` Alex Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Len Brown @ 2013-04-11 21:02 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	morten.rasmussen, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On 04/03/2013 10:00 PM, Alex Shi wrote:

> As mentioned in the power aware scheduling proposal, Power aware
> scheduling has 2 assumptions:
> 1, race to idle is helpful for power saving
> 2, less active sched groups will reduce cpu power consumption

linux-pm@vger.kernel.org should be cc:
on Linux proposals that affect power.

> Since the patch can perfect pack tasks into fewer groups, I just show
> some performance/power testing data here:
> =========================================
> $for ((i = 0; i < x; i++)) ; do while true; do :; done  &   done
> 
> On my SNB laptop with 4 core* HT: the data is avg Watts
>          powersaving     performance
> x = 8	 72.9482 	 72.6702
> x = 4	 61.2737 	 66.7649
> x = 2	 44.8491 	 59.0679
> x = 1	 43.225 	 43.0638

> on SNB EP machine with 2 sockets * 8 cores * HT:
>          powersaving     performance
> x = 32	 393.062 	 395.134
> x = 16	 277.438 	 376.152
> x = 8	 209.33 	 272.398
> x = 4	 199 	         238.309
> x = 2	 175.245 	 210.739
> x = 1	 174.264 	 173.603

The numbers above say nothing about performance,
and thus don't tell us much.

In particular, they don't tell us if reducing power
by hacking the scheduler is more or less efficient
than using the existing techniques that are already shipping,
such as controlling P-states.

> tasks number keep waving benchmark, 'make -j <x> vmlinux'
> on my SNB EP 2 sockets machine with 8 cores * HT:
>          powersaving              performance
> x = 2    189.416 /228 23          193.355 /209 24

Energy = Power * Time

189.416*228 = 43186.848 Joules for powersaving to retire the workload
193.355*209 = 40411.195 Joules for performance to retire the workload.

So the net effect of the 'powersaving' mode here is:
1. 228/209 = 9% performance degradation
2. 43186.848/40411.195 = 6.9 % more energy to retire the workload.

These numbers suggest that this patch series simultaneously
has a negative impact on performance and energy required
to retire the workload.  Why do it?

> x = 4    215.728 /132 35          219.69 /122 37

ditto here.
8% increase in time.
6% increase in energy.

> x = 8    244.31 /75 54            252.709 /68 58

ditto here
10% increase in time.
6% increase in energy.

> x = 16   299.915 /43 77           259.127 /58 66

Are you sure that powersave mode ran in 43 seconds
when performance mode ran in 58 seconds?

If that is true, than somewhere in this patch series
you have a _significant_ performance benefit
on this workload under these conditions!

Interestingly, powersave mode also ran at
15% higher power than performance mode.
maybe "powersave" isn't quite the right name for it:-)

> x = 32   341.221 /35 83           323.418 /38 81

Why does this patch series have a performance impact (8%)
at x=32.  All the processors are always busy, no?

> data explains: 189.416 /228 23
> 	189.416: average Watts during compilation
> 	228: seconds(compile time)
> 	23:  scaled performance/watts = 1000000 / seconds / watts
> The performance value of kbuild is better on threads 16/32, that's due
> to lazy power balance reduced the context switch and CPU has more boost 
> chance on powersaving balance.

25% is a huge difference in performance.
Can you get a performance benefit in that scenario
without having a negative performance impact
in the other scenarios?  In particular,
an 8% hit to the fully utilized case is a deal killer.

The x=16 performance change here suggest there is value
someplace in this patch series to increase performance.
However, the case that these scheduling changes are
a benefit from an energy efficiency point of view
is yet to be made.

thanks,
-Len Brown
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-11 21:02 ` [patch v7 0/21] sched: power aware scheduling Len Brown
@ 2013-04-12  8:46   ` Alex Shi
  2013-04-12 16:23     ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Shi @ 2013-04-12  8:46 UTC (permalink / raw)
  To: Len Brown
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	morten.rasmussen, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On 04/12/2013 05:02 AM, Len Brown wrote:
>> > x = 16   299.915 /43 77           259.127 /58 66
> Are you sure that powersave mode ran in 43 seconds
> when performance mode ran in 58 seconds?

Thanks a lot for comments, Len!
Will do more testing by your tool fspin. :)

powersaving using less time when thread = 16 or 32.
The main contribution come from CPU freq boost. I have disable the boost
of cpufreq. then find the compile time become similar between
powersaving and performance on thread 32, and powersaving is slower when
threads is 16.
And less Context Switch from less lazy power balance should also do some
help.
> 
> If that is true, than somewhere in this patch series
> you have a _significant_ performance benefit
> on this workload under these conditions!
> 
> Interestingly, powersave mode also ran at
> 15% higher power than performance mode.
> maybe "powersave" isn't quite the right name for it:-)

What other name you suggest? :)
> 
>> > x = 32   341.221 /35 83           323.418 /38 81
> Why does this patch series have a performance impact (8%)
> at x=32.  All the processors are always busy, no?

No, all processors are not always busy in 'make -j vmlinux'
So, compile time also get benefit from boost and less CS. the
performance policy doesn't introduce any impact. there is nothing added
in performance policy.
> 
>> > data explains: 189.416 /228 23
>> > 	189.416: average Watts during compilation
>> > 	228: seconds(compile time)
>> > 	23:  scaled performance/watts = 1000000 / seconds / watts
>> > The performance value of kbuild is better on threads 16/32, that's due
>> > to lazy power balance reduced the context switch and CPU has more boost 
>> > chance on powersaving balance.
> 25% is a huge difference in performance.
> Can you get a performance benefit in that scenario
> without having a negative performance impact
> in the other scenarios?  In particular,

will try packing task on cpu capacity not cpu weight.
> an 8% hit to the fully utilized case is a deal killer.

that is the 8% gain on powersaving, not 8% lose on performance policy. :)
> 
> The x=16 performance change here suggest there is value
> someplace in this patch series to increase performance.
> However, the case that these scheduling changes are
> a benefit from an energy efficiency point of view
> is yet to be made.


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-12  8:46   ` Alex Shi
@ 2013-04-12 16:23     ` Borislav Petkov
  2013-04-12 16:48       ` Mike Galbraith
  2013-04-14  1:28       ` Alex Shi
  0 siblings, 2 replies; 30+ messages in thread
From: Borislav Petkov @ 2013-04-12 16:23 UTC (permalink / raw)
  To: Alex Shi
  Cc: Len Brown, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
> Thanks a lot for comments, Len!

AFAICT, you kinda forgot to answer his most important question:

> These numbers suggest that this patch series simultaneously
> has a negative impact on performance and energy required
> to retire the workload.  Why do it?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-12 16:23     ` Borislav Petkov
@ 2013-04-12 16:48       ` Mike Galbraith
  2013-04-12 17:12         ` Borislav Petkov
  2013-04-17 21:53         ` Len Brown
  2013-04-14  1:28       ` Alex Shi
  1 sibling, 2 replies; 30+ messages in thread
From: Mike Galbraith @ 2013-04-12 16:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alex Shi, Len Brown, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On Fri, 2013-04-12 at 18:23 +0200, Borislav Petkov wrote: 
> On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
> > Thanks a lot for comments, Len!
> 
> AFAICT, you kinda forgot to answer his most important question:
> 
> > These numbers suggest that this patch series simultaneously
> > has a negative impact on performance and energy required
> > to retire the workload.  Why do it?

Hm.  When I tested AIM7 compute on a NUMA box, there was a marked
throughput increase at the low to moderate load end of the test spectrum
IIRC.  Fully repeatable.  There were also other benefits unrelated to
power, ie mitigation of the evil face of select_idle_sibling().  I
rather liked what I saw during ~big box test-drive.

(just saying there are other aspects besides joules in there)

-Mike



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-12 16:48       ` Mike Galbraith
@ 2013-04-12 17:12         ` Borislav Petkov
  2013-04-14  1:36           ` Alex Shi
  2013-04-17 21:53         ` Len Brown
  1 sibling, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2013-04-12 17:12 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Alex Shi, Len Brown, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On Fri, Apr 12, 2013 at 06:48:31PM +0200, Mike Galbraith wrote:
> (just saying there are other aspects besides joules in there)

Yeah, but we don't allow any regressions in sched*, do we? Can we pick
only the good cherries? :-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-12 17:12         ` Borislav Petkov
@ 2013-04-14  1:36           ` Alex Shi
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Shi @ 2013-04-14  1:36 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mike Galbraith, Len Brown, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On 04/13/2013 01:12 AM, Borislav Petkov wrote:
> On Fri, Apr 12, 2013 at 06:48:31PM +0200, Mike Galbraith wrote:
>> (just saying there are other aspects besides joules in there)
> 
> Yeah, but we don't allow any regressions in sched*, do we? Can we pick
> only the good cherries? :-)
> 

Thanks for all of discussion on this threads. :)
I think we can bear a little power efficient lose when want powersaving.

For second question, the performance increase come from cpu boost
feature, the hardware feature diffined, if there are some cores idle in
cpu socket, other core has more chance to boost on higher frequency. The
task packing try to pack tasks so that left more idle cores.

The difficult to merge this feature into current performance is that
current balance policy is trying to give as much as possible cpu
resources to each of task. that just conflict with the cpu boost condition.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-12 16:48       ` Mike Galbraith
  2013-04-12 17:12         ` Borislav Petkov
@ 2013-04-17 21:53         ` Len Brown
  2013-04-18  1:51           ` Mike Galbraith
  2013-04-26 15:11           ` Mike Galbraith
  1 sibling, 2 replies; 30+ messages in thread
From: Len Brown @ 2013-04-17 21:53 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Alex Shi, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On 04/12/2013 12:48 PM, Mike Galbraith wrote:
> On Fri, 2013-04-12 at 18:23 +0200, Borislav Petkov wrote: 
>> On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
>>> Thanks a lot for comments, Len!
>>
>> AFAICT, you kinda forgot to answer his most important question:
>>
>>> These numbers suggest that this patch series simultaneously
>>> has a negative impact on performance and energy required
>>> to retire the workload.  Why do it?
> 
> Hm.  When I tested AIM7 compute on a NUMA box, there was a marked
> throughput increase at the low to moderate load end of the test spectrum
> IIRC.  Fully repeatable.  There were also other benefits unrelated to
> power, ie mitigation of the evil face of select_idle_sibling().  I
> rather liked what I saw during ~big box test-drive.
> 
> (just saying there are other aspects besides joules in there)

Mike,

Can you re-run your AIM7 measurement with turbo-mode and HT-mode disabled,
and then independently re-enable them?

If you still see the performance benefit, then that proves
that the scheduler hacks are not about tricking into
turbo mode, but something else.

If the performance gains *are* about interactions with turbo-mode,
then perhaps what we should really be doing here is making
the scheduler explicitly turbo-aware?  Of course, that begs the question
of how the scheduler should be aware of cpufreq in general...

thanks,
Len Brown, Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-17 21:53         ` Len Brown
@ 2013-04-18  1:51           ` Mike Galbraith
  2013-04-26 15:11           ` Mike Galbraith
  1 sibling, 0 replies; 30+ messages in thread
From: Mike Galbraith @ 2013-04-18  1:51 UTC (permalink / raw)
  To: Len Brown
  Cc: Borislav Petkov, Alex Shi, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On Wed, 2013-04-17 at 17:53 -0400, Len Brown wrote: 
> On 04/12/2013 12:48 PM, Mike Galbraith wrote:
> > On Fri, 2013-04-12 at 18:23 +0200, Borislav Petkov wrote: 
> >> On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
> >>> Thanks a lot for comments, Len!
> >>
> >> AFAICT, you kinda forgot to answer his most important question:
> >>
> >>> These numbers suggest that this patch series simultaneously
> >>> has a negative impact on performance and energy required
> >>> to retire the workload.  Why do it?
> > 
> > Hm.  When I tested AIM7 compute on a NUMA box, there was a marked
> > throughput increase at the low to moderate load end of the test spectrum
> > IIRC.  Fully repeatable.  There were also other benefits unrelated to
> > power, ie mitigation of the evil face of select_idle_sibling().  I
> > rather liked what I saw during ~big box test-drive.
> > 
> > (just saying there are other aspects besides joules in there)
> 
> Mike,
> 
> Can you re-run your AIM7 measurement with turbo-mode and HT-mode disabled,
> and then independently re-enable them?

Unfortunately no, because I don't have remote access to buttons.

> If you still see the performance benefit, then that proves
> that the scheduler hacks are not about tricking into
> turbo mode, but something else.

Yeah, turbo playing a role in that makes lots of sense.  Someone else
will have to test that though.  It was 100% repeatable, so should be
easy to verify.

-Mike

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-17 21:53         ` Len Brown
  2013-04-18  1:51           ` Mike Galbraith
@ 2013-04-26 15:11           ` Mike Galbraith
  2013-04-30  5:16             ` Mike Galbraith
  1 sibling, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2013-04-26 15:11 UTC (permalink / raw)
  To: Len Brown
  Cc: Borislav Petkov, Alex Shi, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On Wed, 2013-04-17 at 17:53 -0400, Len Brown wrote: 
> On 04/12/2013 12:48 PM, Mike Galbraith wrote:
> > On Fri, 2013-04-12 at 18:23 +0200, Borislav Petkov wrote: 
> >> On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
> >>> Thanks a lot for comments, Len!
> >>
> >> AFAICT, you kinda forgot to answer his most important question:
> >>
> >>> These numbers suggest that this patch series simultaneously
> >>> has a negative impact on performance and energy required
> >>> to retire the workload.  Why do it?
> > 
> > Hm.  When I tested AIM7 compute on a NUMA box, there was a marked
> > throughput increase at the low to moderate load end of the test spectrum
> > IIRC.  Fully repeatable.  There were also other benefits unrelated to
> > power, ie mitigation of the evil face of select_idle_sibling().  I
> > rather liked what I saw during ~big box test-drive.
> > 
> > (just saying there are other aspects besides joules in there)
> 
> Mike,
> 
> Can you re-run your AIM7 measurement with turbo-mode and HT-mode disabled,
> and then independently re-enable them?
> 
> If you still see the performance benefit, then that proves
> that the scheduler hacks are not about tricking into
> turbo mode, but something else.

I did that today, neither turbo nor HT affected the performance gain.  I
used the same box and patch set as tested before (v4), but plugged into
linus HEAD.  "powersaving" AIM7 numbers are ~identical to those I posted
before, "performance" is lower at the low end of AIM7 test spectrum, but
as before, delta goes away once the load becomes hefty.

-Mike


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-26 15:11           ` Mike Galbraith
@ 2013-04-30  5:16             ` Mike Galbraith
  2013-04-30  8:30               ` Mike Galbraith
  0 siblings, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2013-04-30  5:16 UTC (permalink / raw)
  To: Len Brown
  Cc: Borislav Petkov, Alex Shi, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On Fri, 2013-04-26 at 17:11 +0200, Mike Galbraith wrote: 
> On Wed, 2013-04-17 at 17:53 -0400, Len Brown wrote: 
> > On 04/12/2013 12:48 PM, Mike Galbraith wrote:
> > > On Fri, 2013-04-12 at 18:23 +0200, Borislav Petkov wrote: 
> > >> On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
> > >>> Thanks a lot for comments, Len!
> > >>
> > >> AFAICT, you kinda forgot to answer his most important question:
> > >>
> > >>> These numbers suggest that this patch series simultaneously
> > >>> has a negative impact on performance and energy required
> > >>> to retire the workload.  Why do it?
> > > 
> > > Hm.  When I tested AIM7 compute on a NUMA box, there was a marked
> > > throughput increase at the low to moderate load end of the test spectrum
> > > IIRC.  Fully repeatable.  There were also other benefits unrelated to
> > > power, ie mitigation of the evil face of select_idle_sibling().  I
> > > rather liked what I saw during ~big box test-drive.
> > > 
> > > (just saying there are other aspects besides joules in there)
> > 
> > Mike,
> > 
> > Can you re-run your AIM7 measurement with turbo-mode and HT-mode disabled,
> > and then independently re-enable them?
> > 
> > If you still see the performance benefit, then that proves
> > that the scheduler hacks are not about tricking into
> > turbo mode, but something else.
> 
> I did that today, neither turbo nor HT affected the performance gain.  I
> used the same box and patch set as tested before (v4), but plugged into
> linus HEAD.  "powersaving" AIM7 numbers are ~identical to those I posted
> before, "performance" is lower at the low end of AIM7 test spectrum, but
> as before, delta goes away once the load becomes hefty.

Well now, that's not exactly what I expected to see for AIM7 compute.
Filesystem is munching cycles otherwise used for compute when load is
spread across the whole box vs consolidated.

performance

   PerfTop:      35 irqs/sec  kernel:94.3%  exact:  0.0% [1000Hz cycles],  (all, 80 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------                            

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ________________________________________

             9367.00 15.5% jbd2_journal_put_journal_head  /lib/modules/3.9.0-default/build/vmlinux
             7658.00 12.7% jbd2_journal_add_journal_head  /lib/modules/3.9.0-default/build/vmlinux
             7042.00 11.7% jbd2_journal_grab_journal_head /lib/modules/3.9.0-default/build/vmlinux
             4433.00  7.4% sieve                          /abuild/mike/aim7/multitask             
             3248.00  5.4% jbd_lock_bh_state              /lib/modules/3.9.0-default/build/vmlinux
             3034.00  5.0% do_get_write_access            /lib/modules/3.9.0-default/build/vmlinux
             2058.00  3.4% mul_double                     /abuild/mike/aim7/multitask             
             2038.00  3.4% add_double                     /abuild/mike/aim7/multitask             
             1365.00  2.3% native_write_msr_safe          /lib/modules/3.9.0-default/build/vmlinux
             1333.00  2.2% __find_get_block               /lib/modules/3.9.0-default/build/vmlinux
             1213.00  2.0% add_long                       /abuild/mike/aim7/multitask             
             1208.00  2.0% add_int                        /abuild/mike/aim7/multitask             
             1084.00  1.8% __wait_on_bit_lock             /lib/modules/3.9.0-default/build/vmlinux
             1065.00  1.8% div_double                     /abuild/mike/aim7/multitask             
              901.00  1.5% intel_idle                     /lib/modules/3.9.0-default/build/vmlinux
              812.00  1.3% _raw_spin_lock_irqsave         /lib/modules/3.9.0-default/build/vmlinux
              559.00  0.9% jbd2_journal_dirty_metadata    /lib/modules/3.9.0-default/build/vmlinux
              464.00  0.8% copy_user_generic_string       /lib/modules/3.9.0-default/build/vmlinux
              455.00  0.8% div_int                        /abuild/mike/aim7/multitask             
              430.00  0.7% string_rtns_1                  /abuild/mike/aim7/multitask             
              419.00  0.7% strncat                        /lib64/libc-2.11.3.so                   
              412.00  0.7% wake_bit_function              /lib/modules/3.9.0-default/build/vmlinux
              347.00  0.6% jbd2_journal_cancel_revoke     /lib/modules/3.9.0-default/build/vmlinux
              346.00  0.6% ext4_mark_iloc_dirty           /lib/modules/3.9.0-default/build/vmlinux
              306.00  0.5% __brelse                       /lib/modules/3.9.0-default/build/vmlinux

powersaving

   PerfTop:      59 irqs/sec  kernel:78.0%  exact:  0.0% [1000Hz cycles],  (all, 80 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------                            

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ________________________________________

             6383.00 22.5% sieve                          /abuild/mike/aim7/multitask             
             2380.00  8.4% mul_double                     /abuild/mike/aim7/multitask             
             2375.00  8.4% add_double                     /abuild/mike/aim7/multitask             
             1678.00  5.9% add_long                       /abuild/mike/aim7/multitask             
             1633.00  5.8% add_int                        /abuild/mike/aim7/multitask             
             1338.00  4.7% div_double                     /abuild/mike/aim7/multitask             
              770.00  2.7% strncat                        /lib64/libc-2.11.3.so                   
              698.00  2.5% string_rtns_1                  /abuild/mike/aim7/multitask             
              678.00  2.4% copy_user_generic_string       /lib/modules/3.9.0-default/build/vmlinux
              569.00  2.0% div_int                        /abuild/mike/aim7/multitask             
              329.00  1.2% jbd2_journal_put_journal_head  /lib/modules/3.9.0-default/build/vmlinux
              306.00  1.1% array_rtns                     /abuild/mike/aim7/multitask             
              298.00  1.1% do_get_write_access            /lib/modules/3.9.0-default/build/vmlinux
              270.00  1.0% jbd2_journal_add_journal_head  /lib/modules/3.9.0-default/build/vmlinux
              258.00  0.9% _int_malloc                    /lib64/libc-2.11.3.so                   
              251.00  0.9% __find_get_block               /lib/modules/3.9.0-default/build/vmlinux
              236.00  0.8% __memset                       /lib/modules/3.9.0-default/build/vmlinux
              224.00  0.8% jbd2_journal_grab_journal_head /lib/modules/3.9.0-default/build/vmlinux
              221.00  0.8% intel_idle                     /lib/modules/3.9.0-default/build/vmlinux
              161.00  0.6% jbd_lock_bh_state              /lib/modules/3.9.0-default/build/vmlinux
              161.00  0.6% start_this_handle              /lib/modules/3.9.0-default/build/vmlinux
              153.00  0.5% __GI_memset                    /lib64/libc-2.11.3.so                   
              147.00  0.5% ext4_do_update_inode           /lib/modules/3.9.0-default/build/vmlinux
              135.00  0.5% jbd2_journal_stop              /lib/modules/3.9.0-default/build/vmlinux
              123.00  0.4% jbd2_journal_dirty_metadata    /lib/modules/3.9.0-default/build/vmlinux


performance

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
14  7      0 47716456 255124 674808    0    0     0     0 6183 93733  1  3 95  1  0
 0  0      0 47791912 255152 602068    0    0     0  2671 14526 49606  2  2 94  1  0
 1  0      0 47794384 255152 603796    0    0     0     0   68  111  0  0 100  0  0
 8  6      0 47672340 255156 730040    0    0     0     0 36249 103961  2  8 86  4  0
 0  0      0 47793976 255216 604616    0    0     0  2686 5322 6379  2  1 97  0  0
 0  0      0 47799128 255216 603108    0    0     0     0   62  106  0  0 100  0  0
 3  0      0 47795972 255300 603136    0    0     0  2626 39115 146228  3  5 88  3  0
 0  0      0 47797176 255300 603284    0    0     0    43  128  216  0  0 100  0  0
 0  0      0 47803244 255300 602580    0    0     0     0   78  124  0  0 100  0  0
 0  0      0 47789120 255336 603940    0    0     0  2676 14085 85798  3  3 92  1  0

powersaving

 0  0      0 47820780 255516 590292    0    0     0    31   81  126  0  0 100  0  0
 0  0      0 47823712 255516 589376    0    0     0     0  107  190  0  0 100  0  0
 0  0      0 47826608 255516 588060    0    0     0     0   76  130  0  0 100  0  0
 0  0      0 47811260 255632 602080    0    0     0  2678  106  200  0  0 100  0  0
 0  0      0 47812548 255632 601892    0    0     0     0   69  110  0  0 100  0  0
 0  0      0 47808284 255680 604400    0    0     0  2668 1588 3451  4  2 94  0  0
 0  0      0 47810300 255680 603624    0    0     0     0   77  124  0  0 100  0  0
20  3      0 47760764 255720 643744    0    0     1     0  948 2817  2  1 97  0  0
 0  0      0 47817828 255756 602400    0    0     1  2703  984  797  2  0 98  0  0
 0  0      0 47819548 255756 602532    0    0     0     0   93  158  0  0 100  0  0
 1  0      0 47819312 255792 603080    0    0     0  2661 1774 3348  4  2 94  0  0
 0  0      0 47821912 255800 602608    0    0     0     2   66  107  0  0 100  0  0

Invisible ink is pretty expensive stuff.

	-Mike


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-30  5:16             ` Mike Galbraith
@ 2013-04-30  8:30               ` Mike Galbraith
  2013-04-30  8:41                 ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2013-04-30  8:30 UTC (permalink / raw)
  To: Len Brown
  Cc: Borislav Petkov, Alex Shi, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On Tue, 2013-04-30 at 07:16 +0200, Mike Galbraith wrote:

> Well now, that's not exactly what I expected to see for AIM7 compute.
> Filesystem is munching cycles otherwise used for compute when load is
> spread across the whole box vs consolidated.

So AIM7 compute performance delta boils down to: powersaving stacks
tasks, so they pat single bit of spinning rust sequentially/gently.

-Mike


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-30  8:30               ` Mike Galbraith
@ 2013-04-30  8:41                 ` Ingo Molnar
  2013-04-30  9:35                   ` Mike Galbraith
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2013-04-30  8:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Len Brown, Borislav Petkov, Alex Shi, mingo, peterz, tglx, akpm,
	arjan, pjt, namhyung, morten.rasmussen, vincent.guittot, gregkh,
	preeti, viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki,
	jkosina, clark.williams, tony.luck, keescook, mgorman, riel,
	Linux PM list


* Mike Galbraith <bitbucket@online.de> wrote:

> On Tue, 2013-04-30 at 07:16 +0200, Mike Galbraith wrote:
> 
> > Well now, that's not exactly what I expected to see for AIM7 compute.
> > Filesystem is munching cycles otherwise used for compute when load is
> > spread across the whole box vs consolidated.
> 
> So AIM7 compute performance delta boils down to: powersaving stacks
> tasks, so they pat single bit of spinning rust sequentially/gently.

So AIM7 with real block IO improved, due to sequentiality. Does it improve 
if AIM7 works on an SSD, or into ramdisk?

Which are the workloads where 'powersaving' mode hurts workload 
performance measurably?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-30  8:41                 ` Ingo Molnar
@ 2013-04-30  9:35                   ` Mike Galbraith
  2013-04-30  9:49                     ` Mike Galbraith
  0 siblings, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2013-04-30  9:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Len Brown, Borislav Petkov, Alex Shi, mingo, peterz, tglx, akpm,
	arjan, pjt, namhyung, morten.rasmussen, vincent.guittot, gregkh,
	preeti, viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki,
	jkosina, clark.williams, tony.luck, keescook, mgorman, riel,
	Linux PM list

On Tue, 2013-04-30 at 10:41 +0200, Ingo Molnar wrote: 
> * Mike Galbraith <bitbucket@online.de> wrote:
> 
> > On Tue, 2013-04-30 at 07:16 +0200, Mike Galbraith wrote:
> > 
> > > Well now, that's not exactly what I expected to see for AIM7 compute.
> > > Filesystem is munching cycles otherwise used for compute when load is
> > > spread across the whole box vs consolidated.
> > 
> > So AIM7 compute performance delta boils down to: powersaving stacks
> > tasks, so they pat single bit of spinning rust sequentially/gently.
> 
> So AIM7 with real block IO improved, due to sequentiality. Does it improve 
> if AIM7 works on an SSD, or into ramdisk?

Seriously doubt it, but I suppose I can try tmpfs.

performance 
Tasks    jobs/min  jti  jobs/min/task      real       cpu
   20    11170.51   99       558.5253     10.85     15.19   Tue Apr 30 11:21:46 2013
   20    11078.61   99       553.9305     10.94     15.59   Tue Apr 30 11:21:57 2013
   20    11191.14   99       559.5568     10.83     15.29   Tue Apr 30 11:22:08 2013

powersaving
Tasks    jobs/min  jti  jobs/min/task      real       cpu
   20    10978.26   99       548.9130     11.04     19.25   Tue Apr 30 11:22:38 2013
   20    10988.21   99       549.4107     11.03     18.71   Tue Apr 30 11:22:49 2013
   20    11008.17   99       550.4087     11.01     18.85   Tue Apr 30 11:23:00 2013

Nope.

> Which are the workloads where 'powersaving' mode hurts workload 
> performance measurably?

Well, it'll lose throughput any time there's parallel execution
potential but it's serialized instead.. using average will inevitably
stack tasks sometimes, but that's its goal.  Hackbench shows it.

performance 
monteverdi:/abuild/mike/aim7/:[0]# hackbench -l 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.487
monteverdi:/abuild/mike/aim7/:[0]# hackbench -l 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.487
monteverdi:/abuild/mike/aim7/:[0]# hackbench -l 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.497

powersaving
monteverdi:/abuild/mike/aim7/:[0]# hackbench -l 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.702
monteverdi:/abuild/mike/aim7/:[0]# hackbench -l 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.679
monteverdi:/abuild/mike/aim7/:[0]# hackbench -l 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 1.137

	-Mike


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-30  9:35                   ` Mike Galbraith
@ 2013-04-30  9:49                     ` Mike Galbraith
  2013-04-30  9:56                       ` Mike Galbraith
  0 siblings, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2013-04-30  9:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Len Brown, Borislav Petkov, Alex Shi, mingo, peterz, tglx, akpm,
	arjan, pjt, namhyung, morten.rasmussen, vincent.guittot, gregkh,
	preeti, viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki,
	jkosina, clark.williams, tony.luck, keescook, mgorman, riel,
	Linux PM list

On Tue, 2013-04-30 at 11:35 +0200, Mike Galbraith wrote: 
> On Tue, 2013-04-30 at 10:41 +0200, Ingo Molnar wrote: 

> > Which are the workloads where 'powersaving' mode hurts workload 
> > performance measurably?
> 
> Well, it'll lose throughput any time there's parallel execution
> potential but it's serialized instead.. using average will inevitably
> stack tasks sometimes, but that's its goal.  Hackbench shows it.

(but that consolidation can be a winner too, and I bet a nickle it would
be for a socket sized pgbench run)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-30  9:49                     ` Mike Galbraith
@ 2013-04-30  9:56                       ` Mike Galbraith
  2013-05-17  8:06                         ` Preeti U Murthy
  0 siblings, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2013-04-30  9:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Len Brown, Borislav Petkov, Alex Shi, mingo, peterz, tglx, akpm,
	arjan, pjt, namhyung, morten.rasmussen, vincent.guittot, gregkh,
	preeti, viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki,
	jkosina, clark.williams, tony.luck, keescook, mgorman, riel,
	Linux PM list

On Tue, 2013-04-30 at 11:49 +0200, Mike Galbraith wrote: 
> On Tue, 2013-04-30 at 11:35 +0200, Mike Galbraith wrote: 
> > On Tue, 2013-04-30 at 10:41 +0200, Ingo Molnar wrote: 
> 
> > > Which are the workloads where 'powersaving' mode hurts workload 
> > > performance measurably?
> > 
> > Well, it'll lose throughput any time there's parallel execution
> > potential but it's serialized instead.. using average will inevitably
> > stack tasks sometimes, but that's its goal.  Hackbench shows it.
> 
> (but that consolidation can be a winner too, and I bet a nickle it would
> be for a socket sized pgbench run)

(belay that, was thinking of keeping all tasks on a single node, but
it'll likely stack the whole thing on a CPU or two, if so, it'll hurt)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-30  9:56                       ` Mike Galbraith
@ 2013-05-17  8:06                         ` Preeti U Murthy
  2013-05-20  1:01                           ` Alex Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Preeti U Murthy @ 2013-05-17  8:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Len Brown, Borislav Petkov, Alex Shi, mingo, peterz,
	tglx, akpm, arjan, pjt, namhyung, morten.rasmussen,
	vincent.guittot, gregkh, viresh.kumar, linux-kernel, len.brown,
	rafael.j.wysocki, jkosina, clark.williams, tony.luck, keescook,
	mgorman, riel, Linux PM list

On 04/30/2013 03:26 PM, Mike Galbraith wrote:
> On Tue, 2013-04-30 at 11:49 +0200, Mike Galbraith wrote: 
>> On Tue, 2013-04-30 at 11:35 +0200, Mike Galbraith wrote: 
>>> On Tue, 2013-04-30 at 10:41 +0200, Ingo Molnar wrote: 
>>
>>>> Which are the workloads where 'powersaving' mode hurts workload 
>>>> performance measurably?

I ran ebizzy on a 2 socket, 16 core, SMT 4 Power machine.
The power efficiency drops significantly with the powersaving policy of
this patch,over the power efficiency of the scheduler without this patch.

The below parameters are measured relative to the default scheduler
behaviour.

A: Drop in power efficiency with the patch+powersaving policy
B: Drop in performance with the patch+powersaving policy
C: Decrease in power consumption with the patch+powersaving policy

NumThreads      A            B         C
-----------------------------------------
2               33%         36%       4%
4               31%         33%       3%
8               28%         30%       3%
16              31%         33%       4%

Each of the above run is for 30s.

On investigating socket utilization,I found that only 1 socket was being
used during all the above threaded runs. As can be guessed this is due
to the group_weight being considered for the threshold metric.
This stacks up tasks on a core and further on a socket, thus throttling
them, as observed by Mike below.

I therefore think we must switch to group_capacity as the metric for
threshold and use only (rq->utils*nr_running) for group_utils
calculation during non-bursty wakeup scenarios.
This way we are comparing right; the utilization of the runqueue by the
fair tasks and the cpu capacity available for them after being consumed
by the rt tasks.

After I made the above modification,all the above three parameters came
to be nearly null. However, I am observing the load balancing of the
scheduler with the patch and powersavings policy enabled. It is behaving
very close to the default scheduler (spreading tasks across sockets).
That also explains why there is no performance drop or gain with the
patch+powersavings policy enabled. I will look into this observation and
revert.

>>>
>>> Well, it'll lose throughput any time there's parallel execution
>>> potential but it's serialized instead.. using average will inevitably
>>> stack tasks sometimes, but that's its goal.  Hackbench shows it.
>>
>> (but that consolidation can be a winner too, and I bet a nickle it would
>> be for a socket sized pgbench run)
> 
> (belay that, was thinking of keeping all tasks on a single node, but
> it'll likely stack the whole thing on a CPU or two, if so, it'll hurt)

At this point, I would like to raise one issue.
*Is the goal of the power aware scheduler improving power efficiency of
the scheduler or a compromise on the power efficiency but definitely a
decrease in power consumption, since it is the user who has decided to
prioritise lower power consumption over performance* ?

> 

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-05-17  8:06                         ` Preeti U Murthy
@ 2013-05-20  1:01                           ` Alex Shi
  2013-05-20  2:30                             ` Preeti U Murthy
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Shi @ 2013-05-20  1:01 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Mike Galbraith, Ingo Molnar, Len Brown, Borislav Petkov, mingo,
	peterz, tglx, akpm, arjan, pjt, namhyung, morten.rasmussen,
	vincent.guittot, gregkh, viresh.kumar, linux-kernel, len.brown,
	rafael.j.wysocki, jkosina, clark.williams, tony.luck, keescook,
	mgorman, riel, Linux PM list


>>>>> Which are the workloads where 'powersaving' mode hurts workload 
>>>>> performance measurably?
> 
> I ran ebizzy on a 2 socket, 16 core, SMT 4 Power machine.

Is this a 2 * 16 * 4 LCPUs PowerPC machine?
> The power efficiency drops significantly with the powersaving policy of
> this patch,over the power efficiency of the scheduler without this patch.
> 
> The below parameters are measured relative to the default scheduler
> behaviour.
> 
> A: Drop in power efficiency with the patch+powersaving policy
> B: Drop in performance with the patch+powersaving policy
> C: Decrease in power consumption with the patch+powersaving policy
> 
> NumThreads      A            B         C
> -----------------------------------------
> 2               33%         36%       4%
> 4               31%         33%       3%
> 8               28%         30%       3%
> 16              31%         33%       4%
> 
> Each of the above run is for 30s.
> 
> On investigating socket utilization,I found that only 1 socket was being
> used during all the above threaded runs. As can be guessed this is due
> to the group_weight being considered for the threshold metric.
> This stacks up tasks on a core and further on a socket, thus throttling
> them, as observed by Mike below.
> 
> I therefore think we must switch to group_capacity as the metric for
> threshold and use only (rq->utils*nr_running) for group_utils
> calculation during non-bursty wakeup scenarios.
> This way we are comparing right; the utilization of the runqueue by the
> fair tasks and the cpu capacity available for them after being consumed
> by the rt tasks.
> 
> After I made the above modification,all the above three parameters came
> to be nearly null. However, I am observing the load balancing of the
> scheduler with the patch and powersavings policy enabled. It is behaving
> very close to the default scheduler (spreading tasks across sockets).
> That also explains why there is no performance drop or gain with the
> patch+powersavings policy enabled. I will look into this observation and
> revert.

Thanks a lot for the great testings!
Seem tasks per SMT cpu isn't power efficient.
And I got the similar result last week. I tested the fspin testing(do
endless calculation, in linux-next tree.). when I bind task per SMT cpu,
the power efficiency really dropped with most every threads number. but
when bind task per core, it has better power efficiency on all threads.
Beside to move task depend on group_capacity, another choice is balance
task according cpu_power. I did the transfer in code. but need to go
through a internal open source process before public them.
> 
>>>>
>>>> Well, it'll lose throughput any time there's parallel execution
>>>> potential but it's serialized instead.. using average will inevitably
>>>> stack tasks sometimes, but that's its goal.  Hackbench shows it.
>>>
>>> (but that consolidation can be a winner too, and I bet a nickle it would
>>> be for a socket sized pgbench run)
>>
>> (belay that, was thinking of keeping all tasks on a single node, but
>> it'll likely stack the whole thing on a CPU or two, if so, it'll hurt)
> 
> At this point, I would like to raise one issue.
> *Is the goal of the power aware scheduler improving power efficiency of
> the scheduler or a compromise on the power efficiency but definitely a
> decrease in power consumption, since it is the user who has decided to
> prioritise lower power consumption over performance* ?
> 

It could be one of reason for this feather, but I could like to
make it has better efficiency, like packing tasks according to cpu_power
not current group_weight.
>>
> 
> Regards
> Preeti U Murthy
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-05-20  1:01                           ` Alex Shi
@ 2013-05-20  2:30                             ` Preeti U Murthy
  0 siblings, 0 replies; 30+ messages in thread
From: Preeti U Murthy @ 2013-05-20  2:30 UTC (permalink / raw)
  To: Alex Shi
  Cc: Mike Galbraith, Ingo Molnar, Len Brown, Borislav Petkov, mingo,
	peterz, tglx, akpm, arjan, pjt, namhyung, morten.rasmussen,
	vincent.guittot, gregkh, viresh.kumar, linux-kernel, len.brown,
	rafael.j.wysocki, jkosina, clark.williams, tony.luck, keescook,
	mgorman, riel, Linux PM list

Hi Alex,

On 05/20/2013 06:31 AM, Alex Shi wrote:
> 
>>>>>> Which are the workloads where 'powersaving' mode hurts workload 
>>>>>> performance measurably?
>>
>> I ran ebizzy on a 2 socket, 16 core, SMT 4 Power machine.
> 
> Is this a 2 * 16 * 4 LCPUs PowerPC machine?

This is a 2 * 8 * 4 LCPUs PowerPC machine.

>> The power efficiency drops significantly with the powersaving policy of
>> this patch,over the power efficiency of the scheduler without this patch.
>>
>> The below parameters are measured relative to the default scheduler
>> behaviour.
>>
>> A: Drop in power efficiency with the patch+powersaving policy
>> B: Drop in performance with the patch+powersaving policy
>> C: Decrease in power consumption with the patch+powersaving policy
>>
>> NumThreads      A            B         C
>> -----------------------------------------
>> 2               33%         36%       4%
>> 4               31%         33%       3%
>> 8               28%         30%       3%
>> 16              31%         33%       4%
>>
>> Each of the above run is for 30s.
>>
>> On investigating socket utilization,I found that only 1 socket was being
>> used during all the above threaded runs. As can be guessed this is due
>> to the group_weight being considered for the threshold metric.
>> This stacks up tasks on a core and further on a socket, thus throttling
>> them, as observed by Mike below.
>>
>> I therefore think we must switch to group_capacity as the metric for
>> threshold and use only (rq->utils*nr_running) for group_utils
>> calculation during non-bursty wakeup scenarios.
>> This way we are comparing right; the utilization of the runqueue by the
>> fair tasks and the cpu capacity available for them after being consumed
>> by the rt tasks.
>>
>> After I made the above modification,all the above three parameters came
>> to be nearly null. However, I am observing the load balancing of the
>> scheduler with the patch and powersavings policy enabled. It is behaving
>> very close to the default scheduler (spreading tasks across sockets).
>> That also explains why there is no performance drop or gain with the
>> patch+powersavings policy enabled. I will look into this observation and
>> revert.
> 
> Thanks a lot for the great testings!
> Seem tasks per SMT cpu isn't power efficient.
> And I got the similar result last week. I tested the fspin testing(do
> endless calculation, in linux-next tree.). when I bind task per SMT cpu,
> the power efficiency really dropped with most every threads number. but
> when bind task per core, it has better power efficiency on all threads.
> Beside to move task depend on group_capacity, another choice is balance
> task according cpu_power. I did the transfer in code. but need to go
> through a internal open source process before public them.

What do you mean by *another* choice is balance task according to
cpu_power? group_capacity is based on cpu_power.

Also, your balance policy in v6 was doing the same right? It was rightly
comparing rq->utils * nr_running against cpu_power. Why not simply
switch to that code for power policy load balancing?

>>>>> Well, it'll lose throughput any time there's parallel execution
>>>>> potential but it's serialized instead.. using average will inevitably
>>>>> stack tasks sometimes, but that's its goal.  Hackbench shows it.
>>>>
>>>> (but that consolidation can be a winner too, and I bet a nickle it would
>>>> be for a socket sized pgbench run)
>>>
>>> (belay that, was thinking of keeping all tasks on a single node, but
>>> it'll likely stack the whole thing on a CPU or two, if so, it'll hurt)
>>
>> At this point, I would like to raise one issue.
>> *Is the goal of the power aware scheduler improving power efficiency of
>> the scheduler or a compromise on the power efficiency but definitely a
>> decrease in power consumption, since it is the user who has decided to
>> prioritise lower power consumption over performance* ?
>>
> 
> It could be one of reason for this feather, but I could like to
> make it has better efficiency, like packing tasks according to cpu_power
> not current group_weight.

Yes we could try the patch using group_capacity and observe the results
for power efficiency, before we decide to compromise on power efficiency
for decrease in power.

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-12 16:23     ` Borislav Petkov
  2013-04-12 16:48       ` Mike Galbraith
@ 2013-04-14  1:28       ` Alex Shi
  2013-04-14  5:10         ` Alex Shi
  2013-04-14 15:59         ` Borislav Petkov
  1 sibling, 2 replies; 30+ messages in thread
From: Alex Shi @ 2013-04-14  1:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Len Brown, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On 04/13/2013 12:23 AM, Borislav Petkov wrote:
> On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
>> > Thanks a lot for comments, Len!
> AFAICT, you kinda forgot to answer his most important question:
> 
>> > These numbers suggest that this patch series simultaneously
>> > has a negative impact on performance and energy required
>> > to retire the workload.  Why do it?

Even some scenario the total energy cost more, at least the avg watts
dropped in that scenarios. Len said he has low p-state which can work
there. but that's is different. I had sent some data in another email
list to show the difference:

The following is 2 times kbuild testing result for 3 kinds condiation on
SNB EP box, the middle column is the lowest p-state testing result, we
can see, it has the lowest power consumption, also has the lowest
performance/watts value.
At least for kbuild benchmark, powersaving policy has the best
compromise on powersaving and power efficient. Further more, due to cpu
boost feature, it has better performance in some scenarios.

   powersaving + ondemand  userspace + fixed 1.2GHz performance+ondemand
x = 8    231.318 /75 57           165.063 /166 36        253.552 /63 62
x = 16   280.357 /49 72           174.408 /106 54        296.776 /41 82
x = 32   325.206 /34 90           178.675 /90 62         314.153 /37 86

x = 8    233.623 /74 57           164.507 /168 36        254.775 /65 60
x = 16   272.54  /38 96           174.364 /106 54        297.731 /42 79
x = 32   320.758 /34 91           177.917 /91 61         317.875 /35 89
x = 64   326.837 /33 92           179.037 /90 62         320.615 /36 86

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-14  1:28       ` Alex Shi
@ 2013-04-14  5:10         ` Alex Shi
  2013-04-14 15:59         ` Borislav Petkov
  1 sibling, 0 replies; 30+ messages in thread
From: Alex Shi @ 2013-04-14  5:10 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Len Brown, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On 04/14/2013 09:28 AM, Alex Shi wrote:
>>>> >> > These numbers suggest that this patch series simultaneously
>>>> >> > has a negative impact on performance and energy required
>>>> >> > to retire the workload.  Why do it?
> Even some scenario the total energy cost more, at least the avg watts
> dropped in that scenarios. Len said he has low p-state which can work
> there. but that's is different. I had sent some data in another email
> list to show the difference:
> 
> The following is 2 times kbuild testing result for 3 kinds condiation on
> SNB EP box, the middle column is the lowest p-state testing result, we
> can see, it has the lowest power consumption, also has the lowest
> performance/watts value.
> At least for kbuild benchmark, powersaving policy has the best
> compromise on powersaving and power efficient. Further more, due to cpu
> boost feature, it has better performance in some scenarios.

BTW, another benefit on powersaving is that powersaving policy is very
flexible on system load. when task number in sched domain is beyond LCPU
number, it will take performance oriented balance. That conduct the
similar performance when system is busy.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-14  1:28       ` Alex Shi
  2013-04-14  5:10         ` Alex Shi
@ 2013-04-14 15:59         ` Borislav Petkov
  2013-04-15  6:04           ` Alex Shi
  1 sibling, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2013-04-14 15:59 UTC (permalink / raw)
  To: Alex Shi
  Cc: Len Brown, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On Sun, Apr 14, 2013 at 09:28:50AM +0800, Alex Shi wrote:
> Even some scenario the total energy cost more, at least the avg watts
> dropped in that scenarios.

Ok, what's wrong with x = 32 then? So basically if you're looking at
avg watts, you don't want to have more than 16 threads, otherwise
powersaving sucks on that particular uarch and platform. Can you say
that for all platforms out there?

Also, I've added in the columns below the Energy = Power * Time thing.

And the funny thing is, exactly there where avg watts is better in
powersaving, energy for workload retire is worse. And the other way
around. Basically, avg watts vs retire energy is reciprocal. Great :-\.

> Len said he has low p-state which can work there. but that's is
> different. I had sent some data in another email list to show the
> difference:
> 
> The following is 2 times kbuild testing result for 3 kinds condiation on
> SNB EP box, the middle column is the lowest p-state testing result, we
> can see, it has the lowest power consumption, also has the lowest
> performance/watts value.
> At least for kbuild benchmark, powersaving policy has the best
> compromise on powersaving and power efficient. Further more, due to cpu
> boost feature, it has better performance in some scenarios.
> 
>    powersaving + ondemand  userspace + fixed 1.2GHz performance+ondemand
> x = 8    231.318 /75 57           165.063 /166 36        253.552 /63 62
> x = 16   280.357 /49 72           174.408 /106 54        296.776 /41 82
> x = 32   325.206 /34 90           178.675 /90 62         314.153 /37 86
> 
> x = 8    233.623 /74 57           164.507 /168 36        254.775 /65 60
> x = 16   272.54  /38 96           174.364 /106 54        297.731 /42 79
> x = 32   320.758 /34 91           177.917 /91 61         317.875 /35 89
> x = 64   326.837 /33 92           179.037 /90 62         320.615 /36 86

	    17348.850		    27400.458		   15973.776
	    13737.493		    18487.248		   12167.816
	    11057.004		    16080.750		   11623.661

	    17288.102		    27637.176		   16560.375
	    10356.52		    18482.584		   12504.702
	    10905.772		    16190.447		   11125.625
	    10785.621		    16113.330		   11542.140

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-14 15:59         ` Borislav Petkov
@ 2013-04-15  6:04           ` Alex Shi
  2013-04-15  6:16             ` Alex Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Shi @ 2013-04-15  6:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Len Brown, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On 04/14/2013 11:59 PM, Borislav Petkov wrote:
> On Sun, Apr 14, 2013 at 09:28:50AM +0800, Alex Shi wrote:
>> Even some scenario the total energy cost more, at least the avg watts
>> dropped in that scenarios.
> 
> Ok, what's wrong with x = 32 then? So basically if you're looking at
> avg watts, you don't want to have more than 16 threads, otherwise
> powersaving sucks on that particular uarch and platform. Can you say
> that for all platforms out there?

The cpu freq boost make the avg watts higher with x = 32, and also make
higher power efficiency. We can disable cpu freq boost for this if we
want lower power consumption all time.
But for my understanding, the power efficient is better way to save power.
As to other platforms, I'm glad to see any testing or try and give me
results...
> 
> Also, I've added in the columns below the Energy = Power * Time thing.

Thanks. btw the third data of each column is 'performance/watt'. that
shows similar meaning on the other side. :)
> 
> And the funny thing is, exactly there where avg watts is better in
> powersaving, energy for workload retire is worse. And the other way
> around. Basically, avg watts vs retire energy is reciprocal. Great :-\.
> 
>> Len said he has low p-state which can work there. but that's is
>> different. I had sent some data in another email list to show the
>> difference:
>>
>> The following is 2 times kbuild testing result for 3 kinds condiation on
>> SNB EP box, the middle column is the lowest p-state testing result, we
>> can see, it has the lowest power consumption, also has the lowest
>> performance/watts value.
>> At least for kbuild benchmark, powersaving policy has the best
>> compromise on powersaving and power efficient. Further more, due to cpu
>> boost feature, it has better performance in some scenarios.
>>
>>    powersaving + ondemand  userspace + fixed 1.2GHz performance+ondemand
>> x = 8    231.318 /75 57           165.063 /166 36        253.552 /63 62
>> x = 16   280.357 /49 72           174.408 /106 54        296.776 /41 82
>> x = 32   325.206 /34 90           178.675 /90 62         314.153 /37 86
>>
>> x = 8    233.623 /74 57           164.507 /168 36        254.775 /65 60
>> x = 16   272.54  /38 96           174.364 /106 54        297.731 /42 79
>> x = 32   320.758 /34 91           177.917 /91 61         317.875 /35 89
>> x = 64   326.837 /33 92           179.037 /90 62         320.615 /36 86
> 
> 	    17348.850		    27400.458		   15973.776
> 	    13737.493		    18487.248		   12167.816
> 	    11057.004		    16080.750		   11623.661
> 
> 	    17288.102		    27637.176		   16560.375
> 	    10356.52		    18482.584		   12504.702
> 	    10905.772		    16190.447		   11125.625
> 	    10785.621		    16113.330		   11542.140
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-15  6:04           ` Alex Shi
@ 2013-04-15  6:16             ` Alex Shi
  2013-04-15  9:52               ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Shi @ 2013-04-15  6:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Len Brown, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On 04/15/2013 02:04 PM, Alex Shi wrote:
> On 04/14/2013 11:59 PM, Borislav Petkov wrote:
>> > On Sun, Apr 14, 2013 at 09:28:50AM +0800, Alex Shi wrote:
>>> >> Even some scenario the total energy cost more, at least the avg watts
>>> >> dropped in that scenarios.
>> > 
>> > Ok, what's wrong with x = 32 then? So basically if you're looking at
>> > avg watts, you don't want to have more than 16 threads, otherwise
>> > powersaving sucks on that particular uarch and platform. Can you say
>> > that for all platforms out there?
> The cpu freq boost make the avg watts higher with x = 32, and also make
> higher power efficiency. We can disable cpu freq boost for this if we
> want lower power consumption all time.
> But for my understanding, the power efficient is better way to save power.

BTW, lowest p-state, no freq boost and plus this powersaving policy will
give the lowest power consumption.

And I need to say again. the powersaving policy just effect on system
under utilisation. when system goes busy, it won't has effect.
performance oriented policy will take over balance behaviour.

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-15  6:16             ` Alex Shi
@ 2013-04-15  9:52               ` Borislav Petkov
  2013-04-15 13:50                 ` Alex Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2013-04-15  9:52 UTC (permalink / raw)
  To: Alex Shi
  Cc: Len Brown, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On Mon, Apr 15, 2013 at 02:16:55PM +0800, Alex Shi wrote:
> And I need to say again. the powersaving policy just effect on system
> under utilisation. when system goes busy, it won't has effect.
> performance oriented policy will take over balance behaviour.

And AFACU your patches, you do this automatically, right? In which case,
an underutilized system will have switched to powersaving balancing and
will use *more* energy to retire the workload. Correct?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-15  9:52               ` Borislav Petkov
@ 2013-04-15 13:50                 ` Alex Shi
  2013-04-15 23:12                   ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Shi @ 2013-04-15 13:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Len Brown, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On 04/15/2013 05:52 PM, Borislav Petkov wrote:
> On Mon, Apr 15, 2013 at 02:16:55PM +0800, Alex Shi wrote:
>> And I need to say again. the powersaving policy just effect on system
>> under utilisation. when system goes busy, it won't has effect.
>> performance oriented policy will take over balance behaviour.
> 
> And AFACU your patches, you do this automatically, right?

Yes
 In which case,
> an underutilized system will have switched to powersaving balancing and
> will use *more* energy to retire the workload. Correct?
> 

For fairness and total threads consideration, powersaving cost quit
similar energy on kbuild benchmark, and even better.

	    17348.850		    27400.458		   15973.776
	    13737.493		    18487.248		   12167.816
	    11057.004		    16080.750		   11623.661

	    17288.102		    27637.176		   16560.375
	    10356.52		    18482.584		   12504.702
	    10905.772		    16190.447		   11125.625
	    10785.621		    16113.330		   11542.140

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-15 13:50                 ` Alex Shi
@ 2013-04-15 23:12                   ` Borislav Petkov
  2013-04-16  0:22                     ` Alex Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2013-04-15 23:12 UTC (permalink / raw)
  To: Alex Shi
  Cc: Len Brown, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On Mon, Apr 15, 2013 at 09:50:22PM +0800, Alex Shi wrote:
> For fairness and total threads consideration, powersaving cost quit
> similar energy on kbuild benchmark, and even better.
> 
> 	    17348.850		    27400.458		   15973.776
> 	    13737.493		    18487.248		   12167.816

Yeah, but those lines don't look good - powersaving needs more energy
than performance.

And what is even crazier is that fixed 1.2 GHz case. I'd guess in
the normal case those cores are at triple the freq. - i.e. somewhere
around 3-4 GHz. And yet, 1.2 GHz eats almost *double* the power than
performance and powersaving.

So for the x=8 and maybe even the x=16 case we're basically better off
with performance.

Or could it be that the power measurements are not really that accurate
and those numbers above are not really correct?

Hmm.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-15 23:12                   ` Borislav Petkov
@ 2013-04-16  0:22                     ` Alex Shi
  2013-04-16 10:24                       ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Shi @ 2013-04-16  0:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Len Brown, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On 04/16/2013 07:12 AM, Borislav Petkov wrote:
> On Mon, Apr 15, 2013 at 09:50:22PM +0800, Alex Shi wrote:
>> For fairness and total threads consideration, powersaving cost quit
>> similar energy on kbuild benchmark, and even better.
>>
>> 	    17348.850		    27400.458		   15973.776
>> 	    13737.493		    18487.248		   12167.816
> 
> Yeah, but those lines don't look good - powersaving needs more energy
> than performance.
> 
> And what is even crazier is that fixed 1.2 GHz case. I'd guess in
> the normal case those cores are at triple the freq. - i.e. somewhere
> around 3-4 GHz. And yet, 1.2 GHz eats almost *double* the power than
> performance and powersaving.

yes, the max freq is 2.7 GHZ, plus boost.
> 
> So for the x=8 and maybe even the x=16 case we're basically better off
> with performance.
> 
> Or could it be that the power measurements are not really that accurate
> and those numbers above are not really correct?

testing has a little variation, but the power data is quite accurate. I
may change to packing tasks per cpu capacity than current cpu weight.
that should has better power efficient value.

> 
> Hmm.
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-16  0:22                     ` Alex Shi
@ 2013-04-16 10:24                       ` Borislav Petkov
  2013-04-17  1:18                         ` Alex Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2013-04-16 10:24 UTC (permalink / raw)
  To: Alex Shi
  Cc: Len Brown, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On Tue, Apr 16, 2013 at 08:22:19AM +0800, Alex Shi wrote:
> testing has a little variation, but the power data is quite accurate.
> I may change to packing tasks per cpu capacity than current cpu
> weight. that should has better power efficient value.

Yeah, this probably needs careful measuring - and by "this" I mean how
to place N tasks where N is less than number of cores in the system.

I can imagine trying to migrate them all together on a single physical
socket (maybe even overcommitting it) so that you can flush the caches
of the cores on the other sockets and so that you can power down the
other sockets and avoid coherent traffic from waking them up, to be one
strategy. My supposition here is that maybe putting the whole unused
sockets in a deep sleep state could save a lot of power.

Or not, who knows. Only empirical measurements should show us what
actually happens.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-16 10:24                       ` Borislav Petkov
@ 2013-04-17  1:18                         ` Alex Shi
  2013-04-17  7:38                           ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Shi @ 2013-04-17  1:18 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Len Brown, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On 04/16/2013 06:24 PM, Borislav Petkov wrote:
> On Tue, Apr 16, 2013 at 08:22:19AM +0800, Alex Shi wrote:
>> testing has a little variation, but the power data is quite accurate.
>> I may change to packing tasks per cpu capacity than current cpu
>> weight. that should has better power efficient value.
> 
> Yeah, this probably needs careful measuring - and by "this" I mean how
> to place N tasks where N is less than number of cores in the system.
> 
> I can imagine trying to migrate them all together on a single physical
> socket (maybe even overcommitting it) so that you can flush the caches
> of the cores on the other sockets and so that you can power down the
> other sockets and avoid coherent traffic from waking them up, to be one
> strategy. My supposition here is that maybe putting the whole unused
> sockets in a deep sleep state could save a lot of power.

Sure. Currently if the whole socket get into sleep, but the memory on
the node is still accessed. the cpu socket still spend some power on
'uncore' part. So the further step is reduce the remote memory access to
save more power, and that is also numa balance want to do.
And then the next step is to detect if this socket is cache intensive,
if there is much cache thresh on the node.
In theory, there is still has lots of tuning space. :)
> 
> Or not, who knows. Only empirical measurements should show us what
> actually happens.

Sure. :)
> 
> Thanks.
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v7 0/21] sched: power aware scheduling
  2013-04-17  1:18                         ` Alex Shi
@ 2013-04-17  7:38                           ` Borislav Petkov
  0 siblings, 0 replies; 30+ messages in thread
From: Borislav Petkov @ 2013-04-17  7:38 UTC (permalink / raw)
  To: Alex Shi
  Cc: Len Brown, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, morten.rasmussen, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel, len.brown, rafael.j.wysocki, jkosina,
	clark.williams, tony.luck, keescook, mgorman, riel, Linux PM list

On Wed, Apr 17, 2013 at 09:18:28AM +0800, Alex Shi wrote:
> Sure. Currently if the whole socket get into sleep, but the memory on
> the node is still accessed. the cpu socket still spend some power on
> 'uncore' part. So the further step is reduce the remote memory access
> to save more power, and that is also numa balance want to do.

Yeah, if you also mean, you need to further migrate the memory of the
threads away from the node so that it doesn't need to serve memory
accesses from other sockets, then that should probably help save even
more power. You probably would still need to serve probes from the L3
but your DRAM links will be powered down and such.

> And then the next step is to detect if this socket is cache intensive,
> if there is much cache thresh on the node.

Yeah, that would be probably harder to determine - is cache thrashing
(and I think you mean L3 here) worse than migrating tasks to other nodes
and having them powered on just because my current node is not supposed
to thrash L3. Hmm.

> In theory, there is still has lots of tuning space. :)

Yep. :)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2013-05-20  2:32 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1365040862-8390-1-git-send-email-alex.shi@intel.com>
2013-04-11 21:02 ` [patch v7 0/21] sched: power aware scheduling Len Brown
2013-04-12  8:46   ` Alex Shi
2013-04-12 16:23     ` Borislav Petkov
2013-04-12 16:48       ` Mike Galbraith
2013-04-12 17:12         ` Borislav Petkov
2013-04-14  1:36           ` Alex Shi
2013-04-17 21:53         ` Len Brown
2013-04-18  1:51           ` Mike Galbraith
2013-04-26 15:11           ` Mike Galbraith
2013-04-30  5:16             ` Mike Galbraith
2013-04-30  8:30               ` Mike Galbraith
2013-04-30  8:41                 ` Ingo Molnar
2013-04-30  9:35                   ` Mike Galbraith
2013-04-30  9:49                     ` Mike Galbraith
2013-04-30  9:56                       ` Mike Galbraith
2013-05-17  8:06                         ` Preeti U Murthy
2013-05-20  1:01                           ` Alex Shi
2013-05-20  2:30                             ` Preeti U Murthy
2013-04-14  1:28       ` Alex Shi
2013-04-14  5:10         ` Alex Shi
2013-04-14 15:59         ` Borislav Petkov
2013-04-15  6:04           ` Alex Shi
2013-04-15  6:16             ` Alex Shi
2013-04-15  9:52               ` Borislav Petkov
2013-04-15 13:50                 ` Alex Shi
2013-04-15 23:12                   ` Borislav Petkov
2013-04-16  0:22                     ` Alex Shi
2013-04-16 10:24                       ` Borislav Petkov
2013-04-17  1:18                         ` Alex Shi
2013-04-17  7:38                           ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).