[RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
@ 2014-04-24 19:30 Yuyang Du
  2014-04-25  6:23 ` Mike Galbraith
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Yuyang Du @ 2014-04-24 19:30 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, rafael.j.wysocki, alan.cox,
	mark.gross, morten.rasmussen, vincent.guittot, yuyang.du

Hi Ingo, PeterZ, and others,

The current scheduler’s load balancing is completely work-conserving. In some
workload, generally low CPU utilization but immersed with CPU bursts of
transient tasks, migrating task to engage all available CPUs for
work-conserving can lead to significant overhead: cache locality loss,
idle/active HW state transitional latency and power, shallower idle state,
etc, which are both power and performance inefficient especially for today’s
low power processors in mobile. 

This RFC introduces a sense of idleness-conserving into work-conserving (by
all means, we really don’t want to be overwhelming in only one way). But to
what extent the idleness-conserving should be, bearing in mind that we don’t
want to sacrifice performance? We first need a load/idleness indicator to that
end.

Thanks to CFS’s “model an ideal, precise multi-tasking CPU”, tasks can be seen
as concurrently running (the tasks in the runqueue). So it is natural to use
task concurrency as load indicator. Having said that, we do two things:

1)	Divide continuous time into periods of time, and average task concurrency
in period, for tolerating the transient bursts:
a = sum(concurrency * time) / period
2)	Exponentially decay past periods, and synthesize them all, for hysteresis
to load drops or resilience to load rises (let f be decaying factor, and a_x
the xth period average since period 0):
s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, …..,+ f^(n-1) * a_1 + f^n * a_0

We name this load indicator as CPU ConCurrency (CC): task concurrency
determines how many CPUs are needed to be running concurrently.

To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3)
scheduler tick, and 4) enter/exit idle.

By CC, we implemented a Workload Consolidation patch on two Intel mobile
platforms (a quad-core composed of two dual-core modules): contain load and load
balancing in the first dual-core when aggregated CC low, and if not in the
full quad-core. Results show that we got power savings and no substantial
performance regression (even gains for some).

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-04-24 19:30 [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
@ 2014-04-25  6:23 ` Mike Galbraith
  2014-04-25  8:00 ` Vincent Guittot
  2014-04-25 10:23 ` Morten Rasmussen
  2 siblings, 0 replies; 11+ messages in thread
From: Mike Galbraith @ 2014-04-25  6:23 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, peterz, linux-kernel, linux-pm, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot

On Fri, 2014-04-25 at 03:30 +0800, Yuyang Du wrote:

> To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3)
> scheduler tick, and 4) enter/exit idle.

Boo hiss to 1, 2 and 4.  Less fastpath math would be better.

-Mike


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-04-24 19:30 [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
  2014-04-25  6:23 ` Mike Galbraith
@ 2014-04-25  8:00 ` Vincent Guittot
  2014-04-25  9:57   ` Peter Zijlstra
  2014-04-25 10:23 ` Morten Rasmussen
  2 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2014-04-25  8:00 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo@redhat.com, Peter Zijlstra, linux-kernel,
	linux-pm@vger.kernel.org, arjan.van.de.ven, Len Brown,
	rafael.j.wysocki, alan.cox, Gross, Mark, Morten Rasmussen

On 24 April 2014 21:30, Yuyang Du <yuyang.du@intel.com> wrote:
> Hi Ingo, PeterZ, and others,
>
> The current scheduler's load balancing is completely work-conserving. In some
> workload, generally low CPU utilization but immersed with CPU bursts of
> transient tasks, migrating task to engage all available CPUs for
> work-conserving can lead to significant overhead: cache locality loss,
> idle/active HW state transitional latency and power, shallower idle state,
> etc, which are both power and performance inefficient especially for today's
> low power processors in mobile.
>
> This RFC introduces a sense of idleness-conserving into work-conserving (by
> all means, we really don't want to be overwhelming in only one way). But to
> what extent the idleness-conserving should be, bearing in mind that we don't
> want to sacrifice performance? We first need a load/idleness indicator to that
> end.
>
> Thanks to CFS's "model an ideal, precise multi-tasking CPU", tasks can be seen
> as concurrently running (the tasks in the runqueue). So it is natural to use
> task concurrency as load indicator. Having said that, we do two things:
>
> 1)      Divide continuous time into periods of time, and average task concurrency
> in period, for tolerating the transient bursts:
> a = sum(concurrency * time) / period
> 2)      Exponentially decay past periods, and synthesize them all, for hysteresis
> to load drops or resilience to load rises (let f be decaying factor, and a_x
> the xth period average since period 0):
> s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, .....,+ f^(n-1) * a_1 + f^n * a_0

In the original version of entity load tracking patchset, there was a
usage_avg_sum field that was counting the time the task was really
running on the CPU. By combining this (disappeared ) field with the
runnable_avg_sum, you should have similar concurrency value but with
the current load tracking mechanism (instead of creating new one).

Vincent

>
> We name this load indicator as CPU ConCurrency (CC): task concurrency
> determines how many CPUs are needed to be running concurrently.
>
> To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3)
> scheduler tick, and 4) enter/exit idle.
>
> By CC, we implemented a Workload Consolidation patch on two Intel mobile
> platforms (a quad-core composed of two dual-core modules): contain load and load
> balancing in the first dual-core when aggregated CC low, and if not in the
> full quad-core. Results show that we got power savings and no substantial
> performance regression (even gains for some).
>
> Thanks,
> Yuyang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-04-25  8:00 ` Vincent Guittot
@ 2014-04-25  9:57   ` Peter Zijlstra
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2014-04-25  9:57 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, mingo@redhat.com, linux-kernel,
	linux-pm@vger.kernel.org, arjan.van.de.ven, Len Brown,
	rafael.j.wysocki, alan.cox, Gross, Mark, Morten Rasmussen

On Fri, Apr 25, 2014 at 10:00:02AM +0200, Vincent Guittot wrote:
> On 24 April 2014 21:30, Yuyang Du <yuyang.du@intel.com> wrote:
> > Hi Ingo, PeterZ, and others,
> >
> > The current scheduler's load balancing is completely work-conserving. In some
> > workload, generally low CPU utilization but immersed with CPU bursts of
> > transient tasks, migrating task to engage all available CPUs for
> > work-conserving can lead to significant overhead: cache locality loss,
> > idle/active HW state transitional latency and power, shallower idle state,
> > etc, which are both power and performance inefficient especially for today's
> > low power processors in mobile.
> >
> > This RFC introduces a sense of idleness-conserving into work-conserving (by
> > all means, we really don't want to be overwhelming in only one way). But to
> > what extent the idleness-conserving should be, bearing in mind that we don't
> > want to sacrifice performance? We first need a load/idleness indicator to that
> > end.
> >
> > Thanks to CFS's "model an ideal, precise multi-tasking CPU", tasks can be seen
> > as concurrently running (the tasks in the runqueue). So it is natural to use
> > task concurrency as load indicator. Having said that, we do two things:
> >
> > 1)      Divide continuous time into periods of time, and average task concurrency
> > in period, for tolerating the transient bursts:
> > a = sum(concurrency * time) / period
> > 2)      Exponentially decay past periods, and synthesize them all, for hysteresis
> > to load drops or resilience to load rises (let f be decaying factor, and a_x
> > the xth period average since period 0):
> > s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, .....,+ f^(n-1) * a_1 + f^n * a_0
> 
> In the original version of entity load tracking patchset, there was a
> usage_avg_sum field that was counting the time the task was really
> running on the CPU. By combining this (disappeared ) field with the
> runnable_avg_sum, you should have similar concurrency value but with
> the current load tracking mechanism (instead of creating new one).

I'm not entire sure understood what was proposed, but I suspect its very
close to what I told you to do with the capacity muck. Use avg
utilization instead of 1 active task per core.

And yes, the current load tracking should be pretty close.

We just need to come up another way of doing SMT again, bloody
inconvenient SMT.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-04-24 19:30 [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
  2014-04-25  6:23 ` Mike Galbraith
  2014-04-25  8:00 ` Vincent Guittot
@ 2014-04-25 10:23 ` Morten Rasmussen
  2014-04-25 12:19   ` Rafael J. Wysocki
  2 siblings, 1 reply; 11+ messages in thread
From: Morten Rasmussen @ 2014-04-25 10:23 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo@redhat.com, peterz@infradead.org,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	arjan.van.de.ven@intel.com, len.brown@intel.com,
	rafael.j.wysocki@intel.com, alan.cox@intel.com,
	mark.gross@intel.com, vincent.guittot@linaro.org

Hi Yuyang,

On Thu, Apr 24, 2014 at 08:30:05PM +0100, Yuyang Du wrote:
> 1)	Divide continuous time into periods of time, and average task concurrency
> in period, for tolerating the transient bursts:
> a = sum(concurrency * time) / period
> 2)	Exponentially decay past periods, and synthesize them all, for hysteresis
> to load drops or resilience to load rises (let f be decaying factor, and a_x
> the xth period average since period 0):
> s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, …..,+ f^(n-1) * a_1 + f^n * a_0
> 
> We name this load indicator as CPU ConCurrency (CC): task concurrency
> determines how many CPUs are needed to be running concurrently.
> 
> To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3)
> scheduler tick, and 4) enter/exit idle.
> 
> By CC, we implemented a Workload Consolidation patch on two Intel mobile
> platforms (a quad-core composed of two dual-core modules): contain load and load
> balancing in the first dual-core when aggregated CC low, and if not in the
> full quad-core. Results show that we got power savings and no substantial
> performance regression (even gains for some).

The idea you present seems quite similar to the task packing proposals
by Vincent and others that were discussed about a year ago. One of the
main issues related to task packing/consolidation is that it is not
always beneficial.

I have spent some time over the last couple of weeks looking into this
trying to figure out when task consolidation makes sense. The pattern I
have seen is that it makes most sense when the task energy is dominated
by wake-up costs. That is short-running tasks. The actual energy savings
come from a reduced number of wake-ups if the consolidation cpu is busy
enough to be already awake when another task wakes up, and savings by
keeping the consolidation cpu in a shallower idle state and thereby
reducing the wake-up costs. The wake-up cost savings outweighs the
additional leakage in the shallower idle state in some scenarios. All of
this is of course quite platform dependent. Different idle state leakage
power and wake-up costs may change the picture.

I'm therefore quite interested in knowing what sort of test scenarios
you used and the parameters for CC (f and size of the periods). I'm not
convinced (yet) that a cpu load concurrency indicator is sufficient to
make the call when to consolidate tasks. I'm thinking whether we need a
power model to guide the decisions.

Whether you use CC or reintroduce usage_avg as Vincent proposes I
believe the overhead should be roughly the same.

Morten

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-04-25 10:23 ` Morten Rasmussen
@ 2014-04-25 12:19   ` Rafael J. Wysocki
  2014-04-25 14:53     ` Morten Rasmussen
  0 siblings, 1 reply; 11+ messages in thread
From: Rafael J. Wysocki @ 2014-04-25 12:19 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Yuyang Du, mingo@redhat.com, peterz@infradead.org,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	arjan.van.de.ven@intel.com, len.brown@intel.com,
	rafael.j.wysocki@intel.com, alan.cox@intel.com,
	mark.gross@intel.com, vincent.guittot@linaro.org

On Friday, April 25, 2014 11:23:07 AM Morten Rasmussen wrote:
> Hi Yuyang,
> 
> On Thu, Apr 24, 2014 at 08:30:05PM +0100, Yuyang Du wrote:
> > 1)	Divide continuous time into periods of time, and average task concurrency
> > in period, for tolerating the transient bursts:
> > a = sum(concurrency * time) / period
> > 2)	Exponentially decay past periods, and synthesize them all, for hysteresis
> > to load drops or resilience to load rises (let f be decaying factor, and a_x
> > the xth period average since period 0):
> > s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, …..,+ f^(n-1) * a_1 + f^n * a_0
> > 
> > We name this load indicator as CPU ConCurrency (CC): task concurrency
> > determines how many CPUs are needed to be running concurrently.
> > 
> > To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3)
> > scheduler tick, and 4) enter/exit idle.
> > 
> > By CC, we implemented a Workload Consolidation patch on two Intel mobile
> > platforms (a quad-core composed of two dual-core modules): contain load and load
> > balancing in the first dual-core when aggregated CC low, and if not in the
> > full quad-core. Results show that we got power savings and no substantial
> > performance regression (even gains for some).
> 
> The idea you present seems quite similar to the task packing proposals
> by Vincent and others that were discussed about a year ago. One of the
> main issues related to task packing/consolidation is that it is not
> always beneficial.
> 
> I have spent some time over the last couple of weeks looking into this
> trying to figure out when task consolidation makes sense. The pattern I
> have seen is that it makes most sense when the task energy is dominated
> by wake-up costs. That is short-running tasks. The actual energy savings
> come from a reduced number of wake-ups if the consolidation cpu is busy
> enough to be already awake when another task wakes up, and savings by
> keeping the consolidation cpu in a shallower idle state and thereby
> reducing the wake-up costs. The wake-up cost savings outweighs the
> additional leakage in the shallower idle state in some scenarios. All of
> this is of course quite platform dependent. Different idle state leakage
> power and wake-up costs may change the picture.

The problem, however, is that it usually is not really known in advance
whether or not a given task will be short-running.  There simply is no way
to tell.

The only kinds of information we can possibly use to base decisions on are
(1) things that don't change (or if they change, we know exactly when and
how), such as the system's topology, and (2) information on what happened
in the past.  So, for example, if there's a task that has been running for
some time already and it has behaved in approximately the same way all the
time, it is reasonable to assume that it will behave in this way in the
future.  We need to let it run for a while to collect that information,
though.

Without that kind of information we can only speculate about what's going
to happen and different methods of speculation may lead to better or worse
results in a given situation, but still that's only speculation and the
results are only known after the fact.

In the reverse, if I know the system topology and I have a particular workload,
I know what's going to happen, so I can find a load balancing method that will
be perfect for this particular workload on this particular system.  That's not
the situation the scheduler has to deal with, though, because the workload is
unknown to it until it has been measured.

So in my opinion we need to figure out how to measure workloads while they are
running and then use that information to make load balancing decisions.

In principle, given the system's topology, task packing may lead to better
results for some workloads, but not necessarily for all of them.  So we need
a way to determine (a) whether or not task packing is an option at all in the
given system (that may change over time due to user policy changes etc.) and
if that is the case, then (b) if the current workload is eligible for task
packing.

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-04-25 12:19   ` Rafael J. Wysocki
@ 2014-04-25 14:53     ` Morten Rasmussen
  2014-04-27 14:05       ` Rafael J. Wysocki
  2014-04-27 20:07       ` Yuyang Du
  0 siblings, 2 replies; 11+ messages in thread
From: Morten Rasmussen @ 2014-04-25 14:53 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Yuyang Du, mingo@redhat.com, peterz@infradead.org,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	arjan.van.de.ven@intel.com, len.brown@intel.com,
	rafael.j.wysocki@intel.com, alan.cox@intel.com,
	mark.gross@intel.com, vincent.guittot@linaro.org

On Fri, Apr 25, 2014 at 01:19:46PM +0100, Rafael J. Wysocki wrote:
> On Friday, April 25, 2014 11:23:07 AM Morten Rasmussen wrote:
> > Hi Yuyang,
> > 
> > On Thu, Apr 24, 2014 at 08:30:05PM +0100, Yuyang Du wrote:
> > > 1)	Divide continuous time into periods of time, and average task concurrency
> > > in period, for tolerating the transient bursts:
> > > a = sum(concurrency * time) / period
> > > 2)	Exponentially decay past periods, and synthesize them all, for hysteresis
> > > to load drops or resilience to load rises (let f be decaying factor, and a_x
> > > the xth period average since period 0):
> > > s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, …..,+ f^(n-1) * a_1 + f^n * a_0
> > > 
> > > We name this load indicator as CPU ConCurrency (CC): task concurrency
> > > determines how many CPUs are needed to be running concurrently.
> > > 
> > > To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3)
> > > scheduler tick, and 4) enter/exit idle.
> > > 
> > > By CC, we implemented a Workload Consolidation patch on two Intel mobile
> > > platforms (a quad-core composed of two dual-core modules): contain load and load
> > > balancing in the first dual-core when aggregated CC low, and if not in the
> > > full quad-core. Results show that we got power savings and no substantial
> > > performance regression (even gains for some).
> > 
> > The idea you present seems quite similar to the task packing proposals
> > by Vincent and others that were discussed about a year ago. One of the
> > main issues related to task packing/consolidation is that it is not
> > always beneficial.
> > 
> > I have spent some time over the last couple of weeks looking into this
> > trying to figure out when task consolidation makes sense. The pattern I
> > have seen is that it makes most sense when the task energy is dominated
> > by wake-up costs. That is short-running tasks. The actual energy savings
> > come from a reduced number of wake-ups if the consolidation cpu is busy
> > enough to be already awake when another task wakes up, and savings by
> > keeping the consolidation cpu in a shallower idle state and thereby
> > reducing the wake-up costs. The wake-up cost savings outweighs the
> > additional leakage in the shallower idle state in some scenarios. All of
> > this is of course quite platform dependent. Different idle state leakage
> > power and wake-up costs may change the picture.
> 
> The problem, however, is that it usually is not really known in advance
> whether or not a given task will be short-running.  There simply is no way
> to tell.
> 
> The only kinds of information we can possibly use to base decisions on are
> (1) things that don't change (or if they change, we know exactly when and
> how), such as the system's topology, and (2) information on what happened
> in the past.  So, for example, if there's a task that has been running for
> some time already and it has behaved in approximately the same way all the
> time, it is reasonable to assume that it will behave in this way in the
> future.  We need to let it run for a while to collect that information,
> though.
> 
> Without that kind of information we can only speculate about what's going
> to happen and different methods of speculation may lead to better or worse
> results in a given situation, but still that's only speculation and the
> results are only known after the fact.
> 
> In the reverse, if I know the system topology and I have a particular workload,
> I know what's going to happen, so I can find a load balancing method that will
> be perfect for this particular workload on this particular system.  That's not
> the situation the scheduler has to deal with, though, because the workload is
> unknown to it until it has been measured.
> 
> So in my opinion we need to figure out how to measure workloads while they are
> running and then use that information to make load balancing decisions.
> 
> In principle, given the system's topology, task packing may lead to better
> results for some workloads, but not necessarily for all of them.  So we need
> a way to determine (a) whether or not task packing is an option at all in the
> given system (that may change over time due to user policy changes etc.) and
> if that is the case, then (b) if the current workload is eligible for task
> packing.

I fully agree. My point was that there is more to task consolidation
than just observing the degree of task parallelism. The system topology
has a lot to say when deciding whether or not to pack. That was the
motivation for proposing to have a power model for the system topology
to help making that decision.

We do already have some per-task metric available that may be useful for
determining whether a workload is eligible for task packing. The
load_avg_contrib gives us an indication of the tasks cpu utilization and
we also count task wake-ups. If we tracked task wake-ups over time
(right now we only have the sum) we should be able to reason about the
number of wake-ups that a task causes. Lots of wake-ups and low
load_avg_contrib would indicate the task power is likely to be dominated
by the wake-up costs if it is placed on a cpu in a deep idle state.

I fully agree that measuring the workloads while they are running is the
way to go. I'm just wondering if the proposed cpu concurrency measure
is sufficient to make the task packing decision for all system
topologies or if we need something that incorporates more system
topology information. If the latter, we may want to roll it all into
something like an energy_diff(src_cpu, dst_cpu, task) helper function
for use in load-balancing decisions.

Morten

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-04-25 14:53     ` Morten Rasmussen
@ 2014-04-27 14:05       ` Rafael J. Wysocki
  2014-04-27 20:07       ` Yuyang Du
  1 sibling, 0 replies; 11+ messages in thread
From: Rafael J. Wysocki @ 2014-04-27 14:05 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Yuyang Du, mingo@redhat.com, peterz@infradead.org,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	arjan.van.de.ven@intel.com, len.brown@intel.com,
	rafael.j.wysocki@intel.com, alan.cox@intel.com,
	mark.gross@intel.com, vincent.guittot@linaro.org

On Friday, April 25, 2014 03:53:34 PM Morten Rasmussen wrote:
> On Fri, Apr 25, 2014 at 01:19:46PM +0100, Rafael J. Wysocki wrote:

[...]

> > 
> > So in my opinion we need to figure out how to measure workloads while they are
> > running and then use that information to make load balancing decisions.
> > 
> > In principle, given the system's topology, task packing may lead to better
> > results for some workloads, but not necessarily for all of them.  So we need
> > a way to determine (a) whether or not task packing is an option at all in the
> > given system (that may change over time due to user policy changes etc.) and
> > if that is the case, then (b) if the current workload is eligible for task
> > packing.
> 
> I fully agree. My point was that there is more to task consolidation
> than just observing the degree of task parallelism. The system topology
> has a lot to say when deciding whether or not to pack. That was the
> motivation for proposing to have a power model for the system topology
> to help making that decision.

Agreed.

> We do already have some per-task metric available that may be useful for
> determining whether a workload is eligible for task packing. The
> load_avg_contrib gives us an indication of the tasks cpu utilization and
> we also count task wake-ups. If we tracked task wake-ups over time
> (right now we only have the sum) we should be able to reason about the
> number of wake-ups that a task causes. Lots of wake-ups and low
> load_avg_contrib would indicate the task power is likely to be dominated
> by the wake-up costs if it is placed on a cpu in a deep idle state.

Agreed too.

> I fully agree that measuring the workloads while they are running is the
> way to go. I'm just wondering if the proposed cpu concurrency measure
> is sufficient to make the task packing decision for all system
> topologies or if we need something that incorporates more system
> topology information. If the latter, we may want to roll it all into
> something like an energy_diff(src_cpu, dst_cpu, task) helper function
> for use in load-balancing decisions.

I think that the latter is the case.

For example, quite a lot may depend on how much cache is shared between
different CPU cores and what the cache hierarchy is in general.  Especially
if the workload fits into the cache entirely.

Thanks!

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-04-25 14:53     ` Morten Rasmussen
  2014-04-27 14:05       ` Rafael J. Wysocki
@ 2014-04-27 20:07       ` Yuyang Du
  2014-04-28 15:22         ` Morten Rasmussen
  1 sibling, 1 reply; 11+ messages in thread
From: Yuyang Du @ 2014-04-27 20:07 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Rafael J. Wysocki, mingo@redhat.com, peterz@infradead.org,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	arjan.van.de.ven@intel.com, len.brown@intel.com,
	rafael.j.wysocki@intel.com, alan.cox@intel.com,
	mark.gross@intel.com, vincent.guittot@linaro.org,
	rajeev.d.muralidhar, vishwesh.m.rudramuni, nicole.chalhoub,
	ajaya.durg, harinarayanan.seshadri, yuyang.du

On Fri, Apr 25, 2014 at 03:53:34PM +0100, Morten Rasmussen wrote:
> I fully agree. My point was that there is more to task consolidation
> than just observing the degree of task parallelism. The system topology
> has a lot to say when deciding whether or not to pack. That was the
> motivation for proposing to have a power model for the system topology
> to help making that decision.
> 
> We do already have some per-task metric available that may be useful for
> determining whether a workload is eligible for task packing. The
> load_avg_contrib gives us an indication of the tasks cpu utilization and
> we also count task wake-ups. If we tracked task wake-ups over time
> (right now we only have the sum) we should be able to reason about the
> number of wake-ups that a task causes. Lots of wake-ups and low
> load_avg_contrib would indicate the task power is likely to be dominated
> by the wake-up costs if it is placed on a cpu in a deep idle state.
> 
> I fully agree that measuring the workloads while they are running is the
> way to go. I'm just wondering if the proposed cpu concurrency measure
> is sufficient to make the task packing decision for all system
> topologies or if we need something that incorporates more system
> topology information. If the latter, we may want to roll it all into
> something like an energy_diff(src_cpu, dst_cpu, task) helper function
> for use in load-balancing decisions.
> 

Thank you.

After CC, in the consolidation part, we do 1) attach the CPU topology to "help
making that decision" and to be adaptive beyond our experimental platforms, and
2) intercept the current load balance for load and load balancing containment.

Maybe, the way we consolidate workload differs from previous is:

1) we don't do it per task. We only see how many concurrent CPUs needed (on
average and on prediction at power gated units) for the workload, and simply
consolidate.

2) I am not sure it is sufficient either, :). But I can offer another two ways
of how to interpret CC.

2.1) the current work-conserving load balance also uses CC, but instantaneous
CC (similar to what PeterZ said to Vincent?).

2.2) CC vs. CPU utilization. CC is runqueue-length-weighted CPU utilization.
If we change: "a = sum(concurrency * time) / period" to "a' = sum(1 * time) /
period". Then a' is just about the CPU utilization. And the way we weight
runqueue-length is the simplest one (excluding the exponential decays, and you
may have other ways).

The workloads they (not me) used to evaluate the "Workload Consolidation" is
1) 50+ perf/ux benchmarks (almost all of the magazine ones), and 2) ~10 power
workloads, of course, they are the easiest ones, such as browsing, audio,
video, recording, imaging, etc.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-04-27 20:07       ` Yuyang Du
@ 2014-04-28 15:22         ` Morten Rasmussen
  2014-04-28 19:26           ` Yuyang Du
  0 siblings, 1 reply; 11+ messages in thread
From: Morten Rasmussen @ 2014-04-28 15:22 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Rafael J. Wysocki, mingo@redhat.com, peterz@infradead.org,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	arjan.van.de.ven@intel.com, len.brown@intel.com,
	rafael.j.wysocki@intel.com, alan.cox@intel.com,
	mark.gross@intel.com, vincent.guittot@linaro.org,
	rajeev.d.muralidhar@intel.com, vishwesh.m.rudramuni@intel.com,
	nicole.chalhoub@intel.com, ajaya.durg@intel.com,
	harinarayanan.seshadri@intel.com

On Sun, Apr 27, 2014 at 09:07:25PM +0100, Yuyang Du wrote:
> On Fri, Apr 25, 2014 at 03:53:34PM +0100, Morten Rasmussen wrote:
> > I fully agree. My point was that there is more to task consolidation
> > than just observing the degree of task parallelism. The system topology
> > has a lot to say when deciding whether or not to pack. That was the
> > motivation for proposing to have a power model for the system topology
> > to help making that decision.
> > 
> > We do already have some per-task metric available that may be useful for
> > determining whether a workload is eligible for task packing. The
> > load_avg_contrib gives us an indication of the tasks cpu utilization and
> > we also count task wake-ups. If we tracked task wake-ups over time
> > (right now we only have the sum) we should be able to reason about the
> > number of wake-ups that a task causes. Lots of wake-ups and low
> > load_avg_contrib would indicate the task power is likely to be dominated
> > by the wake-up costs if it is placed on a cpu in a deep idle state.
> > 
> > I fully agree that measuring the workloads while they are running is the
> > way to go. I'm just wondering if the proposed cpu concurrency measure
> > is sufficient to make the task packing decision for all system
> > topologies or if we need something that incorporates more system
> > topology information. If the latter, we may want to roll it all into
> > something like an energy_diff(src_cpu, dst_cpu, task) helper function
> > for use in load-balancing decisions.
> > 
> 
> Thank you.
> 
> After CC, in the consolidation part, we do 1) attach the CPU topology to "help
> making that decision" and to be adaptive beyond our experimental platforms, and
> 2) intercept the current load balance for load and load balancing containment.
> 
> Maybe, the way we consolidate workload differs from previous is:
> 
> 1) we don't do it per task. We only see how many concurrent CPUs needed (on
> average and on prediction at power gated units) for the workload, and simply
> consolidate.

I'm a bit confused, do you have one global CC that tracks the number of
tasks across all runqueues in the system or one for each cpu? There
could be some contention when updating that value on larger systems if
it one global CC. If they are separate, how do you then decide when to
consolidate? 

How do you determine your "f" parameter? How fast is the reaction time?
If you have had a period of consolidation and have a bunch of tasks
waking up at the same time. How long will it be until you spread the
load to all cpus?

> 
> 2) I am not sure it is sufficient either, :). But I can offer another two ways
> of how to interpret CC.
> 
> 2.1) the current work-conserving load balance also uses CC, but instantaneous
> CC (similar to what PeterZ said to Vincent?).

The existing load balancing based on load_avg_contrib factors in task
parallelism implicitly. If you have two tasks runnable at the same time,
one of them will have to wait on the rq resulting in it getting a higher
load_avg_contrib than it would have had if the two tasks became runnable
at different times (no parallelism). The higher load_avg_contrib means
that load balancer is more likely to spread tasks that overlaps in time
similar to what you achieve with CC. But it doesn't do the reverse.

> 
> 2.2) CC vs. CPU utilization. CC is runqueue-length-weighted CPU utilization.
> If we change: "a = sum(concurrency * time) / period" to "a' = sum(1 * time) /
> period". Then a' is just about the CPU utilization. And the way we weight
> runqueue-length is the simplest one (excluding the exponential decays, and you
> may have other ways).

Right. How do you distinguish between having a concurrency of 1 for 100%
of the time and having a concurrency of 2 for 50% of the time. Both
should give an average concurrency of very close to 1 depending on your
exponential decay?

It seems to me that you are loosing some important information by
tracking per cpu and not per task. Also, your load balance behaviour is
very sensitive to the choice of decay factor. We have that issue with
the runqueue load tracking already. It reacts very slowly to load
changes, so it can't really be used for periodic load-balancing
decisions.

> The workloads they (not me) used to evaluate the "Workload Consolidation" is
> 1) 50+ perf/ux benchmarks (almost all of the magazine ones), and 2) ~10 power
> workloads, of course, they are the easiest ones, such as browsing, audio,
> video, recording, imaging, etc.

Can you share how much of the time that the benchmarks actually ran
consolidated vs spread out? IIUC, you consolidate on two cpus which
should be enough for a lot of workloads.

Morten

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-04-28 15:22         ` Morten Rasmussen
@ 2014-04-28 19:26           ` Yuyang Du
  0 siblings, 0 replies; 11+ messages in thread
From: Yuyang Du @ 2014-04-28 19:26 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Rafael J. Wysocki, mingo@redhat.com, peterz@infradead.org,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	arjan.van.de.ven@intel.com, len.brown@intel.com,
	rafael.j.wysocki@intel.com, alan.cox@intel.com,
	mark.gross@intel.com, vincent.guittot@linaro.org,
	rajeev.d.muralidhar@intel.com, vishwesh.m.rudramuni@intel.com,
	nicole.chalhoub@intel.com, ajaya.durg@intel.com,
	harinarayanan.seshadri@intel.com, yuyang.du

> I'm a bit confused, do you have one global CC that tracks the number of
> tasks across all runqueues in the system or one for each cpu? There
> could be some contention when updating that value on larger systems if
> it one global CC. If they are separate, how do you then decide when to
> consolidate? 

Oh, we are getting down to business. Currently, CC is per CPU. To consolidate,
the formula is based on a heuristic. Because:

suppose we have 2 CPUs, their task concurrency over time is ('-' means no
task, 'x' having tasks):

1)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---------xxxx---- (CC[1])

2)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---xxxx---------- (CC[1])

If we consolidate CPU0 and CPU1, the consolidated CC will be: CC' = CC[0] +
CC[1] for case 1 and CC'' = (CC[0] + CC[1]) * 2 for case 2. For the cases
in between case 1 and 2 in terms of how xxx overlaps, the CC should be between
CC' and CC''. So, we uniformly use this condition to evaluate for
consolidation (suppose we consolidate m CPUs to n CPUs, m > n):

(CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 * n) * n *
consolidate_coefficient

The consolidate_coeffient could be like 100% or more or less.

> How do you determine your "f" parameter? How fast is the reaction time?
> If you have had a period of consolidation and have a bunch of tasks
> waking up at the same time. How long will it be until you spread the
> load to all cpus?

Per CPU vs. per task? This is really not about who is (more) informative or
not, or why not both or not. It is about when you have task concurrency and
CPU utilization at the same time, and you must make a fast decision right now,
then what? Actually, it is also about how I want to position the whole CC
fuss. CC and the associated CPU workload consolidation can be regarded as
another "layer" beneath the current sophisticated load balancing, such that
this layer senses globally how many CPUs are needed and then do whatever it is
currently supposed to do in the needed CPUs. I think this is only a design
choice that is effective but simpler and less intrusive to just realize
consolidation/packing.

> It seems to me that you are loosing some important information by
> tracking per cpu and not per task. Also, your load balance behaviour is
> very sensitive to the choice of decay factor. We have that issue with
> the runqueue load tracking already. It reacts very slowly to load
> changes, so it can't really be used for periodic load-balancing
> decisions.

The current halflife is 1 period, and the period was 32ms, and now 64ms for
more aggressive consolidation.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-04-29  3:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-24 19:30 [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
2014-04-25  6:23 ` Mike Galbraith
2014-04-25  8:00 ` Vincent Guittot
2014-04-25  9:57   ` Peter Zijlstra
2014-04-25 10:23 ` Morten Rasmussen
2014-04-25 12:19   ` Rafael J. Wysocki
2014-04-25 14:53     ` Morten Rasmussen
2014-04-27 14:05       ` Rafael J. Wysocki
2014-04-27 20:07       ` Yuyang Du
2014-04-28 15:22         ` Morten Rasmussen
2014-04-28 19:26           ` Yuyang Du

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).