* [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
@ 2014-04-24 19:30 Yuyang Du
2014-04-25 6:23 ` Mike Galbraith
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Yuyang Du @ 2014-04-24 19:30 UTC (permalink / raw)
To: mingo, peterz, linux-kernel, linux-pm
Cc: arjan.van.de.ven, len.brown, rafael.j.wysocki, alan.cox,
mark.gross, morten.rasmussen, vincent.guittot, yuyang.du
Hi Ingo, PeterZ, and others,
The current scheduler’s load balancing is completely work-conserving. In some
workload, generally low CPU utilization but immersed with CPU bursts of
transient tasks, migrating task to engage all available CPUs for
work-conserving can lead to significant overhead: cache locality loss,
idle/active HW state transitional latency and power, shallower idle state,
etc, which are both power and performance inefficient especially for today’s
low power processors in mobile.
This RFC introduces a sense of idleness-conserving into work-conserving (by
all means, we really don’t want to be overwhelming in only one way). But to
what extent the idleness-conserving should be, bearing in mind that we don’t
want to sacrifice performance? We first need a load/idleness indicator to that
end.
Thanks to CFS’s “model an ideal, precise multi-tasking CPU”, tasks can be seen
as concurrently running (the tasks in the runqueue). So it is natural to use
task concurrency as load indicator. Having said that, we do two things:
1) Divide continuous time into periods of time, and average task concurrency
in period, for tolerating the transient bursts:
a = sum(concurrency * time) / period
2) Exponentially decay past periods, and synthesize them all, for hysteresis
to load drops or resilience to load rises (let f be decaying factor, and a_x
the xth period average since period 0):
s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, …..,+ f^(n-1) * a_1 + f^n * a_0
We name this load indicator as CPU ConCurrency (CC): task concurrency
determines how many CPUs are needed to be running concurrently.
To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3)
scheduler tick, and 4) enter/exit idle.
By CC, we implemented a Workload Consolidation patch on two Intel mobile
platforms (a quad-core composed of two dual-core modules): contain load and load
balancing in the first dual-core when aggregated CC low, and if not in the
full quad-core. Results show that we got power savings and no substantial
performance regression (even gains for some).
Thanks,
Yuyang
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency 2014-04-24 19:30 [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du @ 2014-04-25 6:23 ` Mike Galbraith 2014-04-25 8:00 ` Vincent Guittot 2014-04-25 10:23 ` Morten Rasmussen 2 siblings, 0 replies; 11+ messages in thread From: Mike Galbraith @ 2014-04-25 6:23 UTC (permalink / raw) To: Yuyang Du Cc: mingo, peterz, linux-kernel, linux-pm, arjan.van.de.ven, len.brown, rafael.j.wysocki, alan.cox, mark.gross, morten.rasmussen, vincent.guittot On Fri, 2014-04-25 at 03:30 +0800, Yuyang Du wrote: > To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3) > scheduler tick, and 4) enter/exit idle. Boo hiss to 1, 2 and 4. Less fastpath math would be better. -Mike ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency 2014-04-24 19:30 [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du 2014-04-25 6:23 ` Mike Galbraith @ 2014-04-25 8:00 ` Vincent Guittot 2014-04-25 9:57 ` Peter Zijlstra 2014-04-25 10:23 ` Morten Rasmussen 2 siblings, 1 reply; 11+ messages in thread From: Vincent Guittot @ 2014-04-25 8:00 UTC (permalink / raw) To: Yuyang Du Cc: mingo@redhat.com, Peter Zijlstra, linux-kernel, linux-pm@vger.kernel.org, arjan.van.de.ven, Len Brown, rafael.j.wysocki, alan.cox, Gross, Mark, Morten Rasmussen On 24 April 2014 21:30, Yuyang Du <yuyang.du@intel.com> wrote: > Hi Ingo, PeterZ, and others, > > The current scheduler's load balancing is completely work-conserving. In some > workload, generally low CPU utilization but immersed with CPU bursts of > transient tasks, migrating task to engage all available CPUs for > work-conserving can lead to significant overhead: cache locality loss, > idle/active HW state transitional latency and power, shallower idle state, > etc, which are both power and performance inefficient especially for today's > low power processors in mobile. > > This RFC introduces a sense of idleness-conserving into work-conserving (by > all means, we really don't want to be overwhelming in only one way). But to > what extent the idleness-conserving should be, bearing in mind that we don't > want to sacrifice performance? We first need a load/idleness indicator to that > end. > > Thanks to CFS's "model an ideal, precise multi-tasking CPU", tasks can be seen > as concurrently running (the tasks in the runqueue). So it is natural to use > task concurrency as load indicator. Having said that, we do two things: > > 1) Divide continuous time into periods of time, and average task concurrency > in period, for tolerating the transient bursts: > a = sum(concurrency * time) / period > 2) Exponentially decay past periods, and synthesize them all, for hysteresis > to load drops or resilience to load rises (let f be decaying factor, and a_x > the xth period average since period 0): > s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, .....,+ f^(n-1) * a_1 + f^n * a_0 In the original version of entity load tracking patchset, there was a usage_avg_sum field that was counting the time the task was really running on the CPU. By combining this (disappeared ) field with the runnable_avg_sum, you should have similar concurrency value but with the current load tracking mechanism (instead of creating new one). Vincent > > We name this load indicator as CPU ConCurrency (CC): task concurrency > determines how many CPUs are needed to be running concurrently. > > To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3) > scheduler tick, and 4) enter/exit idle. > > By CC, we implemented a Workload Consolidation patch on two Intel mobile > platforms (a quad-core composed of two dual-core modules): contain load and load > balancing in the first dual-core when aggregated CC low, and if not in the > full quad-core. Results show that we got power savings and no substantial > performance regression (even gains for some). > > Thanks, > Yuyang ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency 2014-04-25 8:00 ` Vincent Guittot @ 2014-04-25 9:57 ` Peter Zijlstra 0 siblings, 0 replies; 11+ messages in thread From: Peter Zijlstra @ 2014-04-25 9:57 UTC (permalink / raw) To: Vincent Guittot Cc: Yuyang Du, mingo@redhat.com, linux-kernel, linux-pm@vger.kernel.org, arjan.van.de.ven, Len Brown, rafael.j.wysocki, alan.cox, Gross, Mark, Morten Rasmussen On Fri, Apr 25, 2014 at 10:00:02AM +0200, Vincent Guittot wrote: > On 24 April 2014 21:30, Yuyang Du <yuyang.du@intel.com> wrote: > > Hi Ingo, PeterZ, and others, > > > > The current scheduler's load balancing is completely work-conserving. In some > > workload, generally low CPU utilization but immersed with CPU bursts of > > transient tasks, migrating task to engage all available CPUs for > > work-conserving can lead to significant overhead: cache locality loss, > > idle/active HW state transitional latency and power, shallower idle state, > > etc, which are both power and performance inefficient especially for today's > > low power processors in mobile. > > > > This RFC introduces a sense of idleness-conserving into work-conserving (by > > all means, we really don't want to be overwhelming in only one way). But to > > what extent the idleness-conserving should be, bearing in mind that we don't > > want to sacrifice performance? We first need a load/idleness indicator to that > > end. > > > > Thanks to CFS's "model an ideal, precise multi-tasking CPU", tasks can be seen > > as concurrently running (the tasks in the runqueue). So it is natural to use > > task concurrency as load indicator. Having said that, we do two things: > > > > 1) Divide continuous time into periods of time, and average task concurrency > > in period, for tolerating the transient bursts: > > a = sum(concurrency * time) / period > > 2) Exponentially decay past periods, and synthesize them all, for hysteresis > > to load drops or resilience to load rises (let f be decaying factor, and a_x > > the xth period average since period 0): > > s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, .....,+ f^(n-1) * a_1 + f^n * a_0 > > In the original version of entity load tracking patchset, there was a > usage_avg_sum field that was counting the time the task was really > running on the CPU. By combining this (disappeared ) field with the > runnable_avg_sum, you should have similar concurrency value but with > the current load tracking mechanism (instead of creating new one). I'm not entire sure understood what was proposed, but I suspect its very close to what I told you to do with the capacity muck. Use avg utilization instead of 1 active task per core. And yes, the current load tracking should be pretty close. We just need to come up another way of doing SMT again, bloody inconvenient SMT. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency 2014-04-24 19:30 [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du 2014-04-25 6:23 ` Mike Galbraith 2014-04-25 8:00 ` Vincent Guittot @ 2014-04-25 10:23 ` Morten Rasmussen 2014-04-25 12:19 ` Rafael J. Wysocki 2 siblings, 1 reply; 11+ messages in thread From: Morten Rasmussen @ 2014-04-25 10:23 UTC (permalink / raw) To: Yuyang Du Cc: mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, arjan.van.de.ven@intel.com, len.brown@intel.com, rafael.j.wysocki@intel.com, alan.cox@intel.com, mark.gross@intel.com, vincent.guittot@linaro.org Hi Yuyang, On Thu, Apr 24, 2014 at 08:30:05PM +0100, Yuyang Du wrote: > 1) Divide continuous time into periods of time, and average task concurrency > in period, for tolerating the transient bursts: > a = sum(concurrency * time) / period > 2) Exponentially decay past periods, and synthesize them all, for hysteresis > to load drops or resilience to load rises (let f be decaying factor, and a_x > the xth period average since period 0): > s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, …..,+ f^(n-1) * a_1 + f^n * a_0 > > We name this load indicator as CPU ConCurrency (CC): task concurrency > determines how many CPUs are needed to be running concurrently. > > To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3) > scheduler tick, and 4) enter/exit idle. > > By CC, we implemented a Workload Consolidation patch on two Intel mobile > platforms (a quad-core composed of two dual-core modules): contain load and load > balancing in the first dual-core when aggregated CC low, and if not in the > full quad-core. Results show that we got power savings and no substantial > performance regression (even gains for some). The idea you present seems quite similar to the task packing proposals by Vincent and others that were discussed about a year ago. One of the main issues related to task packing/consolidation is that it is not always beneficial. I have spent some time over the last couple of weeks looking into this trying to figure out when task consolidation makes sense. The pattern I have seen is that it makes most sense when the task energy is dominated by wake-up costs. That is short-running tasks. The actual energy savings come from a reduced number of wake-ups if the consolidation cpu is busy enough to be already awake when another task wakes up, and savings by keeping the consolidation cpu in a shallower idle state and thereby reducing the wake-up costs. The wake-up cost savings outweighs the additional leakage in the shallower idle state in some scenarios. All of this is of course quite platform dependent. Different idle state leakage power and wake-up costs may change the picture. I'm therefore quite interested in knowing what sort of test scenarios you used and the parameters for CC (f and size of the periods). I'm not convinced (yet) that a cpu load concurrency indicator is sufficient to make the call when to consolidate tasks. I'm thinking whether we need a power model to guide the decisions. Whether you use CC or reintroduce usage_avg as Vincent proposes I believe the overhead should be roughly the same. Morten ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency 2014-04-25 10:23 ` Morten Rasmussen @ 2014-04-25 12:19 ` Rafael J. Wysocki 2014-04-25 14:53 ` Morten Rasmussen 0 siblings, 1 reply; 11+ messages in thread From: Rafael J. Wysocki @ 2014-04-25 12:19 UTC (permalink / raw) To: Morten Rasmussen Cc: Yuyang Du, mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, arjan.van.de.ven@intel.com, len.brown@intel.com, rafael.j.wysocki@intel.com, alan.cox@intel.com, mark.gross@intel.com, vincent.guittot@linaro.org On Friday, April 25, 2014 11:23:07 AM Morten Rasmussen wrote: > Hi Yuyang, > > On Thu, Apr 24, 2014 at 08:30:05PM +0100, Yuyang Du wrote: > > 1) Divide continuous time into periods of time, and average task concurrency > > in period, for tolerating the transient bursts: > > a = sum(concurrency * time) / period > > 2) Exponentially decay past periods, and synthesize them all, for hysteresis > > to load drops or resilience to load rises (let f be decaying factor, and a_x > > the xth period average since period 0): > > s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, …..,+ f^(n-1) * a_1 + f^n * a_0 > > > > We name this load indicator as CPU ConCurrency (CC): task concurrency > > determines how many CPUs are needed to be running concurrently. > > > > To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3) > > scheduler tick, and 4) enter/exit idle. > > > > By CC, we implemented a Workload Consolidation patch on two Intel mobile > > platforms (a quad-core composed of two dual-core modules): contain load and load > > balancing in the first dual-core when aggregated CC low, and if not in the > > full quad-core. Results show that we got power savings and no substantial > > performance regression (even gains for some). > > The idea you present seems quite similar to the task packing proposals > by Vincent and others that were discussed about a year ago. One of the > main issues related to task packing/consolidation is that it is not > always beneficial. > > I have spent some time over the last couple of weeks looking into this > trying to figure out when task consolidation makes sense. The pattern I > have seen is that it makes most sense when the task energy is dominated > by wake-up costs. That is short-running tasks. The actual energy savings > come from a reduced number of wake-ups if the consolidation cpu is busy > enough to be already awake when another task wakes up, and savings by > keeping the consolidation cpu in a shallower idle state and thereby > reducing the wake-up costs. The wake-up cost savings outweighs the > additional leakage in the shallower idle state in some scenarios. All of > this is of course quite platform dependent. Different idle state leakage > power and wake-up costs may change the picture. The problem, however, is that it usually is not really known in advance whether or not a given task will be short-running. There simply is no way to tell. The only kinds of information we can possibly use to base decisions on are (1) things that don't change (or if they change, we know exactly when and how), such as the system's topology, and (2) information on what happened in the past. So, for example, if there's a task that has been running for some time already and it has behaved in approximately the same way all the time, it is reasonable to assume that it will behave in this way in the future. We need to let it run for a while to collect that information, though. Without that kind of information we can only speculate about what's going to happen and different methods of speculation may lead to better or worse results in a given situation, but still that's only speculation and the results are only known after the fact. In the reverse, if I know the system topology and I have a particular workload, I know what's going to happen, so I can find a load balancing method that will be perfect for this particular workload on this particular system. That's not the situation the scheduler has to deal with, though, because the workload is unknown to it until it has been measured. So in my opinion we need to figure out how to measure workloads while they are running and then use that information to make load balancing decisions. In principle, given the system's topology, task packing may lead to better results for some workloads, but not necessarily for all of them. So we need a way to determine (a) whether or not task packing is an option at all in the given system (that may change over time due to user policy changes etc.) and if that is the case, then (b) if the current workload is eligible for task packing. -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency 2014-04-25 12:19 ` Rafael J. Wysocki @ 2014-04-25 14:53 ` Morten Rasmussen 2014-04-27 14:05 ` Rafael J. Wysocki 2014-04-27 20:07 ` Yuyang Du 0 siblings, 2 replies; 11+ messages in thread From: Morten Rasmussen @ 2014-04-25 14:53 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Yuyang Du, mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, arjan.van.de.ven@intel.com, len.brown@intel.com, rafael.j.wysocki@intel.com, alan.cox@intel.com, mark.gross@intel.com, vincent.guittot@linaro.org On Fri, Apr 25, 2014 at 01:19:46PM +0100, Rafael J. Wysocki wrote: > On Friday, April 25, 2014 11:23:07 AM Morten Rasmussen wrote: > > Hi Yuyang, > > > > On Thu, Apr 24, 2014 at 08:30:05PM +0100, Yuyang Du wrote: > > > 1) Divide continuous time into periods of time, and average task concurrency > > > in period, for tolerating the transient bursts: > > > a = sum(concurrency * time) / period > > > 2) Exponentially decay past periods, and synthesize them all, for hysteresis > > > to load drops or resilience to load rises (let f be decaying factor, and a_x > > > the xth period average since period 0): > > > s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, …..,+ f^(n-1) * a_1 + f^n * a_0 > > > > > > We name this load indicator as CPU ConCurrency (CC): task concurrency > > > determines how many CPUs are needed to be running concurrently. > > > > > > To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3) > > > scheduler tick, and 4) enter/exit idle. > > > > > > By CC, we implemented a Workload Consolidation patch on two Intel mobile > > > platforms (a quad-core composed of two dual-core modules): contain load and load > > > balancing in the first dual-core when aggregated CC low, and if not in the > > > full quad-core. Results show that we got power savings and no substantial > > > performance regression (even gains for some). > > > > The idea you present seems quite similar to the task packing proposals > > by Vincent and others that were discussed about a year ago. One of the > > main issues related to task packing/consolidation is that it is not > > always beneficial. > > > > I have spent some time over the last couple of weeks looking into this > > trying to figure out when task consolidation makes sense. The pattern I > > have seen is that it makes most sense when the task energy is dominated > > by wake-up costs. That is short-running tasks. The actual energy savings > > come from a reduced number of wake-ups if the consolidation cpu is busy > > enough to be already awake when another task wakes up, and savings by > > keeping the consolidation cpu in a shallower idle state and thereby > > reducing the wake-up costs. The wake-up cost savings outweighs the > > additional leakage in the shallower idle state in some scenarios. All of > > this is of course quite platform dependent. Different idle state leakage > > power and wake-up costs may change the picture. > > The problem, however, is that it usually is not really known in advance > whether or not a given task will be short-running. There simply is no way > to tell. > > The only kinds of information we can possibly use to base decisions on are > (1) things that don't change (or if they change, we know exactly when and > how), such as the system's topology, and (2) information on what happened > in the past. So, for example, if there's a task that has been running for > some time already and it has behaved in approximately the same way all the > time, it is reasonable to assume that it will behave in this way in the > future. We need to let it run for a while to collect that information, > though. > > Without that kind of information we can only speculate about what's going > to happen and different methods of speculation may lead to better or worse > results in a given situation, but still that's only speculation and the > results are only known after the fact. > > In the reverse, if I know the system topology and I have a particular workload, > I know what's going to happen, so I can find a load balancing method that will > be perfect for this particular workload on this particular system. That's not > the situation the scheduler has to deal with, though, because the workload is > unknown to it until it has been measured. > > So in my opinion we need to figure out how to measure workloads while they are > running and then use that information to make load balancing decisions. > > In principle, given the system's topology, task packing may lead to better > results for some workloads, but not necessarily for all of them. So we need > a way to determine (a) whether or not task packing is an option at all in the > given system (that may change over time due to user policy changes etc.) and > if that is the case, then (b) if the current workload is eligible for task > packing. I fully agree. My point was that there is more to task consolidation than just observing the degree of task parallelism. The system topology has a lot to say when deciding whether or not to pack. That was the motivation for proposing to have a power model for the system topology to help making that decision. We do already have some per-task metric available that may be useful for determining whether a workload is eligible for task packing. The load_avg_contrib gives us an indication of the tasks cpu utilization and we also count task wake-ups. If we tracked task wake-ups over time (right now we only have the sum) we should be able to reason about the number of wake-ups that a task causes. Lots of wake-ups and low load_avg_contrib would indicate the task power is likely to be dominated by the wake-up costs if it is placed on a cpu in a deep idle state. I fully agree that measuring the workloads while they are running is the way to go. I'm just wondering if the proposed cpu concurrency measure is sufficient to make the task packing decision for all system topologies or if we need something that incorporates more system topology information. If the latter, we may want to roll it all into something like an energy_diff(src_cpu, dst_cpu, task) helper function for use in load-balancing decisions. Morten ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency 2014-04-25 14:53 ` Morten Rasmussen @ 2014-04-27 14:05 ` Rafael J. Wysocki 2014-04-27 20:07 ` Yuyang Du 1 sibling, 0 replies; 11+ messages in thread From: Rafael J. Wysocki @ 2014-04-27 14:05 UTC (permalink / raw) To: Morten Rasmussen Cc: Yuyang Du, mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, arjan.van.de.ven@intel.com, len.brown@intel.com, rafael.j.wysocki@intel.com, alan.cox@intel.com, mark.gross@intel.com, vincent.guittot@linaro.org On Friday, April 25, 2014 03:53:34 PM Morten Rasmussen wrote: > On Fri, Apr 25, 2014 at 01:19:46PM +0100, Rafael J. Wysocki wrote: [...] > > > > So in my opinion we need to figure out how to measure workloads while they are > > running and then use that information to make load balancing decisions. > > > > In principle, given the system's topology, task packing may lead to better > > results for some workloads, but not necessarily for all of them. So we need > > a way to determine (a) whether or not task packing is an option at all in the > > given system (that may change over time due to user policy changes etc.) and > > if that is the case, then (b) if the current workload is eligible for task > > packing. > > I fully agree. My point was that there is more to task consolidation > than just observing the degree of task parallelism. The system topology > has a lot to say when deciding whether or not to pack. That was the > motivation for proposing to have a power model for the system topology > to help making that decision. Agreed. > We do already have some per-task metric available that may be useful for > determining whether a workload is eligible for task packing. The > load_avg_contrib gives us an indication of the tasks cpu utilization and > we also count task wake-ups. If we tracked task wake-ups over time > (right now we only have the sum) we should be able to reason about the > number of wake-ups that a task causes. Lots of wake-ups and low > load_avg_contrib would indicate the task power is likely to be dominated > by the wake-up costs if it is placed on a cpu in a deep idle state. Agreed too. > I fully agree that measuring the workloads while they are running is the > way to go. I'm just wondering if the proposed cpu concurrency measure > is sufficient to make the task packing decision for all system > topologies or if we need something that incorporates more system > topology information. If the latter, we may want to roll it all into > something like an energy_diff(src_cpu, dst_cpu, task) helper function > for use in load-balancing decisions. I think that the latter is the case. For example, quite a lot may depend on how much cache is shared between different CPU cores and what the cache hierarchy is in general. Especially if the workload fits into the cache entirely. Thanks! -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency 2014-04-25 14:53 ` Morten Rasmussen 2014-04-27 14:05 ` Rafael J. Wysocki @ 2014-04-27 20:07 ` Yuyang Du 2014-04-28 15:22 ` Morten Rasmussen 1 sibling, 1 reply; 11+ messages in thread From: Yuyang Du @ 2014-04-27 20:07 UTC (permalink / raw) To: Morten Rasmussen Cc: Rafael J. Wysocki, mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, arjan.van.de.ven@intel.com, len.brown@intel.com, rafael.j.wysocki@intel.com, alan.cox@intel.com, mark.gross@intel.com, vincent.guittot@linaro.org, rajeev.d.muralidhar, vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg, harinarayanan.seshadri, yuyang.du On Fri, Apr 25, 2014 at 03:53:34PM +0100, Morten Rasmussen wrote: > I fully agree. My point was that there is more to task consolidation > than just observing the degree of task parallelism. The system topology > has a lot to say when deciding whether or not to pack. That was the > motivation for proposing to have a power model for the system topology > to help making that decision. > > We do already have some per-task metric available that may be useful for > determining whether a workload is eligible for task packing. The > load_avg_contrib gives us an indication of the tasks cpu utilization and > we also count task wake-ups. If we tracked task wake-ups over time > (right now we only have the sum) we should be able to reason about the > number of wake-ups that a task causes. Lots of wake-ups and low > load_avg_contrib would indicate the task power is likely to be dominated > by the wake-up costs if it is placed on a cpu in a deep idle state. > > I fully agree that measuring the workloads while they are running is the > way to go. I'm just wondering if the proposed cpu concurrency measure > is sufficient to make the task packing decision for all system > topologies or if we need something that incorporates more system > topology information. If the latter, we may want to roll it all into > something like an energy_diff(src_cpu, dst_cpu, task) helper function > for use in load-balancing decisions. > Thank you. After CC, in the consolidation part, we do 1) attach the CPU topology to "help making that decision" and to be adaptive beyond our experimental platforms, and 2) intercept the current load balance for load and load balancing containment. Maybe, the way we consolidate workload differs from previous is: 1) we don't do it per task. We only see how many concurrent CPUs needed (on average and on prediction at power gated units) for the workload, and simply consolidate. 2) I am not sure it is sufficient either, :). But I can offer another two ways of how to interpret CC. 2.1) the current work-conserving load balance also uses CC, but instantaneous CC (similar to what PeterZ said to Vincent?). 2.2) CC vs. CPU utilization. CC is runqueue-length-weighted CPU utilization. If we change: "a = sum(concurrency * time) / period" to "a' = sum(1 * time) / period". Then a' is just about the CPU utilization. And the way we weight runqueue-length is the simplest one (excluding the exponential decays, and you may have other ways). The workloads they (not me) used to evaluate the "Workload Consolidation" is 1) 50+ perf/ux benchmarks (almost all of the magazine ones), and 2) ~10 power workloads, of course, they are the easiest ones, such as browsing, audio, video, recording, imaging, etc. Thanks, Yuyang ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency 2014-04-27 20:07 ` Yuyang Du @ 2014-04-28 15:22 ` Morten Rasmussen 2014-04-28 19:26 ` Yuyang Du 0 siblings, 1 reply; 11+ messages in thread From: Morten Rasmussen @ 2014-04-28 15:22 UTC (permalink / raw) To: Yuyang Du Cc: Rafael J. Wysocki, mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, arjan.van.de.ven@intel.com, len.brown@intel.com, rafael.j.wysocki@intel.com, alan.cox@intel.com, mark.gross@intel.com, vincent.guittot@linaro.org, rajeev.d.muralidhar@intel.com, vishwesh.m.rudramuni@intel.com, nicole.chalhoub@intel.com, ajaya.durg@intel.com, harinarayanan.seshadri@intel.com On Sun, Apr 27, 2014 at 09:07:25PM +0100, Yuyang Du wrote: > On Fri, Apr 25, 2014 at 03:53:34PM +0100, Morten Rasmussen wrote: > > I fully agree. My point was that there is more to task consolidation > > than just observing the degree of task parallelism. The system topology > > has a lot to say when deciding whether or not to pack. That was the > > motivation for proposing to have a power model for the system topology > > to help making that decision. > > > > We do already have some per-task metric available that may be useful for > > determining whether a workload is eligible for task packing. The > > load_avg_contrib gives us an indication of the tasks cpu utilization and > > we also count task wake-ups. If we tracked task wake-ups over time > > (right now we only have the sum) we should be able to reason about the > > number of wake-ups that a task causes. Lots of wake-ups and low > > load_avg_contrib would indicate the task power is likely to be dominated > > by the wake-up costs if it is placed on a cpu in a deep idle state. > > > > I fully agree that measuring the workloads while they are running is the > > way to go. I'm just wondering if the proposed cpu concurrency measure > > is sufficient to make the task packing decision for all system > > topologies or if we need something that incorporates more system > > topology information. If the latter, we may want to roll it all into > > something like an energy_diff(src_cpu, dst_cpu, task) helper function > > for use in load-balancing decisions. > > > > Thank you. > > After CC, in the consolidation part, we do 1) attach the CPU topology to "help > making that decision" and to be adaptive beyond our experimental platforms, and > 2) intercept the current load balance for load and load balancing containment. > > Maybe, the way we consolidate workload differs from previous is: > > 1) we don't do it per task. We only see how many concurrent CPUs needed (on > average and on prediction at power gated units) for the workload, and simply > consolidate. I'm a bit confused, do you have one global CC that tracks the number of tasks across all runqueues in the system or one for each cpu? There could be some contention when updating that value on larger systems if it one global CC. If they are separate, how do you then decide when to consolidate? How do you determine your "f" parameter? How fast is the reaction time? If you have had a period of consolidation and have a bunch of tasks waking up at the same time. How long will it be until you spread the load to all cpus? > > 2) I am not sure it is sufficient either, :). But I can offer another two ways > of how to interpret CC. > > 2.1) the current work-conserving load balance also uses CC, but instantaneous > CC (similar to what PeterZ said to Vincent?). The existing load balancing based on load_avg_contrib factors in task parallelism implicitly. If you have two tasks runnable at the same time, one of them will have to wait on the rq resulting in it getting a higher load_avg_contrib than it would have had if the two tasks became runnable at different times (no parallelism). The higher load_avg_contrib means that load balancer is more likely to spread tasks that overlaps in time similar to what you achieve with CC. But it doesn't do the reverse. > > 2.2) CC vs. CPU utilization. CC is runqueue-length-weighted CPU utilization. > If we change: "a = sum(concurrency * time) / period" to "a' = sum(1 * time) / > period". Then a' is just about the CPU utilization. And the way we weight > runqueue-length is the simplest one (excluding the exponential decays, and you > may have other ways). Right. How do you distinguish between having a concurrency of 1 for 100% of the time and having a concurrency of 2 for 50% of the time. Both should give an average concurrency of very close to 1 depending on your exponential decay? It seems to me that you are loosing some important information by tracking per cpu and not per task. Also, your load balance behaviour is very sensitive to the choice of decay factor. We have that issue with the runqueue load tracking already. It reacts very slowly to load changes, so it can't really be used for periodic load-balancing decisions. > The workloads they (not me) used to evaluate the "Workload Consolidation" is > 1) 50+ perf/ux benchmarks (almost all of the magazine ones), and 2) ~10 power > workloads, of course, they are the easiest ones, such as browsing, audio, > video, recording, imaging, etc. Can you share how much of the time that the benchmarks actually ran consolidated vs spread out? IIUC, you consolidate on two cpus which should be enough for a lot of workloads. Morten ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency 2014-04-28 15:22 ` Morten Rasmussen @ 2014-04-28 19:26 ` Yuyang Du 0 siblings, 0 replies; 11+ messages in thread From: Yuyang Du @ 2014-04-28 19:26 UTC (permalink / raw) To: Morten Rasmussen Cc: Rafael J. Wysocki, mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, arjan.van.de.ven@intel.com, len.brown@intel.com, rafael.j.wysocki@intel.com, alan.cox@intel.com, mark.gross@intel.com, vincent.guittot@linaro.org, rajeev.d.muralidhar@intel.com, vishwesh.m.rudramuni@intel.com, nicole.chalhoub@intel.com, ajaya.durg@intel.com, harinarayanan.seshadri@intel.com, yuyang.du > I'm a bit confused, do you have one global CC that tracks the number of > tasks across all runqueues in the system or one for each cpu? There > could be some contention when updating that value on larger systems if > it one global CC. If they are separate, how do you then decide when to > consolidate? Oh, we are getting down to business. Currently, CC is per CPU. To consolidate, the formula is based on a heuristic. Because: suppose we have 2 CPUs, their task concurrency over time is ('-' means no task, 'x' having tasks): 1) CPU0: ---xxxx---------- (CC[0]) CPU1: ---------xxxx---- (CC[1]) 2) CPU0: ---xxxx---------- (CC[0]) CPU1: ---xxxx---------- (CC[1]) If we consolidate CPU0 and CPU1, the consolidated CC will be: CC' = CC[0] + CC[1] for case 1 and CC'' = (CC[0] + CC[1]) * 2 for case 2. For the cases in between case 1 and 2 in terms of how xxx overlaps, the CC should be between CC' and CC''. So, we uniformly use this condition to evaluate for consolidation (suppose we consolidate m CPUs to n CPUs, m > n): (CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 * n) * n * consolidate_coefficient The consolidate_coeffient could be like 100% or more or less. > How do you determine your "f" parameter? How fast is the reaction time? > If you have had a period of consolidation and have a bunch of tasks > waking up at the same time. How long will it be until you spread the > load to all cpus? Per CPU vs. per task? This is really not about who is (more) informative or not, or why not both or not. It is about when you have task concurrency and CPU utilization at the same time, and you must make a fast decision right now, then what? Actually, it is also about how I want to position the whole CC fuss. CC and the associated CPU workload consolidation can be regarded as another "layer" beneath the current sophisticated load balancing, such that this layer senses globally how many CPUs are needed and then do whatever it is currently supposed to do in the needed CPUs. I think this is only a design choice that is effective but simpler and less intrusive to just realize consolidation/packing. > It seems to me that you are loosing some important information by > tracking per cpu and not per task. Also, your load balance behaviour is > very sensitive to the choice of decay factor. We have that issue with > the runqueue load tracking already. It reacts very slowly to load > changes, so it can't really be used for periodic load-balancing > decisions. The current halflife is 1 period, and the period was 32ms, and now 64ms for more aggressive consolidation. Thanks, Yuyang ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-04-29 3:31 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-04-24 19:30 [RFC] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du 2014-04-25 6:23 ` Mike Galbraith 2014-04-25 8:00 ` Vincent Guittot 2014-04-25 9:57 ` Peter Zijlstra 2014-04-25 10:23 ` Morten Rasmussen 2014-04-25 12:19 ` Rafael J. Wysocki 2014-04-25 14:53 ` Morten Rasmussen 2014-04-27 14:05 ` Rafael J. Wysocki 2014-04-27 20:07 ` Yuyang Du 2014-04-28 15:22 ` Morten Rasmussen 2014-04-28 19:26 ` Yuyang Du
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).