CFS vs. cpufreq/cstates vs. latency

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* CFS vs. cpufreq/cstates vs. latency
@ 2012-07-17 14:23 Rik van Riel
  2012-07-17 17:56 ` Chris Friesen
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Rik van Riel @ 2012-07-17 14:23 UTC (permalink / raw)
  To: Linux kernel Mailing List
  Cc: Peter Zijlstra, Ingo Molnar, Avi Kivity, Gleb Natapov,
	Michael S. Tsirkin, Andi Kleen

While tracking down a latency issue with communication between
KVM guests, we ran into a very interesting issue, an interplay
of CFS and power saving code.

About 3/4 of the 230us latency came from CPUs waking up out of
C-states. Disabling C states reduced the latency to 60us...

The issue? The communication is between various threads and
processes, each of which last ran on a CPU that is now in a
deeper C state. The total latency from that is "CPU wakeup
latency * NR CPUs woken".

This problem could be common to many different multi-threaded
or multi-process applications. It looks like something that
would be fixable at the scheduler + cpufreq level.

Specifically, waking up some process requires that the CPU
which is running the wakeup is already in C0 state. If the
CPU on which the to-be-woken task ran last is in a deep C
state, it may make sense to simply run the woken up task
on the local CPU, not the CPU where it was originally.

I seem to remember some scheduling code that (for power
saving reasons) tried running all the tasks on one CPU,
until that CPU got busy, and then spilled over onto other
CPUs.

I do not seem to be able to find that code in recent kernels,
but I have the feeling that a policy like that (related to
WAKE_AFFINE scheduling?) could improve this issue.

As an additional benefit, it has the possibility of further
improving power saving.

What do the scheduler and cpufreq people think about this
problem?

Any preferred ways to solve the "N * cpu wakeup latency"
problem that is plaguing multi-process and multi-threaded
workloads?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: CFS vs. cpufreq/cstates vs. latency
  2012-07-17 14:23 CFS vs. cpufreq/cstates vs. latency Rik van Riel
@ 2012-07-17 17:56 ` Chris Friesen
  2012-07-20 16:57 ` Peter Zijlstra
  2012-07-22 10:07 ` Avi Kivity
  2 siblings, 0 replies; 5+ messages in thread
From: Chris Friesen @ 2012-07-17 17:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linux kernel Mailing List, Peter Zijlstra, Ingo Molnar,
	Avi Kivity, Gleb Natapov, Michael S. Tsirkin, Andi Kleen

On 07/17/2012 08:23 AM, Rik van Riel wrote:

> Specifically, waking up some process requires that the CPU
> which is running the wakeup is already in C0 state. If the
> CPU on which the to-be-woken task ran last is in a deep C
> state, it may make sense to simply run the woken up task
> on the local CPU, not the CPU where it was originally.

While it sounds interesting, I can see possible issues with this:

1) If we're using NUMA there will be additional cost to running a task with memory on a remote node.  It might make sense to try and run the task on a CPU on that node if possible.
2) It might not make sense to migrate if the local cpu is close to capacity.

Presumably the scheduler could take into account the expected delay for coming out of the C state (which we should know) as well as the expected cost of migrating the task to the running CPU and the expected run-length of the task in order to decide if this makes sense or not.

> I seem to remember some scheduling code that (for power
> saving reasons) tried running all the tasks on one CPU,
> until that CPU got busy, and then spilled over onto other
> CPUs.

I suspect you're thinking of

/sys/devices/system/cpu/sched_mc_power_savings
/sys/devices/system/cpu/sched_smt_power_savings 

> I do not seem to be able to find that code in recent kernels,
> but I have the feeling that a policy like that (related to
> WAKE_AFFINE scheduling?) could improve this issue.

Looks like it was removed in 8e7fbcb because it was broken.

Chris

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: CFS vs. cpufreq/cstates vs. latency
  2012-07-17 14:23 CFS vs. cpufreq/cstates vs. latency Rik van Riel
  2012-07-17 17:56 ` Chris Friesen
@ 2012-07-20 16:57 ` Peter Zijlstra
  2012-07-22 10:07 ` Avi Kivity
  2 siblings, 0 replies; 5+ messages in thread
From: Peter Zijlstra @ 2012-07-20 16:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linux kernel Mailing List, Ingo Molnar, Avi Kivity, Gleb Natapov,
	Michael S. Tsirkin, Andi Kleen

On Tue, 2012-07-17 at 10:23 -0400, Rik van Riel wrote:
> While tracking down a latency issue with communication between
> KVM guests, we ran into a very interesting issue, an interplay
> of CFS and power saving code.
> 
> About 3/4 of the 230us latency came from CPUs waking up out of
> C-states. Disabling C states reduced the latency to 60us...
> 
> The issue? The communication is between various threads and
> processes, each of which last ran on a CPU that is now in a
> deeper C state. The total latency from that is "CPU wakeup
> latency * NR CPUs woken".
> 
> This problem could be common to many different multi-threaded
> or multi-process applications. It looks like something that
> would be fixable at the scheduler + cpufreq level.

There's tons to be fixed there... we should pull most if not all cpufreq
load accounting into the scheduler, it already does most of it anyway.

Also, you want to do per-task policy tracking, something which isn't
possible with the current per-cpu cpufreq setup.

Sadly some hardware makes this very difficult indeed because it needs a
schedulable context to change the cpu freq/volt etc..

> Specifically, waking up some process requires that the CPU
> which is running the wakeup is already in C0 state. If the
> CPU on which the to-be-woken task ran last is in a deep C
> state, it may make sense to simply run the woken up task
> on the local CPU, not the CPU where it was originally.

That's cpuidle, not cpufreq :-) Yay for more players, but yes, I know
I've talked about this very issue to a number of people.

Same as for cpufreq, the accounting crap should move into the scheduler,
we want to use the idle-time guestimator for different things as well.

> I seem to remember some scheduling code that (for power
> saving reasons) tried running all the tasks on one CPU,
> until that CPU got busy, and then spilled over onto other
> CPUs.
> 
> I do not seem to be able to find that code in recent kernels,
> but I have the feeling that a policy like that (related to
> WAKE_AFFINE scheduling?) could improve this issue.
> 
> As an additional benefit, it has the possibility of further
> improving power saving.

What power saving? I recently ripped all that stuff out because it was
terminally broken and the fixes I got were beyond ugly.

There were some people interested in writing a new power aware balancer
infrastructure, but nothing has been forthcoming as yet. Although it
could be they're waiting for PJT's load tracking patches to hit
mainline.

Anyway, you're conflating issues.. you don't want a power aware
balancer, you just don't want it to be unaware of C-states irrespective
of whatever balance policy we're using.

> What do the scheduler and cpufreq people think about this
> problem?
> 
> Any preferred ways to solve the "N * cpu wakeup latency"
> problem that is plaguing multi-process and multi-threaded
> workloads?

Yeah, unify all the various load tracking and guestimator logic in the
scheduler and go from there ;-)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: CFS vs. cpufreq/cstates vs. latency
  2012-07-17 14:23 CFS vs. cpufreq/cstates vs. latency Rik van Riel
  2012-07-17 17:56 ` Chris Friesen
  2012-07-20 16:57 ` Peter Zijlstra
@ 2012-07-22 10:07 ` Avi Kivity
  2012-07-24 18:08   ` Chris Friesen
  2 siblings, 1 reply; 5+ messages in thread
From: Avi Kivity @ 2012-07-22 10:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linux kernel Mailing List, Peter Zijlstra, Ingo Molnar,
	Gleb Natapov, Michael S. Tsirkin, Andi Kleen

On 07/17/2012 05:23 PM, Rik van Riel wrote:
> While tracking down a latency issue with communication between
> KVM guests, we ran into a very interesting issue, an interplay
> of CFS and power saving code.
> 
> About 3/4 of the 230us latency came from CPUs waking up out of
> C-states. Disabling C states reduced the latency to 60us...
> 
> The issue? The communication is between various threads and
> processes, each of which last ran on a CPU that is now in a
> deeper C state. The total latency from that is "CPU wakeup
> latency * NR CPUs woken".
> 
> This problem could be common to many different multi-threaded
> or multi-process applications. It looks like something that
> would be fixable at the scheduler + cpufreq level.
> 
> Specifically, waking up some process requires that the CPU
> which is running the wakeup is already in C0 state. If the
> CPU on which the to-be-woken task ran last is in a deep C
> state, it may make sense to simply run the woken up task
> on the local CPU, not the CPU where it was originally.
> 
> I seem to remember some scheduling code that (for power
> saving reasons) tried running all the tasks on one CPU,
> until that CPU got busy, and then spilled over onto other
> CPUs.
> 
> I do not seem to be able to find that code in recent kernels,
> but I have the feeling that a policy like that (related to
> WAKE_AFFINE scheduling?) could improve this issue.
> 
> As an additional benefit, it has the possibility of further
> improving power saving.
> 
> What do the scheduler and cpufreq people think about this
> problem?
> 
> Any preferred ways to solve the "N * cpu wakeup latency"
> problem that is plaguing multi-process and multi-threaded
> workloads?

A few notes:

- if you go into deep C-state, it may be worthwhile to migrate all the
interrupts away from that cpu.  sysfs says C3 latency is 200 us on one
of my machines, if we go there we should migrate anything important away.

- I believe some of those C-states flush the cache, so executing on a
cpu that is has awoken from one of these states will be slow for a
while; needs to be taken into account.

- C1 state is listed as having 3 us latency.  If we're expecting a
wakeup soon and are sensitive to latency, it's better to spin for a bit
before sleeping.


-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: CFS vs. cpufreq/cstates vs. latency
  2012-07-22 10:07 ` Avi Kivity
@ 2012-07-24 18:08   ` Chris Friesen
  0 siblings, 0 replies; 5+ messages in thread
From: Chris Friesen @ 2012-07-24 18:08 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, Linux kernel Mailing List, Peter Zijlstra,
	Ingo Molnar, Gleb Natapov, Michael S. Tsirkin, Andi Kleen

> On 07/17/2012 05:23 PM, Rik van Riel wrote:
>> While tracking down a latency issue with communication between
>> KVM guests, we ran into a very interesting issue, an interplay
>> of CFS and power saving code.
>>
>> About 3/4 of the 230us latency came from CPUs waking up out of
>> C-states. Disabling C states reduced the latency to 60us...
>>
>> The issue? The communication is between various threads and
>> processes, each of which last ran on a CPU that is now in a
>> deeper C state. The total latency from that is "CPU wakeup
>> latency * NR CPUs woken".
>>
>> This problem could be common to many different multi-threaded
>> or multi-process applications. It looks like something that
>> would be fixable at the scheduler + cpufreq level.
>>
>> Specifically, waking up some process requires that the CPU
>> which is running the wakeup is already in C0 state. If the
>> CPU on which the to-be-woken task ran last is in a deep C
>> state, it may make sense to simply run the woken up task
>> on the local CPU, not the CPU where it was originally.
>>
>> I seem to remember some scheduling code that (for power
>> saving reasons) tried running all the tasks on one CPU,
>> until that CPU got busy, and then spilled over onto other
>> CPUs.
>>
>> I do not seem to be able to find that code in recent kernels,
>> but I have the feeling that a policy like that (related to
>> WAKE_AFFINE scheduling?) could improve this issue.
>>
>> As an additional benefit, it has the possibility of further
>> improving power saving.
>>
>> What do the scheduler and cpufreq people think about this
>> problem?
>>
>> Any preferred ways to solve the "N * cpu wakeup latency"
>> problem that is plaguing multi-process and multi-threaded
>> workloads?
> A few notes:
>
> - if you go into deep C-state, it may be worthwhile to migrate all the
> interrupts away from that cpu.  sysfs says C3 latency is 200 us on one
> of my machines, if we go there we should migrate anything important away.
>
> - I believe some of those C-states flush the cache, so executing on a
> cpu that is has awoken from one of these states will be slow for a
> while; needs to be taken into account.

On current Intel I think C3 flushes L1/L2 and when all cores on a socket 
are in C7 the last-level-cache is flushed.

Chris


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-07-24 18:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-17 14:23 CFS vs. cpufreq/cstates vs. latency Rik van Riel
2012-07-17 17:56 ` Chris Friesen
2012-07-20 16:57 ` Peter Zijlstra
2012-07-22 10:07 ` Avi Kivity
2012-07-24 18:08   ` Chris Friesen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).