Re: [PATCH 0/4] Really lazy fpu

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 0/4] Really lazy fpu
@ 2010-06-16 11:32 George Spelvin
  2010-06-16 11:46 ` Avi Kivity
  0 siblings, 1 reply; 17+ messages in thread
From: George Spelvin @ 2010-06-16 11:32 UTC (permalink / raw)
  To: avi, mingo; +Cc: linux, linux-kernel, npiggin

> But on busy servers where most wakeups are IRQ based the chance of being on 
> the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of 
> CPUs.

That doesn't seem right.  If the server is busy with FPU-using tasks, then
the FPU state has already been swapped out, and no IPI is necessary.

The worst-case seems to be a lot of non-FPU CPU hogs, and a few FPU-using tasks
that get bounced around the CPUs like pinballs.

It is an explicit scheduler goal to keep tasks on the same CPU across
schedules, so they get to re-use their cache state.  The IPI only happens
when that goal is not met, *and* the FPU state has not been forced out
by another FPU-using task.

Not completely trivial to arrange.

(An halfway version of this optimization whoch sould avoid the need for
an IPI would be *save* the FPU state, but mark it "clean", so the re-load
can be skipped if we're lucky.  If the code supported this as well as the
IPI alternative, you could make a heuristic guess at switch-out time
whether to save immediately or hope the odds of needing the IPI are less than
the fxsave/IPI cost ratio.)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-16 11:32 [PATCH 0/4] Really lazy fpu George Spelvin
@ 2010-06-16 11:46 ` Avi Kivity
  2010-06-17  9:38   ` George Spelvin
  0 siblings, 1 reply; 17+ messages in thread
From: Avi Kivity @ 2010-06-16 11:46 UTC (permalink / raw)
  To: George Spelvin; +Cc: mingo, linux-kernel, npiggin

On 06/16/2010 02:32 PM, George Spelvin wrote:
>
> (An halfway version of this optimization whoch sould avoid the need for
> an IPI would be *save* the FPU state, but mark it "clean", so the re-load
> can be skipped if we're lucky.  If the code supported this as well as the
> IPI alternative, you could make a heuristic guess at switch-out time
> whether to save immediately or hope the odds of needing the IPI are less than
> the fxsave/IPI cost ratio.)
>    

That's an interesting optimization - and we already have something 
similar in the form of fpu preload.  Shouldn't be too hard to do.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-16 11:46 ` Avi Kivity
@ 2010-06-17  9:38   ` George Spelvin
  0 siblings, 0 replies; 17+ messages in thread
From: George Spelvin @ 2010-06-17  9:38 UTC (permalink / raw)
  To: avi; +Cc: linux-kernel, mingo, npiggin

> That's an interesting optimization - and we already have something 
> similar in the form of fpu preload.  Shouldn't be too hard to do.

Unfortunately, there's no dirty flag for the preloaded FPU state.
Unless you take an interrupt, which defeats the whole purpose of
preload.

AFAIK, I should add; there's a lot of obscure stuff in the x86
system-level architecture.  But a bit of searching around the source
didn't show me anything; once we've used the CPU for 5 context switches,
the kernel calls __math_state_restore when loading the new state,
which sets TS_USEDFPU.

(While you're mucking about in there, do you suppose the gas < 2.16
workaround in arch/x86/include/asm/i387.h:fxsave() can be yanked yet?)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 0/4] Really lazy fpu
@ 2010-06-13 15:03 Avi Kivity
  2010-06-13 20:45 ` Valdis.Kletnieks
  2010-06-16  7:24 ` Avi Kivity
  0 siblings, 2 replies; 17+ messages in thread
From: Avi Kivity @ 2010-06-13 15:03 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin; +Cc: kvm, linux-kernel

Currently fpu management is only lazy in one direction.  When we switch into
a task, we may avoid loading the fpu state in the hope that the task will
never use it.  If we guess right we save an fpu load/save cycle; if not,
a Device not Available exception will remind us to load the fpu.

However, in the other direction, fpu management is eager.  When we switch out
of an fpu-using task, we always save its fpu state.

This is wasteful if the task(s) that run until we switch back in all don't use
the fpu, since we could have kept the task's fpu on the cpu all this time
and saved an fpu save/load cycle.  This can be quite common with threaded
interrupts, but will also happen with normal kernel threads and even normal
user tasks.

This patch series converts task fpu management to be fully lazy.  When
switching out of a task, we keep its fpu state on the cpu, only flushing it
if some other task needs the fpu.

Open issues/TODO:

- patch 2 enables interrupts during #NM.  There's a comment that says
  it shouldn't be done, presumably because of old-style #FERR handling.
  Need to fix one way or the other (dropping #FERR support, eagerly saving
  state when #FERR is detected, or dropping the entire optimization on i386)
- flush fpu state on cpu offlining (trivial)
- make sure the AMD FXSAVE workaround still works correctly
- reduce IPIs by flushing fpu state when we know a task is being migrated
  (guidance from scheduler folk appreciated)
- preemptible kernel_fpu_begin() to improve latency on raid and crypto setups
  (will post patches)
- lazy host-side kvm fpu management (will post patches)
- accelerate signal delivery by allocating signal handlers their own fpu
  state, and letting them run with the normal task's fpu until they use
  an fp instruction (will generously leave to interested parties)

Avi Kivity (4):
  x86, fpu: merge __save_init_fpu() implementations
  x86, fpu: run device not available trap with interrupts enabled
  x86, fpu: Let the fpu remember which cpu it is active on
  x86, fpu: don't save fpu state when switching from a task

 arch/x86/include/asm/i387.h      |  126 +++++++++++++++++++++++++++++++++-----
 arch/x86/include/asm/processor.h |    4 +
 arch/x86/kernel/i387.c           |    3 +
 arch/x86/kernel/process.c        |    1 +
 arch/x86/kernel/process_32.c     |   12 +++-
 arch/x86/kernel/process_64.c     |   13 +++--
 arch/x86/kernel/traps.c          |   13 ++---
 7 files changed, 139 insertions(+), 33 deletions(-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-13 15:03 Avi Kivity
@ 2010-06-13 20:45 ` Valdis.Kletnieks
  2010-06-14  7:47   ` Avi Kivity
  2010-06-16  7:24 ` Avi Kivity
  1 sibling, 1 reply; 17+ messages in thread
From: Valdis.Kletnieks @ 2010-06-13 20:45 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ingo Molnar, H. Peter Anvin, kvm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1059 bytes --]

On Sun, 13 Jun 2010 18:03:43 +0300, Avi Kivity said:
> Currently fpu management is only lazy in one direction.  When we switch into
> a task, we may avoid loading the fpu state in the hope that the task will
> never use it.  If we guess right we save an fpu load/save cycle; if not,
> a Device not Available exception will remind us to load the fpu.
> 
> However, in the other direction, fpu management is eager.  When we switch out
> of an fpu-using task, we always save its fpu state.

Does anybody have numbers on how many clocks it takes a modern CPU design
to do a FPU state save or restore?  I know it must have been painful in the
days before cache memory, having to make added trips out to RAM for 128-bit
registers.  But what's the impact today? (Yes, I see there's the potential
for a painful IPI call - anything else?)

Do we have any numbers on how many saves/restores this will save us when
running the hypothetical "standard Gnome desktop" environment?  How common
is the "we went all the way around to the original single FPU-using task" case?

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-13 20:45 ` Valdis.Kletnieks
@ 2010-06-14  7:47   ` Avi Kivity
  0 siblings, 0 replies; 17+ messages in thread
From: Avi Kivity @ 2010-06-14  7:47 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Ingo Molnar, H. Peter Anvin, kvm, linux-kernel

On 06/13/2010 11:45 PM, Valdis.Kletnieks@vt.edu wrote:
> On Sun, 13 Jun 2010 18:03:43 +0300, Avi Kivity said:
>    
>> Currently fpu management is only lazy in one direction.  When we switch into
>> a task, we may avoid loading the fpu state in the hope that the task will
>> never use it.  If we guess right we save an fpu load/save cycle; if not,
>> a Device not Available exception will remind us to load the fpu.
>>
>> However, in the other direction, fpu management is eager.  When we switch out
>> of an fpu-using task, we always save its fpu state.
>>      
> Does anybody have numbers on how many clocks it takes a modern CPU design
> to do a FPU state save or restore?

320 cycles for a back-to-back round trip.  Presumably less on more 
modern hardware, more if uncached, more on even more modern hardware 
that has the xsave header (8 bytes) and ymm state (256 bytes) in addition.

> I know it must have been painful in the
> days before cache memory, having to make added trips out to RAM for 128-bit
> registers.  But what's the impact today?

I'd estimate between 300 and 600 cycles depending on the factors above.

> (Yes, I see there's the potential
> for a painful IPI call - anything else?)
>    

The IPI is only taken after a task migration, hopefully a rare event.  
The patchset also adds the overhead of irq save/restore.  I think I can 
remove that at the cost of some complexity, but prefer to start with a 
simple approach.

> Do we have any numbers on how many saves/restores this will save us when
> running the hypothetical "standard Gnome desktop" environment?

The potential is in the number of context switches per second.  On a 
desktop environment, I don't see much potential for a throughput 
improvement, rather latency reduction from making the crypto threads 
preemptible and reducing context switch times.

Servers with high context switch rates, esp. with real-time preemptible 
kernels (due to threaded interrupts), will see throughput gains.  And, 
of course, kvm will benefit from not needing to switch the fpu when 
going from guest to host userspace or to a host kernel thread (vhost-net).

> How common
> is the "we went all the way around to the original single FPU-using task" case?
>    

When your context switch is due to an oversubscribed cpu, not very 
common.  When it is due to the need to service an event and go back to 
sleep, very common.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-13 15:03 Avi Kivity
  2010-06-13 20:45 ` Valdis.Kletnieks
@ 2010-06-16  7:24 ` Avi Kivity
  2010-06-16  7:32   ` H. Peter Anvin
  1 sibling, 1 reply; 17+ messages in thread
From: Avi Kivity @ 2010-06-16  7:24 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin; +Cc: kvm, linux-kernel

On 06/13/2010 06:03 PM, Avi Kivity wrote:
> Currently fpu management is only lazy in one direction.  When we switch into
> a task, we may avoid loading the fpu state in the hope that the task will
> never use it.  If we guess right we save an fpu load/save cycle; if not,
> a Device not Available exception will remind us to load the fpu.
>
> However, in the other direction, fpu management is eager.  When we switch out
> of an fpu-using task, we always save its fpu state.
>
> This is wasteful if the task(s) that run until we switch back in all don't use
> the fpu, since we could have kept the task's fpu on the cpu all this time
> and saved an fpu save/load cycle.  This can be quite common with threaded
> interrupts, but will also happen with normal kernel threads and even normal
> user tasks.
>
> This patch series converts task fpu management to be fully lazy.  When
> switching out of a task, we keep its fpu state on the cpu, only flushing it
> if some other task needs the fpu.
>    

Ingo, Peter, any feedback on this?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-16  7:24 ` Avi Kivity
@ 2010-06-16  7:32   ` H. Peter Anvin
  2010-06-16  8:02     ` Avi Kivity
  0 siblings, 1 reply; 17+ messages in thread
From: H. Peter Anvin @ 2010-06-16  7:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ingo Molnar, kvm, linux-kernel

On 06/16/2010 12:24 AM, Avi Kivity wrote:
> 
> Ingo, Peter, any feedback on this?
> 

Conceptually, this makes sense to me.  However, I have a concern what
happens when a task is scheduled on another CPU, while its FPU state is
still in registers in the original CPU.  That would seem to require
expensive IPIs to spill the state in order for the rescheduling to
proceed, and this could really damage performance.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-16  7:32   ` H. Peter Anvin
@ 2010-06-16  8:02     ` Avi Kivity
  2010-06-16  8:39         ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Avi Kivity @ 2010-06-16  8:02 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Ingo Molnar, kvm, linux-kernel

On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
> On 06/16/2010 12:24 AM, Avi Kivity wrote:
>    
>> Ingo, Peter, any feedback on this?
>>      
> Conceptually, this makes sense to me.  However, I have a concern what
> happens when a task is scheduled on another CPU, while its FPU state is
> still in registers in the original CPU.  That would seem to require
> expensive IPIs to spill the state in order for the rescheduling to
> proceed, and this could really damage performance.
>    

Right, this optimization isn't free.

I think the tradeoff is favourable since task migrations are much less 
frequent than context switches within the same cpu, can the scheduler 
experts comment?

We can also mitigate some of the IPIs if we know that we're migrating on 
the cpu we're migrating from (i.e. we're pushing tasks to another cpu, 
not pulling them from their cpu).  Is that a common case, and if so, 
where can I hook a call to unlazy_fpu() (or its new equivalent)?

Note that kvm on intel has exactly the same issue (the VMPTR and VMCS 
are on-chip registers that are expensive to load and save, so we keep 
them loaded even while not scheduled, and IPI if we notice we've 
migrated; note that architecturally the cpu can cache multiple VMCSs 
simultaneously (though I doubt they cache multiple VMCSs 
microarchitecturally at this point)).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-16  8:02     ` Avi Kivity
@ 2010-06-16  8:39         ` Ingo Molnar
  0 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2010-06-16  8:39 UTC (permalink / raw)
  To: Avi Kivity, Peter Zijlstra, Arjan van de Ven, Thomas Gleixner,
	Suresh Siddha, Linus Torvalds, Fr??d??ric Weisbecker,
	Andrew Morton, Nick Piggin, Eric Dumazet, Mike Galbraith
  Cc: H. Peter Anvin, kvm, linux-kernel

(Cc:-ed various performance/optimization folks)

* Avi Kivity <avi@redhat.com> wrote:

> On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
> >On 06/16/2010 12:24 AM, Avi Kivity wrote:
> >>Ingo, Peter, any feedback on this?
> > Conceptually, this makes sense to me.  However, I have a concern what
> > happens when a task is scheduled on another CPU, while its FPU state is
> > still in registers in the original CPU.  That would seem to require
> > expensive IPIs to spill the state in order for the rescheduling to
> > proceed, and this could really damage performance.
> 
> Right, this optimization isn't free.
> 
> I think the tradeoff is favourable since task migrations are much
> less frequent than context switches within the same cpu, can the
> scheduler experts comment?

This cannot be stated categorically without precise measurements of 
known-good, known-bad, average FPU usage and average CPU usage scenarios. All 
these workloads have different characteristics.

I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, 
various lmbench components, X benchmarks, tiobench - you name it. Combined 
with the fact that most micro-benchmarks wont be using the FPU, while in the 
long run most processes will be using the FPU due to SIMM instructions. So 
even a positive result might be skewed in practice. Has to be measured 
carefully IMO - and i havent seen a _single_ performance measurement in the 
submission mail. This is really essential.

So this does not look like a patch-set we could apply without gathering a 
_ton_ of hard data about advantages and disadvantages.

> We can also mitigate some of the IPIs if we know that we're migrating on the 
> cpu we're migrating from (i.e. we're pushing tasks to another cpu, not 
> pulling them from their cpu).  Is that a common case, and if so, where can I 
> hook a call to unlazy_fpu() (or its new equivalent)?

When the system goes from idle to less idle then most of the 'fast' migrations 
happen on a 'push' model - on a busy CPU we wake up a new task and push it out 
to a known-idle CPU. At that point we can indeed unlazy the FPU with probably 
little cost.

But on busy servers where most wakeups are IRQ based the chance of being on 
the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of 
CPUs.

If there's some sucky corner case in theory we could approach it statistically 
and measure the ratio of fast vs. slow migration vs. local context switches - 
but that looks a bit complex.

Dunno.

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
@ 2010-06-16  8:39         ` Ingo Molnar
  0 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2010-06-16  8:39 UTC (permalink / raw)
  To: Avi Kivity, Peter Zijlstra, Arjan van de Ven, Thomas Gleixner,
	Suresh Siddha
  Cc: H. Peter Anvin, kvm, linux-kernel

(Cc:-ed various performance/optimization folks)

* Avi Kivity <avi@redhat.com> wrote:

> On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
> >On 06/16/2010 12:24 AM, Avi Kivity wrote:
> >>Ingo, Peter, any feedback on this?
> > Conceptually, this makes sense to me.  However, I have a concern what
> > happens when a task is scheduled on another CPU, while its FPU state is
> > still in registers in the original CPU.  That would seem to require
> > expensive IPIs to spill the state in order for the rescheduling to
> > proceed, and this could really damage performance.
> 
> Right, this optimization isn't free.
> 
> I think the tradeoff is favourable since task migrations are much
> less frequent than context switches within the same cpu, can the
> scheduler experts comment?

This cannot be stated categorically without precise measurements of 
known-good, known-bad, average FPU usage and average CPU usage scenarios. All 
these workloads have different characteristics.

I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, 
various lmbench components, X benchmarks, tiobench - you name it. Combined 
with the fact that most micro-benchmarks wont be using the FPU, while in the 
long run most processes will be using the FPU due to SIMM instructions. So 
even a positive result might be skewed in practice. Has to be measured 
carefully IMO - and i havent seen a _single_ performance measurement in the 
submission mail. This is really essential.

So this does not look like a patch-set we could apply without gathering a 
_ton_ of hard data about advantages and disadvantages.

> We can also mitigate some of the IPIs if we know that we're migrating on the 
> cpu we're migrating from (i.e. we're pushing tasks to another cpu, not 
> pulling them from their cpu).  Is that a common case, and if so, where can I 
> hook a call to unlazy_fpu() (or its new equivalent)?

When the system goes from idle to less idle then most of the 'fast' migrations 
happen on a 'push' model - on a busy CPU we wake up a new task and push it out 
to a known-idle CPU. At that point we can indeed unlazy the FPU with probably 
little cost.

But on busy servers where most wakeups are IRQ based the chance of being on 
the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of 
CPUs.

If there's some sucky corner case in theory we could approach it statistically 
and measure the ratio of fast vs. slow migration vs. local context switches - 
but that looks a bit complex.

Dunno.

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-16  8:39         ` Ingo Molnar
  (?)
@ 2010-06-16  9:01         ` Samuel Thibault
  2010-06-16  9:43           ` Avi Kivity
  2010-06-16  9:43           ` Avi Kivity
  -1 siblings, 2 replies; 17+ messages in thread
From: Samuel Thibault @ 2010-06-16  9:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Peter Zijlstra, Arjan van de Ven, Thomas Gleixner,
	Suresh Siddha, Linus Torvalds, Fr??d??ric Weisbecker,
	Andrew Morton, Nick Piggin, Eric Dumazet, Mike Galbraith,
	H. Peter Anvin, kvm, linux-kernel

Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit :
> in the long run most processes will be using the FPU due to SIMM      
> instructions.

I believe glibc already uses SIMM instructions for e.g. memcpy and
friends, i.e. basically all applications...

Samuel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-16  9:01         ` Samuel Thibault
@ 2010-06-16  9:43           ` Avi Kivity
  2010-06-16  9:43           ` Avi Kivity
  1 sibling, 0 replies; 17+ messages in thread
From: Avi Kivity @ 2010-06-16  9:43 UTC (permalink / raw)
  To: Samuel Thibault, Ingo Molnar, Peter Zijlstra, Arjan van de Ven,
	Thomas Gleixner, Suresh Siddha, Linus Torvalds,
	Fr??d??ric Weisbecker, Andrew Morton, Nick Piggin, Eric Dumazet,
	Mike Galbraith, H. Peter Anvin, kvm, linux-kernel

On 06/16/2010 12:01 PM, Samuel Thibault wrote:
> Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit :
>    
>> in the long run most processes will be using the FPU due to SIMM
>> instructions.
>>      
> I believe glibc already uses SIMM instructions for e.g. memcpy and
> friends, i.e. basically all applications...
>    

I think they ought to be using 'rep movs' on newer processors, but yes 
you're right.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-16  9:01         ` Samuel Thibault
  2010-06-16  9:43           ` Avi Kivity
@ 2010-06-16  9:43           ` Avi Kivity
  1 sibling, 0 replies; 17+ messages in thread
From: Avi Kivity @ 2010-06-16  9:43 UTC (permalink / raw)
  To: Samuel Thibault, Ingo Molnar, Peter Zijlstra, Arjan van de Ven,
	Thomas Gleixner

On 06/16/2010 12:01 PM, Samuel Thibault wrote:
> Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit :
>    
>> in the long run most processes will be using the FPU due to SIMM
>> instructions.
>>      
> I believe glibc already uses SIMM instructions for e.g. memcpy and
> friends, i.e. basically all applications...
>    

I think they ought to be using 'rep movs' on newer processors, but yes 
you're right.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-16  8:39         ` Ingo Molnar
  (?)
  (?)
@ 2010-06-16  9:10         ` Nick Piggin
  2010-06-16  9:30           ` Avi Kivity
  -1 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2010-06-16  9:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Peter Zijlstra, Arjan van de Ven, Thomas Gleixner,
	Suresh Siddha, Linus Torvalds, Fr??d??ric Weisbecker,
	Andrew Morton, Eric Dumazet, Mike Galbraith, H. Peter Anvin, kvm,
	linux-kernel

On Wed, Jun 16, 2010 at 10:39:41AM +0200, Ingo Molnar wrote:
> 
> (Cc:-ed various performance/optimization folks)
> 
> * Avi Kivity <avi@redhat.com> wrote:
> 
> > On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
> > >On 06/16/2010 12:24 AM, Avi Kivity wrote:
> > >>Ingo, Peter, any feedback on this?
> > > Conceptually, this makes sense to me.  However, I have a concern what
> > > happens when a task is scheduled on another CPU, while its FPU state is
> > > still in registers in the original CPU.  That would seem to require
> > > expensive IPIs to spill the state in order for the rescheduling to
> > > proceed, and this could really damage performance.
> > 
> > Right, this optimization isn't free.
> > 
> > I think the tradeoff is favourable since task migrations are much
> > less frequent than context switches within the same cpu, can the
> > scheduler experts comment?
> 
> This cannot be stated categorically without precise measurements of 
> known-good, known-bad, average FPU usage and average CPU usage scenarios. All 
> these workloads have different characteristics.
> 
> I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, 
> various lmbench components, X benchmarks, tiobench - you name it. Combined 
> with the fact that most micro-benchmarks wont be using the FPU, while in the 
> long run most processes will be using the FPU due to SIMM instructions. So 
> even a positive result might be skewed in practice. Has to be measured 
> carefully IMO - and i havent seen a _single_ performance measurement in the 
> submission mail. This is really essential.

It can be nice to code an absolute worst-case microbenchmark too.

Task migration can actually be very important to the point of being
almost a fastpath in some workloads where threads are oversubscribed to
CPUs and blocking on some contented resource (IO or mutex or whatever).
I suspect the main issues in that case is the actual context switching
and contention, but it would be nice to see just how much slower it
could get.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-16  9:10         ` Nick Piggin
@ 2010-06-16  9:30           ` Avi Kivity
  0 siblings, 0 replies; 17+ messages in thread
From: Avi Kivity @ 2010-06-16  9:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Peter Zijlstra, Arjan van de Ven, Thomas Gleixner,
	Suresh Siddha, Linus Torvalds, Fr??d??ric Weisbecker,
	Andrew Morton, Eric Dumazet, Mike Galbraith, H. Peter Anvin, kvm,
	linux-kernel

On 06/16/2010 12:10 PM, Nick Piggin wrote:
>
>> This cannot be stated categorically without precise measurements of
>> known-good, known-bad, average FPU usage and average CPU usage scenarios. All
>> these workloads have different characteristics.
>>
>> I can imagine bad effects across all sorts of workloads: tcpbench, AIM7,
>> various lmbench components, X benchmarks, tiobench - you name it. Combined
>> with the fact that most micro-benchmarks wont be using the FPU, while in the
>> long run most processes will be using the FPU due to SIMM instructions. So
>> even a positive result might be skewed in practice. Has to be measured
>> carefully IMO - and i havent seen a _single_ performance measurement in the
>> submission mail. This is really essential.
>>      
> It can be nice to code an absolute worst-case microbenchmark too.
>    

Sure.

> Task migration can actually be very important to the point of being
> almost a fastpath in some workloads where threads are oversubscribed to
> CPUs and blocking on some contented resource (IO or mutex or whatever).
> I suspect the main issues in that case is the actual context switching
> and contention, but it would be nice to see just how much slower it
> could get.
>    

If it's just cpu oversubscription then the IPIs will be limited by the 
rebalance rate and the time slice, so as you say it has to involve 
contention and frequent wakeups as well as heavy cpu usage.  That won't 
be easy to code.  Can you suggest an existing benchmark to run?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Really lazy fpu
  2010-06-16  8:39         ` Ingo Molnar
                           ` (2 preceding siblings ...)
  (?)
@ 2010-06-16  9:28         ` Avi Kivity
  -1 siblings, 0 replies; 17+ messages in thread
From: Avi Kivity @ 2010-06-16  9:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Arjan van de Ven, Thomas Gleixner, Suresh Siddha,
	Linus Torvalds, Fr??d??ric Weisbecker, Andrew Morton, Nick Piggin,
	Eric Dumazet, Mike Galbraith, H. Peter Anvin, kvm, linux-kernel

On 06/16/2010 11:39 AM, Ingo Molnar wrote:
> (Cc:-ed various performance/optimization folks)
>
> * Avi Kivity<avi@redhat.com>  wrote:
>
>    
>> On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
>>      
>>> On 06/16/2010 12:24 AM, Avi Kivity wrote:
>>>        
>>>> Ingo, Peter, any feedback on this?
>>>>          
>>> Conceptually, this makes sense to me.  However, I have a concern what
>>> happens when a task is scheduled on another CPU, while its FPU state is
>>> still in registers in the original CPU.  That would seem to require
>>> expensive IPIs to spill the state in order for the rescheduling to
>>> proceed, and this could really damage performance.
>>>        
>> Right, this optimization isn't free.
>>
>> I think the tradeoff is favourable since task migrations are much
>> less frequent than context switches within the same cpu, can the
>> scheduler experts comment?
>>      
> This cannot be stated categorically without precise measurements of
> known-good, known-bad, average FPU usage and average CPU usage scenarios. All
> these workloads have different characteristics.
>
> I can imagine bad effects across all sorts of workloads: tcpbench, AIM7,
> various lmbench components, X benchmarks, tiobench - you name it. Combined
> with the fact that most micro-benchmarks wont be using the FPU, while in the
> long run most processes will be using the FPU due to SIMM instructions. So
> even a positive result might be skewed in practice. Has to be measured
> carefully IMO - and i havent seen a _single_ performance measurement in the
> submission mail. This is really essential.
>    

I have really no idea what to measure.  Which would you most like to see?

> So this does not look like a patch-set we could apply without gathering a
> _ton_ of hard data about advantages and disadvantages.
>    

I agree (not to mention that I'm not really close to having an applyable 
patchset).

Note some of the advantages will not be in throughput but in latency 
(making kernel_fpu_begin() preemptible, and reducing context switch time 
for event threads).

>> We can also mitigate some of the IPIs if we know that we're migrating on the
>> cpu we're migrating from (i.e. we're pushing tasks to another cpu, not
>> pulling them from their cpu).  Is that a common case, and if so, where can I
>> hook a call to unlazy_fpu() (or its new equivalent)?
>>      
> When the system goes from idle to less idle then most of the 'fast' migrations
> happen on a 'push' model - on a busy CPU we wake up a new task and push it out
> to a known-idle CPU. At that point we can indeed unlazy the FPU with probably
> little cost.
>    

Can you point me to the code which does this?

> But on busy servers where most wakeups are IRQ based the chance of being on
> the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of
> CPUs.
>    

But don't we usually avoid pulls due to NUMA and cache considerations?

> If there's some sucky corner case in theory we could approach it statistically
> and measure the ratio of fast vs. slow migration vs. local context switches -
> but that looks a bit complex.
>
>    

I certainly wouldn't want to start with it.

> Dunno.
>    

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-06-17  9:38 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-16 11:32 [PATCH 0/4] Really lazy fpu George Spelvin
2010-06-16 11:46 ` Avi Kivity
2010-06-17  9:38   ` George Spelvin
  -- strict thread matches above, loose matches on Subject: below --
2010-06-13 15:03 Avi Kivity
2010-06-13 20:45 ` Valdis.Kletnieks
2010-06-14  7:47   ` Avi Kivity
2010-06-16  7:24 ` Avi Kivity
2010-06-16  7:32   ` H. Peter Anvin
2010-06-16  8:02     ` Avi Kivity
2010-06-16  8:39       ` Ingo Molnar
2010-06-16  8:39         ` Ingo Molnar
2010-06-16  9:01         ` Samuel Thibault
2010-06-16  9:43           ` Avi Kivity
2010-06-16  9:43           ` Avi Kivity
2010-06-16  9:10         ` Nick Piggin
2010-06-16  9:30           ` Avi Kivity
2010-06-16  9:28         ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.