* Re: [PATCH 0/4] Really lazy fpu
@ 2010-06-16 11:32 George Spelvin
2010-06-16 11:46 ` Avi Kivity
0 siblings, 1 reply; 17+ messages in thread
From: George Spelvin @ 2010-06-16 11:32 UTC (permalink / raw)
To: avi, mingo; +Cc: linux, linux-kernel, npiggin
> But on busy servers where most wakeups are IRQ based the chance of being on
> the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of
> CPUs.
That doesn't seem right. If the server is busy with FPU-using tasks, then
the FPU state has already been swapped out, and no IPI is necessary.
The worst-case seems to be a lot of non-FPU CPU hogs, and a few FPU-using tasks
that get bounced around the CPUs like pinballs.
It is an explicit scheduler goal to keep tasks on the same CPU across
schedules, so they get to re-use their cache state. The IPI only happens
when that goal is not met, *and* the FPU state has not been forced out
by another FPU-using task.
Not completely trivial to arrange.
(An halfway version of this optimization whoch sould avoid the need for
an IPI would be *save* the FPU state, but mark it "clean", so the re-load
can be skipped if we're lucky. If the code supported this as well as the
IPI alternative, you could make a heuristic guess at switch-out time
whether to save immediately or hope the odds of needing the IPI are less than
the fxsave/IPI cost ratio.)
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH 0/4] Really lazy fpu 2010-06-16 11:32 [PATCH 0/4] Really lazy fpu George Spelvin @ 2010-06-16 11:46 ` Avi Kivity 2010-06-17 9:38 ` George Spelvin 0 siblings, 1 reply; 17+ messages in thread From: Avi Kivity @ 2010-06-16 11:46 UTC (permalink / raw) To: George Spelvin; +Cc: mingo, linux-kernel, npiggin On 06/16/2010 02:32 PM, George Spelvin wrote: > > (An halfway version of this optimization whoch sould avoid the need for > an IPI would be *save* the FPU state, but mark it "clean", so the re-load > can be skipped if we're lucky. If the code supported this as well as the > IPI alternative, you could make a heuristic guess at switch-out time > whether to save immediately or hope the odds of needing the IPI are less than > the fxsave/IPI cost ratio.) > That's an interesting optimization - and we already have something similar in the form of fpu preload. Shouldn't be too hard to do. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu 2010-06-16 11:46 ` Avi Kivity @ 2010-06-17 9:38 ` George Spelvin 0 siblings, 0 replies; 17+ messages in thread From: George Spelvin @ 2010-06-17 9:38 UTC (permalink / raw) To: avi; +Cc: linux-kernel, mingo, npiggin > That's an interesting optimization - and we already have something > similar in the form of fpu preload. Shouldn't be too hard to do. Unfortunately, there's no dirty flag for the preloaded FPU state. Unless you take an interrupt, which defeats the whole purpose of preload. AFAIK, I should add; there's a lot of obscure stuff in the x86 system-level architecture. But a bit of searching around the source didn't show me anything; once we've used the CPU for 5 context switches, the kernel calls __math_state_restore when loading the new state, which sets TS_USEDFPU. (While you're mucking about in there, do you suppose the gas < 2.16 workaround in arch/x86/include/asm/i387.h:fxsave() can be yanked yet?) ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 0/4] Really lazy fpu @ 2010-06-13 15:03 Avi Kivity 2010-06-13 20:45 ` Valdis.Kletnieks 2010-06-16 7:24 ` Avi Kivity 0 siblings, 2 replies; 17+ messages in thread From: Avi Kivity @ 2010-06-13 15:03 UTC (permalink / raw) To: Ingo Molnar, H. Peter Anvin; +Cc: kvm, linux-kernel Currently fpu management is only lazy in one direction. When we switch into a task, we may avoid loading the fpu state in the hope that the task will never use it. If we guess right we save an fpu load/save cycle; if not, a Device not Available exception will remind us to load the fpu. However, in the other direction, fpu management is eager. When we switch out of an fpu-using task, we always save its fpu state. This is wasteful if the task(s) that run until we switch back in all don't use the fpu, since we could have kept the task's fpu on the cpu all this time and saved an fpu save/load cycle. This can be quite common with threaded interrupts, but will also happen with normal kernel threads and even normal user tasks. This patch series converts task fpu management to be fully lazy. When switching out of a task, we keep its fpu state on the cpu, only flushing it if some other task needs the fpu. Open issues/TODO: - patch 2 enables interrupts during #NM. There's a comment that says it shouldn't be done, presumably because of old-style #FERR handling. Need to fix one way or the other (dropping #FERR support, eagerly saving state when #FERR is detected, or dropping the entire optimization on i386) - flush fpu state on cpu offlining (trivial) - make sure the AMD FXSAVE workaround still works correctly - reduce IPIs by flushing fpu state when we know a task is being migrated (guidance from scheduler folk appreciated) - preemptible kernel_fpu_begin() to improve latency on raid and crypto setups (will post patches) - lazy host-side kvm fpu management (will post patches) - accelerate signal delivery by allocating signal handlers their own fpu state, and letting them run with the normal task's fpu until they use an fp instruction (will generously leave to interested parties) Avi Kivity (4): x86, fpu: merge __save_init_fpu() implementations x86, fpu: run device not available trap with interrupts enabled x86, fpu: Let the fpu remember which cpu it is active on x86, fpu: don't save fpu state when switching from a task arch/x86/include/asm/i387.h | 126 +++++++++++++++++++++++++++++++++----- arch/x86/include/asm/processor.h | 4 + arch/x86/kernel/i387.c | 3 + arch/x86/kernel/process.c | 1 + arch/x86/kernel/process_32.c | 12 +++- arch/x86/kernel/process_64.c | 13 +++-- arch/x86/kernel/traps.c | 13 ++--- 7 files changed, 139 insertions(+), 33 deletions(-) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu 2010-06-13 15:03 Avi Kivity @ 2010-06-13 20:45 ` Valdis.Kletnieks 2010-06-14 7:47 ` Avi Kivity 2010-06-16 7:24 ` Avi Kivity 1 sibling, 1 reply; 17+ messages in thread From: Valdis.Kletnieks @ 2010-06-13 20:45 UTC (permalink / raw) To: Avi Kivity; +Cc: Ingo Molnar, H. Peter Anvin, kvm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1059 bytes --] On Sun, 13 Jun 2010 18:03:43 +0300, Avi Kivity said: > Currently fpu management is only lazy in one direction. When we switch into > a task, we may avoid loading the fpu state in the hope that the task will > never use it. If we guess right we save an fpu load/save cycle; if not, > a Device not Available exception will remind us to load the fpu. > > However, in the other direction, fpu management is eager. When we switch out > of an fpu-using task, we always save its fpu state. Does anybody have numbers on how many clocks it takes a modern CPU design to do a FPU state save or restore? I know it must have been painful in the days before cache memory, having to make added trips out to RAM for 128-bit registers. But what's the impact today? (Yes, I see there's the potential for a painful IPI call - anything else?) Do we have any numbers on how many saves/restores this will save us when running the hypothetical "standard Gnome desktop" environment? How common is the "we went all the way around to the original single FPU-using task" case? [-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu 2010-06-13 20:45 ` Valdis.Kletnieks @ 2010-06-14 7:47 ` Avi Kivity 0 siblings, 0 replies; 17+ messages in thread From: Avi Kivity @ 2010-06-14 7:47 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: Ingo Molnar, H. Peter Anvin, kvm, linux-kernel On 06/13/2010 11:45 PM, Valdis.Kletnieks@vt.edu wrote: > On Sun, 13 Jun 2010 18:03:43 +0300, Avi Kivity said: > >> Currently fpu management is only lazy in one direction. When we switch into >> a task, we may avoid loading the fpu state in the hope that the task will >> never use it. If we guess right we save an fpu load/save cycle; if not, >> a Device not Available exception will remind us to load the fpu. >> >> However, in the other direction, fpu management is eager. When we switch out >> of an fpu-using task, we always save its fpu state. >> > Does anybody have numbers on how many clocks it takes a modern CPU design > to do a FPU state save or restore? 320 cycles for a back-to-back round trip. Presumably less on more modern hardware, more if uncached, more on even more modern hardware that has the xsave header (8 bytes) and ymm state (256 bytes) in addition. > I know it must have been painful in the > days before cache memory, having to make added trips out to RAM for 128-bit > registers. But what's the impact today? I'd estimate between 300 and 600 cycles depending on the factors above. > (Yes, I see there's the potential > for a painful IPI call - anything else?) > The IPI is only taken after a task migration, hopefully a rare event. The patchset also adds the overhead of irq save/restore. I think I can remove that at the cost of some complexity, but prefer to start with a simple approach. > Do we have any numbers on how many saves/restores this will save us when > running the hypothetical "standard Gnome desktop" environment? The potential is in the number of context switches per second. On a desktop environment, I don't see much potential for a throughput improvement, rather latency reduction from making the crypto threads preemptible and reducing context switch times. Servers with high context switch rates, esp. with real-time preemptible kernels (due to threaded interrupts), will see throughput gains. And, of course, kvm will benefit from not needing to switch the fpu when going from guest to host userspace or to a host kernel thread (vhost-net). > How common > is the "we went all the way around to the original single FPU-using task" case? > When your context switch is due to an oversubscribed cpu, not very common. When it is due to the need to service an event and go back to sleep, very common. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu 2010-06-13 15:03 Avi Kivity 2010-06-13 20:45 ` Valdis.Kletnieks @ 2010-06-16 7:24 ` Avi Kivity 2010-06-16 7:32 ` H. Peter Anvin 1 sibling, 1 reply; 17+ messages in thread From: Avi Kivity @ 2010-06-16 7:24 UTC (permalink / raw) To: Ingo Molnar, H. Peter Anvin; +Cc: kvm, linux-kernel On 06/13/2010 06:03 PM, Avi Kivity wrote: > Currently fpu management is only lazy in one direction. When we switch into > a task, we may avoid loading the fpu state in the hope that the task will > never use it. If we guess right we save an fpu load/save cycle; if not, > a Device not Available exception will remind us to load the fpu. > > However, in the other direction, fpu management is eager. When we switch out > of an fpu-using task, we always save its fpu state. > > This is wasteful if the task(s) that run until we switch back in all don't use > the fpu, since we could have kept the task's fpu on the cpu all this time > and saved an fpu save/load cycle. This can be quite common with threaded > interrupts, but will also happen with normal kernel threads and even normal > user tasks. > > This patch series converts task fpu management to be fully lazy. When > switching out of a task, we keep its fpu state on the cpu, only flushing it > if some other task needs the fpu. > Ingo, Peter, any feedback on this? -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu 2010-06-16 7:24 ` Avi Kivity @ 2010-06-16 7:32 ` H. Peter Anvin 2010-06-16 8:02 ` Avi Kivity 0 siblings, 1 reply; 17+ messages in thread From: H. Peter Anvin @ 2010-06-16 7:32 UTC (permalink / raw) To: Avi Kivity; +Cc: Ingo Molnar, kvm, linux-kernel On 06/16/2010 12:24 AM, Avi Kivity wrote: > > Ingo, Peter, any feedback on this? > Conceptually, this makes sense to me. However, I have a concern what happens when a task is scheduled on another CPU, while its FPU state is still in registers in the original CPU. That would seem to require expensive IPIs to spill the state in order for the rescheduling to proceed, and this could really damage performance. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu 2010-06-16 7:32 ` H. Peter Anvin @ 2010-06-16 8:02 ` Avi Kivity 2010-06-16 8:39 ` Ingo Molnar 0 siblings, 1 reply; 17+ messages in thread From: Avi Kivity @ 2010-06-16 8:02 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Ingo Molnar, kvm, linux-kernel On 06/16/2010 10:32 AM, H. Peter Anvin wrote: > On 06/16/2010 12:24 AM, Avi Kivity wrote: > >> Ingo, Peter, any feedback on this? >> > Conceptually, this makes sense to me. However, I have a concern what > happens when a task is scheduled on another CPU, while its FPU state is > still in registers in the original CPU. That would seem to require > expensive IPIs to spill the state in order for the rescheduling to > proceed, and this could really damage performance. > Right, this optimization isn't free. I think the tradeoff is favourable since task migrations are much less frequent than context switches within the same cpu, can the scheduler experts comment? We can also mitigate some of the IPIs if we know that we're migrating on the cpu we're migrating from (i.e. we're pushing tasks to another cpu, not pulling them from their cpu). Is that a common case, and if so, where can I hook a call to unlazy_fpu() (or its new equivalent)? Note that kvm on intel has exactly the same issue (the VMPTR and VMCS are on-chip registers that are expensive to load and save, so we keep them loaded even while not scheduled, and IPI if we notice we've migrated; note that architecturally the cpu can cache multiple VMCSs simultaneously (though I doubt they cache multiple VMCSs microarchitecturally at this point)). -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu 2010-06-16 8:02 ` Avi Kivity @ 2010-06-16 8:39 ` Ingo Molnar 0 siblings, 0 replies; 17+ messages in thread From: Ingo Molnar @ 2010-06-16 8:39 UTC (permalink / raw) To: Avi Kivity, Peter Zijlstra, Arjan van de Ven, Thomas Gleixner, Suresh Siddha, Linus Torvalds, Fr??d??ric Weisbecker, Andrew Morton, Nick Piggin, Eric Dumazet, Mike Galbraith Cc: H. Peter Anvin, kvm, linux-kernel (Cc:-ed various performance/optimization folks) * Avi Kivity <avi@redhat.com> wrote: > On 06/16/2010 10:32 AM, H. Peter Anvin wrote: > >On 06/16/2010 12:24 AM, Avi Kivity wrote: > >>Ingo, Peter, any feedback on this? > > Conceptually, this makes sense to me. However, I have a concern what > > happens when a task is scheduled on another CPU, while its FPU state is > > still in registers in the original CPU. That would seem to require > > expensive IPIs to spill the state in order for the rescheduling to > > proceed, and this could really damage performance. > > Right, this optimization isn't free. > > I think the tradeoff is favourable since task migrations are much > less frequent than context switches within the same cpu, can the > scheduler experts comment? This cannot be stated categorically without precise measurements of known-good, known-bad, average FPU usage and average CPU usage scenarios. All these workloads have different characteristics. I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, various lmbench components, X benchmarks, tiobench - you name it. Combined with the fact that most micro-benchmarks wont be using the FPU, while in the long run most processes will be using the FPU due to SIMM instructions. So even a positive result might be skewed in practice. Has to be measured carefully IMO - and i havent seen a _single_ performance measurement in the submission mail. This is really essential. So this does not look like a patch-set we could apply without gathering a _ton_ of hard data about advantages and disadvantages. > We can also mitigate some of the IPIs if we know that we're migrating on the > cpu we're migrating from (i.e. we're pushing tasks to another cpu, not > pulling them from their cpu). Is that a common case, and if so, where can I > hook a call to unlazy_fpu() (or its new equivalent)? When the system goes from idle to less idle then most of the 'fast' migrations happen on a 'push' model - on a busy CPU we wake up a new task and push it out to a known-idle CPU. At that point we can indeed unlazy the FPU with probably little cost. But on busy servers where most wakeups are IRQ based the chance of being on the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of CPUs. If there's some sucky corner case in theory we could approach it statistically and measure the ratio of fast vs. slow migration vs. local context switches - but that looks a bit complex. Dunno. Ingo ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu @ 2010-06-16 8:39 ` Ingo Molnar 0 siblings, 0 replies; 17+ messages in thread From: Ingo Molnar @ 2010-06-16 8:39 UTC (permalink / raw) To: Avi Kivity, Peter Zijlstra, Arjan van de Ven, Thomas Gleixner, Suresh Siddha Cc: H. Peter Anvin, kvm, linux-kernel (Cc:-ed various performance/optimization folks) * Avi Kivity <avi@redhat.com> wrote: > On 06/16/2010 10:32 AM, H. Peter Anvin wrote: > >On 06/16/2010 12:24 AM, Avi Kivity wrote: > >>Ingo, Peter, any feedback on this? > > Conceptually, this makes sense to me. However, I have a concern what > > happens when a task is scheduled on another CPU, while its FPU state is > > still in registers in the original CPU. That would seem to require > > expensive IPIs to spill the state in order for the rescheduling to > > proceed, and this could really damage performance. > > Right, this optimization isn't free. > > I think the tradeoff is favourable since task migrations are much > less frequent than context switches within the same cpu, can the > scheduler experts comment? This cannot be stated categorically without precise measurements of known-good, known-bad, average FPU usage and average CPU usage scenarios. All these workloads have different characteristics. I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, various lmbench components, X benchmarks, tiobench - you name it. Combined with the fact that most micro-benchmarks wont be using the FPU, while in the long run most processes will be using the FPU due to SIMM instructions. So even a positive result might be skewed in practice. Has to be measured carefully IMO - and i havent seen a _single_ performance measurement in the submission mail. This is really essential. So this does not look like a patch-set we could apply without gathering a _ton_ of hard data about advantages and disadvantages. > We can also mitigate some of the IPIs if we know that we're migrating on the > cpu we're migrating from (i.e. we're pushing tasks to another cpu, not > pulling them from their cpu). Is that a common case, and if so, where can I > hook a call to unlazy_fpu() (or its new equivalent)? When the system goes from idle to less idle then most of the 'fast' migrations happen on a 'push' model - on a busy CPU we wake up a new task and push it out to a known-idle CPU. At that point we can indeed unlazy the FPU with probably little cost. But on busy servers where most wakeups are IRQ based the chance of being on the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of CPUs. If there's some sucky corner case in theory we could approach it statistically and measure the ratio of fast vs. slow migration vs. local context switches - but that looks a bit complex. Dunno. Ingo ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu 2010-06-16 8:39 ` Ingo Molnar (?) @ 2010-06-16 9:01 ` Samuel Thibault 2010-06-16 9:43 ` Avi Kivity 2010-06-16 9:43 ` Avi Kivity -1 siblings, 2 replies; 17+ messages in thread From: Samuel Thibault @ 2010-06-16 9:01 UTC (permalink / raw) To: Ingo Molnar Cc: Avi Kivity, Peter Zijlstra, Arjan van de Ven, Thomas Gleixner, Suresh Siddha, Linus Torvalds, Fr??d??ric Weisbecker, Andrew Morton, Nick Piggin, Eric Dumazet, Mike Galbraith, H. Peter Anvin, kvm, linux-kernel Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit : > in the long run most processes will be using the FPU due to SIMM > instructions. I believe glibc already uses SIMM instructions for e.g. memcpy and friends, i.e. basically all applications... Samuel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu 2010-06-16 9:01 ` Samuel Thibault @ 2010-06-16 9:43 ` Avi Kivity 2010-06-16 9:43 ` Avi Kivity 1 sibling, 0 replies; 17+ messages in thread From: Avi Kivity @ 2010-06-16 9:43 UTC (permalink / raw) To: Samuel Thibault, Ingo Molnar, Peter Zijlstra, Arjan van de Ven, Thomas Gleixner, Suresh Siddha, Linus Torvalds, Fr??d??ric Weisbecker, Andrew Morton, Nick Piggin, Eric Dumazet, Mike Galbraith, H. Peter Anvin, kvm, linux-kernel On 06/16/2010 12:01 PM, Samuel Thibault wrote: > Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit : > >> in the long run most processes will be using the FPU due to SIMM >> instructions. >> > I believe glibc already uses SIMM instructions for e.g. memcpy and > friends, i.e. basically all applications... > I think they ought to be using 'rep movs' on newer processors, but yes you're right. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu 2010-06-16 9:01 ` Samuel Thibault 2010-06-16 9:43 ` Avi Kivity @ 2010-06-16 9:43 ` Avi Kivity 1 sibling, 0 replies; 17+ messages in thread From: Avi Kivity @ 2010-06-16 9:43 UTC (permalink / raw) To: Samuel Thibault, Ingo Molnar, Peter Zijlstra, Arjan van de Ven, Thomas Gleixner On 06/16/2010 12:01 PM, Samuel Thibault wrote: > Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit : > >> in the long run most processes will be using the FPU due to SIMM >> instructions. >> > I believe glibc already uses SIMM instructions for e.g. memcpy and > friends, i.e. basically all applications... > I think they ought to be using 'rep movs' on newer processors, but yes you're right. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu 2010-06-16 8:39 ` Ingo Molnar (?) (?) @ 2010-06-16 9:10 ` Nick Piggin 2010-06-16 9:30 ` Avi Kivity -1 siblings, 1 reply; 17+ messages in thread From: Nick Piggin @ 2010-06-16 9:10 UTC (permalink / raw) To: Ingo Molnar Cc: Avi Kivity, Peter Zijlstra, Arjan van de Ven, Thomas Gleixner, Suresh Siddha, Linus Torvalds, Fr??d??ric Weisbecker, Andrew Morton, Eric Dumazet, Mike Galbraith, H. Peter Anvin, kvm, linux-kernel On Wed, Jun 16, 2010 at 10:39:41AM +0200, Ingo Molnar wrote: > > (Cc:-ed various performance/optimization folks) > > * Avi Kivity <avi@redhat.com> wrote: > > > On 06/16/2010 10:32 AM, H. Peter Anvin wrote: > > >On 06/16/2010 12:24 AM, Avi Kivity wrote: > > >>Ingo, Peter, any feedback on this? > > > Conceptually, this makes sense to me. However, I have a concern what > > > happens when a task is scheduled on another CPU, while its FPU state is > > > still in registers in the original CPU. That would seem to require > > > expensive IPIs to spill the state in order for the rescheduling to > > > proceed, and this could really damage performance. > > > > Right, this optimization isn't free. > > > > I think the tradeoff is favourable since task migrations are much > > less frequent than context switches within the same cpu, can the > > scheduler experts comment? > > This cannot be stated categorically without precise measurements of > known-good, known-bad, average FPU usage and average CPU usage scenarios. All > these workloads have different characteristics. > > I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, > various lmbench components, X benchmarks, tiobench - you name it. Combined > with the fact that most micro-benchmarks wont be using the FPU, while in the > long run most processes will be using the FPU due to SIMM instructions. So > even a positive result might be skewed in practice. Has to be measured > carefully IMO - and i havent seen a _single_ performance measurement in the > submission mail. This is really essential. It can be nice to code an absolute worst-case microbenchmark too. Task migration can actually be very important to the point of being almost a fastpath in some workloads where threads are oversubscribed to CPUs and blocking on some contented resource (IO or mutex or whatever). I suspect the main issues in that case is the actual context switching and contention, but it would be nice to see just how much slower it could get. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu 2010-06-16 9:10 ` Nick Piggin @ 2010-06-16 9:30 ` Avi Kivity 0 siblings, 0 replies; 17+ messages in thread From: Avi Kivity @ 2010-06-16 9:30 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Peter Zijlstra, Arjan van de Ven, Thomas Gleixner, Suresh Siddha, Linus Torvalds, Fr??d??ric Weisbecker, Andrew Morton, Eric Dumazet, Mike Galbraith, H. Peter Anvin, kvm, linux-kernel On 06/16/2010 12:10 PM, Nick Piggin wrote: > >> This cannot be stated categorically without precise measurements of >> known-good, known-bad, average FPU usage and average CPU usage scenarios. All >> these workloads have different characteristics. >> >> I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, >> various lmbench components, X benchmarks, tiobench - you name it. Combined >> with the fact that most micro-benchmarks wont be using the FPU, while in the >> long run most processes will be using the FPU due to SIMM instructions. So >> even a positive result might be skewed in practice. Has to be measured >> carefully IMO - and i havent seen a _single_ performance measurement in the >> submission mail. This is really essential. >> > It can be nice to code an absolute worst-case microbenchmark too. > Sure. > Task migration can actually be very important to the point of being > almost a fastpath in some workloads where threads are oversubscribed to > CPUs and blocking on some contented resource (IO or mutex or whatever). > I suspect the main issues in that case is the actual context switching > and contention, but it would be nice to see just how much slower it > could get. > If it's just cpu oversubscription then the IPIs will be limited by the rebalance rate and the time slice, so as you say it has to involve contention and frequent wakeups as well as heavy cpu usage. That won't be easy to code. Can you suggest an existing benchmark to run? -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] Really lazy fpu 2010-06-16 8:39 ` Ingo Molnar ` (2 preceding siblings ...) (?) @ 2010-06-16 9:28 ` Avi Kivity -1 siblings, 0 replies; 17+ messages in thread From: Avi Kivity @ 2010-06-16 9:28 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Arjan van de Ven, Thomas Gleixner, Suresh Siddha, Linus Torvalds, Fr??d??ric Weisbecker, Andrew Morton, Nick Piggin, Eric Dumazet, Mike Galbraith, H. Peter Anvin, kvm, linux-kernel On 06/16/2010 11:39 AM, Ingo Molnar wrote: > (Cc:-ed various performance/optimization folks) > > * Avi Kivity<avi@redhat.com> wrote: > > >> On 06/16/2010 10:32 AM, H. Peter Anvin wrote: >> >>> On 06/16/2010 12:24 AM, Avi Kivity wrote: >>> >>>> Ingo, Peter, any feedback on this? >>>> >>> Conceptually, this makes sense to me. However, I have a concern what >>> happens when a task is scheduled on another CPU, while its FPU state is >>> still in registers in the original CPU. That would seem to require >>> expensive IPIs to spill the state in order for the rescheduling to >>> proceed, and this could really damage performance. >>> >> Right, this optimization isn't free. >> >> I think the tradeoff is favourable since task migrations are much >> less frequent than context switches within the same cpu, can the >> scheduler experts comment? >> > This cannot be stated categorically without precise measurements of > known-good, known-bad, average FPU usage and average CPU usage scenarios. All > these workloads have different characteristics. > > I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, > various lmbench components, X benchmarks, tiobench - you name it. Combined > with the fact that most micro-benchmarks wont be using the FPU, while in the > long run most processes will be using the FPU due to SIMM instructions. So > even a positive result might be skewed in practice. Has to be measured > carefully IMO - and i havent seen a _single_ performance measurement in the > submission mail. This is really essential. > I have really no idea what to measure. Which would you most like to see? > So this does not look like a patch-set we could apply without gathering a > _ton_ of hard data about advantages and disadvantages. > I agree (not to mention that I'm not really close to having an applyable patchset). Note some of the advantages will not be in throughput but in latency (making kernel_fpu_begin() preemptible, and reducing context switch time for event threads). >> We can also mitigate some of the IPIs if we know that we're migrating on the >> cpu we're migrating from (i.e. we're pushing tasks to another cpu, not >> pulling them from their cpu). Is that a common case, and if so, where can I >> hook a call to unlazy_fpu() (or its new equivalent)? >> > When the system goes from idle to less idle then most of the 'fast' migrations > happen on a 'push' model - on a busy CPU we wake up a new task and push it out > to a known-idle CPU. At that point we can indeed unlazy the FPU with probably > little cost. > Can you point me to the code which does this? > But on busy servers where most wakeups are IRQ based the chance of being on > the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of > CPUs. > But don't we usually avoid pulls due to NUMA and cache considerations? > If there's some sucky corner case in theory we could approach it statistically > and measure the ratio of fast vs. slow migration vs. local context switches - > but that looks a bit complex. > > I certainly wouldn't want to start with it. > Dunno. > -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2010-06-17 9:38 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-06-16 11:32 [PATCH 0/4] Really lazy fpu George Spelvin 2010-06-16 11:46 ` Avi Kivity 2010-06-17 9:38 ` George Spelvin -- strict thread matches above, loose matches on Subject: below -- 2010-06-13 15:03 Avi Kivity 2010-06-13 20:45 ` Valdis.Kletnieks 2010-06-14 7:47 ` Avi Kivity 2010-06-16 7:24 ` Avi Kivity 2010-06-16 7:32 ` H. Peter Anvin 2010-06-16 8:02 ` Avi Kivity 2010-06-16 8:39 ` Ingo Molnar 2010-06-16 8:39 ` Ingo Molnar 2010-06-16 9:01 ` Samuel Thibault 2010-06-16 9:43 ` Avi Kivity 2010-06-16 9:43 ` Avi Kivity 2010-06-16 9:10 ` Nick Piggin 2010-06-16 9:30 ` Avi Kivity 2010-06-16 9:28 ` Avi Kivity
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.