* [RFC] syscall calling convention, stts/clts, and xstate latency @ 2011-07-24 21:07 Andrew Lutomirski 2011-07-24 21:15 ` Ingo Molnar 0 siblings, 1 reply; 15+ messages in thread From: Andrew Lutomirski @ 2011-07-24 21:07 UTC (permalink / raw) To: linux-kernel, x86 I was trying to understand the FPU/xstate saving code, and I ran some benchmarks with surprising results. These are all on Sandy Bridge i7-2600. Please take all numbers with a grain of salt -- they're in tight-ish loops and don't really take into account real-world cache effects. A clts/stts pair takes about 80 ns. Accessing extended state from userspace with TS set takes 239 ns. A kernel_fpu_begin / kernel_fpu_end pair with no userspace xstate access takes 80 ns (presumably 79 of those 80 are the clts/stts). (Note: The numbers in this paragraph were measured using a hacked-up kernel and KVM.) With nonzero ymm state, xsave + clflush (on the first cacheline of xstate) + xrstor takes 128 ns. With hot cache, xsave = 24ns, xsaveopt (with unchanged state) = 16 ns, and xrstor = 40 ns. With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38 ns and xsaveopt saves another 5 ns. Zeroing the state completely with vzeroall adds 2 ns. Not sure what's going on. All of this makes me think that, at least on Sandy Bridge, lazy xstate saving is a bad optimization -- if the cache is being nice, save/restore is faster than twiddling the TS bit. And the cost of the trap when TS is set blows everything else away. Which brings me to another question: what do you think about declaring some of the extended state to be clobbered by syscall? Ideally, we'd treat syscall like a regular function and clobber everything except the floating point control word and mxcsr. More conservatively, we'd leave xmm and x87 state but clobber ymm. This would let us keep the cost of the state save and restore down when kernel_fpu_begin is used in a syscall path and when a context switch happens as a result of a syscall. glibc does *not* mark the xmm registers as clobbered when it issues syscalls, but I suspect that everything everywhere that issues syscalls does it from a function, and functions are implicitly assumed to clobber extended state. (And if anything out there assumes that ymm state is preserved, I'd be amazed.) --Andy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] syscall calling convention, stts/clts, and xstate latency 2011-07-24 21:07 [RFC] syscall calling convention, stts/clts, and xstate latency Andrew Lutomirski @ 2011-07-24 21:15 ` Ingo Molnar 2011-07-24 22:34 ` Andrew Lutomirski 2011-07-25 7:42 ` Avi Kivity 0 siblings, 2 replies; 15+ messages in thread From: Ingo Molnar @ 2011-07-24 21:15 UTC (permalink / raw) To: Andrew Lutomirski Cc: linux-kernel, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity * Andrew Lutomirski <luto@mit.edu> wrote: > I was trying to understand the FPU/xstate saving code, and I ran > some benchmarks with surprising results. These are all on Sandy > Bridge i7-2600. Please take all numbers with a grain of salt -- > they're in tight-ish loops and don't really take into account > real-world cache effects. > > A clts/stts pair takes about 80 ns. Accessing extended state from > userspace with TS set takes 239 ns. A kernel_fpu_begin / > kernel_fpu_end pair with no userspace xstate access takes 80 ns > (presumably 79 of those 80 are the clts/stts). (Note: The numbers > in this paragraph were measured using a hacked-up kernel and KVM.) > > With nonzero ymm state, xsave + clflush (on the first cacheline of > xstate) + xrstor takes 128 ns. With hot cache, xsave = 24ns, > xsaveopt (with unchanged state) = 16 ns, and xrstor = 40 ns. > > With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38 > ns and xsaveopt saves another 5 ns. > > Zeroing the state completely with vzeroall adds 2 ns. Not sure > what's going on. > > All of this makes me think that, at least on Sandy Bridge, lazy > xstate saving is a bad optimization -- if the cache is being nice, > save/restore is faster than twiddling the TS bit. And the cost of > the trap when TS is set blows everything else away. Interesting. Mind cooking up a delazying patch and measure it on native as well? KVM generally makes exceptions more expensive, so the effect of lazy exceptions might be less on native. > > Which brings me to another question: what do you think about > declaring some of the extended state to be clobbered by syscall? > Ideally, we'd treat syscall like a regular function and clobber > everything except the floating point control word and mxcsr. More > conservatively, we'd leave xmm and x87 state but clobber ymm. This > would let us keep the cost of the state save and restore down when > kernel_fpu_begin is used in a syscall path and when a context > switch happens as a result of a syscall. > > glibc does *not* mark the xmm registers as clobbered when it issues > syscalls, but I suspect that everything everywhere that issues > syscalls does it from a function, and functions are implicitly > assumed to clobber extended state. (And if anything out there > assumes that ymm state is preserved, I'd be amazed.) To build the kernel with sse optimizations? Would certainly be interesting to try. Thanks, Ingo ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] syscall calling convention, stts/clts, and xstate latency 2011-07-24 21:15 ` Ingo Molnar @ 2011-07-24 22:34 ` Andrew Lutomirski 2011-07-25 3:21 ` Andrew Lutomirski 2011-07-25 6:38 ` [RFC] syscall calling convention, stts/clts, and xstate latency Ingo Molnar 2011-07-25 7:42 ` Avi Kivity 1 sibling, 2 replies; 15+ messages in thread From: Andrew Lutomirski @ 2011-07-24 22:34 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity On Sun, Jul 24, 2011 at 5:15 PM, Ingo Molnar <mingo@elte.hu> wrote: > > * Andrew Lutomirski <luto@mit.edu> wrote: > >> I was trying to understand the FPU/xstate saving code, and I ran >> some benchmarks with surprising results. These are all on Sandy >> Bridge i7-2600. Please take all numbers with a grain of salt -- >> they're in tight-ish loops and don't really take into account >> real-world cache effects. >> >> A clts/stts pair takes about 80 ns. Accessing extended state from >> userspace with TS set takes 239 ns. A kernel_fpu_begin / >> kernel_fpu_end pair with no userspace xstate access takes 80 ns >> (presumably 79 of those 80 are the clts/stts). (Note: The numbers >> in this paragraph were measured using a hacked-up kernel and KVM.) >> >> With nonzero ymm state, xsave + clflush (on the first cacheline of >> xstate) + xrstor takes 128 ns. With hot cache, xsave = 24ns, >> xsaveopt (with unchanged state) = 16 ns, and xrstor = 40 ns. >> >> With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38 >> ns and xsaveopt saves another 5 ns. >> >> Zeroing the state completely with vzeroall adds 2 ns. Not sure >> what's going on. >> >> All of this makes me think that, at least on Sandy Bridge, lazy >> xstate saving is a bad optimization -- if the cache is being nice, >> save/restore is faster than twiddling the TS bit. And the cost of >> the trap when TS is set blows everything else away. > > Interesting. Mind cooking up a delazying patch and measure it on > native as well? KVM generally makes exceptions more expensive, so the > effect of lazy exceptions might be less on native. Using the same patch on native, I get: kernel_fpu_begin/kernel_fpu_end (no userspace xstate): 71.53 ns stts/clts: 73 ns (clearly there's a bit of error here) userspace xstate with TS set: 229 ns So virtualization adds only a little bit of overhead. This isn't really a delazying patch -- it's two arch_prctls, one of them is kernel_fpu_begin();kernel_fpu_end(). The other is the same thing in a loop. The other numbers were already native since I measured them entirely in userspace. They look the same after rebooting. > >> >> Which brings me to another question: what do you think about >> declaring some of the extended state to be clobbered by syscall? >> Ideally, we'd treat syscall like a regular function and clobber >> everything except the floating point control word and mxcsr. More >> conservatively, we'd leave xmm and x87 state but clobber ymm. This >> would let us keep the cost of the state save and restore down when >> kernel_fpu_begin is used in a syscall path and when a context >> switch happens as a result of a syscall. >> >> glibc does *not* mark the xmm registers as clobbered when it issues >> syscalls, but I suspect that everything everywhere that issues >> syscalls does it from a function, and functions are implicitly >> assumed to clobber extended state. (And if anything out there >> assumes that ymm state is preserved, I'd be amazed.) > > To build the kernel with sse optimizations? Would certainly be > interesting to try. I had in mind something a little less ambitious: making kernel_fpu_begin very fast, especially when used more than once. Currently it's slow enough to have spawned arch/x86/crypto/fpu.c, which is a hideous piece of infrastructure that exists solely to reduce the number of kernel_fpu_begin/end pairs when using AES-NI. Clobbering registers in syscall would reduce the cost even more, but it might require having a way to detect whether the most recent kernel entry was via syscall or some other means. Making the whole kernel safe for xstate use would be technically possible, but it would add about three cycles to syscalls (for vzeroall -- non-AVX machines would take a larger hit) and apparently about 57 ns to non-syscall traps. That seems worse than the lazier approach. --Andy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] syscall calling convention, stts/clts, and xstate latency 2011-07-24 22:34 ` Andrew Lutomirski @ 2011-07-25 3:21 ` Andrew Lutomirski 2011-07-25 6:42 ` Ingo Molnar 2011-07-25 10:05 ` [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to Andy Lutomirski 2011-07-25 6:38 ` [RFC] syscall calling convention, stts/clts, and xstate latency Ingo Molnar 1 sibling, 2 replies; 15+ messages in thread From: Andrew Lutomirski @ 2011-07-25 3:21 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity On Sun, Jul 24, 2011 at 6:34 PM, Andrew Lutomirski <luto@mit.edu> wrote: > > I had in mind something a little less ambitious: making > kernel_fpu_begin very fast, especially when used more than once. > Currently it's slow enough to have spawned arch/x86/crypto/fpu.c, > which is a hideous piece of infrastructure that exists solely to > reduce the number of kernel_fpu_begin/end pairs when using AES-NI. > Clobbering registers in syscall would reduce the cost even more, but > it might require having a way to detect whether the most recent kernel > entry was via syscall or some other means. I think it will be very hard to inadvertently cause a regression, because the current code looks pretty bad. 1. Once a task uses xstate for five timeslices, the kernel decides that it will continue using it. The only thing that clears that condition is __unlazy_fpu called with TS_USEDFPU set. The only way I can see for that to happen is if kernel_fpu_begin is called twice in a row between context switches, and that has little do with the task's xstate usage. 2. __switch_to, when switching to a task with fpu_counter > 5, will do stts(); clts(). The combination means that when switching between two xstate-using tasks (or even tasks that were once xstate-using), we pay the full price of a state save/restore *and* stts/clts. --Andy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] syscall calling convention, stts/clts, and xstate latency 2011-07-25 3:21 ` Andrew Lutomirski @ 2011-07-25 6:42 ` Ingo Molnar 2011-07-25 10:05 ` [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to Andy Lutomirski 1 sibling, 0 replies; 15+ messages in thread From: Ingo Molnar @ 2011-07-25 6:42 UTC (permalink / raw) To: Andrew Lutomirski Cc: linux-kernel, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity * Andrew Lutomirski <luto@mit.edu> wrote: > On Sun, Jul 24, 2011 at 6:34 PM, Andrew Lutomirski <luto@mit.edu> wrote: > > > > I had in mind something a little less ambitious: making > > kernel_fpu_begin very fast, especially when used more than once. > > Currently it's slow enough to have spawned arch/x86/crypto/fpu.c, > > which is a hideous piece of infrastructure that exists solely to > > reduce the number of kernel_fpu_begin/end pairs when using > > AES-NI. Clobbering registers in syscall would reduce the cost > > even more, but it might require having a way to detect whether > > the most recent kernel entry was via syscall or some other means. > > I think it will be very hard to inadvertently cause a regression, > because the current code looks pretty bad. [ heh, one of the rare cases where bad code works in our favor ;-) ] > 1. Once a task uses xstate for five timeslices, the kernel decides > that it will continue using it. The only thing that clears that > condition is __unlazy_fpu called with TS_USEDFPU set. The only way > I can see for that to happen is if kernel_fpu_begin is called twice > in a row between context switches, and that has little do with the > task's xstate usage. > > 2. __switch_to, when switching to a task with fpu_counter > 5, will > do stts(); clts(). > > The combination means that when switching between two xstate-using > tasks (or even tasks that were once xstate-using), we pay the full > price of a state save/restore *and* stts/clts. I'm all for simplifying this for modern x86 CPUs. The lazy FPU switching logic was kind of neat on UP but started showing its limitations with SMP already - and that was 10 years ago. So if the numbers prove you right then go for it. It's an added bonus that this could enable the kernel to be built using vector instructions - you may or may not want to shoot for the glory of achieving that feat first ;-) Thanks, Ingo ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to 2011-07-25 3:21 ` Andrew Lutomirski 2011-07-25 6:42 ` Ingo Molnar @ 2011-07-25 10:05 ` Andy Lutomirski 2011-07-25 11:12 ` Ingo Molnar 1 sibling, 1 reply; 15+ messages in thread From: Andy Lutomirski @ 2011-07-25 10:05 UTC (permalink / raw) To: Ingo Molnar Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds, Arjan van de Ven, Avi Kivity An stts/clts pair takes over 70 ns by itself on Sandy Bridge, and when other things are going on it's apparently even worse. This saves 10% on context switches between threads that both use extended state. Signed-off-by: Andy Lutomirski <luto@mit.edu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Arjan van de Ven <arjan@infradead.org>, Cc: Avi Kivity <avi@redhat.com> --- This is not as well tested as it should be (especially on 32-bit, where I haven't actually tried compiling it), but I think this might be 3.1 material so I want to get it out for review before it's even more unjustifiably late :) Argument for inclusion in 3.1 (after a bit more testing): - It's dead simple. - It's a 10% speedup on context switching under the right conditions [1] - It's unlikely to slow any workload down, since it doesn't add any work anywwhere. Argument against: - It's late. [1] https://gitorious.org/linux-test-utils/linux-clock-tests/blobs/master/context_switch_latency.c arch/x86/include/asm/i387.h | 10 ++++++++++ arch/x86/kernel/process_32.c | 10 ++++------ arch/x86/kernel/process_64.c | 7 +++---- 3 files changed, 17 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index c9e09ea..9d2d08b 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -295,6 +295,16 @@ static inline void __unlazy_fpu(struct task_struct *tsk) tsk->fpu_counter = 0; } +static inline void __unlazy_fpu_clts(struct task_struct *tsk) +{ + if (task_thread_info(tsk)->status & TS_USEDFPU) { + __save_init_fpu(tsk); + } else { + tsk->fpu_counter = 0; + clts(); + } +} + static inline void __clear_fpu(struct task_struct *tsk) { if (task_thread_info(tsk)->status & TS_USEDFPU) { diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index a3d0dc5..c707741 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -304,7 +304,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) */ preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5; - __unlazy_fpu(prev_p); + if (preload_fpu) + __unlazy_fpu_clts(prev_p); + else + __unlazy_fpu(prev_p); /* we're going to use this soon, after a few expensive things */ if (preload_fpu) @@ -348,11 +351,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT)) __switch_to_xtra(prev_p, next_p, tss); - /* If we're going to preload the fpu context, make sure clts - is run while we're batching the cpu state updates. */ - if (preload_fpu) - clts(); - /* * Leave lazy mode, flushing any hypercalls made here. * This must be done before restoring TLS segments so diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index b1f3f53..272bddd 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -419,11 +419,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) load_TLS(next, cpu); /* Must be after DS reload */ - __unlazy_fpu(prev_p); - - /* Make sure cpu is ready for new context */ if (preload_fpu) - clts(); + __unlazy_fpu_clts(prev_p); + else + __unlazy_fpu(prev_p); /* * Leave lazy mode, flushing any hypercalls made here. -- 1.7.6 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to 2011-07-25 10:05 ` [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to Andy Lutomirski @ 2011-07-25 11:12 ` Ingo Molnar 2011-07-25 13:04 ` Andrew Lutomirski 0 siblings, 1 reply; 15+ messages in thread From: Ingo Molnar @ 2011-07-25 11:12 UTC (permalink / raw) To: Andy Lutomirski Cc: x86, linux-kernel, Linus Torvalds, Arjan van de Ven, Avi Kivity * Andy Lutomirski <luto@MIT.EDU> wrote: > An stts/clts pair takes over 70 ns by itself on Sandy Bridge, and > when other things are going on it's apparently even worse. This > saves 10% on context switches between threads that both use extended > state. > > Signed-off-by: Andy Lutomirski <luto@mit.edu> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Arjan van de Ven <arjan@infradead.org>, > Cc: Avi Kivity <avi@redhat.com> > --- > > This is not as well tested as it should be (especially on 32-bit, where > I haven't actually tried compiling it), but I think this might be 3.1 > material so I want to get it out for review before it's even more > unjustifiably late :) > > Argument for inclusion in 3.1 (after a bit more testing): > - It's dead simple. > - It's a 10% speedup on context switching under the right conditions [1] > - It's unlikely to slow any workload down, since it doesn't add any work > anywwhere. > > Argument against: > - It's late. I think it's late. Would be much better to stick it into the x86/xsave tree i pointed to and treat and debug it as a coherent unit. FPU bugs need a lot of time to surface so we definitely do not want to fast-track it. In fact if we want it in v3.2 we should start assembling the tree right now. Also, if you are tempted by the prospect of possibly enabling vector instructions for the x86 kernel, we could try that too, and get multiple speedups for the price of having to debug the tree only once ;-) Thanks, Ingo ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to 2011-07-25 11:12 ` Ingo Molnar @ 2011-07-25 13:04 ` Andrew Lutomirski 2011-07-25 14:13 ` Ingo Molnar 0 siblings, 1 reply; 15+ messages in thread From: Andrew Lutomirski @ 2011-07-25 13:04 UTC (permalink / raw) To: Ingo Molnar Cc: x86, linux-kernel, Linus Torvalds, Arjan van de Ven, Avi Kivity On Mon, Jul 25, 2011 at 7:12 AM, Ingo Molnar <mingo@elte.hu> wrote: > > * Andy Lutomirski <luto@MIT.EDU> wrote: > >> An stts/clts pair takes over 70 ns by itself on Sandy Bridge, and >> when other things are going on it's apparently even worse. This >> saves 10% on context switches between threads that both use extended >> state. >> >> Signed-off-by: Andy Lutomirski <luto@mit.edu> >> Cc: Linus Torvalds <torvalds@linux-foundation.org> >> Cc: Arjan van de Ven <arjan@infradead.org>, >> Cc: Avi Kivity <avi@redhat.com> >> --- >> >> This is not as well tested as it should be (especially on 32-bit, where >> I haven't actually tried compiling it), but I think this might be 3.1 >> material so I want to get it out for review before it's even more >> unjustifiably late :) >> >> Argument for inclusion in 3.1 (after a bit more testing): >> - It's dead simple. >> - It's a 10% speedup on context switching under the right conditions [1] >> - It's unlikely to slow any workload down, since it doesn't add any work >> anywwhere. >> >> Argument against: >> - It's late. > > I think it's late. > > Would be much better to stick it into the x86/xsave tree i pointed to > and treat and debug it as a coherent unit. FPU bugs need a lot of > time to surface so we definitely do not want to fast-track it. In > fact if we want it in v3.2 we should start assembling the tree right > now. Fair enough. I make no guarantee that I'll have anything ready in less than a few weeks. I'm defending my thesis in a week, and kernel hacking is entirely a distraction. :) (The only thing my thesis has to do with operating systems is that I mention recvmmsg.) > > Also, if you are tempted by the prospect of possibly enabling vector > instructions for the x86 kernel, we could try that too, and get > multiple speedups for the price of having to debug the tree only once > ;-) I'll play with it. I have some other cleanup / speedup ideas, too, and I'll see where they go. Given that the kernel doesn't really use floating-point math, I'm not sure that gcc will do much unless we turn on -ftree-vectorize, and that's a little scary. --Andy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to 2011-07-25 13:04 ` Andrew Lutomirski @ 2011-07-25 14:13 ` Ingo Molnar 0 siblings, 0 replies; 15+ messages in thread From: Ingo Molnar @ 2011-07-25 14:13 UTC (permalink / raw) To: Andrew Lutomirski Cc: x86, linux-kernel, Linus Torvalds, Arjan van de Ven, Avi Kivity * Andrew Lutomirski <luto@mit.edu> wrote: > > Also, if you are tempted by the prospect of possibly enabling > > vector instructions for the x86 kernel, we could try that too, > > and get multiple speedups for the price of having to debug the > > tree only once ;-) > > I'll play with it. I have some other cleanup / speedup ideas, too, > and I'll see where they go. Given that the kernel doesn't really > use floating-point math, I'm not sure that gcc will do much unless > we turn on -ftree-vectorize, and that's a little scary. It's indeed scary - but as long as it boots it would allow some baseline figures to be estimated - is there any win, and if yes, how much. It might be a complete dud in the end. Thanks, Ingo ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] syscall calling convention, stts/clts, and xstate latency 2011-07-24 22:34 ` Andrew Lutomirski 2011-07-25 3:21 ` Andrew Lutomirski @ 2011-07-25 6:38 ` Ingo Molnar 2011-07-25 9:44 ` Andrew Lutomirski 1 sibling, 1 reply; 15+ messages in thread From: Ingo Molnar @ 2011-07-25 6:38 UTC (permalink / raw) To: Andrew Lutomirski Cc: linux-kernel, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity * Andrew Lutomirski <luto@mit.edu> wrote: > On Sun, Jul 24, 2011 at 5:15 PM, Ingo Molnar <mingo@elte.hu> wrote: > > > > * Andrew Lutomirski <luto@mit.edu> wrote: > > > >> I was trying to understand the FPU/xstate saving code, and I ran > >> some benchmarks with surprising results. These are all on Sandy > >> Bridge i7-2600. Please take all numbers with a grain of salt -- > >> they're in tight-ish loops and don't really take into account > >> real-world cache effects. > >> > >> A clts/stts pair takes about 80 ns. Accessing extended state from > >> userspace with TS set takes 239 ns. A kernel_fpu_begin / > >> kernel_fpu_end pair with no userspace xstate access takes 80 ns > >> (presumably 79 of those 80 are the clts/stts). (Note: The numbers > >> in this paragraph were measured using a hacked-up kernel and KVM.) > >> > >> With nonzero ymm state, xsave + clflush (on the first cacheline of > >> xstate) + xrstor takes 128 ns. With hot cache, xsave = 24ns, > >> xsaveopt (with unchanged state) = 16 ns, and xrstor = 40 ns. > >> > >> With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38 > >> ns and xsaveopt saves another 5 ns. > >> > >> Zeroing the state completely with vzeroall adds 2 ns. Not sure > >> what's going on. > >> > >> All of this makes me think that, at least on Sandy Bridge, lazy > >> xstate saving is a bad optimization -- if the cache is being nice, > >> save/restore is faster than twiddling the TS bit. And the cost of > >> the trap when TS is set blows everything else away. > > > > Interesting. Mind cooking up a delazying patch and measure it on > > native as well? KVM generally makes exceptions more expensive, so the > > effect of lazy exceptions might be less on native. > > Using the same patch on native, I get: > > kernel_fpu_begin/kernel_fpu_end (no userspace xstate): 71.53 ns > stts/clts: 73 ns (clearly there's a bit of error here) userspace > xstate with TS set: 229 ns > > So virtualization adds only a little bit of overhead. KVM rocks. > This isn't really a delazying patch -- it's two arch_prctls, one of > them is kernel_fpu_begin();kernel_fpu_end(). The other is the same > thing in a loop. > > The other numbers were already native since I measured them > entirely in userspace. They look the same after rebooting. I should have mentioned it earlier, but there's a certain amount of delazying patches in the tip:x86/xsave branch: $ gll linus..x86/xsave 300c6120b465: x86, xsave: fix non-lazy allocation of the xsave area f79018f2daa9: Merge branch 'x86/urgent' into x86/xsave 66beba27e8b5: x86, xsave: remove lazy allocation of xstate area 1039b306b1c6: x86, xsave: add kernel support for AMDs Lightweight Profiling (LWP) 4182a4d68bac: x86, xsave: add support for non-lazy xstates 324cbb83e215: x86, xsave: more cleanups 2efd67935eb7: x86, xsave: remove unused code 0c11e6f1aed1: x86, xsave: cleanup fpu/xsave signal frame setup 7f4f0a56a7d3: x86, xsave: rework fpu/xsave support 26bce4e4c56f: x86, xsave: cleanup fpu/xsave support it's not in tip:master because the LWP bits need (much) more work to be palatable - but we could spin them off and complete them as per your suggestions if they are an independent speedup on modern CPUs. > >> Which brings me to another question: what do you think about > >> declaring some of the extended state to be clobbered by syscall? > >> Ideally, we'd treat syscall like a regular function and clobber > >> everything except the floating point control word and mxcsr. More > >> conservatively, we'd leave xmm and x87 state but clobber ymm. This > >> would let us keep the cost of the state save and restore down when > >> kernel_fpu_begin is used in a syscall path and when a context > >> switch happens as a result of a syscall. > >> > >> glibc does *not* mark the xmm registers as clobbered when it issues > >> syscalls, but I suspect that everything everywhere that issues > >> syscalls does it from a function, and functions are implicitly > >> assumed to clobber extended state. (And if anything out there > >> assumes that ymm state is preserved, I'd be amazed.) > > > > To build the kernel with sse optimizations? Would certainly be > > interesting to try. > > I had in mind something a little less ambitious: making > kernel_fpu_begin very fast, especially when used more than once. > Currently it's slow enough to have spawned arch/x86/crypto/fpu.c, > which is a hideous piece of infrastructure that exists solely to > reduce the number of kernel_fpu_begin/end pairs when using AES-NI. > Clobbering registers in syscall would reduce the cost even more, > but it might require having a way to detect whether the most recent > kernel entry was via syscall or some other means. > > Making the whole kernel safe for xstate use would be technically > possible, but it would add about three cycles to syscalls (for > vzeroall -- non-AVX machines would take a larger hit) and > apparently about 57 ns to non-syscall traps. That seems worse than > the lazier approach. 3 cycles per syscall is acceptable, if the average optimization savings per syscall are better than 3 cycles - which is not impossible at all: using more registers generally moves the pressure away from GP registers and allows the compiler to be smarter. (older CPUs with higher switching costs wouldnt want to run such kernels, obviously.) So it's very much worth trying, if only to get some hard numbers. That would also turn the somewhat awkward way of how we use vector operations in the crypto code into something more natural. In theory you could write a crypto algorithm in C and the compiler would use vector instructions and get a pretty good end result. (one can always hope, right?) But more importantly, doing that would push vector operations *way* beyond the somewhat niche area of crypto/RAID optimizations. User-space already saves/restores the vector registers so they have already done much of the register switching cost - the kernel just has to take advantage of that. Thanks, Ingo ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] syscall calling convention, stts/clts, and xstate latency 2011-07-25 6:38 ` [RFC] syscall calling convention, stts/clts, and xstate latency Ingo Molnar @ 2011-07-25 9:44 ` Andrew Lutomirski 2011-07-25 9:51 ` Ingo Molnar 2011-07-25 11:04 ` Hans Rosenfeld 0 siblings, 2 replies; 15+ messages in thread From: Andrew Lutomirski @ 2011-07-25 9:44 UTC (permalink / raw) To: Ingo Molnar, Hans Rosenfeld Cc: linux-kernel, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity On Mon, Jul 25, 2011 at 2:38 AM, Ingo Molnar <mingo@elte.hu> wrote: > > * Andrew Lutomirski <luto@mit.edu> wrote: > >> On Sun, Jul 24, 2011 at 5:15 PM, Ingo Molnar <mingo@elte.hu> wrote: >> > >> > * Andrew Lutomirski <luto@mit.edu> wrote: >> > >> >> I was trying to understand the FPU/xstate saving code, and I ran >> >> some benchmarks with surprising results. These are all on Sandy >> >> Bridge i7-2600. Please take all numbers with a grain of salt -- >> >> they're in tight-ish loops and don't really take into account >> >> real-world cache effects. >> >> >> >> A clts/stts pair takes about 80 ns. Accessing extended state from >> >> userspace with TS set takes 239 ns. A kernel_fpu_begin / >> >> kernel_fpu_end pair with no userspace xstate access takes 80 ns >> >> (presumably 79 of those 80 are the clts/stts). (Note: The numbers >> >> in this paragraph were measured using a hacked-up kernel and KVM.) >> >> >> >> With nonzero ymm state, xsave + clflush (on the first cacheline of >> >> xstate) + xrstor takes 128 ns. With hot cache, xsave = 24ns, >> >> xsaveopt (with unchanged state) = 16 ns, and xrstor = 40 ns. >> >> >> >> With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38 >> >> ns and xsaveopt saves another 5 ns. >> >> >> >> Zeroing the state completely with vzeroall adds 2 ns. Not sure >> >> what's going on. >> >> >> >> All of this makes me think that, at least on Sandy Bridge, lazy >> >> xstate saving is a bad optimization -- if the cache is being nice, >> >> save/restore is faster than twiddling the TS bit. And the cost of >> >> the trap when TS is set blows everything else away. >> > >> > Interesting. Mind cooking up a delazying patch and measure it on >> > native as well? KVM generally makes exceptions more expensive, so the >> > effect of lazy exceptions might be less on native. >> >> Using the same patch on native, I get: >> >> kernel_fpu_begin/kernel_fpu_end (no userspace xstate): 71.53 ns >> stts/clts: 73 ns (clearly there's a bit of error here) userspace >> xstate with TS set: 229 ns >> >> So virtualization adds only a little bit of overhead. > > KVM rocks. > >> This isn't really a delazying patch -- it's two arch_prctls, one of >> them is kernel_fpu_begin();kernel_fpu_end(). The other is the same >> thing in a loop. >> >> The other numbers were already native since I measured them >> entirely in userspace. They look the same after rebooting. > > I should have mentioned it earlier, but there's a certain amount of > delazying patches in the tip:x86/xsave branch: > > $ gll linus..x86/xsave > 300c6120b465: x86, xsave: fix non-lazy allocation of the xsave area > f79018f2daa9: Merge branch 'x86/urgent' into x86/xsave > 66beba27e8b5: x86, xsave: remove lazy allocation of xstate area > 1039b306b1c6: x86, xsave: add kernel support for AMDs Lightweight Profiling (LWP) > 4182a4d68bac: x86, xsave: add support for non-lazy xstates > 324cbb83e215: x86, xsave: more cleanups > 2efd67935eb7: x86, xsave: remove unused code > 0c11e6f1aed1: x86, xsave: cleanup fpu/xsave signal frame setup > 7f4f0a56a7d3: x86, xsave: rework fpu/xsave support > 26bce4e4c56f: x86, xsave: cleanup fpu/xsave support > > it's not in tip:master because the LWP bits need (much) more work to > be palatable - but we could spin them off and complete them as per > your suggestions if they are an independent speedup on modern CPUs. Hans, what's the status of these? I want to do some other cleanups (now or in a couple of weeks) that will probably conflict with your xsave work. --Andy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] syscall calling convention, stts/clts, and xstate latency 2011-07-25 9:44 ` Andrew Lutomirski @ 2011-07-25 9:51 ` Ingo Molnar 2011-07-25 11:04 ` Hans Rosenfeld 1 sibling, 0 replies; 15+ messages in thread From: Ingo Molnar @ 2011-07-25 9:51 UTC (permalink / raw) To: Andrew Lutomirski Cc: Hans Rosenfeld, linux-kernel, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity * Andrew Lutomirski <luto@mit.edu> wrote: > On Mon, Jul 25, 2011 at 2:38 AM, Ingo Molnar <mingo@elte.hu> wrote: > > > > * Andrew Lutomirski <luto@mit.edu> wrote: > > > >> On Sun, Jul 24, 2011 at 5:15 PM, Ingo Molnar <mingo@elte.hu> wrote: > >> > > >> > * Andrew Lutomirski <luto@mit.edu> wrote: > >> > > >> >> I was trying to understand the FPU/xstate saving code, and I ran > >> >> some benchmarks with surprising results. These are all on Sandy > >> >> Bridge i7-2600. Please take all numbers with a grain of salt -- > >> >> they're in tight-ish loops and don't really take into account > >> >> real-world cache effects. > >> >> > >> >> A clts/stts pair takes about 80 ns. Accessing extended state from > >> >> userspace with TS set takes 239 ns. A kernel_fpu_begin / > >> >> kernel_fpu_end pair with no userspace xstate access takes 80 ns > >> >> (presumably 79 of those 80 are the clts/stts). (Note: The numbers > >> >> in this paragraph were measured using a hacked-up kernel and KVM.) > >> >> > >> >> With nonzero ymm state, xsave + clflush (on the first cacheline of > >> >> xstate) + xrstor takes 128 ns. With hot cache, xsave = 24ns, > >> >> xsaveopt (with unchanged state) = 16 ns, and xrstor = 40 ns. > >> >> > >> >> With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38 > >> >> ns and xsaveopt saves another 5 ns. > >> >> > >> >> Zeroing the state completely with vzeroall adds 2 ns. Not sure > >> >> what's going on. > >> >> > >> >> All of this makes me think that, at least on Sandy Bridge, lazy > >> >> xstate saving is a bad optimization -- if the cache is being nice, > >> >> save/restore is faster than twiddling the TS bit. And the cost of > >> >> the trap when TS is set blows everything else away. > >> > > >> > Interesting. Mind cooking up a delazying patch and measure it on > >> > native as well? KVM generally makes exceptions more expensive, so the > >> > effect of lazy exceptions might be less on native. > >> > >> Using the same patch on native, I get: > >> > >> kernel_fpu_begin/kernel_fpu_end (no userspace xstate): 71.53 ns > >> stts/clts: 73 ns (clearly there's a bit of error here) userspace > >> xstate with TS set: 229 ns > >> > >> So virtualization adds only a little bit of overhead. > > > > KVM rocks. > > > >> This isn't really a delazying patch -- it's two arch_prctls, one of > >> them is kernel_fpu_begin();kernel_fpu_end(). The other is the same > >> thing in a loop. > >> > >> The other numbers were already native since I measured them > >> entirely in userspace. They look the same after rebooting. > > > > I should have mentioned it earlier, but there's a certain amount of > > delazying patches in the tip:x86/xsave branch: > > > > $ gll linus..x86/xsave > > 300c6120b465: x86, xsave: fix non-lazy allocation of the xsave area > > f79018f2daa9: Merge branch 'x86/urgent' into x86/xsave > > 66beba27e8b5: x86, xsave: remove lazy allocation of xstate area > > 1039b306b1c6: x86, xsave: add kernel support for AMDs Lightweight Profiling (LWP) > > 4182a4d68bac: x86, xsave: add support for non-lazy xstates > > 324cbb83e215: x86, xsave: more cleanups > > 2efd67935eb7: x86, xsave: remove unused code > > 0c11e6f1aed1: x86, xsave: cleanup fpu/xsave signal frame setup > > 7f4f0a56a7d3: x86, xsave: rework fpu/xsave support > > 26bce4e4c56f: x86, xsave: cleanup fpu/xsave support > > > > it's not in tip:master because the LWP bits need (much) more work to > > be palatable - but we could spin them off and complete them as per > > your suggestions if they are an independent speedup on modern CPUs. > > Hans, what's the status of these? I want to do some other cleanups > (now or in a couple of weeks) that will probably conflict with your > xsave work. if you extract this bit: 1039b306b1c6: x86, xsave: add kernel support for AMDs Lightweight Profiling (LWP) then we can keep all the other patches. this could be done by: git reset --hard 4182a4d68bac # careful, this zaps your current dirty state git cherry-pick 66beba27e8b5 git cherry-pick 300c6120b465 Thanks, Ingo ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] syscall calling convention, stts/clts, and xstate latency 2011-07-25 9:44 ` Andrew Lutomirski 2011-07-25 9:51 ` Ingo Molnar @ 2011-07-25 11:04 ` Hans Rosenfeld 1 sibling, 0 replies; 15+ messages in thread From: Hans Rosenfeld @ 2011-07-25 11:04 UTC (permalink / raw) To: Andrew Lutomirski Cc: Ingo Molnar, linux-kernel@vger.kernel.org, x86, Linus Torvalds, Arjan van de Ven, Avi Kivity [-- Attachment #1: Type: text/plain, Size: 1510 bytes --] On Mon, Jul 25, 2011 at 05:44:32AM -0400, Andrew Lutomirski wrote: > On Mon, Jul 25, 2011 at 2:38 AM, Ingo Molnar <mingo@elte.hu> wrote: > > I should have mentioned it earlier, but there's a certain amount of > > delazying patches in the tip:x86/xsave branch: > > > > $ gll linus..x86/xsave > > 300c6120b465: x86, xsave: fix non-lazy allocation of the xsave area > > f79018f2daa9: Merge branch 'x86/urgent' into x86/xsave > > 66beba27e8b5: x86, xsave: remove lazy allocation of xstate area > > 1039b306b1c6: x86, xsave: add kernel support for AMDs Lightweight Profiling (LWP) > > 4182a4d68bac: x86, xsave: add support for non-lazy xstates > > 324cbb83e215: x86, xsave: more cleanups > > 2efd67935eb7: x86, xsave: remove unused code > > 0c11e6f1aed1: x86, xsave: cleanup fpu/xsave signal frame setup > > 7f4f0a56a7d3: x86, xsave: rework fpu/xsave support > > 26bce4e4c56f: x86, xsave: cleanup fpu/xsave support > > > > it's not in tip:master because the LWP bits need (much) more work to > > be palatable - but we could spin them off and complete them as per > > your suggestions if they are an independent speedup on modern CPUs. > > Hans, what's the status of these? I want to do some other cleanups > (now or in a couple of weeks) that will probably conflict with your > xsave work. I know of one bug in there that occasionally causes panics at boot, see the attached patch for a fix. Hans -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown [-- Attachment #2: 0001-x86-xsave-clear-pre-allocated-xsave-area.patch --] [-- Type: text/plain, Size: 1056 bytes --] >From 599d3ee9a9e743377739480a8a893582f1409a8d Mon Sep 17 00:00:00 2001 From: Hans Rosenfeld <hans.rosenfeld@amd.com> Date: Wed, 6 Jul 2011 16:31:19 +0200 Subject: [PATCH 1/1] x86, xsave: clear pre-allocated xsave area Bogus data in the xsave area can cause xrstor to panic, so make sure that the pre-allocated xsave area is all nice and clean before being used. Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com> --- arch/x86/kernel/process.c | 11 +++++++++-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index c5ae256..03c5ded 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -28,8 +28,15 @@ EXPORT_SYMBOL_GPL(task_xstate_cachep); int arch_prealloc_fpu(struct task_struct *tsk) { - if (!fpu_allocated(&tsk->thread.fpu)) - return fpu_alloc(&tsk->thread.fpu); + if (!fpu_allocated(&tsk->thread.fpu)) { + int err = fpu_alloc(&tsk->thread.fpu); + + if (err) + return err; + + fpu_clear(&tsk->thread.fpu); + } + return 0; } -- 1.5.6.5 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC] syscall calling convention, stts/clts, and xstate latency 2011-07-24 21:15 ` Ingo Molnar 2011-07-24 22:34 ` Andrew Lutomirski @ 2011-07-25 7:42 ` Avi Kivity 2011-07-25 7:54 ` Ingo Molnar 1 sibling, 1 reply; 15+ messages in thread From: Avi Kivity @ 2011-07-25 7:42 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Lutomirski, linux-kernel, x86, Linus Torvalds, Arjan van de Ven On 07/25/2011 12:15 AM, Ingo Molnar wrote: > > All of this makes me think that, at least on Sandy Bridge, lazy > > xstate saving is a bad optimization -- if the cache is being nice, > > save/restore is faster than twiddling the TS bit. And the cost of > > the trap when TS is set blows everything else away. > > Interesting. Mind cooking up a delazying patch and measure it on > native as well? KVM generally makes exceptions more expensive, so the > effect of lazy exceptions might be less on native. While this is true in general, kvm will trap #NM only after a host context switch or an exit to host userspace. These are supposedly rare so you won't see them a lot, especially in a benchmark scenario with just one guest. ("host context switch" includes switching to the idle thread when the guest executes HLT, something I tried to optimize in the past but it proved too difficult for the gain) -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] syscall calling convention, stts/clts, and xstate latency 2011-07-25 7:42 ` Avi Kivity @ 2011-07-25 7:54 ` Ingo Molnar 0 siblings, 0 replies; 15+ messages in thread From: Ingo Molnar @ 2011-07-25 7:54 UTC (permalink / raw) To: Avi Kivity Cc: Andrew Lutomirski, linux-kernel, x86, Linus Torvalds, Arjan van de Ven * Avi Kivity <avi@redhat.com> wrote: > On 07/25/2011 12:15 AM, Ingo Molnar wrote: > >> All of this makes me think that, at least on Sandy Bridge, lazy > >> xstate saving is a bad optimization -- if the cache is being nice, > >> save/restore is faster than twiddling the TS bit. And the cost of > >> the trap when TS is set blows everything else away. > > > > Interesting. Mind cooking up a delazying patch and measure it on > > native as well? KVM generally makes exceptions more expensive, so > > the effect of lazy exceptions might be less on native. > > While this is true in general, kvm will trap #NM only after a host > context switch or an exit to host userspace. These are supposedly > rare so you won't see them a lot, especially in a benchmark > scenario with just one guest. > > ("host context switch" includes switching to the idle thread when > the guest executes HLT, something I tried to optimize in the past > but it proved too difficult for the gain) Yeah - but this was a fair thing to test before Andy embarks on something more ambitious on the native side. Thanks, Ingo ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2011-07-25 14:14 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-07-24 21:07 [RFC] syscall calling convention, stts/clts, and xstate latency Andrew Lutomirski 2011-07-24 21:15 ` Ingo Molnar 2011-07-24 22:34 ` Andrew Lutomirski 2011-07-25 3:21 ` Andrew Lutomirski 2011-07-25 6:42 ` Ingo Molnar 2011-07-25 10:05 ` [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to Andy Lutomirski 2011-07-25 11:12 ` Ingo Molnar 2011-07-25 13:04 ` Andrew Lutomirski 2011-07-25 14:13 ` Ingo Molnar 2011-07-25 6:38 ` [RFC] syscall calling convention, stts/clts, and xstate latency Ingo Molnar 2011-07-25 9:44 ` Andrew Lutomirski 2011-07-25 9:51 ` Ingo Molnar 2011-07-25 11:04 ` Hans Rosenfeld 2011-07-25 7:42 ` Avi Kivity 2011-07-25 7:54 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox