public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] syscall calling convention, stts/clts, and xstate latency
@ 2011-07-24 21:07 Andrew Lutomirski
  2011-07-24 21:15 ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Lutomirski @ 2011-07-24 21:07 UTC (permalink / raw)
  To: linux-kernel, x86

I was trying to understand the FPU/xstate saving code, and I ran some
benchmarks with surprising results.  These are all on Sandy Bridge
i7-2600.  Please take all numbers with a grain of salt -- they're in
tight-ish loops and don't really take into account real-world cache
effects.

A clts/stts pair takes about 80 ns.  Accessing extended state from
userspace with TS set takes 239 ns.  A kernel_fpu_begin /
kernel_fpu_end pair with no userspace xstate access takes 80 ns
(presumably 79 of those 80 are the clts/stts).  (Note: The numbers in
this paragraph were measured using a hacked-up kernel and KVM.)

With nonzero ymm state, xsave + clflush (on the first cacheline of
xstate) + xrstor takes 128 ns.  With hot cache, xsave = 24ns, xsaveopt
(with unchanged state) = 16 ns, and xrstor = 40 ns.

With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38 ns
and xsaveopt saves another 5 ns.

Zeroing the state completely with vzeroall adds 2 ns.  Not sure what's going on.

All of this makes me think that, at least on Sandy Bridge, lazy xstate
saving is a bad optimization -- if the cache is being nice,
save/restore is faster than twiddling the TS bit.  And the cost of the
trap when TS is set blows everything else away.


Which brings me to another question: what do you think about declaring
some of the extended state to be clobbered by syscall?  Ideally, we'd
treat syscall like a regular function and clobber everything except
the floating point control word and mxcsr.  More conservatively, we'd
leave xmm and x87 state but clobber ymm.  This would let us keep the
cost of the state save and restore down when kernel_fpu_begin is used
in a syscall path and when a context switch happens as a result of a
syscall.

glibc does *not* mark the xmm registers as clobbered when it issues
syscalls, but I suspect that everything everywhere that issues
syscalls does it from a function, and functions are implicitly assumed
to clobber extended state.  (And if anything out there assumes that
ymm state is preserved, I'd be amazed.)


--Andy

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-07-25 14:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-24 21:07 [RFC] syscall calling convention, stts/clts, and xstate latency Andrew Lutomirski
2011-07-24 21:15 ` Ingo Molnar
2011-07-24 22:34   ` Andrew Lutomirski
2011-07-25  3:21     ` Andrew Lutomirski
2011-07-25  6:42       ` Ingo Molnar
2011-07-25 10:05       ` [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to Andy Lutomirski
2011-07-25 11:12         ` Ingo Molnar
2011-07-25 13:04           ` Andrew Lutomirski
2011-07-25 14:13             ` Ingo Molnar
2011-07-25  6:38     ` [RFC] syscall calling convention, stts/clts, and xstate latency Ingo Molnar
2011-07-25  9:44       ` Andrew Lutomirski
2011-07-25  9:51         ` Ingo Molnar
2011-07-25 11:04         ` Hans Rosenfeld
2011-07-25  7:42   ` Avi Kivity
2011-07-25  7:54     ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox