* Re: [RFC 0/3] Basic support for LWP [not found] <1286212172-654419-1-git-send-email-hans.rosenfeld@amd.com> @ 2010-10-04 22:13 ` Thomas Gleixner 2010-10-05 14:51 ` Hans Rosenfeld 0 siblings, 1 reply; 26+ messages in thread From: Thomas Gleixner @ 2010-10-04 22:13 UTC (permalink / raw) To: Hans Rosenfeld Cc: LKML, H. Peter Anvin, Ingo Molnar, Andreas Herrmann, Peter Zijlstra On Mon, 4 Oct 2010, Hans Rosenfeld wrote: > LWP (Light-Weight Profiling) is a new profiling mechanism that allows > user mode processes to gather performance data about themselves with > very low overhead. The specification can be found here: > http://developer.amd.com/cpu/LWP/Pages/default.aspx All I can see there is marketing blurb. The spec pdf does not tell me either how the end result of your patches will look like. > This code adds basic support for LWP to the context switch code, which > is the minimum needed to use LWP. Support for other LWP features like > interrupts will be added later. Oh no. We are not going to merge that and give you a card blanche to come up with another profiling/performance monitoring facility which is completely disconnected to everything else. Please provide either the full patch of this facility or at least a reasonable explanation how it will look like when your code is ready. What's the user space interface, how does it hook into perf, etc ... Thanks, tglx ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-04 22:13 ` [RFC 0/3] Basic support for LWP Thomas Gleixner @ 2010-10-05 14:51 ` Hans Rosenfeld 2010-10-05 15:34 ` Thomas Gleixner 2010-10-05 18:43 ` Davidlohr Bueso 0 siblings, 2 replies; 26+ messages in thread From: Hans Rosenfeld @ 2010-10-05 14:51 UTC (permalink / raw) To: Thomas Gleixner Cc: robert.richter, LKML, H. Peter Anvin, Ingo Molnar, Herrmann3, Andreas, Peter Zijlstra [ adding Robert ] On Mon, Oct 04, 2010 at 06:13:46PM -0400, Thomas Gleixner wrote: > > LWP (Light-Weight Profiling) is a new profiling mechanism that allows > > user mode processes to gather performance data about themselves with > > very low overhead. The specification can be found here: > > http://developer.amd.com/cpu/LWP/Pages/default.aspx > > All I can see there is marketing blurb. The spec pdf does not tell me > either how the end result of your patches will look like. Well, you should get a basic understanding of how LWP works from the spec. I'm pretty sure there is more than that blurb in there. What I sent _is_ the end result. Supporting LWP in the task switching code is the only thing necessary to use the LWP instructions in user space. And that is also what I'd like to get some comments on. > > This code adds basic support for LWP to the context switch code, which > > is the minimum needed to use LWP. Support for other LWP features like > > interrupts will be added later. > > Oh no. We are not going to merge that and give you a card blanche to > come up with another profiling/performance monitoring facility which > is completely disconnected to everything else. I think you got something wrong here, probably because I wasn't clear enough :) What I sent was just the basic kernel support, which means saving/restoring the LWP state in task switches. Technically no other kernel changes are necessary to use LWP in user space programs. No kernel framework of any kind is necessary to use LWP. Additionally, Robert is checking whether and how LWP support could be integrated into perf. This of course depends on having basic LWP task switching support in the kernel. > Please provide either the full patch of this facility or at least a > reasonable explanation how it will look like when your code is > ready. What's the user space interface, how does it hook into perf, > etc ... The basic principle of LWP is that a user program reserves some memory for a LWP control block and a ring buffer, writes the ring buffer address and some setup bits into the control block, and executes the LLWPCB instruction to load the control block into the CPU. While the program runs it can periodically read the samples from the ring buffer and do something useful with them. The program being profiled needs to include support for LWP to do that. In case you can't or don't want to modify the program being profiled, a small LWP library would have to be written that would have to be LD_PRELOADed. I don't really know what it will look like in the end and how it will be integrated into perf. The tool/library developers will probably know best how to make use of LWP. The point is that LWP is supposed to be controlled and used from user space without any kernel interaction, provided that the kernel saves/restores the LWP context. That's also how I tested it, using a small test program that just uses the LWP instructions. Technically LWP is quite similar to the FPU, it even uses the same facilities for OS support. Just like FPU support, the kernel doesn't have to care at all about what user space is going to do with it. It just has to make sure state is saved and restored when necessary. Hans -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-05 14:51 ` Hans Rosenfeld @ 2010-10-05 15:34 ` Thomas Gleixner 2010-10-05 18:27 ` Hans Rosenfeld 2010-10-05 18:43 ` Davidlohr Bueso 1 sibling, 1 reply; 26+ messages in thread From: Thomas Gleixner @ 2010-10-05 15:34 UTC (permalink / raw) To: Hans Rosenfeld Cc: robert.richter, LKML, H. Peter Anvin, Ingo Molnar, Herrmann3, Andreas, Peter Zijlstra On Tue, 5 Oct 2010, Hans Rosenfeld wrote: > [ adding Robert ] > > On Mon, Oct 04, 2010 at 06:13:46PM -0400, Thomas Gleixner wrote: > > > LWP (Light-Weight Profiling) is a new profiling mechanism that allows > > > user mode processes to gather performance data about themselves with > > > very low overhead. The specification can be found here: > > > http://developer.amd.com/cpu/LWP/Pages/default.aspx > > > > All I can see there is marketing blurb. The spec pdf does not tell me > > either how the end result of your patches will look like. > > Well, you should get a basic understanding of how LWP works from the > spec. I'm pretty sure there is more than that blurb in there. > > What I sent _is_ the end result. Supporting LWP in the task switching > code is the only thing necessary to use the LWP instructions in user > space. And that is also what I'd like to get some comments on. So it's the end result. Ok, then explain the following sentence: > > > This code adds basic support for LWP to the context switch code, which > > > is the minimum needed to use LWP. Support for other LWP features like > > > interrupts will be added later. If your patch _IS_ the end result then there are no other LWP features, right ? If there are, then your patch is _NOT_ the end result and we really want to know what is coming. Thanks, tglx ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-05 15:34 ` Thomas Gleixner @ 2010-10-05 18:27 ` Hans Rosenfeld 2010-10-05 18:30 ` Hans Rosenfeld ` (4 more replies) 0 siblings, 5 replies; 26+ messages in thread From: Hans Rosenfeld @ 2010-10-05 18:27 UTC (permalink / raw) To: Thomas Gleixner Cc: Richter, Robert, LKML, H. Peter Anvin, Ingo Molnar, Herrmann3, Andreas, Peter Zijlstra On Tue, Oct 05, 2010 at 11:34:11AM -0400, Thomas Gleixner wrote: > So it's the end result. Ok, then explain the following sentence: > > > > > This code adds basic support for LWP to the context switch code, which > > > > is the minimum needed to use LWP. Support for other LWP features like > > > > interrupts will be added later. > > If your patch _IS_ the end result then there are no other LWP > features, right ? Ok, you got me there. I should have checked again before I wrote this. There is at one other optional feature. Thats the support for a thresholding interrupt, which the OS should relay to the user process in one way or another. But thats not required to use LWP instructions. Maybe Robert will add it to perf some day, or maybe not. It is completely optional, and how it is implemented at some point in the future is completely irrelevant to the basic support of LWP. > If there are, then your patch is _NOT_ the end result and we really > want to know what is coming. As far as I am concerned, it _is_ the end result. It implements the necessary changes to make LWP usable, and that's it. Whatever support for optional features Robert or anybody else may be going to implement or not affects in no way the necessity of managing the LWP state for user processes. Those are two completely unrelated issues. Now, since this was obviously not made clear enough before: Don't merge this code anywhere yet. This is an RFC and nothing else. It implements a basic change necessary to support a new instruction set, which just so happens to be a profiling mechanism, too. Please review the code and flame me for anything that you don't like in it. Hans PS: The patches were not delivered to LKML previously, so I'll resend them. -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown ^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC 0/3] Basic support for LWP 2010-10-05 18:27 ` Hans Rosenfeld @ 2010-10-05 18:30 ` Hans Rosenfeld 2010-10-05 18:30 ` [RFC 1/3] Cleanup xsave/xrstor support Hans Rosenfeld ` (3 subsequent siblings) 4 siblings, 0 replies; 26+ messages in thread From: Hans Rosenfeld @ 2010-10-05 18:30 UTC (permalink / raw) To: linux-kernel; +Cc: Hans Rosenfeld LWP (Light-Weight Profiling) is a new profiling mechanism that allows user mode processes to gather performance data about themselves with very low overhead. The specification can be found here: http://developer.amd.com/cpu/LWP/Pages/default.aspx These patches are against tip/master 531548b311024e1bec01bc412f5258e243a796b7. This code adds basic support for LWP to the context switch code, which is the minimum needed to use LWP. Support for other LWP features like interrupts will be added later. Hans Rosenfeld (3): Cleanup xsave/xrstor support. Allow saving of individual states in fpu_xsave(). Save/restore LWP state in context switches. arch/x86/include/asm/i387.h | 20 +++------ arch/x86/include/asm/lwp.h | 68 ++++++++++++++++++++++++++++ arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/asm/processor.h | 12 +++++ arch/x86/include/asm/sigcontext.h | 12 +++++ arch/x86/include/asm/xsave.h | 90 +++++++++++++++--------------------- arch/x86/kernel/process_32.c | 13 +++++ arch/x86/kernel/process_64.c | 13 +++++ arch/x86/kernel/traps.c | 2 +- arch/x86/kernel/xsave.c | 6 +- 10 files changed, 167 insertions(+), 70 deletions(-) create mode 100644 arch/x86/include/asm/lwp.h ^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC 1/3] Cleanup xsave/xrstor support. 2010-10-05 18:27 ` Hans Rosenfeld 2010-10-05 18:30 ` Hans Rosenfeld @ 2010-10-05 18:30 ` Hans Rosenfeld 2010-10-05 18:30 ` [RFC 2/3] Allow saving of individual states in fpu_xsave() Hans Rosenfeld ` (2 subsequent siblings) 4 siblings, 0 replies; 26+ messages in thread From: Hans Rosenfeld @ 2010-10-05 18:30 UTC (permalink / raw) To: linux-kernel; +Cc: Hans Rosenfeld Removed the functions fpu_fxrstor_checking() and restore_fpu_checking() because they weren't doing anything. Removed redundant xsave/xrstor implementations. Since xsave/xrstor is not specific to the FPU, and also for consistency, all xsave/xrstor functions now take a xsave_struct argument. Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com> --- arch/x86/include/asm/i387.h | 20 +++------- arch/x86/include/asm/xsave.h | 81 +++++++++++++++--------------------------- arch/x86/kernel/traps.c | 2 +- arch/x86/kernel/xsave.c | 4 +- 4 files changed, 38 insertions(+), 69 deletions(-) diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index 70626ed..de8593b 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -199,12 +199,14 @@ static inline void fpu_fxsave(struct fpu *fpu) static inline void fpu_save_init(struct fpu *fpu) { if (use_xsave()) { - fpu_xsave(fpu); + struct xsave_struct *xstate = &fpu->state->xsave; + + fpu_xsave(xstate); /* * xsave header may indicate the init state of the FP. */ - if (!(fpu->state->xsave.xsave_hdr.xstate_bv & XSTATE_FP)) + if (!(xstate->xsave_hdr.xstate_bv & XSTATE_FP)) return; } else if (use_fxsr()) { fpu_fxsave(fpu); @@ -234,22 +236,12 @@ static inline void __save_init_fpu(struct task_struct *tsk) task_thread_info(tsk)->status &= ~TS_USEDFPU; } -static inline int fpu_fxrstor_checking(struct fpu *fpu) -{ - return fxrstor_checking(&fpu->state->fxsave); -} - static inline int fpu_restore_checking(struct fpu *fpu) { if (use_xsave()) - return fpu_xrstor_checking(fpu); + return xrstor_checking(&fpu->state->xsave, -1); else - return fpu_fxrstor_checking(fpu); -} - -static inline int restore_fpu_checking(struct task_struct *tsk) -{ - return fpu_restore_checking(&tsk->thread.fpu); + return fxrstor_checking(&fpu->state->fxsave); } /* diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h index c6ce245..8bcbbce 100644 --- a/arch/x86/include/asm/xsave.h +++ b/arch/x86/include/asm/xsave.h @@ -42,10 +42,11 @@ extern int check_for_xstate(struct i387_fxsave_struct __user *buf, void __user *fpstate, struct _fpx_sw_bytes *sw); -static inline int fpu_xrstor_checking(struct fpu *fpu) +static inline int xrstor_checking(struct xsave_struct *fx, u64 mask) { - struct xsave_struct *fx = &fpu->state->xsave; int err; + u32 lmask = mask; + u32 hmask = mask >> 32; asm volatile("1: .byte " REX_PREFIX "0x0f,0xae,0x2f\n\t" "2:\n" @@ -55,13 +56,23 @@ static inline int fpu_xrstor_checking(struct fpu *fpu) ".previous\n" _ASM_EXTABLE(1b, 3b) : [err] "=r" (err) - : "D" (fx), "m" (*fx), "a" (-1), "d" (-1), "0" (0) + : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask), "0" (0) : "memory"); return err; } -static inline int xsave_user(struct xsave_struct __user *buf) +static inline void xrstor_state(struct xsave_struct *fx, u64 mask) +{ + u32 lmask = mask; + u32 hmask = mask >> 32; + + asm volatile(".byte " REX_PREFIX "0x0f,0xae,0x2f\n\t" + : : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask) + : "memory"); +} + +static inline int xsave_checking(struct xsave_struct __user *buf) { int err; @@ -74,58 +85,24 @@ static inline int xsave_user(struct xsave_struct __user *buf) if (unlikely(err)) return -EFAULT; - __asm__ __volatile__("1: .byte " REX_PREFIX "0x0f,0xae,0x27\n" - "2:\n" - ".section .fixup,\"ax\"\n" - "3: movl $-1,%[err]\n" - " jmp 2b\n" - ".previous\n" - ".section __ex_table,\"a\"\n" - _ASM_ALIGN "\n" - _ASM_PTR "1b,3b\n" - ".previous" - : [err] "=r" (err) - : "D" (buf), "a" (-1), "d" (-1), "0" (0) - : "memory"); + asm volatile("1: .byte " REX_PREFIX "0x0f,0xae,0x27\n" + "2:\n" + ".section .fixup,\"ax\"\n" + "3: movl $-1,%[err]\n" + " jmp 2b\n" + ".previous\n" + _ASM_EXTABLE(1b,3b) + : [err] "=r" (err) + : "D" (buf), "a" (-1), "d" (-1), "0" (0) + : "memory"); + if (unlikely(err) && __clear_user(buf, xstate_size)) err = -EFAULT; - /* No need to clear here because the caller clears USED_MATH */ - return err; -} - -static inline int xrestore_user(struct xsave_struct __user *buf, u64 mask) -{ - int err; - struct xsave_struct *xstate = ((__force struct xsave_struct *)buf); - u32 lmask = mask; - u32 hmask = mask >> 32; - __asm__ __volatile__("1: .byte " REX_PREFIX "0x0f,0xae,0x2f\n" - "2:\n" - ".section .fixup,\"ax\"\n" - "3: movl $-1,%[err]\n" - " jmp 2b\n" - ".previous\n" - ".section __ex_table,\"a\"\n" - _ASM_ALIGN "\n" - _ASM_PTR "1b,3b\n" - ".previous" - : [err] "=r" (err) - : "D" (xstate), "a" (lmask), "d" (hmask), "0" (0) - : "memory"); /* memory required? */ + /* No need to clear here because the caller clears USED_MATH */ return err; } -static inline void xrstor_state(struct xsave_struct *fx, u64 mask) -{ - u32 lmask = mask; - u32 hmask = mask >> 32; - - asm volatile(".byte " REX_PREFIX "0x0f,0xae,0x2f\n\t" - : : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask) - : "memory"); -} - static inline void xsave_state(struct xsave_struct *fx, u64 mask) { u32 lmask = mask; @@ -136,7 +113,7 @@ static inline void xsave_state(struct xsave_struct *fx, u64 mask) : "memory"); } -static inline void fpu_xsave(struct fpu *fpu) +static inline void fpu_xsave(struct xsave_struct *fx) { /* This, however, we can work around by forcing the compiler to select an addressing mode that doesn't require extended registers. */ @@ -144,7 +121,7 @@ static inline void fpu_xsave(struct fpu *fpu) ".byte " REX_PREFIX "0x0f,0xae,0x27", ".byte " REX_PREFIX "0x0f,0xae,0x37", X86_FEATURE_XSAVEOPT, - [fx] "D" (&fpu->state->xsave), "a" (-1), "d" (-1) : + [fx] "D" (fx), "a" (-1), "d" (-1) : "memory"); } #endif diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index cb838ca..966793d 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -731,7 +731,7 @@ void __math_state_restore(void) /* * Paranoid restore. send a SIGSEGV if we fail to restore the state. */ - if (unlikely(restore_fpu_checking(tsk))) { + if (unlikely(fpu_restore_checking(&tsk->thread.fpu))) { stts(); force_sig(SIGSEGV, tsk); return; diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c index 9c253bd..5eb15d4 100644 --- a/arch/x86/kernel/xsave.c +++ b/arch/x86/kernel/xsave.c @@ -170,7 +170,7 @@ int save_i387_xstate(void __user *buf) if (task_thread_info(tsk)->status & TS_USEDFPU) { if (use_xsave()) - err = xsave_user(buf); + err = xsave_checking(buf); else err = fxsave_user(buf); @@ -247,7 +247,7 @@ static int restore_user_xstate(void __user *buf) /* * restore the state passed by the user. */ - err = xrestore_user(buf, mask); + err = xrstor_checking((__force struct xsave_struct *)buf, mask); if (err) return err; -- 1.5.6.5 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [RFC 2/3] Allow saving of individual states in fpu_xsave(). 2010-10-05 18:27 ` Hans Rosenfeld 2010-10-05 18:30 ` Hans Rosenfeld 2010-10-05 18:30 ` [RFC 1/3] Cleanup xsave/xrstor support Hans Rosenfeld @ 2010-10-05 18:30 ` Hans Rosenfeld 2010-10-05 18:30 ` [RFC 3/3] Save/restore LWP state in context switches Hans Rosenfeld 2010-10-05 19:05 ` [RFC 0/3] Basic support for LWP Ingo Molnar 4 siblings, 0 replies; 26+ messages in thread From: Hans Rosenfeld @ 2010-10-05 18:30 UTC (permalink / raw) To: linux-kernel; +Cc: Hans Rosenfeld Add a feature mask argument to fpu_xsave(). Introduce XCNTXT_DEFAULT to specify which feature states are to be saved/restored by default at context switches. Rename XCNTXT_MASK to XCNTXT_SUPPORTED to make clear what it really is. Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com> --- arch/x86/include/asm/i387.h | 4 ++-- arch/x86/include/asm/xsave.h | 14 +++++++++++--- arch/x86/kernel/xsave.c | 2 +- 3 files changed, 14 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index de8593b..33600d3 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -201,7 +201,7 @@ static inline void fpu_save_init(struct fpu *fpu) if (use_xsave()) { struct xsave_struct *xstate = &fpu->state->xsave; - fpu_xsave(xstate); + fpu_xsave(xstate, XCNTXT_DEFAULT); /* * xsave header may indicate the init state of the FP. @@ -239,7 +239,7 @@ static inline void __save_init_fpu(struct task_struct *tsk) static inline int fpu_restore_checking(struct fpu *fpu) { if (use_xsave()) - return xrstor_checking(&fpu->state->xsave, -1); + return xrstor_checking(&fpu->state->xsave, XCNTXT_DEFAULT); else return fxrstor_checking(&fpu->state->fxsave); } diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h index 8bcbbce..8820b57 100644 --- a/arch/x86/include/asm/xsave.h +++ b/arch/x86/include/asm/xsave.h @@ -23,7 +23,12 @@ /* * These are the features that the OS can handle currently. */ -#define XCNTXT_MASK (XSTATE_FP | XSTATE_SSE | XSTATE_YMM) +#define XCNTXT_SUPPORTED (XSTATE_FP | XSTATE_SSE | XSTATE_YMM) + +/* + * These are the features that the OS saves/restores by default. + */ +#define XCNTXT_DEFAULT (-1) #ifdef CONFIG_X86_64 #define REX_PREFIX "0x48, " @@ -113,15 +118,18 @@ static inline void xsave_state(struct xsave_struct *fx, u64 mask) : "memory"); } -static inline void fpu_xsave(struct xsave_struct *fx) +static inline void fpu_xsave(struct xsave_struct *fx, u64 mask) { + u32 lmask = mask; + u32 hmask = mask >> 32; + /* This, however, we can work around by forcing the compiler to select an addressing mode that doesn't require extended registers. */ alternative_input( ".byte " REX_PREFIX "0x0f,0xae,0x27", ".byte " REX_PREFIX "0x0f,0xae,0x37", X86_FEATURE_XSAVEOPT, - [fx] "D" (fx), "a" (-1), "d" (-1) : + [fx] "D" (fx), "a" (lmask), "d" (hmask) : "memory"); } #endif diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c index 5eb15d4..11c3445 100644 --- a/arch/x86/kernel/xsave.c +++ b/arch/x86/kernel/xsave.c @@ -434,7 +434,7 @@ static void __init xstate_enable_boot_cpu(void) /* * Support only the state known to OS. */ - pcntxt_mask = pcntxt_mask & XCNTXT_MASK; + pcntxt_mask = pcntxt_mask & XCNTXT_SUPPORTED; xstate_enable(); -- 1.5.6.5 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [RFC 3/3] Save/restore LWP state in context switches. 2010-10-05 18:27 ` Hans Rosenfeld ` (2 preceding siblings ...) 2010-10-05 18:30 ` [RFC 2/3] Allow saving of individual states in fpu_xsave() Hans Rosenfeld @ 2010-10-05 18:30 ` Hans Rosenfeld 2010-10-06 11:12 ` Brian Gerst 2010-10-05 19:05 ` [RFC 0/3] Basic support for LWP Ingo Molnar 4 siblings, 1 reply; 26+ messages in thread From: Hans Rosenfeld @ 2010-10-05 18:30 UTC (permalink / raw) To: linux-kernel; +Cc: Hans Rosenfeld LWP (Light-Weight Profiling) is a new per-thread profiling mechanism that can be enabled by any thread at any time if the OS claims to support it (by setting a bit in XCR0). A threads LWP state (configuration & unsaved collected data) is supposed to be saved and restored with xsave and xrstor by the OS. Unfortunately, LWP does not support any kind of lazy switching, nor does it use the TS bit in CR0. Since any thread can enable LWP at any time without the kernel knowing, the context switch code is supposed to save/restore LWP context unconditionally. This would require a valid xsave state area for all threads, whether or not they use any FPU or LWP functionality. It would also make the already complex lazy switching code more complicated. To avoid this memory overhead, especially for systems not supporting LWP, and also to avoid more intrusive changes to the code that handles FPU state, this patch handles LWP separately from the FPU. Only if a system supports LWP, the context switch code checks whether LWP has been used by the thread that is being taken off the CPU by reading the LWP_CBADDR MSR, which is nonzero if LWP has been used by the thread. Only in that case the LWP state is saved to the common xsave area in the threads FPU context. This means, of course, that an FPU context has to be allocated and initialized when a thread first uses LWP before using the FPU. Similarly, restoring the LWP state is only done when an FPU context exists and the LWP bit in the xstate header is set. To make things a little more complicated, xsave and xrstor _do_ use the TS bit and trap when it is set. To avoid unwanted traps, the TS bit has to be cleared before and restored after doing xsave or xrstor for LWP. Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com> --- arch/x86/include/asm/lwp.h | 68 +++++++++++++++++++++++++++++++++++++ arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/asm/processor.h | 12 ++++++ arch/x86/include/asm/sigcontext.h | 12 ++++++ arch/x86/include/asm/xsave.h | 5 ++- arch/x86/kernel/process_32.c | 13 +++++++ arch/x86/kernel/process_64.c | 13 +++++++ 7 files changed, 122 insertions(+), 2 deletions(-) create mode 100644 arch/x86/include/asm/lwp.h diff --git a/arch/x86/include/asm/lwp.h b/arch/x86/include/asm/lwp.h new file mode 100644 index 0000000..6a8e348 --- /dev/null +++ b/arch/x86/include/asm/lwp.h @@ -0,0 +1,68 @@ +/* + * Copyright (C) 2010 Advanced Micro Devices, Inc. + * Contributed by Hans Rosenfeld <hans.rosenfeld@amd.com> + */ + +#ifndef _ASM_X86_LWP_H +#define _ASM_X86_LWP_H + +#ifndef __ASSEMBLY__ + +#include <asm/i387.h> +#include <asm/xsave.h> +#include <asm/processor.h> +#include <asm/msr-index.h> +#include <asm/system.h> + +/* + * Save LWP state if it has been used by the task. Initializes the FPU + * state if necessary. + */ +static inline void save_lwp_state(struct task_struct *tsk) +{ + u64 lwpcb; + bool ts; + + /* Has LWP been used? */ + rdmsrl(MSR_AMD64_LWP_CBADDR, lwpcb); + if (!lwpcb) + return; + + /* set up FPU state if necessary */ + if (!fpu_allocated(&tsk->thread.fpu)) + init_fpu(tsk); + + /* Xsave traps when TS is set. Disable it temporarily. */ + ts = read_cr0() & X86_CR0_TS; + clts(); + fpu_xsave(&tsk->thread.fpu.state->xsave, XSTATE_LWP); + if (ts) + stts(); + + /* Disable LWP for next thread. */ + wrmsrl(MSR_AMD64_LWP_CBADDR, 0); +} + +/* + * Restore LWP state if present in the tasks xsave area. + */ +static inline void restore_lwp_state(struct task_struct *tsk) +{ + bool ts; + + /* Don't do anything if there is no LWP state for this thread */ + if (!fpu_allocated(&tsk->thread.fpu) || + !(tsk->thread.fpu.state->xsave.xsave_hdr.xstate_bv & XSTATE_LWP)) + return; + + /* Xrstor traps when TS is set. Disable it temporarily. */ + ts = read_cr0() & X86_CR0_TS; + clts(); + xrstor_state(&tsk->thread.fpu.state->xsave, XSTATE_LWP); + if (ts) + stts(); +} + +#endif /* __ASSEMBLY__ */ + +#endif /* _ASM_X86_LWP_H */ diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 986f779..9183ecc 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -121,6 +121,7 @@ #define MSR_AMD64_IBSDCLINAD 0xc0011038 #define MSR_AMD64_IBSDCPHYSAD 0xc0011039 #define MSR_AMD64_IBSCTL 0xc001103a +#define MSR_AMD64_LWP_CBADDR 0xc0000106 /* Fam 10h MSRs */ #define MSR_FAM10H_MMIO_CONF_BASE 0xc0010058 diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index cae9c3c..cbb5da9 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -358,6 +358,17 @@ struct ymmh_struct { u32 ymmh_space[64]; }; +struct lwp_struct { + u64 lwpcb_addr; + u32 flags; + u32 buf_head_offset; + u64 buf_base; + u32 buf_size; + u32 filters; + u64 saved_event_record[4]; + u32 event_counter[16]; +}; + struct xsave_hdr_struct { u64 xstate_bv; u64 reserved1[2]; @@ -368,6 +379,7 @@ struct xsave_struct { struct i387_fxsave_struct i387; struct xsave_hdr_struct xsave_hdr; struct ymmh_struct ymmh; + struct lwp_struct lwp; /* new processor state extensions will go here */ } __attribute__ ((packed, aligned (64))); diff --git a/arch/x86/include/asm/sigcontext.h b/arch/x86/include/asm/sigcontext.h index 04459d2..0a58b82 100644 --- a/arch/x86/include/asm/sigcontext.h +++ b/arch/x86/include/asm/sigcontext.h @@ -274,6 +274,17 @@ struct _ymmh_state { __u32 ymmh_space[64]; }; +struct _lwp_state { + __u64 lwpcb_addr; + __u32 flags; + __u32 buf_head_offset; + __u64 buf_base; + __u32 buf_size; + __u32 filters; + __u64 saved_event_record[4]; + __u32 event_counter[16]; +}; + /* * Extended state pointed by the fpstate pointer in the sigcontext. * In addition to the fpstate, information encoded in the xstate_hdr @@ -284,6 +295,7 @@ struct _xstate { struct _fpstate fpstate; struct _xsave_hdr xstate_hdr; struct _ymmh_state ymmh; + struct _lwp_state lwp; /* new processor state extensions go here */ }; diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h index 8820b57..ceb0b85 100644 --- a/arch/x86/include/asm/xsave.h +++ b/arch/x86/include/asm/xsave.h @@ -9,6 +9,7 @@ #define XSTATE_FP 0x1 #define XSTATE_SSE 0x2 #define XSTATE_YMM 0x4 +#define XSTATE_LWP (1ULL << 62) #define XSTATE_FPSSE (XSTATE_FP | XSTATE_SSE) @@ -23,12 +24,12 @@ /* * These are the features that the OS can handle currently. */ -#define XCNTXT_SUPPORTED (XSTATE_FP | XSTATE_SSE | XSTATE_YMM) +#define XCNTXT_SUPPORTED (XSTATE_FP | XSTATE_SSE | XSTATE_YMM | XSTATE_LWP) /* * These are the features that the OS saves/restores by default. */ -#define XCNTXT_DEFAULT (-1) +#define XCNTXT_DEFAULT (~XSTATE_LWP) #ifdef CONFIG_X86_64 #define REX_PREFIX "0x48, " diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 96586c3..69ced7d 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -44,6 +44,7 @@ #include <asm/ldt.h> #include <asm/processor.h> #include <asm/i387.h> +#include <asm/lwp.h> #include <asm/desc.h> #ifdef CONFIG_MATH_EMULATION #include <asm/math_emu.h> @@ -311,6 +312,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) __unlazy_fpu(prev_p); + /* + * Save LWP state if LWP is available. + */ + if (static_cpu_has(X86_FEATURE_LWP)) + save_lwp_state(prev_p); + /* we're going to use this soon, after a few expensive things */ if (preload_fpu) prefetch(next->fpu.state); @@ -353,6 +360,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT)) __switch_to_xtra(prev_p, next_p, tss); + /* + * Restore LWP state if available. + */ + if (static_cpu_has(X86_FEATURE_LWP)) + restore_lwp_state(next_p); + /* If we're going to preload the fpu context, make sure clts is run while we're batching the cpu state updates. */ if (preload_fpu) diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index b3d7a3a..a70c6cc 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -42,6 +42,7 @@ #include <asm/system.h> #include <asm/processor.h> #include <asm/i387.h> +#include <asm/lwp.h> #include <asm/mmu_context.h> #include <asm/prctl.h> #include <asm/desc.h> @@ -423,6 +424,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) load_TLS(next, cpu); + /* + * Save LWP state if LWP is available. + */ + if (static_cpu_has(X86_FEATURE_LWP)) + save_lwp_state(prev_p); + /* Must be after DS reload */ __unlazy_fpu(prev_p); @@ -489,6 +496,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) __switch_to_xtra(prev_p, next_p, tss); /* + * Restore LWP state if available. + */ + if (static_cpu_has(X86_FEATURE_LWP)) + restore_lwp_state(next_p); + + /* * Preload the FPU context, now that we've determined that the * task is likely to be using it. */ -- 1.5.6.5 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC 3/3] Save/restore LWP state in context switches. 2010-10-05 18:30 ` [RFC 3/3] Save/restore LWP state in context switches Hans Rosenfeld @ 2010-10-06 11:12 ` Brian Gerst 2010-10-07 14:58 ` Hans Rosenfeld ` (3 more replies) 0 siblings, 4 replies; 26+ messages in thread From: Brian Gerst @ 2010-10-06 11:12 UTC (permalink / raw) To: Hans Rosenfeld; +Cc: linux-kernel On Tue, Oct 5, 2010 at 2:30 PM, Hans Rosenfeld <hans.rosenfeld@amd.com> wrote: > LWP (Light-Weight Profiling) is a new per-thread profiling mechanism > that can be enabled by any thread at any time if the OS claims to > support it (by setting a bit in XCR0). A threads LWP state > (configuration & unsaved collected data) is supposed to be saved and > restored with xsave and xrstor by the OS. > > Unfortunately, LWP does not support any kind of lazy switching, nor does > it use the TS bit in CR0. Since any thread can enable LWP at any time > without the kernel knowing, the context switch code is supposed to > save/restore LWP context unconditionally. This would require a valid > xsave state area for all threads, whether or not they use any FPU or LWP > functionality. It would also make the already complex lazy switching > code more complicated. > > To avoid this memory overhead, especially for systems not supporting > LWP, and also to avoid more intrusive changes to the code that handles > FPU state, this patch handles LWP separately from the FPU. Only if a > system supports LWP, the context switch code checks whether LWP has been > used by the thread that is being taken off the CPU by reading the > LWP_CBADDR MSR, which is nonzero if LWP has been used by the thread. > Only in that case the LWP state is saved to the common xsave area in the > threads FPU context. This means, of course, that an FPU context has to > be allocated and initialized when a thread first uses LWP before using > the FPU. > > Similarly, restoring the LWP state is only done when an FPU context > exists and the LWP bit in the xstate header is set. > > To make things a little more complicated, xsave and xrstor _do_ use the > TS bit and trap when it is set. To avoid unwanted traps, the TS bit has > to be cleared before and restored after doing xsave or xrstor for LWP. > > Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com> I would prefer to see the xsave code refactored so that you would only need one xsave/xrstor call per context switch. We are currently treating xsave as an extension to the FPU state. But it would be better to treat FPU as a subset of the extended state. That way more state can be added without touching the FPU code.. -- Brian Gerst ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 3/3] Save/restore LWP state in context switches. 2010-10-06 11:12 ` Brian Gerst @ 2010-10-07 14:58 ` Hans Rosenfeld 2010-11-23 20:41 ` [RFC 0/2] FPU/xsave rework in preparation for LWP Hans Rosenfeld ` (2 subsequent siblings) 3 siblings, 0 replies; 26+ messages in thread From: Hans Rosenfeld @ 2010-10-07 14:58 UTC (permalink / raw) To: Brian Gerst; +Cc: linux-kernel@vger.kernel.org On Wed, Oct 06, 2010 at 07:12:11AM -0400, Brian Gerst wrote: > On Tue, Oct 5, 2010 at 2:30 PM, Hans Rosenfeld <hans.rosenfeld@amd.com> wrote: > > LWP (Light-Weight Profiling) is a new per-thread profiling mechanism > > that can be enabled by any thread at any time if the OS claims to > > support it (by setting a bit in XCR0). A threads LWP state > > (configuration & unsaved collected data) is supposed to be saved and > > restored with xsave and xrstor by the OS. > > > > Unfortunately, LWP does not support any kind of lazy switching, nor does > > it use the TS bit in CR0. Since any thread can enable LWP at any time > > without the kernel knowing, the context switch code is supposed to > > save/restore LWP context unconditionally. This would require a valid > > xsave state area for all threads, whether or not they use any FPU or LWP > > functionality. It would also make the already complex lazy switching > > code more complicated. > > > > To avoid this memory overhead, especially for systems not supporting > > LWP, and also to avoid more intrusive changes to the code that handles > > FPU state, this patch handles LWP separately from the FPU. Only if a > > system supports LWP, the context switch code checks whether LWP has been > > used by the thread that is being taken off the CPU by reading the > > LWP_CBADDR MSR, which is nonzero if LWP has been used by the thread. > > Only in that case the LWP state is saved to the common xsave area in the > > threads FPU context. This means, of course, that an FPU context has to > > be allocated and initialized when a thread first uses LWP before using > > the FPU. > > > > Similarly, restoring the LWP state is only done when an FPU context > > exists and the LWP bit in the xstate header is set. > > > > To make things a little more complicated, xsave and xrstor _do_ use the > > TS bit and trap when it is set. To avoid unwanted traps, the TS bit has > > to be cleared before and restored after doing xsave or xrstor for LWP. > > > > Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com> > > I would prefer to see the xsave code refactored so that you would only > need one xsave/xrstor call per context switch. We are currently > treating xsave as an extension to the FPU state. But it would be > better to treat FPU as a subset of the extended state. That way more > state can be added without touching the FPU code.. IIRC Robert did some changes to the xsave stuff to make it more independent from the FPU stuff. The first of my patches also goes into that direction, but thats not the main point of it. The main point is to remove useless or duplicated code and make it easier to understand :) I have spent some more time meditating about that code and thinking about using only one xsave/xrstor call per context switch. It is possible, but it would at least require pre-allocating the xsave state area for each thread. The xsave part would be fairly simple, the FPU state is saved non-lazy anyway if it has been used. Xsave would probably save the state more often than necessary as it can't know whether the state really changed since the last time it was saved. This could be avoided by checking this before doing xsave, and then do xsave only for the parts that really need saving (like LWP). Xrstor would need more work because of lazy FPU switching. You would always have to check whether you want to preload the FPU state or not, and do xrstor for all or only the necessary states. In any case one would have to care about CR0.TS when doing all that. So, as far as I understand it, to get this simpler and cleaner, at least the preallocation of xsave or fpu state buffers is necessary. To really clean it up, dropping lazy switching would probably help. But I'm certainly not proposing to do that :) Hans -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown ^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC 0/2] FPU/xsave rework in preparation for LWP 2010-10-06 11:12 ` Brian Gerst 2010-10-07 14:58 ` Hans Rosenfeld @ 2010-11-23 20:41 ` Hans Rosenfeld 2010-11-23 20:41 ` [RFC 1/2] x86, xsave: cleanup xsave/xrstor support Hans Rosenfeld 2010-11-23 20:41 ` [RFC 2/2] x86, xsave: rework xsave support Hans Rosenfeld 3 siblings, 0 replies; 26+ messages in thread From: Hans Rosenfeld @ 2010-11-23 20:41 UTC (permalink / raw) To: linux-kernel Cc: x86, brgerst, robert.richter, andreas.herrmann3, Hans Rosenfeld Hi, please review and comment on these patches. They are based on tip/master a17d38493c3dd2253828b107b5d405dd7cdcbf43. The first one is just a generic cleanup of the xsave code that I already sent couple of weeks ago. When my previous LWP patchset was reviewed I was told that it would be much better to use only a single call to xsave and xrstor in the context switch. This is basically what I try to achieve with the second patch: base all "extended state" management on xsave/xrstor (with fallback to fxsave/fxrstor or even fsave/frstor) and keep track of which extended states a thread actually used. The actual LWP support is not included here, I will work on this next. As a side effect of this rework, there will be a single simple interface that can be used by other parts of the kernel that need to save and restore extended states (thinking of KVM here). This should also improve the general readability and maintainability of this code, although I would agree that this isn't visible yet :) There is a lot of room for cleaning things up here, which I will do along with the LWP support. The LWP changes will require that each thread will always have a xstate area, but I don't expect this to be much of a problem. While this would add an overhead of around 1kb for each thread that wouldn't need a xstate area otherwise, I expect that there are only very few such threads on any reasonably modern system. I will have to check that when the LWP support is done. This would allow for another big cleanup by removing all the complexity of supporting threads that don't have an xstate area. Hans Rosenfeld (2): x86, xsave: cleanup xsave/xrstor support x86, xsave: rework xsave support arch/x86/include/asm/i387.h | 111 +++++++---------------- arch/x86/include/asm/thread_info.h | 2 + arch/x86/include/asm/xsave.h | 88 ++++++++----------- arch/x86/kernel/i387.c | 11 ++- arch/x86/kernel/process_64.c | 24 +++--- arch/x86/kernel/traps.c | 31 +------ arch/x86/kernel/xsave.c | 174 +++++++++++++++++++----------------- arch/x86/kvm/x86.c | 7 +- 8 files changed, 188 insertions(+), 260 deletions(-) ^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC 1/2] x86, xsave: cleanup xsave/xrstor support 2010-10-06 11:12 ` Brian Gerst 2010-10-07 14:58 ` Hans Rosenfeld 2010-11-23 20:41 ` [RFC 0/2] FPU/xsave rework in preparation for LWP Hans Rosenfeld @ 2010-11-23 20:41 ` Hans Rosenfeld 2010-11-23 20:41 ` [RFC 2/2] x86, xsave: rework xsave support Hans Rosenfeld 3 siblings, 0 replies; 26+ messages in thread From: Hans Rosenfeld @ 2010-11-23 20:41 UTC (permalink / raw) To: linux-kernel Cc: x86, brgerst, robert.richter, andreas.herrmann3, Hans Rosenfeld Removed the functions fpu_fxrstor_checking() and restore_fpu_checking() because they weren't doing anything. Removed redundant xsave/xrstor implementations. Since xsave/xrstor is not specific to the FPU, and also for consistency, all xsave/xrstor functions now take a xsave_struct argument. Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com> --- arch/x86/include/asm/i387.h | 20 +++------- arch/x86/include/asm/xsave.h | 81 +++++++++++++++--------------------------- arch/x86/kernel/traps.c | 2 +- arch/x86/kernel/xsave.c | 4 +- 4 files changed, 38 insertions(+), 69 deletions(-) diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index ef32890..d908383 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -227,12 +227,14 @@ static inline void fpu_fxsave(struct fpu *fpu) static inline void fpu_save_init(struct fpu *fpu) { if (use_xsave()) { - fpu_xsave(fpu); + struct xsave_struct *xstate = &fpu->state->xsave; + + fpu_xsave(xstate); /* * xsave header may indicate the init state of the FP. */ - if (!(fpu->state->xsave.xsave_hdr.xstate_bv & XSTATE_FP)) + if (!(xstate->xsave_hdr.xstate_bv & XSTATE_FP)) return; } else if (use_fxsr()) { fpu_fxsave(fpu); @@ -262,22 +264,12 @@ static inline void __save_init_fpu(struct task_struct *tsk) task_thread_info(tsk)->status &= ~TS_USEDFPU; } -static inline int fpu_fxrstor_checking(struct fpu *fpu) -{ - return fxrstor_checking(&fpu->state->fxsave); -} - static inline int fpu_restore_checking(struct fpu *fpu) { if (use_xsave()) - return fpu_xrstor_checking(fpu); + return xrstor_checking(&fpu->state->xsave, -1); else - return fpu_fxrstor_checking(fpu); -} - -static inline int restore_fpu_checking(struct task_struct *tsk) -{ - return fpu_restore_checking(&tsk->thread.fpu); + return fxrstor_checking(&fpu->state->fxsave); } /* diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h index c6ce245..8bcbbce 100644 --- a/arch/x86/include/asm/xsave.h +++ b/arch/x86/include/asm/xsave.h @@ -42,10 +42,11 @@ extern int check_for_xstate(struct i387_fxsave_struct __user *buf, void __user *fpstate, struct _fpx_sw_bytes *sw); -static inline int fpu_xrstor_checking(struct fpu *fpu) +static inline int xrstor_checking(struct xsave_struct *fx, u64 mask) { - struct xsave_struct *fx = &fpu->state->xsave; int err; + u32 lmask = mask; + u32 hmask = mask >> 32; asm volatile("1: .byte " REX_PREFIX "0x0f,0xae,0x2f\n\t" "2:\n" @@ -55,13 +56,23 @@ static inline int fpu_xrstor_checking(struct fpu *fpu) ".previous\n" _ASM_EXTABLE(1b, 3b) : [err] "=r" (err) - : "D" (fx), "m" (*fx), "a" (-1), "d" (-1), "0" (0) + : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask), "0" (0) : "memory"); return err; } -static inline int xsave_user(struct xsave_struct __user *buf) +static inline void xrstor_state(struct xsave_struct *fx, u64 mask) +{ + u32 lmask = mask; + u32 hmask = mask >> 32; + + asm volatile(".byte " REX_PREFIX "0x0f,0xae,0x2f\n\t" + : : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask) + : "memory"); +} + +static inline int xsave_checking(struct xsave_struct __user *buf) { int err; @@ -74,58 +85,24 @@ static inline int xsave_user(struct xsave_struct __user *buf) if (unlikely(err)) return -EFAULT; - __asm__ __volatile__("1: .byte " REX_PREFIX "0x0f,0xae,0x27\n" - "2:\n" - ".section .fixup,\"ax\"\n" - "3: movl $-1,%[err]\n" - " jmp 2b\n" - ".previous\n" - ".section __ex_table,\"a\"\n" - _ASM_ALIGN "\n" - _ASM_PTR "1b,3b\n" - ".previous" - : [err] "=r" (err) - : "D" (buf), "a" (-1), "d" (-1), "0" (0) - : "memory"); + asm volatile("1: .byte " REX_PREFIX "0x0f,0xae,0x27\n" + "2:\n" + ".section .fixup,\"ax\"\n" + "3: movl $-1,%[err]\n" + " jmp 2b\n" + ".previous\n" + _ASM_EXTABLE(1b,3b) + : [err] "=r" (err) + : "D" (buf), "a" (-1), "d" (-1), "0" (0) + : "memory"); + if (unlikely(err) && __clear_user(buf, xstate_size)) err = -EFAULT; - /* No need to clear here because the caller clears USED_MATH */ - return err; -} - -static inline int xrestore_user(struct xsave_struct __user *buf, u64 mask) -{ - int err; - struct xsave_struct *xstate = ((__force struct xsave_struct *)buf); - u32 lmask = mask; - u32 hmask = mask >> 32; - __asm__ __volatile__("1: .byte " REX_PREFIX "0x0f,0xae,0x2f\n" - "2:\n" - ".section .fixup,\"ax\"\n" - "3: movl $-1,%[err]\n" - " jmp 2b\n" - ".previous\n" - ".section __ex_table,\"a\"\n" - _ASM_ALIGN "\n" - _ASM_PTR "1b,3b\n" - ".previous" - : [err] "=r" (err) - : "D" (xstate), "a" (lmask), "d" (hmask), "0" (0) - : "memory"); /* memory required? */ + /* No need to clear here because the caller clears USED_MATH */ return err; } -static inline void xrstor_state(struct xsave_struct *fx, u64 mask) -{ - u32 lmask = mask; - u32 hmask = mask >> 32; - - asm volatile(".byte " REX_PREFIX "0x0f,0xae,0x2f\n\t" - : : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask) - : "memory"); -} - static inline void xsave_state(struct xsave_struct *fx, u64 mask) { u32 lmask = mask; @@ -136,7 +113,7 @@ static inline void xsave_state(struct xsave_struct *fx, u64 mask) : "memory"); } -static inline void fpu_xsave(struct fpu *fpu) +static inline void fpu_xsave(struct xsave_struct *fx) { /* This, however, we can work around by forcing the compiler to select an addressing mode that doesn't require extended registers. */ @@ -144,7 +121,7 @@ static inline void fpu_xsave(struct fpu *fpu) ".byte " REX_PREFIX "0x0f,0xae,0x27", ".byte " REX_PREFIX "0x0f,0xae,0x37", X86_FEATURE_XSAVEOPT, - [fx] "D" (&fpu->state->xsave), "a" (-1), "d" (-1) : + [fx] "D" (fx), "a" (-1), "d" (-1) : "memory"); } #endif diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index f02c179..1b0d148 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -720,7 +720,7 @@ void __math_state_restore(void) /* * Paranoid restore. send a SIGSEGV if we fail to restore the state. */ - if (unlikely(restore_fpu_checking(tsk))) { + if (unlikely(fpu_restore_checking(&tsk->thread.fpu))) { stts(); force_sig(SIGSEGV, tsk); return; diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c index 9c253bd..5eb15d4 100644 --- a/arch/x86/kernel/xsave.c +++ b/arch/x86/kernel/xsave.c @@ -170,7 +170,7 @@ int save_i387_xstate(void __user *buf) if (task_thread_info(tsk)->status & TS_USEDFPU) { if (use_xsave()) - err = xsave_user(buf); + err = xsave_checking(buf); else err = fxsave_user(buf); @@ -247,7 +247,7 @@ static int restore_user_xstate(void __user *buf) /* * restore the state passed by the user. */ - err = xrestore_user(buf, mask); + err = xrstor_checking((__force struct xsave_struct *)buf, mask); if (err) return err; -- 1.5.6.5 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [RFC 2/2] x86, xsave: rework xsave support 2010-10-06 11:12 ` Brian Gerst ` (2 preceding siblings ...) 2010-11-23 20:41 ` [RFC 1/2] x86, xsave: cleanup xsave/xrstor support Hans Rosenfeld @ 2010-11-23 20:41 ` Hans Rosenfeld 2010-11-25 0:36 ` Brian Gerst 3 siblings, 1 reply; 26+ messages in thread From: Hans Rosenfeld @ 2010-11-23 20:41 UTC (permalink / raw) To: linux-kernel Cc: x86, brgerst, robert.richter, andreas.herrmann3, Hans Rosenfeld This is an attempt to rework the code that handles FPU and related extended states. Since FPU, XMM and YMM states are just variants of what xsave handles, all of the old FPU-specific state handling code will be hidden behind a set of functions that resemble xsave and xrstor. For hardware that does not support xsave, the code falls back to fxsave/fxrstor or even fsave/frstor. A xstate_mask member will be added to the thread_info structure that will control which states are to be saved by xsave. It is set to include all "lazy" states (that is, all states currently supported) by the #NM handler when a lazy restore is triggered or by switch_to() when the tasks FPU context is preloaded. Xstate_mask is intended to completely replace TS_USEDFPU in a later cleanup patch. When "non-lazy" states such as for LWP will be added later, the corresponding bits in xstate_mask are supposed to be set for all threads all the time. There will be no performance penalty for threads not using these states, as xsave and xrstor will ignore unused states. This patch is not complete and not final at all. Support for 32bit is lacking, and the context handling for signals will probably change again. I haven't benchmarked it yet, but I have tested it for both the fxsave and xsave cases. Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com> --- arch/x86/include/asm/i387.h | 103 +++++++--------------- arch/x86/include/asm/thread_info.h | 2 + arch/x86/include/asm/xsave.h | 7 ++ arch/x86/kernel/i387.c | 11 ++- arch/x86/kernel/process_64.c | 24 +++--- arch/x86/kernel/traps.c | 31 +------ arch/x86/kernel/xsave.c | 174 +++++++++++++++++++----------------- arch/x86/kvm/x86.c | 7 +- 8 files changed, 159 insertions(+), 200 deletions(-) diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index d908383..53b62c5 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -159,7 +159,7 @@ static inline int fxsave_user(struct i387_fxsave_struct __user *fx) return err; } -static inline void fpu_fxsave(struct fpu *fpu) +static inline void fpu_fxsave(struct i387_fxsave_struct *fx) { /* Using "rex64; fxsave %0" is broken because, if the memory operand uses any extended registers for addressing, a second REX prefix @@ -170,7 +170,7 @@ static inline void fpu_fxsave(struct fpu *fpu) /* Using "fxsaveq %0" would be the ideal choice, but is only supported starting with gas 2.16. */ __asm__ __volatile__("fxsaveq %0" - : "=m" (fpu->state->fxsave)); + : "=m" (*fx)); #else /* Using, as a workaround, the properly prefixed form below isn't accepted by any binutils version so far released, complaining that @@ -181,8 +181,8 @@ static inline void fpu_fxsave(struct fpu *fpu) This, however, we can work around by forcing the compiler to select an addressing mode that doesn't require extended registers. */ asm volatile("rex64/fxsave (%[fx])" - : "=m" (fpu->state->fxsave) - : [fx] "R" (&fpu->state->fxsave)); + : "=m" (*fx) + : [fx] "R" (fx)); #endif } @@ -204,14 +204,34 @@ static inline int fxrstor_checking(struct i387_fxsave_struct *fx) return 0; } -static inline void fpu_fxsave(struct fpu *fpu) +static inline void fpu_fxsave(struct i387_fxsave_struct *fx) { asm volatile("fxsave %[fx]" - : [fx] "=m" (fpu->state->fxsave)); + : [fx] "=m" (fx)); } #endif /* CONFIG_X86_64 */ +/* + * These must be called with preempt disabled + */ + +static inline int fpu_restore(struct i387_fxsave_struct *fx) +{ + return fxrstor_checking(fx); +} + +static inline void fpu_save(struct i387_fxsave_struct *fx) +{ + if (use_fxsr()) { + fpu_fxsave(fx); + } else { + asm volatile("fsave %[fx]; fwait" + : [fx] "=m" (fx)); + return; + } +} + /* We need a safe address that is cheap to find and that is already in L1 during context switch. The best choices are unfortunately different for UP and SMP */ @@ -221,30 +241,9 @@ static inline void fpu_fxsave(struct fpu *fpu) #define safe_address (kstat_cpu(0).cpustat.user) #endif -/* - * These must be called with preempt disabled - */ -static inline void fpu_save_init(struct fpu *fpu) +static inline void fpu_clean(struct i387_fxsave_struct *fx) { - if (use_xsave()) { - struct xsave_struct *xstate = &fpu->state->xsave; - - fpu_xsave(xstate); - - /* - * xsave header may indicate the init state of the FP. - */ - if (!(xstate->xsave_hdr.xstate_bv & XSTATE_FP)) - return; - } else if (use_fxsr()) { - fpu_fxsave(fpu); - } else { - asm volatile("fsave %[fx]; fwait" - : [fx] "=m" (fpu->state->fsave)); - return; - } - - if (unlikely(fpu->state->fxsave.swd & X87_FSW_ES)) + if (unlikely(fx->swd & X87_FSW_ES)) asm volatile("fnclex"); /* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception @@ -258,35 +257,12 @@ static inline void fpu_save_init(struct fpu *fpu) [addr] "m" (safe_address)); } -static inline void __save_init_fpu(struct task_struct *tsk) -{ - fpu_save_init(&tsk->thread.fpu); - task_thread_info(tsk)->status &= ~TS_USEDFPU; -} - -static inline int fpu_restore_checking(struct fpu *fpu) -{ - if (use_xsave()) - return xrstor_checking(&fpu->state->xsave, -1); - else - return fxrstor_checking(&fpu->state->fxsave); -} - /* * Signal frame handlers... */ extern int save_i387_xstate(void __user *buf); extern int restore_i387_xstate(void __user *buf); -static inline void __unlazy_fpu(struct task_struct *tsk) -{ - if (task_thread_info(tsk)->status & TS_USEDFPU) { - __save_init_fpu(tsk); - stts(); - } else - tsk->fpu_counter = 0; -} - static inline void __clear_fpu(struct task_struct *tsk) { if (task_thread_info(tsk)->status & TS_USEDFPU) { @@ -295,6 +271,7 @@ static inline void __clear_fpu(struct task_struct *tsk) "2:\n" _ASM_EXTABLE(1b, 2b)); task_thread_info(tsk)->status &= ~TS_USEDFPU; + task_thread_info(tsk)->status &= ~XCNTXT_LAZY; stts(); } } @@ -303,10 +280,9 @@ static inline void kernel_fpu_begin(void) { struct thread_info *me = current_thread_info(); preempt_disable(); - if (me->status & TS_USEDFPU) - __save_init_fpu(me->task); - else - clts(); + /* XXX saves nonlazy unnecessarily? */ + save_xstates(me->task); + clts(); } static inline void kernel_fpu_end(void) @@ -357,21 +333,6 @@ static inline void irq_ts_restore(int TS_state) /* * These disable preemption on their own and are safe */ -static inline void save_init_fpu(struct task_struct *tsk) -{ - preempt_disable(); - __save_init_fpu(tsk); - stts(); - preempt_enable(); -} - -static inline void unlazy_fpu(struct task_struct *tsk) -{ - preempt_disable(); - __unlazy_fpu(tsk); - preempt_enable(); -} - static inline void clear_fpu(struct task_struct *tsk) { preempt_disable(); diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index f0b6e5d..5c92d21 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -26,6 +26,7 @@ struct exec_domain; struct thread_info { struct task_struct *task; /* main task structure */ struct exec_domain *exec_domain; /* execution domain */ + __u64 xstate_mask; /* xstates in use */ __u32 flags; /* low level flags */ __u32 status; /* thread synchronous flags */ __u32 cpu; /* current CPU */ @@ -47,6 +48,7 @@ struct thread_info { { \ .task = &tsk, \ .exec_domain = &default_exec_domain, \ + .xstate_mask = 0, \ .flags = 0, \ .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h index 8bcbbce..2eb019e 100644 --- a/arch/x86/include/asm/xsave.h +++ b/arch/x86/include/asm/xsave.h @@ -25,6 +25,8 @@ */ #define XCNTXT_MASK (XSTATE_FP | XSTATE_SSE | XSTATE_YMM) +#define XCNTXT_LAZY XCNTXT_MASK + #ifdef CONFIG_X86_64 #define REX_PREFIX "0x48, " #else @@ -35,6 +37,11 @@ extern unsigned int xstate_size; extern u64 pcntxt_mask; extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS]; +extern void xsave(struct xsave_struct *, u64); +extern int xrstor(struct xsave_struct *, u64); +extern void save_xstates(struct task_struct *); +extern void restore_xstates(struct task_struct *, u64); + extern void xsave_init(void); extern void update_regset_xstate_info(unsigned int size, u64 xstate_mask); extern int init_fpu(struct task_struct *child); diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index 58bb239..72bc6f0 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -152,8 +152,11 @@ int init_fpu(struct task_struct *tsk) int ret; if (tsk_used_math(tsk)) { - if (HAVE_HWFP && tsk == current) - unlazy_fpu(tsk); + if (HAVE_HWFP && tsk == current) { + preempt_disable(); + save_xstates(tsk); + preempt_enable(); + } return 0; } @@ -599,7 +602,9 @@ int save_i387_xstate_ia32(void __user *buf) NULL, fp) ? -1 : 1; } - unlazy_fpu(tsk); + preempt_disable(); + save_xstates(tsk); + preempt_enable(); if (cpu_has_xsave) return save_i387_xsave(fp); diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index b3d7a3a..c95a9e5 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -253,7 +253,9 @@ static inline u32 read_32bit_tls(struct task_struct *t, int tls) */ void prepare_to_copy(struct task_struct *tsk) { - unlazy_fpu(tsk); + preempt_disable(); + save_xstates(tsk); + preempt_enable(); } int copy_thread(unsigned long clone_flags, unsigned long sp, @@ -382,18 +384,18 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) int cpu = smp_processor_id(); struct tss_struct *tss = &per_cpu(init_tss, cpu); unsigned fsindex, gsindex; - bool preload_fpu; + u64 preload_lazy = 0; /* * If the task has used fpu the last 5 timeslices, just do a full * restore of the math state immediately to avoid the trap; the * chances of needing FPU soon are obviously high now */ - preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5; - - /* we're going to use this soon, after a few expensive things */ - if (preload_fpu) + if (tsk_used_math(next_p) && next_p->fpu_counter > 5) { + preload_lazy = XCNTXT_LAZY; + /* we're going to use this soon, after a few expensive things */ prefetch(next->fpu.state); + } /* * Reload esp0, LDT and the page table pointer: @@ -424,11 +426,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) load_TLS(next, cpu); /* Must be after DS reload */ - __unlazy_fpu(prev_p); - - /* Make sure cpu is ready for new context */ - if (preload_fpu) - clts(); + save_xstates(prev_p); /* * Leave lazy mode, flushing any hypercalls made here. @@ -492,8 +490,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) * Preload the FPU context, now that we've determined that the * task is likely to be using it. */ - if (preload_fpu) - __math_state_restore(); + if (preload_lazy) + restore_xstates(next_p, preload_lazy); return prev_p; } diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 1b0d148..b40691b 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -617,7 +617,10 @@ void math_error(struct pt_regs *regs, int error_code, int trapnr) /* * Save the info for the exception handler and clear the error. */ - save_init_fpu(task); + preempt_disable(); + save_xstates(task); + preempt_enable(); + task->thread.trap_no = trapnr; task->thread.error_code = error_code; info.si_signo = SIGFPE; @@ -709,28 +712,6 @@ asmlinkage void __attribute__((weak)) smp_threshold_interrupt(void) } /* - * __math_state_restore assumes that cr0.TS is already clear and the - * fpu state is all ready for use. Used during context switch. - */ -void __math_state_restore(void) -{ - struct thread_info *thread = current_thread_info(); - struct task_struct *tsk = thread->task; - - /* - * Paranoid restore. send a SIGSEGV if we fail to restore the state. - */ - if (unlikely(fpu_restore_checking(&tsk->thread.fpu))) { - stts(); - force_sig(SIGSEGV, tsk); - return; - } - - thread->status |= TS_USEDFPU; /* So we fnsave on switch_to() */ - tsk->fpu_counter++; -} - -/* * 'math_state_restore()' saves the current math information in the * old math state array, and gets the new ones from the current task * @@ -760,9 +741,7 @@ asmlinkage void math_state_restore(void) local_irq_disable(); } - clts(); /* Allow maths ops (or we recurse) */ - - __math_state_restore(); + restore_xstates(tsk, XCNTXT_LAZY); } EXPORT_SYMBOL_GPL(math_state_restore); diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c index 5eb15d4..5457332 100644 --- a/arch/x86/kernel/xsave.c +++ b/arch/x86/kernel/xsave.c @@ -47,8 +47,6 @@ void __sanitize_i387_state(struct task_struct *tsk) if (!fx) return; - BUG_ON(task_thread_info(tsk)->status & TS_USEDFPU); - xstate_bv = tsk->thread.fpu.state->xsave.xsave_hdr.xstate_bv; /* @@ -168,22 +166,10 @@ int save_i387_xstate(void __user *buf) if (!used_math()) return 0; - if (task_thread_info(tsk)->status & TS_USEDFPU) { - if (use_xsave()) - err = xsave_checking(buf); - else - err = fxsave_user(buf); - - if (err) - return err; - task_thread_info(tsk)->status &= ~TS_USEDFPU; - stts(); - } else { - sanitize_i387_state(tsk); - if (__copy_to_user(buf, &tsk->thread.fpu.state->fxsave, - xstate_size)) - return -1; - } + save_xstates(tsk); + sanitize_i387_state(tsk); + if (__copy_to_user(buf, &tsk->thread.fpu.state->xsave, xstate_size)) + return -1; clear_used_math(); /* trigger finit */ @@ -229,62 +215,24 @@ int save_i387_xstate(void __user *buf) } /* - * Restore the extended state if present. Otherwise, restore the FP/SSE - * state. - */ -static int restore_user_xstate(void __user *buf) -{ - struct _fpx_sw_bytes fx_sw_user; - u64 mask; - int err; - - if (((unsigned long)buf % 64) || - check_for_xstate(buf, buf, &fx_sw_user)) - goto fx_only; - - mask = fx_sw_user.xstate_bv; - - /* - * restore the state passed by the user. - */ - err = xrstor_checking((__force struct xsave_struct *)buf, mask); - if (err) - return err; - - /* - * init the state skipped by the user. - */ - mask = pcntxt_mask & ~mask; - if (unlikely(mask)) - xrstor_state(init_xstate_buf, mask); - - return 0; - -fx_only: - /* - * couldn't find the extended state information in the - * memory layout. Restore just the FP/SSE and init all - * the other extended state. - */ - xrstor_state(init_xstate_buf, pcntxt_mask & ~XSTATE_FPSSE); - return fxrstor_checking((__force struct i387_fxsave_struct *)buf); -} - -/* * This restores directly out of user space. Exceptions are handled. */ int restore_i387_xstate(void __user *buf) { + struct _fpx_sw_bytes fx_sw_user; struct task_struct *tsk = current; int err = 0; if (!buf) { - if (used_math()) - goto clear; + if (used_math()) { + clear_fpu(tsk); + clear_used_math(); + } return 0; - } else - if (!access_ok(VERIFY_READ, buf, sig_xstate_size)) - return -EACCES; + } + + if (!access_ok(VERIFY_READ, buf, sig_xstate_size)) + return -EACCES; if (!used_math()) { err = init_fpu(tsk); @@ -292,25 +240,21 @@ int restore_i387_xstate(void __user *buf) return err; } - if (!(task_thread_info(current)->status & TS_USEDFPU)) { - clts(); - task_thread_info(current)->status |= TS_USEDFPU; - } - if (use_xsave()) - err = restore_user_xstate(buf); - else - err = fxrstor_checking((__force struct i387_fxsave_struct *) - buf); - if (unlikely(err)) { - /* - * Encountered an error while doing the restore from the - * user buffer, clear the fpu state. - */ -clear: - clear_fpu(tsk); - clear_used_math(); - } - return err; + if (__copy_from_user(&tsk->thread.fpu.state->xsave, buf, xstate_size)) + return -1; + + /* + * Restore only states specified by the user. If there is anything wrong + * with the xstate, restore only FP/SSE. XRSTOR will initialize all + * other states. + */ + if (!check_for_xstate(buf, buf, &fx_sw_user)) + tsk->thread.fpu.state->xsave.xsave_hdr.xstate_bv &= fx_sw_user.xstate_bv; + else if (use_xsave()) + tsk->thread.fpu.state->xsave.xsave_hdr.xstate_bv = XSTATE_FPSSE; + + restore_xstates(tsk, XCNTXT_LAZY); + return 0; } #endif @@ -473,3 +417,65 @@ void __cpuinit xsave_init(void) next_func = xstate_enable; this_func(); } + +void xsave(struct xsave_struct *x, u64 mask) +{ + clts(); + + if (use_xsave()) + xsave_state(x, mask); + else if (mask & XCNTXT_LAZY) + fpu_save(&x->i387); + + if (mask & XCNTXT_LAZY) + fpu_clean(&x->i387); + + stts(); +} + +void save_xstates(struct task_struct *tsk) +{ + struct thread_info *ti = task_thread_info(tsk); + + if (!fpu_allocated(&tsk->thread.fpu)) + return; + + xsave(&tsk->thread.fpu.state->xsave, ti->xstate_mask); + + if (!(ti->xstate_mask & XCNTXT_LAZY)) + tsk->fpu_counter = 0; + + if (tsk->fpu_counter < 5) + ti->xstate_mask &= ~XCNTXT_LAZY; + + ti->status &= ~TS_USEDFPU; +} + +int xrstor(struct xsave_struct *x, u64 mask) +{ + clts(); + + if (use_xsave()) + return xrstor_checking(x, mask); + else if (mask & XCNTXT_LAZY) + return fpu_restore(&x->i387); + + return 0; +} + +void restore_xstates(struct task_struct *tsk, u64 mask) +{ + struct thread_info *ti = task_thread_info(tsk); + + if (!fpu_allocated(&tsk->thread.fpu)) + return; + + if (unlikely(xrstor(&tsk->thread.fpu.state->xsave, mask))) { + stts(); + force_sig(SIGSEGV, tsk); + } else { + ti->xstate_mask |= mask; + ti->status |= TS_USEDFPU; + tsk->fpu_counter++; + } +} diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cdac9e5..771044d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -57,6 +57,7 @@ #include <asm/xcr.h> #include <asm/pvclock.h> #include <asm/div64.h> +#include <asm/xsave.h> #define MAX_IO_MSRS 256 #define CR0_RESERVED_BITS \ @@ -5712,8 +5713,8 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) */ kvm_put_guest_xcr0(vcpu); vcpu->guest_fpu_loaded = 1; - unlazy_fpu(current); - fpu_restore_checking(&vcpu->arch.guest_fpu); + save_xstates(current); + xrstor(&vcpu->arch.guest_fpu.state->xsave, -1); trace_kvm_fpu(1); } @@ -5725,7 +5726,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) return; vcpu->guest_fpu_loaded = 0; - fpu_save_init(&vcpu->arch.guest_fpu); + xsave(&vcpu->arch.guest_fpu.state->xsave, -1); ++vcpu->stat.fpu_reload; kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); trace_kvm_fpu(0); -- 1.5.6.5 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC 2/2] x86, xsave: rework xsave support 2010-11-23 20:41 ` [RFC 2/2] x86, xsave: rework xsave support Hans Rosenfeld @ 2010-11-25 0:36 ` Brian Gerst 0 siblings, 0 replies; 26+ messages in thread From: Brian Gerst @ 2010-11-25 0:36 UTC (permalink / raw) To: Hans Rosenfeld; +Cc: linux-kernel, x86, robert.richter, andreas.herrmann3 On Tue, Nov 23, 2010 at 3:41 PM, Hans Rosenfeld <hans.rosenfeld@amd.com> wrote: > This is an attempt to rework the code that handles FPU and related > extended states. Since FPU, XMM and YMM states are just variants of what > xsave handles, all of the old FPU-specific state handling code will be > hidden behind a set of functions that resemble xsave and xrstor. For > hardware that does not support xsave, the code falls back to > fxsave/fxrstor or even fsave/frstor. > > A xstate_mask member will be added to the thread_info structure that > will control which states are to be saved by xsave. It is set to include > all "lazy" states (that is, all states currently supported) by the #NM > handler when a lazy restore is triggered or by switch_to() when the > tasks FPU context is preloaded. Xstate_mask is intended to completely > replace TS_USEDFPU in a later cleanup patch. > > When "non-lazy" states such as for LWP will be added later, the > corresponding bits in xstate_mask are supposed to be set for all threads > all the time. There will be no performance penalty for threads not using > these states, as xsave and xrstor will ignore unused states. > > This patch is not complete and not final at all. Support for 32bit is > lacking, and the context handling for signals will probably change > again. I haven't benchmarked it yet, but I have tested it for both the > fxsave and xsave cases. Looks good, but I would suggest adding wrappers for save_states() and restore_xstates() that handle preemption, like how unlazy_fpu() was. -- Brian Gerst ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-05 18:27 ` Hans Rosenfeld ` (3 preceding siblings ...) 2010-10-05 18:30 ` [RFC 3/3] Save/restore LWP state in context switches Hans Rosenfeld @ 2010-10-05 19:05 ` Ingo Molnar 2010-10-06 7:35 ` Robert Richter 4 siblings, 1 reply; 26+ messages in thread From: Ingo Molnar @ 2010-10-05 19:05 UTC (permalink / raw) To: Hans Rosenfeld Cc: Thomas Gleixner, Richter, Robert, LKML, H. Peter Anvin, Herrmann3, Andreas, Peter Zijlstra, Peter Zijlstra, Frédéric Weisbecker, Steven Rostedt, Arnaldo Carvalho de Melo * Hans Rosenfeld <hans.rosenfeld@amd.com> wrote: > But thats not required to use LWP instructions. Maybe Robert will add > it to perf some day, or maybe not. It is completely optional, and how > it is implemented at some point in the future is completely irrelevant > to the basic support of LWP. It isnt irrelevant. If the concept is implemented in a crappy way, if there are random user-space libraries that do not properly expose these capabilities and do not allow them to extend existing perf functionality in a natural way then we might be better off not adding overhead to the scheduler and not enabling it at all. So thoughts need to be made what the point of it all is and how it integrates into perf. If it doesnt integrate, if the whole plan is to just get it to user-space where it can be messed up freely in some CPU specific way then color me thoroughly uninterested. We have a generic instrumentation framework for a reason. We very much want to know the structure of that area and want to make use of it not just on a per task basis. We also want to expose it all in a more generalized form. I cannot believe something like this was committed into silicon without at minimum talking to people who have a clue about where it will (and should) all go ... Ingo ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-05 19:05 ` [RFC 0/3] Basic support for LWP Ingo Molnar @ 2010-10-06 7:35 ` Robert Richter [not found] ` <AANLkTi=T0QmcKeZcgcR+GKk-9OwQUB_x8XdHiNuU7tE_@mail.gmail.com> 0 siblings, 1 reply; 26+ messages in thread From: Robert Richter @ 2010-10-06 7:35 UTC (permalink / raw) To: Ingo Molnar Cc: Rosenfeld, Hans, Thomas Gleixner, LKML, H. Peter Anvin, Herrmann3, Andreas, Peter Zijlstra, Peter Zijlstra, Frédéric Weisbecker, Steven Rostedt, Arnaldo Carvalho de Melo, Stephane Eranian On 05.10.10 15:05:01, Ingo Molnar wrote: > So thoughts need to be made what the point of it all is and how it > integrates into perf. If it doesnt integrate, if the whole plan is to > just get it to user-space where it can be messed up freely in some CPU > specific way then color me thoroughly uninterested. We have a generic > instrumentation framework for a reason. I was looking at how this integrates into the perf syscall. It might be a little disappointing, but there is not much to do for the kernel. Ring buffer handling is implemented in hardware now, the user land sets up address ranges in the task's address space for buffers and thus may direct access it. We do not need an interrupt handler to fill the buffers. The pmu state is saved and restored during context switches in ways that has been proven for the fpu (xsave) and virtualization (VMCB like LWPCB). So, overall hardware is now doing the job for writing samples into a userland buffer and managing the pmu state. This reduces system overhead while profiling a lot especially because we don't have to walk through a software stack with each sample (this is where the 'lightweight' comes from). Of course this does not fit into current frameworks because of its difference in concept, but in general we want to see it in perf. So the main work of integration will leave here to tool and library implementation. But for this Linux must implement LWP context switch support. This is what Hans did. We also measured the system impact which comes from the additional rdmsrl() if the cpu supports LWP. There is no significant performance decrease in a worst case scenario. So, this is how we think it is the best to implement it and we need your feedback here. I think we should consider to apply patch 1/3 as it is unrelated to LWP and reworks and improves the code. Thanks, -Robert -- Advanced Micro Devices, Inc. Operating System Research Center ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <AANLkTi=T0QmcKeZcgcR+GKk-9OwQUB_x8XdHiNuU7tE_@mail.gmail.com>]
* Re: [RFC 0/3] Basic support for LWP [not found] ` <AANLkTi=T0QmcKeZcgcR+GKk-9OwQUB_x8XdHiNuU7tE_@mail.gmail.com> @ 2010-10-07 10:46 ` Stephane Eranian 2010-10-07 13:59 ` H. Peter Anvin 0 siblings, 1 reply; 26+ messages in thread From: Stephane Eranian @ 2010-10-07 10:46 UTC (permalink / raw) To: mingo Cc: Hans.Rosenfeld, robert.richter, tglx, linux-kernel, hpa, Andreas.Herrmann3, peterz, fweisbec, rostedt, acme, Peter Zijlstra, eranian Hi, Some comments on all of this. First of all LWP adds value because it allows collecting information which is otherwise unavailable or of lesser quality. LWP is meant to measure only at the user level (only CPL=3). It is blind for kernel level execution or even priv level 1, 2. LWP is ALWAYS precise (that's THE key value add in my mind). It reports the instruction address of the instruction that caused the counter to overflow. This is true for all events it can measure, including cache misses.That precision can be be achieved through HW support. LWP does not have the data pollution of IBS. It only records instructions that caused the type of event you want to measure. For instance, if you want to sample cache misses, it collects only data related to instructions that generate cache misses unlike IBS. It is a radically different approach. Don't get me wrong, both are useful, it all depends on what you want to measure. LWP does allow data normalization. You know how many instances of the sampling events have been observed unlike IBS. LWP is lightweight in the sense that the HW records samples for you in a memory buffer. This is similar to PEBS in that regards. But it does not capture full machine state. It also has a limited number of events. LWP allows users to insert their own samples in the buffer. That can be useful to record other kinds of information. Could be use to make this work in system-wide mode for instance. The LWP specs has been out for review by EVERYBODY since 2007. I remember looking at it back then. So the comments about "nobody was consulted before this got into silicon" is erroneous. Everybody had a chance to comment. LWP operates independently from the regular core PMU. Both can be active at the same time with no problems. As for the patch itself, I am not an expert at xsave/xrstor, but it seems to me you could decouple LWP from FPU. I think Brian had the same comment. I suspect this can be done and it will certainly look cleaner. On the comments about integrating this into perf_events. LWP is designed for self-monitoring ONLY. It can be controlled entirely from user space (once kernel enables it and has xsave/xrstor support). Thus, it seems natural, at first, to implement this into a library that can be either explicitly called by developers or also implicitly used via LD_PRELOAD. But I suspect it could also be implemented as part of perf_events in a manner similar to PEBS, using the precise_ip mode to trigger it. This is assuming that the LWP instructions work at priv level 0. Both the LWP control block + LWP buffer would be allocated by the kernel and yet be accessible from priv level 3. It would also need its own event encoding as there isn't necessarily a 1-to-1 mapping with core PMU events. It would also use the interrupt mode in this case. In terms of access control from users, I believe it would be possible to make LWP invisible to users by tweaking the LWP_CFG register to report no support and yet have the kernel control LWP. You'd have to find a way to copy ALL the information in each LWP record into the perf sampling buffer format. What does this buy us? An integrated solution with perf and the perf tools, no dedicated library, finer grain access control. Possibly precise sampling in per-cpu mode (leverage LWPVAL). What do we lose? The lightweight aspect of it. There would be syscalls for setup/start/stop. You'd have to copy the data between the LWP record format and the perf sampling buffer. There would be data copying between the two buffers. Personally I would have gone with the user level stuff as the first approach, using polling mode on the LWP buffer, to evaluate if LWP delivers on its promise w.r.t. high quality data, low overhead. But I also understand the motivation to integrate this into perf to provide a uniform interface. But one thing is clear, LWP adds value and I want to access the LWP data because it is of higher quality than what we can get today on AMD processors. It can really help drive certain optimizations, e.g., in compilers. On Wed, Oct 6, 2010 at 9:35 AM, Robert Richter <robert.richter@amd.com> wrote: > On 05.10.10 15:05:01, Ingo Molnar wrote: > >> So thoughts need to be made what the point of it all is and how it >> integrates into perf. If it doesnt integrate, if the whole plan is to >> just get it to user-space where it can be messed up freely in some CPU >> specific way then color me thoroughly uninterested. We have a generic >> instrumentation framework for a reason. > > I was looking at how this integrates into the perf syscall. It might > be a little disappointing, but there is not much to do for the kernel. > Ring buffer handling is implemented in hardware now, the user land > sets up address ranges in the task's address space for buffers and > thus may direct access it. We do not need an interrupt handler to fill > the buffers. The pmu state is saved and restored during context > switches in ways that has been proven for the fpu (xsave) and > virtualization (VMCB like LWPCB). So, overall hardware is now doing > the job for writing samples into a userland buffer and managing the > pmu state. This reduces system overhead while profiling a lot > especially because we don't have to walk through a software stack with > each sample (this is where the 'lightweight' comes from). > > Of course this does not fit into current frameworks because of its > difference in concept, but in general we want to see it in perf. So > the main work of integration will leave here to tool and library > implementation. But for this Linux must implement LWP context switch > support. This is what Hans did. > > We also measured the system impact which comes from the additional > rdmsrl() if the cpu supports LWP. There is no significant performance > decrease in a worst case scenario. So, this is how we think it is the > best to implement it and we need your feedback here. > > I think we should consider to apply patch 1/3 as it is unrelated to > LWP and reworks and improves the code. > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-07 10:46 ` Stephane Eranian @ 2010-10-07 13:59 ` H. Peter Anvin 2010-10-07 14:11 ` Stephane Eranian 2010-10-07 15:12 ` Stephane Eranian 0 siblings, 2 replies; 26+ messages in thread From: H. Peter Anvin @ 2010-10-07 13:59 UTC (permalink / raw) To: Stephane Eranian Cc: mingo, Hans.Rosenfeld, robert.richter, tglx, linux-kernel, Andreas.Herrmann3, peterz, fweisbec, rostedt, acme, Peter Zijlstra, eranian On 10/07/2010 03:46 AM, Stephane Eranian wrote: > > As for the patch itself, I am not an expert at xsave/xrstor, but it seems to > me you could decouple LWP from FPU. I think Brian had the same comment. > I suspect this can be done and it will certainly look cleaner. > Well, once you're using XSAVE you're not decoupled from the FPU. Worse, if you're using XSAVE and not honoring CR0.TS you have a major design flaw. Nothing that can't be dealt with, but still... -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-07 13:59 ` H. Peter Anvin @ 2010-10-07 14:11 ` Stephane Eranian 2010-10-07 14:20 ` Hans Rosenfeld 2010-10-07 14:20 ` H. Peter Anvin 2010-10-07 15:12 ` Stephane Eranian 1 sibling, 2 replies; 26+ messages in thread From: Stephane Eranian @ 2010-10-07 14:11 UTC (permalink / raw) To: H. Peter Anvin Cc: mingo, Hans.Rosenfeld, robert.richter, tglx, linux-kernel, Andreas.Herrmann3, peterz, fweisbec, rostedt, acme, Peter Zijlstra, eranian On Thu, Oct 7, 2010 at 3:59 PM, H. Peter Anvin <hpa@zytor.com> wrote: > On 10/07/2010 03:46 AM, Stephane Eranian wrote: >> >> As for the patch itself, I am not an expert at xsave/xrstor, but it seems to >> me you could decouple LWP from FPU. I think Brian had the same comment. >> I suspect this can be done and it will certainly look cleaner. >> > > Well, once you're using XSAVE you're not decoupled from the FPU. Worse, > if you're using XSAVE and not honoring CR0.TS you have a major design flaw. > Is that to say, that if you use LWP you will have to save/restore FPU state even though you're not actually using it? > Nothing that can't be dealt with, but still... > > -- > H. Peter Anvin, Intel Open Source Technology Center > I work for Intel. I don't speak on their behalf. > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-07 14:11 ` Stephane Eranian @ 2010-10-07 14:20 ` Hans Rosenfeld 2010-10-07 14:20 ` H. Peter Anvin 1 sibling, 0 replies; 26+ messages in thread From: Hans Rosenfeld @ 2010-10-07 14:20 UTC (permalink / raw) To: Stephane Eranian Cc: H. Peter Anvin, mingo@elte.hu, Richter, Robert, tglx@linutronix.de, linux-kernel@vger.kernel.org, Herrmann3, Andreas, peterz@infradead.org, fweisbec@gmail.com, rostedt@goodmis.org, acme@redhat.com, Peter Zijlstra, eranian@gmail.com On Thu, Oct 07, 2010 at 10:11:42AM -0400, Stephane Eranian wrote: > On Thu, Oct 7, 2010 at 3:59 PM, H. Peter Anvin <hpa@zytor.com> wrote: > > On 10/07/2010 03:46 AM, Stephane Eranian wrote: > >> > >> As for the patch itself, I am not an expert at xsave/xrstor, but it seems to > >> me you could decouple LWP from FPU. I think Brian had the same comment. > >> I suspect this can be done and it will certainly look cleaner. > >> > > > > Well, once you're using XSAVE you're not decoupled from the FPU. Worse, > > if you're using XSAVE and not honoring CR0.TS you have a major design flaw. > > > Is that to say, that if you use LWP you will have to save/restore FPU state even > though you're not actually using it? No, you don't necessarily have to care about the FPU stuff. XSAVE can save different states individually to the same buffer, and XRSTOR can restore them individually. You just have to make sure CR0.TS is not set when you execute the XSAVE or XRSTOR instruction. That's what I did and it worked :) Hans -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-07 14:11 ` Stephane Eranian 2010-10-07 14:20 ` Hans Rosenfeld @ 2010-10-07 14:20 ` H. Peter Anvin 2010-10-07 14:25 ` Stephane Eranian 1 sibling, 1 reply; 26+ messages in thread From: H. Peter Anvin @ 2010-10-07 14:20 UTC (permalink / raw) To: Stephane Eranian Cc: mingo, Hans.Rosenfeld, robert.richter, tglx, linux-kernel, Andreas.Herrmann3, peterz, fweisbec, rostedt, acme, Peter Zijlstra, eranian On 10/07/2010 07:11 AM, Stephane Eranian wrote: > On Thu, Oct 7, 2010 at 3:59 PM, H. Peter Anvin <hpa@zytor.com> wrote: >> On 10/07/2010 03:46 AM, Stephane Eranian wrote: >>> >>> As for the patch itself, I am not an expert at xsave/xrstor, but it seems to >>> me you could decouple LWP from FPU. I think Brian had the same comment. >>> I suspect this can be done and it will certainly look cleaner. >>> >> >> Well, once you're using XSAVE you're not decoupled from the FPU. Worse, >> if you're using XSAVE and not honoring CR0.TS you have a major design flaw. >> > Is that to say, that if you use LWP you will have to save/restore FPU state even > though you're not actually using it? > No, but you wouldn't be able to use lazy FPU. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-07 14:20 ` H. Peter Anvin @ 2010-10-07 14:25 ` Stephane Eranian 2010-10-07 14:47 ` H. Peter Anvin 0 siblings, 1 reply; 26+ messages in thread From: Stephane Eranian @ 2010-10-07 14:25 UTC (permalink / raw) To: H. Peter Anvin Cc: mingo, Hans.Rosenfeld, robert.richter, tglx, linux-kernel, Andreas.Herrmann3, peterz, fweisbec, rostedt, acme, Peter Zijlstra, eranian On Thu, Oct 7, 2010 at 4:20 PM, H. Peter Anvin <hpa@zytor.com> wrote: > On 10/07/2010 07:11 AM, Stephane Eranian wrote: >> On Thu, Oct 7, 2010 at 3:59 PM, H. Peter Anvin <hpa@zytor.com> wrote: >>> On 10/07/2010 03:46 AM, Stephane Eranian wrote: >>>> >>>> As for the patch itself, I am not an expert at xsave/xrstor, but it seems to >>>> me you could decouple LWP from FPU. I think Brian had the same comment. >>>> I suspect this can be done and it will certainly look cleaner. >>>> >>> >>> Well, once you're using XSAVE you're not decoupled from the FPU. Worse, >>> if you're using XSAVE and not honoring CR0.TS you have a major design flaw. >>> >> Is that to say, that if you use LWP you will have to save/restore FPU state even >> though you're not actually using it? >> > > No, but you wouldn't be able to use lazy FPU. > You mean lazy restore, I am guessing here. Is that to say, you cannot figure out whether the FPU state in the CPU on ctxsw in is yours anymore? > -hpa > > -- > H. Peter Anvin, Intel Open Source Technology Center > I work for Intel. I don't speak on their behalf. > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-07 14:25 ` Stephane Eranian @ 2010-10-07 14:47 ` H. Peter Anvin 0 siblings, 0 replies; 26+ messages in thread From: H. Peter Anvin @ 2010-10-07 14:47 UTC (permalink / raw) To: Stephane Eranian Cc: mingo, Hans.Rosenfeld, robert.richter, tglx, linux-kernel, Andreas.Herrmann3, peterz, fweisbec, rostedt, acme, Peter Zijlstra, eranian On 10/07/2010 07:25 AM, Stephane Eranian wrote: >> >> No, but you wouldn't be able to use lazy FPU. >> > You mean lazy restore, I am guessing here. > Is that to say, you cannot figure out whether the FPU state in > the CPU on ctxsw in is yours anymore? > Lazy save/restore, and more or less, yes. Really, XSAVE was designed to be saved/restored as a unit, with unused state skipped. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-07 13:59 ` H. Peter Anvin 2010-10-07 14:11 ` Stephane Eranian @ 2010-10-07 15:12 ` Stephane Eranian 1 sibling, 0 replies; 26+ messages in thread From: Stephane Eranian @ 2010-10-07 15:12 UTC (permalink / raw) To: H. Peter Anvin Cc: mingo, Hans.Rosenfeld, robert.richter, tglx, linux-kernel, Andreas.Herrmann3, peterz, fweisbec, rostedt, acme, Peter Zijlstra, eranian On Thu, Oct 7, 2010 at 3:59 PM, H. Peter Anvin <hpa@zytor.com> wrote: > On 10/07/2010 03:46 AM, Stephane Eranian wrote: >> >> As for the patch itself, I am not an expert at xsave/xrstor, but it seems to >> me you could decouple LWP from FPU. I think Brian had the same comment. >> I suspect this can be done and it will certainly look cleaner. >> > > Well, once you're using XSAVE you're not decoupled from the FPU. Worse, > if you're using XSAVE and not honoring CR0.TS you have a major design flaw. > I looked at the LWP documentation and it says: "LWP does not support the “lazy” state save and restore that is possible for floating point and SSE state. It does not interact with the CR0.TS bit. Operating systems that support LWP must always do an XSAVE to preserve the old thread’s LWP context and an XRSTOR to set up the new LWP context. The OS can continue to do a lazy switch of the FP and SSE state by ensuring that the corresponding bits in EDX:EAX are clear when it executes the XSAVE and XRSTOR to handle the LWP context." They imply that you can still do lazy FP/SSE save/restore even though CR0.TS does not monitor LWP access. If LWP is used the kernel needs to call xsave/xrstor on ctxsw. It can do this lazily by checking LWP_CBADDR. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-05 14:51 ` Hans Rosenfeld 2010-10-05 15:34 ` Thomas Gleixner @ 2010-10-05 18:43 ` Davidlohr Bueso 2010-10-06 10:26 ` Andi Kleen 1 sibling, 1 reply; 26+ messages in thread From: Davidlohr Bueso @ 2010-10-05 18:43 UTC (permalink / raw) To: Hans Rosenfeld Cc: Thomas Gleixner, robert.richter, LKML, H. Peter Anvin, Ingo Molnar, Herrmann3, Andreas, Peter Zijlstra On Tue, 2010-10-05 at 16:51 +0200, Hans Rosenfeld wrote: > [ adding Robert ] > > On Mon, Oct 04, 2010 at 06:13:46PM -0400, Thomas Gleixner wrote: > > > LWP (Light-Weight Profiling) is a new profiling mechanism that allows > > > user mode processes to gather performance data about themselves with > > > very low overhead. The specification can be found here: > > > http://developer.amd.com/cpu/LWP/Pages/default.aspx > > > > All I can see there is marketing blurb. The spec pdf does not tell me > > either how the end result of your patches will look like. > > Well, you should get a basic understanding of how LWP works from the > spec. I'm pretty sure there is more than that blurb in there. > > What I sent _is_ the end result. Supporting LWP in the task switching > code is the only thing necessary to use the LWP instructions in user > space. And that is also what I'd like to get some comments on. > > > > This code adds basic support for LWP to the context switch code, which > > > is the minimum needed to use LWP. Support for other LWP features like > > > interrupts will be added later. > > > > Oh no. We are not going to merge that and give you a card blanche to > > come up with another profiling/performance monitoring facility which > > is completely disconnected to everything else. > > I think you got something wrong here, probably because I wasn't clear > enough :) > > What I sent was just the basic kernel support, which means > saving/restoring the LWP state in task switches. Technically no other > kernel changes are necessary to use LWP in user space programs. No > kernel framework of any kind is necessary to use LWP. Additionally, > Robert is checking whether and how LWP support could be integrated > into perf. This of course depends on having basic LWP task switching > support in the kernel. That would be the way to go, I don't think anyone wants yet another profiling userspace tool. > > > Please provide either the full patch of this facility or at least a > > reasonable explanation how it will look like when your code is > > ready. What's the user space interface, how does it hook into perf, > > etc ... > > The basic principle of LWP is that a user program reserves some memory > for a LWP control block and a ring buffer, writes the ring buffer > address and some setup bits into the control block, and executes the > LLWPCB instruction to load the control block into the CPU. While the > program runs it can periodically read the samples from the ring buffer > and do something useful with them. The program being profiled needs to > include support for LWP to do that. In case you can't or don't want to > modify the program being profiled, a small LWP library would have to > be written that would have to be LD_PRELOADed. I don't really know > what it will look like in the end and how it will be integrated into > perf. The tool/library developers will probably know best how to make > use of LWP. > > The point is that LWP is supposed to be controlled and used from user > space without any kernel interaction, provided that the kernel > saves/restores the LWP context. That's also how I tested it, using a > small test program that just uses the LWP instructions. > > Technically LWP is quite similar to the FPU, it even uses the same > facilities for OS support. Just like FPU support, the kernel doesn't > have to care at all about what user space is going to do with it. It > just has to make sure state is saved and restored when necessary. > > > Hans > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 0/3] Basic support for LWP 2010-10-05 18:43 ` Davidlohr Bueso @ 2010-10-06 10:26 ` Andi Kleen 0 siblings, 0 replies; 26+ messages in thread From: Andi Kleen @ 2010-10-06 10:26 UTC (permalink / raw) To: dave Cc: Hans Rosenfeld, Thomas Gleixner, robert.richter, LKML, H. Peter Anvin, Ingo Molnar, Herrmann3, Andreas, Peter Zijlstra Davidlohr Bueso <dave@gnu.org> writes: > > That would be the way to go, I don't think anyone wants yet another > profiling userspace tool. Why not? Are you saying innovation in profilers is dead and we're all only allowed to use a single one now? -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2010-11-25 0:36 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1286212172-654419-1-git-send-email-hans.rosenfeld@amd.com>
2010-10-04 22:13 ` [RFC 0/3] Basic support for LWP Thomas Gleixner
2010-10-05 14:51 ` Hans Rosenfeld
2010-10-05 15:34 ` Thomas Gleixner
2010-10-05 18:27 ` Hans Rosenfeld
2010-10-05 18:30 ` Hans Rosenfeld
2010-10-05 18:30 ` [RFC 1/3] Cleanup xsave/xrstor support Hans Rosenfeld
2010-10-05 18:30 ` [RFC 2/3] Allow saving of individual states in fpu_xsave() Hans Rosenfeld
2010-10-05 18:30 ` [RFC 3/3] Save/restore LWP state in context switches Hans Rosenfeld
2010-10-06 11:12 ` Brian Gerst
2010-10-07 14:58 ` Hans Rosenfeld
2010-11-23 20:41 ` [RFC 0/2] FPU/xsave rework in preparation for LWP Hans Rosenfeld
2010-11-23 20:41 ` [RFC 1/2] x86, xsave: cleanup xsave/xrstor support Hans Rosenfeld
2010-11-23 20:41 ` [RFC 2/2] x86, xsave: rework xsave support Hans Rosenfeld
2010-11-25 0:36 ` Brian Gerst
2010-10-05 19:05 ` [RFC 0/3] Basic support for LWP Ingo Molnar
2010-10-06 7:35 ` Robert Richter
[not found] ` <AANLkTi=T0QmcKeZcgcR+GKk-9OwQUB_x8XdHiNuU7tE_@mail.gmail.com>
2010-10-07 10:46 ` Stephane Eranian
2010-10-07 13:59 ` H. Peter Anvin
2010-10-07 14:11 ` Stephane Eranian
2010-10-07 14:20 ` Hans Rosenfeld
2010-10-07 14:20 ` H. Peter Anvin
2010-10-07 14:25 ` Stephane Eranian
2010-10-07 14:47 ` H. Peter Anvin
2010-10-07 15:12 ` Stephane Eranian
2010-10-05 18:43 ` Davidlohr Bueso
2010-10-06 10:26 ` Andi Kleen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox