Re: [RFC 0/3] Basic support for LWP

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC 0/3] Basic support for LWP
       [not found] <1286212172-654419-1-git-send-email-hans.rosenfeld@amd.com>
@ 2010-10-04 22:13 ` Thomas Gleixner
  2010-10-05 14:51   ` Hans Rosenfeld
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2010-10-04 22:13 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: LKML, H. Peter Anvin, Ingo Molnar, Andreas Herrmann,
	Peter Zijlstra

On Mon, 4 Oct 2010, Hans Rosenfeld wrote:

> LWP (Light-Weight Profiling) is a new profiling mechanism that allows
> user mode processes to gather performance data about themselves with
> very low overhead. The specification can be found here:
>         http://developer.amd.com/cpu/LWP/Pages/default.aspx

All I can see there is marketing blurb. The spec pdf does not tell me
either how the end result of your patches will look like.

> This code adds basic support for LWP to the context switch code, which
> is the minimum needed to use LWP. Support for other LWP features like
> interrupts will be added later.

Oh no. We are not going to merge that and give you a card blanche to
come up with another profiling/performance monitoring facility which
is completely disconnected to everything else.

Please provide either the full patch of this facility or at least a
reasonable explanation how it will look like when your code is
ready. What's the user space interface, how does it hook into perf,
etc ...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-04 22:13 ` [RFC 0/3] Basic support for LWP Thomas Gleixner
@ 2010-10-05 14:51   ` Hans Rosenfeld
  2010-10-05 15:34     ` Thomas Gleixner
  2010-10-05 18:43     ` Davidlohr Bueso
  0 siblings, 2 replies; 26+ messages in thread
From: Hans Rosenfeld @ 2010-10-05 14:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: robert.richter, LKML, H. Peter Anvin, Ingo Molnar,
	Herrmann3, Andreas, Peter Zijlstra

[ adding Robert ]

On Mon, Oct 04, 2010 at 06:13:46PM -0400, Thomas Gleixner wrote:
> > LWP (Light-Weight Profiling) is a new profiling mechanism that allows
> > user mode processes to gather performance data about themselves with
> > very low overhead. The specification can be found here:
> >         http://developer.amd.com/cpu/LWP/Pages/default.aspx
> 
> All I can see there is marketing blurb. The spec pdf does not tell me
> either how the end result of your patches will look like.

Well, you should get a basic understanding of how LWP works from the
spec. I'm pretty sure there is more than that blurb in there.

What I sent _is_ the end result. Supporting LWP in the task switching
code is the only thing necessary to use the LWP instructions in user
space. And that is also what I'd like to get some comments on.

> > This code adds basic support for LWP to the context switch code, which
> > is the minimum needed to use LWP. Support for other LWP features like
> > interrupts will be added later.
> 
> Oh no. We are not going to merge that and give you a card blanche to
> come up with another profiling/performance monitoring facility which
> is completely disconnected to everything else.

I think you got something wrong here, probably because I wasn't clear
enough :)

What I sent was just the basic kernel support, which means
saving/restoring the LWP state in task switches. Technically no other
kernel changes are necessary to use LWP in user space programs. No
kernel framework of any kind is necessary to use LWP. Additionally,
Robert is checking whether and how LWP support could be integrated
into perf. This of course depends on having basic LWP task switching
support in the kernel.

> Please provide either the full patch of this facility or at least a
> reasonable explanation how it will look like when your code is
> ready. What's the user space interface, how does it hook into perf,
> etc ...

The basic principle of LWP is that a user program reserves some memory
for a LWP control block and a ring buffer, writes the ring buffer
address and some setup bits into the control block, and executes the
LLWPCB instruction to load the control block into the CPU. While the
program runs it can periodically read the samples from the ring buffer
and do something useful with them. The program being profiled needs to
include support for LWP to do that. In case you can't or don't want to
modify the program being profiled, a small LWP library would have to
be written that would have to be LD_PRELOADed. I don't really know
what it will look like in the end and how it will be integrated into
perf. The tool/library developers will probably know best how to make
use of LWP.

The point is that LWP is supposed to be controlled and used from user
space without any kernel interaction, provided that the kernel
saves/restores the LWP context. That's also how I tested it, using a
small test program that just uses the LWP instructions.

Technically LWP is quite similar to the FPU, it even uses the same
facilities for OS support. Just like FPU support, the kernel doesn't
have to care at all about what user space is going to do with it. It
just has to make sure state is saved and restored when necessary.

Hans

-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-05 14:51   ` Hans Rosenfeld
@ 2010-10-05 15:34     ` Thomas Gleixner
  2010-10-05 18:27       ` Hans Rosenfeld
  2010-10-05 18:43     ` Davidlohr Bueso
  1 sibling, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2010-10-05 15:34 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: robert.richter, LKML, H. Peter Anvin, Ingo Molnar,
	Herrmann3, Andreas, Peter Zijlstra

On Tue, 5 Oct 2010, Hans Rosenfeld wrote:

> [ adding Robert ]
> 
> On Mon, Oct 04, 2010 at 06:13:46PM -0400, Thomas Gleixner wrote:
> > > LWP (Light-Weight Profiling) is a new profiling mechanism that allows
> > > user mode processes to gather performance data about themselves with
> > > very low overhead. The specification can be found here:
> > >         http://developer.amd.com/cpu/LWP/Pages/default.aspx
> > 
> > All I can see there is marketing blurb. The spec pdf does not tell me
> > either how the end result of your patches will look like.
> 
> Well, you should get a basic understanding of how LWP works from the
> spec. I'm pretty sure there is more than that blurb in there.
> 
> What I sent _is_ the end result. Supporting LWP in the task switching
> code is the only thing necessary to use the LWP instructions in user
> space. And that is also what I'd like to get some comments on.

So it's the end result. Ok, then explain the following sentence:
 
> > > This code adds basic support for LWP to the context switch code, which
> > > is the minimum needed to use LWP. Support for other LWP features like
> > > interrupts will be added later.

If your patch _IS_ the end result then there are no other LWP
features, right ?

If there are, then your patch is _NOT_ the end result and we really
want to know what is coming.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-05 15:34     ` Thomas Gleixner
@ 2010-10-05 18:27       ` Hans Rosenfeld
  2010-10-05 18:30         ` Hans Rosenfeld
                           ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Hans Rosenfeld @ 2010-10-05 18:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Richter, Robert, LKML, H. Peter Anvin, Ingo Molnar,
	Herrmann3, Andreas, Peter Zijlstra

On Tue, Oct 05, 2010 at 11:34:11AM -0400, Thomas Gleixner wrote:
> So it's the end result. Ok, then explain the following sentence:
>  
> > > > This code adds basic support for LWP to the context switch code, which
> > > > is the minimum needed to use LWP. Support for other LWP features like
> > > > interrupts will be added later.
> 
> If your patch _IS_ the end result then there are no other LWP
> features, right ?

Ok, you got me there. I should have checked again before I wrote this.

There is at one other optional feature. Thats the support for a
thresholding interrupt, which the OS should relay to the user process
in one way or another.

But thats not required to use LWP instructions. Maybe Robert will add
it to perf some day, or maybe not. It is completely optional, and how
it is implemented at some point in the future is completely irrelevant
to the basic support of LWP.

> If there are, then your patch is _NOT_ the end result and we really
> want to know what is coming.

As far as I am concerned, it _is_ the end result. It implements the
necessary changes to make LWP usable, and that's it.

Whatever support for optional features Robert or anybody else may be
going to implement or not affects in no way the necessity of managing
the LWP state for user processes. Those are two completely unrelated
issues.

Now, since this was obviously not made clear enough before:

  Don't merge this code anywhere yet. This is an RFC and nothing else.
  It implements a basic change necessary to support a new instruction
  set, which just so happens to be a profiling mechanism, too. Please
  review the code and flame me for anything that you don't like in it.

Hans

PS: The patches were not delivered to LKML previously, so I'll resend
them.

-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC 0/3] Basic support for LWP
  2010-10-05 18:27       ` Hans Rosenfeld
@ 2010-10-05 18:30         ` Hans Rosenfeld
  2010-10-05 18:30         ` [RFC 1/3] Cleanup xsave/xrstor support Hans Rosenfeld
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Hans Rosenfeld @ 2010-10-05 18:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: Hans Rosenfeld

LWP (Light-Weight Profiling) is a new profiling mechanism that allows
user mode processes to gather performance data about themselves with
very low overhead. The specification can be found here:
        http://developer.amd.com/cpu/LWP/Pages/default.aspx

These patches are against tip/master 531548b311024e1bec01bc412f5258e243a796b7.

This code adds basic support for LWP to the context switch code, which
is the minimum needed to use LWP. Support for other LWP features like
interrupts will be added later.


Hans Rosenfeld (3):
  Cleanup xsave/xrstor support.
  Allow saving of individual states in fpu_xsave().
  Save/restore LWP state in context switches.

 arch/x86/include/asm/i387.h       |   20 +++------
 arch/x86/include/asm/lwp.h        |   68 ++++++++++++++++++++++++++++
 arch/x86/include/asm/msr-index.h  |    1 +
 arch/x86/include/asm/processor.h  |   12 +++++
 arch/x86/include/asm/sigcontext.h |   12 +++++
 arch/x86/include/asm/xsave.h      |   90 +++++++++++++++---------------------
 arch/x86/kernel/process_32.c      |   13 +++++
 arch/x86/kernel/process_64.c      |   13 +++++
 arch/x86/kernel/traps.c           |    2 +-
 arch/x86/kernel/xsave.c           |    6 +-
 10 files changed, 167 insertions(+), 70 deletions(-)
 create mode 100644 arch/x86/include/asm/lwp.h



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC 1/3] Cleanup xsave/xrstor support.
  2010-10-05 18:27       ` Hans Rosenfeld
  2010-10-05 18:30         ` Hans Rosenfeld
@ 2010-10-05 18:30         ` Hans Rosenfeld
  2010-10-05 18:30         ` [RFC 2/3] Allow saving of individual states in fpu_xsave() Hans Rosenfeld
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Hans Rosenfeld @ 2010-10-05 18:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: Hans Rosenfeld

Removed the functions fpu_fxrstor_checking() and restore_fpu_checking()
because they weren't doing anything. Removed redundant xsave/xrstor
implementations. Since xsave/xrstor is not specific to the FPU, and also
for consistency, all xsave/xrstor functions now take a xsave_struct
argument.

Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
---
 arch/x86/include/asm/i387.h  |   20 +++-------
 arch/x86/include/asm/xsave.h |   81 +++++++++++++++---------------------------
 arch/x86/kernel/traps.c      |    2 +-
 arch/x86/kernel/xsave.c      |    4 +-
 4 files changed, 38 insertions(+), 69 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 70626ed..de8593b 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -199,12 +199,14 @@ static inline void fpu_fxsave(struct fpu *fpu)
 static inline void fpu_save_init(struct fpu *fpu)
 {
 	if (use_xsave()) {
-		fpu_xsave(fpu);
+		struct xsave_struct *xstate = &fpu->state->xsave;
+
+		fpu_xsave(xstate);
 
 		/*
 		 * xsave header may indicate the init state of the FP.
 		 */
-		if (!(fpu->state->xsave.xsave_hdr.xstate_bv & XSTATE_FP))
+		if (!(xstate->xsave_hdr.xstate_bv & XSTATE_FP))
 			return;
 	} else if (use_fxsr()) {
 		fpu_fxsave(fpu);
@@ -234,22 +236,12 @@ static inline void __save_init_fpu(struct task_struct *tsk)
 	task_thread_info(tsk)->status &= ~TS_USEDFPU;
 }
 
-static inline int fpu_fxrstor_checking(struct fpu *fpu)
-{
-	return fxrstor_checking(&fpu->state->fxsave);
-}
-
 static inline int fpu_restore_checking(struct fpu *fpu)
 {
 	if (use_xsave())
-		return fpu_xrstor_checking(fpu);
+		return xrstor_checking(&fpu->state->xsave, -1);
 	else
-		return fpu_fxrstor_checking(fpu);
-}
-
-static inline int restore_fpu_checking(struct task_struct *tsk)
-{
-	return fpu_restore_checking(&tsk->thread.fpu);
+		return fxrstor_checking(&fpu->state->fxsave);
 }
 
 /*
diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index c6ce245..8bcbbce 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -42,10 +42,11 @@ extern int check_for_xstate(struct i387_fxsave_struct __user *buf,
 			    void __user *fpstate,
 			    struct _fpx_sw_bytes *sw);
 
-static inline int fpu_xrstor_checking(struct fpu *fpu)
+static inline int xrstor_checking(struct xsave_struct *fx, u64 mask)
 {
-	struct xsave_struct *fx = &fpu->state->xsave;
 	int err;
+	u32 lmask = mask;
+	u32 hmask = mask >> 32;
 
 	asm volatile("1: .byte " REX_PREFIX "0x0f,0xae,0x2f\n\t"
 		     "2:\n"
@@ -55,13 +56,23 @@ static inline int fpu_xrstor_checking(struct fpu *fpu)
 		     ".previous\n"
 		     _ASM_EXTABLE(1b, 3b)
 		     : [err] "=r" (err)
-		     : "D" (fx), "m" (*fx), "a" (-1), "d" (-1), "0" (0)
+		     : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask), "0" (0)
 		     : "memory");
 
 	return err;
 }
 
-static inline int xsave_user(struct xsave_struct __user *buf)
+static inline void xrstor_state(struct xsave_struct *fx, u64 mask)
+{
+	u32 lmask = mask;
+	u32 hmask = mask >> 32;
+
+	asm volatile(".byte " REX_PREFIX "0x0f,0xae,0x2f\n\t"
+		     : : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask)
+		     :   "memory");
+}
+
+static inline int xsave_checking(struct xsave_struct __user *buf)
 {
 	int err;
 
@@ -74,58 +85,24 @@ static inline int xsave_user(struct xsave_struct __user *buf)
 	if (unlikely(err))
 		return -EFAULT;
 
-	__asm__ __volatile__("1: .byte " REX_PREFIX "0x0f,0xae,0x27\n"
-			     "2:\n"
-			     ".section .fixup,\"ax\"\n"
-			     "3:  movl $-1,%[err]\n"
-			     "    jmp  2b\n"
-			     ".previous\n"
-			     ".section __ex_table,\"a\"\n"
-			     _ASM_ALIGN "\n"
-			     _ASM_PTR "1b,3b\n"
-			     ".previous"
-			     : [err] "=r" (err)
-			     : "D" (buf), "a" (-1), "d" (-1), "0" (0)
-			     : "memory");
+	asm volatile("1: .byte " REX_PREFIX "0x0f,0xae,0x27\n"
+		     "2:\n"
+		     ".section .fixup,\"ax\"\n"
+		     "3:  movl $-1,%[err]\n"
+		     "    jmp  2b\n"
+		     ".previous\n"
+		     _ASM_EXTABLE(1b,3b)
+		     : [err] "=r" (err)
+		     : "D" (buf), "a" (-1), "d" (-1), "0" (0)
+		     : "memory");
+
 	if (unlikely(err) && __clear_user(buf, xstate_size))
 		err = -EFAULT;
-	/* No need to clear here because the caller clears USED_MATH */
-	return err;
-}
-
-static inline int xrestore_user(struct xsave_struct __user *buf, u64 mask)
-{
-	int err;
-	struct xsave_struct *xstate = ((__force struct xsave_struct *)buf);
-	u32 lmask = mask;
-	u32 hmask = mask >> 32;
 
-	__asm__ __volatile__("1: .byte " REX_PREFIX "0x0f,0xae,0x2f\n"
-			     "2:\n"
-			     ".section .fixup,\"ax\"\n"
-			     "3:  movl $-1,%[err]\n"
-			     "    jmp  2b\n"
-			     ".previous\n"
-			     ".section __ex_table,\"a\"\n"
-			     _ASM_ALIGN "\n"
-			     _ASM_PTR "1b,3b\n"
-			     ".previous"
-			     : [err] "=r" (err)
-			     : "D" (xstate), "a" (lmask), "d" (hmask), "0" (0)
-			     : "memory");	/* memory required? */
+	/* No need to clear here because the caller clears USED_MATH */
 	return err;
 }
 
-static inline void xrstor_state(struct xsave_struct *fx, u64 mask)
-{
-	u32 lmask = mask;
-	u32 hmask = mask >> 32;
-
-	asm volatile(".byte " REX_PREFIX "0x0f,0xae,0x2f\n\t"
-		     : : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask)
-		     :   "memory");
-}
-
 static inline void xsave_state(struct xsave_struct *fx, u64 mask)
 {
 	u32 lmask = mask;
@@ -136,7 +113,7 @@ static inline void xsave_state(struct xsave_struct *fx, u64 mask)
 		     :   "memory");
 }
 
-static inline void fpu_xsave(struct fpu *fpu)
+static inline void fpu_xsave(struct xsave_struct *fx)
 {
 	/* This, however, we can work around by forcing the compiler to select
 	   an addressing mode that doesn't require extended registers. */
@@ -144,7 +121,7 @@ static inline void fpu_xsave(struct fpu *fpu)
 		".byte " REX_PREFIX "0x0f,0xae,0x27",
 		".byte " REX_PREFIX "0x0f,0xae,0x37",
 		X86_FEATURE_XSAVEOPT,
-		[fx] "D" (&fpu->state->xsave), "a" (-1), "d" (-1) :
+		[fx] "D" (fx), "a" (-1), "d" (-1) :
 		"memory");
 }
 #endif
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cb838ca..966793d 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -731,7 +731,7 @@ void __math_state_restore(void)
 	/*
 	 * Paranoid restore. send a SIGSEGV if we fail to restore the state.
 	 */
-	if (unlikely(restore_fpu_checking(tsk))) {
+	if (unlikely(fpu_restore_checking(&tsk->thread.fpu))) {
 		stts();
 		force_sig(SIGSEGV, tsk);
 		return;
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 9c253bd..5eb15d4 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -170,7 +170,7 @@ int save_i387_xstate(void __user *buf)
 
 	if (task_thread_info(tsk)->status & TS_USEDFPU) {
 		if (use_xsave())
-			err = xsave_user(buf);
+			err = xsave_checking(buf);
 		else
 			err = fxsave_user(buf);
 
@@ -247,7 +247,7 @@ static int restore_user_xstate(void __user *buf)
 	/*
 	 * restore the state passed by the user.
 	 */
-	err = xrestore_user(buf, mask);
+	err = xrstor_checking((__force struct xsave_struct *)buf, mask);
 	if (err)
 		return err;
 
-- 
1.5.6.5



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC 2/3] Allow saving of individual states in fpu_xsave().
  2010-10-05 18:27       ` Hans Rosenfeld
  2010-10-05 18:30         ` Hans Rosenfeld
  2010-10-05 18:30         ` [RFC 1/3] Cleanup xsave/xrstor support Hans Rosenfeld
@ 2010-10-05 18:30         ` Hans Rosenfeld
  2010-10-05 18:30         ` [RFC 3/3] Save/restore LWP state in context switches Hans Rosenfeld
  2010-10-05 19:05         ` [RFC 0/3] Basic support for LWP Ingo Molnar
  4 siblings, 0 replies; 26+ messages in thread
From: Hans Rosenfeld @ 2010-10-05 18:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: Hans Rosenfeld

Add a feature mask argument to fpu_xsave(). Introduce XCNTXT_DEFAULT
to specify which feature states are to be saved/restored by default at
context switches. Rename XCNTXT_MASK to XCNTXT_SUPPORTED to make clear
what it really is.

Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
---
 arch/x86/include/asm/i387.h  |    4 ++--
 arch/x86/include/asm/xsave.h |   14 +++++++++++---
 arch/x86/kernel/xsave.c      |    2 +-
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index de8593b..33600d3 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -201,7 +201,7 @@ static inline void fpu_save_init(struct fpu *fpu)
 	if (use_xsave()) {
 		struct xsave_struct *xstate = &fpu->state->xsave;
 
-		fpu_xsave(xstate);
+		fpu_xsave(xstate, XCNTXT_DEFAULT);
 
 		/*
 		 * xsave header may indicate the init state of the FP.
@@ -239,7 +239,7 @@ static inline void __save_init_fpu(struct task_struct *tsk)
 static inline int fpu_restore_checking(struct fpu *fpu)
 {
 	if (use_xsave())
-		return xrstor_checking(&fpu->state->xsave, -1);
+		return xrstor_checking(&fpu->state->xsave, XCNTXT_DEFAULT);
 	else
 		return fxrstor_checking(&fpu->state->fxsave);
 }
diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index 8bcbbce..8820b57 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -23,7 +23,12 @@
 /*
  * These are the features that the OS can handle currently.
  */
-#define XCNTXT_MASK	(XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
+#define XCNTXT_SUPPORTED    (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
+
+/*
+ * These are the features that the OS saves/restores by default.
+ */
+#define XCNTXT_DEFAULT	    (-1)
 
 #ifdef CONFIG_X86_64
 #define REX_PREFIX	"0x48, "
@@ -113,15 +118,18 @@ static inline void xsave_state(struct xsave_struct *fx, u64 mask)
 		     :   "memory");
 }
 
-static inline void fpu_xsave(struct xsave_struct *fx)
+static inline void fpu_xsave(struct xsave_struct *fx, u64 mask)
 {
+	u32 lmask = mask;
+	u32 hmask = mask >> 32;
+
 	/* This, however, we can work around by forcing the compiler to select
 	   an addressing mode that doesn't require extended registers. */
 	alternative_input(
 		".byte " REX_PREFIX "0x0f,0xae,0x27",
 		".byte " REX_PREFIX "0x0f,0xae,0x37",
 		X86_FEATURE_XSAVEOPT,
-		[fx] "D" (fx), "a" (-1), "d" (-1) :
+		[fx] "D" (fx), "a" (lmask), "d" (hmask) :
 		"memory");
 }
 #endif
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 5eb15d4..11c3445 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -434,7 +434,7 @@ static void __init xstate_enable_boot_cpu(void)
 	/*
 	 * Support only the state known to OS.
 	 */
-	pcntxt_mask = pcntxt_mask & XCNTXT_MASK;
+	pcntxt_mask = pcntxt_mask & XCNTXT_SUPPORTED;
 
 	xstate_enable();
 
-- 
1.5.6.5



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC 3/3] Save/restore LWP state in context switches.
  2010-10-05 18:27       ` Hans Rosenfeld
                           ` (2 preceding siblings ...)
  2010-10-05 18:30         ` [RFC 2/3] Allow saving of individual states in fpu_xsave() Hans Rosenfeld
@ 2010-10-05 18:30         ` Hans Rosenfeld
  2010-10-06 11:12           ` Brian Gerst
  2010-10-05 19:05         ` [RFC 0/3] Basic support for LWP Ingo Molnar
  4 siblings, 1 reply; 26+ messages in thread
From: Hans Rosenfeld @ 2010-10-05 18:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: Hans Rosenfeld

LWP (Light-Weight Profiling) is a new per-thread profiling mechanism
that can be enabled by any thread at any time if the OS claims to
support it (by setting a bit in XCR0). A threads LWP state
(configuration & unsaved collected data) is supposed to be saved and
restored with xsave and xrstor by the OS.

Unfortunately, LWP does not support any kind of lazy switching, nor does
it use the TS bit in CR0. Since any thread can enable LWP at any time
without the kernel knowing, the context switch code is supposed to
save/restore LWP context unconditionally. This would require a valid
xsave state area for all threads, whether or not they use any FPU or LWP
functionality. It would also make the already complex lazy switching
code more complicated.

To avoid this memory overhead, especially for systems not supporting
LWP, and also to avoid more intrusive changes to the code that handles
FPU state, this patch handles LWP separately from the FPU. Only if a
system supports LWP, the context switch code checks whether LWP has been
used by the thread that is being taken off the CPU by reading the
LWP_CBADDR MSR, which is nonzero if LWP has been used by the thread.
Only in that case the LWP state is saved to the common xsave area in the
threads FPU context. This means, of course, that an FPU context has to
be allocated and initialized when a thread first uses LWP before using
the FPU.

Similarly, restoring the LWP state is only done when an FPU context
exists and the LWP bit in the xstate header is set.

To make things a little more complicated, xsave and xrstor _do_ use the
TS bit and trap when it is set. To avoid unwanted traps, the TS bit has
to be cleared before and restored after doing xsave or xrstor for LWP.

Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
---
 arch/x86/include/asm/lwp.h        |   68 +++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/msr-index.h  |    1 +
 arch/x86/include/asm/processor.h  |   12 ++++++
 arch/x86/include/asm/sigcontext.h |   12 ++++++
 arch/x86/include/asm/xsave.h      |    5 ++-
 arch/x86/kernel/process_32.c      |   13 +++++++
 arch/x86/kernel/process_64.c      |   13 +++++++
 7 files changed, 122 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/lwp.h

diff --git a/arch/x86/include/asm/lwp.h b/arch/x86/include/asm/lwp.h
new file mode 100644
index 0000000..6a8e348
--- /dev/null
+++ b/arch/x86/include/asm/lwp.h
@@ -0,0 +1,68 @@
+/*
+ * Copyright (C) 2010 Advanced Micro Devices, Inc.
+ * Contributed by Hans Rosenfeld <hans.rosenfeld@amd.com>
+ */
+
+#ifndef _ASM_X86_LWP_H
+#define _ASM_X86_LWP_H
+
+#ifndef __ASSEMBLY__
+
+#include <asm/i387.h>
+#include <asm/xsave.h>
+#include <asm/processor.h>
+#include <asm/msr-index.h>
+#include <asm/system.h>
+
+/*
+ * Save LWP state if it has been used by the task. Initializes the FPU
+ * state if necessary.
+ */
+static inline void save_lwp_state(struct task_struct *tsk)
+{
+	u64 lwpcb;
+	bool ts;
+
+	/* Has LWP been used? */
+	rdmsrl(MSR_AMD64_LWP_CBADDR, lwpcb);
+	if (!lwpcb)
+		return;
+
+	/* set up FPU state if necessary */
+	if (!fpu_allocated(&tsk->thread.fpu))
+		init_fpu(tsk);
+
+	/* Xsave traps when TS is set. Disable it temporarily. */
+	ts = read_cr0() & X86_CR0_TS;
+	clts();
+	fpu_xsave(&tsk->thread.fpu.state->xsave, XSTATE_LWP);
+	if (ts)
+		stts();
+
+	/* Disable LWP for next thread. */
+	wrmsrl(MSR_AMD64_LWP_CBADDR, 0);
+}
+
+/*
+ * Restore LWP state if present in the tasks xsave area.
+ */
+static inline void restore_lwp_state(struct task_struct *tsk)
+{
+	bool ts;
+
+	/* Don't do anything if there is no LWP state for this thread */
+	if (!fpu_allocated(&tsk->thread.fpu) ||
+	    !(tsk->thread.fpu.state->xsave.xsave_hdr.xstate_bv & XSTATE_LWP))
+		return;
+
+	/* Xrstor traps when TS is set. Disable it temporarily. */
+	ts = read_cr0() & X86_CR0_TS;
+	clts();
+	xrstor_state(&tsk->thread.fpu.state->xsave, XSTATE_LWP);
+	if (ts)
+		stts();
+}
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_LWP_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 986f779..9183ecc 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -121,6 +121,7 @@
 #define MSR_AMD64_IBSDCLINAD		0xc0011038
 #define MSR_AMD64_IBSDCPHYSAD		0xc0011039
 #define MSR_AMD64_IBSCTL		0xc001103a
+#define MSR_AMD64_LWP_CBADDR		0xc0000106
 
 /* Fam 10h MSRs */
 #define MSR_FAM10H_MMIO_CONF_BASE	0xc0010058
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index cae9c3c..cbb5da9 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -358,6 +358,17 @@ struct ymmh_struct {
 	u32 ymmh_space[64];
 };
 
+struct lwp_struct {
+	u64 lwpcb_addr;
+	u32 flags;
+	u32 buf_head_offset;
+	u64 buf_base;
+	u32 buf_size;
+	u32 filters;
+	u64 saved_event_record[4];
+	u32 event_counter[16];
+};
+
 struct xsave_hdr_struct {
 	u64 xstate_bv;
 	u64 reserved1[2];
@@ -368,6 +379,7 @@ struct xsave_struct {
 	struct i387_fxsave_struct i387;
 	struct xsave_hdr_struct xsave_hdr;
 	struct ymmh_struct ymmh;
+	struct lwp_struct lwp;
 	/* new processor state extensions will go here */
 } __attribute__ ((packed, aligned (64)));
 
diff --git a/arch/x86/include/asm/sigcontext.h b/arch/x86/include/asm/sigcontext.h
index 04459d2..0a58b82 100644
--- a/arch/x86/include/asm/sigcontext.h
+++ b/arch/x86/include/asm/sigcontext.h
@@ -274,6 +274,17 @@ struct _ymmh_state {
 	__u32 ymmh_space[64];
 };
 
+struct _lwp_state {
+	__u64 lwpcb_addr;
+	__u32 flags;
+	__u32 buf_head_offset;
+	__u64 buf_base;
+	__u32 buf_size;
+	__u32 filters;
+	__u64 saved_event_record[4];
+	__u32 event_counter[16];
+};
+
 /*
  * Extended state pointed by the fpstate pointer in the sigcontext.
  * In addition to the fpstate, information encoded in the xstate_hdr
@@ -284,6 +295,7 @@ struct _xstate {
 	struct _fpstate fpstate;
 	struct _xsave_hdr xstate_hdr;
 	struct _ymmh_state ymmh;
+	struct _lwp_state lwp;
 	/* new processor state extensions go here */
 };
 
diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index 8820b57..ceb0b85 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -9,6 +9,7 @@
 #define XSTATE_FP	0x1
 #define XSTATE_SSE	0x2
 #define XSTATE_YMM	0x4
+#define XSTATE_LWP	(1ULL << 62)
 
 #define XSTATE_FPSSE	(XSTATE_FP | XSTATE_SSE)
 
@@ -23,12 +24,12 @@
 /*
  * These are the features that the OS can handle currently.
  */
-#define XCNTXT_SUPPORTED    (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
+#define XCNTXT_SUPPORTED    (XSTATE_FP | XSTATE_SSE | XSTATE_YMM | XSTATE_LWP)
 
 /*
  * These are the features that the OS saves/restores by default.
  */
-#define XCNTXT_DEFAULT	    (-1)
+#define XCNTXT_DEFAULT	    (~XSTATE_LWP)
 
 #ifdef CONFIG_X86_64
 #define REX_PREFIX	"0x48, "
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..69ced7d 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -44,6 +44,7 @@
 #include <asm/ldt.h>
 #include <asm/processor.h>
 #include <asm/i387.h>
+#include <asm/lwp.h>
 #include <asm/desc.h>
 #ifdef CONFIG_MATH_EMULATION
 #include <asm/math_emu.h>
@@ -311,6 +312,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	__unlazy_fpu(prev_p);
 
+	/*
+	 * Save LWP state if LWP is available.
+	 */
+	if (static_cpu_has(X86_FEATURE_LWP))
+		save_lwp_state(prev_p);
+
 	/* we're going to use this soon, after a few expensive things */
 	if (preload_fpu)
 		prefetch(next->fpu.state);
@@ -353,6 +360,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 		     task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT))
 		__switch_to_xtra(prev_p, next_p, tss);
 
+	/*
+	 * Restore LWP state if available.
+	 */
+	if (static_cpu_has(X86_FEATURE_LWP))
+		restore_lwp_state(next_p);
+
 	/* If we're going to preload the fpu context, make sure clts
 	   is run while we're batching the cpu state updates. */
 	if (preload_fpu)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b3d7a3a..a70c6cc 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -42,6 +42,7 @@
 #include <asm/system.h>
 #include <asm/processor.h>
 #include <asm/i387.h>
+#include <asm/lwp.h>
 #include <asm/mmu_context.h>
 #include <asm/prctl.h>
 #include <asm/desc.h>
@@ -423,6 +424,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	load_TLS(next, cpu);
 
+	/*
+	 * Save LWP state if LWP is available.
+	 */
+	if (static_cpu_has(X86_FEATURE_LWP))
+		save_lwp_state(prev_p);
+
 	/* Must be after DS reload */
 	__unlazy_fpu(prev_p);
 
@@ -489,6 +496,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 		__switch_to_xtra(prev_p, next_p, tss);
 
 	/*
+	 * Restore LWP state if available.
+	 */
+	if (static_cpu_has(X86_FEATURE_LWP))
+		restore_lwp_state(next_p);
+
+	/*
 	 * Preload the FPU context, now that we've determined that the
 	 * task is likely to be using it. 
 	 */
-- 
1.5.6.5



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC 3/3] Save/restore LWP state in context switches.
  2010-10-05 18:30         ` [RFC 3/3] Save/restore LWP state in context switches Hans Rosenfeld
@ 2010-10-06 11:12           ` Brian Gerst
  2010-10-07 14:58             ` Hans Rosenfeld
                               ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Brian Gerst @ 2010-10-06 11:12 UTC (permalink / raw)
  To: Hans Rosenfeld; +Cc: linux-kernel

On Tue, Oct 5, 2010 at 2:30 PM, Hans Rosenfeld <hans.rosenfeld@amd.com> wrote:
> LWP (Light-Weight Profiling) is a new per-thread profiling mechanism
> that can be enabled by any thread at any time if the OS claims to
> support it (by setting a bit in XCR0). A threads LWP state
> (configuration & unsaved collected data) is supposed to be saved and
> restored with xsave and xrstor by the OS.
>
> Unfortunately, LWP does not support any kind of lazy switching, nor does
> it use the TS bit in CR0. Since any thread can enable LWP at any time
> without the kernel knowing, the context switch code is supposed to
> save/restore LWP context unconditionally. This would require a valid
> xsave state area for all threads, whether or not they use any FPU or LWP
> functionality. It would also make the already complex lazy switching
> code more complicated.
>
> To avoid this memory overhead, especially for systems not supporting
> LWP, and also to avoid more intrusive changes to the code that handles
> FPU state, this patch handles LWP separately from the FPU. Only if a
> system supports LWP, the context switch code checks whether LWP has been
> used by the thread that is being taken off the CPU by reading the
> LWP_CBADDR MSR, which is nonzero if LWP has been used by the thread.
> Only in that case the LWP state is saved to the common xsave area in the
> threads FPU context. This means, of course, that an FPU context has to
> be allocated and initialized when a thread first uses LWP before using
> the FPU.
>
> Similarly, restoring the LWP state is only done when an FPU context
> exists and the LWP bit in the xstate header is set.
>
> To make things a little more complicated, xsave and xrstor _do_ use the
> TS bit and trap when it is set. To avoid unwanted traps, the TS bit has
> to be cleared before and restored after doing xsave or xrstor for LWP.
>
> Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>

I would prefer to see the xsave code refactored so that you would only
need one xsave/xrstor call per context switch.  We are currently
treating xsave as an extension to the FPU state.  But it would be
better to treat FPU as a subset of the extended state.  That way more
state can be added without touching the FPU code..

--
Brian Gerst

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 3/3] Save/restore LWP state in context switches.
  2010-10-06 11:12           ` Brian Gerst
@ 2010-10-07 14:58             ` Hans Rosenfeld
  2010-11-23 20:41             ` [RFC 0/2] FPU/xsave rework in preparation for LWP Hans Rosenfeld
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Hans Rosenfeld @ 2010-10-07 14:58 UTC (permalink / raw)
  To: Brian Gerst; +Cc: linux-kernel@vger.kernel.org

On Wed, Oct 06, 2010 at 07:12:11AM -0400, Brian Gerst wrote:
> On Tue, Oct 5, 2010 at 2:30 PM, Hans Rosenfeld <hans.rosenfeld@amd.com> wrote:
> > LWP (Light-Weight Profiling) is a new per-thread profiling mechanism
> > that can be enabled by any thread at any time if the OS claims to
> > support it (by setting a bit in XCR0). A threads LWP state
> > (configuration & unsaved collected data) is supposed to be saved and
> > restored with xsave and xrstor by the OS.
> >
> > Unfortunately, LWP does not support any kind of lazy switching, nor does
> > it use the TS bit in CR0. Since any thread can enable LWP at any time
> > without the kernel knowing, the context switch code is supposed to
> > save/restore LWP context unconditionally. This would require a valid
> > xsave state area for all threads, whether or not they use any FPU or LWP
> > functionality. It would also make the already complex lazy switching
> > code more complicated.
> >
> > To avoid this memory overhead, especially for systems not supporting
> > LWP, and also to avoid more intrusive changes to the code that handles
> > FPU state, this patch handles LWP separately from the FPU. Only if a
> > system supports LWP, the context switch code checks whether LWP has been
> > used by the thread that is being taken off the CPU by reading the
> > LWP_CBADDR MSR, which is nonzero if LWP has been used by the thread.
> > Only in that case the LWP state is saved to the common xsave area in the
> > threads FPU context. This means, of course, that an FPU context has to
> > be allocated and initialized when a thread first uses LWP before using
> > the FPU.
> >
> > Similarly, restoring the LWP state is only done when an FPU context
> > exists and the LWP bit in the xstate header is set.
> >
> > To make things a little more complicated, xsave and xrstor _do_ use the
> > TS bit and trap when it is set. To avoid unwanted traps, the TS bit has
> > to be cleared before and restored after doing xsave or xrstor for LWP.
> >
> > Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
> 
> I would prefer to see the xsave code refactored so that you would only
> need one xsave/xrstor call per context switch.  We are currently
> treating xsave as an extension to the FPU state.  But it would be
> better to treat FPU as a subset of the extended state.  That way more
> state can be added without touching the FPU code..

IIRC Robert did some changes to the xsave stuff to make it more
independent from the FPU stuff. The first of my patches also goes into
that direction, but thats not the main point of it. The main point is
to remove useless or duplicated code and make it easier to understand :)

I have spent some more time meditating about that code and thinking
about using only one xsave/xrstor call per context switch. It is
possible, but it would at least require pre-allocating the xsave state
area for each thread.

The xsave part would be fairly simple, the FPU state is saved non-lazy
anyway if it has been used. Xsave would probably save the state more
often than necessary as it can't know whether the state really changed
since the last time it was saved. This could be avoided by checking
this before doing xsave, and then do xsave only for the parts that
really need saving (like LWP).

Xrstor would need more work because of lazy FPU switching. You would
always have to check whether you want to preload the FPU state or not,
and do xrstor for all or only the necessary states.

In any case one would have to care about CR0.TS when doing all that.

So, as far as I understand it, to get this simpler and cleaner, at
least the preallocation of xsave or fpu state buffers is necessary.
To really clean it up, dropping lazy switching would probably help.
But I'm certainly not proposing to do that :)

Hans

-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC 0/2] FPU/xsave rework in preparation for LWP
  2010-10-06 11:12           ` Brian Gerst
  2010-10-07 14:58             ` Hans Rosenfeld
@ 2010-11-23 20:41             ` Hans Rosenfeld
  2010-11-23 20:41             ` [RFC 1/2] x86, xsave: cleanup xsave/xrstor support Hans Rosenfeld
  2010-11-23 20:41             ` [RFC 2/2] x86, xsave: rework xsave support Hans Rosenfeld
  3 siblings, 0 replies; 26+ messages in thread
From: Hans Rosenfeld @ 2010-11-23 20:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, brgerst, robert.richter, andreas.herrmann3, Hans Rosenfeld

Hi,

please review and comment on these patches. They are based on tip/master
a17d38493c3dd2253828b107b5d405dd7cdcbf43. The first one is just a generic
cleanup of the xsave code that I already sent couple of weeks ago.

When my previous LWP patchset was reviewed I was told that it would be
much better to use only a single call to xsave and xrstor in the context
switch. This is basically what I try to achieve with the second patch:
base all "extended state" management on xsave/xrstor (with fallback to
fxsave/fxrstor or even fsave/frstor) and keep track of which extended
states a thread actually used. The actual LWP support is not included
here, I will work on this next. 

As a side effect of this rework, there will be a single simple interface
that can be used by other parts of the kernel that need to save and
restore extended states (thinking of KVM here). This should also improve
the general readability and maintainability of this code, although I
would agree that this isn't visible yet :)

There is a lot of room for cleaning things up here, which I will do
along with the LWP support. The LWP changes will require that each
thread will always have a xstate area, but I don't expect this to be
much of a problem. While this would add an overhead of around 1kb for
each thread that wouldn't need a xstate area otherwise, I expect that
there are only very few such threads on any reasonably modern system. I
will have to check that when the LWP support is done. This would allow
for another big cleanup by removing all the complexity of supporting
threads that don't have an xstate area.

Hans Rosenfeld (2):
  x86, xsave: cleanup xsave/xrstor support
  x86, xsave: rework xsave support

 arch/x86/include/asm/i387.h        |  111 +++++++----------------
 arch/x86/include/asm/thread_info.h |    2 +
 arch/x86/include/asm/xsave.h       |   88 ++++++++-----------
 arch/x86/kernel/i387.c             |   11 ++-
 arch/x86/kernel/process_64.c       |   24 +++---
 arch/x86/kernel/traps.c            |   31 +------
 arch/x86/kernel/xsave.c            |  174 +++++++++++++++++++-----------------
 arch/x86/kvm/x86.c                 |    7 +-
 8 files changed, 188 insertions(+), 260 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC 1/2] x86, xsave: cleanup xsave/xrstor support
  2010-10-06 11:12           ` Brian Gerst
  2010-10-07 14:58             ` Hans Rosenfeld
  2010-11-23 20:41             ` [RFC 0/2] FPU/xsave rework in preparation for LWP Hans Rosenfeld
@ 2010-11-23 20:41             ` Hans Rosenfeld
  2010-11-23 20:41             ` [RFC 2/2] x86, xsave: rework xsave support Hans Rosenfeld
  3 siblings, 0 replies; 26+ messages in thread
From: Hans Rosenfeld @ 2010-11-23 20:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, brgerst, robert.richter, andreas.herrmann3, Hans Rosenfeld

Removed the functions fpu_fxrstor_checking() and restore_fpu_checking()
because they weren't doing anything. Removed redundant xsave/xrstor
implementations. Since xsave/xrstor is not specific to the FPU, and also
for consistency, all xsave/xrstor functions now take a xsave_struct
argument.

Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
---
 arch/x86/include/asm/i387.h  |   20 +++-------
 arch/x86/include/asm/xsave.h |   81 +++++++++++++++---------------------------
 arch/x86/kernel/traps.c      |    2 +-
 arch/x86/kernel/xsave.c      |    4 +-
 4 files changed, 38 insertions(+), 69 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index ef32890..d908383 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -227,12 +227,14 @@ static inline void fpu_fxsave(struct fpu *fpu)
 static inline void fpu_save_init(struct fpu *fpu)
 {
 	if (use_xsave()) {
-		fpu_xsave(fpu);
+		struct xsave_struct *xstate = &fpu->state->xsave;
+
+		fpu_xsave(xstate);
 
 		/*
 		 * xsave header may indicate the init state of the FP.
 		 */
-		if (!(fpu->state->xsave.xsave_hdr.xstate_bv & XSTATE_FP))
+		if (!(xstate->xsave_hdr.xstate_bv & XSTATE_FP))
 			return;
 	} else if (use_fxsr()) {
 		fpu_fxsave(fpu);
@@ -262,22 +264,12 @@ static inline void __save_init_fpu(struct task_struct *tsk)
 	task_thread_info(tsk)->status &= ~TS_USEDFPU;
 }
 
-static inline int fpu_fxrstor_checking(struct fpu *fpu)
-{
-	return fxrstor_checking(&fpu->state->fxsave);
-}
-
 static inline int fpu_restore_checking(struct fpu *fpu)
 {
 	if (use_xsave())
-		return fpu_xrstor_checking(fpu);
+		return xrstor_checking(&fpu->state->xsave, -1);
 	else
-		return fpu_fxrstor_checking(fpu);
-}
-
-static inline int restore_fpu_checking(struct task_struct *tsk)
-{
-	return fpu_restore_checking(&tsk->thread.fpu);
+		return fxrstor_checking(&fpu->state->fxsave);
 }
 
 /*
diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index c6ce245..8bcbbce 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -42,10 +42,11 @@ extern int check_for_xstate(struct i387_fxsave_struct __user *buf,
 			    void __user *fpstate,
 			    struct _fpx_sw_bytes *sw);
 
-static inline int fpu_xrstor_checking(struct fpu *fpu)
+static inline int xrstor_checking(struct xsave_struct *fx, u64 mask)
 {
-	struct xsave_struct *fx = &fpu->state->xsave;
 	int err;
+	u32 lmask = mask;
+	u32 hmask = mask >> 32;
 
 	asm volatile("1: .byte " REX_PREFIX "0x0f,0xae,0x2f\n\t"
 		     "2:\n"
@@ -55,13 +56,23 @@ static inline int fpu_xrstor_checking(struct fpu *fpu)
 		     ".previous\n"
 		     _ASM_EXTABLE(1b, 3b)
 		     : [err] "=r" (err)
-		     : "D" (fx), "m" (*fx), "a" (-1), "d" (-1), "0" (0)
+		     : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask), "0" (0)
 		     : "memory");
 
 	return err;
 }
 
-static inline int xsave_user(struct xsave_struct __user *buf)
+static inline void xrstor_state(struct xsave_struct *fx, u64 mask)
+{
+	u32 lmask = mask;
+	u32 hmask = mask >> 32;
+
+	asm volatile(".byte " REX_PREFIX "0x0f,0xae,0x2f\n\t"
+		     : : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask)
+		     :   "memory");
+}
+
+static inline int xsave_checking(struct xsave_struct __user *buf)
 {
 	int err;
 
@@ -74,58 +85,24 @@ static inline int xsave_user(struct xsave_struct __user *buf)
 	if (unlikely(err))
 		return -EFAULT;
 
-	__asm__ __volatile__("1: .byte " REX_PREFIX "0x0f,0xae,0x27\n"
-			     "2:\n"
-			     ".section .fixup,\"ax\"\n"
-			     "3:  movl $-1,%[err]\n"
-			     "    jmp  2b\n"
-			     ".previous\n"
-			     ".section __ex_table,\"a\"\n"
-			     _ASM_ALIGN "\n"
-			     _ASM_PTR "1b,3b\n"
-			     ".previous"
-			     : [err] "=r" (err)
-			     : "D" (buf), "a" (-1), "d" (-1), "0" (0)
-			     : "memory");
+	asm volatile("1: .byte " REX_PREFIX "0x0f,0xae,0x27\n"
+		     "2:\n"
+		     ".section .fixup,\"ax\"\n"
+		     "3:  movl $-1,%[err]\n"
+		     "    jmp  2b\n"
+		     ".previous\n"
+		     _ASM_EXTABLE(1b,3b)
+		     : [err] "=r" (err)
+		     : "D" (buf), "a" (-1), "d" (-1), "0" (0)
+		     : "memory");
+
 	if (unlikely(err) && __clear_user(buf, xstate_size))
 		err = -EFAULT;
-	/* No need to clear here because the caller clears USED_MATH */
-	return err;
-}
-
-static inline int xrestore_user(struct xsave_struct __user *buf, u64 mask)
-{
-	int err;
-	struct xsave_struct *xstate = ((__force struct xsave_struct *)buf);
-	u32 lmask = mask;
-	u32 hmask = mask >> 32;
 
-	__asm__ __volatile__("1: .byte " REX_PREFIX "0x0f,0xae,0x2f\n"
-			     "2:\n"
-			     ".section .fixup,\"ax\"\n"
-			     "3:  movl $-1,%[err]\n"
-			     "    jmp  2b\n"
-			     ".previous\n"
-			     ".section __ex_table,\"a\"\n"
-			     _ASM_ALIGN "\n"
-			     _ASM_PTR "1b,3b\n"
-			     ".previous"
-			     : [err] "=r" (err)
-			     : "D" (xstate), "a" (lmask), "d" (hmask), "0" (0)
-			     : "memory");	/* memory required? */
+	/* No need to clear here because the caller clears USED_MATH */
 	return err;
 }
 
-static inline void xrstor_state(struct xsave_struct *fx, u64 mask)
-{
-	u32 lmask = mask;
-	u32 hmask = mask >> 32;
-
-	asm volatile(".byte " REX_PREFIX "0x0f,0xae,0x2f\n\t"
-		     : : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask)
-		     :   "memory");
-}
-
 static inline void xsave_state(struct xsave_struct *fx, u64 mask)
 {
 	u32 lmask = mask;
@@ -136,7 +113,7 @@ static inline void xsave_state(struct xsave_struct *fx, u64 mask)
 		     :   "memory");
 }
 
-static inline void fpu_xsave(struct fpu *fpu)
+static inline void fpu_xsave(struct xsave_struct *fx)
 {
 	/* This, however, we can work around by forcing the compiler to select
 	   an addressing mode that doesn't require extended registers. */
@@ -144,7 +121,7 @@ static inline void fpu_xsave(struct fpu *fpu)
 		".byte " REX_PREFIX "0x0f,0xae,0x27",
 		".byte " REX_PREFIX "0x0f,0xae,0x37",
 		X86_FEATURE_XSAVEOPT,
-		[fx] "D" (&fpu->state->xsave), "a" (-1), "d" (-1) :
+		[fx] "D" (fx), "a" (-1), "d" (-1) :
 		"memory");
 }
 #endif
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index f02c179..1b0d148 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -720,7 +720,7 @@ void __math_state_restore(void)
 	/*
 	 * Paranoid restore. send a SIGSEGV if we fail to restore the state.
 	 */
-	if (unlikely(restore_fpu_checking(tsk))) {
+	if (unlikely(fpu_restore_checking(&tsk->thread.fpu))) {
 		stts();
 		force_sig(SIGSEGV, tsk);
 		return;
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 9c253bd..5eb15d4 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -170,7 +170,7 @@ int save_i387_xstate(void __user *buf)
 
 	if (task_thread_info(tsk)->status & TS_USEDFPU) {
 		if (use_xsave())
-			err = xsave_user(buf);
+			err = xsave_checking(buf);
 		else
 			err = fxsave_user(buf);
 
@@ -247,7 +247,7 @@ static int restore_user_xstate(void __user *buf)
 	/*
 	 * restore the state passed by the user.
 	 */
-	err = xrestore_user(buf, mask);
+	err = xrstor_checking((__force struct xsave_struct *)buf, mask);
 	if (err)
 		return err;
 
-- 
1.5.6.5



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC 2/2] x86, xsave: rework xsave support
  2010-10-06 11:12           ` Brian Gerst
                               ` (2 preceding siblings ...)
  2010-11-23 20:41             ` [RFC 1/2] x86, xsave: cleanup xsave/xrstor support Hans Rosenfeld
@ 2010-11-23 20:41             ` Hans Rosenfeld
  2010-11-25  0:36               ` Brian Gerst
  3 siblings, 1 reply; 26+ messages in thread
From: Hans Rosenfeld @ 2010-11-23 20:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, brgerst, robert.richter, andreas.herrmann3, Hans Rosenfeld

This is an attempt to rework the code that handles FPU and related
extended states. Since FPU, XMM and YMM states are just variants of what
xsave handles, all of the old FPU-specific state handling code will be
hidden behind a set of functions that resemble xsave and xrstor. For
hardware that does not support xsave, the code falls back to
fxsave/fxrstor or even fsave/frstor.

A xstate_mask member will be added to the thread_info structure that
will control which states are to be saved by xsave. It is set to include
all "lazy" states (that is, all states currently supported) by the #NM
handler when a lazy restore is triggered or by switch_to() when the
tasks FPU context is preloaded. Xstate_mask is intended to completely
replace TS_USEDFPU in a later cleanup patch.

When "non-lazy" states such as for LWP will be added later, the
corresponding bits in xstate_mask are supposed to be set for all threads
all the time. There will be no performance penalty for threads not using
these states, as xsave and xrstor will ignore unused states.

This patch is not complete and not final at all. Support for 32bit is
lacking, and the context handling for signals will probably change
again. I haven't benchmarked it yet, but I have tested it for both the
fxsave and xsave cases.

Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
---
 arch/x86/include/asm/i387.h        |  103 +++++++---------------
 arch/x86/include/asm/thread_info.h |    2 +
 arch/x86/include/asm/xsave.h       |    7 ++
 arch/x86/kernel/i387.c             |   11 ++-
 arch/x86/kernel/process_64.c       |   24 +++---
 arch/x86/kernel/traps.c            |   31 +------
 arch/x86/kernel/xsave.c            |  174 +++++++++++++++++++-----------------
 arch/x86/kvm/x86.c                 |    7 +-
 8 files changed, 159 insertions(+), 200 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index d908383..53b62c5 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -159,7 +159,7 @@ static inline int fxsave_user(struct i387_fxsave_struct __user *fx)
 	return err;
 }
 
-static inline void fpu_fxsave(struct fpu *fpu)
+static inline void fpu_fxsave(struct i387_fxsave_struct *fx)
 {
 	/* Using "rex64; fxsave %0" is broken because, if the memory operand
 	   uses any extended registers for addressing, a second REX prefix
@@ -170,7 +170,7 @@ static inline void fpu_fxsave(struct fpu *fpu)
 	/* Using "fxsaveq %0" would be the ideal choice, but is only supported
 	   starting with gas 2.16. */
 	__asm__ __volatile__("fxsaveq %0"
-			     : "=m" (fpu->state->fxsave));
+			     : "=m" (*fx));
 #else
 	/* Using, as a workaround, the properly prefixed form below isn't
 	   accepted by any binutils version so far released, complaining that
@@ -181,8 +181,8 @@ static inline void fpu_fxsave(struct fpu *fpu)
 	   This, however, we can work around by forcing the compiler to select
 	   an addressing mode that doesn't require extended registers. */
 	asm volatile("rex64/fxsave (%[fx])"
-		     : "=m" (fpu->state->fxsave)
-		     : [fx] "R" (&fpu->state->fxsave));
+		     : "=m" (*fx)
+		     : [fx] "R" (fx));
 #endif
 }
 
@@ -204,14 +204,34 @@ static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
 	return 0;
 }
 
-static inline void fpu_fxsave(struct fpu *fpu)
+static inline void fpu_fxsave(struct i387_fxsave_struct *fx)
 {
 	asm volatile("fxsave %[fx]"
-		     : [fx] "=m" (fpu->state->fxsave));
+		     : [fx] "=m" (fx));
 }
 
 #endif	/* CONFIG_X86_64 */
 
+/*
+ * These must be called with preempt disabled
+ */
+
+static inline int fpu_restore(struct i387_fxsave_struct *fx)
+{
+	return fxrstor_checking(fx);
+}
+
+static inline void fpu_save(struct i387_fxsave_struct *fx)
+{
+	if (use_fxsr()) {
+		fpu_fxsave(fx);
+	} else {
+		asm volatile("fsave %[fx]; fwait"
+			     : [fx] "=m" (fx));
+		return;
+	}
+}
+
 /* We need a safe address that is cheap to find and that is already
    in L1 during context switch. The best choices are unfortunately
    different for UP and SMP */
@@ -221,30 +241,9 @@ static inline void fpu_fxsave(struct fpu *fpu)
 #define safe_address (kstat_cpu(0).cpustat.user)
 #endif
 
-/*
- * These must be called with preempt disabled
- */
-static inline void fpu_save_init(struct fpu *fpu)
+static inline void fpu_clean(struct i387_fxsave_struct *fx)
 {
-	if (use_xsave()) {
-		struct xsave_struct *xstate = &fpu->state->xsave;
-
-		fpu_xsave(xstate);
-
-		/*
-		 * xsave header may indicate the init state of the FP.
-		 */
-		if (!(xstate->xsave_hdr.xstate_bv & XSTATE_FP))
-			return;
-	} else if (use_fxsr()) {
-		fpu_fxsave(fpu);
-	} else {
-		asm volatile("fsave %[fx]; fwait"
-			     : [fx] "=m" (fpu->state->fsave));
-		return;
-	}
-
-	if (unlikely(fpu->state->fxsave.swd & X87_FSW_ES))
+	if (unlikely(fx->swd & X87_FSW_ES))
 		asm volatile("fnclex");
 
 	/* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
@@ -258,35 +257,12 @@ static inline void fpu_save_init(struct fpu *fpu)
 		[addr] "m" (safe_address));
 }
 
-static inline void __save_init_fpu(struct task_struct *tsk)
-{
-	fpu_save_init(&tsk->thread.fpu);
-	task_thread_info(tsk)->status &= ~TS_USEDFPU;
-}
-
-static inline int fpu_restore_checking(struct fpu *fpu)
-{
-	if (use_xsave())
-		return xrstor_checking(&fpu->state->xsave, -1);
-	else
-		return fxrstor_checking(&fpu->state->fxsave);
-}
-
 /*
  * Signal frame handlers...
  */
 extern int save_i387_xstate(void __user *buf);
 extern int restore_i387_xstate(void __user *buf);
 
-static inline void __unlazy_fpu(struct task_struct *tsk)
-{
-	if (task_thread_info(tsk)->status & TS_USEDFPU) {
-		__save_init_fpu(tsk);
-		stts();
-	} else
-		tsk->fpu_counter = 0;
-}
-
 static inline void __clear_fpu(struct task_struct *tsk)
 {
 	if (task_thread_info(tsk)->status & TS_USEDFPU) {
@@ -295,6 +271,7 @@ static inline void __clear_fpu(struct task_struct *tsk)
 			     "2:\n"
 			     _ASM_EXTABLE(1b, 2b));
 		task_thread_info(tsk)->status &= ~TS_USEDFPU;
+		task_thread_info(tsk)->status &= ~XCNTXT_LAZY;
 		stts();
 	}
 }
@@ -303,10 +280,9 @@ static inline void kernel_fpu_begin(void)
 {
 	struct thread_info *me = current_thread_info();
 	preempt_disable();
-	if (me->status & TS_USEDFPU)
-		__save_init_fpu(me->task);
-	else
-		clts();
+	/* XXX saves nonlazy unnecessarily? */
+	save_xstates(me->task);
+	clts();
 }
 
 static inline void kernel_fpu_end(void)
@@ -357,21 +333,6 @@ static inline void irq_ts_restore(int TS_state)
 /*
  * These disable preemption on their own and are safe
  */
-static inline void save_init_fpu(struct task_struct *tsk)
-{
-	preempt_disable();
-	__save_init_fpu(tsk);
-	stts();
-	preempt_enable();
-}
-
-static inline void unlazy_fpu(struct task_struct *tsk)
-{
-	preempt_disable();
-	__unlazy_fpu(tsk);
-	preempt_enable();
-}
-
 static inline void clear_fpu(struct task_struct *tsk)
 {
 	preempt_disable();
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index f0b6e5d..5c92d21 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -26,6 +26,7 @@ struct exec_domain;
 struct thread_info {
 	struct task_struct	*task;		/* main task structure */
 	struct exec_domain	*exec_domain;	/* execution domain */
+	__u64			xstate_mask;	/* xstates in use */
 	__u32			flags;		/* low level flags */
 	__u32			status;		/* thread synchronous flags */
 	__u32			cpu;		/* current CPU */
@@ -47,6 +48,7 @@ struct thread_info {
 {						\
 	.task		= &tsk,			\
 	.exec_domain	= &default_exec_domain,	\
+	.xstate_mask	= 0,			\
 	.flags		= 0,			\
 	.cpu		= 0,			\
 	.preempt_count	= INIT_PREEMPT_COUNT,	\
diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index 8bcbbce..2eb019e 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -25,6 +25,8 @@
  */
 #define XCNTXT_MASK	(XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
 
+#define XCNTXT_LAZY	XCNTXT_MASK
+
 #ifdef CONFIG_X86_64
 #define REX_PREFIX	"0x48, "
 #else
@@ -35,6 +37,11 @@ extern unsigned int xstate_size;
 extern u64 pcntxt_mask;
 extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
 
+extern void xsave(struct xsave_struct *, u64);
+extern int  xrstor(struct xsave_struct *, u64);
+extern void save_xstates(struct task_struct *);
+extern void restore_xstates(struct task_struct *, u64);
+
 extern void xsave_init(void);
 extern void update_regset_xstate_info(unsigned int size, u64 xstate_mask);
 extern int init_fpu(struct task_struct *child);
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 58bb239..72bc6f0 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -152,8 +152,11 @@ int init_fpu(struct task_struct *tsk)
 	int ret;
 
 	if (tsk_used_math(tsk)) {
-		if (HAVE_HWFP && tsk == current)
-			unlazy_fpu(tsk);
+		if (HAVE_HWFP && tsk == current) {
+			preempt_disable();
+			save_xstates(tsk);
+			preempt_enable();
+		}
 		return 0;
 	}
 
@@ -599,7 +602,9 @@ int save_i387_xstate_ia32(void __user *buf)
 				       NULL, fp) ? -1 : 1;
 	}
 
-	unlazy_fpu(tsk);
+	preempt_disable();
+	save_xstates(tsk);
+	preempt_enable();
 
 	if (cpu_has_xsave)
 		return save_i387_xsave(fp);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b3d7a3a..c95a9e5 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -253,7 +253,9 @@ static inline u32 read_32bit_tls(struct task_struct *t, int tls)
  */
 void prepare_to_copy(struct task_struct *tsk)
 {
-	unlazy_fpu(tsk);
+	preempt_disable();
+	save_xstates(tsk);
+	preempt_enable();
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long sp,
@@ -382,18 +384,18 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	int cpu = smp_processor_id();
 	struct tss_struct *tss = &per_cpu(init_tss, cpu);
 	unsigned fsindex, gsindex;
-	bool preload_fpu;
+	u64 preload_lazy = 0;
 
 	/*
 	 * If the task has used fpu the last 5 timeslices, just do a full
 	 * restore of the math state immediately to avoid the trap; the
 	 * chances of needing FPU soon are obviously high now
 	 */
-	preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;
-
-	/* we're going to use this soon, after a few expensive things */
-	if (preload_fpu)
+	if (tsk_used_math(next_p) && next_p->fpu_counter > 5) {
+		preload_lazy = XCNTXT_LAZY;
+		/* we're going to use this soon, after a few expensive things */
 		prefetch(next->fpu.state);
+	}
 
 	/*
 	 * Reload esp0, LDT and the page table pointer:
@@ -424,11 +426,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	load_TLS(next, cpu);
 
 	/* Must be after DS reload */
-	__unlazy_fpu(prev_p);
-
-	/* Make sure cpu is ready for new context */
-	if (preload_fpu)
-		clts();
+	save_xstates(prev_p);
 
 	/*
 	 * Leave lazy mode, flushing any hypercalls made here.
@@ -492,8 +490,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 * Preload the FPU context, now that we've determined that the
 	 * task is likely to be using it. 
 	 */
-	if (preload_fpu)
-		__math_state_restore();
+	if (preload_lazy)
+		restore_xstates(next_p, preload_lazy);
 
 	return prev_p;
 }
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 1b0d148..b40691b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -617,7 +617,10 @@ void math_error(struct pt_regs *regs, int error_code, int trapnr)
 	/*
 	 * Save the info for the exception handler and clear the error.
 	 */
-	save_init_fpu(task);
+	preempt_disable();
+	save_xstates(task);
+	preempt_enable();
+
 	task->thread.trap_no = trapnr;
 	task->thread.error_code = error_code;
 	info.si_signo = SIGFPE;
@@ -709,28 +712,6 @@ asmlinkage void __attribute__((weak)) smp_threshold_interrupt(void)
 }
 
 /*
- * __math_state_restore assumes that cr0.TS is already clear and the
- * fpu state is all ready for use.  Used during context switch.
- */
-void __math_state_restore(void)
-{
-	struct thread_info *thread = current_thread_info();
-	struct task_struct *tsk = thread->task;
-
-	/*
-	 * Paranoid restore. send a SIGSEGV if we fail to restore the state.
-	 */
-	if (unlikely(fpu_restore_checking(&tsk->thread.fpu))) {
-		stts();
-		force_sig(SIGSEGV, tsk);
-		return;
-	}
-
-	thread->status |= TS_USEDFPU;	/* So we fnsave on switch_to() */
-	tsk->fpu_counter++;
-}
-
-/*
  * 'math_state_restore()' saves the current math information in the
  * old math state array, and gets the new ones from the current task
  *
@@ -760,9 +741,7 @@ asmlinkage void math_state_restore(void)
 		local_irq_disable();
 	}
 
-	clts();				/* Allow maths ops (or we recurse) */
-
-	__math_state_restore();
+	restore_xstates(tsk, XCNTXT_LAZY);
 }
 EXPORT_SYMBOL_GPL(math_state_restore);
 
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 5eb15d4..5457332 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -47,8 +47,6 @@ void __sanitize_i387_state(struct task_struct *tsk)
 	if (!fx)
 		return;
 
-	BUG_ON(task_thread_info(tsk)->status & TS_USEDFPU);
-
 	xstate_bv = tsk->thread.fpu.state->xsave.xsave_hdr.xstate_bv;
 
 	/*
@@ -168,22 +166,10 @@ int save_i387_xstate(void __user *buf)
 	if (!used_math())
 		return 0;
 
-	if (task_thread_info(tsk)->status & TS_USEDFPU) {
-		if (use_xsave())
-			err = xsave_checking(buf);
-		else
-			err = fxsave_user(buf);
-
-		if (err)
-			return err;
-		task_thread_info(tsk)->status &= ~TS_USEDFPU;
-		stts();
-	} else {
-		sanitize_i387_state(tsk);
-		if (__copy_to_user(buf, &tsk->thread.fpu.state->fxsave,
-				   xstate_size))
-			return -1;
-	}
+	save_xstates(tsk);
+	sanitize_i387_state(tsk);
+	if (__copy_to_user(buf, &tsk->thread.fpu.state->xsave, xstate_size))
+		return -1;
 
 	clear_used_math(); /* trigger finit */
 
@@ -229,62 +215,24 @@ int save_i387_xstate(void __user *buf)
 }
 
 /*
- * Restore the extended state if present. Otherwise, restore the FP/SSE
- * state.
- */
-static int restore_user_xstate(void __user *buf)
-{
-	struct _fpx_sw_bytes fx_sw_user;
-	u64 mask;
-	int err;
-
-	if (((unsigned long)buf % 64) ||
-	     check_for_xstate(buf, buf, &fx_sw_user))
-		goto fx_only;
-
-	mask = fx_sw_user.xstate_bv;
-
-	/*
-	 * restore the state passed by the user.
-	 */
-	err = xrstor_checking((__force struct xsave_struct *)buf, mask);
-	if (err)
-		return err;
-
-	/*
-	 * init the state skipped by the user.
-	 */
-	mask = pcntxt_mask & ~mask;
-	if (unlikely(mask))
-		xrstor_state(init_xstate_buf, mask);
-
-	return 0;
-
-fx_only:
-	/*
-	 * couldn't find the extended state information in the
-	 * memory layout. Restore just the FP/SSE and init all
-	 * the other extended state.
-	 */
-	xrstor_state(init_xstate_buf, pcntxt_mask & ~XSTATE_FPSSE);
-	return fxrstor_checking((__force struct i387_fxsave_struct *)buf);
-}
-
-/*
  * This restores directly out of user space. Exceptions are handled.
  */
 int restore_i387_xstate(void __user *buf)
 {
+	struct _fpx_sw_bytes fx_sw_user;
 	struct task_struct *tsk = current;
 	int err = 0;
 
 	if (!buf) {
-		if (used_math())
-			goto clear;
+		if (used_math()) {
+			clear_fpu(tsk);
+			clear_used_math();
+		}
 		return 0;
-	} else
-		if (!access_ok(VERIFY_READ, buf, sig_xstate_size))
-			return -EACCES;
+	}
+
+	if (!access_ok(VERIFY_READ, buf, sig_xstate_size))
+		return -EACCES;
 
 	if (!used_math()) {
 		err = init_fpu(tsk);
@@ -292,25 +240,21 @@ int restore_i387_xstate(void __user *buf)
 			return err;
 	}
 
-	if (!(task_thread_info(current)->status & TS_USEDFPU)) {
-		clts();
-		task_thread_info(current)->status |= TS_USEDFPU;
-	}
-	if (use_xsave())
-		err = restore_user_xstate(buf);
-	else
-		err = fxrstor_checking((__force struct i387_fxsave_struct *)
-				       buf);
-	if (unlikely(err)) {
-		/*
-		 * Encountered an error while doing the restore from the
-		 * user buffer, clear the fpu state.
-		 */
-clear:
-		clear_fpu(tsk);
-		clear_used_math();
-	}
-	return err;
+	if (__copy_from_user(&tsk->thread.fpu.state->xsave, buf, xstate_size))
+		return -1;
+
+	/*
+	 * Restore only states specified by the user. If there is anything wrong
+	 * with the xstate, restore only FP/SSE. XRSTOR will initialize all
+	 * other states.
+	 */
+	if (!check_for_xstate(buf, buf, &fx_sw_user))
+		tsk->thread.fpu.state->xsave.xsave_hdr.xstate_bv &= fx_sw_user.xstate_bv;
+	else if (use_xsave())
+		tsk->thread.fpu.state->xsave.xsave_hdr.xstate_bv = XSTATE_FPSSE;
+
+	restore_xstates(tsk, XCNTXT_LAZY);
+	return 0;
 }
 #endif
 
@@ -473,3 +417,65 @@ void __cpuinit xsave_init(void)
 	next_func = xstate_enable;
 	this_func();
 }
+
+void xsave(struct xsave_struct *x, u64 mask)
+{
+	clts();
+
+	if (use_xsave())
+		xsave_state(x, mask);
+	else if (mask & XCNTXT_LAZY)
+		fpu_save(&x->i387);
+
+	if (mask & XCNTXT_LAZY)
+		fpu_clean(&x->i387);
+
+	stts();
+}
+
+void save_xstates(struct task_struct *tsk)
+{
+	struct thread_info *ti = task_thread_info(tsk);
+
+	if (!fpu_allocated(&tsk->thread.fpu))
+		return;
+
+	xsave(&tsk->thread.fpu.state->xsave, ti->xstate_mask);
+
+	if (!(ti->xstate_mask & XCNTXT_LAZY))
+		tsk->fpu_counter = 0;
+
+	if (tsk->fpu_counter < 5)
+		ti->xstate_mask &= ~XCNTXT_LAZY;
+
+	ti->status &= ~TS_USEDFPU;
+}
+
+int xrstor(struct xsave_struct *x, u64 mask)
+{
+	clts();
+
+	if (use_xsave())
+		return xrstor_checking(x, mask);
+	else if (mask & XCNTXT_LAZY)
+		return fpu_restore(&x->i387);
+
+	return 0;
+}
+
+void restore_xstates(struct task_struct *tsk, u64 mask)
+{
+	struct thread_info *ti = task_thread_info(tsk);
+
+	if (!fpu_allocated(&tsk->thread.fpu))
+		return;
+
+	if (unlikely(xrstor(&tsk->thread.fpu.state->xsave, mask))) {
+		stts();
+		force_sig(SIGSEGV, tsk);
+	} else {
+		ti->xstate_mask |= mask;
+		ti->status |= TS_USEDFPU;
+		tsk->fpu_counter++;
+	}
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cdac9e5..771044d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -57,6 +57,7 @@
 #include <asm/xcr.h>
 #include <asm/pvclock.h>
 #include <asm/div64.h>
+#include <asm/xsave.h>
 
 #define MAX_IO_MSRS 256
 #define CR0_RESERVED_BITS						\
@@ -5712,8 +5713,8 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 	 */
 	kvm_put_guest_xcr0(vcpu);
 	vcpu->guest_fpu_loaded = 1;
-	unlazy_fpu(current);
-	fpu_restore_checking(&vcpu->arch.guest_fpu);
+	save_xstates(current);
+	xrstor(&vcpu->arch.guest_fpu.state->xsave, -1);
 	trace_kvm_fpu(1);
 }
 
@@ -5725,7 +5726,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 		return;
 
 	vcpu->guest_fpu_loaded = 0;
-	fpu_save_init(&vcpu->arch.guest_fpu);
+	xsave(&vcpu->arch.guest_fpu.state->xsave, -1);
 	++vcpu->stat.fpu_reload;
 	kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
 	trace_kvm_fpu(0);
-- 
1.5.6.5



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC 2/2] x86, xsave: rework xsave support
  2010-11-23 20:41             ` [RFC 2/2] x86, xsave: rework xsave support Hans Rosenfeld
@ 2010-11-25  0:36               ` Brian Gerst
  0 siblings, 0 replies; 26+ messages in thread
From: Brian Gerst @ 2010-11-25  0:36 UTC (permalink / raw)
  To: Hans Rosenfeld; +Cc: linux-kernel, x86, robert.richter, andreas.herrmann3

On Tue, Nov 23, 2010 at 3:41 PM, Hans Rosenfeld <hans.rosenfeld@amd.com> wrote:
> This is an attempt to rework the code that handles FPU and related
> extended states. Since FPU, XMM and YMM states are just variants of what
> xsave handles, all of the old FPU-specific state handling code will be
> hidden behind a set of functions that resemble xsave and xrstor. For
> hardware that does not support xsave, the code falls back to
> fxsave/fxrstor or even fsave/frstor.
>
> A xstate_mask member will be added to the thread_info structure that
> will control which states are to be saved by xsave. It is set to include
> all "lazy" states (that is, all states currently supported) by the #NM
> handler when a lazy restore is triggered or by switch_to() when the
> tasks FPU context is preloaded. Xstate_mask is intended to completely
> replace TS_USEDFPU in a later cleanup patch.
>
> When "non-lazy" states such as for LWP will be added later, the
> corresponding bits in xstate_mask are supposed to be set for all threads
> all the time. There will be no performance penalty for threads not using
> these states, as xsave and xrstor will ignore unused states.
>
> This patch is not complete and not final at all. Support for 32bit is
> lacking, and the context handling for signals will probably change
> again. I haven't benchmarked it yet, but I have tested it for both the
> fxsave and xsave cases.

Looks good, but I would suggest adding wrappers for save_states() and
restore_xstates() that handle preemption, like how unlazy_fpu() was.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-05 18:27       ` Hans Rosenfeld
                           ` (3 preceding siblings ...)
  2010-10-05 18:30         ` [RFC 3/3] Save/restore LWP state in context switches Hans Rosenfeld
@ 2010-10-05 19:05         ` Ingo Molnar
  2010-10-06  7:35           ` Robert Richter
  4 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2010-10-05 19:05 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: Thomas Gleixner, Richter, Robert, LKML, H. Peter Anvin,
	Herrmann3, Andreas, Peter Zijlstra, Peter Zijlstra,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo

* Hans Rosenfeld <hans.rosenfeld@amd.com> wrote:

> But thats not required to use LWP instructions. Maybe Robert will add 
> it to perf some day, or maybe not. It is completely optional, and how 
> it is implemented at some point in the future is completely irrelevant 
> to the basic support of LWP.

It isnt irrelevant. If the concept is implemented in a crappy way, if 
there are random user-space libraries that do not properly expose these 
capabilities and do not allow them to extend existing perf functionality 
in a natural way then we might be better off not adding overhead to the 
scheduler and not enabling it at all.

So thoughts need to be made what the point of it all is and how it 
integrates into perf. If it doesnt integrate, if the whole plan is to 
just get it to user-space where it can be messed up freely in some CPU 
specific way then color me thoroughly uninterested. We have a generic 
instrumentation framework for a reason.

We very much want to know the structure of that area and want to make 
use of it not just on a per task basis. We also want to expose it all in 
a more generalized form.

I cannot believe something like this was committed into silicon without 
at minimum talking to people who have a clue about where it will (and 
should) all go ...

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-05 19:05         ` [RFC 0/3] Basic support for LWP Ingo Molnar
@ 2010-10-06  7:35           ` Robert Richter
       [not found]             ` <AANLkTi=T0QmcKeZcgcR+GKk-9OwQUB_x8XdHiNuU7tE_@mail.gmail.com>
  0 siblings, 1 reply; 26+ messages in thread
From: Robert Richter @ 2010-10-06  7:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rosenfeld, Hans, Thomas Gleixner, LKML, H. Peter Anvin,
	Herrmann3, Andreas, Peter Zijlstra, Peter Zijlstra,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Stephane Eranian

On 05.10.10 15:05:01, Ingo Molnar wrote:

> So thoughts need to be made what the point of it all is and how it 
> integrates into perf. If it doesnt integrate, if the whole plan is to 
> just get it to user-space where it can be messed up freely in some CPU 
> specific way then color me thoroughly uninterested. We have a generic 
> instrumentation framework for a reason.

I was looking at how this integrates into the perf syscall. It might
be a little disappointing, but there is not much to do for the kernel.
Ring buffer handling is implemented in hardware now, the user land
sets up address ranges in the task's address space for buffers and
thus may direct access it. We do not need an interrupt handler to fill
the buffers. The pmu state is saved and restored during context
switches in ways that has been proven for the fpu (xsave) and
virtualization (VMCB like LWPCB). So, overall hardware is now doing
the job for writing samples into a userland buffer and managing the
pmu state. This reduces system overhead while profiling a lot
especially because we don't have to walk through a software stack with
each sample (this is where the 'lightweight' comes from).

Of course this does not fit into current frameworks because of its
difference in concept, but in general we want to see it in perf. So
the main work of integration will leave here to tool and library
implementation. But for this Linux must implement LWP context switch
support. This is what Hans did.

We also measured the system impact which comes from the additional
rdmsrl() if the cpu supports LWP. There is no significant performance
decrease in a worst case scenario. So, this is how we think it is the
best to implement it and we need your feedback here.

I think we should consider to apply patch 1/3 as it is unrelated to
LWP and reworks and improves the code.

Thanks,

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

^ permalink raw reply	[flat|nested] 26+ messages in thread

[parent not found: <AANLkTi=T0QmcKeZcgcR+GKk-9OwQUB_x8XdHiNuU7tE_@mail.gmail.com>]

* Re: [RFC 0/3] Basic support for LWP
       [not found]             ` <AANLkTi=T0QmcKeZcgcR+GKk-9OwQUB_x8XdHiNuU7tE_@mail.gmail.com>
@ 2010-10-07 10:46               ` Stephane Eranian
  2010-10-07 13:59                 ` H. Peter Anvin
  0 siblings, 1 reply; 26+ messages in thread
From: Stephane Eranian @ 2010-10-07 10:46 UTC (permalink / raw)
  To: mingo
  Cc: Hans.Rosenfeld, robert.richter, tglx, linux-kernel, hpa,
	Andreas.Herrmann3, peterz, fweisbec, rostedt, acme,
	Peter Zijlstra, eranian

Hi,

Some comments on all of this.

First of all LWP adds value because it allows collecting information
which is otherwise unavailable or of lesser quality.

LWP is meant to measure only at the user level (only CPL=3). It is
blind for kernel level execution or even priv level 1, 2.

LWP is ALWAYS precise (that's THE key value add in my mind).
It reports the instruction address of the instruction that caused the
counter to overflow. This is true for all events it can measure, including
cache misses.That precision can be be achieved through HW support.

LWP does not have the data pollution of IBS. It only records instructions
that caused the type of event you want to measure. For instance, if you want
to sample cache misses, it collects only data related to instructions that
generate cache misses unlike IBS. It is a radically different approach.
Don't get me wrong, both are useful, it all depends on what you want to measure.

LWP does allow data normalization. You know how many instances of
the sampling events have been observed unlike IBS.

LWP is lightweight in the sense that the HW records samples for you in
a memory buffer. This is similar to PEBS in that regards. But it does not
capture full machine state. It also has a limited number of events.

LWP allows users to insert their own samples in the buffer. That can be
useful to record other kinds of information. Could be use to make this
work in system-wide mode for instance.

The LWP specs has been out for review by EVERYBODY since 2007.
I remember looking at it back then. So the comments about "nobody was
consulted before this got into silicon" is erroneous. Everybody had a chance
to comment.

LWP operates independently from the regular core PMU. Both can be
active at the same time with no problems.

As for the patch itself, I am not an expert at xsave/xrstor, but it seems to
me you could decouple LWP from FPU. I think  Brian had the same comment.
I suspect this can be done and it will certainly look cleaner.

On the comments about integrating this into perf_events. LWP is designed for
self-monitoring ONLY. It can be controlled entirely from user space (once kernel
enables it and has xsave/xrstor support). Thus, it seems natural, at
first, to implement
this into a library that can be either explicitly called by developers
or also implicitly
used via LD_PRELOAD.

But I suspect it could also be implemented as part of perf_events in a
manner similar to PEBS,
using the precise_ip mode to trigger it. This is assuming that the LWP
instructions work at
priv level 0. Both the LWP control block + LWP buffer would be
allocated by the kernel and
yet be accessible from priv level 3. It would also need its own event
encoding as there isn't
necessarily a 1-to-1 mapping with core PMU events. It would also use
the interrupt mode in
this case. In terms of access control from users, I believe it would
be possible to make LWP
invisible to users by tweaking the LWP_CFG register to report no
support and yet have the
kernel control LWP. You'd have to find a way to copy ALL the
information in each LWP record
into the perf sampling buffer format.

What does this buy us?

An integrated solution with perf and the perf tools, no dedicated
library, finer grain access
control. Possibly precise sampling in per-cpu mode (leverage LWPVAL).

What do we lose?

The lightweight aspect of it. There would be syscalls for
setup/start/stop. You'd have
to copy the data between the LWP record format and the perf sampling buffer.
There would be data copying between the two buffers.

Personally I would have gone with the user level stuff as the first
approach, using
polling mode on the LWP buffer,  to evaluate if LWP delivers on its
promise w.r.t.
high quality data, low overhead. But I also understand the motivation
to integrate
this into perf to provide a uniform interface.

But one thing is clear, LWP adds value and I want to access the LWP data because
it is of higher quality than what we can get today on AMD processors.
It can really
help drive certain optimizations, e.g., in compilers.

On Wed, Oct 6, 2010 at 9:35 AM, Robert Richter <robert.richter@amd.com> wrote:
> On 05.10.10 15:05:01, Ingo Molnar wrote:
>
>> So thoughts need to be made what the point of it all is and how it
>> integrates into perf. If it doesnt integrate, if the whole plan is to
>> just get it to user-space where it can be messed up freely in some CPU
>> specific way then color me thoroughly uninterested. We have a generic
>> instrumentation framework for a reason.
>
> I was looking at how this integrates into the perf syscall. It might
> be a little disappointing, but there is not much to do for the kernel.
> Ring buffer handling is implemented in hardware now, the user land
> sets up address ranges in the task's address space for buffers and
> thus may direct access it. We do not need an interrupt handler to fill
> the buffers. The pmu state is saved and restored during context
> switches in ways that has been proven for the fpu (xsave) and
> virtualization (VMCB like LWPCB). So, overall hardware is now doing
> the job for writing samples into a userland buffer and managing the
> pmu state. This reduces system overhead while profiling a lot
> especially because we don't have to walk through a software stack with
> each sample (this is where the 'lightweight' comes from).
>
> Of course this does not fit into current frameworks because of its
> difference in concept, but in general we want to see it in perf. So
> the main work of integration will leave here to tool and library
> implementation. But for this Linux must implement LWP context switch
> support. This is what Hans did.
>
> We also measured the system impact which comes from the additional
> rdmsrl() if the cpu supports LWP. There is no significant performance
> decrease in a worst case scenario. So, this is how we think it is the
> best to implement it and we need your feedback here.
>
> I think we should consider to apply patch 1/3 as it is unrelated to
> LWP and reworks and improves the code.
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-07 10:46               ` Stephane Eranian
@ 2010-10-07 13:59                 ` H. Peter Anvin
  2010-10-07 14:11                   ` Stephane Eranian
  2010-10-07 15:12                   ` Stephane Eranian
  0 siblings, 2 replies; 26+ messages in thread
From: H. Peter Anvin @ 2010-10-07 13:59 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: mingo, Hans.Rosenfeld, robert.richter, tglx, linux-kernel,
	Andreas.Herrmann3, peterz, fweisbec, rostedt, acme,
	Peter Zijlstra, eranian

On 10/07/2010 03:46 AM, Stephane Eranian wrote:
> 
> As for the patch itself, I am not an expert at xsave/xrstor, but it seems to
> me you could decouple LWP from FPU. I think  Brian had the same comment.
> I suspect this can be done and it will certainly look cleaner.
> 

Well, once you're using XSAVE you're not decoupled from the FPU.  Worse,
if you're using XSAVE and not honoring CR0.TS you have a major design flaw.

Nothing that can't be dealt with, but still...

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-07 13:59                 ` H. Peter Anvin
@ 2010-10-07 14:11                   ` Stephane Eranian
  2010-10-07 14:20                     ` Hans Rosenfeld
  2010-10-07 14:20                     ` H. Peter Anvin
  2010-10-07 15:12                   ` Stephane Eranian
  1 sibling, 2 replies; 26+ messages in thread
From: Stephane Eranian @ 2010-10-07 14:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mingo, Hans.Rosenfeld, robert.richter, tglx, linux-kernel,
	Andreas.Herrmann3, peterz, fweisbec, rostedt, acme,
	Peter Zijlstra, eranian

On Thu, Oct 7, 2010 at 3:59 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 10/07/2010 03:46 AM, Stephane Eranian wrote:
>>
>> As for the patch itself, I am not an expert at xsave/xrstor, but it seems to
>> me you could decouple LWP from FPU. I think  Brian had the same comment.
>> I suspect this can be done and it will certainly look cleaner.
>>
>
> Well, once you're using XSAVE you're not decoupled from the FPU.  Worse,
> if you're using XSAVE and not honoring CR0.TS you have a major design flaw.
>
Is that to say, that if you use LWP you will have to save/restore FPU state even
though you're not actually using it?

> Nothing that can't be dealt with, but still...
>
> --
> H. Peter Anvin, Intel Open Source Technology Center
> I work for Intel.  I don't speak on their behalf.
>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-07 14:11                   ` Stephane Eranian
@ 2010-10-07 14:20                     ` Hans Rosenfeld
  2010-10-07 14:20                     ` H. Peter Anvin
  1 sibling, 0 replies; 26+ messages in thread
From: Hans Rosenfeld @ 2010-10-07 14:20 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: H. Peter Anvin, mingo@elte.hu, Richter, Robert,
	tglx@linutronix.de, linux-kernel@vger.kernel.org,
	Herrmann3, Andreas, peterz@infradead.org, fweisbec@gmail.com,
	rostedt@goodmis.org, acme@redhat.com, Peter Zijlstra,
	eranian@gmail.com

On Thu, Oct 07, 2010 at 10:11:42AM -0400, Stephane Eranian wrote:
> On Thu, Oct 7, 2010 at 3:59 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> > On 10/07/2010 03:46 AM, Stephane Eranian wrote:
> >>
> >> As for the patch itself, I am not an expert at xsave/xrstor, but it seems to
> >> me you could decouple LWP from FPU. I think  Brian had the same comment.
> >> I suspect this can be done and it will certainly look cleaner.
> >>
> >
> > Well, once you're using XSAVE you're not decoupled from the FPU.  Worse,
> > if you're using XSAVE and not honoring CR0.TS you have a major design flaw.
> >
> Is that to say, that if you use LWP you will have to save/restore FPU state even
> though you're not actually using it?

No, you don't necessarily have to care about the FPU stuff. XSAVE can
save different states individually to the same buffer, and XRSTOR can
restore them individually. You just have to make sure CR0.TS is not
set when you execute the XSAVE or XRSTOR instruction.

That's what I did and it worked :)


Hans


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-07 14:11                   ` Stephane Eranian
  2010-10-07 14:20                     ` Hans Rosenfeld
@ 2010-10-07 14:20                     ` H. Peter Anvin
  2010-10-07 14:25                       ` Stephane Eranian
  1 sibling, 1 reply; 26+ messages in thread
From: H. Peter Anvin @ 2010-10-07 14:20 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: mingo, Hans.Rosenfeld, robert.richter, tglx, linux-kernel,
	Andreas.Herrmann3, peterz, fweisbec, rostedt, acme,
	Peter Zijlstra, eranian

On 10/07/2010 07:11 AM, Stephane Eranian wrote:
> On Thu, Oct 7, 2010 at 3:59 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 10/07/2010 03:46 AM, Stephane Eranian wrote:
>>>
>>> As for the patch itself, I am not an expert at xsave/xrstor, but it seems to
>>> me you could decouple LWP from FPU. I think  Brian had the same comment.
>>> I suspect this can be done and it will certainly look cleaner.
>>>
>>
>> Well, once you're using XSAVE you're not decoupled from the FPU.  Worse,
>> if you're using XSAVE and not honoring CR0.TS you have a major design flaw.
>>
> Is that to say, that if you use LWP you will have to save/restore FPU state even
> though you're not actually using it?
> 

No, but you wouldn't be able to use lazy FPU.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-07 14:20                     ` H. Peter Anvin
@ 2010-10-07 14:25                       ` Stephane Eranian
  2010-10-07 14:47                         ` H. Peter Anvin
  0 siblings, 1 reply; 26+ messages in thread
From: Stephane Eranian @ 2010-10-07 14:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mingo, Hans.Rosenfeld, robert.richter, tglx, linux-kernel,
	Andreas.Herrmann3, peterz, fweisbec, rostedt, acme,
	Peter Zijlstra, eranian

On Thu, Oct 7, 2010 at 4:20 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 10/07/2010 07:11 AM, Stephane Eranian wrote:
>> On Thu, Oct 7, 2010 at 3:59 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 10/07/2010 03:46 AM, Stephane Eranian wrote:
>>>>
>>>> As for the patch itself, I am not an expert at xsave/xrstor, but it seems to
>>>> me you could decouple LWP from FPU. I think  Brian had the same comment.
>>>> I suspect this can be done and it will certainly look cleaner.
>>>>
>>>
>>> Well, once you're using XSAVE you're not decoupled from the FPU.  Worse,
>>> if you're using XSAVE and not honoring CR0.TS you have a major design flaw.
>>>
>> Is that to say, that if you use LWP you will have to save/restore FPU state even
>> though you're not actually using it?
>>
>
> No, but you wouldn't be able to use lazy FPU.
>
You mean lazy restore, I am guessing here.
Is that to say, you cannot figure out whether the FPU state in
the CPU on ctxsw in is yours anymore?

>        -hpa
>
> --
> H. Peter Anvin, Intel Open Source Technology Center
> I work for Intel.  I don't speak on their behalf.
>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-07 14:25                       ` Stephane Eranian
@ 2010-10-07 14:47                         ` H. Peter Anvin
  0 siblings, 0 replies; 26+ messages in thread
From: H. Peter Anvin @ 2010-10-07 14:47 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: mingo, Hans.Rosenfeld, robert.richter, tglx, linux-kernel,
	Andreas.Herrmann3, peterz, fweisbec, rostedt, acme,
	Peter Zijlstra, eranian

On 10/07/2010 07:25 AM, Stephane Eranian wrote:
>>
>> No, but you wouldn't be able to use lazy FPU.
>>
> You mean lazy restore, I am guessing here.
> Is that to say, you cannot figure out whether the FPU state in
> the CPU on ctxsw in is yours anymore?
> 

Lazy save/restore, and more or less, yes.  Really, XSAVE was designed to
be saved/restored as a unit, with unused state skipped.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-07 13:59                 ` H. Peter Anvin
  2010-10-07 14:11                   ` Stephane Eranian
@ 2010-10-07 15:12                   ` Stephane Eranian
  1 sibling, 0 replies; 26+ messages in thread
From: Stephane Eranian @ 2010-10-07 15:12 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mingo, Hans.Rosenfeld, robert.richter, tglx, linux-kernel,
	Andreas.Herrmann3, peterz, fweisbec, rostedt, acme,
	Peter Zijlstra, eranian

On Thu, Oct 7, 2010 at 3:59 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 10/07/2010 03:46 AM, Stephane Eranian wrote:
>>
>> As for the patch itself, I am not an expert at xsave/xrstor, but it seems to
>> me you could decouple LWP from FPU. I think  Brian had the same comment.
>> I suspect this can be done and it will certainly look cleaner.
>>
>
> Well, once you're using XSAVE you're not decoupled from the FPU.  Worse,
> if you're using XSAVE and not honoring CR0.TS you have a major design flaw.
>
I looked at the LWP documentation and it says:

"LWP does not support the “lazy” state save and restore that is
possible for floating point and SSE state. It does not interact with
the CR0.TS bit. Operating systems that support LWP must always do an
XSAVE to preserve the old thread’s LWP context and an XRSTOR to set up
the new LWP context. The OS can continue to do a lazy switch of the FP
and SSE state by ensuring that the corresponding bits in EDX:EAX are
clear when it executes the XSAVE and XRSTOR to handle the LWP
context."

They imply that you can still do lazy FP/SSE save/restore even though
CR0.TS does not monitor LWP access.

If LWP is used the kernel needs to call xsave/xrstor on ctxsw. It can
do this lazily by checking LWP_CBADDR.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-05 14:51   ` Hans Rosenfeld
  2010-10-05 15:34     ` Thomas Gleixner
@ 2010-10-05 18:43     ` Davidlohr Bueso
  2010-10-06 10:26       ` Andi Kleen
  1 sibling, 1 reply; 26+ messages in thread
From: Davidlohr Bueso @ 2010-10-05 18:43 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: Thomas Gleixner, robert.richter, LKML, H. Peter Anvin,
	Ingo Molnar, Herrmann3, Andreas, Peter Zijlstra

On Tue, 2010-10-05 at 16:51 +0200, Hans Rosenfeld wrote:
> [ adding Robert ]
> 
> On Mon, Oct 04, 2010 at 06:13:46PM -0400, Thomas Gleixner wrote:
> > > LWP (Light-Weight Profiling) is a new profiling mechanism that allows
> > > user mode processes to gather performance data about themselves with
> > > very low overhead. The specification can be found here:
> > >         http://developer.amd.com/cpu/LWP/Pages/default.aspx
> > 
> > All I can see there is marketing blurb. The spec pdf does not tell me
> > either how the end result of your patches will look like.
> 
> Well, you should get a basic understanding of how LWP works from the
> spec. I'm pretty sure there is more than that blurb in there.
> 
> What I sent _is_ the end result. Supporting LWP in the task switching
> code is the only thing necessary to use the LWP instructions in user
> space. And that is also what I'd like to get some comments on.
> 
> > > This code adds basic support for LWP to the context switch code, which
> > > is the minimum needed to use LWP. Support for other LWP features like
> > > interrupts will be added later.
> > 
> > Oh no. We are not going to merge that and give you a card blanche to
> > come up with another profiling/performance monitoring facility which
> > is completely disconnected to everything else.
> 
> I think you got something wrong here, probably because I wasn't clear
> enough :)
> 
> What I sent was just the basic kernel support, which means
> saving/restoring the LWP state in task switches. Technically no other
> kernel changes are necessary to use LWP in user space programs. No
> kernel framework of any kind is necessary to use LWP. Additionally,
> Robert is checking whether and how LWP support could be integrated
> into perf. This of course depends on having basic LWP task switching
> support in the kernel.

That would be the way to go, I don't think anyone wants yet another
profiling userspace tool.
> 
> > Please provide either the full patch of this facility or at least a
> > reasonable explanation how it will look like when your code is
> > ready. What's the user space interface, how does it hook into perf,
> > etc ...
> 
> The basic principle of LWP is that a user program reserves some memory
> for a LWP control block and a ring buffer, writes the ring buffer
> address and some setup bits into the control block, and executes the
> LLWPCB instruction to load the control block into the CPU. While the
> program runs it can periodically read the samples from the ring buffer
> and do something useful with them. The program being profiled needs to
> include support for LWP to do that. In case you can't or don't want to
> modify the program being profiled, a small LWP library would have to
> be written that would have to be LD_PRELOADed. I don't really know
> what it will look like in the end and how it will be integrated into
> perf. The tool/library developers will probably know best how to make
> use of LWP.
> 
> The point is that LWP is supposed to be controlled and used from user
> space without any kernel interaction, provided that the kernel
> saves/restores the LWP context. That's also how I tested it, using a
> small test program that just uses the LWP instructions.
> 
> Technically LWP is quite similar to the FPU, it even uses the same
> facilities for OS support. Just like FPU support, the kernel doesn't
> have to care at all about what user space is going to do with it. It
> just has to make sure state is saved and restored when necessary.
> 
> 
> Hans
> 
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/3] Basic support for LWP
  2010-10-05 18:43     ` Davidlohr Bueso
@ 2010-10-06 10:26       ` Andi Kleen
  0 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2010-10-06 10:26 UTC (permalink / raw)
  To: dave
  Cc: Hans Rosenfeld, Thomas Gleixner, robert.richter, LKML,
	H. Peter Anvin, Ingo Molnar, Herrmann3, Andreas, Peter Zijlstra

Davidlohr Bueso <dave@gnu.org> writes:
>
> That would be the way to go, I don't think anyone wants yet another
> profiling userspace tool.

Why not? 

Are you saying innovation in profilers is dead and we're all only
allowed to use a single one now?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2010-11-25  0:36 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1286212172-654419-1-git-send-email-hans.rosenfeld@amd.com>
2010-10-04 22:13 ` [RFC 0/3] Basic support for LWP Thomas Gleixner
2010-10-05 14:51   ` Hans Rosenfeld
2010-10-05 15:34     ` Thomas Gleixner
2010-10-05 18:27       ` Hans Rosenfeld
2010-10-05 18:30         ` Hans Rosenfeld
2010-10-05 18:30         ` [RFC 1/3] Cleanup xsave/xrstor support Hans Rosenfeld
2010-10-05 18:30         ` [RFC 2/3] Allow saving of individual states in fpu_xsave() Hans Rosenfeld
2010-10-05 18:30         ` [RFC 3/3] Save/restore LWP state in context switches Hans Rosenfeld
2010-10-06 11:12           ` Brian Gerst
2010-10-07 14:58             ` Hans Rosenfeld
2010-11-23 20:41             ` [RFC 0/2] FPU/xsave rework in preparation for LWP Hans Rosenfeld
2010-11-23 20:41             ` [RFC 1/2] x86, xsave: cleanup xsave/xrstor support Hans Rosenfeld
2010-11-23 20:41             ` [RFC 2/2] x86, xsave: rework xsave support Hans Rosenfeld
2010-11-25  0:36               ` Brian Gerst
2010-10-05 19:05         ` [RFC 0/3] Basic support for LWP Ingo Molnar
2010-10-06  7:35           ` Robert Richter
     [not found]             ` <AANLkTi=T0QmcKeZcgcR+GKk-9OwQUB_x8XdHiNuU7tE_@mail.gmail.com>
2010-10-07 10:46               ` Stephane Eranian
2010-10-07 13:59                 ` H. Peter Anvin
2010-10-07 14:11                   ` Stephane Eranian
2010-10-07 14:20                     ` Hans Rosenfeld
2010-10-07 14:20                     ` H. Peter Anvin
2010-10-07 14:25                       ` Stephane Eranian
2010-10-07 14:47                         ` H. Peter Anvin
2010-10-07 15:12                   ` Stephane Eranian
2010-10-05 18:43     ` Davidlohr Bueso
2010-10-06 10:26       ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox