* [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions
@ 2011-05-03 20:17 Wei Huang
2011-05-04 7:09 ` Jan Beulich
0 siblings, 1 reply; 6+ messages in thread
From: Wei Huang @ 2011-05-03 20:17 UTC (permalink / raw)
To: 'xen-devel@lists.xensource.com', Keir Fraser, Jan Beulich
[-- Attachment #1: Type: text/plain, Size: 504 bytes --]
FPU: create lazy and non-lazy FPU restore functions
Currently Xen relies on #NM (via CR0.TS) to trigger FPU context restore.
But not all FPU state is tracked by TS bit. This function creates two
FPU restore functions: vcpu_restore_fpu_lazy() and
vcpu_restore_fpu_eager(). vcpu_restore_fpu_lazy() is still used when #NM
is triggered. vcpu_restore_fpu_eager(), as a comparision, is called for
vcpu which is being scheduled in on every context switch.
Signed-off-by: Wei Huang <wei.huang2@amd.com>
[-- Attachment #2: lwp6.txt --]
[-- Type: text/plain, Size: 3909 bytes --]
# HG changeset patch
# User Wei Huang <wei.huang2@amd.com>
# Date 1304449177 18000
# Node ID 40c9bab757e1cc3964ae6869022eebba7141d6eb
# Parent 83db82b67f65bee91f35e9caaad700a78ac0a3fc
FPU: create lazy and non-lazy FPU restore functions
Currently Xen relies on #NM (via CR0.TS) to trigger FPU context restore. But not all FPU state is tracked by TS bit. This function creates two FPU restore functions: vcpu_restore_fpu_lazy() and vcpu_restore_fpu_eager(). vcpu_restore_fpu_lazy() is still used when #NM is triggered. vcpu_restore_fpu_eager(), as a comparision, is called for vcpu which is being scheduled in on every context switch.
Signed-off-by: Wei Huang <wei.huang2@amd.com>
diff -r 83db82b67f65 -r 40c9bab757e1 xen/arch/x86/domain.c
--- a/xen/arch/x86/domain.c Tue May 03 13:49:27 2011 -0500
+++ b/xen/arch/x86/domain.c Tue May 03 13:59:37 2011 -0500
@@ -1578,6 +1578,7 @@
memcpy(stack_regs, &n->arch.user_regs, CTXT_SWITCH_STACK_BYTES);
if ( xsave_enabled(n) && n->arch.xcr0 != get_xcr0() )
set_xcr0(n->arch.xcr0);
+ vcpu_restore_fpu_eager(n);
n->arch.ctxt_switch_to(n);
}
diff -r 83db82b67f65 -r 40c9bab757e1 xen/arch/x86/hvm/svm/svm.c
--- a/xen/arch/x86/hvm/svm/svm.c Tue May 03 13:49:27 2011 -0500
+++ b/xen/arch/x86/hvm/svm/svm.c Tue May 03 13:59:37 2011 -0500
@@ -348,7 +348,7 @@
{
struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb;
- vcpu_restore_fpu(v);
+ vcpu_restore_fpu_lazy(v);
vmcb_set_exception_intercepts(
vmcb, vmcb_get_exception_intercepts(vmcb) & ~(1U << TRAP_no_device));
}
diff -r 83db82b67f65 -r 40c9bab757e1 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c Tue May 03 13:49:27 2011 -0500
+++ b/xen/arch/x86/hvm/vmx/vmx.c Tue May 03 13:59:37 2011 -0500
@@ -612,7 +612,7 @@
static void vmx_fpu_enter(struct vcpu *v)
{
- vcpu_restore_fpu(v);
+ vcpu_restore_fpu_lazy(v);
v->arch.hvm_vmx.exception_bitmap &= ~(1u << TRAP_no_device);
vmx_update_exception_bitmap(v);
v->arch.hvm_vmx.host_cr0 &= ~X86_CR0_TS;
diff -r 83db82b67f65 -r 40c9bab757e1 xen/arch/x86/i387.c
--- a/xen/arch/x86/i387.c Tue May 03 13:49:27 2011 -0500
+++ b/xen/arch/x86/i387.c Tue May 03 13:59:37 2011 -0500
@@ -160,10 +160,25 @@
/*******************************/
/* VCPU FPU Functions */
/*******************************/
+/* Restore FPU state whenever VCPU is schduled in. */
+void vcpu_restore_fpu_eager(struct vcpu *v)
+{
+ ASSERT(!is_idle_vcpu(v));
+
+ /* Avoid recursion */
+ clts();
+
+ /* save the nonlazy extended state which is not tracked by CR0.TS bit */
+ if ( xsave_enabled(v) )
+ fpu_xrstor(v, XSTATE_NONLAZY);
+
+ stts();
+}
+
/*
* Restore FPU state when #NM is triggered.
*/
-void vcpu_restore_fpu(struct vcpu *v)
+void vcpu_restore_fpu_lazy(struct vcpu *v)
{
ASSERT(!is_idle_vcpu(v));
@@ -174,7 +189,7 @@
return;
if ( xsave_enabled(v) )
- fpu_xrstor(v, XSTATE_ALL);
+ fpu_xrstor(v, XSTATE_LAZY);
else if ( v->fpu_initialised )
{
if ( cpu_has_fxsr )
diff -r 83db82b67f65 -r 40c9bab757e1 xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c Tue May 03 13:49:27 2011 -0500
+++ b/xen/arch/x86/traps.c Tue May 03 13:59:37 2011 -0500
@@ -3198,7 +3198,7 @@
BUG_ON(!guest_mode(regs));
- vcpu_restore_fpu(curr);
+ vcpu_restore_fpu_lazy(curr);
if ( curr->arch.pv_vcpu.ctrlreg[0] & X86_CR0_TS )
{
diff -r 83db82b67f65 -r 40c9bab757e1 xen/include/asm-x86/i387.h
--- a/xen/include/asm-x86/i387.h Tue May 03 13:49:27 2011 -0500
+++ b/xen/include/asm-x86/i387.h Tue May 03 13:59:37 2011 -0500
@@ -14,7 +14,8 @@
#include <xen/types.h>
#include <xen/percpu.h>
-void vcpu_restore_fpu(struct vcpu *v);
+void vcpu_restore_fpu_eager(struct vcpu *v);
+void vcpu_restore_fpu_lazy(struct vcpu *v);
void vcpu_save_fpu(struct vcpu *v);
int vcpu_init_fpu(struct vcpu *v);
[-- Attachment #3: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions 2011-05-03 20:17 [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions Wei Huang @ 2011-05-04 7:09 ` Jan Beulich 2011-05-04 16:33 ` Wei Huang 0 siblings, 1 reply; 6+ messages in thread From: Jan Beulich @ 2011-05-04 7:09 UTC (permalink / raw) To: Wei Huang; +Cc: xen-devel@lists.xensource.com, KeirFraser >>> On 03.05.11 at 22:17, Wei Huang <wei.huang2@amd.com> wrote: Again as pointed out earlier, ... >--- a/xen/arch/x86/domain.c Tue May 03 13:49:27 2011 -0500 >+++ b/xen/arch/x86/domain.c Tue May 03 13:59:37 2011 -0500 >@@ -1578,6 +1578,7 @@ > memcpy(stack_regs, &n->arch.user_regs, CTXT_SWITCH_STACK_BYTES); > if ( xsave_enabled(n) && n->arch.xcr0 != get_xcr0() ) > set_xcr0(n->arch.xcr0); >+ vcpu_restore_fpu_eager(n); ... this call is unconditional, ... > n->arch.ctxt_switch_to(n); > } > >--- a/xen/arch/x86/i387.c Tue May 03 13:49:27 2011 -0500 >+++ b/xen/arch/x86/i387.c Tue May 03 13:59:37 2011 -0500 >@@ -160,10 +160,25 @@ > /*******************************/ > /* VCPU FPU Functions */ > /*******************************/ >+/* Restore FPU state whenever VCPU is schduled in. */ >+void vcpu_restore_fpu_eager(struct vcpu *v) >+{ >+ ASSERT(!is_idle_vcpu(v)); >+ >+ /* Avoid recursion */ >+ clts(); >+ >+ /* save the nonlazy extended state which is not tracked by CR0.TS bit */ >+ if ( xsave_enabled(v) ) >+ fpu_xrstor(v, XSTATE_NONLAZY); >+ >+ stts(); ... while here you do an unconditional clts followed by an xrstor only checking whether xsave is enabled (but not checking whether there's any non-lazy state to be restored) and, possibly the most expensive of all, an unconditional write of CR0. Jan >+} >+ > /* > * Restore FPU state when #NM is triggered. > */ >-void vcpu_restore_fpu(struct vcpu *v) >+void vcpu_restore_fpu_lazy(struct vcpu *v) > { > ASSERT(!is_idle_vcpu(v)); > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions 2011-05-04 7:09 ` Jan Beulich @ 2011-05-04 16:33 ` Wei Huang 2011-05-05 7:13 ` Jan Beulich 0 siblings, 1 reply; 6+ messages in thread From: Wei Huang @ 2011-05-04 16:33 UTC (permalink / raw) To: Jan Beulich; +Cc: xen-devel@lists.xensource.com, KeirFraser Checking whether there is a non-lazy state to save is architectural specific and very messy. For instance, we need to read LWP_CBADDR to confirm LWP's dirty state. This MSR is AMD specific and we don't want to add it here. Plus reading data from LWP_CBADDR MSR might be as expensive as clts/stts. My previous email showed that the overhead with LWP is around 1%-2% of __context_switch(). For non lwp-capable CPU, this overhead should be much smaller (only clts and stts) because xfeature_mask[LWP] is 0. Yes, clts() and stts() don't have to called every time. How about this one? /* Restore FPU state whenever VCPU is schduled in. */ void vcpu_restore_fpu_eager(struct vcpu *v) { ASSERT(!is_idle_vcpu(v)); /* save the nonlazy extended state which is not tracked by CR0.TS bit */ if ( xsave_enabled(v) ) { /* Avoid recursion */ clts(); fpu_xrstor(v, XSTATE_NONLAZY); stts(); } . On 05/04/2011 02:09 AM, Jan Beulich wrote: >>>> On 03.05.11 at 22:17, Wei Huang<wei.huang2@amd.com> wrote: > Again as pointed out earlier, ... > >> --- a/xen/arch/x86/domain.c Tue May 03 13:49:27 2011 -0500 >> +++ b/xen/arch/x86/domain.c Tue May 03 13:59:37 2011 -0500 >> @@ -1578,6 +1578,7 @@ >> memcpy(stack_regs,&n->arch.user_regs, CTXT_SWITCH_STACK_BYTES); >> if ( xsave_enabled(n)&& n->arch.xcr0 != get_xcr0() ) >> set_xcr0(n->arch.xcr0); >> + vcpu_restore_fpu_eager(n); > ... this call is unconditional, ... > >> n->arch.ctxt_switch_to(n); >> } >> >> --- a/xen/arch/x86/i387.c Tue May 03 13:49:27 2011 -0500 >> +++ b/xen/arch/x86/i387.c Tue May 03 13:59:37 2011 -0500 >> @@ -160,10 +160,25 @@ >> /*******************************/ >> /* VCPU FPU Functions */ >> /*******************************/ >> +/* Restore FPU state whenever VCPU is schduled in. */ >> +void vcpu_restore_fpu_eager(struct vcpu *v) >> +{ >> + ASSERT(!is_idle_vcpu(v)); >> + >> + /* Avoid recursion */ >> + clts(); >> + >> + /* save the nonlazy extended state which is not tracked by CR0.TS bit */ >> + if ( xsave_enabled(v) ) >> + fpu_xrstor(v, XSTATE_NONLAZY); >> + >> + stts(); > ... while here you do an unconditional clts followed by an xrstor only > checking whether xsave is enabled (but not checking whether there's > any non-lazy state to be restored) and, possibly the most expensive > of all, an unconditional write of CR0. > > Jan > >> +} >> + >> /* >> * Restore FPU state when #NM is triggered. >> */ >> -void vcpu_restore_fpu(struct vcpu *v) >> +void vcpu_restore_fpu_lazy(struct vcpu *v) >> { >> ASSERT(!is_idle_vcpu(v)); >> > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions 2011-05-04 16:33 ` Wei Huang @ 2011-05-05 7:13 ` Jan Beulich 2011-05-05 21:41 ` Wei Huang 0 siblings, 1 reply; 6+ messages in thread From: Jan Beulich @ 2011-05-05 7:13 UTC (permalink / raw) To: Wei Huang; +Cc: xen-devel@lists.xensource.com, KeirFraser >>> On 04.05.11 at 18:33, Wei Huang <wei.huang2@amd.com> wrote: > Checking whether there is a non-lazy state to save is architectural > specific and very messy. For instance, we need to read LWP_CBADDR to > confirm LWP's dirty state. This MSR is AMD specific and we don't want to > add it here. Plus reading data from LWP_CBADDR MSR might be as expensive > as clts/stts. > > My previous email showed that the overhead with LWP is around 1%-2% of > __context_switch(). For non lwp-capable CPU, this overhead should be > much smaller (only clts and stts) because xfeature_mask[LWP] is 0. I wasn't talking about determining whether LWP state is dirty, but much rather about LWP not being in use at all. > Yes, clts() and stts() don't have to called every time. How about this one? > > /* Restore FPU state whenever VCPU is schduled in. */ > void vcpu_restore_fpu_eager(struct vcpu *v) > { > ASSERT(!is_idle_vcpu(v)); > > > /* save the nonlazy extended state which is not tracked by CR0.TS bit */ > if ( xsave_enabled(v) ) > { > /* Avoid recursion */ > clts(); > fpu_xrstor(v, XSTATE_NONLAZY); > stts(); > } That's certainly better, but I'd still like to see the xsave_enabled() check to be replaced by some form of lwp_enabled() or lazy_xsave_needed() or some such (which will at once exclude all pv guests until you care to add support for them). Jan ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Re: [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions 2011-05-05 7:13 ` Jan Beulich @ 2011-05-05 21:41 ` Wei Huang 2011-05-06 7:49 ` Jan Beulich 0 siblings, 1 reply; 6+ messages in thread From: Wei Huang @ 2011-05-05 21:41 UTC (permalink / raw) To: Jan Beulich; +Cc: xen-devel@lists.xensource.com, KeirFraser [-- Attachment #1: Type: text/plain, Size: 2363 bytes --] Hi Jan, If we want to make LWP restore optional in vcpu_restore_fpu_eager(), we have to change vcpu_save_fpu() as well. Otherwise, the extended state will become inconsistent for non-LWP VCPUs (because save and restore is asymmetric). There are two approaches: 1. In vcpu_save_fpu(), clean physical CPU's extended state for VCPU which is being scheduled in. This prevents messy states from causing problems. The disadvantage is the cleaning cost, which would out-weight the benefits. 2. Add a new variable in VCPU to track whether nonlazy state is dirty. I think this is better. See the attached file. Let me know if it is what you want. After that, I will re-spin the patches. Thanks, -Wei On 05/05/2011 02:13 AM, Jan Beulich wrote: >>>> On 04.05.11 at 18:33, Wei Huang<wei.huang2@amd.com> wrote: >> Checking whether there is a non-lazy state to save is architectural >> specific and very messy. For instance, we need to read LWP_CBADDR to >> confirm LWP's dirty state. This MSR is AMD specific and we don't want to >> add it here. Plus reading data from LWP_CBADDR MSR might be as expensive >> as clts/stts. >> >> My previous email showed that the overhead with LWP is around 1%-2% of >> __context_switch(). For non lwp-capable CPU, this overhead should be >> much smaller (only clts and stts) because xfeature_mask[LWP] is 0. > I wasn't talking about determining whether LWP state is dirty, but > much rather about LWP not being in use at all. > >> Yes, clts() and stts() don't have to called every time. How about this one? >> >> /* Restore FPU state whenever VCPU is schduled in. */ >> void vcpu_restore_fpu_eager(struct vcpu *v) >> { >> ASSERT(!is_idle_vcpu(v)); >> >> >> /* save the nonlazy extended state which is not tracked by CR0.TS bit */ >> if ( xsave_enabled(v) ) >> { >> /* Avoid recursion */ >> clts(); >> fpu_xrstor(v, XSTATE_NONLAZY); >> stts(); >> } > That's certainly better, but I'd still like to see the xsave_enabled() > check to be replaced by some form of lwp_enabled() or > lazy_xsave_needed() or some such (which will at once exclude all > pv guests until you care to add support for them). > > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > [-- Attachment #2: nonlazy_dirty.txt --] [-- Type: text/plain, Size: 2594 bytes --] diff -r 343470b5ad6b xen/arch/x86/hvm/svm/svm.c --- a/xen/arch/x86/hvm/svm/svm.c Tue May 03 14:05:28 2011 -0500 +++ b/xen/arch/x86/hvm/svm/svm.c Thu May 05 16:40:41 2011 -0500 @@ -716,7 +716,7 @@ unsigned int eax, ebx, ecx, edx; uint32_t msr_low; - if ( cpu_has_lwp ) + if ( xsave_enabled(v) && cpu_has_lwp ) { hvm_cpuid(0x8000001c, &eax, &ebx, &ecx, &edx); msr_low = (uint32_t)msr_content; @@ -729,6 +729,9 @@ /* CPU might automatically correct reserved bits. So read it back. */ rdmsrl(MSR_AMD64_LWP_CFG, msr_content); v->arch.hvm_svm.guest_lwp_cfg = msr_content; + + /* track nonalzy state if LWP_CFG is non-zero. */ + v->arch.nonlazy_xstate_dirty = !!(msr_content); } return 0; diff -r 343470b5ad6b xen/arch/x86/i387.c --- a/xen/arch/x86/i387.c Tue May 03 14:05:28 2011 -0500 +++ b/xen/arch/x86/i387.c Thu May 05 16:40:41 2011 -0500 @@ -98,13 +98,13 @@ /* FPU Save Functions */ /*******************************/ /* Save x87 extended state */ -static inline void fpu_xsave(struct vcpu *v, uint64_t mask) +static inline void fpu_xsave(struct vcpu *v) { /* XCR0 normally represents what guest OS set. In case of Xen itself, * we set all accumulated feature mask before doing save/restore. */ set_xcr0(v->arch.xcr0_accum); - xsave(v, mask); + xsave(v, v->arch.nonlazy_xstate_dirty ? XSTATE_ALL : XSTATE_LAZY); set_xcr0(v->arch.xcr0); } @@ -164,15 +164,15 @@ void vcpu_restore_fpu_eager(struct vcpu *v) { ASSERT(!is_idle_vcpu(v)); - - /* Avoid recursion */ - clts(); /* save the nonlazy extended state which is not tracked by CR0.TS bit */ - if ( xsave_enabled(v) ) + if ( xsave_enabled(v) && v->arch.nonlazy_xstate_dirty ) + { + /* Avoid recursion */ + clts(); fpu_xrstor(v, XSTATE_NONLAZY); - - stts(); + stts(); + } } /* @@ -219,7 +219,7 @@ clts(); if ( xsave_enabled(v) ) - fpu_xsave(v, XSTATE_ALL); + fpu_xsave(v); else if ( cpu_has_fxsr ) fpu_fxsave(v); else diff -r 343470b5ad6b xen/include/asm-x86/domain.h --- a/xen/include/asm-x86/domain.h Tue May 03 14:05:28 2011 -0500 +++ b/xen/include/asm-x86/domain.h Thu May 05 16:40:41 2011 -0500 @@ -492,6 +492,9 @@ * it explicitly enables it via xcr0. */ uint64_t xcr0_accum; + /* This variable determines whether nonlazy extended state is dirty and + * needs to be tracked. */ + bool_t nonlazy_xstate_dirty; struct paging_vcpu paging; [-- Attachment #3: Type: text/plain, Size: 138 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Re: [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions 2011-05-05 21:41 ` Wei Huang @ 2011-05-06 7:49 ` Jan Beulich 0 siblings, 0 replies; 6+ messages in thread From: Jan Beulich @ 2011-05-06 7:49 UTC (permalink / raw) To: Wei Huang; +Cc: xen-devel@lists.xensource.com, KeirFraser >>> On 05.05.11 at 23:41, Wei Huang <wei.huang2@amd.com> wrote: > Hi Jan, > > If we want to make LWP restore optional in vcpu_restore_fpu_eager(), we > have to change vcpu_save_fpu() as well. Otherwise, the extended state > will become inconsistent for non-LWP VCPUs (because save and restore is > asymmetric). There are two approaches: > > 1. In vcpu_save_fpu(), clean physical CPU's extended state for VCPU > which is being scheduled in. This prevents messy states from causing > problems. The disadvantage is the cleaning cost, which would out-weight > the benefits. Cleaning cost? Wasn't it that one can express to default-initialize fields during xrstor (which, if indeed expensive, you'd want to trigger only if you know the physical CPU's state is dirty, i.e. in this case requiring a per-CPU variable that gets evaluated and updated on context restore). > 2. Add a new variable in VCPU to track whether nonlazy state is dirty. I > think this is better. See the attached file. > > Let me know if it is what you want. After that, I will re-spin the patches. Yes, this looks like what I meant. Two suggestions: The new field's name (nonlazy_xstate_dirty) would perhaps better be something like nonlazy_xstate_used, so that name and use are in sync. And the check in vcpu_restore_fpu_eager() probably doesn't need to re-evaluate xsave_enabled(v), since the flag can't get set without this (if you absolutely want to, put in an ASSERT() to this effect). Jan ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-05-06 7:49 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-05-03 20:17 [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions Wei Huang 2011-05-04 7:09 ` Jan Beulich 2011-05-04 16:33 ` Wei Huang 2011-05-05 7:13 ` Jan Beulich 2011-05-05 21:41 ` Wei Huang 2011-05-06 7:49 ` Jan Beulich
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).