All of lore.kernel.org
 help / color / mirror / Atom feed
* NMI deferral on i386
@ 2007-05-15 14:46 Jan Beulich
  2007-05-15 15:00 ` Keir Fraser
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2007-05-15 14:46 UTC (permalink / raw)
  To: xen-devel

Looking at cloning this logic to more properly support MCE, I see two issues
with this code:

- by using iret, the NMI is being acknowledged to the CPU, and since nothing
  was done to address its reason, I can't see why it shouldn't re-trigger right
  after that iret (unless it was sent as an IPI)
- by re-issuing it on vector 31, the resulting interrupt will have lower priority
  than any external interrupt, hence all pending interrupts will be serviced
  before getting to actually handle the NMI; ideally this should use the highest
  possible vector, but since priorities are grouped anyway, at least allocating
  the vector from the high priority pool would seem necessary

Thanks, Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI deferral on i386
  2007-05-15 14:46 NMI deferral on i386 Jan Beulich
@ 2007-05-15 15:00 ` Keir Fraser
  2007-05-16  8:17   ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Keir Fraser @ 2007-05-15 15:00 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 15/5/07 15:46, "Jan Beulich" <jbeulich@novell.com> wrote:

> - by using iret, the NMI is being acknowledged to the CPU, and since nothing
>   was done to address its reason, I can't see why it shouldn't re-trigger
>   right after that iret (unless it was sent as an IPI)

Yes, it's good enough for watchdog and oprofile. Level-triggered external
NMIs will of course be a problem. We could possibly work around this by
masking LINT1 if we are CPU0 (and, of course, if LAPIC is enabled) and then
unmasking only at the end of real NMI handler. And of course x86/64 doesn't
have this problem at all, and practically speaking is pretty much the only
hypervisor build that vendors seem to care about.

> - by re-issuing it on vector 31, the resulting interrupt will have lower
> priority
>   than any external interrupt, hence all pending interrupts will be serviced
>   before getting to actually handle the NMI; ideally this should use the
> highest
>   possible vector, but since priorities are grouped anyway, at least
> allocating
>   the vector from the high priority pool would seem necessary

Yes, this is true.

 -- Keir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI deferral on i386
  2007-05-15 15:00 ` Keir Fraser
@ 2007-05-16  8:17   ` Jan Beulich
  2007-05-16  8:28     ` Keir Fraser
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2007-05-16  8:17 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

>>> Keir Fraser <keir@xensource.com> 15.05.07 17:00 >>>
>On 15/5/07 15:46, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>> - by using iret, the NMI is being acknowledged to the CPU, and since nothing
>>   was done to address its reason, I can't see why it shouldn't re-trigger
>>   right after that iret (unless it was sent as an IPI)
>
>Yes, it's good enough for watchdog and oprofile. Level-triggered external
>NMIs will of course be a problem. We could possibly work around this by
>masking LINT1 if we are CPU0 (and, of course, if LAPIC is enabled) and then
>unmasking only at the end of real NMI handler. And of course x86/64 doesn't
>have this problem at all, and practically speaking is pretty much the only
>hypervisor build that vendors seem to care about.

What if we removed the deferral altogether, and made the NMI handler
store into the outer most frame (after all, selector registers have fixed
places on that frame), marking the that frame accordingly so that
overwriting the values saved this way can be avoided in the
interrupted save sequence (would be necessary only if both %ds and
%es are neither __HYPERVISOR_DS nor null [neatly avoiding special
casing the vm86 mode entry in the outer frame], and would add an extra
branch to __SAVE_ALL_PRE plus splitting the selector register stores
into moving %ds and %es into general purpose registers, testing the
flag NMI or MCE handlers may set, and storing the GPRs into the frame
if the flag was clear).

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI deferral on i386
  2007-05-16  8:17   ` Jan Beulich
@ 2007-05-16  8:28     ` Keir Fraser
  2007-05-16 10:10       ` Jan Beulich
  2007-05-21 14:01       ` Jan Beulich
  0 siblings, 2 replies; 11+ messages in thread
From: Keir Fraser @ 2007-05-16  8:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 16/5/07 09:17, "Jan Beulich" <jbeulich@novell.com> wrote:

>> Yes, it's good enough for watchdog and oprofile. Level-triggered external
>> NMIs will of course be a problem. We could possibly work around this by
>> masking LINT1 if we are CPU0 (and, of course, if LAPIC is enabled) and then
>> unmasking only at the end of real NMI handler. And of course x86/64 doesn't
>> have this problem at all, and practically speaking is pretty much the only
>> hypervisor build that vendors seem to care about.
> 
> What if we removed the deferral altogether, and made the NMI handler
> store into the outer most frame (after all, selector registers have fixed
> places on that frame), marking the that frame accordingly so that
> overwriting the values saved this way can be avoided in the
> interrupted save sequence (would be necessary only if both %ds and
> %es are neither __HYPERVISOR_DS nor null [neatly avoiding special
> casing the vm86 mode entry in the outer frame], and would add an extra
> branch to __SAVE_ALL_PRE plus splitting the selector register stores
> into moving %ds and %es into general purpose registers, testing the
> flag NMI or MCE handlers may set, and storing the GPRs into the frame
> if the flag was clear).

It sounds a bit painful. Also it's the exit-to-guest path that is more of a
pain to deal with. In this case we may have restored a segment register by
the time we take the NMI. What do we do in this case about restoring the
segment register safely? Races in updating GDT/LDT may mean that the reload
still may fault, even though it didn't just before; also we may need to do
work in Xen (e.g., shadow-mode stuff) in interrupts-enabled context to fix
up a #GP or #PG.

 -- Keir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI deferral on i386
  2007-05-16  8:28     ` Keir Fraser
@ 2007-05-16 10:10       ` Jan Beulich
  2007-05-16 12:32         ` Keir Fraser
  2007-05-21 14:01       ` Jan Beulich
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2007-05-16 10:10 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

>It sounds a bit painful. Also it's the exit-to-guest path that is more of a
>pain to deal with. In this case we may have restored a segment register by
>the time we take the NMI. What do we do in this case about restoring the
>segment register safely? Races in updating GDT/LDT may mean that the reload
>still may fault, even though it didn't just before; also we may need to do
>work in Xen (e.g., shadow-mode stuff) in interrupts-enabled context to fix
>up a #GP or #PG.

Indeed. Nevertheless, for non-restartable MCEs, deferral is impossible, and
hence some mechanism would still be needed (unless we say the machine's
going down anyway in this case and we don't care about getting a proper
reason logged).

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI deferral on i386
  2007-05-16 10:10       ` Jan Beulich
@ 2007-05-16 12:32         ` Keir Fraser
  2007-05-16 14:19           ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Keir Fraser @ 2007-05-16 12:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 16/5/07 11:10, "Jan Beulich" <jbeulich@novell.com> wrote:

>> It sounds a bit painful. Also it's the exit-to-guest path that is more of a
>> pain to deal with. In this case we may have restored a segment register by
>> the time we take the NMI. What do we do in this case about restoring the
>> segment register safely? Races in updating GDT/LDT may mean that the reload
>> still may fault, even though it didn't just before; also we may need to do
>> work in Xen (e.g., shadow-mode stuff) in interrupts-enabled context to fix
>> up a #GP or #PG.
> 
> Indeed. Nevertheless, for non-restartable MCEs, deferral is impossible, and
> hence some mechanism would still be needed (unless we say the machine's
> going down anyway in this case and we don't care about getting a proper
> reason logged).

You mean, like with a #DF, that sometimes a #MC may have bogus CS:EIP and so
you cannot IRET from it? I'm not sure how much we care about losing these
and turning a possibly-informative crash into an ugly and confusing crash.
Personally I've never seen a #MC or had one reported to me, restartable or
not. Maybe I'm lucky. :-)

 -- Keir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI deferral on i386
  2007-05-16 12:32         ` Keir Fraser
@ 2007-05-16 14:19           ` Jan Beulich
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Beulich @ 2007-05-16 14:19 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

>>> Keir Fraser <keir@xensource.com> 16.05.07 14:32 >>>
>On 16/5/07 11:10, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>>> It sounds a bit painful. Also it's the exit-to-guest path that is more of a
>>> pain to deal with. In this case we may have restored a segment register by
>>> the time we take the NMI. What do we do in this case about restoring the
>>> segment register safely? Races in updating GDT/LDT may mean that the reload
>>> still may fault, even though it didn't just before; also we may need to do
>>> work in Xen (e.g., shadow-mode stuff) in interrupts-enabled context to fix
>>> up a #GP or #PG.
>> 
>> Indeed. Nevertheless, for non-restartable MCEs, deferral is impossible, and
>> hence some mechanism would still be needed (unless we say the machine's
>> going down anyway in this case and we don't care about getting a proper
>> reason logged).
>
>You mean, like with a #DF, that sometimes a #MC may have bogus CS:EIP and so
>you cannot IRET from it? I'm not sure how much we care about losing these
>and turning a possibly-informative crash into an ugly and confusing crash.

Yes, that's what I mean. And I'm not so much concerned about turning a (very
rare) 'nice' crash into an 'ugly' one, but more about that fact that until the
system actually crashes it may continue to execute for a short while, possibly
making the data corruption situation worse.

>Personally I've never seen a #MC or had one reported to me, restartable or
>not. Maybe I'm lucky. :-)

I did see quite a few non-restartable ones (on native Linux), and it took me
some time to actually get the system into a state where I could see the
related messages before it rebooted. I don't have that system anymore,
though, otherwise I might be have been able to use it for testing purposes
here.

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI deferral on i386
  2007-05-16  8:28     ` Keir Fraser
  2007-05-16 10:10       ` Jan Beulich
@ 2007-05-21 14:01       ` Jan Beulich
  2007-05-21 14:17         ` Keir Fraser
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2007-05-21 14:01 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2222 bytes --]

>>> Keir Fraser <keir@xensource.com> 16.05.07 10:28 >>>
>On 16/5/07 09:17, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>>> Yes, it's good enough for watchdog and oprofile. Level-triggered external
>>> NMIs will of course be a problem. We could possibly work around this by
>>> masking LINT1 if we are CPU0 (and, of course, if LAPIC is enabled) and then
>>> unmasking only at the end of real NMI handler. And of course x86/64 doesn't
>>> have this problem at all, and practically speaking is pretty much the only
>>> hypervisor build that vendors seem to care about.
>> 
>> What if we removed the deferral altogether, and made the NMI handler
>> store into the outer most frame (after all, selector registers have fixed
>> places on that frame), marking the that frame accordingly so that
>> overwriting the values saved this way can be avoided in the
>> interrupted save sequence (would be necessary only if both %ds and
>> %es are neither __HYPERVISOR_DS nor null [neatly avoiding special
>> casing the vm86 mode entry in the outer frame], and would add an extra
>> branch to __SAVE_ALL_PRE plus splitting the selector register stores
>> into moving %ds and %es into general purpose registers, testing the
>> flag NMI or MCE handlers may set, and storing the GPRs into the frame
>> if the flag was clear).
>
>It sounds a bit painful. Also it's the exit-to-guest path that is more of a
>pain to deal with. In this case we may have restored a segment register by
>the time we take the NMI. What do we do in this case about restoring the
>segment register safely? Races in updating GDT/LDT may mean that the reload
>still may fault, even though it didn't just before; also we may need to do
>work in Xen (e.g., shadow-mode stuff) in interrupts-enabled context to fix
>up a #GP or #PG.

I think I found a pretty reasonable solution, which I'm attaching in its current
(3.1-based) form, together with the prior (untested) variant that copied the
NMI behavior. Even if it doesn't apply to -unstable, I'd be glad if you could
have a brief look to see whether you consider the approach too intrusive (in
which case there would be no point in trying to bring it forward to -unstable).

Jan


[-- Attachment #2: x86-machine-check.patch.0 --]
[-- Type: application/octet-stream, Size: 27006 bytes --]

Index: 2007-05-14/xen/arch/x86/cpu/mcheck/k7.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/k7.c	2007-04-23 10:01:40.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/k7.c	2007-05-15 15:56:00.000000000 +0200
@@ -16,7 +16,7 @@
 #include "mce.h"
 
 /* Machine Check Handler For AMD Athlon/Duron */
-static fastcall void k7_machine_check(struct cpu_user_regs * regs, long error_code)
+static fastcall void k7_machine_check(struct cpu_user_regs * regs)
 {
 	int recover=1;
 	u32 alow, ahigh, high, low;
Index: 2007-05-14/xen/arch/x86/cpu/mcheck/mce.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/mce.c	2007-04-23 10:01:40.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/mce.c	2007-05-15 15:56:16.000000000 +0200
@@ -18,13 +18,13 @@ int mce_disabled = 0;
 int nr_mce_banks;
 
 /* Handle unconfigured int18 (should never happen) */
-static fastcall void unexpected_machine_check(struct cpu_user_regs * regs, long error_code)
+static fastcall void unexpected_machine_check(struct cpu_user_regs * regs)
 {	
 	printk(KERN_ERR "CPU#%d: Unexpected int18 (Machine Check).\n", smp_processor_id());
 }
 
 /* Call the installed machine check handler for this CPU setup. */
-void fastcall (*machine_check_vector)(struct cpu_user_regs *, long error_code) = unexpected_machine_check;
+void fastcall (*machine_check_vector)(struct cpu_user_regs *) = unexpected_machine_check;
 
 /* This has to be run for each processor */
 void mcheck_init(struct cpuinfo_x86 *c)
Index: 2007-05-14/xen/arch/x86/cpu/mcheck/mce.h
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/mce.h	2007-04-23 10:01:40.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/mce.h	2007-05-15 17:34:43.000000000 +0200
@@ -1,4 +1,5 @@
 #include <xen/init.h>
+#include <asm/processor.h>
 
 void amd_mcheck_init(struct cpuinfo_x86 *c);
 void intel_p4_mcheck_init(struct cpuinfo_x86 *c);
@@ -6,9 +7,6 @@ void intel_p5_mcheck_init(struct cpuinfo
 void intel_p6_mcheck_init(struct cpuinfo_x86 *c);
 void winchip_mcheck_init(struct cpuinfo_x86 *c);
 
-/* Call the installed machine check handler for this CPU setup. */
-extern fastcall void (*machine_check_vector)(struct cpu_user_regs *, long error_code);
-
 extern int mce_disabled __initdata;
 extern int nr_mce_banks;
 
Index: 2007-05-14/xen/arch/x86/cpu/mcheck/p4.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/p4.c	2007-04-23 10:01:40.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/p4.c	2007-05-15 15:56:31.000000000 +0200
@@ -158,7 +158,7 @@ done:
 	return mce_num_extended_msrs;
 }
 
-static fastcall void intel_machine_check(struct cpu_user_regs * regs, long error_code)
+static fastcall void intel_machine_check(struct cpu_user_regs * regs)
 {
 	int recover=1;
 	u32 alow, ahigh, high, low;
Index: 2007-05-14/xen/arch/x86/cpu/mcheck/p5.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/p5.c	2007-04-23 10:01:40.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/p5.c	2007-05-15 15:56:39.000000000 +0200
@@ -15,7 +15,7 @@
 #include "mce.h"
 
 /* Machine check handler for Pentium class Intel */
-static fastcall void pentium_machine_check(struct cpu_user_regs * regs, long error_code)
+static fastcall void pentium_machine_check(struct cpu_user_regs * regs)
 {
 	u32 loaddr, hi, lotype;
 	rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
Index: 2007-05-14/xen/arch/x86/cpu/mcheck/p6.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/p6.c	2007-04-23 10:01:40.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/p6.c	2007-05-15 15:56:43.000000000 +0200
@@ -15,7 +15,7 @@
 #include "mce.h"
 
 /* Machine Check Handler For PII/PIII */
-static fastcall void intel_machine_check(struct cpu_user_regs * regs, long error_code)
+static fastcall void intel_machine_check(struct cpu_user_regs * regs)
 {
 	int recover=1;
 	u32 alow, ahigh, high, low;
Index: 2007-05-14/xen/arch/x86/cpu/mcheck/winchip.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/winchip.c	2007-04-23 10:01:40.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/winchip.c	2007-05-15 15:56:48.000000000 +0200
@@ -16,7 +16,7 @@
 #include "mce.h"
 
 /* Machine check handler for WinChip C6 */
-static fastcall void winchip_machine_check(struct cpu_user_regs * regs, long error_code)
+static fastcall void winchip_machine_check(struct cpu_user_regs * regs)
 {
 	printk(KERN_EMERG "CPU0: Machine Check Exception.\n");
 	add_taint(TAINT_MACHINE_CHECK);
Index: 2007-05-14/xen/arch/x86/hvm/svm/svm.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/hvm/svm/svm.c	2007-05-14 14:33:28.000000000 +0200
+++ 2007-05-14/xen/arch/x86/hvm/svm/svm.c	2007-05-15 17:57:20.000000000 +0200
@@ -407,7 +407,7 @@ int svm_vmcb_restore(struct vcpu *v, str
     }
 
  skip_cr3:
-    vmcb->cr4 = c->cr4 | SVM_CR4_HOST_MASK;
+    vmcb->cr4 = c->cr4 | HVM_CR4_HOST_MASK;
     v->arch.hvm_svm.cpu_shadow_cr4 = c->cr4;
     
     vmcb->idtr.limit = c->idtr_limit;
@@ -464,7 +464,8 @@ int svm_vmcb_restore(struct vcpu *v, str
     /* update VMCB for nested paging restore */
     if ( paging_mode_hap(v->domain) ) {
         vmcb->cr0 = v->arch.hvm_svm.cpu_shadow_cr0;
-        vmcb->cr4 = v->arch.hvm_svm.cpu_shadow_cr4;
+        vmcb->cr4 = v->arch.hvm_svm.cpu_shadow_cr4 |
+                    (HVM_CR4_HOST_MASK & ~X86_CR4_PAE);
         vmcb->cr3 = c->cr3;
         vmcb->np_enable = 1;
         vmcb->g_pat = 0x0007040600070406ULL; /* guest PAT */
@@ -1731,9 +1732,19 @@ static int mov_to_cr(int gpreg, int cr, 
         break;
 
     case 4: /* CR4 */
+        if ( value & ~mmu_cr4_features )
+        {
+            HVM_DBG_LOG(DBG_LEVEL_1, "Guest attempts to enable unsupported "
+                        "CR4 features %lx (host %lx)",
+                        value, mmu_cr4_features);
+            svm_inject_exception(v, TRAP_gp_fault, 1, 0);
+            break;
+        }
+
         if ( paging_mode_hap(v->domain) )
         {
-            vmcb->cr4 = v->arch.hvm_svm.cpu_shadow_cr4 = value;
+            v->arch.hvm_svm.cpu_shadow_cr4 = value;
+            vmcb->cr4 = value | (HVM_CR4_HOST_MASK & ~X86_CR4_PAE);
             paging_update_paging_modes(v);
             break;
         }
@@ -1779,7 +1790,7 @@ static int mov_to_cr(int gpreg, int cr, 
         }
 
         v->arch.hvm_svm.cpu_shadow_cr4 = value;
-        vmcb->cr4 = value | SVM_CR4_HOST_MASK;
+        vmcb->cr4 = value | HVM_CR4_HOST_MASK;
   
         /*
          * Writing to CR4 to modify the PSE, PGE, or PAE flag invalidates
@@ -2141,12 +2152,13 @@ static int svm_reset_to_realmode(struct 
     vmcb->cr2 = 0;
     vmcb->efer = EFER_SVME;
 
-    vmcb->cr4 = SVM_CR4_HOST_MASK;
+    vmcb->cr4 = HVM_CR4_HOST_MASK;
     v->arch.hvm_svm.cpu_shadow_cr4 = 0;
 
     if ( paging_mode_hap(v->domain) ) {
         vmcb->cr0 = v->arch.hvm_svm.cpu_shadow_cr0;
-        vmcb->cr4 = v->arch.hvm_svm.cpu_shadow_cr4;
+        vmcb->cr4 = v->arch.hvm_svm.cpu_shadow_cr4 |
+                    (HVM_CR4_HOST_MASK & ~X86_CR4_PAE);
     }
 
     /* This will jump to ROMBIOS */
@@ -2287,6 +2299,12 @@ asmlinkage void svm_vmexit_handler(struc
         break;
     }
 
+    case VMEXIT_EXCEPTION_MC:
+        HVMTRACE_0D(MCE, v);
+        svm_store_cpu_guest_regs(v, regs, NULL);
+        machine_check_vector(regs);
+        break;
+
     case VMEXIT_VINTR:
         vmcb->vintr.fields.irq = 0;
         vmcb->general1_intercepts &= ~GENERAL1_INTERCEPT_VINTR;
Index: 2007-05-14/xen/arch/x86/hvm/svm/vmcb.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/hvm/svm/vmcb.c	2007-04-23 10:01:41.000000000 +0200
+++ 2007-05-14/xen/arch/x86/hvm/svm/vmcb.c	2007-05-15 17:28:40.000000000 +0200
@@ -225,7 +225,7 @@ static int construct_vmcb(struct vcpu *v
     /* Guest CR4. */
     arch_svm->cpu_shadow_cr4 =
         read_cr4() & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE);
-    vmcb->cr4 = arch_svm->cpu_shadow_cr4 | SVM_CR4_HOST_MASK;
+    vmcb->cr4 = arch_svm->cpu_shadow_cr4 | HVM_CR4_HOST_MASK;
 
     paging_update_paging_modes(v);
     vmcb->cr3 = v->arch.hvm_vcpu.hw_cr3; 
@@ -236,11 +236,13 @@ static int construct_vmcb(struct vcpu *v
         vmcb->np_enable = 1; /* enable nested paging */
         vmcb->g_pat = 0x0007040600070406ULL; /* guest PAT */
         vmcb->h_cr3 = pagetable_get_paddr(v->domain->arch.phys_table);
-        vmcb->cr4 = arch_svm->cpu_shadow_cr4 = 0;
+        vmcb->cr4 = arch_svm->cpu_shadow_cr4 =
+                    (HVM_CR4_HOST_MASK & ~X86_CR4_PAE);
+        vmcb->exception_intercepts = HVM_TRAP_MASK;
     }
     else
     {
-        vmcb->exception_intercepts = 1U << TRAP_page_fault;
+        vmcb->exception_intercepts = HVM_TRAP_MASK | (1U << TRAP_page_fault);
     }
 
     return 0;
Index: 2007-05-14/xen/arch/x86/hvm/vmx/vmcs.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/hvm/vmx/vmcs.c	2007-05-14 14:40:20.000000000 +0200
+++ 2007-05-14/xen/arch/x86/hvm/vmx/vmcs.c	2007-05-15 17:28:58.000000000 +0200
@@ -400,7 +400,7 @@ static void construct_vmcs(struct vcpu *
     __vmwrite(VMCS_LINK_POINTER_HIGH, ~0UL);
 #endif
 
-    __vmwrite(EXCEPTION_BITMAP, 1U << TRAP_page_fault);
+    __vmwrite(EXCEPTION_BITMAP, HVM_TRAP_MASK | (1U << TRAP_page_fault));
 
     /* Guest CR0. */
     cr0 = read_cr0();
Index: 2007-05-14/xen/arch/x86/hvm/vmx/vmx.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/hvm/vmx/vmx.c	2007-05-14 14:33:28.000000000 +0200
+++ 2007-05-14/xen/arch/x86/hvm/vmx/vmx.c	2007-05-15 17:56:59.000000000 +0200
@@ -600,7 +600,7 @@ int vmx_vmcs_restore(struct vcpu *v, str
     }
 #endif
 
-    __vmwrite(GUEST_CR4, (c->cr4 | VMX_CR4_HOST_MASK));
+    __vmwrite(GUEST_CR4, (c->cr4 | HVM_CR4_HOST_MASK));
     v->arch.hvm_vmx.cpu_shadow_cr4 = c->cr4;
     __vmwrite(CR4_READ_SHADOW, v->arch.hvm_vmx.cpu_shadow_cr4);
 
@@ -1886,7 +1886,7 @@ static int vmx_world_restore(struct vcpu
     else
         HVM_DBG_LOG(DBG_LEVEL_VMMU, "Update CR3 value = %x", c->cr3);
 
-    __vmwrite(GUEST_CR4, (c->cr4 | VMX_CR4_HOST_MASK));
+    __vmwrite(GUEST_CR4, (c->cr4 | HVM_CR4_HOST_MASK));
     v->arch.hvm_vmx.cpu_shadow_cr4 = c->cr4;
     __vmwrite(CR4_READ_SHADOW, v->arch.hvm_vmx.cpu_shadow_cr4);
 
@@ -2275,6 +2275,14 @@ static int mov_to_cr(int gp, int cr, str
     case 4: /* CR4 */
         old_cr = v->arch.hvm_vmx.cpu_shadow_cr4;
 
+        if ( value & ~mmu_cr4_features )
+        {
+            HVM_DBG_LOG(DBG_LEVEL_1, "Guest attempts to enable unsupported "
+                        "CR4 features %lx (host %lx)",
+                        value, mmu_cr4_features);
+            vmx_inject_hw_exception(v, TRAP_gp_fault, 0);
+            break;
+        }
         if ( (value & X86_CR4_PAE) && !(old_cr & X86_CR4_PAE) )
         {
             if ( vmx_pgbit_test(v) )
@@ -2315,7 +2323,7 @@ static int mov_to_cr(int gp, int cr, str
             }
         }
 
-        __vmwrite(GUEST_CR4, value| VMX_CR4_HOST_MASK);
+        __vmwrite(GUEST_CR4, value | HVM_CR4_HOST_MASK);
         v->arch.hvm_vmx.cpu_shadow_cr4 = value;
         __vmwrite(CR4_READ_SHADOW, v->arch.hvm_vmx.cpu_shadow_cr4);
 
@@ -2623,7 +2631,8 @@ static void vmx_reflect_exception(struct
     }
 }
 
-static void vmx_failed_vmentry(unsigned int exit_reason)
+static void vmx_failed_vmentry(unsigned int exit_reason,
+                               struct cpu_user_regs *regs)
 {
     unsigned int failed_vmentry_reason = (uint16_t)exit_reason;
     unsigned long exit_qualification;
@@ -2640,6 +2649,9 @@ static void vmx_failed_vmentry(unsigned 
         break;
     case EXIT_REASON_MACHINE_CHECK:
         printk("caused by machine check.\n");
+        HVMTRACE_0D(MCE, current);
+        hvm_store_cpu_guest_regs(current, regs, NULL);
+        machine_check_vector(regs);
         break;
     default:
         printk("reason not known yet!");
@@ -2665,11 +2677,12 @@ asmlinkage void vmx_vmexit_handler(struc
 
     perfc_incra(vmexits, exit_reason);
 
-    if ( exit_reason != EXIT_REASON_EXTERNAL_INTERRUPT )
-        local_irq_enable();
-
     if ( unlikely(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) )
-        return vmx_failed_vmentry(exit_reason);
+        return vmx_failed_vmentry(exit_reason, regs);
+
+    if ( exit_reason != EXIT_REASON_EXTERNAL_INTERRUPT &&
+         exit_reason != EXIT_REASON_EXCEPTION_NMI )
+        local_irq_enable();
 
     switch ( exit_reason )
     {
@@ -2689,6 +2702,9 @@ asmlinkage void vmx_vmexit_handler(struc
 
         perfc_incra(cause_vector, vector);
 
+        if ( vector != TRAP_nmi && vector != TRAP_machine_check )
+            local_irq_enable();
+
         switch ( vector )
         {
         case TRAP_debug:
@@ -2726,6 +2742,11 @@ asmlinkage void vmx_vmexit_handler(struc
             else
                 vmx_reflect_exception(v);
             break;
+        case TRAP_machine_check:
+            HVMTRACE_0D(MCE, v);
+            hvm_store_cpu_guest_regs(v, regs, NULL);
+            machine_check_vector(regs);
+            break;
         default:
             goto exit_and_crash;
         }
Index: 2007-05-14/xen/arch/x86/smpboot.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/smpboot.c	2007-04-23 10:01:42.000000000 +0200
+++ 2007-05-14/xen/arch/x86/smpboot.c	2007-05-16 12:20:05.000000000 +0200
@@ -1215,16 +1215,21 @@ void __init smp_cpus_done(unsigned int m
 #endif
 }
 
+#ifdef __i386__
+uint8_t nmi_deferral_vector, mce_deferral_vector;
+#endif
+
 void __init smp_intr_init(void)
 {
 	int irq, seridx;
+	unsigned int vector = FIRST_HIPRIORITY_VECTOR;
 
 	/*
 	 * IRQ0 must be given a fixed assignment and initialized,
 	 * because it's used before the IO-APIC is set up.
 	 */
-	irq_vector[0] = FIRST_HIPRIORITY_VECTOR;
-	vector_irq[FIRST_HIPRIORITY_VECTOR] = 0;
+	irq_vector[0] = vector;
+	vector_irq[vector++] = 0;
 
 	/*
 	 * Also ensure serial interrupts are high priority. We do not
@@ -1233,10 +1238,18 @@ void __init smp_intr_init(void)
 	for (seridx = 0; seridx < 2; seridx++) {
 		if ((irq = serial_irq(seridx)) < 0)
 			continue;
-		irq_vector[irq] = FIRST_HIPRIORITY_VECTOR + seridx + 1;
-		vector_irq[FIRST_HIPRIORITY_VECTOR + seridx + 1] = irq;
+		irq_vector[irq] = vector;
+		vector_irq[vector++] = irq;
 	}
 
+#ifdef __i386__
+	/* The same applied to NMI/MCE deferral interrupts. */
+	nmi_deferral_vector = vector++;
+	mce_deferral_vector = vector++;
+#endif
+
+	BUG_ON(vector > LAST_HIPRIORITY_VECTOR + 1);
+
 	/* IPI for event checking. */
 	set_intr_gate(EVENT_CHECK_VECTOR, event_check_interrupt);
 
Index: 2007-05-14/xen/arch/x86/traps.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/traps.c	2007-05-14 14:40:35.000000000 +0200
+++ 2007-05-14/xen/arch/x86/traps.c	2007-05-15 15:52:58.000000000 +0200
@@ -707,12 +707,6 @@ asmlinkage int do_int3(struct cpu_user_r
     return do_guest_trap(TRAP_int3, regs, 0);
 }
 
-asmlinkage int do_machine_check(struct cpu_user_regs *regs)
-{
-    fatal_trap(TRAP_machine_check, regs);
-    return 0;
-}
-
 void propagate_page_fault(unsigned long addr, u16 error_code)
 {
     struct trap_info *ti;
Index: 2007-05-14/xen/arch/x86/x86_32/entry.S
===================================================================
--- 2007-05-14.orig/xen/arch/x86/x86_32/entry.S	2007-04-27 09:57:47.000000000 +0200
+++ 2007-05-14/xen/arch/x86/x86_32/entry.S	2007-05-16 12:21:27.000000000 +0200
@@ -546,10 +546,6 @@ ENTRY(page_fault)
         movw  $TRAP_page_fault,2(%esp)
         jmp   handle_exception
 
-ENTRY(machine_check)
-        pushl $TRAP_machine_check<<16
-        jmp   handle_exception
-
 ENTRY(spurious_interrupt_bug)
         pushl $TRAP_spurious_int<<16
         jmp   handle_exception
@@ -583,12 +579,13 @@ ENTRY(nmi)
         movb  UREGS_cs(%esp),%al
         testl $(3|X86_EFLAGS_VM),%eax
         jnz   continue_nmi
+        movzbl %ss:nmi_deferral_vector,%edx
         movl  %ds,%eax
         cmpw  $(__HYPERVISOR_DS),%ax
-        jne   defer_nmi
+        jne   defer_nmi_mce
         movl  %es,%eax
         cmpw  $(__HYPERVISOR_DS),%ax
-        jne   defer_nmi
+        jne   defer_nmi_mce
 
 continue_nmi:
         SET_XEN_SEGMENTS(d)
@@ -597,16 +594,49 @@ continue_nmi:
         call  do_nmi
         addl  $4,%esp
         jmp   ret_from_intr
+#endif /* !CONFIG_X86_SUPERVISOR_MODE_KERNEL */
+
+ENTRY(machine_check)
+        # See NMI handler for explanations.
+#ifdef CONFIG_X86_SUPERVISOR_MODE_KERNEL
+        iret
+#else
+        pushl $TRAP_machine_check<<16
+        SAVE_ALL_NOSEGREGS(a)
+        movl  UREGS_eflags(%esp),%eax
+        movb  UREGS_cs(%esp),%al
+        testl $(3|X86_EFLAGS_VM),%eax
+        jnz   .Lcontinue_mce
+        movzbl %ss:mce_deferral_vector,%edx
+        movl  $__HYPERVISOR_DS,%ecx
+        movl  %ds,%eax
+        testw %ax,%ax
+        cmovzl %ecx,%eax
+        cmpw  %cx,%ax
+        jne   defer_nmi_mce
+        movl  %es,%eax
+        testw %ax,%ax
+        cmovzl %ecx,%eax
+        cmpw  %cx,%ax
+        jne   defer_nmi_mce
+
+.Lcontinue_mce:
+        SET_XEN_SEGMENTS(d)
+        movl  %esp,%eax
+        pushl %eax
+        call  *machine_check_vector
+        addl  $4,%esp
+        jmp   ret_from_intr
 
-defer_nmi:
+defer_nmi_mce:
         movl  $FIXMAP_apic_base,%eax
         # apic_wait_icr_idle()
 1:      movl  %ss:APIC_ICR(%eax),%ebx
         testl $APIC_ICR_BUSY,%ebx
         jnz   1b
-        # __send_IPI_shortcut(APIC_DEST_SELF, TRAP_deferred_nmi)
-        movl  $(APIC_DM_FIXED | APIC_DEST_SELF | APIC_DEST_PHYSICAL | \
-                TRAP_deferred_nmi),%ss:APIC_ICR(%eax)
+        # __send_IPI_shortcut(APIC_DEST_SELF, %edx)
+        orl   $(APIC_DM_FIXED | APIC_DEST_SELF | APIC_DEST_PHYSICAL),%edx
+        movl  %edx,%ss:APIC_ICR(%eax)
         jmp   restore_all_xen
 #endif /* !CONFIG_X86_SUPERVISOR_MODE_KERNEL */
 
@@ -644,7 +674,7 @@ ENTRY(exception_table)
         .long do_spurious_interrupt_bug
         .long do_coprocessor_error
         .long do_alignment_check
-        .long do_machine_check
+        .long 0 # machine_check
         .long do_simd_coprocessor_error
 
 ENTRY(hypercall_table)
Index: 2007-05-14/xen/arch/x86/x86_32/traps.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/x86_32/traps.c	2007-04-23 10:01:42.000000000 +0200
+++ 2007-05-14/xen/arch/x86/x86_32/traps.c	2007-05-16 12:29:20.000000000 +0200
@@ -236,14 +236,25 @@ unsigned long do_iret(void)
 }
 
 #include <asm/asm_defns.h>
-BUILD_SMP_INTERRUPT(deferred_nmi, TRAP_deferred_nmi)
+extern uint8_t nmi_deferral_vector, mce_deferral_vector;
+
+BUILD_SMP_INTERRUPT(deferred_nmi, 0)
 fastcall void smp_deferred_nmi(struct cpu_user_regs *regs)
 {
     asmlinkage void do_nmi(struct cpu_user_regs *);
     ack_APIC_irq();
+    regs->entry_vector = nmi_deferral_vector;
     do_nmi(regs);
 }
 
+BUILD_SMP_INTERRUPT(deferred_mce, 0)
+fastcall void smp_deferred_mce(struct cpu_user_regs *regs)
+{
+    ack_APIC_irq();
+    regs->entry_vector = mce_deferral_vector;
+    machine_check_vector(regs);
+}
+
 void __init percpu_traps_init(void)
 {
     struct tss_struct *tss = &doublefault_tss;
@@ -258,7 +269,8 @@ void __init percpu_traps_init(void)
     /* The hypercall entry vector is only accessible from ring 1. */
     _set_gate(idt_table+HYPERCALL_VECTOR, 14, 1, &hypercall);
 
-    set_intr_gate(TRAP_deferred_nmi, &deferred_nmi);
+    set_intr_gate(nmi_deferral_vector, &deferred_nmi);
+    set_intr_gate(mce_deferral_vector, &deferred_mce);
 
     /*
      * Make a separate task for double faults. This will get us debug output if
Index: 2007-05-14/xen/arch/x86/x86_64/entry.S
===================================================================
--- 2007-05-14.orig/xen/arch/x86/x86_64/entry.S	2007-04-27 09:57:47.000000000 +0200
+++ 2007-05-14/xen/arch/x86/x86_64/entry.S	2007-05-15 15:54:11.000000000 +0200
@@ -518,11 +518,6 @@ ENTRY(page_fault)
         movl  $TRAP_page_fault,4(%rsp)
         jmp   handle_exception
 
-ENTRY(machine_check)
-        pushq $0
-        movl  $TRAP_machine_check,4(%rsp)
-        jmp   handle_exception
-
 ENTRY(spurious_interrupt_bug)
         pushq $0
         movl  $TRAP_spurious_int,4(%rsp)
@@ -559,6 +554,23 @@ nmi_in_hypervisor_mode:
         call  do_nmi
         jmp   ret_from_intr
 
+ENTRY(machine_check)
+        pushq $0
+        movl  $TRAP_machine_check,4(%rsp)
+        SAVE_ALL
+        testb $3,UREGS_cs(%rsp)
+        jz    .Lmc_in_hypervisor_mode
+        /* Interrupted guest context. Copy the context to stack bottom. */
+        GET_GUEST_REGS(%rdi)
+        movq  %rsp,%rsi
+        movl  $UREGS_kernel_sizeof/8,%ecx
+        movq  %rdi,%rsp
+        rep movsq
+.Lmc_in_hypervisor_mode:
+        movq  %rsp,%rdi
+        call  *machine_check_vector(%rip)
+        jmp   ret_from_intr
+
 .data
 
 ENTRY(exception_table)
@@ -580,7 +592,7 @@ ENTRY(exception_table)
         .quad do_spurious_interrupt_bug
         .quad do_coprocessor_error
         .quad do_alignment_check
-        .quad do_machine_check
+        .quad 0 # machine_check
         .quad do_simd_coprocessor_error
 
 ENTRY(hypercall_table)
Index: 2007-05-14/xen/arch/x86/x86_64/traps.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/x86_64/traps.c	2007-05-03 09:45:09.000000000 +0200
+++ 2007-05-14/xen/arch/x86/x86_64/traps.c	2007-05-15 15:46:51.000000000 +0200
@@ -260,6 +260,7 @@ void __init percpu_traps_init(void)
         set_intr_gate(TRAP_double_fault, &double_fault);
         idt_table[TRAP_double_fault].a |= 1UL << 32; /* IST1 */
         idt_table[TRAP_nmi].a          |= 2UL << 32; /* IST2 */
+        idt_table[TRAP_machine_check].a|= 3UL << 32; /* IST3 */
 
         /*
          * The 32-on-64 hypercall entry vector is only accessible from ring 1.
@@ -274,7 +275,10 @@ void __init percpu_traps_init(void)
     stack_bottom = (char *)get_stack_bottom();
     stack        = (char *)((unsigned long)stack_bottom & ~(STACK_SIZE - 1));
 
-    /* Double-fault handler has its own per-CPU 2kB stack. */
+    /* Machine Check handler has its own per-CPU 1kB stack. */
+    init_tss[cpu].ist[2] = (unsigned long)&stack[1024];
+
+    /* Double-fault handler has its own per-CPU 1kB stack. */
     init_tss[cpu].ist[0] = (unsigned long)&stack[2048];
 
     /* NMI handler has its own per-CPU 1kB stack. */
Index: 2007-05-14/xen/include/asm-x86/hvm/hvm.h
===================================================================
--- 2007-05-14.orig/xen/include/asm-x86/hvm/hvm.h	2007-05-14 14:40:20.000000000 +0200
+++ 2007-05-14/xen/include/asm-x86/hvm/hvm.h	2007-05-15 17:29:10.000000000 +0200
@@ -277,4 +277,11 @@ static inline int hvm_event_injection_fa
     return hvm_funcs.event_injection_faulted(v);
 }
 
+/* These bits in the CR4 are owned by the host */
+#define HVM_CR4_HOST_MASK (mmu_cr4_features & \
+    (X86_CR4_VMXE | X86_CR4_PAE | X86_CR4_MCE))
+
+/* These exceptions must always be intercepted. */
+#define HVM_TRAP_MASK (1U << TRAP_machine_check)
+
 #endif /* __ASM_X86_HVM_HVM_H__ */
Index: 2007-05-14/xen/include/asm-x86/hvm/svm/vmcb.h
===================================================================
--- 2007-05-14.orig/xen/include/asm-x86/hvm/svm/vmcb.h	2007-04-23 10:01:46.000000000 +0200
+++ 2007-05-14/xen/include/asm-x86/hvm/svm/vmcb.h	2007-05-15 17:07:57.000000000 +0200
@@ -465,14 +465,6 @@ void svm_destroy_vmcb(struct vcpu *v);
 
 void setup_vmcb_dump(void);
 
-/* These bits in the CR4 are owned by the host */
-#if CONFIG_PAGING_LEVELS >= 3
-#define SVM_CR4_HOST_MASK (X86_CR4_PAE)
-#else
-#define SVM_CR4_HOST_MASK 0
-#endif
-
-
 #endif /* ASM_X86_HVM_SVM_VMCS_H__ */
 
 /*
Index: 2007-05-14/xen/include/asm-x86/hvm/trace.h
===================================================================
--- 2007-05-14.orig/xen/include/asm-x86/hvm/trace.h	2007-04-23 10:01:46.000000000 +0200
+++ 2007-05-14/xen/include/asm-x86/hvm/trace.h	2007-05-15 17:30:42.000000000 +0200
@@ -21,6 +21,7 @@
 #define DO_TRC_HVM_CPUID       1
 #define DO_TRC_HVM_INTR        1
 #define DO_TRC_HVM_NMI         1
+#define DO_TRC_HVM_MCE         1
 #define DO_TRC_HVM_SMI         1
 #define DO_TRC_HVM_VMMCALL     1
 #define DO_TRC_HVM_HLT         1
Index: 2007-05-14/xen/include/asm-x86/hvm/vmx/vmx.h
===================================================================
--- 2007-05-14.orig/xen/include/asm-x86/hvm/vmx/vmx.h	2007-05-14 14:40:20.000000000 +0200
+++ 2007-05-14/xen/include/asm-x86/hvm/vmx/vmx.h	2007-05-15 17:08:05.000000000 +0200
@@ -128,13 +128,6 @@ void set_guest_time(struct vcpu *v, u64 
 #define TYPE_MOV_FROM_DR                (1 << 4)
 #define DEBUG_REG_ACCESS_REG            0xf00   /* 11:8, general purpose register */
 
-/* These bits in the CR4 are owned by the host */
-#if CONFIG_PAGING_LEVELS >= 3
-#define VMX_CR4_HOST_MASK (X86_CR4_VMXE | X86_CR4_PAE)
-#else
-#define VMX_CR4_HOST_MASK (X86_CR4_VMXE)
-#endif
-
 #define VMCALL_OPCODE   ".byte 0x0f,0x01,0xc1\n"
 #define VMCLEAR_OPCODE  ".byte 0x66,0x0f,0xc7\n"        /* reg/opcode: /6 */
 #define VMLAUNCH_OPCODE ".byte 0x0f,0x01,0xc2\n"
Index: 2007-05-14/xen/include/asm-x86/processor.h
===================================================================
--- 2007-05-14.orig/xen/include/asm-x86/processor.h	2007-05-15 10:24:15.000000000 +0200
+++ 2007-05-14/xen/include/asm-x86/processor.h	2007-05-16 12:30:05.000000000 +0200
@@ -104,7 +104,6 @@
 #define TRAP_alignment_check  17
 #define TRAP_machine_check    18
 #define TRAP_simd_error       19
-#define TRAP_deferred_nmi     31
 
 /* Set for entry via SYSCALL. Informs return code to use SYSRETQ not IRETQ. */
 /* NB. Same as VGCF_in_syscall. No bits in common with any other TRAP_ defn. */
@@ -569,6 +568,7 @@ extern void mtrr_ap_init(void);
 extern void mtrr_bp_init(void);
 
 extern void mcheck_init(struct cpuinfo_x86 *c);
+extern asmlinkage void (*machine_check_vector)(struct cpu_user_regs *);
 
 int cpuid_hypervisor_leaves(
     uint32_t idx, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx);
Index: 2007-05-14/xen/include/public/trace.h
===================================================================
--- 2007-05-14.orig/xen/include/public/trace.h	2007-04-23 10:01:47.000000000 +0200
+++ 2007-05-14/xen/include/public/trace.h	2007-05-15 17:55:19.000000000 +0200
@@ -88,6 +88,7 @@
 #define TRC_HVM_VMMCALL         (TRC_HVM_HANDLER + 0x12)
 #define TRC_HVM_HLT             (TRC_HVM_HANDLER + 0x13)
 #define TRC_HVM_INVLPG          (TRC_HVM_HANDLER + 0x14)
+#define TRC_HVM_MCE             (TRC_HVM_HANDLER + 0x15)
 
 /* This structure represents a single trace buffer record. */
 struct t_rec {

[-- Attachment #3: x86-machine-check.patch --]
[-- Type: text/plain, Size: 34499 bytes --]

Index: 2007-05-14/xen/arch/x86/cpu/mcheck/k7.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/k7.c	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/k7.c	2007-05-15 15:56:00.000000000 +0200
@@ -16,7 +16,7 @@
 #include "mce.h"
 
 /* Machine Check Handler For AMD Athlon/Duron */
-static fastcall void k7_machine_check(struct cpu_user_regs * regs, long error_code)
+static fastcall void k7_machine_check(struct cpu_user_regs * regs)
 {
 	int recover=1;
 	u32 alow, ahigh, high, low;
Index: 2007-05-14/xen/arch/x86/cpu/mcheck/mce.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/mce.c	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/mce.c	2007-05-15 15:56:16.000000000 +0200
@@ -18,13 +18,13 @@ int mce_disabled = 0;
 int nr_mce_banks;
 
 /* Handle unconfigured int18 (should never happen) */
-static fastcall void unexpected_machine_check(struct cpu_user_regs * regs, long error_code)
+static fastcall void unexpected_machine_check(struct cpu_user_regs * regs)
 {	
 	printk(KERN_ERR "CPU#%d: Unexpected int18 (Machine Check).\n", smp_processor_id());
 }
 
 /* Call the installed machine check handler for this CPU setup. */
-void fastcall (*machine_check_vector)(struct cpu_user_regs *, long error_code) = unexpected_machine_check;
+void fastcall (*machine_check_vector)(struct cpu_user_regs *) = unexpected_machine_check;
 
 /* This has to be run for each processor */
 void mcheck_init(struct cpuinfo_x86 *c)
Index: 2007-05-14/xen/arch/x86/cpu/mcheck/mce.h
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/mce.h	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/mce.h	2007-05-15 17:34:43.000000000 +0200
@@ -1,4 +1,5 @@
 #include <xen/init.h>
+#include <asm/processor.h>
 
 void amd_mcheck_init(struct cpuinfo_x86 *c);
 void intel_p4_mcheck_init(struct cpuinfo_x86 *c);
@@ -6,9 +7,6 @@ void intel_p5_mcheck_init(struct cpuinfo
 void intel_p6_mcheck_init(struct cpuinfo_x86 *c);
 void winchip_mcheck_init(struct cpuinfo_x86 *c);
 
-/* Call the installed machine check handler for this CPU setup. */
-extern fastcall void (*machine_check_vector)(struct cpu_user_regs *, long error_code);
-
 extern int mce_disabled __initdata;
 extern int nr_mce_banks;
 
Index: 2007-05-14/xen/arch/x86/cpu/mcheck/p4.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/p4.c	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/p4.c	2007-05-15 15:56:31.000000000 +0200
@@ -158,7 +158,7 @@ done:
 	return mce_num_extended_msrs;
 }
 
-static fastcall void intel_machine_check(struct cpu_user_regs * regs, long error_code)
+static fastcall void intel_machine_check(struct cpu_user_regs * regs)
 {
 	int recover=1;
 	u32 alow, ahigh, high, low;
Index: 2007-05-14/xen/arch/x86/cpu/mcheck/p5.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/p5.c	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/p5.c	2007-05-15 15:56:39.000000000 +0200
@@ -15,7 +15,7 @@
 #include "mce.h"
 
 /* Machine check handler for Pentium class Intel */
-static fastcall void pentium_machine_check(struct cpu_user_regs * regs, long error_code)
+static fastcall void pentium_machine_check(struct cpu_user_regs * regs)
 {
 	u32 loaddr, hi, lotype;
 	rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
Index: 2007-05-14/xen/arch/x86/cpu/mcheck/p6.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/p6.c	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/p6.c	2007-05-15 15:56:43.000000000 +0200
@@ -15,7 +15,7 @@
 #include "mce.h"
 
 /* Machine Check Handler For PII/PIII */
-static fastcall void intel_machine_check(struct cpu_user_regs * regs, long error_code)
+static fastcall void intel_machine_check(struct cpu_user_regs * regs)
 {
 	int recover=1;
 	u32 alow, ahigh, high, low;
Index: 2007-05-14/xen/arch/x86/cpu/mcheck/winchip.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/cpu/mcheck/winchip.c	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/cpu/mcheck/winchip.c	2007-05-15 15:56:48.000000000 +0200
@@ -16,7 +16,7 @@
 #include "mce.h"
 
 /* Machine check handler for WinChip C6 */
-static fastcall void winchip_machine_check(struct cpu_user_regs * regs, long error_code)
+static fastcall void winchip_machine_check(struct cpu_user_regs * regs)
 {
 	printk(KERN_EMERG "CPU0: Machine Check Exception.\n");
 	add_taint(TAINT_MACHINE_CHECK);
Index: 2007-05-14/xen/arch/x86/hvm/svm/svm.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/hvm/svm/svm.c	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/hvm/svm/svm.c	2007-05-15 17:57:20.000000000 +0200
@@ -407,7 +407,7 @@ int svm_vmcb_restore(struct vcpu *v, str
     }
 
  skip_cr3:
-    vmcb->cr4 = c->cr4 | SVM_CR4_HOST_MASK;
+    vmcb->cr4 = c->cr4 | HVM_CR4_HOST_MASK;
     v->arch.hvm_svm.cpu_shadow_cr4 = c->cr4;
     
     vmcb->idtr.limit = c->idtr_limit;
@@ -464,7 +464,8 @@ int svm_vmcb_restore(struct vcpu *v, str
     /* update VMCB for nested paging restore */
     if ( paging_mode_hap(v->domain) ) {
         vmcb->cr0 = v->arch.hvm_svm.cpu_shadow_cr0;
-        vmcb->cr4 = v->arch.hvm_svm.cpu_shadow_cr4;
+        vmcb->cr4 = v->arch.hvm_svm.cpu_shadow_cr4 |
+                    (HVM_CR4_HOST_MASK & ~X86_CR4_PAE);
         vmcb->cr3 = c->cr3;
         vmcb->np_enable = 1;
         vmcb->g_pat = 0x0007040600070406ULL; /* guest PAT */
@@ -1731,9 +1732,19 @@ static int mov_to_cr(int gpreg, int cr, 
         break;
 
     case 4: /* CR4 */
+        if ( value & ~mmu_cr4_features )
+        {
+            HVM_DBG_LOG(DBG_LEVEL_1, "Guest attempts to enable unsupported "
+                        "CR4 features %lx (host %lx)",
+                        value, mmu_cr4_features);
+            svm_inject_exception(v, TRAP_gp_fault, 1, 0);
+            break;
+        }
+
         if ( paging_mode_hap(v->domain) )
         {
-            vmcb->cr4 = v->arch.hvm_svm.cpu_shadow_cr4 = value;
+            v->arch.hvm_svm.cpu_shadow_cr4 = value;
+            vmcb->cr4 = value | (HVM_CR4_HOST_MASK & ~X86_CR4_PAE);
             paging_update_paging_modes(v);
             break;
         }
@@ -1779,7 +1790,7 @@ static int mov_to_cr(int gpreg, int cr, 
         }
 
         v->arch.hvm_svm.cpu_shadow_cr4 = value;
-        vmcb->cr4 = value | SVM_CR4_HOST_MASK;
+        vmcb->cr4 = value | HVM_CR4_HOST_MASK;
   
         /*
          * Writing to CR4 to modify the PSE, PGE, or PAE flag invalidates
@@ -2141,12 +2152,13 @@ static int svm_reset_to_realmode(struct 
     vmcb->cr2 = 0;
     vmcb->efer = EFER_SVME;
 
-    vmcb->cr4 = SVM_CR4_HOST_MASK;
+    vmcb->cr4 = HVM_CR4_HOST_MASK;
     v->arch.hvm_svm.cpu_shadow_cr4 = 0;
 
     if ( paging_mode_hap(v->domain) ) {
         vmcb->cr0 = v->arch.hvm_svm.cpu_shadow_cr0;
-        vmcb->cr4 = v->arch.hvm_svm.cpu_shadow_cr4;
+        vmcb->cr4 = v->arch.hvm_svm.cpu_shadow_cr4 |
+                    (HVM_CR4_HOST_MASK & ~X86_CR4_PAE);
     }
 
     /* This will jump to ROMBIOS */
@@ -2287,6 +2299,12 @@ asmlinkage void svm_vmexit_handler(struc
         break;
     }
 
+    case VMEXIT_EXCEPTION_MC:
+        HVMTRACE_0D(MCE, v);
+        svm_store_cpu_guest_regs(v, regs, NULL);
+        machine_check_vector(regs);
+        break;
+
     case VMEXIT_VINTR:
         vmcb->vintr.fields.irq = 0;
         vmcb->general1_intercepts &= ~GENERAL1_INTERCEPT_VINTR;
Index: 2007-05-14/xen/arch/x86/hvm/svm/vmcb.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/hvm/svm/vmcb.c	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/hvm/svm/vmcb.c	2007-05-15 17:28:40.000000000 +0200
@@ -225,7 +225,7 @@ static int construct_vmcb(struct vcpu *v
     /* Guest CR4. */
     arch_svm->cpu_shadow_cr4 =
         read_cr4() & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE);
-    vmcb->cr4 = arch_svm->cpu_shadow_cr4 | SVM_CR4_HOST_MASK;
+    vmcb->cr4 = arch_svm->cpu_shadow_cr4 | HVM_CR4_HOST_MASK;
 
     paging_update_paging_modes(v);
     vmcb->cr3 = v->arch.hvm_vcpu.hw_cr3; 
@@ -236,11 +236,13 @@ static int construct_vmcb(struct vcpu *v
         vmcb->np_enable = 1; /* enable nested paging */
         vmcb->g_pat = 0x0007040600070406ULL; /* guest PAT */
         vmcb->h_cr3 = pagetable_get_paddr(v->domain->arch.phys_table);
-        vmcb->cr4 = arch_svm->cpu_shadow_cr4 = 0;
+        vmcb->cr4 = arch_svm->cpu_shadow_cr4 =
+                    (HVM_CR4_HOST_MASK & ~X86_CR4_PAE);
+        vmcb->exception_intercepts = HVM_TRAP_MASK;
     }
     else
     {
-        vmcb->exception_intercepts = 1U << TRAP_page_fault;
+        vmcb->exception_intercepts = HVM_TRAP_MASK | (1U << TRAP_page_fault);
     }
 
     return 0;
Index: 2007-05-14/xen/arch/x86/hvm/vmx/vmcs.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/hvm/vmx/vmcs.c	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/hvm/vmx/vmcs.c	2007-05-15 17:28:58.000000000 +0200
@@ -400,7 +400,7 @@ static void construct_vmcs(struct vcpu *
     __vmwrite(VMCS_LINK_POINTER_HIGH, ~0UL);
 #endif
 
-    __vmwrite(EXCEPTION_BITMAP, 1U << TRAP_page_fault);
+    __vmwrite(EXCEPTION_BITMAP, HVM_TRAP_MASK | (1U << TRAP_page_fault));
 
     /* Guest CR0. */
     cr0 = read_cr0();
Index: 2007-05-14/xen/arch/x86/hvm/vmx/vmx.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/hvm/vmx/vmx.c	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/hvm/vmx/vmx.c	2007-05-15 17:56:59.000000000 +0200
@@ -600,7 +600,7 @@ int vmx_vmcs_restore(struct vcpu *v, str
     }
 #endif
 
-    __vmwrite(GUEST_CR4, (c->cr4 | VMX_CR4_HOST_MASK));
+    __vmwrite(GUEST_CR4, (c->cr4 | HVM_CR4_HOST_MASK));
     v->arch.hvm_vmx.cpu_shadow_cr4 = c->cr4;
     __vmwrite(CR4_READ_SHADOW, v->arch.hvm_vmx.cpu_shadow_cr4);
 
@@ -1886,7 +1886,7 @@ static int vmx_world_restore(struct vcpu
     else
         HVM_DBG_LOG(DBG_LEVEL_VMMU, "Update CR3 value = %x", c->cr3);
 
-    __vmwrite(GUEST_CR4, (c->cr4 | VMX_CR4_HOST_MASK));
+    __vmwrite(GUEST_CR4, (c->cr4 | HVM_CR4_HOST_MASK));
     v->arch.hvm_vmx.cpu_shadow_cr4 = c->cr4;
     __vmwrite(CR4_READ_SHADOW, v->arch.hvm_vmx.cpu_shadow_cr4);
 
@@ -2275,6 +2275,14 @@ static int mov_to_cr(int gp, int cr, str
     case 4: /* CR4 */
         old_cr = v->arch.hvm_vmx.cpu_shadow_cr4;
 
+        if ( value & ~mmu_cr4_features )
+        {
+            HVM_DBG_LOG(DBG_LEVEL_1, "Guest attempts to enable unsupported "
+                        "CR4 features %lx (host %lx)",
+                        value, mmu_cr4_features);
+            vmx_inject_hw_exception(v, TRAP_gp_fault, 0);
+            break;
+        }
         if ( (value & X86_CR4_PAE) && !(old_cr & X86_CR4_PAE) )
         {
             if ( vmx_pgbit_test(v) )
@@ -2315,7 +2323,7 @@ static int mov_to_cr(int gp, int cr, str
             }
         }
 
-        __vmwrite(GUEST_CR4, value| VMX_CR4_HOST_MASK);
+        __vmwrite(GUEST_CR4, value | HVM_CR4_HOST_MASK);
         v->arch.hvm_vmx.cpu_shadow_cr4 = value;
         __vmwrite(CR4_READ_SHADOW, v->arch.hvm_vmx.cpu_shadow_cr4);
 
@@ -2623,7 +2631,8 @@ static void vmx_reflect_exception(struct
     }
 }
 
-static void vmx_failed_vmentry(unsigned int exit_reason)
+static void vmx_failed_vmentry(unsigned int exit_reason,
+                               struct cpu_user_regs *regs)
 {
     unsigned int failed_vmentry_reason = (uint16_t)exit_reason;
     unsigned long exit_qualification;
@@ -2640,6 +2649,9 @@ static void vmx_failed_vmentry(unsigned 
         break;
     case EXIT_REASON_MACHINE_CHECK:
         printk("caused by machine check.\n");
+        HVMTRACE_0D(MCE, current);
+        hvm_store_cpu_guest_regs(current, regs, NULL);
+        machine_check_vector(regs);
         break;
     default:
         printk("reason not known yet!");
@@ -2665,11 +2677,12 @@ asmlinkage void vmx_vmexit_handler(struc
 
     perfc_incra(vmexits, exit_reason);
 
-    if ( exit_reason != EXIT_REASON_EXTERNAL_INTERRUPT )
-        local_irq_enable();
-
     if ( unlikely(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) )
-        return vmx_failed_vmentry(exit_reason);
+        return vmx_failed_vmentry(exit_reason, regs);
+
+    if ( exit_reason != EXIT_REASON_EXTERNAL_INTERRUPT &&
+         exit_reason != EXIT_REASON_EXCEPTION_NMI )
+        local_irq_enable();
 
     switch ( exit_reason )
     {
@@ -2689,6 +2702,9 @@ asmlinkage void vmx_vmexit_handler(struc
 
         perfc_incra(cause_vector, vector);
 
+        if ( vector != TRAP_nmi && vector != TRAP_machine_check )
+            local_irq_enable();
+
         switch ( vector )
         {
         case TRAP_debug:
@@ -2726,6 +2742,11 @@ asmlinkage void vmx_vmexit_handler(struc
             else
                 vmx_reflect_exception(v);
             break;
+        case TRAP_machine_check:
+            HVMTRACE_0D(MCE, v);
+            hvm_store_cpu_guest_regs(v, regs, NULL);
+            machine_check_vector(regs);
+            break;
         default:
             goto exit_and_crash;
         }
Index: 2007-05-14/xen/arch/x86/traps.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/traps.c	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/traps.c	2007-05-15 15:52:58.000000000 +0200
@@ -707,12 +707,6 @@ asmlinkage int do_int3(struct cpu_user_r
     return do_guest_trap(TRAP_int3, regs, 0);
 }
 
-asmlinkage int do_machine_check(struct cpu_user_regs *regs)
-{
-    fatal_trap(TRAP_machine_check, regs);
-    return 0;
-}
-
 void propagate_page_fault(unsigned long addr, u16 error_code)
 {
     struct trap_info *ti;
Index: 2007-05-14/xen/arch/x86/x86_32/entry.S
===================================================================
--- 2007-05-14.orig/xen/arch/x86/x86_32/entry.S	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/x86_32/entry.S	2007-05-21 12:45:09.000000000 +0200
@@ -77,14 +77,29 @@
 restore_all_guest:
         ASSERT_INTERRUPTS_DISABLED
         testl $X86_EFLAGS_VM,UREGS_eflags(%esp)
-        jnz  restore_all_vm86
-#ifdef CONFIG_X86_SUPERVISOR_MODE_KERNEL
+#ifndef CONFIG_X86_SUPERVISOR_MODE_KERNEL
+        popl %ebx
+        popl %ecx
+        popl %edx
+        popl %esi
+        popl %edi
+        popl %ebp
+        popl %eax
+        leal 4(%esp),%esp
+        jnz  .Lrestore_iret_guest
+.Lrestore_sregs_guest:
+.Lft1:  mov  UREGS_ds-UREGS_eip(%esp),%ds
+.Lft2:  mov  UREGS_es-UREGS_eip(%esp),%es
+.Lft3:  mov  UREGS_fs-UREGS_eip(%esp),%fs
+.Lft4:  mov  UREGS_gs-UREGS_eip(%esp),%gs
+.Lrestore_iret_guest:
+#else
+        jnz   restore_all_vm86
         testl $2,UREGS_cs(%esp)
         jnz   1f
         call  restore_ring0_guest
         jmp   restore_all_vm86
 1:
-#endif
 .Lft1:  mov  UREGS_ds(%esp),%ds
 .Lft2:  mov  UREGS_es(%esp),%es
 .Lft3:  mov  UREGS_fs(%esp),%fs
@@ -98,6 +113,7 @@ restore_all_vm86:
         popl %ebp
         popl %eax
         addl $4,%esp
+#endif
 .Lft5:  iret
 .section .fixup,"ax"
 .Lfx5:  subl  $28,%esp
@@ -109,9 +125,13 @@ restore_all_vm86:
         movl  %edx,UREGS_edx+4(%esp)
         movl  %ecx,UREGS_ecx+4(%esp)
         movl  %ebx,UREGS_ebx+4(%esp)
+#ifndef CONFIG_X86_SUPERVISOR_MODE_KERNEL
+.equ .Lfx1, .Lfx5
+#else
 .Lfx1:  SET_XEN_SEGMENTS(a)
         movl  %eax,%fs
         movl  %eax,%gs
+#endif
         sti
         popl  %esi
         pushfl                         # EFLAGS
@@ -169,8 +189,8 @@ restore_all_xen:
 ENTRY(hypercall)
         subl $4,%esp
         FIXUP_RING0_GUEST_STACK
-        SAVE_ALL(b)
-        sti
+        SAVE_ALL(1f,1f)
+1:      sti
         GET_CURRENT(%ebx)
         cmpl  $NR_hypercalls,%eax
         jae   bad_hypercall
@@ -433,9 +453,13 @@ ENTRY(divide_error)
         ALIGN
 handle_exception:
         FIXUP_RING0_GUEST_STACK
-        SAVE_ALL_NOSEGREGS(a)
-        SET_XEN_SEGMENTS(a)
-        testb $X86_EFLAGS_IF>>8,UREGS_eflags+1(%esp)
+        SAVE_ALL(1f,2f)
+        .text 1
+1:      mov   %ecx,%ds
+        mov   %ecx,%es
+        jmp   2f
+        .previous
+2:      testb $X86_EFLAGS_IF>>8,UREGS_eflags+1(%esp)
         jz    exception_with_ints_disabled
         sti                             # re-enable interrupts
 1:      xorl  %eax,%eax
@@ -546,18 +570,14 @@ ENTRY(page_fault)
         movw  $TRAP_page_fault,2(%esp)
         jmp   handle_exception
 
-ENTRY(machine_check)
-        pushl $TRAP_machine_check<<16
-        jmp   handle_exception
-
 ENTRY(spurious_interrupt_bug)
         pushl $TRAP_spurious_int<<16
         jmp   handle_exception
 
 ENTRY(early_page_fault)
-        SAVE_ALL_NOSEGREGS(a)
-        movl  %esp,%edx
-        pushl %edx
+        SAVE_ALL(1f,1f)
+1:      movl  %esp,%eax
+        pushl %eax
         call  do_early_page_fault
         addl  $4,%esp
         jmp   restore_all_xen
@@ -568,49 +588,84 @@ ENTRY(nmi)
         iret
 #else
         # Save state but do not trash the segment registers!
-        # We may otherwise be unable to reload them or copy them to ring 1. 
+        pushl $TRAP_nmi<<16
+        SAVE_ALL(.Lnmi_xen,.Lnmi_common)
+.Lnmi_common:
+        movl  %esp,%eax
         pushl %eax
-        SAVE_ALL_NOSEGREGS(a)
-
-        # We can only process the NMI if:
-        #  A. We are the outermost Xen activation (in which case we have
-        #     the selectors safely saved on our stack)
-        #  B. DS and ES contain sane Xen values.
-        # In all other cases we bail without touching DS-GS, as we have
-        # interrupted an enclosing Xen activation in tricky prologue or
-        # epilogue code.
-        movl  UREGS_eflags(%esp),%eax
-        movb  UREGS_cs(%esp),%al
-        testl $(3|X86_EFLAGS_VM),%eax
-        jnz   continue_nmi
-        movl  %ds,%eax
-        cmpw  $(__HYPERVISOR_DS),%ax
-        jne   defer_nmi
-        movl  %es,%eax
-        cmpw  $(__HYPERVISOR_DS),%ax
-        jne   defer_nmi
-
-continue_nmi:
-        SET_XEN_SEGMENTS(d)
-        movl  %esp,%edx
-        pushl %edx
         call  do_nmi
         addl  $4,%esp
         jmp   ret_from_intr
+.Lnmi_xen:
+        GET_GUEST_REGS(%ebx)
+        testl $X86_EFLAGS_VM,%ss:UREGS_eflags(%ebx)
+        mov   %ds,%eax
+        mov   %es,%edx
+        jnz   .Lnmi_vm86
+        cmpw  %ax,%cx
+        mov   %ecx,%ds
+        cmovel UREGS_ds(%ebx),%eax
+        cmpw  %dx,%cx
+        movl  %eax,UREGS_ds(%ebx)
+        cmovel UREGS_es(%ebx),%edx
+        mov   %ecx,%es
+        movl  $.Lrestore_sregs_guest,%ecx
+        movl  %edx,UREGS_es(%ebx)
+        cmpl  %ecx,UREGS_eip(%esp)
+        jbe   .Lnmi_common
+        cmpl  $.Lrestore_iret_guest,UREGS_eip(%esp)
+        ja    .Lnmi_common
+        movl  %ecx,UREGS_eip(%esp)
+        jmp   .Lnmi_common
+.Lnmi_vm86:
+        mov   %ecx,%ds
+        mov   %ecx,%es
+        jmp   .Lnmi_common
+#endif /* !CONFIG_X86_SUPERVISOR_MODE_KERNEL */
 
-defer_nmi:
-        movl  $FIXMAP_apic_base,%eax
-        # apic_wait_icr_idle()
-1:      movl  %ss:APIC_ICR(%eax),%ebx
-        testl $APIC_ICR_BUSY,%ebx
-        jnz   1b
-        # __send_IPI_shortcut(APIC_DEST_SELF, TRAP_deferred_nmi)
-        movl  $(APIC_DM_FIXED | APIC_DEST_SELF | APIC_DEST_PHYSICAL | \
-                TRAP_deferred_nmi),%ss:APIC_ICR(%eax)
-        jmp   restore_all_xen
+ENTRY(machine_check)
+        # See NMI handler for explanations.
+#ifdef CONFIG_X86_SUPERVISOR_MODE_KERNEL
+        iret
+#else
+        pushl $TRAP_machine_check<<16
+        SAVE_ALL(.Lmce_xen,.Lmce_common)
+.Lmce_common:
+        movl  %esp,%eax
+        pushl %eax
+        call  *machine_check_vector
+        addl  $4,%esp
+        jmp   ret_from_intr
+.Lmce_xen:
+        GET_GUEST_REGS(%ebx)
+        testl $X86_EFLAGS_VM,%ss:UREGS_eflags(%ebx)
+        mov   %ds,%eax
+        mov   %es,%edx
+        jnz   .Lmce_vm86
+        cmpw  %ax,%cx
+        mov   %ecx,%ds
+        cmovel UREGS_ds(%ebx),%eax
+        cmpw  %dx,%cx
+        movl  %eax,UREGS_ds(%ebx)
+        cmovel UREGS_es(%ebx),%edx
+        mov   %ecx,%es
+        movl  $.Lrestore_sregs_guest,%ecx
+        movl  %edx,UREGS_es(%ebx)
+        cmpl  %ecx,UREGS_eip(%esp)
+        jbe   .Lmce_common
+        cmpl  $.Lrestore_iret_guest,UREGS_eip(%esp)
+        ja    .Lmce_common
+        movl  %ecx,UREGS_eip(%esp)
+        jmp   .Lmce_common
+.Lmce_vm86:
+        mov   %ecx,%ds
+        mov   %ecx,%es
+        jmp   .Lmce_common
 #endif /* !CONFIG_X86_SUPERVISOR_MODE_KERNEL */
 
 ENTRY(setup_vm86_frame)
+        mov %ecx,%ds
+        mov %ecx,%es
         # Copies the entire stack frame forwards by 16 bytes.
         .macro copy_vm86_words count=18
         .if \count
@@ -644,7 +699,7 @@ ENTRY(exception_table)
         .long do_spurious_interrupt_bug
         .long do_coprocessor_error
         .long do_alignment_check
-        .long do_machine_check
+        .long 0 # machine_check
         .long do_simd_coprocessor_error
 
 ENTRY(hypercall_table)
Index: 2007-05-14/xen/arch/x86/x86_32/traps.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/x86_32/traps.c	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/x86_32/traps.c	2007-05-21 09:00:10.000000000 +0200
@@ -235,15 +235,6 @@ unsigned long do_iret(void)
     return 0;
 }
 
-#include <asm/asm_defns.h>
-BUILD_SMP_INTERRUPT(deferred_nmi, TRAP_deferred_nmi)
-fastcall void smp_deferred_nmi(struct cpu_user_regs *regs)
-{
-    asmlinkage void do_nmi(struct cpu_user_regs *);
-    ack_APIC_irq();
-    do_nmi(regs);
-}
-
 void __init percpu_traps_init(void)
 {
     struct tss_struct *tss = &doublefault_tss;
@@ -258,8 +249,6 @@ void __init percpu_traps_init(void)
     /* The hypercall entry vector is only accessible from ring 1. */
     _set_gate(idt_table+HYPERCALL_VECTOR, 14, 1, &hypercall);
 
-    set_intr_gate(TRAP_deferred_nmi, &deferred_nmi);
-
     /*
      * Make a separate task for double faults. This will get us debug output if
      * we blow the kernel stack.
Index: 2007-05-14/xen/arch/x86/x86_64/entry.S
===================================================================
--- 2007-05-14.orig/xen/arch/x86/x86_64/entry.S	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/x86_64/entry.S	2007-05-21 11:24:15.000000000 +0200
@@ -518,11 +518,6 @@ ENTRY(page_fault)
         movl  $TRAP_page_fault,4(%rsp)
         jmp   handle_exception
 
-ENTRY(machine_check)
-        pushq $0
-        movl  $TRAP_machine_check,4(%rsp)
-        jmp   handle_exception
-
 ENTRY(spurious_interrupt_bug)
         pushq $0
         movl  $TRAP_spurious_int,4(%rsp)
@@ -559,6 +554,23 @@ nmi_in_hypervisor_mode:
         call  do_nmi
         jmp   ret_from_intr
 
+ENTRY(machine_check)
+        pushq $0
+        movl  $TRAP_machine_check,4(%rsp)
+        SAVE_ALL
+        testb $3,UREGS_cs(%rsp)
+        jz    .Lmce_in_hypervisor_mode
+        /* Interrupted guest context. Copy the context to stack bottom. */
+        GET_GUEST_REGS(%rdi)
+        movq  %rsp,%rsi
+        movl  $UREGS_kernel_sizeof/8,%ecx
+        movq  %rdi,%rsp
+        rep movsq
+.Lmce_in_hypervisor_mode:
+        movq  %rsp,%rdi
+        call  *machine_check_vector(%rip)
+        jmp   ret_from_intr
+
 .data
 
 ENTRY(exception_table)
@@ -580,7 +592,7 @@ ENTRY(exception_table)
         .quad do_spurious_interrupt_bug
         .quad do_coprocessor_error
         .quad do_alignment_check
-        .quad do_machine_check
+        .quad 0 # machine_check
         .quad do_simd_coprocessor_error
 
 ENTRY(hypercall_table)
Index: 2007-05-14/xen/arch/x86/x86_64/traps.c
===================================================================
--- 2007-05-14.orig/xen/arch/x86/x86_64/traps.c	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/arch/x86/x86_64/traps.c	2007-05-15 15:46:51.000000000 +0200
@@ -260,6 +260,7 @@ void __init percpu_traps_init(void)
         set_intr_gate(TRAP_double_fault, &double_fault);
         idt_table[TRAP_double_fault].a |= 1UL << 32; /* IST1 */
         idt_table[TRAP_nmi].a          |= 2UL << 32; /* IST2 */
+        idt_table[TRAP_machine_check].a|= 3UL << 32; /* IST3 */
 
         /*
          * The 32-on-64 hypercall entry vector is only accessible from ring 1.
@@ -274,7 +275,10 @@ void __init percpu_traps_init(void)
     stack_bottom = (char *)get_stack_bottom();
     stack        = (char *)((unsigned long)stack_bottom & ~(STACK_SIZE - 1));
 
-    /* Double-fault handler has its own per-CPU 2kB stack. */
+    /* Machine Check handler has its own per-CPU 1kB stack. */
+    init_tss[cpu].ist[2] = (unsigned long)&stack[1024];
+
+    /* Double-fault handler has its own per-CPU 1kB stack. */
     init_tss[cpu].ist[0] = (unsigned long)&stack[2048];
 
     /* NMI handler has its own per-CPU 1kB stack. */
Index: 2007-05-14/xen/include/asm-x86/hvm/hvm.h
===================================================================
--- 2007-05-14.orig/xen/include/asm-x86/hvm/hvm.h	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/include/asm-x86/hvm/hvm.h	2007-05-15 17:29:10.000000000 +0200
@@ -277,4 +277,11 @@ static inline int hvm_event_injection_fa
     return hvm_funcs.event_injection_faulted(v);
 }
 
+/* These bits in the CR4 are owned by the host */
+#define HVM_CR4_HOST_MASK (mmu_cr4_features & \
+    (X86_CR4_VMXE | X86_CR4_PAE | X86_CR4_MCE))
+
+/* These exceptions must always be intercepted. */
+#define HVM_TRAP_MASK (1U << TRAP_machine_check)
+
 #endif /* __ASM_X86_HVM_HVM_H__ */
Index: 2007-05-14/xen/include/asm-x86/hvm/svm/vmcb.h
===================================================================
--- 2007-05-14.orig/xen/include/asm-x86/hvm/svm/vmcb.h	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/include/asm-x86/hvm/svm/vmcb.h	2007-05-15 17:07:57.000000000 +0200
@@ -465,14 +465,6 @@ void svm_destroy_vmcb(struct vcpu *v);
 
 void setup_vmcb_dump(void);
 
-/* These bits in the CR4 are owned by the host */
-#if CONFIG_PAGING_LEVELS >= 3
-#define SVM_CR4_HOST_MASK (X86_CR4_PAE)
-#else
-#define SVM_CR4_HOST_MASK 0
-#endif
-
-
 #endif /* ASM_X86_HVM_SVM_VMCS_H__ */
 
 /*
Index: 2007-05-14/xen/include/asm-x86/hvm/trace.h
===================================================================
--- 2007-05-14.orig/xen/include/asm-x86/hvm/trace.h	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/include/asm-x86/hvm/trace.h	2007-05-15 17:30:42.000000000 +0200
@@ -21,6 +21,7 @@
 #define DO_TRC_HVM_CPUID       1
 #define DO_TRC_HVM_INTR        1
 #define DO_TRC_HVM_NMI         1
+#define DO_TRC_HVM_MCE         1
 #define DO_TRC_HVM_SMI         1
 #define DO_TRC_HVM_VMMCALL     1
 #define DO_TRC_HVM_HLT         1
Index: 2007-05-14/xen/include/asm-x86/hvm/vmx/vmx.h
===================================================================
--- 2007-05-14.orig/xen/include/asm-x86/hvm/vmx/vmx.h	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/include/asm-x86/hvm/vmx/vmx.h	2007-05-15 17:08:05.000000000 +0200
@@ -128,13 +128,6 @@ void set_guest_time(struct vcpu *v, u64 
 #define TYPE_MOV_FROM_DR                (1 << 4)
 #define DEBUG_REG_ACCESS_REG            0xf00   /* 11:8, general purpose register */
 
-/* These bits in the CR4 are owned by the host */
-#if CONFIG_PAGING_LEVELS >= 3
-#define VMX_CR4_HOST_MASK (X86_CR4_VMXE | X86_CR4_PAE)
-#else
-#define VMX_CR4_HOST_MASK (X86_CR4_VMXE)
-#endif
-
 #define VMCALL_OPCODE   ".byte 0x0f,0x01,0xc1\n"
 #define VMCLEAR_OPCODE  ".byte 0x66,0x0f,0xc7\n"        /* reg/opcode: /6 */
 #define VMLAUNCH_OPCODE ".byte 0x0f,0x01,0xc2\n"
Index: 2007-05-14/xen/include/asm-x86/processor.h
===================================================================
--- 2007-05-14.orig/xen/include/asm-x86/processor.h	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/include/asm-x86/processor.h	2007-05-16 12:30:05.000000000 +0200
@@ -104,7 +104,6 @@
 #define TRAP_alignment_check  17
 #define TRAP_machine_check    18
 #define TRAP_simd_error       19
-#define TRAP_deferred_nmi     31
 
 /* Set for entry via SYSCALL. Informs return code to use SYSRETQ not IRETQ. */
 /* NB. Same as VGCF_in_syscall. No bits in common with any other TRAP_ defn. */
@@ -569,6 +568,7 @@ extern void mtrr_ap_init(void);
 extern void mtrr_bp_init(void);
 
 extern void mcheck_init(struct cpuinfo_x86 *c);
+extern asmlinkage void (*machine_check_vector)(struct cpu_user_regs *);
 
 int cpuid_hypervisor_leaves(
     uint32_t idx, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx);
Index: 2007-05-14/xen/include/asm-x86/x86_32/asm_defns.h
===================================================================
--- 2007-05-14.orig/xen/include/asm-x86/x86_32/asm_defns.h	2007-04-23 10:01:46.000000000 +0200
+++ 2007-05-14/xen/include/asm-x86/x86_32/asm_defns.h	2007-05-21 12:44:12.000000000 +0200
@@ -22,7 +22,7 @@
 #define ASSERT_INTERRUPTS_ENABLED  ASSERT_INTERRUPT_STATUS(nz)
 #define ASSERT_INTERRUPTS_DISABLED ASSERT_INTERRUPT_STATUS(z)
 
-#define __SAVE_ALL_PRE                                  \
+#define SAVE_ALL(xen_lbl, vm86_lbl)                     \
         cld;                                            \
         pushl %eax;                                     \
         pushl %ebp;                                     \
@@ -33,31 +33,32 @@
         pushl %ecx;                                     \
         pushl %ebx;                                     \
         testl $(X86_EFLAGS_VM),UREGS_eflags(%esp);      \
-        jz 2f;                                          \
-        call setup_vm86_frame;                          \
-        jmp 3f;                                         \
-        2:testb $3,UREGS_cs(%esp);                      \
-        jz 1f;                                          \
-        mov %ds,UREGS_ds(%esp);                         \
-        mov %es,UREGS_es(%esp);                         \
+        mov %ds,%edi;                                   \
+        mov %es,%esi;                                   \
+        movl $(__HYPERVISOR_DS),%ecx;                   \
+        jnz 86f;                                        \
+        .text 1;                                        \
+        86:call setup_vm86_frame;                       \
+        jmp vm86_lbl;                                   \
+        .previous;                                      \
+        testb $3,UREGS_cs(%esp);                        \
+        jz xen_lbl;                                     \
+        cmpw %cx,%di;                                   \
+        mov %ecx,%ds;                                   \
         mov %fs,UREGS_fs(%esp);                         \
+        cmovel UREGS_ds(%esp),%edi;                     \
+        cmpw %cx,%si;                                   \
+        mov %edi,UREGS_ds(%esp);                        \
+        cmovel UREGS_es(%esp),%esi;                     \
+        mov %ecx,%es;                                   \
         mov %gs,UREGS_gs(%esp);                         \
-        3:
-
-#define SAVE_ALL_NOSEGREGS(_reg)                \
-        __SAVE_ALL_PRE                          \
-        1:
+        mov %esi,UREGS_es(%esp)
 
 #define SET_XEN_SEGMENTS(_reg)                          \
         movl $(__HYPERVISOR_DS),%e ## _reg ## x;        \
         mov %e ## _reg ## x,%ds;                        \
         mov %e ## _reg ## x,%es;
 
-#define SAVE_ALL(_reg)                          \
-        __SAVE_ALL_PRE                          \
-        SET_XEN_SEGMENTS(_reg)                  \
-        1:
-
 #ifdef PERF_COUNTERS
 #define PERFC_INCR(_name,_idx,_cur)                     \
         pushl _cur;                                     \
@@ -93,8 +94,8 @@ __asm__(                                
     STR(x) ":\n\t"                              \
     "pushl $"#v"<<16\n\t"                       \
     STR(FIXUP_RING0_GUEST_STACK)                \
-    STR(SAVE_ALL(a))                            \
-    "movl %esp,%eax\n\t"                        \
+    STR(SAVE_ALL(1f,1f)) "\n\t"                 \
+    "1:movl %esp,%eax\n\t"                      \
     "pushl %eax\n\t"                            \
     "call "STR(smp_##x)"\n\t"                   \
     "addl $4,%esp\n\t"                          \
@@ -105,8 +106,8 @@ __asm__(                                
     "\n" __ALIGN_STR"\n"                        \
     "common_interrupt:\n\t"                     \
     STR(FIXUP_RING0_GUEST_STACK)                \
-    STR(SAVE_ALL(a))                            \
-    "movl %esp,%eax\n\t"                        \
+    STR(SAVE_ALL(1f,1f)) "\n\t"                 \
+    "1:movl %esp,%eax\n\t"                      \
     "pushl %eax\n\t"                            \
     "call " STR(do_IRQ) "\n\t"                  \
     "addl $4,%esp\n\t"                          \
Index: 2007-05-14/xen/include/public/trace.h
===================================================================
--- 2007-05-14.orig/xen/include/public/trace.h	2007-05-21 08:58:02.000000000 +0200
+++ 2007-05-14/xen/include/public/trace.h	2007-05-15 17:55:19.000000000 +0200
@@ -88,6 +88,7 @@
 #define TRC_HVM_VMMCALL         (TRC_HVM_HANDLER + 0x12)
 #define TRC_HVM_HLT             (TRC_HVM_HANDLER + 0x13)
 #define TRC_HVM_INVLPG          (TRC_HVM_HANDLER + 0x14)
+#define TRC_HVM_MCE             (TRC_HVM_HANDLER + 0x15)
 
 /* This structure represents a single trace buffer record. */
 struct t_rec {

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI deferral on i386
  2007-05-21 14:01       ` Jan Beulich
@ 2007-05-21 14:17         ` Keir Fraser
  2007-05-21 14:34           ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Keir Fraser @ 2007-05-21 14:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 21/5/07 15:01, "Jan Beulich" <jbeulich@novell.com> wrote:

> I think I found a pretty reasonable solution, which I'm attaching in its
> current
> (3.1-based) form, together with the prior (untested) variant that copied the
> NMI behavior. Even if it doesn't apply to -unstable, I'd be glad if you could
> have a brief look to see whether you consider the approach too intrusive (in
> which case there would be no point in trying to bring it forward to
> -unstable).

You'll have to explain how the changes to the x86_32 entry.S work. The rest
of the patch looks good.

 -- Keir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI deferral on i386
  2007-05-21 14:17         ` Keir Fraser
@ 2007-05-21 14:34           ` Jan Beulich
  2007-05-23 10:03             ` Keir Fraser
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2007-05-21 14:34 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

>>> Keir Fraser <keir@xensource.com> 21.05.07 16:17 >>>
>On 21/5/07 15:01, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>> I think I found a pretty reasonable solution, which I'm attaching in its
>> current
>> (3.1-based) form, together with the prior (untested) variant that copied the
>> NMI behavior. Even if it doesn't apply to -unstable, I'd be glad if you could
>> have a brief look to see whether you consider the approach too intrusive (in
>> which case there would be no point in trying to bring it forward to
>> -unstable).
>
>You'll have to explain how the changes to the x86_32 entry.S work. The rest
>of the patch looks good.

The idea is to always check values read from %ds and %es against __HYPERVISOR_DS,
and only store into the current frame (all normal handlers) or the outer-most
one (NMI and MCE) if the value read is different. That way, any NMI or MCE
occurring during frame setup will store selectors not saved so far on behalf of
the interrupted handler, with that interrupted handler either having managed
to read the guest selector (in which case it can store it regardless of whether
NMI/MCE kicked in between the read and the store) or finding __HYPERVISOR_DS
already in the register, in which case it'll know not to store (as the nested
handler would have done the store).

For the restore portion this makes use of the fact that there's exactly one
such code sequence, and by moving the selector restore part past all other
restores (including all stack pointer adjustments) the NMI/MCE handlers can
safely detect whether any selector would have been restored already (by
range checking EIP) and move EIP back to the beginning of the selector
restore sequence without having to play with the stack pointer itself or any
other gpr.

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI deferral on i386
  2007-05-21 14:34           ` Jan Beulich
@ 2007-05-23 10:03             ` Keir Fraser
  0 siblings, 0 replies; 11+ messages in thread
From: Keir Fraser @ 2007-05-23 10:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

Well, this sounds fine to me. If you port it I'll apply it. I would prefer
it as a separate patch from the rest of the MCA/MCE changes really, but if
that's a pain then don't worry about it.

 -- Keir

On 21/5/07 15:34, "Jan Beulich" <jbeulich@novell.com> wrote:

> The idea is to always check values read from %ds and %es against
> __HYPERVISOR_DS,
> and only store into the current frame (all normal handlers) or the outer-most
> one (NMI and MCE) if the value read is different. That way, any NMI or MCE
> occurring during frame setup will store selectors not saved so far on behalf
> of
> the interrupted handler, with that interrupted handler either having managed
> to read the guest selector (in which case it can store it regardless of
> whether
> NMI/MCE kicked in between the read and the store) or finding __HYPERVISOR_DS
> already in the register, in which case it'll know not to store (as the nested
> handler would have done the store).
> 
> For the restore portion this makes use of the fact that there's exactly one
> such code sequence, and by moving the selector restore part past all other
> restores (including all stack pointer adjustments) the NMI/MCE handlers can
> safely detect whether any selector would have been restored already (by
> range checking EIP) and move EIP back to the beginning of the selector
> restore sequence without having to play with the stack pointer itself or any
> other gpr.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-05-23 10:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-15 14:46 NMI deferral on i386 Jan Beulich
2007-05-15 15:00 ` Keir Fraser
2007-05-16  8:17   ` Jan Beulich
2007-05-16  8:28     ` Keir Fraser
2007-05-16 10:10       ` Jan Beulich
2007-05-16 12:32         ` Keir Fraser
2007-05-16 14:19           ` Jan Beulich
2007-05-21 14:01       ` Jan Beulich
2007-05-21 14:17         ` Keir Fraser
2007-05-21 14:34           ` Jan Beulich
2007-05-23 10:03             ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.