[PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
@ 2012-11-13 20:08 Malcolm Crossley
  2012-11-14 10:06 ` Jan Beulich
  2012-11-22  8:58 ` Jan Beulich
  0 siblings, 2 replies; 21+ messages in thread
From: Malcolm Crossley @ 2012-11-13 20:08 UTC (permalink / raw)
  To: xen-devel; +Cc: tim, eddie.dong, JBeulich, jun.nakajima, Ian.Campbell

The self_nmi() code cause's an NMI to be triggered by sending an APIC message
to the local processor. Unfortunately there is a delay in the delivery of the
APIC message and we may already have re-entered the HVM guest by the time the
NMI is taken. This results in the VMEXIT code calling the self_nmi() function
again. We have seen hundreds of iterations of this VMEXIT/VMENTER loop before
the HVM guest resumes normal operation.

Volume 3 Chapter 27 Section 1 of the Intel SDM states:

An NMI causes subsequent NMIs to be blocked, but only after the VM exit
completes.

So we believe it is safe to directly invoke the INT call as NMI's should
already be blocked.

The INT 2 call will perform an IRET which will unblock later calls to the NMI
handler, according to Intel SDM Volume 3 Chapter 6.7.1. We must ensure that the
IRET from the INT 2 IRET is the first IRET issued to prevent losing an NMI.
Moving the INT 2 call to before the interrupts are enabled should ensure we
don't lose the NMI.

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

diff -r 62885b3c34c8 -r 7d6fd0219dd7 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2269,6 +2269,14 @@ void vmx_vmexit_handler(struct cpu_user_
         vector = intr_info & INTR_INFO_VECTOR_MASK;
         if ( vector == TRAP_machine_check )
             do_machine_check(regs);
+        else if ( vector == TRAP_nmi &&
+                ( (intr_info & INTR_INFO_INTR_TYPE_MASK) ==
+                  (X86_EVENTTYPE_NMI << 8) ) )
+            /* Must be called before interrupts are enabled to ensure
+             * the NMI handler code is run before the first IRET. The
+             * IRET unblocks subsequent NMI's (Intel SDM Vol 3, 6.7.1)
+             */
+            asm volatile("int $2"); /* Real NMI, vector 2: normal processing */
         break;
     case EXIT_REASON_MCE_DURING_VMENTRY:
         do_machine_check(regs);
@@ -2442,7 +2450,6 @@ void vmx_vmexit_handler(struct cpu_user_
                  (X86_EVENTTYPE_NMI << 8) )
                 goto exit_and_crash;
             HVMTRACE_0D(NMI);
-            self_nmi(); /* Real NMI, vector 2: normal processing. */
             break;
         case TRAP_machine_check:
             HVMTRACE_0D(MCE);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-13 20:08 [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler Malcolm Crossley
@ 2012-11-14 10:06 ` Jan Beulich
  2012-11-15 16:41   ` Tim Deegan
  2012-11-22  8:58 ` Jan Beulich
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2012-11-14 10:06 UTC (permalink / raw)
  To: Malcolm Crossley, tim; +Cc: eddie.dong, Ian.Campbell, jun.nakajima, xen-devel

>>> On 13.11.12 at 21:08, Malcolm Crossley <malcolm.crossley@citrix.com> wrote:
> The self_nmi() code cause's an NMI to be triggered by sending an APIC message
> to the local processor. Unfortunately there is a delay in the delivery of 
> the
> APIC message and we may already have re-entered the HVM guest by the time the
> NMI is taken. This results in the VMEXIT code calling the self_nmi() 
> function
> again. We have seen hundreds of iterations of this VMEXIT/VMENTER loop 
> before
> the HVM guest resumes normal operation.
> 
> Volume 3 Chapter 27 Section 1 of the Intel SDM states:
> 
> An NMI causes subsequent NMIs to be blocked, but only after the VM exit
> completes.
> 
> So we believe it is safe to directly invoke the INT call as NMI's should
> already be blocked.
> 
> The INT 2 call will perform an IRET which will unblock later calls to the 
> NMI
> handler, according to Intel SDM Volume 3 Chapter 6.7.1. We must ensure that 
> the
> IRET from the INT 2 IRET is the first IRET issued to prevent losing an NMI.
> Moving the INT 2 call to before the interrupts are enabled should ensure we
> don't lose the NMI.
> 
> Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
> Acked-by: Tim Deegan <tim@xen.org>
> 
> diff -r 62885b3c34c8 -r 7d6fd0219dd7 xen/arch/x86/hvm/vmx/vmx.c
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -2269,6 +2269,14 @@ void vmx_vmexit_handler(struct cpu_user_
>          vector = intr_info & INTR_INFO_VECTOR_MASK;
>          if ( vector == TRAP_machine_check )
>              do_machine_check(regs);
> +        else if ( vector == TRAP_nmi &&
> +                ( (intr_info & INTR_INFO_INTR_TYPE_MASK) ==
> +                  (X86_EVENTTYPE_NMI << 8) ) )
> +            /* Must be called before interrupts are enabled to ensure
> +             * the NMI handler code is run before the first IRET. The
> +             * IRET unblocks subsequent NMI's (Intel SDM Vol 3, 6.7.1)
> +             */
> +            asm volatile("int $2"); /* Real NMI, vector 2: normal processing */

And I still don't like this use of "int $2" here: An aspect we didn't
consider so far is that a nested MCE would break things again
(yes, that also is the case for NMIs arriving when in VMX root
context, but this wouldn't be the case here if we called do_nmi()
and took care of the missing IRET in a place also covering the PV
path, e.g. in continue_idle_domain() or reset_stack_and_jump();
we should at least keep the window where this can happen as
small as possible).

Jan

>          break;
>      case EXIT_REASON_MCE_DURING_VMENTRY:
>          do_machine_check(regs);
> @@ -2442,7 +2450,6 @@ void vmx_vmexit_handler(struct cpu_user_
>                   (X86_EVENTTYPE_NMI << 8) )
>                  goto exit_and_crash;
>              HVMTRACE_0D(NMI);
> -            self_nmi(); /* Real NMI, vector 2: normal processing. */
>              break;
>          case TRAP_machine_check:
>              HVMTRACE_0D(MCE);
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org 
> http://lists.xen.org/xen-devel 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-14 10:06 ` Jan Beulich
@ 2012-11-15 16:41   ` Tim Deegan
  2012-11-15 16:52     ` Andrew Cooper
                       ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Tim Deegan @ 2012-11-15 16:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Malcolm Crossley, eddie.dong, Ian.Campbell, jun.nakajima,
	xen-devel

Hi, 

At 10:06 +0000 on 14 Nov (1352887560), Jan Beulich wrote:
> > +            asm volatile("int $2"); /* Real NMI, vector 2: normal processing */
> 
> And I still don't like this use of "int $2" here: An aspect we didn't
> consider so far is that a nested MCE would break things again

OK, I think I understand the problem[s], but I'm going to spell it out
slowly so you can correct me. :)

[ tl;dr I agree that do_nmi() is better, and we should do that in this
  patch, but maybe we need to solve the general problem too. ]

On a PV guest, we have to use dedicated stacks for NMI and MCE in case
either of those things happens just before SYSRET when we're on the user
stack (no other interrupt or exception can happen at that point).

On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
running a HVM guest, so the stack issue doesn't apply (but nested NMIs
are still bad).

On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
guests.  We don't really have to but it saves time in the context switch
not to update the IDT.  Using do_nmi() here means that the first NMI is
handled on the normal stack instead.  It's also consistent with the way
we call do_machine_check() for the MCE case.  But it needs an explicit
IRET after the call to do_nmi() to make sure that NMIs get re-enabled.

These dedicated stacks make the general problem of re-entrant MCE/NMI
worse.  In the general case those handlers don't expect to be called in
a reentrant way, but blatting the stack turns a possible problem into a
definite one.

---

All of this would be moot except for the risk that we might take an MCE
while in the NMI handler.  The IRET from the MCE handler re-enables NMIs
while we're still in the NMI handler, and a second NMI arriving could
break the NMI handler.  In the PV case, it will also clobber the NMI
handler's stack.  In the VMX case we would need to see something like
(NMI (MCE) (NMI (MCE) (NMI))) for that to happen, but it could.

The inverse case, taking an NMI while in the MCE handler, is not very
interesting.  There's no masking of MCEs so that handler already has to
deal with nested entry, and the IRET from the NMI handler has no effect.

We could potentially solve the problem by having the MCE handler check
whether it's returning to the NMI stack, and do a normal return in that
case.  It's a bit of extra code but only in the MCE handler, which is
not performance-critical. 

If we do that, then the choice of 'int $2' vs 'do_nmi(); fake_iret()'
is mostly one of taste.  do_nmi() saves an IDT indirection but
unbalances the call/return stack.  I slightly prefer 'int $2' just
because it makes the PV and non-PV cases more similar.

But first, we should take the current fix, with do_nmi() and iret() 
instead of 'int $2'.  The nested-MCE issue can be handled separately.

Does that make sense?

Tim.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-15 16:41   ` Tim Deegan
@ 2012-11-15 16:52     ` Andrew Cooper
  2012-11-15 17:25       ` Tim Deegan
  2012-11-15 17:03     ` Mats Petersson
  2012-11-16  8:07     ` Jan Beulich
  2 siblings, 1 reply; 21+ messages in thread
From: Andrew Cooper @ 2012-11-15 16:52 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Ian Campbell, eddie.dong@intel.com, xen-devel, Jan Beulich,
	jun.nakajima@intel.com, Malcolm Crossley


On 15/11/12 16:41, Tim Deegan wrote:
> Hi, 
>
> At 10:06 +0000 on 14 Nov (1352887560), Jan Beulich wrote:
>>> +            asm volatile("int $2"); /* Real NMI, vector 2: normal processing */
>> And I still don't like this use of "int $2" here: An aspect we didn't
>> consider so far is that a nested MCE would break things again
> OK, I think I understand the problem[s], but I'm going to spell it out
> slowly so you can correct me. :)
>
> [ tl;dr I agree that do_nmi() is better, and we should do that in this
>   patch, but maybe we need to solve the general problem too. ]
>
> On a PV guest, we have to use dedicated stacks for NMI and MCE in case
> either of those things happens just before SYSRET when we're on the user
> stack (no other interrupt or exception can happen at that point).
>
> On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
> running a HVM guest, so the stack issue doesn't apply (but nested NMIs
> are still bad).
>
> On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
> guests.  We don't really have to but it saves time in the context switch
> not to update the IDT.  Using do_nmi() here means that the first NMI is
> handled on the normal stack instead.  It's also consistent with the way
> we call do_machine_check() for the MCE case.  But it needs an explicit
> IRET after the call to do_nmi() to make sure that NMIs get re-enabled.
>
> These dedicated stacks make the general problem of re-entrant MCE/NMI
> worse.  In the general case those handlers don't expect to be called in
> a reentrant way, but blatting the stack turns a possible problem into a
> definite one.

I have made a fairly simple patch which deliberately invokes a
re-entrant NMI.  The result is that a PCPU spins around the NMI handler
until the watchdog takes the host down.  It is also possible to get a
reentrant NMI if there is a pagefault (or handful of other possible
faults) when trying to execute the iret of the NMI itself; NMIs can get
re-enabled from the iret of the pagefault, and we take a new NMI before
attempting to retry the iret from the original NMI.

>
> ---
>
> All of this would be moot except for the risk that we might take an MCE
> while in the NMI handler.  The IRET from the MCE handler re-enables NMIs
> while we're still in the NMI handler, and a second NMI arriving could
> break the NMI handler.  In the PV case, it will also clobber the NMI
> handler's stack.  In the VMX case we would need to see something like
> (NMI (MCE) (NMI (MCE) (NMI))) for that to happen, but it could.

There is the MCIP bit in an MCE status MSR which acts as a latch for
MCEs.  If a new MCE is generated while this bit is set, then a triple
fault occurs.  An MCE handler is required to set this bit to 0 to
indicate that it has dealt with the MCE.  However, there is a race
condition window between setting this bit to 0 and leaving the MCE stack
during which another MCE can arrive and corrupt the stack.

>
> The inverse case, taking an NMI while in the MCE handler, is not very
> interesting.  There's no masking of MCEs so that handler already has to
> deal with nested entry, and the IRET from the NMI handler has no effect.
>
> We could potentially solve the problem by having the MCE handler check
> whether it's returning to the NMI stack, and do a normal return in that
> case.  It's a bit of extra code but only in the MCE handler, which is
> not performance-critical. 
>
> If we do that, then the choice of 'int $2' vs 'do_nmi(); fake_iret()'
> is mostly one of taste.  do_nmi() saves an IDT indirection but
> unbalances the call/return stack.  I slightly prefer 'int $2' just
> because it makes the PV and non-PV cases more similar.
>
> But first, we should take the current fix, with do_nmi() and iret() 
> instead of 'int $2'.  The nested-MCE issue can be handled separately.
>
> Does that make sense?

I have been looking at appling a similar fix to Linuses fix
(http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3f3c8b8c4b2a34776c3470142a7c8baafcda6eb0)
to Xen, for both the NMI and MCE stacks.

Work is currently in the preliminary stages at the moment.

~Andrew

>
> Tim.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-15 16:41   ` Tim Deegan
  2012-11-15 16:52     ` Andrew Cooper
@ 2012-11-15 17:03     ` Mats Petersson
  2012-11-15 17:15       ` Tim Deegan
  2012-11-16  8:07     ` Jan Beulich
  2 siblings, 1 reply; 21+ messages in thread
From: Mats Petersson @ 2012-11-15 17:03 UTC (permalink / raw)
  To: xen-devel

On 15/11/12 16:41, Tim Deegan wrote:
> Hi,
>
> At 10:06 +0000 on 14 Nov (1352887560), Jan Beulich wrote:
>>> +            asm volatile("int $2"); /* Real NMI, vector 2: normal processing */
>> And I still don't like this use of "int $2" here: An aspect we didn't
>> consider so far is that a nested MCE would break things again
> OK, I think I understand the problem[s], but I'm going to spell it out
> slowly so you can correct me. :)
>
> [ tl;dr I agree that do_nmi() is better, and we should do that in this
>    patch, but maybe we need to solve the general problem too. ]
>
> On a PV guest, we have to use dedicated stacks for NMI and MCE in case
> either of those things happens just before SYSRET when we're on the user
> stack (no other interrupt or exception can happen at that point).
>
> On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
> running a HVM guest, so the stack issue doesn't apply (but nested NMIs
> are still bad).
>
> On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
> guests.  We don't really have to but it saves time in the context switch
> not to update the IDT.  Using do_nmi() here means that the first NMI is
> handled on the normal stack instead.  It's also consistent with the way
> we call do_machine_check() for the MCE case.  But it needs an explicit
> IRET after the call to do_nmi() to make sure that NMIs get re-enabled.

Both AMD and Intel has an identical setup with regard to stacks and 
general "what happens when we taken one of these interrupts". As Andy 
Cooper is about to post, there are ways to solve the nesting of either 
of these interrupts.

There are subtle differences between Intel and AMD in how the actual 
interrupt gets handled when the processor is currently running a HVM 
guest, which is why the patch in this thread is "Intel only", because 
the Intel system "eats" the NMI, and it needs to be "reissued". On AMD, 
the interrupts are held pending until the guest has exited into the 
Hypervisor, and then taken after the processor executes the STGI (Set 
Global Interrupt Enable) instruction - analogous to what happens when 
you have a CLI and STI for ordinary interrupts.

The issues with regards to nesting of NMI and MCE is completely 
different from the "how do we issue an NMI from the HVM handling code 
when the guest got interrupted by NMI".

Nested NMI and nested MCE is the same problem on both AMD and Intel 
processors, and need a completely separate solution [Andy is working on 
this].

--
Mats

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-15 17:03     ` Mats Petersson
@ 2012-11-15 17:15       ` Tim Deegan
  2012-11-15 17:33         ` Mats Petersson
  0 siblings, 1 reply; 21+ messages in thread
From: Tim Deegan @ 2012-11-15 17:15 UTC (permalink / raw)
  To: Mats Petersson; +Cc: xen-devel

At 17:03 +0000 on 15 Nov (1352998993), Mats Petersson wrote:
> >On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
> >running a HVM guest, so the stack issue doesn't apply (but nested NMIs
> >are still bad).
> >
> >On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
> >guests.  We don't really have to but it saves time in the context switch
> >not to update the IDT.  Using do_nmi() here means that the first NMI is
> >handled on the normal stack instead.  It's also consistent with the way
> >we call do_machine_check() for the MCE case.  But it needs an explicit
> >IRET after the call to do_nmi() to make sure that NMIs get re-enabled.
> 
> Both AMD and Intel has an identical setup with regard to stacks and 
> general "what happens when we taken one of these interrupts".

My reading of svm_ctxt_switch_{to,from} makes me disagree with this. 
AFAICT, on SVM we're not using dedicated stacks at all. 

> The issues with regards to nesting of NMI and MCE is completely 
> different from the "how do we issue an NMI from the HVM handling code 
> when the guest got interrupted by NMI".

Yes.  As I said, we should take the fix to the VMX NMI handling now, and
sort out the nesting separately.

Tim.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-15 16:52     ` Andrew Cooper
@ 2012-11-15 17:25       ` Tim Deegan
  2012-11-16  8:17         ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Tim Deegan @ 2012-11-15 17:25 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Ian Campbell, eddie.dong@intel.com, xen-devel, Jan Beulich,
	jun.nakajima@intel.com, Malcolm Crossley

At 16:52 +0000 on 15 Nov (1352998340), Andrew Cooper wrote:
> It is also possible to get a reentrant NMI if there is a pagefault (or
> handful of other possible faults) when trying to execute the iret of
> the NMI itself; NMIs can get re-enabled from the iret of the
> pagefault, and we take a new NMI before attempting to retry the iret
> from the original NMI.

Yes, I hadn't thought of that case.

> > All of this would be moot except for the risk that we might take an MCE
> > while in the NMI handler.  The IRET from the MCE handler re-enables NMIs
> > while we're still in the NMI handler, and a second NMI arriving could
> > break the NMI handler.  In the PV case, it will also clobber the NMI
> > handler's stack.  In the VMX case we would need to see something like
> > (NMI (MCE) (NMI (MCE) (NMI))) for that to happen, but it could.
> 
> There is the MCIP bit in an MCE status MSR which acts as a latch for
> MCEs.  If a new MCE is generated while this bit is set, then a triple
> fault occurs.  An MCE handler is required to set this bit to 0 to
> indicate that it has dealt with the MCE.  However, there is a race
> condition window between setting this bit to 0 and leaving the MCE stack
> during which another MCE can arrive and corrupt the stack.

Sure.  Nested MCEs aren't very interesting to me.  If you're taking MCEs
faster than you can handle them, the difference between one of your CPUs
resetting and the stack getting blasted isn't that much. :)

> I have been looking at appling a similar fix to Linuses fix
> (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3f3c8b8c4b2a34776c3470142a7c8baafcda6eb0)
> to Xen, for both the NMI and MCE stacks.
> 
> Work is currently in the preliminary stages at the moment.

Sounds great!

Tim.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-15 17:15       ` Tim Deegan
@ 2012-11-15 17:33         ` Mats Petersson
  2012-11-15 17:44           ` Tim Deegan
  0 siblings, 1 reply; 21+ messages in thread
From: Mats Petersson @ 2012-11-15 17:33 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel@lists.xen.org

On 15/11/12 17:15, Tim Deegan wrote:
> At 17:03 +0000 on 15 Nov (1352998993), Mats Petersson wrote:
>>> On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
>>> running a HVM guest, so the stack issue doesn't apply (but nested NMIs
>>> are still bad).
>>>
>>> On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
>>> guests.  We don't really have to but it saves time in the context switch
>>> not to update the IDT.  Using do_nmi() here means that the first NMI is
>>> handled on the normal stack instead.  It's also consistent with the way
>>> we call do_machine_check() for the MCE case.  But it needs an explicit
>>> IRET after the call to do_nmi() to make sure that NMIs get re-enabled.
>> Both AMD and Intel has an identical setup with regard to stacks and
>> general "what happens when we taken one of these interrupts".
> My reading of svm_ctxt_switch_{to,from} makes me disagree with this.
> AFAICT, on SVM we're not using dedicated stacks at all.
In SVM, the VMRUN returns to whatever stack you had before the VMRUN. 
This is not what I'm talking about, however. The stack used for the NMI 
and MCE comes from the interrupt descriptor entry for those respective 
vectors.

I'm fairly sure (but I haven't followed all the code-paths to verify 
this) that both AMD and Intel HVM code uses the same stack when a VMEXIT 
happens (as in, the the value in RSP, at least nominally, the same value 
each time you end up in the {svm,vmx}_exit_handler(), but the value may 
vary from one time to another, and probably isn't the exact same on an 
Intel vs. AMD comparison). The Intel solution is slightly different as 
to "how you end up with the RSP value you want to be at", but that's a 
beside the point.

Either way, what stack we are on at VMEXIT shouldn't matter to the 
handling of NMI or MCE, but we should avoid using the "current" stack to 
handle NMI, in case it's somehow causing further problems to do so, and 
we need the NMI to "get out of a problem". Similarly for MCE - if the 
memory used by RSP is bad, we probably don't want to reboot the machine 
by using it to store the return address for MCE (although I'm not sure 
we can completely avoid that)...

So in conclusion, the do_mce_exception() call probably should be a 
__asm__ __volatile__("int $X"), where X is the relevant vector. 
[Although I admit that when we take an MCE exception from HVM, it 
hopefully isn't using the Hypervisor stack that the VMEXIT happens to 
use... That would be rather bad in all manner of ways, and of course, 
the MCE stack may also be equally bad!]

--
Mats
>
>> The issues with regards to nesting of NMI and MCE is completely
>> different from the "how do we issue an NMI from the HVM handling code
>> when the guest got interrupted by NMI".
> Yes.  As I said, we should take the fix to the VMX NMI handling now, and
> sort out the nesting separately.
>
> Tim.
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-15 17:33         ` Mats Petersson
@ 2012-11-15 17:44           ` Tim Deegan
  2012-11-15 18:23             ` Mats Petersson
  0 siblings, 1 reply; 21+ messages in thread
From: Tim Deegan @ 2012-11-15 17:44 UTC (permalink / raw)
  To: Mats Petersson; +Cc: xen-devel@lists.xen.org

At 17:33 +0000 on 15 Nov (1353000782), Mats Petersson wrote:
> On 15/11/12 17:15, Tim Deegan wrote:
> >At 17:03 +0000 on 15 Nov (1352998993), Mats Petersson wrote:
> >>>On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
> >>>running a HVM guest, so the stack issue doesn't apply (but nested NMIs
> >>>are still bad).
> >>>
> >>>On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
> >>>guests.  We don't really have to but it saves time in the context switch
> >>>not to update the IDT.  Using do_nmi() here means that the first NMI is
> >>>handled on the normal stack instead.  It's also consistent with the way
> >>>we call do_machine_check() for the MCE case.  But it needs an explicit
> >>>IRET after the call to do_nmi() to make sure that NMIs get re-enabled.
> >>Both AMD and Intel has an identical setup with regard to stacks and
> >>general "what happens when we taken one of these interrupts".
> >My reading of svm_ctxt_switch_{to,from} makes me disagree with this.
> >AFAICT, on SVM we're not using dedicated stacks at all.
> In SVM, the VMRUN returns to whatever stack you had before the VMRUN. 
> This is not what I'm talking about, however. The stack used for the NMI 
> and MCE comes from the interrupt descriptor entry for those respective 
> vectors.

This is the code I was referring to:

    /*
     * Cannot use ISTs for NMI/#MC/#DF while we are running with the guest TR.
     * But this doesn't matter: the IST is only req'd to handle SYSCALL/SYSRET.
     */
    idt_tables[cpu][TRAP_double_fault].a  &= ~(7UL << 32);
    idt_tables[cpu][TRAP_nmi].a           &= ~(7UL << 32);
    idt_tables[cpu][TRAP_machine_check].a &= ~(7UL << 32);

Am I misreading it?

> So in conclusion, the do_mce_exception() call probably should be a 
> __asm__ __volatile__("int $X"), where X is the relevant vector. 

This handles MCEs that were raised in guest context.  If we've managed
to get this far into the exit handler, the hypervisor stack is probably
OK. :)

I'd be happy to invoke the MCE handler though the IDT here, just for
symmetry with the other cases, but I don't think it makes much
difference. 

Tim.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-15 17:44           ` Tim Deegan
@ 2012-11-15 18:23             ` Mats Petersson
  0 siblings, 0 replies; 21+ messages in thread
From: Mats Petersson @ 2012-11-15 18:23 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel@lists.xen.org

On 15/11/12 17:44, Tim Deegan wrote:
> At 17:33 +0000 on 15 Nov (1353000782), Mats Petersson wrote:
>> On 15/11/12 17:15, Tim Deegan wrote:
>>> At 17:03 +0000 on 15 Nov (1352998993), Mats Petersson wrote:
>>>>> On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
>>>>> running a HVM guest, so the stack issue doesn't apply (but nested NMIs
>>>>> are still bad).
>>>>>
>>>>> On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
>>>>> guests.  We don't really have to but it saves time in the context switch
>>>>> not to update the IDT.  Using do_nmi() here means that the first NMI is
>>>>> handled on the normal stack instead.  It's also consistent with the way
>>>>> we call do_machine_check() for the MCE case.  But it needs an explicit
>>>>> IRET after the call to do_nmi() to make sure that NMIs get re-enabled.
>>>> Both AMD and Intel has an identical setup with regard to stacks and
>>>> general "what happens when we taken one of these interrupts".
>>> My reading of svm_ctxt_switch_{to,from} makes me disagree with this.
>>> AFAICT, on SVM we're not using dedicated stacks at all.
>> In SVM, the VMRUN returns to whatever stack you had before the VMRUN.
>> This is not what I'm talking about, however. The stack used for the NMI
>> and MCE comes from the interrupt descriptor entry for those respective
>> vectors.
> This is the code I was referring to:
>
>      /*
>       * Cannot use ISTs for NMI/#MC/#DF while we are running with the guest TR.
>       * But this doesn't matter: the IST is only req'd to handle SYSCALL/SYSRET.
>       */
>      idt_tables[cpu][TRAP_double_fault].a  &= ~(7UL << 32);
>      idt_tables[cpu][TRAP_nmi].a           &= ~(7UL << 32);
>      idt_tables[cpu][TRAP_machine_check].a &= ~(7UL << 32);
>
> Am I misreading it?

No, you are reading it perfectly right, I'm wrong...

--
Mats
>
>> So in conclusion, the do_mce_exception() call probably should be a
>> __asm__ __volatile__("int $X"), where X is the relevant vector.
> This handles MCEs that were raised in guest context.  If we've managed
> to get this far into the exit handler, the hypervisor stack is probably
> OK. :)
>
> I'd be happy to invoke the MCE handler though the IDT here, just for
> symmetry with the other cases, but I don't think it makes much
> difference.
>
> Tim.
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-15 16:41   ` Tim Deegan
  2012-11-15 16:52     ` Andrew Cooper
  2012-11-15 17:03     ` Mats Petersson
@ 2012-11-16  8:07     ` Jan Beulich
  2012-11-16 10:56       ` Tim Deegan
  2 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2012-11-16  8:07 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Malcolm Crossley, eddie.dong, Ian.Campbell, jun.nakajima,
	xen-devel

>>> On 15.11.12 at 17:41, Tim Deegan <tim@xen.org> wrote:
> At 10:06 +0000 on 14 Nov (1352887560), Jan Beulich wrote:
>> > +            asm volatile("int $2"); /* Real NMI, vector 2: normal 
> processing */
>> 
>> And I still don't like this use of "int $2" here: An aspect we didn't
>> consider so far is that a nested MCE would break things again
> 
> OK, I think I understand the problem[s], but I'm going to spell it out
> slowly so you can correct me. :)
> 
> [ tl;dr I agree that do_nmi() is better, and we should do that in this
>   patch, but maybe we need to solve the general problem too. ]
> 
> On a PV guest, we have to use dedicated stacks for NMI and MCE in case
> either of those things happens just before SYSRET when we're on the user
> stack (no other interrupt or exception can happen at that point).

Yes.

> On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
> running a HVM guest, so the stack issue doesn't apply (but nested NMIs
> are still bad).

Yes, albeit that's a potential (separate) problem too (because we
don't distinguish the event having its origin in guest or hypervisor
context, i.e. an MCE due to a back stack page would be handled
on that same stack page, i.e. shutdown).

> On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
> guests.  We don't really have to but it saves time in the context switch
> not to update the IDT.  Using do_nmi() here means that the first NMI is
> handled on the normal stack instead.  It's also consistent with the way
> we call do_machine_check() for the MCE case.  But it needs an explicit
> IRET after the call to do_nmi() to make sure that NMIs get re-enabled.

Or have the subsequent VMRESUME take care of this. There is a
sentence in "26.6.1 Interruptibility State" to that effect: "NMIs are
not blocked in VMX nonroot operation (except for ordinary blocking
for other reasons, such as by the MOV SS instruction, the
wait-for-SIPI state, etc.)", so NMIs get unblocked implicitly with
the VMRESUME.

Hence the only problem is with ending the NMI context when not
exiting back to VMX guest.

I continue to not be in favor of special casing this in VMX code,
considering that the problem is generic (i.e. similarly affects PV).
I.e. either we handle the other case similarly (special code added
also to the PV code path), or we deal with this in a single place,
keeping the NMIs masked for an extended period of time.

> These dedicated stacks make the general problem of re-entrant MCE/NMI
> worse.  In the general case those handlers don't expect to be called in
> a reentrant way, but blatting the stack turns a possible problem into a
> definite one.

Yes. I think almost everyone agrees that this is a design flaw.

> ---
> 
> All of this would be moot except for the risk that we might take an MCE
> while in the NMI handler.  The IRET from the MCE handler re-enables NMIs
> while we're still in the NMI handler, and a second NMI arriving could
> break the NMI handler.  In the PV case, it will also clobber the NMI
> handler's stack.

No - the entry code switches away from the dedicated stacks when
the origin was in guest context (see handle_ist_exception in
xen/arch/x86/x86_64/entry.S).

> In the VMX case we would need to see something like
> (NMI (MCE) (NMI (MCE) (NMI))) for that to happen, but it could.
> 
> The inverse case, taking an NMI while in the MCE handler, is not very
> interesting.  There's no masking of MCEs so that handler already has to
> deal with nested entry, and the IRET from the NMI handler has no effect.

As already pointed out by Andrew, there is an in-progress bit
for this. The thing is that we'd have to switch away from the
dedicated stack before clearing that bit (which shouldn't be too
difficult; if the MCE was caused by the normal stack, we
shouldn't be getting to the point of trying to exit from the MCE
handler anyway).

> We could potentially solve the problem by having the MCE handler check
> whether it's returning to the NMI stack, and do a normal return in that
> case.  It's a bit of extra code but only in the MCE handler, which is
> not performance-critical. 

Yes, that could solve that nesting case (again not very difficult
to implement).

> If we do that, then the choice of 'int $2' vs 'do_nmi(); fake_iret()'
> is mostly one of taste.  do_nmi() saves an IDT indirection but
> unbalances the call/return stack.  I slightly prefer 'int $2' just
> because it makes the PV and non-PV cases more similar.

Once the nesting is dealt with properly, and if we decide to do
the NMI-disabled-window-exit early, then yes, I agree (apart
from the taste aspect, which would still tell me not to use "int $xx"
in hypervisor code). And, as I think Mats said, this might then be
done for the MCE the same way for consistency.

> But first, we should take the current fix, with do_nmi() and iret() 
> instead of 'int $2'.  The nested-MCE issue can be handled separately.

Leaving the PV case un-addressed...

> Does that make sense?

Of course.

Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-15 17:25       ` Tim Deegan
@ 2012-11-16  8:17         ` Jan Beulich
  2012-11-16  9:59           ` Mats Petersson
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2012-11-16  8:17 UTC (permalink / raw)
  To: Andrew Cooper, Tim Deegan
  Cc: Malcolm Crossley, eddie.dong@intel.com, Ian Campbell,
	jun.nakajima@intel.com, xen-devel

>>> On 15.11.12 at 18:25, Tim Deegan <tim@xen.org> wrote:
> At 16:52 +0000 on 15 Nov (1352998340), Andrew Cooper wrote:
>> It is also possible to get a reentrant NMI if there is a pagefault (or
>> handful of other possible faults) when trying to execute the iret of
>> the NMI itself; NMIs can get re-enabled from the iret of the
>> pagefault, and we take a new NMI before attempting to retry the iret
>> from the original NMI.
> 
> Yes, I hadn't thought of that case.

But what would make a fault happen on that IRET? Oh, yes,
there is one case - the guest having its previous instruction end
exactly at the canonical/non-canonical boundary. But for the
sake of correctness, that's a #GP then. I would suppose this
would better be filtered (manually injecting a #GP into the guest)
than allowed to actually cause a #GP.

Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-16  8:17         ` Jan Beulich
@ 2012-11-16  9:59           ` Mats Petersson
  2012-11-16 10:18             ` Keir Fraser
  0 siblings, 1 reply; 21+ messages in thread
From: Mats Petersson @ 2012-11-16  9:59 UTC (permalink / raw)
  To: xen-devel

On 16/11/12 08:17, Jan Beulich wrote:
>>>> On 15.11.12 at 18:25, Tim Deegan <tim@xen.org> wrote:
>> At 16:52 +0000 on 15 Nov (1352998340), Andrew Cooper wrote:
>>> It is also possible to get a reentrant NMI if there is a pagefault (or
>>> handful of other possible faults) when trying to execute the iret of
>>> the NMI itself; NMIs can get re-enabled from the iret of the
>>> pagefault, and we take a new NMI before attempting to retry the iret
>>> from the original NMI.
>> Yes, I hadn't thought of that case.
> But what would make a fault happen on that IRET? Oh, yes,
> there is one case - the guest having its previous instruction end
> exactly at the canonical/non-canonical boundary. But for the
> sake of correctness, that's a #GP then. I would suppose this
> would better be filtered (manually injecting a #GP into the guest)
> than allowed to actually cause a #GP.
Or, if for some reason the address we return to is "not present". Now, 
in the current Xen, Xen itself doesn't get paged out, but in a PV guest, 
I'm pretty certain the guest could decide to page out some code-page, 
which just happens to be the one we were about to return to?

--
Mats
>
> Jan
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-16  9:59           ` Mats Petersson
@ 2012-11-16 10:18             ` Keir Fraser
  0 siblings, 0 replies; 21+ messages in thread
From: Keir Fraser @ 2012-11-16 10:18 UTC (permalink / raw)
  To: Mats Petersson, xen-devel

On 16/11/2012 09:59, "Mats Petersson" <mats.petersson@citrix.com> wrote:

>>> Yes, I hadn't thought of that case.
>> But what would make a fault happen on that IRET? Oh, yes,
>> there is one case - the guest having its previous instruction end
>> exactly at the canonical/non-canonical boundary. But for the
>> sake of correctness, that's a #GP then. I would suppose this
>> would better be filtered (manually injecting a #GP into the guest)
>> than allowed to actually cause a #GP.
> Or, if for some reason the address we return to is "not present". Now,
> in the current Xen, Xen itself doesn't get paged out, but in a PV guest,
> I'm pretty certain the guest could decide to page out some code-page,
> which just happens to be the one we were about to return to?

That fault would occur in the context being returned to, with rIP == the
IRET return target.

 -- Keir

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-16  8:07     ` Jan Beulich
@ 2012-11-16 10:56       ` Tim Deegan
  2012-11-16 11:23         ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Tim Deegan @ 2012-11-16 10:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Malcolm Crossley, eddie.dong, Ian.Campbell, jun.nakajima,
	xen-devel

At 08:07 +0000 on 16 Nov (1353053247), Jan Beulich wrote:
> I continue to not be in favor of special casing this in VMX code,
> considering that the problem is generic (i.e. similarly affects PV).
> I.e. either we handle the other case similarly (special code added
> also to the PV code path), or we deal with this in a single place,
> keeping the NMIs masked for an extended period of time.

Affects PV because PV might use SYSRET to return from the NMI handler?
Right.

> > All of this would be moot except for the risk that we might take an MCE
> > while in the NMI handler.  The IRET from the MCE handler re-enables NMIs
> > while we're still in the NMI handler, and a second NMI arriving could
> > break the NMI handler.  In the PV case, it will also clobber the NMI
> > handler's stack.
> 
> No - the entry code switches away from the dedicated stacks when
> the origin was in guest context (see handle_ist_exception in
> xen/arch/x86/x86_64/entry.S).

I see, thanks.  So it's only if we take the NMI while in the hypervisor
that we stay on the NMI stack and risk getting the stack clobbered.

> > We could potentially solve the problem by having the MCE handler check
> > whether it's returning to the NMI stack, and do a normal return in that
> > case.  It's a bit of extra code but only in the MCE handler, which is
> > not performance-critical. 
> 
> Yes, that could solve that nesting case (again not very difficult
> to implement).

How about we just have the MCE handler return without IRET in _all_
cases where it's returning to ring 0?  I think that entirely solves the
MCE-in-NMI problem, without all the extra mechanism meeded for the
linux-style solution.  (Unless we want to allow other traps in either
the NMI or MCE handlers).

[And it occurs to me that the linux-style solution is tricky because
detecting the case where you've taken an NMI and not yet set the
nmi-in-progress flag is hard in both SVM (in the NMI handler but on the
normal stack) and VMX (in the _vmexit_ handler and on the normal
stack).]

So I guess now I'm suggesting:
 - MCE never returns to Xen with IRET;
 - NMI handling in handle_vmexit() moves to beside the MCE handling;
 - Explicit IRET-to-self at the end of do_nmi() to unmask NMIs; and
 - no int $2. :)

How's that?  I feel sure I must have missed a case - itt sounds too
easy.

Tim.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-16 10:56       ` Tim Deegan
@ 2012-11-16 11:23         ` Jan Beulich
  2012-11-16 11:52           ` Andrew Cooper
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2012-11-16 11:23 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Malcolm Crossley, eddie.dong, Ian.Campbell, jun.nakajima,
	xen-devel

>>> On 16.11.12 at 11:56, Tim Deegan <tim@xen.org> wrote:
> At 08:07 +0000 on 16 Nov (1353053247), Jan Beulich wrote:
>> > We could potentially solve the problem by having the MCE handler check
>> > whether it's returning to the NMI stack, and do a normal return in that
>> > case.  It's a bit of extra code but only in the MCE handler, which is
>> > not performance-critical. 
>> 
>> Yes, that could solve that nesting case (again not very difficult
>> to implement).
> 
> How about we just have the MCE handler return without IRET in _all_
> cases where it's returning to ring 0?  I think that entirely solves the
> MCE-in-NMI problem, without all the extra mechanism meeded for the
> linux-style solution.

Good suggestion.

>  (Unless we want to allow other traps in either
> the NMI or MCE handlers).

We should absolutely avoid that.

> [And it occurs to me that the linux-style solution is tricky because
> detecting the case where you've taken an NMI and not yet set the
> nmi-in-progress flag is hard in both SVM (in the NMI handler but on the
> normal stack) and VMX (in the _vmexit_ handler and on the normal
> stack).]

Agreed.

> So I guess now I'm suggesting:
>  - MCE never returns to Xen with IRET;

Yes. But that might need care with regard to EFI runtime services
(or maybe not, as we're, at least at present, not switching stacks
there). Nevertheless, to be on the safe side, we could restrict
avoiding the IRET to "Ring 0 and RIP in hypervisor space", as we
know we won't have interrupted the NMI handler if that's not the
case.

>  - NMI handling in handle_vmexit() moves to beside the MCE handling;

Yes.

>  - Explicit IRET-to-self at the end of do_nmi() to unmask NMIs; and

This IRET must then switch to the normal stack, if currently on
the NMI one, which might be a little tricky/fragile.

But then again we're on the NMI one only when we got there
from hypervisor context, and in that specific case we return
without handling softirqs anyway. So perhaps the stack switch
isn't needed, but the IRET-to-self must only be done when the
origin wasn't in hypervisor context.

>  - no int $2. :)

Yippee.

Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-16 11:23         ` Jan Beulich
@ 2012-11-16 11:52           ` Andrew Cooper
  2012-11-16 13:53             ` Tim Deegan
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Cooper @ 2012-11-16 11:52 UTC (permalink / raw)
  To: xen-devel


On 16/11/12 11:23, Jan Beulich wrote:
>>>> On 16.11.12 at 11:56, Tim Deegan <tim@xen.org> wrote:
>> At 08:07 +0000 on 16 Nov (1353053247), Jan Beulich wrote:
>>>> We could potentially solve the problem by having the MCE handler check
>>>> whether it's returning to the NMI stack, and do a normal return in that
>>>> case.  It's a bit of extra code but only in the MCE handler, which is
>>>> not performance-critical. 
>>> Yes, that could solve that nesting case (again not very difficult
>>> to implement).
>> How about we just have the MCE handler return without IRET in _all_
>> cases where it's returning to ring 0?  I think that entirely solves the
>> MCE-in-NMI problem, without all the extra mechanism meeded for the
>> linux-style solution.
> Good suggestion.
>
>>  (Unless we want to allow other traps in either
>> the NMI or MCE handlers).
> We should absolutely avoid that.
>
>> [And it occurs to me that the linux-style solution is tricky because
>> detecting the case where you've taken an NMI and not yet set the
>> nmi-in-progress flag is hard in both SVM (in the NMI handler but on the
>> normal stack) and VMX (in the _vmexit_ handler and on the normal
>> stack).]
> Agreed.

But we never need to detect this case.  If we take an NMI and ensure
there is no possibility for a trap before setting the nmi-in-progress
flag (which is not very hard, with it being a handful of instructions in
the handler), then we guarantee that NMIs are still blocked, and thus
cant be reentrant.

Also, for what it is worth, we do have traps on the NMI path in the form
of BUG()s, WARN()s and panic gubbins, although the host is in a fairly
dire state if we actually ever hit any of these.

~Andrew

>
>> So I guess now I'm suggesting:
>>  - MCE never returns to Xen with IRET;
> Yes. But that might need care with regard to EFI runtime services
> (or maybe not, as we're, at least at present, not switching stacks
> there). Nevertheless, to be on the safe side, we could restrict
> avoiding the IRET to "Ring 0 and RIP in hypervisor space", as we
> know we won't have interrupted the NMI handler if that's not the
> case.
>
>>  - NMI handling in handle_vmexit() moves to beside the MCE handling;
> Yes.
>
>>  - Explicit IRET-to-self at the end of do_nmi() to unmask NMIs; and
> This IRET must then switch to the normal stack, if currently on
> the NMI one, which might be a little tricky/fragile.
>
> But then again we're on the NMI one only when we got there
> from hypervisor context, and in that specific case we return
> without handling softirqs anyway. So perhaps the stack switch
> isn't needed, but the IRET-to-self must only be done when the
> origin wasn't in hypervisor context.
>
>>  - no int $2. :)
> Yippee.
>
> Jan
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-16 11:52           ` Andrew Cooper
@ 2012-11-16 13:53             ` Tim Deegan
  2012-11-16 14:11               ` Andrew Cooper
  0 siblings, 1 reply; 21+ messages in thread
From: Tim Deegan @ 2012-11-16 13:53 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

At 11:52 +0000 on 16 Nov (1353066779), Andrew Cooper wrote:
> 
> On 16/11/12 11:23, Jan Beulich wrote:
> >>>> On 16.11.12 at 11:56, Tim Deegan <tim@xen.org> wrote:
> >> At 08:07 +0000 on 16 Nov (1353053247), Jan Beulich wrote:
> >>>> We could potentially solve the problem by having the MCE handler check
> >>>> whether it's returning to the NMI stack, and do a normal return in that
> >>>> case.  It's a bit of extra code but only in the MCE handler, which is
> >>>> not performance-critical. 
> >>> Yes, that could solve that nesting case (again not very difficult
> >>> to implement).
> >> How about we just have the MCE handler return without IRET in _all_
> >> cases where it's returning to ring 0?  I think that entirely solves the
> >> MCE-in-NMI problem, without all the extra mechanism meeded for the
> >> linux-style solution.
> > Good suggestion.
> >
> >>  (Unless we want to allow other traps in either
> >> the NMI or MCE handlers).
> > We should absolutely avoid that.
> >
> >> [And it occurs to me that the linux-style solution is tricky because
> >> detecting the case where you've taken an NMI and not yet set the
> >> nmi-in-progress flag is hard in both SVM (in the NMI handler but on the
> >> normal stack) and VMX (in the _vmexit_ handler and on the normal
> >> stack).]
> > Agreed.
> 
> But we never need to detect this case.  If we take an NMI and ensure
> there is no possibility for a trap before setting the nmi-in-progress
> flag

The problem is that there is no way to do that -- the trap we're worried
about is MCE, which can happen at any time.  That's why linux has the
backstop check for the case where the flag's not set but the return
address is on the NMI stack.

> (which is not very hard, with it being a handful of instructions in
> the handler),

It's quite a bit more than that in the VMX case.  I guess we need to
audit that code for possible faults.

> then we guarantee that NMIs are still blocked, and thus
> cant be reentrant.
> 
> Also, for what it is worth, we do have traps on the NMI path in the form
> of BUG()s, WARN()s and panic gubbins, although the host is in a fairly
> dire state if we actually ever hit any of these.

Ergh.  If there are any WARN()s we should get rid of them.  BUG()s are
fine. :)

Tim.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-16 13:53             ` Tim Deegan
@ 2012-11-16 14:11               ` Andrew Cooper
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew Cooper @ 2012-11-16 14:11 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel@lists.xen.org

On 16/11/12 13:53, Tim Deegan wrote:
> At 11:52 +0000 on 16 Nov (1353066779), Andrew Cooper wrote:
>> On 16/11/12 11:23, Jan Beulich wrote:
>>>>>> On 16.11.12 at 11:56, Tim Deegan <tim@xen.org> wrote:
>>>> At 08:07 +0000 on 16 Nov (1353053247), Jan Beulich wrote:
>>>>>> We could potentially solve the problem by having the MCE handler check
>>>>>> whether it's returning to the NMI stack, and do a normal return in that
>>>>>> case.  It's a bit of extra code but only in the MCE handler, which is
>>>>>> not performance-critical. 
>>>>> Yes, that could solve that nesting case (again not very difficult
>>>>> to implement).
>>>> How about we just have the MCE handler return without IRET in _all_
>>>> cases where it's returning to ring 0?  I think that entirely solves the
>>>> MCE-in-NMI problem, without all the extra mechanism meeded for the
>>>> linux-style solution.
>>> Good suggestion.
>>>
>>>>  (Unless we want to allow other traps in either
>>>> the NMI or MCE handlers).
>>> We should absolutely avoid that.
>>>
>>>> [And it occurs to me that the linux-style solution is tricky because
>>>> detecting the case where you've taken an NMI and not yet set the
>>>> nmi-in-progress flag is hard in both SVM (in the NMI handler but on the
>>>> normal stack) and VMX (in the _vmexit_ handler and on the normal
>>>> stack).]
>>> Agreed.
>> But we never need to detect this case.  If we take an NMI and ensure
>> there is no possibility for a trap before setting the nmi-in-progress
>> flag
> The problem is that there is no way to do that -- the trap we're worried
> about is MCE, which can happen at any time.  That's why linux has the
> backstop check for the case where the flag's not set but the return
> address is on the NMI stack.

D'oh - your quite correct.  I overlooked that possibility.

>
>> (which is not very hard, with it being a handful of instructions in
>> the handler),
> It's quite a bit more than that in the VMX case.  I guess we need to
> audit that code for possible faults.

But if we fix the underlying NMI/MCE reentrant problem, then faults on
the vmexit patch cease to be an issue, do they not?  If and when
MCEs/NMIs/interrupts occur, they will be dealt with in the same manor as
any other interruption to hypervisor code.

~Andrew

>> then we guarantee that NMIs are still blocked, and thus
>> cant be reentrant.
>>
>> Also, for what it is worth, we do have traps on the NMI path in the form
>> of BUG()s, WARN()s and panic gubbins, although the host is in a fairly
>> dire state if we actually ever hit any of these.
> Ergh.  If there are any WARN()s we should get rid of them.  BUG()s are
> fine. :)
>
> Tim.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-13 20:08 [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler Malcolm Crossley
  2012-11-14 10:06 ` Jan Beulich
@ 2012-11-22  8:58 ` Jan Beulich
  2012-11-22 10:52   ` Andrew Cooper
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2012-11-22  8:58 UTC (permalink / raw)
  To: Malcolm Crossley, tim; +Cc: eddie.dong, Ian.Campbell, jun.nakajima, xen-devel

>>> On 13.11.12 at 21:08, Malcolm Crossley <malcolm.crossley@citrix.com> wrote:
> The self_nmi() code cause's an NMI to be triggered by sending an APIC message
> to the local processor. Unfortunately there is a delay in the delivery of 
> the
> APIC message and we may already have re-entered the HVM guest by the time the
> NMI is taken. This results in the VMEXIT code calling the self_nmi() 
> function
> again. We have seen hundreds of iterations of this VMEXIT/VMENTER loop 
> before
> the HVM guest resumes normal operation.
> 
> Volume 3 Chapter 27 Section 1 of the Intel SDM states:
> 
> An NMI causes subsequent NMIs to be blocked, but only after the VM exit
> completes.
> 
> So we believe it is safe to directly invoke the INT call as NMI's should
> already be blocked.
> 
> The INT 2 call will perform an IRET which will unblock later calls to the 
> NMI
> handler, according to Intel SDM Volume 3 Chapter 6.7.1. We must ensure that 
> the
> IRET from the INT 2 IRET is the first IRET issued to prevent losing an NMI.
> Moving the INT 2 call to before the interrupts are enabled should ensure we
> don't lose the NMI.
> 
> Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
> Acked-by: Tim Deegan <tim@xen.org>

As (I think) we agreed to not use "int $2", and as I don't recall
having seen a v3 of this patch - is that something that can be
expected any time soon? Ideally, I would want to incorporate
the changes here (and hopefully also the PV issue described
during the discussion) in the pending 4.2.1 (and, if applicable,
4.1.4) release(s).

Jan

> diff -r 62885b3c34c8 -r 7d6fd0219dd7 xen/arch/x86/hvm/vmx/vmx.c
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -2269,6 +2269,14 @@ void vmx_vmexit_handler(struct cpu_user_
>          vector = intr_info & INTR_INFO_VECTOR_MASK;
>          if ( vector == TRAP_machine_check )
>              do_machine_check(regs);
> +        else if ( vector == TRAP_nmi &&
> +                ( (intr_info & INTR_INFO_INTR_TYPE_MASK) ==
> +                  (X86_EVENTTYPE_NMI << 8) ) )
> +            /* Must be called before interrupts are enabled to ensure
> +             * the NMI handler code is run before the first IRET. The
> +             * IRET unblocks subsequent NMI's (Intel SDM Vol 3, 6.7.1)
> +             */
> +            asm volatile("int $2"); /* Real NMI, vector 2: normal 
> processing */
>          break;
>      case EXIT_REASON_MCE_DURING_VMENTRY:
>          do_machine_check(regs);
> @@ -2442,7 +2450,6 @@ void vmx_vmexit_handler(struct cpu_user_
>                   (X86_EVENTTYPE_NMI << 8) )
>                  goto exit_and_crash;
>              HVMTRACE_0D(NMI);
> -            self_nmi(); /* Real NMI, vector 2: normal processing. */
>              break;
>          case TRAP_machine_check:
>              HVMTRACE_0D(MCE);
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org 
> http://lists.xen.org/xen-devel 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
  2012-11-22  8:58 ` Jan Beulich
@ 2012-11-22 10:52   ` Andrew Cooper
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew Cooper @ 2012-11-22 10:52 UTC (permalink / raw)
  To: xen-devel, Jan Beulich

On 22/11/12 08:58, Jan Beulich wrote:
>>>> On 13.11.12 at 21:08, Malcolm Crossley <malcolm.crossley@citrix.com> wrote:
>> The self_nmi() code cause's an NMI to be triggered by sending an APIC message
>> to the local processor. Unfortunately there is a delay in the delivery of 
>> the
>> APIC message and we may already have re-entered the HVM guest by the time the
>> NMI is taken. This results in the VMEXIT code calling the self_nmi() 
>> function
>> again. We have seen hundreds of iterations of this VMEXIT/VMENTER loop 
>> before
>> the HVM guest resumes normal operation.
>>
>> Volume 3 Chapter 27 Section 1 of the Intel SDM states:
>>
>> An NMI causes subsequent NMIs to be blocked, but only after the VM exit
>> completes.
>>
>> So we believe it is safe to directly invoke the INT call as NMI's should
>> already be blocked.
>>
>> The INT 2 call will perform an IRET which will unblock later calls to the 
>> NMI
>> handler, according to Intel SDM Volume 3 Chapter 6.7.1. We must ensure that 
>> the
>> IRET from the INT 2 IRET is the first IRET issued to prevent losing an NMI.
>> Moving the INT 2 call to before the interrupts are enabled should ensure we
>> don't lose the NMI.
>>
>> Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
>> Acked-by: Tim Deegan <tim@xen.org>
> As (I think) we agreed to not use "int $2", and as I don't recall
> having seen a v3 of this patch - is that something that can be
> expected any time soon? Ideally, I would want to incorporate
> the changes here (and hopefully also the PV issue described
> during the discussion) in the pending 4.2.1 (and, if applicable,
> 4.1.4) release(s).
>
> Jan

Malcolm is out of the office this week.  I will see about respinning a
v3 later today.

~Andrew

>
>> diff -r 62885b3c34c8 -r 7d6fd0219dd7 xen/arch/x86/hvm/vmx/vmx.c
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -2269,6 +2269,14 @@ void vmx_vmexit_handler(struct cpu_user_
>>          vector = intr_info & INTR_INFO_VECTOR_MASK;
>>          if ( vector == TRAP_machine_check )
>>              do_machine_check(regs);
>> +        else if ( vector == TRAP_nmi &&
>> +                ( (intr_info & INTR_INFO_INTR_TYPE_MASK) ==
>> +                  (X86_EVENTTYPE_NMI << 8) ) )
>> +            /* Must be called before interrupts are enabled to ensure
>> +             * the NMI handler code is run before the first IRET. The
>> +             * IRET unblocks subsequent NMI's (Intel SDM Vol 3, 6.7.1)
>> +             */
>> +            asm volatile("int $2"); /* Real NMI, vector 2: normal 
>> processing */
>>          break;
>>      case EXIT_REASON_MCE_DURING_VMENTRY:
>>          do_machine_check(regs);
>> @@ -2442,7 +2450,6 @@ void vmx_vmexit_handler(struct cpu_user_
>>                   (X86_EVENTTYPE_NMI << 8) )
>>                  goto exit_and_crash;
>>              HVMTRACE_0D(NMI);
>> -            self_nmi(); /* Real NMI, vector 2: normal processing. */
>>              break;
>>          case TRAP_machine_check:
>>              HVMTRACE_0D(MCE);
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org 
>> http://lists.xen.org/xen-devel 
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-11-22 10:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-13 20:08 [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler Malcolm Crossley
2012-11-14 10:06 ` Jan Beulich
2012-11-15 16:41   ` Tim Deegan
2012-11-15 16:52     ` Andrew Cooper
2012-11-15 17:25       ` Tim Deegan
2012-11-16  8:17         ` Jan Beulich
2012-11-16  9:59           ` Mats Petersson
2012-11-16 10:18             ` Keir Fraser
2012-11-15 17:03     ` Mats Petersson
2012-11-15 17:15       ` Tim Deegan
2012-11-15 17:33         ` Mats Petersson
2012-11-15 17:44           ` Tim Deegan
2012-11-15 18:23             ` Mats Petersson
2012-11-16  8:07     ` Jan Beulich
2012-11-16 10:56       ` Tim Deegan
2012-11-16 11:23         ` Jan Beulich
2012-11-16 11:52           ` Andrew Cooper
2012-11-16 13:53             ` Tim Deegan
2012-11-16 14:11               ` Andrew Cooper
2012-11-22  8:58 ` Jan Beulich
2012-11-22 10:52   ` Andrew Cooper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).