All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Tim Deegan <tim@xen.org>
Cc: Ian Campbell <Ian.Campbell@citrix.com>,
	"eddie.dong@intel.com" <eddie.dong@intel.com>,
	xen-devel <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>,
	"jun.nakajima@intel.com" <jun.nakajima@intel.com>,
	Malcolm Crossley <malcolm.crossley@citrix.com>
Subject: Re: [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
Date: Thu, 15 Nov 2012 16:52:20 +0000	[thread overview]
Message-ID: <50A51DC4.7040205@citrix.com> (raw)
In-Reply-To: <20121115164156.GE75988@ocelot.phlegethon.org>


On 15/11/12 16:41, Tim Deegan wrote:
> Hi, 
>
> At 10:06 +0000 on 14 Nov (1352887560), Jan Beulich wrote:
>>> +            asm volatile("int $2"); /* Real NMI, vector 2: normal processing */
>> And I still don't like this use of "int $2" here: An aspect we didn't
>> consider so far is that a nested MCE would break things again
> OK, I think I understand the problem[s], but I'm going to spell it out
> slowly so you can correct me. :)
>
> [ tl;dr I agree that do_nmi() is better, and we should do that in this
>   patch, but maybe we need to solve the general problem too. ]
>
> On a PV guest, we have to use dedicated stacks for NMI and MCE in case
> either of those things happens just before SYSRET when we're on the user
> stack (no other interrupt or exception can happen at that point).
>
> On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
> running a HVM guest, so the stack issue doesn't apply (but nested NMIs
> are still bad).
>
> On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
> guests.  We don't really have to but it saves time in the context switch
> not to update the IDT.  Using do_nmi() here means that the first NMI is
> handled on the normal stack instead.  It's also consistent with the way
> we call do_machine_check() for the MCE case.  But it needs an explicit
> IRET after the call to do_nmi() to make sure that NMIs get re-enabled.
>
> These dedicated stacks make the general problem of re-entrant MCE/NMI
> worse.  In the general case those handlers don't expect to be called in
> a reentrant way, but blatting the stack turns a possible problem into a
> definite one.

I have made a fairly simple patch which deliberately invokes a
re-entrant NMI.  The result is that a PCPU spins around the NMI handler
until the watchdog takes the host down.  It is also possible to get a
reentrant NMI if there is a pagefault (or handful of other possible
faults) when trying to execute the iret of the NMI itself; NMIs can get
re-enabled from the iret of the pagefault, and we take a new NMI before
attempting to retry the iret from the original NMI.

>
> ---
>
> All of this would be moot except for the risk that we might take an MCE
> while in the NMI handler.  The IRET from the MCE handler re-enables NMIs
> while we're still in the NMI handler, and a second NMI arriving could
> break the NMI handler.  In the PV case, it will also clobber the NMI
> handler's stack.  In the VMX case we would need to see something like
> (NMI (MCE) (NMI (MCE) (NMI))) for that to happen, but it could.

There is the MCIP bit in an MCE status MSR which acts as a latch for
MCEs.  If a new MCE is generated while this bit is set, then a triple
fault occurs.  An MCE handler is required to set this bit to 0 to
indicate that it has dealt with the MCE.  However, there is a race
condition window between setting this bit to 0 and leaving the MCE stack
during which another MCE can arrive and corrupt the stack.

>
> The inverse case, taking an NMI while in the MCE handler, is not very
> interesting.  There's no masking of MCEs so that handler already has to
> deal with nested entry, and the IRET from the NMI handler has no effect.
>
> We could potentially solve the problem by having the MCE handler check
> whether it's returning to the NMI stack, and do a normal return in that
> case.  It's a bit of extra code but only in the MCE handler, which is
> not performance-critical. 
>
> If we do that, then the choice of 'int $2' vs 'do_nmi(); fake_iret()'
> is mostly one of taste.  do_nmi() saves an IDT indirection but
> unbalances the call/return stack.  I slightly prefer 'int $2' just
> because it makes the PV and non-PV cases more similar.
>
> But first, we should take the current fix, with do_nmi() and iret() 
> instead of 'int $2'.  The nested-MCE issue can be handled separately.
>
> Does that make sense?

I have been looking at appling a similar fix to Linuses fix
(http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3f3c8b8c4b2a34776c3470142a7c8baafcda6eb0)
to Xen, for both the NMI and MCE stacks.

Work is currently in the preliminary stages at the moment.

~Andrew

>
> Tim.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

  reply	other threads:[~2012-11-15 16:52 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-13 20:08 [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler Malcolm Crossley
2012-11-14 10:06 ` Jan Beulich
2012-11-15 16:41   ` Tim Deegan
2012-11-15 16:52     ` Andrew Cooper [this message]
2012-11-15 17:25       ` Tim Deegan
2012-11-16  8:17         ` Jan Beulich
2012-11-16  9:59           ` Mats Petersson
2012-11-16 10:18             ` Keir Fraser
2012-11-15 17:03     ` Mats Petersson
2012-11-15 17:15       ` Tim Deegan
2012-11-15 17:33         ` Mats Petersson
2012-11-15 17:44           ` Tim Deegan
2012-11-15 18:23             ` Mats Petersson
2012-11-16  8:07     ` Jan Beulich
2012-11-16 10:56       ` Tim Deegan
2012-11-16 11:23         ` Jan Beulich
2012-11-16 11:52           ` Andrew Cooper
2012-11-16 13:53             ` Tim Deegan
2012-11-16 14:11               ` Andrew Cooper
2012-11-22  8:58 ` Jan Beulich
2012-11-22 10:52   ` Andrew Cooper

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50A51DC4.7040205@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=JBeulich@suse.com \
    --cc=eddie.dong@intel.com \
    --cc=jun.nakajima@intel.com \
    --cc=malcolm.crossley@citrix.com \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.