Woes of NMIs and MCEs, and possibly how to fix

All of lore.kernel.org
 help / color / mirror / Atom feed

* Woes of NMIs and MCEs, and possibly how to fix
@ 2012-11-30 17:34 Andrew Cooper
  2012-11-30 17:56 ` Tim Deegan
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Andrew Cooper @ 2012-11-30 17:34 UTC (permalink / raw)
  To: xen-devel@lists.xen.org, Tim Deegan, Keir Fraser, Jan Beulich

Hello,

Yesterday, Tim and myself spent a very long time in front of a
whiteboard trying to develop a fix which covered all the problems, and
sadly it is very hard.  We managed to possibly come up with a long
solution which we think has no race conditions, but relies on very large
sections of reentrant code which cant use the stack or trash registers. 
As such, is it is not practical at all (assuming that any of us could
actually code it)

As a result, I thought instead that I would outline all the issues we
currently face.  We can then:
 * Decide which issues need fixing
 * Decide which issues need to at least be detected and crash gracefully
 * Decide which issues we are happy (or perhaps at least willing, if not
happy) to ignore

So, the issues are as follows.  (I have tried to list them in a logical
order, with 1 individual problem per number, but please do point out if
I have missed/miss-attributed entries)

1) Faults on the NMI path will re-enable NMIs before the handler
returns, leading to reentrant behaviour.  We should audit the NMI path
to try and remove any needless cases which might fault, but getting a
fault-free path will be hard (and is not going so solve the reentrant
behaviour itself).

2) Faults on the MCE path will re-enable NMIs, as will the iret of the
MCE itself if an MCE interrupts an NMI.

3) SMM mode executing an iret will re-enable NMIs.  There is nothing we
can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no
way to predict if/when it may happen.  The best we can do is accept that
it might happen, and try to deal with the after effects.

4) "Fake NMIs" can be caused by hardware with access to the INTR pin
(very unlikely in modern systems with the LAPIC supporting virtual wire
mode), or by software executing an `int $0x2`.  This can cause the NMI
handler to run on the NMI stack, but without the normal hardware NMI
cessation logic being triggered.

5) "Fake MCEs" can be caused by software executing `int $0x18`, and by
any MSI/IOMMU/IOAPIC programmed to deliver vector 0x18.  Normally, this
could only be caused by a bug in Xen, although it is also possible on a
system with out interrupt remapping. (Where the host administrator has
accepted the documented security issue, and decided still to pass-though
a device to a trusted VM, and the VM in question has a buggy driver for
the passed-through hardware)

6) Because of interrupt stack tables, real NMIs/MCEs can race with their
fake alternatives, where the real interrupt interrupts the fake one and
corrupts the exception frame of the fake one, loosing the original
context to return to.  (This is one of the two core problem of
reentrancy with NMIs and MCEs)

7) Real MCEs can race with each other.  If two real MCEs occur too close
together, the processor shuts down (We cant avoid this).  However, there
is large race condition between the MCE handler clearing the MCIP bit of
IA32_MCG_STATUS and the handler returning during which a new MCE can
occur and the exception frame will be corrupted.

In addition to the above issues, we have two NMI related bugs in Xen
which need fixing (which shall be part of the series which fixes the above)

8) VMEXIT reason NMI on Intel calls self_nmi() while NMIs are latched,
causing the PCPU to fall into loop of VMEXITs until the VCPU timeslice
has expired, at which point the return-to-guest path decides to schedule
instead of resuming the guest.

9) The NMI handler when returning to ring3 will leave NMIs latched, as
it uses the sysret path.

As for 1 possible solution which we cant use:

If it were not for the sysret stupidness[1] of requiring the hypervisor
to move to the guest stack before executing the `sysret` instruction, we
could do away with the stack tables for NMIs and MCEs alltogether, and
the above crazyness would be easy to fix.  However, the overhead of
always using iret to return to ring3 is not likely to be acceptable,
meaning that we cannot "fix" the problem by discarding interrupt stacks
and doing everything properly on the main hypervisor stack.

Looking at the above problems, I believe there is a solution if we are
willing to ignore the problem to do with SMM re-enabling NMIs, and if we
are happy to crash gracefully when mixes of NMIs and MCEs interrupt each
other and trash their exception frames (in situations were we could
technically fix up correctly), which is based on the Linux NMI solution.

As questions to the community - have I missed, or misrepresented any
points above which might perhaps influence the design of the solution? 
I think the list is complete, but would not be supprised if there is a
case still not considered yet.

~Andrew

[1] In an effort to prevent a flamewar with my comment, the situation we
find outself in now is almost certainly the result of unforseen
interactions of individual features, but we are left to pick up the many
pieces in way which cant completely be solved.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Woes of NMIs and MCEs, and possibly how to fix
  2012-11-30 17:34 Woes of NMIs and MCEs, and possibly how to fix Andrew Cooper
@ 2012-11-30 17:56 ` Tim Deegan
  2012-11-30 21:59   ` Andrew Cooper
  2012-12-03  8:28   ` Jan Beulich
  2012-11-30 19:12 ` Mats Petersson
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: Tim Deegan @ 2012-11-30 17:56 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Keir Fraser, Jan Beulich, xen-devel@lists.xen.org

At 17:34 +0000 on 30 Nov (1354296851), Andrew Cooper wrote:
> Hello,
> 
> Yesterday, Tim and myself spent a very long time in front of a
> whiteboard trying to develop a fix which covered all the problems, and
> sadly it is very hard.  We managed to possibly come up with a long
> solution which we think has no race conditions, but relies on very large
> sections of reentrant code which cant use the stack or trash registers. 
> As such, is it is not practical at all (assuming that any of us could
> actually code it)

For the record, we also came up with a much simpler solution, which I
prefer:
 - The MCE handler should never return to Xen with IRET.
 - The NMI handler should always return with IRET.
 - There should be no faulting code in the NMI or MCE handlers.

That covers all the interesting cases except (3), (4) and (7) below, and
a simple per-cpu {nmi,mce}-in-progress flag will be enough to detect
(and crash) on _almost_ all cases where that bites us (the other cases
will crash less politely from their stacks being smashed).

Even if we go on to build some more bulletproof solution, I think we
should consider implementing that now, as the baseline.

Tim.

> As a result, I thought instead that I would outline all the issues we
> currently face.  We can then:
>  * Decide which issues need fixing
>  * Decide which issues need to at least be detected and crash gracefully
>  * Decide which issues we are happy (or perhaps at least willing, if not
> happy) to ignore
> 
> So, the issues are as follows.  (I have tried to list them in a logical
> order, with 1 individual problem per number, but please do point out if
> I have missed/miss-attributed entries)
> 
> 1) Faults on the NMI path will re-enable NMIs before the handler
> returns, leading to reentrant behaviour.  We should audit the NMI path
> to try and remove any needless cases which might fault, but getting a
> fault-free path will be hard (and is not going so solve the reentrant
> behaviour itself).
> 
> 2) Faults on the MCE path will re-enable NMIs, as will the iret of the
> MCE itself if an MCE interrupts an NMI.
> 
> 3) SMM mode executing an iret will re-enable NMIs.  There is nothing we
> can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no
> way to predict if/when it may happen.  The best we can do is accept that
> it might happen, and try to deal with the after effects.
> 
> 4) "Fake NMIs" can be caused by hardware with access to the INTR pin
> (very unlikely in modern systems with the LAPIC supporting virtual wire
> mode), or by software executing an `int $0x2`.  This can cause the NMI
> handler to run on the NMI stack, but without the normal hardware NMI
> cessation logic being triggered.
> 
> 5) "Fake MCEs" can be caused by software executing `int $0x18`, and by
> any MSI/IOMMU/IOAPIC programmed to deliver vector 0x18.  Normally, this
> could only be caused by a bug in Xen, although it is also possible on a
> system with out interrupt remapping. (Where the host administrator has
> accepted the documented security issue, and decided still to pass-though
> a device to a trusted VM, and the VM in question has a buggy driver for
> the passed-through hardware)
> 
> 6) Because of interrupt stack tables, real NMIs/MCEs can race with their
> fake alternatives, where the real interrupt interrupts the fake one and
> corrupts the exception frame of the fake one, loosing the original
> context to return to.  (This is one of the two core problem of
> reentrancy with NMIs and MCEs)
> 
> 7) Real MCEs can race with each other.  If two real MCEs occur too close
> together, the processor shuts down (We cant avoid this).  However, there
> is large race condition between the MCE handler clearing the MCIP bit of
> IA32_MCG_STATUS and the handler returning during which a new MCE can
> occur and the exception frame will be corrupted.
> 
> 
> In addition to the above issues, we have two NMI related bugs in Xen
> which need fixing (which shall be part of the series which fixes the above)
> 
> 8) VMEXIT reason NMI on Intel calls self_nmi() while NMIs are latched,
> causing the PCPU to fall into loop of VMEXITs until the VCPU timeslice
> has expired, at which point the return-to-guest path decides to schedule
> instead of resuming the guest.
> 
> 9) The NMI handler when returning to ring3 will leave NMIs latched, as
> it uses the sysret path.
> 
> 
> As for 1 possible solution which we cant use:
> 
> If it were not for the sysret stupidness[1] of requiring the hypervisor
> to move to the guest stack before executing the `sysret` instruction, we
> could do away with the stack tables for NMIs and MCEs alltogether, and
> the above crazyness would be easy to fix.  However, the overhead of
> always using iret to return to ring3 is not likely to be acceptable,
> meaning that we cannot "fix" the problem by discarding interrupt stacks
> and doing everything properly on the main hypervisor stack.
> 
> 
> Looking at the above problems, I believe there is a solution if we are
> willing to ignore the problem to do with SMM re-enabling NMIs, and if we
> are happy to crash gracefully when mixes of NMIs and MCEs interrupt each
> other and trash their exception frames (in situations were we could
> technically fix up correctly), which is based on the Linux NMI solution.
> 
> As questions to the community - have I missed, or misrepresented any
> points above which might perhaps influence the design of the solution? 
> I think the list is complete, but would not be supprised if there is a
> case still not considered yet.
> 
> ~Andrew
> 
> 
> [1] In an effort to prevent a flamewar with my comment, the situation we
> find outself in now is almost certainly the result of unforseen
> interactions of individual features, but we are left to pick up the many
> pieces in way which cant completely be solved.
> 
> -- 
> Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
> T: +44 (0)1223 225 900, http://www.citrix.com
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Woes of NMIs and MCEs, and possibly how to fix
  2012-11-30 17:56 ` Tim Deegan
@ 2012-11-30 21:59   ` Andrew Cooper
  2012-12-03  8:28   ` Jan Beulich
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew Cooper @ 2012-11-30 21:59 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Keir (Xen.org), Jan Beulich, xen-devel@lists.xen.org

On 30/11/2012 17:56, Tim Deegan wrote:
> At 17:34 +0000 on 30 Nov (1354296851), Andrew Cooper wrote:
>> Hello,
>>
>> Yesterday, Tim and myself spent a very long time in front of a
>> whiteboard trying to develop a fix which covered all the problems, and
>> sadly it is very hard.  We managed to possibly come up with a long
>> solution which we think has no race conditions, but relies on very large
>> sections of reentrant code which cant use the stack or trash registers. 
>> As such, is it is not practical at all (assuming that any of us could
>> actually code it)
> For the record, we also came up with a much simpler solution, which I
> prefer:
>  - The MCE handler should never return to Xen with IRET.
>  - The NMI handler should always return with IRET.
>  - There should be no faulting code in the NMI or MCE handlers.
>
> That covers all the interesting cases except (3), (4) and (7) below, and
> a simple per-cpu {nmi,mce}-in-progress flag will be enough to detect
> (and crash) on _almost_ all cases where that bites us (the other cases
> will crash less politely from their stacks being smashed).
>
> Even if we go on to build some more bulletproof solution, I think we
> should consider implementing that now, as the baseline.
>
> Tim.

D'oh - I knew I forgot something.  Yes - the above solution does fix a
large number of the issues.

Having no faults on the NMI/MCE paths is a good thing all round, and we
should strive for it.  But that does not mean that they wont creep back
in in the future.

What I shall do (assuming no holes are shot in that idea by this thread)
is develop this solution first, along with the VMX NMI issue fix.  It is
substantially easier than the alternative.

I will then work on developing the further solution (assuming no holes
are shot in that idea).  As I alluded to in the parent post, I believe a
modification to the Linux solution should allow us to detect most of the
reentrant cases and deal with them correctly, and should allow us to
detect any smashed stacks and deal with them more politely than we
otherwise would.

~Andrew

>
>> As a result, I thought instead that I would outline all the issues we
>> currently face.  We can then:
>>  * Decide which issues need fixing
>>  * Decide which issues need to at least be detected and crash gracefully
>>  * Decide which issues we are happy (or perhaps at least willing, if not
>> happy) to ignore
>>
>> So, the issues are as follows.  (I have tried to list them in a logical
>> order, with 1 individual problem per number, but please do point out if
>> I have missed/miss-attributed entries)
>>
>> 1) Faults on the NMI path will re-enable NMIs before the handler
>> returns, leading to reentrant behaviour.  We should audit the NMI path
>> to try and remove any needless cases which might fault, but getting a
>> fault-free path will be hard (and is not going so solve the reentrant
>> behaviour itself).
>>
>> 2) Faults on the MCE path will re-enable NMIs, as will the iret of the
>> MCE itself if an MCE interrupts an NMI.
>>
>> 3) SMM mode executing an iret will re-enable NMIs.  There is nothing we
>> can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no
>> way to predict if/when it may happen.  The best we can do is accept that
>> it might happen, and try to deal with the after effects.
>>
>> 4) "Fake NMIs" can be caused by hardware with access to the INTR pin
>> (very unlikely in modern systems with the LAPIC supporting virtual wire
>> mode), or by software executing an `int $0x2`.  This can cause the NMI
>> handler to run on the NMI stack, but without the normal hardware NMI
>> cessation logic being triggered.
>>
>> 5) "Fake MCEs" can be caused by software executing `int $0x18`, and by
>> any MSI/IOMMU/IOAPIC programmed to deliver vector 0x18.  Normally, this
>> could only be caused by a bug in Xen, although it is also possible on a
>> system with out interrupt remapping. (Where the host administrator has
>> accepted the documented security issue, and decided still to pass-though
>> a device to a trusted VM, and the VM in question has a buggy driver for
>> the passed-through hardware)
>>
>> 6) Because of interrupt stack tables, real NMIs/MCEs can race with their
>> fake alternatives, where the real interrupt interrupts the fake one and
>> corrupts the exception frame of the fake one, loosing the original
>> context to return to.  (This is one of the two core problem of
>> reentrancy with NMIs and MCEs)
>>
>> 7) Real MCEs can race with each other.  If two real MCEs occur too close
>> together, the processor shuts down (We cant avoid this).  However, there
>> is large race condition between the MCE handler clearing the MCIP bit of
>> IA32_MCG_STATUS and the handler returning during which a new MCE can
>> occur and the exception frame will be corrupted.
>>
>>
>> In addition to the above issues, we have two NMI related bugs in Xen
>> which need fixing (which shall be part of the series which fixes the above)
>>
>> 8) VMEXIT reason NMI on Intel calls self_nmi() while NMIs are latched,
>> causing the PCPU to fall into loop of VMEXITs until the VCPU timeslice
>> has expired, at which point the return-to-guest path decides to schedule
>> instead of resuming the guest.
>>
>> 9) The NMI handler when returning to ring3 will leave NMIs latched, as
>> it uses the sysret path.
>>
>>
>> As for 1 possible solution which we cant use:
>>
>> If it were not for the sysret stupidness[1] of requiring the hypervisor
>> to move to the guest stack before executing the `sysret` instruction, we
>> could do away with the stack tables for NMIs and MCEs alltogether, and
>> the above crazyness would be easy to fix.  However, the overhead of
>> always using iret to return to ring3 is not likely to be acceptable,
>> meaning that we cannot "fix" the problem by discarding interrupt stacks
>> and doing everything properly on the main hypervisor stack.
>>
>>
>> Looking at the above problems, I believe there is a solution if we are
>> willing to ignore the problem to do with SMM re-enabling NMIs, and if we
>> are happy to crash gracefully when mixes of NMIs and MCEs interrupt each
>> other and trash their exception frames (in situations were we could
>> technically fix up correctly), which is based on the Linux NMI solution.
>>
>> As questions to the community - have I missed, or misrepresented any
>> points above which might perhaps influence the design of the solution? 
>> I think the list is complete, but would not be supprised if there is a
>> case still not considered yet.
>>
>> ~Andrew
>>
>>
>> [1] In an effort to prevent a flamewar with my comment, the situation we
>> find outself in now is almost certainly the result of unforseen
>> interactions of individual features, but we are left to pick up the many
>> pieces in way which cant completely be solved.
>>
>> -- 
>> Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
>> T: +44 (0)1223 225 900, http://www.citrix.com
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Woes of NMIs and MCEs, and possibly how to fix
  2012-11-30 17:56 ` Tim Deegan
  2012-11-30 21:59   ` Andrew Cooper
@ 2012-12-03  8:28   ` Jan Beulich
  1 sibling, 0 replies; 14+ messages in thread
From: Jan Beulich @ 2012-12-03  8:28 UTC (permalink / raw)
  To: Andrew Cooper, Tim Deegan; +Cc: Keir Fraser, xen-devel@lists.xen.org

>>> On 30.11.12 at 18:56, Tim Deegan <tim@xen.org> wrote:
> At 17:34 +0000 on 30 Nov (1354296851), Andrew Cooper wrote:
> For the record, we also came up with a much simpler solution, which I
> prefer:
>  - The MCE handler should never return to Xen with IRET.
>  - The NMI handler should always return with IRET.
>  - There should be no faulting code in the NMI or MCE handlers.
> 
> That covers all the interesting cases except (3), (4) and (7) below, and
> a simple per-cpu {nmi,mce}-in-progress flag will be enough to detect
> (and crash) on _almost_ all cases where that bites us (the other cases
> will crash less politely from their stacks being smashed).
> 
> Even if we go on to build some more bulletproof solution, I think we
> should consider implementing that now, as the baseline.

Fully agree. As said in an earlier reply to Andrew's original mail,
dealing with (3) and (4) doesn't seem necessary to me. 

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Woes of NMIs and MCEs, and possibly how to fix
  2012-11-30 17:34 Woes of NMIs and MCEs, and possibly how to fix Andrew Cooper
  2012-11-30 17:56 ` Tim Deegan
@ 2012-11-30 19:12 ` Mats Petersson
  2012-11-30 20:24   ` Tim Deegan
  2012-12-03  8:09   ` Jan Beulich
  2012-12-03  8:26 ` Jan Beulich
  2012-12-03 11:24 ` George Dunlap
  3 siblings, 2 replies; 14+ messages in thread
From: Mats Petersson @ 2012-11-30 19:12 UTC (permalink / raw)
  To: xen-devel

On 30/11/12 17:34, Andrew Cooper wrote:
> Hello,
>
> Yesterday, Tim and myself spent a very long time in front of a
> whiteboard trying to develop a fix which covered all the problems, and
> sadly it is very hard.  We managed to possibly come up with a long
> solution which we think has no race conditions, but relies on very large
> sections of reentrant code which cant use the stack or trash registers.
> As such, is it is not practical at all (assuming that any of us could
> actually code it)
>
>
> As a result, I thought instead that I would outline all the issues we
> currently face.  We can then:
>   * Decide which issues need fixing
>   * Decide which issues need to at least be detected and crash gracefully
>   * Decide which issues we are happy (or perhaps at least willing, if not
> happy) to ignore
>
> So, the issues are as follows.  (I have tried to list them in a logical
> order, with 1 individual problem per number, but please do point out if
> I have missed/miss-attributed entries)
>
> 1) Faults on the NMI path will re-enable NMIs before the handler
> returns, leading to reentrant behaviour.  We should audit the NMI path
> to try and remove any needless cases which might fault, but getting a
> fault-free path will be hard (and is not going so solve the reentrant
> behaviour itself).

What sort of faults are we expecting on the NMI path? Surely the trap 
handler isn't paged out? Other faults would be that the code is cause GP 
fault or illegal instructions, divide by zero or similar - these should 
all cause hypervisor panic anyways, surely? I'm sure I've missed 
something really important here, but I don't really see what faults we 
can expect to see within the NMI handler, that are "recoverable".
>
> 2) Faults on the MCE path will re-enable NMIs, as will the iret of the
> MCE itself if an MCE interrupts an NMI.

The same questions apply as to #1 (just replace NMI with MCE)

>
> 3) SMM mode executing an iret will re-enable NMIs.  There is nothing we
> can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no
> way to predict if/when it may happen.  The best we can do is accept that
> it might happen, and try to deal with the after effects.
SMM is a messy thing that can interfere with most things in a system. We 
will have to rely on the BIOS developers to not mess up here. We can't 
do anything else in our code (on AMD hardware, in a HVM guest you could 
trap SMI as a VMEXIT, and then "deal with it in a container", but that 
doesn't fix SMI that happen whilst in the hypervisor, or in a PV kernel, 
so doesn't really help much).
>
> 4) "Fake NMIs" can be caused by hardware with access to the INTR pin
> (very unlikely in modern systems with the LAPIC supporting virtual wire
> mode), or by software executing an `int $0x2`.  This can cause the NMI
> handler to run on the NMI stack, but without the normal hardware NMI
> cessation logic being triggered.
>
> 5) "Fake MCEs" can be caused by software executing `int $0x18`, and by
> any MSI/IOMMU/IOAPIC programmed to deliver vector 0x18.  Normally, this
> could only be caused by a bug in Xen, although it is also possible on a
> system with out interrupt remapping. (Where the host administrator has
> accepted the documented security issue, and decided still to pass-though
> a device to a trusted VM, and the VM in question has a buggy driver for
> the passed-through hardware)
Surely both 4 & 5 are "bad guest behaviour", and whilst it's a "nice to 
have" to catch that, it's no different from running on bare metal doing 
daft things with vectors or writing code that doesn't behave at all 
"friendly". (4 is only available to Xen developers, which we hope are 
most of the time sane enough not to try these crazy things in a "live" 
system that matters). 5 is only available if you have pass through 
enabled. I don't think either is a particularly likely cause of real, in 
the field, problems.

That said, if it's a trivial fix on top of something that fixes the 
other problems mentioned, I'm OK with that being added.

>
> 6) Because of interrupt stack tables, real NMIs/MCEs can race with their
> fake alternatives, where the real interrupt interrupts the fake one and
> corrupts the exception frame of the fake one, loosing the original
> context to return to.  (This is one of the two core problem of
> reentrancy with NMIs and MCEs)
>
> 7) Real MCEs can race with each other.  If two real MCEs occur too close
> together, the processor shuts down (We cant avoid this).  However, there
> is large race condition between the MCE handler clearing the MCIP bit of
> IA32_MCG_STATUS and the handler returning during which a new MCE can
> occur and the exception frame will be corrupted.
 From what I understand, the purpose of this bit is really to ensure 
that any data needed from the MCE status registers has been fetched 
before the processor issues another MCE - otherwise you have a big race 
of "what data are we reading, and which of the multiple, in short 
succession, MCEs does this belong to. If you get two MCEs in such a 
short time that the MCE handler doesn't have time to gather the data 
from the status registers, it's likely that the machine isn't going to 
do very well for much longer anyways. Now, if we have a common stack, we 
should not reset the MCIP bit until it is time to return from the MCE 
handler - ideally on the last instruction before that, but that may be a 
little difficult to achieve, seeing as at that point, no registers will 
be available [as we're restoring those to return back to previous 
context], but something close to that should make for a very minimal 
(but admittedly still existing) window for a race. It is questionable if 
the MCE logic and processor trapping mechanism will react quickly enough 
to the MCIP bit being set, without getting to the iret [or whatever 
instruction is ending the handler]. If it does, then we die. It is not 
much different from the case where a MCE happens while the MCIP bit is 
set, which will cause a processor shutdown - that's a reboot for 
anything with a "PC compatible chipset", as CPU shutdown is pretty much 
a useless dead state for the processor, and the chipset therefore pulls 
the reset pin as soon as this state is detected.

>
>
> In addition to the above issues, we have two NMI related bugs in Xen
> which need fixing (which shall be part of the series which fixes the above)
>
> 8) VMEXIT reason NMI on Intel calls self_nmi() while NMIs are latched,
> causing the PCPU to fall into loop of VMEXITs until the VCPU timeslice
> has expired, at which point the return-to-guest path decides to schedule
> instead of resuming the guest.
The solution to this bug is to call the nmi handler either via the INT 2 
instruction or via a call to do_nmi() or something similar. There are 
not many other options, and code to fix this has been posted a couple of 
weeks ago. No, it's not completely "safe", but it's a whole lot better 
than the current non-working code. And that applies regardless of other 
issues with MCE and NMI handling.
>
> 9) The NMI handler when returning to ring3 will leave NMIs latched, as
> it uses the sysret path.
That should also be relatively easy to fix, either by actually using an 
IRET at the end of NMI handler (and using the "INT 2" solution above), 
or by making a fake stackframe for the "next instruction after IRET" on 
the stack, and then performing an IRET.

>
>
> As for 1 possible solution which we cant use:
>
> If it were not for the sysret stupidness[1] of requiring the hypervisor
> to move to the guest stack before executing the `sysret` instruction, we
> could do away with the stack tables for NMIs and MCEs alltogether, and
> the above crazyness would be easy to fix.  However, the overhead of
> always using iret to return to ring3 is not likely to be acceptable,
> meaning that we cannot "fix" the problem by discarding interrupt stacks
> and doing everything properly on the main hypervisor stack.
>
>
> Looking at the above problems, I believe there is a solution if we are
> willing to ignore the problem to do with SMM re-enabling NMIs, and if we
> are happy to crash gracefully when mixes of NMIs and MCEs interrupt each
> other and trash their exception frames (in situations were we could
> technically fix up correctly), which is based on the Linux NMI solution.
>
> As questions to the community - have I missed, or misrepresented any
> points above which might perhaps influence the design of the solution?
> I think the list is complete, but would not be supprised if there is a
> case still not considered yet.
>
> ~Andrew
>
>
> [1] In an effort to prevent a flamewar with my comment, the situation we
> find outself in now is almost certainly the result of unforseen
> interactions of individual features, but we are left to pick up the many
> pieces in way which cant completely be solved.
>

Happy to have my comments completely shot down into little bits, but I'm 
worrying that we're looking to solve a problem that doesn't actually 
need solving - at least as long as the code in the respective handlers 
are "doing the right thing", and if that happens to be broken, then we 
should fix THAT, not build lots of extra code to recover from such a thing.

--
Mats

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Woes of NMIs and MCEs, and possibly how to fix
  2012-11-30 19:12 ` Mats Petersson
@ 2012-11-30 20:24   ` Tim Deegan
  2012-11-30 22:55     ` Andrew Cooper
  2012-12-03  8:09   ` Jan Beulich
  1 sibling, 1 reply; 14+ messages in thread
From: Tim Deegan @ 2012-11-30 20:24 UTC (permalink / raw)
  To: Mats Petersson; +Cc: xen-devel

At 19:12 +0000 on 30 Nov (1354302731), Mats Petersson wrote:
> >1) Faults on the NMI path will re-enable NMIs before the handler
> >returns, leading to reentrant behaviour.  We should audit the NMI path
> >to try and remove any needless cases which might fault, but getting a
> >fault-free path will be hard (and is not going so solve the reentrant
> >behaviour itself).
> 
> What sort of faults are we expecting on the NMI path?

None, but we have to keep it that way, and if we accidentlly introduce
one there's no immediate indication that we've messed up.  Unless we add
'am I in the NMI handler?' to all the fixup code.

> >2) Faults on the MCE path will re-enable NMIs, as will the iret of the
> >MCE itself if an MCE interrupts an NMI.
> 
> The same questions apply as to #1 (just replace NMI with MCE)

Andrew pointed out that some MCE code uses rdmsr_safe().

FWIW, I think that constraining MCE and NMI code not to do anything that
can fault is perfectly reasonable.  The MCE code has grown a lot
recently and probably needs an audit to check for spinlocks, faults &c.

> >3) SMM mode executing an iret will re-enable NMIs.  There is nothing we
> >can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no
> >way to predict if/when it may happen.  The best we can do is accept that
> >it might happen, and try to deal with the after effects.
> 
> SMM is a messy thing that can interfere with most things in a system. We 
> will have to rely on the BIOS developers to not mess up here. We can't 
> do anything else in our code (on AMD hardware, in a HVM guest you could 
> trap SMI as a VMEXIT, and then "deal with it in a container", but that 
> doesn't fix SMI that happen whilst in the hypervisor, or in a PV kernel, 
> so doesn't really help much).

You can't even do that any more -- some BIOSes turn on the SMM lock,
which disables the VMEXIT hook.

But yes, if BIOS vendors are executing IRET in SMM mode (and if 'fix
your BIOS' isn't an appropriate response), then the simple 'just don't
IRET anywhere that's rechable fom #NMI' solution can't prevent that, and
we need to handle nested NMIs.

But I'd be happy to cross that bridge when we come to it.

> Surely both 4 & 5 are "bad guest behaviour", and whilst it's a "nice to 
> have" to catch that, it's no different from running on bare metal doing 
> daft things with vectors or writing code that doesn't behave at all 
> "friendly". (4 is only available to Xen developers, which we hope are 
> most of the time sane enough not to try these crazy things in a "live" 
> system that matters). 5 is only available if you have pass through 
> enabled. I don't think either is a particularly likely cause of real, in 
> the field, problems.

AFAICS both are only available to Xen, unless there's a bug in Xen's MSI
handling, so they'd only be caused by Xen bugs or bad hardware.  And
both can be detected (in most cases) with the same simple flag that
would detect real nested NMI/MCEs.  Or more completely by a linux-style
solution.

> >7) Real MCEs can race with each other.  If two real MCEs occur too close
> >together, the processor shuts down (We cant avoid this).  However, there
> >is large race condition between the MCE handler clearing the MCIP bit of
> >IA32_MCG_STATUS and the handler returning during which a new MCE can
> >occur and the exception frame will be corrupted.
> 
> From what I understand, the purpose of this bit is really to ensure 
> that any data needed from the MCE status registers has been fetched 
> before the processor issues another MCE - otherwise you have a big race 
> of "what data are we reading, and which of the multiple, in short 
> succession, MCEs does this belong to. If you get two MCEs in such a 
> short time that the MCE handler doesn't have time to gather the data 
> from the status registers, it's likely that the machine isn't going to 
> do very well for much longer anyways. Now, if we have a common stack, we 
> should not reset the MCIP bit until it is time to return from the MCE 
> handler - ideally on the last instruction before that

We can do the equivalent with a 'mce-being-handled' flag, which allows
us to (try to) print a useful message intead of just losing a CPU. 

> >[1] In an effort to prevent a flamewar with my comment, the situation we
> >find outself in now is almost certainly the result of unforseen
> >interactions of individual features, but we are left to pick up the many
> >pieces in way which cant completely be solved.
> >
> 
> Happy to have my comments completely shot down into little bits, but I'm 
> worrying that we're looking to solve a problem that doesn't actually 
> need solving - at least as long as the code in the respective handlers 
> are "doing the right thing", and if that happens to be broken, then we 
> should fix THAT, not build lots of extra code to recover from such a thing.

I agree.  The only things we can't fix by DTRT in do_nmi() and do_mce()
are:
 - IRET in SMM mode re-enabling NMIs; and
 - detecting _every_ case where we get a nested NMI/MCE (all we can 
   do is detect _most_ cases, but the detection is just so we can print
   a message before the crash).

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Woes of NMIs and MCEs, and possibly how to fix
  2012-11-30 20:24   ` Tim Deegan
@ 2012-11-30 22:55     ` Andrew Cooper
  2012-12-01  9:09       ` Tim Deegan
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Cooper @ 2012-11-30 22:55 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Mats Petersson, xen-devel@lists.xen.org

On 30/11/2012 20:24, Tim Deegan wrote:
> At 19:12 +0000 on 30 Nov (1354302731), Mats Petersson wrote:
>>> 1) Faults on the NMI path will re-enable NMIs before the handler
>>> returns, leading to reentrant behaviour.  We should audit the NMI path
>>> to try and remove any needless cases which might fault, but getting a
>>> fault-free path will be hard (and is not going so solve the reentrant
>>> behaviour itself).
>> What sort of faults are we expecting on the NMI path?
> None, but we have to keep it that way, and if we accidentlly introduce
> one there's no immediate indication that we've messed up.  Unless we add
> 'am I in the NMI handler?' to all the fixup code.

Any spinlocking whatsoever is out, until we completely fix the
re-entrant entry to do_{nmi,mce}(), at which point spinlocks which are
ok so long as they are exclusively used inside their respective
handlers.  This includes printk (unless preceded by a
console_force_unlock(), which is only safe to use on paths where we have
decided to crash).  There was an audit a little while ago to move
printk()s into soft tasks.

WARN()s, (and with the same logic BUG()s and ASSERT()s, although these
are fatal) are out, both because they directly printk, and make use of
ud2 to force a trap.

Use of {rd,wr}msr() are problematic, as they might fault if we have
buggy code referring to MSRs which are not implemented, and the _safe()
variants will still fault.

>
>>> 2) Faults on the MCE path will re-enable NMIs, as will the iret of the
>>> MCE itself if an MCE interrupts an NMI.
>> The same questions apply as to #1 (just replace NMI with MCE)
> Andrew pointed out that some MCE code uses rdmsr_safe().
>
> FWIW, I think that constraining MCE and NMI code not to do anything that
> can fault is perfectly reasonable.  The MCE code has grown a lot
> recently and probably needs an audit to check for spinlocks, faults &c.

Yes.  However, being able to deal gracefully with the case were we miss
things on code review which touches the NMI/MCE paths is certainly
better than crashing in subtle ways.

>
>>> 3) SMM mode executing an iret will re-enable NMIs.  There is nothing we
>>> can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no
>>> way to predict if/when it may happen.  The best we can do is accept that
>>> it might happen, and try to deal with the after effects.
>> SMM is a messy thing that can interfere with most things in a system. We 
>> will have to rely on the BIOS developers to not mess up here. We can't 
>> do anything else in our code (on AMD hardware, in a HVM guest you could 
>> trap SMI as a VMEXIT, and then "deal with it in a container", but that 
>> doesn't fix SMI that happen whilst in the hypervisor, or in a PV kernel, 
>> so doesn't really help much).
> You can't even do that any more -- some BIOSes turn on the SMM lock,
> which disables the VMEXIT hook.
>
> But yes, if BIOS vendors are executing IRET in SMM mode (and if 'fix
> your BIOS' isn't an appropriate response), then the simple 'just don't
> IRET anywhere that's rechable fom #NMI' solution can't prevent that, and
> we need to handle nested NMIs.
>
> But I'd be happy to cross that bridge when we come to it.

I am fairly sure we the ability to discover this and crash gracefully
when we properly fix for the reentrant case.

>
>> Surely both 4 & 5 are "bad guest behaviour", and whilst it's a "nice to 
>> have" to catch that, it's no different from running on bare metal doing 
>> daft things with vectors or writing code that doesn't behave at all 
>> "friendly". (4 is only available to Xen developers, which we hope are 
>> most of the time sane enough not to try these crazy things in a "live" 
>> system that matters). 5 is only available if you have pass through 
>> enabled. I don't think either is a particularly likely cause of real, in 
>> the field, problems.
> AFAICS both are only available to Xen, unless there's a bug in Xen's MSI
> handling, so they'd only be caused by Xen bugs or bad hardware.  And
> both can be detected (in most cases) with the same simple flag that
> would detect real nested NMI/MCEs.  Or more completely by a linux-style
> solution.

Yes - the fake cases are exclusively buggy Xen or buggy hardware (with
the exception of the pci-passthough with no interrupt-remapping case, in
which one could argue that the host administrator gets to keep both
halves of his broken system after they decided to insecurely allow
passthrough in the first place)

>
>>> 7) Real MCEs can race with each other.  If two real MCEs occur too close
>>> together, the processor shuts down (We cant avoid this).  However, there
>>> is large race condition between the MCE handler clearing the MCIP bit of
>>> IA32_MCG_STATUS and the handler returning during which a new MCE can
>>> occur and the exception frame will be corrupted.
>> From what I understand, the purpose of this bit is really to ensure 
>> that any data needed from the MCE status registers has been fetched 
>> before the processor issues another MCE - otherwise you have a big race 
>> of "what data are we reading, and which of the multiple, in short 
>> succession, MCEs does this belong to. If you get two MCEs in such a 
>> short time that the MCE handler doesn't have time to gather the data 
>> from the status registers, it's likely that the machine isn't going to 
>> do very well for much longer anyways. Now, if we have a common stack, we 
>> should not reset the MCIP bit until it is time to return from the MCE 
>> handler - ideally on the last instruction before that
> We can do the equivalent with a 'mce-being-handled' flag, which allows
> us to (try to) print a useful message intead of just losing a CPU. 

At the moment, the clearing of the MCIP bit is quite early in a few of
the cpu family specific MCE handlers.  As it is an architectural MSR, I
was considering moving it outside the family specific handlers, and make
one of the last things on the MCE path, to help reduce the race
condition window until we properly fix reentrant MCEs.

>
>>> [1] In an effort to prevent a flamewar with my comment, the situation we
>>> find outself in now is almost certainly the result of unforseen
>>> interactions of individual features, but we are left to pick up the many
>>> pieces in way which cant completely be solved.
>>>
>> Happy to have my comments completely shot down into little bits, but I'm 
>> worrying that we're looking to solve a problem that doesn't actually 
>> need solving - at least as long as the code in the respective handlers 
>> are "doing the right thing", and if that happens to be broken, then we 
>> should fix THAT, not build lots of extra code to recover from such a thing.
> I agree.  The only things we can't fix by DTRT in do_nmi() and do_mce()
> are:
>  - IRET in SMM mode re-enabling NMIs; and
>  - detecting _every_ case where we get a nested NMI/MCE (all we can 
>    do is detect _most_ cases, but the detection is just so we can print
>    a message before the crash).
>
> Cheers,
>
> Tim.

We would need to modify the asm stubs to detect nesting.  I think it is
unreasonable to expect the C functions do_{nmi,mce}() to be reentrantly
safe.  With the proper solution, we can get the asm stubs to fixup
nested NMIs/MCEs without reentrantly calling into C (or crashing if we
detect the stack has been smashed)

~Andrew

>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Woes of NMIs and MCEs, and possibly how to fix
  2012-11-30 22:55     ` Andrew Cooper
@ 2012-12-01  9:09       ` Tim Deegan
  0 siblings, 0 replies; 14+ messages in thread
From: Tim Deegan @ 2012-12-01  9:09 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Mats Petersson, xen-devel@lists.xen.org

At 22:55 +0000 on 30 Nov (1354316132), Andrew Cooper wrote:
> Any spinlocking whatsoever is out, until we completely fix the
> re-entrant entry to do_{nmi,mce}(), at which point spinlocks which are
> ok so long as they are exclusively used inside their respective
> handlers.

AFAICT spinlocks are bad news even if they're only in their respective
handlers, as one CPU can get (NMI (MCE)) while the other gets (MCE (NMI)).

> >>> 2) Faults on the MCE path will re-enable NMIs, as will the iret of the
> >>> MCE itself if an MCE interrupts an NMI.
> >> The same questions apply as to #1 (just replace NMI with MCE)
> > Andrew pointed out that some MCE code uses rdmsr_safe().
> >
> > FWIW, I think that constraining MCE and NMI code not to do anything that
> > can fault is perfectly reasonable.  The MCE code has grown a lot
> > recently and probably needs an audit to check for spinlocks, faults &c.
> 
> Yes.  However, being able to deal gracefully with the case were we miss
> things on code review which touches the NMI/MCE paths is certainly
> better than crashing in subtle ways.

Agreed.

> At the moment, the clearing of the MCIP bit is quite early in a few of
> the cpu family specific MCE handlers.  As it is an architectural MSR, I
> was considering moving it outside the family specific handlers, and make
> one of the last things on the MCE path, to help reduce the race
> condition window until we properly fix reentrant MCEs.

Why not have a per-cpu mce-in-progress flag, and clear MCIP early?  That
way you get a panic instead of silently losing a CPU.

> >>> [1] In an effort to prevent a flamewar with my comment, the situation we
> >>> find outself in now is almost certainly the result of unforseen
> >>> interactions of individual features, but we are left to pick up the many
> >>> pieces in way which cant completely be solved.
> >>>
> >> Happy to have my comments completely shot down into little bits, but I'm 
> >> worrying that we're looking to solve a problem that doesn't actually 
> >> need solving - at least as long as the code in the respective handlers 
> >> are "doing the right thing", and if that happens to be broken, then we 
> >> should fix THAT, not build lots of extra code to recover from such a thing.
> > I agree.  The only things we can't fix by DTRT in do_nmi() and do_mce()
> > are:
> >  - IRET in SMM mode re-enabling NMIs; and
> >  - detecting _every_ case where we get a nested NMI/MCE (all we can 
> >    do is detect _most_ cases, but the detection is just so we can print
> >    a message before the crash).
> 
> We would need to modify the asm stubs to detect nesting.

We can detect _almost all_ nesting from C with an in-progress flag.  We
can probably expand that to cover all nesting by pushing the
flag-setting/flag-clearing out to the asm but that'd still be only a
couple of lines of change - a lot simpler than what we'd need to allow
nested MCE/NMIs to continue without crashing.

>  I think it is
> unreasonable to expect the C functions do_{nmi,mce}() to be reentrantly
> safe.

Good. :)  I'm not suggesting that.

Tim.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Woes of NMIs and MCEs, and possibly how to fix
  2012-11-30 19:12 ` Mats Petersson
  2012-11-30 20:24   ` Tim Deegan
@ 2012-12-03  8:09   ` Jan Beulich
  1 sibling, 0 replies; 14+ messages in thread
From: Jan Beulich @ 2012-12-03  8:09 UTC (permalink / raw)
  To: Mats Petersson; +Cc: xen-devel

>>> On 30.11.12 at 20:12, Mats Petersson <mats.petersson@citrix.com> wrote:
> On 30/11/12 17:34, Andrew Cooper wrote:
>> 4) "Fake NMIs" can be caused by hardware with access to the INTR pin
>> (very unlikely in modern systems with the LAPIC supporting virtual wire
>> mode), or by software executing an `int $0x2`.  This can cause the NMI
>> handler to run on the NMI stack, but without the normal hardware NMI
>> cessation logic being triggered.
>>
>> 5) "Fake MCEs" can be caused by software executing `int $0x18`, and by
>> any MSI/IOMMU/IOAPIC programmed to deliver vector 0x18.  Normally, this
>> could only be caused by a bug in Xen, although it is also possible on a
>> system with out interrupt remapping. (Where the host administrator has
>> accepted the documented security issue, and decided still to pass-though
>> a device to a trusted VM, and the VM in question has a buggy driver for
>> the passed-through hardware)
> Surely both 4 & 5 are "bad guest behaviour", and whilst it's a "nice to 
> have" to catch that, it's no different from running on bare metal doing 
> daft things with vectors or writing code that doesn't behave at all 
> "friendly". (4 is only available to Xen developers, which we hope are 
> most of the time sane enough not to try these crazy things in a "live" 
> system that matters). 5 is only available if you have pass through 
> enabled. I don't think either is a particularly likely cause of real, in 
> the field, problems.

If these were guest exposed, we'd have a much more severe
problem. But as Tim said, they aren't, hence we "only" need to
exclude Xen bugs here (and decide on how far we may want to go
with working around eventual hardware bugs).

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Woes of NMIs and MCEs, and possibly how to fix
  2012-11-30 17:34 Woes of NMIs and MCEs, and possibly how to fix Andrew Cooper
  2012-11-30 17:56 ` Tim Deegan
  2012-11-30 19:12 ` Mats Petersson
@ 2012-12-03  8:26 ` Jan Beulich
  2012-12-03 11:24 ` George Dunlap
  3 siblings, 0 replies; 14+ messages in thread
From: Jan Beulich @ 2012-12-03  8:26 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Tim Deegan, Keir Fraser, xen-devel@lists.xen.org

>>> On 30.11.12 at 18:34, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> 1) Faults on the NMI path will re-enable NMIs before the handler
> returns, leading to reentrant behaviour.  We should audit the NMI path
> to try and remove any needless cases which might fault, but getting a
> fault-free path will be hard (and is not going so solve the reentrant
> behaviour itself).
> 
> 2) Faults on the MCE path will re-enable NMIs, as will the iret of the
> MCE itself if an MCE interrupts an NMI.

As apparently converged to later in the thread - we just need to
exclude the potential for faults inside the NMI and MCE paths. The
only reason I could this needing to change would be if we intended
to add an extensive NMI producer like native Linux'es perf
subsystem.

> 3) SMM mode executing an iret will re-enable NMIs.  There is nothing we
> can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no
> way to predict if/when it may happen.  The best we can do is accept that
> it might happen, and try to deal with the after effects.

I don't see us needing to deal with that in any way. SMM using IRET
carelessly is just plain wrong. Iirc SMM (just like VMEXIT) has a save/
restore field for the NMI mask, so if they make proper use of this,
there should be no problem.

> 4) "Fake NMIs" can be caused by hardware with access to the INTR pin
> (very unlikely in modern systems with the LAPIC supporting virtual wire
> mode), or by software executing an `int $0x2`.  This can cause the NMI
> handler to run on the NMI stack, but without the normal hardware NMI
> cessation logic being triggered.
> 
> 5) "Fake MCEs" can be caused by software executing `int $0x18`, and by
> any MSI/IOMMU/IOAPIC programmed to deliver vector 0x18.  Normally, this
> could only be caused by a bug in Xen, although it is also possible on a
> system with out interrupt remapping. (Where the host administrator has
> accepted the documented security issue, and decided still to pass-though
> a device to a trusted VM, and the VM in question has a buggy driver for
> the passed-through hardware)

Fake exceptions, as was also already said by others, are a Xen or
hardware bug and hence shouldn't need extra precautions either.

> 9) The NMI handler when returning to ring3 will leave NMIs latched, as
> it uses the sysret path.

This is a little imprecise: The problem is only when entering the
scheduler on the way out of an NMI, and resuming an unaware
PV vCPU on the given pCPU. Apart from forcing an IRET in that
case early (we can't be on the special NMI stack in that case, as
the NMI entry path switches to the normal stack when entered
from PV guest context, entry from VMX context happens on the
normal stack anyway, and entry from hypervisor context [which
includes the SVM case] doesn't end up handling softirqs on the
exit path), another option would be to clear the TRAP_syscall
flag when resuming a PV vCPU in the scheduler.

But the early IRET solution has other benefits (keeping the NMI
disabled window short), so would be preferable imo.

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Woes of NMIs and MCEs, and possibly how to fix
  2012-11-30 17:34 Woes of NMIs and MCEs, and possibly how to fix Andrew Cooper
                   ` (2 preceding siblings ...)
  2012-12-03  8:26 ` Jan Beulich
@ 2012-12-03 11:24 ` George Dunlap
  2012-12-03 11:45   ` Mats Petersson
  2012-12-03 11:50   ` Jan Beulich
  3 siblings, 2 replies; 14+ messages in thread
From: George Dunlap @ 2012-12-03 11:24 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Tim Deegan, Jan Beulich, xen-devel@lists.xen.org

[-- Attachment #1.1: Type: text/plain, Size: 2078 bytes --]

On Fri, Nov 30, 2012 at 5:34 PM, Andrew Cooper <andrew.cooper3@citrix.com>wrote:

> 3) SMM mode executing an iret will re-enable NMIs.  There is nothing we
> can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no
> way to predict if/when it may happen.  The best we can do is accept that
> it might happen, and try to deal with the after effects.
>

Did you actually mean IRET, or did you mean RSM?  Does it make a difference?

> As for 1 possible solution which we cant use:
>
> If it were not for the sysret stupidness[1] of requiring the hypervisor
> to move to the guest stack before executing the `sysret` instruction, we
> could do away with the stack tables for NMIs and MCEs alltogether, and
> the above crazyness would be easy to fix.  However, the overhead of
> always using iret to return to ring3 is not likely to be acceptable,
> meaning that we cannot "fix" the problem by discarding interrupt stacks
> and doing everything properly on the main hypervisor stack.
>

64-bit Intel processors have SYSEXIT, right?  It's worth pointing out the
following alternatives, even if we never actually use them:

1. Use SYSEXIT on Intel processors and let the bugs (or some subset of
them) remain on AMD
2. Use SYSEXIT on Intel processors and IRET on AMD

Given that AMD has cut back their investment in OSS development, and is
talking about moving to ARM, it may only be a matter of time before Intel
is the only important player in the x86 world.

> [1] In an effort to prevent a flamewar with my comment, the situation we
> find outself in now is almost certainly the result of unforseen
> interactions of individual features, but we are left to pick up the many
> pieces in way which cant completely be solved.
>

The very first time I heard that SYSRET didn't restore the stack pointer, I
thought it was an obviously stupid idea that would cause all kinds of crazy
bugs. When you're designing operating systems, a little paranoia is a good
thing, and I can't help but think that the architects that let this go
through made a big mistake here.

 -George

[-- Attachment #1.2: Type: text/html, Size: 2736 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Woes of NMIs and MCEs, and possibly how to fix
  2012-12-03 11:24 ` George Dunlap
@ 2012-12-03 11:45   ` Mats Petersson
  2012-12-03 12:31     ` Jan Beulich
  2012-12-03 11:50   ` Jan Beulich
  1 sibling, 1 reply; 14+ messages in thread
From: Mats Petersson @ 2012-12-03 11:45 UTC (permalink / raw)
  To: xen-devel

On 03/12/12 11:24, George Dunlap wrote:
> On Fri, Nov 30, 2012 at 5:34 PM, Andrew Cooper 
> <andrew.cooper3@citrix.com <mailto:andrew.cooper3@citrix.com>> wrote:
>
>     3) SMM mode executing an iret will re-enable NMIs.  There is
>     nothing we
>     can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no
>     way to predict if/when it may happen.  The best we can do is
>     accept that
>     it might happen, and try to deal with the after effects.
>
>
> Did you actually mean IRET, or did you mean RSM?  Does it make a 
> difference?

If, for some obscure reason, the SMM code decides, for example, to run 
code like "int 0x21", where the int 0x21 handler ends with the rather 
predictable IRET to return to the caller, then you would indeed "unlock" 
the NMI blocking that happens from the NMI being taken by the processor. 
NMI will still not interrupt the SMM code, but it WILL interrupt the 
code that was running before SMI was taken - which could be an NMI 
handler, that doesn't expect another NMI.

RSM doesn't, in and of itself [unless "messing" with the saved state] 
alter the NMI state in other ways than "restore to previous value".
>
>     As for 1 possible solution which we cant use:
>
>     If it were not for the sysret stupidness[1] of requiring the
>     hypervisor
>     to move to the guest stack before executing the `sysret`
>     instruction, we
>     could do away with the stack tables for NMIs and MCEs alltogether, and
>     the above crazyness would be easy to fix.  However, the overhead of
>     always using iret to return to ring3 is not likely to be acceptable,
>     meaning that we cannot "fix" the problem by discarding interrupt
>     stacks
>     and doing everything properly on the main hypervisor stack.
>
>
> 64-bit Intel processors have SYSEXIT, right?  It's worth pointing out 
> the following alternatives, even if we never actually use them:
>
> 1. Use SYSEXIT on Intel processors and let the bugs (or some subset of 
> them) remain on AMD
> 2. Use SYSEXIT on Intel processors and IRET on AMD
> Given that AMD has cut back their investment in OSS development, and 
> is talking about moving to ARM, it may only be a matter of time before 
> Intel is the only important player in the x86 world.
Surely we would still want to support existing machines made with AMD 
processors. And as far as possible, we should keep the code architecture 
independent. We do not want a bunch of "IF processor=INTEL" in the 
assembler code [and even less, "#if BUILD_FOR_INTEL" and separate 
binaries, I would expect].

SYSCALL and SYSRET is the corresponding pair of instructions as SYSENTER 
and SYSEXIT but for 64-bit OS's (don't ask me why they decided to add a 
new pair of instructions, rather than just alter the behaviour of 
SYSENTER/SYSEXIT. I'm sure there were some reason for this, but it's 
beyond my understanding.

--
Mats

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Woes of NMIs and MCEs, and possibly how to fix
  2012-12-03 11:45   ` Mats Petersson
@ 2012-12-03 12:31     ` Jan Beulich
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Beulich @ 2012-12-03 12:31 UTC (permalink / raw)
  To: Mats Petersson; +Cc: xen-devel

>>> On 03.12.12 at 12:45, Mats Petersson <mats.petersson@citrix.com> wrote:
> On 03/12/12 11:24, George Dunlap wrote:
>> On Fri, Nov 30, 2012 at 5:34 PM, Andrew Cooper 
>> <andrew.cooper3@citrix.com <mailto:andrew.cooper3@citrix.com>> wrote:
>>
>>     3) SMM mode executing an iret will re-enable NMIs.  There is
>>     nothing we
>>     can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no
>>     way to predict if/when it may happen.  The best we can do is
>>     accept that
>>     it might happen, and try to deal with the after effects.
>>
>>
>> Did you actually mean IRET, or did you mean RSM?  Does it make a 
>> difference?
> 
> If, for some obscure reason, the SMM code decides, for example, to run 
> code like "int 0x21", where the int 0x21 handler ends with the rather 
> predictable IRET to return to the caller, then you would indeed "unlock" 
> the NMI blocking that happens from the NMI being taken by the processor. 
> NMI will still not interrupt the SMM code, but it WILL interrupt the 
> code that was running before SMI was taken - which could be an NMI 
> handler, that doesn't expect another NMI.
> 
> RSM doesn't, in and of itself [unless "messing" with the saved state] 
> alter the NMI state in other ways than "restore to previous value".

And isn't it this "restore to previous value" that makes this a
non-issue - after RSM, NMI-s would again be masked.

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Woes of NMIs and MCEs, and possibly how to fix
  2012-12-03 11:24 ` George Dunlap
  2012-12-03 11:45   ` Mats Petersson
@ 2012-12-03 11:50   ` Jan Beulich
  1 sibling, 0 replies; 14+ messages in thread
From: Jan Beulich @ 2012-12-03 11:50 UTC (permalink / raw)
  To: Andrew Cooper, George Dunlap
  Cc: Tim Deegan, Keir Fraser, xen-devel@lists.xen.org

>>> On 03.12.12 at 12:24, George Dunlap <dunlapg@umich.edu> wrote:
> On Fri, Nov 30, 2012 at 5:34 PM, Andrew Cooper 
>> As for 1 possible solution which we cant use:
>>
>> If it were not for the sysret stupidness[1] of requiring the hypervisor
>> to move to the guest stack before executing the `sysret` instruction, we
>> could do away with the stack tables for NMIs and MCEs alltogether, and
>> the above crazyness would be easy to fix.  However, the overhead of
>> always using iret to return to ring3 is not likely to be acceptable,
>> meaning that we cannot "fix" the problem by discarding interrupt stacks
>> and doing everything properly on the main hypervisor stack.
>>
> 
> 64-bit Intel processors have SYSEXIT, right?  It's worth pointing out the
> following alternatives, even if we never actually use them:
> 
> 1. Use SYSEXIT on Intel processors and let the bugs (or some subset of
> them) remain on AMD
> 2. Use SYSEXIT on Intel processors and IRET on AMD

SYSEXIT isn't very suitable because you'd have to corrupt %edx,
i.e. it couldn't be used for hypercalls with just 1 or 2 arguments.

Plus our GDT layout doesn't match that needed by SYSEXIT, yet
some of the selector values are part of the ABI.

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-12-03 12:31 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-30 17:34 Woes of NMIs and MCEs, and possibly how to fix Andrew Cooper
2012-11-30 17:56 ` Tim Deegan
2012-11-30 21:59   ` Andrew Cooper
2012-12-03  8:28   ` Jan Beulich
2012-11-30 19:12 ` Mats Petersson
2012-11-30 20:24   ` Tim Deegan
2012-11-30 22:55     ` Andrew Cooper
2012-12-01  9:09       ` Tim Deegan
2012-12-03  8:09   ` Jan Beulich
2012-12-03  8:26 ` Jan Beulich
2012-12-03 11:24 ` George Dunlap
2012-12-03 11:45   ` Mats Petersson
2012-12-03 12:31     ` Jan Beulich
2012-12-03 11:50   ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.