Re: RFC: MCA/MCE concept

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Gavin Maltby <Gavin.Maltby@Sun.COM>
To: xen-devel@lists.xensource.com
Subject: Re: RFC: MCA/MCE concept
Date: Wed, 30 May 2007 14:50:37 +0100	[thread overview]
Message-ID: <465D812D.9040907@sun.com> (raw)
In-Reply-To: <200705301110.50172.Christoph.Egger@amd.com>

Hi,

On 05/30/07 10:10, Christoph Egger wrote:

[cut]

>>>>> 2b) error == UE and UE impacts Xen or Dom0:
>>>> A very important aspect here is how you want to classify what impact an
>>>> uncorrectable has - generally, I can see very few situations where you
>>>> could confine the impact to a sub-portion of the system (i.e. a single
>>>> domU, dom0, or Xen). The general rule in my opinion must be to halt the
>>>> system, the question just is how likely it is that you can get a
>>>> meaningful message out (to screen, serial, or logs) that can help
>>>> analyze the problem afterwards. If it is somewhat likely, then dom0
>>>> should be involved, otherwise Xen should just shut down the system.
>>> Here you can best help out using HW features to handle errors.
>>> AMD CPUs features online-spare RAM and Chipkill since K8 RevF.
>>>
>>> CPUs such as the Sparc features Data Poisoning. That would be the
>>> most handy technique that can be used here.
>> But that assumes the error is recoverable (i.e. no other data got
>> corrupted). You still didn't clarify how you intend to determine the
>> impact an uncorrectable error had.
> 
> I know. I am lacking a sudden inspiration here.
> That's why I discuss this here before writing code that goes to nowhere.
> Anyone here with a flash of genius? :-)

For a first phase I'd suggest that treating an uncorrectable error as
terminal to the entire system (e.g., panic hypervisor or setup a hardware
reset mechanism such as Sync Flood) is practical and safe, and allows
us to concentrate on getting some more basic elements in place.
As Christoph says we really need some form of data poisoning supported
on the platform to really be able to isolate the impact of an uncorrectable
error.  In the absence of such support I think some fancy heuristics could
work in some limited cases (e.g., a memory uncorrectable on a page that
only a domU has a mapping to and which is not shared with any other domain
not even via a front/backend driver) but the penalty for bugs in those
heuristics is silent data corruption which is the ultimate crime.

> 
>>>>> 3a) DomU is a PV guest:
>>>>>       if DomU installed MCA event handler, it gets notified to perform
>>>>>          self-healing
>>>>>       if DomU did not install MCA event handler, notify Dom0 to do
>>>>>          some operations on DomU (case II)
>>>>>       if neither DomU nor Dom0 did not install MCA event handlers,
>>>>>          then Xen kills DomU
>>>>> 3b) DomU is a HVM guest:
>>>>>       if DomU features a PV driver then behave as in 3a)
>>>> What significance do pv drivers have here? Or do you mean a pv MCA
>>>> driver?
[cut]

My feeling is that the hypervisor and dom0 own the hardware and as such
all hardware fault management should reside there.  So we should never
deliver any form of #MC to a domU, nor should a poll of MCA state from
a domU ever observe valid state (e.g, make the RDMSR return 0).
So all handling, logging and diagnosis as well as hardware response actions
(such as to deploy an online spare chip-select) are controlled
in the hypervisor/dom0 combination.  That seems a consistent model - e.g.,
if a domU is migrated to another system it should not carry the
diagnosis state of the original system across etc, since that belongs with
the one domain that cannot migrate.

But that is not to say that (I think at a future phase) domU should not
participate in a higher-level fault management function, at the direction
of the hypervisor/dom0 combo.  For example if/when we can isolate an
uncorrectable error to a single domU we could forward such an event to
the affected domU if it has registered its ability/interest in such
events.  These won't be in the form of a faked #MC or anything,
instead they'd be some form of synchronous trap experienced when next
the affected domU context resumes on CPU.  The intelligent domU handler
can then decide whether the domU must panic, whether it could simply
kill the affected process etc.  Those details are clearly sketchy, but the
idea is to up-level the communication to a domU to be more like
"you're broken" rather than "here's a machine-level hardware error for
you to interpret and decide what to do with".

Gavin

next prev parent reply	other threads:[~2007-05-30 13:50 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-29 15:32 RFC: MCA/MCE concept Christoph Egger
2007-05-30  7:19 ` Jan Beulich
2007-05-30  7:45   ` Christoph Egger
2007-05-30  8:49     ` Jan Beulich
2007-05-30  9:10       ` Christoph Egger
2007-05-30  9:59         ` Jan Beulich
2007-05-30 10:12           ` Christoph Egger
2007-05-30 13:50         ` Gavin Maltby [this message]
2007-05-30 15:03           ` Petersson, Mats
2007-06-01  8:11             ` Christoph Egger
2007-06-01  8:55               ` Petersson, Mats
2007-06-01  9:28                 ` Christoph Egger
2007-06-01  9:48                   ` Petersson, Mats
2007-06-01 10:57                     ` Gavin Maltby
2007-06-01 11:38                       ` Petersson, Mats
2007-06-04 16:16         ` Gavin Maltby
2007-06-06  9:28           ` Christoph Egger
2007-06-06 10:35             ` Gavin Maltby
2007-06-06 11:57               ` Christoph Egger
2007-06-06 12:25                 ` Gavin Maltby
2007-06-06 13:24                   ` Christoph Egger
2007-06-14 11:59             ` Gavin Maltby
2007-06-21  9:29               ` Christoph Egger
2007-06-21 10:15                 ` Petersson, Mats

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=465D812D.9040907@sun.com \
    --to=gavin.maltby@sun.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.