Re: RFC: MCA/MCE concept

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Gavin Maltby <Gavin.Maltby@Sun.COM>
To: Christoph Egger <Christoph.Egger@amd.com>
Cc: xen-devel@lists.xensource.com, Keir Fraser <keir@xensource.com>
Subject: Re: RFC: MCA/MCE concept
Date: Wed, 06 Jun 2007 13:25:26 +0100	[thread overview]
Message-ID: <4666A7B6.1020702@sun.com> (raw)
In-Reply-To: <200706061357.26924.Christoph.Egger@amd.com>

Hi,

On 06/06/07 12:57, Christoph Egger wrote:

>>>> For the first I've assumed so far that an event channel notification
>>>> of the MCA event will suffice;  as long as the hypervisor only polls
>>>> for correctable MCA errors at a low-frequency rate (currently 15s
>>>> interval) there is no danger of spamming that single notification.
>>> Why polling?
>> Polling for correctable errors, but #MC as usual for others.  Setting
>> MCi_CTL bits for correctable errors does not produce a machine check,
>> so polling is the only approach unless one sets additional (and
>> undocumented, certainly for AMD chips) config bits.  What I was getting
>> at here is that polling at largish intervals for correctables is
>> the correct approach - trapping for them or polling at a high-frequency
>> is bad because in cases where you have some form of solid correctable
>> error (say a single bad pin in a dimm socket affecting one or two ranks
>> of that dimm but never able to produce a UE) the trap handling and
>> diagnosis software consume the machine and things make little useful
>> forward progress.
> 
> I still don't see, why #MC for all kind of errors is bad.

I'm talking about whether the hypervisor takes a machine check
for an error or polls for it.  We do not want #MC for correctable
errors stopping the hypervisor from making progress.  And if the
hypervisor poll interval was to small a solid error would again
keep the hypervisor busy producing (mostly/all duplicate)
error telemetry and the diagnosis code in dom0 would burn
cpu cycles, too.

How errors observed by the hypervisor, be they from #MC or from
a poll, are propogated to the domains is unimportant from this
point of view - e.g., if we decide to take error telemetry
discovered via a poll in the hypervisor and propogate it
to the domain pretending it is undistinguishable from a machine
check that will not hurt or limit the domain processing.

An untested design I had in mind, unashamedly influenced by what
we do in Solaris, was to have some common memory shared between
hypervisor and domain into which the hypervisor produces
error telemetry and the domain consumes that telemetry.
Producing and consuming is lockless using compare-and-swap.
There are two queues in this shared memory - one for uncorrectable
error telemetry and one for correctable error telemetry.  When the
domain gets whatever event to notify it of telemetry for processing
it processes the queues;  the event would be synchronous for
uncorrectable errors (ie, domain must process the telemetry
right now) or asynchronous in the case of correctable errors
(process when convenient).  The separation of CE and UE queues
stops CEs from flooding the more important UE events (you can
always drop CEs if there is no more space, but you can never
drop UEs).

[cut]

> After some code reading I found a nmi_pending, nmi_masked and nmi_addr in
[cut]

Still chewing on that ...

Cheers

Gavin

next prev parent reply	other threads:[~2007-06-06 12:25 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-29 15:32 RFC: MCA/MCE concept Christoph Egger
2007-05-30  7:19 ` Jan Beulich
2007-05-30  7:45   ` Christoph Egger
2007-05-30  8:49     ` Jan Beulich
2007-05-30  9:10       ` Christoph Egger
2007-05-30  9:59         ` Jan Beulich
2007-05-30 10:12           ` Christoph Egger
2007-05-30 13:50         ` Gavin Maltby
2007-05-30 15:03           ` Petersson, Mats
2007-06-01  8:11             ` Christoph Egger
2007-06-01  8:55               ` Petersson, Mats
2007-06-01  9:28                 ` Christoph Egger
2007-06-01  9:48                   ` Petersson, Mats
2007-06-01 10:57                     ` Gavin Maltby
2007-06-01 11:38                       ` Petersson, Mats
2007-06-04 16:16         ` Gavin Maltby
2007-06-06  9:28           ` Christoph Egger
2007-06-06 10:35             ` Gavin Maltby
2007-06-06 11:57               ` Christoph Egger
2007-06-06 12:25                 ` Gavin Maltby [this message]
2007-06-06 13:24                   ` Christoph Egger
2007-06-14 11:59             ` Gavin Maltby
2007-06-21  9:29               ` Christoph Egger
2007-06-21 10:15                 ` Petersson, Mats

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4666A7B6.1020702@sun.com \
    --to=gavin.maltby@sun.com \
    --cc=Christoph.Egger@amd.com \
    --cc=keir@xensource.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.