From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Christoph Egger" Subject: Re: RFC: MCA/MCE concept Date: Wed, 30 May 2007 09:45:50 +0200 Message-ID: <200705300945.51163.Christoph.Egger@amd.com> References: <200705291732.46709.Christoph.Egger@amd.com> <465D4190.76E4.0078.0@novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <465D4190.76E4.0078.0@novell.com> Content-Disposition: inline List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: xen-devel@lists.xensource.com Cc: Jan Beulich List-Id: xen-devel@lists.xenproject.org On Wednesday 30 May 2007 09:19:12 Jan Beulich wrote: > >case I) - Xen reveives a MCE from the CPU > > > >1) Xen MCE handler figures out if error is an correctable error (CE) > > or uncorrectable error (UE) > >2a) error =3D=3D CE: > > Xen notifies Dom0 if Dom0 installed an MCA event handler > > for statistical purpose > >2b) error =3D=3D UE and UE impacts Xen or Dom0: > > A very important aspect here is how you want to classify what impact an > uncorrectable has - generally, I can see very few situations where you > could confine the impact to a sub-portion of the system (i.e. a single > domU, dom0, or Xen). The general rule in my opinion must be to halt the > system, the question just is how likely it is that you can get a meaningf= ul > message out (to screen, serial, or logs) that can help analyze the problem > afterwards. If it is somewhat likely, then dom0 should be involved, > otherwise Xen should just shut down the system. Here you can best help out using HW features to handle errors. AMD CPUs features online-spare RAM and Chipkill since K8 RevF. CPUs such as the Sparc features Data Poisoning. That would be the most handy technique that can be used here. Maybe this line: > > Xen does some self-healing should be this: Xen *tries* to do some self-healing > > and notifies Dom0 on success if Dom0 installed MCA event handler > > or Xen panics on failure The first implemenation can just panic here. The self-healing will be implemented and improved over time. > >2c) error =3D=3D UE and UE impacts DomU: > > In case of Dom0 installed MCA event handler: > > Xen notifies Dom0 and Dom0 tells Xen whether > > to also notify DomU and/or does some operations > > on the DomU (case II) > > In case Dom0 did not install MCA event handler, > > Xen notifies DomU > >3a) DomU is a PV guest: > > if DomU installed MCA event handler, it gets notified to perform > > self-healing > > if DomU did not install MCA event handler, notify Dom0 to do > > some operations on DomU (case II) > > if neither DomU nor Dom0 did not install MCA event handlers, > > then Xen kills DomU > >3b) DomU is a HVM guest: > > if DomU features a PV driver then behave as in 3a) > > What significance do pv drivers have here? Or do you mean a pv MCA > driver? Yes, I mean the pv MCA driver. > > > if DomU enabled MCA/MCE via MSR, inject MCE into guest > > if DomU did not enable MCA/MCE via MSR, notify Dom0 > > to do some operations on DomU (case II) > > if neither DomU enabled MCA/MCE nor Dom0 did not install > > MCA event handler, Xen kills DomU > > Injecting an MCE to a hvm guest seems at least questionable. It can't > really do anything about it (it doesn't even know the real topology of the > system it's running on, so addresses stored in MSRs are meaningless - > either you allow them to be read untranslated [in which case the guest > cannot make sense of them] or you do translation for the guest [in which > case it might make assumptions about co-locality of other nearby pages > which will be wrong]). Yes, Xen should do the translation for the guest. The assumptions must be fixed then. I know that's easier said than done. > Doing this to a pv domU for purely notification purposes (where the guest > knows it's running virtualized) is clearly a different matter. Yes, I agree with you here. The general idea behind informing a DomU is to let its own fault management handle the error. It is always better to= =20 let it kill a screen saver process and keep the word processor running than killing the whole guest. The DomU should crash itself if it thinks that's t= he best. > >case II) - Xen reveives Dom0 instructions via Hypercall > > > >There are different reasons, why Xen should do something. > > > > - Dom0 got enough CEs so that UEs are very likely to happen in order > > to "circumvent" UEs. > > - Possible operations on a DomU > > - save/restore DomU > > - (live-)migrate DomU to a different physical machine > > - etc. > > Very heavy-weight operations, which I think are unlikely to succeed if > you already suspect the system's going to suffer a UE soon. Yes, they are heavy-weight operations. Do you have some ideas, what a Dom0 can do? The idea here is that the Dom0's fault management helps guests to survive as best as possible. Christoph =2D-=20 AMD Saxony, Dresden Germany Operating System Research Center Legal Information: AMD Saxony Limited Liability Company & Co. KG Sitz (Gesch=E4ftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland Registergericht Dresden: HRA 4896 vertretungsberechtigter Komplement=E4r: AMD Saxony LLC (Sitz Wilmington, Delaware, USA) Gesch=E4ftsf=FChrer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy