From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Christoph Egger" Subject: Re: RFC: MCA/MCE concept Date: Fri, 1 Jun 2007 10:11:35 +0200 Message-ID: <200706011011.35336.Christoph.Egger@amd.com> References: <907625E08839C4409CE5768403633E0B02561D81@sefsexmb1.amd.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <907625E08839C4409CE5768403633E0B02561D81@sefsexmb1.amd.com> Content-Disposition: inline List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: xen-devel@lists.xensource.com Cc: Gavin Maltby List-Id: xen-devel@lists.xenproject.org On Wednesday 30 May 2007 17:03:55 Petersson, Mats wrote: > [snip] > > > My feeling is that the hypervisor and dom0 own the hardware > > and as such > > all hardware fault management should reside there. So we should never > > deliver any form of #MC to a domU, nor should a poll of MCA state from > > a domU ever observe valid state (e.g, make the RDMSR return 0). > > So all handling, logging and diagnosis as well as hardware > > response actions > > (such as to deploy an online spare chip-select) are controlled > > in the hypervisor/dom0 combination. That seems a consistent > > model - e.g., > > if a domU is migrated to another system it should not carry the > > diagnosis state of the original system across etc, since that > > belongs with > > the one domain that cannot migrate. > > I agree entirely with this. > > > But that is not to say that (I think at a future phase) domU > > should not > > participate in a higher-level fault management function, at > > the direction > > of the hypervisor/dom0 combo. For example if/when we can isolate an > > uncorrectable error to a single domU we could forward such an event to > > the affected domU if it has registered its ability/interest in such > > events. These won't be in the form of a faked #MC or anything, > > instead they'd be some form of synchronous trap experienced when next > > the affected domU context resumes on CPU. The intelligent > > domU handler > > can then decide whether the domU must panic, whether it could simply > > kill the affected process etc. Those details are clearly > > sketchy, but the > > idea is to up-level the communication to a domU to be more like > > "you're broken" rather than "here's a machine-level hardware error for > > you to interpret and decide what to do with". > > Yes, this makes much more sense than forwarding #MC, as the guest would > have a hard time to actually do anything really useful with this. As far > as I know, most uncorrectable errors are near enough entirely fatal in > most commercial non-Enterprise OS's anyways - e.g. in Windows XP or > Server 2K3, it always ends in a blue-screen - which is hardly any better > than the guest being "humanely euthenazed" by Dom0. > > I take it this would be some sort of hypercall (available through the > regular PV-driver interface for HVM guests) to say "Let me know if I'm > broken - trap on vector X". =46or short, guests with a PV MCA driver will see a certain event (assuming the event mechanism will be used for the notification) and guests w/o a PV MCA driver will see a "General Protection Fault". Is that right? > -- > Mats > > > Gavin > > =2D-=20 AMD Saxony, Dresden, Germany Operating System Research Center Legal Information: AMD Saxony Limited Liability Company & Co. KG Sitz (Gesch=E4ftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland Registergericht Dresden: HRA 4896 vertretungsberechtigter Komplement=E4r: AMD Saxony LLC (Sitz Wilmington, Delaware, USA) Gesch=E4ftsf=FChrer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy