From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gavin Maltby Subject: Re: Re: [PATCH 1/4] CPU online/offline support in Xen Date: Wed, 17 Sep 2008 14:17:02 +1000 Message-ID: <48D084BE.5050602@sun.com> References: <200809111623.11316.Christoph.Egger@amd.com> Mime-Version: 1.0 Content-Type: text/plain; format=flowed; charset=utf-8 Content-Transfer-Encoding: 7BIT Return-path: In-reply-to: <200809111623.11316.Christoph.Egger@amd.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Christoph Egger Cc: Haitao Shan , "Tian, Kevin" , xen-devel@lists.xensource.com, "Shan, Haitao" , Keir Fraser List-Id: xen-devel@lists.xenproject.org Christoph Egger wrote: > On Thursday 11 September 2008 16:15:14 Keir Fraser wrote: >> I applied the patch with the following changes: >> * I rewrote your changes to fixup_irqs(). We should force lazy EOIs >> *after* we have serviced any straggling interrupts. Also we should actually >> clear the EOI stack so it is empty next time the CPU comes online. >> * I simplified your changes to schedule.c in light of the fact we run in >> stop_machine context. Hence we can be quite relaxed about locking, for >> example. >> * I removed your change to __csched_vcpu_is_migrateable() and instead put >> a similar check in csched_load_balance(). I think this is clearer and also >> cheaper. >> >> I note that the VCPU currently running on the offlined CPU continues to run >> there even after __cpu_disable(), and until that CPU does a final run >> through the scheduler soon after. I hope it does not matter there is one >> vcpu with v->processor == offlined_cpu for a short while > > This is not acceptable regarding to machine check. When Dom0 offlines a > defect cpu, nothing may continue on it or silent data corruption occurs. I don't see this as a problem for machine check correctness. If dom0 asks to offline a cpu (because it believes the cpu is busted and a threat to uptime), that decision is fundamentally asynchronous to the actual error handling that occured at machine check exception time: - running in whatever context - MCE occurs - trap to hypervisor MCE handler . this decides on hypervisor panic, or other appropriate immediate (in handler) response . telemetry forwarded to dom0 for logging and analysis - assume no hypervisor panic - eons pass during which any unconstrained bad data remaining after initial handling may go anywhere - dom0 gets telemetry and let's say diagnoses a fault and decides to call back into the hypervisor to offline the offending cpu Note the "eons pass" bit; tonnes of instructions may run on the bad cpu in this time, and a few more for some offline delay won't hurt. Gavin