From mboxrd@z Thu Jan  1 00:00:00 1970
From: Gavin Maltby <Gavin.Maltby@Sun.COM>
Subject: Re: Re: [PATCH 1/4] CPU online/offline support in Xen
Date: Wed, 17 Sep 2008 14:17:02 +1000
Message-ID: <48D084BE.5050602@sun.com>
References: <C4EEE682.2707B%keir.fraser@eu.citrix.com>
	<200809111623.11316.Christoph.Egger@amd.com>
Mime-Version: 1.0
Content-Type: text/plain; format=flowed; charset=utf-8
Content-Transfer-Encoding: 7BIT
Return-path: <xen-devel-bounces@lists.xensource.com>
In-reply-to: <200809111623.11316.Christoph.Egger@amd.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Christoph Egger <Christoph.Egger@amd.com>
Cc: Haitao Shan <maillists.shan@gmail.com>, "Tian,
	Kevin" <kevin.tian@intel.com>, xen-devel@lists.xensource.com, "Shan, Haitao" <haitao.shan@intel.com>, Keir Fraser <keir.fraser@eu.citrix.com>
List-Id: xen-devel@lists.xenproject.org

Christoph Egger wrote:
> On Thursday 11 September 2008 16:15:14 Keir Fraser wrote:
>> I applied the patch with the following changes:
>>  * I rewrote your changes to fixup_irqs(). We should force lazy EOIs
>> *after* we have serviced any straggling interrupts. Also we should actually
>> clear the EOI stack so it is empty next time the CPU comes online.
>>  * I simplified your changes to schedule.c in light of the fact we run in
>> stop_machine context. Hence we can be quite relaxed about locking, for
>> example.
>>  * I removed your change to __csched_vcpu_is_migrateable() and instead put
>> a similar check in csched_load_balance(). I think this is clearer and also
>> cheaper.
>>
>> I note that the VCPU currently running on the offlined CPU continues to run
>> there even after __cpu_disable(), and until that CPU does a final run
>> through the scheduler soon after. I hope it does not matter there is one
>> vcpu with v->processor == offlined_cpu for a short while 
> 
> This is not acceptable regarding to machine check. When Dom0 offlines a
> defect cpu, nothing may continue on it or silent data corruption occurs.

I don't see this as a problem for machine check correctness.

If dom0 asks to offline a cpu (because it believes the cpu is busted and
a threat to uptime), that decision is fundamentally asynchronous
to the actual error handling that occured at machine check exception
time:

  - running in whatever context
  - MCE occurs
  - trap to hypervisor MCE handler
	. this decides on hypervisor panic, or other appropriate
	  immediate (in handler) response
	. telemetry forwarded to dom0 for logging and analysis
  - assume no hypervisor panic
  - eons pass during which any unconstrained bad data remaining
    after initial handling may go anywhere
  - dom0 gets telemetry and let's say diagnoses a fault and
    decides to call back into the hypervisor to offline the
    offending cpu

Note the "eons pass" bit;  tonnes of instructions may run on the
bad cpu in this time, and a few more for some offline delay won't
hurt.

Gavin