All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gavin Maltby <Gavin.Maltby@Sun.COM>
To: Christoph Egger <Christoph.Egger@amd.com>
Cc: Haitao Shan <maillists.shan@gmail.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	xen-devel@lists.xensource.com, "Shan,
	Haitao" <haitao.shan@intel.com>,
	Keir Fraser <keir.fraser@eu.citrix.com>
Subject: Re: Re: [PATCH 1/4] CPU online/offline support in Xen
Date: Wed, 17 Sep 2008 14:17:02 +1000	[thread overview]
Message-ID: <48D084BE.5050602@sun.com> (raw)
In-Reply-To: <200809111623.11316.Christoph.Egger@amd.com>

Christoph Egger wrote:
> On Thursday 11 September 2008 16:15:14 Keir Fraser wrote:
>> I applied the patch with the following changes:
>>  * I rewrote your changes to fixup_irqs(). We should force lazy EOIs
>> *after* we have serviced any straggling interrupts. Also we should actually
>> clear the EOI stack so it is empty next time the CPU comes online.
>>  * I simplified your changes to schedule.c in light of the fact we run in
>> stop_machine context. Hence we can be quite relaxed about locking, for
>> example.
>>  * I removed your change to __csched_vcpu_is_migrateable() and instead put
>> a similar check in csched_load_balance(). I think this is clearer and also
>> cheaper.
>>
>> I note that the VCPU currently running on the offlined CPU continues to run
>> there even after __cpu_disable(), and until that CPU does a final run
>> through the scheduler soon after. I hope it does not matter there is one
>> vcpu with v->processor == offlined_cpu for a short while 
> 
> This is not acceptable regarding to machine check. When Dom0 offlines a
> defect cpu, nothing may continue on it or silent data corruption occurs.

I don't see this as a problem for machine check correctness.

If dom0 asks to offline a cpu (because it believes the cpu is busted and
a threat to uptime), that decision is fundamentally asynchronous
to the actual error handling that occured at machine check exception
time:

  - running in whatever context
  - MCE occurs
  - trap to hypervisor MCE handler
	. this decides on hypervisor panic, or other appropriate
	  immediate (in handler) response
	. telemetry forwarded to dom0 for logging and analysis
  - assume no hypervisor panic
  - eons pass during which any unconstrained bad data remaining
    after initial handling may go anywhere
  - dom0 gets telemetry and let's say diagnoses a fault and
    decides to call back into the hypervisor to offline the
    offending cpu

Note the "eons pass" bit;  tonnes of instructions may run on the
bad cpu in this time, and a few more for some offline delay won't
hurt.

Gavin

  parent reply	other threads:[~2008-09-17  4:17 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-09-09  8:59 [PATCH 1/4] CPU online/offline support in Xen Shan, Haitao
2008-09-10 10:43 ` Keir Fraser
2008-09-10 10:59   ` Keir Fraser
2008-09-10 12:59   ` Haitao Shan
2008-09-10 16:05     ` Frank van der Linden
2008-09-11  7:36       ` Keir Fraser
2008-09-11  8:02     ` Shan, Haitao
2008-09-11 11:12       ` Keir Fraser
2008-09-11 11:33         ` Shan, Haitao
2008-09-11 12:42           ` Keir Fraser
2008-09-11 14:15           ` Keir Fraser
2008-09-11 14:23             ` Christoph Egger
2008-09-11 14:32               ` Keir Fraser
2008-09-11 14:47                 ` Keir Fraser
2008-09-17  4:17               ` Gavin Maltby [this message]
2008-09-17  7:05                 ` Jan Beulich
2008-09-17  9:20                   ` Jiang, Yunhong
2008-09-17  9:43                     ` Christoph Egger
2008-09-17 13:14                       ` Ke, Liping
2008-09-18  3:56                       ` Jiang, Yunhong
2008-09-18  7:20                         ` Keir Fraser
2008-09-18  8:13                           ` Jiang, Yunhong
2008-09-18  9:11                             ` Keir Fraser
2008-09-18 15:17                               ` Jiang, Yunhong
2008-09-11 16:00             ` Shan, Haitao
2008-09-11 16:52               ` Keir Fraser
2008-09-11 23:30                 ` Shan, Haitao
  -- strict thread matches above, loose matches on Subject: below --
2008-09-12  2:22 Tian, Kevin
2008-09-12  6:02 ` Keir Fraser
2008-09-12  6:04   ` Tian, Kevin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48D084BE.5050602@sun.com \
    --to=gavin.maltby@sun.com \
    --cc=Christoph.Egger@amd.com \
    --cc=haitao.shan@intel.com \
    --cc=keir.fraser@eu.citrix.com \
    --cc=kevin.tian@intel.com \
    --cc=maillists.shan@gmail.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.