Re: Debugging a weird hardware fault.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Keir Fraser <keir.xen@gmail.com>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	winston.l.wang@intel.com, gang.wei@intel.com
Subject: Re: Debugging a weird hardware fault.
Date: Tue, 2 Aug 2011 15:14:21 +0100	[thread overview]
Message-ID: <4E38063D.2030103@citrix.com> (raw)
In-Reply-To: <CA581B8A.1EBFA%keir.xen@gmail.com>

On 29/07/11 08:10, Keir Fraser wrote:
> On 28/07/2011 23:45, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:
>
>> Initially, an SMI was what I was thinking, but the triple fault occurs whether
>> you start bringing down CPUs or not.  While waiting 10 seconds in the
>> platform_op select statment, the fault still occurs when all CPUs are still
>> up, all IRQs still enabled and potentially domU's still up.  (Also, from
>> studying the Xen3.4 code, I believe that interrupts are still actually up
>> during time_suspend(), but are soon brought down by lapic_suspend() later in
>> device_power_down().)
>>
>> Convertly, in the hacked up case where I ditched most of the shared S3/S5
>> codepath and just hit the PM1A, the server correctly shut down and stayed shut
>> down, implying that the fault was caused by software (be it BIOS or OS) rather
>> than hardware.  From what I understand of the APCI spec (and I claim very
>> little knowledge), there are a multitude of hardware events which could bring
>> the server out of S5, appearing as a triple fault, which would not be affected
>> by whether you had hit the PM1A register.
>>
>> In this specific example, dom0 regular shudown code already brought down the
>> domUs (of which there were none because we never started any), and we were
>> running with 1 CPU only so no others were up.  This opens up a whole host of
>> other possibilities which could be playing an effect betwee the
>> XENPF_enter_apci_sleep hypercall and Xen actually shutting itself down.
> Well I expect dom0 has done some going-to-sleep work that has left the
> platform on borrowed time w.r.t. bashing SLP_EN into the PM1 control
> register and actually finalising the shutdown.
>
> For example, it will have executed the _GTS ACPI method if there is one.
> That is supposed to happen immediately before writing PM1.SLP_EN, with no
> intervening interrupt activity or I/O. Obviously things don't work out quite
> like that when running on Xen!
>
> This is an architectural limitation of how ACPI sleep is currently
> implemented for Xen. It may need some rethinking to do it really properly
> according to the spec. e.g., do a hypercall just to prepare Xen for
> shutdown, but return back to dom0 in some limited environment to actually
> have it do the final ACPI sleep work. Or have dom0 pass a pointer to a code
> block that Xen should simply jump at to get the sleep to happen (where that
> code block would basically be dom0's acpi_enter_sleep() function). There are
> a few, somewhat distasteful, options that are more respectful of the ACPI
> spec than we are right now.
>
>  -- Keir
Just for information, this turned out to be a BIOS bug.  It was setting
a 6 second timer when executing _PTS, which hit the system reset if
PM1{a,b} had not been hit when the timer expired.  As Xen does all of
its shutdown after the call to _PTS and before PM1{a,b}, there is a
significant time gap, which was falling fowl of the timer in most cases.

In this case, it seems likely that a BIOS fix can be done, as Supermicro
do provide a custom BIOS for the NetScalar box in question.

However, If anyone else comes across this issue, we did make a software
solution.  You can replace /etc/init.d/halt (or equivalent for your
chosen dom0 distro) to KEXEC reboot into a native kernel which listens
for a special command line parameter and calls pm_power_off_prepare()
and pm_power_off() after the ACPI module has initialized[1].

This issue does however show that Xen itself is in breach of the ACPI
spec, which is a dangerous situation to be in given the fragility of
APCI at the best of times.  In due course, I will put my mind to solving
the dom0-Xen ACPI interaction problems if the question is still open.

~Andrew Cooper

[1] Yes this is a hack.  Sorry.  Its the easiest solution without
rewriting Xen

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

next prev parent reply	other threads:[~2011-08-02 14:14 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-28 19:53 Debugging a weird hardware fault Andrew Cooper
2011-07-28 20:42 ` Keir Fraser
2011-07-28 22:45   ` Andrew Cooper
2011-07-29  7:10     ` Keir Fraser
2011-07-29  7:24       ` Keir Fraser
2011-08-02 14:14       ` Andrew Cooper [this message]
2011-08-02 14:26         ` Keir Fraser
2011-08-02 14:56           ` Andrew Cooper
  -- strict thread matches above, loose matches on Subject: below --
2011-08-03 14:48 Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E38063D.2030103@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=gang.wei@intel.com \
    --cc=keir.xen@gmail.com \
    --cc=winston.l.wang@intel.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.