Debugging a weird hardware fault.

All of lore.kernel.org
 help / color / mirror / Atom feed

* Debugging a weird hardware fault.
@ 2011-07-28 19:53 Andrew Cooper
  2011-07-28 20:42 ` Keir Fraser
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Cooper @ 2011-07-28 19:53 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com

Hello,

I am trying to debug an issue which appears on the surface as "run
shutdown -h +0 in dom0 and the machine reboots".  The issue reproduces
on a Supermicro X8DT6 motherboard
(http://www.supermicro.com/products/motherboard/QPI/5500/X8DT6-F.cfm)
only (as far as we can tell - we cant reproduce it on any other
hardware), on both Xen 3.4 and Xen 4.1.  The debugging described below
is specifically against 3.4

It reproduces irrespective of number of CPUs and irrespective of IOMMU
utilization.  For all tests, the server is being run with maxcpus=1 on
the Xen command-line and no domUs at all.

Tracing the path of execution, Xen is getting the XENPF_enter_acpi_sleep
platform op and acting on it correctly, going down the ACPI S5 codepath.

My assumption is that the reboot is caused by a triple fault, as the
server reboots before it actually writes to the PM1A register (except
for the case where it actually works, at which point it writes correctly
and properly shuts down).  There is no indication on the serial console
of a fault or double fault.

My method of tracing is
#define SERIAL_CHAR(ch) __asm__ __volatile__ ("mov %0, %%al\n\t"\

                                             "mov $0x3f8, %%dx\n\t"    \

              "out %%al,%%dx\n\t" :: "g"(ch) : "%ax", "%dx");
scattered over the codebase.

The fault itself is time dependent - it occasionally works when the
shutdown code spends very little time in get_cmos_time.

By waiting at certain points, but particularly inserting:

     for( i=0; i < 10; ++i)
      {
        SERIAL_CHAR('*');
        mdelay(1000);
      }

in the XENPF_enter_acpi_sleep case statement, It shows that the triple
fault is reliably 5 seconds after the hypercall, and in otherwise safe
code.  I SERIAL_CHAR'd the entry and exit of the nmi handler, which
shows that the triple fault is not caused by the nmi watchdog, which I
thought might be having an effect.

While waiting to print '*' every second, the serial console buffer
continues to be written to the UART, showing that other tasks are going
on while XENPF_enter_acpi_sleep is being serviced.

The server itself is otherwise totally stable, running PV, HVM (and some
bodged pv-on-hvm container for FreeBSD), along with performing SR-IOV
from 8 NICs with 40 VFs each.  I have a workaround by removing the call
to time_suspend() at which point proding the PM1A register happens
reliably before whatever causes the triple fault later.  However, this
is not a suitable solution for the S3 codepath which suffers the same
problem but really does need to run time_suspend.

My questions to the Xen community are:

what (if any) new tasks get scheduled when a XENPF_enter_acpi_sleep is
in action, and more generally, how can I go about debugging which tasks
are being run.

Thanks in advance for any advice/tips

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Debugging a weird hardware fault.
  2011-07-28 19:53 Debugging a weird hardware fault Andrew Cooper
@ 2011-07-28 20:42 ` Keir Fraser
  2011-07-28 22:45   ` Andrew Cooper
  0 siblings, 1 reply; 9+ messages in thread
From: Keir Fraser @ 2011-07-28 20:42 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel@lists.xensource.com

On 28/07/2011 20:53, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:

> My questions to the Xen community are:
> 
> what (if any) new tasks get scheduled when a XENPF_enter_acpi_sleep is
> in action, and more generally, how can I go about debugging which tasks
> are being run.

By the time you get to time_suspend(), you are running on CPU0, all other
CPUs are offline, all domUs are suspended, and IRQs are disabled. There's
not much scope for unexpected interruptions unless it's an NMI or SMI.

By that point the serial subsystem is in synchronous mode, rather than
interrupt-driven, so it's no wonder it continues to work.

 -- Keir

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Debugging a weird hardware fault.
  2011-07-28 20:42 ` Keir Fraser
@ 2011-07-28 22:45   ` Andrew Cooper
  2011-07-29  7:10     ` Keir Fraser
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Cooper @ 2011-07-28 22:45 UTC (permalink / raw)
  To: Keir Fraser, xen-devel@lists.xensource.com

________________________________________
From: Keir Fraser [keir.xen@gmail.com]
Sent: 28 July 2011 21:42
To: Andrew Cooper; xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] Debugging a weird hardware fault.

On 28/07/2011 20:53, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:

> My questions to the Xen community are:
>
> what (if any) new tasks get scheduled when a XENPF_enter_acpi_sleep is
> in action, and more generally, how can I go about debugging which tasks
> are being run.

By the time you get to time_suspend(), you are running on CPU0, all other
CPUs are offline, all domUs are suspended, and IRQs are disabled. There's
not much scope for unexpected interruptions unless it's an NMI or SMI.

By that point the serial subsystem is in synchronous mode, rather than
interrupt-driven, so it's no wonder it continues to work.

 -- Keir

Initially, an SMI was what I was thinking, but the triple fault occurs whether you start bringing down CPUs or not.  While waiting 10 seconds in the platform_op select statment, the fault still occurs when all CPUs are still up, all IRQs still enabled and potentially domU's still up.  (Also, from studying the Xen3.4 code, I believe that interrupts are still actually up during time_suspend(), but are soon brought down by lapic_suspend() later in device_power_down().)

Convertly, in the hacked up case where I ditched most of the shared S3/S5 codepath and just hit the PM1A, the server correctly shut down and stayed shut down, implying that the fault was caused by software (be it BIOS or OS) rather than hardware.  From what I understand of the APCI spec (and I claim very little knowledge), there are a multitude of hardware events which could bring the server out of S5, appearing as a triple fault, which would not be affected by whether you had hit the PM1A register.

In this specific example, dom0 regular shudown code already brought down the domUs (of which there were none because we never started any), and we were running with 1 CPU only so no others were up.  This opens up a whole host of other possibilities which could be playing an effect betwee the XENPF_enter_apci_sleep hypercall and Xen actually shutting itself down.

~Andrew

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Debugging a weird hardware fault.
  2011-07-28 22:45   ` Andrew Cooper
@ 2011-07-29  7:10     ` Keir Fraser
  2011-07-29  7:24       ` Keir Fraser
  2011-08-02 14:14       ` Andrew Cooper
  0 siblings, 2 replies; 9+ messages in thread
From: Keir Fraser @ 2011-07-29  7:10 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel@lists.xensource.com

On 28/07/2011 23:45, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:

> Initially, an SMI was what I was thinking, but the triple fault occurs whether
> you start bringing down CPUs or not.  While waiting 10 seconds in the
> platform_op select statment, the fault still occurs when all CPUs are still
> up, all IRQs still enabled and potentially domU's still up.  (Also, from
> studying the Xen3.4 code, I believe that interrupts are still actually up
> during time_suspend(), but are soon brought down by lapic_suspend() later in
> device_power_down().)
> 
> Convertly, in the hacked up case where I ditched most of the shared S3/S5
> codepath and just hit the PM1A, the server correctly shut down and stayed shut
> down, implying that the fault was caused by software (be it BIOS or OS) rather
> than hardware.  From what I understand of the APCI spec (and I claim very
> little knowledge), there are a multitude of hardware events which could bring
> the server out of S5, appearing as a triple fault, which would not be affected
> by whether you had hit the PM1A register.
> 
> In this specific example, dom0 regular shudown code already brought down the
> domUs (of which there were none because we never started any), and we were
> running with 1 CPU only so no others were up.  This opens up a whole host of
> other possibilities which could be playing an effect betwee the
> XENPF_enter_apci_sleep hypercall and Xen actually shutting itself down.

Well I expect dom0 has done some going-to-sleep work that has left the
platform on borrowed time w.r.t. bashing SLP_EN into the PM1 control
register and actually finalising the shutdown.

For example, it will have executed the _GTS ACPI method if there is one.
That is supposed to happen immediately before writing PM1.SLP_EN, with no
intervening interrupt activity or I/O. Obviously things don't work out quite
like that when running on Xen!

This is an architectural limitation of how ACPI sleep is currently
implemented for Xen. It may need some rethinking to do it really properly
according to the spec. e.g., do a hypercall just to prepare Xen for
shutdown, but return back to dom0 in some limited environment to actually
have it do the final ACPI sleep work. Or have dom0 pass a pointer to a code
block that Xen should simply jump at to get the sleep to happen (where that
code block would basically be dom0's acpi_enter_sleep() function). There are
a few, somewhat distasteful, options that are more respectful of the ACPI
spec than we are right now.

 -- Keir

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Debugging a weird hardware fault.
  2011-07-29  7:10     ` Keir Fraser
@ 2011-07-29  7:24       ` Keir Fraser
  2011-08-02 14:14       ` Andrew Cooper
  1 sibling, 0 replies; 9+ messages in thread
From: Keir Fraser @ 2011-07-29  7:24 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel@lists.xensource.com; +Cc: winston.l.wang, gang.wei

Cc'ing some of the Xen ACPI/PM maintainers to see if they have an opinion on
this issue...

On 29/07/2011 08:10, "Keir Fraser" <keir.xen@gmail.com> wrote:

> On 28/07/2011 23:45, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:
> 
>> Initially, an SMI was what I was thinking, but the triple fault occurs
>> whether
>> you start bringing down CPUs or not.  While waiting 10 seconds in the
>> platform_op select statment, the fault still occurs when all CPUs are still
>> up, all IRQs still enabled and potentially domU's still up.  (Also, from
>> studying the Xen3.4 code, I believe that interrupts are still actually up
>> during time_suspend(), but are soon brought down by lapic_suspend() later in
>> device_power_down().)
>> 
>> Convertly, in the hacked up case where I ditched most of the shared S3/S5
>> codepath and just hit the PM1A, the server correctly shut down and stayed
>> shut
>> down, implying that the fault was caused by software (be it BIOS or OS)
>> rather
>> than hardware.  From what I understand of the APCI spec (and I claim very
>> little knowledge), there are a multitude of hardware events which could bring
>> the server out of S5, appearing as a triple fault, which would not be
>> affected
>> by whether you had hit the PM1A register.
>> 
>> In this specific example, dom0 regular shudown code already brought down the
>> domUs (of which there were none because we never started any), and we were
>> running with 1 CPU only so no others were up.  This opens up a whole host of
>> other possibilities which could be playing an effect betwee the
>> XENPF_enter_apci_sleep hypercall and Xen actually shutting itself down.
> 
> Well I expect dom0 has done some going-to-sleep work that has left the
> platform on borrowed time w.r.t. bashing SLP_EN into the PM1 control
> register and actually finalising the shutdown.
> 
> For example, it will have executed the _GTS ACPI method if there is one.
> That is supposed to happen immediately before writing PM1.SLP_EN, with no
> intervening interrupt activity or I/O. Obviously things don't work out quite
> like that when running on Xen!
> 
> This is an architectural limitation of how ACPI sleep is currently
> implemented for Xen. It may need some rethinking to do it really properly
> according to the spec. e.g., do a hypercall just to prepare Xen for
> shutdown, but return back to dom0 in some limited environment to actually
> have it do the final ACPI sleep work. Or have dom0 pass a pointer to a code
> block that Xen should simply jump at to get the sleep to happen (where that
> code block would basically be dom0's acpi_enter_sleep() function). There are
> a few, somewhat distasteful, options that are more respectful of the ACPI
> spec than we are right now.
> 
>  -- Keir
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Debugging a weird hardware fault.
  2011-07-29  7:10     ` Keir Fraser
  2011-07-29  7:24       ` Keir Fraser
@ 2011-08-02 14:14       ` Andrew Cooper
  2011-08-02 14:26         ` Keir Fraser
  1 sibling, 1 reply; 9+ messages in thread
From: Andrew Cooper @ 2011-08-02 14:14 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com, winston.l.wang, gang.wei

On 29/07/11 08:10, Keir Fraser wrote:
> On 28/07/2011 23:45, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:
>
>> Initially, an SMI was what I was thinking, but the triple fault occurs whether
>> you start bringing down CPUs or not.  While waiting 10 seconds in the
>> platform_op select statment, the fault still occurs when all CPUs are still
>> up, all IRQs still enabled and potentially domU's still up.  (Also, from
>> studying the Xen3.4 code, I believe that interrupts are still actually up
>> during time_suspend(), but are soon brought down by lapic_suspend() later in
>> device_power_down().)
>>
>> Convertly, in the hacked up case where I ditched most of the shared S3/S5
>> codepath and just hit the PM1A, the server correctly shut down and stayed shut
>> down, implying that the fault was caused by software (be it BIOS or OS) rather
>> than hardware.  From what I understand of the APCI spec (and I claim very
>> little knowledge), there are a multitude of hardware events which could bring
>> the server out of S5, appearing as a triple fault, which would not be affected
>> by whether you had hit the PM1A register.
>>
>> In this specific example, dom0 regular shudown code already brought down the
>> domUs (of which there were none because we never started any), and we were
>> running with 1 CPU only so no others were up.  This opens up a whole host of
>> other possibilities which could be playing an effect betwee the
>> XENPF_enter_apci_sleep hypercall and Xen actually shutting itself down.
> Well I expect dom0 has done some going-to-sleep work that has left the
> platform on borrowed time w.r.t. bashing SLP_EN into the PM1 control
> register and actually finalising the shutdown.
>
> For example, it will have executed the _GTS ACPI method if there is one.
> That is supposed to happen immediately before writing PM1.SLP_EN, with no
> intervening interrupt activity or I/O. Obviously things don't work out quite
> like that when running on Xen!
>
> This is an architectural limitation of how ACPI sleep is currently
> implemented for Xen. It may need some rethinking to do it really properly
> according to the spec. e.g., do a hypercall just to prepare Xen for
> shutdown, but return back to dom0 in some limited environment to actually
> have it do the final ACPI sleep work. Or have dom0 pass a pointer to a code
> block that Xen should simply jump at to get the sleep to happen (where that
> code block would basically be dom0's acpi_enter_sleep() function). There are
> a few, somewhat distasteful, options that are more respectful of the ACPI
> spec than we are right now.
>
>  -- Keir
Just for information, this turned out to be a BIOS bug.  It was setting
a 6 second timer when executing _PTS, which hit the system reset if
PM1{a,b} had not been hit when the timer expired.  As Xen does all of
its shutdown after the call to _PTS and before PM1{a,b}, there is a
significant time gap, which was falling fowl of the timer in most cases.

In this case, it seems likely that a BIOS fix can be done, as Supermicro
do provide a custom BIOS for the NetScalar box in question.

However, If anyone else comes across this issue, we did make a software
solution.  You can replace /etc/init.d/halt (or equivalent for your
chosen dom0 distro) to KEXEC reboot into a native kernel which listens
for a special command line parameter and calls pm_power_off_prepare()
and pm_power_off() after the ACPI module has initialized[1].

This issue does however show that Xen itself is in breach of the ACPI
spec, which is a dangerous situation to be in given the fragility of
APCI at the best of times.  In due course, I will put my mind to solving
the dom0-Xen ACPI interaction problems if the question is still open.

~Andrew Cooper

[1] Yes this is a hack.  Sorry.  Its the easiest solution without
rewriting Xen

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Debugging a weird hardware fault.
  2011-08-02 14:14       ` Andrew Cooper
@ 2011-08-02 14:26         ` Keir Fraser
  2011-08-02 14:56           ` Andrew Cooper
  0 siblings, 1 reply; 9+ messages in thread
From: Keir Fraser @ 2011-08-02 14:26 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel@lists.xensource.com, winston.l.wang, gang.wei

On 02/08/2011 07:14, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:

> Just for information, this turned out to be a BIOS bug.  It was setting
> a 6 second timer when executing _PTS, which hit the system reset if
> PM1{a,b} had not been hit when the timer expired.  As Xen does all of
> its shutdown after the call to _PTS and before PM1{a,b}, there is a
> significant time gap, which was falling fowl of the timer in most cases.

Six seconds though, that's quite a long time! Is it a big box?

> In this case, it seems likely that a BIOS fix can be done, as Supermicro
> do provide a custom BIOS for the NetScalar box in question.
> 
> However, If anyone else comes across this issue, we did make a software
> solution.  You can replace /etc/init.d/halt (or equivalent for your
> chosen dom0 distro) to KEXEC reboot into a native kernel which listens
> for a special command line parameter and calls pm_power_off_prepare()
> and pm_power_off() after the ACPI module has initialized[1].
> 
> This issue does however show that Xen itself is in breach of the ACPI
> spec, which is a dangerous situation to be in given the fragility of
> APCI at the best of times.  In due course, I will put my mind to solving
> the dom0-Xen ACPI interaction problems if the question is still open.

Yes, this is ultimately the issue. It's going to be a pain to fix properly,
unfortunately.

 -- Keir

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Debugging a weird hardware fault.
  2011-08-02 14:26         ` Keir Fraser
@ 2011-08-02 14:56           ` Andrew Cooper
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Cooper @ 2011-08-02 14:56 UTC (permalink / raw)
  To: Keir Fraser
  Cc: xen-devel@lists.xensource.com, winston.l.wang@intel.com,
	gang.wei@intel.com



On 02/08/11 15:26, Keir Fraser wrote:
> On 02/08/2011 07:14, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:
>
>> Just for information, this turned out to be a BIOS bug.  It was setting
>> a 6 second timer when executing _PTS, which hit the system reset if
>> PM1{a,b} had not been hit when the timer expired.  As Xen does all of
>> its shutdown after the call to _PTS and before PM1{a,b}, there is a
>> significant time gap, which was falling fowl of the timer in most cases.
> Six seconds though, that's quite a long time! Is it a big box?

It is a Netscalar SDX box, designed to have 24 logical pcpus, 96GB ram,
320 pci-passed-through ixgbe virtual functions (claiming 3 irqs per vf).

It seems that Xen spends a fair amount of time doing freeze_domains
(even though dom0 has already shut down all domUs, albeit forcibly if
they haven't shut down nicely within 15 seconds), and bringing down the
other CPUs (in particular, it spends ages fiddling around with irq
affinities).

Overall, there is probably quite a bit of optimization which could be
done, but that still doesn't excuse a BIOS deciding that "a long time"
as per the ACPI spec is "less than 6 seconds".

~Andrew

>> In this case, it seems likely that a BIOS fix can be done, as Supermicro
>> do provide a custom BIOS for the NetScalar box in question.
>>
>> However, If anyone else comes across this issue, we did make a software
>> solution.  You can replace /etc/init.d/halt (or equivalent for your
>> chosen dom0 distro) to KEXEC reboot into a native kernel which listens
>> for a special command line parameter and calls pm_power_off_prepare()
>> and pm_power_off() after the ACPI module has initialized[1].
>>
>> This issue does however show that Xen itself is in breach of the ACPI
>> spec, which is a dangerous situation to be in given the fragility of
>> APCI at the best of times.  In due course, I will put my mind to solving
>> the dom0-Xen ACPI interaction problems if the question is still open.
> Yes, this is ultimately the issue. It's going to be a pain to fix properly,
> unfortunately.
>
>  -- Keir
>
>

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Debugging a weird hardware fault.
@ 2011-08-03 14:48 Jan Beulich
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Beulich @ 2011-08-03 14:48 UTC (permalink / raw)
  To: andrew.cooper3, keir; +Cc: xen-devel, winston.l.wang, gang.wei

>>> Andrew Cooper 08/02/11 5:01 PM >>> 
>It seems that Xen spends a fair amount of time doing freeze_domains 
>(even though dom0 has already shut down all domUs, albeit forcibly if 
>they haven't shut down nicely within 15 seconds), and bringing down the 
>other CPUs (in particular, it spends ages fiddling around with irq 
>affinities). 

Is that independent of using a serial console? That is, are the delays
perhaps incurred just by that code being overly verbose? One of the
odd things I had noticed now and then is that during shutdown, various
IRQs get fixed up more than once (up to once per CPU brought down).
There surely are ways to have them moved to CPU0 directly in the
shutdown case.

Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-08-03 14:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-28 19:53 Debugging a weird hardware fault Andrew Cooper
2011-07-28 20:42 ` Keir Fraser
2011-07-28 22:45   ` Andrew Cooper
2011-07-29  7:10     ` Keir Fraser
2011-07-29  7:24       ` Keir Fraser
2011-08-02 14:14       ` Andrew Cooper
2011-08-02 14:26         ` Keir Fraser
2011-08-02 14:56           ` Andrew Cooper
  -- strict thread matches above, loose matches on Subject: below --
2011-08-03 14:48 Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.