All of lore.kernel.org
 help / color / mirror / Atom feed
* Debugging a weird hardware fault.
@ 2011-07-28 19:53 Andrew Cooper
  2011-07-28 20:42 ` Keir Fraser
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Cooper @ 2011-07-28 19:53 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com

Hello,

I am trying to debug an issue which appears on the surface as "run
shutdown -h +0 in dom0 and the machine reboots".  The issue reproduces
on a Supermicro X8DT6 motherboard
(http://www.supermicro.com/products/motherboard/QPI/5500/X8DT6-F.cfm)
only (as far as we can tell - we cant reproduce it on any other
hardware), on both Xen 3.4 and Xen 4.1.  The debugging described below
is specifically against 3.4

It reproduces irrespective of number of CPUs and irrespective of IOMMU
utilization.  For all tests, the server is being run with maxcpus=1 on
the Xen command-line and no domUs at all.

Tracing the path of execution, Xen is getting the XENPF_enter_acpi_sleep
platform op and acting on it correctly, going down the ACPI S5 codepath.

My assumption is that the reboot is caused by a triple fault, as the
server reboots before it actually writes to the PM1A register (except
for the case where it actually works, at which point it writes correctly
and properly shuts down).  There is no indication on the serial console
of a fault or double fault.

My method of tracing is
#define SERIAL_CHAR(ch) __asm__ __volatile__ ("mov %0, %%al\n\t"\
                               
                                             "mov $0x3f8, %%dx\n\t"    \
                                                              
              "out %%al,%%dx\n\t" :: "g"(ch) : "%ax", "%dx");
scattered over the codebase.


The fault itself is time dependent - it occasionally works when the
shutdown code spends very little time in get_cmos_time.

By waiting at certain points, but particularly inserting:

     for( i=0; i < 10; ++i)
      {
        SERIAL_CHAR('*');
        mdelay(1000);
      }

in the XENPF_enter_acpi_sleep case statement, It shows that the triple
fault is reliably 5 seconds after the hypercall, and in otherwise safe
code.  I SERIAL_CHAR'd the entry and exit of the nmi handler, which
shows that the triple fault is not caused by the nmi watchdog, which I
thought might be having an effect.

While waiting to print '*' every second, the serial console buffer
continues to be written to the UART, showing that other tasks are going
on while XENPF_enter_acpi_sleep is being serviced.

The server itself is otherwise totally stable, running PV, HVM (and some
bodged pv-on-hvm container for FreeBSD), along with performing SR-IOV
from 8 NICs with 40 VFs each.  I have a workaround by removing the call
to time_suspend() at which point proding the PM1A register happens
reliably before whatever causes the triple fault later.  However, this
is not a suitable solution for the S3 codepath which suffers the same
problem but really does need to run time_suspend.

My questions to the Xen community are:

what (if any) new tasks get scheduled when a XENPF_enter_acpi_sleep is
in action, and more generally, how can I go about debugging which tasks
are being run.

Thanks in advance for any advice/tips

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: Debugging a weird hardware fault.
@ 2011-08-03 14:48 Jan Beulich
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Beulich @ 2011-08-03 14:48 UTC (permalink / raw)
  To: andrew.cooper3, keir; +Cc: xen-devel, winston.l.wang, gang.wei

>>> Andrew Cooper 08/02/11 5:01 PM >>> 
>It seems that Xen spends a fair amount of time doing freeze_domains 
>(even though dom0 has already shut down all domUs, albeit forcibly if 
>they haven't shut down nicely within 15 seconds), and bringing down the 
>other CPUs (in particular, it spends ages fiddling around with irq 
>affinities). 

Is that independent of using a serial console? That is, are the delays
perhaps incurred just by that code being overly verbose? One of the
odd things I had noticed now and then is that during shutdown, various
IRQs get fixed up more than once (up to once per CPU brought down).
There surely are ways to have them moved to CPU0 directly in the
shutdown case.

Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-08-03 14:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-28 19:53 Debugging a weird hardware fault Andrew Cooper
2011-07-28 20:42 ` Keir Fraser
2011-07-28 22:45   ` Andrew Cooper
2011-07-29  7:10     ` Keir Fraser
2011-07-29  7:24       ` Keir Fraser
2011-08-02 14:14       ` Andrew Cooper
2011-08-02 14:26         ` Keir Fraser
2011-08-02 14:56           ` Andrew Cooper
  -- strict thread matches above, loose matches on Subject: below --
2011-08-03 14:48 Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.