From: Jeremy Fitzhardinge <jeremy@goop.org>
To: Andreas Kinzler <ml-xen-devel@hfp.de>
Cc: xen-devel@lists.xensource.com, JBeulich@novell.com,
Keir Fraser <keir.fraser@eu.citrix.com>
Subject: Re: Instability with Xen, interrupt routing frozen, HPET broadcast
Date: Wed, 29 Sep 2010 12:50:48 -0700 [thread overview]
Message-ID: <4CA39898.8080304@goop.org> (raw)
In-Reply-To: <4CA38093.9070802@hfp.de>
On 09/29/2010 11:08 AM, Andreas Kinzler wrote:
> On 21.09.2010 13:56, Pasi Kärkkäinen wrote:
>>> I am talking a while (via email) with Jan now to track the following
>>> problem and he suggested that I report the problem on xen-devel:
>>>
>>> Jul 9 01:48:04 virt kernel: aacraid: Host adapter reset request. SCSI
>>> hang ?
>>> Jul 9 01:49:05 virt kernel: aacraid: SCSI bus appears hung
>>> Jul 9 01:49:10 virt kernel: Calling adapter init
>>> Jul 9 01:49:49 virt kernel: IRQ 16/aacraid: IRQF_DISABLED is not
>>> guaranteed on shared IRQs
>>> Jul 9 01:49:49 virt kernel: Acquiring adapter information
>>> Jul 9 01:49:49 virt kernel: update_interval=30:00
>>> check_interval=86400s
>>> Jul 9 01:53:13 virt kernel: aacraid: aac_fib_send: first asynchronous
>>> command timed out.
>>> Jul 9 01:53:13 virt kernel: Usually a result of a PCI interrupt
>>> routing
>>> problem;
>>> Jul 9 01:53:13 virt kernel: update mother board BIOS or consider
>>> utilizing one of
>>> Jul 9 01:53:13 virt kernel: the SAFE mode kernel options (acpi,
>>> apic etc)
>>>
>>> After the VMs have been running a while the aacraid driver reports a
>>> non-responding RAID controller. Most of the time the NIC is also no
>>> longer working.
>>> I nearly tried every combination of dom0 kernel (pvops0, xenfied suse
>>> 2.6.31.x, xenfied suse 2.6.32.x, xenfied suse 2.6.34.x) with Xen
>>> hypervisor 3.4.2, 3.4.4-cs19986, 4.0.1, unstable.
>>> No success in two month. Every combination earlier or later had the
>>> problem shown above. I did extensive tests to make sure that the
>>> hardware is OK. And it is - I am sure it is a Xen/dom0 problem.
>>>
>>> Jan suggested to try the fix in c/s 22051 but it did not help. My
>>> answer
>>> to him:
>>>
>>>> In the meantime I did try xen-unstable c/s 22068 (contains staging c/s
>>> 22051) and
>>>> it did not fix the problem at all. I was able to fix a problem with
>>> the serial console
>>>> and so I got some debug info that is attached to this email. The
>>> following line looks
>>>> suspicious to me (irr=1, delivery_status=1):
>>>
>>>> (XEN) IRQ 16 Vec216:
>>>> (XEN) Apic 0x00, Pin 16: vector=216, delivery_mode=1,
>>> dest_mode=logical,
>>>> delivery_status=1, polarity=1, irr=1, trigger=level,
>>> mask=0, dest_id:1
>>>
>>>> IRQ 16 is the aacraid controller which after some while seems to be
>>> enable to receive
>>>> interrupts. Can you see from the debug info what is going on?
>>>
>>> I also applied a small patch which disables HPET broadcast. The machine
>>> is now running
>>> for 110 hours without a crash while normally it crashes within a few
>>> minutes. Is there
>>> something wrong (race, deadlock) with HPET broadcasts in relation to
>>> blocked interrupt
>>> reception (see above)?
>> What kind of hardware does this happen on?
>
> It is a Supermicro X8SIL-F, Intel Xeon 3450 system.
That's exactly what my main test/devel machine is. It has been very
stable for me with xen-unstable. Is 4.0.1 different from xen-unstable
with respect to HPET?
The big problem I had initially was instability with the integrated
ethernet until I disabled PCIe ASPM. The symptom was that the ethernet
devices would disappear (ie, their PCI config space would start to read
all 0xff...)
>> Should this patch be merged?
>
> Not easy to answer. I spend more than 10 weeks searching nearly full
> time for the reason of the stability issues. Finally I was able to
> track it down to the HPET broadcast code.
>
> We need to find the developer of the HPET broadcast code. Then, he
> should try to fix the code. I consider it a quite severe bug as it
> renders Xen nearly useless on affected systems. That is why I (and my
> boss who pays me) spend so much time (developing/fixing Xen is not
> really my core job) and money (buying a E5620 machine just for testing
> Xen).
>
> I think many people on affected systems are having problems. See
> http://lists.xensource.com/archives/html/xen-users/2010-09/msg00370.html
Just out of interest, does disabling ASPM help? I had to disable it in
the BIOS, and set pcie_aspm=off on the kernel command line.
This is a total shot in the dark, but given that we're using identical
systems it seems worth a try.
Thanks,
J
next prev parent reply other threads:[~2010-09-29 19:50 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-09 9:20 Instability with Xen, interrupt routing frozen, HPET broadcast Andreas Kinzler
2010-09-21 11:56 ` Pasi Kärkkäinen
2010-09-29 18:08 ` Andreas Kinzler
2010-09-29 19:34 ` Andrew Lyon
2010-09-29 21:18 ` Konrad Rzeszutek Wilk
2010-09-29 19:50 ` Jeremy Fitzhardinge [this message]
2010-09-30 10:16 ` Andreas Kinzler
2010-09-30 17:12 ` Jeremy Fitzhardinge
2010-09-30 5:00 ` Zhang, Xiantao
2010-09-30 6:02 ` Wei, Gang
2010-09-30 9:42 ` Andreas Kinzler
2010-10-01 4:14 ` Zhang, Xiantao
2010-12-31 14:31 ` Pasi Kärkkäinen
2011-01-09 19:10 ` Andreas Kinzler
2011-01-09 19:21 ` Pasi Kärkkäinen
2011-01-09 20:04 ` Keir Fraser
2011-01-19 10:19 ` Andreas Kinzler
-- strict thread matches above, loose matches on Subject: below --
2011-05-04 22:31 Langsdorf, Mark
2011-05-05 6:27 ` Wei, Gang
2011-05-05 14:53 ` Langsdorf, Mark
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4CA39898.8080304@goop.org \
--to=jeremy@goop.org \
--cc=JBeulich@novell.com \
--cc=keir.fraser@eu.citrix.com \
--cc=ml-xen-devel@hfp.de \
--cc=xen-devel@lists.xensource.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.