From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: Instability with Xen, interrupt routing frozen, HPET broadcast Date: Wed, 29 Sep 2010 12:50:48 -0700 Message-ID: <4CA39898.8080304@goop.org> References: <4C88A6F3.9020207@hfp.de> <20100921115604.GP2804@reaktio.net> <4CA38093.9070802@hfp.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <4CA38093.9070802@hfp.de> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Andreas Kinzler Cc: xen-devel@lists.xensource.com, JBeulich@novell.com, Keir Fraser List-Id: xen-devel@lists.xenproject.org On 09/29/2010 11:08 AM, Andreas Kinzler wrote: > On 21.09.2010 13:56, Pasi K=E4rkk=E4inen wrote: >>> I am talking a while (via email) with Jan now to track the followin= g >>> problem and he suggested that I report the problem on xen-devel: >>> >>> Jul 9 01:48:04 virt kernel: aacraid: Host adapter reset request. SCS= I >>> hang ? >>> Jul 9 01:49:05 virt kernel: aacraid: SCSI bus appears hung >>> Jul 9 01:49:10 virt kernel: Calling adapter init >>> Jul 9 01:49:49 virt kernel: IRQ 16/aacraid: IRQF_DISABLED is not >>> guaranteed on shared IRQs >>> Jul 9 01:49:49 virt kernel: Acquiring adapter information >>> Jul 9 01:49:49 virt kernel: update_interval=3D30:00 >>> check_interval=3D86400s >>> Jul 9 01:53:13 virt kernel: aacraid: aac_fib_send: first asynchronou= s >>> command timed out. >>> Jul 9 01:53:13 virt kernel: Usually a result of a PCI interrupt >>> routing >>> problem; >>> Jul 9 01:53:13 virt kernel: update mother board BIOS or consider >>> utilizing one of >>> Jul 9 01:53:13 virt kernel: the SAFE mode kernel options (acpi, >>> apic etc) >>> >>> After the VMs have been running a while the aacraid driver reports a >>> non-responding RAID controller. Most of the time the NIC is also no >>> longer working. >>> I nearly tried every combination of dom0 kernel (pvops0, xenfied suse >>> 2.6.31.x, xenfied suse 2.6.32.x, xenfied suse 2.6.34.x) with Xen >>> hypervisor 3.4.2, 3.4.4-cs19986, 4.0.1, unstable. >>> No success in two month. Every combination earlier or later had the >>> problem shown above. I did extensive tests to make sure that the >>> hardware is OK. And it is - I am sure it is a Xen/dom0 problem. >>> >>> Jan suggested to try the fix in c/s 22051 but it did not help. My >>> answer >>> to him: >>> >>>> In the meantime I did try xen-unstable c/s 22068 (contains staging c= /s >>> 22051) and >>>> it did not fix the problem at all. I was able to fix a problem with >>> the serial console >>>> and so I got some debug info that is attached to this email. The >>> following line looks >>>> suspicious to me (irr=3D1, delivery_status=3D1): >>> >>>> (XEN) IRQ 16 Vec216: >>>> (XEN) Apic 0x00, Pin 16: vector=3D216, delivery_mode=3D1, >>> dest_mode=3Dlogical, >>>> delivery_status=3D1, polarity=3D1, irr=3D1, trigger=3Dl= evel, >>> mask=3D0, dest_id:1 >>> >>>> IRQ 16 is the aacraid controller which after some while seems to be >>> enable to receive >>>> interrupts. Can you see from the debug info what is going on? >>> >>> I also applied a small patch which disables HPET broadcast. The machi= ne >>> is now running >>> for 110 hours without a crash while normally it crashes within a few >>> minutes. Is there >>> something wrong (race, deadlock) with HPET broadcasts in relation to >>> blocked interrupt >>> reception (see above)? >> What kind of hardware does this happen on? > > It is a Supermicro X8SIL-F, Intel Xeon 3450 system. That's exactly what my main test/devel machine is. It has been very stable for me with xen-unstable. Is 4.0.1 different from xen-unstable with respect to HPET? The big problem I had initially was instability with the integrated ethernet until I disabled PCIe ASPM. The symptom was that the ethernet devices would disappear (ie, their PCI config space would start to read all 0xff...) >> Should this patch be merged? > > Not easy to answer. I spend more than 10 weeks searching nearly full > time for the reason of the stability issues. Finally I was able to > track it down to the HPET broadcast code. > > We need to find the developer of the HPET broadcast code. Then, he > should try to fix the code. I consider it a quite severe bug as it > renders Xen nearly useless on affected systems. That is why I (and my > boss who pays me) spend so much time (developing/fixing Xen is not > really my core job) and money (buying a E5620 machine just for testing > Xen). > > I think many people on affected systems are having problems. See > http://lists.xensource.com/archives/html/xen-users/2010-09/msg00370.htm= l Just out of interest, does disabling ASPM help? I had to disable it in the BIOS, and set pcie_aspm=3Doff on the kernel command line. This is a total shot in the dark, but given that we're using identical systems it seems worth a try. Thanks, J