From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Kinzler Subject: Re: Instability with Xen, interrupt routing frozen, HPET broadcast Date: Wed, 29 Sep 2010 20:08:19 +0200 Message-ID: <4CA38093.9070802@hfp.de> References: <4C88A6F3.9020207@hfp.de> <20100921115604.GP2804@reaktio.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20100921115604.GP2804@reaktio.net> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: =?ISO-8859-1?Q?Pasi_K=E4rkk=E4inen?= Cc: xen-devel@lists.xensource.com, Keir Fraser , JBeulich@novell.com List-Id: xen-devel@lists.xenproject.org On 21.09.2010 13:56, Pasi K=E4rkk=E4inen wrote: >> I am talking a while (via email) with Jan now to track the following >> problem and he suggested that I report the problem on xen-devel: >> >> Jul 9 01:48:04 virt kernel: aacraid: Host adapter reset request. SCSI >> hang ? >> Jul 9 01:49:05 virt kernel: aacraid: SCSI bus appears hung >> Jul 9 01:49:10 virt kernel: Calling adapter init >> Jul 9 01:49:49 virt kernel: IRQ 16/aacraid: IRQF_DISABLED is not >> guaranteed on shared IRQs >> Jul 9 01:49:49 virt kernel: Acquiring adapter information >> Jul 9 01:49:49 virt kernel: update_interval=3D30:00 check_interval=3D= 86400s >> Jul 9 01:53:13 virt kernel: aacraid: aac_fib_send: first asynchronous >> command timed out. >> Jul 9 01:53:13 virt kernel: Usually a result of a PCI interrupt routi= ng >> problem; >> Jul 9 01:53:13 virt kernel: update mother board BIOS or consider >> utilizing one of >> Jul 9 01:53:13 virt kernel: the SAFE mode kernel options (acpi, apic = etc) >> >> After the VMs have been running a while the aacraid driver reports a >> non-responding RAID controller. Most of the time the NIC is also no >> longer working. >> I nearly tried every combination of dom0 kernel (pvops0, xenfied suse >> 2.6.31.x, xenfied suse 2.6.32.x, xenfied suse 2.6.34.x) with Xen >> hypervisor 3.4.2, 3.4.4-cs19986, 4.0.1, unstable. >> No success in two month. Every combination earlier or later had the >> problem shown above. I did extensive tests to make sure that the >> hardware is OK. And it is - I am sure it is a Xen/dom0 problem. >> >> Jan suggested to try the fix in c/s 22051 but it did not help. My answ= er >> to him: >> >>> In the meantime I did try xen-unstable c/s 22068 (contains staging c/= s >> 22051) and >>> it did not fix the problem at all. I was able to fix a problem with >> the serial console >>> and so I got some debug info that is attached to this email. The >> following line looks >>> suspicious to me (irr=3D1, delivery_status=3D1): >> >>> (XEN) IRQ 16 Vec216: >>> (XEN) Apic 0x00, Pin 16: vector=3D216, delivery_mode=3D1, >> dest_mode=3Dlogical, >>> delivery_status=3D1, polarity=3D1, irr=3D1, trigger=3Dle= vel, >> mask=3D0, dest_id:1 >> >>> IRQ 16 is the aacraid controller which after some while seems to be >> enable to receive >>> interrupts. Can you see from the debug info what is going on? >> >> I also applied a small patch which disables HPET broadcast. The machin= e >> is now running >> for 110 hours without a crash while normally it crashes within a few >> minutes. Is there >> something wrong (race, deadlock) with HPET broadcasts in relation to >> blocked interrupt >> reception (see above)? > What kind of hardware does this happen on? It is a Supermicro X8SIL-F, Intel Xeon 3450 system. > Should this patch be merged? Not easy to answer. I spend more than 10 weeks searching nearly full=20 time for the reason of the stability issues. Finally I was able to track=20 it down to the HPET broadcast code. We need to find the developer of the HPET broadcast code. Then, he=20 should try to fix the code. I consider it a quite severe bug as it=20 renders Xen nearly useless on affected systems. That is why I (and my=20 boss who pays me) spend so much time (developing/fixing Xen is not=20 really my core job) and money (buying a E5620 machine just for testing Xe= n). I think many people on affected systems are having problems. See=20 http://lists.xensource.com/archives/html/xen-users/2010-09/msg00370.html Regards Andreas