From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: Instability with Xen, interrupt routing frozen, HPET broadcast Date: Wed, 29 Sep 2010 17:18:02 -0400 Message-ID: <20100929211802.GA21793@dumpdata.com> References: <4C88A6F3.9020207@hfp.de> <20100921115604.GP2804@reaktio.net> <4CA38093.9070802@hfp.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Andrew Lyon Cc: xen-devel@lists.xensource.com, JBeulich@novell.com, Andreas Kinzler , Keir Fraser List-Id: xen-devel@lists.xenproject.org On Wed, Sep 29, 2010 at 08:34:28PM +0100, Andrew Lyon wrote: > On Wed, Sep 29, 2010 at 7:08 PM, Andreas Kinzler = wrote: > > On 21.09.2010 13:56, Pasi K=E4rkk=E4inen wrote: > >>> > >>> =A0I am talking a while (via email) with Jan now to track the follo= wing > >>> problem and he suggested that I report the problem on xen-devel: > >>> > >>> Jul =A09 01:48:04 virt kernel: aacraid: Host adapter reset request.= SCSI > >>> hang ? > >>> Jul =A09 01:49:05 virt kernel: aacraid: SCSI bus appears hung > >>> Jul =A09 01:49:10 virt kernel: Calling adapter init > >>> Jul =A09 01:49:49 virt kernel: IRQ 16/aacraid: IRQF_DISABLED is not > >>> guaranteed on shared IRQs > >>> Jul =A09 01:49:49 virt kernel: Acquiring adapter information > >>> Jul =A09 01:49:49 virt kernel: update_interval=3D30:00 check_interv= al=3D86400s > >>> Jul =A09 01:53:13 virt kernel: aacraid: aac_fib_send: first asynchr= onous > >>> command timed out. > >>> Jul =A09 01:53:13 virt kernel: Usually a result of a PCI interrupt = routing > >>> problem; > >>> Jul =A09 01:53:13 virt kernel: update mother board BIOS or consider > >>> utilizing one of > >>> Jul =A09 01:53:13 virt kernel: the SAFE mode kernel options (acpi, = apic > >>> etc) > >>> > >>> After the VMs have been running a while the aacraid driver reports = a > >>> non-responding RAID controller. Most of the time the NIC is also no > >>> longer working. > >>> I nearly tried every combination of dom0 kernel (pvops0, xenfied su= se > >>> 2.6.31.x, xenfied suse 2.6.32.x, xenfied suse 2.6.34.x) with Xen > >>> hypervisor 3.4.2, 3.4.4-cs19986, 4.0.1, unstable. > >>> No success in two month. Every combination earlier or later had the > >>> problem shown above. I did extensive tests to make sure that the > >>> hardware is OK. And it is - I am sure it is a Xen/dom0 problem. > >>> > >>> Jan suggested to try the fix in c/s 22051 but it did not help. My a= nswer > >>> to him: > >>> > >>>> In the meantime I did try xen-unstable c/s 22068 (contains staging= c/s > >>> > >>> 22051) and > >>>> > >>>> it did not fix the problem at all. I was able to fix a problem wit= h > >>> > >>> the serial console > >>>> > >>>> and so I got some debug info that is attached to this email. The > >>> > >>> following line looks > >>>> > >>>> suspicious to me (irr=3D1, delivery_status=3D1): > >>> > >>>> (XEN) =A0 =A0 IRQ 16 Vec216: > >>>> (XEN) =A0 =A0 =A0 Apic 0x00, Pin 16: vector=3D216, delivery_mode=3D= 1, > >>> > >>> dest_mode=3Dlogical, > >>>> > >>>> =A0 =A0 =A0 =A0 =A0 =A0 delivery_status=3D1, polarity=3D1, irr=3D1= , trigger=3Dlevel, > >>> > >>> mask=3D0, dest_id:1 > >>> > >>>> IRQ 16 is the aacraid controller which after some while seems to b= e > >>> > >>> enable to receive > >>>> > >>>> interrupts. Can you see from the debug info what is going on? > >>> > >>> I also applied a small patch which disables HPET broadcast. The mac= hine > >>> is now running > >>> for 110 hours without a crash while normally it crashes within a fe= w > >>> minutes. Is there > >>> something wrong (race, deadlock) with HPET broadcasts in relation t= o > >>> blocked interrupt > >>> reception (see above)? > >> > >> What kind of hardware does this happen on? > > > > It is a Supermicro X8SIL-F, Intel Xeon 3450 system. > > > >> Should this patch be merged? > > > > Not easy to answer. I spend more than 10 weeks searching nearly full = time > > for the reason of the stability issues. Finally I was able to track i= t down > > to the HPET broadcast code. > > > > We need to find the developer of the HPET broadcast code. Then, he sh= ould > > try to fix the code. I consider it a quite severe bug as it renders X= en > > nearly useless on affected systems. That is why I (and my boss who pa= ys me) > > spend so much time (developing/fixing Xen is not really my core job) = and > > money (buying a E5620 machine just for testing Xen). > > > > I think many people on affected systems are having problems. See > > http://lists.xensource.com/archives/html/xen-users/2010-09/msg00370.h= tml > > > > Regards Andreas > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > > >=20 > I will test that patch on my Supermicro X7DWA-N based dual Xeon > workstation, I always use a Xenified kernel rather than pv_ops as it > supports some features that I need and is compatible with nvidia > binary drivers, but I've always had problems with very occasional The PVOPS kernel works with the nouveau driver Look at http://wiki.xensource.com/xenwiki/XenPVOPSDRM for details.