From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Jan Beulich" Subject: Re: domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M) Date: Tue, 29 Jun 2010 09:42:10 +0100 Message-ID: <4C29CE020200007800008881@vpn.id2.novell.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: Content-Disposition: inline List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Keir Fraser , Dante Cinco Cc: Xen-devel List-Id: xen-devel@lists.xenproject.org >>> On 28.06.10 at 20:22, Dante Cinco wrote: > I have an HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz) with Xen = 4.0.0 > and dom0 Linux 2.6.32.12 x86_64 pvops and domU Linux kernel 2.6.30.1 = x86_64. > I'm using PCI passthrough (pci-stub) to pass my 4-port 8Gb PMC-Sierra = Fibre > Channel HBA to domU. After running I/Os for several hours, both dom0 and > domU hangs and the Xen console shows the interrupt binding below where = IRQ > 66 shows in-flight=3D1 and mask set (---M). What's the best way to debug = this > problem? There are potentially two problems here: One is that the guest may fail to send the EOI notification. You would want to check whether pirq_guest_eoi() got run after that last occurrence of the interrupt. The more worrying part is that Xen should time out on a guest failing to send the EOI notification, and ack the interrupt nevertheless. Looking at the code I fail to see how the ack_APIC_irq() would get sent in this case: non-maskable MSIs get this issued from end_msi_irq(), but ->end doesn't get invoked from irq_guest_eoi_timer_fn() (only ->enable does). Keir, am I missing something? Otoh I can't see how this can work reliably in the first place: Since there's no other way to mask such interrupts, sending an ack to the LAPIC could result in an interrupt storm. Disabling MSI on the affected device isn't a good option either, as we know there are devices that switch to legacy IRQ mode irreversibly in that case, and hence the device becomes unusable (presumably until being reset). But very likely this would still be better than hanging the entire box; it probably would just need a more graceful timeout. Jan > (XEN) IRQ: 66 affinity:00000000,00000000,00000000,00000001 vec:b9 > type=3DPCI-MSI status=3D00000010 in-flight=3D1 domain-list=3D1: = 79(---M), > (XEN) IRQ: 67 affinity:00000000,00000000,00000000,00000004 vec:d9 > type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1: = 78(----), > (XEN) IRQ: 68 affinity:00000000,00000000,00000000,00000010 vec:22 > type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1: = 77(----), > (XEN) IRQ: 69 affinity:00000000,00000000,00000000,00000040 vec:2a > type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1: = 76(----), >=20 > (XEN) 07:00.3 - dom 1 - MSIs < 69 > > (XEN) 07:00.2 - dom 1 - MSIs < 68 > > (XEN) 07:00.1 - dom 1 - MSIs < 67 > > (XEN) 07:00.0 - dom 1 - MSIs < 66 > >=20 > (XEN) MSI 66 vec=3Db9 fixed edge assert phys cpu dest=3D000000= 00 > mask=3D0/0/-1 > (XEN) MSI 67 vec=3Dd9 fixed edge assert phys cpu dest=3D000000= 04 > mask=3D0/0/-1 > (XEN) MSI 68 vec=3D22 fixed edge assert phys cpu dest=3D000000= 02 > mask=3D0/0/-1 > (XEN) MSI 69 vec=3D2a fixed edge assert phys cpu dest=3D000000= 06 > mask=3D0/0/-1 >=20 > Thanks. >=20 > Dante