From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: Xen-unstable: xen panic RIP: dpci_softirq Date: Mon, 17 Nov 2014 11:34:16 -0500 Message-ID: <20141117163416.GA22137@laptop.dumpdata.com> References: <193010671.20141114141112@eikelenboom.it> <546618620200007800047AD1@mail.emea.novell.com> <688701120.20141114153404@eikelenboom.it> <546629510200007800047BC3@mail.emea.novell.com> <1224708950.20141114162052@eikelenboom.it> <5466314E0200007800047C90@mail.emea.novell.com> <1393541150.20141114175923@eikelenboom.it> <20141114202513.GA3281@laptop.dumpdata.com> <1402169526.20141114230958@eikelenboom.it> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta5.messagelabs.com ([195.245.231.135]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1XqPFx-0001Q3-PP for xen-devel@lists.xenproject.org; Mon, 17 Nov 2014 16:34:25 +0000 Content-Disposition: inline In-Reply-To: <1402169526.20141114230958@eikelenboom.it> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Sander Eikelenboom Cc: xen-devel , Jan Beulich List-Id: xen-devel@lists.xenproject.org On Fri, Nov 14, 2014 at 11:09:58PM +0100, Sander Eikelenboom wrote: > > Friday, November 14, 2014, 9:25:13 PM, you wrote: > > > On Fri, Nov 14, 2014 at 05:59:23PM +0100, Sander Eikelenboom wrote: > >> > >> Friday, November 14, 2014, 4:43:58 PM, you wrote: > >> > >> >>>> On 14.11.14 at 16:20, wrote: > >> >> If it still helps i could try Andrews suggestion and try out with only > >> >> commit aeeea485 .. > >> > >> > Yes, even if it's pretty certain it's the second of the commits, verifying > >> > this would be helpful (or if the assumption is wrong, the pattern it's > >> > dying with would change and hence perhaps provide further clues). > >> > >> > Jan > >> > >> > >> Ok with a revert of f6dd295 .. it survived cooking and eating a nice bowl of > >> pasta without a panic. So it would probably be indeed that specific commit. > > > Could you try running with these two patches while you enjoy an beer in the evening? > > Hmm i didn't expect it not to panic and reboot anymore :-) I should have also asked for your to run with 'iommu=verbose,debug', but that can be done later.. The guest d16 looks to have two PCI passthrough devices: XEN) [2014-11-14 21:31:26.569] io.c:550: d16: bind: m_gsi=37 g_gsi=36 dev=00.00.5 intx=0 XEN) [2014-11-14 21:31:28.095] io.c:550: d16: bind: m_gsi=47 g_gsi=40 dev=00.00.6 intx=0 And one of them uses just the GSI while the other uses four MSI-X, is that about right? I tried to reproduce that on my AMD box with two NICs: # lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.2 USB Controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01) 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01) 00:02.0 VGA compatible controller: Technical Corp. Device 1111 00:03.0 Class ff80: XenSource, Inc. Xen Platform Device (rev 01) 00:04.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 00:05.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05) # cat /proc/interrupts |grep eth 36: 384183 0 xen-pirq-ioapic-level eth0 63: 1 0 xen-pirq-msi-x eth1 64: 24 661961 xen-pirq-msi-x eth1-rx-0 65: 205 0 xen-pirq-msi-x eth1-rx-1 66: 162 0 xen-pirq-msi-x eth1-tx-0 67: 190 0 xen-pirq-msi-x eth1-tx-1 Is that a similar distribution of IRQ/MSIx you end up having? > > However xl dmesg (complete one attached) showed it would have: > > (XEN) [2014-11-14 21:35:50.646] --MARK-- > (XEN) [2014-11-14 21:35:56.861] grant_table.c:305:d0v0 Increased maptrack size to 9 frames > (XEN) [2014-11-14 21:36:00.647] --MARK-- > (XEN) [2014-11-14 21:36:10.410] grant_table.c:1299:d16v1 Expanding dom (16) grant table from (5) to (6) frames. > (XEN) [2014-11-14 21:36:10.820] --MARK-- > (XEN) [2014-11-14 21:36:20.820] --MARK-- > (XEN) [2014-11-14 21:36:30.820] --MARK-- > (XEN) [2014-11-14 21:36:40.821] --MARK-- > (XEN) [2014-11-14 21:36:50.821] --MARK-- > (XEN) [2014-11-14 21:37:00.388] CPU00: > (XEN) [2014-11-14 21:37:00.399] CPU01: > (XEN) [2014-11-14 21:37:00.410] d16 OK-softirq 20msec ago, state:1, 41220 count, [prev:ffff83054ef5e3e0, next:ffff83054ef5e3e0] PIRQ:0 > (XEN) [2014-11-14 21:37:00.445] d16 OK-raise 46msec ago, state:1, 41223 count, [prev:0000000000200200, next:0000000000100100] PIRQ:0 > (XEN) [2014-11-14 21:37:00.481] d16 ERR-poison 92msec ago, state:0, 1 count, [prev:0000000000200200, next:0000000000100100] PIRQ:0 > (XEN) [2014-11-14 21:37:00.515] d16 Z-softirq 28853msec ago, state:2, 1 count, [prev:0000000000200200, next:0000000000100100] PIRQ:0 The PIRQ:0 would imply that this is the legacy interrupt - which would be you 0a:00.0 device (Conexant Systems, Inc. Device 8210). And it is pounding on this CPU - and the issue is that the 'test_and_clear_bit' ends up returning 0 - which means it was not able to set STATE_SCHED: (!?) if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) ) { hvm_dirq_assist(d, pirq_dpci); put_domain(d); } else { _record(&debug->zombie_softirq, pirq_dpci); which causes us to record it [Z-softirq], which says we we are in state 2 (1<domid, names[type], (unsigned long)((now - d->last) / MILLISECS(1)), - d->state, d->count, d->list.prev, d->list.next); + d->state, d->count, d->list.prev, d->list.next, d->dpci); if ( d->dpci ) { struct hvm_pirq_dpci *pirq_dpci = d->dpci; for ( i = 0; i <= _HVM_IRQ_DPCI_GUEST_MSI_SHIFT; i++ ) - if ( pirq_dpci->flags & 1 << _HVM_IRQ_DPCI_TRANSLATE_SHIFT ) + if ( pirq_dpci->flags & (1 << i) ) printk("%s ", names_flag[i]); printk(" PIRQ:%d", pirq_dpci->pirq);