From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: Xen-unstable: xen panic RIP: dpci_softirq Date: Tue, 18 Nov 2014 15:56:33 -0500 Message-ID: <20141118205633.GB6540@laptop.dumpdata.com> References: <1402169526.20141114230958@eikelenboom.it> <20141117163416.GA22137@laptop.dumpdata.com> <1403873666.20141117180419@eikelenboom.it> <20141117204347.GA27617@laptop.dumpdata.com> <1271355060.20141117234011@eikelenboom.it> <20141118024927.GA32256@andromeda.dapyr.net> <1408328417.20141118120741@eikelenboom.it> <68258140.20141118160925@eikelenboom.it> <20141118161650.GC17095@laptop.dumpdata.com> <1222042576.20141118180323@eikelenboom.it> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta3.messagelabs.com ([195.245.230.39]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1XqppT-0005vw-Ry for xen-devel@lists.xenproject.org; Tue, 18 Nov 2014 20:56:52 +0000 Content-Disposition: inline In-Reply-To: <1222042576.20141118180323@eikelenboom.it> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Sander Eikelenboom Cc: Konrad Rzeszutek Wilk , Jan Beulich , xen-devel List-Id: xen-devel@lists.xenproject.org > > Uhmm i thought i had these switched off (due to problems earlier and then forgot > about them .. however looking at the earlier reports these lines were also in > those reports). > > The xen-syms and these last runs are all with a prestine xen tree cloned today (staging > branch), so the qemu-xen and seabios defined with that were also freshly cloned > and had a new default seabios config. (just to rule out anything stale in my tree) > > If you don't see those messages .. perhaps your seabios and qemu trees (and at least the > seabios config) are not the most recent (they don't get updated automatically > when you just do a git pull on the main tree) ? > > In /tools/firmware/seabios-dir/.config i have: > CONFIG_USB=y > CONFIG_USB_UHCI=y > CONFIG_USB_OHCI=y > CONFIG_USB_EHCI=y > CONFIG_USB_XHCI=y > CONFIG_USB_MSC=y > CONFIG_USB_UAS=y > CONFIG_USB_HUB=y > CONFIG_USB_KEYBOARD=y > CONFIG_USB_MOUSE=y > I seem to have the same thing. Perhaps it is my XHCI controller being wonky. > And this is all just from a: > - git clone git://xenbits.xen.org/xen.git -b staging > - make clean && ./configure && make -j6 && make -j6 install Aye. .. snip.. > > 1) test_and_[set|clear]_bit sometimes return unexpected values. > > [But this might be invalid as the addition of the ffff8303faaf25a8 > > might be correct - as the second dpci the softirq is processing > > could be the MSI one] > > Would there be an easy way to stress test this function separately in some > debugging function to see if it indeed is returning unexpected values ? Sadly no. But you got me looking in the right direction when you mentioned 'timeout'. > > > 2) INIT_LIST_HEAD operations on the same CPU are not honored. > > Just curious, have you also tested the patches on AMD hardware ? Yes. To reproduce this the first thing I did was to get an AMD box. > > > >> When i look at the combination of (2) and (3), It seems it could be an > >> interaction between the two passed through devices and/or different IRQ types. > > > Could be - as in it is causing this issue to show up faster than > > expected. Or it is the one that triggers more than one dpci happening > > at the same time. > > Well that didn't seem to be it (see separate amendment i mailed previously) Right, the current theory I've is that the interrupts are not being Acked within 8 milisecond and we reset the 'state' - and at the same time we get an interrupt and schedule it - while we are still processing the same interrupt. This would explain why the 'test_and_clear_bit' got the wrong value. In regards to the list poison - following this thread of logic - with the 'state = 0' set we open the floodgates for any CPU to put the same 'struct hvm_pirq_dpci' on its list. We do reset the 'state' on _every_ GSI that is mapped to a guest - so we also reset the 'state' for the MSI one (XHCI). Anyhow in your case: CPUX: CPUY: pt_irq_time_out: state = 0; [out of timer coder, the raise_softirq pirq_dpci is on the dpci_list] [adds the pirq_dpci as state == 0] softirq_dpci softirq_dpci: list_del [entries poison] list_del <= BOOM Is what I believe is happening. The INTX device - once I put a load on it - does not trigger any pt_irq_time_out, so that would explain why I cannot hit this. But I believe your card hits these "hiccups".