From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Cooper Subject: Re: Commit 1aeb1156fa43fe2cd2b5003995b20466cd19a622: "x86 don't change affinity with interrupt unmasked", APCI errors and assorted pci trouble Date: Wed, 1 Apr 2015 15:43:04 +0100 Message-ID: <551C03F8.9070204@citrix.com> References: <1995398026.20150328163438@eikelenboom.it> <5516E53F.1080203@citrix.com> <1687342964.20150328211022@eikelenboom.it> <55192DBA.2070702@citrix.com> <351939600.20150330152618@eikelenboom.it> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta3.messagelabs.com ([195.245.230.39]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1YdJrK-0003aS-Cr for xen-devel@lists.xenproject.org; Wed, 01 Apr 2015 14:43:10 +0000 In-Reply-To: <351939600.20150330152618@eikelenboom.it> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Sander Eikelenboom Cc: xen-devel , Jan Beulich List-Id: xen-devel@lists.xenproject.org On 30/03/15 14:26, Sander Eikelenboom wrote: > Monday, March 30, 2015, 1:04:26 PM, you wrote: > >> On 28/03/15 20:10, Sander Eikelenboom wrote: >>> Saturday, March 28, 2015, 6:30:39 PM, you wrote: >>> >>>> On 28/03/15 15:34, Sander Eikelenboom wrote: >>>>> Hi Jan, >>>>> >>>>> Commit 1aeb1156fa43fe2cd2b5003995b20466cd19a622: >>>>> "x86 don't change affinity with interrupt unmasked", >>>>> gives trouble on my AMD box, symptoms: >>>>> - APIC errors in xl dmesg that weren't previously there: >>>>> (XEN) [2015-03-26 20:35:37.085] IOAPIC[0]: Set PCI routing entry (6-13 -> 0x88 -> IRQ 13 Mode:0 Active:0) >>>>> (XEN) [2015-03-26 20:35:37.101] PCI: Using MCFG for segment 0000 bus 00-ff >>>>> (XEN) [2015-03-26 20:35:37.097] IOAPIC[0]: Set PCI routing entry (6-8 -> 0x58 -> IRQ 8 Mode:0 Active:0) >>>>> (XEN) [2015-03-26 20:35:37.112] IOAPIC[0]: Set PCI routing entry (6-18 -> 0xb8 -> IRQ 18 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:37.189] IOAPIC[0]: Set PCI routing entry (6-17 -> 0xc0 -> IRQ 17 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:37.420] IOAPIC[1]: Set PCI routing entry (7-29 -> 0xc8 -> IRQ 53 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:37.420] IOAPIC[1]: Set PCI routing entry (7-24 -> 0xd0 -> IRQ 48 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:37.420] IOAPIC[1]: Set PCI routing entry (7-30 -> 0xd8 -> IRQ 54 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:37.420] IOAPIC[1]: Set PCI routing entry (7-12 -> 0x21 -> IRQ 36 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:37.420] IOAPIC[1]: Set PCI routing entry (7-13 -> 0x29 -> IRQ 37 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:37.421] IOAPIC[1]: Set PCI routing entry (7-16 -> 0x31 -> IRQ 40 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:37.495] IOAPIC[1]: Set PCI routing entry (7-28 -> 0x39 -> IRQ 52 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:37.498] IOAPIC[0]: Set PCI routing entry (6-16 -> 0x89 -> IRQ 16 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:37.498] IOAPIC[1]: Set PCI routing entry (7-14 -> 0xa9 -> IRQ 38 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:37.548] IOAPIC[0]: Set PCI routing entry (6-22 -> 0xb9 -> IRQ 22 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:39.620] IOAPIC[1]: Set PCI routing entry (7-9 -> 0xc1 -> IRQ 33 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:39.646] IOAPIC[1]: Set PCI routing entry (7-8 -> 0xc9 -> IRQ 32 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:39.647] IOAPIC[1]: Set PCI routing entry (7-23 -> 0xd1 -> IRQ 47 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:41.732] IOAPIC[1]: Set PCI routing entry (7-5 -> 0xd9 -> IRQ 29 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:41.779] IOAPIC[1]: Set PCI routing entry (7-4 -> 0x22 -> IRQ 28 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:41.803] mm.c:803: d0: Forcing read-only access to MFN fed00 >>>>> (XEN) [2015-03-26 20:35:41.894] IOAPIC[0]: Set PCI routing entry (6-19 -> 0x2a -> IRQ 19 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:42.057] IOAPIC[1]: Set PCI routing entry (7-22 -> 0x72 -> IRQ 46 Mode:1 Active:1) >>>>> (XEN) [2015-03-26 20:35:42.093] IOAPIC[1]: Set PCI routing entry (7-27 -> 0x8a -> IRQ 51 Mode:1 Active:1) >>>>> >>>>> these: >>>>> (XEN) [2015-03-26 20:35:42.205] APIC error on CPU0: 00(40) >>>>> (XEN) [2015-03-26 20:35:42.372] APIC error on CPU0: 40(40) >>>>> >>>>> (XEN) [2015-03-26 20:35:42.691] d0 attempted to change d0v1's CR4 flags 00000660 -> 00000760 >>>>> (XEN) [2015-03-26 20:35:42.691] IOAPIC[1]: Set PCI routing entry (7-1 -> 0x9a -> IRQ 25 Mode:1 Active:1) >>>>> >>>>> and this one: >>>>> (XEN) [2015-03-26 20:35:42.707] APIC error on CPU0: 40(40) >>>>> (XEN) [2015-03-26 20:35:43.958] d0 attempted to change d0v0's CR4 flags 00000660 -> 00000760 >>>>> (XEN) [2015-03-26 20:35:43.970] d0 attempted to change d0v2's CR4 flags 00000660 -> 00000760 >>>>> (XEN) [2015-03-26 20:35:43.988] d0 attempted to change d0v3's CR4 flags 00000660 -> 00000760 >>>>> (XEN) [2015-03-26 20:35:43.992] d0 attempted to change d0v4's CR4 flags 00000660 -> 00000760 >>>>> (XEN) [2015-03-26 20:35:43.996] d0 attempted to change d0v5's CR4 flags 00000660 -> 00000760 >>>>> (d1) [2015-03-26 20:40:42.220] mapping kernel into physical memory >>>>> (d1) [2015-03-26 20:40:42.220] about to get started... >>>>> >>>>> >>>>> - random failures on dom0 SATA devices, the SATA controller is using multiple MSI >>>>> interrupts. >>>>> >>>>> - failues on XHCI controllers passed through to a HVM guest which uses MSI-X >>>>> interrupts. Leading to these in the guest dmesg: >>>>> [ 350.246548] xhci_hcd 0000:00:05.0: Looking for event-dma 000000003cdf7140 trb-start 000000003cdf7240 trb-end 000000003cdf7240 seg-start 000000003cdf7000 seg-end 000000003cdf73f0 >>>>> [ 350.246548] xhci_hcd 0000:00:05.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 1 comp_code 1 >>>>> [ 350.246548] xhci_hcd 0000:00:05.0: Looking for event-dma 000000003cdf7150 trb-start 000000003cdf7240 trb-end 000000003cdf7240 seg-start 000000003cdf7000 seg-end 000000003cdf73f0 >>>>> [ 350.246548] xhci_hcd 0000:00:05.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 1 comp_code 1 >>>>> >>>>> >>>>> Reverting this specific commit makes all the troubles go away .. >>>> That is unfortunate, as conceptually the identified patch definitely >>>> fixes a bug. >>>> The "APIC error" messages have bit 6 set, which is "Receive Illegal >>>> Vector". i.e. a device has attempted to deliver an interrupt with a >>>> vector field less than 16. I presume that this means that the device is >>>> ending up with a malformed data field programmed into it. >>>> Can you identify the PCI sbdf's of the problematic devices, and collect >>>> debug-keys Q, M and i on a working system so I can identify precisely >>>> which of the MSI interrupt drivers is in use (Xen has several, depending >>>> on exact hardware circumstance). If you can, the same debug-keys with >>>> the problematic changeset present might also be interesting. >>>> ~Andrew >>> Hi Andrew, >>> >>> The passed through xhci is 08:00.0 >>> The SATA controller is 00:11.0 >>> >>> Most clear failure is on the xhci controller. >>> >>> The working and not working config only differ in the revert of the mentioned >>> commit. >>> >>> Attached are: >>> >>> - lspci in dom0 of the working config >>> - serial-log of the working config (with debug-keys Q, M and i after full boot >>> and guest start) >>> - serial-log of the not working config (with debug-keys Q, M and i after full >>> boot and guest start) >> Thanks. >> As an utter longshot, can you give this patch a try? Could you also see >> about capturing an lspci in dom0 while the bad situation is manifesting >> itself? >> ~Andrew > Hi Andrew, > > lspci of the not working case attached, there are some differences > compared to the working case, but on other device than i expected. > (btw i'm running with the ivrs_ioapic[6]=00:14.0 override due to > the bios tables not properly specifying the SB ioapic.) > > I tried the patch, but couldn't notice any difference, > lspci output was exactly the same as of the not working case > that is attached. I still can't find a plausible reason for this failure, given the change, which suggest that it might be a pre-existing subtle issue uncovered by the change. ~Andrew