From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wei Wang Subject: Re: [PATCH] amd iommu: Dump flags of IO page faults Date: Mon, 24 Sep 2012 14:24:16 +0200 Message-ID: <506050F0.7020703@amd.com> References: <504764E2.4000809@amd.com> <326589347.20120906005936@eikelenboom.it> <5048A603.3070207@amd.com> <483486315.20120906155001@eikelenboom.it> <5048BB29.4040900@amd.com> <488803362.20120907093241@eikelenboom.it> <5049B650.4080101@amd.com> <74647167 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <74647167.20120924103835@eikelenboom.it> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Sander Eikelenboom Cc: "xen-devel@lists.xensource.com" , Jan Beulich List-Id: xen-devel@lists.xenproject.org On 09/24/2012 10:38 AM, Sander Eikelenboom wrote: > > Friday, September 7, 2012, 10:54:40 AM, you wrote: > >> On 09/07/2012 09:32 AM, Sander Eikelenboom wrote: >>> >>> Thursday, September 6, 2012, 5:03:05 PM, you wrote: >>> >>>> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote: >>>>> >>>>> Thursday, September 6, 2012, 3:32:51 PM, you wrote: >>>>> >>>>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote: >>>>>>> >>>>>>> Wednesday, September 5, 2012, 4:42:42 PM, you wrote: >>>>>>> >>>>>>>> Hi Jan, >>>>>>>> Attached patch dumps io page fault flags. The flags show the reason of >>>>>>>> the fault and tell us if this is an unmapped interrupt fault or a DMA fault. >>>>>>> >>>>>>>> Thanks, >>>>>>>> Wei >>>>>>> >>>>>>>> signed-off-by: Wei Wang >>>>>>> >>>>>>> >>>>>>> I have applied the patch and the flags seem to differ between the faults: >>>>>>> >>>>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000 >>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000 >>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device id = 0x0700, fault address = 0xa8d339e0, flags = 0x020 >>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device id = 0x0700, fault address = 0xa8d33a40, flags = 0x020 >>>>> >>>>>> OK, so they are not interrupt requests. I guess further information from >>>>>> your system would be helpful to debug this issue: >>>>>> 1) xl info >>>>>> 2) xl list >>>>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest) >>>>>> 4) cat /proc/iomem (in both dom0 and your hvm guest) >>>>> >>>>> dom14 is not a HVM guest,it's a PV guest. >>> >>>> Ah, I see. PV guest is quite different than hvm, it does use p2m tables >>>> as io page tables. So no-sharept option does not work in this case. PV >>>> guests always use separated io page tables. There might be some >>>> incorrect mappings on the page tables. I will check this on my side. >>> >>> I have reverted the machine to xen-4.1.4-pre (changeset 23353) and kept everything else the same. >>> I haven't seen any IO PAGE FAULTS after that. >>> >>> I did spot some differences in the output from lspci between xen 4.1 and 4.2, related to MSI enabled or not for the IOMMU device. >>> Have attached the xl/xm dmesg and lspci from booting with both versions. >>> >>> lspci: >>> >>> 00:00.2 Generic system peripheral [0806]: ATI Technologies Inc RD990 I/O Memory Management Unit (IOMMU) [1002:5a23] >>> Subsystem: ATI Technologies Inc RD990 I/O Memory Management Unit (IOMMU) [1002:5a23] >>> Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- >>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast>TAbort-SERR->> Latency: 0 >>> Interrupt: pin A routed to IRQ 10 >>> Capabilities: [40] Secure device >>> 4.1: Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+ > >> Eh... That is interesting. So which dom0 are you using? There is a c/s >> in 4.2 to prevent recent dom0 to disable iommu interrupt (changeset >> 25492:61844569a432) Otherwise, iommu cannot send any events including IO >> PAGE faults. You could try to revert dom0 to an old version like 2.6 >> pv_ops to see if you really have no io page faults on 4.1 > > Ok i finally got the time to do some more testing, tested 4.2 around that changeset, and made a copy of the guest using HVM instead of PV. > > The results: > - On xen-4.1.* and a 3.6-rc6 kernel (dom0 and domU): the video device passed through works fine, both in a HVM as a PV guest, i don't see IO page faults getting reported. > - On xen-4.2 changeset< 25492 and a 3.6-rc6 kernel (dom0 and domU): the video device passed through works fine, both in a HVM as a PV guest, i don't see IO page faults getting reported. > - On xen-4.2 changeset> 25492 and a 3.6-rc6 kernel (dom0 and domU): the video device passed through works fine for a short while (around 5 to 10 minutes) in a PV guest, after that IO page faults get reported and the video freezes, i don't see any errors in the guest though. > - On xen-unstable tip and a 3.6-rc6 kernel (dom0 and domU): > PV: the video device passed through works fine for a short while (around 5 to 10 minutes), after that IO page faults get reported and the video freezes, i don't see any errors in the guest though. > HVM: the video device passed through doesn't work from the start: > - The device is there according to lspci > - The video application start fine, but delivers a green image, so the device is not working properly. I don't see IO page faults though. > > Attached are (all with xen-unstable tip and the guest as HVM (domain 15): > - xl dmesg > - Patch which adds some more info, but all values reported seem to be zero (see xl dmesg) > - lspci dom0 > - lspci HVM guest HI, Thanks for the information, very very helpful for debugging. I hope I could start to look at this right after sending my next iommu patch queue upstream...another question is: Did you see this issue on a single pv/hvm guest system or you only saw it on a system with about 16 running VMs? Thanks, Wei > > > >>> 4.2: Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+ >>> Address: 00000000fee0100c Data: 4128 >>> Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+ >>> >>> Although it seems enabled, shouldn't the IRQ number used be much higher than 10 for MSI interrupts ? > >> The IRQ number is fine. MSI vector is stored at Data: 4128 > >>> >>> There is another difference in the bridge device that's in front of the 0a:00.6 device that faults before the kernel is even booted. >>> >>> 00:03.0 PCI bridge [0604]: ATI Technologies Inc RD890 PCI to PCI bridge (PCI express gpp port C) [1002:5a17] (prog-if 00 [Normal decode]) >>> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ >>> 4.1: Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast>TAbort-SERR->> 4.2: Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast>TAbort-SERR->> Latency: 0, Cache Line Size: 64 bytes >>> Bus: primary=00, secondary=0a, subordinate=0a, sec-latency=0 >>> I/O behind bridge: 0000f000-00000fff >>> Memory behind bridge: f9f00000-f9ffffff >>> Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff >>> 4.1: Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast>TAbort->> 4.2: Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast>TAbort+>> BridgeCtl: Parity+ SERR+ NoISA+ VGA- MAbort->Reset- FastB2B- >>> PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- >>> Capabilities: [50] Power Management version 3 >>> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) >>> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- >>> Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00 >>> DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s<64ns, L1<1us >>> ExtTag+ RBE+ FLReset- >>> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- >>> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ >>> MaxPayload 128 bytes, MaxReadReq 128 bytes >>> DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- >>> LnkCap: Port #1, Speed 5GT/s, Width x8, ASPM L0s L1, Latency L0<1us, L1<8us >>> ClockPM- Surprise- LLActRep+ BwNot+ >>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- >>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- >>> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- >>> SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise- >>> Slot #3, PowerLimit 10.000W; Interlock- NoCompl+ >>> SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg- >>> Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock- >>> SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- >>> Changed: MRL- PresDet+ LinkState+ > >> The probably because of the IO_PAGE_FAULT. > >> Thanks, >> Wei > >>> serveerstertje:~# lspci -t >>> -[0000:00]-+-00.0 >>> +-00.2 >>> +-02.0-[0b]----00.0 >>> +-03.0-[0a]--+-00.0 >>> | +-00.1 >>> | +-00.2 >>> | +-00.3 >>> | +-00.4 >>> | +-00.5 >>> | +-00.6 >>> | \-00.7 >>> +-05.0-[09]----00.0 >>> +-06.0-[08]----00.0 >>> +-0a.0-[07]----00.0 >>> +-0b.0-[06]--+-00.0 >>> | \-00.1 >>> +-0c.0-[05]----00.0 >>> +-0d.0-[04]--+-00.0 >>> | +-00.1 >>> | +-00.2 >>> | +-00.3 >>> | +-00.4 >>> | +-00.5 >>> | +-00.6 >>> | \-00.7 >>> +-11.0 >>> +-12.0 >>> +-12.2 >>> +-13.0 >>> +-13.2 >>> +-14.0 >>> +-14.3 >>> +-14.4-[03]----06.0 >>> +-14.5 >>> +-15.0-[02]-- >>> +-16.0 >>> +-16.2 >>> +-18.0 >>> +-18.1 >>> +-18.2 >>> +-18.3 >>> \-18.4 >>> >>> >>> >>> >>> >>>> Thanks, >>>> Wei >>> >>>>> I will try to make a complete package, and try with one pv domain only where the devices are being passed through just to simplify the setup. >>>>> >>>>> >>>>>> * I would also like to know the symptoms of device 0x0700 when IO_PF >>>>>> happened. Did it stop working? >>>>> >>>>> Yes it stops working, the video capture just freezes, but the driver doesn't bail out. >>>>> For the USB controller (0x0a06) it starts to give errors for usbdev_open in the guest. >>>>> >>>>>> (BTW: I copied a few options from your boot cmd line and it worked with >>>>>> my RD890 system >>>>> >>>>>> dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all console_timestamps >>>>>> cpuidle cpufreq=xen noreboot debug lapic=debug apic_verbosity=debug >>>>>> apic=debug iommu=on,verbose,debug,no-sharept >>>>> >>>>>> * so, what OEM board you have?) >>>>> >>>>> MSI 890FXA-GD70 >>>>> >>>>>> Also from your log, these lines looks very strange: >>>>> >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xd5, mfn=0xa4a11 >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xd7, mfn=0xa4a0f >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xd9, mfn=0xa4a0d >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xdb, mfn=0xa4a0b >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xdd, mfn=0xa4a09 >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xdf, mfn=0xa4a07 >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xe1, mfn=0xa4a05 >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xe3, mfn=0xa4a03 >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xe5, mfn=0xa4a01 >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xe7, mfn=0xa463f >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xe9, mfn=0xa463d >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xeb, mfn=0xa463b >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xed, mfn=0xa4639 >>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to >>>>>> read-only memory page. gfn=0xef, mfn=0xa4637 >>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id >>>>>> = 0x0a06, fault address = 0xc2c2c2c0 >>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device >>>>>> id = 0x0700, fault address = 0xa90f8300 >>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device >>>>>> id = 0x0700, fault address = 0xa90f8340 >>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device >>>>>> id = 0x0700, fault address = 0xa90f8380 >>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device >>>>>> id = 0x0700, fault address = 0xa90f83c0 >>>>> >>>>>> * they are just followed by the IO PAGE fault. Do you know where are >>>>>> they from? Your video card driver maybe? >>>>> >>>>> From a HVM domain with a old (3.0.3) kernel, but the faults also occur without this domain being started. >>>>> >>>>> >>>>>> Thanks, >>>>>> Wei >>>>> >>>>> >>>>>>> Complete xl dmesg and lspci -vvvknn attached. >>>>>>> >>>>>>> Thx >>>>>>> >>>>>>> -- >>>>>>> Sander >>>>> >>>>> >>>>> >>>>> >>>>> >>> >>> > >