* [PATCH v2] iommu: intel: do deep dma-unmapping, to avoid kernel-flooding. @ 2021-10-12 13:56 Ajay Garg 2021-10-22 17:33 ` Ajay Garg 2021-11-09 23:56 ` Alex Williamson 0 siblings, 2 replies; 5+ messages in thread From: Ajay Garg @ 2021-10-12 13:56 UTC (permalink / raw) To: iommu; +Cc: alex.williamson Origins at : https://lists.linuxfoundation.org/pipermail/iommu/2021-October/thread.html === Changes from v1 => v2 === a) Improved patch-description. b) A more root-level fix, as suggested by 1. Alex Williamson <alex.williamson@redhat.com> 2. Lu Baolu <baolu.lu@linux.intel.com> === Issue === Kernel-flooding is seen, when an x86_64 L1 guest (Ubuntu-21) is booted in qemu/kvm on a x86_64 host (Ubuntu-21), with a host-pci-device attached. Following kind of logs, along with the stacktraces, cause the flood : ...... DMAR: ERROR: DMA PTE for vPFN 0x428ec already set (to 3f6ec003 not 3f6ec003) DMAR: ERROR: DMA PTE for vPFN 0x428ed already set (to 3f6ed003 not 3f6ed003) DMAR: ERROR: DMA PTE for vPFN 0x428ee already set (to 3f6ee003 not 3f6ee003) DMAR: ERROR: DMA PTE for vPFN 0x428ef already set (to 3f6ef003 not 3f6ef003) DMAR: ERROR: DMA PTE for vPFN 0x428f0 already set (to 3f6f0003 not 3f6f0003) ...... === Current Behaviour, leading to the issue === Currently, when we do a dma-unmapping, we unmap/unlink the mappings, but the pte-entries are not cleared. Thus, following sequencing would flood the kernel-logs : i) A dma-unmapping makes the real/leaf-level pte-slot invalid, but the pte-content itself is not cleared. ii) Now, during some later dma-mapping procedure, as the pte-slot is about to hold a new pte-value, the intel-iommu checks if a prior pte-entry exists in the pte-slot. If it exists, it logs a kernel-error, along with a corresponding stacktrace. iii) Step ii) runs in abundance, and the kernel-logs run insane. === Fix === We ensure that as part of a dma-unmapping, each (unmapped) pte-slot is also cleared of its value/content (at the leaf-level, where the real mapping from a iova => pfn mapping is stored). This completes a "deep" dma-unmapping. Signed-off-by: Ajay Garg <ajaygargnsit@gmail.com> --- drivers/iommu/intel/iommu.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index d75f59ae28e6..485a8ea71394 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -5090,6 +5090,8 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain, gather->freelist = domain_unmap(dmar_domain, start_pfn, last_pfn, gather->freelist); + dma_pte_clear_range(dmar_domain, start_pfn, last_pfn); + if (dmar_domain->max_addr == iova + size) dmar_domain->max_addr = iova; -- 2.30.2 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2] iommu: intel: do deep dma-unmapping, to avoid kernel-flooding. 2021-10-12 13:56 [PATCH v2] iommu: intel: do deep dma-unmapping, to avoid kernel-flooding Ajay Garg @ 2021-10-22 17:33 ` Ajay Garg 2021-10-23 7:00 ` Ajay Garg 2021-11-09 23:56 ` Alex Williamson 1 sibling, 1 reply; 5+ messages in thread From: Ajay Garg @ 2021-10-22 17:33 UTC (permalink / raw) To: iommu; +Cc: Alex Williamson Ping .. Any updates please on this? It will be great to have the fix upstreamed (properly of course). Right now, the patch contains the change as suggested, of explicitly/properly clearing out dma-mappings when unmap is called. Please let me know in whatever way I can help, including testing/debugging for other approaches if required. Many thanks to Alex and Lu for their continued support on the issue. P.S. : I might have missed mentioning the information about the device that causes flooding. Please find it below : ###################################### sudo lspci -vvv 0a:00.0 SD Host controller: O2 Micro, Inc. OZ600FJ0/OZ900FJ0/OZ600FJS SD/MMC Card Reader Controller (rev 05) (prog-if 01) Subsystem: Dell OZ600FJ0/OZ900FJ0/OZ600FJS SD/MMC Card Reader Controller Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 17 IOMMU group: 14 Region 0: Memory at e2c20000 (32-bit, non-prefetchable) [size=512] Capabilities: [a0] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] MSI: Enable- Count=1/1 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [80] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 10.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM L0s Enabled; RCB 64 bytes, Disabled- CommClk- ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s (ok), Width x1 (ok) TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [200 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Kernel driver in use: sdhci-pci Kernel modules: sdhci_pci ###################################### Thanks and Regards, Ajay On Tue, Oct 12, 2021 at 7:27 PM Ajay Garg <ajaygargnsit@gmail.com> wrote: > > Origins at : > https://lists.linuxfoundation.org/pipermail/iommu/2021-October/thread.html > > === Changes from v1 => v2 === > > a) > Improved patch-description. > > b) > A more root-level fix, as suggested by > > 1. > Alex Williamson <alex.williamson@redhat.com> > > 2. > Lu Baolu <baolu.lu@linux.intel.com> > > > > === Issue === > > Kernel-flooding is seen, when an x86_64 L1 guest (Ubuntu-21) is booted in qemu/kvm > on a x86_64 host (Ubuntu-21), with a host-pci-device attached. > > Following kind of logs, along with the stacktraces, cause the flood : > > ...... > DMAR: ERROR: DMA PTE for vPFN 0x428ec already set (to 3f6ec003 not 3f6ec003) > DMAR: ERROR: DMA PTE for vPFN 0x428ed already set (to 3f6ed003 not 3f6ed003) > DMAR: ERROR: DMA PTE for vPFN 0x428ee already set (to 3f6ee003 not 3f6ee003) > DMAR: ERROR: DMA PTE for vPFN 0x428ef already set (to 3f6ef003 not 3f6ef003) > DMAR: ERROR: DMA PTE for vPFN 0x428f0 already set (to 3f6f0003 not 3f6f0003) > ...... > > > > === Current Behaviour, leading to the issue === > > Currently, when we do a dma-unmapping, we unmap/unlink the mappings, but > the pte-entries are not cleared. > > Thus, following sequencing would flood the kernel-logs : > > i) > A dma-unmapping makes the real/leaf-level pte-slot invalid, but the > pte-content itself is not cleared. > > ii) > Now, during some later dma-mapping procedure, as the pte-slot is about > to hold a new pte-value, the intel-iommu checks if a prior > pte-entry exists in the pte-slot. If it exists, it logs a kernel-error, > along with a corresponding stacktrace. > > iii) > Step ii) runs in abundance, and the kernel-logs run insane. > > > > === Fix === > > We ensure that as part of a dma-unmapping, each (unmapped) pte-slot > is also cleared of its value/content (at the leaf-level, where the > real mapping from a iova => pfn mapping is stored). > > This completes a "deep" dma-unmapping. > > > > Signed-off-by: Ajay Garg <ajaygargnsit@gmail.com> > --- > drivers/iommu/intel/iommu.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > index d75f59ae28e6..485a8ea71394 100644 > --- a/drivers/iommu/intel/iommu.c > +++ b/drivers/iommu/intel/iommu.c > @@ -5090,6 +5090,8 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain, > gather->freelist = domain_unmap(dmar_domain, start_pfn, > last_pfn, gather->freelist); > > + dma_pte_clear_range(dmar_domain, start_pfn, last_pfn); > + > if (dmar_domain->max_addr == iova + size) > dmar_domain->max_addr = iova; > > -- > 2.30.2 > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] iommu: intel: do deep dma-unmapping, to avoid kernel-flooding. 2021-10-22 17:33 ` Ajay Garg @ 2021-10-23 7:00 ` Ajay Garg 0 siblings, 0 replies; 5+ messages in thread From: Ajay Garg @ 2021-10-23 7:00 UTC (permalink / raw) To: iommu; +Cc: Alex Williamson Another piece of information : The observations are same, if the current pci-device (sd/mmc controller) is detached, and another pci-device (sound controller) is attached to the guest. So, it looks that we can rule out any (pci-)device-specific issue. For brevity, here are the details of the other pci-device I tried with : ############################################### sudo lspci -vvv 00:1b.0 Audio device: Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Controller (rev 04) DeviceName: Onboard Audio Subsystem: Dell 6 Series/C200 Series Chipset Family High Definition Audio Controller Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 31 IOMMU group: 5 Region 0: Memory at e2e60000 (64-bit, non-prefetchable) [size=16K] Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [60] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00358 Data: 0000 Capabilities: [70] Express (v1) Root Complex Integrated Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE- FLReset+ DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend- Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- VC1: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=1 ArbSelect=Fixed TC/VC=22 Status: NegoPending- InProgress- Capabilities: [130 v1] Root Complex Link Desc: PortNumber=0f ComponentID=00 EltType=Config Link0: Desc: TargetPort=00 TargetComponent=00 AssocRCRB- LinkType=MemMapped LinkValid+ Addr: 00000000fed1c000 Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel ############################################### On Fri, Oct 22, 2021 at 11:03 PM Ajay Garg <ajaygargnsit@gmail.com> wrote: > > Ping .. > > Any updates please on this? > > It will be great to have the fix upstreamed (properly of course). > > Right now, the patch contains the change as suggested, of > explicitly/properly clearing out dma-mappings when unmap is called. > Please let me know in whatever way I can help, including > testing/debugging for other approaches if required. > > > Many thanks to Alex and Lu for their continued support on the issue. > > > > P.S. : > > I might have missed mentioning the information about the device that > causes flooding. > Please find it below : > > ###################################### > sudo lspci -vvv > > 0a:00.0 SD Host controller: O2 Micro, Inc. OZ600FJ0/OZ900FJ0/OZ600FJS > SD/MMC Card Reader Controller (rev 05) (prog-if 01) > Subsystem: Dell OZ600FJ0/OZ900FJ0/OZ600FJS SD/MMC Card Reader Controller > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > ParErr- Stepping- SERR- FastB2B- DisINTx- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 17 > IOMMU group: 14 > Region 0: Memory at e2c20000 (32-bit, non-prefetchable) [size=512] > Capabilities: [a0] Power Management version 3 > Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA > PME(D0+,D1+,D2+,D3hot+,D3cold+) > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [48] MSI: Enable- Count=1/1 Maskable+ 64bit+ > Address: 0000000000000000 Data: 0000 > Masking: 00000000 Pending: 00000000 > Capabilities: [80] Express (v1) Endpoint, MSI 00 > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- > SlotPowerLimit 10.000W > DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ > MaxPayload 128 bytes, MaxReadReq 512 bytes > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- > LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit > Latency L0s <512ns, L1 <64us > ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp- > LnkCtl: ASPM L0s Enabled; RCB 64 bytes, Disabled- CommClk- > ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok) > TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- > Capabilities: [100 v1] Virtual Channel > Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 > Arb: Fixed- WRR32- WRR64- WRR128- > Ctrl: ArbSelect=Fixed > Status: InProgress- > VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- > Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- > Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff > Status: NegoPending- InProgress- > Capabilities: [200 v1] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- > RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- > RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- > RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ > AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- > ECRCChkCap- ECRCChkEn- > MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- > HeaderLog: 00000000 00000000 00000000 00000000 > Kernel driver in use: sdhci-pci > Kernel modules: sdhci_pci > ###################################### > > > > Thanks and Regards, > Ajay > > On Tue, Oct 12, 2021 at 7:27 PM Ajay Garg <ajaygargnsit@gmail.com> wrote: > > > > Origins at : > > https://lists.linuxfoundation.org/pipermail/iommu/2021-October/thread.html > > > > === Changes from v1 => v2 === > > > > a) > > Improved patch-description. > > > > b) > > A more root-level fix, as suggested by > > > > 1. > > Alex Williamson <alex.williamson@redhat.com> > > > > 2. > > Lu Baolu <baolu.lu@linux.intel.com> > > > > > > > > === Issue === > > > > Kernel-flooding is seen, when an x86_64 L1 guest (Ubuntu-21) is booted in qemu/kvm > > on a x86_64 host (Ubuntu-21), with a host-pci-device attached. > > > > Following kind of logs, along with the stacktraces, cause the flood : > > > > ...... > > DMAR: ERROR: DMA PTE for vPFN 0x428ec already set (to 3f6ec003 not 3f6ec003) > > DMAR: ERROR: DMA PTE for vPFN 0x428ed already set (to 3f6ed003 not 3f6ed003) > > DMAR: ERROR: DMA PTE for vPFN 0x428ee already set (to 3f6ee003 not 3f6ee003) > > DMAR: ERROR: DMA PTE for vPFN 0x428ef already set (to 3f6ef003 not 3f6ef003) > > DMAR: ERROR: DMA PTE for vPFN 0x428f0 already set (to 3f6f0003 not 3f6f0003) > > ...... > > > > > > > > === Current Behaviour, leading to the issue === > > > > Currently, when we do a dma-unmapping, we unmap/unlink the mappings, but > > the pte-entries are not cleared. > > > > Thus, following sequencing would flood the kernel-logs : > > > > i) > > A dma-unmapping makes the real/leaf-level pte-slot invalid, but the > > pte-content itself is not cleared. > > > > ii) > > Now, during some later dma-mapping procedure, as the pte-slot is about > > to hold a new pte-value, the intel-iommu checks if a prior > > pte-entry exists in the pte-slot. If it exists, it logs a kernel-error, > > along with a corresponding stacktrace. > > > > iii) > > Step ii) runs in abundance, and the kernel-logs run insane. > > > > > > > > === Fix === > > > > We ensure that as part of a dma-unmapping, each (unmapped) pte-slot > > is also cleared of its value/content (at the leaf-level, where the > > real mapping from a iova => pfn mapping is stored). > > > > This completes a "deep" dma-unmapping. > > > > > > > > Signed-off-by: Ajay Garg <ajaygargnsit@gmail.com> > > --- > > drivers/iommu/intel/iommu.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > > index d75f59ae28e6..485a8ea71394 100644 > > --- a/drivers/iommu/intel/iommu.c > > +++ b/drivers/iommu/intel/iommu.c > > @@ -5090,6 +5090,8 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain, > > gather->freelist = domain_unmap(dmar_domain, start_pfn, > > last_pfn, gather->freelist); > > > > + dma_pte_clear_range(dmar_domain, start_pfn, last_pfn); > > + > > if (dmar_domain->max_addr == iova + size) > > dmar_domain->max_addr = iova; > > > > -- > > 2.30.2 > > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] iommu: intel: do deep dma-unmapping, to avoid kernel-flooding. 2021-10-12 13:56 [PATCH v2] iommu: intel: do deep dma-unmapping, to avoid kernel-flooding Ajay Garg 2021-10-22 17:33 ` Ajay Garg @ 2021-11-09 23:56 ` Alex Williamson 2021-11-10 6:33 ` Lu Baolu 1 sibling, 1 reply; 5+ messages in thread From: Alex Williamson @ 2021-11-09 23:56 UTC (permalink / raw) To: baolu.lu; +Cc: iommu Hi Baolu, Have you looked into this? I'm able to reproduce by starting and destroying an assigned device VM several times. It seems like it came in with Joerg's pull request for the v5.15 merge window. Bisecting lands me on 3f34f1259776 where intel-iommu added map/unmap_pages support, but I'm not convinced that isn't an artifact that the regular map/unmap calls had been been simplified to only be used for single pages by that point. If I mask the map/unmap_pages callbacks and use map/unmap with (pgsize * size) and restore the previous pgsize_bitmap, I can generate the same faults. So maybe the root issue was introduced somewhere else, or perhaps it is a latent bug in clearing of pte ranges as Ajay proposes below. In any case, I think there's a real issue here. Thanks, Alex On Tue, 12 Oct 2021 19:26:53 +0530 Ajay Garg <ajaygargnsit@gmail.com> wrote: > === Issue === > > Kernel-flooding is seen, when an x86_64 L1 guest (Ubuntu-21) is booted in qemu/kvm > on a x86_64 host (Ubuntu-21), with a host-pci-device attached. > > Following kind of logs, along with the stacktraces, cause the flood : > > ...... > DMAR: ERROR: DMA PTE for vPFN 0x428ec already set (to 3f6ec003 not 3f6ec003) > DMAR: ERROR: DMA PTE for vPFN 0x428ed already set (to 3f6ed003 not 3f6ed003) > DMAR: ERROR: DMA PTE for vPFN 0x428ee already set (to 3f6ee003 not 3f6ee003) > DMAR: ERROR: DMA PTE for vPFN 0x428ef already set (to 3f6ef003 not 3f6ef003) > DMAR: ERROR: DMA PTE for vPFN 0x428f0 already set (to 3f6f0003 not 3f6f0003) > ...... > > > > === Current Behaviour, leading to the issue === > > Currently, when we do a dma-unmapping, we unmap/unlink the mappings, but > the pte-entries are not cleared. > > Thus, following sequencing would flood the kernel-logs : > > i) > A dma-unmapping makes the real/leaf-level pte-slot invalid, but the > pte-content itself is not cleared. > > ii) > Now, during some later dma-mapping procedure, as the pte-slot is about > to hold a new pte-value, the intel-iommu checks if a prior > pte-entry exists in the pte-slot. If it exists, it logs a kernel-error, > along with a corresponding stacktrace. > > iii) > Step ii) runs in abundance, and the kernel-logs run insane. > > > > === Fix === > > We ensure that as part of a dma-unmapping, each (unmapped) pte-slot > is also cleared of its value/content (at the leaf-level, where the > real mapping from a iova => pfn mapping is stored). > > This completes a "deep" dma-unmapping. > > > > Signed-off-by: Ajay Garg <ajaygargnsit@gmail.com> > --- > drivers/iommu/intel/iommu.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > index d75f59ae28e6..485a8ea71394 100644 > --- a/drivers/iommu/intel/iommu.c > +++ b/drivers/iommu/intel/iommu.c > @@ -5090,6 +5090,8 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain, > gather->freelist = domain_unmap(dmar_domain, start_pfn, > last_pfn, gather->freelist); > > + dma_pte_clear_range(dmar_domain, start_pfn, last_pfn); > + > if (dmar_domain->max_addr == iova + size) > dmar_domain->max_addr = iova; > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] iommu: intel: do deep dma-unmapping, to avoid kernel-flooding. 2021-11-09 23:56 ` Alex Williamson @ 2021-11-10 6:33 ` Lu Baolu 0 siblings, 0 replies; 5+ messages in thread From: Lu Baolu @ 2021-11-10 6:33 UTC (permalink / raw) To: Alex Williamson; +Cc: iommu Hi Alex, On 2021/11/10 7:56, Alex Williamson wrote: > > Hi Baolu, > > Have you looked into this? I am looking at this. > I'm able to reproduce by starting and > destroying an assigned device VM several times. It seems like it came > in with Joerg's pull request for the v5.15 merge window. Bisecting > lands me on 3f34f1259776 where intel-iommu added map/unmap_pages > support, but I'm not convinced that isn't an artifact that the regular > map/unmap calls had been been simplified to only be used for single > pages by that point. If I mask the map/unmap_pages callbacks and use > map/unmap with (pgsize * size) and restore the previous pgsize_bitmap, > I can generate the same faults. So maybe the root issue was introduced > somewhere else, or perhaps it is a latent bug in clearing of pte ranges > as Ajay proposes below. In any case, I think there's a real issue > here. Thanks, I am trying to reproduce this issue with my local setup. I will come back again after I have more details. Best regards, baolu > > Alex > > On Tue, 12 Oct 2021 19:26:53 +0530 > Ajay Garg <ajaygargnsit@gmail.com> wrote: > >> === Issue === >> >> Kernel-flooding is seen, when an x86_64 L1 guest (Ubuntu-21) is booted in qemu/kvm >> on a x86_64 host (Ubuntu-21), with a host-pci-device attached. >> >> Following kind of logs, along with the stacktraces, cause the flood : >> >> ...... >> DMAR: ERROR: DMA PTE for vPFN 0x428ec already set (to 3f6ec003 not 3f6ec003) >> DMAR: ERROR: DMA PTE for vPFN 0x428ed already set (to 3f6ed003 not 3f6ed003) >> DMAR: ERROR: DMA PTE for vPFN 0x428ee already set (to 3f6ee003 not 3f6ee003) >> DMAR: ERROR: DMA PTE for vPFN 0x428ef already set (to 3f6ef003 not 3f6ef003) >> DMAR: ERROR: DMA PTE for vPFN 0x428f0 already set (to 3f6f0003 not 3f6f0003) >> ...... >> >> >> >> === Current Behaviour, leading to the issue === >> >> Currently, when we do a dma-unmapping, we unmap/unlink the mappings, but >> the pte-entries are not cleared. >> >> Thus, following sequencing would flood the kernel-logs : >> >> i) >> A dma-unmapping makes the real/leaf-level pte-slot invalid, but the >> pte-content itself is not cleared. >> >> ii) >> Now, during some later dma-mapping procedure, as the pte-slot is about >> to hold a new pte-value, the intel-iommu checks if a prior >> pte-entry exists in the pte-slot. If it exists, it logs a kernel-error, >> along with a corresponding stacktrace. >> >> iii) >> Step ii) runs in abundance, and the kernel-logs run insane. >> >> >> >> === Fix === >> >> We ensure that as part of a dma-unmapping, each (unmapped) pte-slot >> is also cleared of its value/content (at the leaf-level, where the >> real mapping from a iova => pfn mapping is stored). >> >> This completes a "deep" dma-unmapping. >> >> >> >> Signed-off-by: Ajay Garg <ajaygargnsit@gmail.com> >> --- >> drivers/iommu/intel/iommu.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c >> index d75f59ae28e6..485a8ea71394 100644 >> --- a/drivers/iommu/intel/iommu.c >> +++ b/drivers/iommu/intel/iommu.c >> @@ -5090,6 +5090,8 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain, >> gather->freelist = domain_unmap(dmar_domain, start_pfn, >> last_pfn, gather->freelist); >> >> + dma_pte_clear_range(dmar_domain, start_pfn, last_pfn); >> + >> if (dmar_domain->max_addr == iova + size) >> dmar_domain->max_addr = iova; >> > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-11-10 6:33 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-10-12 13:56 [PATCH v2] iommu: intel: do deep dma-unmapping, to avoid kernel-flooding Ajay Garg 2021-10-22 17:33 ` Ajay Garg 2021-10-23 7:00 ` Ajay Garg 2021-11-09 23:56 ` Alex Williamson 2021-11-10 6:33 ` Lu Baolu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox