* Xe performance regression with recent IOMMU changes
@ 2026-01-21 13:02 Francois Dugast
2026-01-21 13:11 ` Jason Gunthorpe
0 siblings, 1 reply; 10+ messages in thread
From: Francois Dugast @ 2026-01-21 13:02 UTC (permalink / raw)
To: iommu
Cc: intel-xe, Francois Dugast, Jason Gunthorpe, Joerg Roedel,
Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy,
Samiullah Khawaja, Matthew Brost, Thomas Hellström,
Tina Zhang, Lu Baolu, Kevin Tian
I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It
can be observed during DMA mappings/unmappings required to issue copies
between system memory and the device, when handling GPU faults. Not sure
how other use cases or vendors are affected but below is the impact on
execution times for BMG:
Before changes:
4KB
drm_pagemap_migrate_map_pages: 0.4 us
drm_pagemap_migrate_unmap_pages: 0.4 us
64KB
drm_pagemap_migrate_map_pages: 2.5 us
drm_pagemap_migrate_unmap_pages: 3.5 us
2MB
drm_pagemap_migrate_map_pages: 88 us
drm_pagemap_migrate_unmap_pages: 108 us
After changes:
4KB
drm_pagemap_migrate_map_pages: 0.7 us
drm_pagemap_migrate_unmap_pages: 0.7 us
64KB
drm_pagemap_migrate_map_pages: 3.5 us
drm_pagemap_migrate_unmap_pages: 10.5 us
2MB
drm_pagemap_migrate_map_pages: 102 us
drm_pagemap_migrate_unmap_pages: 330 us
Bisecting points to these commits:
d373449d8e97 iommu/vt-d: Use the generic iommu page table
d856f9d27885 iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires
Simply reverting them brings back performance.
Any idea about why and how to fix it?
Thanks,
Francois
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: Calvin Owens <calvin@wbinvd.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Tina Zhang <tina.zhang@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
--
2.43.0
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: Xe performance regression with recent IOMMU changes 2026-01-21 13:02 Xe performance regression with recent IOMMU changes Francois Dugast @ 2026-01-21 13:11 ` Jason Gunthorpe 2026-01-21 18:04 ` Jason Gunthorpe 0 siblings, 1 reply; 10+ messages in thread From: Jason Gunthorpe @ 2026-01-21 13:11 UTC (permalink / raw) To: Francois Dugast Cc: iommu, intel-xe, Joerg Roedel, Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy, Samiullah Khawaja, Matthew Brost, Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian On Wed, Jan 21, 2026 at 02:02:16PM +0100, Francois Dugast wrote: > I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It > can be observed during DMA mappings/unmappings required to issue copies > between system memory and the device, when handling GPU faults. Not sure > how other use cases or vendors are affected but below is the impact on > execution times for BMG: > > Before changes: > 4KB > drm_pagemap_migrate_map_pages: 0.4 us > drm_pagemap_migrate_unmap_pages: 0.4 us > 64KB > drm_pagemap_migrate_map_pages: 2.5 us > drm_pagemap_migrate_unmap_pages: 3.5 us > 2MB > drm_pagemap_migrate_map_pages: 88 us > drm_pagemap_migrate_unmap_pages: 108 us > > After changes: > 4KB > drm_pagemap_migrate_map_pages: 0.7 us > drm_pagemap_migrate_unmap_pages: 0.7 us > 64KB > drm_pagemap_migrate_map_pages: 3.5 us > drm_pagemap_migrate_unmap_pages: 10.5 us > 2MB > drm_pagemap_migrate_map_pages: 102 us > drm_pagemap_migrate_unmap_pages: 330 us I posted some more optimizations for these cases, it should reduce the numbers. This is the opposite of the benchmark numbers I ran which showed significant gains as the page count and sizes increased. But something weird is going on to see a 3x increase in unmap, that shouldn't be just algorithm overhead. That almost seems like additional IOTLB invalidation overhead or something else going wrong. Is this from a system with the VT-d cache flushing requirement? That logic changed around too and could have this kind of big impact. Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Xe performance regression with recent IOMMU changes 2026-01-21 13:11 ` Jason Gunthorpe @ 2026-01-21 18:04 ` Jason Gunthorpe 2026-01-22 6:15 ` Matthew Brost 0 siblings, 1 reply; 10+ messages in thread From: Jason Gunthorpe @ 2026-01-21 18:04 UTC (permalink / raw) To: Francois Dugast Cc: iommu, intel-xe, Joerg Roedel, Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy, Samiullah Khawaja, Matthew Brost, Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian On Wed, Jan 21, 2026 at 09:11:35AM -0400, Jason Gunthorpe wrote: > On Wed, Jan 21, 2026 at 02:02:16PM +0100, Francois Dugast wrote: > > I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It > > can be observed during DMA mappings/unmappings required to issue copies > > between system memory and the device, when handling GPU faults. Not sure > > how other use cases or vendors are affected but below is the impact on > > execution times for BMG: > > > > Before changes: > > 4KB > > drm_pagemap_migrate_map_pages: 0.4 us > > drm_pagemap_migrate_unmap_pages: 0.4 us > > 64KB > > drm_pagemap_migrate_map_pages: 2.5 us > > drm_pagemap_migrate_unmap_pages: 3.5 us > > 2MB > > drm_pagemap_migrate_map_pages: 88 us > > drm_pagemap_migrate_unmap_pages: 108 us > > > > After changes: > > 4KB > > drm_pagemap_migrate_map_pages: 0.7 us > > drm_pagemap_migrate_unmap_pages: 0.7 us > > 64KB > > drm_pagemap_migrate_map_pages: 3.5 us > > drm_pagemap_migrate_unmap_pages: 10.5 us > > 2MB > > drm_pagemap_migrate_map_pages: 102 us > > drm_pagemap_migrate_unmap_pages: 330 us > > I posted some more optimizations for these cases, it should reduce the > numbers. > > This is the opposite of the benchmark numbers I ran which showed > significant gains as the page count and sizes increased. > > But something weird is going on to see a 3x increase in unmap, that > shouldn't be just algorithm overhead. That almost seems like > additional IOTLB invalidation overhead or something else going wrong. > > Is this from a system with the VT-d cache flushing requirement? That > logic changed around too and could have this kind of big impact. Oh looking at the code a bit you've got pretty much the slowest possible thing you can do here: for (i = 0; i < npages;) { if (!pagemap_addr[i].addr || dma_mapping_error(dev, pagemap_addr[i].addr)) goto next; dma_unmap_page(dev, pagemap_addr[i].addr, PAGE_SIZE << pagemap_addr[i].order, dir); It is weird though: 0.7 us * 512 = 358us so it is about the reported speed. But the old one is 0.4 us * 512 = 204 us which is twice as slow as reported?? It got 2x faster the more times you loop it? Huh? The real way to fix this up is to use the new DMA API so this can be collapsed into a single unmap. Then it will take < 1us for all those cases. Look at the patches Leon made for the RDMA ODP stuff, it has a similar looking workflow. The optimizations I posted will help this noticably. Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Xe performance regression with recent IOMMU changes 2026-01-21 18:04 ` Jason Gunthorpe @ 2026-01-22 6:15 ` Matthew Brost 2026-01-22 7:29 ` Leon Romanovsky 2026-01-22 13:31 ` Jason Gunthorpe 0 siblings, 2 replies; 10+ messages in thread From: Matthew Brost @ 2026-01-22 6:15 UTC (permalink / raw) To: Jason Gunthorpe Cc: Francois Dugast, iommu, intel-xe, Joerg Roedel, Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy, Samiullah Khawaja, Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian On Wed, Jan 21, 2026 at 02:04:49PM -0400, Jason Gunthorpe wrote: > On Wed, Jan 21, 2026 at 09:11:35AM -0400, Jason Gunthorpe wrote: > > On Wed, Jan 21, 2026 at 02:02:16PM +0100, Francois Dugast wrote: > > > I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It > > > can be observed during DMA mappings/unmappings required to issue copies > > > between system memory and the device, when handling GPU faults. Not sure > > > how other use cases or vendors are affected but below is the impact on > > > execution times for BMG: > > > > > > Before changes: > > > 4KB > > > drm_pagemap_migrate_map_pages: 0.4 us > > > drm_pagemap_migrate_unmap_pages: 0.4 us > > > 64KB > > > drm_pagemap_migrate_map_pages: 2.5 us > > > drm_pagemap_migrate_unmap_pages: 3.5 us > > > 2MB > > > drm_pagemap_migrate_map_pages: 88 us > > > drm_pagemap_migrate_unmap_pages: 108 us > > > > > > After changes: > > > 4KB > > > drm_pagemap_migrate_map_pages: 0.7 us > > > drm_pagemap_migrate_unmap_pages: 0.7 us > > > 64KB > > > drm_pagemap_migrate_map_pages: 3.5 us > > > drm_pagemap_migrate_unmap_pages: 10.5 us > > > 2MB > > > drm_pagemap_migrate_map_pages: 102 us > > > drm_pagemap_migrate_unmap_pages: 330 us > > > > I posted some more optimizations for these cases, it should reduce the > > numbers. > > We can try those — link? I believe I know the series, but just to make sure we’re on the same page. > > This is the opposite of the benchmark numbers I ran which showed > > significant gains as the page count and sizes increased. > > > > But something weird is going on to see a 3x increase in unmap, that > > shouldn't be just algorithm overhead. That almost seems like > > additional IOTLB invalidation overhead or something else going wrong. > > > > Is this from a system with the VT-d cache flushing requirement? That > > logic changed around too and could have this kind of big impact. > > Oh looking at the code a bit you've got pretty much the slowest > possible thing you can do here: This was a fairly common pattern prior to Leon’s series, I believe. The cross-references show this pattern appearing frequently in the kernel [1]. I do agree with the point below that, with Leon’s changes applied, this could be refactored into an IOVA alloc/link/unlink/free flow, which would work better (also 2M device pages reduces the common 2M case to a mute point). But that’s not what we’re discussing here. We’re talking about a regression introduced in the dma-mapping API for x86, which in my view is unacceptable for a kernel release. So IMO we should revert those changes [2]. [1] https://elixir.bootlin.com/linux/v6.18.6/A/ident/dma_unmap_page [2] e6fbd544619c50b4a4d96ccb4676cac03cb iommupt/vtd: Support mgaw's less than a 4 level walk for first stage d856f9d27885c499d96ab7fe506083346ccf145d iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires 6cbc09b7719ec7fd9f650f18b3828b7f60c17881 iommu/vt-d: Restore previous domain::aperture_end calculation a97fbc3ee3e2a536fafaff04f21f45472db71769 syscore: Pass context data to callbacks 101a2854110fa8787226dae1202892071ff2c369 iommu/vt-d: Follow PT_FEAT_DMA_INCOHERENT into the PASID entry d373449d8e97891434db0c64afca79d903c1194e iommu/vt-d: Use the generic iommu page table > > for (i = 0; i < npages;) { > if (!pagemap_addr[i].addr || dma_mapping_error(dev, pagemap_addr[i].addr)) > goto next; > > dma_unmap_page(dev, pagemap_addr[i].addr, PAGE_SIZE << pagemap_addr[i].order, dir); > > It is weird though: > > 0.7 us * 512 = 358us so it is about the reported speed. > > But the old one is 0.4 us * 512 = 204 us which is twice as > slow as reported?? It got 2x faster the more times you loop it? Huh? > > The real way to fix this up is to use the new DMA API so this can be > collapsed into a single unmap. Then it will take < 1us for all those cases. > > Look at the patches Leon made for the RDMA ODP stuff, it has a similar > looking workflow. > See above. I agree this is the right direction, but we can’t simply regress kernels from existing performance. > The optimizations I posted will help this noticably. > I think we need to start with a revert and then discuss whether your subsequent changes actually fix the problem. Matt > Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Xe performance regression with recent IOMMU changes 2026-01-22 6:15 ` Matthew Brost @ 2026-01-22 7:29 ` Leon Romanovsky 2026-01-22 7:36 ` Matthew Brost 2026-01-22 13:31 ` Jason Gunthorpe 1 sibling, 1 reply; 10+ messages in thread From: Leon Romanovsky @ 2026-01-22 7:29 UTC (permalink / raw) To: Matthew Brost Cc: Jason Gunthorpe, Francois Dugast, iommu, intel-xe, Joerg Roedel, Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy, Samiullah Khawaja, Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian On Wed, Jan 21, 2026 at 10:15:14PM -0800, Matthew Brost wrote: > On Wed, Jan 21, 2026 at 02:04:49PM -0400, Jason Gunthorpe wrote: > > On Wed, Jan 21, 2026 at 09:11:35AM -0400, Jason Gunthorpe wrote: > > > On Wed, Jan 21, 2026 at 02:02:16PM +0100, Francois Dugast wrote: > > > > I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It > > > > can be observed during DMA mappings/unmappings required to issue copies > > > > between system memory and the device, when handling GPU faults. Not sure > > > > how other use cases or vendors are affected but below is the impact on > > > > execution times for BMG: > > > > > > > > Before changes: > > > > 4KB > > > > drm_pagemap_migrate_map_pages: 0.4 us > > > > drm_pagemap_migrate_unmap_pages: 0.4 us > > > > 64KB > > > > drm_pagemap_migrate_map_pages: 2.5 us > > > > drm_pagemap_migrate_unmap_pages: 3.5 us > > > > 2MB > > > > drm_pagemap_migrate_map_pages: 88 us > > > > drm_pagemap_migrate_unmap_pages: 108 us > > > > > > > > After changes: > > > > 4KB > > > > drm_pagemap_migrate_map_pages: 0.7 us > > > > drm_pagemap_migrate_unmap_pages: 0.7 us > > > > 64KB > > > > drm_pagemap_migrate_map_pages: 3.5 us > > > > drm_pagemap_migrate_unmap_pages: 10.5 us > > > > 2MB > > > > drm_pagemap_migrate_map_pages: 102 us > > > > drm_pagemap_migrate_unmap_pages: 330 us > > > > > > I posted some more optimizations for these cases, it should reduce the > > > numbers. > > > > > We can try those — link? I believe I know the series, but just to make > sure we’re on the same page. > > > > This is the opposite of the benchmark numbers I ran which showed > > > significant gains as the page count and sizes increased. > > > > > > But something weird is going on to see a 3x increase in unmap, that > > > shouldn't be just algorithm overhead. That almost seems like > > > additional IOTLB invalidation overhead or something else going wrong. > > > > > > Is this from a system with the VT-d cache flushing requirement? That > > > logic changed around too and could have this kind of big impact. > > > > Oh looking at the code a bit you've got pretty much the slowest > > possible thing you can do here: > > This was a fairly common pattern prior to Leon’s series, I believe. The > cross-references show this pattern appearing frequently in the kernel > [1]. I do agree with the point below that, with Leon’s changes applied, > this could be refactored into an IOVA alloc/link/unlink/free flow, which > would work better (also 2M device pages reduces the common 2M case to a > mute point). > > But that’s not what we’re discussing here. We’re talking about a > regression introduced in the dma-mapping API for x86, which in my view > is unacceptable for a kernel release. So IMO we should revert those > changes [2]. > > [1] https://elixir.bootlin.com/linux/v6.18.6/A/ident/dma_unmap_page I think this comparison is unfair. The previous behavior was bad for everyone, while the current issue affects only the specific drm_pagemap_migrate_unmap_pages() flow. Cases where the performance of dma_unmap_page() in non-direct mode matters are extremely rare. It should be relatively straightforward to add a link/unlink path to the drm_pagemap_*() helpers and achieve decent performance. Thanks > [2] > e6fbd544619c50b4a4d96ccb4676cac03cb iommupt/vtd: Support mgaw's less than a 4 level walk for first stage > d856f9d27885c499d96ab7fe506083346ccf145d iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires > 6cbc09b7719ec7fd9f650f18b3828b7f60c17881 iommu/vt-d: Restore previous domain::aperture_end calculation > a97fbc3ee3e2a536fafaff04f21f45472db71769 syscore: Pass context data to callbacks > 101a2854110fa8787226dae1202892071ff2c369 iommu/vt-d: Follow PT_FEAT_DMA_INCOHERENT into the PASID entry > d373449d8e97891434db0c64afca79d903c1194e iommu/vt-d: Use the generic iommu page table > > > > > for (i = 0; i < npages;) { > > if (!pagemap_addr[i].addr || dma_mapping_error(dev, pagemap_addr[i].addr)) > > goto next; > > > > dma_unmap_page(dev, pagemap_addr[i].addr, PAGE_SIZE << pagemap_addr[i].order, dir); > > > > It is weird though: > > > > 0.7 us * 512 = 358us so it is about the reported speed. > > > > But the old one is 0.4 us * 512 = 204 us which is twice as > > slow as reported?? It got 2x faster the more times you loop it? Huh? > > > > The real way to fix this up is to use the new DMA API so this can be > > collapsed into a single unmap. Then it will take < 1us for all those cases. > > > > Look at the patches Leon made for the RDMA ODP stuff, it has a similar > > looking workflow. > > > > See above. I agree this is the right direction, but we can’t simply > regress kernels from existing performance. > > > The optimizations I posted will help this noticably. > > > > I think we need to start with a revert and then discuss whether your > subsequent changes actually fix the problem. > > Matt > > > Jason > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Xe performance regression with recent IOMMU changes 2026-01-22 7:29 ` Leon Romanovsky @ 2026-01-22 7:36 ` Matthew Brost 2026-01-22 10:26 ` Leon Romanovsky 0 siblings, 1 reply; 10+ messages in thread From: Matthew Brost @ 2026-01-22 7:36 UTC (permalink / raw) To: Leon Romanovsky Cc: Jason Gunthorpe, Francois Dugast, iommu, intel-xe, Joerg Roedel, Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy, Samiullah Khawaja, Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian On Thu, Jan 22, 2026 at 09:29:13AM +0200, Leon Romanovsky wrote: > On Wed, Jan 21, 2026 at 10:15:14PM -0800, Matthew Brost wrote: > > On Wed, Jan 21, 2026 at 02:04:49PM -0400, Jason Gunthorpe wrote: > > > On Wed, Jan 21, 2026 at 09:11:35AM -0400, Jason Gunthorpe wrote: > > > > On Wed, Jan 21, 2026 at 02:02:16PM +0100, Francois Dugast wrote: > > > > > I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It > > > > > can be observed during DMA mappings/unmappings required to issue copies > > > > > between system memory and the device, when handling GPU faults. Not sure > > > > > how other use cases or vendors are affected but below is the impact on > > > > > execution times for BMG: > > > > > > > > > > Before changes: > > > > > 4KB > > > > > drm_pagemap_migrate_map_pages: 0.4 us > > > > > drm_pagemap_migrate_unmap_pages: 0.4 us > > > > > 64KB > > > > > drm_pagemap_migrate_map_pages: 2.5 us > > > > > drm_pagemap_migrate_unmap_pages: 3.5 us > > > > > 2MB > > > > > drm_pagemap_migrate_map_pages: 88 us > > > > > drm_pagemap_migrate_unmap_pages: 108 us > > > > > > > > > > After changes: > > > > > 4KB > > > > > drm_pagemap_migrate_map_pages: 0.7 us > > > > > drm_pagemap_migrate_unmap_pages: 0.7 us > > > > > 64KB > > > > > drm_pagemap_migrate_map_pages: 3.5 us > > > > > drm_pagemap_migrate_unmap_pages: 10.5 us > > > > > 2MB > > > > > drm_pagemap_migrate_map_pages: 102 us > > > > > drm_pagemap_migrate_unmap_pages: 330 us > > > > > > > > I posted some more optimizations for these cases, it should reduce the > > > > numbers. > > > > > > > > We can try those — link? I believe I know the series, but just to make > > sure we’re on the same page. > > > > > > This is the opposite of the benchmark numbers I ran which showed > > > > significant gains as the page count and sizes increased. > > > > > > > > But something weird is going on to see a 3x increase in unmap, that > > > > shouldn't be just algorithm overhead. That almost seems like > > > > additional IOTLB invalidation overhead or something else going wrong. > > > > > > > > Is this from a system with the VT-d cache flushing requirement? That > > > > logic changed around too and could have this kind of big impact. > > > > > > Oh looking at the code a bit you've got pretty much the slowest > > > possible thing you can do here: > > > > This was a fairly common pattern prior to Leon’s series, I believe. The > > cross-references show this pattern appearing frequently in the kernel > > [1]. I do agree with the point below that, with Leon’s changes applied, > > this could be refactored into an IOVA alloc/link/unlink/free flow, which > > would work better (also 2M device pages reduces the common 2M case to a > > mute point). > > > > But that’s not what we’re discussing here. We’re talking about a > > regression introduced in the dma-mapping API for x86, which in my view > > is unacceptable for a kernel release. So IMO we should revert those > > changes [2]. > > > > [1] https://elixir.bootlin.com/linux/v6.18.6/A/ident/dma_unmap_page > > I think this comparison is unfair. The previous behavior was bad for > everyone, while the current issue affects only the specific > drm_pagemap_migrate_unmap_pages() flow. Cases where the performance of > dma_unmap_page() in non-direct mode matters are extremely rare. > I don’t think you can reason about this without extensive testing across multiple platforms. Nor is it fair to say - sorry we slowed down your existing code, good luck. > It should be relatively straightforward to add a link/unlink path to the > drm_pagemap_*() helpers and achieve decent performance. > I agree. Happy to work with you on this going *forward*. Matt > Thanks > > > [2] > > e6fbd544619c50b4a4d96ccb4676cac03cb iommupt/vtd: Support mgaw's less than a 4 level walk for first stage > > d856f9d27885c499d96ab7fe506083346ccf145d iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires > > 6cbc09b7719ec7fd9f650f18b3828b7f60c17881 iommu/vt-d: Restore previous domain::aperture_end calculation > > a97fbc3ee3e2a536fafaff04f21f45472db71769 syscore: Pass context data to callbacks > > 101a2854110fa8787226dae1202892071ff2c369 iommu/vt-d: Follow PT_FEAT_DMA_INCOHERENT into the PASID entry > > d373449d8e97891434db0c64afca79d903c1194e iommu/vt-d: Use the generic iommu page table > > > > > > > > for (i = 0; i < npages;) { > > > if (!pagemap_addr[i].addr || dma_mapping_error(dev, pagemap_addr[i].addr)) > > > goto next; > > > > > > dma_unmap_page(dev, pagemap_addr[i].addr, PAGE_SIZE << pagemap_addr[i].order, dir); > > > > > > It is weird though: > > > > > > 0.7 us * 512 = 358us so it is about the reported speed. > > > > > > But the old one is 0.4 us * 512 = 204 us which is twice as > > > slow as reported?? It got 2x faster the more times you loop it? Huh? > > > > > > The real way to fix this up is to use the new DMA API so this can be > > > collapsed into a single unmap. Then it will take < 1us for all those cases. > > > > > > Look at the patches Leon made for the RDMA ODP stuff, it has a similar > > > looking workflow. > > > > > > > See above. I agree this is the right direction, but we can’t simply > > regress kernels from existing performance. > > > > > The optimizations I posted will help this noticably. > > > > > > > I think we need to start with a revert and then discuss whether your > > subsequent changes actually fix the problem. > > > > Matt > > > > > Jason > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Xe performance regression with recent IOMMU changes 2026-01-22 7:36 ` Matthew Brost @ 2026-01-22 10:26 ` Leon Romanovsky 0 siblings, 0 replies; 10+ messages in thread From: Leon Romanovsky @ 2026-01-22 10:26 UTC (permalink / raw) To: Matthew Brost Cc: Jason Gunthorpe, Francois Dugast, iommu, intel-xe, Joerg Roedel, Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy, Samiullah Khawaja, Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian On Wed, Jan 21, 2026 at 11:36:47PM -0800, Matthew Brost wrote: > On Thu, Jan 22, 2026 at 09:29:13AM +0200, Leon Romanovsky wrote: > > On Wed, Jan 21, 2026 at 10:15:14PM -0800, Matthew Brost wrote: > > > On Wed, Jan 21, 2026 at 02:04:49PM -0400, Jason Gunthorpe wrote: > > > > On Wed, Jan 21, 2026 at 09:11:35AM -0400, Jason Gunthorpe wrote: > > > > > On Wed, Jan 21, 2026 at 02:02:16PM +0100, Francois Dugast wrote: > > > > > > I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It > > > > > > can be observed during DMA mappings/unmappings required to issue copies > > > > > > between system memory and the device, when handling GPU faults. Not sure > > > > > > how other use cases or vendors are affected but below is the impact on > > > > > > execution times for BMG: > > > > > > > > > > > > Before changes: > > > > > > 4KB > > > > > > drm_pagemap_migrate_map_pages: 0.4 us > > > > > > drm_pagemap_migrate_unmap_pages: 0.4 us > > > > > > 64KB > > > > > > drm_pagemap_migrate_map_pages: 2.5 us > > > > > > drm_pagemap_migrate_unmap_pages: 3.5 us > > > > > > 2MB > > > > > > drm_pagemap_migrate_map_pages: 88 us > > > > > > drm_pagemap_migrate_unmap_pages: 108 us > > > > > > > > > > > > After changes: > > > > > > 4KB > > > > > > drm_pagemap_migrate_map_pages: 0.7 us > > > > > > drm_pagemap_migrate_unmap_pages: 0.7 us > > > > > > 64KB > > > > > > drm_pagemap_migrate_map_pages: 3.5 us > > > > > > drm_pagemap_migrate_unmap_pages: 10.5 us > > > > > > 2MB > > > > > > drm_pagemap_migrate_map_pages: 102 us > > > > > > drm_pagemap_migrate_unmap_pages: 330 us > > > > > > > > > > I posted some more optimizations for these cases, it should reduce the > > > > > numbers. > > > > > > > > > > > We can try those — link? I believe I know the series, but just to make > > > sure we’re on the same page. > > > > > > > > This is the opposite of the benchmark numbers I ran which showed > > > > > significant gains as the page count and sizes increased. > > > > > > > > > > But something weird is going on to see a 3x increase in unmap, that > > > > > shouldn't be just algorithm overhead. That almost seems like > > > > > additional IOTLB invalidation overhead or something else going wrong. > > > > > > > > > > Is this from a system with the VT-d cache flushing requirement? That > > > > > logic changed around too and could have this kind of big impact. > > > > > > > > Oh looking at the code a bit you've got pretty much the slowest > > > > possible thing you can do here: > > > > > > This was a fairly common pattern prior to Leon’s series, I believe. The > > > cross-references show this pattern appearing frequently in the kernel > > > [1]. I do agree with the point below that, with Leon’s changes applied, > > > this could be refactored into an IOVA alloc/link/unlink/free flow, which > > > would work better (also 2M device pages reduces the common 2M case to a > > > mute point). > > > > > > But that’s not what we’re discussing here. We’re talking about a > > > regression introduced in the dma-mapping API for x86, which in my view > > > is unacceptable for a kernel release. So IMO we should revert those > > > changes [2]. > > > > > > [1] https://elixir.bootlin.com/linux/v6.18.6/A/ident/dma_unmap_page > > > > I think this comparison is unfair. The previous behavior was bad for > > everyone, while the current issue affects only the specific > > drm_pagemap_migrate_unmap_pages() flow. Cases where the performance of > > dma_unmap_page() in non-direct mode matters are extremely rare. > > > > I don’t think you can reason about this without extensive testing across > multiple platforms. Nor is it fair to say - sorry we slowed down your > existing code, good luck. It is not what I said. I only pointed to the specific point that loop over dma_unmap_page() is universally performance critical. Thanks ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Xe performance regression with recent IOMMU changes 2026-01-22 6:15 ` Matthew Brost 2026-01-22 7:29 ` Leon Romanovsky @ 2026-01-22 13:31 ` Jason Gunthorpe 2026-01-23 16:27 ` Francois Dugast 1 sibling, 1 reply; 10+ messages in thread From: Jason Gunthorpe @ 2026-01-22 13:31 UTC (permalink / raw) To: Matthew Brost Cc: Francois Dugast, iommu, intel-xe, Joerg Roedel, Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy, Samiullah Khawaja, Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian On Wed, Jan 21, 2026 at 10:15:14PM -0800, Matthew Brost wrote: > > > I posted some more optimizations for these cases, it should reduce the > > > numbers. > > We can try those — link? I believe I know the series, but just to make > sure we’re on the same page. https://patch.msgid.link/r/0-v2-973a6bdc820f+693-iommpt_map_direct_jgg@nvidia.com I also need the answer if this testing is running on the non-cache coherent iommu HW Intel sometimes has, it makes a difference. I also have in mind a fairly small change to make this special unmap case speed up. IMHO those two together will likely get you back to close enough. And then do link if you actually care about this scenario. > This was a fairly common pattern prior to Leon’s series, I believe. The > cross-references show this pattern appearing frequently in the kernel > [1]. Yes the pattern is common but virtually nobody actually uses it with the iommu turned on because it is something like 10x slower than using just identity mode. I understand this is a test suite and it should test with iommu enabled, but I'm deeply skeptical this represents actual users who also care about performace. If they did they'd already have set the iommu to identity. > > The optimizations I posted will help this noticably. > > I think we need to start with a revert and then discuss whether your > subsequent changes actually fix the problem. We haven't even done some basic investigation, immediately demanding a revert of such a large amount of work for a use case I suspect doesn't have users it not reasonable. This work was not done for no reason and is bringing performance wins for other use cases that do actually have real users. If we eventually really can't fix it then you can talk about reverts, but given link will absolutely fix xe, I don't see that happening. Try the patches, give me the new numbers, tell me if you have the non-cache iommu and I will give you another one to try. Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Xe performance regression with recent IOMMU changes 2026-01-22 13:31 ` Jason Gunthorpe @ 2026-01-23 16:27 ` Francois Dugast 2026-01-23 19:07 ` Jason Gunthorpe 0 siblings, 1 reply; 10+ messages in thread From: Francois Dugast @ 2026-01-23 16:27 UTC (permalink / raw) To: Jason Gunthorpe Cc: Matthew Brost, iommu, intel-xe, Joerg Roedel, Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy, Samiullah Khawaja, Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian On Thu, Jan 22, 2026 at 09:31:31AM -0400, Jason Gunthorpe wrote: > Try the patches, give me the new numbers, Thanks for the suggestion but they do not seem to help, see new execution times below in ns, collected this time without kprobe to reduce variation: # iommu-tip + https://patch.msgid.link/r/0-v2-973a6bdc820f+693-iommpt_map_direct_jgg@nvidia.com +-----------------------------------+--------+--------+--------+ | | 4KB | 64KB | 2MB | +-----------------------------------+--------+--------+--------+ | drm_pagemap_migrate_map_pages() | 660 | 3951 | 113813 | +-----------------------------------+--------+--------+--------+ | drm_pagemap_migrate_unmap_pages() | 610 | 11136 | 322802 | +-----------------------------------+--------+--------+--------+ # drm-tip +-----------------------------------+--------+--------+--------+ | | 4KB | 64KB | 2MB | +-----------------------------------+--------+--------+--------+ | drm_pagemap_migrate_map_pages() | 687 | 3890 | 114749 | +-----------------------------------+--------+--------+--------+ | drm_pagemap_migrate_unmap_pages() | 621 | 11180 | 334472 | +-----------------------------------+--------+--------+--------+ # drm-tip + revert of IOMMU changes +-----------------------------------+--------+--------+--------+ | | 4KB | 64KB | 2MB | +-----------------------------------+--------+--------+--------+ | drm_pagemap_migrate_map_pages() | 355 | 3545 | 102706 | +-----------------------------------+--------+--------+--------+ | drm_pagemap_migrate_unmap_pages() | 305 | 4341 | 125919 | +-----------------------------------+--------+--------+--------+ > tell me if you have the non-cache iommu The setup used in this test has non-cache coherent IOMMU. > and I will give you another one to try. Sure, please do. Francois ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Xe performance regression with recent IOMMU changes 2026-01-23 16:27 ` Francois Dugast @ 2026-01-23 19:07 ` Jason Gunthorpe 0 siblings, 0 replies; 10+ messages in thread From: Jason Gunthorpe @ 2026-01-23 19:07 UTC (permalink / raw) To: Francois Dugast Cc: Matthew Brost, iommu, intel-xe, Joerg Roedel, Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy, Samiullah Khawaja, Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian On Fri, Jan 23, 2026 at 05:27:24PM +0100, Francois Dugast wrote: > On Thu, Jan 22, 2026 at 09:31:31AM -0400, Jason Gunthorpe wrote: > > Try the patches, give me the new numbers, > > Thanks for the suggestion but they do not seem to help, see new > execution times below in ns, collected this time without kprobe > to reduce variation: > > # iommu-tip + https://patch.msgid.link/r/0-v2-973a6bdc820f+693-iommpt_map_direct_jgg@nvidia.com > +-----------------------------------+--------+--------+--------+ > | | 4KB | 64KB | 2MB | > +-----------------------------------+--------+--------+--------+ > | drm_pagemap_migrate_map_pages() | 660 | 3951 | 113813 | > +-----------------------------------+--------+--------+--------+ > | drm_pagemap_migrate_unmap_pages() | 610 | 11136 | 322802 | > +-----------------------------------+--------+--------+--------+ > > # drm-tip > +-----------------------------------+--------+--------+--------+ > | | 4KB | 64KB | 2MB | > +-----------------------------------+--------+--------+--------+ > | drm_pagemap_migrate_map_pages() | 687 | 3890 | 114749 | > +-----------------------------------+--------+--------+--------+ > | drm_pagemap_migrate_unmap_pages() | 621 | 11180 | 334472 | > +-----------------------------------+--------+--------+--------+ It is not nothing, that looks like about a 4% gain, that matches the lower bound of what I was measuring for those patches as well. There are two mysteries in your report. First, compared to my measurements: https://lore.kernel.org/linux-iommu/5-v3-634ccd3efce0+16d38-iommu_pt_vtd_jgg@nvidia.com/ iommu_map() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 53,66 , 50,64 , 21.21 256*2^12, 384,524 , 337,516 , 34.34 iommu_unmap() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 67,86 , 63,84 , 25.25 256*2^12, 216,335 , 198,317 , 37.37 Yours are about 10x higher. Granted they are not exactly the same thing, but I'm measuring the actual page table code as 20% faster, not slower. So I'm really wondering what is so different on your situation. Is the cache flushing causing the 10x delta? Second, it is normal for the map/unmap to be approximately the same, your results have map being 165% slower. This surely must be a bug, I have a guess that some cache flush is the incorrect length.. Still, that 10x difference is confusing, are you running with debug options in your .kconfig? I wouldn't be surprised at all to be told kasn/gcov/etc reacts much differently. > > tell me if you have the non-cache iommu > > The setup used in this test has non-cache coherent IOMMU. That helps a lot. The non-coherent case disables a meaningful optimization for the 4k page case map case, and triggers a bunch of hard to test cache flushing code that we can look at. Any chance you can run this on a system that has a coherent IOMMU? That would really help narrow things down. Can you measure directly iommu_map/unmap() calls under the dma API? Another thought is something related to the gather outside the actual page table is acting differently. I will attempt to run some benchmarking here specifically with the non-coherent mode enabled to see if I can find a bug. Thanks, Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-01-23 19:07 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-21 13:02 Xe performance regression with recent IOMMU changes Francois Dugast 2026-01-21 13:11 ` Jason Gunthorpe 2026-01-21 18:04 ` Jason Gunthorpe 2026-01-22 6:15 ` Matthew Brost 2026-01-22 7:29 ` Leon Romanovsky 2026-01-22 7:36 ` Matthew Brost 2026-01-22 10:26 ` Leon Romanovsky 2026-01-22 13:31 ` Jason Gunthorpe 2026-01-23 16:27 ` Francois Dugast 2026-01-23 19:07 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox