Xe performance regression with recent IOMMU changes

public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed

* Xe performance regression with recent IOMMU changes
@ 2026-01-21 13:02 Francois Dugast
  2026-01-21 13:11 ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Francois Dugast @ 2026-01-21 13:02 UTC (permalink / raw)
  To: iommu
  Cc: intel-xe, Francois Dugast, Jason Gunthorpe, Joerg Roedel,
	Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy,
	Samiullah Khawaja, Matthew Brost, Thomas Hellström,
	Tina Zhang, Lu Baolu, Kevin Tian

I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It
can be observed during DMA mappings/unmappings required to issue copies
between system memory and the device, when handling GPU faults. Not sure
how other use cases or vendors are affected but below is the impact on
execution times for BMG:

Before changes:
  4KB
    drm_pagemap_migrate_map_pages: 0.4 us
    drm_pagemap_migrate_unmap_pages: 0.4 us
  64KB
    drm_pagemap_migrate_map_pages: 2.5 us
    drm_pagemap_migrate_unmap_pages: 3.5 us
  2MB
    drm_pagemap_migrate_map_pages: 88 us
    drm_pagemap_migrate_unmap_pages: 108 us

After changes:
  4KB
    drm_pagemap_migrate_map_pages: 0.7 us
    drm_pagemap_migrate_unmap_pages: 0.7 us
  64KB
    drm_pagemap_migrate_map_pages: 3.5 us
    drm_pagemap_migrate_unmap_pages: 10.5 us
  2MB
    drm_pagemap_migrate_map_pages: 102 us
    drm_pagemap_migrate_unmap_pages: 330 us

Bisecting points to these commits:
  d373449d8e97 iommu/vt-d: Use the generic iommu page table
  d856f9d27885 iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires

Simply reverting them brings back performance.

Any idea about why and how to fix it?

Thanks,
Francois

Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: Calvin Owens <calvin@wbinvd.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Tina Zhang <tina.zhang@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>

-- 
2.43.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xe performance regression with recent IOMMU changes
  2026-01-21 13:02 Xe performance regression with recent IOMMU changes Francois Dugast
@ 2026-01-21 13:11 ` Jason Gunthorpe
  2026-01-21 18:04   ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2026-01-21 13:11 UTC (permalink / raw)
  To: Francois Dugast
  Cc: iommu, intel-xe, Joerg Roedel, Calvin Owens, David Woodhouse,
	Will Deacon, Robin Murphy, Samiullah Khawaja, Matthew Brost,
	Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian

On Wed, Jan 21, 2026 at 02:02:16PM +0100, Francois Dugast wrote:
> I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It
> can be observed during DMA mappings/unmappings required to issue copies
> between system memory and the device, when handling GPU faults. Not sure
> how other use cases or vendors are affected but below is the impact on
> execution times for BMG:
> 
> Before changes:
>   4KB
>     drm_pagemap_migrate_map_pages: 0.4 us
>     drm_pagemap_migrate_unmap_pages: 0.4 us
>   64KB
>     drm_pagemap_migrate_map_pages: 2.5 us
>     drm_pagemap_migrate_unmap_pages: 3.5 us
>   2MB
>     drm_pagemap_migrate_map_pages: 88 us
>     drm_pagemap_migrate_unmap_pages: 108 us
> 
> After changes:
>   4KB
>     drm_pagemap_migrate_map_pages: 0.7 us
>     drm_pagemap_migrate_unmap_pages: 0.7 us
>   64KB
>     drm_pagemap_migrate_map_pages: 3.5 us
>     drm_pagemap_migrate_unmap_pages: 10.5 us
>   2MB
>     drm_pagemap_migrate_map_pages: 102 us
>     drm_pagemap_migrate_unmap_pages: 330 us

I posted some more optimizations for these cases, it should reduce the
numbers.

This is the opposite of the benchmark numbers I ran which showed
significant gains as the page count and sizes increased.

But something weird is going on to see a 3x increase in unmap, that
shouldn't be just algorithm overhead. That almost seems like
additional IOTLB invalidation overhead or something else going wrong.

Is this from a system with the VT-d cache flushing requirement? That
logic changed around too and could have this kind of big impact.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xe performance regression with recent IOMMU changes
  2026-01-21 13:11 ` Jason Gunthorpe
@ 2026-01-21 18:04   ` Jason Gunthorpe
  2026-01-22  6:15     ` Matthew Brost
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2026-01-21 18:04 UTC (permalink / raw)
  To: Francois Dugast
  Cc: iommu, intel-xe, Joerg Roedel, Calvin Owens, David Woodhouse,
	Will Deacon, Robin Murphy, Samiullah Khawaja, Matthew Brost,
	Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian

On Wed, Jan 21, 2026 at 09:11:35AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 21, 2026 at 02:02:16PM +0100, Francois Dugast wrote:
> > I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It
> > can be observed during DMA mappings/unmappings required to issue copies
> > between system memory and the device, when handling GPU faults. Not sure
> > how other use cases or vendors are affected but below is the impact on
> > execution times for BMG:
> > 
> > Before changes:
> >   4KB
> >     drm_pagemap_migrate_map_pages: 0.4 us
> >     drm_pagemap_migrate_unmap_pages: 0.4 us
> >   64KB
> >     drm_pagemap_migrate_map_pages: 2.5 us
> >     drm_pagemap_migrate_unmap_pages: 3.5 us
> >   2MB
> >     drm_pagemap_migrate_map_pages: 88 us
> >     drm_pagemap_migrate_unmap_pages: 108 us
> > 
> > After changes:
> >   4KB
> >     drm_pagemap_migrate_map_pages: 0.7 us
> >     drm_pagemap_migrate_unmap_pages: 0.7 us
> >   64KB
> >     drm_pagemap_migrate_map_pages: 3.5 us
> >     drm_pagemap_migrate_unmap_pages: 10.5 us
> >   2MB
> >     drm_pagemap_migrate_map_pages: 102 us
> >     drm_pagemap_migrate_unmap_pages: 330 us
> 
> I posted some more optimizations for these cases, it should reduce the
> numbers.
> 
> This is the opposite of the benchmark numbers I ran which showed
> significant gains as the page count and sizes increased.
> 
> But something weird is going on to see a 3x increase in unmap, that
> shouldn't be just algorithm overhead. That almost seems like
> additional IOTLB invalidation overhead or something else going wrong.
> 
> Is this from a system with the VT-d cache flushing requirement? That
> logic changed around too and could have this kind of big impact.

Oh looking at the code a bit you've got pretty much the slowest
possible thing you can do here:

	for (i = 0; i < npages;) {
		if (!pagemap_addr[i].addr || dma_mapping_error(dev, pagemap_addr[i].addr))
			goto next;

		dma_unmap_page(dev, pagemap_addr[i].addr, PAGE_SIZE << pagemap_addr[i].order, dir);

It is weird though:

0.7 us * 512 = 358us so it is about the reported speed.

But the old one is 0.4 us * 512 = 204 us which is twice as
slow as reported?? It got 2x faster the more times you loop it? Huh?

The real way to fix this up is to use the new DMA API so this can be
collapsed into a single unmap. Then it will take < 1us for all those cases.

Look at the patches Leon made for the RDMA ODP stuff, it has a similar
looking workflow.

The optimizations I posted will help this noticably.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xe performance regression with recent IOMMU changes
  2026-01-21 18:04   ` Jason Gunthorpe
@ 2026-01-22  6:15     ` Matthew Brost
  2026-01-22  7:29       ` Leon Romanovsky
  2026-01-22 13:31       ` Jason Gunthorpe
  0 siblings, 2 replies; 10+ messages in thread
From: Matthew Brost @ 2026-01-22  6:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Francois Dugast, iommu, intel-xe, Joerg Roedel, Calvin Owens,
	David Woodhouse, Will Deacon, Robin Murphy, Samiullah Khawaja,
	Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian

On Wed, Jan 21, 2026 at 02:04:49PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 21, 2026 at 09:11:35AM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 21, 2026 at 02:02:16PM +0100, Francois Dugast wrote:
> > > I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It
> > > can be observed during DMA mappings/unmappings required to issue copies
> > > between system memory and the device, when handling GPU faults. Not sure
> > > how other use cases or vendors are affected but below is the impact on
> > > execution times for BMG:
> > > 
> > > Before changes:
> > >   4KB
> > >     drm_pagemap_migrate_map_pages: 0.4 us
> > >     drm_pagemap_migrate_unmap_pages: 0.4 us
> > >   64KB
> > >     drm_pagemap_migrate_map_pages: 2.5 us
> > >     drm_pagemap_migrate_unmap_pages: 3.5 us
> > >   2MB
> > >     drm_pagemap_migrate_map_pages: 88 us
> > >     drm_pagemap_migrate_unmap_pages: 108 us
> > > 
> > > After changes:
> > >   4KB
> > >     drm_pagemap_migrate_map_pages: 0.7 us
> > >     drm_pagemap_migrate_unmap_pages: 0.7 us
> > >   64KB
> > >     drm_pagemap_migrate_map_pages: 3.5 us
> > >     drm_pagemap_migrate_unmap_pages: 10.5 us
> > >   2MB
> > >     drm_pagemap_migrate_map_pages: 102 us
> > >     drm_pagemap_migrate_unmap_pages: 330 us
> > 
> > I posted some more optimizations for these cases, it should reduce the
> > numbers.
> > 

We can try those — link? I believe I know the series, but just to make
sure we’re on the same page.

> > This is the opposite of the benchmark numbers I ran which showed
> > significant gains as the page count and sizes increased.
> > 
> > But something weird is going on to see a 3x increase in unmap, that
> > shouldn't be just algorithm overhead. That almost seems like
> > additional IOTLB invalidation overhead or something else going wrong.
> > 
> > Is this from a system with the VT-d cache flushing requirement? That
> > logic changed around too and could have this kind of big impact.
> 
> Oh looking at the code a bit you've got pretty much the slowest
> possible thing you can do here:

This was a fairly common pattern prior to Leon’s series, I believe. The
cross-references show this pattern appearing frequently in the kernel
[1]. I do agree with the point below that, with Leon’s changes applied,
this could be refactored into an IOVA alloc/link/unlink/free flow, which
would work better (also 2M device pages reduces the common 2M case to a
mute point).

But that’s not what we’re discussing here. We’re talking about a
regression introduced in the dma-mapping API for x86, which in my view
is unacceptable for a kernel release. So IMO we should revert those
changes [2].

[1] https://elixir.bootlin.com/linux/v6.18.6/A/ident/dma_unmap_page
[2]
e6fbd544619c50b4a4d96ccb4676cac03cb iommupt/vtd: Support mgaw's less than a 4 level walk for first stage
d856f9d27885c499d96ab7fe506083346ccf145d iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires
6cbc09b7719ec7fd9f650f18b3828b7f60c17881 iommu/vt-d: Restore previous domain::aperture_end calculation
a97fbc3ee3e2a536fafaff04f21f45472db71769 syscore: Pass context data to callbacks
101a2854110fa8787226dae1202892071ff2c369 iommu/vt-d: Follow PT_FEAT_DMA_INCOHERENT into the PASID entry
d373449d8e97891434db0c64afca79d903c1194e iommu/vt-d: Use the generic iommu page table

> 
> 	for (i = 0; i < npages;) {
> 		if (!pagemap_addr[i].addr || dma_mapping_error(dev, pagemap_addr[i].addr))
> 			goto next;
> 
> 		dma_unmap_page(dev, pagemap_addr[i].addr, PAGE_SIZE << pagemap_addr[i].order, dir);
> 
> It is weird though:
> 
> 0.7 us * 512 = 358us so it is about the reported speed.
> 
> But the old one is 0.4 us * 512 = 204 us which is twice as
> slow as reported?? It got 2x faster the more times you loop it? Huh?
> 
> The real way to fix this up is to use the new DMA API so this can be
> collapsed into a single unmap. Then it will take < 1us for all those cases.
> 
> Look at the patches Leon made for the RDMA ODP stuff, it has a similar
> looking workflow.
> 

See above. I agree this is the right direction, but we can’t simply
regress kernels from existing performance.

> The optimizations I posted will help this noticably.
> 

I think we need to start with a revert and then discuss whether your
subsequent changes actually fix the problem.

Matt

> Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xe performance regression with recent IOMMU changes
  2026-01-22  6:15     ` Matthew Brost
@ 2026-01-22  7:29       ` Leon Romanovsky
  2026-01-22  7:36         ` Matthew Brost
  2026-01-22 13:31       ` Jason Gunthorpe
  1 sibling, 1 reply; 10+ messages in thread
From: Leon Romanovsky @ 2026-01-22  7:29 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Jason Gunthorpe, Francois Dugast, iommu, intel-xe, Joerg Roedel,
	Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy,
	Samiullah Khawaja, Thomas Hellström, Tina Zhang, Lu Baolu,
	Kevin Tian

On Wed, Jan 21, 2026 at 10:15:14PM -0800, Matthew Brost wrote:
> On Wed, Jan 21, 2026 at 02:04:49PM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 21, 2026 at 09:11:35AM -0400, Jason Gunthorpe wrote:
> > > On Wed, Jan 21, 2026 at 02:02:16PM +0100, Francois Dugast wrote:
> > > > I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It
> > > > can be observed during DMA mappings/unmappings required to issue copies
> > > > between system memory and the device, when handling GPU faults. Not sure
> > > > how other use cases or vendors are affected but below is the impact on
> > > > execution times for BMG:
> > > > 
> > > > Before changes:
> > > >   4KB
> > > >     drm_pagemap_migrate_map_pages: 0.4 us
> > > >     drm_pagemap_migrate_unmap_pages: 0.4 us
> > > >   64KB
> > > >     drm_pagemap_migrate_map_pages: 2.5 us
> > > >     drm_pagemap_migrate_unmap_pages: 3.5 us
> > > >   2MB
> > > >     drm_pagemap_migrate_map_pages: 88 us
> > > >     drm_pagemap_migrate_unmap_pages: 108 us
> > > > 
> > > > After changes:
> > > >   4KB
> > > >     drm_pagemap_migrate_map_pages: 0.7 us
> > > >     drm_pagemap_migrate_unmap_pages: 0.7 us
> > > >   64KB
> > > >     drm_pagemap_migrate_map_pages: 3.5 us
> > > >     drm_pagemap_migrate_unmap_pages: 10.5 us
> > > >   2MB
> > > >     drm_pagemap_migrate_map_pages: 102 us
> > > >     drm_pagemap_migrate_unmap_pages: 330 us
> > > 
> > > I posted some more optimizations for these cases, it should reduce the
> > > numbers.
> > > 
> 
> We can try those — link? I believe I know the series, but just to make
> sure we’re on the same page.
> 
> > > This is the opposite of the benchmark numbers I ran which showed
> > > significant gains as the page count and sizes increased.
> > > 
> > > But something weird is going on to see a 3x increase in unmap, that
> > > shouldn't be just algorithm overhead. That almost seems like
> > > additional IOTLB invalidation overhead or something else going wrong.
> > > 
> > > Is this from a system with the VT-d cache flushing requirement? That
> > > logic changed around too and could have this kind of big impact.
> > 
> > Oh looking at the code a bit you've got pretty much the slowest
> > possible thing you can do here:
> 
> This was a fairly common pattern prior to Leon’s series, I believe. The
> cross-references show this pattern appearing frequently in the kernel
> [1]. I do agree with the point below that, with Leon’s changes applied,
> this could be refactored into an IOVA alloc/link/unlink/free flow, which
> would work better (also 2M device pages reduces the common 2M case to a
> mute point).
> 
> But that’s not what we’re discussing here. We’re talking about a
> regression introduced in the dma-mapping API for x86, which in my view
> is unacceptable for a kernel release. So IMO we should revert those
> changes [2].
> 
> [1] https://elixir.bootlin.com/linux/v6.18.6/A/ident/dma_unmap_page

I think this comparison is unfair. The previous behavior was bad for
everyone, while the current issue affects only the specific
drm_pagemap_migrate_unmap_pages() flow. Cases where the performance of
dma_unmap_page() in non-direct mode matters are extremely rare.

It should be relatively straightforward to add a link/unlink path to the
drm_pagemap_*() helpers and achieve decent performance.

Thanks

> [2]
> e6fbd544619c50b4a4d96ccb4676cac03cb iommupt/vtd: Support mgaw's less than a 4 level walk for first stage
> d856f9d27885c499d96ab7fe506083346ccf145d iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires
> 6cbc09b7719ec7fd9f650f18b3828b7f60c17881 iommu/vt-d: Restore previous domain::aperture_end calculation
> a97fbc3ee3e2a536fafaff04f21f45472db71769 syscore: Pass context data to callbacks
> 101a2854110fa8787226dae1202892071ff2c369 iommu/vt-d: Follow PT_FEAT_DMA_INCOHERENT into the PASID entry
> d373449d8e97891434db0c64afca79d903c1194e iommu/vt-d: Use the generic iommu page table
> 
> > 
> > 	for (i = 0; i < npages;) {
> > 		if (!pagemap_addr[i].addr || dma_mapping_error(dev, pagemap_addr[i].addr))
> > 			goto next;
> > 
> > 		dma_unmap_page(dev, pagemap_addr[i].addr, PAGE_SIZE << pagemap_addr[i].order, dir);
> > 
> > It is weird though:
> > 
> > 0.7 us * 512 = 358us so it is about the reported speed.
> > 
> > But the old one is 0.4 us * 512 = 204 us which is twice as
> > slow as reported?? It got 2x faster the more times you loop it? Huh?
> > 
> > The real way to fix this up is to use the new DMA API so this can be
> > collapsed into a single unmap. Then it will take < 1us for all those cases.
> > 
> > Look at the patches Leon made for the RDMA ODP stuff, it has a similar
> > looking workflow.
> > 
> 
> See above. I agree this is the right direction, but we can’t simply
> regress kernels from existing performance.
> 
> > The optimizations I posted will help this noticably.
> > 
> 
> I think we need to start with a revert and then discuss whether your
> subsequent changes actually fix the problem.
> 
> Matt
> 
> > Jason
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xe performance regression with recent IOMMU changes
  2026-01-22  7:29       ` Leon Romanovsky
@ 2026-01-22  7:36         ` Matthew Brost
  2026-01-22 10:26           ` Leon Romanovsky
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Brost @ 2026-01-22  7:36 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Francois Dugast, iommu, intel-xe, Joerg Roedel,
	Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy,
	Samiullah Khawaja, Thomas Hellström, Tina Zhang, Lu Baolu,
	Kevin Tian

On Thu, Jan 22, 2026 at 09:29:13AM +0200, Leon Romanovsky wrote:
> On Wed, Jan 21, 2026 at 10:15:14PM -0800, Matthew Brost wrote:
> > On Wed, Jan 21, 2026 at 02:04:49PM -0400, Jason Gunthorpe wrote:
> > > On Wed, Jan 21, 2026 at 09:11:35AM -0400, Jason Gunthorpe wrote:
> > > > On Wed, Jan 21, 2026 at 02:02:16PM +0100, Francois Dugast wrote:
> > > > > I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It
> > > > > can be observed during DMA mappings/unmappings required to issue copies
> > > > > between system memory and the device, when handling GPU faults. Not sure
> > > > > how other use cases or vendors are affected but below is the impact on
> > > > > execution times for BMG:
> > > > > 
> > > > > Before changes:
> > > > >   4KB
> > > > >     drm_pagemap_migrate_map_pages: 0.4 us
> > > > >     drm_pagemap_migrate_unmap_pages: 0.4 us
> > > > >   64KB
> > > > >     drm_pagemap_migrate_map_pages: 2.5 us
> > > > >     drm_pagemap_migrate_unmap_pages: 3.5 us
> > > > >   2MB
> > > > >     drm_pagemap_migrate_map_pages: 88 us
> > > > >     drm_pagemap_migrate_unmap_pages: 108 us
> > > > > 
> > > > > After changes:
> > > > >   4KB
> > > > >     drm_pagemap_migrate_map_pages: 0.7 us
> > > > >     drm_pagemap_migrate_unmap_pages: 0.7 us
> > > > >   64KB
> > > > >     drm_pagemap_migrate_map_pages: 3.5 us
> > > > >     drm_pagemap_migrate_unmap_pages: 10.5 us
> > > > >   2MB
> > > > >     drm_pagemap_migrate_map_pages: 102 us
> > > > >     drm_pagemap_migrate_unmap_pages: 330 us
> > > > 
> > > > I posted some more optimizations for these cases, it should reduce the
> > > > numbers.
> > > > 
> > 
> > We can try those — link? I believe I know the series, but just to make
> > sure we’re on the same page.
> > 
> > > > This is the opposite of the benchmark numbers I ran which showed
> > > > significant gains as the page count and sizes increased.
> > > > 
> > > > But something weird is going on to see a 3x increase in unmap, that
> > > > shouldn't be just algorithm overhead. That almost seems like
> > > > additional IOTLB invalidation overhead or something else going wrong.
> > > > 
> > > > Is this from a system with the VT-d cache flushing requirement? That
> > > > logic changed around too and could have this kind of big impact.
> > > 
> > > Oh looking at the code a bit you've got pretty much the slowest
> > > possible thing you can do here:
> > 
> > This was a fairly common pattern prior to Leon’s series, I believe. The
> > cross-references show this pattern appearing frequently in the kernel
> > [1]. I do agree with the point below that, with Leon’s changes applied,
> > this could be refactored into an IOVA alloc/link/unlink/free flow, which
> > would work better (also 2M device pages reduces the common 2M case to a
> > mute point).
> > 
> > But that’s not what we’re discussing here. We’re talking about a
> > regression introduced in the dma-mapping API for x86, which in my view
> > is unacceptable for a kernel release. So IMO we should revert those
> > changes [2].
> > 
> > [1] https://elixir.bootlin.com/linux/v6.18.6/A/ident/dma_unmap_page
> 
> I think this comparison is unfair. The previous behavior was bad for
> everyone, while the current issue affects only the specific
> drm_pagemap_migrate_unmap_pages() flow. Cases where the performance of
> dma_unmap_page() in non-direct mode matters are extremely rare.
> 

I don’t think you can reason about this without extensive testing across
multiple platforms. Nor is it fair to say - sorry we slowed down your
existing code, good luck.

> It should be relatively straightforward to add a link/unlink path to the
> drm_pagemap_*() helpers and achieve decent performance.
> 

I agree. Happy to work with you on this going *forward*.

Matt

> Thanks
> 
> > [2]
> > e6fbd544619c50b4a4d96ccb4676cac03cb iommupt/vtd: Support mgaw's less than a 4 level walk for first stage
> > d856f9d27885c499d96ab7fe506083346ccf145d iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires
> > 6cbc09b7719ec7fd9f650f18b3828b7f60c17881 iommu/vt-d: Restore previous domain::aperture_end calculation
> > a97fbc3ee3e2a536fafaff04f21f45472db71769 syscore: Pass context data to callbacks
> > 101a2854110fa8787226dae1202892071ff2c369 iommu/vt-d: Follow PT_FEAT_DMA_INCOHERENT into the PASID entry
> > d373449d8e97891434db0c64afca79d903c1194e iommu/vt-d: Use the generic iommu page table
> > 
> > > 
> > > 	for (i = 0; i < npages;) {
> > > 		if (!pagemap_addr[i].addr || dma_mapping_error(dev, pagemap_addr[i].addr))
> > > 			goto next;
> > > 
> > > 		dma_unmap_page(dev, pagemap_addr[i].addr, PAGE_SIZE << pagemap_addr[i].order, dir);
> > > 
> > > It is weird though:
> > > 
> > > 0.7 us * 512 = 358us so it is about the reported speed.
> > > 
> > > But the old one is 0.4 us * 512 = 204 us which is twice as
> > > slow as reported?? It got 2x faster the more times you loop it? Huh?
> > > 
> > > The real way to fix this up is to use the new DMA API so this can be
> > > collapsed into a single unmap. Then it will take < 1us for all those cases.
> > > 
> > > Look at the patches Leon made for the RDMA ODP stuff, it has a similar
> > > looking workflow.
> > > 
> > 
> > See above. I agree this is the right direction, but we can’t simply
> > regress kernels from existing performance.
> > 
> > > The optimizations I posted will help this noticably.
> > > 
> > 
> > I think we need to start with a revert and then discuss whether your
> > subsequent changes actually fix the problem.
> > 
> > Matt
> > 
> > > Jason
> > 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xe performance regression with recent IOMMU changes
  2026-01-22  7:36         ` Matthew Brost
@ 2026-01-22 10:26           ` Leon Romanovsky
  0 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2026-01-22 10:26 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Jason Gunthorpe, Francois Dugast, iommu, intel-xe, Joerg Roedel,
	Calvin Owens, David Woodhouse, Will Deacon, Robin Murphy,
	Samiullah Khawaja, Thomas Hellström, Tina Zhang, Lu Baolu,
	Kevin Tian

On Wed, Jan 21, 2026 at 11:36:47PM -0800, Matthew Brost wrote:
> On Thu, Jan 22, 2026 at 09:29:13AM +0200, Leon Romanovsky wrote:
> > On Wed, Jan 21, 2026 at 10:15:14PM -0800, Matthew Brost wrote:
> > > On Wed, Jan 21, 2026 at 02:04:49PM -0400, Jason Gunthorpe wrote:
> > > > On Wed, Jan 21, 2026 at 09:11:35AM -0400, Jason Gunthorpe wrote:
> > > > > On Wed, Jan 21, 2026 at 02:02:16PM +0100, Francois Dugast wrote:
> > > > > > I am reporting a slowdown in Xe caused by a couple of IOMMU changes. It
> > > > > > can be observed during DMA mappings/unmappings required to issue copies
> > > > > > between system memory and the device, when handling GPU faults. Not sure
> > > > > > how other use cases or vendors are affected but below is the impact on
> > > > > > execution times for BMG:
> > > > > > 
> > > > > > Before changes:
> > > > > >   4KB
> > > > > >     drm_pagemap_migrate_map_pages: 0.4 us
> > > > > >     drm_pagemap_migrate_unmap_pages: 0.4 us
> > > > > >   64KB
> > > > > >     drm_pagemap_migrate_map_pages: 2.5 us
> > > > > >     drm_pagemap_migrate_unmap_pages: 3.5 us
> > > > > >   2MB
> > > > > >     drm_pagemap_migrate_map_pages: 88 us
> > > > > >     drm_pagemap_migrate_unmap_pages: 108 us
> > > > > > 
> > > > > > After changes:
> > > > > >   4KB
> > > > > >     drm_pagemap_migrate_map_pages: 0.7 us
> > > > > >     drm_pagemap_migrate_unmap_pages: 0.7 us
> > > > > >   64KB
> > > > > >     drm_pagemap_migrate_map_pages: 3.5 us
> > > > > >     drm_pagemap_migrate_unmap_pages: 10.5 us
> > > > > >   2MB
> > > > > >     drm_pagemap_migrate_map_pages: 102 us
> > > > > >     drm_pagemap_migrate_unmap_pages: 330 us
> > > > > 
> > > > > I posted some more optimizations for these cases, it should reduce the
> > > > > numbers.
> > > > > 
> > > 
> > > We can try those — link? I believe I know the series, but just to make
> > > sure we’re on the same page.
> > > 
> > > > > This is the opposite of the benchmark numbers I ran which showed
> > > > > significant gains as the page count and sizes increased.
> > > > > 
> > > > > But something weird is going on to see a 3x increase in unmap, that
> > > > > shouldn't be just algorithm overhead. That almost seems like
> > > > > additional IOTLB invalidation overhead or something else going wrong.
> > > > > 
> > > > > Is this from a system with the VT-d cache flushing requirement? That
> > > > > logic changed around too and could have this kind of big impact.
> > > > 
> > > > Oh looking at the code a bit you've got pretty much the slowest
> > > > possible thing you can do here:
> > > 
> > > This was a fairly common pattern prior to Leon’s series, I believe. The
> > > cross-references show this pattern appearing frequently in the kernel
> > > [1]. I do agree with the point below that, with Leon’s changes applied,
> > > this could be refactored into an IOVA alloc/link/unlink/free flow, which
> > > would work better (also 2M device pages reduces the common 2M case to a
> > > mute point).
> > > 
> > > But that’s not what we’re discussing here. We’re talking about a
> > > regression introduced in the dma-mapping API for x86, which in my view
> > > is unacceptable for a kernel release. So IMO we should revert those
> > > changes [2].
> > > 
> > > [1] https://elixir.bootlin.com/linux/v6.18.6/A/ident/dma_unmap_page
> > 
> > I think this comparison is unfair. The previous behavior was bad for
> > everyone, while the current issue affects only the specific
> > drm_pagemap_migrate_unmap_pages() flow. Cases where the performance of
> > dma_unmap_page() in non-direct mode matters are extremely rare.
> > 
> 
> I don’t think you can reason about this without extensive testing across
> multiple platforms. Nor is it fair to say - sorry we slowed down your
> existing code, good luck.

It is not what I said. I only pointed to the specific point that loop
over dma_unmap_page() is universally performance critical.

Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xe performance regression with recent IOMMU changes
  2026-01-22  6:15     ` Matthew Brost
  2026-01-22  7:29       ` Leon Romanovsky
@ 2026-01-22 13:31       ` Jason Gunthorpe
  2026-01-23 16:27         ` Francois Dugast
  1 sibling, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2026-01-22 13:31 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Francois Dugast, iommu, intel-xe, Joerg Roedel, Calvin Owens,
	David Woodhouse, Will Deacon, Robin Murphy, Samiullah Khawaja,
	Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian

On Wed, Jan 21, 2026 at 10:15:14PM -0800, Matthew Brost wrote:
> > > I posted some more optimizations for these cases, it should reduce the
> > > numbers.
> 
> We can try those — link? I believe I know the series, but just to make
> sure we’re on the same page.

https://patch.msgid.link/r/0-v2-973a6bdc820f+693-iommpt_map_direct_jgg@nvidia.com

I also need the answer if this testing is running on the non-cache
coherent iommu HW Intel sometimes has, it makes a difference.

I also have in mind a fairly small change to make this special unmap
case speed up.

IMHO those two together will likely get you back to close enough.

And then do link if you actually care about this scenario.

> This was a fairly common pattern prior to Leon’s series, I believe. The
> cross-references show this pattern appearing frequently in the kernel
> [1].

Yes the pattern is common but virtually nobody actually uses it with
the iommu turned on because it is something like 10x slower than using
just identity mode.

I understand this is a test suite and it should test with iommu
enabled, but I'm deeply skeptical this represents actual users who
also care about performace. If they did they'd already have set the
iommu to identity.

> > The optimizations I posted will help this noticably.
> 
> I think we need to start with a revert and then discuss whether your
> subsequent changes actually fix the problem.

We haven't even done some basic investigation, immediately demanding a
revert of such a large amount of work for a use case I suspect doesn't
have users it not reasonable.

This work was not done for no reason and is bringing performance wins
for other use cases that do actually have real users. If we eventually
really can't fix it then you can talk about reverts, but given link
will absolutely fix xe, I don't see that happening.

Try the patches, give me the new numbers, tell me if you have the
non-cache iommu and I will give you another one to try.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xe performance regression with recent IOMMU changes
  2026-01-22 13:31       ` Jason Gunthorpe
@ 2026-01-23 16:27         ` Francois Dugast
  2026-01-23 19:07           ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Francois Dugast @ 2026-01-23 16:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Brost, iommu, intel-xe, Joerg Roedel, Calvin Owens,
	David Woodhouse, Will Deacon, Robin Murphy, Samiullah Khawaja,
	Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian

On Thu, Jan 22, 2026 at 09:31:31AM -0400, Jason Gunthorpe wrote:
> Try the patches, give me the new numbers,

Thanks for the suggestion but they do not seem to help, see new
execution times below in ns, collected this time without kprobe
to reduce variation:

# iommu-tip + https://patch.msgid.link/r/0-v2-973a6bdc820f+693-iommpt_map_direct_jgg@nvidia.com
+-----------------------------------+--------+--------+--------+
|                                   | 4KB    | 64KB   | 2MB    |
+-----------------------------------+--------+--------+--------+
| drm_pagemap_migrate_map_pages()   | 660    | 3951   | 113813 |
+-----------------------------------+--------+--------+--------+
| drm_pagemap_migrate_unmap_pages() | 610    | 11136  | 322802 |
+-----------------------------------+--------+--------+--------+

# drm-tip
+-----------------------------------+--------+--------+--------+
|                                   | 4KB    | 64KB   | 2MB    |
+-----------------------------------+--------+--------+--------+
| drm_pagemap_migrate_map_pages()   | 687    | 3890   | 114749 |
+-----------------------------------+--------+--------+--------+
| drm_pagemap_migrate_unmap_pages() | 621    | 11180  | 334472 |
+-----------------------------------+--------+--------+--------+

# drm-tip + revert of IOMMU changes
+-----------------------------------+--------+--------+--------+
|                                   | 4KB    | 64KB   | 2MB    |
+-----------------------------------+--------+--------+--------+
| drm_pagemap_migrate_map_pages()   | 355    | 3545   | 102706 |
+-----------------------------------+--------+--------+--------+
| drm_pagemap_migrate_unmap_pages() | 305    | 4341   | 125919 |
+-----------------------------------+--------+--------+--------+

> tell me if you have the non-cache iommu

The setup used in this test has non-cache coherent IOMMU.

> and I will give you another one to try.

Sure, please do.

Francois

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xe performance regression with recent IOMMU changes
  2026-01-23 16:27         ` Francois Dugast
@ 2026-01-23 19:07           ` Jason Gunthorpe
  0 siblings, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2026-01-23 19:07 UTC (permalink / raw)
  To: Francois Dugast
  Cc: Matthew Brost, iommu, intel-xe, Joerg Roedel, Calvin Owens,
	David Woodhouse, Will Deacon, Robin Murphy, Samiullah Khawaja,
	Thomas Hellström, Tina Zhang, Lu Baolu, Kevin Tian

On Fri, Jan 23, 2026 at 05:27:24PM +0100, Francois Dugast wrote:
> On Thu, Jan 22, 2026 at 09:31:31AM -0400, Jason Gunthorpe wrote:
> > Try the patches, give me the new numbers,
> 
> Thanks for the suggestion but they do not seem to help, see new
> execution times below in ns, collected this time without kprobe
> to reduce variation:
> 
> # iommu-tip + https://patch.msgid.link/r/0-v2-973a6bdc820f+693-iommpt_map_direct_jgg@nvidia.com
> +-----------------------------------+--------+--------+--------+
> |                                   | 4KB    | 64KB   | 2MB    |
> +-----------------------------------+--------+--------+--------+
> | drm_pagemap_migrate_map_pages()   | 660    | 3951   | 113813 |
> +-----------------------------------+--------+--------+--------+
> | drm_pagemap_migrate_unmap_pages() | 610    | 11136  | 322802 |
> +-----------------------------------+--------+--------+--------+
> 
> # drm-tip
> +-----------------------------------+--------+--------+--------+
> |                                   | 4KB    | 64KB   | 2MB    |
> +-----------------------------------+--------+--------+--------+
> | drm_pagemap_migrate_map_pages()   | 687    | 3890   | 114749 |
> +-----------------------------------+--------+--------+--------+
> | drm_pagemap_migrate_unmap_pages() | 621    | 11180  | 334472 |
> +-----------------------------------+--------+--------+--------+

It is not nothing, that looks like about a 4% gain, that matches the
lower bound of what I was measuring for those patches as well.

There are two mysteries in your report.

First, compared to my measurements:

https://lore.kernel.org/linux-iommu/5-v3-634ccd3efce0+16d38-iommu_pt_vtd_jgg@nvidia.com/

iommu_map()
   pgsz  ,avg new,old ns, min new,old ns  , min % (+ve is better)
     2^12,     53,66    ,      50,64      ,  21.21
 256*2^12,    384,524   ,     337,516     ,  34.34

iommu_unmap()
   pgsz  ,avg new,old ns, min new,old ns  , min % (+ve is better)
     2^12,     67,86    ,      63,84      ,  25.25
 256*2^12,    216,335   ,     198,317     ,  37.37

Yours are about 10x higher. Granted they are not exactly the same
thing, but I'm measuring the actual page table code as 20% faster, not
slower. So I'm really wondering what is so different on your
situation. Is the cache flushing causing the 10x delta?

Second, it is normal for the map/unmap to be approximately the same,
your results have map being 165% slower. This surely must be a bug, I
have a guess that some cache flush is the incorrect length..

Still, that 10x difference is confusing, are you running with debug
options in your .kconfig? I wouldn't be surprised at all to be told
kasn/gcov/etc reacts much differently.

> > tell me if you have the non-cache iommu
> 
> The setup used in this test has non-cache coherent IOMMU.

That helps a lot. The non-coherent case disables a meaningful
optimization for the 4k page case map case, and triggers a bunch of
hard to test cache flushing code that we can look at.

Any chance you can run this on a system that has a coherent IOMMU?
That would really help narrow things down.

Can you measure directly iommu_map/unmap() calls under the dma API?
Another thought is something related to the gather outside the actual
page table is acting differently.

I will attempt to run some benchmarking here specifically with the
non-coherent mode enabled to see if I can find a bug.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-01-23 19:07 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-21 13:02 Xe performance regression with recent IOMMU changes Francois Dugast
2026-01-21 13:11 ` Jason Gunthorpe
2026-01-21 18:04   ` Jason Gunthorpe
2026-01-22  6:15     ` Matthew Brost
2026-01-22  7:29       ` Leon Romanovsky
2026-01-22  7:36         ` Matthew Brost
2026-01-22 10:26           ` Leon Romanovsky
2026-01-22 13:31       ` Jason Gunthorpe
2026-01-23 16:27         ` Francois Dugast
2026-01-23 19:07           ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox