[BUG?] vfio/pci: VA alignment sensitivity of VFIO_IOMMU_MAP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [BUG?] vfio/pci: VA alignment sensitivity of VFIO_IOMMU_MAP_DMA which target MMIO
@ 2025-05-29 21:44 Alex Mastro
  2025-05-30 13:10 ` Jason Gunthorpe
  0 siblings, 1 reply; 6+ messages in thread
From: Alex Mastro @ 2025-05-29 21:44 UTC (permalink / raw)
  To: linux-pci; +Cc: alex.williamson, jgg, peterx, kbusch, linux-mm

Hello,

We are running user space drivers in production on top of VFIO, and after
upgrading from v6.9.0 to v6.13.2 noticed intermittent, slow performance leading
to "rcu_sched self-detected stall" when issuing VFIO_IOMMU_MAP_DMA on ~64 GiB
mmap-ed BAR regions. When doing this on enough devices concurrently, we
triggered softlockup_panic. The mmap-ed BAR regions were obtained from mmap on
a VFIO device fd.

We map regions > 1G, which sometimes do not start at 1G-aligned BAR offsets,
but they are always aligned by at least 2 MiB.

We determined that slow, stalling runs were correlated with 4 KiB-aligned
addresses returned by mmap, and normal runs with >= 2 MiB alignment.

Inspired by QEMU's mmap-alloc.c, we are handling this by reserving VA with an
oversized mmap, and then clobbering with MAP_FIXED at a good address inside the
reservation with the mmap on the VFIO device fd.

At first we settled for aligning the mmap address to {1 GiB, 2 MiB} exactly,
and the stalls disappeared, but then improved performance with the following:

We found that the best addresses to pass to VFIO_IOMMU_MAP_DMA have the
following properties, where va_align and va_offset are chosen based on the size
and BAR offsets of the desired mapping.

va_align = {1 GiB, 2 MiB, 4 KiB}
va_offset = mmap_offset % va_align
(addr_to_mmap % va_align) == va_offset

Using addresses with the above properties seems to optimize the count and
granularity of faults as confirmed by bpftrace-ing vfio_pci_mmap_huge_fault.

We then backported "Improve DMA mapping performance for huge pfnmaps" [1] to
our 6.13 tree, and saw further performance improvements consistent with those
described in the patch (thank you!). However, with the backport, we still need
to align mmap addresses manually, otherwise we see stalls.

We are wondering the following:
- Is all of the above expected behavior, and usage of VFIO?
- Is there an expected minimum alignment greater than 4K (our system page size)
  for non-MAP_FIXED mmap on a VFIO device fd?
- Was there an unintended regression to our use-case in between 6.9 and 6.13?

Thanks,
Alex Mastro

[1] https://lore.kernel.org/all/20250205231728.2527186-1-alex.williamson@redhat.com/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG?] vfio/pci: VA alignment sensitivity of VFIO_IOMMU_MAP_DMA which target MMIO
  2025-05-29 21:44 [BUG?] vfio/pci: VA alignment sensitivity of VFIO_IOMMU_MAP_DMA which target MMIO Alex Mastro
@ 2025-05-30 13:10 ` Jason Gunthorpe
  2025-05-30 14:25   ` Peter Xu
  2025-06-06 18:49   ` Alex Mastro
  0 siblings, 2 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2025-05-30 13:10 UTC (permalink / raw)
  To: Alex Mastro; +Cc: linux-pci, alex.williamson, peterx, kbusch, linux-mm

On Thu, May 29, 2025 at 02:44:14PM -0700, Alex Mastro wrote:

> We are wondering the following:
> - Is all of the above expected behavior, and usage of VFIO?
> - Is there an expected minimum alignment greater than 4K (our system page size)
>   for non-MAP_FIXED mmap on a VFIO device fd?
> - Was there an unintended regression to our use-case in between 6.9 and 6.13?

I think this is something we have missed. VFIO should automatically
align the VMA's address if not MAP_FIXED, otherwise it can't use the
efficient huge page sizes anymore. qemu uses MAP_FIXED so we've left
out the non-qemu users from this performance optimization.

To fix it, the flow from the mm side is something like what
shmem_get_unmapped_area() does. VFIO would probably want to align all
BAR's to their size.

Which seems to me probably wants some refactoring and a core helper
'mm_get_aligned_unmapped_area()'..

I think if you are mmaping a huge huge BAR it is not surprising that
it will take a huge amount of time to write out all of the 4K
PTEs. The stalls on old kernels should probably be addressed by having
cond_resched() inside the remap_pfnmap().

Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG?] vfio/pci: VA alignment sensitivity of VFIO_IOMMU_MAP_DMA which target MMIO
  2025-05-30 13:10 ` Jason Gunthorpe
@ 2025-05-30 14:25   ` Peter Xu
  2025-05-30 23:05     ` Alex Mastro
  2025-06-06 18:49   ` Alex Mastro
  1 sibling, 1 reply; 6+ messages in thread
From: Peter Xu @ 2025-05-30 14:25 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Alex Mastro, linux-pci, alex.williamson, kbusch, linux-mm

On Fri, May 30, 2025 at 10:10:50AM -0300, Jason Gunthorpe wrote:
> On Thu, May 29, 2025 at 02:44:14PM -0700, Alex Mastro wrote:
> 
> > We are wondering the following:
> > - Is all of the above expected behavior, and usage of VFIO?
> > - Is there an expected minimum alignment greater than 4K (our system page size)
> >   for non-MAP_FIXED mmap on a VFIO device fd?
> > - Was there an unintended regression to our use-case in between 6.9 and 6.13?

Probably due to aac6db75a9fc vfio/pci: Use unmap_mapping_range().  IIUC the
plan was huge fault could bring back the lost perf, but indeed the
alignment is still a challenge to at least always make right.

> 
> I think this is something we have missed. VFIO should automatically
> align the VMA's address if not MAP_FIXED, otherwise it can't use the
> efficient huge page sizes anymore. qemu uses MAP_FIXED so we've left
> out the non-qemu users from this performance optimization.
> 
> To fix it, the flow from the mm side is something like what
> shmem_get_unmapped_area() does. VFIO would probably want to align all
> BAR's to their size.

Good point!  I overlooked the VA hints when QEMU doesn't need it.  I can
have a closer look if nobody else will.

> 
> Which seems to me probably wants some refactoring and a core helper
> 'mm_get_aligned_unmapped_area()'..
> 
> I think if you are mmaping a huge huge BAR it is not surprising that
> it will take a huge amount of time to write out all of the 4K
> PTEs. The stalls on old kernels should probably be addressed by having
> cond_resched() inside the remap_pfnmap().

Right, but then that'll be a stable-only fix.

If VFIO can provide a valid get_unmapped_area(), then with huge faults
maybe we don't even need it, and such change can copy stable too.

Meanwhile, just to mention there's one more commit that vfio huge_fault
stable branches would like to have soon, that Alex fixed yet another
alignment related issue to do reliable huge faults:

commit c1d9dac0db168198b6f63f460665256dedad9b6e
Author: Alex Williamson <alex.williamson@redhat.com>
Date:   Fri May 2 16:40:31 2025 -0600

    vfio/pci: Align huge faults to order

I think if your trace shows correct huge faults when you did correct
alignment, it should mean it doesn't affect your case (likely your app
sequentially fault in the bar region.. meanwhile likely there's no
concurrent, especially unaligned, faults when pre-fault everything).  But
just something FYI and IIUC that commit will land 6.13.z soon.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG?] vfio/pci: VA alignment sensitivity of VFIO_IOMMU_MAP_DMA which target MMIO
  2025-05-30 14:25   ` Peter Xu
@ 2025-05-30 23:05     ` Alex Mastro
  0 siblings, 0 replies; 6+ messages in thread
From: Alex Mastro @ 2025-05-30 23:05 UTC (permalink / raw)
  To: peterx; +Cc: alex.williamson, jgg, kbusch, linux-mm, linux-pci

On Fri, 30 May 2025 10:25:01 -0400 Peter Xu <peterx@redhat.com> wrote:
> On Fri, May 30, 2025 at 10:10:50AM -0300, Jason Gunthorpe wrote:

> Probably due to aac6db75a9fc vfio/pci: Use unmap_mapping_range().

Ack.

> > I think this is something we have missed. VFIO should automatically
> > align the VMA's address if not MAP_FIXED, otherwise it can't use the
> > efficient huge page sizes anymore. qemu uses MAP_FIXED so we've left
> > out the non-qemu users from this performance optimization.

Thanks for confirming.

> Good point!  I overlooked the VA hints when QEMU doesn't need it.  I can
> have a closer look if nobody else will.

This would be appreciated -- thank you!

> > I think if you are mmaping a huge huge BAR it is not surprising that
> > it will take a huge amount of time to write out all of the 4K
> > PTEs. 

Agreed. This matches what we observed.

> I think if your trace shows correct huge faults when you did correct
> alignment, it should mean it doesn't affect your case (likely your app
> sequentially fault in the bar region.

Yes, this is the faulting triggered by the call stack below, downstream from
VFIO_IOMMU_MAP_DMA, which faults in the entire VA range to be mapped.

vfio_pci_mmap_huge_fault+0xf5/0x1b0 [vfio_pci_core]
__do_fault+0x3f/0x130
do_pte_missing+0x363/0xf40
handle_mm_fault+0x6d2/0x1200
fixup_user_fault+0x121/0x280
vaddr_get_pfns+0x185/0x3c0 [vfio_iommu_type1]
vfio_pin_pages_remote+0x1a1/0x590 [vfio_iommu_type1]
vfio_pin_map_dma+0xe6/0x2c0 [vfio_iommu_type1]
vfio_iommu_type1_ioctl+0xd32/0xea0 [vfio_iommu_type1]

I also confirmed that cherry picking "vfio/pci: Align huge faults to order"
does not affect our usage of this path (manual mmap alignment is still
required).

Thanks,
Alex Mastro


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG?] vfio/pci: VA alignment sensitivity of VFIO_IOMMU_MAP_DMA which target MMIO
  2025-05-30 13:10 ` Jason Gunthorpe
  2025-05-30 14:25   ` Peter Xu
@ 2025-06-06 18:49   ` Alex Mastro
  2025-06-09  0:20     ` Jason Gunthorpe
  1 sibling, 1 reply; 6+ messages in thread
From: Alex Mastro @ 2025-06-06 18:49 UTC (permalink / raw)
  To: jgg
  Cc: linux-pci, alex.williamson, peterx, kbusch, linux-mm, leon,
	vivek.kasireddy, wguay, yilun.xu

Hi Jason,

By the way, we have been following progress on IOMMUFD, and would be interested
in dogfooding it for our use case when ready. The main blocker is IOMMUFD's
current lack of P2P support (IOMMU_IOAS_MAP fails when the VA range is backed
by MMIO).

dma-buf as a less ambiguous semantic for communicating this intent (rather than
the struggles of inferring what kind of memory is behind some VA range) makes a
lot of sense.

Based on tidbits we have gleaned, IOMMUFD P2P support intends to be built on
top of "Provide a new two step DMA mapping API" [1] and "vfio/pci: Allow MMIO
regions to be exported through dma-buf" [2].

Item [2] appears to have been picked up by "Host side (KVM/VFIO/IOMMUFD) support
for TDISP using TSM" [3].

Is the above understanding correct?

On top of this, there would need to be a new IOMMUFD uapi, or extension to
existing, which would accept an input dma-buf to map. Are there any patches in
progress which include this?

[1] https://lore.kernel.org/all/cover.1746424934.git.leon@kernel.org/
[2] https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.com
[3] https://lore.kernel.org/all/20250529053513.1592088-1-yilun.xu@linux.intel.com

Thanks,
Alex

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG?] vfio/pci: VA alignment sensitivity of VFIO_IOMMU_MAP_DMA which target MMIO
  2025-06-06 18:49   ` Alex Mastro
@ 2025-06-09  0:20     ` Jason Gunthorpe
  0 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2025-06-09  0:20 UTC (permalink / raw)
  To: Alex Mastro
  Cc: linux-pci, alex.williamson, peterx, kbusch, linux-mm, leon,
	vivek.kasireddy, wguay, yilun.xu, Nicolin Chen

On Fri, Jun 06, 2025 at 11:49:46AM -0700, Alex Mastro wrote:
> Hi Jason,
> 
> By the way, we have been following progress on IOMMUFD, and would be interested
> in dogfooding it for our use case when ready. The main blocker is IOMMUFD's
> current lack of P2P support (IOMMU_IOAS_MAP fails when the VA range is backed
> by MMIO).

Nicolin has a OOT tree patch that makes iommufd work in the same
insecure way as VFIO. Several places are using that for now.

> dma-buf as a less ambiguous semantic for communicating this intent (rather than
> the struggles of inferring what kind of memory is behind some VA range) makes a
> lot of sense.

Yes, I hope
 
> Based on tidbits we have gleaned, IOMMUFD P2P support intends to be built on
> top of "Provide a new two step DMA mapping API" [1] and "vfio/pci: Allow MMIO
> regions to be exported through dma-buf" [2].

Yes, that would be the first basis.

We also need to enhance DMABUF to add 'revoke' semantic

And enhance it to allow the physical page list to be given the iommufd
instead of a scatterlist.

> Item [2] appears to have been picked up by "Host side (KVM/VFIO/IOMMUFD) support
> for TDISP using TSM" [3].

I hope to see #2 redone on top of #1 this next cycle, Leon is working
on the incremental changes to allow the new DMA API to work without
struct page.

> On top of this, there would need to be a new IOMMUFD uapi, or extension to
> existing, which would accept an input dma-buf to map. Are there any patches in
> progress which include this?

No, I haven't seen patches for dmabuf revoke, or iommufd. The patches
for the physical address list need redoing to try again following
Sima's directions.

There are lots of steps here for someone to contribute to :)

Jason


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-06-09  0:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-29 21:44 [BUG?] vfio/pci: VA alignment sensitivity of VFIO_IOMMU_MAP_DMA which target MMIO Alex Mastro
2025-05-30 13:10 ` Jason Gunthorpe
2025-05-30 14:25   ` Peter Xu
2025-05-30 23:05     ` Alex Mastro
2025-06-06 18:49   ` Alex Mastro
2025-06-09  0:20     ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).