* [PATCH] vfio: Request THP-aligned mmap for device fds
@ 2026-06-16 18:01 Anthony Pighin
2026-06-16 22:30 ` Alex Williamson
0 siblings, 1 reply; 10+ messages in thread
From: Anthony Pighin @ 2026-06-16 18:01 UTC (permalink / raw)
To: Alex Williamson
Cc: linux-kernel, stable, Kefeng Wang, Vlastimil Babka, Andrew Morton,
kvm
VFIO PCI devices support PMD-sized page table entries for BAR mappings
via their huge_fault handler (vfio_pci_mmap_huge_fault). However, the
VFIO device file_operations never provided a get_unmapped_area callback
to request PMD-aligned virtual address placement from the mmap address
allocator.
Before commit 34d7cf637c43 ("mm: don't try THP alignment for FS without
get_unmapped_area"), this was masked by a bug introduced in commit
ed48e87c7df3 ("thp: add thp_get_unmapped_area_vmflags()") which
inadvertently applied THP alignment to all file-backed mappings,
regardless of whether they provided a get_unmapped_area callback.
When commit 34d7cf637c43 ("mm: don't try THP alignment for FS without
get_unmapped_area") correctly restricted THP alignment to anonymous
mappings and files that explicitly opt in via get_unmapped_area, VFIO BAR
mappings lost their PMD-aligned placement. Since the huge_fault handler
requires both the VMA start address and the physical PFN to be
PMD-aligned, unaligned VMAs force a fallback to 4KB page faults.
For example, a 2GiB BAR results in 524,288 individual page faults
instead of 1,024 PMD-sized faults, increasing the VFIO_IOMMU_MAP_DMA
pinning time by orders of magnitude -- a regression directly visible to
KVM guests during PCI device initialization.
Fix this by providing a get_unmapped_area callback in vfio_device_fops,
following the same pattern used by ext4, xfs, btrfs, fuse, and other
subsystems that benefit from THP-aligned placement.
Fixes: 34d7cf637c43 ("mm: don't try THP alignment for FS without get_unmapped_area")
Cc: stable@vger.kernel.org
Cc: Alex Williamson <alex@shazbot.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kvm@vger.kernel.org
Signed-off-by: Anthony Pighin <anthony.pighin@nokia.com>
---
drivers/vfio/vfio_main.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 6222376ab6ab..2dbb1a84dbac 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -40,6 +40,7 @@
#include <linux/interval_tree.h>
#include <linux/iova_bitmap.h>
#include <linux/iommufd.h>
+#include <linux/huge_mm.h>
#include "vfio.h"
#define DRIVER_VERSION "0.3"
@@ -1461,6 +1462,7 @@ const struct file_operations vfio_device_fops = {
.unlocked_ioctl = vfio_device_fops_unl_ioctl,
.compat_ioctl = compat_ptr_ioctl,
.mmap = vfio_device_fops_mmap,
+ .get_unmapped_area = thp_get_unmapped_area,
#ifdef CONFIG_PROC_FS
.show_fdinfo = vfio_device_show_fdinfo,
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] vfio: Request THP-aligned mmap for device fds
2026-06-16 18:01 [PATCH] vfio: Request THP-aligned mmap for device fds Anthony Pighin
@ 2026-06-16 22:30 ` Alex Williamson
2026-06-17 14:21 ` Peter Xu
0 siblings, 1 reply; 10+ messages in thread
From: Alex Williamson @ 2026-06-16 22:30 UTC (permalink / raw)
To: Anthony Pighin
Cc: linux-kernel, stable, Kefeng Wang, Vlastimil Babka, Andrew Morton,
kvm, alex, Matthew Wilcox, Jason Gunthorpe, Peter Xu
On Tue, 16 Jun 2026 14:01:29 -0400
Anthony Pighin <anthony.pighin@nokia.com> wrote:
> VFIO PCI devices support PMD-sized page table entries for BAR mappings
> via their huge_fault handler (vfio_pci_mmap_huge_fault). However, the
> VFIO device file_operations never provided a get_unmapped_area callback
> to request PMD-aligned virtual address placement from the mmap address
> allocator.
>
> Before commit 34d7cf637c43 ("mm: don't try THP alignment for FS without
> get_unmapped_area"), this was masked by a bug introduced in commit
> ed48e87c7df3 ("thp: add thp_get_unmapped_area_vmflags()") which
> inadvertently applied THP alignment to all file-backed mappings,
> regardless of whether they provided a get_unmapped_area callback.
>
> When commit 34d7cf637c43 ("mm: don't try THP alignment for FS without
> get_unmapped_area") correctly restricted THP alignment to anonymous
> mappings and files that explicitly opt in via get_unmapped_area, VFIO BAR
> mappings lost their PMD-aligned placement. Since the huge_fault handler
> requires both the VMA start address and the physical PFN to be
> PMD-aligned, unaligned VMAs force a fallback to 4KB page faults.
>
> For example, a 2GiB BAR results in 524,288 individual page faults
> instead of 1,024 PMD-sized faults, increasing the VFIO_IOMMU_MAP_DMA
> pinning time by orders of magnitude -- a regression directly visible to
> KVM guests during PCI device initialization.
>
> Fix this by providing a get_unmapped_area callback in vfio_device_fops,
> following the same pattern used by ext4, xfs, btrfs, fuse, and other
> subsystems that benefit from THP-aligned placement.
The trouble is that PMD alignment isn't right either, your 1024 PMD
faults on a 2GiB BAR would be 2 faults on x86_64 with PUD mappings.
QEMU has forced the alignment to make it optimal for some time[1], so
there are userspace VMM options. Seems like you were previously
getting lucky.
Peter Xu was working on a more comprehensive solution[2] late last
year, but it seems there was an objection to the
file_operations.get_mapping_order() proposal before Plumbers and the
thread hasn't rekindled.
Gentle bump to Peter and Willy that maybe we could resurrect that
effort. Thanks,
Alex
[1]https://gitlab.com/qemu-project/qemu/-/commit/00b519c0bca0
[2]https://lore.kernel.org/all/20251204151003.171039-1-peterx@redhat.com/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] vfio: Request THP-aligned mmap for device fds
2026-06-16 22:30 ` Alex Williamson
@ 2026-06-17 14:21 ` Peter Xu
2026-06-17 18:34 ` Matthew Wilcox
0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2026-06-17 14:21 UTC (permalink / raw)
To: Alex Williamson
Cc: Anthony Pighin, linux-kernel, stable, Kefeng Wang,
Vlastimil Babka, Andrew Morton, kvm, Matthew Wilcox,
Jason Gunthorpe
On Tue, Jun 16, 2026 at 04:30:54PM -0600, Alex Williamson wrote:
> On Tue, 16 Jun 2026 14:01:29 -0400
> Anthony Pighin <anthony.pighin@nokia.com> wrote:
>
> > VFIO PCI devices support PMD-sized page table entries for BAR mappings
> > via their huge_fault handler (vfio_pci_mmap_huge_fault). However, the
> > VFIO device file_operations never provided a get_unmapped_area callback
> > to request PMD-aligned virtual address placement from the mmap address
> > allocator.
> >
> > Before commit 34d7cf637c43 ("mm: don't try THP alignment for FS without
> > get_unmapped_area"), this was masked by a bug introduced in commit
> > ed48e87c7df3 ("thp: add thp_get_unmapped_area_vmflags()") which
> > inadvertently applied THP alignment to all file-backed mappings,
> > regardless of whether they provided a get_unmapped_area callback.
> >
> > When commit 34d7cf637c43 ("mm: don't try THP alignment for FS without
> > get_unmapped_area") correctly restricted THP alignment to anonymous
> > mappings and files that explicitly opt in via get_unmapped_area, VFIO BAR
> > mappings lost their PMD-aligned placement. Since the huge_fault handler
> > requires both the VMA start address and the physical PFN to be
> > PMD-aligned, unaligned VMAs force a fallback to 4KB page faults.
> >
> > For example, a 2GiB BAR results in 524,288 individual page faults
> > instead of 1,024 PMD-sized faults, increasing the VFIO_IOMMU_MAP_DMA
> > pinning time by orders of magnitude -- a regression directly visible to
> > KVM guests during PCI device initialization.
> >
> > Fix this by providing a get_unmapped_area callback in vfio_device_fops,
> > following the same pattern used by ext4, xfs, btrfs, fuse, and other
> > subsystems that benefit from THP-aligned placement.
>
> The trouble is that PMD alignment isn't right either, your 1024 PMD
> faults on a 2GiB BAR would be 2 faults on x86_64 with PUD mappings.
> QEMU has forced the alignment to make it optimal for some time[1], so
> there are userspace VMM options. Seems like you were previously
> getting lucky.
>
> Peter Xu was working on a more comprehensive solution[2] late last
> year, but it seems there was an objection to the
> file_operations.get_mapping_order() proposal before Plumbers and the
> thread hasn't rekindled.
>
> Gentle bump to Peter and Willy that maybe we could resurrect that
> effort. Thanks,
Yes, since QEMU doesn't need it, it was low priority on my list (also due
to much more downstream works recently, and a lot of things happened).
I can definitely try again.
I'll wait for another 1-2 weeks in case Matthew would like to provide a
better suggestion, otherwise I can send a new version based on what we
have, and we can start from there.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] vfio: Request THP-aligned mmap for device fds
2026-06-17 14:21 ` Peter Xu
@ 2026-06-17 18:34 ` Matthew Wilcox
2026-06-17 19:29 ` Jason Gunthorpe
0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2026-06-17 18:34 UTC (permalink / raw)
To: Peter Xu
Cc: Alex Williamson, Anthony Pighin, linux-kernel, Kefeng Wang, kvm,
Jason Gunthorpe, linux-mm, Lorenzo Stoakes, Liam R. Howlett
[why on earth was stable@ cc'd? adding/removing various other email
addresses]
On Wed, Jun 17, 2026 at 10:21:40AM -0400, Peter Xu wrote:
> On Tue, Jun 16, 2026 at 04:30:54PM -0600, Alex Williamson wrote:
> > On Tue, 16 Jun 2026 14:01:29 -0400
> > Anthony Pighin <anthony.pighin@nokia.com> wrote:
> >
> > > VFIO PCI devices support PMD-sized page table entries for BAR mappings
> > > via their huge_fault handler (vfio_pci_mmap_huge_fault). However, the
> > > VFIO device file_operations never provided a get_unmapped_area callback
> > > to request PMD-aligned virtual address placement from the mmap address
> > > allocator.
> > >
> > > Before commit 34d7cf637c43 ("mm: don't try THP alignment for FS without
> > > get_unmapped_area"), this was masked by a bug introduced in commit
> > > ed48e87c7df3 ("thp: add thp_get_unmapped_area_vmflags()") which
> > > inadvertently applied THP alignment to all file-backed mappings,
> > > regardless of whether they provided a get_unmapped_area callback.
> > >
> > > When commit 34d7cf637c43 ("mm: don't try THP alignment for FS without
> > > get_unmapped_area") correctly restricted THP alignment to anonymous
> > > mappings and files that explicitly opt in via get_unmapped_area, VFIO BAR
> > > mappings lost their PMD-aligned placement. Since the huge_fault handler
> > > requires both the VMA start address and the physical PFN to be
> > > PMD-aligned, unaligned VMAs force a fallback to 4KB page faults.
> > >
> > > For example, a 2GiB BAR results in 524,288 individual page faults
> > > instead of 1,024 PMD-sized faults, increasing the VFIO_IOMMU_MAP_DMA
> > > pinning time by orders of magnitude -- a regression directly visible to
> > > KVM guests during PCI device initialization.
> > >
> > > Fix this by providing a get_unmapped_area callback in vfio_device_fops,
> > > following the same pattern used by ext4, xfs, btrfs, fuse, and other
> > > subsystems that benefit from THP-aligned placement.
> >
> > The trouble is that PMD alignment isn't right either, your 1024 PMD
> > faults on a 2GiB BAR would be 2 faults on x86_64 with PUD mappings.
> > QEMU has forced the alignment to make it optimal for some time[1], so
> > there are userspace VMM options. Seems like you were previously
> > getting lucky.
> >
> > Peter Xu was working on a more comprehensive solution[2] late last
> > year, but it seems there was an objection to the
> > file_operations.get_mapping_order() proposal before Plumbers and the
> > thread hasn't rekindled.
> >
> > Gentle bump to Peter and Willy that maybe we could resurrect that
> > effort. Thanks,
>
> Yes, since QEMU doesn't need it, it was low priority on my list (also due
> to much more downstream works recently, and a lot of things happened).
>
> I can definitely try again.
I don't see this as being something that drivers should be involved with
at all. The MM should be able to get this right without any hints from
the file-provider. Yes, that means I also want to get rid of the setting
of get_unmapped_area in ext4/xfs/other filesystems.
Looking at generic_get_unmapped_area_topdown(), I think we can do this by
making an additional call to vm_unmapped_area() before the existing two,
setting info.align_mask and info.align_offset appropriately.
Now, what's "appropriately"? I think it's based on length (>= PMD_SIZE,
then >= PUD_SIZE), but we should also take CONTPTE architectures into
account. And maybe there's a CONTPMD architecture we should also consider?
Anyway, that's my initial thoughts. Perhaps others have feedback.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] vfio: Request THP-aligned mmap for device fds
2026-06-17 18:34 ` Matthew Wilcox
@ 2026-06-17 19:29 ` Jason Gunthorpe
2026-06-18 14:55 ` Lorenzo Stoakes
0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2026-06-17 19:29 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Peter Xu, Alex Williamson, Anthony Pighin, linux-kernel,
Kefeng Wang, kvm, linux-mm, Lorenzo Stoakes, Liam R. Howlett
On Wed, Jun 17, 2026 at 07:34:06PM +0100, Matthew Wilcox wrote:
> I don't see this as being something that drivers should be involved with
> at all. The MM should be able to get this right without any hints from
> the file-provider. Yes, that means I also want to get rid of the setting
> of get_unmapped_area in ext4/xfs/other filesystems.
>
> Looking at generic_get_unmapped_area_topdown(), I think we can do this by
> making an additional call to vm_unmapped_area() before the existing two,
> setting info.align_mask and info.align_offset appropriately.
>
> Now, what's "appropriately"? I think it's based on length (>= PMD_SIZE,
> then >= PUD_SIZE), but we should also take CONTPTE architectures into
> account.
The info.align_mask and info.align_offset do need information from the
driver based on what it intends to map into the VMA that is being
created.
Filesystems probably have quite different requirements than drivers
using remap_pfn() or vmf_insert_pfn() that have locked down pfn's.
A pfn driver often has a single already known physical range that it
will use for the VMA and that range should drive the alignment
decision of the VMA.
vfio in particular has common use cases where you want to mmap from
weird offsets, but we still want to achieve a VMA starting point that
has pa % PUD_SIZE == va % PUD_SIZE. It is impossible to do this if the
thing building info does not know pa.
I do think it makes sense that no file provider should be computing
the VA area itself, I think I made that case when Peter was last
working on this. Now that we have Lorenzo's mmap changes maybe we
should be talking about supporting VFIO by having a callback to obtain
the starting pfn for the VMA. Usable only by drivers like VFIO that
are working with the pfn functions.
The starting pfn and VMA size is enough for the mm to setup info
suitably.
Maybe other users would prefer a 'max order' callback and then the mm
would assume the VMA will be popoulated with pgoff aligned folios up
to that highest order?
> And maybe there's a CONTPMD architecture we should also consider?
ARM HW supports "CONTPMD" but I suppose it is not implemented..
Jason
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] vfio: Request THP-aligned mmap for device fds
2026-06-17 19:29 ` Jason Gunthorpe
@ 2026-06-18 14:55 ` Lorenzo Stoakes
2026-06-18 15:04 ` Matthew Wilcox
2026-06-18 15:28 ` Jason Gunthorpe
0 siblings, 2 replies; 10+ messages in thread
From: Lorenzo Stoakes @ 2026-06-18 14:55 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Matthew Wilcox, Peter Xu, Alex Williamson, Anthony Pighin,
linux-kernel, Kefeng Wang, kvm, linux-mm, Liam R. Howlett,
Ryan Roberts
+cc Ryan for contPMD
On Wed, Jun 17, 2026 at 04:29:28PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 17, 2026 at 07:34:06PM +0100, Matthew Wilcox wrote:
>
> > I don't see this as being something that drivers should be involved with
> > at all. The MM should be able to get this right without any hints from
> > the file-provider. Yes, that means I also want to get rid of the setting
> > of get_unmapped_area in ext4/xfs/other filesystems.
> >
> > Looking at generic_get_unmapped_area_topdown(), I think we can do this by
> > making an additional call to vm_unmapped_area() before the existing two,
> > setting info.align_mask and info.align_offset appropriately.
> >
> > Now, what's "appropriately"? I think it's based on length (>= PMD_SIZE,
> > then >= PUD_SIZE), but we should also take CONTPTE architectures into
> > account.
>
> The info.align_mask and info.align_offset do need information from the
> driver based on what it intends to map into the VMA that is being
> created.
>
> Filesystems probably have quite different requirements than drivers
> using remap_pfn() or vmf_insert_pfn() that have locked down pfn's.
I think part of the problem here is that we don't differentiate between
drivers and filesystems, and what might be sensible for one is perhaps not
sensible for another.
We're too generic really.
With mmap_prepare we have a lot of flexibility as to what we do. That
callback is idempotent and as limited as possible, and actions like remap
are achieved through calling a kernel function like mmap-action_remap().
With that interface, drivers are declarative more than imperative, and
_they can tell us stuff_ :)
That seems pertinent here.
And I'm more than happy to have features that _require_ mmap_prepare.
>
> A pfn driver often has a single already known physical range that it
> will use for the VMA and that range should drive the alignment
> decision of the VMA.
>
> vfio in particular has common use cases where you want to mmap from
> weird offsets, but we still want to achieve a VMA starting point that
> has pa % PUD_SIZE == va % PUD_SIZE. It is impossible to do this if the
> thing building info does not know pa.
>
> I do think it makes sense that no file provider should be computing
> the VA area itself, I think I made that case when Peter was last
> working on this. Now that we have Lorenzo's mmap changes maybe we
> should be talking about supporting VFIO by having a callback to obtain
> the starting pfn for the VMA. Usable only by drivers like VFIO that
> are working with the pfn functions.
Can't we figure this out from what the driver tells us when it invokes an
mmap_prepare action?
In general I have zero trust in drivers, the right basis for dealing with
them is that they will do the most insane thing possible (and you're
pleasantly surprised if they don't :)
Callbacks are problematic, they're not neutral (context/lock state/etc.)
and you can't necessarily make assumptions in calling code after a callback
is called that you could before.
Can't we figure it out from the PFN the driver tells mmap_prepare about?
>
> The starting pfn and VMA size is enough for the mm to setup info
> suitably.
seems so? :)
>
> Maybe other users would prefer a 'max order' callback and then the mm
> would assume the VMA will be popoulated with pgoff aligned folios up
> to that highest order?
Not in favour of that, fear it'll be seen as a new go-faster stripe. Ask
somebody how many free pints they want and they may veer rather towards the
upper bound :)
>
> > And maybe there's a CONTPMD architecture we should also consider?
>
> ARM HW supports "CONTPMD" but I suppose it is not implemented..
Maybe Ryan has thoughts?
>
> Jason
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] vfio: Request THP-aligned mmap for device fds
2026-06-18 14:55 ` Lorenzo Stoakes
@ 2026-06-18 15:04 ` Matthew Wilcox
2026-06-18 15:30 ` Jason Gunthorpe
2026-06-18 15:28 ` Jason Gunthorpe
1 sibling, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2026-06-18 15:04 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Jason Gunthorpe, Peter Xu, Alex Williamson, Anthony Pighin,
linux-kernel, Kefeng Wang, kvm, linux-mm, Liam R. Howlett,
Ryan Roberts
On Thu, Jun 18, 2026 at 03:55:58PM +0100, Lorenzo Stoakes wrote:
> On Wed, Jun 17, 2026 at 04:29:28PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 17, 2026 at 07:34:06PM +0100, Matthew Wilcox wrote:
> >
> > > I don't see this as being something that drivers should be involved with
> > > at all. The MM should be able to get this right without any hints from
> > > the file-provider. Yes, that means I also want to get rid of the setting
> > > of get_unmapped_area in ext4/xfs/other filesystems.
> > >
> > > Looking at generic_get_unmapped_area_topdown(), I think we can do this by
> > > making an additional call to vm_unmapped_area() before the existing two,
> > > setting info.align_mask and info.align_offset appropriately.
> > >
> > > Now, what's "appropriately"? I think it's based on length (>= PMD_SIZE,
> > > then >= PUD_SIZE), but we should also take CONTPTE architectures into
> > > account.
> >
> > The info.align_mask and info.align_offset do need information from the
> > driver based on what it intends to map into the VMA that is being
> > created.
What you're saying is that offset 0 of the opened file might correspond
to a PFN that is not aligned in any way? I had assumed that when trying
to do the mapping of (2MB+4KiB to 64MB), that the offset specified to
mmap was 2MB+4KiB. But you seem to be saying that the offset in that
case would be 0 and someone needs to know that it corresponds to a PFN
that is misaligned?
> > Filesystems probably have quite different requirements than drivers
> > using remap_pfn() or vmf_insert_pfn() that have locked down pfn's.
>
> I think part of the problem here is that we don't differentiate between
> drivers and filesystems, and what might be sensible for one is perhaps not
> sensible for another.
>
> We're too generic really.
>
> With mmap_prepare we have a lot of flexibility as to what we do. That
> callback is idempotent and as limited as possible, and actions like remap
> are achieved through calling a kernel function like mmap-action_remap().
mmap_prepare() is called too late. We've already assigned the virtual
address range before we call __mmap_region(), and there's no attempt to
adjust 'addr' in __mmap_region() after calling mmap_prepare().
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] vfio: Request THP-aligned mmap for device fds
2026-06-18 14:55 ` Lorenzo Stoakes
2026-06-18 15:04 ` Matthew Wilcox
@ 2026-06-18 15:28 ` Jason Gunthorpe
1 sibling, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2026-06-18 15:28 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Matthew Wilcox, Peter Xu, Alex Williamson, Anthony Pighin,
linux-kernel, Kefeng Wang, kvm, linux-mm, Liam R. Howlett,
Ryan Roberts
On Thu, Jun 18, 2026 at 03:55:58PM +0100, Lorenzo Stoakes wrote:
> > A pfn driver often has a single already known physical range that it
> > will use for the VMA and that range should drive the alignment
> > decision of the VMA.
> >
> > vfio in particular has common use cases where you want to mmap from
> > weird offsets, but we still want to achieve a VMA starting point that
> > has pa % PUD_SIZE == va % PUD_SIZE. It is impossible to do this if the
> > thing building info does not know pa.
> >
> > I do think it makes sense that no file provider should be computing
> > the VA area itself, I think I made that case when Peter was last
> > working on this. Now that we have Lorenzo's mmap changes maybe we
> > should be talking about supporting VFIO by having a callback to obtain
> > the starting pfn for the VMA. Usable only by drivers like VFIO that
> > are working with the pfn functions.
>
> Can't we figure this out from what the driver tells us when it invokes an
> mmap_prepare action?
VFIO installs the pages via fault handler so there is not a naturally
existing way to pass in the pfn?
> Can't we figure it out from the PFN the driver tells mmap_prepare about?
Maybe it can pass the pfn anyhow and not have the mmap logic map
anything?
> > Maybe other users would prefer a 'max order' callback and then the mm
> > would assume the VMA will be popoulated with pgoff aligned folios up
> > to that highest order?
>
> Not in favour of that, fear it'll be seen as a new go-faster stripe. Ask
> somebody how many free pints they want and they may veer rather towards the
> upper bound :)
I think you need something, otherwise we will be aligning VMAs that
never have anything larger than a 2M THP to 1GB boundaries, doesn't
seem good.
Jason
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] vfio: Request THP-aligned mmap for device fds
2026-06-18 15:04 ` Matthew Wilcox
@ 2026-06-18 15:30 ` Jason Gunthorpe
2026-06-18 15:56 ` Lorenzo Stoakes
0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2026-06-18 15:30 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Lorenzo Stoakes, Peter Xu, Alex Williamson, Anthony Pighin,
linux-kernel, Kefeng Wang, kvm, linux-mm, Liam R. Howlett,
Ryan Roberts
On Thu, Jun 18, 2026 at 04:04:06PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 18, 2026 at 03:55:58PM +0100, Lorenzo Stoakes wrote:
> > On Wed, Jun 17, 2026 at 04:29:28PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 17, 2026 at 07:34:06PM +0100, Matthew Wilcox wrote:
> > >
> > > > I don't see this as being something that drivers should be involved with
> > > > at all. The MM should be able to get this right without any hints from
> > > > the file-provider. Yes, that means I also want to get rid of the setting
> > > > of get_unmapped_area in ext4/xfs/other filesystems.
> > > >
> > > > Looking at generic_get_unmapped_area_topdown(), I think we can do this by
> > > > making an additional call to vm_unmapped_area() before the existing two,
> > > > setting info.align_mask and info.align_offset appropriately.
> > > >
> > > > Now, what's "appropriately"? I think it's based on length (>= PMD_SIZE,
> > > > then >= PUD_SIZE), but we should also take CONTPTE architectures into
> > > > account.
> > >
> > > The info.align_mask and info.align_offset do need information from the
> > > driver based on what it intends to map into the VMA that is being
> > > created.
>
> What you're saying is that offset 0 of the opened file might correspond
> to a PFN that is not aligned in any way? I had assumed that when trying
> to do the mapping of (2MB+4KiB to 64MB), that the offset specified to
> mmap was 2MB+4KiB. But you seem to be saying that the offset in that
> case would be 0 and someone needs to know that it corresponds to a PFN
> that is misaligned?
I do expect that the pgoff space is usually aligned to the pfn space,
most drivers do that or could be improved to do that. There will be
some off cases, but maybe we don't care, and VFIO should be fine.
That is certainly an easier place to start.
Jason
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] vfio: Request THP-aligned mmap for device fds
2026-06-18 15:30 ` Jason Gunthorpe
@ 2026-06-18 15:56 ` Lorenzo Stoakes
0 siblings, 0 replies; 10+ messages in thread
From: Lorenzo Stoakes @ 2026-06-18 15:56 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Matthew Wilcox, Peter Xu, Alex Williamson, Anthony Pighin,
linux-kernel, Kefeng Wang, kvm, linux-mm, Liam R. Howlett,
Ryan Roberts
On Thu, Jun 18, 2026 at 12:30:49PM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 18, 2026 at 04:04:06PM +0100, Matthew Wilcox wrote:
> > On Thu, Jun 18, 2026 at 03:55:58PM +0100, Lorenzo Stoakes wrote:
> > > On Wed, Jun 17, 2026 at 04:29:28PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Jun 17, 2026 at 07:34:06PM +0100, Matthew Wilcox wrote:
> > > >
> > > > > I don't see this as being something that drivers should be involved with
> > > > > at all. The MM should be able to get this right without any hints from
> > > > > the file-provider. Yes, that means I also want to get rid of the setting
> > > > > of get_unmapped_area in ext4/xfs/other filesystems.
> > > > >
> > > > > Looking at generic_get_unmapped_area_topdown(), I think we can do this by
> > > > > making an additional call to vm_unmapped_area() before the existing two,
> > > > > setting info.align_mask and info.align_offset appropriately.
> > > > >
> > > > > Now, what's "appropriately"? I think it's based on length (>= PMD_SIZE,
> > > > > then >= PUD_SIZE), but we should also take CONTPTE architectures into
> > > > > account.
> > > >
> > > > The info.align_mask and info.align_offset do need information from the
> > > > driver based on what it intends to map into the VMA that is being
> > > > created.
> >
> > What you're saying is that offset 0 of the opened file might correspond
> > to a PFN that is not aligned in any way? I had assumed that when trying
> > to do the mapping of (2MB+4KiB to 64MB), that the offset specified to
> > mmap was 2MB+4KiB. But you seem to be saying that the offset in that
> > case would be 0 and someone needs to know that it corresponds to a PFN
> > that is misaligned?
>
> I do expect that the pgoff space is usually aligned to the pfn space,
> most drivers do that or could be improved to do that. There will be
> some off cases, but maybe we don't care, and VFIO should be fine.
Some stuff has weird assumptions about pfn=0 at start of the range (DMA for
instance).
Presumably not applicable to VFIO but that's a thing we need to stop
doing... (I have some patches I deferred from a while back changing the DMA
stuff).
>
> That is certainly an easier place to start.
>
> Jason
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-06-18 15:56 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-16 18:01 [PATCH] vfio: Request THP-aligned mmap for device fds Anthony Pighin
2026-06-16 22:30 ` Alex Williamson
2026-06-17 14:21 ` Peter Xu
2026-06-17 18:34 ` Matthew Wilcox
2026-06-17 19:29 ` Jason Gunthorpe
2026-06-18 14:55 ` Lorenzo Stoakes
2026-06-18 15:04 ` Matthew Wilcox
2026-06-18 15:30 ` Jason Gunthorpe
2026-06-18 15:56 ` Lorenzo Stoakes
2026-06-18 15:28 ` Jason Gunthorpe
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.