Kernel KVM virtualization development
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Peter Xu <peterx@redhat.com>
Cc: Alex Williamson <alex@shazbot.org>,
	Anthony Pighin <anthony.pighin@nokia.com>,
	linux-kernel@vger.kernel.org,
	Kefeng Wang <wangkefeng.wang@huawei.com>,
	kvm@vger.kernel.org, Jason Gunthorpe <jgg@ziepe.ca>,
	linux-mm@kvack.org, Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <liam@infradead.org>
Subject: Re: [PATCH] vfio: Request THP-aligned mmap for device fds
Date: Wed, 17 Jun 2026 19:34:06 +0100	[thread overview]
Message-ID: <ajLontmFa1oRXeoK@casper.infradead.org> (raw)
In-Reply-To: <ajKtdCN0AlbmBnAj@x1.local>

[why on earth was stable@ cc'd?  adding/removing various other email
addresses]

On Wed, Jun 17, 2026 at 10:21:40AM -0400, Peter Xu wrote:
> On Tue, Jun 16, 2026 at 04:30:54PM -0600, Alex Williamson wrote:
> > On Tue, 16 Jun 2026 14:01:29 -0400
> > Anthony Pighin <anthony.pighin@nokia.com> wrote:
> > 
> > > VFIO PCI devices support PMD-sized page table entries for BAR mappings
> > > via their huge_fault handler (vfio_pci_mmap_huge_fault).  However, the
> > > VFIO device file_operations never provided a get_unmapped_area callback
> > > to request PMD-aligned virtual address placement from the mmap address
> > > allocator.
> > > 
> > > Before commit 34d7cf637c43 ("mm: don't try THP alignment for FS without
> > > get_unmapped_area"), this was masked by a bug introduced in commit
> > > ed48e87c7df3 ("thp: add thp_get_unmapped_area_vmflags()") which
> > > inadvertently applied THP alignment to all file-backed mappings,
> > > regardless of whether they provided a get_unmapped_area callback.
> > > 
> > > When commit 34d7cf637c43 ("mm: don't try THP alignment for FS without
> > > get_unmapped_area") correctly restricted THP alignment to anonymous
> > > mappings and files that explicitly opt in via get_unmapped_area, VFIO BAR
> > > mappings lost their PMD-aligned placement.  Since the huge_fault handler
> > > requires both the VMA start address and the physical PFN to be
> > > PMD-aligned, unaligned VMAs force a fallback to 4KB page faults.
> > > 
> > > For example, a 2GiB BAR results in 524,288 individual page faults
> > > instead of 1,024 PMD-sized faults, increasing the VFIO_IOMMU_MAP_DMA
> > > pinning time by orders of magnitude -- a regression directly visible to
> > > KVM guests during PCI device initialization.
> > > 
> > > Fix this by providing a get_unmapped_area callback in vfio_device_fops,
> > > following the same pattern used by ext4, xfs, btrfs, fuse, and other
> > > subsystems that benefit from THP-aligned placement.
> > 
> > The trouble is that PMD alignment isn't right either, your 1024 PMD
> > faults on a 2GiB BAR would be 2 faults on x86_64 with PUD mappings.
> > QEMU has forced the alignment to make it optimal for some time[1], so
> > there are userspace VMM options.  Seems like you were previously
> > getting lucky.
> > 
> > Peter Xu was working on a more comprehensive solution[2] late last
> > year, but it seems there was an objection to the
> > file_operations.get_mapping_order() proposal before Plumbers and the
> > thread hasn't rekindled.
> > 
> > Gentle bump to Peter and Willy that maybe we could resurrect that
> > effort.  Thanks,
> 
> Yes, since QEMU doesn't need it, it was low priority on my list (also due
> to much more downstream works recently, and a lot of things happened).
> 
> I can definitely try again.

I don't see this as being something that drivers should be involved with
at all.  The MM should be able to get this right without any hints from
the file-provider.  Yes, that means I also want to get rid of the setting
of get_unmapped_area in ext4/xfs/other filesystems.

Looking at generic_get_unmapped_area_topdown(), I think we can do this by
making an additional call to vm_unmapped_area() before the existing two,
setting info.align_mask and info.align_offset appropriately.

Now, what's "appropriately"?  I think it's based on length (>= PMD_SIZE,
then >= PUD_SIZE), but we should also take CONTPTE architectures into
account.  And maybe there's a CONTPMD architecture we should also consider?

Anyway, that's my initial thoughts.  Perhaps others have feedback.

  reply	other threads:[~2026-06-17 18:34 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-16 18:01 [PATCH] vfio: Request THP-aligned mmap for device fds Anthony Pighin
2026-06-16 22:30 ` Alex Williamson
2026-06-17 14:21   ` Peter Xu
2026-06-17 18:34     ` Matthew Wilcox [this message]
2026-06-17 19:29       ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajLontmFa1oRXeoK@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=alex@shazbot.org \
    --cc=anthony.pighin@nokia.com \
    --cc=jgg@ziepe.ca \
    --cc=kvm@vger.kernel.org \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=peterx@redhat.com \
    --cc=wangkefeng.wang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox