* Revisiting WC mmap Support for VFIO PCI
@ 2025-07-16 17:11 Mahmoud Nagy Adam
2025-07-16 18:40 ` Jason Gunthorpe
0 siblings, 1 reply; 2+ messages in thread
From: Mahmoud Nagy Adam @ 2025-07-16 17:11 UTC (permalink / raw)
To: kvm; +Cc: Jason Gunthorpe, Alex Williamson, Praveen Kumar, nagy
Hi all,
I’d like to revisit the topic of adding Write Combining (WC) mmap
support for VFIO PCI. We (Amazon Linux) are currently working on
allowing this feature, and I wanted to share the current approach and
gather feedback from the community.
From earlier discussions and RFCs, there are several considerations that
need to be addressed for a clean implementation, which includes:
- Receiving mmap flags from userspace which was previously explored[0],
and proposed approach is to expose an ioctl uAPI to allow userspace to
set flags for a range. There was also discussion around
differentiating WC from prefetchable memory — which makes sense — and
letting userspace explicitly choose which region to use WC.
- Dealing with legacy regions & drivers created regions, this could be
handled as suggested[1] from Jason using maple tree, which I'm
implementing to insert flags entry of the range to be mmapped, since
this would give us the flexibility to set the flags of any ranges.
- Scoping the mmap flags locally per request instead of defining it
globally on vfio_device/vfio_pci_core_device. This afaict from the
code could be handled if vfio_device_file struct is used with the
vfio_device_ops instead of the vfio_device, specifically the mmap &
ioctl since these the ops of interest here, so that we could access it
there and have a per fd maple tree to keep the flags in. This will
also keep the life time of the flags to the FD not to the device which
I think is better in this case.
Since I'm in the middle of investigating & implementing this topic, I
would like to collect opinions on the approach so far, specially the
last point. better ideas or objections with dealing with local flags
using vfio_device_file or other points would be appreciated.
[0]: https://lore.kernel.org/kvm/20240731155352.3973857-1-kbusch@meta.com/
[1]: https://lore.kernel.org/kvm/20240801141914.GC3030761@ziepe.ca/
Regards,
MNAdam
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Revisiting WC mmap Support for VFIO PCI
2025-07-16 17:11 Revisiting WC mmap Support for VFIO PCI Mahmoud Nagy Adam
@ 2025-07-16 18:40 ` Jason Gunthorpe
0 siblings, 0 replies; 2+ messages in thread
From: Jason Gunthorpe @ 2025-07-16 18:40 UTC (permalink / raw)
To: Mahmoud Nagy Adam; +Cc: kvm, Alex Williamson, Praveen Kumar, nagy
On Wed, Jul 16, 2025 at 07:11:50PM +0200, Mahmoud Nagy Adam wrote:
> - Dealing with legacy regions & drivers created regions, this could be
> handled as suggested[1] from Jason using maple tree, which I'm
> implementing to insert flags entry of the range to be mmapped, since
> this would give us the flexibility to set the flags of any ranges.
I think we agreed that the mmap offset we use today is not ABI so the
maple tree should be filled in either during file open time or
probably simpler done during the regions info ioctl.
> - Scoping the mmap flags locally per request instead of defining it
> globally on vfio_device/vfio_pci_core_device. This afaict from the
> code could be handled if vfio_device_file struct is used with the
> vfio_device_ops instead of the vfio_device, specifically the mmap &
> ioctl since these the ops of interest here, so that we could access it
> there and have a per fd maple tree to keep the flags in. This will
> also keep the life time of the flags to the FD not to the device which
> I think is better in this case.
It would be best to put the maple tree in the vfio_device_file.
The core code should just do the maple tree lookup and pass the struct
down to the driver, sort of like:
static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
{
struct vfio_device_file *df = filep->private_data;
[..]
struct vfio_mmap *mmap_obj = mtree_load(&df->mt_mmap, vma->vm_pgoff << PAGE_SHIFT);
if (!mmap_obj)
return -ENXIO;
/* mmap must be contained to a single object */
if (vma->vm_start < mmap_obj->vm_pgoff || vma->vm_end > (mmap_obj->vm_pgoff + mmap_obj->length))
return -ENXIO;
return device->ops->mmap_obj(device, vma, mmap_obj);
And the ioctls would need the df as well, but that is trivial.
Then all the pgoff touching in the driver around mmap goes away.
Drivers implementing the mmap ops should be structured to sub-struct
the vfio_mmap to store their own information:
struct vfio_pci_mmap {
struct vfio_mmap core;
unsigned int bar_index;
phys_addr_t phys_start;
unsigned int pgprot;
};
Or similar perhaps.
> Since I'm in the middle of investigating & implementing this topic, I
> would like to collect opinions on the approach so far, specially the
> last point. better ideas or objections with dealing with local flags
> using vfio_device_file or other points would be appreciated.
I strongly feel using a maple tree to translate pgoff's back into
kernel objects associated with the mmap is the right design for every
subsystem to follow.
Nicolin implemented this for iommufd:
https://lore.kernel.org/linux-iommu/9a888a326b12aa5fe940083eae1156304e210fe0.1752126748.git.nicolinc@nvidia.com/
Which shows the basic shape of the idea. It is quite easy to implement
on top of maple tree.
We are getting close to having problems in VFIO. The 40bits allocated
in VFIO is only about a 1TB of MMIO which is now a real-world size for
CXL devices. Making pgoff dynamic is the best solution to that
problem. It avoids running out of pgoff space as long as possible and
maximally retains compatibility with 32 bit systems.
Just sort of looking at it, I would suggest trying a sequence of
patches like:
1) Make a new function op for VFIO_DEVICE_GET_REGION_INFO. It looks
like there is quite some duplication here that can be improved on
regardless.
2) N patches converting drivers to use the new op.
3) Add the basic maple tree stuff and a new mmap_obj op and new
VFIO_DEVICE_GET_REGION_INFO that returns a 'struct vfio_mmap *'
4) N patches moving to use the vfio_mmap ops
5) Remove the unusued old APIs.
From there you can figure out the right uapi to get another maple tree
entry that selects WC/cachable/etc mode.
Note though compared to what Nicolin did VFIO will need pgoff
alignment during the allocations. BARs should be aligned to
min(PUD_SIZE, bar_size)/PAGE_SIZE within the pgoff space to make
things work well.
I think this is a striaghtforward project, though the rework of
REGION_INFO will be some typing.
Jason
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-07-16 18:40 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-16 17:11 Revisiting WC mmap Support for VFIO PCI Mahmoud Nagy Adam
2025-07-16 18:40 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).