* [RFC PATCH 0/9] vfio: Introduce mmap maple tree
@ 2025-08-04 10:39 Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 1/9] vfio: add mmap maple tree to vfio Mahmoud Adam
` (9 more replies)
0 siblings, 10 replies; 29+ messages in thread
From: Mahmoud Adam @ 2025-08-04 10:39 UTC (permalink / raw)
To: kvm; +Cc: alex.williamson, jgg, benh, David Woodhouse, pravkmr, nagy
This RFC series proposes the implementation of a new mechanism for
region mmap attributes using maple trees, based on Jason's suggested
maple tree and offset cookie approach[0]. The primary motivation is to
enable userspace applications to specify mmap attributes—such as Write
Combining (WC)—prior to invoking mmap on a VFIO region. While the
initial focus is on WC support, this framework can be extended to
support additional attributes (e.g., cachable) in the future.
Core concept is: a maple_tree instance is introduced per file
descriptor within vfio_device_file, allowing per-request ownership and
control of mmap attributes. Via new VFIO device operations (ioctl &
mmap), each vfio device populates its maple_tree, primarily during the
DEVICE_GET_REGION_INFO ioctl. The kernel returns a unique offset key
to userspace; userspace can then pass this offset to mmap, at which
point the kernel retrieves the correct maple_tree entry and invokes
the new mmap op on the vfio device to map the region with the desired
attributes.
This model also enables a new UAPI for userspace to set attributes on
a given mmap offset, allowing flexibility and room for future feature
expansion.
Because these changes alter both internal region offset handling and
the ioctl/mmap interfaces, a staged approach is necessary to manage
the large scope of the update.
This RFC implements:
- Integration of the maple_tree mechanism and new VFIO ops, along
with required helpers.
- Initial onboard support for vfio-pci.
- Introduction of the new UAPI supporting WC.
Planned follow-up work:
- Extending new ops support to all vfio-pci devices.
- Updating usages of VFIO_PCI_OFFSET_TO_INDEX and VFIO_PCI_INDEX_TO_OFFSET.
- Migrating additional VFIO devices to the new ops.
- Fully removing legacy ioctl and mmap ops, renaming the new ops
in their place once migration is complete.
For now, legacy and new VFIO ops coexist. Legacy ops will be removed
following full migration across all relevant devices.
This RFC marks the start of this transition. I am seeking feedback on
the core implementation to ensure the direction and design are correct
before proceeding with further conversion and cleanup work. Thank you
for your review and guidance.
[0]: https://lore.kernel.org/kvm/20240801141914.GC3030761@ziepe.ca/
references:
- https://lore.kernel.org/kvm/20240731155352.3973857-1-kbusch@meta.com/T/#u
- https://lore.kernel.org/kvm/lrkyq4ivccb6x.fsf@dev-dsk-mngyadam-1c-cb3f7548.eu-west-1.amazon.com/
Mahmoud Adam (9):
vfio: add mmap maple tree to vfio
vfio: add transient ops to support vfio mmap mt
vfio-pci-core: rename vm operations
vfio-pci-core: remove redundant offset calculations
vfio-pci-core: add vfio_pci_mmap & helpers
vfio-pci-core: support the new vfio ops
vfio-pci: use the new vfio ops
vfio: UAPI for setting mmap attributes
vfio_pci_core: support mmap attrs uapi & WC
drivers/vfio/pci/vfio_pci.c | 4 +-
drivers/vfio/pci/vfio_pci_core.c | 165 +++++++++++++++++++++++++++----
drivers/vfio/vfio.h | 1 +
drivers/vfio/vfio_main.c | 42 ++++++++
include/linux/vfio.h | 22 +++++
include/linux/vfio_pci_core.h | 14 +++
include/uapi/linux/vfio.h | 19 ++++
7 files changed, 248 insertions(+), 19 deletions(-)
--
2.47.3
thanks,
MNAdam
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC PATCH 1/9] vfio: add mmap maple tree to vfio
2025-08-04 10:39 [RFC PATCH 0/9] vfio: Introduce mmap maple tree Mahmoud Adam
@ 2025-08-04 10:39 ` Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 2/9] vfio: add transient ops to support vfio mmap mt Mahmoud Adam
` (8 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Mahmoud Adam @ 2025-08-04 10:39 UTC (permalink / raw)
To: kvm; +Cc: alex.williamson, jgg, benh, David Woodhouse, pravkmr, nagy
add mmap maple tree for vfio_device_file, this allows vfio devices to
create per mmap request options. the vfio device needs to
insert/allocate the region range offset & size and make it accessible
for the user, probably when the user is calling
DEVICE_GET_REGION_INFO, and then vfio uses the maple_tree to find the
entry (vfio_mmap) needed for mmap op, this adds the vfio_mmap_init &
vfio_mmap_free for initialization and freeing the entry, the freeing
is done through the free callback in the vfio_mmap_ops, which
vfio_devices should implement if they are allocating an entry.
Signed-off-by: Mahmoud Adam <mngyadam@amazon.de>
---
I didn't find a situation where we would need to use ref counting for
now, so I didn't implement it, I think most cases are already handled
by file ref counting, but maybe I'm overlooking something here.
drivers/vfio/vfio.h | 1 +
drivers/vfio/vfio_main.c | 29 +++++++++++++++++++++++++++++
include/linux/vfio.h | 17 +++++++++++++++++
3 files changed, 47 insertions(+)
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 50128da18bcaf..3f0cf2dd41116 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -19,6 +19,7 @@ struct vfio_container;
struct vfio_device_file {
struct vfio_device *device;
struct vfio_group *group;
+ struct maple_tree mmap_mt;
u8 access_granted;
u32 devid; /* only valid when iommufd is valid */
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 1fd261efc582d..4c4af4de60d12 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -39,6 +39,7 @@
#include <linux/interval_tree.h>
#include <linux/iova_bitmap.h>
#include <linux/iommufd.h>
+#include <linux/maple_tree.h>
#include "vfio.h"
#define DRIVER_VERSION "0.3"
@@ -498,6 +499,7 @@ vfio_allocate_device_file(struct vfio_device *device)
df->device = device;
spin_lock_init(&df->kvm_ref_lock);
+ mt_init_flags(&df->mmap_mt, MT_FLAGS_ALLOC_RANGE);
return df;
}
@@ -622,6 +624,25 @@ static inline void vfio_device_pm_runtime_put(struct vfio_device *device)
pm_runtime_put(dev);
}
+void vfio_mmap_init(struct vfio_device *vdev, struct vfio_mmap *vmmap,
+ u32 region_flags, u64 offset, u64 size,
+ struct vfio_mmap_ops *ops)
+{
+ vmmap->owner = vdev;
+ vmmap->offset = offset;
+ vmmap->ops = ops;
+ vmmap->size = size;
+ vmmap->region_flags = region_flags;
+}
+EXPORT_SYMBOL_GPL(vfio_mmap_init);
+
+void vfio_mmap_free(struct vfio_mmap *vmmap)
+{
+ if (vmmap->ops && vmmap->ops->free)
+ vmmap->ops->free(vmmap);
+}
+EXPORT_SYMBOL_GPL(vfio_mmap_free);
+
/*
* VFIO Device fd
*/
@@ -629,14 +650,22 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
{
struct vfio_device_file *df = filep->private_data;
struct vfio_device *device = df->device;
+ struct vfio_mmap *vmmap;
+ unsigned long index = 0;
if (df->group)
vfio_df_group_close(df);
else
vfio_df_unbind_iommufd(df);
+ mt_for_each(&df->mmap_mt, vmmap, index, ULONG_MAX) {
+ mtree_erase(&df->mmap_mt, index);
+ vfio_mmap_free(vmmap);
+ }
+
vfio_device_put_registration(device);
+ mtree_destroy(&df->mmap_mt);
kfree(df);
return 0;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 707b00772ce1f..6e0aca05aa406 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -80,6 +80,19 @@ struct vfio_device {
#endif
};
+struct vfio_mmap {
+ struct vfio_device *owner;
+ u64 offset;
+ u64 size;
+ u32 region_flags;
+ struct vfio_mmap_ops *ops;
+};
+
+struct vfio_mmap_ops {
+ void (*free)(struct vfio_mmap *vmmap);
+};
+
+
/**
* struct vfio_device_ops - VFIO bus driver device callbacks
*
@@ -338,6 +351,10 @@ int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova,
void vfio_unpin_pages(struct vfio_device *device, dma_addr_t iova, int npage);
int vfio_dma_rw(struct vfio_device *device, dma_addr_t iova,
void *data, size_t len, bool write);
+void vfio_mmap_init(struct vfio_device *vdev, struct vfio_mmap *vmmap,
+ u32 region_flags, u64 offset, u64 size,
+ struct vfio_mmap_ops *ops);
+void vfio_mmap_free(struct vfio_mmap *vmmap);
/*
* Sub-module helpers
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH 2/9] vfio: add transient ops to support vfio mmap mt
2025-08-04 10:39 [RFC PATCH 0/9] vfio: Introduce mmap maple tree Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 1/9] vfio: add mmap maple tree to vfio Mahmoud Adam
@ 2025-08-04 10:39 ` Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 3/9] vfio-pci-core: rename vm operations Mahmoud Adam
` (7 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Mahmoud Adam @ 2025-08-04 10:39 UTC (permalink / raw)
To: kvm; +Cc: alex.williamson, jgg, benh, David Woodhouse, pravkmr, nagy
add a new transient operations for ioctl & mmap that allows using the
new mmap maple tree, since these operations are used extensively, it's
better to transition into new temporary ops, then after onboarding all
the users to the new ops we will drop the old legacy ops proto and
replace it with the new ones. Having new ops allows us to enforce the
vmmap existence check when mmap ops is called, and make the migration
more stable, and reviewable.
ioctl needs to have access over the whole mt to add/query entries when
needed, this allows inserting new range in the mt for example when
DEVICE_GET_REGION_INFO is called by the user, this also enabled us to
add other uapi to change mmap attrs in a certain range. When mmapping
there must be a vmmap entry for that offset otherwise return -EINVAL.
Signed-off-by: Mahmoud Adam <mngyadam@amazon.de>
---
This names is only used for the migration period that was I used 2 as a suffix,
maybe _vmmap could also be used or similar.
drivers/vfio/vfio_main.c | 12 ++++++++++++
include/linux/vfio.h | 4 ++++
2 files changed, 16 insertions(+)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 4c4af4de60d12..3275ff56eef47 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1324,6 +1324,10 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
break;
default:
+ if (device->ops->ioctl2) {
+ ret = device->ops->ioctl2(device, cmd, arg, &df->mmap_mt);
+ break;
+ }
if (unlikely(!device->ops->ioctl))
ret = -EINVAL;
else
@@ -1372,11 +1376,19 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
{
struct vfio_device_file *df = filep->private_data;
struct vfio_device *device = df->device;
+ struct vfio_mmap *vmmap;
/* Paired with smp_store_release() following vfio_df_open() */
if (!smp_load_acquire(&df->access_granted))
return -EINVAL;
+ if (device->ops->mmap2) {
+ vmmap = mtree_load(&df->mmap_mt, (vma->vm_pgoff << PAGE_SHIFT));
+ if (!vmmap)
+ return -EINVAL;
+ return device->ops->mmap2(device, vma, vmmap);
+ }
+
if (unlikely(!device->ops->mmap))
return -EINVAL;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 6e0aca05aa406..836ef72a38104 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -142,7 +142,11 @@ struct vfio_device_ops {
size_t count, loff_t *size);
long (*ioctl)(struct vfio_device *vdev, unsigned int cmd,
unsigned long arg);
+ long (*ioctl2)(struct vfio_device *vdev, unsigned int cmd,
+ unsigned long arg, struct maple_tree *mmap_mt);
int (*mmap)(struct vfio_device *vdev, struct vm_area_struct *vma);
+ int (*mmap2)(struct vfio_device *vdev, struct vm_area_struct *vma,
+ struct vfio_mmap *vmmap);
void (*request)(struct vfio_device *vdev, unsigned int count);
int (*match)(struct vfio_device *vdev, char *buf);
void (*dma_unmap)(struct vfio_device *vdev, u64 iova, u64 length);
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH 3/9] vfio-pci-core: rename vm operations
2025-08-04 10:39 [RFC PATCH 0/9] vfio: Introduce mmap maple tree Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 1/9] vfio: add mmap maple tree to vfio Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 2/9] vfio: add transient ops to support vfio mmap mt Mahmoud Adam
@ 2025-08-04 10:39 ` Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 4/9] vfio-pci-core: remove redundant offset calculations Mahmoud Adam
` (6 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Mahmoud Adam @ 2025-08-04 10:39 UTC (permalink / raw)
To: kvm; +Cc: alex.williamson, jgg, benh, David Woodhouse, pravkmr, nagy
rename vm operations from mmap to vm to avoid confusion with the vfio
mmap naming. mainly because we will reuse the name vfio_pci_mmap_ops
for the following patches.
Signed-off-by: Mahmoud Adam <mngyadam@amazon.de>
---
drivers/vfio/pci/vfio_pci_core.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 6328c3a05bcdd..9a22969607bfe 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1641,7 +1641,7 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
}
-static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
+static vm_fault_t vfio_pci_vm_huge_fault(struct vm_fault *vmf,
unsigned int order)
{
struct vm_area_struct *vma = vmf->vma;
@@ -1696,15 +1696,15 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
return ret;
}
-static vm_fault_t vfio_pci_mmap_page_fault(struct vm_fault *vmf)
+static vm_fault_t vfio_pci_vm_page_fault(struct vm_fault *vmf)
{
- return vfio_pci_mmap_huge_fault(vmf, 0);
+ return vfio_pci_vm_huge_fault(vmf, 0);
}
-static const struct vm_operations_struct vfio_pci_mmap_ops = {
- .fault = vfio_pci_mmap_page_fault,
+static const struct vm_operations_struct vfio_pci_vm_ops = {
+ .fault = vfio_pci_vm_page_fault,
#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
- .huge_fault = vfio_pci_mmap_huge_fault,
+ .huge_fault = vfio_pci_vm_huge_fault,
#endif
};
@@ -1792,7 +1792,7 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
*/
vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
VM_DONTEXPAND | VM_DONTDUMP);
- vma->vm_ops = &vfio_pci_mmap_ops;
+ vma->vm_ops = &vfio_pci_vm_ops;
return 0;
}
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH 4/9] vfio-pci-core: remove redundant offset calculations
2025-08-04 10:39 [RFC PATCH 0/9] vfio: Introduce mmap maple tree Mahmoud Adam
` (2 preceding siblings ...)
2025-08-04 10:39 ` [RFC PATCH 3/9] vfio-pci-core: rename vm operations Mahmoud Adam
@ 2025-08-04 10:39 ` Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 5/9] vfio-pci-core: add vfio_pci_mmap & helpers Mahmoud Adam
` (5 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Mahmoud Adam @ 2025-08-04 10:39 UTC (permalink / raw)
To: kvm; +Cc: alex.williamson, jgg, benh, David Woodhouse, pravkmr, nagy
all switch cases are calculating the offset similarly, and they are
not used until the end of the function, just calculate the offset once
at the end, which makes it simpler to change the offset in one spot in
the following patch.
Signed-off-by: Mahmoud Adam <mngyadam@amazon.de>
---
drivers/vfio/pci/vfio_pci_core.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 9a22969607bfe..467466a0b619f 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1014,13 +1014,11 @@ static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev,
switch (info.index) {
case VFIO_PCI_CONFIG_REGION_INDEX:
- info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.size = pdev->cfg_size;
info.flags = VFIO_REGION_INFO_FLAG_READ |
VFIO_REGION_INFO_FLAG_WRITE;
break;
case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
- info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.size = pci_resource_len(pdev, info.index);
if (!info.size) {
info.flags = 0;
@@ -1044,7 +1042,6 @@ static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev,
size_t size;
u16 cmd;
- info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.flags = 0;
info.size = 0;
@@ -1074,7 +1071,6 @@ static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev,
if (!vdev->has_vga)
return -EINVAL;
- info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.size = 0xc0000;
info.flags = VFIO_REGION_INFO_FLAG_READ |
VFIO_REGION_INFO_FLAG_WRITE;
@@ -1093,7 +1089,6 @@ static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev,
i = info.index - VFIO_PCI_NUM_REGIONS;
- info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.size = vdev->region[i].size;
info.flags = vdev->region[i].flags;
@@ -1131,6 +1126,7 @@ static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev,
kfree(caps.buf);
}
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
return copy_to_user(arg, &info, minsz) ? -EFAULT : 0;
}
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH 5/9] vfio-pci-core: add vfio_pci_mmap & helpers
2025-08-04 10:39 [RFC PATCH 0/9] vfio: Introduce mmap maple tree Mahmoud Adam
` (3 preceding siblings ...)
2025-08-04 10:39 ` [RFC PATCH 4/9] vfio-pci-core: remove redundant offset calculations Mahmoud Adam
@ 2025-08-04 10:39 ` Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 6/9] vfio-pci-core: support the new vfio ops Mahmoud Adam
` (4 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Mahmoud Adam @ 2025-08-04 10:39 UTC (permalink / raw)
To: kvm; +Cc: alex.williamson, jgg, benh, David Woodhouse, pravkmr, nagy
support the new vmmap solution to vfio-pci-core, this adds the
vfio_pci_mmap struct. the core already keeps the offset and size of
the region, extend it with bar_index.
Add alloc helper funciton for vfio_pci, which allocates and insert
vmmap to the mt, for the transitioning period the mtree_insert_range
is used with the same offset calculation as the legacy solution, so
that we don't break VFIO_PCI_OFFSET_TO_INDEX usages, eventually after
all the vfio_pci_devices are migrated to the new ops, these macros
will be replaced with mtree_load or similar, then maple tree
allocation could be used instead of direct insertions.
Signed-off-by: Mahmoud Adam <mngyadam@amazon.de>
---
drivers/vfio/pci/vfio_pci_core.c | 44 ++++++++++++++++++++++++++++++++
include/linux/vfio_pci_core.h | 10 ++++++++
2 files changed, 54 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 467466a0b619f..7a431a03bd850 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -882,6 +882,50 @@ static int msix_mmappable_cap(struct vfio_pci_core_device *vdev,
return vfio_info_add_capability(caps, &header, sizeof(header));
}
+static void vfio_pci_mmap_free(struct vfio_mmap *core_vmmap)
+{
+ struct vfio_pci_mmap *vmmap = container_of(core_vmmap,
+ struct vfio_pci_mmap,
+ core);
+ kfree(vmmap);
+}
+
+static struct vfio_mmap_ops vfio_pci_mmap_ops = {
+ .free = vfio_pci_mmap_free,
+};
+
+int vfio_pci_mmap_alloc(struct vfio_pci_core_device *vdev,
+ struct maple_tree *mmap_mt, u32 region_flags,
+ size_t bar_size, unsigned int bar_index,
+ unsigned long *offset)
+{
+ struct vfio_pci_mmap *vmmap;
+ int ret;
+ unsigned long alloc_size;
+ vmmap = kzalloc(sizeof(*vmmap), GFP_KERNEL);
+ if (!vmmap)
+ return -ENOMEM;
+
+ alloc_size = PAGE_ALIGN(bar_size);
+ /* keep the offset aligned to the current usage for now, so we
+ * don't break VFIO_PCI_OFFSET_TO_INDEX */
+ *offset = VFIO_PCI_INDEX_TO_OFFSET(bar_index);
+ vmmap->bar_index = bar_index;
+ vfio_mmap_init(&vdev->vdev, &vmmap->core, region_flags,
+ *offset, alloc_size, &vfio_pci_mmap_ops);
+ ret = mtree_insert_range(mmap_mt, *offset,
+ *offset + alloc_size - 1,
+ &vmmap->core, GFP_KERNEL);
+ if (ret) {
+ vfio_mmap_free(&vmmap->core);
+ /* for now if it exists reuse it */
+ if (ret != -EEXIST)
+ return ret;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(vfio_pci_mmap_alloc);
+
int vfio_pci_core_register_dev_region(struct vfio_pci_core_device *vdev,
unsigned int type, unsigned int subtype,
const struct vfio_pci_regops *ops,
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index fbb472dd99b36..532d2914a9c2e 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -49,6 +49,11 @@ struct vfio_pci_region {
u32 flags;
};
+struct vfio_pci_mmap {
+ struct vfio_mmap core;
+ unsigned int bar_index;
+};
+
struct vfio_pci_core_device {
struct vfio_device vdev;
struct pci_dev *pdev;
@@ -137,6 +142,11 @@ bool vfio_pci_core_range_intersect_range(loff_t buf_start, size_t buf_cnt,
loff_t *buf_offset,
size_t *intersect_count,
size_t *register_offset);
+int vfio_pci_mmap_alloc(struct vfio_pci_core_device *vdev,
+ struct maple_tree *mmap_mt, u32 region_flags,
+ size_t bar_size, unsigned int bar_index,
+ unsigned long *offset);
+
#define VFIO_IOWRITE_DECLARATION(size) \
int vfio_pci_core_iowrite##size(struct vfio_pci_core_device *vdev, \
bool test_mem, u##size val, void __iomem *io);
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH 6/9] vfio-pci-core: support the new vfio ops
2025-08-04 10:39 [RFC PATCH 0/9] vfio: Introduce mmap maple tree Mahmoud Adam
` (4 preceding siblings ...)
2025-08-04 10:39 ` [RFC PATCH 5/9] vfio-pci-core: add vfio_pci_mmap & helpers Mahmoud Adam
@ 2025-08-04 10:39 ` Mahmoud Adam
2025-08-04 10:40 ` [RFC PATCH 7/9] vfio-pci: use " Mahmoud Adam
` (3 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Mahmoud Adam @ 2025-08-04 10:39 UTC (permalink / raw)
To: kvm; +Cc: alex.williamson, jgg, benh, David Woodhouse, pravkmr, nagy
This use the new mt to create the mmap offset entries when user calls
get_region_info, and return the vmmap range offset, and when the user
use the same region range for mmaping, we have access on the vmmap
entry, where we could know which bar from the bar_index, and later
change mmaping attributes like WC. On top of that since we use the mt
range offset, eventually we will not need this legacy
VFIO_PCI_INDEX_TO_OFFSET, which means more dynamic offset calculation
that remove the limitation of the legacy system. To avoid duplicating
the functions for mmap & get_region_info, this create a common
function and use mmap_mt/vmmap if not null, for ioctl2 just override
VFIO_DEVICE_GET_REGION_INFO only with the new function.
Signed-off-by: Mahmoud Adam <mngyadam@amazon.de>
---
This follow the same temprory suffix "2", but also this is only for
the migration period, the other function will be dropped and replace
eventually.
drivers/vfio/pci/vfio_pci_core.c | 72 +++++++++++++++++++++++++++++---
include/linux/vfio_pci_core.h | 4 ++
2 files changed, 71 insertions(+), 5 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 7a431a03bd850..8418d98ac66ce 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1041,8 +1041,10 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
return copy_to_user(arg, &info, minsz) ? -EFAULT : 0;
}
-static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev,
- struct vfio_region_info __user *arg)
+
+static int _vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct maple_tree *mmap_mt,
+ struct vfio_region_info __user *arg)
{
unsigned long minsz = offsetofend(struct vfio_region_info, offset);
struct pci_dev *pdev = vdev->pdev;
@@ -1170,10 +1172,32 @@ static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev,
kfree(caps.buf);
}
- info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ if (mmap_mt) {
+ ret = vfio_pci_mmap_alloc(vdev, mmap_mt,
+ info.flags, info.size, info.index,
+ (unsigned long *) &info.offset);
+ if (ret)
+ return ret;
+ } else {
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ }
+
return copy_to_user(arg, &info, minsz) ? -EFAULT : 0;
}
+static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct vfio_region_info __user *arg)
+{
+ return _vfio_pci_ioctl_get_region_info(vdev, NULL, arg);
+}
+
+static int vfio_pci_ioctl_get_region_info2(struct vfio_pci_core_device *vdev,
+ struct maple_tree *mmap_mt,
+ struct vfio_region_info __user *arg)
+{
+ return _vfio_pci_ioctl_get_region_info(vdev, mmap_mt, arg);
+}
+
static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
struct vfio_irq_info __user *arg)
{
@@ -1514,6 +1538,23 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
}
EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
+
+long vfio_pci_core_ioctl2(struct vfio_device *core_vdev, unsigned int cmd,
+ unsigned long arg, struct maple_tree *mmap_mt)
+{
+ struct vfio_pci_core_device *vdev =
+ container_of(core_vdev, struct vfio_pci_core_device, vdev);
+ void __user *uarg = (void __user *)arg;
+
+ switch (cmd) {
+ case VFIO_DEVICE_GET_REGION_INFO:
+ return vfio_pci_ioctl_get_region_info2(vdev, mmap_mt, uarg);
+ default:
+ return vfio_pci_core_ioctl(core_vdev, cmd, arg);
+ }
+}
+EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl2);
+
static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
uuid_t __user *arg, size_t argsz)
{
@@ -1748,16 +1789,24 @@ static const struct vm_operations_struct vfio_pci_vm_ops = {
#endif
};
-int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma)
+static int _vfio_pci_core_mmap(struct vfio_device *core_vdev,
+ struct vm_area_struct *vma,
+ struct vfio_mmap *core_vmmap)
{
struct vfio_pci_core_device *vdev =
container_of(core_vdev, struct vfio_pci_core_device, vdev);
+ struct vfio_pci_mmap *vmmap = NULL;
struct pci_dev *pdev = vdev->pdev;
unsigned int index;
u64 phys_len, req_len, pgoff, req_start;
int ret;
- index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+ if (core_vmmap) {
+ vmmap = container_of(core_vmmap, struct vfio_pci_mmap, core);
+ index = vmmap->bar_index;
+ } else {
+ index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+ }
if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
return -EINVAL;
@@ -1836,8 +1885,21 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
return 0;
}
+
+int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma)
+{
+ return _vfio_pci_core_mmap(core_vdev, vma, NULL);
+}
EXPORT_SYMBOL_GPL(vfio_pci_core_mmap);
+int vfio_pci_core_mmap2(struct vfio_device *core_vdev,
+ struct vm_area_struct *vma,
+ struct vfio_mmap *core_vmmap)
+{
+ return _vfio_pci_core_mmap(core_vdev, vma, core_vmmap);
+}
+EXPORT_SYMBOL_GPL(vfio_pci_core_mmap2);
+
void vfio_pci_core_request(struct vfio_device *core_vdev, unsigned int count)
{
struct vfio_pci_core_device *vdev =
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 532d2914a9c2e..cb52b92340451 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -118,6 +118,8 @@ int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,
int nr_virtfn);
long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
unsigned long arg);
+long vfio_pci_core_ioctl2(struct vfio_device *core_vdev, unsigned int cmd,
+ unsigned long arg, struct maple_tree *mmap_attrs_mt);
int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
void __user *arg, size_t argsz);
ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
@@ -125,6 +127,8 @@ ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
size_t count, loff_t *ppos);
int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma);
+int vfio_pci_core_mmap2(struct vfio_device *core_vdev, struct vm_area_struct *vma,
+ struct vfio_mmap *core_vmmap);
void vfio_pci_core_request(struct vfio_device *core_vdev, unsigned int count);
int vfio_pci_core_match(struct vfio_device *core_vdev, char *buf);
int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH 7/9] vfio-pci: use the new vfio ops
2025-08-04 10:39 [RFC PATCH 0/9] vfio: Introduce mmap maple tree Mahmoud Adam
` (5 preceding siblings ...)
2025-08-04 10:39 ` [RFC PATCH 6/9] vfio-pci-core: support the new vfio ops Mahmoud Adam
@ 2025-08-04 10:40 ` Mahmoud Adam
2025-08-04 10:40 ` [RFC PATCH 8/9] vfio: UAPI for setting mmap attributes Mahmoud Adam
` (2 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Mahmoud Adam @ 2025-08-04 10:40 UTC (permalink / raw)
To: kvm; +Cc: alex.williamson, jgg, benh, David Woodhouse, pravkmr, nagy
changing vfio-pci vfio ops to use the new ioctl and mmap is enough to
make this migrated, and it gives a initial example of the migration of
vfio-pci devices.
Signed-off-by: Mahmoud Adam <mngyadam@amazon.de>
---
drivers/vfio/pci/vfio_pci.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 5ba39f7623bb7..73d3eded7f95d 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -131,11 +131,11 @@ static const struct vfio_device_ops vfio_pci_ops = {
.release = vfio_pci_core_release_dev,
.open_device = vfio_pci_open_device,
.close_device = vfio_pci_core_close_device,
- .ioctl = vfio_pci_core_ioctl,
+ .ioctl2 = vfio_pci_core_ioctl2,
.device_feature = vfio_pci_core_ioctl_feature,
.read = vfio_pci_core_read,
.write = vfio_pci_core_write,
- .mmap = vfio_pci_core_mmap,
+ .mmap2 = vfio_pci_core_mmap2,
.request = vfio_pci_core_request,
.match = vfio_pci_core_match,
.bind_iommufd = vfio_iommufd_physical_bind,
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH 8/9] vfio: UAPI for setting mmap attributes
2025-08-04 10:39 [RFC PATCH 0/9] vfio: Introduce mmap maple tree Mahmoud Adam
` (6 preceding siblings ...)
2025-08-04 10:40 ` [RFC PATCH 7/9] vfio-pci: use " Mahmoud Adam
@ 2025-08-04 10:40 ` Mahmoud Adam
2025-08-04 10:40 ` [RFC PATCH 9/9] vfio_pci_core: support mmap attrs uapi & WC Mahmoud Adam
2025-08-04 18:49 ` [RFC PATCH 0/9] vfio: Introduce mmap maple tree Alex Williamson
9 siblings, 0 replies; 29+ messages in thread
From: Mahmoud Adam @ 2025-08-04 10:40 UTC (permalink / raw)
To: kvm; +Cc: alex.williamson, jgg, benh, David Woodhouse, pravkmr, nagy
This uapi allows setting mmap attributes using a specified region
offset, region offset is expected to be used from returned value of
VFIO_DEVICE_GET_REGION_INFO or similar, where vmmap mt entry was
created, start with write_combine attribute, which the user can use to
request mmap to use wc. vfio devices expected to load the vmmap entry
from mt and do the needed region specific checks, and sets the
attributes accordingly.
Signed-off-by: Mahmoud Adam <mngyadam@amazon.de>
---
drivers/vfio/vfio_main.c | 1 +
include/linux/vfio.h | 1 +
include/uapi/linux/vfio.h | 19 +++++++++++++++++++
3 files changed, 21 insertions(+)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 3275ff56eef47..58c3cf12a5317 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -633,6 +633,7 @@ void vfio_mmap_init(struct vfio_device *vdev, struct vfio_mmap *vmmap,
vmmap->ops = ops;
vmmap->size = size;
vmmap->region_flags = region_flags;
+ vmmap->attrs = 0;
}
EXPORT_SYMBOL_GPL(vfio_mmap_init);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 836ef72a38104..5885df1729183 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -85,6 +85,7 @@ struct vfio_mmap {
u64 offset;
u64 size;
u32 region_flags;
+ u32 attrs;
struct vfio_mmap_ops *ops;
};
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5764f315137f9..2e3fa90eef5a3 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1831,6 +1831,25 @@ struct vfio_iommu_spapr_tce_remove {
};
#define VFIO_IOMMU_SPAPR_TCE_REMOVE _IO(VFIO_TYPE, VFIO_BASE + 20)
+/**
+ * VFIO_DEVICE_SET_MMAP_ATTRS - _IOW(VFIO_TYPE, VFIO_BASE + 21, struct vfio_mmap_attrs)
+ *
+ * Set memory mapping attributes for a specified region offset before
+ * calling mmap, it expects that the offset used was fetched by
+ * calling VFIO_DEVICE_GET_REGION_INFO.
+ *
+ * Attributes supported:
+ * - VFIO_MMAP_ATTR_WRITE_COMBINE: use write-combine when requested to mmap this offset.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_mmap_attrs {
+ __u64 offset; /* Region offset */
+ __u32 attrs;
+#define VFIO_MMAP_ATTR_WRITE_COMBINE (1 << 0)
+};
+#define VFIO_DEVICE_SET_MMAP_ATTRS _IO(VFIO_TYPE, VFIO_BASE + 21)
+
/* ***************************************************************** */
#endif /* _UAPIVFIO_H */
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH 9/9] vfio_pci_core: support mmap attrs uapi & WC
2025-08-04 10:39 [RFC PATCH 0/9] vfio: Introduce mmap maple tree Mahmoud Adam
` (7 preceding siblings ...)
2025-08-04 10:40 ` [RFC PATCH 8/9] vfio: UAPI for setting mmap attributes Mahmoud Adam
@ 2025-08-04 10:40 ` Mahmoud Adam
2025-08-04 18:49 ` [RFC PATCH 0/9] vfio: Introduce mmap maple tree Alex Williamson
9 siblings, 0 replies; 29+ messages in thread
From: Mahmoud Adam @ 2025-08-04 10:40 UTC (permalink / raw)
To: kvm; +Cc: alex.williamson, jgg, benh, David Woodhouse, pravkmr, nagy
Now we established the required dependencies to support WC through the
new vmmap in vfio_pci, this implements the new uapi & checks if the WC
attr is set while mmaping, then calls pgprot_writecombine.
Signed-off-by: Mahmoud Adam <mngyadam@amazon.de>
---
drivers/vfio/pci/vfio_pci_core.c | 31 ++++++++++++++++++++++++++++++-
1 file changed, 30 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 8418d98ac66ce..461440700af75 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1198,6 +1198,30 @@ static int vfio_pci_ioctl_get_region_info2(struct vfio_pci_core_device *vdev,
return _vfio_pci_ioctl_get_region_info(vdev, mmap_mt, arg);
}
+static int vfio_pci_ioctl_set_mmap_attrs(struct vfio_pci_core_device *vdev,
+ struct maple_tree *mmap_mt,
+ struct vfio_irq_info __user *arg)
+{
+ struct vfio_mmap_attrs vmmap_attrs;
+ struct vfio_pci_mmap *vmmap;
+
+ if (copy_from_user(&vmmap_attrs, arg, sizeof(vmmap_attrs)))
+ return -EFAULT;
+
+ if (vmmap_attrs.attrs & ~VFIO_MMAP_ATTR_WRITE_COMBINE)
+ return -EINVAL;
+
+ vmmap = mtree_load(mmap_mt, vmmap_attrs.offset);
+ if (!vmmap)
+ return -EINVAL;
+
+ if (!(vmmap->core.region_flags & VFIO_REGION_INFO_FLAG_MMAP))
+ return -EINVAL;
+
+ vmmap->core.attrs = vmmap_attrs.attrs;
+ return 0;
+}
+
static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
struct vfio_irq_info __user *arg)
{
@@ -1549,6 +1573,8 @@ long vfio_pci_core_ioctl2(struct vfio_device *core_vdev, unsigned int cmd,
switch (cmd) {
case VFIO_DEVICE_GET_REGION_INFO:
return vfio_pci_ioctl_get_region_info2(vdev, mmap_mt, uarg);
+ case VFIO_DEVICE_SET_MMAP_ATTRS:
+ return vfio_pci_ioctl_set_mmap_attrs(vdev, mmap_mt, uarg);
default:
return vfio_pci_core_ioctl(core_vdev, cmd, arg);
}
@@ -1855,7 +1881,10 @@ static int _vfio_pci_core_mmap(struct vfio_device *core_vdev,
}
vma->vm_private_data = vdev;
- vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ if (vmmap && vmmap->core.attrs & VFIO_MMAP_ATTR_WRITE_COMBINE)
+ vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
+ else
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
/*
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-04 10:39 [RFC PATCH 0/9] vfio: Introduce mmap maple tree Mahmoud Adam
` (8 preceding siblings ...)
2025-08-04 10:40 ` [RFC PATCH 9/9] vfio_pci_core: support mmap attrs uapi & WC Mahmoud Adam
@ 2025-08-04 18:49 ` Alex Williamson
2025-08-04 20:09 ` Mahmoud Nagy Adam
9 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2025-08-04 18:49 UTC (permalink / raw)
To: Mahmoud Adam; +Cc: kvm, jgg, benh, David Woodhouse, pravkmr, nagy
On Mon, 4 Aug 2025 12:39:53 +0200
Mahmoud Adam <mngyadam@amazon.de> wrote:
> This RFC series proposes the implementation of a new mechanism for
> region mmap attributes using maple trees, based on Jason's suggested
> maple tree and offset cookie approach[0]. The primary motivation is to
> enable userspace applications to specify mmap attributes—such as Write
> Combining (WC)—prior to invoking mmap on a VFIO region. While the
> initial focus is on WC support, this framework can be extended to
> support additional attributes (e.g., cachable) in the future.
>
> Core concept is: a maple_tree instance is introduced per file
> descriptor within vfio_device_file, allowing per-request ownership and
> control of mmap attributes. Via new VFIO device operations (ioctl &
> mmap), each vfio device populates its maple_tree, primarily during the
> DEVICE_GET_REGION_INFO ioctl. The kernel returns a unique offset key
> to userspace; userspace can then pass this offset to mmap, at which
> point the kernel retrieves the correct maple_tree entry and invokes
> the new mmap op on the vfio device to map the region with the desired
> attributes.
>
> This model also enables a new UAPI for userspace to set attributes on
> a given mmap offset, allowing flexibility and room for future feature
> expansion.
>
> Because these changes alter both internal region offset handling and
> the ioctl/mmap interfaces, a staged approach is necessary to manage
> the large scope of the update.
>
> This RFC implements:
> - Integration of the maple_tree mechanism and new VFIO ops, along
> with required helpers.
> - Initial onboard support for vfio-pci.
> - Introduction of the new UAPI supporting WC.
>
> Planned follow-up work:
> - Extending new ops support to all vfio-pci devices.
> - Updating usages of VFIO_PCI_OFFSET_TO_INDEX and VFIO_PCI_INDEX_TO_OFFSET.
> - Migrating additional VFIO devices to the new ops.
> - Fully removing legacy ioctl and mmap ops, renaming the new ops
> in their place once migration is complete.
>
>
> For now, legacy and new VFIO ops coexist. Legacy ops will be removed
> following full migration across all relevant devices.
>
> This RFC marks the start of this transition. I am seeking feedback on
> the core implementation to ensure the direction and design are correct
> before proceeding with further conversion and cleanup work. Thank you
> for your review and guidance.
I'm lost. AIUI, there was a proposal to manage region offsets via a
maple tree, specifically to get a more compact mapping, which I think
is meant to allow new regions (mmap cookies) to be created which are
effectively aliases to other regions with different mapping attributes.
Here we have a partial conversion to a maple tree, but the proposed
ioctl is only specifying a mapping attribute for an existing offset.
How does this require or take advantage of the maple tree?
We should be able to convert to a maple tree without introducing these
"legacy" ops. Ideally we'd have a compatibility mode where we retain
the existing offsets, but otherwise the maple tree should compact the
region layout within the device fd and be used ubiquitously, mmap and
read/write. That's only useful though if we decide on a uAPI that
actually wants to allocate ranges of the device fd for use with
alternative mapping attributes. Thanks,
Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-04 18:49 ` [RFC PATCH 0/9] vfio: Introduce mmap maple tree Alex Williamson
@ 2025-08-04 20:09 ` Mahmoud Nagy Adam
2025-08-05 14:31 ` Jason Gunthorpe
0 siblings, 1 reply; 29+ messages in thread
From: Mahmoud Nagy Adam @ 2025-08-04 20:09 UTC (permalink / raw)
To: Alex Williamson; +Cc: kvm, jgg, benh, David Woodhouse, pravkmr, nagy
Alex Williamson <alex.williamson@redhat.com> writes:
>
> I'm lost. AIUI, there was a proposal to manage region offsets via a
> maple tree, specifically to get a more compact mapping, which I think
> is meant to allow new regions (mmap cookies) to be created which are
> effectively aliases to other regions with different mapping attributes.
>
> Here we have a partial conversion to a maple tree, but the proposed
> ioctl is only specifying a mapping attribute for an existing offset.
> How does this require or take advantage of the maple tree?
>
I think you are mentioning the vfio-pci-core ioctl implementation, as I
mentioned in the proposed patch I'm using region insert with mt which is
using the same offset calculation for the transitioning period, but this
will change to allocating a free region in maple_tree after moving the
vfio-pci devices to the new ops and changing the usage of the
OFFSET_TO_INDEX macros.
I wanted first to move all the vfio-pci devices to the new ops first,
then changing the OFFSET_TO_INDEX macro afterwards, to avoid duplicating
all the codes that do this calculations which includes page faults code
etc..
But so far we are advantaging from per FD range attributes for mmaping,
and extra regions can already be used by design.
> We should be able to convert to a maple tree without introducing these
> "legacy" ops.
This technically be done directly by changing the current ops (ioctl,
mmap, read & write) to add vmmap argument in place and call the ops with
NULL for vmmap entry if the entry not found in the mt, instead of
returning -EINVAL in vfio (vfio-devices could check that
themselves). instead of this transitioning stage that this RFC is
implementing.
I probably need to also update read & write ops to use the vmmap in the
same patch as well.
Thanks,
MNAdam
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-04 20:09 ` Mahmoud Nagy Adam
@ 2025-08-05 14:31 ` Jason Gunthorpe
2025-08-05 15:48 ` Mahmoud Nagy Adam
0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2025-08-05 14:31 UTC (permalink / raw)
To: Mahmoud Nagy Adam
Cc: Alex Williamson, kvm, benh, David Woodhouse, pravkmr, nagy
On Mon, Aug 04, 2025 at 10:09:44PM +0200, Mahmoud Nagy Adam wrote:
> > We should be able to convert to a maple tree without introducing these
> > "legacy" ops.
>
> This technically be done directly by changing the current ops (ioctl,
> mmap, read & write) to add vmmap argument in place and call the ops with
> NULL for vmmap entry if the entry not found in the mt
You have gone very far away from my original suggestion:
https://lore.kernel.org/kvm/20250716184028.GA2177603@ziepe.ca/
map2 should not exist, once you introduced a vfio_mmap_ops for free
then the mmap op should be placed into that structure.
ioctl2 is nasty, that should be the "new function op for
VFIO_DEVICE_GET_REGION_INFO" instead. We have been slowly moving
towards the core code doing more decode and dispatch of ioctls instead
of duplicating in drivers.
I still stand by the patch plan I gave in the above email. Clean up
how VFIO_DEVICE_GET_REGION_INFO is handled by drivers and a maple tree
change will trivially drop on top.
The basic way you've used the maple tree here looks plausible, but I'm
still against some VFIO_DEVICE_SET_MMAP_ATTRS as a uapi. mmap cookies
should be immutable, once they are issued they are for a certain thing
and never change.
The entire point of all this infrastructure is to allow more mmap
cookies to be issued. The new UAPI should be some
'VFIO_DEVICE_GET_REGION_INFO2' or whatever that generates an entirely
new mmap cookie with write combining semantics.
Jason
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re:[RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-05 14:31 ` Jason Gunthorpe
@ 2025-08-05 15:48 ` Mahmoud Nagy Adam
2025-08-05 18:50 ` [RFC " Jason Gunthorpe
2025-08-05 19:00 ` Alex Williamson
0 siblings, 2 replies; 29+ messages in thread
From: Mahmoud Nagy Adam @ 2025-08-05 15:48 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, kvm, benh, David Woodhouse, pravkmr, nagy
Jason Gunthorpe <jgg@ziepe.ca> writes:
> map2 should not exist, once you introduced a vfio_mmap_ops for free
> then the mmap op should be placed into that structure.
Does this mean dropping mmap op completely from vfio ops? I think we
could update mmap op in vfio to have the vmmap structure, no?
> ioctl2 is nasty, that should be the "new function op for
> VFIO_DEVICE_GET_REGION_INFO" instead. We have been slowly moving
> towards the core code doing more decode and dispatch of ioctls instead
> of duplicating in drivers.
ack.
> I still stand by the patch plan I gave in the above email. Clean up
> how VFIO_DEVICE_GET_REGION_INFO is handled by drivers and a maple tree
> change will trivially drop on top.
>
but I think also prior of migrating to use packed offsets, we would need
to fix the previous offset calculations, which means read & write ops
also need to access the vmmap object to convert the offset.
Another question is: since VFIO_DEVICE_GET_REGION_INFO will use mt and
technically will create and return different cookie offset everytime
it's called for the same region, do we expect that this will not break
any userspace usage? I'm not sure but could be that some user usage
relying on calling the get_region_info to produce the same offset as the
initial call, instead of caching the region offset for example?
Thanks a lot,
MNAdam
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-05 15:48 ` Mahmoud Nagy Adam
@ 2025-08-05 18:50 ` Jason Gunthorpe
2025-08-05 19:00 ` Alex Williamson
1 sibling, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2025-08-05 18:50 UTC (permalink / raw)
To: Mahmoud Nagy Adam
Cc: Alex Williamson, kvm, benh, David Woodhouse, pravkmr, nagy
On Tue, Aug 05, 2025 at 05:48:05PM +0200, Mahmoud Nagy Adam wrote:
>
> Jason Gunthorpe <jgg@ziepe.ca> writes:
>
> > map2 should not exist, once you introduced a vfio_mmap_ops for free
> > then the mmap op should be placed into that structure.
>
> Does this mean dropping mmap op completely from vfio ops?
Yes, I would aim to that. Once you add the new ops it is much cleaner
to have per-vfio_mmaps functions.
> but I think also prior of migrating to use packed offsets, we would need
> to fix the previous offset calculations, which means read & write ops
> also need to access the vmmap object to convert the offset.
I imagine you'd do a conversion where the first step is reorganizing
the function calls to add a new get_region_info op, while retaining
the existing driver assignment of the offsets.
Then you'd add the maple tree and continue to use the driver
offsets. Record the fixed pgoff ranges in the maple tree.
Then I think you'd need to teach read/write to do mt lookups and call
out to vfio_mmap_ops read/write functions as well.
Now nothing should be relying on the hardwired offsets
Finally you could drop the fixed driver assigned offsets and be fully
dynamic.
> Another question is: since VFIO_DEVICE_GET_REGION_INFO will use mt and
> technically will create and return different cookie offset everytime
> it's called for the same region,
It should not do this. The new infrastructure for a get_region_info op
should also take care of de-duplicating the cookies.
Jason
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-05 15:48 ` Mahmoud Nagy Adam
2025-08-05 18:50 ` [RFC " Jason Gunthorpe
@ 2025-08-05 19:00 ` Alex Williamson
[not found] ` <80dc87730f694b2d6e6aabbd29df49cf3c7c44fb.camel@amazon.com>
1 sibling, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2025-08-05 19:00 UTC (permalink / raw)
To: Mahmoud Nagy Adam
Cc: Jason Gunthorpe, kvm, benh, David Woodhouse, pravkmr, nagy
On Tue, 5 Aug 2025 17:48:05 +0200
Mahmoud Nagy Adam <mngyadam@amazon.de> wrote:
> Jason Gunthorpe <jgg@ziepe.ca> writes:
>
> > map2 should not exist, once you introduced a vfio_mmap_ops for free
> > then the mmap op should be placed into that structure.
>
> Does this mean dropping mmap op completely from vfio ops? I think we
> could update mmap op in vfio to have the vmmap structure, no?
>
> > ioctl2 is nasty, that should be the "new function op for
> > VFIO_DEVICE_GET_REGION_INFO" instead. We have been slowly moving
> > towards the core code doing more decode and dispatch of ioctls instead
> > of duplicating in drivers.
>
> ack.
>
> > I still stand by the patch plan I gave in the above email. Clean up
> > how VFIO_DEVICE_GET_REGION_INFO is handled by drivers and a maple tree
> > change will trivially drop on top.
> >
>
> but I think also prior of migrating to use packed offsets, we would need
> to fix the previous offset calculations, which means read & write ops
> also need to access the vmmap object to convert the offset.
>
>
>
> Another question is: since VFIO_DEVICE_GET_REGION_INFO will use mt and
> technically will create and return different cookie offset everytime
> it's called for the same region, do we expect that this will not break
> any userspace usage? I'm not sure but could be that some user usage
> relying on calling the get_region_info to produce the same offset as the
> initial call, instead of caching the region offset for example?
If we're returning a different cookie every time region_info is called
we're already on the wrong track. Userspace can call region_info an
arbitrary number of times and should get the same offset for the same
region every time. IMO we need an interface that triggers a new
mapping cookie to be created with specific mmap attributes. If we're
using region_info then we're dynamically introducing device specific
regions.
I think we're too concerned about this vmmap object, which ought to be
obtained via an offset into the maple tree in place of how we're using
fixed region indexes today. Some offsets will be pre-populated to
provide the existing regions we have today, optionally those
pre-populated entries may be at compatible offsets for broken
userspaces. New cookies will be dynamically created alongside. Thanks,
Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
[not found] ` <20250806115224.GB377696@ziepe.ca>
@ 2025-08-07 8:12 ` Herrenschmidt, Benjamin
2025-08-07 19:06 ` Alex Williamson
2025-08-07 8:13 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 29+ messages in thread
From: Herrenschmidt, Benjamin @ 2025-08-07 8:12 UTC (permalink / raw)
To: kvm@vger.kernel.org
Cc: Kumar, Praveen, Adam, Mahmoud, Woodhouse, David,
nagy@khwaternagy.com, alex.williamson@redhat.com, jgg@ziepe.ca
Re-sending with the list this time. Also I've fixed my email so back to
benh@kernel.crashing.org :)
1. On Wed, 2025-08-06 at 08:52 -0300, Jason Gunthorpe wrote:
>
>
> On Tue, Aug 05, 2025 at 10:19:29PM +0000, Herrenschmidt, Benjamin
> wrote:
> >
> > Right, that's my main objection here too. The way I see things is
> > that
> > we are somewhat conflating two different issues:
> >
> > * Wanting to change the index -> offset "cooking" mapping to
> > create a
> > more dense cookie space
> >
> > * Wanting to define attributes for mappings...
>
> I don't think that is right. The first is, yes, creating more dense
> cookie space which I think we need to do the second item, which is
> really 'add 2x more cookies, or maybe more'.
Hrm... the end *goal* is to define attribute for mappings. This whole
conversation is "we want to map things WC". "defining more cookies" is
a possible *how* here.
The reason I don't like presenting it that way is that you conflate
what we are trying to achieve with an implementation detail.
By defining more cookies, what you mean is creating multiple cookies
pointing to the same physical regions with different mapping
attributes, correct ? This is very similar to my sub-region idea in
practice I think unless I misunderstand.
> One of the things we want to do with more cookies is to create unique
> cookies for WC mmap requests.
Well, we want to be able to WC map. Introducing "more cookies" again is
just one way to get there. How do you create those cookies ? Upon
request or each region automatically gets multiple with different
attributes ? Do they represent entire regions or subsets ? etc...
Here too, I'd rather we start from a clear idea of the actual UAPI we
want, then we can look at how it gets plumbed under the hood.
> I'm not convinced we should try to introduce more cookies without
> improving the infrastructure. It is already pretty weak, I think
> trying to mangle the indexes or something "clever" would result in
> something even harder to maintain even if it might be less patches to
> write in the first place.
VFIO-pci already allocates dynamic indexes but yes, I don't disagree
that cleaning up the infrastructure first is a good idea. However, I
would prefer if we had a clearer idea of the end goal first,
specifically the details/shape of the UAPI and underlying structure
*before* we start hacking at this all.
> > * What I originally proposed ages ago when we discussed it at LPC
> > which is to have an ioctl to create "subregions" of a region with
> > different attributes. This means creating a new index (and a new
> > corresponding "cookie") that represents a portion of an existing
> > region
> > with a different attribute set to be later used by mmap.
>
> I never liked this, and it is still creating more cookies but now
> with
> weird hacky looking uapi.
"weird hacky looking" isn't a great way of understanding your
objections :-) Yes, it's a bit more complicated, it allows to break
down BARs basically into sub-regions with different attributes which is
fairly close to what users really want to do, but it does complexify
the UAPI a bit.
That said I'm not married to the idea, I just don't completely
understand the alternative you are proposing...
> > * A simpler approach which is to have an ioctl to change the
> > attributes over a range of a region, so that *any* mapping of that
> > range now uses that attribute. This can be done completely
> > dynamically
> > (see below).
>
> And I really, really don't like this. The meaning of a memmap cookie
> should not be changing dynamically. That is a bad precedent.
What do you mean "a cookie changing dynamically" ? The cookie doesn't
change. The attributes of a portion of a mapping change based on what
userspace requested. It is also a pretty close match to the use case
which is to define that a given part of the device MMIO space is meant
to be accessed WC. There is no fundamental change to the "cookies".
The biggest advantage of that approach is that it completely precludes
multiple conflicting mappings for a given region (at least within a
given process, though it might be possible to extend it globally if we
want) which has been a concern expressed last time we talked about this
by the folks from the ARM world.
> The uAPI I prefer is a 'get region info2' which would be simplified
> and allow userspace to pass in some flags. One of those flags can be
> 'request wc'.
So talking of "weird hacky looking" having something called "get_info"
taking "request" flags ... ahem ... :-)
If what you really mean is that under the hood, a given "index"
produces multiple "regions" (cookies) and "get_info2" allows to specify
which one you want to get info about ?
I mean ... this would work. But I don't see how it's superior to any of
the above. I don't care *that much*, but it does make it a bit harder
to avoid the multiple mappings issue and will waste much more cookie
space than any of the alternatives.
> This is logical and future extensible.
I disagree ... I find it actually cumbersome but we can just agree to
disagree here and as I said, I don't care that much as long as there is
no fundamental reason why it wouldn't work in the end.
Talking of cookie space, one thing we do need to preserve is the
natural alignment to the BAR size. Userspace *will* do "|" instead of
"+" on top of a cookie when mmap'ing (beyond mmap own alignment
requirements).
> > Now, within the context of VFIO PCI, since we use a fault handler,
> > it
> > doesn't even have to be done before mmap, we can just dynamically
> > fault
> > a given page with the right attributes (and zap mappings on
> > attributes
> > changes). I would go down that path personally and generalize the
> > fault
> > handler to all VFIO implementations.
>
> Even worse. fault by fault changing of attributes? No way, that's a
> completely crazy uAPI!
Why ? first userspace doesn't know or see it happens fault by fault
(and it could be full established at mmap time in fact, but fault time
makes the implementation a lot easier).
Here you are conflating implementation details with "uAPI" ... uAPI is
what is presented to userspace, which in this case is "set the
attributes for this portion of a region". Then at map time, that
portion of the region gets the requested attributes. There is nothing
in the uAPI that carries the fact that it happens at fault time.
You keep coming up with "ugly", "crazy" etc... without every actually
spelling out the technical pro/cons that would actually substantiate
those adjectives. Basically anything that isn't your
get_region2 is "crazy" or "ugly" ... NIH syndrome ?
> > For example, if I put a 4k WC page in a middle of index 1, then the
> > tree would have an entry for before the WC page, an entry for the
> > WC
> > page and an entry for after. They all point to region index 1, but
> > the
> > fault handler can use the maple tree to find the right attributes
> > at
> > fault time for the page.
>
> I don't know what this is about, I don't want to see APIs to slice up
> regions. The vfio driver should report full regions in the pgoff
> space, and VMAs should be linked to single maple tree ranges only. If
> userspace wants something weird it can call mmap multiple times and
> get different VMAs.
Why ? What are the pros/cons of each approach ?
> Each VMA should refer to a single vfio_mmap struct with the same
> pgprot for every page. It is easy and simple.
Maybe ... we have precedents of doing differently but I don't have big
beef in that game.
The big advantages of that latter approach is that it's very clear and
simple for userspace (this bit of the BAR is meant to be WC) and
completely
avoids the multiple mapping problem. The inconvenient is that we have
to
generalize the fault handler mechanism to all backends, and it makes
the
multi-mapping impossible (in the maybe possible case where might want
to
allow it in some cases).
Overall I find a lot of putting cart before horses here. Can we first
agree
on a clear definition of the uAPI first, then we can figure out the
implementation details ?
Cheers,
Ben.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
[not found] ` <20250806115224.GB377696@ziepe.ca>
2025-08-07 8:12 ` Herrenschmidt, Benjamin
@ 2025-08-07 8:13 ` Benjamin Herrenschmidt
1 sibling, 0 replies; 29+ messages in thread
From: Benjamin Herrenschmidt @ 2025-08-07 8:13 UTC (permalink / raw)
To: kvm; +Cc: Adam, Mahmoud
Re-sending with the list this time. Also I've fixed my email so back to
benh@kernel.crashing.org :)
1. On Wed, 2025-08-06 at 08:52 -0300, Jason Gunthorpe wrote:
>
>
> On Tue, Aug 05, 2025 at 10:19:29PM +0000, Herrenschmidt, Benjamin
> wrote:
> >
> > Right, that's my main objection here too. The way I see things is
> > that
> > we are somewhat conflating two different issues:
> >
> > * Wanting to change the index -> offset "cooking" mapping to
> > create a
> > more dense cookie space
> >
> > * Wanting to define attributes for mappings...
>
> I don't think that is right. The first is, yes, creating more dense
> cookie space which I think we need to do the second item, which is
> really 'add 2x more cookies, or maybe more'.
Hrm... the end *goal* is to define attribute for mappings. This whole
conversation is "we want to map things WC". "defining more cookies" is
a possible *how* here.
The reason I don't like presenting it that way is that you conflate
what we are trying to achieve with an implementation detail.
By defining more cookies, what you mean is creating multiple cookies
pointing to the same physical regions with different mapping
attributes, correct ? This is very similar to my sub-region idea in
practice I think unless I misunderstand.
> One of the things we want to do with more cookies is to create unique
> cookies for WC mmap requests.
Well, we want to be able to WC map. Introducing "more cookies" again is
just one way to get there. How do you create those cookies ? Upon
request or each region automatically gets multiple with different
attributes ? Do they represent entire regions or subsets ? etc...
Here too, I'd rather we start from a clear idea of the actual UAPI we
want, then we can look at how it gets plumbed under the hood.
> I'm not convinced we should try to introduce more cookies without
> improving the infrastructure. It is already pretty weak, I think
> trying to mangle the indexes or something "clever" would result in
> something even harder to maintain even if it might be less patches to
> write in the first place.
VFIO-pci already allocates dynamic indexes but yes, I don't disagree
that cleaning up the infrastructure first is a good idea. However, I
would prefer if we had a clearer idea of the end goal first,
specifically the details/shape of the UAPI and underlying structure
*before* we start hacking at this all.
> > * What I originally proposed ages ago when we discussed it at LPC
> > which is to have an ioctl to create "subregions" of a region with
> > different attributes. This means creating a new index (and a new
> > corresponding "cookie") that represents a portion of an existing
> > region
> > with a different attribute set to be later used by mmap.
>
> I never liked this, and it is still creating more cookies but now
> with
> weird hacky looking uapi.
"weird hacky looking" isn't a great way of understanding your
objections :-) Yes, it's a bit more complicated, it allows to break
down BARs basically into sub-regions with different attributes which is
fairly close to what users really want to do, but it does complexify
the UAPI a bit.
That said I'm not married to the idea, I just don't completely
understand the alternative you are proposing...
> > * A simpler approach which is to have an ioctl to change the
> > attributes over a range of a region, so that *any* mapping of that
> > range now uses that attribute. This can be done completely
> > dynamically
> > (see below).
>
> And I really, really don't like this. The meaning of a memmap cookie
> should not be changing dynamically. That is a bad precedent.
What do you mean "a cookie changing dynamically" ? The cookie doesn't
change. The attributes of a portion of a mapping change based on what
userspace requested. It is also a pretty close match to the use case
which is to define that a given part of the device MMIO space is meant
to be accessed WC. There is no fundamental change to the "cookies".
The biggest advantage of that approach is that it completely precludes
multiple conflicting mappings for a given region (at least within a
given process, though it might be possible to extend it globally if we
want) which has been a concern expressed last time we talked about this
by the folks from the ARM world.
> The uAPI I prefer is a 'get region info2' which would be simplified
> and allow userspace to pass in some flags. One of those flags can be
> 'request wc'.
So talking of "weird hacky looking" having something called "get_info"
taking "request" flags ... ahem ... :-)
If what you really mean is that under the hood, a given "index"
produces multiple "regions" (cookies) and "get_info2" allows to specify
which one you want to get info about ?
I mean ... this would work. But I don't see how it's superior to any of
the above. I don't care *that much*, but it does make it a bit harder
to avoid the multiple mappings issue and will waste much more cookie
space than any of the alternatives.
> This is logical and future extensible.
I disagree ... I find it actually cumbersome but we can just agree to
disagree here and as I said, I don't care that much as long as there is
no fundamental reason why it wouldn't work in the end.
Talking of cookie space, one thing we do need to preserve is the
natural alignment to the BAR size. Userspace *will* do "|" instead of
"+" on top of a cookie when mmap'ing (beyond mmap own alignment
requirements).
> > Now, within the context of VFIO PCI, since we use a fault handler,
> > it
> > doesn't even have to be done before mmap, we can just dynamically
> > fault
> > a given page with the right attributes (and zap mappings on
> > attributes
> > changes). I would go down that path personally and generalize the
> > fault
> > handler to all VFIO implementations.
>
> Even worse. fault by fault changing of attributes? No way, that's a
> completely crazy uAPI!
Why ? first userspace doesn't know or see it happens fault by fault
(and it could be full established at mmap time in fact, but fault time
makes the implementation a lot easier).
Here you are conflating implementation details with "uAPI" ... uAPI is
what is presented to userspace, which in this case is "set the
attributes for this portion of a region". Then at map time, that
portion of the region gets the requested attributes. There is nothing
in the uAPI that carries the fact that it happens at fault time.
You keep coming up with "ugly", "crazy" etc... without every actually
spelling out the technical pro/cons that would actually substantiate
those adjectives. Basically anything that isn't your
get_region2 is "crazy" or "ugly" ... NIH syndrome ?
> > For example, if I put a 4k WC page in a middle of index 1, then the
> > tree would have an entry for before the WC page, an entry for the
> > WC
> > page and an entry for after. They all point to region index 1, but
> > the
> > fault handler can use the maple tree to find the right attributes
> > at
> > fault time for the page.
>
> I don't know what this is about, I don't want to see APIs to slice up
> regions. The vfio driver should report full regions in the pgoff
> space, and VMAs should be linked to single maple tree ranges only. If
> userspace wants something weird it can call mmap multiple times and
> get different VMAs.
Why ? What are the pros/cons of each approach ?
> Each VMA should refer to a single vfio_mmap struct with the same
> pgprot for every page. It is easy and simple.
Maybe ... we have precedents of doing differently but I don't have big
beef in that game.
The big advantages of that latter approach is that it's very clear and
simple for userspace (this bit of the BAR is meant to be WC) and
completely
avoids the multiple mapping problem. The inconvenient is that we have
to
generalize the fault handler mechanism to all backends, and it makes
the
multi-mapping impossible (in the maybe possible case where might want
to
allow it in some cases).
Overall I find a lot of putting cart before horses here. Can we first
agree
on a clear definition of the uAPI first, then we can figure out the
implementation details ?
Cheers,
Ben.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-07 8:12 ` Herrenschmidt, Benjamin
@ 2025-08-07 19:06 ` Alex Williamson
2025-08-11 15:55 ` Jason Gunthorpe
0 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2025-08-07 19:06 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: kvm@vger.kernel.org, Kumar, Praveen, Adam, Mahmoud,
Woodhouse, David, nagy@khwaternagy.com, jgg@ziepe.ca
On Thu, 7 Aug 2025 08:12:04 +0000
"Herrenschmidt, Benjamin" <benh@amazon.com> wrote:
> Re-sending with the list this time. Also I've fixed my email so back to
> benh@kernel.crashing.org :)
Replacing the To: in this reply as well, hopefully it reaches you.
> 1. On Wed, 2025-08-06 at 08:52 -0300, Jason Gunthorpe wrote:
> >
> >
> > On Tue, Aug 05, 2025 at 10:19:29PM +0000, Herrenschmidt, Benjamin
> > wrote:
> > >
> > > Right, that's my main objection here too. The way I see things is
> > > that
> > > we are somewhat conflating two different issues:
> > >
> > > * Wanting to change the index -> offset "cooking" mapping to
> > > create a
> > > more dense cookie space
> > >
> > > * Wanting to define attributes for mappings...
> >
> > I don't think that is right. The first is, yes, creating more dense
> > cookie space which I think we need to do the second item, which is
> > really 'add 2x more cookies, or maybe more'.
>
> Hrm... the end *goal* is to define attribute for mappings. This whole
> conversation is "we want to map things WC". "defining more cookies" is
> a possible *how* here.
>
> The reason I don't like presenting it that way is that you conflate
> what we are trying to achieve with an implementation detail.
>
> By defining more cookies, what you mean is creating multiple cookies
> pointing to the same physical regions with different mapping
> attributes, correct ? This is very similar to my sub-region idea in
> practice I think unless I misunderstand.
I agree and I think this is likely heading down the path of rehashing
the previous thread
https://lore.kernel.org/kvm/20240801141914.GC3030761@ziepe.ca/
I honestly still don't have a firm idea how "mmap cookies" are
different from some sort of extended region ID space, whether it's
sub-regions, device specific regions, or just dynamic regions. I think
there is an underlying desire to say, for example, that "region 0" is
BAR0, period, full stop, but there should be a way to get multiple mmap
cookies to region 0 with different mmap semantics. Of course
REGION_INFO provides the default mmap cookie via the offset field, but
since we don't want to dynamically create region N+1 (under this
interpretation) as an alias of region 0 with different mmap semantics,
now we're stuck in a conundrum of how to describe different mmap cookies
for one region.
So to a large extent I think we're causing our own grief here and I do
agree that if we created an API that allows new regions to be created
as lightweight (mmap-only) aliases of other regions, then we could just
re-use REGION_INFO with the new region index, get the offset/mmap
cookie, return -ENOSPC when we run out of indexes, and implement a
maple tree to get a more compact region space as a follow-on.
> > One of the things we want to do with more cookies is to create unique
> > cookies for WC mmap requests.
>
> Well, we want to be able to WC map. Introducing "more cookies" again is
> just one way to get there. How do you create those cookies ? Upon
> request or each region automatically gets multiple with different
> attributes ? Do they represent entire regions or subsets ? etc...
>
> Here too, I'd rather we start from a clear idea of the actual UAPI we
> want, then we can look at how it gets plumbed under the hood.
>
> > I'm not convinced we should try to introduce more cookies without
> > improving the infrastructure. It is already pretty weak, I think
> > trying to mangle the indexes or something "clever" would result in
> > something even harder to maintain even if it might be less patches to
> > write in the first place.
>
> VFIO-pci already allocates dynamic indexes but yes, I don't disagree
> that cleaning up the infrastructure first is a good idea. However, I
> would prefer if we had a clearer idea of the end goal first,
> specifically the details/shape of the UAPI and underlying structure
> *before* we start hacking at this all.
Yes, the UAPI here is shadowed by a maple tree conversion that isn't
really required by the UAPI. We need to lead with a good UAPI.
> > > * What I originally proposed ages ago when we discussed it at LPC
> > > which is to have an ioctl to create "subregions" of a region with
> > > different attributes. This means creating a new index (and a new
> > > corresponding "cookie") that represents a portion of an existing
> > > region
> > > with a different attribute set to be later used by mmap.
> >
> > I never liked this, and it is still creating more cookies but now
> > with
> > weird hacky looking uapi.
>
> "weird hacky looking" isn't a great way of understanding your
> objections :-) Yes, it's a bit more complicated, it allows to break
> down BARs basically into sub-regions with different attributes which is
> fairly close to what users really want to do, but it does complexify
> the UAPI a bit.
>
> That said I'm not married to the idea, I just don't completely
> understand the alternative you are proposing...
Obviously Jason can correct if the interpretation I gleaned from the
previous thread above is incorrect, but I don't really see that new
region indexes are weird or hacky. They fit into the REGION_INFO
scheme, they could be managed via DEVICE_FEATURE.
> > > * A simpler approach which is to have an ioctl to change the
> > > attributes over a range of a region, so that *any* mapping of that
> > > range now uses that attribute. This can be done completely
> > > dynamically
> > > (see below).
> >
> > And I really, really don't like this. The meaning of a memmap cookie
> > should not be changing dynamically. That is a bad precedent.
>
> What do you mean "a cookie changing dynamically" ? The cookie doesn't
> change. The attributes of a portion of a mapping change based on what
> userspace requested. It is also a pretty close match to the use case
> which is to define that a given part of the device MMIO space is meant
> to be accessed WC. There is no fundamental change to the "cookies".
>
> The biggest advantage of that approach is that it completely precludes
> multiple conflicting mappings for a given region (at least within a
> given process, though it might be possible to extend it globally if we
> want) which has been a concern expressed last time we talked about this
> by the folks from the ARM world.
This precludes that a user could simultaneously have mappings to the
same device memory with different mmap attributes, but otherwise it
honestly sounds like the most consistent mechanism if we don't want to
expand the region index space.
> > The uAPI I prefer is a 'get region info2' which would be simplified
> > and allow userspace to pass in some flags. One of those flags can be
> > 'request wc'.
>
> So talking of "weird hacky looking" having something called "get_info"
> taking "request" flags ... ahem ... :-)
Yup...
> If what you really mean is that under the hood, a given "index"
> produces multiple "regions" (cookies) and "get_info2" allows to specify
> which one you want to get info about ?
>
> I mean ... this would work. But I don't see how it's superior to any of
> the above. I don't care *that much*, but it does make it a bit harder
> to avoid the multiple mappings issue and will waste much more cookie
> space than any of the alternatives.
>
> > This is logical and future extensible.
>
> I disagree ... I find it actually cumbersome but we can just agree to
> disagree here and as I said, I don't care that much as long as there is
> no fundamental reason why it wouldn't work in the end.
I'm not a fan of the REGION_INFO2 idea either. A new region index that
provides an alias to another region with different mmap semantics is
much more intuitive in our existing UAPI, imo.
> Talking of cookie space, one thing we do need to preserve is the
> natural alignment to the BAR size. Userspace *will* do "|" instead of
> "+" on top of a cookie when mmap'ing (beyond mmap own alignment
> requirements).
>
> > > Now, within the context of VFIO PCI, since we use a fault handler,
> > > it
> > > doesn't even have to be done before mmap, we can just dynamically
> > > fault
> > > a given page with the right attributes (and zap mappings on
> > > attributes
> > > changes). I would go down that path personally and generalize the
> > > fault
> > > handler to all VFIO implementations.
> >
> > Even worse. fault by fault changing of attributes? No way, that's a
> > completely crazy uAPI!
>
> Why ? first userspace doesn't know or see it happens fault by fault
> (and it could be full established at mmap time in fact, but fault time
> makes the implementation a lot easier).
>
> Here you are conflating implementation details with "uAPI" ... uAPI is
> what is presented to userspace, which in this case is "set the
> attributes for this portion of a region". Then at map time, that
> portion of the region gets the requested attributes. There is nothing
> in the uAPI that carries the fact that it happens at fault time.
>
> You keep coming up with "ugly", "crazy" etc... without every actually
> spelling out the technical pro/cons that would actually substantiate
> those adjectives. Basically anything that isn't your
> get_region2 is "crazy" or "ugly" ... NIH syndrome ?
>
> > > For example, if I put a 4k WC page in a middle of index 1, then the
> > > tree would have an entry for before the WC page, an entry for the
> > > WC
> > > page and an entry for after. They all point to region index 1, but
> > > the
> > > fault handler can use the maple tree to find the right attributes
> > > at
> > > fault time for the page.
> >
> > I don't know what this is about, I don't want to see APIs to slice up
> > regions. The vfio driver should report full regions in the pgoff
> > space, and VMAs should be linked to single maple tree ranges only. If
> > userspace wants something weird it can call mmap multiple times and
> > get different VMAs.
>
> Why ? What are the pros/cons of each approach ?
>
> > Each VMA should refer to a single vfio_mmap struct with the same
> > pgprot for every page. It is easy and simple.
>
> Maybe ... we have precedents of doing differently but I don't have big
> beef in that game.
>
> The big advantages of that latter approach is that it's very clear and
> simple for userspace (this bit of the BAR is meant to be WC) and
> completely
> avoids the multiple mapping problem. The inconvenient is that we have
> to
> generalize the fault handler mechanism to all backends, and it makes
> the
> multi-mapping impossible (in the maybe possible case where might want
> to
> allow it in some cases).
>
> Overall I find a lot of putting cart before horses here. Can we first
> agree
> on a clear definition of the uAPI first, then we can figure out the
> implementation details ?
+1. Thanks,
Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-07 19:06 ` Alex Williamson
@ 2025-08-11 15:55 ` Jason Gunthorpe
2025-08-11 22:07 ` Alex Williamson
0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2025-08-11 15:55 UTC (permalink / raw)
To: Alex Williamson
Cc: Benjamin Herrenschmidt, kvm@vger.kernel.org, Kumar, Praveen,
Adam, Mahmoud, Woodhouse, David, nagy@khwaternagy.com
On Thu, Aug 07, 2025 at 01:06:05PM -0600, Alex Williamson wrote:
> So to a large extent I think we're causing our own grief here and I do
> agree that if we created an API that allows new regions to be created
> as lightweight (mmap-only) aliases of other regions, then we could just
> re-use REGION_INFO with the new region index, get the offset/mmap
> cookie, return -ENOSPC when we run out of indexes, and implement a
> maple tree to get a more compact region space as a follow-on.
That still needs to dynamically create mmap cookies - which is really the
whole problem here. It is not so easy to just change the existing code
with hardwired math converting pgoff to indexes to support multiple
indexes.
> > Well, we want to be able to WC map. Introducing "more cookies" again is
> > just one way to get there. How do you create those cookies ? Upon
> > request or each region automatically gets multiple with different
> > attributes ? Do they represent entire regions or subsets ? etc...
It doesn't matter. Fixing how mmap works internally lets you use all
of those options.
>
> > > > * What I originally proposed ages ago when we discussed it at LPC
> > > > which is to have an ioctl to create "subregions" of a region with
> > > > different attributes. This means creating a new index (and a new
> > > > corresponding "cookie") that represents a portion of an existing
> > > > region
> > > > with a different attribute set to be later used by mmap.
> > >
> > > I never liked this, and it is still creating more cookies but now
> > > with
> > > weird hacky looking uapi.
> >
> > "weird hacky looking" isn't a great way of understanding your
> > objections :-) Yes, it's a bit more complicated, it allows to break
> > down BARs basically into sub-regions with different attributes which is
> > fairly close to what users really want to do, but it does complexify
> > the UAPI a bit.
I don't think we need sub-regions, it is too complicated in the kernel
and pretty much useless.
> > That said I'm not married to the idea, I just don't completely
> > understand the alternative you are proposing...
>
> Obviously Jason can correct if the interpretation I gleaned from the
> previous thread above is incorrect, but I don't really see that new
> region indexes are weird or hacky. They fit into the REGION_INFO
> scheme, they could be managed via DEVICE_FEATURE.
I think I said before, adding more region indexes just makes a PITA
for the driver to manage this.
Indexes should refer to the physical object, the BAR in PCI. This is
easy for the drivers to manage, any easy for userspace to understand.
If you make dynamic indexes then every driver needs its own scheme to
map them to physical indexes and we end up re-inventing the maple tree
stuff without the cleanup or generality it brings.
> > > > * A simpler approach which is to have an ioctl to change the
> > > > attributes over a range of a region, so that *any* mapping of that
> > > > range now uses that attribute. This can be done completely
> > > > dynamically
> > > > (see below).
> > >
> > > And I really, really don't like this. The meaning of a memmap cookie
> > > should not be changing dynamically. That is a bad precedent.
> >
> > What do you mean "a cookie changing dynamically" ? The cookie doesn't
> > change. The attributes of a portion of a mapping change based on what
> > userspace requested.
Exactly. You have changed the underlying meaning of the cookie
dynamicaly.
> > The biggest advantage of that approach is that it completely precludes
> > multiple conflicting mappings for a given region (at least within a
> > given process, though it might be possible to extend it globally
> > if we
It doesn't. It just makes a messy uapi. At the time of mmap the vma
would stil have to capture the attributes (no fault by fault!) into
the VMA so we will see real users doing things like:
set to wc(cookie)
mmap(cookie + XXX)
set to !wc(cookie)
mmap(cookie + YY)
And then if you try to debug this all our file/vma debug tools will
just show cookie everywhere with no distinction that some VMAs are WC
and some VMAs are !WC.
Basically, it fundamentally breaks how pgoff is supposed to work here
by making its meaning unstable.
> > want) which has been a concern expressed last time we talked about this
> > by the folks from the ARM world.
>
> This precludes that a user could simultaneously have mappings to the
> same device memory with different mmap attributes,
Indeed, that is required for most HW. mlx5 for example has BARs that
mix WC and non WC access modes. There are too few BARs for most HW to
be able to dedicate an entire BAR to WC only.
> > > The uAPI I prefer is a 'get region info2' which would be simplified
> > > and allow userspace to pass in some flags. One of those flags can be
> > > 'request wc'.
> >
> > So talking of "weird hacky looking" having something called "get_info"
> > taking "request" flags ... ahem ... :-)
>
> Yup...
We've already confused returning the mmap cookie and the other
information about the region. If you want to have flags to customize
what mmap cookie is returned, this is the simplest answer.
> > If what you really mean is that under the hood, a given "index"
> > produces multiple "regions" (cookies) and "get_info2" allows to specify
> > which one you want to get info about ?
There is only one index. You can ask for different mmap options, and
pgoff space for them is created dynamically.
A "region" is not a mmap cookie, a region can have multiple mmap cookies.
> > I disagree ... I find it actually cumbersome but we can just agree to
> > disagree here and as I said, I don't care that much as long as there is
> > no fundamental reason why it wouldn't work in the end.
>
> I'm not a fan of the REGION_INFO2 idea either. A new region index that
> provides an alias to another region with different mmap semantics is
> much more intuitive in our existing UAPI, imo.
It would not be another region index. That is the whole point. It is
another pgoff for an existing index.
> > Talking of cookie space, one thing we do need to preserve is the
> > natural alignment to the BAR size. Userspace *will* do "|" instead of
> > "+" on top of a cookie when mmap'ing (beyond mmap own alignment
> > requirements).
I think that's just wrong userspace, sorry. :(
Still, we want to do this anyhow as the VMA alignment stuff Peter was
working on also requires pgoff alignment.
> > > > Now, within the context of VFIO PCI, since we use a fault handler,
> > > > it
> > > > doesn't even have to be done before mmap, we can just dynamically
> > > > fault
> > > > a given page with the right attributes (and zap mappings on
> > > > attributes
> > > > changes). I would go down that path personally and generalize the
> > > > fault
> > > > handler to all VFIO implementations.
> > >
> > > Even worse. fault by fault changing of attributes? No way, that's a
> > > completely crazy uAPI!
> >
> > Why ? first userspace doesn't know or see it happens fault by fault
> > (and it could be full established at mmap time in fact, but fault time
> > makes the implementation a lot easier).
> >
> > Here you are conflating implementation details with "uAPI" ... uAPI is
> > what is presented to userspace, which in this case is "set the
> > attributes for this portion of a region". Then at map time, that
> > portion of the region gets the requested attributes. There is nothing
> > in the uAPI that carries the fact that it happens at fault time.
> >
> > You keep coming up with "ugly", "crazy" etc... without every actually
> > spelling out the technical pro/cons that would actually substantiate
> > those adjectives. Basically anything that isn't your
> > get_region2 is "crazy" or "ugly" ... NIH syndrome ?
This is not how the mm ever works *anywhere* in the main kernel, the
only reason I can see you are propsing this because it avoids doing
the cleanup work.
It makes it impossible for userspace to get both WC and non-WC
mappings. It makes it indeterminate what behavior userspace gets at
every store operation. Real HW like mlx5 would need mixed WC/!WC on
the same bar, so I view this as an entirely bad uAPI.
> > > I don't know what this is about, I don't want to see APIs to slice up
> > > regions. The vfio driver should report full regions in the pgoff
> > > space, and VMAs should be linked to single maple tree ranges only. If
> > > userspace wants something weird it can call mmap multiple times and
> > > get different VMAs.
> >
> > Why ? What are the pros/cons of each approach ?
Extra complexity in the kernel, no usecase that isn't already handled
by mmap directly.
> > The big advantages of that latter approach is that it's very clear and
> > simple for userspace (this bit of the BAR is meant to be WC) and
> > completely
> > avoids the multiple mapping problem.
What is the "multiple mapping problem" ? We've discussed this
extensively with ARM when we added WC support to KVM and I think we
have a general agreement that multiple mappings are not something the
kernel needs to actively prevent, in the limited case of WC and !WC.
> >The inconvenient is that we have to generalize the fault handler
> > mechanism to all backends, and it makes the multi-mapping
> > impossible (in the maybe possible case where might want to allow
> > it in some cases).
> >
> > Overall I find a lot of putting cart before horses here. Can we first
> > agree
> > on a clear definition of the uAPI first, then we can figure out the
> > implementation details ?
>
> +1. Thanks,
All the uAPI proposals I've seen are all trying to work around the
current state of the code.
This is broadly what I've proposed consistently since the beginning,
adjusted for the various remarks since:
struct vfio_region_get_mmap {
__u32 argsz;
__u32 region_index; // only one, no aliases
__u32 mmap_flags; // Set WC here
__aligned_u64 region_size;
__aligned_u64 fd_offset;
};
struct vfio_region_get_caps {
__u32 argsz;
__u32 region_index;
__u32 region_flags; // READ/WRITE/etc
__aligned_u64 region_size;
__u32 cap_offset; /* Offset within info struct of first cap */
};
Alex, you pointed out that the parsing of the existing
VFIO_DEVICE_GET_REGION_INFO has made it non-extendable. So the above
two are creating a new extendable version that are replacements.
To avoid the naming confusion we have a specific ioctl to get
mmap'able access, and another one for the cap list. I guess this also
gives access to read/write so maybe the name needs more bikeshedding.
Compared to the existing we add a single new concept, 'mmap_flags',
which customizes how the fd_offset will behave with mmap. The kernel
will internally de-duplicate region_index/mmap_flags -> fd_offset.
There is still one index per physical object (ie BAR) in the uAPI.
We get one cookie that describes the VMA behavior exactly and immutably.
The existing VFIO_DEVICE_GET_REGION_INFO is expressed in terms of the
above two operations with mmap_flags = 0.
No new subregion concept. No alias region indexes. No changing the
meaning of returned cookies.
We simply allow the userspace to request a mmap cookie for a region
index that will always cause mmap to create a VMA with WC pgrot.
If we someday want cachable, or ARM's DEVICE_GRE or whatever we can
trivially add more flags to mmap_flags and userspace can get
more cookies.
If we later decide we need to solve the ARM multi-device issue then we
can cleanly extend an additional start/len to vfio_region_get_mmap
which can ensure mmaps cookies are disjoint. This is not subregions,
or new regions, this is just a cookie with a restriction.
In terms of implementation once you do the maple tree work that
Mahmoud started we end up with vfio_region_get_mmap allocating a
struct vfio_mmap for each unique region_index/mmap_flags and returning
it to userspace.
All the drivers will use the struct vfio_mmap everywhere instead of
trying to decode pgoff to index with macros. Inside the vfio_mmap we
store the index for the driver to use directly. The driver has a
simple implementation, one index mapped to one physical resource. A
pgprot value in the struct vfio_mmap to specify how to create the VMA.
No dynamic driver aware regions or subregions.
Jason
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-11 15:55 ` Jason Gunthorpe
@ 2025-08-11 22:07 ` Alex Williamson
2025-08-12 0:30 ` Jason Gunthorpe
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Alex Williamson @ 2025-08-11 22:07 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Benjamin Herrenschmidt, kvm@vger.kernel.org, Kumar, Praveen,
Adam, Mahmoud, Woodhouse, David, nagy@khwaternagy.com
On Mon, 11 Aug 2025 12:55:58 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Thu, Aug 07, 2025 at 01:06:05PM -0600, Alex Williamson wrote:
>
> > So to a large extent I think we're causing our own grief here and I do
> > agree that if we created an API that allows new regions to be created
> > as lightweight (mmap-only) aliases of other regions, then we could just
> > re-use REGION_INFO with the new region index, get the offset/mmap
> > cookie, return -ENOSPC when we run out of indexes, and implement a
> > maple tree to get a more compact region space as a follow-on.
>
> That still needs to dynamically create mmap cookies - which is really the
> whole problem here. It is not so easy to just change the existing code
> with hardwired math converting pgoff to indexes to support multiple
> indexes.
We do this today with device specific regions, see
vfio_pci_core_register_dev_region(). We use this to provide several
additional regions for IGD. If we had an interface for users to
trigger new regions we'd need some protection for exceeding the index
space (-ENOSPC), but adding a small number of regions is not a problem.
> > > Well, we want to be able to WC map. Introducing "more cookies"
> > > again is just one way to get there. How do you create those
> > > cookies ? Upon request or each region automatically gets multiple
> > > with different attributes ? Do they represent entire regions or
> > > subsets ? etc...
>
> It doesn't matter. Fixing how mmap works internally lets you use all
> of those options.
What exactly is the "fix how mmap works internally" proposal?
> > > > > * What I originally proposed ages ago when we discussed it
> > > > > at LPC which is to have an ioctl to create "subregions" of a
> > > > > region with different attributes. This means creating a new
> > > > > index (and a new corresponding "cookie") that represents a
> > > > > portion of an existing region
> > > > > with a different attribute set to be later used by mmap.
> > > >
> > > > I never liked this, and it is still creating more cookies but
> > > > now with
> > > > weird hacky looking uapi.
> > >
> > > "weird hacky looking" isn't a great way of understanding your
> > > objections :-) Yes, it's a bit more complicated, it allows to
> > > break down BARs basically into sub-regions with different
> > > attributes which is fairly close to what users really want to do,
> > > but it does complexify the UAPI a bit.
>
> I don't think we need sub-regions, it is too complicated in the kernel
> and pretty much useless.
So we infer that mmap cookies are an alias to an entire region.
> > > That said I'm not married to the idea, I just don't completely
> > > understand the alternative you are proposing...
> >
> > Obviously Jason can correct if the interpretation I gleaned from the
> > previous thread above is incorrect, but I don't really see that new
> > region indexes are weird or hacky. They fit into the REGION_INFO
> > scheme, they could be managed via DEVICE_FEATURE.
>
> I think I said before, adding more region indexes just makes a PITA
> for the driver to manage this.
>
> Indexes should refer to the physical object, the BAR in PCI. This is
> easy for the drivers to manage, any easy for userspace to understand.
>
> If you make dynamic indexes then every driver needs its own scheme to
> map them to physical indexes and we end up re-inventing the maple tree
> stuff without the cleanup or generality it brings.
I'd like to better understand why you believe this to be the case.
We have an existing ABI that maps BARs, config space, and VGA spaces to
fixed region indexes. That "region index" to "device space" mapping
cannot change, but the offset of a given region and the total number of
regions is not ABI. Therefore we can introduce an API where a user
says "give me a new region index that aliases region 0 with mmap
attribute FOO". Maybe this returns to them region index 10. The
DEVICE_INFO ioctl now reports more regions and REGION_INFO for index 10
provides the "mmap cookie", ie. offset, for this new region.
To a limited extent we can provide this within fixed index/pgoff
implementation (not ABI) we use now, but AIUI we could use a maple tree
to block out ranges and get more dense packing of regions across the
device fd.
This is compatible with existing userspace drivers, they don't invoke
any new APIs to create new regions and ideally they shouldn't have any
ABI dependencies on specific offsets (but we might provide limited
compatibility if that proves otherwise). Drivers invoke the new API
specifically to get a region index with the desired mmap attribute and
use the existing REGION_INFO ioctl to get the offset of the region.
I don't understand how this introduces so much complication to drivers
that, for example, BAR0 might be accessible through region index 0 for
legacy mappings and index 10 for modified mmap attributes. It might
make our unmap_mapping_range() call more complicated if we have a
non-contiguous mmap space, but I suspect we can just call it across the
furthest region extent.
Can you describe the driver scenario where having two different
mmap cookies for region index 0 makes things significantly easier for
drivers?
It seems like Ben's suggestion below of a call that modifies the mmap
attributes of an existing region is the least overall change to
existing drivers, though I'm not sure if that's what we should be
optimizing for.
> > > > > * A simpler approach which is to have an ioctl to change the
> > > > > attributes over a range of a region, so that *any* mapping of
> > > > > that range now uses that attribute. This can be done
> > > > > completely dynamically
> > > > > (see below).
> > > >
> > > > And I really, really don't like this. The meaning of a memmap
> > > > cookie should not be changing dynamically. That is a bad
> > > > precedent.
> > >
> > > What do you mean "a cookie changing dynamically" ? The cookie
> > > doesn't change. The attributes of a portion of a mapping change
> > > based on what userspace requested.
>
> Exactly. You have changed the underlying meaning of the cookie
> dynamicaly.
At the direction of the user. Why is that a problem?
> > > The biggest advantage of that approach is that it completely
> > > precludes multiple conflicting mappings for a given region (at
> > > least within a given process, though it might be possible to
> > > extend it globally if we
>
> It doesn't. It just makes a messy uapi. At the time of mmap the vma
> would stil have to capture the attributes (no fault by fault!) into
> the VMA so we will see real users doing things like:
>
> set to wc(cookie)
> mmap(cookie + XXX)
> set to !wc(cookie)
> mmap(cookie + YY)
>
> And then if you try to debug this all our file/vma debug tools will
> just show cookie everywhere with no distinction that some VMAs are WC
> and some VMAs are !WC.
>
> Basically, it fundamentally breaks how pgoff is supposed to work here
> by making its meaning unstable.
We could require the mmap attribute is set before mmap and not changed
after, but yes, we don't get simultaneous mmaps with different
attributes without different cookies.
> > > want) which has been a concern expressed last time we talked
> > > about this by the folks from the ARM world.
> >
> > This precludes that a user could simultaneously have mappings to the
> > same device memory with different mmap attributes,
>
> Indeed, that is required for most HW. mlx5 for example has BARs that
> mix WC and non WC access modes. There are too few BARs for most HW to
> be able to dedicate an entire BAR to WC only.
So do we want to revisit whether an mmap attribute applies to a whole
region or only part of a region? If the WC/UC ranges are spatially
separate, we could still handle that via a "modify mmap attribute
across sub-range of region" type API and userspace would be required to
do separate mmaps for the ranges.
> > > > The uAPI I prefer is a 'get region info2' which would be
> > > > simplified and allow userspace to pass in some flags. One of
> > > > those flags can be 'request wc'.
> > >
> > > So talking of "weird hacky looking" having something called
> > > "get_info" taking "request" flags ... ahem ... :-)
> >
> > Yup...
>
> We've already confused returning the mmap cookie and the other
> information about the region. If you want to have flags to customize
> what mmap cookie is returned, this is the simplest answer.
I think either mechanism presented above is easier and more consistent
with the existing API.
> > > If what you really mean is that under the hood, a given "index"
> > > produces multiple "regions" (cookies) and "get_info2" allows to
> > > specify which one you want to get info about ?
>
> There is only one index. You can ask for different mmap options, and
> pgoff space for them is created dynamically.
>
> A "region" is not a mmap cookie, a region can have multiple mmap
> cookies.
Our ABI requires that BAR0 is region 0, but our API does not require
that BAR0 is only region 0. Therefore I would say that a region is an
mmap cookie is more correct than a region can have multiple mmap
cookies in our current API.
> > > I disagree ... I find it actually cumbersome but we can just
> > > agree to disagree here and as I said, I don't care that much as
> > > long as there is no fundamental reason why it wouldn't work in
> > > the end.
> >
> > I'm not a fan of the REGION_INFO2 idea either. A new region index
> > that provides an alias to another region with different mmap
> > semantics is much more intuitive in our existing UAPI, imo.
>
> It would not be another region index. That is the whole point. It is
> another pgoff for an existing index.
I think this is turning a region index into something it was not meant
to be.
> > > Talking of cookie space, one thing we do need to preserve is the
> > > natural alignment to the BAR size. Userspace *will* do "|"
> > > instead of "+" on top of a cookie when mmap'ing (beyond mmap own
> > > alignment requirements).
>
> I think that's just wrong userspace, sorry. :(
>
> Still, we want to do this anyhow as the VMA alignment stuff Peter was
> working on also requires pgoff alignment.
>
> > > > > Now, within the context of VFIO PCI, since we use a fault
> > > > > handler, it
> > > > > doesn't even have to be done before mmap, we can just
> > > > > dynamically fault
> > > > > a given page with the right attributes (and zap mappings on
> > > > > attributes
> > > > > changes). I would go down that path personally and generalize
> > > > > the fault
> > > > > handler to all VFIO implementations.
> > > >
> > > > Even worse. fault by fault changing of attributes? No way,
> > > > that's a completely crazy uAPI!
> > >
> > > Why ? first userspace doesn't know or see it happens fault by
> > > fault (and it could be full established at mmap time in fact, but
> > > fault time makes the implementation a lot easier).
> > >
> > > Here you are conflating implementation details with "uAPI" ...
> > > uAPI is what is presented to userspace, which in this case is
> > > "set the attributes for this portion of a region". Then at map
> > > time, that portion of the region gets the requested attributes.
> > > There is nothing in the uAPI that carries the fact that it
> > > happens at fault time.
> > >
> > > You keep coming up with "ugly", "crazy" etc... without every
> > > actually spelling out the technical pro/cons that would actually
> > > substantiate those adjectives. Basically anything that isn't your
> > > get_region2 is "crazy" or "ugly" ... NIH syndrome ?
>
> This is not how the mm ever works *anywhere* in the main kernel, the
> only reason I can see you are propsing this because it avoids doing
> the cleanup work.
>
> It makes it impossible for userspace to get both WC and non-WC
> mappings. It makes it indeterminate what behavior userspace gets at
> every store operation. Real HW like mlx5 would need mixed WC/!WC on
> the same bar, so I view this as an entirely bad uAPI.
>
> > > > I don't know what this is about, I don't want to see APIs to
> > > > slice up regions. The vfio driver should report full regions in
> > > > the pgoff space, and VMAs should be linked to single maple tree
> > > > ranges only. If userspace wants something weird it can call
> > > > mmap multiple times and get different VMAs.
> > >
> > > Why ? What are the pros/cons of each approach ?
>
> Extra complexity in the kernel, no usecase that isn't already handled
> by mmap directly.
>
> > > The big advantages of that latter approach is that it's very
> > > clear and simple for userspace (this bit of the BAR is meant to
> > > be WC) and completely
> > > avoids the multiple mapping problem.
>
> What is the "multiple mapping problem" ? We've discussed this
> extensively with ARM when we added WC support to KVM and I think we
> have a general agreement that multiple mappings are not something the
> kernel needs to actively prevent, in the limited case of WC and !WC.
>
> > >The inconvenient is that we have to generalize the fault handler
> > > mechanism to all backends, and it makes the multi-mapping
> > > impossible (in the maybe possible case where might want to allow
> > > it in some cases).
> > >
> > > Overall I find a lot of putting cart before horses here. Can we
> > > first agree
> > > on a clear definition of the uAPI first, then we can figure out
> > > the implementation details ?
> >
> > +1. Thanks,
>
> All the uAPI proposals I've seen are all trying to work around the
> current state of the code.
>
> This is broadly what I've proposed consistently since the beginning,
> adjusted for the various remarks since:
>
> struct vfio_region_get_mmap {
> __u32 argsz;
> __u32 region_index; // only one, no aliases
> __u32 mmap_flags; // Set WC here
>
> __aligned_u64 region_size;
> __aligned_u64 fd_offset;
> };
>
> struct vfio_region_get_caps {
> __u32 argsz;
> __u32 region_index;
>
> __u32 region_flags; // READ/WRITE/etc
> __aligned_u64 region_size;
> __u32 cap_offset; /* Offset within info struct
> of first cap */ };
>
> Alex, you pointed out that the parsing of the existing
> VFIO_DEVICE_GET_REGION_INFO has made it non-extendable. So the above
> two are creating a new extendable version that are replacements.
Can you be more specific on this claim? We are no longer creating
static region indexes after the introduction of device specific
regions, but I don't see why we're not using the mechanisms of the
device specific region to create new region indexes with new offsets
that have specified mmap attributes here. I imagine a DEVICE_FEATURE
that creates a new region, returning at least the region index,
DEVICE_INFO and REGION_INFO are updated to describe the new region, ie.
mmap-only, new offset/cookie, likely a capability embedded in the
REGION_INFO to provide introspection that this regions is an alias of
another.
> To avoid the naming confusion we have a specific ioctl to get
> mmap'able access, and another one for the cap list. I guess this also
> gives access to read/write so maybe the name needs more bikeshedding.
Largely duplicating REGION_INFO.
> Compared to the existing we add a single new concept, 'mmap_flags',
> which customizes how the fd_offset will behave with mmap. The kernel
> will internally de-duplicate region_index/mmap_flags -> fd_offset.
Ok. The above could also de-duplicate.
> There is still one index per physical object (ie BAR) in the uAPI.
This is a non-requirement.
> We get one cookie that describes the VMA behavior exactly and
> immutably.
So does the above.
> The existing VFIO_DEVICE_GET_REGION_INFO is expressed in terms of the
> above two operations with mmap_flags = 0.
Still more complicated that new region index and existing ioctls.
> No new subregion concept. No alias region indexes. No changing the
> meaning of returned cookies.
The above doesn't change the meaning of returned cookies and I don't
see why the rest is an issue.
> We simply allow the userspace to request a mmap cookie for a region
> index that will always cause mmap to create a VMA with WC pgrot.
Or we allow userspace to request a region index that will always cause
mmap to create a vma with wc pgprot.
I see the device fd as segmented into regions. The base set of regions
happen to have fixed definitions relative to device objects.
Introducing mmap cookies as a new mapping to a region where we can have
N:1 cookies to region really seems unnecessarily complicated vs a 1:1
cookie to region space.
> If we someday want cachable, or ARM's DEVICE_GRE or whatever we can
> trivially add more flags to mmap_flags and userspace can get
> more cookies.
Yup, same same.
> If we later decide we need to solve the ARM multi-device issue then we
> can cleanly extend an additional start/len to vfio_region_get_mmap
> which can ensure mmaps cookies are disjoint. This is not subregions,
> or new regions, this is just a cookie with a restriction.
No different if REGION_INFO supplies disjoint offset/length.
> In terms of implementation once you do the maple tree work that
> Mahmoud started we end up with vfio_region_get_mmap allocating a
> struct vfio_mmap for each unique region_index/mmap_flags and returning
> it to userspace.
And we get an entirely disjoint API from legacy vfio.
> All the drivers will use the struct vfio_mmap everywhere instead of
> trying to decode pgoff to index with macros. Inside the vfio_mmap we
> store the index for the driver to use directly. The driver has a
> simple implementation, one index mapped to one physical resource. A
> pgprot value in the struct vfio_mmap to specify how to create the VMA.
>
> No dynamic driver aware regions or subregions.
Currently we have a struct vfio_pci_region stored in an array that we
dynamically resize for device specific regions and the offset is
determined statically from the array index. We could easily specify an
offset and alias field on that object if we wanted to make the address
space more compact (without a maple tree) and facilitate multiple
regions referencing the same device resource. This is all just
implementation decisions. We also don't need to support read/write on
new regions, we could have them exist advertising only mmap support via
REGION_INFO, which simplifies and is consistent with the existing API.
There are a lot of new APIs being proposed here in the name of this
idea that we shouldn't create new regions/sub-regions/alias-regions,
which ultimately seems like a non-issue to me. Thanks,
Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-11 22:07 ` Alex Williamson
@ 2025-08-12 0:30 ` Jason Gunthorpe
2025-08-12 19:26 ` Alex Williamson
2025-08-14 8:39 ` Mahmoud Nagy Adam
2025-08-14 9:52 ` Mahmoud Nagy Adam
2 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2025-08-12 0:30 UTC (permalink / raw)
To: Alex Williamson
Cc: Benjamin Herrenschmidt, kvm@vger.kernel.org, Kumar, Praveen,
Adam, Mahmoud, Woodhouse, David, nagy@khwaternagy.com
On Mon, Aug 11, 2025 at 04:07:10PM -0600, Alex Williamson wrote:
> We do this today with device specific regions, see
> vfio_pci_core_register_dev_region(). We use this to provide several
> additional regions for IGD. If we had an interface for users to
> trigger new regions we'd need some protection for exceeding the index
> space (-ENOSPC), but adding a small number of regions is not a problem.
That is pretty incomplete..
If we go down the maple tree direction I expect to eliminate
vfio_pci_core_register_dev_region() and replace it with core code
handling the dispatch of mmap and rw through the struct vfio_mmap.
What it is now isn't locked properly to be dynamic, and it's operation
is different from the actual physical regions.
> > > > Well, we want to be able to WC map. Introducing "more cookies"
> > > > again is just one way to get there. How do you create those
> > > > cookies ? Upon request or each region automatically gets multiple
> > > > with different attributes ? Do they represent entire regions or
> > > > subsets ? etc...
> >
> > It doesn't matter. Fixing how mmap works internally lets you use all
> > of those options.
>
> What exactly is the "fix how mmap works internally" proposal?
I explained it to Mahmoud previously:
https://lore.kernel.org/kvm/20250716184028.GA2177603@ziepe.ca/
> > I don't think we need sub-regions, it is too complicated in the kernel
> > and pretty much useless.
>
> So we infer that mmap cookies are an alias to an entire region.
Ideally
> We have an existing ABI that maps BARs, config space, and VGA spaces to
> fixed region indexes. That "region index" to "device space" mapping
> cannot change, but the offset of a given region and the total number of
> regions is not ABI.
Yes.
> Therefore we can introduce an API where a user
> says "give me a new region index that aliases region 0 with mmap
> attribute FOO".
I know, but I really dislike this as a uAPI. It becomes confusing for
the user to get a list of, what should be, physical regions and now
suddenly has to deal with non-physial alias regions it may have
created.
Our uAPI has a simple input to describe the region:
__u32 region_index;
Which I think should always describe the physical uAPI region number
and never something else. Drivers like igd have more "physical"
regions, but they are still ultimately physical regions decided by the
kernel, set in a fixed list.
The region_index is effectively uAPI with things like 0 always being
BAR 0 of a PCI function.
I don't think we should be changing that property..
> To a limited extent we can provide this within fixed index/pgoff
> implementation (not ABI) we use now, but AIUI we could use a maple tree
> to block out ranges and get more dense packing of regions across the
> device fd.
It is not regions, the maple tree packs mmap cookies - the pgoff.
Today we algorithmically derive pgoff from region_index in the kernel
but this is not ABI and is exactly what I want to divorce here. pgoff
goes through the maple tree to obtain the region index, not the other
way around.
> I don't understand how this introduces so much complication to drivers
> that, for example, BAR0 might be accessible through region index 0 for
> legacy mappings and index 10 for modified mmap attributes.
Because all the drivers assume the pgoff encoding:
int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma)
{
index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
And index == region_index which is uAPI that defines the physical BAR:
if (!vdev->bar_mmap_supported[index])
return -EINVAL;
So this all needs remapping logic if you want to make index dynamic.
Yes you can abuse the vfio_pci_core_register_dev_region() to make
alias regions and somehow provide ops or special cases to handle the
aliases and probably make it work for PCI.
But I'm saying to just have to core handle it:
int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma, struct vfio_mmap *mmap)
{
index = mmap->index;
The code is simpler and cleaner, it generalizes outside of PCI. No
more open coding vm_pgoff shifting. We get nice things like pgoff
packing, better 32 bit compatability, and a huge number of mmap
cookies for whatever we need.
> Can you describe the driver scenario where having two different
> mmap cookies for region index 0 makes things significantly easier for
> drivers?
Above
> It seems like Ben's suggestion below of a call that modifies the mmap
> attributes of an existing region is the least overall change to
> existing drivers, though I'm not sure if that's what we should be
> optimizing for.
I agreee it is the least overall code change, but it is a bad uAPI
design since it violates the principle that what pgoff points should
behave consistently.
> > > > The biggest advantage of that approach is that it completely
> > > > precludes multiple conflicting mappings for a given region (at
> > > > least within a given process, though it might be possible to
> > > > extend it globally if we
> >
> > It doesn't. It just makes a messy uapi. At the time of mmap the vma
> > would stil have to capture the attributes (no fault by fault!) into
> > the VMA so we will see real users doing things like:
> >
> > set to wc(cookie)
> > mmap(cookie + XXX)
> > set to !wc(cookie)
> > mmap(cookie + YY)
> >
> > And then if you try to debug this all our file/vma debug tools will
> > just show cookie everywhere with no distinction that some VMAs are WC
> > and some VMAs are !WC.
> >
> > Basically, it fundamentally breaks how pgoff is supposed to work here
> > by making its meaning unstable.
>
> We could require the mmap attribute is set before mmap and not changed
> after, but yes, we don't get simultaneous mmaps with different
> attributes without different cookies.
Not allowing WC and !WC is fatal to any proposal, IMHO.
So as above, it is messy and poor to make the pgoff unstable, and it
will be abused by userspace as I showed if that is the uAPI.
> > Indeed, that is required for most HW. mlx5 for example has BARs that
> > mix WC and non WC access modes. There are too few BARs for most HW to
> > be able to dedicate an entire BAR to WC only.
>
> So do we want to revisit whether an mmap attribute applies to a whole
> region or only part of a region?
So long as mmap() can take a slice out of a cookie I don't know of a
functional reason to do more..
> > It would not be another region index. That is the whole point. It is
> > another pgoff for an existing index.
>
> I think this is turning a region index into something it was not meant
> to be.
What do you mean? Region index is the uAPI we have to refer to a fixed
physical part of the device.
Reallly, what makes more sense as userspace operations:
'give me a WC mapping for region index 0 which is BAR 0'
'Make a new region index for region index 0 which is bar 0 and then give me a mapping for region X"
The first is very logical, the second is pretty obfuscated.
> > This is broadly what I've proposed consistently since the beginning,
> > adjusted for the various remarks since:
> >
> > struct vfio_region_get_mmap {
> > __u32 argsz;
> > __u32 region_index; // only one, no aliases
> > __u32 mmap_flags; // Set WC here
> >
> > __aligned_u64 region_size;
> > __aligned_u64 fd_offset;
> > };
> >
> > struct vfio_region_get_caps {
> > __u32 argsz;
> > __u32 region_index;
> >
> > __u32 region_flags; // READ/WRITE/etc
> > __aligned_u64 region_size;
> > __u32 cap_offset; /* Offset within info struct
> > of first cap */ };
> >
> > Alex, you pointed out that the parsing of the existing
> > VFIO_DEVICE_GET_REGION_INFO has made it non-extendable. So the above
> > two are creating a new extendable version that are replacements.
>
> Can you be more specific on this claim?
I don't remember exactly. I think you said something about argsz isn't
parsed right by the kernel so we can't make struct vfio_region_info
any bigger to add something like mmap_flags in a backwards compatible
way because the old kernel wouldn't check for 0 in the expanded
structure.
> We are no longer creating static region indexes after the
> introduction of device specific regions, but I don't see why we're
> not using the mechanisms of the device specific region to create new
> region indexes with new offsets that have specified mmap attributes
> here.
Well, because that is not my proposal.
My proposal is the above, where region_index only takes on values that
the current uAPI defines and nothing more.
The driver continues to use region_index for all its internal
operations, like when PCI does this:
if (!vdev->bar_mmap_supported[index])
And we don't mess with that stuff at all. This is what I'm proposing,
concretely.
> I imagine a DEVICE_FEATURE that creates a new region, returning at
> least the region index, DEVICE_INFO and REGION_INFO are updated to
> describe the new region, ie. mmap-only, new offset/cookie, likely a
> capability embedded in the REGION_INFO to provide introspection that
> this regions is an alias of another.
And this is what you've been suggesting for a while, and I still
continue to dislike it for the reasons given. :)
> > To avoid the naming confusion we have a specific ioctl to get
> > mmap'able access, and another one for the cap list. I guess this also
> > gives access to read/write so maybe the name needs more bikeshedding.
>
> Largely duplicating REGION_INFO.
As I said, we can't extend REGION_INFO so to make changes to it we
need new functions.
> > There is still one index per physical object (ie BAR) in the uAPI.
>
> This is a non-requirement.
Disagree!
> > We get one cookie that describes the VMA behavior exactly and
> > immutably.
>
> So does the above.
Yes
> > The existing VFIO_DEVICE_GET_REGION_INFO is expressed in terms of the
> > above two operations with mmap_flags = 0.
>
> Still more complicated that new region index and existing ioctls.
I think it would simplify the drivers. Inside the drivers we can split
the cap and mmap return paths into seperate function ops. Currently
this is all open coded inside switch statements and it is pretty
duplicitive.
> I see the device fd as segmented into regions. The base set of regions
> happen to have fixed definitions relative to device objects.
I don't - the device fd is segmented in to pgoff spaces which are
managed by "mmap cookies". Physical device regions map into many mmap
cookies within the pgoff number space.
> Introducing mmap cookies as a new mapping to a region where we can have
> N:1 cookies to region really seems unnecessarily complicated vs a 1:1
> cookie to region space.
Your version is creating a 1:N:1 mapping, which I think is more
complicated. One physical region, N virtual regions, one pgoff.
> > If we later decide we need to solve the ARM multi-device issue then we
> > can cleanly extend an additional start/len to vfio_region_get_mmap
> > which can ensure mmaps cookies are disjoint. This is not subregions,
> > or new regions, this is just a cookie with a restriction.
>
> No different if REGION_INFO supplies disjoint offset/length.
We can't extend REGION_INFO so you'd have to create a new region index
which is a slice of a physical index, further complicating the
implementation :(
> > In terms of implementation once you do the maple tree work that
> > Mahmoud started we end up with vfio_region_get_mmap allocating a
> > struct vfio_mmap for each unique region_index/mmap_flags and returning
> > it to userspace.
>
> And we get an entirely disjoint API from legacy vfio.
I don't see how you can say that. Arguably it is closer to legacy
VIFIO because we don't create new region_index's that are not defined
in the uAPI header! Instead we add a single new ioctl.
> Currently we have a struct vfio_pci_region stored in an array that we
> dynamically resize for device specific regions and the offset is
> determined statically from the array index. We could easily specify an
> offset and alias field on that object if we wanted to make the address
> space more compact (without a maple tree) and facilitate multiple
> regions referencing the same device resource.
vfio_pci_region isn't shared outside PCI, so it doesn't improve the
core/driver API. It isn't locked so it can't be changed dynmically.
It is based on region index so we still have the pgoff shifting and
poor pgoff untilization.
> There are a lot of new APIs being proposed here in the name of this
> idea that we shouldn't create new regions/sub-regions/alias-regions,
> which ultimately seems like a non-issue to me. Thanks,
Two APIs, and the CAPS one is more for illustration - like if we had a
time machine how could we have desinged this from day 0 to be
extensible.
So I see add vfio_region_get_mmap or vfio_feature_create_region - same
number of APIs. <shrug>
Jason
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-12 0:30 ` Jason Gunthorpe
@ 2025-08-12 19:26 ` Alex Williamson
2025-08-13 0:17 ` Jason Gunthorpe
0 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2025-08-12 19:26 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Benjamin Herrenschmidt, kvm@vger.kernel.org, Kumar, Praveen,
Adam, Mahmoud, Woodhouse, David, nagy@khwaternagy.com
On Mon, 11 Aug 2025 21:30:53 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Mon, Aug 11, 2025 at 04:07:10PM -0600, Alex Williamson wrote:
> > We do this today with device specific regions, see
> > vfio_pci_core_register_dev_region(). We use this to provide several
> > additional regions for IGD. If we had an interface for users to
> > trigger new regions we'd need some protection for exceeding the index
> > space (-ENOSPC), but adding a small number of regions is not a problem.
>
> That is pretty incomplete..
>
> If we go down the maple tree direction I expect to eliminate
> vfio_pci_core_register_dev_region() and replace it with core code
> handling the dispatch of mmap and rw through the struct vfio_mmap.
Do note the inconsistency of having a vfio_mmap object that's used for
read/write.
> What it is now isn't locked properly to be dynamic, and it's operation
> is different from the actual physical regions.
Of course it's not locked to be used dynamically now, it's not used
dynamically now.
> > > > > Well, we want to be able to WC map. Introducing "more cookies"
> > > > > again is just one way to get there. How do you create those
> > > > > cookies ? Upon request or each region automatically gets multiple
> > > > > with different attributes ? Do they represent entire regions or
> > > > > subsets ? etc...
> > >
> > > It doesn't matter. Fixing how mmap works internally lets you use all
> > > of those options.
> >
> > What exactly is the "fix how mmap works internally" proposal?
>
> I explained it to Mahmoud previously:
> https://lore.kernel.org/kvm/20250716184028.GA2177603@ziepe.ca/
Ok, I think you're just referring to a mechanism by which we use the
pgoff to get an object that tracks the mmap attributes.
> > > I don't think we need sub-regions, it is too complicated in the kernel
> > > and pretty much useless.
> >
> > So we infer that mmap cookies are an alias to an entire region.
>
> Ideally
>
> > We have an existing ABI that maps BARs, config space, and VGA spaces to
> > fixed region indexes. That "region index" to "device space" mapping
> > cannot change, but the offset of a given region and the total number of
> > regions is not ABI.
>
> Yes.
>
> > Therefore we can introduce an API where a user
> > says "give me a new region index that aliases region 0 with mmap
> > attribute FOO".
>
> I know, but I really dislike this as a uAPI. It becomes confusing for
> the user to get a list of, what should be, physical regions and now
> suddenly has to deal with non-physial alias regions it may have
> created.
This is subjective though. Personally I find it confusing to have this
fixation that BAR0 is _only_ accessed through region index 0.
> Our uAPI has a simple input to describe the region:
>
> __u32 region_index;
>
> Which I think should always describe the physical uAPI region number
> and never something else. Drivers like igd have more "physical"
> regions, but they are still ultimately physical regions decided by the
> kernel, set in a fixed list.
This is not entirely true. When we expanded to device specific regions
for IGD we changed the API. We defined that there are no static region
numbers after VGA. Device specific regions, such as those used by IGD,
provide introspection via a capability exposed in the REGION_INFO
return buffer. Therefore there is no implicit meaning of a given
region number after VGA_REGION_INDEX. It is only the implementation of
those IGD regions that present them in a fixed order, the ABI does not
require it.
> The region_index is effectively uAPI with things like 0 always being
> BAR 0 of a PCI function.
>
> I don't think we should be changing that property..
The vfio-pci uAPI maps region numbers up through VGA_REGION_INDEX to
specific device resources. It does not in any way suggest that those
regions numbers exclusively map those device resources nor do higher
region numbers have any implicit mapping to other device resources.
Also note that fixed region indexes are really only a convention
provided by the vfio device type, ie. the "bus driver", for
convenience. vfio-platform for instance has no such standard device
resources, aiui the userspace driver needs to implicitly understand the
region number to resource mapping. If implemented today, we'd probably
include a capability in the region info buffer that describes the
region via a fdt compatibility ID or something.
> > To a limited extent we can provide this within fixed index/pgoff
> > implementation (not ABI) we use now, but AIUI we could use a maple tree
> > to block out ranges and get more dense packing of regions across the
> > device fd.
>
> It is not regions, the maple tree packs mmap cookies - the pgoff.
>
> Today we algorithmically derive pgoff from region_index in the kernel
> but this is not ABI and is exactly what I want to divorce here. pgoff
> goes through the maple tree to obtain the region index, not the other
> way around.
But it doesn't matter. There is no requirement that a given region
index uniquely maps a given device resource and there is no implicit
definition of region numbers beyond VGA_REGION_INDEX for vfio-pci.
Whether we use high order bits of pgoff to index into an array or
lookup the pgoff in a maple tree, the result is we get an object that
describes the access. The region index is yet another cookie for
userspace that has no implicit meaning in the generic vfio sense, but
has limited conventions for specific vfio device types.
> > I don't understand how this introduces so much complication to drivers
> > that, for example, BAR0 might be accessible through region index 0 for
> > legacy mappings and index 10 for modified mmap attributes.
>
> Because all the drivers assume the pgoff encoding:
>
> int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma)
> {
> index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
>
> And index == region_index which is uAPI that defines the physical BAR:
>
> if (!vdev->bar_mmap_supported[index])
> return -EINVAL;
>
> So this all needs remapping logic if you want to make index dynamic.
> Yes you can abuse the vfio_pci_core_register_dev_region() to make
> alias regions and somehow provide ops or special cases to handle the
> aliases and probably make it work for PCI.
>
> But I'm saying to just have to core handle it:
>
> int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma, struct vfio_mmap *mmap)
> {
> index = mmap->index;
>
> The code is simpler and cleaner, it generalizes outside of PCI. No
> more open coding vm_pgoff shifting. We get nice things like pgoff
> packing, better 32 bit compatability, and a huge number of mmap
> cookies for whatever we need.
We're in agreement that we need to derive an object from a pgoff and
that will universally replace embedding the region index in the high
order bits of the pgoff. However everything else here is just the
implementation of that returned object. The object could easily have a
region field that maps to the uAPI region and a device region field
that maps to the physical resource. Some objects have the same value
for these fields, some objects don't. It's not fundamentally important.
> > Can you describe the driver scenario where having two different
> > mmap cookies for region index 0 makes things significantly easier for
> > drivers?
>
> Above
Unconvinced.
> > It seems like Ben's suggestion below of a call that modifies the mmap
> > attributes of an existing region is the least overall change to
> > existing drivers, though I'm not sure if that's what we should be
> > optimizing for.
>
> I agreee it is the least overall code change, but it is a bad uAPI
> design since it violates the principle that what pgoff points should
> behave consistently.
I think we could enforce ordering such that it does behave
consistently, but if simultaneous mappings to the same pgoff with
different attributes is important, it can't do that.
> > > > > The biggest advantage of that approach is that it completely
> > > > > precludes multiple conflicting mappings for a given region (at
> > > > > least within a given process, though it might be possible to
> > > > > extend it globally if we
> > >
> > > It doesn't. It just makes a messy uapi. At the time of mmap the vma
> > > would stil have to capture the attributes (no fault by fault!) into
> > > the VMA so we will see real users doing things like:
> > >
> > > set to wc(cookie)
> > > mmap(cookie + XXX)
> > > set to !wc(cookie)
> > > mmap(cookie + YY)
> > >
> > > And then if you try to debug this all our file/vma debug tools will
> > > just show cookie everywhere with no distinction that some VMAs are WC
> > > and some VMAs are !WC.
> > >
> > > Basically, it fundamentally breaks how pgoff is supposed to work here
> > > by making its meaning unstable.
> >
> > We could require the mmap attribute is set before mmap and not changed
> > after, but yes, we don't get simultaneous mmaps with different
> > attributes without different cookies.
>
> Not allowing WC and !WC is fatal to any proposal, IMHO.
>
> So as above, it is messy and poor to make the pgoff unstable, and it
> will be abused by userspace as I showed if that is the uAPI.
Ok, we can kill that idea.
> > > Indeed, that is required for most HW. mlx5 for example has BARs that
> > > mix WC and non WC access modes. There are too few BARs for most HW to
> > > be able to dedicate an entire BAR to WC only.
> >
> > So do we want to revisit whether an mmap attribute applies to a whole
> > region or only part of a region?
>
> So long as mmap() can take a slice out of a cookie I don't know of a
> functional reason to do more..
If we're creating a new region, the uAPI is simpler if it's an alias of
the whole region rather than specifying an offset and length.
> > > It would not be another region index. That is the whole point. It is
> > > another pgoff for an existing index.
> >
> > I think this is turning a region index into something it was not meant
> > to be.
>
> What do you mean? Region index is the uAPI we have to refer to a fixed
> physical part of the device.
>
> Reallly, what makes more sense as userspace operations:
>
> 'give me a WC mapping for region index 0 which is BAR 0'
>
> 'Make a new region index for region index 0 which is bar 0 and then give me a mapping for region X"
>
> The first is very logical, the second is pretty obfuscated.
Current situation: We already have an API that says give me a mapping
for region X.
Therefore a new operation that says "give me a new region index for
mapping index 0 with attribute FOO" is much more congruent to our
current API.
> > > This is broadly what I've proposed consistently since the beginning,
> > > adjusted for the various remarks since:
> > >
> > > struct vfio_region_get_mmap {
> > > __u32 argsz;
> > > __u32 region_index; // only one, no aliases
> > > __u32 mmap_flags; // Set WC here
> > >
> > > __aligned_u64 region_size;
> > > __aligned_u64 fd_offset;
> > > };
> > >
> > > struct vfio_region_get_caps {
> > > __u32 argsz;
> > > __u32 region_index;
> > >
> > > __u32 region_flags; // READ/WRITE/etc
> > > __aligned_u64 region_size;
> > > __u32 cap_offset; /* Offset within info struct
> > > of first cap */ };
> > >
> > > Alex, you pointed out that the parsing of the existing
> > > VFIO_DEVICE_GET_REGION_INFO has made it non-extendable. So the above
> > > two are creating a new extendable version that are replacements.
> >
> > Can you be more specific on this claim?
>
> I don't remember exactly. I think you said something about argsz isn't
> parsed right by the kernel so we can't make struct vfio_region_info
> any bigger to add something like mmap_flags in a backwards compatible
> way because the old kernel wouldn't check for 0 in the expanded
> structure.
Oh! This was the proposal to use flags as an input value when we call
REGION_INFO. Yeah, that's not possible. But the above two new APIs
don't logically then follow as our only path forward though.
> > We are no longer creating static region indexes after the
> > introduction of device specific regions, but I don't see why we're
> > not using the mechanisms of the device specific region to create new
> > region indexes with new offsets that have specified mmap attributes
> > here.
>
> Well, because that is not my proposal.
>
> My proposal is the above, where region_index only takes on values that
> the current uAPI defines and nothing more.
The vfio-pci uAPI only has conventions for a specific subset of the
region index space and already provides indexes beyond that space using
device specific regions with types defined within a capability buffer.
> The driver continues to use region_index for all its internal
> operations, like when PCI does this:
>
> if (!vdev->bar_mmap_supported[index])
>
> And we don't mess with that stuff at all. This is what I'm proposing,
> concretely.
This is just a matter of having a pgoff to object helper and the fact
that that provided object might have a BAR index reference, which is not
implicitly the region index.
> > I imagine a DEVICE_FEATURE that creates a new region, returning at
> > least the region index, DEVICE_INFO and REGION_INFO are updated to
> > describe the new region, ie. mmap-only, new offset/cookie, likely a
> > capability embedded in the REGION_INFO to provide introspection that
> > this regions is an alias of another.
>
> And this is what you've been suggesting for a while, and I still
> continue to dislike it for the reasons given. :)
Likewise, I dislike the REGION_INFO2 proposal ;)
> > > To avoid the naming confusion we have a specific ioctl to get
> > > mmap'able access, and another one for the cap list. I guess this also
> > > gives access to read/write so maybe the name needs more bikeshedding.
> >
> > Largely duplicating REGION_INFO.
>
> As I said, we can't extend REGION_INFO so to make changes to it we
> need new functions.
>
> > > There is still one index per physical object (ie BAR) in the uAPI.
> >
> > This is a non-requirement.
>
> Disagree!
Clearly this feeling is strong, I still don't understand why. I don't
buy that it's confusing and the remaining arguments seem to be trivial
implementation choices.
> > > We get one cookie that describes the VMA behavior exactly and
> > > immutably.
> >
> > So does the above.
>
> Yes
>
> > > The existing VFIO_DEVICE_GET_REGION_INFO is expressed in terms of the
> > > above two operations with mmap_flags = 0.
> >
> > Still more complicated that new region index and existing ioctls.
>
> I think it would simplify the drivers. Inside the drivers we can split
> the cap and mmap return paths into seperate function ops. Currently
> this is all open coded inside switch statements and it is pretty
> duplicitive.
>
> > I see the device fd as segmented into regions. The base set of regions
> > happen to have fixed definitions relative to device objects.
>
> I don't - the device fd is segmented in to pgoff spaces which are
> managed by "mmap cookies". Physical device regions map into many mmap
> cookies within the pgoff number space.
>
> > Introducing mmap cookies as a new mapping to a region where we can have
> > N:1 cookies to region really seems unnecessarily complicated vs a 1:1
> > cookie to region space.
>
> Your version is creating a 1:N:1 mapping, which I think is more
> complicated. One physical region, N virtual regions, one pgoff.
No, the region index to pgoff is 1:1, REGION_INFO returns 1 offset per
region. Therefore I think it's 1:N:N in this denotation vs I think you
want 1:1:N, but here you've referred to it as a _virtual_ region, which
is exactly what it is, there is no 1:1 requirement for physical regions
to virtual regions.
> > > If we later decide we need to solve the ARM multi-device issue then we
> > > can cleanly extend an additional start/len to vfio_region_get_mmap
> > > which can ensure mmaps cookies are disjoint. This is not subregions,
> > > or new regions, this is just a cookie with a restriction.
> >
> > No different if REGION_INFO supplies disjoint offset/length.
>
> We can't extend REGION_INFO so you'd have to create a new region index
> which is a slice of a physical index, further complicating the
> implementation :(
You argued previously we don't need slices of physical indexes, now
you're arguing that if we do need it the implementation is more
complicated. Pick one.
> > > In terms of implementation once you do the maple tree work that
> > > Mahmoud started we end up with vfio_region_get_mmap allocating a
> > > struct vfio_mmap for each unique region_index/mmap_flags and returning
> > > it to userspace.
> >
> > And we get an entirely disjoint API from legacy vfio.
>
> I don't see how you can say that. Arguably it is closer to legacy
> VIFIO because we don't create new region_index's that are not defined
> in the uAPI header! Instead we add a single new ioctl.
DEVICE_INFO.num_regions defines the region space, not the vfio-pci
convention of low order region indexes. The uAPI makes no claims that
num_regions is immutable and I very much see a path in line with our
uAPI where a user could invoke a DEVICE_FEATURE (or pick your own
IOCTL, but DEVICE_FEATURE is very extensible) to dynamically generate a
new region with specified mmap attributes, DEVICE_INFO.num_regions is
updated, the offset/mmap_cookie is obtained via REGION_INFO, along with
a capability in the return buffer allowing introspection. New ioctls:
potentially zero.
> > Currently we have a struct vfio_pci_region stored in an array that we
> > dynamically resize for device specific regions and the offset is
> > determined statically from the array index. We could easily specify an
> > offset and alias field on that object if we wanted to make the address
> > space more compact (without a maple tree) and facilitate multiple
> > regions referencing the same device resource.
>
> vfio_pci_region isn't shared outside PCI, so it doesn't improve the
> core/driver API. It isn't locked so it can't be changed dynmically.
> It is based on region index so we still have the pgoff shifting and
> poor pgoff untilization.
Obviously it would be great to remove barriers at the core rather than
just vfio-pci, and also obviously the current implementation doesn't
directly support these new features. That doesn't mean there's not an
incremental path by which we adapt vfio_pci_region to support this and
come back to extend it with a maple tree, and again to push it into
vfio-core. I think this is where BenH and I are confused by the
insistence to start with the maple tree rather than focusing on the API.
> > There are a lot of new APIs being proposed here in the name of this
> > idea that we shouldn't create new regions/sub-regions/alias-regions,
> > which ultimately seems like a non-issue to me. Thanks,
>
> Two APIs, and the CAPS one is more for illustration - like if we had a
> time machine how could we have desinged this from day 0 to be
> extensible.
>
> So I see add vfio_region_get_mmap or vfio_feature_create_region - same
> number of APIs. <shrug>
One works within the API, one tries to extend the API and impose
requirements that are not compelling to me. Thanks,
Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-12 19:26 ` Alex Williamson
@ 2025-08-13 0:17 ` Jason Gunthorpe
0 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2025-08-13 0:17 UTC (permalink / raw)
To: Alex Williamson
Cc: Benjamin Herrenschmidt, kvm@vger.kernel.org, Kumar, Praveen,
Adam, Mahmoud, Woodhouse, David, nagy@khwaternagy.com
On Tue, Aug 12, 2025 at 01:26:42PM -0600, Alex Williamson wrote:
> > On Mon, Aug 11, 2025 at 04:07:10PM -0600, Alex Williamson wrote:
> > > We do this today with device specific regions, see
> > > vfio_pci_core_register_dev_region(). We use this to provide several
> > > additional regions for IGD. If we had an interface for users to
> > > trigger new regions we'd need some protection for exceeding the index
> > > space (-ENOSPC), but adding a small number of regions is not a problem.
> >
> > That is pretty incomplete..
> >
> > If we go down the maple tree direction I expect to eliminate
> > vfio_pci_core_register_dev_region() and replace it with core code
> > handling the dispatch of mmap and rw through the struct vfio_mmap.
>
> Do note the inconsistency of having a vfio_mmap object that's used for
> read/write.
We can bikeshed the naming at some later point.
> > I know, but I really dislike this as a uAPI. It becomes confusing for
> > the user to get a list of, what should be, physical regions and now
> > suddenly has to deal with non-physial alias regions it may have
> > created.
>
> This is subjective though. Personally I find it confusing to have this
> fixation that BAR0 is _only_ accessed through region index 0.
I don't see why. The user story here is "give me a WC mapping for 4096
bytes at offset 0x1245000 in BAR 0"
My version of the uAPI is very simple and logical:
ioctl(fd, GET_MMAP, {.region_index = VFIO_PCI_BAR0_REGION_INDEX,
.map_flags = WC,
.mmap_pgoff = &pgoff})
reg = mmap(NULL, 4098, prot, MMAP_SHARED, fd, pgoff + 0x1245000)
Simple direct, exactly matches the user story.
Why force the user to mess with some virtual region index, I can't see
any justification for that, and it only makes the user's job more
complicated.
Even if you take the position that region indexes are dynamic, it
still makes sense, for a user story, to have an API flow like
region_index = get_region_by_name("registers");
ioctl(fd, GET_MMAP, {.region_index = region_index,
.map_flags = WC,
.mmap_pgoff = &pgoff})
reg = mmap(NULL, 4098, prot, MMAP_SHARED, fd, pgoff + 0x1245000)
And I can easially imagine something like the VFIO abstraction layer
in DPDK happily working like the above.
In terms of uAPI usage from a user there is zero value in adding a
virtual region index to this flow. It does nothing to help the user at
all, it just some weird hoop to jump through.
> > int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma, struct vfio_mmap *mmap)
> > {
> > index = mmap->index;
> >
> > The code is simpler and cleaner, it generalizes outside of PCI. No
> > more open coding vm_pgoff shifting. We get nice things like pgoff
> > packing, better 32 bit compatability, and a huge number of mmap
> > cookies for whatever we need.
>
> We're in agreement that we need to derive an object from a pgoff and
> that will universally replace embedding the region index in the high
> order bits of the pgoff.
Ok
> However everything else here is just the implementation of that
> returned object. The object could easily have a region field that
> maps to the uAPI region and a device region field that maps to the
> physical resource. Some objects have the same value for these
> fields, some objects don't. It's not fundamentally important.
Again, I agree we can implement what you are proposing, and it can be
implemented with the maple tree pgoff cleanups too (maybe even
requires it to really be a clean implementation as well).
I just don't think it is a good uapi design for what is ultimately a
very simple user action.
> > > > It would not be another region index. That is the whole point. It is
> > > > another pgoff for an existing index.
> > >
> > > I think this is turning a region index into something it was not meant
> > > to be.
> >
> > What do you mean? Region index is the uAPI we have to refer to a fixed
> > physical part of the device.
> >
> > Reallly, what makes more sense as userspace operations:
> >
> > 'give me a WC mapping for region index 0 which is BAR 0'
> >
> > 'Make a new region index for region index 0 which is bar 0 and then give me a mapping for region X"
> >
> > The first is very logical, the second is pretty obfuscated.
>
> Current situation: We already have an API that says give me a mapping
> for region X.
>
> Therefore a new operation that says "give me a new region index for
> mapping index 0 with attribute FOO" is much more congruent to our
> current API.
I think that is extremely subjective.
I view it as we have a api to get a mapping for region X, that is
nonextensible and lacks a feature we need.
The logical answer is to improve the API to get a mapping for region
X.
There is many precendents for this in Linux. We have an endless number
of systemcall extensions, eg open, openat(), openat2().
I view my suggestion of GET_REGION_MMAP as the openat() to
GET_REGION_INFO as open().
So I don't agree with your leap that GET_REGION_INFO is perfect and
the issue is we need an entirely new conceputal model for region_index
to keep GET_REGION_INFO unchanged. :(
> Oh! This was the proposal to use flags as an input value when we call
> REGION_INFO. Yeah, that's not possible. But the above two new APIs
> don't logically then follow as our only path forward though.
I'm not fussy here, any option to let us extend GET_REGION_INFO gets
my vote :)
> > > Introducing mmap cookies as a new mapping to a region where we can have
> > > N:1 cookies to region really seems unnecessarily complicated vs a 1:1
> > > cookie to region space.
> >
> > Your version is creating a 1:N:1 mapping, which I think is more
> > complicated. One physical region, N virtual regions, one pgoff.
>
> No, the region index to pgoff is 1:1, REGION_INFO returns 1 offset per
> region. Therefore I think it's 1:N:N in this denotation vs I think you
> want 1:1:N, but here you've referred to it as a _virtual_ region, which
> is exactly what it is, there is no 1:1 requirement for physical regions
> to virtual regions.
Ultimately the user is only dealing in the physical regions. That is
how it knows what MMIO to reach.
Adding an indirection where the user has to go from the physical
region, be it statically set as uAPI or discovered once dynamically,
to some other virtual region, and then to a pgoff, is confusing and
uncessary. It is not 1:1:N, I'm aiming for a simple 1:N with no new
virtual region concept at all.
> uAPI where a user could invoke a DEVICE_FEATURE (or pick your own
> IOCTL, but DEVICE_FEATURE is very extensible) to dynamically generate a
> new region with specified mmap attributes, DEVICE_INFO.num_regions is
> updated, the offset/mmap_cookie is obtained via REGION_INFO, along with
> a capability in the return buffer allowing introspection. New ioctls:
> potentially zero.
DEVICE_FEATURE_XX is basically a new ioctl, just multiplexed.
> vfio-core. I think this is where BenH and I are confused by the
> insistence to start with the maple tree rather than focusing on the API.
I guided working on the maple tree because doing so makes it trivial
and clean to implement the GET_MMAP style API which is what I have
been consistently advocating for.
Anyhow, I think we should stop writing these emails, this has gone on
long enough. If this last round hasn't convinced either one of us lets
call it quits. Tell them to do the extra region and help them make a
clean implementation if that is what you want to do.
Jason
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-11 22:07 ` Alex Williamson
2025-08-12 0:30 ` Jason Gunthorpe
@ 2025-08-14 8:39 ` Mahmoud Nagy Adam
2025-08-14 9:52 ` Mahmoud Nagy Adam
2 siblings, 0 replies; 29+ messages in thread
From: Mahmoud Nagy Adam @ 2025-08-14 8:39 UTC (permalink / raw)
To: Alex Williamson
Cc: Jason Gunthorpe, Benjamin Herrenschmidt, kvm@vger.kernel.org,
Kumar, Praveen, Woodhouse, David, nagy@khwaternagy.com
Alex Williamson <alex.williamson@redhat.com> writes:
>
> Currently we have a struct vfio_pci_region stored in an array that we
> dynamically resize for device specific regions and the offset is
> determined statically from the array index. We could easily specify an
> offset and alias field on that object if we wanted to make the address
> space more compact (without a maple tree) and facilitate multiple
> regions referencing the same device resource. This is all just
> implementation decisions. We also don't need to support read/write on
> new regions, we could have them exist advertising only mmap support via
> REGION_INFO, which simplifies and is consistent with the existing API.
Are proposing an API that creates a new region, which the user then uses
the return index with REGION_INFO to get the pgoff? It seems we are just
adding one more syscall the user have to call before getting the
pgoff. But the end goal is technically the same...
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-11 22:07 ` Alex Williamson
2025-08-12 0:30 ` Jason Gunthorpe
2025-08-14 8:39 ` Mahmoud Nagy Adam
@ 2025-08-14 9:52 ` Mahmoud Nagy Adam
2025-08-14 17:52 ` Alex Williamson
2 siblings, 1 reply; 29+ messages in thread
From: Mahmoud Nagy Adam @ 2025-08-14 9:52 UTC (permalink / raw)
To: Alex Williamson
Cc: Jason Gunthorpe, Benjamin Herrenschmidt, kvm@vger.kernel.org,
Kumar, Praveen, Woodhouse, David, nagy@khwaternagy.com
The last email was a draft sent by mistake. This is the full version.
Alex Williamson <alex.williamson@redhat.com> writes:
> Currently we have a struct vfio_pci_region stored in an array that we
> dynamically resize for device specific regions and the offset is
> determined statically from the array index. We could easily specify an
> offset and alias field on that object if we wanted to make the address
> space more compact (without a maple tree) and facilitate multiple
> regions referencing the same device resource. This is all just
> implementation decisions. We also don't need to support read/write on
> new regions, we could have them exist advertising only mmap support via
> REGION_INFO, which simplifies and is consistent with the existing API.
>
What I understand is that you’re proposing an API to create a new
region. The user would then fetch a new index and use it with
REGION_INFO to obtain the pgoff. This feels like adding another layer
on top of the pgoff, while the end goal remains the same.
I'm not sure an alias region offers more value than simply creating an
alias pgoff. It may even be more confusing, since—AFAIU—users expect
indexes to align with PCI BAR indexes in the PCI case. We would also
need either a new API or an additional REGION_INFO member to tell the
user which index the alias refers to and what extra attributes it has.
Ultimately, both approaches are very similar: one creates a full alias
region, the other just a pgoff alias, but both would require nearly the
same internal implementation for pgoff handling.
The key question is: does a full region alias provide any tangible
benefits over a pgoff alias?
In my opinion, it’s clearer to simply have the user call e.g
REQUEST_REGION_MMAP (which returns a pgoff for mmap) rather than request
full region creation.
- MNAdam
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-14 9:52 ` Mahmoud Nagy Adam
@ 2025-08-14 17:52 ` Alex Williamson
2025-08-28 8:53 ` Mahmoud Nagy Adam
0 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2025-08-14 17:52 UTC (permalink / raw)
To: Mahmoud Nagy Adam
Cc: Jason Gunthorpe, Benjamin Herrenschmidt, kvm@vger.kernel.org,
Kumar, Praveen, Woodhouse, David, nagy@khwaternagy.com
On Thu, 14 Aug 2025 11:52:17 +0200
Mahmoud Nagy Adam <mngyadam@amazon.de> wrote:
> The last email was a draft sent by mistake. This is the full version.
>
> Alex Williamson <alex.williamson@redhat.com> writes:
>
> > Currently we have a struct vfio_pci_region stored in an array that we
> > dynamically resize for device specific regions and the offset is
> > determined statically from the array index. We could easily specify an
> > offset and alias field on that object if we wanted to make the address
> > space more compact (without a maple tree) and facilitate multiple
> > regions referencing the same device resource. This is all just
> > implementation decisions. We also don't need to support read/write on
> > new regions, we could have them exist advertising only mmap support via
> > REGION_INFO, which simplifies and is consistent with the existing API.
> >
>
> What I understand is that you’re proposing an API to create a new
> region. The user would then fetch a new index and use it with
> REGION_INFO to obtain the pgoff. This feels like adding another layer
> on top of the pgoff, while the end goal remains the same.
>
> I'm not sure an alias region offers more value than simply creating an
> alias pgoff. It may even be more confusing, since—AFAIU—users expect
> indexes to align with PCI BAR indexes in the PCI case. We would also
> need either a new API or an additional REGION_INFO member to tell the
> user which index the alias refers to and what extra attributes it has.
>
> Ultimately, both approaches are very similar: one creates a full alias
> region, the other just a pgoff alias, but both would require nearly the
> same internal implementation for pgoff handling.
>
> The key question is: does a full region alias provide any tangible
> benefits over a pgoff alias?
>
> In my opinion, it’s clearer to simply have the user call e.g
> REQUEST_REGION_MMAP (which returns a pgoff for mmap) rather than request
> full region creation.
In part this is the argument we've already discussed, creating a new
region and then getting REGION_INFO adds a step for the user, but we
already have REGION_INFO as the standard mechanism for introspection of
regions. We also have capabilities available as a mechanism within the
REGION_INFO ioctl to describe the mapping flags or region alias
relationship.
If we're this concerned about one additional step for the user, design
the DEVICE_FEATURE ioctl to return both the new region index and the
file offset, the user can ignore the region index if they choose.
To me, regions are just segments of the device fd address space,
regions have a unique offset. Regions have a vfio-pci specific
convention where a fixed set of region indexes refer to fixed device
resources but never is there a statement in the uAPI that region 0
_uniquely_ indexes BAR 0 and there will never be another region index
mapping this device resource. The convention exists only to bootstrap
standard device resources. Adding a mechanism to get a file offset
which is an alias of a region, but not itself reported as a region is
to me, splitting up the device fd address space into two different
allocation methods.
The argument that userspace drivers will get confused if region N
aliases region 0 makes no sense to me. The user has actively brought
region N into existence knowing in advance that it's addressing the
same device resource as region 0. Exactly in the same way they'd know
the file offset they get back from REQUEST_REGION_MMAP is an alias to
the requested region. BUT, since we're invoking a new region, we have
mechanisms to allow persistent introspection of that new region.
The tangible benefit to me is that a new region better aligns with the
existing API and has that introspection/debug'ability aspect, versus
creating an alternate mechanism for making allocations from the device
fd address space. Thanks,
Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-14 17:52 ` Alex Williamson
@ 2025-08-28 8:53 ` Mahmoud Nagy Adam
2025-08-28 19:17 ` Alex Williamson
0 siblings, 1 reply; 29+ messages in thread
From: Mahmoud Nagy Adam @ 2025-08-28 8:53 UTC (permalink / raw)
To: Alex Williamson
Cc: Jason Gunthorpe, Benjamin Herrenschmidt, kvm@vger.kernel.org,
Kumar, Praveen, Woodhouse, David, nagy@khwaternagy.com
Hi all,
Since it looks like creating alias regions is the path forward, I’d like
to summarize the discussion to make sure we’re all aligned. From my
understanding, the main steps are:
- Introduce helpers to create compact offsets, likely leveraging
mt. No changes to the vfio ops APIs are required, since the mt should
live within the vfio struct itself.
- Add a WC flag to regions.
- Define a new UAPI for creating alias regions with new
offsets. This UAPI should support aliasing existing regions as well
as specifying additional flags such as WC.
- Enable WC support for mmap.
I plan to start working on an RFC covering these points over the next
1–2 weeks.
Thanks,
MNAdam
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
2025-08-28 8:53 ` Mahmoud Nagy Adam
@ 2025-08-28 19:17 ` Alex Williamson
0 siblings, 0 replies; 29+ messages in thread
From: Alex Williamson @ 2025-08-28 19:17 UTC (permalink / raw)
To: Mahmoud Nagy Adam
Cc: Jason Gunthorpe, Benjamin Herrenschmidt, kvm@vger.kernel.org,
Kumar, Praveen, Woodhouse, David, nagy@khwaternagy.com
On Thu, 28 Aug 2025 10:53:24 +0200
Mahmoud Nagy Adam <mngyadam@amazon.de> wrote:
> Hi all,
>
> Since it looks like creating alias regions is the path forward, I’d like
> to summarize the discussion to make sure we’re all aligned. From my
> understanding, the main steps are:
>
> - Introduce helpers to create compact offsets, likely leveraging
> mt. No changes to the vfio ops APIs are required, since the mt should
> live within the vfio struct itself.
I think we also established, or at least there were arguments in the
direction, that the maple tree is just an implementation detail of the
address space management. Obviously it contributes to compact and
efficient use of the address space, but for small numbers of alias
regions, we could continue to use the upper bits of the offset,
returning -ENOSPC if exceeded. I don't want to discourage the maple
tree, but it seems like it could be split to a separate series to me.
With the maple tree, an opt-in to continue to use the existing offsets
for the pre-defined vfio-pci region indexes could prove to be a useful
feature as well as we potentially learn about userspace drivers that
don't follow the ABI.
> - Add a WC flag to regions.
If we go the route of the DEVICE_FEATURE, we have the PROBE as an
option here. This might make sense if we to avoid redefining flags for
specific attributes between REGION_INFO and the new DEVICE_FEATURE uAPI.
> - Define a new UAPI for creating alias regions with new
> offsets. This UAPI should support aliasing existing regions as well
> as specifying additional flags such as WC.
I thought we'd discussed de-duplication, ie. if a user asks for an alias
with the same attributes as already defined, they get back the existing
region index/offset. I don't see the utility in creating an arbitrary
number of aliases with the same attributes. Thanks,
Alex
> - Enable WC support for mmap.
>
> I plan to start working on an RFC covering these points over the next
> 1–2 weeks.
>
> Thanks,
> MNAdam
>
>
>
> Amazon Web Services Development Center Germany GmbH
> Tamara-Danz-Str. 13
> 10243 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
> Sitz: Berlin
> Ust-ID: DE 365 538 597
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2025-08-28 19:18 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-04 10:39 [RFC PATCH 0/9] vfio: Introduce mmap maple tree Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 1/9] vfio: add mmap maple tree to vfio Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 2/9] vfio: add transient ops to support vfio mmap mt Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 3/9] vfio-pci-core: rename vm operations Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 4/9] vfio-pci-core: remove redundant offset calculations Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 5/9] vfio-pci-core: add vfio_pci_mmap & helpers Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 6/9] vfio-pci-core: support the new vfio ops Mahmoud Adam
2025-08-04 10:40 ` [RFC PATCH 7/9] vfio-pci: use " Mahmoud Adam
2025-08-04 10:40 ` [RFC PATCH 8/9] vfio: UAPI for setting mmap attributes Mahmoud Adam
2025-08-04 10:40 ` [RFC PATCH 9/9] vfio_pci_core: support mmap attrs uapi & WC Mahmoud Adam
2025-08-04 18:49 ` [RFC PATCH 0/9] vfio: Introduce mmap maple tree Alex Williamson
2025-08-04 20:09 ` Mahmoud Nagy Adam
2025-08-05 14:31 ` Jason Gunthorpe
2025-08-05 15:48 ` Mahmoud Nagy Adam
2025-08-05 18:50 ` [RFC " Jason Gunthorpe
2025-08-05 19:00 ` Alex Williamson
[not found] ` <80dc87730f694b2d6e6aabbd29df49cf3c7c44fb.camel@amazon.com>
[not found] ` <20250806115224.GB377696@ziepe.ca>
2025-08-07 8:12 ` Herrenschmidt, Benjamin
2025-08-07 19:06 ` Alex Williamson
2025-08-11 15:55 ` Jason Gunthorpe
2025-08-11 22:07 ` Alex Williamson
2025-08-12 0:30 ` Jason Gunthorpe
2025-08-12 19:26 ` Alex Williamson
2025-08-13 0:17 ` Jason Gunthorpe
2025-08-14 8:39 ` Mahmoud Nagy Adam
2025-08-14 9:52 ` Mahmoud Nagy Adam
2025-08-14 17:52 ` Alex Williamson
2025-08-28 8:53 ` Mahmoud Nagy Adam
2025-08-28 19:17 ` Alex Williamson
2025-08-07 8:13 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).