From: Alex Williamson <alex.williamson@redhat.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"Kumar, Praveen" <pravkmr@amazon.de>,
"Adam, Mahmoud" <mngyadam@amazon.de>,
"Woodhouse, David" <dwmw@amazon.co.uk>,
"nagy@khwaternagy.com" <nagy@khwaternagy.com>
Subject: Re: [RFC PATCH 0/9] vfio: Introduce mmap maple tree
Date: Mon, 11 Aug 2025 16:07:10 -0600 [thread overview]
Message-ID: <20250811160710.174ca708.alex.williamson@redhat.com> (raw)
In-Reply-To: <20250811155558.GF377696@ziepe.ca>
On Mon, 11 Aug 2025 12:55:58 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Thu, Aug 07, 2025 at 01:06:05PM -0600, Alex Williamson wrote:
>
> > So to a large extent I think we're causing our own grief here and I do
> > agree that if we created an API that allows new regions to be created
> > as lightweight (mmap-only) aliases of other regions, then we could just
> > re-use REGION_INFO with the new region index, get the offset/mmap
> > cookie, return -ENOSPC when we run out of indexes, and implement a
> > maple tree to get a more compact region space as a follow-on.
>
> That still needs to dynamically create mmap cookies - which is really the
> whole problem here. It is not so easy to just change the existing code
> with hardwired math converting pgoff to indexes to support multiple
> indexes.
We do this today with device specific regions, see
vfio_pci_core_register_dev_region(). We use this to provide several
additional regions for IGD. If we had an interface for users to
trigger new regions we'd need some protection for exceeding the index
space (-ENOSPC), but adding a small number of regions is not a problem.
> > > Well, we want to be able to WC map. Introducing "more cookies"
> > > again is just one way to get there. How do you create those
> > > cookies ? Upon request or each region automatically gets multiple
> > > with different attributes ? Do they represent entire regions or
> > > subsets ? etc...
>
> It doesn't matter. Fixing how mmap works internally lets you use all
> of those options.
What exactly is the "fix how mmap works internally" proposal?
> > > > > * What I originally proposed ages ago when we discussed it
> > > > > at LPC which is to have an ioctl to create "subregions" of a
> > > > > region with different attributes. This means creating a new
> > > > > index (and a new corresponding "cookie") that represents a
> > > > > portion of an existing region
> > > > > with a different attribute set to be later used by mmap.
> > > >
> > > > I never liked this, and it is still creating more cookies but
> > > > now with
> > > > weird hacky looking uapi.
> > >
> > > "weird hacky looking" isn't a great way of understanding your
> > > objections :-) Yes, it's a bit more complicated, it allows to
> > > break down BARs basically into sub-regions with different
> > > attributes which is fairly close to what users really want to do,
> > > but it does complexify the UAPI a bit.
>
> I don't think we need sub-regions, it is too complicated in the kernel
> and pretty much useless.
So we infer that mmap cookies are an alias to an entire region.
> > > That said I'm not married to the idea, I just don't completely
> > > understand the alternative you are proposing...
> >
> > Obviously Jason can correct if the interpretation I gleaned from the
> > previous thread above is incorrect, but I don't really see that new
> > region indexes are weird or hacky. They fit into the REGION_INFO
> > scheme, they could be managed via DEVICE_FEATURE.
>
> I think I said before, adding more region indexes just makes a PITA
> for the driver to manage this.
>
> Indexes should refer to the physical object, the BAR in PCI. This is
> easy for the drivers to manage, any easy for userspace to understand.
>
> If you make dynamic indexes then every driver needs its own scheme to
> map them to physical indexes and we end up re-inventing the maple tree
> stuff without the cleanup or generality it brings.
I'd like to better understand why you believe this to be the case.
We have an existing ABI that maps BARs, config space, and VGA spaces to
fixed region indexes. That "region index" to "device space" mapping
cannot change, but the offset of a given region and the total number of
regions is not ABI. Therefore we can introduce an API where a user
says "give me a new region index that aliases region 0 with mmap
attribute FOO". Maybe this returns to them region index 10. The
DEVICE_INFO ioctl now reports more regions and REGION_INFO for index 10
provides the "mmap cookie", ie. offset, for this new region.
To a limited extent we can provide this within fixed index/pgoff
implementation (not ABI) we use now, but AIUI we could use a maple tree
to block out ranges and get more dense packing of regions across the
device fd.
This is compatible with existing userspace drivers, they don't invoke
any new APIs to create new regions and ideally they shouldn't have any
ABI dependencies on specific offsets (but we might provide limited
compatibility if that proves otherwise). Drivers invoke the new API
specifically to get a region index with the desired mmap attribute and
use the existing REGION_INFO ioctl to get the offset of the region.
I don't understand how this introduces so much complication to drivers
that, for example, BAR0 might be accessible through region index 0 for
legacy mappings and index 10 for modified mmap attributes. It might
make our unmap_mapping_range() call more complicated if we have a
non-contiguous mmap space, but I suspect we can just call it across the
furthest region extent.
Can you describe the driver scenario where having two different
mmap cookies for region index 0 makes things significantly easier for
drivers?
It seems like Ben's suggestion below of a call that modifies the mmap
attributes of an existing region is the least overall change to
existing drivers, though I'm not sure if that's what we should be
optimizing for.
> > > > > * A simpler approach which is to have an ioctl to change the
> > > > > attributes over a range of a region, so that *any* mapping of
> > > > > that range now uses that attribute. This can be done
> > > > > completely dynamically
> > > > > (see below).
> > > >
> > > > And I really, really don't like this. The meaning of a memmap
> > > > cookie should not be changing dynamically. That is a bad
> > > > precedent.
> > >
> > > What do you mean "a cookie changing dynamically" ? The cookie
> > > doesn't change. The attributes of a portion of a mapping change
> > > based on what userspace requested.
>
> Exactly. You have changed the underlying meaning of the cookie
> dynamicaly.
At the direction of the user. Why is that a problem?
> > > The biggest advantage of that approach is that it completely
> > > precludes multiple conflicting mappings for a given region (at
> > > least within a given process, though it might be possible to
> > > extend it globally if we
>
> It doesn't. It just makes a messy uapi. At the time of mmap the vma
> would stil have to capture the attributes (no fault by fault!) into
> the VMA so we will see real users doing things like:
>
> set to wc(cookie)
> mmap(cookie + XXX)
> set to !wc(cookie)
> mmap(cookie + YY)
>
> And then if you try to debug this all our file/vma debug tools will
> just show cookie everywhere with no distinction that some VMAs are WC
> and some VMAs are !WC.
>
> Basically, it fundamentally breaks how pgoff is supposed to work here
> by making its meaning unstable.
We could require the mmap attribute is set before mmap and not changed
after, but yes, we don't get simultaneous mmaps with different
attributes without different cookies.
> > > want) which has been a concern expressed last time we talked
> > > about this by the folks from the ARM world.
> >
> > This precludes that a user could simultaneously have mappings to the
> > same device memory with different mmap attributes,
>
> Indeed, that is required for most HW. mlx5 for example has BARs that
> mix WC and non WC access modes. There are too few BARs for most HW to
> be able to dedicate an entire BAR to WC only.
So do we want to revisit whether an mmap attribute applies to a whole
region or only part of a region? If the WC/UC ranges are spatially
separate, we could still handle that via a "modify mmap attribute
across sub-range of region" type API and userspace would be required to
do separate mmaps for the ranges.
> > > > The uAPI I prefer is a 'get region info2' which would be
> > > > simplified and allow userspace to pass in some flags. One of
> > > > those flags can be 'request wc'.
> > >
> > > So talking of "weird hacky looking" having something called
> > > "get_info" taking "request" flags ... ahem ... :-)
> >
> > Yup...
>
> We've already confused returning the mmap cookie and the other
> information about the region. If you want to have flags to customize
> what mmap cookie is returned, this is the simplest answer.
I think either mechanism presented above is easier and more consistent
with the existing API.
> > > If what you really mean is that under the hood, a given "index"
> > > produces multiple "regions" (cookies) and "get_info2" allows to
> > > specify which one you want to get info about ?
>
> There is only one index. You can ask for different mmap options, and
> pgoff space for them is created dynamically.
>
> A "region" is not a mmap cookie, a region can have multiple mmap
> cookies.
Our ABI requires that BAR0 is region 0, but our API does not require
that BAR0 is only region 0. Therefore I would say that a region is an
mmap cookie is more correct than a region can have multiple mmap
cookies in our current API.
> > > I disagree ... I find it actually cumbersome but we can just
> > > agree to disagree here and as I said, I don't care that much as
> > > long as there is no fundamental reason why it wouldn't work in
> > > the end.
> >
> > I'm not a fan of the REGION_INFO2 idea either. A new region index
> > that provides an alias to another region with different mmap
> > semantics is much more intuitive in our existing UAPI, imo.
>
> It would not be another region index. That is the whole point. It is
> another pgoff for an existing index.
I think this is turning a region index into something it was not meant
to be.
> > > Talking of cookie space, one thing we do need to preserve is the
> > > natural alignment to the BAR size. Userspace *will* do "|"
> > > instead of "+" on top of a cookie when mmap'ing (beyond mmap own
> > > alignment requirements).
>
> I think that's just wrong userspace, sorry. :(
>
> Still, we want to do this anyhow as the VMA alignment stuff Peter was
> working on also requires pgoff alignment.
>
> > > > > Now, within the context of VFIO PCI, since we use a fault
> > > > > handler, it
> > > > > doesn't even have to be done before mmap, we can just
> > > > > dynamically fault
> > > > > a given page with the right attributes (and zap mappings on
> > > > > attributes
> > > > > changes). I would go down that path personally and generalize
> > > > > the fault
> > > > > handler to all VFIO implementations.
> > > >
> > > > Even worse. fault by fault changing of attributes? No way,
> > > > that's a completely crazy uAPI!
> > >
> > > Why ? first userspace doesn't know or see it happens fault by
> > > fault (and it could be full established at mmap time in fact, but
> > > fault time makes the implementation a lot easier).
> > >
> > > Here you are conflating implementation details with "uAPI" ...
> > > uAPI is what is presented to userspace, which in this case is
> > > "set the attributes for this portion of a region". Then at map
> > > time, that portion of the region gets the requested attributes.
> > > There is nothing in the uAPI that carries the fact that it
> > > happens at fault time.
> > >
> > > You keep coming up with "ugly", "crazy" etc... without every
> > > actually spelling out the technical pro/cons that would actually
> > > substantiate those adjectives. Basically anything that isn't your
> > > get_region2 is "crazy" or "ugly" ... NIH syndrome ?
>
> This is not how the mm ever works *anywhere* in the main kernel, the
> only reason I can see you are propsing this because it avoids doing
> the cleanup work.
>
> It makes it impossible for userspace to get both WC and non-WC
> mappings. It makes it indeterminate what behavior userspace gets at
> every store operation. Real HW like mlx5 would need mixed WC/!WC on
> the same bar, so I view this as an entirely bad uAPI.
>
> > > > I don't know what this is about, I don't want to see APIs to
> > > > slice up regions. The vfio driver should report full regions in
> > > > the pgoff space, and VMAs should be linked to single maple tree
> > > > ranges only. If userspace wants something weird it can call
> > > > mmap multiple times and get different VMAs.
> > >
> > > Why ? What are the pros/cons of each approach ?
>
> Extra complexity in the kernel, no usecase that isn't already handled
> by mmap directly.
>
> > > The big advantages of that latter approach is that it's very
> > > clear and simple for userspace (this bit of the BAR is meant to
> > > be WC) and completely
> > > avoids the multiple mapping problem.
>
> What is the "multiple mapping problem" ? We've discussed this
> extensively with ARM when we added WC support to KVM and I think we
> have a general agreement that multiple mappings are not something the
> kernel needs to actively prevent, in the limited case of WC and !WC.
>
> > >The inconvenient is that we have to generalize the fault handler
> > > mechanism to all backends, and it makes the multi-mapping
> > > impossible (in the maybe possible case where might want to allow
> > > it in some cases).
> > >
> > > Overall I find a lot of putting cart before horses here. Can we
> > > first agree
> > > on a clear definition of the uAPI first, then we can figure out
> > > the implementation details ?
> >
> > +1. Thanks,
>
> All the uAPI proposals I've seen are all trying to work around the
> current state of the code.
>
> This is broadly what I've proposed consistently since the beginning,
> adjusted for the various remarks since:
>
> struct vfio_region_get_mmap {
> __u32 argsz;
> __u32 region_index; // only one, no aliases
> __u32 mmap_flags; // Set WC here
>
> __aligned_u64 region_size;
> __aligned_u64 fd_offset;
> };
>
> struct vfio_region_get_caps {
> __u32 argsz;
> __u32 region_index;
>
> __u32 region_flags; // READ/WRITE/etc
> __aligned_u64 region_size;
> __u32 cap_offset; /* Offset within info struct
> of first cap */ };
>
> Alex, you pointed out that the parsing of the existing
> VFIO_DEVICE_GET_REGION_INFO has made it non-extendable. So the above
> two are creating a new extendable version that are replacements.
Can you be more specific on this claim? We are no longer creating
static region indexes after the introduction of device specific
regions, but I don't see why we're not using the mechanisms of the
device specific region to create new region indexes with new offsets
that have specified mmap attributes here. I imagine a DEVICE_FEATURE
that creates a new region, returning at least the region index,
DEVICE_INFO and REGION_INFO are updated to describe the new region, ie.
mmap-only, new offset/cookie, likely a capability embedded in the
REGION_INFO to provide introspection that this regions is an alias of
another.
> To avoid the naming confusion we have a specific ioctl to get
> mmap'able access, and another one for the cap list. I guess this also
> gives access to read/write so maybe the name needs more bikeshedding.
Largely duplicating REGION_INFO.
> Compared to the existing we add a single new concept, 'mmap_flags',
> which customizes how the fd_offset will behave with mmap. The kernel
> will internally de-duplicate region_index/mmap_flags -> fd_offset.
Ok. The above could also de-duplicate.
> There is still one index per physical object (ie BAR) in the uAPI.
This is a non-requirement.
> We get one cookie that describes the VMA behavior exactly and
> immutably.
So does the above.
> The existing VFIO_DEVICE_GET_REGION_INFO is expressed in terms of the
> above two operations with mmap_flags = 0.
Still more complicated that new region index and existing ioctls.
> No new subregion concept. No alias region indexes. No changing the
> meaning of returned cookies.
The above doesn't change the meaning of returned cookies and I don't
see why the rest is an issue.
> We simply allow the userspace to request a mmap cookie for a region
> index that will always cause mmap to create a VMA with WC pgrot.
Or we allow userspace to request a region index that will always cause
mmap to create a vma with wc pgprot.
I see the device fd as segmented into regions. The base set of regions
happen to have fixed definitions relative to device objects.
Introducing mmap cookies as a new mapping to a region where we can have
N:1 cookies to region really seems unnecessarily complicated vs a 1:1
cookie to region space.
> If we someday want cachable, or ARM's DEVICE_GRE or whatever we can
> trivially add more flags to mmap_flags and userspace can get
> more cookies.
Yup, same same.
> If we later decide we need to solve the ARM multi-device issue then we
> can cleanly extend an additional start/len to vfio_region_get_mmap
> which can ensure mmaps cookies are disjoint. This is not subregions,
> or new regions, this is just a cookie with a restriction.
No different if REGION_INFO supplies disjoint offset/length.
> In terms of implementation once you do the maple tree work that
> Mahmoud started we end up with vfio_region_get_mmap allocating a
> struct vfio_mmap for each unique region_index/mmap_flags and returning
> it to userspace.
And we get an entirely disjoint API from legacy vfio.
> All the drivers will use the struct vfio_mmap everywhere instead of
> trying to decode pgoff to index with macros. Inside the vfio_mmap we
> store the index for the driver to use directly. The driver has a
> simple implementation, one index mapped to one physical resource. A
> pgprot value in the struct vfio_mmap to specify how to create the VMA.
>
> No dynamic driver aware regions or subregions.
Currently we have a struct vfio_pci_region stored in an array that we
dynamically resize for device specific regions and the offset is
determined statically from the array index. We could easily specify an
offset and alias field on that object if we wanted to make the address
space more compact (without a maple tree) and facilitate multiple
regions referencing the same device resource. This is all just
implementation decisions. We also don't need to support read/write on
new regions, we could have them exist advertising only mmap support via
REGION_INFO, which simplifies and is consistent with the existing API.
There are a lot of new APIs being proposed here in the name of this
idea that we shouldn't create new regions/sub-regions/alias-regions,
which ultimately seems like a non-issue to me. Thanks,
Alex
next prev parent reply other threads:[~2025-08-11 22:07 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-04 10:39 [RFC PATCH 0/9] vfio: Introduce mmap maple tree Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 1/9] vfio: add mmap maple tree to vfio Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 2/9] vfio: add transient ops to support vfio mmap mt Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 3/9] vfio-pci-core: rename vm operations Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 4/9] vfio-pci-core: remove redundant offset calculations Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 5/9] vfio-pci-core: add vfio_pci_mmap & helpers Mahmoud Adam
2025-08-04 10:39 ` [RFC PATCH 6/9] vfio-pci-core: support the new vfio ops Mahmoud Adam
2025-08-04 10:40 ` [RFC PATCH 7/9] vfio-pci: use " Mahmoud Adam
2025-08-04 10:40 ` [RFC PATCH 8/9] vfio: UAPI for setting mmap attributes Mahmoud Adam
2025-08-04 10:40 ` [RFC PATCH 9/9] vfio_pci_core: support mmap attrs uapi & WC Mahmoud Adam
2025-08-04 18:49 ` [RFC PATCH 0/9] vfio: Introduce mmap maple tree Alex Williamson
2025-08-04 20:09 ` Mahmoud Nagy Adam
2025-08-05 14:31 ` Jason Gunthorpe
2025-08-05 15:48 ` Mahmoud Nagy Adam
2025-08-05 18:50 ` [RFC " Jason Gunthorpe
2025-08-05 19:00 ` Alex Williamson
[not found] ` <80dc87730f694b2d6e6aabbd29df49cf3c7c44fb.camel@amazon.com>
[not found] ` <20250806115224.GB377696@ziepe.ca>
2025-08-07 8:12 ` Herrenschmidt, Benjamin
2025-08-07 19:06 ` Alex Williamson
2025-08-11 15:55 ` Jason Gunthorpe
2025-08-11 22:07 ` Alex Williamson [this message]
2025-08-12 0:30 ` Jason Gunthorpe
2025-08-12 19:26 ` Alex Williamson
2025-08-13 0:17 ` Jason Gunthorpe
2025-08-14 8:39 ` Mahmoud Nagy Adam
2025-08-14 9:52 ` Mahmoud Nagy Adam
2025-08-14 17:52 ` Alex Williamson
2025-08-28 8:53 ` Mahmoud Nagy Adam
2025-08-28 19:17 ` Alex Williamson
2025-08-07 8:13 ` Benjamin Herrenschmidt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250811160710.174ca708.alex.williamson@redhat.com \
--to=alex.williamson@redhat.com \
--cc=benh@kernel.crashing.org \
--cc=dwmw@amazon.co.uk \
--cc=jgg@ziepe.ca \
--cc=kvm@vger.kernel.org \
--cc=mngyadam@amazon.de \
--cc=nagy@khwaternagy.com \
--cc=pravkmr@amazon.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).