* RFC: A KVM-specific alternative to UserfaultFD
@ 2023-11-06 18:25 David Matlack
2023-11-06 20:23 ` Peter Xu
0 siblings, 1 reply; 34+ messages in thread
From: David Matlack @ 2023-11-06 18:25 UTC (permalink / raw)
To: Paolo Bonzini
Cc: kvm list, Sean Christopherson, James Houghton, Oliver Upton,
Peter Xu, Axel Rasmussen
Hi Paolo,
I'd like your feedback on whether you would merge a KVM-specific
alternative to UserfaultFD.
Within Google we have a feature called "KVM Demand Paging" that we
have been using for post-copy live migration since 2014 and memory
poisoning emulation more recently. The high-level design is:
(a) A bitmap that tracks which GFNs are present, along with a UAPI
to enable/disable the present bitmap.
(b) UAPIs for marking GFNs present and non-present.
(c) KVM_RUN support for returning to userspace on guest page faults
to non-present GFNs.
(d) A notification mechanism and wait queue to coordinate KVM
accesses to non-present GFNs.
(e) UAPI or KVM policy for collapsing SPTEs into huge pages as guest
memory becomes present.
The actual implementation within Google has a lot of warts that I
won't get into... but I think we could have a pretty clean upstream
solution.
In fact, a lot of the infrastructure needed to support this design is
already in-flight upstream. e.g. (a) and (b) could be built on top of
the new memory attributes (although I have concerns about the
performance of using xarray vs. bitmaps), (c) can be built on top of
the memory-fault exiting. The most complex piece of new code would be
the notification mechanism for (d). Within Google we've been using a
netlink socket, but I think we should use a custom file descriptor
instead.
If we do it right, almost no architecture-specific support is needed.
Just a small bit in the page fault path (for (c) and to account for
the present bitmap when determining what (huge)page size to map).
The most painful part of carrying KVM Demand Paging out-of-tree has
been maintaining the hooks for (d). But this has been mostly
self-inflicted. We started out by manually annotating all of the code
where KVM reads/writes guest memory. But there are more core routines
that all guest-memory accesses go through (e.g. __gfn_to_hva_many())
where we could put a single hook, and then KVM just has to make sure
to invalidate an gfn-to-hva/pfn caches and SPTEs when a page becomes
non-present (which is rare and typically only happens before a vCPU
starts running). And hooking KVM accesses to guest memory isn't
exactly new, KVM already manually tracks all writes to keep the dirty
log up to date.
So why merge a KVM-specific alternative to UserfaultFD?
Taking a step back, let's look at what UserfaultFD is actually
providing for KVM VMs:
1. Coordination of userspace accesses to guest memory.
2. Coordination of KVM+guest accesses to guest memory.
(1.) technically does not need kernel support. It's possible to solve
this problem in userspace, and likely can be more efficient to solve
it in userspace because you have more flexibility and can avoid
bouncing through the kernel page fault handler. And it's not
unreasonable to expect VMMs to support this. VMMs already need to
manually intercept userspace _writes_ to guest memory to implement
dirty tracking efficiently. It's a small step beyond that to intercept
both reads and writes for post-copy. And VMMs are increasingly
multi-process. UserfaultFD provides coordination within a process but
VMMs already need to deal with coordinating across processes already.
i.e. UserfaultFD is only solving part of the problem for (1.).
The KVM-specific approach is basically to provide kernel support for
(2) and let userspace solve (1) however it likes.
But if UserfaultFD solves (1) and (2), why introduce a KVM feature
that solves only (2)?
Well, UserfaultFD has some very real downsides:
* Lack of sufficient HugeTLB Support: The most recent and glaring
problem is upstream's NACK of HugeTLB High Granularity Mapping [1].
Without HGM, UserfaultFD can only handle HugeTLB faults at huge
page granularity. i.e. If a VM is backed with 1GiB HugeTLB, then
UserfaultFD can only handle 1GiB faults. Demand-fetching 1GiB of
memory from a remote host during the post-copy phase of live
migration is untenable. Even 2MiB fetches are painful with most
current NICs. In effect, there is no line-of-sight on an upstream
solution for post-copy live migration for VMs backed with HugeTLB.
* Memory Overhead: UserfaultFD requires an extra 8 bytes per page of
guest memory for the userspace page table entries.
* CPU Overhead: UserfaultFD has to manipulate userspace page tables to
split mappings down to PAGE_SIZE, handle PAGE_SIZE'd faults, and,
later, collapse mappings back into huge pages. These manipulations take
locks like mmap_lock, page locks, and page table locks.
* Complexity: UserfaultFD-based demand paging depends on functionality
across multiple subsystems in the kernel including Core MM, KVM, as
well as the each of the memory filesystems (tmpfs, HugeTLB, and
eventually guest_memfd). Debugging problems requires
knowledge across many domains that many engineers do not have. And
solving problems requires getting buy-in from multiple subsystem
maintainers that may not all be aligned (see: HGM).
All of these are addressed with a KVM-specific solution. A
KVM-specific solution can have:
* Transparent support for any backing memory subsystem (tmpfs,
HugeTLB, and even guest_memfd).
* Only 1 bit of overhead per page of guest memory.
* No need to modify host page tables.
* All code contained within KVM.
* Significantly fewer LOC than UserfaultFD.
Ok, that's the pitch. What are your thoughts?
[1] https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@google.com/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-06 18:25 RFC: A KVM-specific alternative to UserfaultFD David Matlack
@ 2023-11-06 20:23 ` Peter Xu
2023-11-06 22:24 ` Axel Rasmussen
2023-11-07 16:25 ` Paolo Bonzini
0 siblings, 2 replies; 34+ messages in thread
From: Peter Xu @ 2023-11-06 20:23 UTC (permalink / raw)
To: David Matlack
Cc: Paolo Bonzini, kvm list, Sean Christopherson, James Houghton,
Oliver Upton, Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
Hi, David,
Before Paolo shares his opinion, I can provide some quick comments.
On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
> Hi Paolo,
>
> I'd like your feedback on whether you would merge a KVM-specific
> alternative to UserfaultFD.
>
> Within Google we have a feature called "KVM Demand Paging" that we
> have been using for post-copy live migration since 2014 and memory
> poisoning emulation more recently. The high-level design is:
I have no immediate comment on the proposal yet, but I can list how uffd
handles below as comparisons, inline below.
>
> (a) A bitmap that tracks which GFNs are present, along with a UAPI
> to enable/disable the present bitmap.
Uffd uses the pgtable (anon) or page cache (shmem/hugetlb) directly.
Slight win, IMHO, because bitmap will be extra structure to maintain the
same information, IIUC.
> (b) UAPIs for marking GFNs present and non-present.
Similar, this is something bound to above bitmap design, and not needed for
uffd. Extra interface?
> (c) KVM_RUN support for returning to userspace on guest page faults
> to non-present GFNs.
Uffd has the wait queues, so this will be extra kvm interface to maintain,
but not easy to judge because it may bring benefits indeed, like better
concurrency. Personally I'm just not sure (1) how important concurrency is
in this use case, and (2) whether we can improve uffd general code on
scalability.
For (1), if the time to resolve a remote page fault is bottlenecked on the
network, concurrency may not matter a huge deal, IMHO. But I didn't really
do enough test over this area.
For (2), something like:
https://lore.kernel.org/r/20230905214235.320571-1-peterx@redhat.com
I didn't continue that thread because QEMU doesn't use uffd as heavy, so no
rush to push that further from QEMU's perspective. However IMHO it'll
always be valuable to profile userfault to see whether the issues can be
resolved in more general ways. One thing that can happen is that we
explored uffd enough so we may find out the bottleneck that we cannot avoid
due to uffd's design, but IIUC that work hasn't yet been done by anyone,
IOW there's still chance to me to provide a generic solution.
> (d) A notification mechanism and wait queue to coordinate KVM
> accesses to non-present GFNs.
Probably uffd's wait queue to be reimplemented more or less.
Is this only used when there's no vcpu thread context? I remember Anish's
other proposal on vcpu exit can already achieve similar without the queue.
> (e) UAPI or KVM policy for collapsing SPTEs into huge pages as guest
> memory becomes present.
This interface will also be needed if with userfaultfd, but if with uffd
it'll be a common interface so can be used outside VM context.
>
> The actual implementation within Google has a lot of warts that I
> won't get into... but I think we could have a pretty clean upstream
> solution.
>
> In fact, a lot of the infrastructure needed to support this design is
> already in-flight upstream. e.g. (a) and (b) could be built on top of
> the new memory attributes (although I have concerns about the
> performance of using xarray vs. bitmaps), (c) can be built on top of
> the memory-fault exiting. The most complex piece of new code would be
> the notification mechanism for (d). Within Google we've been using a
> netlink socket, but I think we should use a custom file descriptor
> instead.
>
> If we do it right, almost no architecture-specific support is needed.
> Just a small bit in the page fault path (for (c) and to account for
> the present bitmap when determining what (huge)page size to map).
>
> The most painful part of carrying KVM Demand Paging out-of-tree has
> been maintaining the hooks for (d). But this has been mostly
> self-inflicted. We started out by manually annotating all of the code
> where KVM reads/writes guest memory. But there are more core routines
> that all guest-memory accesses go through (e.g. __gfn_to_hva_many())
> where we could put a single hook, and then KVM just has to make sure
It's great to know (d) is actually not a real problem, however..
> to invalidate an gfn-to-hva/pfn caches and SPTEs when a page becomes
> non-present (which is rare and typically only happens before a vCPU
> starts running). And hooking KVM accesses to guest memory isn't
> exactly new, KVM already manually tracks all writes to keep the dirty
> log up to date.
.. what about all the other kernel modules that can directly access the
guest memory without KVM APIs, like, vhost? Does all of them need to
implement similar things?
>
> So why merge a KVM-specific alternative to UserfaultFD?
>
> Taking a step back, let's look at what UserfaultFD is actually
> providing for KVM VMs:
>
> 1. Coordination of userspace accesses to guest memory.
> 2. Coordination of KVM+guest accesses to guest memory.
>
> (1.) technically does not need kernel support. It's possible to solve
> this problem in userspace, and likely can be more efficient to solve
> it in userspace because you have more flexibility and can avoid
> bouncing through the kernel page fault handler. And it's not
> unreasonable to expect VMMs to support this. VMMs already need to
> manually intercept userspace _writes_ to guest memory to implement
> dirty tracking efficiently. It's a small step beyond that to intercept
> both reads and writes for post-copy. And VMMs are increasingly
> multi-process. UserfaultFD provides coordination within a process but
> VMMs already need to deal with coordinating across processes already.
> i.e. UserfaultFD is only solving part of the problem for (1.).
>
> The KVM-specific approach is basically to provide kernel support for
> (2) and let userspace solve (1) however it likes.
It's slightly unfortunate to QEMU and other userspace hypervisors in this
case, because it means even if the new interface will be merged, each
community will need to add support for it for the same postcopy feature,
and I'm not 100% sure on the complexity at least for QEMU.
I think this is not a major concern if purely judging from KVM perspective,
indeed, as long as the solution supercedes the current one, so I think it's
still okay to do so. Meanwhile there is also the other option to let
whatever userspace (QEMU, etc.) keeps using userfaultfd, so KVM will have
two solutions for VM postcopy, which is also not fully unacceptable,
either. In all cases, before deciding to go this way, IMHO it'll be a nice
gesture to consider the side effects to other communities, like QEMU, that
heavily consumes KVM.
>
> But if UserfaultFD solves (1) and (2), why introduce a KVM feature
> that solves only (2)?
>
> Well, UserfaultFD has some very real downsides:
>
> * Lack of sufficient HugeTLB Support: The most recent and glaring
> problem is upstream's NACK of HugeTLB High Granularity Mapping [1].
> Without HGM, UserfaultFD can only handle HugeTLB faults at huge
> page granularity. i.e. If a VM is backed with 1GiB HugeTLB, then
> UserfaultFD can only handle 1GiB faults. Demand-fetching 1GiB of
> memory from a remote host during the post-copy phase of live
> migration is untenable. Even 2MiB fetches are painful with most
> current NICs. In effect, there is no line-of-sight on an upstream
> solution for post-copy live migration for VMs backed with HugeTLB.
Indeed I didn't see any patch from James anymore for that support. Does it
mean that the project will be discontinued?
Personally I still think this is the right way to go, that it'll be good if
hugetlb pages can be split in any context even outside hypervisors.
I don't think the community is NACKing the solution to allow hugetlb to
split, it's not merged sololy because the specialty of hugetlbfs, and HGM
just happened at that stage of time.
I had a feeling that it'll be resolved sooner or later, even without a
hugetlb v2, maybe? That "v2" suggestion seems to be the final conclusion
per the last lsfmm 2023 conference; however I don't know whether the
discussion continued anywhere else, and I think that not all the ways are
explored, and maybe we can work it out with current hugetlb code base in
some form so that the community will be happy to allow hugetlb add new
features again.
>
> * Memory Overhead: UserfaultFD requires an extra 8 bytes per page of
> guest memory for the userspace page table entries.
What is this one?
>
> * CPU Overhead: UserfaultFD has to manipulate userspace page tables to
> split mappings down to PAGE_SIZE, handle PAGE_SIZE'd faults, and,
> later, collapse mappings back into huge pages. These manipulations take
> locks like mmap_lock, page locks, and page table locks.
Indeed this can be a problem, however this is also the best part of
userfaultfd on the cleaness of the whole design, that it unifies everything
into the core mm, and it's already there. That's also why it can work with
kvm/vhost/userapp/whatever as long as the page is accessed in whatever
form.
>
> * Complexity: UserfaultFD-based demand paging depends on functionality
> across multiple subsystems in the kernel including Core MM, KVM, as
> well as the each of the memory filesystems (tmpfs, HugeTLB, and
> eventually guest_memfd). Debugging problems requires
> knowledge across many domains that many engineers do not have. And
> solving problems requires getting buy-in from multiple subsystem
> maintainers that may not all be aligned (see: HGM).
I'll put HGM related discussion in above bullet. OTOH, I'd consider
"debugging problem requires knowledge across many domains" not as valid as
reasoning. At least this is not a pure technical point, and I think we
should still stick with technical comparisons on the solutions.
>
> All of these are addressed with a KVM-specific solution. A
> KVM-specific solution can have:
>
> * Transparent support for any backing memory subsystem (tmpfs,
> HugeTLB, and even guest_memfd).
I'm curious how hard would it be to allow guest_memfd support userfaultfd.
David, do you know?
The rest are already supported by uffd so I assume not a major problem.
> * Only 1 bit of overhead per page of guest memory.
> * No need to modify host page tables.
> * All code contained within KVM.
> * Significantly fewer LOC than UserfaultFD.
We're not planning to remove uffd from mm even if this is merged.. right?
Could you elaborate?
>
> Ok, that's the pitch. What are your thoughts?
>
> [1] https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@google.com/
How about hwpoison in no-KVM contexts? I remember Mike Kravetz mentioned
about that in the database usages, and IIRC Google also mentioned the
interest in that area at least before.
Copy Mike for that; copy Andrea for everything.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-06 20:23 ` Peter Xu
@ 2023-11-06 22:24 ` Axel Rasmussen
2023-11-06 23:03 ` Peter Xu
2023-11-07 16:25 ` Paolo Bonzini
1 sibling, 1 reply; 34+ messages in thread
From: Axel Rasmussen @ 2023-11-06 22:24 UTC (permalink / raw)
To: Peter Xu
Cc: David Matlack, Paolo Bonzini, kvm list, Sean Christopherson,
James Houghton, Oliver Upton, Mike Kravetz, Andrea Arcangeli,
Frank van der Linden
On Mon, Nov 6, 2023 at 12:23 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, David,
>
> Before Paolo shares his opinion, I can provide some quick comments.
>
> On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
> > Hi Paolo,
> >
> > I'd like your feedback on whether you would merge a KVM-specific
> > alternative to UserfaultFD.
> >
> > Within Google we have a feature called "KVM Demand Paging" that we
> > have been using for post-copy live migration since 2014 and memory
> > poisoning emulation more recently. The high-level design is:
>
> I have no immediate comment on the proposal yet, but I can list how uffd
> handles below as comparisons, inline below.
>
> >
> > (a) A bitmap that tracks which GFNs are present, along with a UAPI
> > to enable/disable the present bitmap.
>
> Uffd uses the pgtable (anon) or page cache (shmem/hugetlb) directly.
> Slight win, IMHO, because bitmap will be extra structure to maintain the
> same information, IIUC.
>
> > (b) UAPIs for marking GFNs present and non-present.
>
> Similar, this is something bound to above bitmap design, and not needed for
> uffd. Extra interface?
>
> > (c) KVM_RUN support for returning to userspace on guest page faults
> > to non-present GFNs.
>
> Uffd has the wait queues, so this will be extra kvm interface to maintain,
> but not easy to judge because it may bring benefits indeed, like better
> concurrency. Personally I'm just not sure (1) how important concurrency is
> in this use case, and (2) whether we can improve uffd general code on
> scalability.
>
> For (1), if the time to resolve a remote page fault is bottlenecked on the
> network, concurrency may not matter a huge deal, IMHO. But I didn't really
> do enough test over this area.
>
> For (2), something like:
>
> https://lore.kernel.org/r/20230905214235.320571-1-peterx@redhat.com
>
> I didn't continue that thread because QEMU doesn't use uffd as heavy, so no
> rush to push that further from QEMU's perspective. However IMHO it'll
> always be valuable to profile userfault to see whether the issues can be
> resolved in more general ways. One thing that can happen is that we
> explored uffd enough so we may find out the bottleneck that we cannot avoid
> due to uffd's design, but IIUC that work hasn't yet been done by anyone,
> IOW there's still chance to me to provide a generic solution.
>
> > (d) A notification mechanism and wait queue to coordinate KVM
> > accesses to non-present GFNs.
>
> Probably uffd's wait queue to be reimplemented more or less.
>
> Is this only used when there's no vcpu thread context? I remember Anish's
> other proposal on vcpu exit can already achieve similar without the queue.
>
> > (e) UAPI or KVM policy for collapsing SPTEs into huge pages as guest
> > memory becomes present.
>
> This interface will also be needed if with userfaultfd, but if with uffd
> it'll be a common interface so can be used outside VM context.
>
> >
> > The actual implementation within Google has a lot of warts that I
> > won't get into... but I think we could have a pretty clean upstream
> > solution.
> >
> > In fact, a lot of the infrastructure needed to support this design is
> > already in-flight upstream. e.g. (a) and (b) could be built on top of
> > the new memory attributes (although I have concerns about the
> > performance of using xarray vs. bitmaps), (c) can be built on top of
> > the memory-fault exiting. The most complex piece of new code would be
> > the notification mechanism for (d). Within Google we've been using a
> > netlink socket, but I think we should use a custom file descriptor
> > instead.
> >
> > If we do it right, almost no architecture-specific support is needed.
> > Just a small bit in the page fault path (for (c) and to account for
> > the present bitmap when determining what (huge)page size to map).
> >
> > The most painful part of carrying KVM Demand Paging out-of-tree has
> > been maintaining the hooks for (d). But this has been mostly
> > self-inflicted. We started out by manually annotating all of the code
> > where KVM reads/writes guest memory. But there are more core routines
> > that all guest-memory accesses go through (e.g. __gfn_to_hva_many())
> > where we could put a single hook, and then KVM just has to make sure
>
> It's great to know (d) is actually not a real problem, however..
>
> > to invalidate an gfn-to-hva/pfn caches and SPTEs when a page becomes
> > non-present (which is rare and typically only happens before a vCPU
> > starts running). And hooking KVM accesses to guest memory isn't
> > exactly new, KVM already manually tracks all writes to keep the dirty
> > log up to date.
>
> .. what about all the other kernel modules that can directly access the
> guest memory without KVM APIs, like, vhost? Does all of them need to
> implement similar things?
>
> >
> > So why merge a KVM-specific alternative to UserfaultFD?
> >
> > Taking a step back, let's look at what UserfaultFD is actually
> > providing for KVM VMs:
> >
> > 1. Coordination of userspace accesses to guest memory.
> > 2. Coordination of KVM+guest accesses to guest memory.
> >
> > (1.) technically does not need kernel support. It's possible to solve
> > this problem in userspace, and likely can be more efficient to solve
> > it in userspace because you have more flexibility and can avoid
> > bouncing through the kernel page fault handler. And it's not
> > unreasonable to expect VMMs to support this. VMMs already need to
> > manually intercept userspace _writes_ to guest memory to implement
> > dirty tracking efficiently. It's a small step beyond that to intercept
> > both reads and writes for post-copy. And VMMs are increasingly
> > multi-process. UserfaultFD provides coordination within a process but
> > VMMs already need to deal with coordinating across processes already.
> > i.e. UserfaultFD is only solving part of the problem for (1.).
> >
> > The KVM-specific approach is basically to provide kernel support for
> > (2) and let userspace solve (1) however it likes.
>
> It's slightly unfortunate to QEMU and other userspace hypervisors in this
> case, because it means even if the new interface will be merged, each
> community will need to add support for it for the same postcopy feature,
> and I'm not 100% sure on the complexity at least for QEMU.
>
> I think this is not a major concern if purely judging from KVM perspective,
> indeed, as long as the solution supercedes the current one, so I think it's
> still okay to do so. Meanwhile there is also the other option to let
> whatever userspace (QEMU, etc.) keeps using userfaultfd, so KVM will have
> two solutions for VM postcopy, which is also not fully unacceptable,
> either. In all cases, before deciding to go this way, IMHO it'll be a nice
> gesture to consider the side effects to other communities, like QEMU, that
> heavily consumes KVM.
>
> >
> > But if UserfaultFD solves (1) and (2), why introduce a KVM feature
> > that solves only (2)?
> >
> > Well, UserfaultFD has some very real downsides:
> >
> > * Lack of sufficient HugeTLB Support: The most recent and glaring
> > problem is upstream's NACK of HugeTLB High Granularity Mapping [1].
> > Without HGM, UserfaultFD can only handle HugeTLB faults at huge
> > page granularity. i.e. If a VM is backed with 1GiB HugeTLB, then
> > UserfaultFD can only handle 1GiB faults. Demand-fetching 1GiB of
> > memory from a remote host during the post-copy phase of live
> > migration is untenable. Even 2MiB fetches are painful with most
> > current NICs. In effect, there is no line-of-sight on an upstream
> > solution for post-copy live migration for VMs backed with HugeTLB.
>
> Indeed I didn't see any patch from James anymore for that support. Does it
> mean that the project will be discontinued?
>
> Personally I still think this is the right way to go, that it'll be good if
> hugetlb pages can be split in any context even outside hypervisors.
>
> I don't think the community is NACKing the solution to allow hugetlb to
> split, it's not merged sololy because the specialty of hugetlbfs, and HGM
> just happened at that stage of time.
>
> I had a feeling that it'll be resolved sooner or later, even without a
> hugetlb v2, maybe? That "v2" suggestion seems to be the final conclusion
> per the last lsfmm 2023 conference; however I don't know whether the
> discussion continued anywhere else, and I think that not all the ways are
> explored, and maybe we can work it out with current hugetlb code base in
> some form so that the community will be happy to allow hugetlb add new
> features again.
I can at least add my perspective here. Note it may be quite different
from James', I'm only speaking for myself here.
I agree HGM wasn't explicitly NACKed, but at the same time I don't
feel we ever got to consensus around a path for it to be merged. There
were some good suggestions from you Peter and others around
improvements to move in that direction, but it remains unclear to me
at least a) at what point will things be "unified enough" that new
features can be merged? and b) what path towards unification should we
take - incremental improvements? build a hugetlbfs v2? something else?
In other words, I think getting HGM merged upstream involves a
potentially large / unbounded amount of work, and the outcome is
uncertain (e.g. one can imagine a scenario where some incremental
improvements are made, but HGM still seems largely unpalatable to the
community despite this).
So, I think the HGM state of affairs made alternatives like what David
is talking about in this thread much more attractive than they were
when we started working on HGM. :/
>
> >
> > * Memory Overhead: UserfaultFD requires an extra 8 bytes per page of
> > guest memory for the userspace page table entries.
>
> What is this one?
In the way we use userfaultfd, there are two shared userspace mappings
- one non-UFFD registered one which is used to resolve demand paging
faults, and another UFFD-registered one which is handed to KVM et al
for the guest to use. I think David is talking about the "second"
mapping as overhead here, since with the KVM-based approach he's
describing we don't need that mapping.
>
> >
> > * CPU Overhead: UserfaultFD has to manipulate userspace page tables to
> > split mappings down to PAGE_SIZE, handle PAGE_SIZE'd faults, and,
> > later, collapse mappings back into huge pages. These manipulations take
> > locks like mmap_lock, page locks, and page table locks.
>
> Indeed this can be a problem, however this is also the best part of
> userfaultfd on the cleaness of the whole design, that it unifies everything
> into the core mm, and it's already there. That's also why it can work with
> kvm/vhost/userapp/whatever as long as the page is accessed in whatever
> form.
>
> >
> > * Complexity: UserfaultFD-based demand paging depends on functionality
> > across multiple subsystems in the kernel including Core MM, KVM, as
> > well as the each of the memory filesystems (tmpfs, HugeTLB, and
> > eventually guest_memfd). Debugging problems requires
> > knowledge across many domains that many engineers do not have. And
> > solving problems requires getting buy-in from multiple subsystem
> > maintainers that may not all be aligned (see: HGM).
>
> I'll put HGM related discussion in above bullet. OTOH, I'd consider
> "debugging problem requires knowledge across many domains" not as valid as
> reasoning. At least this is not a pure technical point, and I think we
> should still stick with technical comparisons on the solutions.
>
> >
> > All of these are addressed with a KVM-specific solution. A
> > KVM-specific solution can have:
> >
> > * Transparent support for any backing memory subsystem (tmpfs,
> > HugeTLB, and even guest_memfd).
>
> I'm curious how hard would it be to allow guest_memfd support userfaultfd.
> David, do you know?
>
> The rest are already supported by uffd so I assume not a major problem.
>
> > * Only 1 bit of overhead per page of guest memory.
> > * No need to modify host page tables.
> > * All code contained within KVM.
> > * Significantly fewer LOC than UserfaultFD.
>
> We're not planning to remove uffd from mm even if this is merged.. right?
> Could you elaborate?
Also, if we did make improvements for UFFD for this use case, other
(non-VM) use cases may also benefit from those same improvements,
since UFFD is so general.
>
> >
> > Ok, that's the pitch. What are your thoughts?
> >
> > [1] https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@google.com/
>
> How about hwpoison in no-KVM contexts? I remember Mike Kravetz mentioned
> about that in the database usages, and IIRC Google also mentioned the
> interest in that area at least before.
>
> Copy Mike for that; copy Andrea for everything.
>
> Thanks,
>
> --
> Peter Xu
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-06 22:24 ` Axel Rasmussen
@ 2023-11-06 23:03 ` Peter Xu
2023-11-06 23:22 ` David Matlack
0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2023-11-06 23:03 UTC (permalink / raw)
To: Axel Rasmussen
Cc: David Matlack, Paolo Bonzini, kvm list, Sean Christopherson,
James Houghton, Oliver Upton, Mike Kravetz, Andrea Arcangeli,
Frank van der Linden
Hello, Axel,
On Mon, Nov 06, 2023 at 02:24:13PM -0800, Axel Rasmussen wrote:
> On Mon, Nov 6, 2023 at 12:23 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, David,
> >
> > Before Paolo shares his opinion, I can provide some quick comments.
> >
> > On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
> > > Hi Paolo,
> > >
> > > I'd like your feedback on whether you would merge a KVM-specific
> > > alternative to UserfaultFD.
> > >
> > > Within Google we have a feature called "KVM Demand Paging" that we
> > > have been using for post-copy live migration since 2014 and memory
> > > poisoning emulation more recently. The high-level design is:
> >
> > I have no immediate comment on the proposal yet, but I can list how uffd
> > handles below as comparisons, inline below.
> >
> > >
> > > (a) A bitmap that tracks which GFNs are present, along with a UAPI
> > > to enable/disable the present bitmap.
> >
> > Uffd uses the pgtable (anon) or page cache (shmem/hugetlb) directly.
> > Slight win, IMHO, because bitmap will be extra structure to maintain the
> > same information, IIUC.
> >
> > > (b) UAPIs for marking GFNs present and non-present.
> >
> > Similar, this is something bound to above bitmap design, and not needed for
> > uffd. Extra interface?
> >
> > > (c) KVM_RUN support for returning to userspace on guest page faults
> > > to non-present GFNs.
> >
> > Uffd has the wait queues, so this will be extra kvm interface to maintain,
> > but not easy to judge because it may bring benefits indeed, like better
> > concurrency. Personally I'm just not sure (1) how important concurrency is
> > in this use case, and (2) whether we can improve uffd general code on
> > scalability.
> >
> > For (1), if the time to resolve a remote page fault is bottlenecked on the
> > network, concurrency may not matter a huge deal, IMHO. But I didn't really
> > do enough test over this area.
> >
> > For (2), something like:
> >
> > https://lore.kernel.org/r/20230905214235.320571-1-peterx@redhat.com
> >
> > I didn't continue that thread because QEMU doesn't use uffd as heavy, so no
> > rush to push that further from QEMU's perspective. However IMHO it'll
> > always be valuable to profile userfault to see whether the issues can be
> > resolved in more general ways. One thing that can happen is that we
> > explored uffd enough so we may find out the bottleneck that we cannot avoid
> > due to uffd's design, but IIUC that work hasn't yet been done by anyone,
> > IOW there's still chance to me to provide a generic solution.
> >
> > > (d) A notification mechanism and wait queue to coordinate KVM
> > > accesses to non-present GFNs.
> >
> > Probably uffd's wait queue to be reimplemented more or less.
> >
> > Is this only used when there's no vcpu thread context? I remember Anish's
> > other proposal on vcpu exit can already achieve similar without the queue.
> >
> > > (e) UAPI or KVM policy for collapsing SPTEs into huge pages as guest
> > > memory becomes present.
> >
> > This interface will also be needed if with userfaultfd, but if with uffd
> > it'll be a common interface so can be used outside VM context.
> >
> > >
> > > The actual implementation within Google has a lot of warts that I
> > > won't get into... but I think we could have a pretty clean upstream
> > > solution.
> > >
> > > In fact, a lot of the infrastructure needed to support this design is
> > > already in-flight upstream. e.g. (a) and (b) could be built on top of
> > > the new memory attributes (although I have concerns about the
> > > performance of using xarray vs. bitmaps), (c) can be built on top of
> > > the memory-fault exiting. The most complex piece of new code would be
> > > the notification mechanism for (d). Within Google we've been using a
> > > netlink socket, but I think we should use a custom file descriptor
> > > instead.
> > >
> > > If we do it right, almost no architecture-specific support is needed.
> > > Just a small bit in the page fault path (for (c) and to account for
> > > the present bitmap when determining what (huge)page size to map).
> > >
> > > The most painful part of carrying KVM Demand Paging out-of-tree has
> > > been maintaining the hooks for (d). But this has been mostly
> > > self-inflicted. We started out by manually annotating all of the code
> > > where KVM reads/writes guest memory. But there are more core routines
> > > that all guest-memory accesses go through (e.g. __gfn_to_hva_many())
> > > where we could put a single hook, and then KVM just has to make sure
> >
> > It's great to know (d) is actually not a real problem, however..
> >
> > > to invalidate an gfn-to-hva/pfn caches and SPTEs when a page becomes
> > > non-present (which is rare and typically only happens before a vCPU
> > > starts running). And hooking KVM accesses to guest memory isn't
> > > exactly new, KVM already manually tracks all writes to keep the dirty
> > > log up to date.
> >
> > .. what about all the other kernel modules that can directly access the
> > guest memory without KVM APIs, like, vhost? Does all of them need to
> > implement similar things?
> >
> > >
> > > So why merge a KVM-specific alternative to UserfaultFD?
> > >
> > > Taking a step back, let's look at what UserfaultFD is actually
> > > providing for KVM VMs:
> > >
> > > 1. Coordination of userspace accesses to guest memory.
> > > 2. Coordination of KVM+guest accesses to guest memory.
> > >
> > > (1.) technically does not need kernel support. It's possible to solve
> > > this problem in userspace, and likely can be more efficient to solve
> > > it in userspace because you have more flexibility and can avoid
> > > bouncing through the kernel page fault handler. And it's not
> > > unreasonable to expect VMMs to support this. VMMs already need to
> > > manually intercept userspace _writes_ to guest memory to implement
> > > dirty tracking efficiently. It's a small step beyond that to intercept
> > > both reads and writes for post-copy. And VMMs are increasingly
> > > multi-process. UserfaultFD provides coordination within a process but
> > > VMMs already need to deal with coordinating across processes already.
> > > i.e. UserfaultFD is only solving part of the problem for (1.).
> > >
> > > The KVM-specific approach is basically to provide kernel support for
> > > (2) and let userspace solve (1) however it likes.
> >
> > It's slightly unfortunate to QEMU and other userspace hypervisors in this
> > case, because it means even if the new interface will be merged, each
> > community will need to add support for it for the same postcopy feature,
> > and I'm not 100% sure on the complexity at least for QEMU.
> >
> > I think this is not a major concern if purely judging from KVM perspective,
> > indeed, as long as the solution supercedes the current one, so I think it's
> > still okay to do so. Meanwhile there is also the other option to let
> > whatever userspace (QEMU, etc.) keeps using userfaultfd, so KVM will have
> > two solutions for VM postcopy, which is also not fully unacceptable,
> > either. In all cases, before deciding to go this way, IMHO it'll be a nice
> > gesture to consider the side effects to other communities, like QEMU, that
> > heavily consumes KVM.
> >
> > >
> > > But if UserfaultFD solves (1) and (2), why introduce a KVM feature
> > > that solves only (2)?
> > >
> > > Well, UserfaultFD has some very real downsides:
> > >
> > > * Lack of sufficient HugeTLB Support: The most recent and glaring
> > > problem is upstream's NACK of HugeTLB High Granularity Mapping [1].
> > > Without HGM, UserfaultFD can only handle HugeTLB faults at huge
> > > page granularity. i.e. If a VM is backed with 1GiB HugeTLB, then
> > > UserfaultFD can only handle 1GiB faults. Demand-fetching 1GiB of
> > > memory from a remote host during the post-copy phase of live
> > > migration is untenable. Even 2MiB fetches are painful with most
> > > current NICs. In effect, there is no line-of-sight on an upstream
> > > solution for post-copy live migration for VMs backed with HugeTLB.
> >
> > Indeed I didn't see any patch from James anymore for that support. Does it
> > mean that the project will be discontinued?
> >
> > Personally I still think this is the right way to go, that it'll be good if
> > hugetlb pages can be split in any context even outside hypervisors.
> >
> > I don't think the community is NACKing the solution to allow hugetlb to
> > split, it's not merged sololy because the specialty of hugetlbfs, and HGM
> > just happened at that stage of time.
> >
> > I had a feeling that it'll be resolved sooner or later, even without a
> > hugetlb v2, maybe? That "v2" suggestion seems to be the final conclusion
> > per the last lsfmm 2023 conference; however I don't know whether the
> > discussion continued anywhere else, and I think that not all the ways are
> > explored, and maybe we can work it out with current hugetlb code base in
> > some form so that the community will be happy to allow hugetlb add new
> > features again.
>
> I can at least add my perspective here. Note it may be quite different
> from James', I'm only speaking for myself here.
>
> I agree HGM wasn't explicitly NACKed, but at the same time I don't
> feel we ever got to consensus around a path for it to be merged. There
> were some good suggestions from you Peter and others around
> improvements to move in that direction, but it remains unclear to me
> at least a) at what point will things be "unified enough" that new
> features can be merged? and b) what path towards unification should we
> take - incremental improvements? build a hugetlbfs v2? something else?
Incremental should always be preferred if possible.
For a), my answer will be starting with pgtable mgmts. I actually
commented on this one a few months ago and didn't get a response:
https://lore.kernel.org/all/ZItfYxrTNL4Mu%2Flo@x1n/#r
I didn't read HGM for a long time, but afair a few hundreds of LOCs
lie in the pgtable walking changes which should probably be
accounted into "adding complexity" if we say hugetlb will converge
one day with core mm. That's one (out of many issues) that Matthew
listed in his slides yesterday that hugetlb may need an eye looking
at for convergence.
Does it mean that this might be a good spot to pay some more
attention? I know this goes back to the very early stage where we
were discussing what would be the best way to walk hugetlb pgtable
knowing that we can map 4K over a 2M, but I think it may be
slightly different: at least we're clearer that now we want to
merge that with core mm.
I think it means the possibility to mostly deprecate
huge_pte_offset(). James/Mike/anyone, have any of you looked into
that area? Would above make any sense at all?
But that's really only my 2 cents. It could be that I miss something
important so this just won't work. If James stopped working on it, I'm
happy to spend some time exploring in this direction and see what I can
get. It'll still take some time, though.. so hopefully nothing should need
to be blocked on this.
>
> In other words, I think getting HGM merged upstream involves a
> potentially large / unbounded amount of work, and the outcome is
> uncertain (e.g. one can imagine a scenario where some incremental
> improvements are made, but HGM still seems largely unpalatable to the
> community despite this).
>
> So, I think the HGM state of affairs made alternatives like what David
> is talking about in this thread much more attractive than they were
> when we started working on HGM. :/
Yeah, understood.
>
> >
> > >
> > > * Memory Overhead: UserfaultFD requires an extra 8 bytes per page of
> > > guest memory for the userspace page table entries.
> >
> > What is this one?
>
> In the way we use userfaultfd, there are two shared userspace mappings
> - one non-UFFD registered one which is used to resolve demand paging
> faults, and another UFFD-registered one which is handed to KVM et al
> for the guest to use. I think David is talking about the "second"
> mapping as overhead here, since with the KVM-based approach he's
> describing we don't need that mapping.
I see, but then is it userspace relevant? IMHO we should discuss the
proposal based only on the design itself, rather than relying on any
details on possible userspace implementations if two mappings are not
required but optional.
>
> >
> > >
> > > * CPU Overhead: UserfaultFD has to manipulate userspace page tables to
> > > split mappings down to PAGE_SIZE, handle PAGE_SIZE'd faults, and,
> > > later, collapse mappings back into huge pages. These manipulations take
> > > locks like mmap_lock, page locks, and page table locks.
> >
> > Indeed this can be a problem, however this is also the best part of
> > userfaultfd on the cleaness of the whole design, that it unifies everything
> > into the core mm, and it's already there. That's also why it can work with
> > kvm/vhost/userapp/whatever as long as the page is accessed in whatever
> > form.
> >
> > >
> > > * Complexity: UserfaultFD-based demand paging depends on functionality
> > > across multiple subsystems in the kernel including Core MM, KVM, as
> > > well as the each of the memory filesystems (tmpfs, HugeTLB, and
> > > eventually guest_memfd). Debugging problems requires
> > > knowledge across many domains that many engineers do not have. And
> > > solving problems requires getting buy-in from multiple subsystem
> > > maintainers that may not all be aligned (see: HGM).
> >
> > I'll put HGM related discussion in above bullet. OTOH, I'd consider
> > "debugging problem requires knowledge across many domains" not as valid as
> > reasoning. At least this is not a pure technical point, and I think we
> > should still stick with technical comparisons on the solutions.
> >
> > >
> > > All of these are addressed with a KVM-specific solution. A
> > > KVM-specific solution can have:
> > >
> > > * Transparent support for any backing memory subsystem (tmpfs,
> > > HugeTLB, and even guest_memfd).
> >
> > I'm curious how hard would it be to allow guest_memfd support userfaultfd.
> > David, do you know?
> >
> > The rest are already supported by uffd so I assume not a major problem.
> >
> > > * Only 1 bit of overhead per page of guest memory.
> > > * No need to modify host page tables.
> > > * All code contained within KVM.
> > > * Significantly fewer LOC than UserfaultFD.
> >
> > We're not planning to remove uffd from mm even if this is merged.. right?
> > Could you elaborate?
>
> Also, if we did make improvements for UFFD for this use case, other
> (non-VM) use cases may also benefit from those same improvements,
> since UFFD is so general.
Agreed. Providing alternative solutions definitely have the risk of
maintenance burden in the future.
>
> >
> > >
> > > Ok, that's the pitch. What are your thoughts?
> > >
> > > [1] https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@google.com/
> >
> > How about hwpoison in no-KVM contexts? I remember Mike Kravetz mentioned
> > about that in the database usages, and IIRC Google also mentioned the
> > interest in that area at least before.
> >
> > Copy Mike for that; copy Andrea for everything.
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >
>
--
Peter Xu
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-06 23:03 ` Peter Xu
@ 2023-11-06 23:22 ` David Matlack
2023-11-07 14:21 ` Peter Xu
0 siblings, 1 reply; 34+ messages in thread
From: David Matlack @ 2023-11-06 23:22 UTC (permalink / raw)
To: Peter Xu
Cc: Axel Rasmussen, Paolo Bonzini, kvm list, Sean Christopherson,
James Houghton, Oliver Upton, Mike Kravetz, Andrea Arcangeli,
Frank van der Linden
On Mon, Nov 6, 2023 at 3:03 PM Peter Xu <peterx@redhat.com> wrote:
> On Mon, Nov 06, 2023 at 02:24:13PM -0800, Axel Rasmussen wrote:
> > On Mon, Nov 6, 2023 at 12:23 PM Peter Xu <peterx@redhat.com> wrote:
> > > On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
> > > >
> > > > * Memory Overhead: UserfaultFD requires an extra 8 bytes per page of
> > > > guest memory for the userspace page table entries.
> > >
> > > What is this one?
> >
> > In the way we use userfaultfd, there are two shared userspace mappings
> > - one non-UFFD registered one which is used to resolve demand paging
> > faults, and another UFFD-registered one which is handed to KVM et al
> > for the guest to use. I think David is talking about the "second"
> > mapping as overhead here, since with the KVM-based approach he's
> > describing we don't need that mapping.
>
> I see, but then is it userspace relevant? IMHO we should discuss the
> proposal based only on the design itself, rather than relying on any
> details on possible userspace implementations if two mappings are not
> required but optional.
What I mean here is that for UserfaultFD to track accesses at
PAGE_SIZE granularity, that requires 1 PTE per page, i.e. 8 bytes per
page. Versus the KVM-based approach which only requires 1 bit per page
for the present bitmap. This is inherent in the design of UserfaultFD
because it uses PTEs to track what is present, not specific to how we
use UserfaultFD.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-06 23:22 ` David Matlack
@ 2023-11-07 14:21 ` Peter Xu
2023-11-07 16:11 ` James Houghton
0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2023-11-07 14:21 UTC (permalink / raw)
To: David Matlack
Cc: Axel Rasmussen, Paolo Bonzini, kvm list, Sean Christopherson,
James Houghton, Oliver Upton, Mike Kravetz, Andrea Arcangeli,
Frank van der Linden
On Mon, Nov 06, 2023 at 03:22:05PM -0800, David Matlack wrote:
> On Mon, Nov 6, 2023 at 3:03 PM Peter Xu <peterx@redhat.com> wrote:
> > On Mon, Nov 06, 2023 at 02:24:13PM -0800, Axel Rasmussen wrote:
> > > On Mon, Nov 6, 2023 at 12:23 PM Peter Xu <peterx@redhat.com> wrote:
> > > > On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
> > > > >
> > > > > * Memory Overhead: UserfaultFD requires an extra 8 bytes per page of
> > > > > guest memory for the userspace page table entries.
> > > >
> > > > What is this one?
> > >
> > > In the way we use userfaultfd, there are two shared userspace mappings
> > > - one non-UFFD registered one which is used to resolve demand paging
> > > faults, and another UFFD-registered one which is handed to KVM et al
> > > for the guest to use. I think David is talking about the "second"
> > > mapping as overhead here, since with the KVM-based approach he's
> > > describing we don't need that mapping.
> >
> > I see, but then is it userspace relevant? IMHO we should discuss the
> > proposal based only on the design itself, rather than relying on any
> > details on possible userspace implementations if two mappings are not
> > required but optional.
>
> What I mean here is that for UserfaultFD to track accesses at
> PAGE_SIZE granularity, that requires 1 PTE per page, i.e. 8 bytes per
> page. Versus the KVM-based approach which only requires 1 bit per page
> for the present bitmap. This is inherent in the design of UserfaultFD
> because it uses PTEs to track what is present, not specific to how we
> use UserfaultFD.
Shouldn't the userspace normally still maintain one virtual mapping anyway
for the guest address range? As IIUC kvm still relies a lot on HVA to work
(at least before guest memfd)? E.g., KVM_SET_USER_MEMORY_REGION, or mmu
notifiers. If so, that 8 bytes should be there with/without userfaultfd,
IIUC.
Also, I think that's not strictly needed for any kind of file memories, as
in those case userfaultfd works with page cache.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-07 14:21 ` Peter Xu
@ 2023-11-07 16:11 ` James Houghton
2023-11-07 17:24 ` Peter Xu
0 siblings, 1 reply; 34+ messages in thread
From: James Houghton @ 2023-11-07 16:11 UTC (permalink / raw)
To: Peter Xu
Cc: David Matlack, Axel Rasmussen, Paolo Bonzini, kvm list,
Sean Christopherson, Oliver Upton, Mike Kravetz, Andrea Arcangeli,
Frank van der Linden
On Tue, Nov 7, 2023 at 6:22 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Nov 06, 2023 at 03:22:05PM -0800, David Matlack wrote:
> > On Mon, Nov 6, 2023 at 3:03 PM Peter Xu <peterx@redhat.com> wrote:
> > > On Mon, Nov 06, 2023 at 02:24:13PM -0800, Axel Rasmussen wrote:
> > > > On Mon, Nov 6, 2023 at 12:23 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
> > > > > >
> > > > > > * Memory Overhead: UserfaultFD requires an extra 8 bytes per page of
> > > > > > guest memory for the userspace page table entries.
> > > > >
> > > > > What is this one?
> > > >
> > > > In the way we use userfaultfd, there are two shared userspace mappings
> > > > - one non-UFFD registered one which is used to resolve demand paging
> > > > faults, and another UFFD-registered one which is handed to KVM et al
> > > > for the guest to use. I think David is talking about the "second"
> > > > mapping as overhead here, since with the KVM-based approach he's
> > > > describing we don't need that mapping.
> > >
> > > I see, but then is it userspace relevant? IMHO we should discuss the
> > > proposal based only on the design itself, rather than relying on any
> > > details on possible userspace implementations if two mappings are not
> > > required but optional.
> >
> > What I mean here is that for UserfaultFD to track accesses at
> > PAGE_SIZE granularity, that requires 1 PTE per page, i.e. 8 bytes per
> > page. Versus the KVM-based approach which only requires 1 bit per page
> > for the present bitmap. This is inherent in the design of UserfaultFD
> > because it uses PTEs to track what is present, not specific to how we
> > use UserfaultFD.
>
> Shouldn't the userspace normally still maintain one virtual mapping anyway
> for the guest address range? As IIUC kvm still relies a lot on HVA to work
> (at least before guest memfd)? E.g., KVM_SET_USER_MEMORY_REGION, or mmu
> notifiers. If so, that 8 bytes should be there with/without userfaultfd,
> IIUC.
>
> Also, I think that's not strictly needed for any kind of file memories, as
> in those case userfaultfd works with page cache.
This extra ~8 bytes per page overhead is real, and it is the
theoretical maximum additional overhead that userfaultfd would require
over a KVM-based demand paging alternative when we are using
hugepages. Consider the case where we are using THPs and have just
finished post-copy, and we haven't done any collapsing yet:
For userfaultfd: because we have UFFDIO_COPY'd or UFFDIO_CONTINUE'd at
4K (because we demand-fetched at 4K), the userspace page tables are
entirely shattered. KVM has no choice but to have an entirely
shattered second-stage page table as well.
For KVM demand paging: the userspace page tables can remain entirely
populated, so we get PMD mappings here. KVM, though, uses 4K SPTEs
because we have only just finished post-copy and haven't started
collapsing yet.
So both systems end up with a shattered second stage page table, but
userfaultfd has a shattered userspace page table as well (+8 bytes/4K
if using THP, +another 8 bytes/2M if using HugeTLB-1G, etc.) and that
is where the extra overhead comes from.
The second mapping of guest memory that we use today (through which we
install memory), given that we are using hugepages, will use PMDs and
PUDs, so the overhead is minimal.
Hope that clears things up!
Thanks,
James
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-06 20:23 ` Peter Xu
2023-11-06 22:24 ` Axel Rasmussen
@ 2023-11-07 16:25 ` Paolo Bonzini
2023-11-07 20:04 ` David Matlack
2023-11-07 22:29 ` Peter Xu
1 sibling, 2 replies; 34+ messages in thread
From: Paolo Bonzini @ 2023-11-07 16:25 UTC (permalink / raw)
To: Peter Xu, David Matlack
Cc: kvm list, Sean Christopherson, James Houghton, Oliver Upton,
Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
On 11/6/23 21:23, Peter Xu wrote:
> On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
>> Hi Paolo,
>>
>> I'd like your feedback on whether you would merge a KVM-specific
>> alternative to UserfaultFD.
I'm reply to Peter's message because he already brought up some points
that I'd have made...
>> (b) UAPIs for marking GFNs present and non-present.
>
> Similar, this is something bound to above bitmap design, and not needed for
> uffd. Extra interface?
We already use fallocate APIs to mark GFNs non-present in guest_memfd;
and we also use them to mark GFNs present but it would not work to do
that for atomic copy-and-allocate. This UAPI could be pwrite() or a
ioctl().
>> (c) KVM_RUN support for returning to userspace on guest page faults
>> to non-present GFNs.
>
> For (1), if the time to resolve a remote page fault is bottlenecked on the
> network, concurrency may not matter a huge deal, IMHO.
That's likely, and it means we could simply extend
KVM_EXIT_MEMORY_FAULT. However, we need to be careful not to have a
maze of twisty APIs, all different.
>> (d) A notification mechanism and wait queue to coordinate KVM
>> accesses to non-present GFNs.
>
> Probably uffd's wait queue to be reimplemented more or less.
> Is this only used when there's no vcpu thread context? I remember Anish's
> other proposal on vcpu exit can already achieve similar without the queue.
I think this synchronization can be done mostly in userspace, at least
on x86 (just like we got rid of the global VM-level dirty ring). But it
remains a problem on Arm.
>> (e) UAPI or KVM policy for collapsing SPTEs into huge pages as guest
>> memory becomes present.
>
> This interface will also be needed if with userfaultfd, but if with uffd
> it'll be a common interface so can be used outside VM context.
And it can be a generic API anyway (could be fadvise).
>> So why merge a KVM-specific alternative to UserfaultFD?
>>
>> Taking a step back, let's look at what UserfaultFD is actually
>> providing for KVM VMs:
>>
>> 1. Coordination of userspace accesses to guest memory.
>> 2. Coordination of KVM+guest accesses to guest memory.
>>
>> VMMs already need to
>> manually intercept userspace _writes_ to guest memory to implement
>> dirty tracking efficiently. It's a small step beyond that to intercept
>> both reads and writes for post-copy. And VMMs are increasingly
>> multi-process. UserfaultFD provides coordination within a process but
>> VMMs already need to deal with coordinating across processes already.
>> i.e. UserfaultFD is only solving part of the problem for (1.).
This is partly true but it is missing non-vCPU kernel accesses, and it's
what worries me the most if you propose this as a generic mechanism. My
gut feeling even without reading everything was (and it was confirmed
after): I am open to merging some specific features that close holes in
the userfaultfd API, but in general I like the unification between
guest, userspace *and kernel* accesses that userfaultfd brings. The fact
that it includes VGIC on Arm is a cherry on top. :)
For things other than guest_memfd, I want to ask Peter & co. if there
could be a variant of userfaultfd that is better integrated with memfd,
and solve the multi-process VMM issue. For example, maybe a
userfaultfd-like mechanism for memfd could handle missing faults from
_any_ VMA for the memfd.
However, guest_memfd could be a good usecase for the mechanism that you
suggest. Currently guest_memfd cannot be mapped in userspace pages. As
such it cannot be used with userfaultfd. Furthermore, because it is
only mapped by hypervisor page tables, or written via hypervisor APIs,
guest_memfd can easily track presence at 4KB granularity even if backed
by huge pages. That could be a point in favor of a KVM-specific solution.
Also, even if we envision mmap() support as one of the future extensions
of guest_memfd, that does not mean you can use it together with
userfaultfd. For example, if we had restrictedmem-backed guest_memfd,
or non-struct-page-backed guest_memfd, mmap() would be creating a
VM_PFNMAP area.
Once you have the implementation done for guest_memfd, it is interesting
to see how easily it extends to other, userspace-mappable kinds of
memory. But I still dislike the fact that you need some kind of extra
protocol in userspace, for multi-process VMMs. This is the kind of
thing that the kernel is supposed to facilitate. I'd like it to do
_more_ of that (see above memfd pseudo-suggestion), not less.
>> All of these are addressed with a KVM-specific solution. A
>> KVM-specific solution can have:
>>
>> * Transparent support for any backing memory subsystem (tmpfs,
>> HugeTLB, and even guest_memfd).
>
> I'm curious how hard would it be to allow guest_memfd support userfaultfd.
> David, do you know?
Did I answer above? I suppose you'd have something along the lines of
vma_is_shmem() added to vma_can_userfault; or possibly add something to
vm_ops to bridge the differences.
> The rest are already supported by uffd so I assume not a major problem.
Userfaultfd is kinda unusable for 1GB pages so I'm not sure I'd include
it in the "already works" side, but yeah.
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-07 16:11 ` James Houghton
@ 2023-11-07 17:24 ` Peter Xu
2023-11-07 19:08 ` James Houghton
0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2023-11-07 17:24 UTC (permalink / raw)
To: James Houghton
Cc: David Matlack, Axel Rasmussen, Paolo Bonzini, kvm list,
Sean Christopherson, Oliver Upton, Mike Kravetz, Andrea Arcangeli,
Frank van der Linden
On Tue, Nov 07, 2023 at 08:11:09AM -0800, James Houghton wrote:
> This extra ~8 bytes per page overhead is real, and it is the
> theoretical maximum additional overhead that userfaultfd would require
> over a KVM-based demand paging alternative when we are using
> hugepages. Consider the case where we are using THPs and have just
> finished post-copy, and we haven't done any collapsing yet:
>
> For userfaultfd: because we have UFFDIO_COPY'd or UFFDIO_CONTINUE'd at
> 4K (because we demand-fetched at 4K), the userspace page tables are
> entirely shattered. KVM has no choice but to have an entirely
> shattered second-stage page table as well.
>
> For KVM demand paging: the userspace page tables can remain entirely
> populated, so we get PMD mappings here. KVM, though, uses 4K SPTEs
> because we have only just finished post-copy and haven't started
> collapsing yet.
>
> So both systems end up with a shattered second stage page table, but
> userfaultfd has a shattered userspace page table as well (+8 bytes/4K
> if using THP, +another 8 bytes/2M if using HugeTLB-1G, etc.) and that
> is where the extra overhead comes from.
>
> The second mapping of guest memory that we use today (through which we
> install memory), given that we are using hugepages, will use PMDs and
> PUDs, so the overhead is minimal.
>
> Hope that clears things up!
Ah I see, thanks James. Though, is this a real concern in production use,
considering worst case 0.2% overhead (all THP backed) and only exist during
postcopy, only on destination host?
In all cases, I agree that's still a valid point then, comparing to a
constant 1/32k consumption with a bitmap.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-07 17:24 ` Peter Xu
@ 2023-11-07 19:08 ` James Houghton
0 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2023-11-07 19:08 UTC (permalink / raw)
To: Peter Xu
Cc: David Matlack, Axel Rasmussen, Paolo Bonzini, kvm list,
Sean Christopherson, Oliver Upton, Mike Kravetz, Andrea Arcangeli,
Frank van der Linden
On Tue, Nov 7, 2023 at 9:24 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Nov 07, 2023 at 08:11:09AM -0800, James Houghton wrote:
> > This extra ~8 bytes per page overhead is real, and it is the
> > theoretical maximum additional overhead that userfaultfd would require
> > over a KVM-based demand paging alternative when we are using
> > hugepages. Consider the case where we are using THPs and have just
> > finished post-copy, and we haven't done any collapsing yet:
> >
> > For userfaultfd: because we have UFFDIO_COPY'd or UFFDIO_CONTINUE'd at
> > 4K (because we demand-fetched at 4K), the userspace page tables are
> > entirely shattered. KVM has no choice but to have an entirely
> > shattered second-stage page table as well.
> >
> > For KVM demand paging: the userspace page tables can remain entirely
> > populated, so we get PMD mappings here. KVM, though, uses 4K SPTEs
> > because we have only just finished post-copy and haven't started
> > collapsing yet.
> >
> > So both systems end up with a shattered second stage page table, but
> > userfaultfd has a shattered userspace page table as well (+8 bytes/4K
> > if using THP, +another 8 bytes/2M if using HugeTLB-1G, etc.) and that
> > is where the extra overhead comes from.
> >
> > The second mapping of guest memory that we use today (through which we
> > install memory), given that we are using hugepages, will use PMDs and
> > PUDs, so the overhead is minimal.
> >
> > Hope that clears things up!
>
> Ah I see, thanks James. Though, is this a real concern in production use,
> considering worst case 0.2% overhead (all THP backed) and only exist during
> postcopy, only on destination host?
Good question. In an ideal world, 0.2% of lost memory isn't a huge
deal, but it would be nice to save as much memory as possible. So I
see this overhead point as a nice win for a KVM-based solution, but it
is not a key deciding factor in what the right move is. (I think the
key deciding factor is: what is the best way to make post-copy work
for 1G pages?)
To elaborate a little more: For Google, I don't think the 0.2% loss is
a huge deal by itself (though I am not exactly an authority here).
There are other memory overheads like this that we have to deal with
anyway. The real challenge for us comes from the fact that we already
have a post-copy system that works and has less overhead. If we were
to replace KVM demand paging with userfaultfd, that means *regressing*
in efficiency/performance. That's the main practical challenge:
dealing with the regression. We have to make sure that VMs can still
be packed to the appropriate efficiency, things like that. At this
moment *I think* this is a solvable problem, but it would be nice to
avoid the problem entirely. But this is Google's problem; I don't
think this point should be the deciding factor here.
- James
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-07 16:25 ` Paolo Bonzini
@ 2023-11-07 20:04 ` David Matlack
2023-11-07 21:10 ` Oliver Upton
2023-11-07 22:29 ` Peter Xu
1 sibling, 1 reply; 34+ messages in thread
From: David Matlack @ 2023-11-07 20:04 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Peter Xu, kvm list, Sean Christopherson, James Houghton,
Oliver Upton, Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
On Tue, Nov 7, 2023 at 8:25 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 11/6/23 21:23, Peter Xu wrote:
> > On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
> >>
> >> So why merge a KVM-specific alternative to UserfaultFD?
> >>
> >> Taking a step back, let's look at what UserfaultFD is actually
> >> providing for KVM VMs:
> >>
> >> 1. Coordination of userspace accesses to guest memory.
> >> 2. Coordination of KVM+guest accesses to guest memory.
> >>
> >> VMMs already need to
> >> manually intercept userspace _writes_ to guest memory to implement
> >> dirty tracking efficiently. It's a small step beyond that to intercept
> >> both reads and writes for post-copy. And VMMs are increasingly
> >> multi-process. UserfaultFD provides coordination within a process but
> >> VMMs already need to deal with coordinating across processes already.
> >> i.e. UserfaultFD is only solving part of the problem for (1.).
>
> This is partly true but it is missing non-vCPU kernel accesses, and it's
> what worries me the most if you propose this as a generic mechanism.
Non-vCPU accesses in KVM could still be handled with my proposal. But
I agree that non-KVM kernel accesses are a gap.
> My
> gut feeling even without reading everything was (and it was confirmed
> after): I am open to merging some specific features that close holes in
> the userfaultfd API, but in general I like the unification between
> guest, userspace *and kernel* accesses that userfaultfd brings. The fact
> that it includes VGIC on Arm is a cherry on top. :)
Can you explain how VGIC interacts with UFFD? I'd like to understand
if/how that could work with a KVM-specific solution.
>
> For things other than guest_memfd, I want to ask Peter & co. if there
> could be a variant of userfaultfd that is better integrated with memfd,
> and solve the multi-process VMM issue. For example, maybe a
> userfaultfd-like mechanism for memfd could handle missing faults from
> _any_ VMA for the memfd.
>
> However, guest_memfd could be a good usecase for the mechanism that you
> suggest. Currently guest_memfd cannot be mapped in userspace pages. As
> such it cannot be used with userfaultfd. Furthermore, because it is
> only mapped by hypervisor page tables, or written via hypervisor APIs,
> guest_memfd can easily track presence at 4KB granularity even if backed
> by huge pages. That could be a point in favor of a KVM-specific solution.
>
> Also, even if we envision mmap() support as one of the future extensions
> of guest_memfd, that does not mean you can use it together with
> userfaultfd. For example, if we had restrictedmem-backed guest_memfd,
> or non-struct-page-backed guest_memfd, mmap() would be creating a
> VM_PFNMAP area.
>
> Once you have the implementation done for guest_memfd, it is interesting
> to see how easily it extends to other, userspace-mappable kinds of
> memory. But I still dislike the fact that you need some kind of extra
> protocol in userspace, for multi-process VMMs. This is the kind of
> thing that the kernel is supposed to facilitate. I'd like it to do
> _more_ of that (see above memfd pseudo-suggestion), not less.
I was also thinking guest_memfd could be an avenue to solve the
multi-process issue. But a little different than the way you described
(because I still want to find an upstream solution for HugeTLB-backed
VMs, if possible).
What I was thinking was that my proposal could be extended to
guest_memfd VMAs. The way my proposal works is that all KVM and guest
accesses would be guaranteed to go through the VM's present bitmaps,
but accesses through VMAs are not. But with guest_memfd, once we add
mmap() support, we have access to the struct kvm at the time that
mmap() is called and when handling page faults on the guest_memfd VMA.
So it'd be possible for guest_memfd to consult the present bitmap,
notify userspace on non-present pages, and wait for pages to become
present when handling faults. This means we could funnel all accesses
through VMAs (multi-process and non-KVM kernel accesses) through a
single notification mechanism. i.e. It solves the multi-process issue
and unifies guest, kernel, and userspace accesses. BUT, only for
guest_memfd.
So in the short term we could provide a partial solution for
HugeTLB-backed VMs (at least unblocking Google's use-case) and in the
long-term there's line of sight of a unified solution.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-07 20:04 ` David Matlack
@ 2023-11-07 21:10 ` Oliver Upton
2023-11-07 21:34 ` David Matlack
0 siblings, 1 reply; 34+ messages in thread
From: Oliver Upton @ 2023-11-07 21:10 UTC (permalink / raw)
To: David Matlack
Cc: Paolo Bonzini, Peter Xu, kvm list, Sean Christopherson,
James Houghton, Oliver Upton, Axel Rasmussen, Mike Kravetz,
Andrea Arcangeli
On Tue, Nov 07, 2023 at 12:04:21PM -0800, David Matlack wrote:
> On Tue, Nov 7, 2023 at 8:25 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
[...]
> > My
> > gut feeling even without reading everything was (and it was confirmed
> > after): I am open to merging some specific features that close holes in
> > the userfaultfd API, but in general I like the unification between
> > guest, userspace *and kernel* accesses that userfaultfd brings. The fact
> > that it includes VGIC on Arm is a cherry on top. :)
>
> Can you explain how VGIC interacts with UFFD? I'd like to understand
> if/how that could work with a KVM-specific solution.
The VGIC implementation is completely unaware of the existence of UFFD,
which is rather elegant.
There is no ioctl that allows userspace to directly get/set the VGIC
state. Instead, when userspace wants to migrate a VM it needs to flush
the cached state out of KVM's representation into guest memory. I would
expect the VMM to do this right before collecting the final dirty
bitmap.
If UFFD is off the table then it would appear there are two options:
- Instrument these ioctls to request pages not marked as present in the
theorized KVM-owned demand paging interface
- Mandate that userspace has transferred all of the required VGIC / ITS
pages before resuming on the target
The former increases the maintenance burden of supporting post-copy
upstream and the latter *will* fail spectacularly. Ideally we use a
mechanism that doesn't require us to think about instrumenting
post-copy for every new widget that we will want to virtualize.
> So in the short term we could provide a partial solution for
> HugeTLB-backed VMs (at least unblocking Google's use-case) and in the
> long-term there's line of sight of a unified solution.
Who do we expect to look after the upstreamed short-term solution once
Google has moved on to something else?
--
Thanks,
Oliver
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-07 21:10 ` Oliver Upton
@ 2023-11-07 21:34 ` David Matlack
2023-11-08 1:27 ` Oliver Upton
0 siblings, 1 reply; 34+ messages in thread
From: David Matlack @ 2023-11-07 21:34 UTC (permalink / raw)
To: Oliver Upton
Cc: Paolo Bonzini, Peter Xu, kvm list, Sean Christopherson,
James Houghton, Oliver Upton, Axel Rasmussen, Mike Kravetz,
Andrea Arcangeli
On Tue, Nov 7, 2023 at 1:10 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Tue, Nov 07, 2023 at 12:04:21PM -0800, David Matlack wrote:
> > On Tue, Nov 7, 2023 at 8:25 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> [...]
>
> > > My
> > > gut feeling even without reading everything was (and it was confirmed
> > > after): I am open to merging some specific features that close holes in
> > > the userfaultfd API, but in general I like the unification between
> > > guest, userspace *and kernel* accesses that userfaultfd brings. The fact
> > > that it includes VGIC on Arm is a cherry on top. :)
> >
> > Can you explain how VGIC interacts with UFFD? I'd like to understand
> > if/how that could work with a KVM-specific solution.
>
> The VGIC implementation is completely unaware of the existence of UFFD,
> which is rather elegant.
>
> There is no ioctl that allows userspace to directly get/set the VGIC
> state. Instead, when userspace wants to migrate a VM it needs to flush
> the cached state out of KVM's representation into guest memory. I would
> expect the VMM to do this right before collecting the final dirty
> bitmap.
Thanks Oliver. Maybe I'm being dense but I'm still not understanding
how VGIC and UFFD interact :). I understand that VGIC is unaware of
UFFD, but fundamentally they must interact in some way during
post-copy. Can you spell out the sequence of events?
>
> If UFFD is off the table then it would appear there are two options:
>
> - Instrument these ioctls to request pages not marked as present in the
> theorized KVM-owned demand paging interface
>
> - Mandate that userspace has transferred all of the required VGIC / ITS
> pages before resuming on the target
>
> The former increases the maintenance burden of supporting post-copy
> upstream and the latter *will* fail spectacularly. Ideally we use a
> mechanism that doesn't require us to think about instrumenting
> post-copy for every new widget that we will want to virtualize.
>
> > So in the short term we could provide a partial solution for
> > HugeTLB-backed VMs (at least unblocking Google's use-case) and in the
> > long-term there's line of sight of a unified solution.
>
> Who do we expect to look after the upstreamed short-term solution once
> Google has moved on to something else?
Note, the proposed long-term solution you are replying to is an
extension of the short-term solution, not something else.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-07 16:25 ` Paolo Bonzini
2023-11-07 20:04 ` David Matlack
@ 2023-11-07 22:29 ` Peter Xu
2023-11-09 16:41 ` David Matlack
1 sibling, 1 reply; 34+ messages in thread
From: Peter Xu @ 2023-11-07 22:29 UTC (permalink / raw)
To: Paolo Bonzini
Cc: David Matlack, kvm list, Sean Christopherson, James Houghton,
Oliver Upton, Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
Paolo,
On Tue, Nov 07, 2023 at 05:25:06PM +0100, Paolo Bonzini wrote:
> On 11/6/23 21:23, Peter Xu wrote:
> > On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
> > > Hi Paolo,
> > >
> > > I'd like your feedback on whether you would merge a KVM-specific
> > > alternative to UserfaultFD.
>
> I'm reply to Peter's message because he already brought up some points that
> I'd have made...
>
> > > (b) UAPIs for marking GFNs present and non-present.
> >
> > Similar, this is something bound to above bitmap design, and not needed for
> > uffd. Extra interface?
>
> We already use fallocate APIs to mark GFNs non-present in guest_memfd; and
> we also use them to mark GFNs present but it would not work to do that for
> atomic copy-and-allocate. This UAPI could be pwrite() or a ioctl().
Agree.
>
> > > (c) KVM_RUN support for returning to userspace on guest page faults
> > > to non-present GFNs.
> >
> > For (1), if the time to resolve a remote page fault is bottlenecked on the
> > network, concurrency may not matter a huge deal, IMHO.
>
> That's likely, and it means we could simply extend KVM_EXIT_MEMORY_FAULT.
> However, we need to be careful not to have a maze of twisty APIs, all
> different.
>
> > > (d) A notification mechanism and wait queue to coordinate KVM
> > > accesses to non-present GFNs.
> >
> > Probably uffd's wait queue to be reimplemented more or less.
> > Is this only used when there's no vcpu thread context? I remember Anish's
> > other proposal on vcpu exit can already achieve similar without the queue.
>
> I think this synchronization can be done mostly in userspace, at least on
> x86 (just like we got rid of the global VM-level dirty ring). But it remains
> a problem on Arm.
My memory was that ARM was also not the only outlier? We probably need to
reach a consensus on whether we should consider ARM and no-vcpu context
from the start of the design.
>
> > > (e) UAPI or KVM policy for collapsing SPTEs into huge pages as guest
> > > memory becomes present.
> >
> > This interface will also be needed if with userfaultfd, but if with uffd
> > it'll be a common interface so can be used outside VM context.
>
> And it can be a generic API anyway (could be fadvise).
IMHO it should depend on whether multiple KVM instances will have the same
requirement when they are attached to the same gmemfd, when we want to
collapse small folios into a large for that gmemfd.
If my understanding is correct, that "N kvm <-> 1 gmemfd" idea was mostly
for kvm live upgrade on old<->new modules as the plan, then it seems all
KVMs do have the same goal over the large folio, then fadvise() seems
proper to me.
>
> > > So why merge a KVM-specific alternative to UserfaultFD?
> > >
> > > Taking a step back, let's look at what UserfaultFD is actually
> > > providing for KVM VMs:
> > >
> > > 1. Coordination of userspace accesses to guest memory.
> > > 2. Coordination of KVM+guest accesses to guest memory.
> > >
> > > VMMs already need to
> > > manually intercept userspace _writes_ to guest memory to implement
> > > dirty tracking efficiently. It's a small step beyond that to intercept
> > > both reads and writes for post-copy. And VMMs are increasingly
> > > multi-process. UserfaultFD provides coordination within a process but
> > > VMMs already need to deal with coordinating across processes already.
> > > i.e. UserfaultFD is only solving part of the problem for (1.).
>
> This is partly true but it is missing non-vCPU kernel accesses, and it's
> what worries me the most if you propose this as a generic mechanism. My gut
> feeling even without reading everything was (and it was confirmed after): I
> am open to merging some specific features that close holes in the
> userfaultfd API, but in general I like the unification between guest,
> userspace *and kernel* accesses that userfaultfd brings. The fact that it
> includes VGIC on Arm is a cherry on top. :)
>
> For things other than guest_memfd, I want to ask Peter & co. if there could
> be a variant of userfaultfd that is better integrated with memfd, and solve
> the multi-process VMM issue. For example, maybe a userfaultfd-like
> mechanism for memfd could handle missing faults from _any_ VMA for the
> memfd.
On "why uffd is per-vma": I never confirmed with Andrea on the current
design of uffd, but IMO it makes sense to make it per-process from security
pov, otherwise it's more a risk to the whole system: consider an attacker
open a fake logfile under shmem, waiting for another proc to open it and
control page fault of it. The current uffd design requires each process to
be voluntary into userfaultfd involvements by either invoking syscall(uffd)
or open(/dev/userfaultfd). People are obviously worried about uffd safety
already even with current conservative per-mm design, so as to introduce
unprileged_userfaultfd, /dev/userfaultfd, selinux rules, etc..
I think memfd can be a good candidate to start support file uffd if we want
to go that far, because memfd is by default anonymous, so at least added
much more complexity on hijacking. Above example won't apply to memfd due
to anonymousness, requiring the target taking memfd and mmap() into its
address space, which is much harder. IOW, the "voluntary" part moved from
uffd desc to the memfd desc.
What I don't know is whether it'd be worthwhile to go with such a design..
Currently even with per-mm uffd the complexity is mostly manageable to me:
each proc needs to allocate its own uffd, register, and deliver to the host
process (e.g. in QEMU's case, openvswitch delivers that to QEMU).
>
> However, guest_memfd could be a good usecase for the mechanism that you
> suggest. Currently guest_memfd cannot be mapped in userspace pages. As
> such it cannot be used with userfaultfd. Furthermore, because it is only
> mapped by hypervisor page tables, or written via hypervisor APIs,
> guest_memfd can easily track presence at 4KB granularity even if backed by
> huge pages. That could be a point in favor of a KVM-specific solution.
>
> Also, even if we envision mmap() support as one of the future extensions of
> guest_memfd, that does not mean you can use it together with userfaultfd.
> For example, if we had restrictedmem-backed guest_memfd, or
> non-struct-page-backed guest_memfd, mmap() would be creating a VM_PFNMAP
> area.
I think it's doable to support userfaultfd (or at least something like
userfaultfd) as long as we can trap at the fault time (e.g., the PFN should
be faulted dynamically in some form, rather than pfn mapped in mmap()).
AFAIK userfaultfd doesn't require page struct on its own.
>
> Once you have the implementation done for guest_memfd, it is interesting to
> see how easily it extends to other, userspace-mappable kinds of memory. But
> I still dislike the fact that you need some kind of extra protocol in
> userspace, for multi-process VMMs. This is the kind of thing that the
> kernel is supposed to facilitate. I'd like it to do _more_ of that (see
> above memfd pseudo-suggestion), not less.
Is that our future plan to extend gmemfd to normal memories?
I see that gmemfd manages folio on its own. I think it'll make perfect
sense if it's for use in CoCo context, where the memory is so special to be
generic anyway.
However if to extend it to generic memories, I'm wondering how do we
support existing memory features of such memory which already exist with
KVM_SET_USER_MEMORY_REGION v1. To name some:
- numa awareness
- swapping
- cgroup
- punch hole (in a huge page, aka, thp split)
- cma allocations for huge pages / page migrations
- ...
I also haven't thought all through on how other modules should consume
gmemfd yet, as I raised the other question previously on vhost.
E.g. AFAICT, vhost at least will also need ioctl(VHOST_SET_MEM_TABLE2).
Does it mean that most of these features will be reimplemented in kvm in
some form? Or is there easy way?
>
> > > All of these are addressed with a KVM-specific solution. A
> > > KVM-specific solution can have:
> > >
> > > * Transparent support for any backing memory subsystem (tmpfs,
> > > HugeTLB, and even guest_memfd).
> >
> > I'm curious how hard would it be to allow guest_memfd support userfaultfd.
> > David, do you know?
>
> Did I answer above? I suppose you'd have something along the lines of
> vma_is_shmem() added to vma_can_userfault; or possibly add something to
> vm_ops to bridge the differences.
Relying on "whether folio existed in gmemfd mapping" sounds very sane and
natural to identify data presence. However not yet clear on the rest to me.
Currently kvm_gmem_allocate() does look like the place where gmemfd folios
will be allocated, so that can be an unified hook point where we'd want
something besides a zeroed page, huge or small. But I'm not sure that's
the long term plan: it seems to me current fallocate(gmemfd) approach is
more for populating the mem object when VM starts.
A "can be relevant" question could be: what will happen if vhost would like
to DMA to a guest page that hasn't yet been transferred to the CoCo VM,
assuming we're during a postcopy process? It seems to me there's some page
request interface still missing for gmemfd, but I'm also not sure.
>
> > The rest are already supported by uffd so I assume not a major problem.
>
> Userfaultfd is kinda unusable for 1GB pages so I'm not sure I'd include it
> in the "already works" side, but yeah.
Ah I never meant that 1G is working when leaving that comment. :) It's a
long known issue that userfaultfd / postcopy is not usable on 1G pages.
I think it's indeed fair to assume it's just broken and won't be able to be
fixed at least in the near future.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-07 21:34 ` David Matlack
@ 2023-11-08 1:27 ` Oliver Upton
2023-11-08 16:56 ` David Matlack
0 siblings, 1 reply; 34+ messages in thread
From: Oliver Upton @ 2023-11-08 1:27 UTC (permalink / raw)
To: David Matlack
Cc: Paolo Bonzini, Peter Xu, kvm list, Sean Christopherson,
James Houghton, Oliver Upton, Axel Rasmussen, Mike Kravetz,
Andrea Arcangeli
On Tue, Nov 07, 2023 at 01:34:34PM -0800, David Matlack wrote:
> On Tue, Nov 7, 2023 at 1:10 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> Thanks Oliver. Maybe I'm being dense but I'm still not understanding
> how VGIC and UFFD interact :). I understand that VGIC is unaware of
> UFFD, but fundamentally they must interact in some way during
> post-copy. Can you spell out the sequence of events?
Well it doesn't help that my abbreviated explanation glosses over some
details. So here's the verbose explanation, and I'm sure Marc will have
a set of corrections too :) I meant there's no _explicit_ interaction
between UFFD and the various bits of GIC that need to touch guest
memory.
The GIC redistributors contain a set of MMIO registers that are
accessible through the KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR
ioctls. Writes to these are reflected directly into the KVM
representation, no biggie there.
One of the registers (GICR_PENDBASER) is a pointer to guest memory,
containing a bitmap of pending LPIs managed by the redistributor. The
ITS takes this to the extreme, as it is effectively a bunch of page
tables for interrupts. All of this state actually lives in a KVM
representation, and is only flushed out to guest memory when explicitly
told to do so by userspace.
On the target, we reread all the info when rebuilding interrupt
translations when userspace calls KVM_DEV_ARM_ITS_RESTORE_TABLES. All of
these guest memory accesses go through kvm_read_guest() and I expect the
usual UFFD handling for non-present pages kicks in from there.
> >
> > If UFFD is off the table then it would appear there are two options:
> >
> > - Instrument these ioctls to request pages not marked as present in the
> > theorized KVM-owned demand paging interface
> >
> > - Mandate that userspace has transferred all of the required VGIC / ITS
> > pages before resuming on the target
> >
> > The former increases the maintenance burden of supporting post-copy
> > upstream and the latter *will* fail spectacularly. Ideally we use a
> > mechanism that doesn't require us to think about instrumenting
> > post-copy for every new widget that we will want to virtualize.
> >
> > > So in the short term we could provide a partial solution for
> > > HugeTLB-backed VMs (at least unblocking Google's use-case) and in the
> > > long-term there's line of sight of a unified solution.
> >
> > Who do we expect to look after the upstreamed short-term solution once
> > Google has moved on to something else?
>
> Note, the proposed long-term solution you are replying to is an
> extension of the short-term solution, not something else.
Ack, I just feel rather strongly that the priority should be making
guest_memfd with whatever post-copy scheme we devise. Once we settle
on a UAPI that works for the new and shiny thing then it's easier to
rationalize applying the UAPI change to other memory backing types.
--
Thanks,
Oliver
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-08 1:27 ` Oliver Upton
@ 2023-11-08 16:56 ` David Matlack
2023-11-08 17:34 ` Peter Xu
2023-11-08 20:33 ` Paolo Bonzini
0 siblings, 2 replies; 34+ messages in thread
From: David Matlack @ 2023-11-08 16:56 UTC (permalink / raw)
To: Oliver Upton
Cc: Paolo Bonzini, Peter Xu, kvm list, Sean Christopherson,
James Houghton, Oliver Upton, Axel Rasmussen, Mike Kravetz,
Andrea Arcangeli
On Tue, Nov 7, 2023 at 5:27 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Tue, Nov 07, 2023 at 01:34:34PM -0800, David Matlack wrote:
> > On Tue, Nov 7, 2023 at 1:10 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > Thanks Oliver. Maybe I'm being dense but I'm still not understanding
> > how VGIC and UFFD interact :). I understand that VGIC is unaware of
> > UFFD, but fundamentally they must interact in some way during
> > post-copy. Can you spell out the sequence of events?
>
> Well it doesn't help that my abbreviated explanation glosses over some
> details. So here's the verbose explanation, and I'm sure Marc will have
> a set of corrections too :) I meant there's no _explicit_ interaction
> between UFFD and the various bits of GIC that need to touch guest
> memory.
>
> The GIC redistributors contain a set of MMIO registers that are
> accessible through the KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR
> ioctls. Writes to these are reflected directly into the KVM
> representation, no biggie there.
>
> One of the registers (GICR_PENDBASER) is a pointer to guest memory,
> containing a bitmap of pending LPIs managed by the redistributor. The
> ITS takes this to the extreme, as it is effectively a bunch of page
> tables for interrupts. All of this state actually lives in a KVM
> representation, and is only flushed out to guest memory when explicitly
> told to do so by userspace.
>
> On the target, we reread all the info when rebuilding interrupt
> translations when userspace calls KVM_DEV_ARM_ITS_RESTORE_TABLES. All of
> these guest memory accesses go through kvm_read_guest() and I expect the
> usual UFFD handling for non-present pages kicks in from there.
Thanks for the longer explanation. Yes kvm_read_guest() eventually
calls __copy_from_user() which will trigger a page fault and
UserfaultFD will notify userspace and wait for the page to become
present. In the KVM-specific proposal I outlined, calling
kvm_read_guest() will ultimately result in a check of the VM's present
bitmap and KVM will nnotify userspace and wait for the page to become
present if it's not, before calling __copy_from_user(). So I don't
expect a KVM-specific solution to have any increased maintenance
burden for VGIC (or any other widgets).
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-08 16:56 ` David Matlack
@ 2023-11-08 17:34 ` Peter Xu
2023-11-08 20:10 ` Sean Christopherson
2023-11-08 20:49 ` David Matlack
2023-11-08 20:33 ` Paolo Bonzini
1 sibling, 2 replies; 34+ messages in thread
From: Peter Xu @ 2023-11-08 17:34 UTC (permalink / raw)
To: David Matlack
Cc: Oliver Upton, Paolo Bonzini, kvm list, Sean Christopherson,
James Houghton, Oliver Upton, Axel Rasmussen, Mike Kravetz,
Andrea Arcangeli
On Wed, Nov 08, 2023 at 08:56:22AM -0800, David Matlack wrote:
> Thanks for the longer explanation. Yes kvm_read_guest() eventually
> calls __copy_from_user() which will trigger a page fault and
> UserfaultFD will notify userspace and wait for the page to become
> present. In the KVM-specific proposal I outlined, calling
> kvm_read_guest() will ultimately result in a check of the VM's present
> bitmap and KVM will nnotify userspace and wait for the page to become
> present if it's not, before calling __copy_from_user(). So I don't
> expect a KVM-specific solution to have any increased maintenance
> burden for VGIC (or any other widgets).
The question is how to support modules that do not use kvm apis at all,
like vhost. I raised the question in my initial reply, too.
I think if vhost is going to support gmemfd, it'll need new apis so maybe
there'll be a chance to take that into account, but I'm not 100% sure it'll
be the same complexity, also not sure if that's the plan even for CoCo.
Or is anything like vhost not considered to be supported for gmemfd at all?
Is there any plan for the new postcopy proposal then for generic mem (!CoCo)?
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-08 17:34 ` Peter Xu
@ 2023-11-08 20:10 ` Sean Christopherson
2023-11-08 20:36 ` Peter Xu
2023-11-08 20:47 ` Axel Rasmussen
2023-11-08 20:49 ` David Matlack
1 sibling, 2 replies; 34+ messages in thread
From: Sean Christopherson @ 2023-11-08 20:10 UTC (permalink / raw)
To: Peter Xu
Cc: David Matlack, Oliver Upton, Paolo Bonzini, kvm list,
James Houghton, Oliver Upton, Axel Rasmussen, Mike Kravetz,
Andrea Arcangeli
On Wed, Nov 08, 2023, Peter Xu wrote:
> On Wed, Nov 08, 2023 at 08:56:22AM -0800, David Matlack wrote:
> > Thanks for the longer explanation. Yes kvm_read_guest() eventually
> > calls __copy_from_user() which will trigger a page fault and
> > UserfaultFD will notify userspace and wait for the page to become
> > present. In the KVM-specific proposal I outlined, calling
> > kvm_read_guest() will ultimately result in a check of the VM's present
> > bitmap and KVM will nnotify userspace and wait for the page to become
> > present if it's not, before calling __copy_from_user(). So I don't
> > expect a KVM-specific solution to have any increased maintenance
> > burden for VGIC (or any other widgets).
>
> The question is how to support modules that do not use kvm apis at all,
> like vhost. I raised the question in my initial reply, too.
>
> I think if vhost is going to support gmemfd, it'll need new apis so maybe
> there'll be a chance to take that into account, but I'm not 100% sure it'll
> be the same complexity, also not sure if that's the plan even for CoCo.
>
> Or is anything like vhost not considered to be supported for gmemfd at all?
vhost shouldn't require new APIs. To support vhost, guest_memfd would first need
to support virtio for host userspace, i.e. would need to support .mmap(). At that
point, all of the uaccess and gup() stuff in vhost should work without modification.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-08 16:56 ` David Matlack
2023-11-08 17:34 ` Peter Xu
@ 2023-11-08 20:33 ` Paolo Bonzini
2023-11-08 20:43 ` David Matlack
1 sibling, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2023-11-08 20:33 UTC (permalink / raw)
To: David Matlack, Oliver Upton
Cc: Peter Xu, kvm list, Sean Christopherson, James Houghton,
Oliver Upton, Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
On 11/8/23 17:56, David Matlack wrote:
> Thanks for the longer explanation. Yes kvm_read_guest() eventually calls
> __copy_from_user() which will trigger a page fault and UserfaultFD will
> notify userspace and wait for the page to become present. In the
> KVM-specific proposal I outlined, calling kvm_read_guest() will
> ultimately result in a check of the VM's present bitmap and KVM will
> nnotify userspace and wait for the page to become present if it's not,
> before calling __copy_from_user(). So I don't expect a KVM-specific
> solution to have any increased maintenance burden for VGIC (or any other
> widgets).
It does mean however that we need a cross-thread notification mechanism,
instead of just relying on KVM_EXIT_MEMORY_FAULT (or another KVM_EXIT_*).
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-08 20:10 ` Sean Christopherson
@ 2023-11-08 20:36 ` Peter Xu
2023-11-08 20:47 ` Axel Rasmussen
1 sibling, 0 replies; 34+ messages in thread
From: Peter Xu @ 2023-11-08 20:36 UTC (permalink / raw)
To: Sean Christopherson
Cc: David Matlack, Oliver Upton, Paolo Bonzini, kvm list,
James Houghton, Oliver Upton, Axel Rasmussen, Mike Kravetz,
Andrea Arcangeli
On Wed, Nov 08, 2023 at 12:10:15PM -0800, Sean Christopherson wrote:
> On Wed, Nov 08, 2023, Peter Xu wrote:
> > On Wed, Nov 08, 2023 at 08:56:22AM -0800, David Matlack wrote:
> > > Thanks for the longer explanation. Yes kvm_read_guest() eventually
> > > calls __copy_from_user() which will trigger a page fault and
> > > UserfaultFD will notify userspace and wait for the page to become
> > > present. In the KVM-specific proposal I outlined, calling
> > > kvm_read_guest() will ultimately result in a check of the VM's present
> > > bitmap and KVM will nnotify userspace and wait for the page to become
> > > present if it's not, before calling __copy_from_user(). So I don't
> > > expect a KVM-specific solution to have any increased maintenance
> > > burden for VGIC (or any other widgets).
> >
> > The question is how to support modules that do not use kvm apis at all,
> > like vhost. I raised the question in my initial reply, too.
> >
> > I think if vhost is going to support gmemfd, it'll need new apis so maybe
> > there'll be a chance to take that into account, but I'm not 100% sure it'll
> > be the same complexity, also not sure if that's the plan even for CoCo.
> >
> > Or is anything like vhost not considered to be supported for gmemfd at all?
>
> vhost shouldn't require new APIs. To support vhost, guest_memfd would first need
> to support virtio for host userspace, i.e. would need to support .mmap(). At that
> point, all of the uaccess and gup() stuff in vhost should work without modification.
Then I suppose it means we will treat QEMU, vhost and probably the whole
host hypervisor stack the same trust level from gmemfd's regard.
But then it'll be a harder question for a new demand paging scheme, as the
new interface should need to be separately proposed. Another option is to
only support kvm-api based virt modules, but it may then become slightly
less attractive.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-08 20:33 ` Paolo Bonzini
@ 2023-11-08 20:43 ` David Matlack
0 siblings, 0 replies; 34+ messages in thread
From: David Matlack @ 2023-11-08 20:43 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Oliver Upton, Peter Xu, kvm list, Sean Christopherson,
James Houghton, Oliver Upton, Axel Rasmussen, Mike Kravetz,
Andrea Arcangeli
On Wed, Nov 8, 2023 at 12:33 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 11/8/23 17:56, David Matlack wrote:
> > Thanks for the longer explanation. Yes kvm_read_guest() eventually calls
> > __copy_from_user() which will trigger a page fault and UserfaultFD will
> > notify userspace and wait for the page to become present. In the
> > KVM-specific proposal I outlined, calling kvm_read_guest() will
> > ultimately result in a check of the VM's present bitmap and KVM will
> > nnotify userspace and wait for the page to become present if it's not,
> > before calling __copy_from_user(). So I don't expect a KVM-specific
> > solution to have any increased maintenance burden for VGIC (or any other
> > widgets).
>
> It does mean however that we need a cross-thread notification mechanism,
> instead of just relying on KVM_EXIT_MEMORY_FAULT (or another KVM_EXIT_*).
Yes. Any time KVM directly accesses guest memory (e.g.
kvm_read/write_guest()), it would use a blocking notification
mechanism (part (d) in the proposal). Google uses a netlink socket for
this, but a custom file descriptor would be more reliable.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-08 20:10 ` Sean Christopherson
2023-11-08 20:36 ` Peter Xu
@ 2023-11-08 20:47 ` Axel Rasmussen
2023-11-08 21:05 ` David Matlack
1 sibling, 1 reply; 34+ messages in thread
From: Axel Rasmussen @ 2023-11-08 20:47 UTC (permalink / raw)
To: Sean Christopherson
Cc: Peter Xu, David Matlack, Oliver Upton, Paolo Bonzini, kvm list,
James Houghton, Oliver Upton, Mike Kravetz, Andrea Arcangeli
On Wed, Nov 8, 2023 at 12:10 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Nov 08, 2023, Peter Xu wrote:
> > On Wed, Nov 08, 2023 at 08:56:22AM -0800, David Matlack wrote:
> > > Thanks for the longer explanation. Yes kvm_read_guest() eventually
> > > calls __copy_from_user() which will trigger a page fault and
> > > UserfaultFD will notify userspace and wait for the page to become
> > > present. In the KVM-specific proposal I outlined, calling
> > > kvm_read_guest() will ultimately result in a check of the VM's present
> > > bitmap and KVM will nnotify userspace and wait for the page to become
> > > present if it's not, before calling __copy_from_user(). So I don't
> > > expect a KVM-specific solution to have any increased maintenance
> > > burden for VGIC (or any other widgets).
> >
> > The question is how to support modules that do not use kvm apis at all,
> > like vhost. I raised the question in my initial reply, too.
> >
> > I think if vhost is going to support gmemfd, it'll need new apis so maybe
> > there'll be a chance to take that into account, but I'm not 100% sure it'll
> > be the same complexity, also not sure if that's the plan even for CoCo.
> >
> > Or is anything like vhost not considered to be supported for gmemfd at all?
>
> vhost shouldn't require new APIs. To support vhost, guest_memfd would first need
> to support virtio for host userspace, i.e. would need to support .mmap(). At that
> point, all of the uaccess and gup() stuff in vhost should work without modification.
Does this imply the need for some case-specific annotations for the
proposed KVM demand paging implementation to "catch" these accesses?
IIUC this was one of the larger downsides to our internal
implementation, and something David had hoped to avoid in his RFC
proposal. Whereas, I think this is a point in UFFD's favor, where all
accesses "just work" with one centralized check in the fault handler
path.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-08 17:34 ` Peter Xu
2023-11-08 20:10 ` Sean Christopherson
@ 2023-11-08 20:49 ` David Matlack
1 sibling, 0 replies; 34+ messages in thread
From: David Matlack @ 2023-11-08 20:49 UTC (permalink / raw)
To: Peter Xu
Cc: Oliver Upton, Paolo Bonzini, kvm list, Sean Christopherson,
James Houghton, Oliver Upton, Axel Rasmussen, Mike Kravetz,
Andrea Arcangeli
On Wed, Nov 8, 2023 at 9:34 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Nov 08, 2023 at 08:56:22AM -0800, David Matlack wrote:
> > Thanks for the longer explanation. Yes kvm_read_guest() eventually
> > calls __copy_from_user() which will trigger a page fault and
> > UserfaultFD will notify userspace and wait for the page to become
> > present. In the KVM-specific proposal I outlined, calling
> > kvm_read_guest() will ultimately result in a check of the VM's present
> > bitmap and KVM will nnotify userspace and wait for the page to become
> > present if it's not, before calling __copy_from_user(). So I don't
> > expect a KVM-specific solution to have any increased maintenance
> > burden for VGIC (or any other widgets).
>
> The question is how to support modules that do not use kvm apis at all,
> like vhost. I raised the question in my initial reply, too.
Yes you are correct, my proposal does not provide a solution for guest
memory accesses made by vhost. That is admittedly a gap. Google does
not do virtio emulation in-kernel so this isn't a problem we had to
solve and I wasn't aware of it until you pointed it out.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-08 20:47 ` Axel Rasmussen
@ 2023-11-08 21:05 ` David Matlack
0 siblings, 0 replies; 34+ messages in thread
From: David Matlack @ 2023-11-08 21:05 UTC (permalink / raw)
To: Axel Rasmussen
Cc: Sean Christopherson, Peter Xu, Oliver Upton, Paolo Bonzini,
kvm list, James Houghton, Oliver Upton, Mike Kravetz,
Andrea Arcangeli
On Wed, Nov 8, 2023 at 12:47 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> On Wed, Nov 8, 2023 at 12:10 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Nov 08, 2023, Peter Xu wrote:
> > > On Wed, Nov 08, 2023 at 08:56:22AM -0800, David Matlack wrote:
> > > > Thanks for the longer explanation. Yes kvm_read_guest() eventually
> > > > calls __copy_from_user() which will trigger a page fault and
> > > > UserfaultFD will notify userspace and wait for the page to become
> > > > present. In the KVM-specific proposal I outlined, calling
> > > > kvm_read_guest() will ultimately result in a check of the VM's present
> > > > bitmap and KVM will nnotify userspace and wait for the page to become
> > > > present if it's not, before calling __copy_from_user(). So I don't
> > > > expect a KVM-specific solution to have any increased maintenance
> > > > burden for VGIC (or any other widgets).
> > >
> > > The question is how to support modules that do not use kvm apis at all,
> > > like vhost. I raised the question in my initial reply, too.
> > >
> > > I think if vhost is going to support gmemfd, it'll need new apis so maybe
> > > there'll be a chance to take that into account, but I'm not 100% sure it'll
> > > be the same complexity, also not sure if that's the plan even for CoCo.
> > >
> > > Or is anything like vhost not considered to be supported for gmemfd at all?
> >
> > vhost shouldn't require new APIs. To support vhost, guest_memfd would first need
> > to support virtio for host userspace, i.e. would need to support .mmap(). At that
> > point, all of the uaccess and gup() stuff in vhost should work without modification.
>
> Does this imply the need for some case-specific annotations for the
> proposed KVM demand paging implementation to "catch" these accesses?
>
> IIUC this was one of the larger downsides to our internal
> implementation, and something David had hoped to avoid in his RFC
> proposal. Whereas, I think this is a point in UFFD's favor, where all
> accesses "just work" with one centralized check in the fault handler
> path.
Yes this is definitely a point in UFFD's favor. The KVM-specific
solution would require adding hooks in KVM in the core routines that
KVM uses to access guest memory, or in the gfn-to-hva conversion
routine.
But the number of hooks needed is small, and this isn't exactly a new
problem. KVM already needs similar hooks so that it can manually keep
the dirty log up-to-date. Most of the pain we've experienced in our
internal implementation is because (1) it's not upstream so it's easy
for changes to slip in that don't go through the core routines (e.g.
see record_steal_time() which hand-rolls its own guest memory access
via user_access_begin/end()), and (2) we used to manually annotate all
guest memory accesses instead of hooking the core routines.
It's still possible for there to be correctness bugs (more likely than
UFFD), but I think it's about as likely as KVM missing a dirty log
update which I haven't noticed being a major source of upstream bugs
or maintenance burden.
But again, yes this is definitely a point in favor of a
page-table-based approach. :)
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-07 22:29 ` Peter Xu
@ 2023-11-09 16:41 ` David Matlack
2023-11-09 17:58 ` Sean Christopherson
0 siblings, 1 reply; 34+ messages in thread
From: David Matlack @ 2023-11-09 16:41 UTC (permalink / raw)
To: Peter Xu
Cc: Paolo Bonzini, kvm list, Sean Christopherson, James Houghton,
Oliver Upton, Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
On Tue, Nov 7, 2023 at 2:29 PM Peter Xu <peterx@redhat.com> wrote:
> On Tue, Nov 07, 2023 at 05:25:06PM +0100, Paolo Bonzini wrote:
> > On 11/6/23 21:23, Peter Xu wrote:
> > > On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
> > >
> >
> > Once you have the implementation done for guest_memfd, it is interesting to
> > see how easily it extends to other, userspace-mappable kinds of memory. But
> > I still dislike the fact that you need some kind of extra protocol in
> > userspace, for multi-process VMMs. This is the kind of thing that the
> > kernel is supposed to facilitate. I'd like it to do _more_ of that (see
> > above memfd pseudo-suggestion), not less.
>
> Is that our future plan to extend gmemfd to normal memories?
>
> I see that gmemfd manages folio on its own. I think it'll make perfect
> sense if it's for use in CoCo context, where the memory is so special to be
> generic anyway.
>
> However if to extend it to generic memories, I'm wondering how do we
> support existing memory features of such memory which already exist with
> KVM_SET_USER_MEMORY_REGION v1. To name some:
>
> - numa awareness
> - swapping
> - cgroup
> - punch hole (in a huge page, aka, thp split)
> - cma allocations for huge pages / page migrations
> - ...
Sean has stated that he doesn't want guest_memfd to support swap. So I
don't think guest_memfd will one day replace all guest memory
use-cases. That also means that my idea to extend my proposal to
guest_memfd VMAs has limited value. VMs that do not use guest_memfd
would not be able to use it.
Paolo, it sounds like overall my proposal has limited value outside of
GCE's use-case. And even if it landed upstream, it would bifrucate KVM
VM post-copy support. So I think it's probably not worth pursuing
further. Do you think that's a fair assessment? Getting a clear NACK
on pushing this proposal upstream would be a nice outcome here since
it helps inform our next steps.
That being said, we still don't have an upstream solution for 1G
post-copy, which James pointed out is really the core issue. But there
are other avenues we can explore in that direction such as cleaning up
HugeTLB (very nebulous) or adding 1G+mmap()+userfaultfd support to
guest_memfd. The latter seems promising.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-09 16:41 ` David Matlack
@ 2023-11-09 17:58 ` Sean Christopherson
2023-11-09 18:33 ` David Matlack
2023-11-09 19:20 ` Peter Xu
0 siblings, 2 replies; 34+ messages in thread
From: Sean Christopherson @ 2023-11-09 17:58 UTC (permalink / raw)
To: David Matlack
Cc: Peter Xu, Paolo Bonzini, kvm list, James Houghton, Oliver Upton,
Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
On Thu, Nov 09, 2023, David Matlack wrote:
> On Tue, Nov 7, 2023 at 2:29 PM Peter Xu <peterx@redhat.com> wrote:
> > On Tue, Nov 07, 2023 at 05:25:06PM +0100, Paolo Bonzini wrote:
> > > On 11/6/23 21:23, Peter Xu wrote:
> > > > On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
> > > >
> > >
> > > Once you have the implementation done for guest_memfd, it is interesting to
> > > see how easily it extends to other, userspace-mappable kinds of memory. But
> > > I still dislike the fact that you need some kind of extra protocol in
> > > userspace, for multi-process VMMs. This is the kind of thing that the
> > > kernel is supposed to facilitate. I'd like it to do _more_ of that (see
> > > above memfd pseudo-suggestion), not less.
> >
> > Is that our future plan to extend gmemfd to normal memories?
> >
> > I see that gmemfd manages folio on its own. I think it'll make perfect
> > sense if it's for use in CoCo context, where the memory is so special to be
> > generic anyway.
> >
> > However if to extend it to generic memories, I'm wondering how do we
> > support existing memory features of such memory which already exist with
> > KVM_SET_USER_MEMORY_REGION v1. To name some:
> >
> > - numa awareness
The plan is to add fbind() to mirror mbind().
> > - swapping
> > - cgroup
Accounting is already supported. Fine-grained reclaim will likely never be
supported (see below re: swap).
> > - punch hole (in a huge page, aka, thp split)
Already works. What doesn't work is reconstituing a hugepage, but like swap,
I think that's something KVM should deliberately not support.
> > - cma allocations for huge pages / page migrations
I suspect the direction guest_memfd will take will be to support a dedicated pool
of memory, a la hugetlbfs reservations.
> > - ...
>
> Sean has stated that he doesn't want guest_memfd to support swap. So I
> don't think guest_memfd will one day replace all guest memory
> use-cases. That also means that my idea to extend my proposal to
> guest_memfd VMAs has limited value. VMs that do not use guest_memfd
> would not be able to use it.
Yep. This is a hill I'm extremely willing to die on. I feel very, very strongly
that we should put a stake in the ground regarding swap and other traditional memory
management stuff. The intent of guest_memfd is that it's a vehicle for supporting
use cases that don't fit into generic memory subsytems, e.g. CoCo VMs, and/or where
making guest memory inaccessible by default adds a lot of value at minimal cost.
guest_memfd isn't intended to be a wholesale replacement of VMA-based memory.
IMO, use cases that want to dynamically manage guest memory should be firmly
out-of-scope for guest_memfd.
> Paolo, it sounds like overall my proposal has limited value outside of
> GCE's use-case. And even if it landed upstream, it would bifrucate KVM
> VM post-copy support. So I think it's probably not worth pursuing
> further. Do you think that's a fair assessment? Getting a clear NACK
> on pushing this proposal upstream would be a nice outcome here since
> it helps inform our next steps.
>
> That being said, we still don't have an upstream solution for 1G
> post-copy, which James pointed out is really the core issue. But there
> are other avenues we can explore in that direction such as cleaning up
> HugeTLB (very nebulous) or adding 1G+mmap()+userfaultfd support to
> guest_memfd. The latter seems promising.
mmap()+userfaultfd is the answer for userspace and vhost, but it is most defintiely
not the answer for guest_memfd within KVM. The main selling point of guest_memfd
is that it doesn't require mapping the memory into userspace, i.e. userfaultfd
can't be the answer for KVM accesses unless we bastardize the entire concept of
guest_memfd.
And as I've proposed internally, the other thing related to live migration that I
think KVM should support is the ability to performantly and non-destructively freeze
guest memory, e.g. to allowing blocking KVM accesses to guest memory during blackout
without requiring userspace to destroy memslots to harden against memory corruption
due to KVM writing guest memory after userspace has taken the final snapshot of the
dirty bitmap.
For both cases, KVM will need choke points on all accesses to guest memory. Once
the choke points exist and we have signed up to maintain them, the extra burden of
gracefully handling "missing" memory versus frozen memory should be relatively
small, e.g. it'll mainly be the notify-and-wait uAPI.
Don't get me wrong, I think Google's demand paging implementation should die a slow,
horrible death. But I don't think userfaultfd is the answer for guest_memfd.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-09 17:58 ` Sean Christopherson
@ 2023-11-09 18:33 ` David Matlack
2023-11-09 22:44 ` David Matlack
2023-11-09 19:20 ` Peter Xu
1 sibling, 1 reply; 34+ messages in thread
From: David Matlack @ 2023-11-09 18:33 UTC (permalink / raw)
To: Sean Christopherson
Cc: Peter Xu, Paolo Bonzini, kvm list, James Houghton, Oliver Upton,
Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
On Thu, Nov 9, 2023 at 9:58 AM Sean Christopherson <seanjc@google.com> wrote:
> On Thu, Nov 09, 2023, David Matlack wrote:
> > Paolo, it sounds like overall my proposal has limited value outside of
> > GCE's use-case. And even if it landed upstream, it would bifrucate KVM
> > VM post-copy support. So I think it's probably not worth pursuing
> > further. Do you think that's a fair assessment? Getting a clear NACK
> > on pushing this proposal upstream would be a nice outcome here since
> > it helps inform our next steps.
> >
> > That being said, we still don't have an upstream solution for 1G
> > post-copy, which James pointed out is really the core issue. But there
> > are other avenues we can explore in that direction such as cleaning up
> > HugeTLB (very nebulous) or adding 1G+mmap()+userfaultfd support to
> > guest_memfd. The latter seems promising.
>
> mmap()+userfaultfd is the answer for userspace and vhost, but it is most defintiely
> not the answer for guest_memfd within KVM. The main selling point of guest_memfd
> is that it doesn't require mapping the memory into userspace, i.e. userfaultfd
> can't be the answer for KVM accesses unless we bastardize the entire concept of
> guest_memfd.
>
> And as I've proposed internally, the other thing related to live migration that I
> think KVM should support is the ability to performantly and non-destructively freeze
> guest memory, e.g. to allowing blocking KVM accesses to guest memory during blackout
> without requiring userspace to destroy memslots to harden against memory corruption
> due to KVM writing guest memory after userspace has taken the final snapshot of the
> dirty bitmap.
>
> For both cases, KVM will need choke points on all accesses to guest memory. Once
> the choke points exist and we have signed up to maintain them, the extra burden of
> gracefully handling "missing" memory versus frozen memory should be relatively
> small, e.g. it'll mainly be the notify-and-wait uAPI.
To be honest, the choke points are a relatively small part of any
KVM-based demand paging scheme. We still need (a)-(e) from my original
email.
>
> Don't get me wrong, I think Google's demand paging implementation should die a slow,
> horrible death. But I don't think userfaultfd is the answer for guest_memfd.
I'm a bit confused. Yes, Google's implementation is not good, I said
the same in my original email. But if userfaultfd is not the answer
for guest_memfd, are you saying the KVM _does_ need a KVM-based demand
paging UAPI like I proposed?
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-09 17:58 ` Sean Christopherson
2023-11-09 18:33 ` David Matlack
@ 2023-11-09 19:20 ` Peter Xu
2023-11-11 16:23 ` David Matlack
1 sibling, 1 reply; 34+ messages in thread
From: Peter Xu @ 2023-11-09 19:20 UTC (permalink / raw)
To: Sean Christopherson
Cc: David Matlack, Paolo Bonzini, kvm list, James Houghton,
Oliver Upton, Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
On Thu, Nov 09, 2023 at 09:58:49AM -0800, Sean Christopherson wrote:
> guest_memfd isn't intended to be a wholesale replacement of VMA-based memory.
> IMO, use cases that want to dynamically manage guest memory should be firmly
> out-of-scope for guest_memfd.
I'm not sure whether that will keep true for a longer period (e.g. 5-10
years, or more?), but it makes sense to me for now, at least we don't
already decide to reimplement everything.
If the use case grows and CoCo will become the de-facto standard, hopefully
there's always possibility to refactor mm features that CoCo will need to
cooperate with gmemfd, I guess.
>
> > Paolo, it sounds like overall my proposal has limited value outside of
> > GCE's use-case. And even if it landed upstream, it would bifrucate KVM
> > VM post-copy support. So I think it's probably not worth pursuing
> > further. Do you think that's a fair assessment? Getting a clear NACK
> > on pushing this proposal upstream would be a nice outcome here since
> > it helps inform our next steps.
> >
> > That being said, we still don't have an upstream solution for 1G
> > post-copy, which James pointed out is really the core issue. But there
> > are other avenues we can explore in that direction such as cleaning up
> > HugeTLB (very nebulous) or adding 1G+mmap()+userfaultfd support to
> > guest_memfd. The latter seems promising.
>
> mmap()+userfaultfd is the answer for userspace and vhost, but it is most defintiely
> not the answer for guest_memfd within KVM. The main selling point of guest_memfd
> is that it doesn't require mapping the memory into userspace, i.e. userfaultfd
> can't be the answer for KVM accesses unless we bastardize the entire concept of
> guest_memfd.
Note that I don't think userfaultfd needs to be bound to VA, even if it is
for now..
> And as I've proposed internally, the other thing related to live migration that I
> think KVM should support is the ability to performantly and non-destructively freeze
> guest memory, e.g. to allowing blocking KVM accesses to guest memory during blackout
> without requiring userspace to destroy memslots to harden against memory corruption
> due to KVM writing guest memory after userspace has taken the final snapshot of the
> dirty bitmap.
Any pointer to this problem you're describing? Why the userspace cannot
have full control of when to quiesce guest memory accesses (probably by
kicking all vcpus out)?
> For both cases, KVM will need choke points on all accesses to guest memory. Once
> the choke points exist and we have signed up to maintain them, the extra burden of
> gracefully handling "missing" memory versus frozen memory should be relatively
> small, e.g. it'll mainly be the notify-and-wait uAPI.
>
> Don't get me wrong, I think Google's demand paging implementation should die a slow,
> horrible death. But I don't think userfaultfd is the answer for guest_memfd.
As I replied in the other thread, I see possibility implementing
userfaultfd on gmemfd, especially after I know your plan now treating
user/kernel the same way.
But I don't know whether I could have missed something here and there, and
I'd like to read the problem first on above to understand the relationship
between that "freeze guest mem" idea and the demand paging scheme.
One thing I'd agree is we don't necessarily need to squash userfaultfd into
gmemfd support of demand paging. If gmemfd will only be used in KVM
context then indeed it at least won't make a major difference; but still
good if the messaging framework can be leveraged, meanwhile userspace that
already support userfaultfd can cooperate with gmemfd much easier.
In general, a major part of userfaultfd is really a messaging interface for
faults to me. A fault trap mechanism will be needed anyway for gmemfd,
AFAIU. When that comes maybe we can have a clearer mind of what's next.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-09 18:33 ` David Matlack
@ 2023-11-09 22:44 ` David Matlack
2023-11-09 23:54 ` Sean Christopherson
0 siblings, 1 reply; 34+ messages in thread
From: David Matlack @ 2023-11-09 22:44 UTC (permalink / raw)
To: Sean Christopherson
Cc: Peter Xu, Paolo Bonzini, kvm list, James Houghton, Oliver Upton,
Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
On Thu, Nov 9, 2023 at 10:33 AM David Matlack <dmatlack@google.com> wrote:
> On Thu, Nov 9, 2023 at 9:58 AM Sean Christopherson <seanjc@google.com> wrote:
> > For both cases, KVM will need choke points on all accesses to guest memory. Once
> > the choke points exist and we have signed up to maintain them, the extra burden of
> > gracefully handling "missing" memory versus frozen memory should be relatively
> > small, e.g. it'll mainly be the notify-and-wait uAPI.
>
> To be honest, the choke points are a relatively small part of any
> KVM-based demand paging scheme. We still need (a)-(e) from my original
> email.
Another small thing here: I think we can find clean choke point(s)
that fit both freezing and demand paging (aka "missing" pages), but
there is a difference to keep in mind. To freeze guest memory KVM only
needs to return an error at the choke point(s). Whereas handling
"missing" pages may require blocking, which adds constraints on where
the choke point(s) can be placed.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-09 22:44 ` David Matlack
@ 2023-11-09 23:54 ` Sean Christopherson
0 siblings, 0 replies; 34+ messages in thread
From: Sean Christopherson @ 2023-11-09 23:54 UTC (permalink / raw)
To: David Matlack
Cc: Peter Xu, Paolo Bonzini, kvm list, James Houghton, Oliver Upton,
Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
On Thu, Nov 09, 2023, David Matlack wrote:
> On Thu, Nov 9, 2023 at 10:33 AM David Matlack <dmatlack@google.com> wrote:
> > On Thu, Nov 9, 2023 at 9:58 AM Sean Christopherson <seanjc@google.com> wrote:
> > > For both cases, KVM will need choke points on all accesses to guest memory. Once
> > > the choke points exist and we have signed up to maintain them, the extra burden of
> > > gracefully handling "missing" memory versus frozen memory should be relatively
> > > small, e.g. it'll mainly be the notify-and-wait uAPI.
> >
> > To be honest, the choke points are a relatively small part of any
> > KVM-based demand paging scheme. We still need (a)-(e) from my original
> > email.
>
> Another small thing here: I think we can find clean choke point(s)
> that fit both freezing and demand paging (aka "missing" pages), but
> there is a difference to keep in mind. To freeze guest memory KVM only
> needs to return an error at the choke point(s). Whereas handling
> "missing" pages may require blocking, which adds constraints on where
> the choke point(s) can be placed.
Rats, I didn't think about not being able to block. Luckily, that's *almost* a
non-issue as user accesses already might_sleep(). At a glance, it's only x86's
shadow paging that uses kvm_vcpu_read_guest_atomic(), everything else either can
sleep or uses a gfn_to_pfn_cache or kvm_host_map cache. Aha! And all of x86's
usage can fail gracefully (for some definitions of gracefully), i.e. will either
result in the access being retried after dropping mmu_lock or will cause KVM to
zap a SPTE instead of doing something more optimal.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-09 19:20 ` Peter Xu
@ 2023-11-11 16:23 ` David Matlack
2023-11-11 17:30 ` Peter Xu
0 siblings, 1 reply; 34+ messages in thread
From: David Matlack @ 2023-11-11 16:23 UTC (permalink / raw)
To: Peter Xu
Cc: Sean Christopherson, Paolo Bonzini, kvm list, James Houghton,
Oliver Upton, Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
On Thu, Nov 9, 2023 at 11:20 AM Peter Xu <peterx@redhat.com> wrote:
> On Thu, Nov 09, 2023 at 09:58:49AM -0800, Sean Christopherson wrote:
> >
> > For both cases, KVM will need choke points on all accesses to guest memory. Once
> > the choke points exist and we have signed up to maintain them, the extra burden of
> > gracefully handling "missing" memory versus frozen memory should be relatively
> > small, e.g. it'll mainly be the notify-and-wait uAPI.
> >
> > Don't get me wrong, I think Google's demand paging implementation should die a slow,
> > horrible death. But I don't think userfaultfd is the answer for guest_memfd.
>
> As I replied in the other thread, I see possibility implementing
> userfaultfd on gmemfd, especially after I know your plan now treating
> user/kernel the same way.
>
> But I don't know whether I could have missed something here and there, and
> I'd like to read the problem first on above to understand the relationship
> between that "freeze guest mem" idea and the demand paging scheme.
>
> One thing I'd agree is we don't necessarily need to squash userfaultfd into
> gmemfd support of demand paging. If gmemfd will only be used in KVM
> context then indeed it at least won't make a major difference; but still
> good if the messaging framework can be leveraged, meanwhile userspace that
> already support userfaultfd can cooperate with gmemfd much easier.
>
> In general, a major part of userfaultfd is really a messaging interface for
> faults to me. A fault trap mechanism will be needed anyway for gmemfd,
> AFAIU. When that comes maybe we can have a clearer mind of what's next.
The idea to re-use userfaultfd as a notification mechanism is really
interesting.
I'm almost certain that guest page faults on missing pages can re-use
the KVM_CAP_EXIT_ON_MISSING UAPI that Anish is adding for UFFD [1]. So
that will be the same between VMA-based UserfaultFD and
KVM/guest_memfd-based demand paging.
And for the blocking notification in KVM, re-using the userfaultfd
file descriptor seems like a neat idea. We could have a KVM ioctl to
register the fd with KVM, and then KVM can notify when it needs to
block on a missing page. The uffd_msg struct can be extended to
support a new "gfn" type or "guest_memfd" type fault info. I'm not
quite sure how the wait-queuing will work, but I'm sure it's solvable.
With these 2 together, the UAPI for notifying userspace would be the
same for UserfaultFD and KVM.
As for whether to integrate the "missing" page support in guest_memfd
or KVM... I'm obviously partial to the latter because then Google can
also use it for HugeTLB.
But now that I think about it, isn't the KVM-based approach useful to
the broader community as well? For example, QEMU could use the
KVM-based demand paging for all KVM-generated accesses to guest
memory. This would provide 4K-granular demand paging for _most_ memory
accesses. Then for vhost and userspace accesses, QEMU can set up a
separate VMA mapping of guest memory and use UserfaultFD. The
userspace/vhost accesses would have to be done at the huge page size
granularity (if using HugeTLB). But most accesses should still come
from KVM, so this would be a real improvement over a pure UserfaultFD
approach.
And on the more practical side... If we integrate missing page support
directly into guest_memfd, I'm not sure how one part would even work.
Userspace would need a way to write to missing pages before marking
them present. So we'd need some sort of special flag to mmap() to
bypass the missing page interception? I'm sure it's solvable, but the
KVM-based does not have this problem.
[1] https://lore.kernel.org/kvm/20231109210325.3806151-1-amoorthy@google.com/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-11 16:23 ` David Matlack
@ 2023-11-11 17:30 ` Peter Xu
2023-11-13 16:43 ` David Matlack
0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2023-11-11 17:30 UTC (permalink / raw)
To: David Matlack
Cc: Sean Christopherson, Paolo Bonzini, kvm list, James Houghton,
Oliver Upton, Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
On Sat, Nov 11, 2023 at 08:23:57AM -0800, David Matlack wrote:
> On Thu, Nov 9, 2023 at 11:20 AM Peter Xu <peterx@redhat.com> wrote:
> > On Thu, Nov 09, 2023 at 09:58:49AM -0800, Sean Christopherson wrote:
> > >
> > > For both cases, KVM will need choke points on all accesses to guest memory. Once
> > > the choke points exist and we have signed up to maintain them, the extra burden of
> > > gracefully handling "missing" memory versus frozen memory should be relatively
> > > small, e.g. it'll mainly be the notify-and-wait uAPI.
> > >
> > > Don't get me wrong, I think Google's demand paging implementation should die a slow,
> > > horrible death. But I don't think userfaultfd is the answer for guest_memfd.
> >
> > As I replied in the other thread, I see possibility implementing
> > userfaultfd on gmemfd, especially after I know your plan now treating
> > user/kernel the same way.
> >
> > But I don't know whether I could have missed something here and there, and
> > I'd like to read the problem first on above to understand the relationship
> > between that "freeze guest mem" idea and the demand paging scheme.
> >
> > One thing I'd agree is we don't necessarily need to squash userfaultfd into
> > gmemfd support of demand paging. If gmemfd will only be used in KVM
> > context then indeed it at least won't make a major difference; but still
> > good if the messaging framework can be leveraged, meanwhile userspace that
> > already support userfaultfd can cooperate with gmemfd much easier.
> >
> > In general, a major part of userfaultfd is really a messaging interface for
> > faults to me. A fault trap mechanism will be needed anyway for gmemfd,
> > AFAIU. When that comes maybe we can have a clearer mind of what's next.
>
> The idea to re-use userfaultfd as a notification mechanism is really
> interesting.
>
> I'm almost certain that guest page faults on missing pages can re-use
> the KVM_CAP_EXIT_ON_MISSING UAPI that Anish is adding for UFFD [1]. So
> that will be the same between VMA-based UserfaultFD and
> KVM/guest_memfd-based demand paging.
Right. I think we may need to decide whether to use the same interface to
handle no-vcpu contexts, though, and whether possible (e.g. is reusing
vcpu0 even possible? if not consider ugliness). For example, Anish's
proposal still need to rely on userfaults for some minority faults, IIUC.
>
> And for the blocking notification in KVM, re-using the userfaultfd
> file descriptor seems like a neat idea. We could have a KVM ioctl to
> register the fd with KVM, and then KVM can notify when it needs to
> block on a missing page. The uffd_msg struct can be extended to
> support a new "gfn" type or "guest_memfd" type fault info. I'm not
> quite sure how the wait-queuing will work, but I'm sure it's solvable.
If so, IMHO it'll be great the interface is userfault-based. Gmemfd can be
only a special form of memfd in general, and there's also potential of
supporting memfd, then that may not require a kvm context.
It hopefully shouldn't affect how kvm consumes it, though, so from that
part it should be as straightforward as a KVM interface, just cleaner and
fully separated.
I can extend a bit on a possible userfault design, maybe it can make the
discussion clearer.
Currently userfaults registers with VA ranges, cutting them into new VMAs,
setting the flags upon the VMAs. Files don't have those.
To vision an fd based uffd (which never existed) from my pure gut feelings,
we need to only move the VA address space to the file (or to be explicit,
inode), meaning whatever address will become offsets of a file. It can
apply to gmemfd or generic memfd (shmem, hugetlb, etc.). Generic memfd can
be for later, but the interface could hopefully be unchanged even if
someone would like to add support for those.
I need to think more on where to put the uffd context, probably onto the
inode, but maybe also possible to put it into gmemfd-only fields before we
want to make it more generic. That should be a problem for later.
It could be something like ioctl(UFFDIO_REGISTER_FD) to register a normal
memory fd (gmemfd) against an userfault desc. AFAIU it may not even
require a range of file offset to register, we can start with always
registering the file on full range, whatever large it is (0->fsize, fsize
under control of normal file/inode semantics).
It means gmemfd code will still need to check the fsize limit on over
file-size access, as long as it passes and if it's registered with UFFD-fd
mode, it reports the fault just like VMA based, but in file offset this
time. The file-size will need to be rechecked when injecting the page
cache, failing on both sides.
We'll need correspondent new ioctl for file, say, ioctl(UFFDIO_INSERT),
which fills in the mapping entry / page cache only for the inode without
touching any pgtables. The kicking stuff is the same as VMA-based.
> With these 2 together, the UAPI for notifying userspace would be the
> same for UserfaultFD and KVM.
>
> As for whether to integrate the "missing" page support in guest_memfd
> or KVM... I'm obviously partial to the latter because then Google can
> also use it for HugeTLB.
Right, hugetlb 1G is still the major pain, and we can indeed resolve that
with a kvm proposal to bypass whatever issues lies in mm community, even if
not something I prefer.. I sincerely hope we can see some progress on that
side, if possible.
>
> But now that I think about it, isn't the KVM-based approach useful to
> the broader community as well? For example, QEMU could use the
> KVM-based demand paging for all KVM-generated accesses to guest
> memory. This would provide 4K-granular demand paging for _most_ memory
> accesses. Then for vhost and userspace accesses, QEMU can set up a
> separate VMA mapping of guest memory and use UserfaultFD. The
> userspace/vhost accesses would have to be done at the huge page size
> granularity (if using HugeTLB). But most accesses should still come
> from KVM, so this would be a real improvement over a pure UserfaultFD
> approach.
I fully understand why you propose that, but not the one I prefer. That
means KVM is leaving other modules behind. :( And that's not even the same
as the case where KVM wants to resolve hugetlb over 1G, because at least we
tried :) it's just that the proposal got rejected, unfortunately, so far.
IMHO we should still consider virt the whole community, not KVM separately,
even if KVM is indeed a separate module.
If we still have any form of 1G problem (including dma, in the case of
vhost), IMHO the VM still cannot be declared usable over 1G. This
"optimization" is not solving the problem if not considering the rest.
So if we're going to propose the new solution to replace userfault, I'd
rather we add support separately for everything at least still public, even
if it'll take more work, comparing to make it kvm-only.
>
> And on the more practical side... If we integrate missing page support
> directly into guest_memfd, I'm not sure how one part would even work.
> Userspace would need a way to write to missing pages before marking
> them present. So we'd need some sort of special flag to mmap() to
> bypass the missing page interception? I'm sure it's solvable, but the
> KVM-based does not have this problem.
Userfaults rely on the temp buffer. Take UFFDIO_COPY as an example,
uffdio_copy.src|len describes that. Then the kernel does the atomicity
work.
I'm not sure why KVM-based doesn't have that problem. IIUC it'll be the
same? We can't make the folio present in the gmemfd mapping if it doesn't
yet contain full data copied over. Having a special flag for the mapping
is fine for each folio to allow different access permissions looks also
fine, but that sounds like over complicated to me.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-11 17:30 ` Peter Xu
@ 2023-11-13 16:43 ` David Matlack
2023-11-20 18:32 ` James Houghton
0 siblings, 1 reply; 34+ messages in thread
From: David Matlack @ 2023-11-13 16:43 UTC (permalink / raw)
To: Peter Xu
Cc: Sean Christopherson, Paolo Bonzini, kvm list, James Houghton,
Oliver Upton, Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
On Sat, Nov 11, 2023 at 9:30 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Sat, Nov 11, 2023 at 08:23:57AM -0800, David Matlack wrote:
> >
> > But now that I think about it, isn't the KVM-based approach useful to
> > the broader community as well? For example, QEMU could use the
> > KVM-based demand paging for all KVM-generated accesses to guest
> > memory. This would provide 4K-granular demand paging for _most_ memory
> > accesses. Then for vhost and userspace accesses, QEMU can set up a
> > separate VMA mapping of guest memory and use UserfaultFD. The
> > userspace/vhost accesses would have to be done at the huge page size
> > granularity (if using HugeTLB). But most accesses should still come
> > from KVM, so this would be a real improvement over a pure UserfaultFD
> > approach.
>
> I fully understand why you propose that, but not the one I prefer. That
> means KVM is leaving other modules behind. :( And that's not even the same
> as the case where KVM wants to resolve hugetlb over 1G, because at least we
> tried :) it's just that the proposal got rejected, unfortunately, so far.
>
> IMHO we should still consider virt the whole community, not KVM separately,
> even if KVM is indeed a separate module.
KVM is not just any module, though. It is the _only_ module that
mediates _guest_ access to host memory. KVM is also a constant: Any
Linux-based VM that cares about performance is using KVM. guest_memfd,
on the other hand, is not unique and not constant. It's just one way
to back guest memory.
The way I see it, we are going to have one of the 2 following outcomes:
1. VMMs use KVM-based demand paging to mediate guest accesses, and
UserfaultFD to mediate userspace and vhost accesses.
2. VMMs use guest_memfd-based demand paging for guest_memfd, and
UserfaultFD for everything else.
I think there are many advantages of (1) over (2). (1) means that VMMs
can have a common software architecture for post-copy across any
memory types. Any optimizations we implement will apply to _all_
memory types, not just guest_memfd.
Mediating guest accesses _in KVM_ also has practical benefits. It
gives us more flexibility to solve problems that are specific to
virtual machines that other parts of the kernel don't care about. For
example, there's value in being able to preemptively mark memory as
present so that guest accesses don't have to notify userspace. During
a Live Migration, at the beginning of post-copy, there might be a
large number of guest pages that are present and don't need to be
fetched. The set might also be sparse. With KVM mediating access to
guest memory, we can just add a bitmap-based UAPI to KVM to mark
memory as present.
Sure we could technically add a bitmap-based API to guest_memfd, but
that would only solve the problem _for guest_memfd_.
Then there's the bounce-buffering problem. With a guest_memfd-based
scheme, there's no way for userspace to bypass the kernel's notion of
what's present. That means all of guest memory has to be
bounce-buffered. (More on this below.)
And even if we generalize (2) to all memfds, that's still not covering
all ways of backing guest memory.
Having KVM-specific UAPIs is also not new. Consider how KVM implements
its own dirty tracking.
And all of that is independent of the short-term HugeTLB benefit for Google.
>
> So if we're going to propose the new solution to replace userfault, I'd
> rather we add support separately for everything at least still public, even
> if it'll take more work, comparing to make it kvm-only.
To be clear, it's not a replacement for UserfaultFD. It would work in
conjunction with UserfaultFD.
>
> >
> > And on the more practical side... If we integrate missing page support
> > directly into guest_memfd, I'm not sure how one part would even work.
> > Userspace would need a way to write to missing pages before marking
> > them present. So we'd need some sort of special flag to mmap() to
> > bypass the missing page interception? I'm sure it's solvable, but the
> > KVM-based does not have this problem.
>
> Userfaults rely on the temp buffer. Take UFFDIO_COPY as an example,
> uffdio_copy.src|len describes that. Then the kernel does the atomicity
> work.
Any solution that requires bounce-buffering (memcpy) is unlikely to be
tenable. The performance implications and CPU overhead required to
bounce-buffer _every_ page of guest memory during post-copy is too
much. That's why Google maintains a second mapping when using
UserfaultFD.
>
> I'm not sure why KVM-based doesn't have that problem. IIUC it'll be the
> same? We can't make the folio present in the gmemfd mapping if it doesn't
> yet contain full data copied over. Having a special flag for the mapping
> is fine for each folio to allow different access permissions looks also
> fine, but that sounds like over complicated to me.
Both UserfaultFD and KVM-based demand paging don't have a
bounce-buffering problem because they mediate a specific _view_ of
guest memory, not the underlying memory. i.e. Neither mechanism
prevents userspace from creating a separate mapping where it can
access guest memory independent of the "present set", e.g. to RDMA
guest pages directly from the source.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RFC: A KVM-specific alternative to UserfaultFD
2023-11-13 16:43 ` David Matlack
@ 2023-11-20 18:32 ` James Houghton
0 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2023-11-20 18:32 UTC (permalink / raw)
To: David Matlack
Cc: Peter Xu, Sean Christopherson, Paolo Bonzini, kvm list,
Oliver Upton, Axel Rasmussen, Mike Kravetz, Andrea Arcangeli
On Mon, Nov 13, 2023 at 8:44 AM David Matlack <dmatlack@google.com> wrote:
>
> On Sat, Nov 11, 2023 at 9:30 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Sat, Nov 11, 2023 at 08:23:57AM -0800, David Matlack wrote:
> > >
> > > But now that I think about it, isn't the KVM-based approach useful to
> > > the broader community as well? For example, QEMU could use the
> > > KVM-based demand paging for all KVM-generated accesses to guest
> > > memory. This would provide 4K-granular demand paging for _most_ memory
> > > accesses. Then for vhost and userspace accesses, QEMU can set up a
> > > separate VMA mapping of guest memory and use UserfaultFD. The
> > > userspace/vhost accesses would have to be done at the huge page size
> > > granularity (if using HugeTLB). But most accesses should still come
> > > from KVM, so this would be a real improvement over a pure UserfaultFD
> > > approach.
> >
> > I fully understand why you propose that, but not the one I prefer. That
> > means KVM is leaving other modules behind. :( And that's not even the same
> > as the case where KVM wants to resolve hugetlb over 1G, because at least we
> > tried :) it's just that the proposal got rejected, unfortunately, so far.
> >
> > IMHO we should still consider virt the whole community, not KVM separately,
> > even if KVM is indeed a separate module.
>
> KVM is not just any module, though. It is the _only_ module that
> mediates _guest_ access to host memory. KVM is also a constant: Any
> Linux-based VM that cares about performance is using KVM. guest_memfd,
> on the other hand, is not unique and not constant. It's just one way
> to back guest memory.
>
> The way I see it, we are going to have one of the 2 following outcomes:
>
> 1. VMMs use KVM-based demand paging to mediate guest accesses, and
> UserfaultFD to mediate userspace and vhost accesses.
> 2. VMMs use guest_memfd-based demand paging for guest_memfd, and
> UserfaultFD for everything else.
>
> I think there are many advantages of (1) over (2). (1) means that VMMs
> can have a common software architecture for post-copy across any
> memory types. Any optimizations we implement will apply to _all_
> memory types, not just guest_memfd.
>
> Mediating guest accesses _in KVM_ also has practical benefits. It
> gives us more flexibility to solve problems that are specific to
> virtual machines that other parts of the kernel don't care about. For
> example, there's value in being able to preemptively mark memory as
> present so that guest accesses don't have to notify userspace. During
> a Live Migration, at the beginning of post-copy, there might be a
> large number of guest pages that are present and don't need to be
> fetched. The set might also be sparse. With KVM mediating access to
> guest memory, we can just add a bitmap-based UAPI to KVM to mark
> memory as present.
I'm not sure if this has been mentioned yet, but making guest_memfd
work at all for non-private memory would require changing all the
places where we do uaccess to go through guest_memfd logic instead
(provided the goal is to have no dependency on the userspace page
tables, i.e., we don't need to mmap the guest_memfd). Those uaccess
places are most of the places where KVM demand paging hooks would need
to be added (plus GUP, and maybe a few more, like if KVM uses kmap to
access memory). This isn't considering making post-copy work with
guest_memfd, just making guest_memfd work at all for non-private
memory.
So the way I see it, making guest_memfd work more completely would be
a decent chunk (most?) of the work for KVM demand paging. I think a
more generic solution akin to userfaultfd would be better if it can
cleanly solve our problem.
> Sure we could technically add a bitmap-based API to guest_memfd, but
> that would only solve the problem _for guest_memfd_.
I think this could be done in a more generic way. If we had some kind
of file-based userfaultfd, we could use that to mediate KVM's use of
the guest_memfd, instead of building the API into guest_memfd itself.
Guest_memfd would be the only useful application of this for KVM (as
using tmpfs/etc. memfds directly would still require uaccess =>
regular userfaultfd). Maybe there could be other applications for
file-based userfaultfd someday; I haven't thought of any yet.
It also doesn't have to be some kind of bitmap-based API either
(though that might be the right approach). Certainly UFFDIO_INSERT
(the analogue of UFFDIO_COPY, IIUC) wouldn't be enough either; we
would need an analogue for UFFDIO_CONTINUE (see [1] for the original
motivation).
What about something like this:
When an mm tries to grab a page from the file (via vm_ops->fault or
the appropriate equivalent operation, like for guest_memfd,
kvm_gmem_get_folio), notify userspace via userfaultfd that this is
about to happen, and userspace can delay it for as long as necessary,
then issuing a UFFDIO_WAKE_FILE (or something).
This approach doesn't solve the problem of telling KVM if a 2M or 1G
mapping is ok, though. :( I'm also not sure how to translate "when an
mm tries to grab a page from the file" into something that makes sense
to userspace, so maybe that's another reason this idea doesn't work,
and a bitmap would be better.
If we go with a file-based userfaultfd system, I think we're basically
stuck with using guest_memfd if we want to avoid the memory + CPU
overhead that comes from normal userfaultfd for post-copy (due to
managing the userspace page tables; not needed with KVM demand
paging). So we'd need to build everything we need into guest_memfd,
like 1G page support, and maybe even things like a
hugetlb-vmemmap-optimization equivalent for guest_memfd.
Given that guest_memfd doesn't support swapping, we wouldn't be able
to reap the benefits of file-based userfaultfd over regular
userfaultfd for tmpfs-backed VMs that rely on swapping. :( So that's
another point in favor of KVM demand paging.
>
> Then there's the bounce-buffering problem. With a guest_memfd-based
> scheme, there's no way for userspace to bypass the kernel's notion of
> what's present. That means all of guest memory has to be
> bounce-buffered. (More on this below.)
I think this is definitely solvable; its solution probably depends on
the mechanics of a file-based userfaultfd, though.
> And even if we generalize (2) to all memfds, that's still not covering
> all ways of backing guest memory.
I think this is a fair point; we can't use file-based userfaultfd for
anon-backed memslots. I don't think it matters that much though:
KVM demand paging + userfaultfd for anon-backed memslots is a little
cumbersome; we're basically stuck with the single mapping. If the VMM
needs to use userfaultfd, there's no reason to switch to KVM demand
paging, because we'll still need to UFFDIO_COPY (or UFFDIO_UNREGISTER)
no matter what.
So KVM demand paging is only helpful for VMMs that use anon-backed
memslots if they've already annotated everywhere where something
non-KVM may access guest memory (and therefore don't need userfaultfd
except to catch KVM's accesses). So userfaultfd will be the easiest
post-copy system to use for a VMM that has used anonymous memory for
the memslots even in a world where KVM demand paging exists.
This is all just my two cents. I'm not sure what the best solution is
yet, but the fact that KVM demand paging has many fewer unknowns than
file-based userfaultfd makes me want to prefer it. And it's
unfortunate that a file-based userfaultfd solution wouldn't help us at
all for VMs that can have their memory swapped out.
- James
[1]: https://lore.kernel.org/linux-mm/20210225002658.2021807-1-axelrasmussen@google.com/
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2023-11-20 18:32 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-06 18:25 RFC: A KVM-specific alternative to UserfaultFD David Matlack
2023-11-06 20:23 ` Peter Xu
2023-11-06 22:24 ` Axel Rasmussen
2023-11-06 23:03 ` Peter Xu
2023-11-06 23:22 ` David Matlack
2023-11-07 14:21 ` Peter Xu
2023-11-07 16:11 ` James Houghton
2023-11-07 17:24 ` Peter Xu
2023-11-07 19:08 ` James Houghton
2023-11-07 16:25 ` Paolo Bonzini
2023-11-07 20:04 ` David Matlack
2023-11-07 21:10 ` Oliver Upton
2023-11-07 21:34 ` David Matlack
2023-11-08 1:27 ` Oliver Upton
2023-11-08 16:56 ` David Matlack
2023-11-08 17:34 ` Peter Xu
2023-11-08 20:10 ` Sean Christopherson
2023-11-08 20:36 ` Peter Xu
2023-11-08 20:47 ` Axel Rasmussen
2023-11-08 21:05 ` David Matlack
2023-11-08 20:49 ` David Matlack
2023-11-08 20:33 ` Paolo Bonzini
2023-11-08 20:43 ` David Matlack
2023-11-07 22:29 ` Peter Xu
2023-11-09 16:41 ` David Matlack
2023-11-09 17:58 ` Sean Christopherson
2023-11-09 18:33 ` David Matlack
2023-11-09 22:44 ` David Matlack
2023-11-09 23:54 ` Sean Christopherson
2023-11-09 19:20 ` Peter Xu
2023-11-11 16:23 ` David Matlack
2023-11-11 17:30 ` Peter Xu
2023-11-13 16:43 ` David Matlack
2023-11-20 18:32 ` James Houghton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).