[ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy)
@ 2024-08-01 22:43 Sean Christopherson
  2024-08-07 17:21 ` James Houghton
  0 siblings, 1 reply; 9+ messages in thread
From: Sean Christopherson @ 2024-08-01 22:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Xu, James Houghton, Paolo Bonzini,
	Oliver Upton, Axel Rasmussen, David Matlack

Early warning for next week's PUCK since there's actually a topic this time.
James is going to lead a discussion on KVM userfault[*](name subject to change).

I Cc'd folks a few folks that I know are interested, please forward this on
as needed.

Early warning #2, PUCK is canceled for August 14th, as I'll be traveling, though
y'all are welcome to meet without me.

[*] https://lore.kernel.org/all/20240710234222.2333120-1-jthoughton@google.com

Time:     6am PDT
Video:    https://meet.google.com/vdb-aeqo-knk
Phone:    https://tel.meet/vdb-aeqo-knk?pin=3003112178656

Calendar: https://calendar.google.com/calendar/u/0?cid=Y182MWE1YjFmNjQ0NzM5YmY1YmVkN2U1ZWE1ZmMzNjY5Y2UzMmEyNTQ0YzVkYjFjN2M4OTE3MDJjYTUwOTBjN2Q1QGdyb3VwLmNhbGVuZGFyLmdvb2dsZS5jb20
Drive:    https://drive.google.com/drive/folders/1aTqCrvTsQI9T4qLhhLs_l986SngGlhPH?resourcekey=0-FDy0ykM3RerZedI8R-zj4A&usp=drive_link

Future Schedule:
Augst   7th - KVM userfault
August 14th - Canceled (Sean unavailable)
August 21st - Available
August 28th - Available

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy)
  2024-08-01 22:43 [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy) Sean Christopherson
@ 2024-08-07 17:21 ` James Houghton
  2024-08-08  0:17   ` Sean Christopherson
  2024-08-08 12:15   ` Wang, Wei W
  0 siblings, 2 replies; 9+ messages in thread
From: James Houghton @ 2024-08-07 17:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Xu, Paolo Bonzini, Oliver Upton,
	Axel Rasmussen, David Matlack, Anish Moorthy

On Thu, Aug 1, 2024 at 3:44 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Early warning for next week's PUCK since there's actually a topic this time.
> James is going to lead a discussion on KVM userfault[*](name subject to change).

Thanks for attending, everyone!

We seemed to arrive at the following conclusions:

1. For guest_memfd, stage 2 mapping installation will never go through
GUP / virtual addresses to do the GFN --> PFN translation, including
when it supports non-private memory.
2. Something like KVM Userfault is indeed necessary to handle
post-copy for guest_memfd VMs, especially when guest_memfd supports
non-private memory.
3. We should not hook into the overall GFN --> HVA translation, we
should only be hooking the GFN --> PFN translation steps to figure out
how to create stage 2 mappings. That is, KVM's own accesses to guest
memory should just go through mm/userfaultfd.
4. We don't need the concept of "async userfaults" (making KVM block
when attempting to access userfault memory) in KVM Userfault.

So I need to think more about what exactly the API should look like
for controlling if a page should exit to userspace before KVM is
allowed to map it into stage 2 and if this should apply to all of
guest memory or only guest_memfd.

It sounds like it may most likely be something like a per-VM bitmap
that describes which pages are allowed to be mapped into stage 2,
applying to all memory, not just guest_memfd memory. Even though it is
solving a problem for guest_memfd specifically, it is slightly cleaner
to have it apply to all memory.

If this per-VM bitmap applies to all memory, then we don't need to
wait for guest_memfd to support non-private memory before working on a
full implementation. But if not, perhaps it makes sense to wait.

There will be a 30 minute session at LPC to discuss this topic more. I
hope to see you there!

Here are the slides[2].

Thanks!

PS: I'll be away from August 9 - 25.

[2]: https://docs.google.com/presentation/d/1Al9amGumF3ZPX2Wu50mQ4nkPRZZdBJitXmMH3n7j_RE/edit?usp=sharing

> I Cc'd folks a few folks that I know are interested, please forward this on
> as needed.
>
> Early warning #2, PUCK is canceled for August 14th, as I'll be traveling, though
> y'all are welcome to meet without me.
>
> [*] https://lore.kernel.org/all/20240710234222.2333120-1-jthoughton@google.com
>
> Time:     6am PDT
> Video:    https://meet.google.com/vdb-aeqo-knk
> Phone:    https://tel.meet/vdb-aeqo-knk?pin=3003112178656
>
> Calendar: https://calendar.google.com/calendar/u/0?cid=Y182MWE1YjFmNjQ0NzM5YmY1YmVkN2U1ZWE1ZmMzNjY5Y2UzMmEyNTQ0YzVkYjFjN2M4OTE3MDJjYTUwOTBjN2Q1QGdyb3VwLmNhbGVuZGFyLmdvb2dsZS5jb20
> Drive:    https://drive.google.com/drive/folders/1aTqCrvTsQI9T4qLhhLs_l986SngGlhPH?resourcekey=0-FDy0ykM3RerZedI8R-zj4A&usp=drive_link
>
> Future Schedule:
> Augst   7th - KVM userfault
> August 14th - Canceled (Sean unavailable)
> August 21st - Available
> August 28th - Available

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy)
  2024-08-07 17:21 ` James Houghton
@ 2024-08-08  0:17   ` Sean Christopherson
  2024-08-08 12:15   ` Wang, Wei W
  1 sibling, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2024-08-08  0:17 UTC (permalink / raw)
  To: James Houghton
  Cc: kvm, linux-kernel, Peter Xu, Paolo Bonzini, Oliver Upton,
	Axel Rasmussen, David Matlack, Anish Moorthy

On Wed, Aug 07, 2024, James Houghton wrote:
> On Thu, Aug 1, 2024 at 3:44 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Early warning for next week's PUCK since there's actually a topic this time.
> > James is going to lead a discussion on KVM userfault[*](name subject to change).
> 
> Thanks for attending, everyone!
> 
> We seemed to arrive at the following conclusions:
> 
> 1. For guest_memfd, stage 2 mapping installation will never go through
> GUP / virtual addresses to do the GFN --> PFN translation, including
> when it supports non-private memory.
> 2. Something like KVM Userfault is indeed necessary to handle
> post-copy for guest_memfd VMs, especially when guest_memfd supports
> non-private memory.
> 3. We should not hook into the overall GFN --> HVA translation, we
> should only be hooking the GFN --> PFN translation steps to figure out
> how to create stage 2 mappings. That is, KVM's own accesses to guest
> memory should just go through mm/userfaultfd.
> 4. We don't need the concept of "async userfaults" (making KVM block
> when attempting to access userfault memory) in KVM Userfault.
> 
> So I need to think more about what exactly the API should look like
> for controlling if a page should exit to userspace before KVM is
> allowed to map it into stage 2 and if this should apply to all of
> guest memory or only guest_memfd.
> 
> It sounds like it may most likely be something like a per-VM bitmap
> that describes which pages are allowed to be mapped into stage 2,
> applying to all memory, not just guest_memfd memory. Even though it is
> solving a problem for guest_memfd specifically, it is slightly cleaner
> to have it apply to all memory.
> 
> If this per-VM bitmap applies to all memory, then we don't need to
> wait for guest_memfd to support non-private memory before working on a
> full implementation. But if not, perhaps it makes sense to wait.

Per-memslot likely makes more sense.  Unlike attributes, the bitmap only needs
to exist during post-copy, and unless we do something clever, i.e. use something
other than a bitmap, the bitmap needs to be fully allocated, which would result
in unnecessary overhead if there are gaps in guest physical memory.

The other hiccup with a per-VM bitmap is that it would force us to define ABI
for things we don't care about.  E.g. what happens if the local APIC is in-kernel
and userspace marks the APIC page as USERFAULT?  Ditto for gfns without memslots.

E.g. add a KVM_MEM_USERFAULT flag along with a userfault_bitmap user pointer
that is valid when the flag is set.  Unlike dirty logging, KVM is only a reader
of the bitmap, so I'm pretty sure we don't need a copy in KVM.

When userspace creates the VM on the target, it allocates a bitmap for each
memslot and sets KVM_MEM_USERFAULT.  When migration completes, userspace clears
KVM_MEM_USERFAULT for each memslot, and then deletes the associated bitmap.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy)
  2024-08-07 17:21 ` James Houghton
  2024-08-08  0:17   ` Sean Christopherson
@ 2024-08-08 12:15   ` Wang, Wei W
  2024-08-08 19:04     ` James Houghton
  1 sibling, 1 reply; 9+ messages in thread
From: Wang, Wei W @ 2024-08-08 12:15 UTC (permalink / raw)
  To: James Houghton, Sean Christopherson
  Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Peter Xu,
	Paolo Bonzini, Oliver Upton, Axel Rasmussen, David Matlack,
	Anish Moorthy

On Thursday, August 8, 2024 1:22 AM, James Houghton wrote:
> On Thu, Aug 1, 2024 at 3:44 PM Sean Christopherson <seanjc@google.com>
> wrote:
> >
> > Early warning for next week's PUCK since there's actually a topic this time.
> > James is going to lead a discussion on KVM userfault[*](name subject to
> change).
> 
> Thanks for attending, everyone!
> 
> We seemed to arrive at the following conclusions:
> 
> 1. For guest_memfd, stage 2 mapping installation will never go through GUP /
> virtual addresses to do the GFN --> PFN translation, including when it supports
> non-private memory.
> 2. Something like KVM Userfault is indeed necessary to handle post-copy for
> guest_memfd VMs, especially when guest_memfd supports non-private
> memory.
> 3. We should not hook into the overall GFN --> HVA translation, we should
> only be hooking the GFN --> PFN translation steps to figure out how to create
> stage 2 mappings. That is, KVM's own accesses to guest memory should just go
> through mm/userfaultfd.

Sorry.. still a bit confused about this one: will gmem finally support GUP and VMA?
For 1. above, seems no, but for 3. here, KVM's own accesses to gmem will go
through userfaultfd via GUP?
Also, how would vhost's access to gmem get faulted to userspace?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy)
  2024-08-08 12:15   ` Wang, Wei W
@ 2024-08-08 19:04     ` James Houghton
  2024-08-09 13:51       ` Wang, Wei W
  0 siblings, 1 reply; 9+ messages in thread
From: James Houghton @ 2024-08-08 19:04 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: Sean Christopherson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Peter Xu, Paolo Bonzini,
	Oliver Upton, Axel Rasmussen, David Matlack, Anish Moorthy

On Thu, Aug 8, 2024 at 5:15 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
>
> On Thursday, August 8, 2024 1:22 AM, James Houghton wrote:
> > 1. For guest_memfd, stage 2 mapping installation will never go through GUP /
> > virtual addresses to do the GFN --> PFN translation, including when it supports
> > non-private memory.
> > 2. Something like KVM Userfault is indeed necessary to handle post-copy for
> > guest_memfd VMs, especially when guest_memfd supports non-private
> > memory.
> > 3. We should not hook into the overall GFN --> HVA translation, we should
> > only be hooking the GFN --> PFN translation steps to figure out how to create
> > stage 2 mappings. That is, KVM's own accesses to guest memory should just go
> > through mm/userfaultfd.
>
> Sorry.. still a bit confused about this one: will gmem finally support GUP and VMA?
> For 1. above, seems no, but for 3. here, KVM's own accesses to gmem will go
> through userfaultfd via GUP?
> Also, how would vhost's access to gmem get faulted to userspace?

Hi Wei,

From what we discussed in the meeting, guest_memfd will be mappable
into userspace (so VMAs can be created for it), and so GUP will be
able to work on it. However, KVM will *not* use GUP for doing gfn ->
pfn translations for installing stage 2 mappings. (For guest-private
memory, GUP cannot be used, but the claim is that GUP will never be
used, no matter if it's guest-private or guest-shared.)

KVM's own accesses to guest memory (i.e., places where it does
copy_to/from_user) will go through GUP. By default, that's just how it
would work. What I'm saying is that we aren't going to add anything
extra to have "KVM Userfault" prevent KVM from doing a
copy_to/from_user (like how I had it in the RFC, where KVM Userfault
can block the translation of gfn -> hva).

vhost's accesses to guest memory will be the same as KVM's: it will go
through copy_to/from_user.

Hopefully that's a little clearer. :)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy)
  2024-08-08 19:04     ` James Houghton
@ 2024-08-09 13:51       ` Wang, Wei W
  2024-08-09 19:04         ` Sean Christopherson
  0 siblings, 1 reply; 9+ messages in thread
From: Wang, Wei W @ 2024-08-09 13:51 UTC (permalink / raw)
  To: James Houghton
  Cc: Sean Christopherson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Peter Xu, Paolo Bonzini,
	Oliver Upton, Axel Rasmussen, David Matlack, Anish Moorthy

On Friday, August 9, 2024 3:05 AM, James Houghton wrote:
> On Thu, Aug 8, 2024 at 5:15 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
> >
> > On Thursday, August 8, 2024 1:22 AM, James Houghton wrote:
> > > 1. For guest_memfd, stage 2 mapping installation will never go
> > > through GUP / virtual addresses to do the GFN --> PFN translation,
> > > including when it supports non-private memory.
> > > 2. Something like KVM Userfault is indeed necessary to handle
> > > post-copy for guest_memfd VMs, especially when guest_memfd supports
> > > non-private memory.
> > > 3. We should not hook into the overall GFN --> HVA translation, we
> > > should only be hooking the GFN --> PFN translation steps to figure
> > > out how to create stage 2 mappings. That is, KVM's own accesses to
> > > guest memory should just go through mm/userfaultfd.
> >
> > Sorry.. still a bit confused about this one: will gmem finally support GUP and
> VMA?
> > For 1. above, seems no, but for 3. here, KVM's own accesses to gmem
> > will go through userfaultfd via GUP?
> > Also, how would vhost's access to gmem get faulted to userspace?
> 
> Hi Wei,
> 
> From what we discussed in the meeting, guest_memfd will be mappable into
> userspace (so VMAs can be created for it), and so GUP will be able to work on
> it. However, KVM will *not* use GUP for doing gfn -> pfn translations for
> installing stage 2 mappings. (For guest-private memory, GUP cannot be used,
> but the claim is that GUP will never be used, no matter if it's guest-private or
> guest-shared.)

OK. For KVM userfault on a guest-shared page, how is a physical page gets filled
with the data (received from source) and installed into the host cr3 and guest
stage-2 page tables? Add a new gmem uAPI to achieve this?

There also seems to be a race condition between KVM userfault and userfaultfd.
For example, guest access to a guest-shared page triggers KVM userfault to
userspace while vhost (or KVM) could access to the same page during the window
that KVM userfault is handling the page, then there will be two simultaneous faults
on the same page.
I'm thinking how would this case be handled? (leaving it to userspace to detect and
handle such cases would be an complex)

> 
> KVM's own accesses to guest memory (i.e., places where it does
> copy_to/from_user) will go through GUP. By default, that's just how it would
> work. What I'm saying is that we aren't going to add anything extra to have
> "KVM Userfault" prevent KVM from doing a copy_to/from_user (like how I had
> it in the RFC, where KVM Userfault can block the translation of gfn -> hva).
> 
> vhost's accesses to guest memory will be the same as KVM's: it will go through
> copy_to/from_user.
> 
> Hopefully that's a little clearer. :)

Yeah, thanks for explanation. 
Enjoy your vacation. We can continue the discussion after that :)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy)
  2024-08-09 13:51       ` Wang, Wei W
@ 2024-08-09 19:04         ` Sean Christopherson
  2024-08-12 14:12           ` Wang, Wei W
  0 siblings, 1 reply; 9+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:04 UTC (permalink / raw)
  To: Wei W Wang
  Cc: James Houghton, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Peter Xu, Paolo Bonzini, Oliver Upton, Axel Rasmussen,
	David Matlack, Anish Moorthy

On Fri, Aug 09, 2024, Wei W Wang wrote:
> On Friday, August 9, 2024 3:05 AM, James Houghton wrote:
> > On Thu, Aug 8, 2024 at 5:15 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
> There also seems to be a race condition between KVM userfault and userfaultfd.
> For example, guest access to a guest-shared page triggers KVM userfault to
> userspace while vhost (or KVM) could access to the same page during the window
> that KVM userfault is handling the page, then there will be two simultaneous faults
> on the same page.
> I'm thinking how would this case be handled? (leaving it to userspace to detect and
> handle such cases would be an complex)

Userspace is going to have to handle racing "faults" no matter what, e.g. if
multiple vCPUs hit the same fault and exit at the same time.  I don't think it'll
be too complex to detect spurious/fixed faults and retry.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy)
  2024-08-09 19:04         ` Sean Christopherson
@ 2024-08-12 14:12           ` Wang, Wei W
  2024-08-12 15:24             ` Peter Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Wang, Wei W @ 2024-08-12 14:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Peter Xu, Paolo Bonzini, Oliver Upton, Axel Rasmussen,
	David Matlack, Anish Moorthy

On Saturday, August 10, 2024 3:05 AM, Sean Christopherson wrote:
> On Fri, Aug 09, 2024, Wei W Wang wrote:
> > On Friday, August 9, 2024 3:05 AM, James Houghton wrote:
> > > On Thu, Aug 8, 2024 at 5:15 AM Wang, Wei W <wei.w.wang@intel.com>
> wrote:
> > There also seems to be a race condition between KVM userfault and
> userfaultfd.
> > For example, guest access to a guest-shared page triggers KVM
> > userfault to userspace while vhost (or KVM) could access to the same
> > page during the window that KVM userfault is handling the page, then
> > there will be two simultaneous faults on the same page.
> > I'm thinking how would this case be handled? (leaving it to userspace
> > to detect and handle such cases would be an complex)
> 
> Userspace is going to have to handle racing "faults" no matter what, e.g. if
> multiple vCPUs hit the same fault and exit at the same time.  I don't think it'll
> be too complex to detect spurious/fixed faults and retry.

Yes, the case of multiple vCPUs hitting the same fault shouldn't be difficult
to handle as they fall into the same handling path (i.e., KVM userfault). But if
vCPUs and vhost hit the same faults, the two types of fault exit (i.e., KVM
userfault and userfaultfd) will occur at the same time (IIUC, the vCPU access
triggers KVM userfault and the vhost access triggers userfaultfd).

So, the userspace VMM would be required to coordinate between the two types of
userfault. For example, when the page data is fetched from the source, VMM first
needs to determine whether the page should be installed via UFFDIO_COPY (for the
userfaultfd case) and/or the new uAPI, say KVM_USERFAULT_COPY (for the KVM
userfault case).

In the example above, both UFFDIO_COPY and KVM_USERFAULT_COPY need to be
invoked, e.g.:
#1 invoke KVM_USERFAULT_COPY
#2 invoke UFFDIO_COPY

This requires that UFFDIO_COPY does not conflict with KVM_USERFAULT_COPY. Current
UFFDIO_COPY will fail (thus not waking up the threads on the waitq) when it fails to
install the PTE into the page table (in the above example the PTE has already been
installed into the page table by KVM_USERFAULT_COPY at #1).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy)
  2024-08-12 14:12           ` Wang, Wei W
@ 2024-08-12 15:24             ` Peter Xu
  0 siblings, 0 replies; 9+ messages in thread
From: Peter Xu @ 2024-08-12 15:24 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: Sean Christopherson, James Houghton, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Paolo Bonzini, Oliver Upton,
	Axel Rasmussen, David Matlack, Anish Moorthy

On Mon, Aug 12, 2024 at 02:12:29PM +0000, Wang, Wei W wrote:
> In the example above, both UFFDIO_COPY and KVM_USERFAULT_COPY need to be
> invoked, e.g.:
> #1 invoke KVM_USERFAULT_COPY
> #2 invoke UFFDIO_COPY
> 
> This requires that UFFDIO_COPY does not conflict with KVM_USERFAULT_COPY. Current
> UFFDIO_COPY will fail (thus not waking up the threads on the waitq) when it fails to
> install the PTE into the page table (in the above example the PTE has already been
> installed into the page table by KVM_USERFAULT_COPY at #1).

Indeed, maybe we can fix that with an explicit UFFDIO_WAKE upon UFFDIO_COPY
failures iff -EEXIST (in this case, it should fall into "page cache exists"
category, even if pgtable can still be missing).

I assume OTOH a racy KVM_USERFAULT_COPY in whatever form doesn't need
anything but to kick the vcpu, irrelevant of whether the copy succeeded or
not.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-08-12 15:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-01 22:43 [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy) Sean Christopherson
2024-08-07 17:21 ` James Houghton
2024-08-08  0:17   ` Sean Christopherson
2024-08-08 12:15   ` Wang, Wei W
2024-08-08 19:04     ` James Houghton
2024-08-09 13:51       ` Wang, Wei W
2024-08-09 19:04         ` Sean Christopherson
2024-08-12 14:12           ` Wang, Wei W
2024-08-12 15:24             ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox