* [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy) @ 2024-08-01 22:43 Sean Christopherson 2024-08-07 17:21 ` James Houghton 0 siblings, 1 reply; 9+ messages in thread From: Sean Christopherson @ 2024-08-01 22:43 UTC (permalink / raw) To: Sean Christopherson Cc: kvm, linux-kernel, Peter Xu, James Houghton, Paolo Bonzini, Oliver Upton, Axel Rasmussen, David Matlack Early warning for next week's PUCK since there's actually a topic this time. James is going to lead a discussion on KVM userfault[*](name subject to change). I Cc'd folks a few folks that I know are interested, please forward this on as needed. Early warning #2, PUCK is canceled for August 14th, as I'll be traveling, though y'all are welcome to meet without me. [*] https://lore.kernel.org/all/20240710234222.2333120-1-jthoughton@google.com Time: 6am PDT Video: https://meet.google.com/vdb-aeqo-knk Phone: https://tel.meet/vdb-aeqo-knk?pin=3003112178656 Calendar: https://calendar.google.com/calendar/u/0?cid=Y182MWE1YjFmNjQ0NzM5YmY1YmVkN2U1ZWE1ZmMzNjY5Y2UzMmEyNTQ0YzVkYjFjN2M4OTE3MDJjYTUwOTBjN2Q1QGdyb3VwLmNhbGVuZGFyLmdvb2dsZS5jb20 Drive: https://drive.google.com/drive/folders/1aTqCrvTsQI9T4qLhhLs_l986SngGlhPH?resourcekey=0-FDy0ykM3RerZedI8R-zj4A&usp=drive_link Future Schedule: Augst 7th - KVM userfault August 14th - Canceled (Sean unavailable) August 21st - Available August 28th - Available ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy) 2024-08-01 22:43 [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy) Sean Christopherson @ 2024-08-07 17:21 ` James Houghton 2024-08-08 0:17 ` Sean Christopherson 2024-08-08 12:15 ` Wang, Wei W 0 siblings, 2 replies; 9+ messages in thread From: James Houghton @ 2024-08-07 17:21 UTC (permalink / raw) To: Sean Christopherson Cc: kvm, linux-kernel, Peter Xu, Paolo Bonzini, Oliver Upton, Axel Rasmussen, David Matlack, Anish Moorthy On Thu, Aug 1, 2024 at 3:44 PM Sean Christopherson <seanjc@google.com> wrote: > > Early warning for next week's PUCK since there's actually a topic this time. > James is going to lead a discussion on KVM userfault[*](name subject to change). Thanks for attending, everyone! We seemed to arrive at the following conclusions: 1. For guest_memfd, stage 2 mapping installation will never go through GUP / virtual addresses to do the GFN --> PFN translation, including when it supports non-private memory. 2. Something like KVM Userfault is indeed necessary to handle post-copy for guest_memfd VMs, especially when guest_memfd supports non-private memory. 3. We should not hook into the overall GFN --> HVA translation, we should only be hooking the GFN --> PFN translation steps to figure out how to create stage 2 mappings. That is, KVM's own accesses to guest memory should just go through mm/userfaultfd. 4. We don't need the concept of "async userfaults" (making KVM block when attempting to access userfault memory) in KVM Userfault. So I need to think more about what exactly the API should look like for controlling if a page should exit to userspace before KVM is allowed to map it into stage 2 and if this should apply to all of guest memory or only guest_memfd. It sounds like it may most likely be something like a per-VM bitmap that describes which pages are allowed to be mapped into stage 2, applying to all memory, not just guest_memfd memory. Even though it is solving a problem for guest_memfd specifically, it is slightly cleaner to have it apply to all memory. If this per-VM bitmap applies to all memory, then we don't need to wait for guest_memfd to support non-private memory before working on a full implementation. But if not, perhaps it makes sense to wait. There will be a 30 minute session at LPC to discuss this topic more. I hope to see you there! Here are the slides[2]. Thanks! PS: I'll be away from August 9 - 25. [2]: https://docs.google.com/presentation/d/1Al9amGumF3ZPX2Wu50mQ4nkPRZZdBJitXmMH3n7j_RE/edit?usp=sharing > I Cc'd folks a few folks that I know are interested, please forward this on > as needed. > > Early warning #2, PUCK is canceled for August 14th, as I'll be traveling, though > y'all are welcome to meet without me. > > [*] https://lore.kernel.org/all/20240710234222.2333120-1-jthoughton@google.com > > Time: 6am PDT > Video: https://meet.google.com/vdb-aeqo-knk > Phone: https://tel.meet/vdb-aeqo-knk?pin=3003112178656 > > Calendar: https://calendar.google.com/calendar/u/0?cid=Y182MWE1YjFmNjQ0NzM5YmY1YmVkN2U1ZWE1ZmMzNjY5Y2UzMmEyNTQ0YzVkYjFjN2M4OTE3MDJjYTUwOTBjN2Q1QGdyb3VwLmNhbGVuZGFyLmdvb2dsZS5jb20 > Drive: https://drive.google.com/drive/folders/1aTqCrvTsQI9T4qLhhLs_l986SngGlhPH?resourcekey=0-FDy0ykM3RerZedI8R-zj4A&usp=drive_link > > Future Schedule: > Augst 7th - KVM userfault > August 14th - Canceled (Sean unavailable) > August 21st - Available > August 28th - Available ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy) 2024-08-07 17:21 ` James Houghton @ 2024-08-08 0:17 ` Sean Christopherson 2024-08-08 12:15 ` Wang, Wei W 1 sibling, 0 replies; 9+ messages in thread From: Sean Christopherson @ 2024-08-08 0:17 UTC (permalink / raw) To: James Houghton Cc: kvm, linux-kernel, Peter Xu, Paolo Bonzini, Oliver Upton, Axel Rasmussen, David Matlack, Anish Moorthy On Wed, Aug 07, 2024, James Houghton wrote: > On Thu, Aug 1, 2024 at 3:44 PM Sean Christopherson <seanjc@google.com> wrote: > > > > Early warning for next week's PUCK since there's actually a topic this time. > > James is going to lead a discussion on KVM userfault[*](name subject to change). > > Thanks for attending, everyone! > > We seemed to arrive at the following conclusions: > > 1. For guest_memfd, stage 2 mapping installation will never go through > GUP / virtual addresses to do the GFN --> PFN translation, including > when it supports non-private memory. > 2. Something like KVM Userfault is indeed necessary to handle > post-copy for guest_memfd VMs, especially when guest_memfd supports > non-private memory. > 3. We should not hook into the overall GFN --> HVA translation, we > should only be hooking the GFN --> PFN translation steps to figure out > how to create stage 2 mappings. That is, KVM's own accesses to guest > memory should just go through mm/userfaultfd. > 4. We don't need the concept of "async userfaults" (making KVM block > when attempting to access userfault memory) in KVM Userfault. > > So I need to think more about what exactly the API should look like > for controlling if a page should exit to userspace before KVM is > allowed to map it into stage 2 and if this should apply to all of > guest memory or only guest_memfd. > > It sounds like it may most likely be something like a per-VM bitmap > that describes which pages are allowed to be mapped into stage 2, > applying to all memory, not just guest_memfd memory. Even though it is > solving a problem for guest_memfd specifically, it is slightly cleaner > to have it apply to all memory. > > If this per-VM bitmap applies to all memory, then we don't need to > wait for guest_memfd to support non-private memory before working on a > full implementation. But if not, perhaps it makes sense to wait. Per-memslot likely makes more sense. Unlike attributes, the bitmap only needs to exist during post-copy, and unless we do something clever, i.e. use something other than a bitmap, the bitmap needs to be fully allocated, which would result in unnecessary overhead if there are gaps in guest physical memory. The other hiccup with a per-VM bitmap is that it would force us to define ABI for things we don't care about. E.g. what happens if the local APIC is in-kernel and userspace marks the APIC page as USERFAULT? Ditto for gfns without memslots. E.g. add a KVM_MEM_USERFAULT flag along with a userfault_bitmap user pointer that is valid when the flag is set. Unlike dirty logging, KVM is only a reader of the bitmap, so I'm pretty sure we don't need a copy in KVM. When userspace creates the VM on the target, it allocates a bitmap for each memslot and sets KVM_MEM_USERFAULT. When migration completes, userspace clears KVM_MEM_USERFAULT for each memslot, and then deletes the associated bitmap. ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy) 2024-08-07 17:21 ` James Houghton 2024-08-08 0:17 ` Sean Christopherson @ 2024-08-08 12:15 ` Wang, Wei W 2024-08-08 19:04 ` James Houghton 1 sibling, 1 reply; 9+ messages in thread From: Wang, Wei W @ 2024-08-08 12:15 UTC (permalink / raw) To: James Houghton, Sean Christopherson Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Peter Xu, Paolo Bonzini, Oliver Upton, Axel Rasmussen, David Matlack, Anish Moorthy On Thursday, August 8, 2024 1:22 AM, James Houghton wrote: > On Thu, Aug 1, 2024 at 3:44 PM Sean Christopherson <seanjc@google.com> > wrote: > > > > Early warning for next week's PUCK since there's actually a topic this time. > > James is going to lead a discussion on KVM userfault[*](name subject to > change). > > Thanks for attending, everyone! > > We seemed to arrive at the following conclusions: > > 1. For guest_memfd, stage 2 mapping installation will never go through GUP / > virtual addresses to do the GFN --> PFN translation, including when it supports > non-private memory. > 2. Something like KVM Userfault is indeed necessary to handle post-copy for > guest_memfd VMs, especially when guest_memfd supports non-private > memory. > 3. We should not hook into the overall GFN --> HVA translation, we should > only be hooking the GFN --> PFN translation steps to figure out how to create > stage 2 mappings. That is, KVM's own accesses to guest memory should just go > through mm/userfaultfd. Sorry.. still a bit confused about this one: will gmem finally support GUP and VMA? For 1. above, seems no, but for 3. here, KVM's own accesses to gmem will go through userfaultfd via GUP? Also, how would vhost's access to gmem get faulted to userspace? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy) 2024-08-08 12:15 ` Wang, Wei W @ 2024-08-08 19:04 ` James Houghton 2024-08-09 13:51 ` Wang, Wei W 0 siblings, 1 reply; 9+ messages in thread From: James Houghton @ 2024-08-08 19:04 UTC (permalink / raw) To: Wang, Wei W Cc: Sean Christopherson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Peter Xu, Paolo Bonzini, Oliver Upton, Axel Rasmussen, David Matlack, Anish Moorthy On Thu, Aug 8, 2024 at 5:15 AM Wang, Wei W <wei.w.wang@intel.com> wrote: > > On Thursday, August 8, 2024 1:22 AM, James Houghton wrote: > > 1. For guest_memfd, stage 2 mapping installation will never go through GUP / > > virtual addresses to do the GFN --> PFN translation, including when it supports > > non-private memory. > > 2. Something like KVM Userfault is indeed necessary to handle post-copy for > > guest_memfd VMs, especially when guest_memfd supports non-private > > memory. > > 3. We should not hook into the overall GFN --> HVA translation, we should > > only be hooking the GFN --> PFN translation steps to figure out how to create > > stage 2 mappings. That is, KVM's own accesses to guest memory should just go > > through mm/userfaultfd. > > Sorry.. still a bit confused about this one: will gmem finally support GUP and VMA? > For 1. above, seems no, but for 3. here, KVM's own accesses to gmem will go > through userfaultfd via GUP? > Also, how would vhost's access to gmem get faulted to userspace? Hi Wei, From what we discussed in the meeting, guest_memfd will be mappable into userspace (so VMAs can be created for it), and so GUP will be able to work on it. However, KVM will *not* use GUP for doing gfn -> pfn translations for installing stage 2 mappings. (For guest-private memory, GUP cannot be used, but the claim is that GUP will never be used, no matter if it's guest-private or guest-shared.) KVM's own accesses to guest memory (i.e., places where it does copy_to/from_user) will go through GUP. By default, that's just how it would work. What I'm saying is that we aren't going to add anything extra to have "KVM Userfault" prevent KVM from doing a copy_to/from_user (like how I had it in the RFC, where KVM Userfault can block the translation of gfn -> hva). vhost's accesses to guest memory will be the same as KVM's: it will go through copy_to/from_user. Hopefully that's a little clearer. :) ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy) 2024-08-08 19:04 ` James Houghton @ 2024-08-09 13:51 ` Wang, Wei W 2024-08-09 19:04 ` Sean Christopherson 0 siblings, 1 reply; 9+ messages in thread From: Wang, Wei W @ 2024-08-09 13:51 UTC (permalink / raw) To: James Houghton Cc: Sean Christopherson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Peter Xu, Paolo Bonzini, Oliver Upton, Axel Rasmussen, David Matlack, Anish Moorthy On Friday, August 9, 2024 3:05 AM, James Houghton wrote: > On Thu, Aug 8, 2024 at 5:15 AM Wang, Wei W <wei.w.wang@intel.com> wrote: > > > > On Thursday, August 8, 2024 1:22 AM, James Houghton wrote: > > > 1. For guest_memfd, stage 2 mapping installation will never go > > > through GUP / virtual addresses to do the GFN --> PFN translation, > > > including when it supports non-private memory. > > > 2. Something like KVM Userfault is indeed necessary to handle > > > post-copy for guest_memfd VMs, especially when guest_memfd supports > > > non-private memory. > > > 3. We should not hook into the overall GFN --> HVA translation, we > > > should only be hooking the GFN --> PFN translation steps to figure > > > out how to create stage 2 mappings. That is, KVM's own accesses to > > > guest memory should just go through mm/userfaultfd. > > > > Sorry.. still a bit confused about this one: will gmem finally support GUP and > VMA? > > For 1. above, seems no, but for 3. here, KVM's own accesses to gmem > > will go through userfaultfd via GUP? > > Also, how would vhost's access to gmem get faulted to userspace? > > Hi Wei, > > From what we discussed in the meeting, guest_memfd will be mappable into > userspace (so VMAs can be created for it), and so GUP will be able to work on > it. However, KVM will *not* use GUP for doing gfn -> pfn translations for > installing stage 2 mappings. (For guest-private memory, GUP cannot be used, > but the claim is that GUP will never be used, no matter if it's guest-private or > guest-shared.) OK. For KVM userfault on a guest-shared page, how is a physical page gets filled with the data (received from source) and installed into the host cr3 and guest stage-2 page tables? Add a new gmem uAPI to achieve this? There also seems to be a race condition between KVM userfault and userfaultfd. For example, guest access to a guest-shared page triggers KVM userfault to userspace while vhost (or KVM) could access to the same page during the window that KVM userfault is handling the page, then there will be two simultaneous faults on the same page. I'm thinking how would this case be handled? (leaving it to userspace to detect and handle such cases would be an complex) > > KVM's own accesses to guest memory (i.e., places where it does > copy_to/from_user) will go through GUP. By default, that's just how it would > work. What I'm saying is that we aren't going to add anything extra to have > "KVM Userfault" prevent KVM from doing a copy_to/from_user (like how I had > it in the RFC, where KVM Userfault can block the translation of gfn -> hva). > > vhost's accesses to guest memory will be the same as KVM's: it will go through > copy_to/from_user. > > Hopefully that's a little clearer. :) Yeah, thanks for explanation. Enjoy your vacation. We can continue the discussion after that :) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy) 2024-08-09 13:51 ` Wang, Wei W @ 2024-08-09 19:04 ` Sean Christopherson 2024-08-12 14:12 ` Wang, Wei W 0 siblings, 1 reply; 9+ messages in thread From: Sean Christopherson @ 2024-08-09 19:04 UTC (permalink / raw) To: Wei W Wang Cc: James Houghton, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Peter Xu, Paolo Bonzini, Oliver Upton, Axel Rasmussen, David Matlack, Anish Moorthy On Fri, Aug 09, 2024, Wei W Wang wrote: > On Friday, August 9, 2024 3:05 AM, James Houghton wrote: > > On Thu, Aug 8, 2024 at 5:15 AM Wang, Wei W <wei.w.wang@intel.com> wrote: > There also seems to be a race condition between KVM userfault and userfaultfd. > For example, guest access to a guest-shared page triggers KVM userfault to > userspace while vhost (or KVM) could access to the same page during the window > that KVM userfault is handling the page, then there will be two simultaneous faults > on the same page. > I'm thinking how would this case be handled? (leaving it to userspace to detect and > handle such cases would be an complex) Userspace is going to have to handle racing "faults" no matter what, e.g. if multiple vCPUs hit the same fault and exit at the same time. I don't think it'll be too complex to detect spurious/fixed faults and retry. ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy) 2024-08-09 19:04 ` Sean Christopherson @ 2024-08-12 14:12 ` Wang, Wei W 2024-08-12 15:24 ` Peter Xu 0 siblings, 1 reply; 9+ messages in thread From: Wang, Wei W @ 2024-08-12 14:12 UTC (permalink / raw) To: Sean Christopherson Cc: James Houghton, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Peter Xu, Paolo Bonzini, Oliver Upton, Axel Rasmussen, David Matlack, Anish Moorthy On Saturday, August 10, 2024 3:05 AM, Sean Christopherson wrote: > On Fri, Aug 09, 2024, Wei W Wang wrote: > > On Friday, August 9, 2024 3:05 AM, James Houghton wrote: > > > On Thu, Aug 8, 2024 at 5:15 AM Wang, Wei W <wei.w.wang@intel.com> > wrote: > > There also seems to be a race condition between KVM userfault and > userfaultfd. > > For example, guest access to a guest-shared page triggers KVM > > userfault to userspace while vhost (or KVM) could access to the same > > page during the window that KVM userfault is handling the page, then > > there will be two simultaneous faults on the same page. > > I'm thinking how would this case be handled? (leaving it to userspace > > to detect and handle such cases would be an complex) > > Userspace is going to have to handle racing "faults" no matter what, e.g. if > multiple vCPUs hit the same fault and exit at the same time. I don't think it'll > be too complex to detect spurious/fixed faults and retry. Yes, the case of multiple vCPUs hitting the same fault shouldn't be difficult to handle as they fall into the same handling path (i.e., KVM userfault). But if vCPUs and vhost hit the same faults, the two types of fault exit (i.e., KVM userfault and userfaultfd) will occur at the same time (IIUC, the vCPU access triggers KVM userfault and the vhost access triggers userfaultfd). So, the userspace VMM would be required to coordinate between the two types of userfault. For example, when the page data is fetched from the source, VMM first needs to determine whether the page should be installed via UFFDIO_COPY (for the userfaultfd case) and/or the new uAPI, say KVM_USERFAULT_COPY (for the KVM userfault case). In the example above, both UFFDIO_COPY and KVM_USERFAULT_COPY need to be invoked, e.g.: #1 invoke KVM_USERFAULT_COPY #2 invoke UFFDIO_COPY This requires that UFFDIO_COPY does not conflict with KVM_USERFAULT_COPY. Current UFFDIO_COPY will fail (thus not waking up the threads on the waitq) when it fails to install the PTE into the page table (in the above example the PTE has already been installed into the page table by KVM_USERFAULT_COPY at #1). ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy) 2024-08-12 14:12 ` Wang, Wei W @ 2024-08-12 15:24 ` Peter Xu 0 siblings, 0 replies; 9+ messages in thread From: Peter Xu @ 2024-08-12 15:24 UTC (permalink / raw) To: Wang, Wei W Cc: Sean Christopherson, James Houghton, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Paolo Bonzini, Oliver Upton, Axel Rasmussen, David Matlack, Anish Moorthy On Mon, Aug 12, 2024 at 02:12:29PM +0000, Wang, Wei W wrote: > In the example above, both UFFDIO_COPY and KVM_USERFAULT_COPY need to be > invoked, e.g.: > #1 invoke KVM_USERFAULT_COPY > #2 invoke UFFDIO_COPY > > This requires that UFFDIO_COPY does not conflict with KVM_USERFAULT_COPY. Current > UFFDIO_COPY will fail (thus not waking up the threads on the waitq) when it fails to > install the PTE into the page table (in the above example the PTE has already been > installed into the page table by KVM_USERFAULT_COPY at #1). Indeed, maybe we can fix that with an explicit UFFDIO_WAKE upon UFFDIO_COPY failures iff -EEXIST (in this case, it should fall into "page cache exists" category, even if pgtable can still be missing). I assume OTOH a racy KVM_USERFAULT_COPY in whatever form doesn't need anything but to kick the vcpu, irrelevant of whether the copy succeeded or not. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-08-12 15:24 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-08-01 22:43 [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy) Sean Christopherson 2024-08-07 17:21 ` James Houghton 2024-08-08 0:17 ` Sean Christopherson 2024-08-08 12:15 ` Wang, Wei W 2024-08-08 19:04 ` James Houghton 2024-08-09 13:51 ` Wang, Wei W 2024-08-09 19:04 ` Sean Christopherson 2024-08-12 14:12 ` Wang, Wei W 2024-08-12 15:24 ` Peter Xu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox