[RFC PATCH 0/6] KVM: x86: async PF user

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] KVM: x86: async PF user
@ 2024-11-18 12:39 Nikita Kalyazin
  2024-11-18 12:39 ` [RFC PATCH 1/6] Documentation: KVM: add userfault KVM exit flag Nikita Kalyazin
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-18 12:39 UTC (permalink / raw)
  To: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-doc,
	linux-kernel, linux-trace-kernel
  Cc: jthoughton, david, peterx, oleg, vkuznets, gshan, graf, jgowans,
	roypat, derekmn, nsaenz, xmarcalx, kalyazin

Async PF [1] allows to run other processes on a vCPU while the host
handles a stage-2 fault caused by a process on that vCPU. When using
VM-exit-based stage-2 fault handling [2], async PF functionality is lost
because KVM does not run the vCPU while a fault is being handled so no
other process can execute on the vCPU. This patch series extends
VM-exit-based stage-2 fault handling with async PF support by letting
userspace handle faults instead of the kernel, hence the "async PF user"
name.

I circulated the idea with Paolo, Sean, David H, and James H at the LPC,
and the only concern I heard was about injecting the "page not present"
event via #PF exception in the CoCo case, where it may not work. In my
implementation, I reused the existing code for doing that, so the async
PF user implementation is on par with the present async PF
implementation in this regard, and support for the CoCo case can be
added separately.

Please note that this series is applied on top of the VM-exit-based
stage-2 fault handling RFC [2].

Implementation

The following workflow is implemented:
 - A process in the guest causes a stage-2 fault.
 - KVM checks whether the fault can be handled asynchronously. If it
   can, KVM prepares the VM exit info that contains a newly added "async
   PF flag" raised and an async PF token value corresponding to the
   fault.
 - Userspace reads the VM exit info and resumes the vCPU immediately.
   Meanwhile it processes the fault.
 - When the fault is resolved, userspace calls a new async ioctl using
   the token to notify KVM.
 - KVM communicates to the guest that the process can be resumed.

Notes:
 - No changes to the x86 async PF PV interface are required
 - The series does not introduce new dependencies on x86 compared to the
   existing async PF

Testing

Inspired by [3], I built a Firecracker-based setup, where Firecracker
implemented the VM-exit-based fault handling. I observed that a workload
consisting of a CPU-bound and memory-bound threads running concurrently
was executing faster with async PF user enabled: with 10 ms-long fault
processing, it was 26% faster.

It is difficult to provide an objective performance comparison between
async PF kernel and async PF user, because async PF user can only work
with VM-exit-based fault handling, which has its own performance
characteristics compared to in-kernel fault handling or UserfaultFD.

The patch series is built on top of the VM-exit-based stage-2 fault
handling RFC [2].

Patch 1 updates documentation to reflect [2] changes.
Patches 2-6 add the implementation of async PF user.

Questions:
 - Are there any general concerns about the approach?
 - Can we leave the CoCo use case aside for now, or do we need to
   support it straight away?
 - What is the desired level of coupling between async PF and async PF
   user? For now, I kept the coupling to the bare minimum (only the
   PV-related data structure is shared between the two).

[1] https://kvm-forum.qemu.org/2021/sdei_apf_for_arm64_gavin.pdf
[2] https://lore.kernel.org/kvm/CADrL8HUHRMwUPhr7jLLBgD9YLFAnVHc=N-C=8er-x6GUtV97pQ@mail.gmail.com/T/
[3] https://lore.kernel.org/all/20200508032919.52147-1-gshan@redhat.com/

Nikita

Nikita Kalyazin (6):
  Documentation: KVM: add userfault KVM exit flag
  Documentation: KVM: add async pf user doc
  KVM: x86: add async ioctl support
  KVM: trace events: add type argument to async pf
  KVM: x86: async_pf_user: add infrastructure
  KVM: x86: async_pf_user: hook to fault handling and add ioctl

 Documentation/virt/kvm/api.rst  |  35 ++++++
 arch/x86/include/asm/kvm_host.h |  12 +-
 arch/x86/kvm/Kconfig            |   7 ++
 arch/x86/kvm/lapic.c            |   2 +
 arch/x86/kvm/mmu/mmu.c          |  68 ++++++++++-
 arch/x86/kvm/x86.c              | 101 +++++++++++++++-
 arch/x86/kvm/x86.h              |   2 +
 include/linux/kvm_host.h        |  30 +++++
 include/linux/kvm_types.h       |   1 +
 include/trace/events/kvm.h      |  50 +++++---
 include/uapi/linux/kvm.h        |  12 +-
 virt/kvm/Kconfig                |   3 +
 virt/kvm/Makefile.kvm           |   1 +
 virt/kvm/async_pf.c             |   2 +-
 virt/kvm/async_pf_user.c        | 197 ++++++++++++++++++++++++++++++++
 virt/kvm/async_pf_user.h        |  24 ++++
 virt/kvm/kvm_main.c             |  14 +++
 17 files changed, 535 insertions(+), 26 deletions(-)
 create mode 100644 virt/kvm/async_pf_user.c
 create mode 100644 virt/kvm/async_pf_user.h

base-commit: 15f01813426bf9672e2b24a5bac7b861c25de53b
-- 
2.40.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 1/6] Documentation: KVM: add userfault KVM exit flag
  2024-11-18 12:39 [RFC PATCH 0/6] KVM: x86: async PF user Nikita Kalyazin
@ 2024-11-18 12:39 ` Nikita Kalyazin
  2024-11-18 12:39 ` [RFC PATCH 2/6] Documentation: KVM: add async pf user doc Nikita Kalyazin
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-18 12:39 UTC (permalink / raw)
  To: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-doc,
	linux-kernel, linux-trace-kernel
  Cc: jthoughton, david, peterx, oleg, vkuznets, gshan, graf, jgowans,
	roypat, derekmn, nsaenz, xmarcalx, kalyazin

Update KVM documentation to reflect the change made in [1]:
add KVM_MEMORY_EXIT_FLAG_USERFAULT flag to struct memory_fault.

[1] https://lore.kernel.org/lkml/20240710234222.2333120-7-jthoughton@google.com/

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 Documentation/virt/kvm/api.rst | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 26a98fea718c..ffe9a2d0e525 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6996,6 +6996,7 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
   #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+  #define KVM_MEMORY_EXIT_FLAG_USERFAULT  (1ULL << 4)
 			__u64 flags;
 			__u64 gpa;
 			__u64 size;
@@ -7009,6 +7010,8 @@ describes properties of the faulting access that are likely pertinent:
  - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
    on a private memory access.  When clear, indicates the fault occurred on a
    shared access.
+ - KVM_MEMORY_EXIT_FLAG_USERFAULT - When set, indicates the memory fault
+   occurred, because the vCPU attempted to access a gfn marked as userfault.
 
 Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
 accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 2/6] Documentation: KVM: add async pf user doc
  2024-11-18 12:39 [RFC PATCH 0/6] KVM: x86: async PF user Nikita Kalyazin
  2024-11-18 12:39 ` [RFC PATCH 1/6] Documentation: KVM: add userfault KVM exit flag Nikita Kalyazin
@ 2024-11-18 12:39 ` Nikita Kalyazin
  2024-11-18 12:39 ` [RFC PATCH 3/6] KVM: x86: add async ioctl support Nikita Kalyazin
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-18 12:39 UTC (permalink / raw)
  To: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-doc,
	linux-kernel, linux-trace-kernel
  Cc: jthoughton, david, peterx, oleg, vkuznets, gshan, graf, jgowans,
	roypat, derekmn, nsaenz, xmarcalx, kalyazin

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 Documentation/virt/kvm/api.rst | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index ffe9a2d0e525..b30f9989f5c1 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6352,6 +6352,32 @@ a single guest_memfd file, but the bound ranges must not overlap).
 
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
+4.143 KVM_ASYNC_PF_USER_READY
+----------------------------
+
+:Capability: KVM_CAP_USERFAULT
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_async_pf_user_ready(in)
+:Returns: 0 on success, <0 on error
+
+KVM_ASYNC_PF_USER_READY notifies the kernel that the fault corresponding to the
+'token' has been resolved by the userspace. The ioctl is supposed to be used by
+the userspace when processing an async PF in response to a VM exit with the
+KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER flag set. The 'token' must match the value
+supplied by the kernel in 'async_pf_user_token' field of the
+struct memory_fault. When handling the ioctl, the kernel will inject the
+'page present' event in the guest and wake the vcpu up if it is halted, like it
+would do when completing a regular (kernel) async PF.
+
+::
+
+  struct kvm_async_pf_user_ready {
+  __u32 token;
+  };
+
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
+
 5. The kvm_run structure
 ========================
 
@@ -6997,9 +7023,11 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
 		struct {
   #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
   #define KVM_MEMORY_EXIT_FLAG_USERFAULT  (1ULL << 4)
+  #define KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER (1ULL << 5)
 			__u64 flags;
 			__u64 gpa;
 			__u64 size;
+			__u32 async_pf_user_token;
 		} memory_fault;
 
 KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
@@ -7012,6 +7040,10 @@ describes properties of the faulting access that are likely pertinent:
    shared access.
  - KVM_MEMORY_EXIT_FLAG_USERFAULT - When set, indicates the memory fault
    occurred, because the vCPU attempted to access a gfn marked as userfault.
+ - KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER - When set, indicates the memory fault can
+   be processed asynchronously and 'async_pf_user_token' contains the token to
+   be used when notifying KVM of the completion via the KVM_ASYNC_PF_USER_READY
+   ioctl.
 
 Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
 accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 3/6] KVM: x86: add async ioctl support
  2024-11-18 12:39 [RFC PATCH 0/6] KVM: x86: async PF user Nikita Kalyazin
  2024-11-18 12:39 ` [RFC PATCH 1/6] Documentation: KVM: add userfault KVM exit flag Nikita Kalyazin
  2024-11-18 12:39 ` [RFC PATCH 2/6] Documentation: KVM: add async pf user doc Nikita Kalyazin
@ 2024-11-18 12:39 ` Nikita Kalyazin
  2024-11-18 12:39 ` [RFC PATCH 4/6] KVM: trace events: add type argument to async pf Nikita Kalyazin
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-18 12:39 UTC (permalink / raw)
  To: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-doc,
	linux-kernel, linux-trace-kernel
  Cc: jthoughton, david, peterx, oleg, vkuznets, gshan, graf, jgowans,
	roypat, derekmn, nsaenz, xmarcalx, kalyazin

x86 has not had support for async ioctls.
This patch adds an arch implementation, but does not add any of the
ioctls just yet.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 arch/x86/kvm/Kconfig | 1 +
 arch/x86/kvm/x86.c   | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ebd1ec6600bc..191dfba3e27a 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -46,6 +46,7 @@ config KVM
 	select KVM_GENERIC_HARDWARE_ENABLING
 	select KVM_WERROR if WERROR
 	select KVM_USERFAULT
+	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ba0ad76f53bc..800493739043 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13619,6 +13619,12 @@ void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
 }
 #endif
 
+long kvm_arch_vcpu_async_ioctl(struct file *filp,
+			       unsigned int ioctl, unsigned long arg)
+{
+	return -ENOIOCTLCMD;
+}
+
 int kvm_spec_ctrl_test_value(u64 value)
 {
 	/*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 4/6] KVM: trace events: add type argument to async pf
  2024-11-18 12:39 [RFC PATCH 0/6] KVM: x86: async PF user Nikita Kalyazin
                   ` (2 preceding siblings ...)
  2024-11-18 12:39 ` [RFC PATCH 3/6] KVM: x86: add async ioctl support Nikita Kalyazin
@ 2024-11-18 12:39 ` Nikita Kalyazin
  2024-11-18 12:39 ` [RFC PATCH 5/6] KVM: x86: async_pf_user: add infrastructure Nikita Kalyazin
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-18 12:39 UTC (permalink / raw)
  To: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-doc,
	linux-kernel, linux-trace-kernel
  Cc: jthoughton, david, peterx, oleg, vkuznets, gshan, graf, jgowans,
	roypat, derekmn, nsaenz, xmarcalx, kalyazin

With async PF user being added, in order to reuse existing tracepoint
definitions and distinguish async PF user from kernel, a new int
argument `type` is being added that can be either 0 ("kernel") or 1
("user").

For now all of the users of these tracepoints supply 0 ("kernel") as
async PF user are not yet implemented.  In the next commits when they
are implemented, the tracepoints user will set this to 1 ("user") as
necessary.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c     |  4 +--
 arch/x86/kvm/x86.c         |  4 +--
 include/trace/events/kvm.h | 50 ++++++++++++++++++++++++--------------
 virt/kvm/async_pf.c        |  2 +-
 4 files changed, 37 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f0dbc3c68e5c..004e068cabae 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4395,9 +4395,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		return RET_PF_CONTINUE; /* *pfn has correct page already */
 
 	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
-		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
+		trace_kvm_try_async_get_page(fault->addr, fault->gfn, 0);
 		if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
-			trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
+			trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn, 0);
 			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
 			return RET_PF_RETRY;
 		} else if (kvm_arch_setup_async_pf(vcpu, fault)) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 800493739043..0a04de5dbada 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13408,7 +13408,7 @@ bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 {
 	struct x86_exception fault;
 
-	trace_kvm_async_pf_not_present(work->arch.token, work->cr2_or_gpa);
+	trace_kvm_async_pf_not_present(work->arch.token, work->cr2_or_gpa, 0);
 	kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
 
 	if (kvm_can_deliver_async_pf(vcpu) &&
@@ -13447,7 +13447,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 		work->arch.token = ~0; /* broadcast wakeup */
 	else
 		kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
-	trace_kvm_async_pf_ready(work->arch.token, work->cr2_or_gpa);
+	trace_kvm_async_pf_ready(work->arch.token, work->cr2_or_gpa, 0);
 
 	if ((work->wakeup_all || work->notpresent_injected) &&
 	    kvm_pv_async_pf_enabled(vcpu) &&
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 74e40d5d4af4..a7731b62863b 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -256,90 +256,104 @@ TRACE_EVENT(kvm_fpu,
 );
 
 #ifdef CONFIG_KVM_ASYNC_PF
+#define kvm_async_pf_type_symbol	\
+	{0, "kernel"},		\
+	{1, "user"}
+
 DECLARE_EVENT_CLASS(kvm_async_get_page_class,
 
-	TP_PROTO(u64 gva, u64 gfn),
+	TP_PROTO(u64 gva, u64 gfn, int type),
 
-	TP_ARGS(gva, gfn),
+	TP_ARGS(gva, gfn, type),
 
 	TP_STRUCT__entry(
 		__field(__u64, gva)
 		__field(u64, gfn)
+		__field(int, type)
 	),
 
 	TP_fast_assign(
 		__entry->gva = gva;
 		__entry->gfn = gfn;
+		__entry->type = type;
 	),
 
-	TP_printk("gva = %#llx, gfn = %#llx", __entry->gva, __entry->gfn)
+	TP_printk("gva = %#llx, gfn = %#llx, type = %s", __entry->gva,
+		__entry->gfn, __print_symbolic(__entry->type,
+		kvm_async_pf_type_symbol))
 );
 
 DEFINE_EVENT(kvm_async_get_page_class, kvm_try_async_get_page,
 
-	TP_PROTO(u64 gva, u64 gfn),
+	TP_PROTO(u64 gva, u64 gfn, int type),
 
-	TP_ARGS(gva, gfn)
+	TP_ARGS(gva, gfn, type)
 );
 
 DEFINE_EVENT(kvm_async_get_page_class, kvm_async_pf_repeated_fault,
 
-	TP_PROTO(u64 gva, u64 gfn),
+	TP_PROTO(u64 gva, u64 gfn, int type),
 
-	TP_ARGS(gva, gfn)
+	TP_ARGS(gva, gfn, type)
 );
 
 DECLARE_EVENT_CLASS(kvm_async_pf_nopresent_ready,
 
-	TP_PROTO(u64 token, u64 gva),
+	TP_PROTO(u64 token, u64 gva, int type),
 
-	TP_ARGS(token, gva),
+	TP_ARGS(token, gva, type),
 
 	TP_STRUCT__entry(
 		__field(__u64, token)
 		__field(__u64, gva)
+		__field(int, type)
 	),
 
 	TP_fast_assign(
 		__entry->token = token;
 		__entry->gva = gva;
+		__entry->type = type;
 	),
 
-	TP_printk("token %#llx gva %#llx", __entry->token, __entry->gva)
+	TP_printk("token %#llx gva %#llx type %s", __entry->token, __entry->gva,
+		__print_symbolic(__entry->type, kvm_async_pf_type_symbol))
 
 );
 
 DEFINE_EVENT(kvm_async_pf_nopresent_ready, kvm_async_pf_not_present,
 
-	TP_PROTO(u64 token, u64 gva),
+	TP_PROTO(u64 token, u64 gva, int type),
 
-	TP_ARGS(token, gva)
+	TP_ARGS(token, gva, type)
 );
 
 DEFINE_EVENT(kvm_async_pf_nopresent_ready, kvm_async_pf_ready,
 
-	TP_PROTO(u64 token, u64 gva),
+	TP_PROTO(u64 token, u64 gva, int type),
 
-	TP_ARGS(token, gva)
+	TP_ARGS(token, gva, type)
 );
 
 TRACE_EVENT(
 	kvm_async_pf_completed,
-	TP_PROTO(unsigned long address, u64 gva),
-	TP_ARGS(address, gva),
+	TP_PROTO(unsigned long address, u64 gva, int type),
+	TP_ARGS(address, gva, type),
 
 	TP_STRUCT__entry(
 		__field(unsigned long, address)
 		__field(u64, gva)
+		__field(int, type)
 		),
 
 	TP_fast_assign(
 		__entry->address = address;
 		__entry->gva = gva;
+		__entry->type = type;
 		),
 
-	TP_printk("gva %#llx address %#lx",  __entry->gva,
-		  __entry->address)
+	TP_printk("gva %#llx address %#lx type %s",  __entry->gva,
+		  __entry->address, __print_symbolic(__entry->type,
+		  kvm_async_pf_type_symbol))
 );
 
 #endif
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 99a63bad0306..77c689a9b585 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -92,7 +92,7 @@ static void async_pf_execute(struct work_struct *work)
 	if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
 		kvm_arch_async_page_present_queued(vcpu);
 
-	trace_kvm_async_pf_completed(addr, cr2_or_gpa);
+	trace_kvm_async_pf_completed(addr, cr2_or_gpa, 0);
 
 	__kvm_vcpu_wake_up(vcpu);
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 5/6] KVM: x86: async_pf_user: add infrastructure
  2024-11-18 12:39 [RFC PATCH 0/6] KVM: x86: async PF user Nikita Kalyazin
                   ` (3 preceding siblings ...)
  2024-11-18 12:39 ` [RFC PATCH 4/6] KVM: trace events: add type argument to async pf Nikita Kalyazin
@ 2024-11-18 12:39 ` Nikita Kalyazin
  2024-11-18 12:39 ` [RFC PATCH 6/6] KVM: x86: async_pf_user: hook to fault handling and add ioctl Nikita Kalyazin
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-18 12:39 UTC (permalink / raw)
  To: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-doc,
	linux-kernel, linux-trace-kernel
  Cc: jthoughton, david, peterx, oleg, vkuznets, gshan, graf, jgowans,
	roypat, derekmn, nsaenz, xmarcalx, kalyazin

Add both generic and x86-specific infrastructure for async PF.
The functionality is gated by the KVM_ASYNC_PF_USER config option.
The async PF user implementation is mostly isolated from the original
(kernel) implementation.  The only piece that is shared between the two
is the struct apf within struct kvm_vcpu_arch (x86) that is tracking
guest-facing state.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 arch/x86/include/asm/kvm_host.h |  12 +-
 arch/x86/kvm/Kconfig            |   6 +
 arch/x86/kvm/lapic.c            |   2 +
 arch/x86/kvm/mmu/mmu.c          |  19 +++
 arch/x86/kvm/x86.c              |  75 ++++++++++++
 include/linux/kvm_host.h        |  30 +++++
 include/linux/kvm_types.h       |   1 +
 include/uapi/linux/kvm.h        |   8 ++
 virt/kvm/Kconfig                |   3 +
 virt/kvm/Makefile.kvm           |   1 +
 virt/kvm/async_pf_user.c        | 197 ++++++++++++++++++++++++++++++++
 virt/kvm/async_pf_user.h        |  24 ++++
 virt/kvm/kvm_main.c             |  14 +++
 13 files changed, 391 insertions(+), 1 deletion(-)
 create mode 100644 virt/kvm/async_pf_user.c
 create mode 100644 virt/kvm/async_pf_user.h

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9bb2e164c523..36cea4c9000f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -122,6 +122,7 @@
 #define KVM_REQ_HV_TLB_FLUSH \
 	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(34)
+#define KVM_REQ_APF_USER_READY		KVM_ARCH_REQ(29)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -164,6 +165,7 @@
 #define KVM_NR_VAR_MTRR 8
 
 #define ASYNC_PF_PER_VCPU 64
+#define ASYNC_PF_USER_PER_VCPU 64
 
 enum kvm_reg {
 	VCPU_REGS_RAX = __VCPU_REGS_RAX,
@@ -973,7 +975,7 @@ struct kvm_vcpu_arch {
 
 	struct {
 		bool halted;
-		gfn_t gfns[ASYNC_PF_PER_VCPU];
+		gfn_t gfns[ASYNC_PF_PER_VCPU + ASYNC_PF_USER_PER_VCPU];
 		struct gfn_to_hva_cache data;
 		u64 msr_en_val; /* MSR_KVM_ASYNC_PF_EN */
 		u64 msr_int_val; /* MSR_KVM_ASYNC_PF_INT */
@@ -983,6 +985,7 @@ struct kvm_vcpu_arch {
 		u32 host_apf_flags;
 		bool delivery_as_pf_vmexit;
 		bool pageready_pending;
+		bool pageready_user_pending;
 	} apf;
 
 	/* OSVW MSRs (AMD only) */
@@ -2266,11 +2269,18 @@ void kvm_make_scan_ioapic_request_mask(struct kvm *kvm,
 
 bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 				     struct kvm_async_pf *work);
+bool kvm_arch_async_page_not_present_user(struct kvm_vcpu *vcpu,
+					struct kvm_async_pf_user *apf);
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work);
+void kvm_arch_async_page_present_user(struct kvm_vcpu *vcpu,
+				struct kvm_async_pf_user *apf);
 void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
 			       struct kvm_async_pf *work);
+void kvm_arch_async_page_ready_user(struct kvm_vcpu *vcpu,
+				struct kvm_async_pf_user *apf);
 void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu);
+void kvm_arch_async_page_present_user_queued(struct kvm_vcpu *vcpu);
 bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu);
 extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 191dfba3e27a..255597942d59 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -209,4 +209,10 @@ config KVM_MAX_NR_VCPUS
 	  the memory footprint of each KVM guest, regardless of how many vCPUs are
 	  created for a given VM.
 
+config KVM_ASYNC_PF_USER
+	bool "Support for async PF handled by userspace"
+	depends on KVM && KVM_USERFAULT && KVM_ASYNC_PF && X86_64
+	help
+	  Support for async PF handled by userspace.
+
 endif # VIRTUALIZATION
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index acd7d48100a1..723c9584d47a 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -515,6 +515,7 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val)
 	/* Check if there are APF page ready requests pending */
 	if (enabled) {
 		kvm_make_request(KVM_REQ_APF_READY, apic->vcpu);
+		kvm_make_request(KVM_REQ_APF_USER_READY, apic->vcpu);
 		kvm_xen_sw_enable_lapic(apic->vcpu);
 	}
 }
@@ -2560,6 +2561,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
 			static_branch_slow_dec_deferred(&apic_hw_disabled);
 			/* Check if there are APF page ready requests pending */
 			kvm_make_request(KVM_REQ_APF_READY, vcpu);
+			kvm_make_request(KVM_REQ_APF_USER_READY, vcpu);
 		} else {
 			static_branch_inc(&apic_hw_disabled.key);
 			atomic_set_release(&apic->vcpu->kvm->arch.apic_map_dirty, DIRTY);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 004e068cabae..adf0161af894 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4304,6 +4304,25 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code, true, NULL);
 }
 
+void kvm_arch_async_page_ready_user(struct kvm_vcpu *vcpu, struct kvm_async_pf_user *apf)
+{
+        int r;
+
+        if ((vcpu->arch.mmu->root_role.direct != apf->arch.direct_map) ||
+              apf->wakeup_all)
+                return;
+
+        r = kvm_mmu_reload(vcpu);
+        if (unlikely(r))
+                return;
+
+        if (!vcpu->arch.mmu->root_role.direct &&
+              apf->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
+                return;
+
+        kvm_mmu_do_page_fault(vcpu, apf->cr2_or_gpa, apf->arch.error_code, true, NULL);
+}
+
 static inline u8 kvm_max_level_for_order(int order)
 {
 	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0a04de5dbada..2b8cd3af326b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -942,6 +942,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 
 	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
 		kvm_clear_async_pf_completion_queue(vcpu);
+		kvm_clear_async_pf_user_completion_queue(vcpu);
 		kvm_async_pf_hash_reset(vcpu);
 
 		/*
@@ -3569,6 +3570,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 
 	if (!kvm_pv_async_pf_enabled(vcpu)) {
 		kvm_clear_async_pf_completion_queue(vcpu);
+		kvm_clear_async_pf_user_completion_queue(vcpu);
 		kvm_async_pf_hash_reset(vcpu);
 		return 0;
 	}
@@ -3581,6 +3583,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 	vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
 
 	kvm_async_pf_wakeup_all(vcpu);
+	kvm_async_pf_user_wakeup_all(vcpu);
 
 	return 0;
 }
@@ -4019,6 +4022,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (data & 0x1) {
 			vcpu->arch.apf.pageready_pending = false;
 			kvm_check_async_pf_completion(vcpu);
+			vcpu->arch.apf.pageready_user_pending = false;
+			kvm_check_async_pf_user_completion(vcpu);
 		}
 		break;
 	case MSR_KVM_STEAL_TIME:
@@ -10924,6 +10929,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_vcpu_update_apicv(vcpu);
 		if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
 			kvm_check_async_pf_completion(vcpu);
+		if (kvm_check_request(KVM_REQ_APF_USER_READY, vcpu))
+			kvm_check_async_pf_user_completion(vcpu);
 		if (kvm_check_request(KVM_REQ_MSR_FILTER_CHANGED, vcpu))
 			static_call(kvm_x86_msr_filter_changed)(vcpu);
 
@@ -12346,6 +12353,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	kvmclock_reset(vcpu);
 
 	kvm_clear_async_pf_completion_queue(vcpu);
+	kvm_clear_async_pf_user_completion_queue(vcpu);
 	kvm_async_pf_hash_reset(vcpu);
 	vcpu->arch.apf.halted = false;
 
@@ -12671,6 +12679,7 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm)
 
 	kvm_for_each_vcpu(i, vcpu, kvm) {
 		kvm_clear_async_pf_completion_queue(vcpu);
+		kvm_clear_async_pf_user_completion_queue(vcpu);
 		kvm_unload_vcpu_mmu(vcpu);
 	}
 }
@@ -13119,6 +13128,9 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 	if (!list_empty_careful(&vcpu->async_pf.done))
 		return true;
 
+	if (!list_empty_careful(&vcpu->async_pf_user.done))
+		return true;
+
 	if (kvm_apic_has_pending_init_or_sipi(vcpu) &&
 	    kvm_apic_init_sipi_allowed(vcpu))
 		return true;
@@ -13435,6 +13447,37 @@ bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 	}
 }
 
+bool kvm_arch_async_page_not_present_user(struct kvm_vcpu *vcpu,
+						struct kvm_async_pf_user *apf)
+{
+	struct x86_exception fault;
+
+	trace_kvm_async_pf_not_present(apf->arch.token, apf->cr2_or_gpa, 1);
+	kvm_add_async_pf_gfn(vcpu, apf->arch.gfn);
+
+	if (!apf_put_user_notpresent(vcpu)) {
+		fault.vector = PF_VECTOR;
+		fault.error_code_valid = true;
+		fault.error_code = 0;
+		fault.nested_page_fault = false;
+		fault.address = apf->arch.token;
+		fault.async_page_fault = true;
+		kvm_inject_page_fault(vcpu, &fault);
+		return true;
+	} else {
+		/*
+		 * It is not possible to deliver a paravirtualized asynchronous
+		 * page fault, but putting the guest in an artificial halt state
+		 * can be beneficial nevertheless: if an interrupt arrives, we
+		 * can deliver it timely and perhaps the guest will schedule
+		 * another process.  When the instruction that triggered a page
+		 * fault is retried, hopefully the page will be ready in the host.
+		 */
+		kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+		return false;
+	}
+}
+
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work)
 {
@@ -13460,6 +13503,31 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 }
 
+void kvm_arch_async_page_present_user(struct kvm_vcpu *vcpu,
+				struct kvm_async_pf_user *apf)
+{
+        struct kvm_lapic_irq irq = {
+                .delivery_mode = APIC_DM_FIXED,
+                .vector = vcpu->arch.apf.vec
+        };
+
+        if (apf->wakeup_all)
+                apf->arch.token = ~0; /* broadcast wakeup */
+        else
+                kvm_del_async_pf_gfn(vcpu, apf->arch.gfn);
+        trace_kvm_async_pf_ready(apf->arch.token, apf->cr2_or_gpa, 1);
+
+        if ((apf->wakeup_all || apf->notpresent_injected) &&
+            kvm_pv_async_pf_enabled(vcpu) &&
+            !apf_put_user_ready(vcpu, apf->arch.token)) {
+                vcpu->arch.apf.pageready_user_pending = true;
+                kvm_apic_set_irq(vcpu, &irq, NULL);
+        }
+
+        vcpu->arch.apf.halted = false;
+        vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+}
+
 void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
 {
 	kvm_make_request(KVM_REQ_APF_READY, vcpu);
@@ -13467,6 +13535,13 @@ void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
 		kvm_vcpu_kick(vcpu);
 }
 
+void kvm_arch_async_page_present_user_queued(struct kvm_vcpu *vcpu)
+{
+	kvm_make_request(KVM_REQ_APF_USER_READY, vcpu);
+	if (!vcpu->arch.apf.pageready_user_pending)
+		kvm_vcpu_kick(vcpu);
+}
+
 bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu)
 {
 	if (!kvm_pv_async_pf_enabled(vcpu))
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3b9780d85877..d0aa0680127a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -257,6 +257,27 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
+#ifdef CONFIG_KVM_ASYNC_PF_USER
+struct kvm_async_pf_user {
+	struct list_head link;
+	struct list_head queue;
+	gpa_t  cr2_or_gpa;
+	struct kvm_arch_async_pf arch;
+	bool   wakeup_all;
+	bool   resolved;
+	bool   notpresent_injected;
+};
+
+void kvm_clear_async_pf_user_completion_queue(struct kvm_vcpu *vcpu);
+void kvm_check_async_pf_user_completion(struct kvm_vcpu *vcpu);
+bool kvm_setup_async_pf_user(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+			unsigned long hva, struct kvm_arch_async_pf *arch);
+int kvm_async_pf_user_wakeup_all(struct kvm_vcpu *vcpu);
+#endif
+
+int kvm_async_pf_user_ready(struct kvm_vcpu *vcpu,
+			struct kvm_async_pf_user_ready *apf_ready);
+
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 union kvm_mmu_notifier_arg {
 	unsigned long attributes;
@@ -368,6 +389,15 @@ struct kvm_vcpu {
 	} async_pf;
 #endif
 
+#ifdef CONFIG_KVM_ASYNC_PF_USER
+	struct {
+		u32 queued;
+		struct list_head queue;
+		struct list_head done;
+		spinlock_t lock;
+	} async_pf_user;
+#endif
+
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
 	/*
 	 * Cpu relax intercept or pause loop exit optimization
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 827ecc0b7e10..149c7e48b2fb 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -5,6 +5,7 @@
 
 struct kvm;
 struct kvm_async_pf;
+struct kvm_async_pf_user;
 struct kvm_device_ops;
 struct kvm_gfn_range;
 struct kvm_interrupt;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 8cd8e08f11e1..ef3840a1c5e9 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1561,4 +1561,12 @@ struct kvm_fault {
 
 #define KVM_READ_USERFAULT		_IOR(KVMIO, 0xd5, struct kvm_fault)
 
+/* for KVM_ASYNC_PF_USER_READY */
+struct kvm_async_pf_user_ready {
+	/* in */
+	__u32 token;
+};
+
+#define KVM_ASYNC_PF_USER_READY _IOW(KVMIO,  0xd6, struct kvm_async_pf_user_ready)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index f1b660d593e4..91abbd9a8e70 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -45,6 +45,9 @@ config KVM_MMIO
 config KVM_ASYNC_PF
        bool
 
+config KVM_ASYNC_PF_USER
+       bool
+
 # Toggle to switch between direct notification and batch job
 config KVM_ASYNC_PF_SYNC
        bool
diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
index 724c89af78af..980217e0b03a 100644
--- a/virt/kvm/Makefile.kvm
+++ b/virt/kvm/Makefile.kvm
@@ -9,6 +9,7 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o
 kvm-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
 kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o
 kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
+kvm-$(CONFIG_KVM_ASYNC_PF_USER) += $(KVM)/async_pf_user.o
 kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
 kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o
 kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o
diff --git a/virt/kvm/async_pf_user.c b/virt/kvm/async_pf_user.c
new file mode 100644
index 000000000000..d72ce5733e1a
--- /dev/null
+++ b/virt/kvm/async_pf_user.c
@@ -0,0 +1,197 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * kvm support for asyncrhonous fault in userspace
+ *
+ * Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ *
+ * Author:
+ *      Nikita Kalyazin <kalyazin@amazon.com>
+ */
+
+#include <uapi/linux/kvm.h>
+#include <linux/kvm_host.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+
+#include "async_pf_user.h"
+#include <trace/events/kvm.h>
+
+static struct kmem_cache *async_pf_user_cache;
+
+int kvm_async_pf_user_init(void)
+{
+	async_pf_user_cache = KMEM_CACHE(kvm_async_pf_user, 0);
+
+	if (!async_pf_user_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void kvm_async_pf_user_deinit(void)
+{
+	kmem_cache_destroy(async_pf_user_cache);
+	async_pf_user_cache = NULL;
+}
+
+void kvm_async_pf_user_vcpu_init(struct kvm_vcpu *vcpu)
+{
+	INIT_LIST_HEAD(&vcpu->async_pf_user.done);
+	INIT_LIST_HEAD(&vcpu->async_pf_user.queue);
+	spin_lock_init(&vcpu->async_pf_user.lock);
+}
+
+int kvm_async_pf_user_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf_user_ready *apf_ready)
+{
+	struct kvm_async_pf_user *apf = NULL;
+	bool first;
+
+	spin_lock(&vcpu->async_pf_user.lock);
+	list_for_each_entry(apf, &vcpu->async_pf_user.queue, queue) {
+		if (apf->arch.token == apf_ready->token)
+			break;
+	}
+	spin_unlock(&vcpu->async_pf_user.lock);
+
+	if (unlikely(!apf || apf->arch.token != apf_ready->token))
+		return -EINVAL;
+
+	spin_lock(&vcpu->async_pf_user.lock);
+    first = list_empty(&vcpu->async_pf_user.done);
+	apf->resolved = true;
+    list_add_tail(&apf->link, &vcpu->async_pf_user.done);
+    spin_unlock(&vcpu->async_pf_user.lock);
+
+    kvm_arch_async_page_present_user_queued(vcpu);
+
+	if (first)
+		kvm_arch_async_page_present_user_queued(vcpu);
+
+    trace_kvm_async_pf_completed(0, apf->cr2_or_gpa, 1);
+
+    __kvm_vcpu_wake_up(vcpu);
+
+	return 0;
+}
+
+void kvm_clear_async_pf_user_completion_queue(struct kvm_vcpu *vcpu)
+{
+	spin_lock(&vcpu->async_pf_user.lock);
+
+	/* cancel outstanding work queue item */
+	while (!list_empty(&vcpu->async_pf_user.queue)) {
+		struct kvm_async_pf_user *apf =
+			list_first_entry(&vcpu->async_pf_user.queue,
+					 typeof(*apf), queue);
+		list_del(&apf->queue);
+
+		/*
+		 * If userspace has already notified us that the fault
+		 * had been resolved, we will delete the item when
+		 * iterating over the `done` list.
+		 * Otherwise, we free it now, and if at a later point
+		 * userspaces comes back regarding this fault, it will
+		 * be rejected due to an inexistent token.
+		 * Note that we do not have a way to "cancel" the work
+		 * like with traditional (kernel) async pf.
+		 */
+		if (!apf->resolved)
+			kmem_cache_free(async_pf_user_cache, apf);
+	}
+
+	while (!list_empty(&vcpu->async_pf_user.done)) {
+		struct kvm_async_pf_user *apf =
+			list_first_entry(&vcpu->async_pf_user.done,
+					 typeof(*apf), link);
+		list_del(&apf->link);
+
+		/*
+		 * Unlike with traditional (kernel) async pf,
+		 * we know for sure that once the work has been queued,
+		 * userspace has done with it and no residual resources
+		 * are still being held by KVM.
+		 */
+		kmem_cache_free(async_pf_user_cache, apf);
+	}
+	spin_unlock(&vcpu->async_pf_user.lock);
+
+	vcpu->async_pf_user.queued = 0;
+}
+
+void kvm_check_async_pf_user_completion(struct kvm_vcpu *vcpu)
+{
+	struct kvm_async_pf_user *apf;
+
+	while (!list_empty_careful(&vcpu->async_pf_user.done) &&
+	      kvm_arch_can_dequeue_async_page_present(vcpu)) {
+		spin_lock(&vcpu->async_pf_user.lock);
+		apf = list_first_entry(&vcpu->async_pf_user.done, typeof(*apf),
+					      link);
+		list_del(&apf->link);
+		spin_unlock(&vcpu->async_pf_user.lock);
+
+		kvm_arch_async_page_ready_user(vcpu, apf);
+		kvm_arch_async_page_present_user(vcpu, apf);
+
+		list_del(&apf->queue);
+		vcpu->async_pf_user.queued--;
+	}
+}
+
+/*
+ * Try to schedule a job to handle page fault asynchronously. Returns 'true' on
+ * success, 'false' on failure (page fault has to be handled synchronously).
+ */
+bool kvm_setup_async_pf_user(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+			unsigned long hva, struct kvm_arch_async_pf *arch)
+{
+	struct kvm_async_pf_user *apf;
+
+	if (vcpu->async_pf_user.queued >= ASYNC_PF_USER_PER_VCPU)
+		return false;
+
+	/*
+	 * do alloc nowait since if we are going to sleep anyway we
+	 * may as well sleep faulting in page
+	 */
+	apf = kmem_cache_zalloc(async_pf_user_cache, GFP_NOWAIT | __GFP_NOWARN);
+	if (!apf)
+		return false;
+
+	apf->wakeup_all = false;
+	apf->cr2_or_gpa = cr2_or_gpa;
+	apf->arch = *arch;
+
+	list_add_tail(&apf->queue, &vcpu->async_pf_user.queue);
+	vcpu->async_pf_user.queued++;
+	apf->notpresent_injected = kvm_arch_async_page_not_present_user(vcpu, apf);
+
+	return true;
+}
+
+int kvm_async_pf_user_wakeup_all(struct kvm_vcpu *vcpu)
+{
+	struct kvm_async_pf_user *apf;
+	bool first;
+
+	if (!list_empty_careful(&vcpu->async_pf_user.done))
+		return 0;
+
+	apf = kmem_cache_zalloc(async_pf_user_cache, GFP_ATOMIC);
+	if (!apf)
+		return -ENOMEM;
+
+	apf->wakeup_all = true;
+	INIT_LIST_HEAD(&apf->queue); /* for list_del to work */
+
+	spin_lock(&vcpu->async_pf_user.lock);
+	first = list_empty(&vcpu->async_pf_user.done);
+	list_add_tail(&apf->link, &vcpu->async_pf_user.done);
+	spin_unlock(&vcpu->async_pf_user.lock);
+
+	if (first)
+		kvm_arch_async_page_present_user_queued(vcpu);
+
+	vcpu->async_pf_user.queued++;
+	return 0;
+}
diff --git a/virt/kvm/async_pf_user.h b/virt/kvm/async_pf_user.h
new file mode 100644
index 000000000000..35fa12858c05
--- /dev/null
+++ b/virt/kvm/async_pf_user.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * kvm support for asyncrhonous fault in userspace
+ *
+ * Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ *
+ * Author:
+ *      Nikita Kalyazin <kalyazin@amazon.com>
+ */
+
+#ifndef __KVM_ASYNC_PF_USER_H__
+#define __KVM_ASYNC_PF_USER_H__
+
+#ifdef CONFIG_KVM_ASYNC_PF_USER
+int kvm_async_pf_user_init(void);
+void kvm_async_pf_user_deinit(void);
+void kvm_async_pf_user_vcpu_init(struct kvm_vcpu *vcpu);
+#else
+#define kvm_async_pf_user_init() (0)
+#define kvm_async_pf_user_deinit() do {} while (0)
+#define kvm_async_pf_user_vcpu_init(C) do {} while (0)
+#endif
+
+#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 90ce6b8ff0ab..a1a122acf93a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -59,6 +59,7 @@
 
 #include "coalesced_mmio.h"
 #include "async_pf.h"
+#include "async_pf_user.h"
 #include "kvm_mm.h"
 #include "vfio.h"
 
@@ -493,6 +494,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	rcuwait_init(&vcpu->wait);
 #endif
 	kvm_async_pf_vcpu_init(vcpu);
+	kvm_async_pf_user_vcpu_init(vcpu);
 
 	kvm_vcpu_set_in_spin_loop(vcpu, false);
 	kvm_vcpu_set_dy_eligible(vcpu, false);
@@ -4059,6 +4061,11 @@ static bool vcpu_dy_runnable(struct kvm_vcpu *vcpu)
 		return true;
 #endif
 
+#ifdef CONFIG_KVM_ASYNC_PF_USER
+	if (!list_empty_careful(&vcpu->async_pf_user.done))
+		return true;
+#endif
+
 	return false;
 }
 
@@ -6613,6 +6620,10 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
 	if (r)
 		goto err_async_pf;
 
+	r = kvm_async_pf_user_init();
+	if (r)
+		goto err_async_pf_user;
+
 	kvm_chardev_ops.owner = module;
 	kvm_vm_fops.owner = module;
 	kvm_vcpu_fops.owner = module;
@@ -6644,6 +6655,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
 err_register:
 	kvm_vfio_ops_exit();
 err_vfio:
+	kvm_async_pf_user_deinit();
+err_async_pf_user:
 	kvm_async_pf_deinit();
 err_async_pf:
 	kvm_irqfd_exit();
@@ -6677,6 +6690,7 @@ void kvm_exit(void)
 		free_cpumask_var(per_cpu(cpu_kick_mask, cpu));
 	kmem_cache_destroy(kvm_vcpu_cache);
 	kvm_vfio_ops_exit();
+	kvm_async_pf_user_deinit();
 	kvm_async_pf_deinit();
 #ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
 	unregister_syscore_ops(&kvm_syscore_ops);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 6/6] KVM: x86: async_pf_user: hook to fault handling and add ioctl
  2024-11-18 12:39 [RFC PATCH 0/6] KVM: x86: async PF user Nikita Kalyazin
                   ` (4 preceding siblings ...)
  2024-11-18 12:39 ` [RFC PATCH 5/6] KVM: x86: async_pf_user: add infrastructure Nikita Kalyazin
@ 2024-11-18 12:39 ` Nikita Kalyazin
  2024-11-19  1:26 ` [RFC PATCH 0/6] KVM: x86: async PF user James Houghton
  2025-02-11 21:17 ` Sean Christopherson
  7 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-18 12:39 UTC (permalink / raw)
  To: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-doc,
	linux-kernel, linux-trace-kernel
  Cc: jthoughton, david, peterx, oleg, vkuznets, gshan, graf, jgowans,
	roypat, derekmn, nsaenz, xmarcalx, kalyazin

This patch adds interception in the __kvm_faultin_pfn for handling
faults that are causing exit to userspace asynchronously.  If the kernel
expects for the userspace to handle the fault asynchronously (ie it can
resume the vCPU while the fault is being processed), it sets the
KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER flag and supplies the async PF token
in the struct memory_fault in the VM exit info.
The patch also adds the KVM_ASYNC_PF_USER_READY ioctl that the userspace
should use to notify the kernel that the fault has been processed by
using the token corresponding to the fault.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c   | 45 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c       | 16 +++++++++++++-
 arch/x86/kvm/x86.h       |  2 ++
 include/uapi/linux/kvm.h |  4 +++-
 4 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index adf0161af894..a2b024ccbbe1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4282,6 +4282,22 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu,
 				  kvm_vcpu_gfn_to_hva(vcpu, fault->gfn), &arch);
 }
 
+static bool kvm_arch_setup_async_pf_user(struct kvm_vcpu *vcpu,
+					struct kvm_page_fault *fault, u32 *token)
+{
+        struct kvm_arch_async_pf arch;
+
+        arch.token = alloc_apf_token(vcpu);
+        arch.gfn = fault->gfn;
+        arch.error_code = fault->error_code;
+        arch.direct_map = vcpu->arch.mmu->root_role.direct;
+        arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);
+
+        *token = arch.token;
+
+        return kvm_setup_async_pf_user(vcpu, 0, fault->addr, &arch);
+}
+
 void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 {
 	int r;
@@ -4396,6 +4412,35 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 {
 	bool async;
 
+	/* Pre-check for userfault and bail out early. */
+	if (gfn_has_userfault(fault->slot->kvm, fault->gfn)) {
+		bool report_async = false;
+		u32 token = 0;
+
+		if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
+			!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
+			trace_kvm_try_async_get_page(fault->addr, fault->gfn, 1);
+			if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
+				trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn, 1);
+				kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+				return RET_PF_RETRY;
+			} else if (kvm_can_deliver_async_pf(vcpu) &&
+				kvm_arch_setup_async_pf_user(vcpu, fault, &token)) {
+				report_async = true;
+			}
+		}
+
+		fault->pfn = KVM_PFN_ERR_USERFAULT;
+		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+
+		if (report_async) {
+			vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER;
+			vcpu->run->memory_fault.async_pf_user_token = token;
+		}
+
+		return -EFAULT;
+	}
+
 	if (fault->is_private)
 		return kvm_faultin_pfn_private(vcpu, fault);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2b8cd3af326b..30b22904859f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13372,7 +13372,7 @@ static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
 	return !val;
 }
 
-static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
+bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
 {
 
 	if (!kvm_pv_async_pf_enabled(vcpu))
@@ -13697,6 +13697,20 @@ void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
 long kvm_arch_vcpu_async_ioctl(struct file *filp,
 			       unsigned int ioctl, unsigned long arg)
 {
+	void __user *argp = (void __user *)arg;
+	struct kvm_vcpu *vcpu = filp->private_data;
+
+#ifdef CONFIG_KVM_ASYNC_PF_USER
+	if (ioctl == KVM_ASYNC_PF_USER_READY) {
+		struct kvm_async_pf_user_ready apf_ready;
+
+		if (copy_from_user(&apf_ready, argp, sizeof(apf_ready)))
+			return -EFAULT;
+
+		return kvm_async_pf_user_ready(vcpu, &apf_ready);
+	}
+#endif
+
 	return -ENOIOCTLCMD;
 }
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d80a4c6b5a38..66ece51ee94b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -325,6 +325,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 			    int emulation_type, void *insn, int insn_len);
 fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
 
+bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu);
+
 extern u64 host_xcr0;
 extern u64 host_xss;
 extern u64 host_arch_capabilities;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index ef3840a1c5e9..8aa5ce347bdf 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -430,12 +430,14 @@ struct kvm_run {
 		struct {
 #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
 #define KVM_MEMORY_EXIT_FLAG_USERFAULT	(1ULL << 4)
+#define KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER (1ULL << 5)
 			__u64 flags;
 			__u64 gpa;
 			__u64 size;
+			__u32 async_pf_user_token;
 		} memory_fault;
 		/* Fix the size of the union. */
-		char padding[256];
+		char padding[252];
 	};
 
 	/* 2048 is the size of the char array used to bound/pad the size
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] KVM: x86: async PF user
  2024-11-18 12:39 [RFC PATCH 0/6] KVM: x86: async PF user Nikita Kalyazin
                   ` (5 preceding siblings ...)
  2024-11-18 12:39 ` [RFC PATCH 6/6] KVM: x86: async_pf_user: hook to fault handling and add ioctl Nikita Kalyazin
@ 2024-11-19  1:26 ` James Houghton
  2024-11-19 16:19   ` Nikita Kalyazin
  2025-02-11 21:17 ` Sean Christopherson
  7 siblings, 1 reply; 20+ messages in thread
From: James Houghton @ 2024-11-19  1:26 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-doc,
	linux-kernel, linux-trace-kernel, david, peterx, oleg, vkuznets,
	gshan, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On Mon, Nov 18, 2024 at 4:40 AM Nikita Kalyazin <kalyazin@amazon.com> wrote:
>
> Async PF [1] allows to run other processes on a vCPU while the host
> handles a stage-2 fault caused by a process on that vCPU. When using
> VM-exit-based stage-2 fault handling [2], async PF functionality is lost
> because KVM does not run the vCPU while a fault is being handled so no
> other process can execute on the vCPU. This patch series extends
> VM-exit-based stage-2 fault handling with async PF support by letting
> userspace handle faults instead of the kernel, hence the "async PF user"
> name.
>
> I circulated the idea with Paolo, Sean, David H, and James H at the LPC,
> and the only concern I heard was about injecting the "page not present"
> event via #PF exception in the CoCo case, where it may not work. In my
> implementation, I reused the existing code for doing that, so the async
> PF user implementation is on par with the present async PF
> implementation in this regard, and support for the CoCo case can be
> added separately.
>
> Please note that this series is applied on top of the VM-exit-based
> stage-2 fault handling RFC [2].

Thanks, Nikita! I'll post a new version of [2] very soon. The new
version contains the simplifications we talked about at LPC but is
conceptually the same (so this async PF series is motivated the same
way), and it shouldn't have many/any conflicts with the main bits of
this series.

> [2] https://lore.kernel.org/kvm/CADrL8HUHRMwUPhr7jLLBgD9YLFAnVHc=N-C=8er-x6GUtV97pQ@mail.gmail.com/T/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] KVM: x86: async PF user
  2024-11-19  1:26 ` [RFC PATCH 0/6] KVM: x86: async PF user James Houghton
@ 2024-11-19 16:19   ` Nikita Kalyazin
  0 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-19 16:19 UTC (permalink / raw)
  To: James Houghton
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-doc,
	linux-kernel, linux-trace-kernel, david, peterx, oleg, vkuznets,
	gshan, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx



On 19/11/2024 01:26, James Houghton wrote:
>> Please note that this series is applied on top of the VM-exit-based
>> stage-2 fault handling RFC [2].
> 
> Thanks, Nikita! I'll post a new version of [2] very soon. The new
> version contains the simplifications we talked about at LPC but is
> conceptually the same (so this async PF series is motivated the same
> way), and it shouldn't have many/any conflicts with the main bits of
> this series.

Great news, looking forward to seeing it!

> 
>> [2] https://lore.kernel.org/kvm/CADrL8HUHRMwUPhr7jLLBgD9YLFAnVHc=N-C=8er-x6GUtV97pQ@mail.gmail.com/T/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] KVM: x86: async PF user
  2024-11-18 12:39 [RFC PATCH 0/6] KVM: x86: async PF user Nikita Kalyazin
                   ` (6 preceding siblings ...)
  2024-11-19  1:26 ` [RFC PATCH 0/6] KVM: x86: async PF user James Houghton
@ 2025-02-11 21:17 ` Sean Christopherson
  2025-02-12 18:14   ` Nikita Kalyazin
  7 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2025-02-11 21:17 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: pbonzini, corbet, tglx, mingo, bp, dave.hansen, hpa, rostedt,
	mhiramat, mathieu.desnoyers, kvm, linux-doc, linux-kernel,
	linux-trace-kernel, jthoughton, david, peterx, oleg, vkuznets,
	gshan, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
> Async PF [1] allows to run other processes on a vCPU while the host
> handles a stage-2 fault caused by a process on that vCPU. When using
> VM-exit-based stage-2 fault handling [2], async PF functionality is lost
> because KVM does not run the vCPU while a fault is being handled so no
> other process can execute on the vCPU. This patch series extends
> VM-exit-based stage-2 fault handling with async PF support by letting
> userspace handle faults instead of the kernel, hence the "async PF user"
> name.
> 
> I circulated the idea with Paolo, Sean, David H, and James H at the LPC,
> and the only concern I heard was about injecting the "page not present"
> event via #PF exception in the CoCo case, where it may not work. In my
> implementation, I reused the existing code for doing that, so the async
> PF user implementation is on par with the present async PF
> implementation in this regard, and support for the CoCo case can be
> added separately.
> 
> Please note that this series is applied on top of the VM-exit-based
> stage-2 fault handling RFC [2].

...

> Nikita Kalyazin (6):
>   Documentation: KVM: add userfault KVM exit flag
>   Documentation: KVM: add async pf user doc
>   KVM: x86: add async ioctl support
>   KVM: trace events: add type argument to async pf
>   KVM: x86: async_pf_user: add infrastructure
>   KVM: x86: async_pf_user: hook to fault handling and add ioctl
> 
>  Documentation/virt/kvm/api.rst  |  35 ++++++
>  arch/x86/include/asm/kvm_host.h |  12 +-
>  arch/x86/kvm/Kconfig            |   7 ++
>  arch/x86/kvm/lapic.c            |   2 +
>  arch/x86/kvm/mmu/mmu.c          |  68 ++++++++++-
>  arch/x86/kvm/x86.c              | 101 +++++++++++++++-
>  arch/x86/kvm/x86.h              |   2 +
>  include/linux/kvm_host.h        |  30 +++++
>  include/linux/kvm_types.h       |   1 +
>  include/trace/events/kvm.h      |  50 +++++---
>  include/uapi/linux/kvm.h        |  12 +-
>  virt/kvm/Kconfig                |   3 +
>  virt/kvm/Makefile.kvm           |   1 +
>  virt/kvm/async_pf.c             |   2 +-
>  virt/kvm/async_pf_user.c        | 197 ++++++++++++++++++++++++++++++++
>  virt/kvm/async_pf_user.h        |  24 ++++
>  virt/kvm/kvm_main.c             |  14 +++
>  17 files changed, 535 insertions(+), 26 deletions(-)

I am supportive of the idea, but there is way too much copy+paste in this series.
And it's not just the code itself, it's all the structures and concepts.  Off the
top of my head, I can't think of any reason there needs to be a separate queue,
separate lock(s), etc.  The only difference between kernel APF and user APF is
what chunk of code is responsible for faulting in the page.

I suspect a good place to start would be something along the lines of the below
diff, and go from there.  Given that KVM already needs to special case the fake
"wake all" items, I'm guessing it won't be terribly difficult to teach the core
flows about userspace async #PF.

I'm also not sure that injecting async #PF for all userfaults is desirable.  For
in-kernel async #PF, KVM knows that faulting in the memory would sleep.  For
userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
be handled via async #PF.  The obvious answer is to have userspace only enable
userspace async #PF when it's useful, but "an all or nothing" approach isn't
great uAPI.  On the flip side, adding uAPI for a use case that doesn't exist
doesn't make sense either :-/

Exiting to userspace in vCPU context is also kludgy.  It makes sense for base
userfault, because the vCPU can't make forward progress until the fault is
resolved.  Actually, I'm not even sure it makes sense there.  I'll follow-up in
James' series.  Anyways, it definitely doesn't make sense for async #PF, because
the whole point is to let the vCPU run.  Signalling userspace would definitely
add complexity, but only because of the need to communicate the token and wait
for userspace to consume said token.  I'll think more on that.

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 0ee4816b079a..fc31b47cf9c5 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -177,7 +177,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
  * success, 'false' on failure (page fault has to be handled synchronously).
  */
 bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-                       unsigned long hva, struct kvm_arch_async_pf *arch)
+                       unsigned long hva, struct kvm_arch_async_pf *arch,
+                       bool userfault)
 {
        struct kvm_async_pf *work;
 
@@ -202,13 +203,16 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
        work->addr = hva;
        work->arch = *arch;
 
-       INIT_WORK(&work->work, async_pf_execute);
-
        list_add_tail(&work->queue, &vcpu->async_pf.queue);
        vcpu->async_pf.queued++;
        work->notpresent_injected = kvm_arch_async_page_not_present(vcpu, work);
 
-       schedule_work(&work->work);
+       if (userfault) {
+               work->userfault = true;
+       } else {
+               INIT_WORK(&work->work, async_pf_execute);
+               schedule_work(&work->work);
+       }
 
        return true;
 }

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] KVM: x86: async PF user
  2025-02-11 21:17 ` Sean Christopherson
@ 2025-02-12 18:14   ` Nikita Kalyazin
  2025-02-19 15:17     ` Sean Christopherson
  0 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2025-02-12 18:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, corbet, tglx, mingo, bp, dave.hansen, hpa, rostedt,
	mhiramat, mathieu.desnoyers, kvm, linux-doc, linux-kernel,
	linux-trace-kernel, jthoughton, david, peterx, oleg, vkuznets,
	gshan, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On 11/02/2025 21:17, Sean Christopherson wrote:
> On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
>> Async PF [1] allows to run other processes on a vCPU while the host
>> handles a stage-2 fault caused by a process on that vCPU. When using
>> VM-exit-based stage-2 fault handling [2], async PF functionality is lost
>> because KVM does not run the vCPU while a fault is being handled so no
>> other process can execute on the vCPU. This patch series extends
>> VM-exit-based stage-2 fault handling with async PF support by letting
>> userspace handle faults instead of the kernel, hence the "async PF user"
>> name.
>>
>> I circulated the idea with Paolo, Sean, David H, and James H at the LPC,
>> and the only concern I heard was about injecting the "page not present"
>> event via #PF exception in the CoCo case, where it may not work. In my
>> implementation, I reused the existing code for doing that, so the async
>> PF user implementation is on par with the present async PF
>> implementation in this regard, and support for the CoCo case can be
>> added separately.
>>
>> Please note that this series is applied on top of the VM-exit-based
>> stage-2 fault handling RFC [2].
> 
> ...
> 
>> Nikita Kalyazin (6):
>>    Documentation: KVM: add userfault KVM exit flag
>>    Documentation: KVM: add async pf user doc
>>    KVM: x86: add async ioctl support
>>    KVM: trace events: add type argument to async pf
>>    KVM: x86: async_pf_user: add infrastructure
>>    KVM: x86: async_pf_user: hook to fault handling and add ioctl
>>
>>   Documentation/virt/kvm/api.rst  |  35 ++++++
>>   arch/x86/include/asm/kvm_host.h |  12 +-
>>   arch/x86/kvm/Kconfig            |   7 ++
>>   arch/x86/kvm/lapic.c            |   2 +
>>   arch/x86/kvm/mmu/mmu.c          |  68 ++++++++++-
>>   arch/x86/kvm/x86.c              | 101 +++++++++++++++-
>>   arch/x86/kvm/x86.h              |   2 +
>>   include/linux/kvm_host.h        |  30 +++++
>>   include/linux/kvm_types.h       |   1 +
>>   include/trace/events/kvm.h      |  50 +++++---
>>   include/uapi/linux/kvm.h        |  12 +-
>>   virt/kvm/Kconfig                |   3 +
>>   virt/kvm/Makefile.kvm           |   1 +
>>   virt/kvm/async_pf.c             |   2 +-
>>   virt/kvm/async_pf_user.c        | 197 ++++++++++++++++++++++++++++++++
>>   virt/kvm/async_pf_user.h        |  24 ++++
>>   virt/kvm/kvm_main.c             |  14 +++
>>   17 files changed, 535 insertions(+), 26 deletions(-)
> 
> I am supportive of the idea, but there is way too much copy+paste in this series.
Hi Sean,

Yes, like I mentioned in the cover letter, I left the new implementation 
isolated on purpose to make the scope of the change clear.  There is 
certainly lots of duplication that should be removed later on.

> And it's not just the code itself, it's all the structures and concepts.  Off the
> top of my head, I can't think of any reason there needs to be a separate queue,
> separate lock(s), etc.  The only difference between kernel APF and user APF is
> what chunk of code is responsible for faulting in the page.

There are two queues involved:
  - "queue": stores in-flight faults. APF-kernel uses it to cancel all 
works if needed.  APF-user does not have a way to "cancel" userspace 
works, but it uses the queue to look up the struct by the token when 
userspace reports a completion.
  - "ready": stores completed faults until KVM finds a chance to tell 
guest about them.

I agree that the "ready" queue can be shared between APF-kernel and 
-user as it's used in the same way.  As for the "queue" queue, do you 
think it's ok to process its elements differently based on the "type" of 
them in a single loop [1] instead of having two separate queues?

[1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120

> I suspect a good place to start would be something along the lines of the below
> diff, and go from there.  Given that KVM already needs to special case the fake
> "wake all" items, I'm guessing it won't be terribly difficult to teach the core
> flows about userspace async #PF.

That sounds sensible.  I can certainly approach it in a "bottom up" way 
by sparingly adding handling where it's different in APF-user rather 
than adding it side by side and trying to merge common parts.

> I'm also not sure that injecting async #PF for all userfaults is desirable.  For
> in-kernel async #PF, KVM knows that faulting in the memory would sleep.  For
> userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
> be handled via async #PF.  The obvious answer is to have userspace only enable
> userspace async #PF when it's useful, but "an all or nothing" approach isn't
> great uAPI.  On the flip side, adding uAPI for a use case that doesn't exist
> doesn't make sense either :-/

I wasn't able to locate the code that would check whether faulting would 
sleep in APF-kernel.  KVM spins APF-kernel whenever it can ([2]). 
Please let me know if I'm missing something here.

[2] 
https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360

> Exiting to userspace in vCPU context is also kludgy.  It makes sense for base
> userfault, because the vCPU can't make forward progress until the fault is
> resolved.  Actually, I'm not even sure it makes sense there.  I'll follow-up in

Even though we exit to userspace, in case of APF-user, userspace is 
supposed to VM enter straight after scheduling the async job, which is 
then executed concurrently with the vCPU.

> James' series.  Anyways, it definitely doesn't make sense for async #PF, because
> the whole point is to let the vCPU run.  Signalling userspace would definitely
> add complexity, but only because of the need to communicate the token and wait
> for userspace to consume said token.  I'll think more on that.

By signalling userspace you mean a new non-exit-to-userspace mechanism 
similar to UFFD?  What advantage can you see in it over exiting to 
userspace (which already exists in James's series)?


Thanks,
Nikita

> 
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index 0ee4816b079a..fc31b47cf9c5 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -177,7 +177,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>    * success, 'false' on failure (page fault has to be handled synchronously).
>    */
>   bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> -                       unsigned long hva, struct kvm_arch_async_pf *arch)
> +                       unsigned long hva, struct kvm_arch_async_pf *arch,
> +                       bool userfault)
>   {
>          struct kvm_async_pf *work;
> 
> @@ -202,13 +203,16 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>          work->addr = hva;
>          work->arch = *arch;
> 
> -       INIT_WORK(&work->work, async_pf_execute);
> -
>          list_add_tail(&work->queue, &vcpu->async_pf.queue);
>          vcpu->async_pf.queued++;
>          work->notpresent_injected = kvm_arch_async_page_not_present(vcpu, work);
> 
> -       schedule_work(&work->work);
> +       if (userfault) {
> +               work->userfault = true;
> +       } else {
> +               INIT_WORK(&work->work, async_pf_execute);
> +               schedule_work(&work->work);
> +       }
> 
>          return true;
>   }


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] KVM: x86: async PF user
  2025-02-12 18:14   ` Nikita Kalyazin
@ 2025-02-19 15:17     ` Sean Christopherson
  2025-02-20 18:29       ` Nikita Kalyazin
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2025-02-19 15:17 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: pbonzini, corbet, tglx, mingo, bp, dave.hansen, hpa, rostedt,
	mhiramat, mathieu.desnoyers, kvm, linux-doc, linux-kernel,
	linux-trace-kernel, jthoughton, david, peterx, oleg, vkuznets,
	gshan, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
> On 11/02/2025 21:17, Sean Christopherson wrote:
> > On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
> > And it's not just the code itself, it's all the structures and concepts.  Off the
> > top of my head, I can't think of any reason there needs to be a separate queue,
> > separate lock(s), etc.  The only difference between kernel APF and user APF is
> > what chunk of code is responsible for faulting in the page.
> 
> There are two queues involved:
>  - "queue": stores in-flight faults. APF-kernel uses it to cancel all works
> if needed.  APF-user does not have a way to "cancel" userspace works, but it
> uses the queue to look up the struct by the token when userspace reports a
> completion.
>  - "ready": stores completed faults until KVM finds a chance to tell guest
> about them.
> 
> I agree that the "ready" queue can be shared between APF-kernel and -user as
> it's used in the same way.  As for the "queue" queue, do you think it's ok
> to process its elements differently based on the "type" of them in a single
> loop [1] instead of having two separate queues?

Yes.

> [1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120
> 
> > I suspect a good place to start would be something along the lines of the below
> > diff, and go from there.  Given that KVM already needs to special case the fake
> > "wake all" items, I'm guessing it won't be terribly difficult to teach the core
> > flows about userspace async #PF.
> 
> That sounds sensible.  I can certainly approach it in a "bottom up" way by
> sparingly adding handling where it's different in APF-user rather than
> adding it side by side and trying to merge common parts.
> 
> > I'm also not sure that injecting async #PF for all userfaults is desirable.  For
> > in-kernel async #PF, KVM knows that faulting in the memory would sleep.  For
> > userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
> > be handled via async #PF.  The obvious answer is to have userspace only enable
> > userspace async #PF when it's useful, but "an all or nothing" approach isn't
> > great uAPI.  On the flip side, adding uAPI for a use case that doesn't exist
> > doesn't make sense either :-/
> 
> I wasn't able to locate the code that would check whether faulting would
> sleep in APF-kernel.  KVM spins APF-kernel whenever it can ([2]). Please let
> me know if I'm missing something here.

kvm_can_do_async_pf() will be reached if and only if faulting in the memory
requires waiting.  If a page is swapped out, but faulting it back in doesn't
require waiting, e.g. because it's in zswap and can be uncompressed synchronously,
then the initial __kvm_faultin_pfn() with FOLL_NO_WAIT will succeed.

	/*
	 * If resolving the page failed because I/O is needed to fault-in the
	 * page, then either set up an asynchronous #PF to do the I/O, or if
	 * doing an async #PF isn't possible, retry with I/O allowed.  All
	 * other failures are terminal, i.e. retrying won't help.
	 */
	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
		return RET_PF_CONTINUE;

	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
		if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
			trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
			return RET_PF_RETRY;
		} else if (kvm_arch_setup_async_pf(vcpu, fault)) {
			return RET_PF_RETRY;
		}
	}

The conundrum with userspace async #PF is that if userspace is given only a single
bit per gfn to force an exit, then KVM won't be able to differentiate between
"faults" that will be handled synchronously by the vCPU task, and faults that
usersepace will hand off to an I/O task.  If the fault is handled synchronously,
KVM will needlessly inject a not-present #PF and a present IRQ.

But that's a non-issue if the known use cases are all-or-nothing, i.e. if all
userspace faults are either synchronous or asynchronous.

> [2] https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360
> 
> > Exiting to userspace in vCPU context is also kludgy.  It makes sense for base
> > userfault, because the vCPU can't make forward progress until the fault is
> > resolved.  Actually, I'm not even sure it makes sense there.  I'll follow-up in
> 
> Even though we exit to userspace, in case of APF-user, userspace is supposed
> to VM enter straight after scheduling the async job, which is then executed
> concurrently with the vCPU.
> 
> > James' series.  Anyways, it definitely doesn't make sense for async #PF, because
> > the whole point is to let the vCPU run.  Signalling userspace would definitely
> > add complexity, but only because of the need to communicate the token and wait
> > for userspace to consume said token.  I'll think more on that.
> 
> By signalling userspace you mean a new non-exit-to-userspace mechanism
> similar to UFFD?

Yes.

> What advantage can you see in it over exiting to userspace (which already exists
> in James's series)?

It doesn't exit to userspace :-)

If userspace simply wakes a different task in response to the exit, then KVM
should be able to wake said task, e.g. by signalling an eventfd, and resume the
guest much faster than if the vCPU task needs to roundtrip to userspace.  Whether
or not such an optimization is worth the complexity is an entirely different
question though.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] KVM: x86: async PF user
  2025-02-19 15:17     ` Sean Christopherson
@ 2025-02-20 18:29       ` Nikita Kalyazin
  2025-02-20 18:49         ` Sean Christopherson
  0 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2025-02-20 18:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, corbet, tglx, mingo, bp, dave.hansen, hpa, rostedt,
	mhiramat, mathieu.desnoyers, kvm, linux-doc, linux-kernel,
	linux-trace-kernel, jthoughton, david, peterx, oleg, vkuznets,
	gshan, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On 19/02/2025 15:17, Sean Christopherson wrote:
> On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
>> On 11/02/2025 21:17, Sean Christopherson wrote:
>>> On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
>>> And it's not just the code itself, it's all the structures and concepts.  Off the
>>> top of my head, I can't think of any reason there needs to be a separate queue,
>>> separate lock(s), etc.  The only difference between kernel APF and user APF is
>>> what chunk of code is responsible for faulting in the page.
>>
>> There are two queues involved:
>>   - "queue": stores in-flight faults. APF-kernel uses it to cancel all works
>> if needed.  APF-user does not have a way to "cancel" userspace works, but it
>> uses the queue to look up the struct by the token when userspace reports a
>> completion.
>>   - "ready": stores completed faults until KVM finds a chance to tell guest
>> about them.
>>
>> I agree that the "ready" queue can be shared between APF-kernel and -user as
>> it's used in the same way.  As for the "queue" queue, do you think it's ok
>> to process its elements differently based on the "type" of them in a single
>> loop [1] instead of having two separate queues?
> 
> Yes.
> 
>> [1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120
>>
>>> I suspect a good place to start would be something along the lines of the below
>>> diff, and go from there.  Given that KVM already needs to special case the fake
>>> "wake all" items, I'm guessing it won't be terribly difficult to teach the core
>>> flows about userspace async #PF.
>>
>> That sounds sensible.  I can certainly approach it in a "bottom up" way by
>> sparingly adding handling where it's different in APF-user rather than
>> adding it side by side and trying to merge common parts.
>>
>>> I'm also not sure that injecting async #PF for all userfaults is desirable.  For
>>> in-kernel async #PF, KVM knows that faulting in the memory would sleep.  For
>>> userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
>>> be handled via async #PF.  The obvious answer is to have userspace only enable
>>> userspace async #PF when it's useful, but "an all or nothing" approach isn't
>>> great uAPI.  On the flip side, adding uAPI for a use case that doesn't exist
>>> doesn't make sense either :-/
>>
>> I wasn't able to locate the code that would check whether faulting would
>> sleep in APF-kernel.  KVM spins APF-kernel whenever it can ([2]). Please let
>> me know if I'm missing something here.
> 
> kvm_can_do_async_pf() will be reached if and only if faulting in the memory
> requires waiting.  If a page is swapped out, but faulting it back in doesn't
> require waiting, e.g. because it's in zswap and can be uncompressed synchronously,
> then the initial __kvm_faultin_pfn() with FOLL_NO_WAIT will succeed.
> 
>          /*
>           * If resolving the page failed because I/O is needed to fault-in the
>           * page, then either set up an asynchronous #PF to do the I/O, or if
>           * doing an async #PF isn't possible, retry with I/O allowed.  All
>           * other failures are terminal, i.e. retrying won't help.
>           */
>          if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
>                  return RET_PF_CONTINUE;
> 
>          if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
>                  trace_kvm_try_async_get_page(fault->addr, fault->gfn);
>                  if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
>                          trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
>                          kvm_make_request(KVM_REQ_APF_HALT, vcpu);
>                          return RET_PF_RETRY;
>                  } else if (kvm_arch_setup_async_pf(vcpu, fault)) {
>                          return RET_PF_RETRY;
>                  }
>          }
> 
> The conundrum with userspace async #PF is that if userspace is given only a single
> bit per gfn to force an exit, then KVM won't be able to differentiate between
> "faults" that will be handled synchronously by the vCPU task, and faults that
> usersepace will hand off to an I/O task.  If the fault is handled synchronously,
> KVM will needlessly inject a not-present #PF and a present IRQ.

Right, but from the guest's point of view, async PF means "it will 
probably take a while for the host to get the page, so I may consider 
doing something else in the meantime (ie schedule another process if 
available)".  If we are exiting to userspace, it isn't going to be quick 
anyway, so we can consider all such faults "long" and warranting the 
execution of the async PF protocol.  So always injecting a not-present 
#PF and page ready IRQ doesn't look too wrong in that case.

> But that's a non-issue if the known use cases are all-or-nothing, i.e. if all
> userspace faults are either synchronous or asynchronous.

Yes, pretty much.  The user will be choosing the extreme that is more 
performant for their specific usecase.

>> [2] https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360
>>
>>> Exiting to userspace in vCPU context is also kludgy.  It makes sense for base
>>> userfault, because the vCPU can't make forward progress until the fault is
>>> resolved.  Actually, I'm not even sure it makes sense there.  I'll follow-up in
>>
>> Even though we exit to userspace, in case of APF-user, userspace is supposed
>> to VM enter straight after scheduling the async job, which is then executed
>> concurrently with the vCPU.
>>
>>> James' series.  Anyways, it definitely doesn't make sense for async #PF, because
>>> the whole point is to let the vCPU run.  Signalling userspace would definitely
>>> add complexity, but only because of the need to communicate the token and wait
>>> for userspace to consume said token.  I'll think more on that.
>>
>> By signalling userspace you mean a new non-exit-to-userspace mechanism
>> similar to UFFD?
> 
> Yes.
> 
>> What advantage can you see in it over exiting to userspace (which already exists
>> in James's series)?
> 
> It doesn't exit to userspace :-)
> 
> If userspace simply wakes a different task in response to the exit, then KVM
> should be able to wake said task, e.g. by signalling an eventfd, and resume the
> guest much faster than if the vCPU task needs to roundtrip to userspace.  Whether
> or not such an optimization is worth the complexity is an entirely different
> question though.

This reminds me of the discussion about VMA-less UFFD that was coming up 
several times, such as [1], but AFAIK hasn't materialised into something 
actionable.  I may be wrong, but James was looking into that and 
couldn't figure out a way to scale it sufficiently for his use case and 
had to stick with the VM-exit-based approach.  Can you see a world where 
VM-exit userfaults coexist with no-VM-exit way of handling async PFs?

[1]: https://lore.kernel.org/kvm/ZqwKuzfAs7pvdHAN@x1n/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] KVM: x86: async PF user
  2025-02-20 18:29       ` Nikita Kalyazin
@ 2025-02-20 18:49         ` Sean Christopherson
  2025-02-21 11:02           ` Nikita Kalyazin
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2025-02-20 18:49 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: pbonzini, corbet, tglx, mingo, bp, dave.hansen, hpa, rostedt,
	mhiramat, mathieu.desnoyers, kvm, linux-doc, linux-kernel,
	linux-trace-kernel, jthoughton, david, peterx, oleg, vkuznets,
	gshan, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On Thu, Feb 20, 2025, Nikita Kalyazin wrote:
> On 19/02/2025 15:17, Sean Christopherson wrote:
> > On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
> > The conundrum with userspace async #PF is that if userspace is given only a single
> > bit per gfn to force an exit, then KVM won't be able to differentiate between
> > "faults" that will be handled synchronously by the vCPU task, and faults that
> > usersepace will hand off to an I/O task.  If the fault is handled synchronously,
> > KVM will needlessly inject a not-present #PF and a present IRQ.
> 
> Right, but from the guest's point of view, async PF means "it will probably
> take a while for the host to get the page, so I may consider doing something
> else in the meantime (ie schedule another process if available)".

Except in this case, the guest never gets a chance to run, i.e. it can't do
something else.  From the guest point of view, if KVM doesn't inject what is
effectively a spurious async #PF, the VM-Exiting instruction simply took a (really)
long time to execute.

> If we are exiting to userspace, it isn't going to be quick anyway, so we can
> consider all such faults "long" and warranting the execution of the async PF
> protocol.  So always injecting a not-present #PF and page ready IRQ doesn't
> look too wrong in that case.

There is no "wrong", it's simply wasteful.  The fact that the userspace exit is
"long" is completely irrelevant.  Decompressing zswap is also slow, but it is
done on the current CPU, i.e. is not background I/O, and so doesn't trigger async
#PFs.

In the guest, if host userspace resolves the fault before redoing KVM_RUN, the
vCPU will get two events back-to-back: an async #PF, and an IRQ signalling completion
of that #PF.

> > > What advantage can you see in it over exiting to userspace (which already exists
> > > in James's series)?
> > 
> > It doesn't exit to userspace :-)
> > 
> > If userspace simply wakes a different task in response to the exit, then KVM
> > should be able to wake said task, e.g. by signalling an eventfd, and resume the
> > guest much faster than if the vCPU task needs to roundtrip to userspace.  Whether
> > or not such an optimization is worth the complexity is an entirely different
> > question though.
> 
> This reminds me of the discussion about VMA-less UFFD that was coming up
> several times, such as [1], but AFAIK hasn't materialised into something
> actionable.  I may be wrong, but James was looking into that and couldn't
> figure out a way to scale it sufficiently for his use case and had to stick
> with the VM-exit-based approach.  Can you see a world where VM-exit
> userfaults coexist with no-VM-exit way of handling async PFs?

The issue with UFFD is that it's difficult to provide a generic "point of contact",
whereas with KVM userfault, signalling can be tied to the vCPU, and KVM can provide
per-vCPU buffers/structures to aid communication.

That said, supporting "exitless" KVM userfault would most definitely be premature
optimization without strong evidence it would benefit a real world use case.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] KVM: x86: async PF user
  2025-02-20 18:49         ` Sean Christopherson
@ 2025-02-21 11:02           ` Nikita Kalyazin
  2025-02-26  0:58             ` Sean Christopherson
  0 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2025-02-21 11:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, corbet, tglx, mingo, bp, dave.hansen, hpa, rostedt,
	mhiramat, mathieu.desnoyers, kvm, linux-doc, linux-kernel,
	linux-trace-kernel, jthoughton, david, peterx, oleg, vkuznets,
	gshan, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On 20/02/2025 18:49, Sean Christopherson wrote:
> On Thu, Feb 20, 2025, Nikita Kalyazin wrote:
>> On 19/02/2025 15:17, Sean Christopherson wrote:
>>> On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
>>> The conundrum with userspace async #PF is that if userspace is given only a single
>>> bit per gfn to force an exit, then KVM won't be able to differentiate between
>>> "faults" that will be handled synchronously by the vCPU task, and faults that
>>> usersepace will hand off to an I/O task.  If the fault is handled synchronously,
>>> KVM will needlessly inject a not-present #PF and a present IRQ.
>>
>> Right, but from the guest's point of view, async PF means "it will probably
>> take a while for the host to get the page, so I may consider doing something
>> else in the meantime (ie schedule another process if available)".
> 
> Except in this case, the guest never gets a chance to run, i.e. it can't do
> something else.  From the guest point of view, if KVM doesn't inject what is
> effectively a spurious async #PF, the VM-Exiting instruction simply took a (really)
> long time to execute.

Sorry, I didn't get that.  If userspace learns from the 
kvm_run::memory_fault::flags that the exit is due to an async PF, it 
should call kvm run immediately, inject the not-present PF and allow the 
guest to reschedule.  What do you mean by "the guest never gets a chance 
to run"?

>> If we are exiting to userspace, it isn't going to be quick anyway, so we can
>> consider all such faults "long" and warranting the execution of the async PF
>> protocol.  So always injecting a not-present #PF and page ready IRQ doesn't
>> look too wrong in that case.
> 
> There is no "wrong", it's simply wasteful.  The fact that the userspace exit is
> "long" is completely irrelevant.  Decompressing zswap is also slow, but it is
> done on the current CPU, i.e. is not background I/O, and so doesn't trigger async
> #PFs.
> 
> In the guest, if host userspace resolves the fault before redoing KVM_RUN, the
> vCPU will get two events back-to-back: an async #PF, and an IRQ signalling completion
> of that #PF.

Is this practically likely?  At least in our scenario (Firecracker 
snapshot restore) and probably in live migration postcopy, if a vCPU 
hits a fault, it's probably because the content of the page is somewhere 
remote (eg on the source machine or wherever the snapshot data is 
stored) and isn't going to be available quickly.  Conversely, if the 
page content is available, it must have already been prepopulated into 
guest memory pagecache, the bit in the bitmap is cleared and no exit to 
userspace occurs.

>>>> What advantage can you see in it over exiting to userspace (which already exists
>>>> in James's series)?
>>>
>>> It doesn't exit to userspace :-)
>>>
>>> If userspace simply wakes a different task in response to the exit, then KVM
>>> should be able to wake said task, e.g. by signalling an eventfd, and resume the
>>> guest much faster than if the vCPU task needs to roundtrip to userspace.  Whether
>>> or not such an optimization is worth the complexity is an entirely different
>>> question though.
>>
>> This reminds me of the discussion about VMA-less UFFD that was coming up
>> several times, such as [1], but AFAIK hasn't materialised into something
>> actionable.  I may be wrong, but James was looking into that and couldn't
>> figure out a way to scale it sufficiently for his use case and had to stick
>> with the VM-exit-based approach.  Can you see a world where VM-exit
>> userfaults coexist with no-VM-exit way of handling async PFs?
> 
> The issue with UFFD is that it's difficult to provide a generic "point of contact",
> whereas with KVM userfault, signalling can be tied to the vCPU, and KVM can provide
> per-vCPU buffers/structures to aid communication.
> 
> That said, supporting "exitless" KVM userfault would most definitely be premature
> optimization without strong evidence it would benefit a real world use case.

Does that mean that the "exitless" solution for async PF is a long-term 
one (if required), while the short-term would still be "exitful" (if we 
find a way to do it sensibly)?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] KVM: x86: async PF user
  2025-02-21 11:02           ` Nikita Kalyazin
@ 2025-02-26  0:58             ` Sean Christopherson
  2025-02-26 17:07               ` Nikita Kalyazin
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2025-02-26  0:58 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: pbonzini, corbet, tglx, mingo, bp, dave.hansen, hpa, rostedt,
	mhiramat, mathieu.desnoyers, kvm, linux-doc, linux-kernel,
	linux-trace-kernel, jthoughton, david, peterx, oleg, vkuznets,
	gshan, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On Fri, Feb 21, 2025, Nikita Kalyazin wrote:
> On 20/02/2025 18:49, Sean Christopherson wrote:
> > On Thu, Feb 20, 2025, Nikita Kalyazin wrote:
> > > On 19/02/2025 15:17, Sean Christopherson wrote:
> > > > On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
> > > > The conundrum with userspace async #PF is that if userspace is given only a single
> > > > bit per gfn to force an exit, then KVM won't be able to differentiate between
> > > > "faults" that will be handled synchronously by the vCPU task, and faults that
> > > > usersepace will hand off to an I/O task.  If the fault is handled synchronously,
> > > > KVM will needlessly inject a not-present #PF and a present IRQ.
> > > 
> > > Right, but from the guest's point of view, async PF means "it will probably
> > > take a while for the host to get the page, so I may consider doing something
> > > else in the meantime (ie schedule another process if available)".
> > 
> > Except in this case, the guest never gets a chance to run, i.e. it can't do
> > something else.  From the guest point of view, if KVM doesn't inject what is
> > effectively a spurious async #PF, the VM-Exiting instruction simply took a (really)
> > long time to execute.
> 
> Sorry, I didn't get that.  If userspace learns from the
> kvm_run::memory_fault::flags that the exit is due to an async PF, it should
> call kvm run immediately, inject the not-present PF and allow the guest to
> reschedule.  What do you mean by "the guest never gets a chance to run"?

What I'm saying is that, as proposed, the API doesn't precisely tell userspace
an exit happened due to an "async #PF".  KVM has absolutely zero clue as to
whether or not userspace is going to do an async #PF, or if userspace wants to
intercept the fault for some entirely different purpose.

> > > If we are exiting to userspace, it isn't going to be quick anyway, so we can
> > > consider all such faults "long" and warranting the execution of the async PF
> > > protocol.  So always injecting a not-present #PF and page ready IRQ doesn't
> > > look too wrong in that case.
> > 
> > There is no "wrong", it's simply wasteful.  The fact that the userspace exit is
> > "long" is completely irrelevant.  Decompressing zswap is also slow, but it is
> > done on the current CPU, i.e. is not background I/O, and so doesn't trigger async
> > #PFs.
> > 
> > In the guest, if host userspace resolves the fault before redoing KVM_RUN, the
> > vCPU will get two events back-to-back: an async #PF, and an IRQ signalling completion
> > of that #PF.
> 
> Is this practically likely?

Yes, I think's it's quite possible.

> At least in our scenario (Firecracker snapshot restore) and probably in live
> migration postcopy, if a vCPU hits a fault, it's probably because the content
> of the page is somewhere remote (eg on the source machine or wherever the
> snapshot data is stored) and isn't going to be available quickly.

Unless the remote page was already requested, e.g. by a different vCPU, or by a
prefetching algorithim.

> Conversely, if the page content is available, it must have already been
> prepopulated into guest memory pagecache, the bit in the bitmap is cleared
> and no exit to userspace occurs.

But that doesn't happen instantaneously.  Even if the VMM somehow atomically
receives the page and marks it present, it's still possible for marking the page
present to race with KVM checking the bitmap.

> > > > > What advantage can you see in it over exiting to userspace (which already exists
> > > > > in James's series)?
> > > > 
> > > > It doesn't exit to userspace :-)
> > > > 
> > > > If userspace simply wakes a different task in response to the exit, then KVM
> > > > should be able to wake said task, e.g. by signalling an eventfd, and resume the
> > > > guest much faster than if the vCPU task needs to roundtrip to userspace.  Whether
> > > > or not such an optimization is worth the complexity is an entirely different
> > > > question though.
> > > 
> > > This reminds me of the discussion about VMA-less UFFD that was coming up
> > > several times, such as [1], but AFAIK hasn't materialised into something
> > > actionable.  I may be wrong, but James was looking into that and couldn't
> > > figure out a way to scale it sufficiently for his use case and had to stick
> > > with the VM-exit-based approach.  Can you see a world where VM-exit
> > > userfaults coexist with no-VM-exit way of handling async PFs?
> > 
> > The issue with UFFD is that it's difficult to provide a generic "point of contact",
> > whereas with KVM userfault, signalling can be tied to the vCPU, and KVM can provide
> > per-vCPU buffers/structures to aid communication.
> > 
> > That said, supporting "exitless" KVM userfault would most definitely be premature
> > optimization without strong evidence it would benefit a real world use case.
> 
> Does that mean that the "exitless" solution for async PF is a long-term one
> (if required), while the short-term would still be "exitful" (if we find a
> way to do it sensibly)?

My question on exitless support was purely exploratory, just ignore it for now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] KVM: x86: async PF user
  2025-02-26  0:58             ` Sean Christopherson
@ 2025-02-26 17:07               ` Nikita Kalyazin
  2025-02-27 16:44                 ` Sean Christopherson
  0 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2025-02-26 17:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, corbet, tglx, mingo, bp, dave.hansen, hpa, rostedt,
	mhiramat, mathieu.desnoyers, kvm, linux-doc, linux-kernel,
	linux-trace-kernel, jthoughton, david, peterx, oleg, vkuznets,
	gshan, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On 26/02/2025 00:58, Sean Christopherson wrote:
> On Fri, Feb 21, 2025, Nikita Kalyazin wrote:
>> On 20/02/2025 18:49, Sean Christopherson wrote:
>>> On Thu, Feb 20, 2025, Nikita Kalyazin wrote:
>>>> On 19/02/2025 15:17, Sean Christopherson wrote:
>>>>> On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
>>>>> The conundrum with userspace async #PF is that if userspace is given only a single
>>>>> bit per gfn to force an exit, then KVM won't be able to differentiate between
>>>>> "faults" that will be handled synchronously by the vCPU task, and faults that
>>>>> usersepace will hand off to an I/O task.  If the fault is handled synchronously,
>>>>> KVM will needlessly inject a not-present #PF and a present IRQ.
>>>>
>>>> Right, but from the guest's point of view, async PF means "it will probably
>>>> take a while for the host to get the page, so I may consider doing something
>>>> else in the meantime (ie schedule another process if available)".
>>>
>>> Except in this case, the guest never gets a chance to run, i.e. it can't do
>>> something else.  From the guest point of view, if KVM doesn't inject what is
>>> effectively a spurious async #PF, the VM-Exiting instruction simply took a (really)
>>> long time to execute.
>>
>> Sorry, I didn't get that.  If userspace learns from the
>> kvm_run::memory_fault::flags that the exit is due to an async PF, it should
>> call kvm run immediately, inject the not-present PF and allow the guest to
>> reschedule.  What do you mean by "the guest never gets a chance to run"?
> 
> What I'm saying is that, as proposed, the API doesn't precisely tell userspace
> an exit happened due to an "async #PF".  KVM has absolutely zero clue as to
> whether or not userspace is going to do an async #PF, or if userspace wants to
> intercept the fault for some entirely different purpose.

Userspace is supposed to know whether the PF is async from the dedicated 
flag added in the memory_fault structure: 
KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER.  It will be set when KVM managed to 
inject page-not-present.  Are you saying it isn't sufficient?

@@ -4396,6 +4412,35 @@ static int __kvm_faultin_pfn(struct kvm_vcpu 
*vcpu, struct kvm_page_fault *fault
  {
  	bool async;

+	/* Pre-check for userfault and bail out early. */
+	if (gfn_has_userfault(fault->slot->kvm, fault->gfn)) {
+		bool report_async = false;
+		u32 token = 0;
+
+		if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
+			!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
+			trace_kvm_try_async_get_page(fault->addr, fault->gfn, 1);
+			if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
+				trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn, 1);
+				kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+				return RET_PF_RETRY;
+			} else if (kvm_can_deliver_async_pf(vcpu) &&
+				kvm_arch_setup_async_pf_user(vcpu, fault, &token)) {
+				report_async = true;
+			}
+		}
+
+		fault->pfn = KVM_PFN_ERR_USERFAULT;
+		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+
+		if (report_async) {
+			vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER;
+			vcpu->run->memory_fault.async_pf_user_token = token;
+		}
+
+		return -EFAULT;
+	}
+

>>>> If we are exiting to userspace, it isn't going to be quick anyway, so we can
>>>> consider all such faults "long" and warranting the execution of the async PF
>>>> protocol.  So always injecting a not-present #PF and page ready IRQ doesn't
>>>> look too wrong in that case.
>>>
>>> There is no "wrong", it's simply wasteful.  The fact that the userspace exit is
>>> "long" is completely irrelevant.  Decompressing zswap is also slow, but it is
>>> done on the current CPU, i.e. is not background I/O, and so doesn't trigger async
>>> #PFs.
>>>
>>> In the guest, if host userspace resolves the fault before redoing KVM_RUN, the
>>> vCPU will get two events back-to-back: an async #PF, and an IRQ signalling completion
>>> of that #PF.
>>
>> Is this practically likely?
> 
> Yes, I think's it's quite possible.
> 
>> At least in our scenario (Firecracker snapshot restore) and probably in live
>> migration postcopy, if a vCPU hits a fault, it's probably because the content
>> of the page is somewhere remote (eg on the source machine or wherever the
>> snapshot data is stored) and isn't going to be available quickly.
> 
> Unless the remote page was already requested, e.g. by a different vCPU, or by a
> prefetching algorithim.
> 
>> Conversely, if the page content is available, it must have already been
>> prepopulated into guest memory pagecache, the bit in the bitmap is cleared
>> and no exit to userspace occurs.
> 
> But that doesn't happen instantaneously.  Even if the VMM somehow atomically
> receives the page and marks it present, it's still possible for marking the page
> present to race with KVM checking the bitmap.

That looks like a generic problem of the VM-exit fault handling.  Eg 
when one vCPU exits, userspace handles the fault and races setting the 
bitmap with another vCPU that is about to fault the same page, which may 
cause a spurious exit.

On the other hand, is it malignant?  The only downside is additional 
overhead of the async PF protocol, but if the race occurs infrequently, 
it shouldn't be a problem.

>>>>>> What advantage can you see in it over exiting to userspace (which already exists
>>>>>> in James's series)?
>>>>>
>>>>> It doesn't exit to userspace :-)
>>>>>
>>>>> If userspace simply wakes a different task in response to the exit, then KVM
>>>>> should be able to wake said task, e.g. by signalling an eventfd, and resume the
>>>>> guest much faster than if the vCPU task needs to roundtrip to userspace.  Whether
>>>>> or not such an optimization is worth the complexity is an entirely different
>>>>> question though.
>>>>
>>>> This reminds me of the discussion about VMA-less UFFD that was coming up
>>>> several times, such as [1], but AFAIK hasn't materialised into something
>>>> actionable.  I may be wrong, but James was looking into that and couldn't
>>>> figure out a way to scale it sufficiently for his use case and had to stick
>>>> with the VM-exit-based approach.  Can you see a world where VM-exit
>>>> userfaults coexist with no-VM-exit way of handling async PFs?
>>>
>>> The issue with UFFD is that it's difficult to provide a generic "point of contact",
>>> whereas with KVM userfault, signalling can be tied to the vCPU, and KVM can provide
>>> per-vCPU buffers/structures to aid communication.
>>>
>>> That said, supporting "exitless" KVM userfault would most definitely be premature
>>> optimization without strong evidence it would benefit a real world use case.
>>
>> Does that mean that the "exitless" solution for async PF is a long-term one
>> (if required), while the short-term would still be "exitful" (if we find a
>> way to do it sensibly)?
> 
> My question on exitless support was purely exploratory, just ignore it for now.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] KVM: x86: async PF user
  2025-02-26 17:07               ` Nikita Kalyazin
@ 2025-02-27 16:44                 ` Sean Christopherson
  2025-02-27 18:24                   ` Nikita Kalyazin
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2025-02-27 16:44 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: pbonzini, corbet, tglx, mingo, bp, dave.hansen, hpa, rostedt,
	mhiramat, mathieu.desnoyers, kvm, linux-doc, linux-kernel,
	linux-trace-kernel, jthoughton, david, peterx, oleg, vkuznets,
	gshan, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On Wed, Feb 26, 2025, Nikita Kalyazin wrote:
> On 26/02/2025 00:58, Sean Christopherson wrote:
> > On Fri, Feb 21, 2025, Nikita Kalyazin wrote:
> > > On 20/02/2025 18:49, Sean Christopherson wrote:
> > > > On Thu, Feb 20, 2025, Nikita Kalyazin wrote:
> > > > > On 19/02/2025 15:17, Sean Christopherson wrote:
> > > > > > On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
> > > > > > The conundrum with userspace async #PF is that if userspace is given only a single
> > > > > > bit per gfn to force an exit, then KVM won't be able to differentiate between
> > > > > > "faults" that will be handled synchronously by the vCPU task, and faults that
> > > > > > usersepace will hand off to an I/O task.  If the fault is handled synchronously,
> > > > > > KVM will needlessly inject a not-present #PF and a present IRQ.
> > > > > 
> > > > > Right, but from the guest's point of view, async PF means "it will probably
> > > > > take a while for the host to get the page, so I may consider doing something
> > > > > else in the meantime (ie schedule another process if available)".
> > > > 
> > > > Except in this case, the guest never gets a chance to run, i.e. it can't do
> > > > something else.  From the guest point of view, if KVM doesn't inject what is
> > > > effectively a spurious async #PF, the VM-Exiting instruction simply took a (really)
> > > > long time to execute.
> > > 
> > > Sorry, I didn't get that.  If userspace learns from the
> > > kvm_run::memory_fault::flags that the exit is due to an async PF, it should
> > > call kvm run immediately, inject the not-present PF and allow the guest to
> > > reschedule.  What do you mean by "the guest never gets a chance to run"?
> > 
> > What I'm saying is that, as proposed, the API doesn't precisely tell userspace
                                                                         ^^^^^^^^^
                                                                         KVM
> > an exit happened due to an "async #PF".  KVM has absolutely zero clue as to
> > whether or not userspace is going to do an async #PF, or if userspace wants to
> > intercept the fault for some entirely different purpose.
> 
> Userspace is supposed to know whether the PF is async from the dedicated
> flag added in the memory_fault structure:
> KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER.  It will be set when KVM managed to
> inject page-not-present.  Are you saying it isn't sufficient?

Gah, sorry, typo.  The API doesn't tell *KVM* that userfault exit is due to an
async #PF.

> > Unless the remote page was already requested, e.g. by a different vCPU, or by a
> > prefetching algorithim.
> > 
> > > Conversely, if the page content is available, it must have already been
> > > prepopulated into guest memory pagecache, the bit in the bitmap is cleared
> > > and no exit to userspace occurs.
> > 
> > But that doesn't happen instantaneously.  Even if the VMM somehow atomically
> > receives the page and marks it present, it's still possible for marking the page
> > present to race with KVM checking the bitmap.
> 
> That looks like a generic problem of the VM-exit fault handling.  Eg when

Heh, it's a generic "problem" for faults in general.  E.g. modern x86 CPUs will
take "spurious" page faults on write accesses if a PTE is writable in memory but
the CPU has a read-only mapping cached in its TLB.

It's all a matter of cost.  E.g. pre-Nehalem Intel CPUs didn't take such spurious
read-only faults as they would re-walk the in-memory page tables, but that ended
up being a net negative because the cost of re-walking for all read-only faults
outweighed the benefits of avoiding spurious faults in the unlikely scenario the
fault had already been fixed.

For a spurious async #PF + IRQ, the cost could be signficant, e.g. due to causing
unwanted context switches in the guest, in addition to the raw overhead of the
faults, interrupts, and exits.

> one vCPU exits, userspace handles the fault and races setting the bitmap
> with another vCPU that is about to fault the same page, which may cause a
> spurious exit.
> 
> On the other hand, is it malignant?  The only downside is additional
> overhead of the async PF protocol, but if the race occurs infrequently, it
> shouldn't be a problem.

When it comes to uAPI, I want to try and avoid statements along the lines of
"IF 'x' holds true, then 'y' SHOULDN'T be a problem".  If this didn't impact uAPI,
I wouldn't care as much, i.e. I'd be much more willing iterate as needed.

I'm not saying we should go straight for a complex implementation.  Quite the
opposite.  But I do want us to consider the possible ramifications of using a
single bit for all userfaults, so that we can at least try to design something
that is extensible and won't be a pain to maintain.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] KVM: x86: async PF user
  2025-02-27 16:44                 ` Sean Christopherson
@ 2025-02-27 18:24                   ` Nikita Kalyazin
  2025-02-27 23:47                     ` Sean Christopherson
  0 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2025-02-27 18:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, corbet, tglx, mingo, bp, dave.hansen, hpa, rostedt,
	mhiramat, mathieu.desnoyers, kvm, linux-doc, linux-kernel,
	linux-trace-kernel, jthoughton, david, peterx, oleg, vkuznets,
	gshan, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On 27/02/2025 16:44, Sean Christopherson wrote:
> On Wed, Feb 26, 2025, Nikita Kalyazin wrote:
>> On 26/02/2025 00:58, Sean Christopherson wrote:
>>> On Fri, Feb 21, 2025, Nikita Kalyazin wrote:
>>>> On 20/02/2025 18:49, Sean Christopherson wrote:
>>>>> On Thu, Feb 20, 2025, Nikita Kalyazin wrote:
>>>>>> On 19/02/2025 15:17, Sean Christopherson wrote:
>>>>>>> On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
>>>>>>> The conundrum with userspace async #PF is that if userspace is given only a single
>>>>>>> bit per gfn to force an exit, then KVM won't be able to differentiate between
>>>>>>> "faults" that will be handled synchronously by the vCPU task, and faults that
>>>>>>> usersepace will hand off to an I/O task.  If the fault is handled synchronously,
>>>>>>> KVM will needlessly inject a not-present #PF and a present IRQ.
>>>>>>
>>>>>> Right, but from the guest's point of view, async PF means "it will probably
>>>>>> take a while for the host to get the page, so I may consider doing something
>>>>>> else in the meantime (ie schedule another process if available)".
>>>>>
>>>>> Except in this case, the guest never gets a chance to run, i.e. it can't do
>>>>> something else.  From the guest point of view, if KVM doesn't inject what is
>>>>> effectively a spurious async #PF, the VM-Exiting instruction simply took a (really)
>>>>> long time to execute.
>>>>
>>>> Sorry, I didn't get that.  If userspace learns from the
>>>> kvm_run::memory_fault::flags that the exit is due to an async PF, it should
>>>> call kvm run immediately, inject the not-present PF and allow the guest to
>>>> reschedule.  What do you mean by "the guest never gets a chance to run"?
>>>
>>> What I'm saying is that, as proposed, the API doesn't precisely tell userspace
>                                                                           ^^^^^^^^^
>                                                                           KVM
>>> an exit happened due to an "async #PF".  KVM has absolutely zero clue as to
>>> whether or not userspace is going to do an async #PF, or if userspace wants to
>>> intercept the fault for some entirely different purpose.
>>
>> Userspace is supposed to know whether the PF is async from the dedicated
>> flag added in the memory_fault structure:
>> KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER.  It will be set when KVM managed to
>> inject page-not-present.  Are you saying it isn't sufficient?
> 
> Gah, sorry, typo.  The API doesn't tell *KVM* that userfault exit is due to an
> async #PF.
> 
>>> Unless the remote page was already requested, e.g. by a different vCPU, or by a
>>> prefetching algorithim.
>>>
>>>> Conversely, if the page content is available, it must have already been
>>>> prepopulated into guest memory pagecache, the bit in the bitmap is cleared
>>>> and no exit to userspace occurs.
>>>
>>> But that doesn't happen instantaneously.  Even if the VMM somehow atomically
>>> receives the page and marks it present, it's still possible for marking the page
>>> present to race with KVM checking the bitmap.
>>
>> That looks like a generic problem of the VM-exit fault handling.  Eg when
> 
> Heh, it's a generic "problem" for faults in general.  E.g. modern x86 CPUs will
> take "spurious" page faults on write accesses if a PTE is writable in memory but
> the CPU has a read-only mapping cached in its TLB.
> 
> It's all a matter of cost.  E.g. pre-Nehalem Intel CPUs didn't take such spurious
> read-only faults as they would re-walk the in-memory page tables, but that ended
> up being a net negative because the cost of re-walking for all read-only faults
> outweighed the benefits of avoiding spurious faults in the unlikely scenario the
> fault had already been fixed.
> 
> For a spurious async #PF + IRQ, the cost could be signficant, e.g. due to causing
> unwanted context switches in the guest, in addition to the raw overhead of the
> faults, interrupts, and exits.
> 
>> one vCPU exits, userspace handles the fault and races setting the bitmap
>> with another vCPU that is about to fault the same page, which may cause a
>> spurious exit.
>>
>> On the other hand, is it malignant?  The only downside is additional
>> overhead of the async PF protocol, but if the race occurs infrequently, it
>> shouldn't be a problem.
> 
> When it comes to uAPI, I want to try and avoid statements along the lines of
> "IF 'x' holds true, then 'y' SHOULDN'T be a problem".  If this didn't impact uAPI,
> I wouldn't care as much, i.e. I'd be much more willing iterate as needed.
> 
> I'm not saying we should go straight for a complex implementation.  Quite the
> opposite.  But I do want us to consider the possible ramifications of using a
> single bit for all userfaults, so that we can at least try to design something
> that is extensible and won't be a pain to maintain.

So you would've liked more the "two-bit per gfn" approach as in: provide 
2 interception points, for sync and async exits, with the former chosen 
by userspace when it "knows" that the content is already in memory? 
What makes it a conundrum then?  It looks like an incremental change to 
what has already been proposed.  There is a complication that 2-bit 
operations aren't atomic, but even 1 bit is racy between KVM and userspace.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] KVM: x86: async PF user
  2025-02-27 18:24                   ` Nikita Kalyazin
@ 2025-02-27 23:47                     ` Sean Christopherson
  0 siblings, 0 replies; 20+ messages in thread
From: Sean Christopherson @ 2025-02-27 23:47 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: pbonzini, corbet, tglx, mingo, bp, dave.hansen, hpa, rostedt,
	mhiramat, mathieu.desnoyers, kvm, linux-doc, linux-kernel,
	linux-trace-kernel, jthoughton, david, peterx, oleg, vkuznets,
	gshan, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On Thu, Feb 27, 2025, Nikita Kalyazin wrote:
> On 27/02/2025 16:44, Sean Christopherson wrote:
> > When it comes to uAPI, I want to try and avoid statements along the lines of
> > "IF 'x' holds true, then 'y' SHOULDN'T be a problem".  If this didn't impact uAPI,
> > I wouldn't care as much, i.e. I'd be much more willing iterate as needed.
> > 
> > I'm not saying we should go straight for a complex implementation.  Quite the
> > opposite.  But I do want us to consider the possible ramifications of using a
> > single bit for all userfaults, so that we can at least try to design something
> > that is extensible and won't be a pain to maintain.
> 
> So you would've liked more the "two-bit per gfn" approach as in: provide 2
> interception points, for sync and async exits, with the former chosen by
> userspace when it "knows" that the content is already in memory? 

No, all I'm saying is I want people think about what the future will look like,
to minimize the chances of ending up with a mess.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-02-27 23:47 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-18 12:39 [RFC PATCH 0/6] KVM: x86: async PF user Nikita Kalyazin
2024-11-18 12:39 ` [RFC PATCH 1/6] Documentation: KVM: add userfault KVM exit flag Nikita Kalyazin
2024-11-18 12:39 ` [RFC PATCH 2/6] Documentation: KVM: add async pf user doc Nikita Kalyazin
2024-11-18 12:39 ` [RFC PATCH 3/6] KVM: x86: add async ioctl support Nikita Kalyazin
2024-11-18 12:39 ` [RFC PATCH 4/6] KVM: trace events: add type argument to async pf Nikita Kalyazin
2024-11-18 12:39 ` [RFC PATCH 5/6] KVM: x86: async_pf_user: add infrastructure Nikita Kalyazin
2024-11-18 12:39 ` [RFC PATCH 6/6] KVM: x86: async_pf_user: hook to fault handling and add ioctl Nikita Kalyazin
2024-11-19  1:26 ` [RFC PATCH 0/6] KVM: x86: async PF user James Houghton
2024-11-19 16:19   ` Nikita Kalyazin
2025-02-11 21:17 ` Sean Christopherson
2025-02-12 18:14   ` Nikita Kalyazin
2025-02-19 15:17     ` Sean Christopherson
2025-02-20 18:29       ` Nikita Kalyazin
2025-02-20 18:49         ` Sean Christopherson
2025-02-21 11:02           ` Nikita Kalyazin
2025-02-26  0:58             ` Sean Christopherson
2025-02-26 17:07               ` Nikita Kalyazin
2025-02-27 16:44                 ` Sean Christopherson
2025-02-27 18:24                   ` Nikita Kalyazin
2025-02-27 23:47                     ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).