kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults
@ 2023-11-09 21:03 Anish Moorthy
  2023-11-09 21:03 ` [PATCH v6 01/14] KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
                   ` (14 more replies)
  0 siblings, 15 replies; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

This series adds an option to cause stage-2 fault handlers to
KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
the userspace mappings. Doing so allows userspace to receive stage-2
faults directly from KVM_RUN instead of through userfaultfd, which
suffers from serious contention issues as the number of vCPUs scales.

Support for the new option (KVM_CAP_EXIT_ON_MISSING) is added to the
demand_paging_test, which demonstrates the scalability improvements:
the following data was collected using [2] on an x86 machine with 256
cores.

vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
1       150     340
2       191     477
4       210     809
8       155     1239
16      130     1595
32      108     2299
64      86      3482
128     62      4134
256     36      4012

TODO
~~~~
No known issues/things to resolve. However, documentation/commit logs
merit a close look given how much feedback I've received on those :/

Base Commit
~~~~~~~~~~~
This series is based off of kvm/next (45b890f7689e) with v14 of the
guest_memfd series applied, with some fixes on top [3].

Links
~~~~~
[1] Original RFC from James Houghton:
    https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/

[2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
    A quick rundown of the new flags (also detailed in later commits)
        -a registers all of guest memory to a single uffd.
        -r species the number of reader threads for polling the uffd.
        -w is what actually enables the new capabilities.
    All data was collected after applying the entire series

[3] https://lore.kernel.org/kvm/20231105163040.14904-1-pbonzini@redhat.com/T/#m56361120ee1dd5265a5710e6a814906cda8e1020
    The following fixes are required to get the KVM selftests to compile
    on arm64
    - https://lore.kernel.org/kvm/20231108233723.3380042-1-amoorthy@google.com/
    - https://lore.kernel.org/kvm/affca7a8-116e-4b0f-9edf-6cdc05ba65ca@redhat.com/
    - Unguarding the definitions of MEM_REGION_GPA/SLOT in set_memory_region_test
      (not sure if this is the "right" fix for that test, but it compiles)

---

v6
  - Rebase onto guest_memfd series [Anish/Sean]
  - Set write fault flag properly in user_mem_abort() [Oliver]
  - Reformat unnecessarily multi-line comments [Sean]
  - Drop the kvm_vcpu_read|write_guest_page() annotations [Sean]
  - Rename *USERFAULT_ON_MISSING to *EXIT_ON_MISSING [David]
  - Remove unnecessary rounding in user_mem_abort() annotation [David]
  - Rewrite logs for KVM_MEM_EXIT_ON_MISSING patches and squash
    them with the stage-2 fault annotation patches [Sean]
  - Undo the enum parameter addition to __gfn_to_pfn_memslot(), and just
    add another boolean parameter instead [Sean]
  - Better shortlog for the hva_to_pfn_fast() change [Anish]

v5: https://lore.kernel.org/kvm/20230908222905.1321305-1-amoorthy@google.com/
  - Rename APIs (again) [Sean]
  - Initialize hardware_exit_reason along w/ exit_reason on x86 [Isaku]
  - Reword hva_to_pfn_fast() change commit message [Sean]
  - Correct style on terminal if statements [Sean]
  - Switch to kconfig to signal KVM_CAP_USERFAULT_ON_MISSING [Sean]
  - Add read fault flag for annotated faults [Sean]
  - read/write_guest_page() changes
      - Move the annotations into vcpu wrapper fns [Sean]
      - Reorder parameters [Robert]
  - Rename kvm_populate_efault_info() to
    kvm_handle_guest_uaccess_fault() [Sean]
  - Remove unnecessary EINVAL on trying to enable memory fault info cap [Sean]
  - Correct description of the faults which hva_to_pfn_fast() can now
    resolve [Sean]
  - Eliminate unnecessary parameter added to __kvm_faultin_pfn() [Sean]
  - Magnanimously accept Sean's rewrite of the handle_error_pfn()
    annotation [Anish]
  - Remove vcpu null check from kvm_handle_guest_uaccess_fault [Sean]

v4: https://lore.kernel.org/kvm/20230602161921.208564-1-amoorthy@google.com/T/#t
  - Fix excessive indentation [Robert, Oliver]
  - Calculate final stats when uffd handler fn returns an error [Robert]
  - Remove redundant info from uffd_desc [Robert]
  - Fix various commit message typos [Robert]
  - Add comment about suppressed EEXISTs in selftest [Robert]
  - Add exit_reasons_known definition for KVM_EXIT_MEMORY_FAULT [Robert]
  - Fix some include/logic issues in self test [Robert]
  - Rename no-slow-gup cap to KVM_CAP_NOWAIT_ON_FAULT [Oliver, Sean]
  - Make KVM_CAP_MEMORY_FAULT_INFO informational-only [Oliver, Sean]
  - Drop most of the annotations from v3: see
    https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
  - Remove WARN on bare efaults [Sean, Oliver]
  - Eliminate unnecessary UFFDIO_WAKE call from self test [James]

v3: https://lore.kernel.org/kvm/ZEBXi5tZZNxA+jRs@x1n/T/#t
  - Rework the implementation to be based on two orthogonal
    capabilities (KVM_CAP_MEMORY_FAULT_INFO and
    KVM_CAP_NOWAIT_ON_FAULT) [Sean, Oliver]
  - Change return code of kvm_populate_efault_info [Isaku]
  - Use kvm_populate_efault_info from arm code [Oliver]

v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/

    This was a bit of a misfire, as I sent my WIP series on the mailing
    list but was just targeting Sean for some feedback. Oliver Upton and
    Isaku Yamahata ended up discovering the series and giving me some
    feedback anyways, so thanks to them :) In the end, there was enough
    discussion to justify retroactively labeling it as v2, even with the
    limited cc list.

  - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
  - API changes:
        - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
          KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
          requirement).
        - Switched to memslot flag
  - Take Oliver's simplification to the "allow fast gup for readable
    faults" logic.
  - Slightly redefine the return code of user_mem_abort.
  - Fix documentation errors brought up by Marc
  - Reword commit messages in imperative mood

v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/

Anish Moorthy (14):
  KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic'
    parameter
  KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
  KVM: Simplify error handling in __gfn_to_pfn_memslot()
  KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to
    userspace
  KVM: Try using fast GUP to resolve read faults
  KVM: Add memslot flag to let userspace force an exit on missing hva
    mappings
  KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from
    stage-2 fault handler
  KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
  KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from
    stage-2 fault-handler
  KVM: selftests: Report per-vcpu demand paging rate from demand paging
    test
  KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
    paging test
  KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
    signal errors via TEST_ASSERT
  KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
  KVM: selftests: Handle memory fault exits in demand_paging_test

 Documentation/virt/kvm/api.rst                |  33 +-
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |   7 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c           |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c        |   2 +-
 arch/x86/kvm/Kconfig                          |   1 +
 arch/x86/kvm/mmu/mmu.c                        |   8 +-
 include/linux/kvm_host.h                      |  21 +-
 include/uapi/linux/kvm.h                      |   5 +
 .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
 .../selftests/kvm/access_tracking_perf_test.c |   2 +-
 .../selftests/kvm/demand_paging_test.c        | 295 ++++++++++++++----
 .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
 .../testing/selftests/kvm/include/memstress.h |   2 +-
 .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
 tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
 .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
 .../kvm/memslot_modification_stress_test.c    |   2 +-
 .../x86_64/dirty_log_page_splitting_test.c    |   2 +-
 virt/kvm/Kconfig                              |   3 +
 virt/kvm/kvm_main.c                           |  46 ++-
 22 files changed, 444 insertions(+), 175 deletions(-)

-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v6 01/14] KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2024-02-07 15:26   ` Sean Christopherson
  2023-11-09 21:03 ` [PATCH v6 02/14] KVM: Documentation: Add docstrings for __kvm_read/write_guest_page() Anish Moorthy
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

The current docstring can be read as "atomic -> allowed to sleep," when
in fact the intended statement is "atomic -> NOT allowed to sleep." Make
that clearer in the docstring.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9170a61ea99f..687374138cfd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2983,7 +2983,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 /*
  * Pin guest page in memory and return its pfn.
  * @addr: host virtual address which maps memory to the guest
- * @atomic: whether this function can sleep
+ * @atomic: whether this function is forbidden from sleeping
  * @interruptible: whether the process can be interrupted by non-fatal signals
  * @async: whether this function need to wait IO complete if the
  *         host page is not in the memory
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v6 02/14] KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
  2023-11-09 21:03 ` [PATCH v6 01/14] KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2024-02-07 15:30   ` Sean Christopherson
  2023-11-09 21:03 ` [PATCH v6 03/14] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

The (gfn, data, offset, len) order of parameters is a little strange
since "offset" applies to "gfn" rather than to "data". Add docstrings to
make things perfectly clear.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 687374138cfd..f521b6fd808f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3328,6 +3328,7 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
+/* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
 static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
 				 void *data, int offset, int len)
 {
@@ -3429,6 +3430,7 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
+/* Copy @len bytes from @data into guest memory at '(@gfn * PAGE_SIZE) + @offset' */
 static int __kvm_write_guest_page(struct kvm *kvm,
 				  struct kvm_memory_slot *memslot, gfn_t gfn,
 			          const void *data, int offset, int len)
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v6 03/14] KVM: Simplify error handling in __gfn_to_pfn_memslot()
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
  2023-11-09 21:03 ` [PATCH v6 01/14] KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
  2023-11-09 21:03 ` [PATCH v6 02/14] KVM: Documentation: Add docstrings for __kvm_read/write_guest_page() Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2023-11-09 21:03 ` [PATCH v6 04/14] KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace Anish Moorthy
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

KVM_HVA_ERR_RO_BAD satisfies kvm_is_error_hva(), so there's no need to
duplicate the "if (writable)" block. Fix this by bringing all
kvm_is_error_hva() cases under one conditional.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f521b6fd808f..88946d5d102b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3055,15 +3055,13 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 	if (hva)
 		*hva = addr;
 
-	if (addr == KVM_HVA_ERR_RO_BAD) {
-		if (writable)
-			*writable = false;
-		return KVM_PFN_ERR_RO_FAULT;
-	}
-
 	if (kvm_is_error_hva(addr)) {
 		if (writable)
 			*writable = false;
+
+		if (addr == KVM_HVA_ERR_RO_BAD)
+			return KVM_PFN_ERR_RO_FAULT;
+
 		return KVM_PFN_NOSLOT;
 	}
 
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v6 04/14] KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
                   ` (2 preceding siblings ...)
  2023-11-09 21:03 ` [PATCH v6 03/14] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2024-02-07 15:30   ` Sean Christopherson
  2023-11-09 21:03 ` [PATCH v6 05/14] KVM: Try using fast GUP to resolve read faults Anish Moorthy
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 5 +++++
 include/linux/kvm_host.h       | 9 ++++++++-
 include/uapi/linux/kvm.h       | 3 +++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index c13ede498369..a07964f601de 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6979,6 +6979,9 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
 
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
+  #define KVM_MEMORY_EXIT_FLAG_READ     (1ULL << 0)
+  #define KVM_MEMORY_EXIT_FLAG_WRITE    (1ULL << 1)
+  #define KVM_MEMORY_EXIT_FLAG_EXEC     (1ULL << 2)
   #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
 			__u64 flags;
 			__u64 gpa;
@@ -6990,6 +6993,8 @@ could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
 guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
 describes properties of the faulting access that are likely pertinent:
 
+ - KVM_MEMORY_EXIT_FLAG_READ/WRITE/EXEC - When set, indicates that the memory
+   fault occurred on a read/write/exec access respectively.
  - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
    on a private memory access.  When clear, indicates the fault occurred on a
    shared access.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4d5d139b0bde..5201400358da 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2372,8 +2372,15 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 	vcpu->run->memory_fault.gpa = gpa;
 	vcpu->run->memory_fault.size = size;
 
-	/* RWX flags are not (yet) defined or communicated to userspace. */
 	vcpu->run->memory_fault.flags = 0;
+
+	if (is_write)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_WRITE;
+	else if (is_exec)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_EXEC;
+	else
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_READ;
+
 	if (is_private)
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
 }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b4ba4b53b834..bda5622a9c68 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -535,6 +535,9 @@ struct kvm_run {
 		} notify;
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
+#define KVM_MEMORY_EXIT_FLAG_READ       (1ULL << 0)
+#define KVM_MEMORY_EXIT_FLAG_WRITE      (1ULL << 1)
+#define KVM_MEMORY_EXIT_FLAG_EXEC       (1ULL << 2)
 #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
 			__u64 flags;
 			__u64 gpa;
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v6 05/14] KVM: Try using fast GUP to resolve read faults
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
                   ` (3 preceding siblings ...)
  2023-11-09 21:03 ` [PATCH v6 04/14] KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2024-02-05 21:55   ` Anish Moorthy
  2023-11-09 21:03 ` [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings Anish Moorthy
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

hva_to_pfn_fast() currently just fails for faults where establishing
writable mappings is forbidden, which is unnecessary. Instead, try
getting the page without passing FOLL_WRITE. This allows the
aforementioned faults to (potentially) be resolved without falling back
to slow GUP.

Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 88946d5d102b..725191333c4e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2811,7 +2811,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
 }
 
 /*
- * The fast path to get the writable pfn which will be stored in @pfn,
+ * The fast path to get the pfn which will be stored in @pfn,
  * true indicates success, otherwise false is returned.  It's also the
  * only part that runs if we can in atomic context.
  */
@@ -2825,10 +2825,9 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 	 * or the caller allows to map a writable pfn for a read fault
 	 * request.
 	 */
-	if (!(write_fault || writable))
-		return false;
+	unsigned int gup_flags = (write_fault || writable) ? FOLL_WRITE : 0;
 
-	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+	if (get_user_page_fast_only(addr, gup_flags, page)) {
 		*pfn = page_to_pfn(page[0]);
 
 		if (writable)
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
                   ` (4 preceding siblings ...)
  2023-11-09 21:03 ` [PATCH v6 05/14] KVM: Try using fast GUP to resolve read faults Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2024-01-31  0:25   ` James Houghton
  2023-11-09 21:03 ` [PATCH v6 07/14] KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler Anish Moorthy
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

Allowing KVM to fault in pages during vcpu-context guest memory accesses
can be undesirable: during userfaultfd-based postcopy, it can cause
significant performance issues due to vCPUs contending for
userfaultfd-internal locks.

Add a new memslot flag (KVM_MEM_EXIT_ON_MISSING) through which userspace
can indicate that KVM_RUN should exit instead of faulting in pages
during vcpu-context guest memory accesses. The unfaulted pages are
reported by the accompanying KVM_EXIT_MEMORY_FAULT_INFO, allowing
userspace to determine and take appropriate action.

The basic implementation strategy is to check the memslot flag from
within __gfn_to_pfn_memslot() and override the caller-provided arguments
accordingly. Some callers (such as kvm_vcpu_map()) must be able to opt
out of this behavior, and do so by passing can_exit_on_missing=false.

No functional change intended: nothing sets KVM_MEM_EXIT_ON_MISSING or
passes can_exit_on_missing=true to __gfn_to_pfn_memslot().

Suggested-by: James Houghton <jthoughton@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst         | 28 +++++++++++++++++++++++---
 arch/arm64/kvm/mmu.c                   |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c    |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
 arch/x86/kvm/mmu/mmu.c                 |  4 ++--
 include/linux/kvm_host.h               | 12 ++++++++++-
 include/uapi/linux/kvm.h               |  2 ++
 virt/kvm/Kconfig                       |  3 +++
 virt/kvm/kvm_main.c                    | 25 ++++++++++++++++++-----
 9 files changed, 66 insertions(+), 14 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index a07964f601de..1457865f6e98 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1365,6 +1365,8 @@ yet and must be cleared on entry.
   /* for kvm_userspace_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
+  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1395,12 +1397,16 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
+The flags field supports four flags
+
+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
 writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
+use it.
+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
 to make a new slot read-only.  In this case, writes to this memory will be
 posted to userspace as KVM_EXIT_MMIO exits.
+3.  KVM_MEM_GUEST_MEMFD
+4.  KVM_MEM_EXIT_ON_MISSING: see KVM_CAP_EXIT_ON_MISSING for details.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
@@ -8059,6 +8065,22 @@ error/annotated fault.
 
 See KVM_EXIT_MEMORY_FAULT for more information.
 
+7.35 KVM_CAP_EXIT_ON_MISSING
+----------------------------
+
+:Architectures: None
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that userspace may set the
+KVM_MEM_EXIT_ON_MISSING flag on memslots. Said flag will cause KVM_RUN to fail
+(-EFAULT) in response to guest-context memory accesses which would require KVM
+to page fault on the userspace mapping.
+
+The range of guest physical memory causing the fault is advertised to userspace
+through KVM_CAP_MEMORY_FAULT_INFO. Userspace should take appropriate action.
+This could mean, for instance, checking that the fault is resolvable, faulting
+in the relevant userspace mapping, then retrying KVM_RUN.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 4e41ceed5468..13066a6fdfff 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1486,7 +1486,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmap_read_unlock(current->mm);
 
 	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-				   write_fault, &writable, NULL);
+				   write_fault, &writable, false, NULL);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index efd0ebf70a5e..2ce0e1d3f597 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -613,7 +613,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 	} else {
 		/* Call KVM generic code to do the slow-path check */
 		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-					   writing, &write_ok, NULL);
+					   writing, &write_ok, false, NULL);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
 		page = NULL;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 572707858d65..9d40ca02747f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -847,7 +847,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 
 		/* Call KVM generic code to do the slow-path check */
 		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-					   writing, upgrade_p, NULL);
+					   writing, upgrade_p, false, NULL);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
 		page = NULL;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4de7670d5976..b1e5e42bdeb4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4375,7 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
 					  fault->write, &fault->map_writable,
-					  &fault->hva);
+					  false, &fault->hva);
 	if (!async)
 		return RET_PF_CONTINUE; /* *pfn has correct page already */
 
@@ -4397,7 +4397,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 */
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
 					  fault->write, &fault->map_writable,
-					  &fault->hva);
+					  false, &fault->hva);
 	return RET_PF_CONTINUE;
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5201400358da..e8e30088289e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1219,7 +1219,8 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn);
 kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn);
 kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool interruptible, bool *async,
-			       bool write_fault, bool *writable, hva_t *hva);
+			       bool write_fault, bool *writable,
+			       bool can_exit_on_missing, hva_t *hva);
 
 void kvm_release_pfn_clean(kvm_pfn_t pfn);
 void kvm_release_pfn_dirty(kvm_pfn_t pfn);
@@ -2423,4 +2424,13 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
+/*
+ * Whether vCPUs should exit upon trying to access memory for which the
+ * userspace mappings are missing.
+ */
+static inline bool kvm_is_slot_exit_on_missing(const struct kvm_memory_slot *slot)
+{
+	return slot && slot->flags & KVM_MEM_EXIT_ON_MISSING;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index bda5622a9c68..18546cbada61 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -116,6 +116,7 @@ struct kvm_userspace_memory_region2 {
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
 #define KVM_MEM_GUEST_MEMFD	(1UL << 2)
+#define KVM_MEM_EXIT_ON_MISSING	(1UL << 3)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -1231,6 +1232,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_MEMORY_ATTRIBUTES 233
 #define KVM_CAP_GUEST_MEMFD 234
 #define KVM_CAP_VM_TYPES 235
+#define KVM_CAP_EXIT_ON_MISSING 236
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 2c964586aa14..241f524a4e9d 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -109,3 +109,6 @@ config KVM_GENERIC_PRIVATE_MEM
        select KVM_GENERIC_MEMORY_ATTRIBUTES
        select KVM_PRIVATE_MEM
        bool
+
+config HAVE_KVM_EXIT_ON_MISSING
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 725191333c4e..faaccdba179c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1614,7 +1614,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
  * only allows these.
  */
 #define KVM_SET_USER_MEMORY_REGION_V1_FLAGS \
-	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
+	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY | KVM_MEM_EXIT_ON_MISSING)
 
 static int check_memory_region_flags(struct kvm *kvm,
 				     const struct kvm_userspace_memory_region2 *mem)
@@ -1632,6 +1632,9 @@ static int check_memory_region_flags(struct kvm *kvm,
 	valid_flags |= KVM_MEM_READONLY;
 #endif
 
+	if (IS_ENABLED(CONFIG_HAVE_KVM_EXIT_ON_MISSING))
+		valid_flags |= KVM_MEM_EXIT_ON_MISSING;
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
 
@@ -3047,7 +3050,8 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
 
 kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool interruptible, bool *async,
-			       bool write_fault, bool *writable, hva_t *hva)
+			       bool write_fault, bool *writable,
+			       bool can_exit_on_missing, hva_t *hva)
 {
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
@@ -3070,6 +3074,15 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 		writable = NULL;
 	}
 
+	if (!atomic && can_exit_on_missing
+	    && kvm_is_slot_exit_on_missing(slot)) {
+		atomic = true;
+		if (async) {
+			*async = false;
+			async = NULL;
+		}
+	}
+
 	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
 			  writable);
 }
@@ -3079,21 +3092,21 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
 	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
-				    NULL, write_fault, writable, NULL);
+				    NULL, write_fault, writable, false, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
-				    NULL, NULL);
+				    NULL, false, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
-				    NULL, NULL);
+				    NULL, false, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
@@ -4898,6 +4911,8 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_GUEST_MEMFD:
 		return !kvm || kvm_arch_has_private_mem(kvm);
 #endif
+	case KVM_CAP_EXIT_ON_MISSING:
+		return IS_ENABLED(CONFIG_HAVE_KVM_EXIT_ON_MISSING);
 	default:
 		break;
 	}
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v6 07/14] KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
                   ` (5 preceding siblings ...)
  2023-11-09 21:03 ` [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2023-11-09 21:03 ` [PATCH v6 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

Prevent the stage-2 fault handler from faulting in pages when
KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
calls to check the memslot flag.

To actually make that behavior useful, prepare a KVM_EXIT_MEMORY_FAULT
when the stage-2 handler returns EFAULT, e.g. when it cannot resolve the
pfn. With KVM_MEM_EXIT_ON_MISSING enabled this effects the delivery of
stage-2 faults as vCPU exits, which userspace can attempt to resolve
without terminating the guest.

Delivering stage-2 faults to userspace in this way sidesteps the
significant scalabiliy issues associated with using userfaultfd for the
same purpose.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 2 +-
 arch/x86/kvm/Kconfig           | 1 +
 arch/x86/kvm/mmu/mmu.c         | 8 ++++++--
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 1457865f6e98..fd87bbfbfdf2 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8068,7 +8068,7 @@ See KVM_EXIT_MEMORY_FAULT for more information.
 7.35 KVM_CAP_EXIT_ON_MISSING
 ----------------------------
 
-:Architectures: None
+:Architectures: x86
 :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
 
 The presence of this capability indicates that userspace may set the
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index c1716e83d176..97b16be349a2 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -49,6 +49,7 @@ config KVM
 	select INTERVAL_TREE
 	select HAVE_KVM_PM_NOTIFIER if PM
 	select KVM_GENERIC_HARDWARE_ENABLING
+        select HAVE_KVM_EXIT_ON_MISSING
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b1e5e42bdeb4..bc978260d2be 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3309,6 +3309,10 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 		return RET_PF_RETRY;
 	}
 
+	WARN_ON_ONCE(fault->goal_level != PG_LEVEL_4K);
+
+	kvm_prepare_memory_fault_exit(vcpu, gfn_to_gpa(fault->gfn), PAGE_SIZE,
+				      fault->write, fault->exec, fault->is_private);
 	return -EFAULT;
 }
 
@@ -4375,7 +4379,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
 					  fault->write, &fault->map_writable,
-					  false, &fault->hva);
+					  true, &fault->hva);
 	if (!async)
 		return RET_PF_CONTINUE; /* *pfn has correct page already */
 
@@ -4397,7 +4401,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 */
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
 					  fault->write, &fault->map_writable,
-					  false, &fault->hva);
+					  true, &fault->hva);
 	return RET_PF_CONTINUE;
 }
 
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v6 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
                   ` (6 preceding siblings ...)
  2023-11-09 21:03 ` [PATCH v6 07/14] KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2023-11-09 21:07   ` Anish Moorthy
  2023-11-09 21:03 ` [PATCH v6 09/14] KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler Anish Moorthy
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

TODO: Changelog -- and possibly just merge into the "god" arm commit?

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/arm64/kvm/arm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 317964bad1e1..b5c1d1fb77d0 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -241,6 +241,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_SYSTEM_SUSPEND:
 	case KVM_CAP_IRQFD_RESAMPLE:
 	case KVM_CAP_COUNTER_OFFSET:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		r = 1;
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG2:
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v6 09/14] KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
                   ` (7 preceding siblings ...)
  2023-11-09 21:03 ` [PATCH v6 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2024-01-30 23:58   ` James Houghton
  2023-11-09 21:03 ` [PATCH v6 10/14] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

Prevent the stage-2 fault handler from faulting in pages when
KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
calls to check the memslot flag.

To actually make that behavior useful, prepare a KVM_EXIT_MEMORY_FAULT
when the stage-2 handler cannot resolve the pfn for a fault. With
KVM_MEM_EXIT_ON_MISSING enabled this effects the delivery of stage-2
faults as vCPU exits, which userspace can attempt to resolve without
terminating the guest.

Delivering stage-2 faults to userspace in this way sidesteps the
significant scalabiliy issues associated with using userfaultfd for the
same purpose.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 2 +-
 arch/arm64/kvm/Kconfig         | 1 +
 arch/arm64/kvm/mmu.c           | 7 +++++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index fd87bbfbfdf2..67fcb9dbe855 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8068,7 +8068,7 @@ See KVM_EXIT_MEMORY_FAULT for more information.
 7.35 KVM_CAP_EXIT_ON_MISSING
 ----------------------------
 
-:Architectures: x86
+:Architectures: x86, arm64
 :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
 
 The presence of this capability indicates that userspace may set the
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 1a777715199f..d6fae31f7e1a 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -43,6 +43,7 @@ menuconfig KVM
 	select GUEST_PERF_EVENTS if PERF_EVENTS
 	select INTERVAL_TREE
 	select XARRAY_MULTI
+        select HAVE_KVM_EXIT_ON_MISSING
 	help
 	  Support hosting virtualized guest machines.
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 13066a6fdfff..3b9fb80672ac 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1486,13 +1486,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmap_read_unlock(current->mm);
 
 	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-				   write_fault, &writable, false, NULL);
+				   write_fault, &writable, true, NULL);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
 	}
-	if (is_error_noslot_pfn(pfn))
+	if (is_error_noslot_pfn(pfn)) {
+		kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
+					      write_fault, exec_fault, false);
 		return -EFAULT;
+	}
 
 	if (kvm_is_device_pfn(pfn)) {
 		/*
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v6 10/14] KVM: selftests: Report per-vcpu demand paging rate from demand paging test
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
                   ` (8 preceding siblings ...)
  2023-11-09 21:03 ` [PATCH v6 09/14] KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2023-11-09 21:03 ` [PATCH v6 11/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

Using the overall demand paging rate to measure performance can be
slightly misleading when vCPU accesses are not overlapped. Adding more
vCPUs will (usually) increase the overall demand paging rate even
if performance remains constant or even degrades on a per-vcpu basis. As
such, it makes sense to report both the total and per-vcpu paging rates.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 tools/testing/selftests/kvm/demand_paging_test.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 09c116a82a84..6dc823fa933a 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -135,6 +135,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
 	int i;
+	double vcpu_paging_rate;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
@@ -191,11 +192,17 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			uffd_stop_demand_paging(uffd_descs[i]);
 	}
 
-	pr_info("Total guest execution time: %ld.%.9lds\n",
+	pr_info("Total guest execution time:\t%ld.%.9lds\n",
 		ts_diff.tv_sec, ts_diff.tv_nsec);
-	pr_info("Overall demand paging rate: %f pgs/sec\n",
-		memstress_args.vcpu_args[0].pages * nr_vcpus /
-		((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / NSEC_PER_SEC));
+
+	vcpu_paging_rate =
+		memstress_args.vcpu_args[0].pages
+		/ ((double)ts_diff.tv_sec
+			+ (double)ts_diff.tv_nsec / NSEC_PER_SEC);
+	pr_info("Per-vcpu demand paging rate:\t%f pgs/sec/vcpu\n",
+		vcpu_paging_rate);
+	pr_info("Overall demand paging rate:\t%f pgs/sec\n",
+		vcpu_paging_rate * nr_vcpus);
 
 	memstress_destroy_vm(vm);
 
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v6 11/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
                   ` (9 preceding siblings ...)
  2023-11-09 21:03 ` [PATCH v6 10/14] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2023-11-09 21:03 ` [PATCH v6 12/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

At the moment, demand_paging_test does not support profiling/testing
multiple vCPU threads concurrently faulting on a single uffd because

    (a) "-u" (run test in userfaultfd mode) creates a uffd for each vCPU's
        region, so that each uffd services a single vCPU thread.
    (b) "-u -o" (userfaultfd mode + overlapped vCPU memory accesses)
        simply doesn't work: the test tries to register the same memory
        to multiple uffds, causing an error.

Add support for many vcpus per uffd by
    (1) Keeping "-u" behavior unchanged.
    (2) Making "-u -a" create a single uffd for all of guest memory.
    (3) Making "-u -o" implicitly pass "-a", solving the problem in (b).
In cases (2) and (3) all vCPU threads fault on a single uffd.

With potentially multiple vCPUs per UFFD, it makes sense to allow
configuring the number of reader threads per UFFD as well: add the "-r"
flag to do so.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/aarch64/page_fault_test.c   |  4 +-
 .../selftests/kvm/demand_paging_test.c        | 76 +++++++++++++---
 .../selftests/kvm/include/userfaultfd_util.h  | 17 +++-
 .../selftests/kvm/lib/userfaultfd_util.c      | 87 +++++++++++++------
 4 files changed, 137 insertions(+), 47 deletions(-)

diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
index 08a5ca5bed56..dad1fb338f36 100644
--- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
+++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
@@ -375,14 +375,14 @@ static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
 		*pt_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						    pt_args.hva,
 						    pt_args.paging_size,
-						    test->uffd_pt_handler);
+						    1, test->uffd_pt_handler);
 
 	*data_uffd = NULL;
 	if (test->uffd_data_handler)
 		*data_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						      data_args.hva,
 						      data_args.paging_size,
-						      test->uffd_data_handler);
+						      1, test->uffd_data_handler);
 }
 
 static void free_uffd(struct test_desc *test, struct uffd_desc *pt_uffd,
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 6dc823fa933a..f7897a951f90 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -77,8 +77,20 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		copy.mode = 0;
 
 		r = ioctl(uffd, UFFDIO_COPY, &copy);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d with errno: %d\n",
+		/*
+		 * With multiple vCPU threads fault on a single page and there are
+		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
+		 * will fail with EEXIST: handle that case without signaling an
+		 * error.
+		 *
+		 * Note that this also suppress any EEXISTs occurring from,
+		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
+		 * happens here, but a realistic VMM might potentially maintain
+		 * some external state to correctly surface EEXISTs to userspace
+		 * (or prevent duplicate COPY/CONTINUEs in the first place).
+		 */
+		if (r == -1 && errno != EEXIST) {
+			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
 				addr, tid, errno);
 			return r;
 		}
@@ -89,8 +101,20 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		cont.range.len = demand_paging_size;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_CONTINUE in 0x%lx from thread %d with errno: %d\n",
+		/*
+		 * With multiple vCPU threads fault on a single page and there are
+		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
+		 * will fail with EEXIST: handle that case without signaling an
+		 * error.
+		 *
+		 * Note that this also suppress any EEXISTs occurring from,
+		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
+		 * happens here, but a realistic VMM might potentially maintain
+		 * some external state to correctly surface EEXISTs to userspace
+		 * (or prevent duplicate COPY/CONTINUEs in the first place).
+		 */
+		if (r == -1 && errno != EEXIST) {
+			pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
 				addr, tid, errno);
 			return r;
 		}
@@ -110,7 +134,9 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 
 struct test_params {
 	int uffd_mode;
+	bool single_uffd;
 	useconds_t uffd_delay;
+	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
 };
@@ -134,8 +160,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i;
+	int i, num_uffds = 0;
 	double vcpu_paging_rate;
+	uint64_t uffd_region_size;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
@@ -148,7 +175,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
 	if (p->uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
-		for (i = 0; i < nr_vcpus; i++) {
+		num_uffds = p->single_uffd ? 1 : nr_vcpus;
+		for (i = 0; i < num_uffds; i++) {
 			vcpu_args = &memstress_args.vcpu_args[i];
 			prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
 				     vcpu_args->pages * memstress_args.guest_page_size);
@@ -156,9 +184,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	}
 
 	if (p->uffd_mode) {
-		uffd_descs = malloc(nr_vcpus * sizeof(struct uffd_desc *));
+		num_uffds = p->single_uffd ? 1 : nr_vcpus;
+		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
+
+		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
 		TEST_ASSERT(uffd_descs, "Memory allocation failed");
-		for (i = 0; i < nr_vcpus; i++) {
+		for (i = 0; i < num_uffds; i++) {
+			struct memstress_vcpu_args *vcpu_args;
 			void *vcpu_hva;
 
 			vcpu_args = &memstress_args.vcpu_args[i];
@@ -171,7 +203,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
 				p->uffd_mode, p->uffd_delay, vcpu_hva,
-				vcpu_args->pages * memstress_args.guest_page_size,
+				uffd_region_size,
+				p->readers_per_uffd,
 				&handle_uffd_page_request);
 		}
 	}
@@ -188,7 +221,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
 	if (p->uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
-		for (i = 0; i < nr_vcpus; i++)
+		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
 	}
 
@@ -214,15 +247,20 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 static void help(char *name)
 {
 	puts("");
-	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-d uffd_delay_usec]\n"
-	       "          [-b memory] [-s type] [-v vcpus] [-c cpu_list] [-o]\n", name);
+	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
+		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
+		   "          [-s type] [-v vcpus] [-c cpu_list] [-o]\n", name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
 	kvm_print_vcpu_pinning_help();
+	printf(" -a: Use a single userfaultfd for all of guest memory, instead of\n"
+	       "     creating one for each region paged by a unique vCPU\n"
+	       "     Set implicitly with -o, and no effect without -u.\n");
 	printf(" -d: add a delay in usec to the User Fault\n"
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
+	printf(" -r: Set the number of reader threads per uffd.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -241,12 +279,14 @@ int main(int argc, char *argv[])
 	struct test_params p = {
 		.src_type = DEFAULT_VM_MEM_SRC,
 		.partition_vcpu_memory_access = true,
+		.readers_per_uffd = 1,
+		.single_uffd = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "hm:u:d:b:s:v:c:o")) != -1) {
+	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:c:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
@@ -258,6 +298,9 @@ int main(int argc, char *argv[])
 				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
 			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
+		case 'a':
+			p.single_uffd = true;
+			break;
 		case 'd':
 			p.uffd_delay = strtoul(optarg, NULL, 0);
 			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
@@ -278,6 +321,13 @@ int main(int argc, char *argv[])
 			break;
 		case 'o':
 			p.partition_vcpu_memory_access = false;
+			p.single_uffd = true;
+			break;
+		case 'r':
+			p.readers_per_uffd = atoi(optarg);
+			TEST_ASSERT(p.readers_per_uffd >= 1,
+				    "Invalid number of readers per uffd %d: must be >=1",
+				    p.readers_per_uffd);
 			break;
 		case 'h':
 		default:
diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
index 877449c34592..af83a437e74a 100644
--- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
+++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
@@ -17,18 +17,27 @@
 
 typedef int (*uffd_handler_t)(int uffd_mode, int uffd, struct uffd_msg *msg);
 
-struct uffd_desc {
+struct uffd_reader_args {
 	int uffd_mode;
 	int uffd;
-	int pipefds[2];
 	useconds_t delay;
 	uffd_handler_t handler;
-	pthread_t thread;
+	/* Holds the read end of the pipe for killing the reader. */
+	int pipe;
+};
+
+struct uffd_desc {
+	int uffd;
+	uint64_t num_readers;
+	/* Holds the write ends of the pipes for killing the readers. */
+	int *pipefds;
+	pthread_t *readers;
+	struct uffd_reader_args *reader_args;
 };
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler);
+					   uint64_t num_readers, uffd_handler_t handler);
 
 void uffd_stop_demand_paging(struct uffd_desc *uffd);
 
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 271f63891581..6f220aa4fb08 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -27,10 +27,8 @@
 
 static void *uffd_handler_thread_fn(void *arg)
 {
-	struct uffd_desc *uffd_desc = (struct uffd_desc *)arg;
-	int uffd = uffd_desc->uffd;
-	int pipefd = uffd_desc->pipefds[0];
-	useconds_t delay = uffd_desc->delay;
+	struct uffd_reader_args *reader_args = (struct uffd_reader_args *)arg;
+	int uffd = reader_args->uffd;
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
@@ -44,7 +42,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 		pollfd[0].fd = uffd;
 		pollfd[0].events = POLLIN;
-		pollfd[1].fd = pipefd;
+		pollfd[1].fd = reader_args->pipe;
 		pollfd[1].events = POLLIN;
 
 		r = poll(pollfd, 2, -1);
@@ -92,9 +90,9 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
 
-		if (delay)
-			usleep(delay);
-		r = uffd_desc->handler(uffd_desc->uffd_mode, uffd, &msg);
+		if (reader_args->delay)
+			usleep(reader_args->delay);
+		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
 		if (r < 0)
 			return NULL;
 		pages++;
@@ -110,7 +108,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler)
+					   uint64_t num_readers, uffd_handler_t handler)
 {
 	struct uffd_desc *uffd_desc;
 	bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR);
@@ -118,14 +116,26 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	struct uffdio_api uffdio_api;
 	struct uffdio_register uffdio_register;
 	uint64_t expected_ioctls = ((uint64_t) 1) << _UFFDIO_COPY;
-	int ret;
+	int ret, i;
 
 	PER_PAGE_DEBUG("Userfaultfd %s mode, faults resolved with %s\n",
 		       is_minor ? "MINOR" : "MISSING",
 		       is_minor ? "UFFDIO_CONINUE" : "UFFDIO_COPY");
 
 	uffd_desc = malloc(sizeof(struct uffd_desc));
-	TEST_ASSERT(uffd_desc, "malloc failed");
+	TEST_ASSERT(uffd_desc, "Failed to malloc uffd descriptor");
+
+	uffd_desc->pipefds = malloc(sizeof(int) * num_readers);
+	TEST_ASSERT(uffd_desc->pipefds, "Failed to malloc pipes");
+
+	uffd_desc->readers = malloc(sizeof(pthread_t) * num_readers);
+	TEST_ASSERT(uffd_desc->readers, "Failed to malloc reader threads");
+
+	uffd_desc->reader_args = malloc(
+		sizeof(struct uffd_reader_args) * num_readers);
+	TEST_ASSERT(uffd_desc->reader_args, "Failed to malloc reader_args");
+
+	uffd_desc->num_readers = num_readers;
 
 	/* In order to get minor faults, prefault via the alias. */
 	if (is_minor)
@@ -148,18 +158,28 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
 		    expected_ioctls, "missing userfaultfd ioctls");
 
-	ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
-	TEST_ASSERT(!ret, "Failed to set up pipefd");
-
-	uffd_desc->uffd_mode = uffd_mode;
 	uffd_desc->uffd = uffd;
-	uffd_desc->delay = delay;
-	uffd_desc->handler = handler;
-	pthread_create(&uffd_desc->thread, NULL, uffd_handler_thread_fn,
-		       uffd_desc);
+	for (i = 0; i < uffd_desc->num_readers; ++i) {
+		int pipes[2];
+
+		ret = pipe2((int *) &pipes, O_CLOEXEC | O_NONBLOCK);
+		TEST_ASSERT(!ret, "Failed to set up pipefd %i for uffd_desc %p",
+			    i, uffd_desc);
+
+		uffd_desc->pipefds[i] = pipes[1];
 
-	PER_VCPU_DEBUG("Created uffd thread for HVA range [%p, %p)\n",
-		       hva, hva + len);
+		uffd_desc->reader_args[i].uffd_mode = uffd_mode;
+		uffd_desc->reader_args[i].uffd = uffd;
+		uffd_desc->reader_args[i].delay = delay;
+		uffd_desc->reader_args[i].handler = handler;
+		uffd_desc->reader_args[i].pipe = pipes[0];
+
+		pthread_create(&uffd_desc->readers[i], NULL, uffd_handler_thread_fn,
+			       &uffd_desc->reader_args[i]);
+
+		PER_VCPU_DEBUG("Created uffd thread %i for HVA range [%p, %p)\n",
+			       i, hva, hva + len);
+	}
 
 	return uffd_desc;
 }
@@ -167,19 +187,30 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 void uffd_stop_demand_paging(struct uffd_desc *uffd)
 {
 	char c = 0;
-	int ret;
+	int i, ret;
 
-	ret = write(uffd->pipefds[1], &c, 1);
-	TEST_ASSERT(ret == 1, "Unable to write to pipefd");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = write(uffd->pipefds[i], &c, 1);
+		TEST_ASSERT(
+			ret == 1, "Unable to write to pipefd %i for uffd_desc %p", i, uffd);
+	}
 
-	ret = pthread_join(uffd->thread, NULL);
-	TEST_ASSERT(ret == 0, "Pthread_join failed.");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = pthread_join(uffd->readers[i], NULL);
+		TEST_ASSERT(
+			ret == 0, "Pthread_join failed on reader %i for uffd_desc %p", i, uffd);
+	}
 
 	close(uffd->uffd);
 
-	close(uffd->pipefds[1]);
-	close(uffd->pipefds[0]);
+	for (i = 0; i < uffd->num_readers; ++i) {
+		close(uffd->pipefds[i]);
+		close(uffd->reader_args[i].pipe);
+	}
 
+	free(uffd->pipefds);
+	free(uffd->readers);
+	free(uffd->reader_args);
 	free(uffd);
 }
 
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v6 12/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
                   ` (10 preceding siblings ...)
  2023-11-09 21:03 ` [PATCH v6 11/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2023-11-09 21:03 ` [PATCH v6 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

With multiple reader threads POLLing a single UFFD, the test suffers
from the thundering herd problem: performance degrades as the number of
reader threads is increased. Solve this issue [1] by switching the
the polling mechanism to EPOLL + EPOLLEXCLUSIVE.

Also, change the error-handling convention of uffd_handler_thread_fn.
Instead of just printing errors and returning early from the polling
loop, check for them via TEST_ASSERT. "return NULL" is reserved for a
successful exit from uffd_handler_thread_fn, ie one triggered by a
write to the exit pipe.

Performance samples generated by the command in [2] are given below.

Num Reader Threads, Paging Rate (POLL), Paging Rate (EPOLL)
1      249k      185k
2      201k      235k
4      186k      155k
16     150k      217k
32     89k       198k

[1] Single-vCPU performance does suffer somewhat.
[2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        |  1 -
 .../selftests/kvm/lib/userfaultfd_util.c      | 74 +++++++++----------
 2 files changed, 35 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index f7897a951f90..0455347f932a 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -13,7 +13,6 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
-#include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
 #include <sys/syscall.h>
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 6f220aa4fb08..2a179133645a 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -16,6 +16,7 @@
 #include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <sys/epoll.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -32,60 +33,55 @@ static void *uffd_handler_thread_fn(void *arg)
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
+	int epollfd;
+	struct epoll_event evt;
+
+	epollfd = epoll_create(1);
+	TEST_ASSERT(epollfd >= 0, "Failed to create epollfd.");
+
+	evt.events = EPOLLIN | EPOLLEXCLUSIVE;
+	evt.data.u32 = 0;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, uffd, &evt) == 0,
+		    "Failed to add uffd to epollfd");
+
+	evt.events = EPOLLIN;
+	evt.data.u32 = 1;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, reader_args->pipe, &evt) == 0,
+		    "Failed to add pipe to epollfd");
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
 	while (1) {
 		struct uffd_msg msg;
-		struct pollfd pollfd[2];
-		char tmp_chr;
 		int r;
 
-		pollfd[0].fd = uffd;
-		pollfd[0].events = POLLIN;
-		pollfd[1].fd = reader_args->pipe;
-		pollfd[1].events = POLLIN;
-
-		r = poll(pollfd, 2, -1);
-		switch (r) {
-		case -1:
-			pr_info("poll err");
-			continue;
-		case 0:
-			continue;
-		case 1:
-			break;
-		default:
-			pr_info("Polling uffd returned %d", r);
-			return NULL;
-		}
+		r = epoll_wait(epollfd, &evt, 1, -1);
+		TEST_ASSERT(r == 1,
+			    "Unexpected number of events (%d) from epoll, errno = %d",
+			    r, errno);
 
-		if (pollfd[0].revents & POLLERR) {
-			pr_info("uffd revents has POLLERR");
-			return NULL;
-		}
+		if (evt.data.u32 == 1) {
+			char tmp_chr;
 
-		if (pollfd[1].revents & POLLIN) {
-			r = read(pollfd[1].fd, &tmp_chr, 1);
+			TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+				    "Reader thread received EPOLLERR or EPOLLHUP on pipe.");
+			r = read(reader_args->pipe, &tmp_chr, 1);
 			TEST_ASSERT(r == 1,
-				    "Error reading pipefd in UFFD thread\n");
+				    "Error reading pipefd in uffd reader thread");
 			break;
 		}
 
-		if (!(pollfd[0].revents & POLLIN))
-			continue;
+		TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+			    "Reader thread received EPOLLERR or EPOLLHUP on uffd.");
 
 		r = read(uffd, &msg, sizeof(msg));
 		if (r == -1) {
-			if (errno == EAGAIN)
-				continue;
-			pr_info("Read of uffd got errno %d\n", errno);
-			return NULL;
+			TEST_ASSERT(errno == EAGAIN,
+				    "Error reading from UFFD: errno = %d", errno);
+			continue;
 		}
 
-		if (r != sizeof(msg)) {
-			pr_info("Read on uffd returned unexpected size: %d bytes", r);
-			return NULL;
-		}
+		TEST_ASSERT(r == sizeof(msg),
+			    "Read on uffd returned unexpected number of bytes (%d)", r);
 
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
@@ -93,8 +89,8 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (reader_args->delay)
 			usleep(reader_args->delay);
 		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
-		if (r < 0)
-			return NULL;
+		TEST_ASSERT(r >= 0,
+			    "Reader thread handler fn returned negative value %d", r);
 		pages++;
 	}
 
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v6 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
                   ` (11 preceding siblings ...)
  2023-11-09 21:03 ` [PATCH v6 12/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2023-11-09 21:03 ` [PATCH v6 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
  2024-02-07 15:46 ` [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Sean Christopherson
  14 siblings, 0 replies; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

Memslot flags aren't currently exposed to the tests, and are just always
set to 0. Add a parameter to allow tests to manually set those flags.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 tools/testing/selftests/kvm/access_tracking_perf_test.c       | 2 +-
 tools/testing/selftests/kvm/demand_paging_test.c              | 2 +-
 tools/testing/selftests/kvm/dirty_log_perf_test.c             | 2 +-
 tools/testing/selftests/kvm/include/memstress.h               | 2 +-
 tools/testing/selftests/kvm/lib/memstress.c                   | 4 ++--
 .../testing/selftests/kvm/memslot_modification_stress_test.c  | 2 +-
 .../selftests/kvm/x86_64/dirty_log_page_splitting_test.c      | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..b51656b408b8 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -306,7 +306,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct kvm_vm *vm;
 	int nr_vcpus = params->nr_vcpus;
 
-	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1, 0,
 				 params->backing_src, !overlap_memory_access);
 
 	memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 0455347f932a..61bb2e23bef0 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -163,7 +163,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	double vcpu_paging_rate;
 	uint64_t uffd_region_size;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
 				 p->src_type, p->partition_vcpu_memory_access);
 
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index d374dbcf9a53..8b1a84a4db3b 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -153,7 +153,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	int i;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
-				 p->slots, p->backing_src,
+				 p->slots, 0, p->backing_src,
 				 p->partition_vcpu_memory_access);
 
 	pr_info("Random seed: %u\n", p->random_seed);
diff --git a/tools/testing/selftests/kvm/include/memstress.h b/tools/testing/selftests/kvm/include/memstress.h
index ce4e603050ea..8be9609d3ca0 100644
--- a/tools/testing/selftests/kvm/include/memstress.h
+++ b/tools/testing/selftests/kvm/include/memstress.h
@@ -56,7 +56,7 @@ struct memstress_args {
 extern struct memstress_args memstress_args;
 
 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access);
 void memstress_destroy_vm(struct kvm_vm *vm);
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index d05487e5a371..e74b09f39769 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -123,7 +123,7 @@ void memstress_setup_vcpus(struct kvm_vm *vm, int nr_vcpus,
 }
 
 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access)
 {
@@ -212,7 +212,7 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 
 		vm_userspace_mem_region_add(vm, backing_src, region_start,
 					    MEMSTRESS_MEM_SLOT_INDEX + i,
-					    region_pages, 0);
+					    region_pages, slot_flags);
 	}
 
 	/* Do mapping for the demand paging memory slot */
diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
index 9855c41ca811..0b19ec3ecc9c 100644
--- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
+++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
@@ -95,7 +95,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct test_params *p = arg;
 	struct kvm_vm *vm;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
 				 VM_MEM_SRC_ANONYMOUS,
 				 p->partition_vcpu_memory_access);
 
diff --git a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
index 634c6bfcd572..a770d7fa469a 100644
--- a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
+++ b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
@@ -100,7 +100,7 @@ static void run_test(enum vm_guest_mode mode, void *unused)
 	struct kvm_page_stats stats_dirty_logging_disabled;
 	struct kvm_page_stats stats_repopulated;
 
-	vm = memstress_create_vm(mode, VCPUS, guest_percpu_mem_size,
+	vm = memstress_create_vm(mode, VCPUS, guest_percpu_mem_size, 0,
 				 SLOTS, backing_src, false);
 
 	guest_num_pages = (VCPUS * guest_percpu_mem_size) >> vm->page_shift;
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v6 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
                   ` (12 preceding siblings ...)
  2023-11-09 21:03 ` [PATCH v6 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
@ 2023-11-09 21:03 ` Anish Moorthy
  2024-02-07 15:46 ` [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Sean Christopherson
  14 siblings, 0 replies; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:03 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	amoorthy, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

Demonstrate a (very basic) scheme for supporting memory fault exits.

From the vCPU threads:
1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
   with the purpose of establishing the absent mappings. Do so with
   wake_waiters=false to avoid serializing on the userfaultfd wait queue
   locks.

2. When the UFFDIO_COPY/CONTINUE in (1) fails with EEXIST,
   assume that the mapping was already established but is currently
   absent [A] and attempt to populate it using MADV_POPULATE_WRITE.

Issue UFFDIO_COPY/CONTINUEs from the reader threads as well, but with
wake_waiters=true to ensure that any threads sleeping on the uffd are
eventually woken up.

A real VMM would track whether it had already COPY/CONTINUEd pages (eg,
via a bitmap) to avoid calls destined to EEXIST. However, even the
naive approach is enough to demonstrate the performance advantages of
KVM_EXIT_MEMORY_FAULT.

[A] In reality it is much likelier that the vCPU thread simply lost a
    race to establish the mapping for the page.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        | 245 +++++++++++++-----
 1 file changed, 173 insertions(+), 72 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 61bb2e23bef0..44bdcc7aad87 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -15,6 +15,7 @@
 #include <time.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <linux/mman.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -31,36 +32,102 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
 static size_t demand_paging_size;
 static char *guest_data_prototype;
 
+static int num_uffds;
+static size_t uffd_region_size;
+static struct uffd_desc **uffd_descs;
+/*
+ * Delay when demand paging is performed through userfaultfd or directly by
+ * vcpu_worker in the case of an annotated memory fault.
+ */
+static useconds_t uffd_delay;
+static int uffd_mode;
+
+
+static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
+				    bool is_vcpu);
+
+static void madv_write_or_err(uint64_t gpa)
+{
+	int r;
+	void *hva = addr_gpa2hva(memstress_args.vm, gpa);
+
+	r = madvise(hva, demand_paging_size, MADV_POPULATE_WRITE);
+	TEST_ASSERT(r == 0,
+		    "MADV_POPULATE_WRITE on hva 0x%lx (gpa 0x%lx) fail, errno %i\n",
+		    (uintptr_t) hva, gpa, errno);
+}
+
+static void ready_page(uint64_t gpa)
+{
+	int r, uffd;
+
+	/*
+	 * This test only registers memslot 1 w/ userfaultfd. Any accesses outside
+	 * the registered ranges should fault in the physical pages through
+	 * MADV_POPULATE_WRITE.
+	 */
+	if ((gpa < memstress_args.gpa)
+		|| (gpa >= memstress_args.gpa + memstress_args.size)) {
+		madv_write_or_err(gpa);
+	} else {
+		if (uffd_delay)
+			usleep(uffd_delay);
+
+		uffd = uffd_descs[(gpa - memstress_args.gpa) / uffd_region_size]->uffd;
+
+		r = handle_uffd_page_request(uffd_mode, uffd,
+					     (uint64_t) addr_gpa2hva(memstress_args.vm, gpa), true);
+
+		if (r == EEXIST)
+			madv_write_or_err(gpa);
+	}
+}
+
 static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 {
 	struct kvm_vcpu *vcpu = vcpu_args->vcpu;
 	int vcpu_idx = vcpu_args->vcpu_idx;
 	struct kvm_run *run = vcpu->run;
-	struct timespec start;
-	struct timespec ts_diff;
+	struct timespec last_start;
+	struct timespec total_runtime = {};
 	int ret;
-
-	clock_gettime(CLOCK_MONOTONIC, &start);
-
-	/* Let the guest access its memory */
-	ret = _vcpu_run(vcpu);
-	TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-	if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
-		TEST_ASSERT(false,
-			    "Invalid guest sync status: exit_reason=%s\n",
-			    exit_reason_str(run->exit_reason));
+	u64 num_memory_fault_exits = 0;
+	bool annotated_memory_fault = false;
+
+	while (true) {
+		clock_gettime(CLOCK_MONOTONIC, &last_start);
+		/* Let the guest access its memory */
+		ret = _vcpu_run(vcpu);
+		annotated_memory_fault = errno == EFAULT
+					 && run->exit_reason == KVM_EXIT_MEMORY_FAULT;
+		TEST_ASSERT(ret == 0 || annotated_memory_fault,
+			    "vcpu_run failed: %d\n", ret);
+
+		total_runtime = timespec_add(total_runtime,
+					     timespec_elapsed(last_start));
+		if (ret != 0 && get_ucall(vcpu, NULL) != UCALL_SYNC) {
+
+			if (annotated_memory_fault) {
+				++num_memory_fault_exits;
+				ready_page(run->memory_fault.gpa);
+				continue;
+			}
+
+			TEST_ASSERT(false,
+				    "Invalid guest sync status: exit_reason=%s\n",
+				    exit_reason_str(run->exit_reason));
+		}
+		break;
 	}
-
-	ts_diff = timespec_elapsed(start);
-	PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
-		       ts_diff.tv_sec, ts_diff.tv_nsec);
+	PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds, %d memory fault exits\n",
+		       vcpu_idx, total_runtime.tv_sec, total_runtime.tv_nsec,
+		       num_memory_fault_exits);
 }
 
-static int handle_uffd_page_request(int uffd_mode, int uffd,
-		struct uffd_msg *msg)
+static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
+				    bool is_vcpu)
 {
 	pid_t tid = syscall(__NR_gettid);
-	uint64_t addr = msg->arg.pagefault.address;
 	struct timespec start;
 	struct timespec ts_diff;
 	int r;
@@ -71,16 +138,15 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		struct uffdio_copy copy;
 
 		copy.src = (uint64_t)guest_data_prototype;
-		copy.dst = addr;
+		copy.dst = hva;
 		copy.len = demand_paging_size;
-		copy.mode = 0;
+		copy.mode = is_vcpu ? UFFDIO_COPY_MODE_DONTWAKE : 0;
 
-		r = ioctl(uffd, UFFDIO_COPY, &copy);
 		/*
-		 * With multiple vCPU threads fault on a single page and there are
-		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
-		 * will fail with EEXIST: handle that case without signaling an
-		 * error.
+		 * With multiple vCPU threads and at least one of multiple reader threads
+		 * or vCPU memory faults, multiple vCPUs accessing an absent page will
+		 * almost certainly cause some thread doing the UFFDIO_COPY here to get
+		 * EEXIST: make sure to allow that case.
 		 *
 		 * Note that this also suppress any EEXISTs occurring from,
 		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
@@ -88,23 +154,24 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		 * some external state to correctly surface EEXISTs to userspace
 		 * (or prevent duplicate COPY/CONTINUEs in the first place).
 		 */
-		if (r == -1 && errno != EEXIST) {
-			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
-				addr, tid, errno);
-			return r;
-		}
+		r = ioctl(uffd, UFFDIO_COPY, &copy);
+		TEST_ASSERT(r == 0 || errno == EEXIST,
+			    "Thread 0x%x failed UFFDIO_COPY on hva 0x%lx, errno = %d",
+			    tid, hva, errno);
 	} else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
+		/* The comments in the UFFDIO_COPY branch also apply here. */
 		struct uffdio_continue cont = {0};
 
-		cont.range.start = addr;
+		cont.range.start = hva;
 		cont.range.len = demand_paging_size;
+		cont.mode = is_vcpu ? UFFDIO_CONTINUE_MODE_DONTWAKE : 0;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
 		/*
-		 * With multiple vCPU threads fault on a single page and there are
-		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
-		 * will fail with EEXIST: handle that case without signaling an
-		 * error.
+		 * With multiple vCPU threads and at least one of multiple reader threads
+		 * or vCPU memory faults, multiple vCPUs accessing an absent page will
+		 * almost certainly cause some thread doing the UFFDIO_COPY here to get
+		 * EEXIST: make sure to allow that case.
 		 *
 		 * Note that this also suppress any EEXISTs occurring from,
 		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
@@ -112,32 +179,54 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		 * some external state to correctly surface EEXISTs to userspace
 		 * (or prevent duplicate COPY/CONTINUEs in the first place).
 		 */
-		if (r == -1 && errno != EEXIST) {
-			pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
-				addr, tid, errno);
-			return r;
-		}
+		TEST_ASSERT(r == 0 || errno == EEXIST,
+			    "Thread 0x%x failed UFFDIO_CONTINUE on hva 0x%lx, errno = %d",
+			    tid, hva, errno);
 	} else {
 		TEST_FAIL("Invalid uffd mode %d", uffd_mode);
 	}
 
+	/*
+	 * If the above UFFDIO_COPY/CONTINUE failed with EEXIST, waiting threads
+	 * will not have been woken: wake them here.
+	 */
+	if (!is_vcpu && r != 0) {
+		struct uffdio_range range = {
+			.start = hva,
+			.len = demand_paging_size
+		};
+		r = ioctl(uffd, UFFDIO_WAKE, &range);
+		TEST_ASSERT(r == 0,
+			    "Thread 0x%x failed UFFDIO_WAKE on hva 0x%lx, errno = %d",
+			    tid, hva, errno);
+	}
+
 	ts_diff = timespec_elapsed(start);
 
 	PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
 		       timespec_to_ns(ts_diff));
 	PER_PAGE_DEBUG("Paged in %ld bytes at 0x%lx from thread %d\n",
-		       demand_paging_size, addr, tid);
+		       demand_paging_size, hva, tid);
 
 	return 0;
 }
 
+static int handle_uffd_page_request_from_uffd(int uffd_mode, int uffd,
+					      struct uffd_msg *msg)
+{
+	TEST_ASSERT(msg->event == UFFD_EVENT_PAGEFAULT,
+		    "Received uffd message with event %d != UFFD_EVENT_PAGEFAULT",
+		    msg->event);
+	return handle_uffd_page_request(uffd_mode, uffd,
+					msg->arg.pagefault.address, false);
+}
+
 struct test_params {
-	int uffd_mode;
 	bool single_uffd;
-	useconds_t uffd_delay;
 	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
+	bool memfault_exits;
 };
 
 static void prefault_mem(void *alias, uint64_t len)
@@ -155,16 +244,22 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 {
 	struct memstress_vcpu_args *vcpu_args;
 	struct test_params *p = arg;
-	struct uffd_desc **uffd_descs = NULL;
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i, num_uffds = 0;
+	int i;
 	double vcpu_paging_rate;
-	uint64_t uffd_region_size;
+	uint32_t slot_flags = 0;
+	bool uffd_memfault_exits = uffd_mode && p->memfault_exits;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
-				 p->src_type, p->partition_vcpu_memory_access);
+	if (uffd_memfault_exits) {
+		TEST_ASSERT(kvm_has_cap(KVM_CAP_EXIT_ON_MISSING) > 0,
+					"KVM does not have KVM_CAP_EXIT_ON_MISSING");
+		slot_flags = KVM_MEM_EXIT_ON_MISSING;
+	}
+
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
+				 1, slot_flags, p->src_type, p->partition_vcpu_memory_access);
 
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
 
@@ -173,21 +268,21 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		    "Failed to allocate buffer for guest data pattern");
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
-	if (p->uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
-		num_uffds = p->single_uffd ? 1 : nr_vcpus;
-		for (i = 0; i < num_uffds; i++) {
-			vcpu_args = &memstress_args.vcpu_args[i];
-			prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
-				     vcpu_args->pages * memstress_args.guest_page_size);
-		}
-	}
-
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		num_uffds = p->single_uffd ? 1 : nr_vcpus;
 		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
 
+		if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
+			for (i = 0; i < num_uffds; i++) {
+				vcpu_args = &memstress_args.vcpu_args[i];
+				prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
+					     uffd_region_size);
+			}
+		}
+
 		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
-		TEST_ASSERT(uffd_descs, "Memory allocation failed");
+		TEST_ASSERT(uffd_descs, "Failed to allocate uffd descriptors");
+
 		for (i = 0; i < num_uffds; i++) {
 			struct memstress_vcpu_args *vcpu_args;
 			void *vcpu_hva;
@@ -201,10 +296,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 * requests.
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
-				p->uffd_mode, p->uffd_delay, vcpu_hva,
+				uffd_mode, uffd_delay, vcpu_hva,
 				uffd_region_size,
 				p->readers_per_uffd,
-				&handle_uffd_page_request);
+				&handle_uffd_page_request_from_uffd);
 		}
 	}
 
@@ -218,7 +313,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	ts_diff = timespec_elapsed(start);
 	pr_info("All vCPU threads joined\n");
 
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
 		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
@@ -239,7 +334,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memstress_destroy_vm(vm);
 
 	free(guest_data_prototype);
-	if (p->uffd_mode)
+	if (uffd_mode)
 		free(uffd_descs);
 }
 
@@ -248,7 +343,8 @@ static void help(char *name)
 	puts("");
 	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
 		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
-		   "          [-s type] [-v vcpus] [-c cpu_list] [-o]\n", name);
+		   "          [-s type] [-v vcpus] [-c cpu_list] [-o] [-w] \n",
+	       name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
@@ -260,6 +356,7 @@ static void help(char *name)
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
 	printf(" -r: Set the number of reader threads per uffd.\n");
+	printf(" -w: Enable kvm cap for memory fault exits.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -280,29 +377,30 @@ int main(int argc, char *argv[])
 		.partition_vcpu_memory_access = true,
 		.readers_per_uffd = 1,
 		.single_uffd = false,
+		.memfault_exits = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:c:r:")) != -1) {
+	while ((opt = getopt(argc, argv, "ahowm:u:d:b:s:v:c:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
 			break;
 		case 'u':
 			if (!strcmp("MISSING", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
+				uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
 			else if (!strcmp("MINOR", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
-			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
+				uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
+			TEST_ASSERT(uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
 		case 'a':
 			p.single_uffd = true;
 			break;
 		case 'd':
-			p.uffd_delay = strtoul(optarg, NULL, 0);
-			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
+			uffd_delay = strtoul(optarg, NULL, 0);
+			TEST_ASSERT(uffd_delay >= 0, "A negative UFFD delay is not supported.");
 			break;
 		case 'b':
 			guest_percpu_mem_size = parse_size(optarg);
@@ -328,6 +426,9 @@ int main(int argc, char *argv[])
 				    "Invalid number of readers per uffd %d: must be >=1",
 				    p.readers_per_uffd);
 			break;
+		case 'w':
+			p.memfault_exits = true;
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
@@ -335,7 +436,7 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	if (p.uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
+	if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
 	    !backing_src_is_shared(p.src_type)) {
 		TEST_FAIL("userfaultfd MINOR mode requires shared memory; pick a different -s");
 	}
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
  2023-11-09 21:03 ` [PATCH v6 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
@ 2023-11-09 21:07   ` Anish Moorthy
  2024-02-07 15:39     ` Sean Christopherson
  0 siblings, 1 reply; 44+ messages in thread
From: Anish Moorthy @ 2023-11-09 21:07 UTC (permalink / raw)
  To: seanjc, kvm, kvmarm
  Cc: oliver.upton, pbonzini, maz, robert.hoo.linux, jthoughton,
	dmatlack, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Thu, Nov 9, 2023 at 1:03 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> TODO: Changelog -- and possibly just merge into the "god" arm commit?

*Facepalm*

Well as you can tell, I wasn't sure if there was anything to actually
put in the long-form log. Lmk if you have suggestions

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 09/14] KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
  2023-11-09 21:03 ` [PATCH v6 09/14] KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler Anish Moorthy
@ 2024-01-30 23:58   ` James Houghton
  2024-01-31 22:38     ` Anish Moorthy
  0 siblings, 1 reply; 44+ messages in thread
From: James Houghton @ 2024-01-30 23:58 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: seanjc, kvm, kvmarm, oliver.upton, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

Hi Anish,

Sorry to get back to you so late. :) I was hoping others would provide
more feedback, but I have a little bit to give anyway. Overall the
series looks good to me.

On Thu, Nov 9, 2023 at 1:03 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> Prevent the stage-2 fault handler from faulting in pages when
> KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
> calls to check the memslot flag.
>
> To actually make that behavior useful, prepare a KVM_EXIT_MEMORY_FAULT
> when the stage-2 handler cannot resolve the pfn for a fault. With
> KVM_MEM_EXIT_ON_MISSING enabled this effects the delivery of stage-2
> faults as vCPU exits, which userspace can attempt to resolve without
> terminating the guest.
>
> Delivering stage-2 faults to userspace in this way sidesteps the
> significant scalabiliy issues associated with using userfaultfd for the
> same purpose.
>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  Documentation/virt/kvm/api.rst | 2 +-
>  arch/arm64/kvm/Kconfig         | 1 +
>  arch/arm64/kvm/mmu.c           | 7 +++++--
>  3 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index fd87bbfbfdf2..67fcb9dbe855 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8068,7 +8068,7 @@ See KVM_EXIT_MEMORY_FAULT for more information.
>  7.35 KVM_CAP_EXIT_ON_MISSING
>  ----------------------------
>
> -:Architectures: x86
> +:Architectures: x86, arm64
>  :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
>
>  The presence of this capability indicates that userspace may set the
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 1a777715199f..d6fae31f7e1a 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -43,6 +43,7 @@ menuconfig KVM
>         select GUEST_PERF_EVENTS if PERF_EVENTS
>         select INTERVAL_TREE
>         select XARRAY_MULTI
> +        select HAVE_KVM_EXIT_ON_MISSING
>         help
>           Support hosting virtualized guest machines.
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 13066a6fdfff..3b9fb80672ac 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1486,13 +1486,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         mmap_read_unlock(current->mm);
>
>         pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
> -                                  write_fault, &writable, false, NULL);
> +                                  write_fault, &writable, true, NULL);
>         if (pfn == KVM_PFN_ERR_HWPOISON) {
>                 kvm_send_hwpoison_signal(hva, vma_shift);
>                 return 0;
>         }
> -       if (is_error_noslot_pfn(pfn))
> +       if (is_error_noslot_pfn(pfn)) {
> +               kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
> +                                             write_fault, exec_fault, false);

I think that either (1) we move this kvm_prepare_memory_fault_exit
logic into the previous patch[1], or (2) we merge this patch with the
previous one. IIUC, we can only advertise KVM_CAP_MEMORY_FAULT_INFO on
arm64 if this logic is present.

As for the changelog in the previous patch[1], if you leave it
unmerged with this one, something like "Enable
KVM_CAP_MEMORY_FAULT_INFO to make KVM_CAP_EXIT_ON_MISSING useful, as
without it, userspace doesn't know which page(s) of memory it needs to
fix" works for me.

Also, I think we need to update the documentation for
KVM_CAP_MEMORY_FAULT_INFO to say that it is available for arm64 now
(just like you have done for KVM_CAP_EXIT_ON_MISSING).

Thanks!

[1]: https://lore.kernel.org/kvm/20231109210325.3806151-9-amoorthy@google.com/

>                 return -EFAULT;
> +       }
>
>         if (kvm_is_device_pfn(pfn)) {
>                 /*
> --
> 2.42.0.869.gea05f2083d-goog
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2023-11-09 21:03 ` [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings Anish Moorthy
@ 2024-01-31  0:25   ` James Houghton
  2024-01-31 21:59     ` Anish Moorthy
  0 siblings, 1 reply; 44+ messages in thread
From: James Houghton @ 2024-01-31  0:25 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: seanjc, kvm, kvmarm, oliver.upton, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Nov 9, 2023 at 1:03 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> Allowing KVM to fault in pages during vcpu-context guest memory accesses
> can be undesirable: during userfaultfd-based postcopy, it can cause
> significant performance issues due to vCPUs contending for
> userfaultfd-internal locks.
>
> Add a new memslot flag (KVM_MEM_EXIT_ON_MISSING) through which userspace
> can indicate that KVM_RUN should exit instead of faulting in pages
> during vcpu-context guest memory accesses. The unfaulted pages are
> reported by the accompanying KVM_EXIT_MEMORY_FAULT_INFO, allowing
> userspace to determine and take appropriate action.
>
> The basic implementation strategy is to check the memslot flag from
> within __gfn_to_pfn_memslot() and override the caller-provided arguments
> accordingly. Some callers (such as kvm_vcpu_map()) must be able to opt
> out of this behavior, and do so by passing can_exit_on_missing=false.
>
> No functional change intended: nothing sets KVM_MEM_EXIT_ON_MISSING or
> passes can_exit_on_missing=true to __gfn_to_pfn_memslot().
>
> Suggested-by: James Houghton <jthoughton@google.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>

Feel free to add:

Reviewed-by: James Houghton <jthoughton@google.com>

> ---
>  Documentation/virt/kvm/api.rst         | 28 +++++++++++++++++++++++---
>  arch/arm64/kvm/mmu.c                   |  2 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c    |  2 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
>  arch/x86/kvm/mmu/mmu.c                 |  4 ++--
>  include/linux/kvm_host.h               | 12 ++++++++++-
>  include/uapi/linux/kvm.h               |  2 ++
>  virt/kvm/Kconfig                       |  3 +++
>  virt/kvm/kvm_main.c                    | 25 ++++++++++++++++++-----
>  9 files changed, 66 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index a07964f601de..1457865f6e98 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1365,6 +1365,8 @@ yet and must be cleared on entry.
>    /* for kvm_userspace_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
>    #define KVM_MEM_READONLY     (1UL << 1)
> +  #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
>
>  This ioctl allows the user to create, modify or delete a guest physical
>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> @@ -1395,12 +1397,16 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>  be identical.  This allows large pages in the guest to be backed by large
>  pages in the host.
>
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> +The flags field supports four flags
> +
> +1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
>  writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> +use it.
> +2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
>  to make a new slot read-only.  In this case, writes to this memory will be
>  posted to userspace as KVM_EXIT_MMIO exits.
> +3.  KVM_MEM_GUEST_MEMFD

If we include KVM_MEM_GUEST_MEMFD here, we should point the reader to
KVM_SET_USER_MEMORY_REGION2 and explain that using
KVM_SET_USER_MEMORY_REGION with this flag will always fail.

> +4.  KVM_MEM_EXIT_ON_MISSING: see KVM_CAP_EXIT_ON_MISSING for details.
>
>  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>  the memory region are automatically reflected into the guest.  For example, an
> @@ -8059,6 +8065,22 @@ error/annotated fault.
>
>  See KVM_EXIT_MEMORY_FAULT for more information.
>
> +7.35 KVM_CAP_EXIT_ON_MISSING
> +----------------------------
> +
> +:Architectures: None
> +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> +
> +The presence of this capability indicates that userspace may set the
> +KVM_MEM_EXIT_ON_MISSING flag on memslots. Said flag will cause KVM_RUN to fail
> +(-EFAULT) in response to guest-context memory accesses which would require KVM
> +to page fault on the userspace mapping.
> +
> +The range of guest physical memory causing the fault is advertised to userspace
> +through KVM_CAP_MEMORY_FAULT_INFO. Userspace should take appropriate action.
> +This could mean, for instance, checking that the fault is resolvable, faulting
> +in the relevant userspace mapping, then retrying KVM_RUN.
> +
>  8. Other capabilities.
>  ======================
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 4e41ceed5468..13066a6fdfff 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1486,7 +1486,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         mmap_read_unlock(current->mm);
>
>         pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
> -                                  write_fault, &writable, NULL);
> +                                  write_fault, &writable, false, NULL);
>         if (pfn == KVM_PFN_ERR_HWPOISON) {
>                 kvm_send_hwpoison_signal(hva, vma_shift);
>                 return 0;
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index efd0ebf70a5e..2ce0e1d3f597 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -613,7 +613,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
>         } else {
>                 /* Call KVM generic code to do the slow-path check */
>                 pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
> -                                          writing, &write_ok, NULL);
> +                                          writing, &write_ok, false, NULL);
>                 if (is_error_noslot_pfn(pfn))
>                         return -EFAULT;
>                 page = NULL;
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 572707858d65..9d40ca02747f 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -847,7 +847,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
>
>                 /* Call KVM generic code to do the slow-path check */
>                 pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
> -                                          writing, upgrade_p, NULL);
> +                                          writing, upgrade_p, false, NULL);
>                 if (is_error_noslot_pfn(pfn))
>                         return -EFAULT;
>                 page = NULL;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4de7670d5976..b1e5e42bdeb4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4375,7 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>         async = false;
>         fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
>                                           fault->write, &fault->map_writable,
> -                                         &fault->hva);
> +                                         false, &fault->hva);
>         if (!async)
>                 return RET_PF_CONTINUE; /* *pfn has correct page already */
>
> @@ -4397,7 +4397,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>          */
>         fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
>                                           fault->write, &fault->map_writable,
> -                                         &fault->hva);
> +                                         false, &fault->hva);
>         return RET_PF_CONTINUE;
>  }
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 5201400358da..e8e30088289e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1219,7 +1219,8 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn);
>  kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn);
>  kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>                                bool atomic, bool interruptible, bool *async,
> -                              bool write_fault, bool *writable, hva_t *hva);
> +                              bool write_fault, bool *writable,
> +                              bool can_exit_on_missing, hva_t *hva);
>
>  void kvm_release_pfn_clean(kvm_pfn_t pfn);
>  void kvm_release_pfn_dirty(kvm_pfn_t pfn);
> @@ -2423,4 +2424,13 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_PRIVATE_MEM */
>
> +/*
> + * Whether vCPUs should exit upon trying to access memory for which the
> + * userspace mappings are missing.
> + */
> +static inline bool kvm_is_slot_exit_on_missing(const struct kvm_memory_slot *slot)
> +{
> +       return slot && slot->flags & KVM_MEM_EXIT_ON_MISSING;
> +}
> +
>  #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index bda5622a9c68..18546cbada61 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -116,6 +116,7 @@ struct kvm_userspace_memory_region2 {
>  #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
>  #define KVM_MEM_READONLY       (1UL << 1)
>  #define KVM_MEM_GUEST_MEMFD    (1UL << 2)
> +#define KVM_MEM_EXIT_ON_MISSING        (1UL << 3)
>
>  /* for KVM_IRQ_LINE */
>  struct kvm_irq_level {
> @@ -1231,6 +1232,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_MEMORY_ATTRIBUTES 233
>  #define KVM_CAP_GUEST_MEMFD 234
>  #define KVM_CAP_VM_TYPES 235
> +#define KVM_CAP_EXIT_ON_MISSING 236
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 2c964586aa14..241f524a4e9d 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -109,3 +109,6 @@ config KVM_GENERIC_PRIVATE_MEM
>         select KVM_GENERIC_MEMORY_ATTRIBUTES
>         select KVM_PRIVATE_MEM
>         bool
> +
> +config HAVE_KVM_EXIT_ON_MISSING
> +       bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 725191333c4e..faaccdba179c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1614,7 +1614,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
>   * only allows these.
>   */
>  #define KVM_SET_USER_MEMORY_REGION_V1_FLAGS \
> -       (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
> +       (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY | KVM_MEM_EXIT_ON_MISSING)
>
>  static int check_memory_region_flags(struct kvm *kvm,
>                                      const struct kvm_userspace_memory_region2 *mem)
> @@ -1632,6 +1632,9 @@ static int check_memory_region_flags(struct kvm *kvm,
>         valid_flags |= KVM_MEM_READONLY;
>  #endif
>
> +       if (IS_ENABLED(CONFIG_HAVE_KVM_EXIT_ON_MISSING))
> +               valid_flags |= KVM_MEM_EXIT_ON_MISSING;
> +
>         if (mem->flags & ~valid_flags)
>                 return -EINVAL;
>
> @@ -3047,7 +3050,8 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>
>  kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>                                bool atomic, bool interruptible, bool *async,
> -                              bool write_fault, bool *writable, hva_t *hva)
> +                              bool write_fault, bool *writable,
> +                              bool can_exit_on_missing, hva_t *hva)
>  {
>         unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
>
> @@ -3070,6 +3074,15 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>                 writable = NULL;
>         }
>
> +       if (!atomic && can_exit_on_missing
> +           && kvm_is_slot_exit_on_missing(slot)) {
> +               atomic = true;
> +               if (async) {
> +                       *async = false;
> +                       async = NULL;
> +               }
> +       }
> +

Perhaps we should have a comment for this? Maybe something like: "If
we want to exit-on-missing, we want to bail out if fast GUP fails, and
we do not want to go into slow GUP. Setting atomic=true does exactly
this."

Thanks!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-01-31  0:25   ` James Houghton
@ 2024-01-31 21:59     ` Anish Moorthy
  2024-02-01  0:26       ` James Houghton
  2024-02-01 16:09       ` Sean Christopherson
  0 siblings, 2 replies; 44+ messages in thread
From: Anish Moorthy @ 2024-01-31 21:59 UTC (permalink / raw)
  To: James Houghton
  Cc: seanjc, kvm, kvmarm, oliver.upton, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Tue, Jan 30, 2024 at 4:26 PM James Houghton <jthoughton@google.com> wrote:
>
> Feel free to add:
>
> Reviewed-by: James Houghton <jthoughton@google.com>

> If we include KVM_MEM_GUEST_MEMFD here, we should point the reader to
> KVM_SET_USER_MEMORY_REGION2 and explain that using
> KVM_SET_USER_MEMORY_REGION with this flag will always fail.

Done and done (I've split the guest memfd doc update off into its own
commit too).

> > @@ -3070,6 +3074,15 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> >                 writable = NULL;
> >         }
> >
> > +       if (!atomic && can_exit_on_missing
> > +           && kvm_is_slot_exit_on_missing(slot)) {
> > +               atomic = true;
> > +               if (async) {
> > +                       *async = false;
> > +                       async = NULL;
> > +               }
> > +       }
> > +
>
> Perhaps we should have a comment for this? Maybe something like: "If
> we want to exit-on-missing, we want to bail out if fast GUP fails, and
> we do not want to go into slow GUP. Setting atomic=true does exactly
> this."

I was going to push back on the use of "we" but I see that it's all
over kvm_main.c :).

I agree that a comment would be good, but isn't the "fast GUP only iff
atomic=true" statement a tautology? That's an actual question, my
memory's fuzzy. What about

> When the slot is exit-on-missing (and when we should respect that)
> set atomic=true to prevent GUP from faulting in the userspace
> mappings.

instead?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 09/14] KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
  2024-01-30 23:58   ` James Houghton
@ 2024-01-31 22:38     ` Anish Moorthy
  2024-02-09  1:21       ` Anish Moorthy
  0 siblings, 1 reply; 44+ messages in thread
From: Anish Moorthy @ 2024-01-31 22:38 UTC (permalink / raw)
  To: James Houghton
  Cc: seanjc, kvm, kvmarm, oliver.upton, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Tue, Jan 30, 2024 at 3:58 PM James Houghton <jthoughton@google.com> wrote:
>
> Hi Anish,
>
> Sorry to get back to you so late. :) I was hoping others would provide
> more feedback, but I have a little bit to give anyway. Overall the
> series looks good to me.

Thanks James. I'm just happy to have some review :D

> On Thu, Nov 9, 2023 at 1:03 PM Anish Moorthy <amoorthy@google.com> wrote:
> >
> > Prevent the stage-2 fault handler from faulting in pages when
> > KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
> > calls to check the memslot flag.
> >
> > To actually make that behavior useful, prepare a KVM_EXIT_MEMORY_FAULT
> > when the stage-2 handler cannot resolve the pfn for a fault. With
> > KVM_MEM_EXIT_ON_MISSING enabled this effects the delivery of stage-2
> > faults as vCPU exits, which userspace can attempt to resolve without
> > terminating the guest.
> >
> > Delivering stage-2 faults to userspace in this way sidesteps the
> > significant scalabiliy issues associated with using userfaultfd for the
> > same purpose.
> >
> > Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > ---
> >  Documentation/virt/kvm/api.rst | 2 +-
> >  arch/arm64/kvm/Kconfig         | 1 +
> >  arch/arm64/kvm/mmu.c           | 7 +++++--
> >  3 files changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index fd87bbfbfdf2..67fcb9dbe855 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -8068,7 +8068,7 @@ See KVM_EXIT_MEMORY_FAULT for more information.
> >  7.35 KVM_CAP_EXIT_ON_MISSING
> >  ----------------------------
> >
> > -:Architectures: x86
> > +:Architectures: x86, arm64
> >  :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> >
> >  The presence of this capability indicates that userspace may set the
> > diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> > index 1a777715199f..d6fae31f7e1a 100644
> > --- a/arch/arm64/kvm/Kconfig
> > +++ b/arch/arm64/kvm/Kconfig
> > @@ -43,6 +43,7 @@ menuconfig KVM
> >         select GUEST_PERF_EVENTS if PERF_EVENTS
> >         select INTERVAL_TREE
> >         select XARRAY_MULTI
> > +        select HAVE_KVM_EXIT_ON_MISSING
> >         help
> >           Support hosting virtualized guest machines.
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 13066a6fdfff..3b9fb80672ac 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1486,13 +1486,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >         mmap_read_unlock(current->mm);
> >
> >         pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
> > -                                  write_fault, &writable, false, NULL);
> > +                                  write_fault, &writable, true, NULL);
> >         if (pfn == KVM_PFN_ERR_HWPOISON) {
> >                 kvm_send_hwpoison_signal(hva, vma_shift);
> >                 return 0;
> >         }
> > -       if (is_error_noslot_pfn(pfn))
> > +       if (is_error_noslot_pfn(pfn)) {
> > +               kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
> > +                                             write_fault, exec_fault, false);
>
> I think that either (1) we move this kvm_prepare_memory_fault_exit
> logic into the previous patch[1], or (2) we merge this patch with the
> previous one. IIUC, we can only advertise KVM_CAP_MEMORY_FAULT_INFO on
> arm64 if this logic is present.

Actually (sorry, about-face from our off-list chat), *does* it make
sense to merge these two patches? arm64 could benefit from annotated
EFAULTs even without KVM_CAP_EXIT_ON_MISSING: for instance if there
were spots outside the stage-2 handler if EFAULTs were annotated [a].
And if this patch was for some reason reverted in the future, then we
probably wouldn't want that to entail silencing any other
KVM_EXIT_MEMORY_FAULTs that arm64 might also get.

Sean did also ask me to merge some patches back on v5 [b], but I think
the point there was to enable KVM_CAP_EXIT_ON_MISSING at the same time
as adding the stage-2 annotation, which I'm doing here.

[a] Theoretically, anyways. Atm only x86 code uses
kvm_prepare_memory_fault_exit()
[b] https://lore.kernel.org/kvm/ZR4WzE1JOvq_0dhE@google.com/

> As for the changelog in the previous patch[1], if you leave it
> unmerged with this one, something like "Enable
> KVM_CAP_MEMORY_FAULT_INFO to make KVM_CAP_EXIT_ON_MISSING useful, as
> without it, userspace doesn't know which page(s) of memory it needs to
> fix" works for me.

For that previous patch, I updated the description to the following

> Advertise to arm64 userspaces that KVM_RUN may return annotated EFAULTs
> (see KVM_EXIT_MEMORY_FAULT). In fact KVM_RUN might already be annotating
> some EFAULTs, but this capability is necessary for userspace to know
> that.

Which I think makes more sense than sort of forward-declaring
KVM_CAP_EXIT_ON_MISSING on arm64.

On Tue, Jan 30, 2024 at 3:58 PM James Houghton <jthoughton@google.com> wrote:
>
> Also, I think we need to update the documentation for
> KVM_CAP_MEMORY_FAULT_INFO to say that it is available for arm64 now
> (just like you have done for KVM_CAP_EXIT_ON_MISSING).

Done, ty

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-01-31 21:59     ` Anish Moorthy
@ 2024-02-01  0:26       ` James Houghton
  2024-02-01  1:19         ` Oliver Upton
  2024-02-01 16:09       ` Sean Christopherson
  1 sibling, 1 reply; 44+ messages in thread
From: James Houghton @ 2024-02-01  0:26 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: seanjc, kvm, kvmarm, oliver.upton, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Jan 31, 2024 at 2:00 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Tue, Jan 30, 2024 at 4:26 PM James Houghton <jthoughton@google.com> wrote:
> > > @@ -3070,6 +3074,15 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > >                 writable = NULL;
> > >         }
> > >
> > > +       if (!atomic && can_exit_on_missing
> > > +           && kvm_is_slot_exit_on_missing(slot)) {
> > > +               atomic = true;
> > > +               if (async) {
> > > +                       *async = false;
> > > +                       async = NULL;
> > > +               }
> > > +       }
> > > +
> >
> > Perhaps we should have a comment for this? Maybe something like: "If
> > we want to exit-on-missing, we want to bail out if fast GUP fails, and
> > we do not want to go into slow GUP. Setting atomic=true does exactly
> > this."
>
> I was going to push back on the use of "we" but I see that it's all
> over kvm_main.c :).
>
> I agree that a comment would be good, but isn't the "fast GUP only iff
> atomic=true" statement a tautology? That's an actual question, my
> memory's fuzzy.
>
> What about
>
> > When the slot is exit-on-missing (and when we should respect that)
> > set atomic=true to prevent GUP from faulting in the userspace
> > mappings.
>
> instead?

This is much better than what I wrote, thanks! We merely want GUP not
to fault the page in; we don't actually care about fast GUP vs. slow
GUP.

On that note, I think we need to drop the patch that causes
read-faults in RO memslots to go through fast GUP. KVM didn't do that
for a good reason[1].

That would break KVM_EXIT_ON_MISSING for RO memslots, so I think that
the right way to implement KVM_EXIT_ON_MISSING is to have
hva_to_pfn_slow pass FOLL_NOFAULT, at least for the RO memslot case.
We still get the win we're looking for: don't grab the userfaultfd
locks.

[1]: Commit 17839856fd5 ("gup: document and work around "COW can break
either way" issue")

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-02-01  0:26       ` James Houghton
@ 2024-02-01  1:19         ` Oliver Upton
  2024-02-01 16:28           ` Sean Christopherson
  0 siblings, 1 reply; 44+ messages in thread
From: Oliver Upton @ 2024-02-01  1:19 UTC (permalink / raw)
  To: James Houghton
  Cc: Anish Moorthy, seanjc, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Jan 31, 2024 at 04:26:08PM -0800, James Houghton wrote:
> On Wed, Jan 31, 2024 at 2:00 PM Anish Moorthy <amoorthy@google.com> wrote:

[...]

> On that note, I think we need to drop the patch that causes
> read-faults in RO memslots to go through fast GUP. KVM didn't do that
> for a good reason[1].
> 
> That would break KVM_EXIT_ON_MISSING for RO memslots, so I think that
> the right way to implement KVM_EXIT_ON_MISSING is to have
> hva_to_pfn_slow pass FOLL_NOFAULT, at least for the RO memslot case.
> We still get the win we're looking for: don't grab the userfaultfd
> locks.

Is there even a legitimate use case that warrants optimizing faults on
RO memslots? My expectation is that the VMM uses these to back things
like guest ROM, or at least that's what QEMU does. In that case I'd
expect faults to be exceedingly rare, and if the VMM actually cared it
could just pre-populate the primary mapping.

I understand why we would want to support KVM_EXIT_ON_MISSING for the
sake of completeness of the UAPI, but it'd be surprising if this
mattered in the context of post-copy.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-01-31 21:59     ` Anish Moorthy
  2024-02-01  0:26       ` James Houghton
@ 2024-02-01 16:09       ` Sean Christopherson
  2024-02-01 19:53         ` Anish Moorthy
  1 sibling, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2024-02-01 16:09 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: James Houghton, kvm, kvmarm, oliver.upton, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Jan 31, 2024, Anish Moorthy wrote:
> On Tue, Jan 30, 2024 at 4:26 PM James Houghton <jthoughton@google.com> wrote:
> >
> > Feel free to add:
> >
> > Reviewed-by: James Houghton <jthoughton@google.com>
> 
> > If we include KVM_MEM_GUEST_MEMFD here, we should point the reader to
> > KVM_SET_USER_MEMORY_REGION2 and explain that using
> > KVM_SET_USER_MEMORY_REGION with this flag will always fail.
> 
> Done and done (I've split the guest memfd doc update off into its own
> commit too).
> 
> > > @@ -3070,6 +3074,15 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > >                 writable = NULL;
> > >         }
> > >
> > > +       if (!atomic && can_exit_on_missing
> > > +           && kvm_is_slot_exit_on_missing(slot)) {

Operators go on the preceding line:

	if (!atomic && can_exit_on_missing &&
	    kvm_is_slot_exit_on_missing(slot))

> > > +               atomic = true;
> > > +               if (async) {
> > > +                       *async = false;
> > > +                       async = NULL;
> > > +               }
> > > +       }
> > > +
> >
> > Perhaps we should have a comment for this? Maybe something like: "If
> > we want to exit-on-missing, we want to bail out if fast GUP fails, and
> > we do not want to go into slow GUP. Setting atomic=true does exactly
> > this."
> 
> I was going to push back on the use of "we" but I see that it's all
> over kvm_main.c :).

Ignore the ancient art, push back.  The KVM code base is 15 years old at this
point.  A _lot_ of has changed in 15 years.  The fact that KVM has existing code
that's in violation of current standards doesn't justify ignoring the standards
when adding new code.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-02-01  1:19         ` Oliver Upton
@ 2024-02-01 16:28           ` Sean Christopherson
  2024-02-01 19:24             ` Anish Moorthy
  2024-02-02  1:01             ` Oliver Upton
  0 siblings, 2 replies; 44+ messages in thread
From: Sean Christopherson @ 2024-02-01 16:28 UTC (permalink / raw)
  To: Oliver Upton
  Cc: James Houghton, Anish Moorthy, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Feb 01, 2024, Oliver Upton wrote:
> On Wed, Jan 31, 2024 at 04:26:08PM -0800, James Houghton wrote:
> > On Wed, Jan 31, 2024 at 2:00 PM Anish Moorthy <amoorthy@google.com> wrote:
> 
> [...]
> 
> > On that note, I think we need to drop the patch that causes
> > read-faults in RO memslots to go through fast GUP. KVM didn't do that
> > for a good reason[1].
> > 
> > That would break KVM_EXIT_ON_MISSING for RO memslots, so I think that
> > the right way to implement KVM_EXIT_ON_MISSING is to have
> > hva_to_pfn_slow pass FOLL_NOFAULT, at least for the RO memslot case.
> > We still get the win we're looking for: don't grab the userfaultfd
> > locks.
> 
> Is there even a legitimate use case that warrants optimizing faults on
> RO memslots? My expectation is that the VMM uses these to back things
> like guest ROM, or at least that's what QEMU does. In that case I'd
> expect faults to be exceedingly rare, and if the VMM actually cared it
> could just pre-populate the primary mapping.
> 
> I understand why we would want to support KVM_EXIT_ON_MISSING for the
> sake of completeness of the UAPI, but it'd be surprising if this
> mattered in the context of post-copy.

Yeah, I let's just make KVM_EXIT_ON_MISSING mutually exclusive with
KVM_MEM_READONLY.  KVM (oviously) honors the primary MMU protections, so userspace
can (and does) map read-only memory into the guest without READONLY.  As Oliver
pointed out, making the *memslot* RO is intended for use cases where userspace
wants writes to be treated like emulated MMIO.

We can always add support in the future in the extremely unlikely event someone
comes along with a legitimate reason for KVM_EXIT_ON_MISSING to play nice with
KVM_MEM_READONLY.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-02-01 16:28           ` Sean Christopherson
@ 2024-02-01 19:24             ` Anish Moorthy
  2024-02-02  1:03               ` Oliver Upton
  2024-02-02  1:01             ` Oliver Upton
  1 sibling, 1 reply; 44+ messages in thread
From: Anish Moorthy @ 2024-02-01 19:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Oliver Upton, James Houghton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Jan 31, 2024 at 4:26 PM James Houghton <jthoughton@google.com> wrote:
>
> On that note, I think we need to drop the patch that causes
> read-faults in RO memslots to go through fast GUP. KVM didn't do that
> for a good reason[1].

Thanks a ton for catching this James! That's some informative reading,
and a good reminder to be extra careful messing with stuff I don't
quite understand.

On Thu, Feb 1, 2024 at 8:28 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Feb 01, 2024, Oliver Upton wrote:
> > On Wed, Jan 31, 2024 at 04:26:08PM -0800, James Houghton wrote:
> > > On Wed, Jan 31, 2024 at 2:00 PM Anish Moorthy <amoorthy@google.com> wrote:
> >
> > [...]
> >
> > > On that note, I think we need to drop the patch that causes
> > > read-faults in RO memslots to go through fast GUP. KVM didn't do that
> > > for a good reason[1].
> > >
> > > That would break KVM_EXIT_ON_MISSING for RO memslots, so I think that
> > > the right way to implement KVM_EXIT_ON_MISSING is to have
> > > hva_to_pfn_slow pass FOLL_NOFAULT, at least for the RO memslot case.
> > > We still get the win we're looking for: don't grab the userfaultfd
> > > locks.
> >
> > Is there even a legitimate use case that warrants optimizing faults on
> > RO memslots? My expectation is that the VMM uses these to back things
> > like guest ROM, or at least that's what QEMU does. In that case I'd
> > expect faults to be exceedingly rare, and if the VMM actually cared it
> > could just pre-populate the primary mapping.
> >
> > I understand why we would want to support KVM_EXIT_ON_MISSING for the
> > sake of completeness of the UAPI, but it'd be surprising if this
> > mattered in the context of post-copy.
>
> Yeah, I let's just make KVM_EXIT_ON_MISSING mutually exclusive with
> KVM_MEM_READONLY.  KVM (oviously) honors the primary MMU protections, so userspace
> can (and does) map read-only memory into the guest without READONLY.  As Oliver
> pointed out, making the *memslot* RO is intended for use cases where userspace
> wants writes to be treated like emulated MMIO.
>
> We can always add support in the future in the extremely unlikely event someone
> comes along with a legitimate reason for KVM_EXIT_ON_MISSING to play nice with
> KVM_MEM_READONLY.

Gotcha, I'll go ahead and make the flags incompatible for the next
version. Thanks for the tidbit on how RO memslots are used Oliver- I
didn't know that we expect faults on these to be so rare.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-02-01 16:09       ` Sean Christopherson
@ 2024-02-01 19:53         ` Anish Moorthy
  2024-02-07 15:35           ` Sean Christopherson
  0 siblings, 1 reply; 44+ messages in thread
From: Anish Moorthy @ 2024-02-01 19:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, kvm, kvmarm, oliver.upton, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Feb 1, 2024 at 8:09 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Jan 31, 2024, Anish Moorthy wrote:
> > On Tue, Jan 30, 2024 at 4:26 PM James Houghton <jthoughton@google.com> wrote:
> > >
> > > Feel free to add:
> > >
> > > Reviewed-by: James Houghton <jthoughton@google.com>
> >
> > > If we include KVM_MEM_GUEST_MEMFD here, we should point the reader to
> > > KVM_SET_USER_MEMORY_REGION2 and explain that using
> > > KVM_SET_USER_MEMORY_REGION with this flag will always fail.
> >
> > Done and done (I've split the guest memfd doc update off into its own
> > commit too).
> >
> > > > @@ -3070,6 +3074,15 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > > >                 writable = NULL;
> > > >         }
> > > >
> > > > +       if (!atomic && can_exit_on_missing
> > > > +           && kvm_is_slot_exit_on_missing(slot)) {
>
> Operators go on the preceding line:

Thanks. On a side note, is this actually documented anywhere? I
searched coding-style.rst but couldn't find it.

> > > Perhaps we should have a comment for this? Maybe something like: "If
> > > we want to exit-on-missing, we want to bail out if fast GUP fails, and
> > > we do not want to go into slow GUP. Setting atomic=true does exactly
> > > this."
> >
> > I was going to push back on the use of "we" but I see that it's all
> > over kvm_main.c :).
>
> Ignore the ancient art, push back.  The KVM code base is 15 years old at this
> point.  A _lot_ of has changed in 15 years.  The fact that KVM has existing code
> that's in violation of current standards doesn't justify ignoring the standards
> when adding new code.

Well, actually the idea to push back here was more mechanical- I'm a
fan of "we" in comments myself. Somehow I got the idea that it's
discouraged, but that might just be leakage from comments on my past
commit messages.

Again I don't see anything in coding-style.rst, but I wonder if I'm
missing something.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-02-01 16:28           ` Sean Christopherson
  2024-02-01 19:24             ` Anish Moorthy
@ 2024-02-02  1:01             ` Oliver Upton
  1 sibling, 0 replies; 44+ messages in thread
From: Oliver Upton @ 2024-02-02  1:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, Anish Moorthy, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Feb 01, 2024 at 08:28:19AM -0800, Sean Christopherson wrote:

[...]

> Yeah, I let's just make KVM_EXIT_ON_MISSING mutually exclusive with
> KVM_MEM_READONLY.  KVM (oviously) honors the primary MMU protections, so userspace
> can (and does) map read-only memory into the guest without READONLY.  As Oliver
> pointed out, making the *memslot* RO is intended for use cases where userspace
> wants writes to be treated like emulated MMIO.

Well, it was clear enough to me what open source VMMs are doing, I was
curious if Google's VMM is doing something strange with RO memslots that
made the optimization desirable. But it sounds like we're all in
agreement that RO memslots aren't consequential for post-copy.

> We can always add support in the future in the extremely unlikely event someone
> comes along with a legitimate reason for KVM_EXIT_ON_MISSING to play nice with
> KVM_MEM_READONLY.

+1, the less stupid things we let userspace do the better.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-02-01 19:24             ` Anish Moorthy
@ 2024-02-02  1:03               ` Oliver Upton
  0 siblings, 0 replies; 44+ messages in thread
From: Oliver Upton @ 2024-02-02  1:03 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Sean Christopherson, James Houghton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Feb 01, 2024 at 11:24:54AM -0800, Anish Moorthy wrote:

[...]

> Gotcha, I'll go ahead and make the flags incompatible for the next
> version. Thanks for the tidbit on how RO memslots are used Oliver- I
> didn't know that we expect faults on these to be so rare.

You should definitely go and check your userspace to make sure that is
actually the case, but at least for open source VMMs that's the case :)

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 05/14] KVM: Try using fast GUP to resolve read faults
  2023-11-09 21:03 ` [PATCH v6 05/14] KVM: Try using fast GUP to resolve read faults Anish Moorthy
@ 2024-02-05 21:55   ` Anish Moorthy
  0 siblings, 0 replies; 44+ messages in thread
From: Anish Moorthy @ 2024-02-05 21:55 UTC (permalink / raw)
  To: kvm, kvmarm

Dropped as per https://lore.kernel.org/kvm/CADrL8HXSzm_C9UwUb8-H_c6-TRgpkKLE+qeXfyN-X_rHGj2vuw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 01/14] KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
  2023-11-09 21:03 ` [PATCH v6 01/14] KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
@ 2024-02-07 15:26   ` Sean Christopherson
  2024-02-07 18:44     ` Anish Moorthy
  0 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2024-02-07 15:26 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: kvm, kvmarm, oliver.upton, pbonzini, maz, robert.hoo.linux,
	jthoughton, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

This is not a Documentation change.  The comment might make its way to generated
docs, but this is not Documentation/ and I most definitely did not expect a change
to kvm_main.c based on the scope.

On Thu, Nov 09, 2023, Anish Moorthy wrote:
> The current docstring can be read as "atomic -> allowed to sleep," when
> in fact the intended statement is "atomic -> NOT allowed to sleep." Make
> that clearer in the docstring.
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  virt/kvm/kvm_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9170a61ea99f..687374138cfd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2983,7 +2983,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  /*
>   * Pin guest page in memory and return its pfn.
>   * @addr: host virtual address which maps memory to the guest
> - * @atomic: whether this function can sleep
> + * @atomic: whether this function is forbidden from sleeping
>   * @interruptible: whether the process can be interrupted by non-fatal signals
>   * @async: whether this function need to wait IO complete if the
>   *         host page is not in the memory
> -- 
> 2.42.0.869.gea05f2083d-goog
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 02/14] KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
  2023-11-09 21:03 ` [PATCH v6 02/14] KVM: Documentation: Add docstrings for __kvm_read/write_guest_page() Anish Moorthy
@ 2024-02-07 15:30   ` Sean Christopherson
  0 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2024-02-07 15:30 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: kvm, kvmarm, oliver.upton, pbonzini, maz, robert.hoo.linux,
	jthoughton, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

Same "Documentation" issue.  And the purpose of the change isn't to add docstrings,
the purpose is to clarify the goofy/strange parameters:

  KVM: Add comment to clarify param usage for __kvm{read,_write}_guest_page()

On Thu, Nov 09, 2023, Anish Moorthy wrote:
> The (gfn, data, offset, len) order of parameters is a little strange
> since "offset" applies to "gfn" rather than to "data". Add docstrings to

s/docstrings/function comments/.  This change has value and impact even if the
kernel suddenly stops generating documentation from code.

> make things perfectly clear.
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  virt/kvm/kvm_main.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 687374138cfd..f521b6fd808f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3328,6 +3328,7 @@ static int next_segment(unsigned long len, int offset)
>  		return len;
>  }
>  
> +/* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
>  static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
>  				 void *data, int offset, int len)
>  {
> @@ -3429,6 +3430,7 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
>  
> +/* Copy @len bytes from @data into guest memory at '(@gfn * PAGE_SIZE) + @offset' */
>  static int __kvm_write_guest_page(struct kvm *kvm,
>  				  struct kvm_memory_slot *memslot, gfn_t gfn,
>  			          const void *data, int offset, int len)
> -- 
> 2.42.0.869.gea05f2083d-goog
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 04/14] KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
  2023-11-09 21:03 ` [PATCH v6 04/14] KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace Anish Moorthy
@ 2024-02-07 15:30   ` Sean Christopherson
  2024-02-07 18:57     ` Anish Moorthy
  0 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2024-02-07 15:30 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: kvm, kvmarm, oliver.upton, pbonzini, maz, robert.hoo.linux,
	jthoughton, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Nov 09, 2023, Anish Moorthy wrote:

Needs a changelog.

> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-02-01 19:53         ` Anish Moorthy
@ 2024-02-07 15:35           ` Sean Christopherson
  0 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2024-02-07 15:35 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: James Houghton, kvm, kvmarm, oliver.upton, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Feb 01, 2024, Anish Moorthy wrote:
> On Thu, Feb 1, 2024 at 8:09 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Jan 31, 2024, Anish Moorthy wrote:
> > > On Tue, Jan 30, 2024 at 4:26 PM James Houghton <jthoughton@google.com> wrote:
> > > >
> > > > Feel free to add:
> > > >
> > > > Reviewed-by: James Houghton <jthoughton@google.com>
> > >
> > > > If we include KVM_MEM_GUEST_MEMFD here, we should point the reader to
> > > > KVM_SET_USER_MEMORY_REGION2 and explain that using
> > > > KVM_SET_USER_MEMORY_REGION with this flag will always fail.
> > >
> > > Done and done (I've split the guest memfd doc update off into its own
> > > commit too).
> > >
> > > > > @@ -3070,6 +3074,15 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > > > >                 writable = NULL;
> > > > >         }
> > > > >
> > > > > +       if (!atomic && can_exit_on_missing
> > > > > +           && kvm_is_slot_exit_on_missing(slot)) {
> >
> > Operators go on the preceding line:
> 
> Thanks. On a side note, is this actually documented anywhere? I
> searched coding-style.rst but couldn't find it.

Maybe?  But the fact there are very few, if any, patterns like this in KVM should
be a big clue that it's not the One True Way.  The formal docs will never be 100%
complete, and preferences do evolve and change, but if your code sticks out like
a sore thumb, odds are good you're doing something wrong.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
  2023-11-09 21:07   ` Anish Moorthy
@ 2024-02-07 15:39     ` Sean Christopherson
  2024-02-07 16:41       ` Oliver Upton
  0 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2024-02-07 15:39 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: kvm, kvmarm, oliver.upton, pbonzini, maz, robert.hoo.linux,
	jthoughton, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Nov 09, 2023, Anish Moorthy wrote:
> On Thu, Nov 9, 2023 at 1:03 PM Anish Moorthy <amoorthy@google.com> wrote:
> >
> > TODO: Changelog -- and possibly just merge into the "god" arm commit?
> 
> *Facepalm*
> 
> Well as you can tell, I wasn't sure if there was anything to actually
> put in the long-form log. Lmk if you have suggestions

I think the right way to organize things is to have this chunk:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b1e5e42bdeb4..bc978260d2be 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3309,6 +3309,10 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
                return RET_PF_RETRY;
        }

+       WARN_ON_ONCE(fault->goal_level != PG_LEVEL_4K);
+
+       kvm_prepare_memory_fault_exit(vcpu, gfn_to_gpa(fault->gfn), PAGE_SIZE,
+                                     fault->write, fault->exec, fault->is_private);
        return -EFAULT;
 }


be part of this patch.  Because otherwise, advertising KVM_CAP_MEMORY_FAULT_INFO
is a lie.  Userspace can't catch KVM in the lie, but that doesn't make it right.

That should in turn make it easier to write a useful changelog.

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults
  2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
                   ` (13 preceding siblings ...)
  2023-11-09 21:03 ` [PATCH v6 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
@ 2024-02-07 15:46 ` Sean Christopherson
  2024-02-09 16:00   ` Anish Moorthy
  14 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2024-02-07 15:46 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: kvm, kvmarm, oliver.upton, pbonzini, maz, robert.hoo.linux,
	jthoughton, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Nov 09, 2023, Anish Moorthy wrote:
> Base Commit
> ~~~~~~~~~~~
> This series is based off of kvm/next (45b890f7689e) with v14 of the
> guest_memfd series applied, with some fixes on top [3].

Please use `--base`.  I have gotten spoiled by git appending the object ID at the
bottom, and get annoyed every time I have to go spelunking for the base :-)

Also, in the future, when posting a series that has multiple dependencies, it is
*very* helpful to reviewers and maintainers to provide a full branch somewhere,
e.g. on github, gitlab, etc.  That way someone that wants to actually test things
doesn't need to hunt down and splice together a bunch of different assets.

From Documentation/process/maintainer-kvm-x86.rst:

Git Base
~~~~~~~~
If you are using git version 2.9.0 or later (Googlers, this is all of you!),
use ``git format-patch`` with the ``--base`` flag to automatically include the
base tree information in the generated patches.

Note, ``--base=auto`` works as expected if and only if a branch's upstream is
set to the base topic branch, e.g. it will do the wrong thing if your upstream
is set to your personal repository for backup purposes.  An alternative "auto"
solution is to derive the names of your development branches based on their
KVM x86 topic, and feed that into ``--base``.  E.g. ``x86/pmu/my_branch_name``,
and then write a small wrapper to extract ``pmu`` from the current branch name
to yield ``--base=x/pmu``, where ``x`` is whatever name your repository uses to
track the KVM x86 remote.

> Anish Moorthy (14):
>   KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic'
>     parameter
>   KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
>   KVM: Simplify error handling in __gfn_to_pfn_memslot()
>   KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to
>     userspace
>   KVM: Try using fast GUP to resolve read faults
>   KVM: Add memslot flag to let userspace force an exit on missing hva
>     mappings
>   KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from
>     stage-2 fault handler
>   KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
>   KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from
>     stage-2 fault-handler
>   KVM: selftests: Report per-vcpu demand paging rate from demand paging
>     test
>   KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
>     paging test
>   KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
>     signal errors via TEST_ASSERT
>   KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
>   KVM: selftests: Handle memory fault exits in demand_paging_test
> 
>  Documentation/virt/kvm/api.rst                |  33 +-
>  arch/arm64/kvm/Kconfig                        |   1 +
>  arch/arm64/kvm/arm.c                          |   1 +
>  arch/arm64/kvm/mmu.c                          |   7 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c           |   2 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c        |   2 +-
>  arch/x86/kvm/Kconfig                          |   1 +
>  arch/x86/kvm/mmu/mmu.c                        |   8 +-
>  include/linux/kvm_host.h                      |  21 +-
>  include/uapi/linux/kvm.h                      |   5 +
>  .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
>  .../selftests/kvm/access_tracking_perf_test.c |   2 +-
>  .../selftests/kvm/demand_paging_test.c        | 295 ++++++++++++++----
>  .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
>  .../testing/selftests/kvm/include/memstress.h |   2 +-
>  .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
>  tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
>  .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
>  .../kvm/memslot_modification_stress_test.c    |   2 +-
>  .../x86_64/dirty_log_page_splitting_test.c    |   2 +-
>  virt/kvm/Kconfig                              |   3 +
>  virt/kvm/kvm_main.c                           |  46 ++-
>  22 files changed, 444 insertions(+), 175 deletions(-)

A few nits throughout, but this is looking good for 6.9.

Oliver / Marc,

Any objection to taking this through kvm-x86? (when you feel it's ready, obviously)
My plan is to put it in a dedicated topic branch, with a massaged cover letter as
the tag used for the pull request so that we can capture the motivation/benefits.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
  2024-02-07 15:39     ` Sean Christopherson
@ 2024-02-07 16:41       ` Oliver Upton
  2024-02-07 21:21         ` Anish Moorthy
  0 siblings, 1 reply; 44+ messages in thread
From: Oliver Upton @ 2024-02-07 16:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Anish Moorthy, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Feb 07, 2024 at 07:39:50AM -0800, Sean Christopherson wrote:
> On Thu, Nov 09, 2023, Anish Moorthy wrote:
> > On Thu, Nov 9, 2023 at 1:03 PM Anish Moorthy <amoorthy@google.com> wrote:
> > >
> > > TODO: Changelog -- and possibly just merge into the "god" arm commit?
> > 
> > *Facepalm*
> > 
> > Well as you can tell, I wasn't sure if there was anything to actually
> > put in the long-form log. Lmk if you have suggestions
> 
> I think the right way to organize things is to have this chunk:
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b1e5e42bdeb4..bc978260d2be 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3309,6 +3309,10 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
>                 return RET_PF_RETRY;
>         }
> 
> +       WARN_ON_ONCE(fault->goal_level != PG_LEVEL_4K);
> +
> +       kvm_prepare_memory_fault_exit(vcpu, gfn_to_gpa(fault->gfn), PAGE_SIZE,
> +                                     fault->write, fault->exec, fault->is_private);
>         return -EFAULT;
>  }

Err.. This is the arm64 patch. x86 already advertises KVM_CAP_MEMORY_FAULT_INFO.
The rest of the advertisement happens over in the arch-neutral code when
the arch selects CONFIG_HAVE_KVM_EXIT_ON_MISSING.

Having said that...

> be part of this patch.  Because otherwise, advertising KVM_CAP_MEMORY_FAULT_INFO
> is a lie.  Userspace can't catch KVM in the lie, but that doesn't make it right.
> 
> That should in turn make it easier to write a useful changelog.

The feedback still stands. The capability needs to be squashed into the
patch that actually introduces the functionality.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 01/14] KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
  2024-02-07 15:26   ` Sean Christopherson
@ 2024-02-07 18:44     ` Anish Moorthy
  0 siblings, 0 replies; 44+ messages in thread
From: Anish Moorthy @ 2024-02-07 18:44 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, kvmarm

On Wed, Feb 7, 2024 at 7:26 AM Sean Christopherson <seanjc@google.com> wrote:
>
> This is not a Documentation change.  The comment might make its way to generated
> docs, but this is not Documentation/ and I most definitely did not expect a change
> to kvm_main.c based on the scope.

Whoops, didn't realize that Documentation: carried that specific
meaning- thanks for letting me know

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 04/14] KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
  2024-02-07 15:30   ` Sean Christopherson
@ 2024-02-07 18:57     ` Anish Moorthy
  0 siblings, 0 replies; 44+ messages in thread
From: Anish Moorthy @ 2024-02-07 18:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, kvmarm, oliver.upton, pbonzini, maz, robert.hoo.linux,
	jthoughton, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Feb 7, 2024 at 7:31 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Needs a changelog.

Urgh, that's embarrassing: I'll use the following absent any objections

> kvm_prepare_memory_fault_exit() already takes parameters describing the R/W/X
> nature of the relevant access but doesn't actually do anything with them. Define
> and utilize the flags necessary to pass this information on to userspace.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
  2024-02-07 16:41       ` Oliver Upton
@ 2024-02-07 21:21         ` Anish Moorthy
  2024-02-07 21:41           ` Sean Christopherson
  2024-02-07 22:07           ` Oliver Upton
  0 siblings, 2 replies; 44+ messages in thread
From: Anish Moorthy @ 2024-02-07 21:21 UTC (permalink / raw)
  To: Oliver Upton, Sean Christopherson
  Cc: kvm, kvmarm, pbonzini, maz, robert.hoo.linux, jthoughton,
	dmatlack, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Wed, Feb 7, 2024 at 8:41 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Wed, Feb 07, 2024 at 07:39:50AM -0800, Sean Christopherson wrote:
>
> Having said that...
>
> > be part of this patch.  Because otherwise, advertising KVM_CAP_MEMORY_FAULT_INFO
> > is a lie.  Userspace can't catch KVM in the lie, but that doesn't make it right.
> >
> > That should in turn make it easier to write a useful changelog.
>
> The feedback still stands. The capability needs to be squashed into the
> patch that actually introduces the functionality.
>
> --
> Thanks,
> Oliver

Hold on, I think there may be confusion here.
KVM_CAP_MEMORY_FAULT_INFO is the mechanism for reporting annotated
EFAULTs. These are generic in that other things (such as the guest
memfd stuff) may also report information to userspace using annotated
EFAULTs.

KVM_CAP_EXIT_ON_MISSING is the thing that says "do an annotated EFAULT
when a stage-2 violation would require faulting in host mapping" On
both x86 and arm64, the relevant functionality is added and the cap is
advertised in a single patch.

I think it makes sense to enable/advertise the two caps separately (as
I've done here). The former, after all, just says that userspace "may
get annotated EFAULTs for whatever reason" (as opposed to the latter
cap, which says that userspace *will* get annotated EFAULTs when the
stage-2 handler is failed). So even if arm64 userspaces never get
annotated EFAULTs as of this patch, I don't think we're "lying" to
them.

Consider a related problem: suppose that code is added in core KVM
which also generates annotated EFAULTs, and that later the arm64
"Enable KVM_CAP_EXIT_ON_MISSING"  patch [1] ends up needing to be
reverted for some reason. If the two patches were merged, that revert
would silence *all* annotated EFAULTs for arm64, which seems
incorrect.

James and I had this discussion in [2] (I propose an updated
description there too).

[1] https://lore.kernel.org/kvm/20231109210325.3806151-10-amoorthy@google.com/
[2] https://lore.kernel.org/kvm/CAF7b7mrALBBWCg+ctU867BjQhtLQNuX=Yo8u9TZEuDTEtCV6qw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
  2024-02-07 21:21         ` Anish Moorthy
@ 2024-02-07 21:41           ` Sean Christopherson
  2024-02-07 22:07           ` Oliver Upton
  1 sibling, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2024-02-07 21:41 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Oliver Upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Feb 07, 2024, Anish Moorthy wrote:
> On Wed, Feb 7, 2024 at 8:41 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Wed, Feb 07, 2024 at 07:39:50AM -0800, Sean Christopherson wrote:
> >
> > Having said that...
> >
> > > be part of this patch.  Because otherwise, advertising KVM_CAP_MEMORY_FAULT_INFO
> > > is a lie.  Userspace can't catch KVM in the lie, but that doesn't make it right.
> > >
> > > That should in turn make it easier to write a useful changelog.
> >
> > The feedback still stands. The capability needs to be squashed into the
> > patch that actually introduces the functionality.
> >
> > --
> > Thanks,
> > Oliver
> 
> Hold on, I think there may be confusion here.
> KVM_CAP_MEMORY_FAULT_INFO is the mechanism for reporting annotated
> EFAULTs. These are generic in that other things (such as the guest
> memfd stuff) may also report information to userspace using annotated
> EFAULTs.
> 
> KVM_CAP_EXIT_ON_MISSING is the thing that says "do an annotated EFAULT
> when a stage-2 violation would require faulting in host mapping" On
> both x86 and arm64, the relevant functionality is added and the cap is
> advertised in a single patch.
> 
> I think it makes sense to enable/advertise the two caps separately (as
> I've done here). The former, after all, just says that userspace "may
> get annotated EFAULTs for whatever reason" (as opposed to the latter
> cap, which says that userspace *will* get annotated EFAULTs when the
> stage-2 handler is failed). So even if arm64 userspaces never get
> annotated EFAULTs as of this patch, I don't think we're "lying" to
> them.

Neither Oliver nor I are advocating you smush the two together.  We're saying
the code that actually fills memory_fault should be either (a) a separate patch
(x86) or (b) in the patch that advertises KVM_CAP_MEMORY_FAULT_INFO (arm).

I *highly* doubt it will matter in practice, but if there was a problem with
filling memory_fault, it would be nice to isolate that to a standalone patch,
and not the EXIT_ON_MISSING_CHANGE.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
  2024-02-07 21:21         ` Anish Moorthy
  2024-02-07 21:41           ` Sean Christopherson
@ 2024-02-07 22:07           ` Oliver Upton
  2024-02-09  1:13             ` Anish Moorthy
  1 sibling, 1 reply; 44+ messages in thread
From: Oliver Upton @ 2024-02-07 22:07 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Sean Christopherson, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Feb 07, 2024 at 01:21:05PM -0800, Anish Moorthy wrote:
> On Wed, Feb 7, 2024 at 8:41 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Wed, Feb 07, 2024 at 07:39:50AM -0800, Sean Christopherson wrote:
> >
> > Having said that...
> >
> > > be part of this patch.  Because otherwise, advertising KVM_CAP_MEMORY_FAULT_INFO
> > > is a lie.  Userspace can't catch KVM in the lie, but that doesn't make it right.
> > >
> > > That should in turn make it easier to write a useful changelog.
> >
> > The feedback still stands. The capability needs to be squashed into the
> > patch that actually introduces the functionality.
> >
> > --
> > Thanks,
> > Oliver
> 
> Hold on, I think there may be confusion here.

No, there isn't.

> KVM_CAP_MEMORY_FAULT_INFO is the mechanism for reporting annotated
> EFAULTs. These are generic in that other things (such as the guest
> memfd stuff) may also report information to userspace using annotated
> EFAULTs.
> 
> KVM_CAP_EXIT_ON_MISSING is the thing that says "do an annotated EFAULT
> when a stage-2 violation would require faulting in host mapping" On
> both x86 and arm64, the relevant functionality is added and the cap is
> advertised in a single patch.
> 
> I think it makes sense to enable/advertise the two caps separately (as
> I've done here). The former, after all, just says that userspace "may
> get annotated EFAULTs for whatever reason" (as opposed to the latter
> cap, which says that userspace *will* get annotated EFAULTs when the
> stage-2 handler is failed). So even if arm64 userspaces never get
> annotated EFAULTs as of this patch, I don't think we're "lying" to
> them.

I don't know about you, but I find describing UAPI in terms of "may" and
"whatever reason" quite unsettling. I like to keep my interactions with
userspace deterministic.

Overall, I find the informational capability to be quite superfluous as
it pertains to this feature. Userspace has *explicitly* opted in to a
specific behavior, and the side band capability provides no useful
information. You can easily document KVM_CAP_MEMORY_FAULT_INFO in
such a way that userspace expects to take this sort of exit.

Nobody has presented a use case for annotated EFAULTs on arm64 beyond this
opt-in and there is zero interest in predefining UAPI for anything else.
x86 may've done this a different way, but that's their business.

We're not making UAPI out of any of our other EFAULT returns right now.

> Consider a related problem: suppose that code is added in core KVM
> which also generates annotated EFAULTs, and that later the arm64
> "Enable KVM_CAP_EXIT_ON_MISSING"  patch [1] ends up needing to be
> reverted for some reason.

The single rule we try to uphold in the kernel is to *never break
userspace*, so I don't see this being in the realm of possibility. The
moment we expose a feature to userspace we're on the hook for it in
perpetuity, and if we break that then you're welcome to send a nastygram
to Marc or I.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
  2024-02-07 22:07           ` Oliver Upton
@ 2024-02-09  1:13             ` Anish Moorthy
  0 siblings, 0 replies; 44+ messages in thread
From: Anish Moorthy @ 2024-02-09  1:13 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Sean Christopherson, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

Ah, I finally see what I'm being beaten over the head with- the
kvm_prepare_memory_fault_exit() is a feature of
KVM_CAP_MEMORY_FAULT_INFO, not EXIT_ON_MISSING. You're right of course

Ok, so for this commit in the next version of the series I'm thinking
the following description

> KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the
> stage-2 fault handler
>
> At the moment the only intended use case for KVM_CAP_MEMORY_FAULT_INFO
> on arm64 is to annotate EFAULTs from the stage-2 fault handler, so
> add that annotation now.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 09/14] KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
  2024-01-31 22:38     ` Anish Moorthy
@ 2024-02-09  1:21       ` Anish Moorthy
  0 siblings, 0 replies; 44+ messages in thread
From: Anish Moorthy @ 2024-02-09  1:21 UTC (permalink / raw)
  To: James Houghton
  Cc: seanjc, kvm, kvmarm, oliver.upton, pbonzini, maz,
	robert.hoo.linux, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Jan 31, 2024 at 2:38 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Tue, Jan 30, 2024 at 3:58 PM James Houghton <jthoughton@google.com> wrote:
> >
> > I think that either (1) we move this kvm_prepare_memory_fault_exit
> > logic into the previous patch[1], or (2) we merge this patch with the
> > previous one. IIUC, we can only advertise KVM_CAP_MEMORY_FAULT_INFO on
> > arm64 if this logic is present.
>
> Actually (sorry, about-face from our off-list chat), *does* it make
> sense to merge these two patches?

As per [1]: yes, it makes sense to move the kvm_prepare_memory_fault_exit().

So for the next version, the description is also going to change a bit to

> KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
>
> Prevent the stage-2 fault handler from faulting in pages when
> KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
> call to check the memslot flag. This effects the delivery of stage-2
> faults as vCPU exits (see KVM_CAP_MEMORY_FAULT_INFO), which userspace
> can attempt to resolve without terminating the guest.

> Delivering stage-2 faults to userspace in this way sidesteps the
> significant scalabiliy issues associated with using userfaultfd for the
> same purpose.

[1] https://lore.kernel.org/kvm/ZcP_JHsMJUlvjAs1@linux.dev/#t

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults
  2024-02-07 15:46 ` [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Sean Christopherson
@ 2024-02-09 16:00   ` Anish Moorthy
  0 siblings, 0 replies; 44+ messages in thread
From: Anish Moorthy @ 2024-02-09 16:00 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton
  Cc: kvm, kvmarm, pbonzini, robert.hoo.linux, jthoughton, dmatlack,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

On Wed, Feb 7, 2024 at 7:46 AM Sean Christopherson <seanjc@google.com> wrote:
>
> A few nits throughout, but this is looking good for 6.9.
>
> Oliver / Marc,
>
> Any objection to taking this through kvm-x86? (when you feel it's ready, obviously)
> My plan is to put it in a dedicated topic branch, with a massaged cover letter as
> the tag used for the pull request so that we can capture the motivation/benefits.

Oliver and Marc,

I have a v7 ready based on the feedback I've received so far- please
let me know if I should send it or wait for you to take a look at this
version first.

On the one hand I obviously want to incorporate any feedback you have
for the next version, but on the other I suspect that if/when you look
at this you'll want to see a version with as few (known) flaws as
possible

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2024-02-09 16:01 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-09 21:03 [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Anish Moorthy
2023-11-09 21:03 ` [PATCH v6 01/14] KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
2024-02-07 15:26   ` Sean Christopherson
2024-02-07 18:44     ` Anish Moorthy
2023-11-09 21:03 ` [PATCH v6 02/14] KVM: Documentation: Add docstrings for __kvm_read/write_guest_page() Anish Moorthy
2024-02-07 15:30   ` Sean Christopherson
2023-11-09 21:03 ` [PATCH v6 03/14] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
2023-11-09 21:03 ` [PATCH v6 04/14] KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace Anish Moorthy
2024-02-07 15:30   ` Sean Christopherson
2024-02-07 18:57     ` Anish Moorthy
2023-11-09 21:03 ` [PATCH v6 05/14] KVM: Try using fast GUP to resolve read faults Anish Moorthy
2024-02-05 21:55   ` Anish Moorthy
2023-11-09 21:03 ` [PATCH v6 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings Anish Moorthy
2024-01-31  0:25   ` James Houghton
2024-01-31 21:59     ` Anish Moorthy
2024-02-01  0:26       ` James Houghton
2024-02-01  1:19         ` Oliver Upton
2024-02-01 16:28           ` Sean Christopherson
2024-02-01 19:24             ` Anish Moorthy
2024-02-02  1:03               ` Oliver Upton
2024-02-02  1:01             ` Oliver Upton
2024-02-01 16:09       ` Sean Christopherson
2024-02-01 19:53         ` Anish Moorthy
2024-02-07 15:35           ` Sean Christopherson
2023-11-09 21:03 ` [PATCH v6 07/14] KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler Anish Moorthy
2023-11-09 21:03 ` [PATCH v6 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
2023-11-09 21:07   ` Anish Moorthy
2024-02-07 15:39     ` Sean Christopherson
2024-02-07 16:41       ` Oliver Upton
2024-02-07 21:21         ` Anish Moorthy
2024-02-07 21:41           ` Sean Christopherson
2024-02-07 22:07           ` Oliver Upton
2024-02-09  1:13             ` Anish Moorthy
2023-11-09 21:03 ` [PATCH v6 09/14] KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler Anish Moorthy
2024-01-30 23:58   ` James Houghton
2024-01-31 22:38     ` Anish Moorthy
2024-02-09  1:21       ` Anish Moorthy
2023-11-09 21:03 ` [PATCH v6 10/14] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
2023-11-09 21:03 ` [PATCH v6 11/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
2023-11-09 21:03 ` [PATCH v6 12/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
2023-11-09 21:03 ` [PATCH v6 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
2023-11-09 21:03 ` [PATCH v6 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
2024-02-07 15:46 ` [PATCH v6 00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults Sean Christopherson
2024-02-09 16:00   ` Anish Moorthy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).