[PATCH v2 00/13] KVM: Introduce KVM Userfault

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/13] KVM: Introduce KVM Userfault
@ 2025-01-09 20:49 James Houghton
  2025-01-09 20:49 ` [PATCH v2 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap James Houghton
                   ` (14 more replies)
  0 siblings, 15 replies; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

This is a v2 of KVM Userfault, mostly unchanged from v1[5]. Changelog here:

v1->v2:
- For arm64, no longer zap stage 2 when disabling KVM_MEM_USERFAULT
  (thanks Oliver).
- Fix the userfault_bitmap validation and casts (thanks kernel test
  robot).
- Fix _Atomic cast for the userfault bitmap in the selftest (thanks
  kernel test robot).
- Pick up Reviewed-by on doc changes (thanks Bagas).

And here is a trimmed down cover letter from v1, slightly modified
given the small arm64 change:

Please see the RFC[1] for the problem description. In summary,
guest_memfd VMs have no mechanism for doing post-copy live migration.
KVM Userfault provides such a mechanism.

There is a second problem that KVM Userfault solves: userfaultfd-based
post-copy doesn't scale very well. KVM Userfault when used with
userfaultfd can scale much better in the common case that most post-copy
demand fetches are a result of vCPU access violations. This is a
continuation of the solution Anish was working on[3]. This aspect of
KVM Userfault is important for userfaultfd-based live migration when
scaling up to hundreds of vCPUs with ~30us network latency for a
PAGE_SIZE demand-fetch.

The implementation in this series is version than the RFC[1]. It adds...
 1. a new memslot flag is added: KVM_MEM_USERFAULT,
 2. a new parameter, userfault_bitmap, into struct kvm_memory_slot,
 3. a new KVM_RUN exit reason: KVM_MEMORY_EXIT_FLAG_USERFAULT,
 4. a new KVM capability KVM_CAP_USERFAULT.

KVM Userfault does not attempt to catch KVM's own accesses to guest
memory. That is left up to userfaultfd.

When enabling KVM_MEM_USERFAULT for a memslot, the second-stage mappings
are zapped, and new faults will check `userfault_bitmap` to see if the
fault should exit to userspace.

When KVM_MEM_USERFAULT is enabled, only PAGE_SIZE mappings are
permitted.

When disabling KVM_MEM_USERFAULT, huge mappings will be reconstructed
consistent with dirty log disabling. So on x86, huge mappings will be
reconstructed, but on arm64, they won't be.

KVM Userfault is not compatible with async page faults. Nikita has
proposed a new implementation of async page faults that is more
userspace-driven that *is* compatible with KVM Userfault[4].

See v1 for more performance details[5]. They are unchanged in this v2.

This series is based on the latest kvm/next.

[1]: https://lore.kernel.org/kvm/20240710234222.2333120-1-jthoughton@google.com/
[2]: https://lpc.events/event/18/contributions/1757/
[3]: https://lore.kernel.org/all/20240215235405.368539-1-amoorthy@google.com/
[4]: https://lore.kernel.org/kvm/20241118123948.4796-1-kalyazin@amazon.com/#t
[5]: https://lore.kernel.org/kvm/20241204191349.1730936-1-jthoughton@google.com/

James Houghton (13):
  KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
  KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
  KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot
  KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
  KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
  KVM: arm64: Add support for KVM_MEM_USERFAULT
  KVM: selftests: Fix vm_mem_region_set_flags docstring
  KVM: selftests: Fix prefault_mem logic
  KVM: selftests: Add va_start/end into uffd_desc
  KVM: selftests: Add KVM Userfault mode to demand_paging_test
  KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT
  KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests
  KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT
    details

 Documentation/virt/kvm/api.rst                |  33 +++-
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/mmu.c                          |  26 +++-
 arch/x86/kvm/Kconfig                          |   1 +
 arch/x86/kvm/mmu/mmu.c                        |  27 +++-
 arch/x86/kvm/mmu/mmu_internal.h               |  20 ++-
 arch/x86/kvm/x86.c                            |  36 +++--
 include/linux/kvm_host.h                      |  19 ++-
 include/uapi/linux/kvm.h                      |   6 +-
 .../selftests/kvm/demand_paging_test.c        | 145 ++++++++++++++++--
 .../testing/selftests/kvm/include/kvm_util.h  |   5 +
 .../selftests/kvm/include/userfaultfd_util.h  |   2 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  42 ++++-
 .../selftests/kvm/lib/userfaultfd_util.c      |   2 +
 .../selftests/kvm/set_memory_region_test.c    |  33 ++++
 virt/kvm/Kconfig                              |   3 +
 virt/kvm/kvm_main.c                           |  54 ++++++-
 17 files changed, 419 insertions(+), 36 deletions(-)

base-commit: 10b2c8a67c4b8ec15f9d07d177f63b563418e948
-- 
2.47.1.613.gc27f4b7a9f-goog

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
@ 2025-01-09 20:49 ` James Houghton
  2025-05-07  0:01   ` Sean Christopherson
  2025-01-09 20:49 ` [PATCH v2 02/13] KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT James Houghton
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

Use one of the 14 reserved u64s in struct kvm_userspace_memory_region2
for the user to provide `userfault_bitmap`.

The memslot flag indicates if KVM should be reading from the
`userfault_bitmap` field from the memslot. The user is permitted to
provide a bogus pointer. If the pointer cannot be read from, we will
return -EFAULT (with no other information) back to the user.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/kvm_host.h | 14 ++++++++++++++
 include/uapi/linux/kvm.h |  4 +++-
 virt/kvm/Kconfig         |  3 +++
 virt/kvm/kvm_main.c      | 35 +++++++++++++++++++++++++++++++++++
 4 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 401439bb21e3..f7a3dfd5e224 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -590,6 +590,7 @@ struct kvm_memory_slot {
 	unsigned long *dirty_bitmap;
 	struct kvm_arch_memory_slot arch;
 	unsigned long userspace_addr;
+	unsigned long __user *userfault_bitmap;
 	u32 flags;
 	short id;
 	u16 as_id;
@@ -724,6 +725,11 @@ static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
 }
 #endif
 
+static inline bool kvm_has_userfault(struct kvm *kvm)
+{
+	return IS_ENABLED(CONFIG_HAVE_KVM_USERFAULT);
+}
+
 struct kvm_memslots {
 	u64 generation;
 	atomic_long_t last_used_slot;
@@ -2553,4 +2559,12 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range);
 #endif
 
+int kvm_gfn_userfault(struct kvm *kvm, struct kvm_memory_slot *memslot,
+		      gfn_t gfn);
+
+static inline bool kvm_memslot_userfault(struct kvm_memory_slot *memslot)
+{
+	return memslot->flags & KVM_MEM_USERFAULT;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 343de0a51797..7ade5169d373 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -40,7 +40,8 @@ struct kvm_userspace_memory_region2 {
 	__u64 guest_memfd_offset;
 	__u32 guest_memfd;
 	__u32 pad1;
-	__u64 pad2[14];
+	__u64 userfault_bitmap;
+	__u64 pad2[13];
 };
 
 /*
@@ -51,6 +52,7 @@ struct kvm_userspace_memory_region2 {
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
 #define KVM_MEM_GUEST_MEMFD	(1UL << 2)
+#define KVM_MEM_USERFAULT	(1UL << 3)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 54e959e7d68f..9eb1fae238b1 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -124,3 +124,6 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
 config HAVE_KVM_ARCH_GMEM_INVALIDATE
        bool
        depends on KVM_PRIVATE_MEM
+
+config HAVE_KVM_USERFAULT
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index de2c11dae231..4bceae6a6401 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1541,6 +1541,9 @@ static int check_memory_region_flags(struct kvm *kvm,
 	    !(mem->flags & KVM_MEM_GUEST_MEMFD))
 		valid_flags |= KVM_MEM_READONLY;
 
+	if (kvm_has_userfault(kvm))
+		valid_flags |= KVM_MEM_USERFAULT;
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
 
@@ -1974,6 +1977,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		return -EINVAL;
 	if ((mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
 		return -EINVAL;
+	if (mem->flags & KVM_MEM_USERFAULT &&
+	    ((mem->userfault_bitmap != untagged_addr(mem->userfault_bitmap)) ||
+	     !access_ok((void __user *)(unsigned long)mem->userfault_bitmap,
+			DIV_ROUND_UP(mem->memory_size >> PAGE_SHIFT, BITS_PER_LONG)
+			 * sizeof(long))))
+		return -EINVAL;
 
 	slots = __kvm_memslots(kvm, as_id);
 
@@ -2042,6 +2051,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if (r)
 			goto out;
 	}
+	if (mem->flags & KVM_MEM_USERFAULT)
+		new->userfault_bitmap =
+		  (unsigned long __user *)(unsigned long)mem->userfault_bitmap;
 
 	r = kvm_set_memslot(kvm, old, new, change);
 	if (r)
@@ -6426,3 +6438,26 @@ void kvm_exit(void)
 	kvm_irqfd_exit();
 }
 EXPORT_SYMBOL_GPL(kvm_exit);
+
+int kvm_gfn_userfault(struct kvm *kvm, struct kvm_memory_slot *memslot,
+		       gfn_t gfn)
+{
+	unsigned long bitmap_chunk = 0;
+	off_t offset;
+
+	if (!kvm_memslot_userfault(memslot))
+		return 0;
+
+	if (WARN_ON_ONCE(!memslot->userfault_bitmap))
+		return 0;
+
+	offset = gfn - memslot->base_gfn;
+
+	if (copy_from_user(&bitmap_chunk,
+			   memslot->userfault_bitmap + offset / BITS_PER_LONG,
+			   sizeof(bitmap_chunk)))
+		return -EFAULT;
+
+	/* Set in the bitmap means that the gfn is userfault */
+	return !!(bitmap_chunk & (1ul << (offset % BITS_PER_LONG)));
+}
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 02/13] KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
  2025-01-09 20:49 ` [PATCH v2 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap James Houghton
@ 2025-01-09 20:49 ` James Houghton
  2025-01-09 20:49 ` [PATCH v2 03/13] KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot James Houghton
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

This flag is used for vCPU memory faults caused by KVM Userfault; i.e.,
the bit in `userfault_bitmap` corresponding to the faulting gfn was set.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 7ade5169d373..c302edf1c984 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -444,6 +444,7 @@ struct kvm_run {
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
 #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+#define KVM_MEMORY_EXIT_FLAG_USERFAULT	(1ULL << 4)
 			__u64 flags;
 			__u64 gpa;
 			__u64 size;
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 03/13] KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
  2025-01-09 20:49 ` [PATCH v2 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap James Houghton
  2025-01-09 20:49 ` [PATCH v2 02/13] KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT James Houghton
@ 2025-01-09 20:49 ` James Houghton
  2025-05-07  0:03   ` Sean Christopherson
  2025-01-09 20:49 ` [PATCH v2 04/13] KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION James Houghton
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

Currently guest_memfd memslots can only be deleted. Slightly change the
logic to allow KVM_MR_FLAGS_ONLY changes when the only flag being
changed is KVM_MEM_USERFAULT.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 virt/kvm/kvm_main.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4bceae6a6401..882c1f7b4aa8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2015,9 +2015,6 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
 			return -EINVAL;
 	} else { /* Modify an existing slot. */
-		/* Private memslots are immutable, they can only be deleted. */
-		if (mem->flags & KVM_MEM_GUEST_MEMFD)
-			return -EINVAL;
 		if ((mem->userspace_addr != old->userspace_addr) ||
 		    (npages != old->npages) ||
 		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
@@ -2031,6 +2028,16 @@ int __kvm_set_memory_region(struct kvm *kvm,
 			return 0;
 	}
 
+	/*
+	 * Except for being able to set KVM_MEM_USERFAULT, private memslots are
+	 * immutable, they can only be deleted.
+	 */
+	if (mem->flags & KVM_MEM_GUEST_MEMFD &&
+	    !(change == KVM_MR_CREATE ||
+	      (change == KVM_MR_FLAGS_ONLY &&
+	       (mem->flags ^ old->flags) == KVM_MEM_USERFAULT)))
+		return -EINVAL;
+
 	if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
 	    kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages))
 		return -EEXIST;
@@ -2046,7 +2053,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new->npages = npages;
 	new->flags = mem->flags;
 	new->userspace_addr = mem->userspace_addr;
-	if (mem->flags & KVM_MEM_GUEST_MEMFD) {
+	if (mem->flags & KVM_MEM_GUEST_MEMFD && change == KVM_MR_CREATE) {
 		r = kvm_gmem_bind(kvm, new, mem->guest_memfd, mem->guest_memfd_offset);
 		if (r)
 			goto out;
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 04/13] KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (2 preceding siblings ...)
  2025-01-09 20:49 ` [PATCH v2 03/13] KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot James Houghton
@ 2025-01-09 20:49 ` James Houghton
  2025-01-09 20:49 ` [PATCH v2 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT James Houghton
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

Advertise support for KVM_CAP_USERFAULT when kvm_has_userfault() returns
true. Currently this is merely IS_ENABLED(CONFIG_HAVE_KVM_USERFAULT), so
it is somewhat redundant.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/uapi/linux/kvm.h | 1 +
 virt/kvm/kvm_main.c      | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c302edf1c984..defcad38d423 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -936,6 +936,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_PRE_FAULT_MEMORY 236
 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
 #define KVM_CAP_X86_GUEST_MODE 238
+#define KVM_CAP_USERFAULT 239
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 882c1f7b4aa8..30f09141df64 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4811,6 +4811,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #ifdef CONFIG_KVM_PRIVATE_MEM
 	case KVM_CAP_GUEST_MEMFD:
 		return !kvm || kvm_arch_has_private_mem(kvm);
+#endif
+#ifdef CONFIG_HAVE_KVM_USERFAULT
+	case KVM_CAP_USERFAULT:
+		return kvm_has_userfault(kvm);
 #endif
 	default:
 		break;
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (3 preceding siblings ...)
  2025-01-09 20:49 ` [PATCH v2 04/13] KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION James Houghton
@ 2025-01-09 20:49 ` James Houghton
  2025-05-07  0:05   ` Sean Christopherson
  2025-01-09 20:49 ` [PATCH v2 06/13] KVM: arm64: " James Houghton
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

Adhering to the requirements of KVM Userfault:

1. Zap all sptes for the memslot when KVM_MEM_USERFAULT is toggled on
   with kvm_arch_flush_shadow_memslot().
2. Only all PAGE_SIZE sptes when KVM_MEM_USERFAULT is enabled (for both
   normal/GUP memory and guest_memfd memory).
3. Reconstruct huge mappings when KVM_MEM_USERFAULT is toggled off with
   kvm_mmu_recover_huge_pages(). This is the behavior when dirty logging
   is disabled; remain consistent with it.

With the new logic in kvm_mmu_slot_apply_flags(), I've simplified the
two dirty-logging-toggle checks into one, and I have dropped the
WARN_ON() that was there.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/Kconfig            |  1 +
 arch/x86/kvm/mmu/mmu.c          | 27 +++++++++++++++++++++----
 arch/x86/kvm/mmu/mmu_internal.h | 20 +++++++++++++++---
 arch/x86/kvm/x86.c              | 36 ++++++++++++++++++++++++---------
 include/linux/kvm_host.h        |  5 ++++-
 5 files changed, 71 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ea2c4f21c1ca..286c6825cd1c 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -47,6 +47,7 @@ config KVM_X86
 	select KVM_GENERIC_PRE_FAULT_MEMORY
 	select KVM_GENERIC_PRIVATE_MEM if KVM_SW_PROTECTED_VM
 	select KVM_WERROR if WERROR
+	select HAVE_KVM_USERFAULT
 
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2401606db260..5cab2785b97f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4280,14 +4280,19 @@ static inline u8 kvm_max_level_for_order(int order)
 	return PG_LEVEL_4K;
 }
 
-static u8 kvm_max_private_mapping_level(struct kvm *kvm, kvm_pfn_t pfn,
-					u8 max_level, int gmem_order)
+static u8 kvm_max_private_mapping_level(struct kvm *kvm,
+					struct kvm_memory_slot *slot,
+					kvm_pfn_t pfn, u8 max_level,
+					int gmem_order)
 {
 	u8 req_max_level;
 
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
+	if (kvm_memslot_userfault(slot))
+		return PG_LEVEL_4K;
+
 	max_level = min(kvm_max_level_for_order(gmem_order), max_level);
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
@@ -4324,8 +4329,10 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
 	}
 
 	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
-	fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault->pfn,
-							 fault->max_level, max_order);
+	fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault->slot,
+							 fault->pfn,
+							 fault->max_level,
+							 max_order);
 
 	return RET_PF_CONTINUE;
 }
@@ -4334,6 +4341,18 @@ static int __kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
 				 struct kvm_page_fault *fault)
 {
 	unsigned int foll = fault->write ? FOLL_WRITE : 0;
+	int userfault;
+
+	userfault = kvm_gfn_userfault(vcpu->kvm, fault->slot, fault->gfn);
+	if (userfault < 0)
+		return userfault;
+	if (userfault) {
+		kvm_mmu_prepare_userfault_exit(vcpu, fault);
+		return -EFAULT;
+	}
+
+	if (kvm_memslot_userfault(fault->slot))
+		fault->max_level = PG_LEVEL_4K;
 
 	if (fault->is_private)
 		return kvm_mmu_faultin_pfn_private(vcpu, fault);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b00abbe3f6cf..15705faa3b67 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -282,12 +282,26 @@ enum {
 	RET_PF_SPURIOUS,
 };
 
-static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
-						     struct kvm_page_fault *fault)
+static inline void __kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+						       struct kvm_page_fault *fault,
+						       bool is_userfault)
 {
 	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
 				      PAGE_SIZE, fault->write, fault->exec,
-				      fault->is_private);
+				      fault->is_private,
+				      is_userfault);
+}
+
+static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+						     struct kvm_page_fault *fault)
+{
+	__kvm_mmu_prepare_memory_fault_exit(vcpu, fault, false);
+}
+
+static inline void kvm_mmu_prepare_userfault_exit(struct kvm_vcpu *vcpu,
+						  struct kvm_page_fault *fault)
+{
+	__kvm_mmu_prepare_memory_fault_exit(vcpu, fault, true);
 }
 
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b04092ec76a..2abb425a6514 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13053,12 +13053,36 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 	u32 new_flags = new ? new->flags : 0;
 	bool log_dirty_pages = new_flags & KVM_MEM_LOG_DIRTY_PAGES;
 
+	/*
+	 * When toggling KVM Userfault on, zap all sptes so that userfault-ness
+	 * will be respected at refault time. All new faults will only install
+	 * small sptes. Therefore, when toggling it off, recover hugepages.
+	 *
+	 * For MOVE and DELETE, there will be nothing to do, as the old
+	 * mappings will have already been deleted by
+	 * kvm_arch_flush_shadow_memslot().
+	 *
+	 * For CREATE, no mappings will have been created yet.
+	 */
+	if ((old_flags ^ new_flags) & KVM_MEM_USERFAULT &&
+	    (change == KVM_MR_FLAGS_ONLY)) {
+		if (old_flags & KVM_MEM_USERFAULT)
+			kvm_mmu_recover_huge_pages(kvm, new);
+		else
+			kvm_arch_flush_shadow_memslot(kvm, old);
+	}
+
+	/*
+	 * Nothing more to do if dirty logging isn't being toggled.
+	 */
+	if (!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
+		return;
+
 	/*
 	 * Update CPU dirty logging if dirty logging is being toggled.  This
 	 * applies to all operations.
 	 */
-	if ((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES)
-		kvm_mmu_update_cpu_dirty_logging(kvm, log_dirty_pages);
+	kvm_mmu_update_cpu_dirty_logging(kvm, log_dirty_pages);
 
 	/*
 	 * Nothing more to do for RO slots (which can't be dirtied and can't be
@@ -13078,14 +13102,6 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 	if ((change != KVM_MR_FLAGS_ONLY) || (new_flags & KVM_MEM_READONLY))
 		return;
 
-	/*
-	 * READONLY and non-flags changes were filtered out above, and the only
-	 * other flag is LOG_DIRTY_PAGES, i.e. something is wrong if dirty
-	 * logging isn't being toggled on or off.
-	 */
-	if (WARN_ON_ONCE(!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES)))
-		return;
-
 	if (!log_dirty_pages) {
 		/*
 		 * Recover huge page mappings in the slot now that dirty logging
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f7a3dfd5e224..9e8a8dcf2b73 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2465,7 +2465,8 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 						 gpa_t gpa, gpa_t size,
 						 bool is_write, bool is_exec,
-						 bool is_private)
+						 bool is_private,
+						 bool is_userfault)
 {
 	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
 	vcpu->run->memory_fault.gpa = gpa;
@@ -2475,6 +2476,8 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 	vcpu->run->memory_fault.flags = 0;
 	if (is_private)
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
+	if (is_userfault)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_USERFAULT;
 }
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 06/13] KVM: arm64: Add support for KVM_MEM_USERFAULT
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (4 preceding siblings ...)
  2025-01-09 20:49 ` [PATCH v2 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT James Houghton
@ 2025-01-09 20:49 ` James Houghton
  2025-05-07  0:06   ` Sean Christopherson
  2025-01-09 20:49 ` [PATCH v2 07/13] KVM: selftests: Fix vm_mem_region_set_flags docstring James Houghton
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

Adhering to the requirements of KVM Userfault:
1. When it is toggled on, zap the second stage with
   kvm_arch_flush_shadow_memslot(). This is to respect userfault-ness.
2. When KVM_MEM_USERFAULT is enabled, restrict new second-stage mappings
   to be PAGE_SIZE, just like when dirty logging is enabled.

Do not zap the second stage when KVM_MEM_USERFAULT is disabled to remain
consistent with the behavior when dirty logging is disabled.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/arm64/kvm/Kconfig |  1 +
 arch/arm64/kvm/mmu.c   | 26 +++++++++++++++++++++++++-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index ead632ad01b4..d89b4088b580 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -38,6 +38,7 @@ menuconfig KVM
 	select HAVE_KVM_VCPU_RUN_PID_CHANGE
 	select SCHED_INFO
 	select GUEST_PERF_EVENTS if PERF_EVENTS
+	select HAVE_KVM_USERFAULT
 	help
 	  Support hosting virtualized guest machines.
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c9d46ad57e52..e099bdcfac42 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1493,7 +1493,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * logging_active is guaranteed to never be true for VM_PFNMAP
 	 * memslots.
 	 */
-	if (logging_active) {
+	if (logging_active || kvm_memslot_userfault(memslot)) {
 		force_pte = true;
 		vma_shift = PAGE_SHIFT;
 	} else {
@@ -1582,6 +1582,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
+	if (kvm_gfn_userfault(kvm, memslot, gfn)) {
+		kvm_prepare_memory_fault_exit(vcpu, gfn << PAGE_SHIFT,
+					      PAGE_SIZE, write_fault,
+					      exec_fault, false, true);
+		return -EFAULT;
+	}
+
 	pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
 				&writable, &page);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
@@ -2073,6 +2080,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   enum kvm_mr_change change)
 {
 	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
+	u32 new_flags = new ? new->flags : 0;
+	u32 changed_flags = (new_flags) ^ (old ? old->flags : 0);
+
+	/*
+	 * If KVM_MEM_USERFAULT has been enabled, drop all the stage-2 mappings
+	 * so that we can respect userfault-ness.
+	 */
+	if ((changed_flags & KVM_MEM_USERFAULT) &&
+	    (new_flags & KVM_MEM_USERFAULT) &&
+	    change == KVM_MR_FLAGS_ONLY)
+		kvm_arch_flush_shadow_memslot(kvm, old);
+
+	/*
+	 * Nothing left to do if not toggling dirty logging.
+	 */
+	if (!(changed_flags & KVM_MEM_LOG_DIRTY_PAGES))
+		return;
 
 	/*
 	 * At this point memslot has been committed and there is an
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 07/13] KVM: selftests: Fix vm_mem_region_set_flags docstring
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (5 preceding siblings ...)
  2025-01-09 20:49 ` [PATCH v2 06/13] KVM: arm64: " James Houghton
@ 2025-01-09 20:49 ` James Houghton
  2025-01-09 20:49 ` [PATCH v2 08/13] KVM: selftests: Fix prefault_mem logic James Houghton
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

`flags` is what region->region.flags gets set to.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 33fefeb3ca44..a87988a162f1 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1124,7 +1124,7 @@ memslot2region(struct kvm_vm *vm, uint32_t memslot)
  *
  * Input Args:
  *   vm - Virtual Machine
- *   flags - Starting guest physical address
+ *   flags - Flags for the memslot
  *
  * Output Args: None
  *
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 08/13] KVM: selftests: Fix prefault_mem logic
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (6 preceding siblings ...)
  2025-01-09 20:49 ` [PATCH v2 07/13] KVM: selftests: Fix vm_mem_region_set_flags docstring James Houghton
@ 2025-01-09 20:49 ` James Houghton
  2025-01-09 20:49 ` [PATCH v2 09/13] KVM: selftests: Add va_start/end into uffd_desc James Houghton
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

The previous logic didn't handle the case where memory was partitioned
AND we were using a single userfaultfd. It would only prefault the first
vCPU's memory and not the rest.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/demand_paging_test.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 0202b78f8680..315f5c9037b4 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -172,11 +172,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
 	if (p->uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
-		num_uffds = p->single_uffd ? 1 : nr_vcpus;
-		for (i = 0; i < num_uffds; i++) {
+		for (i = 0; i < nr_vcpus; i++) {
 			vcpu_args = &memstress_args.vcpu_args[i];
 			prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
 				     vcpu_args->pages * memstress_args.guest_page_size);
+			if (!p->partition_vcpu_memory_access)
+				/* We prefaulted everything */
+				break;
 		}
 	}
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 09/13] KVM: selftests: Add va_start/end into uffd_desc
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (7 preceding siblings ...)
  2025-01-09 20:49 ` [PATCH v2 08/13] KVM: selftests: Fix prefault_mem logic James Houghton
@ 2025-01-09 20:49 ` James Houghton
  2025-01-09 20:49 ` [PATCH v2 10/13] KVM: selftests: Add KVM Userfault mode to demand_paging_test James Houghton
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

This will be used for the self-test to look up which userfaultfd we
should be using when handling a KVM Userfault (in the event KVM
Userfault and userfaultfd are being used together).

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/include/userfaultfd_util.h | 2 ++
 tools/testing/selftests/kvm/lib/userfaultfd_util.c     | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
index 60f7f9d435dc..b62fecdfe745 100644
--- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
+++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
@@ -30,6 +30,8 @@ struct uffd_desc {
 	int *pipefds;
 	pthread_t *readers;
 	struct uffd_reader_args *reader_args;
+	void *va_start;
+	void *va_end;
 };
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 7c9de8414462..93004c85bcdc 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -152,6 +152,8 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 		    expected_ioctls, "missing userfaultfd ioctls");
 
 	uffd_desc->uffd = uffd;
+	uffd_desc->va_start = hva;
+	uffd_desc->va_end = (char *)hva + len;
 	for (i = 0; i < uffd_desc->num_readers; ++i) {
 		int pipes[2];
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 10/13] KVM: selftests: Add KVM Userfault mode to demand_paging_test
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (8 preceding siblings ...)
  2025-01-09 20:49 ` [PATCH v2 09/13] KVM: selftests: Add va_start/end into uffd_desc James Houghton
@ 2025-01-09 20:49 ` James Houghton
  2025-01-09 20:49 ` [PATCH v2 11/13] KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT James Houghton
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

Add a way for the KVM_RUN loop to handle -EFAULT exits when they are for
KVM_MEMORY_EXIT_FLAG_USERFAULT. In this case, preemptively handle the
UFFDIO_COPY or UFFDIO_CONTINUE if userfaultfd is also in use. This saves
the trip through the userfaultfd poll/read/WAKE loop.

When preemptively handling UFFDIO_COPY/CONTINUE, do so with
MODE_DONTWAKE, as there will not be a thread to wake. If a thread *does*
take the userfaultfd slow path, we will get a regular userfault, and we
will call handle_uffd_page_request() which will do a full wake-up. In
the EEXIST case, a wake-up will not occur. Make sure to call UFFDIO_WAKE
explicitly in this case.

When handling KVM userfaults, make sure to set the bitmap with
memory_order_release. Although it wouldn't affect the functionality of
the test (because memstress doesn't actually require any particular
guest memory contents), it is what userspace normally needs to do.

Add `-k` to set the test to use KVM Userfault.

Add the vm_mem_region_set_flags_userfault() helper for setting
`userfault_bitmap` and KVM_MEM_USERFAULT at the same time.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        | 139 +++++++++++++++++-
 .../testing/selftests/kvm/include/kvm_util.h  |   5 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  40 ++++-
 3 files changed, 176 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 315f5c9037b4..183c70731093 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -12,7 +12,9 @@
 #include <time.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <linux/bitmap.h>
 #include <sys/syscall.h>
+#include <stdatomic.h>
 
 #include "kvm_util.h"
 #include "test_util.h"
@@ -24,11 +26,21 @@
 #ifdef __NR_userfaultfd
 
 static int nr_vcpus = 1;
+static int num_uffds;
 static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
 
 static size_t demand_paging_size;
+static size_t host_page_size;
 static char *guest_data_prototype;
 
+static struct {
+	bool enabled;
+	int uffd_mode; /* set if userfaultfd is also in use */
+	struct uffd_desc **uffd_descs;
+} kvm_userfault_data;
+
+static void resolve_kvm_userfault(u64 gpa, u64 size);
+
 static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 {
 	struct kvm_vcpu *vcpu = vcpu_args->vcpu;
@@ -41,8 +53,22 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 	clock_gettime(CLOCK_MONOTONIC, &start);
 
 	/* Let the guest access its memory */
+restart:
 	ret = _vcpu_run(vcpu);
-	TEST_ASSERT(ret == 0, "vcpu_run failed: %d", ret);
+	if (ret < 0 && errno == EFAULT && kvm_userfault_data.enabled) {
+		/* Check for userfault. */
+		TEST_ASSERT(run->exit_reason == KVM_EXIT_MEMORY_FAULT,
+			    "Got invalid exit reason: %x", run->exit_reason);
+		TEST_ASSERT(run->memory_fault.flags ==
+			    KVM_MEMORY_EXIT_FLAG_USERFAULT,
+			    "Got invalid memory fault exit: %llx",
+			    run->memory_fault.flags);
+		resolve_kvm_userfault(run->memory_fault.gpa,
+				      run->memory_fault.size);
+		goto restart;
+	} else
+		TEST_ASSERT(ret == 0, "vcpu_run failed: %d", ret);
+
 	if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
 		TEST_ASSERT(false,
 			    "Invalid guest sync status: exit_reason=%s",
@@ -54,11 +80,10 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 		       ts_diff.tv_sec, ts_diff.tv_nsec);
 }
 
-static int handle_uffd_page_request(int uffd_mode, int uffd,
-		struct uffd_msg *msg)
+static int resolve_uffd_page_request(int uffd_mode, int uffd, uint64_t addr,
+				     bool wake)
 {
 	pid_t tid = syscall(__NR_gettid);
-	uint64_t addr = msg->arg.pagefault.address;
 	struct timespec start;
 	struct timespec ts_diff;
 	int r;
@@ -71,7 +96,7 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		copy.src = (uint64_t)guest_data_prototype;
 		copy.dst = addr;
 		copy.len = demand_paging_size;
-		copy.mode = 0;
+		copy.mode = wake ? 0 : UFFDIO_COPY_MODE_DONTWAKE;
 
 		r = ioctl(uffd, UFFDIO_COPY, &copy);
 		/*
@@ -96,6 +121,7 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 
 		cont.range.start = addr;
 		cont.range.len = demand_paging_size;
+		cont.mode = wake ? 0 : UFFDIO_CONTINUE_MODE_DONTWAKE;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
 		/*
@@ -119,6 +145,20 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		TEST_FAIL("Invalid uffd mode %d", uffd_mode);
 	}
 
+	if (r < 0 && wake) {
+		/*
+		 * No wake-up occurs when UFFDIO_COPY/CONTINUE fails, but we
+		 * have a thread waiting. Wake it up.
+		 */
+		struct uffdio_range range = {0};
+
+		range.start = addr;
+		range.len = demand_paging_size;
+
+		TEST_ASSERT(ioctl(uffd, UFFDIO_WAKE, &range) == 0,
+			    "UFFDIO_WAKE failed: 0x%lx", addr);
+	}
+
 	ts_diff = timespec_elapsed(start);
 
 	PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
@@ -129,6 +169,58 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 	return 0;
 }
 
+static int handle_uffd_page_request(int uffd_mode, int uffd,
+				    struct uffd_msg *msg)
+{
+	uint64_t addr = msg->arg.pagefault.address;
+
+	return resolve_uffd_page_request(uffd_mode, uffd, addr, true);
+}
+
+static void resolve_kvm_userfault(u64 gpa, u64 size)
+{
+	struct kvm_vm *vm = memstress_args.vm;
+	struct userspace_mem_region *region;
+	unsigned long *bitmap_chunk;
+	u64 page, gpa_offset;
+
+	region = (struct userspace_mem_region *) userspace_mem_region_find(
+		vm, gpa, (gpa + size - 1));
+
+	if (kvm_userfault_data.uffd_mode) {
+		/*
+		 * Resolve userfaults early, without needing to read them
+		 * off the userfaultfd.
+		 */
+		uint64_t hva = (uint64_t)addr_gpa2hva(vm, gpa);
+		struct uffd_desc **descs = kvm_userfault_data.uffd_descs;
+		int i, fd;
+
+		for (i = 0; i < num_uffds; ++i)
+			if (hva >= (uint64_t)descs[i]->va_start &&
+			    hva < (uint64_t)descs[i]->va_end)
+				break;
+
+		TEST_ASSERT(i < num_uffds,
+			    "Did not find userfaultfd for hva: %lx", hva);
+
+		fd = kvm_userfault_data.uffd_descs[i]->uffd;
+		resolve_uffd_page_request(kvm_userfault_data.uffd_mode, fd,
+					  hva, false);
+	} else {
+		uint64_t hva = (uint64_t)addr_gpa2hva(vm, gpa);
+
+		memcpy((char *)hva, guest_data_prototype, demand_paging_size);
+	}
+
+	gpa_offset = gpa - region->region.guest_phys_addr;
+	page = gpa_offset / host_page_size;
+	bitmap_chunk = (unsigned long *)region->region.userfault_bitmap +
+		       page / BITS_PER_LONG;
+	atomic_fetch_and_explicit((_Atomic unsigned long *)bitmap_chunk,
+			~(1ul << (page % BITS_PER_LONG)), memory_order_release);
+}
+
 struct test_params {
 	int uffd_mode;
 	bool single_uffd;
@@ -136,6 +228,7 @@ struct test_params {
 	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
+	bool kvm_userfault;
 };
 
 static void prefault_mem(void *alias, uint64_t len)
@@ -149,6 +242,25 @@ static void prefault_mem(void *alias, uint64_t len)
 	}
 }
 
+static void enable_userfault(struct kvm_vm *vm, int slots)
+{
+	for (int i = 0; i < slots; ++i) {
+		int slot = MEMSTRESS_MEM_SLOT_INDEX + i;
+		struct userspace_mem_region *region;
+		unsigned long *userfault_bitmap;
+		int flags = KVM_MEM_USERFAULT;
+
+		region = memslot2region(vm, slot);
+		userfault_bitmap = bitmap_zalloc(region->mmap_size /
+						 host_page_size);
+		/* everything is userfault initially */
+		memset(userfault_bitmap, -1, region->mmap_size / host_page_size / CHAR_BIT);
+		printf("Setting bitmap: %p\n", userfault_bitmap);
+		vm_mem_region_set_flags_userfault(vm, slot, flags,
+						  userfault_bitmap);
+	}
+}
+
 static void run_test(enum vm_guest_mode mode, void *arg)
 {
 	struct memstress_vcpu_args *vcpu_args;
@@ -159,12 +271,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct timespec ts_diff;
 	double vcpu_paging_rate;
 	struct kvm_vm *vm;
-	int i, num_uffds = 0;
+	int i;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
 
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
+	host_page_size = getpagesize();
 
 	guest_data_prototype = malloc(demand_paging_size);
 	TEST_ASSERT(guest_data_prototype,
@@ -208,6 +321,14 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		}
 	}
 
+	if (p->kvm_userfault) {
+		TEST_REQUIRE(kvm_has_cap(KVM_CAP_USERFAULT));
+		kvm_userfault_data.enabled = true;
+		kvm_userfault_data.uffd_mode = p->uffd_mode;
+		kvm_userfault_data.uffd_descs = uffd_descs;
+		enable_userfault(vm, 1);
+	}
+
 	pr_info("Finished creating vCPUs and starting uffd threads\n");
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
@@ -265,6 +386,7 @@ static void help(char *name)
 	printf(" -v: specify the number of vCPUs to run.\n");
 	printf(" -o: Overlap guest memory accesses instead of partitioning\n"
 	       "     them into a separate region of memory for each vCPU.\n");
+	printf(" -k: Use KVM Userfault\n");
 	puts("");
 	exit(0);
 }
@@ -283,7 +405,7 @@ int main(int argc, char *argv[])
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:c:r:")) != -1) {
+	while ((opt = getopt(argc, argv, "ahokm:u:d:b:s:v:c:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
@@ -326,6 +448,9 @@ int main(int argc, char *argv[])
 				    "Invalid number of readers per uffd %d: must be >=1",
 				    p.readers_per_uffd);
 			break;
+		case 'k':
+			p.kvm_userfault = true;
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 4c4e5a847f67..0d49a9ce832a 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -582,6 +582,8 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
 		uint32_t flags, int guest_memfd_fd, uint64_t guest_memfd_offset);
+struct userspace_mem_region *
+userspace_mem_region_find(struct kvm_vm *vm, uint64_t start, uint64_t end);
 
 #ifndef vm_arch_has_protected_memory
 static inline bool vm_arch_has_protected_memory(struct kvm_vm *vm)
@@ -591,6 +593,9 @@ static inline bool vm_arch_has_protected_memory(struct kvm_vm *vm)
 #endif
 
 void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags);
+void vm_mem_region_set_flags_userfault(struct kvm_vm *vm, uint32_t slot,
+				       uint32_t flags,
+				       unsigned long *userfault_bitmap);
 void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, uint64_t new_gpa);
 void vm_mem_region_delete(struct kvm_vm *vm, uint32_t slot);
 struct kvm_vcpu *__vm_vcpu_add(struct kvm_vm *vm, uint32_t vcpu_id);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index a87988a162f1..a8f6b949ac59 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -634,7 +634,7 @@ void kvm_parse_vcpu_pinning(const char *pcpus_string, uint32_t vcpu_to_pcpu[],
  * of the regions is returned.  Null is returned only when no overlapping
  * region exists.
  */
-static struct userspace_mem_region *
+struct userspace_mem_region *
 userspace_mem_region_find(struct kvm_vm *vm, uint64_t start, uint64_t end)
 {
 	struct rb_node *node;
@@ -1149,6 +1149,44 @@ void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags)
 		ret, errno, slot, flags);
 }
 
+/*
+ * VM Memory Region Flags Set with a userfault bitmap
+ *
+ * Input Args:
+ *   vm - Virtual Machine
+ *   flags - Flags for the memslot
+ *   userfault_bitmap - The bitmap to use for KVM_MEM_USERFAULT
+ *
+ * Output Args: None
+ *
+ * Return: None
+ *
+ * Sets the flags of the memory region specified by the value of slot,
+ * to the values given by flags. This helper adds a way to provide a
+ * userfault_bitmap.
+ */
+void vm_mem_region_set_flags_userfault(struct kvm_vm *vm, uint32_t slot,
+				       uint32_t flags,
+				       unsigned long *userfault_bitmap)
+{
+	int ret;
+	struct userspace_mem_region *region;
+
+	region = memslot2region(vm, slot);
+
+	TEST_ASSERT(!userfault_bitmap ^ (flags & KVM_MEM_USERFAULT),
+		    "KVM_MEM_USERFAULT must be specified with a bitmap");
+
+	region->region.flags = flags;
+	region->region.userfault_bitmap = (__u64)userfault_bitmap;
+
+	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
+
+	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
+		"  rc: %i errno: %i slot: %u flags: 0x%x",
+		ret, errno, slot, flags);
+}
+
 /*
  * VM Memory Region Move
  *
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 11/13] KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (9 preceding siblings ...)
  2025-01-09 20:49 ` [PATCH v2 10/13] KVM: selftests: Add KVM Userfault mode to demand_paging_test James Houghton
@ 2025-01-09 20:49 ` James Houghton
  2025-01-09 20:49 ` [PATCH v2 12/13] KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests James Houghton
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

The KVM_MEM_USERFAULT flag is supported iff KVM_CAP_USERFAULT is
available.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/set_memory_region_test.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c
index 86ee3385e860..adce75720cc1 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -364,6 +364,9 @@ static void test_invalid_memory_region_flags(void)
 	if (kvm_check_cap(KVM_CAP_MEMORY_ATTRIBUTES) & KVM_MEMORY_ATTRIBUTE_PRIVATE)
 		supported_flags |= KVM_MEM_GUEST_MEMFD;
 
+	if (kvm_check_cap(KVM_CAP_USERFAULT))
+		supported_flags |= KVM_MEM_USERFAULT;
+
 	for (i = 0; i < 32; i++) {
 		if ((supported_flags & BIT(i)) && !(v2_only_flags & BIT(i)))
 			continue;
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 12/13] KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (10 preceding siblings ...)
  2025-01-09 20:49 ` [PATCH v2 11/13] KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT James Houghton
@ 2025-01-09 20:49 ` James Houghton
  2025-01-09 20:49 ` [PATCH v2 13/13] KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT details James Houghton
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

Make sure KVM_MEM_USERFAULT can be toggled on and off for
KVM_MEM_GUEST_MEMFD memslots.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/set_memory_region_test.c    | 30 +++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c
index adce75720cc1..1fea8ff0fe74 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -556,6 +556,35 @@ static void test_add_overlapping_private_memory_regions(void)
 	close(memfd);
 	kvm_vm_free(vm);
 }
+
+static void test_private_memory_region_userfault(void)
+{
+	struct kvm_vm *vm;
+	int memfd;
+
+	pr_info("Testing toggling KVM_MEM_USERFAULT on KVM_MEM_GUEST_MEMFD memory regions\n");
+
+	vm = vm_create_barebones_type(KVM_X86_SW_PROTECTED_VM);
+
+	test_invalid_guest_memfd(vm, vm->kvm_fd, 0, "KVM fd should fail");
+	test_invalid_guest_memfd(vm, vm->fd, 0, "VM's fd should fail");
+
+	memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE, 0);
+
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD,
+				   MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0);
+
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT,
+				   KVM_MEM_GUEST_MEMFD | KVM_MEM_USERFAULT,
+				   MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0);
+
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD,
+				   MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0);
+
+	close(memfd);
+
+	kvm_vm_free(vm);
+}
 #endif
 
 int main(int argc, char *argv[])
@@ -582,6 +611,7 @@ int main(int argc, char *argv[])
 	    (kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM))) {
 		test_add_private_memory_region();
 		test_add_overlapping_private_memory_regions();
+		test_private_memory_region_userfault();
 	} else {
 		pr_info("Skipping tests for KVM_MEM_GUEST_MEMFD memory regions\n");
 	}
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 13/13] KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT details
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (11 preceding siblings ...)
  2025-01-09 20:49 ` [PATCH v2 12/13] KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests James Houghton
@ 2025-01-09 20:49 ` James Houghton
  2025-05-06 23:48 ` [PATCH v2 00/13] KVM: Introduce KVM Userfault Sean Christopherson
  2025-05-07  0:13 ` Sean Christopherson
  14 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2025-01-09 20:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, Bagas Sanjaya

Include the note about memory ordering when clearing bits in
userfault_bitmap, as it may not be obvious for users.

Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
---
 Documentation/virt/kvm/api.rst | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 454c2aaa155e..eec485dcf0bc 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6281,7 +6281,8 @@ bounds checks apply (use common sense).
 	__u64 guest_memfd_offset;
 	__u32 guest_memfd;
 	__u32 pad1;
-	__u64 pad2[14];
+	__u64 userfault_bitmap;
+	__u64 pad2[13];
   };
 
 A KVM_MEM_GUEST_MEMFD region _must_ have a valid guest_memfd (private memory) and
@@ -6297,6 +6298,25 @@ state.  At VM creation time, all memory is shared, i.e. the PRIVATE attribute
 is '0' for all gfns.  Userspace can control whether memory is shared/private by
 toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
 
+When the KVM_MEM_USERFAULT flag is set, userfault_bitmap points to the starting
+address for the bitmap that controls if vCPU memory faults should immediately
+exit to userspace. If an invalid pointer is provided, at fault time, KVM_RUN
+will return -EFAULT. KVM_MEM_USERFAULT is only supported when
+KVM_CAP_USERFAULT is supported.
+
+userfault_bitmap should point to an array of longs where each bit in the array
+linearly corresponds to a single gfn. Bit 0 in userfault_bitmap corresponds to
+guest_phys_addr, bit 1 corresponds to guest_phys_addr + PAGE_SIZE, etc. If the
+bit for a page is set, any vCPU access to that page will exit to userspace with
+KVM_MEMORY_EXIT_FLAG_USERFAULT.
+
+Setting bits in userfault_bitmap has no effect on pages that have already been
+mapped by KVM until KVM_MEM_USERFAULT is disabled and re-enabled again.
+
+Clearing bits in userfault_bitmap should usually be done with a store-release
+if changes to guest memory are being made available to the guest via
+userfault_bitmap.
+
 S390:
 ^^^^^
 
@@ -8251,6 +8271,17 @@ KVM exits with the register state of either the L1 or L2 guest
 depending on which executed at the time of an exit. Userspace must
 take care to differentiate between these cases.
 
+7.37 KVM_CAP_USERFAULT
+----------------------
+
+:Architectures: x86, arm64
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_SET_USER_MEMORY_REGION2 will
+accept KVM_MEM_USERFAULT as a valid memslot flag.
+
+See KVM_SET_USER_MEMORY_REGION2 for more details.
+
 8. Other capabilities.
 ======================
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 00/13] KVM: Introduce KVM Userfault
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (12 preceding siblings ...)
  2025-01-09 20:49 ` [PATCH v2 13/13] KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT details James Houghton
@ 2025-05-06 23:48 ` Sean Christopherson
  2025-05-07  0:13 ` Sean Christopherson
  14 siblings, 0 replies; 34+ messages in thread
From: Sean Christopherson @ 2025-05-06 23:48 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, Jan 09, 2025, James Houghton wrote:
> James Houghton (13):
>   KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
>   KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
>   KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot
>   KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION

Starting with some series-wide feedback, the granularity of these first few
patches is too fine.  I normally like to split things up, but honestly, this is
such a small feature that I don't see much point in separating the uAPI from the
infrastructure.

To avoid cyclical dependencies between common KVM and arch code, we can do all
the prep, but not fully enable+advertise support on any architecture until all
targeted architectures are fully ready.

In other words, I think we should squish these into one patch, minus this bit at
the very end of the series (spoiler alert):

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ce7bf5de6d72..0106d6d461a3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1545,6 +1545,9 @@ static int check_memory_region_flags(struct kvm *kvm,
            !(mem->flags & KVM_MEM_GUEST_MEMFD))
                valid_flags |= KVM_MEM_READONLY;
 
+       if (IS_ENABLED(CONFIG_KVM_GENERIC_PAGE_FAULT))
+               valid_flags |= KVM_MEM_USERFAULT;
+
        if (mem->flags & ~valid_flags)
                return -EINVAL;
 
@@ -4824,6 +4827,9 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
        case KVM_CAP_CHECK_EXTENSION_VM:
        case KVM_CAP_ENABLE_CAP_VM:
        case KVM_CAP_HALT_POLL:
+#ifdef CONFIG_KVM_GENERIC_PAGE_FAULT
+       case KVM_CAP_USERFAULT:
+#endif
                return 1;
 #ifdef CONFIG_KVM_MMIO
        case KVM_CAP_COALESCED_MMIO:

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
  2025-01-09 20:49 ` [PATCH v2 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap James Houghton
@ 2025-05-07  0:01   ` Sean Christopherson
  2025-05-28 15:21     ` James Houghton
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2025-05-07  0:01 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, Jan 09, 2025, James Houghton wrote:
> Use one of the 14 reserved u64s in struct kvm_userspace_memory_region2
> for the user to provide `userfault_bitmap`.
> 
> The memslot flag indicates if KVM should be reading from the
> `userfault_bitmap` field from the memslot. The user is permitted to
> provide a bogus pointer. If the pointer cannot be read from, we will
> return -EFAULT (with no other information) back to the user.

For the uAPI+infrastructure changelog, please elaborate on the design goals and
choices.  The "what" is pretty obvious from the patch; describe why this is being
added.

> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/kvm_host.h | 14 ++++++++++++++
>  include/uapi/linux/kvm.h |  4 +++-
>  virt/kvm/Kconfig         |  3 +++
>  virt/kvm/kvm_main.c      | 35 +++++++++++++++++++++++++++++++++++
>  4 files changed, 55 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 401439bb21e3..f7a3dfd5e224 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -590,6 +590,7 @@ struct kvm_memory_slot {
>  	unsigned long *dirty_bitmap;
>  	struct kvm_arch_memory_slot arch;
>  	unsigned long userspace_addr;
> +	unsigned long __user *userfault_bitmap;
>  	u32 flags;
>  	short id;
>  	u16 as_id;
> @@ -724,6 +725,11 @@ static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
>  }
>  #endif
>  
> +static inline bool kvm_has_userfault(struct kvm *kvm)
> +{
> +	return IS_ENABLED(CONFIG_HAVE_KVM_USERFAULT);
> +}

Eh, don't think we need this wrapper.  Just check the CONFIG_xxx manually in the
one or two places where code isn't guarded by an #ifdef.

>  struct kvm_memslots {
>  	u64 generation;
>  	atomic_long_t last_used_slot;
> @@ -2553,4 +2559,12 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  				    struct kvm_pre_fault_memory *range);
>  #endif
>  
> +int kvm_gfn_userfault(struct kvm *kvm, struct kvm_memory_slot *memslot,
> +		      gfn_t gfn);
> +
> +static inline bool kvm_memslot_userfault(struct kvm_memory_slot *memslot)

I strongly prefer kvm_is_userfault_memslot().  KVM's weird kvm_memslot_<flag>()
nomenclature comes from ancient code, i.e. isn't something I would follow.

> +{
> +	return memslot->flags & KVM_MEM_USERFAULT;

I think it's worth checking for a non-NULL memslot, even if all current callers
pre-check for a slot.

> @@ -2042,6 +2051,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  		if (r)
>  			goto out;
>  	}
> +	if (mem->flags & KVM_MEM_USERFAULT)
> +		new->userfault_bitmap =
> +		  (unsigned long __user *)(unsigned long)mem->userfault_bitmap;

	if (mem->flags & KVM_MEM_USERFAULT)
		new->userfault_bitmap = u64_to_user_ptr(mem->userfault_bitmap);

>  	r = kvm_set_memslot(kvm, old, new, change);
>  	if (r)
> @@ -6426,3 +6438,26 @@ void kvm_exit(void)
>  	kvm_irqfd_exit();
>  }
>  EXPORT_SYMBOL_GPL(kvm_exit);
> +
> +int kvm_gfn_userfault(struct kvm *kvm, struct kvm_memory_slot *memslot,
> +		       gfn_t gfn)

I think this series is the perfect opportunity (read: victim) to introduce a
common "struct kvm_page_fault".  With a common structure to provide the gfn, slot,
write, exec, and is_private fields, this helper can handle the checks and the call
to kvm_prepare_memory_fault_exit().

And with that in place, I would vote to name this something like kvm_do_userfault(),
return a boolean, and let the caller return -EFAULT.

For making "struct kvm_page_fault" common, one thought would be to have arch code
define the entire struct, and simply assert on the few fields that common KVM needs
being defined by arch code.  And wrap all references in CONFIG_KVM_GENERIC_PAGE_FAULT.

I don't expect there will be a huge number of fields that common KVM needs, i.e. I
don't think the maintenance burden of punting to arch code will be high.  And letting
arch code own the entire struct means we don't need to have e.g. fault->arch.present
vs. fault->write in KVM x86, which to me is a big net negative for readability.

I'll respond to the cover letter with an attachment of seven patches to sketch out
the idea.

> +{
> +	unsigned long bitmap_chunk = 0;
> +	off_t offset;
> +
> +	if (!kvm_memslot_userfault(memslot))
> +		return 0;
> +
> +	if (WARN_ON_ONCE(!memslot->userfault_bitmap))
> +		return 0;

'0' is technically a valid userspace address.  I'd just drop this.  If we have a
KVM bug that results in failure to generate usefaults, we'll notice quite quickly.

> +
> +	offset = gfn - memslot->base_gfn;
> +
> +	if (copy_from_user(&bitmap_chunk,
> +			   memslot->userfault_bitmap + offset / BITS_PER_LONG,
> +			   sizeof(bitmap_chunk)))

Since the address is checked during memslot creation, I'm pretty sure this can
use __get_user().  At the very least, it should be get_user().

> +		return -EFAULT;
> +
> +	/* Set in the bitmap means that the gfn is userfault */
> +	return !!(bitmap_chunk & (1ul << (offset % BITS_PER_LONG)));

test_bit()?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 03/13] KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot
  2025-01-09 20:49 ` [PATCH v2 03/13] KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot James Houghton
@ 2025-05-07  0:03   ` Sean Christopherson
  0 siblings, 0 replies; 34+ messages in thread
From: Sean Christopherson @ 2025-05-07  0:03 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, Jan 09, 2025, James Houghton wrote:
> Currently guest_memfd memslots can only be deleted. Slightly change the
> logic to allow KVM_MR_FLAGS_ONLY changes when the only flag being
> changed is KVM_MEM_USERFAULT.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  virt/kvm/kvm_main.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4bceae6a6401..882c1f7b4aa8 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2015,9 +2015,6 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
>  			return -EINVAL;
>  	} else { /* Modify an existing slot. */
> -		/* Private memslots are immutable, they can only be deleted. */
> -		if (mem->flags & KVM_MEM_GUEST_MEMFD)
> -			return -EINVAL;
>  		if ((mem->userspace_addr != old->userspace_addr) ||
>  		    (npages != old->npages) ||
>  		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> @@ -2031,6 +2028,16 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  			return 0;
>  	}
>  
> +	/*
> +	 * Except for being able to set KVM_MEM_USERFAULT, private memslots are
> +	 * immutable, they can only be deleted.
> +	 */
> +	if (mem->flags & KVM_MEM_GUEST_MEMFD &&
> +	    !(change == KVM_MR_CREATE ||
> +	      (change == KVM_MR_FLAGS_ONLY &&
> +	       (mem->flags ^ old->flags) == KVM_MEM_USERFAULT)))
> +		return -EINVAL;

Oof.  I don't even want to decipher this.  Let's just drop the blankent immutable
restriction, and simply say guest_memfd slots can't be MOVED.  guest_memfd doesn't
support RO memslots (and never will), and doesn't support dirty logging, so this
is effectively dead code anyways.  Then the diff is much nicer:

@@ -2012,14 +2027,15 @@ static int kvm_set_memory_region(struct kvm *kvm,
                if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
                        return -EINVAL;
        } else { /* Modify an existing slot. */
-               /* Private memslots are immutable, they can only be deleted. */
-               if (mem->flags & KVM_MEM_GUEST_MEMFD)
-                       return -EINVAL;
                if ((mem->userspace_addr != old->userspace_addr) ||
                    (npages != old->npages) ||
                    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
                        return -EINVAL;
 
+               /* Moving a guest_memfd memslot isn't supported. */
+               if (base_gfn != old->base_gfn && mem->flags & KVM_MEM_GUEST_MEMFD)
+                       return -EINVAL;
+
                if (base_gfn != old->base_gfn)
                        change = KVM_MR_MOVE;
                else if (mem->flags != old->flags)

> +
>  	if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
>  	    kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages))
>  		return -EEXIST;
> @@ -2046,7 +2053,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  	new->npages = npages;
>  	new->flags = mem->flags;
>  	new->userspace_addr = mem->userspace_addr;
> -	if (mem->flags & KVM_MEM_GUEST_MEMFD) {
> +	if (mem->flags & KVM_MEM_GUEST_MEMFD && change == KVM_MR_CREATE) {
>  		r = kvm_gmem_bind(kvm, new, mem->guest_memfd, mem->guest_memfd_offset);
>  		if (r)
>  			goto out;
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
  2025-01-09 20:49 ` [PATCH v2 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT James Houghton
@ 2025-05-07  0:05   ` Sean Christopherson
  2025-05-28 20:21     ` Oliver Upton
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2025-05-07  0:05 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, Jan 09, 2025, James Houghton wrote:
> Adhering to the requirements of KVM Userfault:
> 
> 1. Zap all sptes for the memslot when KVM_MEM_USERFAULT is toggled on
>    with kvm_arch_flush_shadow_memslot().
> 2. Only all PAGE_SIZE sptes when KVM_MEM_USERFAULT is enabled (for both
>    normal/GUP memory and guest_memfd memory).
> 3. Reconstruct huge mappings when KVM_MEM_USERFAULT is toggled off with
>    kvm_mmu_recover_huge_pages(). This is the behavior when dirty logging
>    is disabled; remain consistent with it.
> 
> With the new logic in kvm_mmu_slot_apply_flags(), I've simplified the
> two dirty-logging-toggle checks into one, and I have dropped the
> WARN_ON() that was there.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/x86/kvm/Kconfig            |  1 +
>  arch/x86/kvm/mmu/mmu.c          | 27 +++++++++++++++++++++----
>  arch/x86/kvm/mmu/mmu_internal.h | 20 +++++++++++++++---
>  arch/x86/kvm/x86.c              | 36 ++++++++++++++++++++++++---------
>  include/linux/kvm_host.h        |  5 ++++-
>  5 files changed, 71 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index ea2c4f21c1ca..286c6825cd1c 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -47,6 +47,7 @@ config KVM_X86
>  	select KVM_GENERIC_PRE_FAULT_MEMORY
>  	select KVM_GENERIC_PRIVATE_MEM if KVM_SW_PROTECTED_VM
>  	select KVM_WERROR if WERROR
> +	select HAVE_KVM_USERFAULT
>  
>  config KVM
>  	tristate "Kernel-based Virtual Machine (KVM) support"
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2401606db260..5cab2785b97f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4280,14 +4280,19 @@ static inline u8 kvm_max_level_for_order(int order)
>  	return PG_LEVEL_4K;
>  }
>  
> -static u8 kvm_max_private_mapping_level(struct kvm *kvm, kvm_pfn_t pfn,
> -					u8 max_level, int gmem_order)
> +static u8 kvm_max_private_mapping_level(struct kvm *kvm,
> +					struct kvm_memory_slot *slot,
> +					kvm_pfn_t pfn, u8 max_level,
> +					int gmem_order)
>  {
>  	u8 req_max_level;
>  
>  	if (max_level == PG_LEVEL_4K)
>  		return PG_LEVEL_4K;
>  
> +	if (kvm_memslot_userfault(slot))

Unless I'm missing something, this can go in kvm_mmu_hugepage_adjust():

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a4439e9e0726..49eb6b9b268c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3304,7 +3304,7 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
        if (is_error_noslot_pfn(fault->pfn))
                return;
 
-       if (kvm_slot_dirty_track_enabled(slot))
+       if (kvm_slot_dirty_track_enabled(slot) || kvm_is_userfault_memslot(slot))
                return;
 
>  static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1b04092ec76a..2abb425a6514 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13053,12 +13053,36 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>  	u32 new_flags = new ? new->flags : 0;
>  	bool log_dirty_pages = new_flags & KVM_MEM_LOG_DIRTY_PAGES;
>  
> +	/*
> +	 * When toggling KVM Userfault on, zap all sptes so that userfault-ness
> +	 * will be respected at refault time. All new faults will only install
> +	 * small sptes. Therefore, when toggling it off, recover hugepages.
> +	 *
> +	 * For MOVE and DELETE, there will be nothing to do, as the old
> +	 * mappings will have already been deleted by
> +	 * kvm_arch_flush_shadow_memslot().
> +	 *
> +	 * For CREATE, no mappings will have been created yet.
> +	 */

Eh, trim this down and the reference the comment below to explain why FLAGS_ONLY
is the only case that needs to be handled.

> +	if ((old_flags ^ new_flags) & KVM_MEM_USERFAULT &&
> +	    (change == KVM_MR_FLAGS_ONLY)) {
> +		if (old_flags & KVM_MEM_USERFAULT)
> +			kvm_mmu_recover_huge_pages(kvm, new);
> +		else
> +			kvm_arch_flush_shadow_memslot(kvm, old);

The call to kvm_arch_flush_shadow_memslot() should definitely go in common code.
The fancy recovery logic is arch specific, but blasting the memslot when userfault
is toggled on is not.

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 06/13] KVM: arm64: Add support for KVM_MEM_USERFAULT
  2025-01-09 20:49 ` [PATCH v2 06/13] KVM: arm64: " James Houghton
@ 2025-05-07  0:06   ` Sean Christopherson
  2025-05-28 15:09     ` James Houghton
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2025-05-07  0:06 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, Jan 09, 2025, James Houghton wrote:
> @@ -2073,6 +2080,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>  				   enum kvm_mr_change change)
>  {
>  	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> +	u32 new_flags = new ? new->flags : 0;
> +	u32 changed_flags = (new_flags) ^ (old ? old->flags : 0);

This is a bit hard to read, and there's only one use of log_dirty_pages.  With
zapping handled in common KVM, just do:

@@ -2127,14 +2131,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
                                   const struct kvm_memory_slot *new,
                                   enum kvm_mr_change change)
 {
-       bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
+       u32 old_flags = old ? old->flags : 0;
+       u32 new_flags = new ? new->flags : 0;
+
+       /* Nothing to do if not toggling dirty logging. */
+       if (!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
+               return;
 
        /*
         * At this point memslot has been committed and there is an
         * allocated dirty_bitmap[], dirty pages will be tracked while the
         * memory slot is write protected.
         */
-       if (log_dirty_pages) {
+       if (new_flags & KVM_MEM_LOG_DIRTY_PAGES) {
 
                if (change == KVM_MR_DELETE)
                        return;

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 00/13] KVM: Introduce KVM Userfault
  2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (13 preceding siblings ...)
  2025-05-06 23:48 ` [PATCH v2 00/13] KVM: Introduce KVM Userfault Sean Christopherson
@ 2025-05-07  0:13 ` Sean Christopherson
  2025-05-28 15:48   ` James Houghton
  14 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2025-05-07  0:13 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

[-- Attachment #1: Type: text/plain, Size: 2521 bytes --]

On Thu, Jan 09, 2025, James Houghton wrote:
>   KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
>   KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
>   KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot
>   KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
>   KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
>   KVM: arm64: Add support for KVM_MEM_USERFAULT
>   KVM: selftests: Fix vm_mem_region_set_flags docstring
>   KVM: selftests: Fix prefault_mem logic
>   KVM: selftests: Add va_start/end into uffd_desc
>   KVM: selftests: Add KVM Userfault mode to demand_paging_test
>   KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT
>   KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests
>   KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT
>     details
> 
>  Documentation/virt/kvm/api.rst                |  33 +++-
>  arch/arm64/kvm/Kconfig                        |   1 +
>  arch/arm64/kvm/mmu.c                          |  26 +++-
>  arch/x86/kvm/Kconfig                          |   1 +
>  arch/x86/kvm/mmu/mmu.c                        |  27 +++-
>  arch/x86/kvm/mmu/mmu_internal.h               |  20 ++-
>  arch/x86/kvm/x86.c                            |  36 +++--
>  include/linux/kvm_host.h                      |  19 ++-
>  include/uapi/linux/kvm.h                      |   6 +-
>  .../selftests/kvm/demand_paging_test.c        | 145 ++++++++++++++++--
>  .../testing/selftests/kvm/include/kvm_util.h  |   5 +
>  .../selftests/kvm/include/userfaultfd_util.h  |   2 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  42 ++++-
>  .../selftests/kvm/lib/userfaultfd_util.c      |   2 +
>  .../selftests/kvm/set_memory_region_test.c    |  33 ++++
>  virt/kvm/Kconfig                              |   3 +
>  virt/kvm/kvm_main.c                           |  54 ++++++-
>  17 files changed, 419 insertions(+), 36 deletions(-)

I didn't look at the selftests changes, but nothing in this series scares me.  We
bikeshedded most of this death this in the "exit on missing" series, so for me at
least, the only real question is whether or not we want to add the uAPI.  AFAIK,
this is best proposal for post-copy guest_memfd support (and not just because it's
the only proposal :-D).

So... yes?

Attached are a variation on the series using the common "struct kvm_page_fault"
idea.  The documentation change could be squashed with the final enablement patch.

Compile tested only.  I would not be the least bit surprised if I completely
butchered something.

[-- Attachment #2: 0001-KVM-x86-mmu-Move-struct-kvm_page_fault-definition-to.patch --]
[-- Type: text/x-diff, Size: 5054 bytes --]

From 763674ca414c8d54f914b21b9d86ba6bb304d294 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 6 May 2025 14:40:54 -0700
Subject: [PATCH 1/7] KVM: x86/mmu: Move "struct kvm_page_fault" definition to
 asm/kvm_host.h

Make "struct kvm_page_fault" globally visible via asm/kvm_host.h so that
the structure can be referenced by common KVM.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 68 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/mmu/mmu_internal.h | 67 --------------------------------
 2 files changed, 67 insertions(+), 68 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4c27f213ea55..ae61a4687d38 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -443,7 +443,73 @@ struct kvm_mmu_root_info {
 #define KVM_HAVE_MMU_RWLOCK
 
 struct kvm_mmu_page;
-struct kvm_page_fault;
+
+struct kvm_page_fault {
+	/* arguments to kvm_mmu_do_page_fault.  */
+	const gpa_t addr;
+	const u64 error_code;
+	const bool prefetch;
+
+	/* Derived from error_code.  */
+	const bool exec;
+	const bool write;
+	const bool present;
+	const bool rsvd;
+	const bool user;
+
+	/* Derived from mmu and global state.  */
+	const bool is_tdp;
+	const bool is_private;
+	const bool nx_huge_page_workaround_enabled;
+
+	/*
+	 * Whether a >4KB mapping can be created or is forbidden due to NX
+	 * hugepages.
+	 */
+	bool huge_page_disallowed;
+
+	/*
+	 * Maximum page size that can be created for this fault; input to
+	 * FNAME(fetch), direct_map() and kvm_tdp_mmu_map().
+	 */
+	u8 max_level;
+
+	/*
+	 * Page size that can be created based on the max_level and the
+	 * page size used by the host mapping.
+	 */
+	u8 req_level;
+
+	/*
+	 * Page size that will be created based on the req_level and
+	 * huge_page_disallowed.
+	 */
+	u8 goal_level;
+
+	/*
+	 * Shifted addr, or result of guest page table walk if addr is a gva. In
+	 * the case of VM where memslot's can be mapped at multiple GPA aliases
+	 * (i.e. TDX), the gfn field does not contain the bit that selects between
+	 * the aliases (i.e. the shared bit for TDX).
+	 */
+	gfn_t gfn;
+
+	/* The memslot containing gfn. May be NULL. */
+	struct kvm_memory_slot *slot;
+
+	/* Outputs of kvm_mmu_faultin_pfn().  */
+	unsigned long mmu_seq;
+	kvm_pfn_t pfn;
+	struct page *refcounted_page;
+	bool map_writable;
+
+	/*
+	 * Indicates the guest is trying to write a gfn that contains one or
+	 * more of the PTEs used to translate the write itself, i.e. the access
+	 * is changing its own translation in the guest page tables.
+	 */
+	bool write_fault_to_shadow_pgtable;
+};
 
 /*
  * x86 supports 4 paging modes (5-level 64-bit, 4-level 64-bit, 3-level 32-bit,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index db8f33e4de62..384fc4d0bfec 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -230,73 +230,6 @@ static inline bool is_nx_huge_page_enabled(struct kvm *kvm)
 	return READ_ONCE(nx_huge_pages) && !kvm->arch.disable_nx_huge_pages;
 }
 
-struct kvm_page_fault {
-	/* arguments to kvm_mmu_do_page_fault.  */
-	const gpa_t addr;
-	const u64 error_code;
-	const bool prefetch;
-
-	/* Derived from error_code.  */
-	const bool exec;
-	const bool write;
-	const bool present;
-	const bool rsvd;
-	const bool user;
-
-	/* Derived from mmu and global state.  */
-	const bool is_tdp;
-	const bool is_private;
-	const bool nx_huge_page_workaround_enabled;
-
-	/*
-	 * Whether a >4KB mapping can be created or is forbidden due to NX
-	 * hugepages.
-	 */
-	bool huge_page_disallowed;
-
-	/*
-	 * Maximum page size that can be created for this fault; input to
-	 * FNAME(fetch), direct_map() and kvm_tdp_mmu_map().
-	 */
-	u8 max_level;
-
-	/*
-	 * Page size that can be created based on the max_level and the
-	 * page size used by the host mapping.
-	 */
-	u8 req_level;
-
-	/*
-	 * Page size that will be created based on the req_level and
-	 * huge_page_disallowed.
-	 */
-	u8 goal_level;
-
-	/*
-	 * Shifted addr, or result of guest page table walk if addr is a gva. In
-	 * the case of VM where memslot's can be mapped at multiple GPA aliases
-	 * (i.e. TDX), the gfn field does not contain the bit that selects between
-	 * the aliases (i.e. the shared bit for TDX).
-	 */
-	gfn_t gfn;
-
-	/* The memslot containing gfn. May be NULL. */
-	struct kvm_memory_slot *slot;
-
-	/* Outputs of kvm_mmu_faultin_pfn().  */
-	unsigned long mmu_seq;
-	kvm_pfn_t pfn;
-	struct page *refcounted_page;
-	bool map_writable;
-
-	/*
-	 * Indicates the guest is trying to write a gfn that contains one or
-	 * more of the PTEs used to translate the write itself, i.e. the access
-	 * is changing its own translation in the guest page tables.
-	 */
-	bool write_fault_to_shadow_pgtable;
-};
-
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
 /*

base-commit: 94da2b969670d100730b5537f20523e49e989920
-- 
2.49.0.967.g6a0df3ecc3-goog


[-- Attachment #3: 0002-KVM-arm64-Add-struct-kvm_page_fault-to-gather-common.patch --]
[-- Type: text/x-diff, Size: 5301 bytes --]

From 20b0bb4681a05d36c52fce90c932cd239b604d42 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 6 May 2025 14:43:11 -0700
Subject: [PATCH 2/7] KVM: arm64: Add "struct kvm_page_fault" to gather common
 fault variables

Introduce "struct kvm_page_fault" and use it in user_mem_abort() in lieu
of a collection of local variables.  Providing "struct kvm_page_fault"
will allow common KVM to provide APIs to take in said structure, e.g. when
preparing memory fault exits.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  9 +++++++++
 arch/arm64/kvm/mmu.c              | 32 +++++++++++++++++--------------
 2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 08ba91e6fb03..50a04b86baaa 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -403,6 +403,15 @@ struct kvm_vcpu_fault_info {
 	u64 disr_el1;		/* Deferred [SError] Status Register */
 };
 
+struct kvm_page_fault {
+	const bool exec;
+	const bool write;
+	const bool is_private;
+
+	gfn_t gfn;
+	struct kvm_memory_slot *slot;
+};
+
 /*
  * VNCR() just places the VNCR_capable registers in the enum after
  * __VNCR_START__, and the value (after correction) to be an 8-byte offset
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 754f2fe0cc67..c5d21bcfa3ed 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1472,8 +1472,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  bool fault_is_perm)
 {
 	int ret = 0;
-	bool write_fault, writable, force_pte = false;
-	bool exec_fault, mte_allowed;
+	bool writable, force_pte = false;
+	bool mte_allowed;
 	bool device = false, vfio_allow_any_uc = false;
 	unsigned long mmu_seq;
 	phys_addr_t ipa = fault_ipa;
@@ -1481,7 +1481,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct vm_area_struct *vma;
 	short vma_shift;
 	void *memcache;
-	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool logging_active = memslot_is_logging(memslot);
 	long vma_pagesize, fault_granule;
@@ -1490,13 +1489,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct page *page;
 	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED;
 
+	struct kvm_page_fault fault = {
+		.write = kvm_is_write_fault(vcpu),
+		.exec = kvm_vcpu_trap_is_exec_fault(vcpu),
+
+		.slot = memslot,
+	};
+
 	if (fault_is_perm)
 		fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
-	write_fault = kvm_is_write_fault(vcpu);
-	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
-	VM_BUG_ON(write_fault && exec_fault);
+	VM_BUG_ON(fault.write && fault.exec);
 
-	if (fault_is_perm && !write_fault && !exec_fault) {
+	if (fault_is_perm && !fault.write && !fault.exec) {
 		kvm_err("Unexpected L2 read permission error\n");
 		return -EFAULT;
 	}
@@ -1507,7 +1511,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * only exception to this is when dirty logging is enabled at runtime
 	 * and a write fault needs to collapse a block entry into a table.
 	 */
-	if (!fault_is_perm || (logging_active && write_fault)) {
+	if (!fault_is_perm || (logging_active && fault.write)) {
 		int min_pages = kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu);
 
 		if (!is_protected_kvm_enabled()) {
@@ -1607,7 +1611,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		ipa &= ~(vma_pagesize - 1);
 	}
 
-	gfn = ipa >> PAGE_SHIFT;
+	fault.gfn = ipa >> PAGE_SHIFT;
 	mte_allowed = kvm_vma_mte_allowed(vma);
 
 	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
@@ -1626,7 +1630,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
-	pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
+	pfn = __kvm_faultin_pfn(memslot, fault.gfn, fault.write ? FOLL_WRITE : 0,
 				&writable, &page);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
@@ -1647,7 +1651,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		 * change things at the last minute.
 		 */
 		device = true;
-	} else if (logging_active && !write_fault) {
+	} else if (logging_active && !fault.write) {
 		/*
 		 * Only actually map the page as writable if this was a write
 		 * fault.
@@ -1655,7 +1659,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		writable = false;
 	}
 
-	if (exec_fault && device)
+	if (fault.exec && device)
 		return -ENOEXEC;
 
 	/*
@@ -1715,7 +1719,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
-	if (exec_fault)
+	if (fault.exec)
 		prot |= KVM_PGTABLE_PROT_X;
 
 	if (device) {
@@ -1752,7 +1756,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	/* Mark the page dirty only if the fault is handled successfully */
 	if (writable && !ret)
-		mark_page_dirty_in_slot(kvm, memslot, gfn);
+		mark_page_dirty_in_slot(kvm, memslot, fault.gfn);
 
 	return ret != -EAGAIN ? ret : 0;
 }
-- 
2.49.0.967.g6a0df3ecc3-goog


[-- Attachment #4: 0003-KVM-arm64-x86-Require-struct-kvm_page_fault-for-memo.patch --]
[-- Type: text/x-diff, Size: 6451 bytes --]

From 44ec300f1e47fce3ac3893d5fbd8834705db8d58 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 6 May 2025 14:49:12 -0700
Subject: [PATCH 3/7] KVM: arm64: x86: Require "struct kvm_page_fault" for
 memory fault exits

Now that both arm64 and x86 define "struct kvm_page_fault" with a base set
of fields, rework kvm_prepare_memory_fault_exit() to take a kvm_page_fault
structure instead of passing in a pile of parameters.  Guard the related
code with CONFIG_KVM_GENERIC_PAGE_FAULT to play nice with architectures
that don't yet support kvm_page_fault.

Rather than define a common kvm_page_fault and kvm_arch_page_fault child,
simply assert that the handful of required fields are provided by the
arch-defined structure.  Unlike vCPU and VMs, the number of common fields
is expected to be small, and letting arch code fully define the structure
allows for maximum flexibility with respect to const, layout, etc.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/kvm/Kconfig          |  1 +
 arch/x86/kvm/Kconfig            |  1 +
 arch/x86/kvm/mmu/mmu.c          |  8 ++++----
 arch/x86/kvm/mmu/mmu_internal.h | 10 +---------
 include/linux/kvm_host.h        | 26 ++++++++++++++++++++------
 virt/kvm/Kconfig                |  3 +++
 6 files changed, 30 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 096e45acadb2..35b18f77afc4 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -38,6 +38,7 @@ menuconfig KVM
 	select HAVE_KVM_VCPU_RUN_PID_CHANGE
 	select SCHED_INFO
 	select GUEST_PERF_EVENTS if PERF_EVENTS
+	select KVM_GENERIC_PAGE_FAULT
 	help
 	  Support hosting virtualized guest machines.
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 2eeffcec5382..2d5966f15738 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -48,6 +48,7 @@ config KVM_X86
 	select KVM_GENERIC_PRE_FAULT_MEMORY
 	select KVM_GENERIC_PRIVATE_MEM if KVM_SW_PROTECTED_VM
 	select KVM_WERROR if WERROR
+	select KVM_GENERIC_PAGE_FAULT
 
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cbc84c6abc2e..a4439e9e0726 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3429,7 +3429,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
 	gva_t gva = fault->is_tdp ? 0 : fault->addr;
 
 	if (fault->is_private) {
-		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+		kvm_prepare_memory_fault_exit(vcpu, fault);
 		return -EFAULT;
 	}
 
@@ -4499,14 +4499,14 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
 	int max_order, r;
 
 	if (!kvm_slot_can_be_private(fault->slot)) {
-		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+		kvm_prepare_memory_fault_exit(vcpu, fault);
 		return -EFAULT;
 	}
 
 	r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
 			     &fault->refcounted_page, &max_order);
 	if (r) {
-		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+		kvm_prepare_memory_fault_exit(vcpu, fault);
 		return r;
 	}
 
@@ -4586,7 +4586,7 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
 	 * private vs. shared mismatch.
 	 */
 	if (fault->is_private != kvm_mem_is_private(kvm, fault->gfn)) {
-		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+		kvm_prepare_memory_fault_exit(vcpu, fault);
 		return -EFAULT;
 	}
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 384fc4d0bfec..c15060ed6e8b 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -269,14 +269,6 @@ enum {
  */
 static_assert(RET_PF_CONTINUE == 0);
 
-static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
-						     struct kvm_page_fault *fault)
-{
-	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
-				      PAGE_SIZE, fault->write, fault->exec,
-				      fault->is_private);
-}
-
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 					u64 err, bool prefetch,
 					int *emulation_type, u8 *level)
@@ -329,7 +321,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	 */
 	if (r == RET_PF_EMULATE && fault.is_private) {
 		pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
-		kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
+		kvm_prepare_memory_fault_exit(vcpu, &fault);
 		return -EFAULT;
 	}
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c685fb417e92..adece3cbfb02 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2493,20 +2493,34 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+#ifdef CONFIG_KVM_GENERIC_PAGE_FAULT
+
+#define KVM_ASSERT_TYPE_IS(type_t, x)					\
+do {									\
+	type_t __maybe_unused tmp;					\
+									\
+	BUILD_BUG_ON(!__types_ok(tmp, x) || !__typecheck(tmp, x));	\
+}while (0)
+
 static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
-						 gpa_t gpa, gpa_t size,
-						 bool is_write, bool is_exec,
-						 bool is_private)
+						 struct kvm_page_fault *fault)
 {
+	KVM_ASSERT_TYPE_IS(gfn_t, fault->gfn);
+	KVM_ASSERT_TYPE_IS(bool, fault->exec);
+	KVM_ASSERT_TYPE_IS(bool, fault->write);
+	KVM_ASSERT_TYPE_IS(bool, fault->is_private);
+	KVM_ASSERT_TYPE_IS(struct kvm_memory_slot *, fault->slot);
+
 	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
-	vcpu->run->memory_fault.gpa = gpa;
-	vcpu->run->memory_fault.size = size;
+	vcpu->run->memory_fault.gpa = fault->gfn << PAGE_SHIFT;
+	vcpu->run->memory_fault.size = PAGE_SIZE;
 
 	/* RWX flags are not (yet) defined or communicated to userspace. */
 	vcpu->run->memory_fault.flags = 0;
-	if (is_private)
+	if (fault->is_private)
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
 }
+#endif
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 727b542074e7..28ed6b241578 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -128,3 +128,6 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
 config HAVE_KVM_ARCH_GMEM_INVALIDATE
        bool
        depends on KVM_PRIVATE_MEM
+
+config KVM_GENERIC_PAGE_FAULT
+       bool
-- 
2.49.0.967.g6a0df3ecc3-goog


[-- Attachment #5: 0004-KVM-Add-common-infrastructure-for-KVM-Userfaults.patch --]
[-- Type: text/x-diff, Size: 5763 bytes --]

From 3fe5acc137b6b3ec7b87dcd266084f0072db934a Mon Sep 17 00:00:00 2001
From: James Houghton <jthoughton@google.com>
Date: Tue, 6 May 2025 15:35:24 -0700
Subject: [PATCH 4/7] KVM: Add common infrastructure for KVM Userfaults

<to be written by James>

Signed-off-by: James Houghton <jthoughton@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h |  9 +++++++
 include/uapi/linux/kvm.h |  5 +++-
 virt/kvm/kvm_main.c      | 52 ++++++++++++++++++++++++++++++++++++----
 3 files changed, 61 insertions(+), 5 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index adece3cbfb02..73e6ec4eae78 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -597,6 +597,7 @@ struct kvm_memory_slot {
 	unsigned long *dirty_bitmap;
 	struct kvm_arch_memory_slot arch;
 	unsigned long userspace_addr;
+	unsigned long __user *userfault_bitmap;
 	u32 flags;
 	short id;
 	u16 as_id;
@@ -2520,6 +2521,14 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 	if (fault->is_private)
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
 }
+
+bool kvm_do_userfault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+
+static inline bool kvm_is_userfault_memslot(struct kvm_memory_slot *memslot)
+{
+	return memslot && memslot->flags & KVM_MEM_USERFAULT;
+}
+
 #endif
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c6988e2c68d5..af1fc86ddbe0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -40,7 +40,8 @@ struct kvm_userspace_memory_region2 {
 	__u64 guest_memfd_offset;
 	__u32 guest_memfd;
 	__u32 pad1;
-	__u64 pad2[14];
+	__u64 userfault_bitmap;
+	__u64 pad2[13];
 };
 
 /*
@@ -51,6 +52,7 @@ struct kvm_userspace_memory_region2 {
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
 #define KVM_MEM_GUEST_MEMFD	(1UL << 2)
+#define KVM_MEM_USERFAULT	(1UL << 3)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -443,6 +445,7 @@ struct kvm_run {
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
 #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+#define KVM_MEMORY_EXIT_FLAG_USERFAULT	(1ULL << 4)
 			__u64 flags;
 			__u64 gpa;
 			__u64 size;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 16fe54cf2808..ca08075a9b7b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1688,6 +1688,15 @@ static void kvm_commit_memory_region(struct kvm *kvm,
 		if (old->dirty_bitmap && !new->dirty_bitmap)
 			kvm_destroy_dirty_bitmap(old);
 
+		/*
+		 * If userfault is being enabled for the slot, assume userspace
+		 * wants to intercept the majority of accesses, and simply drop
+		 * all stage-2 mappings for the slot.
+		 */
+		if (!(old_flags & KVM_MEM_USERFAULT) &&
+		    (new_flags & KVM_MEM_USERFAULT))
+			kvm_arch_flush_shadow_memslot(kvm, old);
+
 		/*
 		 * The final quirk.  Free the detached, old slot, but only its
 		 * memory, not any metadata.  Metadata, including arch specific
@@ -1980,6 +1989,12 @@ static int kvm_set_memory_region(struct kvm *kvm,
 	if (id < KVM_USER_MEM_SLOTS &&
 	    (mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
 		return -EINVAL;
+	if (mem->flags & KVM_MEM_USERFAULT &&
+	    ((mem->userfault_bitmap != untagged_addr(mem->userfault_bitmap)) ||
+	     !access_ok((void __user *)(unsigned long)mem->userfault_bitmap,
+			DIV_ROUND_UP(mem->memory_size >> PAGE_SHIFT, BITS_PER_LONG)
+			 * sizeof(long))))
+		return -EINVAL;
 
 	slots = __kvm_memslots(kvm, as_id);
 
@@ -2012,14 +2027,15 @@ static int kvm_set_memory_region(struct kvm *kvm,
 		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
 			return -EINVAL;
 	} else { /* Modify an existing slot. */
-		/* Private memslots are immutable, they can only be deleted. */
-		if (mem->flags & KVM_MEM_GUEST_MEMFD)
-			return -EINVAL;
 		if ((mem->userspace_addr != old->userspace_addr) ||
 		    (npages != old->npages) ||
 		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
 			return -EINVAL;
 
+		/* Moving a guest_memfd memslot isn't supported. */
+		if (base_gfn != old->base_gfn && mem->flags & KVM_MEM_GUEST_MEMFD)
+			return -EINVAL;
+
 		if (base_gfn != old->base_gfn)
 			change = KVM_MR_MOVE;
 		else if (mem->flags != old->flags)
@@ -2043,11 +2059,13 @@ static int kvm_set_memory_region(struct kvm *kvm,
 	new->npages = npages;
 	new->flags = mem->flags;
 	new->userspace_addr = mem->userspace_addr;
-	if (mem->flags & KVM_MEM_GUEST_MEMFD) {
+	if (mem->flags & KVM_MEM_GUEST_MEMFD && change == KVM_MR_CREATE) {
 		r = kvm_gmem_bind(kvm, new, mem->guest_memfd, mem->guest_memfd_offset);
 		if (r)
 			goto out;
 	}
+	if (mem->flags & KVM_MEM_USERFAULT)
+		new->userfault_bitmap = u64_to_user_ptr(mem->userfault_bitmap);
 
 	r = kvm_set_memslot(kvm, old, new, change);
 	if (r)
@@ -4921,6 +4939,32 @@ static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
 	return cleared;
 }
 
+#ifdef CONFIG_KVM_GENERIC_PAGE_FAULT
+bool kvm_do_userfault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+	struct kvm_memory_slot *slot = fault->slot;
+	unsigned long __user *user_chunk;
+	unsigned long chunk;
+	gfn_t offset;
+
+	if (!kvm_is_userfault_memslot(slot))
+		return false;
+
+	offset = fault->gfn - slot->base_gfn;
+	user_chunk = slot->userfault_bitmap + (offset / BITS_PER_LONG);
+
+	if (__get_user(chunk, user_chunk))
+		return true;
+
+	if (!test_bit(offset % BITS_PER_LONG, &chunk))
+		return false;
+
+	kvm_prepare_memory_fault_exit(vcpu, fault);
+	vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_USERFAULT;
+	return true;
+}
+#endif
+
 int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 						  struct kvm_enable_cap *cap)
 {
-- 
2.49.0.967.g6a0df3ecc3-goog


[-- Attachment #6: 0005-KVM-x86-Add-support-for-KVM-userfault-exits.patch --]
[-- Type: text/x-diff, Size: 3084 bytes --]

From e8cf843428aadfcec342f5e84ba4fd6ad7f16aa1 Mon Sep 17 00:00:00 2001
From: James Houghton <jthoughton@google.com>
Date: Tue, 6 May 2025 15:37:14 -0700
Subject: [PATCH 5/7] KVM: x86: Add support for KVM userfault exits

<to be written by James>

Signed-off-by: James Houghton <jthoughton@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c |  5 ++++-
 arch/x86/kvm/x86.c     | 27 +++++++++++++++++----------
 2 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a4439e9e0726..49eb6b9b268c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3304,7 +3304,7 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	if (is_error_noslot_pfn(fault->pfn))
 		return;
 
-	if (kvm_slot_dirty_track_enabled(slot))
+	if (kvm_slot_dirty_track_enabled(slot) || kvm_is_userfault_memslot(slot))
 		return;
 
 	/*
@@ -4522,6 +4522,9 @@ static int __kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
 {
 	unsigned int foll = fault->write ? FOLL_WRITE : 0;
 
+	if (kvm_do_userfault(vcpu, fault))
+		return -EFAULT;
+
 	if (fault->is_private)
 		return kvm_mmu_faultin_pfn_private(vcpu, fault);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 75c0a934556d..4f9edda47782 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13150,12 +13150,27 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 	u32 new_flags = new ? new->flags : 0;
 	bool log_dirty_pages = new_flags & KVM_MEM_LOG_DIRTY_PAGES;
 
+	/*
+	 * Recover hugepages when userfault is toggled off, as KVM forces 4KiB
+	 * mappings when userfault is enabled.  See below for why CREATE, MOVE,
+	 * and DELETE don't need special handling.  Note, common KVM handles
+	 * zapping SPTEs when userfault is toggled on.
+	 */
+	if (change == KVM_MR_FLAGS_ONLY && (old_flags & KVM_MEM_USERFAULT) &&
+	    !(new_flags & KVM_MEM_USERFAULT))
+		kvm_mmu_recover_huge_pages(kvm, new);
+
+	/*
+	 * Nothing more to do if dirty logging isn't being toggled.
+	 */
+	if (!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
+		return;
+
 	/*
 	 * Update CPU dirty logging if dirty logging is being toggled.  This
 	 * applies to all operations.
 	 */
-	if ((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES)
-		kvm_mmu_update_cpu_dirty_logging(kvm, log_dirty_pages);
+	kvm_mmu_update_cpu_dirty_logging(kvm, log_dirty_pages);
 
 	/*
 	 * Nothing more to do for RO slots (which can't be dirtied and can't be
@@ -13175,14 +13190,6 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 	if ((change != KVM_MR_FLAGS_ONLY) || (new_flags & KVM_MEM_READONLY))
 		return;
 
-	/*
-	 * READONLY and non-flags changes were filtered out above, and the only
-	 * other flag is LOG_DIRTY_PAGES, i.e. something is wrong if dirty
-	 * logging isn't being toggled on or off.
-	 */
-	if (WARN_ON_ONCE(!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES)))
-		return;
-
 	if (!log_dirty_pages) {
 		/*
 		 * Recover huge page mappings in the slot now that dirty logging
-- 
2.49.0.967.g6a0df3ecc3-goog


[-- Attachment #7: 0006-KVM-arm64-Add-support-for-KVM-userfault-exits.patch --]
[-- Type: text/x-diff, Size: 2128 bytes --]

From b9300a363d5517098bde6c97b8292a71092aa455 Mon Sep 17 00:00:00 2001
From: James Houghton <jthoughton@google.com>
Date: Tue, 6 May 2025 15:38:31 -0700
Subject: [PATCH 6/7] KVM: arm64: Add support for KVM userfault exits

<to be written by James>

Signed-off-by: James Houghton <jthoughton@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/kvm/mmu.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c5d21bcfa3ed..5e2ccde66f43 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1541,7 +1541,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * logging_active is guaranteed to never be true for VM_PFNMAP
 	 * memslots.
 	 */
-	if (logging_active || is_protected_kvm_enabled()) {
+	if (logging_active || is_protected_kvm_enabled() ||
+	    kvm_is_userfault_memslot(memslot)) {
 		force_pte = true;
 		vma_shift = PAGE_SHIFT;
 	} else {
@@ -1630,6 +1631,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
+	if (kvm_do_userfault(vcpu, &fault))
+		return -EFAULT;
+
 	pfn = __kvm_faultin_pfn(memslot, fault.gfn, fault.write ? FOLL_WRITE : 0,
 				&writable, &page);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
@@ -2127,14 +2131,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   const struct kvm_memory_slot *new,
 				   enum kvm_mr_change change)
 {
-	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
+	u32 old_flags = old ? old->flags : 0;
+	u32 new_flags = new ? new->flags : 0;
+
+	/* Nothing to do if not toggling dirty logging. */
+	if (!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
+		return;
 
 	/*
 	 * At this point memslot has been committed and there is an
 	 * allocated dirty_bitmap[], dirty pages will be tracked while the
 	 * memory slot is write protected.
 	 */
-	if (log_dirty_pages) {
+	if (new_flags & KVM_MEM_LOG_DIRTY_PAGES) {
 
 		if (change == KVM_MR_DELETE)
 			return;
-- 
2.49.0.967.g6a0df3ecc3-goog


[-- Attachment #8: 0007-KVM-Enable-and-advertise-support-for-KVM-userfault-e.patch --]
[-- Type: text/x-diff, Size: 1880 bytes --]

From 81b07458648d57cb532b64178dcfd2aeba81db95 Mon Sep 17 00:00:00 2001
From: James Houghton <jthoughton@google.com>
Date: Tue, 6 May 2025 15:40:57 -0700
Subject: [PATCH 7/7] KVM: Enable and advertise support for KVM userfault exits

Now that all architectures (arm64 and x86) that utilize "generic" page
faults also support userfault exits, advertise support for
KVM_CAP_USERFAULT and let userspace set KVM_MEM_USERFAULT in memslots.

Signed-off-by: James Houghton <jthoughton@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/uapi/linux/kvm.h | 1 +
 virt/kvm/kvm_main.c      | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index af1fc86ddbe0..262e2bde9b3b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -934,6 +934,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
 #define KVM_CAP_X86_GUEST_MODE 238
 #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239
+#define KVM_CAP_USERFAULT 240
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ca08075a9b7b..c3384847dd5b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1545,6 +1545,9 @@ static int check_memory_region_flags(struct kvm *kvm,
 	    !(mem->flags & KVM_MEM_GUEST_MEMFD))
 		valid_flags |= KVM_MEM_READONLY;
 
+	if (IS_ENABLED(CONFIG_KVM_GENERIC_PAGE_FAULT))
+		valid_flags |= KVM_MEM_USERFAULT;
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
 
@@ -4823,6 +4826,9 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_CHECK_EXTENSION_VM:
 	case KVM_CAP_ENABLE_CAP_VM:
 	case KVM_CAP_HALT_POLL:
+#ifdef CONFIG_KVM_GENERIC_PAGE_FAULT
+	case KVM_CAP_USERFAULT:
+#endif
 		return 1;
 #ifdef CONFIG_KVM_MMIO
 	case KVM_CAP_COALESCED_MMIO:
-- 
2.49.0.967.g6a0df3ecc3-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 06/13] KVM: arm64: Add support for KVM_MEM_USERFAULT
  2025-05-07  0:06   ` Sean Christopherson
@ 2025-05-28 15:09     ` James Houghton
  2025-05-28 15:25       ` James Houghton
  0 siblings, 1 reply; 34+ messages in thread
From: James Houghton @ 2025-05-28 15:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Tue, May 6, 2025 at 8:06 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 09, 2025, James Houghton wrote:
> > @@ -2073,6 +2080,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> >                                  enum kvm_mr_change change)
> >  {
> >       bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> > +     u32 new_flags = new ? new->flags : 0;
> > +     u32 changed_flags = (new_flags) ^ (old ? old->flags : 0);
>
> This is a bit hard to read, and there's only one use of log_dirty_pages.  With
> zapping handled in common KVM, just do:

Thanks, Sean. Yeah what you have below looks a lot better, thanks for
applying it for me. I'll post a new version soon. One note below.

>
> @@ -2127,14 +2131,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>                                    const struct kvm_memory_slot *new,
>                                    enum kvm_mr_change change)
>  {
> -       bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> +       u32 old_flags = old ? old->flags : 0;
> +       u32 new_flags = new ? new->flags : 0;
> +
> +       /* Nothing to do if not toggling dirty logging. */
> +       if (!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
> +               return;

This is my bug, not yours, but I think this condition must also check
that `change == KVM_MR_FLAGS_ONLY` for it to be correct. This, for
example, will break the case where we are deleting a memslot that
still has KVM_MEM_LOG_DIRTY_PAGES enabled. Will fix in the next
version.

>
>         /*
>          * At this point memslot has been committed and there is an
>          * allocated dirty_bitmap[], dirty pages will be tracked while the
>          * memory slot is write protected.
>          */
> -       if (log_dirty_pages) {
> +       if (new_flags & KVM_MEM_LOG_DIRTY_PAGES) {
>
>                 if (change == KVM_MR_DELETE)
>                         return;

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
  2025-05-07  0:01   ` Sean Christopherson
@ 2025-05-28 15:21     ` James Houghton
  0 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2025-05-28 15:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Tue, May 6, 2025 at 8:01 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 09, 2025, James Houghton wrote:
> > Use one of the 14 reserved u64s in struct kvm_userspace_memory_region2
> > for the user to provide `userfault_bitmap`.
> >
> > The memslot flag indicates if KVM should be reading from the
> > `userfault_bitmap` field from the memslot. The user is permitted to
> > provide a bogus pointer. If the pointer cannot be read from, we will
> > return -EFAULT (with no other information) back to the user.
>
> For the uAPI+infrastructure changelog, please elaborate on the design goals and
> choices.  The "what" is pretty obvious from the patch; describe why this is being
> added.
>
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  include/linux/kvm_host.h | 14 ++++++++++++++
> >  include/uapi/linux/kvm.h |  4 +++-
> >  virt/kvm/Kconfig         |  3 +++
> >  virt/kvm/kvm_main.c      | 35 +++++++++++++++++++++++++++++++++++
> >  4 files changed, 55 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 401439bb21e3..f7a3dfd5e224 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -590,6 +590,7 @@ struct kvm_memory_slot {
> >       unsigned long *dirty_bitmap;
> >       struct kvm_arch_memory_slot arch;
> >       unsigned long userspace_addr;
> > +     unsigned long __user *userfault_bitmap;
> >       u32 flags;
> >       short id;
> >       u16 as_id;
> > @@ -724,6 +725,11 @@ static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
> >  }
> >  #endif
> >
> > +static inline bool kvm_has_userfault(struct kvm *kvm)
> > +{
> > +     return IS_ENABLED(CONFIG_HAVE_KVM_USERFAULT);
> > +}
>
> Eh, don't think we need this wrapper.  Just check the CONFIG_xxx manually in the
> one or two places where code isn't guarded by an #ifdef.
>
> >  struct kvm_memslots {
> >       u64 generation;
> >       atomic_long_t last_used_slot;
> > @@ -2553,4 +2559,12 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >                                   struct kvm_pre_fault_memory *range);
> >  #endif
> >
> > +int kvm_gfn_userfault(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > +                   gfn_t gfn);
> > +
> > +static inline bool kvm_memslot_userfault(struct kvm_memory_slot *memslot)
>
> I strongly prefer kvm_is_userfault_memslot().  KVM's weird kvm_memslot_<flag>()
> nomenclature comes from ancient code, i.e. isn't something I would follow.
>
> > +{
> > +     return memslot->flags & KVM_MEM_USERFAULT;
>
> I think it's worth checking for a non-NULL memslot, even if all current callers
> pre-check for a slot.
>
> > @@ -2042,6 +2051,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >               if (r)
> >                       goto out;
> >       }
> > +     if (mem->flags & KVM_MEM_USERFAULT)
> > +             new->userfault_bitmap =
> > +               (unsigned long __user *)(unsigned long)mem->userfault_bitmap;
>
>         if (mem->flags & KVM_MEM_USERFAULT)
>                 new->userfault_bitmap = u64_to_user_ptr(mem->userfault_bitmap);

Applied this change to the other cast (where we do access_ok()) as well, thanks!

>
> >       r = kvm_set_memslot(kvm, old, new, change);
> >       if (r)
> > @@ -6426,3 +6438,26 @@ void kvm_exit(void)
> >       kvm_irqfd_exit();
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_exit);
> > +
> > +int kvm_gfn_userfault(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > +                    gfn_t gfn)
>
> I think this series is the perfect opportunity (read: victim) to introduce a
> common "struct kvm_page_fault".  With a common structure to provide the gfn, slot,
> write, exec, and is_private fields, this helper can handle the checks and the call
> to kvm_prepare_memory_fault_exit().
>
> And with that in place, I would vote to name this something like kvm_do_userfault(),
> return a boolean, and let the caller return -EFAULT.

Returning 'true' from kvm_do_userfault() without a
kvm_prepare_memory_fault_exit() looked a bit strange at first, but I
don't have strong feelings. I'll add a small comment there.

>
> For making "struct kvm_page_fault" common, one thought would be to have arch code
> define the entire struct, and simply assert on the few fields that common KVM needs
> being defined by arch code.  And wrap all references in CONFIG_KVM_GENERIC_PAGE_FAULT.
>
> I don't expect there will be a huge number of fields that common KVM needs, i.e. I
> don't think the maintenance burden of punting to arch code will be high.  And letting
> arch code own the entire struct means we don't need to have e.g. fault->arch.present
> vs. fault->write in KVM x86, which to me is a big net negative for readability.
>
> I'll respond to the cover letter with an attachment of seven patches to sketch out
> the idea.

Looks great! Thanks very much!

>
> > +{
> > +     unsigned long bitmap_chunk = 0;
> > +     off_t offset;
> > +
> > +     if (!kvm_memslot_userfault(memslot))
> > +             return 0;
> > +
> > +     if (WARN_ON_ONCE(!memslot->userfault_bitmap))
> > +             return 0;
>
> '0' is technically a valid userspace address.  I'd just drop this.  If we have a
> KVM bug that results in failure to generate usefaults, we'll notice quite quickly.
>
> > +
> > +     offset = gfn - memslot->base_gfn;
> > +
> > +     if (copy_from_user(&bitmap_chunk,
> > +                        memslot->userfault_bitmap + offset / BITS_PER_LONG,
> > +                        sizeof(bitmap_chunk)))
>
> Since the address is checked during memslot creation, I'm pretty sure this can
> use __get_user().  At the very least, it should be get_user().

Thanks! I agree, __get_user() should be fine.

>
> > +             return -EFAULT;
> > +
> > +     /* Set in the bitmap means that the gfn is userfault */
> > +     return !!(bitmap_chunk & (1ul << (offset % BITS_PER_LONG)));
>
> test_bit()?

Thanks for all the feedback and applying it for me in those patches
you sent back. :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 06/13] KVM: arm64: Add support for KVM_MEM_USERFAULT
  2025-05-28 15:09     ` James Houghton
@ 2025-05-28 15:25       ` James Houghton
  2025-05-28 17:30         ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: James Houghton @ 2025-05-28 15:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Wed, May 28, 2025 at 11:09 AM James Houghton <jthoughton@google.com> wrote:
>
> On Tue, May 6, 2025 at 8:06 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Jan 09, 2025, James Houghton wrote:
> > > @@ -2073,6 +2080,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> > >                                  enum kvm_mr_change change)
> > >  {
> > >       bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> > > +     u32 new_flags = new ? new->flags : 0;
> > > +     u32 changed_flags = (new_flags) ^ (old ? old->flags : 0);
> >
> > This is a bit hard to read, and there's only one use of log_dirty_pages.  With
> > zapping handled in common KVM, just do:
>
> Thanks, Sean. Yeah what you have below looks a lot better, thanks for
> applying it for me. I'll post a new version soon. One note below.
>
> >
> > @@ -2127,14 +2131,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> >                                    const struct kvm_memory_slot *new,
> >                                    enum kvm_mr_change change)
> >  {
> > -       bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> > +       u32 old_flags = old ? old->flags : 0;
> > +       u32 new_flags = new ? new->flags : 0;
> > +
> > +       /* Nothing to do if not toggling dirty logging. */
> > +       if (!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
> > +               return;
>
> This is my bug, not yours, but I think this condition must also check
> that `change == KVM_MR_FLAGS_ONLY` for it to be correct. This, for
> example, will break the case where we are deleting a memslot that
> still has KVM_MEM_LOG_DIRTY_PAGES enabled. Will fix in the next
> version.

Ah it wouldn't break that example, as `new` would be NULL. But I think
it would break the case where we are moving a memslot that keeps
`KVM_MEM_LOG_DIRTY_PAGES`.

>
> >
> >         /*
> >          * At this point memslot has been committed and there is an
> >          * allocated dirty_bitmap[], dirty pages will be tracked while the
> >          * memory slot is write protected.
> >          */
> > -       if (log_dirty_pages) {
> > +       if (new_flags & KVM_MEM_LOG_DIRTY_PAGES) {
> >
> >                 if (change == KVM_MR_DELETE)
> >                         return;

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 00/13] KVM: Introduce KVM Userfault
  2025-05-07  0:13 ` Sean Christopherson
@ 2025-05-28 15:48   ` James Houghton
  2025-05-29 15:28     ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: James Houghton @ 2025-05-28 15:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, Jiaqi Yan

On Tue, May 6, 2025 at 8:13 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 09, 2025, James Houghton wrote:
> >   KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
> >   KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
> >   KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot
> >   KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
> >   KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
> >   KVM: arm64: Add support for KVM_MEM_USERFAULT
> >   KVM: selftests: Fix vm_mem_region_set_flags docstring
> >   KVM: selftests: Fix prefault_mem logic
> >   KVM: selftests: Add va_start/end into uffd_desc
> >   KVM: selftests: Add KVM Userfault mode to demand_paging_test
> >   KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT
> >   KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests
> >   KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT
> >     details
> >
> >  Documentation/virt/kvm/api.rst                |  33 +++-
> >  arch/arm64/kvm/Kconfig                        |   1 +
> >  arch/arm64/kvm/mmu.c                          |  26 +++-
> >  arch/x86/kvm/Kconfig                          |   1 +
> >  arch/x86/kvm/mmu/mmu.c                        |  27 +++-
> >  arch/x86/kvm/mmu/mmu_internal.h               |  20 ++-
> >  arch/x86/kvm/x86.c                            |  36 +++--
> >  include/linux/kvm_host.h                      |  19 ++-
> >  include/uapi/linux/kvm.h                      |   6 +-
> >  .../selftests/kvm/demand_paging_test.c        | 145 ++++++++++++++++--
> >  .../testing/selftests/kvm/include/kvm_util.h  |   5 +
> >  .../selftests/kvm/include/userfaultfd_util.h  |   2 +
> >  tools/testing/selftests/kvm/lib/kvm_util.c    |  42 ++++-
> >  .../selftests/kvm/lib/userfaultfd_util.c      |   2 +
> >  .../selftests/kvm/set_memory_region_test.c    |  33 ++++
> >  virt/kvm/Kconfig                              |   3 +
> >  virt/kvm/kvm_main.c                           |  54 ++++++-
> >  17 files changed, 419 insertions(+), 36 deletions(-)
>
> I didn't look at the selftests changes, but nothing in this series scares me.  We
> bikeshedded most of this death this in the "exit on missing" series, so for me at
> least, the only real question is whether or not we want to add the uAPI.  AFAIK,
> this is best proposal for post-copy guest_memfd support (and not just because it's
> the only proposal :-D).

The only thing that I want to call out again is that this UAPI works
great for when we are going from userfault --> !userfault. That is, it
works well for postcopy (both for guest_memfd and for standard
memslots where userfaultfd scalability is a concern).

But there is another use case worth bringing up: unmapping pages that
the VMM is emulating as poisoned.

Normally this can be handled by mm (e.g. with UFFDIO_POISON), but for
4K poison within a HugeTLB-backed memslot (if the HugeTLB page remains
mapped in userspace), KVM Userfault is the only option (if we don't
want to punch holes in memslots). This leaves us with three problems:

1. If using KVM Userfault to emulate poison, we are stuck with small
pages in stage 2 for the entire memslot.
2. We must unmap everything when toggling on KVM Userfault just to
unmap a single page.
3. If KVM Userfault is already enabled, we have no choice but to
toggle KVM Userfault off and on again to unmap the newly poisoned
pages (i.e., there is no ioctl to scan the bitmap and unmap
newly-userfault pages).

All of these are non-issues if we emulate poison by removing memslots,
and I think that's possible. But if that proves too slow, we'd need to
be a little bit more clever with hugepage recovery and with unmapping
newly-userfault pages, both of which I think can be solved by adding
some kind of bitmap re-scan ioctl. We can do that later if the need
arises.

> So... yes?

Thanks Sean!

> Attached are a variation on the series using the common "struct kvm_page_fault"
> idea.  The documentation change could be squashed with the final enablement patch.
>
> Compile tested only.  I would not be the least bit surprised if I completely
> butchered something.

Looks good! The new selftests work just fine.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 06/13] KVM: arm64: Add support for KVM_MEM_USERFAULT
  2025-05-28 15:25       ` James Houghton
@ 2025-05-28 17:30         ` Sean Christopherson
  2025-05-28 20:17           ` James Houghton
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2025-05-28 17:30 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Wed, May 28, 2025, James Houghton wrote:
> On Wed, May 28, 2025 at 11:09 AM James Houghton <jthoughton@google.com> wrote:
> >
> > On Tue, May 6, 2025 at 8:06 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, Jan 09, 2025, James Houghton wrote:
> > > > @@ -2073,6 +2080,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> > > >                                  enum kvm_mr_change change)
> > > >  {
> > > >       bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> > > > +     u32 new_flags = new ? new->flags : 0;
> > > > +     u32 changed_flags = (new_flags) ^ (old ? old->flags : 0);
> > >
> > > This is a bit hard to read, and there's only one use of log_dirty_pages.  With
> > > zapping handled in common KVM, just do:
> >
> > Thanks, Sean. Yeah what you have below looks a lot better, thanks for
> > applying it for me. I'll post a new version soon. One note below.
> >
> > >
> > > @@ -2127,14 +2131,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> > >                                    const struct kvm_memory_slot *new,
> > >                                    enum kvm_mr_change change)
> > >  {
> > > -       bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> > > +       u32 old_flags = old ? old->flags : 0;
> > > +       u32 new_flags = new ? new->flags : 0;
> > > +
> > > +       /* Nothing to do if not toggling dirty logging. */
> > > +       if (!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
> > > +               return;
> >
> > This is my bug, not yours, but I think this condition must also check
> > that `change == KVM_MR_FLAGS_ONLY` for it to be correct. This, for
> > example, will break the case where we are deleting a memslot that
> > still has KVM_MEM_LOG_DIRTY_PAGES enabled. Will fix in the next
> > version.
> 
> Ah it wouldn't break that example, as `new` would be NULL. But I think
> it would break the case where we are moving a memslot that keeps
> `KVM_MEM_LOG_DIRTY_PAGES`.

Can you elaborate?  Maybe with the full snippet of the final code that's broken.
I'm not entirely following what's path you're referring to.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 06/13] KVM: arm64: Add support for KVM_MEM_USERFAULT
  2025-05-28 17:30         ` Sean Christopherson
@ 2025-05-28 20:17           ` James Houghton
  2025-05-28 23:25             ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: James Houghton @ 2025-05-28 20:17 UTC (permalink / raw)
  To: seanjc
  Cc: amoorthy, corbet, dmatlack, jthoughton, kalyazin, kvm, kvmarm,
	linux-arm-kernel, linux-doc, linux-kernel, maz, oliver.upton,
	pbonzini, peterx, pgonda, wei.w.wang, yan.y.zhao

On Wed, May 28, 2025 at 1:30 PM Sean Christopherson <seanjc@google.com> wrote:
> On Wed, May 28, 2025, James Houghton wrote:
> > On Wed, May 28, 2025 at 11:09 AM James Houghton <jthoughton@google.com> wrote:
> > > On Tue, May 6, 2025 at 8:06 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > @@ -2127,14 +2131,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> > > >                                    const struct kvm_memory_slot *new,
> > > >                                    enum kvm_mr_change change)
> > > >  {
> > > > -       bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> > > > +       u32 old_flags = old ? old->flags : 0;
> > > > +       u32 new_flags = new ? new->flags : 0;
> > > > +
> > > > +       /* Nothing to do if not toggling dirty logging. */
> > > > +       if (!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
> > > > +               return;
> > >
> > > This is my bug, not yours, but I think this condition must also check
> > > that `change == KVM_MR_FLAGS_ONLY` for it to be correct. This, for
> > > example, will break the case where we are deleting a memslot that
> > > still has KVM_MEM_LOG_DIRTY_PAGES enabled. Will fix in the next
> > > version.
> >
> > Ah it wouldn't break that example, as `new` would be NULL. But I think
> > it would break the case where we are moving a memslot that keeps
> > `KVM_MEM_LOG_DIRTY_PAGES`.
>
> Can you elaborate?  Maybe with the full snippet of the final code that's broken.
> I'm not entirely following what's path you're referring to.

This is even more broken than I realized.

I mean that this diff should be applied on top of your patch:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 5e2ccde66f43c..f1db3f7742b28 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2134,8 +2134,12 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 	u32 old_flags = old ? old->flags : 0;
 	u32 new_flags = new ? new->flags : 0;
 
-	/* Nothing to do if not toggling dirty logging. */
-	if (!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
+	/*
+	 * If only changing flags, nothing to do if not toggling
+	 * dirty logging.
+	 */
+	if (change == KVM_MR_FLAGS_ONLY &&
+	    !((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
 		return;
 
 	/*

So the final commit looks like:

commit 3c4b57b25b1123629c5f2b64065d51ecdadb6771
Author: James Houghton <jthoughton@google.com>
Date:   Tue May 6 15:38:31 2025 -0700

    KVM: arm64: Add support for KVM userfault exits
    
    <to be written by James>
    
    Signed-off-by: James Houghton <jthoughton@google.com>
    Signed-off-by: Sean Christopherson <seanjc@google.com>

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c5d21bcfa3ed4..f1db3f7742b28 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2127,15 +2131,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   const struct kvm_memory_slot *new,
 				   enum kvm_mr_change change)
 {
-	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
+	u32 old_flags = old ? old->flags : 0;
+	u32 new_flags = new ? new->flags : 0;
+
+	/*
+	 * If only changing flags, nothing to do if not toggling
+	 * dirty logging.
+	 */
+	if (change == KVM_MR_FLAGS_ONLY &&
+	    !((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
+		return;
 
 	/*
 	 * At this point memslot has been committed and there is an
 	 * allocated dirty_bitmap[], dirty pages will be tracked while the
 	 * memory slot is write protected.
 	 */
-	if (log_dirty_pages) {
-
+	if (new_flags & KVM_MEM_LOG_DIRTY_PAGES) {
 		if (change == KVM_MR_DELETE)
 			return;
 

So we need to bail out early if we are enabling KVM_MEM_USERFAULT but
KVM_MEM_LOG_DIRTY_PAGES is already enabled, otherwise we'll be
write-protecting a bunch of PTEs that we don't need or want to WP.

When *disabling* KVM_MEM_USERFAULT, we definitely don't want to WP
things, as we aren't going to get the unmap afterwards anyway.

So the check we started with handles this:
> > > > +       u32 old_flags = old ? old->flags : 0;
> > > > +       u32 new_flags = new ? new->flags : 0;
> > > > +
> > > > +       /* Nothing to do if not toggling dirty logging. */
> > > > +       if (!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
> > > > +               return;

So why also check for `change == KVM_MR_FLAGS_ONLY` as well? Everything I just
said doesn't really apply when the memslot is being created, moved, or
destroyed. Otherwise, consider the case where we never enable dirty logging:

 - Memslot deletion would be totally broken; we'll see that
   KVM_MEM_LOG_DIRTY_PAGES is not getting toggled and then bail out, skipping
   some freeing.

 - Memslot creation would be broken in a similar way; we'll skip a bunch of
   setup work.

 - For memslot moving, the only case that we could possibly be leaving
   KVM_MEM_LOG_DIRTY_PAGES set without the change being KVM_MR_FLAGS_ONLY,
   I think we still need to do the split and WP stuff.

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
  2025-05-07  0:05   ` Sean Christopherson
@ 2025-05-28 20:21     ` Oliver Upton
  2025-05-28 21:22       ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: Oliver Upton @ 2025-05-28 20:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, Paolo Bonzini, Jonathan Corbet, Marc Zyngier,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Tue, May 06, 2025 at 05:05:50PM -0700, Sean Christopherson wrote:
> > +	if ((old_flags ^ new_flags) & KVM_MEM_USERFAULT &&
> > +	    (change == KVM_MR_FLAGS_ONLY)) {
> > +		if (old_flags & KVM_MEM_USERFAULT)
> > +			kvm_mmu_recover_huge_pages(kvm, new);
> > +		else
> > +			kvm_arch_flush_shadow_memslot(kvm, old);
> 
> The call to kvm_arch_flush_shadow_memslot() should definitely go in common code.
> The fancy recovery logic is arch specific, but blasting the memslot when userfault
> is toggled on is not.

Not like anything in KVM is consistent but sprinkling translation
changes / invalidations between arch and generic code feels
error-prone. Especially if there isn't clear ownership of a particular
flag, e.g. 0 -> 1 transitions happen in generic code and 1 -> 0 happens
in arch code.

Even in the case of KVM_MEM_USERFAULT, an architecture could potentially
preserve the stage-2 translations but reap access permissions without
modifying page tables / TLBs.

I'm happy with arch interfaces that clearly express intent (make this
memslot inaccessible), then the architecture can make an informed
decision about how to best achieve that. Otherwise we're always going to
use the largest possible hammer potentially overinvalidate.

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
  2025-05-28 20:21     ` Oliver Upton
@ 2025-05-28 21:22       ` Sean Christopherson
  2025-05-29 14:56         ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2025-05-28 21:22 UTC (permalink / raw)
  To: Oliver Upton
  Cc: James Houghton, Paolo Bonzini, Jonathan Corbet, Marc Zyngier,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Wed, May 28, 2025, Oliver Upton wrote:
> On Tue, May 06, 2025 at 05:05:50PM -0700, Sean Christopherson wrote:
> > > +	if ((old_flags ^ new_flags) & KVM_MEM_USERFAULT &&
> > > +	    (change == KVM_MR_FLAGS_ONLY)) {
> > > +		if (old_flags & KVM_MEM_USERFAULT)
> > > +			kvm_mmu_recover_huge_pages(kvm, new);
> > > +		else
> > > +			kvm_arch_flush_shadow_memslot(kvm, old);
> > 
> > The call to kvm_arch_flush_shadow_memslot() should definitely go in common code.
> > The fancy recovery logic is arch specific, but blasting the memslot when userfault
> > is toggled on is not.
> 
> Not like anything in KVM is consistent but sprinkling translation
> changes / invalidations between arch and generic code feels
> error-prone.

Eh, leaving critical operations to arch code isn't exactly error free either :-)

> Especially if there isn't clear ownership of a particular flag, e.g. 0 -> 1
> transitions happen in generic code and 1 -> 0 happens in arch code.

The difference I see is that removing access to the memslot on 0=>1 is mandatory,
whereas any action on 1=>0 is not.  So IMO it's not arbitrary sprinkling of
invalidations, it's deliberately putting the common, mandatory logic in generic
code, while leaving optional performance tweaks to arch code.

> Even in the case of KVM_MEM_USERFAULT, an architecture could potentially
> preserve the stage-2 translations but reap access permissions without
> modifying page tables / TLBs.

Yes, but that wouldn't be strictly unique to KVM_MEM_USERFAULT.

E.g. for NUMA balancing faults (or rather, the PROT_NONE conversions), KVM could
handle the mmu_notifier invalidations by removing access while keeping the PTEs,
so that faulting the memory back would be a lighter weight operation.  Ditto for
reacting to other protection changes that come through mmu_notifiers.

If we want to go down that general path, my preference would be to put the control
logic in generic code, and then call dedicated arch APIs for removing protections.

> I'm happy with arch interfaces that clearly express intent (make this
> memslot inaccessible), then the architecture can make an informed
> decision about how to best achieve that. Otherwise we're always going to
> use the largest possible hammer potentially overinvalidate.

Yeah, definitely no argument there given x86's history in this area.  Though if
we want to tackle that problem straightaway, I think I'd vote to add the
aforementioned dedicated APIs for removing protections, with a generic default
implementation that simply invokes kvm_arch_flush_shadow_memslot().

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 06/13] KVM: arm64: Add support for KVM_MEM_USERFAULT
  2025-05-28 20:17           ` James Houghton
@ 2025-05-28 23:25             ` Sean Christopherson
  2025-06-09 23:04               ` James Houghton
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2025-05-28 23:25 UTC (permalink / raw)
  To: James Houghton
  Cc: amoorthy, corbet, dmatlack, kalyazin, kvm, kvmarm,
	linux-arm-kernel, linux-doc, linux-kernel, maz, oliver.upton,
	pbonzini, peterx, pgonda, wei.w.wang, yan.y.zhao

On Wed, May 28, 2025, James Houghton wrote:
> On Wed, May 28, 2025 at 1:30 PM Sean Christopherson <seanjc@google.com> wrote:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c5d21bcfa3ed4..f1db3f7742b28 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -2127,15 +2131,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>  				   const struct kvm_memory_slot *new,
>  				   enum kvm_mr_change change)
>  {
> -	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> +	u32 old_flags = old ? old->flags : 0;
> +	u32 new_flags = new ? new->flags : 0;
> +
> +	/*
> +	 * If only changing flags, nothing to do if not toggling
> +	 * dirty logging.
> +	 */
> +	if (change == KVM_MR_FLAGS_ONLY &&
> +	    !((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
> +		return;
>  
>  	/*
>  	 * At this point memslot has been committed and there is an
>  	 * allocated dirty_bitmap[], dirty pages will be tracked while the
>  	 * memory slot is write protected.
>  	 */
> -	if (log_dirty_pages) {
> -
> +	if (new_flags & KVM_MEM_LOG_DIRTY_PAGES) {
>  		if (change == KVM_MR_DELETE)
>  			return;
>  
> 
> So we need to bail out early if we are enabling KVM_MEM_USERFAULT but
> KVM_MEM_LOG_DIRTY_PAGES is already enabled, otherwise we'll be
> write-protecting a bunch of PTEs that we don't need or want to WP.
> 
> When *disabling* KVM_MEM_USERFAULT, we definitely don't want to WP
> things, as we aren't going to get the unmap afterwards anyway.
> 
> So the check we started with handles this:
> > > > > +       u32 old_flags = old ? old->flags : 0;
> > > > > +       u32 new_flags = new ? new->flags : 0;
> > > > > +
> > > > > +       /* Nothing to do if not toggling dirty logging. */
> > > > > +       if (!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
> > > > > +               return;
> 
> So why also check for `change == KVM_MR_FLAGS_ONLY` as well? Everything I just
> said doesn't really apply when the memslot is being created, moved, or
> destroyed. Otherwise, consider the case where we never enable dirty logging:
> 
>  - Memslot deletion would be totally broken; we'll see that
>    KVM_MEM_LOG_DIRTY_PAGES is not getting toggled and then bail out, skipping
>    some freeing.

No, because @new and thus new_flags will be 0.  If dirty logging wasn't enabled,
then there's nothing to be done.

>  - Memslot creation would be broken in a similar way; we'll skip a bunch of
>    setup work.

No, because @old and thus old_flags will be 0.  If dirty logging isn't being
enabled, then there's nothing to be done.

>  - For memslot moving, the only case that we could possibly be leaving
>    KVM_MEM_LOG_DIRTY_PAGES set without the change being KVM_MR_FLAGS_ONLY,
>    I think we still need to do the split and WP stuff.

No, because KVM invokes kvm_arch_flush_shadow_memslot() on the memslot and marks
it invalid prior to installing the new, moved memslot.  See kvm_invalidate_memslot().

So I'm still not seeing what's buggy.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
  2025-05-28 21:22       ` Sean Christopherson
@ 2025-05-29 14:56         ` Sean Christopherson
  2025-05-29 15:37           ` James Houghton
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2025-05-29 14:56 UTC (permalink / raw)
  To: Oliver Upton
  Cc: James Houghton, Paolo Bonzini, Jonathan Corbet, Marc Zyngier,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Wed, May 28, 2025, Sean Christopherson wrote:
> On Wed, May 28, 2025, Oliver Upton wrote:
> > On Tue, May 06, 2025 at 05:05:50PM -0700, Sean Christopherson wrote:
> > > > +	if ((old_flags ^ new_flags) & KVM_MEM_USERFAULT &&
> > > > +	    (change == KVM_MR_FLAGS_ONLY)) {
> > > > +		if (old_flags & KVM_MEM_USERFAULT)
> > > > +			kvm_mmu_recover_huge_pages(kvm, new);
> > > > +		else
> > > > +			kvm_arch_flush_shadow_memslot(kvm, old);
> > > 
> > > The call to kvm_arch_flush_shadow_memslot() should definitely go in common code.
> > > The fancy recovery logic is arch specific, but blasting the memslot when userfault
> > > is toggled on is not.
> > 
> > Not like anything in KVM is consistent but sprinkling translation
> > changes / invalidations between arch and generic code feels
> > error-prone.
> 
> Eh, leaving critical operations to arch code isn't exactly error free either :-)
> 
> > Especially if there isn't clear ownership of a particular flag, e.g. 0 -> 1
> > transitions happen in generic code and 1 -> 0 happens in arch code.
> 
> The difference I see is that removing access to the memslot on 0=>1 is mandatory,
> whereas any action on 1=>0 is not.  So IMO it's not arbitrary sprinkling of
> invalidations, it's deliberately putting the common, mandatory logic in generic
> code, while leaving optional performance tweaks to arch code.
> 
> > Even in the case of KVM_MEM_USERFAULT, an architecture could potentially
> > preserve the stage-2 translations but reap access permissions without
> > modifying page tables / TLBs.
> 
> Yes, but that wouldn't be strictly unique to KVM_MEM_USERFAULT.
> 
> E.g. for NUMA balancing faults (or rather, the PROT_NONE conversions), KVM could
> handle the mmu_notifier invalidations by removing access while keeping the PTEs,
> so that faulting the memory back would be a lighter weight operation.  Ditto for
> reacting to other protection changes that come through mmu_notifiers.
> 
> If we want to go down that general path, my preference would be to put the control
> logic in generic code, and then call dedicated arch APIs for removing protections.
> 
> > I'm happy with arch interfaces that clearly express intent (make this
> > memslot inaccessible), then the architecture can make an informed
> > decision about how to best achieve that. Otherwise we're always going to
> > use the largest possible hammer potentially overinvalidate.
> 
> Yeah, definitely no argument there given x86's history in this area.  Though if
> we want to tackle that problem straightaway, I think I'd vote to add the
> aforementioned dedicated APIs for removing protections, with a generic default
> implementation that simply invokes kvm_arch_flush_shadow_memslot().

Alternatively, we could punt on this issue entirely by not allowing userspace to
set KVM_MEM_USERFAULT on anything but KVM_MR_CREATE.  I.e. allow a FLAGS_ONLY
update to clear USERFAULT, but not set USERFAULT.

Other than emulating poisoned pages, is there a (potential) use case for setting
KVM_MEM_USERFAULT after a VM has been created?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 00/13] KVM: Introduce KVM Userfault
  2025-05-28 15:48   ` James Houghton
@ 2025-05-29 15:28     ` Sean Christopherson
  2025-05-29 16:17       ` James Houghton
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2025-05-29 15:28 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, Jiaqi Yan

On Wed, May 28, 2025, James Houghton wrote:
> The only thing that I want to call out again is that this UAPI works
> great for when we are going from userfault --> !userfault. That is, it
> works well for postcopy (both for guest_memfd and for standard
> memslots where userfaultfd scalability is a concern).
> 
> But there is another use case worth bringing up: unmapping pages that
> the VMM is emulating as poisoned.
> 
> Normally this can be handled by mm (e.g. with UFFDIO_POISON), but for
> 4K poison within a HugeTLB-backed memslot (if the HugeTLB page remains
> mapped in userspace), KVM Userfault is the only option (if we don't
> want to punch holes in memslots). This leaves us with three problems:
> 
> 1. If using KVM Userfault to emulate poison, we are stuck with small
> pages in stage 2 for the entire memslot.
> 2. We must unmap everything when toggling on KVM Userfault just to
> unmap a single page.
> 3. If KVM Userfault is already enabled, we have no choice but to
> toggle KVM Userfault off and on again to unmap the newly poisoned
> pages (i.e., there is no ioctl to scan the bitmap and unmap
> newly-userfault pages).
> 
> All of these are non-issues if we emulate poison by removing memslots,
> and I think that's possible. But if that proves too slow, we'd need to
> be a little bit more clever with hugepage recovery and with unmapping
> newly-userfault pages, both of which I think can be solved by adding
> some kind of bitmap re-scan ioctl. We can do that later if the need
> arises.

Hmm.

On the one hand, punching a hole in a memslot is generally gross, e.g. requires
deleting the entire memslot and thus unmapping large swaths of guest memory (or
all of guest memory for most x86 VMs).

On the other hand, unless userspace sets KVM_MEM_USERFAULT from time zero, KVM
will need to unmap guest memory (or demote the mapping size a la eager page
splitting?) when KVM_MEM_USERFAULT is toggled from 0=>1.

One thought would be to change the behavior of KVM's processing of the userfault
bitmap, such that KVM doesn't infer *anything* about the mapping sizes, and instead
give userspace more explicit control over the mapping size.  However, on non-x86
architectures, implementing such a control would require a non-trivial amount of
code and complexity, and would incur overhead that doesn't exist today (i.e. we'd
need to implement equivalent infrastructure to x86's disallow_lpage tracking).

And IIUC, another problem with KVM Userfault is that it wouldn't Just Work for
KVM accesses to guest memory.  E.g. if the HugeTLB page is still mapped into
userspace, then depending on the flow that gets hit, I'm pretty sure that emulating
an access to the poisoned memory would result in KVM_EXIT_INTERNAL_ERROR, whereas
punching a hole in a memslot would result in a much more friendly KVM_EXIT_MMIO.

All in all, given that KVM needs to correctly handle hugepage vs. memslot
alignment/size issues no matter what, and that KVM has well-established behavior
for handling no-memslot accesses, I'm leaning towards saying userspace should
punch a hole in the memslot in order to emulate a poisoned page.  The only reason
I can think of for preferring a different approach is if userspace can't provide
the desired latency/performance characteristics when punching a hole in a memslot.
Hopefully reacting to a poisoned page is a fairly slow path?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
  2025-05-29 14:56         ` Sean Christopherson
@ 2025-05-29 15:37           ` James Houghton
  0 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2025-05-29 15:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Oliver Upton, Paolo Bonzini, Jonathan Corbet, Marc Zyngier,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, May 29, 2025 at 10:56 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, May 28, 2025, Sean Christopherson wrote:
> > On Wed, May 28, 2025, Oliver Upton wrote:
> > > On Tue, May 06, 2025 at 05:05:50PM -0700, Sean Christopherson wrote:
> > > > > +       if ((old_flags ^ new_flags) & KVM_MEM_USERFAULT &&
> > > > > +           (change == KVM_MR_FLAGS_ONLY)) {
> > > > > +               if (old_flags & KVM_MEM_USERFAULT)
> > > > > +                       kvm_mmu_recover_huge_pages(kvm, new);
> > > > > +               else
> > > > > +                       kvm_arch_flush_shadow_memslot(kvm, old);
> > > >
> > > > The call to kvm_arch_flush_shadow_memslot() should definitely go in common code.
> > > > The fancy recovery logic is arch specific, but blasting the memslot when userfault
> > > > is toggled on is not.
> > >
> > > Not like anything in KVM is consistent but sprinkling translation
> > > changes / invalidations between arch and generic code feels
> > > error-prone.
> >
> > Eh, leaving critical operations to arch code isn't exactly error free either :-)
> >
> > > Especially if there isn't clear ownership of a particular flag, e.g. 0 -> 1
> > > transitions happen in generic code and 1 -> 0 happens in arch code.
> >
> > The difference I see is that removing access to the memslot on 0=>1 is mandatory,
> > whereas any action on 1=>0 is not.  So IMO it's not arbitrary sprinkling of
> > invalidations, it's deliberately putting the common, mandatory logic in generic
> > code, while leaving optional performance tweaks to arch code.
> >
> > > Even in the case of KVM_MEM_USERFAULT, an architecture could potentially
> > > preserve the stage-2 translations but reap access permissions without
> > > modifying page tables / TLBs.
> >
> > Yes, but that wouldn't be strictly unique to KVM_MEM_USERFAULT.
> >
> > E.g. for NUMA balancing faults (or rather, the PROT_NONE conversions), KVM could
> > handle the mmu_notifier invalidations by removing access while keeping the PTEs,
> > so that faulting the memory back would be a lighter weight operation.  Ditto for
> > reacting to other protection changes that come through mmu_notifiers.
> >
> > If we want to go down that general path, my preference would be to put the control
> > logic in generic code, and then call dedicated arch APIs for removing protections.
> >
> > > I'm happy with arch interfaces that clearly express intent (make this
> > > memslot inaccessible), then the architecture can make an informed
> > > decision about how to best achieve that. Otherwise we're always going to
> > > use the largest possible hammer potentially overinvalidate.
> >
> > Yeah, definitely no argument there given x86's history in this area.  Though if
> > we want to tackle that problem straightaway, I think I'd vote to add the
> > aforementioned dedicated APIs for removing protections, with a generic default
> > implementation that simply invokes kvm_arch_flush_shadow_memslot().

I'm happy to add something like kvm_arch_invalidate_shadow_memslot()
which invokes kvm_arch_flush_shadow_memslot() by default (and has a
lockdep assertion for holding the slots lock), with no architectures
currently providing a specialization. Feel free to suggest better
names.

Or we could do kvm_arch_userfault_changed(/* ... */, bool enabled),
and, for the default implementation, if `enabled == true`, do
kvm_arch_invalidate_shadow_memslot(), else do nothing. Then x86 can
specialize this. This arguably still leaves the responsibility of
unmapping/invalidating everything to arch code...

Let me know your preferences, Sean and Oliver.

>
> Alternatively, we could punt on this issue entirely by not allowing userspace to
> set KVM_MEM_USERFAULT on anything but KVM_MR_CREATE.  I.e. allow a FLAGS_ONLY
> update to clear USERFAULT, but not set USERFAULT.
>
> Other than emulating poisoned pages, is there a (potential) use case for setting
> KVM_MEM_USERFAULT after a VM has been created?

Today, Google's userspace does not know when creating memslots that we
will need to enable KVM_MEM_USERFAULT. We could delete and re-add the
memslots of course, but overall, for any userspace, I think adding
this restriction (for what seems to be a non-issue :)) isn't worth it.

Thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 00/13] KVM: Introduce KVM Userfault
  2025-05-29 15:28     ` Sean Christopherson
@ 2025-05-29 16:17       ` James Houghton
  0 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2025-05-29 16:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, wei.w.wang, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, Jiaqi Yan

On Thu, May 29, 2025 at 11:28 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, May 28, 2025, James Houghton wrote:
> > The only thing that I want to call out again is that this UAPI works
> > great for when we are going from userfault --> !userfault. That is, it
> > works well for postcopy (both for guest_memfd and for standard
> > memslots where userfaultfd scalability is a concern).
> >
> > But there is another use case worth bringing up: unmapping pages that
> > the VMM is emulating as poisoned.
> >
> > Normally this can be handled by mm (e.g. with UFFDIO_POISON), but for
> > 4K poison within a HugeTLB-backed memslot (if the HugeTLB page remains
> > mapped in userspace), KVM Userfault is the only option (if we don't
> > want to punch holes in memslots). This leaves us with three problems:
> >
> > 1. If using KVM Userfault to emulate poison, we are stuck with small
> > pages in stage 2 for the entire memslot.
> > 2. We must unmap everything when toggling on KVM Userfault just to
> > unmap a single page.
> > 3. If KVM Userfault is already enabled, we have no choice but to
> > toggle KVM Userfault off and on again to unmap the newly poisoned
> > pages (i.e., there is no ioctl to scan the bitmap and unmap
> > newly-userfault pages).
> >
> > All of these are non-issues if we emulate poison by removing memslots,
> > and I think that's possible. But if that proves too slow, we'd need to
> > be a little bit more clever with hugepage recovery and with unmapping
> > newly-userfault pages, both of which I think can be solved by adding
> > some kind of bitmap re-scan ioctl. We can do that later if the need
> > arises.
>
> Hmm.
>
> On the one hand, punching a hole in a memslot is generally gross, e.g. requires
> deleting the entire memslot and thus unmapping large swaths of guest memory (or
> all of guest memory for most x86 VMs).
>
> On the other hand, unless userspace sets KVM_MEM_USERFAULT from time zero, KVM
> will need to unmap guest memory (or demote the mapping size a la eager page
> splitting?) when KVM_MEM_USERFAULT is toggled from 0=>1.
>
> One thought would be to change the behavior of KVM's processing of the userfault
> bitmap, such that KVM doesn't infer *anything* about the mapping sizes, and instead
> give userspace more explicit control over the mapping size.  However, on non-x86
> architectures, implementing such a control would require a non-trivial amount of
> code and complexity, and would incur overhead that doesn't exist today (i.e. we'd
> need to implement equivalent infrastructure to x86's disallow_lpage tracking).
>
> And IIUC, another problem with KVM Userfault is that it wouldn't Just Work for
> KVM accesses to guest memory.  E.g. if the HugeTLB page is still mapped into
> userspace, then depending on the flow that gets hit, I'm pretty sure that emulating
> an access to the poisoned memory would result in KVM_EXIT_INTERNAL_ERROR, whereas
> punching a hole in a memslot would result in a much more friendly KVM_EXIT_MMIO.

Oh, yes, of course. KVM Userfault is not enough for memory poison
emulation for non-guest-memfd memslots. Like how for these memslots we
need userfaultfd to do post-copy properly, for memory poison, we still
need userfaultfd (so 4K emulated poison within a HugeTLB memslot is
not possible).

So yeah in this case (4K poison in a still-mapped HugeTLB page), we
would need to punch a hole and get KVM_EXIT_MMIO. SGTM.

For guest_memfd memslots, we can handle uaccess to emulated poison
like tmpfs: with UFFDIO_POISON (Nikita has already started on
UFFDIO_CONTINUE support[1]). We *could* make the gmem page fault
handler (what Fuad is implementing) respect KVM Userfault, but that
isn't necessary (and would look quite like a reimplementation of
userfaultfd).

[1]: https://lore.kernel.org/kvm/20250404154352.23078-1-kalyazin@amazon.com/

> All in all, given that KVM needs to correctly handle hugepage vs. memslot
> alignment/size issues no matter what, and that KVM has well-established behavior
> for handling no-memslot accesses, I'm leaning towards saying userspace should
> punch a hole in the memslot in order to emulate a poisoned page.  The only reason
> I can think of for preferring a different approach is if userspace can't provide
> the desired latency/performance characteristics when punching a hole in a memslot.
> Hopefully reacting to a poisoned page is a fairly slow path?

In general, yes it is. Memory poison is rare.

For non-HugeTLB (tmpfs or guest_memfd), I don't think we need to punch
a hole, so that's good. For HugeTLB, there are two circumstances that
are perhaps concerning:

1. Learning about poison during post-copy? This should be vanishingly
rare, as most poison is discovered in the first pre-copy pass. If we
didn't do *any* pre-copy passes, then it could be a concern.
2. Learning about poison during pre-copy after shattering? If doing
lazy page splitting with incremental dirty log clearing, this isn't a
*huge* problem, otherwise it could be.

I think userspace has two ways out: (1) don't make super large
memslots, or (2) don't use HugeTLB.

Just to be clear, this isn't really an issue with KVM Userfault -- in
its current form (not preventing KVM's uaccess), it cannot help here.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 06/13] KVM: arm64: Add support for KVM_MEM_USERFAULT
  2025-05-28 23:25             ` Sean Christopherson
@ 2025-06-09 23:04               ` James Houghton
  0 siblings, 0 replies; 34+ messages in thread
From: James Houghton @ 2025-06-09 23:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: amoorthy, corbet, dmatlack, kalyazin, kvm, kvmarm,
	linux-arm-kernel, linux-doc, linux-kernel, maz, oliver.upton,
	pbonzini, peterx, pgonda, wei.w.wang, yan.y.zhao

On Wed, May 28, 2025 at 4:25 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, May 28, 2025, James Houghton wrote:
> > On Wed, May 28, 2025 at 1:30 PM Sean Christopherson <seanjc@google.com> wrote:
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index c5d21bcfa3ed4..f1db3f7742b28 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -2127,15 +2131,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> >                                  const struct kvm_memory_slot *new,
> >                                  enum kvm_mr_change change)
> >  {
> > -     bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> > +     u32 old_flags = old ? old->flags : 0;
> > +     u32 new_flags = new ? new->flags : 0;
> > +
> > +     /*
> > +      * If only changing flags, nothing to do if not toggling
> > +      * dirty logging.
> > +      */
> > +     if (change == KVM_MR_FLAGS_ONLY &&
> > +         !((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
> > +             return;
> >
> >       /*
> >        * At this point memslot has been committed and there is an
> >        * allocated dirty_bitmap[], dirty pages will be tracked while the
> >        * memory slot is write protected.
> >        */
> > -     if (log_dirty_pages) {
> > -
> > +     if (new_flags & KVM_MEM_LOG_DIRTY_PAGES) {
> >               if (change == KVM_MR_DELETE)
> >                       return;
> >
> >
> > So we need to bail out early if we are enabling KVM_MEM_USERFAULT but
> > KVM_MEM_LOG_DIRTY_PAGES is already enabled, otherwise we'll be
> > write-protecting a bunch of PTEs that we don't need or want to WP.
> >
> > When *disabling* KVM_MEM_USERFAULT, we definitely don't want to WP
> > things, as we aren't going to get the unmap afterwards anyway.
> >
> > So the check we started with handles this:
> > > > > > +       u32 old_flags = old ? old->flags : 0;
> > > > > > +       u32 new_flags = new ? new->flags : 0;
> > > > > > +
> > > > > > +       /* Nothing to do if not toggling dirty logging. */
> > > > > > +       if (!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
> > > > > > +               return;
> >
> > So why also check for `change == KVM_MR_FLAGS_ONLY` as well? Everything I just
> > said doesn't really apply when the memslot is being created, moved, or
> > destroyed. Otherwise, consider the case where we never enable dirty logging:
> >
> >  - Memslot deletion would be totally broken; we'll see that
> >    KVM_MEM_LOG_DIRTY_PAGES is not getting toggled and then bail out, skipping
> >    some freeing.
>
> No, because @new and thus new_flags will be 0.  If dirty logging wasn't enabled,
> then there's nothing to be done.
>
> >  - Memslot creation would be broken in a similar way; we'll skip a bunch of
> >    setup work.
>
> No, because @old and thus old_flags will be 0.  If dirty logging isn't being
> enabled, then there's nothing to be done.
>
> >  - For memslot moving, the only case that we could possibly be leaving
> >    KVM_MEM_LOG_DIRTY_PAGES set without the change being KVM_MR_FLAGS_ONLY,
> >    I think we still need to do the split and WP stuff.
>
> No, because KVM invokes kvm_arch_flush_shadow_memslot() on the memslot and marks
> it invalid prior to installing the new, moved memslot.  See kvm_invalidate_memslot().
>
> So I'm still not seeing what's buggy.

Sorry, I didn't see your reply, Sean. :(

You're right, I was confusing the KVM_MEM_USERFAULT and
KVM_MEM_LOG_DIRTY_PAGES. I'll undo the little change I said I was
going to make.

Thank you!

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-06-09 23:05 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-09 20:49 [PATCH v2 00/13] KVM: Introduce KVM Userfault James Houghton
2025-01-09 20:49 ` [PATCH v2 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap James Houghton
2025-05-07  0:01   ` Sean Christopherson
2025-05-28 15:21     ` James Houghton
2025-01-09 20:49 ` [PATCH v2 02/13] KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT James Houghton
2025-01-09 20:49 ` [PATCH v2 03/13] KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot James Houghton
2025-05-07  0:03   ` Sean Christopherson
2025-01-09 20:49 ` [PATCH v2 04/13] KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION James Houghton
2025-01-09 20:49 ` [PATCH v2 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT James Houghton
2025-05-07  0:05   ` Sean Christopherson
2025-05-28 20:21     ` Oliver Upton
2025-05-28 21:22       ` Sean Christopherson
2025-05-29 14:56         ` Sean Christopherson
2025-05-29 15:37           ` James Houghton
2025-01-09 20:49 ` [PATCH v2 06/13] KVM: arm64: " James Houghton
2025-05-07  0:06   ` Sean Christopherson
2025-05-28 15:09     ` James Houghton
2025-05-28 15:25       ` James Houghton
2025-05-28 17:30         ` Sean Christopherson
2025-05-28 20:17           ` James Houghton
2025-05-28 23:25             ` Sean Christopherson
2025-06-09 23:04               ` James Houghton
2025-01-09 20:49 ` [PATCH v2 07/13] KVM: selftests: Fix vm_mem_region_set_flags docstring James Houghton
2025-01-09 20:49 ` [PATCH v2 08/13] KVM: selftests: Fix prefault_mem logic James Houghton
2025-01-09 20:49 ` [PATCH v2 09/13] KVM: selftests: Add va_start/end into uffd_desc James Houghton
2025-01-09 20:49 ` [PATCH v2 10/13] KVM: selftests: Add KVM Userfault mode to demand_paging_test James Houghton
2025-01-09 20:49 ` [PATCH v2 11/13] KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT James Houghton
2025-01-09 20:49 ` [PATCH v2 12/13] KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests James Houghton
2025-01-09 20:49 ` [PATCH v2 13/13] KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT details James Houghton
2025-05-06 23:48 ` [PATCH v2 00/13] KVM: Introduce KVM Userfault Sean Christopherson
2025-05-07  0:13 ` Sean Christopherson
2025-05-28 15:48   ` James Houghton
2025-05-29 15:28     ` Sean Christopherson
2025-05-29 16:17       ` James Houghton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).