linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/13] KVM: Introduce KVM Userfault
@ 2024-12-04 19:13 James Houghton
  2024-12-04 19:13 ` [PATCH v1 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap James Houghton
                   ` (13 more replies)
  0 siblings, 14 replies; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

This is a continuation of the original KVM Userfault RFC[1] from July.
It contains the simplifications we talked about at LPC[2].

Please see the RFC[1] for the problem description. In summary,
guest_memfd VMs have no mechanism for doing post-copy live migration.
KVM Userfault provides such a mechanism. Today there is no upstream
mechanism for installing memory into a guest_memfd, but there will
be one soon (e.g. [3]).

There is a second problem that KVM Userfault solves: userfaultfd-based
post-copy doesn't scale very well. KVM Userfault when used with
userfaultfd can scale much better in the common case that most post-copy
demand fetches are a result of vCPU access violations. This is a
continuation of the solution Anish was working on[4]. This aspect of
KVM Userfault is important for userfaultfd-based live migration when
scaling up to hundreds of vCPUs with ~30us network latency for a
PAGE_SIZE demand-fetch.

The implementation in this series is version than the RFC[1]. It adds...
 1. a new memslot flag is added: KVM_MEM_USERFAULT,
 2. a new parameter, userfault_bitmap, into struct kvm_memory_slot,
 3. a new KVM_RUN exit reason: KVM_MEMORY_EXIT_FLAG_USERFAULT,
 4. a new KVM capability KVM_CAP_USERFAULT.

KVM Userfault does not attempt to catch KVM's own accesses to guest
memory. That is left up to userfaultfd.

When enabling KVM_MEM_USERFAULT for a memslot, the second-stage mappings
are zapped, and new faults will check `userfault_bitmap` to see if the
fault should exit to userspace.

When KVM_MEM_USERFAULT is enabled, only PAGE_SIZE mappings are
permitted.

When disabling KVM_MEM_USERFAULT, huge mappings will be reconstructed
(either eagerly or on-demand; the architecture can decide).

KVM Userfault is not compatible with async page faults. Nikita has
proposed a new implementation of async page faults that is more
userspace-driven that *is* compatible with KVM Userfault[5].

Performance
===========

The takeaways I have are:

1. For cases where lock contention is not a concern, there is a
   discernable win because KVM Userfault saves the trip through the
   userfaultfd poll/read/WAKE cycle.

2. Using a single userfaultfd without KVM Userfault gets very slow as
   the number of vCPUs increases, and it gets even slower when you add
   more reader threads. This is due to contention on the userfaultfd
   wait_queue locks. This is the contention that KVM Userfault avoids.
   Compare this to the multiple-userfaultfd runs; they are much faster
   because the wait_queue locks are sharded perfectly (1 per vCPU).
   Perfect sharding is only possible because the vCPUs are configured to
   touch only their own chunk of memory.

Config:
 - 64M per vcpu
 - vcpus only touch their 64M (`-b 64M -a`)
 - THPs disabled
 - MGLRU disabled

Each run used the following command:

./demand_paging_test -b 64M -a -v <#vcpus>  \
	-s shmem		     \ # if using shmem
	-r <#readers> -u <uffd_mode> \ # if using userfaultfd
	-k \			     \ # if using KVM Userfault
	-m 3			       # when on arm64

note: I patched demand_paging_test so that, when using shmem, the page
      cache will always be preallocated, not only in the `-u MINOR`
      case. Otherwise the comparison would be unfair. I left this patch
      out in the selftest commits, but I am happy to add it if it would
      be useful.

== x86 (96 LPUs, 48 cores, TDP MMU enabled) ==

-- Anonymous memory, single userfaultfd

	userfault mode
vcpus				2	8	64
	no userfault		306845	220402	47720
	MISSING (single reader)	90724	26059	3029
	MISSING			86840	37912	1664
	MISSING + KVM UF	143021	104385	34910
	KVM UF			160326	128247	39913

-- shmem (preallocated), single userfaultfd

vcpus				2	8	64
	no userfault		375130	214635	54420
	MINOR (single reader)	102336	31704	3244
	MINOR 			97981	36982	1673
	MINOR + KVM UF		161835	113716	33577
	KVM UF			181972	131204	42914

-- shmem (preallocated), multiple userfaultfds

vcpus				2	8	64
	no userfault		374060	216108	54433
	MINOR			102661	56978	11300
	MINOR + KVM UF		167080	123461	48382
	KVM UF			180439	122310	42539

== arm64 (96 PEs, AmpereOne) ==

-- shmem (preallocated), single userfaultfd

vcpus:				2	8	64
	no userfault		419069	363081	34348
	MINOR (single reader)	87421	36147	3764
	MINOR			84953	43444	1323
	MINOR + KVM UF		164509	139986	12373
	KVM UF			185706	122153	12153

-- shmem (preallocated), multiple userfaultfds

vcpus:				2	8	64
	no userfault		401931	334142	36117
	MINOR			83696	75617	15996
	MINOR + KVM UF		176327	115784	12198
	KVM UF			190074	126966	12084

This series is based on the latest kvm/next.

[1]: https://lore.kernel.org/kvm/20240710234222.2333120-1-jthoughton@google.com/
[2]: https://lpc.events/event/18/contributions/1757/
[3]: https://lore.kernel.org/kvm/20241112073837.22284-1-yan.y.zhao@intel.com/
[4]: https://lore.kernel.org/all/20240215235405.368539-1-amoorthy@google.com/
[5]: https://lore.kernel.org/kvm/20241118123948.4796-1-kalyazin@amazon.com/#t

James Houghton (13):
  KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
  KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
  KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot
  KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
  KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
  KVM: arm64: Add support for KVM_MEM_USERFAULT
  KVM: selftests: Fix vm_mem_region_set_flags docstring
  KVM: selftests: Fix prefault_mem logic
  KVM: selftests: Add va_start/end into uffd_desc
  KVM: selftests: Add KVM Userfault mode to demand_paging_test
  KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT
  KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests
  KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT
    details

 Documentation/virt/kvm/api.rst                |  33 +++-
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/mmu.c                          |  23 ++-
 arch/x86/kvm/Kconfig                          |   1 +
 arch/x86/kvm/mmu/mmu.c                        |  27 +++-
 arch/x86/kvm/mmu/mmu_internal.h               |  20 ++-
 arch/x86/kvm/x86.c                            |  36 +++--
 include/linux/kvm_host.h                      |  19 ++-
 include/uapi/linux/kvm.h                      |   6 +-
 .../selftests/kvm/demand_paging_test.c        | 145 ++++++++++++++++--
 .../testing/selftests/kvm/include/kvm_util.h  |   5 +
 .../selftests/kvm/include/userfaultfd_util.h  |   2 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  42 ++++-
 .../selftests/kvm/lib/userfaultfd_util.c      |   2 +
 .../selftests/kvm/set_memory_region_test.c    |  33 ++++
 virt/kvm/Kconfig                              |   3 +
 virt/kvm/kvm_main.c                           |  47 +++++-
 17 files changed, 409 insertions(+), 36 deletions(-)


base-commit: 4d911c7abee56771b0219a9fbf0120d06bdc9c14
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v1 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
@ 2024-12-04 19:13 ` James Houghton
  2024-12-05 11:52   ` kernel test robot
  2024-12-05 14:22   ` kernel test robot
  2024-12-04 19:13 ` [PATCH v1 02/13] KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT James Houghton
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Use one of the 14 reserved u64s in struct kvm_userspace_memory_region2
for the user to provide `userfault_bitmap`.

The memslot flag indicates if KVM should be reading from the
`userfault_bitmap` field from the memslot. The user is permitted to
provide a bogus pointer. If the pointer cannot be read from, we will
return -EFAULT (with no other information) back to the user.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/kvm_host.h | 14 ++++++++++++++
 include/uapi/linux/kvm.h |  4 +++-
 virt/kvm/Kconfig         |  3 +++
 virt/kvm/kvm_main.c      | 28 ++++++++++++++++++++++++++++
 4 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 401439bb21e3..f7a3dfd5e224 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -590,6 +590,7 @@ struct kvm_memory_slot {
 	unsigned long *dirty_bitmap;
 	struct kvm_arch_memory_slot arch;
 	unsigned long userspace_addr;
+	unsigned long __user *userfault_bitmap;
 	u32 flags;
 	short id;
 	u16 as_id;
@@ -724,6 +725,11 @@ static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
 }
 #endif
 
+static inline bool kvm_has_userfault(struct kvm *kvm)
+{
+	return IS_ENABLED(CONFIG_HAVE_KVM_USERFAULT);
+}
+
 struct kvm_memslots {
 	u64 generation;
 	atomic_long_t last_used_slot;
@@ -2553,4 +2559,12 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range);
 #endif
 
+int kvm_gfn_userfault(struct kvm *kvm, struct kvm_memory_slot *memslot,
+		      gfn_t gfn);
+
+static inline bool kvm_memslot_userfault(struct kvm_memory_slot *memslot)
+{
+	return memslot->flags & KVM_MEM_USERFAULT;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 502ea63b5d2e..94be7e8b46a4 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -40,7 +40,8 @@ struct kvm_userspace_memory_region2 {
 	__u64 guest_memfd_offset;
 	__u32 guest_memfd;
 	__u32 pad1;
-	__u64 pad2[14];
+	__u64 userfault_bitmap;
+	__u64 pad2[13];
 };
 
 /*
@@ -51,6 +52,7 @@ struct kvm_userspace_memory_region2 {
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
 #define KVM_MEM_GUEST_MEMFD	(1UL << 2)
+#define KVM_MEM_USERFAULT	(1UL << 3)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 54e959e7d68f..9eb1fae238b1 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -124,3 +124,6 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
 config HAVE_KVM_ARCH_GMEM_INVALIDATE
        bool
        depends on KVM_PRIVATE_MEM
+
+config HAVE_KVM_USERFAULT
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index de2c11dae231..23fa3e911c4e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1541,6 +1541,9 @@ static int check_memory_region_flags(struct kvm *kvm,
 	    !(mem->flags & KVM_MEM_GUEST_MEMFD))
 		valid_flags |= KVM_MEM_READONLY;
 
+	if (kvm_has_userfault(kvm))
+		valid_flags |= KVM_MEM_USERFAULT;
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
 
@@ -2042,6 +2045,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if (r)
 			goto out;
 	}
+	if (mem->flags & KVM_MEM_USERFAULT)
+		new->userfault_bitmap = (unsigned long *)mem->userfault_bitmap;
 
 	r = kvm_set_memslot(kvm, old, new, change);
 	if (r)
@@ -6426,3 +6431,26 @@ void kvm_exit(void)
 	kvm_irqfd_exit();
 }
 EXPORT_SYMBOL_GPL(kvm_exit);
+
+int kvm_gfn_userfault(struct kvm *kvm, struct kvm_memory_slot *memslot,
+		       gfn_t gfn)
+{
+	unsigned long bitmap_chunk = 0;
+	off_t offset;
+
+	if (!kvm_memslot_userfault(memslot))
+		return 0;
+
+	if (WARN_ON_ONCE(!memslot->userfault_bitmap))
+		return 0;
+
+	offset = gfn - memslot->base_gfn;
+
+	if (copy_from_user(&bitmap_chunk,
+			   memslot->userfault_bitmap + offset / BITS_PER_LONG,
+			   sizeof(bitmap_chunk)))
+		return -EFAULT;
+
+	/* Set in the bitmap means that the gfn is userfault */
+	return !!(bitmap_chunk & (1ul << (offset % BITS_PER_LONG)));
+}
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 02/13] KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
  2024-12-04 19:13 ` [PATCH v1 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap James Houghton
@ 2024-12-04 19:13 ` James Houghton
  2024-12-04 19:13 ` [PATCH v1 03/13] KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot James Houghton
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

This flag is used for vCPU memory faults caused by KVM Userfault; i.e.,
the bit in `userfault_bitmap` corresponding to the faulting gfn was set.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 94be7e8b46a4..641a2e580441 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -444,6 +444,7 @@ struct kvm_run {
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
 #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+#define KVM_MEMORY_EXIT_FLAG_USERFAULT	(1ULL << 4)
 			__u64 flags;
 			__u64 gpa;
 			__u64 size;
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 03/13] KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
  2024-12-04 19:13 ` [PATCH v1 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap James Houghton
  2024-12-04 19:13 ` [PATCH v1 02/13] KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT James Houghton
@ 2024-12-04 19:13 ` James Houghton
  2024-12-04 19:13 ` [PATCH v1 04/13] KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION James Houghton
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Currently guest_memfd memslots can only be deleted. Slightly change the
logic to allow KVM_MR_FLAGS_ONLY changes when the only flag being
changed is KVM_MEM_USERFAULT.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 virt/kvm/kvm_main.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 23fa3e911c4e..fa851704db94 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2009,9 +2009,6 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
 			return -EINVAL;
 	} else { /* Modify an existing slot. */
-		/* Private memslots are immutable, they can only be deleted. */
-		if (mem->flags & KVM_MEM_GUEST_MEMFD)
-			return -EINVAL;
 		if ((mem->userspace_addr != old->userspace_addr) ||
 		    (npages != old->npages) ||
 		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
@@ -2025,6 +2022,16 @@ int __kvm_set_memory_region(struct kvm *kvm,
 			return 0;
 	}
 
+	/*
+	 * Except for being able to set KVM_MEM_USERFAULT, private memslots are
+	 * immutable, they can only be deleted.
+	 */
+	if (mem->flags & KVM_MEM_GUEST_MEMFD &&
+	    !(change == KVM_MR_CREATE ||
+	      (change == KVM_MR_FLAGS_ONLY &&
+	       (mem->flags ^ old->flags) == KVM_MEM_USERFAULT)))
+		return -EINVAL;
+
 	if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
 	    kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages))
 		return -EEXIST;
@@ -2040,7 +2047,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new->npages = npages;
 	new->flags = mem->flags;
 	new->userspace_addr = mem->userspace_addr;
-	if (mem->flags & KVM_MEM_GUEST_MEMFD) {
+	if (mem->flags & KVM_MEM_GUEST_MEMFD && change == KVM_MR_CREATE) {
 		r = kvm_gmem_bind(kvm, new, mem->guest_memfd, mem->guest_memfd_offset);
 		if (r)
 			goto out;
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 04/13] KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (2 preceding siblings ...)
  2024-12-04 19:13 ` [PATCH v1 03/13] KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot James Houghton
@ 2024-12-04 19:13 ` James Houghton
  2024-12-04 19:13 ` [PATCH v1 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT James Houghton
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Advertise support for KVM_CAP_USERFAULT when kvm_has_userfault() returns
true. Currently this is merely IS_ENABLED(CONFIG_HAVE_KVM_USERFAULT), so
it is somewhat redundant.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/uapi/linux/kvm.h | 1 +
 virt/kvm/kvm_main.c      | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 641a2e580441..d9a135c895d7 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -936,6 +936,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_PRE_FAULT_MEMORY 236
 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
 #define KVM_CAP_X86_GUEST_MODE 238
+#define KVM_CAP_USERFAULT 239
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fa851704db94..b552cdef2850 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4804,6 +4804,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #ifdef CONFIG_KVM_PRIVATE_MEM
 	case KVM_CAP_GUEST_MEMFD:
 		return !kvm || kvm_arch_has_private_mem(kvm);
+#endif
+#ifdef CONFIG_HAVE_KVM_USERFAULT
+	case KVM_CAP_USERFAULT:
+		return kvm_has_userfault(kvm);
 #endif
 	default:
 		break;
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (3 preceding siblings ...)
  2024-12-04 19:13 ` [PATCH v1 04/13] KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION James Houghton
@ 2024-12-04 19:13 ` James Houghton
  2024-12-04 19:13 ` [PATCH v1 06/13] KVM: arm64: " James Houghton
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Adhering to the requirements of KVM Userfault:

1. Zap all sptes for the memslot when KVM_MEM_USERFAULT is toggled on
   with kvm_arch_flush_shadow_memslot().
2. Only all PAGE_SIZE sptes while KVM_MEM_USERFAULT is enabled (for both
   normal/GUP memory and guest_memfd memory).
3. Reconstruct huge mappings when KVM_MEM_USERFAULT is toggled off with
   kvm_mmu_recover_huge_pages().

With the new logic in kvm_mmu_slot_apply_flags(), I've simplified the
two dirty-logging-toggle checks into one, and I have dropped the
WARN_ON() that was there.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/Kconfig            |  1 +
 arch/x86/kvm/mmu/mmu.c          | 27 +++++++++++++++++++++----
 arch/x86/kvm/mmu/mmu_internal.h | 20 +++++++++++++++---
 arch/x86/kvm/x86.c              | 36 ++++++++++++++++++++++++---------
 include/linux/kvm_host.h        |  5 ++++-
 5 files changed, 71 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ea2c4f21c1ca..286c6825cd1c 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -47,6 +47,7 @@ config KVM_X86
 	select KVM_GENERIC_PRE_FAULT_MEMORY
 	select KVM_GENERIC_PRIVATE_MEM if KVM_SW_PROTECTED_VM
 	select KVM_WERROR if WERROR
+	select HAVE_KVM_USERFAULT
 
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 22e7ad235123..2f7381255d11 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4292,14 +4292,19 @@ static inline u8 kvm_max_level_for_order(int order)
 	return PG_LEVEL_4K;
 }
 
-static u8 kvm_max_private_mapping_level(struct kvm *kvm, kvm_pfn_t pfn,
-					u8 max_level, int gmem_order)
+static u8 kvm_max_private_mapping_level(struct kvm *kvm,
+					struct kvm_memory_slot *slot,
+					kvm_pfn_t pfn, u8 max_level,
+					int gmem_order)
 {
 	u8 req_max_level;
 
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
+	if (kvm_memslot_userfault(slot))
+		return PG_LEVEL_4K;
+
 	max_level = min(kvm_max_level_for_order(gmem_order), max_level);
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
@@ -4336,8 +4341,10 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
 	}
 
 	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
-	fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault->pfn,
-							 fault->max_level, max_order);
+	fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault->slot,
+							 fault->pfn,
+							 fault->max_level,
+							 max_order);
 
 	return RET_PF_CONTINUE;
 }
@@ -4346,6 +4353,18 @@ static int __kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
 				 struct kvm_page_fault *fault)
 {
 	unsigned int foll = fault->write ? FOLL_WRITE : 0;
+	int userfault;
+
+	userfault = kvm_gfn_userfault(vcpu->kvm, fault->slot, fault->gfn);
+	if (userfault < 0)
+		return userfault;
+	if (userfault) {
+		kvm_mmu_prepare_userfault_exit(vcpu, fault);
+		return -EFAULT;
+	}
+
+	if (kvm_memslot_userfault(fault->slot))
+		fault->max_level = PG_LEVEL_4K;
 
 	if (fault->is_private)
 		return kvm_mmu_faultin_pfn_private(vcpu, fault);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b00abbe3f6cf..15705faa3b67 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -282,12 +282,26 @@ enum {
 	RET_PF_SPURIOUS,
 };
 
-static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
-						     struct kvm_page_fault *fault)
+static inline void __kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+						       struct kvm_page_fault *fault,
+						       bool is_userfault)
 {
 	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
 				      PAGE_SIZE, fault->write, fault->exec,
-				      fault->is_private);
+				      fault->is_private,
+				      is_userfault);
+}
+
+static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+						     struct kvm_page_fault *fault)
+{
+	__kvm_mmu_prepare_memory_fault_exit(vcpu, fault, false);
+}
+
+static inline void kvm_mmu_prepare_userfault_exit(struct kvm_vcpu *vcpu,
+						  struct kvm_page_fault *fault)
+{
+	__kvm_mmu_prepare_memory_fault_exit(vcpu, fault, true);
 }
 
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2e713480933a..2f7080fd6218 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13053,12 +13053,36 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 	u32 new_flags = new ? new->flags : 0;
 	bool log_dirty_pages = new_flags & KVM_MEM_LOG_DIRTY_PAGES;
 
+	/*
+	 * When toggling KVM Userfault on, zap all sptes so that userfault-ness
+	 * will be respected at refault time. All new faults will only install
+	 * small sptes. Therefore, when toggling it off, recover hugepages.
+	 *
+	 * For MOVE and DELETE, there will be nothing to do, as the old
+	 * mappings will have already been deleted by
+	 * kvm_arch_flush_shadow_memslot().
+	 *
+	 * For CREATE, no mappings will have been created yet.
+	 */
+	if ((old_flags ^ new_flags) & KVM_MEM_USERFAULT &&
+	    (change == KVM_MR_FLAGS_ONLY)) {
+		if (old_flags & KVM_MEM_USERFAULT)
+			kvm_mmu_recover_huge_pages(kvm, new);
+		else
+			kvm_arch_flush_shadow_memslot(kvm, old);
+	}
+
+	/*
+	 * Nothing more to do if dirty logging isn't being toggled.
+	 */
+	if (!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES))
+		return;
+
 	/*
 	 * Update CPU dirty logging if dirty logging is being toggled.  This
 	 * applies to all operations.
 	 */
-	if ((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES)
-		kvm_mmu_update_cpu_dirty_logging(kvm, log_dirty_pages);
+	kvm_mmu_update_cpu_dirty_logging(kvm, log_dirty_pages);
 
 	/*
 	 * Nothing more to do for RO slots (which can't be dirtied and can't be
@@ -13078,14 +13102,6 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 	if ((change != KVM_MR_FLAGS_ONLY) || (new_flags & KVM_MEM_READONLY))
 		return;
 
-	/*
-	 * READONLY and non-flags changes were filtered out above, and the only
-	 * other flag is LOG_DIRTY_PAGES, i.e. something is wrong if dirty
-	 * logging isn't being toggled on or off.
-	 */
-	if (WARN_ON_ONCE(!((old_flags ^ new_flags) & KVM_MEM_LOG_DIRTY_PAGES)))
-		return;
-
 	if (!log_dirty_pages) {
 		/*
 		 * Recover huge page mappings in the slot now that dirty logging
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f7a3dfd5e224..9e8a8dcf2b73 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2465,7 +2465,8 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 						 gpa_t gpa, gpa_t size,
 						 bool is_write, bool is_exec,
-						 bool is_private)
+						 bool is_private,
+						 bool is_userfault)
 {
 	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
 	vcpu->run->memory_fault.gpa = gpa;
@@ -2475,6 +2476,8 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 	vcpu->run->memory_fault.flags = 0;
 	if (is_private)
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
+	if (is_userfault)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_USERFAULT;
 }
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 06/13] KVM: arm64: Add support for KVM_MEM_USERFAULT
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (4 preceding siblings ...)
  2024-12-04 19:13 ` [PATCH v1 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT James Houghton
@ 2024-12-04 19:13 ` James Houghton
  2024-12-04 23:07   ` Oliver Upton
  2024-12-04 19:13 ` [PATCH v1 07/13] KVM: selftests: Fix vm_mem_region_set_flags docstring James Houghton
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Adhering to the requirements of KVM Userfault:

1. When it is toggled (either on or off), zap the second stage with
   kvm_arch_flush_shadow_memslot(). This is to (1) respect
   userfault-ness and (2) to reconstruct block mappings.
2. While KVM_MEM_USERFAULT is enabled, restrict new second-stage mappings
   to be PAGE_SIZE, just like when dirty logging is enabled.

Signed-off-by: James Houghton <jthoughton@google.com>
---
  I'm not 100% sure if kvm_arch_flush_shadow_memslot() is correct in
  this case (like if the host does not have S2FWB).
---
 arch/arm64/kvm/Kconfig |  1 +
 arch/arm64/kvm/mmu.c   | 23 ++++++++++++++++++++++-
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index ead632ad01b4..d89b4088b580 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -38,6 +38,7 @@ menuconfig KVM
 	select HAVE_KVM_VCPU_RUN_PID_CHANGE
 	select SCHED_INFO
 	select GUEST_PERF_EVENTS if PERF_EVENTS
+	select HAVE_KVM_USERFAULT
 	help
 	  Support hosting virtualized guest machines.
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a71fe6f6bd90..53cee0bacb75 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1482,7 +1482,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * logging_active is guaranteed to never be true for VM_PFNMAP
 	 * memslots.
 	 */
-	if (logging_active) {
+	if (logging_active || kvm_memslot_userfault(memslot)) {
 		force_pte = true;
 		vma_shift = PAGE_SHIFT;
 	} else {
@@ -1571,6 +1571,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
+	if (kvm_gfn_userfault(kvm, memslot, gfn)) {
+		kvm_prepare_memory_fault_exit(vcpu, gfn << PAGE_SHIFT,
+					      PAGE_SIZE, write_fault,
+					      exec_fault, false, true);
+		return -EFAULT;
+	}
+
 	pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
 				&writable, &page);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
@@ -2062,6 +2069,20 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   enum kvm_mr_change change)
 {
 	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
+	u32 changed_flags = (new ? new->flags : 0) ^ (old ? old->flags : 0);
+
+	/*
+	 * If KVM_MEM_USERFAULT changed, drop all the stage-2 mappings so that
+	 * we can (1) respect userfault-ness or (2) create block mappings.
+	 */
+	if ((changed_flags & KVM_MEM_USERFAULT) && change == KVM_MR_FLAGS_ONLY)
+		kvm_arch_flush_shadow_memslot(kvm, old);
+
+	/*
+	 * Nothing left to do if not toggling dirty logging.
+	 */
+	if (!(changed_flags & KVM_MEM_LOG_DIRTY_PAGES))
+		return;
 
 	/*
 	 * At this point memslot has been committed and there is an
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 07/13] KVM: selftests: Fix vm_mem_region_set_flags docstring
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (5 preceding siblings ...)
  2024-12-04 19:13 ` [PATCH v1 06/13] KVM: arm64: " James Houghton
@ 2024-12-04 19:13 ` James Houghton
  2024-12-04 19:13 ` [PATCH v1 08/13] KVM: selftests: Fix prefault_mem logic James Houghton
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

`flags` is what region->region.flags gets set to.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 480e3a40d197..9603f99d3247 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1124,7 +1124,7 @@ memslot2region(struct kvm_vm *vm, uint32_t memslot)
  *
  * Input Args:
  *   vm - Virtual Machine
- *   flags - Starting guest physical address
+ *   flags - Flags for the memslot
  *
  * Output Args: None
  *
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 08/13] KVM: selftests: Fix prefault_mem logic
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (6 preceding siblings ...)
  2024-12-04 19:13 ` [PATCH v1 07/13] KVM: selftests: Fix vm_mem_region_set_flags docstring James Houghton
@ 2024-12-04 19:13 ` James Houghton
  2024-12-04 19:13 ` [PATCH v1 09/13] KVM: selftests: Add va_start/end into uffd_desc James Houghton
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

The previous logic didn't handle the case where memory was partitioned
AND we were using a single userfaultfd. It would only prefault the first
vCPU's memory and not the rest.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/demand_paging_test.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 0202b78f8680..315f5c9037b4 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -172,11 +172,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
 	if (p->uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
-		num_uffds = p->single_uffd ? 1 : nr_vcpus;
-		for (i = 0; i < num_uffds; i++) {
+		for (i = 0; i < nr_vcpus; i++) {
 			vcpu_args = &memstress_args.vcpu_args[i];
 			prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
 				     vcpu_args->pages * memstress_args.guest_page_size);
+			if (!p->partition_vcpu_memory_access)
+				/* We prefaulted everything */
+				break;
 		}
 	}
 
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 09/13] KVM: selftests: Add va_start/end into uffd_desc
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (7 preceding siblings ...)
  2024-12-04 19:13 ` [PATCH v1 08/13] KVM: selftests: Fix prefault_mem logic James Houghton
@ 2024-12-04 19:13 ` James Houghton
  2024-12-04 19:13 ` [PATCH v1 10/13] KVM: selftests: Add KVM Userfault mode to demand_paging_test James Houghton
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

This will be used for the self-test to look up which userfaultfd we
should be using when handling a KVM Userfault (in the event KVM
Userfault and userfaultfd are being used together).

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/include/userfaultfd_util.h | 2 ++
 tools/testing/selftests/kvm/lib/userfaultfd_util.c     | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
index 60f7f9d435dc..b62fecdfe745 100644
--- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
+++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
@@ -30,6 +30,8 @@ struct uffd_desc {
 	int *pipefds;
 	pthread_t *readers;
 	struct uffd_reader_args *reader_args;
+	void *va_start;
+	void *va_end;
 };
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 7c9de8414462..93004c85bcdc 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -152,6 +152,8 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 		    expected_ioctls, "missing userfaultfd ioctls");
 
 	uffd_desc->uffd = uffd;
+	uffd_desc->va_start = hva;
+	uffd_desc->va_end = (char *)hva + len;
 	for (i = 0; i < uffd_desc->num_readers; ++i) {
 		int pipes[2];
 
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 10/13] KVM: selftests: Add KVM Userfault mode to demand_paging_test
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (8 preceding siblings ...)
  2024-12-04 19:13 ` [PATCH v1 09/13] KVM: selftests: Add va_start/end into uffd_desc James Houghton
@ 2024-12-04 19:13 ` James Houghton
  2024-12-14 22:46   ` kernel test robot
  2024-12-04 19:13 ` [PATCH v1 11/13] KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT James Houghton
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Add a way for the KVM_RUN loop to handle -EFAULT exits when they are for
KVM_MEMORY_EXIT_FLAG_USERFAULT. In this case, preemptively handle the
UFFDIO_COPY or UFFDIO_CONTINUE if userfaultfd is also in use. This saves
the trip through the userfaultfd poll/read/WAKE loop.

When preemptively handling UFFDIO_COPY/CONTINUE, do so with
MODE_DONTWAKE, as there will not be a thread to wake. If a thread *does*
take the userfaultfd slow path, we will get a regular userfault, and we
will call handle_uffd_page_request() which will do a full wake-up. In
the EEXIST case, a wake-up will not occur. Make sure to call UFFDIO_WAKE
explicitly in this case.

When handling KVM userfaults, make sure to set the bitmap with
memory_order_release. Although it wouldn't affect the functionality of
the test (because memstress doesn't actually require any particular
guest memory contents), it is what userspace normally needs to do.

Add `-k` to set the test to use KVM Userfault.

Add the vm_mem_region_set_flags_userfault() helper for setting
`userfault_bitmap` and KVM_MEM_USERFAULT at the same time.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        | 139 +++++++++++++++++-
 .../testing/selftests/kvm/include/kvm_util.h  |   5 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  40 ++++-
 3 files changed, 176 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 315f5c9037b4..e7ea1c57264d 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -12,7 +12,9 @@
 #include <time.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <linux/bitmap.h>
 #include <sys/syscall.h>
+#include <stdatomic.h>
 
 #include "kvm_util.h"
 #include "test_util.h"
@@ -24,11 +26,21 @@
 #ifdef __NR_userfaultfd
 
 static int nr_vcpus = 1;
+static int num_uffds;
 static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
 
 static size_t demand_paging_size;
+static size_t host_page_size;
 static char *guest_data_prototype;
 
+static struct {
+	bool enabled;
+	int uffd_mode; /* set if userfaultfd is also in use */
+	struct uffd_desc **uffd_descs;
+} kvm_userfault_data;
+
+static void resolve_kvm_userfault(u64 gpa, u64 size);
+
 static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 {
 	struct kvm_vcpu *vcpu = vcpu_args->vcpu;
@@ -41,8 +53,22 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 	clock_gettime(CLOCK_MONOTONIC, &start);
 
 	/* Let the guest access its memory */
+restart:
 	ret = _vcpu_run(vcpu);
-	TEST_ASSERT(ret == 0, "vcpu_run failed: %d", ret);
+	if (ret < 0 && errno == EFAULT && kvm_userfault_data.enabled) {
+		/* Check for userfault. */
+		TEST_ASSERT(run->exit_reason == KVM_EXIT_MEMORY_FAULT,
+			    "Got invalid exit reason: %x", run->exit_reason);
+		TEST_ASSERT(run->memory_fault.flags ==
+			    KVM_MEMORY_EXIT_FLAG_USERFAULT,
+			    "Got invalid memory fault exit: %llx",
+			    run->memory_fault.flags);
+		resolve_kvm_userfault(run->memory_fault.gpa,
+				      run->memory_fault.size);
+		goto restart;
+	} else
+		TEST_ASSERT(ret == 0, "vcpu_run failed: %d", ret);
+
 	if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
 		TEST_ASSERT(false,
 			    "Invalid guest sync status: exit_reason=%s",
@@ -54,11 +80,10 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 		       ts_diff.tv_sec, ts_diff.tv_nsec);
 }
 
-static int handle_uffd_page_request(int uffd_mode, int uffd,
-		struct uffd_msg *msg)
+static int resolve_uffd_page_request(int uffd_mode, int uffd, uint64_t addr,
+				     bool wake)
 {
 	pid_t tid = syscall(__NR_gettid);
-	uint64_t addr = msg->arg.pagefault.address;
 	struct timespec start;
 	struct timespec ts_diff;
 	int r;
@@ -71,7 +96,7 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		copy.src = (uint64_t)guest_data_prototype;
 		copy.dst = addr;
 		copy.len = demand_paging_size;
-		copy.mode = 0;
+		copy.mode = wake ? 0 : UFFDIO_COPY_MODE_DONTWAKE;
 
 		r = ioctl(uffd, UFFDIO_COPY, &copy);
 		/*
@@ -96,6 +121,7 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 
 		cont.range.start = addr;
 		cont.range.len = demand_paging_size;
+		cont.mode = wake ? 0 : UFFDIO_CONTINUE_MODE_DONTWAKE;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
 		/*
@@ -119,6 +145,20 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		TEST_FAIL("Invalid uffd mode %d", uffd_mode);
 	}
 
+	if (r < 0 && wake) {
+		/*
+		 * No wake-up occurs when UFFDIO_COPY/CONTINUE fails, but we
+		 * have a thread waiting. Wake it up.
+		 */
+		struct uffdio_range range = {0};
+
+		range.start = addr;
+		range.len = demand_paging_size;
+
+		TEST_ASSERT(ioctl(uffd, UFFDIO_WAKE, &range) == 0,
+			    "UFFDIO_WAKE failed: 0x%lx", addr);
+	}
+
 	ts_diff = timespec_elapsed(start);
 
 	PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
@@ -129,6 +169,58 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 	return 0;
 }
 
+static int handle_uffd_page_request(int uffd_mode, int uffd,
+				    struct uffd_msg *msg)
+{
+	uint64_t addr = msg->arg.pagefault.address;
+
+	return resolve_uffd_page_request(uffd_mode, uffd, addr, true);
+}
+
+static void resolve_kvm_userfault(u64 gpa, u64 size)
+{
+	struct kvm_vm *vm = memstress_args.vm;
+	struct userspace_mem_region *region;
+	unsigned long *bitmap_chunk;
+	u64 page, gpa_offset;
+
+	region = (struct userspace_mem_region *) userspace_mem_region_find(
+		vm, gpa, (gpa + size - 1));
+
+	if (kvm_userfault_data.uffd_mode) {
+		/*
+		 * Resolve userfaults early, without needing to read them
+		 * off the userfaultfd.
+		 */
+		uint64_t hva = (uint64_t)addr_gpa2hva(vm, gpa);
+		struct uffd_desc **descs = kvm_userfault_data.uffd_descs;
+		int i, fd;
+
+		for (i = 0; i < num_uffds; ++i)
+			if (hva >= (uint64_t)descs[i]->va_start &&
+			    hva < (uint64_t)descs[i]->va_end)
+				break;
+
+		TEST_ASSERT(i < num_uffds,
+			    "Did not find userfaultfd for hva: %lx", hva);
+
+		fd = kvm_userfault_data.uffd_descs[i]->uffd;
+		resolve_uffd_page_request(kvm_userfault_data.uffd_mode, fd,
+					  hva, false);
+	} else {
+		uint64_t hva = (uint64_t)addr_gpa2hva(vm, gpa);
+
+		memcpy((char *)hva, guest_data_prototype, demand_paging_size);
+	}
+
+	gpa_offset = gpa - region->region.guest_phys_addr;
+	page = gpa_offset / host_page_size;
+	bitmap_chunk = (unsigned long *)region->region.userfault_bitmap +
+		       page / BITS_PER_LONG;
+	atomic_fetch_and_explicit(bitmap_chunk,
+			~(1ul << (page % BITS_PER_LONG)), memory_order_release);
+}
+
 struct test_params {
 	int uffd_mode;
 	bool single_uffd;
@@ -136,6 +228,7 @@ struct test_params {
 	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
+	bool kvm_userfault;
 };
 
 static void prefault_mem(void *alias, uint64_t len)
@@ -149,6 +242,25 @@ static void prefault_mem(void *alias, uint64_t len)
 	}
 }
 
+static void enable_userfault(struct kvm_vm *vm, int slots)
+{
+	for (int i = 0; i < slots; ++i) {
+		int slot = MEMSTRESS_MEM_SLOT_INDEX + i;
+		struct userspace_mem_region *region;
+		unsigned long *userfault_bitmap;
+		int flags = KVM_MEM_USERFAULT;
+
+		region = memslot2region(vm, slot);
+		userfault_bitmap = bitmap_zalloc(region->mmap_size /
+						 host_page_size);
+		/* everything is userfault initially */
+		memset(userfault_bitmap, -1, region->mmap_size / host_page_size / CHAR_BIT);
+		printf("Setting bitmap: %p\n", userfault_bitmap);
+		vm_mem_region_set_flags_userfault(vm, slot, flags,
+						  userfault_bitmap);
+	}
+}
+
 static void run_test(enum vm_guest_mode mode, void *arg)
 {
 	struct memstress_vcpu_args *vcpu_args;
@@ -159,12 +271,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct timespec ts_diff;
 	double vcpu_paging_rate;
 	struct kvm_vm *vm;
-	int i, num_uffds = 0;
+	int i;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
 
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
+	host_page_size = getpagesize();
 
 	guest_data_prototype = malloc(demand_paging_size);
 	TEST_ASSERT(guest_data_prototype,
@@ -208,6 +321,14 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		}
 	}
 
+	if (p->kvm_userfault) {
+		TEST_REQUIRE(kvm_has_cap(KVM_CAP_USERFAULT));
+		kvm_userfault_data.enabled = true;
+		kvm_userfault_data.uffd_mode = p->uffd_mode;
+		kvm_userfault_data.uffd_descs = uffd_descs;
+		enable_userfault(vm, 1);
+	}
+
 	pr_info("Finished creating vCPUs and starting uffd threads\n");
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
@@ -265,6 +386,7 @@ static void help(char *name)
 	printf(" -v: specify the number of vCPUs to run.\n");
 	printf(" -o: Overlap guest memory accesses instead of partitioning\n"
 	       "     them into a separate region of memory for each vCPU.\n");
+	printf(" -k: Use KVM Userfault\n");
 	puts("");
 	exit(0);
 }
@@ -283,7 +405,7 @@ int main(int argc, char *argv[])
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:c:r:")) != -1) {
+	while ((opt = getopt(argc, argv, "ahokm:u:d:b:s:v:c:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
@@ -326,6 +448,9 @@ int main(int argc, char *argv[])
 				    "Invalid number of readers per uffd %d: must be >=1",
 				    p.readers_per_uffd);
 			break;
+		case 'k':
+			p.kvm_userfault = true;
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index bc7c242480d6..7fec3559aa64 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -582,6 +582,8 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
 		uint32_t flags, int guest_memfd_fd, uint64_t guest_memfd_offset);
+struct userspace_mem_region *
+userspace_mem_region_find(struct kvm_vm *vm, uint64_t start, uint64_t end);
 
 #ifndef vm_arch_has_protected_memory
 static inline bool vm_arch_has_protected_memory(struct kvm_vm *vm)
@@ -591,6 +593,9 @@ static inline bool vm_arch_has_protected_memory(struct kvm_vm *vm)
 #endif
 
 void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags);
+void vm_mem_region_set_flags_userfault(struct kvm_vm *vm, uint32_t slot,
+				       uint32_t flags,
+				       unsigned long *userfault_bitmap);
 void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, uint64_t new_gpa);
 void vm_mem_region_delete(struct kvm_vm *vm, uint32_t slot);
 struct kvm_vcpu *__vm_vcpu_add(struct kvm_vm *vm, uint32_t vcpu_id);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 9603f99d3247..7195dd3db5df 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -634,7 +634,7 @@ void kvm_parse_vcpu_pinning(const char *pcpus_string, uint32_t vcpu_to_pcpu[],
  * of the regions is returned.  Null is returned only when no overlapping
  * region exists.
  */
-static struct userspace_mem_region *
+struct userspace_mem_region *
 userspace_mem_region_find(struct kvm_vm *vm, uint64_t start, uint64_t end)
 {
 	struct rb_node *node;
@@ -1149,6 +1149,44 @@ void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags)
 		ret, errno, slot, flags);
 }
 
+/*
+ * VM Memory Region Flags Set with a userfault bitmap
+ *
+ * Input Args:
+ *   vm - Virtual Machine
+ *   flags - Flags for the memslot
+ *   userfault_bitmap - The bitmap to use for KVM_MEM_USERFAULT
+ *
+ * Output Args: None
+ *
+ * Return: None
+ *
+ * Sets the flags of the memory region specified by the value of slot,
+ * to the values given by flags. This helper adds a way to provide a
+ * userfault_bitmap.
+ */
+void vm_mem_region_set_flags_userfault(struct kvm_vm *vm, uint32_t slot,
+				       uint32_t flags,
+				       unsigned long *userfault_bitmap)
+{
+	int ret;
+	struct userspace_mem_region *region;
+
+	region = memslot2region(vm, slot);
+
+	TEST_ASSERT(!userfault_bitmap ^ (flags & KVM_MEM_USERFAULT),
+		    "KVM_MEM_USERFAULT must be specified with a bitmap");
+
+	region->region.flags = flags;
+	region->region.userfault_bitmap = (__u64)userfault_bitmap;
+
+	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
+
+	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
+		"  rc: %i errno: %i slot: %u flags: 0x%x",
+		ret, errno, slot, flags);
+}
+
 /*
  * VM Memory Region Move
  *
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 11/13] KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (9 preceding siblings ...)
  2024-12-04 19:13 ` [PATCH v1 10/13] KVM: selftests: Add KVM Userfault mode to demand_paging_test James Houghton
@ 2024-12-04 19:13 ` James Houghton
  2024-12-04 19:13 ` [PATCH v1 12/13] KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests James Houghton
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

The KVM_MEM_USERFAULT flag is supported iff KVM_CAP_USERFAULT is
available.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/set_memory_region_test.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c
index a8267628e9ed..d233cdfb0241 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -364,6 +364,9 @@ static void test_invalid_memory_region_flags(void)
 	if (kvm_check_cap(KVM_CAP_MEMORY_ATTRIBUTES) & KVM_MEMORY_ATTRIBUTE_PRIVATE)
 		supported_flags |= KVM_MEM_GUEST_MEMFD;
 
+	if (kvm_check_cap(KVM_CAP_USERFAULT))
+		supported_flags |= KVM_MEM_USERFAULT;
+
 	for (i = 0; i < 32; i++) {
 		if ((supported_flags & BIT(i)) && !(v2_only_flags & BIT(i)))
 			continue;
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 12/13] KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (10 preceding siblings ...)
  2024-12-04 19:13 ` [PATCH v1 11/13] KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT James Houghton
@ 2024-12-04 19:13 ` James Houghton
  2024-12-04 19:13 ` [PATCH v1 13/13] KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT details James Houghton
  2024-12-24 21:07 ` [PATCH v1 00/13] KVM: Introduce KVM Userfault Peter Xu
  13 siblings, 0 replies; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Make sure KVM_MEM_USERFAULT can be toggled on and off for
KVM_MEM_GUEST_MEMFD memslots.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/set_memory_region_test.c    | 30 +++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c
index d233cdfb0241..57b7032d7cc3 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -556,6 +556,35 @@ static void test_add_overlapping_private_memory_regions(void)
 	close(memfd);
 	kvm_vm_free(vm);
 }
+
+static void test_private_memory_region_userfault(void)
+{
+	struct kvm_vm *vm;
+	int memfd;
+
+	pr_info("Testing toggling KVM_MEM_USERFAULT on KVM_MEM_GUEST_MEMFD memory regions\n");
+
+	vm = vm_create_barebones_type(KVM_X86_SW_PROTECTED_VM);
+
+	test_invalid_guest_memfd(vm, vm->kvm_fd, 0, "KVM fd should fail");
+	test_invalid_guest_memfd(vm, vm->fd, 0, "VM's fd should fail");
+
+	memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE, 0);
+
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD,
+				   MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0);
+
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT,
+				   KVM_MEM_GUEST_MEMFD | KVM_MEM_USERFAULT,
+				   MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0);
+
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD,
+				   MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0);
+
+	close(memfd);
+
+	kvm_vm_free(vm);
+}
 #endif
 
 int main(int argc, char *argv[])
@@ -582,6 +611,7 @@ int main(int argc, char *argv[])
 	    (kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM))) {
 		test_add_private_memory_region();
 		test_add_overlapping_private_memory_regions();
+		test_private_memory_region_userfault();
 	} else {
 		pr_info("Skipping tests for KVM_MEM_GUEST_MEMFD memory regions\n");
 	}
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 13/13] KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT details
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (11 preceding siblings ...)
  2024-12-04 19:13 ` [PATCH v1 12/13] KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests James Houghton
@ 2024-12-04 19:13 ` James Houghton
  2024-12-07  1:38   ` Bagas Sanjaya
  2024-12-24 21:07 ` [PATCH v1 00/13] KVM: Introduce KVM Userfault Peter Xu
  13 siblings, 1 reply; 32+ messages in thread
From: James Houghton @ 2024-12-04 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	James Houghton, Nikita Kalyazin, Anish Moorthy, Peter Gonda,
	Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Include the note about memory ordering when clearing bits in
userfault_bitmap, as it may not be obvious for users.

Signed-off-by: James Houghton <jthoughton@google.com>
---
  I would like to include the new -EFAULT reason in the documentation for
  KVM_RUN (the case where userfault_bitmap could not be read), as -EFAULT
  usually means that GUP failed.
---
 Documentation/virt/kvm/api.rst | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 454c2aaa155e..eec485dcf0bc 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6281,7 +6281,8 @@ bounds checks apply (use common sense).
 	__u64 guest_memfd_offset;
 	__u32 guest_memfd;
 	__u32 pad1;
-	__u64 pad2[14];
+	__u64 userfault_bitmap;
+	__u64 pad2[13];
   };
 
 A KVM_MEM_GUEST_MEMFD region _must_ have a valid guest_memfd (private memory) and
@@ -6297,6 +6298,25 @@ state.  At VM creation time, all memory is shared, i.e. the PRIVATE attribute
 is '0' for all gfns.  Userspace can control whether memory is shared/private by
 toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
 
+When the KVM_MEM_USERFAULT flag is set, userfault_bitmap points to the starting
+address for the bitmap that controls if vCPU memory faults should immediately
+exit to userspace. If an invalid pointer is provided, at fault time, KVM_RUN
+will return -EFAULT. KVM_MEM_USERFAULT is only supported when
+KVM_CAP_USERFAULT is supported.
+
+userfault_bitmap should point to an array of longs where each bit in the array
+linearly corresponds to a single gfn. Bit 0 in userfault_bitmap corresponds to
+guest_phys_addr, bit 1 corresponds to guest_phys_addr + PAGE_SIZE, etc. If the
+bit for a page is set, any vCPU access to that page will exit to userspace with
+KVM_MEMORY_EXIT_FLAG_USERFAULT.
+
+Setting bits in userfault_bitmap has no effect on pages that have already been
+mapped by KVM until KVM_MEM_USERFAULT is disabled and re-enabled again.
+
+Clearing bits in userfault_bitmap should usually be done with a store-release
+if changes to guest memory are being made available to the guest via
+userfault_bitmap.
+
 S390:
 ^^^^^
 
@@ -8251,6 +8271,17 @@ KVM exits with the register state of either the L1 or L2 guest
 depending on which executed at the time of an exit. Userspace must
 take care to differentiate between these cases.
 
+7.37 KVM_CAP_USERFAULT
+----------------------
+
+:Architectures: x86, arm64
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_SET_USER_MEMORY_REGION2 will
+accept KVM_MEM_USERFAULT as a valid memslot flag.
+
+See KVM_SET_USER_MEMORY_REGION2 for more details.
+
 8. Other capabilities.
 ======================
 
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 06/13] KVM: arm64: Add support for KVM_MEM_USERFAULT
  2024-12-04 19:13 ` [PATCH v1 06/13] KVM: arm64: " James Houghton
@ 2024-12-04 23:07   ` Oliver Upton
  2024-12-05 23:31     ` James Houghton
  0 siblings, 1 reply; 32+ messages in thread
From: Oliver Upton @ 2024-12-04 23:07 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Sean Christopherson, Jonathan Corbet, Marc Zyngier,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, Wang, Wei W, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

Hi James,

On Wed, Dec 04, 2024 at 07:13:41PM +0000, James Houghton wrote:
> Adhering to the requirements of KVM Userfault:
> 
> 1. When it is toggled (either on or off), zap the second stage with
>    kvm_arch_flush_shadow_memslot(). This is to (1) respect
>    userfault-ness and (2) to reconstruct block mappings.
> 2. While KVM_MEM_USERFAULT is enabled, restrict new second-stage mappings
>    to be PAGE_SIZE, just like when dirty logging is enabled.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   I'm not 100% sure if kvm_arch_flush_shadow_memslot() is correct in
>   this case (like if the host does not have S2FWB).

Invalidating the stage-2 entries is of course necessary for correctness
on the !USERFAULT -> USERFAULT transition, and the MMU will do the right
thing regardless of whether hardware implements FEAT_S2FWB.

What I think you may be getting at is the *performance* implications are
quite worrying without FEAT_S2FWB due to the storm of CMOs, and I'd
definitely agree with that.

> @@ -2062,6 +2069,20 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>  				   enum kvm_mr_change change)
>  {
>  	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> +	u32 changed_flags = (new ? new->flags : 0) ^ (old ? old->flags : 0);
> +
> +	/*
> +	 * If KVM_MEM_USERFAULT changed, drop all the stage-2 mappings so that
> +	 * we can (1) respect userfault-ness or (2) create block mappings.
> +	 */
> +	if ((changed_flags & KVM_MEM_USERFAULT) && change == KVM_MR_FLAGS_ONLY)
> +		kvm_arch_flush_shadow_memslot(kvm, old);

I'd strongly prefer that we make (2) a userspace problem and don't
eagerly invalidate stage-2 mappings on the USERFAULT -> !USERFAULT
change.

Having implied user-visible behaviors on ioctls is never good, and for
systems without FEAT_S2FWB you might be better off avoiding the unmap in
the first place.

So, if userspace decides there's a benefit to invalidating the stage-2
MMU, it can just delete + recreate the memslot.

-- 
Thanks,
Oliver


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
  2024-12-04 19:13 ` [PATCH v1 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap James Houghton
@ 2024-12-05 11:52   ` kernel test robot
  2024-12-05 14:22   ` kernel test robot
  1 sibling, 0 replies; 32+ messages in thread
From: kernel test robot @ 2024-12-05 11:52 UTC (permalink / raw)
  To: James Houghton, Paolo Bonzini, Sean Christopherson
  Cc: oe-kbuild-all, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, James Houghton, Nikita Kalyazin, Anish Moorthy,
	Peter Gonda, Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Hi James,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 4d911c7abee56771b0219a9fbf0120d06bdc9c14]

url:    https://github.com/intel-lab-lkp/linux/commits/James-Houghton/KVM-Add-KVM_MEM_USERFAULT-memslot-flag-and-bitmap/20241205-032516
base:   4d911c7abee56771b0219a9fbf0120d06bdc9c14
patch link:    https://lore.kernel.org/r/20241204191349.1730936-2-jthoughton%40google.com
patch subject: [PATCH v1 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
config: x86_64-randconfig-121 (https://download.01.org/0day-ci/archive/20241205/202412051904.GNL7BE1X-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241205/202412051904.GNL7BE1X-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412051904.GNL7BE1X-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   arch/x86/kvm/../../../virt/kvm/kvm_main.c: note: in included file:
   include/linux/kvm_host.h:2080:54: sparse: sparse: array of flexible structures
   include/linux/kvm_host.h:2082:56: sparse: sparse: array of flexible structures
>> arch/x86/kvm/../../../virt/kvm/kvm_main.c:2049:39: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned long [noderef] __user *userfault_bitmap @@     got unsigned long * @@
   arch/x86/kvm/../../../virt/kvm/kvm_main.c:2049:39: sparse:     expected unsigned long [noderef] __user *userfault_bitmap
   arch/x86/kvm/../../../virt/kvm/kvm_main.c:2049:39: sparse:     got unsigned long *
   arch/x86/kvm/../../../virt/kvm/kvm_main.c:626:25: sparse: sparse: context imbalance in 'kvm_mmu_notifier_invalidate_range_start' - different lock contexts for basic block
   arch/x86/kvm/../../../virt/kvm/kvm_main.c:626:25: sparse: sparse: context imbalance in 'kvm_mmu_notifier_invalidate_range_end' - different lock contexts for basic block
   arch/x86/kvm/../../../virt/kvm/kvm_main.c:626:25: sparse: sparse: context imbalance in 'kvm_mmu_notifier_clear_flush_young' - different lock contexts for basic block
   arch/x86/kvm/../../../virt/kvm/kvm_main.c:626:25: sparse: sparse: context imbalance in 'kvm_mmu_notifier_clear_young' - different lock contexts for basic block
   arch/x86/kvm/../../../virt/kvm/kvm_main.c:626:25: sparse: sparse: context imbalance in 'kvm_mmu_notifier_test_young' - different lock contexts for basic block
   arch/x86/kvm/../../../virt/kvm/kvm_main.c: note: in included file (through include/linux/mutex.h, include/linux/kvm_types.h, include/kvm/iodev.h):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   arch/x86/kvm/../../../virt/kvm/kvm_main.c:1960:49: sparse: sparse: self-comparison always evaluates to false
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true

vim +2049 arch/x86/kvm/../../../virt/kvm/kvm_main.c

  1931	
  1932	/*
  1933	 * Allocate some memory and give it an address in the guest physical address
  1934	 * space.
  1935	 *
  1936	 * Discontiguous memory is allowed, mostly for framebuffers.
  1937	 *
  1938	 * Must be called holding kvm->slots_lock for write.
  1939	 */
  1940	int __kvm_set_memory_region(struct kvm *kvm,
  1941				    const struct kvm_userspace_memory_region2 *mem)
  1942	{
  1943		struct kvm_memory_slot *old, *new;
  1944		struct kvm_memslots *slots;
  1945		enum kvm_mr_change change;
  1946		unsigned long npages;
  1947		gfn_t base_gfn;
  1948		int as_id, id;
  1949		int r;
  1950	
  1951		r = check_memory_region_flags(kvm, mem);
  1952		if (r)
  1953			return r;
  1954	
  1955		as_id = mem->slot >> 16;
  1956		id = (u16)mem->slot;
  1957	
  1958		/* General sanity checks */
  1959		if ((mem->memory_size & (PAGE_SIZE - 1)) ||
  1960		    (mem->memory_size != (unsigned long)mem->memory_size))
  1961			return -EINVAL;
  1962		if (mem->guest_phys_addr & (PAGE_SIZE - 1))
  1963			return -EINVAL;
  1964		/* We can read the guest memory with __xxx_user() later on. */
  1965		if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
  1966		    (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
  1967		     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
  1968				mem->memory_size))
  1969			return -EINVAL;
  1970		if (mem->flags & KVM_MEM_GUEST_MEMFD &&
  1971		    (mem->guest_memfd_offset & (PAGE_SIZE - 1) ||
  1972		     mem->guest_memfd_offset + mem->memory_size < mem->guest_memfd_offset))
  1973			return -EINVAL;
  1974		if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM)
  1975			return -EINVAL;
  1976		if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
  1977			return -EINVAL;
  1978		if ((mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
  1979			return -EINVAL;
  1980	
  1981		slots = __kvm_memslots(kvm, as_id);
  1982	
  1983		/*
  1984		 * Note, the old memslot (and the pointer itself!) may be invalidated
  1985		 * and/or destroyed by kvm_set_memslot().
  1986		 */
  1987		old = id_to_memslot(slots, id);
  1988	
  1989		if (!mem->memory_size) {
  1990			if (!old || !old->npages)
  1991				return -EINVAL;
  1992	
  1993			if (WARN_ON_ONCE(kvm->nr_memslot_pages < old->npages))
  1994				return -EIO;
  1995	
  1996			return kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE);
  1997		}
  1998	
  1999		base_gfn = (mem->guest_phys_addr >> PAGE_SHIFT);
  2000		npages = (mem->memory_size >> PAGE_SHIFT);
  2001	
  2002		if (!old || !old->npages) {
  2003			change = KVM_MR_CREATE;
  2004	
  2005			/*
  2006			 * To simplify KVM internals, the total number of pages across
  2007			 * all memslots must fit in an unsigned long.
  2008			 */
  2009			if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
  2010				return -EINVAL;
  2011		} else { /* Modify an existing slot. */
  2012			/* Private memslots are immutable, they can only be deleted. */
  2013			if (mem->flags & KVM_MEM_GUEST_MEMFD)
  2014				return -EINVAL;
  2015			if ((mem->userspace_addr != old->userspace_addr) ||
  2016			    (npages != old->npages) ||
  2017			    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
  2018				return -EINVAL;
  2019	
  2020			if (base_gfn != old->base_gfn)
  2021				change = KVM_MR_MOVE;
  2022			else if (mem->flags != old->flags)
  2023				change = KVM_MR_FLAGS_ONLY;
  2024			else /* Nothing to change. */
  2025				return 0;
  2026		}
  2027	
  2028		if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
  2029		    kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages))
  2030			return -EEXIST;
  2031	
  2032		/* Allocate a slot that will persist in the memslot. */
  2033		new = kzalloc(sizeof(*new), GFP_KERNEL_ACCOUNT);
  2034		if (!new)
  2035			return -ENOMEM;
  2036	
  2037		new->as_id = as_id;
  2038		new->id = id;
  2039		new->base_gfn = base_gfn;
  2040		new->npages = npages;
  2041		new->flags = mem->flags;
  2042		new->userspace_addr = mem->userspace_addr;
  2043		if (mem->flags & KVM_MEM_GUEST_MEMFD) {
  2044			r = kvm_gmem_bind(kvm, new, mem->guest_memfd, mem->guest_memfd_offset);
  2045			if (r)
  2046				goto out;
  2047		}
  2048		if (mem->flags & KVM_MEM_USERFAULT)
> 2049			new->userfault_bitmap = (unsigned long *)mem->userfault_bitmap;
  2050	
  2051		r = kvm_set_memslot(kvm, old, new, change);
  2052		if (r)
  2053			goto out_unbind;
  2054	
  2055		return 0;
  2056	
  2057	out_unbind:
  2058		if (mem->flags & KVM_MEM_GUEST_MEMFD)
  2059			kvm_gmem_unbind(new);
  2060	out:
  2061		kfree(new);
  2062		return r;
  2063	}
  2064	EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
  2065	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
  2024-12-04 19:13 ` [PATCH v1 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap James Houghton
  2024-12-05 11:52   ` kernel test robot
@ 2024-12-05 14:22   ` kernel test robot
  2024-12-06 22:46     ` James Houghton
  1 sibling, 1 reply; 32+ messages in thread
From: kernel test robot @ 2024-12-05 14:22 UTC (permalink / raw)
  To: James Houghton, Paolo Bonzini, Sean Christopherson
  Cc: oe-kbuild-all, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, James Houghton, Nikita Kalyazin, Anish Moorthy,
	Peter Gonda, Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Hi James,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 4d911c7abee56771b0219a9fbf0120d06bdc9c14]

url:    https://github.com/intel-lab-lkp/linux/commits/James-Houghton/KVM-Add-KVM_MEM_USERFAULT-memslot-flag-and-bitmap/20241205-032516
base:   4d911c7abee56771b0219a9fbf0120d06bdc9c14
patch link:    https://lore.kernel.org/r/20241204191349.1730936-2-jthoughton%40google.com
patch subject: [PATCH v1 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
config: i386-buildonly-randconfig-006 (https://download.01.org/0day-ci/archive/20241205/202412052133.pTg3UAQm-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241205/202412052133.pTg3UAQm-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412052133.pTg3UAQm-lkp@intel.com/

All warnings (new ones prefixed by >>):

   arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function '__kvm_set_memory_region':
>> arch/x86/kvm/../../../virt/kvm/kvm_main.c:2049:41: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    2049 |                 new->userfault_bitmap = (unsigned long *)mem->userfault_bitmap;
         |                                         ^


vim +2049 arch/x86/kvm/../../../virt/kvm/kvm_main.c

  1931	
  1932	/*
  1933	 * Allocate some memory and give it an address in the guest physical address
  1934	 * space.
  1935	 *
  1936	 * Discontiguous memory is allowed, mostly for framebuffers.
  1937	 *
  1938	 * Must be called holding kvm->slots_lock for write.
  1939	 */
  1940	int __kvm_set_memory_region(struct kvm *kvm,
  1941				    const struct kvm_userspace_memory_region2 *mem)
  1942	{
  1943		struct kvm_memory_slot *old, *new;
  1944		struct kvm_memslots *slots;
  1945		enum kvm_mr_change change;
  1946		unsigned long npages;
  1947		gfn_t base_gfn;
  1948		int as_id, id;
  1949		int r;
  1950	
  1951		r = check_memory_region_flags(kvm, mem);
  1952		if (r)
  1953			return r;
  1954	
  1955		as_id = mem->slot >> 16;
  1956		id = (u16)mem->slot;
  1957	
  1958		/* General sanity checks */
  1959		if ((mem->memory_size & (PAGE_SIZE - 1)) ||
  1960		    (mem->memory_size != (unsigned long)mem->memory_size))
  1961			return -EINVAL;
  1962		if (mem->guest_phys_addr & (PAGE_SIZE - 1))
  1963			return -EINVAL;
  1964		/* We can read the guest memory with __xxx_user() later on. */
  1965		if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
  1966		    (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
  1967		     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
  1968				mem->memory_size))
  1969			return -EINVAL;
  1970		if (mem->flags & KVM_MEM_GUEST_MEMFD &&
  1971		    (mem->guest_memfd_offset & (PAGE_SIZE - 1) ||
  1972		     mem->guest_memfd_offset + mem->memory_size < mem->guest_memfd_offset))
  1973			return -EINVAL;
  1974		if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM)
  1975			return -EINVAL;
  1976		if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
  1977			return -EINVAL;
  1978		if ((mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
  1979			return -EINVAL;
  1980	
  1981		slots = __kvm_memslots(kvm, as_id);
  1982	
  1983		/*
  1984		 * Note, the old memslot (and the pointer itself!) may be invalidated
  1985		 * and/or destroyed by kvm_set_memslot().
  1986		 */
  1987		old = id_to_memslot(slots, id);
  1988	
  1989		if (!mem->memory_size) {
  1990			if (!old || !old->npages)
  1991				return -EINVAL;
  1992	
  1993			if (WARN_ON_ONCE(kvm->nr_memslot_pages < old->npages))
  1994				return -EIO;
  1995	
  1996			return kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE);
  1997		}
  1998	
  1999		base_gfn = (mem->guest_phys_addr >> PAGE_SHIFT);
  2000		npages = (mem->memory_size >> PAGE_SHIFT);
  2001	
  2002		if (!old || !old->npages) {
  2003			change = KVM_MR_CREATE;
  2004	
  2005			/*
  2006			 * To simplify KVM internals, the total number of pages across
  2007			 * all memslots must fit in an unsigned long.
  2008			 */
  2009			if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
  2010				return -EINVAL;
  2011		} else { /* Modify an existing slot. */
  2012			/* Private memslots are immutable, they can only be deleted. */
  2013			if (mem->flags & KVM_MEM_GUEST_MEMFD)
  2014				return -EINVAL;
  2015			if ((mem->userspace_addr != old->userspace_addr) ||
  2016			    (npages != old->npages) ||
  2017			    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
  2018				return -EINVAL;
  2019	
  2020			if (base_gfn != old->base_gfn)
  2021				change = KVM_MR_MOVE;
  2022			else if (mem->flags != old->flags)
  2023				change = KVM_MR_FLAGS_ONLY;
  2024			else /* Nothing to change. */
  2025				return 0;
  2026		}
  2027	
  2028		if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
  2029		    kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages))
  2030			return -EEXIST;
  2031	
  2032		/* Allocate a slot that will persist in the memslot. */
  2033		new = kzalloc(sizeof(*new), GFP_KERNEL_ACCOUNT);
  2034		if (!new)
  2035			return -ENOMEM;
  2036	
  2037		new->as_id = as_id;
  2038		new->id = id;
  2039		new->base_gfn = base_gfn;
  2040		new->npages = npages;
  2041		new->flags = mem->flags;
  2042		new->userspace_addr = mem->userspace_addr;
  2043		if (mem->flags & KVM_MEM_GUEST_MEMFD) {
  2044			r = kvm_gmem_bind(kvm, new, mem->guest_memfd, mem->guest_memfd_offset);
  2045			if (r)
  2046				goto out;
  2047		}
  2048		if (mem->flags & KVM_MEM_USERFAULT)
> 2049			new->userfault_bitmap = (unsigned long *)mem->userfault_bitmap;
  2050	
  2051		r = kvm_set_memslot(kvm, old, new, change);
  2052		if (r)
  2053			goto out_unbind;
  2054	
  2055		return 0;
  2056	
  2057	out_unbind:
  2058		if (mem->flags & KVM_MEM_GUEST_MEMFD)
  2059			kvm_gmem_unbind(new);
  2060	out:
  2061		kfree(new);
  2062		return r;
  2063	}
  2064	EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
  2065	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 06/13] KVM: arm64: Add support for KVM_MEM_USERFAULT
  2024-12-04 23:07   ` Oliver Upton
@ 2024-12-05 23:31     ` James Houghton
  2024-12-06  0:45       ` Oliver Upton
  0 siblings, 1 reply; 32+ messages in thread
From: James Houghton @ 2024-12-05 23:31 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Paolo Bonzini, Sean Christopherson, Jonathan Corbet, Marc Zyngier,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, Wei W, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Wed, Dec 4, 2024 at 3:07 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> Hi James,

Hi Oliver!

>
> On Wed, Dec 04, 2024 at 07:13:41PM +0000, James Houghton wrote:
> > Adhering to the requirements of KVM Userfault:
> >
> > 1. When it is toggled (either on or off), zap the second stage with
> >    kvm_arch_flush_shadow_memslot(). This is to (1) respect
> >    userfault-ness and (2) to reconstruct block mappings.
> > 2. While KVM_MEM_USERFAULT is enabled, restrict new second-stage mappings
> >    to be PAGE_SIZE, just like when dirty logging is enabled.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >   I'm not 100% sure if kvm_arch_flush_shadow_memslot() is correct in
> >   this case (like if the host does not have S2FWB).
>
> Invalidating the stage-2 entries is of course necessary for correctness
> on the !USERFAULT -> USERFAULT transition, and the MMU will do the right
> thing regardless of whether hardware implements FEAT_S2FWB.
>
> What I think you may be getting at is the *performance* implications are
> quite worrying without FEAT_S2FWB due to the storm of CMOs, and I'd
> definitely agree with that.

Thanks for clarifying that for me.

> > @@ -2062,6 +2069,20 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> >                                  enum kvm_mr_change change)
> >  {
> >       bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> > +     u32 changed_flags = (new ? new->flags : 0) ^ (old ? old->flags : 0);
> > +
> > +     /*
> > +      * If KVM_MEM_USERFAULT changed, drop all the stage-2 mappings so that
> > +      * we can (1) respect userfault-ness or (2) create block mappings.
> > +      */
> > +     if ((changed_flags & KVM_MEM_USERFAULT) && change == KVM_MR_FLAGS_ONLY)
> > +             kvm_arch_flush_shadow_memslot(kvm, old);
>
> I'd strongly prefer that we make (2) a userspace problem and don't
> eagerly invalidate stage-2 mappings on the USERFAULT -> !USERFAULT
> change.
>
> Having implied user-visible behaviors on ioctls is never good, and for
> systems without FEAT_S2FWB you might be better off avoiding the unmap in
> the first place.
>
> So, if userspace decides there's a benefit to invalidating the stage-2
> MMU, it can just delete + recreate the memslot.

Ok I think that's reasonable. So for USERFAULT -> !USERFAULT, I'll
just follow the precedent set by dirty logging. For x86 today, we
collapse the mappings, and for arm64 we do not.

Is arm64 ever going to support collapsing back to huge mappings after
dirty logging is disabled?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 06/13] KVM: arm64: Add support for KVM_MEM_USERFAULT
  2024-12-05 23:31     ` James Houghton
@ 2024-12-06  0:45       ` Oliver Upton
  0 siblings, 0 replies; 32+ messages in thread
From: Oliver Upton @ 2024-12-06  0:45 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Sean Christopherson, Jonathan Corbet, Marc Zyngier,
	Yan Zhao, Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, Wei W, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, Dec 05, 2024 at 03:31:05PM -0800, James Houghton wrote:
> > > @@ -2062,6 +2069,20 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> > >                                  enum kvm_mr_change change)
> > >  {
> > >       bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> > > +     u32 changed_flags = (new ? new->flags : 0) ^ (old ? old->flags : 0);
> > > +
> > > +     /*
> > > +      * If KVM_MEM_USERFAULT changed, drop all the stage-2 mappings so that
> > > +      * we can (1) respect userfault-ness or (2) create block mappings.
> > > +      */
> > > +     if ((changed_flags & KVM_MEM_USERFAULT) && change == KVM_MR_FLAGS_ONLY)
> > > +             kvm_arch_flush_shadow_memslot(kvm, old);
> >
> > I'd strongly prefer that we make (2) a userspace problem and don't
> > eagerly invalidate stage-2 mappings on the USERFAULT -> !USERFAULT
> > change.
> >
> > Having implied user-visible behaviors on ioctls is never good, and for
> > systems without FEAT_S2FWB you might be better off avoiding the unmap in
> > the first place.
> >
> > So, if userspace decides there's a benefit to invalidating the stage-2
> > MMU, it can just delete + recreate the memslot.
> 
> Ok I think that's reasonable. So for USERFAULT -> !USERFAULT, I'll
> just follow the precedent set by dirty logging. For x86 today, we
> collapse the mappings, and for arm64 we do not.
> 
> Is arm64 ever going to support collapsing back to huge mappings after
> dirty logging is disabled?

Patches on list is always a good place to start :)

What I'd expect on FEAT_S2FWB hardware is that invalidating the whole
stage-2 and faulting back in block entries would give the best
experience.

Only in the case of !FWB would a literal table -> block collapse be
beneficial, as the MMU could potentially elide CMOs when remapping. But
that assumes you're starting with a fully-mapped table and there are no
holes that are "out of sync" with the guest.

-- 
Thanks,
Oliver


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
  2024-12-05 14:22   ` kernel test robot
@ 2024-12-06 22:46     ` James Houghton
  0 siblings, 0 replies; 32+ messages in thread
From: James Houghton @ 2024-12-06 22:46 UTC (permalink / raw)
  To: lkp
  Cc: amoorthy, corbet, dmatlack, jthoughton, kalyazin, kvm, kvmarm,
	linux-arm-kernel, linux-doc, linux-kernel, maz, oe-kbuild-all,
	oliver.upton, pbonzini, peterx, pgonda, seanjc, wei.w.wang,
	yan.y.zhao

>    arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function '__kvm_set_memory_region':
> >> arch/x86/kvm/../../../virt/kvm/kvm_main.c:2049:41: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
>     2049 |                 new->userfault_bitmap = (unsigned long *)mem->userfault_bitmap;
>          |                                         ^

I realize that, not only have I done this cast slightly wrong, I'm
missing a few checks on userfault_bitmap that I should have. Applying
this diff, or at least something like it, to fix it:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b552cdef2850..30f09141df64 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1977,6 +1977,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		return -EINVAL;
 	if ((mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
 		return -EINVAL;
+	if (mem->flags & KVM_MEM_USERFAULT &&
+	    ((mem->userfault_bitmap != untagged_addr(mem->userfault_bitmap)) ||
+	     !access_ok((void __user *)(unsigned long)mem->userfault_bitmap,
+			DIV_ROUND_UP(mem->memory_size >> PAGE_SHIFT, BITS_PER_LONG)
+			 * sizeof(long))))
+		return -EINVAL;
 
 	slots = __kvm_memslots(kvm, as_id);
 
@@ -2053,7 +2059,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 			goto out;
 	}
 	if (mem->flags & KVM_MEM_USERFAULT)
-		new->userfault_bitmap = (unsigned long *)mem->userfault_bitmap;
+		new->userfault_bitmap =
+		  (unsigned long __user *)(unsigned long)mem->userfault_bitmap;
 
 	r = kvm_set_memslot(kvm, old, new, change);
 	if (r)


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 13/13] KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT details
  2024-12-04 19:13 ` [PATCH v1 13/13] KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT details James Houghton
@ 2024-12-07  1:38   ` Bagas Sanjaya
  0 siblings, 0 replies; 32+ messages in thread
From: Bagas Sanjaya @ 2024-12-07  1:38 UTC (permalink / raw)
  To: James Houghton, Paolo Bonzini, Sean Christopherson
  Cc: Jonathan Corbet, Marc Zyngier, Oliver Upton, Yan Zhao,
	Nikita Kalyazin, Anish Moorthy, Peter Gonda, Peter Xu,
	David Matlack, Wang, Wei W, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

[-- Attachment #1: Type: text/plain, Size: 326 bytes --]

On Wed, Dec 04, 2024 at 07:13:48PM +0000, James Houghton wrote:
> Include the note about memory ordering when clearing bits in
> userfault_bitmap, as it may not be obvious for users.

The doc LGTM, thanks!

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 10/13] KVM: selftests: Add KVM Userfault mode to demand_paging_test
  2024-12-04 19:13 ` [PATCH v1 10/13] KVM: selftests: Add KVM Userfault mode to demand_paging_test James Houghton
@ 2024-12-14 22:46   ` kernel test robot
  0 siblings, 0 replies; 32+ messages in thread
From: kernel test robot @ 2024-12-14 22:46 UTC (permalink / raw)
  To: James Houghton, Paolo Bonzini, Sean Christopherson
  Cc: oe-kbuild-all, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Yan Zhao, James Houghton, Nikita Kalyazin, Anish Moorthy,
	Peter Gonda, Peter Xu, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Hi James,

kernel test robot noticed the following build errors:

[auto build test ERROR on 4d911c7abee56771b0219a9fbf0120d06bdc9c14]

url:    https://github.com/intel-lab-lkp/linux/commits/James-Houghton/KVM-Add-KVM_MEM_USERFAULT-memslot-flag-and-bitmap/20241205-032516
base:   4d911c7abee56771b0219a9fbf0120d06bdc9c14
patch link:    https://lore.kernel.org/r/20241204191349.1730936-11-jthoughton%40google.com
patch subject: [PATCH v1 10/13] KVM: selftests: Add KVM Userfault mode to demand_paging_test
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241215/202412150600.wLk8rmZf-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412150600.wLk8rmZf-lkp@intel.com/

All errors (new ones prefixed by >>):

>> demand_paging_test.c:220:2: error: address argument to atomic operation must be a pointer to _Atomic type ('unsigned long *' invalid)
     220 |         atomic_fetch_and_explicit(bitmap_chunk,
         |         ^                         ~~~~~~~~~~~~
   /opt/cross/clang-ab51eccf88/lib/clang/19/include/stdatomic.h:169:35: note: expanded from macro 'atomic_fetch_and_explicit'
     169 | #define atomic_fetch_and_explicit __c11_atomic_fetch_and
         |                                   ^
   1 error generated.

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/13] KVM: Introduce KVM Userfault
  2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
                   ` (12 preceding siblings ...)
  2024-12-04 19:13 ` [PATCH v1 13/13] KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT details James Houghton
@ 2024-12-24 21:07 ` Peter Xu
  2025-01-02 17:53   ` James Houghton
  13 siblings, 1 reply; 32+ messages in thread
From: Peter Xu @ 2024-12-24 21:07 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Sean Christopherson, Jonathan Corbet, Marc Zyngier,
	Oliver Upton, Yan Zhao, Nikita Kalyazin, Anish Moorthy,
	Peter Gonda, David Matlack, Wang, Wei W, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

James,

On Wed, Dec 04, 2024 at 07:13:35PM +0000, James Houghton wrote:
> This is a continuation of the original KVM Userfault RFC[1] from July.
> It contains the simplifications we talked about at LPC[2].
> 
> Please see the RFC[1] for the problem description. In summary,
> guest_memfd VMs have no mechanism for doing post-copy live migration.
> KVM Userfault provides such a mechanism. Today there is no upstream
> mechanism for installing memory into a guest_memfd, but there will
> be one soon (e.g. [3]).
> 
> There is a second problem that KVM Userfault solves: userfaultfd-based
> post-copy doesn't scale very well. KVM Userfault when used with
> userfaultfd can scale much better in the common case that most post-copy
> demand fetches are a result of vCPU access violations. This is a
> continuation of the solution Anish was working on[4]. This aspect of
> KVM Userfault is important for userfaultfd-based live migration when
> scaling up to hundreds of vCPUs with ~30us network latency for a
> PAGE_SIZE demand-fetch.

I think it would be clearer to nail down the goal of the feature.  If it's
a perf-oriented feature we don't need to mention gmem, but maybe it's not.

> 
> The implementation in this series is version than the RFC[1]. It adds...
>  1. a new memslot flag is added: KVM_MEM_USERFAULT,
>  2. a new parameter, userfault_bitmap, into struct kvm_memory_slot,
>  3. a new KVM_RUN exit reason: KVM_MEMORY_EXIT_FLAG_USERFAULT,
>  4. a new KVM capability KVM_CAP_USERFAULT.
> 
> KVM Userfault does not attempt to catch KVM's own accesses to guest
> memory. That is left up to userfaultfd.

I assume it means this is an "perf optimization" feature then?  As it
doesn't work for remote-fault processes like firecracker, or
remote-emulated processes like QEMU's vhost-user?

Even though it could still 100% cover x86_64's setup if it's not as
complicated as above?  I mean, I assumed above sentence was for archs like
ARM that I remember having no-vcpu-context accesses so things like that is
not covered too.  Perhaps x86_64 is the goal?  If so, would also be good to
mention some details.

> 
> When enabling KVM_MEM_USERFAULT for a memslot, the second-stage mappings
> are zapped, and new faults will check `userfault_bitmap` to see if the
> fault should exit to userspace.
> 
> When KVM_MEM_USERFAULT is enabled, only PAGE_SIZE mappings are
> permitted.
> 
> When disabling KVM_MEM_USERFAULT, huge mappings will be reconstructed
> (either eagerly or on-demand; the architecture can decide).
> 
> KVM Userfault is not compatible with async page faults. Nikita has
> proposed a new implementation of async page faults that is more
> userspace-driven that *is* compatible with KVM Userfault[5].
> 
> Performance
> ===========
> 
> The takeaways I have are:
> 
> 1. For cases where lock contention is not a concern, there is a
>    discernable win because KVM Userfault saves the trip through the
>    userfaultfd poll/read/WAKE cycle.
> 
> 2. Using a single userfaultfd without KVM Userfault gets very slow as
>    the number of vCPUs increases, and it gets even slower when you add
>    more reader threads. This is due to contention on the userfaultfd
>    wait_queue locks. This is the contention that KVM Userfault avoids.
>    Compare this to the multiple-userfaultfd runs; they are much faster
>    because the wait_queue locks are sharded perfectly (1 per vCPU).
>    Perfect sharding is only possible because the vCPUs are configured to
>    touch only their own chunk of memory.

I'll try to spend some more time after holidays on this perf issue. But
will still be after the 1g support on !coco gmem if it would work out. As
the 1g function is still missing in QEMU, so that one has higher priority
comparing to either perf or downtime (e.g. I'll also need to measure
whether QEMU will need minor fault, or stick with missing as of now).

Maybe I'll also start to explore a bit on [g]memfd support on userfault,
I'm not sure whether anyone started working on some generic solution before
for CoCo / gmem postcopy - we need to still have a solution for either
firecrackers or OVS/vhost-user.  I feel like we need that sooner or later,
one way or another.  I think I'll start without minor faults support until
justified, and if I'll ever be able to start it at all in a few months next
year..

Let me know if there's any comment on above thoughts.

I guess this feauture might be useful to QEMU too, but QEMU always needs
uffd or something similar, then we need to measure and justify this one
useful in a real QEMU setup.  For example, need to see how the page
transfer overhead compares with lock contentions when there're, say, 400
vcpus.  If some speedup on userfault + the transfer overhead is close to
what we can get with vcpu exits, then QEMU may still stick with a simple
model.  But not sure.

When integrated with this feature, it also means some other overheads at
least to QEMU.  E.g., trap / resolve page fault needs two ops now (uffd and
the bitmap).  Meanwhile even if vcpu can get rid of uffd's one big
spinlock, it may contend again in userspace, either on page resolution or
on similar queuing.  I think I mentioned it previously but I guess it's
nontrivial to justify.  In all cases, I trust that you should have better
judgement on this.  It's just that QEMU can at least behave differently, so
not sure how it'll go there.

Happy holidays. :)

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/13] KVM: Introduce KVM Userfault
  2024-12-24 21:07 ` [PATCH v1 00/13] KVM: Introduce KVM Userfault Peter Xu
@ 2025-01-02 17:53   ` James Houghton
  2025-01-16 20:19     ` Peter Xu
  0 siblings, 1 reply; 32+ messages in thread
From: James Houghton @ 2025-01-02 17:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Sean Christopherson, Jonathan Corbet, Marc Zyngier,
	Oliver Upton, Yan Zhao, Nikita Kalyazin, Anish Moorthy,
	Peter Gonda, David Matlack, Wei W, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Tue, Dec 24, 2024 at 4:07 PM Peter Xu <peterx@redhat.com> wrote:
>
> James,

Hi Peter!

> On Wed, Dec 04, 2024 at 07:13:35PM +0000, James Houghton wrote:
> > This is a continuation of the original KVM Userfault RFC[1] from July.
> > It contains the simplifications we talked about at LPC[2].
> >
> > Please see the RFC[1] for the problem description. In summary,
> > guest_memfd VMs have no mechanism for doing post-copy live migration.
> > KVM Userfault provides such a mechanism. Today there is no upstream
> > mechanism for installing memory into a guest_memfd, but there will
> > be one soon (e.g. [3]).
> >
> > There is a second problem that KVM Userfault solves: userfaultfd-based
> > post-copy doesn't scale very well. KVM Userfault when used with
> > userfaultfd can scale much better in the common case that most post-copy
> > demand fetches are a result of vCPU access violations. This is a
> > continuation of the solution Anish was working on[4]. This aspect of
> > KVM Userfault is important for userfaultfd-based live migration when
> > scaling up to hundreds of vCPUs with ~30us network latency for a
> > PAGE_SIZE demand-fetch.
>
> I think it would be clearer to nail down the goal of the feature.  If it's
> a perf-oriented feature we don't need to mention gmem, but maybe it's not.

In my mind, both the gmem aspect and the performance aspect are
important. I don't think one is more important than the other, though
the performance win aspect of this is more immediately useful.

>
> >
> > The implementation in this series is version than the RFC[1]. It adds...
> >  1. a new memslot flag is added: KVM_MEM_USERFAULT,
> >  2. a new parameter, userfault_bitmap, into struct kvm_memory_slot,
> >  3. a new KVM_RUN exit reason: KVM_MEMORY_EXIT_FLAG_USERFAULT,
> >  4. a new KVM capability KVM_CAP_USERFAULT.
> >
> > KVM Userfault does not attempt to catch KVM's own accesses to guest
> > memory. That is left up to userfaultfd.
>
> I assume it means this is an "perf optimization" feature then?  As it
> doesn't work for remote-fault processes like firecracker, or
> remote-emulated processes like QEMU's vhost-user?

For the !gmem case, yes KVM Userfault is not a replacement for
userfaultfd; for post-copy to function properly, we need to catch all
attempted accesses to guest memory (including from things like
vhost-net and KVM itself). It is indeed a performance optimization on
top of userfaultfd.

For setups where userfaultfd reader threads are running in a separate
process from the vCPU threads, yes KVM Userfault will make it so that
guest-mode vCPU faults will have to be initially handled by the vCPU
threads themselves. The fault information can always be forwarded to a
separate process afterwards, and the communication mechanism is
totally up to userspace (so they can optimize for their exact
use-case).

> Even though it could still 100% cover x86_64's setup if it's not as
> complicated as above?  I mean, I assumed above sentence was for archs like
> ARM that I remember having no-vcpu-context accesses so things like that is
> not covered too.  Perhaps x86_64 is the goal?  If so, would also be good to
> mention some details.

(In the !gmem case) It can't replace userfaultfd for any architecture
as long as there are kernel components that want to read or write to
guest memory.

The only real reason to make KVM Userfault try to completely replace
userfaultfd is to make post-copy with 1G pages work, but that requires
(1) userspace to catch their own accesses and (2) non-KVM components
catch their own accesses.

(1) is a pain (IIUC, infeasible for QEMU) and error-prone even if it
can be done, and (2) can essentially never be done upstream.

So I'm not pushing for KVM Userfault to replace userfaultfd; it's not
worth the extra/duplicated complexity. And at LPC, Paolo and Sean
indicated that this direction was indeed wrong. I have another way to
make this work in mind. :)

For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
replacing it. And as of right now anyway, KVM Userfault *does* provide
a complete post-copy system for gmem.

When gmem pages can be mapped into userspace, for post-copy to remain
functional, userspace-mapped gmem will need userfaultfd integration.
Keep in mind that even after this integration happens, userfaultfd
alone will *not* be a complete post-copy solution, as vCPU faults
won't be resolved via the userspace page tables.

> >
> > When enabling KVM_MEM_USERFAULT for a memslot, the second-stage mappings
> > are zapped, and new faults will check `userfault_bitmap` to see if the
> > fault should exit to userspace.
> >
> > When KVM_MEM_USERFAULT is enabled, only PAGE_SIZE mappings are
> > permitted.
> >
> > When disabling KVM_MEM_USERFAULT, huge mappings will be reconstructed
> > (either eagerly or on-demand; the architecture can decide).
> >
> > KVM Userfault is not compatible with async page faults. Nikita has
> > proposed a new implementation of async page faults that is more
> > userspace-driven that *is* compatible with KVM Userfault[5].
> >
> > Performance
> > ===========
> >
> > The takeaways I have are:
> >
> > 1. For cases where lock contention is not a concern, there is a
> >    discernable win because KVM Userfault saves the trip through the
> >    userfaultfd poll/read/WAKE cycle.
> >
> > 2. Using a single userfaultfd without KVM Userfault gets very slow as
> >    the number of vCPUs increases, and it gets even slower when you add
> >    more reader threads. This is due to contention on the userfaultfd
> >    wait_queue locks. This is the contention that KVM Userfault avoids.
> >    Compare this to the multiple-userfaultfd runs; they are much faster
> >    because the wait_queue locks are sharded perfectly (1 per vCPU).
> >    Perfect sharding is only possible because the vCPUs are configured to
> >    touch only their own chunk of memory.
>
> I'll try to spend some more time after holidays on this perf issue. But
> will still be after the 1g support on !coco gmem if it would work out. As
> the 1g function is still missing in QEMU, so that one has higher priority
> comparing to either perf or downtime (e.g. I'll also need to measure
> whether QEMU will need minor fault, or stick with missing as of now).
>
> Maybe I'll also start to explore a bit on [g]memfd support on userfault,
> I'm not sure whether anyone started working on some generic solution before
> for CoCo / gmem postcopy - we need to still have a solution for either
> firecrackers or OVS/vhost-user.  I feel like we need that sooner or later,
> one way or another.  I think I'll start without minor faults support until
> justified, and if I'll ever be able to start it at all in a few months next
> year..

Yeah we'll need userfaultfd support for !coco gmem when gmem supports
mapping into userspace to support setups that use OVS/vhost-user.

And feel free to start with MISSING or MINOR, I don't mind. Eventually
I'll probably need MINOR support; I'm happy to work on it when the
time comes (I'm waiting for KVM Userfault to get merged and then for
userspace mapping of gmem to get merged).

FWIW, I think userspace mapping of gmem + userfaultfd support for
userspace-mapped gmem + 1G page support for gmem = good 1G post-copy
for QEMU (i.e., use gmem instead of hugetlbfs after gmem supports 1G
pages).

Remember the feedback I got from LSFMM a while ago? "don't use
hugetlbfs." gmem seems like the natural replacement.

> Let me know if there's any comment on above thoughts.
>
> I guess this feauture might be useful to QEMU too, but QEMU always needs
> uffd or something similar, then we need to measure and justify this one
> useful in a real QEMU setup.  For example, need to see how the page
> transfer overhead compares with lock contentions when there're, say, 400
> vcpus.  If some speedup on userfault + the transfer overhead is close to
> what we can get with vcpu exits, then QEMU may still stick with a simple
> model.  But not sure.

It would be nice to integrate this into QEMU and see if there's a
measurable win. I'll try to find some time to look into this.

For GCE's userspace, there is a measurable win. Anish posted some
results a while ago here[1].

[1]: https://lore.kernel.org/kvm/CAF7b7mqrLP1VYtwB4i0x5HC1eYen9iMvZbKerCKWrCFv7tDg5Q@mail.gmail.com/

> When integrated with this feature, it also means some other overheads at
> least to QEMU.  E.g., trap / resolve page fault needs two ops now (uffd and
> the bitmap).

I think about it like this: instead of UFFDIO_CONTINUE + implicit wake
(or even UFFDIO_CONTINUE + UFFDIO_WAKE; this is how my userspace does
it actually), it is UFFDIO_CONTINUE (no wake) + bitmap set. So it's
kind of still two "ops" either way... :) and updating the bitmap is
really cheap compared to a userfaultfd wake-up.

(Having two things that describe something like "we should fault on
this page" is a little bit confusing.)

> Meanwhile even if vcpu can get rid of uffd's one big
> spinlock, it may contend again in userspace, either on page resolution or
> on similar queuing.  I think I mentioned it previously but I guess it's
> nontrivial to justify.  In all cases, I trust that you should have better
> judgement on this.  It's just that QEMU can at least behave differently, so
> not sure how it'll go there.

The contention that you may run into after you take away the uffd
spinlock depends on what userspace is doing, yeah. For example in GCE,
before the TDP MMU, we ran into KVM MMU lock contention, and then
later we ran into eventfd issues in our network request
submission/completion queues.

>
> Happy holidays. :)

You too!

Thanks for the feedback, Peter! Let me know if I've misunderstood any
of the points you were making.

> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/13] KVM: Introduce KVM Userfault
  2025-01-02 17:53   ` James Houghton
@ 2025-01-16 20:19     ` Peter Xu
  2025-01-16 20:32       ` Peter Xu
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Xu @ 2025-01-16 20:19 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Sean Christopherson, Jonathan Corbet, Marc Zyngier,
	Oliver Upton, Yan Zhao, Nikita Kalyazin, Anish Moorthy,
	Peter Gonda, David Matlack, Wei W, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

James,

Sorry for a late reply.

I still do have one or two pure questions, but nothing directly relevant to
your series.

On Thu, Jan 02, 2025 at 12:53:11PM -0500, James Houghton wrote:
> So I'm not pushing for KVM Userfault to replace userfaultfd; it's not
> worth the extra/duplicated complexity. And at LPC, Paolo and Sean
> indicated that this direction was indeed wrong. I have another way to
> make this work in mind. :)

Do you still want to share it, more or less? :)

> 
> For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> replacing it. And as of right now anyway, KVM Userfault *does* provide
> a complete post-copy system for gmem.
> 
> When gmem pages can be mapped into userspace, for post-copy to remain
> functional, userspace-mapped gmem will need userfaultfd integration.
> Keep in mind that even after this integration happens, userfaultfd
> alone will *not* be a complete post-copy solution, as vCPU faults
> won't be resolved via the userspace page tables.

Do you know in context of CoCo, whether a private page can be accessed at
all outside of KVM?

I think I'm pretty sure now a private page can never be mapped to
userspace.  However, can another module like vhost-kernel access it during
postcopy?  My impression of that is still a yes, but then how about
vhost-user?

Here, the "vhost-kernel" part represents a question on whether private
pages can be accessed at all outside KVM.  While "vhost-user" part
represents a question on whether, if the previous vhost-kernel question
answers as "yes it can", such access attempt can happen in another
process/task (hence, not only does it lack KVM context, but also not
sharing the same task context).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/13] KVM: Introduce KVM Userfault
  2025-01-16 20:19     ` Peter Xu
@ 2025-01-16 20:32       ` Peter Xu
  2025-01-16 22:16         ` Sean Christopherson
  2025-01-16 22:51         ` James Houghton
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Xu @ 2025-01-16 20:32 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Sean Christopherson, Jonathan Corbet, Marc Zyngier,
	Oliver Upton, Yan Zhao, Nikita Kalyazin, Anish Moorthy,
	Peter Gonda, David Matlack, Wei W, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, Jan 16, 2025 at 03:19:49PM -0500, Peter Xu wrote:
> James,
> 
> Sorry for a late reply.
> 
> I still do have one or two pure questions, but nothing directly relevant to
> your series.
> 
> On Thu, Jan 02, 2025 at 12:53:11PM -0500, James Houghton wrote:
> > So I'm not pushing for KVM Userfault to replace userfaultfd; it's not
> > worth the extra/duplicated complexity. And at LPC, Paolo and Sean
> > indicated that this direction was indeed wrong. I have another way to
> > make this work in mind. :)
> 
> Do you still want to share it, more or less? :)
> 
> > 
> > For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> > replacing it. And as of right now anyway, KVM Userfault *does* provide
> > a complete post-copy system for gmem.
> > 
> > When gmem pages can be mapped into userspace, for post-copy to remain
> > functional, userspace-mapped gmem will need userfaultfd integration.
> > Keep in mind that even after this integration happens, userfaultfd
> > alone will *not* be a complete post-copy solution, as vCPU faults
> > won't be resolved via the userspace page tables.
> 
> Do you know in context of CoCo, whether a private page can be accessed at
> all outside of KVM?
> 
> I think I'm pretty sure now a private page can never be mapped to
> userspace.  However, can another module like vhost-kernel access it during
> postcopy?  My impression of that is still a yes, but then how about
> vhost-user?
> 
> Here, the "vhost-kernel" part represents a question on whether private
> pages can be accessed at all outside KVM.  While "vhost-user" part
> represents a question on whether, if the previous vhost-kernel question
> answers as "yes it can", such access attempt can happen in another
> process/task (hence, not only does it lack KVM context, but also not
> sharing the same task context).

Right after I sent it, I just recalled whenever a device needs to access
the page, it needs to be converted to shared pages first..

So I suppose the questions were not valid at all!  It is not about the
context but that the pages will be shared always whenever a device in
whatever form will access it..

Fundamentally I'm thinking about whether userfaultfd must support (fd,
offset) tuple.  Now I suppose it's not, because vCPUs accessing
private/shared will all exit to userspace, while all non-vCPU / devices can
access shared pages only.

In that case, looks like userfaultfd can support CoCo on device emulations
by sticking with virtual-address traps like before, at least from that
specific POV.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/13] KVM: Introduce KVM Userfault
  2025-01-16 20:32       ` Peter Xu
@ 2025-01-16 22:16         ` Sean Christopherson
  2025-01-16 23:04           ` James Houghton
  2025-01-16 22:51         ` James Houghton
  1 sibling, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2025-01-16 22:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, Paolo Bonzini, Jonathan Corbet, Marc Zyngier,
	Oliver Upton, Yan Zhao, Nikita Kalyazin, Anish Moorthy,
	Peter Gonda, David Matlack, Wei W, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, Jan 16, 2025, Peter Xu wrote:
> On Thu, Jan 16, 2025 at 03:19:49PM -0500, Peter Xu wrote:
> > > For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> > > replacing it. And as of right now anyway, KVM Userfault *does* provide
> > > a complete post-copy system for gmem.
> > > 
> > > When gmem pages can be mapped into userspace, for post-copy to remain
> > > functional, userspace-mapped gmem will need userfaultfd integration.
> > > Keep in mind that even after this integration happens, userfaultfd
> > > alone will *not* be a complete post-copy solution, as vCPU faults
> > > won't be resolved via the userspace page tables.
> > 
> > Do you know in context of CoCo, whether a private page can be accessed at
> > all outside of KVM?
> > 
> > I think I'm pretty sure now a private page can never be mapped to
> > userspace.  However, can another module like vhost-kernel access it during
> > postcopy?  My impression of that is still a yes, but then how about
> > vhost-user?
> > 
> > Here, the "vhost-kernel" part represents a question on whether private
> > pages can be accessed at all outside KVM.  While "vhost-user" part
> > represents a question on whether, if the previous vhost-kernel question
> > answers as "yes it can", such access attempt can happen in another
> > process/task (hence, not only does it lack KVM context, but also not
> > sharing the same task context).
> 
> Right after I sent it, I just recalled whenever a device needs to access
> the page, it needs to be converted to shared pages first..

FWIW, once Trusted I/O comes along, "trusted" devices will be able to access guest
private memory.  The basic gist is that the IOMMU will enforce access to private
memory, e.g. on AMD the IOMMU will check the RMP[*], and I believe the plan for
TDX is to have the IOMMU share the Secure-EPT tables that are used by the CPU.

[*] https://www.amd.com/content/dam/amd/en/documents/developer/sev-tio-whitepaper.pdf

> So I suppose the questions were not valid at all!  It is not about the
> context but that the pages will be shared always whenever a device in
> whatever form will access it..
> 
> Fundamentally I'm thinking about whether userfaultfd must support (fd,
> offset) tuple.  Now I suppose it's not, because vCPUs accessing
> private/shared will all exit to userspace, while all non-vCPU / devices can
> access shared pages only.
> 
> In that case, looks like userfaultfd can support CoCo on device emulations
> by sticking with virtual-address traps like before, at least from that
> specific POV.
> 
> -- 
> Peter Xu
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/13] KVM: Introduce KVM Userfault
  2025-01-16 20:32       ` Peter Xu
  2025-01-16 22:16         ` Sean Christopherson
@ 2025-01-16 22:51         ` James Houghton
  2025-01-16 23:31           ` Peter Xu
  1 sibling, 1 reply; 32+ messages in thread
From: James Houghton @ 2025-01-16 22:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Sean Christopherson, Jonathan Corbet, Marc Zyngier,
	Oliver Upton, Yan Zhao, Nikita Kalyazin, Anish Moorthy,
	Peter Gonda, David Matlack, Wei W, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, Jan 16, 2025 at 12:32 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Jan 16, 2025 at 03:19:49PM -0500, Peter Xu wrote:
> > James,
> >
> > Sorry for a late reply.
> >
> > I still do have one or two pure questions, but nothing directly relevant to
> > your series.
> >
> > On Thu, Jan 02, 2025 at 12:53:11PM -0500, James Houghton wrote:
> > > So I'm not pushing for KVM Userfault to replace userfaultfd; it's not
> > > worth the extra/duplicated complexity. And at LPC, Paolo and Sean
> > > indicated that this direction was indeed wrong. I have another way to
> > > make this work in mind. :)
> >
> > Do you still want to share it, more or less? :)

I think I'm referring to how to make 4K demand fetches for 1G-backed
guest memory work, and I kind of said what I was thinking a little
further down:

On Thu, Jan 2, 2025 at 9:53 AM James Houghton <jthoughton@google.com> wrote:
>
> FWIW, I think userspace mapping of gmem + userfaultfd support for
> userspace-mapped gmem + 1G page support for gmem = good 1G post-copy
> for QEMU (i.e., use gmem instead of hugetlbfs after gmem supports 1G
> pages).
>
> Remember the feedback I got from LSFMM a while ago? "don't use
> hugetlbfs." gmem seems like the natural replacement.

I guess this might not work if QEMU *needs* to use HugeTLB for
whatever reason, but Google's hypervisor just needs 1G pages; it
doesn't matter where they come from really.

> > > For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> > > replacing it. And as of right now anyway, KVM Userfault *does* provide
> > > a complete post-copy system for gmem.
> > >
> > > When gmem pages can be mapped into userspace, for post-copy to remain
> > > functional, userspace-mapped gmem will need userfaultfd integration.
> > > Keep in mind that even after this integration happens, userfaultfd
> > > alone will *not* be a complete post-copy solution, as vCPU faults
> > > won't be resolved via the userspace page tables.
> >
> > Do you know in context of CoCo, whether a private page can be accessed at
> > all outside of KVM?
> >
> > I think I'm pretty sure now a private page can never be mapped to
> > userspace.  However, can another module like vhost-kernel access it during
> > postcopy?  My impression of that is still a yes, but then how about
> > vhost-user?
> >
> > Here, the "vhost-kernel" part represents a question on whether private
> > pages can be accessed at all outside KVM.  While "vhost-user" part
> > represents a question on whether, if the previous vhost-kernel question
> > answers as "yes it can", such access attempt can happen in another
> > process/task (hence, not only does it lack KVM context, but also not
> > sharing the same task context).
>
> Right after I sent it, I just recalled whenever a device needs to access
> the page, it needs to be converted to shared pages first..

Yep! This is my understanding anyway. Devices will need to GUP or use
the userspace page tables to access guest memory; both of which will
go to userfaultfd. And userspace hasn't told KVM to make some pages
shared, then these GUPs/faults will fail.

Maybe Trusted I/O changes some things here... let me reply to Sean. :)

> So I suppose the questions were not valid at all!  It is not about the
> context but that the pages will be shared always whenever a device in
> whatever form will access it..
>
> Fundamentally I'm thinking about whether userfaultfd must support (fd,
> offset) tuple.  Now I suppose it's not, because vCPUs accessing
> private/shared will all exit to userspace, while all non-vCPU / devices can
> access shared pages only.
>
> In that case, looks like userfaultfd can support CoCo on device emulations
> by sticking with virtual-address traps like before, at least from that
> specific POV.

Yeah, I don't think the userfaultfd API needs to change to support
gmem, because it's going to be using the VMAs / user mappings of gmem.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/13] KVM: Introduce KVM Userfault
  2025-01-16 22:16         ` Sean Christopherson
@ 2025-01-16 23:04           ` James Houghton
  2025-01-16 23:17             ` Peter Xu
  0 siblings, 1 reply; 32+ messages in thread
From: James Houghton @ 2025-01-16 23:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Xu, Paolo Bonzini, Jonathan Corbet, Marc Zyngier,
	Oliver Upton, Yan Zhao, Nikita Kalyazin, Anish Moorthy,
	Peter Gonda, David Matlack, Wei W, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, Jan 16, 2025 at 2:16 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 16, 2025, Peter Xu wrote:
> > On Thu, Jan 16, 2025 at 03:19:49PM -0500, Peter Xu wrote:
> > > > For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> > > > replacing it. And as of right now anyway, KVM Userfault *does* provide
> > > > a complete post-copy system for gmem.
> > > >
> > > > When gmem pages can be mapped into userspace, for post-copy to remain
> > > > functional, userspace-mapped gmem will need userfaultfd integration.
> > > > Keep in mind that even after this integration happens, userfaultfd
> > > > alone will *not* be a complete post-copy solution, as vCPU faults
> > > > won't be resolved via the userspace page tables.
> > >
> > > Do you know in context of CoCo, whether a private page can be accessed at
> > > all outside of KVM?
> > >
> > > I think I'm pretty sure now a private page can never be mapped to
> > > userspace.  However, can another module like vhost-kernel access it during
> > > postcopy?  My impression of that is still a yes, but then how about
> > > vhost-user?
> > >
> > > Here, the "vhost-kernel" part represents a question on whether private
> > > pages can be accessed at all outside KVM.  While "vhost-user" part
> > > represents a question on whether, if the previous vhost-kernel question
> > > answers as "yes it can", such access attempt can happen in another
> > > process/task (hence, not only does it lack KVM context, but also not
> > > sharing the same task context).
> >
> > Right after I sent it, I just recalled whenever a device needs to access
> > the page, it needs to be converted to shared pages first..
>
> FWIW, once Trusted I/O comes along, "trusted" devices will be able to access guest
> private memory.  The basic gist is that the IOMMU will enforce access to private
> memory, e.g. on AMD the IOMMU will check the RMP[*], and I believe the plan for
> TDX is to have the IOMMU share the Secure-EPT tables that are used by the CPU.
>
> [*] https://www.amd.com/content/dam/amd/en/documents/developer/sev-tio-whitepaper.pdf

Hi Sean,

Do you know what API the IOMMU driver would use to get the private
pages to map? Normally it'd use GUP, but GUP would/should fail for
guest-private pages, right?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/13] KVM: Introduce KVM Userfault
  2025-01-16 23:04           ` James Houghton
@ 2025-01-16 23:17             ` Peter Xu
  2025-01-16 23:46               ` Sean Christopherson
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Xu @ 2025-01-16 23:17 UTC (permalink / raw)
  To: James Houghton
  Cc: Sean Christopherson, Paolo Bonzini, Jonathan Corbet, Marc Zyngier,
	Oliver Upton, Yan Zhao, Nikita Kalyazin, Anish Moorthy,
	Peter Gonda, David Matlack, Wei W, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, Jan 16, 2025 at 03:04:45PM -0800, James Houghton wrote:
> On Thu, Jan 16, 2025 at 2:16 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Jan 16, 2025, Peter Xu wrote:
> > > On Thu, Jan 16, 2025 at 03:19:49PM -0500, Peter Xu wrote:
> > > > > For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> > > > > replacing it. And as of right now anyway, KVM Userfault *does* provide
> > > > > a complete post-copy system for gmem.
> > > > >
> > > > > When gmem pages can be mapped into userspace, for post-copy to remain
> > > > > functional, userspace-mapped gmem will need userfaultfd integration.
> > > > > Keep in mind that even after this integration happens, userfaultfd
> > > > > alone will *not* be a complete post-copy solution, as vCPU faults
> > > > > won't be resolved via the userspace page tables.
> > > >
> > > > Do you know in context of CoCo, whether a private page can be accessed at
> > > > all outside of KVM?
> > > >
> > > > I think I'm pretty sure now a private page can never be mapped to
> > > > userspace.  However, can another module like vhost-kernel access it during
> > > > postcopy?  My impression of that is still a yes, but then how about
> > > > vhost-user?
> > > >
> > > > Here, the "vhost-kernel" part represents a question on whether private
> > > > pages can be accessed at all outside KVM.  While "vhost-user" part
> > > > represents a question on whether, if the previous vhost-kernel question
> > > > answers as "yes it can", such access attempt can happen in another
> > > > process/task (hence, not only does it lack KVM context, but also not
> > > > sharing the same task context).
> > >
> > > Right after I sent it, I just recalled whenever a device needs to access
> > > the page, it needs to be converted to shared pages first..
> >
> > FWIW, once Trusted I/O comes along, "trusted" devices will be able to access guest
> > private memory.  The basic gist is that the IOMMU will enforce access to private
> > memory, e.g. on AMD the IOMMU will check the RMP[*], and I believe the plan for
> > TDX is to have the IOMMU share the Secure-EPT tables that are used by the CPU.
> >
> > [*] https://www.amd.com/content/dam/amd/en/documents/developer/sev-tio-whitepaper.pdf

Thanks, Sean.  This is interesting to know..

> 
> Hi Sean,
> 
> Do you know what API the IOMMU driver would use to get the private
> pages to map? Normally it'd use GUP, but GUP would/should fail for
> guest-private pages, right?

James,

I'm still reading the link Sean shared, looks like there's answer in the
white paper on this on assigned devices:

        TDIs access memory via either guest virtual address (GVA) space or
        guest physical address (GPA) space.  The I/O Memory Management Unit
        (IOMMU) in the host hardware is responsible for translating the
        provided GVAs or GPAs into system physical addresses
        (SPAs). Because SEV-SNP enforces access control at the time of
        translation, the IOMMU performs RMP entry lookups on translation

So I suppose after the device is attested and trusted, it can directly map
everything if wanted, and DMA directly to the encrypted pages.

OTOH, for my specific question (on vhost-kernel, or vhost-user), I suppose
they cannot be attested but still be part of host software.. so I'm
guessing they'll need to still stick with shared pages, and use a bounce
buffer to do DMAs..

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/13] KVM: Introduce KVM Userfault
  2025-01-16 22:51         ` James Houghton
@ 2025-01-16 23:31           ` Peter Xu
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Xu @ 2025-01-16 23:31 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Sean Christopherson, Jonathan Corbet, Marc Zyngier,
	Oliver Upton, Yan Zhao, Nikita Kalyazin, Anish Moorthy,
	Peter Gonda, David Matlack, Wei W, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, Jan 16, 2025 at 02:51:11PM -0800, James Houghton wrote:
> I guess this might not work if QEMU *needs* to use HugeTLB for
> whatever reason, but Google's hypervisor just needs 1G pages; it
> doesn't matter where they come from really.

I see now.  Yes I suppose it works for QEMU too.

[...]

> > In that case, looks like userfaultfd can support CoCo on device emulations
> > by sticking with virtual-address traps like before, at least from that
> > specific POV.
> 
> Yeah, I don't think the userfaultfd API needs to change to support
> gmem, because it's going to be using the VMAs / user mappings of gmem.

There's other things I am still thinking on how the notification could
happen when CoCo is enabled, that especially when there's no vcpu context.

The first thing is any PV interfaces, and what's currently in my mind is
kvmclock.  I suppose that could work like untrusted dmas, so that when the
hypervisor wants to read/update the clock struct, it'll access a shared
page and then the guest can move it from/to to a private page.  Or I'm not
sure whether such information is proven to be not sensitive data, so the
guest can directly use a permanent shared page for such purpose (in which
case should still be part of guest memory, hence access to it can be
trapped just like other shared pages via userfaultfd).

The other thing is after I read the SEV-TIO then I found it could be easy
to implement page faults for trusted devices now.  For example, the white
paper said the host IOMMU will be responsible to translating trusted
devices' DMA into GPA/GVA, I think it means KVM would somehow share the
secondary pgtable to the IOMMU, and probably when DMA sees a missing page
it can now easily generate a page fault to the secondary page table.
However the question is this is a DMA op and it definitely also doesn't
have a vcpu context.  So the question is how to trap it.

So.. maybe (fd, offset) support might still be needed at some point, which
can be more future proof.  But I don't think I have a solid mind yet.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/13] KVM: Introduce KVM Userfault
  2025-01-16 23:17             ` Peter Xu
@ 2025-01-16 23:46               ` Sean Christopherson
  0 siblings, 0 replies; 32+ messages in thread
From: Sean Christopherson @ 2025-01-16 23:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, Paolo Bonzini, Jonathan Corbet, Marc Zyngier,
	Oliver Upton, Yan Zhao, Nikita Kalyazin, Anish Moorthy,
	Peter Gonda, David Matlack, Wei W, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Thu, Jan 16, 2025, Peter Xu wrote:
> On Thu, Jan 16, 2025 at 03:04:45PM -0800, James Houghton wrote:
> > On Thu, Jan 16, 2025 at 2:16 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, Jan 16, 2025, Peter Xu wrote:
> > > > On Thu, Jan 16, 2025 at 03:19:49PM -0500, Peter Xu wrote:
> > > > > > For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> > > > > > replacing it. And as of right now anyway, KVM Userfault *does* provide
> > > > > > a complete post-copy system for gmem.
> > > > > >
> > > > > > When gmem pages can be mapped into userspace, for post-copy to remain
> > > > > > functional, userspace-mapped gmem will need userfaultfd integration.
> > > > > > Keep in mind that even after this integration happens, userfaultfd
> > > > > > alone will *not* be a complete post-copy solution, as vCPU faults
> > > > > > won't be resolved via the userspace page tables.
> > > > >
> > > > > Do you know in context of CoCo, whether a private page can be accessed at
> > > > > all outside of KVM?
> > > > >
> > > > > I think I'm pretty sure now a private page can never be mapped to
> > > > > userspace.  However, can another module like vhost-kernel access it during
> > > > > postcopy?  My impression of that is still a yes, but then how about
> > > > > vhost-user?
> > > > >
> > > > > Here, the "vhost-kernel" part represents a question on whether private
> > > > > pages can be accessed at all outside KVM.  While "vhost-user" part
> > > > > represents a question on whether, if the previous vhost-kernel question
> > > > > answers as "yes it can", such access attempt can happen in another
> > > > > process/task (hence, not only does it lack KVM context, but also not
> > > > > sharing the same task context).
> > > >
> > > > Right after I sent it, I just recalled whenever a device needs to access
> > > > the page, it needs to be converted to shared pages first..
> > >
> > > FWIW, once Trusted I/O comes along, "trusted" devices will be able to access guest
> > > private memory.  The basic gist is that the IOMMU will enforce access to private
> > > memory, e.g. on AMD the IOMMU will check the RMP[*], and I believe the plan for
> > > TDX is to have the IOMMU share the Secure-EPT tables that are used by the CPU.
> > >
> > > [*] https://www.amd.com/content/dam/amd/en/documents/developer/sev-tio-whitepaper.pdf
> 
> Thanks, Sean.  This is interesting to know..
> 
> > 
> > Hi Sean,
> > 
> > Do you know what API the IOMMU driver would use to get the private
> > pages to map? Normally it'd use GUP, but GUP would/should fail for
> > guest-private pages, right?
> 
> James,
> 
> I'm still reading the link Sean shared, looks like there's answer in the
> white paper on this on assigned devices:
> 
>         TDIs access memory via either guest virtual address (GVA) space or
>         guest physical address (GPA) space.  The I/O Memory Management Unit
>         (IOMMU) in the host hardware is responsible for translating the
>         provided GVAs or GPAs into system physical addresses
>         (SPAs). Because SEV-SNP enforces access control at the time of
>         translation, the IOMMU performs RMP entry lookups on translation
> 
> So I suppose after the device is attested and trusted, it can directly map
> everything if wanted, and DMA directly to the encrypted pages.

But as James called out, the kernel still needs to actually map guest_memfd
memory (all other memory is shared), and guest_memfd does not and will not ever
support GUP/mmap() of *private* memory.

There's an RFC that's under heavy discussion that I assume will handle some/all?
of this (I have largely ignored the thread).

https://lore.kernel.org/all/20250107142719.179636-1-yilun.xu@linux.intel.com

> OTOH, for my specific question (on vhost-kernel, or vhost-user), I suppose
> they cannot be attested but still be part of host software.. so I'm
> guessing they'll need to still stick with shared pages, and use a bounce
> buffer to do DMAs..

Yep.  There's no sane way to attest software that runs in "regular" mode on the
CPU, and so things like device emulation and vhost will always be restricted to
shared memory.


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-01-16 23:52 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-04 19:13 [PATCH v1 00/13] KVM: Introduce KVM Userfault James Houghton
2024-12-04 19:13 ` [PATCH v1 01/13] KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap James Houghton
2024-12-05 11:52   ` kernel test robot
2024-12-05 14:22   ` kernel test robot
2024-12-06 22:46     ` James Houghton
2024-12-04 19:13 ` [PATCH v1 02/13] KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT James Houghton
2024-12-04 19:13 ` [PATCH v1 03/13] KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot James Houghton
2024-12-04 19:13 ` [PATCH v1 04/13] KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION James Houghton
2024-12-04 19:13 ` [PATCH v1 05/13] KVM: x86/mmu: Add support for KVM_MEM_USERFAULT James Houghton
2024-12-04 19:13 ` [PATCH v1 06/13] KVM: arm64: " James Houghton
2024-12-04 23:07   ` Oliver Upton
2024-12-05 23:31     ` James Houghton
2024-12-06  0:45       ` Oliver Upton
2024-12-04 19:13 ` [PATCH v1 07/13] KVM: selftests: Fix vm_mem_region_set_flags docstring James Houghton
2024-12-04 19:13 ` [PATCH v1 08/13] KVM: selftests: Fix prefault_mem logic James Houghton
2024-12-04 19:13 ` [PATCH v1 09/13] KVM: selftests: Add va_start/end into uffd_desc James Houghton
2024-12-04 19:13 ` [PATCH v1 10/13] KVM: selftests: Add KVM Userfault mode to demand_paging_test James Houghton
2024-12-14 22:46   ` kernel test robot
2024-12-04 19:13 ` [PATCH v1 11/13] KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT James Houghton
2024-12-04 19:13 ` [PATCH v1 12/13] KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests James Houghton
2024-12-04 19:13 ` [PATCH v1 13/13] KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT details James Houghton
2024-12-07  1:38   ` Bagas Sanjaya
2024-12-24 21:07 ` [PATCH v1 00/13] KVM: Introduce KVM Userfault Peter Xu
2025-01-02 17:53   ` James Houghton
2025-01-16 20:19     ` Peter Xu
2025-01-16 20:32       ` Peter Xu
2025-01-16 22:16         ` Sean Christopherson
2025-01-16 23:04           ` James Houghton
2025-01-16 23:17             ` Peter Xu
2025-01-16 23:46               ` Sean Christopherson
2025-01-16 22:51         ` James Houghton
2025-01-16 23:31           ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).