[RFC PATCH 00/18] KVM: Post-copy live migration for guest

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd
@ 2024-07-10 23:42 James Houghton
  2024-07-10 23:42 ` [RFC PATCH 01/18] KVM: Add KVM_USERFAULT build option James Houghton
                   ` (21 more replies)
  0 siblings, 22 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

This patch series implements the KVM-based demand paging system that was
first introduced back in November[1] by David Matlack.

The working name for this new system is KVM Userfault, but that name is
very confusing so it will not be the final name.

Problem: post-copy with guest_memfd
===================================

Post-copy live migration makes it possible to migrate VMs from one host
to another no matter how fast they are writing to memory while keeping
the VM paused for a minimal amount of time. For post-copy to work, we
need:
 1. to be able to prevent KVM from being able to access particular pages
    of guest memory until we have populated it
 2. for userspace to know when KVM is trying to access a particular
    page.
 3. a way to allow the access to proceed.

Traditionally, post-copy live migration is implemented using
userfaultfd, which hooks into the main mm fault path. KVM hits this path
when it is doing HVA -> PFN translations (with GUP) or when it itself
attempts to access guest memory. Userfaultfd sends a page fault
notification to userspace, and KVM goes to sleep.

Userfaultfd works well, as it is not specific to KVM; everyone who
attempts to access guest memory will block the same way.

However, with guest_memfd, we do not use GUP to translate from GFN to
HPA (nor is there an intermediate HVA).

So userfaultfd in its current form cannot be used to support post-copy
live migration with guest_memfd-backed VMs.

Solution: hook into the gfn -> pfn translation
==============================================

The only way to implement post-copy with a non-KVM-specific
userfaultfd-like system would be to introduce the concept of a
file-userfault[2] to intercept faults on a guest_memfd.

Instead, we take the simpler approach of adding a KVM-specific API, and
we hook into the GFN -> HVA or GFN -> PFN translation steps (for
traditional memslots and for guest_memfd respectively).

I have intentionally added support for traditional memslots, as the
complexity that it adds is minimal, and it is useful for some VMMs, as
it can be used to fully implement post-copy live migration.

Implementation Details
======================

Let's break down how KVM implements each of the three core requirements
for implementing post-copy as laid out above:

--- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---

The most straightforward way to inform KVM of userfault-enabled pages is
to use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.

There is already infrastructure in place for modifying and checking
memory attributes. Using this interface is slightly challenging, as there
is no UAPI for setting/clearing particular attributes; we must set the
exact attributes we want.

The synchronization that is in place for updating memory attributes is
not suitable for post-copy live migration either, which will require
updating memory attributes (from userfault to no-userfault) very
frequently.

Another potential interface could be to use something akin to a dirty
bitmap, where a bitmap describes which pages within a memslot (or VM)
should trigger userfaults. This way, it is straightforward to make
updates to the userfault status of a page cheap.

When KVM Userfault is enabled, we need to be careful not to map a
userfault page in response to a fault on a non-userfault page. In this
RFC, I've taken the simplest approach: force new PTEs to be PAGE_SIZE.

--- Page fault notifications ---

For page faults generated by vCPUs running in guest mode, if the page
the vCPU is trying to access is a userfault-enabled page, we use
KVM_EXIT_MEMORY_FAULT with a new flag: KVM_MEMORY_EXIT_FLAG_USERFAULT.

For arm64, I believe this is actually all we need, provided we handle
steal_time properly.

For x86, where returning from deep within the instruction emulator (or
other non-trivial execution paths) is infeasible, being able to pause
execution while userspace fetches the page, just as userfaultfd would
do, is necessary. Let's call these "asynchronous userfaults."

A new ioctl, KVM_READ_USERFAULT, has been added to read asynchronous
userfaults, and an eventfd is used to signal that new faults are
available for reading.

Today, we busy-wait for a gfn to have userfault disabled. This will
change in the future.

--- Fault resolution ---

Resolving userfaults today is as simple as removing the USERFAULT memory
attribute on the faulting gfn. This will change if we do not end up
using memory attributes for KVM Userfault. Having a range-based wake-up
like userfaultfd (see UFFDIO_WAKE) might also be helpful for
performance.

Problems with this series
=========================
- This cannot be named KVM Userfault! Perhaps "KVM missing pages"?
- Memory attribute modification doesn't scale well.
- We busy-wait for pages to not be userfault-enabled.
- gfn_to_hva and gfn_to_pfn caches are not invalidated.
- Page tables are not collapsed when KVM Userfault is disabled.
- There is no self-test for asynchronous userfaults.
- Asynchronous page faults can be dropped if KVM_READ_USERFAULT fails.
- Supports only x86 and arm64.
- Probably many more!

Thanks!

[1]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@mail.gmail.com/
[2]: https://lore.kernel.org/kvm/CADrL8HVwBjLpWDM9i9Co1puFWmJshZOKVu727fMPJUAbD+XX5g@mail.gmail.com/

James Houghton (18):
  KVM: Add KVM_USERFAULT build option
  KVM: Add KVM_CAP_USERFAULT and KVM_MEMORY_ATTRIBUTE_USERFAULT
  KVM: Put struct kvm pointer in memslot
  KVM: Fail __gfn_to_hva_many for userfault gfns.
  KVM: Add KVM_PFN_ERR_USERFAULT
  KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
  KVM: Provide attributes to kvm_arch_pre_set_memory_attributes
  KVM: x86: Add KVM Userfault support
  KVM: x86: Add vCPU fault fast-path for Userfault
  KVM: arm64: Add KVM Userfault support
  KVM: arm64: Add vCPU memory fault fast-path for Userfault
  KVM: arm64: Add userfault support for steal-time
  KVM: Add atomic parameter to __gfn_to_hva_many
  KVM: Add asynchronous userfaults, KVM_READ_USERFAULT
  KVM: guest_memfd: Add KVM Userfault support
  KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
  KVM: selftests: Add KVM Userfault mode to demand_paging_test
  KVM: selftests: Remove restriction in vm_set_memory_attributes

 Documentation/virt/kvm/api.rst                |  23 ++
 arch/arm64/include/asm/kvm_host.h             |   2 +-
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/arm.c                          |   8 +-
 arch/arm64/kvm/mmu.c                          |  45 +++-
 arch/arm64/kvm/pvtime.c                       |  11 +-
 arch/x86/kvm/Kconfig                          |   1 +
 arch/x86/kvm/mmu/mmu.c                        |  67 +++++-
 arch/x86/kvm/mmu/mmu_internal.h               |   3 +-
 include/linux/kvm_host.h                      |  41 +++-
 include/uapi/linux/kvm.h                      |  13 ++
 .../selftests/kvm/demand_paging_test.c        |  46 +++-
 .../testing/selftests/kvm/include/kvm_util.h  |   7 -
 virt/kvm/Kconfig                              |   4 +
 virt/kvm/guest_memfd.c                        |  16 +-
 virt/kvm/kvm_main.c                           | 213 +++++++++++++++++-
 16 files changed, 457 insertions(+), 44 deletions(-)

base-commit: 02b0d3b9d4dd1ef76b3e8c63175f1ae9ff392313
-- 
2.45.2.993.g49e7a77208-goog

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH 01/18] KVM: Add KVM_USERFAULT build option
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:42 ` [RFC PATCH 02/18] KVM: Add KVM_CAP_USERFAULT and KVM_MEMORY_ATTRIBUTE_USERFAULT James Houghton
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Some architectures will not support KVM_USERFAULT, so we need to have a
build option to avoid including it for those architectures.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 virt/kvm/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 754c6c923427..f1b660d593e4 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -117,3 +117,7 @@ config HAVE_KVM_GMEM_PREPARE
 config HAVE_KVM_GMEM_INVALIDATE
        bool
        depends on KVM_PRIVATE_MEM
+
+config KVM_USERFAULT
+	select KVM_GENERIC_MEMORY_ATTRIBUTES
+	bool
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 02/18] KVM: Add KVM_CAP_USERFAULT and KVM_MEMORY_ATTRIBUTE_USERFAULT
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
  2024-07-10 23:42 ` [RFC PATCH 01/18] KVM: Add KVM_USERFAULT build option James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-15 21:37   ` Anish Moorthy
  2024-07-10 23:42 ` [RFC PATCH 03/18] KVM: Put struct kvm pointer in memslot James Houghton
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Add the ability to enable and disable KVM Userfault, and add
KVM_MEMORY_ATTRIBUTE_USERFAULT to control whether or not pages should
trigger userfaults.

The presence of a kvm_userfault_ctx in the struct kvm is what signifies
whether KVM Userfault is enabled or not. To make sure that this struct
is non-empty, include a struct eventfd_ctx pointer, although it is not
used in this patch.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 Documentation/virt/kvm/api.rst | 23 ++++++++
 include/linux/kvm_host.h       | 14 +++++
 include/uapi/linux/kvm.h       |  5 ++
 virt/kvm/kvm_main.c            | 96 +++++++++++++++++++++++++++++++++-
 4 files changed, 136 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index a71d91978d9e..26a98fea718c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8070,6 +8070,29 @@ error/annotated fault.
 
 See KVM_EXIT_MEMORY_FAULT for more information.
 
+7.35 KVM_CAP_USERFAULT
+------------------------------
+
+:Architectures: none
+:Parameters: args[0] - whether or not to enable KVM Userfault. To enable,
+                       pass KVM_USERFAULT_ENABLE, and to disable pass
+                       KVM_USERFAULT_DISABLE.
+             args[1] - the eventfd to be notified when asynchronous userfaults
+                       occur.
+
+:Returns: 0 on success, -EINVAL if args[0] is not KVM_USERFAULT_ENABLE
+          or KVM_USERFAULT_DISABLE, or if KVM Userfault is not supported.
+
+This capability, if enabled with KVM_ENABLE_CAP, allows userspace to mark
+regions of memory as KVM_MEMORY_ATTRIBUTE_USERFAULT, in which case, attempted
+accesses to these regions of memory by KVM_RUN will fail with
+KVM_EXIT_MEMORY_FAULT. Attempted accesses by other ioctls will fail with
+EFAULT.
+
+Enabling this capability will cause all future faults to create
+small-page-sized sptes. Collapsing these sptes back into their optimal size
+is done with KVM_COLLAPSE_PAGE_TABLES.
+
 8. Other capabilities.
 ======================
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7b57878c8c18..f0d4db2d64af 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -730,6 +730,10 @@ struct kvm_memslots {
 	int node_idx;
 };
 
+struct kvm_userfault_ctx {
+	struct eventfd_ctx *ev_fd;
+};
+
 struct kvm {
 #ifdef KVM_HAVE_MMU_RWLOCK
 	rwlock_t mmu_lock;
@@ -831,6 +835,7 @@ struct kvm {
 	bool dirty_ring_with_bitmap;
 	bool vm_bugged;
 	bool vm_dead;
+	struct kvm_userfault_ctx __rcu *userfault_ctx;
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 	struct notifier_block pm_notifier;
@@ -2477,4 +2482,13 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages
 void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
 #endif
 
+static inline bool kvm_userfault_enabled(struct kvm *kvm)
+{
+#ifdef CONFIG_KVM_USERFAULT
+	return !!rcu_access_pointer(kvm->userfault_ctx);
+#else
+	return false;
+#endif
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d03842abae57..c84c24a9678e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -917,6 +917,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_MEMORY_ATTRIBUTES 233
 #define KVM_CAP_GUEST_MEMFD 234
 #define KVM_CAP_VM_TYPES 235
+#define KVM_CAP_USERFAULT 236
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
@@ -1539,6 +1540,7 @@ struct kvm_memory_attributes {
 };
 
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
+#define KVM_MEMORY_ATTRIBUTE_USERFAULT         (1ULL << 4)
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
 
@@ -1548,4 +1550,7 @@ struct kvm_create_guest_memfd {
 	__u64 reserved[6];
 };
 
+#define KVM_USERFAULT_ENABLE		(1ULL << 0)
+#define KVM_USERFAULT_DISABLE		(1ULL << 1)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8e422c2c9450..fb7972e61439 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2430,10 +2430,16 @@ bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
+	u64 attributes = 0;
 	if (!kvm || kvm_arch_has_private_mem(kvm))
-		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
+		attributes |= KVM_MEMORY_ATTRIBUTE_PRIVATE;
 
-	return 0;
+#ifdef CONFIG_KVM_USERFAULT
+	if (!kvm || kvm_userfault_enabled(kvm))
+		attributes |= KVM_MEMORY_ATTRIBUTE_USERFAULT;
+#endif
+
+	return attributes;
 }
 
 static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
@@ -4946,6 +4952,84 @@ bool kvm_are_all_memslots_empty(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_are_all_memslots_empty);
 
+#ifdef CONFIG_KVM_USERFAULT
+static int kvm_disable_userfault(struct kvm *kvm)
+{
+	struct kvm_userfault_ctx *ctx;
+
+	mutex_lock(&kvm->slots_lock);
+
+	ctx = rcu_replace_pointer(kvm->userfault_ctx, NULL,
+				  mutex_is_locked(&kvm->slots_lock));
+
+	mutex_unlock(&kvm->slots_lock);
+
+	if (!ctx)
+		return 0;
+
+	/* Wait for everyone to stop using userfault. */
+	synchronize_srcu(&kvm->srcu);
+
+	eventfd_ctx_put(ctx->ev_fd);
+	kfree(ctx);
+	return 0;
+}
+
+static int kvm_enable_userfault(struct kvm *kvm, int event_fd)
+{
+	struct kvm_userfault_ctx *userfault_ctx;
+	struct eventfd_ctx *ev_fd;
+	int ret;
+
+	mutex_lock(&kvm->slots_lock);
+
+	ret = -EEXIST;
+	if (kvm_userfault_enabled(kvm))
+		goto out;
+
+	ret = -ENOMEM;
+	userfault_ctx = kmalloc(sizeof(*userfault_ctx), GFP_KERNEL);
+	if (!userfault_ctx)
+		goto out;
+
+	ev_fd = eventfd_ctx_fdget(event_fd);
+	if (IS_ERR(ev_fd)) {
+		ret = PTR_ERR(ev_fd);
+		kfree(userfault_ctx);
+		goto out;
+	}
+
+	ret = 0;
+	userfault_ctx->ev_fd = ev_fd;
+
+	rcu_assign_pointer(kvm->userfault_ctx, userfault_ctx);
+out:
+	mutex_unlock(&kvm->slots_lock);
+	return ret;
+}
+
+static int kvm_vm_ioctl_enable_userfault(struct kvm *kvm, int options,
+					 int event_fd)
+{
+	u64 allowed_options = KVM_USERFAULT_ENABLE |
+			      KVM_USERFAULT_DISABLE;
+	bool enable;
+
+	if (options & ~allowed_options)
+		return -EINVAL;
+	/* Exactly one of ENABLE or DISABLE must be set. */
+	if (options == allowed_options || !options)
+		return -EINVAL;
+
+	enable = options & KVM_USERFAULT_ENABLE;
+
+	if (enable)
+		return kvm_enable_userfault(kvm, event_fd);
+	else
+		return kvm_disable_userfault(kvm);
+}
+#endif
+
 static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 					   struct kvm_enable_cap *cap)
 {
@@ -5009,6 +5093,14 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 
 		return r;
 	}
+#ifdef CONFIG_KVM_USERFAULT
+	case KVM_CAP_USERFAULT:
+		if (cap->flags)
+			return -EINVAL;
+
+		return kvm_vm_ioctl_enable_userfault(kvm, cap->args[0],
+						     cap->args[1]);
+#endif
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 02/18] KVM: Add KVM_CAP_USERFAULT and KVM_MEMORY_ATTRIBUTE_USERFAULT
  2024-07-10 23:42 ` [RFC PATCH 02/18] KVM: Add KVM_CAP_USERFAULT and KVM_MEMORY_ATTRIBUTE_USERFAULT James Houghton
@ 2024-07-15 21:37   ` Anish Moorthy
  0 siblings, 0 replies; 40+ messages in thread
From: Anish Moorthy @ 2024-07-15 21:37 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, James Morse,
	Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Shuah Khan,
	Peter Xu, Axel Rasmussen, David Matlack, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Not to nitpick an RFC, but since the stuff in this patch seems less
likely to change I think you should avoid using #ifdefs

For instance

On Wed, Jul 10, 2024 at 4:43 PM James Houghton <jthoughton@google.com> wrote:
> +static inline bool kvm_userfault_enabled(struct kvm *kvm)
> +{
> +#ifdef CONFIG_KVM_USERFAULT
> +       return !!rcu_access_pointer(kvm->userfault_ctx);
> +#else
> +       return false;
> +#endif
> +}

can be

> +static inline bool kvm_userfault_enabled(struct kvm *kvm)
> +{
> +    if (IS_ENABLED(CONFIG_KVM_USERFAULT)) {
> +       return !!rcu_access_pointer(kvm->userfault_ctx);
> +    } else {
> +       return false;
> +   }
> +}

(however kernel style tells you to write that :), the cap-supported
check can be moved into kvm_vm_ioctl_enable_userfault(), etc


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH 03/18] KVM: Put struct kvm pointer in memslot
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
  2024-07-10 23:42 ` [RFC PATCH 01/18] KVM: Add KVM_USERFAULT build option James Houghton
  2024-07-10 23:42 ` [RFC PATCH 02/18] KVM: Add KVM_CAP_USERFAULT and KVM_MEMORY_ATTRIBUTE_USERFAULT James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:42 ` [RFC PATCH 04/18] KVM: Fail __gfn_to_hva_many for userfault gfns James Houghton
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Because a gfn having userfault enabled is tied to a struct kvm, we need
a pointer to it. We could pass the kvm pointer around in the routines we
need it, but that is a lot of churn, and there isn't much of a downside
to simply storing the pointer in the memslot.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/kvm_host.h | 2 ++
 virt/kvm/kvm_main.c      | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f0d4db2d64af..c1eb59a3141b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -596,6 +596,8 @@ struct kvm_memory_slot {
 		pgoff_t pgoff;
 	} gmem;
 #endif
+
+	struct kvm *kvm;
 };
 
 static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fb7972e61439..ffa452a13672 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1769,6 +1769,7 @@ static void kvm_copy_memslot(struct kvm_memory_slot *dest,
 	dest->flags = src->flags;
 	dest->id = src->id;
 	dest->as_id = src->as_id;
+	dest->kvm = src->kvm;
 }
 
 static void kvm_invalidate_memslot(struct kvm *kvm,
@@ -2078,6 +2079,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new->npages = npages;
 	new->flags = mem->flags;
 	new->userspace_addr = mem->userspace_addr;
+	new->kvm = kvm;
 	if (mem->flags & KVM_MEM_GUEST_MEMFD) {
 		r = kvm_gmem_bind(kvm, new, mem->guest_memfd, mem->guest_memfd_offset);
 		if (r)
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 04/18] KVM: Fail __gfn_to_hva_many for userfault gfns.
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (2 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 03/18] KVM: Put struct kvm pointer in memslot James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-11 23:40   ` David Matlack
  2024-07-10 23:42 ` [RFC PATCH 05/18] KVM: Add KVM_PFN_ERR_USERFAULT James Houghton
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Add gfn_has_userfault() that (1) checks that KVM Userfault is enabled,
and (2) that our particular gfn is a userfault gfn.

Check gfn_has_userfault() as part of __gfn_to_hva_many to prevent
gfn->hva translations for userfault gfns.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/kvm_host.h | 12 ++++++++++++
 virt/kvm/kvm_main.c      |  3 +++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c1eb59a3141b..4cca896fb44a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -140,6 +140,7 @@ static inline bool is_noslot_pfn(kvm_pfn_t pfn)
 
 #define KVM_HVA_ERR_BAD		(PAGE_OFFSET)
 #define KVM_HVA_ERR_RO_BAD	(PAGE_OFFSET + PAGE_SIZE)
+#define KVM_HVA_ERR_USERFAULT	(PAGE_OFFSET + 2 * PAGE_SIZE)
 
 static inline bool kvm_is_error_hva(unsigned long addr)
 {
@@ -2493,4 +2494,15 @@ static inline bool kvm_userfault_enabled(struct kvm *kvm)
 #endif
 }
 
+static inline bool gfn_has_userfault(struct kvm *kvm, gfn_t gfn)
+{
+#ifdef CONFIG_KVM_USERFAULT
+	return kvm_userfault_enabled(kvm) &&
+		(kvm_get_memory_attributes(kvm, gfn) &
+		 KVM_MEMORY_ATTRIBUTE_USERFAULT);
+#else
+	return false;
+#endif
+}
+
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ffa452a13672..758deb90a050 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2686,6 +2686,9 @@ static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t
 	if (memslot_is_readonly(slot) && write)
 		return KVM_HVA_ERR_RO_BAD;
 
+	if (gfn_has_userfault(slot->kvm, gfn))
+		return KVM_HVA_ERR_USERFAULT;
+
 	if (nr_pages)
 		*nr_pages = slot->npages - (gfn - slot->base_gfn);
 
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 04/18] KVM: Fail __gfn_to_hva_many for userfault gfns.
  2024-07-10 23:42 ` [RFC PATCH 04/18] KVM: Fail __gfn_to_hva_many for userfault gfns James Houghton
@ 2024-07-11 23:40   ` David Matlack
  0 siblings, 0 replies; 40+ messages in thread
From: David Matlack @ 2024-07-11 23:40 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, James Morse,
	Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Shuah Khan,
	Peter Xu, Axel Rasmussen, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Wed, Jul 10, 2024 at 4:42 PM James Houghton <jthoughton@google.com> wrote:
>
> Add gfn_has_userfault() that (1) checks that KVM Userfault is enabled,
> and (2) that our particular gfn is a userfault gfn.
>
> Check gfn_has_userfault() as part of __gfn_to_hva_many to prevent
> gfn->hva translations for userfault gfns.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/kvm_host.h | 12 ++++++++++++
>  virt/kvm/kvm_main.c      |  3 +++
>  2 files changed, 15 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c1eb59a3141b..4cca896fb44a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -140,6 +140,7 @@ static inline bool is_noslot_pfn(kvm_pfn_t pfn)
>
>  #define KVM_HVA_ERR_BAD                (PAGE_OFFSET)
>  #define KVM_HVA_ERR_RO_BAD     (PAGE_OFFSET + PAGE_SIZE)
> +#define KVM_HVA_ERR_USERFAULT  (PAGE_OFFSET + 2 * PAGE_SIZE)
>
>  static inline bool kvm_is_error_hva(unsigned long addr)
>  {
> @@ -2493,4 +2494,15 @@ static inline bool kvm_userfault_enabled(struct kvm *kvm)
>  #endif
>  }
>
> +static inline bool gfn_has_userfault(struct kvm *kvm, gfn_t gfn)
> +{
> +#ifdef CONFIG_KVM_USERFAULT
> +       return kvm_userfault_enabled(kvm) &&
> +               (kvm_get_memory_attributes(kvm, gfn) &
> +                KVM_MEMORY_ATTRIBUTE_USERFAULT);
> +#else
> +       return false;
> +#endif
> +}
> +
>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ffa452a13672..758deb90a050 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2686,6 +2686,9 @@ static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t
>         if (memslot_is_readonly(slot) && write)
>                 return KVM_HVA_ERR_RO_BAD;
>
> +       if (gfn_has_userfault(slot->kvm, gfn))
> +               return KVM_HVA_ERR_USERFAULT;

You missed the "many" part :)

Speaking of, to do this you'll need to convert all callers that pass
in nr_pages to actually set the number of pages they need. Today KVM
just checks from gfn to the end of the slot and returns the total
number of pages via nr_pages. i.e. We could end up checking (and async
fetching) the entire slot!

> +
>         if (nr_pages)
>                 *nr_pages = slot->npages - (gfn - slot->base_gfn);
>
> --
> 2.45.2.993.g49e7a77208-goog
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH 05/18] KVM: Add KVM_PFN_ERR_USERFAULT
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (3 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 04/18] KVM: Fail __gfn_to_hva_many for userfault gfns James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:42 ` [RFC PATCH 06/18] KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT James Houghton
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

When a GFN -> HVA translation (__gfn_to_hva_many()) fails with
KVM_HVA_ERR_USERFAULT as part of a GFN -> PFN conversion
(__gfn_to_pfn_memslot()), we need to return some kind of KVM_PFN_ERR.

Introduce a new KVM_PFN_ERR_USERFAULT so that callers know that it is a
userfault.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 8 +++++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4cca896fb44a..2005906c78c8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -97,6 +97,7 @@
 #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
 #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
 #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
+#define KVM_PFN_ERR_USERFAULT	(KVM_PFN_ERR_MASK + 4)
 
 /*
  * error pfns indicate that the gfn is in slot but faild to
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 758deb90a050..840e02c75fe3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3009,9 +3009,11 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 	if (kvm_is_error_hva(addr)) {
 		if (writable)
 			*writable = false;
-
-		return addr == KVM_HVA_ERR_RO_BAD ? KVM_PFN_ERR_RO_FAULT :
-						    KVM_PFN_NOSLOT;
+		if (addr == KVM_HVA_ERR_RO_BAD)
+			return KVM_PFN_ERR_RO_FAULT;
+		if (addr == KVM_HVA_ERR_USERFAULT)
+			return KVM_PFN_ERR_USERFAULT;
+		return KVM_PFN_NOSLOT;
 	}
 
 	/* Do not map writable pfn in the readonly memslot. */
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 06/18] KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (4 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 05/18] KVM: Add KVM_PFN_ERR_USERFAULT James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:42 ` [RFC PATCH 07/18] KVM: Provide attributes to kvm_arch_pre_set_memory_attributes James Houghton
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

This is needed to support returning to userspace when a vCPU attempts to
access a userfault gfn.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c84c24a9678e..6aa99b4587c6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -429,6 +429,7 @@ struct kvm_run {
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
 #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+#define KVM_MEMORY_EXIT_FLAG_USERFAULT	(1ULL << 4)
 			__u64 flags;
 			__u64 gpa;
 			__u64 size;
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 07/18] KVM: Provide attributes to kvm_arch_pre_set_memory_attributes
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (5 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 06/18] KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:42 ` [RFC PATCH 08/18] KVM: x86: Add KVM Userfault support James Houghton
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

This is needed so that architectures can do the right thing (including
at least unmapping newly userfault-enabled gfns) when KVM Userfault is
enabled for a particular range.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 virt/kvm/kvm_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 840e02c75fe3..77eb9f0de02d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2514,6 +2514,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	struct kvm_mmu_notifier_range pre_set_range = {
 		.start = start,
 		.end = end,
+		.arg.attributes = attributes,
 		.handler = kvm_pre_set_memory_attributes,
 		.on_lock = kvm_mmu_invalidate_begin,
 		.flush_on_ret = true,
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 08/18] KVM: x86: Add KVM Userfault support
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (6 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 07/18] KVM: Provide attributes to kvm_arch_pre_set_memory_attributes James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-17 15:34   ` Wang, Wei W
  2024-07-10 23:42 ` [RFC PATCH 09/18] KVM: x86: Add vCPU fault fast-path for Userfault James Houghton
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

The first prong for enabling KVM Userfault support for x86 is to be able
to inform userspace of userfaults. We know when userfaults occurs when
fault->pfn comes back as KVM_PFN_ERR_FAULT, so in
kvm_mmu_prepare_memory_fault_exit(), simply check if fault->pfn is
indeed KVM_PFN_ERR_FAULT. This means always setting fault->pfn to a
known value (I have chosen KVM_PFN_ERR_FAULT) before calling
kvm_mmu_prepare_memory_fault_exit().

The next prong is to unmap pages that are newly userfault-enabled. Do
this in kvm_arch_pre_set_memory_attributes().

The final prong is to only allow PAGE_SIZE mappings when KVM Userfault
is enabled. This prevents us from mapping a userfault-enabled gfn with a
fault on a non-userfault-enabled gfn.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/Kconfig            |  1 +
 arch/x86/kvm/mmu/mmu.c          | 60 ++++++++++++++++++++++++++-------
 arch/x86/kvm/mmu/mmu_internal.h |  3 +-
 include/linux/kvm_host.h        |  5 ++-
 4 files changed, 55 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 80e5afde69f4..ebd1ec6600bc 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -45,6 +45,7 @@ config KVM
 	select HAVE_KVM_PM_NOTIFIER if PM
 	select KVM_GENERIC_HARDWARE_ENABLING
 	select KVM_WERROR if WERROR
+	select KVM_USERFAULT
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1432deb75cbb..6b6a053758ec 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3110,6 +3110,13 @@ static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
 	struct kvm_lpage_info *linfo;
 	int host_level;
 
+	/*
+	 * KVM Userfault requires new mappings to be 4K, as userfault check was
+	 * done only for the particular page was faulted on.
+	 */
+	if (kvm_userfault_enabled(kvm))
+		return PG_LEVEL_4K;
+
 	max_level = min(max_level, max_huge_page_level);
 	for ( ; max_level > PG_LEVEL_4K; max_level--) {
 		linfo = lpage_info_slot(gfn, slot, max_level);
@@ -3265,6 +3272,9 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 		return RET_PF_RETRY;
 	}
 
+	if (fault->pfn == KVM_PFN_ERR_USERFAULT)
+		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+
 	return -EFAULT;
 }
 
@@ -4316,6 +4326,9 @@ static u8 kvm_max_private_mapping_level(struct kvm *kvm, kvm_pfn_t pfn,
 {
 	u8 req_max_level;
 
+	if (kvm_userfault_enabled(kvm))
+		return PG_LEVEL_4K;
+
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
@@ -4335,6 +4348,12 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
 {
 	int max_order, r;
 
+	/*
+	 * Make sure a pfn is set so that kvm_mmu_prepare_memory_fault_exit
+	 * doesn't read uninitialized memory.
+	 */
+	fault->pfn = KVM_PFN_ERR_FAULT;
+
 	if (!kvm_slot_can_be_private(fault->slot)) {
 		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
 		return -EFAULT;
@@ -7390,21 +7409,37 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range)
 {
+	unsigned long attrs = range->arg.attributes;
+
 	/*
-	 * Zap SPTEs even if the slot can't be mapped PRIVATE.  KVM x86 only
-	 * supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like KVM
-	 * can simply ignore such slots.  But if userspace is making memory
-	 * PRIVATE, then KVM must prevent the guest from accessing the memory
-	 * as shared.  And if userspace is making memory SHARED and this point
-	 * is reached, then at least one page within the range was previously
-	 * PRIVATE, i.e. the slot's possible hugepage ranges are changing.
-	 * Zapping SPTEs in this case ensures KVM will reassess whether or not
-	 * a hugepage can be used for affected ranges.
+	 * For KVM_MEMORY_ATTRIBUTE_PRIVATE:
+	 *  Zap SPTEs even if the slot can't be mapped PRIVATE.  It *seems* like
+	 *  KVM can simply ignore such slots.  But if userspace is making memory
+	 *  PRIVATE, then KVM must prevent the guest from accessing the memory
+	 *  as shared.  And if userspace is making memory SHARED and this point
+	 *  is reached, then at least one page within the range was previously
+	 *  PRIVATE, i.e. the slot's possible hugepage ranges are changing.
+	 *  Zapping SPTEs in this case ensures KVM will reassess whether or not
+	 *  a hugepage can be used for affected ranges.
+	 *
+	 * For KVM_MEMORY_ATTRIBUTE_USERFAULT:
+	 *  When enabling, we want to zap the mappings that land in @range,
+	 *  otherwise we will not be able to trap vCPU accesses.
+	 *  When disabling, we don't need to zap anything.
 	 */
-	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
+	if (WARN_ON_ONCE(!kvm_userfault_enabled(kvm) &&
+			 !kvm_arch_has_private_mem(kvm)))
 		return false;
 
-	return kvm_unmap_gfn_range(kvm, range);
+	if (kvm_arch_has_private_mem(kvm) ||
+			(attrs & KVM_MEMORY_ATTRIBUTE_USERFAULT))
+		return kvm_unmap_gfn_range(kvm, range);
+
+	/*
+	 * We are disabling USERFAULT. No unmap necessary. An unmap to get
+	 * huge mappings again will come later.
+	 */
+	return false;
 }
 
 static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
@@ -7458,7 +7493,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 	 * a range that has PRIVATE GFNs, and conversely converting a range to
 	 * SHARED may now allow hugepages.
 	 */
-	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
+	if (WARN_ON_ONCE(!kvm_userfault_enabled(kvm) &&
+			 !kvm_arch_has_private_mem(kvm)))
 		return false;
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index ce2fcd19ba6b..9d8c8c3e00a1 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -284,7 +284,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 {
 	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
 				      PAGE_SIZE, fault->write, fault->exec,
-				      fault->is_private);
+				      fault->is_private,
+				      fault->pfn == KVM_PFN_ERR_USERFAULT);
 }
 
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2005906c78c8..dc12d0a5498b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2400,7 +2400,8 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 						 gpa_t gpa, gpa_t size,
 						 bool is_write, bool is_exec,
-						 bool is_private)
+						 bool is_private,
+						 bool is_userfault)
 {
 	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
 	vcpu->run->memory_fault.gpa = gpa;
@@ -2410,6 +2411,8 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 	vcpu->run->memory_fault.flags = 0;
 	if (is_private)
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
+	if (is_userfault)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_USERFAULT;
 }
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* RE: [RFC PATCH 08/18] KVM: x86: Add KVM Userfault support
  2024-07-10 23:42 ` [RFC PATCH 08/18] KVM: x86: Add KVM Userfault support James Houghton
@ 2024-07-17 15:34   ` Wang, Wei W
  2024-07-18 17:08     ` James Houghton
  0 siblings, 1 reply; 40+ messages in thread
From: Wang, Wei W @ 2024-07-17 15:34 UTC (permalink / raw)
  To: James Houghton, Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev

On Thursday, July 11, 2024 7:42 AM, James Houghton wrote:
> The first prong for enabling KVM Userfault support for x86 is to be able to
> inform userspace of userfaults. We know when userfaults occurs when
> fault->pfn comes back as KVM_PFN_ERR_FAULT, so in
> kvm_mmu_prepare_memory_fault_exit(), simply check if fault->pfn is indeed
> KVM_PFN_ERR_FAULT. This means always setting fault->pfn to a known value (I
> have chosen KVM_PFN_ERR_FAULT) before calling
> kvm_mmu_prepare_memory_fault_exit().
> 
> The next prong is to unmap pages that are newly userfault-enabled. Do this in
> kvm_arch_pre_set_memory_attributes().

Why is there a need to unmap it?
I think a userfault is triggered on a page during postcopy when its data has not yet
been fetched from the source, that is, the page is never accessed by guest on the
destination and the page table leaf entry is empty.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 08/18] KVM: x86: Add KVM Userfault support
  2024-07-17 15:34   ` Wang, Wei W
@ 2024-07-18 17:08     ` James Houghton
  2024-07-19 14:44       ` Wang, Wei W
  0 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-18 17:08 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, James Morse,
	Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Shuah Khan,
	Axel Rasmussen, David Matlack, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	Peter Xu

On Wed, Jul 17, 2024 at 8:34 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
>
> On Thursday, July 11, 2024 7:42 AM, James Houghton wrote:
> > The first prong for enabling KVM Userfault support for x86 is to be able to
> > inform userspace of userfaults. We know when userfaults occurs when
> > fault->pfn comes back as KVM_PFN_ERR_FAULT, so in
> > kvm_mmu_prepare_memory_fault_exit(), simply check if fault->pfn is indeed
> > KVM_PFN_ERR_FAULT. This means always setting fault->pfn to a known value (I
> > have chosen KVM_PFN_ERR_FAULT) before calling
> > kvm_mmu_prepare_memory_fault_exit().
> >
> > The next prong is to unmap pages that are newly userfault-enabled. Do this in
> > kvm_arch_pre_set_memory_attributes().
>
> Why is there a need to unmap it?
> I think a userfault is triggered on a page during postcopy when its data has not yet
> been fetched from the source, that is, the page is never accessed by guest on the
> destination and the page table leaf entry is empty.
>

You're right that it's not strictly necessary for implementing
post-copy. This just comes down to the UAPI we want: does
ATTRIBUTE_USERFAULT mean "KVM will be unable to access this memory;
any attempt to access it will generate a userfault" or does it mean
"accesses to never-accessed, non-prefaulted memory will generate a
userfault."

I think the former (i.e., the one implemented in this RFC) is slightly
clearer and slightly more useful.

Userfaultfd does the latter:
1. MAP_PRIVATE|MAP_ANONYMOUS + UFFDIO_REGISTER_MODE_MISSING: if
nothing is mapped (i.e., major page fault)
2. non-anonymous VMA + UFFDIO_REGISTER_MODE_MISSING: if the page cache
does not contain a page
3. MAP_SHARED + UFFDIO_REGISTER_MODE_MINOR: if the page cache
*contains* a page, but we got a fault anyway

But in all of these cases, we have a way to start getting userfaults
for already-accessed memory: for (1) and (3), MADV_DONTNEED, and for
(2), fallocate(FALLOC_FL_PUNCH_HOLE).

Even if we didn't have MADV_DONTNEED (as used to be the case with
HugeTLB), we can use PROT_NONE to prevent anyone from mapping anything
in between an mmap() and a UFFDIO_REGISTER. This has been useful for
me.

With KVM, we have neither of these tools (unless we include them here), AFAIA.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RFC PATCH 08/18] KVM: x86: Add KVM Userfault support
  2024-07-18 17:08     ` James Houghton
@ 2024-07-19 14:44       ` Wang, Wei W
  0 siblings, 0 replies; 40+ messages in thread
From: Wang, Wei W @ 2024-07-19 14:44 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, James Morse,
	Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Shuah Khan,
	Axel Rasmussen, David Matlack, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	Peter Xu

On Friday, July 19, 2024 1:09 AM, James Houghton wrote:
> On Wed, Jul 17, 2024 at 8:34 AM Wang, Wei W <wei.w.wang@intel.com>
> wrote:
> >
> > On Thursday, July 11, 2024 7:42 AM, James Houghton wrote:
> > > The first prong for enabling KVM Userfault support for x86 is to be
> > > able to inform userspace of userfaults. We know when userfaults
> > > occurs when
> > > fault->pfn comes back as KVM_PFN_ERR_FAULT, so in
> > > kvm_mmu_prepare_memory_fault_exit(), simply check if fault->pfn is
> > > indeed KVM_PFN_ERR_FAULT. This means always setting fault->pfn to a
> > > known value (I have chosen KVM_PFN_ERR_FAULT) before calling
> > > kvm_mmu_prepare_memory_fault_exit().
> > >
> > > The next prong is to unmap pages that are newly userfault-enabled.
> > > Do this in kvm_arch_pre_set_memory_attributes().
> >
> > Why is there a need to unmap it?
> > I think a userfault is triggered on a page during postcopy when its
> > data has not yet been fetched from the source, that is, the page is
> > never accessed by guest on the destination and the page table leaf entry is
> empty.
> >
> 
> You're right that it's not strictly necessary for implementing post-copy. This just
> comes down to the UAPI we want: does ATTRIBUTE_USERFAULT mean "KVM
> will be unable to access this memory; any attempt to access it will generate a
> userfault" or does it mean "accesses to never-accessed, non-prefaulted
> memory will generate a userfault."
> 
> I think the former (i.e., the one implemented in this RFC) is slightly clearer and
> slightly more useful.
> 
> Userfaultfd does the latter:
> 1. MAP_PRIVATE|MAP_ANONYMOUS + UFFDIO_REGISTER_MODE_MISSING: if
> nothing is mapped (i.e., major page fault) 2. non-anonymous VMA +
> UFFDIO_REGISTER_MODE_MISSING: if the page cache does not contain a page
> 3. MAP_SHARED + UFFDIO_REGISTER_MODE_MINOR: if the page cache
> *contains* a page, but we got a fault anyway
> 

Yeah, you pointed a good reference here. I think it's beneficial to have different
"modes" for a feature, so we don’t need to force everybody to carry on the
unnecessary burden (e.g. unmap()).

The design is targeted for postcopy support currently, we could start with what
postcopy needs (similar to UFFDIO_REGISTER_MODE_MISSING), while leaving
space for incremental addition of new modes for new usages. When there is a
clear need for unmap(), it could be added under a new flag.

> But in all of these cases, we have a way to start getting userfaults for already-
> accessed memory: for (1) and (3), MADV_DONTNEED, and for (2),
> fallocate(FALLOC_FL_PUNCH_HOLE).

Those cases don't seem related to postcopy (i.e., faults are not brand new and they
need to be handled locally). They could be added under a new flag later when the
related usage is clear.

> 
> Even if we didn't have MADV_DONTNEED (as used to be the case with
> HugeTLB), we can use PROT_NONE to prevent anyone from mapping anything
> in between an mmap() and a UFFDIO_REGISTER. This has been useful for me.

Same for this case as well, plus gmem doesn't support mmap() at present.

> 
> With KVM, we have neither of these tools (unless we include them here),
> AFAIA.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH 09/18] KVM: x86: Add vCPU fault fast-path for Userfault
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (7 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 08/18] KVM: x86: Add KVM Userfault support James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:42 ` [RFC PATCH 10/18] KVM: arm64: Add KVM Userfault support James Houghton
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Without this fast-path, we will take the asynchronous userfault path
every time, which is inefficient.

As implemented today, KVM Userfault isn't well optimized at all, but I'm
providing this optimization because something like this will be required
to significantly improve post-copy performance. Memory fault exits for
userfaultfd were proposed for the same reason[1].

[1]: https://lore.kernel.org/kvm/20240215235405.368539-7-amoorthy@google.com/

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6b6a053758ec..f0dbc3c68e5c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4380,6 +4380,13 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	if (fault->is_private)
 		return kvm_faultin_pfn_private(vcpu, fault);
 
+	/* Pre-check for userfault and bail out early. */
+	if (gfn_has_userfault(fault->slot->kvm, fault->gfn)) {
+		fault->pfn = KVM_PFN_ERR_USERFAULT;
+		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+		return -EFAULT;
+	}
+
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(fault->slot, fault->gfn, false, false,
 					  &async, fault->write,
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 10/18] KVM: arm64: Add KVM Userfault support
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (8 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 09/18] KVM: x86: Add vCPU fault fast-path for Userfault James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:42 ` [RFC PATCH 11/18] KVM: arm64: Add vCPU memory fault fast-path for Userfault James Houghton
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Support comes in three parts:
1. When KVM Userfault is enabled, only install PAGE_SIZE PTEs. This
   prevents us from being able to map a userfault-enabled pfn with a
   huge PTE in response to a fault on a non-userfault pfn.
2. When we get KVM_PFN_ERR_USERFAULT from __gfn_to_pfn_memslot, return a
   memory fault to userspace.
3. When KVM Userfault is enabled for a particular kvm_gfn_range, unmap
   it, so that we can get faults on it.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/arm64/kvm/Kconfig |  1 +
 arch/arm64/kvm/mmu.c   | 36 ++++++++++++++++++++++++++++++++++--
 2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 58f09370d17e..358153d91d58 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -37,6 +37,7 @@ menuconfig KVM
 	select HAVE_KVM_VCPU_RUN_PID_CHANGE
 	select SCHED_INFO
 	select GUEST_PERF_EVENTS if PERF_EVENTS
+	select KVM_USERFAULT
 	help
 	  Support hosting virtualized guest machines.
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8bcab0cc3fe9..ac283e606516 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1434,7 +1434,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * logging_active is guaranteed to never be true for VM_PFNMAP
 	 * memslots.
 	 */
-	if (logging_active) {
+	if (logging_active || kvm->userfault) {
 		force_pte = true;
 		vma_shift = PAGE_SHIFT;
 	} else {
@@ -1494,8 +1494,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
 	}
-	if (is_error_noslot_pfn(pfn))
+	if (is_error_noslot_pfn(pfn)) {
+		if (pfn == KVM_PFN_ERR_USERFAULT)
+			kvm_prepare_memory_fault_exit(vcpu, gfn << PAGE_SHIFT,
+						      PAGE_SIZE, write_fault,
+						      /*exec=*/false,
+						      /*private=*/false,
+						      /*userfault=*/true);
 		return -EFAULT;
+	}
 
 	if (kvm_is_device_pfn(pfn)) {
 		/*
@@ -2105,3 +2112,28 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
 
 	trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
 }
+
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+					struct kvm_gfn_range *range)
+{
+	unsigned long attrs = range->arg.attributes;
+
+	/*
+	 * We only need to unmap if we're enabling userfault. Disabling it
+	 * does not need an unmap. An unmap to get huge mappings will come
+	 * later.
+	 */
+	if (attrs & KVM_MEMORY_ATTRIBUTE_USERFAULT)
+		kvm_unmap_gfn_range(kvm, range);
+
+	return false;
+}
+
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+					 struct kvm_gfn_range *range)
+{
+	/* Nothing to do! */
+	return false;
+}
+#endif
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 11/18] KVM: arm64: Add vCPU memory fault fast-path for Userfault
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (9 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 10/18] KVM: arm64: Add KVM Userfault support James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:42 ` [RFC PATCH 12/18] KVM: arm64: Add userfault support for steal-time James Houghton
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Make this optimization for the same reason we make it for x86: because
it necessary for sufficient post-copy performance when scaling up to
hundreds of cores (even though KVM Userfault today doesn't scale very
well).

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/arm64/kvm/mmu.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ac283e606516..c84633c9ab98 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1488,6 +1488,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
+	if (gfn_has_userfault(memslot->kvm, gfn)) {
+		kvm_prepare_memory_fault_exit(vcpu, gfn << PAGE_SHIFT,
+					      PAGE_SIZE, write_fault,
+					      /*exec=*/false,
+					      /*private=*/false,
+					      /*userfault=*/true);
+		return -EFAULT;
+	}
+
 	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
 				   write_fault, &writable, NULL);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 12/18] KVM: arm64: Add userfault support for steal-time
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (10 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 11/18] KVM: arm64: Add vCPU memory fault fast-path for Userfault James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:42 ` [RFC PATCH 13/18] KVM: Add atomic parameter to __gfn_to_hva_many James Houghton
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

As part of KVM_RUN, we may need to write steal-time information to guest
memory. In the case that the gfn we are writing to is userfault-enabled,
we should return to userspace with fault information.

With asynchronous userfaults, this change is not necessary and merely
acts as an optimization.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  2 +-
 arch/arm64/kvm/arm.c              |  8 ++++++--
 arch/arm64/kvm/pvtime.c           | 11 +++++++++--
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 36b8e97bf49e..4c7bd72ba9e8 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1166,7 +1166,7 @@ static inline bool kvm_arch_pmi_in_guest(struct kvm_vcpu *vcpu)
 
 long kvm_hypercall_pv_features(struct kvm_vcpu *vcpu);
 gpa_t kvm_init_stolen_time(struct kvm_vcpu *vcpu);
-void kvm_update_stolen_time(struct kvm_vcpu *vcpu);
+int kvm_update_stolen_time(struct kvm_vcpu *vcpu);
 
 bool kvm_arm_pvtime_supported(void);
 int kvm_arm_pvtime_set_attr(struct kvm_vcpu *vcpu,
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 59716789fe0f..4c7994e44217 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -974,8 +974,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 		 */
 		kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu);
 
-		if (kvm_check_request(KVM_REQ_RECORD_STEAL, vcpu))
-			kvm_update_stolen_time(vcpu);
+		if (kvm_check_request(KVM_REQ_RECORD_STEAL, vcpu)) {
+			int ret = kvm_update_stolen_time(vcpu);
+
+			if (ret <= 0)
+				return ret;
+		}
 
 		if (kvm_check_request(KVM_REQ_RELOAD_GICv4, vcpu)) {
 			/* The distributor enable bits were changed */
diff --git a/arch/arm64/kvm/pvtime.c b/arch/arm64/kvm/pvtime.c
index 4ceabaa4c30b..ba0164726310 100644
--- a/arch/arm64/kvm/pvtime.c
+++ b/arch/arm64/kvm/pvtime.c
@@ -10,7 +10,7 @@
 
 #include <kvm/arm_hypercalls.h>
 
-void kvm_update_stolen_time(struct kvm_vcpu *vcpu)
+int kvm_update_stolen_time(struct kvm_vcpu *vcpu)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 base = vcpu->arch.steal.base;
@@ -20,9 +20,14 @@ void kvm_update_stolen_time(struct kvm_vcpu *vcpu)
 	int idx;
 
 	if (base == INVALID_GPA)
-		return;
+		return 1;
 
 	idx = srcu_read_lock(&kvm->srcu);
+	if (gfn_to_hva(kvm, base + offset) == KVM_HVA_ERR_USERFAULT) {
+		kvm_prepare_memory_fault_exit(vcpu, base + offset, PAGE_SIZE,
+					      true, false, false, true);
+		return -EFAULT;
+	}
 	if (!kvm_get_guest(kvm, base + offset, steal)) {
 		steal = le64_to_cpu(steal);
 		vcpu->arch.steal.last_steal = READ_ONCE(current->sched_info.run_delay);
@@ -30,6 +35,8 @@ void kvm_update_stolen_time(struct kvm_vcpu *vcpu)
 		kvm_put_guest(kvm, base + offset, cpu_to_le64(steal));
 	}
 	srcu_read_unlock(&kvm->srcu, idx);
+
+	return 1;
 }
 
 long kvm_hypercall_pv_features(struct kvm_vcpu *vcpu)
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 13/18] KVM: Add atomic parameter to __gfn_to_hva_many
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (11 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 12/18] KVM: arm64: Add userfault support for steal-time James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:42 ` [RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT James Houghton
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Atomic translations cannot wait for userspace to resolve a userfault.
Add an atomic parameter for later use with KVM Userfault.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 virt/kvm/kvm_main.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 77eb9f0de02d..4ac018cac704 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2679,7 +2679,7 @@ static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
 }
 
 static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
-				       gfn_t *nr_pages, bool write)
+				       gfn_t *nr_pages, bool write, bool atomic)
 {
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return KVM_HVA_ERR_BAD;
@@ -2699,7 +2699,7 @@ static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t
 static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
 				     gfn_t *nr_pages)
 {
-	return __gfn_to_hva_many(slot, gfn, nr_pages, true);
+	return __gfn_to_hva_many(slot, gfn, nr_pages, true, false);
 }
 
 unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot,
@@ -2732,7 +2732,7 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_hva);
 unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot,
 				      gfn_t gfn, bool *writable)
 {
-	unsigned long hva = __gfn_to_hva_many(slot, gfn, NULL, false);
+	unsigned long hva = __gfn_to_hva_many(slot, gfn, NULL, false, false);
 
 	if (!kvm_is_error_hva(hva) && writable)
 		*writable = !memslot_is_readonly(slot);
@@ -3002,7 +3002,8 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool interruptible, bool *async,
 			       bool write_fault, bool *writable, hva_t *hva)
 {
-	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
+	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault,
+					       atomic);
 
 	if (hva)
 		*hva = addr;
@@ -3074,7 +3075,7 @@ int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
 	unsigned long addr;
 	gfn_t entry = 0;
 
-	addr = gfn_to_hva_many(slot, gfn, &entry);
+	addr = __gfn_to_hva_many(slot, gfn, &entry, true, true);
 	if (kvm_is_error_hva(addr))
 		return -1;
 
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (12 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 13/18] KVM: Add atomic parameter to __gfn_to_hva_many James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-11 23:52   ` David Matlack
  2024-07-26 16:50   ` Nikita Kalyazin
  2024-07-10 23:42 ` [RFC PATCH 15/18] KVM: guest_memfd: Add KVM Userfault support James Houghton
                   ` (7 subsequent siblings)
  21 siblings, 2 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

It is possible that KVM wants to access a userfault-enabled GFN in a
path where it is difficult to return out to userspace with the fault
information. For these cases, add a mechanism for KVM to wait for a GFN
to not be userfault-enabled.

The mechanism introduced in this patch uses an eventfd to signal that a
userfault is ready to be read. Userspace then reads the userfault with
KVM_READ_USERFAULT. The fault itself is stored in a list, and KVM will
busy-wait for the gfn to not be userfault-enabled.

The implementation of this mechanism is certain to change before KVM
Userfault could possibly be merged. Really the main concerns are whether
or not this kind of asynchronous userfault system is required and if the
UAPI for reading faults works.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/kvm_host.h |  7 +++
 include/uapi/linux/kvm.h |  7 +++
 virt/kvm/kvm_main.c      | 92 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index dc12d0a5498b..3b9780d85877 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -734,8 +734,15 @@ struct kvm_memslots {
 	int node_idx;
 };
 
+struct kvm_userfault_list_entry {
+	struct list_head list;
+	gfn_t gfn;
+};
+
 struct kvm_userfault_ctx {
 	struct eventfd_ctx *ev_fd;
+	spinlock_t list_lock;
+	struct list_head gfn_list;
 };
 
 struct kvm {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6aa99b4587c6..8cd8e08f11e1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1554,4 +1554,11 @@ struct kvm_create_guest_memfd {
 #define KVM_USERFAULT_ENABLE		(1ULL << 0)
 #define KVM_USERFAULT_DISABLE		(1ULL << 1)
 
+struct kvm_fault {
+	__u64 address;
+	/* TODO: reserved fields */
+};
+
+#define KVM_READ_USERFAULT		_IOR(KVMIO, 0xd5, struct kvm_fault)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4ac018cac704..d2ca16ddcaa1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2678,6 +2678,43 @@ static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
 	return slot->flags & KVM_MEM_READONLY;
 }
 
+static int read_userfault(struct kvm_userfault_ctx __rcu *ctx, gfn_t *gfn)
+{
+	struct kvm_userfault_list_entry *entry;
+
+	spin_lock(&ctx->list_lock);
+
+	entry = list_first_entry_or_null(&ctx->gfn_list,
+					 struct kvm_userfault_list_entry,
+					 list);
+
+	list_del(&entry->list);
+
+	spin_unlock(&ctx->list_lock);
+
+	if (!entry)
+		return -ENOENT;
+
+	*gfn = entry->gfn;
+	return 0;
+}
+
+static void signal_userfault(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_userfault_ctx __rcu *ctx =
+		srcu_dereference(kvm->userfault_ctx, &kvm->srcu);
+	struct kvm_userfault_list_entry entry;
+
+	entry.gfn = gfn;
+	INIT_LIST_HEAD(&entry.list);
+
+	spin_lock(&ctx->list_lock);
+	list_add(&entry.list, &ctx->gfn_list);
+	spin_unlock(&ctx->list_lock);
+
+	eventfd_signal(ctx->ev_fd);
+}
+
 static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
 				       gfn_t *nr_pages, bool write, bool atomic)
 {
@@ -2687,8 +2724,14 @@ static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t
 	if (memslot_is_readonly(slot) && write)
 		return KVM_HVA_ERR_RO_BAD;
 
-	if (gfn_has_userfault(slot->kvm, gfn))
-		return KVM_HVA_ERR_USERFAULT;
+	if (gfn_has_userfault(slot->kvm, gfn)) {
+		if (atomic)
+			return KVM_HVA_ERR_USERFAULT;
+		signal_userfault(slot->kvm, gfn);
+		while (gfn_has_userfault(slot->kvm, gfn))
+			/* TODO: don't busy-wait */
+			cpu_relax();
+	}
 
 	if (nr_pages)
 		*nr_pages = slot->npages - (gfn - slot->base_gfn);
@@ -5009,6 +5052,10 @@ static int kvm_enable_userfault(struct kvm *kvm, int event_fd)
 	}
 
 	ret = 0;
+
+	INIT_LIST_HEAD(&userfault_ctx->gfn_list);
+	spin_lock_init(&userfault_ctx->list_lock);
+
 	userfault_ctx->ev_fd = ev_fd;
 
 	rcu_assign_pointer(kvm->userfault_ctx, userfault_ctx);
@@ -5037,6 +5084,27 @@ static int kvm_vm_ioctl_enable_userfault(struct kvm *kvm, int options,
 	else
 		return kvm_disable_userfault(kvm);
 }
+
+static int kvm_vm_ioctl_read_userfault(struct kvm *kvm, gfn_t *gfn)
+{
+	int ret;
+	int idx;
+	struct kvm_userfault_ctx __rcu *ctx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+
+	ctx = srcu_dereference(kvm->userfault_ctx, &kvm->srcu);
+
+	ret = -ENOENT;
+	if (!ctx)
+		goto out;
+
+	ret = read_userfault(ctx, gfn);
+
+out:
+	srcu_read_unlock(&kvm->srcu, idx);
+	return ret;
+}
 #endif
 
 static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
@@ -5403,6 +5471,26 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_gmem_create(kvm, &guest_memfd);
 		break;
 	}
+#endif
+#ifdef CONFIG_KVM_USERFAULT
+	case KVM_READ_USERFAULT: {
+		struct kvm_fault fault;
+		gfn_t gfn;
+
+		r = kvm_vm_ioctl_read_userfault(kvm, &gfn);
+		if (r)
+			goto out;
+
+		fault.address = gfn;
+
+		/* TODO: if this fails, this gfn is lost. */
+		r = -EFAULT;
+		if (copy_to_user(&fault, argp, sizeof(fault)))
+			goto out;
+
+		r = 0;
+		break;
+	}
 #endif
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT
  2024-07-10 23:42 ` [RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT James Houghton
@ 2024-07-11 23:52   ` David Matlack
  2024-07-26 16:50   ` Nikita Kalyazin
  1 sibling, 0 replies; 40+ messages in thread
From: David Matlack @ 2024-07-11 23:52 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, James Morse,
	Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Shuah Khan,
	Peter Xu, Axel Rasmussen, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Wed, Jul 10, 2024 at 4:42 PM James Houghton <jthoughton@google.com> wrote:
>
> +       case KVM_READ_USERFAULT: {
> +               struct kvm_fault fault;
> +               gfn_t gfn;
> +
> +               r = kvm_vm_ioctl_read_userfault(kvm, &gfn);
> +               if (r)
> +                       goto out;
> +
> +               fault.address = gfn;
> +
> +               /* TODO: if this fails, this gfn is lost. */
> +               r = -EFAULT;
> +               if (copy_to_user(&fault, argp, sizeof(fault)))

You could do the copy under the spin_lock() with
copy_to_user_nofault() to avoid losing gfn.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT
  2024-07-10 23:42 ` [RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT James Houghton
  2024-07-11 23:52   ` David Matlack
@ 2024-07-26 16:50   ` Nikita Kalyazin
  2024-07-26 18:00     ` James Houghton
  1 sibling, 1 reply; 40+ messages in thread
From: Nikita Kalyazin @ 2024-07-26 16:50 UTC (permalink / raw)
  To: James Houghton
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, roypat, kalyazin, Paolo Bonzini

Hi James,

On 11/07/2024 00:42, James Houghton wrote:
> It is possible that KVM wants to access a userfault-enabled GFN in a
> path where it is difficult to return out to userspace with the fault
> information. For these cases, add a mechanism for KVM to wait for a GFN
> to not be userfault-enabled.
In this patch series, an asynchronous notification mechanism is used 
only in cases "where it is difficult to return out to userspace with the 
fault information". However, we (AWS) have a use case where we would 
like to be notified asynchronously about _all_ faults. Firecracker can 
restore a VM from a memory snapshot where the guest memory is supplied 
via a Userfaultfd by a process separate from the VMM itself [1]. While 
it looks technically possible for the VMM process to handle exits via 
forwarding the faults to the other process, that would require building 
a complex userspace protocol on top and likely introduce extra latency 
on the critical path. This also implies that a KVM API 
(KVM_READ_USERFAULT) is not suitable, because KVM checks that the ioctls 
are performed specifically by the VMM process [2]:
	if (kvm->mm != current->mm || kvm->vm_dead)
		return -EIO;

 > The implementation of this mechanism is certain to change before KVM
 > Userfault could possibly be merged.
How do you envision resolving faults in userspace? Copying the page in 
(provided that userspace mapping of guest_memfd is supported [3]) and 
clearing the KVM_MEMORY_ATTRIBUTE_USERFAULT alone do not look 
sufficient to resolve the fault because an attempt to copy the page 
directly in userspace will trigger a fault on its own and may lead to a 
deadlock in the case where the original fault was caused by the VMM. An 
interface similar to UFFDIO_COPY is needed that would allocate a page, 
copy the content in and update page tables.

[1] Firecracker snapshot restore via UserfaultFD: 
https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md
[2] KVM ioctl check for the address space: 
https://elixir.bootlin.com/linux/v6.10.1/source/virt/kvm/kvm_main.c#L5083
[3] mmap() of guest_memfd: 
https://lore.kernel.org/kvm/489d1494-626c-40d9-89ec-4afc4cd0624b@redhat.com/T/#mc944a6fdcd20a35f654c2be99f9c91a117c1bed4

Thanks,
Nikita


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT
  2024-07-26 16:50   ` Nikita Kalyazin
@ 2024-07-26 18:00     ` James Houghton
  2024-07-29 17:17       ` Nikita Kalyazin
  0 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-26 18:00 UTC (permalink / raw)
  To: kalyazin
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, roypat, Paolo Bonzini

On Fri, Jul 26, 2024 at 9:50 AM Nikita Kalyazin <kalyazin@amazon.com> wrote:
>
> Hi James,
>
> On 11/07/2024 00:42, James Houghton wrote:
> > It is possible that KVM wants to access a userfault-enabled GFN in a
> > path where it is difficult to return out to userspace with the fault
> > information. For these cases, add a mechanism for KVM to wait for a GFN
> > to not be userfault-enabled.
> In this patch series, an asynchronous notification mechanism is used
> only in cases "where it is difficult to return out to userspace with the
> fault information". However, we (AWS) have a use case where we would
> like to be notified asynchronously about _all_ faults. Firecracker can
> restore a VM from a memory snapshot where the guest memory is supplied
> via a Userfaultfd by a process separate from the VMM itself [1]. While
> it looks technically possible for the VMM process to handle exits via
> forwarding the faults to the other process, that would require building
> a complex userspace protocol on top and likely introduce extra latency
> on the critical path.
> This also implies that a KVM API
> (KVM_READ_USERFAULT) is not suitable, because KVM checks that the ioctls
> are performed specifically by the VMM process [2]:
>         if (kvm->mm != current->mm || kvm->vm_dead)
>                 return -EIO;

If it would be useful, we could absolutely have a flag to have all
faults go through the asynchronous mechanism. :) It's meant to just be
an optimization. For me, it is a necessary optimization.

Userfaultfd doesn't scale particularly well: we have to grab two locks
to work with the wait_queues. You could create several userfaultfds,
but the underlying issue is still there. KVM Userfault, if it uses a
wait_queue for the async fault mechanism, will have the same
bottleneck. Anish and I worked on making userfaults more scalable for
KVM[1], and we ended up with a scheme very similar to what we have in
this KVM Userfault series.

My use case already requires using a reasonably complex API for
interacting with a separate userland process for fetching memory, and
it's really fast. I've never tried to hook userfaultfd into this other
process, but I'm quite certain that [1] + this process's interface
scale better than userfaultfd does. Perhaps userfaultfd, for
not-so-scaled-up cases, could be *slightly* faster, but I mostly care
about what happens when we scale to hundreds of vCPUs.

[1]: https://lore.kernel.org/kvm/20240215235405.368539-1-amoorthy@google.com/

>
>  > The implementation of this mechanism is certain to change before KVM
>  > Userfault could possibly be merged.
> How do you envision resolving faults in userspace? Copying the page in
> (provided that userspace mapping of guest_memfd is supported [3]) and
> clearing the KVM_MEMORY_ATTRIBUTE_USERFAULT alone do not look
> sufficient to resolve the fault because an attempt to copy the page
> directly in userspace will trigger a fault on its own

This is not true for KVM Userfault, at least for right now. Userspace
accesses to guest memory will not trigger KVM Userfaults. (I know this
name is terrible -- regular old userfaultfd() userfaults will indeed
get triggered, provided you've set things up properly.)

KVM Userfault is merely meant to catch KVM's own accesses to guest
memory (including vCPU accesses). For non-guest_memfd memslots,
userspace can totally just write through the VMA it has made (KVM
Userfault *cannot*, by virtue of being completely divorced from mm,
intercept this access). For guest_memfd, userspace could write to
guest memory through a VMA if that's where guest_memfd is headed, but
perhaps it will rely on exact details of how userspace is meant to
populate guest_memfd memory.

You're totally right that, in essence, we will need some kind of
non-faulting way to interact with guest memory. With traditional
memslots and VMAs, we have that already; guest_memfd memslots and
VMAs, I think we will have that eventually.

> and may lead to a
> deadlock in the case where the original fault was caused by the VMM. An
> interface similar to UFFDIO_COPY is needed that would allocate a page,
> copy the content in and update page tables.

In case it's interesting or useful at all, we actually use
UFFDIO_CONTINUE for our live migration use case. We mmap() memory
twice -- one of them we register with userfaultfd and also give to
KVM. The other one we use to install memory -- our non-faulting view
of guest memory!

>
> [1] Firecracker snapshot restore via UserfaultFD:
> https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md
> [2] KVM ioctl check for the address space:
> https://elixir.bootlin.com/linux/v6.10.1/source/virt/kvm/kvm_main.c#L5083
> [3] mmap() of guest_memfd:
> https://lore.kernel.org/kvm/489d1494-626c-40d9-89ec-4afc4cd0624b@redhat.com/T/#mc944a6fdcd20a35f654c2be99f9c91a117c1bed4
>
> Thanks,
> Nikita

Thanks for the feedback!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT
  2024-07-26 18:00     ` James Houghton
@ 2024-07-29 17:17       ` Nikita Kalyazin
  2024-07-29 21:09         ` James Houghton
  0 siblings, 1 reply; 40+ messages in thread
From: Nikita Kalyazin @ 2024-07-29 17:17 UTC (permalink / raw)
  To: James Houghton
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, roypat, Paolo Bonzini, kalyazin

On 26/07/2024 19:00, James Houghton wrote:
> If it would be useful, we could absolutely have a flag to have all
> faults go through the asynchronous mechanism. :) It's meant to just be
> an optimization. For me, it is a necessary optimization.
> 
> Userfaultfd doesn't scale particularly well: we have to grab two locks
> to work with the wait_queues. You could create several userfaultfds,
> but the underlying issue is still there. KVM Userfault, if it uses a
> wait_queue for the async fault mechanism, will have the same
> bottleneck. Anish and I worked on making userfaults more scalable for
> KVM[1], and we ended up with a scheme very similar to what we have in
> this KVM Userfault series.
Yes, I see your motivation. Does this approach support async pagefaults 
[1]? Ie would all the guest processes on the vCPU need to stall until a 
fault is resolved or is there a way to let the vCPU run and only block 
the faulted process?

A more general question is, it looks like Userfaultfd's main purpose was 
to support the postcopy use case [2], yet it fails to do that 
efficiently for large VMs. Would it be ideologically better to try to 
improve Userfaultfd's performance (similar to how it was attempted in 
[3]) or is that something you have already looked into and reached a 
dead end as a part of [4]?

[1] https://lore.kernel.org/lkml/4AEFB823.4040607@redhat.com/T/
[2] https://lwn.net/Articles/636226/
[3] https://lore.kernel.org/lkml/20230905214235.320571-1-peterx@redhat.com/
[4] 
https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/

> My use case already requires using a reasonably complex API for
> interacting with a separate userland process for fetching memory, and
> it's really fast. I've never tried to hook userfaultfd into this other
> process, but I'm quite certain that [1] + this process's interface
> scale better than userfaultfd does. Perhaps userfaultfd, for
> not-so-scaled-up cases, could be *slightly* faster, but I mostly care
> about what happens when we scale to hundreds of vCPUs.
> 
> [1]: https://lore.kernel.org/kvm/20240215235405.368539-1-amoorthy@google.com/
Do I understand it right that in your setup, when an EPT violation occurs,
  - VMM shares the fault information with the other process via a 
userspace protocol
  - the process fetches the memory, installs it (?) and notifies VMM
  - VMM calls KVM run to resume execution
?
Would you be ok to share an outline of the API you mentioned?

>> How do you envision resolving faults in userspace? Copying the page in
>> (provided that userspace mapping of guest_memfd is supported [3]) and
>> clearing the KVM_MEMORY_ATTRIBUTE_USERFAULT alone do not look
>> sufficient to resolve the fault because an attempt to copy the page
>> directly in userspace will trigger a fault on its own
> 
> This is not true for KVM Userfault, at least for right now. Userspace
> accesses to guest memory will not trigger KVM Userfaults. (I know this
> name is terrible -- regular old userfaultfd() userfaults will indeed
> get triggered, provided you've set things up properly.)
> 
> KVM Userfault is merely meant to catch KVM's own accesses to guest
> memory (including vCPU accesses). For non-guest_memfd memslots,
> userspace can totally just write through the VMA it has made (KVM
> Userfault *cannot*, by virtue of being completely divorced from mm,
> intercept this access). For guest_memfd, userspace could write to
> guest memory through a VMA if that's where guest_memfd is headed, but
> perhaps it will rely on exact details of how userspace is meant to
> populate guest_memfd memory.
True, it isn't the case right now. I think I fast-forwarded to a state 
where notifications about VMM-triggered faults to the guest_memfd are 
also sent asynchronously.

> In case it's interesting or useful at all, we actually use
> UFFDIO_CONTINUE for our live migration use case. We mmap() memory
> twice -- one of them we register with userfaultfd and also give to
> KVM. The other one we use to install memory -- our non-faulting view
> of guest memory!
That is interesting. You're replacing UFFDIO_COPY (vma1) with a memcpy 
(vma2) + UFFDIO_CONTINUE (vma1), IIUC. Are both mappings created by the 
same process? What benefits does it bring?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT
  2024-07-29 17:17       ` Nikita Kalyazin
@ 2024-07-29 21:09         ` James Houghton
  2024-08-01 22:22           ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-29 21:09 UTC (permalink / raw)
  To: kalyazin
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Axel Rasmussen,
	David Matlack, kvm, linux-doc, linux-kernel, linux-arm-kernel,
	kvmarm, roypat, Paolo Bonzini, Peter Xu

On Mon, Jul 29, 2024 at 10:17 AM Nikita Kalyazin <kalyazin@amazon.com> wrote:
>
> On 26/07/2024 19:00, James Houghton wrote:
> > If it would be useful, we could absolutely have a flag to have all
> > faults go through the asynchronous mechanism. :) It's meant to just be
> > an optimization. For me, it is a necessary optimization.
> >
> > Userfaultfd doesn't scale particularly well: we have to grab two locks
> > to work with the wait_queues. You could create several userfaultfds,
> > but the underlying issue is still there. KVM Userfault, if it uses a
> > wait_queue for the async fault mechanism, will have the same
> > bottleneck. Anish and I worked on making userfaults more scalable for
> > KVM[1], and we ended up with a scheme very similar to what we have in
> > this KVM Userfault series.
> Yes, I see your motivation. Does this approach support async pagefaults
> [1]? Ie would all the guest processes on the vCPU need to stall until a
> fault is resolved or is there a way to let the vCPU run and only block
> the faulted process?

As implemented, it didn't hook into the async page faults stuff. I
think it's technically possible to do that, but we didn't explore it.

> A more general question is, it looks like Userfaultfd's main purpose was
> to support the postcopy use case [2], yet it fails to do that
> efficiently for large VMs. Would it be ideologically better to try to
> improve Userfaultfd's performance (similar to how it was attempted in
> [3]) or is that something you have already looked into and reached a
> dead end as a part of [4]?

My end goal with [4] was to take contention out of the vCPU +
userfault path completely (so, if we are taking a lock exclusively, we
are the only one taking it). I came to the conclusion that the way to
do this that made the most sense was Anish's memory fault exits idea.
I think it's possible to make userfaults scale better themselves, but
it's much more challenging than the memory fault exits approach for
KVM (and I don't have a good way to do it in mind).

> [1] https://lore.kernel.org/lkml/4AEFB823.4040607@redhat.com/T/
> [2] https://lwn.net/Articles/636226/
> [3] https://lore.kernel.org/lkml/20230905214235.320571-1-peterx@redhat.com/
> [4]
> https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
>
> > My use case already requires using a reasonably complex API for
> > interacting with a separate userland process for fetching memory, and
> > it's really fast. I've never tried to hook userfaultfd into this other
> > process, but I'm quite certain that [1] + this process's interface
> > scale better than userfaultfd does. Perhaps userfaultfd, for
> > not-so-scaled-up cases, could be *slightly* faster, but I mostly care
> > about what happens when we scale to hundreds of vCPUs.
> >
> > [1]: https://lore.kernel.org/kvm/20240215235405.368539-1-amoorthy@google.com/
> Do I understand it right that in your setup, when an EPT violation occurs,
>   - VMM shares the fault information with the other process via a
> userspace protocol
>   - the process fetches the memory, installs it (?) and notifies VMM
>   - VMM calls KVM run to resume execution
> ?

That's right.

> Would you be ok to share an outline of the API you mentioned?

I can share some information. The source (remote) and target (local)
VMMs register guest memory (shared memory) with this network worker
process. On the target during post-copy, the gfn of a fault is
converted into its corresponding local and remote offsets. The API for
then fetching the memory is basically something like
CopyFromRemote(remote_offset, local_offset, length), and the
communication with the process to handle this command is done just
with shared memory. After memory is copied, the faulting thread does a
UFFDIO_CONTINUE (with MODE_DONTWAKE) to map the page, and then we
KVM_RUN to resume. This will make more sense with the description of
UFFDIO_CONTINUE below.

Let me know if you'd like to know more, though I'm not intimately
familiar with all the details of this network worker process.

> >> How do you envision resolving faults in userspace? Copying the page in
> >> (provided that userspace mapping of guest_memfd is supported [3]) and
> >> clearing the KVM_MEMORY_ATTRIBUTE_USERFAULT alone do not look
> >> sufficient to resolve the fault because an attempt to copy the page
> >> directly in userspace will trigger a fault on its own
> >
> > This is not true for KVM Userfault, at least for right now. Userspace
> > accesses to guest memory will not trigger KVM Userfaults. (I know this
> > name is terrible -- regular old userfaultfd() userfaults will indeed
> > get triggered, provided you've set things up properly.)
> >
> > KVM Userfault is merely meant to catch KVM's own accesses to guest
> > memory (including vCPU accesses). For non-guest_memfd memslots,
> > userspace can totally just write through the VMA it has made (KVM
> > Userfault *cannot*, by virtue of being completely divorced from mm,
> > intercept this access). For guest_memfd, userspace could write to
> > guest memory through a VMA if that's where guest_memfd is headed, but
> > perhaps it will rely on exact details of how userspace is meant to
> > populate guest_memfd memory.
> True, it isn't the case right now. I think I fast-forwarded to a state
> where notifications about VMM-triggered faults to the guest_memfd are
> also sent asynchronously.
>
> > In case it's interesting or useful at all, we actually use
> > UFFDIO_CONTINUE for our live migration use case. We mmap() memory
> > twice -- one of them we register with userfaultfd and also give to
> > KVM. The other one we use to install memory -- our non-faulting view
> > of guest memory!
> That is interesting. You're replacing UFFDIO_COPY (vma1) with a memcpy
> (vma2) + UFFDIO_CONTINUE (vma1), IIUC. Are both mappings created by the
> same process? What benefits does it bring?

The cover letter for the patch series where UFFDIO_CONTINUE was
introduced does a good job at explaining why it's useful for live
migration[5]. But I can summarize it here: when doing pre-copy, we
send many copies of memory to the target. Upon resuming on the target,
we want to get faults on the pages with stale content. It may take a
while to send the final dirty bitmap to the target, and we don't want
to leave the VM paused for that long (i.e., treat everything as
stale). When the dirty bitmap arrives, we want to be able to quickly
(like, without having to copy anything) say "stop getting faults on
these pages, they are in fact clean." Using shared memory (i.e.,
having a page cache) with UFFDIO_CONTINUE (well, really
UFFD_FEATURE_MINOR*) allows us to do this.

It also turns out that it is basically necessary if we want our
network API of choice to be able to directly write into guest memory.

[5]: https://lore.kernel.org/linux-mm/20210225002658.2021807-1-axelrasmussen@google.com/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT
  2024-07-29 21:09         ` James Houghton
@ 2024-08-01 22:22           ` Peter Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2024-08-01 22:22 UTC (permalink / raw)
  To: James Houghton
  Cc: kalyazin, Marc Zyngier, Oliver Upton, James Morse,
	Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Shuah Khan,
	Axel Rasmussen, David Matlack, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, roypat, Paolo Bonzini

On Mon, Jul 29, 2024 at 02:09:16PM -0700, James Houghton wrote:
> > A more general question is, it looks like Userfaultfd's main purpose was
> > to support the postcopy use case [2], yet it fails to do that
> > efficiently for large VMs. Would it be ideologically better to try to
> > improve Userfaultfd's performance (similar to how it was attempted in
> > [3]) or is that something you have already looked into and reached a
> > dead end as a part of [4]?
> 
> My end goal with [4] was to take contention out of the vCPU +
> userfault path completely (so, if we are taking a lock exclusively, we
> are the only one taking it). I came to the conclusion that the way to
> do this that made the most sense was Anish's memory fault exits idea.
> I think it's possible to make userfaults scale better themselves, but
> it's much more challenging than the memory fault exits approach for
> KVM (and I don't have a good way to do it in mind).
> 
> > [1] https://lore.kernel.org/lkml/4AEFB823.4040607@redhat.com/T/
> > [2] https://lwn.net/Articles/636226/
> > [3] https://lore.kernel.org/lkml/20230905214235.320571-1-peterx@redhat.com/
> > [4]
> > https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/

Thanks for the link here on [3].  Just to mention I still remember I have
more thoughts on userfault-generic optimizations on top of this one at that
time, like >1 queues rather than one.  Maybe that could also help, maybe
not.

Even with that I think it'll be less-scalable than vcpu exits for
sure.. but still, I am always not yet convinced those "speed" are extremely
necessary, because postcopy overhead should be page movements, IMHO.  Maybe
there's scalability on the locks with userfault right now, but maybe that's
fixable?

I'm not sure whether I'm right, but IMHO the perf here isn't the critical
part.  Now IMHO it's about guest_memfd is not aligned to how userfault is
defined (with a mapping first, if without fd-extension), I think it indeed
can make sense, or say, have choice on implementing that in KVM if that's
easier.  So maybe other things besides the perf point here matters more.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH 15/18] KVM: guest_memfd: Add KVM Userfault support
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (13 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:42 ` [RFC PATCH 16/18] KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION James Houghton
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

We now have to pass our struct kvm into __kvm_gmem_get_pfn to know if a
gfn is userfault-enabled or not.

For faults on userfault-enabled gfns, indicate this to the caller by
setting *pfn to KVM_PFN_ERR_USERFAULT. Architectures may use this to
know to return a userfault to userspace, though they should be careful
to set a value for *pfn before calling (e.g. KVM_PFN_ERR_FAULT).

While we're at it, set *pfn to KVM_PFN_ERR_HWPOISON for accesses to
poisoned gfns.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 virt/kvm/guest_memfd.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 9148b9679bb1..ba7a981e3396 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -542,8 +542,9 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
 	fput(file);
 }
 
-static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
-		       gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prepare)
+static int __kvm_gmem_get_pfn(struct kvm *kvm, struct file *file,
+		       struct kvm_memory_slot *slot, gfn_t gfn, kvm_pfn_t *pfn,
+		       int *max_order, bool prepare)
 {
 	pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
 	struct kvm_gmem *gmem = file->private_data;
@@ -551,6 +552,11 @@ static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
 	struct page *page;
 	int r;
 
+	if (gfn_has_userfault(kvm, gfn)) {
+		*pfn = KVM_PFN_ERR_USERFAULT;
+		return -EFAULT;
+	}
+
 	if (file != slot->gmem.file) {
 		WARN_ON_ONCE(slot->gmem.file);
 		return -EFAULT;
@@ -567,6 +573,7 @@ static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
 		return PTR_ERR(folio);
 
 	if (folio_test_hwpoison(folio)) {
+		*pfn = KVM_PFN_ERR_HWPOISON;
 		folio_unlock(folio);
 		folio_put(folio);
 		return -EHWPOISON;
@@ -594,7 +601,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	if (!file)
 		return -EFAULT;
 
-	r = __kvm_gmem_get_pfn(file, slot, gfn, pfn, max_order, true);
+	r = __kvm_gmem_get_pfn(kvm, file, slot, gfn, pfn, max_order, true);
 	fput(file);
 	return r;
 }
@@ -634,7 +641,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
 			break;
 		}
 
-		ret = __kvm_gmem_get_pfn(file, slot, gfn, &pfn, &max_order, false);
+		ret = __kvm_gmem_get_pfn(kvm, file, slot, gfn, &pfn,
+					 &max_order, false);
 		if (ret)
 			break;
 
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 16/18] KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (14 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 15/18] KVM: guest_memfd: Add KVM Userfault support James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:42 ` [RFC PATCH 17/18] KVM: selftests: Add KVM Userfault mode to demand_paging_test James Houghton
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Advertise support for KVM_CAP_USERFAULT when CONFIG_KVM_USERFAULT is
enabled.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 virt/kvm/kvm_main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d2ca16ddcaa1..90ce6b8ff0ab 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4916,6 +4916,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #ifdef CONFIG_KVM_PRIVATE_MEM
 	case KVM_CAP_GUEST_MEMFD:
 		return !kvm || kvm_arch_has_private_mem(kvm);
+#endif
+#ifdef CONFIG_KVM_USERFAULT
+	case KVM_CAP_USERFAULT:
+		return 1;
 #endif
 	default:
 		break;
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 17/18] KVM: selftests: Add KVM Userfault mode to demand_paging_test
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (15 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 16/18] KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:42 ` [RFC PATCH 18/18] KVM: selftests: Remove restriction in vm_set_memory_attributes James Houghton
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

The KVM Userfault mode checks that we are able to resolve KVM Userfaults
and the vCPUs will continue to make progress. It initially sets all of
guest memory as KVM_MEMORY_ATTRIBUTE_USERFAULT, then, as the test runs,
clears the attribute from pages as they are faulted on.

This test does not currently check for asynchronous page faults.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        | 46 ++++++++++++++++++-
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 0202b78f8680..8654b58091b2 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -28,6 +28,13 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
 
 static size_t demand_paging_size;
 static char *guest_data_prototype;
+bool userfault_enabled = false;
+
+static void resolve_kvm_userfault(u64 gpa, u64 size)
+{
+	/* Toggle KVM_MEMORY_ATTRIBUTE_USERFAULT off */
+	vm_set_memory_attributes(memstress_args.vm, gpa, size, 0);
+}
 
 static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 {
@@ -41,8 +48,22 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 	clock_gettime(CLOCK_MONOTONIC, &start);
 
 	/* Let the guest access its memory */
+restart:
 	ret = _vcpu_run(vcpu);
-	TEST_ASSERT(ret == 0, "vcpu_run failed: %d", ret);
+	if (ret < 0 && errno == EFAULT && userfault_enabled) {
+		/* Check for userfault. */
+		TEST_ASSERT(run->exit_reason == KVM_EXIT_MEMORY_FAULT,
+			    "Got invalid exit reason: %llx", run->exit_reason);
+		TEST_ASSERT(run->memory_fault.flags ==
+			    KVM_MEMORY_EXIT_FLAG_USERFAULT,
+			    "Got invalid memory fault exit: %llx",
+			    run->memory_fault.flags);
+		resolve_kvm_userfault(run->memory_fault.gpa,
+				      run->memory_fault.size);
+		goto restart;
+	} else
+		TEST_ASSERT(ret == 0, "vcpu_run failed: %d", ret);
+
 	if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
 		TEST_ASSERT(false,
 			    "Invalid guest sync status: exit_reason=%s",
@@ -136,6 +157,7 @@ struct test_params {
 	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
+	bool kvm_userfault;
 };
 
 static void prefault_mem(void *alias, uint64_t len)
@@ -206,6 +228,17 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		}
 	}
 
+	if (p->kvm_userfault) {
+		TEST_REQUIRE(kvm_has_cap(KVM_CAP_USERFAULT));
+		vm_enable_cap(vm, KVM_CAP_USERFAULT, KVM_USERFAULT_ENABLE);
+		TEST_REQUIRE(kvm_check_cap(KVM_CAP_MEMORY_ATTRIBUTES) &
+			     KVM_MEMORY_ATTRIBUTE_USERFAULT);
+		vm_set_memory_attributes(vm, memstress_args.gpa,
+					 memstress_args.size,
+					 KVM_MEMORY_ATTRIBUTE_USERFAULT);
+		userfault_enabled = true;
+	}
+
 	pr_info("Finished creating vCPUs and starting uffd threads\n");
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
@@ -232,6 +265,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	pr_info("Overall demand paging rate:\t%f pgs/sec\n",
 		vcpu_paging_rate * nr_vcpus);
 
+	if (p->kvm_userfault) {
+		vm_enable_cap(vm, KVM_CAP_USERFAULT, KVM_USERFAULT_DISABLE);
+		userfault_enabled = false;
+	}
+
 	memstress_destroy_vm(vm);
 
 	free(guest_data_prototype);
@@ -263,6 +301,7 @@ static void help(char *name)
 	printf(" -v: specify the number of vCPUs to run.\n");
 	printf(" -o: Overlap guest memory accesses instead of partitioning\n"
 	       "     them into a separate region of memory for each vCPU.\n");
+	printf(" -k: Use KVM Userfault\n");
 	puts("");
 	exit(0);
 }
@@ -281,7 +320,7 @@ int main(int argc, char *argv[])
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:c:r:")) != -1) {
+	while ((opt = getopt(argc, argv, "ahokm:u:d:b:s:v:c:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
@@ -324,6 +363,9 @@ int main(int argc, char *argv[])
 				    "Invalid number of readers per uffd %d: must be >=1",
 				    p.readers_per_uffd);
 			break;
+		case 'k':
+			p.kvm_userfault = true;
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 18/18] KVM: selftests: Remove restriction in vm_set_memory_attributes
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (16 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 17/18] KVM: selftests: Add KVM Userfault mode to demand_paging_test James Houghton
@ 2024-07-10 23:42 ` James Houghton
  2024-07-10 23:48 ` [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, James Houghton, kvm, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

Allow the test to run with a new attribute (USERFAULT). The flows could
very well need changing, but we can at least demonstrate functionality
without further changes.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/include/kvm_util.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 63c2aaae51f3..12876268780c 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -384,13 +384,6 @@ static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
 		.flags = 0,
 	};
 
-	/*
-	 * KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes.  These flows
-	 * need significant enhancements to support multiple attributes.
-	 */
-	TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
-		    "Update me to support multiple attributes!");
-
 	vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
 }
 
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (17 preceding siblings ...)
  2024-07-10 23:42 ` [RFC PATCH 18/18] KVM: selftests: Remove restriction in vm_set_memory_attributes James Houghton
@ 2024-07-10 23:48 ` James Houghton
  2024-08-01 22:12   ` Peter Xu
  2024-07-11 17:54 ` James Houghton
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-10 23:48 UTC (permalink / raw)
  To: Paolo Bonzini, Peter Xu
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Axel Rasmussen,
	David Matlack, kvm, linux-doc, linux-kernel, linux-arm-kernel,
	kvmarm

Ah, I put the wrong email for Peter! I'm so sorry!


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd
  2024-07-10 23:48 ` [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
@ 2024-08-01 22:12   ` Peter Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2024-08-01 22:12 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, James Morse,
	Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Shuah Khan,
	Axel Rasmussen, David Matlack, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm

On Wed, Jul 10, 2024 at 04:48:36PM -0700, James Houghton wrote:
> Ah, I put the wrong email for Peter! I'm so sorry!

So I have a pure (and even stupid) question to ask before the rest of
details.. it's pure question because I know little on guest_memfd,
especially on the future plans.

So.. Is there any chance guest_memfd can in the future provide 1G normal
(!CoCo) pages?  If yes, are these pages GUP-able, and mapp-able?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (18 preceding siblings ...)
  2024-07-10 23:48 ` [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
@ 2024-07-11 17:54 ` James Houghton
  2024-07-11 23:37 ` David Matlack
  2024-07-15 15:25 ` Wang, Wei W
  21 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-11 17:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Axel Rasmussen,
	David Matlack, kvm, linux-doc, linux-kernel, linux-arm-kernel,
	kvmarm, Peter Xu

On Wed, Jul 10, 2024 at 4:42 PM James Houghton <jthoughton@google.com> wrote:
> Solution: hook into the gfn -> pfn translation
> ==============================================
>
> The only way to implement post-copy with a non-KVM-specific
> userfaultfd-like system would be to introduce the concept of a
> file-userfault[2] to intercept faults on a guest_memfd.
>
> Instead, we take the simpler approach of adding a KVM-specific API, and
> we hook into the GFN -> HVA or GFN -> PFN translation steps (for
> traditional memslots and for guest_memfd respectively).
>
> I have intentionally added support for traditional memslots, as the
> complexity that it adds is minimal, and it is useful for some VMMs, as
> it can be used to fully implement post-copy live migration.

I want to clarify this sentence a little.

Today, because guest_memfd is only accessed by vCPUs (and is only ever
used for guest-private memory), the concept of "asynchronous
userfaults" isn't exactly necessary. However, when guest_memfd
supports shared memory and KVM is itself able to access it,
asynchronous userfaults become useful in the same way that they are
useful for the non-guest_memfd case.

In a world where guest_memfd requires asynchronous userfaults, adding
support for traditional memslots on top of that is quite simple, and
it somewhat simplies the UAPI.

And for why it is useful for userspace to be able to use KVM Userfault
to implement post-copy live migration, David mentioned this in his
initial RFC[1].

[1]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@mail.gmail.com/#t


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (19 preceding siblings ...)
  2024-07-11 17:54 ` James Houghton
@ 2024-07-11 23:37 ` David Matlack
  2024-07-18  1:59   ` James Houghton
  2024-07-15 15:25 ` Wang, Wei W
  21 siblings, 1 reply; 40+ messages in thread
From: David Matlack @ 2024-07-11 23:37 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, James Morse,
	Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Shuah Khan,
	Axel Rasmussen, kvm, linux-doc, linux-kernel, linux-arm-kernel,
	kvmarm, Peter Xu

On Wed, Jul 10, 2024 at 4:42 PM James Houghton <jthoughton@google.com> wrote:
>
> --- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---
>
> The most straightforward way to inform KVM of userfault-enabled pages is
> to use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.
>
> There is already infrastructure in place for modifying and checking
> memory attributes. Using this interface is slightly challenging, as there
> is no UAPI for setting/clearing particular attributes; we must set the
> exact attributes we want.

The thing we'll want to optimize specifically is clearing
ATTRIBUTE_USERFAULT. During post-copy migration, there will be
potentially hundreds of vCPUs in a single VM concurrently
demand-fetching memory. Clearing ATTRIBUTE_USERFAULT for each page
fetched is on the critical path of getting the vCPU back into
guest-mode.

Clearing ATTRIBUTE_USERFAULT just needs to clear the attribute. It
doesn't need to modify page tables or update any data structures other
than the attribute itself. But the existing UAPI takes both mmu_lock
and slots_lock IIRC.

I'm also concerned that the existing UAPI could lead to userspace
accidentally clearing ATTRIBUTE_USERFAULT when it goes to set
ATTRIBUTE_PRIVATE (or any other potential future attribute). Sure that
could be solved but that means centrally tracking attributes in
userspace and issuing one ioctl per contiguous region of guest memory
with matching attributes. Imagine a scenario where ~every other page
of guest memory as ATTRIBUTE_USERFAULT and then userspace wants to set
a differient attribute on a large region of memory. That's going to
take a _lot_ of ioctls.

Having a UAPI to set (attributes |= delta) and clear (attributes &=
~delta) attributes on a range of GFNs would solve both these problems.

>
> The synchronization that is in place for updating memory attributes is
> not suitable for post-copy live migration either, which will require
> updating memory attributes (from userfault to no-userfault) very
> frequently.

There is also the xarray. I imagine that will trigger a lot of dynamic
memory allocations during post-copy which will slow increase the total
time a vCPU is paused due to a USERFAULT page.

Is it feasible to convert attributes to a bitmap?

>
> Another potential interface could be to use something akin to a dirty
> bitmap, where a bitmap describes which pages within a memslot (or VM)
> should trigger userfaults. This way, it is straightforward to make
> updates to the userfault status of a page cheap.

Taking a similar approach to dirty logging is attractive for several reasons.

1. The infrastructure to manage per-memslot bitmaps already exists for
dirty logging.
2. It avoids the performance problems with xarrays by using a bitmap.
3. It avoids the performance problems with setting all attributes at once.

However it will require new specific UAPIs to set/clear. And it's
probably possible to optimize attributes to meet our needs, and those
changes will benefit all attributes.

>
> When KVM Userfault is enabled, we need to be careful not to map a
> userfault page in response to a fault on a non-userfault page. In this
> RFC, I've taken the simplest approach: force new PTEs to be PAGE_SIZE.
>
> --- Page fault notifications ---
>
> For page faults generated by vCPUs running in guest mode, if the page
> the vCPU is trying to access is a userfault-enabled page, we use
> KVM_EXIT_MEMORY_FAULT with a new flag: KVM_MEMORY_EXIT_FLAG_USERFAULT.
>
> For arm64, I believe this is actually all we need, provided we handle
> steal_time properly.

There's steal time, and also the GIC pages. Steal time can use
KVM_EXIT_MEMORY_FAULT, but that requires special casing in the ARM
code. Alternatively, both can use the async mechanism and to avoid
special handling in the ARM code.

>
> For x86, where returning from deep within the instruction emulator (or
> other non-trivial execution paths) is infeasible, being able to pause
> execution while userspace fetches the page, just as userfaultfd would
> do, is necessary. Let's call these "asynchronous userfaults."
>
> A new ioctl, KVM_READ_USERFAULT, has been added to read asynchronous
> userfaults, and an eventfd is used to signal that new faults are
> available for reading.
>
> Today, we busy-wait for a gfn to have userfault disabled. This will
> change in the future.
>
> --- Fault resolution ---
>
> Resolving userfaults today is as simple as removing the USERFAULT memory
> attribute on the faulting gfn. This will change if we do not end up
> using memory attributes for KVM Userfault. Having a range-based wake-up
> like userfaultfd (see UFFDIO_WAKE) might also be helpful for
> performance.
>
> Problems with this series
> =========================
> - This cannot be named KVM Userfault! Perhaps "KVM missing pages"?
> - Memory attribute modification doesn't scale well.
> - We busy-wait for pages to not be userfault-enabled.

Async faults are a slow path so I think a wait queue would suffice.

> - gfn_to_hva and gfn_to_pfn caches are not invalidated.
> - Page tables are not collapsed when KVM Userfault is disabled.
> - There is no self-test for asynchronous userfaults.
> - Asynchronous page faults can be dropped if KVM_READ_USERFAULT fails.

Userspace would probably treat this as fatal anyway right?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd
  2024-07-11 23:37 ` David Matlack
@ 2024-07-18  1:59   ` James Houghton
  0 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-18  1:59 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, James Morse,
	Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Shuah Khan,
	Axel Rasmussen, kvm, linux-doc, linux-kernel, linux-arm-kernel,
	kvmarm, Peter Xu

On Thu, Jul 11, 2024 at 4:37 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Jul 10, 2024 at 4:42 PM James Houghton <jthoughton@google.com> wrote:
> >
> > --- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---
> >
> > The most straightforward way to inform KVM of userfault-enabled pages is
> > to use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.
> >
> > There is already infrastructure in place for modifying and checking
> > memory attributes. Using this interface is slightly challenging, as there
> > is no UAPI for setting/clearing particular attributes; we must set the
> > exact attributes we want.
>
> The thing we'll want to optimize specifically is clearing
> ATTRIBUTE_USERFAULT. During post-copy migration, there will be
> potentially hundreds of vCPUs in a single VM concurrently
> demand-fetching memory. Clearing ATTRIBUTE_USERFAULT for each page
> fetched is on the critical path of getting the vCPU back into
> guest-mode.
>
> Clearing ATTRIBUTE_USERFAULT just needs to clear the attribute. It
> doesn't need to modify page tables or update any data structures other
> than the attribute itself. But the existing UAPI takes both mmu_lock
> and slots_lock IIRC.
>
> I'm also concerned that the existing UAPI could lead to userspace
> accidentally clearing ATTRIBUTE_USERFAULT when it goes to set
> ATTRIBUTE_PRIVATE (or any other potential future attribute). Sure that
> could be solved but that means centrally tracking attributes in
> userspace and issuing one ioctl per contiguous region of guest memory
> with matching attributes. Imagine a scenario where ~every other page
> of guest memory as ATTRIBUTE_USERFAULT and then userspace wants to set
> a differient attribute on a large region of memory. That's going to
> take a _lot_ of ioctls.
>
> Having a UAPI to set (attributes |= delta) and clear (attributes &=
> ~delta) attributes on a range of GFNs would solve both these problems.

Hi David, sorry for the delay getting back to you.

I agree with all of these points.

>
> >
> > The synchronization that is in place for updating memory attributes is
> > not suitable for post-copy live migration either, which will require
> > updating memory attributes (from userfault to no-userfault) very
> > frequently.
>
> There is also the xarray. I imagine that will trigger a lot of dynamic
> memory allocations during post-copy which will slow increase the total
> time a vCPU is paused due to a USERFAULT page.
>
> Is it feasible to convert attributes to a bitmap?

I don't see any reason why we couldn't convert attributes to be a
bitmap (or to have some attributes be stored in bitmaps and others be
stored in the xarray).

>
> >
> > Another potential interface could be to use something akin to a dirty
> > bitmap, where a bitmap describes which pages within a memslot (or VM)
> > should trigger userfaults. This way, it is straightforward to make
> > updates to the userfault status of a page cheap.
>
> Taking a similar approach to dirty logging is attractive for several reasons.
>
> 1. The infrastructure to manage per-memslot bitmaps already exists for
> dirty logging.
> 2. It avoids the performance problems with xarrays by using a bitmap.
> 3. It avoids the performance problems with setting all attributes at once.
>
> However it will require new specific UAPIs to set/clear. And it's
> probably possible to optimize attributes to meet our needs, and those
> changes will benefit all attributes.

Ok so the three options in my head are:
1. Add an attribute diff UAPI and track the USERFAULT attribute in the xarray.
2. Add an attribute diff UAPI and track the USERFAULT attribute with a bitmap.
3. Add a new UAPI to enable KVM userfaults on gfns according to a
particular bitmap, similar to dirty logging.

(1) is problematic because it is valid to have every page (or, say,
every other page) have ATTRIBUTE_USERFAULT.

(2) seems ok to me.

(3) would be great, but maybe the much more complicated UAPI is not
worth it. (We get the ability to mark many different regions as
USERFAULT in one syscall, and KVM has a lot of code for handling
bitmap arguments.)

I'm hoping others will weigh in here.

> >
> > When KVM Userfault is enabled, we need to be careful not to map a
> > userfault page in response to a fault on a non-userfault page. In this
> > RFC, I've taken the simplest approach: force new PTEs to be PAGE_SIZE.
> >
> > --- Page fault notifications ---
> >
> > For page faults generated by vCPUs running in guest mode, if the page
> > the vCPU is trying to access is a userfault-enabled page, we use
> > KVM_EXIT_MEMORY_FAULT with a new flag: KVM_MEMORY_EXIT_FLAG_USERFAULT.
> >
> > For arm64, I believe this is actually all we need, provided we handle
> > steal_time properly.
>
> There's steal time, and also the GIC pages. Steal time can use
> KVM_EXIT_MEMORY_FAULT, but that requires special casing in the ARM
> code. Alternatively, both can use the async mechanism and to avoid
> special handling in the ARM code.

Oh, of course, I forgot about the GIC. Thanks. And yes, if the async
userfault mechanism is acceptable, using that would be better than
adding the special cases.

>
> >
> > For x86, where returning from deep within the instruction emulator (or
> > other non-trivial execution paths) is infeasible, being able to pause
> > execution while userspace fetches the page, just as userfaultfd would
> > do, is necessary. Let's call these "asynchronous userfaults."
> >
> > A new ioctl, KVM_READ_USERFAULT, has been added to read asynchronous
> > userfaults, and an eventfd is used to signal that new faults are
> > available for reading.
> >
> > Today, we busy-wait for a gfn to have userfault disabled. This will
> > change in the future.
> >
> > --- Fault resolution ---
> >
> > Resolving userfaults today is as simple as removing the USERFAULT memory
> > attribute on the faulting gfn. This will change if we do not end up
> > using memory attributes for KVM Userfault. Having a range-based wake-up
> > like userfaultfd (see UFFDIO_WAKE) might also be helpful for
> > performance.
> >
> > Problems with this series
> > =========================
> > - This cannot be named KVM Userfault! Perhaps "KVM missing pages"?
> > - Memory attribute modification doesn't scale well.
> > - We busy-wait for pages to not be userfault-enabled.
>
> Async faults are a slow path so I think a wait queue would suffice.

I think a wait queue seems like a good fit too. (It's what userfaultfd uses.)

>
> > - gfn_to_hva and gfn_to_pfn caches are not invalidated.
> > - Page tables are not collapsed when KVM Userfault is disabled.
> > - There is no self-test for asynchronous userfaults.
> > - Asynchronous page faults can be dropped if KVM_READ_USERFAULT fails.
>
> Userspace would probably treat this as fatal anyway right?

Yes, but I still think dropping the gfn isn't great. I'll fix this
when I change from using the hacky list-based thing to something more
sophisticated (like a wait_queue).


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd
  2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
                   ` (20 preceding siblings ...)
  2024-07-11 23:37 ` David Matlack
@ 2024-07-15 15:25 ` Wang, Wei W
  2024-07-16 17:10   ` James Houghton
  21 siblings, 1 reply; 40+ messages in thread
From: Wang, Wei W @ 2024-07-15 15:25 UTC (permalink / raw)
  To: James Houghton, Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Shuah Khan, Peter Xu,
	Axel Rasmussen, David Matlack, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev

On Thursday, July 11, 2024 7:42 AM, James Houghton wrote:
> This patch series implements the KVM-based demand paging system that was
> first introduced back in November[1] by David Matlack.
> 
> The working name for this new system is KVM Userfault, but that name is very
> confusing so it will not be the final name.
> 
Hi James,
I had implemented a similar approach for TDX post-copy migration, there are quite
some differences though. Got some questions about your design below.

> Problem: post-copy with guest_memfd
> ===================================
> 
> Post-copy live migration makes it possible to migrate VMs from one host to
> another no matter how fast they are writing to memory while keeping the VM
> paused for a minimal amount of time. For post-copy to work, we
> need:
>  1. to be able to prevent KVM from being able to access particular pages
>     of guest memory until we have populated it  2. for userspace to know when
> KVM is trying to access a particular
>     page.
>  3. a way to allow the access to proceed.
> 
> Traditionally, post-copy live migration is implemented using userfaultfd, which
> hooks into the main mm fault path. KVM hits this path when it is doing HVA ->
> PFN translations (with GUP) or when it itself attempts to access guest memory.
> Userfaultfd sends a page fault notification to userspace, and KVM goes to sleep.
> 
> Userfaultfd works well, as it is not specific to KVM; everyone who attempts to
> access guest memory will block the same way.
> 
> However, with guest_memfd, we do not use GUP to translate from GFN to HPA
> (nor is there an intermediate HVA).
> 
> So userfaultfd in its current form cannot be used to support post-copy live
> migration with guest_memfd-backed VMs.
> 
> Solution: hook into the gfn -> pfn translation
> ==============================================
> 
> The only way to implement post-copy with a non-KVM-specific userfaultfd-like
> system would be to introduce the concept of a file-userfault[2] to intercept
> faults on a guest_memfd.
> 
> Instead, we take the simpler approach of adding a KVM-specific API, and we
> hook into the GFN -> HVA or GFN -> PFN translation steps (for traditional
> memslots and for guest_memfd respectively).


Why taking KVM_EXIT_MEMORY_FAULT faults for the traditional shared
pages (i.e. GFN -> HVA)? 
It seems simpler if we use KVM_EXIT_MEMORY_FAULT for private pages only, leaving
shared pages to go through the existing userfaultfd mechanism:
- The need for “asynchronous userfaults,” introduced by patch 14, could be eliminated.
- The additional support (e.g., KVM_MEMORY_EXIT_FLAG_USERFAULT) for private page
  faults exiting to userspace for postcopy might not be necessary, because all pages on the
  destination side are initially “shared,” and the guest’s first access will always cause an
  exit to userspace for shared->private conversion. So VMM is able to leverage the exit to
  fetch the page data from the source (VMM can know if a page data has been fetched
  from the source or not).

> 
> I have intentionally added support for traditional memslots, as the complexity
> that it adds is minimal, and it is useful for some VMMs, as it can be used to
> fully implement post-copy live migration.
> 
> Implementation Details
> ======================
> 
> Let's break down how KVM implements each of the three core requirements
> for implementing post-copy as laid out above:
> 
> --- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---
> 
> The most straightforward way to inform KVM of userfault-enabled pages is to
> use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.
> 
> There is already infrastructure in place for modifying and checking memory
> attributes. Using this interface is slightly challenging, as there is no UAPI for
> setting/clearing particular attributes; we must set the exact attributes we want.
> 
> The synchronization that is in place for updating memory attributes is not
> suitable for post-copy live migration either, which will require updating
> memory attributes (from userfault to no-userfault) very frequently.
> 
> Another potential interface could be to use something akin to a dirty bitmap,
> where a bitmap describes which pages within a memslot (or VM) should trigger
> userfaults. This way, it is straightforward to make updates to the userfault
> status of a page cheap.
> 
> When KVM Userfault is enabled, we need to be careful not to map a userfault
> page in response to a fault on a non-userfault page. In this RFC, I've taken the
> simplest approach: force new PTEs to be PAGE_SIZE.
> 
> --- Page fault notifications ---
> 
> For page faults generated by vCPUs running in guest mode, if the page the
> vCPU is trying to access is a userfault-enabled page, we use

Why is it necessary to add the per-page control (with uAPIs for VMM to set/clear)?
Any functional issues if we just have all the page faults exit to userspace during the
post-copy period?
- As also mentioned above, userspace can easily know if a page needs to be
  fetched from the source or not, so upon a fault exit to userspace, VMM can
  decide to block the faulting vcpu thread or return back to KVM immediately.
- If improvement is really needed (would need profiling first) to reduce number
  of exits to userspace, a  KVM internal status (bitmap or xarray) seems sufficient.
  Each page only needs to exit to userspace once for the purpose of fetching its data
  from the source in postcopy. It doesn't seem to need userspace to enable the exit
  again for the page (via a new uAPI), right?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd
  2024-07-15 15:25 ` Wang, Wei W
@ 2024-07-16 17:10   ` James Houghton
  2024-07-17 15:03     ` Wang, Wei W
  0 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-16 17:10 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, James Morse,
	Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Shuah Khan,
	Axel Rasmussen, David Matlack, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	Peter Xu

On Mon, Jul 15, 2024 at 8:28 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
>
> On Thursday, July 11, 2024 7:42 AM, James Houghton wrote:
> > This patch series implements the KVM-based demand paging system that was
> > first introduced back in November[1] by David Matlack.
> >
> > The working name for this new system is KVM Userfault, but that name is very
> > confusing so it will not be the final name.
> >
> Hi James,
> I had implemented a similar approach for TDX post-copy migration, there are quite
> some differences though. Got some questions about your design below.

Thanks for the feedback!!

>
> > Problem: post-copy with guest_memfd
> > ===================================
> >
> > Post-copy live migration makes it possible to migrate VMs from one host to
> > another no matter how fast they are writing to memory while keeping the VM
> > paused for a minimal amount of time. For post-copy to work, we
> > need:
> >  1. to be able to prevent KVM from being able to access particular pages
> >     of guest memory until we have populated it  2. for userspace to know when
> > KVM is trying to access a particular
> >     page.
> >  3. a way to allow the access to proceed.
> >
> > Traditionally, post-copy live migration is implemented using userfaultfd, which
> > hooks into the main mm fault path. KVM hits this path when it is doing HVA ->
> > PFN translations (with GUP) or when it itself attempts to access guest memory.
> > Userfaultfd sends a page fault notification to userspace, and KVM goes to sleep.
> >
> > Userfaultfd works well, as it is not specific to KVM; everyone who attempts to
> > access guest memory will block the same way.
> >
> > However, with guest_memfd, we do not use GUP to translate from GFN to HPA
> > (nor is there an intermediate HVA).
> >
> > So userfaultfd in its current form cannot be used to support post-copy live
> > migration with guest_memfd-backed VMs.
> >
> > Solution: hook into the gfn -> pfn translation
> > ==============================================
> >
> > The only way to implement post-copy with a non-KVM-specific userfaultfd-like
> > system would be to introduce the concept of a file-userfault[2] to intercept
> > faults on a guest_memfd.
> >
> > Instead, we take the simpler approach of adding a KVM-specific API, and we
> > hook into the GFN -> HVA or GFN -> PFN translation steps (for traditional
> > memslots and for guest_memfd respectively).
>
>
> Why taking KVM_EXIT_MEMORY_FAULT faults for the traditional shared
> pages (i.e. GFN -> HVA)?
> It seems simpler if we use KVM_EXIT_MEMORY_FAULT for private pages only, leaving
> shared pages to go through the existing userfaultfd mechanism:
> - The need for “asynchronous userfaults,” introduced by patch 14, could be eliminated.
> - The additional support (e.g., KVM_MEMORY_EXIT_FLAG_USERFAULT) for private page
>   faults exiting to userspace for postcopy might not be necessary, because all pages on the
>   destination side are initially “shared,” and the guest’s first access will always cause an
>   exit to userspace for shared->private conversion. So VMM is able to leverage the exit to
>   fetch the page data from the source (VMM can know if a page data has been fetched
>   from the source or not).

You're right that, today, including support for guest-private memory
*only* indeed simplifies things (no async userfaults). I think your
strategy for implementing post-copy would work (so, shared->private
conversion faults for vCPU accesses to private memory, and userfaultfd
for everything else).

I'm not 100% sure what should happen in the case of a non-vCPU access
to should-be-private memory; today it seems like KVM just provides the
shared version of the page, so conventional use of userfaultfd
shouldn't break anything.

But eventually guest_memfd itself will support "shared" memory, and
(IIUC) it won't use VMAs, so userfaultfd won't be usable (without
changes anyway). For a non-confidential VM, all memory will be
"shared", so shared->private conversions can't help us there either.
Starting everything as private almost works (so using private->shared
conversions as a notification mechanism), but if the first time KVM
attempts to use a page is not from a vCPU (and is from a place where
we cannot easily return to userspace), the need for "async userfaults"
comes back.

For this use case, it seems cleaner to have a new interface. (And, as
far as I can tell, we would at least need some kind of "async
userfault"-like mechanism.)

Another reason why, today, KVM Userfault is helpful is that
userfaultfd has a couple drawbacks. Userfaultfd migration with
HugeTLB-1G is basically unusable, as HugeTLB pages cannot be mapped at
PAGE_SIZE. Some discussion here[1][2].

Moving the implementation of post-copy to KVM means that, throughout
post-copy, we can avoid changes to the main mm page tables, and we
only need to modify the second stage page tables. This saves the
memory needed to store the extra set of shattered page tables, and we
save the performance overhead of the page table modifications and
accounting that mm does.

There's some more discussion about these points in David's RFC[3].

[1]: https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@google.com/
[2]: https://lore.kernel.org/linux-mm/ZdcKwK7CXgEsm-Co@x1n/
[3]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@mail.gmail.com/

>
> >

> > I have intentionally added support for traditional memslots, as the complexity
> > that it adds is minimal, and it is useful for some VMMs, as it can be used to
> > fully implement post-copy live migration.
> >
> > Implementation Details
> > ======================
> >
> > Let's break down how KVM implements each of the three core requirements
> > for implementing post-copy as laid out above:
> >
> > --- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---
> >
> > The most straightforward way to inform KVM of userfault-enabled pages is to
> > use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.
> >
> > There is already infrastructure in place for modifying and checking memory
> > attributes. Using this interface is slightly challenging, as there is no UAPI for
> > setting/clearing particular attributes; we must set the exact attributes we want.
> >
> > The synchronization that is in place for updating memory attributes is not
> > suitable for post-copy live migration either, which will require updating
> > memory attributes (from userfault to no-userfault) very frequently.
> >
> > Another potential interface could be to use something akin to a dirty bitmap,
> > where a bitmap describes which pages within a memslot (or VM) should trigger
> > userfaults. This way, it is straightforward to make updates to the userfault
> > status of a page cheap.
> >
> > When KVM Userfault is enabled, we need to be careful not to map a userfault
> > page in response to a fault on a non-userfault page. In this RFC, I've taken the
> > simplest approach: force new PTEs to be PAGE_SIZE.
> >
> > --- Page fault notifications ---
> >
> > For page faults generated by vCPUs running in guest mode, if the page the
> > vCPU is trying to access is a userfault-enabled page, we use
>
> Why is it necessary to add the per-page control (with uAPIs for VMM to set/clear)?
> Any functional issues if we just have all the page faults exit to userspace during the
> post-copy period?
> - As also mentioned above, userspace can easily know if a page needs to be
>   fetched from the source or not, so upon a fault exit to userspace, VMM can
>   decide to block the faulting vcpu thread or return back to KVM immediately.
> - If improvement is really needed (would need profiling first) to reduce number
>   of exits to userspace, a  KVM internal status (bitmap or xarray) seems sufficient.
>   Each page only needs to exit to userspace once for the purpose of fetching its data
>   from the source in postcopy. It doesn't seem to need userspace to enable the exit
>   again for the page (via a new uAPI), right?

We don't necessarily need a way to go from no-fault -> fault for a
page, that's right[4]. But we do need a way for KVM to be able to
allow the access to proceed (i.e., go from fault -> no-fault). IOW, if
we get a fault and come out to userspace, we need a way to tell KVM
not to do that again. In the case of shared->private conversions, that
mechanism is toggling the memory attributes for a gfn. For
conventional userfaultfd, that's using UFFDIO_COPY/CONTINUE/POISON.
Maybe I'm misunderstanding your question.

[4]: It is helpful for poison emulation for HugeTLB-backed VMs today,
but this is not important.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd
  2024-07-16 17:10   ` James Houghton
@ 2024-07-17 15:03     ` Wang, Wei W
  2024-07-18  1:09       ` James Houghton
  0 siblings, 1 reply; 40+ messages in thread
From: Wang, Wei W @ 2024-07-17 15:03 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, James Morse,
	Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Shuah Khan,
	Axel Rasmussen, David Matlack, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	Peter Xu

On Wednesday, July 17, 2024 1:10 AM, James Houghton wrote:
> You're right that, today, including support for guest-private memory
> *only* indeed simplifies things (no async userfaults). I think your strategy for
> implementing post-copy would work (so, shared->private conversion faults for
> vCPU accesses to private memory, and userfaultfd for everything else).

Yes, it works and has been used for our internal tests.

> 
> I'm not 100% sure what should happen in the case of a non-vCPU access to
> should-be-private memory; today it seems like KVM just provides the shared
> version of the page, so conventional use of userfaultfd shouldn't break
> anything.

This seems to be the trusted IO usage (not aware of other usages, emulated device
backends, such as vhost, work with shared pages). Migration support for trusted device
passthrough doesn't seem to be architecturally ready yet. Especially for postcopy,
AFAIK, even the legacy VM case lacks the support for device passthrough (not sure if
you've made it internally). So it seems too early to discuss this in detail.


> 
> But eventually guest_memfd itself will support "shared" memory, 

OK, I thought of this. Not sure how feasible it would be to extend gmem for
shared memory. I think questions like below need to be investigated:
#1 what are the tangible benefits of gmem based shared memory, compared to the
     legacy shared memory that we have now?
#2 There would be some gaps to make gmem usable for shared pages. For
      example, would it support userspace to map (without security concerns)?
#3 if gmem gets extended to be something like hugetlb (e.g. 1GB), would it result
     in the same issue as hugetlb? 

The support of using gmem for shared memory isn't in place yet, and this seems
to be a dependency for the support being added here.

> and
> (IIUC) it won't use VMAs, so userfaultfd won't be usable (without changes
> anyway). For a non-confidential VM, all memory will be "shared", so shared-
> >private conversions can't help us there either.
> Starting everything as private almost works (so using private->shared
> conversions as a notification mechanism), but if the first time KVM attempts to
> use a page is not from a vCPU (and is from a place where we cannot easily
> return to userspace), the need for "async userfaults"
> comes back.

Yeah, this needs to be resolved for KVM userfaults. If gmem is used for private
pages only, this wouldn't be an issue (it will be covered by userfaultfd).


> 
> For this use case, it seems cleaner to have a new interface. (And, as far as I can
> tell, we would at least need some kind of "async userfault"-like mechanism.)
> 
> Another reason why, today, KVM Userfault is helpful is that userfaultfd has a
> couple drawbacks. Userfaultfd migration with HugeTLB-1G is basically
> unusable, as HugeTLB pages cannot be mapped at PAGE_SIZE. Some discussion
> here[1][2].
> 
> Moving the implementation of post-copy to KVM means that, throughout
> post-copy, we can avoid changes to the main mm page tables, and we only
> need to modify the second stage page tables. This saves the memory needed
> to store the extra set of shattered page tables, and we save the performance
> overhead of the page table modifications and accounting that mm does.

It would be nice to see some data for comparisons between kvm faults and userfaultfd
e.g., end to end latency of handling a page fault via getting data from the source.
(I didn't find data from the link you shared. Please correct me if I missed it)


> We don't necessarily need a way to go from no-fault -> fault for a page, that's
> right[4]. But we do need a way for KVM to be able to allow the access to
> proceed (i.e., go from fault -> no-fault). IOW, if we get a fault and come out to
> userspace, we need a way to tell KVM not to do that again.
> In the case of shared->private conversions, that mechanism is toggling the memory
> attributes for a gfn.  For conventional userfaultfd, that's using
> UFFDIO_COPY/CONTINUE/POISON.
> Maybe I'm misunderstanding your question.

We can come back to this after the dependency discussion above is done. (If gmem is only
used for private pages, the support for postcopy, including changes required for VMMs, would
be simpler)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd
  2024-07-17 15:03     ` Wang, Wei W
@ 2024-07-18  1:09       ` James Houghton
  2024-07-19 14:47         ` Wang, Wei W
  0 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-18  1:09 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, James Morse,
	Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Shuah Khan,
	Axel Rasmussen, David Matlack, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	Peter Xu

On Wed, Jul 17, 2024 at 8:03 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
>
> On Wednesday, July 17, 2024 1:10 AM, James Houghton wrote:
> > You're right that, today, including support for guest-private memory
> > *only* indeed simplifies things (no async userfaults). I think your strategy for
> > implementing post-copy would work (so, shared->private conversion faults for
> > vCPU accesses to private memory, and userfaultfd for everything else).
>
> Yes, it works and has been used for our internal tests.
>
> >
> > I'm not 100% sure what should happen in the case of a non-vCPU access to
> > should-be-private memory; today it seems like KVM just provides the shared
> > version of the page, so conventional use of userfaultfd shouldn't break
> > anything.
>
> This seems to be the trusted IO usage (not aware of other usages, emulated device
> backends, such as vhost, work with shared pages). Migration support for trusted device
> passthrough doesn't seem to be architecturally ready yet. Especially for postcopy,
> AFAIK, even the legacy VM case lacks the support for device passthrough (not sure if
> you've made it internally). So it seems too early to discuss this in detail.

We don't migrate VMs with passthrough devices.

I still think the way KVM handles non-vCPU accesses to private memory
is wrong: surely it is an error, yet we simply provide the shared
version of the page. *shrug*

>
> >
> > But eventually guest_memfd itself will support "shared" memory,
>
> OK, I thought of this. Not sure how feasible it would be to extend gmem for
> shared memory. I think questions like below need to be investigated:

An RFC for it got posted recently[1]. :)

> #1 what are the tangible benefits of gmem based shared memory, compared to the
>      legacy shared memory that we have now?

For [1], unmapping guest memory from the direct map.

> #2 There would be some gaps to make gmem usable for shared pages. For
>       example, would it support userspace to map (without security concerns)?

At least in [1], userspace would be able to mmap it, but KVM would
still not be able to GUP it (instead going through the normal
guest_memfd path).

> #3 if gmem gets extended to be something like hugetlb (e.g. 1GB), would it result
>      in the same issue as hugetlb?

Good question. At the end of the day, the problem is that GUP relies
on host mm page table mappings, and HugeTLB can't map things with
PAGE_SIZE PTEs.

At least as of [1], given that KVM doesn't GUP guest_memfd memory, we
don't rely on the host mm page table layout, so we don't have the same
problem.

For VMMs that want to catch userspace (or non-GUP kernel) accesses via
a guest_memfd VMA, then it's possible it has the same issue. But for
VMMs that don't care to catch these kinds of accesses (the kind of
user that would use KVM Userfault to implement post-copy), it doesn't
matter.

[1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/

>
> The support of using gmem for shared memory isn't in place yet, and this seems
> to be a dependency for the support being added here.

Perhaps I've been slightly preemptive. :) I still think there's useful
discussion here.

> > and
> > (IIUC) it won't use VMAs, so userfaultfd won't be usable (without changes
> > anyway). For a non-confidential VM, all memory will be "shared", so shared-
> > >private conversions can't help us there either.
> > Starting everything as private almost works (so using private->shared
> > conversions as a notification mechanism), but if the first time KVM attempts to
> > use a page is not from a vCPU (and is from a place where we cannot easily
> > return to userspace), the need for "async userfaults"
> > comes back.
>
> Yeah, this needs to be resolved for KVM userfaults. If gmem is used for private
> pages only, this wouldn't be an issue (it will be covered by userfaultfd).

We're on the same page here.

>
>
> >
> > For this use case, it seems cleaner to have a new interface. (And, as far as I can
> > tell, we would at least need some kind of "async userfault"-like mechanism.)
> >
> > Another reason why, today, KVM Userfault is helpful is that userfaultfd has a
> > couple drawbacks. Userfaultfd migration with HugeTLB-1G is basically
> > unusable, as HugeTLB pages cannot be mapped at PAGE_SIZE. Some discussion
> > here[1][2].
> >
> > Moving the implementation of post-copy to KVM means that, throughout
> > post-copy, we can avoid changes to the main mm page tables, and we only
> > need to modify the second stage page tables. This saves the memory needed
> > to store the extra set of shattered page tables, and we save the performance
> > overhead of the page table modifications and accounting that mm does.
>
> It would be nice to see some data for comparisons between kvm faults and userfaultfd
> e.g., end to end latency of handling a page fault via getting data from the source.
> (I didn't find data from the link you shared. Please correct me if I missed it)

I don't have an A/B comparison for kernel end-to-end fault latency. :(
But I can tell you that with 32us or so network latency, it's not a
huge difference (assuming Anish's series[2]).

The real performance issue comes when we are collapsing the page
tables at the end. We basically have to do ~2x of everything (TLB
flushes, etc.), plus additional accounting that HugeTLB/THP does
(adjusting refcount/mapcount), etc. And one must optimize how the
unmap MMU notifiers are called so as to not stall vCPUs unnecessarily.

[2]: https://lore.kernel.org/kvm/20240215235405.368539-1-amoorthy@google.com/

>
>
> > We don't necessarily need a way to go from no-fault -> fault for a page, that's
> > right[4]. But we do need a way for KVM to be able to allow the access to
> > proceed (i.e., go from fault -> no-fault). IOW, if we get a fault and come out to
> > userspace, we need a way to tell KVM not to do that again.
> > In the case of shared->private conversions, that mechanism is toggling the memory
> > attributes for a gfn.  For conventional userfaultfd, that's using
> > UFFDIO_COPY/CONTINUE/POISON.
> > Maybe I'm misunderstanding your question.
>
> We can come back to this after the dependency discussion above is done. (If gmem is only
> used for private pages, the support for postcopy, including changes required for VMMs, would
> be simpler)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd
  2024-07-18  1:09       ` James Houghton
@ 2024-07-19 14:47         ` Wang, Wei W
  0 siblings, 0 replies; 40+ messages in thread
From: Wang, Wei W @ 2024-07-19 14:47 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, James Morse,
	Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Shuah Khan,
	Axel Rasmussen, David Matlack, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	Peter Xu

On Thursday, July 18, 2024 9:09 AM, James Houghton wrote:
> On Wed, Jul 17, 2024 at 8:03 AM Wang, Wei W <wei.w.wang@intel.com>
> wrote:
> >
> > On Wednesday, July 17, 2024 1:10 AM, James Houghton wrote:
> > > You're right that, today, including support for guest-private memory
> > > *only* indeed simplifies things (no async userfaults). I think your
> > > strategy for implementing post-copy would work (so, shared->private
> > > conversion faults for vCPU accesses to private memory, and userfaultfd for
> everything else).
> >
> > Yes, it works and has been used for our internal tests.
> >
> > >
> > > I'm not 100% sure what should happen in the case of a non-vCPU
> > > access to should-be-private memory; today it seems like KVM just
> > > provides the shared version of the page, so conventional use of
> > > userfaultfd shouldn't break anything.
> >
> > This seems to be the trusted IO usage (not aware of other usages,
> > emulated device backends, such as vhost, work with shared pages).
> > Migration support for trusted device passthrough doesn't seem to be
> > architecturally ready yet. Especially for postcopy, AFAIK, even the
> > legacy VM case lacks the support for device passthrough (not sure if you've
> made it internally). So it seems too early to discuss this in detail.
> 
> We don't migrate VMs with passthrough devices.
> 
> I still think the way KVM handles non-vCPU accesses to private memory is
> wrong: surely it is an error, yet we simply provide the shared version of the
> page. *shrug*
> 
> >
> > >
> > > But eventually guest_memfd itself will support "shared" memory,
> >
> > OK, I thought of this. Not sure how feasible it would be to extend
> > gmem for shared memory. I think questions like below need to be
> investigated:
> 
> An RFC for it got posted recently[1]. :)
> 
> > #1 what are the tangible benefits of gmem based shared memory, compared
> to the
> >      legacy shared memory that we have now?
> 
> For [1], unmapping guest memory from the direct map.
> 
> > #2 There would be some gaps to make gmem usable for shared pages. For
> >       example, would it support userspace to map (without security concerns)?
> 
> At least in [1], userspace would be able to mmap it, but KVM would still not be
> able to GUP it (instead going through the normal guest_memfd path).
> 
> > #3 if gmem gets extended to be something like hugetlb (e.g. 1GB), would it
> result
> >      in the same issue as hugetlb?
> 
> Good question. At the end of the day, the problem is that GUP relies on host
> mm page table mappings, and HugeTLB can't map things with PAGE_SIZE PTEs.
> 
> At least as of [1], given that KVM doesn't GUP guest_memfd memory, we don't
> rely on the host mm page table layout, so we don't have the same problem.
> 
> For VMMs that want to catch userspace (or non-GUP kernel) accesses via a
> guest_memfd VMA, then it's possible it has the same issue. But for VMMs that
> don't care to catch these kinds of accesses (the kind of user that would use
> KVM Userfault to implement post-copy), it doesn't matter.
> 
> [1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-
> roypat@amazon.co.uk/

Ah, I overlooked this series, thanks for the reminder.
Let me check the details first. 

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2024-08-01 22:23 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-10 23:42 [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
2024-07-10 23:42 ` [RFC PATCH 01/18] KVM: Add KVM_USERFAULT build option James Houghton
2024-07-10 23:42 ` [RFC PATCH 02/18] KVM: Add KVM_CAP_USERFAULT and KVM_MEMORY_ATTRIBUTE_USERFAULT James Houghton
2024-07-15 21:37   ` Anish Moorthy
2024-07-10 23:42 ` [RFC PATCH 03/18] KVM: Put struct kvm pointer in memslot James Houghton
2024-07-10 23:42 ` [RFC PATCH 04/18] KVM: Fail __gfn_to_hva_many for userfault gfns James Houghton
2024-07-11 23:40   ` David Matlack
2024-07-10 23:42 ` [RFC PATCH 05/18] KVM: Add KVM_PFN_ERR_USERFAULT James Houghton
2024-07-10 23:42 ` [RFC PATCH 06/18] KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT James Houghton
2024-07-10 23:42 ` [RFC PATCH 07/18] KVM: Provide attributes to kvm_arch_pre_set_memory_attributes James Houghton
2024-07-10 23:42 ` [RFC PATCH 08/18] KVM: x86: Add KVM Userfault support James Houghton
2024-07-17 15:34   ` Wang, Wei W
2024-07-18 17:08     ` James Houghton
2024-07-19 14:44       ` Wang, Wei W
2024-07-10 23:42 ` [RFC PATCH 09/18] KVM: x86: Add vCPU fault fast-path for Userfault James Houghton
2024-07-10 23:42 ` [RFC PATCH 10/18] KVM: arm64: Add KVM Userfault support James Houghton
2024-07-10 23:42 ` [RFC PATCH 11/18] KVM: arm64: Add vCPU memory fault fast-path for Userfault James Houghton
2024-07-10 23:42 ` [RFC PATCH 12/18] KVM: arm64: Add userfault support for steal-time James Houghton
2024-07-10 23:42 ` [RFC PATCH 13/18] KVM: Add atomic parameter to __gfn_to_hva_many James Houghton
2024-07-10 23:42 ` [RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT James Houghton
2024-07-11 23:52   ` David Matlack
2024-07-26 16:50   ` Nikita Kalyazin
2024-07-26 18:00     ` James Houghton
2024-07-29 17:17       ` Nikita Kalyazin
2024-07-29 21:09         ` James Houghton
2024-08-01 22:22           ` Peter Xu
2024-07-10 23:42 ` [RFC PATCH 15/18] KVM: guest_memfd: Add KVM Userfault support James Houghton
2024-07-10 23:42 ` [RFC PATCH 16/18] KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION James Houghton
2024-07-10 23:42 ` [RFC PATCH 17/18] KVM: selftests: Add KVM Userfault mode to demand_paging_test James Houghton
2024-07-10 23:42 ` [RFC PATCH 18/18] KVM: selftests: Remove restriction in vm_set_memory_attributes James Houghton
2024-07-10 23:48 ` [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd James Houghton
2024-08-01 22:12   ` Peter Xu
2024-07-11 17:54 ` James Houghton
2024-07-11 23:37 ` David Matlack
2024-07-18  1:59   ` James Houghton
2024-07-15 15:25 ` Wang, Wei W
2024-07-16 17:10   ` James Houghton
2024-07-17 15:03     ` Wang, Wei W
2024-07-18  1:09       ` James Houghton
2024-07-19 14:47         ` Wang, Wei W

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox