[PATCH 00/13] KVM: guest

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/13] KVM: guest_memfd fixes
@ 2023-09-21 20:33 Sean Christopherson
  2023-09-21 20:33 ` [PATCH 01/13] KVM: Assert that mmu_invalidate_in_progress *never* goes negative Sean Christopherson
                   ` (13 more replies)
  0 siblings, 14 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

Fix a variety of bugs in the guest_memfd series, almost all of which are
my fault, and add assertions and testcases to detect future regressions.

The last patch, renaming guest_mem.c to guest_memfd.c, is obviously not a
bug fix, I included it here so that if we want to go with guest_memfd.c,
squashing everything will be straightforward.

Note, the truncate fix and test conflicts with Isaku's series[*].  My
fix is more correct (knock wood), and my test is slightly more comprehensive
(though arguably not really all that more interesting).

Note #2, this is based on kvm-x86/guest_memfd, to which I force-pushed v12.

Note #3, the patches are organized so that they can be squashed with their
Fixes, i.e. the splits are more than a bit odd in some places.

[*] https://lore.kernel.org/all/cover.1695327124.git.isaku.yamahata@intel.com

Sean Christopherson (13):
  KVM: Assert that mmu_invalidate_in_progress *never* goes negative
  KVM: Actually truncate the inode when doing PUNCH_HOLE for guest_memfd
  KVM: WARN if *any* MMU invalidation sequence doesn't add a range
  KVM: WARN if there are danging MMU invalidations at VM destruction
  KVM: Fix MMU invalidation bookkeeping in guest_memfd
  KVM: Disallow hugepages for incompatible gmem bindings, but let 'em
    succeed
  KVM: x86/mmu: Track PRIVATE impact on hugepage mappings for all
    memslots
  KVM: x86/mmu: Zap shared-only memslots when private attribute changes
  KVM: Always add relevant ranges to invalidation set when changing
    attributes
  KVM: x86/mmu: Drop repeated add() of to-be-invalidated range
  KVM: selftests: Refactor private mem conversions to prep for
    punch_hole test
  KVM: selftests: Add a "pure" PUNCH_HOLE on guest_memfd testcase
  KVM: Rename guest_mem.c to guest_memfd.c

 arch/x86/kvm/mmu/mmu.c                        |  25 ++--
 .../kvm/x86_64/private_mem_conversions_test.c | 112 ++++++++++++++----
 virt/kvm/Makefile.kvm                         |   2 +-
 virt/kvm/{guest_mem.c => guest_memfd.c}       |  84 +++++++------
 virt/kvm/kvm_main.c                           |  40 +++++--
 5 files changed, 184 insertions(+), 79 deletions(-)
 rename virt/kvm/{guest_mem.c => guest_memfd.c} (92%)


base-commit: 7af66fbd6d89b159acc359895449b5940b6e4fdb
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01/13] KVM: Assert that mmu_invalidate_in_progress *never* goes negative
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
@ 2023-09-21 20:33 ` Sean Christopherson
  2023-09-21 20:33 ` [PATCH 02/13] KVM: Actually truncate the inode when doing PUNCH_HOLE for guest_memfd Sean Christopherson
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

Move the assertion on the in-progress invalidation count from the primary
MMU's notifier path to KVM's common notification path, i.e. assert that
the count doesn't go negative even when the invalidation is coming from
KVM itself.

Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only
the affected VM, not the entire kernel.  A corrupted count is fatal to the
VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry()
to block any and all attempts to install new mappings.  But it's far from
guaranteed that an end() without a start() is fatal or even problematic to
anything other than the target VM, e.g. the underlying bug could simply be
a duplicate call to end().  And it's much more likely that a missed
invalidation, i.e. a potential use-after-free, would manifest as no
notification whatsoever, not an end() without a start().

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a83dfef1316e..30708e460568 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -870,6 +870,7 @@ void kvm_mmu_invalidate_end(struct kvm *kvm)
 	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
 	 */
 	kvm->mmu_invalidate_in_progress--;
+	KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);

 	/*
 	 * Assert that at least one range must be added between start() and
@@ -906,8 +907,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	 */
 	if (wake)
 		rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
-
-	BUG_ON(kvm->mmu_invalidate_in_progress < 0);
 }

 static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
-- 
2.42.0.515.g380fc7ccd1-goog

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 02/13] KVM: Actually truncate the inode when doing PUNCH_HOLE for guest_memfd
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
  2023-09-21 20:33 ` [PATCH 01/13] KVM: Assert that mmu_invalidate_in_progress *never* goes negative Sean Christopherson
@ 2023-09-21 20:33 ` Sean Christopherson
  2023-09-21 20:33 ` [PATCH 03/13] KVM: WARN if *any* MMU invalidation sequence doesn't add a range Sean Christopherson
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

Restore the call to truncate_inode_pages_range() in guest_memfd's handling
of PUNCH_HOLE that was unintentionally removed in a rebase gone bad.

Reported-by: Michael Roth <michael.roth@amd.com>
Fixes: 1d46f95498c5 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/guest_mem.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index a819367434e9..3c9e83a596fe 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -140,10 +140,13 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	 */
 	filemap_invalidate_lock(inode->i_mapping);
 
-	list_for_each_entry(gmem, gmem_list, entry) {
+	list_for_each_entry(gmem, gmem_list, entry)
 		kvm_gmem_invalidate_begin(gmem, start, end);
+
+	truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
+
+	list_for_each_entry(gmem, gmem_list, entry)
 		kvm_gmem_invalidate_end(gmem, start, end);
-	}
 
 	filemap_invalidate_unlock(inode->i_mapping);
 
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 03/13] KVM: WARN if *any* MMU invalidation sequence doesn't add a range
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
  2023-09-21 20:33 ` [PATCH 01/13] KVM: Assert that mmu_invalidate_in_progress *never* goes negative Sean Christopherson
  2023-09-21 20:33 ` [PATCH 02/13] KVM: Actually truncate the inode when doing PUNCH_HOLE for guest_memfd Sean Christopherson
@ 2023-09-21 20:33 ` Sean Christopherson
  2023-09-21 20:33 ` [PATCH 04/13] KVM: WARN if there are danging MMU invalidations at VM destruction Sean Christopherson
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

Tweak the assertion in kvm_mmu_invalidate_end() to unconditionally require
a range to be added between start() and end().  Asserting if and only if
kvm->mmu_invalidate_in_progress is non-zero makes the assertion all but
useless as it would fire only when there are multiple invalidations in
flight, which is not common and would also get a false negative if one or
more sequences, but not all, added a range.

Reported-by: Binbin Wu <binbin.wu@linux.intel.com>
Fixes: 145725d1542a ("KVM: Use gfn instead of hva for mmu_notifier_retry")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 30708e460568..54480655bcce 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -873,11 +873,10 @@ void kvm_mmu_invalidate_end(struct kvm *kvm)
 	KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
 
 	/*
-	 * Assert that at least one range must be added between start() and
-	 * end().  Not adding a range isn't fatal, but it is a KVM bug.
+	 * Assert that at least one range was added between start() and end().
+	 * Not adding a range isn't fatal, but it is a KVM bug.
 	 */
-	WARN_ON_ONCE(kvm->mmu_invalidate_in_progress &&
-		     kvm->mmu_invalidate_range_start == INVALID_GPA);
+	WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA);
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 04/13] KVM: WARN if there are danging MMU invalidations at VM destruction
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
                   ` (2 preceding siblings ...)
  2023-09-21 20:33 ` [PATCH 03/13] KVM: WARN if *any* MMU invalidation sequence doesn't add a range Sean Christopherson
@ 2023-09-21 20:33 ` Sean Christopherson
  2023-09-27  3:16   ` Binbin Wu
  2023-09-21 20:33 ` [PATCH 05/13] KVM: Fix MMU invalidation bookkeeping in guest_memfd Sean Christopherson
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

Add an assertion that there are no in-progress MMU invalidations when a
VM is being destroyed, with the exception of the scenario where KVM
unregisters its MMU notifier between an .invalidate_range_start() call and
the corresponding .invalidate_range_end().

KVM can't detect unpaired calls from the mmu_notifier due to the above
exception waiver, but the assertion can detect KVM bugs, e.g. such as the
bug that *almost* escaped initial guest_memfd development.

Link: https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5389@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 54480655bcce..277afeedd670 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1381,9 +1381,16 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	 * No threads can be waiting in kvm_swap_active_memslots() as the
 	 * last reference on KVM has been dropped, but freeing
 	 * memslots would deadlock without this manual intervention.
+	 *
+	 * If the count isn't unbalanced, i.e. KVM did NOT unregister between
+	 * a start() and end(), then there shouldn't be any in-progress
+	 * invalidations.
 	 */
 	WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
-	kvm->mn_active_invalidate_count = 0;
+	if (kvm->mn_active_invalidate_count)
+		kvm->mn_active_invalidate_count = 0;
+	else
+		WARN_ON(kvm->mmu_invalidate_in_progress);
 #else
 	kvm_flush_shadow_all(kvm);
 #endif
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 05/13] KVM: Fix MMU invalidation bookkeeping in guest_memfd
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
                   ` (3 preceding siblings ...)
  2023-09-21 20:33 ` [PATCH 04/13] KVM: WARN if there are danging MMU invalidations at VM destruction Sean Christopherson
@ 2023-09-21 20:33 ` Sean Christopherson
  2023-09-21 20:33 ` [PATCH 06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed Sean Christopherson
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

Acquire mmu_lock and do invalidate_{begin,end}() if and only if there is
at least one memslot that overlaps the to-be-invalidated range.  This
fixes a bug where KVM would leave a danging in-progress invalidation as
the begin() call was unconditional, but the end() was not (only performed
if there was overlap).

Reported-by: Binbin Wu <binbin.wu@linux.intel.com>
Fixes: 1d46f95498c5 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/guest_mem.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index 3c9e83a596fe..68528e9cddd7 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -88,14 +88,10 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
 				      pgoff_t end)
 {
+	bool flush = false, found_memslot = false;
 	struct kvm_memory_slot *slot;
 	struct kvm *kvm = gmem->kvm;
 	unsigned long index;
-	bool flush = false;
-
-	KVM_MMU_LOCK(kvm);
-
-	kvm_mmu_invalidate_begin(kvm);
 
 	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
 		pgoff_t pgoff = slot->gmem.pgoff;
@@ -107,13 +103,21 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
 			.may_block = true,
 		};
 
+		if (!found_memslot) {
+			found_memslot = true;
+
+			KVM_MMU_LOCK(kvm);
+			kvm_mmu_invalidate_begin(kvm);
+		}
+
 		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
 	}
 
 	if (flush)
 		kvm_flush_remote_tlbs(kvm);
 
-	KVM_MMU_UNLOCK(kvm);
+	if (found_memslot)
+		KVM_MMU_UNLOCK(kvm);
 }
 
 static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
@@ -121,10 +125,11 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
 {
 	struct kvm *kvm = gmem->kvm;
 
-	KVM_MMU_LOCK(kvm);
-	if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT))
+	if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
+		KVM_MMU_LOCK(kvm);
 		kvm_mmu_invalidate_end(kvm);
-	KVM_MMU_UNLOCK(kvm);
+		KVM_MMU_UNLOCK(kvm);
+	}
 }
 
 static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
                   ` (4 preceding siblings ...)
  2023-09-21 20:33 ` [PATCH 05/13] KVM: Fix MMU invalidation bookkeeping in guest_memfd Sean Christopherson
@ 2023-09-21 20:33 ` Sean Christopherson
  2023-09-22 22:42   ` Michael Roth
  2023-09-21 20:33 ` [PATCH 07/13] KVM: x86/mmu: Track PRIVATE impact on hugepage mappings for all memslots Sean Christopherson
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

Remove the restriction that a guest_memfd instance that supports hugepages
can *only* be bound by memslots that are 100% compatible with hugepage
mappings, and instead force KVM to use an order-0 mapping if the binding
isn't compatible with hugepages.

The intent of the draconian binding restriction was purely to simplify the
guest_memfd implementation, e.g. to avoid repeatining the existing logic in
KVM x86ial for precisely tracking which GFNs support hugepages.  But
checking that the binding's offset and size is compatible is just as easy
to do when KVM wants to create a mapping.

And on the other hand, completely rejecting bindings that are incompatible
with hugepages makes it practically impossible for userspace to use a
single guest_memfd instance for all guest memory, e.g. on x86 it would be
impossible to skip the legacy VGA hole while still allowing hugepage
mappings for the rest of guest memory.

Suggested-by: Michael Roth <michael.roth@amd.com>
Link: https://lore.kernel.org/all/20230918163647.m6bjgwusc7ww5tyu@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/guest_mem.c | 54 ++++++++++++++++++++++----------------------
 1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index 68528e9cddd7..4f3a313f5532 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -434,20 +434,6 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags,
 	return err;
 }
 
-static bool kvm_gmem_is_valid_size(loff_t size, u64 flags)
-{
-	if (size < 0 || !PAGE_ALIGNED(size))
-		return false;
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
-	    !IS_ALIGNED(size, HPAGE_PMD_SIZE))
-		return false;
-#endif
-
-	return true;
-}
-
 int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 {
 	loff_t size = args->size;
@@ -460,9 +446,15 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	if (flags & ~valid_flags)
 		return -EINVAL;
 
-	if (!kvm_gmem_is_valid_size(size, flags))
+	if (size < 0 || !PAGE_ALIGNED(size))
 		return -EINVAL;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
+	    !IS_ALIGNED(size, HPAGE_PMD_SIZE))
+		return -EINVAL;
+#endif
+
 	return __kvm_gmem_create(kvm, size, flags, kvm_gmem_mnt);
 }
 
@@ -470,7 +462,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
 		  unsigned int fd, loff_t offset)
 {
 	loff_t size = slot->npages << PAGE_SHIFT;
-	unsigned long start, end, flags;
+	unsigned long start, end;
 	struct kvm_gmem *gmem;
 	struct inode *inode;
 	struct file *file;
@@ -489,16 +481,9 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
 		goto err;
 
 	inode = file_inode(file);
-	flags = (unsigned long)inode->i_private;
 
-	/*
-	 * For simplicity, require the offset into the file and the size of the
-	 * memslot to be aligned to the largest possible page size used to back
-	 * the file (same as the size of the file itself).
-	 */
-	if (!kvm_gmem_is_valid_size(offset, flags) ||
-	    !kvm_gmem_is_valid_size(size, flags))
-		goto err;
+	if (offset < 0 || !PAGE_ALIGNED(offset))
+		return -EINVAL;
 
 	if (offset + size > i_size_read(inode))
 		goto err;
@@ -599,8 +584,23 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	page = folio_file_page(folio, index);
 
 	*pfn = page_to_pfn(page);
-	if (max_order)
-		*max_order = compound_order(compound_head(page));
+	if (!max_order)
+		goto success;
+
+	*max_order = compound_order(compound_head(page));
+	if (!*max_order)
+		goto success;
+
+	/*
+	 * For simplicity, allow mapping a hugepage if and only if the entire
+	 * binding is compatible, i.e. don't bother supporting mapping interior
+	 * sub-ranges with hugepages (unless userspace comes up with a *really*
+	 * strong use case for needing hugepages within unaligned bindings).
+	 */
+	if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) ||
+	    !IS_ALIGNED(slot->npages, 1ull << *max_order))
+		*max_order = 0;
+success:
 	r = 0;
 
 out_unlock:
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 07/13] KVM: x86/mmu: Track PRIVATE impact on hugepage mappings for all memslots
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
                   ` (5 preceding siblings ...)
  2023-09-21 20:33 ` [PATCH 06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed Sean Christopherson
@ 2023-09-21 20:33 ` Sean Christopherson
  2023-09-27  6:01   ` Binbin Wu
  2023-09-21 20:33 ` [PATCH 08/13] KVM: x86/mmu: Zap shared-only memslots when private attribute changes Sean Christopherson
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

Track the effects of private attributes on potential hugepage mappings if
the VM supports private memory, i.e. even if the target memslot can only
ever be mapped shared.  If userspace configures a chunk of memory as
private, KVM must not allow that memory to be mapped shared regardless of
whether or not the *current* memslot can be mapped private.  E.g. if the
guest accesses a private range using a shared memslot, then KVM must exit
to userspace.

Fixes: 5bb0b4e162d1 ("KVM: x86: Disallow hugepages when memory attributes are mixed")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 269d4dc47c98..148931cf9dba 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7314,10 +7314,12 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 	lockdep_assert_held(&kvm->slots_lock);
 
 	/*
-	 * KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip
-	 * the slot if the slot will never consume the PRIVATE attribute.
+	 * Calculate which ranges can be mapped with hugepages even if the slot
+	 * can't map memory PRIVATE.  KVM mustn't create a SHARED hugepage over
+	 * a range that has PRIVATE GFNs, and conversely converting a range to
+	 * SHARED may now allow hugepages.
 	 */
-	if (!kvm_slot_can_be_private(slot))
+	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
 		return false;
 
 	/*
@@ -7372,7 +7374,7 @@ void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
 {
 	int level;
 
-	if (!kvm_slot_can_be_private(slot))
+	if (!kvm_arch_has_private_mem(kvm))
 		return;
 
 	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 08/13] KVM: x86/mmu: Zap shared-only memslots when private attribute changes
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
                   ` (6 preceding siblings ...)
  2023-09-21 20:33 ` [PATCH 07/13] KVM: x86/mmu: Track PRIVATE impact on hugepage mappings for all memslots Sean Christopherson
@ 2023-09-21 20:33 ` Sean Christopherson
  2023-09-21 20:33 ` [PATCH 09/13] KVM: Always add relevant ranges to invalidation set when changing attributes Sean Christopherson
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

Zap all relevant memslots, including shared-only memslots, if the private
memory attribute is being changed.  If userspace converts a range to
private, KVM must zap shared SPTEs to prevent the guest from accessing
the memory as shared.  If userspace converts a range to shared, zapping
SPTEs for shared-only memslots isn't strictly necessary, but doing so
ensures that KVM will install a hugepage mapping if possible, e.g. if a
2MiB range that was mixed is converted to be 100% shared.

Fixes: dcde045383f3 ("KVM: x86/mmu: Handle page fault for private memory")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 148931cf9dba..aa67d9d6fcf8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7259,10 +7259,17 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range)
 {
 	/*
-	 * KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip
-	 * the slot if the slot will never consume the PRIVATE attribute.
+	 * Zap SPTEs even if the slot can't be mapped PRIVATE.  KVM x86 only
+	 * supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like KVM
+	 * can simply ignore such slots.  But if userspace is making memory
+	 * PRIVATE, then KVM must prevent the guest from accessing the memory
+	 * as shared.  And if userspace is making memory SHARED and this point
+	 * is reached, then at least one page within the range was previously
+	 * PRIVATE, i.e. the slot's possible hugepage ranges are changing.
+	 * Zapping SPTEs in this case ensures KVM will reassess whether or not
+	 * a hugepage can be used for affected ranges.
 	 */
-	if (!kvm_slot_can_be_private(range->slot))
+	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
 		return false;
 
 	return kvm_mmu_unmap_gfn_range(kvm, range);
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 09/13] KVM: Always add relevant ranges to invalidation set when changing attributes
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
                   ` (7 preceding siblings ...)
  2023-09-21 20:33 ` [PATCH 08/13] KVM: x86/mmu: Zap shared-only memslots when private attribute changes Sean Christopherson
@ 2023-09-21 20:33 ` Sean Christopherson
  2023-09-21 20:33 ` [PATCH 10/13] KVM: x86/mmu: Drop repeated add() of to-be-invalidated range Sean Christopherson
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

When setting memory attributes, add all affected memslot ranges to the set
of invalidation ranges before calling into arch code.  Even if the change
in attributes doesn't strictly require zapping, it's not at all obvious
that letting arch code establish new mappings while the attributes are in
flux is safe and/or desirable.  Unconditionally adding ranges allows KVM
to keep its sanity check that at least one range is added between begin()
and end(), e.g. to guard against a missed add() call, without needing
complex code to condition the begin()/end() on arch behavior.

Fixes: 9a327182447a ("KVM: Introduce per-page memory attributes")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 277afeedd670..96fc609459e3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2529,6 +2529,25 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 		KVM_MMU_UNLOCK(kvm);
 }
 
+static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
+					  struct kvm_gfn_range *range)
+{
+	/*
+	 * Unconditionally add the range to the invalidation set, regardless of
+	 * whether or not the arch callback actually needs to zap SPTEs.  E.g.
+	 * if KVM supports RWX attributes in the future and the attributes are
+	 * going from R=>RW, zapping isn't strictly necessary.  Unconditionally
+	 * adding the range allows KVM to require that MMU invalidations add at
+	 * least one range between begin() and end(), e.g. allows KVM to detect
+	 * bugs where the add() is missed.  Rexlaing the rule *might* be safe,
+	 * but it's not obvious that allowing new mappings while the attributes
+	 * are in flux is desirable or worth the complexity.
+	 */
+	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+
+	return kvm_arch_pre_set_memory_attributes(kvm, range);
+}
+
 /* Set @attributes for the gfn range [@start, @end). */
 static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 				     unsigned long attributes)
@@ -2536,7 +2555,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	struct kvm_mmu_notifier_range pre_set_range = {
 		.start = start,
 		.end = end,
-		.handler = kvm_arch_pre_set_memory_attributes,
+		.handler = kvm_pre_set_memory_attributes,
 		.on_lock = kvm_mmu_invalidate_begin,
 		.flush_on_ret = true,
 		.may_block = true,
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 10/13] KVM: x86/mmu: Drop repeated add() of to-be-invalidated range
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
                   ` (8 preceding siblings ...)
  2023-09-21 20:33 ` [PATCH 09/13] KVM: Always add relevant ranges to invalidation set when changing attributes Sean Christopherson
@ 2023-09-21 20:33 ` Sean Christopherson
  2023-09-21 20:33 ` [PATCH 11/13] KVM: selftests: Refactor private mem conversions to prep for punch_hole test Sean Christopherson
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

Use kvm_unmap_gfn_range() instead of kvm_mmu_unmap_gfn_range() when
handling memory attribute ranges now that common KVM adds the target range
to the invalidation set, i.e. calls kvm_mmu_invalidate_range_add() before
invoking the arch callback.

Fixes: dcde045383f3 ("KVM: x86/mmu: Handle page fault for private memory")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index aa67d9d6fcf8..bcb812a7f563 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7272,7 +7272,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
 		return false;
 
-	return kvm_mmu_unmap_gfn_range(kvm, range);
+	return kvm_unmap_gfn_range(kvm, range);
 }
 
 static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 11/13] KVM: selftests: Refactor private mem conversions to prep for punch_hole test
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
                   ` (9 preceding siblings ...)
  2023-09-21 20:33 ` [PATCH 10/13] KVM: x86/mmu: Drop repeated add() of to-be-invalidated range Sean Christopherson
@ 2023-09-21 20:33 ` Sean Christopherson
  2023-09-21 20:33 ` [PATCH 12/13] KVM: selftests: Add a "pure" PUNCH_HOLE on guest_memfd testcase Sean Christopherson
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

Refactor the private memory conversions test to prepare for adding a test
to verify PUNCH_HOLE functionality *without* actually do a proper
conversion, i.e. without calling KVM_SET_MEMORY_ATTRIBUTES.  Make setting
attributes optional, rename the guest code to be more descriptive, and
extract the ranges to a global variable (iterating over multiple ranges is
less interesting for PUNCH_HOLE, but with a common array it's trivially
easy to do so).

Fixes: 90535ca08f76 ("KVM: selftests: Add x86-only selftest for private memory conversions")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../kvm/x86_64/private_mem_conversions_test.c | 51 ++++++++++---------
 1 file changed, 27 insertions(+), 24 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
index 50541246d6fd..b80cf7342d0d 100644
--- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -83,13 +83,14 @@ static void guest_sync_private(uint64_t gpa, uint64_t size, uint8_t pattern)
 }
 
 /* Arbitrary values, KVM doesn't care about the attribute flags. */
-#define MAP_GPA_SHARED		BIT(0)
-#define MAP_GPA_DO_FALLOCATE	BIT(1)
+#define MAP_GPA_SET_ATTRIBUTES	BIT(0)
+#define MAP_GPA_SHARED		BIT(1)
+#define MAP_GPA_DO_FALLOCATE	BIT(2)
 
 static void guest_map_mem(uint64_t gpa, uint64_t size, bool map_shared,
 			  bool do_fallocate)
 {
-	uint64_t flags = 0;
+	uint64_t flags = MAP_GPA_SET_ATTRIBUTES;
 
 	if (map_shared)
 		flags |= MAP_GPA_SHARED;
@@ -108,19 +109,19 @@ static void guest_map_private(uint64_t gpa, uint64_t size, bool do_fallocate)
 	guest_map_mem(gpa, size, false, do_fallocate);
 }
 
-static void guest_run_test(uint64_t base_gpa, bool do_fallocate)
+struct {
+	uint64_t offset;
+	uint64_t size;
+} static const test_ranges[] = {
+	GUEST_STAGE(0, PAGE_SIZE),
+	GUEST_STAGE(0, SZ_2M),
+	GUEST_STAGE(PAGE_SIZE, PAGE_SIZE),
+	GUEST_STAGE(PAGE_SIZE, SZ_2M),
+	GUEST_STAGE(SZ_2M, PAGE_SIZE),
+};
+
+static void guest_test_explicit_conversion(uint64_t base_gpa, bool do_fallocate)
 {
-	struct {
-		uint64_t offset;
-		uint64_t size;
-		uint8_t pattern;
-	} stages[] = {
-		GUEST_STAGE(0, PAGE_SIZE),
-		GUEST_STAGE(0, SZ_2M),
-		GUEST_STAGE(PAGE_SIZE, PAGE_SIZE),
-		GUEST_STAGE(PAGE_SIZE, SZ_2M),
-		GUEST_STAGE(SZ_2M, PAGE_SIZE),
-	};
 	const uint8_t init_p = 0xcc;
 	uint64_t j;
 	int i;
@@ -130,9 +131,9 @@ static void guest_run_test(uint64_t base_gpa, bool do_fallocate)
 	guest_sync_shared(base_gpa, PER_CPU_DATA_SIZE, (uint8_t)~init_p, init_p);
 	memcmp_g(base_gpa, init_p, PER_CPU_DATA_SIZE);
 
-	for (i = 0; i < ARRAY_SIZE(stages); i++) {
-		uint64_t gpa = base_gpa + stages[i].offset;
-		uint64_t size = stages[i].size;
+	for (i = 0; i < ARRAY_SIZE(test_ranges); i++) {
+		uint64_t gpa = base_gpa + test_ranges[i].offset;
+		uint64_t size = test_ranges[i].size;
 		uint8_t p1 = 0x11;
 		uint8_t p2 = 0x22;
 		uint8_t p3 = 0x33;
@@ -214,11 +215,11 @@ static void guest_run_test(uint64_t base_gpa, bool do_fallocate)
 static void guest_code(uint64_t base_gpa)
 {
 	/*
-	 * Run everything twice, with and without doing fallocate() on the
-	 * guest_memfd backing when converting between shared and private.
+	 * Run the conversion test twice, with and without doing fallocate() on
+	 * the guest_memfd backing when converting between shared and private.
 	 */
-	guest_run_test(base_gpa, false);
-	guest_run_test(base_gpa, true);
+	guest_test_explicit_conversion(base_gpa, false);
+	guest_test_explicit_conversion(base_gpa, true);
 	GUEST_DONE();
 }
 
@@ -227,6 +228,7 @@ static void handle_exit_hypercall(struct kvm_vcpu *vcpu)
 	struct kvm_run *run = vcpu->run;
 	uint64_t gpa = run->hypercall.args[0];
 	uint64_t size = run->hypercall.args[1] * PAGE_SIZE;
+	bool set_attributes = run->hypercall.args[2] & MAP_GPA_SET_ATTRIBUTES;
 	bool map_shared = run->hypercall.args[2] & MAP_GPA_SHARED;
 	bool do_fallocate = run->hypercall.args[2] & MAP_GPA_DO_FALLOCATE;
 	struct kvm_vm *vm = vcpu->vm;
@@ -238,8 +240,9 @@ static void handle_exit_hypercall(struct kvm_vcpu *vcpu)
 	if (do_fallocate)
 		vm_guest_mem_fallocate(vm, gpa, size, map_shared);
 
-	vm_set_memory_attributes(vm, gpa, size,
-				 map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE);
+	if (set_attributes)
+		vm_set_memory_attributes(vm, gpa, size,
+					 map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE);
 	run->hypercall.ret = 0;
 }
 
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 12/13] KVM: selftests: Add a "pure" PUNCH_HOLE on guest_memfd testcase
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
                   ` (10 preceding siblings ...)
  2023-09-21 20:33 ` [PATCH 11/13] KVM: selftests: Refactor private mem conversions to prep for punch_hole test Sean Christopherson
@ 2023-09-21 20:33 ` Sean Christopherson
  2023-09-21 20:33 ` [PATCH 13/13] KVM: Rename guest_mem.c to guest_memfd.c Sean Christopherson
  2023-09-29  2:22 ` [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
  13 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

Add a PUNCH_HOLE testcase to the private memory conversions test that
verifies PUNCH_HOLE actually frees memory.  Directly verifying that KVM
frees memory is impractical, if it's even possible, so instead indirectly
verify memory is freed by asserting that the guest reads zeroes after a
PUNCH_HOLE.  E.g. if KVM zaps SPTEs but doesn't actually punch a hole in
the inode, the subsequent read will still see the previous value.  And
obviously punching a hole shouldn't cause explosions.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../kvm/x86_64/private_mem_conversions_test.c | 61 +++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
index b80cf7342d0d..c04e7d61a585 100644
--- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -212,6 +212,60 @@ static void guest_test_explicit_conversion(uint64_t base_gpa, bool do_fallocate)
 	}
 }
 
+static void guest_punch_hole(uint64_t gpa, uint64_t size)
+{
+	/* "Mapping" memory shared via fallocate() is done via PUNCH_HOLE. */
+	uint64_t flags = MAP_GPA_SHARED | MAP_GPA_DO_FALLOCATE;
+
+	kvm_hypercall_map_gpa_range(gpa, size, flags);
+}
+
+/*
+ * Test that PUNCH_HOLE actually frees memory by punching holes without doing a
+ * proper conversion.  Freeing (PUNCH_HOLE) should zap SPTEs, and reallocating
+ * (subsequent fault) should zero memory.
+ */
+static void guest_test_punch_hole(uint64_t base_gpa, bool precise)
+{
+	const uint8_t init_p = 0xcc;
+	int i;
+
+	/*
+	 * Convert the entire range to private, this testcase is all about
+	 * punching holes in guest_memfd, i.e. shared mappings aren't needed.
+	 */
+	guest_map_private(base_gpa, PER_CPU_DATA_SIZE, false);
+
+	for (i = 0; i < ARRAY_SIZE(test_ranges); i++) {
+		uint64_t gpa = base_gpa + test_ranges[i].offset;
+		uint64_t size = test_ranges[i].size;
+
+		/*
+		 * Free all memory before each iteration, even for the !precise
+		 * case where the memory will be faulted back in.  Freeing and
+		 * reallocating should obviously work, and freeing all memory
+		 * minimizes the probability of cross-testcase influence.
+		 */
+		guest_punch_hole(base_gpa, PER_CPU_DATA_SIZE);
+
+		/* Fault-in and initialize memory, and verify the pattern. */
+		if (precise) {
+			memset((void *)gpa, init_p, size);
+			memcmp_g(gpa, init_p, size);
+		} else {
+			memset((void *)base_gpa, init_p, PER_CPU_DATA_SIZE);
+			memcmp_g(base_gpa, init_p, PER_CPU_DATA_SIZE);
+		}
+
+		/*
+		 * Punch a hole at the target range and verify that reads from
+		 * the guest succeed and return zeroes.
+		 */
+		guest_punch_hole(gpa, size);
+		memcmp_g(gpa, 0, size);
+	}
+}
+
 static void guest_code(uint64_t base_gpa)
 {
 	/*
@@ -220,6 +274,13 @@ static void guest_code(uint64_t base_gpa)
 	 */
 	guest_test_explicit_conversion(base_gpa, false);
 	guest_test_explicit_conversion(base_gpa, true);
+
+	/*
+	 * Run the PUNCH_HOLE test twice too, once with the entire guest_memfd
+	 * faulted in, once with only the target range faulted in.
+	 */
+	guest_test_punch_hole(base_gpa, false);
+	guest_test_punch_hole(base_gpa, true);
 	GUEST_DONE();
 }
 
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 13/13] KVM: Rename guest_mem.c to guest_memfd.c
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
                   ` (11 preceding siblings ...)
  2023-09-21 20:33 ` [PATCH 12/13] KVM: selftests: Add a "pure" PUNCH_HOLE on guest_memfd testcase Sean Christopherson
@ 2023-09-21 20:33 ` Sean Christopherson
  2023-09-29  2:22 ` [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
  13 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-21 20:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

Use guest_memfd.c for the KVM_CREATE_GUEST_MEMFD implementation to make it
more obvious that the file holds more than generic "guest memory" APIs,
and to provide a stronger conceptual connection with memfd.c.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/Makefile.kvm                   | 2 +-
 virt/kvm/{guest_mem.c => guest_memfd.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename virt/kvm/{guest_mem.c => guest_memfd.c} (100%)

diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
index a5a61bbe7f4c..724c89af78af 100644
--- a/virt/kvm/Makefile.kvm
+++ b/virt/kvm/Makefile.kvm
@@ -12,4 +12,4 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
 kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
 kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o
 kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o
-kvm-$(CONFIG_KVM_PRIVATE_MEM) += $(KVM)/guest_mem.o
+kvm-$(CONFIG_KVM_PRIVATE_MEM) += $(KVM)/guest_memfd.o
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_memfd.c
similarity index 100%
rename from virt/kvm/guest_mem.c
rename to virt/kvm/guest_memfd.c
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed
  2023-09-21 20:33 ` [PATCH 06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed Sean Christopherson
@ 2023-09-22 22:42   ` Michael Roth
  2023-09-28 18:31     ` Sean Christopherson
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Roth @ 2023-09-22 22:42 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Binbin Wu

On Thu, Sep 21, 2023 at 01:33:23PM -0700, Sean Christopherson wrote:
> Remove the restriction that a guest_memfd instance that supports hugepages
> can *only* be bound by memslots that are 100% compatible with hugepage
> mappings, and instead force KVM to use an order-0 mapping if the binding
> isn't compatible with hugepages.
> 
> The intent of the draconian binding restriction was purely to simplify the
> guest_memfd implementation, e.g. to avoid repeatining the existing logic in
> KVM x86ial for precisely tracking which GFNs support hugepages.  But
> checking that the binding's offset and size is compatible is just as easy
> to do when KVM wants to create a mapping.
> 
> And on the other hand, completely rejecting bindings that are incompatible
> with hugepages makes it practically impossible for userspace to use a
> single guest_memfd instance for all guest memory, e.g. on x86 it would be
> impossible to skip the legacy VGA hole while still allowing hugepage
> mappings for the rest of guest memory.
> 
> Suggested-by: Michael Roth <michael.roth@amd.com>
> Link: https://lore.kernel.org/all/20230918163647.m6bjgwusc7ww5tyu@amd.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/guest_mem.c | 54 ++++++++++++++++++++++----------------------
>  1 file changed, 27 insertions(+), 27 deletions(-)
> 
> diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> index 68528e9cddd7..4f3a313f5532 100644
> --- a/virt/kvm/guest_mem.c
> +++ b/virt/kvm/guest_mem.c
> @@ -434,20 +434,6 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags,
>  	return err;
>  }
>  
> -static bool kvm_gmem_is_valid_size(loff_t size, u64 flags)
> -{
> -	if (size < 0 || !PAGE_ALIGNED(size))
> -		return false;
> -
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> -	    !IS_ALIGNED(size, HPAGE_PMD_SIZE))
> -		return false;
> -#endif
> -
> -	return true;
> -}
> -
>  int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  {
>  	loff_t size = args->size;
> @@ -460,9 +446,15 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  	if (flags & ~valid_flags)
>  		return -EINVAL;
>  
> -	if (!kvm_gmem_is_valid_size(size, flags))
> +	if (size < 0 || !PAGE_ALIGNED(size))
>  		return -EINVAL;
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> +	    !IS_ALIGNED(size, HPAGE_PMD_SIZE))
> +		return -EINVAL;
> +#endif
> +
>  	return __kvm_gmem_create(kvm, size, flags, kvm_gmem_mnt);
>  }
>  
> @@ -470,7 +462,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		  unsigned int fd, loff_t offset)
>  {
>  	loff_t size = slot->npages << PAGE_SHIFT;
> -	unsigned long start, end, flags;
> +	unsigned long start, end;
>  	struct kvm_gmem *gmem;
>  	struct inode *inode;
>  	struct file *file;
> @@ -489,16 +481,9 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		goto err;
>  
>  	inode = file_inode(file);
> -	flags = (unsigned long)inode->i_private;
>  
> -	/*
> -	 * For simplicity, require the offset into the file and the size of the
> -	 * memslot to be aligned to the largest possible page size used to back
> -	 * the file (same as the size of the file itself).
> -	 */
> -	if (!kvm_gmem_is_valid_size(offset, flags) ||
> -	    !kvm_gmem_is_valid_size(size, flags))
> -		goto err;
> +	if (offset < 0 || !PAGE_ALIGNED(offset))
> +		return -EINVAL;
>  
>  	if (offset + size > i_size_read(inode))
>  		goto err;
> @@ -599,8 +584,23 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	page = folio_file_page(folio, index);
>  
>  	*pfn = page_to_pfn(page);
> -	if (max_order)
> -		*max_order = compound_order(compound_head(page));
> +	if (!max_order)
> +		goto success;
> +
> +	*max_order = compound_order(compound_head(page));
> +	if (!*max_order)
> +		goto success;
> +
> +	/*
> +	 * For simplicity, allow mapping a hugepage if and only if the entire
> +	 * binding is compatible, i.e. don't bother supporting mapping interior
> +	 * sub-ranges with hugepages (unless userspace comes up with a *really*
> +	 * strong use case for needing hugepages within unaligned bindings).
> +	 */
> +	if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) ||
> +	    !IS_ALIGNED(slot->npages, 1ull << *max_order))
> +		*max_order = 0;

Thanks for working this in. Unfortunately on x86 the bulk of guest memory
ends up getting slotted directly above legacy regions at GFN 0x100, so the
associated slot still ends failing these alignment checks if it tries to
match the gmemfd offsets up with the shared RAM/memfd offsets.

I tried to work around it in userspace by padding the gmemfd offset of
each slot to the next 2M boundary, but that also requires dynamically
growing the gmemfd inode to account for the padding of each new slot and
it gets ugly enough that I'm not sure it's any better than your
suggested alternative of using a unique gmemfd for each slot.

But what if we relax the check to simply make sure that any large folio
must is fully-contained by the range of the slot is bound to? It *seems*
like that would still avoid stuff like mapping 2M pages in the NPT (or
setting up 2M RMP table entries) that aren't fully contained by a slot
while still allowing the bulk of guest memory to get mapped as 2M. Are
there other edge cases to consider?

The following seems to work for a basic 16GB SNP guest at least:

diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index 9109bf5751ee..e73128d4ebc2 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -618,6 +618,7 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
                       gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep)
 {
        pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
+       pgoff_t huge_index;
        struct kvm_gmem *gmem;
        struct folio *folio;
        struct page *page;
@@ -662,9 +663,12 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
         * sub-ranges with hugepages (unless userspace comes up with a *really*
         * strong use case for needing hugepages within unaligned bindings).
         */
-       if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) ||
-           !IS_ALIGNED(slot->npages, 1ull << *max_order))
+       huge_index = round_down(index, 1ull << *max_order);
+       if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) ||
+           huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages) {
+               pr_debug("%s: GFN %llx failed alignment checks\n", __func__, gfn);
                *max_order = 0;
+       }
 success:
        r = 0;

-Mike

> +success:
>  	r = 0;
>  
>  out_unlock:
> -- 
> 2.42.0.515.g380fc7ccd1-goog
> 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 04/13] KVM: WARN if there are danging MMU invalidations at VM destruction
  2023-09-21 20:33 ` [PATCH 04/13] KVM: WARN if there are danging MMU invalidations at VM destruction Sean Christopherson
@ 2023-09-27  3:16   ` Binbin Wu
  2023-09-28 18:11     ` Sean Christopherson
  0 siblings, 1 reply; 23+ messages in thread
From: Binbin Wu @ 2023-09-27  3:16 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, linux-kernel, Paolo Bonzini, Michael Roth



On 9/22/2023 4:33 AM, Sean Christopherson wrote:
> Add an assertion that there are no in-progress MMU invalidations when a
> VM is being destroyed, with the exception of the scenario where KVM
> unregisters its MMU notifier between an .invalidate_range_start() call and
> the corresponding .invalidate_range_end().
>
> KVM can't detect unpaired calls from the mmu_notifier due to the above
> exception waiver, but the assertion can detect KVM bugs, e.g. such as the
> bug that *almost* escaped initial guest_memfd development.
>
> Link: https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5389@linux.intel.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   virt/kvm/kvm_main.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 54480655bcce..277afeedd670 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1381,9 +1381,16 @@ static void kvm_destroy_vm(struct kvm *kvm)
>   	 * No threads can be waiting in kvm_swap_active_memslots() as the
>   	 * last reference on KVM has been dropped, but freeing
>   	 * memslots would deadlock without this manual intervention.
> +	 *
> +	 * If the count isn't unbalanced, i.e. KVM did NOT unregister between
Nit: Readers can get it according to the code context, but is it better 
to add
"MMU notifier"  to tell what to "unregister" to make the comment easier to
understand?


> +	 * a start() and end(), then there shouldn't be any in-progress
> +	 * invalidations.
>   	 */
>   	WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
> -	kvm->mn_active_invalidate_count = 0;
> +	if (kvm->mn_active_invalidate_count)
> +		kvm->mn_active_invalidate_count = 0;
> +	else
> +		WARN_ON(kvm->mmu_invalidate_in_progress);
>   #else
>   	kvm_flush_shadow_all(kvm);
>   #endif


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 07/13] KVM: x86/mmu: Track PRIVATE impact on hugepage mappings for all memslots
  2023-09-21 20:33 ` [PATCH 07/13] KVM: x86/mmu: Track PRIVATE impact on hugepage mappings for all memslots Sean Christopherson
@ 2023-09-27  6:01   ` Binbin Wu
  2023-09-27 14:25     ` Sean Christopherson
  0 siblings, 1 reply; 23+ messages in thread
From: Binbin Wu @ 2023-09-27  6:01 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, linux-kernel, Paolo Bonzini, Michael Roth



On 9/22/2023 4:33 AM, Sean Christopherson wrote:
> Track the effects of private attributes on potential hugepage mappings if
> the VM supports private memory, i.e. even if the target memslot can only
> ever be mapped shared.  If userspace configures a chunk of memory as
> private, KVM must not allow that memory to be mapped shared regardless of
> whether or not the *current* memslot can be mapped private.  E.g. if the
> guest accesses a private range using a shared memslot, then KVM must exit
> to userspace.
How does usersapce handle this case?
IIRC, in gmem v12 patch set, it says a memslot can not be convert to 
private from shared.
So, userspace should delete the old memslot and and a new one?


> Fixes: 5bb0b4e162d1 ("KVM: x86: Disallow hugepages when memory attributes are mixed")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 10 ++++++----
>   1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 269d4dc47c98..148931cf9dba 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7314,10 +7314,12 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>   	lockdep_assert_held(&kvm->slots_lock);
>   
>   	/*
> -	 * KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip
> -	 * the slot if the slot will never consume the PRIVATE attribute.
> +	 * Calculate which ranges can be mapped with hugepages even if the slot
> +	 * can't map memory PRIVATE.  KVM mustn't create a SHARED hugepage over
> +	 * a range that has PRIVATE GFNs, and conversely converting a range to
> +	 * SHARED may now allow hugepages.
>   	 */
> -	if (!kvm_slot_can_be_private(slot))
> +	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
>   		return false;
>   
>   	/*
> @@ -7372,7 +7374,7 @@ void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
>   {
>   	int level;
>   
> -	if (!kvm_slot_can_be_private(slot))
> +	if (!kvm_arch_has_private_mem(kvm))
>   		return;
>   
>   	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 07/13] KVM: x86/mmu: Track PRIVATE impact on hugepage mappings for all memslots
  2023-09-27  6:01   ` Binbin Wu
@ 2023-09-27 14:25     ` Sean Christopherson
  0 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-27 14:25 UTC (permalink / raw)
  To: Binbin Wu; +Cc: kvm, linux-kernel, Paolo Bonzini, Michael Roth

On Wed, Sep 27, 2023, Binbin Wu wrote:
> 
> On 9/22/2023 4:33 AM, Sean Christopherson wrote:
> > Track the effects of private attributes on potential hugepage mappings if
> > the VM supports private memory, i.e. even if the target memslot can only
> > ever be mapped shared.  If userspace configures a chunk of memory as
> > private, KVM must not allow that memory to be mapped shared regardless of
> > whether or not the *current* memslot can be mapped private.  E.g. if the
> > guest accesses a private range using a shared memslot, then KVM must exit
> > to userspace.
> How does usersapce handle this case?
> IIRC, in gmem v12 patch set, it says a memslot can not be convert to private
> from shared.
> So, userspace should delete the old memslot and and a new one?

That depends on the contract between userspace and the VM, e.g. if the access is
to a range that the VMM and the guest have agreed is shared-only, then the VMM
could also "resolve" the access by injecting an error or killing the VM.

But yes, deleting the memslot and creating a new one is the only approach that
will work if userspace wants to map the gfn as private.  I don't actually expect
any real world VMMs to actually do anything like this, the purpose of this change
is mainly to ensure KVM has robust, consistent behavior.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 04/13] KVM: WARN if there are danging MMU invalidations at VM destruction
  2023-09-27  3:16   ` Binbin Wu
@ 2023-09-28 18:11     ` Sean Christopherson
  0 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-28 18:11 UTC (permalink / raw)
  To: Binbin Wu; +Cc: kvm, linux-kernel, Paolo Bonzini, Michael Roth

On Wed, Sep 27, 2023, Binbin Wu wrote:
> 
> 
> On 9/22/2023 4:33 AM, Sean Christopherson wrote:
> > Add an assertion that there are no in-progress MMU invalidations when a
> > VM is being destroyed, with the exception of the scenario where KVM
> > unregisters its MMU notifier between an .invalidate_range_start() call and
> > the corresponding .invalidate_range_end().
> > 
> > KVM can't detect unpaired calls from the mmu_notifier due to the above
> > exception waiver, but the assertion can detect KVM bugs, e.g. such as the
> > bug that *almost* escaped initial guest_memfd development.
> > 
> > Link: https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5389@linux.intel.com
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   virt/kvm/kvm_main.c | 9 ++++++++-
> >   1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 54480655bcce..277afeedd670 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1381,9 +1381,16 @@ static void kvm_destroy_vm(struct kvm *kvm)
> >   	 * No threads can be waiting in kvm_swap_active_memslots() as the
> >   	 * last reference on KVM has been dropped, but freeing
> >   	 * memslots would deadlock without this manual intervention.
> > +	 *
> > +	 * If the count isn't unbalanced, i.e. KVM did NOT unregister between
> Nit: Readers can get it according to the code context, but is it better to
> add "MMU notifier"  to tell what to "unregister" to make the comment easier
> to understand?

Agreed, I'll add that when applying.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed
  2023-09-22 22:42   ` Michael Roth
@ 2023-09-28 18:31     ` Sean Christopherson
  2023-10-02 15:53       ` Michael Roth
  0 siblings, 1 reply; 23+ messages in thread
From: Sean Christopherson @ 2023-09-28 18:31 UTC (permalink / raw)
  To: Michael Roth; +Cc: Paolo Bonzini, kvm, linux-kernel, Binbin Wu

On Fri, Sep 22, 2023, Michael Roth wrote:
> On Thu, Sep 21, 2023 at 01:33:23PM -0700, Sean Christopherson wrote:
> > +	/*
> > +	 * For simplicity, allow mapping a hugepage if and only if the entire
> > +	 * binding is compatible, i.e. don't bother supporting mapping interior
> > +	 * sub-ranges with hugepages (unless userspace comes up with a *really*
> > +	 * strong use case for needing hugepages within unaligned bindings).
> > +	 */
> > +	if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) ||
> > +	    !IS_ALIGNED(slot->npages, 1ull << *max_order))
> > +		*max_order = 0;
> 
> Thanks for working this in. Unfortunately on x86 the bulk of guest memory
> ends up getting slotted directly above legacy regions at GFN 0x100, 

Can you provide an example?  I'm struggling to understand what the layout actually
is.  I don't think it changes the story for the kernel, but it sounds like there
might be room for optimization in QEMU?  Or more likely, I just don't understand
what you're saying :-)

> so the associated slot still ends failing these alignment checks if it tries
> to match the gmemfd offsets up with the shared RAM/memfd offsets.
> 
> I tried to work around it in userspace by padding the gmemfd offset of
> each slot to the next 2M boundary, but that also requires dynamically
> growing the gmemfd inode to account for the padding of each new slot and
> it gets ugly enough that I'm not sure it's any better than your
> suggested alternative of using a unique gmemfd for each slot.
> 
> But what if we relax the check to simply make sure that any large folio
> must is fully-contained by the range of the slot is bound to? It *seems*
> like that would still avoid stuff like mapping 2M pages in the NPT (or
> setting up 2M RMP table entries) that aren't fully contained by a slot
> while still allowing the bulk of guest memory to get mapped as 2M. Are
> there other edge cases to consider?
> 
> The following seems to work for a basic 16GB SNP guest at least:
> 
> diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> index 9109bf5751ee..e73128d4ebc2 100644
> --- a/virt/kvm/guest_mem.c
> +++ b/virt/kvm/guest_mem.c
> @@ -618,6 +618,7 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>                        gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep)
>  {
>         pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
> +       pgoff_t huge_index;
>         struct kvm_gmem *gmem;
>         struct folio *folio;
>         struct page *page;
> @@ -662,9 +663,12 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>          * sub-ranges with hugepages (unless userspace comes up with a *really*
>          * strong use case for needing hugepages within unaligned bindings).
>          */
> -       if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) ||
> -           !IS_ALIGNED(slot->npages, 1ull << *max_order))
> +       huge_index = round_down(index, 1ull << *max_order);

Why not use ALIGN() here?  The size is obviously a power-of-2.  Or is my math
even worse than I thought?

> +       if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) ||
> +           huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages) {

Argh, I keep forgetting that the MMU is responsible for handling misaligned gfns.
Yeah, this looks right.

Can you post this as a proper patch, on top of my fixes?  And without the pr_debug().
That'll be the easiest for me to apply+squash when the time comes.

Thanks much!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00/13] KVM: guest_memfd fixes
  2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
                   ` (12 preceding siblings ...)
  2023-09-21 20:33 ` [PATCH 13/13] KVM: Rename guest_mem.c to guest_memfd.c Sean Christopherson
@ 2023-09-29  2:22 ` Sean Christopherson
  13 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-09-29  2:22 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Binbin Wu

On Thu, 21 Sep 2023 13:33:17 -0700, Sean Christopherson wrote:
> Fix a variety of bugs in the guest_memfd series, almost all of which are
> my fault, and add assertions and testcases to detect future regressions.
> 
> The last patch, renaming guest_mem.c to guest_memfd.c, is obviously not a
> bug fix, I included it here so that if we want to go with guest_memfd.c,
> squashing everything will be straightforward.
> 
> [...]

Applied to kvm-x86 guest_memfd.  I'll apply Mike's hugepage fix on top (when it
arrives), will send out a patch to fix the off-by-one reported by Binbin, and
will post a miniseries to clean up KVM_EXIT_MEMORY_FAULT.

[01/13] KVM: Assert that mmu_invalidate_in_progress *never* goes negative
        https://github.com/kvm-x86/linux/commit/46c10adeda81
[02/13] KVM: Actually truncate the inode when doing PUNCH_HOLE for guest_memfd
        https://github.com/kvm-x86/linux/commit/936144404469
[03/13] KVM: WARN if *any* MMU invalidation sequence doesn't add a range
        https://github.com/kvm-x86/linux/commit/1912c5dff3ac
[04/13] KVM: WARN if there are danging MMU invalidations at VM destruction
        https://github.com/kvm-x86/linux/commit/37bbf72db864
[05/13] KVM: Fix MMU invalidation bookkeeping in guest_memfd
        https://github.com/kvm-x86/linux/commit/b25ab9cae30f
[06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed
        https://github.com/kvm-x86/linux/commit/1c297b84f3a4
[07/13] KVM: x86/mmu: Track PRIVATE impact on hugepage mappings for all memslots
        https://github.com/kvm-x86/linux/commit/26cf4453d2d9
[08/13] KVM: x86/mmu: Zap shared-only memslots when private attribute changes
        https://github.com/kvm-x86/linux/commit/fb6f779719ca
[09/13] KVM: Always add relevant ranges to invalidation set when changing attributes
        https://github.com/kvm-x86/linux/commit/69c7916df569
[10/13] KVM: x86/mmu: Drop repeated add() of to-be-invalidated range
        https://github.com/kvm-x86/linux/commit/e6b1a6922470
[11/13] KVM: selftests: Refactor private mem conversions to prep for punch_hole test
        https://github.com/kvm-x86/linux/commit/5782107f5d2b
[12/13] KVM: selftests: Add a "pure" PUNCH_HOLE on guest_memfd testcase
        https://github.com/kvm-x86/linux/commit/848d5faa2099
[13/13] KVM: Rename guest_mem.c to guest_memfd.c
        https://github.com/kvm-x86/linux/commit/6a92dc57b0e6

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed
  2023-09-28 18:31     ` Sean Christopherson
@ 2023-10-02 15:53       ` Michael Roth
  2023-10-02 16:49         ` Sean Christopherson
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Roth @ 2023-10-02 15:53 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Binbin Wu

On Thu, Sep 28, 2023 at 11:31:51AM -0700, Sean Christopherson wrote:
> On Fri, Sep 22, 2023, Michael Roth wrote:
> > On Thu, Sep 21, 2023 at 01:33:23PM -0700, Sean Christopherson wrote:
> > > +	/*
> > > +	 * For simplicity, allow mapping a hugepage if and only if the entire
> > > +	 * binding is compatible, i.e. don't bother supporting mapping interior
> > > +	 * sub-ranges with hugepages (unless userspace comes up with a *really*
> > > +	 * strong use case for needing hugepages within unaligned bindings).
> > > +	 */
> > > +	if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) ||
> > > +	    !IS_ALIGNED(slot->npages, 1ull << *max_order))
> > > +		*max_order = 0;
> > 
> > Thanks for working this in. Unfortunately on x86 the bulk of guest memory
> > ends up getting slotted directly above legacy regions at GFN 0x100, 
> 
> Can you provide an example?  I'm struggling to understand what the layout actually
> is.  I don't think it changes the story for the kernel, but it sounds like there
> might be room for optimization in QEMU?  Or more likely, I just don't understand
> what you're saying :-)

Here's one example, which seems to be fairly normal for an x86 boot:

  kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7f24afc00000 ret=0 restricted_fd=19 restricted_offset=0x0
  ^ QEMU creates Slot 0 for all of main guest RAM
  kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x0 gpa=0x0 size=0x0 ua=0x7f24afc00000 ret=0 restricted_fd=19 restricted_offset=0x0
  kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0xc0000 ua=0x7f24afc00000 ret=0 restricted_fd=19 restricted_offset=0x0
  kvm_set_user_memory AddrSpace#0 Slot#3 flags=0x6 gpa=0xc0000 size=0x20000 ua=0x7f2575000000 ret=0 restricted_fd=33 restricted_offset=0x0
  kvm_set_user_memory AddrSpace#0 Slot#4 flags=0x6 gpa=0xe0000 size=0x20000 ua=0x7f2575400000 ret=0 restricted_fd=31 restricted_offset=0x0
  ^ legacy regions are created and mapped on top of GPA ranges [0xc0000:0xe0000) and [0xe0000:0x100000)
  kvm_set_user_memory AddrSpace#0 Slot#5 flags=0x4 gpa=0x100000 size=0x7ff00000 ua=0x7f24afd00000 ret=0 restricted_fd=19 restricted_offset=0x100000
  ^ QEMU divides Slot 0 into Slot 0 at [0x0:0xc0000) and Slot 5 at [0x100000:0x80000000)
    Both Slots still share the same backing memory allocation, so same gmem
    fd 19 is used,but Slot 5 is assigned to offset 0x100000, whih is not
    2M-aligned

I tried messing with QEMU handling to pad out guest_memfd offsets to 2MB
boundaries but then the inode size needs to be enlarged to account for it
and things get a bit messy. Not sure if there are alternative approaches
that can be taken from userspace, but with normal malloc()'d or mmap()'d
backing memory the kernel can still allocate a 2MB backing page for the
[0x0:0x200000) range and I think KVM still handles that when setting up
NPT of sub-ranges so there might not be much room for further optimization
there.

> 
> > so the associated slot still ends failing these alignment checks if it tries
> > to match the gmemfd offsets up with the shared RAM/memfd offsets.
> > 
> > I tried to work around it in userspace by padding the gmemfd offset of
> > each slot to the next 2M boundary, but that also requires dynamically
> > growing the gmemfd inode to account for the padding of each new slot and
> > it gets ugly enough that I'm not sure it's any better than your
> > suggested alternative of using a unique gmemfd for each slot.
> > 
> > But what if we relax the check to simply make sure that any large folio
> > must is fully-contained by the range of the slot is bound to? It *seems*
> > like that would still avoid stuff like mapping 2M pages in the NPT (or
> > setting up 2M RMP table entries) that aren't fully contained by a slot
> > while still allowing the bulk of guest memory to get mapped as 2M. Are
> > there other edge cases to consider?
> > 
> > The following seems to work for a basic 16GB SNP guest at least:
> > 
> > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> > index 9109bf5751ee..e73128d4ebc2 100644
> > --- a/virt/kvm/guest_mem.c
> > +++ b/virt/kvm/guest_mem.c
> > @@ -618,6 +618,7 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >                        gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep)
> >  {
> >         pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
> > +       pgoff_t huge_index;
> >         struct kvm_gmem *gmem;
> >         struct folio *folio;
> >         struct page *page;
> > @@ -662,9 +663,12 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >          * sub-ranges with hugepages (unless userspace comes up with a *really*
> >          * strong use case for needing hugepages within unaligned bindings).
> >          */
> > -       if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) ||
> > -           !IS_ALIGNED(slot->npages, 1ull << *max_order))
> > +       huge_index = round_down(index, 1ull << *max_order);
> 
> Why not use ALIGN() here?  The size is obviously a power-of-2.  Or is my math
> even worse than I thought?

I actually only originally used round_down() because kvm_gmem_get_huge_folio()
was taking that approach, but I still ended up using ALIGN() below so sorry
if the inconsistency caused any confusion. I switched to using ALIGN() above
and it works fine.

> 
> > +       if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) ||
> > +           huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages) {
> 
> Argh, I keep forgetting that the MMU is responsible for handling misaligned gfns.
> Yeah, this looks right.
> 
> Can you post this as a proper patch, on top of my fixes?  And without the pr_debug().
> That'll be the easiest for me to apply+squash when the time comes.

Sure, I submitted a revised patch on top of kvm-x86:

  https://lore.kernel.org/lkml/20231002133342.195882-1-michael.roth@amd.com/T/#u

I ran into a separate issue trying to test it and submitted a patch for that
here:

  https://lore.kernel.org/lkml/20231002133230.195738-1-michael.roth@amd.com/T/#u

-Mike

> 
> Thanks much!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed
  2023-10-02 15:53       ` Michael Roth
@ 2023-10-02 16:49         ` Sean Christopherson
  0 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2023-10-02 16:49 UTC (permalink / raw)
  To: Michael Roth; +Cc: Paolo Bonzini, kvm, linux-kernel, Binbin Wu

On Mon, Oct 02, 2023, Michael Roth wrote:
> On Thu, Sep 28, 2023 at 11:31:51AM -0700, Sean Christopherson wrote:
> > On Fri, Sep 22, 2023, Michael Roth wrote:
> > > On Thu, Sep 21, 2023 at 01:33:23PM -0700, Sean Christopherson wrote:
> > > > +	/*
> > > > +	 * For simplicity, allow mapping a hugepage if and only if the entire
> > > > +	 * binding is compatible, i.e. don't bother supporting mapping interior
> > > > +	 * sub-ranges with hugepages (unless userspace comes up with a *really*
> > > > +	 * strong use case for needing hugepages within unaligned bindings).
> > > > +	 */
> > > > +	if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) ||
> > > > +	    !IS_ALIGNED(slot->npages, 1ull << *max_order))
> > > > +		*max_order = 0;
> > > 
> > > Thanks for working this in. Unfortunately on x86 the bulk of guest memory
> > > ends up getting slotted directly above legacy regions at GFN 0x100, 
> > 
> > Can you provide an example?  I'm struggling to understand what the layout actually
> > is.  I don't think it changes the story for the kernel, but it sounds like there
> > might be room for optimization in QEMU?  Or more likely, I just don't understand
> > what you're saying :-)
> 
> Here's one example, which seems to be fairly normal for an x86 boot:
> 
>   kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7f24afc00000 ret=0 restricted_fd=19 restricted_offset=0x0
>   ^ QEMU creates Slot 0 for all of main guest RAM
>   kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x0 gpa=0x0 size=0x0 ua=0x7f24afc00000 ret=0 restricted_fd=19 restricted_offset=0x0
>   kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0xc0000 ua=0x7f24afc00000 ret=0 restricted_fd=19 restricted_offset=0x0
>   kvm_set_user_memory AddrSpace#0 Slot#3 flags=0x6 gpa=0xc0000 size=0x20000 ua=0x7f2575000000 ret=0 restricted_fd=33 restricted_offset=0x0
>   kvm_set_user_memory AddrSpace#0 Slot#4 flags=0x6 gpa=0xe0000 size=0x20000 ua=0x7f2575400000 ret=0 restricted_fd=31 restricted_offset=0x0
>   ^ legacy regions are created and mapped on top of GPA ranges [0xc0000:0xe0000) and [0xe0000:0x100000)
>   kvm_set_user_memory AddrSpace#0 Slot#5 flags=0x4 gpa=0x100000 size=0x7ff00000 ua=0x7f24afd00000 ret=0 restricted_fd=19 restricted_offset=0x100000
>   ^ QEMU divides Slot 0 into Slot 0 at [0x0:0xc0000) and Slot 5 at [0x100000:0x80000000)
>     Both Slots still share the same backing memory allocation, so same gmem
>     fd 19 is used,but Slot 5 is assigned to offset 0x100000, whih is not
>     2M-aligned
> 
> I tried messing with QEMU handling to pad out guest_memfd offsets to 2MB
> boundaries but then the inode size needs to be enlarged to account for it
> and things get a bit messy. Not sure if there are alternative approaches
> that can be taken from userspace, but with normal malloc()'d or mmap()'d
> backing memory the kernel can still allocate a 2MB backing page for the
> [0x0:0x200000) range and I think KVM still handles that when setting up
> NPT of sub-ranges so there might not be much room for further optimization
> there.

Oooh, duh.  QEMU intentionally creates a gap for the VGA and/or BIOS holes, and
so the lower DRAM chunk that goes from the end of the system reserved chunk to
to TOLUD is started at an unaligned offset, even though 99% of the slot is properly
aligned.

Yeah, KVM definitely needs to support that.  Requiring userspace to align based
on the hugepage size could work, e.g. QEMU could divide slot 5 into N slots, to
end up with a series of slots to get from 4KiB aligned => 2MiB aligned => 1GiB
aligned.  But pushing for that would be beyond stubborn.

Thanks for being patient :-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-10-02 16:49 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-21 20:33 [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson
2023-09-21 20:33 ` [PATCH 01/13] KVM: Assert that mmu_invalidate_in_progress *never* goes negative Sean Christopherson
2023-09-21 20:33 ` [PATCH 02/13] KVM: Actually truncate the inode when doing PUNCH_HOLE for guest_memfd Sean Christopherson
2023-09-21 20:33 ` [PATCH 03/13] KVM: WARN if *any* MMU invalidation sequence doesn't add a range Sean Christopherson
2023-09-21 20:33 ` [PATCH 04/13] KVM: WARN if there are danging MMU invalidations at VM destruction Sean Christopherson
2023-09-27  3:16   ` Binbin Wu
2023-09-28 18:11     ` Sean Christopherson
2023-09-21 20:33 ` [PATCH 05/13] KVM: Fix MMU invalidation bookkeeping in guest_memfd Sean Christopherson
2023-09-21 20:33 ` [PATCH 06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed Sean Christopherson
2023-09-22 22:42   ` Michael Roth
2023-09-28 18:31     ` Sean Christopherson
2023-10-02 15:53       ` Michael Roth
2023-10-02 16:49         ` Sean Christopherson
2023-09-21 20:33 ` [PATCH 07/13] KVM: x86/mmu: Track PRIVATE impact on hugepage mappings for all memslots Sean Christopherson
2023-09-27  6:01   ` Binbin Wu
2023-09-27 14:25     ` Sean Christopherson
2023-09-21 20:33 ` [PATCH 08/13] KVM: x86/mmu: Zap shared-only memslots when private attribute changes Sean Christopherson
2023-09-21 20:33 ` [PATCH 09/13] KVM: Always add relevant ranges to invalidation set when changing attributes Sean Christopherson
2023-09-21 20:33 ` [PATCH 10/13] KVM: x86/mmu: Drop repeated add() of to-be-invalidated range Sean Christopherson
2023-09-21 20:33 ` [PATCH 11/13] KVM: selftests: Refactor private mem conversions to prep for punch_hole test Sean Christopherson
2023-09-21 20:33 ` [PATCH 12/13] KVM: selftests: Add a "pure" PUNCH_HOLE on guest_memfd testcase Sean Christopherson
2023-09-21 20:33 ` [PATCH 13/13] KVM: Rename guest_mem.c to guest_memfd.c Sean Christopherson
2023-09-29  2:22 ` [PATCH 00/13] KVM: guest_memfd fixes Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).