[PATCH 0/3] KVM: x86/mmu: Don't zap "direct" non-leaf SPTEs on memslot removal

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] KVM: x86/mmu: Don't zap "direct" non-leaf SPTEs on memslot removal
@ 2024-10-09 19:23 Sean Christopherson
  2024-10-09 19:23 ` [PATCH 1/3] KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot Sean Christopherson
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Sean Christopherson @ 2024-10-09 19:23 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel, Yan Zhao

When doing a targeted zap during memslot removal (DELETE or MOVE) zap only
leaf SPTEs and non-leaf SPTEs with gPTEs, as zapping non-leaf SPTEs without
gPTEs is unnecessary and weird because sp->gfn for such SPTEs is rounded,
i.e. night tightly coupled to the memslot.

Massage the related documentation so that KVM doesn't get stuck maintaining
undesirable ABI (again), and opportunistically add a lockdep assertion in
kvm_unmap_gfn_range() (largely because I keep forgetting that memslot updates
are special).

Sean Christopherson (3):
  KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot
  KVM: x86/mmu: Add lockdep assert to enforce safe usage of
    kvm_unmap_gfn_range()
  KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL

 Documentation/virt/kvm/api.rst | 16 +++++++++-------
 arch/x86/kvm/mmu/mmu.c         | 26 ++++++++++++++++----------
 2 files changed, 25 insertions(+), 17 deletions(-)

base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
-- 
2.47.0.rc1.288.g06298d1525-goog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot
  2024-10-09 19:23 [PATCH 0/3] KVM: x86/mmu: Don't zap "direct" non-leaf SPTEs on memslot removal Sean Christopherson
@ 2024-10-09 19:23 ` Sean Christopherson
  2024-10-10  7:59   ` Yan Zhao
  2024-10-09 19:23 ` [PATCH 2/3] KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range() Sean Christopherson
  2024-10-09 19:23 ` [PATCH 3/3] KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL Sean Christopherson
  2 siblings, 1 reply; 9+ messages in thread
From: Sean Christopherson @ 2024-10-09 19:23 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel, Yan Zhao

When performing a targeted zap on memslot removal, zap only MMU pages that
shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and
unnecessary.  Furthermore, for_each_gfn_valid_sp() arguably shouldn't
exist, because it doesn't do what most people would it expect it to do.
The "round gfn for level" adjustment that is done for direct SPs (no gPTE)
means that the exact gfn comparison will not get a match, even when a SP
does "cover" a gfn, or was even created specifically for a gfn.

For memslot deletion specifically, KVM's behavior will vary significantly
based on the size and alignment of a memslot, and in weird ways.  E.g. for
a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if
it's only 4KiB aligned.  And as described below, zapping SPs in the
aligned case overzaps for direct MMUs, as odds are good the upper-level
SPs are serving other memslots.

To iterate over all potentially-relevant gfns, KVM would need to make a
pass over the hash table for each level, with the gfn used for lookup
rounded for said level.  And then check that the SP is of the correct
level, too, e.g. to avoid over-zapping.

But even then, KVM would massively overzap, as processing every level is
all but guaranteed to zap SPs that serve other memslots, especially if the
memslot being removed is relatively small.  KVM could mitigate that issue
by processing only levels that can be possible guest huge pages, i.e. are
less likely to be re-used for other memslot, but while somewhat logical,
that's quite arbitrary and would be a bit of a mess to implement.

So, zap only SPs with gPTEs, as the resulting behavior is easy to describe,
is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that
absolutely must be zapped.

Cc: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a9a23e058555..09494d01c38e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1884,14 +1884,10 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp)
 		if (is_obsolete_sp((_kvm), (_sp))) {			\
 		} else

-#define for_each_gfn_valid_sp(_kvm, _sp, _gfn)				\
+#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn)		\
 	for_each_valid_sp(_kvm, _sp,					\
 	  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])	\
-		if ((_sp)->gfn != (_gfn)) {} else
-
-#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn)		\
-	for_each_gfn_valid_sp(_kvm, _sp, _gfn)				\
-		if (!sp_has_gptes(_sp)) {} else
+		if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else

 static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
@@ -7063,15 +7059,15 @@ static void kvm_mmu_zap_memslot_pages_and_flush(struct kvm *kvm,

 	/*
 	 * Since accounting information is stored in struct kvm_arch_memory_slot,
-	 * shadow pages deletion (e.g. unaccount_shadowed()) requires that all
-	 * gfns with a shadow page have a corresponding memslot.  Do so before
-	 * the memslot goes away.
+	 * all MMU pages that are shadowing guest PTEs must be zapped before the
+	 * memslot is deleted, as freeing such pages after the memslot is freed
+	 * will result in use-after-free, e.g. in unaccount_shadowed().
 	 */
 	for (i = 0; i < slot->npages; i++) {
 		struct kvm_mmu_page *sp;
 		gfn_t gfn = slot->base_gfn + i;

-		for_each_gfn_valid_sp(kvm, sp, gfn)
+		for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn)
 			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);

 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
-- 
2.47.0.rc1.288.g06298d1525-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()
  2024-10-09 19:23 [PATCH 0/3] KVM: x86/mmu: Don't zap "direct" non-leaf SPTEs on memslot removal Sean Christopherson
  2024-10-09 19:23 ` [PATCH 1/3] KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot Sean Christopherson
@ 2024-10-09 19:23 ` Sean Christopherson
  2024-10-10  7:01   ` Yan Zhao
  2024-10-09 19:23 ` [PATCH 3/3] KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL Sean Christopherson
  2 siblings, 1 reply; 9+ messages in thread
From: Sean Christopherson @ 2024-10-09 19:23 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel, Yan Zhao

Add a lockdep assertion in kvm_unmap_gfn_range() to ensure that either
mmu_invalidate_in_progress is elevated, or that the range is being zapped
due to memslot removal (loosely detected by slots_lock being held).
Zapping SPTEs without mmu_invalidate_{in_progress,seq} protection is unsafe
as KVM's page fault path snapshots state before acquiring mmu_lock, and
thus can create SPTEs with stale information if vCPUs aren't forced to
retry faults (due to seeing an in-progress or past MMU invalidation).

Memslot removal is a special case, as the memslot is retrieved outside of
mmu_invalidate_seq, i.e. doesn't use the "standard" protections, and
instead relies on SRCU synchronization to ensure any in-flight page faults
are fully resolved before zapping SPTEs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 09494d01c38e..c6716fd3666f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1556,6 +1556,16 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool flush = false;
 
+	/*
+	 * To prevent races with vCPUs faulting in a gfn using stale data,
+	 * zapping a gfn range must be protected by mmu_invalidate_in_progress
+	 * (and mmu_invalidate_seq).  The only exception is memslot deletion,
+	 * in which case SRCU synchronization ensures SPTEs a zapped after all
+	 * vCPUs have unlocked SRCU and are guaranteed to see the invalid slot.
+	 */
+	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
+			    lockdep_is_held(&kvm->slots_lock));
+
 	if (kvm_memslots_have_rmaps(kvm))
 		flush = __kvm_rmap_zap_gfn_range(kvm, range->slot,
 						 range->start, range->end,
-- 
2.47.0.rc1.288.g06298d1525-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL
  2024-10-09 19:23 [PATCH 0/3] KVM: x86/mmu: Don't zap "direct" non-leaf SPTEs on memslot removal Sean Christopherson
  2024-10-09 19:23 ` [PATCH 1/3] KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot Sean Christopherson
  2024-10-09 19:23 ` [PATCH 2/3] KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range() Sean Christopherson
@ 2024-10-09 19:23 ` Sean Christopherson
  2 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2024-10-09 19:23 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel, Yan Zhao

Massage the documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL to call out that
it applies to moved memslots as well as deleted memslots, to avoid KVM's
"fast zap" terminology (which has no meaning for userspace), and to reword
the documented targeted zap behavior to specifically say that KVM _may_
zap a subset of all SPTEs.  As evidenced by the fix to zap non-leafs SPTEs
with gPTEs, formally documenting KVM's exact internal behavior is risky
and unnecessary.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e32471977d0a..edc070c6e19b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8098,13 +8098,15 @@ KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS By default, KVM emulates MONITOR/MWAIT (if
                                     KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT is
                                     disabled.
 
-KVM_X86_QUIRK_SLOT_ZAP_ALL          By default, KVM invalidates all SPTEs in
-                                    fast way for memslot deletion when VM type
-                                    is KVM_X86_DEFAULT_VM.
-                                    When this quirk is disabled or when VM type
-                                    is other than KVM_X86_DEFAULT_VM, KVM zaps
-                                    only leaf SPTEs that are within the range of
-                                    the memslot being deleted.
+KVM_X86_QUIRK_SLOT_ZAP_ALL          By default, for KVM_X86_DEFAULT_VM VMs, KVM
+                                    invalidates all SPTEs in all memslots and
+                                    address spaces when a memslot is deleted or
+                                    moved.  When this quirk is disabled (or the
+                                    VM type isn't KVM_X86_DEFAULT_VM), KVM only
+                                    ensures the backing memory of the deleted
+                                    or moved memslot isn't reachable, i.e KVM
+                                    _may_ invalidate only SPTEs related to the
+                                    memslot.
 =================================== ============================================
 
 7.32 KVM_CAP_MAX_VCPU_ID
-- 
2.47.0.rc1.288.g06298d1525-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()
  2024-10-09 19:23 ` [PATCH 2/3] KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range() Sean Christopherson
@ 2024-10-10  7:01   ` Yan Zhao
  2024-10-10 16:14     ` Sean Christopherson
  0 siblings, 1 reply; 9+ messages in thread
From: Yan Zhao @ 2024-10-10  7:01 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Wed, Oct 09, 2024 at 12:23:44PM -0700, Sean Christopherson wrote:
> Add a lockdep assertion in kvm_unmap_gfn_range() to ensure that either
> mmu_invalidate_in_progress is elevated, or that the range is being zapped
> due to memslot removal (loosely detected by slots_lock being held).
> Zapping SPTEs without mmu_invalidate_{in_progress,seq} protection is unsafe
> as KVM's page fault path snapshots state before acquiring mmu_lock, and
> thus can create SPTEs with stale information if vCPUs aren't forced to
> retry faults (due to seeing an in-progress or past MMU invalidation).
> 
> Memslot removal is a special case, as the memslot is retrieved outside of
> mmu_invalidate_seq, i.e. doesn't use the "standard" protections, and
> instead relies on SRCU synchronization to ensure any in-flight page faults
> are fully resolved before zapping SPTEs.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 09494d01c38e..c6716fd3666f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1556,6 +1556,16 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	bool flush = false;
>  
> +	/*
> +	 * To prevent races with vCPUs faulting in a gfn using stale data,
> +	 * zapping a gfn range must be protected by mmu_invalidate_in_progress
> +	 * (and mmu_invalidate_seq).  The only exception is memslot deletion,
> +	 * in which case SRCU synchronization ensures SPTEs a zapped after all
> +	 * vCPUs have unlocked SRCU and are guaranteed to see the invalid slot.
> +	 */
> +	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> +			    lockdep_is_held(&kvm->slots_lock));
> +
Is the detection of slots_lock too loose?
If a caller just holds slots_lock without calling
"synchronize_srcu_expedited(&kvm->srcu)" as that in kvm_swap_active_memslots()
to ensure the old slot is retired, stale data may still be encountered. 

>  	if (kvm_memslots_have_rmaps(kvm))
>  		flush = __kvm_rmap_zap_gfn_range(kvm, range->slot,
>  						 range->start, range->end,
> -- 
> 2.47.0.rc1.288.g06298d1525-goog
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot
  2024-10-09 19:23 ` [PATCH 1/3] KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot Sean Christopherson
@ 2024-10-10  7:59   ` Yan Zhao
  0 siblings, 0 replies; 9+ messages in thread
From: Yan Zhao @ 2024-10-10  7:59 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

Tests of "normal VM + nested VM + 3 selftests" passed on the 3 configs
1) modprobe kvm_intel ept=0,
2) modprobe kvm tdp_mmu=0
   modprobe kvm_intel ept=1
3) modprobe kvm tdp_mmu=1
   modprobe kvm_intel ept=1

Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>

On Wed, Oct 09, 2024 at 12:23:43PM -0700, Sean Christopherson wrote:
> When performing a targeted zap on memslot removal, zap only MMU pages that
> shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and
> unnecessary.  Furthermore, for_each_gfn_valid_sp() arguably shouldn't
> exist, because it doesn't do what most people would it expect it to do.
> The "round gfn for level" adjustment that is done for direct SPs (no gPTE)
> means that the exact gfn comparison will not get a match, even when a SP
> does "cover" a gfn, or was even created specifically for a gfn.
> 
> For memslot deletion specifically, KVM's behavior will vary significantly
> based on the size and alignment of a memslot, and in weird ways.  E.g. for
> a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if
> it's only 4KiB aligned.  And as described below, zapping SPs in the
> aligned case overzaps for direct MMUs, as odds are good the upper-level
> SPs are serving other memslots.
> 
> To iterate over all potentially-relevant gfns, KVM would need to make a
> pass over the hash table for each level, with the gfn used for lookup
> rounded for said level.  And then check that the SP is of the correct
> level, too, e.g. to avoid over-zapping.
> 
> But even then, KVM would massively overzap, as processing every level is
> all but guaranteed to zap SPs that serve other memslots, especially if the
> memslot being removed is relatively small.  KVM could mitigate that issue
> by processing only levels that can be possible guest huge pages, i.e. are
> less likely to be re-used for other memslot, but while somewhat logical,
> that's quite arbitrary and would be a bit of a mess to implement.
> 
> So, zap only SPs with gPTEs, as the resulting behavior is easy to describe,
> is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that
> absolutely must be zapped.
> 
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a9a23e058555..09494d01c38e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1884,14 +1884,10 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp)
>  		if (is_obsolete_sp((_kvm), (_sp))) {			\
>  		} else
>  
> -#define for_each_gfn_valid_sp(_kvm, _sp, _gfn)				\
> +#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn)		\
>  	for_each_valid_sp(_kvm, _sp,					\
>  	  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])	\
> -		if ((_sp)->gfn != (_gfn)) {} else
> -
> -#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn)		\
> -	for_each_gfn_valid_sp(_kvm, _sp, _gfn)				\
> -		if (!sp_has_gptes(_sp)) {} else
> +		if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else
>  
>  static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  {
> @@ -7063,15 +7059,15 @@ static void kvm_mmu_zap_memslot_pages_and_flush(struct kvm *kvm,
>  
>  	/*
>  	 * Since accounting information is stored in struct kvm_arch_memory_slot,
> -	 * shadow pages deletion (e.g. unaccount_shadowed()) requires that all
> -	 * gfns with a shadow page have a corresponding memslot.  Do so before
> -	 * the memslot goes away.
> +	 * all MMU pages that are shadowing guest PTEs must be zapped before the
> +	 * memslot is deleted, as freeing such pages after the memslot is freed
> +	 * will result in use-after-free, e.g. in unaccount_shadowed().
>  	 */
>  	for (i = 0; i < slot->npages; i++) {
>  		struct kvm_mmu_page *sp;
>  		gfn_t gfn = slot->base_gfn + i;
>  
> -		for_each_gfn_valid_sp(kvm, sp, gfn)
> +		for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn)
>  			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
>  
>  		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> -- 
> 2.47.0.rc1.288.g06298d1525-goog
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()
  2024-10-10  7:01   ` Yan Zhao
@ 2024-10-10 16:14     ` Sean Christopherson
  2024-10-11  5:37       ` Yan Zhao
  0 siblings, 1 reply; 9+ messages in thread
From: Sean Christopherson @ 2024-10-10 16:14 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Paolo Bonzini, kvm, linux-kernel

On Thu, Oct 10, 2024, Yan Zhao wrote:
> On Wed, Oct 09, 2024 at 12:23:44PM -0700, Sean Christopherson wrote:
> > Add a lockdep assertion in kvm_unmap_gfn_range() to ensure that either
> > mmu_invalidate_in_progress is elevated, or that the range is being zapped
> > due to memslot removal (loosely detected by slots_lock being held).
> > Zapping SPTEs without mmu_invalidate_{in_progress,seq} protection is unsafe
> > as KVM's page fault path snapshots state before acquiring mmu_lock, and
> > thus can create SPTEs with stale information if vCPUs aren't forced to
> > retry faults (due to seeing an in-progress or past MMU invalidation).
> > 
> > Memslot removal is a special case, as the memslot is retrieved outside of
> > mmu_invalidate_seq, i.e. doesn't use the "standard" protections, and
> > instead relies on SRCU synchronization to ensure any in-flight page faults
> > are fully resolved before zapping SPTEs.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 09494d01c38e..c6716fd3666f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1556,6 +1556,16 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> >  	bool flush = false;
> >  
> > +	/*
> > +	 * To prevent races with vCPUs faulting in a gfn using stale data,
> > +	 * zapping a gfn range must be protected by mmu_invalidate_in_progress
> > +	 * (and mmu_invalidate_seq).  The only exception is memslot deletion,
> > +	 * in which case SRCU synchronization ensures SPTEs a zapped after all
> > +	 * vCPUs have unlocked SRCU and are guaranteed to see the invalid slot.
> > +	 */
> > +	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> > +			    lockdep_is_held(&kvm->slots_lock));
> > +
> Is the detection of slots_lock too loose?

Yes, but I can't think of an easy way to tighten it.  My original thought was to
require range->slot to be invalid, but KVM (correctly) passes in the old, valid
memslot to kvm_arch_flush_shadow_memslot().

The goal with the assert is to detect as many bugs as possible, without adding
too much complexity, and also to document the rules for using kvm_unmap_gfn_range().

Actually, we can tighten the check, by verifying that the slot being unmapped is
valid, but that the slot that KVM sees is invalid.  I'm not sure I love it though,
as it's absurdly specific.

(untested)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c6716fd3666f..12b87b746b59 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1552,6 +1552,17 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
                                 start, end - 1, can_yield, true, flush);
 }
 
+static kvm_memslot_is_being_invalidated(const struct kvm_memory_slot *old)
+{
+       const struct kvm_memory_slot *new;
+
+       if (old->flags & KVM_MEMSLOT_INVALID)
+               return false;
+
+       new = id_to_memslot(__kvm_memslots(kvm, old->as_id), old->id);
+       return new && new->flags & KVM_MEMSLOT_INVALID;
+}
+
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
        bool flush = false;
@@ -1564,7 +1575,8 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
         * vCPUs have unlocked SRCU and are guaranteed to see the invalid slot.
         */
        lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
-                           lockdep_is_held(&kvm->slots_lock));
+                           (lockdep_is_held(&kvm->slots_lock) &&
+                            kvm_memslot_is_being_invalidated(range->slot));
 
        if (kvm_memslots_have_rmaps(kvm))
                flush = __kvm_rmap_zap_gfn_range(kvm, range->slot,


> If a caller just holds slots_lock without calling
> "synchronize_srcu_expedited(&kvm->srcu)" as that in kvm_swap_active_memslots()
> to ensure the old slot is retired, stale data may still be encountered. 
> 
> >  	if (kvm_memslots_have_rmaps(kvm))
> >  		flush = __kvm_rmap_zap_gfn_range(kvm, range->slot,
> >  						 range->start, range->end,
> > -- 
> > 2.47.0.rc1.288.g06298d1525-goog
> > 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()
  2024-10-10 16:14     ` Sean Christopherson
@ 2024-10-11  5:37       ` Yan Zhao
  2024-10-11 21:22         ` Sean Christopherson
  0 siblings, 1 reply; 9+ messages in thread
From: Yan Zhao @ 2024-10-11  5:37 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Thu, Oct 10, 2024 at 09:14:41AM -0700, Sean Christopherson wrote:
> On Thu, Oct 10, 2024, Yan Zhao wrote:
> > On Wed, Oct 09, 2024 at 12:23:44PM -0700, Sean Christopherson wrote:
> > > Add a lockdep assertion in kvm_unmap_gfn_range() to ensure that either
> > > mmu_invalidate_in_progress is elevated, or that the range is being zapped
> > > due to memslot removal (loosely detected by slots_lock being held).
> > > Zapping SPTEs without mmu_invalidate_{in_progress,seq} protection is unsafe
> > > as KVM's page fault path snapshots state before acquiring mmu_lock, and
> > > thus can create SPTEs with stale information if vCPUs aren't forced to
> > > retry faults (due to seeing an in-progress or past MMU invalidation).
> > > 
> > > Memslot removal is a special case, as the memslot is retrieved outside of
> > > mmu_invalidate_seq, i.e. doesn't use the "standard" protections, and
> > > instead relies on SRCU synchronization to ensure any in-flight page faults
> > > are fully resolved before zapping SPTEs.
> > > 
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c | 10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > > 
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 09494d01c38e..c6716fd3666f 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1556,6 +1556,16 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > >  {
> > >  	bool flush = false;
> > >  
> > > +	/*
> > > +	 * To prevent races with vCPUs faulting in a gfn using stale data,
> > > +	 * zapping a gfn range must be protected by mmu_invalidate_in_progress
> > > +	 * (and mmu_invalidate_seq).  The only exception is memslot deletion,
> > > +	 * in which case SRCU synchronization ensures SPTEs a zapped after all
> > > +	 * vCPUs have unlocked SRCU and are guaranteed to see the invalid slot.
> > > +	 */
> > > +	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> > > +			    lockdep_is_held(&kvm->slots_lock));
> > > +
> > Is the detection of slots_lock too loose?
> 
> Yes, but I can't think of an easy way to tighten it.  My original thought was to
> require range->slot to be invalid, but KVM (correctly) passes in the old, valid
> memslot to kvm_arch_flush_shadow_memslot().
> 
> The goal with the assert is to detect as many bugs as possible, without adding
> too much complexity, and also to document the rules for using kvm_unmap_gfn_range().
> 
> Actually, we can tighten the check, by verifying that the slot being unmapped is
> valid, but that the slot that KVM sees is invalid.  I'm not sure I love it though,
> as it's absurdly specific.
Right. It doesn't reflect the wait in kvm_swap_active_memslots() for the old
slot.

  CPU 0                  CPU 1
1. fault on old begins
                       2. swap to new
		       3. zap old
4. fault on old ends

Without CPU 1 waiting for 1&4 complete between 2&3, stale data is still
possible.

So, the detection in kvm_memslot_is_being_invalidated() only indicates the
caller is from kvm_arch_flush_shadow_memslot() with current code.

Given that, how do you feel about passing in a "bool is_flush_slot" to indicate
the caller and asserting?

> (untested)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c6716fd3666f..12b87b746b59 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1552,6 +1552,17 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
>                                  start, end - 1, can_yield, true, flush);
>  }
>  
> +static kvm_memslot_is_being_invalidated(const struct kvm_memory_slot *old)
> +{
> +       const struct kvm_memory_slot *new;
> +
> +       if (old->flags & KVM_MEMSLOT_INVALID)
> +               return false;
> +
> +       new = id_to_memslot(__kvm_memslots(kvm, old->as_id), old->id);
> +       return new && new->flags & KVM_MEMSLOT_INVALID;
> +}
> +
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>         bool flush = false;
> @@ -1564,7 +1575,8 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>          * vCPUs have unlocked SRCU and are guaranteed to see the invalid slot.
>          */
>         lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> -                           lockdep_is_held(&kvm->slots_lock));
> +                           (lockdep_is_held(&kvm->slots_lock) &&
> +                            kvm_memslot_is_being_invalidated(range->slot));
>  
>         if (kvm_memslots_have_rmaps(kvm))
>                 flush = __kvm_rmap_zap_gfn_range(kvm, range->slot,
> 
> 
> > If a caller just holds slots_lock without calling
> > "synchronize_srcu_expedited(&kvm->srcu)" as that in kvm_swap_active_memslots()
> > to ensure the old slot is retired, stale data may still be encountered. 
> > 
> > >  	if (kvm_memslots_have_rmaps(kvm))
> > >  		flush = __kvm_rmap_zap_gfn_range(kvm, range->slot,
> > >  						 range->start, range->end,
> > > -- 
> > > 2.47.0.rc1.288.g06298d1525-goog
> > > 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()
  2024-10-11  5:37       ` Yan Zhao
@ 2024-10-11 21:22         ` Sean Christopherson
  0 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2024-10-11 21:22 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Paolo Bonzini, kvm, linux-kernel

On Fri, Oct 11, 2024, Yan Zhao wrote:
> On Thu, Oct 10, 2024 at 09:14:41AM -0700, Sean Christopherson wrote:
> > On Thu, Oct 10, 2024, Yan Zhao wrote:
> > > On Wed, Oct 09, 2024 at 12:23:44PM -0700, Sean Christopherson wrote:
> > > > Add a lockdep assertion in kvm_unmap_gfn_range() to ensure that either
> > > > mmu_invalidate_in_progress is elevated, or that the range is being zapped
> > > > due to memslot removal (loosely detected by slots_lock being held).
> > > > Zapping SPTEs without mmu_invalidate_{in_progress,seq} protection is unsafe
> > > > as KVM's page fault path snapshots state before acquiring mmu_lock, and
> > > > thus can create SPTEs with stale information if vCPUs aren't forced to
> > > > retry faults (due to seeing an in-progress or past MMU invalidation).
> > > > 
> > > > Memslot removal is a special case, as the memslot is retrieved outside of
> > > > mmu_invalidate_seq, i.e. doesn't use the "standard" protections, and
> > > > instead relies on SRCU synchronization to ensure any in-flight page faults
> > > > are fully resolved before zapping SPTEs.
> > > > 
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > ---
> > > >  arch/x86/kvm/mmu/mmu.c | 10 ++++++++++
> > > >  1 file changed, 10 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 09494d01c38e..c6716fd3666f 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -1556,6 +1556,16 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > > >  {
> > > >  	bool flush = false;
> > > >  
> > > > +	/*
> > > > +	 * To prevent races with vCPUs faulting in a gfn using stale data,
> > > > +	 * zapping a gfn range must be protected by mmu_invalidate_in_progress
> > > > +	 * (and mmu_invalidate_seq).  The only exception is memslot deletion,
> > > > +	 * in which case SRCU synchronization ensures SPTEs a zapped after all
> > > > +	 * vCPUs have unlocked SRCU and are guaranteed to see the invalid slot.
> > > > +	 */
> > > > +	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> > > > +			    lockdep_is_held(&kvm->slots_lock));
> > > > +
> > > Is the detection of slots_lock too loose?
> > 
> > Yes, but I can't think of an easy way to tighten it.  My original thought was to
> > require range->slot to be invalid, but KVM (correctly) passes in the old, valid
> > memslot to kvm_arch_flush_shadow_memslot().
> > 
> > The goal with the assert is to detect as many bugs as possible, without adding
> > too much complexity, and also to document the rules for using kvm_unmap_gfn_range().
> > 
> > Actually, we can tighten the check, by verifying that the slot being unmapped is
> > valid, but that the slot that KVM sees is invalid.  I'm not sure I love it though,
> > as it's absurdly specific.
> Right. It doesn't reflect the wait in kvm_swap_active_memslots() for the old
> slot.
> 
>   CPU 0                  CPU 1
> 1. fault on old begins
>                        2. swap to new
> 		       3. zap old
> 4. fault on old ends
> 
> Without CPU 1 waiting for 1&4 complete between 2&3, stale data is still
> possible.
> 
> So, the detection in kvm_memslot_is_being_invalidated() only indicates the
> caller is from kvm_arch_flush_shadow_memslot() with current code.

Yep, which is why I don't love it.

> Given that, how do you feel about passing in a "bool is_flush_slot" to indicate
> the caller and asserting?

I like it even less than the ugliness I proposed :-)  It'd basically be a "I pinky
swear I know what I'm doing" flag, and I think the downsides of having true/false
literals in the code would outweigh the upside of the precise assertion.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-10-11 21:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-09 19:23 [PATCH 0/3] KVM: x86/mmu: Don't zap "direct" non-leaf SPTEs on memslot removal Sean Christopherson
2024-10-09 19:23 ` [PATCH 1/3] KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot Sean Christopherson
2024-10-10  7:59   ` Yan Zhao
2024-10-09 19:23 ` [PATCH 2/3] KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range() Sean Christopherson
2024-10-10  7:01   ` Yan Zhao
2024-10-10 16:14     ` Sean Christopherson
2024-10-11  5:37       ` Yan Zhao
2024-10-11 21:22         ` Sean Christopherson
2024-10-09 19:23 ` [PATCH 3/3] KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox