public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] KVM: x86/mmu: Clean up indirect_shadow_pages usage
@ 2024-02-03  0:23 Sean Christopherson
  2024-02-03  0:23 ` [PATCH v2 1/4] KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic Sean Christopherson
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Sean Christopherson @ 2024-02-03  0:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang

Resurrect a 6 month old patch from Mingwei, add a few cleanps on top, and
fix a largely theoretical race between emulating writes and write-protecting
shadow pages.  At least, I'm pretty sure there's a race.  Memory ordering
isn't exactly my strong suit.

v2:
 - Drop the unnecessary READ_ONCE(). [Jim]
 - Cleanup more old crud in reexecute_instruction().
 - Fix the aforementioned race.

v1: https://lore.kernel.org/all/20230605004334.1930091-1-mizhang@google.com

Mingwei Zhang (1):
  KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages
    as a heuristic

Sean Christopherson (3):
  KVM: x86: Drop dedicated logic for direct MMUs in
    reexecute_instruction()
  KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag
  KVM: x86/mmu: Fix a *very* theoretical race in kvm_mmu_track_write()

 arch/x86/kvm/mmu/mmu.c | 19 ++++++++++++++++---
 arch/x86/kvm/x86.c     | 35 ++++++++++++++---------------------
 2 files changed, 30 insertions(+), 24 deletions(-)


base-commit: 60eedcfceda9db46f1b333e5e1aa9359793f04fb
-- 
2.43.0.594.gd9cf4e227d-goog


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/4] KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic
  2024-02-03  0:23 [PATCH v2 0/4] KVM: x86/mmu: Clean up indirect_shadow_pages usage Sean Christopherson
@ 2024-02-03  0:23 ` Sean Christopherson
  2024-02-03  0:23 ` [PATCH v2 2/4] KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction() Sean Christopherson
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2024-02-03  0:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang

From: Mingwei Zhang <mizhang@google.com>

Drop KVM's completely pointless acquisition of mmu_lock when deciding
whether or not to unprotect any shadow pages residing at the gfn before
resuming the guest to let it retry an instruction that KVM failed to
emulated.  In this case, indirect_shadow_pages is used as a coarse-grained
heuristic to check if there is any chance of there being a relevant shadow
page to unprotected.  But acquiring mmu_lock largely defeats any benefit
to the heuristic, as taking mmu_lock for write is likely far more costly
to the VM as a whole than unnecessarily walking mmu_page_hash.

Furthermore, the current code is already prone to false negatives and
false positives, as it drops mmu_lock before checking the flag and
unprotecting shadow pages.  And as evidenced by the lack of bug reports,
neither false positives nor false negatives are problematic.  A false
positive simply means that KVM will try to unprotect shadow pages that
have already been zapped.  And a false negative means that KVM will
resume the guest without unprotecting the gfn, i.e. if a shadow page was
_just_ created, the vCPU will hit the same page fault and do the whole
dance all over again, and detect and unprotect the shadow page the second
time around (or not, if something else zaps it first).

Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
[sean: drop READ_ONCE() and comment change, rewrite changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c339d9f95b4b..2ec3e1851f2f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8787,13 +8787,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 
 	/* The instructions are well-emulated on direct mmu. */
 	if (vcpu->arch.mmu->root_role.direct) {
-		unsigned int indirect_shadow_pages;
-
-		write_lock(&vcpu->kvm->mmu_lock);
-		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
-		write_unlock(&vcpu->kvm->mmu_lock);
-
-		if (indirect_shadow_pages)
+		if (vcpu->kvm->arch.indirect_shadow_pages)
 			kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
 
 		return true;
-- 
2.43.0.594.gd9cf4e227d-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/4] KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction()
  2024-02-03  0:23 [PATCH v2 0/4] KVM: x86/mmu: Clean up indirect_shadow_pages usage Sean Christopherson
  2024-02-03  0:23 ` [PATCH v2 1/4] KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic Sean Christopherson
@ 2024-02-03  0:23 ` Sean Christopherson
  2024-02-03  0:23 ` [PATCH v2 3/4] KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag Sean Christopherson
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2024-02-03  0:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang

Now that KVM doesn't pointlessly acquire mmu_lock for direct MMUs, drop
the dedicated path entirely and always query indirect_shadow_pages when
deciding whether or not to try unprotecting the gfn.  For indirect, a.k.a.
shadow MMUs, checking indirect_shadow_pages is harmless; unless *every*
shadow page was somehow zapped while KVM was attempting to emulate the
instruction, indirect_shadow_pages is guaranteed to be non-zero.

Well, unless the instruction used a direct hugepage with 2-level paging
for its code page, but in that case, there's obviously nothing to
unprotect.  And in the extremely unlikely case all shadow pages were
zapped, there's again obviously nothing to unprotect.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2ec3e1851f2f..c502121b7bee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8785,27 +8785,27 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 
 	kvm_release_pfn_clean(pfn);
 
-	/* The instructions are well-emulated on direct mmu. */
-	if (vcpu->arch.mmu->root_role.direct) {
-		if (vcpu->kvm->arch.indirect_shadow_pages)
-			kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
-
-		return true;
-	}
-
 	/*
-	 * if emulation was due to access to shadowed page table
-	 * and it failed try to unshadow page and re-enter the
-	 * guest to let CPU execute the instruction.
+	 * If emulation may have been triggered by a write to a shadowed page
+	 * table, unprotect the gfn (zap any relevant SPTEs) and re-enter the
+	 * guest to let the CPU re-execute the instruction in the hope that the
+	 * CPU can cleanly execute the instruction that KVM failed to emulate.
 	 */
-	kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
+	if (vcpu->kvm->arch.indirect_shadow_pages)
+		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
 
 	/*
-	 * If the access faults on its page table, it can not
-	 * be fixed by unprotecting shadow page and it should
-	 * be reported to userspace.
+	 * If the failed instruction faulted on an access to page tables that
+	 * are used to translate any part of the instruction, KVM can't resolve
+	 * the issue by unprotecting the gfn, as zapping the shadow page will
+	 * result in the instruction taking a !PRESENT page fault and thus put
+	 * the vCPU into an infinite loop of page faults.  E.g. KVM will create
+	 * a SPTE and write-protect the gfn to resolve the !PRESENT fault, and
+	 * then zap the SPTE to unprotect the gfn, and then do it all over
+	 * again.  Report the error to userspace.
 	 */
-	return !(emulation_type & EMULTYPE_WRITE_PF_TO_SP);
+	return vcpu->arch.mmu->root_role.direct ||
+	       !(emulation_type & EMULTYPE_WRITE_PF_TO_SP);
 }
 
 static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
-- 
2.43.0.594.gd9cf4e227d-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/4] KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag
  2024-02-03  0:23 [PATCH v2 0/4] KVM: x86/mmu: Clean up indirect_shadow_pages usage Sean Christopherson
  2024-02-03  0:23 ` [PATCH v2 1/4] KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic Sean Christopherson
  2024-02-03  0:23 ` [PATCH v2 2/4] KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction() Sean Christopherson
@ 2024-02-03  0:23 ` Sean Christopherson
  2024-02-03  0:23 ` [PATCH v2 4/4] KVM: x86/mmu: Fix a *very* theoretical race in kvm_mmu_track_write() Sean Christopherson
  2024-02-23  1:35 ` [PATCH v2 0/4] KVM: x86/mmu: Clean up indirect_shadow_pages usage Sean Christopherson
  4 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2024-02-03  0:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang

Remove reexecute_instruction()'s final check on the MMU being direct, as
EMULTYPE_WRITE_PF_TO_SP is only ever set if the MMU is indirect, i.e. is a
shadow MMU.  Prior to commit 93c05d3ef252 ("KVM: x86: improve
reexecute_instruction"), the flag simply didn't exist (and KVM actually
returned "true" unconditionally for both types of MMUs).  I.e. the
explicit check for a direct MMU is simply leftover artifact from old code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c502121b7bee..5fe94b2de1dc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8804,8 +8804,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	 * then zap the SPTE to unprotect the gfn, and then do it all over
 	 * again.  Report the error to userspace.
 	 */
-	return vcpu->arch.mmu->root_role.direct ||
-	       !(emulation_type & EMULTYPE_WRITE_PF_TO_SP);
+	return !(emulation_type & EMULTYPE_WRITE_PF_TO_SP);
 }
 
 static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
-- 
2.43.0.594.gd9cf4e227d-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 4/4] KVM: x86/mmu: Fix a *very* theoretical race in kvm_mmu_track_write()
  2024-02-03  0:23 [PATCH v2 0/4] KVM: x86/mmu: Clean up indirect_shadow_pages usage Sean Christopherson
                   ` (2 preceding siblings ...)
  2024-02-03  0:23 ` [PATCH v2 3/4] KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag Sean Christopherson
@ 2024-02-03  0:23 ` Sean Christopherson
  2024-02-23  8:09   ` Paolo Bonzini
  2024-02-23  1:35 ` [PATCH v2 0/4] KVM: x86/mmu: Clean up indirect_shadow_pages usage Sean Christopherson
  4 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2024-02-03  0:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang

Add full memory barriers in kvm_mmu_track_write() and account_shadowed()
to plug a (very, very theoretical) race where kvm_mmu_track_write() could
miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant,
*stale* SPTEs.

Without the barriers, because modern x86 CPUs allow (per the SDM):

  Reads may be reordered with older writes to different locations but not
  with older writes to the same location.

it's (again, super theoretically) possible that the following could happen
(terms of values being visible/resolved):

 CPU0                          CPU1
 read memory[gfn] (=Y)
                               memory[gfn] Y=>X
                               read indirect_shadow_pages (=0)
 indirect_shadow_pages 0=>1

or conversely:

 CPU0                          CPU1
 indirect_shadow_pages 0=>1
                               read indirect_shadow_pages (=0)
 read memory[gfn] (=Y)
                               memory[gfn] Y=>X

In practice, this bug is likely benign as both the 0=>1 transition and
reordering of this scope are extremely rare occurrences.

Note, if the cost of the barrier (which is simply a locked ADD, see commit
450cbdd0125c ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")),
is problematic, KVM could avoid the barrier by bailing earlier if checking
kvm_memslots_have_rmaps() is false.  But the odds of the barrier being
problematic is extremely low, *and* the odds of the extra checks being
meaningfully faster overall is also low.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c193b096b45..86b85060534d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -830,6 +830,14 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	struct kvm_memory_slot *slot;
 	gfn_t gfn;
 
+	/*
+	 * Ensure indirect_shadow_pages is elevated prior to re-reading guest
+	 * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight
+	 * emulated writes are visible before re-reading guest PTEs, or that
+	 * an emulated write will see the elevated count and acquire mmu_lock
+	 * to update SPTEs.  Pairs with the smp_mb() in kvm_mmu_track_write().
+	 */
+	smp_mb();
 	kvm->arch.indirect_shadow_pages++;
 	gfn = sp->gfn;
 	slots = kvm_memslots_for_spte_role(kvm, sp->role);
@@ -5747,10 +5755,15 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 	bool flush = false;
 
 	/*
-	 * If we don't have indirect shadow pages, it means no page is
-	 * write-protected, so we can exit simply.
+	 * When emulating guest writes, ensure the written value is visible to
+	 * any task that is handling page faults before checking whether or not
+	 * KVM is shadowing a guest PTE.  This ensures either KVM will create
+	 * the correct SPTE in the page fault handler, or this task will see
+	 * a non-zero indirect_shadow_pages.  Pairs with the smp_mb() in
+	 * account_shadowed().
 	 */
-	if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
+	smp_mb();
+	if (!vcpu->kvm->arch.indirect_shadow_pages)
 		return;
 
 	write_lock(&vcpu->kvm->mmu_lock);
-- 
2.43.0.594.gd9cf4e227d-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/4] KVM: x86/mmu: Clean up indirect_shadow_pages usage
  2024-02-03  0:23 [PATCH v2 0/4] KVM: x86/mmu: Clean up indirect_shadow_pages usage Sean Christopherson
                   ` (3 preceding siblings ...)
  2024-02-03  0:23 ` [PATCH v2 4/4] KVM: x86/mmu: Fix a *very* theoretical race in kvm_mmu_track_write() Sean Christopherson
@ 2024-02-23  1:35 ` Sean Christopherson
  4 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2024-02-23  1:35 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang

On Fri, 02 Feb 2024 16:23:39 -0800, Sean Christopherson wrote:
> Resurrect a 6 month old patch from Mingwei, add a few cleanps on top, and
> fix a largely theoretical race between emulating writes and write-protecting
> shadow pages.  At least, I'm pretty sure there's a race.  Memory ordering
> isn't exactly my strong suit.
> 
> v2:
>  - Drop the unnecessary READ_ONCE(). [Jim]
>  - Cleanup more old crud in reexecute_instruction().
>  - Fix the aforementioned race.
> 
> [...]

Applied 1-3 to kvm-x86 mmu.  I will likely apply patch 4 for 6.9 as well, but
assuming it doesn't get attention "soon", I'll apply it dead last in kvm-x86/mmu
so that it's easy to discard if my paranoia is unfounded.

[1/4] KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic
      https://github.com/kvm-x86/linux/commit/474b99ed703b
[2/4] KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction()
      https://github.com/kvm-x86/linux/commit/515c18a64e70
[3/4] KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag
      https://github.com/kvm-x86/linux/commit/dfeef3d3f310
[4/4] KVM: x86/mmu: Fix a *very* theoretical race in kvm_mmu_track_write()
      (not applied, yet)

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 4/4] KVM: x86/mmu: Fix a *very* theoretical race in kvm_mmu_track_write()
  2024-02-03  0:23 ` [PATCH v2 4/4] KVM: x86/mmu: Fix a *very* theoretical race in kvm_mmu_track_write() Sean Christopherson
@ 2024-02-23  8:09   ` Paolo Bonzini
  2024-02-23 18:12     ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2024-02-23  8:09 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang

On 2/3/24 01:23, Sean Christopherson wrote:
> Add full memory barriers in kvm_mmu_track_write() and account_shadowed()
> to plug a (very, very theoretical) race where kvm_mmu_track_write() could
> miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant,
> *stale* SPTEs.

Ok, so we have

emulator_write_phys
   overwrite PTE
   kvm_page_track_write
     kvm_mmu_track_write
       // memory barrier missing here
       if (indirect_shadow_pages)
         zap();

and on the other side

   FNAME(page_fault)
     FNAME(fetch)
       kvm_mmu_get_child_sp
         kvm_mmu_get_shadow_page
           __kvm_mmu_get_shadow_page
             kvm_mmu_alloc_shadow_page
               account_shadowed
                 indirect shadow pages++
                 // memory barrier missing here
       if (FNAME(gpte_changed)) // reads PTE
         goto out

If you can weave something like this in the commit message the sequence 
would be a bit clearer.

> In practice, this bug is likely benign as both the 0=>1 transition and
> reordering of this scope are extremely rare occurrences.

I wouldn't call it benign, it's more that it's unobservable in practice 
but the race is real.  However...
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c193b096b45..86b85060534d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -830,6 +830,14 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>   	struct kvm_memory_slot *slot;
>   	gfn_t gfn;
>   
> +	/*
> +	 * Ensure indirect_shadow_pages is elevated prior to re-reading guest
> +	 * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight
> +	 * emulated writes are visible before re-reading guest PTEs, or that
> +	 * an emulated write will see the elevated count and acquire mmu_lock
> +	 * to update SPTEs.  Pairs with the smp_mb() in kvm_mmu_track_write().
> +	 */
> +	smp_mb();

... this memory barrier needs to be after the increment (the desired 
ordering is store-before-read).

Paolo

>   	kvm->arch.indirect_shadow_pages++;
>   	gfn = sp->gfn;
>   	slots = kvm_memslots_for_spte_role(kvm, sp->role);
> @@ -5747,10 +5755,15 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
>   	bool flush = false;
>   
>   	/*
> -	 * If we don't have indirect shadow pages, it means no page is
> -	 * write-protected, so we can exit simply.
> +	 * When emulating guest writes, ensure the written value is visible to
> +	 * any task that is handling page faults before checking whether or not
> +	 * KVM is shadowing a guest PTE.  This ensures either KVM will create
> +	 * the correct SPTE in the page fault handler, or this task will see
> +	 * a non-zero indirect_shadow_pages.  Pairs with the smp_mb() in
> +	 * account_shadowed().
>   	 */
> -	if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
> +	smp_mb();
> +	if (!vcpu->kvm->arch.indirect_shadow_pages)
>   		return;
>   
>   	write_lock(&vcpu->kvm->mmu_lock);


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 4/4] KVM: x86/mmu: Fix a *very* theoretical race in kvm_mmu_track_write()
  2024-02-23  8:09   ` Paolo Bonzini
@ 2024-02-23 18:12     ` Sean Christopherson
  0 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2024-02-23 18:12 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang

On Fri, Feb 23, 2024, Paolo Bonzini wrote:
> On 2/3/24 01:23, Sean Christopherson wrote:
> > Add full memory barriers in kvm_mmu_track_write() and account_shadowed()
> > to plug a (very, very theoretical) race where kvm_mmu_track_write() could
> > miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant,
> > *stale* SPTEs.
> 
> Ok, so we have
> 
> emulator_write_phys
>   overwrite PTE
>   kvm_page_track_write
>     kvm_mmu_track_write
>       // memory barrier missing here
>       if (indirect_shadow_pages)
>         zap();
> 
> and on the other side
> 
>   FNAME(page_fault)
>     FNAME(fetch)
>       kvm_mmu_get_child_sp
>         kvm_mmu_get_shadow_page
>           __kvm_mmu_get_shadow_page
>             kvm_mmu_alloc_shadow_page
>               account_shadowed
>                 indirect shadow pages++
>                 // memory barrier missing here
>       if (FNAME(gpte_changed)) // reads PTE
>         goto out
> 
> If you can weave something like this in the commit message the sequence
> would be a bit clearer.

Roger that.

> > In practice, this bug is likely benign as both the 0=>1 transition and
> > reordering of this scope are extremely rare occurrences.
> 
> I wouldn't call it benign, it's more that it's unobservable in practice but
> the race is real.  However...
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 3c193b096b45..86b85060534d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -830,6 +830,14 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> >   	struct kvm_memory_slot *slot;
> >   	gfn_t gfn;
> > +	/*
> > +	 * Ensure indirect_shadow_pages is elevated prior to re-reading guest
> > +	 * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight
> > +	 * emulated writes are visible before re-reading guest PTEs, or that
> > +	 * an emulated write will see the elevated count and acquire mmu_lock
> > +	 * to update SPTEs.  Pairs with the smp_mb() in kvm_mmu_track_write().
> > +	 */
> > +	smp_mb();
> 
> ... this memory barrier needs to be after the increment (the desired
> ordering is store-before-read).

Doh.  I'll post a fixed version as a one-off v3.

Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-02-23 18:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-03  0:23 [PATCH v2 0/4] KVM: x86/mmu: Clean up indirect_shadow_pages usage Sean Christopherson
2024-02-03  0:23 ` [PATCH v2 1/4] KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic Sean Christopherson
2024-02-03  0:23 ` [PATCH v2 2/4] KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction() Sean Christopherson
2024-02-03  0:23 ` [PATCH v2 3/4] KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag Sean Christopherson
2024-02-03  0:23 ` [PATCH v2 4/4] KVM: x86/mmu: Fix a *very* theoretical race in kvm_mmu_track_write() Sean Christopherson
2024-02-23  8:09   ` Paolo Bonzini
2024-02-23 18:12     ` Sean Christopherson
2024-02-23  1:35 ` [PATCH v2 0/4] KVM: x86/mmu: Clean up indirect_shadow_pages usage Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox