All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yosry Ahmed <yosry@kernel.org>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org,  linux-kernel@vger.kernel.org,
	Jim Mattson <jmattson@google.com>,
	 James Houghton <jthoughton@google.com>
Subject: Re: [PATCH 1/2] KVM: x86/mmu: Recursively zap orphaned nested TDP shadow pages on emulated writes
Date: Mon, 8 Jun 2026 21:06:57 +0000	[thread overview]
Message-ID: <aicuvY6FwIguzkg5@google.com> (raw)
In-Reply-To: <20260605174611.2222504-2-seanjc@google.com>

On Fri, Jun 05, 2026 at 10:46:10AM -0700, Sean Christopherson wrote:
> Recursively zap orphaned nested TDP shadow pages when emulating a guest
> write to a shadowed page table, regardless of whether or not the associated
> (parent) shadow page will be zapped, e.g. due to detected write-flooding.
> 
> This plugs a hole where KVM fails to reclaim defunct, unsync shadow pages
> for select L1 hypervisor patterns.  Commit 2de4085cccea ("KVM: x86/MMU:
> Recursively zap nested TDP SPs when zapping last/only parent") modified KVM
> to recursively zap synchronized shadow pages (KVM already recursively zaps
> unsync children) when a child is orphaned.  But the fix effectively only
> applied the logic to kvm_mmu_page_unlink_children(), i.e. only performs the
> recursive zap when KVM is already zapping a parent SP and processing its
> children.
> 
> If L1 zaps SPTEs bottom-up (4KiB => 2MiB => ...), as KVM's TDP MMU does
> with CONFIG_KVM_PROVE_MMU=n since commit 8ca983631f3c ("KVM: x86/mmu: Zap
> invalidated TDP MMU roots at 4KiB granularity"), then KVM (as L0) will leak
> upwards of 4 shadow pages per GiB of L2 guest memory.  Over hundreds or
> thousands of L2 boots, if the VM is "lucky" enough to escape write-flooding
> detection, i.e. not trigger reclaim of the orphaned shadow pages by dumb
> luck, then it's possible to end up with tens or even hundreds of thousands
> of unsync shadow pages and associated rmap entries.
> 
> Polluting the hash table and rmap entries with a horde of stale entries
> can eventually degrade L2 guest boot time by an order of magnitude,
> especially if there is any antagonistic activity in the host, i.e. anything
> that will contend for mmu_lock and/or needs to walk rmaps.
> 
> With "top"-down zapping, where "top" is 1GiB or above, then L0 KVM is
> effectively limited to leaking 4 shadow pages per 256 GiB of memory, as
> KVM's write flooding detection will kick in on the third write to an L1
> TDP PUD, and thus recursively zap the entire 256 GiB range of the parent
> PGD.  I.e. even though L1 KVM still recursively zaps 2MiB => 4KiB SPTEs
> when zapping each 1GiB SPTE, KVM only gets through two of the 1GiB SPTEs
> before dropping everything.  E.g. hacking tracing into L0 KVM's
> kvm_mmu_track_write(), the top-down zapping of L1's TDP MMU for an L2 with
> 16GiB of memory leads to:
> 
>   gpa = 107407000, old = 800000010741bd07, new = 8000000000000000, level = 3, flood = 0
>   gpa = 10741b000, old = 8000000112fb2d07, new = 80000000000001a0, level = 2, flood = 0
>   gpa = 10741b008, old = 800000012509cd07, new = 80000000000001a0, level = 2, flood = 1
>   gpa = 10741b010, old = 80000001114b9d07, new = 80000000000001a0, level = 2, flood = 2
>   gpa = 107407008, old = 8000000112fb5d07, new = 8000000000000000, level = 3, flood = 1
>   gpa = 112fb5298, old = 8000000106f43d07, new = 80000000000001a0, level = 2, flood = 0
>   gpa = 112fb52a0, old = 8000000106f4dd07, new = 80000000000001a0, level = 2, flood = 1
>   gpa = 112fb5ea0, old = 8000000120490d07, new = 80000000000001a0, level = 2, flood = 2
>   gpa = 107407010, old = 8000000106df2d07, new = 8000000000000000, level = 3, flood = 2
>   gpa = 107410000, old = 8000000107408d07, new = 8000000000000000, level = 5, flood = 0
>   gpa = 107408000, old = 8000000107407d07, new = 80000000000001a0, level = 4, flood = 0
> 
> Contrast that with a bottom-up zap, which effectively allows all 2MiB SPTEs
> in L1 to leak their children.
> 
>   gpa = 167939000, old = 800000011c8f4d07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 167939020, old = 8000000104407d07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 167939028, old = 800000011ed20d07, new = 8000000000000000, level = 2, flood = 2
>   gpa = 118c70bb0, old = 8000000167ab9d07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 118c70bb8, old = 8000000163913d07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 118c70de8, old = 800000011cc9dd07, new = 8000000000000000, level = 2, flood = 2
>   gpa = 160be7fb0, old = 800000011d322d07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 160be7fb8, old = 8000000126b1bd07, new = 8000000000000000, level = 2, flood = 2
>   gpa = 1634ab000, old = 800000010e984d07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 1634ab008, old = 800000016879fd07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 1634ab010, old = 800000016879ed07, new = 8000000000000000, level = 2, flood = 2
>   gpa = 11e3f1e48, old = 8000000168a33d07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 11e3f1e50, old = 80000001664dcd07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 1167eacb8, old = 8000000166544d07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 1167eacc0, old = 800000015c16bd07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 1689e89b8, old = 800000015f296d07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 1689e89c0, old = 8000000167ca8d07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 107b35eb8, old = 8000000161e71d07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 107b35ec0, old = 8000000118cf3d07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 118cf2d48, old = 8000000118cf1d07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 118cf2d50, old = 8000000118cf0d07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 118dcb770, old = 8000000118dcad07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 118dcb778, old = 8000000118dc9d07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 118dc87e8, old = 8000000126997d07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 118dc87f0, old = 8000000126996d07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 126995148, old = 8000000126994d07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 126995150, old = 8000000103477d07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 1034764c8, old = 8000000103475d07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 1034764d0, old = 8000000103474d07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 10ea4b788, old = 800000010ea4ad07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 10ea4b790, old = 800000010ea49d07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 10ea48928, old = 800000011a5bfd07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 10ea48930, old = 800000011a5bed07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 11a5bd0d8, old = 800000011a5bcd07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 11a5bd0e0, old = 800000011d323d07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 122ce2b40, old = 800000011fe0bd07, new = 8000000000000000, level = 2, flood = 0
>   gpa = 122ce2b48, old = 800000010e985d07, new = 8000000000000000, level = 2, flood = 1
>   gpa = 122ce2b50, old = 8000000161c9dd07, new = 8000000000000000, level = 2, flood = 2
>   gpa = 16864c000, old = 8000000167939d07, new = 8000000000000000, level = 3, flood = 0
>   gpa = 16864c008, old = 8000000118c70d07, new = 8000000000000000, level = 3, flood = 1
>   gpa = 16864c010, old = 80000001688a6d07, new = 8000000000000000, level = 3, flood = 2
>   gpa = 11c8f7000, old = 80000001608a7d07, new = 8000000000000000, level = 5, flood = 0
>   gpa = 1608a7000, old = 800000016864cd07, new = 80000000000001a0, level = 4, flood = 0
> 
> Note, in the shadow MMU, "level" describes the level a shadow page "points"
> at, not the level of its associated SPTE. I.e.  when write-flooding of 1GiB
> PUD entries is detected, KVM recursively zaps shadow pages covering 256GiB
> worth of memory.  And as shown above, KVM's write-flooding detection
> operates at all levels, so a single PMD (in L1) can effectively only leak
> two unsync children (4KiB shadow pages) before it gets recursively zapped.
> As a result, for the top-down zap, L0 KVM will leak at most 4 unsync shadow
> pages per 256GiB of L2 memory.
> 
> The top-down zap also makes it more likely that L1 will self-heal (to some
> extent), as any shadow pages that are "rediscovered" by future runs of L2
> can get reclaimed by a recursive zap, whereas bottom-up zapping orphans
> shadow pages over and over.
> 
> Note, in theory, there is some risk of over-zapping, e.g. due to zapping a
> a large branch of the paging tree that L1 is only temporarily removing.  In
> practice, the usage patterns of hypervisors are highly unlikely to trigger
> false positives.  E.g. temporarily changing paging protections is typically
> done at the leaf, not on a non-leaf entry.  And if the L1 hypervisor is
> updating large swaths of PTEs, e.g. to (temporarily?) remove chunks of
> memory from L2, then L0 KVM's write-flooding detection will kick in, and
> the children would be zapped anyways.
> 
> Fixes: 2de4085cccea ("KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent")
> Cc: Yosry Ahmed <yosry@kernel.org>
> Cc: Jim Mattson <jmattson@google.com>
> Cc: James Houghton <jthoughton@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

The changelog (and problem in general) is quite the ride. Assuming I
followed along correctly, this LGTM:

Reviewed-by: Yosry Ahmed <yosry@kernel.org>

> ---
>  arch/x86/kvm/mmu/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b8f2edf2cfeb..9368a71336fe 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6376,7 +6376,7 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
>  
>  		while (npte--) {
>  			entry = *spte;
> -			mmu_page_zap_pte(vcpu->kvm, sp, spte, NULL);
> +			mmu_page_zap_pte(vcpu->kvm, sp, spte, &invalid_list);
>  			if (gentry && sp->role.level != PG_LEVEL_4K)
>  				++vcpu->kvm->stat.mmu_pde_zapped;
>  			if (is_shadow_present_pte(entry))
> -- 
> 2.54.0.1032.g2f8565e1d1-goog
> 

  parent reply	other threads:[~2026-06-08 21:06 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-05 17:46 [PATCH 0/2] KVM: x86/mmu: Plug an unsync shadow page leak Sean Christopherson
2026-06-05 17:46 ` [PATCH 1/2] KVM: x86/mmu: Recursively zap orphaned nested TDP shadow pages on emulated writes Sean Christopherson
2026-06-06 13:04   ` Jim Mattson
2026-06-08 21:06   ` Yosry Ahmed [this message]
2026-06-05 17:46 ` [PATCH 2/2] KVM: x86/mmu: Expose number of shadow MMU shadow pages as a stat Sean Christopherson
2026-06-05 18:06   ` sashiko-bot
2026-06-05 18:14     ` Sean Christopherson
2026-06-09 16:31 ` [PATCH 0/2] KVM: x86/mmu: Plug an unsync shadow page leak Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aicuvY6FwIguzkg5@google.com \
    --to=yosry@kernel.org \
    --cc=jmattson@google.com \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.