[PATCH 1/2] KVM: x86/mmu: Don't check old SPTE permissions when trying to unsync

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] KVM: x86/mmu: Don't check old SPTE permissions when trying to unsync
@ 2026-01-23  9:03 Lai Jiangshan
  2026-01-23  9:03 ` [PATCH 2/2] KVM: x86/mmu: KVM: x86/mmu: Skip unsync when large pages are allowed Lai Jiangshan
  2026-03-12 17:35 ` [PATCH 1/2] KVM: x86/mmu: Don't check old SPTE permissions when trying to unsync Sean Christopherson
  0 siblings, 2 replies; 5+ messages in thread
From: Lai Jiangshan @ 2026-01-23  9:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, kvm

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Commit ecc5589f19a5 ("KVM: MMU: optimize set_spte for page sync") added
a writable permission check on the old SPTE to avoid unnecessary calls
to mmu_try_to_unsync_pages() when syncing SPTEs.

Later, commit e6722d9211b2 ("KVM: x86/mmu: Reduce the update to the spte
in FNAME(sync_spte)") indirectly achieves it by avoiding some SPTE
updates altogether, which makes the writable permission check in
make_spte() much less useful.

Remove the old-SPTE writable permission check from make_spte() to
simplify the code.

This may cause mmu_try_to_unsync_pages() to be called in a few
additional cases not covered by commit e6722d9211b2, such as when the
guest toggles the execute bit, which is expected to be rare.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 arch/x86/kvm/mmu/mmu.c         |  2 +-
 arch/x86/kvm/mmu/paging_tmpl.h |  2 +-
 arch/x86/kvm/mmu/spte.c        | 12 ++----------
 arch/x86/kvm/mmu/spte.h        |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c     |  2 +-
 5 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 02c450686b4a..4535d2836004 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3073,7 +3073,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 			was_rmapped = 1;
 	}
 
-	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
+	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, prefetch,
 			   false, host_writable, &spte);
 
 	if (*sptep == spte) {
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 901cd2bd40b8..95fccee63563 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -954,7 +954,7 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
 	host_writable = spte & shadow_host_writable_mask;
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 	make_spte(vcpu, sp, slot, pte_access, gfn,
-		  spte_to_pfn(spte), spte, true, true,
+		  spte_to_pfn(spte), true, true,
 		  host_writable, &spte);
 
 	/*
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 85a0473809b0..a8e2606ccd22 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -186,7 +186,7 @@ bool spte_needs_atomic_update(u64 spte)
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
-	       u64 old_spte, bool prefetch, bool synchronizing,
+	       bool prefetch, bool synchronizing,
 	       bool host_writable, u64 *new_spte)
 {
 	int level = sp->role.level;
@@ -258,16 +258,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		 * SPTE.  Write-protect the SPTE if the page can't be unsync'd,
 		 * e.g. it's write-tracked (upper-level SPs) or has one or more
 		 * shadow pages and unsync'ing pages is not allowed.
-		 *
-		 * When overwriting an existing leaf SPTE, and the old SPTE was
-		 * writable, skip trying to unsync shadow pages as any relevant
-		 * shadow pages must already be unsync, i.e. the hash lookup is
-		 * unnecessary (and expensive).  Note, this relies on KVM not
-		 * changing PFNs without first zapping the old SPTE, which is
-		 * guaranteed by both the shadow MMU and the TDP MMU.
 		 */
-		if ((!is_last_spte(old_spte, level) || !is_writable_pte(old_spte)) &&
-		    mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, synchronizing, prefetch))
+		if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, synchronizing, prefetch))
 			wrprot = true;
 		else
 			spte |= PT_WRITABLE_MASK | shadow_mmu_writable_mask |
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 91ce29fd6f1b..cf9cd27bcd4f 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -543,7 +543,7 @@ bool spte_needs_atomic_update(u64 spte);
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
-	       u64 old_spte, bool prefetch, bool synchronizing,
+	       bool prefetch, bool synchronizing,
 	       bool host_writable, u64 *new_spte);
 u64 make_small_spte(struct kvm *kvm, u64 huge_spte,
 		    union kvm_mmu_page_role role, int index);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9c26038f6b77..8dfaab2a4fd9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1188,7 +1188,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
 	else
 		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
-				   fault->pfn, iter->old_spte, fault->prefetch,
+				   fault->pfn, fault->prefetch,
 				   false, fault->map_writable, &new_spte);
 
 	if (new_spte == iter->old_spte)
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] KVM: x86/mmu: KVM: x86/mmu: Skip unsync when large pages are allowed
  2026-01-23  9:03 [PATCH 1/2] KVM: x86/mmu: Don't check old SPTE permissions when trying to unsync Lai Jiangshan
@ 2026-01-23  9:03 ` Lai Jiangshan
  2026-03-12 17:07   ` Sean Christopherson
  2026-03-12 17:35 ` [PATCH 1/2] KVM: x86/mmu: Don't check old SPTE permissions when trying to unsync Sean Christopherson
  1 sibling, 1 reply; 5+ messages in thread
From: Lai Jiangshan @ 2026-01-23  9:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, kvm

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Use the large-page metadata to avoid pointless attempts to search SP.

If the target GFN falls within a range where a large page is allowed,
then there cannot be a shadow page for that GFN; a shadow page in the
range would itself disallow using a large page. In that case, there
is nothing to unsync and mmu_try_to_unsync_pages() can return
immediately.

This is always true for TDP MMU without nested TDP, and holds for a
significant fraction of cases with shadow paging even all SPs are 4K.

For shadow paging, this optimization theoretically avoids work for about
1/e ~= 37% of GFNs, assuming one guest page table per 2M of memory and
that each GPT falls randomly into the 2M memory buckets. In a simple
test setup, it skipped unsync in a much higher percentage of cases,
mainly because the guest buddy allocator clusters GPTs into fewer
buckets.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 arch/x86/kvm/mmu/mmu.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4535d2836004..555075fb63d9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2932,6 +2932,14 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
 	struct kvm_mmu_page *sp;
 	bool locked = false;

+	/*
+	 * If large page is allowed, there is no shadow page in the GFN range,
+	 * because the presence of a shadow page in that range would prevent
+	 * using a large page.
+	 */
+	if (!lpage_info_slot(gfn, slot, PG_LEVEL_2M)->disallow_lpage)
+		return 0;
+
 	/*
 	 * Force write-protection if the page is being tracked.  Note, the page
 	 * track machinery is used to write-protect upper-level shadow pages,
-- 
2.19.1.6.gb485710b

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: KVM: x86/mmu: Skip unsync when large pages are allowed
  2026-01-23  9:03 ` [PATCH 2/2] KVM: x86/mmu: KVM: x86/mmu: Skip unsync when large pages are allowed Lai Jiangshan
@ 2026-03-12 17:07   ` Sean Christopherson
  2026-03-12 17:22     ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: Sean Christopherson @ 2026-03-12 17:07 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Paolo Bonzini, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	kvm

On Fri, Jan 23, 2026, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> Use the large-page metadata to avoid pointless attempts to search SP.
> 
> If the target GFN falls within a range where a large page is allowed,
> then there cannot be a shadow page for that GFN; a shadow page in the
> range would itself disallow using a large page. In that case, there
> is nothing to unsync and mmu_try_to_unsync_pages() can return
> immediately.
> 
> This is always true for TDP MMU without nested TDP,

I wouldn't expect this to be a much of a performance optimization for this case
though, as kvm_get_mmu_page_hash() will return an empty list, i.e.
for_each_gfn_valid_sp_with_gptes() won't do meaningful work anyways.

> and holds for a significant fraction of cases with shadow paging even all SPs
> are 4K.
> 
> For shadow paging, this optimization theoretically avoids work for about
> 1/e ~= 37% of GFNs, assuming one guest page table per 2M of memory and
> that each GPT falls randomly into the 2M memory buckets. In a simple
> test setup, it skipped unsync in a much higher percentage of cases,
> mainly because the guest buddy allocator clusters GPTs into fewer
> buckets.
> 
> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4535d2836004..555075fb63d9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2932,6 +2932,14 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
>  	struct kvm_mmu_page *sp;
>  	bool locked = false;
>  
> +	/*
> +	 * If large page is allowed, there is no shadow page in the GFN range,
> +	 * because the presence of a shadow page in that range would prevent
> +	 * using a large page.
> +	 */
> +	if (!lpage_info_slot(gfn, slot, PG_LEVEL_2M)->disallow_lpage)
> +		return 0;

Hmm, I'd like to move this to after the write-tracking check, even though as
implemented in code today, the two are mutually exclusive.  Specifically, I don't
want to rely on KVM not supporting write-tracking at 2MiB granularity, and also
to avoid confusing readers.  E.g. a shallow read of account_shadowed() would lead
people to believe this code is wrong:

	/* the non-leaf shadow pages are keeping readonly. */
	if (sp->role.level > PG_LEVEL_4K)
		return __kvm_write_track_add_gfn(kvm, slot, gfn);

	kvm_mmu_gfn_disallow_lpage(slot, gfn);

if they didn't follow __kvm_write_track_add_gfn() to see:

	/*
	 * new track stops large page mapping for the
	 * tracked page.
	 */
	kvm_mmu_gfn_disallow_lpage(slot, gfn);

From a performance perspective, kvm_gfn_is_write_tracked() is O(1) time, and
should be very fast for the "pure" TDP MMU case, so I don't think that's a
concern.

This is what I have locally, please holler if you object to landing the code
after the write-tracked check.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 363967a17069..3d0e0c1b5332 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2940,6 +2940,15 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
        if (kvm_gfn_is_write_tracked(kvm, slot, gfn))
                return -EPERM;
 
+       /*
+        * Only 4KiB mappings can become unsync, and KVM disallows hugepages
+        * for unsync gfns.  Upper-level gPTEs (leaf or non-leaf) are always
+        * write-protected (see above), thus if the gfn can be mapped with a
+        * hugepage and isn't write-tracked, it can't be unsync.
+        */
+       if (!lpage_info_slot(gfn, slot, PG_LEVEL_2M)->disallow_lpage)
+               return 0;
+
        /*
         * The page is not write-tracked, mark existing shadow pages unsync
         * unless KVM is synchronizing an unsync SP.  In that case, KVM must


>  	/*
>  	 * Force write-protection if the page is being tracked.  Note, the page
>  	 * track machinery is used to write-protect upper-level shadow pages,
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: KVM: x86/mmu: Skip unsync when large pages are allowed
  2026-03-12 17:07   ` Sean Christopherson
@ 2026-03-12 17:22     ` Sean Christopherson
  0 siblings, 0 replies; 5+ messages in thread
From: Sean Christopherson @ 2026-03-12 17:22 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Paolo Bonzini, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	kvm

On Thu, Mar 12, 2026, Sean Christopherson wrote:
> On Fri, Jan 23, 2026, Lai Jiangshan wrote:
> This is what I have locally, please holler if you object to landing the code
> after the write-tracked check.
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 363967a17069..3d0e0c1b5332 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2940,6 +2940,15 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
>         if (kvm_gfn_is_write_tracked(kvm, slot, gfn))
>                 return -EPERM;
>  
> +       /*
> +        * Only 4KiB mappings can become unsync, and KVM disallows hugepages
> +        * for unsync gfns.  Upper-level gPTEs (leaf or non-leaf) are always
> +        * write-protected (see above), thus if the gfn can be mapped with a
> +        * hugepage and isn't write-tracked, it can't be unsync.

Gah, I swapped the ordering of who is doing what.  The comment should be this:

	/*
	 * Only 4KiB mappings can become unsync, and KVM disallows hugepages
	 * when accounting 4KiB shadow pages.  Upper-level gPTEs are always
	 * write-protected (see above), thus if the gfn can be mapped with a
	 * hugepage and isn't write-tracked, it can't have a shadow page.
	 */
	if (!lpage_info_slot(gfn, slot, PG_LEVEL_2M)->disallow_lpage)
		return 0;

> +        */
> +       if (!lpage_info_slot(gfn, slot, PG_LEVEL_2M)->disallow_lpage)
> +               return 0;
> +
>         /*
>          * The page is not write-tracked, mark existing shadow pages unsync
>          * unless KVM is synchronizing an unsync SP.  In that case, KVM must
> 
> 
> >  	/*
> >  	 * Force write-protection if the page is being tracked.  Note, the page
> >  	 * track machinery is used to write-protect upper-level shadow pages,
> > -- 
> > 2.19.1.6.gb485710b
> > 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Don't check old SPTE permissions when trying to unsync
  2026-01-23  9:03 [PATCH 1/2] KVM: x86/mmu: Don't check old SPTE permissions when trying to unsync Lai Jiangshan
  2026-01-23  9:03 ` [PATCH 2/2] KVM: x86/mmu: KVM: x86/mmu: Skip unsync when large pages are allowed Lai Jiangshan
@ 2026-03-12 17:35 ` Sean Christopherson
  1 sibling, 0 replies; 5+ messages in thread
From: Sean Christopherson @ 2026-03-12 17:35 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Paolo Bonzini, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	kvm

On Fri, Jan 23, 2026, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> Commit ecc5589f19a5 ("KVM: MMU: optimize set_spte for page sync") added
> a writable permission check on the old SPTE to avoid unnecessary calls
> to mmu_try_to_unsync_pages() when syncing SPTEs.
> 
> Later, commit e6722d9211b2 ("KVM: x86/mmu: Reduce the update to the spte
> in FNAME(sync_spte)") indirectly achieves it by avoiding some SPTE
> updates altogether, which makes the writable permission check in
> make_spte() much less useful.
> 
> Remove the old-SPTE writable permission check from make_spte() to
> simplify the code.
> 
> This may cause mmu_try_to_unsync_pages() to be called in a few
> additional cases not covered by commit e6722d9211b2, such as when the
> guest toggles the execute bit, which is expected to be rare.

Hmm, but it would also apply to spurious faults.  The TDP MMU largely guards
against that behavior thanks to commit 386d69f9f29b ("KVM: x86/mmu: Treat TDP MMU
faults as spurious if access is already allowed"), but the shadow MMU does not.

Booting a 24 vCPU VM with shadowing paging gets ~3000 hits on the optimizations,
which isn't a ton, but it's definitely not 0 either.  And while I'm generally all
about simplifying code, I'm also generally very hesitant to tweak shadow paging
optimizations without strong evidence for doing so.

Is there an ulterior motive to this change, e.g. does it allow for additional
cleanups, or is simplifying the code the main goal?

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-12 17:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-23  9:03 [PATCH 1/2] KVM: x86/mmu: Don't check old SPTE permissions when trying to unsync Lai Jiangshan
2026-01-23  9:03 ` [PATCH 2/2] KVM: x86/mmu: KVM: x86/mmu: Skip unsync when large pages are allowed Lai Jiangshan
2026-03-12 17:07   ` Sean Christopherson
2026-03-12 17:22     ` Sean Christopherson
2026-03-12 17:35 ` [PATCH 1/2] KVM: x86/mmu: Don't check old SPTE permissions when trying to unsync Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox