[PATCH v2] KVM: x86/mmu: Expose number of shadow MMU shadow pages as a stat

Kernel KVM virtualization development
 help / color / mirror / Atom feed

* [PATCH v2] KVM: x86/mmu: Expose number of shadow MMU shadow pages as a stat
@ 2026-06-12 13:37 Sean Christopherson
  2026-06-15 20:57 ` Yosry Ahmed
  0 siblings, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2026-06-12 13:37 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Turn arch.n_used_mmu_pages into a stat, mmu_shadow_pages, as the number of
live shadow pages is arguably _the_ most critical datapoint when it comes
to analyzing the shadow MMU.  Before the TDP MMU came along, i.e. when the
shadow MMU was the only MMU, explicitly tracking the number of shadow pages
wasn't as interesting, because the same information could more or less be
gleaned from the pages_{1g,2m,4k} stats.  But with the TDP MMU, where the
shadow MMU is only used for nested TDP, it becomes extremely difficult, if
not impossible, to determine which SPTEs are coming from the TDP MMU, and
which are coming from the shadow MMU.

E.g. when triaging/debugging shadow MMU performance issues due to "too many
shadow pages", being able to observe that 99%+ of all shadow pages are
unsync is critical to being able to deduce that KVM is effectively leaking
shadow pages.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---

v2: Drop the definition of n_used_mmu_pages. [Sashiko]

v1: https://lore.kernel.org/all/20260605174611.2222504-3-seanjc@google.com
 
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 14 +++++++-------
 arch/x86/kvm/mmu/mmutrace.h     |  2 +-
 arch/x86/kvm/x86.c              |  1 +
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3886b536c8a5..c2b86b1420d1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1435,7 +1435,6 @@ enum kvm_mmu_type {
 };
 
 struct kvm_arch {
-	unsigned long n_used_mmu_pages;
 	unsigned long n_requested_mmu_pages;
 	unsigned long n_max_mmu_pages;
 	unsigned int indirect_shadow_pages;
@@ -1701,6 +1700,7 @@ struct kvm_vm_stat {
 	u64 mmu_recycled;
 	u64 mmu_cache_miss;
 	u64 mmu_unsync;
+	u64 mmu_shadow_pages;
 	union {
 		struct {
 			atomic64_t pages_4k;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9368a71336fe..3839aef6819b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1801,13 +1801,13 @@ static void kvm_mmu_check_sptes_at_free(struct kvm_mmu_page *sp)
 
 static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	kvm->arch.n_used_mmu_pages++;
+	kvm->stat.mmu_shadow_pages++;
 	kvm_account_pgtable_pages((void *)sp->spt, +1);
 }
 
 static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	kvm->arch.n_used_mmu_pages--;
+	kvm->stat.mmu_shadow_pages--;
 	kvm_account_pgtable_pages((void *)sp->spt, -1);
 }
 
@@ -2833,9 +2833,9 @@ static unsigned long kvm_mmu_zap_oldest_mmu_pages(struct kvm *kvm,
 
 static inline unsigned long kvm_mmu_available_pages(struct kvm *kvm)
 {
-	if (kvm->arch.n_max_mmu_pages > kvm->arch.n_used_mmu_pages)
+	if (kvm->arch.n_max_mmu_pages > kvm->stat.mmu_shadow_pages)
 		return kvm->arch.n_max_mmu_pages -
-			kvm->arch.n_used_mmu_pages;
+			kvm->stat.mmu_shadow_pages;
 
 	return 0;
 }
@@ -2871,11 +2871,11 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long goal_nr_mmu_pages)
 {
 	write_lock(&kvm->mmu_lock);
 
-	if (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages) {
-		kvm_mmu_zap_oldest_mmu_pages(kvm, kvm->arch.n_used_mmu_pages -
+	if (kvm->stat.mmu_shadow_pages > goal_nr_mmu_pages) {
+		kvm_mmu_zap_oldest_mmu_pages(kvm, kvm->stat.mmu_shadow_pages -
 						  goal_nr_mmu_pages);
 
-		goal_nr_mmu_pages = kvm->arch.n_used_mmu_pages;
+		goal_nr_mmu_pages = kvm->stat.mmu_shadow_pages;
 	}
 
 	kvm->arch.n_max_mmu_pages = goal_nr_mmu_pages;
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index fa01719baf8d..8354d9f39777 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -303,7 +303,7 @@ TRACE_EVENT(
 
 	TP_fast_assign(
 		__entry->mmu_valid_gen = kvm->arch.mmu_valid_gen;
-		__entry->mmu_used_pages = kvm->arch.n_used_mmu_pages;
+		__entry->mmu_used_pages = kvm->stat.mmu_shadow_pages;
 	),
 
 	TP_printk("kvm-mmu-valid-gen %u used_pages %x",
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cf122b8c3210..9d43e476707e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -244,6 +244,7 @@ const struct kvm_stats_desc kvm_vm_stats_desc[] = {
 	STATS_DESC_COUNTER(VM, mmu_recycled),
 	STATS_DESC_COUNTER(VM, mmu_cache_miss),
 	STATS_DESC_ICOUNTER(VM, mmu_unsync),
+	STATS_DESC_ICOUNTER(VM, mmu_shadow_pages),
 	STATS_DESC_ICOUNTER(VM, pages_4k),
 	STATS_DESC_ICOUNTER(VM, pages_2m),
 	STATS_DESC_ICOUNTER(VM, pages_1g),

base-commit: de3a35be92d2391ece4bf3143ef2887192625fd0
-- 
2.54.0.1136.gdb2ca164c4-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] KVM: x86/mmu: Expose number of shadow MMU shadow pages as a stat
  2026-06-12 13:37 [PATCH v2] KVM: x86/mmu: Expose number of shadow MMU shadow pages as a stat Sean Christopherson
@ 2026-06-15 20:57 ` Yosry Ahmed
  2026-06-15 23:46   ` Sean Christopherson
  0 siblings, 1 reply; 6+ messages in thread
From: Yosry Ahmed @ 2026-06-15 20:57 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Fri, Jun 12, 2026 at 6:37 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Turn arch.n_used_mmu_pages into a stat, mmu_shadow_pages, as the number of
> live shadow pages is arguably _the_ most critical datapoint when it comes
> to analyzing the shadow MMU.  Before the TDP MMU came along, i.e. when the
> shadow MMU was the only MMU, explicitly tracking the number of shadow pages
> wasn't as interesting, because the same information could more or less be
> gleaned from the pages_{1g,2m,4k} stats.  But with the TDP MMU, where the
> shadow MMU is only used for nested TDP, it becomes extremely difficult, if
> not impossible, to determine which SPTEs are coming from the TDP MMU, and
> which are coming from the shadow MMU.
>
> E.g. when triaging/debugging shadow MMU performance issues due to "too many
> shadow pages", being able to observe that 99%+ of all shadow pages are
> unsync is critical to being able to deduce that KVM is effectively leaking
> shadow pages.

Why not expose indirect_shadow_pages? IIRC that was also one of the
stats we (mostly you) used while debugging?

I guess for most cases, mmu_shadow_pages will represent either the MMU
pages used to shadow the VM's x86 page tables (with TDP off) or nested
TDP MMU pages (with TDP on and nested used) -- but I do remember some
interesting case about direct mappings in the shadow MMU or sth?

I could be hallucinating this (I was bit by an AI agent).

>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] KVM: x86/mmu: Expose number of shadow MMU shadow pages as a stat
  2026-06-15 20:57 ` Yosry Ahmed
@ 2026-06-15 23:46   ` Sean Christopherson
  2026-06-15 23:57     ` Yosry Ahmed
  0 siblings, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2026-06-15 23:46 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel

On Mon, Jun 15, 2026, Yosry Ahmed wrote:
> On Fri, Jun 12, 2026 at 6:37 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Turn arch.n_used_mmu_pages into a stat, mmu_shadow_pages, as the number of
> > live shadow pages is arguably _the_ most critical datapoint when it comes
> > to analyzing the shadow MMU.  Before the TDP MMU came along, i.e. when the
> > shadow MMU was the only MMU, explicitly tracking the number of shadow pages
> > wasn't as interesting, because the same information could more or less be
> > gleaned from the pages_{1g,2m,4k} stats.  But with the TDP MMU, where the
> > shadow MMU is only used for nested TDP, it becomes extremely difficult, if
> > not impossible, to determine which SPTEs are coming from the TDP MMU, and
> > which are coming from the shadow MMU.
> >
> > E.g. when triaging/debugging shadow MMU performance issues due to "too many
> > shadow pages", being able to observe that 99%+ of all shadow pages are
> > unsync is critical to being able to deduce that KVM is effectively leaking
> > shadow pages.
> 
> Why not expose indirect_shadow_pages? IIRC that was also one of the
> stats we (mostly you) used while debugging?

Because it's a subset of mmu_shadow_pages, and I suspect mmu_shadow_pages will
be more helpful if we're only providing one of the two?  E.g. if the problem is
that KVM is leaking indirect shadow pages, then either number will suffice.  But
if KVM is zapping old SPTEs due to the KVM_SET_NR_MMU_PAGES limit, then we really
want to see mmu_shadow_pages, otherwise there will be a blind spot with respect
to direct shadow pages.  And if there's bug that's specific to direct shadow pages,
then we're probably hosed either way, because it will be difficult to observe just
the direct shadow pages (unless they happen to be the _only_ pages, which is very
unlikely, but then we'd still want mmu_shadow_pages,).

In practice, thanks to the TDP MMU deliberately _not_ accounting its pages as
shadow pages, the delta between the two values will be tiny on setups with TDP
enabled, i.e. on practically every modern deployment.  Because hypervisor page
tables are typically tree-like, and hugepages are, well, huge, the number of
direct shadow pages in indirect MMUs will be counted in tens or hundreds, out
thousands or tens of thousands of total shadow pages.

> I guess for most cases, mmu_shadow_pages will represent either the MMU
> pages used to shadow the VM's x86 page tables (with TDP off) or nested
> TDP MMU pages (with TDP on and nested used) -- but I do remember some
> interesting case about direct mappings in the shadow MMU or sth?

Yes, there can direct shadow pages in an indirect mmu (guest is using a page size
that is larger than the host, in which case there are no gPTEs to shadow and thus
the gva=>gpa / l2_gpa=>l1_gpa translations in KVM's shadow pages are "direct").

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] KVM: x86/mmu: Expose number of shadow MMU shadow pages as a stat
  2026-06-15 23:46   ` Sean Christopherson
@ 2026-06-15 23:57     ` Yosry Ahmed
  2026-06-16  0:20       ` Sean Christopherson
  0 siblings, 1 reply; 6+ messages in thread
From: Yosry Ahmed @ 2026-06-15 23:57 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Mon, Jun 15, 2026 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jun 15, 2026, Yosry Ahmed wrote:
> > On Fri, Jun 12, 2026 at 6:37 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > Turn arch.n_used_mmu_pages into a stat, mmu_shadow_pages, as the number of
> > > live shadow pages is arguably _the_ most critical datapoint when it comes
> > > to analyzing the shadow MMU.  Before the TDP MMU came along, i.e. when the
> > > shadow MMU was the only MMU, explicitly tracking the number of shadow pages
> > > wasn't as interesting, because the same information could more or less be
> > > gleaned from the pages_{1g,2m,4k} stats.  But with the TDP MMU, where the
> > > shadow MMU is only used for nested TDP, it becomes extremely difficult, if
> > > not impossible, to determine which SPTEs are coming from the TDP MMU, and
> > > which are coming from the shadow MMU.
> > >
> > > E.g. when triaging/debugging shadow MMU performance issues due to "too many
> > > shadow pages", being able to observe that 99%+ of all shadow pages are
> > > unsync is critical to being able to deduce that KVM is effectively leaking
> > > shadow pages.
> >
> > Why not expose indirect_shadow_pages? IIRC that was also one of the
> > stats we (mostly you) used while debugging?
>
> Because it's a subset of mmu_shadow_pages, and I suspect mmu_shadow_pages will
> be more helpful if we're only providing one of the two?  E.g. if the problem is
> that KVM is leaking indirect shadow pages, then either number will suffice.  But
> if KVM is zapping old SPTEs due to the KVM_SET_NR_MMU_PAGES limit, then we really
> want to see mmu_shadow_pages, otherwise there will be a blind spot with respect
> to direct shadow pages.  And if there's bug that's specific to direct shadow pages,
> then we're probably hosed either way, because it will be difficult to observe just
> the direct shadow pages (unless they happen to be the _only_ pages, which is very
> unlikely, but then we'd still want mmu_shadow_pages,).
>
> In practice, thanks to the TDP MMU deliberately _not_ accounting its pages as
> shadow pages, the delta between the two values will be tiny on setups with TDP
> enabled, i.e. on practically every modern deployment.  Because hypervisor page
> tables are typically tree-like, and hugepages are, well, huge, the number of
> direct shadow pages in indirect MMUs will be counted in tens or hundreds, out
> thousands or tens of thousands of total shadow pages.

Yeah if we're choosing one, I think mmu_shadow_pages is more valuable.
What do we lose if we make both of them stats tho?

>
> > I guess for most cases, mmu_shadow_pages will represent either the MMU
> > pages used to shadow the VM's x86 page tables (with TDP off) or nested
> > TDP MMU pages (with TDP on and nested used) -- but I do remember some
> > interesting case about direct mappings in the shadow MMU or sth?
>
> Yes, there can direct shadow pages in an indirect mmu (guest is using a page size
> that is larger than the host, in which case there are no gPTEs to shadow and thus
> the gva=>gpa / l2_gpa=>l1_gpa translations in KVM's shadow pages are "direct").

I never understood why these are "direct" tho. Sure, they do not
directly correspond gPTEs, but they still are shadowing guest
mappings. IOW, if the guest has a PMD mapping and KVM has some PTE
mappings, aren't those PTE mappings still shadowing the PMD mapping,
and still need to be sync'd in the same way they would if they were
shadowing PTEs? The main difference I can think of is the 1:n
relationship?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] KVM: x86/mmu: Expose number of shadow MMU shadow pages as a stat
  2026-06-15 23:57     ` Yosry Ahmed
@ 2026-06-16  0:20       ` Sean Christopherson
  2026-06-16  0:29         ` Yosry Ahmed
  0 siblings, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2026-06-16  0:20 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel

On Mon, Jun 15, 2026, Yosry Ahmed wrote:
> On Mon, Jun 15, 2026 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
> > > I guess for most cases, mmu_shadow_pages will represent either the MMU
> > > pages used to shadow the VM's x86 page tables (with TDP off) or nested
> > > TDP MMU pages (with TDP on and nested used) -- but I do remember some
> > > interesting case about direct mappings in the shadow MMU or sth?
> >
> > Yes, there can direct shadow pages in an indirect mmu (guest is using a page size
> > that is larger than the host, in which case there are no gPTEs to shadow and thus
> > the gva=>gpa / l2_gpa=>l1_gpa translations in KVM's shadow pages are "direct").
> 
> I never understood why these are "direct" tho. Sure, they do not
> directly correspond gPTEs, but they still are shadowing guest
> mappings. 

No, the *parent* is shadowing a gPTE, the child is not.  Overall, KVM is shadowing
a mapping, but individual shadow pages are shadowing a specific gPTE in a specific
context, not a full mapping.

> IOW, if the guest has a PMD mapping and KVM has some PTE mappings, aren't
> those PTE mappings still shadowing the PMD mapping,

Nope, as above, there is an indirect shadow page with a base GFN that corresponds
to the guest PMD entry, i.e. that is shadowing the huge gPMD.  And so if that
gPMD is modified, KVM needs to adjust that parent shadow page.

But for the PG_LEVEL_4K shadow pages, i.e. the children, there is no corresponding
page table, no gPTEs to shadow, no gfn that needs write-protection, no protection
bits to account for, and no additional lookup to do when getting from the base GFN
to the final GFN.

See the use of sp->shadowed_translation, e.g. in kvm_mmu_page_get_gfn() and
kvm_mmu_page_get_access(), and also commit 9ecc1c119b28 ("KVM: x86/mmu: Only
allocate shadowed translation cache for sp->role.level <= KVM_MAX_HUGEPAGE_LEVEL"),
which optimized KVM to avoid allocating unused metadata (KVM only ever needs to
query the gfn and access protections for leaf SPTEs).

> and still need to be sync'd in the same way they would if they were shadowing
> PTEs?

Ignoring for the moment that KVM never unsyncs PMD+ gPTEs, "yes", but only the
parents.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] KVM: x86/mmu: Expose number of shadow MMU shadow pages as a stat
  2026-06-16  0:20       ` Sean Christopherson
@ 2026-06-16  0:29         ` Yosry Ahmed
  0 siblings, 0 replies; 6+ messages in thread
From: Yosry Ahmed @ 2026-06-16  0:29 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Mon, Jun 15, 2026 at 5:20 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jun 15, 2026, Yosry Ahmed wrote:
> > On Mon, Jun 15, 2026 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > I guess for most cases, mmu_shadow_pages will represent either the MMU
> > > > pages used to shadow the VM's x86 page tables (with TDP off) or nested
> > > > TDP MMU pages (with TDP on and nested used) -- but I do remember some
> > > > interesting case about direct mappings in the shadow MMU or sth?
> > >
> > > Yes, there can direct shadow pages in an indirect mmu (guest is using a page size
> > > that is larger than the host, in which case there are no gPTEs to shadow and thus
> > > the gva=>gpa / l2_gpa=>l1_gpa translations in KVM's shadow pages are "direct").
> >
> > I never understood why these are "direct" tho. Sure, they do not
> > directly correspond gPTEs, but they still are shadowing guest
> > mappings.
>
> No, the *parent* is shadowing a gPTE, the child is not.  Overall, KVM is shadowing
> a mapping, but individual shadow pages are shadowing a specific gPTE in a specific
> context, not a full mapping.
>
> > IOW, if the guest has a PMD mapping and KVM has some PTE mappings, aren't
> > those PTE mappings still shadowing the PMD mapping,
>
> Nope, as above, there is an indirect shadow page with a base GFN that corresponds
> to the guest PMD entry, i.e. that is shadowing the huge gPMD.  And so if that
> gPMD is modified, KVM needs to adjust that parent shadow page.
>
> But for the PG_LEVEL_4K shadow pages, i.e. the children, there is no corresponding
> page table, no gPTEs to shadow, no gfn that needs write-protection, no protection
> bits to account for, and no additional lookup to do when getting from the base GFN
> to the final GFN.
>
> See the use of sp->shadowed_translation, e.g. in kvm_mmu_page_get_gfn() and
> kvm_mmu_page_get_access(), and also commit 9ecc1c119b28 ("KVM: x86/mmu: Only
> allocate shadowed translation cache for sp->role.level <= KVM_MAX_HUGEPAGE_LEVEL"),
> which optimized KVM to avoid allocating unused metadata (KVM only ever needs to
> query the gfn and access protections for leaf SPTEs).

Yeah I guess the distinction makes sense from the shadow MMU
implementation perspective, what I was really getting at is that, to
your point, maybe it's best not to expose the direct/indirect
distinction to userspace (unless we already do elsewhere?). Seems like
the distinction (or rather the terminology) is implementation-specific
and can (in theory) change.

>
> > and still need to be sync'd in the same way they would if they were shadowing
> > PTEs?
>
> Ignoring for the moment that KVM never unsyncs PMD+ gPTEs, "yes", but only the
> parents.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-16  0:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12 13:37 [PATCH v2] KVM: x86/mmu: Expose number of shadow MMU shadow pages as a stat Sean Christopherson
2026-06-15 20:57 ` Yosry Ahmed
2026-06-15 23:46   ` Sean Christopherson
2026-06-15 23:57     ` Yosry Ahmed
2026-06-16  0:20       ` Sean Christopherson
2026-06-16  0:29         ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox