public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Kai Huang <kai.huang@intel.com>
Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>,
	Yan Y Zhao <yan.y.zhao@intel.com>,
	 "kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Fan Du <fan.du@intel.com>,  Xiaoyao Li <xiaoyao.li@intel.com>,
	Chao Gao <chao.gao@intel.com>,
	 Dave Hansen <dave.hansen@intel.com>,
	"thomas.lendacky@amd.com" <thomas.lendacky@amd.com>,
	 "vbabka@suse.cz" <vbabka@suse.cz>,
	"tabba@google.com" <tabba@google.com>,
	"david@kernel.org" <david@kernel.org>,
	 "kas@kernel.org" <kas@kernel.org>,
	"michael.roth@amd.com" <michael.roth@amd.com>,
	Ira Weiny <ira.weiny@intel.com>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	 "binbin.wu@linux.intel.com" <binbin.wu@linux.intel.com>,
	 "ackerleytng@google.com" <ackerleytng@google.com>,
	"nik.borisov@suse.com" <nik.borisov@suse.com>,
	 Isaku Yamahata <isaku.yamahata@intel.com>,
	Chao P Peng <chao.p.peng@intel.com>,
	 "francescolavra.fl@gmail.com" <francescolavra.fl@gmail.com>,
	"sagis@google.com" <sagis@google.com>,
	 Vishal Annapurve <vannapurve@google.com>,
	Rick P Edgecombe <rick.p.edgecombe@intel.com>,
	 Jun Miao <jun.miao@intel.com>,
	"jgross@suse.com" <jgross@suse.com>,
	 "pgonda@google.com" <pgonda@google.com>,
	"x86@kernel.org" <x86@kernel.org>
Subject: Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting
Date: Wed, 21 Jan 2026 09:30:28 -0800	[thread overview]
Message-ID: <aXENNKjAKTM9UJNH@google.com> (raw)
In-Reply-To: <b9487eba19c134c1801a536945e8ae57ea93032f.camel@intel.com>

On Wed, Jan 21, 2026, Kai Huang wrote:
> On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> I have been thinking whether we can simplify the solution, not only just
> for avoiding this complicated memory cache topup-then-consume mechanism
> under MMU read lock, but also for avoiding kinda duplicated code about how
> to calculate how many DPAMT pages needed to topup etc between your next
> patch and similar code in DPAMT series for the per-vCPU cache.
> 
> IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
> and the mapped 2M range when splitting.
> 
> - For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
> tdx_alloc_page() directly which also handles DPAMT pages internally.
> 
> Here in tdp_mmmu_alloc_sp_for_split():
> 
> 	sp->external_spt = tdx_alloc_page();
> 
> For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
> that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
> mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
> add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':
> 
> https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/
> 
> So we can get rid of the per-VM DPAMT cache for S-EPT pages.
> 
> - For DPAMT pages for the TDX guest private memory, I think we can also
> get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
> needed DPAMT pages:
> 
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -111,6 +111,7 @@ struct kvm_mmu_page {
>                  * Passed to TDX module, not accessed by KVM.
>                  */
>                 void *external_spt;
> +               void *leaf_level_private;
>         };

There's no need to put this in with external_spt, we could throw it in a new union
with unsync_child_bitmap (TDP MMU can't have unsync children).  IIRC, the main
reason I've never suggested unionizing unsync_child_bitmap is that overloading
the bitmap would risk corruption if KVM ever marked a TDP MMU page as unsync, but
that's easy enough to guard against:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d568512201d..d6c6768c1f50 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1917,9 +1917,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 
 static void mark_unsync(u64 *spte)
 {
-       struct kvm_mmu_page *sp;
+       struct kvm_mmu_page *sp = sptep_to_sp(spte);
 
-       sp = sptep_to_sp(spte);
+       if (WARN_ON_ONCE(is_tdp_mmu_page(sp)))
+               return;
        if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
                return;
        if (sp->unsync_children++)


I might send a patch to do that even if we don't overload the bitmap, as a
hardening measure.

> Then we can define a structure which contains DPAMT pages for a given 2M
> range:
> 
> 	struct tdx_dmapt_metadata {
> 		struct page *page1;
> 		struct page *page2;
> 	};
> 
> Then when we allocate sp->external_spt, we can also allocate it for
> leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
> last level page table.
> 
> In this case, I think we can get rid of the per-VM DPAMT cache?
> 
> For the fault path, similarly, I believe we can use a per-vCPU cache for
> 'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
> hooks.
> 
> The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
> TDX guests even they are never used, but if what I said above is feasible,
> maybe it's worth the cost.
> 
> But it's completely possible that I missed something.  Any thoughts?

I *LOVE* the core idea (seriously, this made my week), though I think we should
take it a step further and _immediately_ do DPAMT maintenance on allocation.
I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
SP cache instead of waiting until KVM links the SP.  Then KVM doesn't need to
track PAMT pages except for memory that is mapped into a guest, and we end up
with better symmetry and more consistency throughout TDX.  E.g. all pages that
KVM allocates and gifts to the TDX-Module will allocated and freed via the same
TDX APIs.

Absolute worst case scenario, KVM allocates 40 (KVM's SP cache capacity) PAMT
entries per-vCPU that end up being free without ever being gifted to the TDX-Module.
But I doubt that will be a problem in practice, because odds are good the adjacent
pages/pfns will already have been consumed, i.e. the "speculative" allocation is
really just bumping the refcount.  And _if_ it's a problem, e.g. results in too
many wasted DPAMT entries, then it's one we can solve in KVM by tuning the cache
capacity to less aggresively allocate DPAMT entries.

I'll send compile-tested v4 for the DPAMT series later today (I think I can get
it out today), as I have other non-trival feedback that I've accumulated when
going through the patches.

  reply	other threads:[~2026-01-21 17:30 UTC|newest]

Thread overview: 127+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
2026-01-06 10:18 ` [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
2026-01-06 21:08   ` Dave Hansen
2026-01-07  9:12     ` Yan Zhao
2026-01-07 16:39       ` Dave Hansen
2026-01-08 19:05         ` Ackerley Tng
2026-01-08 19:24           ` Dave Hansen
2026-01-09 16:21             ` Vishal Annapurve
2026-01-09  3:08         ` Yan Zhao
2026-01-09 18:29           ` Ackerley Tng
2026-01-12  2:41             ` Yan Zhao
2026-01-13 16:50               ` Vishal Annapurve
2026-01-14  1:48                 ` Yan Zhao
2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
2026-01-16  1:00   ` Huang, Kai
2026-01-16  8:35     ` Yan Zhao
2026-01-16 11:10       ` Huang, Kai
2026-01-16 11:22         ` Huang, Kai
2026-01-19  6:18           ` Yan Zhao
2026-01-19  6:15         ` Yan Zhao
2026-01-16 11:22   ` Huang, Kai
2026-01-19  5:55     ` Yan Zhao
2026-01-28 22:49   ` Sean Christopherson
2026-01-06 10:19 ` [PATCH v3 03/24] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
2026-01-06 10:19 ` [PATCH v3 04/24] x86/tdx: Introduce tdx_quirk_reset_folio() to reset private " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 05/24] x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
2026-01-15 22:49   ` Sean Christopherson
2026-01-16  7:54     ` Yan Zhao
2026-01-26 16:08       ` Sean Christopherson
2026-01-27  3:40         ` Yan Zhao
2026-01-28 19:51           ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock Yan Zhao
2026-01-28 22:38   ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 08/24] KVM: TDX: Enable huge page splitting " Yan Zhao
2026-01-06 10:21 ` [PATCH v3 09/24] KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX Yan Zhao
2026-01-06 10:21 ` [PATCH v3 10/24] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
2026-01-06 10:21 ` [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
2026-01-15 12:25   ` Huang, Kai
2026-01-16 23:39     ` Sean Christopherson
2026-01-19  1:28       ` Yan Zhao
2026-01-19  8:35         ` Huang, Kai
2026-01-19  8:49           ` Huang, Kai
2026-01-19 10:11             ` Yan Zhao
2026-01-19 10:40               ` Huang, Kai
2026-01-19 11:06                 ` Yan Zhao
2026-01-19 12:32                   ` Yan Zhao
2026-01-29 14:36                     ` Sean Christopherson
2026-01-20 17:51         ` Sean Christopherson
2026-01-22  6:27           ` Yan Zhao
2026-01-20 17:57       ` Vishal Annapurve
2026-01-20 18:02         ` Sean Christopherson
2026-01-22  6:33           ` Yan Zhao
2026-01-29 14:51             ` Sean Christopherson
2026-01-06 10:21 ` [PATCH v3 12/24] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
2026-01-06 10:22 ` [PATCH v3 13/24] KVM: TDX: Honor the guest's accept level contained in an EPT violation Yan Zhao
2026-01-06 10:22 ` [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
2026-01-16  0:21   ` Sean Christopherson
2026-01-16  6:42     ` Yan Zhao
2026-01-06 10:22 ` [PATCH v3 15/24] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
2026-01-06 10:22 ` [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
2026-01-28 22:39   ` Sean Christopherson
2026-01-06 10:23 ` [PATCH v3 17/24] KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB Yan Zhao
2026-01-06 10:23 ` [PATCH v3 18/24] x86/virt/tdx: Add loud warning when tdx_pamt_put() fails Yan Zhao
2026-01-06 10:23 ` [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting Yan Zhao
2026-01-21  1:54   ` Huang, Kai
2026-01-21 17:30     ` Sean Christopherson [this message]
2026-01-21 19:39       ` Edgecombe, Rick P
2026-01-21 23:01       ` Huang, Kai
2026-01-22  7:03       ` Yan Zhao
2026-01-22  7:30         ` Huang, Kai
2026-01-22  7:49           ` Yan Zhao
2026-01-22 10:33             ` Huang, Kai
2026-01-06 10:23 ` [PATCH v3 20/24] KVM: TDX: Implement per-VM external cache for splitting in TDX Yan Zhao
2026-01-06 10:23 ` [PATCH v3 21/24] KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting Yan Zhao
2026-01-06 10:24 ` [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote Yan Zhao
2026-01-19 10:52   ` Huang, Kai
2026-01-19 11:11     ` Yan Zhao
2026-01-06 10:24 ` [PATCH v3 23/24] x86/tdx: Pass guest memory's PFN info to demote for updating pamt_refcount Yan Zhao
2026-01-06 10:24 ` [PATCH v3 24/24] KVM: TDX: Turn on PG_LEVEL_2M Yan Zhao
2026-01-06 17:47 ` [PATCH v3 00/24] KVM: TDX huge page support for private memory Vishal Annapurve
2026-01-06 21:26   ` Ackerley Tng
2026-01-06 21:38     ` Sean Christopherson
2026-01-06 22:04       ` Ackerley Tng
2026-01-06 23:43         ` Sean Christopherson
2026-01-07  9:03           ` Yan Zhao
2026-01-08 20:11             ` Ackerley Tng
2026-01-09  9:18               ` Yan Zhao
2026-01-09 16:12                 ` Vishal Annapurve
2026-01-09 17:16                   ` Vishal Annapurve
2026-01-09 18:07                   ` Ackerley Tng
2026-01-12  1:39                     ` Yan Zhao
2026-01-12  2:12                       ` Yan Zhao
2026-01-12 19:56                         ` Ackerley Tng
2026-01-13  6:10                           ` Yan Zhao
2026-01-13 16:40                             ` Vishal Annapurve
2026-01-14  9:32                               ` Yan Zhao
2026-01-07 19:22           ` Edgecombe, Rick P
2026-01-07 20:27             ` Sean Christopherson
2026-01-12 20:15           ` Ackerley Tng
2026-01-14  0:33             ` Yan Zhao
2026-01-14  1:24               ` Sean Christopherson
2026-01-14  9:23                 ` Yan Zhao
2026-01-14 15:26                   ` Sean Christopherson
2026-01-14 18:45                     ` Ackerley Tng
2026-01-15  3:08                       ` Yan Zhao
2026-01-15 18:13                         ` Ackerley Tng
2026-01-14 18:56                     ` Dave Hansen
2026-01-15  0:19                       ` Sean Christopherson
2026-01-16 15:45                         ` Edgecombe, Rick P
2026-01-16 16:31                           ` Sean Christopherson
2026-01-16 16:58                             ` Edgecombe, Rick P
2026-01-19  5:53                               ` Yan Zhao
2026-01-30 15:32                                 ` Sean Christopherson
2026-02-03  9:18                                   ` Yan Zhao
2026-02-09 17:01                                     ` Sean Christopherson
2026-01-16 16:57                         ` Dave Hansen
2026-01-16 17:14                           ` Sean Christopherson
2026-01-16 17:45                             ` Dave Hansen
2026-01-16 19:59                               ` Sean Christopherson
2026-01-16 22:25                                 ` Dave Hansen
2026-01-15  1:41                     ` Yan Zhao
2026-01-15 16:26                       ` Sean Christopherson
2026-01-16  0:28 ` Sean Christopherson
2026-01-16 11:25   ` Yan Zhao
2026-01-16 14:46     ` Sean Christopherson
2026-01-19  1:25       ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aXENNKjAKTM9UJNH@google.com \
    --to=seanjc@google.com \
    --cc=ackerleytng@google.com \
    --cc=binbin.wu@linux.intel.com \
    --cc=chao.gao@intel.com \
    --cc=chao.p.peng@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=david@kernel.org \
    --cc=fan.du@intel.com \
    --cc=francescolavra.fl@gmail.com \
    --cc=ira.weiny@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=jgross@suse.com \
    --cc=jun.miao@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kas@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=nik.borisov@suse.com \
    --cc=pbonzini@redhat.com \
    --cc=pgonda@google.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=sagis@google.com \
    --cc=tabba@google.com \
    --cc=thomas.lendacky@amd.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=x86@kernel.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox