From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: Kai Huang <kai.huang@intel.com>,
"pbonzini@redhat.com" <pbonzini@redhat.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
Fan Du <fan.du@intel.com>, Xiaoyao Li <xiaoyao.li@intel.com>,
Chao Gao <chao.gao@intel.com>,
Dave Hansen <dave.hansen@intel.com>,
"thomas.lendacky@amd.com" <thomas.lendacky@amd.com>,
"vbabka@suse.cz" <vbabka@suse.cz>,
"tabba@google.com" <tabba@google.com>,
"david@kernel.org" <david@kernel.org>,
"kas@kernel.org" <kas@kernel.org>,
"michael.roth@amd.com" <michael.roth@amd.com>,
Ira Weiny <ira.weiny@intel.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"binbin.wu@linux.intel.com" <binbin.wu@linux.intel.com>,
"ackerleytng@google.com" <ackerleytng@google.com>,
"nik.borisov@suse.com" <nik.borisov@suse.com>,
Isaku Yamahata <isaku.yamahata@intel.com>,
Chao P Peng <chao.p.peng@intel.com>,
"francescolavra.fl@gmail.com" <francescolavra.fl@gmail.com>,
"sagis@google.com" <sagis@google.com>,
Vishal Annapurve <vannapurve@google.com>,
Rick P Edgecombe <rick.p.edgecombe@intel.com>,
Jun Miao <jun.miao@intel.com>,
"jgross@suse.com" <jgross@suse.com>,
"pgonda@google.com" <pgonda@google.com>,
"x86@kernel.org" <x86@kernel.org>
Subject: Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
Date: Tue, 20 Jan 2026 09:51:06 -0800 [thread overview]
Message-ID: <aW_Aith2qkYQ3fGY@google.com> (raw)
In-Reply-To: <aW2Iwpuwoyod8eQc@yzhao56-desk.sh.intel.com>
On Mon, Jan 19, 2026, Yan Zhao wrote:
> On Fri, Jan 16, 2026 at 03:39:13PM -0800, Sean Christopherson wrote:
> > On Thu, Jan 15, 2026, Kai Huang wrote:
> > > So how about:
> > >
> > > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > >
> > > ?
> >
> > I find the "cross_boundary" termininology extremely confusing. I also dislike
> > the concept itself, in the sense that it shoves a weird, specific concept into
> > the guts of the TDP MMU.
> > The other wart is that it's inefficient when punching a large hole. E.g. say
> > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > userpace punches a 12TiB hole. Walking all ~12TiB just to _maybe_ split the head
> > and tail pages is asinine.
> That's a reasonable concern. I actually thought about it.
> My consideration was as follows:
> Currently, we don't have such large areas. Usually, the conversion ranges are
> less than 1GB.
Nothing guarantees that behavior.
> Though the initial conversion which converts all memory from private to
> shared may be wide, there are usually no mappings at that stage. So, the
> traversal should be very fast (since the traversal doesn't even need to go
> down to the 2MB/1GB level).
>
> If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> very large range at runtime, it can optimize by invoking the API twice:
> once for range [start, ALIGN(start, 1GB)), and
> once for range [ALIGN_DOWN(end, 1GB), end).
>
> I can also implement this optimization within kvm_split_cross_boundary_leafs()
> by checking the range size if you think that would be better.
>
> > And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> > _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> > code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
> Sorry for the confusion about the usage of tdx_honor_guest_accept_level(). I
> should add a better comment.
>
> There are 4 use cases for the API kvm_split_cross_boundary_leafs():
> 1. PUNCH_HOLE
> 2. KVM_SET_MEMORY_ATTRIBUTES2, which invokes kvm_gmem_set_attributes() for
> private-to-shared conversions
> 3. tdx_honor_guest_accept_level()
> 4. kvm_gmem_error_folio()
>
> Use cases 1-3 are already in the current code. Use case 4 is per our discussion,
> and will be implemented in the next version (because guest_memfd may split
> folios without first splitting S-EPT).
>
> The 4 use cases can be divided into two categories:
>
> 1. Category 1: use cases 1, 2, 4
> We must ensure GFN start - 1 and GFN start are not mapped in a single
> mapping. However, for GFN start or GFN start - 1 specifically, we don't care
> about their actual mapping levels, which means they are free to be mapped at
> 2MB or 1GB. The same applies to GFN end - 1 and GFN end.
>
> --|------------------|-----------
> ^ ^
> start end - 1
>
> 2. Category 2: use case 3
> It cares about the mapping level of the GFN, i.e., it must not be mapped
> above a certain level.
>
> -----|-------
> ^
> GFN
>
> So, to unify the two categories, I have tdx_honor_guest_accept_level() check
> the range of [level-aligned GFN, level-aligned GFN + level size). e.g.,
> If the accept level is 2MB, only 1GB mapping is possible to be outside the
> range and needs splitting.
But that overlooks the fact that Category 2 already fits the existing "category"
that is supported by the TDP MMU. I.e. Category 1 is (somewhat) new and novel,
Category 2 is not.
> -----|-------------|---
> ^ ^
> | |
> level-aligned level-aligned
> GFN GFN + level size - 1
>
>
> > For the EPT violation case, the guest is accepting a page. Just split to the
> > guest's accepted level, I don't see any reason to make things more complicated
> > than that.
> This use case could reuse the kvm_mmu_try_split_huge_pages() API, except that we
> need a return value.
Just expose tdp_mmu_split_huge_pages_root(), the fault path only _needs_ to split
the current root, and in fact shouldn't even try to split other roots (ignoring
that no other relevant roots exist).
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9c26038f6b77..7d924da75106 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1555,10 +1555,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
return ret;
}
-static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
- struct kvm_mmu_page *root,
- gfn_t start, gfn_t end,
- int target_level, bool shared)
+int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end, int target_level,
+ bool shared)
{
struct kvm_mmu_page *sp = NULL;
struct tdp_iter iter;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index bd62977c9199..ea9a509608fb 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -93,6 +93,9 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
int min_level);
+int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end, int target_level,
+ bool shared);
void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
const struct kvm_memory_slot *slot,
gfn_t start, gfn_t end,
> > And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> > and tail pages need to be split, and use the existing APIs to make that happen.
> This use case cannot reuse kvm_mmu_try_split_huge_pages() without modification.
Modifying existing code is a non-issue, and you're already modifying TDP MMU
functions, so I don't see that as a reason for choosing X instead of Y.
> Or which existing APIs are you referring to?
See above.
> The cross_boundary information is still useful?
>
> BTW: Currently, kvm_split_cross_boundary_leafs() internally reuses
> tdp_mmu_split_huge_pages_root() (as shown below).
>
> kvm_split_cross_boundary_leafs
> kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
> tdp_mmu_split_huge_pages_root
>
> However, tdp_mmu_split_huge_pages_root() is originally used to split huge
> mappings in a wide range, so it temporarily releases mmu_lock for memory
> allocation for sp, since it can't predict how many pages to pre-allocate in the
> KVM mmu cache.
>
> For kvm_split_cross_boundary_leafs(), we can actually predict the max number of
> pages to pre-allocate. If we don't reuse tdp_mmu_split_huge_pages_root(), we can
> allocate sp, sp->spt, sp->external_spt and DPAMT pages from the KVM mmu cache
> without releasing mmu_lock and invoking tdp_mmu_alloc_sp_for_split().
That's completely orthogonal to the "only need to maybe split head and tail pages".
E.g. kvm_tdp_mmu_try_split_huge_pages() can also predict the _max_ number of pages
to pre-allocate, it's just not worth adding a kvm_mmu_memory_cache for that use
case because that path can drop mmu_lock at will, unlike the full page fault path.
I.e. the complexity doesn't justify the benefits, especially since the max number
of pages is so large.
AFAICT, the only pre-allocation that is _necessary_ is for the dynamic PAMT,
because the allocation is done outside of KVM's control. But that's a solvable
problem, the tricky part is protecting the PAMT cache for PUNCH_HOLE, but that
too is solvable, e.g. by adding a per-VM mutex that's taken by kvm_gmem_punch_hole()
to handle the PUNCH_HOLE case, and then using the per-vCPU cache when splitting
for a mismatched accept.
next prev parent reply other threads:[~2026-01-20 17:51 UTC|newest]
Thread overview: 127+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
2026-01-06 10:18 ` [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
2026-01-06 21:08 ` Dave Hansen
2026-01-07 9:12 ` Yan Zhao
2026-01-07 16:39 ` Dave Hansen
2026-01-08 19:05 ` Ackerley Tng
2026-01-08 19:24 ` Dave Hansen
2026-01-09 16:21 ` Vishal Annapurve
2026-01-09 3:08 ` Yan Zhao
2026-01-09 18:29 ` Ackerley Tng
2026-01-12 2:41 ` Yan Zhao
2026-01-13 16:50 ` Vishal Annapurve
2026-01-14 1:48 ` Yan Zhao
2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
2026-01-16 1:00 ` Huang, Kai
2026-01-16 8:35 ` Yan Zhao
2026-01-16 11:10 ` Huang, Kai
2026-01-16 11:22 ` Huang, Kai
2026-01-19 6:18 ` Yan Zhao
2026-01-19 6:15 ` Yan Zhao
2026-01-16 11:22 ` Huang, Kai
2026-01-19 5:55 ` Yan Zhao
2026-01-28 22:49 ` Sean Christopherson
2026-01-06 10:19 ` [PATCH v3 03/24] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
2026-01-06 10:19 ` [PATCH v3 04/24] x86/tdx: Introduce tdx_quirk_reset_folio() to reset private " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 05/24] x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
2026-01-15 22:49 ` Sean Christopherson
2026-01-16 7:54 ` Yan Zhao
2026-01-26 16:08 ` Sean Christopherson
2026-01-27 3:40 ` Yan Zhao
2026-01-28 19:51 ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock Yan Zhao
2026-01-28 22:38 ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 08/24] KVM: TDX: Enable huge page splitting " Yan Zhao
2026-01-06 10:21 ` [PATCH v3 09/24] KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX Yan Zhao
2026-01-06 10:21 ` [PATCH v3 10/24] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
2026-01-06 10:21 ` [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
2026-01-15 12:25 ` Huang, Kai
2026-01-16 23:39 ` Sean Christopherson
2026-01-19 1:28 ` Yan Zhao
2026-01-19 8:35 ` Huang, Kai
2026-01-19 8:49 ` Huang, Kai
2026-01-19 10:11 ` Yan Zhao
2026-01-19 10:40 ` Huang, Kai
2026-01-19 11:06 ` Yan Zhao
2026-01-19 12:32 ` Yan Zhao
2026-01-29 14:36 ` Sean Christopherson
2026-01-20 17:51 ` Sean Christopherson [this message]
2026-01-22 6:27 ` Yan Zhao
2026-01-20 17:57 ` Vishal Annapurve
2026-01-20 18:02 ` Sean Christopherson
2026-01-22 6:33 ` Yan Zhao
2026-01-29 14:51 ` Sean Christopherson
2026-01-06 10:21 ` [PATCH v3 12/24] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
2026-01-06 10:22 ` [PATCH v3 13/24] KVM: TDX: Honor the guest's accept level contained in an EPT violation Yan Zhao
2026-01-06 10:22 ` [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
2026-01-16 0:21 ` Sean Christopherson
2026-01-16 6:42 ` Yan Zhao
2026-01-06 10:22 ` [PATCH v3 15/24] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
2026-01-06 10:22 ` [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
2026-01-28 22:39 ` Sean Christopherson
2026-01-06 10:23 ` [PATCH v3 17/24] KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB Yan Zhao
2026-01-06 10:23 ` [PATCH v3 18/24] x86/virt/tdx: Add loud warning when tdx_pamt_put() fails Yan Zhao
2026-01-06 10:23 ` [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting Yan Zhao
2026-01-21 1:54 ` Huang, Kai
2026-01-21 17:30 ` Sean Christopherson
2026-01-21 19:39 ` Edgecombe, Rick P
2026-01-21 23:01 ` Huang, Kai
2026-01-22 7:03 ` Yan Zhao
2026-01-22 7:30 ` Huang, Kai
2026-01-22 7:49 ` Yan Zhao
2026-01-22 10:33 ` Huang, Kai
2026-01-06 10:23 ` [PATCH v3 20/24] KVM: TDX: Implement per-VM external cache for splitting in TDX Yan Zhao
2026-01-06 10:23 ` [PATCH v3 21/24] KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting Yan Zhao
2026-01-06 10:24 ` [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote Yan Zhao
2026-01-19 10:52 ` Huang, Kai
2026-01-19 11:11 ` Yan Zhao
2026-01-06 10:24 ` [PATCH v3 23/24] x86/tdx: Pass guest memory's PFN info to demote for updating pamt_refcount Yan Zhao
2026-01-06 10:24 ` [PATCH v3 24/24] KVM: TDX: Turn on PG_LEVEL_2M Yan Zhao
2026-01-06 17:47 ` [PATCH v3 00/24] KVM: TDX huge page support for private memory Vishal Annapurve
2026-01-06 21:26 ` Ackerley Tng
2026-01-06 21:38 ` Sean Christopherson
2026-01-06 22:04 ` Ackerley Tng
2026-01-06 23:43 ` Sean Christopherson
2026-01-07 9:03 ` Yan Zhao
2026-01-08 20:11 ` Ackerley Tng
2026-01-09 9:18 ` Yan Zhao
2026-01-09 16:12 ` Vishal Annapurve
2026-01-09 17:16 ` Vishal Annapurve
2026-01-09 18:07 ` Ackerley Tng
2026-01-12 1:39 ` Yan Zhao
2026-01-12 2:12 ` Yan Zhao
2026-01-12 19:56 ` Ackerley Tng
2026-01-13 6:10 ` Yan Zhao
2026-01-13 16:40 ` Vishal Annapurve
2026-01-14 9:32 ` Yan Zhao
2026-01-07 19:22 ` Edgecombe, Rick P
2026-01-07 20:27 ` Sean Christopherson
2026-01-12 20:15 ` Ackerley Tng
2026-01-14 0:33 ` Yan Zhao
2026-01-14 1:24 ` Sean Christopherson
2026-01-14 9:23 ` Yan Zhao
2026-01-14 15:26 ` Sean Christopherson
2026-01-14 18:45 ` Ackerley Tng
2026-01-15 3:08 ` Yan Zhao
2026-01-15 18:13 ` Ackerley Tng
2026-01-14 18:56 ` Dave Hansen
2026-01-15 0:19 ` Sean Christopherson
2026-01-16 15:45 ` Edgecombe, Rick P
2026-01-16 16:31 ` Sean Christopherson
2026-01-16 16:58 ` Edgecombe, Rick P
2026-01-19 5:53 ` Yan Zhao
2026-01-30 15:32 ` Sean Christopherson
2026-02-03 9:18 ` Yan Zhao
2026-02-09 17:01 ` Sean Christopherson
2026-01-16 16:57 ` Dave Hansen
2026-01-16 17:14 ` Sean Christopherson
2026-01-16 17:45 ` Dave Hansen
2026-01-16 19:59 ` Sean Christopherson
2026-01-16 22:25 ` Dave Hansen
2026-01-15 1:41 ` Yan Zhao
2026-01-15 16:26 ` Sean Christopherson
2026-01-16 0:28 ` Sean Christopherson
2026-01-16 11:25 ` Yan Zhao
2026-01-16 14:46 ` Sean Christopherson
2026-01-19 1:25 ` Yan Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aW_Aith2qkYQ3fGY@google.com \
--to=seanjc@google.com \
--cc=ackerleytng@google.com \
--cc=binbin.wu@linux.intel.com \
--cc=chao.gao@intel.com \
--cc=chao.p.peng@intel.com \
--cc=dave.hansen@intel.com \
--cc=david@kernel.org \
--cc=fan.du@intel.com \
--cc=francescolavra.fl@gmail.com \
--cc=ira.weiny@intel.com \
--cc=isaku.yamahata@intel.com \
--cc=jgross@suse.com \
--cc=jun.miao@intel.com \
--cc=kai.huang@intel.com \
--cc=kas@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=michael.roth@amd.com \
--cc=nik.borisov@suse.com \
--cc=pbonzini@redhat.com \
--cc=pgonda@google.com \
--cc=rick.p.edgecombe@intel.com \
--cc=sagis@google.com \
--cc=tabba@google.com \
--cc=thomas.lendacky@amd.com \
--cc=vannapurve@google.com \
--cc=vbabka@suse.cz \
--cc=x86@kernel.org \
--cc=xiaoyao.li@intel.com \
--cc=yan.y.zhao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox