All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: Kai Huang <kai.huang@intel.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	 "kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Fan Du <fan.du@intel.com>,  Xiaoyao Li <xiaoyao.li@intel.com>,
	Chao Gao <chao.gao@intel.com>,
	 Dave Hansen <dave.hansen@intel.com>,
	"thomas.lendacky@amd.com" <thomas.lendacky@amd.com>,
	 "vbabka@suse.cz" <vbabka@suse.cz>,
	"tabba@google.com" <tabba@google.com>,
	"david@kernel.org" <david@kernel.org>,
	 "kas@kernel.org" <kas@kernel.org>,
	"michael.roth@amd.com" <michael.roth@amd.com>,
	Ira Weiny <ira.weiny@intel.com>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	 "binbin.wu@linux.intel.com" <binbin.wu@linux.intel.com>,
	 "ackerleytng@google.com" <ackerleytng@google.com>,
	"nik.borisov@suse.com" <nik.borisov@suse.com>,
	 Isaku Yamahata <isaku.yamahata@intel.com>,
	Chao P Peng <chao.p.peng@intel.com>,
	 "francescolavra.fl@gmail.com" <francescolavra.fl@gmail.com>,
	"sagis@google.com" <sagis@google.com>,
	 Vishal Annapurve <vannapurve@google.com>,
	Rick P Edgecombe <rick.p.edgecombe@intel.com>,
	 Jun Miao <jun.miao@intel.com>,
	"jgross@suse.com" <jgross@suse.com>,
	 "pgonda@google.com" <pgonda@google.com>,
	"x86@kernel.org" <x86@kernel.org>
Subject: Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
Date: Tue, 20 Jan 2026 09:51:06 -0800	[thread overview]
Message-ID: <aW_Aith2qkYQ3fGY@google.com> (raw)
In-Reply-To: <aW2Iwpuwoyod8eQc@yzhao56-desk.sh.intel.com>

On Mon, Jan 19, 2026, Yan Zhao wrote:
> On Fri, Jan 16, 2026 at 03:39:13PM -0800, Sean Christopherson wrote:
> > On Thu, Jan 15, 2026, Kai Huang wrote:
> > > So how about:
> > > 
> > > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > > 
> > > ?
> > 
> > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > the concept itself, in the sense that it shoves a weird, specific concept into
> > the guts of the TDP MMU.
> > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > and tail pages is asinine.
> That's a reasonable concern. I actually thought about it.
> My consideration was as follows:
> Currently, we don't have such large areas. Usually, the conversion ranges are
> less than 1GB. 

Nothing guarantees that behavior.

> Though the initial conversion which converts all memory from private to
> shared may be wide, there are usually no mappings at that stage. So, the
> traversal should be very fast (since the traversal doesn't even need to go
> down to the 2MB/1GB level).
> 
> If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> very large range at runtime, it can optimize by invoking the API twice:
> once for range [start, ALIGN(start, 1GB)), and
> once for range [ALIGN_DOWN(end, 1GB), end).
> 
> I can also implement this optimization within kvm_split_cross_boundary_leafs()
> by checking the range size if you think that would be better.
> 
> > And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> > _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> > code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
> Sorry for the confusion about the usage of tdx_honor_guest_accept_level(). I
> should add a better comment.
> 
> There are 4 use cases for the API kvm_split_cross_boundary_leafs():
> 1. PUNCH_HOLE
> 2. KVM_SET_MEMORY_ATTRIBUTES2, which invokes kvm_gmem_set_attributes() for
>    private-to-shared conversions
> 3. tdx_honor_guest_accept_level()
> 4. kvm_gmem_error_folio()
> 
> Use cases 1-3 are already in the current code. Use case 4 is per our discussion,
> and will be implemented in the next version (because guest_memfd may split
> folios without first splitting S-EPT).
> 
> The 4 use cases can be divided into two categories:
> 
> 1. Category 1: use cases 1, 2, 4
>    We must ensure GFN start - 1 and GFN start are not mapped in a single
>    mapping. However, for GFN start or GFN start - 1 specifically, we don't care
>    about their actual mapping levels, which means they are free to be mapped at
>    2MB or 1GB. The same applies to GFN end - 1 and GFN end.
> 
>    --|------------------|-----------
>      ^                  ^
>     start              end - 1 
> 
> 2. Category 2: use case 3
>    It cares about the mapping level of the GFN, i.e., it must not be mapped
>    above a certain level.
> 
>    -----|-------
>         ^
>        GFN
> 
>    So, to unify the two categories, I have tdx_honor_guest_accept_level() check
>    the range of [level-aligned GFN, level-aligned GFN + level size). e.g.,
>    If the accept level is 2MB, only 1GB mapping is possible to be outside the
>    range and needs splitting.

But that overlooks the fact that Category 2 already fits the existing "category"
that is supported by the TDP MMU.  I.e. Category 1 is (somewhat) new and novel,
Category 2 is not.

>    -----|-------------|---
>         ^             ^
>         |             |
>    level-aligned     level-aligned
>       GFN            GFN + level size - 1
> 
> 
> > For the EPT violation case, the guest is accepting a page.  Just split to the
> > guest's accepted level, I don't see any reason to make things more complicated
> > than that.
> This use case could reuse the kvm_mmu_try_split_huge_pages() API, except that we
> need a return value.

Just expose tdp_mmu_split_huge_pages_root(), the fault path only _needs_ to split
the current root, and in fact shouldn't even try to split other roots (ignoring
that no other relevant roots exist).

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9c26038f6b77..7d924da75106 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1555,10 +1555,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
        return ret;
 }
 
-static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
-                                        struct kvm_mmu_page *root,
-                                        gfn_t start, gfn_t end,
-                                        int target_level, bool shared)
+int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
+                                 gfn_t start, gfn_t end, int target_level,
+                                 bool shared)
 {
        struct kvm_mmu_page *sp = NULL;
        struct tdp_iter iter;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index bd62977c9199..ea9a509608fb 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -93,6 +93,9 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
                                   struct kvm_memory_slot *slot, gfn_t gfn,
                                   int min_level);
 
+int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
+                                 gfn_t start, gfn_t end, int target_level,
+                                 bool shared);
 void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
                                      const struct kvm_memory_slot *slot,
                                      gfn_t start, gfn_t end,

> > And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> > and tail pages need to be split, and use the existing APIs to make that happen.
> This use case cannot reuse kvm_mmu_try_split_huge_pages() without modification.

Modifying existing code is a non-issue, and you're already modifying TDP MMU
functions, so I don't see that as a reason for choosing X instead of Y.

> Or which existing APIs are you referring to?

See above.

> The cross_boundary information is still useful?
> 
> BTW: Currently, kvm_split_cross_boundary_leafs() internally reuses
> tdp_mmu_split_huge_pages_root() (as shown below).
> 
> kvm_split_cross_boundary_leafs
>   kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
>     tdp_mmu_split_huge_pages_root
> 
> However, tdp_mmu_split_huge_pages_root() is originally used to split huge
> mappings in a wide range, so it temporarily releases mmu_lock for memory
> allocation for sp, since it can't predict how many pages to pre-allocate in the
> KVM mmu cache.
> 
> For kvm_split_cross_boundary_leafs(), we can actually predict the max number of
> pages to pre-allocate. If we don't reuse tdp_mmu_split_huge_pages_root(), we can
> allocate sp, sp->spt, sp->external_spt and DPAMT pages from the KVM mmu cache
> without releasing mmu_lock and invoking tdp_mmu_alloc_sp_for_split().

That's completely orthogonal to the "only need to maybe split head and tail pages".
E.g. kvm_tdp_mmu_try_split_huge_pages() can also predict the _max_ number of pages
to pre-allocate, it's just not worth adding a kvm_mmu_memory_cache for that use
case because that path can drop mmu_lock at will, unlike the full page fault path.
I.e. the complexity doesn't justify the benefits, especially since the max number
of pages is so large.

AFAICT, the only pre-allocation that is _necessary_ is for the dynamic PAMT,
because the allocation is done outside of KVM's control.  But that's a solvable
problem, the tricky part is protecting the PAMT cache for PUNCH_HOLE, but that
too is solvable, e.g. by adding a per-VM mutex that's taken by kvm_gmem_punch_hole()
to handle the PUNCH_HOLE case, and then using the per-vCPU cache when splitting
for a mismatched accept.

  parent reply	other threads:[~2026-01-20 17:51 UTC|newest]

Thread overview: 127+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
2026-01-06 10:18 ` [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
2026-01-06 21:08   ` Dave Hansen
2026-01-07  9:12     ` Yan Zhao
2026-01-07 16:39       ` Dave Hansen
2026-01-08 19:05         ` Ackerley Tng
2026-01-08 19:24           ` Dave Hansen
2026-01-09 16:21             ` Vishal Annapurve
2026-01-09  3:08         ` Yan Zhao
2026-01-09 18:29           ` Ackerley Tng
2026-01-12  2:41             ` Yan Zhao
2026-01-13 16:50               ` Vishal Annapurve
2026-01-14  1:48                 ` Yan Zhao
2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
2026-01-16  1:00   ` Huang, Kai
2026-01-16  8:35     ` Yan Zhao
2026-01-16 11:10       ` Huang, Kai
2026-01-16 11:22         ` Huang, Kai
2026-01-19  6:18           ` Yan Zhao
2026-01-19  6:15         ` Yan Zhao
2026-01-16 11:22   ` Huang, Kai
2026-01-19  5:55     ` Yan Zhao
2026-01-28 22:49   ` Sean Christopherson
2026-01-06 10:19 ` [PATCH v3 03/24] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
2026-01-06 10:19 ` [PATCH v3 04/24] x86/tdx: Introduce tdx_quirk_reset_folio() to reset private " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 05/24] x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
2026-01-15 22:49   ` Sean Christopherson
2026-01-16  7:54     ` Yan Zhao
2026-01-26 16:08       ` Sean Christopherson
2026-01-27  3:40         ` Yan Zhao
2026-01-28 19:51           ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock Yan Zhao
2026-01-28 22:38   ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 08/24] KVM: TDX: Enable huge page splitting " Yan Zhao
2026-01-06 10:21 ` [PATCH v3 09/24] KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX Yan Zhao
2026-01-06 10:21 ` [PATCH v3 10/24] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
2026-01-06 10:21 ` [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
2026-01-15 12:25   ` Huang, Kai
2026-01-16 23:39     ` Sean Christopherson
2026-01-19  1:28       ` Yan Zhao
2026-01-19  8:35         ` Huang, Kai
2026-01-19  8:49           ` Huang, Kai
2026-01-19 10:11             ` Yan Zhao
2026-01-19 10:40               ` Huang, Kai
2026-01-19 11:06                 ` Yan Zhao
2026-01-19 12:32                   ` Yan Zhao
2026-01-29 14:36                     ` Sean Christopherson
2026-01-20 17:51         ` Sean Christopherson [this message]
2026-01-22  6:27           ` Yan Zhao
2026-01-20 17:57       ` Vishal Annapurve
2026-01-20 18:02         ` Sean Christopherson
2026-01-22  6:33           ` Yan Zhao
2026-01-29 14:51             ` Sean Christopherson
2026-01-06 10:21 ` [PATCH v3 12/24] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
2026-01-06 10:22 ` [PATCH v3 13/24] KVM: TDX: Honor the guest's accept level contained in an EPT violation Yan Zhao
2026-01-06 10:22 ` [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
2026-01-16  0:21   ` Sean Christopherson
2026-01-16  6:42     ` Yan Zhao
2026-01-06 10:22 ` [PATCH v3 15/24] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
2026-01-06 10:22 ` [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
2026-01-28 22:39   ` Sean Christopherson
2026-01-06 10:23 ` [PATCH v3 17/24] KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB Yan Zhao
2026-01-06 10:23 ` [PATCH v3 18/24] x86/virt/tdx: Add loud warning when tdx_pamt_put() fails Yan Zhao
2026-01-06 10:23 ` [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting Yan Zhao
2026-01-21  1:54   ` Huang, Kai
2026-01-21 17:30     ` Sean Christopherson
2026-01-21 19:39       ` Edgecombe, Rick P
2026-01-21 23:01       ` Huang, Kai
2026-01-22  7:03       ` Yan Zhao
2026-01-22  7:30         ` Huang, Kai
2026-01-22  7:49           ` Yan Zhao
2026-01-22 10:33             ` Huang, Kai
2026-01-06 10:23 ` [PATCH v3 20/24] KVM: TDX: Implement per-VM external cache for splitting in TDX Yan Zhao
2026-01-06 10:23 ` [PATCH v3 21/24] KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting Yan Zhao
2026-01-06 10:24 ` [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote Yan Zhao
2026-01-19 10:52   ` Huang, Kai
2026-01-19 11:11     ` Yan Zhao
2026-01-06 10:24 ` [PATCH v3 23/24] x86/tdx: Pass guest memory's PFN info to demote for updating pamt_refcount Yan Zhao
2026-01-06 10:24 ` [PATCH v3 24/24] KVM: TDX: Turn on PG_LEVEL_2M Yan Zhao
2026-01-06 17:47 ` [PATCH v3 00/24] KVM: TDX huge page support for private memory Vishal Annapurve
2026-01-06 21:26   ` Ackerley Tng
2026-01-06 21:38     ` Sean Christopherson
2026-01-06 22:04       ` Ackerley Tng
2026-01-06 23:43         ` Sean Christopherson
2026-01-07  9:03           ` Yan Zhao
2026-01-08 20:11             ` Ackerley Tng
2026-01-09  9:18               ` Yan Zhao
2026-01-09 16:12                 ` Vishal Annapurve
2026-01-09 17:16                   ` Vishal Annapurve
2026-01-09 18:07                   ` Ackerley Tng
2026-01-12  1:39                     ` Yan Zhao
2026-01-12  2:12                       ` Yan Zhao
2026-01-12 19:56                         ` Ackerley Tng
2026-01-13  6:10                           ` Yan Zhao
2026-01-13 16:40                             ` Vishal Annapurve
2026-01-14  9:32                               ` Yan Zhao
2026-01-07 19:22           ` Edgecombe, Rick P
2026-01-07 20:27             ` Sean Christopherson
2026-01-12 20:15           ` Ackerley Tng
2026-01-14  0:33             ` Yan Zhao
2026-01-14  1:24               ` Sean Christopherson
2026-01-14  9:23                 ` Yan Zhao
2026-01-14 15:26                   ` Sean Christopherson
2026-01-14 18:45                     ` Ackerley Tng
2026-01-15  3:08                       ` Yan Zhao
2026-01-15 18:13                         ` Ackerley Tng
2026-01-14 18:56                     ` Dave Hansen
2026-01-15  0:19                       ` Sean Christopherson
2026-01-16 15:45                         ` Edgecombe, Rick P
2026-01-16 16:31                           ` Sean Christopherson
2026-01-16 16:58                             ` Edgecombe, Rick P
2026-01-19  5:53                               ` Yan Zhao
2026-01-30 15:32                                 ` Sean Christopherson
2026-02-03  9:18                                   ` Yan Zhao
2026-02-09 17:01                                     ` Sean Christopherson
2026-01-16 16:57                         ` Dave Hansen
2026-01-16 17:14                           ` Sean Christopherson
2026-01-16 17:45                             ` Dave Hansen
2026-01-16 19:59                               ` Sean Christopherson
2026-01-16 22:25                                 ` Dave Hansen
2026-01-15  1:41                     ` Yan Zhao
2026-01-15 16:26                       ` Sean Christopherson
2026-01-16  0:28 ` Sean Christopherson
2026-01-16 11:25   ` Yan Zhao
2026-01-16 14:46     ` Sean Christopherson
2026-01-19  1:25       ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aW_Aith2qkYQ3fGY@google.com \
    --to=seanjc@google.com \
    --cc=ackerleytng@google.com \
    --cc=binbin.wu@linux.intel.com \
    --cc=chao.gao@intel.com \
    --cc=chao.p.peng@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=david@kernel.org \
    --cc=fan.du@intel.com \
    --cc=francescolavra.fl@gmail.com \
    --cc=ira.weiny@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=jgross@suse.com \
    --cc=jun.miao@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kas@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=nik.borisov@suse.com \
    --cc=pbonzini@redhat.com \
    --cc=pgonda@google.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=sagis@google.com \
    --cc=tabba@google.com \
    --cc=thomas.lendacky@amd.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=x86@kernel.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.