Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Ackerley Tng <ackerleytng@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>, Sean Christopherson <seanjc@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>,
	pbonzini@redhat.com,  linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, x86@kernel.org,  rick.p.edgecombe@intel.com,
	dave.hansen@intel.com, kas@kernel.org,  tabba@google.com,
	michael.roth@amd.com, david@kernel.org, sagis@google.com,
	 vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com,
	 pgonda@google.com, fan.du@intel.com, jun.miao@intel.com,
	 francescolavra.fl@gmail.com, jgross@suse.com,
	ira.weiny@intel.com,  isaku.yamahata@intel.com,
	xiaoyao.li@intel.com, kai.huang@intel.com,
	 binbin.wu@linux.intel.com, chao.p.peng@intel.com,
	chao.gao@intel.com
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
Date: Thu, 08 Jan 2026 12:11:14 -0800	[thread overview]
Message-ID: <diqzqzrzdfvh.fsf@google.com> (raw)
In-Reply-To: <aV4hAfPZXfKKB+7i@yzhao56-desk.sh.intel.com>

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
>> On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > Sean Christopherson <seanjc@google.com> writes:
>> >
>> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > >> Vishal Annapurve <vannapurve@google.com> writes:
>> > >>
>> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> > >> >>
>> > >> >> - EPT mapping size and folio size
>> > >> >>
>> > >> >>   This series is built upon the rule in KVM that the mapping size in the
>> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
>> > >> >>
>> > >>
>> > >> I'm not familiar with this rule and would like to find out more. Why is
>> > >> this rule imposed?
>> > >
>> > > Because it's the only sane way to safely map memory into the guest? :-D
>> > >
>> > >> Is this rule there just because traditionally folio sizes also define the
>> > >> limit of contiguity, and so the mapping size must not be greater than folio
>> > >> size in case the block of memory represented by the folio is not contiguous?
>> > >
>> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
>> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
>> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
>> > >
>> > >> In guest_memfd's case, even if the folio is split (just for refcount
>> > >> tracking purposese on private to shared conversion), the memory is still
>> > >> contiguous up to the original folio's size. Will the contiguity address
>> > >> the concerns?
>> > >
>> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
>> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
>> > > spanning mixed ranges, i.e. with multiple folios.
>> >
>> > The folio can be split if any (or all) of the pages in a huge page range
>> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
>> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
>> > would be split, and the split folios are necessary for tracking users of
>> > shared pages using struct page refcounts.
>>
>> Ahh, that's what the refcounting was referring to.  Gotcha.
>>
>> > However the split folios in that 1G range are still fully contiguous.
>> >
>> > The process of conversion will split the EPT entries soon after the
>> > folios are split so the rule remains upheld.

Correction here: If we go with splitting from 1G to 4K uniformly on
sharing, only the EPT entries around the shared 4K folio will have their
page table entries split, so many of the EPT entries will be at 2M level
though the folios are 4K sized. This would be last beyond the conversion
process.

> Overall, I don't think allowing folios smaller than the mappings while
> conversion is in progress brings enough benefit.
>

I'll look into making the restructuring process always succeed, but off
the top of my head that's hard because

1. HugeTLB Vmemmap Optimization code would have to be refactored to
   use pre-allocated pages, which is refactoring deep in HugeTLB code

2. If we want to split non-uniformly such that only the folios that are
   shared are 4K, and the remaining folios are as large as possible (PMD
   sized as much as possible), it gets complex to figure out how many
   pages to allocate ahead of time.

So it's complex and will probably delay HugeTLB+conversion support even
more!

> Cons:
> (1) TDX's zapping callback has no idea whether the zapping is caused by an
>     in-progress private-to-shared conversion or other reasons. It also has no
>     idea if the attributes of the underlying folios remain unchanged during an
>     in-progress private-to-shared conversion. Even if the assertion Ackerley
>     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
>     callback for in-progress private-to-shared conversion alone (which would
>     increase TDX's dependency on guest_memfd's specific implementation even if
>     it's feasible).
>
>     Removing the sanity checks entirely in TDX's zapping callback is confusing
>     and would show a bad/false expectation from KVM -- what if a huge folio is
>     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
>     others) in other conditions? And then do we still need the check in TDX's
>     mapping callback? If not, does it mean TDX huge pages can stop relying on
>     guest_memfd's ability to allocate huge folios, as KVM could still create
>     huge mappings as long as small folios are physically contiguous with
>     homogeneous memory attributes?
>
> (2) Allowing folios smaller than the mapping would require splitting S-EPT in
>     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
>     invalidate lock held in __kvm_gmem_set_attributes() could guard against
>     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
>     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
>

I think the central question I have among all the above is what TDX
needs to actually care about (putting aside what KVM's folio size/memory
contiguity vs mapping level rule for a while).

I think TDX code can check what it cares about (if required to aid
debugging, as Dave suggested). Does TDX actually care about folio sizes,
or does it actually care about memory contiguity and alignment?

Separately, KVM could also enforce the folio size/memory contiguity vs
mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
the check is deemed necessary, it still shouldn't be in TDX code, I
think.

> Pro: Preventing zapping private memory until conversion is successful is good.
>
> However, could we achieve this benefit in other ways? For example, is it
> possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> optimization? (hugetlb_vmemmap conversion is super slow according to my
> observation and I always disable it).

HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
huge VM, multiplied by a large number of hosts, this is not a trivial
amount of memory. It's one of the key reasons why we are using HugeTLB
in guest_memfd in the first place, other than to be able to get high
level page table mappings. We want this in production.

> Or pre-allocation for
> vmemmap_remap_alloc()?
>

Will investigate if this is possible as mentioned above. Thanks for the
suggestion again!

> Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> private memory before conversion succeeds is still better than introducing the
> mess between folio size and mapping size.
>
>> > I guess perhaps the question is, is it okay if the folios are smaller
>> > than the mapping while conversion is in progress? Does the order matter
>> > (split page table entries first vs split folios first)?
>>
>> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
>> conceptually totally fine, i.e. I'm not totally opposed to adding support for
>> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
>> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
>> multiple folios in one page, probably comes down to which option provides "good
>> enough" performance without incurring too much complexity.

next prev parent reply	other threads:[~2026-01-08 20:16 UTC|newest]

Thread overview: 127+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
2026-01-06 10:18 ` [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
2026-01-06 21:08   ` Dave Hansen
2026-01-07  9:12     ` Yan Zhao
2026-01-07 16:39       ` Dave Hansen
2026-01-08 19:05         ` Ackerley Tng
2026-01-08 19:24           ` Dave Hansen
2026-01-09 16:21             ` Vishal Annapurve
2026-01-09  3:08         ` Yan Zhao
2026-01-09 18:29           ` Ackerley Tng
2026-01-12  2:41             ` Yan Zhao
2026-01-13 16:50               ` Vishal Annapurve
2026-01-14  1:48                 ` Yan Zhao
2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
2026-01-16  1:00   ` Huang, Kai
2026-01-16  8:35     ` Yan Zhao
2026-01-16 11:10       ` Huang, Kai
2026-01-16 11:22         ` Huang, Kai
2026-01-19  6:18           ` Yan Zhao
2026-01-19  6:15         ` Yan Zhao
2026-01-16 11:22   ` Huang, Kai
2026-01-19  5:55     ` Yan Zhao
2026-01-28 22:49   ` Sean Christopherson
2026-01-06 10:19 ` [PATCH v3 03/24] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
2026-01-06 10:19 ` [PATCH v3 04/24] x86/tdx: Introduce tdx_quirk_reset_folio() to reset private " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 05/24] x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
2026-01-15 22:49   ` Sean Christopherson
2026-01-16  7:54     ` Yan Zhao
2026-01-26 16:08       ` Sean Christopherson
2026-01-27  3:40         ` Yan Zhao
2026-01-28 19:51           ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock Yan Zhao
2026-01-28 22:38   ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 08/24] KVM: TDX: Enable huge page splitting " Yan Zhao
2026-01-06 10:21 ` [PATCH v3 09/24] KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX Yan Zhao
2026-01-06 10:21 ` [PATCH v3 10/24] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
2026-01-06 10:21 ` [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
2026-01-15 12:25   ` Huang, Kai
2026-01-16 23:39     ` Sean Christopherson
2026-01-19  1:28       ` Yan Zhao
2026-01-19  8:35         ` Huang, Kai
2026-01-19  8:49           ` Huang, Kai
2026-01-19 10:11             ` Yan Zhao
2026-01-19 10:40               ` Huang, Kai
2026-01-19 11:06                 ` Yan Zhao
2026-01-19 12:32                   ` Yan Zhao
2026-01-29 14:36                     ` Sean Christopherson
2026-01-20 17:51         ` Sean Christopherson
2026-01-22  6:27           ` Yan Zhao
2026-01-20 17:57       ` Vishal Annapurve
2026-01-20 18:02         ` Sean Christopherson
2026-01-22  6:33           ` Yan Zhao
2026-01-29 14:51             ` Sean Christopherson
2026-01-06 10:21 ` [PATCH v3 12/24] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
2026-01-06 10:22 ` [PATCH v3 13/24] KVM: TDX: Honor the guest's accept level contained in an EPT violation Yan Zhao
2026-01-06 10:22 ` [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
2026-01-16  0:21   ` Sean Christopherson
2026-01-16  6:42     ` Yan Zhao
2026-01-06 10:22 ` [PATCH v3 15/24] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
2026-01-06 10:22 ` [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
2026-01-28 22:39   ` Sean Christopherson
2026-01-06 10:23 ` [PATCH v3 17/24] KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB Yan Zhao
2026-01-06 10:23 ` [PATCH v3 18/24] x86/virt/tdx: Add loud warning when tdx_pamt_put() fails Yan Zhao
2026-01-06 10:23 ` [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting Yan Zhao
2026-01-21  1:54   ` Huang, Kai
2026-01-21 17:30     ` Sean Christopherson
2026-01-21 19:39       ` Edgecombe, Rick P
2026-01-21 23:01       ` Huang, Kai
2026-01-22  7:03       ` Yan Zhao
2026-01-22  7:30         ` Huang, Kai
2026-01-22  7:49           ` Yan Zhao
2026-01-22 10:33             ` Huang, Kai
2026-01-06 10:23 ` [PATCH v3 20/24] KVM: TDX: Implement per-VM external cache for splitting in TDX Yan Zhao
2026-01-06 10:23 ` [PATCH v3 21/24] KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting Yan Zhao
2026-01-06 10:24 ` [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote Yan Zhao
2026-01-19 10:52   ` Huang, Kai
2026-01-19 11:11     ` Yan Zhao
2026-01-06 10:24 ` [PATCH v3 23/24] x86/tdx: Pass guest memory's PFN info to demote for updating pamt_refcount Yan Zhao
2026-01-06 10:24 ` [PATCH v3 24/24] KVM: TDX: Turn on PG_LEVEL_2M Yan Zhao
2026-01-06 17:47 ` [PATCH v3 00/24] KVM: TDX huge page support for private memory Vishal Annapurve
2026-01-06 21:26   ` Ackerley Tng
2026-01-06 21:38     ` Sean Christopherson
2026-01-06 22:04       ` Ackerley Tng
2026-01-06 23:43         ` Sean Christopherson
2026-01-07  9:03           ` Yan Zhao
2026-01-08 20:11             ` Ackerley Tng [this message]
2026-01-09  9:18               ` Yan Zhao
2026-01-09 16:12                 ` Vishal Annapurve
2026-01-09 17:16                   ` Vishal Annapurve
2026-01-09 18:07                   ` Ackerley Tng
2026-01-12  1:39                     ` Yan Zhao
2026-01-12  2:12                       ` Yan Zhao
2026-01-12 19:56                         ` Ackerley Tng
2026-01-13  6:10                           ` Yan Zhao
2026-01-13 16:40                             ` Vishal Annapurve
2026-01-14  9:32                               ` Yan Zhao
2026-01-07 19:22           ` Edgecombe, Rick P
2026-01-07 20:27             ` Sean Christopherson
2026-01-12 20:15           ` Ackerley Tng
2026-01-14  0:33             ` Yan Zhao
2026-01-14  1:24               ` Sean Christopherson
2026-01-14  9:23                 ` Yan Zhao
2026-01-14 15:26                   ` Sean Christopherson
2026-01-14 18:45                     ` Ackerley Tng
2026-01-15  3:08                       ` Yan Zhao
2026-01-15 18:13                         ` Ackerley Tng
2026-01-14 18:56                     ` Dave Hansen
2026-01-15  0:19                       ` Sean Christopherson
2026-01-16 15:45                         ` Edgecombe, Rick P
2026-01-16 16:31                           ` Sean Christopherson
2026-01-16 16:58                             ` Edgecombe, Rick P
2026-01-19  5:53                               ` Yan Zhao
2026-01-30 15:32                                 ` Sean Christopherson
2026-02-03  9:18                                   ` Yan Zhao
2026-02-09 17:01                                     ` Sean Christopherson
2026-01-16 16:57                         ` Dave Hansen
2026-01-16 17:14                           ` Sean Christopherson
2026-01-16 17:45                             ` Dave Hansen
2026-01-16 19:59                               ` Sean Christopherson
2026-01-16 22:25                                 ` Dave Hansen
2026-01-15  1:41                     ` Yan Zhao
2026-01-15 16:26                       ` Sean Christopherson
2026-01-16  0:28 ` Sean Christopherson
2026-01-16 11:25   ` Yan Zhao
2026-01-16 14:46     ` Sean Christopherson
2026-01-19  1:25       ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=diqzqzrzdfvh.fsf@google.com \
    --to=ackerleytng@google.com \
    --cc=binbin.wu@linux.intel.com \
    --cc=chao.gao@intel.com \
    --cc=chao.p.peng@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=david@kernel.org \
    --cc=fan.du@intel.com \
    --cc=francescolavra.fl@gmail.com \
    --cc=ira.weiny@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=jgross@suse.com \
    --cc=jun.miao@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kas@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=nik.borisov@suse.com \
    --cc=pbonzini@redhat.com \
    --cc=pgonda@google.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=sagis@google.com \
    --cc=seanjc@google.com \
    --cc=tabba@google.com \
    --cc=thomas.lendacky@amd.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=x86@kernel.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox