From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: Ackerley Tng <ackerleytng@google.com>,
Vishal Annapurve <vannapurve@google.com>,
pbonzini@redhat.com, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com,
dave.hansen@intel.com, kas@kernel.org, tabba@google.com,
michael.roth@amd.com, david@kernel.org, sagis@google.com,
vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com,
pgonda@google.com, fan.du@intel.com, jun.miao@intel.com,
francescolavra.fl@gmail.com, jgross@suse.com,
ira.weiny@intel.com, isaku.yamahata@intel.com,
xiaoyao.li@intel.com, kai.huang@intel.com,
binbin.wu@linux.intel.com, chao.p.peng@intel.com,
chao.gao@intel.com
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
Date: Tue, 13 Jan 2026 17:24:36 -0800 [thread overview]
Message-ID: <aWbwVG8aZupbHBh4@google.com> (raw)
In-Reply-To: <aWbkcRshLiL4NWZg@yzhao56-desk.sh.intel.com>
On Wed, Jan 14, 2026, Yan Zhao wrote:
> On Mon, Jan 12, 2026 at 12:15:17PM -0800, Ackerley Tng wrote:
> > Sean Christopherson <seanjc@google.com> writes:
> >
> > > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > > conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > > mapping multiple guest_memfd folios with a single hugepage. As to whether we
> >
> > Sean, I'd like to clarify this.
> >
> > > do (a) nothing,
> >
> > What does do nothing mean here?
Don't support hugepage's for shared mappings, at least for now (as Rick pointed
out, doing nothing now doesn't mean we can't do something in the future).
> > In this patch series the TDX functions do sanity checks ensuring that
> > mapping size <= folio size. IIUC the checks at mapping time, like in
> > tdh_mem_page_aug() would be fine since at the time of mapping, the
> > mapping size <= folio size, but we'd be in trouble at the time of
> > zapping, since that's when mapping sizes > folio sizes get discovered.
> >
> > The sanity checks are in principle in direct conflict with allowing
> > mapping of multiple guest_memfd folios at hugepage level.
> >
> > > (b) change the refcounting, or
> >
> > I think this is pretty hard unless something changes in core MM that
> > allows refcounting to be customizable by the FS. guest_memfd would love
> > to have that, but customizable refcounting is going to hurt refcounting
> > performance throughout the kernel.
> >
> > > (c) add support for mapping multiple folios in one page,
> >
> > Where would the changes need to be made, IIUC there aren't any checks
> > currently elsewhere in KVM to ensure that mapping size <= folio size,
> > other than the sanity checks in the TDX code proposed in this series.
> >
> > Does any support need to be added, or is it about amending the
> > unenforced/unwritten rule from "mapping size <= folio size" to "mapping
> > size <= contiguous memory size"?
>
> The rule is not "unenforced/unwritten". In fact, it's the de facto standard in
> KVM.
Ya, more or less.
The rules aren't formally documented because the overarching rule is very
simple: KVM must not map memory into the guest that the guest shouldn't have
access to. That falls firmly into the "well, duh" category, and so it's not
written down anywhere :-)
How exactly KVM has honored that rule has varied over the years, and still varies
between architectures. In the past KVM x86 special cased HugeTLB and THP, but
that proved to be a pain to maintain and wasn't extensible, e.g. didn't play nice
with DAX, and so KVM x86 pivoted to pulling the mapping size from the primary MMU
page tables.
But arm64 still special cases THP and HugeTLB, *and* VM_PFNMAP memory (eww).
> For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
> mapping size in the secondary MMU, while the primary MMU does not create a
> mapping larger than the backend folio size.
Super strictly speaking, this might not hold true for VM_PFNMAP memory. E.g. a
driver _could_ split a folio (no idea why it would) but map the entire thing into
userspace, and then userspace could have off that memory to KVM.
So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's
that "KVM mapping size <= primary MMU mapping size", at least for x86. Arm's
VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping
discontiguous pages into a single VM_PFNMAP VMA would be even more sketch.
But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger
than the folio size.
> When splitting the backend folio, the Linux kernel unmaps the folio from both
> the primary MMU and the KVM-managed secondary MMU (through the MMU notifier).
>
> On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
> than folio sizes, splitting folios while they are still mapped in the IOMMU
> stage-2 page table is not permitted due to the extra folio refcount held by the
> IOMMU.
>
> For gmem cases, KVM also does not create mappings larger than the folio size
> allocated from gmem. This is why the TDX huge page series relies on gmem's
> ability to allocate huge folios.
>
> We really need to be careful if we hope to break this long-established rule.
+100 to being careful, but at the same time I don't think we should get _too_
fixated on the guest_memfd folio size. E.g. similar to VM_PFNMAP, where there
might not be a folio, if guest_memfd stopped using folios, then the entire
discussion becomes moot.
And as above, the long-standing rule isn't about the implementation details so
much as it is about KVM's behavior. If the simplest solution to support huge
guest_memfd pages is to decouple the max order from the folio, then so be it.
That said, I'd very much like to get a sense of the alternatives, because at the
end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
and naively, tying that to the folio seems like an easy solution.
next prev parent reply other threads:[~2026-01-14 1:24 UTC|newest]
Thread overview: 127+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
2026-01-06 10:18 ` [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
2026-01-06 21:08 ` Dave Hansen
2026-01-07 9:12 ` Yan Zhao
2026-01-07 16:39 ` Dave Hansen
2026-01-08 19:05 ` Ackerley Tng
2026-01-08 19:24 ` Dave Hansen
2026-01-09 16:21 ` Vishal Annapurve
2026-01-09 3:08 ` Yan Zhao
2026-01-09 18:29 ` Ackerley Tng
2026-01-12 2:41 ` Yan Zhao
2026-01-13 16:50 ` Vishal Annapurve
2026-01-14 1:48 ` Yan Zhao
2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
2026-01-16 1:00 ` Huang, Kai
2026-01-16 8:35 ` Yan Zhao
2026-01-16 11:10 ` Huang, Kai
2026-01-16 11:22 ` Huang, Kai
2026-01-19 6:18 ` Yan Zhao
2026-01-19 6:15 ` Yan Zhao
2026-01-16 11:22 ` Huang, Kai
2026-01-19 5:55 ` Yan Zhao
2026-01-28 22:49 ` Sean Christopherson
2026-01-06 10:19 ` [PATCH v3 03/24] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
2026-01-06 10:19 ` [PATCH v3 04/24] x86/tdx: Introduce tdx_quirk_reset_folio() to reset private " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 05/24] x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
2026-01-15 22:49 ` Sean Christopherson
2026-01-16 7:54 ` Yan Zhao
2026-01-26 16:08 ` Sean Christopherson
2026-01-27 3:40 ` Yan Zhao
2026-01-28 19:51 ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock Yan Zhao
2026-01-28 22:38 ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 08/24] KVM: TDX: Enable huge page splitting " Yan Zhao
2026-01-06 10:21 ` [PATCH v3 09/24] KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX Yan Zhao
2026-01-06 10:21 ` [PATCH v3 10/24] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
2026-01-06 10:21 ` [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
2026-01-15 12:25 ` Huang, Kai
2026-01-16 23:39 ` Sean Christopherson
2026-01-19 1:28 ` Yan Zhao
2026-01-19 8:35 ` Huang, Kai
2026-01-19 8:49 ` Huang, Kai
2026-01-19 10:11 ` Yan Zhao
2026-01-19 10:40 ` Huang, Kai
2026-01-19 11:06 ` Yan Zhao
2026-01-19 12:32 ` Yan Zhao
2026-01-29 14:36 ` Sean Christopherson
2026-01-20 17:51 ` Sean Christopherson
2026-01-22 6:27 ` Yan Zhao
2026-01-20 17:57 ` Vishal Annapurve
2026-01-20 18:02 ` Sean Christopherson
2026-01-22 6:33 ` Yan Zhao
2026-01-29 14:51 ` Sean Christopherson
2026-01-06 10:21 ` [PATCH v3 12/24] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
2026-01-06 10:22 ` [PATCH v3 13/24] KVM: TDX: Honor the guest's accept level contained in an EPT violation Yan Zhao
2026-01-06 10:22 ` [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
2026-01-16 0:21 ` Sean Christopherson
2026-01-16 6:42 ` Yan Zhao
2026-01-06 10:22 ` [PATCH v3 15/24] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
2026-01-06 10:22 ` [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
2026-01-28 22:39 ` Sean Christopherson
2026-01-06 10:23 ` [PATCH v3 17/24] KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB Yan Zhao
2026-01-06 10:23 ` [PATCH v3 18/24] x86/virt/tdx: Add loud warning when tdx_pamt_put() fails Yan Zhao
2026-01-06 10:23 ` [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting Yan Zhao
2026-01-21 1:54 ` Huang, Kai
2026-01-21 17:30 ` Sean Christopherson
2026-01-21 19:39 ` Edgecombe, Rick P
2026-01-21 23:01 ` Huang, Kai
2026-01-22 7:03 ` Yan Zhao
2026-01-22 7:30 ` Huang, Kai
2026-01-22 7:49 ` Yan Zhao
2026-01-22 10:33 ` Huang, Kai
2026-01-06 10:23 ` [PATCH v3 20/24] KVM: TDX: Implement per-VM external cache for splitting in TDX Yan Zhao
2026-01-06 10:23 ` [PATCH v3 21/24] KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting Yan Zhao
2026-01-06 10:24 ` [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote Yan Zhao
2026-01-19 10:52 ` Huang, Kai
2026-01-19 11:11 ` Yan Zhao
2026-01-06 10:24 ` [PATCH v3 23/24] x86/tdx: Pass guest memory's PFN info to demote for updating pamt_refcount Yan Zhao
2026-01-06 10:24 ` [PATCH v3 24/24] KVM: TDX: Turn on PG_LEVEL_2M Yan Zhao
2026-01-06 17:47 ` [PATCH v3 00/24] KVM: TDX huge page support for private memory Vishal Annapurve
2026-01-06 21:26 ` Ackerley Tng
2026-01-06 21:38 ` Sean Christopherson
2026-01-06 22:04 ` Ackerley Tng
2026-01-06 23:43 ` Sean Christopherson
2026-01-07 9:03 ` Yan Zhao
2026-01-08 20:11 ` Ackerley Tng
2026-01-09 9:18 ` Yan Zhao
2026-01-09 16:12 ` Vishal Annapurve
2026-01-09 17:16 ` Vishal Annapurve
2026-01-09 18:07 ` Ackerley Tng
2026-01-12 1:39 ` Yan Zhao
2026-01-12 2:12 ` Yan Zhao
2026-01-12 19:56 ` Ackerley Tng
2026-01-13 6:10 ` Yan Zhao
2026-01-13 16:40 ` Vishal Annapurve
2026-01-14 9:32 ` Yan Zhao
2026-01-07 19:22 ` Edgecombe, Rick P
2026-01-07 20:27 ` Sean Christopherson
2026-01-12 20:15 ` Ackerley Tng
2026-01-14 0:33 ` Yan Zhao
2026-01-14 1:24 ` Sean Christopherson [this message]
2026-01-14 9:23 ` Yan Zhao
2026-01-14 15:26 ` Sean Christopherson
2026-01-14 18:45 ` Ackerley Tng
2026-01-15 3:08 ` Yan Zhao
2026-01-15 18:13 ` Ackerley Tng
2026-01-14 18:56 ` Dave Hansen
2026-01-15 0:19 ` Sean Christopherson
2026-01-16 15:45 ` Edgecombe, Rick P
2026-01-16 16:31 ` Sean Christopherson
2026-01-16 16:58 ` Edgecombe, Rick P
2026-01-19 5:53 ` Yan Zhao
2026-01-30 15:32 ` Sean Christopherson
2026-02-03 9:18 ` Yan Zhao
2026-02-09 17:01 ` Sean Christopherson
2026-01-16 16:57 ` Dave Hansen
2026-01-16 17:14 ` Sean Christopherson
2026-01-16 17:45 ` Dave Hansen
2026-01-16 19:59 ` Sean Christopherson
2026-01-16 22:25 ` Dave Hansen
2026-01-15 1:41 ` Yan Zhao
2026-01-15 16:26 ` Sean Christopherson
2026-01-16 0:28 ` Sean Christopherson
2026-01-16 11:25 ` Yan Zhao
2026-01-16 14:46 ` Sean Christopherson
2026-01-19 1:25 ` Yan Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aWbwVG8aZupbHBh4@google.com \
--to=seanjc@google.com \
--cc=ackerleytng@google.com \
--cc=binbin.wu@linux.intel.com \
--cc=chao.gao@intel.com \
--cc=chao.p.peng@intel.com \
--cc=dave.hansen@intel.com \
--cc=david@kernel.org \
--cc=fan.du@intel.com \
--cc=francescolavra.fl@gmail.com \
--cc=ira.weiny@intel.com \
--cc=isaku.yamahata@intel.com \
--cc=jgross@suse.com \
--cc=jun.miao@intel.com \
--cc=kai.huang@intel.com \
--cc=kas@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=michael.roth@amd.com \
--cc=nik.borisov@suse.com \
--cc=pbonzini@redhat.com \
--cc=pgonda@google.com \
--cc=rick.p.edgecombe@intel.com \
--cc=sagis@google.com \
--cc=tabba@google.com \
--cc=thomas.lendacky@amd.com \
--cc=vannapurve@google.com \
--cc=vbabka@suse.cz \
--cc=x86@kernel.org \
--cc=xiaoyao.li@intel.com \
--cc=yan.y.zhao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox