public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: Ackerley Tng <ackerleytng@google.com>,
	Vishal Annapurve <vannapurve@google.com>,
	pbonzini@redhat.com,  linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, x86@kernel.org,  rick.p.edgecombe@intel.com,
	dave.hansen@intel.com, kas@kernel.org,  tabba@google.com,
	michael.roth@amd.com, david@kernel.org, sagis@google.com,
	 vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com,
	 pgonda@google.com, fan.du@intel.com, jun.miao@intel.com,
	 francescolavra.fl@gmail.com, jgross@suse.com,
	ira.weiny@intel.com,  isaku.yamahata@intel.com,
	xiaoyao.li@intel.com, kai.huang@intel.com,
	 binbin.wu@linux.intel.com, chao.p.peng@intel.com,
	chao.gao@intel.com
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
Date: Wed, 14 Jan 2026 07:26:44 -0800	[thread overview]
Message-ID: <aWe1tKpFw-As6VKg@google.com> (raw)
In-Reply-To: <aWdgfXNdBuzpVE2Z@yzhao56-desk.sh.intel.com>

On Wed, Jan 14, 2026, Yan Zhao wrote:
> On Tue, Jan 13, 2026 at 05:24:36PM -0800, Sean Christopherson wrote:
> > On Wed, Jan 14, 2026, Yan Zhao wrote:
> > > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
> > > mapping size in the secondary MMU, while the primary MMU does not create a
> > > mapping larger than the backend folio size.
> > 
> > Super strictly speaking, this might not hold true for VM_PFNMAP memory.  E.g. a
> > driver _could_ split a folio (no idea why it would) but map the entire thing into
> > userspace, and then userspace could have off that memory to KVM.
> > 
> > So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's
> > that "KVM mapping size <= primary MMU mapping size", at least for x86.  Arm's
> > VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping
> > discontiguous pages into a single VM_PFNMAP VMA would be even more sketch.
> > 
> > But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger
> > than the folio size.
> 
> Oh. I forgot about the VM_PFNMAP case, which allows to provide folios as the
> backend. Indeed, a driver can create a huge mapping in primary MMU for the
> VM_PFNMAP range with multiple discontiguous pages, if it wants.
> 
> But this occurs before KVM creates the mapping. Per my understanding, pages
> under VM_PFNMAP are pinned,

Nope.  Only the driver that owns the VMAs knows what sits behind the PFN and the
lifecycle rules for that memory.

That last point is *very* important.  Even if the PFNs shoved into VM_PFNMAP VMAs
have an associated "struct page", that doesn't mean the "struct page" is refcounted,
i.e. can be pinned.  That detail was the heart of "KVM: Stop grabbing references to
PFNMAP'd pages" overhaul[*].

To _safely_ map VM_PFNMAP into a secondary MMU, i.e. without relying on (priveleged)
userspace to "do the right thing", the secondary MMU needs to be tied into
mmu_notifiers, so that modifications to the mappings in the primary MMU are
reflected into the secondary MMU.

[*] https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com

> so it looks like there're no splits after they are mapped into the primary MMU.
> 
> So, out of curiosity, do you know why linux kernel needs to unmap mappings from
> both primary and secondary MMUs, and check folio refcount before performing
> folio splitting?

Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
if something is going through the effort of splitting a folio, then odds are very,
very good that the new folios can't be safely mapped as a contiguous hugepage.
Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
MM to implement, and for drivers/users to understand.

Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
communicate the maximum mapping size; folios are the "currency" for doing so.

And then for edge cases that want to map a split folio as a hugepage (if any such
edge cases exist), thus take on the responsibility of managing the lifecycle of
the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.
 
> > > When splitting the backend folio, the Linux kernel unmaps the folio from both
> > > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier).
> > > 
> > > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
> > > than folio sizes, splitting folios while they are still mapped in the IOMMU
> > > stage-2 page table is not permitted due to the extra folio refcount held by the
> > > IOMMU.
> > > 
> > > For gmem cases, KVM also does not create mappings larger than the folio size
> > > allocated from gmem. This is why the TDX huge page series relies on gmem's
> > > ability to allocate huge folios.
> > > 
> > > We really need to be careful if we hope to break this long-established rule.
> > 
> > +100 to being careful, but at the same time I don't think we should get _too_
> > fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
> > might not be a folio, if guest_memfd stopped using folios, then the entire
> > discussion becomes moot.
> > 
> > And as above, the long-standing rule isn't about the implementation details so
> > much as it is about KVM's behavior.  If the simplest solution to support huge
> > guest_memfd pages is to decouple the max order from the folio, then so be it.
> > 
> > That said, I'd very much like to get a sense of the alternatives, because at the
> > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
> > and naively, tying that to the folio seems like an easy solution.
> Thanks for the explanation.
> 
> Alternatively, how do you feel about the approach of splitting S-EPT first
> before splitting folios?
> If guest_memfd always splits 1GB folios to 2MB first and only splits the
> converted range to 4KB, splitting S-EPT before splitting folios should not
> introduce too much overhead. Then, we can defer the folio size problem until
> guest_memfd stops using folios.
> 
> If the decision is to stop relying on folios for unmapping now, do you think
> the following changes are reasonable for the TDX huge page series?
> 
> - Add WARN_ON_ONCE() to assert that pages are in a single folio in
>   tdh_mem_page_aug().
> - Do not assert that pages are in a single folio in
>   tdh_phymem_page_wbinvd_hkid(). (or just assert of pfn_valid() for each page?)
>   Could you please give me guidance on
>   https://lore.kernel.org/kvm/aWb16XJuSVuyRu7l@yzhao56-desk.sh.intel.com.
> - Add S-EPT splitting in kvm_gmem_error_folio() and fail on splitting error.

Ok, with the disclaimer that I hadn't actually looked at the patches in this
series before now...

TDX absolutely should not be doing _anything_ with folios.  I am *very* strongly
opposed to TDX assuming that memory is backed by refcounted "struct page", and
thus can use folios to glean the maximum mapping size.

guest_memfd is _the_ owner of that information.  guest_memfd needs to explicitly
_tell_ the rest of KVM what the maximum mapping size is; arch code should not
infer that size from a folio.

And that code+behavior already exists in the form of kvm_gmem_mapping_order() and
its users, _and_ is plumbed all the way into tdx_mem_page_aug() as @level.  IIUC,
the _only_ reason tdx_mem_page_aug() retrieves the page+folio is because
tdx_clflush_page() ultimately requires a "struct page".  That is absolutely
ridiculous and not acceptable.  CLFLUSH takes a virtual address, there is *zero*
reason tdh_mem_page_aug() needs to require/assume a struct page.

Dave may feel differently, but I am not going to budge on this.  I am not going
to bake in assumptions throughout KVM about memory being backed by page+folio.
We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to
PFNMAP'd pages" series, I am NOT reintroducing such assumptions.

NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.

  reply	other threads:[~2026-01-14 15:26 UTC|newest]

Thread overview: 127+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
2026-01-06 10:18 ` [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
2026-01-06 21:08   ` Dave Hansen
2026-01-07  9:12     ` Yan Zhao
2026-01-07 16:39       ` Dave Hansen
2026-01-08 19:05         ` Ackerley Tng
2026-01-08 19:24           ` Dave Hansen
2026-01-09 16:21             ` Vishal Annapurve
2026-01-09  3:08         ` Yan Zhao
2026-01-09 18:29           ` Ackerley Tng
2026-01-12  2:41             ` Yan Zhao
2026-01-13 16:50               ` Vishal Annapurve
2026-01-14  1:48                 ` Yan Zhao
2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
2026-01-16  1:00   ` Huang, Kai
2026-01-16  8:35     ` Yan Zhao
2026-01-16 11:10       ` Huang, Kai
2026-01-16 11:22         ` Huang, Kai
2026-01-19  6:18           ` Yan Zhao
2026-01-19  6:15         ` Yan Zhao
2026-01-16 11:22   ` Huang, Kai
2026-01-19  5:55     ` Yan Zhao
2026-01-28 22:49   ` Sean Christopherson
2026-01-06 10:19 ` [PATCH v3 03/24] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
2026-01-06 10:19 ` [PATCH v3 04/24] x86/tdx: Introduce tdx_quirk_reset_folio() to reset private " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 05/24] x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
2026-01-15 22:49   ` Sean Christopherson
2026-01-16  7:54     ` Yan Zhao
2026-01-26 16:08       ` Sean Christopherson
2026-01-27  3:40         ` Yan Zhao
2026-01-28 19:51           ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock Yan Zhao
2026-01-28 22:38   ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 08/24] KVM: TDX: Enable huge page splitting " Yan Zhao
2026-01-06 10:21 ` [PATCH v3 09/24] KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX Yan Zhao
2026-01-06 10:21 ` [PATCH v3 10/24] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
2026-01-06 10:21 ` [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
2026-01-15 12:25   ` Huang, Kai
2026-01-16 23:39     ` Sean Christopherson
2026-01-19  1:28       ` Yan Zhao
2026-01-19  8:35         ` Huang, Kai
2026-01-19  8:49           ` Huang, Kai
2026-01-19 10:11             ` Yan Zhao
2026-01-19 10:40               ` Huang, Kai
2026-01-19 11:06                 ` Yan Zhao
2026-01-19 12:32                   ` Yan Zhao
2026-01-29 14:36                     ` Sean Christopherson
2026-01-20 17:51         ` Sean Christopherson
2026-01-22  6:27           ` Yan Zhao
2026-01-20 17:57       ` Vishal Annapurve
2026-01-20 18:02         ` Sean Christopherson
2026-01-22  6:33           ` Yan Zhao
2026-01-29 14:51             ` Sean Christopherson
2026-01-06 10:21 ` [PATCH v3 12/24] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
2026-01-06 10:22 ` [PATCH v3 13/24] KVM: TDX: Honor the guest's accept level contained in an EPT violation Yan Zhao
2026-01-06 10:22 ` [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
2026-01-16  0:21   ` Sean Christopherson
2026-01-16  6:42     ` Yan Zhao
2026-01-06 10:22 ` [PATCH v3 15/24] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
2026-01-06 10:22 ` [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
2026-01-28 22:39   ` Sean Christopherson
2026-01-06 10:23 ` [PATCH v3 17/24] KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB Yan Zhao
2026-01-06 10:23 ` [PATCH v3 18/24] x86/virt/tdx: Add loud warning when tdx_pamt_put() fails Yan Zhao
2026-01-06 10:23 ` [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting Yan Zhao
2026-01-21  1:54   ` Huang, Kai
2026-01-21 17:30     ` Sean Christopherson
2026-01-21 19:39       ` Edgecombe, Rick P
2026-01-21 23:01       ` Huang, Kai
2026-01-22  7:03       ` Yan Zhao
2026-01-22  7:30         ` Huang, Kai
2026-01-22  7:49           ` Yan Zhao
2026-01-22 10:33             ` Huang, Kai
2026-01-06 10:23 ` [PATCH v3 20/24] KVM: TDX: Implement per-VM external cache for splitting in TDX Yan Zhao
2026-01-06 10:23 ` [PATCH v3 21/24] KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting Yan Zhao
2026-01-06 10:24 ` [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote Yan Zhao
2026-01-19 10:52   ` Huang, Kai
2026-01-19 11:11     ` Yan Zhao
2026-01-06 10:24 ` [PATCH v3 23/24] x86/tdx: Pass guest memory's PFN info to demote for updating pamt_refcount Yan Zhao
2026-01-06 10:24 ` [PATCH v3 24/24] KVM: TDX: Turn on PG_LEVEL_2M Yan Zhao
2026-01-06 17:47 ` [PATCH v3 00/24] KVM: TDX huge page support for private memory Vishal Annapurve
2026-01-06 21:26   ` Ackerley Tng
2026-01-06 21:38     ` Sean Christopherson
2026-01-06 22:04       ` Ackerley Tng
2026-01-06 23:43         ` Sean Christopherson
2026-01-07  9:03           ` Yan Zhao
2026-01-08 20:11             ` Ackerley Tng
2026-01-09  9:18               ` Yan Zhao
2026-01-09 16:12                 ` Vishal Annapurve
2026-01-09 17:16                   ` Vishal Annapurve
2026-01-09 18:07                   ` Ackerley Tng
2026-01-12  1:39                     ` Yan Zhao
2026-01-12  2:12                       ` Yan Zhao
2026-01-12 19:56                         ` Ackerley Tng
2026-01-13  6:10                           ` Yan Zhao
2026-01-13 16:40                             ` Vishal Annapurve
2026-01-14  9:32                               ` Yan Zhao
2026-01-07 19:22           ` Edgecombe, Rick P
2026-01-07 20:27             ` Sean Christopherson
2026-01-12 20:15           ` Ackerley Tng
2026-01-14  0:33             ` Yan Zhao
2026-01-14  1:24               ` Sean Christopherson
2026-01-14  9:23                 ` Yan Zhao
2026-01-14 15:26                   ` Sean Christopherson [this message]
2026-01-14 18:45                     ` Ackerley Tng
2026-01-15  3:08                       ` Yan Zhao
2026-01-15 18:13                         ` Ackerley Tng
2026-01-14 18:56                     ` Dave Hansen
2026-01-15  0:19                       ` Sean Christopherson
2026-01-16 15:45                         ` Edgecombe, Rick P
2026-01-16 16:31                           ` Sean Christopherson
2026-01-16 16:58                             ` Edgecombe, Rick P
2026-01-19  5:53                               ` Yan Zhao
2026-01-30 15:32                                 ` Sean Christopherson
2026-02-03  9:18                                   ` Yan Zhao
2026-02-09 17:01                                     ` Sean Christopherson
2026-01-16 16:57                         ` Dave Hansen
2026-01-16 17:14                           ` Sean Christopherson
2026-01-16 17:45                             ` Dave Hansen
2026-01-16 19:59                               ` Sean Christopherson
2026-01-16 22:25                                 ` Dave Hansen
2026-01-15  1:41                     ` Yan Zhao
2026-01-15 16:26                       ` Sean Christopherson
2026-01-16  0:28 ` Sean Christopherson
2026-01-16 11:25   ` Yan Zhao
2026-01-16 14:46     ` Sean Christopherson
2026-01-19  1:25       ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aWe1tKpFw-As6VKg@google.com \
    --to=seanjc@google.com \
    --cc=ackerleytng@google.com \
    --cc=binbin.wu@linux.intel.com \
    --cc=chao.gao@intel.com \
    --cc=chao.p.peng@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=david@kernel.org \
    --cc=fan.du@intel.com \
    --cc=francescolavra.fl@gmail.com \
    --cc=ira.weiny@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=jgross@suse.com \
    --cc=jun.miao@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kas@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=nik.borisov@suse.com \
    --cc=pbonzini@redhat.com \
    --cc=pgonda@google.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=sagis@google.com \
    --cc=tabba@google.com \
    --cc=thomas.lendacky@amd.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=x86@kernel.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox