All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: Ackerley Tng <ackerleytng@google.com>,
	Vishal Annapurve <vannapurve@google.com>,
	pbonzini@redhat.com,  linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, x86@kernel.org,  rick.p.edgecombe@intel.com,
	dave.hansen@intel.com, kas@kernel.org,  tabba@google.com,
	michael.roth@amd.com, david@kernel.org, sagis@google.com,
	 vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com,
	 pgonda@google.com, fan.du@intel.com, jun.miao@intel.com,
	 francescolavra.fl@gmail.com, jgross@suse.com,
	ira.weiny@intel.com,  isaku.yamahata@intel.com,
	xiaoyao.li@intel.com, kai.huang@intel.com,
	 binbin.wu@linux.intel.com, chao.p.peng@intel.com,
	chao.gao@intel.com
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
Date: Wed, 14 Jan 2026 07:26:44 -0800	[thread overview]
Message-ID: <aWe1tKpFw-As6VKg@google.com> (raw)
In-Reply-To: <aWdgfXNdBuzpVE2Z@yzhao56-desk.sh.intel.com>

On Wed, Jan 14, 2026, Yan Zhao wrote:
> On Tue, Jan 13, 2026 at 05:24:36PM -0800, Sean Christopherson wrote:
> > On Wed, Jan 14, 2026, Yan Zhao wrote:
> > > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
> > > mapping size in the secondary MMU, while the primary MMU does not create a
> > > mapping larger than the backend folio size.
> > 
> > Super strictly speaking, this might not hold true for VM_PFNMAP memory.  E.g. a
> > driver _could_ split a folio (no idea why it would) but map the entire thing into
> > userspace, and then userspace could have off that memory to KVM.
> > 
> > So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's
> > that "KVM mapping size <= primary MMU mapping size", at least for x86.  Arm's
> > VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping
> > discontiguous pages into a single VM_PFNMAP VMA would be even more sketch.
> > 
> > But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger
> > than the folio size.
> 
> Oh. I forgot about the VM_PFNMAP case, which allows to provide folios as the
> backend. Indeed, a driver can create a huge mapping in primary MMU for the
> VM_PFNMAP range with multiple discontiguous pages, if it wants.
> 
> But this occurs before KVM creates the mapping. Per my understanding, pages
> under VM_PFNMAP are pinned,

Nope.  Only the driver that owns the VMAs knows what sits behind the PFN and the
lifecycle rules for that memory.

That last point is *very* important.  Even if the PFNs shoved into VM_PFNMAP VMAs
have an associated "struct page", that doesn't mean the "struct page" is refcounted,
i.e. can be pinned.  That detail was the heart of "KVM: Stop grabbing references to
PFNMAP'd pages" overhaul[*].

To _safely_ map VM_PFNMAP into a secondary MMU, i.e. without relying on (priveleged)
userspace to "do the right thing", the secondary MMU needs to be tied into
mmu_notifiers, so that modifications to the mappings in the primary MMU are
reflected into the secondary MMU.

[*] https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com

> so it looks like there're no splits after they are mapped into the primary MMU.
> 
> So, out of curiosity, do you know why linux kernel needs to unmap mappings from
> both primary and secondary MMUs, and check folio refcount before performing
> folio splitting?

Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
if something is going through the effort of splitting a folio, then odds are very,
very good that the new folios can't be safely mapped as a contiguous hugepage.
Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
MM to implement, and for drivers/users to understand.

Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
communicate the maximum mapping size; folios are the "currency" for doing so.

And then for edge cases that want to map a split folio as a hugepage (if any such
edge cases exist), thus take on the responsibility of managing the lifecycle of
the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.
 
> > > When splitting the backend folio, the Linux kernel unmaps the folio from both
> > > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier).
> > > 
> > > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
> > > than folio sizes, splitting folios while they are still mapped in the IOMMU
> > > stage-2 page table is not permitted due to the extra folio refcount held by the
> > > IOMMU.
> > > 
> > > For gmem cases, KVM also does not create mappings larger than the folio size
> > > allocated from gmem. This is why the TDX huge page series relies on gmem's
> > > ability to allocate huge folios.
> > > 
> > > We really need to be careful if we hope to break this long-established rule.
> > 
> > +100 to being careful, but at the same time I don't think we should get _too_
> > fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
> > might not be a folio, if guest_memfd stopped using folios, then the entire
> > discussion becomes moot.
> > 
> > And as above, the long-standing rule isn't about the implementation details so
> > much as it is about KVM's behavior.  If the simplest solution to support huge
> > guest_memfd pages is to decouple the max order from the folio, then so be it.
> > 
> > That said, I'd very much like to get a sense of the alternatives, because at the
> > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
> > and naively, tying that to the folio seems like an easy solution.
> Thanks for the explanation.
> 
> Alternatively, how do you feel about the approach of splitting S-EPT first
> before splitting folios?
> If guest_memfd always splits 1GB folios to 2MB first and only splits the
> converted range to 4KB, splitting S-EPT before splitting folios should not
> introduce too much overhead. Then, we can defer the folio size problem until
> guest_memfd stops using folios.
> 
> If the decision is to stop relying on folios for unmapping now, do you think
> the following changes are reasonable for the TDX huge page series?
> 
> - Add WARN_ON_ONCE() to assert that pages are in a single folio in
>   tdh_mem_page_aug().
> - Do not assert that pages are in a single folio in
>   tdh_phymem_page_wbinvd_hkid(). (or just assert of pfn_valid() for each page?)
>   Could you please give me guidance on
>   https://lore.kernel.org/kvm/aWb16XJuSVuyRu7l@yzhao56-desk.sh.intel.com.
> - Add S-EPT splitting in kvm_gmem_error_folio() and fail on splitting error.

Ok, with the disclaimer that I hadn't actually looked at the patches in this
series before now...

TDX absolutely should not be doing _anything_ with folios.  I am *very* strongly
opposed to TDX assuming that memory is backed by refcounted "struct page", and
thus can use folios to glean the maximum mapping size.

guest_memfd is _the_ owner of that information.  guest_memfd needs to explicitly
_tell_ the rest of KVM what the maximum mapping size is; arch code should not
infer that size from a folio.

And that code+behavior already exists in the form of kvm_gmem_mapping_order() and
its users, _and_ is plumbed all the way into tdx_mem_page_aug() as @level.  IIUC,
the _only_ reason tdx_mem_page_aug() retrieves the page+folio is because
tdx_clflush_page() ultimately requires a "struct page".  That is absolutely
ridiculous and not acceptable.  CLFLUSH takes a virtual address, there is *zero*
reason tdh_mem_page_aug() needs to require/assume a struct page.

Dave may feel differently, but I am not going to budge on this.  I am not going
to bake in assumptions throughout KVM about memory being backed by page+folio.
We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to
PFNMAP'd pages" series, I am NOT reintroducing such assumptions.

NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.

  reply	other threads:[~2026-01-14 15:26 UTC|newest]

Thread overview: 127+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
2026-01-06 10:18 ` [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
2026-01-06 21:08   ` Dave Hansen
2026-01-07  9:12     ` Yan Zhao
2026-01-07 16:39       ` Dave Hansen
2026-01-08 19:05         ` Ackerley Tng
2026-01-08 19:24           ` Dave Hansen
2026-01-09 16:21             ` Vishal Annapurve
2026-01-09  3:08         ` Yan Zhao
2026-01-09 18:29           ` Ackerley Tng
2026-01-12  2:41             ` Yan Zhao
2026-01-13 16:50               ` Vishal Annapurve
2026-01-14  1:48                 ` Yan Zhao
2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
2026-01-16  1:00   ` Huang, Kai
2026-01-16  8:35     ` Yan Zhao
2026-01-16 11:10       ` Huang, Kai
2026-01-16 11:22         ` Huang, Kai
2026-01-19  6:18           ` Yan Zhao
2026-01-19  6:15         ` Yan Zhao
2026-01-16 11:22   ` Huang, Kai
2026-01-19  5:55     ` Yan Zhao
2026-01-28 22:49   ` Sean Christopherson
2026-01-06 10:19 ` [PATCH v3 03/24] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
2026-01-06 10:19 ` [PATCH v3 04/24] x86/tdx: Introduce tdx_quirk_reset_folio() to reset private " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 05/24] x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
2026-01-15 22:49   ` Sean Christopherson
2026-01-16  7:54     ` Yan Zhao
2026-01-26 16:08       ` Sean Christopherson
2026-01-27  3:40         ` Yan Zhao
2026-01-28 19:51           ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock Yan Zhao
2026-01-28 22:38   ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 08/24] KVM: TDX: Enable huge page splitting " Yan Zhao
2026-01-06 10:21 ` [PATCH v3 09/24] KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX Yan Zhao
2026-01-06 10:21 ` [PATCH v3 10/24] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
2026-01-06 10:21 ` [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
2026-01-15 12:25   ` Huang, Kai
2026-01-16 23:39     ` Sean Christopherson
2026-01-19  1:28       ` Yan Zhao
2026-01-19  8:35         ` Huang, Kai
2026-01-19  8:49           ` Huang, Kai
2026-01-19 10:11             ` Yan Zhao
2026-01-19 10:40               ` Huang, Kai
2026-01-19 11:06                 ` Yan Zhao
2026-01-19 12:32                   ` Yan Zhao
2026-01-29 14:36                     ` Sean Christopherson
2026-01-20 17:51         ` Sean Christopherson
2026-01-22  6:27           ` Yan Zhao
2026-01-20 17:57       ` Vishal Annapurve
2026-01-20 18:02         ` Sean Christopherson
2026-01-22  6:33           ` Yan Zhao
2026-01-29 14:51             ` Sean Christopherson
2026-01-06 10:21 ` [PATCH v3 12/24] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
2026-01-06 10:22 ` [PATCH v3 13/24] KVM: TDX: Honor the guest's accept level contained in an EPT violation Yan Zhao
2026-01-06 10:22 ` [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
2026-01-16  0:21   ` Sean Christopherson
2026-01-16  6:42     ` Yan Zhao
2026-01-06 10:22 ` [PATCH v3 15/24] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
2026-01-06 10:22 ` [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
2026-01-28 22:39   ` Sean Christopherson
2026-01-06 10:23 ` [PATCH v3 17/24] KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB Yan Zhao
2026-01-06 10:23 ` [PATCH v3 18/24] x86/virt/tdx: Add loud warning when tdx_pamt_put() fails Yan Zhao
2026-01-06 10:23 ` [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting Yan Zhao
2026-01-21  1:54   ` Huang, Kai
2026-01-21 17:30     ` Sean Christopherson
2026-01-21 19:39       ` Edgecombe, Rick P
2026-01-21 23:01       ` Huang, Kai
2026-01-22  7:03       ` Yan Zhao
2026-01-22  7:30         ` Huang, Kai
2026-01-22  7:49           ` Yan Zhao
2026-01-22 10:33             ` Huang, Kai
2026-01-06 10:23 ` [PATCH v3 20/24] KVM: TDX: Implement per-VM external cache for splitting in TDX Yan Zhao
2026-01-06 10:23 ` [PATCH v3 21/24] KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting Yan Zhao
2026-01-06 10:24 ` [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote Yan Zhao
2026-01-19 10:52   ` Huang, Kai
2026-01-19 11:11     ` Yan Zhao
2026-01-06 10:24 ` [PATCH v3 23/24] x86/tdx: Pass guest memory's PFN info to demote for updating pamt_refcount Yan Zhao
2026-01-06 10:24 ` [PATCH v3 24/24] KVM: TDX: Turn on PG_LEVEL_2M Yan Zhao
2026-01-06 17:47 ` [PATCH v3 00/24] KVM: TDX huge page support for private memory Vishal Annapurve
2026-01-06 21:26   ` Ackerley Tng
2026-01-06 21:38     ` Sean Christopherson
2026-01-06 22:04       ` Ackerley Tng
2026-01-06 23:43         ` Sean Christopherson
2026-01-07  9:03           ` Yan Zhao
2026-01-08 20:11             ` Ackerley Tng
2026-01-09  9:18               ` Yan Zhao
2026-01-09 16:12                 ` Vishal Annapurve
2026-01-09 17:16                   ` Vishal Annapurve
2026-01-09 18:07                   ` Ackerley Tng
2026-01-12  1:39                     ` Yan Zhao
2026-01-12  2:12                       ` Yan Zhao
2026-01-12 19:56                         ` Ackerley Tng
2026-01-13  6:10                           ` Yan Zhao
2026-01-13 16:40                             ` Vishal Annapurve
2026-01-14  9:32                               ` Yan Zhao
2026-01-07 19:22           ` Edgecombe, Rick P
2026-01-07 20:27             ` Sean Christopherson
2026-01-12 20:15           ` Ackerley Tng
2026-01-14  0:33             ` Yan Zhao
2026-01-14  1:24               ` Sean Christopherson
2026-01-14  9:23                 ` Yan Zhao
2026-01-14 15:26                   ` Sean Christopherson [this message]
2026-01-14 18:45                     ` Ackerley Tng
2026-01-15  3:08                       ` Yan Zhao
2026-01-15 18:13                         ` Ackerley Tng
2026-01-14 18:56                     ` Dave Hansen
2026-01-15  0:19                       ` Sean Christopherson
2026-01-16 15:45                         ` Edgecombe, Rick P
2026-01-16 16:31                           ` Sean Christopherson
2026-01-16 16:58                             ` Edgecombe, Rick P
2026-01-19  5:53                               ` Yan Zhao
2026-01-30 15:32                                 ` Sean Christopherson
2026-02-03  9:18                                   ` Yan Zhao
2026-02-09 17:01                                     ` Sean Christopherson
2026-01-16 16:57                         ` Dave Hansen
2026-01-16 17:14                           ` Sean Christopherson
2026-01-16 17:45                             ` Dave Hansen
2026-01-16 19:59                               ` Sean Christopherson
2026-01-16 22:25                                 ` Dave Hansen
2026-01-15  1:41                     ` Yan Zhao
2026-01-15 16:26                       ` Sean Christopherson
2026-01-16  0:28 ` Sean Christopherson
2026-01-16 11:25   ` Yan Zhao
2026-01-16 14:46     ` Sean Christopherson
2026-01-19  1:25       ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aWe1tKpFw-As6VKg@google.com \
    --to=seanjc@google.com \
    --cc=ackerleytng@google.com \
    --cc=binbin.wu@linux.intel.com \
    --cc=chao.gao@intel.com \
    --cc=chao.p.peng@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=david@kernel.org \
    --cc=fan.du@intel.com \
    --cc=francescolavra.fl@gmail.com \
    --cc=ira.weiny@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=jgross@suse.com \
    --cc=jun.miao@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kas@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=nik.borisov@suse.com \
    --cc=pbonzini@redhat.com \
    --cc=pgonda@google.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=sagis@google.com \
    --cc=tabba@google.com \
    --cc=thomas.lendacky@amd.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=x86@kernel.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.