Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: "Huang, Kai" <kai.huang@intel.com>
To: "Zhao, Yan Y" <yan.y.zhao@intel.com>
Cc: "Du, Fan" <fan.du@intel.com>,
	"Li, Xiaoyao" <xiaoyao.li@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"Hansen, Dave" <dave.hansen@intel.com>,
	"david@redhat.com" <david@redhat.com>,
	"thomas.lendacky@amd.com" <thomas.lendacky@amd.com>,
	"vbabka@suse.cz" <vbabka@suse.cz>,
	"tabba@google.com" <tabba@google.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"seanjc@google.com" <seanjc@google.com>,
	"binbin.wu@linux.intel.com" <binbin.wu@linux.intel.com>,
	"kas@kernel.org" <kas@kernel.org>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"ackerleytng@google.com" <ackerleytng@google.com>,
	"michael.roth@amd.com" <michael.roth@amd.com>,
	"Weiny, Ira" <ira.weiny@intel.com>,
	"Peng, Chao P" <chao.p.peng@intel.com>,
	"Yamahata, Isaku" <isaku.yamahata@intel.com>,
	"Annapurve, Vishal" <vannapurve@google.com>,
	"Edgecombe, Rick P" <rick.p.edgecombe@intel.com>,
	"Miao, Jun" <jun.miao@intel.com>,
	"x86@kernel.org" <x86@kernel.org>,
	"pgonda@google.com" <pgonda@google.com>
Subject: Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
Date: Tue, 18 Nov 2025 00:14:17 +0000	[thread overview]
Message-ID: <f2fb7c2ed74f37fdf8ce69f593e9436acbdd93ee.camel@intel.com> (raw)
In-Reply-To: <aRbHtnMcoqM1gmL9@yzhao56-desk.sh.intel.com>

On Fri, 2025-11-14 at 14:09 +0800, Yan Zhao wrote:
> On Thu, Nov 13, 2025 at 07:02:59PM +0800, Huang, Kai wrote:
> > On Thu, 2025-11-13 at 16:54 +0800, Yan Zhao wrote:
> > > On Tue, Nov 11, 2025 at 06:42:55PM +0800, Huang, Kai wrote:
> > > > On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> > > > >  static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > >  					 struct kvm_mmu_page *root,
> > > > >  					 gfn_t start, gfn_t end,
> > > > > -					 int target_level, bool shared)
> > > > > +					 int target_level, bool shared,
> > > > > +					 bool only_cross_bounday, bool *flush)
> > > > >  {
> > > > >  	struct kvm_mmu_page *sp = NULL;
> > > > >  	struct tdp_iter iter;
> > > > > @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > >  	 * level into one lower level. For example, if we encounter a 1GB page
> > > > >  	 * we split it into 512 2MB pages.
> > > > >  	 *
> > > > > +	 * When only_cross_bounday is true, just split huge pages above the
> > > > > +	 * target level into one lower level if the huge pages cross the start
> > > > > +	 * or end boundary.
> > > > > +	 *
> > > > > +	 * No need to update @flush for !only_cross_bounday cases, which rely
> > > > > +	 * on the callers to do the TLB flush in the end.
> > > > > +	 *
> > > > 
> > > > s/only_cross_bounday/only_cross_boundary
> > > > 
> > > > From tdp_mmu_split_huge_pages_root()'s perspective, it's quite odd to only
> > > > update 'flush' when 'only_cross_bounday' is true, because
> > > > 'only_cross_bounday' can only results in less splitting.
> > > I have to say it's a reasonable point.
> > > 
> > > > I understand this is because splitting S-EPT mapping needs flush (at least
> > > > before non-block DEMOTE is implemented?).  Would it better to also let the
> > > Actually the flush is only required for !TDX cases.
> > > 
> > > For TDX, either the flush has been performed internally within
> > > tdx_sept_split_private_spt() 
> > > 
> > 
> > AFAICT tdx_sept_split_private_spt() only does tdh_mem_track(), so KVM should
> > still kick all vCPUs out of guest mode so other vCPUs can actually flush the
> > TLB?
> tdx_sept_split_private_spt() actually invokes tdx_track(), which performs the
> kicking off all vCPUs by invoking
> "kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE)".

Oh thanks for the reminder.

Then I am kinda confused why do you need to return @flush, especially when
'only_cross_boundary' is true which is for TDX case?

So step back to where why this 'flush' is needed to be returned:

- For TDX ('only_cross_boundary == true'):

The caller doesn't need to flush TLB because it has already been done when huge
page is actually split.

- For non-TDX case ('only_cross_boundary == false'):

AFAICT the only user of tdp_mmu_split_huge_pages_root() is "eager hugepage
splitting" during log-dirty.  And per per the current implementation there are
two callers of tdp_mmu_split_huge_pages_root():

  kvm_mmu_try_split_huge_pages()
  kvm_mmu_slot_try_split_huge_pages()

But they are both void functions which neither return whether flush TLB is
needed, nor do TLB flush internally.

So I am kinda confused.

Perhaps you mean for "shared memory of TDX guest", the caller will also pass
'only_cross_boundary == true' and the caller needs to perform TLB flush?

[...]

> > 
> > Something like below:
> > 
> > @@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
> > tdp_iter *iter,
> >  static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> >                                          struct kvm_mmu_page *root,
> >                                          gfn_t start, gfn_t end,
> > -                                        int target_level, bool shared)
> > +                                        int target_level, bool shared,
> > +                                        bool only_cross_boundary,
> > +                                        bool *split)
> >  {
> >         struct kvm_mmu_page *sp = NULL;
> >         struct tdp_iter iter;
> > @@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> >                 if (!is_shadow_present_pte(iter.old_spte) ||
> > !is_large_pte(iter.old_spte))
> >                         continue;
> >  
> > +               if (only_cross_boundary && !iter_cross_boundary(&iter, start,
> > end))
> > +                       continue;
> > +
> >                 if (!sp) {
> >                         rcu_read_unlock();
> >  
> > @@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> >                         goto retry;
> >  
> >                 sp = NULL;
> > +               *split = true;
> >         }
> >  
> >         rcu_read_unlock();
> This looks more reasonable for tdp_mmu_split_huge_pages_root();
> 
> Given that splitting only adds a new page to the paging structure (unlike page
> merging), I currently can't think of any current use cases that would be broken
> by the lack of TLB flush before tdp_mmu_iter_cond_resched() releases the
> mmu_lock.
> 
> This is because:
> 1) if the split is triggered in a fault path, the hardware shouldn't have cached
>    the old huge translation.
> 2) if the split is triggered in a zap or convert path,
>    - there shouldn't be concurrent faults on the range due to the protection of
>      mmu_invalidate_range*.
>    - for concurrent splits on the same range, though the other vCPUs may
>      temporally see stale huge TLB entries after they believe they have
>      performed a split, they will be kicked off to flush the cache soon after
>      tdp_mmu_split_huge_pages_root() returns in the first vCPU/host thread.
>      This should be acceptable since I don't see any special guest needs that
>      rely on pure splits.

Perhaps we should just go straight to the point:

  What does "hugepage split" do, and what's the consequence of not flushing TLB.

Per make_small_spte(), the new child PTEs will carry all bits of hugepage PTE
except they clear the 'hugepage bit (obviously)', and set the 'X' bit for NX
hugepage thing.

That means if we leave the stale hugepage TLB, the CPU is still able to find the
correct PFN and AFAICT there shouldn't be any other problem here.  For any fault
due to the stale hugepage TLB missing the 'X' permission, AFAICT KVM will just
treat this as a spurious fault, which isn't nice but should have no harm.

> 
> So I tend to agree with your suggestion though the implementation in this patch
> is safer.

I am perhaps still missing something, as I am still trying to precisely
understand in what cases you want to flush TLB when splitting hugepage.

I kinda tend to think you eventually want to flush TLB because eventually you
want to _ZAP_.  But needing to flush due to zap and needing to flush due to
split is kinda different I think.

next prev parent reply	other threads:[~2025-11-18  0:14 UTC|newest]

Thread overview: 129+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
2025-08-07  9:41 ` [RFC PATCH v2 01/23] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
2025-08-07  9:41 ` [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
2025-09-01  8:55   ` Binbin Wu
2025-09-01  9:08     ` Yan Zhao
2025-09-02 16:56       ` Edgecombe, Rick P
2025-09-02 17:37         ` Sean Christopherson
2025-09-02 17:45           ` Edgecombe, Rick P
2025-09-04  9:31             ` Yan Zhao
2025-11-11  9:15       ` Huang, Kai
2025-11-12  8:06         ` Yan Zhao
2025-11-14  9:14           ` Binbin Wu
2025-11-14  9:21             ` Yan Zhao
2025-08-07  9:42 ` [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
2025-11-11  9:23   ` Huang, Kai
2025-11-12  8:43     ` Yan Zhao
2025-11-12 10:29       ` Huang, Kai
2025-11-13  2:35         ` Yan Zhao
2025-11-13  7:37           ` Huang, Kai
2025-11-13  9:03             ` Yan Zhao
2025-11-13 15:26             ` Dave Hansen
2025-11-14  1:21               ` Yan Zhao
2025-12-10  1:14   ` Vishal Annapurve
2025-12-10  1:18     ` Yan Zhao
2025-12-10  1:30       ` Vishal Annapurve
2025-12-10  1:55         ` Yan Zhao
2025-12-31 19:37           ` Vishal Annapurve
2026-01-06 10:37             ` Yan Zhao
2025-08-07  9:42 ` [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear " Yan Zhao
2025-09-02  2:56   ` Binbin Wu
2025-09-03  9:51     ` Yan Zhao
2025-09-03 11:19       ` Binbin Wu
2025-09-04  2:53         ` Yan Zhao
2025-08-07  9:42 ` [RFC PATCH v2 05/23] x86/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
2025-11-17  2:09   ` Binbin Wu
2025-11-17  4:05     ` Yan Zhao
2025-08-07  9:42 ` [RFC PATCH v2 06/23] KVM: TDX: Do not hold page refcount on private guest pages Yan Zhao
2025-08-07  9:42 ` [RFC PATCH v2 07/23] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
2025-08-07  9:43 ` [RFC PATCH v2 08/23] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
2025-11-11  9:52   ` Huang, Kai
2025-11-12  9:29     ` Yan Zhao
2025-08-07  9:43 ` [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock Yan Zhao
2025-11-11 10:06   ` Huang, Kai
2025-11-13  3:16     ` Yan Zhao
2025-11-17  8:53   ` Binbin Wu
2025-11-17  9:09     ` Yan Zhao
2025-08-07  9:43 ` [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock Yan Zhao
2025-11-11 10:20   ` Huang, Kai
2025-11-13  5:53     ` Yan Zhao
2025-11-17  9:17       ` Binbin Wu
2025-11-17  9:26         ` Yan Zhao
2025-12-09 23:49   ` Sagi Shahar
2025-12-09 23:54     ` Edgecombe, Rick P
2025-12-10  0:28       ` Sagi Shahar
2025-12-10  0:50         ` Yan Zhao
2025-12-10 17:16           ` Sagi Shahar
2025-12-10 19:49             ` Edgecombe, Rick P
2025-12-11  2:10               ` Yan Zhao
2025-08-07  9:43 ` [RFC PATCH v2 11/23] KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror root Yan Zhao
2025-09-03  3:30   ` Binbin Wu
2025-08-07  9:43 ` [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
2025-09-03  6:57   ` Binbin Wu
2025-09-03  9:44     ` Yan Zhao
2025-11-11 10:42   ` Huang, Kai
2025-11-13  8:54     ` Yan Zhao
2025-11-13 11:02       ` Huang, Kai
2025-11-13 11:40         ` Huang, Kai
2025-11-14  6:09         ` Yan Zhao
2025-11-18  0:14           ` Huang, Kai [this message]
2025-11-18  6:30             ` Yan Zhao
2025-11-18  8:59               ` Binbin Wu
2025-11-18 10:49               ` Huang, Kai
2025-11-19  3:41                 ` Yan Zhao
2026-01-06 10:35                   ` Yan Zhao
2025-11-19  6:23                 ` Yan Zhao
2025-11-19  6:31   ` Yan Zhao
2025-08-07  9:44 ` [RFC PATCH v2 13/23] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
2025-08-07  9:44 ` [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info Yan Zhao
2025-09-03  7:36   ` Binbin Wu
2025-09-03  9:37     ` Yan Zhao
2025-11-11 10:55   ` Huang, Kai
2025-11-14  1:42     ` Yan Zhao
2025-11-18  0:26       ` Huang, Kai
2025-11-18  2:44         ` Yan Zhao
2025-11-11 11:05   ` Huang, Kai
2025-11-14  7:22     ` Yan Zhao
2025-11-18  1:04       ` Huang, Kai
2025-11-18  2:20         ` Yan Zhao
2025-11-18  9:44           ` Huang, Kai
2025-11-19  2:58             ` Yan Zhao
2025-11-19  5:51   ` Binbin Wu
2025-11-19  6:29     ` Yan Zhao
2025-11-19  6:39       ` Binbin Wu
2025-08-07  9:44 ` [RFC PATCH v2 15/23] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
2025-08-07  9:44 ` [RFC PATCH v2 16/23] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
2025-08-07  9:45 ` [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
2025-09-04  7:58   ` Binbin Wu
2025-09-04  9:48     ` Yan Zhao
2025-09-04 11:07       ` Yan Zhao
2025-10-01  6:21   ` Ackerley Tng
2025-10-13  0:18     ` Yan Zhao
2025-10-01  8:00   ` Ackerley Tng
2025-10-13  0:45     ` Yan Zhao
2025-08-07  9:45 ` [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set Yan Zhao
2025-08-11 21:10   ` Sagi Shahar
2025-08-12  6:37     ` Yan Zhao
2025-09-04  8:16   ` Binbin Wu
2025-09-04  9:50     ` Yan Zhao
2025-09-05  9:05       ` Binbin Wu
2025-09-05 15:41   ` Edgecombe, Rick P
2025-09-15  6:05     ` Yan Zhao
2025-08-07  9:45 ` [RFC PATCH v2 19/23] KVM: TDX: Pass down pfn to split_external_spt() Yan Zhao
2025-09-04  8:30   ` Binbin Wu
2025-08-07  9:45 ` [RFC PATCH v2 20/23] KVM: TDX: Handle Dynamic PAMT in tdh_mem_page_demote() Yan Zhao
2025-08-07  9:46 ` [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path Yan Zhao
2025-09-04  9:17   ` Binbin Wu
2025-09-04  9:58     ` Yan Zhao
2025-12-05  6:14   ` Sagi Shahar
2025-12-08  5:49     ` Yan Zhao
2025-12-11  1:42       ` Vishal Annapurve
2025-12-11  2:36         ` Yan Zhao
2025-08-07  9:46 ` [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split Yan Zhao
2025-08-14  5:31   ` Vishal Annapurve
2025-08-14 18:29     ` Vishal Annapurve
2025-08-18  4:19     ` Yan Zhao
2025-08-07  9:46 ` [RFC PATCH v2 23/23] KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE Yan Zhao
2025-11-11 11:25   ` Huang, Kai
2025-11-14  8:34     ` Yan Zhao
2025-11-18  0:56       ` Huang, Kai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f2fb7c2ed74f37fdf8ce69f593e9436acbdd93ee.camel@intel.com \
    --to=kai.huang@intel.com \
    --cc=ackerleytng@google.com \
    --cc=binbin.wu@linux.intel.com \
    --cc=chao.p.peng@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=david@redhat.com \
    --cc=fan.du@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=jun.miao@intel.com \
    --cc=kas@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=pgonda@google.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=seanjc@google.com \
    --cc=tabba@google.com \
    --cc=thomas.lendacky@amd.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=x86@kernel.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox