kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Rick P Edgecombe <rick.p.edgecombe@intel.com>
Cc: "kas@kernel.org" <kas@kernel.org>, Chao Gao <chao.gao@intel.com>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Kai Huang <kai.huang@intel.com>,  "bp@alien8.de" <bp@alien8.de>,
	"x86@kernel.org" <x86@kernel.org>,
	"mingo@redhat.com" <mingo@redhat.com>,
	 Yan Y Zhao <yan.y.zhao@intel.com>,
	 "dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	 "tglx@linutronix.de" <tglx@linutronix.de>,
	 "linux-coco@lists.linux.dev" <linux-coco@lists.linux.dev>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	 Isaku Yamahata <isaku.yamahata@intel.com>
Subject: Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT
Date: Mon, 11 Aug 2025 19:02:10 -0700	[thread overview]
Message-ID: <aJqgosNUjrCfH_WN@google.com> (raw)
In-Reply-To: <c2a62badf190717a251d269a6905872b01e8e340.camel@intel.com>

On Mon, Aug 11, 2025, Rick P Edgecombe wrote:
> On Mon, 2025-08-11 at 07:31 +0100, kas@kernel.org wrote:
> > > I don't see any other reason for the global spin lock, Kirill was that
> > > it?  Did you consider also adding a lock per 2MB region, like the
> > > refcount? Or any other granularity of lock besides global? Not saying
> > > global is definitely the wrong choice, but seems arbitrary if I got the
> > > above right.
> > 
> > We have discussed this before[1]. Global locking is problematic when you
> > actually hit contention. Let's not complicate things until we actually
> > see it. I failed to demonstrate contention without huge pages. With huge
> > pages it is even more dubious that we ever see it.
> > 
> > [1]
> > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/
> 
> Ah, I see.
> 
> I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge

How many vCPUs?  And were the VMs actually accepting/faulting all 16GiB?

There's also a noisy neighbor problem lurking.  E.g. malicious/buggy VM spams
private<=>shared conversions and thus interferes with PAMT allocations for other
VMs.

> pages) and then shutting them down. I saw 701 contentions on startup, and 53
> more on shutdown. Total wait time 2ms. Not horrible but not theoretical either.
> But it probably wasn't much of a cacheline bouncing worse case.

Isn't the SEAMCALL done while holding the spinlock?  I assume the latency of the
SEAMCALL is easily the long pole in the flow.

> And I guess this is on my latest changes not this exact v2, but it shouldn't
> have changed.
> 
> But hmm, it seems Dave's objection about maintaining the lock allocations would
> apply to the refcounts too? But the hotplug concerns shouldn't actually be an
> issue for TDX because they gets rejected if the allocations are not already
> there. So complexity of a per-2MB lock should be minimal, at least
> incrementally. The difference seems more about memory use vs performance.
> 
> What gives me pause is in the KVM TDX work we have really tried hard to not take
> exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by
> hard numbers.

Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock
for write is problematic.  Even if TDX VMs don't exhibit the same patterns *today*
as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees
that will always hold true.
 
> But an enormous amount of work went into lettings KVM faults happen under the
> shared lock for normal VMs. So on one hand, yes it's premature optimization.
> But on the other hand, it's a maintainability concern about polluting the
> existing way things work in KVM with special TDX properties.
> 
> I think we need to at least call out loudly that the decision was to go with the
> simplest possible solution, and the impact to KVM. I'm not sure what Sean's
> opinion is, but I wouldn't want him to first learn of it when he went digging
> and found a buried global spin lock in the fault path.

Heh, too late, I saw it when this was first posted.  And to be honest, my initial
reaction was very much "absolutely not" (though Rated R, not PG).  Now that I've
had time to think things through, I'm not _totally_ opposed to having a spinlock
in the page fault path, but my overall sentiment remains the same.

For mmu_lock and related SPTE operations, I was super adamant about not taking
exclusive locks because based on our experience with the TDP MMU, converting flows
from exclusive to shared is usually significantly more work than developing code
for "shared mode" straightaway (and you note above, that wasn't trivial for TDX).
And importantly, those code paths were largely solved problems.  I.e. I didn't
want to get into a situation where TDX undid the parallelization of the TDP MMU,
and then had to add it back after the fact.

I think the same holds true here.  I'm not completely opposed to introducing a
spinlock, but I want to either have a very high level of confidence that the lock
won't introduce jitter/delay (I have low confidence on this front, at least in
the proposed patches), or have super clear line of sight to making the contention
irrelevant, without having to rip apart the code.

My biggest question at this point is: why is all of this being done on-demand?
IIUC, we swung from "allocate all PAMT_4K pages upfront" to "allocate all PAMT_4K
pages at the last possible moment".  Neither of those seems ideal.

E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages, on-demand
PAMT management seems reasonable.  But for PAMTs that are used to track guest-assigned
memory, which is the vaaast majority of PAMT memory, why not hook guest_memfd?
I.e. setup PAMT crud when guest_memfd is populated, not when the memory is mapped
into the guest.  That way setups that cares about guest boot time can preallocate
guest_memfd in order to get the PAMT stuff out of the way.

You could do the same thing by prefaulting guest memory, but TDX has limitations
there, and I see very little value in precisely reclaiming PAMT memory when a
leaf S-EPT is zapped, i.e. when a page is converted from private=>shared.  As
above, that's just asking for noisy neighbor issues.

The complaints with static PAMT are that it required burning 0.4% of memory even
if the host isn't actively running TDX VMs.  Burning 0.4% of the memory assigned
to a guest, regardless of whether it's map private or shared, seems acceptable,
and I think would give us a lot more flexibility in avoiding locking issues.

Similarly, we could bind a PAMT to non-leaf S-EPT pages during mmu_topup_memory_caches(),
i.e. when arch.mmu_external_spt_cache is filled.  Then there would be no need for
a separate vcpu->arch.pamt_page_cache, and more work would be done outside of
mmu_lock.  Freeing SPTs would still be done under mmu_lock (I think), but that
should be a much rarer operation.

  reply	other threads:[~2025-08-12  2:02 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-09 19:13 [PATCHv2 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 01/12] x86/tdx: Consolidate TDX error handling Kirill A. Shutemov
2025-06-25 17:58   ` Dave Hansen
2025-06-25 20:58     ` Edgecombe, Rick P
2025-06-25 21:27       ` Sean Christopherson
2025-06-25 21:46         ` Edgecombe, Rick P
2025-06-26  9:25         ` kirill.shutemov
2025-06-26 14:46           ` Dave Hansen
2025-06-26 15:51             ` Sean Christopherson
2025-06-26 16:59               ` Dave Hansen
2025-06-27 10:42                 ` kirill.shutemov
2025-07-30 18:32                 ` Edgecombe, Rick P
2025-07-31 23:31                   ` Sean Christopherson
2025-07-31 23:46                     ` Edgecombe, Rick P
2025-07-31 23:53                       ` Sean Christopherson
2025-08-01 15:03                         ` Edgecombe, Rick P
2025-08-06 15:19                           ` Sean Christopherson
2025-06-26  0:05     ` Huang, Kai
2025-07-30 18:33       ` Edgecombe, Rick P
2025-06-09 19:13 ` [PATCHv2 02/12] x86/virt/tdx: Allocate page bitmap for Dynamic PAMT Kirill A. Shutemov
2025-06-25 18:06   ` Dave Hansen
2025-06-26  9:25     ` Kirill A. Shutemov
2025-07-31  1:06     ` Edgecombe, Rick P
2025-07-31  4:10       ` Huang, Kai
2025-06-26 11:08   ` Huang, Kai
2025-06-27 10:42     ` kirill.shutemov
2025-06-09 19:13 ` [PATCHv2 03/12] x86/virt/tdx: Allocate reference counters for PAMT memory Kirill A. Shutemov
2025-06-25 19:26   ` Dave Hansen
2025-06-27 11:27     ` Kirill A. Shutemov
2025-06-27 14:03       ` Dave Hansen
2025-06-26  0:53   ` Huang, Kai
2025-06-26  4:48     ` Huang, Kai
2025-06-27 11:35     ` kirill.shutemov
2025-06-09 19:13 ` [PATCHv2 04/12] x86/virt/tdx: Add tdx_alloc/free_page() helpers Kirill A. Shutemov
2025-06-10  2:36   ` Chao Gao
2025-06-10 14:51     ` [PATCHv2.1 " Kirill A. Shutemov
2025-06-25 18:01       ` Dave Hansen
2025-06-25 20:09     ` [PATCHv2 " Dave Hansen
2025-06-26  0:46       ` Chao Gao
2025-06-25 20:02   ` Dave Hansen
2025-06-27 13:00     ` Kirill A. Shutemov
2025-06-27  7:49   ` Adrian Hunter
2025-06-27 13:03     ` Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 05/12] KVM: TDX: Allocate PAMT memory in __tdx_td_init() Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 06/12] KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init() Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 07/12] KVM: TDX: Preallocate PAMT pages to be used in page fault path Kirill A. Shutemov
2025-06-26 11:21   ` Huang, Kai
2025-07-10  1:34   ` Edgecombe, Rick P
2025-07-10  7:49     ` kirill.shutemov
2025-06-09 19:13 ` [PATCHv2 08/12] KVM: TDX: Handle PAMT allocation in " Kirill A. Shutemov
2025-06-12 12:19   ` Chao Gao
2025-06-12 13:05     ` [PATCHv2.1 " Kirill A. Shutemov
2025-06-25 22:38   ` [PATCHv2 " Edgecombe, Rick P
2025-07-09 14:29     ` kirill.shutemov
2025-07-10  1:33   ` Edgecombe, Rick P
2025-07-10  8:45     ` kirill.shutemov
2025-08-21 19:21   ` Sagi Shahar
2025-08-21 19:35     ` Edgecombe, Rick P
2025-08-21 19:53       ` Sagi Shahar
2025-06-09 19:13 ` [PATCHv2 09/12] KVM: TDX: Reclaim PAMT memory Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 10/12] [NOT-FOR-UPSTREAM] x86/virt/tdx: Account PAMT memory and print it in /proc/meminfo Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 11/12] x86/virt/tdx: Enable Dynamic PAMT Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 12/12] Documentation/x86: Add documentation for TDX's " Kirill A. Shutemov
2025-06-25 13:25 ` [PATCHv2 00/12] TDX: Enable " Kirill A. Shutemov
2025-06-25 22:49 ` Edgecombe, Rick P
2025-06-27 13:05   ` kirill.shutemov
2025-08-08 23:18 ` Edgecombe, Rick P
2025-08-11  6:31   ` kas
2025-08-11 22:30     ` Edgecombe, Rick P
2025-08-12  2:02       ` Sean Christopherson [this message]
2025-08-12  2:31         ` Vishal Annapurve
2025-08-12  8:04           ` kas
2025-08-12 15:12             ` Edgecombe, Rick P
2025-08-12 16:15               ` Sean Christopherson
2025-08-12 18:39                 ` Edgecombe, Rick P
2025-08-12 22:00                   ` Vishal Annapurve
2025-08-12 23:34                     ` Edgecombe, Rick P
2025-08-13  0:18                       ` Vishal Annapurve
2025-08-13  0:51                         ` Edgecombe, Rick P
2025-08-12 18:44                 ` Vishal Annapurve
2025-08-13  8:09                 ` Kiryl Shutsemau
2025-08-13  7:49               ` Kiryl Shutsemau
2025-08-12  8:03         ` kas
2025-08-13 22:43         ` Edgecombe, Rick P
2025-08-13 23:31           ` Dave Hansen
2025-08-14  0:14             ` Edgecombe, Rick P
2025-08-14 10:55               ` Kiryl Shutsemau
2025-08-15  1:03                 ` Edgecombe, Rick P
2025-08-20 15:31                   ` Sean Christopherson
2025-08-20 16:35                     ` Edgecombe, Rick P

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aJqgosNUjrCfH_WN@google.com \
    --to=seanjc@google.com \
    --cc=bp@alien8.de \
    --cc=chao.gao@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kas@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).