From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id BB70E27703C
	for <linux-kernel@vger.kernel.org>; Wed, 21 Jan 2026 17:30:30 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1769016632; cv=none; b=Ya4CToRaW++Cs7wWW9VuonGZ8r3Jm07wLv5XJ30o8J1DCnux1lGDNtaTlaqMwXXR9JeWLYPZH5vqFh/WBG+9DOHr4H5LSA/5CjQ5Ce2VNidtuaJP6gfohKlR+eGoJpWa1Q8hod/kc0pEjfN7cRLMvRSuge6Ppsz/nOW+K+9DG8I=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1769016632; c=relaxed/simple;
	bh=Qoafun8TPUoF50Bl0CtUH0N1SSM5uiL5PWVSwhhsa98=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type; b=WSUr15+2kG1DhzKEdKA+d4Gs5+J1lX9nLpjwAqdXkOj7/P9lYN6j/OyOCyUiNoWrMAZm9in6SkyJbksnL6MBUhT6SjdSeS53N91VxmApLJkVynl8QeHVNXNKu5d8kYLlQp0nUUb4PmT0P71z6nk2YN7oE/t+9262gPXWeBpiX+Y=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=GncHZifM; arc=none smtp.client-ip=209.85.214.202
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="GncHZifM"
Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-2a7701b6328so1118595ad.2
        for <linux-kernel@vger.kernel.org>; Wed, 21 Jan 2026 09:30:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1769016630; x=1769621430; darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=zRsUd6Xf2ozPGKhI88Ak/EIDsiU/pGPTJfbBTKTczlo=;
        b=GncHZifMFjE05Iw1Q1jmMemX+HXtbfNzEn2+yOjflYzuPSWjJ6orsks1OzrDjHToEm
         YkCK/A8JL0jfBa8qOqzcyc21BqhelLSEoRJQMjB8T1uolFiDJMIy+0YwOVDmnfHoa4Mr
         sbeGAhollcI5XaYNSM3dgN6pxTe0NmYjCavSQh4sK8mrMJ+EXAWkSiLcGVUu97u0XiB6
         IDO1tc6IP+IZ8OxRayTWtbX8GU+KQg4EQpPH51dgX79Whf8ZevSqomfgYQsGTRqFUhLC
         SPorHyrs9JaWv8l1HDsHrT02zEjyaCpqYWkjHyXmcybrB6+xWtoDFGj+TLFOSXZttaJS
         udbQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1769016630; x=1769621430;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=zRsUd6Xf2ozPGKhI88Ak/EIDsiU/pGPTJfbBTKTczlo=;
        b=txGusj177g7ZQXc7oQWb4JsPb6CJ4CWxY5L26vJyuM9/iLhBUM6ddDz87A4LevnwMl
         5TA82zH7upKc45HLm53DEVP00eUrOeUrQYfyExq0DrwYGb0QCNgTmlB5yFBWe17GKe4C
         bmZLcDvvwJpnSTdL4iJLapXFKXSbhV5OyzgMq7uUm4m12asG5gR18N10sAX2FTDm8xQG
         AoMzd5h3xazN0opxa33NQdXAocR0rNWTAp2YVdi9K5Diqt4WiQq04k8sZSpn80J7tBtV
         4uFTA+pDDYkfr5qt+qFOSwApLgaGQ/au6E1R3hoR/UIx2x6JS67cNsJxlrizE383C7n6
         eHZQ==
X-Forwarded-Encrypted: i=1; AJvYcCUAl+lPI6khIqMG2nEypYEjA0e0A/wAft2sA+KNvjYoc9Na2hYT8Cox0z214f+VCoHCGTfJ9Uva2naY5U0=@vger.kernel.org
X-Gm-Message-State: AOJu0YybTqYrZcIN6/VwcPl+zYPg2l4+0NORC+lOkgHJIImWXXzFTkQV
	3/dPv47e0UNwnTKH7ApghCBbejMrzhQIIEaQqrsygoHK/yv3ZJYP4NYkYOazlIXPTPMI0Hi9COM
	RUgT6QQ==
X-Received: from plpa6.prod.google.com ([2002:a17:902:9006:b0:2a0:e952:ed65])
 (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:1b25:b0:295:9e4e:4092
 with SMTP id d9443c01a7336-2a76b1696c2mr49823155ad.56.1769016629985; Wed, 21
 Jan 2026 09:30:29 -0800 (PST)
Date: Wed, 21 Jan 2026 09:30:28 -0800
In-Reply-To: <b9487eba19c134c1801a536945e8ae57ea93032f.camel@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20260106101646.24809-1-yan.y.zhao@intel.com> <20260106102331.25244-1-yan.y.zhao@intel.com>
 <b9487eba19c134c1801a536945e8ae57ea93032f.camel@intel.com>
Message-ID: <aXENNKjAKTM9UJNH@google.com>
Subject: Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting
From: Sean Christopherson <seanjc@google.com>
To: Kai Huang <kai.huang@intel.com>
Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>, Yan Y Zhao <yan.y.zhao@intel.com>, 
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>, Fan Du <fan.du@intel.com>, 
	Xiaoyao Li <xiaoyao.li@intel.com>, Chao Gao <chao.gao@intel.com>, 
	Dave Hansen <dave.hansen@intel.com>, "thomas.lendacky@amd.com" <thomas.lendacky@amd.com>, 
	"vbabka@suse.cz" <vbabka@suse.cz>, "tabba@google.com" <tabba@google.com>, "david@kernel.org" <david@kernel.org>, 
	"kas@kernel.org" <kas@kernel.org>, "michael.roth@amd.com" <michael.roth@amd.com>, Ira Weiny <ira.weiny@intel.com>, 
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, 
	"binbin.wu@linux.intel.com" <binbin.wu@linux.intel.com>, 
	"ackerleytng@google.com" <ackerleytng@google.com>, "nik.borisov@suse.com" <nik.borisov@suse.com>, 
	Isaku Yamahata <isaku.yamahata@intel.com>, Chao P Peng <chao.p.peng@intel.com>, 
	"francescolavra.fl@gmail.com" <francescolavra.fl@gmail.com>, "sagis@google.com" <sagis@google.com>, 
	Vishal Annapurve <vannapurve@google.com>, Rick P Edgecombe <rick.p.edgecombe@intel.com>, 
	Jun Miao <jun.miao@intel.com>, "jgross@suse.com" <jgross@suse.com>, 
	"pgonda@google.com" <pgonda@google.com>, "x86@kernel.org" <x86@kernel.org>
Content-Type: text/plain; charset="us-ascii"

On Wed, Jan 21, 2026, Kai Huang wrote:
> On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> I have been thinking whether we can simplify the solution, not only just
> for avoiding this complicated memory cache topup-then-consume mechanism
> under MMU read lock, but also for avoiding kinda duplicated code about how
> to calculate how many DPAMT pages needed to topup etc between your next
> patch and similar code in DPAMT series for the per-vCPU cache.
> 
> IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
> and the mapped 2M range when splitting.
> 
> - For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
> tdx_alloc_page() directly which also handles DPAMT pages internally.
> 
> Here in tdp_mmmu_alloc_sp_for_split():
> 
> 	sp->external_spt = tdx_alloc_page();
> 
> For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
> that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
> mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
> add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':
> 
> https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/
> 
> So we can get rid of the per-VM DPAMT cache for S-EPT pages.
> 
> - For DPAMT pages for the TDX guest private memory, I think we can also
> get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
> needed DPAMT pages:
> 
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -111,6 +111,7 @@ struct kvm_mmu_page {
>                  * Passed to TDX module, not accessed by KVM.
>                  */
>                 void *external_spt;
> +               void *leaf_level_private;
>         };

There's no need to put this in with external_spt, we could throw it in a new union
with unsync_child_bitmap (TDP MMU can't have unsync children).  IIRC, the main
reason I've never suggested unionizing unsync_child_bitmap is that overloading
the bitmap would risk corruption if KVM ever marked a TDP MMU page as unsync, but
that's easy enough to guard against:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d568512201d..d6c6768c1f50 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1917,9 +1917,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 
 static void mark_unsync(u64 *spte)
 {
-       struct kvm_mmu_page *sp;
+       struct kvm_mmu_page *sp = sptep_to_sp(spte);
 
-       sp = sptep_to_sp(spte);
+       if (WARN_ON_ONCE(is_tdp_mmu_page(sp)))
+               return;
        if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
                return;
        if (sp->unsync_children++)


I might send a patch to do that even if we don't overload the bitmap, as a
hardening measure.

> Then we can define a structure which contains DPAMT pages for a given 2M
> range:
> 
> 	struct tdx_dmapt_metadata {
> 		struct page *page1;
> 		struct page *page2;
> 	};
> 
> Then when we allocate sp->external_spt, we can also allocate it for
> leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
> last level page table.
> 
> In this case, I think we can get rid of the per-VM DPAMT cache?
> 
> For the fault path, similarly, I believe we can use a per-vCPU cache for
> 'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
> hooks.
> 
> The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
> TDX guests even they are never used, but if what I said above is feasible,
> maybe it's worth the cost.
> 
> But it's completely possible that I missed something.  Any thoughts?

I *LOVE* the core idea (seriously, this made my week), though I think we should
take it a step further and _immediately_ do DPAMT maintenance on allocation.
I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
SP cache instead of waiting until KVM links the SP.  Then KVM doesn't need to
track PAMT pages except for memory that is mapped into a guest, and we end up
with better symmetry and more consistency throughout TDX.  E.g. all pages that
KVM allocates and gifts to the TDX-Module will allocated and freed via the same
TDX APIs.

Absolute worst case scenario, KVM allocates 40 (KVM's SP cache capacity) PAMT
entries per-vCPU that end up being free without ever being gifted to the TDX-Module.
But I doubt that will be a problem in practice, because odds are good the adjacent
pages/pfns will already have been consumed, i.e. the "speculative" allocation is
really just bumping the refcount.  And _if_ it's a problem, e.g. results in too
many wasted DPAMT entries, then it's one we can solve in KVM by tuning the cache
capacity to less aggresively allocate DPAMT entries.

I'll send compile-tested v4 for the DPAMT series later today (I think I can get
it out today), as I have other non-trival feedback that I've accumulated when
going through the patches.