From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BB70E27703C for ; Wed, 21 Jan 2026 17:30:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769016632; cv=none; b=Ya4CToRaW++Cs7wWW9VuonGZ8r3Jm07wLv5XJ30o8J1DCnux1lGDNtaTlaqMwXXR9JeWLYPZH5vqFh/WBG+9DOHr4H5LSA/5CjQ5Ce2VNidtuaJP6gfohKlR+eGoJpWa1Q8hod/kc0pEjfN7cRLMvRSuge6Ppsz/nOW+K+9DG8I= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769016632; c=relaxed/simple; bh=Qoafun8TPUoF50Bl0CtUH0N1SSM5uiL5PWVSwhhsa98=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=WSUr15+2kG1DhzKEdKA+d4Gs5+J1lX9nLpjwAqdXkOj7/P9lYN6j/OyOCyUiNoWrMAZm9in6SkyJbksnL6MBUhT6SjdSeS53N91VxmApLJkVynl8QeHVNXNKu5d8kYLlQp0nUUb4PmT0P71z6nk2YN7oE/t+9262gPXWeBpiX+Y= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=GncHZifM; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="GncHZifM" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-2a7701b6328so1118595ad.2 for ; Wed, 21 Jan 2026 09:30:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1769016630; x=1769621430; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=zRsUd6Xf2ozPGKhI88Ak/EIDsiU/pGPTJfbBTKTczlo=; b=GncHZifMFjE05Iw1Q1jmMemX+HXtbfNzEn2+yOjflYzuPSWjJ6orsks1OzrDjHToEm YkCK/A8JL0jfBa8qOqzcyc21BqhelLSEoRJQMjB8T1uolFiDJMIy+0YwOVDmnfHoa4Mr sbeGAhollcI5XaYNSM3dgN6pxTe0NmYjCavSQh4sK8mrMJ+EXAWkSiLcGVUu97u0XiB6 IDO1tc6IP+IZ8OxRayTWtbX8GU+KQg4EQpPH51dgX79Whf8ZevSqomfgYQsGTRqFUhLC SPorHyrs9JaWv8l1HDsHrT02zEjyaCpqYWkjHyXmcybrB6+xWtoDFGj+TLFOSXZttaJS udbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769016630; x=1769621430; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=zRsUd6Xf2ozPGKhI88Ak/EIDsiU/pGPTJfbBTKTczlo=; b=txGusj177g7ZQXc7oQWb4JsPb6CJ4CWxY5L26vJyuM9/iLhBUM6ddDz87A4LevnwMl 5TA82zH7upKc45HLm53DEVP00eUrOeUrQYfyExq0DrwYGb0QCNgTmlB5yFBWe17GKe4C bmZLcDvvwJpnSTdL4iJLapXFKXSbhV5OyzgMq7uUm4m12asG5gR18N10sAX2FTDm8xQG AoMzd5h3xazN0opxa33NQdXAocR0rNWTAp2YVdi9K5Diqt4WiQq04k8sZSpn80J7tBtV 4uFTA+pDDYkfr5qt+qFOSwApLgaGQ/au6E1R3hoR/UIx2x6JS67cNsJxlrizE383C7n6 eHZQ== X-Forwarded-Encrypted: i=1; AJvYcCUAl+lPI6khIqMG2nEypYEjA0e0A/wAft2sA+KNvjYoc9Na2hYT8Cox0z214f+VCoHCGTfJ9Uva2naY5U0=@vger.kernel.org X-Gm-Message-State: AOJu0YybTqYrZcIN6/VwcPl+zYPg2l4+0NORC+lOkgHJIImWXXzFTkQV 3/dPv47e0UNwnTKH7ApghCBbejMrzhQIIEaQqrsygoHK/yv3ZJYP4NYkYOazlIXPTPMI0Hi9COM RUgT6QQ== X-Received: from plpa6.prod.google.com ([2002:a17:902:9006:b0:2a0:e952:ed65]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:1b25:b0:295:9e4e:4092 with SMTP id d9443c01a7336-2a76b1696c2mr49823155ad.56.1769016629985; Wed, 21 Jan 2026 09:30:29 -0800 (PST) Date: Wed, 21 Jan 2026 09:30:28 -0800 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260106101646.24809-1-yan.y.zhao@intel.com> <20260106102331.25244-1-yan.y.zhao@intel.com> Message-ID: Subject: Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting From: Sean Christopherson To: Kai Huang Cc: "pbonzini@redhat.com" , Yan Y Zhao , "kvm@vger.kernel.org" , Fan Du , Xiaoyao Li , Chao Gao , Dave Hansen , "thomas.lendacky@amd.com" , "vbabka@suse.cz" , "tabba@google.com" , "david@kernel.org" , "kas@kernel.org" , "michael.roth@amd.com" , Ira Weiny , "linux-kernel@vger.kernel.org" , "binbin.wu@linux.intel.com" , "ackerleytng@google.com" , "nik.borisov@suse.com" , Isaku Yamahata , Chao P Peng , "francescolavra.fl@gmail.com" , "sagis@google.com" , Vishal Annapurve , Rick P Edgecombe , Jun Miao , "jgross@suse.com" , "pgonda@google.com" , "x86@kernel.org" Content-Type: text/plain; charset="us-ascii" On Wed, Jan 21, 2026, Kai Huang wrote: > On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote: > I have been thinking whether we can simplify the solution, not only just > for avoiding this complicated memory cache topup-then-consume mechanism > under MMU read lock, but also for avoiding kinda duplicated code about how > to calculate how many DPAMT pages needed to topup etc between your next > patch and similar code in DPAMT series for the per-vCPU cache. > > IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages > and the mapped 2M range when splitting. > > - For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use > tdx_alloc_page() directly which also handles DPAMT pages internally. > > Here in tdp_mmmu_alloc_sp_for_split(): > > sp->external_spt = tdx_alloc_page(); > > For the fault path we need to use the normal 'kvm_mmu_memory_cache' but > that's per-vCPU cache which doesn't have the pain of per-VM cache. As I > mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we > add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache': > > https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/ > > So we can get rid of the per-VM DPAMT cache for S-EPT pages. > > - For DPAMT pages for the TDX guest private memory, I think we can also > get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the > needed DPAMT pages: > > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -111,6 +111,7 @@ struct kvm_mmu_page { > * Passed to TDX module, not accessed by KVM. > */ > void *external_spt; > + void *leaf_level_private; > }; There's no need to put this in with external_spt, we could throw it in a new union with unsync_child_bitmap (TDP MMU can't have unsync children). IIRC, the main reason I've never suggested unionizing unsync_child_bitmap is that overloading the bitmap would risk corruption if KVM ever marked a TDP MMU page as unsync, but that's easy enough to guard against: diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 3d568512201d..d6c6768c1f50 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1917,9 +1917,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp) static void mark_unsync(u64 *spte) { - struct kvm_mmu_page *sp; + struct kvm_mmu_page *sp = sptep_to_sp(spte); - sp = sptep_to_sp(spte); + if (WARN_ON_ONCE(is_tdp_mmu_page(sp))) + return; if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap)) return; if (sp->unsync_children++) I might send a patch to do that even if we don't overload the bitmap, as a hardening measure. > Then we can define a structure which contains DPAMT pages for a given 2M > range: > > struct tdx_dmapt_metadata { > struct page *page1; > struct page *page2; > }; > > Then when we allocate sp->external_spt, we can also allocate it for > leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the > last level page table. > > In this case, I think we can get rid of the per-VM DPAMT cache? > > For the fault path, similarly, I believe we can use a per-vCPU cache for > 'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free() > hooks. > > The cost is the new 'leaf_level_private' takes additional 8-bytes for non- > TDX guests even they are never used, but if what I said above is feasible, > maybe it's worth the cost. > > But it's completely possible that I missed something. Any thoughts? I *LOVE* the core idea (seriously, this made my week), though I think we should take it a step further and _immediately_ do DPAMT maintenance on allocation. I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT SP cache instead of waiting until KVM links the SP. Then KVM doesn't need to track PAMT pages except for memory that is mapped into a guest, and we end up with better symmetry and more consistency throughout TDX. E.g. all pages that KVM allocates and gifts to the TDX-Module will allocated and freed via the same TDX APIs. Absolute worst case scenario, KVM allocates 40 (KVM's SP cache capacity) PAMT entries per-vCPU that end up being free without ever being gifted to the TDX-Module. But I doubt that will be a problem in practice, because odds are good the adjacent pages/pfns will already have been consumed, i.e. the "speculative" allocation is really just bumping the refcount. And _if_ it's a problem, e.g. results in too many wasted DPAMT entries, then it's one we can solve in KVM by tuning the cache capacity to less aggresively allocate DPAMT entries. I'll send compile-tested v4 for the DPAMT series later today (I think I can get it out today), as I have other non-trival feedback that I've accumulated when going through the patches.