Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [RFCv2 PATCH 2/6] efi/unaccepted: Set unaccepted bits for all hotplug memory
From: Kiryl Shutsemau @ 2026-06-24 12:29 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: marcandre.lureau, david, rick.p.edgecombe, prsampat, pbonzini,
	mst, peterx, chenyi.qiang, elena.reshetova, michael.roth,
	ackerleytng, linux-kernel, linux-coco, virtualization, x86,
	yilun.xu, xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-3-zhenzhong.duan@intel.com>

On Tue, Jun 23, 2026 at 06:17:33AM -0400, Zhenzhong Duan wrote:
> In coco guests, hotpluggable memory ranges are initially unaccepted.
> While a previous change expanded the unaccepted memory bitmap boundaries
> to include these hotplug spaces, the actual bits inside the bitmap are
> not yet marked as unaccepted.
> 
> Walks SRAT a second time after the bitmap is allocated and sets the bits
> corresponding to hotpluggable ranges.
> 
> This ensures the bitmap state accurately reflects all static and hotplug
> memory ranges before booting kernel.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  .../firmware/efi/libstub/unaccepted_memory.c   | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
> index bfbb78bd7b8a..01bed8e751ca 100644
> --- a/drivers/firmware/efi/libstub/unaccepted_memory.c
> +++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
> @@ -92,6 +92,23 @@ static void update_mem_boundaries(struct acpi_srat_mem_affinity *mem, struct sra
>  		*(ctx->mem_end) = range_end;
>  }
>  
> +static void mark_hotplug_memory_unaccepted(struct acpi_srat_mem_affinity *mem,
> +					   struct srat_parse_ctx *ctx)
> +{
> +	u64 unit_size = unaccepted_table->unit_size;
> +	u64 start, end;
> +
> +	start = round_up(mem->base_address, unit_size);
> +	end = round_down(mem->base_address + mem->length, unit_size);

We can get here with start > end if srat range is less then unit_size.

> +
> +	/* Translate to offsets from the beginning of the bitmap */
> +	start -= unaccepted_table->phys_base;
> +	end -= unaccepted_table->phys_base;
> +
> +	bitmap_set(unaccepted_table->bitmap,
> +		   start / unit_size, (end - start) / unit_size);
> +}
> +
>  efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
>  					struct efi_boot_memmap *map)
>  {
> @@ -169,6 +186,7 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
>  	unaccepted_table->phys_base = unaccepted_start;
>  	unaccepted_table->size = bitmap_size;
>  	memset(unaccepted_table->bitmap, 0, bitmap_size);
> +	parse_acpi_srat_regions(mark_hotplug_memory_unaccepted, &ctx);
>  
>  	status = efi_bs_call(install_configuration_table,
>  			     &unaccepted_table_guid, unaccepted_table);
> -- 
> 2.52.0
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [RFCv2 PATCH 1/6] efi/unaccepted: Support hotplug memory in unaccepted bitmap via SRAT
From: Kiryl Shutsemau @ 2026-06-24 12:25 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: marcandre.lureau, david, rick.p.edgecombe, prsampat, pbonzini,
	mst, peterx, chenyi.qiang, elena.reshetova, michael.roth,
	ackerleytng, linux-kernel, linux-coco, virtualization, x86,
	yilun.xu, xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-2-zhenzhong.duan@intel.com>

On Tue, Jun 23, 2026 at 06:17:32AM -0400, Zhenzhong Duan wrote:
> Currently, allocate_unaccepted_bitmap() only scans the initial EFI
> boot memory map. This misses hotpluggable ranges described in the
> ACPI SRAT. Without early tracking, hotplug pages are accessed without
> acceptance and this triggers guest crash.
> 
> Introduce a lightweight ACPI SRAT parser to scan these regions early.
> If a region has both ACPI_SRAT_MEM_ENABLED and ACPI_SRAT_MEM_HOT_PLUGGABLE
> flags, expand the tracking boundaries. This avoids pulling in the full
> ACPI subsystem while ensuring the bitmap covers both static memory and
> hotplug memory.

Ugh.. Parsing SRAT there is ugly. I would rather avoid it.

Do I understand correctly that we don't have a way represent pluggable,
but not present memory in EFI memory map?

IIUC, EFI_MEMORY_HOT_PLUGGABLE is actually present, but unpluggable
memory.

Maybe it would be better just allocate bitmap upto maxmem?

And fix EFI spec to add pluggable-but-not-present attribute.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v2 02/17] x86/virt/tdx: Configure add-on features on TDX module init and update
From: Xu Yilun @ 2026-06-24 12:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
	yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
	tony.lindgren, peter.fang, baolu.lu, zhenzhong.duan, dave.hansen,
	seanjc
In-Reply-To: <4f4b0f29-424b-45ed-8cfd-c77da2ea390f@intel.com>

> There's also zero stopping us from putting version in args:
> 
> 	struct tdx_module_args args = {};
>   	int ret;
> 
> 	if (tdx_addon_feature0) {
> 		args.r9 = tdx_addon_feature0;
> 		args.version = 1;
> 	}
> 
> 	ret = seamcall_prerr(TDH_SYS_UPDATE, &args);
> 
> Eh?
> 
> That gives args.version==0 in all the normal cases which just happens to
> be the exact behavior we want. It also avoids having to plumb version
> through all the seamcall*() wrappers.

Ah, on 2nd reading, I'm pretty sure now I understand your logical argument in
patch 1 and 2. It's good to me. I append my diff at the end.

> 
> But this is *exactly* the kind of thing that shouldn't be a part of an
> attestation patch series. This could very much have been a separate
> discussion and happened a month or a year ago. But now it is blocking
> this DICE thing from getting done <grumble>.

Sorry, I should have been more active in searching for the solution
rather than sticking to "kernel never keeps versions", when I've found
the problem that public modules are not available.

----8<----

diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index f20e91d7ac35..972880910a5e 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -143,6 +143,8 @@ struct tdx_module_args {
        u64 rbx;
        u64 rdi;
        u64 rsi;
+       /* for RAX encoding */
+       u8  version;
 };

 /* Used to communicate with the TDX module */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 081816888f7a..b3c00ff4d819 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -95,6 +95,7 @@ static void __used common(void)
        OFFSET(TDX_MODULE_rbx, tdx_module_args, rbx);
        OFFSET(TDX_MODULE_rdi, tdx_module_args, rdi);
        OFFSET(TDX_MODULE_rsi, tdx_module_args, rsi);
+       OFFSET(TDX_MODULE_version, tdx_module_args, version);

        BLANK();
        OFFSET(BP_scratch, boot_params, scratch);
diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
index 016a2a1ec1d6..d1d3d40c5614 100644
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
@@ -48,6 +48,14 @@
        /* Move Leaf ID to RAX */
        mov %rdi, %rax

+       /*
+        * Extract the version from 'struct tdx_module_args', append it to
+        * RAX[23:16]
+        */
+       movzbl  TDX_MODULE_version(%rsi), %ecx
+       shll    $16, %ecx
+       orq     %rcx, %rax
+
        /* Move other input regs from 'struct tdx_module_args' */
        movq    TDX_MODULE_rcx(%rsi), %rcx
        movq    TDX_MODULE_rdx(%rsi), %rdx
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index a6f8fd0a3df0..bc3aa1f78fc8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1036,7 +1036,6 @@ static __init void set_tdx_addon_features(void)
 static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
                                    u64 global_keyid)
 {
-       u64 seamcall_fn = TDH_SYS_CONFIG_V0;
        struct tdx_module_args args = {};
        u64 *tdmr_pa_array;
        size_t array_sz;
@@ -1059,18 +1058,18 @@ static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
        for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
                tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i));

+       set_tdx_addon_features();
+
        args.rcx = __pa(tdmr_pa_array);
        args.rdx = tdmr_list->nr_consumed_tdmrs;
        args.r8 = global_keyid;

-       set_tdx_addon_features();
-
        if (tdx_addon_feature0) {
                args.r9 = tdx_addon_feature0;
-               seamcall_fn = TDH_SYS_CONFIG;
+               args.version = 1;
        }

-       ret = seamcall_prerr(seamcall_fn, &args);
+       ret = seamcall_prerr(TDH_SYS_CONFIG, &args);

        /* Free the array as it is not required anymore. */
        kfree(tdmr_pa_array);
@@ -1761,16 +1760,15 @@ int tdx_module_shutdown(void)

 int tdx_module_run_update(void)
 {
-       u64 seamcall_fn = TDH_SYS_UPDATE_V0;
        struct tdx_module_args args = {};
        int ret;

        if (tdx_addon_feature0) {
                args.r9 = tdx_addon_feature0;
-               seamcall_fn = TDH_SYS_UPDATE;
+               args.version = 1;
        }

-       ret = seamcall_prerr(seamcall_fn, &args);
+       ret = seamcall_prerr(TDH_SYS_UPDATE, &args);
        if (ret)
                return ret;

@@ -2353,6 +2351,7 @@ u64 tdh_vp_init(struct tdx_vp *vp, u64 initial_rcx, u32 x2apicid)
                .rcx = vp->tdvpr_pa,
                .rdx = initial_rcx,
                .r8 = x2apicid,
+               .version = 1,
        };

        return seamcall(TDH_VP_INIT, &args);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 32b13b0c85f9..018988c25caa 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -44,7 +44,7 @@
 #define TDH_VP_CREATE                  10
 #define TDH_MNG_KEY_FREEID             20
 #define TDH_MNG_INIT                   21
-#define TDH_VP_INIT                    SEAMCALL_LEAF_VER(22, 1)
+#define TDH_VP_INIT                    22
 #define TDH_PHYMEM_PAGE_RDMD           24
 #define TDH_VP_RD                      26
 #define TDH_PHYMEM_PAGE_RECLAIM                28
@@ -58,11 +58,9 @@
 #define TDH_PHYMEM_CACHE_WB            40
 #define TDH_PHYMEM_PAGE_WBINVD         41
 #define TDH_VP_WR                      43
-#define TDH_SYS_CONFIG_V0              45
-#define TDH_SYS_CONFIG                 SEAMCALL_LEAF_VER(TDH_SYS_CONFIG_V0, 1)
+#define TDH_SYS_CONFIG                 45
 #define TDH_SYS_SHUTDOWN               52
-#define TDH_SYS_UPDATE_V0              53
-#define TDH_SYS_UPDATE                 SEAMCALL_LEAF_VER(TDH_SYS_UPDATE_V0, 1)
+#define TDH_SYS_UPDATE                 53
 #define TDH_EXT_INIT                   60
 #define TDH_EXT_MEM_ADD                        61
 #define TDH_SYS_DISABLE                        69


^ permalink raw reply related

* Re: [PATCH v8 05/46] KVM: Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable
From: Ackerley Tng @ 2026-06-24  0:14 UTC (permalink / raw)
  To: Sean Christopherson, Julian Braha
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajnQVkLvFl_lMuGB@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Fri, Jun 19, 2026, Julian Braha wrote:
>> Hi Ackerley,
>>
>> On 6/19/26 01:31, Ackerley Tng via B4 Relay wrote:
>>
>> >  config KVM_VM_MEMORY_ATTRIBUTES
>> > -	bool
>> > +	depends on KVM_SW_PROTECTED_VM || KVM_INTEL_TDX || KVM_AMD_SEV
>> > +	bool "Enable per-VM PRIVATE vs. SHARED attributes (for CoCo VMs)"
>>
>> Sorry for the style nitpick, but could you keep the type and prompt as
>> the first attribute in the Kconfig option definition (like the other
>> options do)?
>
> No need to be sorry, I've no idea why I put the "depends" first.  I don't even
> know if that qualifies as a nit :-)
>
> Ackerley, if you can provide your SoB (for Fuad's feedback), I can fixup when
> applying (assuming nothing else necessitates v9).

Thanks, didn't notice this when consolidating this revision.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

^ permalink raw reply

* Re: [PATCH v8 04/46] KVM: Decouple kvm_has_arch_private_mem from CONFIG_KVM_VM_MEMORY_ATTRIBUTES
From: Ackerley Tng @ 2026-06-24  0:13 UTC (permalink / raw)
  To: Binbin Wu
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <a21bfc05-787e-4cd8-89af-8579357e6a12@linux.intel.com>

Binbin Wu <binbin.wu@linux.intel.com> writes:

> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
>> From: Sean Christopherson <seanjc@google.com>
>>
>> When memory attributes become trackable in guest_memfd, the concept of
>> having private memory is no longer dependent on
>> CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
>>
>> With this, on x86, kvm_arch_has_private_mem() is defined if some CoCo
>> platform support (or the testing CONFIG_KVM_SW_PROTECTED_VM) is compiled
>> in.
>>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
>
> One nit below.
>
>> ---
>>  arch/x86/include/asm/kvm_host.h | 4 +++-
>>  include/linux/kvm_host.h        | 2 +-
>>  2 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 8e8eb8a5e8a6b..1bde67cf6eb0e 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -2394,7 +2394,9 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>>  		       int tdp_max_root_level, int tdp_huge_page_level);
>>
>>
>> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>> +#if defined(CONFIG_KVM_SW_PROTECTED_VM) ||	\
>> +	defined(CONFIG_KVM_INTEL_TDX) ||	\
>> +	defined(CONFIG_KVM_AMD_SEV)
>
> Nit:
> Vertically align the defined(XXX) statements for better readability?
>

Sean had this aligned with spaces, and checkpatch complained about
having no spaces before tabs, so I switched it to tabs instead since I
don't think alignment like that is officially documented either way.

Either way is fine :)

>>  #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
>>  #endif
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 201d0f2143976..d370e834d619e 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>>  }
>>  #endif
>>
>> -#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>> +#ifndef kvm_arch_has_private_mem
>>  static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>>  {
>>  	return false;
>>

^ permalink raw reply

* Re: [PATCH v8 01/46] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
From: Ackerley Tng @ 2026-06-24  0:09 UTC (permalink / raw)
  To: Sean Christopherson, Binbin Wu
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajnjTJdQKD1Kz3tf@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Mon, Jun 22, 2026, Binbin Wu wrote:
>> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
>>
>> [...]
>>
>> >
>> > +static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index)
>> > +{
>> > +	struct maple_tree *mt = &GMEM_I(inode)->attributes;
>> > +	void *entry = mtree_load(mt, index);
>> > +
>> > +	return WARN_ON_ONCE(!entry) ? 0 : xa_to_value(entry);
>>
>> If the entry is unexpectedly missing, returning 0 means the attribute would
>> be treated as shared.  And then in kvm_gmem_fault_user_mapping(), it would
>> allow the userspace to fault in the folio.
>>
>> Should gmem deny such edge case?
>
> After several bugs this year where a WARN_ON_ONCE() fired, but was entirely
> insufficient to prevent true badness, I'm definitely senstive to making the "bad"
> behavior as harmless as possible.
>

I guess both are indeed awkward.

> However, in this case I think we're just hosed.  If KVM treats the memory as
> private, KVM will incorrectly do prepare(), incorrectly allow populate(), and
> will caused missed invalidations (though I suppose __kvm_gmem_set_attributes()
> "only" lies to userspace in that case).
>
> That said, assuming SHARED is definitely odd for cases where guest_memfd *can't*
> hold shared memory.  Ditto for assuming PRIVATE.  What if we instead fall back to
> the "init" state, e.g.?
>
> static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index)
> {
> 	struct maple_tree *mt = &GMEM_I(inode)->attributes;
> 	void *entry = mtree_load(mt, index);
>
> 	if (WARN_ON_ONCE(!entry)) {
> 		bool shared = GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED;
>
> 		return shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE;

I was wondering if we should not only return the init state but also set
the init state, but that would involve performing a conversion to the
init state... Too complicated for an edge case.

> 	}
>
> 	return xa_to_value(entry);
> }

Thanks Binbin and Sean!

^ permalink raw reply

* Re: [PATCH v8 3/7] crypto/ccp: Disable CPU hotplug while SNP is active
From: Kalra, Ashish @ 2026-06-23 19:15 UTC (permalink / raw)
  To: Ackerley Tng, Jethro Beekman, tglx, mingo, bp, dave.hansen, x86,
	hpa, seanjc, peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <CAEvNRgHDNGCETxLsy0v-_cBO1=1U+tXtOXWEFrXLU7pYz7U9ow@mail.gmail.com>

Hello Ackerley,

On 6/23/2026 12:48 PM, Ackerley Tng wrote:
> Jethro Beekman <jethro@fortanix.com> writes:
> 
>> On 2026-06-15 21:49, Ashish Kalra wrote:
>>> From: Ashish Kalra <ashish.kalra@amd.com>
>>>
>>> The SEV firmware enumerates the CPUs at SNP initialization and is not
>>> aware of the OS bringing CPUs online or offline afterwards, so OS CPU
>>> hotplug can diverge from the firmware's expectations and break SNP.
>>> Disable CPU hotplug while SNP is active.
>>
>> I think this is too broad. If I have a hypervisor that supports SNP virtualization, a (non-confidential) L1 guest running Linux should still support CPU hotplug while also running confidential L2 guests.
>>
>> --
>> Jethro Beekman | CTO | Fortanix
>>
> 
> Were any other solutions considered other than disabling CPU hotplug?
> 
> Is this temporary until something else is implemented?
> 
> I'm not sure how commonly CPU hotplug is used, and if people are okay
> with trading in CPU hotplug to get SNP.
> 
> Is it that fundamentally the SEV firmware can't support hotplug, so
> there's no point in keeping it enabled anyway?

Yes, essentially. The SEV firmware knows nothing about when the OS takes CPUs online or offline. At SNP_INIT it accounts for all
the CPUs enabled via the BIOS/UEFI and establishes the per-core SNP state for them; it has no notion of the OS bringing CPUs up or
down afterwards. So OS hotplug actions can diverge from the firmware's expectations and break SNP. Disabling hotplug just makes
that constraint explicit — there's nothing useful to keep it enabled for:  a hot-removed core still "exists" as far as
the firmware and the per-core RMP/RMPOPT state are concerned, and a core brought online later was never set up for SNP.

> 
> Is there some way of supporting hotplug for CPUs that won't be used with
> SNP, for serving non-SNP VMs on the same host as SNP VMs, or is that too
> complicated?
>

Not really. SNP's memory-integrity guarantee rests on a single invariant: every memory write is subject to RMP checks to protect
against corruption of SEV-SNP guest memory. The moment any CPU can issue writes that aren't RMP-checked, that protection is
broken for the whole system — it's not something that can be confined to "that one core."

That's because SNP isn't per-core in that sense — it's a system-wide mode. SYSCFG[SNP_EN] is set on every core, the RMP covers all
of physical memory, and once SNP is enabled every memory write is subject to RMP checks on every core. A non-SNP guest sharing the
host still runs on cores that are part of the SNP-enabled system.

By the SNP architecture there simply can't be a CPU that isn't doing RMP checks while SNP is active, so SNP_EN has to be enforced
on every core. RMP enforcement is gated per-core by SYSCFG[SNP_EN] and it must be set on every core before SNP_INIT; a core with
SNP_EN clear performs no RMP checks at all, which the architecture doesn't allow once SNP is up. A newly hotplugged CPU comes up
without SNP_EN (SNP not enabled on it), and since it wasn't present when SNP_INIT ran it isn't part of the initialized SNP
configuration either — so it does no RMP checking. And because an SNP guest's vCPUs (or any guest for that matter) can be scheduled
on any online CPU, the guest could end up running on that core, accessing memory with no RMP enforcement and breaking SNP's memory integrity. 
There's no way to prevent that: KVM doesn't fence SNP guests (or any guests for that matter) off particular online cores.
And carving out cores for non-SNP-only use isn't possible by the architecture: SNP requires RMP checks on every CPU, so there's no valid
configuration with SNP active and a subset of cores exempt.

Thanks,
Ashish

 
>>>
>>> [...snip...]
>>>

^ permalink raw reply

* SVSM Development Call June 24th, 2026
From: Jörg Rödel @ 2026-06-23 19:01 UTC (permalink / raw)
  To: coconut-svsm, linux-coco

Hi,

Here is the call for agenda items for this weeks SVSM development call.  Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.

We will use the LF Zoom instance. Details of the meeting  can be found in our
governance repository at:

	https://github.com/coconut-svsm/governance

The link to the COCONUT-SVSM calendar is:

	https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week

The meeting will be recorded and the recording eventually published.

Regards,

	Jörg

^ permalink raw reply

* Re: [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ackerley Tng @ 2026-06-23 17:50 UTC (permalink / raw)
  To: Kalra, Ashish, K Prateek Nayak, tglx, mingo, bp, dave.hansen, x86,
	hpa, seanjc, peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
	jackyli, pgonda, rientjes, jacobhxu, xin, pawan.kumar.gupta,
	babu.moger, dyoung, nikunj, john.allen, darwi, linux-kernel,
	linux-crypto, kvm, linux-coco
In-Reply-To: <8c5f4082-e3a5-4f65-b058-33938a7ee324@amd.com>

"Kalra, Ashish" <ashish.kalra@amd.com> writes:

>
> [...snip...]
>
>
> Yes, a simpler implementation will be like this:
> ...
>
>  	if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))

Perhaps have a WARN_ON_ONCE() here so we know rmpopt was not performed?
Not a huge deal without though.

>                 return;
>
>  	cpumask_copy(follower_mask, &rmpopt_cpumask);
>
>         /*
>          * The current CPU's core always has RMPOPT_BASE programmed
>          * (snp_prepare() required all CPUs online at setup and CPU hotplug
>          * is disabled while SNP is active), so it can always be the leader.
>          * RMPOPT_BASE is per-core; exclude this core from the followers.
>          */
>         migrate_disable();
>         cpumask_andnot(follower_mask, follower_mask,
>                        topology_sibling_cpumask(smp_processor_id()));
>
>         for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>                 rmpopt(pa);
>                 cond_resched();
>         }
>         migrate_enable();
>
>         cpus_read_lock();
>         for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>                 on_each_cpu_mask(follower_mask, rmpopt_smp, (void *)pa, true);
>                 cond_resched();
>         }
>         cpus_read_unlock();
>
>         free_cpumask_var(follower_mask);
>
>

Definitely better than the version in the original patch :) Thanks!

>  Here, the leader exclusion must use the sibling mask, not clear_cpu(this_cpu). That's why my collapsed version uses:
>
>         cpumask_andnot(follower_mask, follower_mask,
>                        topology_sibling_cpumask(smp_processor_id()));
>
>   - If this_cpu is a primary: its sibling mask contains itself (the primary) -> andnot removes this core's primary from the followers.
>
>   - If this_cpu is a secondary: it isn't in follower_mask at all, but its sibling mask contains its primary, which is in
>   follower_mask -> andnot still removes this core's primary.
>
>   So either way the current core is dropped from the followers. (The old code needed two branches because case #1 used
>   clear_cpu(this_cpu) — only correct when this_cpu is the primary — while case #2 used the sibling andnot. The single andnot works for
>   both cases).
>
> Thanks,
> Ashish
>
>>> +		goto followers;
>>> +	}
>>> +
>>> +	migrate_enable();
>>> +

^ permalink raw reply

* Re: [PATCH v8 3/7] crypto/ccp: Disable CPU hotplug while SNP is active
From: Ackerley Tng @ 2026-06-23 17:48 UTC (permalink / raw)
  To: Jethro Beekman, Ashish Kalra, tglx, mingo, bp, dave.hansen, x86,
	hpa, seanjc, peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <0df3b665-3a9c-4c46-a7aa-14388e8e1577@fortanix.com>

Jethro Beekman <jethro@fortanix.com> writes:

> On 2026-06-15 21:49, Ashish Kalra wrote:
>> From: Ashish Kalra <ashish.kalra@amd.com>
>>
>> The SEV firmware enumerates the CPUs at SNP initialization and is not
>> aware of the OS bringing CPUs online or offline afterwards, so OS CPU
>> hotplug can diverge from the firmware's expectations and break SNP.
>> Disable CPU hotplug while SNP is active.
>
> I think this is too broad. If I have a hypervisor that supports SNP virtualization, a (non-confidential) L1 guest running Linux should still support CPU hotplug while also running confidential L2 guests.
>
> --
> Jethro Beekman | CTO | Fortanix
>

Were any other solutions considered other than disabling CPU hotplug?

Is this temporary until something else is implemented?

I'm not sure how commonly CPU hotplug is used, and if people are okay
with trading in CPU hotplug to get SNP.

Is it that fundamentally the SEV firmware can't support hotplug, so
there's no point in keeping it enabled anyway?

Is there some way of supporting hotplug for CPUs that won't be used with
SNP, for serving non-SNP VMs on the same host as SNP VMs, or is that too
complicated?

>>
>> [...snip...]
>>

^ permalink raw reply

* Re: [PATCH 1/4] kvm: sev: Fix user-space triggerable WARN_ON on snp_launch_update path
From: Sean Christopherson @ 2026-06-23 14:46 UTC (permalink / raw)
  To: Jörg Rödel
  Cc: Paolo Bonzini, x86, Kiryl Shutsemau, Rick Edgecombe, Tom Lendacky,
	Ashish Kalra, Michael Roth, kvm, linux-kernel, linux-coco,
	Joerg Roedel
In-Reply-To: <20260623091556.1500930-2-joro@8bytes.org>

Please capitalize the scope, i.e. "KVM: SEV:".

On Tue, Jun 23, 2026, Jörg Rödel wrote:
> From: Joerg Roedel <joerg.roedel@amd.com>
> 
> Sashiko reported on an unrelated patch:
> 
>   [Severity: High]
>   This is a pre-existing issue, but can a host userspace process trigger a
>   kernel warning by passing a NULL user address (uaddr = 0) here?
> 
>   If params.uaddr is 0, src becomes NULL and passes the PAGE_ALIGNED(src)
>   check. kvm_gmem_populate() skips fetching the user page and passes
>   src_page = NULL to sev_gmem_post_populate().
> 
>   That function then unconditionally evaluates:
> 
>   WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO &&
>                !src_page)
> 
>   Since the type isn't ZERO, won't this allow an unprivileged user to spam
>   the kernel log?

Use Reported-by: + Closes to capture Sashiko's effecitve bug report instead of
copy+pasting the finding.  There's no reason to treat Sashiko any differently
than any other bot.

> The assessment is correct, so check for this condition earlier in the
> snp_launch_update() path to avoid the WARN_ON_ONCE.
> 
> Fixes: dee5a47cc7a45 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> ---
>  arch/x86/kvm/svm/sev.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 6c6a6d663e29..41dcba5180ca 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2438,6 +2438,13 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  	if (!PAGE_ALIGNED(src))
>  		return -EINVAL;
>  
> +	/*
> +	 * Make sure user-mode did not pass NULL as src with
> +	 * type != KVM_SEV_SNP_PAGE_TYPE_ZERO.

Meh, that's pretty obvious from the code.

> +	 */
> +	if (src == NULL && params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)

I think I'd prefer this over checking for KVM_SEV_SNP_PAGE_TYPE_ZERO twice,
especially since the PAGE_ALIGNED() check for the NULL pointer case is rather
weird.

	if (params.type == KVM_SEV_SNP_PAGE_TYPE_ZERO)
		src = NULL;
	else if (!params.uaddr || !PAGE_ALIGNED(params.uaddr))
		return -EINVAL;
	else
		src = u64_to_user_ptr(params.uaddr);


> +		return -EINVAL;

Gah, we created quite the mess for ourselves.  TDX returns -EOPNOTSUPP instead
of -EINVAL, I guess as a placeholder for in-place conversion?  I don't care which
error code is returned, but I do want KVM to be consistent.

We should also adjust TDX to pre-check the source address, because checking only
in the post-populate flow subtly relies on tdx_vcpu_init_mem_region() returning
immediately on error.  If that weren't the case (ignoring for the moment that
continuing on would be nonsensical), KVM would advace the address by PAGE_SIZE
and suddenly a NULL userspace address becomes non-NULL.

I also think it makes sense to drop the WARN in sev_gmem_post_populate(), it's
completely redundant once this bug is fixed.

Ugh, and both SNP and TDX fail to account for tags, and fail to check for
striding into kernel space.  Which I suppose is fine, since gup() handles those
correctly.  And I don't see a strong argument for disallowing tagged addresses,
because unlike the userspace address for memslots, KVM doesn't keep the address
around long-term.

So over two patches, the below?  I can send a v2, I've already got changelogs
written (I was fiddling around with extracting and reusing kvm_set_memory_region()'s
checks on the userspace address+size, but as above, convinced myself that KVM
should continue to allow tagged addresses for SNP and TDX).

---
 arch/x86/kvm/svm/sev.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 74fb15551e83..621a2eaa58f2 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2330,9 +2330,6 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	int level;
 	int ret;
 
-	if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page))
-		return -EINVAL;
-
 	ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level);
 	if (ret || assigned) {
 		pr_debug("%s: Failed to ensure GFN 0x%llx RMP entry is initial shared state, ret: %d assigned: %d\n",
@@ -2421,10 +2418,12 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	     params.type != KVM_SEV_SNP_PAGE_TYPE_CPUID))
 		return -EINVAL;
 
-	src = params.type == KVM_SEV_SNP_PAGE_TYPE_ZERO ? NULL : u64_to_user_ptr(params.uaddr);
-
-	if (!PAGE_ALIGNED(src))
+	if (params.type == KVM_SEV_SNP_PAGE_TYPE_ZERO)
+		src = NULL;
+	else if (!params.uaddr || !PAGE_ALIGNED(params.uaddr))
 		return -EINVAL;
+	else
+		src = u64_to_user_ptr(params.uaddr);
 
 	npages = params.len / PAGE_SIZE;

---
 arch/x86/kvm/vmx/tdx.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ffe9d0db58c5..b0ec054732b9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3198,9 +3198,6 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
 		return -EIO;
 
-	if (!src_page)
-		return -EOPNOTSUPP;
-
 	kvm_tdx->page_add_src = src_page;
 	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
 	kvm_tdx->page_add_src = NULL;
@@ -3247,8 +3244,8 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
 	if (copy_from_user(&region, u64_to_user_ptr(cmd->data), sizeof(region)))
 		return -EFAULT;
 
-	if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) ||
-	    !region.nr_pages ||
+	if (!PAGE_ALIGNED(region.source_addr) || !region.source_addr ||
+	    !PAGE_ALIGNED(region.gpa) || !region.nr_pages ||
 	    region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
 	    !vt_is_tdx_private_gpa(kvm, region.gpa) ||
 	    !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
-- 

^ permalink raw reply related

* Re: [PATCH 3/4] KVM: guest_memfd: Add `write` parameter to kvm_gmem_populate()
From: Sean Christopherson @ 2026-06-23 12:57 UTC (permalink / raw)
  To: Jörg Rödel
  Cc: Paolo Bonzini, x86, Kiryl Shutsemau, Rick Edgecombe, Tom Lendacky,
	Ashish Kalra, Michael Roth, kvm, linux-kernel, linux-coco,
	Joerg Roedel
In-Reply-To: <20260623091556.1500930-4-joro@8bytes.org>

On Tue, Jun 23, 2026, Jörg Rödel wrote:
> From: Joerg Roedel <joerg.roedel@amd.com>
> 
> The call-path of kvm_gmem_populate() might subsequently write to the
> page provided by user-space. This is used to provide detailed error
> information in case the page population failed.
> 
> But since kvm_gmem_populate() only acquires a read-only reference to
> the user-space page via get_user_pages_fast(), the error information
> might be written to a read-only page later on.
> 
> Add a parameter to kvm_gmem_populate() to optionally acquire a
> writeable reference to the source page to make sure page permissions
> can be enforced.

Already fixed, commit f13e90059908 ("KVM: SEV: Pin source page for write when
adding CPUID data for SNP guest").

^ permalink raw reply

* [RFCv2 PATCH 6/6] virtio-mem: Support memory hotplug/unplug for coco guest
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>

Integrate coco memory management operations into the virtio-mem driver to
manage the state of hotplug memory.

In virtio_mem_send_plug_request(), once the host hypervisor acknowledges a
plug request, invoke coco_set_plugged_bitmap() to set the corresponding
bits in the plugged bitmap. Conversely, in virtio_mem_send_unplug_request()
and virtio_mem_send_unplug_all_request(), call unaccept_memory() to let the
guest autonomously transition the target private pages back to "unaccepted"
state before asking the VMM to unplug them. After the VMM acknowledges the
unplug request, clear the ranges from the plugged bitmap.

Note that memory block hotplug/unplug also sets or clears the plugged
bitmap at memory block granularity. While doing this at device block
granularity here creates a slight redundancy, it is completely harmless.

Additionally, update virtio_mem_fake_online() to explicitly invoke
accept_memory() when transitioning memory out of the fake-offline state and
back into service. This ensures that any pages returning to the buddy
system are cleanly accepted by the guest architecture before they are freed
back into the allocator via free_contig_range().

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 drivers/virtio/virtio_mem.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 48051e9e98ab..9f6e53df8caf 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1211,6 +1211,7 @@ static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
 			generic_online_page(page, order);
 		} else {
 			virtio_mem_clear_fake_offline(pfn + i, 1 << order, true);
+			accept_memory(page_to_phys(page), PAGE_SIZE << order);
 			free_contig_range(pfn + i, 1 << order);
 			adjust_managed_page_count(page, 1 << order);
 		}
@@ -1436,6 +1437,7 @@ static int virtio_mem_send_plug_request(struct virtio_mem *vm, uint64_t addr,
 	switch (virtio_mem_send_request(vm, &req)) {
 	case VIRTIO_MEM_RESP_ACK:
 		vm->plugged_size += size;
+		WARN_ON(coco_set_plugged_bitmap(addr, size, true));
 		return 0;
 	case VIRTIO_MEM_RESP_NACK:
 		rc = -EAGAIN;
@@ -1471,9 +1473,12 @@ static int virtio_mem_send_unplug_request(struct virtio_mem *vm, uint64_t addr,
 	dev_dbg(&vm->vdev->dev, "unplugging memory: 0x%llx - 0x%llx\n", addr,
 		addr + size - 1);
 
+	unaccept_memory(addr, size);
+
 	switch (virtio_mem_send_request(vm, &req)) {
 	case VIRTIO_MEM_RESP_ACK:
 		vm->plugged_size -= size;
+		WARN_ON(coco_set_plugged_bitmap(addr, size, false));
 		return 0;
 	case VIRTIO_MEM_RESP_BUSY:
 		rc = -ETXTBSY;
@@ -1498,10 +1503,13 @@ static int virtio_mem_send_unplug_all_request(struct virtio_mem *vm)
 
 	dev_dbg(&vm->vdev->dev, "unplugging all memory");
 
+	unaccept_memory(vm->addr, vm->region_size);
+
 	switch (virtio_mem_send_request(vm, &req)) {
 	case VIRTIO_MEM_RESP_ACK:
 		vm->unplug_all_required = false;
 		vm->plugged_size = 0;
+		WARN_ON(coco_set_plugged_bitmap(vm->addr, vm->region_size, false));
 		/* usable region might have shrunk */
 		atomic_set(&vm->config_changed, 1);
 		return 0;
-- 
2.52.0


^ permalink raw reply related

* [RFCv2 PATCH 5/6] mm/memory_hotplug: Support ACPI hotplug/unplug for coco guest
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>

Integrate coco memory management operations into the core memory hotplug
subsystem to handle the lifecycle of hotplug memory.

In add_memory_resource(), invoke coco_set_plugged_bitmap(..., true) to mark
memory plugged before adding the memory block, because self hosted memmap
initialization needs their plugged bits set before acceptance. There is no
explicit call to accept_memory() for normal pages, because they can be
lazily accepted by the core memory management subsystem after the memory
block is onlined.

In try_remove_memory(), before finalizing the physical removal of the
memory blocks, invoke unaccept_memory(). This allows the guest to take
direct control of its own memory state and release the pages itself,
eliminating the dependency on the VMM to implicitly hole-punch the memory.
It loops through the targeted ranges using find_next_andnot_bit(), matching
pages that are marked plugged and accepted, and releases them back to the
host. Following the unacceptance step, clear the ranges from the plugged
bitmap.

These operations guarantee that both the unaccepted and plugged tracking
states stay completely synchronized with the actual dynamic memory
configurations of the guest.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/linux/mm.h                       | 11 +++
 drivers/firmware/efi/unaccepted_memory.c | 94 ++++++++++++++++++++++++
 mm/memory_hotplug.c                      | 16 ++++
 3 files changed, 121 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fc2acedf0b76..4c094038872a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -5105,6 +5105,8 @@ int set_anon_vma_name(unsigned long addr, unsigned long size,
 
 bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size);
 void accept_memory(phys_addr_t start, unsigned long size);
+void unaccept_memory(phys_addr_t start, unsigned long size);
+int coco_set_plugged_bitmap(phys_addr_t start, unsigned long size, bool set);
 
 #else
 
@@ -5118,6 +5120,15 @@ static inline void accept_memory(phys_addr_t start, unsigned long size)
 {
 }
 
+static inline void unaccept_memory(phys_addr_t start, unsigned long size)
+{
+}
+
+static inline int coco_set_plugged_bitmap(phys_addr_t start, unsigned long size, bool set)
+{
+	return 0;
+}
+
 #endif
 
 static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
index c290b16c5142..f35f7016af53 100644
--- a/drivers/firmware/efi/unaccepted_memory.c
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -233,6 +233,100 @@ bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size)
 	return ret;
 }
 
+static int coco_hotplug_range_check(struct efi_unaccepted_memory *unaccepted,
+				    phys_addr_t start, unsigned long size)
+{
+	u64 unit_size = unaccepted->unit_size;
+	u64 phys_base = unaccepted->phys_base;
+	u64 phys_end = phys_base + unaccepted->size * unit_size * BITS_PER_BYTE;
+
+	if (!IS_ALIGNED(start | size, unit_size))
+		return -EINVAL;
+
+	if (start < phys_base || start + size > phys_end)
+		return -EINVAL;
+
+	return 0;
+}
+
+/* Only used by hotplug memory, we don't unaccept static memory */
+void unaccept_memory(phys_addr_t start, unsigned long size)
+{
+	unsigned long range_start, range_end, bitmap_size, flags;
+	struct efi_unaccepted_memory *unaccepted;
+	void *plugged_bitmap;
+	u64 unit_size;
+
+	unaccepted = efi_get_unaccepted_table();
+	if (!unaccepted)
+		return;
+
+	if (WARN_ON(coco_hotplug_range_check(unaccepted, start, size)))
+		return;
+
+	unit_size = unaccepted->unit_size;
+	range_start = (start - unaccepted->phys_base) / unit_size;
+	bitmap_size = range_start + size / unit_size;
+	plugged_bitmap = plugged_bitmap_of(unaccepted);
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	for (; range_start < bitmap_size; range_start = range_end) {
+		unsigned long phys_start, phys_end;
+		unsigned long unaccepted_one, plugged_zero;
+
+		range_start = find_next_andnot_bit(plugged_bitmap, unaccepted->bitmap,
+						   bitmap_size, range_start);
+
+		if (range_start >= bitmap_size)
+			break;
+
+		unaccepted_one = find_next_bit(unaccepted->bitmap, bitmap_size, range_start);
+		plugged_zero = find_next_zero_bit(plugged_bitmap, bitmap_size, range_start);
+		range_end = min(unaccepted_one, plugged_zero);
+
+		phys_start = range_start * unit_size + unaccepted->phys_base;
+		phys_end = range_end * unit_size + unaccepted->phys_base;
+
+		arch_unaccept_memory(phys_start, phys_end);
+		bitmap_set(unaccepted->bitmap, range_start, range_end - range_start);
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+/*
+ * Only used by hotplug memory, plugged bits of static memory are handled
+ * in process_unaccepted_memory()
+ */
+int coco_set_plugged_bitmap(phys_addr_t start, unsigned long size, bool set)
+{
+	struct efi_unaccepted_memory *unaccepted;
+	unsigned long range_start, flags;
+	void *plugged_bitmap;
+	u64 unit_size;
+	int ret;
+
+	unaccepted = efi_get_unaccepted_table();
+	if (!unaccepted)
+		return 0;
+
+	ret = coco_hotplug_range_check(unaccepted, start, size);
+	if (ret)
+		return ret;
+
+	unit_size = unaccepted->unit_size;
+	range_start = (start - unaccepted->phys_base) / unit_size;
+	plugged_bitmap = plugged_bitmap_of(unaccepted);
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	if (set)
+		bitmap_set(plugged_bitmap, range_start, size / unit_size);
+	else
+		bitmap_clear(plugged_bitmap, range_start, size / unit_size);
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+	return 0;
+}
+
 #ifdef CONFIG_PROC_VMCORE
 static bool unaccepted_memory_vmcore_pfn_is_ram(struct vmcore_cb *cb,
 						unsigned long pfn)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 40c7915dabe0..2f71514a0616 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1429,6 +1429,8 @@ static void remove_memory_blocks_and_altmaps(u64 start, u64 size)
 
 		arch_remove_memory(cur_start, memblock_size, altmap);
 
+		unaccept_memory(cur_start, PFN_PHYS(altmap->free));
+
 		/* Verify that all vmemmap pages have actually been freed. */
 		WARN(altmap->alloc, "Altmap not fully unmapped");
 		kfree(altmap);
@@ -1459,9 +1461,13 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
 			goto out;
 		}
 
+		/* Accept self hosted memmap array before access it */
+		accept_memory(cur_start, PFN_PHYS(mhp_altmap.free));
+
 		/* call arch's memory hotadd */
 		ret = arch_add_memory(nid, cur_start, memblock_size, &params);
 		if (ret < 0) {
+			unaccept_memory(cur_start, PFN_PHYS(mhp_altmap.free));
 			kfree(params.altmap);
 			goto out;
 		}
@@ -1471,6 +1477,7 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
 						  params.altmap, group);
 		if (ret) {
 			arch_remove_memory(cur_start, memblock_size, NULL);
+			unaccept_memory(cur_start, PFN_PHYS(mhp_altmap.free));
 			kfree(params.altmap);
 			goto out;
 		}
@@ -1540,6 +1547,10 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		new_node = true;
 	}
 
+	ret = coco_set_plugged_bitmap(start, size, true);
+	if (ret)
+		goto error_offline_node;
+
 	/*
 	 * Self hosted memmap array
 	 */
@@ -1584,6 +1595,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 
 	return ret;
 error:
+	WARN_ON(coco_set_plugged_bitmap(start, size, false));
+error_offline_node:
 	if (new_node) {
 		node_set_offline(nid);
 		unregister_node(nid);
@@ -2282,6 +2295,9 @@ static int try_remove_memory(u64 start, u64 size)
 	if (nid != NUMA_NO_NODE)
 		try_offline_node(nid);
 
+	unaccept_memory(start, size);
+	WARN_ON(coco_set_plugged_bitmap(start, size, false));
+
 	mem_hotplug_done();
 	return 0;
 }
-- 
2.52.0


^ permalink raw reply related

* [RFCv2 PATCH 4/6] x86/tdx: Implement arch_unaccept_memory()
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>

During memory hot-unplug, if the VMM does not punch hole the memory, the
memory stays in "accepted" state. Consequently, subsequent re-acceptance
of that same memory during a re-plug operation will trigger re-accept
failure. To guard this, a confidential guest must maintain control of
the memory state explicitly, e.g., setting memory to "unaccepted" state
during unplug.

In the context of TDX, the "unaccepted" state maps to the PENDING state,
while the "accepted" state maps to the MAPPED state. Implement
arch_unaccept_memory() for TDX guest via the TDG.MEM.PAGE.RELEASE TDCALL.
It uses 1G/2M/4K page size fallbacks and rolls back on partial failure. A
failure during this rollback step indicates severe corruption of the TDX
module state and triggers a kernel panic.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 arch/x86/include/asm/shared/tdx.h        |   2 +
 arch/x86/include/asm/tdx.h               |   2 +
 arch/x86/include/asm/unaccepted_memory.h |  11 +++
 arch/x86/coco/tdx/tdx.c                  | 120 +++++++++++++++++++++++
 4 files changed, 135 insertions(+)

diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 049638e3da74..910ec1e57528 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -19,6 +19,7 @@
 #define TDG_MEM_PAGE_ACCEPT		6
 #define TDG_VM_RD			7
 #define TDG_VM_WR			8
+#define TDG_MEM_PAGE_RELEASE		30
 
 /* TDX TD attributes */
 #define TDX_TD_ATTR_DEBUG_BIT		0
@@ -54,6 +55,7 @@
 
 /* TDCS_CONFIG_FLAGS bits */
 #define TDCS_CONFIG_FLEXIBLE_PENDING_VE	BIT_ULL(1)
+#define TDCS_CONFIG_PAGE_RELEASE	BIT_ULL(6)
 
 /* TDCS_TD_CTLS bits */
 #define TD_CTLS_PENDING_VE_DISABLE_BIT	0
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index a149740b24e8..8608d33a7db6 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -72,6 +72,8 @@ int tdx_mcall_extend_rtmr(u8 index, u8 *data);
 
 u64 tdx_hcall_get_quote(u8 *buf, size_t size);
 
+bool tdx_unaccept_memory(phys_addr_t start, phys_addr_t end);
+
 void __init tdx_dump_attributes(u64 td_attr);
 void __init tdx_dump_td_ctls(u64 td_ctls);
 
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index f5937e9866ac..9fd9411d2c44 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -18,6 +18,17 @@ static inline void arch_accept_memory(phys_addr_t start, phys_addr_t end)
 	}
 }
 
+static inline void arch_unaccept_memory(phys_addr_t start, phys_addr_t end)
+{
+	/* Platform-specific memory-unacceptance call goes here */
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+		if (!tdx_unaccept_memory(start, end))
+			panic("TDX: Failed to unaccept memory\n");
+	} else {
+		panic("Cannot unaccept memory: unknown platform\n");
+	}
+}
+
 static inline struct efi_unaccepted_memory *efi_get_unaccepted_table(void)
 {
 	if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 186915a17c50..1bab8f4687bf 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -326,6 +326,124 @@ static void reduce_unnecessary_ve(void)
 	enable_cpu_topology_enumeration();
 }
 
+static bool tdx_page_release_supported;
+
+static void tdx_detect_page_release_support(void)
+{
+	u64 config = 0;
+
+	tdg_vm_rd(TDCS_CONFIG_FLAGS, &config);
+
+	tdx_page_release_supported = !!(config & TDCS_CONFIG_PAGE_RELEASE);
+}
+
+static unsigned long try_release_one(phys_addr_t start, unsigned long len,
+				     enum pg_level pg_level)
+{
+	unsigned long release_size = page_level_size(pg_level);
+	struct tdx_module_args args = {};
+	u8 page_size;
+	u64 ret;
+
+	if (!IS_ALIGNED(start, release_size))
+		return 0;
+
+	if (len < release_size)
+		return 0;
+
+	/*
+	 * Pass the page physical address to TDX module to release the
+	 * private page and to put it in PENDING state.
+	 *
+	 * Encode page size in RCX[2:0] using TDX_PS_*
+	 */
+	switch (pg_level) {
+	case PG_LEVEL_4K:
+		page_size = TDX_PS_4K;
+		break;
+	case PG_LEVEL_2M:
+		page_size = TDX_PS_2M;
+		break;
+	case PG_LEVEL_1G:
+		page_size = TDX_PS_1G;
+		break;
+	default:
+		return 0;
+	}
+
+	args.rcx = start | page_size;
+	ret = __tdcall(TDG_MEM_PAGE_RELEASE, &args);
+	if (ret)
+		return 0;
+
+	return release_size;
+}
+
+static bool tdx_release_memory(phys_addr_t start, phys_addr_t end, phys_addr_t *cur)
+{
+	*cur = start;
+
+	while (*cur < end) {
+		unsigned long len = end - *cur;
+		unsigned long release_size;
+
+		/*
+		 * Try larger release first. It speeds up process by cutting
+		 * number of hypercalls (if successful).
+		 */
+
+		release_size = try_release_one(*cur, len, PG_LEVEL_1G);
+		if (!release_size)
+			release_size = try_release_one(*cur, len, PG_LEVEL_2M);
+		if (!release_size)
+			release_size = try_release_one(*cur, len, PG_LEVEL_4K);
+		if (!release_size)
+			return false;
+		*cur += release_size;
+	}
+
+	return true;
+}
+
+/**
+ * Release private memory and put it in PENDING state.
+ *
+ * @start: Physical start address of memory range to release
+ * @end:   Physical end address of memory range to release
+ *
+ * Uses TDG.MEM.PAGE.RELEASE TDCALL to transition private pages back to
+ * PENDING state. If PAGE_RELEASE is not supported by the TDX
+ * configuration, returns true (success) as no action is needed.
+ *
+ * On partial failure, automatically re-accepts any successfully released
+ * pages to restore consistent memory state. Re-acceptance failure is
+ * treated as a fatal error since it indicates severe TDX module issues.
+ *
+ * Returns: true on success, false on failure
+ */
+bool tdx_unaccept_memory(phys_addr_t start, phys_addr_t end)
+{
+	phys_addr_t released = start;
+	bool ret;
+
+	if (!tdx_page_release_supported)
+		return true;
+
+	ret = tdx_release_memory(start, end, &released);
+	if (!ret) {
+		pr_err("Failed to unaccept memory [%pa, %pa)\n", &start, &end);
+		/*
+		 * Re-accept any pages that were successfully released before
+		 * the failure occurred. This should never fail since we're
+		 * just restoring the previous MAPPED state.
+		 */
+		if (!tdx_accept_memory(start, released))
+			panic("%s: Failed to re-accept memory\n", __func__);
+	}
+
+	return ret;
+}
+
 static void tdx_setup(u64 *cc_mask)
 {
 	struct tdx_module_args args = {};
@@ -359,6 +477,8 @@ static void tdx_setup(u64 *cc_mask)
 	disable_sept_ve(td_attr);
 
 	reduce_unnecessary_ve();
+
+	tdx_detect_page_release_support();
 }
 
 /*
-- 
2.52.0


^ permalink raw reply related

* [RFCv2 PATCH 3/6] efi/unaccepted: Create plugged bitmap to support hotplug memory in coco guest
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>

The load_unaligned_zeropad() function can cause unintended memory loads
across page boundaries. To safely handle these unaligned reads in a
confidential computing guest, the kernel implicitly accepts an extra
unit_size block of memory to serve as a safety guard.

However, near hotplug boundaries, this extra acceptance can fall within
unpopulated gaps between hotplugged memory ranges, triggering a guest
kernel crash.

To protect these boundaries against out-of-bounds access, introduce a
"plugged" bitmap positioned immediately following the unaccepted memory
bitmap.

Initial static boot memory ranges have their corresponding bits marked
as plugged by default during early initialization. For hotpluggable
memory ranges, the memory driver must explicitly set the proper bits
when a memory block is plugged, and clear them upon an unplug event.

Update accept_memory() and range_contains_unaccepted_memory() to check
the intersection of both bitmaps. The kernel now combines them to
determine exactly which plugged, unaccepted pages require acceptance.

Additionally, bump the unaccepted memory table layout version from 1
to 2. This strict layout enforcement guarantees that a version 1 table
passed to a new kernel, or a version 2 table passed to an old kernel,
will explicitly fail kexec early due to the version mismatch.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/linux/efi.h                           |  5 ++++
 arch/x86/boot/compressed/mem.c                |  2 +-
 drivers/firmware/efi/efi.c                    |  4 +--
 .../firmware/efi/libstub/unaccepted_memory.c  | 16 +++++++----
 drivers/firmware/efi/unaccepted_memory.c      | 28 +++++++++++++++----
 5 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/include/linux/efi.h b/include/linux/efi.h
index ccbc35479684..579d102f128a 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -551,6 +551,11 @@ struct efi_unaccepted_memory {
 	unsigned long bitmap[];
 };
 
+static inline void *plugged_bitmap_of(struct efi_unaccepted_memory *u)
+{
+	return (void *)u->bitmap + u->size;
+}
+
 /*
  * Architecture independent structure for describing a memory map for the
  * benefit of efi_memmap_init_early(), and for passing context between
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 40e9c81a2206..61b8d0edd2f6 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -69,7 +69,7 @@ bool init_unaccepted_memory(void)
 	if (!table)
 		return false;
 
-	if (table->version != 1)
+	if (table->version != 2)
 		error("Unknown version of unaccepted memory table\n");
 
 	/*
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 318d1cc9a066..7f7341634c13 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -701,7 +701,7 @@ static __init void reserve_unaccepted(struct efi_unaccepted_memory *unaccepted)
 	phys_addr_t start, end;
 
 	start = PAGE_ALIGN_DOWN(efi.unaccepted);
-	end = PAGE_ALIGN(efi.unaccepted + sizeof(*unaccepted) + unaccepted->size);
+	end = PAGE_ALIGN(efi.unaccepted + sizeof(*unaccepted) + unaccepted->size * 2);
 
 	memblock_add(start, end - start);
 	memblock_reserve(start, end - start);
@@ -837,7 +837,7 @@ int __init efi_config_parse_tables(const efi_config_table_t *config_tables,
 		unaccepted = early_memremap(efi.unaccepted, sizeof(*unaccepted));
 		if (unaccepted) {
 
-			if (unaccepted->version == 1) {
+			if (unaccepted->version == 2) {
 				reserve_unaccepted(unaccepted);
 			} else {
 				efi.unaccepted = EFI_INVALID_TABLE_ADDR;
diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
index 01bed8e751ca..5b0deb6c91f1 100644
--- a/drivers/firmware/efi/libstub/unaccepted_memory.c
+++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
@@ -113,7 +113,7 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 					struct efi_boot_memmap *map)
 {
 	efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
-	u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size;
+	u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size, total_size;
 	struct srat_parse_ctx ctx;
 	efi_status_t status;
 	int i;
@@ -124,7 +124,7 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 	/* Check if the table is already installed */
 	unaccepted_table = get_efi_config_table(unaccepted_table_guid);
 	if (unaccepted_table) {
-		if (unaccepted_table->version != 1) {
+		if (unaccepted_table->version != 2) {
 			efi_err("Unknown version of unaccepted memory table\n");
 			return EFI_UNSUPPORTED;
 		}
@@ -173,19 +173,22 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 	bitmap_size = DIV_ROUND_UP(unaccepted_end - unaccepted_start,
 				   EFI_UNACCEPTED_UNIT_SIZE * BITS_PER_BYTE);
 
+	/* There is a plugged bitmap after unaccepted bitmap */
+	total_size = bitmap_size << 1;
+
 	status = efi_bs_call(allocate_pool, EFI_ACPI_RECLAIM_MEMORY,
-			     sizeof(*unaccepted_table) + bitmap_size,
+			     sizeof(*unaccepted_table) + total_size,
 			     (void **)&unaccepted_table);
 	if (status != EFI_SUCCESS) {
 		efi_err("Failed to allocate unaccepted memory config table\n");
 		return status;
 	}
 
-	unaccepted_table->version = 1;
+	unaccepted_table->version = 2;
 	unaccepted_table->unit_size = EFI_UNACCEPTED_UNIT_SIZE;
 	unaccepted_table->phys_base = unaccepted_start;
 	unaccepted_table->size = bitmap_size;
-	memset(unaccepted_table->bitmap, 0, bitmap_size);
+	memset(unaccepted_table->bitmap, 0, total_size);
 	parse_acpi_srat_regions(mark_hotplug_memory_unaccepted, &ctx);
 
 	status = efi_bs_call(install_configuration_table,
@@ -287,6 +290,9 @@ void process_unaccepted_memory(u64 start, u64 end)
 	 */
 	bitmap_set(unaccepted_table->bitmap,
 		   start / unit_size, (end - start) / unit_size);
+	/* Set plugged bits for static memory and never unset */
+	bitmap_set(plugged_bitmap_of(unaccepted_table),
+		   start / unit_size, (end - start) / unit_size);
 }
 
 void accept_memory(phys_addr_t start, unsigned long size)
diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
index 4a8ec8d6a571..c290b16c5142 100644
--- a/drivers/firmware/efi/unaccepted_memory.c
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -38,6 +38,7 @@ void accept_memory(phys_addr_t start, unsigned long size)
 	unsigned long flags;
 	phys_addr_t end;
 	u64 unit_size;
+	void *plugged_bitmap;
 
 	unaccepted = efi_get_unaccepted_table();
 	if (!unaccepted)
@@ -126,12 +127,23 @@ void accept_memory(phys_addr_t start, unsigned long size)
 	 */
 	list_add(&range.list, &accepting_list);
 
-	range_start = range.start;
-	for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
-				   range.end) {
+	plugged_bitmap = plugged_bitmap_of(unaccepted);
+
+	for (range_start = range.start; range_start < range.end; range_start = range_end) {
 		unsigned long phys_start, phys_end;
-		unsigned long len = range_end - range_start;
+		unsigned long len;
+		unsigned long unaccepted_zero, plugged_zero;
+
+		range_start = find_next_and_bit(plugged_bitmap, unaccepted->bitmap,
+						range.end, range_start);
+
+		if (range_start >= range.end)
+			break;
 
+		unaccepted_zero = find_next_zero_bit(unaccepted->bitmap, range.end, range_start);
+		plugged_zero = find_next_zero_bit(plugged_bitmap, range.end, range_start);
+		range_end = min(unaccepted_zero, plugged_zero);
+		len = range_end - range_start;
 		phys_start = range_start * unit_size + unaccepted->phys_base;
 		phys_end = range_end * unit_size + unaccepted->phys_base;
 
@@ -167,6 +179,7 @@ bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size)
 	bool ret = false;
 	phys_addr_t end;
 	u64 unit_size;
+	void *plugged_bitmap;
 
 	unaccepted = efi_get_unaccepted_table();
 	if (!unaccepted)
@@ -201,9 +214,14 @@ bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size)
 	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
 		end = unaccepted->size * unit_size * BITS_PER_BYTE;
 
+	plugged_bitmap = plugged_bitmap_of(unaccepted);
+
 	spin_lock_irqsave(&unaccepted_memory_lock, flags);
 	while (start < end) {
-		if (test_bit(start / unit_size, unaccepted->bitmap)) {
+		unsigned long range_start = start / unit_size;
+
+		if (test_bit(range_start, plugged_bitmap) &&
+		    test_bit(range_start, unaccepted->bitmap)) {
 			ret = true;
 			break;
 		}
-- 
2.52.0


^ permalink raw reply related

* [RFCv2 PATCH 2/6] efi/unaccepted: Set unaccepted bits for all hotplug memory
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>

In coco guests, hotpluggable memory ranges are initially unaccepted.
While a previous change expanded the unaccepted memory bitmap boundaries
to include these hotplug spaces, the actual bits inside the bitmap are
not yet marked as unaccepted.

Walks SRAT a second time after the bitmap is allocated and sets the bits
corresponding to hotpluggable ranges.

This ensures the bitmap state accurately reflects all static and hotplug
memory ranges before booting kernel.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 .../firmware/efi/libstub/unaccepted_memory.c   | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
index bfbb78bd7b8a..01bed8e751ca 100644
--- a/drivers/firmware/efi/libstub/unaccepted_memory.c
+++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
@@ -92,6 +92,23 @@ static void update_mem_boundaries(struct acpi_srat_mem_affinity *mem, struct sra
 		*(ctx->mem_end) = range_end;
 }
 
+static void mark_hotplug_memory_unaccepted(struct acpi_srat_mem_affinity *mem,
+					   struct srat_parse_ctx *ctx)
+{
+	u64 unit_size = unaccepted_table->unit_size;
+	u64 start, end;
+
+	start = round_up(mem->base_address, unit_size);
+	end = round_down(mem->base_address + mem->length, unit_size);
+
+	/* Translate to offsets from the beginning of the bitmap */
+	start -= unaccepted_table->phys_base;
+	end -= unaccepted_table->phys_base;
+
+	bitmap_set(unaccepted_table->bitmap,
+		   start / unit_size, (end - start) / unit_size);
+}
+
 efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 					struct efi_boot_memmap *map)
 {
@@ -169,6 +186,7 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 	unaccepted_table->phys_base = unaccepted_start;
 	unaccepted_table->size = bitmap_size;
 	memset(unaccepted_table->bitmap, 0, bitmap_size);
+	parse_acpi_srat_regions(mark_hotplug_memory_unaccepted, &ctx);
 
 	status = efi_bs_call(install_configuration_table,
 			     &unaccepted_table_guid, unaccepted_table);
-- 
2.52.0


^ permalink raw reply related

* [RFCv2 PATCH 1/6] efi/unaccepted: Support hotplug memory in unaccepted bitmap via SRAT
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>

Currently, allocate_unaccepted_bitmap() only scans the initial EFI
boot memory map. This misses hotpluggable ranges described in the
ACPI SRAT. Without early tracking, hotplug pages are accessed without
acceptance and this triggers guest crash.

Introduce a lightweight ACPI SRAT parser to scan these regions early.
If a region has both ACPI_SRAT_MEM_ENABLED and ACPI_SRAT_MEM_HOT_PLUGGABLE
flags, expand the tracking boundaries. This avoids pulling in the full
ACPI subsystem while ensuring the bitmap covers both static memory and
hotplug memory.

Bail out early with success on non-confidential guests to prevent
unnecessary bitmap allocation.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 drivers/firmware/efi/libstub/efistub.h        |  6 ++
 arch/x86/boot/compressed/mem.c                |  2 +-
 .../firmware/efi/libstub/unaccepted_memory.c  | 94 +++++++++++++++++++
 3 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
index fd91fc15ec81..fc0cd33a5962 100644
--- a/drivers/firmware/efi/libstub/efistub.h
+++ b/drivers/firmware/efi/libstub/efistub.h
@@ -1260,4 +1260,10 @@ void arch_accept_memory(phys_addr_t start, phys_addr_t end);
 efi_status_t efi_zboot_decompress_init(unsigned long *alloc_size);
 efi_status_t efi_zboot_decompress(u8 *out, unsigned long outlen);
 
+bool early_is_tdx_guest(void);
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+bool early_is_sevsnp_guest(void);
+#else
+static inline bool early_is_sevsnp_guest(void) { return false; }
+#endif
 #endif
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 0e9f84ab4bdc..40e9c81a2206 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -12,7 +12,7 @@
  *
  * Enumerate TDX directly from the early users.
  */
-static bool early_is_tdx_guest(void)
+bool early_is_tdx_guest(void)
 {
 	static bool once;
 	static bool is_tdx;
diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
index 757dbe734a47..bfbb78bd7b8a 100644
--- a/drivers/firmware/efi/libstub/unaccepted_memory.c
+++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
@@ -1,19 +1,109 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
 #include <linux/efi.h>
+#include <linux/acpi.h>
 #include <asm/efi.h>
 #include "efistub.h"
 
 struct efi_unaccepted_memory *unaccepted_table;
 
+struct srat_parse_ctx {
+	u64 *mem_start;
+	u64 *mem_end;
+};
+
+typedef void (*srat_region_handler_t)(struct acpi_srat_mem_affinity *mem,
+				      struct srat_parse_ctx *ctx);
+
+/*
+ * parse_acpi_srat_regions - Loop through ACPI SRAT tables to process
+ * hotpluggable memory regions via a custom callback handler.
+ */
+static void parse_acpi_srat_regions(srat_region_handler_t handler, struct srat_parse_ctx *ctx)
+{
+	u32 hotplug_mask = ACPI_SRAT_MEM_ENABLED | ACPI_SRAT_MEM_HOT_PLUGGABLE;
+	struct acpi_table_header *xsdt, *srat = NULL;
+	struct acpi_table_rsdp *rsdp = NULL;
+	u8 *current_ptr, *end_ptr;
+	u64 *table_pointers;
+	u32 entry_count;
+	unsigned long i;
+
+	rsdp = get_efi_config_table(ACPI_20_TABLE_GUID);
+
+	if (!rsdp || !ACPI_VALIDATE_RSDP_SIG(rsdp->signature))
+		return;
+
+	xsdt = (struct acpi_table_header *)(unsigned long)rsdp->xsdt_physical_address;
+	if (!xsdt || !ACPI_COMPARE_NAMESEG(xsdt->signature, ACPI_SIG_XSDT))
+		return;
+
+	if (xsdt->length < sizeof(struct acpi_table_header) + ACPI_XSDT_ENTRY_SIZE)
+		return;
+
+	entry_count = (xsdt->length - sizeof(struct acpi_table_header)) / ACPI_XSDT_ENTRY_SIZE;
+	table_pointers = (u64 *)((u8 *)xsdt + sizeof(struct acpi_table_header));
+
+	for (i = 0; i < entry_count; i++) {
+		struct acpi_table_header *tbl;
+
+		tbl = (struct acpi_table_header *)(unsigned long)table_pointers[i];
+		if (tbl && ACPI_COMPARE_NAMESEG(tbl->signature, ACPI_SIG_SRAT)) {
+			srat = tbl;
+			break;
+		}
+	}
+
+	if (!srat)
+		return;
+
+	current_ptr = (u8 *)srat + sizeof(struct acpi_table_srat);
+	end_ptr = (u8 *)srat + srat->length;
+
+	while (current_ptr < end_ptr) {
+		struct acpi_subtable_header *sub_header;
+		u64 range_end;
+
+		sub_header = (struct acpi_subtable_header *)current_ptr;
+		if (sub_header->length == 0)
+			break;
+
+		if (sub_header->type == ACPI_SRAT_TYPE_MEMORY_AFFINITY &&
+		    sub_header->length >= sizeof(struct acpi_srat_mem_affinity)) {
+			struct acpi_srat_mem_affinity *mem;
+
+			mem = (struct acpi_srat_mem_affinity *)current_ptr;
+			if ((mem->flags & hotplug_mask) == hotplug_mask &&
+			    !check_add_overflow(mem->base_address, mem->length, &range_end))
+				handler(mem, ctx);
+		}
+		current_ptr += sub_header->length;
+	}
+}
+
+static void update_mem_boundaries(struct acpi_srat_mem_affinity *mem, struct srat_parse_ctx *ctx)
+{
+	u64 range_end = mem->base_address + mem->length;
+
+	if (mem->base_address < *(ctx->mem_start))
+		*(ctx->mem_start) = mem->base_address;
+
+	if (range_end > *(ctx->mem_end))
+		*(ctx->mem_end) = range_end;
+}
+
 efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 					struct efi_boot_memmap *map)
 {
 	efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
 	u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size;
+	struct srat_parse_ctx ctx;
 	efi_status_t status;
 	int i;
 
+	if (!early_is_tdx_guest() && !early_is_sevsnp_guest())
+		return EFI_SUCCESS;
+
 	/* Check if the table is already installed */
 	unaccepted_table = get_efi_config_table(unaccepted_table_guid);
 	if (unaccepted_table) {
@@ -38,6 +128,10 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 				     d->phys_addr + d->num_pages * PAGE_SIZE);
 	}
 
+	ctx.mem_start = &unaccepted_start;
+	ctx.mem_end = &unaccepted_end;
+	parse_acpi_srat_regions(update_mem_boundaries, &ctx);
+
 	if (unaccepted_start == ULLONG_MAX)
 		return EFI_SUCCESS;
 
-- 
2.52.0


^ permalink raw reply related

* [RFCv2 PATCH 0/6] Support memory hotplug/unplug for TDX CoCo guests
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng

This RFCv2 series implements comprehensive support for virtio-mem and ACPI
DIMM memory hotplug/unplug in Intel TDX confidential computing guests.
It explores the start-private memory approach utilizing the native
TDG.MEM.PAGE.RELEASE API.

We are seeking feedback from Kiryl on the CoCo guest implementation, MM
experts on DIMM & virio-mem memory hotplug integration and broader
virtio/CoCo community input on the overall approach. We are not seeking
x86 maintainer review at this stage.

== Changes from RFC v1 ==

- Eliminated callback infrastructure: Dropped plug callback and replaced
  unplug callback with platform-level unaccept function into core MM
  hotplug and virtio-mem subsystems.
- Added comprehensive bitmap tracking: Introduced a "plugged" bitmap
  alongside the unaccepted bitmap to track populated hotplug memory
  states to support load_unaligned_zeropad().
- Enhanced SRAT parsing: Extended the EFI stub to parse ACPI SRAT tables
  early, ensuring hotpluggable ranges are tracked from initial boot.

For more introduction about the background or other efforts in community,
please check the RFCv1 cover letter [1].

== Technical Approach ==

- Early SRAT Integration: A lightweight EFI stub parser scans ACPI SRAT
  tables to identify hotpluggable ranges and adjust bitmap boundaries
  early, avoiding the overhead of the full ACPI subsystem.
- Comprehensive Bitmap Tracking: Introduces a "plugged" bitmap right
  after the unaccepted bitmap. Both static and hotplugged memory are
  tracked, allowing the guest to map which ranges are populated by the
  VMM. This prevents acceptance beyond plugged memory boundaries due to
  load_unaligned_zeropad() operations.
- Platform Extensibility: Exposes generic CoCo memory interfaces. Other
  confidential platforms (like AMD SEV-SNP) can easily adopt this by
  hooking their specific mechanisms into arch_unaccept_memory().
- Hotplug & Guest Control: Integrates platform-level unaccept logic
  into ACPI hotplug and virtio-mem handlers. Uses TDG.MEM.PAGE.RELEASE
  for TDX to explicitly set memory to the "unaccepted" state during
  unplug, removing host hole-punching dependencies.
- Kexec Handover: Leverages existing EFI mechanisms to seamlessly hand
  over both the extended unaccepted bitmap and the new plugged bitmap
  across kexec boundaries.

== Testing ==

- dimm and virtio-mem memory hotplug/unplug
- lazy and eager accept
- kexec/kdump with hotplugged memory

This is tested with Marc-André Lureau's newest qemu series [2]

Comments appreciated, thanks.

Zhenzhong

[1] https://lore.kernel.org/all/20260604093551.1511079-1-zhenzhong.duan@intel.com/
[2] https://lore.kernel.org/all/20260604-rdm5-v5-0-5768e6a0943d@redhat.com/

Zhenzhong Duan (6):
  efi/unaccepted: Support hotplug memory in unaccepted bitmap via SRAT
  efi/unaccepted: Set unaccepted bits for all hotplug memory
  efi/unaccepted: Create plugged bitmap to support hotplug memory in
    coco guest
  x86/tdx: Implement arch_unaccept_memory()
  mm/memory_hotplug: Support ACPI hotplug/unplug for coco guest
  virtio-mem: Support memory hotplug/unplug for coco guest

 arch/x86/include/asm/shared/tdx.h             |   2 +
 arch/x86/include/asm/tdx.h                    |   2 +
 arch/x86/include/asm/unaccepted_memory.h      |  11 ++
 drivers/firmware/efi/libstub/efistub.h        |   6 +
 include/linux/efi.h                           |   5 +
 include/linux/mm.h                            |  11 ++
 arch/x86/boot/compressed/mem.c                |   4 +-
 arch/x86/coco/tdx/tdx.c                       | 120 ++++++++++++++++
 drivers/firmware/efi/efi.c                    |   4 +-
 .../firmware/efi/libstub/unaccepted_memory.c  | 128 +++++++++++++++++-
 drivers/firmware/efi/unaccepted_memory.c      | 122 ++++++++++++++++-
 drivers/virtio/virtio_mem.c                   |   8 ++
 mm/memory_hotplug.c                           |  16 +++
 13 files changed, 425 insertions(+), 14 deletions(-)

-- 
2.52.0


^ permalink raw reply

* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Binbin Wu @ 2026-06-23  9:48 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-18-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> @@ -606,12 +608,20 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>  	next = start;
>  	while (safe && filemap_get_folios(mapping, &next, last, &fbatch)) {
>  
> -		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +		for (i = 0; i < folio_batch_count(&fbatch);) {
>  			struct folio *folio = fbatch.folios[i];
>  
> -			if (folio_ref_count(folio) !=
> -			    folio_nr_pages(folio) + filemap_get_folios_refcount) {
> -				safe = false;
> +			safe = (folio_ref_count(folio) ==
> +				folio_nr_pages(folio) +
> +				filemap_get_folios_refcount);
> +
> +			if (safe) {
> +				++i;
> +			} else if (folio_may_be_lru_cached(folio) &&
> +				   !lru_drained) {
> +				lru_add_drain_all();

It seems unprivileged userspace is able to trigger lru_add_drain_all() repeatedly
by invoking KVM_SET_MEMORY_ATTRIBUTES2 in a loop, which could lead to DoS risk?

> +				lru_drained = true;
> +			} else {
>  				*err_index = max(start, folio->index);
>  				break;
>  			}
> 


^ permalink raw reply

* Re: [PATCH v8 21/46] KVM: guest_memfd: Zero page while getting pfn
From: Yan Zhao @ 2026-06-23  8:56 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com>

On Thu, Jun 18, 2026 at 05:31:58PM -0700, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Move the folio initialization logic from kvm_gmem_get_pfn() into
> __kvm_gmem_get_pfn() to also zero pages if the page is to be used in
> kvm_gmem_populate().
> 
> With in-place conversion, the existing data in a guest_memfd page can be
> populated into guest memory through platform-specific ioctls.
> 
> Without first zeroing the page obtained using __kvm_gmem_get_pfn(), it
> might contain uninitialized host memory, which would leak to the guest if
> the populate completes.
> 
> guest_memfd pages are zeroed at most once in the page's entire lifetime
> with guest_memfd, and that is tracked using the uptodate flag.
> 
> Zeroing the page in __kvm_gmem_get_pfn() is chosen over zeroing in
> kvm_gmem_get_folio() since other flows, such as a future write() syscall,
> can get a page, write to the page and then set page uptodate without
> zeroing.
> 
> This aligns with the concept of zeroing before first use - the other place
> where zeroing happens is in kvm_gmem_fault_user_mapping().
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  virt/kvm/guest_memfd.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 90bc1a26512b6..86c9f5b0863cb 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1137,6 +1137,11 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file,
>  		return ERR_PTR(-EHWPOISON);
>  	}
>  
> +	if (!folio_test_uptodate(folio)) {
> +		clear_highpage(folio_page(folio, 0));
> +		folio_mark_uptodate(folio);
> +	}
Note:
In the __kvm_gmem_populate() path, this folio_mark_uptodate() call makes the
later one after post_populate() pointless.

__kvm_gmem_populate
    |1.__kvm_gmem_get_pfn
    |     |->folio = kvm_gmem_get_folio()
    |     |  if (!folio_test_uptodate(folio))
    |     |     folio_mark_uptodate(folio);
    |2. ret = post_populate()
    |3. if (!ret)
    |       folio_mark_uptodate(folio);

>  	*pfn = folio_file_pfn(folio, index);
>  	if (max_order)
>  		*max_order = 0;
> @@ -1166,11 +1171,6 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		goto out;
>  	}
>  
> -	if (!folio_test_uptodate(folio)) {
> -		clear_highpage(folio_page(folio, 0));
> -		folio_mark_uptodate(folio);
> -	}
> -
>  	if (kvm_gmem_is_private_mem(inode, index))
>  		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>  
>


^ permalink raw reply

* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Yan Zhao @ 2026-06-23  8:41 UTC (permalink / raw)
  To: Sean Christopherson, ackerleytng, aik, andrew.jones, binbin.wu,
	brauner, chao.p.peng, david, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <ajoWngKaZ+wfIyR+@yzhao56-desk.sh.intel.com>

On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
> On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
> > On Mon, Jun 22, 2026, Yan Zhao wrote:
> > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
> > > > From: Ackerley Tng <ackerleytng@google.com>
> > > > 
> > > > Update tdx_gmem_post_populate() to handle cases where a source page is
> > > > not explicitly provided. Instead of returning -EOPNOTSUPP when src_page
> > > > is NULL, default to using the page associated with the destination PFN.
> > > > 
> > > > This change allows for in-place memory conversion where the data is
> > > > already present in the target PFN, ensuring the TDX module has a valid
> > > > source page reference for the TDH.MEM.PAGE.ADD operation.
> > > > 
> > > > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > ---
> > > >  Documentation/virt/kvm/x86/intel-tdx.rst |  4 ++++
> > > >  arch/x86/kvm/vmx/tdx.c                   | 11 ++++++++---
> > > >  2 files changed, 12 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
> > > > index 6a222e9d09541..74357fe87f9ec 100644
> > > > --- a/Documentation/virt/kvm/x86/intel-tdx.rst
> > > > +++ b/Documentation/virt/kvm/x86/intel-tdx.rst
> > > > @@ -158,6 +158,10 @@ KVM_TDX_INIT_MEM_REGION
> > > >  Initialize @nr_pages TDX guest private memory starting from @gpa with userspace
> > > >  provided data from @source_addr. @source_addr must be PAGE_SIZE-aligned.
> > > >  
> > > > +If guest_memfd in-place conversion is enabled, pass NULL for @source_addr to
> > > > +initialize the memory region using memory contents already populated in
> > > > +guest_memfd memory.
> > > > +
> > > >  Note, before calling this sub command, memory attribute of the range
> > > >  [gpa, gpa + nr_pages] needs to be private.  Userspace can use
> > > >  KVM_SET_MEMORY_ATTRIBUTES to set the attribute.
> > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > > index ffe9d0db58c59..56d10333c61a7 100644
> > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > > >  	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> > > >  		return -EIO;
> > > >  
> > > > -	if (!src_page)
> > > > -		return -EOPNOTSUPP;
> > > > +	if (!src_page) {
> > > > +		if (!gmem_in_place_conversion)
> > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
> > > without the MMAP flag, the absence of src_page should still be treated as an
> > > error.
> > 
> > Why MMAP?
> Hmm, I was showing a scenario that in-place conversion couldn't occur.
> I didn't mean that with the MMAP flag, mmap() and user write must occur.
> 
> > Shouldn't this be a general "if (!src_page && !up-to-date)"?  Just
> > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
> > and written memory.  And when write() lands, MMAP wouldn't be necessary to
> > initialize the memory.
> Do you mean using up-to-date flag as below?
> 
> if (!src_page) {
> 	src_page = pfn_to_page(pfn);
> 	if (!folio_test_uptodate(page_folio(src_page)))
> 		return -EOPNOTSUPP;
> }

Another concern with this fix is that:
commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
folio uptodate before reaching post_populate().

[1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/

> One concern is that TDX now does not much care about the up-to-date flag since
> TDX doesn't rely on the flag to clear pages on conversions.
> I'm not sure if the flag can be reliably checked in this case. e.g.,
> now the whole folio is marked up-to-date even if only part of it is faulted by
> user access.
> Ensuring that the up-to-date flag works correctly with huge page support seems
> to have more effort than introducing a dedicated flag for TDX.
> 
> > > Additionally, to properly enable in-place copying for the TDX initial memory
> > > region, userspace must not only specify source_addr to NULL, but also follow
> > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
> > > 1. create guest_memfd with MMAP flag
> > > 2. mmap the guest_memfd.
> > > 3. convert the initial memory range to shared.
> > > 4. copy initial content to the source page.
> > > 5. convert the initial memory range to private
> > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
> > > 7. do not unmap the source backend.
> > > 
> > > So, would it be reasonable to introduce a dedicated flag that allows userspace
> > > to explicitly opt into the in-place copy functionality? e.g.,
> > 
> > Why?  It's userspace's responsibility to get the above right.  If userspace fails
> > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
> I mean if userspace specifies a NULL source_addr by mistake, it's better for
> kernel to detect this mistake, similar to how it validates whether source_addr
> is PAGE_ALIGNED.
> Since userspace already needs to perform additional steps to enable in-place
> copy, specifying a dedicated flag to indicate that the NULL source_addr is
> intentional seems like a reasonable burden.

^ permalink raw reply

* [PATCH 3/4] KVM: guest_memfd: Add `write` parameter to kvm_gmem_populate()
From: Jörg Rödel @ 2026-06-23  9:15 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: x86, Kiryl Shutsemau, Rick Edgecombe, Tom Lendacky, Ashish Kalra,
	Michael Roth, kvm, linux-kernel, linux-coco, Joerg Roedel
In-Reply-To: <20260623091556.1500930-1-joro@8bytes.org>

From: Joerg Roedel <joerg.roedel@amd.com>

The call-path of kvm_gmem_populate() might subsequently write to the
page provided by user-space. This is used to provide detailed error
information in case the page population failed.

But since kvm_gmem_populate() only acquires a read-only reference to
the user-space page via get_user_pages_fast(), the error information
might be written to a read-only page later on.

Add a parameter to kvm_gmem_populate() to optionally acquire a
writeable reference to the source page to make sure page permissions
can be enforced.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 arch/x86/kvm/svm/sev.c   | 2 +-
 arch/x86/kvm/vmx/tdx.c   | 2 +-
 include/linux/kvm_host.h | 4 +++-
 virt/kvm/guest_memfd.c   | 4 ++--
 4 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index f09d15f68964..dab8109edf26 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2475,7 +2475,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	sev_populate_args.sev_fd = argp->sev_fd;
 	sev_populate_args.type = params.type;
 
-	count = kvm_gmem_populate(kvm, params.gfn_start, src, npages,
+	count = kvm_gmem_populate(kvm, params.gfn_start, src, npages, 0,
 				  sev_gmem_post_populate, &sev_populate_args);
 	if (count < 0) {
 		argp->error = sev_populate_args.fw_error;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 04ce321ebdf3..46b1d84fddf2 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3185,7 +3185,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
 		};
 		gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
 					     u64_to_user_ptr(region.source_addr),
-					     1, tdx_gmem_post_populate, &arg);
+					     1, 0, tdx_gmem_post_populate, &arg);
 		if (gmem_ret < 0) {
 			ret = gmem_ret;
 			break;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4c14aee1fb06..622c0b04d8c3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2581,6 +2581,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
  *       (passed to @post_populate, and incremented on each iteration
  *       if not NULL). Must be page-aligned.
  * @npages: number of pages to copy from userspace-buffer
+ * @write: user-space provided buffer must be writable. The function
+ *	 will acquire a writable reference when set to 1.
  * @post_populate: callback to issue for each gmem page that backs the GPA
  *                 range
  * @opaque: opaque data to pass to @post_populate callback
@@ -2597,7 +2599,7 @@ typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 				    struct page *page, void *opaque);
 
 long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
-		       kvm_gmem_populate_cb post_populate, void *opaque);
+		       int write, kvm_gmem_populate_cb post_populate, void *opaque);
 #endif
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 69c9d6d546b2..7a245a402a1b 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -859,7 +859,7 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
 }
 
 long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
-		       kvm_gmem_populate_cb post_populate, void *opaque)
+		       int write, kvm_gmem_populate_cb post_populate, void *opaque)
 {
 	struct kvm_memory_slot *slot;
 	int ret = 0;
@@ -893,7 +893,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
 		if (src) {
 			unsigned long uaddr = (unsigned long)src + i * PAGE_SIZE;
 
-			ret = get_user_pages_fast(uaddr, 1, 0, &src_page);
+			ret = get_user_pages_fast(uaddr, 1, write, &src_page);
 			if (ret < 0)
 				break;
 			if (ret != 1) {
-- 
2.53.0


^ permalink raw reply related

* [PATCH 4/4] kvm: sev: Acquire a writeable page reference for CPUID pages
From: Jörg Rödel @ 2026-06-23  9:15 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: x86, Kiryl Shutsemau, Rick Edgecombe, Tom Lendacky, Ashish Kalra,
	Michael Roth, kvm, linux-kernel, linux-coco, Joerg Roedel
In-Reply-To: <20260623091556.1500930-1-joro@8bytes.org>

From: Joerg Roedel <joerg.roedel@amd.com>

When the PSP checks on a user-provided CPUID page fail KVM will write
back the detailed error information to the user-provided buffer.

Make sure this buffer is actually writable to not write the errors to
a read-only page.

Fixes: 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 arch/x86/kvm/svm/sev.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index dab8109edf26..5fd08d34be3f 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2415,6 +2415,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	struct kvm_memory_slot *memslot;
 	long npages, count;
 	void __user *src;
+	int write;
 
 	if (!sev_snp_guest(kvm) || !sev->snp_context)
 		return -EINVAL;
@@ -2475,7 +2476,10 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	sev_populate_args.sev_fd = argp->sev_fd;
 	sev_populate_args.type = params.type;
 
-	count = kvm_gmem_populate(kvm, params.gfn_start, src, npages, 0,
+	/* Acquire a write-reference for CPUID pages as kernel might write to it */
+	write = params.type == KVM_SEV_SNP_PAGE_TYPE_CPUID;
+
+	count = kvm_gmem_populate(kvm, params.gfn_start, src, npages, write,
 				  sev_gmem_post_populate, &sev_populate_args);
 	if (count < 0) {
 		argp->error = sev_populate_args.fw_error;
-- 
2.53.0


^ permalink raw reply related

* [PATCH 1/4] kvm: sev: Fix user-space triggerable WARN_ON on snp_launch_update path
From: Jörg Rödel @ 2026-06-23  9:15 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: x86, Kiryl Shutsemau, Rick Edgecombe, Tom Lendacky, Ashish Kalra,
	Michael Roth, kvm, linux-kernel, linux-coco, Joerg Roedel
In-Reply-To: <20260623091556.1500930-1-joro@8bytes.org>

From: Joerg Roedel <joerg.roedel@amd.com>

Sashiko reported on an unrelated patch:

  [Severity: High]
  This is a pre-existing issue, but can a host userspace process trigger a
  kernel warning by passing a NULL user address (uaddr = 0) here?

  If params.uaddr is 0, src becomes NULL and passes the PAGE_ALIGNED(src)
  check. kvm_gmem_populate() skips fetching the user page and passes
  src_page = NULL to sev_gmem_post_populate().

  That function then unconditionally evaluates:

  WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO &&
               !src_page)

  Since the type isn't ZERO, won't this allow an unprivileged user to spam
  the kernel log?

The assessment is correct, so check for this condition earlier in the
snp_launch_update() path to avoid the WARN_ON_ONCE.

Fixes: dee5a47cc7a45 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 arch/x86/kvm/svm/sev.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 6c6a6d663e29..41dcba5180ca 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2438,6 +2438,13 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	if (!PAGE_ALIGNED(src))
 		return -EINVAL;
 
+	/*
+	 * Make sure user-mode did not pass NULL as src with
+	 * type != KVM_SEV_SNP_PAGE_TYPE_ZERO.
+	 */
+	if (src == NULL && params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
+		return -EINVAL;
+
 	npages = params.len / PAGE_SIZE;
 
 	/*
-- 
2.53.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox