* Re: [RFCv2 PATCH 1/6] efi/unaccepted: Support hotplug memory in unaccepted bitmap via SRAT
From: Pratik R. Sampat @ 2026-06-24 14:23 UTC (permalink / raw)
To: Kiryl Shutsemau, Zhenzhong Duan
Cc: marcandre.lureau, david, rick.p.edgecombe, pbonzini, mst, peterx,
chenyi.qiang, elena.reshetova, michael.roth, ackerleytng,
linux-kernel, linux-coco, virtualization, x86, yilun.xu,
xiaoyao.li, chao.p.peng
In-Reply-To: <ajvLaBs62bDoxC3W@thinkstation>
On 6/24/26 8:25 AM, Kiryl Shutsemau wrote:
> On Tue, Jun 23, 2026 at 06:17:32AM -0400, Zhenzhong Duan wrote:
>> Currently, allocate_unaccepted_bitmap() only scans the initial EFI
>> boot memory map. This misses hotpluggable ranges described in the
>> ACPI SRAT. Without early tracking, hotplug pages are accessed without
>> acceptance and this triggers guest crash.
>>
>> Introduce a lightweight ACPI SRAT parser to scan these regions early.
>> If a region has both ACPI_SRAT_MEM_ENABLED and ACPI_SRAT_MEM_HOT_PLUGGABLE
>> flags, expand the tracking boundaries. This avoids pulling in the full
>> ACPI subsystem while ensuring the bitmap covers both static memory and
>> hotplug memory.
>
> Ugh.. Parsing SRAT there is ugly. I would rather avoid it.
>
I agree. Parsing it here means SRAT gets parsed twice, which doesn't make much
sense.
> Do I understand correctly that we don't have a way represent pluggable,
> but not present memory in EFI memory map?
>
> IIUC, EFI_MEMORY_HOT_PLUGGABLE is actually present, but unpluggable
> memory.
>
Right. And repurposing EFI_MEMORY_HOT_PLUGGABLE (plus updating the spec) would
likely make this messier: by its current definition it describes cold-plugged
pages that may be removed, not pages that may be hot-added later.
> Maybe it would be better just allocate bitmap upto maxmem?
>
> And fix EFI spec to add pluggable-but-not-present attribute.
>
I am currently working with the UEFI community around two proposals for a spec
change:
1. Add a new attribute, as Kiryl suggested, or
2. Add a generic new hotplug memory type that represents all the memory that
could be added later.
In either case, we could then precisely allocate the bitmap by parsing the
region with the attribute/type.
I prefer (1), but I have RFC proposals, code-first edk2 changes, and the Linux
plumbing ready for both approaches, and plan to post them in the following week
after ironing out a few kinks.
Thanks,
--Pratik
^ permalink raw reply
* Re: [PATCH v8 08/46] KVM: Provide generic interface for checking memory private/shared status
From: Ackerley Tng @ 2026-06-24 14:18 UTC (permalink / raw)
To: Suzuki K Poulose, Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <3ec15992-2a29-434b-8c99-8b86bfcf007e@arm.com>
Suzuki K Poulose <suzuki.poulose@arm.com> writes:
>
> [...snip...]
>
>>>> @@ -2546,7 +2546,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
>>>> bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>>>> struct kvm_gfn_range *range);
>>>>
>>>> -static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>>>> +static inline bool kvm_vm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>>
>> Should have read the Sashiko review first, but where is this used?
>> It's not used at all in this series...
>
> See below:
>
>>
>> /fuad
>>
>>>> {
>>>> return kvm_get_vm_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
>>>> }
>>>> @@ -2557,6 +2557,16 @@ static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
>>>> KVM_MEMORY_ATTRIBUTE_PRIVATE,
>>>> KVM_MEMORY_ATTRIBUTE_PRIVATE);
>>>> }
>>>> +#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>>>> +
>>>> +#ifdef kvm_arch_has_private_mem
>>>> +typedef bool (kvm_mem_is_private_t)(struct kvm *kvm, gfn_t gfn);
>>>> +DECLARE_STATIC_CALL(__kvm_mem_is_private, kvm_mem_is_private_t);
>>>> +
>>>> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>>>> +{
>>>> + return static_call(__kvm_mem_is_private)(kvm, gfn);
>>>> +}
>>>> #else
>>>> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>>>> {
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index 6669f1477013c..8b238e461b854 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -2627,6 +2627,20 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>>>> }
>>>> #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>>>>
>>>> +#ifdef kvm_arch_has_private_mem
>>>> +DEFINE_STATIC_CALL_RET0(__kvm_mem_is_private, kvm_mem_is_private_t);
>>>> +EXPORT_STATIC_CALL_GPL(__kvm_mem_is_private);
>>>> +
>>>> +static void kvm_init_memory_attributes(void)
>>>> +{
>>>> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>>>> + static_call_update(__kvm_mem_is_private, kvm_vm_mem_is_private);
>>>> +#endif
>>>> +}
>
>
> Here ^^ as the static call update ?
>
>
> Suzuki
Thanks Suzuki, it is used here. kvm_mem_is_private() was and still is
the function used to check if some gfn is private or shared. Hence, in
this patch, the usages of kvm_mem_is_private() were not
updated. Instead, kvm_mem_is_private() is now set up as a static call,
and the static call is hard-wired to kvm_vm_mem_is_private() in this
patch.
In the later wiring patch, all the places where attributes are looked up
are updated all at once: if conversion enabled, take gmem route, else
take VM route.
kvm_mem_is_private() is special in that the if-else is done at KVM load
time rather than runtime, and I believe that's for performance reasons
since this is checked quite often from the KVM fault handling code.
Buut I think perhaps Fuad was referring to kvm_mem_range_is_private(),
which is indeed not used anywhere. Binbin also asked about this, I think
we should drop kvm_mem_range_is_private(). My reply to Binbin is at [1].
[1] https://lore.kernel.org/all/CAEvNRgGbBcrX5Fw3vNTsTOBNC=Ypi=9-S07674yPxLU9i4akjA@mail.gmail.com/
^ permalink raw reply
* Re: [PATCH v8 07/46] KVM: Rename memory attribute APIs to prepare for in-place gmem conversion
From: Ackerley Tng @ 2026-06-24 13:44 UTC (permalink / raw)
To: Binbin Wu
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <96fb369d-dbff-4ed6-b1f9-0ce63d7d4ed0@linux.intel.com>
Binbin Wu <binbin.wu@linux.intel.com> writes:
>
> [...snip...]
>
>> +static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
>> + gfn_t end)
>> +{
>> + return kvm_range_has_vm_memory_attributes(kvm, start, end,
>> + KVM_MEMORY_ATTRIBUTE_PRIVATE,
>> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
>> }
>
> This function is added, but never used in this patch series.
> Is it intended to be called only when CONFIG_KVM_VM_MEMORY_ATTRIBUTES is
> enabled?
>
Thank you for catching this! I think in some earlier revision this was
meant to be used from the guest_memfd populate flow.
I think the version of kvm_gmem_range_is_private in this revision is
good because it is symmetric. If conversion is enabled, call the gmem
range-has-attributes function, and if conversion is disabled, use the VM
range-has-attributes function.
Sean, if no new revision is needed would you be able to drop
kvm_mem_range_is_private() while you're pulling it in?
>>
>> [...snip...]
>>
^ permalink raw reply
* [PATCH] virt: coco: harden TSM MR attribute allocation
From: Yousef Alhouseen @ 2026-06-24 13:00 UTC (permalink / raw)
To: Dan Williams; +Cc: linux-coco, linux-kernel, Yousef Alhouseen
tsm_mr_create_attribute_group() combines the bin_attribute pointer table
and generated MR name strings into one allocation. It open-coded both the
aggregate name length calculation and the final allocation size as plain
additions and multiplication.
The current in-tree caller uses a small static MR table, but this helper is
exported for confidential-computing guest drivers. Reject impossible MR
definitions instead of allowing arithmetic wraparound to under-allocate the
combined attributes buffer.
Use size_add() and array_size() for the name-length accumulation and the
final allocation size.
Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
drivers/virt/coco/guest/tsm-mr.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/drivers/virt/coco/guest/tsm-mr.c b/drivers/virt/coco/guest/tsm-mr.c
index 657b9c573..789988111 100644
--- a/drivers/virt/coco/guest/tsm-mr.c
+++ b/drivers/virt/coco/guest/tsm-mr.c
@@ -140,7 +140,11 @@ static ssize_t tm_digest_write(struct file *filp, struct kobject *kobj,
const struct attribute_group *
tsm_mr_create_attribute_group(const struct tsm_measurements *tm)
{
+ const struct bin_attribute **attrs __free(kfree) = NULL;
+ struct tm_context *ctx __free(kfree) = NULL;
+ size_t attrs_size, name_len;
size_t nlen;
+ char *name, *end;
if (!tm || !tm->mrs)
return ERR_PTR(-EINVAL);
@@ -164,8 +168,12 @@ tsm_mr_create_attribute_group(const struct tsm_measurements *tm)
return ERR_PTR(-EINVAL);
/* MR sysfs attribute names have the form of MRNAME:HASH */
- nlen += strlen(tm->mrs[i].mr_name) + 1 +
- strlen(hash_algo_name[tm->mrs[i].mr_hash]) + 1;
+ name_len = size_add(strlen(tm->mrs[i].mr_name),
+ strlen(hash_algo_name[tm->mrs[i].mr_hash]));
+ name_len = size_add(name_len, 2);
+ nlen = size_add(nlen, name_len);
+ if (name_len == SIZE_MAX || nlen == SIZE_MAX)
+ return ERR_PTR(-EINVAL);
}
/*
@@ -173,11 +181,13 @@ tsm_mr_create_attribute_group(const struct tsm_measurements *tm)
* so that we don't have to free MR names one-by-one in
* tsm_mr_free_attribute_group()
*/
- const struct bin_attribute **attrs __free(kfree) =
- kzalloc(sizeof(*attrs) * (tm->nr_mrs + 1) + nlen, GFP_KERNEL);
- struct tm_context *ctx __free(kfree) =
- kzalloc_flex(*ctx, mrs, tm->nr_mrs);
- char *name, *end;
+ attrs_size = size_add(array_size(size_add(tm->nr_mrs, 1),
+ sizeof(*attrs)), nlen);
+ if (attrs_size == SIZE_MAX)
+ return ERR_PTR(-EINVAL);
+
+ attrs = kzalloc(attrs_size, GFP_KERNEL);
+ ctx = kzalloc_flex(*ctx, mrs, tm->nr_mrs);
if (!ctx || !attrs)
return ERR_PTR(-ENOMEM);
--
2.54.0
^ permalink raw reply related
* Re: [RFCv2 PATCH 5/6] mm/memory_hotplug: Support ACPI hotplug/unplug for coco guest
From: Kiryl Shutsemau @ 2026-06-24 12:33 UTC (permalink / raw)
To: Zhenzhong Duan
Cc: marcandre.lureau, david, rick.p.edgecombe, prsampat, pbonzini,
mst, peterx, chenyi.qiang, elena.reshetova, michael.roth,
ackerleytng, linux-kernel, linux-coco, virtualization, x86,
yilun.xu, xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-6-zhenzhong.duan@intel.com>
On Tue, Jun 23, 2026 at 06:17:36AM -0400, Zhenzhong Duan wrote:
> + spin_lock_irqsave(&unaccepted_memory_lock, flags);
> + for (; range_start < bitmap_size; range_start = range_end) {
> + unsigned long phys_start, phys_end;
> + unsigned long unaccepted_one, plugged_zero;
> +
> + range_start = find_next_andnot_bit(plugged_bitmap, unaccepted->bitmap,
> + bitmap_size, range_start);
> +
> + if (range_start >= bitmap_size)
> + break;
> +
> + unaccepted_one = find_next_bit(unaccepted->bitmap, bitmap_size, range_start);
> + plugged_zero = find_next_zero_bit(plugged_bitmap, bitmap_size, range_start);
> + range_end = min(unaccepted_one, plugged_zero);
> +
> + phys_start = range_start * unit_size + unaccepted->phys_base;
> + phys_end = range_end * unit_size + unaccepted->phys_base;
> +
> + arch_unaccept_memory(phys_start, phys_end);
> + bitmap_set(unaccepted->bitmap, range_start, range_end - range_start);
> + }
> + spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
Accept TDCALL under the spin lock will kill scalability.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [RFCv2 PATCH 2/6] efi/unaccepted: Set unaccepted bits for all hotplug memory
From: Kiryl Shutsemau @ 2026-06-24 12:29 UTC (permalink / raw)
To: Zhenzhong Duan
Cc: marcandre.lureau, david, rick.p.edgecombe, prsampat, pbonzini,
mst, peterx, chenyi.qiang, elena.reshetova, michael.roth,
ackerleytng, linux-kernel, linux-coco, virtualization, x86,
yilun.xu, xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-3-zhenzhong.duan@intel.com>
On Tue, Jun 23, 2026 at 06:17:33AM -0400, Zhenzhong Duan wrote:
> In coco guests, hotpluggable memory ranges are initially unaccepted.
> While a previous change expanded the unaccepted memory bitmap boundaries
> to include these hotplug spaces, the actual bits inside the bitmap are
> not yet marked as unaccepted.
>
> Walks SRAT a second time after the bitmap is allocated and sets the bits
> corresponding to hotpluggable ranges.
>
> This ensures the bitmap state accurately reflects all static and hotplug
> memory ranges before booting kernel.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> .../firmware/efi/libstub/unaccepted_memory.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
> index bfbb78bd7b8a..01bed8e751ca 100644
> --- a/drivers/firmware/efi/libstub/unaccepted_memory.c
> +++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
> @@ -92,6 +92,23 @@ static void update_mem_boundaries(struct acpi_srat_mem_affinity *mem, struct sra
> *(ctx->mem_end) = range_end;
> }
>
> +static void mark_hotplug_memory_unaccepted(struct acpi_srat_mem_affinity *mem,
> + struct srat_parse_ctx *ctx)
> +{
> + u64 unit_size = unaccepted_table->unit_size;
> + u64 start, end;
> +
> + start = round_up(mem->base_address, unit_size);
> + end = round_down(mem->base_address + mem->length, unit_size);
We can get here with start > end if srat range is less then unit_size.
> +
> + /* Translate to offsets from the beginning of the bitmap */
> + start -= unaccepted_table->phys_base;
> + end -= unaccepted_table->phys_base;
> +
> + bitmap_set(unaccepted_table->bitmap,
> + start / unit_size, (end - start) / unit_size);
> +}
> +
> efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
> struct efi_boot_memmap *map)
> {
> @@ -169,6 +186,7 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
> unaccepted_table->phys_base = unaccepted_start;
> unaccepted_table->size = bitmap_size;
> memset(unaccepted_table->bitmap, 0, bitmap_size);
> + parse_acpi_srat_regions(mark_hotplug_memory_unaccepted, &ctx);
>
> status = efi_bs_call(install_configuration_table,
> &unaccepted_table_guid, unaccepted_table);
> --
> 2.52.0
>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [RFCv2 PATCH 1/6] efi/unaccepted: Support hotplug memory in unaccepted bitmap via SRAT
From: Kiryl Shutsemau @ 2026-06-24 12:25 UTC (permalink / raw)
To: Zhenzhong Duan
Cc: marcandre.lureau, david, rick.p.edgecombe, prsampat, pbonzini,
mst, peterx, chenyi.qiang, elena.reshetova, michael.roth,
ackerleytng, linux-kernel, linux-coco, virtualization, x86,
yilun.xu, xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-2-zhenzhong.duan@intel.com>
On Tue, Jun 23, 2026 at 06:17:32AM -0400, Zhenzhong Duan wrote:
> Currently, allocate_unaccepted_bitmap() only scans the initial EFI
> boot memory map. This misses hotpluggable ranges described in the
> ACPI SRAT. Without early tracking, hotplug pages are accessed without
> acceptance and this triggers guest crash.
>
> Introduce a lightweight ACPI SRAT parser to scan these regions early.
> If a region has both ACPI_SRAT_MEM_ENABLED and ACPI_SRAT_MEM_HOT_PLUGGABLE
> flags, expand the tracking boundaries. This avoids pulling in the full
> ACPI subsystem while ensuring the bitmap covers both static memory and
> hotplug memory.
Ugh.. Parsing SRAT there is ugly. I would rather avoid it.
Do I understand correctly that we don't have a way represent pluggable,
but not present memory in EFI memory map?
IIUC, EFI_MEMORY_HOT_PLUGGABLE is actually present, but unpluggable
memory.
Maybe it would be better just allocate bitmap upto maxmem?
And fix EFI spec to add pluggable-but-not-present attribute.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH v2 02/17] x86/virt/tdx: Configure add-on features on TDX module init and update
From: Xu Yilun @ 2026-06-24 12:00 UTC (permalink / raw)
To: Dave Hansen
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
tony.lindgren, peter.fang, baolu.lu, zhenzhong.duan, dave.hansen,
seanjc
In-Reply-To: <4f4b0f29-424b-45ed-8cfd-c77da2ea390f@intel.com>
> There's also zero stopping us from putting version in args:
>
> struct tdx_module_args args = {};
> int ret;
>
> if (tdx_addon_feature0) {
> args.r9 = tdx_addon_feature0;
> args.version = 1;
> }
>
> ret = seamcall_prerr(TDH_SYS_UPDATE, &args);
>
> Eh?
>
> That gives args.version==0 in all the normal cases which just happens to
> be the exact behavior we want. It also avoids having to plumb version
> through all the seamcall*() wrappers.
Ah, on 2nd reading, I'm pretty sure now I understand your logical argument in
patch 1 and 2. It's good to me. I append my diff at the end.
>
> But this is *exactly* the kind of thing that shouldn't be a part of an
> attestation patch series. This could very much have been a separate
> discussion and happened a month or a year ago. But now it is blocking
> this DICE thing from getting done <grumble>.
Sorry, I should have been more active in searching for the solution
rather than sticking to "kernel never keeps versions", when I've found
the problem that public modules are not available.
----8<----
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index f20e91d7ac35..972880910a5e 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -143,6 +143,8 @@ struct tdx_module_args {
u64 rbx;
u64 rdi;
u64 rsi;
+ /* for RAX encoding */
+ u8 version;
};
/* Used to communicate with the TDX module */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 081816888f7a..b3c00ff4d819 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -95,6 +95,7 @@ static void __used common(void)
OFFSET(TDX_MODULE_rbx, tdx_module_args, rbx);
OFFSET(TDX_MODULE_rdi, tdx_module_args, rdi);
OFFSET(TDX_MODULE_rsi, tdx_module_args, rsi);
+ OFFSET(TDX_MODULE_version, tdx_module_args, version);
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
index 016a2a1ec1d6..d1d3d40c5614 100644
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
@@ -48,6 +48,14 @@
/* Move Leaf ID to RAX */
mov %rdi, %rax
+ /*
+ * Extract the version from 'struct tdx_module_args', append it to
+ * RAX[23:16]
+ */
+ movzbl TDX_MODULE_version(%rsi), %ecx
+ shll $16, %ecx
+ orq %rcx, %rax
+
/* Move other input regs from 'struct tdx_module_args' */
movq TDX_MODULE_rcx(%rsi), %rcx
movq TDX_MODULE_rdx(%rsi), %rdx
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index a6f8fd0a3df0..bc3aa1f78fc8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1036,7 +1036,6 @@ static __init void set_tdx_addon_features(void)
static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
u64 global_keyid)
{
- u64 seamcall_fn = TDH_SYS_CONFIG_V0;
struct tdx_module_args args = {};
u64 *tdmr_pa_array;
size_t array_sz;
@@ -1059,18 +1058,18 @@ static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i));
+ set_tdx_addon_features();
+
args.rcx = __pa(tdmr_pa_array);
args.rdx = tdmr_list->nr_consumed_tdmrs;
args.r8 = global_keyid;
- set_tdx_addon_features();
-
if (tdx_addon_feature0) {
args.r9 = tdx_addon_feature0;
- seamcall_fn = TDH_SYS_CONFIG;
+ args.version = 1;
}
- ret = seamcall_prerr(seamcall_fn, &args);
+ ret = seamcall_prerr(TDH_SYS_CONFIG, &args);
/* Free the array as it is not required anymore. */
kfree(tdmr_pa_array);
@@ -1761,16 +1760,15 @@ int tdx_module_shutdown(void)
int tdx_module_run_update(void)
{
- u64 seamcall_fn = TDH_SYS_UPDATE_V0;
struct tdx_module_args args = {};
int ret;
if (tdx_addon_feature0) {
args.r9 = tdx_addon_feature0;
- seamcall_fn = TDH_SYS_UPDATE;
+ args.version = 1;
}
- ret = seamcall_prerr(seamcall_fn, &args);
+ ret = seamcall_prerr(TDH_SYS_UPDATE, &args);
if (ret)
return ret;
@@ -2353,6 +2351,7 @@ u64 tdh_vp_init(struct tdx_vp *vp, u64 initial_rcx, u32 x2apicid)
.rcx = vp->tdvpr_pa,
.rdx = initial_rcx,
.r8 = x2apicid,
+ .version = 1,
};
return seamcall(TDH_VP_INIT, &args);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 32b13b0c85f9..018988c25caa 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -44,7 +44,7 @@
#define TDH_VP_CREATE 10
#define TDH_MNG_KEY_FREEID 20
#define TDH_MNG_INIT 21
-#define TDH_VP_INIT SEAMCALL_LEAF_VER(22, 1)
+#define TDH_VP_INIT 22
#define TDH_PHYMEM_PAGE_RDMD 24
#define TDH_VP_RD 26
#define TDH_PHYMEM_PAGE_RECLAIM 28
@@ -58,11 +58,9 @@
#define TDH_PHYMEM_CACHE_WB 40
#define TDH_PHYMEM_PAGE_WBINVD 41
#define TDH_VP_WR 43
-#define TDH_SYS_CONFIG_V0 45
-#define TDH_SYS_CONFIG SEAMCALL_LEAF_VER(TDH_SYS_CONFIG_V0, 1)
+#define TDH_SYS_CONFIG 45
#define TDH_SYS_SHUTDOWN 52
-#define TDH_SYS_UPDATE_V0 53
-#define TDH_SYS_UPDATE SEAMCALL_LEAF_VER(TDH_SYS_UPDATE_V0, 1)
+#define TDH_SYS_UPDATE 53
#define TDH_EXT_INIT 60
#define TDH_EXT_MEM_ADD 61
#define TDH_SYS_DISABLE 69
^ permalink raw reply related
* Re: [PATCH v8 05/46] KVM: Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable
From: Ackerley Tng @ 2026-06-24 0:14 UTC (permalink / raw)
To: Sean Christopherson, Julian Braha
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ajnQVkLvFl_lMuGB@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Fri, Jun 19, 2026, Julian Braha wrote:
>> Hi Ackerley,
>>
>> On 6/19/26 01:31, Ackerley Tng via B4 Relay wrote:
>>
>> > config KVM_VM_MEMORY_ATTRIBUTES
>> > - bool
>> > + depends on KVM_SW_PROTECTED_VM || KVM_INTEL_TDX || KVM_AMD_SEV
>> > + bool "Enable per-VM PRIVATE vs. SHARED attributes (for CoCo VMs)"
>>
>> Sorry for the style nitpick, but could you keep the type and prompt as
>> the first attribute in the Kconfig option definition (like the other
>> options do)?
>
> No need to be sorry, I've no idea why I put the "depends" first. I don't even
> know if that qualifies as a nit :-)
>
> Ackerley, if you can provide your SoB (for Fuad's feedback), I can fixup when
> applying (assuming nothing else necessitates v9).
Thanks, didn't notice this when consolidating this revision.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
^ permalink raw reply
* Re: [PATCH v8 04/46] KVM: Decouple kvm_has_arch_private_mem from CONFIG_KVM_VM_MEMORY_ATTRIBUTES
From: Ackerley Tng @ 2026-06-24 0:13 UTC (permalink / raw)
To: Binbin Wu
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <a21bfc05-787e-4cd8-89af-8579357e6a12@linux.intel.com>
Binbin Wu <binbin.wu@linux.intel.com> writes:
> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
>> From: Sean Christopherson <seanjc@google.com>
>>
>> When memory attributes become trackable in guest_memfd, the concept of
>> having private memory is no longer dependent on
>> CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
>>
>> With this, on x86, kvm_arch_has_private_mem() is defined if some CoCo
>> platform support (or the testing CONFIG_KVM_SW_PROTECTED_VM) is compiled
>> in.
>>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
>
> One nit below.
>
>> ---
>> arch/x86/include/asm/kvm_host.h | 4 +++-
>> include/linux/kvm_host.h | 2 +-
>> 2 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 8e8eb8a5e8a6b..1bde67cf6eb0e 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -2394,7 +2394,9 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>> int tdp_max_root_level, int tdp_huge_page_level);
>>
>>
>> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>> +#if defined(CONFIG_KVM_SW_PROTECTED_VM) || \
>> + defined(CONFIG_KVM_INTEL_TDX) || \
>> + defined(CONFIG_KVM_AMD_SEV)
>
> Nit:
> Vertically align the defined(XXX) statements for better readability?
>
Sean had this aligned with spaces, and checkpatch complained about
having no spaces before tabs, so I switched it to tabs instead since I
don't think alignment like that is officially documented either way.
Either way is fine :)
>> #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
>> #endif
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 201d0f2143976..d370e834d619e 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>> }
>> #endif
>>
>> -#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>> +#ifndef kvm_arch_has_private_mem
>> static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>> {
>> return false;
>>
^ permalink raw reply
* Re: [PATCH v8 01/46] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
From: Ackerley Tng @ 2026-06-24 0:09 UTC (permalink / raw)
To: Sean Christopherson, Binbin Wu
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ajnjTJdQKD1Kz3tf@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Mon, Jun 22, 2026, Binbin Wu wrote:
>> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
>>
>> [...]
>>
>> >
>> > +static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index)
>> > +{
>> > + struct maple_tree *mt = &GMEM_I(inode)->attributes;
>> > + void *entry = mtree_load(mt, index);
>> > +
>> > + return WARN_ON_ONCE(!entry) ? 0 : xa_to_value(entry);
>>
>> If the entry is unexpectedly missing, returning 0 means the attribute would
>> be treated as shared. And then in kvm_gmem_fault_user_mapping(), it would
>> allow the userspace to fault in the folio.
>>
>> Should gmem deny such edge case?
>
> After several bugs this year where a WARN_ON_ONCE() fired, but was entirely
> insufficient to prevent true badness, I'm definitely senstive to making the "bad"
> behavior as harmless as possible.
>
I guess both are indeed awkward.
> However, in this case I think we're just hosed. If KVM treats the memory as
> private, KVM will incorrectly do prepare(), incorrectly allow populate(), and
> will caused missed invalidations (though I suppose __kvm_gmem_set_attributes()
> "only" lies to userspace in that case).
>
> That said, assuming SHARED is definitely odd for cases where guest_memfd *can't*
> hold shared memory. Ditto for assuming PRIVATE. What if we instead fall back to
> the "init" state, e.g.?
>
> static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index)
> {
> struct maple_tree *mt = &GMEM_I(inode)->attributes;
> void *entry = mtree_load(mt, index);
>
> if (WARN_ON_ONCE(!entry)) {
> bool shared = GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED;
>
> return shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE;
I was wondering if we should not only return the init state but also set
the init state, but that would involve performing a conversion to the
init state... Too complicated for an edge case.
> }
>
> return xa_to_value(entry);
> }
Thanks Binbin and Sean!
^ permalink raw reply
* Re: [PATCH v8 3/7] crypto/ccp: Disable CPU hotplug while SNP is active
From: Kalra, Ashish @ 2026-06-23 19:15 UTC (permalink / raw)
To: Ackerley Tng, Jethro Beekman, tglx, mingo, bp, dave.hansen, x86,
hpa, seanjc, peterz, thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <CAEvNRgHDNGCETxLsy0v-_cBO1=1U+tXtOXWEFrXLU7pYz7U9ow@mail.gmail.com>
Hello Ackerley,
On 6/23/2026 12:48 PM, Ackerley Tng wrote:
> Jethro Beekman <jethro@fortanix.com> writes:
>
>> On 2026-06-15 21:49, Ashish Kalra wrote:
>>> From: Ashish Kalra <ashish.kalra@amd.com>
>>>
>>> The SEV firmware enumerates the CPUs at SNP initialization and is not
>>> aware of the OS bringing CPUs online or offline afterwards, so OS CPU
>>> hotplug can diverge from the firmware's expectations and break SNP.
>>> Disable CPU hotplug while SNP is active.
>>
>> I think this is too broad. If I have a hypervisor that supports SNP virtualization, a (non-confidential) L1 guest running Linux should still support CPU hotplug while also running confidential L2 guests.
>>
>> --
>> Jethro Beekman | CTO | Fortanix
>>
>
> Were any other solutions considered other than disabling CPU hotplug?
>
> Is this temporary until something else is implemented?
>
> I'm not sure how commonly CPU hotplug is used, and if people are okay
> with trading in CPU hotplug to get SNP.
>
> Is it that fundamentally the SEV firmware can't support hotplug, so
> there's no point in keeping it enabled anyway?
Yes, essentially. The SEV firmware knows nothing about when the OS takes CPUs online or offline. At SNP_INIT it accounts for all
the CPUs enabled via the BIOS/UEFI and establishes the per-core SNP state for them; it has no notion of the OS bringing CPUs up or
down afterwards. So OS hotplug actions can diverge from the firmware's expectations and break SNP. Disabling hotplug just makes
that constraint explicit — there's nothing useful to keep it enabled for: a hot-removed core still "exists" as far as
the firmware and the per-core RMP/RMPOPT state are concerned, and a core brought online later was never set up for SNP.
>
> Is there some way of supporting hotplug for CPUs that won't be used with
> SNP, for serving non-SNP VMs on the same host as SNP VMs, or is that too
> complicated?
>
Not really. SNP's memory-integrity guarantee rests on a single invariant: every memory write is subject to RMP checks to protect
against corruption of SEV-SNP guest memory. The moment any CPU can issue writes that aren't RMP-checked, that protection is
broken for the whole system — it's not something that can be confined to "that one core."
That's because SNP isn't per-core in that sense — it's a system-wide mode. SYSCFG[SNP_EN] is set on every core, the RMP covers all
of physical memory, and once SNP is enabled every memory write is subject to RMP checks on every core. A non-SNP guest sharing the
host still runs on cores that are part of the SNP-enabled system.
By the SNP architecture there simply can't be a CPU that isn't doing RMP checks while SNP is active, so SNP_EN has to be enforced
on every core. RMP enforcement is gated per-core by SYSCFG[SNP_EN] and it must be set on every core before SNP_INIT; a core with
SNP_EN clear performs no RMP checks at all, which the architecture doesn't allow once SNP is up. A newly hotplugged CPU comes up
without SNP_EN (SNP not enabled on it), and since it wasn't present when SNP_INIT ran it isn't part of the initialized SNP
configuration either — so it does no RMP checking. And because an SNP guest's vCPUs (or any guest for that matter) can be scheduled
on any online CPU, the guest could end up running on that core, accessing memory with no RMP enforcement and breaking SNP's memory integrity.
There's no way to prevent that: KVM doesn't fence SNP guests (or any guests for that matter) off particular online cores.
And carving out cores for non-SNP-only use isn't possible by the architecture: SNP requires RMP checks on every CPU, so there's no valid
configuration with SNP active and a subset of cores exempt.
Thanks,
Ashish
>>>
>>> [...snip...]
>>>
^ permalink raw reply
* SVSM Development Call June 24th, 2026
From: Jörg Rödel @ 2026-06-23 19:01 UTC (permalink / raw)
To: coconut-svsm, linux-coco
Hi,
Here is the call for agenda items for this weeks SVSM development call. Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.
We will use the LF Zoom instance. Details of the meeting can be found in our
governance repository at:
https://github.com/coconut-svsm/governance
The link to the COCONUT-SVSM calendar is:
https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week
The meeting will be recorded and the recording eventually published.
Regards,
Jörg
^ permalink raw reply
* Re: [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ackerley Tng @ 2026-06-23 17:50 UTC (permalink / raw)
To: Kalra, Ashish, K Prateek Nayak, tglx, mingo, bp, dave.hansen, x86,
hpa, seanjc, peterz, thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
jackyli, pgonda, rientjes, jacobhxu, xin, pawan.kumar.gupta,
babu.moger, dyoung, nikunj, john.allen, darwi, linux-kernel,
linux-crypto, kvm, linux-coco
In-Reply-To: <8c5f4082-e3a5-4f65-b058-33938a7ee324@amd.com>
"Kalra, Ashish" <ashish.kalra@amd.com> writes:
>
> [...snip...]
>
>
> Yes, a simpler implementation will be like this:
> ...
>
> if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))
Perhaps have a WARN_ON_ONCE() here so we know rmpopt was not performed?
Not a huge deal without though.
> return;
>
> cpumask_copy(follower_mask, &rmpopt_cpumask);
>
> /*
> * The current CPU's core always has RMPOPT_BASE programmed
> * (snp_prepare() required all CPUs online at setup and CPU hotplug
> * is disabled while SNP is active), so it can always be the leader.
> * RMPOPT_BASE is per-core; exclude this core from the followers.
> */
> migrate_disable();
> cpumask_andnot(follower_mask, follower_mask,
> topology_sibling_cpumask(smp_processor_id()));
>
> for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> rmpopt(pa);
> cond_resched();
> }
> migrate_enable();
>
> cpus_read_lock();
> for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> on_each_cpu_mask(follower_mask, rmpopt_smp, (void *)pa, true);
> cond_resched();
> }
> cpus_read_unlock();
>
> free_cpumask_var(follower_mask);
>
>
Definitely better than the version in the original patch :) Thanks!
> Here, the leader exclusion must use the sibling mask, not clear_cpu(this_cpu). That's why my collapsed version uses:
>
> cpumask_andnot(follower_mask, follower_mask,
> topology_sibling_cpumask(smp_processor_id()));
>
> - If this_cpu is a primary: its sibling mask contains itself (the primary) -> andnot removes this core's primary from the followers.
>
> - If this_cpu is a secondary: it isn't in follower_mask at all, but its sibling mask contains its primary, which is in
> follower_mask -> andnot still removes this core's primary.
>
> So either way the current core is dropped from the followers. (The old code needed two branches because case #1 used
> clear_cpu(this_cpu) — only correct when this_cpu is the primary — while case #2 used the sibling andnot. The single andnot works for
> both cases).
>
> Thanks,
> Ashish
>
>>> + goto followers;
>>> + }
>>> +
>>> + migrate_enable();
>>> +
^ permalink raw reply
* Re: [PATCH v8 3/7] crypto/ccp: Disable CPU hotplug while SNP is active
From: Ackerley Tng @ 2026-06-23 17:48 UTC (permalink / raw)
To: Jethro Beekman, Ashish Kalra, tglx, mingo, bp, dave.hansen, x86,
hpa, seanjc, peterz, thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <0df3b665-3a9c-4c46-a7aa-14388e8e1577@fortanix.com>
Jethro Beekman <jethro@fortanix.com> writes:
> On 2026-06-15 21:49, Ashish Kalra wrote:
>> From: Ashish Kalra <ashish.kalra@amd.com>
>>
>> The SEV firmware enumerates the CPUs at SNP initialization and is not
>> aware of the OS bringing CPUs online or offline afterwards, so OS CPU
>> hotplug can diverge from the firmware's expectations and break SNP.
>> Disable CPU hotplug while SNP is active.
>
> I think this is too broad. If I have a hypervisor that supports SNP virtualization, a (non-confidential) L1 guest running Linux should still support CPU hotplug while also running confidential L2 guests.
>
> --
> Jethro Beekman | CTO | Fortanix
>
Were any other solutions considered other than disabling CPU hotplug?
Is this temporary until something else is implemented?
I'm not sure how commonly CPU hotplug is used, and if people are okay
with trading in CPU hotplug to get SNP.
Is it that fundamentally the SEV firmware can't support hotplug, so
there's no point in keeping it enabled anyway?
Is there some way of supporting hotplug for CPUs that won't be used with
SNP, for serving non-SNP VMs on the same host as SNP VMs, or is that too
complicated?
>>
>> [...snip...]
>>
^ permalink raw reply
* Re: [PATCH 1/4] kvm: sev: Fix user-space triggerable WARN_ON on snp_launch_update path
From: Sean Christopherson @ 2026-06-23 14:46 UTC (permalink / raw)
To: Jörg Rödel
Cc: Paolo Bonzini, x86, Kiryl Shutsemau, Rick Edgecombe, Tom Lendacky,
Ashish Kalra, Michael Roth, kvm, linux-kernel, linux-coco,
Joerg Roedel
In-Reply-To: <20260623091556.1500930-2-joro@8bytes.org>
Please capitalize the scope, i.e. "KVM: SEV:".
On Tue, Jun 23, 2026, Jörg Rödel wrote:
> From: Joerg Roedel <joerg.roedel@amd.com>
>
> Sashiko reported on an unrelated patch:
>
> [Severity: High]
> This is a pre-existing issue, but can a host userspace process trigger a
> kernel warning by passing a NULL user address (uaddr = 0) here?
>
> If params.uaddr is 0, src becomes NULL and passes the PAGE_ALIGNED(src)
> check. kvm_gmem_populate() skips fetching the user page and passes
> src_page = NULL to sev_gmem_post_populate().
>
> That function then unconditionally evaluates:
>
> WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO &&
> !src_page)
>
> Since the type isn't ZERO, won't this allow an unprivileged user to spam
> the kernel log?
Use Reported-by: + Closes to capture Sashiko's effecitve bug report instead of
copy+pasting the finding. There's no reason to treat Sashiko any differently
than any other bot.
> The assessment is correct, so check for this condition earlier in the
> snp_launch_update() path to avoid the WARN_ON_ONCE.
>
> Fixes: dee5a47cc7a45 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> ---
> arch/x86/kvm/svm/sev.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 6c6a6d663e29..41dcba5180ca 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2438,6 +2438,13 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> if (!PAGE_ALIGNED(src))
> return -EINVAL;
>
> + /*
> + * Make sure user-mode did not pass NULL as src with
> + * type != KVM_SEV_SNP_PAGE_TYPE_ZERO.
Meh, that's pretty obvious from the code.
> + */
> + if (src == NULL && params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
I think I'd prefer this over checking for KVM_SEV_SNP_PAGE_TYPE_ZERO twice,
especially since the PAGE_ALIGNED() check for the NULL pointer case is rather
weird.
if (params.type == KVM_SEV_SNP_PAGE_TYPE_ZERO)
src = NULL;
else if (!params.uaddr || !PAGE_ALIGNED(params.uaddr))
return -EINVAL;
else
src = u64_to_user_ptr(params.uaddr);
> + return -EINVAL;
Gah, we created quite the mess for ourselves. TDX returns -EOPNOTSUPP instead
of -EINVAL, I guess as a placeholder for in-place conversion? I don't care which
error code is returned, but I do want KVM to be consistent.
We should also adjust TDX to pre-check the source address, because checking only
in the post-populate flow subtly relies on tdx_vcpu_init_mem_region() returning
immediately on error. If that weren't the case (ignoring for the moment that
continuing on would be nonsensical), KVM would advace the address by PAGE_SIZE
and suddenly a NULL userspace address becomes non-NULL.
I also think it makes sense to drop the WARN in sev_gmem_post_populate(), it's
completely redundant once this bug is fixed.
Ugh, and both SNP and TDX fail to account for tags, and fail to check for
striding into kernel space. Which I suppose is fine, since gup() handles those
correctly. And I don't see a strong argument for disallowing tagged addresses,
because unlike the userspace address for memslots, KVM doesn't keep the address
around long-term.
So over two patches, the below? I can send a v2, I've already got changelogs
written (I was fiddling around with extracting and reusing kvm_set_memory_region()'s
checks on the userspace address+size, but as above, convinced myself that KVM
should continue to allow tagged addresses for SNP and TDX).
---
arch/x86/kvm/svm/sev.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 74fb15551e83..621a2eaa58f2 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2330,9 +2330,6 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
int level;
int ret;
- if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page))
- return -EINVAL;
-
ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level);
if (ret || assigned) {
pr_debug("%s: Failed to ensure GFN 0x%llx RMP entry is initial shared state, ret: %d assigned: %d\n",
@@ -2421,10 +2418,12 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
params.type != KVM_SEV_SNP_PAGE_TYPE_CPUID))
return -EINVAL;
- src = params.type == KVM_SEV_SNP_PAGE_TYPE_ZERO ? NULL : u64_to_user_ptr(params.uaddr);
-
- if (!PAGE_ALIGNED(src))
+ if (params.type == KVM_SEV_SNP_PAGE_TYPE_ZERO)
+ src = NULL;
+ else if (!params.uaddr || !PAGE_ALIGNED(params.uaddr))
return -EINVAL;
+ else
+ src = u64_to_user_ptr(params.uaddr);
npages = params.len / PAGE_SIZE;
---
arch/x86/kvm/vmx/tdx.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ffe9d0db58c5..b0ec054732b9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3198,9 +3198,6 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
return -EIO;
- if (!src_page)
- return -EOPNOTSUPP;
-
kvm_tdx->page_add_src = src_page;
ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
kvm_tdx->page_add_src = NULL;
@@ -3247,8 +3244,8 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
if (copy_from_user(®ion, u64_to_user_ptr(cmd->data), sizeof(region)))
return -EFAULT;
- if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) ||
- !region.nr_pages ||
+ if (!PAGE_ALIGNED(region.source_addr) || !region.source_addr ||
+ !PAGE_ALIGNED(region.gpa) || !region.nr_pages ||
region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
!vt_is_tdx_private_gpa(kvm, region.gpa) ||
!vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
--
^ permalink raw reply related
* Re: [PATCH 3/4] KVM: guest_memfd: Add `write` parameter to kvm_gmem_populate()
From: Sean Christopherson @ 2026-06-23 12:57 UTC (permalink / raw)
To: Jörg Rödel
Cc: Paolo Bonzini, x86, Kiryl Shutsemau, Rick Edgecombe, Tom Lendacky,
Ashish Kalra, Michael Roth, kvm, linux-kernel, linux-coco,
Joerg Roedel
In-Reply-To: <20260623091556.1500930-4-joro@8bytes.org>
On Tue, Jun 23, 2026, Jörg Rödel wrote:
> From: Joerg Roedel <joerg.roedel@amd.com>
>
> The call-path of kvm_gmem_populate() might subsequently write to the
> page provided by user-space. This is used to provide detailed error
> information in case the page population failed.
>
> But since kvm_gmem_populate() only acquires a read-only reference to
> the user-space page via get_user_pages_fast(), the error information
> might be written to a read-only page later on.
>
> Add a parameter to kvm_gmem_populate() to optionally acquire a
> writeable reference to the source page to make sure page permissions
> can be enforced.
Already fixed, commit f13e90059908 ("KVM: SEV: Pin source page for write when
adding CPUID data for SNP guest").
^ permalink raw reply
* [RFCv2 PATCH 6/6] virtio-mem: Support memory hotplug/unplug for coco guest
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
michael.roth, ackerleytng
Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>
Integrate coco memory management operations into the virtio-mem driver to
manage the state of hotplug memory.
In virtio_mem_send_plug_request(), once the host hypervisor acknowledges a
plug request, invoke coco_set_plugged_bitmap() to set the corresponding
bits in the plugged bitmap. Conversely, in virtio_mem_send_unplug_request()
and virtio_mem_send_unplug_all_request(), call unaccept_memory() to let the
guest autonomously transition the target private pages back to "unaccepted"
state before asking the VMM to unplug them. After the VMM acknowledges the
unplug request, clear the ranges from the plugged bitmap.
Note that memory block hotplug/unplug also sets or clears the plugged
bitmap at memory block granularity. While doing this at device block
granularity here creates a slight redundancy, it is completely harmless.
Additionally, update virtio_mem_fake_online() to explicitly invoke
accept_memory() when transitioning memory out of the fake-offline state and
back into service. This ensures that any pages returning to the buddy
system are cleanly accepted by the guest architecture before they are freed
back into the allocator via free_contig_range().
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
drivers/virtio/virtio_mem.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 48051e9e98ab..9f6e53df8caf 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1211,6 +1211,7 @@ static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
generic_online_page(page, order);
} else {
virtio_mem_clear_fake_offline(pfn + i, 1 << order, true);
+ accept_memory(page_to_phys(page), PAGE_SIZE << order);
free_contig_range(pfn + i, 1 << order);
adjust_managed_page_count(page, 1 << order);
}
@@ -1436,6 +1437,7 @@ static int virtio_mem_send_plug_request(struct virtio_mem *vm, uint64_t addr,
switch (virtio_mem_send_request(vm, &req)) {
case VIRTIO_MEM_RESP_ACK:
vm->plugged_size += size;
+ WARN_ON(coco_set_plugged_bitmap(addr, size, true));
return 0;
case VIRTIO_MEM_RESP_NACK:
rc = -EAGAIN;
@@ -1471,9 +1473,12 @@ static int virtio_mem_send_unplug_request(struct virtio_mem *vm, uint64_t addr,
dev_dbg(&vm->vdev->dev, "unplugging memory: 0x%llx - 0x%llx\n", addr,
addr + size - 1);
+ unaccept_memory(addr, size);
+
switch (virtio_mem_send_request(vm, &req)) {
case VIRTIO_MEM_RESP_ACK:
vm->plugged_size -= size;
+ WARN_ON(coco_set_plugged_bitmap(addr, size, false));
return 0;
case VIRTIO_MEM_RESP_BUSY:
rc = -ETXTBSY;
@@ -1498,10 +1503,13 @@ static int virtio_mem_send_unplug_all_request(struct virtio_mem *vm)
dev_dbg(&vm->vdev->dev, "unplugging all memory");
+ unaccept_memory(vm->addr, vm->region_size);
+
switch (virtio_mem_send_request(vm, &req)) {
case VIRTIO_MEM_RESP_ACK:
vm->unplug_all_required = false;
vm->plugged_size = 0;
+ WARN_ON(coco_set_plugged_bitmap(vm->addr, vm->region_size, false));
/* usable region might have shrunk */
atomic_set(&vm->config_changed, 1);
return 0;
--
2.52.0
^ permalink raw reply related
* [RFCv2 PATCH 5/6] mm/memory_hotplug: Support ACPI hotplug/unplug for coco guest
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
michael.roth, ackerleytng
Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>
Integrate coco memory management operations into the core memory hotplug
subsystem to handle the lifecycle of hotplug memory.
In add_memory_resource(), invoke coco_set_plugged_bitmap(..., true) to mark
memory plugged before adding the memory block, because self hosted memmap
initialization needs their plugged bits set before acceptance. There is no
explicit call to accept_memory() for normal pages, because they can be
lazily accepted by the core memory management subsystem after the memory
block is onlined.
In try_remove_memory(), before finalizing the physical removal of the
memory blocks, invoke unaccept_memory(). This allows the guest to take
direct control of its own memory state and release the pages itself,
eliminating the dependency on the VMM to implicitly hole-punch the memory.
It loops through the targeted ranges using find_next_andnot_bit(), matching
pages that are marked plugged and accepted, and releases them back to the
host. Following the unacceptance step, clear the ranges from the plugged
bitmap.
These operations guarantee that both the unaccepted and plugged tracking
states stay completely synchronized with the actual dynamic memory
configurations of the guest.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
include/linux/mm.h | 11 +++
drivers/firmware/efi/unaccepted_memory.c | 94 ++++++++++++++++++++++++
mm/memory_hotplug.c | 16 ++++
3 files changed, 121 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fc2acedf0b76..4c094038872a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -5105,6 +5105,8 @@ int set_anon_vma_name(unsigned long addr, unsigned long size,
bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size);
void accept_memory(phys_addr_t start, unsigned long size);
+void unaccept_memory(phys_addr_t start, unsigned long size);
+int coco_set_plugged_bitmap(phys_addr_t start, unsigned long size, bool set);
#else
@@ -5118,6 +5120,15 @@ static inline void accept_memory(phys_addr_t start, unsigned long size)
{
}
+static inline void unaccept_memory(phys_addr_t start, unsigned long size)
+{
+}
+
+static inline int coco_set_plugged_bitmap(phys_addr_t start, unsigned long size, bool set)
+{
+ return 0;
+}
+
#endif
static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
index c290b16c5142..f35f7016af53 100644
--- a/drivers/firmware/efi/unaccepted_memory.c
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -233,6 +233,100 @@ bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size)
return ret;
}
+static int coco_hotplug_range_check(struct efi_unaccepted_memory *unaccepted,
+ phys_addr_t start, unsigned long size)
+{
+ u64 unit_size = unaccepted->unit_size;
+ u64 phys_base = unaccepted->phys_base;
+ u64 phys_end = phys_base + unaccepted->size * unit_size * BITS_PER_BYTE;
+
+ if (!IS_ALIGNED(start | size, unit_size))
+ return -EINVAL;
+
+ if (start < phys_base || start + size > phys_end)
+ return -EINVAL;
+
+ return 0;
+}
+
+/* Only used by hotplug memory, we don't unaccept static memory */
+void unaccept_memory(phys_addr_t start, unsigned long size)
+{
+ unsigned long range_start, range_end, bitmap_size, flags;
+ struct efi_unaccepted_memory *unaccepted;
+ void *plugged_bitmap;
+ u64 unit_size;
+
+ unaccepted = efi_get_unaccepted_table();
+ if (!unaccepted)
+ return;
+
+ if (WARN_ON(coco_hotplug_range_check(unaccepted, start, size)))
+ return;
+
+ unit_size = unaccepted->unit_size;
+ range_start = (start - unaccepted->phys_base) / unit_size;
+ bitmap_size = range_start + size / unit_size;
+ plugged_bitmap = plugged_bitmap_of(unaccepted);
+
+ spin_lock_irqsave(&unaccepted_memory_lock, flags);
+ for (; range_start < bitmap_size; range_start = range_end) {
+ unsigned long phys_start, phys_end;
+ unsigned long unaccepted_one, plugged_zero;
+
+ range_start = find_next_andnot_bit(plugged_bitmap, unaccepted->bitmap,
+ bitmap_size, range_start);
+
+ if (range_start >= bitmap_size)
+ break;
+
+ unaccepted_one = find_next_bit(unaccepted->bitmap, bitmap_size, range_start);
+ plugged_zero = find_next_zero_bit(plugged_bitmap, bitmap_size, range_start);
+ range_end = min(unaccepted_one, plugged_zero);
+
+ phys_start = range_start * unit_size + unaccepted->phys_base;
+ phys_end = range_end * unit_size + unaccepted->phys_base;
+
+ arch_unaccept_memory(phys_start, phys_end);
+ bitmap_set(unaccepted->bitmap, range_start, range_end - range_start);
+ }
+ spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+/*
+ * Only used by hotplug memory, plugged bits of static memory are handled
+ * in process_unaccepted_memory()
+ */
+int coco_set_plugged_bitmap(phys_addr_t start, unsigned long size, bool set)
+{
+ struct efi_unaccepted_memory *unaccepted;
+ unsigned long range_start, flags;
+ void *plugged_bitmap;
+ u64 unit_size;
+ int ret;
+
+ unaccepted = efi_get_unaccepted_table();
+ if (!unaccepted)
+ return 0;
+
+ ret = coco_hotplug_range_check(unaccepted, start, size);
+ if (ret)
+ return ret;
+
+ unit_size = unaccepted->unit_size;
+ range_start = (start - unaccepted->phys_base) / unit_size;
+ plugged_bitmap = plugged_bitmap_of(unaccepted);
+
+ spin_lock_irqsave(&unaccepted_memory_lock, flags);
+ if (set)
+ bitmap_set(plugged_bitmap, range_start, size / unit_size);
+ else
+ bitmap_clear(plugged_bitmap, range_start, size / unit_size);
+ spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+ return 0;
+}
+
#ifdef CONFIG_PROC_VMCORE
static bool unaccepted_memory_vmcore_pfn_is_ram(struct vmcore_cb *cb,
unsigned long pfn)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 40c7915dabe0..2f71514a0616 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1429,6 +1429,8 @@ static void remove_memory_blocks_and_altmaps(u64 start, u64 size)
arch_remove_memory(cur_start, memblock_size, altmap);
+ unaccept_memory(cur_start, PFN_PHYS(altmap->free));
+
/* Verify that all vmemmap pages have actually been freed. */
WARN(altmap->alloc, "Altmap not fully unmapped");
kfree(altmap);
@@ -1459,9 +1461,13 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
goto out;
}
+ /* Accept self hosted memmap array before access it */
+ accept_memory(cur_start, PFN_PHYS(mhp_altmap.free));
+
/* call arch's memory hotadd */
ret = arch_add_memory(nid, cur_start, memblock_size, ¶ms);
if (ret < 0) {
+ unaccept_memory(cur_start, PFN_PHYS(mhp_altmap.free));
kfree(params.altmap);
goto out;
}
@@ -1471,6 +1477,7 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
params.altmap, group);
if (ret) {
arch_remove_memory(cur_start, memblock_size, NULL);
+ unaccept_memory(cur_start, PFN_PHYS(mhp_altmap.free));
kfree(params.altmap);
goto out;
}
@@ -1540,6 +1547,10 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
new_node = true;
}
+ ret = coco_set_plugged_bitmap(start, size, true);
+ if (ret)
+ goto error_offline_node;
+
/*
* Self hosted memmap array
*/
@@ -1584,6 +1595,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
return ret;
error:
+ WARN_ON(coco_set_plugged_bitmap(start, size, false));
+error_offline_node:
if (new_node) {
node_set_offline(nid);
unregister_node(nid);
@@ -2282,6 +2295,9 @@ static int try_remove_memory(u64 start, u64 size)
if (nid != NUMA_NO_NODE)
try_offline_node(nid);
+ unaccept_memory(start, size);
+ WARN_ON(coco_set_plugged_bitmap(start, size, false));
+
mem_hotplug_done();
return 0;
}
--
2.52.0
^ permalink raw reply related
* [RFCv2 PATCH 4/6] x86/tdx: Implement arch_unaccept_memory()
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
michael.roth, ackerleytng
Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>
During memory hot-unplug, if the VMM does not punch hole the memory, the
memory stays in "accepted" state. Consequently, subsequent re-acceptance
of that same memory during a re-plug operation will trigger re-accept
failure. To guard this, a confidential guest must maintain control of
the memory state explicitly, e.g., setting memory to "unaccepted" state
during unplug.
In the context of TDX, the "unaccepted" state maps to the PENDING state,
while the "accepted" state maps to the MAPPED state. Implement
arch_unaccept_memory() for TDX guest via the TDG.MEM.PAGE.RELEASE TDCALL.
It uses 1G/2M/4K page size fallbacks and rolls back on partial failure. A
failure during this rollback step indicates severe corruption of the TDX
module state and triggers a kernel panic.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
arch/x86/include/asm/shared/tdx.h | 2 +
arch/x86/include/asm/tdx.h | 2 +
arch/x86/include/asm/unaccepted_memory.h | 11 +++
arch/x86/coco/tdx/tdx.c | 120 +++++++++++++++++++++++
4 files changed, 135 insertions(+)
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 049638e3da74..910ec1e57528 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -19,6 +19,7 @@
#define TDG_MEM_PAGE_ACCEPT 6
#define TDG_VM_RD 7
#define TDG_VM_WR 8
+#define TDG_MEM_PAGE_RELEASE 30
/* TDX TD attributes */
#define TDX_TD_ATTR_DEBUG_BIT 0
@@ -54,6 +55,7 @@
/* TDCS_CONFIG_FLAGS bits */
#define TDCS_CONFIG_FLEXIBLE_PENDING_VE BIT_ULL(1)
+#define TDCS_CONFIG_PAGE_RELEASE BIT_ULL(6)
/* TDCS_TD_CTLS bits */
#define TD_CTLS_PENDING_VE_DISABLE_BIT 0
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index a149740b24e8..8608d33a7db6 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -72,6 +72,8 @@ int tdx_mcall_extend_rtmr(u8 index, u8 *data);
u64 tdx_hcall_get_quote(u8 *buf, size_t size);
+bool tdx_unaccept_memory(phys_addr_t start, phys_addr_t end);
+
void __init tdx_dump_attributes(u64 td_attr);
void __init tdx_dump_td_ctls(u64 td_ctls);
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index f5937e9866ac..9fd9411d2c44 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -18,6 +18,17 @@ static inline void arch_accept_memory(phys_addr_t start, phys_addr_t end)
}
}
+static inline void arch_unaccept_memory(phys_addr_t start, phys_addr_t end)
+{
+ /* Platform-specific memory-unacceptance call goes here */
+ if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+ if (!tdx_unaccept_memory(start, end))
+ panic("TDX: Failed to unaccept memory\n");
+ } else {
+ panic("Cannot unaccept memory: unknown platform\n");
+ }
+}
+
static inline struct efi_unaccepted_memory *efi_get_unaccepted_table(void)
{
if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 186915a17c50..1bab8f4687bf 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -326,6 +326,124 @@ static void reduce_unnecessary_ve(void)
enable_cpu_topology_enumeration();
}
+static bool tdx_page_release_supported;
+
+static void tdx_detect_page_release_support(void)
+{
+ u64 config = 0;
+
+ tdg_vm_rd(TDCS_CONFIG_FLAGS, &config);
+
+ tdx_page_release_supported = !!(config & TDCS_CONFIG_PAGE_RELEASE);
+}
+
+static unsigned long try_release_one(phys_addr_t start, unsigned long len,
+ enum pg_level pg_level)
+{
+ unsigned long release_size = page_level_size(pg_level);
+ struct tdx_module_args args = {};
+ u8 page_size;
+ u64 ret;
+
+ if (!IS_ALIGNED(start, release_size))
+ return 0;
+
+ if (len < release_size)
+ return 0;
+
+ /*
+ * Pass the page physical address to TDX module to release the
+ * private page and to put it in PENDING state.
+ *
+ * Encode page size in RCX[2:0] using TDX_PS_*
+ */
+ switch (pg_level) {
+ case PG_LEVEL_4K:
+ page_size = TDX_PS_4K;
+ break;
+ case PG_LEVEL_2M:
+ page_size = TDX_PS_2M;
+ break;
+ case PG_LEVEL_1G:
+ page_size = TDX_PS_1G;
+ break;
+ default:
+ return 0;
+ }
+
+ args.rcx = start | page_size;
+ ret = __tdcall(TDG_MEM_PAGE_RELEASE, &args);
+ if (ret)
+ return 0;
+
+ return release_size;
+}
+
+static bool tdx_release_memory(phys_addr_t start, phys_addr_t end, phys_addr_t *cur)
+{
+ *cur = start;
+
+ while (*cur < end) {
+ unsigned long len = end - *cur;
+ unsigned long release_size;
+
+ /*
+ * Try larger release first. It speeds up process by cutting
+ * number of hypercalls (if successful).
+ */
+
+ release_size = try_release_one(*cur, len, PG_LEVEL_1G);
+ if (!release_size)
+ release_size = try_release_one(*cur, len, PG_LEVEL_2M);
+ if (!release_size)
+ release_size = try_release_one(*cur, len, PG_LEVEL_4K);
+ if (!release_size)
+ return false;
+ *cur += release_size;
+ }
+
+ return true;
+}
+
+/**
+ * Release private memory and put it in PENDING state.
+ *
+ * @start: Physical start address of memory range to release
+ * @end: Physical end address of memory range to release
+ *
+ * Uses TDG.MEM.PAGE.RELEASE TDCALL to transition private pages back to
+ * PENDING state. If PAGE_RELEASE is not supported by the TDX
+ * configuration, returns true (success) as no action is needed.
+ *
+ * On partial failure, automatically re-accepts any successfully released
+ * pages to restore consistent memory state. Re-acceptance failure is
+ * treated as a fatal error since it indicates severe TDX module issues.
+ *
+ * Returns: true on success, false on failure
+ */
+bool tdx_unaccept_memory(phys_addr_t start, phys_addr_t end)
+{
+ phys_addr_t released = start;
+ bool ret;
+
+ if (!tdx_page_release_supported)
+ return true;
+
+ ret = tdx_release_memory(start, end, &released);
+ if (!ret) {
+ pr_err("Failed to unaccept memory [%pa, %pa)\n", &start, &end);
+ /*
+ * Re-accept any pages that were successfully released before
+ * the failure occurred. This should never fail since we're
+ * just restoring the previous MAPPED state.
+ */
+ if (!tdx_accept_memory(start, released))
+ panic("%s: Failed to re-accept memory\n", __func__);
+ }
+
+ return ret;
+}
+
static void tdx_setup(u64 *cc_mask)
{
struct tdx_module_args args = {};
@@ -359,6 +477,8 @@ static void tdx_setup(u64 *cc_mask)
disable_sept_ve(td_attr);
reduce_unnecessary_ve();
+
+ tdx_detect_page_release_support();
}
/*
--
2.52.0
^ permalink raw reply related
* [RFCv2 PATCH 3/6] efi/unaccepted: Create plugged bitmap to support hotplug memory in coco guest
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
michael.roth, ackerleytng
Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>
The load_unaligned_zeropad() function can cause unintended memory loads
across page boundaries. To safely handle these unaligned reads in a
confidential computing guest, the kernel implicitly accepts an extra
unit_size block of memory to serve as a safety guard.
However, near hotplug boundaries, this extra acceptance can fall within
unpopulated gaps between hotplugged memory ranges, triggering a guest
kernel crash.
To protect these boundaries against out-of-bounds access, introduce a
"plugged" bitmap positioned immediately following the unaccepted memory
bitmap.
Initial static boot memory ranges have their corresponding bits marked
as plugged by default during early initialization. For hotpluggable
memory ranges, the memory driver must explicitly set the proper bits
when a memory block is plugged, and clear them upon an unplug event.
Update accept_memory() and range_contains_unaccepted_memory() to check
the intersection of both bitmaps. The kernel now combines them to
determine exactly which plugged, unaccepted pages require acceptance.
Additionally, bump the unaccepted memory table layout version from 1
to 2. This strict layout enforcement guarantees that a version 1 table
passed to a new kernel, or a version 2 table passed to an old kernel,
will explicitly fail kexec early due to the version mismatch.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
include/linux/efi.h | 5 ++++
arch/x86/boot/compressed/mem.c | 2 +-
drivers/firmware/efi/efi.c | 4 +--
.../firmware/efi/libstub/unaccepted_memory.c | 16 +++++++----
drivers/firmware/efi/unaccepted_memory.c | 28 +++++++++++++++----
5 files changed, 42 insertions(+), 13 deletions(-)
diff --git a/include/linux/efi.h b/include/linux/efi.h
index ccbc35479684..579d102f128a 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -551,6 +551,11 @@ struct efi_unaccepted_memory {
unsigned long bitmap[];
};
+static inline void *plugged_bitmap_of(struct efi_unaccepted_memory *u)
+{
+ return (void *)u->bitmap + u->size;
+}
+
/*
* Architecture independent structure for describing a memory map for the
* benefit of efi_memmap_init_early(), and for passing context between
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 40e9c81a2206..61b8d0edd2f6 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -69,7 +69,7 @@ bool init_unaccepted_memory(void)
if (!table)
return false;
- if (table->version != 1)
+ if (table->version != 2)
error("Unknown version of unaccepted memory table\n");
/*
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 318d1cc9a066..7f7341634c13 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -701,7 +701,7 @@ static __init void reserve_unaccepted(struct efi_unaccepted_memory *unaccepted)
phys_addr_t start, end;
start = PAGE_ALIGN_DOWN(efi.unaccepted);
- end = PAGE_ALIGN(efi.unaccepted + sizeof(*unaccepted) + unaccepted->size);
+ end = PAGE_ALIGN(efi.unaccepted + sizeof(*unaccepted) + unaccepted->size * 2);
memblock_add(start, end - start);
memblock_reserve(start, end - start);
@@ -837,7 +837,7 @@ int __init efi_config_parse_tables(const efi_config_table_t *config_tables,
unaccepted = early_memremap(efi.unaccepted, sizeof(*unaccepted));
if (unaccepted) {
- if (unaccepted->version == 1) {
+ if (unaccepted->version == 2) {
reserve_unaccepted(unaccepted);
} else {
efi.unaccepted = EFI_INVALID_TABLE_ADDR;
diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
index 01bed8e751ca..5b0deb6c91f1 100644
--- a/drivers/firmware/efi/libstub/unaccepted_memory.c
+++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
@@ -113,7 +113,7 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
struct efi_boot_memmap *map)
{
efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
- u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size;
+ u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size, total_size;
struct srat_parse_ctx ctx;
efi_status_t status;
int i;
@@ -124,7 +124,7 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
/* Check if the table is already installed */
unaccepted_table = get_efi_config_table(unaccepted_table_guid);
if (unaccepted_table) {
- if (unaccepted_table->version != 1) {
+ if (unaccepted_table->version != 2) {
efi_err("Unknown version of unaccepted memory table\n");
return EFI_UNSUPPORTED;
}
@@ -173,19 +173,22 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
bitmap_size = DIV_ROUND_UP(unaccepted_end - unaccepted_start,
EFI_UNACCEPTED_UNIT_SIZE * BITS_PER_BYTE);
+ /* There is a plugged bitmap after unaccepted bitmap */
+ total_size = bitmap_size << 1;
+
status = efi_bs_call(allocate_pool, EFI_ACPI_RECLAIM_MEMORY,
- sizeof(*unaccepted_table) + bitmap_size,
+ sizeof(*unaccepted_table) + total_size,
(void **)&unaccepted_table);
if (status != EFI_SUCCESS) {
efi_err("Failed to allocate unaccepted memory config table\n");
return status;
}
- unaccepted_table->version = 1;
+ unaccepted_table->version = 2;
unaccepted_table->unit_size = EFI_UNACCEPTED_UNIT_SIZE;
unaccepted_table->phys_base = unaccepted_start;
unaccepted_table->size = bitmap_size;
- memset(unaccepted_table->bitmap, 0, bitmap_size);
+ memset(unaccepted_table->bitmap, 0, total_size);
parse_acpi_srat_regions(mark_hotplug_memory_unaccepted, &ctx);
status = efi_bs_call(install_configuration_table,
@@ -287,6 +290,9 @@ void process_unaccepted_memory(u64 start, u64 end)
*/
bitmap_set(unaccepted_table->bitmap,
start / unit_size, (end - start) / unit_size);
+ /* Set plugged bits for static memory and never unset */
+ bitmap_set(plugged_bitmap_of(unaccepted_table),
+ start / unit_size, (end - start) / unit_size);
}
void accept_memory(phys_addr_t start, unsigned long size)
diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
index 4a8ec8d6a571..c290b16c5142 100644
--- a/drivers/firmware/efi/unaccepted_memory.c
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -38,6 +38,7 @@ void accept_memory(phys_addr_t start, unsigned long size)
unsigned long flags;
phys_addr_t end;
u64 unit_size;
+ void *plugged_bitmap;
unaccepted = efi_get_unaccepted_table();
if (!unaccepted)
@@ -126,12 +127,23 @@ void accept_memory(phys_addr_t start, unsigned long size)
*/
list_add(&range.list, &accepting_list);
- range_start = range.start;
- for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
- range.end) {
+ plugged_bitmap = plugged_bitmap_of(unaccepted);
+
+ for (range_start = range.start; range_start < range.end; range_start = range_end) {
unsigned long phys_start, phys_end;
- unsigned long len = range_end - range_start;
+ unsigned long len;
+ unsigned long unaccepted_zero, plugged_zero;
+
+ range_start = find_next_and_bit(plugged_bitmap, unaccepted->bitmap,
+ range.end, range_start);
+
+ if (range_start >= range.end)
+ break;
+ unaccepted_zero = find_next_zero_bit(unaccepted->bitmap, range.end, range_start);
+ plugged_zero = find_next_zero_bit(plugged_bitmap, range.end, range_start);
+ range_end = min(unaccepted_zero, plugged_zero);
+ len = range_end - range_start;
phys_start = range_start * unit_size + unaccepted->phys_base;
phys_end = range_end * unit_size + unaccepted->phys_base;
@@ -167,6 +179,7 @@ bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size)
bool ret = false;
phys_addr_t end;
u64 unit_size;
+ void *plugged_bitmap;
unaccepted = efi_get_unaccepted_table();
if (!unaccepted)
@@ -201,9 +214,14 @@ bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size)
if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
end = unaccepted->size * unit_size * BITS_PER_BYTE;
+ plugged_bitmap = plugged_bitmap_of(unaccepted);
+
spin_lock_irqsave(&unaccepted_memory_lock, flags);
while (start < end) {
- if (test_bit(start / unit_size, unaccepted->bitmap)) {
+ unsigned long range_start = start / unit_size;
+
+ if (test_bit(range_start, plugged_bitmap) &&
+ test_bit(range_start, unaccepted->bitmap)) {
ret = true;
break;
}
--
2.52.0
^ permalink raw reply related
* [RFCv2 PATCH 2/6] efi/unaccepted: Set unaccepted bits for all hotplug memory
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
michael.roth, ackerleytng
Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>
In coco guests, hotpluggable memory ranges are initially unaccepted.
While a previous change expanded the unaccepted memory bitmap boundaries
to include these hotplug spaces, the actual bits inside the bitmap are
not yet marked as unaccepted.
Walks SRAT a second time after the bitmap is allocated and sets the bits
corresponding to hotpluggable ranges.
This ensures the bitmap state accurately reflects all static and hotplug
memory ranges before booting kernel.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
.../firmware/efi/libstub/unaccepted_memory.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
index bfbb78bd7b8a..01bed8e751ca 100644
--- a/drivers/firmware/efi/libstub/unaccepted_memory.c
+++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
@@ -92,6 +92,23 @@ static void update_mem_boundaries(struct acpi_srat_mem_affinity *mem, struct sra
*(ctx->mem_end) = range_end;
}
+static void mark_hotplug_memory_unaccepted(struct acpi_srat_mem_affinity *mem,
+ struct srat_parse_ctx *ctx)
+{
+ u64 unit_size = unaccepted_table->unit_size;
+ u64 start, end;
+
+ start = round_up(mem->base_address, unit_size);
+ end = round_down(mem->base_address + mem->length, unit_size);
+
+ /* Translate to offsets from the beginning of the bitmap */
+ start -= unaccepted_table->phys_base;
+ end -= unaccepted_table->phys_base;
+
+ bitmap_set(unaccepted_table->bitmap,
+ start / unit_size, (end - start) / unit_size);
+}
+
efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
struct efi_boot_memmap *map)
{
@@ -169,6 +186,7 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
unaccepted_table->phys_base = unaccepted_start;
unaccepted_table->size = bitmap_size;
memset(unaccepted_table->bitmap, 0, bitmap_size);
+ parse_acpi_srat_regions(mark_hotplug_memory_unaccepted, &ctx);
status = efi_bs_call(install_configuration_table,
&unaccepted_table_guid, unaccepted_table);
--
2.52.0
^ permalink raw reply related
* [RFCv2 PATCH 1/6] efi/unaccepted: Support hotplug memory in unaccepted bitmap via SRAT
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
michael.roth, ackerleytng
Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>
Currently, allocate_unaccepted_bitmap() only scans the initial EFI
boot memory map. This misses hotpluggable ranges described in the
ACPI SRAT. Without early tracking, hotplug pages are accessed without
acceptance and this triggers guest crash.
Introduce a lightweight ACPI SRAT parser to scan these regions early.
If a region has both ACPI_SRAT_MEM_ENABLED and ACPI_SRAT_MEM_HOT_PLUGGABLE
flags, expand the tracking boundaries. This avoids pulling in the full
ACPI subsystem while ensuring the bitmap covers both static memory and
hotplug memory.
Bail out early with success on non-confidential guests to prevent
unnecessary bitmap allocation.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
drivers/firmware/efi/libstub/efistub.h | 6 ++
arch/x86/boot/compressed/mem.c | 2 +-
.../firmware/efi/libstub/unaccepted_memory.c | 94 +++++++++++++++++++
3 files changed, 101 insertions(+), 1 deletion(-)
diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
index fd91fc15ec81..fc0cd33a5962 100644
--- a/drivers/firmware/efi/libstub/efistub.h
+++ b/drivers/firmware/efi/libstub/efistub.h
@@ -1260,4 +1260,10 @@ void arch_accept_memory(phys_addr_t start, phys_addr_t end);
efi_status_t efi_zboot_decompress_init(unsigned long *alloc_size);
efi_status_t efi_zboot_decompress(u8 *out, unsigned long outlen);
+bool early_is_tdx_guest(void);
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+bool early_is_sevsnp_guest(void);
+#else
+static inline bool early_is_sevsnp_guest(void) { return false; }
+#endif
#endif
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 0e9f84ab4bdc..40e9c81a2206 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -12,7 +12,7 @@
*
* Enumerate TDX directly from the early users.
*/
-static bool early_is_tdx_guest(void)
+bool early_is_tdx_guest(void)
{
static bool once;
static bool is_tdx;
diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
index 757dbe734a47..bfbb78bd7b8a 100644
--- a/drivers/firmware/efi/libstub/unaccepted_memory.c
+++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
@@ -1,19 +1,109 @@
// SPDX-License-Identifier: GPL-2.0-only
#include <linux/efi.h>
+#include <linux/acpi.h>
#include <asm/efi.h>
#include "efistub.h"
struct efi_unaccepted_memory *unaccepted_table;
+struct srat_parse_ctx {
+ u64 *mem_start;
+ u64 *mem_end;
+};
+
+typedef void (*srat_region_handler_t)(struct acpi_srat_mem_affinity *mem,
+ struct srat_parse_ctx *ctx);
+
+/*
+ * parse_acpi_srat_regions - Loop through ACPI SRAT tables to process
+ * hotpluggable memory regions via a custom callback handler.
+ */
+static void parse_acpi_srat_regions(srat_region_handler_t handler, struct srat_parse_ctx *ctx)
+{
+ u32 hotplug_mask = ACPI_SRAT_MEM_ENABLED | ACPI_SRAT_MEM_HOT_PLUGGABLE;
+ struct acpi_table_header *xsdt, *srat = NULL;
+ struct acpi_table_rsdp *rsdp = NULL;
+ u8 *current_ptr, *end_ptr;
+ u64 *table_pointers;
+ u32 entry_count;
+ unsigned long i;
+
+ rsdp = get_efi_config_table(ACPI_20_TABLE_GUID);
+
+ if (!rsdp || !ACPI_VALIDATE_RSDP_SIG(rsdp->signature))
+ return;
+
+ xsdt = (struct acpi_table_header *)(unsigned long)rsdp->xsdt_physical_address;
+ if (!xsdt || !ACPI_COMPARE_NAMESEG(xsdt->signature, ACPI_SIG_XSDT))
+ return;
+
+ if (xsdt->length < sizeof(struct acpi_table_header) + ACPI_XSDT_ENTRY_SIZE)
+ return;
+
+ entry_count = (xsdt->length - sizeof(struct acpi_table_header)) / ACPI_XSDT_ENTRY_SIZE;
+ table_pointers = (u64 *)((u8 *)xsdt + sizeof(struct acpi_table_header));
+
+ for (i = 0; i < entry_count; i++) {
+ struct acpi_table_header *tbl;
+
+ tbl = (struct acpi_table_header *)(unsigned long)table_pointers[i];
+ if (tbl && ACPI_COMPARE_NAMESEG(tbl->signature, ACPI_SIG_SRAT)) {
+ srat = tbl;
+ break;
+ }
+ }
+
+ if (!srat)
+ return;
+
+ current_ptr = (u8 *)srat + sizeof(struct acpi_table_srat);
+ end_ptr = (u8 *)srat + srat->length;
+
+ while (current_ptr < end_ptr) {
+ struct acpi_subtable_header *sub_header;
+ u64 range_end;
+
+ sub_header = (struct acpi_subtable_header *)current_ptr;
+ if (sub_header->length == 0)
+ break;
+
+ if (sub_header->type == ACPI_SRAT_TYPE_MEMORY_AFFINITY &&
+ sub_header->length >= sizeof(struct acpi_srat_mem_affinity)) {
+ struct acpi_srat_mem_affinity *mem;
+
+ mem = (struct acpi_srat_mem_affinity *)current_ptr;
+ if ((mem->flags & hotplug_mask) == hotplug_mask &&
+ !check_add_overflow(mem->base_address, mem->length, &range_end))
+ handler(mem, ctx);
+ }
+ current_ptr += sub_header->length;
+ }
+}
+
+static void update_mem_boundaries(struct acpi_srat_mem_affinity *mem, struct srat_parse_ctx *ctx)
+{
+ u64 range_end = mem->base_address + mem->length;
+
+ if (mem->base_address < *(ctx->mem_start))
+ *(ctx->mem_start) = mem->base_address;
+
+ if (range_end > *(ctx->mem_end))
+ *(ctx->mem_end) = range_end;
+}
+
efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
struct efi_boot_memmap *map)
{
efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size;
+ struct srat_parse_ctx ctx;
efi_status_t status;
int i;
+ if (!early_is_tdx_guest() && !early_is_sevsnp_guest())
+ return EFI_SUCCESS;
+
/* Check if the table is already installed */
unaccepted_table = get_efi_config_table(unaccepted_table_guid);
if (unaccepted_table) {
@@ -38,6 +128,10 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
d->phys_addr + d->num_pages * PAGE_SIZE);
}
+ ctx.mem_start = &unaccepted_start;
+ ctx.mem_end = &unaccepted_end;
+ parse_acpi_srat_regions(update_mem_boundaries, &ctx);
+
if (unaccepted_start == ULLONG_MAX)
return EFI_SUCCESS;
--
2.52.0
^ permalink raw reply related
* [RFCv2 PATCH 0/6] Support memory hotplug/unplug for TDX CoCo guests
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
michael.roth, ackerleytng
Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
xiaoyao.li, chao.p.peng
This RFCv2 series implements comprehensive support for virtio-mem and ACPI
DIMM memory hotplug/unplug in Intel TDX confidential computing guests.
It explores the start-private memory approach utilizing the native
TDG.MEM.PAGE.RELEASE API.
We are seeking feedback from Kiryl on the CoCo guest implementation, MM
experts on DIMM & virio-mem memory hotplug integration and broader
virtio/CoCo community input on the overall approach. We are not seeking
x86 maintainer review at this stage.
== Changes from RFC v1 ==
- Eliminated callback infrastructure: Dropped plug callback and replaced
unplug callback with platform-level unaccept function into core MM
hotplug and virtio-mem subsystems.
- Added comprehensive bitmap tracking: Introduced a "plugged" bitmap
alongside the unaccepted bitmap to track populated hotplug memory
states to support load_unaligned_zeropad().
- Enhanced SRAT parsing: Extended the EFI stub to parse ACPI SRAT tables
early, ensuring hotpluggable ranges are tracked from initial boot.
For more introduction about the background or other efforts in community,
please check the RFCv1 cover letter [1].
== Technical Approach ==
- Early SRAT Integration: A lightweight EFI stub parser scans ACPI SRAT
tables to identify hotpluggable ranges and adjust bitmap boundaries
early, avoiding the overhead of the full ACPI subsystem.
- Comprehensive Bitmap Tracking: Introduces a "plugged" bitmap right
after the unaccepted bitmap. Both static and hotplugged memory are
tracked, allowing the guest to map which ranges are populated by the
VMM. This prevents acceptance beyond plugged memory boundaries due to
load_unaligned_zeropad() operations.
- Platform Extensibility: Exposes generic CoCo memory interfaces. Other
confidential platforms (like AMD SEV-SNP) can easily adopt this by
hooking their specific mechanisms into arch_unaccept_memory().
- Hotplug & Guest Control: Integrates platform-level unaccept logic
into ACPI hotplug and virtio-mem handlers. Uses TDG.MEM.PAGE.RELEASE
for TDX to explicitly set memory to the "unaccepted" state during
unplug, removing host hole-punching dependencies.
- Kexec Handover: Leverages existing EFI mechanisms to seamlessly hand
over both the extended unaccepted bitmap and the new plugged bitmap
across kexec boundaries.
== Testing ==
- dimm and virtio-mem memory hotplug/unplug
- lazy and eager accept
- kexec/kdump with hotplugged memory
This is tested with Marc-André Lureau's newest qemu series [2]
Comments appreciated, thanks.
Zhenzhong
[1] https://lore.kernel.org/all/20260604093551.1511079-1-zhenzhong.duan@intel.com/
[2] https://lore.kernel.org/all/20260604-rdm5-v5-0-5768e6a0943d@redhat.com/
Zhenzhong Duan (6):
efi/unaccepted: Support hotplug memory in unaccepted bitmap via SRAT
efi/unaccepted: Set unaccepted bits for all hotplug memory
efi/unaccepted: Create plugged bitmap to support hotplug memory in
coco guest
x86/tdx: Implement arch_unaccept_memory()
mm/memory_hotplug: Support ACPI hotplug/unplug for coco guest
virtio-mem: Support memory hotplug/unplug for coco guest
arch/x86/include/asm/shared/tdx.h | 2 +
arch/x86/include/asm/tdx.h | 2 +
arch/x86/include/asm/unaccepted_memory.h | 11 ++
drivers/firmware/efi/libstub/efistub.h | 6 +
include/linux/efi.h | 5 +
include/linux/mm.h | 11 ++
arch/x86/boot/compressed/mem.c | 4 +-
arch/x86/coco/tdx/tdx.c | 120 ++++++++++++++++
drivers/firmware/efi/efi.c | 4 +-
.../firmware/efi/libstub/unaccepted_memory.c | 128 +++++++++++++++++-
drivers/firmware/efi/unaccepted_memory.c | 122 ++++++++++++++++-
drivers/virtio/virtio_mem.c | 8 ++
mm/memory_hotplug.c | 16 +++
13 files changed, 425 insertions(+), 14 deletions(-)
--
2.52.0
^ permalink raw reply
* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Binbin Wu @ 2026-06-23 9:48 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-18-9d2959357853@google.com>
On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> @@ -606,12 +608,20 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> next = start;
> while (safe && filemap_get_folios(mapping, &next, last, &fbatch)) {
>
> - for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> + for (i = 0; i < folio_batch_count(&fbatch);) {
> struct folio *folio = fbatch.folios[i];
>
> - if (folio_ref_count(folio) !=
> - folio_nr_pages(folio) + filemap_get_folios_refcount) {
> - safe = false;
> + safe = (folio_ref_count(folio) ==
> + folio_nr_pages(folio) +
> + filemap_get_folios_refcount);
> +
> + if (safe) {
> + ++i;
> + } else if (folio_may_be_lru_cached(folio) &&
> + !lru_drained) {
> + lru_add_drain_all();
It seems unprivileged userspace is able to trigger lru_add_drain_all() repeatedly
by invoking KVM_SET_MEMORY_ATTRIBUTES2 in a loop, which could lead to DoS risk?
> + lru_drained = true;
> + } else {
> *err_index = max(start, folio->index);
> break;
> }
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox