Linux Confidential Computing Development

Linux Confidential Computing Development
 help / color / mirror / Atom feed

* [PATCH v8 20/21] coco/tdx-host: Document TDX module update compatibility criteria
From: Chao Gao @ 2026-04-27 15:28 UTC (permalink / raw)
  To: kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Chao Gao, Dan Williams
In-Reply-To: <20260427152854.101171-1-chao.gao@intel.com>

The TDX module update protocol facilitates compatible runtime updates.

Document the compatibility criteria and indicators of update failures.

Note that runtime TDX module updates are an "update at your own risk"
operation; userspace is responsible for ensureing that the update meets
the compatibility criteria.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
---
v8:
 - Do not map -EIO and -ENOMEM to separate fw_upload errors. There is no
   current need to distinguish them in the userspace ABI, and fw_upload
   has no matching error code for -ENOMEM.
 - some wording changes.
---
 .../ABI/testing/sysfs-devices-faux-tdx-host   | 39 +++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-faux-tdx-host b/Documentation/ABI/testing/sysfs-devices-faux-tdx-host
index 65897fe6abc0..ff585c79aa6e 100644
--- a/Documentation/ABI/testing/sysfs-devices-faux-tdx-host
+++ b/Documentation/ABI/testing/sysfs-devices-faux-tdx-host
@@ -26,3 +26,42 @@ Description:	(RO) Report the number of remaining updates. TDX maintains a
 		See Intel® Trust Domain Extensions - SEAM Loader (SEAMLDR)
 		Interface Specification, Chapter "SEAMLDR_INFO" and Chapter
 		"SEAMLDR.INSTALL" for more information.
+
+What:		/sys/devices/faux/tdx_host/firmware/tdx_module
+Contact:	linux-coco@lists.linux.dev
+Description:	(Directory) The tdx_module directory implements the fw_upload
+		sysfs ABI, see Documentation/ABI/testing/sysfs-class-firmware
+		for the general description of the attributes @data, @cancel,
+		@error, @loading, @remaining_size, and @status. This ABI
+		facilitates "Compatible TDX module Updates". A compatible update
+		is one that meets the following criteria:
+
+		   Does not interrupt or interfere with any current TDX
+		   operation or TD VM.
+
+		   Does not invalidate any previously consumed module metadata
+		   values outside of the TEE_TCB_SVN_2 field (updated Security
+		   Version Number) in TD Quotes.
+
+		   Does not require validation of new module metadata fields. By
+		   implication, new module features and capabilities are only
+		   available by installing the module at reboot (BIOS or EFI
+		   helper loaded).
+
+		See tdx_host/firmware/tdx_module/error for information on
+		update failure indicators.
+
+What:		/sys/devices/faux/tdx_host/firmware/tdx_module/error
+Contact:	linux-coco@lists.linux.dev
+Description:	(RO) See Documentation/ABI/testing/sysfs-class-firmware for
+		baseline expectations for this file. The <ERROR> part in the
+		<STATUS>:<ERROR> format can be:
+
+		   "device-busy": The update conflicts with an in-progress TDX
+		   operation.
+
+		   "firmware-invalid": The update failed for any other reason.
+
+		A "firmware-invalid" result may be fatal. If the TDX module is
+		lost, further TDX operation is not possible, and reading
+		/sys/devices/faux/tdx_host/version returns -ENXIO.
-- 
2.47.1


^ permalink raw reply related

* [PATCH v8 21/21] x86/virt/tdx: Document TDX module update
From: Chao Gao @ 2026-04-27 15:28 UTC (permalink / raw)
  To: kvm, linux-coco, linux-kernel, x86, linux-doc
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Chao Gao, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Jonathan Corbet,
	Shuah Khan
In-Reply-To: <20260427152854.101171-1-chao.gao@intel.com>

Document TDX module update as a subsection of "TDX Host Kernel Support" to
provide background information and cover key points that developers and
users may need to know, for example:

 - update is done in stop_machine() context
 - update instructions and results
 - update policy and tooling

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 Documentation/arch/x86/tdx.rst | 36 ++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
index ff6b110291bc..de6b99d9afa7 100644
--- a/Documentation/arch/x86/tdx.rst
+++ b/Documentation/arch/x86/tdx.rst
@@ -73,6 +73,42 @@ initialize::
 
   [..] virt/tdx: TDX-Module initialization failed ...
 
+TDX module Runtime Update
+-------------------------
+
+The TDX architecture includes a persistent SEAM loader (P-SEAMLDR) that
+runs in SEAM mode separately from the TDX module. The kernel can
+communicate with P-SEAMLDR to perform runtime updates of the TDX module.
+
+During update, the TDX module becomes unresponsive to other TDX operations.
+To prevent components using TDX (such as KVM) from experiencing unexpected
+errors during updates, updates are performed in stop_machine() context.
+
+TDX module update has complex compatibility requirements; the new module
+must be compatible with the current CPU, P-SEAMLDR, and running TDX module.
+Rather than implementing complex module selection and policy enforcement
+logic in the kernel, userspace is responsible for auditing and selecting
+appropriate updates.
+
+Updates use the standard firmware upload interface. See
+Documentation/driver-api/firmware/fw_upload.rst for detailed instructions.
+
+Successful updates are logged in dmesg:
+  [..] virt/tdx: version 1.5.20 -> 1.5.24
+
+If updates failed, running TDs may be killed and further TDX operations may
+not be possible until reboot. For detailed error information, see
+Documentation/ABI/testing/sysfs-devices-faux-tdx-host.
+
+Given the risk of losing existing TDs, userspace should verify that the
+update is compatible with the current system and properly validated before
+applying it.
+
+A reference userspace tool that implements necessary checks is available
+at:
+
+  https://github.com/intel/tdx-module-binaries
+
 TDX Interaction to Other Kernel Components
 ------------------------------------------
 
-- 
2.47.1


^ permalink raw reply related

* Re: [PATCH v8 01/21] x86/virt/tdx: Move low level SEAMCALL helpers out of <asm/tdx.h>
From: Vishal Annapurve @ 2026-04-27 18:12 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, linux-coco, linux-kernel, x86, binbin.wu, dave.hansen, djbw,
	ira.weiny, kai.huang, kas, nik.borisov, paulmck, pbonzini,
	reinette.chatre, rick.p.edgecombe, sagis, seanjc, tony.lindgren,
	vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao, Zhenzhong Duan,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-2-chao.gao@intel.com>

On Mon, Apr 27, 2026 at 8:30 AM Chao Gao <chao.gao@intel.com> wrote:
>
> From: Kai Huang <kai.huang@intel.com>
>
> TDX host core code implements three seamcall*() helpers to make SEAMCALLs
> to the TDX module.  Currently, they are implemented in <asm/tdx.h> and
> are exposed to other kernel code which includes <asm/tdx.h>.
>
> However, other than the TDX host core, seamcall*() are not expected to
> be used by other kernel code directly.  For instance, for all SEAMCALLs
> that are used by KVM, the TDX host core exports a wrapper function for
> each of them.
>
> Move seamcall*() and related code out of <asm/tdx.h> and make them only
> visible to TDX host core.
>
> Since TDX host core tdx.c is already very heavy, don't put low level
> seamcall*() code there but to a new dedicated "seamcall_internal.h".  Also,
> currently tdx.c has seamcall_prerr*() helpers which additionally print
> error message when calling seamcall*() fails.  Move them to
> "seamcall_internal.h" as well. In such way all low level SEAMCALL helpers
> are in a dedicated place, which is much more readable.
>
> Copy the copyright notice from the original files and consolidate the
> date ranges to:
>
>         Copyright (C) 2021-2023 Intel Corporation
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Vishal Annapurve <vannapurve@google.com>

^ permalink raw reply

* Re: [PATCH v4 2/3] swiotlb: dma: its: Enforce host page-size alignment for shared buffers
From: Aneesh Kumar K.V @ 2026-04-28 12:20 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-kernel, iommu, linux-coco, linux-arm-kernel, kvmarm,
	Catalin Marinas, Jason Gunthorpe, Marek Szyprowski, Robin Murphy,
	Steven Price, Suzuki K Poulose, Thomas Gleixner, Will Deacon
In-Reply-To: <86zf2ozrb8.wl-maz@kernel.org>

Marc Zyngier <maz@kernel.org> writes:

> On Mon, 27 Apr 2026 07:31:07 +0100,
> "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:
>> 
>> When running private-memory guests, the guest kernel must apply additional
>> constraints when allocating buffers that are shared with the hypervisor.
>> 
>> These shared buffers are also accessed by the host kernel and therefore
>> must be aligned to the host’s page size, and have a size that is a multiple
>> of the host page size.
>> 
>> On non-secure hosts, set_guest_memory_attributes() tracks memory at the
>> host PAGE_SIZE granularity. This creates a mismatch when the guest applies
>> attributes at 4K boundaries while the host uses 64K pages. In such cases,
>> set_guest_memory_attributes() call returns -EINVAL, preventing the
>> conversion of memory regions from private to shared.
>> 
>> Architectures such as Arm can tolerate realm physical address space
>> (protected memory) PFNs being mapped as shared memory, as incorrect
>> accesses are detected and reported as GPC faults. However, relying on this
>> mechanism is unsafe and can still lead to kernel crashes.
>> 
>> This is particularly likely when guest_memfd allocations are mmapped and
>> accessed from userspace. Once exposed to userspace, we cannot guarantee
>> that applications will only access the intended 4K shared region rather
>> than the full 64K page mapped into their address space. Such userspace
>> addresses may also be passed back into the kernel and accessed via the
>> linear map, resulting in a GPC fault and a kernel crash.
>> 
>> With CCA, although Stage-2 mappings managed by the RMM still operate at a
>> 4K granularity, shared pages must nonetheless be aligned to the
>> host-managed page size and sized as whole host pages to avoid the issues
>> described above.
>
> I thought that was being fixed, and that there was now a strong
> guarantee that RMM and host are aligned on the page size. Even more,
> S2 is totally irrelevant here. The only thing that matters is the host
> page size vs the guest page size. Nothing else.
>

Yes, the latest RMM update includes the ability to change the granule
size.

The section above in the commit message was intended to explain that the
S2 mapping size is irrelevant. I agree it is not clear as written, so I
will reword it to improve clarity.

>
>> 
>> Introduce a new helper, mem_decrypt_align(), to allow callers to enforce
>> the required alignment and size constraints for shared buffers.
>> 
>> The architecture-specific implementation of mem_decrypt_align() will be
>> provided in a follow-up patch.
>> 
>> Note on restricted-dma-pool:
>> rmem_swiotlb_device_init() uses reserved-memory regions described by
>> firmware. Those regions are not changed in-kernel to satisfy host granule
>> alignment. This is intentional: we do not expect restricted-dma-pool
>> allocations to be used with CCA. If restricted-dma-pool is intended for CCA
>> shared use, firmware must provide base/size aligned to the host IPA-change
>> granule.
>> 
>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>> ---
>>  arch/arm64/mm/mem_encrypt.c      | 19 +++++++++++++++----
>>  drivers/irqchip/irq-gic-v3-its.c | 20 +++++++++++++-------
>>  include/linux/mem_encrypt.h      | 14 ++++++++++++++
>>  kernel/dma/contiguous.c          | 10 ++++++++++
>>  kernel/dma/direct.c              | 16 ++++++++++++++--
>>  kernel/dma/pool.c                |  4 +++-
>>  kernel/dma/swiotlb.c             | 21 +++++++++++++--------
>>  7 files changed, 82 insertions(+), 22 deletions(-)
>> 
>
> [...]
>
>> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
>> index 291d7668cc8d..239d7e3bc16f 100644
>> --- a/drivers/irqchip/irq-gic-v3-its.c
>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>> @@ -213,16 +213,17 @@ static gfp_t gfp_flags_quirk;
>>  static struct page *its_alloc_pages_node(int node, gfp_t gfp,
>>  					 unsigned int order)
>>  {
>> +	unsigned int new_order;
>>  	struct page *page;
>>  	int ret = 0;
>>  
>> -	page = alloc_pages_node(node, gfp | gfp_flags_quirk, order);
>> -
>> +	new_order = get_order(mem_decrypt_align((PAGE_SIZE << order)));
>> +	page = alloc_pages_node(node, gfp | gfp_flags_quirk, new_order);
>>  	if (!page)
>>  		return NULL;
>>  
>>  	ret = set_memory_decrypted((unsigned long)page_address(page),
>> -				   1 << order);
>> +				   1 << new_order);
>>  	/*
>>  	 * If set_memory_decrypted() fails then we don't know what state the
>>  	 * page is in, so we can't free it. Instead we leak it.
>> @@ -241,13 +242,16 @@ static struct page *its_alloc_pages(gfp_t gfp, unsigned int order)
>>  
>>  static void its_free_pages(void *addr, unsigned int order)
>>  {
>> +	int new_order;
>> +
>> +	new_order = get_order(mem_decrypt_align((PAGE_SIZE << order)));
>>  	/*
>>  	 * If the memory cannot be encrypted again then we must leak the pages.
>>  	 * set_memory_encrypted() will already have WARNed.
>>  	 */
>> -	if (set_memory_encrypted((unsigned long)addr, 1 << order))
>> +	if (set_memory_encrypted((unsigned long)addr, 1 << new_order))
>>  		return;
>> -	free_pages((unsigned long)addr, order);
>> +	free_pages((unsigned long)addr, new_order);
>>  }
>>
>
> Here's the non-obfuscated version of the two hunks above (and let it
> be on the record that New Order is a terrible, overrated band):
>
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index 291d7668cc8da..a4d555aaee241 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -216,6 +216,7 @@ static struct page *its_alloc_pages_node(int node, gfp_t gfp,
>  	struct page *page;
>  	int ret = 0;
>  
> +	order = get_order(mem_decrypt_align(PAGE_SIZE << order));
>  	page = alloc_pages_node(node, gfp | gfp_flags_quirk, order);
>  
>  	if (!page)
> @@ -245,6 +246,7 @@ static void its_free_pages(void *addr, unsigned int order)
>  	 * If the memory cannot be encrypted again then we must leak the pages.
>  	 * set_memory_encrypted() will already have WARNed.
>  	 */
> +	order = get_order(mem_decrypt_align(PAGE_SIZE << order));
>  	if (set_memory_encrypted((unsigned long)addr, 1 << order))
>  		return;
>  	free_pages((unsigned long)addr, order);
>

I will include this in the next revision.


>>  static struct gen_pool *itt_pool;
>> @@ -268,11 +272,13 @@ static void *itt_alloc_pool(int node, int size)
>>  		if (addr)
>>  			break;
>>  
>> -		page = its_alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
>> +		page = its_alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO,
>> +					    get_order(mem_decrypt_granule_size()));
>
> You already taught its_alloc_pages_node() about the decrypt granule
> size stuff. I don't think we need to see more of it (and you don't
> mess with the call that is just above it).
>
>>  		if (!page)
>>  			break;
>>  
>> -		gen_pool_add(itt_pool, (unsigned long)page_address(page), PAGE_SIZE, node);
>> +		gen_pool_add(itt_pool, (unsigned long)page_address(page),
>> +			     mem_decrypt_granule_size(), node);
>
> I'd rather see something like mem_decrypt_align(PAGE_SIZE), which
> keeps the intent clear.
>

The helper was added based on feedback from a previous version. I assume
you are suggesting that only this caller should switch?


-aneesh

^ permalink raw reply

* Re: [PATCH v4 2/3] swiotlb: dma: its: Enforce host page-size alignment for shared buffers
From: Aneesh Kumar K.V @ 2026-04-28 12:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, iommu, linux-coco, linux-arm-kernel, kvmarm,
	Catalin Marinas, Marc Zyngier, Marek Szyprowski, Robin Murphy,
	Steven Price, Suzuki K Poulose, Thomas Gleixner, Will Deacon
In-Reply-To: <20260427134903.GA740385@ziepe.ca>

Jason Gunthorpe <jgg@ziepe.ca> writes:

> On Mon, Apr 27, 2026 at 12:01:07PM +0530, Aneesh Kumar K.V (Arm) wrote:
>> When running private-memory guests, the guest kernel must apply additional
>> constraints when allocating buffers that are shared with the hypervisor.
>
> This patch has way too much stuff in it.
>
> I think your patch structure should be changed around
>
> 1) Patch to add mem_decrypt_granule_size(), and explain it as
>    the alignment & size of what can be passed to
>    set_memory_encrypted/decrypted()
>
> 2) Add support for mem_decrypt_granule_size() to ARM
>
> Then patches going caller by caller of set_memory_decrypted() to make
> them follow the new rule:
>
> 3) its
>
> 4) swiotlb 
>
> 3) dma_alloc_coherent
>
> etc.
>
> don't forget about the new dma buf heaps too:
>
> drivers/dma-buf/heaps/system_heap.c:    ret = set_memory_decrypted(addr, nr_pages);
>
> It is worth calling out in the cover letter that all the ARM CCA
> relevant places are fixed but drivers/hv/ is left for future.
>
>> @@ -33,18 +32,30 @@ int arm64_mem_crypt_ops_register(const struct arm64_mem_crypt_ops *ops)
>>  
>>  int set_memory_encrypted(unsigned long addr, int numpages)
>>  {
>> -	if (likely(!crypt_ops) || WARN_ON(!PAGE_ALIGNED(addr)))
>> +	if (likely(!crypt_ops))
>>  		return 0;
>>  
>> +	if (WARN_ON(!IS_ALIGNED(addr, mem_decrypt_granule_size())))
>> +		return -EINVAL;
>> +
>> +	if (WARN_ON(!IS_ALIGNED(numpages << PAGE_SHIFT, mem_decrypt_granule_size())))
>> +		return -EINVAL;
>> +
>>  	return crypt_ops->encrypt(addr, numpages);
>>  }
>>  EXPORT_SYMBOL_GPL(set_memory_encrypted);
>>  
>>  int set_memory_decrypted(unsigned long addr, int numpages)
>>  {
>> -	if (likely(!crypt_ops) || WARN_ON(!PAGE_ALIGNED(addr)))
>> +	if (likely(!crypt_ops))
>>  		return 0;
>>  
>> +	if (WARN_ON(!IS_ALIGNED(addr, mem_decrypt_granule_size())))
>> +		return -EINVAL;
>> +
>> +	if (WARN_ON(!IS_ALIGNED(numpages << PAGE_SHIFT, mem_decrypt_granule_size())))
>> +		return -EINVAL;
>> +
>>  	return crypt_ops->decrypt(addr, numpages);
>>  }
>>  EXPORT_SYMBOL_GPL(set_memory_decrypted);
>
> This should go in the ARM patch adding mem_decrypt_granule_size() to CCA
>
>> diff --git a/include/linux/mem_encrypt.h b/include/linux/mem_encrypt.h
>> index 07584c5e36fb..1e01c9ac697f 100644
>> --- a/include/linux/mem_encrypt.h
>> +++ b/include/linux/mem_encrypt.h
>> @@ -11,6 +11,8 @@
>>  #define __MEM_ENCRYPT_H__
>>  
>>  #ifndef __ASSEMBLY__
>> +#include <linux/align.h>
>> +#include <vdso/page.h>
>>  
>>  #ifdef CONFIG_ARCH_HAS_MEM_ENCRYPT
>>  
>> @@ -54,6 +56,18 @@
>>  #define dma_addr_canonical(x)		(x)
>>  #endif
>>  
>> +#ifndef mem_decrypt_granule_size
>> +static inline size_t mem_decrypt_granule_size(void)
>> +{
>> +	return PAGE_SIZE;
>> +}
>> +#endif
>> +
>> +static inline size_t mem_decrypt_align(size_t size)
>> +{
>> +	return ALIGN(size, mem_decrypt_granule_size());
>> +}
>> +
>>  #endif	/* __ASSEMBLY__ */
>>  
>>  #endif	/* __MEM_ENCRYPT_H__ */
>
> I know it seems a bit small, but put this in its own patch and explain
> how it works. I'd also like to see a kdoc here, and add a kdoc to
> set_memory_decrypted() that links back so people have a better chance
> to know about this.
>

Okay, I’ll update all the above in the next revision.

-aneesh

^ permalink raw reply

* Re: [PATCH v4 3/3] coco: guest: arm64: Query host IPA-change alignment via RHI
From: Aneesh Kumar K.V @ 2026-04-28 12:49 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-kernel, iommu, linux-coco, linux-arm-kernel, kvmarm,
	Catalin Marinas, Jason Gunthorpe, Marek Szyprowski, Robin Murphy,
	Steven Price, Suzuki K Poulose, Thomas Gleixner, Will Deacon
In-Reply-To: <86y0i8zo9f.wl-maz@kernel.org>

Marc Zyngier <maz@kernel.org> writes:

> On Mon, 27 Apr 2026 07:31:08 +0100,
> "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:
>> 
>> Add the Realm Host Interface support needed to query host configuration
>> from a Realm guest. Define the RHI hostconf SMCs, add rsi_host_call(), and
>> use them during Realm initialization to retrieve the host IPA-change
>> alignment size.
>
> I don't understand what "IPA-change" means. What you are after is the
> host's sharing granule size.
>

This is part of the RHI specification, and the call is named
RHI_HOSTCONF_GET_IPA_CHANGE_ALIGNMENT. The intent is to determine the
alignment requirements for changing IPA attributes (protected vs.
unprotected IPA

>
>> 
>> Expose that alignment through realm_get_hyp_pagesize() and
>> mem_decrypt_granule_size() so shared-buffer allocation and
>> encryption/decryption paths can honor the ipa change page-size requirement.
>> 
>> If the host reports an invalid alignment (when alginment value is not
>> multiple of 4K), do not enable Realm support.
>> 
>> This provides the host alignment information required by the shared buffer
>> alignment changes.
>> 
>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>> ---
>>  arch/arm64/include/asm/mem_encrypt.h |  3 ++
>>  arch/arm64/include/asm/rhi.h         | 24 +++++++++++++
>>  arch/arm64/include/asm/rsi.h         |  2 ++
>>  arch/arm64/include/asm/rsi_cmds.h    | 10 ++++++
>>  arch/arm64/include/asm/rsi_smc.h     |  7 ++++
>>  arch/arm64/kernel/Makefile           |  2 +-
>>  arch/arm64/kernel/rhi.c              | 54 ++++++++++++++++++++++++++++
>>  arch/arm64/kernel/rsi.c              | 13 +++++++
>>  arch/arm64/mm/mem_encrypt.c          |  8 +++++
>>  9 files changed, 122 insertions(+), 1 deletion(-)
>>  create mode 100644 arch/arm64/include/asm/rhi.h
>>  create mode 100644 arch/arm64/kernel/rhi.c
>> 
>> diff --git a/arch/arm64/include/asm/mem_encrypt.h b/arch/arm64/include/asm/mem_encrypt.h
>> index 314b2b52025f..5541911eb028 100644
>> --- a/arch/arm64/include/asm/mem_encrypt.h
>> +++ b/arch/arm64/include/asm/mem_encrypt.h
>> @@ -16,6 +16,9 @@ int arm64_mem_crypt_ops_register(const struct arm64_mem_crypt_ops *ops);
>>  int set_memory_encrypted(unsigned long addr, int numpages);
>>  int set_memory_decrypted(unsigned long addr, int numpages);
>>  
>> +#define mem_decrypt_granule_size mem_decrypt_granule_size
>> +size_t mem_decrypt_granule_size(void);
>> +
>>  int realm_register_memory_enc_ops(void);
>>  
>>  static inline bool force_dma_unencrypted(struct device *dev)
>> diff --git a/arch/arm64/include/asm/rhi.h b/arch/arm64/include/asm/rhi.h
>> new file mode 100644
>> index 000000000000..0895dd92ea1d
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/rhi.h
>> @@ -0,0 +1,24 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +/*
>> + * Copyright (C) 2026 ARM Ltd.
>> + */
>> +
>> +#ifndef __ASM_RHI_H_
>> +#define __ASM_RHI_H_
>> +
>> +#include <linux/types.h>
>> +
>> +#define SMC_RHI_CALL(func)				\
>> +	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,		\
>> +			   ARM_SMCCC_SMC_64,		\
>> +			   ARM_SMCCC_OWNER_STANDARD_HYP,\
>> +			   (func))
>> +
>> +unsigned long rhi_get_ipa_change_alignment(void);
>> +#define RHI_HOSTCONF_VER_1_0		0x10000
>> +#define RHI_HOSTCONF_VERSION		SMC_RHI_CALL(0x004E)
>> +
>> +#define __RHI_HOSTCONF_GET_IPA_CHANGE_ALIGNMENT BIT(0)
>> +#define RHI_HOSTCONF_FEATURES		SMC_RHI_CALL(0x004F)
>> +#define RHI_HOSTCONF_GET_IPA_CHANGE_ALIGNMENT	SMC_RHI_CALL(0x0050)
>> +#endif
>> diff --git a/arch/arm64/include/asm/rsi.h b/arch/arm64/include/asm/rsi.h
>> index 88b50d660e85..ae54fb3b1429 100644
>> --- a/arch/arm64/include/asm/rsi.h
>> +++ b/arch/arm64/include/asm/rsi.h
>> @@ -67,4 +67,6 @@ static inline int rsi_set_memory_range_shared(phys_addr_t start,
>>  	return rsi_set_memory_range(start, end, RSI_RIPAS_EMPTY,
>>  				    RSI_CHANGE_DESTROYED);
>>  }
>> +
>> +unsigned long realm_get_hyp_pagesize(void);
>>  #endif /* __ASM_RSI_H_ */
>> diff --git a/arch/arm64/include/asm/rsi_cmds.h b/arch/arm64/include/asm/rsi_cmds.h
>> index 2c8763876dfb..a341ce0eeda1 100644
>> --- a/arch/arm64/include/asm/rsi_cmds.h
>> +++ b/arch/arm64/include/asm/rsi_cmds.h
>> @@ -159,4 +159,14 @@ static inline unsigned long rsi_attestation_token_continue(phys_addr_t granule,
>>  	return res.a0;
>>  }
>>  
>> +static inline unsigned long rsi_host_call(struct rsi_host_call *rhi_call)
>> +{
>> +	phys_addr_t addr = virt_to_phys(rhi_call);
>> +	struct arm_smccc_res res;
>> +
>> +	arm_smccc_1_1_invoke(SMC_RSI_HOST_CALL, addr, &res);
>
> Errr... What guarantees that *rhi_call is *IPA contiguous*? This is
> incredibly fragile. You should at the very least check that this isn't
> vmalloc'd.
>


I didn’t quite follow that. We have other RSI calls (even RMI calls)
that do similar things, and the caller understands that the address
should be IPA-contiguous. Are you suggesting that all RSI calls should
add checks for this?. or are you suggesting to update the API to

unsigned long rsi_host_call(unsigned long rhi_call_phys) ?

>> +
>> +	return res.a0;
>> +}
>> +
>>  #endif /* __ASM_RSI_CMDS_H */
>> diff --git a/arch/arm64/include/asm/rsi_smc.h b/arch/arm64/include/asm/rsi_smc.h
>> index e19253f96c94..9ee8b5c7612e 100644
>> --- a/arch/arm64/include/asm/rsi_smc.h
>> +++ b/arch/arm64/include/asm/rsi_smc.h
>> @@ -182,6 +182,13 @@ struct realm_config {
>>   */
>>  #define SMC_RSI_IPA_STATE_GET			SMC_RSI_FID(0x198)
>>  
>> +struct rsi_host_call {
>> +	union {
>> +		u16 imm;
>> +		u64 padding0;
>> +	};
>> +	u64 gprs[31];
>> +} __aligned(0x100);
>>  /*
>>   * Make a Host call.
>>   *
>> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
>> index fe627100d199..3e72dd9584ed 100644
>> --- a/arch/arm64/kernel/Makefile
>> +++ b/arch/arm64/kernel/Makefile
>> @@ -34,7 +34,7 @@ obj-y			:= debug-monitors.o entry.o irq.o fpsimd.o		\
>>  			   cpufeature.o alternative.o cacheinfo.o		\
>>  			   smp.o smp_spin_table.o topology.o smccc-call.o	\
>>  			   syscall.o proton-pack.o idle.o patching.o pi/	\
>> -			   rsi.o jump_label.o
>> +			   rsi.o jump_label.o rhi.o
>>  
>>  obj-$(CONFIG_COMPAT)			+= sys32.o signal32.o			\
>>  					   sys_compat.o
>> diff --git a/arch/arm64/kernel/rhi.c b/arch/arm64/kernel/rhi.c
>> new file mode 100644
>> index 000000000000..7cd6c5102464
>> --- /dev/null
>> +++ b/arch/arm64/kernel/rhi.c
>> @@ -0,0 +1,54 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2026 ARM Ltd.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <asm/rsi.h>
>> +#include <asm/rhi.h>
>> +
>> +/* we need an aligned rhicall for rsi_host_call. slab is not yet ready */
>> +static struct rsi_host_call hyp_pagesize_rhicall;
>
> Why the "hyp_" prefix? This has absolutely nothing to with the
> hypervisor.
>

Sure will update "hyp_" reference to host. 


>> +unsigned long rhi_get_ipa_change_alignment(void)
>> +{
>> +	long ret;
>> +	unsigned long ipa_change_align;
>> +
>> +	hyp_pagesize_rhicall.imm = 0;
>> +	hyp_pagesize_rhicall.gprs[0] = RHI_HOSTCONF_VERSION;
>> +	ret = rsi_host_call(lm_alias(&hyp_pagesize_rhicall));
>> +	if (ret != RSI_SUCCESS)
>> +		goto err_out;
>> +
>> +	if (hyp_pagesize_rhicall.gprs[0] != RHI_HOSTCONF_VER_1_0)
>> +		goto err_out;
>> +
>> +	hyp_pagesize_rhicall.imm = 0;
>> +	hyp_pagesize_rhicall.gprs[0] = RHI_HOSTCONF_FEATURES;
>> +	ret = rsi_host_call(lm_alias(&hyp_pagesize_rhicall));
>> +	if (ret != RSI_SUCCESS)
>> +		goto err_out;
>> +
>> +	if (!(hyp_pagesize_rhicall.gprs[0] & __RHI_HOSTCONF_GET_IPA_CHANGE_ALIGNMENT))
>> +		goto err_out;
>> +
>> +	hyp_pagesize_rhicall.imm = 0;
>> +	hyp_pagesize_rhicall.gprs[0] = RHI_HOSTCONF_GET_IPA_CHANGE_ALIGNMENT;
>> +	ret = rsi_host_call(lm_alias(&hyp_pagesize_rhicall));
>> +	if (ret != RSI_SUCCESS)
>> +		goto err_out;
>> +
>> +	ipa_change_align = hyp_pagesize_rhicall.gprs[0];
>> +	/* This error needs special handling in the caller */
>> +	if (ipa_change_align & (SZ_4K - 1))
>> +		return 0;
>> +
>> +	return ipa_change_align;
>> +
>> +err_out:
>> +	/*
>> +	 * For failure condition assume host is built with 4K page size
>> +	 * and hence ipa change alignment can be guest PAGE_SIZE.
>> +	 */
>> +	return PAGE_SIZE;
>> +}
>
> Why can't this be part of rsi.c? This is an RSI call, and it should be
> part of the RSI initialisation.
>

This is an RHI call as per the specification, hence it has been added to
rhi.c.

>> diff --git a/arch/arm64/kernel/rsi.c b/arch/arm64/kernel/rsi.c
>> index 9e846ce4ef9c..ff735c04e236 100644
>> --- a/arch/arm64/kernel/rsi.c
>> +++ b/arch/arm64/kernel/rsi.c
>> @@ -14,8 +14,10 @@
>>  #include <asm/mem_encrypt.h>
>>  #include <asm/pgtable.h>
>>  #include <asm/rsi.h>
>> +#include <asm/rhi.h>
>>  
>>  static struct realm_config config;
>> +static unsigned long ipa_change_alignment = PAGE_SIZE;
>>  
>>  unsigned long prot_ns_shared;
>>  EXPORT_SYMBOL(prot_ns_shared);
>> @@ -139,6 +141,11 @@ static int realm_ioremap_hook(phys_addr_t phys, size_t size, pgprot_t *prot)
>>  	return 0;
>>  }
>>  
>> +unsigned long realm_get_hyp_pagesize(void)
>> +{
>> +	return ipa_change_alignment;
>> +}
>
> Again, this has nothing to do with the hypervisor, but the host. And
> ipa_change_alignment is still a wording I can't wrap my small head
> around.
>
>> +
>>  void __init arm64_rsi_init(void)
>>  {
>>  	if (arm_smccc_1_1_get_conduit() != SMCCC_CONDUIT_SMC)
>> @@ -147,6 +154,12 @@ void __init arm64_rsi_init(void)
>>  		return;
>>  	if (WARN_ON(rsi_get_realm_config(&config)))
>>  		return;
>> +
>> +	ipa_change_alignment = rhi_get_ipa_change_alignment();
>> +	/* If we don't get a correct alignment response, don't enable realm */
>> +	if (!ipa_change_alignment)
>> +		return;
>
> But at the same time, you override a global value with an error, and
> then paper over it in mem_decrypt_granule_size()...
>


I believe I received similar feedback on my previous version as well,
which I didn’t quite follow.

rhi_get_ipa_change_alignment() only returns an error when the host
returns a size that is not 4K-aligned. Otherwise, it returns the
host-determined size, or defaults to guest PAGE_SIZE if the RHI call
itself is not supported.

	ipa_change_align = hyp_pagesize_rhicall.gprs[0];
	/* This error needs special handling in the caller */
	if (ipa_change_align & (SZ_4K - 1))
		return 0;

	return ipa_change_align;

err_out:
	/*
	 * For failure condition assume host is built with 4K page size
	 * and hence ipa change alignment can be guest PAGE_SIZE.
	 */
	return PAGE_SIZE;

>
>> +
>>  	prot_ns_shared = __phys_to_pte_val(BIT(config.ipa_bits - 1));
>>  
>>  	if (arm64_ioremap_prot_hook_register(realm_ioremap_hook))
>> diff --git a/arch/arm64/mm/mem_encrypt.c b/arch/arm64/mm/mem_encrypt.c
>> index 38c62c9e4e74..f5d64bc29c20 100644
>> --- a/arch/arm64/mm/mem_encrypt.c
>> +++ b/arch/arm64/mm/mem_encrypt.c
>> @@ -59,3 +59,11 @@ int set_memory_decrypted(unsigned long addr, int numpages)
>>  	return crypt_ops->decrypt(addr, numpages);
>>  }
>>  EXPORT_SYMBOL_GPL(set_memory_decrypted);
>> +
>> +size_t mem_decrypt_granule_size(void)
>> +{
>> +	if (is_realm_world())
>> +		return max(PAGE_SIZE, realm_get_hyp_pagesize());
>
> If you didn't mess with ipa_change_alignment above, you shouldn't need
> this max().
>

size_t mem_decrypt_granule_size(void)
{
	if (is_realm_world())
		return max(PAGE_SIZE, realm_get_hyp_pagesize());
	return PAGE_SIZE;
}

That needs to use max(), because we should align to the guest PAGE_SIZE
if it is larger than the host-specified alignment value.

-aneesh

^ permalink raw reply

* [PATCH v2 0/2] x86/tdx: Port I/O emulation fixes
From: Kiryl Shutsemau (Meta) @ 2026-04-28 12:56 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Rick Edgecombe, Kuppuswamy Sathyanarayanan,
	Kai Huang, Borys Tsyrulnikov, linux-kernel, linux-coco, kvm,
	stable, Kiryl Shutsemau (Meta)

This series addresses two technical inaccuracies in the TDX guest port
I/O emulation code reported by Borys Tsyrulnikov.

The first patch fixes an off-by-one error in the GENMASK() macro usage
where the mask was being calculated as one bit too wide (e.g. 9 bits for
an 8-bit operation).

The second patch ensures that 32-bit port I/O operations (INL) correctly
zero-extend the result to the full 64-bit RAX register, as required by
the x86 architecture. Currently, the emulation preserves the upper 32
bits of RAX during such operations.

Both issues were introduced in the initial implementation of the runtime
hypercalls for port I/O.

v1: https://lore.kernel.org/all/20260331112430.71425-1-kas@kernel.org/

Changes in v2:
  - Rephrase the size check in handle_in() as "if (size == 4)" for
    readability (Kuppuswamy)
  - Add Link: to the bug report on both patches (Kuppuswamy)
  - Collect Reviewed-by tags (Kai Huang, Kuppuswamy Sathyanarayanan)
  - Rebase onto v7.1-rc1

Kiryl Shutsemau (Meta) (2):
  x86/tdx: Fix off-by-one in port I/O handling
  x86/tdx: Fix zero-extension for 32-bit port I/O

 arch/x86/coco/tdx/tdx.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
-- 
2.51.2

^ permalink raw reply

* [PATCH v2 1/2] x86/tdx: Fix off-by-one in port I/O handling
From: Kiryl Shutsemau (Meta) @ 2026-04-28 12:56 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Rick Edgecombe, Kuppuswamy Sathyanarayanan,
	Kai Huang, Borys Tsyrulnikov, linux-kernel, linux-coco, kvm,
	stable, Kiryl Shutsemau (Meta)
In-Reply-To: <20260428125632.129770-1-kas@kernel.org>

handle_in() and handle_out() in arch/x86/coco/tdx/tdx.c use:

    u64 mask = GENMASK(BITS_PER_BYTE * size, 0);

GENMASK(h, l) includes bit h. For size=1 (INB), this produces
GENMASK(8, 0) = 0x1FF (9 bits) instead of GENMASK(7, 0) = 0xFF (8
bits). The mask is one bit too wide for all I/O sizes.

Fix the mask calculation.

Fixes: 03149948832a ("x86/tdx: Port I/O: Add runtime hypercalls")
Reported-by: Borys Tsyrulnikov <tsyrulnikov.borys@gmail.com>
Link: https://lore.kernel.org/all/CAKw_Dz96rfSQc6Rn+9QBcUFHhmkK+9zu+P=bxowfZwxrATCBRg@mail.gmail.com/
Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: stable@vger.kernel.org
---
 arch/x86/coco/tdx/tdx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 186915a17c50..65119362f9a2 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -693,7 +693,7 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
 		.r13 = PORT_READ,
 		.r14 = port,
 	};
-	u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
+	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
 	bool success;
 
 	/*
@@ -713,7 +713,7 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
 
 static bool handle_out(struct pt_regs *regs, int size, int port)
 {
-	u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
+	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
 
 	/*
 	 * Emulate the I/O write via hypercall. More info about ABI can be found
-- 
2.51.2


^ permalink raw reply related

* [PATCH v2 2/2] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Kiryl Shutsemau (Meta) @ 2026-04-28 12:56 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Rick Edgecombe, Kuppuswamy Sathyanarayanan,
	Kai Huang, Borys Tsyrulnikov, linux-kernel, linux-coco, kvm,
	stable, Kiryl Shutsemau (Meta)
In-Reply-To: <20260428125632.129770-1-kas@kernel.org>

According to x86 architecture rules, 32-bit operations zero-extend the
result to 64 bits. The current implementation of handle_in() only masks
the lower 32 bits, which preserves the upper 32 bits of RAX when a
32-bit port IN instruction is emulated.

Update handle_in() to zero out the entire RAX register when the I/O size
is 4 bytes to ensure correct zero-extension. For smaller sizes (1 or 2
bytes), continue to preserve the unaffected upper bits.

Fixes: 03149948832a ("x86/tdx: Port I/O: Add runtime hypercalls")
Reported-by: Borys Tsyrulnikov <tsyrulnikov.borys@gmail.com>
Link: https://lore.kernel.org/all/CAKw_Dz96rfSQc6Rn+9QBcUFHhmkK+9zu+P=bxowfZwxrATCBRg@mail.gmail.com/
Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: stable@vger.kernel.org
---
 arch/x86/coco/tdx/tdx.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 65119362f9a2..e09636564237 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -703,8 +703,17 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
 	 */
 	success = !__tdx_hypercall(&args);
 
-	/* Update part of the register affected by the emulated instruction */
-	regs->ax &= ~mask;
+	/*
+	 * Update part of the register affected by the emulated instruction.
+	 *
+	 * 32-bit operands generate a 32-bit result, zero-extended to a 64-bit
+	 * result.
+	 */
+	if (size == 4)
+		regs->ax = 0;
+	else
+		regs->ax &= ~mask;
+
 	if (success)
 		regs->ax |= args.r11 & mask;
 
-- 
2.51.2


^ permalink raw reply related

* Re: [PATCH v4 2/3] swiotlb: dma: its: Enforce host page-size alignment for shared buffers
From: Marc Zyngier @ 2026-04-28 13:31 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-kernel, iommu, linux-coco, linux-arm-kernel, kvmarm,
	Catalin Marinas, Jason Gunthorpe, Marek Szyprowski, Robin Murphy,
	Steven Price, Suzuki K Poulose, Thomas Gleixner, Will Deacon
In-Reply-To: <yq5aa4un1dju.fsf@kernel.org>

On Tue, 28 Apr 2026 13:20:53 +0100,
Aneesh Kumar K.V <aneesh.kumar@kernel.org> wrote:
> 
> Marc Zyngier <maz@kernel.org> writes:
> 
> > On Mon, 27 Apr 2026 07:31:07 +0100,
> > "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:
> >> 
> >> When running private-memory guests, the guest kernel must apply additional
> >> constraints when allocating buffers that are shared with the hypervisor.
> >> 
> >> These shared buffers are also accessed by the host kernel and therefore
> >> must be aligned to the host’s page size, and have a size that is a multiple
> >> of the host page size.
> >> 
> >> On non-secure hosts, set_guest_memory_attributes() tracks memory at the
> >> host PAGE_SIZE granularity. This creates a mismatch when the guest applies
> >> attributes at 4K boundaries while the host uses 64K pages. In such cases,
> >> set_guest_memory_attributes() call returns -EINVAL, preventing the
> >> conversion of memory regions from private to shared.
> >> 
> >> Architectures such as Arm can tolerate realm physical address space
> >> (protected memory) PFNs being mapped as shared memory, as incorrect
> >> accesses are detected and reported as GPC faults. However, relying on this
> >> mechanism is unsafe and can still lead to kernel crashes.
> >> 
> >> This is particularly likely when guest_memfd allocations are mmapped and
> >> accessed from userspace. Once exposed to userspace, we cannot guarantee
> >> that applications will only access the intended 4K shared region rather
> >> than the full 64K page mapped into their address space. Such userspace
> >> addresses may also be passed back into the kernel and accessed via the
> >> linear map, resulting in a GPC fault and a kernel crash.
> >> 
> >> With CCA, although Stage-2 mappings managed by the RMM still operate at a
> >> 4K granularity, shared pages must nonetheless be aligned to the
> >> host-managed page size and sized as whole host pages to avoid the issues
> >> described above.
> >
> > I thought that was being fixed, and that there was now a strong
> > guarantee that RMM and host are aligned on the page size. Even more,
> > S2 is totally irrelevant here. The only thing that matters is the host
> > page size vs the guest page size. Nothing else.
> >
> 
> Yes, the latest RMM update includes the ability to change the granule
> size.
> 
> The section above in the commit message was intended to explain that the
> S2 mapping size is irrelevant. I agree it is not clear as written, so I
> will reword it to improve clarity.

Even better, remove it. Nothing CCA-specific should be in this patch.

[...]

> >>  static struct gen_pool *itt_pool;
> >> @@ -268,11 +272,13 @@ static void *itt_alloc_pool(int node, int size)
> >>  		if (addr)
> >>  			break;
> >>  
> >> -		page = its_alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
> >> +		page = its_alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO,
> >> +					    get_order(mem_decrypt_granule_size()));
> >
> > You already taught its_alloc_pages_node() about the decrypt granule
> > size stuff. I don't think we need to see more of it (and you don't
> > mess with the call that is just above it).
> >
> >>  		if (!page)
> >>  			break;
> >>  
> >> -		gen_pool_add(itt_pool, (unsigned long)page_address(page), PAGE_SIZE, node);
> >> +		gen_pool_add(itt_pool, (unsigned long)page_address(page),
> >> +			     mem_decrypt_granule_size(), node);
> >
> > I'd rather see something like mem_decrypt_align(PAGE_SIZE), which
> > keeps the intent clear.
> >
> 
> The helper was added based on feedback from a previous version. I assume
> you are suggesting that only this caller should switch?

I don't know what you mean by 'this'. What I'd like to see is this
last hunk be changed to:

	gen_pool_add(itt_pool, (unsigned long)page_address(page),
		     mem_decrypt_align(PAGE_SIZE), node);

and the previous hunk simply dropped.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply

* Re: [PATCH v4 3/3] coco: guest: arm64: Query host IPA-change alignment via RHI
From: Marc Zyngier @ 2026-04-28 13:49 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-kernel, iommu, linux-coco, linux-arm-kernel, kvmarm,
	Catalin Marinas, Jason Gunthorpe, Marek Szyprowski, Robin Murphy,
	Steven Price, Suzuki K Poulose, Thomas Gleixner, Will Deacon
In-Reply-To: <yq5a4ikv1c7p.fsf@kernel.org>

On Tue, 28 Apr 2026 13:49:46 +0100,
Aneesh Kumar K.V <aneesh.kumar@kernel.org> wrote:
> 
> Marc Zyngier <maz@kernel.org> writes:
> 
> > On Mon, 27 Apr 2026 07:31:08 +0100,
> > "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:
> >> 
> >> Add the Realm Host Interface support needed to query host configuration
> >> from a Realm guest. Define the RHI hostconf SMCs, add rsi_host_call(), and
> >> use them during Realm initialization to retrieve the host IPA-change
> >> alignment size.
> >
> > I don't understand what "IPA-change" means. What you are after is the
> > host's sharing granule size.
> >
> 
> This is part of the RHI specification, and the call is named
> RHI_HOSTCONF_GET_IPA_CHANGE_ALIGNMENT. The intent is to determine the
> alignment requirements for changing IPA attributes (protected vs.
> unprotected IPA

This really is a terrible name. Why the 'change' part? It doesn't
change, it is a constant.

Oh well...

[...]

> >> +static inline unsigned long rsi_host_call(struct rsi_host_call *rhi_call)
> >> +{
> >> +	phys_addr_t addr = virt_to_phys(rhi_call);
> >> +	struct arm_smccc_res res;
> >> +
> >> +	arm_smccc_1_1_invoke(SMC_RSI_HOST_CALL, addr, &res);
> >
> > Errr... What guarantees that *rhi_call is *IPA contiguous*? This is
> > incredibly fragile. You should at the very least check that this isn't
> > vmalloc'd.
> >
> 
> 
> I didn’t quite follow that. We have other RSI calls (even RMI calls)
> that do similar things, and the caller understands that the address
> should be IPA-contiguous.

Does it? Where is it documented?  All you get is a pointer, so all
bets are off.

> Are you suggesting that all RSI calls should
> add checks for this?. or are you suggesting to update the API to
> 
> unsigned long rsi_host_call(unsigned long rhi_call_phys) ?

I'm suggesting that this API is subtly broken because it makes random
assumption about the physical contiguity of the VA space. It does so
without any check, without any documentation.

Simply changing the parameter to phys_addr_t could at the very least
capture some of the requirements, but I'd like something in big bold
letters.

>
> >> +
> >> +	return res.a0;
> >> +}
> >> +
> >>  #endif /* __ASM_RSI_CMDS_H */
> >> diff --git a/arch/arm64/include/asm/rsi_smc.h b/arch/arm64/include/asm/rsi_smc.h
> >> index e19253f96c94..9ee8b5c7612e 100644
> >> --- a/arch/arm64/include/asm/rsi_smc.h
> >> +++ b/arch/arm64/include/asm/rsi_smc.h
> >> @@ -182,6 +182,13 @@ struct realm_config {
> >>   */
> >>  #define SMC_RSI_IPA_STATE_GET			SMC_RSI_FID(0x198)
> >>  
> >> +struct rsi_host_call {
> >> +	union {
> >> +		u16 imm;
> >> +		u64 padding0;
> >> +	};
> >> +	u64 gprs[31];
> >> +} __aligned(0x100);
> >>  /*
> >>   * Make a Host call.
> >>   *
> >> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> >> index fe627100d199..3e72dd9584ed 100644
> >> --- a/arch/arm64/kernel/Makefile
> >> +++ b/arch/arm64/kernel/Makefile
> >> @@ -34,7 +34,7 @@ obj-y			:= debug-monitors.o entry.o irq.o fpsimd.o		\
> >>  			   cpufeature.o alternative.o cacheinfo.o		\
> >>  			   smp.o smp_spin_table.o topology.o smccc-call.o	\
> >>  			   syscall.o proton-pack.o idle.o patching.o pi/	\
> >> -			   rsi.o jump_label.o
> >> +			   rsi.o jump_label.o rhi.o
> >>  
> >>  obj-$(CONFIG_COMPAT)			+= sys32.o signal32.o			\
> >>  					   sys_compat.o
> >> diff --git a/arch/arm64/kernel/rhi.c b/arch/arm64/kernel/rhi.c
> >> new file mode 100644
> >> index 000000000000..7cd6c5102464
> >> --- /dev/null
> >> +++ b/arch/arm64/kernel/rhi.c
> >> @@ -0,0 +1,54 @@
> >> +// SPDX-License-Identifier: GPL-2.0-only
> >> +/*
> >> + * Copyright (C) 2026 ARM Ltd.
> >> + */
> >> +
> >> +#include <linux/mm.h>
> >> +#include <asm/rsi.h>
> >> +#include <asm/rhi.h>
> >> +
> >> +/* we need an aligned rhicall for rsi_host_call. slab is not yet ready */
> >> +static struct rsi_host_call hyp_pagesize_rhicall;
> >
> > Why the "hyp_" prefix? This has absolutely nothing to with the
> > hypervisor.
> >
> 
> Sure will update "hyp_" reference to host. 
> 
> 
> >> +unsigned long rhi_get_ipa_change_alignment(void)
> >> +{
> >> +	long ret;
> >> +	unsigned long ipa_change_align;
> >> +
> >> +	hyp_pagesize_rhicall.imm = 0;
> >> +	hyp_pagesize_rhicall.gprs[0] = RHI_HOSTCONF_VERSION;
> >> +	ret = rsi_host_call(lm_alias(&hyp_pagesize_rhicall));
> >> +	if (ret != RSI_SUCCESS)
> >> +		goto err_out;
> >> +
> >> +	if (hyp_pagesize_rhicall.gprs[0] != RHI_HOSTCONF_VER_1_0)
> >> +		goto err_out;
> >> +
> >> +	hyp_pagesize_rhicall.imm = 0;
> >> +	hyp_pagesize_rhicall.gprs[0] = RHI_HOSTCONF_FEATURES;
> >> +	ret = rsi_host_call(lm_alias(&hyp_pagesize_rhicall));
> >> +	if (ret != RSI_SUCCESS)
> >> +		goto err_out;
> >> +
> >> +	if (!(hyp_pagesize_rhicall.gprs[0] & __RHI_HOSTCONF_GET_IPA_CHANGE_ALIGNMENT))
> >> +		goto err_out;
> >> +
> >> +	hyp_pagesize_rhicall.imm = 0;
> >> +	hyp_pagesize_rhicall.gprs[0] = RHI_HOSTCONF_GET_IPA_CHANGE_ALIGNMENT;
> >> +	ret = rsi_host_call(lm_alias(&hyp_pagesize_rhicall));
> >> +	if (ret != RSI_SUCCESS)
> >> +		goto err_out;
> >> +
> >> +	ipa_change_align = hyp_pagesize_rhicall.gprs[0];
> >> +	/* This error needs special handling in the caller */
> >> +	if (ipa_change_align & (SZ_4K - 1))
> >> +		return 0;
> >> +
> >> +	return ipa_change_align;
> >> +
> >> +err_out:
> >> +	/*
> >> +	 * For failure condition assume host is built with 4K page size
> >> +	 * and hence ipa change alignment can be guest PAGE_SIZE.
> >> +	 */
> >> +	return PAGE_SIZE;
> >> +}
> >
> > Why can't this be part of rsi.c? This is an RSI call, and it should be
> > part of the RSI initialisation.
> >
> 
> This is an RHI call as per the specification, hence it has been added to
> rhi.c.

News flash: this is the Linux kernel, not an ARM spec. We organise
things based on the logical use, not on the TLA associated with it.

And RHI is implemented in terms of RSI. In rsi.c it goes. We don't
need this pointless proliferation of helper files that only result in
equally pointless global symbols.

> 
> >> diff --git a/arch/arm64/kernel/rsi.c b/arch/arm64/kernel/rsi.c
> >> index 9e846ce4ef9c..ff735c04e236 100644
> >> --- a/arch/arm64/kernel/rsi.c
> >> +++ b/arch/arm64/kernel/rsi.c
> >> @@ -14,8 +14,10 @@
> >>  #include <asm/mem_encrypt.h>
> >>  #include <asm/pgtable.h>
> >>  #include <asm/rsi.h>
> >> +#include <asm/rhi.h>
> >>  
> >>  static struct realm_config config;
> >> +static unsigned long ipa_change_alignment = PAGE_SIZE;
> >>  
> >>  unsigned long prot_ns_shared;
> >>  EXPORT_SYMBOL(prot_ns_shared);
> >> @@ -139,6 +141,11 @@ static int realm_ioremap_hook(phys_addr_t phys, size_t size, pgprot_t *prot)
> >>  	return 0;
> >>  }
> >>  
> >> +unsigned long realm_get_hyp_pagesize(void)
> >> +{
> >> +	return ipa_change_alignment;
> >> +}
> >
> > Again, this has nothing to do with the hypervisor, but the host. And
> > ipa_change_alignment is still a wording I can't wrap my small head
> > around.
> >
> >> +
> >>  void __init arm64_rsi_init(void)
> >>  {
> >>  	if (arm_smccc_1_1_get_conduit() != SMCCC_CONDUIT_SMC)
> >> @@ -147,6 +154,12 @@ void __init arm64_rsi_init(void)
> >>  		return;
> >>  	if (WARN_ON(rsi_get_realm_config(&config)))
> >>  		return;
> >> +
> >> +	ipa_change_alignment = rhi_get_ipa_change_alignment();
> >> +	/* If we don't get a correct alignment response, don't enable realm */
> >> +	if (!ipa_change_alignment)
> >> +		return;
> >
> > But at the same time, you override a global value with an error, and
> > then paper over it in mem_decrypt_granule_size()...
> >
> 
> 
> I believe I received similar feedback on my previous version as well,
> which I didn’t quite follow.

And you didn't think of asking? Sometimes I wonder what these patch
reviews are for... Just to waste some more electrons, I guess :-/.

> 
> rhi_get_ipa_change_alignment() only returns an error when the host
> returns a size that is not 4K-aligned. Otherwise, it returns the
> host-determined size, or defaults to guest PAGE_SIZE if the RHI call
> itself is not supported.

You encode the error as 0. You override ipa_change_alignment with 0.

Then...

> >> +size_t mem_decrypt_granule_size(void)
> >> +{
> >> +	if (is_realm_world())
> >> +		return max(PAGE_SIZE, realm_get_hyp_pagesize());
> >
> > If you didn't mess with ipa_change_alignment above, you shouldn't need
> > this max().
> >
> 
> size_t mem_decrypt_granule_size(void)
> {
> 	if (is_realm_world())
> 		return max(PAGE_SIZE, realm_get_hyp_pagesize());
> 	return PAGE_SIZE;
> }
> 
> That needs to use max(), because we should align to the guest PAGE_SIZE
> if it is larger than the host-specified alignment value.

... you need to correct that back to PAGE_SIZE because you have stored
something smaller than PAGE_SIZE.

Isn't the problem really obvious? ipa_change_alignment can *NEVER* go
down. It should never be allowed to reduce, because that's exactly
the property you are trying to enforce.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply

* Re: [PATCH v4 3/3] coco: guest: arm64: Query host IPA-change alignment via RHI
From: Will Deacon @ 2026-04-28 13:56 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm)
  Cc: linux-kernel, iommu, linux-coco, linux-arm-kernel, kvmarm,
	Catalin Marinas, Jason Gunthorpe, Marc Zyngier, Marek Szyprowski,
	Robin Murphy, Steven Price, Suzuki K Poulose, Thomas Gleixner,
	sebastianene
In-Reply-To: <20260427063108.909019-4-aneesh.kumar@kernel.org>

[+Seb for the ITS]

On Mon, Apr 27, 2026 at 12:01:08PM +0530, Aneesh Kumar K.V (Arm) wrote:
> Add the Realm Host Interface support needed to query host configuration
> from a Realm guest. Define the RHI hostconf SMCs, add rsi_host_call(), and
> use them during Realm initialization to retrieve the host IPA-change
> alignment size.
> 
> Expose that alignment through realm_get_hyp_pagesize() and
> mem_decrypt_granule_size() so shared-buffer allocation and
> encryption/decryption paths can honor the ipa change page-size requirement.
> 
> If the host reports an invalid alignment (when alginment value is not
> multiple of 4K), do not enable Realm support.
> 
> This provides the host alignment information required by the shared buffer
> alignment changes.
> 
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
>  arch/arm64/include/asm/mem_encrypt.h |  3 ++
>  arch/arm64/include/asm/rhi.h         | 24 +++++++++++++
>  arch/arm64/include/asm/rsi.h         |  2 ++
>  arch/arm64/include/asm/rsi_cmds.h    | 10 ++++++
>  arch/arm64/include/asm/rsi_smc.h     |  7 ++++
>  arch/arm64/kernel/Makefile           |  2 +-
>  arch/arm64/kernel/rhi.c              | 54 ++++++++++++++++++++++++++++
>  arch/arm64/kernel/rsi.c              | 13 +++++++
>  arch/arm64/mm/mem_encrypt.c          |  8 +++++
>  9 files changed, 122 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/include/asm/rhi.h
>  create mode 100644 arch/arm64/kernel/rhi.c

[...]

> diff --git a/arch/arm64/mm/mem_encrypt.c b/arch/arm64/mm/mem_encrypt.c
> index 38c62c9e4e74..f5d64bc29c20 100644
> --- a/arch/arm64/mm/mem_encrypt.c
> +++ b/arch/arm64/mm/mem_encrypt.c
> @@ -59,3 +59,11 @@ int set_memory_decrypted(unsigned long addr, int numpages)
>  	return crypt_ops->decrypt(addr, numpages);
>  }
>  EXPORT_SYMBOL_GPL(set_memory_decrypted);
> +
> +size_t mem_decrypt_granule_size(void)
> +{
> +	if (is_realm_world())
> +		return max(PAGE_SIZE, realm_get_hyp_pagesize());
> +	return PAGE_SIZE;

No, this should be indirected via 'struct arm64_mem_crypt_ops' because
there's nothing particularly unique to realms here. For pKVM protected
guests using a smaller page-size than the host, we'd presumably need
something similar for the ITS (where restricted-dma isn't used).

Will

^ permalink raw reply

* Re: [PATCH v4 3/3] coco: guest: arm64: Query host IPA-change alignment via RHI
From: Suzuki K Poulose @ 2026-04-28 15:22 UTC (permalink / raw)
  To: Marc Zyngier, Aneesh Kumar K.V
  Cc: linux-kernel, iommu, linux-coco, linux-arm-kernel, kvmarm,
	Catalin Marinas, Jason Gunthorpe, Marek Szyprowski, Robin Murphy,
	Steven Price, Thomas Gleixner, Will Deacon
In-Reply-To: <86tssvyz2v.wl-maz@kernel.org>

On 28/04/2026 14:49, Marc Zyngier wrote:
> On Tue, 28 Apr 2026 13:49:46 +0100,
> Aneesh Kumar K.V <aneesh.kumar@kernel.org> wrote:
>>
>> Marc Zyngier <maz@kernel.org> writes:
>>
>>> On Mon, 27 Apr 2026 07:31:08 +0100,
>>> "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:
>>>>
>>>> Add the Realm Host Interface support needed to query host configuration
>>>> from a Realm guest. Define the RHI hostconf SMCs, add rsi_host_call(), and
>>>> use them during Realm initialization to retrieve the host IPA-change
>>>> alignment size.
>>>
>>> I don't understand what "IPA-change" means. What you are after is the
>>> host's sharing granule size.
>>>
>>
>> This is part of the RHI specification, and the call is named
>> RHI_HOSTCONF_GET_IPA_CHANGE_ALIGNMENT. The intent is to determine the
>> alignment requirements for changing IPA attributes (protected vs.
>> unprotected IPA
> 
> This really is a terrible name. Why the 'change' part? It doesn't
> change, it is a constant.

Agreed, it was supposed to mean IPA_STATE_CHANGE.

> 
> Oh well...
> 
> [...]
> 

...

>>>> +unsigned long rhi_get_ipa_change_alignment(void)
>>>> +{
>>>> +	long ret;
>>>> +	unsigned long ipa_change_align;
>>>> +
>>>> +	hyp_pagesize_rhicall.imm = 0;
>>>> +	hyp_pagesize_rhicall.gprs[0] = RHI_HOSTCONF_VERSION;
>>>> +	ret = rsi_host_call(lm_alias(&hyp_pagesize_rhicall));
>>>> +	if (ret != RSI_SUCCESS)
>>>> +		goto err_out;
>>>> +
>>>> +	if (hyp_pagesize_rhicall.gprs[0] != RHI_HOSTCONF_VER_1_0)
>>>> +		goto err_out;
>>>> +
>>>> +	hyp_pagesize_rhicall.imm = 0;
>>>> +	hyp_pagesize_rhicall.gprs[0] = RHI_HOSTCONF_FEATURES;
>>>> +	ret = rsi_host_call(lm_alias(&hyp_pagesize_rhicall));
>>>> +	if (ret != RSI_SUCCESS)
>>>> +		goto err_out;
>>>> +
>>>> +	if (!(hyp_pagesize_rhicall.gprs[0] & __RHI_HOSTCONF_GET_IPA_CHANGE_ALIGNMENT))
>>>> +		goto err_out;
>>>> +
>>>> +	hyp_pagesize_rhicall.imm = 0;
>>>> +	hyp_pagesize_rhicall.gprs[0] = RHI_HOSTCONF_GET_IPA_CHANGE_ALIGNMENT;
>>>> +	ret = rsi_host_call(lm_alias(&hyp_pagesize_rhicall));
>>>> +	if (ret != RSI_SUCCESS)
>>>> +		goto err_out;
>>>> +
>>>> +	ipa_change_align = hyp_pagesize_rhicall.gprs[0];
>>>> +	/* This error needs special handling in the caller */
>>>> +	if (ipa_change_align & (SZ_4K - 1))
>>>> +		return 0;
>>>> +
>>>> +	return ipa_change_align;
>>>> +
>>>> +err_out:
>>>> +	/*
>>>> +	 * For failure condition assume host is built with 4K page size
>>>> +	 * and hence ipa change alignment can be guest PAGE_SIZE.
>>>> +	 */
>>>> +	return PAGE_SIZE;
>>>> +}
>>>
>>> Why can't this be part of rsi.c? This is an RSI call, and it should be
>>> part of the RSI initialisation.
>>>
>>
>> This is an RHI call as per the specification, hence it has been added to
>> rhi.c.
> 
> News flash: this is the Linux kernel, not an ARM spec. We organise
> things based on the logical use, not on the TLA associated with it.
> 
> And RHI is implemented in terms of RSI. In rsi.c it goes. We don't
> need this pointless proliferation of helper files that only result in
> equally pointless global symbols.

RHI (Realm Host Interface) is not really the same as RSI. The former is
a service mechanism for Realms with the "Non-secure Hypervisor".  And
this single call is just one of the services. There are further more
services that will eventually come up (e.g., Device Assignment, Boot
Sync Protocol, Firmware Activity Log etc).

RSI (to be precise, RSI_HOST_CALL) is the transport to talk to the Host,
as that is the only way for the Realm to reach the Host. So, tbh, it
does make sense to keep this in rhic ?

Suzuki

^ permalink raw reply

* SVSM Development Call April 29, 2026
From: Jörg Rödel @ 2026-04-28 16:11 UTC (permalink / raw)
  To: coconut-svsm, linux-coco

Hi,

Here is the call for agenda items for this weeks SVSM development call.  Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.

Currently on the agenda:

	- Usual TSC meeting update
	- Review of slides for CCC project status update

We will use the LF Zoom instance. Details of the meeting  can be found in our
governance repository at:

	https://github.com/coconut-svsm/governance

The link to the COCONUT-SVSM calendar is:

	https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week

The meeting will be recorded and the recording eventually published.

Regards,

	Jörg

^ permalink raw reply

* Re: [PATCH v3 1/6] cpu/bugs: Allow forcing Automatic IBRS with SNP active using spectre_v2=eibrs
From: Pawan Gupta @ 2026-04-28 16:49 UTC (permalink / raw)
  To: Kim Phillips
  Cc: linux-kernel, kvm, linux-coco, x86, Sean Christopherson,
	Paolo Bonzini, K Prateek Nayak, Nikunj A Dadhania, Tom Lendacky,
	Michael Roth, Borislav Petkov, Borislav Petkov, Naveen Rao,
	David Kaplan, Dave Hansen, stable, kernel test robot
In-Reply-To: <20260402202558.195005-2-kim.phillips@amd.com>

On Thu, Apr 02, 2026 at 03:25:53PM -0500, Kim Phillips wrote:
> spectre_v2=eibrs currently enables retpolines when SNP is enabled,
> instead of AutoIBRS (EIBRS) because the commit that disabled
> AutoIBRS if SNP is enabled stopped short of enabling
> X86_FEATURE_IBRS_ENHANCED.
> 
> Change the logic to enable X86_FEATURE_IBRS_ENHANCED, and move the
> decision to switch to retpolines in the default/"auto" case in

The retpoline switch happens in force case(=on) too.

> spectre_v2_select_mitigation().  This allows the existing
> spectre_v2=eibrs logic to work as intended.
> 
> Also emit a performance loss warning for using AutoIBRS with
> SNP enabled.
> 
> Fixes: acaa4b5c4c85 ("x86/speculation: Do not enable Automatic IBRS if SEV-SNP is enabled")
> Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Borislav Petkov (AMD) <bp@alien8.de>
> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: stable@kernel.org
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202603121136.bc8zNsHS-lkp@intel.com/
> Signed-off-by: Kim Phillips <kim.phillips@amd.com>
> ---

Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

^ permalink raw reply

* Re: SVSM Development Call April 22, 2026
From: Jörg Rödel @ 2026-04-28 17:27 UTC (permalink / raw)
  To: coconut-svsm, linux-coco
In-Reply-To: <ymracc35f44xtep4zvexdvqxbh2sfudsnnvz6nrgj56wyxaekv@xrxkyzbpdohk>

Meeting minutes are now available, please review:

	https://github.com/coconut-svsm/governance/pull/105

-Joerg

^ permalink raw reply

* Re: [PATCH v3 1/6] cpu/bugs: Allow forcing Automatic IBRS with SNP active using spectre_v2=eibrs
From: Kim Phillips @ 2026-04-28 21:34 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: linux-kernel, kvm, linux-coco, x86, Sean Christopherson,
	Paolo Bonzini, K Prateek Nayak, Nikunj A Dadhania, Tom Lendacky,
	Michael Roth, Borislav Petkov, Borislav Petkov, Naveen Rao,
	David Kaplan, Dave Hansen, stable, kernel test robot
In-Reply-To: <20260428164909.xkxdqhdsguq2wpal@desk>

On 4/28/26 11:49 AM, Pawan Gupta wrote:
> On Thu, Apr 02, 2026 at 03:25:53PM -0500, Kim Phillips wrote:
>> spectre_v2=eibrs currently enables retpolines when SNP is enabled,
>> instead of AutoIBRS (EIBRS) because the commit that disabled
>> AutoIBRS if SNP is enabled stopped short of enabling
>> X86_FEATURE_IBRS_ENHANCED.
>>
>> Change the logic to enable X86_FEATURE_IBRS_ENHANCED, and move the
>> decision to switch to retpolines in the default/"auto" case in
> 
> The retpoline switch happens in force case(=on) too.

That's right, for default/"=auto"/"=on", if SNP is enabled, retpolines.
If SNP is not enabled, AutoIBRS.  I'm assuming that's the desired
behaviour.

I'll amend the commit text in the next version if more reasons
arise to submit one.

>> spectre_v2_select_mitigation().  This allows the existing
>> spectre_v2=eibrs logic to work as intended.
>>
>> Also emit a performance loss warning for using AutoIBRS with
>> SNP enabled.
>>
>> Fixes: acaa4b5c4c85 ("x86/speculation: Do not enable Automatic IBRS if SEV-SNP is enabled")
>> Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
>> Cc: Borislav Petkov (AMD) <bp@alien8.de>
>> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Sean Christopherson <seanjc@google.com>
>> Cc: stable@kernel.org
>> Reported-by: kernel test robot <lkp@intel.com>
>> Closes: https://lore.kernel.org/oe-kbuild-all/202603121136.bc8zNsHS-lkp@intel.com/
>> Signed-off-by: Kim Phillips <kim.phillips@amd.com>
>> ---
> 
> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

Thanks,

Kim

^ permalink raw reply

* [PATCH RFC v5 01/53] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
From: Ackerley Tng via B4 Relay @ 2026-04-28 23:24 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260428-gmem-inplace-conversion-v5-0-d8608ccfca22@google.com>

From: Sean Christopherson <seanjc@google.com>

Start plumbing in guest_memfd support for in-place private<=>shared
conversions by tracking attributes via a maple tree.  KVM currently tracks
private vs. shared attributes on a per-VM basis, which made sense when a
guest_memfd _only_ supported private memory, but tracking per-VM simply
can't work for in-place conversions as the shareability of a given page
needs to be per-gmem_inode, not per-VM.

Use the filemap invalidation lock to protect the maple tree, as taking the
lock for read when faulting in memory (for userspace or the guest) isn't
expected to result in meaningful contention, and using a separate lock
would add significant complexity (avoid deadlock is quite difficult).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 virt/kvm/guest_memfd.c | 139 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 123 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 69c9d6d546b28..17e5a23fec0a1 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,7 @@
 #include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/kvm_host.h>
+#include <linux/maple_tree.h>
 #include <linux/mempolicy.h>
 #include <linux/pseudo_fs.h>
 #include <linux/pagemap.h>
@@ -33,6 +34,12 @@ struct gmem_inode {
 	struct list_head gmem_file_list;
 
 	u64 flags;
+	/*
+	 * Every index in this inode, whether memory is populated or
+	 * not, is tracked in attributes. There are no gaps in this
+	 * maple tree.
+	 */
+	struct maple_tree attributes;
 };
 
 static __always_inline struct gmem_inode *GMEM_I(struct inode *inode)
@@ -60,6 +67,31 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
 	return gfn - slot->base_gfn + slot->gmem.pgoff;
 }
 
+static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index)
+{
+	struct maple_tree *mt = &GMEM_I(inode)->attributes;
+	void *entry = mtree_load(mt, index);
+
+	/*
+	 * The lock _must_ be held for lookups, as some maple tree operations,
+	 * e.g. append, are unsafe (return inaccurate information) with respect
+	 * to concurrent RCU-protected lookups.
+	 */
+	lockdep_assert(mt_lock_is_held(mt));
+
+	return WARN_ON_ONCE(!entry) ? 0 : xa_to_value(entry);
+}
+
+static bool kvm_gmem_is_private_mem(struct inode *inode, pgoff_t index)
+{
+	return kvm_gmem_get_attributes(inode, index) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+}
+
+static bool kvm_gmem_is_shared_mem(struct inode *inode, pgoff_t index)
+{
+	return !kvm_gmem_is_private_mem(inode, index);
+}
+
 static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
 				    pgoff_t index, struct folio *folio)
 {
@@ -397,10 +429,13 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
 	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	if (!(GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED))
-		return VM_FAULT_SIGBUS;
+	filemap_invalidate_lock_shared(inode->i_mapping);
+	if (kvm_gmem_is_shared_mem(inode, vmf->pgoff))
+		folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+	else
+		folio = ERR_PTR(-EACCES);
+	filemap_invalidate_unlock_shared(inode->i_mapping);
 
-	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
 	if (IS_ERR(folio)) {
 		if (PTR_ERR(folio) == -EAGAIN)
 			return VM_FAULT_RETRY;
@@ -556,6 +591,51 @@ bool __weak kvm_arch_supports_gmem_init_shared(struct kvm *kvm)
 	return true;
 }
 
+static int kvm_gmem_init_inode(struct inode *inode, loff_t size, u64 flags)
+{
+	struct gmem_inode *gi = GMEM_I(inode);
+	MA_STATE(mas, &gi->attributes, 0, (size >> PAGE_SHIFT) - 1);
+	u64 attrs;
+	int r;
+
+	inode->i_op = &kvm_gmem_iops;
+	inode->i_mapping->a_ops = &kvm_gmem_aops;
+	inode->i_mode |= S_IFREG;
+	inode->i_size = size;
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+
+	/*
+	 * guest_memfd memory is neither migratable nor swappable: set
+	 * inaccessible to gate off both.
+	 */
+	mapping_set_inaccessible(inode->i_mapping);
+	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+
+	gi->flags = flags;
+
+	mt_set_external_lock(&gi->attributes,
+			     &inode->i_mapping->invalidate_lock);
+
+	/*
+	 * Store default attributes for the entire gmem instance. Ensuring every
+	 * index is represented in the maple tree at all times simplifies the
+	 * conversion and merging logic.
+	 */
+	attrs = gi->flags & GUEST_MEMFD_FLAG_INIT_SHARED ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
+	/*
+	 * Acquire the invalidation lock purely to make lockdep happy.  The
+	 * maple tree library expects all stores to be protected via the lock,
+	 * and the library can't know when the tree is reachable only by the
+	 * caller, as is the case here.
+	 */
+	filemap_invalidate_lock(inode->i_mapping);
+	r = mas_store_gfp(&mas, xa_mk_value(attrs), GFP_KERNEL);
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	return r;
+}
+
 static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 {
 	static const char *name = "[kvm-gmem]";
@@ -586,16 +666,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 		goto err_fops;
 	}
 
-	inode->i_op = &kvm_gmem_iops;
-	inode->i_mapping->a_ops = &kvm_gmem_aops;
-	inode->i_mode |= S_IFREG;
-	inode->i_size = size;
-	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
-	mapping_set_inaccessible(inode->i_mapping);
-	/* Unmovable mappings are supposed to be marked unevictable as well. */
-	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
-
-	GMEM_I(inode)->flags = flags;
+	err = kvm_gmem_init_inode(inode, size, flags);
+	if (err)
+		goto err_inode;
 
 	file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops);
 	if (IS_ERR(file)) {
@@ -797,9 +870,13 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	if (!file)
 		return -EFAULT;
 
+	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
+
 	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
-	if (IS_ERR(folio))
-		return PTR_ERR(folio);
+	if (IS_ERR(folio)) {
+		r = PTR_ERR(folio);
+		goto out;
+	}
 
 	if (!folio_test_uptodate(folio)) {
 		clear_highpage(folio_page(folio, 0));
@@ -815,6 +892,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	else
 		folio_put(folio);
 
+out:
+	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
 	return r;
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
@@ -944,6 +1023,15 @@ static struct inode *kvm_gmem_alloc_inode(struct super_block *sb)
 
 	mpol_shared_policy_init(&gi->policy, NULL);
 
+	/*
+	 * Memory attributes are protected by the filemap invalidation lock, but
+	 * the lock structure isn't available at this time.  Immediately mark
+	 * maple tree as using external locking so that accessing the tree
+	 * before it's fully initialized results in NULL pointer dereferences
+	 * and not more subtle bugs.
+	 */
+	mt_init_flags(&gi->attributes, MT_FLAGS_LOCK_EXTERN);
+
 	gi->flags = 0;
 	INIT_LIST_HEAD(&gi->gmem_file_list);
 	return &gi->vfs_inode;
@@ -951,7 +1039,26 @@ static struct inode *kvm_gmem_alloc_inode(struct super_block *sb)
 
 static void kvm_gmem_destroy_inode(struct inode *inode)
 {
-	mpol_free_shared_policy(&GMEM_I(inode)->policy);
+	struct gmem_inode *gi = GMEM_I(inode);
+
+	mpol_free_shared_policy(&gi->policy);
+
+	/*
+	 * Note!  Checking for an empty tree is functionally necessary
+	 * to avoid explosions if the tree hasn't been fully
+	 * initialized, i.e. if the inode is being destroyed before
+	 * guest_memfd can set the external lock, lockdep would find
+	 * that the tree's internal ma_lock was not held.
+	 */
+	if (!mtree_empty(&gi->attributes)) {
+		/*
+		 * Acquire the invalidation lock purely to make lockdep happy,
+		 * the inode is unreachable at this point.
+		 */
+		filemap_invalidate_lock(inode->i_mapping);
+		__mt_destroy(&gi->attributes);
+		filemap_invalidate_unlock(inode->i_mapping);
+	}
 }
 
 static void kvm_gmem_free_inode(struct inode *inode)

-- 
2.54.0.545.g6539524ca2-goog



^ permalink raw reply related

* [PATCH RFC v5 00/53] guest_memfd: In-place conversion support
From: Ackerley Tng via B4 Relay @ 2026-04-28 23:24 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng

This is RFC v5 of guest_memfd in-place conversion support.

Up till now, guest_memfd supports the entire inode worth of memory being
used as all-shared, or all-private. CoCo VMs may request guest memory to be
converted between private and shared states, and the only way to support
that currently would be to have the userspace VMM provide two sources of
backing memory from completely different areas of physical memory.

pKVM has a use case for in-place sharing: the guest and host may be
cooperating on given data, and pKVM doesn't protect data through
encryption, so copying that given data between different areas of physical
memory as part of conversions would be unnecessary work.

This series also serves as a foundation for guest_memfd huge page
support. Now, guest_memfd only supports PAGE_SIZE pages, so if two sources
of backing memory are used, the userspace VMM could maintain a steady total
memory utilized by punching out the pages that are not used. When huge
pages are available in guest_memfd, even if the backing memory source
supports hole punching within a huge page, punching out pages to maintain
the total memory utilized by a VM would be introducing lots of
fragmentation.

In-place conversion avoids fragmentation by allowing the same physical
memory to be used for both shared and private memory, with guest_memfd
tracks the shared/private status of all the pages at a per-page
granularity.

The central principle, which guest_memfd continues to uphold, is that any
guest-private page will not be mappable to host userspace. All pages will
be mmap()-able in host userspace, but accesses to guest-private pages (as
tracked by guest_memfd) will result in a SIGBUS.

This series introduces a guest_memfd ioctl (not kvm, vm or vcpu, but
guest_memfd ioctl) that allows userspace to set memory
attributes (shared/private) directly through the guest_memfd. This is the
appropriate interface because shared/private-ness is a property of memory
and hence the request should be sent directly to the memory provider -
guest_memfd.

Tested with both CONFIG_KVM_VM_MEMORY_ATTRIBUTES enabled and disabled:

+ tools/testing/selftests/kvm/guest_memfd_test.c
+ tools/testing/selftests/kvm/pre_fault_memory_test.c
+ tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
+ tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
+ tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
+ tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c

Updates for this revision:

+ For TDX and SNP, PRESERVE supported only before VM is finalized only for
  to_private conversions.
    + This allows PRESERVE to be used as part of the VM memory
      loading/encryption flow
    + Only support PRESERVE for to_private conversions (to_shared on
      populated memory on TDX would cause zeroing)
    + Relaxed constraints for SNP and TDX to allow NULL to be passed as
      source address.
+ Dropped KVM_CAP_MEMORY_ATTRIBUTES2. KVM_CAP_MEMORY_ATTRIBUTES reports
  attributes supported by the KVM_SET_MEMORY_ATTRIBUTES VM ioctl, and
  KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES reports attributes supported bt the
  KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl.
    + KVM_SET_MEMORY_ATTRIBUTES2 is not supported by the VM ioctl
+ Resolve locking issue when kvm_gmem_get_attribute() is called from
  kvm_mmu_zap_collapsible_spte() by bugging the VM. guest_memfd memslots
  don't support dirty tracking, so the locking issue is not on an
  accessible code path.
+ Moved guest_memfd_conversions_test.c to only be compiled and tested for
  x86, since it depends so heavily on KVM_X86_SW_PROTECTED_VM's as a
  testing vehicle

TODOs

+ Perhaps further clarify PRESERVE flag: [8]
+ Resolve issue where guest_memfd_conversions_test, which uses the
  kselftest framework, doesn't perform teardown on assertion
  failure. Please see proposal at [9]
+ Test with TDX selftests. We're in the process of rebasing TDX selftests
  on this series and will post updates when that's tested.

I would like feedback on:

+ Content modes: 0 (MODE_UNSPECIFIED), ZERO, and PRESERVE. Is that all
  good, or does anyone think there is a use case for something else?
+ Should the content modes apply even if no attribute changes are required?
    + See notes added in "KVM: guest_memfd: Apply content modes while
      setting memory attributes"
    + Possibly related: should setting attributes be allowed if some
      sub-range requested already has the requested attribute?
+ Structure of how various content modes are checked for support or
  applied? I used overridable weak functions for architectures that haven't
  defined support, and defined overrides for x86 to show how I think it would
  work. For CoCo platforms, I only implemented TDX for illustration purposes
  and might need help with the other platforms. Should I have used
  kvm_x86_ops? I tried and found myself defining lots of boilerplate.
+ The use of private_mem_conversions_test.sh to run different options in
  private_mem_conversions_test. If this makes sense, I'll adjust the
  Makefile to have private_mem_conversions_test tested only via the script.

This series is based on kvm/next, and here's the tree for your convenience:

https://github.com/googleprodkernel/linux-cc/commits/guest_memfd-inplace-conversion-v5

Older series:

+ RFCv4 is at [7]
+ RFCv3 is at [6]
+ RFCv2 is at [5]
+ RFCv1 is at [4]
+ Previous versions of this feature, part of other series, are available at
  [1][2][3].

[1] https://lore.kernel.org/all/bd163de3118b626d1005aa88e71ef2fb72f0be0f.1726009989.git.ackerleytng@google.com/
[2] https://lore.kernel.org/all/20250117163001.2326672-6-tabba@google.com/
[3] https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com/
[4] https://lore.kernel.org/all/cover.1760731772.git.ackerleytng@google.com/T/
[5] https://lore.kernel.org/all/cover.1770071243.git.ackerleytng@google.com/T/
[6] https://lore.kernel.org/r/20260313-gmem-inplace-conversion-v3-0-5fc12a70ec89@google.com/T/
[7] https://lore.kernel.org/all/20260326-gmem-inplace-conversion-v4-0-e202fe950ffd@google.com/T/
[8] https://lore.kernel.org/all/CAEvNRgGbMhkX310CkFY_M5x-zod=BDTiuznrZ0XvFPUK7weL1A@mail.gmail.com/
[9] https://lore.kernel.org/all/20260414-selftest-global-metadata-v1-0-fd223922bc57@google.com/T/

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
Ackerley Tng (34):
      KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
      KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes
      KVM: guest_memfd: Only prepare folios for private pages
      KVM: Move kvm_supported_mem_attributes() to kvm_host.h
      KVM: guest_memfd: Add basic support for KVM_SET_MEMORY_ATTRIBUTES2
      KVM: guest_memfd: Ensure pages are not in use before conversion
      KVM: guest_memfd: Call arch invalidate hooks on conversion
      KVM: guest_memfd: Return early if range already has requested attributes
      KVM: guest_memfd: Advertise KVM_SET_MEMORY_ATTRIBUTES2 ioctl
      KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
      KVM: guest_memfd: Use actual size for invalidation in kvm_gmem_release()
      KVM: guest_memfd: Determine invalidation filter from memory attributes
      KVM: guest_memfd: Introduce default handlers for content modes
      KVM: guest_memfd: Apply content modes while setting memory attributes
      KVM: x86: Support SW_PROTECTED_VM in applying content modes
      KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
      KVM: x86: Support SNP and TDX applying content modes
      KVM: x86: Bug CoCo VM on page fault before finalizing
      KVM: Add CAP to enumerate supported SET_MEMORY_ATTRIBUTES2 flags
      KVM: selftests: Test basic single-page conversion flow
      KVM: selftests: Test conversion flow when INIT_SHARED
      KVM: selftests: Test conversion precision in guest_memfd
      KVM: selftests: Test conversion before allocation
      KVM: selftests: Convert with allocated folios in different layouts
      KVM: selftests: Test that truncation does not change shared/private status
      KVM: selftests: Test conversion with elevated page refcount
      KVM: selftests: Test that conversion to private does not support ZERO
      KVM: selftests: Support checking that data not equal expected
      KVM: selftests: Test that not specifying a conversion flag scrambles memory contents
      KVM: selftests: Reset shared memory after hole-punching
      KVM: selftests: Provide function to look up guest_memfd details from gpa
      KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe
      KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd
      KVM: selftests: Add script to exercise private_mem_conversions_test

Michael Roth (1):
      KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE

Sean Christopherson (18):
      KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
      KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
      KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined
      KVM: Stub in ability to disable per-VM memory attribute tracking
      KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
      KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86
      KVM: Let userspace disable per-VM mem attributes, enable per-gmem attributes
      KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
      KVM: selftests: Create gmem fd before "regular" fd when adding memslot
      KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset}
      KVM: selftests: Add support for mmap() on guest_memfd in core library
      KVM: selftests: Add selftests global for guest memory attributes capability
      KVM: selftests: Add helpers for calling ioctls on guest_memfd
      KVM: selftests: Test that shared/private status is consistent across processes
      KVM: selftests: Provide common function to set memory attributes
      KVM: selftests: Check fd/flags provided to mmap() when setting up memslot
      KVM: selftests: Update pre-fault test to work with per-guest_memfd attributes
      KVM: selftests: Update private memory exits test to work with per-gmem attributes

 Documentation/virt/kvm/api.rst                     | 139 ++++-
 .../virt/kvm/x86/amd-memory-encryption.rst         |  19 +-
 Documentation/virt/kvm/x86/intel-tdx.rst           |   4 +
 arch/x86/include/asm/kvm_host.h                    |   2 +-
 arch/x86/kvm/Kconfig                               |  15 +-
 arch/x86/kvm/mmu/mmu.c                             |  20 +-
 arch/x86/kvm/svm/sev.c                             |  18 +-
 arch/x86/kvm/vmx/tdx.c                             |   8 +-
 arch/x86/kvm/x86.c                                 | 145 ++++-
 include/linux/kvm_host.h                           |  74 ++-
 include/trace/events/kvm.h                         |   4 +-
 include/uapi/linux/kvm.h                           |  21 +
 mm/swap.c                                          |   2 +
 tools/testing/selftests/kvm/Makefile.kvm           |   5 +
 tools/testing/selftests/kvm/include/kvm_util.h     | 141 ++++-
 tools/testing/selftests/kvm/include/test_util.h    |  34 +-
 .../selftests/kvm/kvm_has_gmem_attributes.c        |  17 +
 tools/testing/selftests/kvm/lib/kvm_util.c         | 130 +++--
 tools/testing/selftests/kvm/lib/test_util.c        |   7 -
 tools/testing/selftests/kvm/lib/x86/sev.c          |   2 +-
 .../testing/selftests/kvm/pre_fault_memory_test.c  |   4 +-
 .../kvm/x86/guest_memfd_conversions_test.c         | 552 +++++++++++++++++++
 .../kvm/x86/private_mem_conversions_test.c         |  55 +-
 .../kvm/x86/private_mem_conversions_test.sh        | 128 +++++
 .../selftests/kvm/x86/private_mem_kvm_exits_test.c |  38 +-
 virt/kvm/Kconfig                                   |   3 +-
 virt/kvm/guest_memfd.c                             | 591 ++++++++++++++++++++-
 virt/kvm/kvm_main.c                                |  87 ++-
 28 files changed, 2075 insertions(+), 190 deletions(-)
---
base-commit: 39f1c201b93f4ff71631bac72cff6eb155f976a4
change-id: 20260225-gmem-inplace-conversion-bd0dbd39753a

Best regards,
--
Ackerley Tng <ackerleytng@google.com>



^ permalink raw reply

* [PATCH RFC v5 02/53] KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
From: Ackerley Tng via B4 Relay @ 2026-04-28 23:24 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260428-gmem-inplace-conversion-v5-0-d8608ccfca22@google.com>

From: Sean Christopherson <seanjc@google.com>

Rename the per-VM memory attributes Kconfig to make it explicitly about
per-VM attributes in anticipation of adding memory attributes support to
guest_memfd, at which point it will be possible (and desirable) to have
memory attributes without the per-VM support, even in x86.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/Kconfig            |  6 +++---
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 arch/x86/kvm/x86.c              |  2 +-
 include/linux/kvm_host.h        |  8 ++++----
 include/trace/events/kvm.h      |  4 ++--
 virt/kvm/Kconfig                |  2 +-
 virt/kvm/kvm_main.c             | 14 +++++++-------
 8 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c470e40a00aa4..60b997764beef 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2369,7 +2369,7 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 		       int tdp_max_root_level, int tdp_huge_page_level);
 
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
 #endif
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 801bf9e520db3..26f6afd51bbdc 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -84,7 +84,7 @@ config KVM_SW_PROTECTED_VM
 	bool "Enable support for KVM software-protected VMs"
 	depends on EXPERT
 	depends on KVM_X86 && X86_64
-	select KVM_GENERIC_MEMORY_ATTRIBUTES
+	select KVM_VM_MEMORY_ATTRIBUTES
 	help
 	  Enable support for KVM software-protected VMs.  Currently, software-
 	  protected VMs are purely a development and testing vehicle for
@@ -135,7 +135,7 @@ config KVM_INTEL_TDX
 	bool "Intel Trust Domain Extensions (TDX) support"
 	default y
 	depends on INTEL_TDX_HOST
-	select KVM_GENERIC_MEMORY_ATTRIBUTES
+	select KVM_VM_MEMORY_ATTRIBUTES
 	select HAVE_KVM_ARCH_GMEM_POPULATE
 	help
 	  Provides support for launching Intel Trust Domain Extensions (TDX)
@@ -159,7 +159,7 @@ config KVM_AMD_SEV
 	depends on KVM_AMD && X86_64
 	depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
 	select ARCH_HAS_CC_PLATFORM
-	select KVM_GENERIC_MEMORY_ATTRIBUTES
+	select KVM_VM_MEMORY_ATTRIBUTES
 	select HAVE_KVM_ARCH_GMEM_PREPARE
 	select HAVE_KVM_ARCH_GMEM_INVALIDATE
 	select HAVE_KVM_ARCH_GMEM_POPULATE
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 24fbc9ea502a3..8276d7ca02036 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7906,7 +7906,7 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 		vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
 				int level)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0a1b63c63d1a9..1560de1e95be0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13625,7 +13625,7 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 		}
 	}
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	kvm_mmu_init_memslot_memory_attributes(kvm, slot);
 #endif
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4c14aee1fb063..7b9faa3545300 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 }
 #endif
 
-#ifndef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 {
 	return false;
@@ -871,7 +871,7 @@ struct kvm {
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 	struct notifier_block pm_notifier;
 #endif
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	/* Protected by slots_lock (for writes) and RCU (for reads) */
 	struct xarray mem_attr_array;
 #endif
@@ -2528,7 +2528,7 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
 	return slot->flags & KVM_MEMSLOT_GMEM_ONLY;
 }
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
 	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
@@ -2550,7 +2550,7 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
 	return false;
 }
-#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
 #ifdef CONFIG_KVM_GUEST_MEMFD
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index b282e3a867696..1ba72bd73ea2f 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -358,7 +358,7 @@ TRACE_EVENT(kvm_dirty_ring_exit,
 	TP_printk("vcpu %d", __entry->vcpu_id)
 );
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 /*
  * @start:	Starting address of guest memory range
  * @end:	End address of guest memory range
@@ -383,7 +383,7 @@ TRACE_EVENT(kvm_vm_set_mem_attributes,
 	TP_printk("%#016llx -- %#016llx [0x%lx]",
 		  __entry->start, __entry->end, __entry->attr)
 );
-#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
 TRACE_EVENT(kvm_unmap_hva_range,
 	TP_PROTO(unsigned long start, unsigned long end),
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 794976b88c6f9..5119cb37145fc 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -100,7 +100,7 @@ config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
 config KVM_MMU_LOCKLESS_AGING
        bool
 
-config KVM_GENERIC_MEMORY_ATTRIBUTES
+config KVM_VM_MEMORY_ATTRIBUTES
        bool
 
 config KVM_GUEST_MEMFD
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 89489996fbc1e..306153abbafa5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1115,7 +1115,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	xa_init(&kvm->mem_attr_array);
 #endif
 
@@ -1300,7 +1300,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	cleanup_srcu_struct(&kvm->irq_srcu);
 	srcu_barrier(&kvm->srcu);
 	cleanup_srcu_struct(&kvm->srcu);
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	xa_destroy(&kvm->mem_attr_array);
 #endif
 	kvm_arch_free_vm(kvm);
@@ -2418,7 +2418,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
 	if (!kvm || kvm_arch_has_private_mem(kvm))
@@ -2623,7 +2623,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 
 	return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
 }
-#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 {
@@ -4921,7 +4921,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_SYSTEM_EVENT_DATA:
 	case KVM_CAP_DEVICE_CTRL:
 		return 1;
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	case KVM_CAP_MEMORY_ATTRIBUTES:
 		return kvm_supported_mem_attributes(kvm);
 #endif
@@ -5325,7 +5325,7 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	case KVM_SET_MEMORY_ATTRIBUTES: {
 		struct kvm_memory_attributes attrs;
 
@@ -5336,7 +5336,7 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
 		break;
 	}
-#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 	case KVM_CREATE_DEVICE: {
 		struct kvm_create_device cd;
 

-- 
2.54.0.545.g6539524ca2-goog



^ permalink raw reply related

* [PATCH RFC v5 03/53] KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined
From: Ackerley Tng via B4 Relay @ 2026-04-28 23:24 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260428-gmem-inplace-conversion-v5-0-d8608ccfca22@google.com>

From: Sean Christopherson <seanjc@google.com>

Explicitly guard reporting support for KVM_MEMORY_ATTRIBUTE_PRIVATE based
on kvm_arch_has_private_mem being #defined in anticipation of decoupling
kvm_supported_mem_attributes() from CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
guest_memfd support for memory attributes will be unconditional to avoid
yet more macros (all architectures that support guest_memfd are expected to
use per-gmem attributes at some point), at which point enumerating support
KVM_MEMORY_ATTRIBUTE_PRIVATE based solely on memory attributes being
supported _somewhere_ would result in KVM over-reporting support on arm64.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/kvm_host.h | 2 +-
 virt/kvm/kvm_main.c      | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7b9faa3545300..7d079f9701346 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 }
 #endif
 
-#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+#ifndef kvm_arch_has_private_mem
 static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 {
 	return false;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 306153abbafa5..abb9cfa3eb04d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2421,8 +2421,10 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
+#ifdef kvm_arch_has_private_mem
 	if (!kvm || kvm_arch_has_private_mem(kvm))
 		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
+#endif
 
 	return 0;
 }

-- 
2.54.0.545.g6539524ca2-goog



^ permalink raw reply related

* [PATCH RFC v5 04/53] KVM: Stub in ability to disable per-VM memory attribute tracking
From: Ackerley Tng via B4 Relay @ 2026-04-28 23:24 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260428-gmem-inplace-conversion-v5-0-d8608ccfca22@google.com>

From: Sean Christopherson <seanjc@google.com>

Introduce the basic infrastructure to allow per-VM memory attribute
tracking to be disabled. This will be built-upon in a later patch, where a
module param can disable per-VM memory attribute tracking.

Split the Kconfig option into a base KVM_MEMORY_ATTRIBUTES and the
existing KVM_VM_MEMORY_ATTRIBUTES. The base option provides the core
plumbing, while the latter enables the full per-VM tracking via an xarray
and the associated ioctls.

kvm_get_memory_attributes() now performs a static call that either looks up
kvm->mem_attr_array with CONFIG_KVM_VM_MEMORY_ATTRIBUTES is enabled, or
just returns 0 otherwise. The static call can be patched depending on
whether per-VM tracking is enabled by the CONFIG.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 include/linux/kvm_host.h        | 23 ++++++++++++---------
 virt/kvm/Kconfig                |  4 ++++
 virt/kvm/kvm_main.c             | 44 ++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 60b997764beef..c9aa50bcdac2d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2369,7 +2369,7 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 		       int tdp_max_root_level, int tdp_huge_page_level);
 
 
-#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
 #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
 #endif
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7d079f9701346..c5ba2cb34e45c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2528,19 +2528,15 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
 	return slot->flags & KVM_MEMSLOT_GMEM_ONLY;
 }
 
-#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
+typedef unsigned long (kvm_get_memory_attributes_t)(struct kvm *kvm, gfn_t gfn);
+DECLARE_STATIC_CALL(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
+
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
-	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
+	return static_call(__kvm_get_memory_attributes)(kvm, gfn);
 }
 
-bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
-				     unsigned long mask, unsigned long attrs);
-bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
-					struct kvm_gfn_range *range);
-bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
-					 struct kvm_gfn_range *range);
-
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
 	return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
@@ -2550,6 +2546,15 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
 	return false;
 }
+#endif
+
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+				     unsigned long mask, unsigned long attrs);
+bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+					struct kvm_gfn_range *range);
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+					 struct kvm_gfn_range *range);
 #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
 #ifdef CONFIG_KVM_GUEST_MEMFD
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 5119cb37145fc..3fea89c45cfb4 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -100,7 +100,11 @@ config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
 config KVM_MMU_LOCKLESS_AGING
        bool
 
+config KVM_MEMORY_ATTRIBUTES
+       bool
+
 config KVM_VM_MEMORY_ATTRIBUTES
+       select KVM_MEMORY_ATTRIBUTES
        bool
 
 config KVM_GUEST_MEMFD
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index abb9cfa3eb04d..ee26f1d9b5fda 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -101,6 +101,17 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink);
 static bool __ro_after_init allow_unsafe_mappings;
 module_param(allow_unsafe_mappings, bool, 0444);
 
+#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+static bool vm_memory_attributes = true;
+#else
+#define vm_memory_attributes false
+#endif
+DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
+EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
+EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_TRAMP(__kvm_get_memory_attributes));
+#endif
+
 /*
  * Ordering of locks:
  *
@@ -2418,7 +2429,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
-#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
 #ifdef kvm_arch_has_private_mem
@@ -2429,6 +2440,12 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 	return 0;
 }
 
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
+{
+	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
+}
+
 /*
  * Returns true if _all_ gfns in the range [@start, @end) have attributes
  * such that the bits in @mask match @attrs.
@@ -2625,7 +2642,24 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 
 	return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
 }
+#else  /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
+static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
+{
+	BUILD_BUG_ON(1);
+}
 #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
+static void kvm_init_memory_attributes(void)
+{
+	if (vm_memory_attributes)
+		static_call_update(__kvm_get_memory_attributes,
+				   kvm_get_vm_memory_attributes);
+	else
+		static_call_update(__kvm_get_memory_attributes,
+				   (void *)__static_call_return0);
+}
+#else  /* CONFIG_KVM_MEMORY_ATTRIBUTES */
+static void kvm_init_memory_attributes(void) { }
+#endif /* CONFIG_KVM_MEMORY_ATTRIBUTES */
 
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 {
@@ -4925,6 +4959,9 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 		return 1;
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	case KVM_CAP_MEMORY_ATTRIBUTES:
+		if (!vm_memory_attributes)
+			return 0;
+
 		return kvm_supported_mem_attributes(kvm);
 #endif
 #ifdef CONFIG_KVM_GUEST_MEMFD
@@ -5331,6 +5368,10 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_SET_MEMORY_ATTRIBUTES: {
 		struct kvm_memory_attributes attrs;
 
+		r = -ENOTTY;
+		if (!vm_memory_attributes)
+			goto out;
+
 		r = -EFAULT;
 		if (copy_from_user(&attrs, argp, sizeof(attrs)))
 			goto out;
@@ -6527,6 +6568,7 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
 	kvm_preempt_ops.sched_in = kvm_sched_in;
 	kvm_preempt_ops.sched_out = kvm_sched_out;
 
+	kvm_init_memory_attributes();
 	kvm_init_debug();
 
 	r = kvm_vfio_ops_init();

-- 
2.54.0.545.g6539524ca2-goog



^ permalink raw reply related

* [PATCH RFC v5 05/53] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Ackerley Tng via B4 Relay @ 2026-04-28 23:25 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260428-gmem-inplace-conversion-v5-0-d8608ccfca22@google.com>

From: Sean Christopherson <seanjc@google.com>

Implement kvm_gmem_get_memory_attributes() for guest_memfd to allow the KVM
core and architecture code to query per-GFN memory attributes.

kvm_gmem_get_memory_attributes() finds the memory slot for a given GFN and
queries the guest_memfd file's to determine if the page is marked as
private.

If vm_memory_attributes is not enabled, there is no shared/private tracking
at the VM level. Install the guest_memfd implementation as long as
guest_memfd is enabled to give guest_memfd a chance to respond on
attributes.

guest_memfd should look up attributes regardless of whether this memslot is
gmem-only since attributes are now tracked by gmem regardless of whether
mmap() is enabled.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/guest_memfd.c   | 37 +++++++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c      |  3 +++
 3 files changed, 42 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c5ba2cb34e45c..28a54298d27db 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2557,6 +2557,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					 struct kvm_gfn_range *range);
 #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
+unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn);
+
 #ifdef CONFIG_KVM_GUEST_MEMFD
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 17e5a23fec0a1..e56f89640d050 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -515,6 +515,43 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
+	struct inode *inode;
+	unsigned long attrs;
+
+	/*
+	 * If this gfn has no associated memslot, there's no chance of the gfn
+	 * being backed by private memory, since guest_memfd must be used for
+	 * private memory, and guest_memfd must be associated with some memslot.
+	 */
+	if (!slot)
+		return 0;
+
+	CLASS(gmem_get_file, file)(slot);
+	if (!file)
+		return 0;
+
+	inode = file_inode(file);
+
+	/*
+	 * Acquire the filemap lock to ensure the mtree lookup gets a
+	 * stable result.  The caller _must_ still protect consumption
+	 * of private vs. shared by checking
+	 * mmu_invalidate_retry_gfn() under mmu_lock to serialize
+	 * against ongoing attribute updates.  Acquiring the filemap
+	 * lock only ensures a stable _lookup_, the result can become
+	 * stale as soon as the lock is dropped.
+	 */
+	filemap_invalidate_lock_shared(inode->i_mapping);
+	attrs = kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
+	filemap_invalidate_unlock_shared(inode->i_mapping);
+
+	return attrs;
+}
+EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_memory_attributes);
+
 static struct file_operations kvm_gmem_fops = {
 	.mmap		= kvm_gmem_mmap,
 	.open		= generic_file_open,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ee26f1d9b5fda..4139e903f756a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2653,6 +2653,9 @@ static void kvm_init_memory_attributes(void)
 	if (vm_memory_attributes)
 		static_call_update(__kvm_get_memory_attributes,
 				   kvm_get_vm_memory_attributes);
+	else if (IS_ENABLED(CONFIG_KVM_GUEST_MEMFD))
+		static_call_update(__kvm_get_memory_attributes,
+				   kvm_gmem_get_memory_attributes);
 	else
 		static_call_update(__kvm_get_memory_attributes,
 				   (void *)__static_call_return0);

-- 
2.54.0.545.g6539524ca2-goog



^ permalink raw reply related

* [PATCH RFC v5 06/53] KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
From: Ackerley Tng via B4 Relay @ 2026-04-28 23:25 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260428-gmem-inplace-conversion-v5-0-d8608ccfca22@google.com>

From: Ackerley Tng <ackerleytng@google.com>

When the maximum mapping level is queried, KVM's MMU lock is held, and
while the MMU lock is held, guest_memfd cannot take the
filemap_invalidate_lock() to look up the current shared/private state of
the gfn, for these reasons:

+ The MMU lock is a spinlock or rwlock and cannot be held while taking a
  lock that can sleep.
+ In guest_memfd's code paths (such as truncate), the
  filemap_invalidate_lock() is held while taking the MMU lock, and taking
  the locks in reverse order would introduce a AB-BA deadlock.

Currently, the maximum mapping level is only queried from guest_memfd in
the process of recovering huge pages, if dirty logging is disabled on a
memslot. Dirty logging is not currently supported for guest_memfd, and
guest_memfd memslots also cannot be updated.

For now, bug the VM if guest_memfd needs to be queried to determine the
maximum mapping level. This guard can be removed if/when support is added.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8276d7ca02036..2cc848bddf190 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3364,6 +3364,15 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
 		max_level = fault->max_level;
 		is_private = fault->is_private;
 	} else {
+		/*
+		 * Memory attributes cannot be obtained from guest_memfd while
+		 * the MMU lock is held.
+		 */
+		if (KVM_BUG_ON(static_call_query(__kvm_get_memory_attributes) ==
+			       kvm_gmem_get_memory_attributes, kvm)) {
+			return 0;
+		}
+
 		max_level = PG_LEVEL_NUM;
 		is_private = kvm_mem_is_private(kvm, gfn);
 	}

-- 
2.54.0.545.g6539524ca2-goog



^ permalink raw reply related

* [PATCH RFC v5 07/53] KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes
From: Ackerley Tng via B4 Relay @ 2026-04-28 23:25 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260428-gmem-inplace-conversion-v5-0-d8608ccfca22@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Update the guest_memfd populate() flow to pull memory attributes from the
gmem instance instead of the VM when KVM is not configured to track
shared/private status in the VM.

Rename the per-VM API to make it clear that it retrieves per-VM
attributes, i.e. is not suitable for use outside of flows that are
specific to generic per-VM attributes.

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 arch/x86/kvm/mmu/mmu.c   |  2 +-
 include/linux/kvm_host.h | 14 +++++++++++++-
 virt/kvm/guest_memfd.c   | 24 +++++++++++++++++++++---
 virt/kvm/kvm_main.c      |  8 +++-----
 4 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2cc848bddf190..d3da387340a9d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -8004,7 +8004,7 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
 	const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
 
 	if (level == PG_LEVEL_2M)
-		return kvm_range_has_memory_attributes(kvm, start, end, ~0, attrs);
+		return kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attrs);
 
 	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
 		if (hugepage_test_mixed(slot, gfn, level - 1) ||
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 28a54298d27db..1deab76dc0a2c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2549,12 +2549,24 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 #endif
 
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
-bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+extern bool vm_memory_attributes;
+bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 				     unsigned long mask, unsigned long attrs);
 bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range);
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					 struct kvm_gfn_range *range);
+#else
+#define vm_memory_attributes false
+static inline bool kvm_range_has_vm_memory_attributes(struct kvm *kvm,
+						      gfn_t start, gfn_t end,
+						      unsigned long mask,
+						      unsigned long attrs)
+{
+	WARN_ONCE(1, "Unexpected call to kvm_range_has_vm_memory_attributes()");
+
+	return false;
+}
 #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
 unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn);
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index e56f89640d050..758ac24a0ff40 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -936,12 +936,31 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE
+static bool kvm_gmem_range_is_private(struct gmem_inode *gi, pgoff_t index,
+				      size_t nr_pages, struct kvm *kvm, gfn_t gfn)
+{
+	pgoff_t end = index + nr_pages - 1;
+	void *entry;
+
+	if (vm_memory_attributes)
+		return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
+						       KVM_MEMORY_ATTRIBUTE_PRIVATE,
+						       KVM_MEMORY_ATTRIBUTE_PRIVATE);
+
+	mt_for_each(&gi->attributes, entry, index, end) {
+		if (xa_to_value(entry) != KVM_MEMORY_ATTRIBUTE_PRIVATE)
+			return false;
+	}
+
+	return true;
+}
 
 static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
 				struct file *file, gfn_t gfn, struct page *src_page,
 				kvm_gmem_populate_cb post_populate, void *opaque)
 {
 	pgoff_t index = kvm_gmem_get_index(slot, gfn);
+	struct gmem_inode *gi;
 	struct folio *folio;
 	kvm_pfn_t pfn;
 	int ret;
@@ -956,9 +975,8 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
 
 	folio_unlock(folio);
 
-	if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
-					     KVM_MEMORY_ATTRIBUTE_PRIVATE,
-					     KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
+	gi = GMEM_I(file_inode(file));
+	if (!kvm_gmem_range_is_private(gi, index, 1, kvm, gfn)) {
 		ret = -EINVAL;
 		goto out_put_folio;
 	}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4139e903f756a..0a4024948711a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -103,9 +103,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
 
 #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
-static bool vm_memory_attributes = true;
-#else
-#define vm_memory_attributes false
+bool vm_memory_attributes = true;
 #endif
 DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
@@ -2450,7 +2448,7 @@ static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
  * Returns true if _all_ gfns in the range [@start, @end) have attributes
  * such that the bits in @mask match @attrs.
  */
-bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 				     unsigned long mask, unsigned long attrs)
 {
 	XA_STATE(xas, &kvm->mem_attr_array, start);
@@ -2584,7 +2582,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	mutex_lock(&kvm->slots_lock);
 
 	/* Nothing to do if the entire range has the desired attributes. */
-	if (kvm_range_has_memory_attributes(kvm, start, end, ~0, attributes))
+	if (kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attributes))
 		goto out_unlock;
 
 	/*

-- 
2.54.0.545.g6539524ca2-goog



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox