LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v2 0/5] KVM: PPC: Handle CPU compatibility mode for nested guests
From: Ritesh Harjani @ 2026-05-14  3:19 UTC (permalink / raw)
  To: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan
  Cc: Amit Machhiwal, Vaibhav Jain, Paolo Bonzini, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Jonathan Corbet,
	Shuah Khan, kvm, linux-kernel, linux-doc
In-Reply-To: <20260513100755.83215-1-amachhiw@linux.ibm.com>


Hi Amit,

Amit Machhiwal <amachhiw@linux.ibm.com> writes:

> On POWER systems, newer processor generations can operate in compatibility
> modes corresponding to earlier generations (e.g., a Power11 system running
> in Power10 compatibility mode). In such cases, the effective CPU level
> exposed to guests differs from the physical processor generation.
>
> This creates a problem for nested virtualization. When booting a nested KVM
> guest (L2) inside a host KVM guest (L1) running in a compatibility mode,
> userspace (e.g., QEMU) may derive the CPU model from the raw hardware PVR
> and attempt to configure the nested guest accordingly. However, the L1
> partition is constrained by the compatibility level negotiated with the
> hypervisor (L0), and requests exceeding that level are rejected, leading to
> guest boot failures such as:
>
>   KVM-NESTEDv2: couldn't set guest wide elements
>
> This series addresses the issue in two steps:
>
> 1. Detect and reject invalid compatibility requests early in KVM to avoid
>    late failures.
>
> 2. Provide a mechanism for userspace to query the effective CPU
>    compatibility modes supported by the host, so it can select an
>    appropriate CPU model for nested guests.
>

Do we really need to add a uapi change for this? Tools like Qemu can
read the device tree info of the host, isn't it?

> To achieve this, the series introduces a new KVM capability and ioctl
> (KVM_CAP_PPC_COMPAT_CAPS / KVM_PPC_GET_COMPAT_CAPS) that expose the
> compatibility modes supported by the host.
>
> The implementation supports both:
>
>   - PowerVM (nested API v2), where compatibility information is obtained
>     via the H_GUEST_GET_CAPABILITIES hypercall.
>   - PowerNV (nested API v1), where compatibility is derived from the device
>     tree ("cpu-version") representing the effective processor compatibility
>     level.

See there you go, for PowerNV if this info is provided in the device
tree, then Qemu could as well just read that info, no?

... yup, kvmppc_read_int_dt() can do that I guess.

So, my request is, can we look into this to see, if there is a possible
alternative to this? maybe we already have a mechanism which Qemu could
use to get this info already?

btw - I haven't given a full read of the patch series, but reading the
cover letter, I felt  we should atleast add this info to the cover
letter on, why a uapi change is really needed here, why can't the
existing alternatives work for us. 

-ritesh

>
> This allows userspace (e.g., QEMU) to select a CPU model consistent with
> the host compatibility mode, avoiding mismatches and enabling successful
> nested guest boot.
>


^ permalink raw reply

* Re: [PATCH v5] char/nvram: Remove redundant nvram_mutex
From: Venkat @ 2026-05-14  3:57 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Greg Kroah-Hartman, linux-kernel, linux-kbuild, linuxppc-dev,
	Arnd Bergmann, Christophe Leroy, Ritesh Harjani,
	Madhavan Srinivasan, Tellakula Yeswanth Krishna
In-Reply-To: <20260428061540.73668-1-venkat88@linux.ibm.com>


Hi,

Gentle ping on this patch.

This removes the unused global nvram_mutex and relies on the
existing per-architecture synchronization, as suggested earlier.

I’ve re-tested the change, and everything continues to work as expected.
No issues observed in validation.

Please let me know if any further changes are needed.

Thanks,
Venkat


> On 28 Apr 2026, at 11:45 AM, Venkat Rao Bagalkote <venkat88@linux.ibm.com> wrote:
> 
> The global nvram_mutex in drivers/char/nvram.c is redundant and unused,
> and this triggers compiler warnings on some configurations.
> 
> All platform-specific nvram operations already provide their own internal
> synchronization, meaning the wrapper-level mutex does not provide any
> additional safety.
> 
> Remove the nvram_mutex definition along with all remaining lock/unlock
> users across PPC32, x86, and m68k code paths, and rely entirely on the
> per-architecture nvram implementations for locking.
> 
> Reviewed-by: Arnd Bergmann <arnd@arndb.de>
> Suggested-by: Arnd Bergmann <arnd@arndb.de>
> Tested-by: Tellakula Yeswanth Krishna <yeswanth@linux.ibm.com>
> Signed-off-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> ---
> Changes since v4:
> - No code changes
> - Resent after v7.1-rc1 as suggested by Arnd Bergmann
> 
> drivers/char/nvram.c | 16 +++-------------
> 1 file changed, 3 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/char/nvram.c b/drivers/char/nvram.c
> index 9eff426a9286..e89cc1f1c89e 100644
> --- a/drivers/char/nvram.c
> +++ b/drivers/char/nvram.c
> @@ -53,7 +53,6 @@
> #include <asm/nvram.h>
> #endif
> 
> -static DEFINE_MUTEX(nvram_mutex);
> static DEFINE_SPINLOCK(nvram_state_lock);
> static int nvram_open_cnt; /* #times opened */
> static int nvram_open_mode; /* special open modes */
> @@ -310,11 +309,8 @@ static long nvram_misc_ioctl(struct file *file, unsigned int cmd,
> break;
> #ifdef CONFIG_PPC32
> case IOC_NVRAM_SYNC:
> - if (ppc_md.nvram_sync != NULL) {
> - mutex_lock(&nvram_mutex);
> + if (ppc_md.nvram_sync)
> ppc_md.nvram_sync();
> - mutex_unlock(&nvram_mutex);
> - }
> ret = 0;
> break;
> #endif
> @@ -324,11 +320,8 @@ static long nvram_misc_ioctl(struct file *file, unsigned int cmd,
> if (!capable(CAP_SYS_ADMIN))
> return -EACCES;
> 
> - if (arch_nvram_ops.initialize != NULL) {
> - mutex_lock(&nvram_mutex);
> + if (arch_nvram_ops.initialize)
> ret = arch_nvram_ops.initialize();
> - mutex_unlock(&nvram_mutex);
> - }
> break;
> case NVRAM_SETCKS:
> /* just set checksum, contents unchanged (maybe useful after
> @@ -336,11 +329,8 @@ static long nvram_misc_ioctl(struct file *file, unsigned int cmd,
> if (!capable(CAP_SYS_ADMIN))
> return -EACCES;
> 
> - if (arch_nvram_ops.set_checksum != NULL) {
> - mutex_lock(&nvram_mutex);
> + if (arch_nvram_ops.set_checksum)
> ret = arch_nvram_ops.set_checksum();
> - mutex_unlock(&nvram_mutex);
> - }
> break;
> #endif /* CONFIG_X86 || CONFIG_M68K */
> }
> -- 
> 2.45.2
> 



^ permalink raw reply

* Re: [PATCH v4 01/13] dma-direct: swiotlb: handle swiotlb alloc/free outside __dma_direct_alloc_pages
From: Aneesh Kumar K.V @ 2026-05-14  4:54 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agSDPJMZkcI-uH8f@google.com>

Mostafa Saleh <smostafa@google.com> writes:

> On Tue, May 12, 2026 at 02:33:56PM +0530, Aneesh Kumar K.V (Arm) wrote:
>> Move swiotlb allocation out of __dma_direct_alloc_pages() and handle it in
>> dma_direct_alloc() / dma_direct_alloc_pages().
>> 
>> This is needed for follow-up changes that simplify the handling of
>> memory encryption/decryption based on the DMA attribute flags.
>> 
>> swiotlb backing pages are already mapped decrypted by
>> swiotlb_update_mem_attributes() and rmem_swiotlb_device_init(), so
>> dma-direct should not call dma_set_decrypted() on allocation nor
>> dma_set_encrypted() on free for swiotlb-backed memory.
>> 
>> Update alloc/free paths to detect swiotlb-backed pages and skip
>> encrypt/decrypt transitions for those paths. Keep the existing highmem
>> rejection in dma_direct_alloc_pages() for swiotlb allocations.
>> 
>> Only for "restricted-dma-pool", we currently set `for_alloc = true`, while
>> rmem_swiotlb_device_init() decrypts the whole pool up front. This pool is
>> typically used together with "shared-dma-pool", where the shared region is
>> accessed after remap/ioremap and the returned address is suitable for
>> decrypted memory access. So existing code paths remain valid.
>> 
>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>> ---
>>  kernel/dma/direct.c | 44 +++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 37 insertions(+), 7 deletions(-)
>> 
>> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>> index ec887f443741..b958f150718a 100644
>> --- a/kernel/dma/direct.c
>> +++ b/kernel/dma/direct.c
>> @@ -125,9 +125,6 @@ static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
>>  
>>  	WARN_ON_ONCE(!PAGE_ALIGNED(size));
>>  
>> -	if (is_swiotlb_for_alloc(dev))
>> -		return dma_direct_alloc_swiotlb(dev, size);
>> -
>>  	gfp |= dma_direct_optimal_gfp_mask(dev, &phys_limit);
>>  	page = dma_alloc_contiguous(dev, size, gfp);
>>  	if (page) {
>> @@ -204,6 +201,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>>  		dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
>>  {
>>  	bool remap = false, set_uncached = false;
>> +	bool mark_mem_decrypt = true;
>>  	struct page *page;
>>  	void *ret;
>>  
>> @@ -250,11 +248,21 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>>  	    dma_direct_use_pool(dev, gfp))
>>  		return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
>>  
>> +	if (is_swiotlb_for_alloc(dev)) {
>> +		page = dma_direct_alloc_swiotlb(dev, size);
>> +		if (page) {
>> +			mark_mem_decrypt = false;
>> +			goto setup_page;
>> +		}
>> +		return NULL;
>> +	}
>> +
>>  	/* we always manually zero the memory once we are done */
>>  	page = __dma_direct_alloc_pages(dev, size, gfp & ~__GFP_ZERO, true);
>>  	if (!page)
>>  		return NULL;
>>  
>> +setup_page:
>>  	/*
>>  	 * dma_alloc_contiguous can return highmem pages depending on a
>>  	 * combination the cma= arguments and per-arch setup.  These need to be
>> @@ -281,7 +289,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>>  			goto out_free_pages;
>>  	} else {
>>  		ret = page_address(page);
>> -		if (dma_set_decrypted(dev, ret, size))
>> +		if (mark_mem_decrypt && dma_set_decrypted(dev, ret, size))
>
> I am ok with that approach, but Jason was mentioning we shouldn’t
> special case swiotlb and make the allocator return the memory state
> (similar to the dma_page [1]) . I am also OK if you want to merge that
> part of my series with is.
>
> [1] https://lore.kernel.org/linux-iommu/20260408194750.2280873-1-smostafa@google.com/
>

I was not sure whether we need dma_page. As shown in this series, we can
simplify the allocation and free paths without adding new abstractions
like dma_page.

>
>>  			goto out_leak_pages;
>>  	}
>>  
>> @@ -298,7 +306,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>>  	return ret;
>>  
>>  out_encrypt_pages:
>> -	if (dma_set_encrypted(dev, page_address(page), size))
>> +	if (mark_mem_decrypt && dma_set_encrypted(dev, page_address(page), size))
>>  		return NULL;
>>  out_free_pages:
>>  	__dma_direct_free_pages(dev, page, size);
>> @@ -310,6 +318,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>>  void dma_direct_free(struct device *dev, size_t size,
>>  		void *cpu_addr, dma_addr_t dma_addr, unsigned long attrs)
>>  {
>> +	bool mark_mem_encrypted = true;
>>  	unsigned int page_order = get_order(size);
>>  
>>  	if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
>> @@ -338,12 +347,15 @@ void dma_direct_free(struct device *dev, size_t size,
>>  	    dma_free_from_pool(dev, cpu_addr, PAGE_ALIGN(size)))
>>  		return;
>>  
>> +	if (swiotlb_find_pool(dev, dma_to_phys(dev, dma_addr)))
>> +		mark_mem_encrypted = false;
>> +
>>  	if (is_vmalloc_addr(cpu_addr)) {
>>  		vunmap(cpu_addr);
>>  	} else {
>>  		if (IS_ENABLED(CONFIG_ARCH_HAS_DMA_CLEAR_UNCACHED))
>>  			arch_dma_clear_uncached(cpu_addr, size);
>> -		if (dma_set_encrypted(dev, cpu_addr, size))
>> +		if (mark_mem_encrypted && dma_set_encrypted(dev, cpu_addr, size))
>>  			return;
>>  	}
>>  
>> @@ -359,6 +371,19 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>>  	if (force_dma_unencrypted(dev) && dma_direct_use_pool(dev, gfp))
>>  		return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
>>  
>> +	if (is_swiotlb_for_alloc(dev)) {
>> +		page = dma_direct_alloc_swiotlb(dev, size);
>> +		if (!page)
>> +			return NULL;
>> +
>> +		if (PageHighMem(page)) {
>
> My understanding is that rmem_swiotlb_device_init() asserts that there
> is no PageHighMem()? Also a similar check doesn’t exist in
> dma_direct_alloc().
>

The reason I added that HighMem check is that __dma_direct_alloc_pages()
already has that check.

	page = dma_alloc_contiguous(dev, size, gfp);
	if (page) {
		if (dma_coherent_ok(dev, page_to_phys(page), size) &&
		    (allow_highmem || !PageHighMem(page)))
			return page;

		dma_free_contiguous(dev, page, size);
	}

I understand that the current usage of swiotlb alloc is restricted to
restricted memory, and it will not return HighMem pages. I will drop
this hunk from the patch.

-aneesh


^ permalink raw reply

* Re: [PATCH v4 02/13] dma-direct: use DMA_ATTR_CC_SHARED in alloc/free paths
From: Aneesh Kumar K.V @ 2026-05-14  5:01 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agSDk0QsEM0ZBCTA@google.com>

Mostafa Saleh <smostafa@google.com> writes:

> On Tue, May 12, 2026 at 02:33:57PM +0530, Aneesh Kumar K.V (Arm) wrote:
>> Propagate force_dma_unencrypted() into DMA_ATTR_CC_SHARED in the
>> dma-direct allocation path and use the attribute to drive the related
>> decisions.
>> 
>> This updates dma_direct_alloc(), dma_direct_free(), and
>> dma_direct_alloc_pages() to fold the forced unencrypted case into attrs.
>> 
>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>> ---
>>  kernel/dma/direct.c | 44 ++++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 36 insertions(+), 8 deletions(-)
>> 
>> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>> index b958f150718a..0c2e1f8436ce 100644
>> --- a/kernel/dma/direct.c
>> +++ b/kernel/dma/direct.c
>> @@ -201,16 +201,31 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>>  		dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
>>  {
>>  	bool remap = false, set_uncached = false;
>> -	bool mark_mem_decrypt = true;
>> +	bool mark_mem_decrypt = false;
>>  	struct page *page;
>>  	void *ret;
>>  
>> +	/*
>> +	 * DMA_ATTR_CC_SHARED is not a caller-visible dma_alloc_*()
>> +	 * attribute. The direct allocator uses it internally after it has
>> +	 * decided that the backing pages must be shared/decrypted, so the
>> +	 * rest of the allocation path can consistently select DMA addresses,
>> +	 * choose compatible pools and restore encryption on free.
>> +	 */
>> +	if (attrs & DMA_ATTR_CC_SHARED)
>> +		return NULL;
>> +
>> +	if (force_dma_unencrypted(dev)) {
>> +		attrs |= DMA_ATTR_CC_SHARED;
>> +		mark_mem_decrypt = true;
>> +	}
>> +
>>  	size = PAGE_ALIGN(size);
>>  	if (attrs & DMA_ATTR_NO_WARN)
>>  		gfp |= __GFP_NOWARN;
>>  
>> -	if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
>> -	    !force_dma_unencrypted(dev) && !is_swiotlb_for_alloc(dev))
>> +	if (((attrs & (DMA_ATTR_NO_KERNEL_MAPPING | DMA_ATTR_CC_SHARED)) ==
>> +	     DMA_ATTR_NO_KERNEL_MAPPING) && !is_swiotlb_for_alloc(dev))
>>  		return dma_direct_alloc_no_mapping(dev, size, dma_handle, gfp);
>>  
>>  	if (!dev_is_dma_coherent(dev)) {
>> @@ -244,7 +259,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>>  	 * Remapping or decrypting memory may block, allocate the memory from
>>  	 * the atomic pools instead if we aren't allowed block.
>>  	 */
>> -	if ((remap || force_dma_unencrypted(dev)) &&
>> +	if ((remap || (attrs & DMA_ATTR_CC_SHARED)) &&
>>  	    dma_direct_use_pool(dev, gfp))
>>  		return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
>>  
>> @@ -318,11 +333,20 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>>  void dma_direct_free(struct device *dev, size_t size,
>>  		void *cpu_addr, dma_addr_t dma_addr, unsigned long attrs)
>>  {
>> -	bool mark_mem_encrypted = true;
>> +	bool mark_mem_encrypted = false;
>>  	unsigned int page_order = get_order(size);
>>  
>> -	if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
>> -	    !force_dma_unencrypted(dev) && !is_swiotlb_for_alloc(dev)) {
>> +	/*
>> +	 * if the device had requested for an unencrypted buffer,
>> +	 * convert it to encrypted on free
>> +	 */
>> +	if (force_dma_unencrypted(dev)) {
>> +		attrs |= DMA_ATTR_CC_SHARED;
>> +		mark_mem_encrypted = true;
>> +	}
>> +
>> +	if (((attrs & (DMA_ATTR_NO_KERNEL_MAPPING | DMA_ATTR_CC_SHARED)) ==
>> +	     DMA_ATTR_NO_KERNEL_MAPPING) && !is_swiotlb_for_alloc(dev)) {
>>  		/* cpu_addr is a struct page cookie, not a kernel address */
>>  		dma_free_contiguous(dev, cpu_addr, size);
>>  		return;
>> @@ -365,10 +389,14 @@ void dma_direct_free(struct device *dev, size_t size,
>>  struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>>  		dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
>>  {
>> +	unsigned long attrs = 0;
>>  	struct page *page;
>>  	void *ret;
>>  
>> -	if (force_dma_unencrypted(dev) && dma_direct_use_pool(dev, gfp))
>> +	if (force_dma_unencrypted(dev))
>> +		attrs |= DMA_ATTR_CC_SHARED;
>> +
>> +	if ((attrs & DMA_ATTR_CC_SHARED) && dma_direct_use_pool(dev, gfp))
>>  		return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
>
> What about dma_direct_free_pages()? Nothing inside uses attrs, but
> that’s quite similar to dma_direct_alloc_pages()
>
> Also, at this point, shouldn’t this patch also remove
> force_dma_unencrypted() calls from dma_set_decrypted() and
> dma_set_encrypted()?
>

The final change have 

void dma_direct_free_pages(struct device *dev, size_t size,
...
{
	/*
	 * if the device had requested for an unencrypted buffer,
	 * convert it to encrypted on free
	 */
	bool mark_mem_encrypted =  force_dma_unencrypted(dev);

That got added by "dma-direct: select DMA address encoding from DMA_ATTR_CC_SHARED "

I'll move that hunk into this patch. The overall goal is 

dma_direct_alloc(.. attrs)
{

	if (force_dma_unencrypted(dev))
		attrs |= DMA_ATTR_CC_SHARED;
}

dma_direct_free(.. attrs)
{
	if (force_dma_unencrypted(dev)) {
		attrs |= DMA_ATTR_CC_SHARED;
		mark_mem_encrypted = true;
	}

	if (swiotlb_find_pool(dev, dma_to_phys(dev, dma_addr)))
		mark_mem_encrypted = false;

}

dma_direct_alloc_pages()
{
	if (force_dma_unencrypted(dev))
		attrs |= DMA_ATTR_CC_SHARED;

}

dma_direct_free_pages()
{
	bool mark_mem_encrypted =  force_dma_unencrypted(dev);

	if (swiotlb_find_pool(dev, page_to_phys(page)))
		mark_mem_encrypted = false;

}

-aneesh


^ permalink raw reply

* Re: [PATCH] powerpc/hv-gpci: fix preempt count leak in sysfs show paths
From: Shrikanth Hegde @ 2026-05-14  5:20 UTC (permalink / raw)
  To: Aboorva Devarajan, Madhavan Srinivasan, Athira Rajeev,
	linuxppc-dev
  Cc: Christophe Leroy, Kajol Jain, linux-kernel
In-Reply-To: <20260508041256.3447113-1-aboorvad@linux.ibm.com>



On 5/8/26 9:42 AM, Aboorva Devarajan wrote:
> Four sysfs show() callbacks in hv-gpci take get_cpu_var(hv_gpci_reqb)
> (which calls preempt_disable()) but only call the matching put_cpu_var()
> on the error path under the 'out:' label. Every successful read leaks
> one preempt_disable():
> 
>    processor_bus_topology_show()
>    processor_config_show()
>    affinity_domain_via_virtual_processor_show()
>    affinity_domain_via_domain_show()
> 
> (affinity_domain_via_partition_show() was already correct.)
> 
> On a CONFIG_PREEMPT=y kernel, repeated reads raise preempt_count and
> eventually return to userspace with preemption still disabled. The
> next user-mode page fault then hits faulthandler_disabled() == 1,
> gets forced to SIGSEGV, and the resulting coredump trips
> 'BUG: scheduling while atomic' in call_usermodehelper_exec ->
> wait_for_completion_state -> schedule:
> 
>    BUG: scheduling while atomic: <task>/<pid>/0x00000004
>    ...
>    __schedule_bug+0x6c/0x90
>    __schedule+0x58c/0x13a0
>    schedule+0x48/0x1a0
>    schedule_timeout+0x104/0x170
>    wait_for_completion_state+0x16c/0x330
>    call_usermodehelper_exec+0x254/0x2d0
>    vfs_coredump+0x1050/0x2590
>    get_signal+0xb9c/0xc80
>    do_notify_resume+0xf8/0x470
> 
> Add an out_success label that calls put_cpu_var() before returning
> the byte count, mirroring affinity_domain_via_partition_show().
> 
> Fixes: 71f1c39647d8 ("powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show processor bus topology information")
> Fixes: 1a160c2a13c6 ("powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show processor config information")
> Fixes: 71a7ccb478fc ("powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show affinity domain via virtual processor information")
> Fixes: a69a57cac1ec ("powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show affinity domain via domain information")
> Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> ---
>   arch/powerpc/perf/hv-gpci.c | 24 ++++++++++++++++--------
>   1 file changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
> index 5cac2cf3bd1e5..10c82cf8f5b39 100644
> --- a/arch/powerpc/perf/hv-gpci.c
> +++ b/arch/powerpc/perf/hv-gpci.c
> @@ -210,7 +210,7 @@ static ssize_t processor_bus_topology_show(struct device *dev, struct device_att
>   			0, 0, buf, &n, arg);
>   
>   	if (!ret)
> -		return n;
> +		goto out_success;
>   
>   	if (ret != H_PARAMETER)
>   		goto out;
> @@ -244,12 +244,14 @@ static ssize_t processor_bus_topology_show(struct device *dev, struct device_att
>   				starting_index, 0, buf, &n, arg);
>   
>   		if (!ret)
> -			return n;
> +			goto out_success;
>   
>   		if (ret != H_PARAMETER)
>   			goto out;
>   	}
>   
> +out_success:
> +	put_cpu_var(hv_gpci_reqb);
>   	return n;
>   
>   out:
> @@ -278,7 +280,7 @@ static ssize_t processor_config_show(struct device *dev, struct device_attribute
>   			0, 0, buf, &n, arg);
>   
>   	if (!ret)
> -		return n;
> +		goto out_success;
>   
>   	if (ret != H_PARAMETER)
>   		goto out;
> @@ -312,12 +314,14 @@ static ssize_t processor_config_show(struct device *dev, struct device_attribute
>   				starting_index, 0, buf, &n, arg);
>   
>   		if (!ret)
> -			return n;
> +			goto out_success;
>   
>   		if (ret != H_PARAMETER)
>   			goto out;
>   	}
>   
> +out_success:
> +	put_cpu_var(hv_gpci_reqb);
>   	return n;
>   
>   out:
> @@ -346,7 +350,7 @@ static ssize_t affinity_domain_via_virtual_processor_show(struct device *dev,
>   			0, 0, buf, &n, arg);
>   
>   	if (!ret)
> -		return n;
> +		goto out_success;
>   
>   	if (ret != H_PARAMETER)
>   		goto out;
> @@ -382,12 +386,14 @@ static ssize_t affinity_domain_via_virtual_processor_show(struct device *dev,
>   				starting_index, secondary_index, buf, &n, arg);
>   
>   		if (!ret)
> -			return n;
> +			goto out_success;
>   
>   		if (ret != H_PARAMETER)
>   			goto out;
>   	}
>   
> +out_success:
> +	put_cpu_var(hv_gpci_reqb);
>   	return n;
>   
>   out:
> @@ -416,7 +422,7 @@ static ssize_t affinity_domain_via_domain_show(struct device *dev, struct device
>   			0, 0, buf, &n, arg);
>   
>   	if (!ret)
> -		return n;
> +		goto out_success;
>   
>   	if (ret != H_PARAMETER)
>   		goto out;
> @@ -448,12 +454,14 @@ static ssize_t affinity_domain_via_domain_show(struct device *dev, struct device
>   					starting_index, 0, buf, &n, arg);
>   
>   		if (!ret)
> -			return n;
> +			goto out_success;
>   
>   		if (ret != H_PARAMETER)
>   			goto out;
>   	}
>   
> +out_success:
> +	put_cpu_var(hv_gpci_reqb);
>   	return n;
>   
>   out:


Thanks for fixing this. This change per se, look good to me. So,
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>


But, it would be good to move to scope based gaurds. That would
remove all the goto's and issues like this.


^ permalink raw reply

* Re: [PATCH] powerpc/hv-gpci: fix preempt count leak in sysfs show paths
From: Shrikanth Hegde @ 2026-05-14  5:24 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: Christophe Leroy, Kajol Jain, linux-kernel, Aboorva Devarajan,
	Madhavan Srinivasan, Athira Rajeev, linuxppc-dev
In-Reply-To: <5x4usp43.ritesh.list@gmail.com>

Hi Ritesh.

> I guess, these would be the other kind of issues which might show up now,
> since powerpc is moved to CONFIG_PREEMPTION=y with lazy preempt as the
> default preemption mode. It will be nice to have some mechanism to catch
> these error handling paths where preempt counts might be getting leaked.
> 

There is CONFIG_DEBUG_PREEMPT=y which helps you to catch these issue
where count leak happens. Otherwise we only see end result, but not the culprit.

That said, we do have issues of not nesting with get_cpu/put_cpu in few places and
preempt_disable/preempt_enable in few return paths still.

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-14  5:54 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agSKQrSIhizCXKwx@google.com>

Mostafa Saleh <smostafa@google.com> writes:

> On Tue, May 12, 2026 at 02:33:59PM +0530, Aneesh Kumar K.V (Arm) wrote:
>> Teach swiotlb to distinguish between encrypted and decrypted bounce
>> buffer pools, and make allocation and mapping paths select a pool whose
>> state matches the requested DMA attributes.
>> 
>> Add a decrypted flag to io_tlb_mem, initialize it for the default and
>> restricted pools, and propagate DMA_ATTR_CC_SHARED into swiotlb pool
>> allocation. Reject swiotlb alloc/map requests when the selected pool does
>> not match the required encrypted/decrypted state.
>> 
>> Also return DMA addresses with the matching phys_to_dma_{encrypted,
>> unencrypted} helper so the DMA address encoding stays consistent with the
>> chosen pool.
>> 
>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>> ---
>>  include/linux/dma-direct.h |  10 ++++
>>  include/linux/swiotlb.h    |   8 ++-
>>  kernel/dma/direct.c        |  14 +++--
>>  kernel/dma/swiotlb.c       | 108 +++++++++++++++++++++++++++----------
>>  4 files changed, 107 insertions(+), 33 deletions(-)
>> 
>> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
>> index c249912456f9..94fad4e7c11e 100644
>> --- a/include/linux/dma-direct.h
>> +++ b/include/linux/dma-direct.h
>> @@ -77,6 +77,10 @@ static inline dma_addr_t dma_range_map_max(const struct bus_dma_region *map)
>>  #ifndef phys_to_dma_unencrypted
>>  #define phys_to_dma_unencrypted		phys_to_dma
>>  #endif
>> +
>> +#ifndef phys_to_dma_encrypted
>> +#define phys_to_dma_encrypted		phys_to_dma
>> +#endif
>>  #else
>>  static inline dma_addr_t __phys_to_dma(struct device *dev, phys_addr_t paddr)
>>  {
>> @@ -90,6 +94,12 @@ static inline dma_addr_t phys_to_dma_unencrypted(struct device *dev,
>>  {
>>  	return dma_addr_unencrypted(__phys_to_dma(dev, paddr));
>>  }
>> +
>> +static inline dma_addr_t phys_to_dma_encrypted(struct device *dev,
>> +		phys_addr_t paddr)
>> +{
>> +	return dma_addr_encrypted(__phys_to_dma(dev, paddr));
>> +}
>>  /*
>>   * If memory encryption is supported, phys_to_dma will set the memory encryption
>>   * bit in the DMA address, and dma_to_phys will clear it.
>> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
>> index 3dae0f592063..b3fa3c6e0169 100644
>> --- a/include/linux/swiotlb.h
>> +++ b/include/linux/swiotlb.h
>> @@ -81,6 +81,7 @@ struct io_tlb_pool {
>>  	struct list_head node;
>>  	struct rcu_head rcu;
>>  	bool transient;
>> +	bool unencrypted;
>>  #endif
>>  };
>>  
>> @@ -111,6 +112,7 @@ struct io_tlb_mem {
>>  	struct dentry *debugfs;
>>  	bool force_bounce;
>>  	bool for_alloc;
>> +	bool unencrypted;
>>  #ifdef CONFIG_SWIOTLB_DYNAMIC
>>  	bool can_grow;
>>  	u64 phys_limit;
>> @@ -282,7 +284,8 @@ static inline void swiotlb_sync_single_for_cpu(struct device *dev,
>>  extern void swiotlb_print_info(void);
>>  
>>  #ifdef CONFIG_DMA_RESTRICTED_POOL
>> -struct page *swiotlb_alloc(struct device *dev, size_t size);
>> +struct page *swiotlb_alloc(struct device *dev, size_t size,
>> +		unsigned long attrs);
>>  bool swiotlb_free(struct device *dev, struct page *page, size_t size);
>>  
>>  static inline bool is_swiotlb_for_alloc(struct device *dev)
>> @@ -290,7 +293,8 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
>>  	return dev->dma_io_tlb_mem->for_alloc;
>>  }
>>  #else
>> -static inline struct page *swiotlb_alloc(struct device *dev, size_t size)
>> +static inline struct page *swiotlb_alloc(struct device *dev, size_t size,
>> +		unsigned long attrs)
>>  {
>>  	return NULL;
>>  }
>> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>> index dc2907439b3d..97ae4fa10521 100644
>> --- a/kernel/dma/direct.c
>> +++ b/kernel/dma/direct.c
>> @@ -104,9 +104,10 @@ static void __dma_direct_free_pages(struct device *dev, struct page *page,
>>  	dma_free_contiguous(dev, page, size);
>>  }
>>  
>> -static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size)
>> +static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size,
>> +		unsigned long attrs)
>>  {
>> -	struct page *page = swiotlb_alloc(dev, size);
>> +	struct page *page = swiotlb_alloc(dev, size, attrs);
>>  
>>  	if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
>>  		swiotlb_free(dev, page, size);
>> @@ -266,8 +267,12 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>>  						  gfp, attrs);
>>  
>>  	if (is_swiotlb_for_alloc(dev)) {
>> -		page = dma_direct_alloc_swiotlb(dev, size);
>> +		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>>  		if (page) {
>> +			/*
>> +			 * swiotlb allocations comes from pool already marked
>> +			 * decrypted
>> +			 */
>>  			mark_mem_decrypt = false;
>>  			goto setup_page;
>>  		}
>> @@ -374,6 +379,7 @@ void dma_direct_free(struct device *dev, size_t size,
>>  		return;
>>  
>>  	if (swiotlb_find_pool(dev, dma_to_phys(dev, dma_addr)))
>> +		/* Swiotlb doesn't need a page attribute update on free */
>>  		mark_mem_encrypted = false;
>>  
>>  	if (is_vmalloc_addr(cpu_addr)) {
>> @@ -403,7 +409,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>>  						  gfp, attrs);
>>  
>>  	if (is_swiotlb_for_alloc(dev)) {
>> -		page = dma_direct_alloc_swiotlb(dev, size);
>> +		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>>  		if (!page)
>>  			return NULL;
>>  
>> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
>> index ab4eccbaa076..065663be282c 100644
>> --- a/kernel/dma/swiotlb.c
>> +++ b/kernel/dma/swiotlb.c
>> @@ -259,10 +259,21 @@ void __init swiotlb_update_mem_attributes(void)
>>  	struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
>>  	unsigned long bytes;
>>  
>> +	/*
>> +	 * if platform support memory encryption, swiotlb buffers are
>> +	 * decrypted by default.
>> +	 */
>> +	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> +		io_tlb_default_mem.unencrypted = true;
>> +	else
>> +		io_tlb_default_mem.unencrypted = false;
>> +
>>  	if (!mem->nslabs || mem->late_alloc)
>>  		return;
>>  	bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
>> -	set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
>> +
>> +	if (io_tlb_default_mem.unencrypted)
>> +		set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
>>  }
>>  
>>  static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
>> @@ -505,8 +516,10 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>>  	if (!mem->slots)
>>  		goto error_slots;
>>  
>> -	set_memory_decrypted((unsigned long)vstart,
>> -			     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
>> +	if (io_tlb_default_mem.unencrypted)
>> +		set_memory_decrypted((unsigned long)vstart,
>> +				     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
>> +
>>  	swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
>>  				 nareas);
>>  	add_mem_pool(&io_tlb_default_mem, mem);
>> @@ -539,7 +552,9 @@ void __init swiotlb_exit(void)
>>  	tbl_size = PAGE_ALIGN(mem->end - mem->start);
>>  	slots_size = PAGE_ALIGN(array_size(sizeof(*mem->slots), mem->nslabs));
>>  
>> -	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
>> +	if (io_tlb_default_mem.unencrypted)
>> +		set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
>> +
>>  	if (mem->late_alloc) {
>>  		area_order = get_order(array_size(sizeof(*mem->areas),
>>  			mem->nareas));
>> @@ -563,6 +578,7 @@ void __init swiotlb_exit(void)
>>   * @gfp:	GFP flags for the allocation.
>>   * @bytes:	Size of the buffer.
>>   * @phys_limit:	Maximum allowed physical address of the buffer.
>> + * @unencrypted: true to allocate unencrypted memory, false for encrypted memory
>>   *
>>   * Allocate pages from the buddy allocator. If successful, make the allocated
>>   * pages decrypted that they can be used for DMA.
>> @@ -570,7 +586,8 @@ void __init swiotlb_exit(void)
>>   * Return: Decrypted pages, %NULL on allocation failure, or ERR_PTR(-EAGAIN)
>>   * if the allocated physical address was above @phys_limit.
>>   */
>> -static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>> +static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes,
>> +		u64 phys_limit, bool unencrypted)
>>  {
>>  	unsigned int order = get_order(bytes);
>>  	struct page *page;
>> @@ -588,13 +605,13 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>>  	}
>>  
>>  	vaddr = phys_to_virt(paddr);
>> -	if (set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> +	if (unencrypted && set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
>>  		goto error;
>>  	return page;
>>  
>>  error:
>>  	/* Intentional leak if pages cannot be encrypted again. */
>> -	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> +	if (unencrypted && !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>>  		__free_pages(page, order);
>>  	return NULL;
>>  }
>> @@ -604,30 +621,26 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>>   * @dev:	Device for which a memory pool is allocated.
>>   * @bytes:	Size of the buffer.
>>   * @phys_limit:	Maximum allowed physical address of the buffer.
>> + * @attrs:	DMA attributes for the allocation.
>>   * @gfp:	GFP flags for the allocation.
>>   *
>>   * Return: Allocated pages, or %NULL on allocation failure.
>>   */
>>  static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>> -		u64 phys_limit, gfp_t gfp)
>> +		u64 phys_limit, unsigned long attrs, gfp_t gfp)
>>  {
>>  	struct page *page;
>> -	unsigned long attrs = 0;
>>  
>>  	/*
>>  	 * Allocate from the atomic pools if memory is encrypted and
>>  	 * the allocation is atomic, because decrypting may block.
>>  	 */
>> -	if (!gfpflags_allow_blocking(gfp) && dev && force_dma_unencrypted(dev)) {
>> +	if (!gfpflags_allow_blocking(gfp) && (attrs & DMA_ATTR_CC_SHARED)) {
>>  		void *vaddr;
>>  
>>  		if (!IS_ENABLED(CONFIG_DMA_COHERENT_POOL))
>>  			return NULL;
>>  
>> -		/* swiotlb considered decrypted by default */
>> -		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> -			attrs = DMA_ATTR_CC_SHARED;
>> -
>>  		return dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
>>  					   attrs, dma_coherent_ok);
>>  	}
>> @@ -638,7 +651,8 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>>  	else if (phys_limit <= DMA_BIT_MASK(32))
>>  		gfp |= __GFP_DMA32;
>>  
>> -	while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit))) {
>> +	while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit,
>> +					     !!(attrs & DMA_ATTR_CC_SHARED)))) {
>>  		if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
>>  		    phys_limit < DMA_BIT_MASK(64) &&
>>  		    !(gfp & (__GFP_DMA32 | __GFP_DMA)))
>> @@ -657,15 +671,18 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>>   * swiotlb_free_tlb() - free a dynamically allocated IO TLB buffer
>>   * @vaddr:	Virtual address of the buffer.
>>   * @bytes:	Size of the buffer.
>> + * @unencrypted: true if @vaddr was allocated decrypted and must be
>> + *	re-encrypted before being freed
>>   */
>> -static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>> +static void swiotlb_free_tlb(void *vaddr, size_t bytes, bool unencrypted)
>>  {
>>  	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
>>  	    dma_free_from_pool(NULL, vaddr, bytes))
>>  		return;
>>  
>>  	/* Intentional leak if pages cannot be encrypted again. */
>> -	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> +	if (!unencrypted ||
>> +	    !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>>  		__free_pages(virt_to_page(vaddr), get_order(bytes));
>>  }
>>  
>> @@ -676,6 +693,7 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>>   * @nslabs:	Desired (maximum) number of slabs.
>>   * @nareas:	Number of areas.
>>   * @phys_limit:	Maximum DMA buffer physical address.
>> + * @attrs:	DMA attributes for the allocation.
>>   * @gfp:	GFP flags for the allocations.
>>   *
>>   * Allocate and initialize a new IO TLB memory pool. The actual number of
>> @@ -686,7 +704,8 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>>   */
>>  static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>>  		unsigned long minslabs, unsigned long nslabs,
>> -		unsigned int nareas, u64 phys_limit, gfp_t gfp)
>> +		unsigned int nareas, u64 phys_limit, unsigned long attrs,
>> +		gfp_t gfp)
>>  {
>>  	struct io_tlb_pool *pool;
>>  	unsigned int slot_order;
>> @@ -704,9 +723,10 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>>  	if (!pool)
>>  		goto error;
>>  	pool->areas = (void *)pool + sizeof(*pool);
>> +	pool->unencrypted = !!(attrs & DMA_ATTR_CC_SHARED);
>>  
>>  	tlb_size = nslabs << IO_TLB_SHIFT;
>> -	while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp))) {
>> +	while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, attrs, gfp))) {
>>  		if (nslabs <= minslabs)
>>  			goto error_tlb;
>>  		nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
>> @@ -724,7 +744,8 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>>  	return pool;
>>  
>>  error_slots:
>> -	swiotlb_free_tlb(page_address(tlb), tlb_size);
>> +	swiotlb_free_tlb(page_address(tlb), tlb_size,
>> +			 !!(attrs & DMA_ATTR_CC_SHARED));
>>  error_tlb:
>>  	kfree(pool);
>>  error:
>> @@ -742,7 +763,9 @@ static void swiotlb_dyn_alloc(struct work_struct *work)
>>  	struct io_tlb_pool *pool;
>>  
>>  	pool = swiotlb_alloc_pool(NULL, IO_TLB_MIN_SLABS, default_nslabs,
>> -				  default_nareas, mem->phys_limit, GFP_KERNEL);
>> +				  default_nareas, mem->phys_limit,
>> +				  mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
>> +				  GFP_KERNEL);
>>  	if (!pool) {
>>  		pr_warn_ratelimited("Failed to allocate new pool");
>>  		return;
>> @@ -762,7 +785,7 @@ static void swiotlb_dyn_free(struct rcu_head *rcu)
>>  	size_t tlb_size = pool->end - pool->start;
>>  
>>  	free_pages((unsigned long)pool->slots, get_order(slots_size));
>> -	swiotlb_free_tlb(pool->vaddr, tlb_size);
>> +	swiotlb_free_tlb(pool->vaddr, tlb_size, pool->unencrypted);
>>  	kfree(pool);
>>  }
>>  
>> @@ -1232,6 +1255,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>>  	nslabs = nr_slots(alloc_size);
>>  	phys_limit = min_not_zero(*dev->dma_mask, dev->bus_dma_limit);
>>  	pool = swiotlb_alloc_pool(dev, nslabs, nslabs, 1, phys_limit,
>> +				  mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
>>  				  GFP_NOWAIT);
>>  	if (!pool)
>>  		return -1;
>> @@ -1394,6 +1418,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>>  		enum dma_data_direction dir, unsigned long attrs)
>>  {
>>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>> +	bool require_decrypted = false;
>>  	unsigned int offset;
>>  	struct io_tlb_pool *pool;
>>  	unsigned int i;
>> @@ -1411,6 +1436,16 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>>  	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>>  		pr_warn_once("Memory encryption is active and system is using DMA bounce buffers\n");
>>  
>> +	/*
>> +	 * if we are trying to swiotlb map a decrypted paddr or the paddr is encrypted
>> +	 * but the device is forcing decryption, use decrypted io_tlb_mem
>> +	 */
>> +	if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))
>> +		require_decrypted = true;
>> +
>> +	if (require_decrypted != mem->unencrypted)
>> +		return (phys_addr_t)DMA_MAPPING_ERROR;
>> +
>>  	/*
>>  	 * The default swiotlb memory pool is allocated with PAGE_SIZE
>>  	 * alignment. If a mapping is requested with larger alignment,
>> @@ -1608,8 +1643,14 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
>>  	if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
>>  		return DMA_MAPPING_ERROR;
>>  
>> -	/* Ensure that the address returned is DMA'ble */
>> -	dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
>> +	/*
>> +	 * Use the allocated io_tlb_mem encryption type to determine dma addr.
>> +	 */
>> +	if (dev->dma_io_tlb_mem->unencrypted)
>> +		dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
>> +	else
>> +		dma_addr = phys_to_dma_encrypted(dev, swiotlb_addr);
>> +
>>  	if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
>>  		__swiotlb_tbl_unmap_single(dev, swiotlb_addr, size, dir,
>>  			attrs | DMA_ATTR_SKIP_CPU_SYNC,
>> @@ -1773,7 +1814,8 @@ static inline void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
>>  
>>  #ifdef CONFIG_DMA_RESTRICTED_POOL
>>  
>> -struct page *swiotlb_alloc(struct device *dev, size_t size)
>> +struct page *swiotlb_alloc(struct device *dev, size_t size,
>> +		unsigned long attrs)
>>  {
>>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>>  	struct io_tlb_pool *pool;
>> @@ -1784,6 +1826,9 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
>>  	if (!mem)
>>  		return NULL;
>>  
>> +	if (mem->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
>> +		return NULL;
>> +
>>  	align = (1 << (get_order(size) + PAGE_SHIFT)) - 1;
>>  	index = swiotlb_find_slots(dev, 0, size, align, &pool);
>>  	if (index == -1)
>> @@ -1853,9 +1898,18 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
>>  			kfree(mem);
>>  			return -ENOMEM;
>>  		}
>> +		/*
>> +		 * if platform supports memory encryption,
>> +		 * restricted mem pool is decrypted by default
>> +		 */
>> +		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
>> +			mem->unencrypted = true;
>> +			set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
>> +					     rmem->size >> PAGE_SHIFT);
>> +		} else {
>> +			mem->unencrypted = false;
>> +		}
>
> This breaks pKVM as it doesn’t set CC_ATTR_MEM_ENCRYPT, so all virtio
> traffic now fails.
>
> Also, by design, some drivers are clueless about bouncing, so
> I believe that the pool should have a way to control it’s property
> (encrypted or decrypted) and that takes priority over whatever
> attributes comes from allocation.
> And that brings us to the same point whether it’s better to return
> the memory along with it’s state or we pass the requested state.
> I think for other cases it’s fine for the device/DMA-API to dictate
> the attrs, but not in restricted-dma case, the firmware just knows better.
>

Is it that the pKVM guest kernel does not have awareness of
encrypted/decrypted DMA allocations? Instead, the firmware attaches
hypervisor-shared pages to the device via restricted-dma-pool? The
kernel then has swiotlb->for_alloc = true, and hence all DMA allocations
go through the restricted-dma-pool?

Given that pKVM supports pkvm_set_memory_encrypted() and
pkvm_set_memory_decrypted(), can we consider adding CC_ATTR_MEM_ENCRYPT
support to pKVM? It would also be good to investigate whether we can set
force_dma_unencrypted(dev) to true where needed.

I agree that this patch, as it stands, can break pKVM because we are now
missing the set_memory_decrypted() call required for pKVM to work.

We now mark the swiotlb io_tlb_mem as unencrypted/encrypted in the guest
using struct io_tlb_mem->unencrypted. I am not clear what we can use for
pKVM to conditionalize this so that it works for both protected and
unprotected guests.

-aneesh



^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-14  6:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Mostafa Saleh
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Petr Tesarik,
	Alexey Kardashevskiy, Dan Williams, Xu Yilun, linuxppc-dev,
	linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260513172450.GR7702@ziepe.ca>

Jason Gunthorpe <jgg@ziepe.ca> writes:

>> And that brings us to the same point whether it’s better to return
>> the memory along with it’s state or we pass the requested state.
>> I think for other cases it’s fine for the device/DMA-API to dictate
>> the attrs, but not in restricted-dma case, the firmware just knows better.
>
> The memory type must be returned back at some level so downstream
> things can do the right transformation of the phys_addr_t.
>
> One of the aspirational CC things that should work is a T=1 device
> tries to DMA from a decrypted page, finds the address is above the dma
> limit of the device, so it bounces it with SWIOTLB to an encrypted low
> address page and then the DMA API internal flow switiches from working
> with decrypted to encrypted phys_addr_t.
>
> If we can make that work then maybe the flows are designed correctly.
>

That is what this patch series aims to do. dma_direct_map_phys() will
derive the DMA address based on the attributes of the physical address:

if (attrs & DMA_ATTR_CC_SHARED)
        dma_addr = phys_to_dma_unencrypted(dev, phys);
else
        dma_addr = phys_to_dma_encrypted(dev, phys);

If that fails the dma_capable() check, we fall back to swiotlb_map():

if (unlikely(!dma_capable(dev, dma_addr, size, true, attrs)))
        return swiotlb_map(dev, phys, size, dir, attrs);

We currently do not have an encrypted SWIOTLB pool, but once that is
supported, swiotlb_map() should do the right thing and return the
correct encrypted dma_addr_t based on io_tlb_mem->unencrypted:

if (dev->dma_io_tlb_mem->unencrypted) {
        dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
        attrs |= DMA_ATTR_CC_SHARED;
} else {
        dma_addr = phys_to_dma_encrypted(dev, swiotlb_addr);
}

-aneesh


^ permalink raw reply

* Re: [PATCH v4 03/13] dma-pool: track decrypted atomic pools and select them via attrs
From: Aneesh Kumar K.V @ 2026-05-14  7:00 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agSED6BdSXBhbDYI@google.com>

Mostafa Saleh <smostafa@google.com> writes:

...

>>  struct page *dma_alloc_from_pool(struct device *dev, size_t size,
>> -		void **cpu_addr, gfp_t gfp,
>> +		void **cpu_addr, gfp_t gfp, unsigned long attrs,
>>  		bool (*phys_addr_ok)(struct device *, phys_addr_t, size_t))
>>  {
>> -	struct gen_pool *pool = NULL;
>> +	struct dma_gen_pool *dma_pool = NULL;
>>  	struct page *page;
>>  	bool pool_found = false;
>>  
>> -	while ((pool = dma_guess_pool(pool, gfp))) {
>> +	while ((dma_pool = dma_guess_pool(dma_pool, gfp))) {
>> +
>> +		if (dma_pool->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
>> +			continue;
>> +
>
> nit: If we fail to find a matching pool, a slightly misleading message
> is printed as pool_found = false
>

The message printed is

	WARN(1, "Failed to get suitable pool for %s\n", dev_name(dev));

That is correct, isn’t it? The kernel failed to find a pool with the
correct encryption attribute. For example, the request was for an
encrypted allocation from the pool, but no encrypted pool was available.

>
>>  		pool_found = true;
>> -		page = __dma_alloc_from_pool(dev, size, pool, cpu_addr,
>> +		page = __dma_alloc_from_pool(dev, size, dma_pool->pool, cpu_addr,
>>  					     phys_addr_ok);
>>  		if (page)
>>  			return page;
>> @@ -296,12 +345,14 @@ struct page *dma_alloc_from_pool(struct device *dev, size_t size,

-aneesh


^ permalink raw reply

* Re: [PATCH v2 01/69] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages
From: Oscar Salvador @ 2026-05-14  7:51 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Michael Ellerman,
	Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Nicholas Piggin, Christophe Leroy, Ackerley Tng,
	Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel
In-Reply-To: <20260513130542.35604-2-songmuchun@bytedance.com>

On Wed, May 13, 2026 at 09:04:29PM +0800, Muchun Song wrote:
> Commit 622026e87c40 ("mm/hugetlb: remove fake head pages") switched
> HVO to reuse per-zone shared tail pages from zone->vmemmap_tails[].
> 
> Those shared tail pages were initialized in hugetlb_vmemmap_init(), but
> bootmem HugeTLB folios are prepared earlier from gather_bootmem_prealloc().
> With hugetlb_free_vmemmap=on, prep_and_add_bootmem_folios() can access
> pageblock flags on bootmem HugeTLB pages whose mirrored tail struct pages
> already point to the shared tail page. On CONFIG_DEBUG_VM kernels,
> get_pfnblock_bitmap_bitidx() then dereferences the still-uninitialized
> shared tail page and can panic during boot.
> 
> Initialize zone->vmemmap_tails[] from gather_bootmem_prealloc(), before
> bootmem HugeTLB folios are processed, and drop the later initialization
> from hugetlb_vmemmap_init().
> 
> This bug only affects CONFIG_DEBUG_VM kernels, where the relevant
> assertion is evaluated.
> 
> Fixes: 622026e87c40 ("mm/hugetlb: remove fake head pages")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

For the correctness of the change

Acked-by: Oscar Salvador <osalvador@suse.de>

but I have a couple of comments:

> ---
>  mm/hugetlb.c         | 19 +++++++++++++++++++
>  mm/hugetlb_vmemmap.c | 17 -----------------
>  2 files changed, 19 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 31b34ca0f402..d22683ab30a1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3382,6 +3382,25 @@ static void __init gather_bootmem_prealloc(void)
>  		.max_threads	= num_node_state(N_MEMORY),
>  		.numa_aware	= true,
>  	};
> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> +	struct zone *zone;
> +
> +	for_each_zone(zone) {
> +		for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
> +			struct page *tail, *p;
> +			unsigned int order;
> +
> +			tail = zone->vmemmap_tails[i];
> +			if (!tail)
> +				continue;
> +
> +			order = i + VMEMMAP_TAIL_MIN_ORDER;
> +			p = page_to_virt(tail);
> +			for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
> +				init_compound_tail(p + j, NULL, order, zone);
> +		}
> +	}

This deserves a comment why do we need to initialize those pages here, no need for
a fat one but a hint, because everybody else looking at this will wonder why do not
have it in hugetlb_vmemmap_init(), as the name suggests.


> +#endif
>  
>  	padata_do_multithreaded(&job);
>  }
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 4a077d231d3a..62e61af18c9a 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -870,27 +870,10 @@ static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
>  static int __init hugetlb_vmemmap_init(void)

At this point, hugetlb_vmemmap_init only initialites the sysctls for
each hstate, should the name reflect that better?

Also, vmemmap_get_tail() still thinks that tail pages are initialized
here from this comment:

 "/*
   * Only allocate the page, but do not initialize it.
   *
   * Any initialization done here will be overwritten by memmap_init().
   *
   * hugetlb_vmemmap_init() will take care of initialization after
   * memmap_init().
   */"

so we might want to adjust that as well, and I am not sure if we have
more comments in the tree referencing hugetlb_vmemmap_init() as the init
point for those pages that need correction.

 
Having said that, and this is just a question, we cannot really make
gather_bootmem_prealloc() also do the initialization of the sysfs right?
I see that hugetlb sysfs gets initialized from hugetlb_init() a few
calls after, so.. meh.

Anyway, maybe just convert hugetlb_vmemmap_init to hugetlb_vmemmap_sysfs_create
(pick a better name), so it does not look like having two entry points
for vmemmap init stuff.

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply

* Re: [PATCH v2 02/69] mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios()
From: Oscar Salvador @ 2026-05-14  7:56 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Michael Ellerman,
	Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Nicholas Piggin, Christophe Leroy, Ackerley Tng,
	Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel
In-Reply-To: <20260513130542.35604-3-songmuchun@bytedance.com>

On Wed, May 13, 2026 at 09:04:30PM +0800, Muchun Song wrote:
> __hugetlb_vmemmap_optimize_folios() uses incorrect arguments when handling
> bootmem HugeTLB folios.
> 
> The section number passed to register_page_bootmem_memmap() is derived from
> the vmemmap virtual address of folio->page instead of the folio PFN, so the
> bootmem memmap metadata can be registered against the wrong section. The
> helper is also given HUGETLB_VMEMMAP_RESERVE_SIZE even though it expects a
> page count, not a size in bytes. In addition, the write-protect range is
> based on pages_per_huge_page(h), which does not cover the full HugeTLB
> vmemmap area and can leave part of the shared tail vmemmap mapping writable.
> 
> Fix the section lookup to use folio_pfn(folio), use
> HUGETLB_VMEMMAP_RESERVE_PAGES when registering the reserved memmap pages, and
> use hugetlb_vmemmap_size(h) for the write-protect range.
> 
> Fixes: 752fe17af693 ("mm/hugetlb: add pre-HVO framework")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Acked-by: Oscar Salvador <osalvador@suse.de>

did you observe any Ooops or malfunctioning or was just code reviewing?

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply

* Re: [PATCH v2 03/69] powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population
From: Oscar Salvador @ 2026-05-14  8:00 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Michael Ellerman,
	Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Nicholas Piggin, Christophe Leroy, Ackerley Tng,
	Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel
In-Reply-To: <20260513130542.35604-4-songmuchun@bytedance.com>

On Wed, May 13, 2026 at 09:04:31PM +0800, Muchun Song wrote:
> vmemmap_populate_compound_pages() uses addr_pfn to determine the PFN
> offset within a compound page and to decide whether the current
> vmemmap slot should be populated as a head page mapping or should reuse
> a tail page mapping.
> 
> However, addr_pfn is advanced manually in parallel with addr.  The loop
> itself progresses in vmemmap address space, so each PAGE_SIZE step in
> addr covers PAGE_SIZE / sizeof(struct page) struct page slots.  Since
> addr_pfn is compared against nr_pages in data-PFN units, it should
> advance by the same number of PFNs.  The existing manual increments do
> not match that and therefore do not reliably track the PFN
> corresponding to the current addr.
> 
> As a result, pfn_offset can be computed from the wrong PFN and the code
> can make the head/tail decision for the wrong compound-page position.
> 
> Fix this by deriving addr_pfn directly from the current vmemmap address
> instead of carrying it as loop state.
> 
> Fixes: f2b79c0d7968 ("powerpc/book3s64/radix: add support for vmemmap optimization for radix")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Acked-by: Oscar Salvador <osalvador@suse.de>

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply

* Re: [PATCH v2 04/69] mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier
From: Oscar Salvador @ 2026-05-14  8:05 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Michael Ellerman,
	Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Nicholas Piggin, Christophe Leroy, Ackerley Tng,
	Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel
In-Reply-To: <20260513130542.35604-5-songmuchun@bytedance.com>

On Wed, May 13, 2026 at 09:04:32PM +0800, Muchun Song wrote:
> Gigantic bootmem HugeTLB pages are currently initialized from hugetlb_init(),
> but page_alloc_init_late() runs earlier and walks pageblocks to determine
> zone contiguity.
> 
> If a bootmem HugeTLB region is marked noinit, set_zone_contiguous() can
> observe still-uninitialized struct pages through __pageblock_pfn_to_page().
> This may not trigger an immediate failure, but it can make
> set_zone_contiguous() compute the wrong zone contiguity state. If extra
> poisoned-page checks are added in this path, such as PF_POISONED_CHECK()
> in page_zone_id(), it can also trigger an early boot panic.
> 
> Initialize gigantic bootmem HugeTLB struct pages from page_alloc_init_late(),
> before zone contiguity is evaluated, so later page allocator setup only
> sees valid struct page state. This also makes the initialization order
> more natural, as struct pages should be initialized before later code
> inspects them.
> 
> Fixes: fde1c4ecf916 ("mm: hugetlb: skip initialization of gigantic tail struct pages if freed by HVO")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Acked-by: Oscar Salvador <osalvador@suse.de>

but

> ---
>  include/linux/hugetlb.h | 5 +++++
>  mm/hugetlb.c            | 3 +--
>  mm/mm_init.c            | 1 +
>  3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 93418625d3c5..52a2c30f866c 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -173,6 +173,7 @@ extern int movable_gigantic_pages __read_mostly;
>  extern int sysctl_hugetlb_shm_group __read_mostly;
>  extern struct list_head huge_boot_pages[MAX_NUMNODES];
>  
> +void hugetlb_struct_page_init(void);

we lost the hint that this only takes care of bootmem pages.
So I think hugetlb_bootmem_struct_page_init or something like that would make
it more clear?


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply

* Re: [PATCH v2 05/69] mm/mm_init: Simplify deferred_free_pages() migratetype init
From: Oscar Salvador @ 2026-05-14  8:12 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Michael Ellerman,
	Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Nicholas Piggin, Christophe Leroy, Ackerley Tng,
	Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel
In-Reply-To: <20260513130542.35604-6-songmuchun@bytedance.com>

On Wed, May 13, 2026 at 09:04:33PM +0800, Muchun Song wrote:
> deferred_free_pages() open-codes two loops to initialize the pageblock
> migratetype for a range of pages.
> 
> Replace them with pageblock_migratetype_init_range() to remove the
> duplication and make the code clearer (Note that deferred_free_pages()
> may be called from atomic context).
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Acked-by: Oscar Salvador <osalvador@suse.de>

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply

* Re: [PATCH v2 01/69] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages
From: Muchun Song @ 2026-05-14  8:13 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Muchun Song, Andrew Morton, David Hildenbrand, Michael Ellerman,
	Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Nicholas Piggin, Christophe Leroy, Ackerley Tng,
	Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel
In-Reply-To: <agV-8Ng-E0tZ9G5k@localhost.localdomain>



> On May 14, 2026, at 15:51, Oscar Salvador <osalvador@suse.de> wrote:
> 
> On Wed, May 13, 2026 at 09:04:29PM +0800, Muchun Song wrote:
>> Commit 622026e87c40 ("mm/hugetlb: remove fake head pages") switched
>> HVO to reuse per-zone shared tail pages from zone->vmemmap_tails[].
>> 
>> Those shared tail pages were initialized in hugetlb_vmemmap_init(), but
>> bootmem HugeTLB folios are prepared earlier from gather_bootmem_prealloc().
>> With hugetlb_free_vmemmap=on, prep_and_add_bootmem_folios() can access
>> pageblock flags on bootmem HugeTLB pages whose mirrored tail struct pages
>> already point to the shared tail page. On CONFIG_DEBUG_VM kernels,
>> get_pfnblock_bitmap_bitidx() then dereferences the still-uninitialized
>> shared tail page and can panic during boot.
>> 
>> Initialize zone->vmemmap_tails[] from gather_bootmem_prealloc(), before
>> bootmem HugeTLB folios are processed, and drop the later initialization
>> from hugetlb_vmemmap_init().
>> 
>> This bug only affects CONFIG_DEBUG_VM kernels, where the relevant
>> assertion is evaluated.
>> 
>> Fixes: 622026e87c40 ("mm/hugetlb: remove fake head pages")
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> 
> For the correctness of the change
> 
> Acked-by: Oscar Salvador <osalvador@suse.de>

Thanks.

> 
> but I have a couple of comments:
> 
>> ---
>> mm/hugetlb.c         | 19 +++++++++++++++++++
>> mm/hugetlb_vmemmap.c | 17 -----------------
>> 2 files changed, 19 insertions(+), 17 deletions(-)
>> 
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 31b34ca0f402..d22683ab30a1 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3382,6 +3382,25 @@ static void __init gather_bootmem_prealloc(void)
>> .max_threads = num_node_state(N_MEMORY),
>> .numa_aware = true,
>> };
>> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>> + 	struct zone *zone;
>> +
>> + 	for_each_zone(zone) {
>> + 		for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
>> + 			struct page *tail, *p;
>> + 			unsigned int order;
>> +
>> + 			tail = zone->vmemmap_tails[i];
>> + 			if (!tail)
>> + 				continue;
>> +
>> + 			order = i + VMEMMAP_TAIL_MIN_ORDER;
>> + 			p = page_to_virt(tail);
>> + 			for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
>> + 				init_compound_tail(p + j, NULL, order, zone);
>> + 		}
>> + 	}
> 
> This deserves a comment why do we need to initialize those pages here, no need for
> a fat one but a hint, because everybody else looking at this will wonder why do not
> have it in hugetlb_vmemmap_init(), as the name suggests.

Make sense.

> 
> 
>> +#endif
>> 
>> 	padata_do_multithreaded(&job);
>> }
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index 4a077d231d3a..62e61af18c9a 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -870,27 +870,10 @@ static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
>> static int __init hugetlb_vmemmap_init(void)
> 
> At this point, hugetlb_vmemmap_init only initialites the sysctls for
> each hstate, should the name reflect that better?

A good point. I can add a new cleanup patch.

> 
> Also, vmemmap_get_tail() still thinks that tail pages are initialized
> here from this comment:

Good catch. I'll fix it.

> 
> "/*
>   * Only allocate the page, but do not initialize it.
>   *
>   * Any initialization done here will be overwritten by memmap_init().
>   *
>   * hugetlb_vmemmap_init() will take care of initialization after
>   * memmap_init().
>   */"
> 
> so we might want to adjust that as well, and I am not sure if we have
> more comments in the tree referencing hugetlb_vmemmap_init() as the init
> point for those pages that need correction.

Only this one.

> 
> 
> Having said that, and this is just a question, we cannot really make
> gather_bootmem_prealloc() also do the initialization of the sysfs right?
> I see that hugetlb sysfs gets initialized from hugetlb_init() a few
> calls after, so.. meh.

hugetlb_vmemmap_init() is introduced by me. I want to hide some symbols
from mm/hugetlb_vmemmap.c at that time. So I used late_initcall() for that.

> 
> Anyway, maybe just convert hugetlb_vmemmap_init to hugetlb_vmemmap_sysfs_create
> (pick a better name), so it does not look like having two entry points
> for vmemmap init stuff.

This bug fix is a temporary change, the whole zone initialization will
removed later in patch 30. Then, there will be only one place for vmemmap init :)

I’m curious about your thoughts on the practice of minimizing external
symbols. Personally, I’d prefer to move these initializations into
their respective files:

    - hugetlb_sysfs_init();
    - hugetlb_cgroup_file_init();
    - hugetlb_sysctl_init();

Of course, that’s just my personal preference.

Muchun,
Thanks.


> 
> -- 
> Oscar Salvador
> SUSE Labs




^ permalink raw reply

* Re: [PATCH v2 06/69] mm/sparse: Panic on memmap and usemap allocation failure
From: Oscar Salvador @ 2026-05-14  8:15 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Michael Ellerman,
	Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Nicholas Piggin, Christophe Leroy, Ackerley Tng,
	Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel
In-Reply-To: <20260513130542.35604-7-songmuchun@bytedance.com>

On Wed, May 13, 2026 at 09:04:34PM +0800, Muchun Song wrote:
> When vmemmap or usemap allocation fails, sparse_init_nid() currently
> marks the section non-present and continues. Later boot-time code can
> still walk PFNs in that section without checking for this partial setup,
> which leads to invalid accesses. subsection_map_init() can also touch an
> unallocated usemap.
> 
> Auditing and fixing all early PFN walkers for this case is not worth the
> complexity. These allocation failures are expected to be fatal anyway,
> and other memory models already treat them that way.
> 
> Make memmap and usemap allocation failures panic immediately instead of
> trying to recover and crashing later in less obvious ways. This is also
> consistent with how other memory model configurations handle memmap
> allocation failures.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Yes, one could argue that if we fail to allocate memory at that early
stage we are already screwed.

Acked-by: Oscar Salvador <osalvador@suse.de>

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply

* Re: [PATCH v2 04/69] mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier
From: Muchun Song @ 2026-05-14  8:16 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Muchun Song, Andrew Morton, David Hildenbrand, Michael Ellerman,
	Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Nicholas Piggin, Christophe Leroy, Ackerley Tng,
	Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel
In-Reply-To: <agWCVeywSo5lk1Vs@localhost.localdomain>



> On May 14, 2026, at 16:05, Oscar Salvador <osalvador@suse.de> wrote:
> 
> On Wed, May 13, 2026 at 09:04:32PM +0800, Muchun Song wrote:
>> Gigantic bootmem HugeTLB pages are currently initialized from hugetlb_init(),
>> but page_alloc_init_late() runs earlier and walks pageblocks to determine
>> zone contiguity.
>> 
>> If a bootmem HugeTLB region is marked noinit, set_zone_contiguous() can
>> observe still-uninitialized struct pages through __pageblock_pfn_to_page().
>> This may not trigger an immediate failure, but it can make
>> set_zone_contiguous() compute the wrong zone contiguity state. If extra
>> poisoned-page checks are added in this path, such as PF_POISONED_CHECK()
>> in page_zone_id(), it can also trigger an early boot panic.
>> 
>> Initialize gigantic bootmem HugeTLB struct pages from page_alloc_init_late(),
>> before zone contiguity is evaluated, so later page allocator setup only
>> sees valid struct page state. This also makes the initialization order
>> more natural, as struct pages should be initialized before later code
>> inspects them.
>> 
>> Fixes: fde1c4ecf916 ("mm: hugetlb: skip initialization of gigantic tail struct pages if freed by HVO")
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> 
> Acked-by: Oscar Salvador <osalvador@suse.de>
> 

Thanks.

> but
> 
>> ---
>> include/linux/hugetlb.h | 5 +++++
>> mm/hugetlb.c            | 3 +--
>> mm/mm_init.c            | 1 +
>> 3 files changed, 7 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 93418625d3c5..52a2c30f866c 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -173,6 +173,7 @@ extern int movable_gigantic_pages __read_mostly;
>> extern int sysctl_hugetlb_shm_group __read_mostly;
>> extern struct list_head huge_boot_pages[MAX_NUMNODES];
>> 
>> +void hugetlb_struct_page_init(void);
> 
> we lost the hint that this only takes care of bootmem pages.

Right.

> So I think hugetlb_bootmem_struct_page_init or something like that would make
> it more clear?

Yes. Make sense.

Thanks,
Muchun

> 
> 
> -- 
> Oscar Salvador
> SUSE Labs




^ permalink raw reply

* Re: [PATCH v2 07/69] mm/sparse: Move subsection_map_init() into sparse_init()
From: Oscar Salvador @ 2026-05-14  8:19 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Michael Ellerman,
	Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Nicholas Piggin, Christophe Leroy, Ackerley Tng,
	Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel
In-Reply-To: <20260513130542.35604-8-songmuchun@bytedance.com>

On Wed, May 13, 2026 at 09:04:35PM +0800, Muchun Song wrote:
> subsection_map_init() is part of sparse memory initialization, but it is
> currently called from free_area_init().
> 
> Move it into sparse_init() so the sparse-specific setup stays together
> instead of being split across the generic free_area_init() path.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Acked-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply

* Re: [PATCH v2 02/69] mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios()
From: Muchun Song @ 2026-05-14  8:19 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Muchun Song, Andrew Morton, David Hildenbrand, Michael Ellerman,
	Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Nicholas Piggin, Christophe Leroy, Ackerley Tng,
	Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel
In-Reply-To: <agWANe6i5gsfXFgP@localhost.localdomain>



> On May 14, 2026, at 15:56, Oscar Salvador <osalvador@suse.de> wrote:
> 
> On Wed, May 13, 2026 at 09:04:30PM +0800, Muchun Song wrote:
>> __hugetlb_vmemmap_optimize_folios() uses incorrect arguments when handling
>> bootmem HugeTLB folios.
>> 
>> The section number passed to register_page_bootmem_memmap() is derived from
>> the vmemmap virtual address of folio->page instead of the folio PFN, so the
>> bootmem memmap metadata can be registered against the wrong section. The
>> helper is also given HUGETLB_VMEMMAP_RESERVE_SIZE even though it expects a
>> page count, not a size in bytes. In addition, the write-protect range is
>> based on pages_per_huge_page(h), which does not cover the full HugeTLB
>> vmemmap area and can leave part of the shared tail vmemmap mapping writable.
>> 
>> Fix the section lookup to use folio_pfn(folio), use
>> HUGETLB_VMEMMAP_RESERVE_PAGES when registering the reserved memmap pages, and
>> use hugetlb_vmemmap_size(h) for the write-protect range.
>> 
>> Fixes: 752fe17af693 ("mm/hugetlb: add pre-HVO framework")
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> 
> Acked-by: Oscar Salvador <osalvador@suse.de>

Thanks.

> 
> did you observe any Ooops or malfunctioning or was just code reviewing?

Found by LLM (Gemini Pro).

When I touched the code here, LLM will report the bug as well, even it is
not introduced by me :)

Thanks,
Muchun

> 
> 
> 
> -- 
> Oscar Salvador
> SUSE Labs




^ permalink raw reply

* Re: [PATCH v2 0/5] KVM: PPC: Handle CPU compatibility mode for nested guests
From: Amit Machhiwal @ 2026-05-14 10:04 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan, Vaibhav Jain,
	Paolo Bonzini, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP), Jonathan Corbet, Shuah Khan, kvm,
	linux-kernel, linux-doc
In-Reply-To: <o6iir82o.ritesh.list@gmail.com>

Hi Ritesh,

Thanks for taking a look at this series. Please find my comment inline below:

On 2026/05/14 08:49 AM, Ritesh Harjani wrote:
> 
> Hi Amit,
> 
> Amit Machhiwal <amachhiw@linux.ibm.com> writes:
> 
> > On POWER systems, newer processor generations can operate in compatibility
> > modes corresponding to earlier generations (e.g., a Power11 system running
> > in Power10 compatibility mode). In such cases, the effective CPU level
> > exposed to guests differs from the physical processor generation.
> >
> > This creates a problem for nested virtualization. When booting a nested KVM
> > guest (L2) inside a host KVM guest (L1) running in a compatibility mode,
> > userspace (e.g., QEMU) may derive the CPU model from the raw hardware PVR
> > and attempt to configure the nested guest accordingly. However, the L1
> > partition is constrained by the compatibility level negotiated with the
> > hypervisor (L0), and requests exceeding that level are rejected, leading to
> > guest boot failures such as:
> >
> >   KVM-NESTEDv2: couldn't set guest wide elements
> >
> > This series addresses the issue in two steps:
> >
> > 1. Detect and reject invalid compatibility requests early in KVM to avoid
> >    late failures.
> >
> > 2. Provide a mechanism for userspace to query the effective CPU
> >    compatibility modes supported by the host, so it can select an
> >    appropriate CPU model for nested guests.
> >
>
> Do we really need to add a uapi change for this? Tools like Qemu can
> read the device tree info of the host, isn't it?

While cpu-version is available in /proc/device-tree/cpus/<cpu#>/cpu-version on
both L1 booted on PowerNV and PowerVM LPARs, I believe the UAPI change is still
preferable for several reasons:

1. We would want to rely on the capabilities negotiated with pHYP (L0) in KVM on
   PowerVM case instead of device tree property. Also, the cpu-version property
   only depicts the current compat mode host (L1) is booted in but doesn't
   really point to what all compat modes are supported for the nested guest
   (L2).

2. procfs dependency: Not all systems run with procfs enabled (CONFIG_PROC_FS is
   optional). For example, minimal configurations (like buildroot) might disable
   it. The KVM ioctl works regardless of procfs availability since it accesses
   kernel data structures directly.

3. Kernel validation: The kernel validates and normalizes the compatibility
   information. For example, patch 1 adds validation logic that rejects invalid
   compatibility requests early. The ioctl ensures userspace gets validated,
   consistent data.

4. Abstraction & stability: While /proc/device-tree works today, it's an
   implementation detail. The UAPI provides a stable interface that won't break
   if the underlying mechanism changes.

5. Semantic clarity: KVM_PPC_GET_COMPAT_CAPS clearly expresses what
   compatibility modes can I use for KVM guests vs. parsing device tree which
   requires understanding the semantic meaning of cpu-version.

>
> > To achieve this, the series introduces a new KVM capability and ioctl
> > (KVM_CAP_PPC_COMPAT_CAPS / KVM_PPC_GET_COMPAT_CAPS) that expose the
> > compatibility modes supported by the host.
> >
> > The implementation supports both:
> >
> >   - PowerVM (nested API v2), where compatibility information is obtained
> >     via the H_GUEST_GET_CAPABILITIES hypercall.
> >   - PowerNV (nested API v1), where compatibility is derived from the device
> >     tree ("cpu-version") representing the effective processor compatibility
> >     level.
> 
> See there you go, for PowerNV if this info is provided in the device
> tree, then Qemu could as well just read that info, no?
>
> ... yup, kvmppc_read_int_dt() can do that I guess.
> 
> So, my request is, can we look into this to see, if there is a possible
> alternative to this? maybe we already have a mechanism which Qemu could
> use to get this info already?

You're right that QEMU could read the device tree from procfs. We had discussed
this approach internally as well. However, we believe the UAPI approach offers
additional benefits and looks more robust and future proof as outlined above.

> 
> btw - I haven't given a full read of the patch series, but reading the
> cover letter, I felt  we should atleast add this info to the cover
> letter on, why a uapi change is really needed here, why can't the
> existing alternatives work for us.

I have described above why we did the UAPI change for the approach followed in
this series. Could you please suggest what else can be added?

Thanks,
Amit

> -ritesh
> 
> >
> > This allows userspace (e.g., QEMU) to select a CPU model consistent with
> > the host compatibility mode, avoiding mismatches and enabling successful
> > nested guest boot.
> >

^ permalink raw reply

* Re: [PATCH v4 03/13] dma-pool: track decrypted atomic pools and select them via attrs
From: Mostafa Saleh @ 2026-05-14  8:06 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5abjeia315.fsf@kernel.org>

On Thu, May 14, 2026 at 8:01 AM Aneesh Kumar K.V
<aneesh.kumar@kernel.org> wrote:
>
> Mostafa Saleh <smostafa@google.com> writes:
>
> ...
>
> >>  struct page *dma_alloc_from_pool(struct device *dev, size_t size,
> >> -            void **cpu_addr, gfp_t gfp,
> >> +            void **cpu_addr, gfp_t gfp, unsigned long attrs,
> >>              bool (*phys_addr_ok)(struct device *, phys_addr_t, size_t))
> >>  {
> >> -    struct gen_pool *pool = NULL;
> >> +    struct dma_gen_pool *dma_pool = NULL;
> >>      struct page *page;
> >>      bool pool_found = false;
> >>
> >> -    while ((pool = dma_guess_pool(pool, gfp))) {
> >> +    while ((dma_pool = dma_guess_pool(dma_pool, gfp))) {
> >> +
> >> +            if (dma_pool->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
> >> +                    continue;
> >> +
> >
> > nit: If we fail to find a matching pool, a slightly misleading message
> > is printed as pool_found = false
> >
>
> The message printed is
>
>         WARN(1, "Failed to get suitable pool for %s\n", dev_name(dev));
>
> That is correct, isn’t it? The kernel failed to find a pool with the
> correct encryption attribute. For example, the request was for an
> encrypted allocation from the pool, but no encrypted pool was available.
>

Sure, I’d prefer a clearer print in that case, especially since that’s new code:
“Only {encrypted/decrypted} pool found for a {encrypted/decrypted} alloction”

But no strong opinion.

Thanks,
Mostafa



> >
> >>              pool_found = true;
> >> -            page = __dma_alloc_from_pool(dev, size, pool, cpu_addr,
> >> +            page = __dma_alloc_from_pool(dev, size, dma_pool->pool, cpu_addr,
> >>                                           phys_addr_ok);
> >>              if (page)
> >>                      return page;
> >> @@ -296,12 +345,14 @@ struct page *dma_alloc_from_pool(struct device *dev, size_t size,
>
> -aneesh


^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-14 11:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260513172450.GR7702@ziepe.ca>

On Wed, May 13, 2026 at 02:24:50PM -0300, Jason Gunthorpe wrote:
> On Wed, May 13, 2026 at 02:27:14PM +0000, Mostafa Saleh wrote:
> 
> > > +		/*
> > > +		 * if platform supports memory encryption,
> > > +		 * restricted mem pool is decrypted by default
> > > +		 */
> > > +		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> > > +			mem->unencrypted = true;
> > > +			set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
> > > +					     rmem->size >> PAGE_SHIFT);
> > > +		} else {
> > > +			mem->unencrypted = false;
> > > +		}
> >
> > This breaks pKVM as it doesn’t set CC_ATTR_MEM_ENCRYPT, so all virtio
> > traffic now fails.
> 
> How will pKVM signal what kind of memory the DMA needs then?
> 
> Does it use set_memory_decrypted()? How can it use
> set_memory_decrypted() without offering CC_ATTR_MEM_ENCRYPT ?

pKVM (hypervisor) doesn’t signal anything.
The VMM when running protected guests will use restricted dma-pools
for emulated vritio devices in the guest, which gets decrypted by
the guest kernel and hence shared with the host kernel, and then
traffic is bounced via the pool.

It’s also worth noting that bouncing here isn't just about visibility.
Because memory sharing operates at page granularity, bouncing sub-page
allocations through the restricted pool prevents adjacent, sensitive
guest data from being exposed to the untrusted host.

> 
> > Also, by design, some drivers are clueless about bouncing, so
> 
> Oh? What does this mean? We take quite a dim view of drivers mis-using
> the DMA API..

Maybe clueless is not the right word, I mean when virtio drivers use
the DMA API they don’t know whether it’s going to bounce or not as
that is decided by dma-direct (and in other cases by dma-iommu,
but not for pKVM).

> 
> > I believe that the pool should have a way to control it’s property
> > (encrypted or decrypted) and that takes priority over whatever
> > attributes comes from allocation.
> 
> We should get here because dma_capable() fails, and then swiotlb needs
> to return something that makes dma_capable() succeed. Yes, it should
> return details about the thing it decided, but it shouldn't have been
> pre-created with some idea how to make dma_capable() work.

That sounds neat, but at the end we have force_dma_unencrypted() in
dma_capable() which is just hardcoded to true/false by the platform.
How is that different from having the state static by the pool?

> 
> If dma_capable() can fail, then swiotlb should know exactly what to do
> to fix it.

dma_capable() returns a bool, I don’t think it can know what exactly
went wrong (based on address, size, attrs, dev...)

> 
> If pkvm wants to use the hacky scheme where you force a swiotlb pool
> configuration during arch init with force swiotlb that's a somewhat
> different flow and, sure the forced pool should force do whatever it
> is forced to.
> 
> But lets try to keep them seperated in the discussion..

While we can debate the aesthetics of the setup , this is
the exisitng behaviour for Linux, which existed for years
and pKVM relies on and is used extensively.
And, this patch alters that long-standing logic and introduces
a functional regression.

We can address this by either adjusting this patch or by changing
pKVM guests to be more aligned with other CCA guests which is
something I have been wondering about if it would help reduce
bouncing.

> 
> > And that brings us to the same point whether it’s better to return
> > the memory along with it’s state or we pass the requested state.
> > I think for other cases it’s fine for the device/DMA-API to dictate
> > the attrs, but not in restricted-dma case, the firmware just knows better.
> 
> The memory type must be returned back at some level so downstream
> things can do the right transformation of the phys_addr_t.

Agreed, I believe that will be needed at least for
SWIOTLB/restricted-dma -> dma-API interactions.

> 
> One of the aspirational CC things that should work is a T=1 device
> tries to DMA from a decrypted page, finds the address is above the dma
> limit of the device, so it bounces it with SWIOTLB to an encrypted low
> address page and then the DMA API internal flow switiches from working
> with decrypted to encrypted phys_addr_t.
> 
> If we can make that work then maybe the flows are designed correctly.

Mmm, I am not sure I understand this one, shouldn’t the device also be
notified about the switch in memory state, if it expects to read/write
decrypted memory, how would that work if the kernel changes it to an
encrypted one?

Thanks,
Mostafa
> 
> Jason

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-14 12:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5ah5oaa63h.fsf@kernel.org>

On Thu, May 14, 2026 at 11:24:42AM +0530, Aneesh Kumar K.V wrote:
> Mostafa Saleh <smostafa@google.com> writes:
> 
> > On Tue, May 12, 2026 at 02:33:59PM +0530, Aneesh Kumar K.V (Arm) wrote:
> >> Teach swiotlb to distinguish between encrypted and decrypted bounce
> >> buffer pools, and make allocation and mapping paths select a pool whose
> >> state matches the requested DMA attributes.
> >> 
> >> Add a decrypted flag to io_tlb_mem, initialize it for the default and
> >> restricted pools, and propagate DMA_ATTR_CC_SHARED into swiotlb pool
> >> allocation. Reject swiotlb alloc/map requests when the selected pool does
> >> not match the required encrypted/decrypted state.
> >> 
> >> Also return DMA addresses with the matching phys_to_dma_{encrypted,
> >> unencrypted} helper so the DMA address encoding stays consistent with the
> >> chosen pool.
> >> 
> >> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> >> ---
> >>  include/linux/dma-direct.h |  10 ++++
> >>  include/linux/swiotlb.h    |   8 ++-
> >>  kernel/dma/direct.c        |  14 +++--
> >>  kernel/dma/swiotlb.c       | 108 +++++++++++++++++++++++++++----------
> >>  4 files changed, 107 insertions(+), 33 deletions(-)
> >> 
> >> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
> >> index c249912456f9..94fad4e7c11e 100644
> >> --- a/include/linux/dma-direct.h
> >> +++ b/include/linux/dma-direct.h
> >> @@ -77,6 +77,10 @@ static inline dma_addr_t dma_range_map_max(const struct bus_dma_region *map)
> >>  #ifndef phys_to_dma_unencrypted
> >>  #define phys_to_dma_unencrypted		phys_to_dma
> >>  #endif
> >> +
> >> +#ifndef phys_to_dma_encrypted
> >> +#define phys_to_dma_encrypted		phys_to_dma
> >> +#endif
> >>  #else
> >>  static inline dma_addr_t __phys_to_dma(struct device *dev, phys_addr_t paddr)
> >>  {
> >> @@ -90,6 +94,12 @@ static inline dma_addr_t phys_to_dma_unencrypted(struct device *dev,
> >>  {
> >>  	return dma_addr_unencrypted(__phys_to_dma(dev, paddr));
> >>  }
> >> +
> >> +static inline dma_addr_t phys_to_dma_encrypted(struct device *dev,
> >> +		phys_addr_t paddr)
> >> +{
> >> +	return dma_addr_encrypted(__phys_to_dma(dev, paddr));
> >> +}
> >>  /*
> >>   * If memory encryption is supported, phys_to_dma will set the memory encryption
> >>   * bit in the DMA address, and dma_to_phys will clear it.
> >> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> >> index 3dae0f592063..b3fa3c6e0169 100644
> >> --- a/include/linux/swiotlb.h
> >> +++ b/include/linux/swiotlb.h
> >> @@ -81,6 +81,7 @@ struct io_tlb_pool {
> >>  	struct list_head node;
> >>  	struct rcu_head rcu;
> >>  	bool transient;
> >> +	bool unencrypted;
> >>  #endif
> >>  };
> >>  
> >> @@ -111,6 +112,7 @@ struct io_tlb_mem {
> >>  	struct dentry *debugfs;
> >>  	bool force_bounce;
> >>  	bool for_alloc;
> >> +	bool unencrypted;
> >>  #ifdef CONFIG_SWIOTLB_DYNAMIC
> >>  	bool can_grow;
> >>  	u64 phys_limit;
> >> @@ -282,7 +284,8 @@ static inline void swiotlb_sync_single_for_cpu(struct device *dev,
> >>  extern void swiotlb_print_info(void);
> >>  
> >>  #ifdef CONFIG_DMA_RESTRICTED_POOL
> >> -struct page *swiotlb_alloc(struct device *dev, size_t size);
> >> +struct page *swiotlb_alloc(struct device *dev, size_t size,
> >> +		unsigned long attrs);
> >>  bool swiotlb_free(struct device *dev, struct page *page, size_t size);
> >>  
> >>  static inline bool is_swiotlb_for_alloc(struct device *dev)
> >> @@ -290,7 +293,8 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
> >>  	return dev->dma_io_tlb_mem->for_alloc;
> >>  }
> >>  #else
> >> -static inline struct page *swiotlb_alloc(struct device *dev, size_t size)
> >> +static inline struct page *swiotlb_alloc(struct device *dev, size_t size,
> >> +		unsigned long attrs)
> >>  {
> >>  	return NULL;
> >>  }
> >> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> >> index dc2907439b3d..97ae4fa10521 100644
> >> --- a/kernel/dma/direct.c
> >> +++ b/kernel/dma/direct.c
> >> @@ -104,9 +104,10 @@ static void __dma_direct_free_pages(struct device *dev, struct page *page,
> >>  	dma_free_contiguous(dev, page, size);
> >>  }
> >>  
> >> -static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size)
> >> +static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size,
> >> +		unsigned long attrs)
> >>  {
> >> -	struct page *page = swiotlb_alloc(dev, size);
> >> +	struct page *page = swiotlb_alloc(dev, size, attrs);
> >>  
> >>  	if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
> >>  		swiotlb_free(dev, page, size);
> >> @@ -266,8 +267,12 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> >>  						  gfp, attrs);
> >>  
> >>  	if (is_swiotlb_for_alloc(dev)) {
> >> -		page = dma_direct_alloc_swiotlb(dev, size);
> >> +		page = dma_direct_alloc_swiotlb(dev, size, attrs);
> >>  		if (page) {
> >> +			/*
> >> +			 * swiotlb allocations comes from pool already marked
> >> +			 * decrypted
> >> +			 */
> >>  			mark_mem_decrypt = false;
> >>  			goto setup_page;
> >>  		}
> >> @@ -374,6 +379,7 @@ void dma_direct_free(struct device *dev, size_t size,
> >>  		return;
> >>  
> >>  	if (swiotlb_find_pool(dev, dma_to_phys(dev, dma_addr)))
> >> +		/* Swiotlb doesn't need a page attribute update on free */
> >>  		mark_mem_encrypted = false;
> >>  
> >>  	if (is_vmalloc_addr(cpu_addr)) {
> >> @@ -403,7 +409,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
> >>  						  gfp, attrs);
> >>  
> >>  	if (is_swiotlb_for_alloc(dev)) {
> >> -		page = dma_direct_alloc_swiotlb(dev, size);
> >> +		page = dma_direct_alloc_swiotlb(dev, size, attrs);
> >>  		if (!page)
> >>  			return NULL;
> >>  
> >> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> >> index ab4eccbaa076..065663be282c 100644
> >> --- a/kernel/dma/swiotlb.c
> >> +++ b/kernel/dma/swiotlb.c
> >> @@ -259,10 +259,21 @@ void __init swiotlb_update_mem_attributes(void)
> >>  	struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
> >>  	unsigned long bytes;
> >>  
> >> +	/*
> >> +	 * if platform support memory encryption, swiotlb buffers are
> >> +	 * decrypted by default.
> >> +	 */
> >> +	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> >> +		io_tlb_default_mem.unencrypted = true;
> >> +	else
> >> +		io_tlb_default_mem.unencrypted = false;
> >> +
> >>  	if (!mem->nslabs || mem->late_alloc)
> >>  		return;
> >>  	bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
> >> -	set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
> >> +
> >> +	if (io_tlb_default_mem.unencrypted)
> >> +		set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
> >>  }
> >>  
> >>  static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
> >> @@ -505,8 +516,10 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
> >>  	if (!mem->slots)
> >>  		goto error_slots;
> >>  
> >> -	set_memory_decrypted((unsigned long)vstart,
> >> -			     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
> >> +	if (io_tlb_default_mem.unencrypted)
> >> +		set_memory_decrypted((unsigned long)vstart,
> >> +				     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
> >> +
> >>  	swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
> >>  				 nareas);
> >>  	add_mem_pool(&io_tlb_default_mem, mem);
> >> @@ -539,7 +552,9 @@ void __init swiotlb_exit(void)
> >>  	tbl_size = PAGE_ALIGN(mem->end - mem->start);
> >>  	slots_size = PAGE_ALIGN(array_size(sizeof(*mem->slots), mem->nslabs));
> >>  
> >> -	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
> >> +	if (io_tlb_default_mem.unencrypted)
> >> +		set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
> >> +
> >>  	if (mem->late_alloc) {
> >>  		area_order = get_order(array_size(sizeof(*mem->areas),
> >>  			mem->nareas));
> >> @@ -563,6 +578,7 @@ void __init swiotlb_exit(void)
> >>   * @gfp:	GFP flags for the allocation.
> >>   * @bytes:	Size of the buffer.
> >>   * @phys_limit:	Maximum allowed physical address of the buffer.
> >> + * @unencrypted: true to allocate unencrypted memory, false for encrypted memory
> >>   *
> >>   * Allocate pages from the buddy allocator. If successful, make the allocated
> >>   * pages decrypted that they can be used for DMA.
> >> @@ -570,7 +586,8 @@ void __init swiotlb_exit(void)
> >>   * Return: Decrypted pages, %NULL on allocation failure, or ERR_PTR(-EAGAIN)
> >>   * if the allocated physical address was above @phys_limit.
> >>   */
> >> -static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
> >> +static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes,
> >> +		u64 phys_limit, bool unencrypted)
> >>  {
> >>  	unsigned int order = get_order(bytes);
> >>  	struct page *page;
> >> @@ -588,13 +605,13 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
> >>  	}
> >>  
> >>  	vaddr = phys_to_virt(paddr);
> >> -	if (set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
> >> +	if (unencrypted && set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
> >>  		goto error;
> >>  	return page;
> >>  
> >>  error:
> >>  	/* Intentional leak if pages cannot be encrypted again. */
> >> -	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
> >> +	if (unencrypted && !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
> >>  		__free_pages(page, order);
> >>  	return NULL;
> >>  }
> >> @@ -604,30 +621,26 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
> >>   * @dev:	Device for which a memory pool is allocated.
> >>   * @bytes:	Size of the buffer.
> >>   * @phys_limit:	Maximum allowed physical address of the buffer.
> >> + * @attrs:	DMA attributes for the allocation.
> >>   * @gfp:	GFP flags for the allocation.
> >>   *
> >>   * Return: Allocated pages, or %NULL on allocation failure.
> >>   */
> >>  static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
> >> -		u64 phys_limit, gfp_t gfp)
> >> +		u64 phys_limit, unsigned long attrs, gfp_t gfp)
> >>  {
> >>  	struct page *page;
> >> -	unsigned long attrs = 0;
> >>  
> >>  	/*
> >>  	 * Allocate from the atomic pools if memory is encrypted and
> >>  	 * the allocation is atomic, because decrypting may block.
> >>  	 */
> >> -	if (!gfpflags_allow_blocking(gfp) && dev && force_dma_unencrypted(dev)) {
> >> +	if (!gfpflags_allow_blocking(gfp) && (attrs & DMA_ATTR_CC_SHARED)) {
> >>  		void *vaddr;
> >>  
> >>  		if (!IS_ENABLED(CONFIG_DMA_COHERENT_POOL))
> >>  			return NULL;
> >>  
> >> -		/* swiotlb considered decrypted by default */
> >> -		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> >> -			attrs = DMA_ATTR_CC_SHARED;
> >> -
> >>  		return dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
> >>  					   attrs, dma_coherent_ok);
> >>  	}
> >> @@ -638,7 +651,8 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
> >>  	else if (phys_limit <= DMA_BIT_MASK(32))
> >>  		gfp |= __GFP_DMA32;
> >>  
> >> -	while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit))) {
> >> +	while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit,
> >> +					     !!(attrs & DMA_ATTR_CC_SHARED)))) {
> >>  		if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
> >>  		    phys_limit < DMA_BIT_MASK(64) &&
> >>  		    !(gfp & (__GFP_DMA32 | __GFP_DMA)))
> >> @@ -657,15 +671,18 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
> >>   * swiotlb_free_tlb() - free a dynamically allocated IO TLB buffer
> >>   * @vaddr:	Virtual address of the buffer.
> >>   * @bytes:	Size of the buffer.
> >> + * @unencrypted: true if @vaddr was allocated decrypted and must be
> >> + *	re-encrypted before being freed
> >>   */
> >> -static void swiotlb_free_tlb(void *vaddr, size_t bytes)
> >> +static void swiotlb_free_tlb(void *vaddr, size_t bytes, bool unencrypted)
> >>  {
> >>  	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
> >>  	    dma_free_from_pool(NULL, vaddr, bytes))
> >>  		return;
> >>  
> >>  	/* Intentional leak if pages cannot be encrypted again. */
> >> -	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
> >> +	if (!unencrypted ||
> >> +	    !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
> >>  		__free_pages(virt_to_page(vaddr), get_order(bytes));
> >>  }
> >>  
> >> @@ -676,6 +693,7 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
> >>   * @nslabs:	Desired (maximum) number of slabs.
> >>   * @nareas:	Number of areas.
> >>   * @phys_limit:	Maximum DMA buffer physical address.
> >> + * @attrs:	DMA attributes for the allocation.
> >>   * @gfp:	GFP flags for the allocations.
> >>   *
> >>   * Allocate and initialize a new IO TLB memory pool. The actual number of
> >> @@ -686,7 +704,8 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
> >>   */
> >>  static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
> >>  		unsigned long minslabs, unsigned long nslabs,
> >> -		unsigned int nareas, u64 phys_limit, gfp_t gfp)
> >> +		unsigned int nareas, u64 phys_limit, unsigned long attrs,
> >> +		gfp_t gfp)
> >>  {
> >>  	struct io_tlb_pool *pool;
> >>  	unsigned int slot_order;
> >> @@ -704,9 +723,10 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
> >>  	if (!pool)
> >>  		goto error;
> >>  	pool->areas = (void *)pool + sizeof(*pool);
> >> +	pool->unencrypted = !!(attrs & DMA_ATTR_CC_SHARED);
> >>  
> >>  	tlb_size = nslabs << IO_TLB_SHIFT;
> >> -	while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp))) {
> >> +	while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, attrs, gfp))) {
> >>  		if (nslabs <= minslabs)
> >>  			goto error_tlb;
> >>  		nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
> >> @@ -724,7 +744,8 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
> >>  	return pool;
> >>  
> >>  error_slots:
> >> -	swiotlb_free_tlb(page_address(tlb), tlb_size);
> >> +	swiotlb_free_tlb(page_address(tlb), tlb_size,
> >> +			 !!(attrs & DMA_ATTR_CC_SHARED));
> >>  error_tlb:
> >>  	kfree(pool);
> >>  error:
> >> @@ -742,7 +763,9 @@ static void swiotlb_dyn_alloc(struct work_struct *work)
> >>  	struct io_tlb_pool *pool;
> >>  
> >>  	pool = swiotlb_alloc_pool(NULL, IO_TLB_MIN_SLABS, default_nslabs,
> >> -				  default_nareas, mem->phys_limit, GFP_KERNEL);
> >> +				  default_nareas, mem->phys_limit,
> >> +				  mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
> >> +				  GFP_KERNEL);
> >>  	if (!pool) {
> >>  		pr_warn_ratelimited("Failed to allocate new pool");
> >>  		return;
> >> @@ -762,7 +785,7 @@ static void swiotlb_dyn_free(struct rcu_head *rcu)
> >>  	size_t tlb_size = pool->end - pool->start;
> >>  
> >>  	free_pages((unsigned long)pool->slots, get_order(slots_size));
> >> -	swiotlb_free_tlb(pool->vaddr, tlb_size);
> >> +	swiotlb_free_tlb(pool->vaddr, tlb_size, pool->unencrypted);
> >>  	kfree(pool);
> >>  }
> >>  
> >> @@ -1232,6 +1255,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> >>  	nslabs = nr_slots(alloc_size);
> >>  	phys_limit = min_not_zero(*dev->dma_mask, dev->bus_dma_limit);
> >>  	pool = swiotlb_alloc_pool(dev, nslabs, nslabs, 1, phys_limit,
> >> +				  mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
> >>  				  GFP_NOWAIT);
> >>  	if (!pool)
> >>  		return -1;
> >> @@ -1394,6 +1418,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
> >>  		enum dma_data_direction dir, unsigned long attrs)
> >>  {
> >>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> >> +	bool require_decrypted = false;
> >>  	unsigned int offset;
> >>  	struct io_tlb_pool *pool;
> >>  	unsigned int i;
> >> @@ -1411,6 +1436,16 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
> >>  	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> >>  		pr_warn_once("Memory encryption is active and system is using DMA bounce buffers\n");
> >>  
> >> +	/*
> >> +	 * if we are trying to swiotlb map a decrypted paddr or the paddr is encrypted
> >> +	 * but the device is forcing decryption, use decrypted io_tlb_mem
> >> +	 */
> >> +	if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))
> >> +		require_decrypted = true;
> >> +
> >> +	if (require_decrypted != mem->unencrypted)
> >> +		return (phys_addr_t)DMA_MAPPING_ERROR;
> >> +
> >>  	/*
> >>  	 * The default swiotlb memory pool is allocated with PAGE_SIZE
> >>  	 * alignment. If a mapping is requested with larger alignment,
> >> @@ -1608,8 +1643,14 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
> >>  	if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
> >>  		return DMA_MAPPING_ERROR;
> >>  
> >> -	/* Ensure that the address returned is DMA'ble */
> >> -	dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
> >> +	/*
> >> +	 * Use the allocated io_tlb_mem encryption type to determine dma addr.
> >> +	 */
> >> +	if (dev->dma_io_tlb_mem->unencrypted)
> >> +		dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
> >> +	else
> >> +		dma_addr = phys_to_dma_encrypted(dev, swiotlb_addr);
> >> +
> >>  	if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
> >>  		__swiotlb_tbl_unmap_single(dev, swiotlb_addr, size, dir,
> >>  			attrs | DMA_ATTR_SKIP_CPU_SYNC,
> >> @@ -1773,7 +1814,8 @@ static inline void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
> >>  
> >>  #ifdef CONFIG_DMA_RESTRICTED_POOL
> >>  
> >> -struct page *swiotlb_alloc(struct device *dev, size_t size)
> >> +struct page *swiotlb_alloc(struct device *dev, size_t size,
> >> +		unsigned long attrs)
> >>  {
> >>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> >>  	struct io_tlb_pool *pool;
> >> @@ -1784,6 +1826,9 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
> >>  	if (!mem)
> >>  		return NULL;
> >>  
> >> +	if (mem->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
> >> +		return NULL;
> >> +
> >>  	align = (1 << (get_order(size) + PAGE_SHIFT)) - 1;
> >>  	index = swiotlb_find_slots(dev, 0, size, align, &pool);
> >>  	if (index == -1)
> >> @@ -1853,9 +1898,18 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
> >>  			kfree(mem);
> >>  			return -ENOMEM;
> >>  		}
> >> +		/*
> >> +		 * if platform supports memory encryption,
> >> +		 * restricted mem pool is decrypted by default
> >> +		 */
> >> +		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> >> +			mem->unencrypted = true;
> >> +			set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
> >> +					     rmem->size >> PAGE_SHIFT);
> >> +		} else {
> >> +			mem->unencrypted = false;
> >> +		}
> >
> > This breaks pKVM as it doesn’t set CC_ATTR_MEM_ENCRYPT, so all virtio
> > traffic now fails.
> >
> > Also, by design, some drivers are clueless about bouncing, so
> > I believe that the pool should have a way to control it’s property
> > (encrypted or decrypted) and that takes priority over whatever
> > attributes comes from allocation.
> > And that brings us to the same point whether it’s better to return
> > the memory along with it’s state or we pass the requested state.
> > I think for other cases it’s fine for the device/DMA-API to dictate
> > the attrs, but not in restricted-dma case, the firmware just knows better.
> >
> 
> Is it that the pKVM guest kernel does not have awareness of
> encrypted/decrypted DMA allocations? Instead, the firmware attaches
> hypervisor-shared pages to the device via restricted-dma-pool? The
> kernel then has swiotlb->for_alloc = true, and hence all DMA allocations
> go through the restricted-dma-pool?

Yes.

> 
> Given that pKVM supports pkvm_set_memory_encrypted() and
> pkvm_set_memory_decrypted(), can we consider adding CC_ATTR_MEM_ENCRYPT
> support to pKVM? It would also be good to investigate whether we can set
> force_dma_unencrypted(dev) to true where needed.

I was looking in to that, but it didn't work because
force_dma_unencrypted() is broken with restricted-dma due to the
double decryption issue, that's when I sent my first series [1]

May be we should land some basic fixes for that path so we can
convert pKVM, then we do the full rework.

I will revive my old work and see if I can send a RFC.

[1] https://lore.kernel.org/all/20260305170335.963568-1-smostafa@google.com/

> 
> I agree that this patch, as it stands, can break pKVM because we are now
> missing the set_memory_decrypted() call required for pKVM to work.
> 
> We now mark the swiotlb io_tlb_mem as unencrypted/encrypted in the guest
> using struct io_tlb_mem->unencrypted. I am not clear what we can use for
> pKVM to conditionalize this so that it works for both protected and
> unprotected guests.

There is no problem with non-protected guests as they don't use memory
encryption, my initial thought was that th encrpyted/decrypted is
per-pool property which is decided by FW (device-tree).

Thanks,
Mostafa

> 
> -aneesh
> 


^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-14 12:35 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agW2lzJI-20DyJVe@google.com>

> > How will pKVM signal what kind of memory the DMA needs then?
> > 
> > Does it use set_memory_decrypted()? How can it use
> > set_memory_decrypted() without offering CC_ATTR_MEM_ENCRYPT ?
> 
> pKVM (hypervisor) doesn’t signal anything.
> The VMM when running protected guests will use restricted dma-pools
> for emulated vritio devices in the guest, which gets decrypted by
> the guest kernel and hence shared with the host kernel, and then
> traffic is bounced via the pool.

That really does sound like CC and set_memory_decrypted() to me..

> It’s also worth noting that bouncing here isn't just about visibility.
> Because memory sharing operates at page granularity, bouncing sub-page
> allocations through the restricted pool prevents adjacent, sensitive
> guest data from being exposed to the untrusted host.

That's a somewhat different problem, we have the dev->trusted stuff
that is supposed to deal with this kind of security. We need it for
IOMMU based systems too, eg hot plug thunderbolt should have it.

Then CC issue is more that the DMA API can't decrypt random passed in
memory because doing so often requires changing the PTEs pointing at
the page so it would break everything if done transparently.

> > > I believe that the pool should have a way to control it’s property
> > > (encrypted or decrypted) and that takes priority over whatever
> > > attributes comes from allocation.
> > 
> > We should get here because dma_capable() fails, and then swiotlb needs
> > to return something that makes dma_capable() succeed. Yes, it should
> > return details about the thing it decided, but it shouldn't have been
> > pre-created with some idea how to make dma_capable() work.
> 
> That sounds neat, but at the end we have force_dma_unencrypted() in
> dma_capable() which is just hardcoded to true/false by the platform.

For now, the next step is it becomes per-device and dynamic during the
device lifecycle.

> How is that different from having the state static by the pool?

statically attached pools to the device are not so flexible when
devices have dynamically changing capabilities..

> > If dma_capable() can fail, then swiotlb should know exactly what to do
> > to fix it.
> 
> dma_capable() returns a bool, I don’t think it can know what exactly
> went wrong (based on address, size, attrs, dev...)

Yes, but I think the design is swiotlb is supposed to re-inspect what
is going on against the limits dma_capable checks and then select the
correct remedy..

> While we can debate the aesthetics of the setup , this is
> the exisitng behaviour for Linux, which existed for years
> and pKVM relies on and is used extensively.
> And, this patch alters that long-standing logic and introduces
> a functional regression.

Yeah, Aneesh needs to do something here, I'm pointing out it is
entirely seperate thing from the CC path we are working on which is
decoupling CC from reylying on force swiotlb.

> We can address this by either adjusting this patch or by changing
> pKVM guests to be more aligned with other CCA guests which is
> something I have been wondering about if it would help reduce
> bouncing.

Every time I look at pkvm I think it is just ARM CCA with a different
design and no access to the unique HW features..

> > If we can make that work then maybe the flows are designed correctly.
> 
> Mmm, I am not sure I understand this one, shouldn’t the device also be
> notified about the switch in memory state, if it expects to read/write
> decrypted memory, how would that work if the kernel changes it to an
> encrypted one?

Nothing on the device changes. In a CC world we put the device in a
T=0 or T=1 state before the driver loads and the expectation from the
DMA API is that the device will only use that T=x DMA type during
operation.

A T=1 state device can access all of memory, private or shared. Any
information the platform may need is encoded in the dma_addr_t or in
the S1 IOPTEs.

So we never need to tell the device driver what kind of memory the DMA
is targetting, and we NEVER expect a device in T=1 mode to have to
issue a T=0 DMA to use the DMA API.

In a pkvm world it should be the same, the S2 table for the SMMU will
control what the device can access, and if the SMMU points to a
"private" or "shared" page is not something the device needs to know
or care about.

Jason

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-14 12:48 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agW5rhE9n2gDQ0w5@google.com>

Mostafa Saleh <smostafa@google.com> writes:

> On Thu, May 14, 2026 at 11:24:42AM +0530, Aneesh Kumar K.V wrote:
>> Mostafa Saleh <smostafa@google.com> writes:
>> 
>> > On Tue, May 12, 2026 at 02:33:59PM +0530, Aneesh Kumar K.V (Arm) wrote:
>> >> Teach swiotlb to distinguish between encrypted and decrypted bounce
>> >> buffer pools, and make allocation and mapping paths select a pool whose
>> >> state matches the requested DMA attributes.
>> >> 
>> >> Add a decrypted flag to io_tlb_mem, initialize it for the default and
>> >> restricted pools, and propagate DMA_ATTR_CC_SHARED into swiotlb pool
>> >> allocation. Reject swiotlb alloc/map requests when the selected pool does
>> >> not match the required encrypted/decrypted state.
>> >> 
>> >> Also return DMA addresses with the matching phys_to_dma_{encrypted,
>> >> unencrypted} helper so the DMA address encoding stays consistent with the
>> >> chosen pool.
>> >> 
>> >> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>> >> ---
>> >>  include/linux/dma-direct.h |  10 ++++
>> >>  include/linux/swiotlb.h    |   8 ++-
>> >>  kernel/dma/direct.c        |  14 +++--
>> >>  kernel/dma/swiotlb.c       | 108 +++++++++++++++++++++++++++----------
>> >>  4 files changed, 107 insertions(+), 33 deletions(-)
>> >> 
>> >> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
>> >> index c249912456f9..94fad4e7c11e 100644
>> >> --- a/include/linux/dma-direct.h
>> >> +++ b/include/linux/dma-direct.h
>> >> @@ -77,6 +77,10 @@ static inline dma_addr_t dma_range_map_max(const struct bus_dma_region *map)
>> >>  #ifndef phys_to_dma_unencrypted
>> >>  #define phys_to_dma_unencrypted		phys_to_dma
>> >>  #endif
>> >> +
>> >> +#ifndef phys_to_dma_encrypted
>> >> +#define phys_to_dma_encrypted		phys_to_dma
>> >> +#endif
>> >>  #else
>> >>  static inline dma_addr_t __phys_to_dma(struct device *dev, phys_addr_t paddr)
>> >>  {
>> >> @@ -90,6 +94,12 @@ static inline dma_addr_t phys_to_dma_unencrypted(struct device *dev,
>> >>  {
>> >>  	return dma_addr_unencrypted(__phys_to_dma(dev, paddr));
>> >>  }
>> >> +
>> >> +static inline dma_addr_t phys_to_dma_encrypted(struct device *dev,
>> >> +		phys_addr_t paddr)
>> >> +{
>> >> +	return dma_addr_encrypted(__phys_to_dma(dev, paddr));
>> >> +}
>> >>  /*
>> >>   * If memory encryption is supported, phys_to_dma will set the memory encryption
>> >>   * bit in the DMA address, and dma_to_phys will clear it.
>> >> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
>> >> index 3dae0f592063..b3fa3c6e0169 100644
>> >> --- a/include/linux/swiotlb.h
>> >> +++ b/include/linux/swiotlb.h
>> >> @@ -81,6 +81,7 @@ struct io_tlb_pool {
>> >>  	struct list_head node;
>> >>  	struct rcu_head rcu;
>> >>  	bool transient;
>> >> +	bool unencrypted;
>> >>  #endif
>> >>  };
>> >>  
>> >> @@ -111,6 +112,7 @@ struct io_tlb_mem {
>> >>  	struct dentry *debugfs;
>> >>  	bool force_bounce;
>> >>  	bool for_alloc;
>> >> +	bool unencrypted;
>> >>  #ifdef CONFIG_SWIOTLB_DYNAMIC
>> >>  	bool can_grow;
>> >>  	u64 phys_limit;
>> >> @@ -282,7 +284,8 @@ static inline void swiotlb_sync_single_for_cpu(struct device *dev,
>> >>  extern void swiotlb_print_info(void);
>> >>  
>> >>  #ifdef CONFIG_DMA_RESTRICTED_POOL
>> >> -struct page *swiotlb_alloc(struct device *dev, size_t size);
>> >> +struct page *swiotlb_alloc(struct device *dev, size_t size,
>> >> +		unsigned long attrs);
>> >>  bool swiotlb_free(struct device *dev, struct page *page, size_t size);
>> >>  
>> >>  static inline bool is_swiotlb_for_alloc(struct device *dev)
>> >> @@ -290,7 +293,8 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
>> >>  	return dev->dma_io_tlb_mem->for_alloc;
>> >>  }
>> >>  #else
>> >> -static inline struct page *swiotlb_alloc(struct device *dev, size_t size)
>> >> +static inline struct page *swiotlb_alloc(struct device *dev, size_t size,
>> >> +		unsigned long attrs)
>> >>  {
>> >>  	return NULL;
>> >>  }
>> >> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>> >> index dc2907439b3d..97ae4fa10521 100644
>> >> --- a/kernel/dma/direct.c
>> >> +++ b/kernel/dma/direct.c
>> >> @@ -104,9 +104,10 @@ static void __dma_direct_free_pages(struct device *dev, struct page *page,
>> >>  	dma_free_contiguous(dev, page, size);
>> >>  }
>> >>  
>> >> -static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size)
>> >> +static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size,
>> >> +		unsigned long attrs)
>> >>  {
>> >> -	struct page *page = swiotlb_alloc(dev, size);
>> >> +	struct page *page = swiotlb_alloc(dev, size, attrs);
>> >>  
>> >>  	if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
>> >>  		swiotlb_free(dev, page, size);
>> >> @@ -266,8 +267,12 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>> >>  						  gfp, attrs);
>> >>  
>> >>  	if (is_swiotlb_for_alloc(dev)) {
>> >> -		page = dma_direct_alloc_swiotlb(dev, size);
>> >> +		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>> >>  		if (page) {
>> >> +			/*
>> >> +			 * swiotlb allocations comes from pool already marked
>> >> +			 * decrypted
>> >> +			 */
>> >>  			mark_mem_decrypt = false;
>> >>  			goto setup_page;
>> >>  		}
>> >> @@ -374,6 +379,7 @@ void dma_direct_free(struct device *dev, size_t size,
>> >>  		return;
>> >>  
>> >>  	if (swiotlb_find_pool(dev, dma_to_phys(dev, dma_addr)))
>> >> +		/* Swiotlb doesn't need a page attribute update on free */
>> >>  		mark_mem_encrypted = false;
>> >>  
>> >>  	if (is_vmalloc_addr(cpu_addr)) {
>> >> @@ -403,7 +409,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>> >>  						  gfp, attrs);
>> >>  
>> >>  	if (is_swiotlb_for_alloc(dev)) {
>> >> -		page = dma_direct_alloc_swiotlb(dev, size);
>> >> +		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>> >>  		if (!page)
>> >>  			return NULL;
>> >>  
>> >> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
>> >> index ab4eccbaa076..065663be282c 100644
>> >> --- a/kernel/dma/swiotlb.c
>> >> +++ b/kernel/dma/swiotlb.c
>> >> @@ -259,10 +259,21 @@ void __init swiotlb_update_mem_attributes(void)
>> >>  	struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
>> >>  	unsigned long bytes;
>> >>  
>> >> +	/*
>> >> +	 * if platform support memory encryption, swiotlb buffers are
>> >> +	 * decrypted by default.
>> >> +	 */
>> >> +	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> >> +		io_tlb_default_mem.unencrypted = true;
>> >> +	else
>> >> +		io_tlb_default_mem.unencrypted = false;
>> >> +
>> >>  	if (!mem->nslabs || mem->late_alloc)
>> >>  		return;
>> >>  	bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
>> >> -	set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
>> >> +
>> >> +	if (io_tlb_default_mem.unencrypted)
>> >> +		set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
>> >>  }
>> >>  
>> >>  static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
>> >> @@ -505,8 +516,10 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>> >>  	if (!mem->slots)
>> >>  		goto error_slots;
>> >>  
>> >> -	set_memory_decrypted((unsigned long)vstart,
>> >> -			     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
>> >> +	if (io_tlb_default_mem.unencrypted)
>> >> +		set_memory_decrypted((unsigned long)vstart,
>> >> +				     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
>> >> +
>> >>  	swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
>> >>  				 nareas);
>> >>  	add_mem_pool(&io_tlb_default_mem, mem);
>> >> @@ -539,7 +552,9 @@ void __init swiotlb_exit(void)
>> >>  	tbl_size = PAGE_ALIGN(mem->end - mem->start);
>> >>  	slots_size = PAGE_ALIGN(array_size(sizeof(*mem->slots), mem->nslabs));
>> >>  
>> >> -	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
>> >> +	if (io_tlb_default_mem.unencrypted)
>> >> +		set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
>> >> +
>> >>  	if (mem->late_alloc) {
>> >>  		area_order = get_order(array_size(sizeof(*mem->areas),
>> >>  			mem->nareas));
>> >> @@ -563,6 +578,7 @@ void __init swiotlb_exit(void)
>> >>   * @gfp:	GFP flags for the allocation.
>> >>   * @bytes:	Size of the buffer.
>> >>   * @phys_limit:	Maximum allowed physical address of the buffer.
>> >> + * @unencrypted: true to allocate unencrypted memory, false for encrypted memory
>> >>   *
>> >>   * Allocate pages from the buddy allocator. If successful, make the allocated
>> >>   * pages decrypted that they can be used for DMA.
>> >> @@ -570,7 +586,8 @@ void __init swiotlb_exit(void)
>> >>   * Return: Decrypted pages, %NULL on allocation failure, or ERR_PTR(-EAGAIN)
>> >>   * if the allocated physical address was above @phys_limit.
>> >>   */
>> >> -static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>> >> +static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes,
>> >> +		u64 phys_limit, bool unencrypted)
>> >>  {
>> >>  	unsigned int order = get_order(bytes);
>> >>  	struct page *page;
>> >> @@ -588,13 +605,13 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>> >>  	}
>> >>  
>> >>  	vaddr = phys_to_virt(paddr);
>> >> -	if (set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >> +	if (unencrypted && set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >>  		goto error;
>> >>  	return page;
>> >>  
>> >>  error:
>> >>  	/* Intentional leak if pages cannot be encrypted again. */
>> >> -	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >> +	if (unencrypted && !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >>  		__free_pages(page, order);
>> >>  	return NULL;
>> >>  }
>> >> @@ -604,30 +621,26 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>> >>   * @dev:	Device for which a memory pool is allocated.
>> >>   * @bytes:	Size of the buffer.
>> >>   * @phys_limit:	Maximum allowed physical address of the buffer.
>> >> + * @attrs:	DMA attributes for the allocation.
>> >>   * @gfp:	GFP flags for the allocation.
>> >>   *
>> >>   * Return: Allocated pages, or %NULL on allocation failure.
>> >>   */
>> >>  static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>> >> -		u64 phys_limit, gfp_t gfp)
>> >> +		u64 phys_limit, unsigned long attrs, gfp_t gfp)
>> >>  {
>> >>  	struct page *page;
>> >> -	unsigned long attrs = 0;
>> >>  
>> >>  	/*
>> >>  	 * Allocate from the atomic pools if memory is encrypted and
>> >>  	 * the allocation is atomic, because decrypting may block.
>> >>  	 */
>> >> -	if (!gfpflags_allow_blocking(gfp) && dev && force_dma_unencrypted(dev)) {
>> >> +	if (!gfpflags_allow_blocking(gfp) && (attrs & DMA_ATTR_CC_SHARED)) {
>> >>  		void *vaddr;
>> >>  
>> >>  		if (!IS_ENABLED(CONFIG_DMA_COHERENT_POOL))
>> >>  			return NULL;
>> >>  
>> >> -		/* swiotlb considered decrypted by default */
>> >> -		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> >> -			attrs = DMA_ATTR_CC_SHARED;
>> >> -
>> >>  		return dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
>> >>  					   attrs, dma_coherent_ok);
>> >>  	}
>> >> @@ -638,7 +651,8 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>> >>  	else if (phys_limit <= DMA_BIT_MASK(32))
>> >>  		gfp |= __GFP_DMA32;
>> >>  
>> >> -	while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit))) {
>> >> +	while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit,
>> >> +					     !!(attrs & DMA_ATTR_CC_SHARED)))) {
>> >>  		if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
>> >>  		    phys_limit < DMA_BIT_MASK(64) &&
>> >>  		    !(gfp & (__GFP_DMA32 | __GFP_DMA)))
>> >> @@ -657,15 +671,18 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>> >>   * swiotlb_free_tlb() - free a dynamically allocated IO TLB buffer
>> >>   * @vaddr:	Virtual address of the buffer.
>> >>   * @bytes:	Size of the buffer.
>> >> + * @unencrypted: true if @vaddr was allocated decrypted and must be
>> >> + *	re-encrypted before being freed
>> >>   */
>> >> -static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>> >> +static void swiotlb_free_tlb(void *vaddr, size_t bytes, bool unencrypted)
>> >>  {
>> >>  	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
>> >>  	    dma_free_from_pool(NULL, vaddr, bytes))
>> >>  		return;
>> >>  
>> >>  	/* Intentional leak if pages cannot be encrypted again. */
>> >> -	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >> +	if (!unencrypted ||
>> >> +	    !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >>  		__free_pages(virt_to_page(vaddr), get_order(bytes));
>> >>  }
>> >>  
>> >> @@ -676,6 +693,7 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>> >>   * @nslabs:	Desired (maximum) number of slabs.
>> >>   * @nareas:	Number of areas.
>> >>   * @phys_limit:	Maximum DMA buffer physical address.
>> >> + * @attrs:	DMA attributes for the allocation.
>> >>   * @gfp:	GFP flags for the allocations.
>> >>   *
>> >>   * Allocate and initialize a new IO TLB memory pool. The actual number of
>> >> @@ -686,7 +704,8 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>> >>   */
>> >>  static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>> >>  		unsigned long minslabs, unsigned long nslabs,
>> >> -		unsigned int nareas, u64 phys_limit, gfp_t gfp)
>> >> +		unsigned int nareas, u64 phys_limit, unsigned long attrs,
>> >> +		gfp_t gfp)
>> >>  {
>> >>  	struct io_tlb_pool *pool;
>> >>  	unsigned int slot_order;
>> >> @@ -704,9 +723,10 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>> >>  	if (!pool)
>> >>  		goto error;
>> >>  	pool->areas = (void *)pool + sizeof(*pool);
>> >> +	pool->unencrypted = !!(attrs & DMA_ATTR_CC_SHARED);
>> >>  
>> >>  	tlb_size = nslabs << IO_TLB_SHIFT;
>> >> -	while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp))) {
>> >> +	while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, attrs, gfp))) {
>> >>  		if (nslabs <= minslabs)
>> >>  			goto error_tlb;
>> >>  		nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
>> >> @@ -724,7 +744,8 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>> >>  	return pool;
>> >>  
>> >>  error_slots:
>> >> -	swiotlb_free_tlb(page_address(tlb), tlb_size);
>> >> +	swiotlb_free_tlb(page_address(tlb), tlb_size,
>> >> +			 !!(attrs & DMA_ATTR_CC_SHARED));
>> >>  error_tlb:
>> >>  	kfree(pool);
>> >>  error:
>> >> @@ -742,7 +763,9 @@ static void swiotlb_dyn_alloc(struct work_struct *work)
>> >>  	struct io_tlb_pool *pool;
>> >>  
>> >>  	pool = swiotlb_alloc_pool(NULL, IO_TLB_MIN_SLABS, default_nslabs,
>> >> -				  default_nareas, mem->phys_limit, GFP_KERNEL);
>> >> +				  default_nareas, mem->phys_limit,
>> >> +				  mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
>> >> +				  GFP_KERNEL);
>> >>  	if (!pool) {
>> >>  		pr_warn_ratelimited("Failed to allocate new pool");
>> >>  		return;
>> >> @@ -762,7 +785,7 @@ static void swiotlb_dyn_free(struct rcu_head *rcu)
>> >>  	size_t tlb_size = pool->end - pool->start;
>> >>  
>> >>  	free_pages((unsigned long)pool->slots, get_order(slots_size));
>> >> -	swiotlb_free_tlb(pool->vaddr, tlb_size);
>> >> +	swiotlb_free_tlb(pool->vaddr, tlb_size, pool->unencrypted);
>> >>  	kfree(pool);
>> >>  }
>> >>  
>> >> @@ -1232,6 +1255,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>> >>  	nslabs = nr_slots(alloc_size);
>> >>  	phys_limit = min_not_zero(*dev->dma_mask, dev->bus_dma_limit);
>> >>  	pool = swiotlb_alloc_pool(dev, nslabs, nslabs, 1, phys_limit,
>> >> +				  mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
>> >>  				  GFP_NOWAIT);
>> >>  	if (!pool)
>> >>  		return -1;
>> >> @@ -1394,6 +1418,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>> >>  		enum dma_data_direction dir, unsigned long attrs)
>> >>  {
>> >>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>> >> +	bool require_decrypted = false;
>> >>  	unsigned int offset;
>> >>  	struct io_tlb_pool *pool;
>> >>  	unsigned int i;
>> >> @@ -1411,6 +1436,16 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>> >>  	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> >>  		pr_warn_once("Memory encryption is active and system is using DMA bounce buffers\n");
>> >>  
>> >> +	/*
>> >> +	 * if we are trying to swiotlb map a decrypted paddr or the paddr is encrypted
>> >> +	 * but the device is forcing decryption, use decrypted io_tlb_mem
>> >> +	 */
>> >> +	if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))
>> >> +		require_decrypted = true;
>> >> +
>> >> +	if (require_decrypted != mem->unencrypted)
>> >> +		return (phys_addr_t)DMA_MAPPING_ERROR;
>> >> +
>> >>  	/*
>> >>  	 * The default swiotlb memory pool is allocated with PAGE_SIZE
>> >>  	 * alignment. If a mapping is requested with larger alignment,
>> >> @@ -1608,8 +1643,14 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
>> >>  	if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
>> >>  		return DMA_MAPPING_ERROR;
>> >>  
>> >> -	/* Ensure that the address returned is DMA'ble */
>> >> -	dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
>> >> +	/*
>> >> +	 * Use the allocated io_tlb_mem encryption type to determine dma addr.
>> >> +	 */
>> >> +	if (dev->dma_io_tlb_mem->unencrypted)
>> >> +		dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
>> >> +	else
>> >> +		dma_addr = phys_to_dma_encrypted(dev, swiotlb_addr);
>> >> +
>> >>  	if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
>> >>  		__swiotlb_tbl_unmap_single(dev, swiotlb_addr, size, dir,
>> >>  			attrs | DMA_ATTR_SKIP_CPU_SYNC,
>> >> @@ -1773,7 +1814,8 @@ static inline void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
>> >>  
>> >>  #ifdef CONFIG_DMA_RESTRICTED_POOL
>> >>  
>> >> -struct page *swiotlb_alloc(struct device *dev, size_t size)
>> >> +struct page *swiotlb_alloc(struct device *dev, size_t size,
>> >> +		unsigned long attrs)
>> >>  {
>> >>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>> >>  	struct io_tlb_pool *pool;
>> >> @@ -1784,6 +1826,9 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
>> >>  	if (!mem)
>> >>  		return NULL;
>> >>  
>> >> +	if (mem->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
>> >> +		return NULL;
>> >> +
>> >>  	align = (1 << (get_order(size) + PAGE_SHIFT)) - 1;
>> >>  	index = swiotlb_find_slots(dev, 0, size, align, &pool);
>> >>  	if (index == -1)
>> >> @@ -1853,9 +1898,18 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
>> >>  			kfree(mem);
>> >>  			return -ENOMEM;
>> >>  		}
>> >> +		/*
>> >> +		 * if platform supports memory encryption,
>> >> +		 * restricted mem pool is decrypted by default
>> >> +		 */
>> >> +		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
>> >> +			mem->unencrypted = true;
>> >> +			set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
>> >> +					     rmem->size >> PAGE_SHIFT);
>> >> +		} else {
>> >> +			mem->unencrypted = false;
>> >> +		}
>> >
>> > This breaks pKVM as it doesn’t set CC_ATTR_MEM_ENCRYPT, so all virtio
>> > traffic now fails.
>> >
>> > Also, by design, some drivers are clueless about bouncing, so
>> > I believe that the pool should have a way to control it’s property
>> > (encrypted or decrypted) and that takes priority over whatever
>> > attributes comes from allocation.
>> > And that brings us to the same point whether it’s better to return
>> > the memory along with it’s state or we pass the requested state.
>> > I think for other cases it’s fine for the device/DMA-API to dictate
>> > the attrs, but not in restricted-dma case, the firmware just knows better.
>> >
>> 
>> Is it that the pKVM guest kernel does not have awareness of
>> encrypted/decrypted DMA allocations? Instead, the firmware attaches
>> hypervisor-shared pages to the device via restricted-dma-pool? The
>> kernel then has swiotlb->for_alloc = true, and hence all DMA allocations
>> go through the restricted-dma-pool?
>
> Yes.
>
>> 
>> Given that pKVM supports pkvm_set_memory_encrypted() and
>> pkvm_set_memory_decrypted(), can we consider adding CC_ATTR_MEM_ENCRYPT
>> support to pKVM? It would also be good to investigate whether we can set
>> force_dma_unencrypted(dev) to true where needed.
>
> I was looking in to that, but it didn't work because
> force_dma_unencrypted() is broken with restricted-dma due to the
> double decryption issue, that's when I sent my first series [1]
>
> May be we should land some basic fixes for that path so we can
> convert pKVM, then we do the full rework.
>
> I will revive my old work and see if I can send a RFC.
>
> [1] https://lore.kernel.org/all/20260305170335.963568-1-smostafa@google.com/
>

With this series, can you check whether the only change needed is
something like the following?

modified   kernel/dma/swiotlb.c
@@ -1905,7 +1905,8 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
 		 * if platform supports memory encryption,
 		 * restricted mem pool is decrypted by default
 		 */
-		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+		//if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+		if (true) {
 			mem->unencrypted = true;
 			set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
 					     rmem->size >> PAGE_SHIFT);

>
>> 
>> I agree that this patch, as it stands, can break pKVM because we are now
>> missing the set_memory_decrypted() call required for pKVM to work.
>> 
>> We now mark the swiotlb io_tlb_mem as unencrypted/encrypted in the guest
>> using struct io_tlb_mem->unencrypted. I am not clear what we can use for
>> pKVM to conditionalize this so that it works for both protected and
>> unprotected guests.
>
> There is no problem with non-protected guests as they don't use memory
> encryption, my initial thought was that th encrpyted/decrypted is
> per-pool property which is decided by FW (device-tree).
>

What I meant was that we need a generic way to identify a pKVM guest, so
that we can use it in the conditional above.

-aneesh


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox