LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] powerpc: replace strlcat to simplify ppc_kallsyms_lookup_name
From: Christophe Leroy (CS GROUP) @ 2026-06-09  4:52 UTC (permalink / raw)
  To: Thorsten Blum, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20260608193203.163266-2-thorsten.blum@linux.dev>



Le 08/06/2026 à 21:32, Thorsten Blum a écrit :
> strlcat() should not be used anymore (see fortify-string.h), and since
> name is guaranteed to be NUL-terminated within KSYM_NAME_LEN bytes, use
> memcpy() instead.
> 
> Rename dot_appended to the semantically clearer prepend_dot while at it.
> 
> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>

Please take 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20260506021143.13797-1-xieyuanbin1@huawei.com/ 
instead

Christophe


> ---
>   arch/powerpc/include/asm/text-patching.h | 17 ++++++++---------
>   1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/text-patching.h b/arch/powerpc/include/asm/text-patching.h
> index e7f14720f630..c47b813bd067 100644
> --- a/arch/powerpc/include/asm/text-patching.h
> +++ b/arch/powerpc/include/asm/text-patching.h
> @@ -227,22 +227,21 @@ static inline unsigned long ppc_kallsyms_lookup_name(const char *name)
>   #ifdef CONFIG_PPC64_ELF_ABI_V1
>   	/* check for dot variant */
>   	char dot_name[1 + KSYM_NAME_LEN];
> -	bool dot_appended = false;
> +	bool prepend_dot = name[0] != '.';
> +	size_t len = strnlen(name, KSYM_NAME_LEN);
>   
> -	if (strnlen(name, KSYM_NAME_LEN) >= KSYM_NAME_LEN)
> +	if (len == KSYM_NAME_LEN)
>   		return 0;
>   
> -	if (name[0] != '.') {
> +	if (prepend_dot) {
>   		dot_name[0] = '.';
> -		dot_name[1] = '\0';
> -		strlcat(dot_name, name, sizeof(dot_name));
> -		dot_appended = true;
> +		memcpy(dot_name + 1, name, len + 1);
>   	} else {
> -		dot_name[0] = '\0';
> -		strlcat(dot_name, name, sizeof(dot_name));
> +		memcpy(dot_name, name, len + 1);
>   	}
> +
>   	addr = kallsyms_lookup_name(dot_name);
> -	if (!addr && dot_appended)
> +	if (!addr && prepend_dot)
>   		/* Let's try the original non-dot symbol lookup	*/
>   		addr = kallsyms_lookup_name(name);
>   #elif defined(CONFIG_PPC64_ELF_ABI_V2)



^ permalink raw reply

* [PATCH v3] KVM: PPC: Book3S HV: Validate arch_compat against host compatibility mode
From: Amit Machhiwal @ 2026-06-09  5:33 UTC (permalink / raw)
  To: linuxppc-dev, Madhavan Srinivasan
  Cc: Amit Machhiwal, Vaibhav Jain, Harsh Prateek Bora, Ritesh Harjani,
	Anushree Mathur, Gautam Menghani, Mukesh Kumar Chaurasiya,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Thomas Huth, kvm, stable, linux-kernel

On IBM POWER systems, newer processor generations can operate in
compatibility modes corresponding to earlier generations. This becomes
relevant for nested virtualization, where nested KVM guests may need to
run with a specific processor compatibility level.

Currently, when running a nested KVM guest (L2) inside a Power11 pSeries
logical partition (L1) booted in Power10 compatibility mode, the guest
fails to boot while setting 'arch_compat'. This happens because the CPU
class is derived from the hardware PVR (via mfspr()), which reflects the
physical processor generation (Power11), rather than the effective
compatibility mode (Power10).

As a result, userspace may request a Power11 arch_compat for the L2
guest. However, the L1 partition, running in Power10 compatibility, has
only negotiated support up to Power10 with the Power Hypervisor (L0).
When H_GUEST_SET_STATE is invoked with a Power11 Logical PVR, the
hypervisor rejects the request, leading to a late guest boot failure:

  KVM-NESTEDv2: couldn't set guest wide elements
  [..KVM reg dump..]

This situation should be detected earlier and rejected by KVM. Without
proper validation, if userspace ignores the error, the guest may continue
to boot in Power11 raw mode on a Power10 compatibility host, which should
not be allowed.

Introduce a validation mechanism that detects unsupported arch_compat
values early in the guest initialization path. When an unsupported
arch_compat is requested (e.g., Power11 on a Power10 compatibility mode
host), kvmppc_set_arch_compat() uses cpu_has_feature(CPU_FTR_P11_PVR) to
detect the mismatch and sets arch_compat to PVR_ARCH_INVALID. This
triggers kvmppc_sanity_check() to mark the vCPU as invalid by setting
vcpu->arch.sane to false. On the next vCPU run, kvmppc_vcpu_run_hv()
checks this flag and returns -EINVAL, preventing the guest from running
with an invalid processor compatibility configuration.

With this, when a Power11 arch_compat is requested on a Power10
compatibility mode host, the guest fails early during boot with:

  error: kvm run failed Invalid argument

This provides a much clearer failure mode compared to the previous
behavior where the guest could boot in Power11 raw mode (if userspace
ignored the error) or fail late during H_GUEST_SET_STATE.

Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: stable@vger.kernel.org # v6.13+
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
---
Changes in v3:
* Fixed null pointer dereference in kvmppc_sanity_check(): added check for
  vcpu->arch.vcore before accessing arch_compat, as vcore is NULL for Book3S
  PR and BookE guests (only Book3S HV uses vcore) [Reported by Sashiko AI]
* Added Reviewed-by tag from Vaibhav

Changes in v2:
* Fixed issue where v1 allowed guest to boot in Power11 raw mode when
  userspace ignored the error, by adding validation in kvmppc_sanity_check()
  to ensure early failure during vCPU run [Found the issue after posting v1,
  also reported by Gautam.]
* Introduced PVR_ARCH_INVALID constant for marking invalid arch_compat
* Dropped all Reviewed-by and Tested-by tags due to code changes; requesting
  fresh reviews
* v1: https://lore.kernel.org/all/20260603141539.47620-1-amachhiw@linux.ibm.com/

Changes in v1:
* Moved this patch out of the v3 series [1] as discussed here [2]
* Addressed below review comments from Ritesh:
  - Based the PVR validation on cpu features
  - Fixed hcall name typo
  - Stable backport

[1] https://lore.kernel.org/all/20260522152744.55251-1-amachhiw@linux.ibm.com/
[2] https://lore.kernel.org/all/20260522152744.55251-2-amachhiw@linux.ibm.com/
---
 arch/powerpc/include/asm/reg.h |  1 +
 arch/powerpc/kvm/book3s_hv.c   | 15 ++++++++++++++-
 arch/powerpc/kvm/powerpc.c     |  4 ++++
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 3449dd2b577d..7472b9522f71 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1356,6 +1356,7 @@
 #define PVR_ARCH_300	0x0f000005
 #define PVR_ARCH_31	0x0f000006
 #define PVR_ARCH_31_P11	0x0f000007
+#define PVR_ARCH_INVALID	0xffffffff
 
 /* Macros for setting and retrieving special purpose registers */
 #ifndef __ASSEMBLER__
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 61dbeea317f3..f9380ef65750 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -446,7 +446,19 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
 			guest_pcr_bit = PCR_ARCH_300;
 			break;
 		case PVR_ARCH_31:
+			guest_pcr_bit = PCR_ARCH_31;
+			break;
 		case PVR_ARCH_31_P11:
+			/*
+			 * Need to check this for ISA 3.1, as Power10 and
+			 * Power11 share the same PCR. For any subsequent ISA
+			 * versions, this will be taken care of by the guest vs
+			 * host PCR comparison below.
+			 */
+			if (!cpu_has_feature(CPU_FTR_P11_PVR)) {
+				arch_compat = PVR_ARCH_INVALID;
+				goto out;
+			}
 			guest_pcr_bit = PCR_ARCH_31;
 			break;
 		default:
@@ -469,6 +481,7 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
 			return -EINVAL;
 	}
 
+out:
 	spin_lock(&vc->lock);
 	vc->arch_compat = arch_compat;
 	kvmhv_nestedv2_mark_dirty(vcpu, KVMPPC_GSID_LOGICAL_PVR);
@@ -479,7 +492,7 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
 	vc->pcr = (host_pcr_bit - guest_pcr_bit) | PCR_MASK;
 	spin_unlock(&vc->lock);
 
-	return 0;
+	return kvmppc_sanity_check(vcpu);
 }
 
 static void kvmppc_dump_regs(struct kvm_vcpu *vcpu)
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 00302399fc37..98de68379b18 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -258,6 +258,10 @@ int kvmppc_sanity_check(struct kvm_vcpu *vcpu)
 	if (!vcpu->arch.pvr)
 		goto out;
 
+	if (vcpu->arch.vcore &&
+		vcpu->arch.vcore->arch_compat == PVR_ARCH_INVALID)
+		goto out;
+
 	/* PAPR only works with book3s_64 */
 	if ((vcpu->arch.cpu_type != KVM_CPU_3S_64) && vcpu->arch.papr_enabled)
 		goto out;

base-commit: 2d3090a8aeb596a26935db0955d46c9a5db5c6ce
-- 
2.50.1 (Apple Git-155)



^ permalink raw reply related

* Re: [PATCH v7 15/15] arm64: mm: Unmap kernel data/bss entirely from the linear map
From: Marek Szyprowski @ 2026-06-09  6:22 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-arm-kernel
  Cc: linux-kernel, will, catalin.marinas, mark.rutland, Ard Biesheuvel,
	Ryan Roberts, Anshuman Khandual, Kevin Brodsky, Liz Prucka,
	Seth Jenkins, Kees Cook, Mike Rapoport, David Hildenbrand,
	Andrew Morton, Jann Horn, linux-mm, linux-hardening, linuxppc-dev,
	linux-sh
In-Reply-To: <20260529150150.1670604-32-ardb+git@google.com>

Dear All,

On 29.05.2026 17:02, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
>
> The linear aliases of the kernel text and rodata are also mapped
> read-only in the linear map. Given that the contents of these regions
> are mostly identical to the version in the loadable image, mapping them
> read-only and leaving their contents visible is a reasonable hardening
> measure.
>
> Data and bss, however, are now also mapped read-only but the contents of
> these regions are more likely to contain data that we'd rather not leak.
> So let's unmap these entirely in the linear map when the kernel is
> running normally.
>
> When going into hibernation or waking up from it, these regions need to
> be mapped, so map the region initially, and toggle the valid bit so
> map/unmap the region as needed.
>
> Doing so is required because pages covering the kernel image are marked
> as PageReserved, and therefore disregarded for snapshotting by the
> hibernate logic unless they are mapped.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
This commit landed in yesterday's linux-next as commit 63e0b6a5b693
("arm64: mm: Unmap kernel data/bss entirely from the linear map").
In my tests I found that it breaks booting of RaspberryPi3 and
RaspberryPi4 boards with the following kernel panic:

kvm [1]: nv: 570 coarse grained trap handlers
kvm [1]: nv: 710 fine grained trap handlers
kvm [1]: IPA Size Limit: 40 bits
Unable to handle kernel paging request at virtual address ffff000003a23000
Mem abort info:
  ESR = 0x0000000096000147
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x07: level 3 translation fault
Data abort info:
  ISV = 0, ISS = 0x00000147, ISS2 = 0x00000000
  CM = 1, WnR = 1, TnD = 0, TagAccess = 0
  GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000002609000
[ffff000003a23000] pgd=0000000000000000, p4d=180000003b3ff403, pud=180000003b3fe403, pmd=180000003b3e6403, pte=00e8000003a23f06
Internal error: Oops: 0000000096000147 [#1]  SMP
Modules linked in:
CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.1.0-rc1+ #16768 PREEMPT
Hardware name: Raspberry Pi 3 Model B (DT)
pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : dcache_clean_inval_poc+0x24/0x48
lr : kvm_arm_init+0xa8c/0x165c
sp : ffff8000844bbd00
...
Call trace:
 dcache_clean_inval_poc+0x24/0x48 (P)
 do_one_initcall+0x68/0x4f4
 kernel_init_freeable+0x24c/0x360
 kernel_init+0x24/0x1dc
 ret_from_fork+0x10/0x20
Code: 9ac32042 d1000443 8a230000 d503201f (d50b7e20)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x00000000,03000008,00040000,0400421b
Memory Limit: none
---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---



> ---
>  arch/arm64/mm/mmu.c | 45 ++++++++++++++++++--
>  1 file changed, 41 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 7b18dc2f1721..07a6fa210171 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -24,6 +24,7 @@
>  #include <linux/mm.h>
>  #include <linux/vmalloc.h>
>  #include <linux/set_memory.h>
> +#include <linux/suspend.h>
>  #include <linux/kfence.h>
>  #include <linux/pkeys.h>
>  #include <linux/mm_inline.h>
> @@ -1056,6 +1057,29 @@ static void __init __map_memblock(phys_addr_t start, phys_addr_t end,
>  				 end - start, prot, early_pgtable_alloc, flags);
>  }
>  
> +static void mark_linear_data_alias_valid(bool valid)
> +{
> +	set_memory_valid((unsigned long)lm_alias(__init_end),
> +			 (unsigned long)(__bss_stop - __init_end) / PAGE_SIZE,
> +			 valid);
> +}
> +
> +static int arm64_hibernate_pm_notify(struct notifier_block *nb,
> +				     unsigned long mode, void *unused)
> +{
> +	switch (mode) {
> +	default:
> +		break;
> +	case PM_POST_HIBERNATION:
> +		mark_linear_data_alias_valid(false);
> +		break;
> +	case PM_HIBERNATION_PREPARE:
> +		mark_linear_data_alias_valid(true);
> +		break;
> +	}
> +	return 0;
> +}
> +
>  void __init mark_linear_text_alias_ro(void)
>  {
>  	/*
> @@ -1064,6 +1088,21 @@ void __init mark_linear_text_alias_ro(void)
>  	update_mapping_prot(__pa_symbol(_text), (unsigned long)lm_alias(_text),
>  			    (unsigned long)__init_begin - (unsigned long)_text,
>  			    PAGE_KERNEL_RO);
> +
> +	/*
> +	 * Register a PM notifier to remap the linear alias of data/bss as
> +	 * valid read-only before hibernation. This is needed because the
> +	 * snapshot logic disregards PageReserved pages (such as the ones
> +	 * covering the kernel image) unless they are mapped in the linear
> +	 * map.
> +	 */
> +	if (IS_ENABLED(CONFIG_HIBERNATION)) {
> +		static struct notifier_block nb = {
> +			.notifier_call = arm64_hibernate_pm_notify
> +		};
> +
> +		register_pm_notifier(&nb);
> +	}
>  }
>  
>  #ifdef CONFIG_KFENCE
> @@ -1193,10 +1232,8 @@ static void __init map_mem(void)
>  			       flags);
>  	}
>  
> -	/* Map the kernel data/bss read-only in the linear map */
> -	__map_memblock(init_end, kernel_end, PAGE_KERNEL_RO, flags);
> -	flush_tlb_kernel_range((unsigned long)lm_alias(__init_end),
> -			       (unsigned long)lm_alias(__bss_stop));
> +	/* Map the kernel data/bss as invalid in the linear map */
> +	mark_linear_data_alias_valid(false);
>  }
>  
>  void mark_rodata_ro(void)

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland



^ permalink raw reply

* Re: [PATCH v7 15/15] arm64: mm: Unmap kernel data/bss entirely from the linear map
From: Marek Szyprowski @ 2026-06-09  6:28 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-arm-kernel
  Cc: linux-kernel, will, catalin.marinas, mark.rutland, Ard Biesheuvel,
	Ryan Roberts, Anshuman Khandual, Kevin Brodsky, Liz Prucka,
	Seth Jenkins, Kees Cook, Mike Rapoport, David Hildenbrand,
	Andrew Morton, Jann Horn, linux-mm, linux-hardening, linuxppc-dev,
	linux-sh
In-Reply-To: <a1b27e97-182c-485d-a448-56c19c5de2c2@samsung.com>

On 09.06.2026 08:22, Marek Szyprowski wrote:
> On 29.05.2026 17:02, Ard Biesheuvel wrote:
>> From: Ard Biesheuvel <ardb@kernel.org>
>>
>> The linear aliases of the kernel text and rodata are also mapped
>> read-only in the linear map. Given that the contents of these regions
>> are mostly identical to the version in the loadable image, mapping them
>> read-only and leaving their contents visible is a reasonable hardening
>> measure.
>>
>> Data and bss, however, are now also mapped read-only but the contents of
>> these regions are more likely to contain data that we'd rather not leak.
>> So let's unmap these entirely in the linear map when the kernel is
>> running normally.
>>
>> When going into hibernation or waking up from it, these regions need to
>> be mapped, so map the region initially, and toggle the valid bit so
>> map/unmap the region as needed.
>>
>> Doing so is required because pages covering the kernel image are marked
>> as PageReserved, and therefore disregarded for snapshotting by the
>> hibernate logic unless they are mapped.
>>
>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> This commit landed in yesterday's linux-next as commit 63e0b6a5b693
> ("arm64: mm: Unmap kernel data/bss entirely from the linear map").
> In my tests I found that it breaks booting of RaspberryPi3 and
> RaspberryPi4 boards with the following kernel panic:
One more comment - reverting 63e0b6a5b693 and 53205d56212c (dependent
change) on top of next-20260608 fixes this issue.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland



^ permalink raw reply

* Re: [PATCH v7 15/15] arm64: mm: Unmap kernel data/bss entirely from the linear map
From: Ard Biesheuvel @ 2026-06-09  6:31 UTC (permalink / raw)
  To: Marek Szyprowski, Ard Biesheuvel, linux-arm-kernel
  Cc: linux-kernel, Will Deacon, Catalin Marinas, Mark Rutland,
	Ryan Roberts, Anshuman Khandual, Kevin Brodsky, Liz Prucka,
	Seth Jenkins, Kees Cook, Mike Rapoport, David Hildenbrand,
	Andrew Morton, Jann Horn, linux-mm, linux-hardening, linuxppc-dev,
	linux-sh
In-Reply-To: <6a9c0f55-fe98-4063-864b-8f7e1f4fefd7@samsung.com>



On Tue, 9 Jun 2026, at 08:28, Marek Szyprowski wrote:
> On 09.06.2026 08:22, Marek Szyprowski wrote:
>> On 29.05.2026 17:02, Ard Biesheuvel wrote:
>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>
>>> The linear aliases of the kernel text and rodata are also mapped
>>> read-only in the linear map. Given that the contents of these regions
>>> are mostly identical to the version in the loadable image, mapping them
>>> read-only and leaving their contents visible is a reasonable hardening
>>> measure.
>>>
>>> Data and bss, however, are now also mapped read-only but the contents of
>>> these regions are more likely to contain data that we'd rather not leak.
>>> So let's unmap these entirely in the linear map when the kernel is
>>> running normally.
>>>
>>> When going into hibernation or waking up from it, these regions need to
>>> be mapped, so map the region initially, and toggle the valid bit so
>>> map/unmap the region as needed.
>>>
>>> Doing so is required because pages covering the kernel image are marked
>>> as PageReserved, and therefore disregarded for snapshotting by the
>>> hibernate logic unless they are mapped.
>>>
>>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>> This commit landed in yesterday's linux-next as commit 63e0b6a5b693
>> ("arm64: mm: Unmap kernel data/bss entirely from the linear map").
>> In my tests I found that it breaks booting of RaspberryPi3 and
>> RaspberryPi4 boards with the following kernel panic:
> One more comment - reverting 63e0b6a5b693 and 53205d56212c (dependent
> change) on top of next-20260608 fixes this issue.
>

Thanks for the report, and for the confirmation that those reverts fix
the issue - this was reported here as well:

https://lore.kernel.org/all/aicVyebkEMs6w6UV@sirena.co.uk/



^ permalink raw reply

* Re: [PATCH v3] powerpc/pseries/iommu: Add TCEs for 16GB pages when RAM is pre-mapped
From: Venkat Rao Bagalkote @ 2026-06-09  6:38 UTC (permalink / raw)
  To: Harsh Prateek Bora, Gaurav Batra, maddy, sbhat
  Cc: linuxppc-dev, ritesh.list, vaibhav, donettom
In-Reply-To: <f7e3e2ec-5cbc-4ccc-bbc3-ec3ddd4e8b62@linux.ibm.com>


On 31/05/26 11:18 pm, Harsh Prateek Bora wrote:
> + Venkat
>
> Hi Gaurav,
> Would just like to confirm if it is tested with multiple iterations of 
> hotplug of RAM (DLPAR) as well?
>
> Hi Venkat,
> Could you please help validate the patch for above-mentioned scenario 
> as well?


Hi Harsh,

Thanks for looping me in on this.

I have performed DLPAR testing of memory for around 100 iterations and 
also carried out DLPAR of an adapter a few times.

I did not observe any issues during these tests.

Please let me know if any additional scenarios need to be validated.

Thanks again for cc’ing me and reaching out for this validation.

Please add, below tag.

Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>


Regards,
Venkat.

>
> Hi Shivaprasad,
> Please share your review feedback or any additional testing scenarios 
> needed?
>
> Thanks
> Harsh
>
> On 15/05/26 9:21 pm, Gaurav Batra wrote:
>> In powerPC, if Dynamic DMA Window is big enough, RAM is pre-mapped. To
>> determine the size of RAM, a PAPR+ property "ibm,lrdr-capacity" is used.
>> This OF property dictates what is the max size of RAM an LPAR can have,
>> including DR added memory.
>>
>> In PowerPC, 16GB pages can be allocated at machine level and then
>> assigned to LPARs. These 16GB pages are added to LPAR memory at the time
>> of boot. The address range for these 16GB pages is above MAX RAM an LPAR
>> can have (ibm,lrdr-capacity). In the current implementation, these 16GB
>> pages are being excluded from pre-mapped TCEs. A driver can have DMA
>> buffers allocated from 16GB pages. This results in platform to raise an
>> EEH when DMA is attempted on buffers in 16GB memory range.
>>
>> commit 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier incorrectly
>> adds TCEs for pmemory")
>>
>> Prior to the above patch, memblock_end_of_DRAM() was being used to
>> determine the MAX memory of an LPAR. This included 16GB pages as well.
>> The issue with using memblock_end_of_DRAM() is that when pmemory is
>> converted to RAM via daxctl command, the DDW engine will incorrectly try
>> to add TCEs for pmemory as well.
>>
>> Below is the address distribution of RAM, 16GB pages and pmemory for an
>> LPAR with max memory of 256GB, memory allocated 64GB, 2 16GB pages and
>> assigned pmemory of 8GB.
>>
>> RANGE                                 SIZE  STATE REMOVABLE BLOCK
>> 0x0000000000000000-0x0000000fffffffff  64G online       yes 0-255
>> 0x0000004000000000-0x00000047ffffffff  32G online       yes 1024-1151
>>
>> cat /sys/bus/nd/devices/region0/resource
>> 0x40100000000
>> cat /sys/bus/nd/devices/region0/size
>> 8589934592
>>
>> The approach to fix this problem is to revert back the code changes
>> introduced by the above patch and to stash away the MAX memory of an
>> LPAR, including 16GB pages, at the LPAR boot time. This value is then
>> used whenever TCEs are needed to be pre-mapped - enable_DDW() or,
>> iommu_mem_notifier()
>>
>> Fixes: 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier 
>> incorrectly adds TCEs for pmemory")
>> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
>> ---
>>
>> Change log:
>>
>> V2 -> V3
>>
>> 1. Harsh: Remove R-b tags from the change log
>>
>>     Response: Incorporated changes
>>
>> 2. Harsh: Change WARN_ON() to WARN_ONCE()
>>
>>     Response: Incorporated changes
>>
>> 3. Harsh: Fix indendation
>>
>>     Response: Incorporated changes
>>
>> 4. Harsh: Replace comment with a log if limit < arg->nr_pages ?
>>
>>     Response: Doesn't seems to be needed since the WARN_ONCE() will 
>> log this
>>     scenario. I removed the comment instead.
>>
>> V1 -> V2
>>
>> 1. Harsh: Not only start_pfn, but end_pfn also needs to be within 
>> allowed
>>     range, which may require clamping arg->nr_pages if crossing the 
>> limits.
>>
>>     Response: Incorporated changes.
>>
>>   arch/powerpc/platforms/pseries/iommu.c | 58 ++++++++++++++++++--------
>>   1 file changed, 41 insertions(+), 17 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>> b/arch/powerpc/platforms/pseries/iommu.c
>> index 3e1f915fe4f6..7bbe070006fa 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -69,6 +69,8 @@ static struct iommu_table 
>> *iommu_pseries_alloc_table(int node)
>>       return tbl;
>>   }
>>   +static phys_addr_t pseries_ddw_max_ram;
>> +
>>   #ifdef CONFIG_IOMMU_API
>>   static struct iommu_table_group_ops spapr_tce_table_group_ops;
>>   #endif
>> @@ -1285,13 +1287,17 @@ static LIST_HEAD(failed_ddw_pdn_list);
>>     static phys_addr_t ddw_memory_hotplug_max(void)
>>   {
>> -    resource_size_t max_addr;
>> +    resource_size_t max_addr = memory_hotplug_max();
>> +    struct device_node *memory;
>>   -#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
>> -    max_addr = hot_add_drconf_memory_max();
>> -#else
>> -    max_addr = memblock_end_of_DRAM();
>> -#endif
>> +    for_each_node_by_type(memory, "memory") {
>> +        struct resource res;
>> +
>> +        if (of_address_to_resource(memory, 0, &res))
>> +            continue;
>> +
>> +        max_addr = max_t(resource_size_t, max_addr, res.end + 1);
>> +    }
>>         return max_addr;
>>   }
>> @@ -1446,7 +1452,7 @@ static struct property 
>> *ddw_property_create(const char *propname, u32 liobn, u64
>>   static bool enable_ddw(struct pci_dev *dev, struct device_node 
>> *pdn, u64 dma_mask)
>>   {
>>       int len = 0, ret;
>> -    int max_ram_len = order_base_2(ddw_memory_hotplug_max());
>> +    int max_ram_len = order_base_2(pseries_ddw_max_ram);
>>       struct ddw_query_response query;
>>       struct ddw_create_response create;
>>       int page_shift;
>> @@ -1668,7 +1674,7 @@ static bool enable_ddw(struct pci_dev *dev, 
>> struct device_node *pdn, u64 dma_mas
>>         if (direct_mapping) {
>>           /* DDW maps the whole partition, so enable direct DMA 
>> mapping */
>> -        ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> 
>> PAGE_SHIFT,
>> +        ret = walk_system_ram_range(0, pseries_ddw_max_ram >> 
>> PAGE_SHIFT,
>>                           win64->value, 
>> tce_setrange_multi_pSeriesLP_walk);
>>           if (ret) {
>>               dev_info(&dev->dev, "failed to map DMA window for %pOF: 
>> %d\n",
>> @@ -2419,23 +2425,35 @@ static int iommu_mem_notifier(struct 
>> notifier_block *nb, unsigned long action,
>>   {
>>       struct dma_win *window;
>>       struct memory_notify *arg = data;
>> +    unsigned long limit = arg->nr_pages;
>> +    unsigned long max_ram_pages = pseries_ddw_max_ram >> PAGE_SHIFT;
>>       int ret = 0;
>>         /* This notifier can get called when onlining persistent 
>> memory as well.
>>        * TCEs are not pre-mapped for persistent memory. Persistent 
>> memory will
>> -     * always be above ddw_memory_hotplug_max()
>> +     * always be above pseries_ddw_max_ram
>>        */
>> +    if (arg->start_pfn >= max_ram_pages)
>> +        return NOTIFY_OK;
>> +
>> +    /* RAM is being DLPAR'ed. The range should never exceed max ram.
>> +     * Just in case, clamp the range and throw a warning.
>> +     */
>> +    if (arg->start_pfn + limit > max_ram_pages) {
>> +        limit = max_ram_pages - arg->start_pfn;
>> +        WARN_ONCE(1, "Limiting Page Range %lx - %lx to Max Mem 
>> Pages: %lx\n",
>> +                    arg->start_pfn, arg->start_pfn + arg->nr_pages,
>> +                    max_ram_pages);
>> +    }
>>         switch (action) {
>>       case MEM_GOING_ONLINE:
>>           spin_lock(&dma_win_list_lock);
>>           list_for_each_entry(window, &dma_win_list, list) {
>> -            if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>> -                ddw_memory_hotplug_max()) {
>> +            if (window->direct) {
>>                   ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>> -                        arg->nr_pages, window->prop);
>> +                        limit, window->prop);
>>               }
>> -            /* XXX log error */
>>           }
>>           spin_unlock(&dma_win_list_lock);
>>           break;
>> @@ -2443,12 +2461,10 @@ static int iommu_mem_notifier(struct 
>> notifier_block *nb, unsigned long action,
>>       case MEM_OFFLINE:
>>           spin_lock(&dma_win_list_lock);
>>           list_for_each_entry(window, &dma_win_list, list) {
>> -            if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>> -                ddw_memory_hotplug_max()) {
>> +            if (window->direct) {
>>                   ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>> -                        arg->nr_pages, window->prop);
>> +                        limit, window->prop);
>>               }
>> -            /* XXX log error */
>>           }
>>           spin_unlock(&dma_win_list_lock);
>>           break;
>> @@ -2532,6 +2548,14 @@ void __init iommu_init_early_pSeries(void)
>>       register_memory_notifier(&iommu_mem_nb);
>>         set_pci_dma_ops(&dma_iommu_ops);
>> +
>> +    /* During init determine the max memory an LPAR can have and set 
>> it. This
>> +     * will be used for pre-mapping RAM in DDW. 
>> memblock_end_of_DRAM() can
>> +     * change during the running of LPAR - daxctl can add pmemory as
>> +     * "system-ram". This memory range should not be pre-mapped in 
>> DDW since
>> +     * the address of pmemory can be much higher than the DDW size.
>> +     */
>> +    pseries_ddw_max_ram = ddw_memory_hotplug_max();
>>   }
>>     static int __init disable_multitce(char *str)
>>
>> base-commit: 6d35786de28116ecf78797a62b84e6bf3c45aa5a
>
>


^ permalink raw reply

* Re: [PATCH] tools/perf/sched: Update process names of processes in zombie state for both -s and -S options
From: Athira Rajeev @ 2026-06-09  8:12 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Anubhav Shelat, Namhyung Kim, Ian Rogers, jolsa, adrian.hunter,
	mpetlan, tmricht, maddy, linux-perf-users, linuxppc-dev, hbathini,
	Tejas.Manhas1, Tanushree.Shah, Shivani.Nittor
In-Reply-To: <aiGZOn1P92CEABv3@x1>



> On 4 Jun 2026, at 8:56 PM, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> 
> On Thu, Jun 04, 2026 at 08:38:46PM +0530, Athira Rajeev wrote:
>>> On 4 Jun 2026, at 7:47 PM, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>>> 
>>> On Thu, May 21, 2026 at 11:17:58AM -0300, Arnaldo Carvalho de Melo wrote:
>>>> On Thu, May 21, 2026 at 02:02:53PM +0530, Athira Rajeev wrote:
>>>>>> On 27 Apr 2026, at 11:26 AM, Namhyung Kim <namhyung@kernel.org> wrote:
>>>>>> On Sun, Apr 26, 2026 at 03:09:30PM +0530, Athira Rajeev wrote:
>>>>>>> In redhat perftool testsuite, observed fail for this test:
>>>>>>>  -- [ FAIL ] -- perf_sched :: test_timehist :: --with-summary (output regexp parsing)
>>>>>>> 
>>>>>>> This led to analysis of "perf sched timehist" summary options.
>>>>>>> 
>>>>>>> # perf sched record -a -o ./perf.data -- sleep 0.1
>>>>>>> This will record using perf sched record
>>>>>>> 
>>>>>>> perf sched timeliest has two options "-s" and "-S"
>>>>>>> # perf sched -i ./perf.data timehist -S
>>>>>>> -S : Captures summary also at the end
>>>>>>> 
>>>>>>> # perf sched -i ./perf.data timehist -s
>>>>>>> -s : Captures only summary
>>>>>>> 
>>>>>>> The test saves -s result which has only summary and compares with
>>>>>>> summary which comes at the end from -S . Since there is a difference
>>>>>>> in these two, test fails.
>>>>>>> 
>>>>>>> Checking the behaviour change in -S and -s results, difference is:
>>>>>>> 
>>>>>>>                rcu_sched[16]       2          4        0.013      0.001       0.003       0.006   33.23       0
>>>>>>>             migration/11[73]       2          1        0.006      0.006       0.006       0.006    0.00       0
>>>>>>>              migration/3[33]       2          1        0.006      0.006       0.006       0.006    0.00       0
>>>>>>> -               :216753[216753]      -1          1        0.041      0.041       0.041       0.041    0.00       0
>>>>>>> +                 sleep[216753]      -1          1        0.041      0.041       0.041       0.041    0.00       0
>>>>>>>              migration/8[58]       2          1        0.005      0.005       0.005       0.005    0.00       0
>>>>>>>          NetworkManager[811]       1          2        0.089      0.028       0.044       0.060   36.06       0
>>>>>>>             migration/13[83]       2          1        0.005      0.005       0.005       0.005    0.00       0
>>>>>>> 
>>>>>>> Here 216753 is pid for sleep which is a zombie process. This is
>>>>>>> happening in latest kernel due to an update in "-S" result.
>>>>>>> In -S, the process name appears in the results "sleep[216753]",
>>>>>>> where as in the -s, only pid is present in the summary result
>>>>>>> ":216753[216753]".
>>>>>>> 
>>>>>>> After commit 39f473f6d0b2 ("perf sched timehist: decode process names
>>>>>>> of processes in zombie state")
>>>>>>> for -S option, if process name is using pid, it uses different way to
>>>>>>> set it. So that we get the process name and not just Pid.
>>>>>>> 
>>>>>>> This change went in only for timehist_print_sample() function.
>>>>>>> Add this improvement in generic place so that even -s option (which
>>>>>>> captures summary) also will have meaningful information.
>>>>>>> 
>>>>>>> Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com>
>>>>>> 
>>>>>> Acked-by: Namhyung Kim <namhyung@kernel.org>
>>>>>> 
>>>>>> Thanks,
>>>>>> Namhyung
>>>>> Hi,
>>>>> 
>>>>> Can we please have this pulled in, if the patch looks fine ?
>>>> 
>>>> Can you please check applying it on top of current perf-tools-next?
>>> 
>>> So, this seems to be also addressed by:
>>> 
>>> commit 39f473f6d0b24cf375893f2110b1cc9d8a079a42
>>> Author: Anubhav Shelat <ashelat@redhat.com>
>>> Date:   Wed Jul 16 16:39:15 2025 -0400
>>> 
>>>   perf sched timehist: decode process names of processes in zombie state
>>> 
>>>   Previously when running perf trace timehist --state, when recording
>>>   processes in the zombie state the process name would not be decoded
>>>   properly and appears with just the PID:
>>> 
>>>   1140057.412177 [0006]  Mutter Input Th[3139/3104]          0.956      0.019      0.041      S
>>>   1140057.412222 [0012]  :1248612[1248612]                   0.000      0.000      0.332      Z
>>>   1140057.412275 [0004]  <idle>                              0.052      0.052      0.953      I
>>>   1140057.412284 [0008]  <idle>                              0.070      0.070      0.932      I
>>>   1140057.412333 [0004]  KMS thread[3126/3104]               0.953      0.112      0.058      S
>>> 
>>>   Now some extra processing has been added to decode the process name:
>>> 
>>>   1140057.412177 [0006]  Mutter Input Th[3139/3104]          0.956      0.019      0.041      S
>>>   1140057.412222 [0012]  sleep[1248612]                      0.000      0.000      0.332      Z
>>>   1140057.412275 [0004]  <idle>                              0.052      0.052      0.953      I
>>>   1140057.412284 [0008]  <idle>                              0.070      0.070      0.932      I
>>>   1140057.412333 [0004]  KMS thread[3126/3104]               0.953      0.112      0.058      S
>>> 
>>>   Signed-off-by: Anubhav Shelat <ashelat@redhat.com>
>>>   Link: https://lore.kernel.org/r/20250716203914.45772-2-ashelat@redhat.com
>>>   Signed-off-by: Namhyung Kim <namhyung@kernel.org>
>>> 
>>> 
>>> No? It is not applying to perf-tools-next, a quick look found the patch
>>> above.
>> 
>> Hi Arnaldo
>> 
>> commit 39f473f6d0b2 ("perf sched timehist: decode process names
>> of processes in zombie state”)
>> added change for -S option. The patch I submitted is to add change in process name for “-s” option as well
>> 
>> I will check applying this on top of current perf-tools-next
> 
> Thanks for looking into this!
> 
> - Arnaldo


Hi Arnaldo

I have posted rebased patch on top of perf-tools-next here : https://lore.kernel.org/linux-perf-users/20260607140245.95706-1-atrajeev@linux.ibm.com/

Thanks
Athira



^ permalink raw reply

* Re: [PATCH v7 15/15] arm64: mm: Unmap kernel data/bss entirely from the linear map
From: Vladimir Murzin @ 2026-06-09  8:26 UTC (permalink / raw)
  To: Marek Szyprowski, Ard Biesheuvel, linux-arm-kernel
  Cc: linux-kernel, will, catalin.marinas, mark.rutland, Ard Biesheuvel,
	Ryan Roberts, Anshuman Khandual, Kevin Brodsky, Liz Prucka,
	Seth Jenkins, Kees Cook, Mike Rapoport, David Hildenbrand,
	Andrew Morton, Jann Horn, linux-mm, linux-hardening, linuxppc-dev,
	linux-sh
In-Reply-To: <6a9c0f55-fe98-4063-864b-8f7e1f4fefd7@samsung.com>

Hi,

On 6/9/26 07:28, Marek Szyprowski wrote:
> On 09.06.2026 08:22, Marek Szyprowski wrote:
>> On 29.05.2026 17:02, Ard Biesheuvel wrote:
>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>
>>> The linear aliases of the kernel text and rodata are also mapped
>>> read-only in the linear map. Given that the contents of these regions
>>> are mostly identical to the version in the loadable image, mapping them
>>> read-only and leaving their contents visible is a reasonable hardening
>>> measure.
>>>
>>> Data and bss, however, are now also mapped read-only but the contents of
>>> these regions are more likely to contain data that we'd rather not leak.
>>> So let's unmap these entirely in the linear map when the kernel is
>>> running normally.
>>>
>>> When going into hibernation or waking up from it, these regions need to
>>> be mapped, so map the region initially, and toggle the valid bit so
>>> map/unmap the region as needed.
>>>
>>> Doing so is required because pages covering the kernel image are marked
>>> as PageReserved, and therefore disregarded for snapshotting by the
>>> hibernate logic unless they are mapped.
>>>
>>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>> This commit landed in yesterday's linux-next as commit 63e0b6a5b693
>> ("arm64: mm: Unmap kernel data/bss entirely from the linear map").
>> In my tests I found that it breaks booting of RaspberryPi3 and
>> RaspberryPi4 boards with the following kernel panic:
> One more comment - reverting 63e0b6a5b693 and 53205d56212c (dependent
> change) on top of next-20260608 fixes this issue.
> 

Thanks for report! It seems it already has been reported and discussed in
another thread [1].

[1] https://lore.kernel.org/linux-arm-kernel/aicVyebkEMs6w6UV@sirena.co.uk/

Cheers
Vladimir


> Best regards
> -- Marek Szyprowski, PhD Samsung R&D Institute Poland
> 



^ permalink raw reply

* [PATCH v3] PCI: pnv_php: Add null checks for OpenCAPI PHBs
From: Aditya Gupta @ 2026-06-09  8:49 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, linux-pci, Madhavan Srinivasan,
	Timothy Pearson, Bjorn Helgaas, Shawn Anastasio
  Cc: Bjorn Helgaas, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), stable

For OpenCAPI phb direct slots, the .pdev for php_slots will be NULL

Various sections of the code in pnv_php can do a null dereference and
crash the kernel.

Originally, the issue was hit during boot:

  PowerPC PowerNV PCI Hotplug Driver version: 0.1
  BUG: Kernel NULL pointer dereference at 0x00000074
  Faulting instruction address: 0xc000000000b75fd0
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Hash  SMP NR_CPUS=2048 NUMA PowerNV
  ...
  NIP [c000000000b75fd0] pnv_php_get_adapter_state+0x60/0x154
  LR [c000000000b75fbc] pnv_php_get_adapter_state+0x4c/0x154
  Call Trace:
  [c000c0000688f990] [c000000000b75fbc] pnv_php_get_adapter_state+0x4c/0x154 (unreliable)
  [c000c0000688fa20] [c000000000b78bd0] pnv_php_enable+0x94/0x378
  [c000c0000688fac0] [c000000000b7912c] pnv_php_register_one.isra.0+0x11c/0x1e0

This occurs for hotplug slots on root buses where bus->self == NULL,
such as OpenCAPI PHB direct slots. An added debug print (not part of
this patch) confirmed it was OpenCAPI:

  pnv_php: slot 'OPENCAPI-0009' has NULL pdev (bus 0009:00, parent=NO (root bus))
  pnv_php: slot 'OPENCAPI-0009' dn->full_name='pciex@603a000000000', compatible='ibm,power10-pau-opencapi-pciex'

This only required null check in 'pnv_php_get_adapter_state', which
caused the kernel to boot.

Even with 'pnv_php_get_adapter_state' null check, there are more
possible null dereferences pointed by sashiko, including cases where
userspace crashes the kernel, such as:

  $ cat /sys/bus/pci/slots/*/attention
  ...
  Kernel attempted to read user page (6e) - exploit attempt? (uid: 0)
  BUG: Kernel NULL pointer dereference on read at 0x0000006e
  Faulting instruction address: 0xc000000000a83334
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Hash  SMP NR_CPUS=2048 NUMA PowerNV
  ...
  [c000000046707a20] [c000000046707b90] 0xc000000046707b90 (unreliable)
  [c000000046707a70] [0000000000000001] 0x1
  [c000000046707ab0] [c000000000acb00c] attention_read_file+0x54/0xa8
  [c000000046707b30] [c000000000abfbfc] pci_slot_attr_show+0x3c/0x58
  [c000000046707b50] [c0000000008181ec] sysfs_kf_seq_show+0xd4/0x204
  [c000000046707be0] [c000000000815004] kernfs_seq_show+0x44/0x58

Add null checks to prevent the null dereferences.

Cc: stable@vger.kernel.org
Fixes: 80f9fc236279 ("PCI: pnv_php: Work around switches with broken presence detection")
Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>

---
Changelog:
v3:
+ split the patch from v2 series, as it's independent
+ incorporate reviews from bjorn to improve the description

v2:
+ sashiko pointed out various pre-existing null pointer derefs, which
  can give access to userspace to crash the kernel, fix them
---
---
 drivers/pci/hotplug/pnv_php.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
index ff92a5c301b8..d0f5e8ad1f71 100644
--- a/drivers/pci/hotplug/pnv_php.c
+++ b/drivers/pci/hotplug/pnv_php.c
@@ -47,6 +47,9 @@ static void pnv_php_disable_irq(struct pnv_php_slot *php_slot,
 	struct pci_dev *pdev = php_slot->pdev;
 	u16 ctrl;
 
+	if (!pdev)
+		return;
+
 	if (php_slot->irq > 0) {
 		pcie_capability_read_word(pdev, PCI_EXP_SLTCTL, &ctrl);
 		ctrl &= ~(PCI_EXP_SLTCTL_HPIE |
@@ -414,7 +417,8 @@ static int pnv_php_get_adapter_state(struct hotplug_slot *slot, u8 *state)
 	 */
 	ret = pnv_pci_get_presence_state(php_slot->id, &presence);
 	if (ret >= 0) {
-		if (pci_pcie_type(php_slot->pdev) == PCI_EXP_TYPE_DOWNSTREAM &&
+		if (php_slot->pdev &&
+			pci_pcie_type(php_slot->pdev) == PCI_EXP_TYPE_DOWNSTREAM &&
 			presence == OPAL_PCI_SLOT_EMPTY) {
 			/*
 			 * Similar to pciehp_hpc, check whether the Link Active
@@ -442,6 +446,11 @@ static int pnv_php_get_raw_indicator_status(struct hotplug_slot *slot, u8 *state
 	struct pci_dev *bridge = php_slot->pdev;
 	u16 status;
 
+	if (!bridge) {
+		*state = 0;
+		return 0;
+	}
+
 	pcie_capability_read_word(bridge, PCI_EXP_SLTCTL, &status);
 	*state = (status & (PCI_EXP_SLTCTL_AIC | PCI_EXP_SLTCTL_PIC)) >> 6;
 	return 0;
@@ -514,11 +523,13 @@ static int pnv_php_activate_slot(struct pnv_php_slot *php_slot,
 			 * fence / freeze.
 			 */
 			SLOT_WARN(php_slot, "Try %d...\n", i + 1);
-			pci_set_pcie_reset_state(php_slot->pdev,
-						 pcie_warm_reset);
-			msleep(250);
-			pci_set_pcie_reset_state(php_slot->pdev,
-						 pcie_deassert_reset);
+			if (php_slot->pdev) {
+				pci_set_pcie_reset_state(php_slot->pdev,
+							 pcie_warm_reset);
+				msleep(250);
+				pci_set_pcie_reset_state(php_slot->pdev,
+							 pcie_deassert_reset);
+			}
 
 			ret = pnv_php_set_slot_power_state(
 				slot, OPAL_PCI_SLOT_POWER_ON);
@@ -911,6 +922,9 @@ pnv_php_detect_clear_suprise_removal_freeze(struct pnv_php_slot *php_slot)
 	struct eeh_pe *pe;
 	int i, rc;
 
+	if (!pdev)
+		return;
+
 	/*
 	 * When a device is surprise removed from a downstream bridge slot,
 	 * the upstream bridge port can still end up frozen due to related EEH
@@ -1093,6 +1107,9 @@ static void pnv_php_enable_irq(struct pnv_php_slot *php_slot)
 	struct pci_dev *pdev = php_slot->pdev;
 	int irq, ret;
 
+	if (!pdev)
+		return;
+
 	/*
 	 * The MSI/MSIx interrupt might have been occupied by other
 	 * drivers. Don't populate the surprise hotplug capability
-- 
2.54.0



^ permalink raw reply related

* Re: [PATCH v2 1/3] ppc/pnv: Add null checks for OpenCapi PHBs
From: Aditya Gupta @ 2026-06-09  8:51 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linuxppc-dev, Madhavan Srinivasan, Timothy Pearson,
	Bjorn Helgaas, Shawn Anastasio, sashiko-bot, linux-pci,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	stable
In-Reply-To: <20260608153948.GA36499@bhelgaas>

On 08/06/26 21:09, Bjorn Helgaas wrote:

> On Wed, May 27, 2026 at 11:38:14PM +0530, Aditya Gupta wrote:
>> For opencapi phb direct slots, the .pdev for php_slots will be NULL
>>
>> Various sections of the code in pnv_php can do a null dereference and
>> crash the kernel.
>>
>> Originally, the issue was hit during boot:
>>
>>      [    1.568588] PowerPC PowerNV PCI Hotplug Driver version: 0.1
>>      [    1.569722] BUG: Kernel NULL pointer dereference at 0x00000074
>>      [    1.569811] Faulting instruction address: 0xc000000000b75fd0
>>      [    1.569890] Oops: Kernel access of bad area, sig: 11 [#1]
>>      [    1.569963] LE PAGE_SIZE=64K MMU=Hash  SMP NR_CPUS=2048 NUMA PowerNV
>>      ...
>>      [    1.571492] NIP [c000000000b75fd0] pnv_php_get_adapter_state+0x60/0x154
>>      [    1.571604] LR [c000000000b75fbc] pnv_php_get_adapter_state+0x4c/0x154
>>      [    1.571690] Call Trace:
>>      [    1.571725] [c000c0000688f990] [c000000000b75fbc] pnv_php_get_adapter_state+0x4c/0x154 (unreliable)
>>      [    1.571783] [c000c0000688fa20] [c000000000b78bd0] pnv_php_enable+0x94/0x378
>>      [    1.571951] [c000c0000688fac0] [c000000000b7912c] pnv_php_register_one.isra.0+0x11c/0x1e0
> Drop timestamps since they don't add useful information.
>
> Indent quoted material by two spaces to reduce wrapping.
>
> Run "git log --oneline drivers/pci/hotplug/pnv_php.c" and "git log
> --oneline drivers/pci/hotplug/" and match subject line style.
>
>> This occurs for hotplug slots on root buses where bus->self == NULL,
>> such as OpenCAPI PHB direct slots. An added debug print (not part of
>> this patch) confirmed it was opencapi:
> Style "OpenCAPI" and "PHB" consistently in commit log and subject.

Thanks for the review Bjorn, fixed the description and have sent the 
patch again as v3.

In v3, I have sent the patch #1 independently for rc, and will send the 
rework patches (patches #2 and #3) separately, since I have to do extra 
fixes for pre-existing issues pointed by sashiko.

Thanks,
- Aditya G




^ permalink raw reply

* Re: [PATCH v7 15/15] arm64: mm: Unmap kernel data/bss entirely from the linear map
From: Geert Uytterhoeven @ 2026-06-09  9:55 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Ard Biesheuvel, linux-arm-kernel, linux-kernel, will,
	catalin.marinas, mark.rutland, Ard Biesheuvel, Ryan Roberts,
	Anshuman Khandual, Kevin Brodsky, Liz Prucka, Seth Jenkins,
	Kees Cook, Mike Rapoport, David Hildenbrand, Andrew Morton,
	Jann Horn, linux-mm, linux-hardening, linuxppc-dev, linux-sh,
	Linux-Renesas
In-Reply-To: <6a9c0f55-fe98-4063-864b-8f7e1f4fefd7@samsung.com>

On Tue, 9 Jun 2026 at 08:28, Marek Szyprowski <m.szyprowski@samsung.com> wrote:
> On 09.06.2026 08:22, Marek Szyprowski wrote:
> > On 29.05.2026 17:02, Ard Biesheuvel wrote:
> >> From: Ard Biesheuvel <ardb@kernel.org>
> >>
> >> The linear aliases of the kernel text and rodata are also mapped
> >> read-only in the linear map. Given that the contents of these regions
> >> are mostly identical to the version in the loadable image, mapping them
> >> read-only and leaving their contents visible is a reasonable hardening
> >> measure.
> >>
> >> Data and bss, however, are now also mapped read-only but the contents of
> >> these regions are more likely to contain data that we'd rather not leak.
> >> So let's unmap these entirely in the linear map when the kernel is
> >> running normally.
> >>
> >> When going into hibernation or waking up from it, these regions need to
> >> be mapped, so map the region initially, and toggle the valid bit so
> >> map/unmap the region as needed.
> >>
> >> Doing so is required because pages covering the kernel image are marked
> >> as PageReserved, and therefore disregarded for snapshotting by the
> >> hibernate logic unless they are mapped.
> >>
> >> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > This commit landed in yesterday's linux-next as commit 63e0b6a5b693
> > ("arm64: mm: Unmap kernel data/bss entirely from the linear map").
> > In my tests I found that it breaks booting of RaspberryPi3 and
> > RaspberryPi4 boards with the following kernel panic:

Seeing the same panic on R-Car H3 ES2.0 (Cortex A57/A53), but not
on R-Car V4M (Cortex A76).

> One more comment - reverting 63e0b6a5b693 and 53205d56212c (dependent
> change) on top of next-20260608 fixes this issue.

Confirmed, too.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply

* Re: [PATCH] powerpc: Export set_memory_encrypted and set_memory_decrypted
From: Jiri Pirko @ 2026-06-09 11:43 UTC (permalink / raw)
  To: Sumit Semwal
  Cc: Jason Gunthorpe, Maxime Ripard, Christoph Hellwig, T.J. Mercier,
	maddy, mpe, npiggin, chleroy, linuxppc-dev, lkp, linux-kernel,
	iommu, linux-mm, agordeev, gerald.schaefer, linux-s390,
	Dan Williams, Tom Lendacky, x86, Arnd Bergmann
In-Reply-To: <CAO_48GH3NP09U6TdB5drbKY0TpwvtBXwrf=Jajsr5ttNbC_u9g@mail.gmail.com>

Mon, Jun 08, 2026 at 05:17:15PM +0200, sumit.semwal@linaro.org wrote:
>Hi Jason,
>
>On Thu, 4 Jun 2026 at 19:27, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> On Thu, Jun 04, 2026 at 12:51:49PM +0530, Sumit Semwal wrote:
>>
>> > Given that Christoph's objection is not really about the modules part,
>> > but that the set_memory_{encrypted,decrypted} should not be used here,
>> > one option is to revert 78b30c50a7ac until that issue is sorted out?
>>
>> Please no, we have stuff already using this so it would be a
>> functional regression. Revert making heaps into a module since that
>> doesn't have a functional regression.
>
>Thanks for your comments.
>
>To me, it looks like while system and system_cc_shared heaps share a
>lot of code, their user bases have different needs. It's apparent that
>system_cc_heap users don't care about it being a module while system
>heap users would very much like so.
>
>I also discussed this with Arnd, and he suggested we could rearrange
>the code so that system_heap_cc_shared_priv depends on a new Kconfig
>symbol like
>
>config DMABUF_HEAPS_CC_SYSTEM
>        bool "DMA-BUF System Heap for memory encryption"
>        depends on ARCH_HAS_MEM_ENCRYPT && DMABUF_HEAPS_SYSTEM=y
>
>This allows building both into the kernel or leave encryption choice
>up to the consumers of the system heap.
>
>If this is agreeable to everyone, I can post Arnd's patch.

Sounds good to me. Thanks!


^ permalink raw reply

* Re: [PATCH V4 2/2] tools/perf: Use scnprintf in buffer offset calculations
From: Athira Rajeev @ 2026-06-09 12:14 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: acme, jolsa, adrian.hunter, mpetlan, tmricht, maddy, irogers,
	linux-perf-users, linuxppc-dev, hbathini, Tejas.Manhas1,
	Tanushree.Shah, shivani
In-Reply-To: <aiH2BnKXmur79SSR@google.com>



> On 5 Jun 2026, at 3:32 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> On Mon, May 04, 2026 at 09:12:05PM +0530, Athira Rajeev wrote:
>> Replace snprintf with scnprintf in buffer offset calculations to
>> ensure the 'used' count will not exceed the "len".
>> 
>> The current logic in perf_pmu__for_each_event uses an unconditional
>> + 1 increment to buf_used to account for null terminators. This can
>> cause a a stack buffer overflow in the subsequent scnprintf call.
>> When the local stack buffer buf (1024 bytes) is full, buf_used can
>> reach 1025. This causes the subsequent remaining space calculation
>> sizeof(buf) - buf_used to underflow.
>> 
>> Use sub_non_neg() to see if space actually existed, and only
>> increment the offset if remaning space is present.
>> 
>> Changes includes:
>> - Use sub_non_neg to check if space exists
>> - Replacing snprintf with scnprintf to ensure the return value
>> reflects the actual bytes written into the buffer.
>> - Only increment buf_used by 1 if space exists
>> - If a parameterized event uses a built-in perf keyword for its
>> parameter name (eg, config=?), the lexer parses it as a predefined
>> term token, which sets term->config to NULL. Add check to use
>> parse_events__term_type_str() if term->config is NULL.
>> 
>> Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com>
>> ---
>> Changelog:
>> v2 -> v3:
>> - Split the scnprintf related changes in separate patch
>> - Handle the overflow issues and unconditional increment
>> wrapped around sub_non_neg addressing review comment from Sashiko
>> 
>> tools/perf/util/pmu.c | 46 ++++++++++++++++++++++++++++++++-----------
>> 1 file changed, 35 insertions(+), 11 deletions(-)
>> 
>> diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
>> index 0b8d58543f17..4b9ade1a4cf9 100644
>> --- a/tools/perf/util/pmu.c
>> +++ b/tools/perf/util/pmu.c
>> @@ -2129,15 +2129,19 @@ static char *format_alias(char *buf, int len, const struct perf_pmu *pmu,
>> pr_err("Failure to parse '%s' terms '%s': %d\n",
>> alias->name, alias->terms, ret);
>> parse_events_terms__exit(&terms);
>> - snprintf(buf, len, "%.*s/%s/", (int)pmu_name_len, pmu->name, alias->name);
>> + scnprintf(buf, len, "%.*s/%s/", (int)pmu_name_len, pmu->name, alias->name);
>> return buf;
>> }
>> - used = snprintf(buf, len, "%.*s/%s", (int)pmu_name_len, pmu->name, alias->name);
>> + used = scnprintf(buf, len, "%.*s/%s", (int)pmu_name_len, pmu->name, alias->name);
>> 
>> list_for_each_entry(term, &terms.terms, list) {
>> + const char *name = term->config;
>> +
>> + if (!name)
>> + name = parse_events__term_type_str(term->type_term);
>> if (term->type_val == PARSE_EVENTS__TERM_TYPE_STR)
>> - used += snprintf(buf + used, sub_non_neg(len, used),
>> - ",%s=%s", term->config,
>> + used += scnprintf(buf + used, sub_non_neg(len, used),
>> + ",%s=%s", name,
>> term->val.str);
>> }
>> parse_events_terms__exit(&terms);
>> @@ -2201,6 +2205,7 @@ int perf_pmu__for_each_event(struct perf_pmu *pmu, bool skip_duplicate_pmus,
>> int ret = 0;
>> struct hashmap_entry *entry;
>> size_t bkt;
>> + size_t size_rem, len;
>> 
>> if (perf_pmu__is_tracepoint(pmu))
>> return tp_pmu__for_each_event(pmu, state, cb);
>> @@ -2234,17 +2239,36 @@ int perf_pmu__for_each_event(struct perf_pmu *pmu, bool skip_duplicate_pmus,
>> }
>> buf_used = strlen(buf) + 1;
>> }
>> +
>> info.scale_unit = NULL;
>> if (strlen(event->unit) || event->scale != 1.0) {
>> - info.scale_unit = buf + buf_used;
>> - buf_used += snprintf(buf + buf_used, sizeof(buf) - buf_used,
>> - "%G%s", event->scale, event->unit) + 1;
>> + /* Check the remaining space */
>> + size_rem = sub_non_neg(sizeof(buf), buf_used);
>> +
>> + if (size_rem > 0) {
>> + info.scale_unit = buf + buf_used;
>> + len = scnprintf(buf + buf_used, size_rem, "%G%s",
>> + event->scale, event->unit);
>> + /*
>> +  * Increment buf_used by 1 only if
>> +  * it fits remaining space
>> +  */
>> + buf_used += min(len + 1, size_rem);
> 
> Hmm.. it seems scnprintf() cannot return a number greater than or equal
> to size_rem.  Can we just do like this?
> 
> buf_used += scnprintf(...) + 1;
> 
> Thanks,
> Namhyung

Sure

I will address this change in next version

Thanks
Athira
> 
>> + }
>> }
>> info.desc = event->desc;
>> info.long_desc = event->long_desc;
>> - info.encoding_desc = buf + buf_used;
>> - buf_used += snprintf(buf + buf_used, sizeof(buf) - buf_used,
>> - "%.*s/%s/", (int)pmu_name_len, info.pmu_name, event->terms) + 1;
>> + info.encoding_desc = NULL;
>> +
>> + /* Check the remaining space */
>> + size_rem = sub_non_neg(sizeof(buf), buf_used);
>> + if (size_rem > 0) {
>> + info.encoding_desc = buf + buf_used;
>> + len = scnprintf(buf + buf_used, size_rem, "%.*s/%s/",
>> + (int)pmu_name_len, info.pmu_name, event->terms);
>> + buf_used += min(len + 1, size_rem);
>> + }
>> +
>> info.str = event->terms;
>> info.topic = event->topic;
>> info.deprecated = perf_pmu_alias__check_deprecated(pmu, event);
>> @@ -2254,7 +2278,7 @@ int perf_pmu__for_each_event(struct perf_pmu *pmu, bool skip_duplicate_pmus,
>> }
>> if (pmu->selectable) {
>> info.name = buf;
>> - snprintf(buf, sizeof(buf), "%s//", pmu->name);
>> + scnprintf(buf, sizeof(buf), "%s//", pmu->name);
>> info.alias = NULL;
>> info.scale_unit = NULL;
>> info.desc = NULL;
>> -- 
>> 2.47.3




^ permalink raw reply

* Re: [PATCH v6 02/20] dma-direct: swiotlb: handle swiotlb alloc/free outside __dma_direct_alloc_pages
From: Petr Tesarik @ 2026-06-09 12:15 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm)
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86, Jiri Pirko,
	Michael Kelley
In-Reply-To: <20260604083959.1265923-3-aneesh.kumar@kernel.org>

On Thu,  4 Jun 2026 14:09:41 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:

> Move swiotlb allocation out of __dma_direct_alloc_pages() and handle it in
> dma_direct_alloc() / dma_direct_alloc_pages().
> 
> This is needed for follow-up changes that simplify the handling of
> memory encryption/decryption based on the DMA attribute flags.
> 
> swiotlb backing pages are already mapped decrypted by
> swiotlb_update_mem_attributes() and rmem_swiotlb_device_init(), so
> dma-direct should not call dma_set_decrypted() on allocation nor
> dma_set_encrypted() on free for swiotlb-backed memory.
> 
> Update alloc/free paths to detect swiotlb-backed pages and skip
> encrypt/decrypt transitions for those paths. Keep the existing highmem
> rejection in dma_direct_alloc_pages() for swiotlb allocations.
> 
> Only for "restricted-dma-pool", we currently set `for_alloc = true`, while
> rmem_swiotlb_device_init() decrypts the whole pool up front. This pool is
> typically used together with "shared-dma-pool", where the shared region is
> accessed after remap/ioremap and the returned address is suitable for
> decrypted memory access. So existing code paths remain valid.
> 
> Tested-by: Jiri Pirko <jiri@nvidia.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
>  include/linux/swiotlb.h |  6 ++++
>  kernel/dma/direct.c     | 71 ++++++++++++++++++++++++++++++-----------
>  kernel/dma/swiotlb.c    |  6 ++++
>  3 files changed, 65 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 3dae0f592063..133bb8ca9032 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -284,6 +284,8 @@ extern void swiotlb_print_info(void);
>  #ifdef CONFIG_DMA_RESTRICTED_POOL
>  struct page *swiotlb_alloc(struct device *dev, size_t size);
>  bool swiotlb_free(struct device *dev, struct page *page, size_t size);
> +void swiotlb_free_from_pool(struct device *dev, phys_addr_t tlb_addr,
> +		size_t size, struct io_tlb_pool *pool);
>  
>  static inline bool is_swiotlb_for_alloc(struct device *dev)
>  {
> @@ -299,6 +301,10 @@ static inline bool swiotlb_free(struct device *dev, struct page *page,
>  {
>  	return false;
>  }
> +static inline void swiotlb_free_from_pool(struct device *dev, phys_addr_t tlb_addr,
> +		size_t size, struct io_tlb_pool *pool)
> +{
> +}
>  static inline bool is_swiotlb_for_alloc(struct device *dev)
>  {
>  	return false;
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 583c5922bca2..a741c8a2ee66 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -96,14 +96,6 @@ static int dma_set_encrypted(struct device *dev, void *vaddr, size_t size)
>  	return ret;
>  }
>  
> -static void __dma_direct_free_pages(struct device *dev, struct page *page,
> -				    size_t size)
> -{
> -	if (swiotlb_free(dev, page, size))
> -		return;
> -	dma_free_contiguous(dev, page, size);
> -}
> -
>  static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size)
>  {
>  	struct page *page = swiotlb_alloc(dev, size);
> @@ -125,9 +117,6 @@ static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
>  
>  	WARN_ON_ONCE(!PAGE_ALIGNED(size));
>  
> -	if (is_swiotlb_for_alloc(dev))
> -		return dma_direct_alloc_swiotlb(dev, size);
> -
>  	gfp |= dma_direct_optimal_gfp_mask(dev, &phys_limit);
>  	page = dma_alloc_contiguous(dev, size, gfp);
>  	if (page) {
> @@ -204,6 +193,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  		dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
>  {
>  	bool remap = false, set_uncached = false;
> +	bool mark_mem_decrypt = true;
>  	struct page *page;
>  	void *ret;
>  
> @@ -250,11 +240,21 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  	    dma_direct_use_pool(dev, gfp))
>  		return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
>  
> +	if (is_swiotlb_for_alloc(dev)) {
> +		page = dma_direct_alloc_swiotlb(dev, size);
> +		if (page) {
> +			mark_mem_decrypt = false;
> +			goto setup_page;
> +		}
> +		return NULL;
> +	}
> +
>  	/* we always manually zero the memory once we are done */
>  	page = __dma_direct_alloc_pages(dev, size, gfp & ~__GFP_ZERO, true);
>  	if (!page)
>  		return NULL;
>  
> +setup_page:
>  	/*
>  	 * dma_alloc_contiguous can return highmem pages depending on a
>  	 * combination the cma= arguments and per-arch setup.  These need to be
> @@ -281,7 +281,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  			goto out_free_pages;
>  	} else {
>  		ret = page_address(page);
> -		if (dma_set_decrypted(dev, ret, size))
> +		if (mark_mem_decrypt && dma_set_decrypted(dev, ret, size))
>  			goto out_leak_pages;
>  	}
>  
> @@ -298,10 +298,11 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  	return ret;
>  
>  out_encrypt_pages:
> -	if (dma_set_encrypted(dev, page_address(page), size))
> +	if (mark_mem_decrypt && dma_set_encrypted(dev, page_address(page), size))
>  		return NULL;
>  out_free_pages:
> -	__dma_direct_free_pages(dev, page, size);
> +	if (!swiotlb_free(dev, page, size))
> +		dma_free_contiguous(dev, page, size);
>  	return NULL;
>  out_leak_pages:
>  	return NULL;
> @@ -310,6 +311,9 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  void dma_direct_free(struct device *dev, size_t size,
>  		void *cpu_addr, dma_addr_t dma_addr, unsigned long attrs)
>  {
> +	phys_addr_t phys;
> +	bool mark_mem_encrypted = true;
> +	struct io_tlb_pool *swiotlb_pool;
>  	unsigned int page_order = get_order(size);
>  
>  	if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
> @@ -338,16 +342,25 @@ void dma_direct_free(struct device *dev, size_t size,
>  	    dma_free_from_pool(dev, cpu_addr, PAGE_ALIGN(size)))
>  		return;
>  
> +	phys = dma_to_phys(dev, dma_addr);
> +	swiotlb_pool = swiotlb_find_pool(dev, phys);
> +	if (swiotlb_pool)
> +		/* Swiotlb doesn't need a page attribute update on free */
> +		mark_mem_encrypted = false;
> +
>  	if (is_vmalloc_addr(cpu_addr)) {
>  		vunmap(cpu_addr);
>  	} else {
>  		if (IS_ENABLED(CONFIG_ARCH_HAS_DMA_CLEAR_UNCACHED))
>  			arch_dma_clear_uncached(cpu_addr, size);
> -		if (dma_set_encrypted(dev, cpu_addr, size))
> +		if (mark_mem_encrypted && dma_set_encrypted(dev, cpu_addr, size))
>  			return;
>  	}
>  
> -	__dma_direct_free_pages(dev, dma_direct_to_page(dev, dma_addr), size);
> +	if (swiotlb_pool)
> +		swiotlb_free_from_pool(dev, phys, size, swiotlb_pool);
> +	else
> +		dma_free_contiguous(dev, dma_direct_to_page(dev, dma_addr), size);
>  }
>  
>  struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
> @@ -359,6 +372,15 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>  	if (force_dma_unencrypted(dev) && dma_direct_use_pool(dev, gfp))
>  		return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
>  
> +	if (is_swiotlb_for_alloc(dev)) {
> +		page = dma_direct_alloc_swiotlb(dev, size);
> +		if (!page)
> +			return NULL;
> +
> +		ret = page_address(page);
> +		goto setup_page;
> +	}
> +
>  	page = __dma_direct_alloc_pages(dev, size, gfp, false);
>  	if (!page)
>  		return NULL;
> @@ -366,6 +388,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>  	ret = page_address(page);
>  	if (dma_set_decrypted(dev, ret, size))
>  		goto out_leak_pages;
> +setup_page:
>  	memset(ret, 0, size);
>  	*dma_handle = phys_to_dma_direct(dev, page_to_phys(page));
>  	return page;
> @@ -377,16 +400,28 @@ void dma_direct_free_pages(struct device *dev, size_t size,
>  		struct page *page, dma_addr_t dma_addr,
>  		enum dma_data_direction dir)
>  {
> +	phys_addr_t phys;
>  	void *vaddr = page_address(page);
> +	struct io_tlb_pool *swiotlb_pool;
> +	bool mark_mem_encrypted = true;
>  
>  	/* If cpu_addr is not from an atomic pool, dma_free_from_pool() fails */
>  	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
>  	    dma_free_from_pool(dev, vaddr, size))
>  		return;
>  
> -	if (dma_set_encrypted(dev, vaddr, size))
> +	phys = page_to_phys(page);
> +	swiotlb_pool = swiotlb_find_pool(dev, phys);
> +	if (swiotlb_pool)
> +		mark_mem_encrypted = false;
> +
> +	if (mark_mem_encrypted && dma_set_encrypted(dev, vaddr, size))
>  		return;
> -	__dma_direct_free_pages(dev, page, size);
> +
> +	if (swiotlb_pool)
> +		swiotlb_free_from_pool(dev, phys, size, swiotlb_pool);
> +	else
> +		dma_free_contiguous(dev, page, size);
>  }
>  
>  #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 1abd3e6146f4..ac03a6856c2e 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -1809,6 +1809,12 @@ bool swiotlb_free(struct device *dev, struct page *page, size_t size)
>  	return true;
>  }
>  
> +void swiotlb_free_from_pool(struct device *dev, phys_addr_t tlb_addr, size_t size,
> +		struct io_tlb_pool *pool)

What's the reason to pass the buffer size if it's not used?

Other than that, this patch looks good to me.

Petr T

> +{
> +	swiotlb_release_slots(dev, tlb_addr, pool);
> +}
> +
>  static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
>  				    struct device *dev)
>  {


^ permalink raw reply

* Re: [PATCH v6 03/20] dma-direct: use DMA_ATTR_CC_SHARED in alloc/free paths
From: Petr Tesarik @ 2026-06-09 12:18 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm)
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86, Jiri Pirko,
	Michael Kelley
In-Reply-To: <20260604083959.1265923-4-aneesh.kumar@kernel.org>

On Thu,  4 Jun 2026 14:09:42 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:

> Propagate force_dma_unencrypted() into DMA_ATTR_CC_SHARED in the
> dma-direct allocation path and use the attribute to drive the related
> decisions.
> 
> This updates dma_direct_alloc(), dma_direct_free(), and
> dma_direct_alloc_pages() to fold the forced unencrypted case into attrs.
> 
> Tested-by: Jiri Pirko <jiri@nvidia.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>

Reviewed-by: Petr Tesarik <ptesarik@suse.com>

Petr T

> ---
>  kernel/dma/direct.c | 53 +++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 44 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index a741c8a2ee66..90dc5057a0c0 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -193,16 +193,31 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  		dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
>  {
>  	bool remap = false, set_uncached = false;
> -	bool mark_mem_decrypt = true;
> +	bool mark_mem_decrypt = false;
>  	struct page *page;
>  	void *ret;
>  
> +	/*
> +	 * DMA_ATTR_CC_SHARED is not a caller-visible dma_alloc_*()
> +	 * attribute. The direct allocator uses it internally after it has
> +	 * decided that the backing pages must be shared/decrypted, so the
> +	 * rest of the allocation path can consistently select DMA addresses,
> +	 * choose compatible pools and restore encryption on free.
> +	 */
> +	if (attrs & DMA_ATTR_CC_SHARED)
> +		return NULL;
> +
> +	if (force_dma_unencrypted(dev)) {
> +		attrs |= DMA_ATTR_CC_SHARED;
> +		mark_mem_decrypt = true;
> +	}
> +
>  	size = PAGE_ALIGN(size);
>  	if (attrs & DMA_ATTR_NO_WARN)
>  		gfp |= __GFP_NOWARN;
>  
> -	if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
> -	    !force_dma_unencrypted(dev) && !is_swiotlb_for_alloc(dev))
> +	if (((attrs & (DMA_ATTR_NO_KERNEL_MAPPING | DMA_ATTR_CC_SHARED)) ==
> +	     DMA_ATTR_NO_KERNEL_MAPPING) && !is_swiotlb_for_alloc(dev))
>  		return dma_direct_alloc_no_mapping(dev, size, dma_handle, gfp);
>  
>  	if (!dev_is_dma_coherent(dev)) {
> @@ -236,7 +251,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  	 * Remapping or decrypting memory may block, allocate the memory from
>  	 * the atomic pools instead if we aren't allowed block.
>  	 */
> -	if ((remap || force_dma_unencrypted(dev)) &&
> +	if ((remap || (attrs & DMA_ATTR_CC_SHARED)) &&
>  	    dma_direct_use_pool(dev, gfp))
>  		return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
>  
> @@ -312,12 +327,24 @@ void dma_direct_free(struct device *dev, size_t size,
>  		void *cpu_addr, dma_addr_t dma_addr, unsigned long attrs)
>  {
>  	phys_addr_t phys;
> -	bool mark_mem_encrypted = true;
> +	bool mark_mem_encrypted = false;
>  	struct io_tlb_pool *swiotlb_pool;
>  	unsigned int page_order = get_order(size);
>  
> -	if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
> -	    !force_dma_unencrypted(dev) && !is_swiotlb_for_alloc(dev)) {
> +	/* see dma_direct_alloc() for details */
> +	WARN_ON(attrs & DMA_ATTR_CC_SHARED);
> +
> +	/*
> +	 * if the device had requested for an unencrypted buffer,
> +	 * convert it to encrypted on free
> +	 */
> +	if (force_dma_unencrypted(dev)) {
> +		attrs |= DMA_ATTR_CC_SHARED;
> +		mark_mem_encrypted = true;
> +	}
> +
> +	if (((attrs & (DMA_ATTR_NO_KERNEL_MAPPING | DMA_ATTR_CC_SHARED)) ==
> +	     DMA_ATTR_NO_KERNEL_MAPPING) && !is_swiotlb_for_alloc(dev)) {
>  		/* cpu_addr is a struct page cookie, not a kernel address */
>  		dma_free_contiguous(dev, cpu_addr, size);
>  		return;
> @@ -366,10 +393,14 @@ void dma_direct_free(struct device *dev, size_t size,
>  struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>  		dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
>  {
> +	unsigned long attrs = 0;
>  	struct page *page;
>  	void *ret;
>  
> -	if (force_dma_unencrypted(dev) && dma_direct_use_pool(dev, gfp))
> +	if (force_dma_unencrypted(dev))
> +		attrs |= DMA_ATTR_CC_SHARED;
> +
> +	if ((attrs & DMA_ATTR_CC_SHARED) && dma_direct_use_pool(dev, gfp))
>  		return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
>  
>  	if (is_swiotlb_for_alloc(dev)) {
> @@ -403,7 +434,11 @@ void dma_direct_free_pages(struct device *dev, size_t size,
>  	phys_addr_t phys;
>  	void *vaddr = page_address(page);
>  	struct io_tlb_pool *swiotlb_pool;
> -	bool mark_mem_encrypted = true;
> +	/*
> +	 * if the device had requested for an unencrypted buffer,
> +	 * convert it to encrypted on free
> +	 */
> +	bool mark_mem_encrypted = force_dma_unencrypted(dev);
>  
>  	/* If cpu_addr is not from an atomic pool, dma_free_from_pool() fails */
>  	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&



^ permalink raw reply

* Re: [PATCH v6 05/20] dma: swiotlb: pass mapping attributes by reference
From: Petr Tesarik @ 2026-06-09 12:21 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm)
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86, Michael Kelley
In-Reply-To: <20260604083959.1265923-6-aneesh.kumar@kernel.org>

On Thu,  4 Jun 2026 14:09:44 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:

> Change swiotlb_tbl_map_single() to take the DMA mapping attributes by
> reference and update the direct callers accordingly.
> 
> This is a preparatory change for a follow-up patch which updates the
> attributes based on the selected swiotlb pool. Keeping the signature change
> separate makes the follow-up patch easier to review.
> 
> No functional change in this patch.
> 
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>

Reviewed-by: Petr Tesarik <ptesarik@suse.com>

Thanks
Petr T

> ---
>  drivers/iommu/dma-iommu.c | 2 +-
>  drivers/xen/swiotlb-xen.c | 2 +-
>  include/linux/swiotlb.h   | 2 +-
>  kernel/dma/swiotlb.c      | 6 +++---
>  4 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index c2595bee3d41..725c7adb0a8d 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1180,7 +1180,7 @@ static phys_addr_t iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys,
>  	trace_swiotlb_bounced(dev, phys, size);
>  
>  	phys = swiotlb_tbl_map_single(dev, phys, size, iova_mask(iovad), dir,
> -			attrs);
> +				      &attrs);
>  
>  	/*
>  	 * Untrusted devices should not see padding areas with random leftover
> diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> index 2cbf2b588f5b..8c4abe65cd49 100644
> --- a/drivers/xen/swiotlb-xen.c
> +++ b/drivers/xen/swiotlb-xen.c
> @@ -243,7 +243,7 @@ static dma_addr_t xen_swiotlb_map_phys(struct device *dev, phys_addr_t phys,
>  	 */
>  	trace_swiotlb_bounced(dev, dev_addr, size);
>  
> -	map = swiotlb_tbl_map_single(dev, phys, size, 0, dir, attrs);
> +	map = swiotlb_tbl_map_single(dev, phys, size, 0, dir, &attrs);
>  	if (map == (phys_addr_t)DMA_MAPPING_ERROR)
>  		return DMA_MAPPING_ERROR;
>  
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 133bb8ca9032..29187cec90d8 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -238,7 +238,7 @@ static inline phys_addr_t default_swiotlb_limit(void)
>  
>  phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t phys,
>  		size_t mapping_size, unsigned int alloc_aligned_mask,
> -		enum dma_data_direction dir, unsigned long attrs);
> +		enum dma_data_direction dir, unsigned long *attrs);
>  dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
>  		size_t size, enum dma_data_direction dir, unsigned long attrs);
>  
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index be4d418d92ac..78ce05857c00 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -1391,7 +1391,7 @@ static unsigned long mem_used(struct io_tlb_mem *mem)
>   */
>  phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>  		size_t mapping_size, unsigned int alloc_align_mask,
> -		enum dma_data_direction dir, unsigned long attrs)
> +		enum dma_data_direction dir, unsigned long *attrs)
>  {
>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>  	unsigned int offset;
> @@ -1425,7 +1425,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>  	size = ALIGN(mapping_size + offset, alloc_align_mask + 1);
>  	index = swiotlb_find_slots(dev, orig_addr, size, alloc_align_mask, &pool);
>  	if (index == -1) {
> -		if (!(attrs & DMA_ATTR_NO_WARN))
> +		if (!(*attrs & DMA_ATTR_NO_WARN))
>  			dev_warn_ratelimited(dev,
>  	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
>  				 size, mem->nslabs, mem_used(mem));
> @@ -1604,7 +1604,7 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
>  
>  	trace_swiotlb_bounced(dev, phys_to_dma(dev, paddr), size);
>  
> -	swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, 0, dir, attrs);
> +	swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, 0, dir, &attrs);
>  	if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
>  		return DMA_MAPPING_ERROR;
>  



^ permalink raw reply

* Re: [PATCH v6 04/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Petr Tesarik @ 2026-06-09 12:23 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm)
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86, Jiri Pirko,
	Michael Kelley
In-Reply-To: <20260604083959.1265923-5-aneesh.kumar@kernel.org>

On Thu,  4 Jun 2026 14:09:43 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:

> Teach the atomic DMA pool code to distinguish between encrypted and
> unencrypted pools, and make pool allocation select the matching pool based
> on DMA attributes.
> 
> Introduce a dma_gen_pool wrapper that records whether a pool is
> unencrypted, initialize that state when the atomic pools are created, and
> use it when expanding and resizing the pools. Update dma_alloc_from_pool()
> to take attrs and skip pools whose encrypted state does not match
> DMA_ATTR_CC_SHARED. Update dma_free_from_pool() accordingly.
> 
> Also pass DMA_ATTR_CC_SHARED from the swiotlb atomic allocation path so
> decrypted swiotlb allocations are taken from the correct atomic pool.
> 
> Tested-by: Jiri Pirko <jiri@nvidia.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Reviewed-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>

FWIW this also looks good to me, but I don't think I'm the best person
to review changed to DMA generic pools.

Petr T

> ---
>  drivers/iommu/dma-iommu.c   |   2 +-
>  include/linux/dma-map-ops.h |   2 +-
>  kernel/dma/direct.c         |  11 ++-
>  kernel/dma/pool.c           | 167 +++++++++++++++++++++++-------------
>  kernel/dma/swiotlb.c        |   7 +-
>  5 files changed, 123 insertions(+), 66 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 54d96e847f16..c2595bee3d41 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1673,7 +1673,7 @@ void *iommu_dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
>  	if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
>  	    !gfpflags_allow_blocking(gfp) && !coherent)
>  		page = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &cpu_addr,
> -					       gfp, NULL);
> +					   gfp, attrs, NULL);
>  	else
>  		cpu_addr = iommu_dma_alloc_pages(dev, size, &page, gfp, attrs);
>  	if (!cpu_addr)
> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> index 6a1832a73cad..696b2c3a2305 100644
> --- a/include/linux/dma-map-ops.h
> +++ b/include/linux/dma-map-ops.h
> @@ -212,7 +212,7 @@ void *dma_common_pages_remap(struct page **pages, size_t size, pgprot_t prot,
>  void dma_common_free_remap(void *cpu_addr, size_t size);
>  
>  struct page *dma_alloc_from_pool(struct device *dev, size_t size,
> -		void **cpu_addr, gfp_t flags,
> +		void **cpu_addr, gfp_t flags, unsigned long attrs,
>  		bool (*phys_addr_ok)(struct device *, phys_addr_t, size_t));
>  bool dma_free_from_pool(struct device *dev, void *start, size_t size);
>  
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 90dc5057a0c0..681f16a984ab 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -154,7 +154,7 @@ static bool dma_direct_use_pool(struct device *dev, gfp_t gfp)
>  }
>  
>  static void *dma_direct_alloc_from_pool(struct device *dev, size_t size,
> -		dma_addr_t *dma_handle, gfp_t gfp)
> +		dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
>  {
>  	struct page *page;
>  	u64 phys_limit;
> @@ -164,7 +164,8 @@ static void *dma_direct_alloc_from_pool(struct device *dev, size_t size,
>  		return NULL;
>  
>  	gfp |= dma_direct_optimal_gfp_mask(dev, &phys_limit);
> -	page = dma_alloc_from_pool(dev, size, &ret, gfp, dma_coherent_ok);
> +	page = dma_alloc_from_pool(dev, size, &ret, gfp, attrs,
> +				   dma_coherent_ok);
>  	if (!page)
>  		return NULL;
>  	*dma_handle = phys_to_dma_direct(dev, page_to_phys(page));
> @@ -253,7 +254,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  	 */
>  	if ((remap || (attrs & DMA_ATTR_CC_SHARED)) &&
>  	    dma_direct_use_pool(dev, gfp))
> -		return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
> +		return dma_direct_alloc_from_pool(dev, size, dma_handle,
> +						  gfp, attrs);
>  
>  	if (is_swiotlb_for_alloc(dev)) {
>  		page = dma_direct_alloc_swiotlb(dev, size);
> @@ -401,7 +403,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>  		attrs |= DMA_ATTR_CC_SHARED;
>  
>  	if ((attrs & DMA_ATTR_CC_SHARED) && dma_direct_use_pool(dev, gfp))
> -		return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
> +		return dma_direct_alloc_from_pool(dev, size, dma_handle,
> +						  gfp, attrs);
>  
>  	if (is_swiotlb_for_alloc(dev)) {
>  		page = dma_direct_alloc_swiotlb(dev, size);
> diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
> index 2b2fbb709242..be78474a6c49 100644
> --- a/kernel/dma/pool.c
> +++ b/kernel/dma/pool.c
> @@ -12,12 +12,18 @@
>  #include <linux/set_memory.h>
>  #include <linux/slab.h>
>  #include <linux/workqueue.h>
> +#include <linux/cc_platform.h>
>  
> -static struct gen_pool *atomic_pool_dma __ro_after_init;
> +struct dma_gen_pool {
> +	bool unencrypted;
> +	struct gen_pool *pool;
> +};
> +
> +static struct dma_gen_pool atomic_pool_dma __ro_after_init;
>  static unsigned long pool_size_dma;
> -static struct gen_pool *atomic_pool_dma32 __ro_after_init;
> +static struct dma_gen_pool atomic_pool_dma32 __ro_after_init;
>  static unsigned long pool_size_dma32;
> -static struct gen_pool *atomic_pool_kernel __ro_after_init;
> +static struct dma_gen_pool atomic_pool_kernel __ro_after_init;
>  static unsigned long pool_size_kernel;
>  
>  /* Size can be defined by the coherent_pool command line */
> @@ -76,11 +82,12 @@ static bool cma_in_zone(gfp_t gfp)
>  	return true;
>  }
>  
> -static int atomic_pool_expand(struct gen_pool *pool, size_t pool_size,
> +static int atomic_pool_expand(struct dma_gen_pool *dma_pool, size_t pool_size,
>  			      gfp_t gfp)
>  {
>  	unsigned int order;
>  	struct page *page = NULL;
> +	bool leak_pages = false;
>  	void *addr;
>  	int ret = -ENOMEM;
>  
> @@ -113,12 +120,17 @@ static int atomic_pool_expand(struct gen_pool *pool, size_t pool_size,
>  	 * Memory in the atomic DMA pools must be unencrypted, the pools do not
>  	 * shrink so no re-encryption occurs in dma_direct_free().
>  	 */
> -	ret = set_memory_decrypted((unsigned long)page_to_virt(page),
> -				   1 << order);
> -	if (ret)
> -		goto remove_mapping;
> -	ret = gen_pool_add_virt(pool, (unsigned long)addr, page_to_phys(page),
> -				pool_size, NUMA_NO_NODE);
> +	if (dma_pool->unencrypted) {
> +		ret = set_memory_decrypted((unsigned long)page_to_virt(page),
> +					   1 << order);
> +		if (ret) {
> +			leak_pages = true;
> +			goto remove_mapping;
> +		}
> +	}
> +
> +	ret = gen_pool_add_virt(dma_pool->pool, (unsigned long)addr,
> +				page_to_phys(page), pool_size, NUMA_NO_NODE);
>  	if (ret)
>  		goto encrypt_mapping;
>  
> @@ -126,62 +138,67 @@ static int atomic_pool_expand(struct gen_pool *pool, size_t pool_size,
>  	return 0;
>  
>  encrypt_mapping:
> -	ret = set_memory_encrypted((unsigned long)page_to_virt(page),
> -				   1 << order);
> -	if (WARN_ON_ONCE(ret)) {
> -		/* Decrypt succeeded but encrypt failed, purposely leak */
> -		goto out;
> -	}
> +	if (dma_pool->unencrypted &&
> +	    set_memory_encrypted((unsigned long)page_to_virt(page), 1 << order))
> +		leak_pages = true;
> +
>  remove_mapping:
>  #ifdef CONFIG_DMA_DIRECT_REMAP
>  	dma_common_free_remap(addr, pool_size);
>  free_page:
> -	__free_pages(page, order);
> +	if (!leak_pages)
> +		__free_pages(page, order);
>  #endif
>  out:
>  	return ret;
>  }
>  
> -static void atomic_pool_resize(struct gen_pool *pool, gfp_t gfp)
> +static void atomic_pool_resize(struct dma_gen_pool *dma_pool, gfp_t gfp)
>  {
> -	if (pool && gen_pool_avail(pool) < atomic_pool_size)
> -		atomic_pool_expand(pool, gen_pool_size(pool), gfp);
> +	if (dma_pool->pool && gen_pool_avail(dma_pool->pool) < atomic_pool_size)
> +		atomic_pool_expand(dma_pool, gen_pool_size(dma_pool->pool), gfp);
>  }
>  
>  static void atomic_pool_work_fn(struct work_struct *work)
>  {
>  	if (IS_ENABLED(CONFIG_ZONE_DMA))
> -		atomic_pool_resize(atomic_pool_dma,
> +		atomic_pool_resize(&atomic_pool_dma,
>  				   GFP_KERNEL | GFP_DMA);
>  	if (IS_ENABLED(CONFIG_ZONE_DMA32))
> -		atomic_pool_resize(atomic_pool_dma32,
> +		atomic_pool_resize(&atomic_pool_dma32,
>  				   GFP_KERNEL | GFP_DMA32);
> -	atomic_pool_resize(atomic_pool_kernel, GFP_KERNEL);
> +	atomic_pool_resize(&atomic_pool_kernel, GFP_KERNEL);
>  }
>  
> -static __init struct gen_pool *__dma_atomic_pool_init(size_t pool_size,
> -						      gfp_t gfp)
> +static __init struct dma_gen_pool *__dma_atomic_pool_init(struct dma_gen_pool *dma_pool,
> +		size_t pool_size, gfp_t gfp)
>  {
> -	struct gen_pool *pool;
>  	int ret;
>  
> -	pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
> -	if (!pool)
> +	dma_pool->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
> +	if (!dma_pool->pool)
>  		return NULL;
>  
> -	gen_pool_set_algo(pool, gen_pool_first_fit_order_align, NULL);
> +	gen_pool_set_algo(dma_pool->pool, gen_pool_first_fit_order_align, NULL);
> +
> +	/* if platform is using memory encryption atomic pools are by default decrypted. */
> +	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> +		dma_pool->unencrypted = true;
> +	else
> +		dma_pool->unencrypted = false;
>  
> -	ret = atomic_pool_expand(pool, pool_size, gfp);
> +	ret = atomic_pool_expand(dma_pool, pool_size, gfp);
>  	if (ret) {
> -		gen_pool_destroy(pool);
> +		gen_pool_destroy(dma_pool->pool);
> +		dma_pool->pool = NULL;
>  		pr_err("DMA: failed to allocate %zu KiB %pGg pool for atomic allocation\n",
>  		       pool_size >> 10, &gfp);
>  		return NULL;
>  	}
>  
>  	pr_info("DMA: preallocated %zu KiB %pGg pool for atomic allocations\n",
> -		gen_pool_size(pool) >> 10, &gfp);
> -	return pool;
> +		gen_pool_size(dma_pool->pool) >> 10, &gfp);
> +	return dma_pool;
>  }
>  
>  #ifdef CONFIG_ZONE_DMA32
> @@ -207,21 +224,22 @@ static int __init dma_atomic_pool_init(void)
>  
>  	/* All memory might be in the DMA zone(s) to begin with */
>  	if (has_managed_zone(ZONE_NORMAL)) {
> -		atomic_pool_kernel = __dma_atomic_pool_init(atomic_pool_size,
> -						    GFP_KERNEL);
> -		if (!atomic_pool_kernel)
> +		__dma_atomic_pool_init(&atomic_pool_kernel, atomic_pool_size, GFP_KERNEL);
> +		if (!atomic_pool_kernel.pool)
>  			ret = -ENOMEM;
>  	}
> +
>  	if (has_managed_dma()) {
> -		atomic_pool_dma = __dma_atomic_pool_init(atomic_pool_size,
> -						GFP_KERNEL | GFP_DMA);
> -		if (!atomic_pool_dma)
> +		__dma_atomic_pool_init(&atomic_pool_dma, atomic_pool_size,
> +				       GFP_KERNEL | GFP_DMA);
> +		if (!atomic_pool_dma.pool)
>  			ret = -ENOMEM;
>  	}
> +
>  	if (has_managed_dma32) {
> -		atomic_pool_dma32 = __dma_atomic_pool_init(atomic_pool_size,
> -						GFP_KERNEL | GFP_DMA32);
> -		if (!atomic_pool_dma32)
> +		__dma_atomic_pool_init(&atomic_pool_dma32, atomic_pool_size,
> +				       GFP_KERNEL | GFP_DMA32);
> +		if (!atomic_pool_dma32.pool)
>  			ret = -ENOMEM;
>  	}
>  
> @@ -230,19 +248,44 @@ static int __init dma_atomic_pool_init(void)
>  }
>  postcore_initcall(dma_atomic_pool_init);
>  
> -static inline struct gen_pool *dma_guess_pool(struct gen_pool *prev, gfp_t gfp)
> +static inline struct dma_gen_pool *__dma_guess_pool(struct dma_gen_pool *first,
> +		struct dma_gen_pool *second, struct dma_gen_pool *third)
>  {
> -	if (prev == NULL) {
> +	if (first->pool)
> +		return first;
> +	if (second && second->pool)
> +		return second;
> +	if (third && third->pool)
> +		return third;
> +	return NULL;
> +}
> +
> +static inline struct dma_gen_pool *dma_guess_pool(struct dma_gen_pool *prev,
> +		gfp_t gfp)
> +{
> +	if (!prev) {
>  		if (gfp & GFP_DMA)
> -			return atomic_pool_dma ?: atomic_pool_dma32 ?: atomic_pool_kernel;
> +			return __dma_guess_pool(&atomic_pool_dma,
> +						&atomic_pool_dma32,
> +						&atomic_pool_kernel);
> +
>  		if (gfp & GFP_DMA32)
> -			return atomic_pool_dma32 ?: atomic_pool_dma ?: atomic_pool_kernel;
> -		return atomic_pool_kernel ?: atomic_pool_dma32 ?: atomic_pool_dma;
> +			return __dma_guess_pool(&atomic_pool_dma32,
> +						&atomic_pool_dma,
> +						&atomic_pool_kernel);
> +
> +		return __dma_guess_pool(&atomic_pool_kernel,
> +					&atomic_pool_dma32,
> +					&atomic_pool_dma);
>  	}
> -	if (prev == atomic_pool_kernel)
> -		return atomic_pool_dma32 ? atomic_pool_dma32 : atomic_pool_dma;
> -	if (prev == atomic_pool_dma32)
> -		return atomic_pool_dma;
> +
> +	if (prev == &atomic_pool_kernel)
> +		return __dma_guess_pool(&atomic_pool_dma32,
> +					&atomic_pool_dma, NULL);
> +
> +	if (prev == &atomic_pool_dma32)
> +		return __dma_guess_pool(&atomic_pool_dma, NULL, NULL);
> +
>  	return NULL;
>  }
>  
> @@ -272,16 +315,20 @@ static struct page *__dma_alloc_from_pool(struct device *dev, size_t size,
>  }
>  
>  struct page *dma_alloc_from_pool(struct device *dev, size_t size,
> -		void **cpu_addr, gfp_t gfp,
> +		void **cpu_addr, gfp_t gfp, unsigned long attrs,
>  		bool (*phys_addr_ok)(struct device *, phys_addr_t, size_t))
>  {
> -	struct gen_pool *pool = NULL;
> +	struct dma_gen_pool *dma_pool = NULL;
>  	struct page *page;
>  	bool pool_found = false;
>  
> -	while ((pool = dma_guess_pool(pool, gfp))) {
> +	while ((dma_pool = dma_guess_pool(dma_pool, gfp))) {
> +
> +		if (dma_pool->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
> +			continue;
> +
>  		pool_found = true;
> -		page = __dma_alloc_from_pool(dev, size, pool, cpu_addr,
> +		page = __dma_alloc_from_pool(dev, size, dma_pool->pool, cpu_addr,
>  					     phys_addr_ok);
>  		if (page)
>  			return page;
> @@ -296,12 +343,14 @@ struct page *dma_alloc_from_pool(struct device *dev, size_t size,
>  
>  bool dma_free_from_pool(struct device *dev, void *start, size_t size)
>  {
> -	struct gen_pool *pool = NULL;
> +	struct dma_gen_pool *dma_pool = NULL;
> +
> +	while ((dma_pool = dma_guess_pool(dma_pool, 0))) {
>  
> -	while ((pool = dma_guess_pool(pool, 0))) {
> -		if (!gen_pool_has_addr(pool, (unsigned long)start, size))
> +		if (!gen_pool_has_addr(dma_pool->pool, (unsigned long)start, size))
>  			continue;
> -		gen_pool_free(pool, (unsigned long)start, size);
> +
> +		gen_pool_free(dma_pool->pool, (unsigned long)start, size);
>  		return true;
>  	}
>  
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index ac03a6856c2e..be4d418d92ac 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -612,6 +612,7 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>  		u64 phys_limit, gfp_t gfp)
>  {
>  	struct page *page;
> +	unsigned long attrs = 0;
>  
>  	/*
>  	 * Allocate from the atomic pools if memory is encrypted and
> @@ -623,8 +624,12 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>  		if (!IS_ENABLED(CONFIG_DMA_COHERENT_POOL))
>  			return NULL;
>  
> +		/* swiotlb considered decrypted by default */
> +		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> +			attrs = DMA_ATTR_CC_SHARED;
> +
>  		return dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
> -					   dma_coherent_ok);
> +					   attrs, dma_coherent_ok);
>  	}
>  
>  	gfp &= ~GFP_ZONEMASK;



^ permalink raw reply

* Re: [PATCH] powerpc: Export set_memory_encrypted and set_memory_decrypted
From: Maxime Ripard @ 2026-06-09 12:31 UTC (permalink / raw)
  To: Sumit Semwal
  Cc: Jason Gunthorpe, Jiri Pirko, Christoph Hellwig, T.J. Mercier,
	maddy, mpe, npiggin, chleroy, linuxppc-dev, lkp, linux-kernel,
	iommu, linux-mm, agordeev, gerald.schaefer, linux-s390,
	Dan Williams, Tom Lendacky, x86, Arnd Bergmann
In-Reply-To: <CAO_48GH3NP09U6TdB5drbKY0TpwvtBXwrf=Jajsr5ttNbC_u9g@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1508 bytes --]

On Mon, Jun 08, 2026 at 08:47:15PM +0530, Sumit Semwal wrote:
> Hi Jason,
> 
> On Thu, 4 Jun 2026 at 19:27, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Thu, Jun 04, 2026 at 12:51:49PM +0530, Sumit Semwal wrote:
> >
> > > Given that Christoph's objection is not really about the modules part,
> > > but that the set_memory_{encrypted,decrypted} should not be used here,
> > > one option is to revert 78b30c50a7ac until that issue is sorted out?
> >
> > Please no, we have stuff already using this so it would be a
> > functional regression. Revert making heaps into a module since that
> > doesn't have a functional regression.
> 
> Thanks for your comments.
> 
> To me, it looks like while system and system_cc_shared heaps share a
> lot of code, their user bases have different needs. It's apparent that
> system_cc_heap users don't care about it being a module while system
> heap users would very much like so.
> 
> I also discussed this with Arnd, and he suggested we could rearrange
> the code so that system_heap_cc_shared_priv depends on a new Kconfig
> symbol like
> 
> config DMABUF_HEAPS_CC_SYSTEM
>         bool "DMA-BUF System Heap for memory encryption"
>         depends on ARCH_HAS_MEM_ENCRYPT && DMABUF_HEAPS_SYSTEM=y
> 
> This allows building both into the kernel or leave encryption choice
> up to the consumers of the system heap.
> 
> If this is agreeable to everyone, I can post Arnd's patch.

It would be the perfect compromise, thanks!
Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 273 bytes --]

^ permalink raw reply

* Re: [PATCH 35/60] kvm: Add VCPU plane-scheduling state and helpers
From: Jörg Rödel @ 2026-06-09 12:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Tom Lendacky, ashish.kalra, michael.roth,
	nsaenz, anelkz, James.Bottomley, Melody Wang, kvm, linux-kernel,
	kvmarm, loongarch, linux-mips, linuxppc-dev, kvm-riscv, x86,
	coconut-svsm, joerg.roedel
In-Reply-To: <CABgObfbUsDeStnZF-7oyR-W-Bvd4qTMoeUwGizgn10UTdKtZ2A@mail.gmail.com>

On Mon, Jun 08, 2026 at 07:58:43PM +0200, Paolo Bonzini wrote:
> Related to this, let me know if you want me to pick up again the
> common part, especially with Sashiko being hard at work on it.

Yeah, that might be good, let me think a bit about it and discuss in tomorrows
PUCK call.

> The idea of the userspace scheduling was that you're not forced to use
> it - the kernel can always choose to override it if it's using an
> accelerated implementation of planes (and of plane switching). But it
> also leaves some leeway to different accelerated implementations, each
> of which can pick their own algorithm.
> 
> Conceptually I'd rather keep the possibility of userspace scheduling.
> But maybe it doesn't add much.

My preference is to keep plane scheduling at one place (in the kernel) to keep
it simple. But if you see a need for user-mode to interact there as well (only
really works for VSM), then I can add it.

I read a bit more about VSM and it seems their prioritization of VTLs is a bit
more complicated. VTL0 has the least privileges but boots first, then sets up
VTL1. But VTL1 is only higher-privileged once it is locked by VTL0. Another way
to look at it is that VTL0 de-prioritizes itself.

The patches here are built around the assumption that plane0 is the highest
privileged one and is always runnable. Running any lower-privilege plane must
be triggered by the guest. This is clearly not sufficient for VSM, the question
is how to solve that.

The answer depends on how IRQ delivery affects VTL scheduling in VSM. If a
VM has VTL0 (currently running), VTL1, and VTL2 (highest privilege), and an IRQ
becomed pending for VTL1, does Hyper-V schedule VTL1 directly or does it switch
to VTL2 (highest privilege) first to let it schedule VTL1?

When VSM switches to VTL2 first, then planes could just use a marker for the
highest-privilege plane (which can be non-zero). In the other case the solution
is likely to make the direction in which the vcpu->common->vcpus[] array is
traversed configurable.

-Joerg



^ permalink raw reply

* Re: [PATCH v6 06/20] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Petr Tesarik @ 2026-06-09 12:48 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm)
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86, Jiri Pirko,
	Michael Kelley
In-Reply-To: <20260604083959.1265923-7-aneesh.kumar@kernel.org>

On Thu,  4 Jun 2026 14:09:45 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:

> Teach swiotlb to distinguish between encrypted and decrypted bounce
> buffer pools, and make allocation and mapping paths select a pool whose
> state matches the requested DMA attributes.
> 
> Add a unencrypted flag to io_tlb_mem, initialize it for the default and
> restricted pools, and propagate DMA_ATTR_CC_SHARED into swiotlb pool
> allocation. Reject swiotlb alloc/map requests when the selected pool does
> not match the required encrypted/decrypted state.
> 
> Also return DMA addresses with the matching phys_to_dma_{encrypted,
> unencrypted} helper so the DMA address encoding stays consistent with the
> chosen pool.
> 
> Tested-by: Jiri Pirko <jiri@nvidia.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
>  include/linux/dma-direct.h |  10 +++
>  include/linux/swiotlb.h    |   8 +-
>  kernel/dma/direct.c        |  13 +++-
>  kernel/dma/swiotlb.c       | 154 ++++++++++++++++++++++++++++---------
>  4 files changed, 142 insertions(+), 43 deletions(-)
> 
> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
> index c249912456f9..94fad4e7c11e 100644
> --- a/include/linux/dma-direct.h
> +++ b/include/linux/dma-direct.h
> @@ -77,6 +77,10 @@ static inline dma_addr_t dma_range_map_max(const struct bus_dma_region *map)
>  #ifndef phys_to_dma_unencrypted
>  #define phys_to_dma_unencrypted		phys_to_dma
>  #endif
> +
> +#ifndef phys_to_dma_encrypted
> +#define phys_to_dma_encrypted		phys_to_dma
> +#endif
>  #else
>  static inline dma_addr_t __phys_to_dma(struct device *dev, phys_addr_t paddr)
>  {
> @@ -90,6 +94,12 @@ static inline dma_addr_t phys_to_dma_unencrypted(struct device *dev,
>  {
>  	return dma_addr_unencrypted(__phys_to_dma(dev, paddr));
>  }
> +
> +static inline dma_addr_t phys_to_dma_encrypted(struct device *dev,
> +		phys_addr_t paddr)
> +{
> +	return dma_addr_encrypted(__phys_to_dma(dev, paddr));
> +}
>  /*
>   * If memory encryption is supported, phys_to_dma will set the memory encryption
>   * bit in the DMA address, and dma_to_phys will clear it.
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 29187cec90d8..4dcbf3931be1 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -81,6 +81,7 @@ struct io_tlb_pool {
>  	struct list_head node;
>  	struct rcu_head rcu;
>  	bool transient;
> +	bool unencrypted;

IIUC this is a copy of the unencrypted member in the corresponding
struct io_tlb_mem. In other words, if pools are allocated dynamically,
all pools must have the same encryption state, correct?

>  #endif
>  };
>  
> @@ -111,6 +112,7 @@ struct io_tlb_mem {
>  	struct dentry *debugfs;
>  	bool force_bounce;
>  	bool for_alloc;
> +	bool unencrypted;
>  #ifdef CONFIG_SWIOTLB_DYNAMIC
>  	bool can_grow;
>  	u64 phys_limit;
> @@ -282,7 +284,8 @@ static inline void swiotlb_sync_single_for_cpu(struct device *dev,
>  extern void swiotlb_print_info(void);
>  
>  #ifdef CONFIG_DMA_RESTRICTED_POOL
> -struct page *swiotlb_alloc(struct device *dev, size_t size);
> +struct page *swiotlb_alloc(struct device *dev, size_t size,
> +		unsigned long attrs);
>  bool swiotlb_free(struct device *dev, struct page *page, size_t size);
>  void swiotlb_free_from_pool(struct device *dev, phys_addr_t tlb_addr,
>  		size_t size, struct io_tlb_pool *pool);
> @@ -292,7 +295,8 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
>  	return dev->dma_io_tlb_mem->for_alloc;
>  }
>  #else
> -static inline struct page *swiotlb_alloc(struct device *dev, size_t size)
> +static inline struct page *swiotlb_alloc(struct device *dev, size_t size,
> +		unsigned long attrs)
>  {
>  	return NULL;
>  }
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 681f16a984ab..0b4a26c6b6fd 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -96,9 +96,10 @@ static int dma_set_encrypted(struct device *dev, void *vaddr, size_t size)
>  	return ret;
>  }
>  
> -static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size)
> +static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size,
> +		unsigned long attrs)
>  {
> -	struct page *page = swiotlb_alloc(dev, size);
> +	struct page *page = swiotlb_alloc(dev, size, attrs);
>  
>  	if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
>  		swiotlb_free(dev, page, size);
> @@ -258,8 +259,12 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  						  gfp, attrs);
>  
>  	if (is_swiotlb_for_alloc(dev)) {
> -		page = dma_direct_alloc_swiotlb(dev, size);
> +		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>  		if (page) {
> +			/*
> +			 * swiotlb allocations comes from pool already marked
> +			 * decrypted
> +			 */
>  			mark_mem_decrypt = false;
>  			goto setup_page;
>  		}
> @@ -407,7 +412,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>  						  gfp, attrs);
>  
>  	if (is_swiotlb_for_alloc(dev)) {
> -		page = dma_direct_alloc_swiotlb(dev, size);
> +		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>  		if (!page)
>  			return NULL;
>  
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 78ce05857c00..2bf3981db35d 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -259,10 +259,21 @@ void __init swiotlb_update_mem_attributes(void)
>  	struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
>  	unsigned long bytes;
>  
> +	/*
> +	 * if platform support memory encryption, swiotlb buffers are
> +	 * decrypted by default.
> +	 */
> +	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> +		io_tlb_default_mem.unencrypted = true;
> +	else
> +		io_tlb_default_mem.unencrypted = false;
> +
>  	if (!mem->nslabs || mem->late_alloc)
>  		return;
>  	bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
> -	set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
> +
> +	if (io_tlb_default_mem.unencrypted)
> +		set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
>  }
>  
>  static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
> @@ -505,8 +516,10 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>  	if (!mem->slots)
>  		goto error_slots;
>  
> -	set_memory_decrypted((unsigned long)vstart,
> -			     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
> +	if (io_tlb_default_mem.unencrypted)
> +		set_memory_decrypted((unsigned long)vstart,
> +				     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
> +
>  	swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
>  				 nareas);
>  	add_mem_pool(&io_tlb_default_mem, mem);
> @@ -539,7 +552,9 @@ void __init swiotlb_exit(void)
>  	tbl_size = PAGE_ALIGN(mem->end - mem->start);
>  	slots_size = PAGE_ALIGN(array_size(sizeof(*mem->slots), mem->nslabs));
>  
> -	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
> +	if (io_tlb_default_mem.unencrypted)
> +		set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
> +
>  	if (mem->late_alloc) {
>  		area_order = get_order(array_size(sizeof(*mem->areas),
>  			mem->nareas));
> @@ -563,6 +578,7 @@ void __init swiotlb_exit(void)
>   * @gfp:	GFP flags for the allocation.
>   * @bytes:	Size of the buffer.
>   * @phys_limit:	Maximum allowed physical address of the buffer.
> + * @unencrypted: true to allocate unencrypted memory, false for encrypted memory
>   *
>   * Allocate pages from the buddy allocator. If successful, make the allocated
>   * pages decrypted that they can be used for DMA.
> @@ -570,7 +586,8 @@ void __init swiotlb_exit(void)
>   * Return: Decrypted pages, %NULL on allocation failure, or ERR_PTR(-EAGAIN)
>   * if the allocated physical address was above @phys_limit.
>   */
> -static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
> +static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes,
> +		u64 phys_limit, bool unencrypted)
>  {
>  	unsigned int order = get_order(bytes);
>  	struct page *page;
> @@ -588,13 +605,13 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>  	}
>  
>  	vaddr = phys_to_virt(paddr);
> -	if (set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
> +	if (unencrypted && set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
>  		goto error;
>  	return page;
>  
>  error:
>  	/* Intentional leak if pages cannot be encrypted again. */
> -	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
> +	if (unencrypted && !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>  		__free_pages(page, order);
>  	return NULL;
>  }
> @@ -604,30 +621,26 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>   * @dev:	Device for which a memory pool is allocated.
>   * @bytes:	Size of the buffer.
>   * @phys_limit:	Maximum allowed physical address of the buffer.
> + * @attrs:	DMA attributes for the allocation.
>   * @gfp:	GFP flags for the allocation.
>   *
>   * Return: Allocated pages, or %NULL on allocation failure.
>   */
>  static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
> -		u64 phys_limit, gfp_t gfp)
> +		u64 phys_limit, unsigned long attrs, gfp_t gfp)

If my assumption above is correct, then I prefer to add a struct
io_tlb_mem *mem parameter here and calculate the allocation attributes
inside this function, so you don't have to repeat it in the callers.

>  {
>  	struct page *page;
> -	unsigned long attrs = 0;
>  
>  	/*
>  	 * Allocate from the atomic pools if memory is encrypted and
>  	 * the allocation is atomic, because decrypting may block.
>  	 */
> -	if (!gfpflags_allow_blocking(gfp) && dev && force_dma_unencrypted(dev)) {
> +	if (!gfpflags_allow_blocking(gfp) && (attrs & DMA_ATTR_CC_SHARED)) {

You're removing the check that dev is non-NULL. This is fine, because
the only call with dev == NULL is from swiotlb_dyn_alloc(), and that one
uses GFP_KERNEL (i.e. allows blocking). However, if this is an intended
optimization, I'd rather have it in a separate commit, with this
explanation why it's OK to do it.

The rest of the patch looks good to me.

Petr T

>  		void *vaddr;
>  
>  		if (!IS_ENABLED(CONFIG_DMA_COHERENT_POOL))
>  			return NULL;
>  
> -		/* swiotlb considered decrypted by default */
> -		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> -			attrs = DMA_ATTR_CC_SHARED;
> -
>  		return dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
>  					   attrs, dma_coherent_ok);
>  	}
> @@ -638,7 +651,8 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>  	else if (phys_limit <= DMA_BIT_MASK(32))
>  		gfp |= __GFP_DMA32;
>  
> -	while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit))) {
> +	while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit,
> +					     !!(attrs & DMA_ATTR_CC_SHARED)))) {
>  		if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
>  		    phys_limit < DMA_BIT_MASK(64) &&
>  		    !(gfp & (__GFP_DMA32 | __GFP_DMA)))
> @@ -657,15 +671,18 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>   * swiotlb_free_tlb() - free a dynamically allocated IO TLB buffer
>   * @vaddr:	Virtual address of the buffer.
>   * @bytes:	Size of the buffer.
> + * @unencrypted: true if @vaddr was allocated decrypted and must be
> + *	re-encrypted before being freed
>   */
> -static void swiotlb_free_tlb(void *vaddr, size_t bytes)
> +static void swiotlb_free_tlb(void *vaddr, size_t bytes, bool unencrypted)
>  {
>  	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
>  	    dma_free_from_pool(NULL, vaddr, bytes))
>  		return;
>  
>  	/* Intentional leak if pages cannot be encrypted again. */
> -	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
> +	if (!unencrypted ||
> +	    !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>  		__free_pages(virt_to_page(vaddr), get_order(bytes));
>  }
>  
> @@ -676,6 +693,7 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>   * @nslabs:	Desired (maximum) number of slabs.
>   * @nareas:	Number of areas.
>   * @phys_limit:	Maximum DMA buffer physical address.
> + * @attrs:	DMA attributes for the allocation.
>   * @gfp:	GFP flags for the allocations.
>   *
>   * Allocate and initialize a new IO TLB memory pool. The actual number of
> @@ -686,7 +704,8 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>   */
>  static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>  		unsigned long minslabs, unsigned long nslabs,
> -		unsigned int nareas, u64 phys_limit, gfp_t gfp)
> +		unsigned int nareas, u64 phys_limit,
> +		unsigned long attrs, gfp_t gfp)
>  {
>  	struct io_tlb_pool *pool;
>  	unsigned int slot_order;
> @@ -704,9 +723,10 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>  	if (!pool)
>  		goto error;
>  	pool->areas = (void *)pool + sizeof(*pool);
> +	pool->unencrypted = !!(attrs & DMA_ATTR_CC_SHARED);
>  
>  	tlb_size = nslabs << IO_TLB_SHIFT;
> -	while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp))) {
> +	while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, attrs, gfp))) {
>  		if (nslabs <= minslabs)
>  			goto error_tlb;
>  		nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
> @@ -724,7 +744,8 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>  	return pool;
>  
>  error_slots:
> -	swiotlb_free_tlb(page_address(tlb), tlb_size);
> +	swiotlb_free_tlb(page_address(tlb), tlb_size,
> +			 !!(attrs & DMA_ATTR_CC_SHARED));
>  error_tlb:
>  	kfree(pool);
>  error:
> @@ -742,7 +763,9 @@ static void swiotlb_dyn_alloc(struct work_struct *work)
>  	struct io_tlb_pool *pool;
>  
>  	pool = swiotlb_alloc_pool(NULL, IO_TLB_MIN_SLABS, default_nslabs,
> -				  default_nareas, mem->phys_limit, GFP_KERNEL);
> +				  default_nareas, mem->phys_limit,
> +				  mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
> +				  GFP_KERNEL);
>  	if (!pool) {
>  		pr_warn_ratelimited("Failed to allocate new pool");
>  		return;
> @@ -762,7 +785,7 @@ static void swiotlb_dyn_free(struct rcu_head *rcu)
>  	size_t tlb_size = pool->end - pool->start;
>  
>  	free_pages((unsigned long)pool->slots, get_order(slots_size));
> -	swiotlb_free_tlb(pool->vaddr, tlb_size);
> +	swiotlb_free_tlb(pool->vaddr, tlb_size, pool->unencrypted);
>  	kfree(pool);
>  }
>  
> @@ -1037,13 +1060,11 @@ static void dec_transient_used(struct io_tlb_mem *mem, unsigned int nslots)
>   * Return: Index of the first allocated slot, or -1 on error.
>   */
>  static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool,
> -		int area_index, phys_addr_t orig_addr, size_t alloc_size,
> -		unsigned int alloc_align_mask)
> +		int area_index, phys_addr_t orig_addr, dma_addr_t tbl_dma_addr,
> +		size_t alloc_size, unsigned int alloc_align_mask)
>  {
>  	struct io_tlb_area *area = pool->areas + area_index;
>  	unsigned long boundary_mask = dma_get_seg_boundary(dev);
> -	dma_addr_t tbl_dma_addr =
> -		phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
>  	unsigned long max_slots = get_max_slots(boundary_mask);
>  	unsigned int iotlb_align_mask = dma_get_min_align_mask(dev);
>  	unsigned int nslots = nr_slots(alloc_size), stride;
> @@ -1056,6 +1077,8 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
>  	BUG_ON(!nslots);
>  	BUG_ON(area_index >= pool->nareas);
>  
> +	tbl_dma_addr &= boundary_mask;
> +
>  	/*
>  	 * Historically, swiotlb allocations >= PAGE_SIZE were guaranteed to be
>  	 * page-aligned in the absence of any other alignment requirements.
> @@ -1167,6 +1190,7 @@ static int swiotlb_search_area(struct device *dev, int start_cpu,
>  {
>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>  	struct io_tlb_pool *pool;
> +	dma_addr_t tbl_dma_addr;
>  	int area_index;
>  	int index = -1;
>  
> @@ -1175,9 +1199,15 @@ static int swiotlb_search_area(struct device *dev, int start_cpu,
>  		if (cpu_offset >= pool->nareas)
>  			continue;
>  		area_index = (start_cpu + cpu_offset) & (pool->nareas - 1);
> +
> +		if (mem->unencrypted)
> +			tbl_dma_addr = phys_to_dma_unencrypted(dev, pool->start);
> +		else
> +			tbl_dma_addr = phys_to_dma_encrypted(dev, pool->start);
> +
>  		index = swiotlb_search_pool_area(dev, pool, area_index,
> -						 orig_addr, alloc_size,
> -						 alloc_align_mask);
> +						 orig_addr, tbl_dma_addr,
> +						 alloc_size, alloc_align_mask);
>  		if (index >= 0) {
>  			*retpool = pool;
>  			break;
> @@ -1207,6 +1237,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>  {
>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>  	struct io_tlb_pool *pool;
> +	dma_addr_t tbl_dma_addr;
>  	unsigned long nslabs;
>  	unsigned long flags;
>  	u64 phys_limit;
> @@ -1232,11 +1263,17 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>  	nslabs = nr_slots(alloc_size);
>  	phys_limit = min_not_zero(*dev->dma_mask, dev->bus_dma_limit);
>  	pool = swiotlb_alloc_pool(dev, nslabs, nslabs, 1, phys_limit,
> +				  mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
>  				  GFP_NOWAIT);
>  	if (!pool)
>  		return -1;
>  
> -	index = swiotlb_search_pool_area(dev, pool, 0, orig_addr,
> +	if (mem->unencrypted)
> +		tbl_dma_addr = phys_to_dma_unencrypted(dev, pool->start);
> +	else
> +		tbl_dma_addr = phys_to_dma_encrypted(dev, pool->start);
> +
> +	index = swiotlb_search_pool_area(dev, pool, 0, orig_addr, tbl_dma_addr,
>  					 alloc_size, alloc_align_mask);
>  	if (index < 0) {
>  		swiotlb_dyn_free(&pool->rcu);
> @@ -1281,15 +1318,23 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>  		size_t alloc_size, unsigned int alloc_align_mask,
>  		struct io_tlb_pool **retpool)
>  {
> +	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>  	struct io_tlb_pool *pool;
> +	dma_addr_t tbl_dma_addr;
>  	int start, i;
>  	int index;
>  
> -	*retpool = pool = &dev->dma_io_tlb_mem->defpool;
> +	*retpool = pool = &mem->defpool;
> +	if (mem->unencrypted)
> +		tbl_dma_addr = phys_to_dma_unencrypted(dev, pool->start);
> +	else
> +		tbl_dma_addr = phys_to_dma_encrypted(dev, pool->start);
> +
>  	i = start = raw_smp_processor_id() & (pool->nareas - 1);
>  	do {
>  		index = swiotlb_search_pool_area(dev, pool, i, orig_addr,
> -						 alloc_size, alloc_align_mask);
> +						 tbl_dma_addr, alloc_size,
> +						 alloc_align_mask);
>  		if (index >= 0)
>  			return index;
>  		if (++i >= pool->nareas)
> @@ -1372,9 +1417,19 @@ static unsigned long mem_used(struct io_tlb_mem *mem)
>   *			any pre- or post-padding for alignment
>   * @alloc_align_mask:	Required start and end alignment of the allocated buffer
>   * @dir:		DMA direction
> - * @attrs:		Optional DMA attributes for the map operation
> + * @attrs:		Optional DMA attributes for the map operation, updated
> + *			to match the selected SWIOTLB pool
>   *
>   * Find and allocate a suitable sequence of IO TLB slots for the request.
> + * The device's SWIOTLB pool must match the device's current DMA encryption
> + * requirements. If the device requires decrypted DMA, bouncing is done through
> + * an unencrypted pool and the mapping is marked shared. If the device can DMA
> + * to encrypted memory, bouncing is done through an encrypted pool even when the
> + * original DMA address was unencrypted. Enabling encrypted DMA for a device is
> + * therefore expected to update its default io_tlb_mem to an encrypted pool, so
> + * later bounce mappings for both encrypted and decrypted original memory use
> + * that encrypted pool.
> + *
>   * The allocated space starts at an alignment specified by alloc_align_mask,
>   * and the size of the allocated space is rounded up so that the total amount
>   * of allocated space is a multiple of (alloc_align_mask + 1). If
> @@ -1411,6 +1466,16 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>  	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>  		pr_warn_once("Memory encryption is active and system is using DMA bounce buffers\n");
>  
> +	/* swiotlb pool is incorrect for this device */
> +	if (unlikely(mem->unencrypted != force_dma_unencrypted(dev)))
> +		return (phys_addr_t)DMA_MAPPING_ERROR;
> +
> +	/* Force attrs to match the kind of memory in the pool */
> +	if (mem->unencrypted)
> +		*attrs |= DMA_ATTR_CC_SHARED;
> +	else
> +		*attrs &= ~DMA_ATTR_CC_SHARED;
> +
>  	/*
>  	 * The default swiotlb memory pool is allocated with PAGE_SIZE
>  	 * alignment. If a mapping is requested with larger alignment,
> @@ -1608,8 +1673,11 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
>  	if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
>  		return DMA_MAPPING_ERROR;
>  
> -	/* Ensure that the address returned is DMA'ble */
> -	dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
> +	if (attrs & DMA_ATTR_CC_SHARED)
> +		dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
> +	else
> +		dma_addr = phys_to_dma_encrypted(dev, swiotlb_addr);
> +
>  	if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
>  		__swiotlb_tbl_unmap_single(dev, swiotlb_addr, size, dir,
>  			attrs | DMA_ATTR_SKIP_CPU_SYNC,
> @@ -1773,7 +1841,7 @@ static inline void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
>  
>  #ifdef CONFIG_DMA_RESTRICTED_POOL
>  
> -struct page *swiotlb_alloc(struct device *dev, size_t size)
> +struct page *swiotlb_alloc(struct device *dev, size_t size, unsigned long attrs)
>  {
>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>  	struct io_tlb_pool *pool;
> @@ -1784,6 +1852,9 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
>  	if (!mem)
>  		return NULL;
>  
> +	if (mem->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
> +		return NULL;
> +
>  	align = (1 << (get_order(size) + PAGE_SHIFT)) - 1;
>  	index = swiotlb_find_slots(dev, 0, size, align, &pool);
>  	if (index == -1)
> @@ -1859,9 +1930,18 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
>  			kfree(mem);
>  			return -ENOMEM;
>  		}
> +		/*
> +		 * if platform supports memory encryption,
> +		 * restricted mem pool is decrypted by default
> +		 */
> +		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> +			mem->unencrypted = true;
> +			set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
> +					     rmem->size >> PAGE_SHIFT);
> +		} else {
> +			mem->unencrypted = false;
> +		}
>  
> -		set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
> -				     rmem->size >> PAGE_SHIFT);
>  		swiotlb_init_io_tlb_pool(pool, rmem->base, nslabs,
>  					 false, nareas);
>  		mem->force_bounce = true;



^ permalink raw reply

* Re: [PATCH v6 08/20] dma-direct: pass attrs to dma_capable() for DMA_ATTR_CC_SHARED checks
From: Petr Tesarik @ 2026-06-09 12:50 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm)
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86, Jiri Pirko,
	Michael Kelley
In-Reply-To: <20260604083959.1265923-9-aneesh.kumar@kernel.org>

On Thu,  4 Jun 2026 14:09:47 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:

> Teach dma_capable() about DMA_ATTR_CC_SHARED so the capability
> check can reject encrypted DMA addresses for devices that require
> unencrypted/shared DMA.
> 
> Also propagate DMA_ATTR_CC_SHARED in swiotlb_map() when the selected
> SWIOTLB pool is decrypted so the capability check sees the correct DMA
> address attribute.
> 
> Tested-by: Jiri Pirko <jiri@nvidia.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>

Reviewed-by: Petr Tesarik <ptesarik@suse.com>

Petr T

> ---
>  arch/x86/kernel/amd_gart_64.c | 30 ++++++++++++++++--------------
>  drivers/xen/swiotlb-xen.c     |  6 +++---
>  include/linux/dma-direct.h    | 10 +++++++++-
>  kernel/dma/direct.h           |  6 +++---
>  kernel/dma/swiotlb.c          |  2 +-
>  5 files changed, 32 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
> index e8000a56732e..b5f1f031d45b 100644
> --- a/arch/x86/kernel/amd_gart_64.c
> +++ b/arch/x86/kernel/amd_gart_64.c
> @@ -180,22 +180,23 @@ static void iommu_full(struct device *dev, size_t size, int dir)
>  }
>  
>  static inline int
> -need_iommu(struct device *dev, unsigned long addr, size_t size)
> +need_iommu(struct device *dev, unsigned long addr, size_t size, unsigned long attrs)
>  {
> -	return force_iommu || !dma_capable(dev, addr, size, true);
> +	return force_iommu || !dma_capable(dev, addr, size, true, attrs);
>  }
>  
>  static inline int
> -nonforced_iommu(struct device *dev, unsigned long addr, size_t size)
> +nonforced_iommu(struct device *dev, unsigned long addr, size_t size,
> +		unsigned long attrs)
>  {
> -	return !dma_capable(dev, addr, size, true);
> +	return !dma_capable(dev, addr, size, true, attrs);
>  }
>  
>  /* Map a single continuous physical area into the IOMMU.
>   * Caller needs to check if the iommu is needed and flush.
>   */
>  static dma_addr_t dma_map_area(struct device *dev, dma_addr_t phys_mem,
> -				size_t size, int dir, unsigned long align_mask)
> +		size_t size, int dir, unsigned long align_mask, unsigned long attrs)
>  {
>  	unsigned long npages = iommu_num_pages(phys_mem, size, PAGE_SIZE);
>  	unsigned long iommu_page;
> @@ -206,7 +207,7 @@ static dma_addr_t dma_map_area(struct device *dev, dma_addr_t phys_mem,
>  
>  	iommu_page = alloc_iommu(dev, npages, align_mask);
>  	if (iommu_page == -1) {
> -		if (!nonforced_iommu(dev, phys_mem, size))
> +		if (!nonforced_iommu(dev, phys_mem, size, attrs))
>  			return phys_mem;
>  		if (panic_on_overflow)
>  			panic("dma_map_area overflow %lu bytes\n", size);
> @@ -231,10 +232,10 @@ static dma_addr_t gart_map_phys(struct device *dev, phys_addr_t paddr,
>  	if (unlikely(attrs & DMA_ATTR_MMIO))
>  		return DMA_MAPPING_ERROR;
>  
> -	if (!need_iommu(dev, paddr, size))
> +	if (!need_iommu(dev, paddr, size, attrs))
>  		return paddr;
>  
> -	bus = dma_map_area(dev, paddr, size, dir, 0);
> +	bus = dma_map_area(dev, paddr, size, dir, 0, attrs);
>  	flush_gart();
>  
>  	return bus;
> @@ -289,7 +290,7 @@ static void gart_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>  
>  /* Fallback for dma_map_sg in case of overflow */
>  static int dma_map_sg_nonforce(struct device *dev, struct scatterlist *sg,
> -			       int nents, int dir)
> +		int nents, int dir, unsigned long attrs)
>  {
>  	struct scatterlist *s;
>  	int i;
> @@ -301,8 +302,8 @@ static int dma_map_sg_nonforce(struct device *dev, struct scatterlist *sg,
>  	for_each_sg(sg, s, nents, i) {
>  		unsigned long addr = sg_phys(s);
>  
> -		if (nonforced_iommu(dev, addr, s->length)) {
> -			addr = dma_map_area(dev, addr, s->length, dir, 0);
> +		if (nonforced_iommu(dev, addr, s->length, attrs)) {
> +			addr = dma_map_area(dev, addr, s->length, dir, 0, attrs);
>  			if (addr == DMA_MAPPING_ERROR) {
>  				if (i > 0)
>  					gart_unmap_sg(dev, sg, i, dir, 0);
> @@ -401,7 +402,7 @@ static int gart_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>  		s->dma_address = addr;
>  		BUG_ON(s->length == 0);
>  
> -		nextneed = need_iommu(dev, addr, s->length);
> +		nextneed = need_iommu(dev, addr, s->length, attrs);
>  
>  		/* Handle the previous not yet processed entries */
>  		if (i > start) {
> @@ -449,7 +450,7 @@ static int gart_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>  
>  	/* When it was forced or merged try again in a dumb way */
>  	if (force_iommu || iommu_merge) {
> -		out = dma_map_sg_nonforce(dev, sg, nents, dir);
> +		out = dma_map_sg_nonforce(dev, sg, nents, dir, attrs);
>  		if (out > 0)
>  			return out;
>  	}
> @@ -473,7 +474,8 @@ gart_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_addr,
>  		return vaddr;
>  
>  	*dma_addr = dma_map_area(dev, virt_to_phys(vaddr), size,
> -			DMA_BIDIRECTIONAL, (1UL << get_order(size)) - 1);
> +				 DMA_BIDIRECTIONAL,
> +				 (1UL << get_order(size)) - 1, attrs);
>  	flush_gart();
>  	if (unlikely(*dma_addr == DMA_MAPPING_ERROR))
>  		goto out_free;
> diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> index 8c4abe65cd49..e2538824ef52 100644
> --- a/drivers/xen/swiotlb-xen.c
> +++ b/drivers/xen/swiotlb-xen.c
> @@ -212,7 +212,7 @@ static dma_addr_t xen_swiotlb_map_phys(struct device *dev, phys_addr_t phys,
>  	BUG_ON(dir == DMA_NONE);
>  
>  	if (attrs & DMA_ATTR_MMIO) {
> -		if (unlikely(!dma_capable(dev, phys, size, false))) {
> +		if (unlikely(!dma_capable(dev, phys, size, false, attrs))) {
>  			dev_err_once(
>  				dev,
>  				"DMA addr %pa+%zu overflow (mask %llx, bus limit %llx).\n",
> @@ -231,7 +231,7 @@ static dma_addr_t xen_swiotlb_map_phys(struct device *dev, phys_addr_t phys,
>  	 * we can safely return the device addr and not worry about bounce
>  	 * buffering it.
>  	 */
> -	if (dma_capable(dev, dev_addr, size, true) &&
> +	if (dma_capable(dev, dev_addr, size, true, attrs) &&
>  	    !dma_kmalloc_needs_bounce(dev, size, dir) &&
>  	    !range_straddles_page_boundary(phys, size) &&
>  		!xen_arch_need_swiotlb(dev, phys, dev_addr) &&
> @@ -253,7 +253,7 @@ static dma_addr_t xen_swiotlb_map_phys(struct device *dev, phys_addr_t phys,
>  	/*
>  	 * Ensure that the address returned is DMA'ble
>  	 */
> -	if (unlikely(!dma_capable(dev, dev_addr, size, true))) {
> +	if (unlikely(!dma_capable(dev, dev_addr, size, true, attrs))) {
>  		__swiotlb_tbl_unmap_single(dev, map, size, dir,
>  				attrs | DMA_ATTR_SKIP_CPU_SYNC,
>  				swiotlb_find_pool(dev, map));
> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
> index 94fad4e7c11e..daa31a1adf7b 100644
> --- a/include/linux/dma-direct.h
> +++ b/include/linux/dma-direct.h
> @@ -135,12 +135,20 @@ static inline bool force_dma_unencrypted(struct device *dev)
>  #endif /* CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED */
>  
>  static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size,
> -		bool is_ram)
> +		bool is_ram, unsigned long attrs)
>  {
>  	dma_addr_t end = addr + size - 1;
>  
>  	if (addr == DMA_MAPPING_ERROR)
>  		return false;
> +	/*
> +	 * The DMA address was derived from encrypted RAM, but this device
> +	 * requires unencrypted DMA addresses. Treat it as not DMA-capable
> +	 * so the caller can fall back to a suitable SWIOTLB pool.
> +	 */
> +	if (!(attrs & DMA_ATTR_CC_SHARED) && force_dma_unencrypted(dev))
> +		return false;
> +
>  	if (is_ram && !IS_ENABLED(CONFIG_ARCH_DMA_ADDR_T_64BIT) &&
>  	    min(addr, end) < phys_to_dma(dev, PFN_PHYS(min_low_pfn)))
>  		return false;
> diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
> index 7140c208c123..e05dc7649366 100644
> --- a/kernel/dma/direct.h
> +++ b/kernel/dma/direct.h
> @@ -101,15 +101,15 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
>  
>  	if (attrs & DMA_ATTR_MMIO) {
>  		dma_addr = phys;
> -		if (unlikely(!dma_capable(dev, dma_addr, size, false)))
> +		if (unlikely(!dma_capable(dev, dma_addr, size, false, attrs)))
>  			goto err_overflow;
>  	} else if (attrs & DMA_ATTR_CC_SHARED) {
>  		dma_addr = phys_to_dma_unencrypted(dev, phys);
> -		if (unlikely(!dma_capable(dev, dma_addr, size, false)))
> +		if (unlikely(!dma_capable(dev, dma_addr, size, false, attrs)))
>  			goto err_overflow;
>  	} else {
>  		dma_addr = phys_to_dma(dev, phys);
> -		if (unlikely(!dma_capable(dev, dma_addr, size, true)) ||
> +		if (unlikely(!dma_capable(dev, dma_addr, size, true, attrs)) ||
>  		    dma_kmalloc_needs_bounce(dev, size, dir)) {
>  			if (is_swiotlb_active(dev) &&
>  			    !(attrs & DMA_ATTR_REQUIRE_COHERENT))
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 2bf3981db35d..f4e8b241a1c4 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -1678,7 +1678,7 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
>  	else
>  		dma_addr = phys_to_dma_encrypted(dev, swiotlb_addr);
>  
> -	if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
> +	if (unlikely(!dma_capable(dev, dma_addr, size, true, attrs))) {
>  		__swiotlb_tbl_unmap_single(dev, swiotlb_addr, size, dir,
>  			attrs | DMA_ATTR_SKIP_CPU_SYNC,
>  			swiotlb_find_pool(dev, swiotlb_addr));



^ permalink raw reply

* Re: [PATCH v6 13/20] dma-direct: rename ret to cpu_addr in alloc helpers
From: Petr Tesarik @ 2026-06-09 12:54 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm)
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86, Michael Kelley
In-Reply-To: <20260604083959.1265923-14-aneesh.kumar@kernel.org>

On Thu,  4 Jun 2026 14:09:52 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:

> ret in dma_direct_alloc() and dma_direct_alloc_pages() holds the returned
> CPU mapping, not a generic return value. Rename it to cpu_addr and update
> the remaining uses to match.
> 
> This makes the allocation paths easier to follow and keeps the local naming
> consistent with what the variable actually represents.
> 
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>

I wondered if cpu_addr is descriptive enough (a CPU address could
theoretically be virtual or physical), but I can see that a few other
places already use cpu_addr to hold virtual addresses, so yeah, let's
keep this name.

Reviewed-by: Petr Tesarik <ptesarik@suse.com>

Petr T

> ---
>  kernel/dma/direct.c | 31 +++++++++++++++----------------
>  1 file changed, 15 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index aa3489aa10a0..4e446aa4130e 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -204,7 +204,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  	bool mark_mem_decrypt = false;
>  	bool allow_highmem = true;
>  	struct page *page;
> -	void *ret;
> +	void *cpu_addr;
>  
>  	/*
>  	 * DMA_ATTR_CC_SHARED is not a caller-visible dma_alloc_*()
> @@ -318,34 +318,33 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  		arch_dma_prep_coherent(page, size);
>  
>  		/* create a coherent mapping */
> -		ret = dma_common_contiguous_remap(page, size, prot,
> -				__builtin_return_address(0));
> -		if (!ret)
> +		cpu_addr = dma_common_contiguous_remap(page, size, prot,
> +					__builtin_return_address(0));
> +		if (!cpu_addr)
>  			goto out_encrypt_pages;
>  	} else {
> -		ret = page_address(page);
> +		cpu_addr = page_address(page);
>  	}
>  
> -	memset(ret, 0, size);
> +	memset(cpu_addr, 0, size);
>  
>  	if (set_uncached) {
>  		void *uncached_cpu_addr;
>  
>  		arch_dma_prep_coherent(page, size);
> -		uncached_cpu_addr = arch_dma_set_uncached(ret, size);
> +		uncached_cpu_addr = arch_dma_set_uncached(cpu_addr, size);
>  		if (IS_ERR(uncached_cpu_addr))
>  			goto out_free_remap_pages;
> -		ret = uncached_cpu_addr;
> +		cpu_addr = uncached_cpu_addr;
>  	}
>  
>  	*dma_handle = phys_to_dma_direct(dev, page_to_phys(page),
>  					 !!(attrs & DMA_ATTR_CC_SHARED));
> -	return ret;
> -
> +	return cpu_addr;
>  
>  out_free_remap_pages:
>  	if (remap)
> -		dma_common_free_remap(ret, size);
> +		dma_common_free_remap(cpu_addr, size);
>  
>  out_encrypt_pages:
>  	if (mark_mem_decrypt &&
> @@ -439,7 +438,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>  {
>  	unsigned long attrs = 0;
>  	struct page *page;
> -	void *ret;
> +	void *cpu_addr;
>  
>  	if (force_dma_unencrypted(dev))
>  		attrs |= DMA_ATTR_CC_SHARED;
> @@ -453,7 +452,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>  		if (!page)
>  			return NULL;
>  
> -		ret = page_address(page);
> +		cpu_addr = page_address(page);
>  		goto setup_page;
>  	}
>  
> @@ -461,11 +460,11 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>  	if (!page)
>  		return NULL;
>  
> -	ret = page_address(page);
> -	if ((attrs & DMA_ATTR_CC_SHARED) && dma_set_decrypted(dev, ret, size))
> +	cpu_addr = page_address(page);
> +	if ((attrs & DMA_ATTR_CC_SHARED) && dma_set_decrypted(dev, cpu_addr, size))
>  		goto out_leak_pages;
>  setup_page:
> -	memset(ret, 0, size);
> +	memset(cpu_addr, 0, size);
>  	*dma_handle = phys_to_dma_direct(dev, page_to_phys(page),
>  					 !!(attrs & DMA_ATTR_CC_SHARED));
>  	return page;



^ permalink raw reply

* Re: [PATCH 35/60] kvm: Add VCPU plane-scheduling state and helpers
From: James Bottomley @ 2026-06-09 12:59 UTC (permalink / raw)
  To: Jörg Rödel, Paolo Bonzini
  Cc: Sean Christopherson, Tom Lendacky, ashish.kalra, michael.roth,
	nsaenz, anelkz, Melody Wang, kvm, linux-kernel, kvmarm, loongarch,
	linux-mips, linuxppc-dev, kvm-riscv, x86, coconut-svsm,
	joerg.roedel
In-Reply-To: <aigE2EvJyZlYDz0V@8bytes.org>

On Tue, 2026-06-09 at 14:37 +0200, Jörg Rödel wrote:
> On Mon, Jun 08, 2026 at 07:58:43PM +0200, Paolo Bonzini wrote:
> > Related to this, let me know if you want me to pick up again the
> > common part, especially with Sashiko being hard at work on it.
> 
> Yeah, that might be good, let me think a bit about it and discuss in
> tomorrows
> PUCK call.

Are the details of this anywhere?  The last PUCK information I saw on
the kvm list was the cancellation of the March and April calls.

Regards,

James



^ permalink raw reply

* Re: [RFC PATCH 0/4] perf: Add perf.data tracepoint events to trace.dat conversion
From: Tanushree Shah @ 2026-06-09 13:09 UTC (permalink / raw)
  To: Ian Rogers
  Cc: acme, jolsa, adrian.hunter, vmolnaro, mpetlan, tmricht, maddy,
	namhyung, linux-kernel, linux-perf-users, linuxppc-dev, atrajeev,
	hbathini, Tejas.Manhas1, Tanushree.Shah, Shivani.Nittor
In-Reply-To: <CAP-5=fWbaEB=8niBhrUz6TLB2dgWv=kjrH8-Hn_PAmpUwqQThg@mail.gmail.com>

Hello Ian,

Thanks for the review. I will fix the typo in the next version.

I agree that shell test coverage similar to the existing converter tests 
would be useful, and I plan to include that in the next revision.

I will also go through the review comments from Sashiko, validate them, 
and address the necessary fixes in the next version.

Thanks,
Tanushree Shah

On 08/06/26 20:48, Ian Rogers wrote:
> On Mon, Jun 8, 2026 at 6:00 AM Tanushree Shah <tshah@linux.ibm.com> wrote:
>>
>> This RFC patch series introduces support for converting perf.data files
>> containing tracepoint events into trace.dat format, enabling seamless
>> visualization and analysis using KerneShark.
> 
> Thanks for doing this, this is a useful feature!
> 
> nit: typo KernelShark
> 
>>
>> ======================
>> Background and Motivation
>> ======================
>>
>> Currently, perf and trace-cmd operate as separate tracing ecosystems with
>> incompatible data formats. Users who collect tracepoint data with
>> 'perf record' cannot easily visualize it in KernelShark's graphical
>> timeline view or leverage trace-cmd's analysis capabilities.
>>
>> This creates workflow friction when users need to:
>>
>> - Visualize perf tracepoint data in KernelShark's interactive graphical
>>    timeline
>> - Share trace data between perf and trace-cmd workflows and toolchains
>> - Perform architecture-independent conversion and analysis of traces
>>
>> This conversion bridge eliminates these barriers by enabling seamless
>> data exchange between perf and trace-cmd ecosystems, allowing users to
>> choose the best tool for each analysis phase.
>>
>> ======================
>> Implementation Overview
>> ======================
>>
>> The series implements the trace.dat file format specification (version 7)
>> within perf's data conversion framework.
>>
>> **Patch 1/4: Core trace.dat Export Infrastructure**
>> Introduces util/trace-dat.c and util/trace-dat.h implementing:
>> - Per-CPU raw event buffer management (init, collect, free)
>> - Ftrace ring buffer page construction
>> - trace.dat section writers (strings, options, flyrecord sections)
>>
>> **Patch 2/4: Metadata Integration**
>> Extends util/trace-event-read.c to write trace.dat metadata during
>> perf.data
>> parsing:
>> - Initial format header (magic, version, endian, page size, compression)
>> - Section 16: HEADER INFO (header_page + header_event)
>> - Section 17: FTRACE EVENT FORMATS
>> - Section 18: EVENT FORMATS (per system/event format files)
>> - Section 19: KALLSYMS
>> - Section 21: CMDLINES
>> - Section 15: STRINGS (written last after all sections)
>>
>> **Patch 3/4: Conversion Backend**
>> Implements util/data-convert-trace.c with trace_convert__perf2dat()
>> function:
>> - Processes PERF_TYPE_TRACEPOINT samples via process_sample_event()
>> - Collects raw event data per-CPU using trace_dat__collect_cpu_event()
>> - Writes OPTIONS sections (CPUCOUNT, TRACECLOCK, metadata offsets)
>> - Writes FLYRECORD section with per-CPU ring buffer pages
>>
>> **Patch 4/4: User Interface**
>> Extends tools/perf/builtin-data.c with --to-trace-dat option:
>> - Adds command-line option for trace.dat output
>> - Mutually exclusive with --to-ctf and --to-json
>> - Calls trace_convert__perf2dat() to perform conversion
>>
>> ======================
>> Current Implementation Details
>> ======================
>>
>> **trace.dat Format Version:**
>> The implementation currently targets trace.dat format version 7, which
>> is the stable version supported by current trace-cmd releases (v3.x).
>> This version is hardcoded to ensure compatibility with existing
>> trace-cmd and KernelShark installations. Future enhancements could add
>> version negotiation or support for newer format versions as they become
>> standardized.
>>
>> **Compression Strategy:**
>> Compression is explicitly disabled (set to NONE) in the generated
>> trace.dat files.
>> This design choice:
>> - Simplifies the initial implementation and testing
>> - Ensures maximum compatibility across trace-cmd versions
>> - Avoids external compression library dependencies
>>
>> Future work could add support for various compression algorithms (zlib,
>> zstd, lz4) with runtime selection via command-line options, significantly
>> reducing file sizes for large traces.
>>
>> ======================
>> Usage Example
>> ======================
>>
>> ```bash
>> *Record tracepoint events with perf*
>> perf record -e sched:sched_switch -e sched:sched_wakeup -a sleep 10
>>
>> *Convert to trace.dat format*
>> perf data convert --to-trace-dat=output.dat
>>
>> *Verify trace.dat structure*
>> trace-cmd dump --summary output.dat
>>
>> *Analyze with trace-cmd*
>> trace-cmd report output.dat
>>
>> *Visualize in KernelShark*
>> kernelshark output.dat
>> ```
>>
>> **Conversion Output:**
>> ```
>> [ perf data convert: Converted 'perf.data' into trace.dat format
>> 'output.dat' ]
>> [ perf data convert: Converted 2684 events ]
>> ```
>> **trace-cmd dump --summary Output:**
>> ```
>>   Tracing meta data in file output.dat:
>>          [Initial format]
>>                  7       [Version]
>>                  0       [Little endian]
>>                  8       [Bytes in a long]
>>                  65536   [Page size, bytes]
>>                  none    [Compression algorithm]
>>                          [Compression version]
>>          [buffer "", "local" clock, 65536 page size, 16 cpus, 1048576 bytes
>>      flyrecord data]
>>          [10 options]
>>          [Saved command lines, 0 bytes]
>>          [Kallsyms, 0 bytes]
>>          [Ftrace format, 0 events]
>>          [Header page, 206 bytes]
>>          [Header event, 205 bytes]
>>          [Events format, 1 systems]
>>          [9 sections]
>> ```
>> ======================
>> Testing and Verification
>> ======================
>>
>> The series has been extensively tested with:
>> - Various tracepoint events (sched, irq, syscalls, block I/O)
>> - Mixed recordings containing both tracepoint and non-tracepoint events
>>    only tracepoints converted)
>> - Verification with trace-cmd report and KernelShark visualization
>> - Memory leak testing with Valgrind (0 bytes leaked)
>> - Cross-architecture testing (x86_64, ppc64le)
> 
> It seems that some of this could be a test to give coverage of the
> feature. We have similar tests for other convertors:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell/test_perf_data_converter_ctf.sh?h=perf-tools-next
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell/test_perf_data_converter_json.sh?h=perf-tools-next
> 
> I think Sashiko has caught some coding issues, so I'll hold off on a
> full review until the churn from Sashiko subsides.
> 
> Thanks!
> Ian
> 
>> All generated trace.dat files successfully open in:
>> - trace-cmd report (v3.1+)
>> - KernelShark (v2.0+)
>>
>> ======================
>> Next Steps
>> ======================
>>
>> We would highly appreciate reviews, comments, and feedback on:
>> - The overall architectural approach and integration points
>> - Compatibility considerations with trace-cmd ecosystem
>> - Performance characteristics for large-scale traces
>> - Additional use cases or workflow scenarios
>> - Future enhancement priorities
>>
>> Tanushree Shah (4):
>>    perf/trace-dat: Add trace.dat export infrastructure
>>    perf/trace-event: Write trace.dat metadata sections during parsing
>>    perf data-convert: Add perf.data to trace.dat conversion backend
>>    perf data: Add --to-trace-dat option for converting perf.data
>>      tracepoint events into trace.dat format
>>
>>   tools/perf/builtin-data.c            |  38 +-
>>   tools/perf/util/Build                |   2 +
>>   tools/perf/util/data-convert-trace.c | 152 ++++++
>>   tools/perf/util/data-convert.h       |   4 +
>>   tools/perf/util/trace-dat.c          | 705 +++++++++++++++++++++++++++
>>   tools/perf/util/trace-dat.h          |  79 +++
>>   tools/perf/util/trace-event-read.c   | 259 +++++++++-
>>   7 files changed, 1230 insertions(+), 9 deletions(-)
>>   create mode 100644 tools/perf/util/data-convert-trace.c
>>   create mode 100644 tools/perf/util/trace-dat.c
>>   create mode 100644 tools/perf/util/trace-dat.h
>>
>> --
>> 2.53.0
>>
> 



^ permalink raw reply

* Re: [PATCH v6 14/20] dma-direct: return struct page from dma_direct_alloc_from_pool()
From: Petr Tesarik @ 2026-06-09 13:12 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm)
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86, stable, Michael Kelley
In-Reply-To: <20260604083959.1265923-15-aneesh.kumar@kernel.org>

On Thu,  4 Jun 2026 14:09:53 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:

> Commit 5b138c534fda ("dma-direct: factor out a dma_direct_alloc_from_pool
> helper") changed dma_direct_alloc_from_pool() to return the CPU address
> from dma_alloc_from_pool(). That fits dma_direct_alloc(), but
> dma_direct_alloc_pages() also uses the helper and expects a struct page *.
> 
> Fix this by making dma_direct_alloc_from_pool() return the struct page *
> again, and pass the CPU address back through an out-parameter for the
> dma_direct_alloc() caller.
> 
> Fixes: 5b138c534fda ("dma-direct: factor out a dma_direct_alloc_from_pool helper")
> Cc: stable@vger.kernel.org

While I totally agree with the reasoning and the fix, it's interesting
that this bug has been apparently present in the kernel for 5+ years
without anybody hitting nasty memory corruption bugs.

How can it be? Is the buggy code path never actually used in practice?
Does it hint at a missed opportunity to simplify the code?

Anyway, these these thoughts are intended for a possible future
cleanup. For now, let's apply the fix as is, of course.

Petr T

> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
>  kernel/dma/direct.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 4e446aa4130e..e0ab9ff3f1d6 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -157,24 +157,24 @@ static bool dma_direct_use_pool(struct device *dev, gfp_t gfp)
>  	return !gfpflags_allow_blocking(gfp) && !is_swiotlb_for_alloc(dev);
>  }
>  
> -static void *dma_direct_alloc_from_pool(struct device *dev, size_t size,
> -		dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
> +static struct page *dma_direct_alloc_from_pool(struct device *dev, size_t size,
> +		dma_addr_t *dma_handle, void **cpu_addr, gfp_t gfp,
> +		unsigned long attrs)
>  {
>  	struct page *page;
>  	u64 phys_limit;
> -	void *ret;
>  
>  	if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_DMA_COHERENT_POOL)))
>  		return NULL;
>  
>  	gfp |= dma_direct_optimal_gfp_mask(dev, &phys_limit);
> -	page = dma_alloc_from_pool(dev, size, &ret, gfp, attrs,
> +	page = dma_alloc_from_pool(dev, size, cpu_addr, gfp, attrs,
>  				   dma_coherent_ok);
>  	if (!page)
>  		return NULL;
>  	*dma_handle = phys_to_dma_direct(dev, page_to_phys(page),
>  					 !!(attrs & DMA_ATTR_CC_SHARED));
> -	return ret;
> +	return page;
>  }
>  
>  static void *dma_direct_alloc_no_mapping(struct device *dev, size_t size,
> @@ -270,9 +270,12 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  	 * the atomic pools instead if we aren't allowed block.
>  	 */
>  	if ((remap || (attrs & DMA_ATTR_CC_SHARED)) &&
> -	    dma_direct_use_pool(dev, gfp))
> -		return dma_direct_alloc_from_pool(dev, size, dma_handle,
> -						  gfp, attrs);
> +	    dma_direct_use_pool(dev, gfp)) {
> +		page = dma_direct_alloc_from_pool(dev, size,
> +					dma_handle, &cpu_addr,
> +					gfp, attrs);
> +		return page ? cpu_addr : NULL;
> +	}
>  
>  	if (is_swiotlb_for_alloc(dev)) {
>  		page = dma_direct_alloc_swiotlb(dev, size, attrs);
> @@ -445,7 +448,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>  
>  	if ((attrs & DMA_ATTR_CC_SHARED) && dma_direct_use_pool(dev, gfp))
>  		return dma_direct_alloc_from_pool(dev, size, dma_handle,
> -						  gfp, attrs);
> +						  &cpu_addr, gfp, attrs);
>  
>  	if (is_swiotlb_for_alloc(dev)) {
>  		page = dma_direct_alloc_swiotlb(dev, size, attrs);



^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox