LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-14 14:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260514123529.GZ7702@ziepe.ca>

On Thu, May 14, 2026 at 09:35:29AM -0300, Jason Gunthorpe wrote:
> > > How will pKVM signal what kind of memory the DMA needs then?
> > > 
> > > Does it use set_memory_decrypted()? How can it use
> > > set_memory_decrypted() without offering CC_ATTR_MEM_ENCRYPT ?
> > 
> > pKVM (hypervisor) doesn’t signal anything.
> > The VMM when running protected guests will use restricted dma-pools
> > for emulated vritio devices in the guest, which gets decrypted by
> > the guest kernel and hence shared with the host kernel, and then
> > traffic is bounced via the pool.
> 
> That really does sound like CC and set_memory_decrypted() to me..
> 
> > It’s also worth noting that bouncing here isn't just about visibility.
> > Because memory sharing operates at page granularity, bouncing sub-page
> > allocations through the restricted pool prevents adjacent, sensitive
> > guest data from being exposed to the untrusted host.
> 
> That's a somewhat different problem, we have the dev->trusted stuff
> that is supposed to deal with this kind of security. We need it for
> IOMMU based systems too, eg hot plug thunderbolt should have it.

I see that it is used only for dma-iommu and for PCI devices.
However, I think that should be a problem with other CCA solutions
with emulated devices as they are untrusted. As I'd expect they
would have virtio devices.

> 
> Then CC issue is more that the DMA API can't decrypt random passed in
> memory because doing so often requires changing the PTEs pointing at
> the page so it would break everything if done transparently.
> 
> > > > I believe that the pool should have a way to control it’s property
> > > > (encrypted or decrypted) and that takes priority over whatever
> > > > attributes comes from allocation.
> > > 
> > > We should get here because dma_capable() fails, and then swiotlb needs
> > > to return something that makes dma_capable() succeed. Yes, it should
> > > return details about the thing it decided, but it shouldn't have been
> > > pre-created with some idea how to make dma_capable() work.
> > 
> > That sounds neat, but at the end we have force_dma_unencrypted() in
> > dma_capable() which is just hardcoded to true/false by the platform.
> 
> For now, the next step is it becomes per-device and dynamic during the
> device lifecycle.
> 
> > How is that different from having the state static by the pool?
> 
> statically attached pools to the device are not so flexible when
> devices have dynamically changing capabilities..

Pools can be per-device also. A device can have mutiple pools with
different memory attrs, which then can be matched by the DMA code
at runtime, it's not as flexible, but removes some complexity from
the guest code.

> 
> > > If dma_capable() can fail, then swiotlb should know exactly what to do
> > > to fix it.
> > 
> > dma_capable() returns a bool, I don’t think it can know what exactly
> > went wrong (based on address, size, attrs, dev...)
> 
> Yes, but I think the design is swiotlb is supposed to re-inspect what
> is going on against the limits dma_capable checks and then select the
> correct remedy..

I see, but that’s not part of this series, and probably would require
some rework so dma_capable() can return an error code (ERANGE, EPERM...)
so that caller can deal with that.

> 
> > While we can debate the aesthetics of the setup , this is
> > the exisitng behaviour for Linux, which existed for years
> > and pKVM relies on and is used extensively.
> > And, this patch alters that long-standing logic and introduces
> > a functional regression.
> 
> Yeah, Aneesh needs to do something here, I'm pointing out it is
> entirely seperate thing from the CC path we are working on which is
> decoupling CC from reylying on force swiotlb.

I am looking into converting pKVM to use the CC stuff, I replied with
a patch to Aneesh in this thread. However, I need to do more testing
and make sure there are not any unwanted consequences.

> 
> > We can address this by either adjusting this patch or by changing
> > pKVM guests to be more aligned with other CCA guests which is
> > something I have been wondering about if it would help reduce
> > bouncing.
> 
> Every time I look at pkvm I think it is just ARM CCA with a different
> design and no access to the unique HW features..
> 
> > > If we can make that work then maybe the flows are designed correctly.
> > 
> > Mmm, I am not sure I understand this one, shouldn’t the device also be
> > notified about the switch in memory state, if it expects to read/write
> > decrypted memory, how would that work if the kernel changes it to an
> > encrypted one?
> 
> Nothing on the device changes. In a CC world we put the device in a
> T=0 or T=1 state before the driver loads and the expectation from the
> DMA API is that the device will only use that T=x DMA type during
> operation.
> 
> A T=1 state device can access all of memory, private or shared. Any
> information the platform may need is encoded in the dma_addr_t or in
> the S1 IOPTEs.
> 
> So we never need to tell the device driver what kind of memory the DMA
> is targetting, and we NEVER expect a device in T=1 mode to have to
> issue a T=0 DMA to use the DMA API.
> 
> In a pkvm world it should be the same, the S2 table for the SMMU will
> control what the device can access, and if the SMMU points to a
> "private" or "shared" page is not something the device needs to know
> or care about.

I see that's because dma-iommu chooses the attrs for iommu_map().

In pKVM, dma_addr_t and IOPTE are the same for private and shared,
so nothing differs in that case.
We don’t expect pass-through devices to interact with shared
memory (T=0) at the moment.
However, I can see use cases for that, where the host and the guest
collaborate with device passthrough and require zero copy.

One other interesting case for device-passthrough is non-coherent
devices which then require private pools for bouncing.

Thanks,
Mostafa

> 
> Jason


^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-14 15:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Aneesh Kumar K.V, iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260514143733.GB7702@ziepe.ca>

On Thu, May 14, 2026 at 11:37:33AM -0300, Jason Gunthorpe wrote:
> On Thu, May 14, 2026 at 06:18:05PM +0530, Aneesh Kumar K.V wrote:
> > > There is no problem with non-protected guests as they don't use memory
> > > encryption, my initial thought was that th encrpyted/decrypted is
> > > per-pool property which is decided by FW (device-tree).
> > 
> > What I meant was that we need a generic way to identify a pKVM guest, so
> > that we can use it in the conditional above.
> 
> If I understood Mostafa's remarks I think different devices in the
> guest need shared/decrypted and some don't? Ie a virtio hypervisor
> device needs shared while a real PCI device doesn't? Is that right?

In upstream, device passthrough is not supported, but that case is
supported in Android and we plan to upstream it (it currently
depends on the SMMUv3 series first)

> 
> In CC terms that would be a mixture of T=0 and T=1 devices hardwired
> and signaled by firwmare..
> 
> Ideally we'd have a flow where if the arch precreates a swiotlb pool
> with special parameters this overrides all other decision making. Then
> this series is about making CC NOT use that flow... ??

Yes, I believe that will be needed, we do this at android by a per-pool
property added in the device tree.

Thanks,
Mostafa

> 
> Jason


^ permalink raw reply

* [PATCH 3/5] powerpc/pci: Use official API to iterate over PCI buses
From: Gerd Bayer @ 2026-05-15 14:22 UTC (permalink / raw)
  To: Richard Henderson, Matt Turner, Magnus Lindholm, Russell King,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Bjorn Helgaas, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin
  Cc: Yinghai Lu, linux-alpha, linux-kernel, linux-arm-kernel,
	linuxppc-dev, linux-pci, Gerd Bayer
In-Reply-To: <20260515-priv_root_buses-v1-0-f8e393c57390@linux.ibm.com>

Replace iterating over pci_root_buses with the official
pci_find_next_bus() call provided by PCI core. This allows to make
pci_root_buses private to PCI core.

Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
---
 arch/powerpc/kernel/pci-common.c | 7 ++++---
 arch/powerpc/kernel/pci_64.c     | 4 ++--
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 8efe95a0c4ff..1e0be7bcaa56 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1417,10 +1417,10 @@ static void __init pcibios_reserve_legacy_regions(struct pci_bus *bus)
 
 void __init pcibios_resource_survey(void)
 {
-	struct pci_bus *b;
+	struct pci_bus *b = NULL;
 
 	/* Allocate and assign resources */
-	list_for_each_entry(b, &pci_root_buses, node)
+	while ((b = pci_find_next_bus(b)) != NULL)
 		pcibios_allocate_bus_resources(b);
 	if (!pci_has_flag(PCI_REASSIGN_ALL_RSRC)) {
 		pcibios_allocate_resources(0);
@@ -1432,7 +1432,8 @@ void __init pcibios_resource_survey(void)
 	 * bus available resources to avoid allocating things on top of them
 	 */
 	if (!pci_has_flag(PCI_PROBE_ONLY)) {
-		list_for_each_entry(b, &pci_root_buses, node)
+		b = NULL; /* Start all over */
+		while ((b = pci_find_next_bus(b)) != NULL)
 			pcibios_reserve_legacy_regions(b);
 	}
 
diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
index e27342ef128b..f816d063b984 100644
--- a/arch/powerpc/kernel/pci_64.c
+++ b/arch/powerpc/kernel/pci_64.c
@@ -227,7 +227,7 @@ SYSCALL_DEFINE3(pciconfig_iobase, long, which, unsigned long, in_bus,
 			  unsigned long, in_devfn)
 {
 	struct pci_controller* hose;
-	struct pci_bus *tmp_bus, *bus = NULL;
+	struct pci_bus *tmp_bus = NULL, *bus = NULL;
 	struct device_node *hose_node;
 
 	/* Argh ! Please forgive me for that hack, but that's the
@@ -248,7 +248,7 @@ SYSCALL_DEFINE3(pciconfig_iobase, long, which, unsigned long, in_bus,
 	 * used on pre-domains setup. We return the first match
 	 */
 
-	list_for_each_entry(tmp_bus, &pci_root_buses, node) {
+	while ((tmp_bus = pci_find_next_bus(tmp_bus)) != NULL) {
 		if (in_bus >= tmp_bus->number &&
 		    in_bus <= tmp_bus->busn_res.end) {
 			bus = tmp_bus;

-- 
2.54.0



^ permalink raw reply related

* [PATCH 1/5] alpha/pci: Use official API to iterate over PCI buses
From: Gerd Bayer @ 2026-05-15 14:22 UTC (permalink / raw)
  To: Richard Henderson, Matt Turner, Magnus Lindholm, Russell King,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Bjorn Helgaas, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin
  Cc: Yinghai Lu, linux-alpha, linux-kernel, linux-arm-kernel,
	linuxppc-dev, linux-pci, Gerd Bayer
In-Reply-To: <20260515-priv_root_buses-v1-0-f8e393c57390@linux.ibm.com>

Replace iterating over pci_root_buses with the official
pci_find_next_bus() call provided by PCI core. This allows to make
pci_root_buses private to PCI core.

Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
---
 arch/alpha/kernel/pci.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/pci.c b/arch/alpha/kernel/pci.c
index 11df411b1d18..02ec4dbb3ac6 100644
--- a/arch/alpha/kernel/pci.c
+++ b/arch/alpha/kernel/pci.c
@@ -312,9 +312,9 @@ pcibios_claim_one_bus(struct pci_bus *b)
 static void __init
 pcibios_claim_console_setup(void)
 {
-	struct pci_bus *b;
+	struct pci_bus *b = NULL;
 
-	list_for_each_entry(b, &pci_root_buses, node)
+	while ((b = pci_find_next_bus(b)) != NULL)
 		pcibios_claim_one_bus(b);
 }
 

-- 
2.54.0



^ permalink raw reply related

* [PATCH 4/5] x86/pci: Use official API to iterate over PCI buses
From: Gerd Bayer @ 2026-05-15 14:22 UTC (permalink / raw)
  To: Richard Henderson, Matt Turner, Magnus Lindholm, Russell King,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Bjorn Helgaas, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin
  Cc: Yinghai Lu, linux-alpha, linux-kernel, linux-arm-kernel,
	linuxppc-dev, linux-pci, Gerd Bayer
In-Reply-To: <20260515-priv_root_buses-v1-0-f8e393c57390@linux.ibm.com>

Replace iterating over pci_root_buses with the official
pci_find_next_bus() call provided by PCI core. This allows to make
pci_root_buses private to PCI core.

Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
---
 arch/x86/pci/i386.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index e2de26b82940..194d0fa3cec8 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -357,10 +357,10 @@ static void pcibios_allocate_rom_resources(struct pci_bus *bus)
 
 static int __init pcibios_assign_resources(void)
 {
-	struct pci_bus *bus;
+	struct pci_bus *bus = NULL;
 
 	if (!(pci_probe & PCI_ASSIGN_ROMS))
-		list_for_each_entry(bus, &pci_root_buses, node)
+		while ((bus = pci_find_next_bus(bus)) != NULL)
 			pcibios_allocate_rom_resources(bus);
 
 	pci_assign_unassigned_resources();
@@ -390,16 +390,18 @@ void pcibios_resource_survey_bus(struct pci_bus *bus)
 
 void __init pcibios_resource_survey(void)
 {
-	struct pci_bus *bus;
+	struct pci_bus *bus = NULL;
 
 	DBG("PCI: Allocating resources\n");
 
-	list_for_each_entry(bus, &pci_root_buses, node)
+	while ((bus = pci_find_next_bus(bus)) != NULL)
 		pcibios_allocate_bus_resources(bus);
 
-	list_for_each_entry(bus, &pci_root_buses, node)
+	bus = NULL; /* start all over */
+	while ((bus = pci_find_next_bus(bus)) != NULL)
 		pcibios_allocate_resources(bus, 0);
-	list_for_each_entry(bus, &pci_root_buses, node)
+	bus = NULL; /* start all over */
+	while ((bus = pci_find_next_bus(bus)) != NULL)
 		pcibios_allocate_resources(bus, 1);
 
 	e820__reserve_resources_late();

-- 
2.54.0



^ permalink raw reply related

* [PATCH 2/5] arm/pci: Use official API to iterate over PCI buses
From: Gerd Bayer @ 2026-05-15 14:22 UTC (permalink / raw)
  To: Richard Henderson, Matt Turner, Magnus Lindholm, Russell King,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Bjorn Helgaas, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin
  Cc: Yinghai Lu, linux-alpha, linux-kernel, linux-arm-kernel,
	linuxppc-dev, linux-pci, Gerd Bayer
In-Reply-To: <20260515-priv_root_buses-v1-0-f8e393c57390@linux.ibm.com>

Replace iterating over pci_root_buses with the official
pci_find_next_bus() call provided by PCI core. This allows to make
pci_root_buses private to PCI core.

Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
---
 arch/arm/kernel/bios32.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kernel/bios32.c b/arch/arm/kernel/bios32.c
index ac0e890510da..35642c9ba054 100644
--- a/arch/arm/kernel/bios32.c
+++ b/arch/arm/kernel/bios32.c
@@ -59,9 +59,9 @@ static void pcibios_bus_report_status(struct pci_bus *bus, u_int status_mask, in
 
 void pcibios_report_status(u_int status_mask, int warn)
 {
-	struct pci_bus *bus;
+	struct pci_bus *bus = NULL;
 
-	list_for_each_entry(bus, &pci_root_buses, node)
+	while ((bus = pci_find_next_bus(bus)) != NULL)
 		pcibios_bus_report_status(bus, status_mask, warn);
 }
 

-- 
2.54.0



^ permalink raw reply related

* [PATCH 5/5] PCI: Make pci_root_buses private to PCI core
From: Gerd Bayer @ 2026-05-15 14:22 UTC (permalink / raw)
  To: Richard Henderson, Matt Turner, Magnus Lindholm, Russell King,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Bjorn Helgaas, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin
  Cc: Yinghai Lu, linux-alpha, linux-kernel, linux-arm-kernel,
	linuxppc-dev, linux-pci, Gerd Bayer
In-Reply-To: <20260515-priv_root_buses-v1-0-f8e393c57390@linux.ibm.com>

After all users of pci_root_buses external to PCI core have been
converted to using pci_find_next_bus(), move its declaration to the
PCI core code and stop exporting the symbol.

Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
---
 drivers/pci/pci.h   | 3 +++
 drivers/pci/probe.c | 2 --
 include/linux/pci.h | 4 ----
 3 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4a14f88e543a..1f36d400c9e0 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -366,6 +366,9 @@ static inline void pci_create_legacy_files(struct pci_bus *bus) { }
 static inline void pci_remove_legacy_files(struct pci_bus *bus) { }
 #endif
 
+/* List of all known PCI buses */
+extern struct list_head pci_root_buses;
+
 /* Lock for read/write access to pci device and bus lists */
 extern struct rw_semaphore pci_bus_sem;
 extern struct mutex pci_slot_mutex;
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index b63cd0c310bc..2e97ab125ead 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -33,9 +33,7 @@ static struct resource busn_resource = {
 	.flags	= IORESOURCE_BUS,
 };
 
-/* Ugh.  Need to stop exporting this to modules. */
 LIST_HEAD(pci_root_buses);
-EXPORT_SYMBOL(pci_root_buses);
 
 static LIST_HEAD(pci_domain_busn_res_list);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2c4454583c11..1c4610848b5c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1192,10 +1192,6 @@ extern enum pcie_bus_config_types pcie_bus_config;
 
 extern const struct bus_type pci_bus_type;
 
-/* Do NOT directly access these two variables, unless you are arch-specific PCI
- * code, or PCI core code. */
-extern struct list_head pci_root_buses;	/* List of all known PCI buses */
-
 void pcibios_resource_survey_bus(struct pci_bus *bus);
 void pcibios_bus_add_device(struct pci_dev *pdev);
 void pcibios_add_bus(struct pci_bus *bus);

-- 
2.54.0



^ permalink raw reply related

* [PATCH 0/5] PCI: Finally make pci_root_buses private
From: Gerd Bayer @ 2026-05-15 14:22 UTC (permalink / raw)
  To: Richard Henderson, Matt Turner, Magnus Lindholm, Russell King,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Bjorn Helgaas, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin
  Cc: Yinghai Lu, linux-alpha, linux-kernel, linux-arm-kernel,
	linuxppc-dev, linux-pci, Gerd Bayer

Hi all!

The ominous warning about pci_root_buses in drivers/pci/probe.c caught
my attention. Looking closer, I found that there are uses in four
arch-specific files left before we can stop exposing that symbol outside
of drivers/pci.

Finish off the job that Yinghai Lu started in 2013 - see
https://msgid.link/1359265003-16166-23-git-send-email-yinghai@kernel.org/

The entire series has been compile-tested only - with defconfigs on
alpha, arm, powerpc, and x86.

Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
---
Gerd Bayer (5):
      alpha/pci: Use official API to iterate over PCI buses
      arm/pci: Use official API to iterate over PCI buses
      powerpc/pci: Use official API to iterate over PCI buses
      x86/pci: Use official API to iterate over PCI buses
      PCI: Make pci_root_buses private to PCI core

 arch/alpha/kernel/pci.c          |  4 ++--
 arch/arm/kernel/bios32.c         |  4 ++--
 arch/powerpc/kernel/pci-common.c |  7 ++++---
 arch/powerpc/kernel/pci_64.c     |  4 ++--
 arch/x86/pci/i386.c              | 14 ++++++++------
 drivers/pci/pci.h                |  3 +++
 drivers/pci/probe.c              |  2 --
 include/linux/pci.h              |  4 ----
 8 files changed, 21 insertions(+), 21 deletions(-)
---
base-commit: 5d6919055dec134de3c40167a490f33c74c12581
change-id: 20260508-priv_root_buses-0263ef2679ad

Best regards,
-- 
Gerd Bayer <gbayer@linux.ibm.com>



^ permalink raw reply

* Re: [PATCH v2] powerpc/pseries/iommu: Add TCEs for 16GB pages when RAM is pre-mapped
From: Gaurav Batra @ 2026-05-15 14:23 UTC (permalink / raw)
  To: Harsh Prateek Bora, maddy
  Cc: linuxppc-dev, ritesh.list, sbhat, vaibhav, donettom
In-Reply-To: <f6e0be64-fc54-4c36-b871-991771549b29@linux.ibm.com>


On 5/15/26 4:06 AM, Harsh Prateek Bora wrote:
>
>
> On 15/05/26 12:24 am, Gaurav Batra wrote:
>> In powerPC, if Dynamic DMA Window is big enough, RAM is pre-mapped. To
>> determine the size of RAM, a PAPR+ property "ibm,lrdr-capacity" is used.
>> This OF property dictates what is the max size of RAM an LPAR can have,
>> including DR added memory.
>>
>> In PowerPC, 16GB pages can be allocated at machine level and then
>> assigned to LPARs. These 16GB pages are added to LPAR memory at the time
>> of boot. The address range for these 16GB pages is above MAX RAM an LPAR
>> can have (ibm,lrdr-capacity). In the current implementation, these 16GB
>> pages are being excluded from pre-mapped TCEs. A driver can have DMA
>> buffers allocated from 16GB pages. This results in platform to raise an
>> EEH when DMA is attempted on buffers in 16GB memory range.
>>
>> commit 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier incorrectly
>> adds TCEs for pmemory")
>>
>> Prior to the above patch, memblock_end_of_DRAM() was being used to
>> determine the MAX memory of an LPAR. This included 16GB pages as well.
>> The issue with using memblock_end_of_DRAM() is that when pmemory is
>> converted to RAM via daxctl command, the DDW engine will incorrectly try
>> to add TCEs for pmemory as well.
>>
>> Below is the address distribution of RAM, 16GB pages and pmemory for an
>> LPAR with max memory of 256GB, memory allocated 64GB, 2 16GB pages and
>> assigned pmemory of 8GB.
>>
>> RANGE                                 SIZE  STATE REMOVABLE BLOCK
>> 0x0000000000000000-0x0000000fffffffff  64G online       yes 0-255
>> 0x0000004000000000-0x00000047ffffffff  32G online       yes 1024-1151
>>
>> cat /sys/bus/nd/devices/region0/resource
>> 0x40100000000
>> cat /sys/bus/nd/devices/region0/size
>> 8589934592
>>
>> The approach to fix this problem is to revert back the code changes
>> introduced by the above patch and to stash away the MAX memory of an
>> LPAR, including 16GB pages, at the LPAR boot time. This value is then
>> used whenever TCEs are needed to be pre-mapped - enable_DDW() or,
>> iommu_mem_notifier()
>>
>> Fixes: 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier 
>> incorrectly adds TCEs for pmemory")
>> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
>> ---
>>
>> Change log:
>>
>> V1 -> V2
>>
>> 1. Harsh: Not only start_pfn, but end_pfn also needs to be within 
>> allowed
>>     range, which may require clamping arg->nr_pages if crossing the 
>> limits.
>>
>>     Response: Incorporated changes.
>>
>> Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com>
>
> I think I mentioned it before also. Please avoid using tags unless 
> explicitly provided by the reviewer.
my apologies, I thought you meant to move it to "review comments 
section". I will remove them in my next version of the patch
>
>>
>>   arch/powerpc/platforms/pseries/iommu.c | 56 ++++++++++++++++++--------
>>   1 file changed, 40 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>> b/arch/powerpc/platforms/pseries/iommu.c
>> index 3e1f915fe4f6..fdb160b72938 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -69,6 +69,8 @@ static struct iommu_table 
>> *iommu_pseries_alloc_table(int node)
>>       return tbl;
>>   }
>>   +static phys_addr_t pseries_ddw_max_ram;
>> +
>>   #ifdef CONFIG_IOMMU_API
>>   static struct iommu_table_group_ops spapr_tce_table_group_ops;
>>   #endif
>> @@ -1285,15 +1287,19 @@ static LIST_HEAD(failed_ddw_pdn_list);
>>     static phys_addr_t ddw_memory_hotplug_max(void)
>>   {
>> -    resource_size_t max_addr;
>> +    resource_size_t max_addr = memory_hotplug_max();
>> +    struct device_node *memory;
>>   -#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
>> -    max_addr = hot_add_drconf_memory_max();
>> -#else
>> -    max_addr = memblock_end_of_DRAM();
>> -#endif
>> +    for_each_node_by_type(memory, "memory") {
>> +        struct resource res;
>> +
>> +        if (of_address_to_resource(memory, 0, &res))
>> +            continue;
>> +
>> +        max_addr = max_t(resource_size_t, max_addr, res.end + 1);
>> +        }
>
> Indentation needs to be corrected above and below.
>
>>   -    return max_addr;
>> +        return max_addr;
>>   }
>>     /*
>> @@ -1446,7 +1452,7 @@ static struct property 
>> *ddw_property_create(const char *propname, u32 liobn, u64
>>   static bool enable_ddw(struct pci_dev *dev, struct device_node 
>> *pdn, u64 dma_mask)
>>   {
>>       int len = 0, ret;
>> -    int max_ram_len = order_base_2(ddw_memory_hotplug_max());
>> +    int max_ram_len = order_base_2(pseries_ddw_max_ram);
>>       struct ddw_query_response query;
>>       struct ddw_create_response create;
>>       int page_shift;
>> @@ -1668,7 +1674,7 @@ static bool enable_ddw(struct pci_dev *dev, 
>> struct device_node *pdn, u64 dma_mas
>>         if (direct_mapping) {
>>           /* DDW maps the whole partition, so enable direct DMA 
>> mapping */
>> -        ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> 
>> PAGE_SHIFT,
>> +        ret = walk_system_ram_range(0, pseries_ddw_max_ram >> 
>> PAGE_SHIFT,
>>                           win64->value, 
>> tce_setrange_multi_pSeriesLP_walk);
>>           if (ret) {
>>               dev_info(&dev->dev, "failed to map DMA window for %pOF: 
>> %d\n",
>> @@ -2419,21 +2425,32 @@ static int iommu_mem_notifier(struct 
>> notifier_block *nb, unsigned long action,
>>   {
>>       struct dma_win *window;
>>       struct memory_notify *arg = data;
>> +    unsigned long limit = arg->nr_pages;
>> +    unsigned long max_ram_pages = pseries_ddw_max_ram >> PAGE_SHIFT;
>>       int ret = 0;
>>         /* This notifier can get called when onlining persistent 
>> memory as well.
>>        * TCEs are not pre-mapped for persistent memory. Persistent 
>> memory will
>> -     * always be above ddw_memory_hotplug_max()
>> +     * always be above pseries_ddw_max_ram
>>        */
>> +    if (arg->start_pfn >= max_ram_pages)
>> +        return NOTIFY_OK;
>> +
>> +    /* RAM is being DLPAR'ed. The range should never exceed max ram.
>> +     * Just in case, clamp the range and throw a warning.
>> +     */
>> +    if (arg->start_pfn + limit > max_ram_pages) {
>> +        limit = max_ram_pages - arg->start_pfn;
>> +        WARN_ON(1);
>
> WARN_ONCE with an appropriate warning message may be a better choice.
>
>> +    }
>>         switch (action) {
>>       case MEM_GOING_ONLINE:
>>           spin_lock(&dma_win_list_lock);
>>           list_for_each_entry(window, &dma_win_list, list) {
>> -            if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>> -                ddw_memory_hotplug_max()) {
>> +            if (window->direct) {
>>                   ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>> -                        arg->nr_pages, window->prop);
>> +                        limit, window->prop);
>>               }
>>               /* XXX log error */
>
> Replace comment with a log if limit < arg->nr_pages ?
> Similarly below as well.
>
>>           }
>> @@ -2443,10 +2460,9 @@ static int iommu_mem_notifier(struct 
>> notifier_block *nb, unsigned long action,
>>       case MEM_OFFLINE:
>>           spin_lock(&dma_win_list_lock);
>>           list_for_each_entry(window, &dma_win_list, list) {
>> -            if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>> -                ddw_memory_hotplug_max()) {
>> +            if (window->direct) {
>>                   ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>> -                        arg->nr_pages, window->prop);
>> +                        limit, window->prop);
>>               }
>>               /* XXX log error */
>
> ^^^ Ditto.
>
> Thanks
> Harsh
>
>>           }
>> @@ -2532,6 +2548,14 @@ void __init iommu_init_early_pSeries(void)
>>       register_memory_notifier(&iommu_mem_nb);
>>         set_pci_dma_ops(&dma_iommu_ops);
>> +
>> +    /* During init determine the max memory an LPAR can have and set 
>> it. This
>> +     * will be used for pre-mapping RAM in DDW. 
>> memblock_end_of_DRAM() can
>> +     * change during the running of LPAR - daxctl can add pmemory as
>> +     * "system-ram". This memory range should not be pre-mapped in 
>> DDW since
>> +     * the address of pmemory can be much higher than the DDW size.
>> +     */
>> +    pseries_ddw_max_ram = ddw_memory_hotplug_max();
>>   }
>>     static int __init disable_multitce(char *str)
>>
>> base-commit: 6d35786de28116ecf78797a62b84e6bf3c45aa5a
>


^ permalink raw reply

* Re: [PATCH 01/19] btrfs: require at least 4 devices for RAID 6
From: Goffredo Baroncelli @ 2026-05-14 19:51 UTC (permalink / raw)
  To: Christoph Hellwig, David Sterba
  Cc: Andrew Morton, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Huacai Chen, WANG Xuerui, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Herbert Xu, Dan Williams,
	Chris Mason, David Sterba, Arnd Bergmann, Song Liu, Yu Kuai,
	Li Nan, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, linux-crypto, linux-btrfs, linux-arch,
	linux-raid
In-Reply-To: <20260513054742.GA1018@lst.de>

On 13/05/2026 07.47, Christoph Hellwig wrote:
> On Tue, May 12, 2026 at 01:42:31PM +0200, David Sterba wrote:

> 
>> The degenerate modes of
>> raid0, 5, or 6 are explicit as a possible middle step when converting
>> profiles.  We can use a fallback implementation for this case if the
>> accelerated implementations cannot do it.
> 
> This is not about a degenerated mode.  For a degenerated RAID 6, parity
> generation uses the RAID 5 XOR routines as the second parity will be
> missing.  This is about generating two parities for a single data disk,
> which must be explicitly selected.
> 

I think that the David concern is : "what happens for an already
existing btrfs raid6 3 disks filesystem when the user upgrade the kernel ?"
(I am thinking when a new BG needs to be allocated)...

BR
GB

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply

* Re: [PATCH] powerpc/64s: Fix the vector number in comments for h_facility_unavailable
From: Gautam Menghani @ 2026-05-15 13:08 UTC (permalink / raw)
  To: Vaibhav Jain
  Cc: Gautam Menghani, maddy, mpe, npiggin, chleroy, linuxppc-dev,
	linux-kernel
In-Reply-To: <87fr3vvfvq.fsf@vajain21.in.ibm.com>

On Wed, May 13, 2026 at 02:35:29PM +0530, Vaibhav Jain wrote:
> Hey Gautam,
> 
> Thanks for the patch. Since this patch doesnt have any functional or
> code change can you please put a 'trivial' suffix to it patch title like
> [1] or some other suffix indicating its a non-functional change. That
> way maintainers can easily pull the patch without worrying much about a
> regression.
> 
> [1]
> https://git.kernel.org/powerpc/c/d2827e5e2e0f0941a651f4b1ca5e9b778c4b5293

Yeah I've mentioned "comments" in the title, so I guess that's fine?

Thanks,
Gautam


^ permalink raw reply

* Re: [PATCH v2 0/5] KVM: PPC: Handle CPU compatibility mode for nested guests
From: Anushree Mathur @ 2026-05-15 10:50 UTC (permalink / raw)
  To: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan
  Cc: Vaibhav Jain, Paolo Bonzini, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP), Jonathan Corbet, Shuah Khan, kvm,
	linux-kernel, linux-doc, anushree.mathur
In-Reply-To: <20260513100755.83215-1-amachhiw@linux.ibm.com>



On 13/05/26 3:37 PM, Amit Machhiwal wrote:
> On POWER systems, newer processor generations can operate in compatibility
> modes corresponding to earlier generations (e.g., a Power11 system running
> in Power10 compatibility mode). In such cases, the effective CPU level
> exposed to guests differs from the physical processor generation.
>
> This creates a problem for nested virtualization. When booting a nested KVM
> guest (L2) inside a host KVM guest (L1) running in a compatibility mode,
> userspace (e.g., QEMU) may derive the CPU model from the raw hardware PVR
> and attempt to configure the nested guest accordingly. However, the L1
> partition is constrained by the compatibility level negotiated with the
> hypervisor (L0), and requests exceeding that level are rejected, leading to
> guest boot failures such as:
>
>    KVM-NESTEDv2: couldn't set guest wide elements
>
> This series addresses the issue in two steps:
>
> 1. Detect and reject invalid compatibility requests early in KVM to avoid
>     late failures.
>
> 2. Provide a mechanism for userspace to query the effective CPU
>     compatibility modes supported by the host, so it can select an
>     appropriate CPU model for nested guests.
>
> To achieve this, the series introduces a new KVM capability and ioctl
> (KVM_CAP_PPC_COMPAT_CAPS / KVM_PPC_GET_COMPAT_CAPS) that expose the
> compatibility modes supported by the host.
>
> The implementation supports both:
>
>    - PowerVM (nested API v2), where compatibility information is obtained
>      via the H_GUEST_GET_CAPABILITIES hypercall.
>    - PowerNV (nested API v1), where compatibility is derived from the device
>      tree ("cpu-version") representing the effective processor compatibility
>      level.
>
> This allows userspace (e.g., QEMU) to select a CPU model consistent with
> the host compatibility mode, avoiding mismatches and enabling successful
> nested guest boot.
>
> Changes in v2:
>    - Squashed patches 2 and 3 from v1 (capability introduction and ioctl
>      wiring) into a single patch for better logical grouping
>    - Changed kvm_ppc_compat_caps.flags from __u32 to __u64 for consistency
>      and future extensibility
>    - Addressed other review comments
>    - Improved commit messages with clearer explanations of the changes
>
> Patch summary:
>    [1/5] Validate arch_compat against host compatibility mode
>    [2/5] Introduce KVM_CAP_PPC_COMPAT_CAPS and wire up ioctl
>    [3/5] Implement capability retrieval for PowerVM (API v2)
>    [4/5] Add PowerNV support (API v1)
>    [5/5] Document the new ioctl
>
> Tested on:
>    - Power11 pSeries LPAR in Power10 compatibility mode (nested API v2)
>    - Power10 PowerNV system (and QEMU TCG PowerNV 11) with nested
>      virtualization (API v1) with various combinations of KVM L1/L2 guests
>      in various supported compatibility modes.
>
> With this series, nested guests boot successfully in configurations where
> they previously failed due to compatibility mismatches.
>
> Related QEMU series:
>    A corresponding QEMU series adds support for querying and using these
>    compatibility capabilities when configuring nested KVM guests:
>    https://lore.kernel.org/all/20260502140021.69712-1-amachhiw@linux.ibm.com/
>
> v1: https://lore.kernel.org/linuxppc-dev/20260430054906.94431-1-amachhiw@linux.ibm.com/
>
> Amit Machhiwal (5):
>    KVM: PPC: Book3S HV: Validate arch_compat against host compatibility
>      mode
>    KVM: PPC: Introduce KVM_CAP_PPC_COMPAT_CAPS and wire up ioctl
>    KVM: PPC: Book3S HV: Implement compat CPU capability retrieval for KVM
>      on PowerVM
>    KVM: PPC: Book3S HV: Add support for compat CPU capabilities for KVM
>      on PowerNV
>    KVM: PPC: Document KVM_PPC_GET_COMPAT_CAPS ioctl
>
>   Documentation/virt/kvm/api.rst      | 35 ++++++++++++++++
>   arch/powerpc/include/asm/kvm_ppc.h  |  1 +
>   arch/powerpc/include/uapi/asm/kvm.h |  6 +++
>   arch/powerpc/kvm/book3s_hv.c        | 63 +++++++++++++++++++++++++++++
>   arch/powerpc/kvm/powerpc.c          | 21 ++++++++++
>   include/uapi/linux/kvm.h            |  4 ++
>   6 files changed, 130 insertions(+)
>
>
> base-commit: 1d5dcaa3bd65f2e8c9baa14a393d3a2dc5db7524

Hi Amit,
I tried booting up a guest on P11 lpar booted with P10 compat mode 
applying your patch along with the qemu patch series and it has been 
working perfectly fine.

Host lscpu:

lscpu
Architecture:                ppc64le
   Byte Order:                Little Endian
CPU(s):                      80
   On-line CPU(s) list:       0-79
Model name:                  POWER10 (architected), altivec supported


Guest lscpu:

lscpu
Architecture:                ppc64le
   Byte Order:                Little Endian
CPU(s):                      10
   On-line CPU(s) list:       0-9
Model name:                  POWER10 (architected), altivec supported

Feel free to add :

Tested-by: Anushree Mathur <anushree.mathur@linux.ibm.com>

Thank you!
Anushree Mathur


^ permalink raw reply

* Re: [PATCH 01/19] btrfs: require at least 4 devices for RAID 6
From: Christoph Hellwig @ 2026-05-15  4:37 UTC (permalink / raw)
  To: kreijack
  Cc: Christoph Hellwig, David Sterba, Andrew Morton, Catalin Marinas,
	Will Deacon, Ard Biesheuvel, Huacai Chen, WANG Xuerui,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexandre Ghiti, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Herbert Xu, Dan Williams, Chris Mason,
	David Sterba, Arnd Bergmann, Song Liu, Yu Kuai, Li Nan,
	linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, linux-crypto, linux-btrfs, linux-arch,
	linux-raid
In-Reply-To: <0a8d1ff4-f5a2-49e9-aa45-d25dbe4ded40@libero.it>

On Thu, May 14, 2026 at 09:51:59PM +0200, Goffredo Baroncelli wrote:
> I think that the David concern is : "what happens for an already
> existing btrfs raid6 3 disks filesystem when the user upgrade the kernel ?"
> (I am thinking when a new BG needs to be allocated)...

Then it will cleanly fail to mount instead of constantly corrupting data
and memory with every write, yes.  Which clearly suggest that such
file systems don't exist in the wild.

But if btrfs wants to keep supporting this I'll just add a _unsafe
version without the check in the core library.


^ permalink raw reply

* Re: [PATCH 01/19] btrfs: require at least 4 devices for RAID 6
From: Christoph Hellwig @ 2026-05-15  4:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: kreijack, Goffredo Baroncelli, Christoph Hellwig, David Sterba,
	Andrew Morton, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Huacai Chen, WANG Xuerui, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Herbert Xu, Dan Williams, Chris Mason,
	David Sterba, Arnd Bergmann, Song Liu, Yu Kuai, Li Nan,
	linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, linux-crypto, linux-btrfs, linux-arch,
	linux-raid
In-Reply-To: <0507CCEF-0548-442F-8703-1D006B5E068B@zytor.com>

On Thu, May 14, 2026 at 12:57:53PM -0700, H. Peter Anvin wrote:
> That's what I'm saying – it should invoke the RAID-1 code under the
> cover (as with 3 disks, D = P = Q.)

Yes, if the btrfs maintainer cared for this setup that is what should
be done.


^ permalink raw reply

* RE: [PATCH v7 net-next 10/15] net: dsa: netc: introduce NXP NETC switch driver for i.MX94
From: Wei Fang @ 2026-05-15  3:36 UTC (permalink / raw)
  To: Claudiu Manoil, Vladimir Oltean, Clark Wang,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, robh@kernel.org,
	krzk+dt@kernel.org, conor+dt@kernel.org, f.fainelli@gmail.com,
	Frank Li, chleroy@kernel.org, horms@kernel.org,
	linux@armlinux.org.uk, maxime.chevallier@bootlin.com,
	andrew@lunn.ch, olteanv@gmail.com
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	devicetree@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org, imx@lists.linux.dev
In-Reply-To: <20260513030454.1666570-11-wei.fang@nxp.com>

> diff --git a/drivers/net/dsa/netc/Kconfig b/drivers/net/dsa/netc/Kconfig new file
> mode 100644 index 000000000000..0f246ac9e018
> --- /dev/null
> +++ b/drivers/net/dsa/netc/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0-only config NET_DSA_NETC_SWITCH
> +	tristate "NXP NETC Ethernet switch support"
> +	depends on ARM64 || COMPILE_TEST
> +	depends on NET_DSA && PCI
> +	select NET_DSA_TAG_NETC
> +	select FSL_ENETC_MDIO
> +	select NXP_NTMP
> +	select NXP_NETC_LIB
> +	help
> +	  This driver supports the NXP NETC Ethernet switch, which is embedded
> +	  as a PCIe function of the NXP NETC IP. But note that this driver is
> +	  is only available for NETC v4.3 and later versions.

Sashiko reported there is a duplicated "is" in the help text. I will fix in v8.

--
pw-bot: cr


^ permalink raw reply

* Re: [PATCH v2] powerpc/pseries/iommu: Add TCEs for 16GB pages when RAM is pre-mapped
From: Harsh Prateek Bora @ 2026-05-15  9:06 UTC (permalink / raw)
  To: Gaurav Batra, maddy; +Cc: linuxppc-dev, ritesh.list, sbhat, vaibhav, donettom
In-Reply-To: <20260514185448.34434-1-gbatra@linux.ibm.com>



On 15/05/26 12:24 am, Gaurav Batra wrote:
> In powerPC, if Dynamic DMA Window is big enough, RAM is pre-mapped. To
> determine the size of RAM, a PAPR+ property "ibm,lrdr-capacity" is used.
> This OF property dictates what is the max size of RAM an LPAR can have,
> including DR added memory.
> 
> In PowerPC, 16GB pages can be allocated at machine level and then
> assigned to LPARs. These 16GB pages are added to LPAR memory at the time
> of boot. The address range for these 16GB pages is above MAX RAM an LPAR
> can have (ibm,lrdr-capacity). In the current implementation, these 16GB
> pages are being excluded from pre-mapped TCEs. A driver can have DMA
> buffers allocated from 16GB pages. This results in platform to raise an
> EEH when DMA is attempted on buffers in 16GB memory range.
> 
> commit 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier incorrectly
> adds TCEs for pmemory")
> 
> Prior to the above patch, memblock_end_of_DRAM() was being used to
> determine the MAX memory of an LPAR. This included 16GB pages as well.
> The issue with using memblock_end_of_DRAM() is that when pmemory is
> converted to RAM via daxctl command, the DDW engine will incorrectly try
> to add TCEs for pmemory as well.
> 
> Below is the address distribution of RAM, 16GB pages and pmemory for an
> LPAR with max memory of 256GB, memory allocated 64GB, 2 16GB pages and
> assigned pmemory of 8GB.
> 
> RANGE                                 SIZE  STATE REMOVABLE     BLOCK
> 0x0000000000000000-0x0000000fffffffff  64G online       yes     0-255
> 0x0000004000000000-0x00000047ffffffff  32G online       yes 1024-1151
> 
> cat /sys/bus/nd/devices/region0/resource
> 0x40100000000
> cat /sys/bus/nd/devices/region0/size
> 8589934592
> 
> The approach to fix this problem is to revert back the code changes
> introduced by the above patch and to stash away the MAX memory of an
> LPAR, including 16GB pages, at the LPAR boot time. This value is then
> used whenever TCEs are needed to be pre-mapped - enable_DDW() or,
> iommu_mem_notifier()
> 
> Fixes: 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory")
> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
> ---
> 
> Change log:
> 
> V1 -> V2
> 
> 1. Harsh: Not only start_pfn, but end_pfn also needs to be within allowed
>     range, which may require clamping arg->nr_pages if crossing the limits.
> 
>     Response: Incorporated changes.
> 
> Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com>

I think I mentioned it before also. Please avoid using tags unless 
explicitly provided by the reviewer.

> 
>   arch/powerpc/platforms/pseries/iommu.c | 56 ++++++++++++++++++--------
>   1 file changed, 40 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 3e1f915fe4f6..fdb160b72938 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -69,6 +69,8 @@ static struct iommu_table *iommu_pseries_alloc_table(int node)
>   	return tbl;
>   }
>   
> +static phys_addr_t pseries_ddw_max_ram;
> +
>   #ifdef CONFIG_IOMMU_API
>   static struct iommu_table_group_ops spapr_tce_table_group_ops;
>   #endif
> @@ -1285,15 +1287,19 @@ static LIST_HEAD(failed_ddw_pdn_list);
>   
>   static phys_addr_t ddw_memory_hotplug_max(void)
>   {
> -	resource_size_t max_addr;
> +	resource_size_t max_addr = memory_hotplug_max();
> +	struct device_node *memory;
>   
> -#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> -	max_addr = hot_add_drconf_memory_max();
> -#else
> -	max_addr = memblock_end_of_DRAM();
> -#endif
> +	for_each_node_by_type(memory, "memory") {
> +		struct resource res;
> +
> +		if (of_address_to_resource(memory, 0, &res))
> +			continue;
> +
> +		max_addr = max_t(resource_size_t, max_addr, res.end + 1);
> +		}

Indentation needs to be corrected above and below.

>   
> -	return max_addr;
> +		return max_addr;
>   }
>   
>   /*
> @@ -1446,7 +1452,7 @@ static struct property *ddw_property_create(const char *propname, u32 liobn, u64
>   static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn, u64 dma_mask)
>   {
>   	int len = 0, ret;
> -	int max_ram_len = order_base_2(ddw_memory_hotplug_max());
> +	int max_ram_len = order_base_2(pseries_ddw_max_ram);
>   	struct ddw_query_response query;
>   	struct ddw_create_response create;
>   	int page_shift;
> @@ -1668,7 +1674,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn, u64 dma_mas
>   
>   	if (direct_mapping) {
>   		/* DDW maps the whole partition, so enable direct DMA mapping */
> -		ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT,
> +		ret = walk_system_ram_range(0, pseries_ddw_max_ram >> PAGE_SHIFT,
>   					    win64->value, tce_setrange_multi_pSeriesLP_walk);
>   		if (ret) {
>   			dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
> @@ -2419,21 +2425,32 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
>   {
>   	struct dma_win *window;
>   	struct memory_notify *arg = data;
> +	unsigned long limit = arg->nr_pages;
> +	unsigned long max_ram_pages = pseries_ddw_max_ram >> PAGE_SHIFT;
>   	int ret = 0;
>   
>   	/* This notifier can get called when onlining persistent memory as well.
>   	 * TCEs are not pre-mapped for persistent memory. Persistent memory will
> -	 * always be above ddw_memory_hotplug_max()
> +	 * always be above pseries_ddw_max_ram
>   	 */
> +	if (arg->start_pfn >= max_ram_pages)
> +		return NOTIFY_OK;
> +
> +	/* RAM is being DLPAR'ed. The range should never exceed max ram.
> +	 * Just in case, clamp the range and throw a warning.
> +	 */
> +	if (arg->start_pfn + limit > max_ram_pages) {
> +		limit = max_ram_pages - arg->start_pfn;
> +		WARN_ON(1);

WARN_ONCE with an appropriate warning message may be a better choice.

> +	}
>   
>   	switch (action) {
>   	case MEM_GOING_ONLINE:
>   		spin_lock(&dma_win_list_lock);
>   		list_for_each_entry(window, &dma_win_list, list) {
> -			if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> -				ddw_memory_hotplug_max()) {
> +			if (window->direct) {
>   				ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
> -						arg->nr_pages, window->prop);
> +						limit, window->prop);
>   			}
>   			/* XXX log error */

Replace comment with a log if limit < arg->nr_pages ?
Similarly below as well.

>   		}
> @@ -2443,10 +2460,9 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
>   	case MEM_OFFLINE:
>   		spin_lock(&dma_win_list_lock);
>   		list_for_each_entry(window, &dma_win_list, list) {
> -			if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> -				ddw_memory_hotplug_max()) {
> +			if (window->direct) {
>   				ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
> -						arg->nr_pages, window->prop);
> +						limit, window->prop);
>   			}
>   			/* XXX log error */

^^^ Ditto.

Thanks
Harsh

>   		}
> @@ -2532,6 +2548,14 @@ void __init iommu_init_early_pSeries(void)
>   	register_memory_notifier(&iommu_mem_nb);
>   
>   	set_pci_dma_ops(&dma_iommu_ops);
> +
> +	/* During init determine the max memory an LPAR can have and set it. This
> +	 * will be used for pre-mapping RAM in DDW. memblock_end_of_DRAM() can
> +	 * change during the running of LPAR - daxctl can add pmemory as
> +	 * "system-ram". This memory range should not be pre-mapped in DDW since
> +	 * the address of pmemory can be much higher than the DDW size.
> +	 */
> +	pseries_ddw_max_ram = ddw_memory_hotplug_max();
>   }
>   
>   static int __init disable_multitce(char *str)
> 
> base-commit: 6d35786de28116ecf78797a62b84e6bf3c45aa5a



^ permalink raw reply

* Re: [PATCH net] net: wan: fsl_ucc_hdlc: free tx_skbuff in uhdlc_memclean
From: kernel test robot @ 2026-05-15  3:05 UTC (permalink / raw)
  To: Holger Brunck, netdev
  Cc: llvm, oe-kbuild-all, linuxppc-dev, andrew+netdev, chleroy,
	qiang.zhao, horms, Holger Brunck
In-Reply-To: <20260504161145.2217950-1-holger.brunck@hitachienergy.com>

Hi Holger,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]
[also build test ERROR on v7.1-rc3 next-20260508]
[cannot apply to net/main linus/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Holger-Brunck/net-wan-fsl_ucc_hdlc-free-tx_skbuff-in-uhdlc_memclean/20260514-055007
base:   net-next/main
patch link:    https://lore.kernel.org/r/20260504161145.2217950-1-holger.brunck%40hitachienergy.com
patch subject: [PATCH net] net: wan: fsl_ucc_hdlc: free tx_skbuff in uhdlc_memclean
config: s390-allmodconfig (https://download.01.org/0day-ci/archive/20260515/202605151029.MIApM8zq-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260515/202605151029.MIApM8zq-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605151029.MIApM8zq-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

>> drivers/net/wan/fsl_ucc_hdlc.c:775:32: error: expected ';' in 'for' statement specifier
     775 |         for (i = 0; i < TX_BD_RING_LEN) {
         |                                       ^
>> drivers/net/wan/fsl_ucc_hdlc.c:775:14: warning: variable 'i' used in loop condition not modified in loop body [-Wfor-loop-analysis]
     775 |         for (i = 0; i < TX_BD_RING_LEN) {
         |                     ^
   1 warning and 1 error generated.


vim +775 drivers/net/wan/fsl_ucc_hdlc.c

   740	
   741	static void uhdlc_memclean(struct ucc_hdlc_private *priv)
   742	{
   743		int i;
   744	
   745		qe_muram_free(ioread16be(&priv->ucc_pram->riptr));
   746		qe_muram_free(ioread16be(&priv->ucc_pram->tiptr));
   747	
   748		if (priv->rx_bd_base) {
   749			dma_free_coherent(priv->dev,
   750					  RX_BD_RING_LEN * sizeof(struct qe_bd),
   751					  priv->rx_bd_base, priv->dma_rx_bd);
   752	
   753			priv->rx_bd_base = NULL;
   754			priv->dma_rx_bd = 0;
   755		}
   756	
   757		if (priv->tx_bd_base) {
   758			dma_free_coherent(priv->dev,
   759					  TX_BD_RING_LEN * sizeof(struct qe_bd),
   760					  priv->tx_bd_base, priv->dma_tx_bd);
   761	
   762			priv->tx_bd_base = NULL;
   763			priv->dma_tx_bd = 0;
   764		}
   765	
   766		if (priv->ucc_pram) {
   767			qe_muram_free(priv->ucc_pram_offset);
   768			priv->ucc_pram = NULL;
   769			priv->ucc_pram_offset = 0;
   770		 }
   771	
   772		kfree(priv->rx_skbuff);
   773		priv->rx_skbuff = NULL;
   774	
 > 775		for (i = 0; i < TX_BD_RING_LEN) {
   776			kfree(priv->tx_skbuff[i]);
   777			priv->tx_skbuff[i] = NULL;
   778		}
   779	
   780		kfree(priv->tx_skbuff);
   781		priv->tx_skbuff = NULL;
   782	
   783		if (priv->uccf) {
   784			ucc_fast_free(priv->uccf);
   785			priv->uccf = NULL;
   786		}
   787	
   788		if (priv->rx_buffer) {
   789			dma_free_coherent(priv->dev,
   790					  (RX_BD_RING_LEN + TX_BD_RING_LEN) * MAX_RX_BUF_LENGTH,
   791					  priv->rx_buffer, priv->dma_rx_addr);
   792			priv->rx_buffer = NULL;
   793			priv->dma_rx_addr = 0;
   794	
   795			priv->tx_buffer = NULL;
   796			priv->dma_tx_addr = 0;
   797	
   798		}
   799	}
   800	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply

* Re: [PATCH] tools/perf/test: Check for perf stat return code in perf all PMU test
From: Mi, Dapeng @ 2026-05-15  6:25 UTC (permalink / raw)
  To: Ian Rogers, Falcon, Thomas, Kleen, Andi
  Cc: atrajeev@linux.ibm.com, venkat88@linux.ibm.com,
	Shivani.Nittor@ibm.com, tmricht@linux.ibm.com,
	hbathini@linux.vnet.ibm.com, mpetlan@redhat.com,
	Tanushree.Shah@ibm.com, Hunter, Adrian,
	linux-perf-users@vger.kernel.org, maddy@linux.ibm.com, Chen, Zide,
	vmolnaro@redhat.com, Tejas.Manhas1@ibm.com,
	linuxppc-dev@lists.ozlabs.org, acme@kernel.org, jolsa@kernel.org,
	Mi, Dapeng1, namhyung@kernel.org
In-Reply-To: <ec1e04d3-4235-4593-80bb-270a86e6b01f@linux.intel.com>


On 4/7/2026 8:48 AM, Mi, Dapeng wrote:
> On 4/3/2026 11:39 PM, Ian Rogers wrote:
>> On Fri, Apr 3, 2026 at 12:36 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>> On 4/3/2026 1:32 AM, Falcon, Thomas wrote:
>>>> On Wed, 2026-04-01 at 13:40 -0700, Ian Rogers wrote:
>>>>> On Mon, Mar 23, 2026 at 3:40 AM Venkat <venkat88@linux.ibm.com>
>>>>> wrote:
>>>>>>> On 15 Mar 2026, at 4:27 PM, Athira Rajeev
>>>>>>> <atrajeev@linux.ibm.com> wrote:
>>>>>>>
>>>>>>> Currently in "perf all PMU test", for "perf stat -e <event>
>>>>>>> true",
>>>>>>> below checks are done:
>>>>>>> - if return code is zero, look for "not supported" to decide pass
>>>>>>>  scenario
>>>>>>> - check for "not supported" to ignore the event
>>>>>>> - looks for "No permission to enable" to skip the event.
>>>>>>> - If output has "Bad event name", fail the test.
>>>>>>> - Use "Access to performance monitoring and observability
>>>>>>> operations is
>>>>>>>  limited." to ignore fail due to access limitations
>>>>>>>
>>>>>>> If we failed to see event and it is supported, retries with
>>>>>>> longer
>>>>>>> workload "perf bench internals synthesize".
>>>>>>> - Here if output has <event>, the test is a pass.
>>>>>>>
>>>>>>> Snippet of code check:
>>>>>>>  ```
>>>>>>>  output=$(perf stat -e "$p" perf bench internals synthesize 2>&1)
>>>>>>>  if echo "$output" | grep -q "$p"
>>>>>>>  ```
>>>>>>> - if output doesn't have event printed in logs, considers it
>>>>>>> fail.
>>>>>>>
>>>>>>> But this results in false pass for events in some cases.
>>>>>>> Example, if perf stat fails as below:
>>>>>>>
>>>>>>> # ./perf stat -e pmu/event/  true
>>>>>>> event syntax error: 'pmu/event/'
>>>>>>>                     \___ Bad event or PMU
>>>>>>>
>>>>>>> Unable to find PMU or event on a PMU of 'pmu'
>>>>>>> Run 'perf list' for a list of valid events
>>>>>>>
>>>>>>>  Usage: perf stat [<options>] [<command>]
>>>>>>>
>>>>>>>    -e, --event <event>   event selector. use 'perf list' to list
>>>>>>> available events
>>>>>>> # echo $?
>>>>>>> 129
>>>>>>>
>>>>>>> Since this has non-zero return code and doesn't have the
>>>>>>> fail strings being checked in the test, it will enter check using
>>>>>>> longer workload. and since the output fail log has event, it
>>>>>>> declares test as "supported".
>>>>>>>
>>>>>>> Since all the fail strings can't be added in the check, update
>>>>>>> the testcase to check return code before proceeding to longer
>>>>>>> workload run.
>>>>>>>
>>>>>>> Another missing scenario is when system wide monitoring is
>>>>>>> supported
>>>>>>> example:
>>>>>>> # ./perf stat -e pmu/event/ true
>>>>>>> Error:
>>>>>>> No supported events found.
>>>>>>>  Unsupported event (pmu/event/H) in per-thread mode, enable
>>>>>>> system wide with '-a'.
>>>>>>>
>>>>>>> Update testcase to check with "perf stat -a -e $p" as well
>>>>>>>
>>>>>>> Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com>
>>>>>>> ---
>>>>>> Tested this patch.
>>>>>>
>>>>>>
>>>>>> With this patch:
>>>>>>
>>>>>> Testing hv_24x7/CPM_ADJUNCT_INST/ -- perf stat failed with non-zero
>>>>>> return code
>>>>>> Testing hv_24x7/CPM_ADJUNCT_PCYC/ -- perf stat failed with non-zero
>>>>>> return code
>>>>>>
>>>>>>
>>>>>>
>>>>>> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
>>>>> Testing on an Intel Alderlake the test is now failing:
>>>>> ```
>>>>> ...
>>>>> Testing offcore_requests_outstanding.l3_miss_demand_data_rd --
>>>>> supported
>>>>> Testing ocr.full_streaming_wr.any_response -- perf stat failed with
>>>>> non-zero return code
>>>>> Testing ocr.partial_streaming_wr.any_response -- perf stat failed
>>>>> with
>>>>> non-zero return code
>>>>> Testing ocr.streaming_wr.any_response -- supported
>>>>> ...
>>>>> ```
>>>>>
>>>>> Running `perf stat` manually reveals an issue with the event:
>>>>> ```
>>>>> $ sudo perf stat -vv -e ocr.full_streaming_wr.any_response -a sleep
>>>>> 1
>>>>> Using CPUID GenuineIntel-6-B7-1
>>>>> Attempt to add: cpu_atom/ocr.full_streaming_wr.any_response/
>>>>> ..after resolving event:
>>>>> cpu_atom/event=0xb7,period=0x186a3,umask=0x1,offcore_rsp=0x8000000100
>>>>> 00/
>>>>> ocr.full_streaming_wr.any_response ->
>>>>> cpu_atom/ocr.full_streaming_wr.any_response/
>>>>> Control descriptor is not initialized
>>>>> ------------------------------------------------------------
>>>>> perf_event_attr:
>>>>>  type                             10 (cpu_atom)
>>>>>  size                             144
>>>>> ------------------------------------------------------------
>>>>> perf_event_attr:
>>>>>  type                             0 (PERF_TYPE_HARDWARE)
>>>>>  config                           0xa00000000
>>>>> (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/)
>>>>>  disabled                         1
>>>>> ------------------------------------------------------------
>>>>> sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 3
>>>>> ------------------------------------------------------------
>>>>> perf_event_attr:
>>>>>  type                             0 (PERF_TYPE_HARDWARE)
>>>>>  config                           0x400000000
>>>>> (cpu_core/PERF_COUNT_HW_CPU_CYCLES/)
>>>>>  disabled                         1
>>>>> ------------------------------------------------------------
>>>>> sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 3
>>>>>  config                           0x1b7
>>>>> (ocr.demand_data_rd.l3_hit.snoop_hit_no_fwd)
>>>>>  sample_type                      IDENTIFIER
>>>>>  read_format
>>>>> TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
>>>>>  disabled                         1
>>>>>  inherit                          1
>>>>>  { bp_addr, config1 }             0x800000010000
>>>>> ------------------------------------------------------------
>>>>> sys_perf_event_open: pid -1  cpu 16  group_fd -1  flags 0x8
>>>>> sys_perf_event_open failed, error -22
>>>>> switching off deferred callchain support
>>>>> Warning:
>>>>> ocr.full_streaming_wr.any_response event is not supported by the
>>>>> kernel.
>>>>> The sys_perf_event_open() syscall failed for event
>>>>> (ocr.full_streaming_wr.any_response): Invalid argument
>>>>> "dmesg | grep -i perf" may provide additional information.
>>>>>
>>>>> Error:
>>>>> No supported events found.
>>>>> The sys_perf_event_open() syscall failed for event
>>>>> (ocr.full_streaming_wr.any_response): Invalid argument
>>>>> "dmesg | grep -i perf" may provide additional information.
>>>>> ```
>>>>>
>>>>> This looks like a latent Intel cpu_atom PMU bug. Thomas, wdyt?
>>> Hmm, it looks the error is caused by the invalid bitmask of OFFCORE_RSP_x
>>> MSRs. Currently the valid bitmask of OFFCORE_RSP_x MSR is set to
>>> 0x3fffffffff in intel_grt_extra_regs[], while the msr value is set
>>> 0x800000010000 for the ocr.full_streaming_wr.any_response event. The bit 47
>>> is recognized an invalid bit and then abort the event creation.
>>>
>>> Base on the description "Table 21-56. MSR_OFFCORE_RSPx Request Type
>>> Definition" in SDM, bit 47 should be a valid bit now. Suppose bit 47 should
>>> not be a valid bit when adding the ADL PMU support, but it's updated and
>>> becomes valid later.
>>>
>>> Along with the constant updates of perf event lists
>>> (https://github.com/intel/perfmon), we have noticed there are mismatches
>>> more or less between the driver hardcoded events and perfmon event list.
>>> Currently we are summarizing the mismatches. Once these mismatches are
>>> finalized. we would submit a patchset to fix these mismatches.
>> That's great, if it takes too long perhaps we could just remove the
>> events for now.
> Suppose it won't be too long. I plan to post the patchset in next release
> cycle. The code changes are simple but need much time to verify on all
> kinds of platforms. Thanks.

The patch
(https://lore.kernel.org/all/20260515061143.338553-5-dapeng1.mi@linux.intel.com/)
would fix this issue. Thanks.


>
>
>> Thanks,
>> Ian
>>
>>> Thanks.
>>>
>>>> +Dapeng, Zide, Andi
>>>>
>>>> Thanks,
>>>> Tom
>>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>> Regards,
>>>>>> Venkat.
>>>>>>
>>>>>>
>>>>>>
>>>>>>> tools/perf/tests/shell/stat_all_pmu.sh | 20 ++++++++++++++++++++
>>>>>>> 1 file changed, 20 insertions(+)
>>>>>>>
>>>>>>> diff --git a/tools/perf/tests/shell/stat_all_pmu.sh
>>>>>>> b/tools/perf/tests/shell/stat_all_pmu.sh
>>>>>>> index 9c466c0efa85..6c4d59cbfa5f 100755
>>>>>>> --- a/tools/perf/tests/shell/stat_all_pmu.sh
>>>>>>> +++ b/tools/perf/tests/shell/stat_all_pmu.sh
>>>>>>> @@ -53,6 +53,26 @@ do
>>>>>>>     continue
>>>>>>>   fi
>>>>>>>
>>>>>>> +  # check with system wide if it is supported.
>>>>>>> +  output=$(perf stat -a -e "$p" true 2>&1)
>>>>>>> +  stat_result=$?
>>>>>>> +  if echo "$output" | grep -q "not supported"
>>>>>>> +  then
>>>>>>> +    # Event not supported, so ignore.
>>>>>>> +    echo "not supported"
>>>>>>> +    continue
>>>>>>> +  fi
>>>>>>> +
>>>>>>> +  # checked through possible access limitations and permissions.
>>>>>>> +  # At this step, non-zero return code from "perf stat" needs to
>>>>>>> +  # reported as fail for the user to investigate
>>>>>>> +  if [ $stat_result -ne 0 ]
>>>>>>> +  then
>>>>>>> +    echo "perf stat failed with non-zero return code"
>>>>>>> +    err=1
>>>>>>> +    continue
>>>>>>> +  fi
>>>>>>> +
>>>>>>>   # We failed to see the event and it is supported. Possibly the
>>>>>>> workload was
>>>>>>>   # too small so retry with something longer.
>>>>>>>   output=$(perf stat -e "$p" perf bench internals synthesize
>>>>>>> 2>&1)
>>>>>>> --
>>>>>>> 2.47.3
>>>>>>>


^ permalink raw reply

* Re: [mainline] powerpc/TM: Unexpected TM Bad Thing during core dump on POWER9 (7.1‑rc1)
From: Venkat Rao Bagalkote @ 2026-05-15  5:29 UTC (permalink / raw)
  To: linuxppc-dev, Madhavan Srinivasan; +Cc: LKML, Christophe Leroy
In-Reply-To: <364996ce-aba2-4213-8d20-7dd481b43fe6@linux.ibm.com>


On 28/04/26 5:48 pm, Venkat Rao Bagalkote wrote:
> Greetings!!
>
> IBM CI has reported a kernel crash while running 
> selftests/powerpc/signal on a POWER9 pSeries system.
>
> I attempted to reproduce this issue manually, but was not successful 
> so far.

I’m now able to reproduce this issue reliably and consistently. The 
reproduction steps are simple:

Run selftests/powerpc/signal


Regards,

Venkat.

>
> Below are the details of the crash as reported by CI.
>
> Crash Details:
>
> [ 9798.880148] Unexpected TM Bad Thing exception at c00000000000dbac
>                (msr 0x8000000302a03031) tm_scratch=800000010280b033
> [ 9798.880160] Oops: Unrecoverable exception, sig: 6 [#1]
> [ 9798.880165] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries
> [ 9798.880173] Modules linked in: nvram(E) rpadlpar_io(E) rpaphp(E) ...
> [ 9798.880233] CPU: 8 UID: 0 PID: 1039530 Comm: sigfuz
>                Tainted: G            E       7.1.0-rc1 #1 PREEMPT
> [ 9798.880245] Hardware name: IBM,8375-42A POWER9 (architected)
>                hv:phyp pSeries
> [ 9798.880251] NIP:  c00000000000dbac LR: 0000000010001e58
> [ 9798.880262] MSR:  8000000302a03031 <SF,VEC,VSX,FP,ME,IR,DR,LE,TM[SE]>
>
>
> Call Trace:
>
>
> NIP [c00000000000dbac] interrupt_return_srr_kernel+0x15c/0x18c
> Call Trace:
>  tm_reclaim_thread
>  flush_tmregs_to_thread
>  vsr_get
>  regset_get_alloc
>  fill_thread_core_info.isra.0
>  fill_note_info
>  elf_core_dump
>  coredump_write
>  do_coredump
>  vfs_coredump
>  get_signal
>  do_signal
>  do_notify_resume
>  interrupt_exit_user_prepare_main
>  interrupt_exit_user_prepare
>  interrupt_return_srr_user
>
>
> Please let me know if further details are required.
>
>
> If you happen to fix this issue, please add below tag.
>
> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
>
>
>
> Regards,
>
> Venkat.
>
>


^ permalink raw reply

* Re: [PATCH 01/19] btrfs: require at least 4 devices for RAID 6
From: H. Peter Anvin @ 2026-05-14 19:57 UTC (permalink / raw)
  To: kreijack, Goffredo Baroncelli, Christoph Hellwig, David Sterba
  Cc: Andrew Morton, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Huacai Chen, WANG Xuerui, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Herbert Xu, Dan Williams, Chris Mason,
	David Sterba, Arnd Bergmann, Song Liu, Yu Kuai, Li Nan,
	linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, linux-crypto, linux-btrfs, linux-arch,
	linux-raid
In-Reply-To: <0a8d1ff4-f5a2-49e9-aa45-d25dbe4ded40@libero.it>

On May 14, 2026 12:51:59 PM PDT, Goffredo Baroncelli <kreijack@libero.it> wrote:
>On 13/05/2026 07.47, Christoph Hellwig wrote:
>> On Tue, May 12, 2026 at 01:42:31PM +0200, David Sterba wrote:
>
>> 
>>> The degenerate modes of
>>> raid0, 5, or 6 are explicit as a possible middle step when converting
>>> profiles.  We can use a fallback implementation for this case if the
>>> accelerated implementations cannot do it.
>> 
>> This is not about a degenerated mode.  For a degenerated RAID 6, parity
>> generation uses the RAID 5 XOR routines as the second parity will be
>> missing.  This is about generating two parities for a single data disk,
>> which must be explicitly selected.
>> 
>
>I think that the David concern is : "what happens for an already
>existing btrfs raid6 3 disks filesystem when the user upgrade the kernel ?"
>(I am thinking when a new BG needs to be allocated)...
>
>BR
>GB
>

That's what I'm saying – it should invoke the RAID-1 code under the cover (as with 3 disks, D = P = Q.)


^ permalink raw reply

* [powerpc:fixes-test] BUILD SUCCESS 31467b23823ffec1f6fff407f8e3ca9af8b7491a
From: kernel test robot @ 2026-05-14 18:54 UTC (permalink / raw)
  To: Madhavan Srinivasan; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes-test
branch HEAD: 31467b23823ffec1f6fff407f8e3ca9af8b7491a  powerpc/time: Remove redundant preempt_disable|enable() calls from arch_irq_work_raise()

elapsed time: 729m

configs tested: 197
configs skipped: 270

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha                             allnoconfig    gcc-15.2.0
alpha                            allyesconfig    gcc-15.2.0
alpha                               defconfig    gcc-15.2.0
arc                              allmodconfig    clang-16
arc                               allnoconfig    gcc-15.2.0
arc                              allyesconfig    clang-23
arc                                 defconfig    gcc-15.2.0
arc                   randconfig-001-20260514    clang-23
arc                   randconfig-002-20260514    clang-23
arm                               allnoconfig    gcc-15.2.0
arm                              allyesconfig    clang-16
arm                                 defconfig    gcc-15.2.0
arm                   randconfig-001-20260514    clang-23
arm                   randconfig-002-20260514    clang-23
arm                   randconfig-003-20260514    clang-23
arm                   randconfig-004-20260514    clang-23
arm64                            allmodconfig    clang-23
arm64                             allnoconfig    gcc-15.2.0
arm64                               defconfig    gcc-15.2.0
arm64                 randconfig-001-20260515    gcc-11.5.0
arm64                 randconfig-002-20260515    gcc-11.5.0
arm64                 randconfig-003-20260515    gcc-11.5.0
arm64                 randconfig-004-20260515    gcc-11.5.0
csky                             allmodconfig    gcc-15.2.0
csky                              allnoconfig    gcc-15.2.0
csky                                defconfig    gcc-15.2.0
csky                  randconfig-001-20260515    gcc-11.5.0
csky                  randconfig-002-20260515    gcc-11.5.0
hexagon                          allmodconfig    gcc-15.2.0
hexagon                           allnoconfig    gcc-15.2.0
hexagon                             defconfig    gcc-15.2.0
hexagon               randconfig-001-20260514    gcc-10.5.0
hexagon               randconfig-002-20260514    gcc-10.5.0
i386                             allmodconfig    clang-20
i386                              allnoconfig    gcc-15.2.0
i386                             allyesconfig    clang-20
i386                 buildonly-randconfig-001    gcc-14
i386        buildonly-randconfig-001-20260514    gcc-14
i386                 buildonly-randconfig-002    gcc-14
i386        buildonly-randconfig-002-20260514    gcc-14
i386                 buildonly-randconfig-003    gcc-14
i386        buildonly-randconfig-003-20260514    gcc-14
i386                 buildonly-randconfig-004    gcc-14
i386        buildonly-randconfig-004-20260514    gcc-14
i386                 buildonly-randconfig-005    gcc-14
i386        buildonly-randconfig-005-20260514    gcc-14
i386                 buildonly-randconfig-006    gcc-14
i386        buildonly-randconfig-006-20260514    gcc-14
i386                                defconfig    gcc-15.2.0
i386                  randconfig-001-20260514    clang-20
i386                  randconfig-002-20260514    clang-20
i386                  randconfig-003-20260514    clang-20
i386                  randconfig-004-20260514    clang-20
i386                  randconfig-005-20260514    clang-20
i386                  randconfig-006-20260514    clang-20
i386                  randconfig-007-20260514    clang-20
i386                  randconfig-011-20260514    clang-20
i386                  randconfig-012-20260514    clang-20
i386                  randconfig-013-20260514    clang-20
i386                  randconfig-014-20260514    clang-20
i386                  randconfig-015-20260514    clang-20
i386                  randconfig-016-20260514    clang-20
i386                  randconfig-017-20260514    clang-20
loongarch                        allmodconfig    clang-23
loongarch                         allnoconfig    gcc-15.2.0
loongarch                           defconfig    clang-19
loongarch             randconfig-001-20260514    gcc-10.5.0
loongarch             randconfig-002-20260514    gcc-10.5.0
m68k                             allmodconfig    gcc-15.2.0
m68k                              allnoconfig    gcc-15.2.0
m68k                             allyesconfig    clang-16
m68k                                defconfig    clang-19
microblaze                        allnoconfig    gcc-15.2.0
microblaze                       allyesconfig    gcc-15.2.0
microblaze                          defconfig    clang-19
mips                             allmodconfig    gcc-15.2.0
mips                              allnoconfig    gcc-15.2.0
mips                             allyesconfig    gcc-15.2.0
nios2                            allmodconfig    clang-23
nios2                             allnoconfig    clang-23
nios2                               defconfig    clang-19
nios2                 randconfig-001-20260514    gcc-10.5.0
nios2                 randconfig-002-20260514    gcc-10.5.0
openrisc                         allmodconfig    clang-23
openrisc                          allnoconfig    clang-23
openrisc                            defconfig    gcc-15.2.0
parisc                           allmodconfig    gcc-15.2.0
parisc                            allnoconfig    clang-23
parisc                           allyesconfig    clang-19
parisc                              defconfig    gcc-15.2.0
parisc                         randconfig-001    gcc-13.4.0
parisc                randconfig-001-20260514    gcc-13.4.0
parisc                         randconfig-002    gcc-13.4.0
parisc                randconfig-002-20260514    gcc-13.4.0
parisc64                            defconfig    clang-19
powerpc                          allmodconfig    gcc-15.2.0
powerpc                           allnoconfig    clang-23
powerpc                  mpc885_ads_defconfig    clang-23
powerpc                        randconfig-001    gcc-13.4.0
powerpc               randconfig-001-20260514    gcc-13.4.0
powerpc                        randconfig-002    gcc-13.4.0
powerpc               randconfig-002-20260514    gcc-13.4.0
powerpc                     tqm8541_defconfig    clang-23
powerpc64                      randconfig-001    gcc-13.4.0
powerpc64             randconfig-001-20260514    gcc-13.4.0
powerpc64                      randconfig-002    gcc-13.4.0
powerpc64             randconfig-002-20260514    gcc-13.4.0
riscv                            allmodconfig    clang-23
riscv                             allnoconfig    clang-23
riscv                            allyesconfig    clang-16
riscv                               defconfig    gcc-15.2.0
riscv                 randconfig-001-20260514    gcc-14.3.0
riscv                 randconfig-002-20260514    gcc-14.3.0
s390                             allmodconfig    clang-19
s390                              allnoconfig    clang-23
s390                             allyesconfig    gcc-15.2.0
s390                                defconfig    clang-23
s390                                defconfig    gcc-15.2.0
s390                  randconfig-001-20260514    gcc-14.3.0
s390                  randconfig-002-20260514    gcc-14.3.0
sh                               allmodconfig    gcc-15.2.0
sh                                allnoconfig    clang-23
sh                               allyesconfig    clang-19
sh                                  defconfig    gcc-14
sh                    randconfig-001-20260514    gcc-14.3.0
sh                    randconfig-002-20260514    gcc-14.3.0
sh                        sh7757lcr_defconfig    gcc-15.2.0
sparc                             allnoconfig    clang-23
sparc                               defconfig    gcc-15.2.0
sparc                          randconfig-001    gcc-15.2.0
sparc                 randconfig-001-20260514    gcc-15.2.0
sparc                          randconfig-002    gcc-15.2.0
sparc                 randconfig-002-20260514    gcc-15.2.0
sparc64                          allmodconfig    clang-23
sparc64                             defconfig    gcc-14
sparc64                        randconfig-001    gcc-15.2.0
sparc64               randconfig-001-20260514    gcc-15.2.0
sparc64                        randconfig-002    gcc-15.2.0
sparc64               randconfig-002-20260514    gcc-15.2.0
um                               allmodconfig    clang-19
um                                allnoconfig    clang-23
um                               allyesconfig    gcc-15.2.0
um                                  defconfig    gcc-14
um                             i386_defconfig    gcc-14
um                             randconfig-001    gcc-15.2.0
um                    randconfig-001-20260514    gcc-15.2.0
um                             randconfig-002    gcc-15.2.0
um                    randconfig-002-20260514    gcc-15.2.0
um                           x86_64_defconfig    gcc-14
x86_64                           allmodconfig    clang-20
x86_64                            allnoconfig    clang-23
x86_64                           allyesconfig    clang-20
x86_64               buildonly-randconfig-001    clang-20
x86_64      buildonly-randconfig-001-20260514    clang-20
x86_64               buildonly-randconfig-002    clang-20
x86_64      buildonly-randconfig-002-20260514    clang-20
x86_64               buildonly-randconfig-003    clang-20
x86_64      buildonly-randconfig-003-20260514    clang-20
x86_64               buildonly-randconfig-004    clang-20
x86_64      buildonly-randconfig-004-20260514    clang-20
x86_64               buildonly-randconfig-005    clang-20
x86_64      buildonly-randconfig-005-20260514    clang-20
x86_64               buildonly-randconfig-006    clang-20
x86_64      buildonly-randconfig-006-20260514    clang-20
x86_64                              defconfig    gcc-14
x86_64                                  kexec    clang-20
x86_64                         randconfig-001    gcc-14
x86_64                randconfig-001-20260514    gcc-14
x86_64                         randconfig-002    gcc-14
x86_64                randconfig-002-20260514    gcc-14
x86_64                         randconfig-003    gcc-14
x86_64                randconfig-003-20260514    gcc-14
x86_64                         randconfig-004    gcc-14
x86_64                randconfig-004-20260514    gcc-14
x86_64                         randconfig-005    gcc-14
x86_64                randconfig-005-20260514    gcc-14
x86_64                         randconfig-006    gcc-14
x86_64                randconfig-006-20260514    gcc-14
x86_64                randconfig-071-20260514    clang-20
x86_64                randconfig-072-20260514    clang-20
x86_64                randconfig-073-20260514    clang-20
x86_64                randconfig-074-20260514    clang-20
x86_64                randconfig-075-20260514    clang-20
x86_64                randconfig-076-20260514    clang-20
x86_64                               rhel-9.4    clang-20
x86_64                           rhel-9.4-bpf    gcc-14
x86_64                          rhel-9.4-func    clang-20
x86_64                    rhel-9.4-kselftests    clang-20
x86_64                         rhel-9.4-kunit    gcc-14
x86_64                           rhel-9.4-ltp    gcc-14
x86_64                          rhel-9.4-rust    clang-20
xtensa                            allnoconfig    clang-23
xtensa                           allyesconfig    clang-23
xtensa                         randconfig-001    gcc-15.2.0
xtensa                randconfig-001-20260514    gcc-15.2.0
xtensa                         randconfig-002    gcc-15.2.0
xtensa                randconfig-002-20260514    gcc-15.2.0

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply

* [PATCH v2] powerpc/pseries/iommu: Add TCEs for 16GB pages when RAM is pre-mapped
From: Gaurav Batra @ 2026-05-14 18:54 UTC (permalink / raw)
  To: maddy
  Cc: linuxppc-dev, ritesh.list, sbhat, vaibhav, donettom, harshpb,
	Gaurav Batra

In powerPC, if Dynamic DMA Window is big enough, RAM is pre-mapped. To
determine the size of RAM, a PAPR+ property "ibm,lrdr-capacity" is used.
This OF property dictates what is the max size of RAM an LPAR can have,
including DR added memory.

In PowerPC, 16GB pages can be allocated at machine level and then
assigned to LPARs. These 16GB pages are added to LPAR memory at the time
of boot. The address range for these 16GB pages is above MAX RAM an LPAR
can have (ibm,lrdr-capacity). In the current implementation, these 16GB
pages are being excluded from pre-mapped TCEs. A driver can have DMA
buffers allocated from 16GB pages. This results in platform to raise an
EEH when DMA is attempted on buffers in 16GB memory range.

commit 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier incorrectly
adds TCEs for pmemory")

Prior to the above patch, memblock_end_of_DRAM() was being used to
determine the MAX memory of an LPAR. This included 16GB pages as well.
The issue with using memblock_end_of_DRAM() is that when pmemory is
converted to RAM via daxctl command, the DDW engine will incorrectly try
to add TCEs for pmemory as well.

Below is the address distribution of RAM, 16GB pages and pmemory for an
LPAR with max memory of 256GB, memory allocated 64GB, 2 16GB pages and
assigned pmemory of 8GB.

RANGE                                 SIZE  STATE REMOVABLE     BLOCK
0x0000000000000000-0x0000000fffffffff  64G online       yes     0-255
0x0000004000000000-0x00000047ffffffff  32G online       yes 1024-1151

cat /sys/bus/nd/devices/region0/resource
0x40100000000
cat /sys/bus/nd/devices/region0/size
8589934592

The approach to fix this problem is to revert back the code changes
introduced by the above patch and to stash away the MAX memory of an
LPAR, including 16GB pages, at the LPAR boot time. This value is then
used whenever TCEs are needed to be pre-mapped - enable_DDW() or,
iommu_mem_notifier()

Fixes: 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory")
Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
---

Change log:

V1 -> V2

1. Harsh: Not only start_pfn, but end_pfn also needs to be within allowed
   range, which may require clamping arg->nr_pages if crossing the limits.

   Response: Incorporated changes.

Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com>

 arch/powerpc/platforms/pseries/iommu.c | 56 ++++++++++++++++++--------
 1 file changed, 40 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 3e1f915fe4f6..fdb160b72938 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -69,6 +69,8 @@ static struct iommu_table *iommu_pseries_alloc_table(int node)
 	return tbl;
 }
 
+static phys_addr_t pseries_ddw_max_ram;
+
 #ifdef CONFIG_IOMMU_API
 static struct iommu_table_group_ops spapr_tce_table_group_ops;
 #endif
@@ -1285,15 +1287,19 @@ static LIST_HEAD(failed_ddw_pdn_list);
 
 static phys_addr_t ddw_memory_hotplug_max(void)
 {
-	resource_size_t max_addr;
+	resource_size_t max_addr = memory_hotplug_max();
+	struct device_node *memory;
 
-#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
-	max_addr = hot_add_drconf_memory_max();
-#else
-	max_addr = memblock_end_of_DRAM();
-#endif
+	for_each_node_by_type(memory, "memory") {
+		struct resource res;
+
+		if (of_address_to_resource(memory, 0, &res))
+			continue;
+
+		max_addr = max_t(resource_size_t, max_addr, res.end + 1);
+		}
 
-	return max_addr;
+		return max_addr;
 }
 
 /*
@@ -1446,7 +1452,7 @@ static struct property *ddw_property_create(const char *propname, u32 liobn, u64
 static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn, u64 dma_mask)
 {
 	int len = 0, ret;
-	int max_ram_len = order_base_2(ddw_memory_hotplug_max());
+	int max_ram_len = order_base_2(pseries_ddw_max_ram);
 	struct ddw_query_response query;
 	struct ddw_create_response create;
 	int page_shift;
@@ -1668,7 +1674,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn, u64 dma_mas
 
 	if (direct_mapping) {
 		/* DDW maps the whole partition, so enable direct DMA mapping */
-		ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT,
+		ret = walk_system_ram_range(0, pseries_ddw_max_ram >> PAGE_SHIFT,
 					    win64->value, tce_setrange_multi_pSeriesLP_walk);
 		if (ret) {
 			dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
@@ -2419,21 +2425,32 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
 {
 	struct dma_win *window;
 	struct memory_notify *arg = data;
+	unsigned long limit = arg->nr_pages;
+	unsigned long max_ram_pages = pseries_ddw_max_ram >> PAGE_SHIFT;
 	int ret = 0;
 
 	/* This notifier can get called when onlining persistent memory as well.
 	 * TCEs are not pre-mapped for persistent memory. Persistent memory will
-	 * always be above ddw_memory_hotplug_max()
+	 * always be above pseries_ddw_max_ram
 	 */
+	if (arg->start_pfn >= max_ram_pages)
+		return NOTIFY_OK;
+
+	/* RAM is being DLPAR'ed. The range should never exceed max ram.
+	 * Just in case, clamp the range and throw a warning.
+	 */
+	if (arg->start_pfn + limit > max_ram_pages) {
+		limit = max_ram_pages - arg->start_pfn;
+		WARN_ON(1);
+	}
 
 	switch (action) {
 	case MEM_GOING_ONLINE:
 		spin_lock(&dma_win_list_lock);
 		list_for_each_entry(window, &dma_win_list, list) {
-			if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
-				ddw_memory_hotplug_max()) {
+			if (window->direct) {
 				ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
-						arg->nr_pages, window->prop);
+						limit, window->prop);
 			}
 			/* XXX log error */
 		}
@@ -2443,10 +2460,9 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
 	case MEM_OFFLINE:
 		spin_lock(&dma_win_list_lock);
 		list_for_each_entry(window, &dma_win_list, list) {
-			if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
-				ddw_memory_hotplug_max()) {
+			if (window->direct) {
 				ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
-						arg->nr_pages, window->prop);
+						limit, window->prop);
 			}
 			/* XXX log error */
 		}
@@ -2532,6 +2548,14 @@ void __init iommu_init_early_pSeries(void)
 	register_memory_notifier(&iommu_mem_nb);
 
 	set_pci_dma_ops(&dma_iommu_ops);
+
+	/* During init determine the max memory an LPAR can have and set it. This
+	 * will be used for pre-mapping RAM in DDW. memblock_end_of_DRAM() can
+	 * change during the running of LPAR - daxctl can add pmemory as
+	 * "system-ram". This memory range should not be pre-mapped in DDW since
+	 * the address of pmemory can be much higher than the DDW size.
+	 */
+	pseries_ddw_max_ram = ddw_memory_hotplug_max();
 }
 
 static int __init disable_multitce(char *str)

base-commit: 6d35786de28116ecf78797a62b84e6bf3c45aa5a
-- 
2.39.3



^ permalink raw reply related

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-14 14:43 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agXaby-7L7yS3Vva@google.com>

Mostafa Saleh <smostafa@google.com> writes:

> On Thu, May 14, 2026 at 06:18:05PM +0530, Aneesh Kumar K.V wrote:
>> Mostafa Saleh <smostafa@google.com> writes:
>> 
>> > On Thu, May 14, 2026 at 11:24:42AM +0530, Aneesh Kumar K.V wrote:
>> >> Mostafa Saleh <smostafa@google.com> writes:
>> >> 
>> >> > On Tue, May 12, 2026 at 02:33:59PM +0530, Aneesh Kumar K.V (Arm) wrote:
>> >> >> Teach swiotlb to distinguish between encrypted and decrypted bounce
>> >> >> buffer pools, and make allocation and mapping paths select a pool whose
>> >> >> state matches the requested DMA attributes.
>> >> >> 
>> >> >> Add a decrypted flag to io_tlb_mem, initialize it for the default and
>> >> >> restricted pools, and propagate DMA_ATTR_CC_SHARED into swiotlb pool
>> >> >> allocation. Reject swiotlb alloc/map requests when the selected pool does
>> >> >> not match the required encrypted/decrypted state.
>> >> >> 
>> >> >> Also return DMA addresses with the matching phys_to_dma_{encrypted,
>> >> >> unencrypted} helper so the DMA address encoding stays consistent with the
>> >> >> chosen pool.
>> >> >> 
>> >> >> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>> >> >> ---
>> >> >>  include/linux/dma-direct.h |  10 ++++
>> >> >>  include/linux/swiotlb.h    |   8 ++-
>> >> >>  kernel/dma/direct.c        |  14 +++--
>> >> >>  kernel/dma/swiotlb.c       | 108 +++++++++++++++++++++++++++----------
>> >> >>  4 files changed, 107 insertions(+), 33 deletions(-)
>> >> >> 
>> >> >> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
>> >> >> index c249912456f9..94fad4e7c11e 100644
>> >> >> --- a/include/linux/dma-direct.h
>> >> >> +++ b/include/linux/dma-direct.h
>> >> >> @@ -77,6 +77,10 @@ static inline dma_addr_t dma_range_map_max(const struct bus_dma_region *map)
>> >> >>  #ifndef phys_to_dma_unencrypted
>> >> >>  #define phys_to_dma_unencrypted		phys_to_dma
>> >> >>  #endif
>> >> >> +
>> >> >> +#ifndef phys_to_dma_encrypted
>> >> >> +#define phys_to_dma_encrypted		phys_to_dma
>> >> >> +#endif
>> >> >>  #else
>> >> >>  static inline dma_addr_t __phys_to_dma(struct device *dev, phys_addr_t paddr)
>> >> >>  {
>> >> >> @@ -90,6 +94,12 @@ static inline dma_addr_t phys_to_dma_unencrypted(struct device *dev,
>> >> >>  {
>> >> >>  	return dma_addr_unencrypted(__phys_to_dma(dev, paddr));
>> >> >>  }
>> >> >> +
>> >> >> +static inline dma_addr_t phys_to_dma_encrypted(struct device *dev,
>> >> >> +		phys_addr_t paddr)
>> >> >> +{
>> >> >> +	return dma_addr_encrypted(__phys_to_dma(dev, paddr));
>> >> >> +}
>> >> >>  /*
>> >> >>   * If memory encryption is supported, phys_to_dma will set the memory encryption
>> >> >>   * bit in the DMA address, and dma_to_phys will clear it.
>> >> >> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
>> >> >> index 3dae0f592063..b3fa3c6e0169 100644
>> >> >> --- a/include/linux/swiotlb.h
>> >> >> +++ b/include/linux/swiotlb.h
>> >> >> @@ -81,6 +81,7 @@ struct io_tlb_pool {
>> >> >>  	struct list_head node;
>> >> >>  	struct rcu_head rcu;
>> >> >>  	bool transient;
>> >> >> +	bool unencrypted;
>> >> >>  #endif
>> >> >>  };
>> >> >>  
>> >> >> @@ -111,6 +112,7 @@ struct io_tlb_mem {
>> >> >>  	struct dentry *debugfs;
>> >> >>  	bool force_bounce;
>> >> >>  	bool for_alloc;
>> >> >> +	bool unencrypted;
>> >> >>  #ifdef CONFIG_SWIOTLB_DYNAMIC
>> >> >>  	bool can_grow;
>> >> >>  	u64 phys_limit;
>> >> >> @@ -282,7 +284,8 @@ static inline void swiotlb_sync_single_for_cpu(struct device *dev,
>> >> >>  extern void swiotlb_print_info(void);
>> >> >>  
>> >> >>  #ifdef CONFIG_DMA_RESTRICTED_POOL
>> >> >> -struct page *swiotlb_alloc(struct device *dev, size_t size);
>> >> >> +struct page *swiotlb_alloc(struct device *dev, size_t size,
>> >> >> +		unsigned long attrs);
>> >> >>  bool swiotlb_free(struct device *dev, struct page *page, size_t size);
>> >> >>  
>> >> >>  static inline bool is_swiotlb_for_alloc(struct device *dev)
>> >> >> @@ -290,7 +293,8 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
>> >> >>  	return dev->dma_io_tlb_mem->for_alloc;
>> >> >>  }
>> >> >>  #else
>> >> >> -static inline struct page *swiotlb_alloc(struct device *dev, size_t size)
>> >> >> +static inline struct page *swiotlb_alloc(struct device *dev, size_t size,
>> >> >> +		unsigned long attrs)
>> >> >>  {
>> >> >>  	return NULL;
>> >> >>  }
>> >> >> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>> >> >> index dc2907439b3d..97ae4fa10521 100644
>> >> >> --- a/kernel/dma/direct.c
>> >> >> +++ b/kernel/dma/direct.c
>> >> >> @@ -104,9 +104,10 @@ static void __dma_direct_free_pages(struct device *dev, struct page *page,
>> >> >>  	dma_free_contiguous(dev, page, size);
>> >> >>  }
>> >> >>  
>> >> >> -static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size)
>> >> >> +static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size,
>> >> >> +		unsigned long attrs)
>> >> >>  {
>> >> >> -	struct page *page = swiotlb_alloc(dev, size);
>> >> >> +	struct page *page = swiotlb_alloc(dev, size, attrs);
>> >> >>  
>> >> >>  	if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
>> >> >>  		swiotlb_free(dev, page, size);
>> >> >> @@ -266,8 +267,12 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>> >> >>  						  gfp, attrs);
>> >> >>  
>> >> >>  	if (is_swiotlb_for_alloc(dev)) {
>> >> >> -		page = dma_direct_alloc_swiotlb(dev, size);
>> >> >> +		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>> >> >>  		if (page) {
>> >> >> +			/*
>> >> >> +			 * swiotlb allocations comes from pool already marked
>> >> >> +			 * decrypted
>> >> >> +			 */
>> >> >>  			mark_mem_decrypt = false;
>> >> >>  			goto setup_page;
>> >> >>  		}
>> >> >> @@ -374,6 +379,7 @@ void dma_direct_free(struct device *dev, size_t size,
>> >> >>  		return;
>> >> >>  
>> >> >>  	if (swiotlb_find_pool(dev, dma_to_phys(dev, dma_addr)))
>> >> >> +		/* Swiotlb doesn't need a page attribute update on free */
>> >> >>  		mark_mem_encrypted = false;
>> >> >>  
>> >> >>  	if (is_vmalloc_addr(cpu_addr)) {
>> >> >> @@ -403,7 +409,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>> >> >>  						  gfp, attrs);
>> >> >>  
>> >> >>  	if (is_swiotlb_for_alloc(dev)) {
>> >> >> -		page = dma_direct_alloc_swiotlb(dev, size);
>> >> >> +		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>> >> >>  		if (!page)
>> >> >>  			return NULL;
>> >> >>  
>> >> >> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
>> >> >> index ab4eccbaa076..065663be282c 100644
>> >> >> --- a/kernel/dma/swiotlb.c
>> >> >> +++ b/kernel/dma/swiotlb.c
>> >> >> @@ -259,10 +259,21 @@ void __init swiotlb_update_mem_attributes(void)
>> >> >>  	struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
>> >> >>  	unsigned long bytes;
>> >> >>  
>> >> >> +	/*
>> >> >> +	 * if platform support memory encryption, swiotlb buffers are
>> >> >> +	 * decrypted by default.
>> >> >> +	 */
>> >> >> +	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> >> >> +		io_tlb_default_mem.unencrypted = true;
>> >> >> +	else
>> >> >> +		io_tlb_default_mem.unencrypted = false;
>> >> >> +
>> >> >>  	if (!mem->nslabs || mem->late_alloc)
>> >> >>  		return;
>> >> >>  	bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
>> >> >> -	set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
>> >> >> +
>> >> >> +	if (io_tlb_default_mem.unencrypted)
>> >> >> +		set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
>> >> >>  }
>> >> >>  
>> >> >>  static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
>> >> >> @@ -505,8 +516,10 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>> >> >>  	if (!mem->slots)
>> >> >>  		goto error_slots;
>> >> >>  
>> >> >> -	set_memory_decrypted((unsigned long)vstart,
>> >> >> -			     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
>> >> >> +	if (io_tlb_default_mem.unencrypted)
>> >> >> +		set_memory_decrypted((unsigned long)vstart,
>> >> >> +				     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
>> >> >> +
>> >> >>  	swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
>> >> >>  				 nareas);
>> >> >>  	add_mem_pool(&io_tlb_default_mem, mem);
>> >> >> @@ -539,7 +552,9 @@ void __init swiotlb_exit(void)
>> >> >>  	tbl_size = PAGE_ALIGN(mem->end - mem->start);
>> >> >>  	slots_size = PAGE_ALIGN(array_size(sizeof(*mem->slots), mem->nslabs));
>> >> >>  
>> >> >> -	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
>> >> >> +	if (io_tlb_default_mem.unencrypted)
>> >> >> +		set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
>> >> >> +
>> >> >>  	if (mem->late_alloc) {
>> >> >>  		area_order = get_order(array_size(sizeof(*mem->areas),
>> >> >>  			mem->nareas));
>> >> >> @@ -563,6 +578,7 @@ void __init swiotlb_exit(void)
>> >> >>   * @gfp:	GFP flags for the allocation.
>> >> >>   * @bytes:	Size of the buffer.
>> >> >>   * @phys_limit:	Maximum allowed physical address of the buffer.
>> >> >> + * @unencrypted: true to allocate unencrypted memory, false for encrypted memory
>> >> >>   *
>> >> >>   * Allocate pages from the buddy allocator. If successful, make the allocated
>> >> >>   * pages decrypted that they can be used for DMA.
>> >> >> @@ -570,7 +586,8 @@ void __init swiotlb_exit(void)
>> >> >>   * Return: Decrypted pages, %NULL on allocation failure, or ERR_PTR(-EAGAIN)
>> >> >>   * if the allocated physical address was above @phys_limit.
>> >> >>   */
>> >> >> -static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>> >> >> +static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes,
>> >> >> +		u64 phys_limit, bool unencrypted)
>> >> >>  {
>> >> >>  	unsigned int order = get_order(bytes);
>> >> >>  	struct page *page;
>> >> >> @@ -588,13 +605,13 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>> >> >>  	}
>> >> >>  
>> >> >>  	vaddr = phys_to_virt(paddr);
>> >> >> -	if (set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >> >> +	if (unencrypted && set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >> >>  		goto error;
>> >> >>  	return page;
>> >> >>  
>> >> >>  error:
>> >> >>  	/* Intentional leak if pages cannot be encrypted again. */
>> >> >> -	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >> >> +	if (unencrypted && !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >> >>  		__free_pages(page, order);
>> >> >>  	return NULL;
>> >> >>  }
>> >> >> @@ -604,30 +621,26 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>> >> >>   * @dev:	Device for which a memory pool is allocated.
>> >> >>   * @bytes:	Size of the buffer.
>> >> >>   * @phys_limit:	Maximum allowed physical address of the buffer.
>> >> >> + * @attrs:	DMA attributes for the allocation.
>> >> >>   * @gfp:	GFP flags for the allocation.
>> >> >>   *
>> >> >>   * Return: Allocated pages, or %NULL on allocation failure.
>> >> >>   */
>> >> >>  static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>> >> >> -		u64 phys_limit, gfp_t gfp)
>> >> >> +		u64 phys_limit, unsigned long attrs, gfp_t gfp)
>> >> >>  {
>> >> >>  	struct page *page;
>> >> >> -	unsigned long attrs = 0;
>> >> >>  
>> >> >>  	/*
>> >> >>  	 * Allocate from the atomic pools if memory is encrypted and
>> >> >>  	 * the allocation is atomic, because decrypting may block.
>> >> >>  	 */
>> >> >> -	if (!gfpflags_allow_blocking(gfp) && dev && force_dma_unencrypted(dev)) {
>> >> >> +	if (!gfpflags_allow_blocking(gfp) && (attrs & DMA_ATTR_CC_SHARED)) {
>> >> >>  		void *vaddr;
>> >> >>  
>> >> >>  		if (!IS_ENABLED(CONFIG_DMA_COHERENT_POOL))
>> >> >>  			return NULL;
>> >> >>  
>> >> >> -		/* swiotlb considered decrypted by default */
>> >> >> -		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> >> >> -			attrs = DMA_ATTR_CC_SHARED;
>> >> >> -
>> >> >>  		return dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
>> >> >>  					   attrs, dma_coherent_ok);
>> >> >>  	}
>> >> >> @@ -638,7 +651,8 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>> >> >>  	else if (phys_limit <= DMA_BIT_MASK(32))
>> >> >>  		gfp |= __GFP_DMA32;
>> >> >>  
>> >> >> -	while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit))) {
>> >> >> +	while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit,
>> >> >> +					     !!(attrs & DMA_ATTR_CC_SHARED)))) {
>> >> >>  		if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
>> >> >>  		    phys_limit < DMA_BIT_MASK(64) &&
>> >> >>  		    !(gfp & (__GFP_DMA32 | __GFP_DMA)))
>> >> >> @@ -657,15 +671,18 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>> >> >>   * swiotlb_free_tlb() - free a dynamically allocated IO TLB buffer
>> >> >>   * @vaddr:	Virtual address of the buffer.
>> >> >>   * @bytes:	Size of the buffer.
>> >> >> + * @unencrypted: true if @vaddr was allocated decrypted and must be
>> >> >> + *	re-encrypted before being freed
>> >> >>   */
>> >> >> -static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>> >> >> +static void swiotlb_free_tlb(void *vaddr, size_t bytes, bool unencrypted)
>> >> >>  {
>> >> >>  	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
>> >> >>  	    dma_free_from_pool(NULL, vaddr, bytes))
>> >> >>  		return;
>> >> >>  
>> >> >>  	/* Intentional leak if pages cannot be encrypted again. */
>> >> >> -	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >> >> +	if (!unencrypted ||
>> >> >> +	    !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >> >>  		__free_pages(virt_to_page(vaddr), get_order(bytes));
>> >> >>  }
>> >> >>  
>> >> >> @@ -676,6 +693,7 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>> >> >>   * @nslabs:	Desired (maximum) number of slabs.
>> >> >>   * @nareas:	Number of areas.
>> >> >>   * @phys_limit:	Maximum DMA buffer physical address.
>> >> >> + * @attrs:	DMA attributes for the allocation.
>> >> >>   * @gfp:	GFP flags for the allocations.
>> >> >>   *
>> >> >>   * Allocate and initialize a new IO TLB memory pool. The actual number of
>> >> >> @@ -686,7 +704,8 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>> >> >>   */
>> >> >>  static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>> >> >>  		unsigned long minslabs, unsigned long nslabs,
>> >> >> -		unsigned int nareas, u64 phys_limit, gfp_t gfp)
>> >> >> +		unsigned int nareas, u64 phys_limit, unsigned long attrs,
>> >> >> +		gfp_t gfp)
>> >> >>  {
>> >> >>  	struct io_tlb_pool *pool;
>> >> >>  	unsigned int slot_order;
>> >> >> @@ -704,9 +723,10 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>> >> >>  	if (!pool)
>> >> >>  		goto error;
>> >> >>  	pool->areas = (void *)pool + sizeof(*pool);
>> >> >> +	pool->unencrypted = !!(attrs & DMA_ATTR_CC_SHARED);
>> >> >>  
>> >> >>  	tlb_size = nslabs << IO_TLB_SHIFT;
>> >> >> -	while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp))) {
>> >> >> +	while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, attrs, gfp))) {
>> >> >>  		if (nslabs <= minslabs)
>> >> >>  			goto error_tlb;
>> >> >>  		nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
>> >> >> @@ -724,7 +744,8 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>> >> >>  	return pool;
>> >> >>  
>> >> >>  error_slots:
>> >> >> -	swiotlb_free_tlb(page_address(tlb), tlb_size);
>> >> >> +	swiotlb_free_tlb(page_address(tlb), tlb_size,
>> >> >> +			 !!(attrs & DMA_ATTR_CC_SHARED));
>> >> >>  error_tlb:
>> >> >>  	kfree(pool);
>> >> >>  error:
>> >> >> @@ -742,7 +763,9 @@ static void swiotlb_dyn_alloc(struct work_struct *work)
>> >> >>  	struct io_tlb_pool *pool;
>> >> >>  
>> >> >>  	pool = swiotlb_alloc_pool(NULL, IO_TLB_MIN_SLABS, default_nslabs,
>> >> >> -				  default_nareas, mem->phys_limit, GFP_KERNEL);
>> >> >> +				  default_nareas, mem->phys_limit,
>> >> >> +				  mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
>> >> >> +				  GFP_KERNEL);
>> >> >>  	if (!pool) {
>> >> >>  		pr_warn_ratelimited("Failed to allocate new pool");
>> >> >>  		return;
>> >> >> @@ -762,7 +785,7 @@ static void swiotlb_dyn_free(struct rcu_head *rcu)
>> >> >>  	size_t tlb_size = pool->end - pool->start;
>> >> >>  
>> >> >>  	free_pages((unsigned long)pool->slots, get_order(slots_size));
>> >> >> -	swiotlb_free_tlb(pool->vaddr, tlb_size);
>> >> >> +	swiotlb_free_tlb(pool->vaddr, tlb_size, pool->unencrypted);
>> >> >>  	kfree(pool);
>> >> >>  }
>> >> >>  
>> >> >> @@ -1232,6 +1255,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>> >> >>  	nslabs = nr_slots(alloc_size);
>> >> >>  	phys_limit = min_not_zero(*dev->dma_mask, dev->bus_dma_limit);
>> >> >>  	pool = swiotlb_alloc_pool(dev, nslabs, nslabs, 1, phys_limit,
>> >> >> +				  mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
>> >> >>  				  GFP_NOWAIT);
>> >> >>  	if (!pool)
>> >> >>  		return -1;
>> >> >> @@ -1394,6 +1418,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>> >> >>  		enum dma_data_direction dir, unsigned long attrs)
>> >> >>  {
>> >> >>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>> >> >> +	bool require_decrypted = false;
>> >> >>  	unsigned int offset;
>> >> >>  	struct io_tlb_pool *pool;
>> >> >>  	unsigned int i;
>> >> >> @@ -1411,6 +1436,16 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>> >> >>  	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> >> >>  		pr_warn_once("Memory encryption is active and system is using DMA bounce buffers\n");
>> >> >>  
>> >> >> +	/*
>> >> >> +	 * if we are trying to swiotlb map a decrypted paddr or the paddr is encrypted
>> >> >> +	 * but the device is forcing decryption, use decrypted io_tlb_mem
>> >> >> +	 */
>> >> >> +	if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))
>> >> >> +		require_decrypted = true;
>> >> >> +
>> >> >> +	if (require_decrypted != mem->unencrypted)
>> >> >> +		return (phys_addr_t)DMA_MAPPING_ERROR;
>> >> >> +
>> >> >>  	/*
>> >> >>  	 * The default swiotlb memory pool is allocated with PAGE_SIZE
>> >> >>  	 * alignment. If a mapping is requested with larger alignment,
>> >> >> @@ -1608,8 +1643,14 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
>> >> >>  	if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
>> >> >>  		return DMA_MAPPING_ERROR;
>> >> >>  
>> >> >> -	/* Ensure that the address returned is DMA'ble */
>> >> >> -	dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
>> >> >> +	/*
>> >> >> +	 * Use the allocated io_tlb_mem encryption type to determine dma addr.
>> >> >> +	 */
>> >> >> +	if (dev->dma_io_tlb_mem->unencrypted)
>> >> >> +		dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
>> >> >> +	else
>> >> >> +		dma_addr = phys_to_dma_encrypted(dev, swiotlb_addr);
>> >> >> +
>> >> >>  	if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
>> >> >>  		__swiotlb_tbl_unmap_single(dev, swiotlb_addr, size, dir,
>> >> >>  			attrs | DMA_ATTR_SKIP_CPU_SYNC,
>> >> >> @@ -1773,7 +1814,8 @@ static inline void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
>> >> >>  
>> >> >>  #ifdef CONFIG_DMA_RESTRICTED_POOL
>> >> >>  
>> >> >> -struct page *swiotlb_alloc(struct device *dev, size_t size)
>> >> >> +struct page *swiotlb_alloc(struct device *dev, size_t size,
>> >> >> +		unsigned long attrs)
>> >> >>  {
>> >> >>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>> >> >>  	struct io_tlb_pool *pool;
>> >> >> @@ -1784,6 +1826,9 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
>> >> >>  	if (!mem)
>> >> >>  		return NULL;
>> >> >>  
>> >> >> +	if (mem->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
>> >> >> +		return NULL;
>> >> >> +
>> >> >>  	align = (1 << (get_order(size) + PAGE_SHIFT)) - 1;
>> >> >>  	index = swiotlb_find_slots(dev, 0, size, align, &pool);
>> >> >>  	if (index == -1)
>> >> >> @@ -1853,9 +1898,18 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
>> >> >>  			kfree(mem);
>> >> >>  			return -ENOMEM;
>> >> >>  		}
>> >> >> +		/*
>> >> >> +		 * if platform supports memory encryption,
>> >> >> +		 * restricted mem pool is decrypted by default
>> >> >> +		 */
>> >> >> +		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
>> >> >> +			mem->unencrypted = true;
>> >> >> +			set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
>> >> >> +					     rmem->size >> PAGE_SHIFT);
>> >> >> +		} else {
>> >> >> +			mem->unencrypted = false;
>> >> >> +		}
>> >> >
>> >> > This breaks pKVM as it doesn’t set CC_ATTR_MEM_ENCRYPT, so all virtio
>> >> > traffic now fails.
>> >> >
>> >> > Also, by design, some drivers are clueless about bouncing, so
>> >> > I believe that the pool should have a way to control it’s property
>> >> > (encrypted or decrypted) and that takes priority over whatever
>> >> > attributes comes from allocation.
>> >> > And that brings us to the same point whether it’s better to return
>> >> > the memory along with it’s state or we pass the requested state.
>> >> > I think for other cases it’s fine for the device/DMA-API to dictate
>> >> > the attrs, but not in restricted-dma case, the firmware just knows better.
>> >> >
>> >> 
>> >> Is it that the pKVM guest kernel does not have awareness of
>> >> encrypted/decrypted DMA allocations? Instead, the firmware attaches
>> >> hypervisor-shared pages to the device via restricted-dma-pool? The
>> >> kernel then has swiotlb->for_alloc = true, and hence all DMA allocations
>> >> go through the restricted-dma-pool?
>> >
>> > Yes.
>> >
>> >> 
>> >> Given that pKVM supports pkvm_set_memory_encrypted() and
>> >> pkvm_set_memory_decrypted(), can we consider adding CC_ATTR_MEM_ENCRYPT
>> >> support to pKVM? It would also be good to investigate whether we can set
>> >> force_dma_unencrypted(dev) to true where needed.
>> >
>> > I was looking in to that, but it didn't work because
>> > force_dma_unencrypted() is broken with restricted-dma due to the
>> > double decryption issue, that's when I sent my first series [1]
>> >
>> > May be we should land some basic fixes for that path so we can
>> > convert pKVM, then we do the full rework.
>> >
>> > I will revive my old work and see if I can send a RFC.
>> >
>> > [1] https://lore.kernel.org/all/20260305170335.963568-1-smostafa@google.com/
>> >
>> 
>> With this series, can you check whether the only change needed is
>> something like the following?
>> 
>> modified   kernel/dma/swiotlb.c
>> @@ -1905,7 +1905,8 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
>>  		 * if platform supports memory encryption,
>>  		 * restricted mem pool is decrypted by default
>>  		 */
>> -		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
>> +		//if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
>> +		if (true) {
>>  			mem->unencrypted = true;
>>  			set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
>>  					     rmem->size >> PAGE_SHIFT);
>
> Yes, that boots, but I will need to do more tests.
>
>> 
>> >
>> >> 
>> >> I agree that this patch, as it stands, can break pKVM because we are now
>> >> missing the set_memory_decrypted() call required for pKVM to work.
>> >> 
>> >> We now mark the swiotlb io_tlb_mem as unencrypted/encrypted in the guest
>> >> using struct io_tlb_mem->unencrypted. I am not clear what we can use for
>> >> pKVM to conditionalize this so that it works for both protected and
>> >> unprotected guests.
>> >
>> > There is no problem with non-protected guests as they don't use memory
>> > encryption, my initial thought was that th encrpyted/decrypted is
>> > per-pool property which is decided by FW (device-tree).
>> >
>> 
>> What I meant was that we need a generic way to identify a pKVM guest, so
>> that we can use it in the conditional above.
>
> I have this patch, with that I can boot with your series unmodified,
> but I will need to do more testing.
>

Thanks, I can add this to the series once you complete the required testing.

>
> From d795b4c4ee2437587616b2b342e9996afe6d6680 Mon Sep 17 00:00:00 2001
> From: Mostafa Saleh <smostafa@google.com>
> Date: Thu, 14 May 2026 13:46:15 +0000
> Subject: [PATCH] arm64/coco: Add pKVM as a CC platform
>
> pKVM does support memory encryption, expose that to the rest of
> the kernel through cc_platform_has()
>
> At the moment, all devices inside the guest are emulated which
> requires its memory to be shared back to the host (decrypted), so
> set force_dma_unencrypted() to always return true.
>
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
>  arch/arm64/include/asm/hypervisor.h           |  6 ++++++
>  arch/arm64/include/asm/mem_encrypt.h          |  3 ++-
>  arch/arm64/kernel/rsi.c                       | 12 ------------
>  arch/arm64/mm/init.c                          | 13 +++++++++++++
>  drivers/virt/coco/pkvm-guest/arm-pkvm-guest.c |  5 +++++
>  5 files changed, 26 insertions(+), 13 deletions(-)
>
> diff --git a/arch/arm64/include/asm/hypervisor.h b/arch/arm64/include/asm/hypervisor.h
> index a12fd897c877..1b0e15f290be 100644
> --- a/arch/arm64/include/asm/hypervisor.h
> +++ b/arch/arm64/include/asm/hypervisor.h
> @@ -10,8 +10,14 @@ void kvm_arm_target_impl_cpu_init(void);
>
>  #ifdef CONFIG_ARM_PKVM_GUEST
>  void pkvm_init_hyp_services(void);
> +bool is_protected_kvm_guest(void);
>  #else
>  static inline void pkvm_init_hyp_services(void) { };
> +
> +static inline bool is_protected_kvm_guest(void)
> +{
> +	return false;
> +}
>  #endif
>
>  static inline void kvm_arch_init_hyp_services(void)
> diff --git a/arch/arm64/include/asm/mem_encrypt.h b/arch/arm64/include/asm/mem_encrypt.h
> index 314b2b52025f..636f45b4d8af 100644
> --- a/arch/arm64/include/asm/mem_encrypt.h
> +++ b/arch/arm64/include/asm/mem_encrypt.h
> @@ -2,6 +2,7 @@
>  #ifndef __ASM_MEM_ENCRYPT_H
>  #define __ASM_MEM_ENCRYPT_H
>
> +#include <asm/hypervisor.h>
>  #include <asm/rsi.h>
>
>  struct device;
> @@ -20,7 +21,7 @@ int realm_register_memory_enc_ops(void);
>
>  static inline bool force_dma_unencrypted(struct device *dev)
>  {
> -	return is_realm_world();
> +	return is_realm_world() || is_protected_kvm_guest();
>  }
>
>  /*
> diff --git a/arch/arm64/kernel/rsi.c b/arch/arm64/kernel/rsi.c
> index 92160f2e57ff..25ca75ce1a4d 100644
> --- a/arch/arm64/kernel/rsi.c
> +++ b/arch/arm64/kernel/rsi.c
> @@ -7,7 +7,6 @@
>  #include <linux/memblock.h>
>  #include <linux/psci.h>
>  #include <linux/swiotlb.h>
> -#include <linux/cc_platform.h>
>  #include <linux/platform_device.h>
>
>  #include <asm/io.h>
> @@ -23,17 +22,6 @@ EXPORT_SYMBOL(prot_ns_shared);
>  DEFINE_STATIC_KEY_FALSE_RO(rsi_present);
>  EXPORT_SYMBOL(rsi_present);
>
> -bool cc_platform_has(enum cc_attr attr)
> -{
> -	switch (attr) {
> -	case CC_ATTR_MEM_ENCRYPT:
> -		return is_realm_world();
> -	default:
> -		return false;
> -	}
> -}
> -EXPORT_SYMBOL_GPL(cc_platform_has);
> -
>  static bool rsi_version_matches(void)
>  {
>  	unsigned long ver_lower, ver_higher;
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index acf67c7064db..a087ac5b15f7 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -12,6 +12,7 @@
>  #include <linux/swap.h>
>  #include <linux/init.h>
>  #include <linux/cache.h>
> +#include <linux/cc_platform.h>
>  #include <linux/mman.h>
>  #include <linux/nodemask.h>
>  #include <linux/initrd.h>
> @@ -36,6 +37,7 @@
>
>  #include <asm/boot.h>
>  #include <asm/fixmap.h>
> +#include <asm/hypervisor.h>
>  #include <asm/kasan.h>
>  #include <asm/kernel-pgtable.h>
>  #include <asm/kvm_host.h>
> @@ -414,6 +416,17 @@ void dump_mem_limit(void)
>  	}
>  }
>
> +bool cc_platform_has(enum cc_attr attr)
> +{
> +	switch (attr) {
> +	case CC_ATTR_MEM_ENCRYPT:
> +		return is_realm_world() || is_protected_kvm_guest();
> +	default:
> +		return false;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(cc_platform_has);
> +
>  #ifdef CONFIG_EXECMEM
>  static u64 module_direct_base __ro_after_init = 0;
>  static u64 module_plt_base __ro_after_init = 0;
> diff --git a/drivers/virt/coco/pkvm-guest/arm-pkvm-guest.c b/drivers/virt/coco/pkvm-guest/arm-pkvm-guest.c
> index 4230b817a80b..297e6d6019b8 100644
> --- a/drivers/virt/coco/pkvm-guest/arm-pkvm-guest.c
> +++ b/drivers/virt/coco/pkvm-guest/arm-pkvm-guest.c
> @@ -95,6 +95,11 @@ static int mmio_guard_ioremap_hook(phys_addr_t phys, size_t size,
>  	return 0;
>  }
>
> +bool is_protected_kvm_guest(void)
> +{
> +	return !!pkvm_granule;
> +}
> +
>  void pkvm_init_hyp_services(void)
>  {
>  	int i;


-aneesh


^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-14 14:37 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Mostafa Saleh, iommu, linux-arm-kernel, linux-kernel, linux-coco,
	Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5apl2y5f96.fsf@kernel.org>

On Thu, May 14, 2026 at 06:18:05PM +0530, Aneesh Kumar K.V wrote:
> > There is no problem with non-protected guests as they don't use memory
> > encryption, my initial thought was that th encrpyted/decrypted is
> > per-pool property which is decided by FW (device-tree).
> 
> What I meant was that we need a generic way to identify a pKVM guest, so
> that we can use it in the conditional above.

If I understood Mostafa's remarks I think different devices in the
guest need shared/decrypted and some don't? Ie a virtio hypervisor
device needs shared while a real PCI device doesn't? Is that right?

In CC terms that would be a mixture of T=0 and T=1 devices hardwired
and signaled by firwmare..

Ideally we'd have a flow where if the arch precreates a swiotlb pool
with special parameters this overrides all other decision making. Then
this series is about making CC NOT use that flow... ??

Jason


^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-14 12:48 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agW5rhE9n2gDQ0w5@google.com>

Mostafa Saleh <smostafa@google.com> writes:

> On Thu, May 14, 2026 at 11:24:42AM +0530, Aneesh Kumar K.V wrote:
>> Mostafa Saleh <smostafa@google.com> writes:
>> 
>> > On Tue, May 12, 2026 at 02:33:59PM +0530, Aneesh Kumar K.V (Arm) wrote:
>> >> Teach swiotlb to distinguish between encrypted and decrypted bounce
>> >> buffer pools, and make allocation and mapping paths select a pool whose
>> >> state matches the requested DMA attributes.
>> >> 
>> >> Add a decrypted flag to io_tlb_mem, initialize it for the default and
>> >> restricted pools, and propagate DMA_ATTR_CC_SHARED into swiotlb pool
>> >> allocation. Reject swiotlb alloc/map requests when the selected pool does
>> >> not match the required encrypted/decrypted state.
>> >> 
>> >> Also return DMA addresses with the matching phys_to_dma_{encrypted,
>> >> unencrypted} helper so the DMA address encoding stays consistent with the
>> >> chosen pool.
>> >> 
>> >> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>> >> ---
>> >>  include/linux/dma-direct.h |  10 ++++
>> >>  include/linux/swiotlb.h    |   8 ++-
>> >>  kernel/dma/direct.c        |  14 +++--
>> >>  kernel/dma/swiotlb.c       | 108 +++++++++++++++++++++++++++----------
>> >>  4 files changed, 107 insertions(+), 33 deletions(-)
>> >> 
>> >> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
>> >> index c249912456f9..94fad4e7c11e 100644
>> >> --- a/include/linux/dma-direct.h
>> >> +++ b/include/linux/dma-direct.h
>> >> @@ -77,6 +77,10 @@ static inline dma_addr_t dma_range_map_max(const struct bus_dma_region *map)
>> >>  #ifndef phys_to_dma_unencrypted
>> >>  #define phys_to_dma_unencrypted		phys_to_dma
>> >>  #endif
>> >> +
>> >> +#ifndef phys_to_dma_encrypted
>> >> +#define phys_to_dma_encrypted		phys_to_dma
>> >> +#endif
>> >>  #else
>> >>  static inline dma_addr_t __phys_to_dma(struct device *dev, phys_addr_t paddr)
>> >>  {
>> >> @@ -90,6 +94,12 @@ static inline dma_addr_t phys_to_dma_unencrypted(struct device *dev,
>> >>  {
>> >>  	return dma_addr_unencrypted(__phys_to_dma(dev, paddr));
>> >>  }
>> >> +
>> >> +static inline dma_addr_t phys_to_dma_encrypted(struct device *dev,
>> >> +		phys_addr_t paddr)
>> >> +{
>> >> +	return dma_addr_encrypted(__phys_to_dma(dev, paddr));
>> >> +}
>> >>  /*
>> >>   * If memory encryption is supported, phys_to_dma will set the memory encryption
>> >>   * bit in the DMA address, and dma_to_phys will clear it.
>> >> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
>> >> index 3dae0f592063..b3fa3c6e0169 100644
>> >> --- a/include/linux/swiotlb.h
>> >> +++ b/include/linux/swiotlb.h
>> >> @@ -81,6 +81,7 @@ struct io_tlb_pool {
>> >>  	struct list_head node;
>> >>  	struct rcu_head rcu;
>> >>  	bool transient;
>> >> +	bool unencrypted;
>> >>  #endif
>> >>  };
>> >>  
>> >> @@ -111,6 +112,7 @@ struct io_tlb_mem {
>> >>  	struct dentry *debugfs;
>> >>  	bool force_bounce;
>> >>  	bool for_alloc;
>> >> +	bool unencrypted;
>> >>  #ifdef CONFIG_SWIOTLB_DYNAMIC
>> >>  	bool can_grow;
>> >>  	u64 phys_limit;
>> >> @@ -282,7 +284,8 @@ static inline void swiotlb_sync_single_for_cpu(struct device *dev,
>> >>  extern void swiotlb_print_info(void);
>> >>  
>> >>  #ifdef CONFIG_DMA_RESTRICTED_POOL
>> >> -struct page *swiotlb_alloc(struct device *dev, size_t size);
>> >> +struct page *swiotlb_alloc(struct device *dev, size_t size,
>> >> +		unsigned long attrs);
>> >>  bool swiotlb_free(struct device *dev, struct page *page, size_t size);
>> >>  
>> >>  static inline bool is_swiotlb_for_alloc(struct device *dev)
>> >> @@ -290,7 +293,8 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
>> >>  	return dev->dma_io_tlb_mem->for_alloc;
>> >>  }
>> >>  #else
>> >> -static inline struct page *swiotlb_alloc(struct device *dev, size_t size)
>> >> +static inline struct page *swiotlb_alloc(struct device *dev, size_t size,
>> >> +		unsigned long attrs)
>> >>  {
>> >>  	return NULL;
>> >>  }
>> >> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>> >> index dc2907439b3d..97ae4fa10521 100644
>> >> --- a/kernel/dma/direct.c
>> >> +++ b/kernel/dma/direct.c
>> >> @@ -104,9 +104,10 @@ static void __dma_direct_free_pages(struct device *dev, struct page *page,
>> >>  	dma_free_contiguous(dev, page, size);
>> >>  }
>> >>  
>> >> -static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size)
>> >> +static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size,
>> >> +		unsigned long attrs)
>> >>  {
>> >> -	struct page *page = swiotlb_alloc(dev, size);
>> >> +	struct page *page = swiotlb_alloc(dev, size, attrs);
>> >>  
>> >>  	if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
>> >>  		swiotlb_free(dev, page, size);
>> >> @@ -266,8 +267,12 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>> >>  						  gfp, attrs);
>> >>  
>> >>  	if (is_swiotlb_for_alloc(dev)) {
>> >> -		page = dma_direct_alloc_swiotlb(dev, size);
>> >> +		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>> >>  		if (page) {
>> >> +			/*
>> >> +			 * swiotlb allocations comes from pool already marked
>> >> +			 * decrypted
>> >> +			 */
>> >>  			mark_mem_decrypt = false;
>> >>  			goto setup_page;
>> >>  		}
>> >> @@ -374,6 +379,7 @@ void dma_direct_free(struct device *dev, size_t size,
>> >>  		return;
>> >>  
>> >>  	if (swiotlb_find_pool(dev, dma_to_phys(dev, dma_addr)))
>> >> +		/* Swiotlb doesn't need a page attribute update on free */
>> >>  		mark_mem_encrypted = false;
>> >>  
>> >>  	if (is_vmalloc_addr(cpu_addr)) {
>> >> @@ -403,7 +409,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>> >>  						  gfp, attrs);
>> >>  
>> >>  	if (is_swiotlb_for_alloc(dev)) {
>> >> -		page = dma_direct_alloc_swiotlb(dev, size);
>> >> +		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>> >>  		if (!page)
>> >>  			return NULL;
>> >>  
>> >> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
>> >> index ab4eccbaa076..065663be282c 100644
>> >> --- a/kernel/dma/swiotlb.c
>> >> +++ b/kernel/dma/swiotlb.c
>> >> @@ -259,10 +259,21 @@ void __init swiotlb_update_mem_attributes(void)
>> >>  	struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
>> >>  	unsigned long bytes;
>> >>  
>> >> +	/*
>> >> +	 * if platform support memory encryption, swiotlb buffers are
>> >> +	 * decrypted by default.
>> >> +	 */
>> >> +	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> >> +		io_tlb_default_mem.unencrypted = true;
>> >> +	else
>> >> +		io_tlb_default_mem.unencrypted = false;
>> >> +
>> >>  	if (!mem->nslabs || mem->late_alloc)
>> >>  		return;
>> >>  	bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
>> >> -	set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
>> >> +
>> >> +	if (io_tlb_default_mem.unencrypted)
>> >> +		set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
>> >>  }
>> >>  
>> >>  static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
>> >> @@ -505,8 +516,10 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>> >>  	if (!mem->slots)
>> >>  		goto error_slots;
>> >>  
>> >> -	set_memory_decrypted((unsigned long)vstart,
>> >> -			     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
>> >> +	if (io_tlb_default_mem.unencrypted)
>> >> +		set_memory_decrypted((unsigned long)vstart,
>> >> +				     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
>> >> +
>> >>  	swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
>> >>  				 nareas);
>> >>  	add_mem_pool(&io_tlb_default_mem, mem);
>> >> @@ -539,7 +552,9 @@ void __init swiotlb_exit(void)
>> >>  	tbl_size = PAGE_ALIGN(mem->end - mem->start);
>> >>  	slots_size = PAGE_ALIGN(array_size(sizeof(*mem->slots), mem->nslabs));
>> >>  
>> >> -	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
>> >> +	if (io_tlb_default_mem.unencrypted)
>> >> +		set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
>> >> +
>> >>  	if (mem->late_alloc) {
>> >>  		area_order = get_order(array_size(sizeof(*mem->areas),
>> >>  			mem->nareas));
>> >> @@ -563,6 +578,7 @@ void __init swiotlb_exit(void)
>> >>   * @gfp:	GFP flags for the allocation.
>> >>   * @bytes:	Size of the buffer.
>> >>   * @phys_limit:	Maximum allowed physical address of the buffer.
>> >> + * @unencrypted: true to allocate unencrypted memory, false for encrypted memory
>> >>   *
>> >>   * Allocate pages from the buddy allocator. If successful, make the allocated
>> >>   * pages decrypted that they can be used for DMA.
>> >> @@ -570,7 +586,8 @@ void __init swiotlb_exit(void)
>> >>   * Return: Decrypted pages, %NULL on allocation failure, or ERR_PTR(-EAGAIN)
>> >>   * if the allocated physical address was above @phys_limit.
>> >>   */
>> >> -static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>> >> +static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes,
>> >> +		u64 phys_limit, bool unencrypted)
>> >>  {
>> >>  	unsigned int order = get_order(bytes);
>> >>  	struct page *page;
>> >> @@ -588,13 +605,13 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>> >>  	}
>> >>  
>> >>  	vaddr = phys_to_virt(paddr);
>> >> -	if (set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >> +	if (unencrypted && set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >>  		goto error;
>> >>  	return page;
>> >>  
>> >>  error:
>> >>  	/* Intentional leak if pages cannot be encrypted again. */
>> >> -	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >> +	if (unencrypted && !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >>  		__free_pages(page, order);
>> >>  	return NULL;
>> >>  }
>> >> @@ -604,30 +621,26 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
>> >>   * @dev:	Device for which a memory pool is allocated.
>> >>   * @bytes:	Size of the buffer.
>> >>   * @phys_limit:	Maximum allowed physical address of the buffer.
>> >> + * @attrs:	DMA attributes for the allocation.
>> >>   * @gfp:	GFP flags for the allocation.
>> >>   *
>> >>   * Return: Allocated pages, or %NULL on allocation failure.
>> >>   */
>> >>  static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>> >> -		u64 phys_limit, gfp_t gfp)
>> >> +		u64 phys_limit, unsigned long attrs, gfp_t gfp)
>> >>  {
>> >>  	struct page *page;
>> >> -	unsigned long attrs = 0;
>> >>  
>> >>  	/*
>> >>  	 * Allocate from the atomic pools if memory is encrypted and
>> >>  	 * the allocation is atomic, because decrypting may block.
>> >>  	 */
>> >> -	if (!gfpflags_allow_blocking(gfp) && dev && force_dma_unencrypted(dev)) {
>> >> +	if (!gfpflags_allow_blocking(gfp) && (attrs & DMA_ATTR_CC_SHARED)) {
>> >>  		void *vaddr;
>> >>  
>> >>  		if (!IS_ENABLED(CONFIG_DMA_COHERENT_POOL))
>> >>  			return NULL;
>> >>  
>> >> -		/* swiotlb considered decrypted by default */
>> >> -		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> >> -			attrs = DMA_ATTR_CC_SHARED;
>> >> -
>> >>  		return dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
>> >>  					   attrs, dma_coherent_ok);
>> >>  	}
>> >> @@ -638,7 +651,8 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>> >>  	else if (phys_limit <= DMA_BIT_MASK(32))
>> >>  		gfp |= __GFP_DMA32;
>> >>  
>> >> -	while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit))) {
>> >> +	while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit,
>> >> +					     !!(attrs & DMA_ATTR_CC_SHARED)))) {
>> >>  		if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
>> >>  		    phys_limit < DMA_BIT_MASK(64) &&
>> >>  		    !(gfp & (__GFP_DMA32 | __GFP_DMA)))
>> >> @@ -657,15 +671,18 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
>> >>   * swiotlb_free_tlb() - free a dynamically allocated IO TLB buffer
>> >>   * @vaddr:	Virtual address of the buffer.
>> >>   * @bytes:	Size of the buffer.
>> >> + * @unencrypted: true if @vaddr was allocated decrypted and must be
>> >> + *	re-encrypted before being freed
>> >>   */
>> >> -static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>> >> +static void swiotlb_free_tlb(void *vaddr, size_t bytes, bool unencrypted)
>> >>  {
>> >>  	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
>> >>  	    dma_free_from_pool(NULL, vaddr, bytes))
>> >>  		return;
>> >>  
>> >>  	/* Intentional leak if pages cannot be encrypted again. */
>> >> -	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >> +	if (!unencrypted ||
>> >> +	    !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
>> >>  		__free_pages(virt_to_page(vaddr), get_order(bytes));
>> >>  }
>> >>  
>> >> @@ -676,6 +693,7 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>> >>   * @nslabs:	Desired (maximum) number of slabs.
>> >>   * @nareas:	Number of areas.
>> >>   * @phys_limit:	Maximum DMA buffer physical address.
>> >> + * @attrs:	DMA attributes for the allocation.
>> >>   * @gfp:	GFP flags for the allocations.
>> >>   *
>> >>   * Allocate and initialize a new IO TLB memory pool. The actual number of
>> >> @@ -686,7 +704,8 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
>> >>   */
>> >>  static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>> >>  		unsigned long minslabs, unsigned long nslabs,
>> >> -		unsigned int nareas, u64 phys_limit, gfp_t gfp)
>> >> +		unsigned int nareas, u64 phys_limit, unsigned long attrs,
>> >> +		gfp_t gfp)
>> >>  {
>> >>  	struct io_tlb_pool *pool;
>> >>  	unsigned int slot_order;
>> >> @@ -704,9 +723,10 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>> >>  	if (!pool)
>> >>  		goto error;
>> >>  	pool->areas = (void *)pool + sizeof(*pool);
>> >> +	pool->unencrypted = !!(attrs & DMA_ATTR_CC_SHARED);
>> >>  
>> >>  	tlb_size = nslabs << IO_TLB_SHIFT;
>> >> -	while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp))) {
>> >> +	while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, attrs, gfp))) {
>> >>  		if (nslabs <= minslabs)
>> >>  			goto error_tlb;
>> >>  		nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
>> >> @@ -724,7 +744,8 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
>> >>  	return pool;
>> >>  
>> >>  error_slots:
>> >> -	swiotlb_free_tlb(page_address(tlb), tlb_size);
>> >> +	swiotlb_free_tlb(page_address(tlb), tlb_size,
>> >> +			 !!(attrs & DMA_ATTR_CC_SHARED));
>> >>  error_tlb:
>> >>  	kfree(pool);
>> >>  error:
>> >> @@ -742,7 +763,9 @@ static void swiotlb_dyn_alloc(struct work_struct *work)
>> >>  	struct io_tlb_pool *pool;
>> >>  
>> >>  	pool = swiotlb_alloc_pool(NULL, IO_TLB_MIN_SLABS, default_nslabs,
>> >> -				  default_nareas, mem->phys_limit, GFP_KERNEL);
>> >> +				  default_nareas, mem->phys_limit,
>> >> +				  mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
>> >> +				  GFP_KERNEL);
>> >>  	if (!pool) {
>> >>  		pr_warn_ratelimited("Failed to allocate new pool");
>> >>  		return;
>> >> @@ -762,7 +785,7 @@ static void swiotlb_dyn_free(struct rcu_head *rcu)
>> >>  	size_t tlb_size = pool->end - pool->start;
>> >>  
>> >>  	free_pages((unsigned long)pool->slots, get_order(slots_size));
>> >> -	swiotlb_free_tlb(pool->vaddr, tlb_size);
>> >> +	swiotlb_free_tlb(pool->vaddr, tlb_size, pool->unencrypted);
>> >>  	kfree(pool);
>> >>  }
>> >>  
>> >> @@ -1232,6 +1255,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>> >>  	nslabs = nr_slots(alloc_size);
>> >>  	phys_limit = min_not_zero(*dev->dma_mask, dev->bus_dma_limit);
>> >>  	pool = swiotlb_alloc_pool(dev, nslabs, nslabs, 1, phys_limit,
>> >> +				  mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
>> >>  				  GFP_NOWAIT);
>> >>  	if (!pool)
>> >>  		return -1;
>> >> @@ -1394,6 +1418,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>> >>  		enum dma_data_direction dir, unsigned long attrs)
>> >>  {
>> >>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>> >> +	bool require_decrypted = false;
>> >>  	unsigned int offset;
>> >>  	struct io_tlb_pool *pool;
>> >>  	unsigned int i;
>> >> @@ -1411,6 +1436,16 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>> >>  	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> >>  		pr_warn_once("Memory encryption is active and system is using DMA bounce buffers\n");
>> >>  
>> >> +	/*
>> >> +	 * if we are trying to swiotlb map a decrypted paddr or the paddr is encrypted
>> >> +	 * but the device is forcing decryption, use decrypted io_tlb_mem
>> >> +	 */
>> >> +	if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))
>> >> +		require_decrypted = true;
>> >> +
>> >> +	if (require_decrypted != mem->unencrypted)
>> >> +		return (phys_addr_t)DMA_MAPPING_ERROR;
>> >> +
>> >>  	/*
>> >>  	 * The default swiotlb memory pool is allocated with PAGE_SIZE
>> >>  	 * alignment. If a mapping is requested with larger alignment,
>> >> @@ -1608,8 +1643,14 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
>> >>  	if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
>> >>  		return DMA_MAPPING_ERROR;
>> >>  
>> >> -	/* Ensure that the address returned is DMA'ble */
>> >> -	dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
>> >> +	/*
>> >> +	 * Use the allocated io_tlb_mem encryption type to determine dma addr.
>> >> +	 */
>> >> +	if (dev->dma_io_tlb_mem->unencrypted)
>> >> +		dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
>> >> +	else
>> >> +		dma_addr = phys_to_dma_encrypted(dev, swiotlb_addr);
>> >> +
>> >>  	if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
>> >>  		__swiotlb_tbl_unmap_single(dev, swiotlb_addr, size, dir,
>> >>  			attrs | DMA_ATTR_SKIP_CPU_SYNC,
>> >> @@ -1773,7 +1814,8 @@ static inline void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
>> >>  
>> >>  #ifdef CONFIG_DMA_RESTRICTED_POOL
>> >>  
>> >> -struct page *swiotlb_alloc(struct device *dev, size_t size)
>> >> +struct page *swiotlb_alloc(struct device *dev, size_t size,
>> >> +		unsigned long attrs)
>> >>  {
>> >>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>> >>  	struct io_tlb_pool *pool;
>> >> @@ -1784,6 +1826,9 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
>> >>  	if (!mem)
>> >>  		return NULL;
>> >>  
>> >> +	if (mem->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
>> >> +		return NULL;
>> >> +
>> >>  	align = (1 << (get_order(size) + PAGE_SHIFT)) - 1;
>> >>  	index = swiotlb_find_slots(dev, 0, size, align, &pool);
>> >>  	if (index == -1)
>> >> @@ -1853,9 +1898,18 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
>> >>  			kfree(mem);
>> >>  			return -ENOMEM;
>> >>  		}
>> >> +		/*
>> >> +		 * if platform supports memory encryption,
>> >> +		 * restricted mem pool is decrypted by default
>> >> +		 */
>> >> +		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
>> >> +			mem->unencrypted = true;
>> >> +			set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
>> >> +					     rmem->size >> PAGE_SHIFT);
>> >> +		} else {
>> >> +			mem->unencrypted = false;
>> >> +		}
>> >
>> > This breaks pKVM as it doesn’t set CC_ATTR_MEM_ENCRYPT, so all virtio
>> > traffic now fails.
>> >
>> > Also, by design, some drivers are clueless about bouncing, so
>> > I believe that the pool should have a way to control it’s property
>> > (encrypted or decrypted) and that takes priority over whatever
>> > attributes comes from allocation.
>> > And that brings us to the same point whether it’s better to return
>> > the memory along with it’s state or we pass the requested state.
>> > I think for other cases it’s fine for the device/DMA-API to dictate
>> > the attrs, but not in restricted-dma case, the firmware just knows better.
>> >
>> 
>> Is it that the pKVM guest kernel does not have awareness of
>> encrypted/decrypted DMA allocations? Instead, the firmware attaches
>> hypervisor-shared pages to the device via restricted-dma-pool? The
>> kernel then has swiotlb->for_alloc = true, and hence all DMA allocations
>> go through the restricted-dma-pool?
>
> Yes.
>
>> 
>> Given that pKVM supports pkvm_set_memory_encrypted() and
>> pkvm_set_memory_decrypted(), can we consider adding CC_ATTR_MEM_ENCRYPT
>> support to pKVM? It would also be good to investigate whether we can set
>> force_dma_unencrypted(dev) to true where needed.
>
> I was looking in to that, but it didn't work because
> force_dma_unencrypted() is broken with restricted-dma due to the
> double decryption issue, that's when I sent my first series [1]
>
> May be we should land some basic fixes for that path so we can
> convert pKVM, then we do the full rework.
>
> I will revive my old work and see if I can send a RFC.
>
> [1] https://lore.kernel.org/all/20260305170335.963568-1-smostafa@google.com/
>

With this series, can you check whether the only change needed is
something like the following?

modified   kernel/dma/swiotlb.c
@@ -1905,7 +1905,8 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
 		 * if platform supports memory encryption,
 		 * restricted mem pool is decrypted by default
 		 */
-		if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+		//if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+		if (true) {
 			mem->unencrypted = true;
 			set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
 					     rmem->size >> PAGE_SHIFT);

>
>> 
>> I agree that this patch, as it stands, can break pKVM because we are now
>> missing the set_memory_decrypted() call required for pKVM to work.
>> 
>> We now mark the swiotlb io_tlb_mem as unencrypted/encrypted in the guest
>> using struct io_tlb_mem->unencrypted. I am not clear what we can use for
>> pKVM to conditionalize this so that it works for both protected and
>> unprotected guests.
>
> There is no problem with non-protected guests as they don't use memory
> encryption, my initial thought was that th encrpyted/decrypted is
> per-pool property which is decided by FW (device-tree).
>

What I meant was that we need a generic way to identify a pKVM guest, so
that we can use it in the conditional above.

-aneesh


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox