* [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
@ 2025-01-30 18:38 Gaurav Batra
2025-02-05 12:43 ` Donet Tom
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Gaurav Batra @ 2025-01-30 18:38 UTC (permalink / raw)
To: linuxppc-dev; +Cc: donettom, Gaurav Batra
iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This
notifier call is responsible to add/remove TCEs from the Dynamic DMA Window
(DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
dynamically mapped when the device driver instructs to do so.
The issue is 'daxctl' command is capable of adding pmemory as "System RAM"
after LPAR boot. The command to do so is -
daxctl reconfigure-device --mode=system-ram dax0.0 --force
This will dynamically add pmemory range to LPAR RAM eventually invoking
iommu_mem_notifier(). The address range of pmemory is way beyond the Max
RAM that the LPAR can have. Which means, this range is beyond the DDW
created for the device, at device initialization time.
As a result when TCEs are pre-mapped for the pmemory range, by
iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
command, daxctl, to add pmemory as RAM.
The solution is to not pre-map TCEs for pmemory.
Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
---
arch/powerpc/include/asm/mmzone.h | 1 +
arch/powerpc/mm/numa.c | 2 +-
arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------
3 files changed, 18 insertions(+), 14 deletions(-)
diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
index d99863cd6cde..049152f8d597 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
#ifdef CONFIG_MEMORY_HOTPLUG
extern unsigned long max_pfn;
u64 memory_hotplug_max(void);
+u64 hot_add_drconf_memory_max(void);
#else
#define memory_hotplug_max() memblock_end_of_DRAM()
#endif
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3c1da08304d0..603a0f652ba6 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
return nid;
}
-static u64 hot_add_drconf_memory_max(void)
+u64 hot_add_drconf_memory_max(void)
{
struct device_node *memory = NULL;
struct device_node *dn = NULL;
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 29f1a0cc59cd..abd9529a8f41 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
static phys_addr_t ddw_memory_hotplug_max(void)
{
- resource_size_t max_addr = memory_hotplug_max();
- struct device_node *memory;
+ resource_size_t max_addr;
- for_each_node_by_type(memory, "memory") {
- struct resource res;
-
- if (of_address_to_resource(memory, 0, &res))
- continue;
-
- max_addr = max_t(resource_size_t, max_addr, res.end + 1);
- }
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+ max_addr = hot_add_drconf_memory_max();
+#else
+ max_addr = memblock_end_of_DRAM();
+#endif
return max_addr;
}
@@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
if (direct_mapping) {
/* DDW maps the whole partition, so enable direct DMA mapping */
- ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
+ ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT,
win64->value, tce_setrange_multi_pSeriesLP_walk);
if (ret) {
dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
@@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
struct memory_notify *arg = data;
int ret = 0;
+ /* This notifier can get called when onlining persistent memory as well.
+ * TCEs are not pre-mapped for persistent memory. Persistent memory will
+ * always be above ddw_memory_hotplug_max()
+ */
+
switch (action) {
case MEM_GOING_ONLINE:
spin_lock(&dma_win_list_lock);
list_for_each_entry(window, &dma_win_list, list) {
- if (window->direct) {
+ if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
+ ddw_memory_hotplug_max()) {
ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
arg->nr_pages, window->prop);
}
@@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
case MEM_OFFLINE:
spin_lock(&dma_win_list_lock);
list_for_each_entry(window, &dma_win_list, list) {
- if (window->direct) {
+ if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
+ ddw_memory_hotplug_max()) {
ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
arg->nr_pages, window->prop);
}
base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory 2025-01-30 18:38 [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory Gaurav Batra @ 2025-02-05 12:43 ` Donet Tom 2025-02-05 13:58 ` Gaurav Batra 2025-02-17 7:28 ` Madhavan Srinivasan 2025-03-19 17:29 ` Michal Suchánek 2 siblings, 1 reply; 11+ messages in thread From: Donet Tom @ 2025-02-05 12:43 UTC (permalink / raw) To: Gaurav Batra, linuxppc-dev On 1/31/25 00:08, Gaurav Batra wrote: > iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This > notifier call is responsible to add/remove TCEs from the Dynamic DMA Window > (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not > for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are > dynamically mapped when the device driver instructs to do so. > > The issue is 'daxctl' command is capable of adding pmemory as "System RAM" > after LPAR boot. The command to do so is - > > daxctl reconfigure-device --mode=system-ram dax0.0 --force > > This will dynamically add pmemory range to LPAR RAM eventually invoking > iommu_mem_notifier(). The address range of pmemory is way beyond the Max > RAM that the LPAR can have. Which means, this range is beyond the DDW > created for the device, at device initialization time. > > As a result when TCEs are pre-mapped for the pmemory range, by > iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the > command, daxctl, to add pmemory as RAM. > > The solution is to not pre-map TCEs for pmemory. > > Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> > --- > arch/powerpc/include/asm/mmzone.h | 1 + > arch/powerpc/mm/numa.c | 2 +- > arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------ > 3 files changed, 18 insertions(+), 14 deletions(-) > > diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h > index d99863cd6cde..049152f8d597 100644 > --- a/arch/powerpc/include/asm/mmzone.h > +++ b/arch/powerpc/include/asm/mmzone.h > @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[]; > #ifdef CONFIG_MEMORY_HOTPLUG > extern unsigned long max_pfn; > u64 memory_hotplug_max(void); > +u64 hot_add_drconf_memory_max(void); > #else > #define memory_hotplug_max() memblock_end_of_DRAM() > #endif > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 3c1da08304d0..603a0f652ba6 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr) > return nid; > } > > -static u64 hot_add_drconf_memory_max(void) > +u64 hot_add_drconf_memory_max(void) > { > struct device_node *memory = NULL; > struct device_node *dn = NULL; > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c > index 29f1a0cc59cd..abd9529a8f41 100644 > --- a/arch/powerpc/platforms/pseries/iommu.c > +++ b/arch/powerpc/platforms/pseries/iommu.c > @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list); > > static phys_addr_t ddw_memory_hotplug_max(void) > { > - resource_size_t max_addr = memory_hotplug_max(); > - struct device_node *memory; > + resource_size_t max_addr; > > - for_each_node_by_type(memory, "memory") { > - struct resource res; > - > - if (of_address_to_resource(memory, 0, &res)) > - continue; > - > - max_addr = max_t(resource_size_t, max_addr, res.end + 1); > - } > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG) > + max_addr = hot_add_drconf_memory_max(); > +#else > + max_addr = memblock_end_of_DRAM(); > +#endif > > return max_addr; > } > @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) > > if (direct_mapping) { > /* DDW maps the whole partition, so enable direct DMA mapping */ > - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT, > + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT, > win64->value, tce_setrange_multi_pSeriesLP_walk); > if (ret) { > dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n", > @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, > struct memory_notify *arg = data; > int ret = 0; > > + /* This notifier can get called when onlining persistent memory as well. > + * TCEs are not pre-mapped for persistent memory. Persistent memory will > + * always be above ddw_memory_hotplug_max() > + */ > + > switch (action) { > case MEM_GOING_ONLINE: > spin_lock(&dma_win_list_lock); > list_for_each_entry(window, &dma_win_list, list) { > - if (window->direct) { > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < > + ddw_memory_hotplug_max()) { Hi Gaurav, Since the pmem_start will be greater than ddw_memory_hotplug_max(), and we have not created DDW beyond ddw_memory_hotplug_max(), we are not adding TCE for this range, right? I have tested this patch on my system, and daxctl reconfigure-device is able to reconfigure PMEM to system RAM. ~# daxctl reconfigure-device --mode=system-ram dax1.0 --force [ { "chardev":"dax1.0", "size":5362417664, "target_node":4, "align":65536, "mode":"system-ram", "online_memblocks":4, "total_memblocks":4, "movable":true } ] reconfigured 1 device ~# ~# lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x000000697fffffff 422G online yes 0-421 0x0000040380000000-0x000004047fffffff 4G online yes 4110-4113 Memory block size: 1G Total online memory: 426G Total offline memory: 0B root@ltcden14-lp2:~# Thanks Donet > ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn, > arg->nr_pages, window->prop); > } > @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, > case MEM_OFFLINE: > spin_lock(&dma_win_list_lock); > list_for_each_entry(window, &dma_win_list, list) { > - if (window->direct) { > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < > + ddw_memory_hotplug_max()) { > ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn, > arg->nr_pages, window->prop); > } > > base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory 2025-02-05 12:43 ` Donet Tom @ 2025-02-05 13:58 ` Gaurav Batra 2025-02-06 6:15 ` Donet Tom 0 siblings, 1 reply; 11+ messages in thread From: Gaurav Batra @ 2025-02-05 13:58 UTC (permalink / raw) To: Donet Tom, linuxppc-dev On 2/5/25 6:43 AM, Donet Tom wrote: > > On 1/31/25 00:08, Gaurav Batra wrote: >> iommu_mem_notifier() is invoked when RAM is dynamically >> added/removed. This >> notifier call is responsible to add/remove TCEs from the Dynamic DMA >> Window >> (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not >> for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are >> dynamically mapped when the device driver instructs to do so. >> >> The issue is 'daxctl' command is capable of adding pmemory as "System >> RAM" >> after LPAR boot. The command to do so is - >> >> daxctl reconfigure-device --mode=system-ram dax0.0 --force >> >> This will dynamically add pmemory range to LPAR RAM eventually invoking >> iommu_mem_notifier(). The address range of pmemory is way beyond the Max >> RAM that the LPAR can have. Which means, this range is beyond the DDW >> created for the device, at device initialization time. >> >> As a result when TCEs are pre-mapped for the pmemory range, by >> iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the >> command, daxctl, to add pmemory as RAM. >> >> The solution is to not pre-map TCEs for pmemory. >> >> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> >> --- >> arch/powerpc/include/asm/mmzone.h | 1 + >> arch/powerpc/mm/numa.c | 2 +- >> arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------ >> 3 files changed, 18 insertions(+), 14 deletions(-) >> >> diff --git a/arch/powerpc/include/asm/mmzone.h >> b/arch/powerpc/include/asm/mmzone.h >> index d99863cd6cde..049152f8d597 100644 >> --- a/arch/powerpc/include/asm/mmzone.h >> +++ b/arch/powerpc/include/asm/mmzone.h >> @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[]; >> #ifdef CONFIG_MEMORY_HOTPLUG >> extern unsigned long max_pfn; >> u64 memory_hotplug_max(void); >> +u64 hot_add_drconf_memory_max(void); >> #else >> #define memory_hotplug_max() memblock_end_of_DRAM() >> #endif >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >> index 3c1da08304d0..603a0f652ba6 100644 >> --- a/arch/powerpc/mm/numa.c >> +++ b/arch/powerpc/mm/numa.c >> @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr) >> return nid; >> } >> -static u64 hot_add_drconf_memory_max(void) >> +u64 hot_add_drconf_memory_max(void) >> { >> struct device_node *memory = NULL; >> struct device_node *dn = NULL; >> diff --git a/arch/powerpc/platforms/pseries/iommu.c >> b/arch/powerpc/platforms/pseries/iommu.c >> index 29f1a0cc59cd..abd9529a8f41 100644 >> --- a/arch/powerpc/platforms/pseries/iommu.c >> +++ b/arch/powerpc/platforms/pseries/iommu.c >> @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list); >> static phys_addr_t ddw_memory_hotplug_max(void) >> { >> - resource_size_t max_addr = memory_hotplug_max(); >> - struct device_node *memory; >> + resource_size_t max_addr; >> - for_each_node_by_type(memory, "memory") { >> - struct resource res; >> - >> - if (of_address_to_resource(memory, 0, &res)) >> - continue; >> - >> - max_addr = max_t(resource_size_t, max_addr, res.end + 1); >> - } >> +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG) >> + max_addr = hot_add_drconf_memory_max(); >> +#else >> + max_addr = memblock_end_of_DRAM(); >> +#endif >> return max_addr; >> } >> @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, >> struct device_node *pdn) >> if (direct_mapping) { >> /* DDW maps the whole partition, so enable direct DMA >> mapping */ >> - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> >> PAGE_SHIFT, >> + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> >> PAGE_SHIFT, >> win64->value, >> tce_setrange_multi_pSeriesLP_walk); >> if (ret) { >> dev_info(&dev->dev, "failed to map DMA window for %pOF: >> %d\n", >> @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct >> notifier_block *nb, unsigned long action, >> struct memory_notify *arg = data; >> int ret = 0; >> + /* This notifier can get called when onlining persistent >> memory as well. >> + * TCEs are not pre-mapped for persistent memory. Persistent >> memory will >> + * always be above ddw_memory_hotplug_max() >> + */ >> + >> switch (action) { >> case MEM_GOING_ONLINE: >> spin_lock(&dma_win_list_lock); >> list_for_each_entry(window, &dma_win_list, list) { >> - if (window->direct) { >> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < >> + ddw_memory_hotplug_max()) { > Hi Gaurav, > > Since the pmem_start will be greater than ddw_memory_hotplug_max(), > and we have not created DDW beyond ddw_memory_hotplug_max(), we are > not adding TCE for this range, right? > That is correct > I have tested this patch on my system, and daxctl reconfigure-device > is able to reconfigure PMEM to system RAM. > > ~# daxctl reconfigure-device --mode=system-ram dax1.0 --force > [ > { > "chardev":"dax1.0", > "size":5362417664, > "target_node":4, > "align":65536, > "mode":"system-ram", > "online_memblocks":4, > "total_memblocks":4, > "movable":true > } > ] > reconfigured 1 device > ~# > ~# lsmem > RANGE SIZE STATE REMOVABLE BLOCK > 0x0000000000000000-0x000000697fffffff 422G online yes 0-421 > 0x0000040380000000-0x000004047fffffff 4G online yes 4110-4113 > > Memory block size: 1G > Total online memory: 426G > Total offline memory: 0B > root@ltcden14-lp2:~# > > Thanks > Donet >> ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn, >> arg->nr_pages, window->prop); >> } >> @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct >> notifier_block *nb, unsigned long action, >> case MEM_OFFLINE: >> spin_lock(&dma_win_list_lock); >> list_for_each_entry(window, &dma_win_list, list) { >> - if (window->direct) { >> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < >> + ddw_memory_hotplug_max()) { >> ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn, >> arg->nr_pages, window->prop); >> } >> >> base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory 2025-02-05 13:58 ` Gaurav Batra @ 2025-02-06 6:15 ` Donet Tom 0 siblings, 0 replies; 11+ messages in thread From: Donet Tom @ 2025-02-06 6:15 UTC (permalink / raw) To: Gaurav Batra, linuxppc-dev On 2/5/25 19:28, Gaurav Batra wrote: > > On 2/5/25 6:43 AM, Donet Tom wrote: >> >> On 1/31/25 00:08, Gaurav Batra wrote: >>> iommu_mem_notifier() is invoked when RAM is dynamically >>> added/removed. This >>> notifier call is responsible to add/remove TCEs from the Dynamic DMA >>> Window >>> (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and >>> not >>> for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are >>> dynamically mapped when the device driver instructs to do so. >>> >>> The issue is 'daxctl' command is capable of adding pmemory as >>> "System RAM" >>> after LPAR boot. The command to do so is - >>> >>> daxctl reconfigure-device --mode=system-ram dax0.0 --force >>> >>> This will dynamically add pmemory range to LPAR RAM eventually invoking >>> iommu_mem_notifier(). The address range of pmemory is way beyond the >>> Max >>> RAM that the LPAR can have. Which means, this range is beyond the DDW >>> created for the device, at device initialization time. >>> >>> As a result when TCEs are pre-mapped for the pmemory range, by >>> iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the >>> command, daxctl, to add pmemory as RAM. >>> >>> The solution is to not pre-map TCEs for pmemory. >>> >>> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> >>> --- >>> arch/powerpc/include/asm/mmzone.h | 1 + >>> arch/powerpc/mm/numa.c | 2 +- >>> arch/powerpc/platforms/pseries/iommu.c | 29 >>> ++++++++++++++------------ >>> 3 files changed, 18 insertions(+), 14 deletions(-) >>> >>> diff --git a/arch/powerpc/include/asm/mmzone.h >>> b/arch/powerpc/include/asm/mmzone.h >>> index d99863cd6cde..049152f8d597 100644 >>> --- a/arch/powerpc/include/asm/mmzone.h >>> +++ b/arch/powerpc/include/asm/mmzone.h >>> @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[]; >>> #ifdef CONFIG_MEMORY_HOTPLUG >>> extern unsigned long max_pfn; >>> u64 memory_hotplug_max(void); >>> +u64 hot_add_drconf_memory_max(void); >>> #else >>> #define memory_hotplug_max() memblock_end_of_DRAM() >>> #endif >>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >>> index 3c1da08304d0..603a0f652ba6 100644 >>> --- a/arch/powerpc/mm/numa.c >>> +++ b/arch/powerpc/mm/numa.c >>> @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr) >>> return nid; >>> } >>> -static u64 hot_add_drconf_memory_max(void) >>> +u64 hot_add_drconf_memory_max(void) >>> { >>> struct device_node *memory = NULL; >>> struct device_node *dn = NULL; >>> diff --git a/arch/powerpc/platforms/pseries/iommu.c >>> b/arch/powerpc/platforms/pseries/iommu.c >>> index 29f1a0cc59cd..abd9529a8f41 100644 >>> --- a/arch/powerpc/platforms/pseries/iommu.c >>> +++ b/arch/powerpc/platforms/pseries/iommu.c >>> @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list); >>> static phys_addr_t ddw_memory_hotplug_max(void) >>> { >>> - resource_size_t max_addr = memory_hotplug_max(); >>> - struct device_node *memory; >>> + resource_size_t max_addr; >>> - for_each_node_by_type(memory, "memory") { >>> - struct resource res; >>> - >>> - if (of_address_to_resource(memory, 0, &res)) >>> - continue; >>> - >>> - max_addr = max_t(resource_size_t, max_addr, res.end + 1); >>> - } >>> +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG) >>> + max_addr = hot_add_drconf_memory_max(); >>> +#else >>> + max_addr = memblock_end_of_DRAM(); >>> +#endif >>> return max_addr; >>> } >>> @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, >>> struct device_node *pdn) >>> if (direct_mapping) { >>> /* DDW maps the whole partition, so enable direct DMA >>> mapping */ >>> - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> >>> PAGE_SHIFT, >>> + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> >>> PAGE_SHIFT, >>> win64->value, >>> tce_setrange_multi_pSeriesLP_walk); >>> if (ret) { >>> dev_info(&dev->dev, "failed to map DMA window for >>> %pOF: %d\n", >>> @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct >>> notifier_block *nb, unsigned long action, >>> struct memory_notify *arg = data; >>> int ret = 0; >>> + /* This notifier can get called when onlining persistent >>> memory as well. >>> + * TCEs are not pre-mapped for persistent memory. Persistent >>> memory will >>> + * always be above ddw_memory_hotplug_max() >>> + */ >>> + >>> switch (action) { >>> case MEM_GOING_ONLINE: >>> spin_lock(&dma_win_list_lock); >>> list_for_each_entry(window, &dma_win_list, list) { >>> - if (window->direct) { >>> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < >>> + ddw_memory_hotplug_max()) { >> Hi Gaurav, >> >> Since the pmem_start will be greater than ddw_memory_hotplug_max(), >> and we have not created DDW beyond ddw_memory_hotplug_max(), we are >> not adding TCE for this range, right? >> > That is correct > Thank you. This looks good to me. feel free to add Tested-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Donet Tom <donettom@linux.ibm.com> > >> I have tested this patch on my system, and daxctl reconfigure-device >> is able to reconfigure PMEM to system RAM. >> >> ~# daxctl reconfigure-device --mode=system-ram dax1.0 --force >> [ >> { >> "chardev":"dax1.0", >> "size":5362417664, >> "target_node":4, >> "align":65536, >> "mode":"system-ram", >> "online_memblocks":4, >> "total_memblocks":4, >> "movable":true >> } >> ] >> reconfigured 1 device >> ~# >> ~# lsmem >> RANGE SIZE STATE REMOVABLE BLOCK >> 0x0000000000000000-0x000000697fffffff 422G online yes 0-421 >> 0x0000040380000000-0x000004047fffffff 4G online yes 4110-4113 >> >> Memory block size: 1G >> Total online memory: 426G >> Total offline memory: 0B >> root@ltcden14-lp2:~# >> >> Thanks >> Donet >>> ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn, >>> arg->nr_pages, window->prop); >>> } >>> @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct >>> notifier_block *nb, unsigned long action, >>> case MEM_OFFLINE: >>> spin_lock(&dma_win_list_lock); >>> list_for_each_entry(window, &dma_win_list, list) { >>> - if (window->direct) { >>> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < >>> + ddw_memory_hotplug_max()) { >>> ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn, >>> arg->nr_pages, window->prop); >>> } >>> >>> base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory 2025-01-30 18:38 [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory Gaurav Batra 2025-02-05 12:43 ` Donet Tom @ 2025-02-17 7:28 ` Madhavan Srinivasan 2025-03-19 17:29 ` Michal Suchánek 2 siblings, 0 replies; 11+ messages in thread From: Madhavan Srinivasan @ 2025-02-17 7:28 UTC (permalink / raw) To: linuxppc-dev, Gaurav Batra; +Cc: donettom On Thu, 30 Jan 2025 12:38:54 -0600, Gaurav Batra wrote: > iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This > notifier call is responsible to add/remove TCEs from the Dynamic DMA Window > (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not > for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are > dynamically mapped when the device driver instructs to do so. > > The issue is 'daxctl' command is capable of adding pmemory as "System RAM" > after LPAR boot. The command to do so is - > > [...] Applied to powerpc/next. [1/1] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory https://git.kernel.org/powerpc/c/6aa989ab2bd0d37540c812b4270006ff794662e7 Thanks ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory 2025-01-30 18:38 [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory Gaurav Batra 2025-02-05 12:43 ` Donet Tom 2025-02-17 7:28 ` Madhavan Srinivasan @ 2025-03-19 17:29 ` Michal Suchánek 2025-03-26 14:46 ` Gaurav Batra 2 siblings, 1 reply; 11+ messages in thread From: Michal Suchánek @ 2025-03-19 17:29 UTC (permalink / raw) To: Gaurav Batra; +Cc: linuxppc-dev, donettom Hello, looks like this upsets some assumption qemu has about these windows. https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg05137.html When Linux kernel that has this patch applied is running inside a qemu VM with a PCI device and the VM is rebooted qemu crashes shortly after the next Linux kernel starts. This is quite curious since qemu does AFAIK not support pmemory at all. Any idea what went wrong there? Thanks Michal On Thu, Jan 30, 2025 at 12:38:54PM -0600, Gaurav Batra wrote: > iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This > notifier call is responsible to add/remove TCEs from the Dynamic DMA Window > (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not > for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are > dynamically mapped when the device driver instructs to do so. > > The issue is 'daxctl' command is capable of adding pmemory as "System RAM" > after LPAR boot. The command to do so is - > > daxctl reconfigure-device --mode=system-ram dax0.0 --force > > This will dynamically add pmemory range to LPAR RAM eventually invoking > iommu_mem_notifier(). The address range of pmemory is way beyond the Max > RAM that the LPAR can have. Which means, this range is beyond the DDW > created for the device, at device initialization time. > > As a result when TCEs are pre-mapped for the pmemory range, by > iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the > command, daxctl, to add pmemory as RAM. > > The solution is to not pre-map TCEs for pmemory. > > Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> > --- > arch/powerpc/include/asm/mmzone.h | 1 + > arch/powerpc/mm/numa.c | 2 +- > arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------ > 3 files changed, 18 insertions(+), 14 deletions(-) > > diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h > index d99863cd6cde..049152f8d597 100644 > --- a/arch/powerpc/include/asm/mmzone.h > +++ b/arch/powerpc/include/asm/mmzone.h > @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[]; > #ifdef CONFIG_MEMORY_HOTPLUG > extern unsigned long max_pfn; > u64 memory_hotplug_max(void); > +u64 hot_add_drconf_memory_max(void); > #else > #define memory_hotplug_max() memblock_end_of_DRAM() > #endif > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 3c1da08304d0..603a0f652ba6 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr) > return nid; > } > > -static u64 hot_add_drconf_memory_max(void) > +u64 hot_add_drconf_memory_max(void) > { > struct device_node *memory = NULL; > struct device_node *dn = NULL; > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c > index 29f1a0cc59cd..abd9529a8f41 100644 > --- a/arch/powerpc/platforms/pseries/iommu.c > +++ b/arch/powerpc/platforms/pseries/iommu.c > @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list); > > static phys_addr_t ddw_memory_hotplug_max(void) > { > - resource_size_t max_addr = memory_hotplug_max(); > - struct device_node *memory; > + resource_size_t max_addr; > > - for_each_node_by_type(memory, "memory") { > - struct resource res; > - > - if (of_address_to_resource(memory, 0, &res)) > - continue; > - > - max_addr = max_t(resource_size_t, max_addr, res.end + 1); > - } > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG) > + max_addr = hot_add_drconf_memory_max(); > +#else > + max_addr = memblock_end_of_DRAM(); > +#endif > > return max_addr; > } > @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) > > if (direct_mapping) { > /* DDW maps the whole partition, so enable direct DMA mapping */ > - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT, > + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT, > win64->value, tce_setrange_multi_pSeriesLP_walk); > if (ret) { > dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n", > @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, > struct memory_notify *arg = data; > int ret = 0; > > + /* This notifier can get called when onlining persistent memory as well. > + * TCEs are not pre-mapped for persistent memory. Persistent memory will > + * always be above ddw_memory_hotplug_max() > + */ > + > switch (action) { > case MEM_GOING_ONLINE: > spin_lock(&dma_win_list_lock); > list_for_each_entry(window, &dma_win_list, list) { > - if (window->direct) { > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < > + ddw_memory_hotplug_max()) { > ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn, > arg->nr_pages, window->prop); > } > @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, > case MEM_OFFLINE: > spin_lock(&dma_win_list_lock); > list_for_each_entry(window, &dma_win_list, list) { > - if (window->direct) { > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < > + ddw_memory_hotplug_max()) { > ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn, > arg->nr_pages, window->prop); > } > > base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3 > -- > 2.39.3 (Apple Git-146) > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory 2025-03-19 17:29 ` Michal Suchánek @ 2025-03-26 14:46 ` Gaurav Batra 2025-03-26 14:53 ` Michal Suchánek 0 siblings, 1 reply; 11+ messages in thread From: Gaurav Batra @ 2025-03-26 14:46 UTC (permalink / raw) To: Michal Suchánek; +Cc: linuxppc-dev, donettom Hello Michal, In the patch to fix the pmemory bug, I made some changes to the code that determines Max memory an LPAR can have (excluding pmemory). This information is needed while creating Dynamic DMA Window (DDW). These changes are in the main line code path of DDW creation. This might have irritated QEMU somehow, no idea yet on how. Thanks, Gaurav On 3/19/25 12:29 PM, Michal Suchánek wrote: > Hello, > > looks like this upsets some assumption qemu has about these windows. > > https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg05137.html > > When Linux kernel that has this patch applied is running inside a qemu > VM with a PCI device and the VM is rebooted qemu crashes shortly after > the next Linux kernel starts. > > This is quite curious since qemu does AFAIK not support pmemory at all. > > Any idea what went wrong there? > > Thanks > > Michal > > On Thu, Jan 30, 2025 at 12:38:54PM -0600, Gaurav Batra wrote: >> iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This >> notifier call is responsible to add/remove TCEs from the Dynamic DMA Window >> (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not >> for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are >> dynamically mapped when the device driver instructs to do so. >> >> The issue is 'daxctl' command is capable of adding pmemory as "System RAM" >> after LPAR boot. The command to do so is - >> >> daxctl reconfigure-device --mode=system-ram dax0.0 --force >> >> This will dynamically add pmemory range to LPAR RAM eventually invoking >> iommu_mem_notifier(). The address range of pmemory is way beyond the Max >> RAM that the LPAR can have. Which means, this range is beyond the DDW >> created for the device, at device initialization time. >> >> As a result when TCEs are pre-mapped for the pmemory range, by >> iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the >> command, daxctl, to add pmemory as RAM. >> >> The solution is to not pre-map TCEs for pmemory. >> >> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> >> --- >> arch/powerpc/include/asm/mmzone.h | 1 + >> arch/powerpc/mm/numa.c | 2 +- >> arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------ >> 3 files changed, 18 insertions(+), 14 deletions(-) >> >> diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h >> index d99863cd6cde..049152f8d597 100644 >> --- a/arch/powerpc/include/asm/mmzone.h >> +++ b/arch/powerpc/include/asm/mmzone.h >> @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[]; >> #ifdef CONFIG_MEMORY_HOTPLUG >> extern unsigned long max_pfn; >> u64 memory_hotplug_max(void); >> +u64 hot_add_drconf_memory_max(void); >> #else >> #define memory_hotplug_max() memblock_end_of_DRAM() >> #endif >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >> index 3c1da08304d0..603a0f652ba6 100644 >> --- a/arch/powerpc/mm/numa.c >> +++ b/arch/powerpc/mm/numa.c >> @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr) >> return nid; >> } >> >> -static u64 hot_add_drconf_memory_max(void) >> +u64 hot_add_drconf_memory_max(void) >> { >> struct device_node *memory = NULL; >> struct device_node *dn = NULL; >> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c >> index 29f1a0cc59cd..abd9529a8f41 100644 >> --- a/arch/powerpc/platforms/pseries/iommu.c >> +++ b/arch/powerpc/platforms/pseries/iommu.c >> @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list); >> >> static phys_addr_t ddw_memory_hotplug_max(void) >> { >> - resource_size_t max_addr = memory_hotplug_max(); >> - struct device_node *memory; >> + resource_size_t max_addr; >> >> - for_each_node_by_type(memory, "memory") { >> - struct resource res; >> - >> - if (of_address_to_resource(memory, 0, &res)) >> - continue; >> - >> - max_addr = max_t(resource_size_t, max_addr, res.end + 1); >> - } >> +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG) >> + max_addr = hot_add_drconf_memory_max(); >> +#else >> + max_addr = memblock_end_of_DRAM(); >> +#endif >> >> return max_addr; >> } >> @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) >> >> if (direct_mapping) { >> /* DDW maps the whole partition, so enable direct DMA mapping */ >> - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT, >> + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT, >> win64->value, tce_setrange_multi_pSeriesLP_walk); >> if (ret) { >> dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n", >> @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, >> struct memory_notify *arg = data; >> int ret = 0; >> >> + /* This notifier can get called when onlining persistent memory as well. >> + * TCEs are not pre-mapped for persistent memory. Persistent memory will >> + * always be above ddw_memory_hotplug_max() >> + */ >> + >> switch (action) { >> case MEM_GOING_ONLINE: >> spin_lock(&dma_win_list_lock); >> list_for_each_entry(window, &dma_win_list, list) { >> - if (window->direct) { >> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < >> + ddw_memory_hotplug_max()) { >> ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn, >> arg->nr_pages, window->prop); >> } >> @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, >> case MEM_OFFLINE: >> spin_lock(&dma_win_list_lock); >> list_for_each_entry(window, &dma_win_list, list) { >> - if (window->direct) { >> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < >> + ddw_memory_hotplug_max()) { >> ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn, >> arg->nr_pages, window->prop); >> } >> >> base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3 >> -- >> 2.39.3 (Apple Git-146) >> >> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory 2025-03-26 14:46 ` Gaurav Batra @ 2025-03-26 14:53 ` Michal Suchánek 2025-05-07 9:06 ` Amit Machhiwal 0 siblings, 1 reply; 11+ messages in thread From: Michal Suchánek @ 2025-03-26 14:53 UTC (permalink / raw) To: Gaurav Batra; +Cc: linuxppc-dev, donettom Hello, On Wed, Mar 26, 2025 at 09:46:11AM -0500, Gaurav Batra wrote: > Hello Michal, > > In the patch to fix the pmemory bug, I made some changes to the code that > determines Max memory an LPAR can have (excluding pmemory). This information > is needed while creating Dynamic DMA Window (DDW). These changes are in the > main line code path of DDW creation. This might have irritated QEMU somehow, > no idea yet on how. Yes, it's defeinitely something with the DDW code. Using the disable_ddw=1 kernel parameter avoids the qemu crash. The kernels in https://download.opensuse.org/repositories/Kernel:/SLE15-SP7/pool/ppc64le/ have the patch applied. Booting the kernel inside qemu VM with a PCI device (such as the USB hub) and then rebooting the VM crashes qemu. Thanks Michal > > Thanks, > > Gaurav > > On 3/19/25 12:29 PM, Michal Suchánek wrote: > > Hello, > > > > looks like this upsets some assumption qemu has about these windows. > > > > https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg05137.html > > > > When Linux kernel that has this patch applied is running inside a qemu > > VM with a PCI device and the VM is rebooted qemu crashes shortly after > > the next Linux kernel starts. > > > > This is quite curious since qemu does AFAIK not support pmemory at all. > > > > Any idea what went wrong there? > > > > Thanks > > > > Michal > > > > On Thu, Jan 30, 2025 at 12:38:54PM -0600, Gaurav Batra wrote: > > > iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This > > > notifier call is responsible to add/remove TCEs from the Dynamic DMA Window > > > (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not > > > for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are > > > dynamically mapped when the device driver instructs to do so. > > > > > > The issue is 'daxctl' command is capable of adding pmemory as "System RAM" > > > after LPAR boot. The command to do so is - > > > > > > daxctl reconfigure-device --mode=system-ram dax0.0 --force > > > > > > This will dynamically add pmemory range to LPAR RAM eventually invoking > > > iommu_mem_notifier(). The address range of pmemory is way beyond the Max > > > RAM that the LPAR can have. Which means, this range is beyond the DDW > > > created for the device, at device initialization time. > > > > > > As a result when TCEs are pre-mapped for the pmemory range, by > > > iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the > > > command, daxctl, to add pmemory as RAM. > > > > > > The solution is to not pre-map TCEs for pmemory. > > > > > > Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> > > > --- > > > arch/powerpc/include/asm/mmzone.h | 1 + > > > arch/powerpc/mm/numa.c | 2 +- > > > arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------ > > > 3 files changed, 18 insertions(+), 14 deletions(-) > > > > > > diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h > > > index d99863cd6cde..049152f8d597 100644 > > > --- a/arch/powerpc/include/asm/mmzone.h > > > +++ b/arch/powerpc/include/asm/mmzone.h > > > @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[]; > > > #ifdef CONFIG_MEMORY_HOTPLUG > > > extern unsigned long max_pfn; > > > u64 memory_hotplug_max(void); > > > +u64 hot_add_drconf_memory_max(void); > > > #else > > > #define memory_hotplug_max() memblock_end_of_DRAM() > > > #endif > > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > > > index 3c1da08304d0..603a0f652ba6 100644 > > > --- a/arch/powerpc/mm/numa.c > > > +++ b/arch/powerpc/mm/numa.c > > > @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr) > > > return nid; > > > } > > > -static u64 hot_add_drconf_memory_max(void) > > > +u64 hot_add_drconf_memory_max(void) > > > { > > > struct device_node *memory = NULL; > > > struct device_node *dn = NULL; > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c > > > index 29f1a0cc59cd..abd9529a8f41 100644 > > > --- a/arch/powerpc/platforms/pseries/iommu.c > > > +++ b/arch/powerpc/platforms/pseries/iommu.c > > > @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list); > > > static phys_addr_t ddw_memory_hotplug_max(void) > > > { > > > - resource_size_t max_addr = memory_hotplug_max(); > > > - struct device_node *memory; > > > + resource_size_t max_addr; > > > - for_each_node_by_type(memory, "memory") { > > > - struct resource res; > > > - > > > - if (of_address_to_resource(memory, 0, &res)) > > > - continue; > > > - > > > - max_addr = max_t(resource_size_t, max_addr, res.end + 1); > > > - } > > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG) > > > + max_addr = hot_add_drconf_memory_max(); > > > +#else > > > + max_addr = memblock_end_of_DRAM(); > > > +#endif > > > return max_addr; > > > } > > > @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) > > > if (direct_mapping) { > > > /* DDW maps the whole partition, so enable direct DMA mapping */ > > > - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT, > > > + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT, > > > win64->value, tce_setrange_multi_pSeriesLP_walk); > > > if (ret) { > > > dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n", > > > @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, > > > struct memory_notify *arg = data; > > > int ret = 0; > > > + /* This notifier can get called when onlining persistent memory as well. > > > + * TCEs are not pre-mapped for persistent memory. Persistent memory will > > > + * always be above ddw_memory_hotplug_max() > > > + */ > > > + > > > switch (action) { > > > case MEM_GOING_ONLINE: > > > spin_lock(&dma_win_list_lock); > > > list_for_each_entry(window, &dma_win_list, list) { > > > - if (window->direct) { > > > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < > > > + ddw_memory_hotplug_max()) { > > > ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn, > > > arg->nr_pages, window->prop); > > > } > > > @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, > > > case MEM_OFFLINE: > > > spin_lock(&dma_win_list_lock); > > > list_for_each_entry(window, &dma_win_list, list) { > > > - if (window->direct) { > > > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < > > > + ddw_memory_hotplug_max()) { > > > ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn, > > > arg->nr_pages, window->prop); > > > } > > > > > > base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3 > > > -- > > > 2.39.3 (Apple Git-146) > > > > > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory 2025-03-26 14:53 ` Michal Suchánek @ 2025-05-07 9:06 ` Amit Machhiwal 2025-05-07 12:23 ` Michal Suchánek 0 siblings, 1 reply; 11+ messages in thread From: Amit Machhiwal @ 2025-05-07 9:06 UTC (permalink / raw) To: Michal Suchánek; +Cc: Gaurav Batra, linuxppc-dev, donettom Hi Michal, I can recreate this issue on sles16 distro kernel but I don't observe this issue with upstream Linux 6.15-rc5 on the **same** sles16 guest. Note: the commit 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory") was included since Linux 6.15-rc1. I think there's something more to this crash. Thanks, Amit On 2025/03/26 03:53 PM, Michal Suchánek wrote: > Hello, > > On Wed, Mar 26, 2025 at 09:46:11AM -0500, Gaurav Batra wrote: > > Hello Michal, > > > > In the patch to fix the pmemory bug, I made some changes to the code that > > determines Max memory an LPAR can have (excluding pmemory). This information > > is needed while creating Dynamic DMA Window (DDW). These changes are in the > > main line code path of DDW creation. This might have irritated QEMU somehow, > > no idea yet on how. > > Yes, it's defeinitely something with the DDW code. Using the > disable_ddw=1 kernel parameter avoids the qemu crash. > > The kernels in > https://download.opensuse.org/repositories/Kernel:/SLE15-SP7/pool/ppc64le/ > > have the patch applied. > > Booting the kernel inside qemu VM with a PCI device (such as the USB > hub) and then rebooting the VM crashes qemu. > > Thanks > > Michal > > > > > Thanks, > > > > Gaurav > > > > On 3/19/25 12:29 PM, Michal Suchánek wrote: > > > Hello, > > > > > > looks like this upsets some assumption qemu has about these windows. > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg05137.html > > > > > > When Linux kernel that has this patch applied is running inside a qemu > > > VM with a PCI device and the VM is rebooted qemu crashes shortly after > > > the next Linux kernel starts. > > > > > > This is quite curious since qemu does AFAIK not support pmemory at all. > > > > > > Any idea what went wrong there? > > > > > > Thanks > > > > > > Michal > > > > > > On Thu, Jan 30, 2025 at 12:38:54PM -0600, Gaurav Batra wrote: > > > > iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This > > > > notifier call is responsible to add/remove TCEs from the Dynamic DMA Window > > > > (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not > > > > for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are > > > > dynamically mapped when the device driver instructs to do so. > > > > > > > > The issue is 'daxctl' command is capable of adding pmemory as "System RAM" > > > > after LPAR boot. The command to do so is - > > > > > > > > daxctl reconfigure-device --mode=system-ram dax0.0 --force > > > > > > > > This will dynamically add pmemory range to LPAR RAM eventually invoking > > > > iommu_mem_notifier(). The address range of pmemory is way beyond the Max > > > > RAM that the LPAR can have. Which means, this range is beyond the DDW > > > > created for the device, at device initialization time. > > > > > > > > As a result when TCEs are pre-mapped for the pmemory range, by > > > > iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the > > > > command, daxctl, to add pmemory as RAM. > > > > > > > > The solution is to not pre-map TCEs for pmemory. > > > > > > > > Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> > > > > --- > > > > arch/powerpc/include/asm/mmzone.h | 1 + > > > > arch/powerpc/mm/numa.c | 2 +- > > > > arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------ > > > > 3 files changed, 18 insertions(+), 14 deletions(-) > > > > > > > > diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h > > > > index d99863cd6cde..049152f8d597 100644 > > > > --- a/arch/powerpc/include/asm/mmzone.h > > > > +++ b/arch/powerpc/include/asm/mmzone.h > > > > @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[]; > > > > #ifdef CONFIG_MEMORY_HOTPLUG > > > > extern unsigned long max_pfn; > > > > u64 memory_hotplug_max(void); > > > > +u64 hot_add_drconf_memory_max(void); > > > > #else > > > > #define memory_hotplug_max() memblock_end_of_DRAM() > > > > #endif > > > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > > > > index 3c1da08304d0..603a0f652ba6 100644 > > > > --- a/arch/powerpc/mm/numa.c > > > > +++ b/arch/powerpc/mm/numa.c > > > > @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr) > > > > return nid; > > > > } > > > > -static u64 hot_add_drconf_memory_max(void) > > > > +u64 hot_add_drconf_memory_max(void) > > > > { > > > > struct device_node *memory = NULL; > > > > struct device_node *dn = NULL; > > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c > > > > index 29f1a0cc59cd..abd9529a8f41 100644 > > > > --- a/arch/powerpc/platforms/pseries/iommu.c > > > > +++ b/arch/powerpc/platforms/pseries/iommu.c > > > > @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list); > > > > static phys_addr_t ddw_memory_hotplug_max(void) > > > > { > > > > - resource_size_t max_addr = memory_hotplug_max(); > > > > - struct device_node *memory; > > > > + resource_size_t max_addr; > > > > - for_each_node_by_type(memory, "memory") { > > > > - struct resource res; > > > > - > > > > - if (of_address_to_resource(memory, 0, &res)) > > > > - continue; > > > > - > > > > - max_addr = max_t(resource_size_t, max_addr, res.end + 1); > > > > - } > > > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG) > > > > + max_addr = hot_add_drconf_memory_max(); > > > > +#else > > > > + max_addr = memblock_end_of_DRAM(); > > > > +#endif > > > > return max_addr; > > > > } > > > > @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) > > > > if (direct_mapping) { > > > > /* DDW maps the whole partition, so enable direct DMA mapping */ > > > > - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT, > > > > + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT, > > > > win64->value, tce_setrange_multi_pSeriesLP_walk); > > > > if (ret) { > > > > dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n", > > > > @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, > > > > struct memory_notify *arg = data; > > > > int ret = 0; > > > > + /* This notifier can get called when onlining persistent memory as well. > > > > + * TCEs are not pre-mapped for persistent memory. Persistent memory will > > > > + * always be above ddw_memory_hotplug_max() > > > > + */ > > > > + > > > > switch (action) { > > > > case MEM_GOING_ONLINE: > > > > spin_lock(&dma_win_list_lock); > > > > list_for_each_entry(window, &dma_win_list, list) { > > > > - if (window->direct) { > > > > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < > > > > + ddw_memory_hotplug_max()) { > > > > ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn, > > > > arg->nr_pages, window->prop); > > > > } > > > > @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, > > > > case MEM_OFFLINE: > > > > spin_lock(&dma_win_list_lock); > > > > list_for_each_entry(window, &dma_win_list, list) { > > > > - if (window->direct) { > > > > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) < > > > > + ddw_memory_hotplug_max()) { > > > > ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn, > > > > arg->nr_pages, window->prop); > > > > } > > > > > > > > base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3 > > > > -- > > > > 2.39.3 (Apple Git-146) > > > > > > > > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory 2025-05-07 9:06 ` Amit Machhiwal @ 2025-05-07 12:23 ` Michal Suchánek 2025-05-09 18:32 ` Michal Suchánek 0 siblings, 1 reply; 11+ messages in thread From: Michal Suchánek @ 2025-05-07 12:23 UTC (permalink / raw) To: Gaurav Batra, linuxppc-dev, donettom Hello, On Wed, May 07, 2025 at 02:36:57PM +0530, Amit Machhiwal wrote: > Hi Michal, > > I can recreate this issue on sles16 distro kernel but I don't observe this issue > with upstream Linux 6.15-rc5 on the **same** sles16 guest. > > Note: the commit 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier > incorrectly adds TCEs for pmemory") was included since Linux 6.15-rc1. Indeed, it's not reproducible with 6.15-rc5. It's not clear what changed in TCE handling between 6.12 and 6.15 that avoids this problem. Thanks Michal ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory 2025-05-07 12:23 ` Michal Suchánek @ 2025-05-09 18:32 ` Michal Suchánek 0 siblings, 0 replies; 11+ messages in thread From: Michal Suchánek @ 2025-05-09 18:32 UTC (permalink / raw) To: Gaurav Batra, linuxppc-dev, donettom, Amit Machhiwal On Wed, May 07, 2025 at 02:23:02PM +0200, Michal Suchánek wrote: > Hello, > > On Wed, May 07, 2025 at 02:36:57PM +0530, Amit Machhiwal wrote: > > Hi Michal, > > > > I can recreate this issue on sles16 distro kernel but I don't observe this issue > > with upstream Linux 6.15-rc5 on the **same** sles16 guest. > > > > Note: the commit 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier > > incorrectly adds TCEs for pmemory") was included since Linux 6.15-rc1. > > Indeed, it's not reproducible with 6.15-rc5. > > It's not clear what changed in TCE handling between 6.12 and 6.15 that > avoids this problem. Seems backporting commit 67dfc11982f7 ("powerpc/pseries/iommu: create DDW for devices with DMA mask less than 64-bits") avoids the problem but it's not clear why. Thanks Michal ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-05-09 18:32 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-30 18:38 [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory Gaurav Batra 2025-02-05 12:43 ` Donet Tom 2025-02-05 13:58 ` Gaurav Batra 2025-02-06 6:15 ` Donet Tom 2025-02-17 7:28 ` Madhavan Srinivasan 2025-03-19 17:29 ` Michal Suchánek 2025-03-26 14:46 ` Gaurav Batra 2025-03-26 14:53 ` Michal Suchánek 2025-05-07 9:06 ` Amit Machhiwal 2025-05-07 12:23 ` Michal Suchánek 2025-05-09 18:32 ` Michal Suchánek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).