* [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
@ 2025-01-30 18:38 Gaurav Batra
2025-02-05 12:43 ` Donet Tom
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Gaurav Batra @ 2025-01-30 18:38 UTC (permalink / raw)
To: linuxppc-dev; +Cc: donettom, Gaurav Batra
iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This
notifier call is responsible to add/remove TCEs from the Dynamic DMA Window
(DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
dynamically mapped when the device driver instructs to do so.
The issue is 'daxctl' command is capable of adding pmemory as "System RAM"
after LPAR boot. The command to do so is -
daxctl reconfigure-device --mode=system-ram dax0.0 --force
This will dynamically add pmemory range to LPAR RAM eventually invoking
iommu_mem_notifier(). The address range of pmemory is way beyond the Max
RAM that the LPAR can have. Which means, this range is beyond the DDW
created for the device, at device initialization time.
As a result when TCEs are pre-mapped for the pmemory range, by
iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
command, daxctl, to add pmemory as RAM.
The solution is to not pre-map TCEs for pmemory.
Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
---
arch/powerpc/include/asm/mmzone.h | 1 +
arch/powerpc/mm/numa.c | 2 +-
arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------
3 files changed, 18 insertions(+), 14 deletions(-)
diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
index d99863cd6cde..049152f8d597 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
#ifdef CONFIG_MEMORY_HOTPLUG
extern unsigned long max_pfn;
u64 memory_hotplug_max(void);
+u64 hot_add_drconf_memory_max(void);
#else
#define memory_hotplug_max() memblock_end_of_DRAM()
#endif
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3c1da08304d0..603a0f652ba6 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
return nid;
}
-static u64 hot_add_drconf_memory_max(void)
+u64 hot_add_drconf_memory_max(void)
{
struct device_node *memory = NULL;
struct device_node *dn = NULL;
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 29f1a0cc59cd..abd9529a8f41 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
static phys_addr_t ddw_memory_hotplug_max(void)
{
- resource_size_t max_addr = memory_hotplug_max();
- struct device_node *memory;
+ resource_size_t max_addr;
- for_each_node_by_type(memory, "memory") {
- struct resource res;
-
- if (of_address_to_resource(memory, 0, &res))
- continue;
-
- max_addr = max_t(resource_size_t, max_addr, res.end + 1);
- }
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+ max_addr = hot_add_drconf_memory_max();
+#else
+ max_addr = memblock_end_of_DRAM();
+#endif
return max_addr;
}
@@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
if (direct_mapping) {
/* DDW maps the whole partition, so enable direct DMA mapping */
- ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
+ ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT,
win64->value, tce_setrange_multi_pSeriesLP_walk);
if (ret) {
dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
@@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
struct memory_notify *arg = data;
int ret = 0;
+ /* This notifier can get called when onlining persistent memory as well.
+ * TCEs are not pre-mapped for persistent memory. Persistent memory will
+ * always be above ddw_memory_hotplug_max()
+ */
+
switch (action) {
case MEM_GOING_ONLINE:
spin_lock(&dma_win_list_lock);
list_for_each_entry(window, &dma_win_list, list) {
- if (window->direct) {
+ if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
+ ddw_memory_hotplug_max()) {
ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
arg->nr_pages, window->prop);
}
@@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
case MEM_OFFLINE:
spin_lock(&dma_win_list_lock);
list_for_each_entry(window, &dma_win_list, list) {
- if (window->direct) {
+ if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
+ ddw_memory_hotplug_max()) {
ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
arg->nr_pages, window->prop);
}
base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
2025-01-30 18:38 [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory Gaurav Batra
@ 2025-02-05 12:43 ` Donet Tom
2025-02-05 13:58 ` Gaurav Batra
2025-02-17 7:28 ` Madhavan Srinivasan
2025-03-19 17:29 ` Michal Suchánek
2 siblings, 1 reply; 11+ messages in thread
From: Donet Tom @ 2025-02-05 12:43 UTC (permalink / raw)
To: Gaurav Batra, linuxppc-dev
On 1/31/25 00:08, Gaurav Batra wrote:
> iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This
> notifier call is responsible to add/remove TCEs from the Dynamic DMA Window
> (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
> for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
> dynamically mapped when the device driver instructs to do so.
>
> The issue is 'daxctl' command is capable of adding pmemory as "System RAM"
> after LPAR boot. The command to do so is -
>
> daxctl reconfigure-device --mode=system-ram dax0.0 --force
>
> This will dynamically add pmemory range to LPAR RAM eventually invoking
> iommu_mem_notifier(). The address range of pmemory is way beyond the Max
> RAM that the LPAR can have. Which means, this range is beyond the DDW
> created for the device, at device initialization time.
>
> As a result when TCEs are pre-mapped for the pmemory range, by
> iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
> command, daxctl, to add pmemory as RAM.
>
> The solution is to not pre-map TCEs for pmemory.
>
> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
> ---
> arch/powerpc/include/asm/mmzone.h | 1 +
> arch/powerpc/mm/numa.c | 2 +-
> arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------
> 3 files changed, 18 insertions(+), 14 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
> index d99863cd6cde..049152f8d597 100644
> --- a/arch/powerpc/include/asm/mmzone.h
> +++ b/arch/powerpc/include/asm/mmzone.h
> @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
> #ifdef CONFIG_MEMORY_HOTPLUG
> extern unsigned long max_pfn;
> u64 memory_hotplug_max(void);
> +u64 hot_add_drconf_memory_max(void);
> #else
> #define memory_hotplug_max() memblock_end_of_DRAM()
> #endif
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 3c1da08304d0..603a0f652ba6 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
> return nid;
> }
>
> -static u64 hot_add_drconf_memory_max(void)
> +u64 hot_add_drconf_memory_max(void)
> {
> struct device_node *memory = NULL;
> struct device_node *dn = NULL;
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 29f1a0cc59cd..abd9529a8f41 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
>
> static phys_addr_t ddw_memory_hotplug_max(void)
> {
> - resource_size_t max_addr = memory_hotplug_max();
> - struct device_node *memory;
> + resource_size_t max_addr;
>
> - for_each_node_by_type(memory, "memory") {
> - struct resource res;
> -
> - if (of_address_to_resource(memory, 0, &res))
> - continue;
> -
> - max_addr = max_t(resource_size_t, max_addr, res.end + 1);
> - }
> +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> + max_addr = hot_add_drconf_memory_max();
> +#else
> + max_addr = memblock_end_of_DRAM();
> +#endif
>
> return max_addr;
> }
> @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>
> if (direct_mapping) {
> /* DDW maps the whole partition, so enable direct DMA mapping */
> - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT,
> win64->value, tce_setrange_multi_pSeriesLP_walk);
> if (ret) {
> dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
> @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> struct memory_notify *arg = data;
> int ret = 0;
>
> + /* This notifier can get called when onlining persistent memory as well.
> + * TCEs are not pre-mapped for persistent memory. Persistent memory will
> + * always be above ddw_memory_hotplug_max()
> + */
> +
> switch (action) {
> case MEM_GOING_ONLINE:
> spin_lock(&dma_win_list_lock);
> list_for_each_entry(window, &dma_win_list, list) {
> - if (window->direct) {
> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> + ddw_memory_hotplug_max()) {
Hi Gaurav,
Since the pmem_start will be greater than ddw_memory_hotplug_max(), and
we have not created DDW beyond ddw_memory_hotplug_max(), we are not
adding TCE for this range, right?
I have tested this patch on my system, and daxctl reconfigure-device is
able to reconfigure PMEM to system RAM.
~# daxctl reconfigure-device --mode=system-ram dax1.0 --force
[
{
"chardev":"dax1.0",
"size":5362417664,
"target_node":4,
"align":65536,
"mode":"system-ram",
"online_memblocks":4,
"total_memblocks":4,
"movable":true
}
]
reconfigured 1 device
~#
~# lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x000000697fffffff 422G online yes 0-421
0x0000040380000000-0x000004047fffffff 4G online yes 4110-4113
Memory block size: 1G
Total online memory: 426G
Total offline memory: 0B
root@ltcden14-lp2:~#
Thanks
Donet
> ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
> arg->nr_pages, window->prop);
> }
> @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> case MEM_OFFLINE:
> spin_lock(&dma_win_list_lock);
> list_for_each_entry(window, &dma_win_list, list) {
> - if (window->direct) {
> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> + ddw_memory_hotplug_max()) {
> ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
> arg->nr_pages, window->prop);
> }
>
> base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
2025-02-05 12:43 ` Donet Tom
@ 2025-02-05 13:58 ` Gaurav Batra
2025-02-06 6:15 ` Donet Tom
0 siblings, 1 reply; 11+ messages in thread
From: Gaurav Batra @ 2025-02-05 13:58 UTC (permalink / raw)
To: Donet Tom, linuxppc-dev
On 2/5/25 6:43 AM, Donet Tom wrote:
>
> On 1/31/25 00:08, Gaurav Batra wrote:
>> iommu_mem_notifier() is invoked when RAM is dynamically
>> added/removed. This
>> notifier call is responsible to add/remove TCEs from the Dynamic DMA
>> Window
>> (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
>> for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
>> dynamically mapped when the device driver instructs to do so.
>>
>> The issue is 'daxctl' command is capable of adding pmemory as "System
>> RAM"
>> after LPAR boot. The command to do so is -
>>
>> daxctl reconfigure-device --mode=system-ram dax0.0 --force
>>
>> This will dynamically add pmemory range to LPAR RAM eventually invoking
>> iommu_mem_notifier(). The address range of pmemory is way beyond the Max
>> RAM that the LPAR can have. Which means, this range is beyond the DDW
>> created for the device, at device initialization time.
>>
>> As a result when TCEs are pre-mapped for the pmemory range, by
>> iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
>> command, daxctl, to add pmemory as RAM.
>>
>> The solution is to not pre-map TCEs for pmemory.
>>
>> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
>> ---
>> arch/powerpc/include/asm/mmzone.h | 1 +
>> arch/powerpc/mm/numa.c | 2 +-
>> arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------
>> 3 files changed, 18 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/mmzone.h
>> b/arch/powerpc/include/asm/mmzone.h
>> index d99863cd6cde..049152f8d597 100644
>> --- a/arch/powerpc/include/asm/mmzone.h
>> +++ b/arch/powerpc/include/asm/mmzone.h
>> @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
>> #ifdef CONFIG_MEMORY_HOTPLUG
>> extern unsigned long max_pfn;
>> u64 memory_hotplug_max(void);
>> +u64 hot_add_drconf_memory_max(void);
>> #else
>> #define memory_hotplug_max() memblock_end_of_DRAM()
>> #endif
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 3c1da08304d0..603a0f652ba6 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
>> return nid;
>> }
>> -static u64 hot_add_drconf_memory_max(void)
>> +u64 hot_add_drconf_memory_max(void)
>> {
>> struct device_node *memory = NULL;
>> struct device_node *dn = NULL;
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c
>> b/arch/powerpc/platforms/pseries/iommu.c
>> index 29f1a0cc59cd..abd9529a8f41 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
>> static phys_addr_t ddw_memory_hotplug_max(void)
>> {
>> - resource_size_t max_addr = memory_hotplug_max();
>> - struct device_node *memory;
>> + resource_size_t max_addr;
>> - for_each_node_by_type(memory, "memory") {
>> - struct resource res;
>> -
>> - if (of_address_to_resource(memory, 0, &res))
>> - continue;
>> -
>> - max_addr = max_t(resource_size_t, max_addr, res.end + 1);
>> - }
>> +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
>> + max_addr = hot_add_drconf_memory_max();
>> +#else
>> + max_addr = memblock_end_of_DRAM();
>> +#endif
>> return max_addr;
>> }
>> @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev,
>> struct device_node *pdn)
>> if (direct_mapping) {
>> /* DDW maps the whole partition, so enable direct DMA
>> mapping */
>> - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >>
>> PAGE_SHIFT,
>> + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >>
>> PAGE_SHIFT,
>> win64->value,
>> tce_setrange_multi_pSeriesLP_walk);
>> if (ret) {
>> dev_info(&dev->dev, "failed to map DMA window for %pOF:
>> %d\n",
>> @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct
>> notifier_block *nb, unsigned long action,
>> struct memory_notify *arg = data;
>> int ret = 0;
>> + /* This notifier can get called when onlining persistent
>> memory as well.
>> + * TCEs are not pre-mapped for persistent memory. Persistent
>> memory will
>> + * always be above ddw_memory_hotplug_max()
>> + */
>> +
>> switch (action) {
>> case MEM_GOING_ONLINE:
>> spin_lock(&dma_win_list_lock);
>> list_for_each_entry(window, &dma_win_list, list) {
>> - if (window->direct) {
>> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>> + ddw_memory_hotplug_max()) {
> Hi Gaurav,
>
> Since the pmem_start will be greater than ddw_memory_hotplug_max(),
> and we have not created DDW beyond ddw_memory_hotplug_max(), we are
> not adding TCE for this range, right?
>
That is correct
> I have tested this patch on my system, and daxctl reconfigure-device
> is able to reconfigure PMEM to system RAM.
>
> ~# daxctl reconfigure-device --mode=system-ram dax1.0 --force
> [
> {
> "chardev":"dax1.0",
> "size":5362417664,
> "target_node":4,
> "align":65536,
> "mode":"system-ram",
> "online_memblocks":4,
> "total_memblocks":4,
> "movable":true
> }
> ]
> reconfigured 1 device
> ~#
> ~# lsmem
> RANGE SIZE STATE REMOVABLE BLOCK
> 0x0000000000000000-0x000000697fffffff 422G online yes 0-421
> 0x0000040380000000-0x000004047fffffff 4G online yes 4110-4113
>
> Memory block size: 1G
> Total online memory: 426G
> Total offline memory: 0B
> root@ltcden14-lp2:~#
>
> Thanks
> Donet
>> ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>> arg->nr_pages, window->prop);
>> }
>> @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct
>> notifier_block *nb, unsigned long action,
>> case MEM_OFFLINE:
>> spin_lock(&dma_win_list_lock);
>> list_for_each_entry(window, &dma_win_list, list) {
>> - if (window->direct) {
>> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>> + ddw_memory_hotplug_max()) {
>> ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>> arg->nr_pages, window->prop);
>> }
>>
>> base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
2025-02-05 13:58 ` Gaurav Batra
@ 2025-02-06 6:15 ` Donet Tom
0 siblings, 0 replies; 11+ messages in thread
From: Donet Tom @ 2025-02-06 6:15 UTC (permalink / raw)
To: Gaurav Batra, linuxppc-dev
On 2/5/25 19:28, Gaurav Batra wrote:
>
> On 2/5/25 6:43 AM, Donet Tom wrote:
>>
>> On 1/31/25 00:08, Gaurav Batra wrote:
>>> iommu_mem_notifier() is invoked when RAM is dynamically
>>> added/removed. This
>>> notifier call is responsible to add/remove TCEs from the Dynamic DMA
>>> Window
>>> (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and
>>> not
>>> for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
>>> dynamically mapped when the device driver instructs to do so.
>>>
>>> The issue is 'daxctl' command is capable of adding pmemory as
>>> "System RAM"
>>> after LPAR boot. The command to do so is -
>>>
>>> daxctl reconfigure-device --mode=system-ram dax0.0 --force
>>>
>>> This will dynamically add pmemory range to LPAR RAM eventually invoking
>>> iommu_mem_notifier(). The address range of pmemory is way beyond the
>>> Max
>>> RAM that the LPAR can have. Which means, this range is beyond the DDW
>>> created for the device, at device initialization time.
>>>
>>> As a result when TCEs are pre-mapped for the pmemory range, by
>>> iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
>>> command, daxctl, to add pmemory as RAM.
>>>
>>> The solution is to not pre-map TCEs for pmemory.
>>>
>>> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
>>> ---
>>> arch/powerpc/include/asm/mmzone.h | 1 +
>>> arch/powerpc/mm/numa.c | 2 +-
>>> arch/powerpc/platforms/pseries/iommu.c | 29
>>> ++++++++++++++------------
>>> 3 files changed, 18 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/mmzone.h
>>> b/arch/powerpc/include/asm/mmzone.h
>>> index d99863cd6cde..049152f8d597 100644
>>> --- a/arch/powerpc/include/asm/mmzone.h
>>> +++ b/arch/powerpc/include/asm/mmzone.h
>>> @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
>>> #ifdef CONFIG_MEMORY_HOTPLUG
>>> extern unsigned long max_pfn;
>>> u64 memory_hotplug_max(void);
>>> +u64 hot_add_drconf_memory_max(void);
>>> #else
>>> #define memory_hotplug_max() memblock_end_of_DRAM()
>>> #endif
>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>>> index 3c1da08304d0..603a0f652ba6 100644
>>> --- a/arch/powerpc/mm/numa.c
>>> +++ b/arch/powerpc/mm/numa.c
>>> @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
>>> return nid;
>>> }
>>> -static u64 hot_add_drconf_memory_max(void)
>>> +u64 hot_add_drconf_memory_max(void)
>>> {
>>> struct device_node *memory = NULL;
>>> struct device_node *dn = NULL;
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c
>>> b/arch/powerpc/platforms/pseries/iommu.c
>>> index 29f1a0cc59cd..abd9529a8f41 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
>>> static phys_addr_t ddw_memory_hotplug_max(void)
>>> {
>>> - resource_size_t max_addr = memory_hotplug_max();
>>> - struct device_node *memory;
>>> + resource_size_t max_addr;
>>> - for_each_node_by_type(memory, "memory") {
>>> - struct resource res;
>>> -
>>> - if (of_address_to_resource(memory, 0, &res))
>>> - continue;
>>> -
>>> - max_addr = max_t(resource_size_t, max_addr, res.end + 1);
>>> - }
>>> +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
>>> + max_addr = hot_add_drconf_memory_max();
>>> +#else
>>> + max_addr = memblock_end_of_DRAM();
>>> +#endif
>>> return max_addr;
>>> }
>>> @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev,
>>> struct device_node *pdn)
>>> if (direct_mapping) {
>>> /* DDW maps the whole partition, so enable direct DMA
>>> mapping */
>>> - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >>
>>> PAGE_SHIFT,
>>> + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >>
>>> PAGE_SHIFT,
>>> win64->value,
>>> tce_setrange_multi_pSeriesLP_walk);
>>> if (ret) {
>>> dev_info(&dev->dev, "failed to map DMA window for
>>> %pOF: %d\n",
>>> @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct
>>> notifier_block *nb, unsigned long action,
>>> struct memory_notify *arg = data;
>>> int ret = 0;
>>> + /* This notifier can get called when onlining persistent
>>> memory as well.
>>> + * TCEs are not pre-mapped for persistent memory. Persistent
>>> memory will
>>> + * always be above ddw_memory_hotplug_max()
>>> + */
>>> +
>>> switch (action) {
>>> case MEM_GOING_ONLINE:
>>> spin_lock(&dma_win_list_lock);
>>> list_for_each_entry(window, &dma_win_list, list) {
>>> - if (window->direct) {
>>> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>>> + ddw_memory_hotplug_max()) {
>> Hi Gaurav,
>>
>> Since the pmem_start will be greater than ddw_memory_hotplug_max(),
>> and we have not created DDW beyond ddw_memory_hotplug_max(), we are
>> not adding TCE for this range, right?
>>
> That is correct
>
Thank you.
This looks good to me. feel free to add
Tested-by: Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Donet Tom <donettom@linux.ibm.com>
>
>> I have tested this patch on my system, and daxctl reconfigure-device
>> is able to reconfigure PMEM to system RAM.
>>
>> ~# daxctl reconfigure-device --mode=system-ram dax1.0 --force
>> [
>> {
>> "chardev":"dax1.0",
>> "size":5362417664,
>> "target_node":4,
>> "align":65536,
>> "mode":"system-ram",
>> "online_memblocks":4,
>> "total_memblocks":4,
>> "movable":true
>> }
>> ]
>> reconfigured 1 device
>> ~#
>> ~# lsmem
>> RANGE SIZE STATE REMOVABLE BLOCK
>> 0x0000000000000000-0x000000697fffffff 422G online yes 0-421
>> 0x0000040380000000-0x000004047fffffff 4G online yes 4110-4113
>>
>> Memory block size: 1G
>> Total online memory: 426G
>> Total offline memory: 0B
>> root@ltcden14-lp2:~#
>>
>> Thanks
>> Donet
>>> ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>>> arg->nr_pages, window->prop);
>>> }
>>> @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct
>>> notifier_block *nb, unsigned long action,
>>> case MEM_OFFLINE:
>>> spin_lock(&dma_win_list_lock);
>>> list_for_each_entry(window, &dma_win_list, list) {
>>> - if (window->direct) {
>>> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>>> + ddw_memory_hotplug_max()) {
>>> ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>>> arg->nr_pages, window->prop);
>>> }
>>>
>>> base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
2025-01-30 18:38 [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory Gaurav Batra
2025-02-05 12:43 ` Donet Tom
@ 2025-02-17 7:28 ` Madhavan Srinivasan
2025-03-19 17:29 ` Michal Suchánek
2 siblings, 0 replies; 11+ messages in thread
From: Madhavan Srinivasan @ 2025-02-17 7:28 UTC (permalink / raw)
To: linuxppc-dev, Gaurav Batra; +Cc: donettom
On Thu, 30 Jan 2025 12:38:54 -0600, Gaurav Batra wrote:
> iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This
> notifier call is responsible to add/remove TCEs from the Dynamic DMA Window
> (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
> for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
> dynamically mapped when the device driver instructs to do so.
>
> The issue is 'daxctl' command is capable of adding pmemory as "System RAM"
> after LPAR boot. The command to do so is -
>
> [...]
Applied to powerpc/next.
[1/1] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
https://git.kernel.org/powerpc/c/6aa989ab2bd0d37540c812b4270006ff794662e7
Thanks
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
2025-01-30 18:38 [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory Gaurav Batra
2025-02-05 12:43 ` Donet Tom
2025-02-17 7:28 ` Madhavan Srinivasan
@ 2025-03-19 17:29 ` Michal Suchánek
2025-03-26 14:46 ` Gaurav Batra
2 siblings, 1 reply; 11+ messages in thread
From: Michal Suchánek @ 2025-03-19 17:29 UTC (permalink / raw)
To: Gaurav Batra; +Cc: linuxppc-dev, donettom
Hello,
looks like this upsets some assumption qemu has about these windows.
https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg05137.html
When Linux kernel that has this patch applied is running inside a qemu
VM with a PCI device and the VM is rebooted qemu crashes shortly after
the next Linux kernel starts.
This is quite curious since qemu does AFAIK not support pmemory at all.
Any idea what went wrong there?
Thanks
Michal
On Thu, Jan 30, 2025 at 12:38:54PM -0600, Gaurav Batra wrote:
> iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This
> notifier call is responsible to add/remove TCEs from the Dynamic DMA Window
> (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
> for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
> dynamically mapped when the device driver instructs to do so.
>
> The issue is 'daxctl' command is capable of adding pmemory as "System RAM"
> after LPAR boot. The command to do so is -
>
> daxctl reconfigure-device --mode=system-ram dax0.0 --force
>
> This will dynamically add pmemory range to LPAR RAM eventually invoking
> iommu_mem_notifier(). The address range of pmemory is way beyond the Max
> RAM that the LPAR can have. Which means, this range is beyond the DDW
> created for the device, at device initialization time.
>
> As a result when TCEs are pre-mapped for the pmemory range, by
> iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
> command, daxctl, to add pmemory as RAM.
>
> The solution is to not pre-map TCEs for pmemory.
>
> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
> ---
> arch/powerpc/include/asm/mmzone.h | 1 +
> arch/powerpc/mm/numa.c | 2 +-
> arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------
> 3 files changed, 18 insertions(+), 14 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
> index d99863cd6cde..049152f8d597 100644
> --- a/arch/powerpc/include/asm/mmzone.h
> +++ b/arch/powerpc/include/asm/mmzone.h
> @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
> #ifdef CONFIG_MEMORY_HOTPLUG
> extern unsigned long max_pfn;
> u64 memory_hotplug_max(void);
> +u64 hot_add_drconf_memory_max(void);
> #else
> #define memory_hotplug_max() memblock_end_of_DRAM()
> #endif
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 3c1da08304d0..603a0f652ba6 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
> return nid;
> }
>
> -static u64 hot_add_drconf_memory_max(void)
> +u64 hot_add_drconf_memory_max(void)
> {
> struct device_node *memory = NULL;
> struct device_node *dn = NULL;
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 29f1a0cc59cd..abd9529a8f41 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
>
> static phys_addr_t ddw_memory_hotplug_max(void)
> {
> - resource_size_t max_addr = memory_hotplug_max();
> - struct device_node *memory;
> + resource_size_t max_addr;
>
> - for_each_node_by_type(memory, "memory") {
> - struct resource res;
> -
> - if (of_address_to_resource(memory, 0, &res))
> - continue;
> -
> - max_addr = max_t(resource_size_t, max_addr, res.end + 1);
> - }
> +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> + max_addr = hot_add_drconf_memory_max();
> +#else
> + max_addr = memblock_end_of_DRAM();
> +#endif
>
> return max_addr;
> }
> @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>
> if (direct_mapping) {
> /* DDW maps the whole partition, so enable direct DMA mapping */
> - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT,
> win64->value, tce_setrange_multi_pSeriesLP_walk);
> if (ret) {
> dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
> @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> struct memory_notify *arg = data;
> int ret = 0;
>
> + /* This notifier can get called when onlining persistent memory as well.
> + * TCEs are not pre-mapped for persistent memory. Persistent memory will
> + * always be above ddw_memory_hotplug_max()
> + */
> +
> switch (action) {
> case MEM_GOING_ONLINE:
> spin_lock(&dma_win_list_lock);
> list_for_each_entry(window, &dma_win_list, list) {
> - if (window->direct) {
> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> + ddw_memory_hotplug_max()) {
> ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
> arg->nr_pages, window->prop);
> }
> @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> case MEM_OFFLINE:
> spin_lock(&dma_win_list_lock);
> list_for_each_entry(window, &dma_win_list, list) {
> - if (window->direct) {
> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> + ddw_memory_hotplug_max()) {
> ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
> arg->nr_pages, window->prop);
> }
>
> base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
> --
> 2.39.3 (Apple Git-146)
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
2025-03-19 17:29 ` Michal Suchánek
@ 2025-03-26 14:46 ` Gaurav Batra
2025-03-26 14:53 ` Michal Suchánek
0 siblings, 1 reply; 11+ messages in thread
From: Gaurav Batra @ 2025-03-26 14:46 UTC (permalink / raw)
To: Michal Suchánek; +Cc: linuxppc-dev, donettom
Hello Michal,
In the patch to fix the pmemory bug, I made some changes to the code
that determines Max memory an LPAR can have (excluding pmemory). This
information is needed while creating Dynamic DMA Window (DDW). These
changes are in the main line code path of DDW creation. This might have
irritated QEMU somehow, no idea yet on how.
Thanks,
Gaurav
On 3/19/25 12:29 PM, Michal Suchánek wrote:
> Hello,
>
> looks like this upsets some assumption qemu has about these windows.
>
> https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg05137.html
>
> When Linux kernel that has this patch applied is running inside a qemu
> VM with a PCI device and the VM is rebooted qemu crashes shortly after
> the next Linux kernel starts.
>
> This is quite curious since qemu does AFAIK not support pmemory at all.
>
> Any idea what went wrong there?
>
> Thanks
>
> Michal
>
> On Thu, Jan 30, 2025 at 12:38:54PM -0600, Gaurav Batra wrote:
>> iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This
>> notifier call is responsible to add/remove TCEs from the Dynamic DMA Window
>> (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
>> for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
>> dynamically mapped when the device driver instructs to do so.
>>
>> The issue is 'daxctl' command is capable of adding pmemory as "System RAM"
>> after LPAR boot. The command to do so is -
>>
>> daxctl reconfigure-device --mode=system-ram dax0.0 --force
>>
>> This will dynamically add pmemory range to LPAR RAM eventually invoking
>> iommu_mem_notifier(). The address range of pmemory is way beyond the Max
>> RAM that the LPAR can have. Which means, this range is beyond the DDW
>> created for the device, at device initialization time.
>>
>> As a result when TCEs are pre-mapped for the pmemory range, by
>> iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
>> command, daxctl, to add pmemory as RAM.
>>
>> The solution is to not pre-map TCEs for pmemory.
>>
>> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
>> ---
>> arch/powerpc/include/asm/mmzone.h | 1 +
>> arch/powerpc/mm/numa.c | 2 +-
>> arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------
>> 3 files changed, 18 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
>> index d99863cd6cde..049152f8d597 100644
>> --- a/arch/powerpc/include/asm/mmzone.h
>> +++ b/arch/powerpc/include/asm/mmzone.h
>> @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
>> #ifdef CONFIG_MEMORY_HOTPLUG
>> extern unsigned long max_pfn;
>> u64 memory_hotplug_max(void);
>> +u64 hot_add_drconf_memory_max(void);
>> #else
>> #define memory_hotplug_max() memblock_end_of_DRAM()
>> #endif
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 3c1da08304d0..603a0f652ba6 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
>> return nid;
>> }
>>
>> -static u64 hot_add_drconf_memory_max(void)
>> +u64 hot_add_drconf_memory_max(void)
>> {
>> struct device_node *memory = NULL;
>> struct device_node *dn = NULL;
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>> index 29f1a0cc59cd..abd9529a8f41 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
>>
>> static phys_addr_t ddw_memory_hotplug_max(void)
>> {
>> - resource_size_t max_addr = memory_hotplug_max();
>> - struct device_node *memory;
>> + resource_size_t max_addr;
>>
>> - for_each_node_by_type(memory, "memory") {
>> - struct resource res;
>> -
>> - if (of_address_to_resource(memory, 0, &res))
>> - continue;
>> -
>> - max_addr = max_t(resource_size_t, max_addr, res.end + 1);
>> - }
>> +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
>> + max_addr = hot_add_drconf_memory_max();
>> +#else
>> + max_addr = memblock_end_of_DRAM();
>> +#endif
>>
>> return max_addr;
>> }
>> @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>
>> if (direct_mapping) {
>> /* DDW maps the whole partition, so enable direct DMA mapping */
>> - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
>> + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT,
>> win64->value, tce_setrange_multi_pSeriesLP_walk);
>> if (ret) {
>> dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
>> @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
>> struct memory_notify *arg = data;
>> int ret = 0;
>>
>> + /* This notifier can get called when onlining persistent memory as well.
>> + * TCEs are not pre-mapped for persistent memory. Persistent memory will
>> + * always be above ddw_memory_hotplug_max()
>> + */
>> +
>> switch (action) {
>> case MEM_GOING_ONLINE:
>> spin_lock(&dma_win_list_lock);
>> list_for_each_entry(window, &dma_win_list, list) {
>> - if (window->direct) {
>> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>> + ddw_memory_hotplug_max()) {
>> ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>> arg->nr_pages, window->prop);
>> }
>> @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
>> case MEM_OFFLINE:
>> spin_lock(&dma_win_list_lock);
>> list_for_each_entry(window, &dma_win_list, list) {
>> - if (window->direct) {
>> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>> + ddw_memory_hotplug_max()) {
>> ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>> arg->nr_pages, window->prop);
>> }
>>
>> base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
>> --
>> 2.39.3 (Apple Git-146)
>>
>>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
2025-03-26 14:46 ` Gaurav Batra
@ 2025-03-26 14:53 ` Michal Suchánek
2025-05-07 9:06 ` Amit Machhiwal
0 siblings, 1 reply; 11+ messages in thread
From: Michal Suchánek @ 2025-03-26 14:53 UTC (permalink / raw)
To: Gaurav Batra; +Cc: linuxppc-dev, donettom
Hello,
On Wed, Mar 26, 2025 at 09:46:11AM -0500, Gaurav Batra wrote:
> Hello Michal,
>
> In the patch to fix the pmemory bug, I made some changes to the code that
> determines Max memory an LPAR can have (excluding pmemory). This information
> is needed while creating Dynamic DMA Window (DDW). These changes are in the
> main line code path of DDW creation. This might have irritated QEMU somehow,
> no idea yet on how.
Yes, it's defeinitely something with the DDW code. Using the
disable_ddw=1 kernel parameter avoids the qemu crash.
The kernels in
https://download.opensuse.org/repositories/Kernel:/SLE15-SP7/pool/ppc64le/
have the patch applied.
Booting the kernel inside qemu VM with a PCI device (such as the USB
hub) and then rebooting the VM crashes qemu.
Thanks
Michal
>
> Thanks,
>
> Gaurav
>
> On 3/19/25 12:29 PM, Michal Suchánek wrote:
> > Hello,
> >
> > looks like this upsets some assumption qemu has about these windows.
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg05137.html
> >
> > When Linux kernel that has this patch applied is running inside a qemu
> > VM with a PCI device and the VM is rebooted qemu crashes shortly after
> > the next Linux kernel starts.
> >
> > This is quite curious since qemu does AFAIK not support pmemory at all.
> >
> > Any idea what went wrong there?
> >
> > Thanks
> >
> > Michal
> >
> > On Thu, Jan 30, 2025 at 12:38:54PM -0600, Gaurav Batra wrote:
> > > iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This
> > > notifier call is responsible to add/remove TCEs from the Dynamic DMA Window
> > > (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
> > > for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
> > > dynamically mapped when the device driver instructs to do so.
> > >
> > > The issue is 'daxctl' command is capable of adding pmemory as "System RAM"
> > > after LPAR boot. The command to do so is -
> > >
> > > daxctl reconfigure-device --mode=system-ram dax0.0 --force
> > >
> > > This will dynamically add pmemory range to LPAR RAM eventually invoking
> > > iommu_mem_notifier(). The address range of pmemory is way beyond the Max
> > > RAM that the LPAR can have. Which means, this range is beyond the DDW
> > > created for the device, at device initialization time.
> > >
> > > As a result when TCEs are pre-mapped for the pmemory range, by
> > > iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
> > > command, daxctl, to add pmemory as RAM.
> > >
> > > The solution is to not pre-map TCEs for pmemory.
> > >
> > > Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
> > > ---
> > > arch/powerpc/include/asm/mmzone.h | 1 +
> > > arch/powerpc/mm/numa.c | 2 +-
> > > arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------
> > > 3 files changed, 18 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
> > > index d99863cd6cde..049152f8d597 100644
> > > --- a/arch/powerpc/include/asm/mmzone.h
> > > +++ b/arch/powerpc/include/asm/mmzone.h
> > > @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
> > > #ifdef CONFIG_MEMORY_HOTPLUG
> > > extern unsigned long max_pfn;
> > > u64 memory_hotplug_max(void);
> > > +u64 hot_add_drconf_memory_max(void);
> > > #else
> > > #define memory_hotplug_max() memblock_end_of_DRAM()
> > > #endif
> > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > > index 3c1da08304d0..603a0f652ba6 100644
> > > --- a/arch/powerpc/mm/numa.c
> > > +++ b/arch/powerpc/mm/numa.c
> > > @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
> > > return nid;
> > > }
> > > -static u64 hot_add_drconf_memory_max(void)
> > > +u64 hot_add_drconf_memory_max(void)
> > > {
> > > struct device_node *memory = NULL;
> > > struct device_node *dn = NULL;
> > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > index 29f1a0cc59cd..abd9529a8f41 100644
> > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
> > > static phys_addr_t ddw_memory_hotplug_max(void)
> > > {
> > > - resource_size_t max_addr = memory_hotplug_max();
> > > - struct device_node *memory;
> > > + resource_size_t max_addr;
> > > - for_each_node_by_type(memory, "memory") {
> > > - struct resource res;
> > > -
> > > - if (of_address_to_resource(memory, 0, &res))
> > > - continue;
> > > -
> > > - max_addr = max_t(resource_size_t, max_addr, res.end + 1);
> > > - }
> > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> > > + max_addr = hot_add_drconf_memory_max();
> > > +#else
> > > + max_addr = memblock_end_of_DRAM();
> > > +#endif
> > > return max_addr;
> > > }
> > > @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > if (direct_mapping) {
> > > /* DDW maps the whole partition, so enable direct DMA mapping */
> > > - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> > > + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT,
> > > win64->value, tce_setrange_multi_pSeriesLP_walk);
> > > if (ret) {
> > > dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
> > > @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> > > struct memory_notify *arg = data;
> > > int ret = 0;
> > > + /* This notifier can get called when onlining persistent memory as well.
> > > + * TCEs are not pre-mapped for persistent memory. Persistent memory will
> > > + * always be above ddw_memory_hotplug_max()
> > > + */
> > > +
> > > switch (action) {
> > > case MEM_GOING_ONLINE:
> > > spin_lock(&dma_win_list_lock);
> > > list_for_each_entry(window, &dma_win_list, list) {
> > > - if (window->direct) {
> > > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> > > + ddw_memory_hotplug_max()) {
> > > ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
> > > arg->nr_pages, window->prop);
> > > }
> > > @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> > > case MEM_OFFLINE:
> > > spin_lock(&dma_win_list_lock);
> > > list_for_each_entry(window, &dma_win_list, list) {
> > > - if (window->direct) {
> > > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> > > + ddw_memory_hotplug_max()) {
> > > ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
> > > arg->nr_pages, window->prop);
> > > }
> > >
> > > base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
> > > --
> > > 2.39.3 (Apple Git-146)
> > >
> > >
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
2025-03-26 14:53 ` Michal Suchánek
@ 2025-05-07 9:06 ` Amit Machhiwal
2025-05-07 12:23 ` Michal Suchánek
0 siblings, 1 reply; 11+ messages in thread
From: Amit Machhiwal @ 2025-05-07 9:06 UTC (permalink / raw)
To: Michal Suchánek; +Cc: Gaurav Batra, linuxppc-dev, donettom
Hi Michal,
I can recreate this issue on sles16 distro kernel but I don't observe this issue
with upstream Linux 6.15-rc5 on the **same** sles16 guest.
Note: the commit 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier
incorrectly adds TCEs for pmemory") was included since Linux 6.15-rc1.
I think there's something more to this crash.
Thanks,
Amit
On 2025/03/26 03:53 PM, Michal Suchánek wrote:
> Hello,
>
> On Wed, Mar 26, 2025 at 09:46:11AM -0500, Gaurav Batra wrote:
> > Hello Michal,
> >
> > In the patch to fix the pmemory bug, I made some changes to the code that
> > determines Max memory an LPAR can have (excluding pmemory). This information
> > is needed while creating Dynamic DMA Window (DDW). These changes are in the
> > main line code path of DDW creation. This might have irritated QEMU somehow,
> > no idea yet on how.
>
> Yes, it's defeinitely something with the DDW code. Using the
> disable_ddw=1 kernel parameter avoids the qemu crash.
>
> The kernels in
> https://download.opensuse.org/repositories/Kernel:/SLE15-SP7/pool/ppc64le/
>
> have the patch applied.
>
> Booting the kernel inside qemu VM with a PCI device (such as the USB
> hub) and then rebooting the VM crashes qemu.
>
> Thanks
>
> Michal
>
> >
> > Thanks,
> >
> > Gaurav
> >
> > On 3/19/25 12:29 PM, Michal Suchánek wrote:
> > > Hello,
> > >
> > > looks like this upsets some assumption qemu has about these windows.
> > >
> > > https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg05137.html
> > >
> > > When Linux kernel that has this patch applied is running inside a qemu
> > > VM with a PCI device and the VM is rebooted qemu crashes shortly after
> > > the next Linux kernel starts.
> > >
> > > This is quite curious since qemu does AFAIK not support pmemory at all.
> > >
> > > Any idea what went wrong there?
> > >
> > > Thanks
> > >
> > > Michal
> > >
> > > On Thu, Jan 30, 2025 at 12:38:54PM -0600, Gaurav Batra wrote:
> > > > iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This
> > > > notifier call is responsible to add/remove TCEs from the Dynamic DMA Window
> > > > (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
> > > > for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
> > > > dynamically mapped when the device driver instructs to do so.
> > > >
> > > > The issue is 'daxctl' command is capable of adding pmemory as "System RAM"
> > > > after LPAR boot. The command to do so is -
> > > >
> > > > daxctl reconfigure-device --mode=system-ram dax0.0 --force
> > > >
> > > > This will dynamically add pmemory range to LPAR RAM eventually invoking
> > > > iommu_mem_notifier(). The address range of pmemory is way beyond the Max
> > > > RAM that the LPAR can have. Which means, this range is beyond the DDW
> > > > created for the device, at device initialization time.
> > > >
> > > > As a result when TCEs are pre-mapped for the pmemory range, by
> > > > iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
> > > > command, daxctl, to add pmemory as RAM.
> > > >
> > > > The solution is to not pre-map TCEs for pmemory.
> > > >
> > > > Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
> > > > ---
> > > > arch/powerpc/include/asm/mmzone.h | 1 +
> > > > arch/powerpc/mm/numa.c | 2 +-
> > > > arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------
> > > > 3 files changed, 18 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
> > > > index d99863cd6cde..049152f8d597 100644
> > > > --- a/arch/powerpc/include/asm/mmzone.h
> > > > +++ b/arch/powerpc/include/asm/mmzone.h
> > > > @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
> > > > #ifdef CONFIG_MEMORY_HOTPLUG
> > > > extern unsigned long max_pfn;
> > > > u64 memory_hotplug_max(void);
> > > > +u64 hot_add_drconf_memory_max(void);
> > > > #else
> > > > #define memory_hotplug_max() memblock_end_of_DRAM()
> > > > #endif
> > > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > > > index 3c1da08304d0..603a0f652ba6 100644
> > > > --- a/arch/powerpc/mm/numa.c
> > > > +++ b/arch/powerpc/mm/numa.c
> > > > @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
> > > > return nid;
> > > > }
> > > > -static u64 hot_add_drconf_memory_max(void)
> > > > +u64 hot_add_drconf_memory_max(void)
> > > > {
> > > > struct device_node *memory = NULL;
> > > > struct device_node *dn = NULL;
> > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > > index 29f1a0cc59cd..abd9529a8f41 100644
> > > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > > @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
> > > > static phys_addr_t ddw_memory_hotplug_max(void)
> > > > {
> > > > - resource_size_t max_addr = memory_hotplug_max();
> > > > - struct device_node *memory;
> > > > + resource_size_t max_addr;
> > > > - for_each_node_by_type(memory, "memory") {
> > > > - struct resource res;
> > > > -
> > > > - if (of_address_to_resource(memory, 0, &res))
> > > > - continue;
> > > > -
> > > > - max_addr = max_t(resource_size_t, max_addr, res.end + 1);
> > > > - }
> > > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> > > > + max_addr = hot_add_drconf_memory_max();
> > > > +#else
> > > > + max_addr = memblock_end_of_DRAM();
> > > > +#endif
> > > > return max_addr;
> > > > }
> > > > @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > > if (direct_mapping) {
> > > > /* DDW maps the whole partition, so enable direct DMA mapping */
> > > > - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> > > > + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT,
> > > > win64->value, tce_setrange_multi_pSeriesLP_walk);
> > > > if (ret) {
> > > > dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
> > > > @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> > > > struct memory_notify *arg = data;
> > > > int ret = 0;
> > > > + /* This notifier can get called when onlining persistent memory as well.
> > > > + * TCEs are not pre-mapped for persistent memory. Persistent memory will
> > > > + * always be above ddw_memory_hotplug_max()
> > > > + */
> > > > +
> > > > switch (action) {
> > > > case MEM_GOING_ONLINE:
> > > > spin_lock(&dma_win_list_lock);
> > > > list_for_each_entry(window, &dma_win_list, list) {
> > > > - if (window->direct) {
> > > > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> > > > + ddw_memory_hotplug_max()) {
> > > > ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
> > > > arg->nr_pages, window->prop);
> > > > }
> > > > @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> > > > case MEM_OFFLINE:
> > > > spin_lock(&dma_win_list_lock);
> > > > list_for_each_entry(window, &dma_win_list, list) {
> > > > - if (window->direct) {
> > > > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> > > > + ddw_memory_hotplug_max()) {
> > > > ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
> > > > arg->nr_pages, window->prop);
> > > > }
> > > >
> > > > base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
> > > > --
> > > > 2.39.3 (Apple Git-146)
> > > >
> > > >
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
2025-05-07 9:06 ` Amit Machhiwal
@ 2025-05-07 12:23 ` Michal Suchánek
2025-05-09 18:32 ` Michal Suchánek
0 siblings, 1 reply; 11+ messages in thread
From: Michal Suchánek @ 2025-05-07 12:23 UTC (permalink / raw)
To: Gaurav Batra, linuxppc-dev, donettom
Hello,
On Wed, May 07, 2025 at 02:36:57PM +0530, Amit Machhiwal wrote:
> Hi Michal,
>
> I can recreate this issue on sles16 distro kernel but I don't observe this issue
> with upstream Linux 6.15-rc5 on the **same** sles16 guest.
>
> Note: the commit 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier
> incorrectly adds TCEs for pmemory") was included since Linux 6.15-rc1.
Indeed, it's not reproducible with 6.15-rc5.
It's not clear what changed in TCE handling between 6.12 and 6.15 that
avoids this problem.
Thanks
Michal
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
2025-05-07 12:23 ` Michal Suchánek
@ 2025-05-09 18:32 ` Michal Suchánek
0 siblings, 0 replies; 11+ messages in thread
From: Michal Suchánek @ 2025-05-09 18:32 UTC (permalink / raw)
To: Gaurav Batra, linuxppc-dev, donettom, Amit Machhiwal
On Wed, May 07, 2025 at 02:23:02PM +0200, Michal Suchánek wrote:
> Hello,
>
> On Wed, May 07, 2025 at 02:36:57PM +0530, Amit Machhiwal wrote:
> > Hi Michal,
> >
> > I can recreate this issue on sles16 distro kernel but I don't observe this issue
> > with upstream Linux 6.15-rc5 on the **same** sles16 guest.
> >
> > Note: the commit 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier
> > incorrectly adds TCEs for pmemory") was included since Linux 6.15-rc1.
>
> Indeed, it's not reproducible with 6.15-rc5.
>
> It's not clear what changed in TCE handling between 6.12 and 6.15 that
> avoids this problem.
Seems backporting commit 67dfc11982f7 ("powerpc/pseries/iommu: create
DDW for devices with DMA mask less than 64-bits") avoids the problem but
it's not clear why.
Thanks
Michal
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-05-09 18:32 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-30 18:38 [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory Gaurav Batra
2025-02-05 12:43 ` Donet Tom
2025-02-05 13:58 ` Gaurav Batra
2025-02-06 6:15 ` Donet Tom
2025-02-17 7:28 ` Madhavan Srinivasan
2025-03-19 17:29 ` Michal Suchánek
2025-03-26 14:46 ` Gaurav Batra
2025-03-26 14:53 ` Michal Suchánek
2025-05-07 9:06 ` Amit Machhiwal
2025-05-07 12:23 ` Michal Suchánek
2025-05-09 18:32 ` Michal Suchánek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).