From: Gaurav Batra <gbatra@linux.ibm.com>
To: "Michal Suchánek" <msuchanek@suse.de>
Cc: linuxppc-dev@lists.ozlabs.org, donettom@linux.ibm.com
Subject: Re: [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
Date: Wed, 26 Mar 2025 09:46:11 -0500 [thread overview]
Message-ID: <aaab4789-390c-4b8d-9b83-bdb5fd6b0767@linux.ibm.com> (raw)
In-Reply-To: <Z9r--U_INHB4RjXI@kitsune.suse.cz>
Hello Michal,
In the patch to fix the pmemory bug, I made some changes to the code
that determines Max memory an LPAR can have (excluding pmemory). This
information is needed while creating Dynamic DMA Window (DDW). These
changes are in the main line code path of DDW creation. This might have
irritated QEMU somehow, no idea yet on how.
Thanks,
Gaurav
On 3/19/25 12:29 PM, Michal Suchánek wrote:
> Hello,
>
> looks like this upsets some assumption qemu has about these windows.
>
> https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg05137.html
>
> When Linux kernel that has this patch applied is running inside a qemu
> VM with a PCI device and the VM is rebooted qemu crashes shortly after
> the next Linux kernel starts.
>
> This is quite curious since qemu does AFAIK not support pmemory at all.
>
> Any idea what went wrong there?
>
> Thanks
>
> Michal
>
> On Thu, Jan 30, 2025 at 12:38:54PM -0600, Gaurav Batra wrote:
>> iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This
>> notifier call is responsible to add/remove TCEs from the Dynamic DMA Window
>> (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
>> for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
>> dynamically mapped when the device driver instructs to do so.
>>
>> The issue is 'daxctl' command is capable of adding pmemory as "System RAM"
>> after LPAR boot. The command to do so is -
>>
>> daxctl reconfigure-device --mode=system-ram dax0.0 --force
>>
>> This will dynamically add pmemory range to LPAR RAM eventually invoking
>> iommu_mem_notifier(). The address range of pmemory is way beyond the Max
>> RAM that the LPAR can have. Which means, this range is beyond the DDW
>> created for the device, at device initialization time.
>>
>> As a result when TCEs are pre-mapped for the pmemory range, by
>> iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
>> command, daxctl, to add pmemory as RAM.
>>
>> The solution is to not pre-map TCEs for pmemory.
>>
>> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
>> ---
>> arch/powerpc/include/asm/mmzone.h | 1 +
>> arch/powerpc/mm/numa.c | 2 +-
>> arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------
>> 3 files changed, 18 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
>> index d99863cd6cde..049152f8d597 100644
>> --- a/arch/powerpc/include/asm/mmzone.h
>> +++ b/arch/powerpc/include/asm/mmzone.h
>> @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
>> #ifdef CONFIG_MEMORY_HOTPLUG
>> extern unsigned long max_pfn;
>> u64 memory_hotplug_max(void);
>> +u64 hot_add_drconf_memory_max(void);
>> #else
>> #define memory_hotplug_max() memblock_end_of_DRAM()
>> #endif
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 3c1da08304d0..603a0f652ba6 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
>> return nid;
>> }
>>
>> -static u64 hot_add_drconf_memory_max(void)
>> +u64 hot_add_drconf_memory_max(void)
>> {
>> struct device_node *memory = NULL;
>> struct device_node *dn = NULL;
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>> index 29f1a0cc59cd..abd9529a8f41 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
>>
>> static phys_addr_t ddw_memory_hotplug_max(void)
>> {
>> - resource_size_t max_addr = memory_hotplug_max();
>> - struct device_node *memory;
>> + resource_size_t max_addr;
>>
>> - for_each_node_by_type(memory, "memory") {
>> - struct resource res;
>> -
>> - if (of_address_to_resource(memory, 0, &res))
>> - continue;
>> -
>> - max_addr = max_t(resource_size_t, max_addr, res.end + 1);
>> - }
>> +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
>> + max_addr = hot_add_drconf_memory_max();
>> +#else
>> + max_addr = memblock_end_of_DRAM();
>> +#endif
>>
>> return max_addr;
>> }
>> @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>
>> if (direct_mapping) {
>> /* DDW maps the whole partition, so enable direct DMA mapping */
>> - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
>> + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT,
>> win64->value, tce_setrange_multi_pSeriesLP_walk);
>> if (ret) {
>> dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
>> @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
>> struct memory_notify *arg = data;
>> int ret = 0;
>>
>> + /* This notifier can get called when onlining persistent memory as well.
>> + * TCEs are not pre-mapped for persistent memory. Persistent memory will
>> + * always be above ddw_memory_hotplug_max()
>> + */
>> +
>> switch (action) {
>> case MEM_GOING_ONLINE:
>> spin_lock(&dma_win_list_lock);
>> list_for_each_entry(window, &dma_win_list, list) {
>> - if (window->direct) {
>> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>> + ddw_memory_hotplug_max()) {
>> ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>> arg->nr_pages, window->prop);
>> }
>> @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
>> case MEM_OFFLINE:
>> spin_lock(&dma_win_list_lock);
>> list_for_each_entry(window, &dma_win_list, list) {
>> - if (window->direct) {
>> + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>> + ddw_memory_hotplug_max()) {
>> ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>> arg->nr_pages, window->prop);
>> }
>>
>> base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
>> --
>> 2.39.3 (Apple Git-146)
>>
>>
next prev parent reply other threads:[~2025-03-26 14:46 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-30 18:38 [PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory Gaurav Batra
2025-02-05 12:43 ` Donet Tom
2025-02-05 13:58 ` Gaurav Batra
2025-02-06 6:15 ` Donet Tom
2025-02-17 7:28 ` Madhavan Srinivasan
2025-03-19 17:29 ` Michal Suchánek
2025-03-26 14:46 ` Gaurav Batra [this message]
2025-03-26 14:53 ` Michal Suchánek
2025-05-07 9:06 ` Amit Machhiwal
2025-05-07 12:23 ` Michal Suchánek
2025-05-09 18:32 ` Michal Suchánek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aaab4789-390c-4b8d-9b83-bdb5fd6b0767@linux.ibm.com \
--to=gbatra@linux.ibm.com \
--cc=donettom@linux.ibm.com \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=msuchanek@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).