LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Harsh Prateek Bora <harshpb@linux.ibm.com>
To: Gaurav Batra <gbatra@linux.ibm.com>,
	maddy@linux.ibm.com,
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>,
	sbhat@linux.ibm.com
Cc: linuxppc-dev@lists.ozlabs.org, ritesh.list@gmail.com,
	vaibhav@linux.ibm.com, donettom@linux.ibm.com
Subject: Re: [PATCH v3] powerpc/pseries/iommu: Add TCEs for 16GB pages when RAM is pre-mapped
Date: Tue, 2 Jun 2026 09:27:56 +0530	[thread overview]
Message-ID: <b2232da2-25e0-4436-86ad-4cb6d57dcab1@linux.ibm.com> (raw)
In-Reply-To: <2a1694c2-d3ed-475b-a4a6-2383d6229c3b@linux.ibm.com>



On 01/06/26 11:33 pm, Gaurav Batra wrote:
> Hello Harsh,
> 
> 
> response is inline
> 
> 
> On 5/31/26 12:48 PM, Harsh Prateek Bora wrote:
>> + Venkat
>>
>> Hi Gaurav,
>> Would just like to confirm if it is tested with multiple iterations of 
>> hotplug of RAM (DLPAR) as well?
> I tested the patch with both DLPAR of RAM and adapter, for 100 
> iterations each.

Thanks for confirming! Feel free to add:

Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com>

>>
>> Hi Venkat,
>> Could you please help validate the patch for above-mentioned scenario 
>> as well?
>>
>> Hi Shivaprasad,
>> Please share your review feedback or any additional testing scenarios 
>> needed?
>>
>> Thanks
>> Harsh
>>
>> On 15/05/26 9:21 pm, Gaurav Batra wrote:
>>> In powerPC, if Dynamic DMA Window is big enough, RAM is pre-mapped. To
>>> determine the size of RAM, a PAPR+ property "ibm,lrdr-capacity" is used.
>>> This OF property dictates what is the max size of RAM an LPAR can have,
>>> including DR added memory.
>>>
>>> In PowerPC, 16GB pages can be allocated at machine level and then
>>> assigned to LPARs. These 16GB pages are added to LPAR memory at the time
>>> of boot. The address range for these 16GB pages is above MAX RAM an LPAR
>>> can have (ibm,lrdr-capacity). In the current implementation, these 16GB
>>> pages are being excluded from pre-mapped TCEs. A driver can have DMA
>>> buffers allocated from 16GB pages. This results in platform to raise an
>>> EEH when DMA is attempted on buffers in 16GB memory range.
>>>
>>> commit 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier incorrectly
>>> adds TCEs for pmemory")
>>>
>>> Prior to the above patch, memblock_end_of_DRAM() was being used to
>>> determine the MAX memory of an LPAR. This included 16GB pages as well.
>>> The issue with using memblock_end_of_DRAM() is that when pmemory is
>>> converted to RAM via daxctl command, the DDW engine will incorrectly try
>>> to add TCEs for pmemory as well.
>>>
>>> Below is the address distribution of RAM, 16GB pages and pmemory for an
>>> LPAR with max memory of 256GB, memory allocated 64GB, 2 16GB pages and
>>> assigned pmemory of 8GB.
>>>
>>> RANGE                                 SIZE  STATE REMOVABLE BLOCK
>>> 0x0000000000000000-0x0000000fffffffff  64G online       yes 0-255
>>> 0x0000004000000000-0x00000047ffffffff  32G online       yes 1024-1151
>>>
>>> cat /sys/bus/nd/devices/region0/resource
>>> 0x40100000000
>>> cat /sys/bus/nd/devices/region0/size
>>> 8589934592
>>>
>>> The approach to fix this problem is to revert back the code changes
>>> introduced by the above patch and to stash away the MAX memory of an
>>> LPAR, including 16GB pages, at the LPAR boot time. This value is then
>>> used whenever TCEs are needed to be pre-mapped - enable_DDW() or,
>>> iommu_mem_notifier()
>>>
>>> Fixes: 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier 
>>> incorrectly adds TCEs for pmemory")
>>> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
>>> ---
>>>
>>> Change log:
>>>
>>> V2 -> V3
>>>
>>> 1. Harsh: Remove R-b tags from the change log
>>>
>>>     Response: Incorporated changes
>>>
>>> 2. Harsh: Change WARN_ON() to WARN_ONCE()
>>>
>>>     Response: Incorporated changes
>>>
>>> 3. Harsh: Fix indendation
>>>
>>>     Response: Incorporated changes
>>>
>>> 4. Harsh: Replace comment with a log if limit < arg->nr_pages ?
>>>
>>>     Response: Doesn't seems to be needed since the WARN_ONCE() will 
>>> log this
>>>     scenario. I removed the comment instead.
>>>
>>> V1 -> V2
>>>
>>> 1. Harsh: Not only start_pfn, but end_pfn also needs to be within 
>>> allowed
>>>     range, which may require clamping arg->nr_pages if crossing the 
>>> limits.
>>>
>>>     Response: Incorporated changes.
>>>
>>>   arch/powerpc/platforms/pseries/iommu.c | 58 ++++++++++++++++++--------
>>>   1 file changed, 41 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/ 
>>> platforms/pseries/iommu.c
>>> index 3e1f915fe4f6..7bbe070006fa 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -69,6 +69,8 @@ static struct iommu_table 
>>> *iommu_pseries_alloc_table(int node)
>>>       return tbl;
>>>   }
>>>   +static phys_addr_t pseries_ddw_max_ram;
>>> +
>>>   #ifdef CONFIG_IOMMU_API
>>>   static struct iommu_table_group_ops spapr_tce_table_group_ops;
>>>   #endif
>>> @@ -1285,13 +1287,17 @@ static LIST_HEAD(failed_ddw_pdn_list);
>>>     static phys_addr_t ddw_memory_hotplug_max(void)
>>>   {
>>> -    resource_size_t max_addr;
>>> +    resource_size_t max_addr = memory_hotplug_max();
>>> +    struct device_node *memory;
>>>   -#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
>>> -    max_addr = hot_add_drconf_memory_max();
>>> -#else
>>> -    max_addr = memblock_end_of_DRAM();
>>> -#endif
>>> +    for_each_node_by_type(memory, "memory") {
>>> +        struct resource res;
>>> +
>>> +        if (of_address_to_resource(memory, 0, &res))
>>> +            continue;
>>> +
>>> +        max_addr = max_t(resource_size_t, max_addr, res.end + 1);
>>> +    }
>>>         return max_addr;
>>>   }
>>> @@ -1446,7 +1452,7 @@ static struct property 
>>> *ddw_property_create(const char *propname, u32 liobn, u64
>>>   static bool enable_ddw(struct pci_dev *dev, struct device_node 
>>> *pdn, u64 dma_mask)
>>>   {
>>>       int len = 0, ret;
>>> -    int max_ram_len = order_base_2(ddw_memory_hotplug_max());
>>> +    int max_ram_len = order_base_2(pseries_ddw_max_ram);
>>>       struct ddw_query_response query;
>>>       struct ddw_create_response create;
>>>       int page_shift;
>>> @@ -1668,7 +1674,7 @@ static bool enable_ddw(struct pci_dev *dev, 
>>> struct device_node *pdn, u64 dma_mas
>>>         if (direct_mapping) {
>>>           /* DDW maps the whole partition, so enable direct DMA 
>>> mapping */
>>> -        ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> 
>>> PAGE_SHIFT,
>>> +        ret = walk_system_ram_range(0, pseries_ddw_max_ram >> 
>>> PAGE_SHIFT,
>>>                           win64->value, 
>>> tce_setrange_multi_pSeriesLP_walk);
>>>           if (ret) {
>>>               dev_info(&dev->dev, "failed to map DMA window for %pOF: 
>>> %d\n",
>>> @@ -2419,23 +2425,35 @@ static int iommu_mem_notifier(struct 
>>> notifier_block *nb, unsigned long action,
>>>   {
>>>       struct dma_win *window;
>>>       struct memory_notify *arg = data;
>>> +    unsigned long limit = arg->nr_pages;
>>> +    unsigned long max_ram_pages = pseries_ddw_max_ram >> PAGE_SHIFT;
>>>       int ret = 0;
>>>         /* This notifier can get called when onlining persistent 
>>> memory as well.
>>>        * TCEs are not pre-mapped for persistent memory. Persistent 
>>> memory will
>>> -     * always be above ddw_memory_hotplug_max()
>>> +     * always be above pseries_ddw_max_ram
>>>        */
>>> +    if (arg->start_pfn >= max_ram_pages)
>>> +        return NOTIFY_OK;
>>> +
>>> +    /* RAM is being DLPAR'ed. The range should never exceed max ram.
>>> +     * Just in case, clamp the range and throw a warning.
>>> +     */
>>> +    if (arg->start_pfn + limit > max_ram_pages) {
>>> +        limit = max_ram_pages - arg->start_pfn;
>>> +        WARN_ONCE(1, "Limiting Page Range %lx - %lx to Max Mem 
>>> Pages: %lx\n",
>>> +                    arg->start_pfn, arg->start_pfn + arg->nr_pages,
>>> +                    max_ram_pages);
>>> +    }
>>>         switch (action) {
>>>       case MEM_GOING_ONLINE:
>>>           spin_lock(&dma_win_list_lock);
>>>           list_for_each_entry(window, &dma_win_list, list) {
>>> -            if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>>> -                ddw_memory_hotplug_max()) {
>>> +            if (window->direct) {
>>>                   ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>>> -                        arg->nr_pages, window->prop);
>>> +                        limit, window->prop);
>>>               }
>>> -            /* XXX log error */
>>>           }
>>>           spin_unlock(&dma_win_list_lock);
>>>           break;
>>> @@ -2443,12 +2461,10 @@ static int iommu_mem_notifier(struct 
>>> notifier_block *nb, unsigned long action,
>>>       case MEM_OFFLINE:
>>>           spin_lock(&dma_win_list_lock);
>>>           list_for_each_entry(window, &dma_win_list, list) {
>>> -            if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
>>> -                ddw_memory_hotplug_max()) {
>>> +            if (window->direct) {
>>>                   ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>>> -                        arg->nr_pages, window->prop);
>>> +                        limit, window->prop);
>>>               }
>>> -            /* XXX log error */
>>>           }
>>>           spin_unlock(&dma_win_list_lock);
>>>           break;
>>> @@ -2532,6 +2548,14 @@ void __init iommu_init_early_pSeries(void)
>>>       register_memory_notifier(&iommu_mem_nb);
>>>         set_pci_dma_ops(&dma_iommu_ops);
>>> +
>>> +    /* During init determine the max memory an LPAR can have and set 
>>> it. This
>>> +     * will be used for pre-mapping RAM in DDW. 
>>> memblock_end_of_DRAM() can
>>> +     * change during the running of LPAR - daxctl can add pmemory as
>>> +     * "system-ram". This memory range should not be pre-mapped in 
>>> DDW since
>>> +     * the address of pmemory can be much higher than the DDW size.
>>> +     */
>>> +    pseries_ddw_max_ram = ddw_memory_hotplug_max();
>>>   }
>>>     static int __init disable_multitce(char *str)
>>>
>>> base-commit: 6d35786de28116ecf78797a62b84e6bf3c45aa5a
>>



  reply	other threads:[~2026-06-02  3:58 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-15 15:51 [PATCH v3] powerpc/pseries/iommu: Add TCEs for 16GB pages when RAM is pre-mapped Gaurav Batra
2026-05-31 17:48 ` Harsh Prateek Bora
2026-06-01 18:03   ` Gaurav Batra
2026-06-02  3:57     ` Harsh Prateek Bora [this message]
2026-06-09  6:38   ` Venkat Rao Bagalkote

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b2232da2-25e0-4436-86ad-4cb6d57dcab1@linux.ibm.com \
    --to=harshpb@linux.ibm.com \
    --cc=donettom@linux.ibm.com \
    --cc=gbatra@linux.ibm.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=maddy@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=sbhat@linux.ibm.com \
    --cc=vaibhav@linux.ibm.com \
    --cc=venkat88@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox