Re: [PATCH 2/2] iommu/iova: Improve restart logic

From: John Garry <john.garry@huawei.com>
To: Robin Murphy <robin.murphy@arm.com>, <joro@8bytes.org>
Cc: Linuxarm <linuxarm@huawei.com>, <linux-kernel@vger.kernel.org>,
	<iommu@lists.linux-foundation.org>, <vjitta@codeaurora.org>
Subject: Re: [PATCH 2/2] iommu/iova: Improve restart logic
Date: Thu, 18 Mar 2021 11:38:49 +0000	[thread overview]
Message-ID: <6cea11f9-e98d-98cb-6789-93abd8833fa0@huawei.com> (raw)
In-Reply-To: <d8e80756-a628-3a0d-77ac-1e9df734f1c5@huawei.com>

On 10/03/2021 17:47, John Garry wrote:
> On 09/03/2021 15:55, John Garry wrote:
>> On 05/03/2021 16:35, Robin Murphy wrote:
>>
>> Hi Robin,
>>
>>> When restarting after searching below the cached node fails, resetting
>>> the start point to the anchor node is often overly pessimistic. If
>>> allocations are made with mixed limits - particularly in the case of the
>>> opportunistic 32-bit allocation for PCI devices - this could mean
>>> significant time wasted walking through the whole populated upper range
>>> just to reach the initial limit. 
>>
>> Right
>>
>>> We can improve on that by implementing
>>> a proper tree traversal to find the first node above the relevant limit,
>>> and set the exact start point.
>>
>> Thanks for this. However, as mentioned in the other thread, this does 
>> not help our performance regression Re: commit 4e89dce72521.
>>
>> And mentioning this "retry" approach again, in case it was missed, 
>> from my experiment on the affected HW, it also has generally a 
>> dreadfully low success rate - less than 1% for the 32b range retry. 
>> Retry rate is about 20%. So I am saying from this 20%, less than 1% of 
>> those succeed.
>>
> 
> So since the retry means that we search through the complete pfn range 
> most of the time (due to poor success rate), we should be able to do a 
> better job at maintaining an accurate max alloc size, by calculating it 
> from the range search, and not relying on max alloc failed or resetting 
> it frequently. Hopefully that would mean that we're smarter about not 
> trying the allocation.

So I tried that out and we seem to be able to scrap back an appreciable 
amount of performance. Maybe 80% of original, with with another change, 
below.

Thanks,
John

> 
> 
>>
>>>
>>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>>> ---
>>>   drivers/iommu/iova.c | 39 ++++++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 38 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index c28003e1d2ee..471c48dd71e7 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -154,6 +154,43 @@ __cached_rbnode_delete_update(struct iova_domain 
>>> *iovad, struct iova *free)
>>>           iovad->cached_node = rb_next(&free->node);
>>>   }
>>> +static struct rb_node *iova_find_limit(struct iova_domain *iovad, 
>>> unsigned long limit_pfn)
>>> +{
>>> +    struct rb_node *node, *next;
>>> +    /*
>>> +     * Ideally what we'd like to judge here is whether limit_pfn is 
>>> close
>>> +     * enough to the highest-allocated IOVA that starting the 
>>> allocation
>>> +     * walk from the anchor node will be quicker than this initial 
>>> work to
>>> +     * find an exact starting point (especially if that ends up 
>>> being the
>>> +     * anchor node anyway). This is an incredibly crude 
>>> approximation which
>>> +     * only really helps the most likely case, but is at least 
>>> trivially easy.
>>> +     */
>>> +    if (limit_pfn > iovad->dma_32bit_pfn)
>>> +        return &iovad->anchor.node;
>>> +
>>> +    node = iovad->rbroot.rb_node;
>>> +    while (to_iova(node)->pfn_hi < limit_pfn)
>>> +        node = node->rb_right;
>>> +
>>> +search_left:
>>> +    while (node->rb_left && to_iova(node->rb_left)->pfn_lo >= 
>>> limit_pfn)
>>> +        node = node->rb_left;
>>> +
>>> +    if (!node->rb_left)
>>> +        return node;
>>> +
>>> +    next = node->rb_left;
>>> +    while (next->rb_right) {
>>> +        next = next->rb_right;
>>> +        if (to_iova(next)->pfn_lo >= limit_pfn) {
>>> +            node = next;
>>> +            goto search_left;
>>> +        }
>>> +    }
>>> +
>>> +    return node;
>>> +}
>>> +
>>>   /* Insert the iova into domain rbtree by holding writer lock */
>>>   static void
>>>   iova_insert_rbtree(struct rb_root *root, struct iova *iova,
>>> @@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct 
>>> iova_domain *iovad,
>>>           if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
>>>               high_pfn = limit_pfn;
>>>               low_pfn = retry_pfn;
>>> -            curr = &iovad->anchor.node;
>>> +            curr = iova_find_limit(iovad, limit_pfn);

I see that it is now applied. However, alternatively could we just add a 
zero-length 32b boundary marker node for the 32b pfn restart point?

>>>               curr_iova = to_iova(curr);
>>>               goto retry;
>>>           }
>>>
>>
>> _______________________________________________
>> iommu mailing list
>> iommu@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>> .
>