From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932310AbcAMRoY (ORCPT <rfc822;w@1wt.eu>);
	Wed, 13 Jan 2016 12:44:24 -0500
Received: from foss.arm.com ([217.140.101.70]:42040 "EHLO foss.arm.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753747AbcAMRoX (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Wed, 13 Jan 2016 12:44:23 -0500
Subject: Re: [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation
To: Tomasz Figa <tfiga@chromium.org>
References: <1452294332-23415-1-git-send-email-dianders@chromium.org>
 <1452294332-23415-2-git-send-email-dianders@chromium.org>
 <56964070.1020303@arm.com>
 <CAAFQd5C8asfo8wSa=jKvp4Vmg6A83R-vG7vQXtsHyOTADLo+9g@mail.gmail.com>
Cc: Douglas Anderson <dianders@chromium.org>,
        Russell King <linux@arm.linux.org.uk>,
        Mauro Carvalho Chehab <mchehab@osg.samsung.com>,
        Marek Szyprowski <m.szyprowski@samsung.com>,
        Pawel Osciak <pawel@osciak.com>,
        Dmitry Torokhov <dmitry.torokhov@gmail.com>,
        Will Deacon <will.deacon@arm.com>,
        Andrew Morton <akpm@linux-foundation.org>, dan.j.williams@intel.com,
        Carlo Caione <carlo@caione.org>,
        Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>,
        mike.looijmans@topic.nl, Lorenzo Nava <lorenx4@gmail.com>,
        "linux-arm-kernel@lists.infradead.org" 
	<linux-arm-kernel@lists.infradead.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
From: Robin Murphy <robin.murphy@arm.com>
Message-ID: <56968CF2.1020409@arm.com>
Date: Wed, 13 Jan 2016 17:44:18 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.4.0
MIME-Version: 1.0
In-Reply-To: <CAAFQd5C8asfo8wSa=jKvp4Vmg6A83R-vG7vQXtsHyOTADLo+9g@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 13/01/16 17:33, Tomasz Figa wrote:
> On Wed, Jan 13, 2016 at 9:17 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>> Hi Doug,
>>
>>
>> On 08/01/16 23:05, Douglas Anderson wrote:
>>>
>>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>>> sizeable buffers.  Upon simple tests of video I saw it trying to
>>> allocate 4,194,304 bytes.  The function tries to allocate large chunks
>>> in order to optimize IOMMU TLB usage.
>>>
>>> The current function is very, very slow.
>>>
>>> One problem is the way it keeps trying and trying to allocate big
>>> chunks.  Imagine a very fragmented memory that has 4M free but no
>>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>>> We'll do the following memory allocations:
>>> - For page 1:
>>>     - Try to allocate order 10 (no retry)
>>>     - Try to allocate order 9 (no retry)
>>>     - ...
>>>     - Try to allocate order 0 (with retry, but not needed)
>>> - For page 2:
>>>     - Try to allocate order 9 (no retry)
>>>     - Try to allocate order 8 (no retry)
>>>     - ...
>>>     - Try to allocate order 0 (with retry, but not needed)
>>> - ...
>>> - ...
>>>
>>> Total number of calls to alloc() calls for this case is:
>>>     sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>>     => 9228
>>>
>>> The above is obviously worse case, but given how slow alloc can be we
>>> really want to try to avoid even somewhat bad cases.  I timed the old
>>> code with a device under memory pressure and it wasn't hard to see it
>>> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
>>> was done on kernel 3.14, so possibly mainline would behave
>>> differently).
>>>
>>> A second problem is that allocating big chunks under memory pressure
>>> when we don't need them is just not a great idea anyway unless we really
>>> need them.  We can make due pretty well with smaller chunks so it's
>>> probably wise to leave bigger chunks for other users once memory
>>> pressure is on.
>>>
>>> Let's adjust the allocation like this:
>>>
>>> 1. If a big chunk fails, stop trying to hard and bump down to lower
>>>      order allocations.
>>> 2. Don't try useless orders.  The whole point of big chunks is to
>>>      optimize the TLB and it can really only make use of 2M, 1M, 64K and
>>>      4K sizes.
>>>
>>> We'll still tend to eat up a bunch of big chunks, but that might be the
>>> right answer for some users.  A future patch could possibly add a new
>>> DMA_ATTR that would let the caller decide that TLB optimization isn't
>>> important and that we should use smaller chunks.  Presumably this would
>>> be a sane strategy for some callers.
>>
>>
>> Now that I've had time to think about it properly:
>>
>> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
>>
>> I just had an absolutely disgusting idea of how to get the same progression
>> with just a single variable and no static array, but I'll keep that firmly
>> to myself as it's almost IOCCC-grade WTF :D
>
> Just out of curiosity, a bitmap and loop with fls() and clearing bit
> on failure or something more freaky? :)

Got a Python interpreter handy?

order = 9
for i in range(4):
     print order
     order = (order - 1) & 0xc

Like I said, disgusting :D

Robin.

>
> Anyway:
>
> Reviewed-by: Tomasz Figa <tfiga@chromium.org>
>
> Best regards,
> Tomasz
>