From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932310AbcAMRoY (ORCPT ); Wed, 13 Jan 2016 12:44:24 -0500 Received: from foss.arm.com ([217.140.101.70]:42040 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753747AbcAMRoX (ORCPT ); Wed, 13 Jan 2016 12:44:23 -0500 Subject: Re: [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation To: Tomasz Figa References: <1452294332-23415-1-git-send-email-dianders@chromium.org> <1452294332-23415-2-git-send-email-dianders@chromium.org> <56964070.1020303@arm.com> Cc: Douglas Anderson , Russell King , Mauro Carvalho Chehab , Marek Szyprowski , Pawel Osciak , Dmitry Torokhov , Will Deacon , Andrew Morton , dan.j.williams@intel.com, Carlo Caione , Laurent Pinchart , mike.looijmans@topic.nl, Lorenzo Nava , "linux-arm-kernel@lists.infradead.org" , "linux-kernel@vger.kernel.org" From: Robin Murphy Message-ID: <56968CF2.1020409@arm.com> Date: Wed, 13 Jan 2016 17:44:18 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 13/01/16 17:33, Tomasz Figa wrote: > On Wed, Jan 13, 2016 at 9:17 PM, Robin Murphy wrote: >> Hi Doug, >> >> >> On 08/01/16 23:05, Douglas Anderson wrote: >>> >>> The __iommu_alloc_buffer() is expected to be called to allocate pretty >>> sizeable buffers. Upon simple tests of video I saw it trying to >>> allocate 4,194,304 bytes. The function tries to allocate large chunks >>> in order to optimize IOMMU TLB usage. >>> >>> The current function is very, very slow. >>> >>> One problem is the way it keeps trying and trying to allocate big >>> chunks. Imagine a very fragmented memory that has 4M free but no >>> contiguous pages at all. Further imagine allocating 4M (1024 pages). >>> We'll do the following memory allocations: >>> - For page 1: >>> - Try to allocate order 10 (no retry) >>> - Try to allocate order 9 (no retry) >>> - ... >>> - Try to allocate order 0 (with retry, but not needed) >>> - For page 2: >>> - Try to allocate order 9 (no retry) >>> - Try to allocate order 8 (no retry) >>> - ... >>> - Try to allocate order 0 (with retry, but not needed) >>> - ... >>> - ... >>> >>> Total number of calls to alloc() calls for this case is: >>> sum(int(math.log(i, 2)) + 1 for i in range(1, 1025)) >>> => 9228 >>> >>> The above is obviously worse case, but given how slow alloc can be we >>> really want to try to avoid even somewhat bad cases. I timed the old >>> code with a device under memory pressure and it wasn't hard to see it >>> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing >>> was done on kernel 3.14, so possibly mainline would behave >>> differently). >>> >>> A second problem is that allocating big chunks under memory pressure >>> when we don't need them is just not a great idea anyway unless we really >>> need them. We can make due pretty well with smaller chunks so it's >>> probably wise to leave bigger chunks for other users once memory >>> pressure is on. >>> >>> Let's adjust the allocation like this: >>> >>> 1. If a big chunk fails, stop trying to hard and bump down to lower >>> order allocations. >>> 2. Don't try useless orders. The whole point of big chunks is to >>> optimize the TLB and it can really only make use of 2M, 1M, 64K and >>> 4K sizes. >>> >>> We'll still tend to eat up a bunch of big chunks, but that might be the >>> right answer for some users. A future patch could possibly add a new >>> DMA_ATTR that would let the caller decide that TLB optimization isn't >>> important and that we should use smaller chunks. Presumably this would >>> be a sane strategy for some callers. >> >> >> Now that I've had time to think about it properly: >> >> Reviewed-by: Robin Murphy >> >> I just had an absolutely disgusting idea of how to get the same progression >> with just a single variable and no static array, but I'll keep that firmly >> to myself as it's almost IOCCC-grade WTF :D > > Just out of curiosity, a bitmap and loop with fls() and clearing bit > on failure or something more freaky? :) Got a Python interpreter handy? order = 9 for i in range(4): print order order = (order - 1) & 0xc Like I said, disgusting :D Robin. > > Anyway: > > Reviewed-by: Tomasz Figa > > Best regards, > Tomasz >