From mboxrd@z Thu Jan  1 00:00:00 1970
From: robin.murphy@arm.com (Robin Murphy)
Date: Wed, 13 Jan 2016 12:23:52 +0000
Subject: [PATCH v6 1/5] ARM: dma-mapping: Optimize allocation
In-Reply-To: <1452533428-12762-2-git-send-email-dianders@chromium.org>
References: <1452533428-12762-1-git-send-email-dianders@chromium.org>
 <1452533428-12762-2-git-send-email-dianders@chromium.org>
Message-ID: <569641D8.1070408@arm.com>
To: linux-arm-kernel@lists.infradead.org
List-Id: linux-arm-kernel.lists.infradead.org

On 11/01/16 17:30, Douglas Anderson wrote:
> The __iommu_alloc_buffer() is expected to be called to allocate pretty
> sizeable buffers.  Upon simple tests of video I saw it trying to
> allocate 4,194,304 bytes.  The function tries to allocate large chunks
> in order to optimize IOMMU TLB usage.
>
> The current function is very, very slow.
>
> One problem is the way it keeps trying and trying to allocate big
> chunks.  Imagine a very fragmented memory that has 4M free but no
> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
> We'll do the following memory allocations:
> - For page 1:
>    - Try to allocate order 10 (no retry)
>    - Try to allocate order 9 (no retry)
>    - ...
>    - Try to allocate order 0 (with retry, but not needed)
> - For page 2:
>    - Try to allocate order 9 (no retry)
>    - Try to allocate order 8 (no retry)
>    - ...
>    - Try to allocate order 0 (with retry, but not needed)
> - ...
> - ...
>
> Total number of calls to alloc() calls for this case is:
>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>    => 9228
>
> The above is obviously worse case, but given how slow alloc can be we
> really want to try to avoid even somewhat bad cases.  I timed the old
> code with a device under memory pressure and it wasn't hard to see it
> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
> was done on kernel 3.14, so possibly mainline would behave
> differently).
>
> A second problem is that allocating big chunks under memory pressure
> when we don't need them is just not a great idea anyway unless we really
> need them.  We can make due pretty well with smaller chunks so it's
> probably wise to leave bigger chunks for other users once memory
> pressure is on.
>
> Let's adjust the allocation like this:
>
> 1. If a big chunk fails, stop trying to hard and bump down to lower
>     order allocations.
> 2. Don't try useless orders.  The whole point of big chunks is to
>     optimize the TLB and it can really only make use of 2M, 1M, 64K and
>     4K sizes.
>
> We'll still tend to eat up a bunch of big chunks, but that might be the
> right answer for some users.  A future patch could possibly add a new
> DMA_ATTR that would let the caller decide that TLB optimization isn't
> important and that we should use smaller chunks.  Presumably this would
> be a sane strategy for some callers.
>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
> ---
> Changes in v6: None

Oops, good thing the the reply I just sent to v5 applies equally well to 
v6 too!

Robin.

> Changes in v5: None
> Changes in v4:
> - Added Marek's ack
>
> Changes in v3: None
> Changes in v2:
> - No longer just 1 page at a time, but gives up higher order quickly.
> - Only tries important higher order allocations that might help us.
>
>   arch/arm/mm/dma-mapping.c | 34 ++++++++++++++++++++--------------
>   1 file changed, 20 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index 0eca3812527e..bc9cebfa0891 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -1122,6 +1122,9 @@ static inline void __free_iova(struct dma_iommu_mapping *mapping,
>   	spin_unlock_irqrestore(&mapping->lock, flags);
>   }
>
> +/* We'll try 2M, 1M, 64K, and finally 4K; array must end with 0! */
> +static const int iommu_order_array[] = { 9, 8, 4, 0 };
> +
>   static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   					  gfp_t gfp, struct dma_attrs *attrs)
>   {
> @@ -1129,6 +1132,7 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   	int count = size >> PAGE_SHIFT;
>   	int array_size = count * sizeof(struct page *);
>   	int i = 0;
> +	int order_idx = 0;
>
>   	if (array_size <= PAGE_SIZE)
>   		pages = kzalloc(array_size, GFP_KERNEL);
> @@ -1162,22 +1166,24 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   	while (count) {
>   		int j, order;
>
> -		for (order = __fls(count); order > 0; --order) {
> -			/*
> -			 * We do not want OOM killer to be invoked as long
> -			 * as we can fall back to single pages, so we force
> -			 * __GFP_NORETRY for orders higher than zero.
> -			 */
> -			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
> -			if (pages[i])
> -				break;
> +		order = iommu_order_array[order_idx];
> +
> +		/* Drop down when we get small */
> +		if (__fls(count) < order) {
> +			order_idx++;
> +			continue;
>   		}
>
> -		if (!pages[i]) {
> -			/*
> -			 * Fall back to single page allocation.
> -			 * Might invoke OOM killer as last resort.
> -			 */
> +		if (order) {
> +			/* See if it's easy to allocate a high-order chunk */
> +			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
> +
> +			/* Go down a notch at first sign of pressure */
> +			if (!pages[i]) {
> +				order_idx++;
> +				continue;
> +			}
> +		} else {
>   			pages[i] = alloc_pages(gfp, 0);
>   			if (!pages[i])
>   				goto error;
>

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1759303AbcAMMX6 (ORCPT <rfc822;w@1wt.eu>);
	Wed, 13 Jan 2016 07:23:58 -0500
Received: from foss.arm.com ([217.140.101.70]:40299 "EHLO foss.arm.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752257AbcAMMX4 (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Wed, 13 Jan 2016 07:23:56 -0500
Subject: Re: [PATCH v6 1/5] ARM: dma-mapping: Optimize allocation
To: Douglas Anderson <dianders@chromium.org>, linux@arm.linux.org.uk,
        mchehab@osg.samsung.com, tfiga@chromium.org, m.szyprowski@samsung.com
References: <1452533428-12762-1-git-send-email-dianders@chromium.org>
 <1452533428-12762-2-git-send-email-dianders@chromium.org>
Cc: pawel@osciak.com, Dmitry Torokhov <dmitry.torokhov@gmail.com>,
        hch@infradead.org, will.deacon@arm.com, akpm@linux-foundation.org,
        dan.j.williams@intel.com, carlo@caione.org,
        laurent.pinchart+renesas@ideasonboard.com, mike.looijmans@topic.nl,
        penguin-kernel@i-love.sakura.ne.jp,
        linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org
From: Robin Murphy <robin.murphy@arm.com>
Message-ID: <569641D8.1070408@arm.com>
Date: Wed, 13 Jan 2016 12:23:52 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.4.0
MIME-Version: 1.0
In-Reply-To: <1452533428-12762-2-git-send-email-dianders@chromium.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 11/01/16 17:30, Douglas Anderson wrote:
> The __iommu_alloc_buffer() is expected to be called to allocate pretty
> sizeable buffers.  Upon simple tests of video I saw it trying to
> allocate 4,194,304 bytes.  The function tries to allocate large chunks
> in order to optimize IOMMU TLB usage.
>
> The current function is very, very slow.
>
> One problem is the way it keeps trying and trying to allocate big
> chunks.  Imagine a very fragmented memory that has 4M free but no
> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
> We'll do the following memory allocations:
> - For page 1:
>    - Try to allocate order 10 (no retry)
>    - Try to allocate order 9 (no retry)
>    - ...
>    - Try to allocate order 0 (with retry, but not needed)
> - For page 2:
>    - Try to allocate order 9 (no retry)
>    - Try to allocate order 8 (no retry)
>    - ...
>    - Try to allocate order 0 (with retry, but not needed)
> - ...
> - ...
>
> Total number of calls to alloc() calls for this case is:
>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>    => 9228
>
> The above is obviously worse case, but given how slow alloc can be we
> really want to try to avoid even somewhat bad cases.  I timed the old
> code with a device under memory pressure and it wasn't hard to see it
> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
> was done on kernel 3.14, so possibly mainline would behave
> differently).
>
> A second problem is that allocating big chunks under memory pressure
> when we don't need them is just not a great idea anyway unless we really
> need them.  We can make due pretty well with smaller chunks so it's
> probably wise to leave bigger chunks for other users once memory
> pressure is on.
>
> Let's adjust the allocation like this:
>
> 1. If a big chunk fails, stop trying to hard and bump down to lower
>     order allocations.
> 2. Don't try useless orders.  The whole point of big chunks is to
>     optimize the TLB and it can really only make use of 2M, 1M, 64K and
>     4K sizes.
>
> We'll still tend to eat up a bunch of big chunks, but that might be the
> right answer for some users.  A future patch could possibly add a new
> DMA_ATTR that would let the caller decide that TLB optimization isn't
> important and that we should use smaller chunks.  Presumably this would
> be a sane strategy for some callers.
>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
> ---
> Changes in v6: None

Oops, good thing the the reply I just sent to v5 applies equally well to 
v6 too!

Robin.

> Changes in v5: None
> Changes in v4:
> - Added Marek's ack
>
> Changes in v3: None
> Changes in v2:
> - No longer just 1 page at a time, but gives up higher order quickly.
> - Only tries important higher order allocations that might help us.
>
>   arch/arm/mm/dma-mapping.c | 34 ++++++++++++++++++++--------------
>   1 file changed, 20 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index 0eca3812527e..bc9cebfa0891 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -1122,6 +1122,9 @@ static inline void __free_iova(struct dma_iommu_mapping *mapping,
>   	spin_unlock_irqrestore(&mapping->lock, flags);
>   }
>
> +/* We'll try 2M, 1M, 64K, and finally 4K; array must end with 0! */
> +static const int iommu_order_array[] = { 9, 8, 4, 0 };
> +
>   static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   					  gfp_t gfp, struct dma_attrs *attrs)
>   {
> @@ -1129,6 +1132,7 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   	int count = size >> PAGE_SHIFT;
>   	int array_size = count * sizeof(struct page *);
>   	int i = 0;
> +	int order_idx = 0;
>
>   	if (array_size <= PAGE_SIZE)
>   		pages = kzalloc(array_size, GFP_KERNEL);
> @@ -1162,22 +1166,24 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   	while (count) {
>   		int j, order;
>
> -		for (order = __fls(count); order > 0; --order) {
> -			/*
> -			 * We do not want OOM killer to be invoked as long
> -			 * as we can fall back to single pages, so we force
> -			 * __GFP_NORETRY for orders higher than zero.
> -			 */
> -			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
> -			if (pages[i])
> -				break;
> +		order = iommu_order_array[order_idx];
> +
> +		/* Drop down when we get small */
> +		if (__fls(count) < order) {
> +			order_idx++;
> +			continue;
>   		}
>
> -		if (!pages[i]) {
> -			/*
> -			 * Fall back to single page allocation.
> -			 * Might invoke OOM killer as last resort.
> -			 */
> +		if (order) {
> +			/* See if it's easy to allocate a high-order chunk */
> +			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
> +
> +			/* Go down a notch at first sign of pressure */
> +			if (!pages[i]) {
> +				order_idx++;
> +				continue;
> +			}
> +		} else {
>   			pages[i] = alloc_pages(gfp, 0);
>   			if (!pages[i])
>   				goto error;
>