[PATCH v6 1/5] ARM: dma-mapping: Optimize allocation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: robin.murphy@arm.com (Robin Murphy)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH v6 1/5] ARM: dma-mapping: Optimize allocation
Date: Wed, 13 Jan 2016 12:23:52 +0000	[thread overview]
Message-ID: <569641D8.1070408@arm.com> (raw)
In-Reply-To: <1452533428-12762-2-git-send-email-dianders@chromium.org>

On 11/01/16 17:30, Douglas Anderson wrote:
> The __iommu_alloc_buffer() is expected to be called to allocate pretty
> sizeable buffers.  Upon simple tests of video I saw it trying to
> allocate 4,194,304 bytes.  The function tries to allocate large chunks
> in order to optimize IOMMU TLB usage.
>
> The current function is very, very slow.
>
> One problem is the way it keeps trying and trying to allocate big
> chunks.  Imagine a very fragmented memory that has 4M free but no
> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
> We'll do the following memory allocations:
> - For page 1:
>    - Try to allocate order 10 (no retry)
>    - Try to allocate order 9 (no retry)
>    - ...
>    - Try to allocate order 0 (with retry, but not needed)
> - For page 2:
>    - Try to allocate order 9 (no retry)
>    - Try to allocate order 8 (no retry)
>    - ...
>    - Try to allocate order 0 (with retry, but not needed)
> - ...
> - ...
>
> Total number of calls to alloc() calls for this case is:
>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>    => 9228
>
> The above is obviously worse case, but given how slow alloc can be we
> really want to try to avoid even somewhat bad cases.  I timed the old
> code with a device under memory pressure and it wasn't hard to see it
> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
> was done on kernel 3.14, so possibly mainline would behave
> differently).
>
> A second problem is that allocating big chunks under memory pressure
> when we don't need them is just not a great idea anyway unless we really
> need them.  We can make due pretty well with smaller chunks so it's
> probably wise to leave bigger chunks for other users once memory
> pressure is on.
>
> Let's adjust the allocation like this:
>
> 1. If a big chunk fails, stop trying to hard and bump down to lower
>     order allocations.
> 2. Don't try useless orders.  The whole point of big chunks is to
>     optimize the TLB and it can really only make use of 2M, 1M, 64K and
>     4K sizes.
>
> We'll still tend to eat up a bunch of big chunks, but that might be the
> right answer for some users.  A future patch could possibly add a new
> DMA_ATTR that would let the caller decide that TLB optimization isn't
> important and that we should use smaller chunks.  Presumably this would
> be a sane strategy for some callers.
>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
> ---
> Changes in v6: None

Oops, good thing the the reply I just sent to v5 applies equally well to 
v6 too!

Robin.

> Changes in v5: None
> Changes in v4:
> - Added Marek's ack
>
> Changes in v3: None
> Changes in v2:
> - No longer just 1 page at a time, but gives up higher order quickly.
> - Only tries important higher order allocations that might help us.
>
>   arch/arm/mm/dma-mapping.c | 34 ++++++++++++++++++++--------------
>   1 file changed, 20 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index 0eca3812527e..bc9cebfa0891 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -1122,6 +1122,9 @@ static inline void __free_iova(struct dma_iommu_mapping *mapping,
>   	spin_unlock_irqrestore(&mapping->lock, flags);
>   }
>
> +/* We'll try 2M, 1M, 64K, and finally 4K; array must end with 0! */
> +static const int iommu_order_array[] = { 9, 8, 4, 0 };
> +
>   static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   					  gfp_t gfp, struct dma_attrs *attrs)
>   {
> @@ -1129,6 +1132,7 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   	int count = size >> PAGE_SHIFT;
>   	int array_size = count * sizeof(struct page *);
>   	int i = 0;
> +	int order_idx = 0;
>
>   	if (array_size <= PAGE_SIZE)
>   		pages = kzalloc(array_size, GFP_KERNEL);
> @@ -1162,22 +1166,24 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   	while (count) {
>   		int j, order;
>
> -		for (order = __fls(count); order > 0; --order) {
> -			/*
> -			 * We do not want OOM killer to be invoked as long
> -			 * as we can fall back to single pages, so we force
> -			 * __GFP_NORETRY for orders higher than zero.
> -			 */
> -			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
> -			if (pages[i])
> -				break;
> +		order = iommu_order_array[order_idx];
> +
> +		/* Drop down when we get small */
> +		if (__fls(count) < order) {
> +			order_idx++;
> +			continue;
>   		}
>
> -		if (!pages[i]) {
> -			/*
> -			 * Fall back to single page allocation.
> -			 * Might invoke OOM killer as last resort.
> -			 */
> +		if (order) {
> +			/* See if it's easy to allocate a high-order chunk */
> +			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
> +
> +			/* Go down a notch at first sign of pressure */
> +			if (!pages[i]) {
> +				order_idx++;
> +				continue;
> +			}
> +		} else {
>   			pages[i] = alloc_pages(gfp, 0);
>   			if (!pages[i])
>   				goto error;
>

WARNING: multiple messages have this Message-ID (diff)

From: Robin Murphy <robin.murphy@arm.com>
To: Douglas Anderson <dianders@chromium.org>,
	linux@arm.linux.org.uk, mchehab@osg.samsung.com,
	tfiga@chromium.org, m.szyprowski@samsung.com
Cc: pawel@osciak.com, Dmitry Torokhov <dmitry.torokhov@gmail.com>,
	hch@infradead.org, will.deacon@arm.com,
	akpm@linux-foundation.org, dan.j.williams@intel.com,
	carlo@caione.org, laurent.pinchart+renesas@ideasonboard.com,
	mike.looijmans@topic.nl, penguin-kernel@i-love.sakura.ne.jp,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v6 1/5] ARM: dma-mapping: Optimize allocation
Date: Wed, 13 Jan 2016 12:23:52 +0000	[thread overview]
Message-ID: <569641D8.1070408@arm.com> (raw)
In-Reply-To: <1452533428-12762-2-git-send-email-dianders@chromium.org>

On 11/01/16 17:30, Douglas Anderson wrote:
> The __iommu_alloc_buffer() is expected to be called to allocate pretty
> sizeable buffers.  Upon simple tests of video I saw it trying to
> allocate 4,194,304 bytes.  The function tries to allocate large chunks
> in order to optimize IOMMU TLB usage.
>
> The current function is very, very slow.
>
> One problem is the way it keeps trying and trying to allocate big
> chunks.  Imagine a very fragmented memory that has 4M free but no
> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
> We'll do the following memory allocations:
> - For page 1:
>    - Try to allocate order 10 (no retry)
>    - Try to allocate order 9 (no retry)
>    - ...
>    - Try to allocate order 0 (with retry, but not needed)
> - For page 2:
>    - Try to allocate order 9 (no retry)
>    - Try to allocate order 8 (no retry)
>    - ...
>    - Try to allocate order 0 (with retry, but not needed)
> - ...
> - ...
>
> Total number of calls to alloc() calls for this case is:
>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>    => 9228
>
> The above is obviously worse case, but given how slow alloc can be we
> really want to try to avoid even somewhat bad cases.  I timed the old
> code with a device under memory pressure and it wasn't hard to see it
> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
> was done on kernel 3.14, so possibly mainline would behave
> differently).
>
> A second problem is that allocating big chunks under memory pressure
> when we don't need them is just not a great idea anyway unless we really
> need them.  We can make due pretty well with smaller chunks so it's
> probably wise to leave bigger chunks for other users once memory
> pressure is on.
>
> Let's adjust the allocation like this:
>
> 1. If a big chunk fails, stop trying to hard and bump down to lower
>     order allocations.
> 2. Don't try useless orders.  The whole point of big chunks is to
>     optimize the TLB and it can really only make use of 2M, 1M, 64K and
>     4K sizes.
>
> We'll still tend to eat up a bunch of big chunks, but that might be the
> right answer for some users.  A future patch could possibly add a new
> DMA_ATTR that would let the caller decide that TLB optimization isn't
> important and that we should use smaller chunks.  Presumably this would
> be a sane strategy for some callers.
>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
> ---
> Changes in v6: None

Oops, good thing the the reply I just sent to v5 applies equally well to 
v6 too!

Robin.

> Changes in v5: None
> Changes in v4:
> - Added Marek's ack
>
> Changes in v3: None
> Changes in v2:
> - No longer just 1 page at a time, but gives up higher order quickly.
> - Only tries important higher order allocations that might help us.
>
>   arch/arm/mm/dma-mapping.c | 34 ++++++++++++++++++++--------------
>   1 file changed, 20 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index 0eca3812527e..bc9cebfa0891 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -1122,6 +1122,9 @@ static inline void __free_iova(struct dma_iommu_mapping *mapping,
>   	spin_unlock_irqrestore(&mapping->lock, flags);
>   }
>
> +/* We'll try 2M, 1M, 64K, and finally 4K; array must end with 0! */
> +static const int iommu_order_array[] = { 9, 8, 4, 0 };
> +
>   static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   					  gfp_t gfp, struct dma_attrs *attrs)
>   {
> @@ -1129,6 +1132,7 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   	int count = size >> PAGE_SHIFT;
>   	int array_size = count * sizeof(struct page *);
>   	int i = 0;
> +	int order_idx = 0;
>
>   	if (array_size <= PAGE_SIZE)
>   		pages = kzalloc(array_size, GFP_KERNEL);
> @@ -1162,22 +1166,24 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   	while (count) {
>   		int j, order;
>
> -		for (order = __fls(count); order > 0; --order) {
> -			/*
> -			 * We do not want OOM killer to be invoked as long
> -			 * as we can fall back to single pages, so we force
> -			 * __GFP_NORETRY for orders higher than zero.
> -			 */
> -			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
> -			if (pages[i])
> -				break;
> +		order = iommu_order_array[order_idx];
> +
> +		/* Drop down when we get small */
> +		if (__fls(count) < order) {
> +			order_idx++;
> +			continue;
>   		}
>
> -		if (!pages[i]) {
> -			/*
> -			 * Fall back to single page allocation.
> -			 * Might invoke OOM killer as last resort.
> -			 */
> +		if (order) {
> +			/* See if it's easy to allocate a high-order chunk */
> +			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
> +
> +			/* Go down a notch at first sign of pressure */
> +			if (!pages[i]) {
> +				order_idx++;
> +				continue;
> +			}
> +		} else {
>   			pages[i] = alloc_pages(gfp, 0);
>   			if (!pages[i])
>   				goto error;
>

next prev parent reply	other threads:[~2016-01-13 12:23 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-11 17:30 [PATCH v6 0/5] dma-mapping: Patches for speeding up allocation Douglas Anderson
2016-01-11 17:30 ` Douglas Anderson
2016-01-11 17:30 ` [PATCH v6 1/5] ARM: dma-mapping: Optimize allocation Douglas Anderson
2016-01-11 17:30   ` Douglas Anderson
2016-01-13 12:23   ` Robin Murphy [this message]
2016-01-13 12:23     ` Robin Murphy
2016-01-11 17:30 ` [PATCH v6 2/5] common: DMA-mapping: add DMA_ATTR_ALLOC_SINGLE_PAGES attribute Douglas Anderson
2016-01-13 12:36   ` Robin Murphy
2016-01-11 17:30 ` [PATCH v6 3/5] ARM: dma-mapping: Use DMA_ATTR_ALLOC_SINGLE_PAGES hint to optimize alloc Douglas Anderson
2016-01-11 17:30   ` Douglas Anderson
2016-01-11 17:30 ` [PATCH v6 4/5] videobuf2-dc: Let drivers specify DMA attrs Douglas Anderson
2016-02-01 11:40   ` Mauro Carvalho Chehab
2016-02-01 13:29   ` Marek Szyprowski
2016-02-01 21:37     ` Doug Anderson
2016-02-17 17:00   ` Hans Verkuil
2016-02-17 17:26     ` Doug Anderson
2016-02-18  5:58     ` Tomasz Figa
2016-01-11 17:30 ` [PATCH v6 5/5] s5p-mfc: Set DMA_ATTR_ALLOC_SINGLE_PAGES Douglas Anderson
2016-01-11 17:30   ` Douglas Anderson
2016-01-26 23:31 ` [PATCH v6 0/5] dma-mapping: Patches for speeding up allocation Javier Martinez Canillas
2016-01-26 23:31   ` Javier Martinez Canillas
2016-01-26 23:59 ` Andrew Morton
2016-01-26 23:59   ` Andrew Morton
2016-01-27  0:39   ` Doug Anderson
2016-01-27  0:39     ` Doug Anderson
2016-01-29 21:52     ` Olof Johansson
2016-01-29 21:52       ` Olof Johansson
2016-01-29 21:58       ` Doug Anderson
2016-01-29 21:58         ` Doug Anderson
2016-01-29 22:14         ` Doug Anderson
2016-01-29 22:14           ` Doug Anderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=569641D8.1070408@arm.com \
    --to=robin.murphy@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.