[PATCH v3 0/3] dma-mapping: Patches for speeding up allocation

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/3] dma-mapping: Patches for speeding up allocation
@ 2016-01-06 19:36 Douglas Anderson
  2016-01-06 19:36 ` [PATCH v3 1/3] ARM: dma-mapping: Optimize allocation Douglas Anderson
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Douglas Anderson @ 2016-01-06 19:36 UTC (permalink / raw)
  To: linux-arm-kernel

This series of 3 patches will speed up memory allocation in dma-mapping
quite a bit.

The first patch ("ARM: dma-mapping: Optimize allocation") is hopefully
not terribly controversial: it merely doesn't try as hard to allocate
big chunks once it gets the first failure.  Since it's unlikely that
further big chunks will help (they're not likely to be virtually aligned
anyway), this should give a big speedup with no real regression to speak
of.  Yes, things could be made better, but this seems like a sane start.

The second patch ("common: DMA-mapping: add DMA_ATTR_SEQUENTIAL
attribute") models MADV_SEQUENTIAL as I understand it.  Hopefully folks
are happy with following that lead.  It does nothing by itself.

The third patch ("ARM: dma-mapping: Use DMA_ATTR_SEQUENTIAL hint to
optimize allocation") simply applies the 2nd patch.  Again it's pretty
simple.  ...and again it does nothing by itself.

Notably missing from this series is the fourth patch that adds teeth to
the second and third.  You can find that out of tree at
<https://chromium-review.googlesource.com/#/c/320498/>.  Unfortunately
the rk3288_vpu, which is what I'm working on, is out of tree.

All testing was done on the chromeos kernel-3.14.  Sanity (compile /
boot) testing was done on a v4.4-rc6-based kernel.

Also note that v2 of this series had an extra patch
<https://patchwork.kernel.org/patch/7888861/> that would attempt to sort
the allocation results to opportunistically get some extra alignment.  I
dropped that, but it could be re-introduced if there was interest.  I
found that it did give a little extra alignment sometimes, but maybe not
enough to justify the extra complexity.  It also was a bit half-baked
since it really should have tried harder to ensure alignment.

Changes in v3:
- add DMA_ATTR_SEQUENTIAL attribute new for v3
- Use DMA_ATTR_SEQUENTIAL hint patch new for v3.

Changes in v2:
- No longer just 1 page at a time, but gives up higher order quickly.
- Only tries important higher order allocations that might help us.

Douglas Anderson (3):
  ARM: dma-mapping: Optimize allocation
  common: DMA-mapping: add DMA_ATTR_SEQUENTIAL attribute
  ARM: dma-mapping: Use DMA_ATTR_SEQUENTIAL hint to optimize allocation

 Documentation/DMA-attributes.txt | 11 +++++++++++
 arch/arm/mm/dma-mapping.c        | 41 ++++++++++++++++++++++++++--------------
 include/linux/dma-attrs.h        |  1 +
 3 files changed, 39 insertions(+), 14 deletions(-)

-- 
2.6.0.rc2.230.g3dd15c0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v3 1/3] ARM: dma-mapping: Optimize allocation
  2016-01-06 19:36 [PATCH v3 0/3] dma-mapping: Patches for speeding up allocation Douglas Anderson
@ 2016-01-06 19:36 ` Douglas Anderson
  2016-01-06 19:36 ` [PATCH v3 3/3] ARM: dma-mapping: Use DMA_ATTR_SEQUENTIAL hint to optimize allocation Douglas Anderson
  2016-01-07 13:52 ` [PATCH v3 0/3] dma-mapping: Patches for speeding up allocation Marek Szyprowski
  2 siblings, 0 replies; 5+ messages in thread
From: Douglas Anderson @ 2016-01-06 19:36 UTC (permalink / raw)
  To: linux-arm-kernel

The __iommu_alloc_buffer() is expected to be called to allocate pretty
sizeable buffers.  Upon simple tests of video I saw it trying to
allocate 4,194,304 bytes.  The function tries to allocate large chunks
in order to optimize IOMMU TLB usage.

The current function is very, very slow.

One problem is the way it keeps trying and trying to allocate big
chunks.  Imagine a very fragmented memory that has 4M free but no
contiguous pages at all.  Further imagine allocating 4M (1024 pages).
We'll do the following memory allocations:
- For page 1:
  - Try to allocate order 10 (no retry)
  - Try to allocate order 9 (no retry)
  - ...
  - Try to allocate order 0 (with retry, but not needed)
- For page 2:
  - Try to allocate order 9 (no retry)
  - Try to allocate order 8 (no retry)
  - ...
  - Try to allocate order 0 (with retry, but not needed)
- ...
- ...

Total number of calls to alloc() calls for this case is:
  sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
  => 9228

The above is obviously worse case, but given how slow alloc can be we
really want to try to avoid even somewhat bad cases.  I timed the old
code with a device under memory pressure and it wasn't hard to see it
take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
was done on kernel 3.14, so possibly mainline would behave
differently).

A second problem is that allocating big chunks under memory pressure
when we don't need them is just not a great idea anyway unless we really
need them.  We can make due pretty well with smaller chunks so it's
probably wise to leave bigger chunks for other users once memory
pressure is on.

Let's adjust the allocation like this:

1. If a big chunk fails, stop trying to hard and bump down to lower
   order allocations.
2. Don't try useless orders.  The whole point of big chunks is to
   optimize the TLB and it can really only make use of 2M, 1M, 64K and
   4K sizes.

We'll still tend to eat up a bunch of big chunks, but that might be the
right answer for some users.  A future patch could possibly add a new
DMA_ATTR that would let the caller decide that TLB optimization isn't
important and that we should use smaller chunks.  Presumably this would
be a sane strategy for some callers.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
Changes in v3: None
Changes in v2:
- No longer just 1 page at a time, but gives up higher order quickly.
- Only tries important higher order allocations that might help us.

 arch/arm/mm/dma-mapping.c | 34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 0eca3812527e..bc9cebfa0891 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1122,6 +1122,9 @@ static inline void __free_iova(struct dma_iommu_mapping *mapping,
 	spin_unlock_irqrestore(&mapping->lock, flags);
 }
 
+/* We'll try 2M, 1M, 64K, and finally 4K; array must end with 0! */
+static const int iommu_order_array[] = { 9, 8, 4, 0 };
+
 static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
 					  gfp_t gfp, struct dma_attrs *attrs)
 {
@@ -1129,6 +1132,7 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
 	int count = size >> PAGE_SHIFT;
 	int array_size = count * sizeof(struct page *);
 	int i = 0;
+	int order_idx = 0;
 
 	if (array_size <= PAGE_SIZE)
 		pages = kzalloc(array_size, GFP_KERNEL);
@@ -1162,22 +1166,24 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
 	while (count) {
 		int j, order;
 
-		for (order = __fls(count); order > 0; --order) {
-			/*
-			 * We do not want OOM killer to be invoked as long
-			 * as we can fall back to single pages, so we force
-			 * __GFP_NORETRY for orders higher than zero.
-			 */
-			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
-			if (pages[i])
-				break;
+		order = iommu_order_array[order_idx];
+
+		/* Drop down when we get small */
+		if (__fls(count) < order) {
+			order_idx++;
+			continue;
 		}
 
-		if (!pages[i]) {
-			/*
-			 * Fall back to single page allocation.
-			 * Might invoke OOM killer as last resort.
-			 */
+		if (order) {
+			/* See if it's easy to allocate a high-order chunk */
+			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
+
+			/* Go down a notch@first sign of pressure */
+			if (!pages[i]) {
+				order_idx++;
+				continue;
+			}
+		} else {
 			pages[i] = alloc_pages(gfp, 0);
 			if (!pages[i])
 				goto error;
-- 
2.6.0.rc2.230.g3dd15c0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v3 3/3] ARM: dma-mapping: Use DMA_ATTR_SEQUENTIAL hint to optimize allocation
  2016-01-06 19:36 [PATCH v3 0/3] dma-mapping: Patches for speeding up allocation Douglas Anderson
  2016-01-06 19:36 ` [PATCH v3 1/3] ARM: dma-mapping: Optimize allocation Douglas Anderson
@ 2016-01-06 19:36 ` Douglas Anderson
  2016-01-07 13:52 ` [PATCH v3 0/3] dma-mapping: Patches for speeding up allocation Marek Szyprowski
  2 siblings, 0 replies; 5+ messages in thread
From: Douglas Anderson @ 2016-01-06 19:36 UTC (permalink / raw)
  To: linux-arm-kernel

If we know that memory will be accessed sequentially then it's not
terribly important to allocate big chunks of memory.  The whole point of
allocating the big chunks was that it would make TLB usage efficient.

As Marek Szyprowski indicated:
    Please note that mapping memory with larger pages significantly
    improves performance, especially when IOMMU has a little TLB
    cache. This can be easily observed when multimedia devices do
    processing of RGB data with 90/270 degree rotation
Image rotation is distinctly not a sequential operation, so it makes
sense that TLB efficiency is important there.

Video decoding, on the other hand, is a fairly sequential operation.
During video decoding it's not expected that we'll be jumping all over
memory.  Thus if we know we're setting up DMA for a video decode
operation we can indicate DMA_ATTR_SEQUENTIAL.

Allocating big chunks of memory is quite expensive, especially if we're
doing it repeadly and memory is full.  In one (out of tree) usage model
it is common that arm_iommu_alloc_attrs() is called 16 times in a row,
each one trying to allocate 4 MB of memory.  This is called whenever the
system encounters a new video, which could easily happen while the
memory system is stressed out.  In fact, on certain social media
websites that auto-play video and have infinite scrolling, it's quite
common to see not just one of these 16x4MB allocations but 2 or 3 right
after another.  Asking the system even to do a small amount of extra
work to give us big chunks in this case is just not a good use of time.

Allocating big chunks of memory is also expensive indirectly.  Even if
we ask the system not to do ANY extra work to allocate _our_ memory,
we're still potentially eating up all big chunks in the system.
Presumably there are other users in the system that aren't quite as
flexible and that actually need these big chunks.  By eating all the big
chunks we're causing extra work for the rest of the system.  We also may
start making other memory allocations fail.  While the system may be
robust to such failures (as is the case with dwc2 USB trying to allocate
buffers for Ethernet data and with WiFi trying to allocate buffers for
WiFi data), it is yet another big performance hit.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
Changes in v3:
- Use DMA_ATTR_SEQUENTIAL hint patch new for v3.

Changes in v2: None

 arch/arm/mm/dma-mapping.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index bc9cebfa0891..58298221ce3e 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1159,6 +1159,13 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
 	}

 	/*
+	 * Go straight to 4K chunks if it's sequential to ease the burden on
+	 * the memory manager and to leave bug chunks available for others.
+	 */
+	if (dma_get_attr(DMA_ATTR_SEQUENTIAL, attrs))
+		order_idx = ARRAY_SIZE(iommu_order_array) - 1;
+
+	/*
 	 * IOMMU can map any pages, so himem can also be used here
 	 */
 	gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
-- 
2.6.0.rc2.230.g3dd15c0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v3 0/3] dma-mapping: Patches for speeding up allocation
  2016-01-06 19:36 [PATCH v3 0/3] dma-mapping: Patches for speeding up allocation Douglas Anderson
  2016-01-06 19:36 ` [PATCH v3 1/3] ARM: dma-mapping: Optimize allocation Douglas Anderson
  2016-01-06 19:36 ` [PATCH v3 3/3] ARM: dma-mapping: Use DMA_ATTR_SEQUENTIAL hint to optimize allocation Douglas Anderson
@ 2016-01-07 13:52 ` Marek Szyprowski
  2016-01-07 17:23   ` Doug Anderson
  2 siblings, 1 reply; 5+ messages in thread
From: Marek Szyprowski @ 2016-01-07 13:52 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

On 2016-01-06 20:36, Douglas Anderson wrote:
> This series of 3 patches will speed up memory allocation in dma-mapping
> quite a bit.
>
> The first patch ("ARM: dma-mapping: Optimize allocation") is hopefully
> not terribly controversial: it merely doesn't try as hard to allocate
> big chunks once it gets the first failure.  Since it's unlikely that
> further big chunks will help (they're not likely to be virtually aligned
> anyway), this should give a big speedup with no real regression to speak
> of.  Yes, things could be made better, but this seems like a sane start.
>
> The second patch ("common: DMA-mapping: add DMA_ATTR_SEQUENTIAL
> attribute") models MADV_SEQUENTIAL as I understand it.  Hopefully folks
> are happy with following that lead.  It does nothing by itself.
>
> The third patch ("ARM: dma-mapping: Use DMA_ATTR_SEQUENTIAL hint to
> optimize allocation") simply applies the 2nd patch.  Again it's pretty
> simple.  ...and again it does nothing by itself.
>
> Notably missing from this series is the fourth patch that adds teeth to
> the second and third.  You can find that out of tree at
> <https://chromium-review.googlesource.com/#/c/320498/>.  Unfortunately
> the rk3288_vpu, which is what I'm working on, is out of tree.
>
> All testing was done on the chromeos kernel-3.14.  Sanity (compile /
> boot) testing was done on a v4.4-rc6-based kernel.
>
> Also note that v2 of this series had an extra patch
> <https://patchwork.kernel.org/patch/7888861/> that would attempt to sort
> the allocation results to opportunistically get some extra alignment.  I
> dropped that, but it could be re-introduced if there was interest.  I
> found that it did give a little extra alignment sometimes, but maybe not
> enough to justify the extra complexity.  It also was a bit half-baked
> since it really should have tried harder to ensure alignment.
>
> Changes in v3:
> - add DMA_ATTR_SEQUENTIAL attribute new for v3
> - Use DMA_ATTR_SEQUENTIAL hint patch new for v3.
>
> Changes in v2:
> - No longer just 1 page at a time, but gives up higher order quickly.
> - Only tries important higher order allocations that might help us.
>
> Douglas Anderson (3):
>    ARM: dma-mapping: Optimize allocation
>    common: DMA-mapping: add DMA_ATTR_SEQUENTIAL attribute
>    ARM: dma-mapping: Use DMA_ATTR_SEQUENTIAL hint to optimize allocation
>
>   Documentation/DMA-attributes.txt | 11 +++++++++++
>   arch/arm/mm/dma-mapping.c        | 41 ++++++++++++++++++++++++++--------------
>   include/linux/dma-attrs.h        |  1 +
>   3 files changed, 39 insertions(+), 14 deletions(-)

Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>


Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v3 0/3] dma-mapping: Patches for speeding up allocation
  2016-01-07 13:52 ` [PATCH v3 0/3] dma-mapping: Patches for speeding up allocation Marek Szyprowski
@ 2016-01-07 17:23   ` Doug Anderson
  0 siblings, 0 replies; 5+ messages in thread
From: Doug Anderson @ 2016-01-07 17:23 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Thu, Jan 7, 2016 at 5:52 AM, Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> Hello,
>
>
> On 2016-01-06 20:36, Douglas Anderson wrote:
>>
>> This series of 3 patches will speed up memory allocation in dma-mapping
>> quite a bit.
>>
>> The first patch ("ARM: dma-mapping: Optimize allocation") is hopefully
>> not terribly controversial: it merely doesn't try as hard to allocate
>> big chunks once it gets the first failure.  Since it's unlikely that
>> further big chunks will help (they're not likely to be virtually aligned
>> anyway), this should give a big speedup with no real regression to speak
>> of.  Yes, things could be made better, but this seems like a sane start.
>>
>> The second patch ("common: DMA-mapping: add DMA_ATTR_SEQUENTIAL
>> attribute") models MADV_SEQUENTIAL as I understand it.  Hopefully folks
>> are happy with following that lead.  It does nothing by itself.
>>
>> The third patch ("ARM: dma-mapping: Use DMA_ATTR_SEQUENTIAL hint to
>> optimize allocation") simply applies the 2nd patch.  Again it's pretty
>> simple.  ...and again it does nothing by itself.
>>
>> Notably missing from this series is the fourth patch that adds teeth to
>> the second and third.  You can find that out of tree at
>> <https://chromium-review.googlesource.com/#/c/320498/>.  Unfortunately
>> the rk3288_vpu, which is what I'm working on, is out of tree.
>>
>> All testing was done on the chromeos kernel-3.14.  Sanity (compile /
>> boot) testing was done on a v4.4-rc6-based kernel.
>>
>> Also note that v2 of this series had an extra patch
>> <https://patchwork.kernel.org/patch/7888861/> that would attempt to sort
>> the allocation results to opportunistically get some extra alignment.  I
>> dropped that, but it could be re-introduced if there was interest.  I
>> found that it did give a little extra alignment sometimes, but maybe not
>> enough to justify the extra complexity.  It also was a bit half-baked
>> since it really should have tried harder to ensure alignment.
>>
>> Changes in v3:
>> - add DMA_ATTR_SEQUENTIAL attribute new for v3
>> - Use DMA_ATTR_SEQUENTIAL hint patch new for v3.
>>
>> Changes in v2:
>> - No longer just 1 page at a time, but gives up higher order quickly.
>> - Only tries important higher order allocations that might help us.
>>
>> Douglas Anderson (3):
>>    ARM: dma-mapping: Optimize allocation
>>    common: DMA-mapping: add DMA_ATTR_SEQUENTIAL attribute
>>    ARM: dma-mapping: Use DMA_ATTR_SEQUENTIAL hint to optimize allocation
>>
>>   Documentation/DMA-attributes.txt | 11 +++++++++++
>>   arch/arm/mm/dma-mapping.c        | 41
>> ++++++++++++++++++++++++++--------------
>>   include/linux/dma-attrs.h        |  1 +
>>   3 files changed, 39 insertions(+), 14 deletions(-)
>
>
> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>

In the ChromeOS review Tomasz suggested "DMA_ATTR_NOHUGEPAGE", so I'll
plan to resend the series changing the name.  I'll also fix a typo he
found in a comment in the last patch in the series.  Neither of these
is substantive, so I'll plan to keep your Ack unless you say not to.
Thanks!

At some point I'll also need to figure out how to get things landed.
I'd tend to submit the ARM bits to Russell's patch tracker, and I
guess I'll also submit the "common: DMA-mapping" patch there too...

-Doug

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-01-07 17:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-06 19:36 [PATCH v3 0/3] dma-mapping: Patches for speeding up allocation Douglas Anderson
2016-01-06 19:36 ` [PATCH v3 1/3] ARM: dma-mapping: Optimize allocation Douglas Anderson
2016-01-06 19:36 ` [PATCH v3 3/3] ARM: dma-mapping: Use DMA_ATTR_SEQUENTIAL hint to optimize allocation Douglas Anderson
2016-01-07 13:52 ` [PATCH v3 0/3] dma-mapping: Patches for speeding up allocation Marek Szyprowski
2016-01-07 17:23   ` Doug Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).