* [PATCH v5 0/5] dma-mapping: Patches for speeding up allocation @ 2016-01-08 23:05 Douglas Anderson 2016-01-08 23:05 ` [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation Douglas Anderson ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Douglas Anderson @ 2016-01-08 23:05 UTC (permalink / raw) To: linux-arm-kernel This series of patches will speed up memory allocation in dma-mapping quite a bit. The first patch ("ARM: dma-mapping: Optimize allocation") is hopefully not terribly controversial: it merely doesn't try as hard to allocate big chunks once it gets the first failure. Since it's unlikely that further big chunks will help (they're not likely to be virtually aligned anyway), this should give a big speedup with no real regression to speak of. Yes, things could be made better, but this seems like a sane start. The second patch ("common: DMA-mapping: add DMA_ATTR_NO_HUGE_PAGE attribute") models MADV_NOHUGEPAGE as I understand it. Hopefully folks are happy with following that lead. It does nothing by itself. The third patch ("ARM: dma-mapping: Use DMA_ATTR_NO_HUGE_PAGE hint to optimize allocation") simply applies the 2nd patch. Again it's pretty simple. ...and again it does nothing by itself. Thue fourth patch ("[media] videobuf2-dc: Let drivers specify DMA attrs") comes from the ChromeOS tree (authored by Tomasz Figa) and allows the fifth patch. The fifth patch ("[media] s5p-mfc: Set DMA_ATTR_NO_HUGE_PAGE") uses the new attribute. For a second user, you can see the out of tree patch for rk3288 at <https://chromium-review.googlesource.com/#/c/320498/>. All testing was done on the chromeos kernel-3.8 and kernel-3.14. Sanity (compile / boot) testing was done on a v4.4-rc6-based kernel on rk3288, though the video codec isn't there. I don't have graphics / MFC working well on exynos, so the MFC change was only compile-tested upstream. Hopefully someone upstream whose setup for MFC can give a Tested-by for these? Also note that v2 of this series had an extra patch <https://patchwork.kernel.org/patch/7888861/> that would attempt to sort the allocation results to opportunistically get some extra alignment. I dropped that, but it could be re-introduced if there was interest. I found that it did give a little extra alignment sometimes, but maybe not enough to justify the extra complexity. It also was a bit half-baked since it really should have tried harder to ensure alignment. Changes in v5: - renamed DMA_ATTR_NOHUGEPAGE to DMA_ATTR_NO_HUGE_PAGE - s/ping ping/ping pong/ - Let drivers specify DMA attrs new for v5 - s5p-mfc patch new for v5 Changes in v4: - renamed DMA_ATTR_SEQUENTIAL to DMA_ATTR_NOHUGEPAGE - added Marek's ack Changes in v3: - add DMA_ATTR_SEQUENTIAL attribute new for v3 - Use DMA_ATTR_SEQUENTIAL hint patch new for v3. Changes in v2: - No longer just 1 page at a time, but gives up higher order quickly. - Only tries important higher order allocations that might help us. Douglas Anderson (4): ARM: dma-mapping: Optimize allocation common: DMA-mapping: add DMA_ATTR_NO_HUGE_PAGE attribute ARM: dma-mapping: Use DMA_ATTR_NO_HUGE_PAGE hint to optimize allocation [media] s5p-mfc: Set DMA_ATTR_NO_HUGE_PAGE Tomasz Figa (1): [media] videobuf2-dc: Let drivers specify DMA attrs Documentation/DMA-attributes.txt | 23 ++++++++++++++++ arch/arm/mm/dma-mapping.c | 38 ++++++++++++++++---------- drivers/media/platform/s5p-mfc/s5p_mfc.c | 13 +++++++-- drivers/media/v4l2-core/videobuf2-dma-contig.c | 33 ++++++++++++++-------- include/linux/dma-attrs.h | 1 + include/media/videobuf2-dma-contig.h | 11 +++++++- 6 files changed, 91 insertions(+), 28 deletions(-) -- 2.6.0.rc2.230.g3dd15c0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation 2016-01-08 23:05 [PATCH v5 0/5] dma-mapping: Patches for speeding up allocation Douglas Anderson @ 2016-01-08 23:05 ` Douglas Anderson 2016-01-13 12:17 ` Robin Murphy 2016-01-08 23:05 ` [PATCH v5 3/5] ARM: dma-mapping: Use DMA_ATTR_NO_HUGE_PAGE hint to optimize allocation Douglas Anderson 2016-01-08 23:05 ` [PATCH v5 5/5] [media] s5p-mfc: Set DMA_ATTR_NO_HUGE_PAGE Douglas Anderson 2 siblings, 1 reply; 7+ messages in thread From: Douglas Anderson @ 2016-01-08 23:05 UTC (permalink / raw) To: linux-arm-kernel The __iommu_alloc_buffer() is expected to be called to allocate pretty sizeable buffers. Upon simple tests of video I saw it trying to allocate 4,194,304 bytes. The function tries to allocate large chunks in order to optimize IOMMU TLB usage. The current function is very, very slow. One problem is the way it keeps trying and trying to allocate big chunks. Imagine a very fragmented memory that has 4M free but no contiguous pages at all. Further imagine allocating 4M (1024 pages). We'll do the following memory allocations: - For page 1: - Try to allocate order 10 (no retry) - Try to allocate order 9 (no retry) - ... - Try to allocate order 0 (with retry, but not needed) - For page 2: - Try to allocate order 9 (no retry) - Try to allocate order 8 (no retry) - ... - Try to allocate order 0 (with retry, but not needed) - ... - ... Total number of calls to alloc() calls for this case is: sum(int(math.log(i, 2)) + 1 for i in range(1, 1025)) => 9228 The above is obviously worse case, but given how slow alloc can be we really want to try to avoid even somewhat bad cases. I timed the old code with a device under memory pressure and it wasn't hard to see it take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing was done on kernel 3.14, so possibly mainline would behave differently). A second problem is that allocating big chunks under memory pressure when we don't need them is just not a great idea anyway unless we really need them. We can make due pretty well with smaller chunks so it's probably wise to leave bigger chunks for other users once memory pressure is on. Let's adjust the allocation like this: 1. If a big chunk fails, stop trying to hard and bump down to lower order allocations. 2. Don't try useless orders. The whole point of big chunks is to optimize the TLB and it can really only make use of 2M, 1M, 64K and 4K sizes. We'll still tend to eat up a bunch of big chunks, but that might be the right answer for some users. A future patch could possibly add a new DMA_ATTR that would let the caller decide that TLB optimization isn't important and that we should use smaller chunks. Presumably this would be a sane strategy for some callers. Signed-off-by: Douglas Anderson <dianders@chromium.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> --- Changes in v5: None Changes in v4: - Added Marek's ack Changes in v3: None Changes in v2: - No longer just 1 page at a time, but gives up higher order quickly. - Only tries important higher order allocations that might help us. arch/arm/mm/dma-mapping.c | 34 ++++++++++++++++++++-------------- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index 0eca3812527e..bc9cebfa0891 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -1122,6 +1122,9 @@ static inline void __free_iova(struct dma_iommu_mapping *mapping, spin_unlock_irqrestore(&mapping->lock, flags); } +/* We'll try 2M, 1M, 64K, and finally 4K; array must end with 0! */ +static const int iommu_order_array[] = { 9, 8, 4, 0 }; + static struct page **__iommu_alloc_buffer(struct device *dev, size_t size, gfp_t gfp, struct dma_attrs *attrs) { @@ -1129,6 +1132,7 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size, int count = size >> PAGE_SHIFT; int array_size = count * sizeof(struct page *); int i = 0; + int order_idx = 0; if (array_size <= PAGE_SIZE) pages = kzalloc(array_size, GFP_KERNEL); @@ -1162,22 +1166,24 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size, while (count) { int j, order; - for (order = __fls(count); order > 0; --order) { - /* - * We do not want OOM killer to be invoked as long - * as we can fall back to single pages, so we force - * __GFP_NORETRY for orders higher than zero. - */ - pages[i] = alloc_pages(gfp | __GFP_NORETRY, order); - if (pages[i]) - break; + order = iommu_order_array[order_idx]; + + /* Drop down when we get small */ + if (__fls(count) < order) { + order_idx++; + continue; } - if (!pages[i]) { - /* - * Fall back to single page allocation. - * Might invoke OOM killer as last resort. - */ + if (order) { + /* See if it's easy to allocate a high-order chunk */ + pages[i] = alloc_pages(gfp | __GFP_NORETRY, order); + + /* Go down a notch@first sign of pressure */ + if (!pages[i]) { + order_idx++; + continue; + } + } else { pages[i] = alloc_pages(gfp, 0); if (!pages[i]) goto error; -- 2.6.0.rc2.230.g3dd15c0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation 2016-01-08 23:05 ` [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation Douglas Anderson @ 2016-01-13 12:17 ` Robin Murphy 2016-01-13 17:33 ` Tomasz Figa 0 siblings, 1 reply; 7+ messages in thread From: Robin Murphy @ 2016-01-13 12:17 UTC (permalink / raw) To: linux-arm-kernel Hi Doug, On 08/01/16 23:05, Douglas Anderson wrote: > The __iommu_alloc_buffer() is expected to be called to allocate pretty > sizeable buffers. Upon simple tests of video I saw it trying to > allocate 4,194,304 bytes. The function tries to allocate large chunks > in order to optimize IOMMU TLB usage. > > The current function is very, very slow. > > One problem is the way it keeps trying and trying to allocate big > chunks. Imagine a very fragmented memory that has 4M free but no > contiguous pages at all. Further imagine allocating 4M (1024 pages). > We'll do the following memory allocations: > - For page 1: > - Try to allocate order 10 (no retry) > - Try to allocate order 9 (no retry) > - ... > - Try to allocate order 0 (with retry, but not needed) > - For page 2: > - Try to allocate order 9 (no retry) > - Try to allocate order 8 (no retry) > - ... > - Try to allocate order 0 (with retry, but not needed) > - ... > - ... > > Total number of calls to alloc() calls for this case is: > sum(int(math.log(i, 2)) + 1 for i in range(1, 1025)) > => 9228 > > The above is obviously worse case, but given how slow alloc can be we > really want to try to avoid even somewhat bad cases. I timed the old > code with a device under memory pressure and it wasn't hard to see it > take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing > was done on kernel 3.14, so possibly mainline would behave > differently). > > A second problem is that allocating big chunks under memory pressure > when we don't need them is just not a great idea anyway unless we really > need them. We can make due pretty well with smaller chunks so it's > probably wise to leave bigger chunks for other users once memory > pressure is on. > > Let's adjust the allocation like this: > > 1. If a big chunk fails, stop trying to hard and bump down to lower > order allocations. > 2. Don't try useless orders. The whole point of big chunks is to > optimize the TLB and it can really only make use of 2M, 1M, 64K and > 4K sizes. > > We'll still tend to eat up a bunch of big chunks, but that might be the > right answer for some users. A future patch could possibly add a new > DMA_ATTR that would let the caller decide that TLB optimization isn't > important and that we should use smaller chunks. Presumably this would > be a sane strategy for some callers. Now that I've had time to think about it properly: Reviewed-by: Robin Murphy <robin.murphy@arm.com> I just had an absolutely disgusting idea of how to get the same progression with just a single variable and no static array, but I'll keep that firmly to myself as it's almost IOCCC-grade WTF :D Thanks, Robin. > Signed-off-by: Douglas Anderson <dianders@chromium.org> > Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> > --- > Changes in v5: None > Changes in v4: > - Added Marek's ack > > Changes in v3: None > Changes in v2: > - No longer just 1 page at a time, but gives up higher order quickly. > - Only tries important higher order allocations that might help us. > > arch/arm/mm/dma-mapping.c | 34 ++++++++++++++++++++-------------- > 1 file changed, 20 insertions(+), 14 deletions(-) > > diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c > index 0eca3812527e..bc9cebfa0891 100644 > --- a/arch/arm/mm/dma-mapping.c > +++ b/arch/arm/mm/dma-mapping.c > @@ -1122,6 +1122,9 @@ static inline void __free_iova(struct dma_iommu_mapping *mapping, > spin_unlock_irqrestore(&mapping->lock, flags); > } > > +/* We'll try 2M, 1M, 64K, and finally 4K; array must end with 0! */ > +static const int iommu_order_array[] = { 9, 8, 4, 0 }; > + > static struct page **__iommu_alloc_buffer(struct device *dev, size_t size, > gfp_t gfp, struct dma_attrs *attrs) > { > @@ -1129,6 +1132,7 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size, > int count = size >> PAGE_SHIFT; > int array_size = count * sizeof(struct page *); > int i = 0; > + int order_idx = 0; > > if (array_size <= PAGE_SIZE) > pages = kzalloc(array_size, GFP_KERNEL); > @@ -1162,22 +1166,24 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size, > while (count) { > int j, order; > > - for (order = __fls(count); order > 0; --order) { > - /* > - * We do not want OOM killer to be invoked as long > - * as we can fall back to single pages, so we force > - * __GFP_NORETRY for orders higher than zero. > - */ > - pages[i] = alloc_pages(gfp | __GFP_NORETRY, order); > - if (pages[i]) > - break; > + order = iommu_order_array[order_idx]; > + > + /* Drop down when we get small */ > + if (__fls(count) < order) { > + order_idx++; > + continue; > } > > - if (!pages[i]) { > - /* > - * Fall back to single page allocation. > - * Might invoke OOM killer as last resort. > - */ > + if (order) { > + /* See if it's easy to allocate a high-order chunk */ > + pages[i] = alloc_pages(gfp | __GFP_NORETRY, order); > + > + /* Go down a notch at first sign of pressure */ > + if (!pages[i]) { > + order_idx++; > + continue; > + } > + } else { > pages[i] = alloc_pages(gfp, 0); > if (!pages[i]) > goto error; > ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation 2016-01-13 12:17 ` Robin Murphy @ 2016-01-13 17:33 ` Tomasz Figa 2016-01-13 17:44 ` Robin Murphy 0 siblings, 1 reply; 7+ messages in thread From: Tomasz Figa @ 2016-01-13 17:33 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 13, 2016 at 9:17 PM, Robin Murphy <robin.murphy@arm.com> wrote: > Hi Doug, > > > On 08/01/16 23:05, Douglas Anderson wrote: >> >> The __iommu_alloc_buffer() is expected to be called to allocate pretty >> sizeable buffers. Upon simple tests of video I saw it trying to >> allocate 4,194,304 bytes. The function tries to allocate large chunks >> in order to optimize IOMMU TLB usage. >> >> The current function is very, very slow. >> >> One problem is the way it keeps trying and trying to allocate big >> chunks. Imagine a very fragmented memory that has 4M free but no >> contiguous pages at all. Further imagine allocating 4M (1024 pages). >> We'll do the following memory allocations: >> - For page 1: >> - Try to allocate order 10 (no retry) >> - Try to allocate order 9 (no retry) >> - ... >> - Try to allocate order 0 (with retry, but not needed) >> - For page 2: >> - Try to allocate order 9 (no retry) >> - Try to allocate order 8 (no retry) >> - ... >> - Try to allocate order 0 (with retry, but not needed) >> - ... >> - ... >> >> Total number of calls to alloc() calls for this case is: >> sum(int(math.log(i, 2)) + 1 for i in range(1, 1025)) >> => 9228 >> >> The above is obviously worse case, but given how slow alloc can be we >> really want to try to avoid even somewhat bad cases. I timed the old >> code with a device under memory pressure and it wasn't hard to see it >> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing >> was done on kernel 3.14, so possibly mainline would behave >> differently). >> >> A second problem is that allocating big chunks under memory pressure >> when we don't need them is just not a great idea anyway unless we really >> need them. We can make due pretty well with smaller chunks so it's >> probably wise to leave bigger chunks for other users once memory >> pressure is on. >> >> Let's adjust the allocation like this: >> >> 1. If a big chunk fails, stop trying to hard and bump down to lower >> order allocations. >> 2. Don't try useless orders. The whole point of big chunks is to >> optimize the TLB and it can really only make use of 2M, 1M, 64K and >> 4K sizes. >> >> We'll still tend to eat up a bunch of big chunks, but that might be the >> right answer for some users. A future patch could possibly add a new >> DMA_ATTR that would let the caller decide that TLB optimization isn't >> important and that we should use smaller chunks. Presumably this would >> be a sane strategy for some callers. > > > Now that I've had time to think about it properly: > > Reviewed-by: Robin Murphy <robin.murphy@arm.com> > > I just had an absolutely disgusting idea of how to get the same progression > with just a single variable and no static array, but I'll keep that firmly > to myself as it's almost IOCCC-grade WTF :D Just out of curiosity, a bitmap and loop with fls() and clearing bit on failure or something more freaky? :) Anyway: Reviewed-by: Tomasz Figa <tfiga@chromium.org> Best regards, Tomasz ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation 2016-01-13 17:33 ` Tomasz Figa @ 2016-01-13 17:44 ` Robin Murphy 0 siblings, 0 replies; 7+ messages in thread From: Robin Murphy @ 2016-01-13 17:44 UTC (permalink / raw) To: linux-arm-kernel On 13/01/16 17:33, Tomasz Figa wrote: > On Wed, Jan 13, 2016 at 9:17 PM, Robin Murphy <robin.murphy@arm.com> wrote: >> Hi Doug, >> >> >> On 08/01/16 23:05, Douglas Anderson wrote: >>> >>> The __iommu_alloc_buffer() is expected to be called to allocate pretty >>> sizeable buffers. Upon simple tests of video I saw it trying to >>> allocate 4,194,304 bytes. The function tries to allocate large chunks >>> in order to optimize IOMMU TLB usage. >>> >>> The current function is very, very slow. >>> >>> One problem is the way it keeps trying and trying to allocate big >>> chunks. Imagine a very fragmented memory that has 4M free but no >>> contiguous pages at all. Further imagine allocating 4M (1024 pages). >>> We'll do the following memory allocations: >>> - For page 1: >>> - Try to allocate order 10 (no retry) >>> - Try to allocate order 9 (no retry) >>> - ... >>> - Try to allocate order 0 (with retry, but not needed) >>> - For page 2: >>> - Try to allocate order 9 (no retry) >>> - Try to allocate order 8 (no retry) >>> - ... >>> - Try to allocate order 0 (with retry, but not needed) >>> - ... >>> - ... >>> >>> Total number of calls to alloc() calls for this case is: >>> sum(int(math.log(i, 2)) + 1 for i in range(1, 1025)) >>> => 9228 >>> >>> The above is obviously worse case, but given how slow alloc can be we >>> really want to try to avoid even somewhat bad cases. I timed the old >>> code with a device under memory pressure and it wasn't hard to see it >>> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing >>> was done on kernel 3.14, so possibly mainline would behave >>> differently). >>> >>> A second problem is that allocating big chunks under memory pressure >>> when we don't need them is just not a great idea anyway unless we really >>> need them. We can make due pretty well with smaller chunks so it's >>> probably wise to leave bigger chunks for other users once memory >>> pressure is on. >>> >>> Let's adjust the allocation like this: >>> >>> 1. If a big chunk fails, stop trying to hard and bump down to lower >>> order allocations. >>> 2. Don't try useless orders. The whole point of big chunks is to >>> optimize the TLB and it can really only make use of 2M, 1M, 64K and >>> 4K sizes. >>> >>> We'll still tend to eat up a bunch of big chunks, but that might be the >>> right answer for some users. A future patch could possibly add a new >>> DMA_ATTR that would let the caller decide that TLB optimization isn't >>> important and that we should use smaller chunks. Presumably this would >>> be a sane strategy for some callers. >> >> >> Now that I've had time to think about it properly: >> >> Reviewed-by: Robin Murphy <robin.murphy@arm.com> >> >> I just had an absolutely disgusting idea of how to get the same progression >> with just a single variable and no static array, but I'll keep that firmly >> to myself as it's almost IOCCC-grade WTF :D > > Just out of curiosity, a bitmap and loop with fls() and clearing bit > on failure or something more freaky? :) Got a Python interpreter handy? order = 9 for i in range(4): print order order = (order - 1) & 0xc Like I said, disgusting :D Robin. > > Anyway: > > Reviewed-by: Tomasz Figa <tfiga@chromium.org> > > Best regards, > Tomasz > ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v5 3/5] ARM: dma-mapping: Use DMA_ATTR_NO_HUGE_PAGE hint to optimize allocation 2016-01-08 23:05 [PATCH v5 0/5] dma-mapping: Patches for speeding up allocation Douglas Anderson 2016-01-08 23:05 ` [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation Douglas Anderson @ 2016-01-08 23:05 ` Douglas Anderson 2016-01-08 23:05 ` [PATCH v5 5/5] [media] s5p-mfc: Set DMA_ATTR_NO_HUGE_PAGE Douglas Anderson 2 siblings, 0 replies; 7+ messages in thread From: Douglas Anderson @ 2016-01-08 23:05 UTC (permalink / raw) To: linux-arm-kernel If we know that TLB efficiency will not be an issue when memory is accessed then it's not terribly important to allocate big chunks of memory. The whole point of allocating the big chunks was that it would make TLB usage efficient. As Marek Szyprowski indicated: Please note that mapping memory with larger pages significantly improves performance, especially when IOMMU has a little TLB cache. This can be easily observed when multimedia devices do processing of RGB data with 90/270 degree rotation Image rotation is distinctly an operation that needs to bounce around through memory, so it makes sense that TLB efficiency is important there. Video decoding, on the other hand, is a fairly sequential operation. During video decoding it's not expected that we'll be jumping all over memory. Decoding video is also pretty heavy and the TLB misses aren't a huge deal. Presumably most HW video acceleration users of dma-mapping will not care about huge pages and will set DMA_ATTR_NO_HUGE_PAGE. Allocating big chunks of memory is quite expensive, especially if we're doing it repeadly and memory is full. In one (out of tree) usage model it is common that arm_iommu_alloc_attrs() is called 16 times in a row, each one trying to allocate 4 MB of memory. This is called whenever the system encounters a new video, which could easily happen while the memory system is stressed out. In fact, on certain social media websites that auto-play video and have infinite scrolling, it's quite common to see not just one of these 16x4MB allocations but 2 or 3 right after another. Asking the system even to do a small amount of extra work to give us big chunks in this case is just not a good use of time. Allocating big chunks of memory is also expensive indirectly. Even if we ask the system not to do ANY extra work to allocate _our_ memory, we're still potentially eating up all big chunks in the system. Presumably there are other users in the system that aren't quite as flexible and that actually need these big chunks. By eating all the big chunks we're causing extra work for the rest of the system. We also may start making other memory allocations fail. While the system may be robust to such failures (as is the case with dwc2 USB trying to allocate buffers for Ethernet data and with WiFi trying to allocate buffers for WiFi data), it is yet another big performance hit. Signed-off-by: Douglas Anderson <dianders@chromium.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> --- Changes in v5: - renamed DMA_ATTR_NOHUGEPAGE to DMA_ATTR_NO_HUGE_PAGE Changes in v4: - renamed DMA_ATTR_SEQUENTIAL to DMA_ATTR_NOHUGEPAGE - added Marek's ack Changes in v3: - Use DMA_ATTR_SEQUENTIAL hint patch new for v3. Changes in v2: None arch/arm/mm/dma-mapping.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index bc9cebfa0891..e9fb2929cb7b 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -1158,6 +1158,10 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size, return pages; } + /* Go straight to 4K chunks if caller says it's OK. */ + if (dma_get_attr(DMA_ATTR_NO_HUGE_PAGE, attrs)) + order_idx = ARRAY_SIZE(iommu_order_array) - 1; + /* * IOMMU can map any pages, so himem can also be used here */ -- 2.6.0.rc2.230.g3dd15c0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v5 5/5] [media] s5p-mfc: Set DMA_ATTR_NO_HUGE_PAGE 2016-01-08 23:05 [PATCH v5 0/5] dma-mapping: Patches for speeding up allocation Douglas Anderson 2016-01-08 23:05 ` [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation Douglas Anderson 2016-01-08 23:05 ` [PATCH v5 3/5] ARM: dma-mapping: Use DMA_ATTR_NO_HUGE_PAGE hint to optimize allocation Douglas Anderson @ 2016-01-08 23:05 ` Douglas Anderson 2 siblings, 0 replies; 7+ messages in thread From: Douglas Anderson @ 2016-01-08 23:05 UTC (permalink / raw) To: linux-arm-kernel We do video allocation all the time and we need it to be fast. Plus TLB efficiency isn't terribly important for video. That means we want to set DMA_ATTR_NO_HUGE_PAGE. See also the previous change ("ARM: dma-mapping: Use DMA_ATTR_NO_HUGE_PAGE hint to optimize allocation"). Signed-off-by: Douglas Anderson <dianders@chromium.org> --- Changes in v5: - s5p-mfc patch new for v5 Changes in v4: None Changes in v3: None Changes in v2: None drivers/media/platform/s5p-mfc/s5p_mfc.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/media/platform/s5p-mfc/s5p_mfc.c b/drivers/media/platform/s5p-mfc/s5p_mfc.c index 927ab4928779..7ea5d0d262bb 100644 --- a/drivers/media/platform/s5p-mfc/s5p_mfc.c +++ b/drivers/media/platform/s5p-mfc/s5p_mfc.c @@ -1095,6 +1095,7 @@ static int s5p_mfc_alloc_memdevs(struct s5p_mfc_dev *dev) /* MFC probe function */ static int s5p_mfc_probe(struct platform_device *pdev) { + DEFINE_DMA_ATTRS(attrs); struct s5p_mfc_dev *dev; struct video_device *vfd; struct resource *res; @@ -1164,12 +1165,20 @@ static int s5p_mfc_probe(struct platform_device *pdev) } } - dev->alloc_ctx[0] = vb2_dma_contig_init_ctx(dev->mem_dev_l); + /* + * We'll do mostly sequential access, so sacrifice TLB efficiency for + * faster allocation. + */ + dma_set_attr(DMA_ATTR_NO_HUGE_PAGE, &attrs); + + dev->alloc_ctx[0] = vb2_dma_contig_init_ctx_attrs(dev->mem_dev_l, + &attrs); if (IS_ERR(dev->alloc_ctx[0])) { ret = PTR_ERR(dev->alloc_ctx[0]); goto err_res; } - dev->alloc_ctx[1] = vb2_dma_contig_init_ctx(dev->mem_dev_r); + dev->alloc_ctx[1] = vb2_dma_contig_init_ctx_attrs(dev->mem_dev_r, + &attrs); if (IS_ERR(dev->alloc_ctx[1])) { ret = PTR_ERR(dev->alloc_ctx[1]); goto err_mem_init_ctx_1; -- 2.6.0.rc2.230.g3dd15c0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-01-13 17:44 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-01-08 23:05 [PATCH v5 0/5] dma-mapping: Patches for speeding up allocation Douglas Anderson 2016-01-08 23:05 ` [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation Douglas Anderson 2016-01-13 12:17 ` Robin Murphy 2016-01-13 17:33 ` Tomasz Figa 2016-01-13 17:44 ` Robin Murphy 2016-01-08 23:05 ` [PATCH v5 3/5] ARM: dma-mapping: Use DMA_ATTR_NO_HUGE_PAGE hint to optimize allocation Douglas Anderson 2016-01-08 23:05 ` [PATCH v5 5/5] [media] s5p-mfc: Set DMA_ATTR_NO_HUGE_PAGE Douglas Anderson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).