Re: [Linaro-mm-sig] [PATCH/RFC 0/8] ARM: DMA-mapping framework redesign

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jonathan Morton <jonathan.morton@movial.com>
To: M.K.Edwards@gmail.com
Cc: Subash Patel <subashrp@gmail.com>,
	Jordan Crouse <jcrouse@codeaurora.org>,
	Marek Szyprowski <m.szyprowski@samsung.com>,
	linux-arch@vger.kernel.org, linaro-mm-sig@lists.linaro.org,
	linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org
Subject: Re: [Linaro-mm-sig] [PATCH/RFC 0/8] ARM: DMA-mapping framework redesign
Date: Sat, 25 Jun 2011 08:23:00 +0300	[thread overview]
Message-ID: <BANLkTinUCZrd-JuMc3TkaF4f1VBmOu9nxQ@mail.gmail.com> (raw)
In-Reply-To: <BANLkTimi2FAmcb7ZWnjRqb-Cb8acXWsCTw@mail.gmail.com>

On 24 June 2011 01:09, Michael K. Edwards <m.k.edwards@gmail.com> wrote:
> Jonathan -
>
> I'm inviting you to this conversation (and to linaro-mm-sig, if you'd
> care to participate!), because I'd really like your commentary on what
> it takes to make write-combining fully effective on various ARMv7
> implementations.

Thanks for the invite.  I'm not fully conversant with the kernel-level
intricacy, but I do know what application code sees, and I have a
fairly good idea of what happens at the SDRAM pinout.

> Getting full write-combining performance on Intel architectures
> involves a somewhat delicate dance:
>  http://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers/

At a high level, that looks like a pretty effective technique (and
well explained) that works cross-platform with detail changes.
However, that describes *read* combining on an uncached area, rather
than write combining.

Write combining is easy - in my experience it Just Works on ARMv7 SoCs
in general.  In practice, I've found that you can write pretty much
anything to uncached memory and the write-combiner will deal with it
fairly intelligently.  Small writes sufficiently close together in
time and contiguous in area will be combined reliably.  This assumes
that the region is marked write-combinable, which should always be the
case for plain SDRAM.  So memcpy() *to* an uncached zone works okay,
even wtih an alignment mismatch.

Read combining is much harder as soon as you turn off the cache, which
defeats all of the nice auto-prefetching mechanisms that tend to be
built into modern caches.  Even the newest ARM SoCs disable any and
all speculative behaviour for uncached reads - it is then not possible
to set up a "streaming read" even explicitly (even though you could
reasonably express such using PLD).

There will typically be a full cache-miss latency per instruction (20+
cycles on A8), even if the addresses are exactly sequential.  If they
are not sequential, or if the memory controller does a read or write
to somewhere else in the meantime, you also get the CAS or RAS
latencies of about 25ns each, which hurt badly (CAS and RAS have not
sped up appreciably, in real terms, since PC133 a decade ago - sense
amps are not so good at fulfilling Moore's Law).  So on a 1GHz
Cortex-A8, you can spend 80 clock cycles waiting for a memory load to
complete - that's about 20 for the memory system to figure out it's
uncacheable and cause the CPU core to replay the instruction twice, 50
waiting for the SDRAM chip to spin up, and another 10 as a fudge
factor and to allow the data to percolate up.

This situation is sufficiently common that I assume (and I tell my
colleagues to assume) that this is the case.  If a vendor were to turn
off write-combining for a memory area, I would complain very loudly to
them once I discovered it.  So far, though, I can only wish that they
would sort out the memory hierarchy to make framebuffer & video reads
better.

I *have* found one vendor who appears to put GPU command buffers in
cached memory, but this necessitates a manual cache cleaning exercise
every time the command buffer is flushed.  This is a substantial
overhead too, but is perhaps easier to optimise.

IMO this whole problem is a hardware design fault.  It's SDRAM
directly wired to the chip; there's nothing going on that the memory
controller doesn't know about.  So why isn't the last cache level part
of / attached to the memory controller, so that it can be used
transparently by all relevant bus masters?  It is, BTW, not only ARM
that gets this wrong, but in the Wintel world there is so much
horsepower to spare that few people notice.

> And I expect something similar to be necessary in order to avoid the
> read-modify-write penalty for write-combining buffers on ARMv7.  (NEON
> store-multiple operations can fill an entire 64-byte entry in the
> victim buffer in one opcode; I don't know whether this is enough to
> stop the L3 memory system from reading the data before clobbering it.)

Ah, now you are talking about store misses to cached memory.

Why 64 bytes?  VLD1 does 32 bytes (4x64b) and VLDM can do 128 bytes
(16x64b).  The latter is, I think, bigger than Intel's fill buffers.
Each of these have exactly equivalent store variants.

The manual for the Cortex-A8 states that store misses in the L1 cache
are sent to the L2 cache; the L2 cache then has validity bits for
every quadword (ie. four validity domains per line), so a 16-byte
store (if aligned) is sufficient to avoid read traffic.  I assume that
the A9 and A5 are at least as sophisticated, not sure about
Snapdragon.

 - Jonathan Morton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Jonathan Morton <jonathan.morton@movial.com>
To: M.K.Edwards@gmail.com
Cc: Subash Patel <subashrp@gmail.com>,
	Jordan Crouse <jcrouse@codeaurora.org>,
	Marek Szyprowski <m.szyprowski@samsung.com>,
	linux-arch@vger.kernel.org, linaro-mm-sig@lists.linaro.org,
	linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org
Subject: Re: [Linaro-mm-sig] [PATCH/RFC 0/8] ARM: DMA-mapping framework redesign
Date: Sat, 25 Jun 2011 08:23:00 +0300	[thread overview]
Message-ID: <BANLkTinUCZrd-JuMc3TkaF4f1VBmOu9nxQ@mail.gmail.com> (raw)
Message-ID: <20110625052300.D7EDYsjBeDw-_t8Zc3b5-pE0fJvZJQeAVo7zgzwz8Tk@z> (raw)
In-Reply-To: <BANLkTimi2FAmcb7ZWnjRqb-Cb8acXWsCTw@mail.gmail.com>

On 24 June 2011 01:09, Michael K. Edwards <m.k.edwards@gmail.com> wrote:
> Jonathan -
>
> I'm inviting you to this conversation (and to linaro-mm-sig, if you'd
> care to participate!), because I'd really like your commentary on what
> it takes to make write-combining fully effective on various ARMv7
> implementations.

Thanks for the invite.  I'm not fully conversant with the kernel-level
intricacy, but I do know what application code sees, and I have a
fairly good idea of what happens at the SDRAM pinout.

> Getting full write-combining performance on Intel architectures
> involves a somewhat delicate dance:
>  http://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers/

At a high level, that looks like a pretty effective technique (and
well explained) that works cross-platform with detail changes.
However, that describes *read* combining on an uncached area, rather
than write combining.

Write combining is easy - in my experience it Just Works on ARMv7 SoCs
in general.  In practice, I've found that you can write pretty much
anything to uncached memory and the write-combiner will deal with it
fairly intelligently.  Small writes sufficiently close together in
time and contiguous in area will be combined reliably.  This assumes
that the region is marked write-combinable, which should always be the
case for plain SDRAM.  So memcpy() *to* an uncached zone works okay,
even wtih an alignment mismatch.

Read combining is much harder as soon as you turn off the cache, which
defeats all of the nice auto-prefetching mechanisms that tend to be
built into modern caches.  Even the newest ARM SoCs disable any and
all speculative behaviour for uncached reads - it is then not possible
to set up a "streaming read" even explicitly (even though you could
reasonably express such using PLD).

There will typically be a full cache-miss latency per instruction (20+
cycles on A8), even if the addresses are exactly sequential.  If they
are not sequential, or if the memory controller does a read or write
to somewhere else in the meantime, you also get the CAS or RAS
latencies of about 25ns each, which hurt badly (CAS and RAS have not
sped up appreciably, in real terms, since PC133 a decade ago - sense
amps are not so good at fulfilling Moore's Law).  So on a 1GHz
Cortex-A8, you can spend 80 clock cycles waiting for a memory load to
complete - that's about 20 for the memory system to figure out it's
uncacheable and cause the CPU core to replay the instruction twice, 50
waiting for the SDRAM chip to spin up, and another 10 as a fudge
factor and to allow the data to percolate up.

This situation is sufficiently common that I assume (and I tell my
colleagues to assume) that this is the case.  If a vendor were to turn
off write-combining for a memory area, I would complain very loudly to
them once I discovered it.  So far, though, I can only wish that they
would sort out the memory hierarchy to make framebuffer & video reads
better.

I *have* found one vendor who appears to put GPU command buffers in
cached memory, but this necessitates a manual cache cleaning exercise
every time the command buffer is flushed.  This is a substantial
overhead too, but is perhaps easier to optimise.

IMO this whole problem is a hardware design fault.  It's SDRAM
directly wired to the chip; there's nothing going on that the memory
controller doesn't know about.  So why isn't the last cache level part
of / attached to the memory controller, so that it can be used
transparently by all relevant bus masters?  It is, BTW, not only ARM
that gets this wrong, but in the Wintel world there is so much
horsepower to spare that few people notice.

> And I expect something similar to be necessary in order to avoid the
> read-modify-write penalty for write-combining buffers on ARMv7.  (NEON
> store-multiple operations can fill an entire 64-byte entry in the
> victim buffer in one opcode; I don't know whether this is enough to
> stop the L3 memory system from reading the data before clobbering it.)

Ah, now you are talking about store misses to cached memory.

Why 64 bytes?  VLD1 does 32 bytes (4x64b) and VLDM can do 128 bytes
(16x64b).  The latter is, I think, bigger than Intel's fill buffers.
Each of these have exactly equivalent store variants.

The manual for the Cortex-A8 states that store misses in the L1 cache
are sent to the L2 cache; the L2 cache then has validity bits for
every quadword (ie. four validity domains per line), so a 16-byte
store (if aligned) is sufficient to avoid read traffic.  I assume that
the A9 and A5 are at least as sophisticated, not sure about
Snapdragon.

 - Jonathan Morton

next prev parent reply	other threads:[~2011-06-25  5:23 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-20  7:50 [PATCH/RFC 0/8] ARM: DMA-mapping framework redesign Marek Szyprowski
2011-06-20  7:50 ` [PATCH 1/8] ARM: dma-mapping: remove offset parameter to prepare for generic dma_ops Marek Szyprowski
2011-06-20  8:35   ` Michal Nazarewicz
2011-06-20 10:46     ` Marek Szyprowski
2011-06-20 10:46       ` Marek Szyprowski
2011-07-03 15:28   ` Russell King - ARM Linux
2011-07-03 15:28     ` Russell King - ARM Linux
2011-07-26 12:56     ` Marek Szyprowski
2011-06-20  7:50 ` [PATCH 2/8] ARM: dma-mapping: implement dma_map_single on top of dma_map_page Marek Szyprowski
2011-06-20 14:39   ` Russell King - ARM Linux
2011-06-20 14:39     ` Russell King - ARM Linux
2011-06-20 15:15     ` Marek Szyprowski
2011-06-24 15:24       ` Arnd Bergmann
2011-06-24 15:24         ` Arnd Bergmann
2011-06-27 14:29         ` Marek Szyprowski
2011-06-27 14:53           ` Arnd Bergmann
2011-06-27 14:53             ` Arnd Bergmann
2011-06-27 15:06             ` Marek Szyprowski
2011-06-20  7:50 ` [PATCH 3/8] ARM: dma-mapping: use asm-generic/dma-mapping-common.h Marek Szyprowski
2011-06-20 14:33   ` [Linaro-mm-sig] " KyongHo Cho
2011-06-21 11:47     ` Marek Szyprowski
2011-06-21 11:47       ` Marek Szyprowski
2011-06-24  8:39       ` 'Joerg Roedel'
2011-06-24  8:39         ` 'Joerg Roedel'
2011-06-24 15:36   ` Arnd Bergmann
2011-06-24 15:36     ` Arnd Bergmann
2011-06-27 12:18     ` Marek Szyprowski
2011-06-27 12:18       ` Marek Szyprowski
2011-06-27 13:19       ` Arnd Bergmann
2011-06-27 13:19         ` Arnd Bergmann
2011-07-07 12:09         ` Lennert Buytenhek
2011-07-07 12:09           ` Lennert Buytenhek
2011-07-07 12:38           ` Russell King - ARM Linux
2011-07-07 12:38             ` Russell King - ARM Linux
2011-07-15  0:10             ` Lennert Buytenhek
2011-07-15  9:27               ` Russell King - ARM Linux
2011-07-15  9:27                 ` Russell King - ARM Linux
2011-07-15 21:53                 ` Lennert Buytenhek
2011-06-20  7:50 ` [PATCH 4/8] ARM: dma-mapping: implement dma sg methods on top of generic dma ops Marek Szyprowski
2011-06-20  7:50   ` Marek Szyprowski
2011-06-20 14:37   ` KyongHo Cho
2011-06-20 14:40   ` Russell King - ARM Linux
2011-06-20 14:40     ` Russell King - ARM Linux
2011-06-20 15:23     ` Marek Szyprowski
2011-06-20  7:50 ` [PATCH 5/8] ARM: dma-mapping: move all dma bounce code to separate dma ops structure Marek Szyprowski
2011-06-20 14:42   ` Russell King - ARM Linux
2011-06-20 15:31     ` Marek Szyprowski
2011-06-20 15:31       ` Marek Szyprowski
2011-06-24 15:47       ` Arnd Bergmann
2011-06-24 15:47         ` Arnd Bergmann
2011-06-27 14:20         ` Marek Szyprowski
2011-06-27 14:20           ` Marek Szyprowski
2011-06-20  7:50 ` [PATCH 6/8] ARM: dma-mapping: remove redundant code and cleanup Marek Szyprowski
2011-06-20  7:50 ` [PATCH 7/8] common: dma-mapping: change alloc/free_coherent method to more generic alloc/free_attrs Marek Szyprowski
2011-06-20 14:45   ` KyongHo Cho
2011-06-20 15:06     ` Russell King - ARM Linux
2011-06-20 15:06       ` Russell King - ARM Linux
2011-06-20 15:14       ` [Linaro-mm-sig] " KyongHo Cho
2011-06-21 11:23     ` Marek Szyprowski
2011-06-22  0:00       ` [Linaro-mm-sig] " KyongHo Cho
2011-06-24  7:20         ` Marek Szyprowski
2011-06-24 15:51   ` Arnd Bergmann
2011-06-24 15:51     ` Arnd Bergmann
2011-06-24 16:15     ` James Bottomley
2011-06-24 16:23       ` Arnd Bergmann
2011-06-27 12:23     ` Marek Szyprowski
2011-06-27 12:23       ` Marek Szyprowski
2011-06-27 13:22       ` Arnd Bergmann
2011-06-27 13:22         ` Arnd Bergmann
2011-06-27 13:30         ` Marek Szyprowski
2011-06-27 13:30           ` Marek Szyprowski
2011-06-24 15:53   ` Arnd Bergmann
2011-06-24 15:53     ` Arnd Bergmann
2011-06-27 14:41     ` Marek Szyprowski
2011-06-20  7:50 ` [PATCH 8/8] ARM: dma-mapping: use alloc, mmap, free from dma_ops Marek Szyprowski
2011-06-22  6:53   ` [Linaro-mm-sig] " KyongHo Cho
2011-06-22  4:53 ` [Linaro-mm-sig] [PATCH/RFC 0/8] ARM: DMA-mapping framework redesign Subash Patel
2011-06-22  6:59   ` Marek Szyprowski
2011-06-22  6:59     ` Marek Szyprowski
2011-06-22  8:53     ` Subash Patel
2011-06-22  9:27       ` Marek Szyprowski
2011-06-22 16:00         ` Jordan Crouse
2011-06-23 13:09           ` Subash Patel
2011-06-23 13:09             ` Subash Patel
2011-06-23 16:24             ` Michael K. Edwards
2011-06-23 22:09               ` Michael K. Edwards
2011-06-25  5:23                 ` Jonathan Morton [this message]
2011-06-25  5:23                   ` Jonathan Morton
2011-06-25  9:55                   ` Michael K. Edwards
2011-06-26  0:06                     ` Jonathan Morton
2011-06-24 15:20           ` Arnd Bergmann
2011-06-24 15:20             ` Arnd Bergmann
2011-06-24  9:18 ` Joerg Roedel
2011-06-24 14:26   ` Marek Szyprowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BANLkTinUCZrd-JuMc3TkaF4f1VBmOu9nxQ@mail.gmail.com \
    --to=jonathan.morton@movial.com \
    --cc=M.K.Edwards@gmail.com \
    --cc=jcrouse@codeaurora.org \
    --cc=linaro-mm-sig@lists.linaro.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=m.szyprowski@samsung.com \
    --cc=subashrp@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).