* [PATCH 0/3] RFC: addition to DMA API
@ 2011-08-31 21:30 Mark Salter
2011-08-31 21:30 ` [PATCH 1/3] add dma_coherent_write_sync " Mark Salter
` (4 more replies)
0 siblings, 5 replies; 32+ messages in thread
From: Mark Salter @ 2011-08-31 21:30 UTC (permalink / raw)
To: linux-arm-kernel
This patch set arose out of a discussion on linux-arm concerning a
performance problem with USB on some ARMv7 based platforms. The
problem was tracked down by ming.lei at canonical.com and found to be
the result of CPU writes to DMA-coherent memory being delayed in a
write buffer between the CPU and memory. One proposed patch fixed
only the immediate problem with the USB EHCI driver, but several
folks thought a more general approach was needed, so I put this series
of patches together as a starting point for wider discussion outside
the ARM specific list.
The original problem seen was that USB storage performance was unusually
poor on some ARMv7 based platforms. With my particular setup, I was
seeing hdparm -t report ~5.6MB/s on an SMP Cortex-A9 based platform
where the same disk driver would get ~21MB/s on a Cortex-A8 based system.
My understanding from subsequent discussion is that the A9 cores have
a write buffer between the CPU and memory which could buffer data for a
prolonged period even in the case of DMA coherent mappings. The ARM
architecture code largely mitigates this by doing a write buffer flush
as part of the MMIO functions used to write to device registers. This
avoids problems in almost all drivers because most need to write to a
device register to tell the device when something is written in the
shared DMA coherent memory. In the case of USB, an EHCI host controller
will poll certain DMA coherent memory locations for information coming
from the CPU. In that case, the write buffering negatively affects
performance.
This series of patches adds a new function to the DMA API to deal with
ARMv7 and any future architectures which have write buffering even for
DMA coherent memory. The proposed dma_coherent_write_sync() function
will allow those few drivers which need it to force out write buffer
data in a timely way to avoid performace issues.
Mark Salter (3):
add dma_coherent_write_sync to DMA API
define ARM-specific dma_coherent_write_sync
add dma_coherent_write_sync calls to USB EHCI driver
Documentation/DMA-API-HOWTO.txt | 15 +++++++++++++++
Documentation/DMA-API.txt | 12 ++++++++++++
arch/arm/include/asm/dma-mapping.h | 10 ++++++++++
drivers/usb/host/ehci-q.c | 7 ++++++-
include/linux/dma-mapping.h | 6 ++++++
5 files changed, 49 insertions(+), 1 deletions(-)
--
1.7.6
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH 1/3] add dma_coherent_write_sync to DMA API 2011-08-31 21:30 [PATCH 0/3] RFC: addition to DMA API Mark Salter @ 2011-08-31 21:30 ` Mark Salter 2011-09-01 2:59 ` Josh Cartwright 2011-09-01 9:57 ` Michał Mirosław 2011-08-31 21:30 ` [PATCH 2/3] define ARM-specific dma_coherent_write_sync Mark Salter ` (3 subsequent siblings) 4 siblings, 2 replies; 32+ messages in thread From: Mark Salter @ 2011-08-31 21:30 UTC (permalink / raw) To: linux-arm-kernel On ARMv6/7 DMA-coherent memory is bufferable which means that CPU writes to coherent memory may still be held in a write buffer for a significant amount of time. This is largely mitigated by having the MMIO write functions force a write buffer flush before doing the actual write to the MMIO register. This forces out previous CPU writes to coherent memory for drivers which write to a register to inform the device that something was written to memory. However, this does not mitigate the problem for devices which poll the DMA memory for changes written by the CPU. One such case was found by ming.lei at canonical.com in the USB EHCI driver. The EHCI host controller relies at least partly on polling DMA coherent memory for information from the driver. This patch adds a dma_coherent_write_sync() function to the DMA API which drivers can use to explicitly force out data which may otherwise be held up in a write buffer. It is a no-op unless and architecture provides its own version or the function and sets ARCH_HAS_DMA_COHERENT_WRITE_SYNC. Signed-off-by: Mark Salter <msalter@redhat.com> --- Documentation/DMA-API-HOWTO.txt | 15 +++++++++++++++ Documentation/DMA-API.txt | 12 ++++++++++++ include/linux/dma-mapping.h | 6 ++++++ 3 files changed, 33 insertions(+), 0 deletions(-) diff --git a/Documentation/DMA-API-HOWTO.txt b/Documentation/DMA-API-HOWTO.txt index a0b6250..8c22b8b 100644 --- a/Documentation/DMA-API-HOWTO.txt +++ b/Documentation/DMA-API-HOWTO.txt @@ -400,6 +400,21 @@ Make sure you've called dma_pool_free for all memory allocated from a pool before you destroy the pool. This function may not be called in interrupt context. +Some architectures which supporting DMA coherent memory may still have write +buffering between the CPU and DMA memory. This buffering may delay CPU writes +from reaching coherent memory in a timely manner. These delays in turn can +lead lead to dramatic performance issues in certain cases. An architecture +may mitigate this problem to a large degree by having a write buffer flush +implicit in the MMIO functions used to write to device registers. This works +for the most common cases where a driver needs to write to a register to tell +a device that something was written to the shared coherent memory. There are +other cases where the device polls the dma-coherent memory for data written +by the driver. In such cases, the driver needs to explicity force write buffer +data to memory by calling: + + dma_coherent_write_sync(); + + DMA Direction The interfaces described in subsequent portions of this document diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index fe23269..44d6923 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -418,6 +418,18 @@ void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr, .... +Part Ie - Write buffering to dma-coherent memory +------------------------------------------------ + +Some architectures supporting DMA coherent memory may have write +buffering between the CPU and DMA memory. This buffering may delay +CPU writes from reaching coherent memory in a timely manner. + + void + dma_coherent_write_sync() + +Force any outstanding coherent writes to memory. + Part II - Advanced dma_ usage ----------------------------- diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 347fdc3..b29d65c 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -154,6 +154,12 @@ dma_mark_declared_memory_occupied(struct device *dev, } #endif +#ifndef ARCH_HAS_DMA_COHERENT_WRITE_SYNC +static inline void dma_coherent_write_sync(void) +{ +} +#endif + /* * Managed DMA API */ -- 1.7.6 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH 1/3] add dma_coherent_write_sync to DMA API 2011-08-31 21:30 ` [PATCH 1/3] add dma_coherent_write_sync " Mark Salter @ 2011-09-01 2:59 ` Josh Cartwright 2011-09-01 9:57 ` Michał Mirosław 1 sibling, 0 replies; 32+ messages in thread From: Josh Cartwright @ 2011-09-01 2:59 UTC (permalink / raw) To: linux-arm-kernel On Wed, Aug 31, 2011 at 05:30:12PM -0400, Mark Salter wrote: > On ARMv6/7 DMA-coherent memory is bufferable which means that CPU writes to > coherent memory may still be held in a write buffer for a significant amount > of time. This is largely mitigated by having the MMIO write functions force > a write buffer flush before doing the actual write to the MMIO register. This > forces out previous CPU writes to coherent memory for drivers which write to > a register to inform the device that something was written to memory. However, > this does not mitigate the problem for devices which poll the DMA memory for > changes written by the CPU. One such case was found by ming.lei at canonical.com > in the USB EHCI driver. The EHCI host controller relies at least partly on > polling DMA coherent memory for information from the driver. > > This patch adds a dma_coherent_write_sync() function to the DMA API which > drivers can use to explicitly force out data which may otherwise be held up > in a write buffer. It is a no-op unless and architecture provides its own > version or the function and sets ARCH_HAS_DMA_COHERENT_WRITE_SYNC. > > Signed-off-by: Mark Salter <msalter@redhat.com> > --- > Documentation/DMA-API-HOWTO.txt | 15 +++++++++++++++ > Documentation/DMA-API.txt | 12 ++++++++++++ > include/linux/dma-mapping.h | 6 ++++++ > 3 files changed, 33 insertions(+), 0 deletions(-) > > diff --git a/Documentation/DMA-API-HOWTO.txt b/Documentation/DMA-API-HOWTO.txt > index a0b6250..8c22b8b 100644 > --- a/Documentation/DMA-API-HOWTO.txt > +++ b/Documentation/DMA-API-HOWTO.txt > @@ -400,6 +400,21 @@ Make sure you've called dma_pool_free for all memory allocated > from a pool before you destroy the pool. This function may not > be called in interrupt context. > > +Some architectures which supporting DMA coherent memory may still have write > +buffering between the CPU and DMA memory. This buffering may delay CPU writes > +from reaching coherent memory in a timely manner. These delays in turn can > +lead lead to dramatic performance issues in certain cases. An architecture 'lead lead' -> 'lead' -- joshc ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 1/3] add dma_coherent_write_sync to DMA API 2011-08-31 21:30 ` [PATCH 1/3] add dma_coherent_write_sync " Mark Salter 2011-09-01 2:59 ` Josh Cartwright @ 2011-09-01 9:57 ` Michał Mirosław 2011-09-01 12:36 ` Mark Salter 1 sibling, 1 reply; 32+ messages in thread From: Michał Mirosław @ 2011-09-01 9:57 UTC (permalink / raw) To: linux-arm-kernel 2011/8/31 Mark Salter <msalter@redhat.com>: > On ARMv6/7 DMA-coherent memory is bufferable which means that CPU writes to > coherent memory may still be held in a write buffer for a significant amount > of time. This is largely mitigated by having the MMIO write functions force > a write buffer flush before doing the actual write to the MMIO register. This > forces out previous CPU writes to coherent memory for drivers which write to > a register to inform the device that something was written to memory. However, > this does not mitigate the problem for devices which poll the DMA memory for > changes written by the CPU. One such case was found by ming.lei at canonical.com > in the USB EHCI driver. The EHCI host controller relies at least partly on > polling DMA coherent memory for information from the driver. > > This patch adds a dma_coherent_write_sync() function to the DMA API which > drivers can use to explicitly force out data which may otherwise be held up > in a write buffer. It is a no-op unless and architecture provides its own > version or the function and sets ARCH_HAS_DMA_COHERENT_WRITE_SYNC. [...] > --- a/Documentation/DMA-API.txt > +++ b/Documentation/DMA-API.txt > @@ -418,6 +418,18 @@ void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr, > ? ? ? ?.... > > > +Part Ie - Write buffering to dma-coherent memory > +------------------------------------------------ > + > +Some architectures supporting DMA coherent memory may have write > +buffering between the CPU and DMA memory. This buffering may delay > +CPU writes from reaching coherent memory in a timely manner. > + > + ? ?void > + ? ?dma_coherent_write_sync() > + > +Force any outstanding coherent writes to memory. This should be merged or referenced in part Ia. BTW, if there's no time limit on write buffers flushing, or if write buffers can cause reordering of the writes, then the memory accesses need to be managed just like non-DMA-coherent memory. So what differs then in DMA-coherent vs non-DMA-coherent mappings then? Best Regards, Micha? Miros?aw ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 1/3] add dma_coherent_write_sync to DMA API 2011-09-01 9:57 ` Michał Mirosław @ 2011-09-01 12:36 ` Mark Salter 2011-09-06 14:30 ` Catalin Marinas 0 siblings, 1 reply; 32+ messages in thread From: Mark Salter @ 2011-09-01 12:36 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2011-09-01 at 11:57 +0200, Micha? Miros?aw wrote: > BTW, if there's no time limit on write buffers flushing, or if write > buffers can cause reordering of the writes, then the memory accesses > need to be managed just like non-DMA-coherent memory. So what differs > then in DMA-coherent vs non-DMA-coherent mappings then? My understanding is that ordering is preserved, but an ARM guy should probably verify that. IIUC, the write buffers could hold data indefinitely. As a practical matter other writes needing to go out to memory will force buffered data out eventually. Again, this is my understanding which may be faulty. My feeling is that this extended write buffering makes it hard to call the dma memory fully coherent, but other limitations on ARMv7 make the buffering hard to avoid. --Mark ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 1/3] add dma_coherent_write_sync to DMA API 2011-09-01 12:36 ` Mark Salter @ 2011-09-06 14:30 ` Catalin Marinas 0 siblings, 0 replies; 32+ messages in thread From: Catalin Marinas @ 2011-09-06 14:30 UTC (permalink / raw) To: linux-arm-kernel (coming late to this thread due to holidays) 2011/9/1 Mark Salter <msalter@redhat.com>: > On Thu, 2011-09-01 at 11:57 +0200, Micha? Miros?aw wrote: >> BTW, if there's no time limit on write buffers flushing, or if write >> buffers can cause reordering of the writes, then the memory accesses >> need to be managed just like non-DMA-coherent memory. So what differs >> then in DMA-coherent vs non-DMA-coherent mappings then? > > My understanding is that ordering is preserved, but an ARM guy should > probably verify that. On ARMv6 onwards the coherent DMA is Normal Non-cacheable memory and this is buffered and can be reordered. -- Catalin ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 2/3] define ARM-specific dma_coherent_write_sync 2011-08-31 21:30 [PATCH 0/3] RFC: addition to DMA API Mark Salter 2011-08-31 21:30 ` [PATCH 1/3] add dma_coherent_write_sync " Mark Salter @ 2011-08-31 21:30 ` Mark Salter 2011-09-06 14:32 ` Catalin Marinas 2011-08-31 21:30 ` [PATCH 3/3] add dma_coherent_write_sync calls to USB EHCI driver Mark Salter ` (2 subsequent siblings) 4 siblings, 1 reply; 32+ messages in thread From: Mark Salter @ 2011-08-31 21:30 UTC (permalink / raw) To: linux-arm-kernel For ARM kernels using CONFIG_ARM_DMA_MEM_BUFFERABLE, this patch adds an ARM specific dma_coherent_write_sync() to override the default version. This routine forces out any data sitting in a write buffer between the CPU and memory. Signed-off-by: Mark Salter <msalter@redhat.com> --- arch/arm/include/asm/dma-mapping.h | 10 ++++++++++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index 7a21d0b..e99562b 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -206,6 +206,16 @@ int dma_mmap_writecombine(struct device *, struct vm_area_struct *, void *, dma_addr_t, size_t); +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE +#define ARCH_HAS_DMA_COHERENT_WRITE_SYNC + +static inline void dma_coherent_write_sync(void) +{ + dsb(); + outer_sync(); +} +#endif + #ifdef CONFIG_DMABOUNCE /* * For SA-1111, IXP425, and ADI systems the dma-mapping functions are "magic" -- 1.7.6 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH 2/3] define ARM-specific dma_coherent_write_sync 2011-08-31 21:30 ` [PATCH 2/3] define ARM-specific dma_coherent_write_sync Mark Salter @ 2011-09-06 14:32 ` Catalin Marinas 2011-09-06 14:37 ` Mark Salter 0 siblings, 1 reply; 32+ messages in thread From: Catalin Marinas @ 2011-09-06 14:32 UTC (permalink / raw) To: linux-arm-kernel On 31 August 2011 22:30, Mark Salter <msalter@redhat.com> wrote: > For ARM kernels using CONFIG_ARM_DMA_MEM_BUFFERABLE, this patch adds an ARM > specific dma_coherent_write_sync() to override the default version. This > routine forces out any data sitting in a write buffer between the CPU and > memory. > > Signed-off-by: Mark Salter <msalter@redhat.com> > --- > ?arch/arm/include/asm/dma-mapping.h | ? 10 ++++++++++ > ?1 files changed, 10 insertions(+), 0 deletions(-) > > diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h > index 7a21d0b..e99562b 100644 > --- a/arch/arm/include/asm/dma-mapping.h > +++ b/arch/arm/include/asm/dma-mapping.h > @@ -206,6 +206,16 @@ int dma_mmap_writecombine(struct device *, struct vm_area_struct *, > ? ? ? ? ? ? ? ?void *, dma_addr_t, size_t); > > > +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE > +#define ARCH_HAS_DMA_COHERENT_WRITE_SYNC > + > +static inline void dma_coherent_write_sync(void) > +{ > + ? ? ? dsb(); > + ? ? ? outer_sync(); > +} That's what mb() and wmb() do already, at least on ARM. Why do we need another API? IIRC from past discussions on linux-arch around barriers, the mb() should be sufficient in the case of DMA coherent buffers. That's why macros like writel() on ARM have the mb() added by default (for cases where you start the DMA transfer by writing to a device register). -- Catalin ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 2/3] define ARM-specific dma_coherent_write_sync 2011-09-06 14:32 ` Catalin Marinas @ 2011-09-06 14:37 ` Mark Salter 2011-09-06 14:48 ` Catalin Marinas 0 siblings, 1 reply; 32+ messages in thread From: Mark Salter @ 2011-09-06 14:37 UTC (permalink / raw) To: linux-arm-kernel On Tue, 2011-09-06 at 15:32 +0100, Catalin Marinas wrote: > That's what mb() and wmb() do already, at least on ARM. Why do we need > another API? IIRC from past discussions on linux-arch around barriers, > the mb() should be sufficient in the case of DMA coherent buffers. > That's why macros like writel() on ARM have the mb() added by default > (for cases where you start the DMA transfer by writing to a device > register). For USB EHCI, the driver does not necessarily write to a register after writing to DMA coherent memory. In some cases, the controller polls for information written by the driver. --Mark ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 2/3] define ARM-specific dma_coherent_write_sync 2011-09-06 14:37 ` Mark Salter @ 2011-09-06 14:48 ` Catalin Marinas 2011-09-06 15:02 ` Mark Salter 0 siblings, 1 reply; 32+ messages in thread From: Catalin Marinas @ 2011-09-06 14:48 UTC (permalink / raw) To: linux-arm-kernel On 6 September 2011 15:37, Mark Salter <msalter@redhat.com> wrote: > On Tue, 2011-09-06 at 15:32 +0100, Catalin Marinas wrote: >> That's what mb() and wmb() do already, at least on ARM. Why do we need >> another API? IIRC from past discussions on linux-arch around barriers, >> the mb() should be sufficient in the case of DMA coherent buffers. >> That's why macros like writel() on ARM have the mb() added by default >> (for cases where you start the DMA transfer by writing to a device >> register). > > For USB EHCI, the driver does not necessarily write to a register after > writing to DMA coherent memory. In some cases, the controller polls for > information written by the driver. So as I understand, you would like to force the eviction from the write buffer rather than waiting for it to be drained. On ARM, the write buffer is eventually flushed, so there is no strict timing guarantee. It could take longer if the processor immediately starts polling some memory location for example, but in this case a simple barrier would do. -- Catalin ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 2/3] define ARM-specific dma_coherent_write_sync 2011-09-06 14:48 ` Catalin Marinas @ 2011-09-06 15:02 ` Mark Salter 2011-10-03 1:40 ` Jon Masters 0 siblings, 1 reply; 32+ messages in thread From: Mark Salter @ 2011-09-06 15:02 UTC (permalink / raw) To: linux-arm-kernel On Tue, 2011-09-06 at 15:48 +0100, Catalin Marinas wrote: > On 6 September 2011 15:37, Mark Salter <msalter@redhat.com> wrote: > > On Tue, 2011-09-06 at 15:32 +0100, Catalin Marinas wrote: > >> That's what mb() and wmb() do already, at least on ARM. Why do we need > >> another API? IIRC from past discussions on linux-arch around barriers, > >> the mb() should be sufficient in the case of DMA coherent buffers. > >> That's why macros like writel() on ARM have the mb() added by default > >> (for cases where you start the DMA transfer by writing to a device > >> register). > > > > For USB EHCI, the driver does not necessarily write to a register after > > writing to DMA coherent memory. In some cases, the controller polls for > > information written by the driver. > > So as I understand, you would like to force the eviction from the > write buffer rather than waiting for it to be drained. On ARM, the > write buffer is eventually flushed, so there is no strict timing > guarantee. It could take longer if the processor immediately starts > polling some memory location for example, but in this case a simple > barrier would do. Yes, a memory barrier would have the same effect on ARM, but the purpose of a barrier is to guarantee ordering. What the patch does is add an interface to force a write buffer flush for performance, not ordering. If a memory barrier is used, it could have a negative impact on other arches. In any case, the current thinking is that the original problem with the USB performance seen on cortex A9 multicore is probably something more than just write buffer delays. Once the original problem is better understood, we can take another look at this patch if it is still needed. --Mark ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 2/3] define ARM-specific dma_coherent_write_sync 2011-09-06 15:02 ` Mark Salter @ 2011-10-03 1:40 ` Jon Masters 2011-10-03 8:44 ` Catalin Marinas 0 siblings, 1 reply; 32+ messages in thread From: Jon Masters @ 2011-10-03 1:40 UTC (permalink / raw) To: linux-arm-kernel On Sep 6, 2011, at 11:02 AM, Mark Salter wrote: > > In any case, the current thinking is that the original problem with > the USB performance seen on cortex A9 multicore is probably something > more than just write buffer delays. Once the original problem is better > understood, we can take another look at this patch if it is still > needed. Thanks again for looking into this Mark. My understanding is that this is still being investigated. I'll followup with ARM to see how that's going since I've heard nothing recently :) Meanwhile, we're continuing to carry a hack based on these patches in Fedora ARM kernels. Jon. ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 2/3] define ARM-specific dma_coherent_write_sync 2011-10-03 1:40 ` Jon Masters @ 2011-10-03 8:44 ` Catalin Marinas 2011-10-03 9:24 ` Jon Masters 0 siblings, 1 reply; 32+ messages in thread From: Catalin Marinas @ 2011-10-03 8:44 UTC (permalink / raw) To: linux-arm-kernel On Mon, Oct 03, 2011 at 02:40:19AM +0100, Jon Masters wrote: > On Sep 6, 2011, at 11:02 AM, Mark Salter wrote: > > In any case, the current thinking is that the original problem with > > the USB performance seen on cortex A9 multicore is probably something > > more than just write buffer delays. Once the original problem is better > > understood, we can take another look at this patch if it is still > > needed. > > Thanks again for looking into this Mark. My understanding is that this > is still being investigated. I'll followup with ARM to see how that's > going since I've heard nothing recently :) Meanwhile, we're continuing > to carry a hack based on these patches in Fedora ARM kernels. Not talking about hardware specifics here, the architecture (though ARMv7 onwards) mandates that the write buffer is eventually drained. But doesn't state any upper limit, so it could even be half a second. In this case, some form of buffer draining for devices that poll the memory may be useful. If we don't want to add a new API, something like below would work as well: write to DMA buffer; mb(); read from DMA buffer; So an mb() between the last write and a subsequent read would force the visibility of the write. We have similar scenarios for Device accesses. If we go for a DMA API extension, the dma_coherent_write_sync() should probably take an address as well, just in case implementations would force the draining via some read back from the same area. -- Catalin ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 2/3] define ARM-specific dma_coherent_write_sync 2011-10-03 8:44 ` Catalin Marinas @ 2011-10-03 9:24 ` Jon Masters 0 siblings, 0 replies; 32+ messages in thread From: Jon Masters @ 2011-10-03 9:24 UTC (permalink / raw) To: linux-arm-kernel On Oct 3, 2011, at 4:44 AM, Catalin Marinas wrote: > On Mon, Oct 03, 2011 at 02:40:19AM +0100, Jon Masters wrote: >> On Sep 6, 2011, at 11:02 AM, Mark Salter wrote: >>> In any case, the current thinking is that the original problem with >>> the USB performance seen on cortex A9 multicore is probably something >>> more than just write buffer delays. Once the original problem is better >>> understood, we can take another look at this patch if it is still >>> needed. >> >> Thanks again for looking into this Mark. My understanding is that this >> is still being investigated. I'll followup with ARM to see how that's >> going since I've heard nothing recently :) Meanwhile, we're continuing >> to carry a hack based on these patches in Fedora ARM kernels. > > Not talking about hardware specifics here, the architecture (though > ARMv7 onwards) mandates that the write buffer is eventually drained. But > doesn't state any upper limit, so it could even be half a second. <snip mechanics of dma_coherent_write_sync, etc.> I guess my main question is, do you think this is just a write buffer delay? If you do, then we should definitely get back to this question of defining DMA extensions. But are we sure that's what this is? Mark: did you have any more insight into this recently? Jon. ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 3/3] add dma_coherent_write_sync calls to USB EHCI driver 2011-08-31 21:30 [PATCH 0/3] RFC: addition to DMA API Mark Salter 2011-08-31 21:30 ` [PATCH 1/3] add dma_coherent_write_sync " Mark Salter 2011-08-31 21:30 ` [PATCH 2/3] define ARM-specific dma_coherent_write_sync Mark Salter @ 2011-08-31 21:30 ` Mark Salter 2011-09-01 2:33 ` Ming Lei 2011-09-01 2:09 ` [PATCH 0/3] RFC: addition to DMA API Ming Lei 2011-09-01 9:11 ` Will Deacon 4 siblings, 1 reply; 32+ messages in thread From: Mark Salter @ 2011-08-31 21:30 UTC (permalink / raw) To: linux-arm-kernel The EHCI driver polls DMA coherent memory for control data written by the driver. On some architectures, such as ARMv7, the writes from the driver may get delayed in a write buffer even though it is written to DMA coherent memory. This delay led to serious performance issues on an ARMv7 based platform using a USB disk drive. Before using this patch, 'hdparm -t' showed a read speed of 5.7MB/s. After applying this patch, hdparm showed 23.5MB/s. Signed-off-by: Mark Salter <msalter@redhat.com> --- drivers/usb/host/ehci-q.c | 7 ++++++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c index 0917e3a..75d9838 100644 --- a/drivers/usb/host/ehci-q.c +++ b/drivers/usb/host/ehci-q.c @@ -114,6 +114,7 @@ qh_update (struct ehci_hcd *ehci, struct ehci_qh *qh, struct ehci_qtd *qtd) /* HC must see latest qtd and qh data before we clear ACTIVE+HALT */ wmb (); hw->hw_token &= cpu_to_hc32(ehci, QTD_TOGGLE | QTD_STS_PING); + dma_coherent_write_sync(); } /* if it weren't for a common silicon quirk (writing the dummy into the qh @@ -404,6 +405,7 @@ qh_completions (struct ehci_hcd *ehci, struct ehci_qh *qh) wmb(); hw->hw_token = cpu_to_hc32(ehci, token); + dma_coherent_write_sync(); goto retry_xacterr; } stopped = 1; @@ -753,8 +755,10 @@ qh_urb_transaction ( } /* by default, enable interrupt on urb completion */ - if (likely (!(urb->transfer_flags & URB_NO_INTERRUPT))) + if (likely(!(urb->transfer_flags & URB_NO_INTERRUPT))) { qtd->hw_token |= cpu_to_hc32(ehci, QTD_IOC); + dma_coherent_write_sync(); + } return head; cleanup: @@ -1081,6 +1085,7 @@ static struct ehci_qh *qh_append_tds ( /* let the hc process these next qtds */ wmb (); dummy->hw_token = token; + dma_coherent_write_sync(); urb->hcpriv = qh_get (qh); } -- 1.7.6 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH 3/3] add dma_coherent_write_sync calls to USB EHCI driver 2011-08-31 21:30 ` [PATCH 3/3] add dma_coherent_write_sync calls to USB EHCI driver Mark Salter @ 2011-09-01 2:33 ` Ming Lei 0 siblings, 0 replies; 32+ messages in thread From: Ming Lei @ 2011-09-01 2:33 UTC (permalink / raw) To: linux-arm-kernel Hi, On Thu, Sep 1, 2011 at 5:30 AM, Mark Salter <msalter@redhat.com> wrote: > The EHCI driver polls DMA coherent memory for control data written by the > driver. On some architectures, such as ARMv7, the writes from the driver > may get delayed in a write buffer even though it is written to DMA coherent > memory. This delay led to serious performance issues on an ARMv7 based > platform using a USB disk drive. Before using this patch, 'hdparm -t' showed > a read speed of 5.7MB/s. After applying this patch, hdparm showed 23.5MB/s. > > Signed-off-by: Mark Salter <msalter@redhat.com> > --- > ?drivers/usb/host/ehci-q.c | ? ?7 ++++++- > ?1 files changed, 6 insertions(+), 1 deletions(-) > > diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c > index 0917e3a..75d9838 100644 > --- a/drivers/usb/host/ehci-q.c > +++ b/drivers/usb/host/ehci-q.c > @@ -114,6 +114,7 @@ qh_update (struct ehci_hcd *ehci, struct ehci_qh *qh, struct ehci_qtd *qtd) > ? ? ? ?/* HC must see latest qtd and qh data before we clear ACTIVE+HALT */ > ? ? ? ?wmb (); > ? ? ? ?hw->hw_token &= cpu_to_hc32(ehci, QTD_TOGGLE | QTD_STS_PING); > + ? ? ? dma_coherent_write_sync(); It is not needed at all, just before the qh is linked into hw queue, there is one wmb to handle sync of qh correctly. Even the wmb can be removed as the patch I have posted out in usb mail list. > ?} > > ?/* if it weren't for a common silicon quirk (writing the dummy into the qh > @@ -404,6 +405,7 @@ qh_completions (struct ehci_hcd *ehci, struct ehci_qh *qh) > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?wmb(); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?hw->hw_token = cpu_to_hc32(ehci, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?token); > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? dma_coherent_write_sync(); It is in a cold path, and if adding the helper or not does not matter. > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?goto retry_xacterr; > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?stopped = 1; > @@ -753,8 +755,10 @@ qh_urb_transaction ( > ? ? ? ?} > > ? ? ? ?/* by default, enable interrupt on urb completion */ > - ? ? ? if (likely (!(urb->transfer_flags & URB_NO_INTERRUPT))) > + ? ? ? if (likely(!(urb->transfer_flags & URB_NO_INTERRUPT))) { > ? ? ? ? ? ? ? ?qtd->hw_token |= cpu_to_hc32(ehci, QTD_IOC); > + ? ? ? ? ? ? ? dma_coherent_write_sync(); It is not needed at all, the wmb in qh_append_tds will handle sync of qtd correctly. > + ? ? ? } > ? ? ? ?return head; > > ?cleanup: > @@ -1081,6 +1085,7 @@ static struct ehci_qh *qh_append_tds ( > ? ? ? ? ? ? ? ? ? ? ? ?/* let the hc process these next qtds */ > ? ? ? ? ? ? ? ? ? ? ? ?wmb (); > ? ? ? ? ? ? ? ? ? ? ? ?dummy->hw_token = token; > + ? ? ? ? ? ? ? ? ? ? ? dma_coherent_write_sync(); It is the only one which does make sense up to now, see discussion in http://marc.info/?t=131472029700001&r=1&w=2 http://marc.info/?t=131445642100002&r=1&w=2 thanks, -- Ming Lei ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-08-31 21:30 [PATCH 0/3] RFC: addition to DMA API Mark Salter ` (2 preceding siblings ...) 2011-08-31 21:30 ` [PATCH 3/3] add dma_coherent_write_sync calls to USB EHCI driver Mark Salter @ 2011-09-01 2:09 ` Ming Lei 2011-09-01 3:09 ` Alan Stern 2011-09-01 9:11 ` Will Deacon 4 siblings, 1 reply; 32+ messages in thread From: Ming Lei @ 2011-09-01 2:09 UTC (permalink / raw) To: linux-arm-kernel Hi, On Thu, Sep 1, 2011 at 5:30 AM, Mark Salter <msalter@redhat.com> wrote: > This patch set arose out of a discussion on linux-arm concerning a > performance problem with USB on some ARMv7 based platforms. The > problem was tracked down by ming.lei at canonical.com and found to be > the result of CPU writes to DMA-coherent memory being delayed in a > write buffer between the CPU and memory. One proposed patch fixed > only the immediate problem with the USB EHCI driver, but several > folks thought a more general approach was needed, so I put this series > of patches together as a starting point for wider discussion outside > the ARM specific list. After some further thoughts, I think it is not a good idea to introduce a general DMA API to handle this case, see below: 1. The side-effect of new API is that it will make descriptors of dma in a partial update, such as qtd in the ehci case, even ehci can handle this successful, but it is really not good to make DMA bus master see a partial update of descriptor, and I am not sure that other kind of bus masters can handle this correctly, which may introduce other problems. A proper memory barrier will always make dma master see a atomic update of dma descriptors, which should be the preferred way to take by device driver. 2, most of such cases can be handled correctly by mb/wmb/rmb barriers. The ehci case I reported is in the one of the most tricky code path in ehci driver, and it should be a special case, and up to now, we only have found this case can't be handled by memory barriers. Is there other cases which can't be handled correctly by mb/wmb/rmb? If so, please point it out. 3, The new DMA API for the purpose to be introduced is much easier to understand, and much easier to use than memory barrier, so it is very possible to make device driver guys misuse or abuse it instead of using memory barrier first to handle the case. > > The original problem seen was that USB storage performance was unusually > poor on some ARMv7 based platforms. With my particular setup, I was > seeing hdparm -t report ~5.6MB/s on an SMP Cortex-A9 based platform > where the same disk driver would get ~21MB/s on a Cortex-A8 based system. > My understanding from subsequent discussion is that the A9 cores have > a write buffer between the CPU and memory which could buffer data for a > prolonged period even in the case of DMA coherent mappings. The ARM > architecture code largely mitigates this by doing a write buffer flush > as part of the MMIO functions used to write to device registers. This > avoids problems in almost all drivers because most need to write to a > device register to tell the device when something is written in the > shared DMA coherent memory. In the case of USB, an EHCI host controller > will poll certain DMA coherent memory locations for information coming > from the CPU. In that case, the write buffering negatively affects > performance. > > This series of patches adds a new function to the DMA API to deal with > ARMv7 and any future architectures which have write buffering even for > DMA coherent memory. The proposed dma_coherent_write_sync() function > will allow those few drivers which need it to force out write buffer > data in a timely way to avoid performace issues. > > Mark Salter (3): > ?add dma_coherent_write_sync to DMA API > ?define ARM-specific dma_coherent_write_sync > ?add dma_coherent_write_sync calls to USB EHCI driver > > ?Documentation/DMA-API-HOWTO.txt ? ?| ? 15 +++++++++++++++ > ?Documentation/DMA-API.txt ? ? ? ? ?| ? 12 ++++++++++++ > ?arch/arm/include/asm/dma-mapping.h | ? 10 ++++++++++ > ?drivers/usb/host/ehci-q.c ? ? ? ? ?| ? ?7 ++++++- > ?include/linux/dma-mapping.h ? ? ? ?| ? ?6 ++++++ > ?5 files changed, 49 insertions(+), 1 deletions(-) > > -- > 1.7.6 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at ?http://vger.kernel.org/majordomo-info.html > Please read the FAQ at ?http://www.tux.org/lkml/ > thanks, -- Ming Lei ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-01 2:09 ` [PATCH 0/3] RFC: addition to DMA API Ming Lei @ 2011-09-01 3:09 ` Alan Stern 2011-09-01 3:41 ` Ming Lei 0 siblings, 1 reply; 32+ messages in thread From: Alan Stern @ 2011-09-01 3:09 UTC (permalink / raw) To: linux-arm-kernel On Thu, 1 Sep 2011, Ming Lei wrote: > Hi, > > On Thu, Sep 1, 2011 at 5:30 AM, Mark Salter <msalter@redhat.com> wrote: > > This patch set arose out of a discussion on linux-arm concerning a > > performance problem with USB on some ARMv7 based platforms. The > > problem was tracked down by ming.lei at canonical.com and found to be > > the result of CPU writes to DMA-coherent memory being delayed in a > > write buffer between the CPU and memory. One proposed patch fixed > > only the immediate problem with the USB EHCI driver, but several > > folks thought a more general approach was needed, so I put this series > > of patches together as a starting point for wider discussion outside > > the ARM specific list. > > After some further thoughts, I think it is not a good idea to introduce a > general DMA API to handle this case, see below: > > 1. The side-effect of new API is that it will make descriptors of dma in a > partial update, such as qtd in the ehci case, even ehci can handle this > successful, but it is really not good to make DMA bus master see a > partial update of descriptor, and I am not sure that other kind of bus masters > can handle this correctly, which may introduce other problems. A proper memory > barrier will always make dma master see a atomic update of dma descriptors, > which should be the preferred way to take by device driver. No, this is completely wrong. Firstly, you are forgetting about other architectures, ones in which writes to coherent memory aren't buffered. On those architectures there's no way to prevent the DMA bus master from seeing an intermediate state of the data structures. Therefore the driver has to be written so that even when this happens, everything will work correctly. Secondly, even when write flushes are used, you can't guarantee that the DMA bus master will see an atomic update. It might turn out that the hardware occasionally flushes some writes very quickly, before the data-structure updates are complete. Thirdly, you are mixing up memory barriers with write flushes. The barriers are used to make sure that writes are done in the correct order, whereas the flushes are used to make sure that writes are done reasonably quickly. One has nothing to do with the other, even if by coincidence on ARM a memory barrier causes a write flush. On other architectures this might not be true. > 2, most of such cases can be handled correctly by mb/wmb/rmb barriers. No, they can't. See the third point above. > The ehci case I reported is in the one of the most tricky code path in > ehci driver, > and it should be a special case, and up to now, we only have found this case > can't be handled by memory barriers. Is there other cases which can't be handled > correctly by mb/wmb/rmb? If so, please point it out. > > 3, The new DMA API for the purpose to be introduced is much easier to > understand, and much easier to use than memory barrier, so it is very > possible to make device driver guys misuse or abuse it instead of using > memory barrier first to handle the case. That criticism could apply to almost any new feature. We shouldn't be afraid to adopt something new merely because it's so easy to use that it might be misused. Alan Stern ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-01 3:09 ` Alan Stern @ 2011-09-01 3:41 ` Ming Lei 2011-09-01 8:45 ` Will Deacon 2011-09-01 15:22 ` Alan Stern 0 siblings, 2 replies; 32+ messages in thread From: Ming Lei @ 2011-09-01 3:41 UTC (permalink / raw) To: linux-arm-kernel Hi, On Thu, Sep 1, 2011 at 11:09 AM, Alan Stern <stern@rowland.harvard.edu> wrote: > On Thu, 1 Sep 2011, Ming Lei wrote: > >> Hi, >> >> On Thu, Sep 1, 2011 at 5:30 AM, Mark Salter <msalter@redhat.com> wrote: >> > This patch set arose out of a discussion on linux-arm concerning a >> > performance problem with USB on some ARMv7 based platforms. The >> > problem was tracked down by ming.lei at canonical.com and found to be >> > the result of CPU writes to DMA-coherent memory being delayed in a >> > write buffer between the CPU and memory. One proposed patch fixed >> > only the immediate problem with the USB EHCI driver, but several >> > folks thought a more general approach was needed, so I put this series >> > of patches together as a starting point for wider discussion outside >> > the ARM specific list. >> >> After some further thoughts, I think it is not a good idea to introduce a >> general DMA API to handle this case, see below: >> >> 1. The side-effect of new API is that it will make descriptors of dma in a >> partial update, such as qtd in the ehci case, even ehci can handle this >> successful, but it is really not good to make DMA bus master see a >> partial update of descriptor, and I am not ?sure that other kind of bus masters >> can handle this correctly, which may introduce other problems. A proper memory >> barrier will always make dma master see a atomic update of dma descriptors, >> which should be the preferred way to take by device driver. > > No, this is completely wrong. > > Firstly, you are forgetting about other architectures, ones in which > writes to coherent memory aren't buffered. ?On those architectures > there's no way to prevent the DMA bus master from seeing an > intermediate state of the data structures. ?Therefore the driver has to > be written so that even when this happens, everything will work > correctly. > > Secondly, even when write flushes are used, you can't guarantee that > the DMA bus master will see an atomic update. ?It might turn out that > the hardware occasionally flushes some writes very quickly, before the > data-structure updates are complete. > > Thirdly, you are mixing up memory barriers with write flushes. ?The > barriers are used to make sure that writes are done in the correct > order, whereas the flushes are used to make sure that writes are done > reasonably quickly. ?One has nothing to do with the other, even if by > coincidence on ARM a memory barrier causes a write flush. ?On other > architectures this might not be true. I agree all about above, but what I described is from another view. I post out the example before explaining my idea further: CPU device A=1; wmb B=2; read B read A one wmb is used to order 'A=1' and 'B=2', which will make the two write operations reach to physical memory as the order: 'A=1' first, 'B=2' second. Then the device can observe the two write events as the order above, so if device has seen 'B==2', then device will surely see 'A==1'. Suppose writing to A is operation to update dma descriptor, the above example can make device always see a atomic update of descriptor, can't it? My idea is that the memory access patterns are to be considered for writer of device driver. For example, many memory access patterns on EHCI hardware are described in detail. Of course, device driver should make full use of the background info, below is a example from ehci driver: qh_link_async(): /*prepare qh descriptor*/ qh->qh_next = head->qh_next; qh->hw->hw_next = head->hw->hw_next; wmb (); /*link the qh descriptor into hardware queue*/ head->qh_next.qh = qh; head->hw->hw_next = dma; so once EHCI fetches a qh with the address of 'dma', it will always see consistent content of qh descriptor, which could not be updated partially. > >> 2, most of such cases can be handled correctly by mb/wmb/rmb barriers. > > No, they can't. ?See the third point above. The example above has demoed that barriers can do it, hasn't it? > >> The ehci case I reported is in the one of the most tricky code path in >> ehci driver, >> and it should be a special case, and up to now, we only have found this case >> can't be handled by memory barriers. Is there other cases which can't be handled >> correctly by mb/wmb/rmb? If so, please point it out. >> >> 3, The new DMA API for the purpose to be introduced is much easier to >> understand, and much easier to use than memory barrier, so it is very >> possible to make device driver guys misuse or abuse it instead of using >> memory barrier first to handle the case. > > That criticism could apply to almost any new feature. ?We shouldn't be > afraid to adopt something new merely because it's so easy to use that > it might be misused. This point depends on the #1 and #2. thanks, -- Ming Lei ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-01 3:41 ` Ming Lei @ 2011-09-01 8:45 ` Will Deacon 2011-09-01 9:14 ` Ming Lei 2011-09-01 15:22 ` Alan Stern 1 sibling, 1 reply; 32+ messages in thread From: Will Deacon @ 2011-09-01 8:45 UTC (permalink / raw) To: linux-arm-kernel On Thu, Sep 01, 2011 at 04:41:46AM +0100, Ming Lei wrote: > Hi, > > On Thu, Sep 1, 2011 at 11:09 AM, Alan Stern <stern@rowland.harvard.edu> wrote: > > > > No, this is completely wrong. > > > > Firstly, you are forgetting about other architectures, ones in which > > writes to coherent memory aren't buffered. ?On those architectures > > there's no way to prevent the DMA bus master from seeing an > > intermediate state of the data structures. ?Therefore the driver has to > > be written so that even when this happens, everything will work > > correctly. > > > > Secondly, even when write flushes are used, you can't guarantee that > > the DMA bus master will see an atomic update. ?It might turn out that > > the hardware occasionally flushes some writes very quickly, before the > > data-structure updates are complete. > > > > Thirdly, you are mixing up memory barriers with write flushes. ?The > > barriers are used to make sure that writes are done in the correct > > order, whereas the flushes are used to make sure that writes are done > > reasonably quickly. ?One has nothing to do with the other, even if by > > coincidence on ARM a memory barrier causes a write flush. ?On other > > architectures this might not be true. > > I agree all about above, but what I described is from another view. > I post out the example before explaining my idea further: > > > CPU device > A=1; > wmb > B=2; > read B > read A > > one wmb is used to order 'A=1' and 'B=2', which will make the two write > operations reach to physical memory as the order: 'A=1' first, 'B=2' second. > Then the device can observe the two write events as the order above, > so if device has seen 'B==2', then device will surely see 'A==1'. > > Suppose writing to A is operation to update dma descriptor, the above example > can make device always see a atomic update of descriptor, can't it? > > My idea is that the memory access patterns are to be considered for > writer of device driver. For example, many memory access patterns on > EHCI hardware are described in detail. Of course, device driver should > make full use of the background info, below is a example from ehci driver: > > qh_link_async(): > > /*prepare qh descriptor*/ > qh->qh_next = head->qh_next; > qh->hw->hw_next = head->hw->hw_next; > wmb (); > > /*link the qh descriptor into hardware queue*/ > head->qh_next.qh = qh; > head->hw->hw_next = dma; > > so once EHCI fetches a qh with the address of 'dma', it will always see > consistent content of qh descriptor, which could not be updated partially. I'm struggling to see what you're getting at here. The proposal has *absolutely nothing* to do with memory barriers. All of the existing barriers will remain - they are needed for correctness. What changes is the addition of an /optional/ flush operation in order to guarantee some sort of immediacy for writes to the coherent buffer. > >> 3, The new DMA API for the purpose to be introduced is much easier to > >> understand, and much easier to use than memory barrier, so it is very > >> possible to make device driver guys misuse or abuse it instead of using > >> memory barrier first to handle the case. > > > > That criticism could apply to almost any new feature. ?We shouldn't be > > afraid to adopt something new merely because it's so easy to use that > > it might be misused. > > This point depends on the #1 and #2. Huh? I don't see the connection. If your worry is that people will start littering their code with flush calls, I don't think that's especially likely. The usual problem (from what I've seen) is that barriers tend to be missing rather than overused so I don't see why this would be different for a what has been proposed. Will ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-01 8:45 ` Will Deacon @ 2011-09-01 9:14 ` Ming Lei 2011-09-01 15:42 ` Alan Stern 0 siblings, 1 reply; 32+ messages in thread From: Ming Lei @ 2011-09-01 9:14 UTC (permalink / raw) To: linux-arm-kernel On Thu, Sep 1, 2011 at 4:45 PM, Will Deacon <will.deacon@arm.com> wrote: > On Thu, Sep 01, 2011 at 04:41:46AM +0100, Ming Lei wrote: >> Hi, >> >> On Thu, Sep 1, 2011 at 11:09 AM, Alan Stern <stern@rowland.harvard.edu> wrote: >> > >> > No, this is completely wrong. >> > >> > Firstly, you are forgetting about other architectures, ones in which >> > writes to coherent memory aren't buffered. ?On those architectures >> > there's no way to prevent the DMA bus master from seeing an >> > intermediate state of the data structures. ?Therefore the driver has to >> > be written so that even when this happens, everything will work >> > correctly. >> > >> > Secondly, even when write flushes are used, you can't guarantee that >> > the DMA bus master will see an atomic update. ?It might turn out that >> > the hardware occasionally flushes some writes very quickly, before the >> > data-structure updates are complete. >> > >> > Thirdly, you are mixing up memory barriers with write flushes. ?The >> > barriers are used to make sure that writes are done in the correct >> > order, whereas the flushes are used to make sure that writes are done >> > reasonably quickly. ?One has nothing to do with the other, even if by >> > coincidence on ARM a memory barrier causes a write flush. ?On other >> > architectures this might not be true. >> >> I agree all about above, but what I described is from another view. >> I post out the example before explaining my idea further: >> >> >> ? ? ? CPU ? ? ? ? ? ? ? ? ? ? device >> ? ? ? A=1; >> ? ? ? wmb >> ? ? ? B=2; >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? read B >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? read A >> >> one wmb is used to order 'A=1' and 'B=2', which will make the two write >> operations reach to physical memory as the order: 'A=1' first, 'B=2' second. >> Then the device can observe the two write events as the order above, >> so if device has seen 'B==2', then device will surely see 'A==1'. >> >> Suppose writing to A is operation to update dma descriptor, the above example >> can make device always see a atomic update of descriptor, can't it? >> >> My idea is that the memory access patterns are to be considered for >> writer of device driver. For example, many memory access patterns on >> EHCI hardware are described in detail. ?Of course, device driver should >> make full use of the background info, below is a example from ehci driver: >> >> qh_link_async(): >> >> ? ? ? /*prepare qh descriptor*/ >> ? ? ? qh->qh_next = head->qh_next; >> ? ? ? qh->hw->hw_next = head->hw->hw_next; >> ? ? ? wmb (); >> >> ? ? ? /*link the qh descriptor into hardware queue*/ >> ? ? ? head->qh_next.qh = qh; >> ? ? ? head->hw->hw_next = dma; >> >> so once EHCI fetches a qh with the address of 'dma', it will always see >> consistent content of qh descriptor, which could not be updated partially. > > I'm struggling to see what you're getting at here. The proposal has > *absolutely nothing* to do with memory barriers. All of the existing > barriers will remain - they are needed for correctness. What changes is the > addition of an /optional/ flush operation in order to guarantee some sort of > immediacy for writes to the coherent buffer. First, most of barriers have made this kind of flush not necessary, as explained int the example above. Even for ehci driver, the flush to be added is only __one__ special case, so could you list other cases in which the flush operation is a must and memory barrier can't do it? If one can understand dma master access order on shared memory in the context, it is not difficult to use memory barrier to avoid the flush operation. Second, as I said before, the flush operation is very easy to cause dma descriptor updated in partial, as qtd descriptor updated in the ehci case. It is one of the side effects of the flush to be introduced. Fortunately, EHCI can handle this correctly, but can other hardwares always handle this correctly? > >> >> 3, The new DMA API for the purpose to be introduced is much easier to >> >> understand, and much easier to use than memory barrier, so it is very >> >> possible to make device driver guys misuse or abuse it instead of using >> >> memory barrier first to handle the case. >> > >> > That criticism could apply to almost any new feature. ?We shouldn't be >> > afraid to adopt something new merely because it's so easy to use that >> > it might be misused. >> >> This point depends on the #1 and #2. > > Huh? I don't see the connection. If your worry is that people will start > littering their code with flush calls, I don't think that's especially > likely. The usual problem (from what I've seen) is that barriers tend to be > missing rather than overused so I don't see why this would be different for > a what has been proposed. Barrier is often missed because it is very difficult to use and one have to understand the dma master access order on shared memory by . The flush to be introduced is very easy to understand, so it is very likely to be abused, isn't it? thanks, -- Ming Lei ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-01 9:14 ` Ming Lei @ 2011-09-01 15:42 ` Alan Stern 2011-09-01 16:04 ` Russell King - ARM Linux 0 siblings, 1 reply; 32+ messages in thread From: Alan Stern @ 2011-09-01 15:42 UTC (permalink / raw) To: linux-arm-kernel On Thu, 1 Sep 2011, Ming Lei wrote: > First, most of barriers have made this kind of flush not necessary, as > explained int the example above. You don't seem to understand the difference between memory barriers and write flushes. Neither makes the other unnecessary -- they do different things. Now, there is good reason to question why write flushes should be needed at all. According to the definition in Documentation/DMA-API.txt (the document doesn't distinguish between coherent memory and consistent memory): Consistent memory is memory for which a write by either the device or the processor can immediately be read by the processor or device without having to worry about caching effects. (You may however need to make sure to flush the processor's write buffers before telling devices to read that memory.) As far as I can tell, we are talking about a cache flush rather than a processor write buffer flush. If that's so, it would appear that the memory in question is not truly coherent. > Even for ehci driver, the flush to be > added is only __one__ special case, so could you list other cases in > which the flush operation is a must and memory barrier can't do it? There are other places in ehci-hcd where write flushes would help improve performance or correctness, although the one in qh_append_tds() is probably the most critical. One such place is in qh_link_async(); others are in qh_link_periodic(), qh_unlink_periodic(), itd_link_urb(), and sitd_link_urb(). And there are similar places in the other USB host controller drivers. > If one can understand dma master access order on shared memory > in the context, it is not difficult to use memory barrier to avoid the > flush operation. That simply is not true. If it were, you would not have needed to submit your original patch. > Second, as I said before, the flush operation is very easy to cause dma > descriptor updated in partial, as qtd descriptor updated in the ehci case. No, not if memory barriers are used correctly. > It is one of the side effects of the flush to be introduced. Fortunately, EHCI > can handle this correctly, but can other hardwares always handle this correctly? If they can't, they are already broken. Adding write flushes won't make them any worse. Alan Stern ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-01 15:42 ` Alan Stern @ 2011-09-01 16:04 ` Russell King - ARM Linux 2011-09-01 17:31 ` Will Deacon 0 siblings, 1 reply; 32+ messages in thread From: Russell King - ARM Linux @ 2011-09-01 16:04 UTC (permalink / raw) To: linux-arm-kernel On Thu, Sep 01, 2011 at 11:42:23AM -0400, Alan Stern wrote: > On Thu, 1 Sep 2011, Ming Lei wrote: > > > First, most of barriers have made this kind of flush not necessary, as > > explained int the example above. > > You don't seem to understand the difference between memory barriers and > write flushes. Neither makes the other unnecessary -- they do > different things. > > Now, there is good reason to question why write flushes should be > needed at all. According to the definition in > Documentation/DMA-API.txt (the document doesn't distinguish between > coherent memory and consistent memory): > > Consistent memory is memory for which a write by either the device or > the processor can immediately be read by the processor or device > without having to worry about caching effects. (You may however need > to make sure to flush the processor's write buffers before telling > devices to read that memory.) > > As far as I can tell, we are talking about a cache flush rather than a > processor write buffer flush. If that's so, it would appear that the > memory in question is not truly coherent. DMA coherent memory on ARM is implemented on ARMv5 and below by using 'noncacheable nonbufferable' memory. There is no weak memory model to worry about, and this memory type is seen as 'strongly ordered' - the CPU stalls until the read or write has completed. So no problem there. On ARMv6 and above, the attributes change: 1. Memory type: [Normal, Device, Strongly ordered] All mappings of a physical address space are absolutely required to be of the same memory type, otherwise the result is unpredictable. There is no mitigation against this. 2. For "normal memory", a variety of options are available to adjust the hints to the cache and memory subsystem - the options here are [Non-cacheable, write-back write alloc, write-through non-write alloc, write-back, non-write alloc.] Strictly to the ARM ARM, all mappings must, again, have the same attributes to avoid unpredictable behaviour. There is a _temporary_ architectural relaxation of this requirement provided certain conditions are met - which may become permanent. We remap system memory (which has its standard direct-mapped kernel mapping as 'normal memory, write-back' for DMA coherent memory into a separate region marking it 'normal memory, non-cacheable'. Strictly this violates the architecture - but we have no other way at present to obtain DMA coherent memory as we can't unmap the standard direct-mapped kernel mapping (we'd have to touch _every_ page table in the system, and then issue TLB flushes which may have to be smp_call_function'd, which you can't do from IRQ context - one of the contexts which dma_alloc_coherent must work from.) So far, no one has reported any ill effects - and there's been much pressure from lots of people to ignore the architecture reference manual over this, including from the CMA guys. It _is_ possible that "unpredictable" means that we may hit cache lines in the [VP]IPT cache via the non-cacheable mapping which have been created by speculative loads via the cacheable mapping - and this is something that has been worrying me for a long time. I've tried several ways to fix this but the result has been regressions. So far, I have no fix for this which will not cause a regression, which will satisfy the ARM ARM, which will satisfy peoples expectations, etc. There is a plan with CMA to try and do something about this, but that's a long way off yet. So, in summary what I'm saying is that _in theory_ our DMA coherent memory on ARMv6+ should have nothing more than write buffering to contend with, but that doesn't stop this being the first real concrete report proving that what I've been going on about regarding the architectural requirements over the last few years is actually very real and valid. ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-01 16:04 ` Russell King - ARM Linux @ 2011-09-01 17:31 ` Will Deacon 2011-09-01 18:07 ` Russell King - ARM Linux 2011-09-01 19:14 ` Mark Salter 0 siblings, 2 replies; 32+ messages in thread From: Will Deacon @ 2011-09-01 17:31 UTC (permalink / raw) To: linux-arm-kernel Hi Russell, On Thu, Sep 01, 2011 at 05:04:29PM +0100, Russell King - ARM Linux wrote: > DMA coherent memory on ARM is implemented on ARMv5 and below by using > 'noncacheable nonbufferable' memory. There is no weak memory model to > worry about, and this memory type is seen as 'strongly ordered' - the > CPU stalls until the read or write has completed. So no problem there. > > On ARMv6 and above, the attributes change: > > 1. Memory type: [Normal, Device, Strongly ordered] > All mappings of a physical address space are absolutely required to be > of the same memory type, otherwise the result is unpredictable. There > is no mitigation against this. > > 2. For "normal memory", a variety of options are available to adjust the > hints to the cache and memory subsystem - the options here are > [Non-cacheable, write-back write alloc, write-through non-write alloc, > write-back, non-write alloc.] > > Strictly to the ARM ARM, all mappings must, again, have the same > attributes to avoid unpredictable behaviour. There is a _temporary_ > architectural relaxation of this requirement provided certain conditions > are met - which may become permanent. This looks set to appear in revision C of the ARM ARM. > It _is_ possible that "unpredictable" means that we may hit cache lines in > the [VP]IPT cache via the non-cacheable mapping which have been created > by speculative loads via the cacheable mapping - and this is something > that has been worrying me for a long time. Whilst this can happen, this will only cause problems for reads performed by the CPU (as these may hit a line speculatively loaded via the cacheable alias). Setting bit 22 in the auxillary control register gets arounds this: http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1 Given that I believe our coherent DMA memory is `cacheable, bufferable, do not allocate' in terms of AXI attributes, then writes will go straight to the write buffer on the PL310. > So, in summary what I'm saying is that _in theory_ our DMA coherent memory > on ARMv6+ should have nothing more than write buffering to contend with, > but that doesn't stop this being the first real concrete report proving > that what I've been going on about regarding the architectural requirements > over the last few years is actually very real and valid. I don't think what we're seeing in this case is caused by mismatched memory attributes, especially as passing `nosmp' on the command-line makes the performance issue disappear. Will ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-01 17:31 ` Will Deacon @ 2011-09-01 18:07 ` Russell King - ARM Linux 2011-09-01 19:14 ` Mark Salter 1 sibling, 0 replies; 32+ messages in thread From: Russell King - ARM Linux @ 2011-09-01 18:07 UTC (permalink / raw) To: linux-arm-kernel On Thu, Sep 01, 2011 at 06:31:49PM +0100, Will Deacon wrote: > Given that I believe our coherent DMA memory is `cacheable, bufferable, do > not allocate' in terms of AXI attributes, then writes will go straight to > the write buffer on the PL310. It's TEXCB=001, which due to the PRRR and NMRR registers is memory type 10, inner policy 00 outer policy 00, which translated means (as I said) "Normal memory", "non-cacheable" at both the inner and outer levels. How that gets translated to AXI attributes is outside the scope of the ARM ARM, it's probably in some mega secret AXI document somewhere which I don't have access to. ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-01 17:31 ` Will Deacon 2011-09-01 18:07 ` Russell King - ARM Linux @ 2011-09-01 19:14 ` Mark Salter 1 sibling, 0 replies; 32+ messages in thread From: Mark Salter @ 2011-09-01 19:14 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2011-09-01 at 18:31 +0100, Will Deacon wrote: > I don't think what we're seeing in this case is caused by mismatched memory > attributes, especially as passing `nosmp' on the command-line makes the > performance issue disappear. I'm coming to think we are dealing with two different problems. We have the original problem where adding the write buffer flush to EHCI gives a 4x performance boost to USB. Also adding nosmp to the cmdline gives pretty much the same boost. This is looking like something other than just data getting held up in a write buffer. On the other hand, on a nosmp kernel, I get about 3-4% performance boost for hdparm -t using the write buffer flush patch vs. without it. So, regardless of what turns out to be the actual cause of the 4x problem, it may still be worthwhile to have the explicit write buffer sync API if we can't avoid using buffered mappings for DMA. --Mark ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-01 3:41 ` Ming Lei 2011-09-01 8:45 ` Will Deacon @ 2011-09-01 15:22 ` Alan Stern 2011-09-01 15:56 ` Ming Lei 1 sibling, 1 reply; 32+ messages in thread From: Alan Stern @ 2011-09-01 15:22 UTC (permalink / raw) To: linux-arm-kernel On Thu, 1 Sep 2011, Ming Lei wrote: > I agree all about above, but what I described is from another view. > I post out the example before explaining my idea further: > > > CPU device > A=1; > wmb > B=2; > read B > read A > > one wmb is used to order 'A=1' and 'B=2', which will make the two write > operations reach to physical memory as the order: 'A=1' first, 'B=2' second. > Then the device can observe the two write events as the order above, > so if device has seen 'B==2', then device will surely see 'A==1'. > > Suppose writing to A is operation to update dma descriptor, the above example > can make device always see a atomic update of descriptor, can't it? Suppose A and B are _both_ part of the dma descriptor. The device might see A==1 and B==0, if the memory accesses occur like this: CPU device --- ------ A = 1; wmb(); read B read A B = 2; When this happens, the device will observe a non-atomic update of the descriptor. There's no way to prevent this. > My idea is that the memory access patterns are to be considered for > writer of device driver. For example, many memory access patterns on > EHCI hardware are described in detail. Of course, device driver should > make full use of the background info, below is a example from ehci driver: > > qh_link_async(): > > /*prepare qh descriptor*/ > qh->qh_next = head->qh_next; > qh->hw->hw_next = head->hw->hw_next; > wmb (); > > /*link the qh descriptor into hardware queue*/ > head->qh_next.qh = qh; > head->hw->hw_next = dma; > > so once EHCI fetches a qh with the address of 'dma', it will always see > consistent content of qh descriptor, which could not be updated partially. Yes, of course. That's what memory barriers are intended for, to make sure that writes occur in the correct order. Without the wmb(), the CPU might decide to write out the value of head->hw->hw_next before writing out the value of qh->hw->hw_next. Then the device might see an inconsistent set of values. None of this has anything to do with the write flushes you want to add. > >> 2, most of such cases can be handled correctly by mb/wmb/rmb barriers. > > > > No, they can't. ?See the third point above. > > The example above has demoed that barriers can do it, hasn't it? The memory barrier in your qh_link_async() example can make sure that the device always sees consistent data. It doesn't guarantee that the write to head->hw->hw_next will be flushed to memory in a reasonably short time, which is the problem you are trying to solve. Alan Stern ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-01 15:22 ` Alan Stern @ 2011-09-01 15:56 ` Ming Lei 2011-09-01 16:48 ` Alan Stern 0 siblings, 1 reply; 32+ messages in thread From: Ming Lei @ 2011-09-01 15:56 UTC (permalink / raw) To: linux-arm-kernel On Thu, Sep 1, 2011 at 11:22 PM, Alan Stern <stern@rowland.harvard.edu> wrote: > On Thu, 1 Sep 2011, Ming Lei wrote: > >> I agree all about above, but what I described is from another view. >> I post out the example before explaining my idea further: >> >> >> ? ? ? CPU ? ? ? ? ? ? ? ? ? ? device >> ? ? ? A=1; >> ? ? ? wmb >> ? ? ? B=2; >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? read B >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? read A >> >> one wmb is used to order 'A=1' and 'B=2', which will make the two write >> operations reach to physical memory as the order: 'A=1' first, 'B=2' second. >> Then the device can observe the two write events as the order above, >> so if device has seen 'B==2', then device will surely see 'A==1'. >> >> Suppose writing to A is operation to update dma descriptor, the above example >> can make device always see a atomic update of descriptor, can't it? > > Suppose A and B are _both_ part of the dma descriptor. ?The device > might see A==1 and B==0, if the memory accesses occur like this: > > ? ? ? ?CPU ? ? ? ? ? ? device > ? ? ? ?--- ? ? ? ? ? ? ------ > ? ? ? ?A = 1; > ? ? ? ?wmb(); > ? ? ? ? ? ? ? ? ? ? ? ?read B > ? ? ? ? ? ? ? ? ? ? ? ?read A > ? ? ? ?B = 2; > > When this happens, the device will observe a non-atomic update of the > descriptor. ?There's no way to prevent this. If device doesn't find that B is 2, it will not fetch descriptor of A, and will observe a atomic update, which is just EHCI does for many cases(such as 4.10.2). > >> My idea is that the memory access patterns are to be considered for >> writer of device driver. For example, many memory access patterns on >> EHCI hardware are described in detail. ?Of course, device driver should >> make full use of the background info, below is a example from ehci driver: >> >> qh_link_async(): >> >> ? ? ? /*prepare qh descriptor*/ >> ? ? ? qh->qh_next = head->qh_next; >> ? ? ? qh->hw->hw_next = head->hw->hw_next; >> ? ? ? wmb (); >> >> ? ? ? /*link the qh descriptor into hardware queue*/ >> ? ? ? head->qh_next.qh = qh; >> ? ? ? head->hw->hw_next = dma; >> >> so once EHCI fetches a qh with the address of 'dma', it will always see >> consistent content of qh descriptor, which could not be updated partially. > > Yes, of course. ?That's what memory barriers are intended for, to make > sure that writes occur in the correct order. ?Without the wmb(), the > CPU might decide to write out the value of head->hw->hw_next before > writing out the value of qh->hw->hw_next. ?Then the device might see an > inconsistent set of values. > > None of this has anything to do with the write flushes you want to add. > >> >> 2, most of such cases can be handled correctly by mb/wmb/rmb barriers. >> > >> > No, they can't. ?See the third point above. >> >> The example above has demoed that barriers can do it, hasn't it? > > The memory barrier in your qh_link_async() example can make sure that > the device always sees consistent data. ?It doesn't guarantee that the > write to head->hw->hw_next will be flushed to memory in a reasonably > short time, which is the problem you are trying to solve. Yes, up to now, it is the only case in which the flush can address to, and in which kind of cases device will poll DMA coherent memory contiguously, I am not sure if there are other devices except for EHCI(maybe have uhci/ohci). If there are many such kind of devices, the flush operation introduced will make sense. As far as I know, for most of devices, dma operation of bus master is triggered by writing into mmio register instead of writing into coherent memory, and the flush is not required in this case surely. thanks, -- Ming Lei ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-01 15:56 ` Ming Lei @ 2011-09-01 16:48 ` Alan Stern 2011-09-02 0:59 ` Ming Lei 0 siblings, 1 reply; 32+ messages in thread From: Alan Stern @ 2011-09-01 16:48 UTC (permalink / raw) To: linux-arm-kernel On Thu, 1 Sep 2011, Ming Lei wrote: > > Suppose A and B are _both_ part of the dma descriptor. ?The device > > might see A==1 and B==0, if the memory accesses occur like this: > > > > ? ? ? ?CPU ? ? ? ? ? ? device > > ? ? ? ?--- ? ? ? ? ? ? ------ > > ? ? ? ?A = 1; > > ? ? ? ?wmb(); > > ? ? ? ? ? ? ? ? ? ? ? ?read B > > ? ? ? ? ? ? ? ? ? ? ? ?read A > > ? ? ? ?B = 2; > > > > When this happens, the device will observe a non-atomic update of the > > descriptor. ?There's no way to prevent this. > > If device doesn't find that B is 2, it will not fetch descriptor of A, > and will observe > a atomic update, which is just EHCI does for many cases(such as 4.10.2). You didn't read what I wrote above. Suppose A and B are _both_ part of the same descriptor, like hw_token and hw_qtd_next. > > The memory barrier in your qh_link_async() example can make sure that > > the device always sees consistent data. ?It doesn't guarantee that the > > write to head->hw->hw_next will be flushed to memory in a reasonably > > short time, which is the problem you are trying to solve. > > Yes, up to now, it is the only case in which the flush can address to, > and in which kind of cases device will poll DMA coherent memory contiguously, > I am not sure if there are other devices except for EHCI(maybe have uhci/ohci). Yes: UHCI, OHCI, EHCI, and XHCI all poll memory constantly. > If there are many such kind of devices, the flush operation introduced will > make sense. > > As far as I know, for most of devices, dma operation of bus master is triggered > by writing into mmio register instead of writing into coherent memory, and the > flush is not required in this case surely. That's right. It is needed only when the device polls automatically. Alan Stern ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-01 16:48 ` Alan Stern @ 2011-09-02 0:59 ` Ming Lei 2011-09-02 13:53 ` Alan Stern 0 siblings, 1 reply; 32+ messages in thread From: Ming Lei @ 2011-09-02 0:59 UTC (permalink / raw) To: linux-arm-kernel Hi, On Fri, Sep 2, 2011 at 12:48 AM, Alan Stern <stern@rowland.harvard.edu> wrote: > On Thu, 1 Sep 2011, Ming Lei wrote: > >> > Suppose A and B are _both_ part of the dma descriptor. ?The device >> > might see A==1 and B==0, if the memory accesses occur like this: >> > >> > ? ? ? ?CPU ? ? ? ? ? ? device >> > ? ? ? ?--- ? ? ? ? ? ? ------ >> > ? ? ? ?A = 1; >> > ? ? ? ?wmb(); >> > ? ? ? ? ? ? ? ? ? ? ? ?read B >> > ? ? ? ? ? ? ? ? ? ? ? ?read A >> > ? ? ? ?B = 2; >> > >> > When this happens, the device will observe a non-atomic update of the >> > descriptor. ?There's no way to prevent this. >> >> If device doesn't find that B is 2, it will not fetch descriptor of A, >> and will observe >> a atomic update, which is just EHCI does for many cases(such as 4.10.2). > > You didn't read what I wrote above. ?Suppose A and B are _both_ part of > the same descriptor, like hw_token and hw_qtd_next. I think the case(keep writing order between parts in a same dma descriptor) is only in constant dma-poll master case, just like ehci/uhci. General case is that memory barrier is required before linking one dma descriptor into hardware queue but after the dma descriptor is prepared. > >> > The memory barrier in your qh_link_async() example can make sure that >> > the device always sees consistent data. ?It doesn't guarantee that the >> > write to head->hw->hw_next will be flushed to memory in a reasonably >> > short time, which is the problem you are trying to solve. >> >> Yes, up to now, it is the only case in which the flush can address to, >> and in which kind of cases device will poll DMA coherent memory contiguously, >> I am not sure if there are other devices except for EHCI(maybe have uhci/ohci). > > Yes: UHCI, OHCI, EHCI, and XHCI all poll memory constantly. In fact, the flush may be not required for ohci and xhci case, since there is already one mmio register writing at the end of .enqueue path in ohci/xhci driver.(just a glance at the code of ohci/xhci, please correct if I am wrong) For UHCI, looks like it has not been used on ARM, so maybe can ignore it. UHCI is to support a slow usb 1.1 transfer, so I am wondering if the flush can produce a obvious performance boost. So looks like the flush only makes sense on EHCI. If the above is not wrong, is it really needed to introduce a general DMA API only for EHCI? thanks, -- Ming Lei ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-09-02 0:59 ` Ming Lei @ 2011-09-02 13:53 ` Alan Stern 0 siblings, 0 replies; 32+ messages in thread From: Alan Stern @ 2011-09-02 13:53 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2 Sep 2011, Ming Lei wrote: > >> I am not sure if there are other devices except for EHCI(maybe have uhci/ohci). > > > > Yes: UHCI, OHCI, EHCI, and XHCI all poll memory constantly. > > In fact, the flush may be not required for ohci and xhci case, since there is > already one mmio register writing at the end of .enqueue path in ohci/xhci > driver.(just a glance at the code of ohci/xhci, please correct if I am wrong) I don't know about xhci, but you're right about ohci. However, there's no guarantee that the mmio write will always remain there; somebody might change the code so that the write takes place only when it is needed instead of every time. > For UHCI, looks like it has not been used on ARM, so maybe can ignore it. > UHCI is to support a slow usb 1.1 transfer, so I am wondering if the flush > can produce a obvious performance boost. Believe me, even with USB-1.1 a 20-ms delay will be noticeable. > So looks like the flush only makes sense on EHCI. Assuming ARM is the only architecture that needs it. > If the above is not wrong, is it really needed to introduce a general DMA API > only for EHCI? Maybe not. I'm hoping that people will identify the underlying cause for these delayed write-backs and fix it. Then no changes at all would be needed in the USB stack. Alan Stern ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 0/3] RFC: addition to DMA API 2011-08-31 21:30 [PATCH 0/3] RFC: addition to DMA API Mark Salter ` (3 preceding siblings ...) 2011-09-01 2:09 ` [PATCH 0/3] RFC: addition to DMA API Ming Lei @ 2011-09-01 9:11 ` Will Deacon 4 siblings, 0 replies; 32+ messages in thread From: Will Deacon @ 2011-09-01 9:11 UTC (permalink / raw) To: linux-arm-kernel Hi Mark, On Wed, Aug 31, 2011 at 10:30:11PM +0100, Mark Salter wrote: > This patch set arose out of a discussion on linux-arm concerning a > performance problem with USB on some ARMv7 based platforms. The > problem was tracked down by ming.lei at canonical.com and found to be > the result of CPU writes to DMA-coherent memory being delayed in a > write buffer between the CPU and memory. One proposed patch fixed > only the immediate problem with the USB EHCI driver, but several > folks thought a more general approach was needed, so I put this series > of patches together as a starting point for wider discussion outside > the ARM specific list. [...] Whilst I'm in favour of this because it seems to be solving a real problem (especially on OMAP4), I think we should get to the bottom of the issue before augmenting the coherent DMA API. We currently have the following unanswered questions: (1) Why does a nosmp kernel not suffer from this problem? (2) Why do *some* Tegra platforms seem to suffer whilst others do not? (3) Can this problem be solved by configuring the hardware appropriately? If (3) is true, then we might be better off solving it that way, although I'd be interested to see if flushing CPU write buffers makes a difference to I/O performance on other architectures using non-cacheable memory for coherent DMA. Will ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2011-10-03 9:24 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-08-31 21:30 [PATCH 0/3] RFC: addition to DMA API Mark Salter 2011-08-31 21:30 ` [PATCH 1/3] add dma_coherent_write_sync " Mark Salter 2011-09-01 2:59 ` Josh Cartwright 2011-09-01 9:57 ` Michał Mirosław 2011-09-01 12:36 ` Mark Salter 2011-09-06 14:30 ` Catalin Marinas 2011-08-31 21:30 ` [PATCH 2/3] define ARM-specific dma_coherent_write_sync Mark Salter 2011-09-06 14:32 ` Catalin Marinas 2011-09-06 14:37 ` Mark Salter 2011-09-06 14:48 ` Catalin Marinas 2011-09-06 15:02 ` Mark Salter 2011-10-03 1:40 ` Jon Masters 2011-10-03 8:44 ` Catalin Marinas 2011-10-03 9:24 ` Jon Masters 2011-08-31 21:30 ` [PATCH 3/3] add dma_coherent_write_sync calls to USB EHCI driver Mark Salter 2011-09-01 2:33 ` Ming Lei 2011-09-01 2:09 ` [PATCH 0/3] RFC: addition to DMA API Ming Lei 2011-09-01 3:09 ` Alan Stern 2011-09-01 3:41 ` Ming Lei 2011-09-01 8:45 ` Will Deacon 2011-09-01 9:14 ` Ming Lei 2011-09-01 15:42 ` Alan Stern 2011-09-01 16:04 ` Russell King - ARM Linux 2011-09-01 17:31 ` Will Deacon 2011-09-01 18:07 ` Russell King - ARM Linux 2011-09-01 19:14 ` Mark Salter 2011-09-01 15:22 ` Alan Stern 2011-09-01 15:56 ` Ming Lei 2011-09-01 16:48 ` Alan Stern 2011-09-02 0:59 ` Ming Lei 2011-09-02 13:53 ` Alan Stern 2011-09-01 9:11 ` Will Deacon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).