* [PATCH] [POWERPC] Improve (in|out)_beXX() asm code @ 2008-05-20 20:40 Trent Piepho 2008-05-20 21:16 ` Benjamin Herrenschmidt 2008-05-20 22:00 ` Andreas Schwab 0 siblings, 2 replies; 160+ messages in thread From: Trent Piepho @ 2008-05-20 20:40 UTC (permalink / raw) To: linuxppc-dev; +Cc: Scott Wood, linux-kernel, Trent Piepho Since commit 4cb3cee03d558fd457cb58f56c80a2a09a66110c the code generated for the in_beXX() and out_beXX() mmio functions has been sub-optimal. The out_leXX() family of functions are created with the macro DEF_MMIO_OUT_LE() while the out_beXX() family are created with DEF_MMIO_OUT_BE(). In what was perhaps a bit too much macro use, both of these macros are in turn created via the macro DEF_MMIO_OUT(). For the LE versions, eventually they boil down to an asm that will look something like this: asm("sync; stwbrx %1,0,%2" : "=m" (*addr) : "r" (val), "r" (addr)); While not perfect, this appears to be the best one can do. The issue is that the "stwbrx" instruction only comes in an indexed, or 'x', version, in which the address is represented by the sum of two registers (the "0,%2"). Unfortunately, gcc doesn't have a constraint for an indexed memory reference. The "m" constraint allows both indexed and offset, i.e. register plus constant, memory references and there is no "stwbr" version for offset references. The unused first operand to the asm is just to tell gcc that *addr is an output of the asm. The address used is passed in a single register via the third asm operand, and the index register is just hard coded as 0. This means gcc is forced to put the address in a single register and can't use index addressing, e.g. if one has the data in register 9, a base address in register 3 and an index in register 4, gcc must emit code like "add 11,4,3; stwbrx 9,0,11" instead of just "stwbrx 9,4,3". This costs an extra add instruction and another register. This brings us the to problem with the BE version. In this case, the "stw" instruction does have both indexed and non-indexed versions. The final asm ends up looking like this: asm("sync; stw%U0%X0 %1,%0" : "=m" (*addr) : "r" (val), "r" (addr)); The undocumented codes "%U0" and "%0X" will generate a 'u' if the memory reference should be a auto-updating one, and an 'x' if the memory reference is indexed, respectively. The third operand is unused, it's just there because asm the code is reused from the LE version. However, gcc does not know this, and generates unnecessary code to stick addr in a register! To use the example from the LE version, gcc will generate "add 11,4,3; stwx 9,4,3". It is able to use the indexed address "4,3" for the "stwx", but still thinks it needs to put 4+3 into register 11, which will never be used. This also ends up happening a lot for the offset addressing mode, where common code like this: out_be32(&device_registers->some_register, data); uses an instruction like "stw 9, 42(3)", where register 3 has the pointer device_registers and 42 is the offset of some_register in that structure. gcc will be forced to generate the unnecessary instruction "addi 11, 3, 42" to put the address into a single (unused) register. The in_* versions end up having these exact same problems as well. Signed-off-by: Trent Piepho <tpiepho@freescale.com> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: Scott Wood <scottwood@freescale.com> --- There was some discussion on a Freescale list if the powerpc I/O accessors should be strictly ordered w.r.t. normal memory. Currently they are not. It does not appear as if any other architecture's I/O accessors are strictly ordered in this manner. memory-barriers.txt explicitly states that the I/O space (inb, outw, etc.) are NOT strictly ordered w.r.t. normal memory accesses and it's implied the other I/O accessors (e.g., writel) are the same. However, it is somewhat harder to program for this model, and there are almost certainly a number of drivers using coherent DMA which have subtle bugs because the do not include the necessary barriers. But clearly and change to this would be a subject for a different patch. include/asm-powerpc/io.h | 44 +++++++++++++++++++++++++------------------- 1 files changed, 25 insertions(+), 19 deletions(-) diff --git a/include/asm-powerpc/io.h b/include/asm-powerpc/io.h index e0062d7..795b5d4 100644 --- a/include/asm-powerpc/io.h +++ b/include/asm-powerpc/io.h @@ -95,33 +95,39 @@ extern resource_size_t isa_mem_base; #define IO_SET_SYNC_FLAG() #endif -#define DEF_MMIO_IN(name, type, insn) \ -static inline type name(const volatile type __iomem *addr) \ +#define DEF_MMIO_IN_LE(name, size, insn) \ +static inline u##size name(const volatile u##size __iomem *addr) \ { \ - type ret; \ - __asm__ __volatile__("sync;" insn ";twi 0,%0,0;isync" \ - : "=r" (ret) : "r" (addr), "m" (*addr)); \ + u##size ret; \ + __asm__ __volatile__("sync;"#insn" %0,0,%1;twi 0,%0,0;isync" \ + : "=r" (ret) : "r" (addr), "m" (*addr)); \ return ret; \ } -#define DEF_MMIO_OUT(name, type, insn) \ -static inline void name(volatile type __iomem *addr, type val) \ +#define DEF_MMIO_IN_BE(name, size, insn) \ +static inline u##size name(const volatile u##size __iomem *addr) \ { \ - __asm__ __volatile__("sync;" insn \ - : "=m" (*addr) : "r" (val), "r" (addr)); \ - IO_SET_SYNC_FLAG(); \ + u##size ret; \ + __asm__ __volatile__("sync;"#insn"%U1%X1 %0,%1;twi 0,%0,0;isync"\ + : "=r" (ret) : "m" (*addr)); \ + return ret; \ } +#define DEF_MMIO_OUT_BE(name, size, insn) \ +static inline void name(volatile u##size __iomem *addr, u##size val) \ +{ \ + __asm__ __volatile__("sync;"#insn"%U0%X0 %1,%0" \ + : "=m" (*addr) : "r" (val)); \ + IO_SET_SYNC_FLAG(); \ +} -#define DEF_MMIO_IN_BE(name, size, insn) \ - DEF_MMIO_IN(name, u##size, __stringify(insn)"%U2%X2 %0,%2") -#define DEF_MMIO_IN_LE(name, size, insn) \ - DEF_MMIO_IN(name, u##size, __stringify(insn)" %0,0,%1") - -#define DEF_MMIO_OUT_BE(name, size, insn) \ - DEF_MMIO_OUT(name, u##size, __stringify(insn)"%U0%X0 %1,%0") -#define DEF_MMIO_OUT_LE(name, size, insn) \ - DEF_MMIO_OUT(name, u##size, __stringify(insn)" %1,0,%2") +#define DEF_MMIO_OUT_LE(name, size, insn) \ +static inline void name(volatile u##size __iomem *addr, u##size val) \ +{ \ + __asm__ __volatile__("sync;"#insn" %1,0,%2" \ + : "=m" (*addr) : "r" (val), "r" (addr)); \ + IO_SET_SYNC_FLAG(); \ +} DEF_MMIO_IN_BE(in_8, 8, lbz); DEF_MMIO_IN_BE(in_be16, 16, lhz); -- 1.5.4.1 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 20:40 [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho @ 2008-05-20 21:16 ` Benjamin Herrenschmidt 2008-05-20 21:38 ` Scott Wood 2008-05-20 22:00 ` Trent Piepho 2008-05-20 22:00 ` Andreas Schwab 1 sibling, 2 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-20 21:16 UTC (permalink / raw) To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel On Tue, 2008-05-20 at 13:40 -0700, Trent Piepho wrote: > There was some discussion on a Freescale list if the powerpc I/O accessors > should be strictly ordered w.r.t. normal memory. Currently they are not. It > does not appear as if any other architecture's I/O accessors are strictly > ordered in this manner. memory-barriers.txt explicitly states that the I/O > space (inb, outw, etc.) are NOT strictly ordered w.r.t. normal memory > accesses and it's implied the other I/O accessors (e.g., writel) are the same. > > However, it is somewhat harder to program for this model, and there are almost > certainly a number of drivers using coherent DMA which have subtle bugs because > the do not include the necessary barriers. > > But clearly and change to this would be a subject for a different patch. The current accessors should provide all the necessary ordering guarantees... Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 21:16 ` Benjamin Herrenschmidt @ 2008-05-20 21:38 ` Scott Wood 2008-05-20 22:02 ` Benjamin Herrenschmidt 2008-05-20 22:15 ` Alan Cox 2008-05-20 22:00 ` Trent Piepho 1 sibling, 2 replies; 160+ messages in thread From: Scott Wood @ 2008-05-20 21:38 UTC (permalink / raw) To: benh; +Cc: linuxppc-dev, Trent Piepho, linux-kernel Benjamin Herrenschmidt wrote: > On Tue, 2008-05-20 at 13:40 -0700, Trent Piepho wrote: >> There was some discussion on a Freescale list if the powerpc I/O accessors >> should be strictly ordered w.r.t. normal memory. Currently they are not. It >> does not appear as if any other architecture's I/O accessors are strictly >> ordered in this manner. memory-barriers.txt explicitly states that the I/O >> space (inb, outw, etc.) are NOT strictly ordered w.r.t. normal memory >> accesses and it's implied the other I/O accessors (e.g., writel) are the same. >> >> However, it is somewhat harder to program for this model, and there are almost >> certainly a number of drivers using coherent DMA which have subtle bugs because >> the do not include the necessary barriers. >> >> But clearly and change to this would be a subject for a different patch. > > The current accessors should provide all the necessary ordering > guarantees... It looks like we rely on -fno-strict-aliasing to prevent reordering ordinary memory accesses (such as to DMA descriptors) past the I/O access. It won't prevent reordering of memory reads around an I/O read, though, which could be a problem if the I/O read result determines the validity of the DMA buffer. IMHO, a memory clobber would be better. -Scott ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 21:38 ` Scott Wood @ 2008-05-20 22:02 ` Benjamin Herrenschmidt 2008-05-20 22:21 ` Trent Piepho 2008-05-20 22:15 ` Alan Cox 1 sibling, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-20 22:02 UTC (permalink / raw) To: Scott Wood; +Cc: linuxppc-dev, Trent Piepho, linux-kernel On Tue, 2008-05-20 at 16:38 -0500, Scott Wood wrote: > It looks like we rely on -fno-strict-aliasing to prevent reordering > ordinary memory accesses (such as to DMA descriptors) past the I/O > access. It won't prevent reordering of memory reads around an I/O > read, > though, which could be a problem if the I/O read result determines > the > validity of the DMA buffer. IMHO, a memory clobber would be better. We probably want a full "memory" clobber then... Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 22:02 ` Benjamin Herrenschmidt @ 2008-05-20 22:21 ` Trent Piepho 0 siblings, 0 replies; 160+ messages in thread From: Trent Piepho @ 2008-05-20 22:21 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc-dev, linux-kernel On Tue, 20 May 2008, Benjamin Herrenschmidt wrote: > On Tue, 2008-05-20 at 16:38 -0500, Scott Wood wrote: >> It looks like we rely on -fno-strict-aliasing to prevent reordering >> ordinary memory accesses (such as to DMA descriptors) past the I/O >> access. It won't prevent reordering of memory reads around an I/O >> read, >> though, which could be a problem if the I/O read result determines >> the >> validity of the DMA buffer. IMHO, a memory clobber would be better. > > We probably want a full "memory" clobber then... As far as I could tell, no other arch has a full memory clobber. I can see the argument for changing the Linux model to be stricter and less efficient, but easier to program for. Not that I entirely agree with it. But I don't see a good reason for why powerpc should be different than everything else. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 21:38 ` Scott Wood 2008-05-20 22:02 ` Benjamin Herrenschmidt @ 2008-05-20 22:15 ` Alan Cox 2008-05-20 22:35 ` Scott Wood 1 sibling, 1 reply; 160+ messages in thread From: Alan Cox @ 2008-05-20 22:15 UTC (permalink / raw) To: Scott Wood; +Cc: Trent Piepho, linux-kernel, linuxppc-dev > It looks like we rely on -fno-strict-aliasing to prevent reordering > ordinary memory accesses (such as to DMA descriptors) past the I/O DMA descriptors in main memory are dependant on cache behaviour anyway and the dma_* operators should be the ones enforcing the needed behaviour. Alan ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 22:15 ` Alan Cox @ 2008-05-20 22:35 ` Scott Wood 2008-05-20 22:39 ` David Miller 2008-05-20 22:55 ` [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho 0 siblings, 2 replies; 160+ messages in thread From: Scott Wood @ 2008-05-20 22:35 UTC (permalink / raw) To: Alan Cox; +Cc: Trent Piepho, linux-kernel, linuxppc-dev Alan Cox wrote: >> It looks like we rely on -fno-strict-aliasing to prevent reordering >> ordinary memory accesses (such as to DMA descriptors) past the I/O > > DMA descriptors in main memory are dependant on cache behaviour anyway > and the dma_* operators should be the ones enforcing the needed behaviour. What about memory obtained from dma_alloc_coherent()? We still need a sync and a compiler barrier. The current I/O accessors have the former, but not the latter. -Scott ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 22:35 ` Scott Wood @ 2008-05-20 22:39 ` David Miller 2008-05-20 22:43 ` Scott Wood 2008-05-20 22:55 ` [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho 1 sibling, 1 reply; 160+ messages in thread From: David Miller @ 2008-05-20 22:39 UTC (permalink / raw) To: scottwood; +Cc: linuxppc-dev, tpiepho, alan, linux-kernel From: Scott Wood <scottwood@freescale.com> Date: Tue, 20 May 2008 17:35:56 -0500 > Alan Cox wrote: > >> It looks like we rely on -fno-strict-aliasing to prevent reordering > >> ordinary memory accesses (such as to DMA descriptors) past the I/O > > > > DMA descriptors in main memory are dependant on cache behaviour anyway > > and the dma_* operators should be the ones enforcing the needed behaviour. > > What about memory obtained from dma_alloc_coherent()? We still need a > sync and a compiler barrier. The current I/O accessors have the former, > but not the latter. The __volatile__ in the asm construct disallows movement of the inline asm relative to statements surrounding it. The only reason barrier() in kernel.h needs a memory clobber is because of a bug in ancient versions of gcc. In fact, I think that memory clobber might even be removable. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 22:39 ` David Miller @ 2008-05-20 22:43 ` Scott Wood 2008-05-20 22:53 ` David Miller 0 siblings, 1 reply; 160+ messages in thread From: Scott Wood @ 2008-05-20 22:43 UTC (permalink / raw) To: David Miller; +Cc: linuxppc-dev, tpiepho, alan, linux-kernel David Miller wrote: > From: Scott Wood <scottwood@freescale.com> > Date: Tue, 20 May 2008 17:35:56 -0500 > >> Alan Cox wrote: >>>> It looks like we rely on -fno-strict-aliasing to prevent reordering >>>> ordinary memory accesses (such as to DMA descriptors) past the I/O >>> DMA descriptors in main memory are dependant on cache behaviour anyway >>> and the dma_* operators should be the ones enforcing the needed behaviour. >> What about memory obtained from dma_alloc_coherent()? We still need a >> sync and a compiler barrier. The current I/O accessors have the former, >> but not the latter. > > The __volatile__ in the asm construct disallows movement of the > inline asm relative to statements surrounding it. > > The only reason barrier() in kernel.h needs a memory clobber is > because of a bug in ancient versions of gcc. In fact, I think > that memory clobber might even be removable. Current versions of GCC seem quite happy to move non-asm memory accesses around a volatile asm without a memory clobber; see the test Trent posted. -Scott ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 22:43 ` Scott Wood @ 2008-05-20 22:53 ` David Miller 2008-05-23 4:24 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 160+ messages in thread From: David Miller @ 2008-05-20 22:53 UTC (permalink / raw) To: scottwood; +Cc: linuxppc-dev, tpiepho, alan, linux-kernel From: Scott Wood <scottwood@freescale.com> Date: Tue, 20 May 2008 17:43:58 -0500 > David Miller wrote: > > The __volatile__ in the asm construct disallows movement of the > > inline asm relative to statements surrounding it. > > > > The only reason barrier() in kernel.h needs a memory clobber is > > because of a bug in ancient versions of gcc. In fact, I think > > that memory clobber might even be removable. > > Current versions of GCC seem quite happy to move non-asm memory accesses > around a volatile asm without a memory clobber; see the test Trent posted. Indeed, and even the GCC manual is clear about this. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 22:53 ` David Miller @ 2008-05-23 4:24 ` Benjamin Herrenschmidt 2008-05-22 22:56 ` Trent Piepho 0 siblings, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-23 4:24 UTC (permalink / raw) To: David Miller; +Cc: scottwood, linuxppc-dev, tpiepho, alan, linux-kernel On Tue, 2008-05-20 at 15:53 -0700, David Miller wrote: > From: Scott Wood <scottwood@freescale.com> > Date: Tue, 20 May 2008 17:43:58 -0500 > > > David Miller wrote: > > > The __volatile__ in the asm construct disallows movement of the > > > inline asm relative to statements surrounding it. > > > > > > The only reason barrier() in kernel.h needs a memory clobber is > > > because of a bug in ancient versions of gcc. In fact, I think > > > that memory clobber might even be removable. > > > > Current versions of GCC seem quite happy to move non-asm memory accesses > > around a volatile asm without a memory clobber; see the test Trent posted. > > Indeed, and even the GCC manual is clear about this. So what is the scope of that problem ? IE. Take an x86 version of that test, writing to memory, doing a writel to some MMIO, then another memory write, can those be re-ordered with the current x86 version of writel ? static inline void writel(unsigned int b, volatile void __iomem *addr) { *(volatile unsigned int __force *)addr = b; } This is becoming a serious issue... Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-23 4:24 ` Benjamin Herrenschmidt @ 2008-05-22 22:56 ` Trent Piepho 2008-05-23 12:36 ` MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code) Benjamin Herrenschmidt 2008-05-27 1:33 ` MMIO and gcc re-ordering issue Benjamin Herrenschmidt 0 siblings, 2 replies; 160+ messages in thread From: Trent Piepho @ 2008-05-22 22:56 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: scottwood, linuxppc-dev, David Miller, alan, linux-kernel On Fri, 23 May 2008, Benjamin Herrenschmidt wrote: > On Tue, 2008-05-20 at 15:53 -0700, David Miller wrote: >> From: Scott Wood <scottwood@freescale.com> >> Date: Tue, 20 May 2008 17:43:58 -0500 >>> David Miller wrote: >>>> The __volatile__ in the asm construct disallows movement of the >>>> inline asm relative to statements surrounding it. >>>> >>>> The only reason barrier() in kernel.h needs a memory clobber is >>>> because of a bug in ancient versions of gcc. In fact, I think >>>> that memory clobber might even be removable. >>> >>> Current versions of GCC seem quite happy to move non-asm memory accesses >>> around a volatile asm without a memory clobber; see the test Trent posted. >> >> Indeed, and even the GCC manual is clear about this. > > So what is the scope of that problem ? > > IE. Take an x86 version of that test, writing to memory, doing a writel > to some MMIO, then another memory write, can those be re-ordered with > the current x86 version of writel ? Yes, the same thing can happen on x86. As far as I could tell, this is something that all other arches can have happen. Usually aliasing prevents it, but it's not hard to constuct a test case where it doesn't. ^ permalink raw reply [flat|nested] 160+ messages in thread
* MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code) 2008-05-22 22:56 ` Trent Piepho @ 2008-05-23 12:36 ` Benjamin Herrenschmidt 2008-05-23 12:50 ` Benjamin Herrenschmidt 2008-05-27 1:33 ` MMIO and gcc re-ordering issue Benjamin Herrenschmidt 1 sibling, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-23 12:36 UTC (permalink / raw) To: Trent Piepho Cc: linux-kernel, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan > > IE. Take an x86 version of that test, writing to memory, doing a writel > > to some MMIO, then another memory write, can those be re-ordered with > > the current x86 version of writel ? > > Yes, the same thing can happen on x86. As far as I could tell, this is > something that all other arches can have happen. Usually aliasing prevents > it, but it's not hard to constuct a test case where it doesn't. That brings us back to the old debate... For consistent memory, should we mandate a wmb between write to some dma data structure in consistent memory and the following mmio write that trigger it, and the same goes with rmb and read ? David, you remember we had those discussions a while back when I was trying to relax a bit the barriers we have in our MMIO accessors on powerpc, and the overwhelming answer was that x86 being in order, I have to expect 90% of the drivers to not get any barrier right on any platform, and thus I should make my MMIO accessors "look like" x86 and thus ordered. We did that, adding some barriers in the assembly of our readl/writel. However, I didn't change the clobber, it's still *addr, not a full "memory" clobber, just like x86. Now if it appears that gcc can indeed re-order things, then we have a problem on both x86, ppc, and pretty much everybody else. (I'm not sure about sparc but I don't see any explicit clobber in your accessors there). So that brings the whole subject back imho. What should be the approach here ? I see several options: - mandate some kind of dma_sync_for_device/cpu on consistent memory. Almost no driver do that currently tho. They only do that for non consistent memory mapped with dma_map_*. - mandate the use of wmb,rmb,mb barriers for use between memory accesses and MMIOs for ordering them. (ie. fix drivers that don't do it). Advantage for powerpc is that I can remove (after some auditing of course) the added heavy barriers in the MMIO accessors themselves. - stick a full memory clobber in all MMIO (and PIO) accessors on all archs. Any other idea ? preference ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code) 2008-05-23 12:36 ` MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code) Benjamin Herrenschmidt @ 2008-05-23 12:50 ` Benjamin Herrenschmidt 2008-05-23 21:14 ` Scott Wood 0 siblings, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-23 12:50 UTC (permalink / raw) To: Trent Piepho Cc: linux-kernel, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Fri, 2008-05-23 at 08:36 -0400, Benjamin Herrenschmidt wrote: > > - mandate some kind of dma_sync_for_device/cpu on consistent memory. > Almost no driver do that currently tho. They only do that for non > consistent memory mapped with dma_map_*. > > - mandate the use of wmb,rmb,mb barriers for use between memory > accesses and MMIOs for ordering them. (ie. fix drivers that don't do > it). Advantage for powerpc is that I can remove (after some auditing of > course) the added heavy barriers in the MMIO accessors themselves. Note that the above is my preferred approach, and a lot of drivers happen to already do this. > - stick a full memory clobber in all MMIO (and PIO) accessors on all > archs. > > Any other idea ? preference ? ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code) 2008-05-23 12:50 ` Benjamin Herrenschmidt @ 2008-05-23 21:14 ` Scott Wood 2008-05-23 22:47 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 160+ messages in thread From: Scott Wood @ 2008-05-23 21:14 UTC (permalink / raw) To: benh Cc: linux-kernel, David Miller, linuxppc-dev, Linus Torvalds, Trent Piepho, alan Benjamin Herrenschmidt wrote: > On Fri, 2008-05-23 at 08:36 -0400, Benjamin Herrenschmidt wrote: >> - mandate some kind of dma_sync_for_device/cpu on consistent memory. >> Almost no driver do that currently tho. They only do that for non >> consistent memory mapped with dma_map_*. >> >> - mandate the use of wmb,rmb,mb barriers for use between memory >> accesses and MMIOs for ordering them. (ie. fix drivers that don't do >> it). Advantage for powerpc is that I can remove (after some auditing of >> course) the added heavy barriers in the MMIO accessors themselves. > > Note that the above is my preferred approach, and a lot of drivers > happen to already do this. As Trent pointed out, if you change to eieio in the accessors, that'd require drivers to also use mmiowb() before spin_unlock(), which fewer drivers currently do. >> - stick a full memory clobber in all MMIO (and PIO) accessors on all >> archs. I like this, combined with introducing raw variants of the non-PCI accessors (in_be32 and such). It's slower and safe by default (i.e. no auditing drivers), but performance-critical paths can be optimized to use raw accessors combined with explicit barriers. -Scott ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code) 2008-05-23 21:14 ` Scott Wood @ 2008-05-23 22:47 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-23 22:47 UTC (permalink / raw) To: Scott Wood Cc: linux-kernel, David Miller, linuxppc-dev, Linus Torvalds, Trent Piepho, alan On Fri, 2008-05-23 at 16:14 -0500, Scott Wood wrote: > As Trent pointed out, if you change to eieio in the accessors, that'd > require drivers to also use mmiowb() before spin_unlock(), which fewer > drivers currently do. No, this is a totally different issue. And we keep track of the need for a sync in spin_unlock already. Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* MMIO and gcc re-ordering issue 2008-05-22 22:56 ` Trent Piepho 2008-05-23 12:36 ` MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code) Benjamin Herrenschmidt @ 2008-05-27 1:33 ` Benjamin Herrenschmidt 2008-05-27 1:40 ` David Miller 2008-05-27 15:28 ` Jonathan Corbet 1 sibling, 2 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-27 1:33 UTC (permalink / raw) To: Linux Arch list Cc: linux-kernel, Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan Re-post to linux-arch of a discussion on lkml. Mostly due to lack of reactions :-) Quick summary: gcc is happily re-ordering readl/writel vs. surrounding memory accesses (and thus accesses to DMA coherent memory) which is obviously a _BAD_THING_. This is on all archs. Quick fix is to stick a "memory" clobber in all arch implementations of readl/writel/... (ie, making them a barrier()). However, I'm using that as an excuse to bring back my pet subject, which is basically, should we instead just finally mandate the use of explicit rmb/wmb/mb's (which boils down to barrier() on x86) to drivers who want to order memory consistent accesses vs. MMIO ? The reason is that on archs that are out of order, this would allow us to get rid in the long run of some of the heavy barriers we have put in our readl/writel implementations to make them look like x86. Note that there are drivers that already do that (ie. explicit rmb/wmb/mb to order coherent memory accesses vs. MMIO), such as OHCI/EHCI which may explain why the new problem with gcc isn't more obviously hitting people, as those turn into compiler barriers on x86. If that approach is accepted, then I'll start auditing drivers and send patches adding wmb/rmb/mb's to them, and in the long run, after mucho testing, relax powerpc writel/readl implementations. Other archs can then do the same too If that approach is generally considered wrong, then we should probably remove the spurrious readl/writel in drivers that do them. So what are the opinions here ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 1:33 ` MMIO and gcc re-ordering issue Benjamin Herrenschmidt @ 2008-05-27 1:40 ` David Miller 2008-05-27 2:15 ` Benjamin Herrenschmidt 2008-05-27 15:28 ` Jonathan Corbet 1 sibling, 1 reply; 160+ messages in thread From: David Miller @ 2008-05-27 1:40 UTC (permalink / raw) To: benh Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, torvalds, tpiepho, alan From: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Tue, 27 May 2008 11:33:46 +1000 > Quick summary: gcc is happily re-ordering readl/writel vs. surrounding > memory accesses (and thus accesses to DMA coherent memory) which is > obviously a _BAD_THING_. > > This is on all archs. Quick fix is to stick a "memory" clobber in all arch > implementations of readl/writel/... (ie, making them a barrier()). > > However, I'm using that as an excuse to bring back my pet subject, which > is basically, should we instead just finally mandate the use of explicit > rmb/wmb/mb's (which boils down to barrier() on x86) to drivers who want > to order memory consistent accesses vs. MMIO ? This is basically what drivers are effectively doing. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 1:40 ` David Miller @ 2008-05-27 2:15 ` Benjamin Herrenschmidt 2008-05-27 2:28 ` David Miller 0 siblings, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-27 2:15 UTC (permalink / raw) To: David Miller Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, torvalds, tpiepho, alan On Mon, 2008-05-26 at 18:40 -0700, David Miller wrote: > > > Quick summary: gcc is happily re-ordering readl/writel vs. surrounding > > memory accesses (and thus accesses to DMA coherent memory) which is > > obviously a _BAD_THING_. > > > > This is on all archs. Quick fix is to stick a "memory" clobber in all arch > > implementations of readl/writel/... (ie, making them a barrier()). > > > > However, I'm using that as an excuse to bring back my pet subject, which > > is basically, should we instead just finally mandate the use of explicit > > rmb/wmb/mb's (which boils down to barrier() on x86) to drivers who want > > to order memory consistent accesses vs. MMIO ? > > This is basically what drivers are effectively doing. Some of them. USB comes to mind. I'd be happy to make it "the rule" and document that MMIO vs. coherent access aren't implicitely ordered. I would still keep them ordered on powerpc for a little while tho until I'm happy enough with driver auditing. But heh, it's you who was telling me that it would be a bad engineering decision and we had to make everybody look like x86 & fully ordered :-) I decided to agree back then and stuck all those nasty heavy barriers in the powerpc variants of readl/writel/... Now, however, that x86 -is- also affected by the problem to some extent (ie. compiler re-ordering, not CPU but similar), this is why I'm asking what's people opinion is. I'm happy to do patches updating memory-barriers.txt and others, and do some driver auditing (though I won't do all of them), if the general opinion is that it's the right direction to go. If not, then shouldn't we remove the existing spurrious wmb/rmb/mb from drivers and slap a "memory" clobber on all archs readl/writel/... ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 2:15 ` Benjamin Herrenschmidt @ 2008-05-27 2:28 ` David Miller 2008-05-27 3:39 ` Benjamin Herrenschmidt ` (2 more replies) 0 siblings, 3 replies; 160+ messages in thread From: David Miller @ 2008-05-27 2:28 UTC (permalink / raw) To: benh Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, torvalds, tpiepho, alan From: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Tue, 27 May 2008 12:15:40 +1000 > Some of them. USB comes to mind. I'd be happy to make it "the rule" and > document that MMIO vs. coherent access aren't implicitely ordered. I > would still keep them ordered on powerpc for a little while tho until > I'm happy enough with driver auditing. > > But heh, it's you who was telling me that it would be a bad engineering > decision and we had to make everybody look like x86 & fully ordered :-) > I decided to agree back then and stuck all those nasty heavy barriers > in the powerpc variants of readl/writel/... I still believe this. It's just another complicated thing for driver authors to get wrong. The other side of the coin is of course the cost. The only thing I am absolutely sure of is that we should make a decision fast, document it, and just stick to it. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 2:28 ` David Miller @ 2008-05-27 3:39 ` Benjamin Herrenschmidt 2008-05-27 15:35 ` Linus Torvalds 2008-05-27 3:42 ` Arjan van de Ven 2008-05-27 8:24 ` Alan Cox 2 siblings, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-27 3:39 UTC (permalink / raw) To: David Miller Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, torvalds, tpiepho, alan On Mon, 2008-05-26 at 19:28 -0700, David Miller wrote: > > But heh, it's you who was telling me that it would be a bad > engineering > > decision and we had to make everybody look like x86 & fully > ordered :-) > > I decided to agree back then and stuck all those nasty heavy > barriers > > in the powerpc variants of readl/writel/... > > I still believe this. > > It's just another complicated thing for driver authors to get wrong. > The other side of the coin is of course the cost. > > The only thing I am absolutely sure of is that we should make a > decision fast, document it, and just stick to it. Yes. As it is today, tg3 for example is potentially broken on all archs with newer gcc unless we either add "memory" clobber to readl/writel or stick some wmb's in there (just a random driver I picked). So Linus, what is your take on that matter ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 3:39 ` Benjamin Herrenschmidt @ 2008-05-27 15:35 ` Linus Torvalds 2008-05-27 16:47 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 160+ messages in thread From: Linus Torvalds @ 2008-05-27 15:35 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, alan On Tue, 27 May 2008, Benjamin Herrenschmidt wrote: > > Yes. As it is today, tg3 for example is potentially broken on all archs > with newer gcc unless we either add "memory" clobber to readl/writel or > stick some wmb's in there (just a random driver I picked). > > So Linus, what is your take on that matter ? Let's just serialize the damn things, and add a memory clobber to them. Expecting people to fix up all drivers is simply not going to happen. And serializing things shouldn't be *that* expensive. People who cannot take the expense can continue to use the magic __raw_writel() etc stuff. Linus ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 15:35 ` Linus Torvalds @ 2008-05-27 16:47 ` Linus Torvalds 2008-05-27 17:31 ` Linus Torvalds 2008-05-27 21:12 ` Benjamin Herrenschmidt 2008-05-27 18:23 ` Trent Piepho 2008-05-27 21:10 ` Benjamin Herrenschmidt 2 siblings, 2 replies; 160+ messages in thread From: Linus Torvalds @ 2008-05-27 16:47 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, alan On Tue, 27 May 2008, Linus Torvalds wrote: > > Expecting people to fix up all drivers is simply not going to happen. And > serializing things shouldn't be *that* expensive. People who cannot take > the expense can continue to use the magic __raw_writel() etc stuff. Of course, for non-x86, you kind of have to expect drivers to be well-behaved, so non-x86 can probably avoid this simply because there are less relevant drivers involved. Here's a UNTESTED patch for x86 that may or may not compile and work, and which serializes (on a compiler level) the IO accesses against regular memory accesses. __read[bwlq]()/__write[bwlq]() are not serialized with a :"memory" barrier, although since they still use "asm volatile" I suspect that i practice they are probably serial too. Did not look very closely at any generated code (only did a trivial test to see that the code looks *roughly* correct). Linus --- include/asm-x86/io.h | 56 +++++++++++++++++++++++++++++++++++++ include/asm-x86/io_32.h | 49 -------------------------------- include/asm-x86/io_64.h | 71 ----------------------------------------------- 3 files changed, 56 insertions(+), 120 deletions(-) diff --git a/include/asm-x86/io.h b/include/asm-x86/io.h index d5b11f6..8e9eca9 100644 --- a/include/asm-x86/io.h +++ b/include/asm-x86/io.h @@ -3,6 +3,62 @@ #define ARCH_HAS_IOREMAP_WC +#include <linux/compiler.h> + +#define build_mmio_read(name, size, type, reg, barrier) \ +static inline type name(const volatile void __iomem *addr) \ +{ type ret; asm volatile("mov" size " %1,%0":"=" reg (ret) \ +:"m" (*(volatile type __force *)addr) barrier); return ret; } + +#define build_mmio_write(name, size, type, reg, barrier) \ +static inline void name(type val, volatile void __iomem *addr) \ +{ asm volatile("mov" size " %0,%1": :reg (val), \ +"m" (*(volatile type __force *)addr) barrier); } + +build_mmio_read(readb, "b", unsigned char, "q", :"memory") +build_mmio_read(readw, "w", unsigned short, "r", :"memory") +build_mmio_read(readl, "l", unsigned int, "r", :"memory") + +build_mmio_read(__readb, "b", unsigned char, "q", ) +build_mmio_read(__readw, "w", unsigned short, "r", ) +build_mmio_read(__readl, "l", unsigned int, "r", ) + +build_mmio_write(writeb, "b", unsigned char, "q", :"memory") +build_mmio_write(writew, "w", unsigned short, "r", :"memory") +build_mmio_write(writel, "l", unsigned int, "r", :"memory") + +build_mmio_write(__writeb, "b", unsigned char, "q", ) +build_mmio_write(__writew, "w", unsigned short, "r", ) +build_mmio_write(__writel, "l", unsigned int, "r", ) + +#define readb_relaxed(a) __readb(a) +#define readw_relaxed(a) __readw(a) +#define readl_relaxed(a) __readl(a) +#define __raw_readb __readb +#define __raw_readw __readw +#define __raw_readl __readl + +#define __raw_writeb __writeb +#define __raw_writew __writew +#define __raw_writel __writel + +#define mmiowb() barrier() + +#ifdef CONFIG_X86_64 +build_mmio_read(readq, "q", unsigned long, "r", :"memory") +build_mmio_read(__readq, "q", unsigned long, "r", ) +build_mmio_write(writeq, "q", unsigned long, "r", :"memory") +build_mmio_write(__writeq, "q", unsigned long, "r", ) + +#define readq_relaxed(a) __readq(a) +#define __raw_readq __readq +#define __raw_writeq writeq + +/* Let people know we have them */ +#define readq readq +#define writeq writeq +#endif + #ifdef CONFIG_X86_32 # include "io_32.h" #else diff --git a/include/asm-x86/io_32.h b/include/asm-x86/io_32.h index 049e81e..d71be8d 100644 --- a/include/asm-x86/io_32.h +++ b/include/asm-x86/io_32.h @@ -149,55 +149,6 @@ extern void __iomem *fix_ioremap(unsigned idx, unsigned long phys); #define virt_to_bus virt_to_phys #define bus_to_virt phys_to_virt -/* - * readX/writeX() are used to access memory mapped devices. On some - * architectures the memory mapped IO stuff needs to be accessed - * differently. On the x86 architecture, we just read/write the - * memory location directly. - */ - -static inline unsigned char readb(const volatile void __iomem *addr) -{ - return *(volatile unsigned char __force *)addr; -} - -static inline unsigned short readw(const volatile void __iomem *addr) -{ - return *(volatile unsigned short __force *)addr; -} - -static inline unsigned int readl(const volatile void __iomem *addr) -{ - return *(volatile unsigned int __force *) addr; -} - -#define readb_relaxed(addr) readb(addr) -#define readw_relaxed(addr) readw(addr) -#define readl_relaxed(addr) readl(addr) -#define __raw_readb readb -#define __raw_readw readw -#define __raw_readl readl - -static inline void writeb(unsigned char b, volatile void __iomem *addr) -{ - *(volatile unsigned char __force *)addr = b; -} - -static inline void writew(unsigned short b, volatile void __iomem *addr) -{ - *(volatile unsigned short __force *)addr = b; -} - -static inline void writel(unsigned int b, volatile void __iomem *addr) -{ - *(volatile unsigned int __force *)addr = b; -} -#define __raw_writeb writeb -#define __raw_writew writew -#define __raw_writel writel - -#define mmiowb() - static inline void memset_io(volatile void __iomem *addr, unsigned char val, int count) { diff --git a/include/asm-x86/io_64.h b/include/asm-x86/io_64.h index 0930bed..ddd8058 100644 --- a/include/asm-x86/io_64.h +++ b/include/asm-x86/io_64.h @@ -204,77 +204,6 @@ extern void __iomem *fix_ioremap(unsigned idx, unsigned long phys); #define virt_to_bus virt_to_phys #define bus_to_virt phys_to_virt -/* - * readX/writeX() are used to access memory mapped devices. On some - * architectures the memory mapped IO stuff needs to be accessed - * differently. On the x86 architecture, we just read/write the - * memory location directly. - */ - -static inline __u8 __readb(const volatile void __iomem *addr) -{ - return *(__force volatile __u8 *)addr; -} - -static inline __u16 __readw(const volatile void __iomem *addr) -{ - return *(__force volatile __u16 *)addr; -} - -static __always_inline __u32 __readl(const volatile void __iomem *addr) -{ - return *(__force volatile __u32 *)addr; -} - -static inline __u64 __readq(const volatile void __iomem *addr) -{ - return *(__force volatile __u64 *)addr; -} - -#define readb(x) __readb(x) -#define readw(x) __readw(x) -#define readl(x) __readl(x) -#define readq(x) __readq(x) -#define readb_relaxed(a) readb(a) -#define readw_relaxed(a) readw(a) -#define readl_relaxed(a) readl(a) -#define readq_relaxed(a) readq(a) -#define __raw_readb readb -#define __raw_readw readw -#define __raw_readl readl -#define __raw_readq readq - -#define mmiowb() - -static inline void __writel(__u32 b, volatile void __iomem *addr) -{ - *(__force volatile __u32 *)addr = b; -} - -static inline void __writeq(__u64 b, volatile void __iomem *addr) -{ - *(__force volatile __u64 *)addr = b; -} - -static inline void __writeb(__u8 b, volatile void __iomem *addr) -{ - *(__force volatile __u8 *)addr = b; -} - -static inline void __writew(__u16 b, volatile void __iomem *addr) -{ - *(__force volatile __u16 *)addr = b; -} - -#define writeq(val, addr) __writeq((val), (addr)) -#define writel(val, addr) __writel((val), (addr)) -#define writew(val, addr) __writew((val), (addr)) -#define writeb(val, addr) __writeb((val), (addr)) -#define __raw_writeb writeb -#define __raw_writew writew -#define __raw_writel writel -#define __raw_writeq writeq - void __memcpy_fromio(void *, unsigned long, unsigned); void __memcpy_toio(unsigned long, const void *, unsigned); ^ permalink raw reply related [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 16:47 ` Linus Torvalds @ 2008-05-27 17:31 ` Linus Torvalds 2008-06-02 10:36 ` Ingo Molnar 2008-05-27 21:12 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 160+ messages in thread From: Linus Torvalds @ 2008-05-27 17:31 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, alan On Tue, 27 May 2008, Linus Torvalds wrote: > > Here's a UNTESTED patch for x86 that may or may not compile and work, and > which serializes (on a compiler level) the IO accesses against regular > memory accesses. Ok, so it at least boots on x86-32. Thus probably on x86-64 too (since the code is now shared). I didn't look at whether it generates much bigger code due to the potential extra serialization, but some of the code generation I looked at looked fine. IOW, it doesn't at least create any _obviously_ worse code, and it should be arguably safer than assuming the compiler does volatile accesses the way we want it to. Linus ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 17:31 ` Linus Torvalds @ 2008-06-02 10:36 ` Ingo Molnar 2008-06-02 21:53 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 160+ messages in thread From: Ingo Molnar @ 2008-06-02 10:36 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, alan * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Here's a UNTESTED patch for x86 that may or may not compile and > > work, and which serializes (on a compiler level) the IO accesses > > against regular memory accesses. > > Ok, so it at least boots on x86-32. Thus probably on x86-64 too (since > the code is now shared). I didn't look at whether it generates much > bigger code due to the potential extra serialization, but some of the > code generation I looked at looked fine. > > IOW, it doesn't at least create any _obviously_ worse code, and it > should be arguably safer than assuming the compiler does volatile > accesses the way we want it to. ok, to pursue this topic of making readl*/writel*() more robust i picked up your patch into -tip and created a new topic branch for it: tip/x86/mmio. The patch passed initial light testing in -tip (~30 successful random self-builds and bootups on various mixed 32-bit/64-bit boxes) but it's still v2.6.27 material IMO. Failures in this area are subtle so there's no good way to tell whether it works as intended - we need wider testing. I've also added the tip/x86/mmio topic to tip/auto-x86-next rules as well so these changes will be picked up by tomorrow's linux-next tree as well, and by the next -mm iteration. Ingo ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-02 10:36 ` Ingo Molnar @ 2008-06-02 21:53 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-06-02 21:53 UTC (permalink / raw) To: Ingo Molnar Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Mon, 2008-06-02 at 12:36 +0200, Ingo Molnar wrote: > The patch passed initial light testing in -tip (~30 successful random > self-builds and bootups on various mixed 32-bit/64-bit boxes) but > it's > still v2.6.27 material IMO. I think adding the "memory" clobber should be .26 and even -stable. We know for sure newer gcc will re-order things, we know it for sure some drivers will break in subtle ways because of that, ad we know as well that sticking "memory" clobber in there will fix that breakage we know about ... So I don't see the point of waiting. Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 16:47 ` Linus Torvalds 2008-05-27 17:31 ` Linus Torvalds @ 2008-05-27 21:12 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-27 21:12 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, alan On Tue, 2008-05-27 at 09:47 -0700, Linus Torvalds wrote: > > __read[bwlq]()/__write[bwlq]() are not serialized with a :"memory" > barrier, although since they still use "asm volatile" I suspect that > i > practice they are probably serial too. Did not look very closely at > any > generated code (only did a trivial test to see that the code looks > *roughly* correct). Nah, asm volatile doesn't help, it does need the "memory" clobber too. Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 15:35 ` Linus Torvalds 2008-05-27 16:47 ` Linus Torvalds @ 2008-05-27 18:23 ` Trent Piepho 2008-05-27 18:33 ` Scott Wood 2008-05-27 21:10 ` Benjamin Herrenschmidt 2 siblings, 1 reply; 160+ messages in thread From: Trent Piepho @ 2008-05-27 18:23 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, David Miller, alan On Tue, 27 May 2008, Linus Torvalds wrote: > On Tue, 27 May 2008, Benjamin Herrenschmidt wrote: >> >> Yes. As it is today, tg3 for example is potentially broken on all archs >> with newer gcc unless we either add "memory" clobber to readl/writel or >> stick some wmb's in there (just a random driver I picked). >> >> So Linus, what is your take on that matter ? > > Let's just serialize the damn things, and add a memory clobber to them. > > Expecting people to fix up all drivers is simply not going to happen. And > serializing things shouldn't be *that* expensive. People who cannot take > the expense can continue to use the magic __raw_writel() etc stuff. Is there an issue with anything _besides_ coherent DMA? Could one have a special version of the accessors for drivers that want to assume they are strictly ordered vs coherent DMA memory? That would be much easier to get right, without slowing _everything_ down. The problem with the raw versions is that on some arches they are much more raw than just not being ordered w.r.t normal memory. __raw_writel() No ordering beyond what the arch provides natively. The only thing you can assume is that reads and writes to the same location on the same CPU are ordered. writel() Strictly ordered with respect to any other IO, but not to normal memory. (this is what Documentation/memory-barriers.txt claims). Should probably be strictly ordered vs locks as well. Would be strictly ordered w.r.t. the streaming DMA sync calls of course. strict_writel() Strictly ordered with respect to normal (incl. coherent DMA) memory as well. One could even go as far as to allow a driver to "#define WANT_STRICT_IO" and then it would get the strict versions. Add that to any driver that uses DMA and then worry about vetting those drivers. It's also worth nothing that adding barriers to IO accessors doesn't mean you never have to worry about barriers with coherent DMA. Inherent with coherent DMA, and driving hardware in general, there are various sequence points where one must be sure one operation has finished before starting the next. Making sure you're doing with DMA memory before telling the hardware to do something with it a common example. Often the "telling the hardware to do something" is done via an IO accessor function, but not always. You can have buffer descriptors in DMA memory that describe the DMA buffers to the hardware. It would be critical to be finished reading a DMA buffer before updating the descriptor to indicate that it's empty. You could trigger a DMA operation not by a call to an IO accessor, but with an atomic operation to a variable shared with an ISR. When the ISR runs it sees the change and does the necessary IO. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 18:23 ` Trent Piepho @ 2008-05-27 18:33 ` Scott Wood 0 siblings, 0 replies; 160+ messages in thread From: Scott Wood @ 2008-05-27 18:33 UTC (permalink / raw) To: Trent Piepho Cc: linux-arch, linux-kernel, linuxppc-dev, Linus Torvalds, David Miller, alan Trent Piepho wrote: > Is there an issue with anything _besides_ coherent DMA? > > Could one have a special version of the accessors for drivers that > want to assume they are strictly ordered vs coherent DMA memory? > That would be much easier to get right, without slowing _everything_ > down. It's better to be safe by default and then optimize the fast paths than to be relaxed by default and hang the machine in some piece of code that runs once a month. "Premature optimization is the root of all evil", and what not. > One could even go as far as to allow a driver to "#define > WANT_STRICT_IO" and then it would get the strict versions. Add that > to any driver that uses DMA and then worry about vetting those > drivers. See above -- if you must have a #define, then it should be WANT_RELAXED_IO. -Scott ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 15:35 ` Linus Torvalds 2008-05-27 16:47 ` Linus Torvalds 2008-05-27 18:23 ` Trent Piepho @ 2008-05-27 21:10 ` Benjamin Herrenschmidt 2008-05-27 21:30 ` Linus Torvalds 2 siblings, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-27 21:10 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, alan On Tue, 2008-05-27 at 08:35 -0700, Linus Torvalds wrote: > > On Tue, 27 May 2008, Benjamin Herrenschmidt wrote: > > > > Yes. As it is today, tg3 for example is potentially broken on all archs > > with newer gcc unless we either add "memory" clobber to readl/writel or > > stick some wmb's in there (just a random driver I picked). > > > > So Linus, what is your take on that matter ? > > Let's just serialize the damn things, and add a memory clobber to them. > > Expecting people to fix up all drivers is simply not going to happen. And > serializing things shouldn't be *that* expensive. People who cannot take > the expense can continue to use the magic __raw_writel() etc stuff. Ok. Do we also remove wmb/rmb/... from drivers then ? :-) I think ia64 would need to be fixed to make their writel serializing... Regarding __raw_* their semantics are dodgy ... we might want to provide something better but it's a different subject. Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:10 ` Benjamin Herrenschmidt @ 2008-05-27 21:30 ` Linus Torvalds 2008-05-27 21:38 ` Alan Cox 2008-05-27 21:38 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 160+ messages in thread From: Linus Torvalds @ 2008-05-27 21:30 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, alan On Wed, 28 May 2008, Benjamin Herrenschmidt wrote: > > On Tue, 2008-05-27 at 08:35 -0700, Linus Torvalds wrote: > > > > Expecting people to fix up all drivers is simply not going to happen. And > > serializing things shouldn't be *that* expensive. People who cannot take > > the expense can continue to use the magic __raw_writel() etc stuff. > > Ok. > > Do we also remove wmb/rmb/... from drivers then ? :-) I think ia64 would > need to be fixed to make their writel serializing... Well.. There's really two different issues: (a) x86 and the fact that we have thousands of drivers which in turn conflicts with (b) non-x86 and the fact that other architectures tend to be absolute pieces of cr*p when it comes to ordering, _especially_ across IO. and the thing about (b) is that the number of drivers involved is a hell of a lot smaller. For example, ia64 and the big SGI machines probably really only care about roughly five drivers (number taken out of my nether regions). So practically speaking, I suspect that the right approach is to just say "ok, x86 will continue to be pretty darn ordered, and the barriers won't really matter (*)" but at the same time also saying "we wish reality was different, and well-maintained drivers should probably try to work in the presense of re-ordering". In *practice*, that probably means that most architectures will be better off if they emulate x86 closely, just because that means that they won't rely on drivers always getting things right, but I think we can leave the door open for the odd machines. We should just realize that they will never get a lot of testing, but on the other hand, their usage scenarios will generally also be very limited (very specific loads, and _very_ specific hardware). And the patch I sent out actually made "__[raw_]readl()" different from "readl()" on x86 too, in that the assembly _allows_ a bit more re-ordering, even though I doubt it will be visible in practice. So if you use the "__" versions, you'd better have barriers even on x86! Linus (*) With the possible but unlikely exception of some big machines with separate IO networks, but if they happen they will fall into the 'ia64' case of just having a few relevant drivers. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:30 ` Linus Torvalds @ 2008-05-27 21:38 ` Alan Cox 2008-05-27 21:53 ` Matthew Wilcox 2008-05-27 21:59 ` Linus Torvalds 2008-05-27 21:38 ` Benjamin Herrenschmidt 1 sibling, 2 replies; 160+ messages in thread From: Alan Cox @ 2008-05-27 21:38 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller > re-ordering, even though I doubt it will be visible in practice. So if you > use the "__" versions, you'd better have barriers even on x86! Are we also going to have __ioread*/__iowrite* ? Also is the sematics of __readl/__writel defined for all architectures - I used it ages ago in the i2o drivers for speed and it got removed because it didn't build on some platforms. Alan ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:38 ` Alan Cox @ 2008-05-27 21:53 ` Matthew Wilcox 2008-05-27 21:46 ` Alan Cox 2008-05-27 22:02 ` Linus Torvalds 2008-05-27 21:59 ` Linus Torvalds 1 sibling, 2 replies; 160+ messages in thread From: Matthew Wilcox @ 2008-05-27 21:53 UTC (permalink / raw) To: Alan Cox Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller On Tue, May 27, 2008 at 10:38:22PM +0100, Alan Cox wrote: > > re-ordering, even though I doubt it will be visible in practice. So if you > > use the "__" versions, you'd better have barriers even on x86! > > Are we also going to have __ioread*/__iowrite* ? Didn't we already define ioread*() to have loose semantics? -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:53 ` Matthew Wilcox @ 2008-05-27 21:46 ` Alan Cox 2008-05-27 22:02 ` Linus Torvalds 1 sibling, 0 replies; 160+ messages in thread From: Alan Cox @ 2008-05-27 21:46 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller On Tue, 27 May 2008 15:53:28 -0600 Matthew Wilcox <matthew@wil.cx> wrote: > On Tue, May 27, 2008 at 10:38:22PM +0100, Alan Cox wrote: > > > re-ordering, even though I doubt it will be visible in practice. So if you > > > use the "__" versions, you'd better have barriers even on x86! > > > > Are we also going to have __ioread*/__iowrite* ? > > Didn't we already define ioread*() to have loose semantics? The ATA layer doesn't think so. Alan ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:53 ` Matthew Wilcox 2008-05-27 21:46 ` Alan Cox @ 2008-05-27 22:02 ` Linus Torvalds 1 sibling, 0 replies; 160+ messages in thread From: Linus Torvalds @ 2008-05-27 22:02 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, Alan Cox On Tue, 27 May 2008, Matthew Wilcox wrote: > On Tue, May 27, 2008 at 10:38:22PM +0100, Alan Cox wrote: > > > re-ordering, even though I doubt it will be visible in practice. So if you > > > use the "__" versions, you'd better have barriers even on x86! > > > > Are we also going to have __ioread*/__iowrite* ? > > Didn't we already define ioread*() to have loose semantics? They are supposed to have the same semantics as readl/writel. And yes, it's "loose", but only when compared to inb/outb (which are really very strict, if you want to emulate x86 - an "outb" basically is not only ordered, it doesn't even post the write, and waits until it has hit the bus!) Linus ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:38 ` Alan Cox 2008-05-27 21:53 ` Matthew Wilcox @ 2008-05-27 21:59 ` Linus Torvalds 1 sibling, 0 replies; 160+ messages in thread From: Linus Torvalds @ 2008-05-27 21:59 UTC (permalink / raw) To: Alan Cox Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller On Tue, 27 May 2008, Alan Cox wrote: > > > re-ordering, even though I doubt it will be visible in practice. So if you > > use the "__" versions, you'd better have barriers even on x86! > > Are we also going to have __ioread*/__iowrite* ? I doubt there is any reason to. Let's just keep them very strictly ordered. > Also is the sematics of __readl/__writel defined for all architectures - > I used it ages ago in the i2o drivers for speed and it got removed > because it didn't build on some platforms. Agreed - I'm not sure the __ versions are really worth it. We have them, but the semantics are subtle enough that most drivers will never care enough to really use them. I would suggest using them mainly for architecture-specific drivers, on architectures where it actually matters (which is not the case on x86). Linus ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:30 ` Linus Torvalds 2008-05-27 21:38 ` Alan Cox @ 2008-05-27 21:38 ` Benjamin Herrenschmidt 2008-05-27 21:42 ` Matthew Wilcox 2008-05-27 21:55 ` Linus Torvalds 1 sibling, 2 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-27 21:38 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, alan > So practically speaking, I suspect that the right approach is to just say > "ok, x86 will continue to be pretty darn ordered, and the barriers won't > really matter (*)" but at the same time also saying "we wish reality was > different, and well-maintained drivers should probably try to work in the > presense of re-ordering". Ok, well, I'll slap "memory" clobbers onto powerpc accessors, we made them fully ordered a while ago anyway. The extra barriers in drivers like USB etc.. won't hurt us much, we can always fine tune drivers that really want high performances. A problem with __raw_ though is that they -also- don't do byteswap, which is a pain in the neck as people use them for either one reason (relaxed ordering) or the other (no byteswap) without always knowing the consequences of doing so... I'm happy to say that __raw is purely about ordering and make them byteswap on powerpc tho (ie, make them little endian like the non-raw counterpart). Some archs started providing writel_be etc... I added those to powerpc a little while ago, and I tend to prefer that approach for the byteswap issue. What do you think ? Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:38 ` Benjamin Herrenschmidt @ 2008-05-27 21:42 ` Matthew Wilcox 2008-05-27 22:17 ` Benjamin Herrenschmidt 2008-05-27 21:55 ` Linus Torvalds 1 sibling, 1 reply; 160+ messages in thread From: Matthew Wilcox @ 2008-05-27 21:42 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Wed, May 28, 2008 at 07:38:55AM +1000, Benjamin Herrenschmidt wrote: > A problem with __raw_ though is that they -also- don't do byteswap, > which is a pain in the neck as people use them for either one reason > (relaxed ordering) or the other (no byteswap) without always knowing the > consequences of doing so... That's why there's __readl() which does byteswap, but doesn't do ordering ... > I'm happy to say that __raw is purely about ordering and make them > byteswap on powerpc tho (ie, make them little endian like the non-raw > counterpart). That would break a lot of drivers. > Some archs started providing writel_be etc... I added those to powerpc a > little while ago, and I tend to prefer that approach for the byteswap > issue. Those are for people who use big endian chips on little endian architectures. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:42 ` Matthew Wilcox @ 2008-05-27 22:17 ` Benjamin Herrenschmidt 2008-05-28 8:36 ` Haavard Skinnemoen 0 siblings, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-27 22:17 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Tue, 2008-05-27 at 15:42 -0600, Matthew Wilcox wrote: > On Wed, May 28, 2008 at 07:38:55AM +1000, Benjamin Herrenschmidt wrote: > > A problem with __raw_ though is that they -also- don't do byteswap, > > which is a pain in the neck as people use them for either one reason > > (relaxed ordering) or the other (no byteswap) without always knowing the > > consequences of doing so... > > That's why there's __readl() which does byteswap, but doesn't do > ordering ... Ah, that one is news to me. I don't think we ever had it on powerpc :-) > > I'm happy to say that __raw is purely about ordering and make them > > byteswap on powerpc tho (ie, make them little endian like the non-raw > > counterpart). > > That would break a lot of drivers. How many actually use __raw_ * ? > > Some archs started providing writel_be etc... I added those to powerpc a > > little while ago, and I tend to prefer that approach for the byteswap > > issue. > Those are for people who use big endian chips on little endian > architectures. Why limit them to LE architecture ? There is nothing fundamentally speicifc to LE architectures here, and it's wrong to provide accessors on some archs and not others. The endianness is a property of the device registers. Current writel/readl are basically writel_le/readl_le. It thus makes sense to have the opposite, ie, readl_be/writel_be, which thus byteswaps on LE platforms and not on BE platforms, which is what I provided on powerpc a while ago. Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 22:17 ` Benjamin Herrenschmidt @ 2008-05-28 8:36 ` Haavard Skinnemoen 2008-05-29 11:05 ` Pantelis Antoniou 2008-05-30 1:13 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 160+ messages in thread From: Haavard Skinnemoen @ 2008-05-28 8:36 UTC (permalink / raw) To: benh Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > I'm happy to say that __raw is purely about ordering and make them > > > byteswap on powerpc tho (ie, make them little endian like the non-raw > > > counterpart). > > > > That would break a lot of drivers. > > How many actually use __raw_ * ? I do -- in all the drivers for on-chip peripherals that are shared between AT91 ARM (LE) and AVR32 (BE). Since everything goes on inside the chip, we must use LE accesses on ARM and BE accesses on AVR32. Currently, this is the only interface I know that can do native-endian accesses, so if you take it away, I'm gonna need an alternative interface that doesn't do byteswapping. Haavard ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-28 8:36 ` Haavard Skinnemoen @ 2008-05-29 11:05 ` Pantelis Antoniou 2008-05-30 1:13 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 160+ messages in thread From: Pantelis Antoniou @ 2008-05-29 11:05 UTC (permalink / raw) To: Haavard Skinnemoen Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On 28 =CE=9C=CE=B1=CF=8A 2008, at 11:36 =CE=A0=CE=9C, Haavard Skinnemoen = wrote: > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: >>>> I'm happy to say that __raw is purely about ordering and make them >>>> byteswap on powerpc tho (ie, make them little endian like the non-=20= >>>> raw >>>> counterpart). >>> >>> That would break a lot of drivers. >> >> How many actually use __raw_ * ? > > I do -- in all the drivers for on-chip peripherals that are shared > between AT91 ARM (LE) and AVR32 (BE). Since everything goes on inside > the chip, we must use LE accesses on ARM and BE accesses on AVR32. > > Currently, this is the only interface I know that can do native-endian > accesses, so if you take it away, I'm gonna need an alternative > interface that doesn't do byteswapping. > > Haavard I certainly do too. Quite a lot of SoC specific drivers use __raw for accessing their on-chip peripherals. Please don't change the __raw semantics, you'll end up breaking almost everything that's BE. -- Pantelis ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-28 8:36 ` Haavard Skinnemoen 2008-05-29 11:05 ` Pantelis Antoniou @ 2008-05-30 1:13 ` Benjamin Herrenschmidt 2008-05-30 6:07 ` Haavard Skinnemoen 1 sibling, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-30 1:13 UTC (permalink / raw) To: Haavard Skinnemoen Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan > I do -- in all the drivers for on-chip peripherals that are shared > between AT91 ARM (LE) and AVR32 (BE). Since everything goes on inside > the chip, we must use LE accesses on ARM and BE accesses on AVR32. > > Currently, this is the only interface I know that can do native-endian > accesses, so if you take it away, I'm gonna need an alternative > interface that doesn't do byteswapping. Are you aware that these also don't provide any ordering guarantee ? Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-30 1:13 ` Benjamin Herrenschmidt @ 2008-05-30 6:07 ` Haavard Skinnemoen 2008-05-30 7:24 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 160+ messages in thread From: Haavard Skinnemoen @ 2008-05-30 6:07 UTC (permalink / raw) To: benh Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Fri, 30 May 2008 11:13:23 +1000 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > Currently, this is the only interface I know that can do native-endian > > accesses, so if you take it away, I'm gonna need an alternative > > interface that doesn't do byteswapping. > > Are you aware that these also don't provide any ordering guarantee ? Yes, but I am not aware of any alternative. I think the drivers I've written have the necessary barriers (or dma ops with implicit barriers) that they don't actually depend on any DMA vs. MMIO ordering guarantees. I hope MMIO vs. MMIO ordering is guaranteed though? Haavard ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-30 6:07 ` Haavard Skinnemoen @ 2008-05-30 7:24 ` Benjamin Herrenschmidt 2008-05-30 8:27 ` Haavard Skinnemoen 0 siblings, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-30 7:24 UTC (permalink / raw) To: Haavard Skinnemoen Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Fri, 2008-05-30 at 08:07 +0200, Haavard Skinnemoen wrote: > On Fri, 30 May 2008 11:13:23 +1000 > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > > Currently, this is the only interface I know that can do native-endian > > > accesses, so if you take it away, I'm gonna need an alternative > > > interface that doesn't do byteswapping. > > > > Are you aware that these also don't provide any ordering guarantee ? > > Yes, but I am not aware of any alternative. > > I think the drivers I've written have the necessary barriers (or dma > ops with implicit barriers) that they don't actually depend on any > DMA vs. MMIO ordering guarantees. I hope MMIO vs. MMIO ordering is > guaranteed though? Only to the same address I'd say. Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-30 7:24 ` Benjamin Herrenschmidt @ 2008-05-30 8:27 ` Haavard Skinnemoen 2008-05-30 9:22 ` Geert Uytterhoeven 0 siblings, 1 reply; 160+ messages in thread From: Haavard Skinnemoen @ 2008-05-30 8:27 UTC (permalink / raw) To: benh Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Fri, 30 May 2008 17:24:27 +1000 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Fri, 2008-05-30 at 08:07 +0200, Haavard Skinnemoen wrote: > > I think the drivers I've written have the necessary barriers (or dma > > ops with implicit barriers) that they don't actually depend on any > > DMA vs. MMIO ordering guarantees. I hope MMIO vs. MMIO ordering is > > guaranteed though? > > Only to the same address I'd say. Right, I sort of suspected that. Now, I'm pretty sure the architectures that can actually run those drivers (ARM9 and AVR32 AP7) provide stronger guarantees than that, and I suspect the same is true on most other embedded architectures that use __raw_* in their drivers. So I don't think adding barriers is the right thing to do because they won't do anything useful in practice, so it's hard to tell whether they are used "correctly". And it will hurt performance at least on AVR32 since wmb() evaluates to a "sync" instruction, which flushes the write buffer to RAM. Since MMIO writes are unbuffered, that's pure overhead. Maybe we need another interface that does not do byteswapping but provides stronger ordering guarantees? Haavard ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-30 8:27 ` Haavard Skinnemoen @ 2008-05-30 9:22 ` Geert Uytterhoeven 2008-06-02 8:11 ` Haavard Skinnemoen 0 siblings, 1 reply; 160+ messages in thread From: Geert Uytterhoeven @ 2008-05-30 9:22 UTC (permalink / raw) To: Haavard Skinnemoen Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Fri, 30 May 2008, Haavard Skinnemoen wrote: > Maybe we need another interface that does not do byteswapping but > provides stronger ordering guarantees? The byte swapping depends on the device/bus. So what happened to the old idea of putting the accessor function pointers in the device/bus structure? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-30 9:22 ` Geert Uytterhoeven @ 2008-06-02 8:11 ` Haavard Skinnemoen 2008-06-02 15:48 ` Scott Wood 2008-06-04 15:31 ` Linus Torvalds 0 siblings, 2 replies; 160+ messages in thread From: Haavard Skinnemoen @ 2008-06-02 8:11 UTC (permalink / raw) To: Geert Uytterhoeven Cc: linux-arch, Matthew Wilcox, linux-kernel, David Miller, linuxppc-dev, tpiepho, scottwood, Torvalds, Linus, alan Geert Uytterhoeven <geert@linux-m68k.org> wrote: > On Fri, 30 May 2008, Haavard Skinnemoen wrote: > > Maybe we need another interface that does not do byteswapping but > > provides stronger ordering guarantees? > > The byte swapping depends on the device/bus. Of course. But isn't it reasonable to assume that a device integrated on the same silicon as the CPU is connected to a somewhat sane bus which doesn't require any byte swapping? > So what happened to the old idea of putting the accessor function pointers > in the device/bus structure? Don't know. I think it sounds like overkill to replace a simple load or store with an indirect function call. Haavard ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-02 8:11 ` Haavard Skinnemoen @ 2008-06-02 15:48 ` Scott Wood 2008-06-03 7:46 ` Haavard Skinnemoen 2008-06-04 15:31 ` Linus Torvalds 1 sibling, 1 reply; 160+ messages in thread From: Scott Wood @ 2008-06-02 15:48 UTC (permalink / raw) To: Haavard Skinnemoen Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev, Geert Uytterhoeven, Linus Torvalds, David Miller, alan On Mon, Jun 02, 2008 at 10:11:02AM +0200, Haavard Skinnemoen wrote: > Geert Uytterhoeven <geert@linux-m68k.org> wrote: > > On Fri, 30 May 2008, Haavard Skinnemoen wrote: > > > Maybe we need another interface that does not do byteswapping but > > > provides stronger ordering guarantees? > > > > The byte swapping depends on the device/bus. > > Of course. But isn't it reasonable to assume that a device integrated > on the same silicon as the CPU is connected to a somewhat sane bus > which doesn't require any byte swapping? No, unfortunately. :-( See the end of drivers/dma/fsldma.h. Likewise with Freescale's PCI host bridges; for some reason the bus itself being little endian led to the host bridge control registers also being little endian. -Scott ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-02 15:48 ` Scott Wood @ 2008-06-03 7:46 ` Haavard Skinnemoen 0 siblings, 0 replies; 160+ messages in thread From: Haavard Skinnemoen @ 2008-06-03 7:46 UTC (permalink / raw) To: Scott Wood Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev, Geert Uytterhoeven, Linus Torvalds, David Miller, alan Scott Wood <scottwood@freescale.com> wrote: > On Mon, Jun 02, 2008 at 10:11:02AM +0200, Haavard Skinnemoen wrote: > > Geert Uytterhoeven <geert@linux-m68k.org> wrote: > > > On Fri, 30 May 2008, Haavard Skinnemoen wrote: > > > > Maybe we need another interface that does not do byteswapping but > > > > provides stronger ordering guarantees? > > > > > > The byte swapping depends on the device/bus. > > > > Of course. But isn't it reasonable to assume that a device integrated > > on the same silicon as the CPU is connected to a somewhat sane bus > > which doesn't require any byte swapping? > > No, unfortunately. :-( Ok, I guess I was being naive. > See the end of drivers/dma/fsldma.h. Likewise with Freescale's PCI host > bridges; for some reason the bus itself being little endian led to the host > bridge control registers also being little endian. Right. But still, isn't it better to handle it on a case-by-case basis in the drivers? In some cases, it's best to explicitly use a certain byte order, in others it's best to use whatever is native to the CPU. Haavard ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-02 8:11 ` Haavard Skinnemoen 2008-06-02 15:48 ` Scott Wood @ 2008-06-04 15:31 ` Linus Torvalds 1 sibling, 0 replies; 160+ messages in thread From: Linus Torvalds @ 2008-06-04 15:31 UTC (permalink / raw) To: Haavard Skinnemoen Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev, Geert Uytterhoeven, scottwood, David Miller, alan On Mon, 2 Jun 2008, Haavard Skinnemoen wrote: > > > So what happened to the old idea of putting the accessor function pointers > > in the device/bus structure? > > Don't know. I think it sounds like overkill to replace a simple load or > store with an indirect function call. Indeed. *Especially* as the driver in practice tends to always know which one it is statically. Yes, sometimes the same RTL may end up being used behind two differnt buses, and with some broken byte-conversion hardware in the middle, but that's pretty damn rare in the end. It happens, but it happens mostly with the odd broken kind of embedded setups where somebody took a standard part and then did something _strange_ to it - and in the process they may well have introduced other issues as well. I suspect you find this kind of thing mostly with things like stupid integrated IDE controllers, and things like byte order tends to be the _least_ of the issues (specialized register accesses with odd offsets etc will happen). So most of the time the byte order is simply decided by the hardware itself (and LE is the common one, due to ISA/PCI). Sometimes you can set it dynamically (at which point it's irrelevant - just pick the one you want). So I definitely argue against complex IO accessor functions with things like dynamic support for byte order depending on bus. 99.9% of all hardware simply do not need them, and the small percentage that might need it will need special drivers anyway, so they might as well do the complexity on their own rather than make everybody else care. See for example the input driver for the i8042 chip: not only do those things sometimes need native order (probably exactly because it's integrated on a chip over some special "native bus", or just wired up to be big-endian on a BE platform by some moron), but you'll also find that they just need odd accesses anyway exactly because it's oddly wired up (eg it's some special serial protocol that emulates ISA accesses). So having some "byte-order correcting function" still wouldn't make any sense, because it's not just about byte order. Will that mean that you might occasionally have a "normal" driver and an "odd" driver that actually drives what is basically the same HW block, just wired up differently? Sure. But that's still a smaller price to pay than it would be to try to make everything be "generic" when there is no point to it. Linus ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:38 ` Benjamin Herrenschmidt 2008-05-27 21:42 ` Matthew Wilcox @ 2008-05-27 21:55 ` Linus Torvalds 2008-05-27 22:19 ` Benjamin Herrenschmidt 2008-06-02 7:24 ` Russell King 1 sibling, 2 replies; 160+ messages in thread From: Linus Torvalds @ 2008-05-27 21:55 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, alan On Wed, 28 May 2008, Benjamin Herrenschmidt wrote: > > A problem with __raw_ though is that they -also- don't do byteswap, Well, that's why there is __readl() and __raw_readl(), no? Neither does ordering, and __raw_readl() doesn't do byte-swap. Of course, I'm not going to guarantee every architecture even has all those versions, nor am I going to guarantee they all work as advertised :) For x86, they have historially all been 100% identical. With the inline asm patch I posted, the "__" version (whether "raw" or not) lack the "memory" barrier, so they allow a *little* bit more re-ordering. (They won't be re-ordered wrt spinlocks etc, unless gcc starts reordering volatile asm's against each other, which would be a bug). In practice, I doubt it matters. Whatever small compiler re-ordering it might affect won't have any real performance impack one way or the other, I think. Linus ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:55 ` Linus Torvalds @ 2008-05-27 22:19 ` Benjamin Herrenschmidt 2008-05-29 7:10 ` Arnd Bergmann 2008-06-02 7:24 ` Russell King 1 sibling, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-27 22:19 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, alan On Tue, 2008-05-27 at 14:55 -0700, Linus Torvalds wrote: > > On Wed, 28 May 2008, Benjamin Herrenschmidt wrote: > > > > A problem with __raw_ though is that they -also- don't do byteswap, > > Well, that's why there is __readl() and __raw_readl(), no? As I replied to somebody else, __readl() is news to me :-) we dont' have those on powerpc. > Neither does ordering, and __raw_readl() doesn't do byte-swap. But I can add them :-) > Of course, I'm not going to guarantee every architecture even has all > those versions, nor am I going to guarantee they all work as advertised :) > > For x86, they have historially all been 100% identical. With the inline > asm patch I posted, the "__" version (whether "raw" or not) lack the > "memory" barrier, so they allow a *little* bit more re-ordering. > > (They won't be re-ordered wrt spinlocks etc, unless gcc starts reordering > volatile asm's against each other, which would be a bug). > > In practice, I doubt it matters. Whatever small compiler re-ordering it > might affect won't have any real performance impack one way or the other, > I think. I prefer explicit endian. Always. Thus I prefer introducing _be variants (we already have those on powerpc and iomap has it's own _be versions too) so we should probably generalize _be. Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 22:19 ` Benjamin Herrenschmidt @ 2008-05-29 7:10 ` Arnd Bergmann 2008-05-29 10:46 ` Alan Cox 0 siblings, 1 reply; 160+ messages in thread From: Arnd Bergmann @ 2008-05-29 7:10 UTC (permalink / raw) To: benh Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Wednesday 28 May 2008, Benjamin Herrenschmidt wrote: > On Tue, 2008-05-27 at 14:55 -0700, Linus Torvalds wrote: > > > > On Wed, 28 May 2008, Benjamin Herrenschmidt wrote: > > > > > > A problem with __raw_ though is that they -also- don't do byteswap, > > > > Well, that's why there is __readl() and __raw_readl(), no? > > As I replied to somebody else, __readl() is news to me :-) we dont' have > those on powerpc. > It's not exactly a well-established interface. Only five architectures define these functions, and there is not a single user in the kernel source outside of these architecture's io.h files. Arnd <>< ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 7:10 ` Arnd Bergmann @ 2008-05-29 10:46 ` Alan Cox 0 siblings, 0 replies; 160+ messages in thread From: Alan Cox @ 2008-05-29 10:46 UTC (permalink / raw) To: Arnd Bergmann Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller > It's not exactly a well-established interface. Only five architectures > define these functions, and there is not a single user in the kernel > source outside of these architecture's io.h files. That is because the drivers using them had them removed (eg I=C2=B2O) - mos= tly because "it didn't compile on power pc" 8( Alan ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:55 ` Linus Torvalds 2008-05-27 22:19 ` Benjamin Herrenschmidt @ 2008-06-02 7:24 ` Russell King 2008-06-03 4:16 ` Nick Piggin 1 sibling, 1 reply; 160+ messages in thread From: Russell King @ 2008-06-02 7:24 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, alan On Tue, May 27, 2008 at 02:55:56PM -0700, Linus Torvalds wrote: > > > On Wed, 28 May 2008, Benjamin Herrenschmidt wrote: > > > > A problem with __raw_ though is that they -also- don't do byteswap, > > Well, that's why there is __readl() and __raw_readl(), no? > > Neither does ordering, and __raw_readl() doesn't do byte-swap. This is where the lack of documentation causes arch maintainers a big problem. None of the semantics of __raw_readl vs __readl vs readl are documented _anywhere_. If you look at x86 as a template, there's no comments there about what the different variants are supposed to do or not do. So it's left up to arch maintainers to literally guess what should be done. That's precisely what I did when I implemented ARMs __raw_readl and friends. I guessed. And it was only after I read a few mails on lkml which suggested that readl and friends should always be LE that ARMs readl implementation started to use le32_to_cpu()... before that it had always been native endian. Again, lack of documentation... So, can the semantics of what's expected from these IO accessor functions be documented somewhere. Please? Before this thread gets lost in the depths of time? -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-02 7:24 ` Russell King @ 2008-06-03 4:16 ` Nick Piggin 2008-06-03 4:32 ` Benjamin Herrenschmidt ` (2 more replies) 0 siblings, 3 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-03 4:16 UTC (permalink / raw) To: Russell King Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Monday 02 June 2008 17:24, Russell King wrote: > On Tue, May 27, 2008 at 02:55:56PM -0700, Linus Torvalds wrote: > > On Wed, 28 May 2008, Benjamin Herrenschmidt wrote: > > > A problem with __raw_ though is that they -also- don't do byteswap, > > > > Well, that's why there is __readl() and __raw_readl(), no? > > > > Neither does ordering, and __raw_readl() doesn't do byte-swap. > > This is where the lack of documentation causes arch maintainers a big > problem. None of the semantics of __raw_readl vs __readl vs readl are > documented _anywhere_. If you look at x86 as a template, there's no > comments there about what the different variants are supposed to do > or not do. > > So it's left up to arch maintainers to literally guess what should be > done. That's precisely what I did when I implemented ARMs __raw_readl > and friends. I guessed. > > And it was only after I read a few mails on lkml which suggested that > readl and friends should always be LE that ARMs readl implementation > started to use le32_to_cpu()... before that it had always been native > endian. Again, lack of documentation... > > So, can the semantics of what's expected from these IO accessor > functions be documented somewhere. Please? Before this thread gets > lost in the depths of time? This whole thread also ties in with my posts about mmiowb (which IMO should go away). readl/writel: strongly ordered wrt one another and other stores to cacheable RAM, byteswapping __readl/__writel: not ordered (needs mb/rmb/wmb to order with other readl/writel and cacheable operations, or io_*mb to order with one another) raw_readl/raw_writel: strongly ordered, no byteswapping __raw_readl/__raw_writel: not ordered, no byteswapping then get rid of *relaxed* variants. Linus: on x86, memory operations to wc and wc+ memory are not ordered with one another, or operations to other memory types (ie. load/load and store/store reordering is allowed). Also, as you know, store/load reordering is explicitly allowed as well, which covers all memory types. So perhaps it is not quite true to say readl/writel is strongly ordered by default even on x86. You would have to put in some mfence instructions in them to make it so. So, what *exact* definition are you going to mandate for readl/writel? Anything less than strict ordering then we also need to ensure drivers use the correct barriers (to implement strict ordering, we could either put mfence instructions in, or explicitly disallow readl/writel to be used on wc/wc+ memory). The other way we can go is just say that they have x86 semantics, although that would be a bit sad IMO: we should have strong ops, in which case driver writers never need to use a single barrier provided they have locking right, and weak ops, in which case they should match up with the weak Linux memory ordering model for system RAM. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 4:16 ` Nick Piggin @ 2008-06-03 4:32 ` Benjamin Herrenschmidt 2008-06-03 6:11 ` Nick Piggin 2008-06-03 14:47 ` Linus Torvalds 2008-06-03 19:43 ` Trent Piepho 2 siblings, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-06-03 4:32 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan > This whole thread also ties in with my posts about mmiowb (which IMO > should go away). > > readl/writel: strongly ordered wrt one another and other stores > to cacheable RAM, byteswapping > __readl/__writel: not ordered (needs mb/rmb/wmb to order with > other readl/writel and cacheable operations, or > io_*mb to order with one another) > raw_readl/raw_writel: strongly ordered, no byteswapping > __raw_readl/__raw_writel: not ordered, no byteswapping > > then get rid of *relaxed* variants. In addition, some archs like powerpc also provide readl_be/writel_be as being defined as big endian (ie. byteswap on LE archs, no byteswap on BE archs). As of today, powerpc lacks the raw_readl/raw_writel and __readl/__writel variants (ie, we only provide fully ordered + byteswap and no ordering + no byteswap variants). If we agree on the above semantics, I'll do a patch providing the missing ones. > Linus: on x86, memory operations to wc and wc+ memory are not ordered > with one another, or operations to other memory types (ie. load/load > and store/store reordering is allowed). Also, as you know, store/load > reordering is explicitly allowed as well, which covers all memory > types. So perhaps it is not quite true to say readl/writel is strongly > ordered by default even on x86. You would have to put in some > mfence instructions in them to make it so. > > So, what *exact* definition are you going to mandate for readl/writel? > Anything less than strict ordering then we also need to ensure drivers > use the correct barriers (to implement strict ordering, we could either > put mfence instructions in, or explicitly disallow readl/writel to be > used on wc/wc+ memory). The ordering guarantees that I provide on powerpc for "ordered" variants are: - cacheable store + writel stays ordered (ie, write to some DMA stuff and then a register to trigger the DMA). - readl + cacheable read stays ordered (ie. read some status register, for example, after an interrupt, and then read the resulting data in memory). - any of these ordered vs. spin_lock and spin_unlock (with the exception that stores done before the spin_lock could potentially leak into the lock). - readl is synchronous (ie, makes the CPU think the data was actually used before executing subsequent instructions, thus waits for the data to come back, for example to ensure that a read used to push out post buffers followed by a delay will indeed happen with the right delay). We don't provide meaningless ones like writel + cacheable store for example. (PCI posting would defeat it anyway). > The other way we can go is just say that they have x86 semantics, > although that would be a bit sad IMO: we should have strong ops, in > which case driver writers never need to use a single barrier provided > they have locking right, and weak ops, in which case they should match > up with the weak Linux memory ordering model for system RAM. Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 4:32 ` Benjamin Herrenschmidt @ 2008-06-03 6:11 ` Nick Piggin 2008-06-03 6:48 ` Benjamin Herrenschmidt 2008-06-03 6:53 ` Paul Mackerras 0 siblings, 2 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-03 6:11 UTC (permalink / raw) To: benh Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Tuesday 03 June 2008 14:32, Benjamin Herrenschmidt wrote: > > This whole thread also ties in with my posts about mmiowb (which IMO > > should go away). > > > > readl/writel: strongly ordered wrt one another and other stores > > to cacheable RAM, byteswapping > > __readl/__writel: not ordered (needs mb/rmb/wmb to order with > > other readl/writel and cacheable operations, or > > io_*mb to order with one another) > > raw_readl/raw_writel: strongly ordered, no byteswapping > > __raw_readl/__raw_writel: not ordered, no byteswapping > > > > then get rid of *relaxed* variants. > > In addition, some archs like powerpc also provide readl_be/writel_be as > being defined as big endian (ie. byteswap on LE archs, no byteswap on BE > archs). Sure. > As of today, powerpc lacks the raw_readl/raw_writel and __readl/__writel > variants (ie, we only provide fully ordered + byteswap and no ordering + > no byteswap variants). > > If we agree on the above semantics, I'll do a patch providing the > missing ones. Let's see what Linus thinks... > > Linus: on x86, memory operations to wc and wc+ memory are not ordered > > with one another, or operations to other memory types (ie. load/load > > and store/store reordering is allowed). Also, as you know, store/load > > reordering is explicitly allowed as well, which covers all memory > > types. So perhaps it is not quite true to say readl/writel is strongly > > ordered by default even on x86. You would have to put in some > > mfence instructions in them to make it so. > > > > So, what *exact* definition are you going to mandate for readl/writel? > > Anything less than strict ordering then we also need to ensure drivers > > use the correct barriers (to implement strict ordering, we could either > > put mfence instructions in, or explicitly disallow readl/writel to be > > used on wc/wc+ memory). > > The ordering guarantees that I provide on powerpc for "ordered" variants > are: > > - cacheable store + writel stays ordered (ie, write to some > DMA stuff and then a register to trigger the DMA). > > - readl + cacheable read stays ordered (ie. read some status > register, for example, after an interrupt, and then read the > resulting data in memory). > > - any of these ordered vs. spin_lock and spin_unlock (with the > exception that stores done before the spin_lock > could potentially leak into the lock). > > - readl is synchronous (ie, makes the CPU think the > data was actually used before executing subsequent > instructions, thus waits for the data to come back, > for example to ensure that a read used to push out > post buffers followed by a delay will indeed happen > with the right delay). So your readl can pass an earlier cacheable store or earlier writel? > We don't provide meaningless ones like writel + cacheable store for > example. (PCI posting would defeat it anyway). What do you mean by meaningless? Ordering of writel followed by a cacheable store is meaningful eg. for retaining io operations within locks. OK, you explicitly have some extra code for spin_unlock, but not for bit locks, mutexes, etc. It would make sense to have the default operations _very_ strongly ordered IMO, and then move drivers to be more relaxed when they are verified. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 6:11 ` Nick Piggin @ 2008-06-03 6:48 ` Benjamin Herrenschmidt 2008-06-03 6:53 ` Paul Mackerras 1 sibling, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-06-03 6:48 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Tue, 2008-06-03 at 16:11 +1000, Nick Piggin wrote: > > - readl is synchronous (ie, makes the CPU think the > > data was actually used before executing subsequent > > instructions, thus waits for the data to come back, > > for example to ensure that a read used to push out > > post buffers followed by a delay will indeed happen > > with the right delay). > > So your readl can pass an earlier cacheable store or earlier writel? I forgot to mention that all MMIO are ordered vs. each other and I do prevent readl from passing earlier cacheable stores too in my current implementation but I'n not 100% we want to "guarantee" that, unless we have stupid devices that trigger DMA's on reads with side effects.. anyway, it is guaranteed in the current case. Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 6:11 ` Nick Piggin 2008-06-03 6:48 ` Benjamin Herrenschmidt @ 2008-06-03 6:53 ` Paul Mackerras 2008-06-03 7:18 ` Nick Piggin 1 sibling, 1 reply; 160+ messages in thread From: Paul Mackerras @ 2008-06-03 6:53 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan Nick Piggin writes: > So your readl can pass an earlier cacheable store or earlier writel? No. It's quite gross at the moment, it has a sync before the access (i.e. a full mb()) and a twi; isync sequence after the access that stalls execution until the data comes back. > > We don't provide meaningless ones like writel + cacheable store for > > example. (PCI posting would defeat it anyway). > > What do you mean by meaningless? Ordering of writel followed by a > cacheable store is meaningful eg. for retaining io operations within > locks. OK, you explicitly have some extra code for spin_unlock, but > not for bit locks, mutexes, etc. It would make sense to have the > default operations _very_ strongly ordered IMO, and then move drivers > to be more relaxed when they are verified. It's meaningless in the sense that nothing guarantees that the writel has actually hit the device, even if we put a full mb() barrier in between the writel and the cacheable write. That would guarantee that the writel had got to the PCI host bridge, and couldn't be reordered with other accesses to the device, but not that the device had actually seen it. I don't mind adding code to the mutex unlock to do the same as spin_unlock, but I really don't want to have *two* sync instructions for every MMIO write. One is bad enough. Paul. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 6:53 ` Paul Mackerras @ 2008-06-03 7:18 ` Nick Piggin 0 siblings, 0 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-03 7:18 UTC (permalink / raw) To: Paul Mackerras Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Tuesday 03 June 2008 16:53, Paul Mackerras wrote: > Nick Piggin writes: > > So your readl can pass an earlier cacheable store or earlier writel? > > No. It's quite gross at the moment, it has a sync before the access > (i.e. a full mb()) and a twi; isync sequence after the access that > stalls execution until the data comes back. OK. > > > We don't provide meaningless ones like writel + cacheable store for > > > example. (PCI posting would defeat it anyway). > > > > What do you mean by meaningless? Ordering of writel followed by a > > cacheable store is meaningful eg. for retaining io operations within > > locks. OK, you explicitly have some extra code for spin_unlock, but > > not for bit locks, mutexes, etc. It would make sense to have the > > default operations _very_ strongly ordered IMO, and then move drivers > > to be more relaxed when they are verified. > > It's meaningless in the sense that nothing guarantees that the writel > has actually hit the device, even if we put a full mb() barrier in > between the writel and the cacheable write. That would guarantee that > the writel had got to the PCI host bridge, and couldn't be reordered > with other accesses to the device, but not that the device had > actually seen it. OK, but I think fits OK with our SMP ordering model for cacheable stores: no amount of barriers on CPU0 will guarantee that CPU1 has seen the store, you actually have to observe a causual effect of the store before you can say that. > I don't mind adding code to the mutex unlock to do the same as > spin_unlock, but I really don't want to have *two* sync instructions > for every MMIO write. One is bad enough. So you can't provide iostore/store ordering without another sync after the writel store? I guess the problem with providing exceptions is that it makes it hard for people who absolutely don't know or care about ordering. I don't like having to think about it "hmm, we can allow this type of reordering... oh, unless some silly device does X...". If we come up with a sane set of weakly ordered accessors (including io_*mb()), it will make it easier to go through the important drivers and improve them. We don't have to enforce the the new semantics strictly until then if they'll slow you down too much in the meantime. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 4:16 ` Nick Piggin 2008-06-03 4:32 ` Benjamin Herrenschmidt @ 2008-06-03 14:47 ` Linus Torvalds 2008-06-03 18:47 ` Trent Piepho 2008-06-04 2:19 ` Nick Piggin 2008-06-03 19:43 ` Trent Piepho 2 siblings, 2 replies; 160+ messages in thread From: Linus Torvalds @ 2008-06-03 14:47 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, alan On Tue, 3 Jun 2008, Nick Piggin wrote: > > Linus: on x86, memory operations to wc and wc+ memory are not ordered > with one another, or operations to other memory types (ie. load/load > and store/store reordering is allowed). Also, as you know, store/load > reordering is explicitly allowed as well, which covers all memory > types. So perhaps it is not quite true to say readl/writel is strongly > ordered by default even on x86. You would have to put in some > mfence instructions in them to make it so. Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that does that needs to be aware of it. IOW, it's a non-issue, imnsho. Linus ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 14:47 ` Linus Torvalds @ 2008-06-03 18:47 ` Trent Piepho 2008-06-03 18:55 ` Matthew Wilcox 2008-06-03 19:07 ` Linus Torvalds 2008-06-04 2:19 ` Nick Piggin 1 sibling, 2 replies; 160+ messages in thread From: Trent Piepho @ 2008-06-03 18:47 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev, scottwood, David Miller, alan On Tue, 3 Jun 2008, Linus Torvalds wrote: > On Tue, 3 Jun 2008, Nick Piggin wrote: >> >> Linus: on x86, memory operations to wc and wc+ memory are not ordered >> with one another, or operations to other memory types (ie. load/load >> and store/store reordering is allowed). Also, as you know, store/load >> reordering is explicitly allowed as well, which covers all memory >> types. So perhaps it is not quite true to say readl/writel is strongly >> ordered by default even on x86. You would have to put in some >> mfence instructions in them to make it so. So on x86, these could be re-ordered? writel(START_OPERATION, CONTROL_REGISTER); status = readl(STATUS_REGISTER); > Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that > does that needs to be aware of it. IOW, it's a non-issue, imnsho. You need to ask for coherent DMA memory too. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 18:47 ` Trent Piepho @ 2008-06-03 18:55 ` Matthew Wilcox 2008-06-03 19:57 ` Trent Piepho 2008-06-03 19:07 ` Linus Torvalds 1 sibling, 1 reply; 160+ messages in thread From: Matthew Wilcox @ 2008-06-03 18:55 UTC (permalink / raw) To: Trent Piepho Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote: > On Tue, 3 Jun 2008, Linus Torvalds wrote: > >On Tue, 3 Jun 2008, Nick Piggin wrote: > >> > >>Linus: on x86, memory operations to wc and wc+ memory are not ordered > >>with one another, or operations to other memory types (ie. load/load > >>and store/store reordering is allowed). Also, as you know, store/load > >>reordering is explicitly allowed as well, which covers all memory > >>types. So perhaps it is not quite true to say readl/writel is strongly > >>ordered by default even on x86. You would have to put in some > >>mfence instructions in them to make it so. > > So on x86, these could be re-ordered? > > writel(START_OPERATION, CONTROL_REGISTER); > status = readl(STATUS_REGISTER); You wouldn't ask for write-combining memory mapping for control or status registers. > >Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that > >does that needs to be aware of it. IOW, it's a non-issue, imnsho. > > You need to ask for coherent DMA memory too. Different case. Coherent DMA memory is *host* memory that the *device* accesses. We're talking about *device* memory that the *cpu* accesses. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 18:55 ` Matthew Wilcox @ 2008-06-03 19:57 ` Trent Piepho 2008-06-03 21:35 ` Matthew Wilcox 0 siblings, 1 reply; 160+ messages in thread From: Trent Piepho @ 2008-06-03 19:57 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Tue, 3 Jun 2008, Matthew Wilcox wrote: > On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote: >> On Tue, 3 Jun 2008, Linus Torvalds wrote: >>> On Tue, 3 Jun 2008, Nick Piggin wrote: >>>> >>>> Linus: on x86, memory operations to wc and wc+ memory are not ordered >>>> with one another, or operations to other memory types (ie. load/load >>>> and store/store reordering is allowed). Also, as you know, store/load >>>> reordering is explicitly allowed as well, which covers all memory >>>> types. So perhaps it is not quite true to say readl/writel is strongly >>>> ordered by default even on x86. You would have to put in some >>>> mfence instructions in them to make it so. >> >> So on x86, these could be re-ordered? >> >> writel(START_OPERATION, CONTROL_REGISTER); >> status = readl(STATUS_REGISTER); > > You wouldn't ask for write-combining memory mapping for control or > status registers. But Nick said, "store/load reordering is explicitly allowed as well, which covers *all* memory types." ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 19:57 ` Trent Piepho @ 2008-06-03 21:35 ` Matthew Wilcox 2008-06-03 21:58 ` Trent Piepho 0 siblings, 1 reply; 160+ messages in thread From: Matthew Wilcox @ 2008-06-03 21:35 UTC (permalink / raw) To: Trent Piepho Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote: > On Tue, 3 Jun 2008, Matthew Wilcox wrote: > >On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote: > >>On Tue, 3 Jun 2008, Linus Torvalds wrote: > >>>On Tue, 3 Jun 2008, Nick Piggin wrote: > >>>> > >>>>Linus: on x86, memory operations to wc and wc+ memory are not ordered > >>>>with one another, or operations to other memory types (ie. load/load > >>>>and store/store reordering is allowed). Also, as you know, store/load > >>>>reordering is explicitly allowed as well, which covers all memory > >>>>types. So perhaps it is not quite true to say readl/writel is strongly > >>>>ordered by default even on x86. You would have to put in some > >>>>mfence instructions in them to make it so. > >> > >>So on x86, these could be re-ordered? > >> > >>writel(START_OPERATION, CONTROL_REGISTER); > >>status = readl(STATUS_REGISTER); > > > >You wouldn't ask for write-combining memory mapping for control or > >status registers. > > But Nick said, "store/load reordering is explicitly allowed as well, which > covers *all* memory types." Then Nick is confused. PCI only defines one way to flush posted writes to a device -- doing a read from it. There's no way that reads can be allowed to pass writes (unless you've asked for it, like with write combining). -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 21:35 ` Matthew Wilcox @ 2008-06-03 21:58 ` Trent Piepho 2008-06-04 2:00 ` Nick Piggin 0 siblings, 1 reply; 160+ messages in thread From: Trent Piepho @ 2008-06-03 21:58 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Tue, 3 Jun 2008, Matthew Wilcox wrote: > On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote: >> On Tue, 3 Jun 2008, Matthew Wilcox wrote: >>> On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote: >>>> On Tue, 3 Jun 2008, Linus Torvalds wrote: >>>>> On Tue, 3 Jun 2008, Nick Piggin wrote: >>>>>> >>>>>> Linus: on x86, memory operations to wc and wc+ memory are not ordered >>>>>> with one another, or operations to other memory types (ie. load/load >>>>>> and store/store reordering is allowed). Also, as you know, store/load >>>>>> reordering is explicitly allowed as well, which covers all memory >>>>>> types. So perhaps it is not quite true to say readl/writel is strongly >>>>>> ordered by default even on x86. You would have to put in some >>>>>> mfence instructions in them to make it so. >>>> >>>> So on x86, these could be re-ordered? >>>> >>>> writel(START_OPERATION, CONTROL_REGISTER); >>>> status = readl(STATUS_REGISTER); >>> >>> You wouldn't ask for write-combining memory mapping for control or >>> status registers. >> >> But Nick said, "store/load reordering is explicitly allowed as well, which >> covers *all* memory types." > > Then Nick is confused. PCI only defines one way to flush posted writes > to a device -- doing a read from it. There's no way that reads can > be allowed to pass writes (unless you've asked for it, like with write > combining). But that requirement is for the PCI bridge, isn't it? It doesn't matter if the bridge will flush all posted writes before allowing a read if the CPU decides to give the bridge the read before the write. A powerpc CPU will certainly do this if you don't take any steps like telling it the memory is uncachable and guarded. I didn't think it was allowed on x86 (except with WC), but Nick seemed to say it was. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 21:58 ` Trent Piepho @ 2008-06-04 2:00 ` Nick Piggin 0 siblings, 0 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-04 2:00 UTC (permalink / raw) To: Trent Piepho Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Wednesday 04 June 2008 07:58, Trent Piepho wrote: > On Tue, 3 Jun 2008, Matthew Wilcox wrote: > > On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote: > >> On Tue, 3 Jun 2008, Matthew Wilcox wrote: > >>> On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote: > >>>> On Tue, 3 Jun 2008, Linus Torvalds wrote: > >>>>> On Tue, 3 Jun 2008, Nick Piggin wrote: > >>>>>> Linus: on x86, memory operations to wc and wc+ memory are not > >>>>>> ordered with one another, or operations to other memory types (ie. > >>>>>> load/load and store/store reordering is allowed). Also, as you know, > >>>>>> store/load reordering is explicitly allowed as well, which covers > >>>>>> all memory types. So perhaps it is not quite true to say > >>>>>> readl/writel is strongly ordered by default even on x86. You would > >>>>>> have to put in some mfence instructions in them to make it so. > >>>> > >>>> So on x86, these could be re-ordered? > >>>> > >>>> writel(START_OPERATION, CONTROL_REGISTER); > >>>> status = readl(STATUS_REGISTER); > >>> > >>> You wouldn't ask for write-combining memory mapping for control or > >>> status registers. > >> > >> But Nick said, "store/load reordering is explicitly allowed as well, > >> which covers *all* memory types." > > > > Then Nick is confused. PCI only defines one way to flush posted writes > > to a device -- doing a read from it. There's no way that reads can > > be allowed to pass writes (unless you've asked for it, like with write > > combining). > > But that requirement is for the PCI bridge, isn't it? It doesn't matter if > the bridge will flush all posted writes before allowing a read if the CPU > decides to give the bridge the read before the write. A powerpc CPU will > certainly do this if you don't take any steps like telling it the memory is > uncachable and guarded. I didn't think it was allowed on x86 (except with > WC), but Nick seemed to say it was. Ah sorry, not UC, I was confused. UC I think actually is strongly ordered WRT other UC and also cacheable operations. WC is weakly ordered, anything goes. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 18:47 ` Trent Piepho 2008-06-03 18:55 ` Matthew Wilcox @ 2008-06-03 19:07 ` Linus Torvalds 2008-06-04 2:05 ` Nick Piggin 2008-06-10 6:56 ` Nick Piggin 1 sibling, 2 replies; 160+ messages in thread From: Linus Torvalds @ 2008-06-03 19:07 UTC (permalink / raw) To: Trent Piepho Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev, scottwood, David Miller, alan On Tue, 3 Jun 2008, Trent Piepho wrote: > On Tue, 3 Jun 2008, Linus Torvalds wrote: > > On Tue, 3 Jun 2008, Nick Piggin wrote: > > > > > > Linus: on x86, memory operations to wc and wc+ memory are not ordered > > > with one another, or operations to other memory types (ie. load/load > > > and store/store reordering is allowed). Also, as you know, store/load > > > reordering is explicitly allowed as well, which covers all memory > > > types. So perhaps it is not quite true to say readl/writel is strongly > > > ordered by default even on x86. You would have to put in some > > > mfence instructions in them to make it so. > > So on x86, these could be re-ordered? > > writel(START_OPERATION, CONTROL_REGISTER); > status = readl(STATUS_REGISTER); With both registers in a WC+ area, yes. The write may be in the WC buffers until the WC buffers are flushed (short list: a fence, a serializing instruction, a read-write to uncached memory, or an interrupt. There are others, but those are the main ones). But if the status register is in uncached memory (which is the only *sane* thing to do), then it doesn't matter if the control register is in WC memory. Because the status register read is itself serializing with the WC buffer, it's actually fine. So this is used for putting things like ring queues in WC memory, and fill them up with writes, and get nice bursty write traffic with the CPU automatically buffering it up (think "stdio.h on a really low level"). And if you then have the command registers in UC memory or using IO port accesses, reading and writing to them will automatically serialize. > > Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that > > does that needs to be aware of it. IOW, it's a non-issue, imnsho. > > You need to ask for coherent DMA memory too. Not on x86. And I don't care what anybody else says - x86 is *so* totally dominant, that other architectures have to live with the fact that 99.9% of all drivers are written for and tested on x86. As a result, anything else is "theory". Are some drivers good and are careful? Yes. Are most? No, and even if they were, people making changes would mostly test them on x86. Architectures should strive to basically act as closely as possible to x86 semantics in order to not get nasty surprises. And architectures that can't do that well enough *will* often find that they need to fix drivers, and only a small small subset of all kernel drivers will generally work out-of-the-box. If you're ready to do that, your architecture can do anything it damn well pleases ;) Linus ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 19:07 ` Linus Torvalds @ 2008-06-04 2:05 ` Nick Piggin 2008-06-04 2:46 ` Linus Torvalds 2008-06-10 6:56 ` Nick Piggin 1 sibling, 1 reply; 160+ messages in thread From: Nick Piggin @ 2008-06-04 2:05 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch, Russell King, linux-kernel, David Miller, linuxppc-dev, scottwood, Trent Piepho, alan On Wednesday 04 June 2008 05:07, Linus Torvalds wrote: > On Tue, 3 Jun 2008, Trent Piepho wrote: > > On Tue, 3 Jun 2008, Linus Torvalds wrote: > > > On Tue, 3 Jun 2008, Nick Piggin wrote: > > > > Linus: on x86, memory operations to wc and wc+ memory are not ordered > > > > with one another, or operations to other memory types (ie. load/load > > > > and store/store reordering is allowed). Also, as you know, store/load > > > > reordering is explicitly allowed as well, which covers all memory > > > > types. So perhaps it is not quite true to say readl/writel is > > > > strongly ordered by default even on x86. You would have to put in > > > > some mfence instructions in them to make it so. > > > > So on x86, these could be re-ordered? > > > > writel(START_OPERATION, CONTROL_REGISTER); > > status = readl(STATUS_REGISTER); > > With both registers in a WC+ area, yes. The write may be in the WC buffers > until the WC buffers are flushed (short list: a fence, a serializing > instruction, a read-write to uncached memory, or an interrupt. There are > others, but those are the main ones). > > But if the status register is in uncached memory (which is the only *sane* > thing to do), then it doesn't matter if the control register is in WC > memory. Because the status register read is itself serializing with the WC > buffer, it's actually fine. Actually, according to the document I am looking at (the AMD one), a UC store may pass a previous WC store. So you could have some code that writes to some WC memory on the card, and then stores to an UC control register to start up the operation on that memory, couldn't you? Those can go out of order. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-04 2:05 ` Nick Piggin @ 2008-06-04 2:46 ` Linus Torvalds 2008-06-04 11:47 ` Alan Cox 0 siblings, 1 reply; 160+ messages in thread From: Linus Torvalds @ 2008-06-04 2:46 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, linux-kernel, David Miller, linuxppc-dev, scottwood, Trent Piepho, alan On Wed, 4 Jun 2008, Nick Piggin wrote: > > Actually, according to the document I am looking at (the AMD one), a UC > store may pass a previous WC store. Hmm. Intel arch manyal, Vol 3, 10.3 (page 10-7 in my version): "If the WC bufer is partially filled, the writes may be delayed until the next ocurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, ..." Any typos mine. Anyway, Intel certainly seems to document that WC memory is serialized by any access to UC memory. But yes, I can well imagine that AMD is different, and I also heartily would recommend rather being safe than sorry. Putting an explicit memory barrier in between those accesses when you know it might make a difference is just a good idea. But basically, as far as I know the thing was designed to be invisible to old software: that is the whole idea behind WC memory. So the design was certainly intended to be that you can generally mark a framebuffer-like structure WC without any software _ever_ caring, as long as you keep all control ports in UC memory. Of course, because burst writes from the WC buffer are <i>so</i> much more efficient on the PCI bus than dribbling them out one write at a time, it didn't take long before all the graphics cards etc wanted to <i>also</i> mark their command queues as WC memory, so that you could burst out the commands to the ring buffers as fast as possible. So now you have both your frame buffer *and* your command buffers mapped WC, and now ordering really has to be ensured in software if you access both. [ And then there are the crazy people who mark *main memory* as WC, because they don't want to pollute the cache with all the data, and then you have the issue of cache coherency etc crap. Which only gets worse with SMP, especially if one processor thinks it has part of memory exclusively cached, and another one - or even the same one, through another aliasign address - ignores the cache protocol. And you now get unhappy CPU's that think that there is a bug in the cache protocol and they get machine check faults. So what started out as a "we can do accesses to the frame buffer more efficiently without anybody ever even having to know or care" has turned into a whole nightmare of people using it for other things, and then you very much _do_ have to care! ] And it doesn't surprise me if AMD then didn't get exactly the same rules. Oh, well. Linus ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-04 2:46 ` Linus Torvalds @ 2008-06-04 11:47 ` Alan Cox 0 siblings, 0 replies; 160+ messages in thread From: Alan Cox @ 2008-06-04 11:47 UTC (permalink / raw) To: Linus Torvalds Cc: Benjamin, Nick Piggin, Russell King, linux-kernel, Trent Piepho, linuxppc-dev, linux-arch, scottwood, David Miller > Anyway, Intel certainly seems to document that WC memory is serialized by= =20 > any access to UC memory. I don't believe that is actually true on Pentium Pro at least. > So what started out as a "we can do accesses to the frame buffer more=20 > efficiently without anybody ever even having to know or care" has=20 > turned into a whole nightmare of people using it for other things, and= =20 > then you very much _do_ have to care! ] Notably graphics where 3Dfx lined the registers up specifically to make this work. We were also using it with I=C2=B2O where it gave a small performance gain. Alan ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 19:07 ` Linus Torvalds 2008-06-04 2:05 ` Nick Piggin @ 2008-06-10 6:56 ` Nick Piggin 2008-06-10 17:41 ` Jesse Barnes ` (2 more replies) 1 sibling, 3 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-10 6:56 UTC (permalink / raw) To: Linus Torvalds, Matthew Wilcox Cc: linux-arch, Russell King, linux-kernel, David Miller, linuxppc-dev, scottwood, Trent Piepho, alan On Wednesday 04 June 2008 05:07, Linus Torvalds wrote: > On Tue, 3 Jun 2008, Trent Piepho wrote: > > On Tue, 3 Jun 2008, Linus Torvalds wrote: > > > On Tue, 3 Jun 2008, Nick Piggin wrote: > > > > Linus: on x86, memory operations to wc and wc+ memory are not ordered > > > > with one another, or operations to other memory types (ie. load/load > > > > and store/store reordering is allowed). Also, as you know, store/load > > > > reordering is explicitly allowed as well, which covers all memory > > > > types. So perhaps it is not quite true to say readl/writel is > > > > strongly ordered by default even on x86. You would have to put in > > > > some mfence instructions in them to make it so. > > > > So on x86, these could be re-ordered? > > > > writel(START_OPERATION, CONTROL_REGISTER); > > status = readl(STATUS_REGISTER); > > With both registers in a WC+ area, yes. The write may be in the WC buffers > until the WC buffers are flushed (short list: a fence, a serializing > instruction, a read-write to uncached memory, or an interrupt. There are > others, but those are the main ones). > > But if the status register is in uncached memory (which is the only *sane* > thing to do), then it doesn't matter if the control register is in WC > memory. Because the status register read is itself serializing with the WC > buffer, it's actually fine. > > So this is used for putting things like ring queues in WC memory, and fill > them up with writes, and get nice bursty write traffic with the CPU > automatically buffering it up (think "stdio.h on a really low level"). And > if you then have the command registers in UC memory or using IO port > accesses, reading and writing to them will automatically serialize. OK, I'm sitll not quite sure where this has ended up. I guess you are happy with x86 semantics as they are now. That is, all IO accesses are strongly ordered WRT one another and WRT cacheable memory (which includes keeping them within spinlocks), *unless* one asks for WC memory, in which case that memory is quite weakly ordered (and is not even ordered by a regular IO readl, at least according to AMD spec). So for WC memory, one still needs to use mb/rmb/wmb. So that still doesn't tell us what *minimum* level of ordering we should provide in the cross platform readl/writel API. Some relatively sane suggestions would be: - as strong as x86. guaranteed not to break drivers that work on x86, but slower on some archs. To me, this is most pleasing. It is much much easier to notice something is going a little slower and to work out how to use weaker ordering there, than it is to debug some once-in-a-bluemoon breakage caused by just the right architecture, driver, etc. It totally frees up the driver writer from thinking about barriers, provided they get the locking right. - ordered WRT other IO accessors, constrained within spinlocks, but not cacheable memory. This is what powerpc does now. It's a little faster for them, and probably covers the vast majority of drivers, but there are real possibilities to get it wrong (trivial example: using bit locks or mutexes or any kind of open coded locking or lockless synchronisation can break). - (less sane) same as above, but not ordered WRT spinlocks. This is what ia64 (sn2) does. From a purist POV, it is a little less arbitrary than powerpc, but in practice, it will break a lot more drivers than powerpc. I was kind of joking about taking control of this issue :) But seriously, it needs a decision to be made. I vote for #1. My rationale: I'm still finding relatively major (well, found maybe 4 or 5 in the last couple of years) bugs in the mm subsystem due to memory ordering problems. This is apparently one of the most well reviewed and tested bit of code in the kernel by people who know all about memory ordering. Not to mention that mm/ does not have to worry about IO ordering at all. Then apparently driver are the least reviewed and tested. Connect dots. Now that doesn't leave waker ordering architectures lumped with "slow old x86 semantics". Think of it as giving them the benefit of sharing x86 development and testing :) We can then formalise the relaxed __ accessors to be more complete (ie. +/- byteswapping). I'd also propose to add io_rmb/io_wmb/io_mb that order io/io access, to help architectures like sn2 where the io/cacheable barrier is pretty expensive. Any comments? ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-10 6:56 ` Nick Piggin @ 2008-06-10 17:41 ` Jesse Barnes 2008-06-10 18:10 ` James Bottomley 2008-06-11 4:18 ` Paul Mackerras 2008-06-11 5:20 ` Paul Mackerras 2 siblings, 1 reply; 160+ messages in thread From: Jesse Barnes @ 2008-06-10 17:41 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel, Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Monday, June 09, 2008 11:56 pm Nick Piggin wrote: > So that still doesn't tell us what *minimum* level of ordering we should > provide in the cross platform readl/writel API. Some relatively sane > suggestions would be: > > - as strong as x86. guaranteed not to break drivers that work on x86, > but slower on some archs. To me, this is most pleasing. It is much > much easier to notice something is going a little slower and to work > out how to use weaker ordering there, than it is to debug some > once-in-a-bluemoon breakage caused by just the right architecture, > driver, etc. It totally frees up the driver writer from thinking > about barriers, provided they get the locking right. > > - ordered WRT other IO accessors, constrained within spinlocks, but not > cacheable memory. This is what powerpc does now. It's a little faster > for them, and probably covers the vast majority of drivers, but there > are real possibilities to get it wrong (trivial example: using bit > locks or mutexes or any kind of open coded locking or lockless > synchronisation can break). > > - (less sane) same as above, but not ordered WRT spinlocks. This is what > ia64 (sn2) does. From a purist POV, it is a little less arbitrary than > powerpc, but in practice, it will break a lot more drivers than powerpc. > > I was kind of joking about taking control of this issue :) But seriously, > it needs a decision to be made. I vote for #1. My rationale: I'm still > finding relatively major (well, found maybe 4 or 5 in the last couple of > years) bugs in the mm subsystem due to memory ordering problems. This is > apparently one of the most well reviewed and tested bit of code in the > kernel by people who know all about memory ordering. Not to mention that > mm/ does not have to worry about IO ordering at all. Then apparently > driver are the least reviewed and tested. Connect dots. > > Now that doesn't leave waker ordering architectures lumped with "slow old > x86 semantics". Think of it as giving them the benefit of sharing x86 > development and testing :) We can then formalise the relaxed __ accessors > to be more complete (ie. +/- byteswapping). I'd also propose to add > io_rmb/io_wmb/io_mb that order io/io access, to help architectures like > sn2 where the io/cacheable barrier is pretty expensive. > > Any comments? FWIW that approach sounds pretty good to me. Arches that suffer from performance penalties can still add lower level primitives and port selected drivers over, so really they won't be losing much. AFAICT though drivers will still have to worry about regular memory ordering issues; they'll just be safe from I/O related ones. :) Still, the simplification is probably worth it. Jesse ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-10 17:41 ` Jesse Barnes @ 2008-06-10 18:10 ` James Bottomley 2008-06-10 19:05 ` Roland Dreier 0 siblings, 1 reply; 160+ messages in thread From: James Bottomley @ 2008-06-10 18:10 UTC (permalink / raw) To: Jesse Barnes Cc: linux-arch, Nick Piggin, Russell King, Matthew Wilcox, linux-kernel, David Miller, linuxppc-dev, scottwood, Linus Torvalds, Trent Piepho, alan On Tue, 2008-06-10 at 10:41 -0700, Jesse Barnes wrote: > On Monday, June 09, 2008 11:56 pm Nick Piggin wrote: > > So that still doesn't tell us what *minimum* level of ordering we should > > provide in the cross platform readl/writel API. Some relatively sane > > suggestions would be: > > > > - as strong as x86. guaranteed not to break drivers that work on x86, > > but slower on some archs. To me, this is most pleasing. It is much > > much easier to notice something is going a little slower and to work > > out how to use weaker ordering there, than it is to debug some > > once-in-a-bluemoon breakage caused by just the right architecture, > > driver, etc. It totally frees up the driver writer from thinking > > about barriers, provided they get the locking right. > > > > - ordered WRT other IO accessors, constrained within spinlocks, but not > > cacheable memory. This is what powerpc does now. It's a little faster > > for them, and probably covers the vast majority of drivers, but there > > are real possibilities to get it wrong (trivial example: using bit > > locks or mutexes or any kind of open coded locking or lockless > > synchronisation can break). > > > > - (less sane) same as above, but not ordered WRT spinlocks. This is what > > ia64 (sn2) does. From a purist POV, it is a little less arbitrary than > > powerpc, but in practice, it will break a lot more drivers than powerpc. > > > > I was kind of joking about taking control of this issue :) But seriously, > > it needs a decision to be made. I vote for #1. My rationale: I'm still > > finding relatively major (well, found maybe 4 or 5 in the last couple of > > years) bugs in the mm subsystem due to memory ordering problems. This is > > apparently one of the most well reviewed and tested bit of code in the > > kernel by people who know all about memory ordering. Not to mention that > > mm/ does not have to worry about IO ordering at all. Then apparently > > driver are the least reviewed and tested. Connect dots. > > > > Now that doesn't leave waker ordering architectures lumped with "slow old > > x86 semantics". Think of it as giving them the benefit of sharing x86 > > development and testing :) We can then formalise the relaxed __ accessors > > to be more complete (ie. +/- byteswapping). I'd also propose to add > > io_rmb/io_wmb/io_mb that order io/io access, to help architectures like > > sn2 where the io/cacheable barrier is pretty expensive. > > > > Any comments? > > FWIW that approach sounds pretty good to me. Arches that suffer from > performance penalties can still add lower level primitives and port selected > drivers over, so really they won't be losing much. AFAICT though drivers > will still have to worry about regular memory ordering issues; they'll just > be safe from I/O related ones. :) Still, the simplification is probably > worth it. me too. That's the whole basis for readX_relaxed() and its cohorts: we make our weirdest machines (like altix) conform to the x86 norm. Then where it really kills us we introduce additional semantics to selected drivers that enable us to recover I/O speed on the abnormal platforms. About the only problem we've had is that architectures aren't very good at co-ordinating for their additional accessors so we tend to get a forest of strange ones growing up, which appear only in a few drivers (i.e. the ones that need the speed ups) and which have no well documented meaning. James ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-10 18:10 ` James Bottomley @ 2008-06-10 19:05 ` Roland Dreier 2008-06-10 19:19 ` Jesse Barnes 0 siblings, 1 reply; 160+ messages in thread From: Roland Dreier @ 2008-06-10 19:05 UTC (permalink / raw) To: James Bottomley Cc: linux-arch, Nick Piggin, Russell King, Matthew Wilcox, linux-kernel, Jesse Barnes, David Miller, linuxppc-dev, scottwood, Linus Torvalds, Trent Piepho, alan > me too. That's the whole basis for readX_relaxed() and its cohorts: we > make our weirdest machines (like altix) conform to the x86 norm. Then > where it really kills us we introduce additional semantics to selected > drivers that enable us to recover I/O speed on the abnormal platforms. Except as I pointed out before, Altix doesn't conform to the norm and many (most?) drivers are missing mmiowb()s that are needed for Altix. Just no one has plugged most devices into an Altix (or haven't stressed the driver in a way that exposes problems of IO ordering between CPUs). It would be a great thing to use the powerpc trick of setting a flag that is tested by spin_unlock()/mutex_unlock() and automatically doing the mmiowb() if needed, and then killing off mmiowb() entirely. - R. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-10 19:05 ` Roland Dreier @ 2008-06-10 19:19 ` Jesse Barnes 2008-06-11 3:29 ` Nick Piggin 0 siblings, 1 reply; 160+ messages in thread From: Jesse Barnes @ 2008-06-10 19:19 UTC (permalink / raw) To: Roland Dreier Cc: linux-arch, Nick Piggin, Russell King, Matthew Wilcox, linux-kernel, David Miller, James Bottomley, linuxppc-dev, scottwood, Linus Torvalds, Trent Piepho, alan On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote: > > me too. That's the whole basis for readX_relaxed() and its cohorts: we > > make our weirdest machines (like altix) conform to the x86 norm. Then > > where it really kills us we introduce additional semantics to selected > > drivers that enable us to recover I/O speed on the abnormal platforms. > > Except as I pointed out before, Altix doesn't conform to the norm and > many (most?) drivers are missing mmiowb()s that are needed for Altix. > Just no one has plugged most devices into an Altix (or haven't stressed > the driver in a way that exposes problems of IO ordering between CPUs). > > It would be a great thing to use the powerpc trick of setting a flag > that is tested by spin_unlock()/mutex_unlock() and automatically doing > the mmiowb() if needed, and then killing off mmiowb() entirely. Yeah I think that's what Nick's guidelines would guarantee. And Jes is already working on the spin_unlock change you mentioned, so mmiowb() should be history soon (in name only, assuming Nick also introduces the I/O barriers he talked about for ordering the looser accessors it would still be there but would be called io_wmb or something). Jesse ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-10 19:19 ` Jesse Barnes @ 2008-06-11 3:29 ` Nick Piggin 2008-06-11 3:40 ` Benjamin Herrenschmidt 2008-06-11 16:07 ` Jesse Barnes 0 siblings, 2 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-11 3:29 UTC (permalink / raw) To: Jesse Barnes, linux-arch Cc: linux-arch, Russell King, Matthew Wilcox, Roland Dreier, linux-kernel, David Miller, James Bottomley, linuxppc-dev, scottwood, Linus Torvalds, Trent Piepho, alan On Wednesday 11 June 2008 05:19, Jesse Barnes wrote: > On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote: > > > me too. That's the whole basis for readX_relaxed() and its cohorts: > > > we make our weirdest machines (like altix) conform to the x86 norm. > > > Then where it really kills us we introduce additional semantics to > > > selected drivers that enable us to recover I/O speed on the abnormal > > > platforms. > > > > Except as I pointed out before, Altix doesn't conform to the norm and > > many (most?) drivers are missing mmiowb()s that are needed for Altix. > > Just no one has plugged most devices into an Altix (or haven't stressed > > the driver in a way that exposes problems of IO ordering between CPUs). > > > > It would be a great thing to use the powerpc trick of setting a flag > > that is tested by spin_unlock()/mutex_unlock() and automatically doing > > the mmiowb() if needed, and then killing off mmiowb() entirely. > > Yeah I think that's what Nick's guidelines would guarantee. And Jes is > already working on the spin_unlock change you mentioned, so mmiowb() should > be history soon (in name only, assuming Nick also introduces the I/O > barriers he talked about for ordering the looser accessors it would still > be there but would be called io_wmb or something). Exactly, yes. I guess everybody has had good intentions here, but as noticed, what is lacking is coordination and documentation. You mention strong ordering WRT spin_unlock, which suggests that you would prefer to take option #2 (the current powerpc one): io/io is ordered and io is contained inside spinlocks, but io/cacheable in general is not ordered. I *would* prefer that io/cacheable actually is strongly ordered with the default accessors. Because if you have that, then the driver writer never has to care about memory ordering, provided they use correct locking for SMP issues. Same as x86. With option 2, there are still windows where you could possibly have issues. For any high performance drivers that are well maintained (ie. the ones where slowdown might be noticed), everyone should have a pretty good handle on memory ordering requirements, so it shouldn't take long to go through and convert them to relaxed accessors. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-11 3:29 ` Nick Piggin @ 2008-06-11 3:40 ` Benjamin Herrenschmidt 2008-06-11 4:06 ` Nick Piggin 2008-06-11 16:07 ` Jesse Barnes 1 sibling, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-06-11 3:40 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, Matthew Wilcox, Roland Dreier, linux-kernel, Jesse Barnes, David Miller, James Bottomley, linuxppc-dev, scottwood, Linus Torvalds, Trent Piepho, alan On Wed, 2008-06-11 at 13:29 +1000, Nick Piggin wrote: > > Exactly, yes. I guess everybody has had good intentions here, but > as noticed, what is lacking is coordination and documentation. > > You mention strong ordering WRT spin_unlock, which suggests that > you would prefer to take option #2 (the current powerpc one): io/io > is ordered and io is contained inside spinlocks, but io/cacheable > in general is not ordered. IO/cacheable -is- ordered on powepc in what we believe is the direction that matter: IO reads are fully ordered vs. anything and IO writes are ordered vs. previous cacheable stores. The only "relaxed" situation is IO writes followed by cacheable stores, which I believe shouldn't be a problem. (except for spinlocks for which we use the flag trick) > I *would* prefer that io/cacheable actually is strongly ordered with > the default accessors. Because if you have that, then the driver > writer never has to care about memory ordering, provided they use > correct locking for SMP issues. Same as x86. With option 2, there > are still windows where you could possibly have issues. > > For any high performance drivers that are well maintained (ie. the > ones where slowdown might be noticed), everyone should have a pretty > good handle on memory ordering requirements, so it shouldn't take > long to go through and convert them to relaxed accessors. Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-11 3:40 ` Benjamin Herrenschmidt @ 2008-06-11 4:06 ` Nick Piggin 0 siblings, 0 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-11 4:06 UTC (permalink / raw) To: benh Cc: linux-arch, Russell King, Matthew Wilcox, Roland Dreier, linux-kernel, Jesse Barnes, David Miller, James Bottomley, linuxppc-dev, scottwood, Linus Torvalds, Trent Piepho, alan On Wednesday 11 June 2008 13:40, Benjamin Herrenschmidt wrote: > On Wed, 2008-06-11 at 13:29 +1000, Nick Piggin wrote: > > Exactly, yes. I guess everybody has had good intentions here, but > > as noticed, what is lacking is coordination and documentation. > > > > You mention strong ordering WRT spin_unlock, which suggests that > > you would prefer to take option #2 (the current powerpc one): io/io > > is ordered and io is contained inside spinlocks, but io/cacheable > > in general is not ordered. > > IO/cacheable -is- ordered on powepc in what we believe is the direction > that matter: IO reads are fully ordered vs. anything and IO writes are > ordered vs. previous cacheable stores. The only "relaxed" situation is > IO writes followed by cacheable stores, which I believe shouldn't be > a problem. (except for spinlocks for which we use the flag trick) Spinlocks... mutexes, semaphores, rwsems, rwlocks, bit spinlocks, bit mutexes, open coded bit locks (of which there are a number floating around in drivers/). But even assuming you get all of that fixed up. I wonder what is such a big benefit to powerpc that you'll rather add the exception "cacheable stores are not ordered with previous io stores" than to say "any driver which works on x86 will work on powerpc as far as memory ordering goes"? (don't you also need something to order io reads with cacheable reads? as per my observation that your rmb is broken according to IBM docs) Obviously you already have a sync instruction in your writel, so 1) adding a second one doesn't slow it down by an order of mangnitude or anything, just some small factor; and 2) you obviously want to be converting high performance drivers to a more relaxed model anyway regardless of whether there is one sync or two in there. Has this ever been measured or thought carefully about? Beyond the extent of "one sync good, two sync bad" ;) ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-11 3:29 ` Nick Piggin 2008-06-11 3:40 ` Benjamin Herrenschmidt @ 2008-06-11 16:07 ` Jesse Barnes 2008-06-12 11:27 ` Nick Piggin 1 sibling, 1 reply; 160+ messages in thread From: Jesse Barnes @ 2008-06-11 16:07 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, Matthew Wilcox, Roland Dreier, linux-kernel, David Miller, James Bottomley, linuxppc-dev, scottwood, Linus Torvalds, Trent Piepho, alan On Tuesday, June 10, 2008 8:29 pm Nick Piggin wrote: > On Wednesday 11 June 2008 05:19, Jesse Barnes wrote: > > On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote: > > > > me too. That's the whole basis for readX_relaxed() and its cohorts: > > > > we make our weirdest machines (like altix) conform to the x86 norm. > > > > Then where it really kills us we introduce additional semantics to > > > > selected drivers that enable us to recover I/O speed on the abnormal > > > > platforms. > > > > > > Except as I pointed out before, Altix doesn't conform to the norm and > > > many (most?) drivers are missing mmiowb()s that are needed for Altix. > > > Just no one has plugged most devices into an Altix (or haven't stressed > > > the driver in a way that exposes problems of IO ordering between CPUs). > > > > > > It would be a great thing to use the powerpc trick of setting a flag > > > that is tested by spin_unlock()/mutex_unlock() and automatically doing > > > the mmiowb() if needed, and then killing off mmiowb() entirely. > > > > Yeah I think that's what Nick's guidelines would guarantee. And Jes is > > already working on the spin_unlock change you mentioned, so mmiowb() > > should be history soon (in name only, assuming Nick also introduces the > > I/O barriers he talked about for ordering the looser accessors it would > > still be there but would be called io_wmb or something). > > Exactly, yes. I guess everybody has had good intentions here, but > as noticed, what is lacking is coordination and documentation. > > You mention strong ordering WRT spin_unlock, which suggests that > you would prefer to take option #2 (the current powerpc one): io/io > is ordered and io is contained inside spinlocks, but io/cacheable > in general is not ordered. I was thinking it would be good for the weaker accessors, but now that I think about it you could just use the new io_* barrier functions. I didn't mean to imply that I wasn't in favor of the io/cacheable ordering as well. > For any high performance drivers that are well maintained (ie. the > ones where slowdown might be noticed), everyone should have a pretty > good handle on memory ordering requirements, so it shouldn't take > long to go through and convert them to relaxed accessors. Yep. Thanks for working on this, Nick, it's definitely a good thing that you're taking control of it. :) Thanks, Jesse ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-11 16:07 ` Jesse Barnes @ 2008-06-12 11:27 ` Nick Piggin 0 siblings, 0 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-12 11:27 UTC (permalink / raw) To: Jesse Barnes Cc: linux-arch, Russell King, Matthew Wilcox, Roland Dreier, linux-kernel, David Miller, James Bottomley, linuxppc-dev, scottwood, Linus Torvalds, Trent Piepho, alan On Thursday 12 June 2008 02:07, Jesse Barnes wrote: > On Tuesday, June 10, 2008 8:29 pm Nick Piggin wrote: > > You mention strong ordering WRT spin_unlock, which suggests that > > you would prefer to take option #2 (the current powerpc one): io/io > > is ordered and io is contained inside spinlocks, but io/cacheable > > in general is not ordered. > > I was thinking it would be good for the weaker accessors, but now that I > think about it you could just use the new io_* barrier functions. > > I didn't mean to imply that I wasn't in favor of the io/cacheable ordering > as well. > > > For any high performance drivers that are well maintained (ie. the > > ones where slowdown might be noticed), everyone should have a pretty > > good handle on memory ordering requirements, so it shouldn't take > > long to go through and convert them to relaxed accessors. > > Yep. Thanks for working on this, Nick, it's definitely a good thing that > you're taking control of it. :) Well, I really am just trying to help the kernel for everyone (and every architecture). Performance for all architectures really is my #2 priority, so if any arch becomes irrepearably slower under a proposal I would go back to the drawing board. I'll come up with a proposal in the form of an initial code+documentation patch when I get some more time on it. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-10 6:56 ` Nick Piggin 2008-06-10 17:41 ` Jesse Barnes @ 2008-06-11 4:18 ` Paul Mackerras 2008-06-11 5:00 ` Nick Piggin 2008-06-11 5:20 ` Paul Mackerras 2 siblings, 1 reply; 160+ messages in thread From: Paul Mackerras @ 2008-06-11 4:18 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel, Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan Nick Piggin writes: > OK, I'm sitll not quite sure where this has ended up. I guess you are happy > with x86 semantics as they are now. That is, all IO accesses are strongly > ordered WRT one another and WRT cacheable memory (which includes keeping > them within spinlocks), My understanding was that on x86, loads could pass stores in general, i.e. a later load could be performed before an earlier store. I guess that can't be true for uncached loads, but could a cacheable load be performed before an earlier uncached store? > - as strong as x86. guaranteed not to break drivers that work on x86, > but slower on some archs. To me, this is most pleasing. It is much > much easier to notice something is going a little slower and to work > out how to use weaker ordering there, than it is to debug some > once-in-a-bluemoon breakage caused by just the right architecture, > driver, etc. It totally frees up the driver writer from thinking > about barriers, provided they get the locking right. I just wish we had even one actual example of things going wrong with the current rules we have on powerpc to motivate changing to this model. > Now that doesn't leave waker ordering architectures lumped with "slow old > x86 semantics". Think of it as giving them the benefit of sharing x86 > development and testing :) We can then formalise the relaxed __ accessors > to be more complete (ie. +/- byteswapping). That leaves a gulf between the extremely strongly ordered writel etc. and the extremely weakly ordered __writel etc. The current powerpc scheme is fine for a lot of drivers but your proposal would leave us no way to deliver it to them. Paul. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-11 4:18 ` Paul Mackerras @ 2008-06-11 5:00 ` Nick Piggin 2008-06-11 5:13 ` Paul Mackerras 2008-06-11 14:46 ` Linus Torvalds 0 siblings, 2 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-11 5:00 UTC (permalink / raw) To: Paul Mackerras Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel, Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Wednesday 11 June 2008 14:18, Paul Mackerras wrote: > Nick Piggin writes: > > OK, I'm sitll not quite sure where this has ended up. I guess you are > > happy with x86 semantics as they are now. That is, all IO accesses are > > strongly ordered WRT one another and WRT cacheable memory (which includ= es > > keeping them within spinlocks), > > My understanding was that on x86, loads could pass stores in general, > i.e. a later load could be performed before an earlier store. Yes, this is the one reordering allowed by the ISA on cacheable memory. WC memory is weaker, which Linus wants to allow exception for because one must explicitly ask for it. UC memory (which presumably is what we're talking about as "IO access") I think is stronger in that it does not allow any reordering between one another or any cacheable access: AMD says this: c =E2=80=94 A store (wp,wt,wb,uc,wc,wc+) may not pass a previous load=20 (wp,wt,wb,uc,wc,wc+). f =E2=80=94 A load (uc) does not pass a previous store (wp,wt,wb,uc,wc,wc+). g =E2=80=94 A store (wp,wt,wb,uc) does not pass a previous store (wp,wt,wb,= uc). i =E2=80=94 A load (wp,wt,wb,wc,wc+) does not pass a previous store (uc). AMD does allow WC/WC+ to be weakly ordered WRT WC as well as UC, which Intel seemingly does not. But we're alrady treating WC as special. I can't actually find the definitive statement in the Intel manuals saying UC is strongly ordered also WRT WB. Linus? > I guess=20 > that can't be true for uncached loads, but could a cacheable load be > performed before an earlier uncached store? > > - as strong as x86. guaranteed not to break drivers that work on x86, > > but slower on some archs. To me, this is most pleasing. It is much > > much easier to notice something is going a little slower and to work > > out how to use weaker ordering there, than it is to debug some > > once-in-a-bluemoon breakage caused by just the right architecture, > > driver, etc. It totally frees up the driver writer from thinking > > about barriers, provided they get the locking right. > > I just wish we had even one actual example of things going wrong with > the current rules we have on powerpc to motivate changing to this > model. ~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | wc -l 506 How sure are you that none of those forms part of a cobbled-together locking scheme that hopes to constrain some IO access? ~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | grep while | wc -l 29 How about those? ~/usr/src/linux-2.6> git grep mutex_lock drivers/ | wc -l 3138 How sure are you that none of them is hoping to constrain IO operations within the lock? Also grep for down, down_write, write_lock, and maybe some others I've forgotten. And then forget completely about locking and imagine some of the open coded things you see around the place (or parts where drivers try to get creative and open code their own locking or try lockless things). > > Now that doesn't leave waker ordering architectures lumped with "slow o= ld > > x86 semantics". Think of it as giving them the benefit of sharing x86 > > development and testing :) We can then formalise the relaxed __ accesso= rs > > to be more complete (ie. +/- byteswapping). > > That leaves a gulf between the extremely strongly ordered writel > etc. and the extremely weakly ordered __writel etc. The current > powerpc scheme is fine for a lot of drivers but your proposal would > leave us no way to deliver it to them. But surely you have to audit the drivers anyway to ensure they are OK with the current powerpc scheme. In which case, once you have audited them and know they are safe, you can easily convert them to the even _faster_ __readl/__writel, and just add the appropriate barriers. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-11 5:00 ` Nick Piggin @ 2008-06-11 5:13 ` Paul Mackerras 2008-06-11 5:35 ` Nick Piggin 2008-06-11 14:46 ` Linus Torvalds 1 sibling, 1 reply; 160+ messages in thread From: Paul Mackerras @ 2008-06-11 5:13 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel, Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan Nick Piggin writes: > > I just wish we had even one actual example of things going wrong with > > the current rules we have on powerpc to motivate changing to this > > model. > > ~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | wc -l > 506 > How sure are you that none of those forms part of a cobbled-together > locking scheme that hopes to constrain some IO access? My comment was precisely about the fact that this sort of argument is actually FUD. I want one example that is demonstrably wrong, not just a "how sure are you". > But surely you have to audit the drivers anyway to ensure they are OK > with the current powerpc scheme. In which case, once you have audited > them and know they are safe, you can easily convert them to the even > _faster_ __readl/__writel, and just add the appropriate barriers. The trouble is that as code gets maintained, using __writel + explicit barriers is more fragile because some people will change the code, or add new code, without understanding the barriers. So whenever a driver gets converted to using __writel + barriers, we will end up having to watch every change that goes into it forever. Whereas with the current scheme there's a much smaller set of gotchas to watch out for, and the gotchas are things that already raise red flags, such as open-coded locking and any sort of "clever" lockless scheme. Paul. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-11 5:13 ` Paul Mackerras @ 2008-06-11 5:35 ` Nick Piggin 2008-06-11 6:02 ` Nick Piggin 2008-06-12 12:14 ` Paul Mackerras 0 siblings, 2 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-11 5:35 UTC (permalink / raw) To: Paul Mackerras Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel, Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Wednesday 11 June 2008 15:13, Paul Mackerras wrote: > Nick Piggin writes: > > > I just wish we had even one actual example of things going wrong with > > > the current rules we have on powerpc to motivate changing to this > > > model. > > > > ~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | wc -l > > 506 > > How sure are you that none of those forms part of a cobbled-together > > locking scheme that hopes to constrain some IO access? > > My comment was precisely about the fact that this sort of argument is > actually FUD. I want one example that is demonstrably wrong, not just > a "how sure are you". But this is a case where you really should be scared, so FUD is completely appropriate. At least, I don't see how it is at all worse than FUD about how much slower everything will be with stronger ordering, and how nobody will be able to do anything about it. In reality, for most devices it will not be measurable, and for some like network or disk might use a bit of a tweak in one or two fastpaths. If the driver author honestly does not know the code well enough to say whether a particular ordering is allowed or not, then it should just not be allowed. Anyway, what do I win if I give you a concrete example? I found one in the first file I picked out of my grep: drivers/net/s2io.c if (test_and_set_bit(__S2IO_STATE_LINK_TASK, &(nic->state))) { /* The card is being reset, no point doing anything */ goto out_unlock; } ... val64 = readq(&bar0->adapter_control); val64 |= ADAPTER_LED_ON; writeq(val64, &bar0->adapter_control); s2io_link(nic, LINK_UP); } else { if (CARDS_WITH_FAULTY_LINK_INDICATORS(nic->device_type, subid)) { val64 = readq(&bar0->gpio_control); val64 &= ~GPIO_CTRL_GPIO_0; writeq(val64, &bar0->gpio_control); val64 = readq(&bar0->gpio_control); } /* turn off LED */ val64 = readq(&bar0->adapter_control); val64 = val64 &(~ADAPTER_LED_ON); writeq(val64, &bar0->adapter_control); s2io_link(nic, LINK_DOWN); } clear_bit(__S2IO_STATE_LINK_TASK, &(nic->state)); Now I can't say definitively that this is going to be wrong on powerpc, because I don't know the code well enough. But I'd be 90% sure that the unlock is not supposed to be visible to other CPUs before the writeqs are queued to the card. On x86 it wouldn't be. > > But surely you have to audit the drivers anyway to ensure they are OK > > with the current powerpc scheme. In which case, once you have audited > > them and know they are safe, you can easily convert them to the even > > _faster_ __readl/__writel, and just add the appropriate barriers. > > The trouble is that as code gets maintained, using __writel + explicit > barriers is more fragile because some people will change the code, or > add new code, without understanding the barriers. If the relaxed writel and explicit barriers are not well documented and well understood, then they should not be used. And if the memory ordering issues are not well understood, then it should be done with locking and strongly ordered IO accessors. > So whenever a > driver gets converted to using __writel + barriers, we will end up > having to watch every change that goes into it forever. Whereas with *You* don't have to watch it, the maintainer does. And if they can't understand the barriers, then it should not be converted to use them. Unless you want to take over maintainership. And if you do, then you should be watching all changes that go into it. > the current scheme there's a much smaller set of gotchas to watch out > for, and the gotchas are things that already raise red flags, such as > open-coded locking and any sort of "clever" lockless scheme. And with the strongly ordered scheme, there is much smaller set of gotchas again, and it behaves the same as the x86 it was developed on, and you don't get confused by the list of rules: IO access is either strongly ordered or weakly ordered. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-11 5:35 ` Nick Piggin @ 2008-06-11 6:02 ` Nick Piggin 2008-06-12 12:14 ` Paul Mackerras 1 sibling, 0 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-11 6:02 UTC (permalink / raw) To: Paul Mackerras Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel, Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Wednesday 11 June 2008 15:35, Nick Piggin wrote: > On Wednesday 11 June 2008 15:13, Paul Mackerras wrote: > > Nick Piggin writes: > > > > I just wish we had even one actual example of things going wrong with > > > > the current rules we have on powerpc to motivate changing to this > > > > model. > > > > > > ~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | wc -l > > > 506 > > > How sure are you that none of those forms part of a cobbled-together > > > locking scheme that hopes to constrain some IO access? > > > > My comment was precisely about the fact that this sort of argument is > > actually FUD. I want one example that is demonstrably wrong, not just > > a "how sure are you". > > But this is a case where you really should be scared, so FUD is > completely appropriate. > > At least, I don't see how it is at all worse than FUD about how much > slower everything will be with stronger ordering, and how nobody will > be able to do anything about it. In reality, for most devices it will > not be measurable, and for some like network or disk might use a bit > of a tweak in one or two fastpaths. > > If the driver author honestly does not know the code well enough to > say whether a particular ordering is allowed or not, then it should > just not be allowed. > > > Anyway, what do I win if I give you a concrete example? > > I found one in the first file I picked out of my grep: drivers/net/s2io.c > > if (test_and_set_bit(__S2IO_STATE_LINK_TASK, &(nic->state))) { > /* The card is being reset, no point doing anything */ > goto out_unlock; > } > > ... > val64 = readq(&bar0->adapter_control); > val64 |= ADAPTER_LED_ON; > writeq(val64, &bar0->adapter_control); > s2io_link(nic, LINK_UP); > } else { > if (CARDS_WITH_FAULTY_LINK_INDICATORS(nic->device_type, > subid)) { > val64 = readq(&bar0->gpio_control); > val64 &= ~GPIO_CTRL_GPIO_0; > writeq(val64, &bar0->gpio_control); > val64 = readq(&bar0->gpio_control); > } > /* turn off LED */ > val64 = readq(&bar0->adapter_control); > val64 = val64 &(~ADAPTER_LED_ON); > writeq(val64, &bar0->adapter_control); > s2io_link(nic, LINK_DOWN); > } > clear_bit(__S2IO_STATE_LINK_TASK, &(nic->state)); > > Now I can't say definitively that this is going to be wrong on > powerpc, because I don't know the code well enough. But I'd be > 90% sure that the unlock is not supposed to be visible to > other CPUs before the writeqs are queued to the card. On x86 it > wouldn't be. And I hope you're reading this, sn2 guys :) The mmiowb() audit for spin_unlock was a nice idea, but this example shows that unless you *really* know the code and have actually audited everything, then you should not be messing with relaxing of ordering etc. Now obviously the example I quote is a slowpath, in which case it can very happily continue to use strongly ordered io accessors, and you will never ever notice the extra sync or PCI read or whatever after the writeq. Then you can easily relax the couple of fastpaths that matter, put barriers in there, comment them, and you're away: you now have a *working* driver (that doesn't randomly lose adapter_control bits twice a year), and that is also faster than it was when it was broken because you've specifically used the very weakly ordered IO accessors in the fast paths that matter. And when somebody plugs in some obscure hardware into your box that you haven't audited and put mmiowb etc etc into, then guess what? It is also not going to blow up in an undebuggable way 6 months later in the middle of <insert something important here>. The worst case is that the user notices it is running a bit slowly and sends a full profile output the next day. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-11 5:35 ` Nick Piggin 2008-06-11 6:02 ` Nick Piggin @ 2008-06-12 12:14 ` Paul Mackerras 2008-06-12 13:08 ` Nick Piggin 1 sibling, 1 reply; 160+ messages in thread From: Paul Mackerras @ 2008-06-12 12:14 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel, Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan Nick Piggin writes: > /* turn off LED */ > val64 = readq(&bar0->adapter_control); > val64 = val64 &(~ADAPTER_LED_ON); > writeq(val64, &bar0->adapter_control); > s2io_link(nic, LINK_DOWN); > } > clear_bit(__S2IO_STATE_LINK_TASK, &(nic->state)); > > Now I can't say definitively that this is going to be wrong on > powerpc, because I don't know the code well enough. But I'd be > 90% sure that the unlock is not supposed to be visible to > other CPUs before the writeqs are queued to the card. On x86 it > wouldn't be. Interestingly, there is also a store to cacheable memory (nic->device_enabled_once), but no smp_wmb or equivalent before the clear_bit. So there are other potential problems here besides the I/O related ones. Anyway, I have done some tests on a dual G5 here with putting a sync on both sides of the store in writel etc. (i.e. making readl/writel strongly ordered w.r.t. everything else), and as you predicted, there wasn't a noticeable performance degradation, at least not on the couple of things I tried. So I am now inclined to accept your suggestion that we should do that. I should probably do some similar checks on POWER6 and a few other machines first, though. Paul. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-12 12:14 ` Paul Mackerras @ 2008-06-12 13:08 ` Nick Piggin 0 siblings, 0 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-12 13:08 UTC (permalink / raw) To: Paul Mackerras Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel, Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Thursday 12 June 2008 22:14, Paul Mackerras wrote: > Nick Piggin writes: > > /* turn off LED */ > > val64 = readq(&bar0->adapter_control); > > val64 = val64 &(~ADAPTER_LED_ON); > > writeq(val64, &bar0->adapter_control); > > s2io_link(nic, LINK_DOWN); > > } > > clear_bit(__S2IO_STATE_LINK_TASK, &(nic->state)); > > > > Now I can't say definitively that this is going to be wrong on > > powerpc, because I don't know the code well enough. But I'd be > > 90% sure that the unlock is not supposed to be visible to > > other CPUs before the writeqs are queued to the card. On x86 it > > wouldn't be. > > Interestingly, there is also a store to cacheable memory > (nic->device_enabled_once), but no smp_wmb or equivalent before the > clear_bit. So there are other potential problems here besides the I/O > related ones. Yeah there sure is. That sucks too, but we go one step at a time ;) I think proposing a strong ordering between set_bit/clear_bit would actually be quite noticable slowdown in core kernel code at this point. Which reminds me, I have been meaning to do another pass of test and set bit / clear bit conversions to the _lock primitives... > Anyway, I have done some tests on a dual G5 here with putting a sync > on both sides of the store in writel etc. (i.e. making readl/writel > strongly ordered w.r.t. everything else), and as you predicted, there > wasn't a noticeable performance degradation, at least not on the > couple of things I tried. So I am now inclined to accept your > suggestion that we should do that. I should probably do some similar > checks on POWER6 and a few other machines first, though. Oh good, thanks for looking into it. I guess it might be a little more noticable on bigger POWER systems. And I think we might even need to do a PCI read after every writel on sn2 systems in order to get the semantics I want. I can't say it won't be noticable. But if we consider the number of drivers (maybe one or two dozen well maintained ones), and number of sites in each driver (maybe one or two submission and completion fastpaths which should have a minimum of IO operations in each one) that will have to be converted in order to get performance as good or better than it is currently with relaxed accessors.... and weigh that against all the places in those and every other crappy obscure driver that we *won't* have to audit, I really think we end up with a net win even with some short term pain. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-11 5:00 ` Nick Piggin 2008-06-11 5:13 ` Paul Mackerras @ 2008-06-11 14:46 ` Linus Torvalds 1 sibling, 0 replies; 160+ messages in thread From: Linus Torvalds @ 2008-06-11 14:46 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel, Trent Piepho, linuxppc-dev, Paul Mackerras, scottwood, David Miller, alan On Wed, 11 Jun 2008, Nick Piggin wrote: > > I can't actually find the definitive statement in the Intel manuals > saying UC is strongly ordered also WRT WB. Linus? Definitive? Dunno. But look in the Architecture manual, volume 3A, 10.3 "Methods of Caching Available", and then under the bullet about Write Combining (WC), it says the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE of MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution. However, it's worth noting that - documentation can be wrong, or even if right, can be Intel-specific. - the above is expressly _only_ about the WC buffer, not about regular memory writes. Cached memory accesses are different from WC accesses. so in the end, the thing that matters is how things actually work. Linus ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-10 6:56 ` Nick Piggin 2008-06-10 17:41 ` Jesse Barnes 2008-06-11 4:18 ` Paul Mackerras @ 2008-06-11 5:20 ` Paul Mackerras 2 siblings, 0 replies; 160+ messages in thread From: Paul Mackerras @ 2008-06-11 5:20 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel, Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan Nick Piggin writes: > Now that doesn't leave waker ordering architectures lumped with "slow old > x86 semantics". Think of it as giving them the benefit of sharing x86 > development and testing :) Worth something, but not perhaps as much as you think, given that many x86 driver writers still don't pay much attention to making their code endian-safe and 64-bit-clean. And there are still plenty of drivers that use virt_to_bus... Paul. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 14:47 ` Linus Torvalds 2008-06-03 18:47 ` Trent Piepho @ 2008-06-04 2:19 ` Nick Piggin 1 sibling, 0 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-04 2:19 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev, scottwood, David Miller, alan On Wednesday 04 June 2008 00:47, Linus Torvalds wrote: > On Tue, 3 Jun 2008, Nick Piggin wrote: > > Linus: on x86, memory operations to wc and wc+ memory are not ordered > > with one another, or operations to other memory types (ie. load/load > > and store/store reordering is allowed). Also, as you know, store/load > > reordering is explicitly allowed as well, which covers all memory > > types. So perhaps it is not quite true to say readl/writel is strongly > > ordered by default even on x86. You would have to put in some > > mfence instructions in them to make it so. > > Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that > does that needs to be aware of it. IOW, it's a non-issue, imnsho. Ah, yes UC is strongly ordered WRT all others *except* WC/WC+. But WC memory is not an x86 specific thing right, so do we need some accessors for WC memory? Or can we just throw that in the weakly ordered pile, and ensure mb/rmb/wmb does the right thing for them. And you want readl/writel to be strongly ordered like x86 on all architectures, no exceptions? This will slow some things down, but if we then also provide explicitly weakly ordered instructions (and add io_mb/io_rmb/io_wmb) then at least it gives the framework for drivers to be written to run on those architectures. The other thing we could do is mandate only that readl/writel will be ordered WRT one another, *and* with spinlocks, but otherwise not with cacheable RAM... ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 4:16 ` Nick Piggin 2008-06-03 4:32 ` Benjamin Herrenschmidt 2008-06-03 14:47 ` Linus Torvalds @ 2008-06-03 19:43 ` Trent Piepho 2008-06-03 21:33 ` Matthew Wilcox 2008-06-03 22:26 ` Benjamin Herrenschmidt 2 siblings, 2 replies; 160+ messages in thread From: Trent Piepho @ 2008-06-03 19:43 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, linux-kernel, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Tue, 3 Jun 2008, Nick Piggin wrote: > On Monday 02 June 2008 17:24, Russell King wrote: >> So, can the semantics of what's expected from these IO accessor >> functions be documented somewhere. Please? Before this thread gets >> lost in the depths of time? > > This whole thread also ties in with my posts about mmiowb (which IMO > should go away). > > readl/writel: strongly ordered wrt one another and other stores > to cacheable RAM, byteswapping > __readl/__writel: not ordered (needs mb/rmb/wmb to order with > other readl/writel and cacheable operations, or > io_*mb to order with one another) > raw_readl/raw_writel: strongly ordered, no byteswapping > __raw_readl/__raw_writel: not ordered, no byteswapping Byte-swapping vs not byte-swapping is not usually what the programmer wants. Usually your device's registers are defined as being big-endian or little-endian and you want whatever is needed to give you that. I believe that on some archs that can be either byte order, some built-in devices will change their registers to match, and so you want "native endian" or no swapping for these. Though that's definitely in the minority. An accessors that always byte-swaps regardless of the endianness of the host is never something I've seen a driver want. IOW, there are four ways one can defined endianness/swapping: 1) Little-endian 2) Big-endian 3) Native-endian aka non-byte-swapping 4) Foreign-endian aka byte-swapping 1 and 2 are by far the most used. Some code wants 3. No one wants 4. Yet our API is providing 3 & 4, the two which are the least useful. Is it enough to provide only "all or none" for ordering strictness? For instance on powerpc, one can get a speedup by dropping strict ordering for IO vs cacheable memory, but still keeping ordering for IO vs IO and IO vs locks. This is much easier to program for than no ordering at all. In fact, if one doesn't use coherent DMA, it's basically the same as fully strict ordering. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 19:43 ` Trent Piepho @ 2008-06-03 21:33 ` Matthew Wilcox 2008-06-03 21:44 ` Trent Piepho 2008-06-03 22:26 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 160+ messages in thread From: Matthew Wilcox @ 2008-06-03 21:33 UTC (permalink / raw) To: Trent Piepho Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Tue, Jun 03, 2008 at 12:43:21PM -0700, Trent Piepho wrote: > IOW, there are four ways one can defined endianness/swapping: > 1) Little-endian > 2) Big-endian > 3) Native-endian aka non-byte-swapping > 4) Foreign-endian aka byte-swapping > > 1 and 2 are by far the most used. Some code wants 3. No one wants 4. Yet > our API is providing 3 & 4, the two which are the least useful. You've fundamentally misunderstood. readX/writeX and __readX/__writeX provide little-endian access. __raw_readX provide native-endian. If you want 2 or 4, define your own accessors. Some architectures define other accessors (eg gsc_readX on parisc is native (big) endian, and works on physical addresses that haven't been ioremapped. sbus_readX on sparc64 also seems to be native (big) endian). > Is it enough to provide only "all or none" for ordering strictness? For > instance on powerpc, one can get a speedup by dropping strict ordering for > IO > vs cacheable memory, but still keeping ordering for IO vs IO and IO vs > locks. This is much easier to program for than no ordering at all. In > fact, if one > doesn't use coherent DMA, it's basically the same as fully strict ordering. I don't understand why you keep talking about DMA. Are you talking about ordering between readX() and DMA? PCI proides those guarantees. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 21:33 ` Matthew Wilcox @ 2008-06-03 21:44 ` Trent Piepho 2008-06-04 2:25 ` Nick Piggin 0 siblings, 1 reply; 160+ messages in thread From: Trent Piepho @ 2008-06-03 21:44 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Tue, 3 Jun 2008, Matthew Wilcox wrote: > On Tue, Jun 03, 2008 at 12:43:21PM -0700, Trent Piepho wrote: >> IOW, there are four ways one can defined endianness/swapping: >> 1) Little-endian >> 2) Big-endian >> 3) Native-endian aka non-byte-swapping >> 4) Foreign-endian aka byte-swapping >> >> 1 and 2 are by far the most used. Some code wants 3. No one wants 4. Yet >> our API is providing 3 & 4, the two which are the least useful. > > You've fundamentally misunderstood. > > readX/writeX and __readX/__writeX provide little-endian access. > __raw_readX provide native-endian. > > If you want 2 or 4, define your own accessors. Some architectures define > other accessors (eg gsc_readX on parisc is native (big) endian, and How about providing 1 and 2, and if you want 3 or 4 define your own accessors? >> Is it enough to provide only "all or none" for ordering strictness? For >> instance on powerpc, one can get a speedup by dropping strict ordering for >> IO >> vs cacheable memory, but still keeping ordering for IO vs IO and IO vs >> locks. This is much easier to program for than no ordering at all. In >> fact, if one >> doesn't use coherent DMA, it's basically the same as fully strict ordering. > > I don't understand why you keep talking about DMA. Are you talking > about ordering between readX() and DMA? PCI proides those guarantees. I guess you haven't been reading the whole thread. The reason it started was because gcc can re-order powerpc (and everyone else's too) IO accesses vs accesses to cachable memory (but not spin-locks), which ends up only being a problem with coherent DMA. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 21:44 ` Trent Piepho @ 2008-06-04 2:25 ` Nick Piggin 2008-06-04 6:39 ` Trent Piepho 0 siblings, 1 reply; 160+ messages in thread From: Nick Piggin @ 2008-06-04 2:25 UTC (permalink / raw) To: Trent Piepho Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Wednesday 04 June 2008 07:44, Trent Piepho wrote: > On Tue, 3 Jun 2008, Matthew Wilcox wrote: > > I don't understand why you keep talking about DMA. Are you talking > > about ordering between readX() and DMA? PCI proides those guarantees. > > I guess you haven't been reading the whole thread. The reason it started > was because gcc can re-order powerpc (and everyone else's too) IO accesses > vs accesses to cachable memory (but not spin-locks), which ends up only > being a problem with coherent DMA. I don't think it is only a problem with coherent DMA. CPU0 CPU1 mutex_lock(mutex); writel(something, DATA_REG); writel(GO, CTRL_REG); started = 1; mutex_unlock(mutex); mutex_lock(mutex); if (started) /* oops, this can reach device before GO */ writel(STOP, CTRL_REG); ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-04 2:25 ` Nick Piggin @ 2008-06-04 6:39 ` Trent Piepho 0 siblings, 0 replies; 160+ messages in thread From: Trent Piepho @ 2008-06-04 6:39 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Wed, 4 Jun 2008, Nick Piggin wrote: > On Wednesday 04 June 2008 07:44, Trent Piepho wrote: >> On Tue, 3 Jun 2008, Matthew Wilcox wrote: > >>> I don't understand why you keep talking about DMA. Are you talking >>> about ordering between readX() and DMA? PCI proides those guarantees. >> >> I guess you haven't been reading the whole thread. The reason it started >> was because gcc can re-order powerpc (and everyone else's too) IO accesses >> vs accesses to cachable memory (but not spin-locks), which ends up only >> being a problem with coherent DMA. > > I don't think it is only a problem with coherent DMA. > > CPU0 CPU1 > mutex_lock(mutex); > writel(something, DATA_REG); > writel(GO, CTRL_REG); > started = 1; (A) > mutex_unlock(mutex); > mutex_lock(mutex); (B) > if (started) > /* oops, this can reach device before GO */ > writel(STOP, CTRL_REG); The locks themselves should have (and do have) ordering operations to insure gcc and/or the cpu can't move a store or load outside the locked region. Generally you need that to keep stores/loads to cacheable memory inside the critical area, much less I/O operations. Otherwise all you have to do is replace writel(something, ...) with shared_data->something = ... and there's an obvious problem. In your example, gcc currently can and will move the GO operation to point A (if it can figure out that CTRL_REG and started aren't aliased), but that's not a problem. If it could move it to B that would be a problem, but it can't. Other than coherent DMA, I don't think there is any reason to care if I/O accessors are strongly ordered wrt load/stores to cacheable memory. locking and streaming DMA sync operations already need to have ordering, so they don't require all I/O to be ordered wrt all cacheable memory. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 19:43 ` Trent Piepho 2008-06-03 21:33 ` Matthew Wilcox @ 2008-06-03 22:26 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-06-03 22:26 UTC (permalink / raw) To: Trent Piepho Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Tue, 2008-06-03 at 12:43 -0700, Trent Piepho wrote: > > Byte-swapping vs not byte-swapping is not usually what the programmer wants. > Usually your device's registers are defined as being big-endian or > little-endian and you want whatever is needed to give you that. Yes, which is why I (and some other archs) have writel_be/readl_be. The standard writel/readl being LE. However, the "raw" variants are defined to be native endian, which is of some use to -some- archs apparently where they have SoC device whose endianness follow the core. > I believe that on some archs that can be either byte order, some built-in > devices will change their registers to match, and so you want "native endian" > or no swapping for these. Though that's definitely in the minority. > > An accessors that always byte-swaps regardless of the endianness of the host > is never something I've seen a driver want. > > IOW, there are four ways one can defined endianness/swapping: > 1) Little-endian > 2) Big-endian > 3) Native-endian aka non-byte-swapping > 4) Foreign-endian aka byte-swapping > > 1 and 2 are by far the most used. Some code wants 3. No one wants 4. Yet > our API is providing 3 & 4, the two which are the least useful. No, we don't provide 4, it was something unclear with nick. We provide 1. (writel/readl and __variants), some archs provide 2 (writel_be/readl_be, tho I don't have __variants, I suppose I could), and everybody provides 3. though in some cases (like us) only in the form of __variants (ie, non ordered, like __raw_readl/__raw_writel). Nick's proposal is to plug those gaps, though it's, I believe, missing the _be variants. > Is it enough to provide only "all or none" for ordering strictness? For > instance on powerpc, one can get a speedup by dropping strict ordering for IO > vs cacheable memory, but still keeping ordering for IO vs IO and IO vs locks. > This is much easier to program for than no ordering at all. In fact, if one > doesn't use coherent DMA, it's basically the same as fully strict ordering. Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 2:28 ` David Miller 2008-05-27 3:39 ` Benjamin Herrenschmidt @ 2008-05-27 3:42 ` Arjan van de Ven 2008-05-27 4:08 ` Roland Dreier 2008-05-27 7:08 ` Benjamin Herrenschmidt 2008-05-27 8:24 ` Alan Cox 2 siblings, 2 replies; 160+ messages in thread From: Arjan van de Ven @ 2008-05-27 3:42 UTC (permalink / raw) To: David Miller Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, torvalds, tpiepho, alan On Mon, 26 May 2008 19:28:12 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Date: Tue, 27 May 2008 12:15:40 +1000 > > > Some of them. USB comes to mind. I'd be happy to make it "the rule" > > and document that MMIO vs. coherent access aren't implicitely > > ordered. I would still keep them ordered on powerpc for a little > > while tho until I'm happy enough with driver auditing. > > > > But heh, it's you who was telling me that it would be a bad > > engineering decision and we had to make everybody look like x86 & > > fully ordered :-) I decided to agree back then and stuck all those > > nasty heavy barriers in the powerpc variants of readl/writel/... > > I still believe this. > > It's just another complicated thing for driver authors to get wrong. > The other side of the coin is of course the cost. > > The only thing I am absolutely sure of is that we should make a > decision fast, document it, and just stick to it. ... and try to find a way to test for it at runtime or compile time. either via sparse or some fancy lockdep like "device store" thing? If we can't test for it and it doesn't show up on x86 ... it'll just be an eterrnal chase. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 3:42 ` Arjan van de Ven @ 2008-05-27 4:08 ` Roland Dreier 2008-05-27 4:20 ` Arjan van de Ven 2008-05-27 7:08 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 160+ messages in thread From: Roland Dreier @ 2008-05-27 4:08 UTC (permalink / raw) To: Arjan van de Ven Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, torvalds, David Miller, alan > either via sparse or some fancy lockdep like "device store" thing? > If we can't test for it and it doesn't show up on x86 ... it'll just be > an eterrnal chase. Ben's point is that it will start showing up on x86 because newer compilers are reordering things... - R. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 4:08 ` Roland Dreier @ 2008-05-27 4:20 ` Arjan van de Ven 0 siblings, 0 replies; 160+ messages in thread From: Arjan van de Ven @ 2008-05-27 4:20 UTC (permalink / raw) To: Roland Dreier Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, torvalds, David Miller, alan On Mon, 26 May 2008 21:08:10 -0700 Roland Dreier <rdreier@cisco.com> wrote: > > either via sparse or some fancy lockdep like "device store" thing? > > If we can't test for it and it doesn't show up on x86 ... it'll > > just be an eterrnal chase. > > Ben's point is that it will start showing up on x86 because newer > compilers are reordering things... > > - R. I know a basic reorder will show up there. A simple "barrier()" in readl/writel solves it for x86. Understandably, Ben doesn't really like that answer ;-) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 3:42 ` Arjan van de Ven 2008-05-27 4:08 ` Roland Dreier @ 2008-05-27 7:08 ` Benjamin Herrenschmidt 2008-05-27 15:50 ` Roland Dreier 1 sibling, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-27 7:08 UTC (permalink / raw) To: Arjan van de Ven Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, torvalds, David Miller, alan > ... and try to find a way to test for it at runtime or compile time. > > either via sparse or some fancy lockdep like "device store" thing? > If we can't test for it and it doesn't show up on x86 ... it'll just be > an eterrnal chase. It's hard. I haven't managed to come up with a good idea on how to test for it either at runtime or from sparse. There -might- be way to test up to a certain point with sparse by defining a __coherent attribute for coherent memory and trying to figure out patterns like write to __coherent followed by MMIO with no barrier in between but that's fishy and won't catch many cases. Sticking barriers in the accessors is thus indeed the "easy" and somewhat safe fix and keeping everything as ordered as possible Though it's my understanding that at least ia64 does require the explicit barriers anyway, so we are still in a dodgy situation here where it's not clear what drivers should do and we end up with possibly excessive barriers on powerpc where I end up with both the wmb/rmb/mb that were added for ia64 -and- the ones I have in readl/writel to make them look synchronous... Not nice. I'm not sure there is a good answer... Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 7:08 ` Benjamin Herrenschmidt @ 2008-05-27 15:50 ` Roland Dreier 2008-05-27 16:37 ` James Bottomley ` (2 more replies) 0 siblings, 3 replies; 160+ messages in thread From: Roland Dreier @ 2008-05-27 15:50 UTC (permalink / raw) To: benh Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven > Though it's my understanding that at least ia64 does require the > explicit barriers anyway, so we are still in a dodgy situation here > where it's not clear what drivers should do and we end up with > possibly excessive barriers on powerpc where I end up with both > the wmb/rmb/mb that were added for ia64 -and- the ones I have in > readl/writel to make them look synchronous... Not nice. ia64 is a disaster with a slightly different ordering problem -- the mmiowb() issue. I know Ben knows far too much about this, but for big SGI boxes, you sometimes need mmiowb() to avoid problems with driver code that does totally sane stuff like spin_lock(&mmio_lock); writel(val1, reg1); writel(val2, reg2); spin_unlock(&mmio_lock); If that snippet is called on two CPUs at the same time, then the device might see a sequence like CPU1 -- write reg1 CPU2 -- write reg1 CPU1 -- write reg2 CPU2 -- write reg2 in spite of the fact that everything is totally ordered on the CPUs by the spin lock. The reason this is such a disaster is because the code looks right, makes sense, and works fine on 99.99% of all systems out there. So I would bet that 99% of our drivers have missing mmiowb() "bugs" -- no one has plugged the hardware into an Altix box and cared enough to stress test it. However for the issue at hand, my expectation as a driver writer is that readl()/writel() are ordered with respect to MMIO operations, but not necessarily with respect to normal writes to coherent CPU memory. And I've included explicit wmb()s in code I've written like drivers/infiniband/hw/mthca. - R. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 15:50 ` Roland Dreier @ 2008-05-27 16:37 ` James Bottomley 2008-05-27 17:38 ` Roland Dreier 2008-05-27 21:11 ` Benjamin Herrenschmidt 2008-05-31 8:04 ` Pavel Machek 2 siblings, 1 reply; 160+ messages in thread From: James Bottomley @ 2008-05-27 16:37 UTC (permalink / raw) To: Roland Dreier Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Tue, 2008-05-27 at 08:50 -0700, Roland Dreier wrote: > > Though it's my understanding that at least ia64 does require the > > explicit barriers anyway, so we are still in a dodgy situation here > > where it's not clear what drivers should do and we end up with > > possibly excessive barriers on powerpc where I end up with both > > the wmb/rmb/mb that were added for ia64 -and- the ones I have in > > readl/writel to make them look synchronous... Not nice. > > ia64 is a disaster with a slightly different ordering problem -- the > mmiowb() issue. I know Ben knows far too much about this, but for big > SGI boxes, you sometimes need mmiowb() to avoid problems with driver > code that does totally sane stuff like > > spin_lock(&mmio_lock); > writel(val1, reg1); > writel(val2, reg2); > spin_unlock(&mmio_lock); > > If that snippet is called on two CPUs at the same time, then the device > might see a sequence like > > CPU1 -- write reg1 > CPU2 -- write reg1 > CPU1 -- write reg2 > CPU2 -- write reg2 > > in spite of the fact that everything is totally ordered on the CPUs by > the spin lock. > > The reason this is such a disaster is because the code looks right, > makes sense, and works fine on 99.99% of all systems out there. So I > would bet that 99% of our drivers have missing mmiowb() "bugs" -- no one > has plugged the hardware into an Altix box and cared enough to stress > test it. > > However for the issue at hand, my expectation as a driver writer is that > readl()/writel() are ordered with respect to MMIO operations, but not > necessarily with respect to normal writes to coherent CPU memory. And > I've included explicit wmb()s in code I've written like > drivers/infiniband/hw/mthca. Actually, this specifically should not be. The need for mmiowb on altix is because it explicitly violates some of the PCI rules that would otherwise impede performance. The compromise is that readX on altix contains the needed dma flush but there's a variant operator, readX_relaxed that doesn't (for drivers that know what they're doing). The altix critical drivers have all been converted to use the relaxed form for performance, and the unconverted ones should all operate just fine (albeit potentially more slowly). You can see all of this in include/asm-ia64/sn/io.h It is confusing to me that sn_dma_flush() and sn_mmiowb() have different implementations, but I think both fix the spinlock problem you allude to by ensuring the DMA operation is completed before the CPU instruction is executed. James ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 16:37 ` James Bottomley @ 2008-05-27 17:38 ` Roland Dreier 2008-05-27 17:53 ` James Bottomley 0 siblings, 1 reply; 160+ messages in thread From: Roland Dreier @ 2008-05-27 17:38 UTC (permalink / raw) To: James Bottomley Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven > Actually, this specifically should not be. The need for mmiowb on altix > is because it explicitly violates some of the PCI rules that would > otherwise impede performance. The compromise is that readX on altix > contains the needed dma flush but there's a variant operator, > readX_relaxed that doesn't (for drivers that know what they're doing). > The altix critical drivers have all been converted to use the relaxed > form for performance, and the unconverted ones should all operate just > fine (albeit potentially more slowly). Is this a recent change? Because as of October 2007, 76d7cc03 ("IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up") was needed. But this was involving writel() (__raw_writel() actually, looking at the code), not readl(). But writel_relaxed() doesn't exist (and doesn't make sense). - R. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 17:38 ` Roland Dreier @ 2008-05-27 17:53 ` James Bottomley 2008-05-27 18:07 ` Roland Dreier 0 siblings, 1 reply; 160+ messages in thread From: James Bottomley @ 2008-05-27 17:53 UTC (permalink / raw) To: Roland Dreier Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Tue, 2008-05-27 at 10:38 -0700, Roland Dreier wrote: > > Actually, this specifically should not be. The need for mmiowb on altix > > is because it explicitly violates some of the PCI rules that would > > otherwise impede performance. The compromise is that readX on altix > > contains the needed dma flush but there's a variant operator, > > readX_relaxed that doesn't (for drivers that know what they're doing). > > The altix critical drivers have all been converted to use the relaxed > > form for performance, and the unconverted ones should all operate just > > fine (albeit potentially more slowly). > > Is this a recent change? Because as of October 2007, 76d7cc03 > ("IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up") > was needed. But this was involving writel() (__raw_writel() actually, > looking at the code), not readl(). But writel_relaxed() doesn't exist > (and doesn't make sense). Um, OK, you've said write twice now ... I was assuming you meant read. Even on an x86, writes are posted, so there's no way a spin lock could serialise a write without an intervening read to flush the posting (that's why only reads have a relaxed version on altix). Or is there something else I'm missing? James ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 17:53 ` James Bottomley @ 2008-05-27 18:07 ` Roland Dreier 2008-05-27 18:17 ` Roland Dreier 2008-05-27 21:23 ` Chris Friesen 0 siblings, 2 replies; 160+ messages in thread From: Roland Dreier @ 2008-05-27 18:07 UTC (permalink / raw) To: James Bottomley Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, torvalds, David Miller, alan, Arjan van de Ven > Um, OK, you've said write twice now ... I was assuming you meant read. > Even on an x86, writes are posted, so there's no way a spin lock could > serialise a write without an intervening read to flush the posting > (that's why only reads have a relaxed version on altix). Or is there > something else I'm missing? Writes are posted yes, but not reordered arbitrarily. If I have code like: spin_lock(&mmio_lock); writel(val1, reg1); writel(val2, reg2); spin_unlock(&mmio_lock); then I have a reasonable expectation that if two CPUs run this at the same time, their writes to reg1/reg2 won't be interleaved with each other (because the whole section is inside a spinlock). And Altix violates that expectation. - R. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 18:07 ` Roland Dreier @ 2008-05-27 18:17 ` Roland Dreier 2008-05-27 21:23 ` Chris Friesen 1 sibling, 0 replies; 160+ messages in thread From: Roland Dreier @ 2008-05-27 18:17 UTC (permalink / raw) To: James Bottomley Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood, torvalds, David Miller, alan, Arjan van de Ven > Writes are posted yes, but not reordered arbitrarily. on standard x86 I mean here... ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 18:07 ` Roland Dreier 2008-05-27 18:17 ` Roland Dreier @ 2008-05-27 21:23 ` Chris Friesen 2008-05-27 21:29 ` Roland Dreier 2008-05-27 23:04 ` Paul Mackerras 1 sibling, 2 replies; 160+ messages in thread From: Chris Friesen @ 2008-05-27 21:23 UTC (permalink / raw) To: Roland Dreier Cc: linux-arch, linux-kernel, tpiepho, James Bottomley, linuxppc-dev, scottwood, torvalds, David Miller, alan, Arjan van de Ven Roland Dreier wrote: > Writes are posted yes, but not reordered arbitrarily. If I have code like: > > spin_lock(&mmio_lock); > writel(val1, reg1); > writel(val2, reg2); > spin_unlock(&mmio_lock); > > then I have a reasonable expectation that if two CPUs run this at the > same time, their writes to reg1/reg2 won't be interleaved with each > other (because the whole section is inside a spinlock). And Altix > violates that expectation. Does that necessarily follow? If you've got a large system with multiple pci bridges, could you end up with posted writes coming from different cpus taking a different amount of time to propagate to a device and thus colliding? Chris ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:23 ` Chris Friesen @ 2008-05-27 21:29 ` Roland Dreier 2008-05-27 23:04 ` Paul Mackerras 1 sibling, 0 replies; 160+ messages in thread From: Roland Dreier @ 2008-05-27 21:29 UTC (permalink / raw) To: Chris Friesen Cc: linux-arch, linux-kernel, tpiepho, James Bottomley, linuxppc-dev, scottwood, torvalds, David Miller, alan, Arjan van de Ven > > Writes are posted yes, but not reordered arbitrarily. If I have code like: > > > > spin_lock(&mmio_lock); > > writel(val1, reg1); > > writel(val2, reg2); > > spin_unlock(&mmio_lock); > > > > then I have a reasonable expectation that if two CPUs run this at the > > same time, their writes to reg1/reg2 won't be interleaved with each > > other (because the whole section is inside a spinlock). And Altix > > violates that expectation. > > Does that necessarily follow? > > If you've got a large system with multiple pci bridges, could you end > up with posted writes coming from different cpus taking a different > amount of time to propagate to a device and thus colliding? Not on x86. And a given PCI device can only be reached from a single host bridge, so I don't see how it can happen. But on SGI altix systems, there is a routed fabric between the CPU and the PCI bus, so the reordering can happen there. Hence mmiowb() and the endless supply of driver bugs that it causes. - R. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:23 ` Chris Friesen 2008-05-27 21:29 ` Roland Dreier @ 2008-05-27 23:04 ` Paul Mackerras 1 sibling, 0 replies; 160+ messages in thread From: Paul Mackerras @ 2008-05-27 23:04 UTC (permalink / raw) To: Chris Friesen Cc: linux-arch, Roland Dreier, linux-kernel, David Miller, James Bottomley, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven Chris Friesen writes: > Roland Dreier wrote: > > > Writes are posted yes, but not reordered arbitrarily. If I have code like: > > > > spin_lock(&mmio_lock); > > writel(val1, reg1); > > writel(val2, reg2); > > spin_unlock(&mmio_lock); > > > > then I have a reasonable expectation that if two CPUs run this at the > > same time, their writes to reg1/reg2 won't be interleaved with each > > other (because the whole section is inside a spinlock). And Altix > > violates that expectation. > > Does that necessarily follow? > > If you've got a large system with multiple pci bridges, could you end up > with posted writes coming from different cpus taking a different amount > of time to propagate to a device and thus colliding? On powerpc we explicitly make sure that can't happen. That's the "do a sync in spin_unlock if there were any writels since the last spin_lock" magic. The sync instruction makes sure the writes have got to the host bridge before the spinlock is unlocked. Paul. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 15:50 ` Roland Dreier 2008-05-27 16:37 ` James Bottomley @ 2008-05-27 21:11 ` Benjamin Herrenschmidt 2008-05-27 21:33 ` Roland Dreier 2008-05-31 8:04 ` Pavel Machek 2 siblings, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-27 21:11 UTC (permalink / raw) To: Roland Dreier Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Tue, 2008-05-27 at 08:50 -0700, Roland Dreier wrote: > > Though it's my understanding that at least ia64 does require the > > explicit barriers anyway, so we are still in a dodgy situation here > > where it's not clear what drivers should do and we end up with > > possibly excessive barriers on powerpc where I end up with both > > the wmb/rmb/mb that were added for ia64 -and- the ones I have in > > readl/writel to make them look synchronous... Not nice. > > ia64 is a disaster with a slightly different ordering problem -- the > mmiowb() issue. I know Ben knows far too much about this, but for big > SGI boxes, you sometimes need mmiowb() to avoid problems with driver > code that does totally sane stuff like This is a different issue. We deal with it on powerpc by having writel set a per-cpu flag and spin_unlock() test it, and do the barrier if needed there. However, drivers such as e1000 -also- have a wmb() between filling the ring buffer and kicking the DMA with MMIO, with a comment about this being needed for ia64 relaxed ordering. Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:11 ` Benjamin Herrenschmidt @ 2008-05-27 21:33 ` Roland Dreier 2008-05-27 22:13 ` Benjamin Herrenschmidt 2008-05-29 14:47 ` Jes Sorensen 0 siblings, 2 replies; 160+ messages in thread From: Roland Dreier @ 2008-05-27 21:33 UTC (permalink / raw) To: benh Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven > This is a different issue. We deal with it on powerpc by having writel > set a per-cpu flag and spin_unlock() test it, and do the barrier if > needed there. Cool... I assume you do this for mutex_unlock() etc? Is there any reason why ia64 can't do this too so we can kill mmiowb and save everyone a lot of hassle? (mips, sh and frv have non-empty mmiowb() definitions too but I'd guess that these are all bugs based on misunderstandings of the mmiowb() semantics...) > However, drivers such as e1000 -also- have a wmb() between filling the > ring buffer and kicking the DMA with MMIO, with a comment about this > being needed for ia64 relaxed ordering. I put these barriers into mthca, mlx4 etc, although it came from my possible misunderstanding of the memory ordering rules in the kernel more than any experience of problems (as opposed the the mmiowb()s, which all came from real world bugs). - R. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:33 ` Roland Dreier @ 2008-05-27 22:13 ` Benjamin Herrenschmidt 2008-05-27 22:39 ` Roland Dreier 2008-05-29 14:47 ` Jes Sorensen 1 sibling, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-27 22:13 UTC (permalink / raw) To: Roland Dreier Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Tue, 2008-05-27 at 14:33 -0700, Roland Dreier wrote: > > This is a different issue. We deal with it on powerpc by having writel > > set a per-cpu flag and spin_unlock() test it, and do the barrier if > > needed there. > > Cool... I assume you do this for mutex_unlock() etc? That's a good point... I don't think we do. Maybe we should. > Is there any reason why ia64 can't do this too so we can kill mmiowb and > save everyone a lot of hassle? (mips, sh and frv have non-empty > mmiowb() definitions too but I'd guess that these are all bugs based on > misunderstandings of the mmiowb() semantics...) Well, basically our approach was that mmiowb() is a pain in the neck, nobody (ie. driver writers) really understands what it's for, and so it's either not there or misused. So we didn't want to introduce it for powerpc, but instead did the trick above in order to -slightly- improve our writel (ie avoid a sync -after- the write) . > > However, drivers such as e1000 -also- have a wmb() between filling the > > ring buffer and kicking the DMA with MMIO, with a comment about this > > being needed for ia64 relaxed ordering. > > I put these barriers into mthca, mlx4 etc, although it came from my > possible misunderstanding of the memory ordering rules in the kernel > more than any experience of problems (as opposed the the mmiowb()s, > which all came from real world bugs). Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 22:13 ` Benjamin Herrenschmidt @ 2008-05-27 22:39 ` Roland Dreier 0 siblings, 0 replies; 160+ messages in thread From: Roland Dreier @ 2008-05-27 22:39 UTC (permalink / raw) To: benh Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven > > Cool... I assume you do this for mutex_unlock() etc? > That's a good point... I don't think we do. Maybe we should. I think it's needed -- take a look at 76d7cc03, which came from a real bug seen on Altix boxes. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 21:33 ` Roland Dreier 2008-05-27 22:13 ` Benjamin Herrenschmidt @ 2008-05-29 14:47 ` Jes Sorensen 2008-05-29 15:01 ` James Bottomley ` (3 more replies) 1 sibling, 4 replies; 160+ messages in thread From: Jes Sorensen @ 2008-05-29 14:47 UTC (permalink / raw) To: Roland Dreier Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven >>>>> "Roland" == Roland Dreier <rdreier@cisco.com> writes: >> This is a different issue. We deal with it on powerpc by having >> writel set a per-cpu flag and spin_unlock() test it, and do the >> barrier if needed there. Roland> Cool... I assume you do this for mutex_unlock() etc? Roland> Is there any reason why ia64 can't do this too so we can kill Roland> mmiowb and save everyone a lot of hassle? (mips, sh and frv Roland> have non-empty mmiowb() definitions too but I'd guess that Roland> these are all bugs based on misunderstandings of the mmiowb() Roland> semantics...) Hi Roland, Thats not going to solve the problem on Altix. On Altix the issue is that there can be multiple paths through the NUMA fabric from cpuX to PCI bridge Y. Consider this uber-cool<tm> ascii art - NR is my abbrevation for NUMA router: ------- ------- |cpu X| |cpu Y| ------- ------- | \____ ____/ | | \/ | | ____/\____ | | / \ | ----- ------ |NR 1| |NR 2| ------ ------ \ / \ / ------- | PCI | ------- The problem is that your two writel's, despite being both issued on cpu X, due to the spin lock, in your example, can end up with the first one going through NR 1 and the second one going through NR 2. If there's contention on NR 1, the write going via NR 2 may hit the PCI bridge prior to the one going via NR 1. Of course, the bigger the system, the worse the problem.... The only way to guarantee ordering in the above setup, is to either make writel() fully ordered or adding the mmiowb()'s inbetween the two writel's. On Altix you have to go and read from the PCI brige to ensure all writes to it have been flushed, which is also what mmiowb() is doing. If writel() was to guarantee this ordering, it would make every writel() call extremely expensive :-( Cheers, Jes ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 14:47 ` Jes Sorensen @ 2008-05-29 15:01 ` James Bottomley 2008-05-30 9:36 ` Jes Sorensen 2008-05-29 21:40 ` Benjamin Herrenschmidt ` (2 subsequent siblings) 3 siblings, 1 reply; 160+ messages in thread From: James Bottomley @ 2008-05-29 15:01 UTC (permalink / raw) To: Jes Sorensen Cc: linux-arch, Roland Dreier, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote: > >>>>> "Roland" == Roland Dreier <rdreier@cisco.com> writes: > > >> This is a different issue. We deal with it on powerpc by having > >> writel set a per-cpu flag and spin_unlock() test it, and do the > >> barrier if needed there. > > Roland> Cool... I assume you do this for mutex_unlock() etc? > > Roland> Is there any reason why ia64 can't do this too so we can kill > Roland> mmiowb and save everyone a lot of hassle? (mips, sh and frv > Roland> have non-empty mmiowb() definitions too but I'd guess that > Roland> these are all bugs based on misunderstandings of the mmiowb() > Roland> semantics...) > > Hi Roland, > > Thats not going to solve the problem on Altix. On Altix the issue is > that there can be multiple paths through the NUMA fabric from cpuX to > PCI bridge Y. > > Consider this uber-cool<tm> ascii art - NR is my abbrevation for NUMA > router: > > ------- ------- > |cpu X| |cpu Y| > ------- ------- > | \____ ____/ | > | \/ | > | ____/\____ | > | / \ | > ----- ------ > |NR 1| |NR 2| > ------ ------ > \ / > \ / > ------- > | PCI | > ------- > > The problem is that your two writel's, despite being both issued on > cpu X, due to the spin lock, in your example, can end up with the > first one going through NR 1 and the second one going through NR 2. If > there's contention on NR 1, the write going via NR 2 may hit the PCI > bridge prior to the one going via NR 1. > > Of course, the bigger the system, the worse the problem.... > > The only way to guarantee ordering in the above setup, is to either > make writel() fully ordered or adding the mmiowb()'s inbetween the two > writel's. On Altix you have to go and read from the PCI brige to > ensure all writes to it have been flushed, which is also what mmiowb() > is doing. If writel() was to guarantee this ordering, it would make > every writel() call extremely expensive :-( So if a read from the bridge achieves the same effect, can't we just put one after the writes within the spinlock (an unrelaxed one). That way this whole sequence will look like a well understood PCI posting flush rather than have to muck around with little understood (at least by most driver writers) io barriers? James ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 15:01 ` James Bottomley @ 2008-05-30 9:36 ` Jes Sorensen 2008-05-30 17:21 ` Jesse Barnes 0 siblings, 1 reply; 160+ messages in thread From: Jes Sorensen @ 2008-05-30 9:36 UTC (permalink / raw) To: James Bottomley Cc: linux-arch, Roland Dreier, linux-kernel, Jeremy Higdon, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven James Bottomley wrote: >> The only way to guarantee ordering in the above setup, is to either >> make writel() fully ordered or adding the mmiowb()'s inbetween the two >> writel's. On Altix you have to go and read from the PCI brige to >> ensure all writes to it have been flushed, which is also what mmiowb() >> is doing. If writel() was to guarantee this ordering, it would make >> every writel() call extremely expensive :-( > > So if a read from the bridge achieves the same effect, can't we just put > one after the writes within the spinlock (an unrelaxed one). That way > this whole sequence will look like a well understood PCI posting flush > rather than have to muck around with little understood (at least by most > driver writers) io barriers? Hmmm, I think mmiowb() does some sort of status read from the bridge, I am not sure if it's enough to just do a regular readl(). I'm adding Jeremy to the list, he should know for sure. Cheers, Jes ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-30 9:36 ` Jes Sorensen @ 2008-05-30 17:21 ` Jesse Barnes 2008-05-31 7:57 ` Jeremy Higdon 0 siblings, 1 reply; 160+ messages in thread From: Jesse Barnes @ 2008-05-30 17:21 UTC (permalink / raw) To: Jes Sorensen Cc: linux-arch, Roland Dreier, linux-kernel, Jeremy Higdon, David Miller, James Bottomley, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Friday, May 30, 2008 2:36 am Jes Sorensen wrote: > James Bottomley wrote: > >> The only way to guarantee ordering in the above setup, is to either > >> make writel() fully ordered or adding the mmiowb()'s inbetween the two > >> writel's. On Altix you have to go and read from the PCI brige to > >> ensure all writes to it have been flushed, which is also what mmiowb() > >> is doing. If writel() was to guarantee this ordering, it would make > >> every writel() call extremely expensive :-( > > > > So if a read from the bridge achieves the same effect, can't we just put > > one after the writes within the spinlock (an unrelaxed one). That way > > this whole sequence will look like a well understood PCI posting flush > > rather than have to muck around with little understood (at least by most > > driver writers) io barriers? > > Hmmm, > > I think mmiowb() does some sort of status read from the bridge, I am not > sure if it's enough to just do a regular readl(). > > I'm adding Jeremy to the list, he should know for sure. I think a read from the target host bridge is enough. What mmiowb() does though is to read a *local* host bridge register, which contains a count of the number of PIO ops still "in flight" on their way to their target bridge. When it reaches 0, all PIOs have arrived at the target host bridge (they still may be bufferd), so ordering is guaranteed. Jesse ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-30 17:21 ` Jesse Barnes @ 2008-05-31 7:57 ` Jeremy Higdon 0 siblings, 0 replies; 160+ messages in thread From: Jeremy Higdon @ 2008-05-31 7:57 UTC (permalink / raw) To: Jesse Barnes Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel, David Miller, James Bottomley, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Fri, May 30, 2008 at 10:21:00AM -0700, Jesse Barnes wrote: > On Friday, May 30, 2008 2:36 am Jes Sorensen wrote: > > James Bottomley wrote: > > >> The only way to guarantee ordering in the above setup, is to either > > >> make writel() fully ordered or adding the mmiowb()'s inbetween the two > > >> writel's. On Altix you have to go and read from the PCI brige to > > >> ensure all writes to it have been flushed, which is also what mmiowb() > > >> is doing. If writel() was to guarantee this ordering, it would make > > >> every writel() call extremely expensive :-( > > > > > > So if a read from the bridge achieves the same effect, can't we just put > > > one after the writes within the spinlock (an unrelaxed one). That way A relaxed readX would be sufficient. It's the next lowest cost way (after mmiowb) of ensuring write ordering between CPUs. Regular readX is the most expensive method (well, we could probably come up with something worse, but we'd have to work at it :). > > > this whole sequence will look like a well understood PCI posting flush > > > rather than have to muck around with little understood (at least by most > > > driver writers) io barriers? > > > > Hmmm, > > > > I think mmiowb() does some sort of status read from the bridge, I am not > > sure if it's enough to just do a regular readl(). > > > > I'm adding Jeremy to the list, he should know for sure. > > I think a read from the target host bridge is enough. What mmiowb() does > though is to read a *local* host bridge register, which contains a count of > the number of PIO ops still "in flight" on their way to their target bridge. > When it reaches 0, all PIOs have arrived at the target host bridge (they > still may be bufferd), so ordering is guaranteed. Note that is the main advantage over a read. There is no round trip across the NUMA fabric. jeremy ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 14:47 ` Jes Sorensen 2008-05-29 15:01 ` James Bottomley @ 2008-05-29 21:40 ` Benjamin Herrenschmidt 2008-05-29 21:48 ` Trent Piepho ` (2 more replies) 2008-05-29 22:06 ` Roland Dreier 2008-05-31 7:52 ` Jeremy Higdon 3 siblings, 3 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-29 21:40 UTC (permalink / raw) To: Jes Sorensen Cc: linux-arch, Roland Dreier, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote: > > The only way to guarantee ordering in the above setup, is to either > make writel() fully ordered or adding the mmiowb()'s inbetween the two > writel's. On Altix you have to go and read from the PCI brige to > ensure all writes to it have been flushed, which is also what mmiowb() > is doing. If writel() was to guarantee this ordering, it would make > every writel() call extremely expensive :-( Interesting. I've always been taught by ia64 people that mmiowb() was intended to be used solely between writel() and spin_unlock(). I think in the above case, you really should make writel() ordered. Anything else is asking for trouble, for the exact same reasons that I made it fully ordered on powerpc at least vs. previous stores. I only kept it relaxed vs. subsequent cacheable stores (ie, spin_unlock), for which I use the trick mentioned before. Yes, this has some cost (can be fairly significant on powerpc too) but I think it's a very basic assumption from drivers that consecutive writel's, especially issued by the same CPU, will get to the device in order. If this is a performance problem, then provide relaxed variants and use them in selected drivers. Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 21:40 ` Benjamin Herrenschmidt @ 2008-05-29 21:48 ` Trent Piepho 2008-05-29 22:05 ` Benjamin Herrenschmidt 2008-05-29 21:53 ` Jesse Barnes 2008-05-30 9:48 ` Jes Sorensen 2 siblings, 1 reply; 160+ messages in thread From: Trent Piepho @ 2008-05-29 21:48 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel, linuxppc-dev, scottwood, torvalds, David Miller, alan, Arjan van de Ven On Fri, 30 May 2008, Benjamin Herrenschmidt wrote: > On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote: > Interesting. I've always been taught by ia64 people that mmiowb() was > intended to be used solely between writel() and spin_unlock(). That's what I gathered too, based on what's written in memory-barriers.txt, which is the only kernel docs I could find that addressed this. > Yes, this has some cost (can be fairly significant on powerpc too) but > I think it's a very basic assumption from drivers that consecutive > writel's, especially issued by the same CPU, will get to the device > in order. It's also what memory-barriers.txt says they should do. > If this is a performance problem, then provide relaxed variants and > use them in selected drivers. I wrote a JTAG over gpio driver for the powerpc MPC8572DS platform. With the non-raw io accessors, the JTAG clock can run at almost ~9.5 MHz. Using raw versions (which I had to write since powerpc doesn't have any), the clock speed increases to about 28 MHz. So it can make a very significant different. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 21:48 ` Trent Piepho @ 2008-05-29 22:05 ` Benjamin Herrenschmidt 2008-05-30 1:53 ` Trent Piepho 0 siblings, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-29 22:05 UTC (permalink / raw) To: Trent Piepho Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel, linuxppc-dev, scottwood, torvalds, David Miller, alan, Arjan van de Ven On Thu, 2008-05-29 at 14:48 -0700, Trent Piepho wrote: > I wrote a JTAG over gpio driver for the powerpc MPC8572DS platform. With the > non-raw io accessors, the JTAG clock can run at almost ~9.5 MHz. Using raw > versions (which I had to write since powerpc doesn't have any), the clock > speed increases to about 28 MHz. So it can make a very significant different. Yes, sync's can hurt a lot. This is why I initially tried to get more relaxed semantics. We could implement something like __ variants and do something that would still have eieio's but not sync's for example (ie. MMIOs are still ordered vs. each other but not vs. coherent memory). Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 22:05 ` Benjamin Herrenschmidt @ 2008-05-30 1:53 ` Trent Piepho 0 siblings, 0 replies; 160+ messages in thread From: Trent Piepho @ 2008-05-30 1:53 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel, linuxppc-dev, scottwood, torvalds, David Miller, alan, Arjan van de Ven On Fri, 30 May 2008, Benjamin Herrenschmidt wrote: > On Thu, 2008-05-29 at 14:48 -0700, Trent Piepho wrote: > >> I wrote a JTAG over gpio driver for the powerpc MPC8572DS platform. With the >> non-raw io accessors, the JTAG clock can run at almost ~9.5 MHz. Using raw >> versions (which I had to write since powerpc doesn't have any), the clock >> speed increases to about 28 MHz. So it can make a very significant different. > > Yes, sync's can hurt a lot. This is why I initially tried to get more > relaxed semantics. > > We could implement something like __ variants and do something that > would still have eieio's but not sync's for example (ie. MMIOs are still > ordered vs. each other but not vs. coherent memory). The problem current with the raw variants is that not all archs have them. And for those that do, there is no defined semantics. Each arch is different as to what ordering they have (and endianness too). If you want to write a driver that is (or might be one day) multi-platform, there aren't any less ordered accessors one can use. A lot of drivers don't even use coherent DMA, and could use less strictly ordered semantics quite trivially. Except there aren't any. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 21:40 ` Benjamin Herrenschmidt 2008-05-29 21:48 ` Trent Piepho @ 2008-05-29 21:53 ` Jesse Barnes 2008-05-30 9:39 ` Jes Sorensen 2008-05-30 9:48 ` Jes Sorensen 2 siblings, 1 reply; 160+ messages in thread From: Jesse Barnes @ 2008-05-29 21:53 UTC (permalink / raw) To: benh Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Thursday, May 29, 2008 2:40 pm Benjamin Herrenschmidt wrote: > On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote: > > The only way to guarantee ordering in the above setup, is to either > > make writel() fully ordered or adding the mmiowb()'s inbetween the two > > writel's. On Altix you have to go and read from the PCI brige to > > ensure all writes to it have been flushed, which is also what mmiowb() > > is doing. If writel() was to guarantee this ordering, it would make > > every writel() call extremely expensive :-( > > Interesting. I've always been taught by ia64 people that mmiowb() was > intended to be used solely between writel() and spin_unlock(). Well, that *was* true, afaik, but maybe these days multipath isn't just for fail-over. If that's true, then yeah making every single writeX ordered would be the only way to go... > If this is a performance problem, then provide relaxed variants and > use them in selected drivers. Sounds reasonable. That way drivers "just work" and important drivers can be optimized. Jesse ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 21:53 ` Jesse Barnes @ 2008-05-30 9:39 ` Jes Sorensen 0 siblings, 0 replies; 160+ messages in thread From: Jes Sorensen @ 2008-05-30 9:39 UTC (permalink / raw) To: Jesse Barnes Cc: linux-arch, Roland Dreier, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven Jesse Barnes wrote: > On Thursday, May 29, 2008 2:40 pm Benjamin Herrenschmidt wrote: >> On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote: >>> The only way to guarantee ordering in the above setup, is to either >>> make writel() fully ordered or adding the mmiowb()'s inbetween the two >>> writel's. On Altix you have to go and read from the PCI brige to >>> ensure all writes to it have been flushed, which is also what mmiowb() >>> is doing. If writel() was to guarantee this ordering, it would make >>> every writel() call extremely expensive :-( >> Interesting. I've always been taught by ia64 people that mmiowb() was >> intended to be used solely between writel() and spin_unlock(). > > Well, that *was* true, afaik, but maybe these days multipath isn't just for > fail-over. If that's true, then yeah making every single writeX ordered > would be the only way to go... I could be getting bits wrong, but multi-path here is in the NUMA routing, not at the device level. >> If this is a performance problem, then provide relaxed variants and >> use them in selected drivers. > > Sounds reasonable. That way drivers "just work" and important drivers can be > optimized. That would kill all levels of performance in all drivers, resulting in attempts to try and modify a fair bit of drivers to get the performance back. In reality this problem really only exists for devices where ordering of consecutive writel's is a big issue. In my experience it really isn't the case very frequently - and the number of mmiowb's that have put in shows that too :-) Cheers, Jes ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 21:40 ` Benjamin Herrenschmidt 2008-05-29 21:48 ` Trent Piepho 2008-05-29 21:53 ` Jesse Barnes @ 2008-05-30 9:48 ` Jes Sorensen 2008-05-31 8:14 ` Pavel Machek 2 siblings, 1 reply; 160+ messages in thread From: Jes Sorensen @ 2008-05-30 9:48 UTC (permalink / raw) To: benh Cc: linux-arch, Roland Dreier, linux-kernel, Jeremy Higdon, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven Benjamin Herrenschmidt wrote: > On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote: >> The only way to guarantee ordering in the above setup, is to either >> make writel() fully ordered or adding the mmiowb()'s inbetween the two >> writel's. On Altix you have to go and read from the PCI brige to >> ensure all writes to it have been flushed, which is also what mmiowb() >> is doing. If writel() was to guarantee this ordering, it would make >> every writel() call extremely expensive :-( > > Interesting. I've always been taught by ia64 people that mmiowb() was > intended to be used solely between writel() and spin_unlock(). > > I think in the above case, you really should make writel() ordered. > Anything else is asking for trouble, for the exact same reasons that I > made it fully ordered on powerpc at least vs. previous stores. I only > kept it relaxed vs. subsequent cacheable stores (ie, spin_unlock), for > which I use the trick mentioned before. Hmmm I hope I didn't mess up the description of this and added to the confusion. The net result of that would be to kill performance completely, I really don't like that idea.... Having each writel() go out and poll the PCI bridge is going to make every driver used on Altix slow as a dog. In addition it's still not going to solve the problem for userland mapped stuff such as infinibug. > Yes, this has some cost (can be fairly significant on powerpc too) but > I think it's a very basic assumption from drivers that consecutive > writel's, especially issued by the same CPU, will get to the device > in order. In this case the cost is more than just significant, it's pretty crippling. > If this is a performance problem, then provide relaxed variants and > use them in selected drivers. We'd have to make major changes to drivers like e1000, tg3, mptsas, the qla2/3/4xxx and a bunch of others to get performance back. I really think the code maintenance issue there will get far worse than what we have today :( Cheers, Jes ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-30 9:48 ` Jes Sorensen @ 2008-05-31 8:14 ` Pavel Machek 2008-06-02 9:48 ` Jes Sorensen 0 siblings, 1 reply; 160+ messages in thread From: Pavel Machek @ 2008-05-31 8:14 UTC (permalink / raw) To: Jes Sorensen Cc: linux-arch, Roland Dreier, linux-kernel, Jeremy Higdon, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven Hi! > >>The only way to guarantee ordering in the above setup, > >>is to either > >>make writel() fully ordered or adding the mmiowb()'s > >>inbetween the two > >>writel's. On Altix you have to go and read from the > >>PCI brige to > >>ensure all writes to it have been flushed, which is > >>also what mmiowb() > >>is doing. If writel() was to guarantee this ordering, > >>it would make > >>every writel() call extremely expensive :-( > > > >Interesting. I've always been taught by ia64 people > >that mmiowb() was > >intended to be used solely between writel() and > >spin_unlock(). > > > >I think in the above case, you really should make > >writel() ordered. > >Anything else is asking for trouble, for the exact same > >reasons that I > >made it fully ordered on powerpc at least vs. previous > >stores. I only > >kept it relaxed vs. subsequent cacheable stores (ie, > >spin_unlock), for > >which I use the trick mentioned before. > > Hmmm I hope I didn't mess up the description of this and > added to the > confusion. > > The net result of that would be to kill performance > completely, I really > don't like that idea.... Having each writel() go out and > poll the PCI > bridge is going to make every driver used on Altix slow > as a dog. > > In addition it's still not going to solve the problem > for userland > mapped stuff such as infinibug. > > >Yes, this has some cost (can be fairly significant on > >powerpc too) but > >I think it's a very basic assumption from drivers that > >consecutive > >writel's, especially issued by the same CPU, will get > >to the device > >in order. > > In this case the cost is more than just significant, > it's pretty > crippling. > > >If this is a performance problem, then provide relaxed > >variants and > >use them in selected drivers. > > We'd have to make major changes to drivers like e1000, > tg3, mptsas, the > qla2/3/4xxx and a bunch of others to get performance > back. I really > think the code maintenance issue there will get far > worse than what we > have today :( Still better than changing semantics of writel for _all_ drivers. If you are really sure driver does not depend on writel order, it is just a sed/// command, so I don't see any big code maintenance issues... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-31 8:14 ` Pavel Machek @ 2008-06-02 9:48 ` Jes Sorensen 0 siblings, 0 replies; 160+ messages in thread From: Jes Sorensen @ 2008-06-02 9:48 UTC (permalink / raw) To: Pavel Machek Cc: linux-arch, Roland Dreier, linux-kernel, Jeremy Higdon, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven Pavel Machek wrote: > Still better than changing semantics of writel for _all_ drivers. > > If you are really sure driver does not depend on writel order, it is > just a sed/// command, so I don't see any big code maintenance > issues... This isn't changing the semantics for all drivers, it means it leaves it up to us to fix the drivers that hit the problem. Whereas the other way round we'd have to make some fairly big modifications to a large number of drivers, which the driver maintainers are likely to be reluctant to accept and maintain. It will be a constant chase to keep that in order. Cheers Jes ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 14:47 ` Jes Sorensen 2008-05-29 15:01 ` James Bottomley 2008-05-29 21:40 ` Benjamin Herrenschmidt @ 2008-05-29 22:06 ` Roland Dreier 2008-05-29 22:25 ` Trent Piepho 2008-05-31 7:52 ` Jeremy Higdon 3 siblings, 1 reply; 160+ messages in thread From: Roland Dreier @ 2008-05-29 22:06 UTC (permalink / raw) To: Jes Sorensen Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven > The problem is that your two writel's, despite being both issued on > cpu X, due to the spin lock, in your example, can end up with the > first one going through NR 1 and the second one going through NR 2. If > there's contention on NR 1, the write going via NR 2 may hit the PCI > bridge prior to the one going via NR 1. Really?? I can't see how you can expect any drivers to work reliably if simple code like writel(reg1); writel(reg2); might end up with the write to reg2 hitting the device before the write to reg1. Almost all MMIO stuff has ordering requirements and will totally break if you allow, say, the "start IO" write to be reordered before the "set IO address" write. - R. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 22:06 ` Roland Dreier @ 2008-05-29 22:25 ` Trent Piepho 2008-05-30 3:56 ` Paul Mackerras 0 siblings, 1 reply; 160+ messages in thread From: Trent Piepho @ 2008-05-29 22:25 UTC (permalink / raw) To: Roland Dreier Cc: linux-arch, Jes Sorensen, linux-kernel, linuxppc-dev, scottwood, torvalds, David Miller, alan, Arjan van de Ven On Thu, 29 May 2008, Roland Dreier wrote: > > The problem is that your two writel's, despite being both issued on > > cpu X, due to the spin lock, in your example, can end up with the > > first one going through NR 1 and the second one going through NR 2. If > > there's contention on NR 1, the write going via NR 2 may hit the PCI > > bridge prior to the one going via NR 1. > > Really?? I can't see how you can expect any drivers to work reliably if > simple code like > > writel(reg1); > writel(reg2); > > might end up with the write to reg2 hitting the device before the write > to reg1. Almost all MMIO stuff has ordering requirements and will This is how powerpc is natively (the linux accessors have extra ordering to not allow this of course), and there are non-Linux drivers that are written for this ordering model. I think what makes altix so hard is that writes to the _same_ register may be be re-ordered. Re-ordering writes to the same address is much less common than allowing writes to different addresses to be re-ordered. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 22:25 ` Trent Piepho @ 2008-05-30 3:56 ` Paul Mackerras 0 siblings, 0 replies; 160+ messages in thread From: Paul Mackerras @ 2008-05-30 3:56 UTC (permalink / raw) To: Trent Piepho Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel, linuxppc-dev, scottwood, torvalds, David Miller, alan, Arjan van de Ven Trent Piepho writes: > On Thu, 29 May 2008, Roland Dreier wrote: > > > The problem is that your two writel's, despite being both issued on > > > cpu X, due to the spin lock, in your example, can end up with the > > > first one going through NR 1 and the second one going through NR 2. If > > > there's contention on NR 1, the write going via NR 2 may hit the PCI > > > bridge prior to the one going via NR 1. > > > > Really?? I can't see how you can expect any drivers to work reliably if > > simple code like > > > > writel(reg1); > > writel(reg2); > > > > might end up with the write to reg2 hitting the device before the write > > to reg1. Almost all MMIO stuff has ordering requirements and will > > This is how powerpc is natively (the linux accessors have extra ordering to > not allow this of course), and there are non-Linux drivers that are written > for this ordering model. In fact stores to non-cacheable locations have to be kept in order, according to the Power architecture. But you're correct in that non-cacheable loads can in principle be reordered w.r.t. each other and to stores. > I think what makes altix so hard is that writes to the _same_ register may be > be re-ordered. Re-ordering writes to the same address is much less common > than allowing writes to different addresses to be re-ordered. Indeed. Paul. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-29 14:47 ` Jes Sorensen ` (2 preceding siblings ...) 2008-05-29 22:06 ` Roland Dreier @ 2008-05-31 7:52 ` Jeremy Higdon 2008-06-02 9:56 ` Jes Sorensen 3 siblings, 1 reply; 160+ messages in thread From: Jeremy Higdon @ 2008-05-31 7:52 UTC (permalink / raw) To: Jes Sorensen Cc: linux-arch, Roland Dreier, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Thu, May 29, 2008 at 10:47:18AM -0400, Jes Sorensen wrote: > Thats not going to solve the problem on Altix. On Altix the issue is > that there can be multiple paths through the NUMA fabric from cpuX to > PCI bridge Y. > > Consider this uber-cool<tm> ascii art - NR is my abbrevation for NUMA > router: > > ------- ------- > |cpu X| |cpu Y| > ------- ------- > | \____ ____/ | > | \/ | > | ____/\____ | > | / \ | > ----- ------ > |NR 1| |NR 2| > ------ ------ > \ / > \ / > ------- > | PCI | > ------- > > The problem is that your two writel's, despite being both issued on > cpu X, due to the spin lock, in your example, can end up with the > first one going through NR 1 and the second one going through NR 2. If > there's contention on NR 1, the write going via NR 2 may hit the PCI > bridge prior to the one going via NR 1. We don't actually have that problem on the Altix. All writes issued by CPU X will be ordered with respect to each other. But writes by CPU X and CPU Y will not be, unless an mmiowb() is done by the original CPU before the second CPU writes. I.e. CPU X writel CPU X writel CPU X mmiowb CPU Y writel ... Note that this implies some sort of locking. Also note that if in the above, CPU Y did the mmiowb, that would not work. jeremy ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-31 7:52 ` Jeremy Higdon @ 2008-06-02 9:56 ` Jes Sorensen 2008-06-02 21:02 ` Jeremy Higdon 2008-06-03 4:33 ` Nick Piggin 0 siblings, 2 replies; 160+ messages in thread From: Jes Sorensen @ 2008-06-02 9:56 UTC (permalink / raw) To: Jeremy Higdon Cc: linux-arch, Roland Dreier, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven Jeremy Higdon wrote: > We don't actually have that problem on the Altix. All writes issued > by CPU X will be ordered with respect to each other. But writes by > CPU X and CPU Y will not be, unless an mmiowb() is done by the > original CPU before the second CPU writes. I.e. > > CPU X writel > CPU X writel > CPU X mmiowb > > CPU Y writel > ... > > Note that this implies some sort of locking. Also note that if in > the above, CPU Y did the mmiowb, that would not work. Hmmm, Then it's less bad than I thought - my apologies for the confusion. Would we be able to use Ben's trick of setting a per cpu flag in writel() then and checking that in spin unlock issuing the mmiowb() there if needed? Cheers, Jes ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-02 9:56 ` Jes Sorensen @ 2008-06-02 21:02 ` Jeremy Higdon 2008-06-03 4:33 ` Nick Piggin 1 sibling, 0 replies; 160+ messages in thread From: Jeremy Higdon @ 2008-06-02 21:02 UTC (permalink / raw) To: Jes Sorensen Cc: linux-arch, Roland Dreier, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Mon, Jun 02, 2008 at 11:56:39AM +0200, Jes Sorensen wrote: > Jeremy Higdon wrote: > >We don't actually have that problem on the Altix. All writes issued > >by CPU X will be ordered with respect to each other. But writes by > >CPU X and CPU Y will not be, unless an mmiowb() is done by the > >original CPU before the second CPU writes. I.e. > > > > CPU X writel > > CPU X writel > > CPU X mmiowb > > > > CPU Y writel > > ... > > > >Note that this implies some sort of locking. Also note that if in > >the above, CPU Y did the mmiowb, that would not work. > > Hmmm, > > Then it's less bad than I thought - my apologies for the confusion. > > Would we be able to use Ben's trick of setting a per cpu flag in > writel() then and checking that in spin unlock issuing the mmiowb() > there if needed? Yes, that should work fine. You will get more mmiowb's than you need, since some drivers, such as Fusion, don't need them. On the Origins (older SGI MIPS-based Numa), the 'sync' instruction had the same effect as mmiowb() with respect to mmio write ordering, and it was issued unconditionally in the spin unlock. It was cheaper than mmiowb, however. If it matters, we could invent and use writel_relaxed() to get performance back in drivers we care about.... jeremy ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-02 9:56 ` Jes Sorensen 2008-06-02 21:02 ` Jeremy Higdon @ 2008-06-03 4:33 ` Nick Piggin 2008-06-03 8:15 ` Jeremy Higdon 2008-06-03 16:52 ` Jesse Barnes 1 sibling, 2 replies; 160+ messages in thread From: Nick Piggin @ 2008-06-03 4:33 UTC (permalink / raw) To: Jes Sorensen Cc: linux-arch, Roland Dreier, linux-kernel, Jeremy Higdon, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Monday 02 June 2008 19:56, Jes Sorensen wrote: > Jeremy Higdon wrote: > > We don't actually have that problem on the Altix. All writes issued > > by CPU X will be ordered with respect to each other. But writes by > > CPU X and CPU Y will not be, unless an mmiowb() is done by the > > original CPU before the second CPU writes. I.e. > > > > CPU X writel > > CPU X writel > > CPU X mmiowb > > > > CPU Y writel > > ... > > > > Note that this implies some sort of locking. Also note that if in > > the above, CPU Y did the mmiowb, that would not work. > > Hmmm, > > Then it's less bad than I thought - my apologies for the confusion. > > Would we be able to use Ben's trick of setting a per cpu flag in > writel() then and checking that in spin unlock issuing the mmiowb() > there if needed? Yes you could, but your writels would still not be strongly ordered within (or outside) spinlock regions, which is what Linus wants (and I kind of agree with). This comes back to my posting about mmiowb and io_*mb barriers etc. Despite what you say, what you've done really _does_ change the semantics of wmb() for all drivers. It is a really sad situation we've got ourselves into somehow, AFAIKS in the hope of trying to save ourselves a tiny bit of work upfront :( (this is not just the sgi folk with mmiowb I'm talking about, but the whole random undefinedness of ordering and io barriers). The right way to make any change is never to weaken the postcondition of an existing interface *unless* you are willing to audit the entire tree and fix it. Impossible for drivers, so the correct thing to do is introduce a new interface, and move things over at an easier pace. Not rocket science. The argument that "Altix only uses a few drivers so this way we can just fix these up rather than make big modifications to large numbers of drivers" is bogus. It is far worse even for Altix if you make incompatible changes, because you first *break* every driver on your platform, then you have to audit and fix them. If you make compatible changes, then you have to do exactly the same audits to move them over to the new API, but you go from slower->faster rather than broken->non broken. As a bonus, you haven't got random broken stuff all over the tree that you forgot to audit. I don't know how there is still so much debate about this :( I have a proposal: I am a neutral party here, not being an arch maintainer, so I'll take input and write up a backward compatible API specification and force everybody to conform to it ;) ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 4:33 ` Nick Piggin @ 2008-06-03 8:15 ` Jeremy Higdon 2008-06-03 8:19 ` Nick Piggin 2008-06-03 16:52 ` Jesse Barnes 1 sibling, 1 reply; 160+ messages in thread From: Jeremy Higdon @ 2008-06-03 8:15 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote: > On Monday 02 June 2008 19:56, Jes Sorensen wrote: > > Jeremy Higdon wrote: > > > We don't actually have that problem on the Altix. All writes issued > > > by CPU X will be ordered with respect to each other. But writes by > > > CPU X and CPU Y will not be, unless an mmiowb() is done by the > > > original CPU before the second CPU writes. I.e. > > > > > > CPU X writel > > > CPU X writel > > > CPU X mmiowb > > > > > > CPU Y writel > > > ... > > > > > > Note that this implies some sort of locking. Also note that if in > > > the above, CPU Y did the mmiowb, that would not work. > > > > Hmmm, > > > > Then it's less bad than I thought - my apologies for the confusion. > > > > Would we be able to use Ben's trick of setting a per cpu flag in > > writel() then and checking that in spin unlock issuing the mmiowb() > > there if needed? > > Yes you could, but your writels would still not be strongly ordered > within (or outside) spinlock regions, which is what Linus wants (and > I kind of agree with). Yes they would be. Writes from the same CPU are always ordered. Writes from different CPUs are not, but that's only a concern if you protect writing via some sort of lock. If the lock release forces a barrier, that should take care of the problem. jeremy ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 8:15 ` Jeremy Higdon @ 2008-06-03 8:19 ` Nick Piggin 2008-06-03 8:45 ` Jeremy Higdon 0 siblings, 1 reply; 160+ messages in thread From: Nick Piggin @ 2008-06-03 8:19 UTC (permalink / raw) To: Jeremy Higdon Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Tuesday 03 June 2008 18:15, Jeremy Higdon wrote: > On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote: > > On Monday 02 June 2008 19:56, Jes Sorensen wrote: > > > Would we be able to use Ben's trick of setting a per cpu flag in > > > writel() then and checking that in spin unlock issuing the mmiowb() > > > there if needed? > > > > Yes you could, but your writels would still not be strongly ordered > > within (or outside) spinlock regions, which is what Linus wants (and > > I kind of agree with). > > Yes they would be. Writes from the same CPU are always ordered. Writes > from different CPUs are not, but that's only a concern if you protect They are not strongly ordered WRT writes to cacheable memory. If they were, then they would not leak out of spinlocks. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 8:19 ` Nick Piggin @ 2008-06-03 8:45 ` Jeremy Higdon 0 siblings, 0 replies; 160+ messages in thread From: Jeremy Higdon @ 2008-06-03 8:45 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Tue, Jun 03, 2008 at 06:19:05PM +1000, Nick Piggin wrote: > On Tuesday 03 June 2008 18:15, Jeremy Higdon wrote: > > On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote: > > > On Monday 02 June 2008 19:56, Jes Sorensen wrote: > > > > Would we be able to use Ben's trick of setting a per cpu flag in > > > > writel() then and checking that in spin unlock issuing the mmiowb() > > > > there if needed? > > > > > > Yes you could, but your writels would still not be strongly ordered > > > within (or outside) spinlock regions, which is what Linus wants (and > > > I kind of agree with). > > > > Yes they would be. Writes from the same CPU are always ordered. Writes > > from different CPUs are not, but that's only a concern if you protect > > They are not strongly ordered WRT writes to cacheable memory. If they > were, then they would not leak out of spinlocks. No posted writes are. As I recall, the outX functions are not supposed to be posted, so the sn2 versions issue a mmiowb in the outX. But writeX has always been posted, or at least postable. I thought that was generally accepted. Normally, the only way for a device to see cacheable memory is via a DMA read, and you are guaranteed on sn2 that in the following: store of of A to location X mmio write to device device DMA read from location X that the device will see A. In the past, on some other archs, you'd have to flush the Dcache for that to work. Granted, if the compiler reorders the store and the mmio write, then you have a problem. jeremy ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 4:33 ` Nick Piggin 2008-06-03 8:15 ` Jeremy Higdon @ 2008-06-03 16:52 ` Jesse Barnes 2008-06-05 8:40 ` Jes Sorensen 1 sibling, 1 reply; 160+ messages in thread From: Jesse Barnes @ 2008-06-03 16:52 UTC (permalink / raw) To: Nick Piggin Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel, Jeremy Higdon, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Monday, June 02, 2008 9:33 pm Nick Piggin wrote: > On Monday 02 June 2008 19:56, Jes Sorensen wrote: > > Jeremy Higdon wrote: > > > We don't actually have that problem on the Altix. All writes issued > > > by CPU X will be ordered with respect to each other. But writes by > > > CPU X and CPU Y will not be, unless an mmiowb() is done by the > > > original CPU before the second CPU writes. I.e. > > > > > > CPU X writel > > > CPU X writel > > > CPU X mmiowb > > > > > > CPU Y writel > > > ... > > > > > > Note that this implies some sort of locking. Also note that if in > > > the above, CPU Y did the mmiowb, that would not work. > > > > Hmmm, > > > > Then it's less bad than I thought - my apologies for the confusion. > > > > Would we be able to use Ben's trick of setting a per cpu flag in > > writel() then and checking that in spin unlock issuing the mmiowb() > > there if needed? > > Yes you could, but your writels would still not be strongly ordered > within (or outside) spinlock regions, which is what Linus wants (and > I kind of agree with). I think you mean wrt cacheable memory accesses here (though iirc on ia64 spin_unlock has release semantics, so at least it'll barrier other stores). > This comes back to my posting about mmiowb and io_*mb barriers etc. > > Despite what you say, what you've done really _does_ change the semantics > of wmb() for all drivers. It is a really sad situation we've got ourselves > into somehow, AFAIKS in the hope of trying to save ourselves a tiny bit > of work upfront :( (this is not just the sgi folk with mmiowb I'm talking > about, but the whole random undefinedness of ordering and io barriers). > > The right way to make any change is never to weaken the postcondition of > an existing interface *unless* you are willing to audit the entire tree > and fix it. Impossible for drivers, so the correct thing to do is introduce > a new interface, and move things over at an easier pace. Not rocket > science. Well, given how undefined things have been in the past, each arch has had to figure out what things mean (based on looking at drivers & core code) then come up with appropriate primitives. On Altix, we went both directions: we made regular PIO reads (readX etc.) *very* expensive to preserve compatibility with what existing drivers expect, and added a readX_relaxed to give a big performance boost to tuned drivers. OTOH, given that posted PCI writes were nothing new to Linux, but the Altix network topology was, we introduced mmiowb() (with lots of discussion I might add), which has clear and relatively simple usage guidelines. Now, in hindsight, using a PIO write set & test flag approach in writeX/spin_unlock (ala powerpc) might have been a better approach, but iirc that never came up in the discussion, probably because we were focused on PCI posting and not uncached vs. cached ordering. > The argument that "Altix only uses a few drivers so this way we can just > fix these up rather than make big modifications to large numbers of > drivers" is bogus. It is far worse even for Altix if you make incompatible > changes, because you first *break* every driver on your platform, then you > have to audit and fix them. If you make compatible changes, then you have > to do exactly the same audits to move them over to the new API, but you go > from slower->faster rather than broken->non broken. As a bonus, you haven't > got random broken stuff all over the tree that you forgot to audit. I agree, but afaik the only change Altix ended up forcing on people was mmiowb(), but that turned out to be necessary on mips64 (and maybe some other platforms?) anyway. > I don't know how there is still so much debate about this :( > > I have a proposal: I am a neutral party here, not being an arch maintainer, > so I'll take input and write up a backward compatible API specification > and force everybody to conform to it ;) Aside from the obvious performance impact of making all the readX/writeX routines strongly ordered, both in terms of PCI posting and cacheable vs. uncacheable accesses, it also makes things inconsistent. Both core code & drivers will still have to worry about regular, cacheable memory barriers for correctness, but it looks like you're proposing that they not have to think about I/O ordering. At any rate, I don't think anyone would argue against defining the ordering semantics of all of these routines (btw you should also include ordering wrt DMA & PCI posting); the question is what's the best balance between keeping the driver requirements simple and the performance cost on complex arches. Jesse ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-03 16:52 ` Jesse Barnes @ 2008-06-05 8:40 ` Jes Sorensen 2008-06-05 8:43 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 160+ messages in thread From: Jes Sorensen @ 2008-06-05 8:40 UTC (permalink / raw) To: Jesse Barnes Cc: linux-arch, Nick Piggin, Roland Dreier, linux-kernel, Jeremy Higdon, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven Jesse Barnes wrote: > Now, in hindsight, using a PIO write set & test flag approach in > writeX/spin_unlock (ala powerpc) might have been a better approach, but iirc > that never came up in the discussion, probably because we were focused on PCI > posting and not uncached vs. cached ordering. Hi Jesse, I am going to take a stab at implementing this so we can see how much of an impact it will have. Cheers, Jes ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-05 8:40 ` Jes Sorensen @ 2008-06-05 8:43 ` Benjamin Herrenschmidt 2008-06-12 15:07 ` Matthew Wilcox 0 siblings, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-06-05 8:43 UTC (permalink / raw) To: Jes Sorensen Cc: linux-arch, Nick Piggin, Roland Dreier, linux-kernel, Jeremy Higdon, David Miller, linuxppc-dev, Jesse Barnes, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Thu, 2008-06-05 at 10:40 +0200, Jes Sorensen wrote: > Jesse Barnes wrote: > > Now, in hindsight, using a PIO write set & test flag approach in > > writeX/spin_unlock (ala powerpc) might have been a better approach, but iirc > > that never came up in the discussion, probably because we were focused on PCI > > posting and not uncached vs. cached ordering. > > Hi Jesse, > > I am going to take a stab at implementing this so we can see how much > of an impact it will have. Note that the powerpc implementation currently clears the flag on spin_lock and tests it on unlock. We are considering changing that to not touch the flag on spin_lock and just clear it whenever we do a sync (ie, on unlock, on explicit mmiowb, and possibly even on readl's where we happen to do sync's). Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-05 8:43 ` Benjamin Herrenschmidt @ 2008-06-12 15:07 ` Matthew Wilcox 2008-06-13 0:07 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 160+ messages in thread From: Matthew Wilcox @ 2008-06-12 15:07 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linux-arch, Nick Piggin, Roland Dreier, Jes Sorensen, linux-kernel, Jeremy Higdon, David Miller, linuxppc-dev, Jesse Barnes, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Thu, Jun 05, 2008 at 06:43:53PM +1000, Benjamin Herrenschmidt wrote: > Note that the powerpc implementation currently clears the flag > on spin_lock and tests it on unlock. We are considering changing > that to not touch the flag on spin_lock and just clear it whenever > we do a sync (ie, on unlock, on explicit mmiowb, and possibly even > on readl's where we happen to do sync's). Your current scheme sounds like it's broken for spin_lock(a) writel(); spin_lock(b); spin_unlock(b); spin_unlock(a); -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-06-12 15:07 ` Matthew Wilcox @ 2008-06-13 0:07 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-06-13 0:07 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-arch, Nick Piggin, Roland Dreier, Jes Sorensen, linux-kernel, Jeremy Higdon, David Miller, linuxppc-dev, Jesse Barnes, scottwood, torvalds, tpiepho, alan, Arjan van de Ven On Thu, 2008-06-12 at 09:07 -0600, Matthew Wilcox wrote: > On Thu, Jun 05, 2008 at 06:43:53PM +1000, Benjamin Herrenschmidt wrote: > > Note that the powerpc implementation currently clears the flag > > on spin_lock and tests it on unlock. We are considering changing > > that to not touch the flag on spin_lock and just clear it whenever > > we do a sync (ie, on unlock, on explicit mmiowb, and possibly even > > on readl's where we happen to do sync's). > > Your current scheme sounds like it's broken for > > spin_lock(a) > writel(); > spin_lock(b); > spin_unlock(b); > spin_unlock(a); Which is why we are considering changing it :-) But as Paulus said before, he did some measurement and we came to the conclusion that (pending more measurements on a wider range of HW) we may as well drop the whole scheme and make writel fully synchronous instead. Then, we can get some nice weakly ordered accessors and start adding them with appropriate explicit barriers to the hot path of perf. critical drivers we care about. Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 15:50 ` Roland Dreier 2008-05-27 16:37 ` James Bottomley 2008-05-27 21:11 ` Benjamin Herrenschmidt @ 2008-05-31 8:04 ` Pavel Machek 2 siblings, 0 replies; 160+ messages in thread From: Pavel Machek @ 2008-05-31 8:04 UTC (permalink / raw) To: Roland Dreier Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan, Arjan van de Ven Hi! > > Though it's my understanding that at least ia64 does require the > > explicit barriers anyway, so we are still in a dodgy situation here > > where it's not clear what drivers should do and we end up with > > possibly excessive barriers on powerpc where I end up with both > > the wmb/rmb/mb that were added for ia64 -and- the ones I have in > > readl/writel to make them look synchronous... Not nice. > > ia64 is a disaster with a slightly different ordering problem -- the > mmiowb() issue. I know Ben knows far too much about this, but for big > SGI boxes, you sometimes need mmiowb() to avoid problems with driver > code that does totally sane stuff like > > spin_lock(&mmio_lock); > writel(val1, reg1); > writel(val2, reg2); > spin_unlock(&mmio_lock); > > If that snippet is called on two CPUs at the same time, then the device > might see a sequence like > > CPU1 -- write reg1 > CPU2 -- write reg1 > CPU1 -- write reg2 > CPU2 -- write reg2 > > in spite of the fact that everything is totally ordered on the CPUs by > the spin lock. > > The reason this is such a disaster is because the code looks right, > makes sense, and works fine on 99.99% of all systems out there. So I > would bet that 99% of our drivers have missing mmiowb() "bugs" -- no one > has plugged the hardware into an Altix box and cared enough to stress > test it. Yes, ia64 is a disaster, and needs its spinlock implementation fixed :-). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 2:28 ` David Miller 2008-05-27 3:39 ` Benjamin Herrenschmidt 2008-05-27 3:42 ` Arjan van de Ven @ 2008-05-27 8:24 ` Alan Cox 2 siblings, 0 replies; 160+ messages in thread From: Alan Cox @ 2008-05-27 8:24 UTC (permalink / raw) To: David Miller Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, torvalds, tpiepho > It's just another complicated thing for driver authors to get wrong. > The other side of the coin is of course the cost. > > The only thing I am absolutely sure of is that we should make a > decision fast, document it, and just stick to it. It's unfortunate that the __read/___write versions have other different semantics. The default should definitely be ordered but having un-ordered ones as an option would be good for zapping the hot paths that matter. Most I/O paths aren't even worth the thinking time for non-ordered I/O. Alan ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: MMIO and gcc re-ordering issue 2008-05-27 1:33 ` MMIO and gcc re-ordering issue Benjamin Herrenschmidt 2008-05-27 1:40 ` David Miller @ 2008-05-27 15:28 ` Jonathan Corbet 1 sibling, 0 replies; 160+ messages in thread From: Jonathan Corbet @ 2008-05-27 15:28 UTC (permalink / raw) To: benh Cc: Linux Arch list, linux-kernel, Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan On Tue, 27 May 2008 11:33:46 +1000 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > However, I'm using that as an excuse to bring back my pet subject, > which is basically, should we instead just finally mandate the use of > explicit rmb/wmb/mb's (which boils down to barrier() on x86) to > drivers who want to order memory consistent accesses vs. MMIO ? To me this seems like asking for a long series of weird bugs in drivers. Is the cost of a barrier in an MMIO operation so high that we can't afford to just put them in? Alan's suggestion of no-barrier versions for people who know what they're doing (and who think there's a reason to care) seems like a better way to go to me. jon ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 22:35 ` Scott Wood 2008-05-20 22:39 ` David Miller @ 2008-05-20 22:55 ` Trent Piepho 2008-05-21 14:01 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 160+ messages in thread From: Trent Piepho @ 2008-05-20 22:55 UTC (permalink / raw) To: Scott Wood; +Cc: linux-kernel, Alan Cox, linuxppc-dev On Tue, 20 May 2008, Scott Wood wrote: > Alan Cox wrote: >> > It looks like we rely on -fno-strict-aliasing to prevent reordering >> > ordinary memory accesses (such as to DMA descriptors) past the I/O >> >> DMA descriptors in main memory are dependant on cache behaviour anyway >> and the dma_* operators should be the ones enforcing the needed behaviour. > > What about memory obtained from dma_alloc_coherent()? We still need a sync > and a compiler barrier. The current I/O accessors have the former, but not > the latter. There doesn't appear to be any barriers to use for coherent dma other than mb() and wmb(). Correct me if I'm wrong, but I think the sync isn't actually _required_ (by memory-barriers.txt's definitions), and it would be enough to use eieio, except there is code that doesn't use mmiowb() between I/O access and unlocking. So, as I understand it, the minimum needed is eieio. To provide strict ordering w.r.t. spin locks without using mmiowb(), you need sync. To provide strict ordering w.r.t. normal memory, you need sync and a compiler barrier. Right now no archs provide the last option. powerpc is currently the middle option. I don't know if anything uses the first option, maybe alpha? I'm almost certain x86 is the middle option (the first isn't possible, the arch already has more ordering than that), which is probably why powerpc used that option and not the first. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 22:55 ` [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho @ 2008-05-21 14:01 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-21 14:01 UTC (permalink / raw) To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel, Alan Cox On Tue, 2008-05-20 at 15:55 -0700, Trent Piepho wrote: > here doesn't appear to be any barriers to use for coherent dma other than > mb() and wmb(). > > Correct me if I'm wrong, but I think the sync isn't actually _required_ (by > memory-barriers.txt's definitions), and it would be enough to use eieio, > except there is code that doesn't use mmiowb() between I/O access and > unlocking. > > So, as I understand it, the minimum needed is eieio. To provide strict > ordering w.r.t. spin locks without using mmiowb(), you need sync. To provide > strict ordering w.r.t. normal memory, you need sync and a compiler barrier. > > Right now no archs provide the last option. powerpc is currently the middle > option. I don't know if anything uses the first option, maybe alpha? I'm > almost certain x86 is the middle option (the first isn't possible, the arch > already has more ordering than that), which is probably why powerpc used that > option and not the first. I don't have time for that now. Can you dig into the archives ? The whole thing has been discussed at lenght already. Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 21:16 ` Benjamin Herrenschmidt 2008-05-20 21:38 ` Scott Wood @ 2008-05-20 22:00 ` Trent Piepho 2008-05-21 14:00 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 160+ messages in thread From: Trent Piepho @ 2008-05-20 22:00 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc-dev, linux-kernel On Tue, 20 May 2008, Benjamin Herrenschmidt wrote: > On Tue, 2008-05-20 at 13:40 -0700, Trent Piepho wrote: >> There was some discussion on a Freescale list if the powerpc I/O accessors >> should be strictly ordered w.r.t. normal memory. Currently they are not. It >> does not appear as if any other architecture's I/O accessors are strictly >> ordered in this manner. memory-barriers.txt explicitly states that the I/O >> space (inb, outw, etc.) are NOT strictly ordered w.r.t. normal memory >> accesses and it's implied the other I/O accessors (e.g., writel) are the same. >> >> However, it is somewhat harder to program for this model, and there are almost >> certainly a number of drivers using coherent DMA which have subtle bugs because >> the do not include the necessary barriers. >> >> But clearly and change to this would be a subject for a different patch. > > The current accessors should provide all the necessary ordering > guarantees... Depends on what you define as "necessary". It's seem clear that I/O accessors _no not_ need to be strictly ordered with respect to normal memory accesses, by what's defined in memory-barriers.txt. So if by "necessary" you mean what the Linux standard for I/O accessors requires (and what other archs provide), then yes, they have the necessary ordering guarantees. But, if you want them to be strictly ordered w.r.t to normal memory, that's not the case. For example, in something like: u32 *dmabuf = kmalloc(...); ... dmabuf[0] = 1; out_be32(®s->dmactl, DMA_SEND_BUFFER); dmabuf[0] = 2; out_be32(®s->dmactl, DMA_SEND_BUFFER); gcc might decide to optimize this code to: out_be32(®s->dmactl, DMA_SEND_BUFFER); out_be32(®s->dmactl, DMA_SEND_BUFFER); dmabuf[0] = 2; gcc will often not do this optimization, because there might be aliasing between "®s->dmact" and "dmabuf", but it _can_ do it. gcc can't optimize the two identical out_be32's into one, or re-order them if they were to different registers, but it can move the normal memory accesses around them. Here's a quick hack I stuck in a driver to test. compile with -save-temps and check the resulting asm. gcc will do the optimization I described above. static void __iomem *baz = (void*)0x1234; static struct bar { u32 bar[256]; } bar; void foo(void) { bar.bar[0] = 44; out_be32(baz+100, 200); bar.bar[0] = 45; out_be32(baz+101, 201); } ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 22:00 ` Trent Piepho @ 2008-05-21 14:00 ` Benjamin Herrenschmidt 2008-05-21 19:44 ` Trent Piepho 0 siblings, 1 reply; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-21 14:00 UTC (permalink / raw) To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel > Depends on what you define as "necessary". It's seem clear that I/O accessors > _no not_ need to be strictly ordered with respect to normal memory accesses, > by what's defined in memory-barriers.txt. So if by "necessary" you mean what > the Linux standard for I/O accessors requires (and what other archs provide), > then yes, they have the necessary ordering guarantees. > > But, if you want them to be strictly ordered w.r.t to normal memory, that's > not the case. They should be. > For example, in something like: > > u32 *dmabuf = kmalloc(...); > ... > dmabuf[0] = 1; > out_be32(®s->dmactl, DMA_SEND_BUFFER); > dmabuf[0] = 2; > out_be32(®s->dmactl, DMA_SEND_BUFFER); > > gcc might decide to optimize this code to: > > out_be32(®s->dmactl, DMA_SEND_BUFFER); > out_be32(®s->dmactl, DMA_SEND_BUFFER); > dmabuf[0] = 2; If that's the case, there is a bug. Ignoring gcc possible optimisations, the accessors contain the necessary memory barriers for things to work the way you describe above. If the use of volatile and clobber in our macros isn't enough to also prevent optimisations, then we have a bug and you are welcome to provide a patch to fix it. > gcc will often not do this optimization, because there might be aliasing > between "®s->dmact" and "dmabuf", but it _can_ do it. gcc can't optimize > the two identical out_be32's into one, or re-order them if they were to > different registers, but it can move the normal memory accesses around them. The linus kernel -cannot- be compiled with strict aliasing rules. This is one of the many areas where those are violated. Frankly, this strict aliasing stuff is just a total nightmare turning a pefectly nice and useable language into something it's not meant to be. > Here's a quick hack I stuck in a driver to test. compile with -save-temps and > check the resulting asm. gcc will do the optimization I described above. > > static void __iomem *baz = (void*)0x1234; > static struct bar { > u32 bar[256]; > } bar; > > void foo(void) { > bar.bar[0] = 44; > out_be32(baz+100, 200); > bar.bar[0] = 45; > out_be32(baz+101, 201); > } Have you removed -fno-strict-aliasing ? Just don't do that. Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-21 14:00 ` Benjamin Herrenschmidt @ 2008-05-21 19:44 ` Trent Piepho 2008-05-21 20:41 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 160+ messages in thread From: Trent Piepho @ 2008-05-21 19:44 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc-dev, linux-kernel On Wed, 21 May 2008, Benjamin Herrenschmidt wrote: >> Depends on what you define as "necessary". It's seem clear that I/O accessors >> _no not_ need to be strictly ordered with respect to normal memory accesses, >> by what's defined in memory-barriers.txt. So if by "necessary" you mean what >> the Linux standard for I/O accessors requires (and what other archs provide), >> then yes, they have the necessary ordering guarantees. >> >> But, if you want them to be strictly ordered w.r.t to normal memory, that's >> not the case. > > They should be. Someone should update memory-barriers.txt, because it doesn't say that, and all I/O accessors for all the arches, because none of them are. >> Here's a quick hack I stuck in a driver to test. compile with -save-temps and >> check the resulting asm. gcc will do the optimization I described above. >> >> static void __iomem *baz = (void*)0x1234; >> static struct bar { >> u32 bar[256]; >> } bar; >> >> void foo(void) { >> bar.bar[0] = 44; >> out_be32(baz+100, 200); >> bar.bar[0] = 45; >> out_be32(baz+101, 201); >> } > > Have you removed -fno-strict-aliasing ? Just don't do that. No, it's compiled with a normal kernel build, which includes -fno-strict-aliasing. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-21 19:44 ` Trent Piepho @ 2008-05-21 20:41 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2008-05-21 20:41 UTC (permalink / raw) To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel On Wed, 2008-05-21 at 12:44 -0700, Trent Piepho wrote: > > Someone should update memory-barriers.txt, because it doesn't say > that, and > all I/O accessors for all the arches, because none of them are. There have been long discussions about that. The end result was that being too weakly ordered is just asking for trouble because the majority of drivers are written & tested on x86 which is in order. If you look at our accessors, minus that gcc problem you found, the barriers in there should pretty much guarantee ordering in the cases that matter, which are basically MMIO read followed by memory accesses and memory writes followed by MMIO. In fact, MMIO read are fully sychronous. > No, it's compiled with a normal kernel build, which includes > -fno-strict-aliasing Ok, so there is a very bad bug indeed, we need to fix that. Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 20:40 [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho 2008-05-20 21:16 ` Benjamin Herrenschmidt @ 2008-05-20 22:00 ` Andreas Schwab 2008-05-20 22:11 ` Trent Piepho 1 sibling, 1 reply; 160+ messages in thread From: Andreas Schwab @ 2008-05-20 22:00 UTC (permalink / raw) To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel Trent Piepho <tpiepho@freescale.com> writes: > For the LE versions, eventually they boil down to an asm that will look > something like this: > asm("sync; stwbrx %1,0,%2" : "=m" (*addr) : "r" (val), "r" (addr)); > > While not perfect, this appears to be the best one can do. The issue is > that the "stwbrx" instruction only comes in an indexed, or 'x', version, in > which the address is represented by the sum of two registers (the "0,%2"). > Unfortunately, gcc doesn't have a constraint for an indexed memory > reference. There is the "Z" constraint, which matches either an indirect or an indexed memory address. That should fit here. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 22:00 ` Andreas Schwab @ 2008-05-20 22:11 ` Trent Piepho 2008-05-20 22:47 ` Andreas Schwab 0 siblings, 1 reply; 160+ messages in thread From: Trent Piepho @ 2008-05-20 22:11 UTC (permalink / raw) To: Andreas Schwab; +Cc: Scott Wood, linuxppc-dev, linux-kernel On Wed, 21 May 2008, Andreas Schwab wrote: > Trent Piepho <tpiepho@freescale.com> writes: > >> For the LE versions, eventually they boil down to an asm that will look >> something like this: >> asm("sync; stwbrx %1,0,%2" : "=m" (*addr) : "r" (val), "r" (addr)); >> >> While not perfect, this appears to be the best one can do. The issue is >> that the "stwbrx" instruction only comes in an indexed, or 'x', version, in >> which the address is represented by the sum of two registers (the "0,%2"). >> Unfortunately, gcc doesn't have a constraint for an indexed memory >> reference. > > There is the "Z" constraint, which matches either an indirect or an > indexed memory address. That should fit here. This came up on the Freescale list. I should have put what I wrote there into my patch descrition: It's the _le versions that have a problem, since we can't get gcc to just use the register indexed mode. It seems like an obvious thing to have a constraint for, but I guess there weren't enough instructions that only come in 'x' versions to bother with it. There is a 'Z' constraint, "Memory operand that is an indexed or indirect from a register", but I tried it and it can use both "rb,ri" and "disp(rb)" forms. Actually, I'm not sure how 'Z' is any different than "m"? ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 22:11 ` Trent Piepho @ 2008-05-20 22:47 ` Andreas Schwab 2008-05-20 23:14 ` Trent Piepho 0 siblings, 1 reply; 160+ messages in thread From: Andreas Schwab @ 2008-05-20 22:47 UTC (permalink / raw) To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel Trent Piepho <tpiepho@freescale.com> writes: > It's the _le versions that have a problem, since we can't get gcc to just use > the register indexed mode. It seems like an obvious thing to have a > constraint for, but I guess there weren't enough instructions that only come > in 'x' versions to bother with it. There is a 'Z' constraint, "Memory operand > that is an indexed or indirect from a register", but I tried it and it can use > both "rb,ri" and "disp(rb)" forms. Actually, I'm not sure how 'Z' is any > different than "m"? 'Z' will never emit a non-zero constant displacement. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 22:47 ` Andreas Schwab @ 2008-05-20 23:14 ` Trent Piepho 2008-05-21 8:03 ` Andreas Schwab 0 siblings, 1 reply; 160+ messages in thread From: Trent Piepho @ 2008-05-20 23:14 UTC (permalink / raw) To: Andreas Schwab; +Cc: Scott Wood, linuxppc-dev, linux-kernel On Wed, 21 May 2008, Andreas Schwab wrote: > Trent Piepho <tpiepho@freescale.com> writes: > >> It's the _le versions that have a problem, since we can't get gcc to just use >> the register indexed mode. It seems like an obvious thing to have a >> constraint for, but I guess there weren't enough instructions that only come >> in 'x' versions to bother with it. There is a 'Z' constraint, "Memory operand >> that is an indexed or indirect from a register", but I tried it and it can use >> both "rb,ri" and "disp(rb)" forms. Actually, I'm not sure how 'Z' is any >> different than "m"? > > 'Z' will never emit a non-zero constant displacement. It's too bad gas doesn't appear to be smart enough to turn: stwbrx 0, 0(3) -or- stwbr 0, 0(3) into the desired: stwbrx 0, 0, 3 ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-20 23:14 ` Trent Piepho @ 2008-05-21 8:03 ` Andreas Schwab 2008-05-21 20:25 ` Trent Piepho 2008-05-27 23:48 ` [PATCH V2] [POWERPC] Improve (in|out)_[bl]eXX() " Trent Piepho 0 siblings, 2 replies; 160+ messages in thread From: Andreas Schwab @ 2008-05-21 8:03 UTC (permalink / raw) To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel Trent Piepho <tpiepho@freescale.com> writes: > On Wed, 21 May 2008, Andreas Schwab wrote: >> Trent Piepho <tpiepho@freescale.com> writes: >> >>> It's the _le versions that have a problem, since we can't get gcc to just use >>> the register indexed mode. It seems like an obvious thing to have a >>> constraint for, but I guess there weren't enough instructions that only come >>> in 'x' versions to bother with it. There is a 'Z' constraint, "Memory operand >>> that is an indexed or indirect from a register", but I tried it and it can use >>> both "rb,ri" and "disp(rb)" forms. Actually, I'm not sure how 'Z' is any >>> different than "m"? >> >> 'Z' will never emit a non-zero constant displacement. > > It's too bad gas doesn't appear to be smart enough to turn: > stwbrx 0, 0(3) -or- stwbr 0, 0(3) Use the %y modifier when substituting the operand. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code 2008-05-21 8:03 ` Andreas Schwab @ 2008-05-21 20:25 ` Trent Piepho 2008-05-27 23:48 ` [PATCH V2] [POWERPC] Improve (in|out)_[bl]eXX() " Trent Piepho 1 sibling, 0 replies; 160+ messages in thread From: Trent Piepho @ 2008-05-21 20:25 UTC (permalink / raw) To: Andreas Schwab; +Cc: Scott Wood, linuxppc-dev, linux-kernel On Wed, 21 May 2008, Andreas Schwab wrote: > Trent Piepho <tpiepho@freescale.com> writes: >> On Wed, 21 May 2008, Andreas Schwab wrote: >>> Trent Piepho <tpiepho@freescale.com> writes: >>>> It's the _le versions that have a problem, since we can't get gcc to just use >>>> the register indexed mode. It seems like an obvious thing to have a >>>> constraint for, but I guess there weren't enough instructions that only come >>>> in 'x' versions to bother with it. There is a 'Z' constraint, "Memory operand >>>> that is an indexed or indirect from a register", but I tried it and it can use >>>> both "rb,ri" and "disp(rb)" forms. Actually, I'm not sure how 'Z' is any >>>> different than "m"? >>> >>> 'Z' will never emit a non-zero constant displacement. >> >> It's too bad gas doesn't appear to be smart enough to turn: >> stwbrx 0, 0(3) -or- stwbr 0, 0(3) > > Use the %y modifier when substituting the operand. Of course, the undocumented y modifier! "Print AltiVec or SPE memory operand" Why didn't I think of that? That appears to work. I can get gcc to to emit 0,reg and reg,reg but not disp(reg). It won't try to use an "update" address either, though it will with an "m" constraint. But, gcc 4.0.2 can't handle the 'Z' constraint. It looks like it's not supported. Other than this, I can build a kernel with 4.0.2 that appears to work. Is it ok to break compatibility with 4.0.2, or should I put in a gcc version check? ^ permalink raw reply [flat|nested] 160+ messages in thread
* [PATCH V2] [POWERPC] Improve (in|out)_[bl]eXX() asm code 2008-05-21 8:03 ` Andreas Schwab 2008-05-21 20:25 ` Trent Piepho @ 2008-05-27 23:48 ` Trent Piepho 1 sibling, 0 replies; 160+ messages in thread From: Trent Piepho @ 2008-05-27 23:48 UTC (permalink / raw) To: linuxppc-dev; +Cc: Trent Piepho Since commit 4cb3cee03d558fd457cb58f56c80a2a09a66110c the code generated for the in_beXX() and out_beXX() mmio functions has been sub-optimal. The out_leXX() family of functions are created with the macro DEF_MMIO_OUT_LE() while the out_beXX() family are created with DEF_MMIO_OUT_BE(). In what was perhaps a bit too much macro use, both of these macros are in turn created via the macro DEF_MMIO_OUT(). For the LE versions, eventually they boil down to an asm that will look something like this: asm("sync; stwbrx %1,0,%2" : "=m" (*addr) : "r" (val), "r" (addr)); The issue is that the "stwbrx" instruction only comes in an indexed, or 'x', version, in which the address is represented by the sum of two registers (the "0,%2"). Unfortunately, gcc doesn't have a constraint for an indexed memory reference. The "m" constraint allows both indexed and offset, i.e. register plus constant, memory references and there is no "stwbr" version for offset references. "m" also allows updating addresses and there is no 'u' version of "stwbrx" like there is with "stwux". The unused first operand to the asm is just to tell gcc that *addr is an output of the asm. The address used is passed in a single register via the third asm operand, and the index register is just hard coded as 0. This means gcc is forced to put the address in a single register and can't use index addressing, e.g. if one has the data in register 9, a base address in register 3 and an index in register 4, gcc must emit code like "add 11,4,3; stwbrx 9,0,11" instead of just "stwbrx 9,4,3". This costs an extra add instruction and another register. For gcc 4.0 and older, there doesn't appear to be anything that can be done. But for 4.1 and newer, there is a 'Z' constraint. It does not allow "updating" addresses, but does allow both indexed and offset addresses. However, the only allowed constant offset is 0. We can then use the undocumented 'y' operand modifier, which causes gcc to convert "0(reg)" into the equivilient "0,reg" format that can be used with stwbrx. This brings us the to problem with the BE version. In this case, the "stw" instruction does have both indexed and non-indexed versions. The final asm ends up looking like this: asm("sync; stw%U0%X0 %1,%0" : "=m" (*addr) : "r" (val), "r" (addr)); The undocumented codes "%U0" and "%0X" will generate a 'u' if the memory reference should be an auto-updating one, and an 'x' if the memory reference is indexed, respectively. The third operand is unused, it's just there because asm the code is reused from the LE version. However, gcc does not know this, and generates unnecessary code to stick addr in a register! To use the example from the LE version, gcc will generate "add 11,4,3; stwx 9,4,3". It is able to use the indexed address "4,3" for the "stwx", but still thinks it needs to put 4+3 into register 11, which will never be used. This also ends up happening a lot for the offset addressing mode, where common code like this: out_be32(&device_registers->some_register, data); uses an instruction like "stw 9, 42(3)", where register 3 has the pointer device_registers and 42 is the offset of some_register in that structure. gcc will be forced to generate the unnecessary instruction "addi 11, 3, 42" to put the address into a single (unused) register. The in_* versions end up having these exact same problems as well. Signed-off-by: Trent Piepho <tpiepho@freescale.com> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: Andreas Schwab <schwab@suse.de> --- This version uses the Andreas's suggestions about how to use the "Z" constraint and "y" modifier to improve the LE version as well. include/asm-powerpc/io.h | 63 ++++++++++++++++++++++++++++++++------------- 1 files changed, 45 insertions(+), 18 deletions(-) diff --git a/include/asm-powerpc/io.h b/include/asm-powerpc/io.h index e0062d7..a2547ac 100644 --- a/include/asm-powerpc/io.h +++ b/include/asm-powerpc/io.h @@ -95,33 +95,60 @@ extern resource_size_t isa_mem_base; #define IO_SET_SYNC_FLAG() #endif -#define DEF_MMIO_IN(name, type, insn) \ -static inline type name(const volatile type __iomem *addr) \ +/* gcc 4.0 and older doesn't have 'Z' constraint */ +#if __GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ == 0) +#define DEF_MMIO_IN_LE(name, size, insn) \ +static inline u##size name(const volatile u##size __iomem *addr) \ { \ - type ret; \ - __asm__ __volatile__("sync;" insn ";twi 0,%0,0;isync" \ - : "=r" (ret) : "r" (addr), "m" (*addr)); \ + u##size ret; \ + __asm__ __volatile__("sync;"#insn" %0,0,%1;twi 0,%0,0;isync" \ + : "=r" (ret) : "r" (addr), "m" (*addr)); \ return ret; \ } -#define DEF_MMIO_OUT(name, type, insn) \ -static inline void name(volatile type __iomem *addr, type val) \ +#define DEF_MMIO_OUT_LE(name, size, insn) \ +static inline void name(volatile u##size __iomem *addr, u##size val) \ { \ - __asm__ __volatile__("sync;" insn \ - : "=m" (*addr) : "r" (val), "r" (addr)); \ - IO_SET_SYNC_FLAG(); \ + __asm__ __volatile__("sync;"#insn" %1,0,%2" \ + : "=m" (*addr) : "r" (val), "r" (addr)); \ + IO_SET_SYNC_FLAG(); \ } +#else /* newer gcc */ +#define DEF_MMIO_IN_LE(name, size, insn) \ +static inline u##size name(const volatile u##size __iomem *addr) \ +{ \ + u##size ret; \ + __asm__ __volatile__("sync;"#insn" %0,%y1;twi 0,%0,0;isync" \ + : "=r" (ret) : "Z" (*addr)); \ + return ret; \ +} + +#define DEF_MMIO_OUT_LE(name, size, insn) \ +static inline void name(volatile u##size __iomem *addr, u##size val) \ +{ \ + __asm__ __volatile__("sync;"#insn" %1,%y0" \ + : "=Z" (*addr) : "r" (val)); \ + IO_SET_SYNC_FLAG(); \ +} +#endif +#define DEF_MMIO_IN_BE(name, size, insn) \ +static inline u##size name(const volatile u##size __iomem *addr) \ +{ \ + u##size ret; \ + __asm__ __volatile__("sync;"#insn"%U1%X1 %0,%1;twi 0,%0,0;isync"\ + : "=r" (ret) : "m" (*addr)); \ + return ret; \ +} -#define DEF_MMIO_IN_BE(name, size, insn) \ - DEF_MMIO_IN(name, u##size, __stringify(insn)"%U2%X2 %0,%2") -#define DEF_MMIO_IN_LE(name, size, insn) \ - DEF_MMIO_IN(name, u##size, __stringify(insn)" %0,0,%1") +#define DEF_MMIO_OUT_BE(name, size, insn) \ +static inline void name(volatile u##size __iomem *addr, u##size val) \ +{ \ + __asm__ __volatile__("sync;"#insn"%U0%X0 %1,%0" \ + : "=m" (*addr) : "r" (val)); \ + IO_SET_SYNC_FLAG(); \ +} -#define DEF_MMIO_OUT_BE(name, size, insn) \ - DEF_MMIO_OUT(name, u##size, __stringify(insn)"%U0%X0 %1,%0") -#define DEF_MMIO_OUT_LE(name, size, insn) \ - DEF_MMIO_OUT(name, u##size, __stringify(insn)" %1,0,%2") DEF_MMIO_IN_BE(in_8, 8, lbz); DEF_MMIO_IN_BE(in_be16, 16, lhz); -- 1.5.4.1 ^ permalink raw reply related [flat|nested] 160+ messages in thread
end of thread, other threads:[~2008-06-13 0:08 UTC | newest] Thread overview: 160+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-20 20:40 [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho 2008-05-20 21:16 ` Benjamin Herrenschmidt 2008-05-20 21:38 ` Scott Wood 2008-05-20 22:02 ` Benjamin Herrenschmidt 2008-05-20 22:21 ` Trent Piepho 2008-05-20 22:15 ` Alan Cox 2008-05-20 22:35 ` Scott Wood 2008-05-20 22:39 ` David Miller 2008-05-20 22:43 ` Scott Wood 2008-05-20 22:53 ` David Miller 2008-05-23 4:24 ` Benjamin Herrenschmidt 2008-05-22 22:56 ` Trent Piepho 2008-05-23 12:36 ` MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code) Benjamin Herrenschmidt 2008-05-23 12:50 ` Benjamin Herrenschmidt 2008-05-23 21:14 ` Scott Wood 2008-05-23 22:47 ` Benjamin Herrenschmidt 2008-05-27 1:33 ` MMIO and gcc re-ordering issue Benjamin Herrenschmidt 2008-05-27 1:40 ` David Miller 2008-05-27 2:15 ` Benjamin Herrenschmidt 2008-05-27 2:28 ` David Miller 2008-05-27 3:39 ` Benjamin Herrenschmidt 2008-05-27 15:35 ` Linus Torvalds 2008-05-27 16:47 ` Linus Torvalds 2008-05-27 17:31 ` Linus Torvalds 2008-06-02 10:36 ` Ingo Molnar 2008-06-02 21:53 ` Benjamin Herrenschmidt 2008-05-27 21:12 ` Benjamin Herrenschmidt 2008-05-27 18:23 ` Trent Piepho 2008-05-27 18:33 ` Scott Wood 2008-05-27 21:10 ` Benjamin Herrenschmidt 2008-05-27 21:30 ` Linus Torvalds 2008-05-27 21:38 ` Alan Cox 2008-05-27 21:53 ` Matthew Wilcox 2008-05-27 21:46 ` Alan Cox 2008-05-27 22:02 ` Linus Torvalds 2008-05-27 21:59 ` Linus Torvalds 2008-05-27 21:38 ` Benjamin Herrenschmidt 2008-05-27 21:42 ` Matthew Wilcox 2008-05-27 22:17 ` Benjamin Herrenschmidt 2008-05-28 8:36 ` Haavard Skinnemoen 2008-05-29 11:05 ` Pantelis Antoniou 2008-05-30 1:13 ` Benjamin Herrenschmidt 2008-05-30 6:07 ` Haavard Skinnemoen 2008-05-30 7:24 ` Benjamin Herrenschmidt 2008-05-30 8:27 ` Haavard Skinnemoen 2008-05-30 9:22 ` Geert Uytterhoeven 2008-06-02 8:11 ` Haavard Skinnemoen 2008-06-02 15:48 ` Scott Wood 2008-06-03 7:46 ` Haavard Skinnemoen 2008-06-04 15:31 ` Linus Torvalds 2008-05-27 21:55 ` Linus Torvalds 2008-05-27 22:19 ` Benjamin Herrenschmidt 2008-05-29 7:10 ` Arnd Bergmann 2008-05-29 10:46 ` Alan Cox 2008-06-02 7:24 ` Russell King 2008-06-03 4:16 ` Nick Piggin 2008-06-03 4:32 ` Benjamin Herrenschmidt 2008-06-03 6:11 ` Nick Piggin 2008-06-03 6:48 ` Benjamin Herrenschmidt 2008-06-03 6:53 ` Paul Mackerras 2008-06-03 7:18 ` Nick Piggin 2008-06-03 14:47 ` Linus Torvalds 2008-06-03 18:47 ` Trent Piepho 2008-06-03 18:55 ` Matthew Wilcox 2008-06-03 19:57 ` Trent Piepho 2008-06-03 21:35 ` Matthew Wilcox 2008-06-03 21:58 ` Trent Piepho 2008-06-04 2:00 ` Nick Piggin 2008-06-03 19:07 ` Linus Torvalds 2008-06-04 2:05 ` Nick Piggin 2008-06-04 2:46 ` Linus Torvalds 2008-06-04 11:47 ` Alan Cox 2008-06-10 6:56 ` Nick Piggin 2008-06-10 17:41 ` Jesse Barnes 2008-06-10 18:10 ` James Bottomley 2008-06-10 19:05 ` Roland Dreier 2008-06-10 19:19 ` Jesse Barnes 2008-06-11 3:29 ` Nick Piggin 2008-06-11 3:40 ` Benjamin Herrenschmidt 2008-06-11 4:06 ` Nick Piggin 2008-06-11 16:07 ` Jesse Barnes 2008-06-12 11:27 ` Nick Piggin 2008-06-11 4:18 ` Paul Mackerras 2008-06-11 5:00 ` Nick Piggin 2008-06-11 5:13 ` Paul Mackerras 2008-06-11 5:35 ` Nick Piggin 2008-06-11 6:02 ` Nick Piggin 2008-06-12 12:14 ` Paul Mackerras 2008-06-12 13:08 ` Nick Piggin 2008-06-11 14:46 ` Linus Torvalds 2008-06-11 5:20 ` Paul Mackerras 2008-06-04 2:19 ` Nick Piggin 2008-06-03 19:43 ` Trent Piepho 2008-06-03 21:33 ` Matthew Wilcox 2008-06-03 21:44 ` Trent Piepho 2008-06-04 2:25 ` Nick Piggin 2008-06-04 6:39 ` Trent Piepho 2008-06-03 22:26 ` Benjamin Herrenschmidt 2008-05-27 3:42 ` Arjan van de Ven 2008-05-27 4:08 ` Roland Dreier 2008-05-27 4:20 ` Arjan van de Ven 2008-05-27 7:08 ` Benjamin Herrenschmidt 2008-05-27 15:50 ` Roland Dreier 2008-05-27 16:37 ` James Bottomley 2008-05-27 17:38 ` Roland Dreier 2008-05-27 17:53 ` James Bottomley 2008-05-27 18:07 ` Roland Dreier 2008-05-27 18:17 ` Roland Dreier 2008-05-27 21:23 ` Chris Friesen 2008-05-27 21:29 ` Roland Dreier 2008-05-27 23:04 ` Paul Mackerras 2008-05-27 21:11 ` Benjamin Herrenschmidt 2008-05-27 21:33 ` Roland Dreier 2008-05-27 22:13 ` Benjamin Herrenschmidt 2008-05-27 22:39 ` Roland Dreier 2008-05-29 14:47 ` Jes Sorensen 2008-05-29 15:01 ` James Bottomley 2008-05-30 9:36 ` Jes Sorensen 2008-05-30 17:21 ` Jesse Barnes 2008-05-31 7:57 ` Jeremy Higdon 2008-05-29 21:40 ` Benjamin Herrenschmidt 2008-05-29 21:48 ` Trent Piepho 2008-05-29 22:05 ` Benjamin Herrenschmidt 2008-05-30 1:53 ` Trent Piepho 2008-05-29 21:53 ` Jesse Barnes 2008-05-30 9:39 ` Jes Sorensen 2008-05-30 9:48 ` Jes Sorensen 2008-05-31 8:14 ` Pavel Machek 2008-06-02 9:48 ` Jes Sorensen 2008-05-29 22:06 ` Roland Dreier 2008-05-29 22:25 ` Trent Piepho 2008-05-30 3:56 ` Paul Mackerras 2008-05-31 7:52 ` Jeremy Higdon 2008-06-02 9:56 ` Jes Sorensen 2008-06-02 21:02 ` Jeremy Higdon 2008-06-03 4:33 ` Nick Piggin 2008-06-03 8:15 ` Jeremy Higdon 2008-06-03 8:19 ` Nick Piggin 2008-06-03 8:45 ` Jeremy Higdon 2008-06-03 16:52 ` Jesse Barnes 2008-06-05 8:40 ` Jes Sorensen 2008-06-05 8:43 ` Benjamin Herrenschmidt 2008-06-12 15:07 ` Matthew Wilcox 2008-06-13 0:07 ` Benjamin Herrenschmidt 2008-05-31 8:04 ` Pavel Machek 2008-05-27 8:24 ` Alan Cox 2008-05-27 15:28 ` Jonathan Corbet 2008-05-20 22:55 ` [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho 2008-05-21 14:01 ` Benjamin Herrenschmidt 2008-05-20 22:00 ` Trent Piepho 2008-05-21 14:00 ` Benjamin Herrenschmidt 2008-05-21 19:44 ` Trent Piepho 2008-05-21 20:41 ` Benjamin Herrenschmidt 2008-05-20 22:00 ` Andreas Schwab 2008-05-20 22:11 ` Trent Piepho 2008-05-20 22:47 ` Andreas Schwab 2008-05-20 23:14 ` Trent Piepho 2008-05-21 8:03 ` Andreas Schwab 2008-05-21 20:25 ` Trent Piepho 2008-05-27 23:48 ` [PATCH V2] [POWERPC] Improve (in|out)_[bl]eXX() " Trent Piepho
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).