[PATCH] [POWERPC] Improve (in|out)

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
@ 2008-05-20 20:40 Trent Piepho
  2008-05-20 21:16 ` Benjamin Herrenschmidt
  2008-05-20 22:00 ` Andreas Schwab
  0 siblings, 2 replies; 160+ messages in thread
From: Trent Piepho @ 2008-05-20 20:40 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Scott Wood, linux-kernel, Trent Piepho

Since commit 4cb3cee03d558fd457cb58f56c80a2a09a66110c the code generated
for the in_beXX() and out_beXX() mmio functions has been sub-optimal.

The out_leXX() family of functions are created with the macro
DEF_MMIO_OUT_LE() while the out_beXX() family are created with
DEF_MMIO_OUT_BE().  In what was perhaps a bit too much macro use, both of
these macros are in turn created via the macro DEF_MMIO_OUT().

For the LE versions, eventually they boil down to an asm that will look
something like this:
asm("sync; stwbrx %1,0,%2" : "=m" (*addr) : "r" (val), "r" (addr));

While not perfect, this appears to be the best one can do.  The issue is
that the "stwbrx" instruction only comes in an indexed, or 'x', version, in
which the address is represented by the sum of two registers (the "0,%2").
Unfortunately, gcc doesn't have a constraint for an indexed memory
reference.  The "m" constraint allows both indexed and offset, i.e.
register plus constant, memory references and there is no "stwbr" version
for offset references.

The unused first operand to the asm is just to tell gcc that *addr is an
output of the asm.  The address used is passed in a single register via the
third asm operand, and the index register is just hard coded as 0.  This
means gcc is forced to put the address in a single register and can't use
index addressing, e.g. if one has the data in register 9, a base address in
register 3 and an index in register 4, gcc must emit code like "add 11,4,3;
stwbrx 9,0,11" instead of just "stwbrx 9,4,3".  This costs an extra add
instruction and another register.

This brings us the to problem with the BE version.  In this case, the "stw"
instruction does have both indexed and non-indexed versions.  The final asm
ends up looking like this:
asm("sync; stw%U0%X0 %1,%0" : "=m" (*addr) : "r" (val), "r" (addr));

The undocumented codes "%U0" and "%0X" will generate a 'u' if the memory
reference should be a auto-updating one, and an 'x' if the memory reference
is indexed, respectively.  The third operand is unused, it's just there
because asm the code is reused from the LE version.  However, gcc does not
know this, and generates unnecessary code to stick addr in a register!  To
use the example from the LE version, gcc will generate "add 11,4,3; stwx
9,4,3".  It is able to use the indexed address "4,3" for the "stwx", but
still thinks it needs to put 4+3 into register 11, which will never be
used.

This also ends up happening a lot for the offset addressing mode, where
common code like this:  out_be32(&device_registers->some_register, data);
uses an instruction like "stw 9, 42(3)", where register 3 has the pointer
device_registers and 42 is the offset of some_register in that structure.
gcc will be forced to generate the unnecessary instruction "addi 11, 3, 42"
to put the address into a single (unused) register.

The in_* versions end up having these exact same problems as well.

Signed-off-by: Trent Piepho <tpiepho@freescale.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Scott Wood <scottwood@freescale.com>
---
There was some discussion on a Freescale list if the powerpc I/O accessors
should be strictly ordered w.r.t.  normal memory.  Currently they are not.  It
does not appear as if any other architecture's I/O accessors are strictly
ordered in this manner.  memory-barriers.txt explicitly states that the I/O
space (inb, outw, etc.) are NOT strictly ordered w.r.t. normal memory
accesses and it's implied the other I/O accessors (e.g., writel) are the same.

However, it is somewhat harder to program for this model, and there are almost
certainly a number of drivers using coherent DMA which have subtle bugs because
the do not include the necessary barriers.

But clearly and change to this would be a subject for a different patch.

 include/asm-powerpc/io.h |   44 +++++++++++++++++++++++++-------------------
 1 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/include/asm-powerpc/io.h b/include/asm-powerpc/io.h
index e0062d7..795b5d4 100644
--- a/include/asm-powerpc/io.h
+++ b/include/asm-powerpc/io.h
@@ -95,33 +95,39 @@ extern resource_size_t isa_mem_base;
 #define IO_SET_SYNC_FLAG()
 #endif

-#define DEF_MMIO_IN(name, type, insn)					\
-static inline type name(const volatile type __iomem *addr)		\
+#define DEF_MMIO_IN_LE(name, size, insn)				\
+static inline u##size name(const volatile u##size __iomem *addr)	\
 {									\
-	type ret;							\
-	__asm__ __volatile__("sync;" insn ";twi 0,%0,0;isync"		\
- 		: "=r" (ret) : "r" (addr), "m" (*addr));		\
+	u##size ret;							\
+	__asm__ __volatile__("sync;"#insn" %0,0,%1;twi 0,%0,0;isync"	\
+		: "=r" (ret) : "r" (addr), "m" (*addr));		\
 	return ret;							\
 }

-#define DEF_MMIO_OUT(name, type, insn)					\
-static inline void name(volatile type __iomem *addr, type val)		\
+#define DEF_MMIO_IN_BE(name, size, insn)				\
+static inline u##size name(const volatile u##size __iomem *addr)	\
 {									\
-	__asm__ __volatile__("sync;" insn				\
- 		: "=m" (*addr) : "r" (val), "r" (addr));		\
-	IO_SET_SYNC_FLAG();					\
+	u##size ret;							\
+	__asm__ __volatile__("sync;"#insn"%U1%X1 %0,%1;twi 0,%0,0;isync"\
+		: "=r" (ret) : "m" (*addr));				\
+	return ret;							\
 }

+#define DEF_MMIO_OUT_BE(name, size, insn)				\
+static inline void name(volatile u##size __iomem *addr, u##size val)	\
+{									\
+	__asm__ __volatile__("sync;"#insn"%U0%X0 %1,%0"			\
+		: "=m" (*addr) : "r" (val));				\
+	IO_SET_SYNC_FLAG();						\
+}

-#define DEF_MMIO_IN_BE(name, size, insn) \
-	DEF_MMIO_IN(name, u##size, __stringify(insn)"%U2%X2 %0,%2")
-#define DEF_MMIO_IN_LE(name, size, insn) \
-	DEF_MMIO_IN(name, u##size, __stringify(insn)" %0,0,%1")
-
-#define DEF_MMIO_OUT_BE(name, size, insn) \
-	DEF_MMIO_OUT(name, u##size, __stringify(insn)"%U0%X0 %1,%0")
-#define DEF_MMIO_OUT_LE(name, size, insn) \
-	DEF_MMIO_OUT(name, u##size, __stringify(insn)" %1,0,%2")
+#define DEF_MMIO_OUT_LE(name, size, insn) 				\
+static inline void name(volatile u##size __iomem *addr, u##size val)	\
+{									\
+	__asm__ __volatile__("sync;"#insn" %1,0,%2"			\
+		: "=m" (*addr) : "r" (val), "r" (addr));		\
+	IO_SET_SYNC_FLAG();						\
+}

 DEF_MMIO_IN_BE(in_8,     8, lbz);
 DEF_MMIO_IN_BE(in_be16, 16, lhz);
-- 
1.5.4.1

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 20:40 [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho
@ 2008-05-20 21:16 ` Benjamin Herrenschmidt
  2008-05-20 21:38   ` Scott Wood
  2008-05-20 22:00   ` Trent Piepho
  2008-05-20 22:00 ` Andreas Schwab
  1 sibling, 2 replies; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-20 21:16 UTC (permalink / raw)
  To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel


On Tue, 2008-05-20 at 13:40 -0700, Trent Piepho wrote:
> There was some discussion on a Freescale list if the powerpc I/O accessors
> should be strictly ordered w.r.t.  normal memory.  Currently they are not.  It
> does not appear as if any other architecture's I/O accessors are strictly
> ordered in this manner.  memory-barriers.txt explicitly states that the I/O
> space (inb, outw, etc.) are NOT strictly ordered w.r.t. normal memory
> accesses and it's implied the other I/O accessors (e.g., writel) are the same.
> 
> However, it is somewhat harder to program for this model, and there are almost
> certainly a number of drivers using coherent DMA which have subtle bugs because
> the do not include the necessary barriers.
> 
> But clearly and change to this would be a subject for a different patch.

The current accessors should provide all the necessary ordering
guarantees...

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 21:16 ` Benjamin Herrenschmidt
@ 2008-05-20 21:38   ` Scott Wood
  2008-05-20 22:02     ` Benjamin Herrenschmidt
  2008-05-20 22:15     ` Alan Cox
  2008-05-20 22:00   ` Trent Piepho
  1 sibling, 2 replies; 160+ messages in thread
From: Scott Wood @ 2008-05-20 21:38 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, Trent Piepho, linux-kernel

Benjamin Herrenschmidt wrote:
> On Tue, 2008-05-20 at 13:40 -0700, Trent Piepho wrote:
>> There was some discussion on a Freescale list if the powerpc I/O accessors
>> should be strictly ordered w.r.t.  normal memory.  Currently they are not.  It
>> does not appear as if any other architecture's I/O accessors are strictly
>> ordered in this manner.  memory-barriers.txt explicitly states that the I/O
>> space (inb, outw, etc.) are NOT strictly ordered w.r.t. normal memory
>> accesses and it's implied the other I/O accessors (e.g., writel) are the same.
>>
>> However, it is somewhat harder to program for this model, and there are almost
>> certainly a number of drivers using coherent DMA which have subtle bugs because
>> the do not include the necessary barriers.
>>
>> But clearly and change to this would be a subject for a different patch.
> 
> The current accessors should provide all the necessary ordering
> guarantees...

It looks like we rely on -fno-strict-aliasing to prevent reordering 
ordinary memory accesses (such as to DMA descriptors) past the I/O 
access.  It won't prevent reordering of memory reads around an I/O read, 
though, which could be a problem if the I/O read result determines the 
validity of the DMA buffer.  IMHO, a memory clobber would be better.

-Scott

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 21:38   ` Scott Wood
@ 2008-05-20 22:02     ` Benjamin Herrenschmidt
  2008-05-20 22:21       ` Trent Piepho
  2008-05-20 22:15     ` Alan Cox
  1 sibling, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-20 22:02 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev, Trent Piepho, linux-kernel


On Tue, 2008-05-20 at 16:38 -0500, Scott Wood wrote:
> It looks like we rely on -fno-strict-aliasing to prevent reordering 
> ordinary memory accesses (such as to DMA descriptors) past the I/O 
> access.  It won't prevent reordering of memory reads around an I/O
> read, 
> though, which could be a problem if the I/O read result determines
> the 
> validity of the DMA buffer.  IMHO, a memory clobber would be better.

We probably want a full "memory" clobber then...

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 22:02     ` Benjamin Herrenschmidt
@ 2008-05-20 22:21       ` Trent Piepho
  0 siblings, 0 replies; 160+ messages in thread
From: Trent Piepho @ 2008-05-20 22:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc-dev, linux-kernel

On Tue, 20 May 2008, Benjamin Herrenschmidt wrote:
> On Tue, 2008-05-20 at 16:38 -0500, Scott Wood wrote:
>> It looks like we rely on -fno-strict-aliasing to prevent reordering
>> ordinary memory accesses (such as to DMA descriptors) past the I/O
>> access.  It won't prevent reordering of memory reads around an I/O
>> read,
>> though, which could be a problem if the I/O read result determines
>> the
>> validity of the DMA buffer.  IMHO, a memory clobber would be better.
>
> We probably want a full "memory" clobber then...

As far as I could tell, no other arch has a full memory clobber.  I can see
the argument for changing the Linux model to be stricter and less efficient,
but easier to program for.  Not that I entirely agree with it.

But I don't see a good reason for why powerpc should be different than
everything else.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 21:38   ` Scott Wood
  2008-05-20 22:02     ` Benjamin Herrenschmidt
@ 2008-05-20 22:15     ` Alan Cox
  2008-05-20 22:35       ` Scott Wood
  1 sibling, 1 reply; 160+ messages in thread
From: Alan Cox @ 2008-05-20 22:15 UTC (permalink / raw)
  To: Scott Wood; +Cc: Trent Piepho, linux-kernel, linuxppc-dev

> It looks like we rely on -fno-strict-aliasing to prevent reordering 
> ordinary memory accesses (such as to DMA descriptors) past the I/O 

DMA descriptors in main memory are dependant on cache behaviour anyway
and the dma_* operators should be the ones enforcing the needed behaviour.

Alan

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 22:15     ` Alan Cox
@ 2008-05-20 22:35       ` Scott Wood
  2008-05-20 22:39         ` David Miller
  2008-05-20 22:55         ` [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho
  0 siblings, 2 replies; 160+ messages in thread
From: Scott Wood @ 2008-05-20 22:35 UTC (permalink / raw)
  To: Alan Cox; +Cc: Trent Piepho, linux-kernel, linuxppc-dev

Alan Cox wrote:
>> It looks like we rely on -fno-strict-aliasing to prevent reordering 
>> ordinary memory accesses (such as to DMA descriptors) past the I/O 
> 
> DMA descriptors in main memory are dependant on cache behaviour anyway
> and the dma_* operators should be the ones enforcing the needed behaviour.

What about memory obtained from dma_alloc_coherent()?  We still need a 
sync and a compiler barrier.  The current I/O accessors have the former, 
but not the latter.

-Scott

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 22:35       ` Scott Wood
@ 2008-05-20 22:39         ` David Miller
  2008-05-20 22:43           ` Scott Wood
  2008-05-20 22:55         ` [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho
  1 sibling, 1 reply; 160+ messages in thread
From: David Miller @ 2008-05-20 22:39 UTC (permalink / raw)
  To: scottwood; +Cc: linuxppc-dev, tpiepho, alan, linux-kernel

From: Scott Wood <scottwood@freescale.com>
Date: Tue, 20 May 2008 17:35:56 -0500

> Alan Cox wrote:
> >> It looks like we rely on -fno-strict-aliasing to prevent reordering 
> >> ordinary memory accesses (such as to DMA descriptors) past the I/O 
> > 
> > DMA descriptors in main memory are dependant on cache behaviour anyway
> > and the dma_* operators should be the ones enforcing the needed behaviour.
> 
> What about memory obtained from dma_alloc_coherent()?  We still need a 
> sync and a compiler barrier.  The current I/O accessors have the former, 
> but not the latter.

The __volatile__ in the asm construct disallows movement of the
inline asm relative to statements surrounding it.

The only reason barrier() in kernel.h needs a memory clobber is
because of a bug in ancient versions of gcc.  In fact, I think
that memory clobber might even be removable.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 22:39         ` David Miller
@ 2008-05-20 22:43           ` Scott Wood
  2008-05-20 22:53             ` David Miller
  0 siblings, 1 reply; 160+ messages in thread
From: Scott Wood @ 2008-05-20 22:43 UTC (permalink / raw)
  To: David Miller; +Cc: linuxppc-dev, tpiepho, alan, linux-kernel

David Miller wrote:
> From: Scott Wood <scottwood@freescale.com>
> Date: Tue, 20 May 2008 17:35:56 -0500
> 
>> Alan Cox wrote:
>>>> It looks like we rely on -fno-strict-aliasing to prevent reordering 
>>>> ordinary memory accesses (such as to DMA descriptors) past the I/O 
>>> DMA descriptors in main memory are dependant on cache behaviour anyway
>>> and the dma_* operators should be the ones enforcing the needed behaviour.
>> What about memory obtained from dma_alloc_coherent()?  We still need a 
>> sync and a compiler barrier.  The current I/O accessors have the former, 
>> but not the latter.
> 
> The __volatile__ in the asm construct disallows movement of the
> inline asm relative to statements surrounding it.
> 
> The only reason barrier() in kernel.h needs a memory clobber is
> because of a bug in ancient versions of gcc.  In fact, I think
> that memory clobber might even be removable.

Current versions of GCC seem quite happy to move non-asm memory accesses 
around a volatile asm without a memory clobber; see the test Trent posted.

-Scott

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 22:43           ` Scott Wood
@ 2008-05-20 22:53             ` David Miller
  2008-05-23  4:24               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 160+ messages in thread
From: David Miller @ 2008-05-20 22:53 UTC (permalink / raw)
  To: scottwood; +Cc: linuxppc-dev, tpiepho, alan, linux-kernel

From: Scott Wood <scottwood@freescale.com>
Date: Tue, 20 May 2008 17:43:58 -0500

> David Miller wrote:
> > The __volatile__ in the asm construct disallows movement of the
> > inline asm relative to statements surrounding it.
> > 
> > The only reason barrier() in kernel.h needs a memory clobber is
> > because of a bug in ancient versions of gcc.  In fact, I think
> > that memory clobber might even be removable.
> 
> Current versions of GCC seem quite happy to move non-asm memory accesses 
> around a volatile asm without a memory clobber; see the test Trent posted.

Indeed, and even the GCC manual is clear about this.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 22:53             ` David Miller
@ 2008-05-23  4:24               ` Benjamin Herrenschmidt
  2008-05-22 22:56                 ` Trent Piepho
  0 siblings, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-23  4:24 UTC (permalink / raw)
  To: David Miller; +Cc: scottwood, linuxppc-dev, tpiepho, alan, linux-kernel


On Tue, 2008-05-20 at 15:53 -0700, David Miller wrote:
> From: Scott Wood <scottwood@freescale.com>
> Date: Tue, 20 May 2008 17:43:58 -0500
> 
> > David Miller wrote:
> > > The __volatile__ in the asm construct disallows movement of the
> > > inline asm relative to statements surrounding it.
> > > 
> > > The only reason barrier() in kernel.h needs a memory clobber is
> > > because of a bug in ancient versions of gcc.  In fact, I think
> > > that memory clobber might even be removable.
> > 
> > Current versions of GCC seem quite happy to move non-asm memory accesses 
> > around a volatile asm without a memory clobber; see the test Trent posted.
> 
> Indeed, and even the GCC manual is clear about this.

So what is the scope of that problem ?

IE. Take an x86 version of that test, writing to memory, doing a writel
to some MMIO, then another memory write, can those be re-ordered with
the current x86 version of writel ?

static inline void writel(unsigned int b, volatile void __iomem *addr)
{
	*(volatile unsigned int __force *)addr = b;
}

This is becoming a serious issue...

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-23  4:24               ` Benjamin Herrenschmidt
@ 2008-05-22 22:56                 ` Trent Piepho
  2008-05-23 12:36                   ` MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code) Benjamin Herrenschmidt
  2008-05-27  1:33                   ` MMIO and gcc re-ordering issue Benjamin Herrenschmidt
  0 siblings, 2 replies; 160+ messages in thread
From: Trent Piepho @ 2008-05-22 22:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: scottwood, linuxppc-dev, David Miller, alan, linux-kernel

On Fri, 23 May 2008, Benjamin Herrenschmidt wrote:
> On Tue, 2008-05-20 at 15:53 -0700, David Miller wrote:
>> From: Scott Wood <scottwood@freescale.com>
>> Date: Tue, 20 May 2008 17:43:58 -0500
>>> David Miller wrote:
>>>> The __volatile__ in the asm construct disallows movement of the
>>>> inline asm relative to statements surrounding it.
>>>>
>>>> The only reason barrier() in kernel.h needs a memory clobber is
>>>> because of a bug in ancient versions of gcc.  In fact, I think
>>>> that memory clobber might even be removable.
>>>
>>> Current versions of GCC seem quite happy to move non-asm memory accesses
>>> around a volatile asm without a memory clobber; see the test Trent posted.
>>
>> Indeed, and even the GCC manual is clear about this.
>
> So what is the scope of that problem ?
>
> IE. Take an x86 version of that test, writing to memory, doing a writel
> to some MMIO, then another memory write, can those be re-ordered with
> the current x86 version of writel ?

Yes, the same thing can happen on x86.  As far as I could tell, this is
something that all other arches can have happen.  Usually aliasing prevents
it, but it's not hard to constuct a test case where it doesn't.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code)
  2008-05-22 22:56                 ` Trent Piepho
@ 2008-05-23 12:36                   ` Benjamin Herrenschmidt
  2008-05-23 12:50                     ` Benjamin Herrenschmidt
  2008-05-27  1:33                   ` MMIO and gcc re-ordering issue Benjamin Herrenschmidt
  1 sibling, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-23 12:36 UTC (permalink / raw)
  To: Trent Piepho
  Cc: linux-kernel, linuxppc-dev, scottwood, Linus Torvalds,
	David Miller, alan

> > IE. Take an x86 version of that test, writing to memory, doing a writel
> > to some MMIO, then another memory write, can those be re-ordered with
> > the current x86 version of writel ?
> 
> Yes, the same thing can happen on x86.  As far as I could tell, this is
> something that all other arches can have happen.  Usually aliasing prevents
> it, but it's not hard to constuct a test case where it doesn't.

That brings us back to the old debate...

For consistent memory, should we mandate a wmb between write to some dma
data structure in consistent memory and the following mmio write that
trigger it, and the same goes with rmb and read ?

David, you remember we had those discussions a while back when I was
trying to relax a bit the barriers we have in our MMIO accessors on
powerpc, and the overwhelming answer was that x86 being in order, I have
to expect 90% of the drivers to not get any barrier right on any
platform, and thus I should make my MMIO accessors "look like" x86 and
thus ordered.

We did that, adding some barriers in the assembly of our readl/writel. 

However, I didn't change the clobber, it's still *addr, not a full
"memory" clobber, just like x86.

Now if it appears that gcc can indeed re-order things, then we have a
problem on both x86, ppc, and pretty much everybody else. (I'm not sure
about sparc but I don't see any explicit clobber in your accessors
there).

So that brings the whole subject back imho. What should be the approach
here ? I see several options:

	- mandate some kind of dma_sync_for_device/cpu on consistent memory.
Almost no driver do that currently tho. They only do that for non
consistent memory mapped with dma_map_*.

	- mandate the use of wmb,rmb,mb barriers for use between memory
accesses and MMIOs for ordering them. (ie. fix drivers that don't do
it). Advantage for powerpc is that I can remove (after some auditing of
course) the added heavy barriers in the MMIO accessors themselves.

	- stick a full memory clobber in all MMIO (and PIO) accessors on all
archs.

Any other idea ? preference ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code)
  2008-05-23 12:36                   ` MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code) Benjamin Herrenschmidt
@ 2008-05-23 12:50                     ` Benjamin Herrenschmidt
  2008-05-23 21:14                       ` Scott Wood
  0 siblings, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-23 12:50 UTC (permalink / raw)
  To: Trent Piepho
  Cc: linux-kernel, linuxppc-dev, scottwood, Linus Torvalds,
	David Miller, alan


On Fri, 2008-05-23 at 08:36 -0400, Benjamin Herrenschmidt wrote:
> 
>         - mandate some kind of dma_sync_for_device/cpu on consistent memory.
> Almost no driver do that currently tho. They only do that for non
> consistent memory mapped with dma_map_*.
> 
>         - mandate the use of wmb,rmb,mb barriers for use between memory
> accesses and MMIOs for ordering them. (ie. fix drivers that don't do
> it). Advantage for powerpc is that I can remove (after some auditing of
> course) the added heavy barriers in the MMIO accessors themselves.

Note that the above is my preferred approach, and a lot of drivers
happen to already do this.

>         - stick a full memory clobber in all MMIO (and PIO) accessors on all
> archs.
> 
> Any other idea ? preference ?

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code)
  2008-05-23 12:50                     ` Benjamin Herrenschmidt
@ 2008-05-23 21:14                       ` Scott Wood
  2008-05-23 22:47                         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 160+ messages in thread
From: Scott Wood @ 2008-05-23 21:14 UTC (permalink / raw)
  To: benh
  Cc: linux-kernel, David Miller, linuxppc-dev, Linus Torvalds,
	Trent Piepho, alan

Benjamin Herrenschmidt wrote:
> On Fri, 2008-05-23 at 08:36 -0400, Benjamin Herrenschmidt wrote:
>>         - mandate some kind of dma_sync_for_device/cpu on consistent memory.
>> Almost no driver do that currently tho. They only do that for non
>> consistent memory mapped with dma_map_*.
>>
>>         - mandate the use of wmb,rmb,mb barriers for use between memory
>> accesses and MMIOs for ordering them. (ie. fix drivers that don't do
>> it). Advantage for powerpc is that I can remove (after some auditing of
>> course) the added heavy barriers in the MMIO accessors themselves.
> 
> Note that the above is my preferred approach, and a lot of drivers
> happen to already do this.

As Trent pointed out, if you change to eieio in the accessors, that'd 
require drivers to also use mmiowb() before spin_unlock(), which fewer 
drivers currently do.

>>         - stick a full memory clobber in all MMIO (and PIO) accessors on all
>> archs.

I like this, combined with introducing raw variants of the non-PCI 
accessors (in_be32 and such).  It's slower and safe by default (i.e. no 
auditing drivers), but performance-critical paths can be optimized to 
use raw accessors combined with explicit barriers.

-Scott

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code)
  2008-05-23 21:14                       ` Scott Wood
@ 2008-05-23 22:47                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-23 22:47 UTC (permalink / raw)
  To: Scott Wood
  Cc: linux-kernel, David Miller, linuxppc-dev, Linus Torvalds,
	Trent Piepho, alan


On Fri, 2008-05-23 at 16:14 -0500, Scott Wood wrote:

> As Trent pointed out, if you change to eieio in the accessors, that'd 
> require drivers to also use mmiowb() before spin_unlock(), which fewer 
> drivers currently do.

No, this is a totally different issue. And we keep track of the need
for a sync in spin_unlock already.

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* MMIO and gcc re-ordering issue
  2008-05-22 22:56                 ` Trent Piepho
  2008-05-23 12:36                   ` MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code) Benjamin Herrenschmidt
@ 2008-05-27  1:33                   ` Benjamin Herrenschmidt
  2008-05-27  1:40                     ` David Miller
  2008-05-27 15:28                     ` Jonathan Corbet
  1 sibling, 2 replies; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-27  1:33 UTC (permalink / raw)
  To: Linux Arch list
  Cc: linux-kernel, Trent Piepho, linuxppc-dev, scottwood,
	Linus Torvalds, David Miller, alan

Re-post to linux-arch of a discussion on lkml.

Mostly due to lack of reactions :-)

Quick summary: gcc is happily re-ordering readl/writel vs. surrounding
memory accesses (and thus accesses to DMA coherent memory) which is
obviously a _BAD_THING_.

This is on all archs. Quick fix is to stick a "memory" clobber in all arch
implementations of readl/writel/... (ie, making them a barrier()).

However, I'm using that as an excuse to bring back my pet subject, which
is basically, should we instead just finally mandate the use of explicit
rmb/wmb/mb's (which boils down to barrier() on x86) to drivers who want
to order memory consistent accesses vs. MMIO ?

The reason is that on archs that are out of order, this would allow us
to get rid in the long run of some of the heavy barriers we have put in
our readl/writel implementations to make them look like x86.

Note that there are drivers that already do that (ie. explicit rmb/wmb/mb
to order coherent memory accesses vs. MMIO), such as OHCI/EHCI which
may explain why the new problem with gcc isn't more obviously hitting
people, as those turn into compiler barriers on x86.

If that approach is accepted, then I'll start auditing drivers and send
patches adding wmb/rmb/mb's to them, and in the long run, after mucho
testing, relax powerpc writel/readl implementations. Other archs can then
do the same too

If that approach is generally considered wrong, then we should probably
remove the spurrious readl/writel in drivers that do them.

So what are the opinions here ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27  1:33                   ` MMIO and gcc re-ordering issue Benjamin Herrenschmidt
@ 2008-05-27  1:40                     ` David Miller
  2008-05-27  2:15                       ` Benjamin Herrenschmidt
  2008-05-27 15:28                     ` Jonathan Corbet
  1 sibling, 1 reply; 160+ messages in thread
From: David Miller @ 2008-05-27  1:40 UTC (permalink / raw)
  To: benh
  Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, torvalds,
	tpiepho, alan

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Tue, 27 May 2008 11:33:46 +1000

> Quick summary: gcc is happily re-ordering readl/writel vs. surrounding
> memory accesses (and thus accesses to DMA coherent memory) which is
> obviously a _BAD_THING_.
> 
> This is on all archs. Quick fix is to stick a "memory" clobber in all arch
> implementations of readl/writel/... (ie, making them a barrier()).
> 
> However, I'm using that as an excuse to bring back my pet subject, which
> is basically, should we instead just finally mandate the use of explicit
> rmb/wmb/mb's (which boils down to barrier() on x86) to drivers who want
> to order memory consistent accesses vs. MMIO ?

This is basically what drivers are effectively doing.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27  1:40                     ` David Miller
@ 2008-05-27  2:15                       ` Benjamin Herrenschmidt
  2008-05-27  2:28                         ` David Miller
  0 siblings, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-27  2:15 UTC (permalink / raw)
  To: David Miller
  Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, torvalds,
	tpiepho, alan

On Mon, 2008-05-26 at 18:40 -0700, David Miller wrote:
> 
> > Quick summary: gcc is happily re-ordering readl/writel vs. surrounding
> > memory accesses (and thus accesses to DMA coherent memory) which is
> > obviously a _BAD_THING_.
> > 
> > This is on all archs. Quick fix is to stick a "memory" clobber in all arch
> > implementations of readl/writel/... (ie, making them a barrier()).
> > 
> > However, I'm using that as an excuse to bring back my pet subject, which
> > is basically, should we instead just finally mandate the use of explicit
> > rmb/wmb/mb's (which boils down to barrier() on x86) to drivers who want
> > to order memory consistent accesses vs. MMIO ?
> 
> This is basically what drivers are effectively doing.

Some of them. USB comes to mind. I'd be happy to make it "the rule" and
document that MMIO vs. coherent access aren't implicitely ordered. I
would still keep them ordered on powerpc for a little while tho until
I'm happy enough with driver auditing.

But heh, it's you who was telling me that it would be a bad engineering
decision and we had to make everybody look like x86 & fully ordered :-)
I decided to agree back then and stuck all those nasty heavy barriers
in the powerpc variants of readl/writel/...

Now, however, that x86 -is- also affected by the problem to some extent
(ie. compiler re-ordering, not CPU but similar), this  is why I'm asking
what's people opinion is.

I'm happy to do patches updating memory-barriers.txt and others, and
do some driver auditing (though I won't do all of them), if the general
opinion is that it's the right direction to go.

If not, then shouldn't we remove the existing spurrious wmb/rmb/mb from
drivers and slap a "memory" clobber on all archs readl/writel/... ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27  2:15                       ` Benjamin Herrenschmidt
@ 2008-05-27  2:28                         ` David Miller
  2008-05-27  3:39                           ` Benjamin Herrenschmidt
                                             ` (2 more replies)
  0 siblings, 3 replies; 160+ messages in thread
From: David Miller @ 2008-05-27  2:28 UTC (permalink / raw)
  To: benh
  Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, torvalds,
	tpiepho, alan

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Tue, 27 May 2008 12:15:40 +1000

> Some of them. USB comes to mind. I'd be happy to make it "the rule" and
> document that MMIO vs. coherent access aren't implicitely ordered. I
> would still keep them ordered on powerpc for a little while tho until
> I'm happy enough with driver auditing.
> 
> But heh, it's you who was telling me that it would be a bad engineering
> decision and we had to make everybody look like x86 & fully ordered :-)
> I decided to agree back then and stuck all those nasty heavy barriers
> in the powerpc variants of readl/writel/...

I still believe this.

It's just another complicated thing for driver authors to get wrong.
The other side of the coin is of course the cost.

The only thing I am absolutely sure of is that we should make a
decision fast, document it, and just stick to it.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27  2:28                         ` David Miller
@ 2008-05-27  3:39                           ` Benjamin Herrenschmidt
  2008-05-27 15:35                             ` Linus Torvalds
  2008-05-27  3:42                           ` Arjan van de Ven
  2008-05-27  8:24                           ` Alan Cox
  2 siblings, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-27  3:39 UTC (permalink / raw)
  To: David Miller
  Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, torvalds,
	tpiepho, alan


On Mon, 2008-05-26 at 19:28 -0700, David Miller wrote:
> > But heh, it's you who was telling me that it would be a bad
> engineering
> > decision and we had to make everybody look like x86 & fully
> ordered :-)
> > I decided to agree back then and stuck all those nasty heavy
> barriers
> > in the powerpc variants of readl/writel/...
> 
> I still believe this.
> 
> It's just another complicated thing for driver authors to get wrong.
> The other side of the coin is of course the cost.
> 
> The only thing I am absolutely sure of is that we should make a
> decision fast, document it, and just stick to it.

Yes. As it is today, tg3 for example is potentially broken on all archs
with newer gcc unless we either add "memory" clobber to readl/writel or
stick some wmb's in there (just a random driver I picked).

So Linus, what is your take on that matter ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27  3:39                           ` Benjamin Herrenschmidt
@ 2008-05-27 15:35                             ` Linus Torvalds
  2008-05-27 16:47                               ` Linus Torvalds
                                                 ` (2 more replies)
  0 siblings, 3 replies; 160+ messages in thread
From: Linus Torvalds @ 2008-05-27 15:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller, alan



On Tue, 27 May 2008, Benjamin Herrenschmidt wrote:
> 
> Yes. As it is today, tg3 for example is potentially broken on all archs
> with newer gcc unless we either add "memory" clobber to readl/writel or
> stick some wmb's in there (just a random driver I picked).
> 
> So Linus, what is your take on that matter ?

Let's just serialize the damn things, and add a memory clobber to them.

Expecting people to fix up all drivers is simply not going to happen. And 
serializing things shouldn't be *that* expensive. People who cannot take 
the expense can continue to use the magic __raw_writel() etc stuff.

			Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 15:35                             ` Linus Torvalds
@ 2008-05-27 16:47                               ` Linus Torvalds
  2008-05-27 17:31                                 ` Linus Torvalds
  2008-05-27 21:12                                 ` Benjamin Herrenschmidt
  2008-05-27 18:23                               ` Trent Piepho
  2008-05-27 21:10                               ` Benjamin Herrenschmidt
  2 siblings, 2 replies; 160+ messages in thread
From: Linus Torvalds @ 2008-05-27 16:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller, alan



On Tue, 27 May 2008, Linus Torvalds wrote:
> 
> Expecting people to fix up all drivers is simply not going to happen. And 
> serializing things shouldn't be *that* expensive. People who cannot take 
> the expense can continue to use the magic __raw_writel() etc stuff.

Of course, for non-x86, you kind of have to expect drivers to be 
well-behaved, so non-x86 can probably avoid this simply because there are 
less relevant drivers involved.

Here's a UNTESTED patch for x86 that may or may not compile and work, and 
which serializes (on a compiler level) the IO accesses against regular 
memory accesses.

__read[bwlq]()/__write[bwlq]() are not serialized with a :"memory" 
barrier, although since they still use "asm volatile" I suspect that i 
practice they are probably serial too. Did not look very closely at any 
generated code (only did a trivial test to see that the code looks 
*roughly* correct).

		Linus

---
 include/asm-x86/io.h    |   56 +++++++++++++++++++++++++++++++++++++
 include/asm-x86/io_32.h |   49 --------------------------------
 include/asm-x86/io_64.h |   71 -----------------------------------------------
 3 files changed, 56 insertions(+), 120 deletions(-)

diff --git a/include/asm-x86/io.h b/include/asm-x86/io.h
index d5b11f6..8e9eca9 100644
--- a/include/asm-x86/io.h
+++ b/include/asm-x86/io.h
@@ -3,6 +3,62 @@
 
 #define ARCH_HAS_IOREMAP_WC
 
+#include <linux/compiler.h>
+
+#define build_mmio_read(name, size, type, reg, barrier) \
+static inline type name(const volatile void __iomem *addr) \
+{ type ret; asm volatile("mov" size " %1,%0":"=" reg (ret) \
+:"m" (*(volatile type __force *)addr) barrier); return ret; }
+
+#define build_mmio_write(name, size, type, reg, barrier) \
+static inline void name(type val, volatile void __iomem *addr) \
+{ asm volatile("mov" size " %0,%1": :reg (val), \
+"m" (*(volatile type __force *)addr) barrier); }
+
+build_mmio_read(readb, "b", unsigned char, "q", :"memory")
+build_mmio_read(readw, "w", unsigned short, "r", :"memory")
+build_mmio_read(readl, "l", unsigned int, "r", :"memory")
+
+build_mmio_read(__readb, "b", unsigned char, "q", )
+build_mmio_read(__readw, "w", unsigned short, "r", )
+build_mmio_read(__readl, "l", unsigned int, "r", )
+
+build_mmio_write(writeb, "b", unsigned char, "q", :"memory")
+build_mmio_write(writew, "w", unsigned short, "r", :"memory")
+build_mmio_write(writel, "l", unsigned int, "r", :"memory")
+
+build_mmio_write(__writeb, "b", unsigned char, "q", )
+build_mmio_write(__writew, "w", unsigned short, "r", )
+build_mmio_write(__writel, "l", unsigned int, "r", )
+
+#define readb_relaxed(a) __readb(a)
+#define readw_relaxed(a) __readw(a)
+#define readl_relaxed(a) __readl(a)
+#define __raw_readb __readb
+#define __raw_readw __readw
+#define __raw_readl __readl
+
+#define __raw_writeb __writeb
+#define __raw_writew __writew
+#define __raw_writel __writel
+
+#define mmiowb() barrier()
+
+#ifdef CONFIG_X86_64
+build_mmio_read(readq, "q", unsigned long, "r", :"memory")
+build_mmio_read(__readq, "q", unsigned long, "r", )
+build_mmio_write(writeq, "q", unsigned long, "r", :"memory")
+build_mmio_write(__writeq, "q", unsigned long, "r", )
+
+#define readq_relaxed(a) __readq(a)
+#define __raw_readq __readq
+#define __raw_writeq writeq
+
+/* Let people know we have them */
+#define readq readq
+#define writeq writeq
+#endif
+
 #ifdef CONFIG_X86_32
 # include "io_32.h"
 #else
diff --git a/include/asm-x86/io_32.h b/include/asm-x86/io_32.h
index 049e81e..d71be8d 100644
--- a/include/asm-x86/io_32.h
+++ b/include/asm-x86/io_32.h
@@ -149,55 +149,6 @@ extern void __iomem *fix_ioremap(unsigned idx, unsigned long phys);
 #define virt_to_bus virt_to_phys
 #define bus_to_virt phys_to_virt
 
-/*
- * readX/writeX() are used to access memory mapped devices. On some
- * architectures the memory mapped IO stuff needs to be accessed
- * differently. On the x86 architecture, we just read/write the
- * memory location directly.
- */
-
-static inline unsigned char readb(const volatile void __iomem *addr)
-{
-	return *(volatile unsigned char __force *)addr;
-}
-
-static inline unsigned short readw(const volatile void __iomem *addr)
-{
-	return *(volatile unsigned short __force *)addr;
-}
-
-static inline unsigned int readl(const volatile void __iomem *addr)
-{
-	return *(volatile unsigned int __force *) addr;
-}
-
-#define readb_relaxed(addr) readb(addr)
-#define readw_relaxed(addr) readw(addr)
-#define readl_relaxed(addr) readl(addr)
-#define __raw_readb readb
-#define __raw_readw readw
-#define __raw_readl readl
-
-static inline void writeb(unsigned char b, volatile void __iomem *addr)
-{
-	*(volatile unsigned char __force *)addr = b;
-}
-
-static inline void writew(unsigned short b, volatile void __iomem *addr)
-{
-	*(volatile unsigned short __force *)addr = b;
-}
-
-static inline void writel(unsigned int b, volatile void __iomem *addr)
-{
-	*(volatile unsigned int __force *)addr = b;
-}
-#define __raw_writeb writeb
-#define __raw_writew writew
-#define __raw_writel writel
-
-#define mmiowb()
-
 static inline void
 memset_io(volatile void __iomem *addr, unsigned char val, int count)
 {
diff --git a/include/asm-x86/io_64.h b/include/asm-x86/io_64.h
index 0930bed..ddd8058 100644
--- a/include/asm-x86/io_64.h
+++ b/include/asm-x86/io_64.h
@@ -204,77 +204,6 @@ extern void __iomem *fix_ioremap(unsigned idx, unsigned long phys);
 #define virt_to_bus virt_to_phys
 #define bus_to_virt phys_to_virt
 
-/*
- * readX/writeX() are used to access memory mapped devices. On some
- * architectures the memory mapped IO stuff needs to be accessed
- * differently. On the x86 architecture, we just read/write the
- * memory location directly.
- */
-
-static inline __u8 __readb(const volatile void __iomem *addr)
-{
-	return *(__force volatile __u8 *)addr;
-}
-
-static inline __u16 __readw(const volatile void __iomem *addr)
-{
-	return *(__force volatile __u16 *)addr;
-}
-
-static __always_inline __u32 __readl(const volatile void __iomem *addr)
-{
-	return *(__force volatile __u32 *)addr;
-}
-
-static inline __u64 __readq(const volatile void __iomem *addr)
-{
-	return *(__force volatile __u64 *)addr;
-}
-
-#define readb(x) __readb(x)
-#define readw(x) __readw(x)
-#define readl(x) __readl(x)
-#define readq(x) __readq(x)
-#define readb_relaxed(a) readb(a)
-#define readw_relaxed(a) readw(a)
-#define readl_relaxed(a) readl(a)
-#define readq_relaxed(a) readq(a)
-#define __raw_readb readb
-#define __raw_readw readw
-#define __raw_readl readl
-#define __raw_readq readq
-
-#define mmiowb()
-
-static inline void __writel(__u32 b, volatile void __iomem *addr)
-{
-	*(__force volatile __u32 *)addr = b;
-}
-
-static inline void __writeq(__u64 b, volatile void __iomem *addr)
-{
-	*(__force volatile __u64 *)addr = b;
-}
-
-static inline void __writeb(__u8 b, volatile void __iomem *addr)
-{
-	*(__force volatile __u8 *)addr = b;
-}
-
-static inline void __writew(__u16 b, volatile void __iomem *addr)
-{
-	*(__force volatile __u16 *)addr = b;
-}
-
-#define writeq(val, addr) __writeq((val), (addr))
-#define writel(val, addr) __writel((val), (addr))
-#define writew(val, addr) __writew((val), (addr))
-#define writeb(val, addr) __writeb((val), (addr))
-#define __raw_writeb writeb
-#define __raw_writew writew
-#define __raw_writel writel
-#define __raw_writeq writeq
-
 void __memcpy_fromio(void *, unsigned long, unsigned);
 void __memcpy_toio(unsigned long, const void *, unsigned);
 

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 16:47                               ` Linus Torvalds
@ 2008-05-27 17:31                                 ` Linus Torvalds
  2008-06-02 10:36                                   ` Ingo Molnar
  2008-05-27 21:12                                 ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 160+ messages in thread
From: Linus Torvalds @ 2008-05-27 17:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller, alan

On Tue, 27 May 2008, Linus Torvalds wrote:
> 
> Here's a UNTESTED patch for x86 that may or may not compile and work, and 
> which serializes (on a compiler level) the IO accesses against regular 
> memory accesses.

Ok, so it at least boots on x86-32. Thus probably on x86-64 too (since the 
code is now shared). I didn't look at whether it generates much bigger 
code due to the potential extra serialization, but some of the code 
generation I looked at looked fine.

IOW, it doesn't at least create any _obviously_ worse code, and it should 
be arguably safer than assuming the compiler does volatile accesses the 
way we want it to.

		Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 17:31                                 ` Linus Torvalds
@ 2008-06-02 10:36                                   ` Ingo Molnar
  2008-06-02 21:53                                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 160+ messages in thread
From: Ingo Molnar @ 2008-06-02 10:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller, alan


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > Here's a UNTESTED patch for x86 that may or may not compile and 
> > work, and which serializes (on a compiler level) the IO accesses 
> > against regular memory accesses.
> 
> Ok, so it at least boots on x86-32. Thus probably on x86-64 too (since 
> the code is now shared). I didn't look at whether it generates much 
> bigger code due to the potential extra serialization, but some of the 
> code generation I looked at looked fine.
> 
> IOW, it doesn't at least create any _obviously_ worse code, and it 
> should be arguably safer than assuming the compiler does volatile 
> accesses the way we want it to.

ok, to pursue this topic of making readl*/writel*() more robust i picked 
up your patch into -tip and created a new topic branch for it: 
tip/x86/mmio.

The patch passed initial light testing in -tip (~30 successful random 
self-builds and bootups on various mixed 32-bit/64-bit boxes) but it's 
still v2.6.27 material IMO.

Failures in this area are subtle so there's no good way to tell whether 
it works as intended - we need wider testing. I've also added the 
tip/x86/mmio topic to tip/auto-x86-next rules as well so these changes 
will be picked up by tomorrow's linux-next tree as well, and by the next 
-mm iteration.

	Ingo

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-02 10:36                                   ` Ingo Molnar
@ 2008-06-02 21:53                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-06-02 21:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	Linus Torvalds, David Miller, alan

On Mon, 2008-06-02 at 12:36 +0200, Ingo Molnar wrote:
> The patch passed initial light testing in -tip (~30 successful random 
> self-builds and bootups on various mixed 32-bit/64-bit boxes) but
> it's 
> still v2.6.27 material IMO.

I think adding the "memory" clobber should be .26 and even -stable. We
know for sure newer gcc will re-order things, we know it for sure some
drivers will break in subtle ways because of that, ad we know as well
that sticking "memory" clobber in there will fix that breakage we know
about ... So I don't see the point of waiting.

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 16:47                               ` Linus Torvalds
  2008-05-27 17:31                                 ` Linus Torvalds
@ 2008-05-27 21:12                                 ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-27 21:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller, alan


On Tue, 2008-05-27 at 09:47 -0700, Linus Torvalds wrote:
> 
> __read[bwlq]()/__write[bwlq]() are not serialized with a :"memory" 
> barrier, although since they still use "asm volatile" I suspect that
> i 
> practice they are probably serial too. Did not look very closely at
> any 
> generated code (only did a trivial test to see that the code looks 
> *roughly* correct).

Nah, asm volatile doesn't help, it does need the "memory" clobber too.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 15:35                             ` Linus Torvalds
  2008-05-27 16:47                               ` Linus Torvalds
@ 2008-05-27 18:23                               ` Trent Piepho
  2008-05-27 18:33                                 ` Scott Wood
  2008-05-27 21:10                               ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 160+ messages in thread
From: Trent Piepho @ 2008-05-27 18:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, David Miller,
	alan

On Tue, 27 May 2008, Linus Torvalds wrote:
> On Tue, 27 May 2008, Benjamin Herrenschmidt wrote:
>>
>> Yes. As it is today, tg3 for example is potentially broken on all archs
>> with newer gcc unless we either add "memory" clobber to readl/writel or
>> stick some wmb's in there (just a random driver I picked).
>>
>> So Linus, what is your take on that matter ?
>
> Let's just serialize the damn things, and add a memory clobber to them.
>
> Expecting people to fix up all drivers is simply not going to happen. And
> serializing things shouldn't be *that* expensive. People who cannot take
> the expense can continue to use the magic __raw_writel() etc stuff.

Is there an issue with anything _besides_ coherent DMA?

Could one have a special version of the accessors for drivers that want to
assume they are strictly ordered vs coherent DMA memory?  That would be much
easier to get right, without slowing _everything_ down.  The problem with the
raw versions is that on some arches they are much more raw than just not being
ordered w.r.t normal memory.

__raw_writel()
No ordering beyond what the arch provides natively.  The only thing you can
assume is that reads and writes to the same location on the same CPU are
ordered.

writel()
Strictly ordered with respect to any other IO, but not to normal memory. 
(this is what Documentation/memory-barriers.txt claims).  Should probably be
strictly ordered vs locks as well.  Would be strictly ordered w.r.t. the
streaming DMA sync calls of course.

strict_writel()
Strictly ordered with respect to normal (incl. coherent DMA) memory as well.

One could even go as far as to allow a driver to "#define WANT_STRICT_IO" and
then it would get the strict versions.  Add that to any driver that uses DMA
and then worry about vetting those drivers.

It's also worth nothing that adding barriers to IO accessors doesn't mean you
never have to worry about barriers with coherent DMA.  Inherent with coherent
DMA, and driving hardware in general, there are various sequence points where
one must be sure one operation has finished before starting the next.  Making
sure you're doing with DMA memory before telling the hardware to do something
with it a common example.  Often the "telling the hardware to do something" is
done via an IO accessor function, but not always.

You can have buffer descriptors in DMA memory that describe the DMA buffers to
the hardware.  It would be critical to be finished reading a DMA buffer before
updating the descriptor to indicate that it's empty.  You could trigger a DMA
operation not by a call to an IO accessor, but with an atomic operation to a
variable shared with an ISR.  When the ISR runs it sees the change and does
the necessary IO.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 18:23                               ` Trent Piepho
@ 2008-05-27 18:33                                 ` Scott Wood
  0 siblings, 0 replies; 160+ messages in thread
From: Scott Wood @ 2008-05-27 18:33 UTC (permalink / raw)
  To: Trent Piepho
  Cc: linux-arch, linux-kernel, linuxppc-dev, Linus Torvalds,
	David Miller, alan

Trent Piepho wrote:
> Is there an issue with anything _besides_ coherent DMA?
> 
> Could one have a special version of the accessors for drivers that
> want to assume they are strictly ordered vs coherent DMA memory?
> That would be much easier to get right, without slowing _everything_
> down.

It's better to be safe by default and then optimize the fast paths than 
to be relaxed by default and hang the machine in some piece of code that 
runs once a month.  "Premature optimization is the root of all evil", 
and what not.

> One could even go as far as to allow a driver to "#define 
> WANT_STRICT_IO" and then it would get the strict versions.  Add that
> to any driver that uses DMA and then worry about vetting those
> drivers.

See above -- if you must have a #define, then it should be WANT_RELAXED_IO.

-Scott

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 15:35                             ` Linus Torvalds
  2008-05-27 16:47                               ` Linus Torvalds
  2008-05-27 18:23                               ` Trent Piepho
@ 2008-05-27 21:10                               ` Benjamin Herrenschmidt
  2008-05-27 21:30                                 ` Linus Torvalds
  2 siblings, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-27 21:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller, alan


On Tue, 2008-05-27 at 08:35 -0700, Linus Torvalds wrote:
> 
> On Tue, 27 May 2008, Benjamin Herrenschmidt wrote:
> > 
> > Yes. As it is today, tg3 for example is potentially broken on all archs
> > with newer gcc unless we either add "memory" clobber to readl/writel or
> > stick some wmb's in there (just a random driver I picked).
> > 
> > So Linus, what is your take on that matter ?
> 
> Let's just serialize the damn things, and add a memory clobber to them.
> 
> Expecting people to fix up all drivers is simply not going to happen. And 
> serializing things shouldn't be *that* expensive. People who cannot take 
> the expense can continue to use the magic __raw_writel() etc stuff.

Ok.

Do we also remove wmb/rmb/... from drivers then ? :-) I think ia64 would
need to be fixed to make their writel serializing...

Regarding __raw_* their semantics are dodgy ... we might want to provide
something better but it's a different subject.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:10                               ` Benjamin Herrenschmidt
@ 2008-05-27 21:30                                 ` Linus Torvalds
  2008-05-27 21:38                                   ` Alan Cox
  2008-05-27 21:38                                   ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 160+ messages in thread
From: Linus Torvalds @ 2008-05-27 21:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller, alan

On Wed, 28 May 2008, Benjamin Herrenschmidt wrote:
> 
> On Tue, 2008-05-27 at 08:35 -0700, Linus Torvalds wrote:
> > 
> > Expecting people to fix up all drivers is simply not going to happen. And 
> > serializing things shouldn't be *that* expensive. People who cannot take 
> > the expense can continue to use the magic __raw_writel() etc stuff.
> 
> Ok.
> 
> Do we also remove wmb/rmb/... from drivers then ? :-) I think ia64 would
> need to be fixed to make their writel serializing...

Well..

There's really two different issues:

 (a) x86 and the fact that we have thousands of drivers

which in turn conflicts with

 (b) non-x86 and the fact that other architectures tend to be absolute 
     pieces of cr*p when it comes to ordering, _especially_ across IO.

and the thing about (b) is that the number of drivers involved is a hell 
of a lot smaller. For example, ia64 and the big SGI machines probably 
really only care about roughly five drivers (number taken out of my nether 
regions). 

So practically speaking, I suspect that the right approach is to just say 
"ok, x86 will continue to be pretty darn ordered, and the barriers won't 
really matter (*)" but at the same time also saying "we wish reality was 
different, and well-maintained drivers should probably try to work in the 
presense of re-ordering".

In *practice*, that probably means that most architectures will be better 
off if they emulate x86 closely, just because that means that they won't 
rely on drivers always getting things right, but I think we can leave the 
door open for the odd machines. We should just realize that they will 
never get a lot of testing, but on the other hand, their usage scenarios 
will generally also be very limited (very specific loads, and _very_ 
specific hardware).

And the patch I sent out actually made "__[raw_]readl()" different from 
"readl()" on x86 too, in that the assembly _allows_ a bit more 
re-ordering, even though I doubt it will be visible in practice. So if you 
use the "__" versions, you'd better have barriers even on x86!

			Linus

(*) With the possible but unlikely exception of some big machines with 
separate IO networks, but if they happen they will fall into the 'ia64' 
case of just having a few relevant drivers.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:30                                 ` Linus Torvalds
@ 2008-05-27 21:38                                   ` Alan Cox
  2008-05-27 21:53                                     ` Matthew Wilcox
  2008-05-27 21:59                                     ` Linus Torvalds
  2008-05-27 21:38                                   ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 160+ messages in thread
From: Alan Cox @ 2008-05-27 21:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller

> re-ordering, even though I doubt it will be visible in practice. So if you 
> use the "__" versions, you'd better have barriers even on x86!

Are we also going to have __ioread*/__iowrite* ?

Also is the sematics of __readl/__writel defined for all architectures -
I used it ages ago in the i2o drivers for speed and it got removed
because it didn't build on some platforms.

Alan

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:38                                   ` Alan Cox
@ 2008-05-27 21:53                                     ` Matthew Wilcox
  2008-05-27 21:46                                       ` Alan Cox
  2008-05-27 22:02                                       ` Linus Torvalds
  2008-05-27 21:59                                     ` Linus Torvalds
  1 sibling, 2 replies; 160+ messages in thread
From: Matthew Wilcox @ 2008-05-27 21:53 UTC (permalink / raw)
  To: Alan Cox
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	Linus Torvalds, David Miller

On Tue, May 27, 2008 at 10:38:22PM +0100, Alan Cox wrote:
> > re-ordering, even though I doubt it will be visible in practice. So if you 
> > use the "__" versions, you'd better have barriers even on x86!
> 
> Are we also going to have __ioread*/__iowrite* ?

Didn't we already define ioread*() to have loose semantics?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:53                                     ` Matthew Wilcox
@ 2008-05-27 21:46                                       ` Alan Cox
  2008-05-27 22:02                                       ` Linus Torvalds
  1 sibling, 0 replies; 160+ messages in thread
From: Alan Cox @ 2008-05-27 21:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	Linus Torvalds, David Miller

On Tue, 27 May 2008 15:53:28 -0600
Matthew Wilcox <matthew@wil.cx> wrote:

> On Tue, May 27, 2008 at 10:38:22PM +0100, Alan Cox wrote:
> > > re-ordering, even though I doubt it will be visible in practice. So if you 
> > > use the "__" versions, you'd better have barriers even on x86!
> > 
> > Are we also going to have __ioread*/__iowrite* ?
> 
> Didn't we already define ioread*() to have loose semantics?

The ATA layer doesn't think so.

Alan

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:53                                     ` Matthew Wilcox
  2008-05-27 21:46                                       ` Alan Cox
@ 2008-05-27 22:02                                       ` Linus Torvalds
  1 sibling, 0 replies; 160+ messages in thread
From: Linus Torvalds @ 2008-05-27 22:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller, Alan Cox



On Tue, 27 May 2008, Matthew Wilcox wrote:

> On Tue, May 27, 2008 at 10:38:22PM +0100, Alan Cox wrote:
> > > re-ordering, even though I doubt it will be visible in practice. So if you 
> > > use the "__" versions, you'd better have barriers even on x86!
> > 
> > Are we also going to have __ioread*/__iowrite* ?
> 
> Didn't we already define ioread*() to have loose semantics?

They are supposed to have the same semantics as readl/writel.

And yes, it's "loose", but only when compared to inb/outb (which are 
really very strict, if you want to emulate x86 - an "outb" basically is 
not only ordered, it doesn't even post the write, and waits until it has 
hit the bus!)

		Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:38                                   ` Alan Cox
  2008-05-27 21:53                                     ` Matthew Wilcox
@ 2008-05-27 21:59                                     ` Linus Torvalds
  1 sibling, 0 replies; 160+ messages in thread
From: Linus Torvalds @ 2008-05-27 21:59 UTC (permalink / raw)
  To: Alan Cox
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller



On Tue, 27 May 2008, Alan Cox wrote:
>
> > re-ordering, even though I doubt it will be visible in practice. So if you 
> > use the "__" versions, you'd better have barriers even on x86!
> 
> Are we also going to have __ioread*/__iowrite* ?

I doubt there is any reason to. Let's just keep them very strictly 
ordered. 

> Also is the sematics of __readl/__writel defined for all architectures -
> I used it ages ago in the i2o drivers for speed and it got removed
> because it didn't build on some platforms.

Agreed - I'm not sure the __ versions are really worth it. We have them, 
but the semantics are subtle enough that most drivers will never care 
enough to really use them.

I would suggest using them mainly for architecture-specific drivers, on 
architectures where it actually matters (which is not the case on x86).

		Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:30                                 ` Linus Torvalds
  2008-05-27 21:38                                   ` Alan Cox
@ 2008-05-27 21:38                                   ` Benjamin Herrenschmidt
  2008-05-27 21:42                                     ` Matthew Wilcox
  2008-05-27 21:55                                     ` Linus Torvalds
  1 sibling, 2 replies; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-27 21:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller, alan

> So practically speaking, I suspect that the right approach is to just say 
> "ok, x86 will continue to be pretty darn ordered, and the barriers won't 
> really matter (*)" but at the same time also saying "we wish reality was 
> different, and well-maintained drivers should probably try to work in the 
> presense of re-ordering".

Ok, well, I'll slap "memory" clobbers onto powerpc accessors, we made
them fully ordered a while ago anyway. The extra barriers in drivers
like USB etc.. won't hurt us much, we can always fine tune drivers that
really want high performances.

A problem with __raw_ though is that they -also- don't do byteswap,
which is a pain in the neck as people use them for either one reason
(relaxed ordering) or the other (no byteswap) without always knowing the
consequences of doing so...

I'm happy to say that __raw is purely about ordering and make them
byteswap on powerpc tho (ie, make them little endian like the non-raw
counterpart).

Some archs started providing writel_be etc... I added those to powerpc a
little while ago, and I tend to prefer that approach for the byteswap
issue.

What do you think ?

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:38                                   ` Benjamin Herrenschmidt
@ 2008-05-27 21:42                                     ` Matthew Wilcox
  2008-05-27 22:17                                       ` Benjamin Herrenschmidt
  2008-05-27 21:55                                     ` Linus Torvalds
  1 sibling, 1 reply; 160+ messages in thread
From: Matthew Wilcox @ 2008-05-27 21:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	Linus Torvalds, David Miller, alan

On Wed, May 28, 2008 at 07:38:55AM +1000, Benjamin Herrenschmidt wrote:
> A problem with __raw_ though is that they -also- don't do byteswap,
> which is a pain in the neck as people use them for either one reason
> (relaxed ordering) or the other (no byteswap) without always knowing the
> consequences of doing so...

That's why there's __readl() which does byteswap, but doesn't do
ordering ...

> I'm happy to say that __raw is purely about ordering and make them
> byteswap on powerpc tho (ie, make them little endian like the non-raw
> counterpart).

That would break a lot of drivers.

> Some archs started providing writel_be etc... I added those to powerpc a
> little while ago, and I tend to prefer that approach for the byteswap
> issue.

Those are for people who use big endian chips on little endian
architectures.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:42                                     ` Matthew Wilcox
@ 2008-05-27 22:17                                       ` Benjamin Herrenschmidt
  2008-05-28  8:36                                         ` Haavard Skinnemoen
  0 siblings, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-27 22:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	Linus Torvalds, David Miller, alan


On Tue, 2008-05-27 at 15:42 -0600, Matthew Wilcox wrote:
> On Wed, May 28, 2008 at 07:38:55AM +1000, Benjamin Herrenschmidt wrote:
> > A problem with __raw_ though is that they -also- don't do byteswap,
> > which is a pain in the neck as people use them for either one reason
> > (relaxed ordering) or the other (no byteswap) without always knowing the
> > consequences of doing so...
> 
> That's why there's __readl() which does byteswap, but doesn't do
> ordering ...

Ah, that one is news to me. I don't think we ever had it on powerpc :-)

> > I'm happy to say that __raw is purely about ordering and make them
> > byteswap on powerpc tho (ie, make them little endian like the non-raw
> > counterpart).
> 
> That would break a lot of drivers.

How many actually use __raw_ * ?

> > Some archs started providing writel_be etc... I added those to powerpc a
> > little while ago, and I tend to prefer that approach for the byteswap
> > issue.

> Those are for people who use big endian chips on little endian
> architectures.

Why limit them to LE architecture ? There is nothing fundamentally
speicifc to LE architectures here, and it's wrong to provide accessors
on some archs and not others. The endianness is a property of the device
registers. Current writel/readl are basically writel_le/readl_le. It
thus makes sense to have the opposite, ie, readl_be/writel_be, which
thus byteswaps on LE platforms and not on BE platforms, which is what I
provided on powerpc a while ago.

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 22:17                                       ` Benjamin Herrenschmidt
@ 2008-05-28  8:36                                         ` Haavard Skinnemoen
  2008-05-29 11:05                                           ` Pantelis Antoniou
  2008-05-30  1:13                                           ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 160+ messages in thread
From: Haavard Skinnemoen @ 2008-05-28  8:36 UTC (permalink / raw)
  To: benh
  Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > > I'm happy to say that __raw is purely about ordering and make them
> > > byteswap on powerpc tho (ie, make them little endian like the non-raw
> > > counterpart).  
> > 
> > That would break a lot of drivers.  
> 
> How many actually use __raw_ * ?

I do -- in all the drivers for on-chip peripherals that are shared
between AT91 ARM (LE) and AVR32 (BE). Since everything goes on inside
the chip, we must use LE accesses on ARM and BE accesses on AVR32.

Currently, this is the only interface I know that can do native-endian
accesses, so if you take it away, I'm gonna need an alternative
interface that doesn't do byteswapping.

Haavard

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-28  8:36                                         ` Haavard Skinnemoen
@ 2008-05-29 11:05                                           ` Pantelis Antoniou
  2008-05-30  1:13                                           ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 160+ messages in thread
From: Pantelis Antoniou @ 2008-05-29 11:05 UTC (permalink / raw)
  To: Haavard Skinnemoen
  Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan


On 28 =CE=9C=CE=B1=CF=8A 2008, at 11:36 =CE=A0=CE=9C, Haavard Skinnemoen =
wrote:

> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>> I'm happy to say that __raw is purely about ordering and make them
>>>> byteswap on powerpc tho (ie, make them little endian like the non-=20=

>>>> raw
>>>> counterpart).
>>>
>>> That would break a lot of drivers.
>>
>> How many actually use __raw_ * ?
>
> I do -- in all the drivers for on-chip peripherals that are shared
> between AT91 ARM (LE) and AVR32 (BE). Since everything goes on inside
> the chip, we must use LE accesses on ARM and BE accesses on AVR32.
>
> Currently, this is the only interface I know that can do native-endian
> accesses, so if you take it away, I'm gonna need an alternative
> interface that doesn't do byteswapping.
>
> Haavard

I certainly do too.

Quite a lot of SoC specific drivers use __raw for accessing their
on-chip peripherals.

Please don't change the __raw semantics, you'll end up breaking almost
everything that's BE.

-- Pantelis

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-28  8:36                                         ` Haavard Skinnemoen
  2008-05-29 11:05                                           ` Pantelis Antoniou
@ 2008-05-30  1:13                                           ` Benjamin Herrenschmidt
  2008-05-30  6:07                                             ` Haavard Skinnemoen
  1 sibling, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-30  1:13 UTC (permalink / raw)
  To: Haavard Skinnemoen
  Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan


> I do -- in all the drivers for on-chip peripherals that are shared
> between AT91 ARM (LE) and AVR32 (BE). Since everything goes on inside
> the chip, we must use LE accesses on ARM and BE accesses on AVR32.
> 
> Currently, this is the only interface I know that can do native-endian
> accesses, so if you take it away, I'm gonna need an alternative
> interface that doesn't do byteswapping.

Are you aware that these also don't provide any ordering guarantee ?

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-30  1:13                                           ` Benjamin Herrenschmidt
@ 2008-05-30  6:07                                             ` Haavard Skinnemoen
  2008-05-30  7:24                                               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 160+ messages in thread
From: Haavard Skinnemoen @ 2008-05-30  6:07 UTC (permalink / raw)
  To: benh
  Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Fri, 30 May 2008 11:13:23 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> > Currently, this is the only interface I know that can do native-endian
> > accesses, so if you take it away, I'm gonna need an alternative
> > interface that doesn't do byteswapping.  
> 
> Are you aware that these also don't provide any ordering guarantee ?

Yes, but I am not aware of any alternative.

I think the drivers I've written have the necessary barriers (or dma
ops with implicit barriers) that they don't actually depend on any
DMA vs. MMIO ordering guarantees. I hope MMIO vs. MMIO ordering is
guaranteed though?

Haavard

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-30  6:07                                             ` Haavard Skinnemoen
@ 2008-05-30  7:24                                               ` Benjamin Herrenschmidt
  2008-05-30  8:27                                                 ` Haavard Skinnemoen
  0 siblings, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-30  7:24 UTC (permalink / raw)
  To: Haavard Skinnemoen
  Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Fri, 2008-05-30 at 08:07 +0200, Haavard Skinnemoen wrote:
> On Fri, 30 May 2008 11:13:23 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > > Currently, this is the only interface I know that can do native-endian
> > > accesses, so if you take it away, I'm gonna need an alternative
> > > interface that doesn't do byteswapping.  
> > 
> > Are you aware that these also don't provide any ordering guarantee ?
> 
> Yes, but I am not aware of any alternative.
> 
> I think the drivers I've written have the necessary barriers (or dma
> ops with implicit barriers) that they don't actually depend on any
> DMA vs. MMIO ordering guarantees. I hope MMIO vs. MMIO ordering is
> guaranteed though?

Only to the same address I'd say.

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-30  7:24                                               ` Benjamin Herrenschmidt
@ 2008-05-30  8:27                                                 ` Haavard Skinnemoen
  2008-05-30  9:22                                                   ` Geert Uytterhoeven
  0 siblings, 1 reply; 160+ messages in thread
From: Haavard Skinnemoen @ 2008-05-30  8:27 UTC (permalink / raw)
  To: benh
  Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Fri, 30 May 2008 17:24:27 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> On Fri, 2008-05-30 at 08:07 +0200, Haavard Skinnemoen wrote:
> > I think the drivers I've written have the necessary barriers (or dma
> > ops with implicit barriers) that they don't actually depend on any
> > DMA vs. MMIO ordering guarantees. I hope MMIO vs. MMIO ordering is
> > guaranteed though?
> 
> Only to the same address I'd say.

Right, I sort of suspected that.

Now, I'm pretty sure the architectures that can actually run those
drivers (ARM9 and AVR32 AP7) provide stronger guarantees than that, and
I suspect the same is true on most other embedded architectures that
use __raw_* in their drivers. So I don't think adding barriers is the
right thing to do because they won't do anything useful in practice, so
it's hard to tell whether they are used "correctly". And it will hurt
performance at least on AVR32 since wmb() evaluates to a "sync"
instruction, which flushes the write buffer to RAM. Since MMIO writes
are unbuffered, that's pure overhead.

Maybe we need another interface that does not do byteswapping but
provides stronger ordering guarantees?

Haavard

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-30  8:27                                                 ` Haavard Skinnemoen
@ 2008-05-30  9:22                                                   ` Geert Uytterhoeven
  2008-06-02  8:11                                                     ` Haavard Skinnemoen
  0 siblings, 1 reply; 160+ messages in thread
From: Geert Uytterhoeven @ 2008-05-30  9:22 UTC (permalink / raw)
  To: Haavard Skinnemoen
  Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Fri, 30 May 2008, Haavard Skinnemoen wrote:
> Maybe we need another interface that does not do byteswapping but
> provides stronger ordering guarantees?

The byte swapping depends on the device/bus.

So what happened to the old idea of putting the accessor function pointers
in the device/bus structure?

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-30  9:22                                                   ` Geert Uytterhoeven
@ 2008-06-02  8:11                                                     ` Haavard Skinnemoen
  2008-06-02 15:48                                                       ` Scott Wood
  2008-06-04 15:31                                                       ` Linus Torvalds
  0 siblings, 2 replies; 160+ messages in thread
From: Haavard Skinnemoen @ 2008-06-02  8:11 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: linux-arch, Matthew Wilcox, linux-kernel, David Miller,
	linuxppc-dev, tpiepho, scottwood, Torvalds, Linus, alan

Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> On Fri, 30 May 2008, Haavard Skinnemoen wrote:
> > Maybe we need another interface that does not do byteswapping but
> > provides stronger ordering guarantees?
> 
> The byte swapping depends on the device/bus.

Of course. But isn't it reasonable to assume that a device integrated
on the same silicon as the CPU is connected to a somewhat sane bus
which doesn't require any byte swapping?

> So what happened to the old idea of putting the accessor function pointers
> in the device/bus structure?

Don't know. I think it sounds like overkill to replace a simple load or
store with an indirect function call.

Haavard

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-02  8:11                                                     ` Haavard Skinnemoen
@ 2008-06-02 15:48                                                       ` Scott Wood
  2008-06-03  7:46                                                         ` Haavard Skinnemoen
  2008-06-04 15:31                                                       ` Linus Torvalds
  1 sibling, 1 reply; 160+ messages in thread
From: Scott Wood @ 2008-06-02 15:48 UTC (permalink / raw)
  To: Haavard Skinnemoen
  Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev,
	Geert Uytterhoeven, Linus Torvalds, David Miller, alan

On Mon, Jun 02, 2008 at 10:11:02AM +0200, Haavard Skinnemoen wrote:
> Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> > On Fri, 30 May 2008, Haavard Skinnemoen wrote:
> > > Maybe we need another interface that does not do byteswapping but
> > > provides stronger ordering guarantees?
> > 
> > The byte swapping depends on the device/bus.
> 
> Of course. But isn't it reasonable to assume that a device integrated
> on the same silicon as the CPU is connected to a somewhat sane bus
> which doesn't require any byte swapping?

No, unfortunately. :-(

See the end of drivers/dma/fsldma.h.  Likewise with Freescale's PCI host
bridges; for some reason the bus itself being little endian led to the host
bridge control registers also being little endian.

-Scott

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-02 15:48                                                       ` Scott Wood
@ 2008-06-03  7:46                                                         ` Haavard Skinnemoen
  0 siblings, 0 replies; 160+ messages in thread
From: Haavard Skinnemoen @ 2008-06-03  7:46 UTC (permalink / raw)
  To: Scott Wood
  Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev,
	Geert Uytterhoeven, Linus Torvalds, David Miller, alan

Scott Wood <scottwood@freescale.com> wrote:
> On Mon, Jun 02, 2008 at 10:11:02AM +0200, Haavard Skinnemoen wrote:
> > Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> > > On Fri, 30 May 2008, Haavard Skinnemoen wrote:
> > > > Maybe we need another interface that does not do byteswapping but
> > > > provides stronger ordering guarantees?
> > > 
> > > The byte swapping depends on the device/bus.
> > 
> > Of course. But isn't it reasonable to assume that a device integrated
> > on the same silicon as the CPU is connected to a somewhat sane bus
> > which doesn't require any byte swapping?
> 
> No, unfortunately. :-(

Ok, I guess I was being naive.

> See the end of drivers/dma/fsldma.h.  Likewise with Freescale's PCI host
> bridges; for some reason the bus itself being little endian led to the host
> bridge control registers also being little endian.

Right. But still, isn't it better to handle it on a case-by-case basis
in the drivers? In some cases, it's best to explicitly use a certain
byte order, in others it's best to use whatever is native to the CPU.

Haavard

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-02  8:11                                                     ` Haavard Skinnemoen
  2008-06-02 15:48                                                       ` Scott Wood
@ 2008-06-04 15:31                                                       ` Linus Torvalds
  1 sibling, 0 replies; 160+ messages in thread
From: Linus Torvalds @ 2008-06-04 15:31 UTC (permalink / raw)
  To: Haavard Skinnemoen
  Cc: linux-arch, Matthew Wilcox, linux-kernel, tpiepho, linuxppc-dev,
	Geert Uytterhoeven, scottwood, David Miller, alan

On Mon, 2 Jun 2008, Haavard Skinnemoen wrote:
> 
> > So what happened to the old idea of putting the accessor function pointers
> > in the device/bus structure?
> 
> Don't know. I think it sounds like overkill to replace a simple load or
> store with an indirect function call.

Indeed. *Especially* as the driver in practice tends to always know which 
one it is statically.

Yes, sometimes the same RTL may end up being used behind two differnt 
buses, and with some broken byte-conversion hardware in the middle, but 
that's pretty damn rare in the end. It happens, but it happens mostly with 
the odd broken kind of embedded setups where somebody took a standard part 
and then did something _strange_ to it - and in the process they may well 
have introduced other issues as well.

I suspect you find this kind of thing mostly with things like stupid 
integrated IDE controllers, and things like byte order tends to be the 
_least_ of the issues (specialized register accesses with odd offsets etc 
will happen).

So most of the time the byte order is simply decided by the hardware 
itself (and LE is the common one, due to ISA/PCI). Sometimes you can set 
it dynamically (at which point it's irrelevant - just pick the one you 
want).

So I definitely argue against complex IO accessor functions with things 
like dynamic support for byte order depending on bus. 99.9% of all 
hardware simply do not need them, and the small percentage that might need 
it will need special drivers anyway, so they might as well do the 
complexity on their own rather than make everybody else care.

See for example the input driver for the i8042 chip: not only do those 
things sometimes need native order (probably exactly because it's 
integrated on a chip over some special "native bus", or just wired up to 
be big-endian on a BE platform by some moron), but you'll also find that 
they just need odd accesses anyway exactly because it's oddly wired up 
(eg it's some special serial protocol that emulates ISA accesses).

So having some "byte-order correcting function" still wouldn't make any 
sense, because it's not just about byte order.

Will that mean that you might occasionally have a "normal" driver and an 
"odd" driver that actually drives what is basically the same HW block, 
just wired up differently? Sure. But that's still a smaller price to pay 
than it would be to try to make everything be "generic" when there is no 
point to it.

			Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:38                                   ` Benjamin Herrenschmidt
  2008-05-27 21:42                                     ` Matthew Wilcox
@ 2008-05-27 21:55                                     ` Linus Torvalds
  2008-05-27 22:19                                       ` Benjamin Herrenschmidt
  2008-06-02  7:24                                       ` Russell King
  1 sibling, 2 replies; 160+ messages in thread
From: Linus Torvalds @ 2008-05-27 21:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller, alan

On Wed, 28 May 2008, Benjamin Herrenschmidt wrote:
> 
> A problem with __raw_ though is that they -also- don't do byteswap,

Well, that's why there is __readl() and __raw_readl(), no?

Neither does ordering, and __raw_readl() doesn't do byte-swap.

Of course, I'm not going to guarantee every architecture even has all 
those versions, nor am I going to guarantee they all work as advertised :)

For x86, they have historially all been 100% identical. With the inline 
asm patch I posted, the "__" version (whether "raw" or not) lack the 
"memory" barrier, so they allow a *little* bit more re-ordering.

(They won't be re-ordered wrt spinlocks etc, unless gcc starts reordering 
volatile asm's against each other, which would be a bug).

In practice, I doubt it matters. Whatever small compiler re-ordering it 
might affect won't have any real performance impack one way or the other, 
I think.

		Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:55                                     ` Linus Torvalds
@ 2008-05-27 22:19                                       ` Benjamin Herrenschmidt
  2008-05-29  7:10                                         ` Arnd Bergmann
  2008-06-02  7:24                                       ` Russell King
  1 sibling, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-27 22:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller, alan


On Tue, 2008-05-27 at 14:55 -0700, Linus Torvalds wrote:
> 
> On Wed, 28 May 2008, Benjamin Herrenschmidt wrote:
> > 
> > A problem with __raw_ though is that they -also- don't do byteswap,
> 
> Well, that's why there is __readl() and __raw_readl(), no?

As I replied to somebody else, __readl() is news to me :-) we dont' have
those on powerpc.

> Neither does ordering, and __raw_readl() doesn't do byte-swap.

But I can add them :-)

> Of course, I'm not going to guarantee every architecture even has all 
> those versions, nor am I going to guarantee they all work as advertised :)
> 
> For x86, they have historially all been 100% identical. With the inline 
> asm patch I posted, the "__" version (whether "raw" or not) lack the 
> "memory" barrier, so they allow a *little* bit more re-ordering.
> 
> (They won't be re-ordered wrt spinlocks etc, unless gcc starts reordering 
> volatile asm's against each other, which would be a bug).
> 
> In practice, I doubt it matters. Whatever small compiler re-ordering it 
> might affect won't have any real performance impack one way or the other, 
> I think.

I prefer explicit endian. Always. Thus I prefer introducing _be variants
(we already have those on powerpc and iomap has it's own _be versions
too) so we should probably generalize _be.

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 22:19                                       ` Benjamin Herrenschmidt
@ 2008-05-29  7:10                                         ` Arnd Bergmann
  2008-05-29 10:46                                           ` Alan Cox
  0 siblings, 1 reply; 160+ messages in thread
From: Arnd Bergmann @ 2008-05-29  7:10 UTC (permalink / raw)
  To: benh
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	Linus Torvalds, David Miller, alan

On Wednesday 28 May 2008, Benjamin Herrenschmidt wrote:
> On Tue, 2008-05-27 at 14:55 -0700, Linus Torvalds wrote:
> > 
> > On Wed, 28 May 2008, Benjamin Herrenschmidt wrote:
> > > 
> > > A problem with __raw_ though is that they -also- don't do byteswap,
> > 
> > Well, that's why there is __readl() and __raw_readl(), no?
> 
> As I replied to somebody else, __readl() is news to me :-) we dont' have
> those on powerpc.
> 

It's not exactly a well-established interface. Only five architectures
define these functions, and there is not a single user in the kernel
source outside of these architecture's io.h files.

	Arnd <><

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29  7:10                                         ` Arnd Bergmann
@ 2008-05-29 10:46                                           ` Alan Cox
  0 siblings, 0 replies; 160+ messages in thread
From: Alan Cox @ 2008-05-29 10:46 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	Linus Torvalds, David Miller

> It's not exactly a well-established interface. Only five architectures
> define these functions, and there is not a single user in the kernel
> source outside of these architecture's io.h files.

That is because the drivers using them had them removed (eg I=C2=B2O) - mos=
tly
because "it didn't compile on power pc" 8(


Alan

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:55                                     ` Linus Torvalds
  2008-05-27 22:19                                       ` Benjamin Herrenschmidt
@ 2008-06-02  7:24                                       ` Russell King
  2008-06-03  4:16                                         ` Nick Piggin
  1 sibling, 1 reply; 160+ messages in thread
From: Russell King @ 2008-06-02  7:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	David Miller, alan

On Tue, May 27, 2008 at 02:55:56PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 28 May 2008, Benjamin Herrenschmidt wrote:
> > 
> > A problem with __raw_ though is that they -also- don't do byteswap,
> 
> Well, that's why there is __readl() and __raw_readl(), no?
> 
> Neither does ordering, and __raw_readl() doesn't do byte-swap.

This is where the lack of documentation causes arch maintainers a big
problem.  None of the semantics of __raw_readl vs __readl vs readl are
documented _anywhere_.  If you look at x86 as a template, there's no
comments there about what the different variants are supposed to do
or not do.

So it's left up to arch maintainers to literally guess what should be
done.  That's precisely what I did when I implemented ARMs __raw_readl
and friends.  I guessed.

And it was only after I read a few mails on lkml which suggested that
readl and friends should always be LE that ARMs readl implementation
started to use le32_to_cpu()... before that it had always been native
endian.  Again, lack of documentation...

So, can the semantics of what's expected from these IO accessor
functions be documented somewhere.  Please?  Before this thread gets
lost in the depths of time?

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-02  7:24                                       ` Russell King
@ 2008-06-03  4:16                                         ` Nick Piggin
  2008-06-03  4:32                                           ` Benjamin Herrenschmidt
                                                             ` (2 more replies)
  0 siblings, 3 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-03  4:16 UTC (permalink / raw)
  To: Russell King
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	Linus Torvalds, David Miller, alan

On Monday 02 June 2008 17:24, Russell King wrote:
> On Tue, May 27, 2008 at 02:55:56PM -0700, Linus Torvalds wrote:
> > On Wed, 28 May 2008, Benjamin Herrenschmidt wrote:
> > > A problem with __raw_ though is that they -also- don't do byteswap,
> >
> > Well, that's why there is __readl() and __raw_readl(), no?
> >
> > Neither does ordering, and __raw_readl() doesn't do byte-swap.
>
> This is where the lack of documentation causes arch maintainers a big
> problem.  None of the semantics of __raw_readl vs __readl vs readl are
> documented _anywhere_.  If you look at x86 as a template, there's no
> comments there about what the different variants are supposed to do
> or not do.
>
> So it's left up to arch maintainers to literally guess what should be
> done.  That's precisely what I did when I implemented ARMs __raw_readl
> and friends.  I guessed.
>
> And it was only after I read a few mails on lkml which suggested that
> readl and friends should always be LE that ARMs readl implementation
> started to use le32_to_cpu()... before that it had always been native
> endian.  Again, lack of documentation...
>
> So, can the semantics of what's expected from these IO accessor
> functions be documented somewhere.  Please?  Before this thread gets
> lost in the depths of time?

This whole thread also ties in with my posts about mmiowb (which IMO
should go away).

readl/writel:  strongly ordered wrt one another and other stores
               to cacheable RAM, byteswapping
__readl/__writel:  not ordered (needs mb/rmb/wmb to order with
                   other readl/writel and cacheable operations, or
                   io_*mb to order with one another)
raw_readl/raw_writel:  strongly ordered, no byteswapping
__raw_readl/__raw_writel:  not ordered, no byteswapping

then get rid of *relaxed* variants.

Linus: on x86, memory operations to wc and wc+ memory are not ordered
with one another, or operations to other memory types (ie. load/load
and store/store reordering is allowed). Also, as you know, store/load
reordering is explicitly allowed as well, which covers all memory
types. So perhaps it is not quite true to say readl/writel is strongly
ordered by default even on x86. You would have to put in some
mfence instructions in them to make it so.

So, what *exact* definition are you going to mandate for readl/writel?
Anything less than strict ordering then we also need to ensure drivers
use the correct barriers (to implement strict ordering, we could either
put mfence instructions in, or explicitly disallow readl/writel to be
used on wc/wc+ memory).

The other way we can go is just say that they have x86 semantics,
although that would be a bit sad IMO: we should have strong ops, in
which case driver writers never need to use a single barrier provided
they have locking right, and weak ops, in which case they should match
up with the weak Linux memory ordering model for system RAM.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03  4:16                                         ` Nick Piggin
@ 2008-06-03  4:32                                           ` Benjamin Herrenschmidt
  2008-06-03  6:11                                             ` Nick Piggin
  2008-06-03 14:47                                           ` Linus Torvalds
  2008-06-03 19:43                                           ` Trent Piepho
  2 siblings, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-06-03  4:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan


> This whole thread also ties in with my posts about mmiowb (which IMO
> should go away).
> 
> readl/writel:  strongly ordered wrt one another and other stores
>                to cacheable RAM, byteswapping
> __readl/__writel:  not ordered (needs mb/rmb/wmb to order with
>                    other readl/writel and cacheable operations, or
>                    io_*mb to order with one another)
> raw_readl/raw_writel:  strongly ordered, no byteswapping
> __raw_readl/__raw_writel:  not ordered, no byteswapping
> 
> then get rid of *relaxed* variants.

In addition, some archs like powerpc also provide readl_be/writel_be as
being defined as big endian (ie. byteswap on LE archs, no byteswap on BE
archs).

As of today, powerpc lacks the raw_readl/raw_writel and __readl/__writel
variants (ie, we only provide fully ordered + byteswap and no ordering +
no byteswap variants).

If we agree on the above semantics, I'll do a patch providing the
missing ones.

> Linus: on x86, memory operations to wc and wc+ memory are not ordered
> with one another, or operations to other memory types (ie. load/load
> and store/store reordering is allowed). Also, as you know, store/load
> reordering is explicitly allowed as well, which covers all memory
> types. So perhaps it is not quite true to say readl/writel is strongly
> ordered by default even on x86. You would have to put in some
> mfence instructions in them to make it so.
> 
> So, what *exact* definition are you going to mandate for readl/writel?
> Anything less than strict ordering then we also need to ensure drivers
> use the correct barriers (to implement strict ordering, we could either
> put mfence instructions in, or explicitly disallow readl/writel to be
> used on wc/wc+ memory).

The ordering guarantees that I provide on powerpc for "ordered" variants
are:

	- cacheable store + writel stays ordered (ie, write to some
          DMA stuff and then a register to trigger the DMA).

	- readl + cacheable read stays ordered (ie. read some status
          register, for example, after an interrupt, and then read the
          resulting data in memory).

	- any of these ordered vs. spin_lock and spin_unlock (with the
          exception that stores done before the spin_lock 
          could potentially leak into the lock).

	- readl is synchronous (ie, makes the CPU think the
          data was actually used before executing subsequent
          instructions, thus waits for the data to come back,
          for example to ensure that a read used to push out
          post buffers followed by a delay will indeed happen
          with the right delay).

We don't provide meaningless ones like writel + cacheable store for
example. (PCI posting would defeat it anyway).

> The other way we can go is just say that they have x86 semantics,
> although that would be a bit sad IMO: we should have strong ops, in
> which case driver writers never need to use a single barrier provided
> they have locking right, and weak ops, in which case they should match
> up with the weak Linux memory ordering model for system RAM.

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03  4:32                                           ` Benjamin Herrenschmidt
@ 2008-06-03  6:11                                             ` Nick Piggin
  2008-06-03  6:48                                               ` Benjamin Herrenschmidt
  2008-06-03  6:53                                               ` Paul Mackerras
  0 siblings, 2 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-03  6:11 UTC (permalink / raw)
  To: benh
  Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Tuesday 03 June 2008 14:32, Benjamin Herrenschmidt wrote:
> > This whole thread also ties in with my posts about mmiowb (which IMO
> > should go away).
> >
> > readl/writel:  strongly ordered wrt one another and other stores
> >                to cacheable RAM, byteswapping
> > __readl/__writel:  not ordered (needs mb/rmb/wmb to order with
> >                    other readl/writel and cacheable operations, or
> >                    io_*mb to order with one another)
> > raw_readl/raw_writel:  strongly ordered, no byteswapping
> > __raw_readl/__raw_writel:  not ordered, no byteswapping
> >
> > then get rid of *relaxed* variants.
>
> In addition, some archs like powerpc also provide readl_be/writel_be as
> being defined as big endian (ie. byteswap on LE archs, no byteswap on BE
> archs).

Sure.


> As of today, powerpc lacks the raw_readl/raw_writel and __readl/__writel
> variants (ie, we only provide fully ordered + byteswap and no ordering +
> no byteswap variants).
>
> If we agree on the above semantics, I'll do a patch providing the
> missing ones.

Let's see what Linus thinks...


> > Linus: on x86, memory operations to wc and wc+ memory are not ordered
> > with one another, or operations to other memory types (ie. load/load
> > and store/store reordering is allowed). Also, as you know, store/load
> > reordering is explicitly allowed as well, which covers all memory
> > types. So perhaps it is not quite true to say readl/writel is strongly
> > ordered by default even on x86. You would have to put in some
> > mfence instructions in them to make it so.
> >
> > So, what *exact* definition are you going to mandate for readl/writel?
> > Anything less than strict ordering then we also need to ensure drivers
> > use the correct barriers (to implement strict ordering, we could either
> > put mfence instructions in, or explicitly disallow readl/writel to be
> > used on wc/wc+ memory).
>
> The ordering guarantees that I provide on powerpc for "ordered" variants
> are:
>
> 	- cacheable store + writel stays ordered (ie, write to some
>           DMA stuff and then a register to trigger the DMA).
>
> 	- readl + cacheable read stays ordered (ie. read some status
>           register, for example, after an interrupt, and then read the
>           resulting data in memory).
>
> 	- any of these ordered vs. spin_lock and spin_unlock (with the
>           exception that stores done before the spin_lock
>           could potentially leak into the lock).
>
> 	- readl is synchronous (ie, makes the CPU think the
>           data was actually used before executing subsequent
>           instructions, thus waits for the data to come back,
>           for example to ensure that a read used to push out
>           post buffers followed by a delay will indeed happen
>           with the right delay).

So your readl can pass an earlier cacheable store or earlier writel?


> We don't provide meaningless ones like writel + cacheable store for
> example. (PCI posting would defeat it anyway).

What do you mean by meaningless? Ordering of writel followed by a
cacheable store  is meaningful eg. for retaining io operations within
locks. OK, you explicitly have some extra code for spin_unlock, but
not for bit locks, mutexes, etc. It would make sense to have the
default operations _very_ strongly ordered IMO, and then move drivers
to be more relaxed when they are verified.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03  6:11                                             ` Nick Piggin
@ 2008-06-03  6:48                                               ` Benjamin Herrenschmidt
  2008-06-03  6:53                                               ` Paul Mackerras
  1 sibling, 0 replies; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-06-03  6:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Tue, 2008-06-03 at 16:11 +1000, Nick Piggin wrote:
> >       - readl is synchronous (ie, makes the CPU think the
> >           data was actually used before executing subsequent
> >           instructions, thus waits for the data to come back,
> >           for example to ensure that a read used to push out
> >           post buffers followed by a delay will indeed happen
> >           with the right delay).
> 
> So your readl can pass an earlier cacheable store or earlier writel?

I forgot to mention that all MMIO are ordered vs. each other and I
do prevent readl from passing earlier cacheable stores too in my
current implementation but I'n not 100% we want to "guarantee" that,
unless we have stupid devices that trigger DMA's on reads with side
effects.. anyway, it is guaranteed in the current case.

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03  6:11                                             ` Nick Piggin
  2008-06-03  6:48                                               ` Benjamin Herrenschmidt
@ 2008-06-03  6:53                                               ` Paul Mackerras
  2008-06-03  7:18                                                 ` Nick Piggin
  1 sibling, 1 reply; 160+ messages in thread
From: Paul Mackerras @ 2008-06-03  6:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

Nick Piggin writes:

> So your readl can pass an earlier cacheable store or earlier writel?

No.  It's quite gross at the moment, it has a sync before the access
(i.e. a full mb()) and a twi; isync sequence after the access that
stalls execution until the data comes back.

> > We don't provide meaningless ones like writel + cacheable store for
> > example. (PCI posting would defeat it anyway).
> 
> What do you mean by meaningless? Ordering of writel followed by a
> cacheable store  is meaningful eg. for retaining io operations within
> locks. OK, you explicitly have some extra code for spin_unlock, but
> not for bit locks, mutexes, etc. It would make sense to have the
> default operations _very_ strongly ordered IMO, and then move drivers
> to be more relaxed when they are verified.

It's meaningless in the sense that nothing guarantees that the writel
has actually hit the device, even if we put a full mb() barrier in
between the writel and the cacheable write.  That would guarantee that
the writel had got to the PCI host bridge, and couldn't be reordered
with other accesses to the device, but not that the device had
actually seen it.

I don't mind adding code to the mutex unlock to do the same as
spin_unlock, but I really don't want to have *two* sync instructions
for every MMIO write.  One is bad enough.

Paul.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03  6:53                                               ` Paul Mackerras
@ 2008-06-03  7:18                                                 ` Nick Piggin
  0 siblings, 0 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-03  7:18 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Tuesday 03 June 2008 16:53, Paul Mackerras wrote:
> Nick Piggin writes:
> > So your readl can pass an earlier cacheable store or earlier writel?
>
> No.  It's quite gross at the moment, it has a sync before the access
> (i.e. a full mb()) and a twi; isync sequence after the access that
> stalls execution until the data comes back.

OK.

> > > We don't provide meaningless ones like writel + cacheable store for
> > > example. (PCI posting would defeat it anyway).
> >
> > What do you mean by meaningless? Ordering of writel followed by a
> > cacheable store  is meaningful eg. for retaining io operations within
> > locks. OK, you explicitly have some extra code for spin_unlock, but
> > not for bit locks, mutexes, etc. It would make sense to have the
> > default operations _very_ strongly ordered IMO, and then move drivers
> > to be more relaxed when they are verified.
>
> It's meaningless in the sense that nothing guarantees that the writel
> has actually hit the device, even if we put a full mb() barrier in
> between the writel and the cacheable write.  That would guarantee that
> the writel had got to the PCI host bridge, and couldn't be reordered
> with other accesses to the device, but not that the device had
> actually seen it.

OK, but I think fits OK with our SMP ordering model for cacheable
stores: no amount of barriers on CPU0 will guarantee that CPU1 has
seen the store, you actually have to observe a causual effect
of the store before you can say that.

> I don't mind adding code to the mutex unlock to do the same as
> spin_unlock, but I really don't want to have *two* sync instructions
> for every MMIO write.  One is bad enough.

So you can't provide iostore/store ordering without another sync
after the writel store?

I guess the problem with providing exceptions is that it makes it
hard for people who absolutely don't know or care about ordering.
I don't like having to think about it "hmm, we can allow this type
of reordering... oh, unless some silly device does X...".

If we come up with a sane set of weakly ordered accessors
(including io_*mb()), it will make it easier to go through the
important drivers and improve them. We don't have to enforce the
the new semantics strictly until then if they'll slow you down
too much in the meantime.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03  4:16                                         ` Nick Piggin
  2008-06-03  4:32                                           ` Benjamin Herrenschmidt
@ 2008-06-03 14:47                                           ` Linus Torvalds
  2008-06-03 18:47                                             ` Trent Piepho
  2008-06-04  2:19                                             ` Nick Piggin
  2008-06-03 19:43                                           ` Trent Piepho
  2 siblings, 2 replies; 160+ messages in thread
From: Linus Torvalds @ 2008-06-03 14:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, David Miller, alan



On Tue, 3 Jun 2008, Nick Piggin wrote:
> 
> Linus: on x86, memory operations to wc and wc+ memory are not ordered
> with one another, or operations to other memory types (ie. load/load
> and store/store reordering is allowed). Also, as you know, store/load
> reordering is explicitly allowed as well, which covers all memory
> types. So perhaps it is not quite true to say readl/writel is strongly
> ordered by default even on x86. You would have to put in some
> mfence instructions in them to make it so.

Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that 
does that needs to be aware of it. IOW, it's a non-issue, imnsho.

			Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 14:47                                           ` Linus Torvalds
@ 2008-06-03 18:47                                             ` Trent Piepho
  2008-06-03 18:55                                               ` Matthew Wilcox
  2008-06-03 19:07                                               ` Linus Torvalds
  2008-06-04  2:19                                             ` Nick Piggin
  1 sibling, 2 replies; 160+ messages in thread
From: Trent Piepho @ 2008-06-03 18:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev,
	scottwood, David Miller, alan

On Tue, 3 Jun 2008, Linus Torvalds wrote:
> On Tue, 3 Jun 2008, Nick Piggin wrote:
>>
>> Linus: on x86, memory operations to wc and wc+ memory are not ordered
>> with one another, or operations to other memory types (ie. load/load
>> and store/store reordering is allowed). Also, as you know, store/load
>> reordering is explicitly allowed as well, which covers all memory
>> types. So perhaps it is not quite true to say readl/writel is strongly
>> ordered by default even on x86. You would have to put in some
>> mfence instructions in them to make it so.

So on x86, these could be re-ordered?

writel(START_OPERATION, CONTROL_REGISTER);
status = readl(STATUS_REGISTER);

> Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that
> does that needs to be aware of it. IOW, it's a non-issue, imnsho.

You need to ask for coherent DMA memory too.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 18:47                                             ` Trent Piepho
@ 2008-06-03 18:55                                               ` Matthew Wilcox
  2008-06-03 19:57                                                 ` Trent Piepho
  2008-06-03 19:07                                               ` Linus Torvalds
  1 sibling, 1 reply; 160+ messages in thread
From: Matthew Wilcox @ 2008-06-03 18:55 UTC (permalink / raw)
  To: Trent Piepho
  Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:
> On Tue, 3 Jun 2008, Linus Torvalds wrote:
> >On Tue, 3 Jun 2008, Nick Piggin wrote:
> >>
> >>Linus: on x86, memory operations to wc and wc+ memory are not ordered
> >>with one another, or operations to other memory types (ie. load/load
> >>and store/store reordering is allowed). Also, as you know, store/load
> >>reordering is explicitly allowed as well, which covers all memory
> >>types. So perhaps it is not quite true to say readl/writel is strongly
> >>ordered by default even on x86. You would have to put in some
> >>mfence instructions in them to make it so.
> 
> So on x86, these could be re-ordered?
> 
> writel(START_OPERATION, CONTROL_REGISTER);
> status = readl(STATUS_REGISTER);

You wouldn't ask for write-combining memory mapping for control or
status registers.

> >Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that
> >does that needs to be aware of it. IOW, it's a non-issue, imnsho.
> 
> You need to ask for coherent DMA memory too.

Different case.  Coherent DMA memory is *host* memory that the *device*
accesses.  We're talking about *device* memory that the *cpu* accesses.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 18:55                                               ` Matthew Wilcox
@ 2008-06-03 19:57                                                 ` Trent Piepho
  2008-06-03 21:35                                                   ` Matthew Wilcox
  0 siblings, 1 reply; 160+ messages in thread
From: Trent Piepho @ 2008-06-03 19:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Tue, 3 Jun 2008, Matthew Wilcox wrote:
> On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:
>> On Tue, 3 Jun 2008, Linus Torvalds wrote:
>>> On Tue, 3 Jun 2008, Nick Piggin wrote:
>>>>
>>>> Linus: on x86, memory operations to wc and wc+ memory are not ordered
>>>> with one another, or operations to other memory types (ie. load/load
>>>> and store/store reordering is allowed). Also, as you know, store/load
>>>> reordering is explicitly allowed as well, which covers all memory
>>>> types. So perhaps it is not quite true to say readl/writel is strongly
>>>> ordered by default even on x86. You would have to put in some
>>>> mfence instructions in them to make it so.
>>
>> So on x86, these could be re-ordered?
>>
>> writel(START_OPERATION, CONTROL_REGISTER);
>> status = readl(STATUS_REGISTER);
>
> You wouldn't ask for write-combining memory mapping for control or
> status registers.

But Nick said, "store/load reordering is explicitly allowed as well, which
covers *all* memory types."

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 19:57                                                 ` Trent Piepho
@ 2008-06-03 21:35                                                   ` Matthew Wilcox
  2008-06-03 21:58                                                     ` Trent Piepho
  0 siblings, 1 reply; 160+ messages in thread
From: Matthew Wilcox @ 2008-06-03 21:35 UTC (permalink / raw)
  To: Trent Piepho
  Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote:
> On Tue, 3 Jun 2008, Matthew Wilcox wrote:
> >On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:
> >>On Tue, 3 Jun 2008, Linus Torvalds wrote:
> >>>On Tue, 3 Jun 2008, Nick Piggin wrote:
> >>>>
> >>>>Linus: on x86, memory operations to wc and wc+ memory are not ordered
> >>>>with one another, or operations to other memory types (ie. load/load
> >>>>and store/store reordering is allowed). Also, as you know, store/load
> >>>>reordering is explicitly allowed as well, which covers all memory
> >>>>types. So perhaps it is not quite true to say readl/writel is strongly
> >>>>ordered by default even on x86. You would have to put in some
> >>>>mfence instructions in them to make it so.
> >>
> >>So on x86, these could be re-ordered?
> >>
> >>writel(START_OPERATION, CONTROL_REGISTER);
> >>status = readl(STATUS_REGISTER);
> >
> >You wouldn't ask for write-combining memory mapping for control or
> >status registers.
> 
> But Nick said, "store/load reordering is explicitly allowed as well, which
> covers *all* memory types."

Then Nick is confused.  PCI only defines one way to flush posted writes
to a device -- doing a read from it.  There's no way that reads can
be allowed to pass writes (unless you've asked for it, like with write
combining).

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 21:35                                                   ` Matthew Wilcox
@ 2008-06-03 21:58                                                     ` Trent Piepho
  2008-06-04  2:00                                                       ` Nick Piggin
  0 siblings, 1 reply; 160+ messages in thread
From: Trent Piepho @ 2008-06-03 21:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Tue, 3 Jun 2008, Matthew Wilcox wrote:
> On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote:
>> On Tue, 3 Jun 2008, Matthew Wilcox wrote:
>>> On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:
>>>> On Tue, 3 Jun 2008, Linus Torvalds wrote:
>>>>> On Tue, 3 Jun 2008, Nick Piggin wrote:
>>>>>>
>>>>>> Linus: on x86, memory operations to wc and wc+ memory are not ordered
>>>>>> with one another, or operations to other memory types (ie. load/load
>>>>>> and store/store reordering is allowed). Also, as you know, store/load
>>>>>> reordering is explicitly allowed as well, which covers all memory
>>>>>> types. So perhaps it is not quite true to say readl/writel is strongly
>>>>>> ordered by default even on x86. You would have to put in some
>>>>>> mfence instructions in them to make it so.
>>>>
>>>> So on x86, these could be re-ordered?
>>>>
>>>> writel(START_OPERATION, CONTROL_REGISTER);
>>>> status = readl(STATUS_REGISTER);
>>>
>>> You wouldn't ask for write-combining memory mapping for control or
>>> status registers.
>>
>> But Nick said, "store/load reordering is explicitly allowed as well, which
>> covers *all* memory types."
>
> Then Nick is confused.  PCI only defines one way to flush posted writes
> to a device -- doing a read from it.  There's no way that reads can
> be allowed to pass writes (unless you've asked for it, like with write
> combining).

But that requirement is for the PCI bridge, isn't it?  It doesn't matter if
the bridge will flush all posted writes before allowing a read if the CPU
decides to give the bridge the read before the write.  A powerpc CPU will
certainly do this if you don't take any steps like telling it the memory is
uncachable and guarded.  I didn't think it was allowed on x86 (except with
WC), but Nick seemed to say it was.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 21:58                                                     ` Trent Piepho
@ 2008-06-04  2:00                                                       ` Nick Piggin
  0 siblings, 0 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-04  2:00 UTC (permalink / raw)
  To: Trent Piepho
  Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel,
	linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan

On Wednesday 04 June 2008 07:58, Trent Piepho wrote:
> On Tue, 3 Jun 2008, Matthew Wilcox wrote:
> > On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote:
> >> On Tue, 3 Jun 2008, Matthew Wilcox wrote:
> >>> On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:
> >>>> On Tue, 3 Jun 2008, Linus Torvalds wrote:
> >>>>> On Tue, 3 Jun 2008, Nick Piggin wrote:
> >>>>>> Linus: on x86, memory operations to wc and wc+ memory are not
> >>>>>> ordered with one another, or operations to other memory types (ie.
> >>>>>> load/load and store/store reordering is allowed). Also, as you know,
> >>>>>> store/load reordering is explicitly allowed as well, which covers
> >>>>>> all memory types. So perhaps it is not quite true to say
> >>>>>> readl/writel is strongly ordered by default even on x86. You would
> >>>>>> have to put in some mfence instructions in them to make it so.
> >>>>
> >>>> So on x86, these could be re-ordered?
> >>>>
> >>>> writel(START_OPERATION, CONTROL_REGISTER);
> >>>> status = readl(STATUS_REGISTER);
> >>>
> >>> You wouldn't ask for write-combining memory mapping for control or
> >>> status registers.
> >>
> >> But Nick said, "store/load reordering is explicitly allowed as well,
> >> which covers *all* memory types."
> >
> > Then Nick is confused.  PCI only defines one way to flush posted writes
> > to a device -- doing a read from it.  There's no way that reads can
> > be allowed to pass writes (unless you've asked for it, like with write
> > combining).
>
> But that requirement is for the PCI bridge, isn't it?  It doesn't matter if
> the bridge will flush all posted writes before allowing a read if the CPU
> decides to give the bridge the read before the write.  A powerpc CPU will
> certainly do this if you don't take any steps like telling it the memory is
> uncachable and guarded.  I didn't think it was allowed on x86 (except with
> WC), but Nick seemed to say it was.

Ah sorry, not UC, I was confused. UC I think actually is strongly ordered
WRT other UC and also cacheable operations.

WC is weakly ordered, anything goes.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 18:47                                             ` Trent Piepho
  2008-06-03 18:55                                               ` Matthew Wilcox
@ 2008-06-03 19:07                                               ` Linus Torvalds
  2008-06-04  2:05                                                 ` Nick Piggin
  2008-06-10  6:56                                                 ` Nick Piggin
  1 sibling, 2 replies; 160+ messages in thread
From: Linus Torvalds @ 2008-06-03 19:07 UTC (permalink / raw)
  To: Trent Piepho
  Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev,
	scottwood, David Miller, alan

On Tue, 3 Jun 2008, Trent Piepho wrote:

> On Tue, 3 Jun 2008, Linus Torvalds wrote:
> > On Tue, 3 Jun 2008, Nick Piggin wrote:
> > > 
> > > Linus: on x86, memory operations to wc and wc+ memory are not ordered
> > > with one another, or operations to other memory types (ie. load/load
> > > and store/store reordering is allowed). Also, as you know, store/load
> > > reordering is explicitly allowed as well, which covers all memory
> > > types. So perhaps it is not quite true to say readl/writel is strongly
> > > ordered by default even on x86. You would have to put in some
> > > mfence instructions in them to make it so.
> 
> So on x86, these could be re-ordered?
> 
> writel(START_OPERATION, CONTROL_REGISTER);
> status = readl(STATUS_REGISTER);

With both registers in a WC+ area, yes. The write may be in the WC buffers 
until the WC buffers are flushed (short list: a fence, a serializing 
instruction, a read-write to uncached memory, or an interrupt. There are 
others, but those are the main ones).

But if the status register is in uncached memory (which is the only *sane* 
thing to do), then it doesn't matter if the control register is in WC 
memory. Because the status register read is itself serializing with the WC 
buffer, it's actually fine.

So this is used for putting things like ring queues in WC memory, and fill 
them up with writes, and get nice bursty write traffic with the CPU 
automatically buffering it up (think "stdio.h on a really low level"). And 
if you then have the command registers in UC memory or using IO port 
accesses, reading and writing to them will automatically serialize.

> > Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that
> > does that needs to be aware of it. IOW, it's a non-issue, imnsho.
> 
> You need to ask for coherent DMA memory too.

Not on x86.

And I don't care what anybody else says - x86 is *so* totally dominant, 
that other architectures have to live with the fact that 99.9% of all 
drivers are written for and tested on x86. As a result, anything else is 
"theory".

Are some drivers good and are careful? Yes. Are most? No, and even if they 
were, people making changes would mostly test them on x86. Architectures 
should strive to basically act as closely as possible to x86 semantics in 
order to not get nasty surprises.

And architectures that can't do that well enough *will* often find that 
they need to fix drivers, and only a small small subset of all kernel 
drivers will generally work out-of-the-box.

If you're ready to do that, your architecture can do anything it damn well 
pleases ;)

			Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 19:07                                               ` Linus Torvalds
@ 2008-06-04  2:05                                                 ` Nick Piggin
  2008-06-04  2:46                                                   ` Linus Torvalds
  2008-06-10  6:56                                                 ` Nick Piggin
  1 sibling, 1 reply; 160+ messages in thread
From: Nick Piggin @ 2008-06-04  2:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Russell King, linux-kernel, David Miller,
	linuxppc-dev, scottwood, Trent Piepho, alan

On Wednesday 04 June 2008 05:07, Linus Torvalds wrote:
> On Tue, 3 Jun 2008, Trent Piepho wrote:
> > On Tue, 3 Jun 2008, Linus Torvalds wrote:
> > > On Tue, 3 Jun 2008, Nick Piggin wrote:
> > > > Linus: on x86, memory operations to wc and wc+ memory are not ordered
> > > > with one another, or operations to other memory types (ie. load/load
> > > > and store/store reordering is allowed). Also, as you know, store/load
> > > > reordering is explicitly allowed as well, which covers all memory
> > > > types. So perhaps it is not quite true to say readl/writel is
> > > > strongly ordered by default even on x86. You would have to put in
> > > > some mfence instructions in them to make it so.
> >
> > So on x86, these could be re-ordered?
> >
> > writel(START_OPERATION, CONTROL_REGISTER);
> > status = readl(STATUS_REGISTER);
>
> With both registers in a WC+ area, yes. The write may be in the WC buffers
> until the WC buffers are flushed (short list: a fence, a serializing
> instruction, a read-write to uncached memory, or an interrupt. There are
> others, but those are the main ones).
>
> But if the status register is in uncached memory (which is the only *sane*
> thing to do), then it doesn't matter if the control register is in WC
> memory. Because the status register read is itself serializing with the WC
> buffer, it's actually fine.

Actually, according to the document I am looking at (the AMD one), a UC
store may pass a previous WC store.

So you could have some code that writes to some WC memory on the card,
and then stores to an UC control register to start up the operation on
that memory, couldn't you? Those can go out of order.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-04  2:05                                                 ` Nick Piggin
@ 2008-06-04  2:46                                                   ` Linus Torvalds
  2008-06-04 11:47                                                     ` Alan Cox
  0 siblings, 1 reply; 160+ messages in thread
From: Linus Torvalds @ 2008-06-04  2:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, linux-kernel, David Miller,
	linuxppc-dev, scottwood, Trent Piepho, alan

On Wed, 4 Jun 2008, Nick Piggin wrote:
> 
> Actually, according to the document I am looking at (the AMD one), a UC
> store may pass a previous WC store.

Hmm. Intel arch manyal, Vol 3, 10.3 (page 10-7 in my version):

  "If the WC bufer is partially filled, the writes may be delayed until 
   the next ocurrence of a serializing event; such as, an SFENCE or MFENCE 
   instruction, CPUID execution, a read or write to uncached memory, ..."

Any typos mine.

Anyway, Intel certainly seems to document that WC memory is serialized by 
any access to UC memory.

But yes, I can well imagine that AMD is different, and I also heartily 
would recommend rather being safe than sorry. Putting an explicit memory 
barrier in between those accesses when you know it might make a difference 
is just a good idea. 

But basically, as far as I know the thing was designed to be invisible to 
old software: that is the whole idea behind WC memory. So the design was 
certainly intended to be that you can generally mark a framebuffer-like 
structure WC without any software _ever_ caring, as long as you keep all 
control ports in UC memory.

Of course, because burst writes from the WC buffer are <i>so</i> much more 
efficient on the PCI bus than dribbling them out one write at a time, it 
didn't take long before all the graphics cards etc wanted to <i>also</i> 
mark their command queues as WC memory, so that you could burst out the 
commands to the ring buffers as fast as possible. So now you have both 
your frame buffer *and* your command buffers mapped WC, and now ordering 
really has to be ensured in software if you access both.

[ And then there are the crazy people who mark *main memory* as WC, 
  because they don't want to pollute the cache with all the data, and then 
  you have the issue of cache coherency etc crap. Which only gets worse 
  with SMP, especially if one processor thinks it has part of memory 
  exclusively cached, and another one - or even the same one, 
  through another aliasign address - ignores the cache protocol.

  And you now get unhappy CPU's that think that there is a bug in the 
  cache protocol and they get machine check faults.

  So what started out as a "we can do accesses to the frame buffer more 
  efficiently without anybody ever even having to know or care" has 
  turned into a whole nightmare of people using it for other things, and 
  then you very much _do_ have to care! ]

And it doesn't surprise me if AMD then didn't get exactly the same 
rules. 

Oh, well.

		Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-04  2:46                                                   ` Linus Torvalds
@ 2008-06-04 11:47                                                     ` Alan Cox
  0 siblings, 0 replies; 160+ messages in thread
From: Alan Cox @ 2008-06-04 11:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin, Nick Piggin, Russell King, linux-kernel, Trent Piepho,
	linuxppc-dev, linux-arch, scottwood, David Miller

> Anyway, Intel certainly seems to document that WC memory is serialized by=
=20
> any access to UC memory.

I don't believe that is actually true on Pentium Pro at least.

>   So what started out as a "we can do accesses to the frame buffer more=20
>   efficiently without anybody ever even having to know or care" has=20
>   turned into a whole nightmare of people using it for other things, and=
=20
>   then you very much _do_ have to care! ]

Notably graphics where 3Dfx lined the registers up specifically to make
this work. We were also using it with I=C2=B2O where it gave a small
performance gain.

Alan

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 19:07                                               ` Linus Torvalds
  2008-06-04  2:05                                                 ` Nick Piggin
@ 2008-06-10  6:56                                                 ` Nick Piggin
  2008-06-10 17:41                                                   ` Jesse Barnes
                                                                     ` (2 more replies)
  1 sibling, 3 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-10  6:56 UTC (permalink / raw)
  To: Linus Torvalds, Matthew Wilcox
  Cc: linux-arch, Russell King, linux-kernel, David Miller,
	linuxppc-dev, scottwood, Trent Piepho, alan

On Wednesday 04 June 2008 05:07, Linus Torvalds wrote:
> On Tue, 3 Jun 2008, Trent Piepho wrote:
> > On Tue, 3 Jun 2008, Linus Torvalds wrote:
> > > On Tue, 3 Jun 2008, Nick Piggin wrote:
> > > > Linus: on x86, memory operations to wc and wc+ memory are not ordered
> > > > with one another, or operations to other memory types (ie. load/load
> > > > and store/store reordering is allowed). Also, as you know, store/load
> > > > reordering is explicitly allowed as well, which covers all memory
> > > > types. So perhaps it is not quite true to say readl/writel is
> > > > strongly ordered by default even on x86. You would have to put in
> > > > some mfence instructions in them to make it so.
> >
> > So on x86, these could be re-ordered?
> >
> > writel(START_OPERATION, CONTROL_REGISTER);
> > status = readl(STATUS_REGISTER);
>
> With both registers in a WC+ area, yes. The write may be in the WC buffers
> until the WC buffers are flushed (short list: a fence, a serializing
> instruction, a read-write to uncached memory, or an interrupt. There are
> others, but those are the main ones).
>
> But if the status register is in uncached memory (which is the only *sane*
> thing to do), then it doesn't matter if the control register is in WC
> memory. Because the status register read is itself serializing with the WC
> buffer, it's actually fine.
>
> So this is used for putting things like ring queues in WC memory, and fill
> them up with writes, and get nice bursty write traffic with the CPU
> automatically buffering it up (think "stdio.h on a really low level"). And
> if you then have the command registers in UC memory or using IO port
> accesses, reading and writing to them will automatically serialize.

OK, I'm sitll not quite sure where this has ended up. I guess you are happy
with x86 semantics as they are now. That is, all IO accesses are strongly
ordered WRT one another and WRT cacheable memory (which includes keeping
them within spinlocks), *unless* one asks for WC memory, in which case that
memory is quite weakly ordered (and is not even ordered by a regular IO
readl, at least according to AMD spec). So for WC memory, one still needs
to use mb/rmb/wmb.

So that still doesn't tell us what *minimum* level of ordering we should
provide in the cross platform readl/writel API. Some relatively sane
suggestions would be:

- as strong as x86. guaranteed not to break drivers that work on x86,
  but slower on some archs. To me, this is most pleasing. It is much
  much easier to notice something is going a little slower and to work
  out how to use weaker ordering there, than it is to debug some
  once-in-a-bluemoon breakage caused by just the right architecture,
  driver, etc. It totally frees up the driver writer from thinking
  about barriers, provided they get the locking right.

- ordered WRT other IO accessors, constrained within spinlocks, but not
  cacheable memory. This is what powerpc does now. It's a little faster
  for them, and probably covers the vast majority of drivers, but there
  are real possibilities to get it wrong (trivial example: using bit
  locks or mutexes or any kind of open coded locking or lockless
  synchronisation can break).

- (less sane) same as above, but not ordered WRT spinlocks. This is what
  ia64 (sn2) does. From a purist POV, it is a little less arbitrary than
  powerpc, but in practice, it will break a lot more drivers than powerpc.

I was kind of joking about taking control of this issue :) But seriously,
it needs a decision to be made. I vote for #1. My rationale: I'm still
finding relatively major (well, found maybe 4 or 5 in the last couple of
years) bugs in the mm subsystem due to memory ordering problems. This is
apparently one of the most well reviewed and tested bit of code in the
kernel by people who know all about memory ordering. Not to mention that
mm/ does not have to worry about IO ordering at all. Then apparently
driver are the least reviewed and tested. Connect dots.

Now that doesn't leave waker ordering architectures lumped with "slow old
x86 semantics". Think of it as giving them the benefit of sharing x86
development and testing :) We can then formalise the relaxed __ accessors
to be more complete (ie. +/- byteswapping). I'd also propose to add
io_rmb/io_wmb/io_mb that order io/io access, to help architectures like
sn2 where the io/cacheable barrier is pretty expensive.

Any comments?

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-10  6:56                                                 ` Nick Piggin
@ 2008-06-10 17:41                                                   ` Jesse Barnes
  2008-06-10 18:10                                                     ` James Bottomley
  2008-06-11  4:18                                                   ` Paul Mackerras
  2008-06-11  5:20                                                   ` Paul Mackerras
  2 siblings, 1 reply; 160+ messages in thread
From: Jesse Barnes @ 2008-06-10 17:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel,
	Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds,
	David Miller, alan

On Monday, June 09, 2008 11:56 pm Nick Piggin wrote:
> So that still doesn't tell us what *minimum* level of ordering we should
> provide in the cross platform readl/writel API. Some relatively sane
> suggestions would be:
>
> - as strong as x86. guaranteed not to break drivers that work on x86,
>   but slower on some archs. To me, this is most pleasing. It is much
>   much easier to notice something is going a little slower and to work
>   out how to use weaker ordering there, than it is to debug some
>   once-in-a-bluemoon breakage caused by just the right architecture,
>   driver, etc. It totally frees up the driver writer from thinking
>   about barriers, provided they get the locking right.
>
> - ordered WRT other IO accessors, constrained within spinlocks, but not
>   cacheable memory. This is what powerpc does now. It's a little faster
>   for them, and probably covers the vast majority of drivers, but there
>   are real possibilities to get it wrong (trivial example: using bit
>   locks or mutexes or any kind of open coded locking or lockless
>   synchronisation can break).
>
> - (less sane) same as above, but not ordered WRT spinlocks. This is what
>   ia64 (sn2) does. From a purist POV, it is a little less arbitrary than
>   powerpc, but in practice, it will break a lot more drivers than powerpc.
>
> I was kind of joking about taking control of this issue :) But seriously,
> it needs a decision to be made. I vote for #1. My rationale: I'm still
> finding relatively major (well, found maybe 4 or 5 in the last couple of
> years) bugs in the mm subsystem due to memory ordering problems. This is
> apparently one of the most well reviewed and tested bit of code in the
> kernel by people who know all about memory ordering. Not to mention that
> mm/ does not have to worry about IO ordering at all. Then apparently
> driver are the least reviewed and tested. Connect dots.
>
> Now that doesn't leave waker ordering architectures lumped with "slow old
> x86 semantics". Think of it as giving them the benefit of sharing x86
> development and testing :) We can then formalise the relaxed __ accessors
> to be more complete (ie. +/- byteswapping). I'd also propose to add
> io_rmb/io_wmb/io_mb that order io/io access, to help architectures like
> sn2 where the io/cacheable barrier is pretty expensive.
>
> Any comments?

FWIW that approach sounds pretty good to me.  Arches that suffer from 
performance penalties can still add lower level primitives and port selected 
drivers over, so really they won't be losing much.  AFAICT though drivers 
will still have to worry about regular memory ordering issues; they'll just 
be safe from I/O related ones. :)  Still, the simplification is probably 
worth it.

Jesse

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-10 17:41                                                   ` Jesse Barnes
@ 2008-06-10 18:10                                                     ` James Bottomley
  2008-06-10 19:05                                                       ` Roland Dreier
  0 siblings, 1 reply; 160+ messages in thread
From: James Bottomley @ 2008-06-10 18:10 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: linux-arch, Nick Piggin, Russell King, Matthew Wilcox,
	linux-kernel, David Miller, linuxppc-dev, scottwood,
	Linus Torvalds, Trent Piepho, alan

On Tue, 2008-06-10 at 10:41 -0700, Jesse Barnes wrote:
> On Monday, June 09, 2008 11:56 pm Nick Piggin wrote:
> > So that still doesn't tell us what *minimum* level of ordering we should
> > provide in the cross platform readl/writel API. Some relatively sane
> > suggestions would be:
> >
> > - as strong as x86. guaranteed not to break drivers that work on x86,
> >   but slower on some archs. To me, this is most pleasing. It is much
> >   much easier to notice something is going a little slower and to work
> >   out how to use weaker ordering there, than it is to debug some
> >   once-in-a-bluemoon breakage caused by just the right architecture,
> >   driver, etc. It totally frees up the driver writer from thinking
> >   about barriers, provided they get the locking right.
> >
> > - ordered WRT other IO accessors, constrained within spinlocks, but not
> >   cacheable memory. This is what powerpc does now. It's a little faster
> >   for them, and probably covers the vast majority of drivers, but there
> >   are real possibilities to get it wrong (trivial example: using bit
> >   locks or mutexes or any kind of open coded locking or lockless
> >   synchronisation can break).
> >
> > - (less sane) same as above, but not ordered WRT spinlocks. This is what
> >   ia64 (sn2) does. From a purist POV, it is a little less arbitrary than
> >   powerpc, but in practice, it will break a lot more drivers than powerpc.
> >
> > I was kind of joking about taking control of this issue :) But seriously,
> > it needs a decision to be made. I vote for #1. My rationale: I'm still
> > finding relatively major (well, found maybe 4 or 5 in the last couple of
> > years) bugs in the mm subsystem due to memory ordering problems. This is
> > apparently one of the most well reviewed and tested bit of code in the
> > kernel by people who know all about memory ordering. Not to mention that
> > mm/ does not have to worry about IO ordering at all. Then apparently
> > driver are the least reviewed and tested. Connect dots.
> >
> > Now that doesn't leave waker ordering architectures lumped with "slow old
> > x86 semantics". Think of it as giving them the benefit of sharing x86
> > development and testing :) We can then formalise the relaxed __ accessors
> > to be more complete (ie. +/- byteswapping). I'd also propose to add
> > io_rmb/io_wmb/io_mb that order io/io access, to help architectures like
> > sn2 where the io/cacheable barrier is pretty expensive.
> >
> > Any comments?
> 
> FWIW that approach sounds pretty good to me.  Arches that suffer from 
> performance penalties can still add lower level primitives and port selected 
> drivers over, so really they won't be losing much.  AFAICT though drivers 
> will still have to worry about regular memory ordering issues; they'll just 
> be safe from I/O related ones. :)  Still, the simplification is probably 
> worth it.

me too.  That's the whole basis for readX_relaxed() and its cohorts: we
make our weirdest machines (like altix) conform to the x86 norm.  Then
where it really kills us we introduce additional semantics to selected
drivers that enable us to recover I/O speed on the abnormal platforms.

About the only problem we've had is that architectures aren't very good
at co-ordinating for their additional accessors so we tend to get a
forest of strange ones growing up, which appear only in a few drivers
(i.e. the ones that need the speed ups) and which have no well
documented meaning.

James

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-10 18:10                                                     ` James Bottomley
@ 2008-06-10 19:05                                                       ` Roland Dreier
  2008-06-10 19:19                                                         ` Jesse Barnes
  0 siblings, 1 reply; 160+ messages in thread
From: Roland Dreier @ 2008-06-10 19:05 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-arch, Nick Piggin, Russell King, Matthew Wilcox,
	linux-kernel, Jesse Barnes, David Miller, linuxppc-dev, scottwood,
	Linus Torvalds, Trent Piepho, alan

 > me too.  That's the whole basis for readX_relaxed() and its cohorts: we
 > make our weirdest machines (like altix) conform to the x86 norm.  Then
 > where it really kills us we introduce additional semantics to selected
 > drivers that enable us to recover I/O speed on the abnormal platforms.

Except as I pointed out before, Altix doesn't conform to the norm and
many (most?) drivers are missing mmiowb()s that are needed for Altix.
Just no one has plugged most devices into an Altix (or haven't stressed
the driver in a way that exposes problems of IO ordering between CPUs).

It would be a great thing to use the powerpc trick of setting a flag
that is tested by spin_unlock()/mutex_unlock() and automatically doing
the mmiowb() if needed, and then killing off mmiowb() entirely.

 - R.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-10 19:05                                                       ` Roland Dreier
@ 2008-06-10 19:19                                                         ` Jesse Barnes
  2008-06-11  3:29                                                           ` Nick Piggin
  0 siblings, 1 reply; 160+ messages in thread
From: Jesse Barnes @ 2008-06-10 19:19 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-arch, Nick Piggin, Russell King, Matthew Wilcox,
	linux-kernel, David Miller, James Bottomley, linuxppc-dev,
	scottwood, Linus Torvalds, Trent Piepho, alan

On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote:
>  > me too.  That's the whole basis for readX_relaxed() and its cohorts: we
>  > make our weirdest machines (like altix) conform to the x86 norm.  Then
>  > where it really kills us we introduce additional semantics to selected
>  > drivers that enable us to recover I/O speed on the abnormal platforms.
>
> Except as I pointed out before, Altix doesn't conform to the norm and
> many (most?) drivers are missing mmiowb()s that are needed for Altix.
> Just no one has plugged most devices into an Altix (or haven't stressed
> the driver in a way that exposes problems of IO ordering between CPUs).
>
> It would be a great thing to use the powerpc trick of setting a flag
> that is tested by spin_unlock()/mutex_unlock() and automatically doing
> the mmiowb() if needed, and then killing off mmiowb() entirely.

Yeah I think that's what Nick's guidelines would guarantee.  And Jes is 
already working on the spin_unlock change you mentioned, so mmiowb() should 
be history soon (in name only, assuming Nick also introduces the I/O barriers 
he talked about for ordering the looser accessors it would still be there but 
would be called io_wmb or something).

Jesse

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-10 19:19                                                         ` Jesse Barnes
@ 2008-06-11  3:29                                                           ` Nick Piggin
  2008-06-11  3:40                                                             ` Benjamin Herrenschmidt
  2008-06-11 16:07                                                             ` Jesse Barnes
  0 siblings, 2 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-11  3:29 UTC (permalink / raw)
  To: Jesse Barnes, linux-arch
  Cc: linux-arch, Russell King, Matthew Wilcox, Roland Dreier,
	linux-kernel, David Miller, James Bottomley, linuxppc-dev,
	scottwood, Linus Torvalds, Trent Piepho, alan

On Wednesday 11 June 2008 05:19, Jesse Barnes wrote:
> On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote:
> >  > me too.  That's the whole basis for readX_relaxed() and its cohorts:
> >  > we make our weirdest machines (like altix) conform to the x86 norm. 
> >  > Then where it really kills us we introduce additional semantics to
> >  > selected drivers that enable us to recover I/O speed on the abnormal
> >  > platforms.
> >
> > Except as I pointed out before, Altix doesn't conform to the norm and
> > many (most?) drivers are missing mmiowb()s that are needed for Altix.
> > Just no one has plugged most devices into an Altix (or haven't stressed
> > the driver in a way that exposes problems of IO ordering between CPUs).
> >
> > It would be a great thing to use the powerpc trick of setting a flag
> > that is tested by spin_unlock()/mutex_unlock() and automatically doing
> > the mmiowb() if needed, and then killing off mmiowb() entirely.
>
> Yeah I think that's what Nick's guidelines would guarantee.  And Jes is
> already working on the spin_unlock change you mentioned, so mmiowb() should
> be history soon (in name only, assuming Nick also introduces the I/O
> barriers he talked about for ordering the looser accessors it would still
> be there but would be called io_wmb or something).

Exactly, yes. I guess everybody has had good intentions here, but
as noticed, what is lacking is coordination and documentation.

You mention strong ordering WRT spin_unlock, which suggests that
you would prefer to take option #2 (the current powerpc one): io/io
is ordered and io is contained inside spinlocks, but io/cacheable
in general is not ordered.

I *would* prefer that io/cacheable actually is strongly ordered with
the default accessors. Because if you have that, then the driver
writer never has to care about memory ordering, provided they use
correct locking for SMP issues. Same as x86. With option 2, there
are still windows where you could possibly have issues.

For any high performance drivers that are well maintained (ie. the
ones where slowdown might be noticed), everyone should have a pretty
good handle on memory ordering requirements, so it shouldn't take
long to go through and convert them to relaxed accessors.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-11  3:29                                                           ` Nick Piggin
@ 2008-06-11  3:40                                                             ` Benjamin Herrenschmidt
  2008-06-11  4:06                                                               ` Nick Piggin
  2008-06-11 16:07                                                             ` Jesse Barnes
  1 sibling, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-06-11  3:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, Matthew Wilcox, Roland Dreier,
	linux-kernel, Jesse Barnes, David Miller, James Bottomley,
	linuxppc-dev, scottwood, Linus Torvalds, Trent Piepho, alan

On Wed, 2008-06-11 at 13:29 +1000, Nick Piggin wrote:
> 
> Exactly, yes. I guess everybody has had good intentions here, but
> as noticed, what is lacking is coordination and documentation.
> 
> You mention strong ordering WRT spin_unlock, which suggests that
> you would prefer to take option #2 (the current powerpc one): io/io
> is ordered and io is contained inside spinlocks, but io/cacheable
> in general is not ordered.

IO/cacheable -is- ordered on powepc in what we believe is the direction
that matter: IO reads are fully ordered vs. anything and IO writes are
ordered vs. previous cacheable stores. The only "relaxed" situation is
IO writes followed by cacheable stores, which I believe shouldn't be
a problem. (except for spinlocks for which we use the flag trick)

> I *would* prefer that io/cacheable actually is strongly ordered with
> the default accessors. Because if you have that, then the driver
> writer never has to care about memory ordering, provided they use
> correct locking for SMP issues. Same as x86. With option 2, there
> are still windows where you could possibly have issues.
> 
> For any high performance drivers that are well maintained (ie. the
> ones where slowdown might be noticed), everyone should have a pretty
> good handle on memory ordering requirements, so it shouldn't take
> long to go through and convert them to relaxed accessors.

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-11  3:40                                                             ` Benjamin Herrenschmidt
@ 2008-06-11  4:06                                                               ` Nick Piggin
  0 siblings, 0 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-11  4:06 UTC (permalink / raw)
  To: benh
  Cc: linux-arch, Russell King, Matthew Wilcox, Roland Dreier,
	linux-kernel, Jesse Barnes, David Miller, James Bottomley,
	linuxppc-dev, scottwood, Linus Torvalds, Trent Piepho, alan

On Wednesday 11 June 2008 13:40, Benjamin Herrenschmidt wrote:
> On Wed, 2008-06-11 at 13:29 +1000, Nick Piggin wrote:
> > Exactly, yes. I guess everybody has had good intentions here, but
> > as noticed, what is lacking is coordination and documentation.
> >
> > You mention strong ordering WRT spin_unlock, which suggests that
> > you would prefer to take option #2 (the current powerpc one): io/io
> > is ordered and io is contained inside spinlocks, but io/cacheable
> > in general is not ordered.
>
> IO/cacheable -is- ordered on powepc in what we believe is the direction
> that matter: IO reads are fully ordered vs. anything and IO writes are
> ordered vs. previous cacheable stores. The only "relaxed" situation is
> IO writes followed by cacheable stores, which I believe shouldn't be
> a problem. (except for spinlocks for which we use the flag trick)

Spinlocks... mutexes, semaphores, rwsems, rwlocks, bit spinlocks, bit
mutexes, open coded bit locks (of which there are a number floating
around in drivers/).

But even assuming you get all of that fixed up. I wonder what is such
a big benefit to powerpc that you'll rather add the exception "cacheable
stores are not ordered with previous io stores" than to say "any driver
which works on x86 will work on powerpc as far as memory ordering goes"?
(don't you also need something to order io reads with cacheable reads?
as per my observation that your rmb is broken according to IBM docs)

Obviously you already have a sync instruction in your writel, so 1)
adding a second one doesn't slow it down by an order of mangnitude or
anything, just some small factor; and 2) you obviously want to be
converting high performance drivers to a more relaxed model anyway
regardless of whether there is one sync or two in there.

Has this ever been measured or thought carefully about? Beyond the
extent of "one sync good, two sync bad" ;)

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-11  3:29                                                           ` Nick Piggin
  2008-06-11  3:40                                                             ` Benjamin Herrenschmidt
@ 2008-06-11 16:07                                                             ` Jesse Barnes
  2008-06-12 11:27                                                               ` Nick Piggin
  1 sibling, 1 reply; 160+ messages in thread
From: Jesse Barnes @ 2008-06-11 16:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, Matthew Wilcox, Roland Dreier,
	linux-kernel, David Miller, James Bottomley, linuxppc-dev,
	scottwood, Linus Torvalds, Trent Piepho, alan

On Tuesday, June 10, 2008 8:29 pm Nick Piggin wrote:
> On Wednesday 11 June 2008 05:19, Jesse Barnes wrote:
> > On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote:
> > >  > me too.  That's the whole basis for readX_relaxed() and its cohorts:
> > >  > we make our weirdest machines (like altix) conform to the x86 norm.
> > >  > Then where it really kills us we introduce additional semantics to
> > >  > selected drivers that enable us to recover I/O speed on the abnormal
> > >  > platforms.
> > >
> > > Except as I pointed out before, Altix doesn't conform to the norm and
> > > many (most?) drivers are missing mmiowb()s that are needed for Altix.
> > > Just no one has plugged most devices into an Altix (or haven't stressed
> > > the driver in a way that exposes problems of IO ordering between CPUs).
> > >
> > > It would be a great thing to use the powerpc trick of setting a flag
> > > that is tested by spin_unlock()/mutex_unlock() and automatically doing
> > > the mmiowb() if needed, and then killing off mmiowb() entirely.
> >
> > Yeah I think that's what Nick's guidelines would guarantee.  And Jes is
> > already working on the spin_unlock change you mentioned, so mmiowb()
> > should be history soon (in name only, assuming Nick also introduces the
> > I/O barriers he talked about for ordering the looser accessors it would
> > still be there but would be called io_wmb or something).
>
> Exactly, yes. I guess everybody has had good intentions here, but
> as noticed, what is lacking is coordination and documentation.
>
> You mention strong ordering WRT spin_unlock, which suggests that
> you would prefer to take option #2 (the current powerpc one): io/io
> is ordered and io is contained inside spinlocks, but io/cacheable
> in general is not ordered.

I was thinking it would be good for the weaker accessors, but now that I think 
about it you could just use the new io_* barrier functions.

I didn't mean to imply that I wasn't in favor of the io/cacheable ordering as 
well.

> For any high performance drivers that are well maintained (ie. the
> ones where slowdown might be noticed), everyone should have a pretty
> good handle on memory ordering requirements, so it shouldn't take
> long to go through and convert them to relaxed accessors.

Yep.  Thanks for working on this, Nick, it's definitely a good thing that 
you're taking control of it. :)

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-11 16:07                                                             ` Jesse Barnes
@ 2008-06-12 11:27                                                               ` Nick Piggin
  0 siblings, 0 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-12 11:27 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: linux-arch, Russell King, Matthew Wilcox, Roland Dreier,
	linux-kernel, David Miller, James Bottomley, linuxppc-dev,
	scottwood, Linus Torvalds, Trent Piepho, alan

On Thursday 12 June 2008 02:07, Jesse Barnes wrote:
> On Tuesday, June 10, 2008 8:29 pm Nick Piggin wrote:

> > You mention strong ordering WRT spin_unlock, which suggests that
> > you would prefer to take option #2 (the current powerpc one): io/io
> > is ordered and io is contained inside spinlocks, but io/cacheable
> > in general is not ordered.
>
> I was thinking it would be good for the weaker accessors, but now that I
> think about it you could just use the new io_* barrier functions.
>
> I didn't mean to imply that I wasn't in favor of the io/cacheable ordering
> as well.
>
> > For any high performance drivers that are well maintained (ie. the
> > ones where slowdown might be noticed), everyone should have a pretty
> > good handle on memory ordering requirements, so it shouldn't take
> > long to go through and convert them to relaxed accessors.
>
> Yep.  Thanks for working on this, Nick, it's definitely a good thing that
> you're taking control of it. :)

Well, I really am just trying to help the kernel for everyone (and every
architecture). Performance for all architectures really is my #2 priority,
so if any arch becomes irrepearably slower under a proposal I would
go back to the drawing board.

I'll come up with a proposal in the form of an initial code+documentation
patch when I get some more time on it.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-10  6:56                                                 ` Nick Piggin
  2008-06-10 17:41                                                   ` Jesse Barnes
@ 2008-06-11  4:18                                                   ` Paul Mackerras
  2008-06-11  5:00                                                     ` Nick Piggin
  2008-06-11  5:20                                                   ` Paul Mackerras
  2 siblings, 1 reply; 160+ messages in thread
From: Paul Mackerras @ 2008-06-11  4:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel,
	Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds,
	David Miller, alan

Nick Piggin writes:

> OK, I'm sitll not quite sure where this has ended up. I guess you are happy
> with x86 semantics as they are now. That is, all IO accesses are strongly
> ordered WRT one another and WRT cacheable memory (which includes keeping
> them within spinlocks),

My understanding was that on x86, loads could pass stores in general,
i.e. a later load could be performed before an earlier store.  I guess
that can't be true for uncached loads, but could a cacheable load be
performed before an earlier uncached store?

> - as strong as x86. guaranteed not to break drivers that work on x86,
>   but slower on some archs. To me, this is most pleasing. It is much
>   much easier to notice something is going a little slower and to work
>   out how to use weaker ordering there, than it is to debug some
>   once-in-a-bluemoon breakage caused by just the right architecture,
>   driver, etc. It totally frees up the driver writer from thinking
>   about barriers, provided they get the locking right.

I just wish we had even one actual example of things going wrong with
the current rules we have on powerpc to motivate changing to this
model.

> Now that doesn't leave waker ordering architectures lumped with "slow old
> x86 semantics". Think of it as giving them the benefit of sharing x86
> development and testing :) We can then formalise the relaxed __ accessors
> to be more complete (ie. +/- byteswapping).

That leaves a gulf between the extremely strongly ordered writel
etc. and the extremely weakly ordered __writel etc.  The current
powerpc scheme is fine for a lot of drivers but your proposal would
leave us no way to deliver it to them.

Paul.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-11  4:18                                                   ` Paul Mackerras
@ 2008-06-11  5:00                                                     ` Nick Piggin
  2008-06-11  5:13                                                       ` Paul Mackerras
  2008-06-11 14:46                                                       ` Linus Torvalds
  0 siblings, 2 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-11  5:00 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel,
	Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds,
	David Miller, alan

On Wednesday 11 June 2008 14:18, Paul Mackerras wrote:
> Nick Piggin writes:
> > OK, I'm sitll not quite sure where this has ended up. I guess you are
> > happy with x86 semantics as they are now. That is, all IO accesses are
> > strongly ordered WRT one another and WRT cacheable memory (which includ=
es
> > keeping them within spinlocks),
>
> My understanding was that on x86, loads could pass stores in general,
> i.e. a later load could be performed before an earlier store.

Yes, this is the one reordering allowed by the ISA on cacheable memory.
WC memory is weaker, which Linus wants to allow exception for because
one must explicitly ask for it. UC memory (which presumably is what
we're talking about as "IO access") I think is stronger in that it does
not allow any reordering between one another or any cacheable access:

AMD says this:

c =E2=80=94 A store (wp,wt,wb,uc,wc,wc+) may not pass a previous load=20
(wp,wt,wb,uc,wc,wc+).
f =E2=80=94 A load (uc) does not pass a previous store (wp,wt,wb,uc,wc,wc+).
g =E2=80=94 A store (wp,wt,wb,uc) does not pass a previous store (wp,wt,wb,=
uc).
i =E2=80=94 A load (wp,wt,wb,wc,wc+) does not pass a previous store (uc).

AMD does allow WC/WC+ to be weakly ordered WRT WC as well as UC, which
Intel seemingly does not. But we're alrady treating WC as special.

I can't actually find the definitive statement in the Intel manuals
saying UC is strongly ordered also WRT WB. Linus?

> I guess=20
> that can't be true for uncached loads, but could a cacheable load be
> performed before an earlier uncached store?
> > - as strong as x86. guaranteed not to break drivers that work on x86,
> >   but slower on some archs. To me, this is most pleasing. It is much
> >   much easier to notice something is going a little slower and to work
> >   out how to use weaker ordering there, than it is to debug some
> >   once-in-a-bluemoon breakage caused by just the right architecture,
> >   driver, etc. It totally frees up the driver writer from thinking
> >   about barriers, provided they get the locking right.
>
> I just wish we had even one actual example of things going wrong with
> the current rules we have on powerpc to motivate changing to this
> model.

~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | wc -l
506
How sure are you that none of those forms part of a cobbled-together
locking scheme that hopes to constrain some IO access?

~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | grep while | wc -l
29
How about those?

~/usr/src/linux-2.6> git grep mutex_lock drivers/ | wc -l
3138
How sure are you that none of them is hoping to constrain IO operations
within the lock?

Also grep for down, down_write, write_lock, and maybe some others I've
forgotten. And then forget completely about locking and imagine some
of the open coded things you see around the place (or parts where drivers
try to get creative and open code their own locking or try lockless
things).

> > Now that doesn't leave waker ordering architectures lumped with "slow o=
ld
> > x86 semantics". Think of it as giving them the benefit of sharing x86
> > development and testing :) We can then formalise the relaxed __ accesso=
rs
> > to be more complete (ie. +/- byteswapping).
>
> That leaves a gulf between the extremely strongly ordered writel
> etc. and the extremely weakly ordered __writel etc.  The current
> powerpc scheme is fine for a lot of drivers but your proposal would
> leave us no way to deliver it to them.

But surely you have to audit the drivers anyway to ensure they are OK
with the current powerpc scheme. In which case, once you have audited
them and know they are safe, you can easily convert them to the even
_faster_ __readl/__writel, and just add the appropriate barriers.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-11  5:00                                                     ` Nick Piggin
@ 2008-06-11  5:13                                                       ` Paul Mackerras
  2008-06-11  5:35                                                         ` Nick Piggin
  2008-06-11 14:46                                                       ` Linus Torvalds
  1 sibling, 1 reply; 160+ messages in thread
From: Paul Mackerras @ 2008-06-11  5:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel,
	Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds,
	David Miller, alan

Nick Piggin writes:

> > I just wish we had even one actual example of things going wrong with
> > the current rules we have on powerpc to motivate changing to this
> > model.
> 
> ~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | wc -l
> 506
> How sure are you that none of those forms part of a cobbled-together
> locking scheme that hopes to constrain some IO access?

My comment was precisely about the fact that this sort of argument is
actually FUD.  I want one example that is demonstrably wrong, not just
a "how sure are you".

> But surely you have to audit the drivers anyway to ensure they are OK
> with the current powerpc scheme. In which case, once you have audited
> them and know they are safe, you can easily convert them to the even
> _faster_ __readl/__writel, and just add the appropriate barriers.

The trouble is that as code gets maintained, using __writel + explicit
barriers is more fragile because some people will change the code, or
add new code, without understanding the barriers.  So whenever a
driver gets converted to using __writel + barriers, we will end up
having to watch every change that goes into it forever.  Whereas with
the current scheme there's a much smaller set of gotchas to watch out
for, and the gotchas are things that already raise red flags, such as
open-coded locking and any sort of "clever" lockless scheme.

Paul.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-11  5:13                                                       ` Paul Mackerras
@ 2008-06-11  5:35                                                         ` Nick Piggin
  2008-06-11  6:02                                                           ` Nick Piggin
  2008-06-12 12:14                                                           ` Paul Mackerras
  0 siblings, 2 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-11  5:35 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel,
	Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds,
	David Miller, alan

On Wednesday 11 June 2008 15:13, Paul Mackerras wrote:
> Nick Piggin writes:
> > > I just wish we had even one actual example of things going wrong with
> > > the current rules we have on powerpc to motivate changing to this
> > > model.
> >
> > ~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | wc -l
> > 506
> > How sure are you that none of those forms part of a cobbled-together
> > locking scheme that hopes to constrain some IO access?
>
> My comment was precisely about the fact that this sort of argument is
> actually FUD.  I want one example that is demonstrably wrong, not just
> a "how sure are you".

But this is a case where you really should be scared, so FUD is
completely appropriate.

At least, I don't see how it is at all worse than FUD about how much
slower everything will be with stronger ordering, and how nobody will
be able to do anything about it. In reality, for most devices it will
not be measurable, and for some like network or disk might use a bit
of a tweak in one or two fastpaths.

If the driver author honestly does not know the code well enough to
say whether a particular ordering is allowed or not, then it should
just not be allowed.

Anyway, what do I win if I give you a concrete example?

I found one in the first file I picked out of my grep: drivers/net/s2io.c

        if (test_and_set_bit(__S2IO_STATE_LINK_TASK, &(nic->state))) {
                /* The card is being reset, no point doing anything */
                goto out_unlock;
        }

...
                val64 = readq(&bar0->adapter_control);
                val64 |= ADAPTER_LED_ON;
                writeq(val64, &bar0->adapter_control);
                s2io_link(nic, LINK_UP);
        } else {
                if (CARDS_WITH_FAULTY_LINK_INDICATORS(nic->device_type,
                                                      subid)) {
                        val64 = readq(&bar0->gpio_control);
                        val64 &= ~GPIO_CTRL_GPIO_0;
                        writeq(val64, &bar0->gpio_control);
                        val64 = readq(&bar0->gpio_control);
                }
                /* turn off LED */
                val64 = readq(&bar0->adapter_control);
                val64 = val64 &(~ADAPTER_LED_ON);
                writeq(val64, &bar0->adapter_control);
                s2io_link(nic, LINK_DOWN);
        }
        clear_bit(__S2IO_STATE_LINK_TASK, &(nic->state));

Now I can't say definitively that this is going to be wrong on
powerpc, because I don't know the code well enough. But I'd be
90% sure that the unlock is not supposed to be visible to
other CPUs before the writeqs are queued to the card. On x86 it
wouldn't be.

> > But surely you have to audit the drivers anyway to ensure they are OK
> > with the current powerpc scheme. In which case, once you have audited
> > them and know they are safe, you can easily convert them to the even
> > _faster_ __readl/__writel, and just add the appropriate barriers.
>
> The trouble is that as code gets maintained, using __writel + explicit
> barriers is more fragile because some people will change the code, or
> add new code, without understanding the barriers.

If the relaxed writel and explicit barriers are not well documented and
well understood, then they should not be used.

And if the memory ordering issues are not well understood, then it
should be done with locking and strongly ordered IO accessors.

> So whenever a 
> driver gets converted to using __writel + barriers, we will end up
> having to watch every change that goes into it forever.  Whereas with

*You* don't have to watch it, the maintainer does. And if they can't
understand the barriers, then it should not be converted to use them.
Unless you want to take over maintainership. And if you do, then you
should be watching all changes that go into it.

> the current scheme there's a much smaller set of gotchas to watch out
> for, and the gotchas are things that already raise red flags, such as
> open-coded locking and any sort of "clever" lockless scheme.

And with the strongly ordered scheme, there is much smaller set of
gotchas again, and it behaves the same as the x86 it was developed on,
and you don't get confused by the list of rules: IO access is either
strongly ordered or weakly ordered.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-11  5:35                                                         ` Nick Piggin
@ 2008-06-11  6:02                                                           ` Nick Piggin
  2008-06-12 12:14                                                           ` Paul Mackerras
  1 sibling, 0 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-11  6:02 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel,
	Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds,
	David Miller, alan

On Wednesday 11 June 2008 15:35, Nick Piggin wrote:
> On Wednesday 11 June 2008 15:13, Paul Mackerras wrote:
> > Nick Piggin writes:
> > > > I just wish we had even one actual example of things going wrong with
> > > > the current rules we have on powerpc to motivate changing to this
> > > > model.
> > >
> > > ~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | wc -l
> > > 506
> > > How sure are you that none of those forms part of a cobbled-together
> > > locking scheme that hopes to constrain some IO access?
> >
> > My comment was precisely about the fact that this sort of argument is
> > actually FUD.  I want one example that is demonstrably wrong, not just
> > a "how sure are you".
>
> But this is a case where you really should be scared, so FUD is
> completely appropriate.
>
> At least, I don't see how it is at all worse than FUD about how much
> slower everything will be with stronger ordering, and how nobody will
> be able to do anything about it. In reality, for most devices it will
> not be measurable, and for some like network or disk might use a bit
> of a tweak in one or two fastpaths.
>
> If the driver author honestly does not know the code well enough to
> say whether a particular ordering is allowed or not, then it should
> just not be allowed.
>
>
> Anyway, what do I win if I give you a concrete example?
>
> I found one in the first file I picked out of my grep: drivers/net/s2io.c
>
>         if (test_and_set_bit(__S2IO_STATE_LINK_TASK, &(nic->state))) {
>                 /* The card is being reset, no point doing anything */
>                 goto out_unlock;
>         }
>
> ...
>                 val64 = readq(&bar0->adapter_control);
>                 val64 |= ADAPTER_LED_ON;
>                 writeq(val64, &bar0->adapter_control);
>                 s2io_link(nic, LINK_UP);
>         } else {
>                 if (CARDS_WITH_FAULTY_LINK_INDICATORS(nic->device_type,
>                                                       subid)) {
>                         val64 = readq(&bar0->gpio_control);
>                         val64 &= ~GPIO_CTRL_GPIO_0;
>                         writeq(val64, &bar0->gpio_control);
>                         val64 = readq(&bar0->gpio_control);
>                 }
>                 /* turn off LED */
>                 val64 = readq(&bar0->adapter_control);
>                 val64 = val64 &(~ADAPTER_LED_ON);
>                 writeq(val64, &bar0->adapter_control);
>                 s2io_link(nic, LINK_DOWN);
>         }
>         clear_bit(__S2IO_STATE_LINK_TASK, &(nic->state));
>
> Now I can't say definitively that this is going to be wrong on
> powerpc, because I don't know the code well enough. But I'd be
> 90% sure that the unlock is not supposed to be visible to
> other CPUs before the writeqs are queued to the card. On x86 it
> wouldn't be.

And I hope you're reading this, sn2 guys :)

The mmiowb() audit for spin_unlock was a nice idea, but this example
shows that unless you *really* know the code and have actually audited
everything, then you should not be messing with relaxing of ordering
etc.

Now obviously the example I quote is a slowpath, in which case it can
very happily continue to use strongly ordered io accessors, and you
will never ever notice the extra sync or PCI read or whatever after
the writeq. Then you can easily relax the couple of fastpaths that
matter, put barriers in there, comment them, and you're away: you now
have a *working* driver (that doesn't randomly lose adapter_control
bits twice a year), and that is also faster than it was when it was
broken because you've specifically used the very weakly ordered IO
accessors in the fast paths that matter.

And when somebody plugs in some obscure hardware into your box that
you haven't audited and put mmiowb etc etc into, then guess what? It
is also not going to blow up in an undebuggable way 6 months later in
the middle of <insert something important here>. The worst case is
that the user notices it is running a bit slowly and sends a full
profile output the next day.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-11  5:35                                                         ` Nick Piggin
  2008-06-11  6:02                                                           ` Nick Piggin
@ 2008-06-12 12:14                                                           ` Paul Mackerras
  2008-06-12 13:08                                                             ` Nick Piggin
  1 sibling, 1 reply; 160+ messages in thread
From: Paul Mackerras @ 2008-06-12 12:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel,
	Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds,
	David Miller, alan

Nick Piggin writes:

>                 /* turn off LED */
>                 val64 = readq(&bar0->adapter_control);
>                 val64 = val64 &(~ADAPTER_LED_ON);
>                 writeq(val64, &bar0->adapter_control);
>                 s2io_link(nic, LINK_DOWN);
>         }
>         clear_bit(__S2IO_STATE_LINK_TASK, &(nic->state));
> 
> Now I can't say definitively that this is going to be wrong on
> powerpc, because I don't know the code well enough. But I'd be
> 90% sure that the unlock is not supposed to be visible to
> other CPUs before the writeqs are queued to the card. On x86 it
> wouldn't be.

Interestingly, there is also a store to cacheable memory
(nic->device_enabled_once), but no smp_wmb or equivalent before the
clear_bit.  So there are other potential problems here besides the I/O
related ones.

Anyway, I have done some tests on a dual G5 here with putting a sync
on both sides of the store in writel etc. (i.e. making readl/writel
strongly ordered w.r.t. everything else), and as you predicted, there
wasn't a noticeable performance degradation, at least not on the
couple of things I tried.  So I am now inclined to accept your
suggestion that we should do that.  I should probably do some similar
checks on POWER6 and a few other machines first, though.

Paul.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-12 12:14                                                           ` Paul Mackerras
@ 2008-06-12 13:08                                                             ` Nick Piggin
  0 siblings, 0 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-12 13:08 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel,
	Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds,
	David Miller, alan

On Thursday 12 June 2008 22:14, Paul Mackerras wrote:
> Nick Piggin writes:
> >                 /* turn off LED */
> >                 val64 = readq(&bar0->adapter_control);
> >                 val64 = val64 &(~ADAPTER_LED_ON);
> >                 writeq(val64, &bar0->adapter_control);
> >                 s2io_link(nic, LINK_DOWN);
> >         }
> >         clear_bit(__S2IO_STATE_LINK_TASK, &(nic->state));
> >
> > Now I can't say definitively that this is going to be wrong on
> > powerpc, because I don't know the code well enough. But I'd be
> > 90% sure that the unlock is not supposed to be visible to
> > other CPUs before the writeqs are queued to the card. On x86 it
> > wouldn't be.
>
> Interestingly, there is also a store to cacheable memory
> (nic->device_enabled_once), but no smp_wmb or equivalent before the
> clear_bit.  So there are other potential problems here besides the I/O
> related ones.

Yeah there sure is. That sucks too, but we go one step at a time ;)
I think proposing a strong ordering between set_bit/clear_bit would
actually be quite noticable slowdown in core kernel code at this
point.

Which reminds me, I have been meaning to do another pass of test and
set bit / clear bit conversions to the _lock primitives...

> Anyway, I have done some tests on a dual G5 here with putting a sync
> on both sides of the store in writel etc. (i.e. making readl/writel
> strongly ordered w.r.t. everything else), and as you predicted, there
> wasn't a noticeable performance degradation, at least not on the
> couple of things I tried.  So I am now inclined to accept your
> suggestion that we should do that.  I should probably do some similar
> checks on POWER6 and a few other machines first, though.

Oh good, thanks for looking into it. I guess it might be a little more
noticable on bigger POWER systems. And I think we might even need to do
a PCI read after every writel on sn2 systems in order to get the
semantics I want.

I can't say it won't be noticable. But if we consider the number of
drivers (maybe one or two dozen well maintained ones), and number of
sites in each driver (maybe one or two submission and completion
fastpaths which should have a minimum of IO operations in each one)
that will have to be converted in order to get performance as good or
better than it is currently with relaxed accessors.... and weigh that
against all the places in those and every other crappy obscure
driver that we *won't* have to audit, I really think we end up with
a net win even with some short term pain.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-11  5:00                                                     ` Nick Piggin
  2008-06-11  5:13                                                       ` Paul Mackerras
@ 2008-06-11 14:46                                                       ` Linus Torvalds
  1 sibling, 0 replies; 160+ messages in thread
From: Linus Torvalds @ 2008-06-11 14:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel,
	Trent Piepho, linuxppc-dev, Paul Mackerras, scottwood,
	David Miller, alan

On Wed, 11 Jun 2008, Nick Piggin wrote:
> 
> I can't actually find the definitive statement in the Intel manuals
> saying UC is strongly ordered also WRT WB. Linus?

Definitive? Dunno. But look in the Architecture manual, volume 3A, 10.3 
"Methods of Caching Available", and then under the bullet about Write 
Combining (WC), it says

  the writes may be delayed until the next occurrence of a serializing 
  event; such as, an SFENCE of MFENCE instruction, CPUID execution, a read 
  or write to uncached memory, an interrupt occurrence, or a LOCK 
  instruction execution.

However, it's worth noting that

 - documentation can be wrong, or even if right, can be Intel-specific.

 - the above is expressly _only_ about the WC buffer, not about regular 
   memory writes. Cached memory accesses are different from WC accesses.

so in the end, the thing that matters is how things actually work.

		Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-10  6:56                                                 ` Nick Piggin
  2008-06-10 17:41                                                   ` Jesse Barnes
  2008-06-11  4:18                                                   ` Paul Mackerras
@ 2008-06-11  5:20                                                   ` Paul Mackerras
  2 siblings, 0 replies; 160+ messages in thread
From: Paul Mackerras @ 2008-06-11  5:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel,
	Trent Piepho, linuxppc-dev, scottwood, Linus Torvalds,
	David Miller, alan

Nick Piggin writes:

> Now that doesn't leave waker ordering architectures lumped with "slow old
> x86 semantics". Think of it as giving them the benefit of sharing x86
> development and testing :)

Worth something, but not perhaps as much as you think, given that many
x86 driver writers still don't pay much attention to making their code
endian-safe and 64-bit-clean.  And there are still plenty of drivers
that use virt_to_bus...

Paul.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 14:47                                           ` Linus Torvalds
  2008-06-03 18:47                                             ` Trent Piepho
@ 2008-06-04  2:19                                             ` Nick Piggin
  1 sibling, 0 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-04  2:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Russell King, linux-kernel, tpiepho, linuxppc-dev,
	scottwood, David Miller, alan

On Wednesday 04 June 2008 00:47, Linus Torvalds wrote:
> On Tue, 3 Jun 2008, Nick Piggin wrote:
> > Linus: on x86, memory operations to wc and wc+ memory are not ordered
> > with one another, or operations to other memory types (ie. load/load
> > and store/store reordering is allowed). Also, as you know, store/load
> > reordering is explicitly allowed as well, which covers all memory
> > types. So perhaps it is not quite true to say readl/writel is strongly
> > ordered by default even on x86. You would have to put in some
> > mfence instructions in them to make it so.
>
> Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that
> does that needs to be aware of it. IOW, it's a non-issue, imnsho.

Ah, yes UC is strongly ordered WRT all others *except* WC/WC+.

But WC memory is not an x86 specific thing right, so do we need
some accessors for WC memory? Or can we just throw that in the
weakly ordered pile, and ensure mb/rmb/wmb does the right thing
for them.

And you want readl/writel to be strongly ordered like x86 on all
architectures, no exceptions? This will slow some things down,
but if we then also provide explicitly weakly ordered instructions
(and add io_mb/io_rmb/io_wmb) then at least it gives the framework
for drivers to be written to run on those architectures.

The other thing we could do is mandate only that readl/writel will
be ordered WRT one another, *and* with spinlocks, but otherwise not
with cacheable RAM...

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03  4:16                                         ` Nick Piggin
  2008-06-03  4:32                                           ` Benjamin Herrenschmidt
  2008-06-03 14:47                                           ` Linus Torvalds
@ 2008-06-03 19:43                                           ` Trent Piepho
  2008-06-03 21:33                                             ` Matthew Wilcox
  2008-06-03 22:26                                             ` Benjamin Herrenschmidt
  2 siblings, 2 replies; 160+ messages in thread
From: Trent Piepho @ 2008-06-03 19:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, linux-kernel, linuxppc-dev, scottwood,
	Linus Torvalds, David Miller, alan

On Tue, 3 Jun 2008, Nick Piggin wrote:
> On Monday 02 June 2008 17:24, Russell King wrote:
>> So, can the semantics of what's expected from these IO accessor
>> functions be documented somewhere.  Please?  Before this thread gets
>> lost in the depths of time?
>
> This whole thread also ties in with my posts about mmiowb (which IMO
> should go away).
>
> readl/writel:  strongly ordered wrt one another and other stores
>               to cacheable RAM, byteswapping
> __readl/__writel:  not ordered (needs mb/rmb/wmb to order with
>                   other readl/writel and cacheable operations, or
>                   io_*mb to order with one another)
> raw_readl/raw_writel:  strongly ordered, no byteswapping
> __raw_readl/__raw_writel:  not ordered, no byteswapping

Byte-swapping vs not byte-swapping is not usually what the programmer wants. 
Usually your device's registers are defined as being big-endian or
little-endian and you want whatever is needed to give you that.

I believe that on some archs that can be either byte order, some built-in
devices will change their registers to match, and so you want "native endian"
or no swapping for these.  Though that's definitely in the minority.

An accessors that always byte-swaps regardless of the endianness of the host
is never something I've seen a driver want.

IOW, there are four ways one can defined endianness/swapping:
1) Little-endian
2) Big-endian
3) Native-endian aka non-byte-swapping
4) Foreign-endian aka byte-swapping

1 and 2 are by far the most used.  Some code wants 3.  No one wants 4.  Yet
our API is providing 3 & 4, the two which are the least useful.

Is it enough to provide only "all or none" for ordering strictness?  For
instance on powerpc, one can get a speedup by dropping strict ordering for IO
vs cacheable memory, but still keeping ordering for IO vs IO and IO vs locks. 
This is much easier to program for than no ordering at all.  In fact, if one
doesn't use coherent DMA, it's basically the same as fully strict ordering.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 19:43                                           ` Trent Piepho
@ 2008-06-03 21:33                                             ` Matthew Wilcox
  2008-06-03 21:44                                               ` Trent Piepho
  2008-06-03 22:26                                             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 160+ messages in thread
From: Matthew Wilcox @ 2008-06-03 21:33 UTC (permalink / raw)
  To: Trent Piepho
  Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Tue, Jun 03, 2008 at 12:43:21PM -0700, Trent Piepho wrote:
> IOW, there are four ways one can defined endianness/swapping:
> 1) Little-endian
> 2) Big-endian
> 3) Native-endian aka non-byte-swapping
> 4) Foreign-endian aka byte-swapping
> 
> 1 and 2 are by far the most used.  Some code wants 3.  No one wants 4.  Yet
> our API is providing 3 & 4, the two which are the least useful.

You've fundamentally misunderstood.

readX/writeX and __readX/__writeX provide little-endian access.
__raw_readX provide native-endian.

If you want 2 or 4, define your own accessors.  Some architectures define
other accessors (eg gsc_readX on parisc is native (big) endian, and
works on physical addresses that haven't been ioremapped.  sbus_readX on
sparc64 also seems to be native (big) endian).

> Is it enough to provide only "all or none" for ordering strictness?  For
> instance on powerpc, one can get a speedup by dropping strict ordering for 
> IO
> vs cacheable memory, but still keeping ordering for IO vs IO and IO vs 
> locks. This is much easier to program for than no ordering at all.  In 
> fact, if one
> doesn't use coherent DMA, it's basically the same as fully strict ordering.

I don't understand why you keep talking about DMA.  Are you talking
about ordering between readX() and DMA?  PCI proides those guarantees.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 21:33                                             ` Matthew Wilcox
@ 2008-06-03 21:44                                               ` Trent Piepho
  2008-06-04  2:25                                                 ` Nick Piggin
  0 siblings, 1 reply; 160+ messages in thread
From: Trent Piepho @ 2008-06-03 21:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Tue, 3 Jun 2008, Matthew Wilcox wrote:
> On Tue, Jun 03, 2008 at 12:43:21PM -0700, Trent Piepho wrote:
>> IOW, there are four ways one can defined endianness/swapping:
>> 1) Little-endian
>> 2) Big-endian
>> 3) Native-endian aka non-byte-swapping
>> 4) Foreign-endian aka byte-swapping
>>
>> 1 and 2 are by far the most used.  Some code wants 3.  No one wants 4.  Yet
>> our API is providing 3 & 4, the two which are the least useful.
>
> You've fundamentally misunderstood.
>
> readX/writeX and __readX/__writeX provide little-endian access.
> __raw_readX provide native-endian.
>
> If you want 2 or 4, define your own accessors.  Some architectures define
> other accessors (eg gsc_readX on parisc is native (big) endian, and

How about providing 1 and 2, and if you want 3 or 4 define your own accessors?

>> Is it enough to provide only "all or none" for ordering strictness?  For
>> instance on powerpc, one can get a speedup by dropping strict ordering for
>> IO
>> vs cacheable memory, but still keeping ordering for IO vs IO and IO vs
>> locks. This is much easier to program for than no ordering at all.  In
>> fact, if one
>> doesn't use coherent DMA, it's basically the same as fully strict ordering.
>
> I don't understand why you keep talking about DMA.  Are you talking
> about ordering between readX() and DMA?  PCI proides those guarantees.

I guess you haven't been reading the whole thread.  The reason it started was
because gcc can re-order powerpc (and everyone else's too) IO accesses vs
accesses to cachable memory (but not spin-locks), which ends up only being a
problem with coherent DMA.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 21:44                                               ` Trent Piepho
@ 2008-06-04  2:25                                                 ` Nick Piggin
  2008-06-04  6:39                                                   ` Trent Piepho
  0 siblings, 1 reply; 160+ messages in thread
From: Nick Piggin @ 2008-06-04  2:25 UTC (permalink / raw)
  To: Trent Piepho
  Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel,
	linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan

On Wednesday 04 June 2008 07:44, Trent Piepho wrote:
> On Tue, 3 Jun 2008, Matthew Wilcox wrote:

> > I don't understand why you keep talking about DMA.  Are you talking
> > about ordering between readX() and DMA?  PCI proides those guarantees.
>
> I guess you haven't been reading the whole thread.  The reason it started
> was because gcc can re-order powerpc (and everyone else's too) IO accesses
> vs accesses to cachable memory (but not spin-locks), which ends up only
> being a problem with coherent DMA.

I don't think it is only a problem with coherent DMA.

CPU0                         CPU1
mutex_lock(mutex);
writel(something, DATA_REG);
writel(GO, CTRL_REG);
started = 1;
mutex_unlock(mutex);
                             mutex_lock(mutex);
                             if (started)
                               /* oops, this can reach device before GO */
                               writel(STOP, CTRL_REG);
                               

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-04  2:25                                                 ` Nick Piggin
@ 2008-06-04  6:39                                                   ` Trent Piepho
  0 siblings, 0 replies; 160+ messages in thread
From: Trent Piepho @ 2008-06-04  6:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Russell King, Matthew Wilcox, linux-kernel,
	linuxppc-dev, scottwood, Linus Torvalds, David Miller, alan

On Wed, 4 Jun 2008, Nick Piggin wrote:
> On Wednesday 04 June 2008 07:44, Trent Piepho wrote:
>> On Tue, 3 Jun 2008, Matthew Wilcox wrote:
>
>>> I don't understand why you keep talking about DMA.  Are you talking
>>> about ordering between readX() and DMA?  PCI proides those guarantees.
>>
>> I guess you haven't been reading the whole thread.  The reason it started
>> was because gcc can re-order powerpc (and everyone else's too) IO accesses
>> vs accesses to cachable memory (but not spin-locks), which ends up only
>> being a problem with coherent DMA.
>
> I don't think it is only a problem with coherent DMA.
>
> CPU0                         CPU1
> mutex_lock(mutex);
> writel(something, DATA_REG);
> writel(GO, CTRL_REG);
> started = 1;
 	(A)
> mutex_unlock(mutex);
>                             mutex_lock(mutex);
 	(B)
>                             if (started)
>                               /* oops, this can reach device before GO */
>                               writel(STOP, CTRL_REG);

The locks themselves should have (and do have) ordering operations to insure
gcc and/or the cpu can't move a store or load outside the locked region. 
Generally you need that to keep stores/loads to cacheable memory inside the
critical area, much less I/O operations.  Otherwise all you have to do is
replace writel(something, ...) with shared_data->something = ...  and there's
an obvious problem.  In your example, gcc currently can and will move the GO
operation to point A (if it can figure out that CTRL_REG and started aren't
aliased), but that's not a problem.  If it could move it to B that would be a
problem, but it can't.

Other than coherent DMA, I don't think there is any reason to care if I/O
accessors are strongly ordered wrt load/stores to cacheable memory.  locking
and streaming DMA sync operations already need to have ordering, so they don't
require all I/O to be ordered wrt all cacheable memory.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 19:43                                           ` Trent Piepho
  2008-06-03 21:33                                             ` Matthew Wilcox
@ 2008-06-03 22:26                                             ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-06-03 22:26 UTC (permalink / raw)
  To: Trent Piepho
  Cc: linux-arch, Nick Piggin, Russell King, linux-kernel, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Tue, 2008-06-03 at 12:43 -0700, Trent Piepho wrote:
> 
> Byte-swapping vs not byte-swapping is not usually what the programmer wants. 
> Usually your device's registers are defined as being big-endian or
> little-endian and you want whatever is needed to give you that.

Yes, which is why I (and some other archs) have writel_be/readl_be.

The standard writel/readl being LE.

However, the "raw" variants are defined to be native endian, which is of
some use to -some- archs apparently where they have SoC device whose
endianness follow the core.

> I believe that on some archs that can be either byte order, some built-in
> devices will change their registers to match, and so you want "native endian"
> or no swapping for these.  Though that's definitely in the minority.
> 
> An accessors that always byte-swaps regardless of the endianness of the host
> is never something I've seen a driver want.
> 
> IOW, there are four ways one can defined endianness/swapping:
> 1) Little-endian
> 2) Big-endian
> 3) Native-endian aka non-byte-swapping
> 4) Foreign-endian aka byte-swapping
> 
> 1 and 2 are by far the most used.  Some code wants 3.  No one wants 4.  Yet
> our API is providing 3 & 4, the two which are the least useful.

No, we don't provide 4, it was something unclear with nick.

We provide 1. (writel/readl and __variants), some archs provide 2
(writel_be/readl_be, tho I don't have __variants, I suppose I could),
and everybody provides 3. though in some cases (like us) only in the
form of __variants (ie, non ordered, like __raw_readl/__raw_writel).

Nick's proposal is to plug those gaps, though it's, I believe, missing
the _be variants.

> Is it enough to provide only "all or none" for ordering strictness?  For
> instance on powerpc, one can get a speedup by dropping strict ordering for IO
> vs cacheable memory, but still keeping ordering for IO vs IO and IO vs locks. 
> This is much easier to program for than no ordering at all.  In fact, if one
> doesn't use coherent DMA, it's basically the same as fully strict ordering.

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27  2:28                         ` David Miller
  2008-05-27  3:39                           ` Benjamin Herrenschmidt
@ 2008-05-27  3:42                           ` Arjan van de Ven
  2008-05-27  4:08                             ` Roland Dreier
  2008-05-27  7:08                             ` Benjamin Herrenschmidt
  2008-05-27  8:24                           ` Alan Cox
  2 siblings, 2 replies; 160+ messages in thread
From: Arjan van de Ven @ 2008-05-27  3:42 UTC (permalink / raw)
  To: David Miller
  Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, torvalds,
	tpiepho, alan

On Mon, 26 May 2008 19:28:12 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Tue, 27 May 2008 12:15:40 +1000
> 
> > Some of them. USB comes to mind. I'd be happy to make it "the rule"
> > and document that MMIO vs. coherent access aren't implicitely
> > ordered. I would still keep them ordered on powerpc for a little
> > while tho until I'm happy enough with driver auditing.
> > 
> > But heh, it's you who was telling me that it would be a bad
> > engineering decision and we had to make everybody look like x86 &
> > fully ordered :-) I decided to agree back then and stuck all those
> > nasty heavy barriers in the powerpc variants of readl/writel/...
> 
> I still believe this.
> 
> It's just another complicated thing for driver authors to get wrong.
> The other side of the coin is of course the cost.
> 
> The only thing I am absolutely sure of is that we should make a
> decision fast, document it, and just stick to it.

... and try to find a way to test for it at runtime or compile time.

either via sparse or some fancy lockdep like "device store" thing?
If we can't test for it and it doesn't show up on x86 ... it'll just be
an eterrnal chase.

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27  3:42                           ` Arjan van de Ven
@ 2008-05-27  4:08                             ` Roland Dreier
  2008-05-27  4:20                               ` Arjan van de Ven
  2008-05-27  7:08                             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 160+ messages in thread
From: Roland Dreier @ 2008-05-27  4:08 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	torvalds, David Miller, alan

 > either via sparse or some fancy lockdep like "device store" thing?
 > If we can't test for it and it doesn't show up on x86 ... it'll just be
 > an eterrnal chase.

Ben's point is that it will start showing up on x86 because newer
compilers are reordering things...

 - R.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27  4:08                             ` Roland Dreier
@ 2008-05-27  4:20                               ` Arjan van de Ven
  0 siblings, 0 replies; 160+ messages in thread
From: Arjan van de Ven @ 2008-05-27  4:20 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	torvalds, David Miller, alan

On Mon, 26 May 2008 21:08:10 -0700
Roland Dreier <rdreier@cisco.com> wrote:

>  > either via sparse or some fancy lockdep like "device store" thing?
>  > If we can't test for it and it doesn't show up on x86 ... it'll
>  > just be an eterrnal chase.
> 
> Ben's point is that it will start showing up on x86 because newer
> compilers are reordering things...
> 
>  - R.


I know a basic reorder will show up there. A simple "barrier()" in
readl/writel solves it for x86. Understandably, Ben doesn't really like
that answer ;-)

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27  3:42                           ` Arjan van de Ven
  2008-05-27  4:08                             ` Roland Dreier
@ 2008-05-27  7:08                             ` Benjamin Herrenschmidt
  2008-05-27 15:50                               ` Roland Dreier
  1 sibling, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-27  7:08 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	torvalds, David Miller, alan

> ... and try to find a way to test for it at runtime or compile time.
> 
> either via sparse or some fancy lockdep like "device store" thing?
> If we can't test for it and it doesn't show up on x86 ... it'll just be
> an eterrnal chase.

It's hard. I haven't managed to come up with a good idea on how
to test for it either at runtime or from sparse.

There -might- be way to test up to a certain point with sparse
by defining a __coherent attribute for coherent memory and trying
to figure out patterns like write to __coherent followed by MMIO
with no barrier in between but that's fishy and won't catch many
cases.

Sticking barriers in the accessors is thus indeed the "easy" and
somewhat safe fix and keeping everything as ordered as possible

Though it's my understanding that at least ia64 does require the
explicit barriers anyway, so we are still in a dodgy situation here
where it's not clear what drivers should do and we end up with
possibly excessive barriers on powerpc where I end up with both
the wmb/rmb/mb that were added for ia64 -and- the ones I have in
readl/writel to make them look synchronous... Not nice.

I'm not sure there is a good answer...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27  7:08                             ` Benjamin Herrenschmidt
@ 2008-05-27 15:50                               ` Roland Dreier
  2008-05-27 16:37                                 ` James Bottomley
                                                   ` (2 more replies)
  0 siblings, 3 replies; 160+ messages in thread
From: Roland Dreier @ 2008-05-27 15:50 UTC (permalink / raw)
  To: benh
  Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood,
	torvalds, tpiepho, alan, Arjan van de Ven

 > Though it's my understanding that at least ia64 does require the
 > explicit barriers anyway, so we are still in a dodgy situation here
 > where it's not clear what drivers should do and we end up with
 > possibly excessive barriers on powerpc where I end up with both
 > the wmb/rmb/mb that were added for ia64 -and- the ones I have in
 > readl/writel to make them look synchronous... Not nice.

ia64 is a disaster with a slightly different ordering problem -- the
mmiowb() issue.  I know Ben knows far too much about this, but for big
SGI boxes, you sometimes need mmiowb() to avoid problems with driver
code that does totally sane stuff like

	spin_lock(&mmio_lock);
	writel(val1, reg1);
	writel(val2, reg2);
	spin_unlock(&mmio_lock);

If that snippet is called on two CPUs at the same time, then the device
might see a sequence like

	CPU1 -- write reg1
	CPU2 -- write reg1
	CPU1 -- write reg2
	CPU2 -- write reg2

in spite of the fact that everything is totally ordered on the CPUs by
the spin lock.

The reason this is such a disaster is because the code looks right,
makes sense, and works fine on 99.99% of all systems out there.  So I
would bet that 99% of our drivers have missing mmiowb() "bugs" -- no one
has plugged the hardware into an Altix box and cared enough to stress
test it.

However for the issue at hand, my expectation as a driver writer is that
readl()/writel() are ordered with respect to MMIO operations, but not
necessarily with respect to normal writes to coherent CPU memory.  And
I've included explicit wmb()s in code I've written like
drivers/infiniband/hw/mthca.

 - R.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 15:50                               ` Roland Dreier
@ 2008-05-27 16:37                                 ` James Bottomley
  2008-05-27 17:38                                   ` Roland Dreier
  2008-05-27 21:11                                 ` Benjamin Herrenschmidt
  2008-05-31  8:04                                 ` Pavel Machek
  2 siblings, 1 reply; 160+ messages in thread
From: James Bottomley @ 2008-05-27 16:37 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood,
	torvalds, tpiepho, alan, Arjan van de Ven

On Tue, 2008-05-27 at 08:50 -0700, Roland Dreier wrote:
> > Though it's my understanding that at least ia64 does require the
>  > explicit barriers anyway, so we are still in a dodgy situation here
>  > where it's not clear what drivers should do and we end up with
>  > possibly excessive barriers on powerpc where I end up with both
>  > the wmb/rmb/mb that were added for ia64 -and- the ones I have in
>  > readl/writel to make them look synchronous... Not nice.
> 
> ia64 is a disaster with a slightly different ordering problem -- the
> mmiowb() issue.  I know Ben knows far too much about this, but for big
> SGI boxes, you sometimes need mmiowb() to avoid problems with driver
> code that does totally sane stuff like
> 
> 	spin_lock(&mmio_lock);
> 	writel(val1, reg1);
> 	writel(val2, reg2);
> 	spin_unlock(&mmio_lock);
> 
> If that snippet is called on two CPUs at the same time, then the device
> might see a sequence like
> 
> 	CPU1 -- write reg1
> 	CPU2 -- write reg1
> 	CPU1 -- write reg2
> 	CPU2 -- write reg2
> 
> in spite of the fact that everything is totally ordered on the CPUs by
> the spin lock.
> 
> The reason this is such a disaster is because the code looks right,
> makes sense, and works fine on 99.99% of all systems out there.  So I
> would bet that 99% of our drivers have missing mmiowb() "bugs" -- no one
> has plugged the hardware into an Altix box and cared enough to stress
> test it.
> 
> However for the issue at hand, my expectation as a driver writer is that
> readl()/writel() are ordered with respect to MMIO operations, but not
> necessarily with respect to normal writes to coherent CPU memory.  And
> I've included explicit wmb()s in code I've written like
> drivers/infiniband/hw/mthca.

Actually, this specifically should not be.  The need for mmiowb on altix
is because it explicitly violates some of the PCI rules that would
otherwise impede performance.   The compromise is that readX on altix
contains the needed dma flush but there's a variant operator,
readX_relaxed that doesn't (for drivers that know what they're doing).
The altix critical drivers have all been converted to use the relaxed
form for performance, and the unconverted ones should all operate just
fine (albeit potentially more slowly).

You can see all of this in include/asm-ia64/sn/io.h

It is confusing to me that sn_dma_flush() and sn_mmiowb() have different
implementations, but I think both fix the spinlock problem you allude to
by ensuring the DMA operation is completed before the CPU instruction is
executed.

James

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 16:37                                 ` James Bottomley
@ 2008-05-27 17:38                                   ` Roland Dreier
  2008-05-27 17:53                                     ` James Bottomley
  0 siblings, 1 reply; 160+ messages in thread
From: Roland Dreier @ 2008-05-27 17:38 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood,
	torvalds, tpiepho, alan, Arjan van de Ven

 > Actually, this specifically should not be.  The need for mmiowb on altix
 > is because it explicitly violates some of the PCI rules that would
 > otherwise impede performance.   The compromise is that readX on altix
 > contains the needed dma flush but there's a variant operator,
 > readX_relaxed that doesn't (for drivers that know what they're doing).
 > The altix critical drivers have all been converted to use the relaxed
 > form for performance, and the unconverted ones should all operate just
 > fine (albeit potentially more slowly).

Is this a recent change?  Because as of October 2007, 76d7cc03
("IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up")
was needed.  But this was involving writel() (__raw_writel() actually,
looking at the code), not readl().  But writel_relaxed() doesn't exist
(and doesn't make sense).

 - R.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 17:38                                   ` Roland Dreier
@ 2008-05-27 17:53                                     ` James Bottomley
  2008-05-27 18:07                                       ` Roland Dreier
  0 siblings, 1 reply; 160+ messages in thread
From: James Bottomley @ 2008-05-27 17:53 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood,
	torvalds, tpiepho, alan, Arjan van de Ven

On Tue, 2008-05-27 at 10:38 -0700, Roland Dreier wrote:
> > Actually, this specifically should not be.  The need for mmiowb on altix
>  > is because it explicitly violates some of the PCI rules that would
>  > otherwise impede performance.   The compromise is that readX on altix
>  > contains the needed dma flush but there's a variant operator,
>  > readX_relaxed that doesn't (for drivers that know what they're doing).
>  > The altix critical drivers have all been converted to use the relaxed
>  > form for performance, and the unconverted ones should all operate just
>  > fine (albeit potentially more slowly).
> 
> Is this a recent change?  Because as of October 2007, 76d7cc03
> ("IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up")
> was needed.  But this was involving writel() (__raw_writel() actually,
> looking at the code), not readl().  But writel_relaxed() doesn't exist
> (and doesn't make sense).

Um, OK, you've said write twice now ... I was assuming you meant read.
Even on an x86, writes are posted, so there's no way a spin lock could
serialise a write without an intervening read to flush the posting
(that's why only reads have a relaxed version on altix).  Or is there
something else I'm missing?

James

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 17:53                                     ` James Bottomley
@ 2008-05-27 18:07                                       ` Roland Dreier
  2008-05-27 18:17                                         ` Roland Dreier
  2008-05-27 21:23                                         ` Chris Friesen
  0 siblings, 2 replies; 160+ messages in thread
From: Roland Dreier @ 2008-05-27 18:07 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	torvalds, David Miller, alan, Arjan van de Ven

 > Um, OK, you've said write twice now ... I was assuming you meant read.
 > Even on an x86, writes are posted, so there's no way a spin lock could
 > serialise a write without an intervening read to flush the posting
 > (that's why only reads have a relaxed version on altix).  Or is there
 > something else I'm missing?

Writes are posted yes, but not reordered arbitrarily.  If I have code like:

	spin_lock(&mmio_lock);
	writel(val1, reg1);
	writel(val2, reg2);
	spin_unlock(&mmio_lock);

then I have a reasonable expectation that if two CPUs run this at the
same time, their writes to reg1/reg2 won't be interleaved with each
other (because the whole section is inside a spinlock).  And Altix
violates that expectation.

 - R.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 18:07                                       ` Roland Dreier
@ 2008-05-27 18:17                                         ` Roland Dreier
  2008-05-27 21:23                                         ` Chris Friesen
  1 sibling, 0 replies; 160+ messages in thread
From: Roland Dreier @ 2008-05-27 18:17 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-arch, linux-kernel, tpiepho, linuxppc-dev, scottwood,
	torvalds, David Miller, alan, Arjan van de Ven

 > Writes are posted yes, but not reordered arbitrarily.

on standard x86 I mean here...

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 18:07                                       ` Roland Dreier
  2008-05-27 18:17                                         ` Roland Dreier
@ 2008-05-27 21:23                                         ` Chris Friesen
  2008-05-27 21:29                                           ` Roland Dreier
  2008-05-27 23:04                                           ` Paul Mackerras
  1 sibling, 2 replies; 160+ messages in thread
From: Chris Friesen @ 2008-05-27 21:23 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-arch, linux-kernel, tpiepho, James Bottomley, linuxppc-dev,
	scottwood, torvalds, David Miller, alan, Arjan van de Ven

Roland Dreier wrote:

> Writes are posted yes, but not reordered arbitrarily.  If I have code like:
> 
> 	spin_lock(&mmio_lock);
> 	writel(val1, reg1);
> 	writel(val2, reg2);
> 	spin_unlock(&mmio_lock);
> 
> then I have a reasonable expectation that if two CPUs run this at the
> same time, their writes to reg1/reg2 won't be interleaved with each
> other (because the whole section is inside a spinlock).  And Altix
> violates that expectation.

Does that necessarily follow?

If you've got a large system with multiple pci bridges, could you end up 
with posted writes coming from different cpus taking a different amount 
of time to propagate to a device and thus colliding?

Chris

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:23                                         ` Chris Friesen
@ 2008-05-27 21:29                                           ` Roland Dreier
  2008-05-27 23:04                                           ` Paul Mackerras
  1 sibling, 0 replies; 160+ messages in thread
From: Roland Dreier @ 2008-05-27 21:29 UTC (permalink / raw)
  To: Chris Friesen
  Cc: linux-arch, linux-kernel, tpiepho, James Bottomley, linuxppc-dev,
	scottwood, torvalds, David Miller, alan, Arjan van de Ven

 > > Writes are posted yes, but not reordered arbitrarily.  If I have code like:
 > >
 > > 	spin_lock(&mmio_lock);
 > > 	writel(val1, reg1);
 > > 	writel(val2, reg2);
 > > 	spin_unlock(&mmio_lock);
 > >
 > > then I have a reasonable expectation that if two CPUs run this at the
 > > same time, their writes to reg1/reg2 won't be interleaved with each
 > > other (because the whole section is inside a spinlock).  And Altix
 > > violates that expectation.
 > 
 > Does that necessarily follow?
 > 
 > If you've got a large system with multiple pci bridges, could you end
 > up with posted writes coming from different cpus taking a different
 > amount of time to propagate to a device and thus colliding?

Not on x86.  And a given PCI device can only be reached from a single
host bridge, so I don't see how it can happen.  But on SGI altix
systems, there is a routed fabric between the CPU and the PCI bus, so
the reordering can happen there.  Hence mmiowb() and the endless supply
of driver bugs that it causes.

 - R.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:23                                         ` Chris Friesen
  2008-05-27 21:29                                           ` Roland Dreier
@ 2008-05-27 23:04                                           ` Paul Mackerras
  1 sibling, 0 replies; 160+ messages in thread
From: Paul Mackerras @ 2008-05-27 23:04 UTC (permalink / raw)
  To: Chris Friesen
  Cc: linux-arch, Roland Dreier, linux-kernel, David Miller,
	James Bottomley, linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

Chris Friesen writes:

> Roland Dreier wrote:
> 
> > Writes are posted yes, but not reordered arbitrarily.  If I have code like:
> > 
> > 	spin_lock(&mmio_lock);
> > 	writel(val1, reg1);
> > 	writel(val2, reg2);
> > 	spin_unlock(&mmio_lock);
> > 
> > then I have a reasonable expectation that if two CPUs run this at the
> > same time, their writes to reg1/reg2 won't be interleaved with each
> > other (because the whole section is inside a spinlock).  And Altix
> > violates that expectation.
> 
> Does that necessarily follow?
> 
> If you've got a large system with multiple pci bridges, could you end up 
> with posted writes coming from different cpus taking a different amount 
> of time to propagate to a device and thus colliding?

On powerpc we explicitly make sure that can't happen.  That's the "do
a sync in spin_unlock if there were any writels since the last
spin_lock" magic.  The sync instruction makes sure the writes have got
to the host bridge before the spinlock is unlocked.

Paul.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 15:50                               ` Roland Dreier
  2008-05-27 16:37                                 ` James Bottomley
@ 2008-05-27 21:11                                 ` Benjamin Herrenschmidt
  2008-05-27 21:33                                   ` Roland Dreier
  2008-05-31  8:04                                 ` Pavel Machek
  2 siblings, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-27 21:11 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood,
	torvalds, tpiepho, alan, Arjan van de Ven


On Tue, 2008-05-27 at 08:50 -0700, Roland Dreier wrote:
> > Though it's my understanding that at least ia64 does require the
>  > explicit barriers anyway, so we are still in a dodgy situation here
>  > where it's not clear what drivers should do and we end up with
>  > possibly excessive barriers on powerpc where I end up with both
>  > the wmb/rmb/mb that were added for ia64 -and- the ones I have in
>  > readl/writel to make them look synchronous... Not nice.
> 
> ia64 is a disaster with a slightly different ordering problem -- the
> mmiowb() issue.  I know Ben knows far too much about this, but for big
> SGI boxes, you sometimes need mmiowb() to avoid problems with driver
> code that does totally sane stuff like

This is a different issue. We deal with it on powerpc by having writel
set a per-cpu flag and spin_unlock() test it, and do the barrier if
needed there.

However, drivers such as e1000 -also- have a wmb() between filling the
ring buffer and kicking the DMA with MMIO, with a comment about this
being needed for ia64 relaxed ordering.

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:11                                 ` Benjamin Herrenschmidt
@ 2008-05-27 21:33                                   ` Roland Dreier
  2008-05-27 22:13                                     ` Benjamin Herrenschmidt
  2008-05-29 14:47                                     ` Jes Sorensen
  0 siblings, 2 replies; 160+ messages in thread
From: Roland Dreier @ 2008-05-27 21:33 UTC (permalink / raw)
  To: benh
  Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood,
	torvalds, tpiepho, alan, Arjan van de Ven

 > This is a different issue. We deal with it on powerpc by having writel
 > set a per-cpu flag and spin_unlock() test it, and do the barrier if
 > needed there.

Cool... I assume you do this for mutex_unlock() etc?

Is there any reason why ia64 can't do this too so we can kill mmiowb and
save everyone a lot of hassle?  (mips, sh and frv have non-empty
mmiowb() definitions too but I'd guess that these are all bugs based on
misunderstandings of the mmiowb() semantics...)

 > However, drivers such as e1000 -also- have a wmb() between filling the
 > ring buffer and kicking the DMA with MMIO, with a comment about this
 > being needed for ia64 relaxed ordering.

I put these barriers into mthca, mlx4 etc, although it came from my
possible misunderstanding of the memory ordering rules in the kernel
more than any experience of problems (as opposed the the mmiowb()s,
which all came from real world bugs).

 - R.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:33                                   ` Roland Dreier
@ 2008-05-27 22:13                                     ` Benjamin Herrenschmidt
  2008-05-27 22:39                                       ` Roland Dreier
  2008-05-29 14:47                                     ` Jes Sorensen
  1 sibling, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-27 22:13 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood,
	torvalds, tpiepho, alan, Arjan van de Ven


On Tue, 2008-05-27 at 14:33 -0700, Roland Dreier wrote:
> > This is a different issue. We deal with it on powerpc by having writel
>  > set a per-cpu flag and spin_unlock() test it, and do the barrier if
>  > needed there.
> 
> Cool... I assume you do this for mutex_unlock() etc?

That's a good point... I don't think we do. Maybe we should.

> Is there any reason why ia64 can't do this too so we can kill mmiowb and
> save everyone a lot of hassle?  (mips, sh and frv have non-empty
> mmiowb() definitions too but I'd guess that these are all bugs based on
> misunderstandings of the mmiowb() semantics...)

Well, basically our approach was that mmiowb() is a pain in the neck,
nobody (ie. driver writers) really understands what it's for, and so
it's either not there or misused. So we didn't want to introduce it for
powerpc, but instead did the trick above in order to -slightly- improve
our writel (ie avoid a sync -after- the write) .

>  > However, drivers such as e1000 -also- have a wmb() between filling the
>  > ring buffer and kicking the DMA with MMIO, with a comment about this
>  > being needed for ia64 relaxed ordering.
> 
> I put these barriers into mthca, mlx4 etc, although it came from my
> possible misunderstanding of the memory ordering rules in the kernel
> more than any experience of problems (as opposed the the mmiowb()s,
> which all came from real world bugs).

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 22:13                                     ` Benjamin Herrenschmidt
@ 2008-05-27 22:39                                       ` Roland Dreier
  0 siblings, 0 replies; 160+ messages in thread
From: Roland Dreier @ 2008-05-27 22:39 UTC (permalink / raw)
  To: benh
  Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood,
	torvalds, tpiepho, alan, Arjan van de Ven

 > > Cool... I assume you do this for mutex_unlock() etc?

 > That's a good point... I don't think we do. Maybe we should.

I think it's needed -- take a look at 76d7cc03, which came from a real
bug seen on Altix boxes.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 21:33                                   ` Roland Dreier
  2008-05-27 22:13                                     ` Benjamin Herrenschmidt
@ 2008-05-29 14:47                                     ` Jes Sorensen
  2008-05-29 15:01                                       ` James Bottomley
                                                         ` (3 more replies)
  1 sibling, 4 replies; 160+ messages in thread
From: Jes Sorensen @ 2008-05-29 14:47 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood,
	torvalds, tpiepho, alan, Arjan van de Ven

>>>>> "Roland" == Roland Dreier <rdreier@cisco.com> writes:

>> This is a different issue. We deal with it on powerpc by having
>> writel set a per-cpu flag and spin_unlock() test it, and do the
>> barrier if needed there.

Roland> Cool... I assume you do this for mutex_unlock() etc?

Roland> Is there any reason why ia64 can't do this too so we can kill
Roland> mmiowb and save everyone a lot of hassle?  (mips, sh and frv
Roland> have non-empty mmiowb() definitions too but I'd guess that
Roland> these are all bugs based on misunderstandings of the mmiowb()
Roland> semantics...)

Hi Roland,

Thats not going to solve the problem on Altix. On Altix the issue is
that there can be multiple paths through the NUMA fabric from cpuX to
PCI bridge Y. 

Consider this uber-cool<tm> ascii art - NR is my abbrevation for NUMA
router:

        -------         -------
        |cpu X|         |cpu Y|
        -------         -------
         |   \____  ____/    |
         |        \/         |
         |    ____/\____     |
         |   /          \    |
         -----          ------
         |NR 1|         |NR 2|
         ------         ------
              \         /
               \       /
                -------
                | PCI |
                -------

The problem is that your two writel's, despite being both issued on
cpu X, due to the spin lock, in your example, can end up with the
first one going through NR 1 and the second one going through NR 2. If
there's contention on NR 1, the write going via NR 2 may hit the PCI
bridge prior to the one going via NR 1.

Of course, the bigger the system, the worse the problem....

The only way to guarantee ordering in the above setup, is to either
make writel() fully ordered or adding the mmiowb()'s inbetween the two
writel's. On Altix you have to go and read from the PCI brige to
ensure all writes to it have been flushed, which is also what mmiowb()
is doing. If writel() was to guarantee this ordering, it would make
every writel() call extremely expensive :-(

Cheers,
Jes

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29 14:47                                     ` Jes Sorensen
@ 2008-05-29 15:01                                       ` James Bottomley
  2008-05-30  9:36                                         ` Jes Sorensen
  2008-05-29 21:40                                       ` Benjamin Herrenschmidt
                                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 160+ messages in thread
From: James Bottomley @ 2008-05-29 15:01 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: linux-arch, Roland Dreier, linux-kernel, David Miller,
	linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote:
> >>>>> "Roland" == Roland Dreier <rdreier@cisco.com> writes:
> 
> >> This is a different issue. We deal with it on powerpc by having
> >> writel set a per-cpu flag and spin_unlock() test it, and do the
> >> barrier if needed there.
> 
> Roland> Cool... I assume you do this for mutex_unlock() etc?
> 
> Roland> Is there any reason why ia64 can't do this too so we can kill
> Roland> mmiowb and save everyone a lot of hassle?  (mips, sh and frv
> Roland> have non-empty mmiowb() definitions too but I'd guess that
> Roland> these are all bugs based on misunderstandings of the mmiowb()
> Roland> semantics...)
> 
> Hi Roland,
> 
> Thats not going to solve the problem on Altix. On Altix the issue is
> that there can be multiple paths through the NUMA fabric from cpuX to
> PCI bridge Y. 
> 
> Consider this uber-cool<tm> ascii art - NR is my abbrevation for NUMA
> router:
> 
>         -------         -------
>         |cpu X|         |cpu Y|
>         -------         -------
>          |   \____  ____/    |
>          |        \/         |
>          |    ____/\____     |
>          |   /          \    |
>          -----          ------
>          |NR 1|         |NR 2|
>          ------         ------
>               \         /
>                \       /
>                 -------
>                 | PCI |
>                 -------
> 
> The problem is that your two writel's, despite being both issued on
> cpu X, due to the spin lock, in your example, can end up with the
> first one going through NR 1 and the second one going through NR 2. If
> there's contention on NR 1, the write going via NR 2 may hit the PCI
> bridge prior to the one going via NR 1.
> 
> Of course, the bigger the system, the worse the problem....
> 
> The only way to guarantee ordering in the above setup, is to either
> make writel() fully ordered or adding the mmiowb()'s inbetween the two
> writel's. On Altix you have to go and read from the PCI brige to
> ensure all writes to it have been flushed, which is also what mmiowb()
> is doing. If writel() was to guarantee this ordering, it would make
> every writel() call extremely expensive :-(

So if a read from the bridge achieves the same effect, can't we just put
one after the writes within the spinlock (an unrelaxed one).  That way
this whole sequence will look like a well understood PCI posting flush
rather than have to muck around with little understood (at least by most
driver writers) io barriers?

James

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29 15:01                                       ` James Bottomley
@ 2008-05-30  9:36                                         ` Jes Sorensen
  2008-05-30 17:21                                           ` Jesse Barnes
  0 siblings, 1 reply; 160+ messages in thread
From: Jes Sorensen @ 2008-05-30  9:36 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-arch, Roland Dreier, linux-kernel, Jeremy Higdon,
	David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

James Bottomley wrote:
>> The only way to guarantee ordering in the above setup, is to either
>> make writel() fully ordered or adding the mmiowb()'s inbetween the two
>> writel's. On Altix you have to go and read from the PCI brige to
>> ensure all writes to it have been flushed, which is also what mmiowb()
>> is doing. If writel() was to guarantee this ordering, it would make
>> every writel() call extremely expensive :-(
> 
> So if a read from the bridge achieves the same effect, can't we just put
> one after the writes within the spinlock (an unrelaxed one).  That way
> this whole sequence will look like a well understood PCI posting flush
> rather than have to muck around with little understood (at least by most
> driver writers) io barriers?

Hmmm,

I think mmiowb() does some sort of status read from the bridge, I am not
sure if it's enough to just do a regular readl().

I'm adding Jeremy to the list, he should know for sure.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-30  9:36                                         ` Jes Sorensen
@ 2008-05-30 17:21                                           ` Jesse Barnes
  2008-05-31  7:57                                             ` Jeremy Higdon
  0 siblings, 1 reply; 160+ messages in thread
From: Jesse Barnes @ 2008-05-30 17:21 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: linux-arch, Roland Dreier, linux-kernel, Jeremy Higdon,
	David Miller, James Bottomley, linuxppc-dev, scottwood, torvalds,
	tpiepho, alan, Arjan van de Ven

On Friday, May 30, 2008 2:36 am Jes Sorensen wrote:
> James Bottomley wrote:
> >> The only way to guarantee ordering in the above setup, is to either
> >> make writel() fully ordered or adding the mmiowb()'s inbetween the two
> >> writel's. On Altix you have to go and read from the PCI brige to
> >> ensure all writes to it have been flushed, which is also what mmiowb()
> >> is doing. If writel() was to guarantee this ordering, it would make
> >> every writel() call extremely expensive :-(
> >
> > So if a read from the bridge achieves the same effect, can't we just put
> > one after the writes within the spinlock (an unrelaxed one).  That way
> > this whole sequence will look like a well understood PCI posting flush
> > rather than have to muck around with little understood (at least by most
> > driver writers) io barriers?
>
> Hmmm,
>
> I think mmiowb() does some sort of status read from the bridge, I am not
> sure if it's enough to just do a regular readl().
>
> I'm adding Jeremy to the list, he should know for sure.

I think a read from the target host bridge is enough.  What mmiowb() does 
though is to read a *local* host bridge register, which contains a count of 
the number of PIO ops still "in flight" on their way to their target bridge.  
When it reaches 0, all PIOs have arrived at the target host bridge (they 
still may be bufferd), so ordering is guaranteed.

Jesse

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-30 17:21                                           ` Jesse Barnes
@ 2008-05-31  7:57                                             ` Jeremy Higdon
  0 siblings, 0 replies; 160+ messages in thread
From: Jeremy Higdon @ 2008-05-31  7:57 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel,
	David Miller, James Bottomley, linuxppc-dev, scottwood, torvalds,
	tpiepho, alan, Arjan van de Ven

On Fri, May 30, 2008 at 10:21:00AM -0700, Jesse Barnes wrote:
> On Friday, May 30, 2008 2:36 am Jes Sorensen wrote:
> > James Bottomley wrote:
> > >> The only way to guarantee ordering in the above setup, is to either
> > >> make writel() fully ordered or adding the mmiowb()'s inbetween the two
> > >> writel's. On Altix you have to go and read from the PCI brige to
> > >> ensure all writes to it have been flushed, which is also what mmiowb()
> > >> is doing. If writel() was to guarantee this ordering, it would make
> > >> every writel() call extremely expensive :-(
> > >
> > > So if a read from the bridge achieves the same effect, can't we just put
> > > one after the writes within the spinlock (an unrelaxed one).  That way

A relaxed readX would be sufficient.  It's the next lowest cost way (after
mmiowb) of ensuring write ordering between CPUs.  Regular readX is the
most expensive method (well, we could probably come up with something worse,
but we'd have to work at it  :).

> > > this whole sequence will look like a well understood PCI posting flush
> > > rather than have to muck around with little understood (at least by most
> > > driver writers) io barriers?
> >
> > Hmmm,
> >
> > I think mmiowb() does some sort of status read from the bridge, I am not
> > sure if it's enough to just do a regular readl().
> >
> > I'm adding Jeremy to the list, he should know for sure.
> 
> I think a read from the target host bridge is enough.  What mmiowb() does 
> though is to read a *local* host bridge register, which contains a count of 
> the number of PIO ops still "in flight" on their way to their target bridge.  
> When it reaches 0, all PIOs have arrived at the target host bridge (they 
> still may be bufferd), so ordering is guaranteed.


Note that is the main advantage over a read.  There is no round trip across
the NUMA fabric.

jeremy

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29 14:47                                     ` Jes Sorensen
  2008-05-29 15:01                                       ` James Bottomley
@ 2008-05-29 21:40                                       ` Benjamin Herrenschmidt
  2008-05-29 21:48                                         ` Trent Piepho
                                                           ` (2 more replies)
  2008-05-29 22:06                                       ` Roland Dreier
  2008-05-31  7:52                                       ` Jeremy Higdon
  3 siblings, 3 replies; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-29 21:40 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: linux-arch, Roland Dreier, linux-kernel, David Miller,
	linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote:
> 
> The only way to guarantee ordering in the above setup, is to either
> make writel() fully ordered or adding the mmiowb()'s inbetween the two
> writel's. On Altix you have to go and read from the PCI brige to
> ensure all writes to it have been flushed, which is also what mmiowb()
> is doing. If writel() was to guarantee this ordering, it would make
> every writel() call extremely expensive :-(

Interesting. I've always been taught by ia64 people that mmiowb() was
intended to be used solely between writel() and spin_unlock().

I think in the above case, you really should make writel() ordered.
Anything else is asking for trouble, for the exact same reasons that I
made it fully ordered on powerpc at least vs. previous stores. I only
kept it relaxed vs. subsequent cacheable stores (ie, spin_unlock), for
which I use the trick mentioned before.

Yes, this has some cost (can be fairly significant on powerpc too) but
I think it's a very basic assumption from drivers that consecutive
writel's, especially issued by the same CPU, will get to the device
in order.

If this is a performance problem, then provide relaxed variants and
use them in selected drivers.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29 21:40                                       ` Benjamin Herrenschmidt
@ 2008-05-29 21:48                                         ` Trent Piepho
  2008-05-29 22:05                                           ` Benjamin Herrenschmidt
  2008-05-29 21:53                                         ` Jesse Barnes
  2008-05-30  9:48                                         ` Jes Sorensen
  2 siblings, 1 reply; 160+ messages in thread
From: Trent Piepho @ 2008-05-29 21:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel,
	linuxppc-dev, scottwood, torvalds, David Miller, alan,
	Arjan van de Ven

On Fri, 30 May 2008, Benjamin Herrenschmidt wrote:
> On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote:
> Interesting. I've always been taught by ia64 people that mmiowb() was
> intended to be used solely between writel() and spin_unlock().

That's what I gathered too, based on what's written in memory-barriers.txt,
which is the only kernel docs I could find that addressed this.

> Yes, this has some cost (can be fairly significant on powerpc too) but
> I think it's a very basic assumption from drivers that consecutive
> writel's, especially issued by the same CPU, will get to the device
> in order.

It's also what memory-barriers.txt says they should do.

> If this is a performance problem, then provide relaxed variants and
> use them in selected drivers.

I wrote a JTAG over gpio driver for the powerpc MPC8572DS platform.  With the
non-raw io accessors, the JTAG clock can run at almost ~9.5 MHz.  Using raw
versions (which I had to write since powerpc doesn't have any), the clock
speed increases to about 28 MHz.  So it can make a very significant different.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29 21:48                                         ` Trent Piepho
@ 2008-05-29 22:05                                           ` Benjamin Herrenschmidt
  2008-05-30  1:53                                             ` Trent Piepho
  0 siblings, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-29 22:05 UTC (permalink / raw)
  To: Trent Piepho
  Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel,
	linuxppc-dev, scottwood, torvalds, David Miller, alan,
	Arjan van de Ven

On Thu, 2008-05-29 at 14:48 -0700, Trent Piepho wrote:

> I wrote a JTAG over gpio driver for the powerpc MPC8572DS platform.  With the
> non-raw io accessors, the JTAG clock can run at almost ~9.5 MHz.  Using raw
> versions (which I had to write since powerpc doesn't have any), the clock
> speed increases to about 28 MHz.  So it can make a very significant different.

Yes, sync's can hurt a lot. This is why I initially tried to get more
relaxed semantics.

We could implement something like __ variants and do something that
would still have eieio's but not sync's for example (ie. MMIOs are still
ordered vs. each other but not vs. coherent memory).

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29 22:05                                           ` Benjamin Herrenschmidt
@ 2008-05-30  1:53                                             ` Trent Piepho
  0 siblings, 0 replies; 160+ messages in thread
From: Trent Piepho @ 2008-05-30  1:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel,
	linuxppc-dev, scottwood, torvalds, David Miller, alan,
	Arjan van de Ven

On Fri, 30 May 2008, Benjamin Herrenschmidt wrote:
> On Thu, 2008-05-29 at 14:48 -0700, Trent Piepho wrote:
>
>> I wrote a JTAG over gpio driver for the powerpc MPC8572DS platform.  With the
>> non-raw io accessors, the JTAG clock can run at almost ~9.5 MHz.  Using raw
>> versions (which I had to write since powerpc doesn't have any), the clock
>> speed increases to about 28 MHz.  So it can make a very significant different.
>
> Yes, sync's can hurt a lot. This is why I initially tried to get more
> relaxed semantics.
>
> We could implement something like __ variants and do something that
> would still have eieio's but not sync's for example (ie. MMIOs are still
> ordered vs. each other but not vs. coherent memory).

The problem current with the raw variants is that not all archs have them. 
And for those that do, there is no defined semantics.  Each arch is different
as to what ordering they have (and endianness too).

If you want to write a driver that is (or might be one day) multi-platform,
there aren't any less ordered accessors one can use.  A lot of drivers don't
even use coherent DMA, and could use less strictly ordered semantics quite
trivially.  Except there aren't any.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29 21:40                                       ` Benjamin Herrenschmidt
  2008-05-29 21:48                                         ` Trent Piepho
@ 2008-05-29 21:53                                         ` Jesse Barnes
  2008-05-30  9:39                                           ` Jes Sorensen
  2008-05-30  9:48                                         ` Jes Sorensen
  2 siblings, 1 reply; 160+ messages in thread
From: Jesse Barnes @ 2008-05-29 21:53 UTC (permalink / raw)
  To: benh
  Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel,
	David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

On Thursday, May 29, 2008 2:40 pm Benjamin Herrenschmidt wrote:
> On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote:
> > The only way to guarantee ordering in the above setup, is to either
> > make writel() fully ordered or adding the mmiowb()'s inbetween the two
> > writel's. On Altix you have to go and read from the PCI brige to
> > ensure all writes to it have been flushed, which is also what mmiowb()
> > is doing. If writel() was to guarantee this ordering, it would make
> > every writel() call extremely expensive :-(
>
> Interesting. I've always been taught by ia64 people that mmiowb() was
> intended to be used solely between writel() and spin_unlock().

Well, that *was* true, afaik, but maybe these days multipath isn't just for 
fail-over.  If that's true, then yeah making every single writeX ordered 
would be the only way to go...

> If this is a performance problem, then provide relaxed variants and
> use them in selected drivers.

Sounds reasonable.  That way drivers "just work" and important drivers can be 
optimized.

Jesse

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29 21:53                                         ` Jesse Barnes
@ 2008-05-30  9:39                                           ` Jes Sorensen
  0 siblings, 0 replies; 160+ messages in thread
From: Jes Sorensen @ 2008-05-30  9:39 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: linux-arch, Roland Dreier, linux-kernel, David Miller,
	linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

Jesse Barnes wrote:
> On Thursday, May 29, 2008 2:40 pm Benjamin Herrenschmidt wrote:
>> On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote:
>>> The only way to guarantee ordering in the above setup, is to either
>>> make writel() fully ordered or adding the mmiowb()'s inbetween the two
>>> writel's. On Altix you have to go and read from the PCI brige to
>>> ensure all writes to it have been flushed, which is also what mmiowb()
>>> is doing. If writel() was to guarantee this ordering, it would make
>>> every writel() call extremely expensive :-(
>> Interesting. I've always been taught by ia64 people that mmiowb() was
>> intended to be used solely between writel() and spin_unlock().
> 
> Well, that *was* true, afaik, but maybe these days multipath isn't just for 
> fail-over.  If that's true, then yeah making every single writeX ordered 
> would be the only way to go...

I could be getting bits wrong, but multi-path here is in the NUMA
routing, not at the device level.

>> If this is a performance problem, then provide relaxed variants and
>> use them in selected drivers.
> 
> Sounds reasonable.  That way drivers "just work" and important drivers can be 
> optimized.

That would kill all levels of performance in all drivers, resulting in
attempts to try and modify a fair bit of drivers to get the performance
back.

In reality this problem really only exists for devices where ordering of
consecutive writel's is a big issue. In my experience it really isn't
the case very frequently - and the number of mmiowb's that have put in
shows that too :-)

Cheers,
Jes

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29 21:40                                       ` Benjamin Herrenschmidt
  2008-05-29 21:48                                         ` Trent Piepho
  2008-05-29 21:53                                         ` Jesse Barnes
@ 2008-05-30  9:48                                         ` Jes Sorensen
  2008-05-31  8:14                                           ` Pavel Machek
  2 siblings, 1 reply; 160+ messages in thread
From: Jes Sorensen @ 2008-05-30  9:48 UTC (permalink / raw)
  To: benh
  Cc: linux-arch, Roland Dreier, linux-kernel, Jeremy Higdon,
	David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

Benjamin Herrenschmidt wrote:
> On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote:
>> The only way to guarantee ordering in the above setup, is to either
>> make writel() fully ordered or adding the mmiowb()'s inbetween the two
>> writel's. On Altix you have to go and read from the PCI brige to
>> ensure all writes to it have been flushed, which is also what mmiowb()
>> is doing. If writel() was to guarantee this ordering, it would make
>> every writel() call extremely expensive :-(
> 
> Interesting. I've always been taught by ia64 people that mmiowb() was
> intended to be used solely between writel() and spin_unlock().
> 
> I think in the above case, you really should make writel() ordered.
> Anything else is asking for trouble, for the exact same reasons that I
> made it fully ordered on powerpc at least vs. previous stores. I only
> kept it relaxed vs. subsequent cacheable stores (ie, spin_unlock), for
> which I use the trick mentioned before.

Hmmm I hope I didn't mess up the description of this and added to the
confusion.

The net result of that would be to kill performance completely, I really
don't like that idea.... Having each writel() go out and poll the PCI
bridge is going to make every driver used on Altix slow as a dog.

In addition it's still not going to solve the problem for userland
mapped stuff such as infinibug.

> Yes, this has some cost (can be fairly significant on powerpc too) but
> I think it's a very basic assumption from drivers that consecutive
> writel's, especially issued by the same CPU, will get to the device
> in order.

In this case the cost is more than just significant, it's pretty
crippling.

> If this is a performance problem, then provide relaxed variants and
> use them in selected drivers.

We'd have to make major changes to drivers like e1000, tg3, mptsas, the
qla2/3/4xxx and a bunch of others to get performance back. I really
think the code maintenance issue there will get far worse than what we
have today :(

Cheers,
Jes

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-30  9:48                                         ` Jes Sorensen
@ 2008-05-31  8:14                                           ` Pavel Machek
  2008-06-02  9:48                                             ` Jes Sorensen
  0 siblings, 1 reply; 160+ messages in thread
From: Pavel Machek @ 2008-05-31  8:14 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: linux-arch, Roland Dreier, linux-kernel, Jeremy Higdon,
	David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

Hi!

> >>The only way to guarantee ordering in the above setup, 
> >>is to either
> >>make writel() fully ordered or adding the mmiowb()'s 
> >>inbetween the two
> >>writel's. On Altix you have to go and read from the 
> >>PCI brige to
> >>ensure all writes to it have been flushed, which is 
> >>also what mmiowb()
> >>is doing. If writel() was to guarantee this ordering, 
> >>it would make
> >>every writel() call extremely expensive :-(
> >
> >Interesting. I've always been taught by ia64 people 
> >that mmiowb() was
> >intended to be used solely between writel() and 
> >spin_unlock().
> >
> >I think in the above case, you really should make 
> >writel() ordered.
> >Anything else is asking for trouble, for the exact same 
> >reasons that I
> >made it fully ordered on powerpc at least vs. previous 
> >stores. I only
> >kept it relaxed vs. subsequent cacheable stores (ie, 
> >spin_unlock), for
> >which I use the trick mentioned before.
> 
> Hmmm I hope I didn't mess up the description of this and 
> added to the
> confusion.
> 
> The net result of that would be to kill performance 
> completely, I really
> don't like that idea.... Having each writel() go out and 
> poll the PCI
> bridge is going to make every driver used on Altix slow 
> as a dog.
> 
> In addition it's still not going to solve the problem 
> for userland
> mapped stuff such as infinibug.
> 
> >Yes, this has some cost (can be fairly significant on 
> >powerpc too) but
> >I think it's a very basic assumption from drivers that 
> >consecutive
> >writel's, especially issued by the same CPU, will get 
> >to the device
> >in order.
> 
> In this case the cost is more than just significant, 
> it's pretty
> crippling.
> 
> >If this is a performance problem, then provide relaxed 
> >variants and
> >use them in selected drivers.
> 
> We'd have to make major changes to drivers like e1000, 
> tg3, mptsas, the
> qla2/3/4xxx and a bunch of others to get performance 
> back. I really
> think the code maintenance issue there will get far 
> worse than what we
> have today :(

Still better than changing semantics of writel for _all_ drivers.

If you are really sure driver does not depend on writel order, it is
just a sed/// command, so I don't see any big code maintenance
issues...
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-31  8:14                                           ` Pavel Machek
@ 2008-06-02  9:48                                             ` Jes Sorensen
  0 siblings, 0 replies; 160+ messages in thread
From: Jes Sorensen @ 2008-06-02  9:48 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-arch, Roland Dreier, linux-kernel, Jeremy Higdon,
	David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

Pavel Machek wrote:
> Still better than changing semantics of writel for _all_ drivers.
> 
> If you are really sure driver does not depend on writel order, it is
> just a sed/// command, so I don't see any big code maintenance
> issues...

This isn't changing the semantics for all drivers, it means it leaves
it up to us to fix the drivers that hit the problem. Whereas the other
way round we'd have to make some fairly big modifications to a large
number of drivers, which the driver maintainers are likely to be
reluctant to accept and maintain. It will be a constant chase to keep
that in order.

Cheers
Jes

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29 14:47                                     ` Jes Sorensen
  2008-05-29 15:01                                       ` James Bottomley
  2008-05-29 21:40                                       ` Benjamin Herrenschmidt
@ 2008-05-29 22:06                                       ` Roland Dreier
  2008-05-29 22:25                                         ` Trent Piepho
  2008-05-31  7:52                                       ` Jeremy Higdon
  3 siblings, 1 reply; 160+ messages in thread
From: Roland Dreier @ 2008-05-29 22:06 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood,
	torvalds, tpiepho, alan, Arjan van de Ven

 > The problem is that your two writel's, despite being both issued on
 > cpu X, due to the spin lock, in your example, can end up with the
 > first one going through NR 1 and the second one going through NR 2. If
 > there's contention on NR 1, the write going via NR 2 may hit the PCI
 > bridge prior to the one going via NR 1.

Really??  I can't see how you can expect any drivers to work reliably if
simple code like

	writel(reg1);
	writel(reg2);

might end up with the write to reg2 hitting the device before the write
to reg1.  Almost all MMIO stuff has ordering requirements and will
totally break if you allow, say, the "start IO" write to be reordered
before the "set IO address" write.

 - R.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29 22:06                                       ` Roland Dreier
@ 2008-05-29 22:25                                         ` Trent Piepho
  2008-05-30  3:56                                           ` Paul Mackerras
  0 siblings, 1 reply; 160+ messages in thread
From: Trent Piepho @ 2008-05-29 22:25 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-arch, Jes Sorensen, linux-kernel, linuxppc-dev, scottwood,
	torvalds, David Miller, alan, Arjan van de Ven

On Thu, 29 May 2008, Roland Dreier wrote:
> > The problem is that your two writel's, despite being both issued on
> > cpu X, due to the spin lock, in your example, can end up with the
> > first one going through NR 1 and the second one going through NR 2. If
> > there's contention on NR 1, the write going via NR 2 may hit the PCI
> > bridge prior to the one going via NR 1.
>
> Really??  I can't see how you can expect any drivers to work reliably if
> simple code like
>
> 	writel(reg1);
> 	writel(reg2);
>
> might end up with the write to reg2 hitting the device before the write
> to reg1.  Almost all MMIO stuff has ordering requirements and will

This is how powerpc is natively (the linux accessors have extra ordering to
not allow this of course), and there are non-Linux drivers that are written
for this ordering model.

I think what makes altix so hard is that writes to the _same_ register may be
be re-ordered.  Re-ordering writes to the same address is much less common
than allowing writes to different addresses to be re-ordered.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29 22:25                                         ` Trent Piepho
@ 2008-05-30  3:56                                           ` Paul Mackerras
  0 siblings, 0 replies; 160+ messages in thread
From: Paul Mackerras @ 2008-05-30  3:56 UTC (permalink / raw)
  To: Trent Piepho
  Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel,
	linuxppc-dev, scottwood, torvalds, David Miller, alan,
	Arjan van de Ven

Trent Piepho writes:

> On Thu, 29 May 2008, Roland Dreier wrote:
> > > The problem is that your two writel's, despite being both issued on
> > > cpu X, due to the spin lock, in your example, can end up with the
> > > first one going through NR 1 and the second one going through NR 2. If
> > > there's contention on NR 1, the write going via NR 2 may hit the PCI
> > > bridge prior to the one going via NR 1.
> >
> > Really??  I can't see how you can expect any drivers to work reliably if
> > simple code like
> >
> > 	writel(reg1);
> > 	writel(reg2);
> >
> > might end up with the write to reg2 hitting the device before the write
> > to reg1.  Almost all MMIO stuff has ordering requirements and will
> 
> This is how powerpc is natively (the linux accessors have extra ordering to
> not allow this of course), and there are non-Linux drivers that are written
> for this ordering model.

In fact stores to non-cacheable locations have to be kept in order,
according to the Power architecture.  But you're correct in that
non-cacheable loads can in principle be reordered w.r.t. each other
and to stores.

> I think what makes altix so hard is that writes to the _same_ register may be
> be re-ordered.  Re-ordering writes to the same address is much less common
> than allowing writes to different addresses to be re-ordered.

Indeed.

Paul.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-29 14:47                                     ` Jes Sorensen
                                                         ` (2 preceding siblings ...)
  2008-05-29 22:06                                       ` Roland Dreier
@ 2008-05-31  7:52                                       ` Jeremy Higdon
  2008-06-02  9:56                                         ` Jes Sorensen
  3 siblings, 1 reply; 160+ messages in thread
From: Jeremy Higdon @ 2008-05-31  7:52 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: linux-arch, Roland Dreier, linux-kernel, David Miller,
	linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

On Thu, May 29, 2008 at 10:47:18AM -0400, Jes Sorensen wrote:
> Thats not going to solve the problem on Altix. On Altix the issue is
> that there can be multiple paths through the NUMA fabric from cpuX to
> PCI bridge Y. 
> 
> Consider this uber-cool<tm> ascii art - NR is my abbrevation for NUMA
> router:
> 
>         -------         -------
>         |cpu X|         |cpu Y|
>         -------         -------
>          |   \____  ____/    |
>          |        \/         |
>          |    ____/\____     |
>          |   /          \    |
>          -----          ------
>          |NR 1|         |NR 2|
>          ------         ------
>               \         /
>                \       /
>                 -------
>                 | PCI |
>                 -------
> 
> The problem is that your two writel's, despite being both issued on
> cpu X, due to the spin lock, in your example, can end up with the
> first one going through NR 1 and the second one going through NR 2. If
> there's contention on NR 1, the write going via NR 2 may hit the PCI
> bridge prior to the one going via NR 1.

We don't actually have that problem on the Altix.  All writes issued
by CPU X will be ordered with respect to each other.  But writes by
CPU X and CPU Y will not be, unless an mmiowb() is done by the
original CPU before the second CPU writes.  I.e.

	CPU X	writel
	CPU X	writel
	CPU X	mmiowb

	CPU Y	writel
	...

Note that this implies some sort of locking.  Also note that if in
the above, CPU Y did the mmiowb, that would not work.

jeremy

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-31  7:52                                       ` Jeremy Higdon
@ 2008-06-02  9:56                                         ` Jes Sorensen
  2008-06-02 21:02                                           ` Jeremy Higdon
  2008-06-03  4:33                                           ` Nick Piggin
  0 siblings, 2 replies; 160+ messages in thread
From: Jes Sorensen @ 2008-06-02  9:56 UTC (permalink / raw)
  To: Jeremy Higdon
  Cc: linux-arch, Roland Dreier, linux-kernel, David Miller,
	linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

Jeremy Higdon wrote:
> We don't actually have that problem on the Altix.  All writes issued
> by CPU X will be ordered with respect to each other.  But writes by
> CPU X and CPU Y will not be, unless an mmiowb() is done by the
> original CPU before the second CPU writes.  I.e.
> 
> 	CPU X	writel
> 	CPU X	writel
> 	CPU X	mmiowb
> 
> 	CPU Y	writel
> 	...
> 
> Note that this implies some sort of locking.  Also note that if in
> the above, CPU Y did the mmiowb, that would not work.

Hmmm,

Then it's less bad than I thought - my apologies for the confusion.

Would we be able to use Ben's trick of setting a per cpu flag in
writel() then and checking that in spin unlock issuing the mmiowb()
there if needed?


Cheers,
Jes

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-02  9:56                                         ` Jes Sorensen
@ 2008-06-02 21:02                                           ` Jeremy Higdon
  2008-06-03  4:33                                           ` Nick Piggin
  1 sibling, 0 replies; 160+ messages in thread
From: Jeremy Higdon @ 2008-06-02 21:02 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: linux-arch, Roland Dreier, linux-kernel, David Miller,
	linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

On Mon, Jun 02, 2008 at 11:56:39AM +0200, Jes Sorensen wrote:
> Jeremy Higdon wrote:
> >We don't actually have that problem on the Altix.  All writes issued
> >by CPU X will be ordered with respect to each other.  But writes by
> >CPU X and CPU Y will not be, unless an mmiowb() is done by the
> >original CPU before the second CPU writes.  I.e.
> >
> >	CPU X	writel
> >	CPU X	writel
> >	CPU X	mmiowb
> >
> >	CPU Y	writel
> >	...
> >
> >Note that this implies some sort of locking.  Also note that if in
> >the above, CPU Y did the mmiowb, that would not work.
> 
> Hmmm,
> 
> Then it's less bad than I thought - my apologies for the confusion.
> 
> Would we be able to use Ben's trick of setting a per cpu flag in
> writel() then and checking that in spin unlock issuing the mmiowb()
> there if needed?


Yes, that should work fine.  You will get more mmiowb's than you
need, since some drivers, such as Fusion, don't need them.  On the
Origins (older SGI MIPS-based Numa), the 'sync' instruction had
the same effect as mmiowb() with respect to mmio write ordering,
and it was issued unconditionally in the spin unlock.  It was
cheaper than mmiowb, however.

If it matters, we could invent and use writel_relaxed() to get
performance back in drivers we care about....

jeremy

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-02  9:56                                         ` Jes Sorensen
  2008-06-02 21:02                                           ` Jeremy Higdon
@ 2008-06-03  4:33                                           ` Nick Piggin
  2008-06-03  8:15                                             ` Jeremy Higdon
  2008-06-03 16:52                                             ` Jesse Barnes
  1 sibling, 2 replies; 160+ messages in thread
From: Nick Piggin @ 2008-06-03  4:33 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: linux-arch, Roland Dreier, linux-kernel, Jeremy Higdon,
	David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

On Monday 02 June 2008 19:56, Jes Sorensen wrote:
> Jeremy Higdon wrote:
> > We don't actually have that problem on the Altix.  All writes issued
> > by CPU X will be ordered with respect to each other.  But writes by
> > CPU X and CPU Y will not be, unless an mmiowb() is done by the
> > original CPU before the second CPU writes.  I.e.
> >
> > 	CPU X	writel
> > 	CPU X	writel
> > 	CPU X	mmiowb
> >
> > 	CPU Y	writel
> > 	...
> >
> > Note that this implies some sort of locking.  Also note that if in
> > the above, CPU Y did the mmiowb, that would not work.
>
> Hmmm,
>
> Then it's less bad than I thought - my apologies for the confusion.
>
> Would we be able to use Ben's trick of setting a per cpu flag in
> writel() then and checking that in spin unlock issuing the mmiowb()
> there if needed?

Yes you could, but your writels would still not be strongly ordered
within (or outside) spinlock regions, which is what Linus wants (and
I kind of agree with).

This comes back to my posting about mmiowb and io_*mb barriers etc.

Despite what you say, what you've done really _does_ change the semantics
of wmb() for all drivers. It is a really sad situation we've got ourselves
into somehow, AFAIKS in the hope of trying to save ourselves a tiny bit
of work upfront :( (this is not just the sgi folk with mmiowb I'm talking
about, but the whole random undefinedness of ordering and io barriers).

The right way to make any change is never to weaken the postcondition of
an existing interface *unless* you are willing to audit the entire tree
and fix it. Impossible for drivers, so the correct thing to do is introduce
a new interface, and move things over at an easier pace. Not rocket science.

The argument that "Altix only uses a few drivers so this way we can just
fix these up rather than make big modifications to large numbers of drivers"
is bogus. It is far worse even for Altix if you make incompatible changes,
because you first *break* every driver on your platform, then you have to
audit and fix them. If you make compatible changes, then you have to do
exactly the same audits to move them over to the new API, but you go from
slower->faster rather than broken->non broken. As a bonus, you haven't
got random broken stuff all over the tree that you forgot to audit.

I don't know how there is still so much debate about this :(

I have a proposal: I am a neutral party here, not being an arch maintainer,
so I'll take input and write up a backward compatible API specification
and force everybody to conform to it ;)

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03  4:33                                           ` Nick Piggin
@ 2008-06-03  8:15                                             ` Jeremy Higdon
  2008-06-03  8:19                                               ` Nick Piggin
  2008-06-03 16:52                                             ` Jesse Barnes
  1 sibling, 1 reply; 160+ messages in thread
From: Jeremy Higdon @ 2008-06-03  8:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel,
	David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote:
> On Monday 02 June 2008 19:56, Jes Sorensen wrote:
> > Jeremy Higdon wrote:
> > > We don't actually have that problem on the Altix.  All writes issued
> > > by CPU X will be ordered with respect to each other.  But writes by
> > > CPU X and CPU Y will not be, unless an mmiowb() is done by the
> > > original CPU before the second CPU writes.  I.e.
> > >
> > > 	CPU X	writel
> > > 	CPU X	writel
> > > 	CPU X	mmiowb
> > >
> > > 	CPU Y	writel
> > > 	...
> > >
> > > Note that this implies some sort of locking.  Also note that if in
> > > the above, CPU Y did the mmiowb, that would not work.
> >
> > Hmmm,
> >
> > Then it's less bad than I thought - my apologies for the confusion.
> >
> > Would we be able to use Ben's trick of setting a per cpu flag in
> > writel() then and checking that in spin unlock issuing the mmiowb()
> > there if needed?
> 
> Yes you could, but your writels would still not be strongly ordered
> within (or outside) spinlock regions, which is what Linus wants (and
> I kind of agree with).

Yes they would be.  Writes from the same CPU are always ordered.  Writes
from different CPUs are not, but that's only a concern if you protect
writing via some sort of lock.  If the lock release forces a barrier,
that should take care of the problem.

jeremy

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03  8:15                                             ` Jeremy Higdon
@ 2008-06-03  8:19                                               ` Nick Piggin
  2008-06-03  8:45                                                 ` Jeremy Higdon
  0 siblings, 1 reply; 160+ messages in thread
From: Nick Piggin @ 2008-06-03  8:19 UTC (permalink / raw)
  To: Jeremy Higdon
  Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel,
	David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

On Tuesday 03 June 2008 18:15, Jeremy Higdon wrote:
> On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote:
> > On Monday 02 June 2008 19:56, Jes Sorensen wrote:

> > > Would we be able to use Ben's trick of setting a per cpu flag in
> > > writel() then and checking that in spin unlock issuing the mmiowb()
> > > there if needed?
> >
> > Yes you could, but your writels would still not be strongly ordered
> > within (or outside) spinlock regions, which is what Linus wants (and
> > I kind of agree with).
>
> Yes they would be.  Writes from the same CPU are always ordered.  Writes
> from different CPUs are not, but that's only a concern if you protect

They are not strongly ordered WRT writes to cacheable memory. If they
were, then they would not leak out of spinlocks.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03  8:19                                               ` Nick Piggin
@ 2008-06-03  8:45                                                 ` Jeremy Higdon
  0 siblings, 0 replies; 160+ messages in thread
From: Jeremy Higdon @ 2008-06-03  8:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel,
	David Miller, linuxppc-dev, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

On Tue, Jun 03, 2008 at 06:19:05PM +1000, Nick Piggin wrote:
> On Tuesday 03 June 2008 18:15, Jeremy Higdon wrote:
> > On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote:
> > > On Monday 02 June 2008 19:56, Jes Sorensen wrote:
> > > > Would we be able to use Ben's trick of setting a per cpu flag in
> > > > writel() then and checking that in spin unlock issuing the mmiowb()
> > > > there if needed?
> > >
> > > Yes you could, but your writels would still not be strongly ordered
> > > within (or outside) spinlock regions, which is what Linus wants (and
> > > I kind of agree with).
> >
> > Yes they would be.  Writes from the same CPU are always ordered.  Writes
> > from different CPUs are not, but that's only a concern if you protect
> 
> They are not strongly ordered WRT writes to cacheable memory. If they
> were, then they would not leak out of spinlocks.

No posted writes are.
As I recall, the outX functions are not supposed to be posted, so the sn2
versions issue a mmiowb in the outX.  But writeX has always been posted,
or at least postable.  I thought that was generally accepted.

Normally, the only way for a device to see cacheable memory is via a DMA
read, and you are guaranteed on sn2 that in the following:

	store of of A to location X
	mmio write to device
	device DMA read from location X

that the device will see A.  In the past, on some other archs, you'd have
to flush the Dcache for that to work.

Granted, if the compiler reorders the store and the mmio write, then you
have a problem.

jeremy

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03  4:33                                           ` Nick Piggin
  2008-06-03  8:15                                             ` Jeremy Higdon
@ 2008-06-03 16:52                                             ` Jesse Barnes
  2008-06-05  8:40                                               ` Jes Sorensen
  1 sibling, 1 reply; 160+ messages in thread
From: Jesse Barnes @ 2008-06-03 16:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, Roland Dreier, Jes Sorensen, linux-kernel,
	Jeremy Higdon, David Miller, linuxppc-dev, scottwood, torvalds,
	tpiepho, alan, Arjan van de Ven

On Monday, June 02, 2008 9:33 pm Nick Piggin wrote:
> On Monday 02 June 2008 19:56, Jes Sorensen wrote:
> > Jeremy Higdon wrote:
> > > We don't actually have that problem on the Altix.  All writes issued
> > > by CPU X will be ordered with respect to each other.  But writes by
> > > CPU X and CPU Y will not be, unless an mmiowb() is done by the
> > > original CPU before the second CPU writes.  I.e.
> > >
> > > 	CPU X	writel
> > > 	CPU X	writel
> > > 	CPU X	mmiowb
> > >
> > > 	CPU Y	writel
> > > 	...
> > >
> > > Note that this implies some sort of locking.  Also note that if in
> > > the above, CPU Y did the mmiowb, that would not work.
> >
> > Hmmm,
> >
> > Then it's less bad than I thought - my apologies for the confusion.
> >
> > Would we be able to use Ben's trick of setting a per cpu flag in
> > writel() then and checking that in spin unlock issuing the mmiowb()
> > there if needed?
>
> Yes you could, but your writels would still not be strongly ordered
> within (or outside) spinlock regions, which is what Linus wants (and
> I kind of agree with).

I think you mean wrt cacheable memory accesses here (though iirc on ia64 
spin_unlock has release semantics, so at least it'll barrier other stores).

> This comes back to my posting about mmiowb and io_*mb barriers etc.
>
> Despite what you say, what you've done really _does_ change the semantics
> of wmb() for all drivers. It is a really sad situation we've got ourselves
> into somehow, AFAIKS in the hope of trying to save ourselves a tiny bit
> of work upfront :( (this is not just the sgi folk with mmiowb I'm talking
> about, but the whole random undefinedness of ordering and io barriers).
>
> The right way to make any change is never to weaken the postcondition of
> an existing interface *unless* you are willing to audit the entire tree
> and fix it. Impossible for drivers, so the correct thing to do is introduce
> a new interface, and move things over at an easier pace. Not rocket
> science.

Well, given how undefined things have been in the past, each arch has had to 
figure out what things mean (based on looking at drivers & core code) then 
come up with appropriate primitives.  On Altix, we went both directions:  we 
made regular PIO reads (readX etc.) *very* expensive to preserve 
compatibility with what existing drivers expect, and added a readX_relaxed to 
give a big performance boost to tuned drivers.

OTOH, given that posted PCI writes were nothing new to Linux, but the Altix 
network topology was, we introduced mmiowb() (with lots of discussion I might 
add), which has clear and relatively simple usage guidelines.

Now, in hindsight, using a PIO write set & test flag approach in 
writeX/spin_unlock (ala powerpc) might have been a better approach, but iirc 
that never came up in the discussion, probably because we were focused on PCI 
posting and not uncached vs. cached ordering.

> The argument that "Altix only uses a few drivers so this way we can just
> fix these up rather than make big modifications to large numbers of
> drivers" is bogus. It is far worse even for Altix if you make incompatible
> changes, because you first *break* every driver on your platform, then you
> have to audit and fix them. If you make compatible changes, then you have
> to do exactly the same audits to move them over to the new API, but you go
> from slower->faster rather than broken->non broken. As a bonus, you haven't
> got random broken stuff all over the tree that you forgot to audit.

I agree, but afaik the only change Altix ended up forcing on people was 
mmiowb(), but that turned out to be necessary on mips64 (and maybe some other 
platforms?) anyway.

> I don't know how there is still so much debate about this :(
>
> I have a proposal: I am a neutral party here, not being an arch maintainer,
> so I'll take input and write up a backward compatible API specification
> and force everybody to conform to it ;)

Aside from the obvious performance impact of making all the readX/writeX 
routines strongly ordered, both in terms of PCI posting and cacheable vs. 
uncacheable accesses, it also makes things inconsistent.  Both core code & 
drivers will still have to worry about regular, cacheable memory barriers for 
correctness, but it looks like you're proposing that they not have to think 
about I/O ordering.

At any rate, I don't think anyone would argue against defining the ordering 
semantics of all of these routines (btw you should also include ordering wrt 
DMA & PCI posting); the question is what's the best balance between keeping 
the driver requirements simple and the performance cost on complex arches.

Jesse

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-03 16:52                                             ` Jesse Barnes
@ 2008-06-05  8:40                                               ` Jes Sorensen
  2008-06-05  8:43                                                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 160+ messages in thread
From: Jes Sorensen @ 2008-06-05  8:40 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: linux-arch, Nick Piggin, Roland Dreier, linux-kernel,
	Jeremy Higdon, David Miller, linuxppc-dev, scottwood, torvalds,
	tpiepho, alan, Arjan van de Ven

Jesse Barnes wrote:
> Now, in hindsight, using a PIO write set & test flag approach in 
> writeX/spin_unlock (ala powerpc) might have been a better approach, but iirc 
> that never came up in the discussion, probably because we were focused on PCI 
> posting and not uncached vs. cached ordering.

Hi Jesse,

I am going to take a stab at implementing this so we can see how much
of an impact it will have.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-05  8:40                                               ` Jes Sorensen
@ 2008-06-05  8:43                                                 ` Benjamin Herrenschmidt
  2008-06-12 15:07                                                   ` Matthew Wilcox
  0 siblings, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-06-05  8:43 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: linux-arch, Nick Piggin, Roland Dreier, linux-kernel,
	Jeremy Higdon, David Miller, linuxppc-dev, Jesse Barnes,
	scottwood, torvalds, tpiepho, alan, Arjan van de Ven

On Thu, 2008-06-05 at 10:40 +0200, Jes Sorensen wrote:
> Jesse Barnes wrote:
> > Now, in hindsight, using a PIO write set & test flag approach in 
> > writeX/spin_unlock (ala powerpc) might have been a better approach, but iirc 
> > that never came up in the discussion, probably because we were focused on PCI 
> > posting and not uncached vs. cached ordering.
> 
> Hi Jesse,
> 
> I am going to take a stab at implementing this so we can see how much
> of an impact it will have.

Note that the powerpc implementation currently clears the flag
on spin_lock and tests it on unlock. We are considering changing
that to not touch the flag on spin_lock and just clear it whenever
we do a sync (ie, on unlock, on explicit mmiowb, and possibly even
on readl's where we happen to do sync's).

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-05  8:43                                                 ` Benjamin Herrenschmidt
@ 2008-06-12 15:07                                                   ` Matthew Wilcox
  2008-06-13  0:07                                                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 160+ messages in thread
From: Matthew Wilcox @ 2008-06-12 15:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-arch, Nick Piggin, Roland Dreier, Jes Sorensen,
	linux-kernel, Jeremy Higdon, David Miller, linuxppc-dev,
	Jesse Barnes, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

On Thu, Jun 05, 2008 at 06:43:53PM +1000, Benjamin Herrenschmidt wrote:
> Note that the powerpc implementation currently clears the flag
> on spin_lock and tests it on unlock. We are considering changing
> that to not touch the flag on spin_lock and just clear it whenever
> we do a sync (ie, on unlock, on explicit mmiowb, and possibly even
> on readl's where we happen to do sync's).

Your current scheme sounds like it's broken for

spin_lock(a)
writel();
spin_lock(b);
spin_unlock(b);
spin_unlock(a);

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-06-12 15:07                                                   ` Matthew Wilcox
@ 2008-06-13  0:07                                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-06-13  0:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-arch, Nick Piggin, Roland Dreier, Jes Sorensen,
	linux-kernel, Jeremy Higdon, David Miller, linuxppc-dev,
	Jesse Barnes, scottwood, torvalds, tpiepho, alan,
	Arjan van de Ven

On Thu, 2008-06-12 at 09:07 -0600, Matthew Wilcox wrote:
> On Thu, Jun 05, 2008 at 06:43:53PM +1000, Benjamin Herrenschmidt wrote:
> > Note that the powerpc implementation currently clears the flag
> > on spin_lock and tests it on unlock. We are considering changing
> > that to not touch the flag on spin_lock and just clear it whenever
> > we do a sync (ie, on unlock, on explicit mmiowb, and possibly even
> > on readl's where we happen to do sync's).
> 
> Your current scheme sounds like it's broken for
> 
> spin_lock(a)
> writel();
> spin_lock(b);
> spin_unlock(b);
> spin_unlock(a);

Which is why we are considering changing it :-)

But as Paulus said before, he did some measurement and we came to the
conclusion that (pending more measurements on a wider range of HW) we
may as well drop the whole scheme and make writel fully synchronous
instead.

Then, we can get some nice weakly ordered accessors and start adding
them with appropriate explicit barriers to the hot path of perf.
critical drivers we care about.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27 15:50                               ` Roland Dreier
  2008-05-27 16:37                                 ` James Bottomley
  2008-05-27 21:11                                 ` Benjamin Herrenschmidt
@ 2008-05-31  8:04                                 ` Pavel Machek
  2 siblings, 0 replies; 160+ messages in thread
From: Pavel Machek @ 2008-05-31  8:04 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-arch, linux-kernel, David Miller, linuxppc-dev, scottwood,
	torvalds, tpiepho, alan, Arjan van de Ven

Hi!

>  > Though it's my understanding that at least ia64 does require the
>  > explicit barriers anyway, so we are still in a dodgy situation here
>  > where it's not clear what drivers should do and we end up with
>  > possibly excessive barriers on powerpc where I end up with both
>  > the wmb/rmb/mb that were added for ia64 -and- the ones I have in
>  > readl/writel to make them look synchronous... Not nice.
> 
> ia64 is a disaster with a slightly different ordering problem -- the
> mmiowb() issue.  I know Ben knows far too much about this, but for big
> SGI boxes, you sometimes need mmiowb() to avoid problems with driver
> code that does totally sane stuff like
> 
> 	spin_lock(&mmio_lock);
> 	writel(val1, reg1);
> 	writel(val2, reg2);
> 	spin_unlock(&mmio_lock);
> 
> If that snippet is called on two CPUs at the same time, then the device
> might see a sequence like
> 
> 	CPU1 -- write reg1
> 	CPU2 -- write reg1
> 	CPU1 -- write reg2
> 	CPU2 -- write reg2
> 
> in spite of the fact that everything is totally ordered on the CPUs by
> the spin lock.
> 
> The reason this is such a disaster is because the code looks right,
> makes sense, and works fine on 99.99% of all systems out there.  So I
> would bet that 99% of our drivers have missing mmiowb() "bugs" -- no one
> has plugged the hardware into an Altix box and cared enough to stress
> test it.

Yes, ia64 is a disaster, and needs its spinlock implementation fixed
:-).
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27  2:28                         ` David Miller
  2008-05-27  3:39                           ` Benjamin Herrenschmidt
  2008-05-27  3:42                           ` Arjan van de Ven
@ 2008-05-27  8:24                           ` Alan Cox
  2 siblings, 0 replies; 160+ messages in thread
From: Alan Cox @ 2008-05-27  8:24 UTC (permalink / raw)
  To: David Miller
  Cc: linux-arch, linux-kernel, linuxppc-dev, scottwood, torvalds,
	tpiepho

> It's just another complicated thing for driver authors to get wrong.
> The other side of the coin is of course the cost.
> 
> The only thing I am absolutely sure of is that we should make a
> decision fast, document it, and just stick to it.

It's unfortunate that the __read/___write versions have other different
semantics. The default should definitely be ordered but having un-ordered
ones as an option would be good for zapping the hot paths that matter.

Most I/O paths aren't even worth the thinking time for non-ordered I/O.

Alan

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: MMIO and gcc re-ordering issue
  2008-05-27  1:33                   ` MMIO and gcc re-ordering issue Benjamin Herrenschmidt
  2008-05-27  1:40                     ` David Miller
@ 2008-05-27 15:28                     ` Jonathan Corbet
  1 sibling, 0 replies; 160+ messages in thread
From: Jonathan Corbet @ 2008-05-27 15:28 UTC (permalink / raw)
  To: benh
  Cc: Linux Arch list, linux-kernel, Trent Piepho, linuxppc-dev,
	scottwood, Linus Torvalds, David Miller, alan

On Tue, 27 May 2008 11:33:46 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> However, I'm using that as an excuse to bring back my pet subject,
> which is basically, should we instead just finally mandate the use of
> explicit rmb/wmb/mb's (which boils down to barrier() on x86) to
> drivers who want to order memory consistent accesses vs. MMIO ?

To me this seems like asking for a long series of weird bugs in
drivers.  Is the cost of a barrier in an MMIO operation so high that we
can't afford to just put them in?  Alan's suggestion of no-barrier
versions for people who know what they're doing (and who think there's a
reason to care) seems like a better way to go to me.

jon

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 22:35       ` Scott Wood
  2008-05-20 22:39         ` David Miller
@ 2008-05-20 22:55         ` Trent Piepho
  2008-05-21 14:01           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 160+ messages in thread
From: Trent Piepho @ 2008-05-20 22:55 UTC (permalink / raw)
  To: Scott Wood; +Cc: linux-kernel, Alan Cox, linuxppc-dev

On Tue, 20 May 2008, Scott Wood wrote:
> Alan Cox wrote:
>> >  It looks like we rely on -fno-strict-aliasing to prevent reordering 
>> >  ordinary memory accesses (such as to DMA descriptors) past the I/O
>>
>>  DMA descriptors in main memory are dependant on cache behaviour anyway
>>  and the dma_* operators should be the ones enforcing the needed behaviour.
>
> What about memory obtained from dma_alloc_coherent()?  We still need a sync 
> and a compiler barrier.  The current I/O accessors have the former, but not 
> the latter.

There doesn't appear to be any barriers to use for coherent dma other than
mb() and wmb().

Correct me if I'm wrong, but I think the sync isn't actually _required_ (by
memory-barriers.txt's definitions), and it would be enough to use eieio,
except there is code that doesn't use mmiowb() between I/O access and
unlocking.

So, as I understand it, the minimum needed is eieio.  To provide strict
ordering w.r.t. spin locks without using mmiowb(), you need sync.  To provide
strict ordering w.r.t. normal memory, you need sync and a compiler barrier.

Right now no archs provide the last option.  powerpc is currently the middle
option.  I don't know if anything uses the first option, maybe alpha?  I'm
almost certain x86 is the middle option (the first isn't possible, the arch
already has more ordering than that), which is probably why powerpc used that
option and not the first.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 22:55         ` [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho
@ 2008-05-21 14:01           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-21 14:01 UTC (permalink / raw)
  To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel, Alan Cox


On Tue, 2008-05-20 at 15:55 -0700, Trent Piepho wrote:
> here doesn't appear to be any barriers to use for coherent dma other than
> mb() and wmb().
> 
> Correct me if I'm wrong, but I think the sync isn't actually _required_ (by
> memory-barriers.txt's definitions), and it would be enough to use eieio,
> except there is code that doesn't use mmiowb() between I/O access and
> unlocking.
> 
> So, as I understand it, the minimum needed is eieio.  To provide strict
> ordering w.r.t. spin locks without using mmiowb(), you need sync.  To provide
> strict ordering w.r.t. normal memory, you need sync and a compiler barrier.
> 
> Right now no archs provide the last option.  powerpc is currently the middle
> option.  I don't know if anything uses the first option, maybe alpha?  I'm
> almost certain x86 is the middle option (the first isn't possible, the arch
> already has more ordering than that), which is probably why powerpc used that
> option and not the first.

I don't have time for that now. Can you dig into the archives ? The
whole thing has been discussed at lenght already.

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 21:16 ` Benjamin Herrenschmidt
  2008-05-20 21:38   ` Scott Wood
@ 2008-05-20 22:00   ` Trent Piepho
  2008-05-21 14:00     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 160+ messages in thread
From: Trent Piepho @ 2008-05-20 22:00 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc-dev, linux-kernel

On Tue, 20 May 2008, Benjamin Herrenschmidt wrote:
> On Tue, 2008-05-20 at 13:40 -0700, Trent Piepho wrote:
>> There was some discussion on a Freescale list if the powerpc I/O accessors
>> should be strictly ordered w.r.t.  normal memory.  Currently they are not.  It
>> does not appear as if any other architecture's I/O accessors are strictly
>> ordered in this manner.  memory-barriers.txt explicitly states that the I/O
>> space (inb, outw, etc.) are NOT strictly ordered w.r.t. normal memory
>> accesses and it's implied the other I/O accessors (e.g., writel) are the same.
>>
>> However, it is somewhat harder to program for this model, and there are almost
>> certainly a number of drivers using coherent DMA which have subtle bugs because
>> the do not include the necessary barriers.
>>
>> But clearly and change to this would be a subject for a different patch.
>
> The current accessors should provide all the necessary ordering
> guarantees...

Depends on what you define as "necessary".  It's seem clear that I/O accessors
_no not_ need to be strictly ordered with respect to normal memory accesses,
by what's defined in memory-barriers.txt.  So if by "necessary" you mean what
the Linux standard for I/O accessors requires (and what other archs provide),
then yes, they have the necessary ordering guarantees.

But, if you want them to be strictly ordered w.r.t to normal memory, that's
not the case.

For example, in something like:

u32 *dmabuf = kmalloc(...);
...
dmabuf[0] = 1;
out_be32(&regs->dmactl, DMA_SEND_BUFFER);
dmabuf[0] = 2;
out_be32(&regs->dmactl, DMA_SEND_BUFFER);

gcc might decide to optimize this code to:

out_be32(&regs->dmactl, DMA_SEND_BUFFER);
out_be32(&regs->dmactl, DMA_SEND_BUFFER);
dmabuf[0] = 2;

gcc will often not do this optimization, because there might be aliasing
between "&regs->dmact" and "dmabuf", but it _can_ do it.  gcc can't optimize
the two identical out_be32's into one, or re-order them if they were to
different registers, but it can move the normal memory accesses around them.

Here's a quick hack I stuck in a driver to test.  compile with -save-temps and
check the resulting asm.  gcc will do the optimization I described above.

static void __iomem *baz = (void*)0x1234;
static struct bar {
     u32 bar[256];
} bar;

void foo(void) {
     bar.bar[0] = 44;
     out_be32(baz+100, 200);
     bar.bar[0] = 45;
     out_be32(baz+101, 201);
}

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 22:00   ` Trent Piepho
@ 2008-05-21 14:00     ` Benjamin Herrenschmidt
  2008-05-21 19:44       ` Trent Piepho
  0 siblings, 1 reply; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-21 14:00 UTC (permalink / raw)
  To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel


> Depends on what you define as "necessary".  It's seem clear that I/O accessors
> _no not_ need to be strictly ordered with respect to normal memory accesses,
> by what's defined in memory-barriers.txt.  So if by "necessary" you mean what
> the Linux standard for I/O accessors requires (and what other archs provide),
> then yes, they have the necessary ordering guarantees.
> 
> But, if you want them to be strictly ordered w.r.t to normal memory, that's
> not the case.

They should be.

> For example, in something like:
> 
> u32 *dmabuf = kmalloc(...);
> ...
> dmabuf[0] = 1;
> out_be32(&regs->dmactl, DMA_SEND_BUFFER);
> dmabuf[0] = 2;
> out_be32(&regs->dmactl, DMA_SEND_BUFFER);
> 
> gcc might decide to optimize this code to:
> 
> out_be32(&regs->dmactl, DMA_SEND_BUFFER);
> out_be32(&regs->dmactl, DMA_SEND_BUFFER);
> dmabuf[0] = 2;

If that's the case, there is a bug. Ignoring gcc possible optimisations,
the accessors contain the necessary memory barriers for things to work
the way you describe above. If the use of volatile and clobber in our
macros isn't enough to also prevent optimisations, then we have a bug
and you are welcome to provide a patch to fix it.

> gcc will often not do this optimization, because there might be aliasing
> between "&regs->dmact" and "dmabuf", but it _can_ do it.  gcc can't optimize
> the two identical out_be32's into one, or re-order them if they were to
> different registers, but it can move the normal memory accesses around them.

The linus kernel -cannot- be compiled with strict aliasing rules. This
is one of the many areas where those are violated. Frankly, this strict
aliasing stuff is just a total nightmare turning a pefectly nice and
useable language into something it's not meant to be.

> Here's a quick hack I stuck in a driver to test.  compile with -save-temps and
> check the resulting asm.  gcc will do the optimization I described above.
> 
> static void __iomem *baz = (void*)0x1234;
> static struct bar {
>      u32 bar[256];
> } bar;
> 
> void foo(void) {
>      bar.bar[0] = 44;
>      out_be32(baz+100, 200);
>      bar.bar[0] = 45;
>      out_be32(baz+101, 201);
> }

Have you removed -fno-strict-aliasing ? Just don't do that.

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-21 14:00     ` Benjamin Herrenschmidt
@ 2008-05-21 19:44       ` Trent Piepho
  2008-05-21 20:41         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 160+ messages in thread
From: Trent Piepho @ 2008-05-21 19:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc-dev, linux-kernel

On Wed, 21 May 2008, Benjamin Herrenschmidt wrote:
>> Depends on what you define as "necessary".  It's seem clear that I/O accessors
>> _no not_ need to be strictly ordered with respect to normal memory accesses,
>> by what's defined in memory-barriers.txt.  So if by "necessary" you mean what
>> the Linux standard for I/O accessors requires (and what other archs provide),
>> then yes, they have the necessary ordering guarantees.
>>
>> But, if you want them to be strictly ordered w.r.t to normal memory, that's
>> not the case.
>
> They should be.

Someone should update memory-barriers.txt, because it doesn't say that, and
all I/O accessors for all the arches, because none of them are.

>> Here's a quick hack I stuck in a driver to test.  compile with -save-temps and
>> check the resulting asm.  gcc will do the optimization I described above.
>>
>> static void __iomem *baz = (void*)0x1234;
>> static struct bar {
>>      u32 bar[256];
>> } bar;
>>
>> void foo(void) {
>>      bar.bar[0] = 44;
>>      out_be32(baz+100, 200);
>>      bar.bar[0] = 45;
>>      out_be32(baz+101, 201);
>> }
>
> Have you removed -fno-strict-aliasing ? Just don't do that.

No, it's compiled with a normal kernel build, which includes
-fno-strict-aliasing.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-21 19:44       ` Trent Piepho
@ 2008-05-21 20:41         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 160+ messages in thread
From: Benjamin Herrenschmidt @ 2008-05-21 20:41 UTC (permalink / raw)
  To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel

On Wed, 2008-05-21 at 12:44 -0700, Trent Piepho wrote:
> 
> Someone should update memory-barriers.txt, because it doesn't say
> that, and
> all I/O accessors for all the arches, because none of them are.

There have been long discussions about that. The end result was that
being too weakly ordered is just asking for trouble because the majority
of drivers are written & tested on x86 which is in order.

If you look at our accessors, minus that gcc problem you found, the
barriers in there should pretty much guarantee ordering in the cases
that matter, which are basically MMIO read followed by memory accesses
and memory writes followed by MMIO. In fact, MMIO read are fully
sychronous.

> No, it's compiled with a normal kernel build, which includes
> -fno-strict-aliasing

Ok, so there is a very bad bug indeed, we need to fix that.

Ben.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 20:40 [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho
  2008-05-20 21:16 ` Benjamin Herrenschmidt
@ 2008-05-20 22:00 ` Andreas Schwab
  2008-05-20 22:11   ` Trent Piepho
  1 sibling, 1 reply; 160+ messages in thread
From: Andreas Schwab @ 2008-05-20 22:00 UTC (permalink / raw)
  To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel

Trent Piepho <tpiepho@freescale.com> writes:

> For the LE versions, eventually they boil down to an asm that will look
> something like this:
> asm("sync; stwbrx %1,0,%2" : "=m" (*addr) : "r" (val), "r" (addr));
>
> While not perfect, this appears to be the best one can do.  The issue is
> that the "stwbrx" instruction only comes in an indexed, or 'x', version, in
> which the address is represented by the sum of two registers (the "0,%2").
> Unfortunately, gcc doesn't have a constraint for an indexed memory
> reference.

There is the "Z" constraint, which matches either an indirect or an
indexed memory address.  That should fit here.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 22:00 ` Andreas Schwab
@ 2008-05-20 22:11   ` Trent Piepho
  2008-05-20 22:47     ` Andreas Schwab
  0 siblings, 1 reply; 160+ messages in thread
From: Trent Piepho @ 2008-05-20 22:11 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Scott Wood, linuxppc-dev, linux-kernel

On Wed, 21 May 2008, Andreas Schwab wrote:
> Trent Piepho <tpiepho@freescale.com> writes:
>
>> For the LE versions, eventually they boil down to an asm that will look
>> something like this:
>> asm("sync; stwbrx %1,0,%2" : "=m" (*addr) : "r" (val), "r" (addr));
>>
>> While not perfect, this appears to be the best one can do.  The issue is
>> that the "stwbrx" instruction only comes in an indexed, or 'x', version, in
>> which the address is represented by the sum of two registers (the "0,%2").
>> Unfortunately, gcc doesn't have a constraint for an indexed memory
>> reference.
>
> There is the "Z" constraint, which matches either an indirect or an
> indexed memory address.  That should fit here.

This came up on the Freescale list.  I should have put what I wrote there into
my patch descrition:

It's the _le versions that have a problem, since we can't get gcc to just use
the register indexed mode.  It seems like an obvious thing to have a
constraint for, but I guess there weren't enough instructions that only come
in 'x' versions to bother with it.  There is a 'Z' constraint, "Memory operand
that is an indexed or indirect from a register", but I tried it and it can use
both "rb,ri" and "disp(rb)" forms.  Actually, I'm not sure how 'Z' is any
different than "m"?

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 22:11   ` Trent Piepho
@ 2008-05-20 22:47     ` Andreas Schwab
  2008-05-20 23:14       ` Trent Piepho
  0 siblings, 1 reply; 160+ messages in thread
From: Andreas Schwab @ 2008-05-20 22:47 UTC (permalink / raw)
  To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel

Trent Piepho <tpiepho@freescale.com> writes:

> It's the _le versions that have a problem, since we can't get gcc to just use
> the register indexed mode.  It seems like an obvious thing to have a
> constraint for, but I guess there weren't enough instructions that only come
> in 'x' versions to bother with it.  There is a 'Z' constraint, "Memory operand
> that is an indexed or indirect from a register", but I tried it and it can use
> both "rb,ri" and "disp(rb)" forms.  Actually, I'm not sure how 'Z' is any
> different than "m"?

'Z' will never emit a non-zero constant displacement.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 22:47     ` Andreas Schwab
@ 2008-05-20 23:14       ` Trent Piepho
  2008-05-21  8:03         ` Andreas Schwab
  0 siblings, 1 reply; 160+ messages in thread
From: Trent Piepho @ 2008-05-20 23:14 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Scott Wood, linuxppc-dev, linux-kernel

On Wed, 21 May 2008, Andreas Schwab wrote:
> Trent Piepho <tpiepho@freescale.com> writes:
>
>> It's the _le versions that have a problem, since we can't get gcc to just use
>> the register indexed mode.  It seems like an obvious thing to have a
>> constraint for, but I guess there weren't enough instructions that only come
>> in 'x' versions to bother with it.  There is a 'Z' constraint, "Memory operand
>> that is an indexed or indirect from a register", but I tried it and it can use
>> both "rb,ri" and "disp(rb)" forms.  Actually, I'm not sure how 'Z' is any
>> different than "m"?
>
> 'Z' will never emit a non-zero constant displacement.

It's too bad gas doesn't appear to be smart enough to turn:
         stwbrx 0, 0(3)   -or-   stwbr 0, 0(3)

into the desired:
         stwbrx 0, 0, 3

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-20 23:14       ` Trent Piepho
@ 2008-05-21  8:03         ` Andreas Schwab
  2008-05-21 20:25           ` Trent Piepho
  2008-05-27 23:48           ` [PATCH V2] [POWERPC] Improve (in|out)_[bl]eXX() " Trent Piepho
  0 siblings, 2 replies; 160+ messages in thread
From: Andreas Schwab @ 2008-05-21  8:03 UTC (permalink / raw)
  To: Trent Piepho; +Cc: Scott Wood, linuxppc-dev, linux-kernel

Trent Piepho <tpiepho@freescale.com> writes:

> On Wed, 21 May 2008, Andreas Schwab wrote:
>> Trent Piepho <tpiepho@freescale.com> writes:
>>
>>> It's the _le versions that have a problem, since we can't get gcc to just use
>>> the register indexed mode.  It seems like an obvious thing to have a
>>> constraint for, but I guess there weren't enough instructions that only come
>>> in 'x' versions to bother with it.  There is a 'Z' constraint, "Memory operand
>>> that is an indexed or indirect from a register", but I tried it and it can use
>>> both "rb,ri" and "disp(rb)" forms.  Actually, I'm not sure how 'Z' is any
>>> different than "m"?
>>
>> 'Z' will never emit a non-zero constant displacement.
>
> It's too bad gas doesn't appear to be smart enough to turn:
>         stwbrx 0, 0(3)   -or-   stwbr 0, 0(3)

Use the %y modifier when substituting the operand.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code
  2008-05-21  8:03         ` Andreas Schwab
@ 2008-05-21 20:25           ` Trent Piepho
  2008-05-27 23:48           ` [PATCH V2] [POWERPC] Improve (in|out)_[bl]eXX() " Trent Piepho
  1 sibling, 0 replies; 160+ messages in thread
From: Trent Piepho @ 2008-05-21 20:25 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Scott Wood, linuxppc-dev, linux-kernel

On Wed, 21 May 2008, Andreas Schwab wrote:
> Trent Piepho <tpiepho@freescale.com> writes:
>> On Wed, 21 May 2008, Andreas Schwab wrote:
>>> Trent Piepho <tpiepho@freescale.com> writes:
>>>> It's the _le versions that have a problem, since we can't get gcc to just use
>>>> the register indexed mode.  It seems like an obvious thing to have a
>>>> constraint for, but I guess there weren't enough instructions that only come
>>>> in 'x' versions to bother with it.  There is a 'Z' constraint, "Memory operand
>>>> that is an indexed or indirect from a register", but I tried it and it can use
>>>> both "rb,ri" and "disp(rb)" forms.  Actually, I'm not sure how 'Z' is any
>>>> different than "m"?
>>>
>>> 'Z' will never emit a non-zero constant displacement.
>>
>> It's too bad gas doesn't appear to be smart enough to turn:
>>         stwbrx 0, 0(3)   -or-   stwbr 0, 0(3)
>
> Use the %y modifier when substituting the operand.

Of course, the undocumented y modifier!  "Print AltiVec or SPE memory operand"
Why didn't I think of that?

That appears to work.  I can get gcc to to emit 0,reg and reg,reg but not
disp(reg).  It won't try to use an "update" address either, though it will
with an "m" constraint.

But, gcc 4.0.2 can't handle the 'Z' constraint.  It looks like it's not
supported.  Other than this, I can build a kernel with 4.0.2 that appears to
work.  Is it ok to break compatibility with 4.0.2, or should I put in a gcc
version check?

^ permalink raw reply	[flat|nested] 160+ messages in thread

* [PATCH V2] [POWERPC] Improve (in|out)_[bl]eXX() asm code
  2008-05-21  8:03         ` Andreas Schwab
  2008-05-21 20:25           ` Trent Piepho
@ 2008-05-27 23:48           ` Trent Piepho
  1 sibling, 0 replies; 160+ messages in thread
From: Trent Piepho @ 2008-05-27 23:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Trent Piepho

Since commit 4cb3cee03d558fd457cb58f56c80a2a09a66110c the code generated
for the in_beXX() and out_beXX() mmio functions has been sub-optimal.

The out_leXX() family of functions are created with the macro
DEF_MMIO_OUT_LE() while the out_beXX() family are created with
DEF_MMIO_OUT_BE().  In what was perhaps a bit too much macro use, both of
these macros are in turn created via the macro DEF_MMIO_OUT().

For the LE versions, eventually they boil down to an asm that will look
something like this:
asm("sync; stwbrx %1,0,%2" : "=m" (*addr) : "r" (val), "r" (addr));

The issue is that the "stwbrx" instruction only comes in an indexed, or
'x', version, in which the address is represented by the sum of two
registers (the "0,%2").  Unfortunately, gcc doesn't have a constraint for
an indexed memory reference.  The "m" constraint allows both indexed and
offset, i.e. register plus constant, memory references and there is no
"stwbr" version for offset references.  "m" also allows updating addresses
and there is no 'u' version of "stwbrx" like there is with "stwux".

The unused first operand to the asm is just to tell gcc that *addr is an
output of the asm.  The address used is passed in a single register via the
third asm operand, and the index register is just hard coded as 0.  This
means gcc is forced to put the address in a single register and can't use
index addressing, e.g. if one has the data in register 9, a base address in
register 3 and an index in register 4, gcc must emit code like "add 11,4,3;
stwbrx 9,0,11" instead of just "stwbrx 9,4,3".  This costs an extra add
instruction and another register.

For gcc 4.0 and older, there doesn't appear to be anything that can be
done.  But for 4.1 and newer, there is a 'Z' constraint.  It does not allow
"updating" addresses, but does allow both indexed and offset addresses.
However, the only allowed constant offset is 0.  We can then use the
undocumented 'y' operand modifier, which causes gcc to convert "0(reg)"
into the equivilient "0,reg" format that can be used with stwbrx.

This brings us the to problem with the BE version.  In this case, the "stw"
instruction does have both indexed and non-indexed versions.  The final asm
ends up looking like this:
asm("sync; stw%U0%X0 %1,%0" : "=m" (*addr) : "r" (val), "r" (addr));

The undocumented codes "%U0" and "%0X" will generate a 'u' if the memory
reference should be an auto-updating one, and an 'x' if the memory
reference is indexed, respectively.  The third operand is unused, it's just
there because asm the code is reused from the LE version.  However, gcc
does not know this, and generates unnecessary code to stick addr in a
register!  To use the example from the LE version, gcc will generate "add
11,4,3; stwx 9,4,3".  It is able to use the indexed address "4,3" for the
"stwx", but still thinks it needs to put 4+3 into register 11, which will
never be used.

This also ends up happening a lot for the offset addressing mode, where
common code like this:  out_be32(&device_registers->some_register, data);
uses an instruction like "stw 9, 42(3)", where register 3 has the pointer
device_registers and 42 is the offset of some_register in that structure.
gcc will be forced to generate the unnecessary instruction "addi 11, 3, 42"
to put the address into a single (unused) register.

The in_* versions end up having these exact same problems as well.

Signed-off-by: Trent Piepho <tpiepho@freescale.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Andreas Schwab <schwab@suse.de>
---

This version uses the Andreas's suggestions about how to use the "Z"
constraint and "y" modifier to improve the LE version as well.

 include/asm-powerpc/io.h |   63 ++++++++++++++++++++++++++++++++-------------
 1 files changed, 45 insertions(+), 18 deletions(-)

diff --git a/include/asm-powerpc/io.h b/include/asm-powerpc/io.h
index e0062d7..a2547ac 100644
--- a/include/asm-powerpc/io.h
+++ b/include/asm-powerpc/io.h
@@ -95,33 +95,60 @@ extern resource_size_t isa_mem_base;
 #define IO_SET_SYNC_FLAG()
 #endif

-#define DEF_MMIO_IN(name, type, insn)					\
-static inline type name(const volatile type __iomem *addr)		\
+/* gcc 4.0 and older doesn't have 'Z' constraint */
+#if __GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ == 0)
+#define DEF_MMIO_IN_LE(name, size, insn)				\
+static inline u##size name(const volatile u##size __iomem *addr)	\
 {									\
-	type ret;							\
-	__asm__ __volatile__("sync;" insn ";twi 0,%0,0;isync"		\
- 		: "=r" (ret) : "r" (addr), "m" (*addr));		\
+	u##size ret;							\
+	__asm__ __volatile__("sync;"#insn" %0,0,%1;twi 0,%0,0;isync"	\
+		: "=r" (ret) : "r" (addr), "m" (*addr));		\
 	return ret;							\
 }

-#define DEF_MMIO_OUT(name, type, insn)					\
-static inline void name(volatile type __iomem *addr, type val)		\
+#define DEF_MMIO_OUT_LE(name, size, insn) 				\
+static inline void name(volatile u##size __iomem *addr, u##size val)	\
 {									\
-	__asm__ __volatile__("sync;" insn				\
- 		: "=m" (*addr) : "r" (val), "r" (addr));		\
-	IO_SET_SYNC_FLAG();					\
+	__asm__ __volatile__("sync;"#insn" %1,0,%2"			\
+		: "=m" (*addr) : "r" (val), "r" (addr));		\
+	IO_SET_SYNC_FLAG();						\
 }
+#else /* newer gcc */
+#define DEF_MMIO_IN_LE(name, size, insn)				\
+static inline u##size name(const volatile u##size __iomem *addr)	\
+{									\
+	u##size ret;							\
+	__asm__ __volatile__("sync;"#insn" %0,%y1;twi 0,%0,0;isync"	\
+		: "=r" (ret) : "Z" (*addr));				\
+	return ret;							\
+}
+
+#define DEF_MMIO_OUT_LE(name, size, insn) 				\
+static inline void name(volatile u##size __iomem *addr, u##size val)	\
+{									\
+	__asm__ __volatile__("sync;"#insn" %1,%y0"			\
+		: "=Z" (*addr) : "r" (val));				\
+	IO_SET_SYNC_FLAG();						\
+}
+#endif

+#define DEF_MMIO_IN_BE(name, size, insn)				\
+static inline u##size name(const volatile u##size __iomem *addr)	\
+{									\
+	u##size ret;							\
+	__asm__ __volatile__("sync;"#insn"%U1%X1 %0,%1;twi 0,%0,0;isync"\
+		: "=r" (ret) : "m" (*addr));				\
+	return ret;							\
+}

-#define DEF_MMIO_IN_BE(name, size, insn) \
-	DEF_MMIO_IN(name, u##size, __stringify(insn)"%U2%X2 %0,%2")
-#define DEF_MMIO_IN_LE(name, size, insn) \
-	DEF_MMIO_IN(name, u##size, __stringify(insn)" %0,0,%1")
+#define DEF_MMIO_OUT_BE(name, size, insn)				\
+static inline void name(volatile u##size __iomem *addr, u##size val)	\
+{									\
+	__asm__ __volatile__("sync;"#insn"%U0%X0 %1,%0"			\
+		: "=m" (*addr) : "r" (val));				\
+	IO_SET_SYNC_FLAG();						\
+}

-#define DEF_MMIO_OUT_BE(name, size, insn) \
-	DEF_MMIO_OUT(name, u##size, __stringify(insn)"%U0%X0 %1,%0")
-#define DEF_MMIO_OUT_LE(name, size, insn) \
-	DEF_MMIO_OUT(name, u##size, __stringify(insn)" %1,0,%2")

 DEF_MMIO_IN_BE(in_8,     8, lbz);
 DEF_MMIO_IN_BE(in_be16, 16, lhz);
-- 
1.5.4.1

^ permalink raw reply related	[flat|nested] 160+ messages in thread

end of thread, other threads:[~2008-06-13  0:08 UTC | newest]

Thread overview: 160+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-20 20:40 [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho
2008-05-20 21:16 ` Benjamin Herrenschmidt
2008-05-20 21:38   ` Scott Wood
2008-05-20 22:02     ` Benjamin Herrenschmidt
2008-05-20 22:21       ` Trent Piepho
2008-05-20 22:15     ` Alan Cox
2008-05-20 22:35       ` Scott Wood
2008-05-20 22:39         ` David Miller
2008-05-20 22:43           ` Scott Wood
2008-05-20 22:53             ` David Miller
2008-05-23  4:24               ` Benjamin Herrenschmidt
2008-05-22 22:56                 ` Trent Piepho
2008-05-23 12:36                   ` MMIO and gcc re-ordering (Was: [PATCH] [POWERPC] Improve (in|out)_beXX() asm code) Benjamin Herrenschmidt
2008-05-23 12:50                     ` Benjamin Herrenschmidt
2008-05-23 21:14                       ` Scott Wood
2008-05-23 22:47                         ` Benjamin Herrenschmidt
2008-05-27  1:33                   ` MMIO and gcc re-ordering issue Benjamin Herrenschmidt
2008-05-27  1:40                     ` David Miller
2008-05-27  2:15                       ` Benjamin Herrenschmidt
2008-05-27  2:28                         ` David Miller
2008-05-27  3:39                           ` Benjamin Herrenschmidt
2008-05-27 15:35                             ` Linus Torvalds
2008-05-27 16:47                               ` Linus Torvalds
2008-05-27 17:31                                 ` Linus Torvalds
2008-06-02 10:36                                   ` Ingo Molnar
2008-06-02 21:53                                     ` Benjamin Herrenschmidt
2008-05-27 21:12                                 ` Benjamin Herrenschmidt
2008-05-27 18:23                               ` Trent Piepho
2008-05-27 18:33                                 ` Scott Wood
2008-05-27 21:10                               ` Benjamin Herrenschmidt
2008-05-27 21:30                                 ` Linus Torvalds
2008-05-27 21:38                                   ` Alan Cox
2008-05-27 21:53                                     ` Matthew Wilcox
2008-05-27 21:46                                       ` Alan Cox
2008-05-27 22:02                                       ` Linus Torvalds
2008-05-27 21:59                                     ` Linus Torvalds
2008-05-27 21:38                                   ` Benjamin Herrenschmidt
2008-05-27 21:42                                     ` Matthew Wilcox
2008-05-27 22:17                                       ` Benjamin Herrenschmidt
2008-05-28  8:36                                         ` Haavard Skinnemoen
2008-05-29 11:05                                           ` Pantelis Antoniou
2008-05-30  1:13                                           ` Benjamin Herrenschmidt
2008-05-30  6:07                                             ` Haavard Skinnemoen
2008-05-30  7:24                                               ` Benjamin Herrenschmidt
2008-05-30  8:27                                                 ` Haavard Skinnemoen
2008-05-30  9:22                                                   ` Geert Uytterhoeven
2008-06-02  8:11                                                     ` Haavard Skinnemoen
2008-06-02 15:48                                                       ` Scott Wood
2008-06-03  7:46                                                         ` Haavard Skinnemoen
2008-06-04 15:31                                                       ` Linus Torvalds
2008-05-27 21:55                                     ` Linus Torvalds
2008-05-27 22:19                                       ` Benjamin Herrenschmidt
2008-05-29  7:10                                         ` Arnd Bergmann
2008-05-29 10:46                                           ` Alan Cox
2008-06-02  7:24                                       ` Russell King
2008-06-03  4:16                                         ` Nick Piggin
2008-06-03  4:32                                           ` Benjamin Herrenschmidt
2008-06-03  6:11                                             ` Nick Piggin
2008-06-03  6:48                                               ` Benjamin Herrenschmidt
2008-06-03  6:53                                               ` Paul Mackerras
2008-06-03  7:18                                                 ` Nick Piggin
2008-06-03 14:47                                           ` Linus Torvalds
2008-06-03 18:47                                             ` Trent Piepho
2008-06-03 18:55                                               ` Matthew Wilcox
2008-06-03 19:57                                                 ` Trent Piepho
2008-06-03 21:35                                                   ` Matthew Wilcox
2008-06-03 21:58                                                     ` Trent Piepho
2008-06-04  2:00                                                       ` Nick Piggin
2008-06-03 19:07                                               ` Linus Torvalds
2008-06-04  2:05                                                 ` Nick Piggin
2008-06-04  2:46                                                   ` Linus Torvalds
2008-06-04 11:47                                                     ` Alan Cox
2008-06-10  6:56                                                 ` Nick Piggin
2008-06-10 17:41                                                   ` Jesse Barnes
2008-06-10 18:10                                                     ` James Bottomley
2008-06-10 19:05                                                       ` Roland Dreier
2008-06-10 19:19                                                         ` Jesse Barnes
2008-06-11  3:29                                                           ` Nick Piggin
2008-06-11  3:40                                                             ` Benjamin Herrenschmidt
2008-06-11  4:06                                                               ` Nick Piggin
2008-06-11 16:07                                                             ` Jesse Barnes
2008-06-12 11:27                                                               ` Nick Piggin
2008-06-11  4:18                                                   ` Paul Mackerras
2008-06-11  5:00                                                     ` Nick Piggin
2008-06-11  5:13                                                       ` Paul Mackerras
2008-06-11  5:35                                                         ` Nick Piggin
2008-06-11  6:02                                                           ` Nick Piggin
2008-06-12 12:14                                                           ` Paul Mackerras
2008-06-12 13:08                                                             ` Nick Piggin
2008-06-11 14:46                                                       ` Linus Torvalds
2008-06-11  5:20                                                   ` Paul Mackerras
2008-06-04  2:19                                             ` Nick Piggin
2008-06-03 19:43                                           ` Trent Piepho
2008-06-03 21:33                                             ` Matthew Wilcox
2008-06-03 21:44                                               ` Trent Piepho
2008-06-04  2:25                                                 ` Nick Piggin
2008-06-04  6:39                                                   ` Trent Piepho
2008-06-03 22:26                                             ` Benjamin Herrenschmidt
2008-05-27  3:42                           ` Arjan van de Ven
2008-05-27  4:08                             ` Roland Dreier
2008-05-27  4:20                               ` Arjan van de Ven
2008-05-27  7:08                             ` Benjamin Herrenschmidt
2008-05-27 15:50                               ` Roland Dreier
2008-05-27 16:37                                 ` James Bottomley
2008-05-27 17:38                                   ` Roland Dreier
2008-05-27 17:53                                     ` James Bottomley
2008-05-27 18:07                                       ` Roland Dreier
2008-05-27 18:17                                         ` Roland Dreier
2008-05-27 21:23                                         ` Chris Friesen
2008-05-27 21:29                                           ` Roland Dreier
2008-05-27 23:04                                           ` Paul Mackerras
2008-05-27 21:11                                 ` Benjamin Herrenschmidt
2008-05-27 21:33                                   ` Roland Dreier
2008-05-27 22:13                                     ` Benjamin Herrenschmidt
2008-05-27 22:39                                       ` Roland Dreier
2008-05-29 14:47                                     ` Jes Sorensen
2008-05-29 15:01                                       ` James Bottomley
2008-05-30  9:36                                         ` Jes Sorensen
2008-05-30 17:21                                           ` Jesse Barnes
2008-05-31  7:57                                             ` Jeremy Higdon
2008-05-29 21:40                                       ` Benjamin Herrenschmidt
2008-05-29 21:48                                         ` Trent Piepho
2008-05-29 22:05                                           ` Benjamin Herrenschmidt
2008-05-30  1:53                                             ` Trent Piepho
2008-05-29 21:53                                         ` Jesse Barnes
2008-05-30  9:39                                           ` Jes Sorensen
2008-05-30  9:48                                         ` Jes Sorensen
2008-05-31  8:14                                           ` Pavel Machek
2008-06-02  9:48                                             ` Jes Sorensen
2008-05-29 22:06                                       ` Roland Dreier
2008-05-29 22:25                                         ` Trent Piepho
2008-05-30  3:56                                           ` Paul Mackerras
2008-05-31  7:52                                       ` Jeremy Higdon
2008-06-02  9:56                                         ` Jes Sorensen
2008-06-02 21:02                                           ` Jeremy Higdon
2008-06-03  4:33                                           ` Nick Piggin
2008-06-03  8:15                                             ` Jeremy Higdon
2008-06-03  8:19                                               ` Nick Piggin
2008-06-03  8:45                                                 ` Jeremy Higdon
2008-06-03 16:52                                             ` Jesse Barnes
2008-06-05  8:40                                               ` Jes Sorensen
2008-06-05  8:43                                                 ` Benjamin Herrenschmidt
2008-06-12 15:07                                                   ` Matthew Wilcox
2008-06-13  0:07                                                     ` Benjamin Herrenschmidt
2008-05-31  8:04                                 ` Pavel Machek
2008-05-27  8:24                           ` Alan Cox
2008-05-27 15:28                     ` Jonathan Corbet
2008-05-20 22:55         ` [PATCH] [POWERPC] Improve (in|out)_beXX() asm code Trent Piepho
2008-05-21 14:01           ` Benjamin Herrenschmidt
2008-05-20 22:00   ` Trent Piepho
2008-05-21 14:00     ` Benjamin Herrenschmidt
2008-05-21 19:44       ` Trent Piepho
2008-05-21 20:41         ` Benjamin Herrenschmidt
2008-05-20 22:00 ` Andreas Schwab
2008-05-20 22:11   ` Trent Piepho
2008-05-20 22:47     ` Andreas Schwab
2008-05-20 23:14       ` Trent Piepho
2008-05-21  8:03         ` Andreas Schwab
2008-05-21 20:25           ` Trent Piepho
2008-05-27 23:48           ` [PATCH V2] [POWERPC] Improve (in|out)_[bl]eXX() " Trent Piepho

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).