Memory barriers and spin

public inbox for linux-arch@vger.kernel.org
 help / color / mirror / Atom feed

* Memory barriers and spin_unlock safety
@ 2006-03-03 16:03 David Howells
  2006-03-03 16:45 ` David Howells
  2006-03-03 16:55 ` Linus Torvalds
  0 siblings, 2 replies; 32+ messages in thread
From: David Howells @ 2006-03-03 16:03 UTC (permalink / raw)
  To: torvalds, akpm, mingo, jblunck, bcrl, matthew
  Cc: linux-kernel, linux-arch, linuxppc64-dev


Hi,

We've just had an interesting discussion on IRC and this has come up with two
unanswered questions:

(1) Is spin_unlock() is entirely safe on Pentium3+ and x86_64 where ?FENCE
    instructions are available?

    Consider the following case, where you want to do two reads effectively
    atomically, and so wrap them in a spinlock:

	spin_lock(&mtx);
	a = *A;
	b = *B;
	spin_unlock(&mtx);

    On x86 Pentium3+ and x86_64, what's to stop you from getting the reads
    done after the unlock since there's no LFENCE instruction there to stop
    you?

    What you'd expect is:

	LOCK WRITE mtx
	--> implies MFENCE
	READ *A		} which may be reordered
	READ *B		}
	WRITE mtx

    But what you might get instead is this:

	LOCK WRITE mtx
	--> implies MFENCE
	WRITE mtx
	--> implies SFENCE
	READ *A		} which may be reordered
	READ *B		}

    There doesn't seem to be anything that says that the reads can't leak
    outside of the locked section; at least, there doesn't in the AMD's system
    programming manual for Amd64 (book 2, section 7.1).

    Writes on the other hand may not happen out of order, so changing things
    inside a critical section would seem to be okay.

    On PowerPC, on the other hand, the barriers have to be made explicit
    because they're not implied by LWARX/STWCX or by ordinary stores:

	LWARX mtx
	STWCX mtx
	ISYNC
	READ *A		} which may be reordered
	READ *B		}
	LWSYNC
	WRITE mtx

	So, should the spin_unlock() on i386 and x86_64 be doing an LFENCE
	instruction before unlocking?


(2) What is the minimum functionality that can be expected of a memory
    barriers? I was of the opinion that all we could expect is for the CPU
    executing one them to force the instructions it is executing to be
    complete up to a point - depending on the type of barrier - before
    continuing past it.

    On pentiums, x86_64, and frv this seems to be exactly what you get for a
    barrier; there doesn't seem to be any external evidence of it that appears
    on the bus, other than the CPU does a load of memory transactions.

    However, on ppc/ppc64, it seems to be more thorough than that, and there
    seems to be some special interaction between the CPU processing the
    instruction and the other CPUs in the system. It's not entirely obvious
    from the manual just what this does.

    As I understand it, Andrew Morton is of the opinion that issuing a read
    barrier on one CPU will cause the other CPUs in the system to sync up, but
    that doesn't look likely on all archs.

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 16:03 Memory barriers and spin_unlock safety David Howells
@ 2006-03-03 16:45 ` David Howells
  2006-03-03 17:03   ` Linus Torvalds
  2006-03-03 20:02   ` Arjan van de Ven
  2006-03-03 16:55 ` Linus Torvalds
  1 sibling, 2 replies; 32+ messages in thread
From: David Howells @ 2006-03-03 16:45 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, akpm, mingo, jblunck, bcrl, matthew, linux-arch,
	linuxppc64-dev, linux-kernel

David Howells <dhowells@redhat.com> wrote:

> 	WRITE mtx
> 	--> implies SFENCE

Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of
SFENCE implies that writes can be reordered, which sort of contradicts what
the AMD64 System Programming Manual says.

If this isn't true, then x86_64 at least should do MFENCE before the store in
spin_unlock() or change the store to be LOCK'ed. The same may also apply for
Pentium3+ class CPUs with the i386 arch.

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 16:45 ` David Howells
@ 2006-03-03 17:03   ` Linus Torvalds
  2006-03-03 20:17     ` David Howells
  2006-03-03 20:02   ` Arjan van de Ven
  1 sibling, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2006-03-03 17:03 UTC (permalink / raw)
  To: David Howells
  Cc: akpm, mingo, jblunck, bcrl, matthew, linux-arch, linuxppc64-dev,
	linux-kernel

On Fri, 3 Mar 2006, David Howells wrote:

> David Howells <dhowells@redhat.com> wrote:
> 
> > 	WRITE mtx
> > 	--> implies SFENCE
> 
> Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of
> SFENCE implies that writes can be reordered, which sort of contradicts what
> the AMD64 System Programming Manual says.

Note that _normal_ writes never need an SFENCE, because they are ordered 
by the core.

The reason to use SFENCE is because of _special_ writes.

For example, if you use a non-temporal store, then the write buffer 
ordering goes away, because there is no write buffer involved (the store 
goes directly to the L2 or outside the bus).

Or when you talk to weakly ordered memory (ie a frame buffer that isn't 
cached, and where the MTRR memory ordering bits say that writes be done 
speculatively), you may want to say "I'm going to do the store that starts 
the graphics pipeline, all my previous stores need to be done now". 

THAT is when you need to use SFENCE.

So SFENCE really isn't about the "smp_wmb()" kind of fencing at all. It's 
about the much weaker ordering that is allowed by the special IO memory 
types and nontemporal instructions.

(Actually, I think one special case of non-temporal instruction is the 
"repeat movs/stos" thing: I think you should _not_ use a "repeat stos" to 
unlock a spinlock, exactly because those stores are not ordered wrt each 
other, and they can bypass the write queue. Of course, doing that would 
be insane anyway, so no harm done ;^).

> If this isn't true, then x86_64 at least should do MFENCE before the store in
> spin_unlock() or change the store to be LOCK'ed. The same may also apply for
> Pentium3+ class CPUs with the i386 arch.

No. But if you want to make sure, you can always check with Intel 
engineers. I'm pretty sure I have this right, though, because Intel 
engineers have certainly looked at Linux sources and locking, and nobody 
has ever said that we'd need an SFENCE.

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 17:03   ` Linus Torvalds
@ 2006-03-03 20:17     ` David Howells
  2006-03-03 21:34       ` Linus Torvalds
  2006-03-07 17:36       ` David Howells
  0 siblings, 2 replies; 32+ messages in thread
From: David Howells @ 2006-03-03 20:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, akpm, ak, mingo, jblunck, bcrl, matthew,
	linux-arch, linuxppc64-dev, linux-kernel

Linus Torvalds <torvalds@osdl.org> wrote:

> Note that _normal_ writes never need an SFENCE, because they are ordered 
> by the core.
> 
> The reason to use SFENCE is because of _special_ writes.

I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and that
only io_wmb() should have that.

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 20:17     ` David Howells
@ 2006-03-03 21:34       ` Linus Torvalds
  2006-03-03 21:51         ` Benjamin LaHaise
  2006-03-07 17:36       ` David Howells
  1 sibling, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2006-03-03 21:34 UTC (permalink / raw)
  To: David Howells
  Cc: akpm, ak, mingo, jblunck, bcrl, matthew, linux-arch,
	linuxppc64-dev, linux-kernel



On Fri, 3 Mar 2006, David Howells wrote:
> 
> I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and that
> only io_wmb() should have that.

Indeed. I think smp_wmb() should be a compiler fence only on x86(-64), ie 
just compile to a "barrier()" (and not even that on UP, of course).

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 21:34       ` Linus Torvalds
@ 2006-03-03 21:51         ` Benjamin LaHaise
  2006-03-03 22:21           ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Benjamin LaHaise @ 2006-03-03 21:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, akpm, ak, mingo, jblunck, matthew, linux-arch,
	linuxppc64-dev, linux-kernel

On Fri, Mar 03, 2006 at 01:34:17PM -0800, Linus Torvalds wrote:
> Indeed. I think smp_wmb() should be a compiler fence only on x86(-64), ie 
> just compile to a "barrier()" (and not even that on UP, of course).

Actually, no.  At least in testing an implementation of Dekker's and 
Peterson's algorithms as a replacement for the locked operation in 
our spinlocks, it is absolutely necessary to have an sfence in the lock 
to ensure the lock is visible to the other CPU before proceeding.  I'd 
use smp_wmb() as the fence is completely unnecessary on UP and is even 
irq-safe.  Here's a copy of the Peterson's implementation to illustrate 
(it works, it's just slower than the existing spinlocks).

		-ben

diff --git a/include/asm-x86_64/spinlock.h b/include/asm-x86_64/spinlock.h
index fe484a6..45bd386 100644
--- a/include/asm-x86_64/spinlock.h
+++ b/include/asm-x86_64/spinlock.h
@@ -4,6 +4,8 @@
 #include <asm/atomic.h>
 #include <asm/rwlock.h>
 #include <asm/page.h>
+#include <asm/pda.h>
+#include <asm/processor.h>
 #include <linux/config.h>
 
 /*
@@ -18,50 +20,53 @@
  */
 
 #define __raw_spin_is_locked(x) \
-		(*(volatile signed int *)(&(x)->slock) <= 0)
-
-#define __raw_spin_lock_string \
-	"\n1:\t" \
-	"lock ; decl %0\n\t" \
-	"js 2f\n" \
-	LOCK_SECTION_START("") \
-	"2:\t" \
-	"rep;nop\n\t" \
-	"cmpl $0,%0\n\t" \
-	"jle 2b\n\t" \
-	"jmp 1b\n" \
-	LOCK_SECTION_END
-
-#define __raw_spin_unlock_string \
-	"movl $1,%0" \
-		:"=m" (lock->slock) : : "memory"
+		((*(volatile signed int *)(x) & ~0xff) != 0)
 
 static inline void __raw_spin_lock(raw_spinlock_t *lock)
 {
-	__asm__ __volatile__(
-		__raw_spin_lock_string
-		:"=m" (lock->slock) : : "memory");
+	int cpu = read_pda(cpunumber);
+
+	barrier();
+	lock->flags[cpu] = 1;
+	lock->turn = cpu ^ 1;
+	barrier();
+
+	asm volatile("sfence":::"memory");
+
+	while (lock->flags[cpu ^ 1] && (lock->turn != cpu)) {
+		cpu_relax();
+		barrier();
+	}
 }
 
 #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
 
 static inline int __raw_spin_trylock(raw_spinlock_t *lock)
 {
-	int oldval;
-
-	__asm__ __volatile__(
-		"xchgl %0,%1"
-		:"=q" (oldval), "=m" (lock->slock)
-		:"0" (0) : "memory");
-
-	return oldval > 0;
+	int cpu = read_pda(cpunumber);
+	barrier();
+	if (__raw_spin_is_locked(lock))
+		return 0;
+
+	lock->flags[cpu] = 1;
+	lock->turn = cpu ^ 1;
+	asm volatile("sfence":::"memory");
+
+	if (lock->flags[cpu ^ 1] && (lock->turn != cpu)) {
+		lock->flags[cpu] = 0;
+		barrier();
+		return 0;
+	}
+	return 1;
 }
 
 static inline void __raw_spin_unlock(raw_spinlock_t *lock)
 {
-	__asm__ __volatile__(
-		__raw_spin_unlock_string
-	);
+	int cpu;
+	//asm volatile("lfence":::"memory");
+	cpu = read_pda(cpunumber);
+	lock->flags[cpu] = 0;
+	barrier();
 }
 
 #define __raw_spin_unlock_wait(lock) \
diff --git a/include/asm-x86_64/spinlock_types.h b/include/asm-x86_64/spinlock_types.h
index 59efe84..a409cbf 100644
--- a/include/asm-x86_64/spinlock_types.h
+++ b/include/asm-x86_64/spinlock_types.h
@@ -6,10 +6,11 @@
 #endif
 
 typedef struct {
-	volatile unsigned int slock;
+	volatile unsigned char turn;
+	volatile unsigned char flags[3];
 } raw_spinlock_t;
 
-#define __RAW_SPIN_LOCK_UNLOCKED	{ 1 }
+#define __RAW_SPIN_LOCK_UNLOCKED	{ 0, { 0, } }
 
 typedef struct {
 	volatile unsigned int lock;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 21:51         ` Benjamin LaHaise
@ 2006-03-03 22:21           ` Linus Torvalds
  2006-03-03 22:36             ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2006-03-03 22:21 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: David Howells, Andrew Morton, ak, mingo, jblunck, matthew,
	linux-arch, linuxppc64-dev, Linux Kernel Mailing List



On Fri, 3 Mar 2006, Benjamin LaHaise wrote:
> 
> Actually, no.  At least in testing an implementation of Dekker's and 
> Peterson's algorithms as a replacement for the locked operation in 
> our spinlocks, it is absolutely necessary to have an sfence in the lock 
> to ensure the lock is visible to the other CPU before proceeding.

I suspect you have some bug in your implementation. I think Dekker's 
algorithm depends on the reads and writes being ordered, and you don't 
seem to do that.

The thing is, you pretty much _have_ to be wrong, because the x86-64 
memory ordering rules are _exactly_ the same as for x86, and we've had 
that simple store as an unlock for a long long time.

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 22:21           ` Linus Torvalds
@ 2006-03-03 22:36             ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2006-03-03 22:36 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: David Howells, Andrew Morton, ak, mingo, jblunck, matthew,
	linux-arch, linuxppc64-dev, Linux Kernel Mailing List



On Fri, 3 Mar 2006, Linus Torvalds wrote:
>
> I suspect you have some bug in your implementation. I think Dekker's 
> algorithm depends on the reads and writes being ordered, and you don't 
> seem to do that.

IOW, I think you need a full memory barrier after the 
	"lock->turn = cpu ^ 1;"
and you should have a "smp_rmb()" in between your reads of
	"lock->flags[cpu ^ 1]"
and
	"lock->turn"
to give the ordering that Dekker (or Peterson) expects.

IOW, the code should be something like

	lock->flags[other] = 1;
	smp_wmb();
	lock->turn = other
	smp_mb();
	while (lock->turn == cpu) {
		smp_rmb();
		if (!lock->flags[other])
			break;
	}

where the wmb's are no-ops on x86, but the rmb's certainly are not.

I _suspect_ that the fact that it starts working with an 'sfence' in there 
somewhere is just because the sfence ends up being "serializing enough" 
that it just happens to work, but that it has nothing to do with the 
current kernel wmb() being wrong.

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 20:17     ` David Howells
  2006-03-03 21:34       ` Linus Torvalds
@ 2006-03-07 17:36       ` David Howells
  2006-03-07 17:40         ` Matthew Wilcox
  2006-03-07 18:18         ` Alan Cox
  1 sibling, 2 replies; 32+ messages in thread
From: David Howells @ 2006-03-07 17:36 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, akpm, ak, mingo, jblunck, bcrl, matthew,
	linux-arch, linuxppc64-dev, linux-kernel

David Howells <dhowells@redhat.com> wrote:

> I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and
> that only io_wmb() should have that.

Hmmm... We don't actually have io_wmb()... Should the following be added to
all archs?

	io_mb()
	io_rmb()
	io_wmb()

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-07 17:36       ` David Howells
@ 2006-03-07 17:40         ` Matthew Wilcox
  2006-03-07 17:56           ` Jesse Barnes
  2006-03-07 18:18         ` Alan Cox
  1 sibling, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2006-03-07 17:40 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, akpm, ak, mingo, jblunck, bcrl, linux-arch,
	linuxppc64-dev, linux-kernel

On Tue, Mar 07, 2006 at 05:36:59PM +0000, David Howells wrote:
> David Howells <dhowells@redhat.com> wrote:
> 
> > I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and
> > that only io_wmb() should have that.
> 
> Hmmm... We don't actually have io_wmb()... Should the following be added to
> all archs?
> 
> 	io_mb()
> 	io_rmb()
> 	io_wmb()

it's spelled mmiowb(), and reads from IO space are synchronous, so don't
need barriers.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-07 17:40         ` Matthew Wilcox
@ 2006-03-07 17:56           ` Jesse Barnes
  0 siblings, 0 replies; 32+ messages in thread
From: Jesse Barnes @ 2006-03-07 17:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Howells, Linus Torvalds, akpm, ak, mingo, jblunck, bcrl,
	linux-arch, linuxppc64-dev, linux-kernel

On Tuesday, March 7, 2006 9:40 am, Matthew Wilcox wrote:
> On Tue, Mar 07, 2006 at 05:36:59PM +0000, David Howells wrote:
> > David Howells <dhowells@redhat.com> wrote:
> > > I suspect, then, that x86_64 should not have an SFENCE for
> > > smp_wmb(), and that only io_wmb() should have that.
> >
> > Hmmm... We don't actually have io_wmb()... Should the following be
> > added to all archs?
> >
> > 	io_mb()
> > 	io_rmb()
> > 	io_wmb()
>
> it's spelled mmiowb(), and reads from IO space are synchronous, so
> don't need barriers.

To expand on willy's note, the reason it's called mmiowb as opposed to 
iowb is because I/O port acccess (inX/outX) are inherently synchronous 
and don't need barriers.  mmio writes, however (writeX) need barrier 
operations to ensure ordering on some platforms.

This raises the question of what semantics the unified I/O mapping 
routines have... are ioreadX/iowriteX synchronous or should we define 
the barriers you mention above for them?  (IIRC ppc64 can use an io read 
ordering op).

Jesse

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-07 17:36       ` David Howells
  2006-03-07 17:40         ` Matthew Wilcox
@ 2006-03-07 18:18         ` Alan Cox
  2006-03-07 18:28           ` Linus Torvalds
  1 sibling, 1 reply; 32+ messages in thread
From: Alan Cox @ 2006-03-07 18:18 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, akpm, ak, mingo, jblunck, bcrl, matthew,
	linux-arch, linuxppc64-dev, linux-kernel

On Maw, 2006-03-07 at 17:36 +0000, David Howells wrote:
> Hmmm... We don't actually have io_wmb()... Should the following be added to
> all archs?
> 
> 	io_mb()
> 	io_rmb()
> 	io_wmb()

What kind of mb/rmb/wmb goes with ioread/iowrite ? It seems we actually
need one that can work out what to do for the general io API ?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-07 18:18         ` Alan Cox
@ 2006-03-07 18:28           ` Linus Torvalds
  2006-03-07 18:55             ` Alan Cox
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2006-03-07 18:28 UTC (permalink / raw)
  To: Alan Cox
  Cc: David Howells, akpm, ak, mingo, jblunck, bcrl, matthew,
	linux-arch, linuxppc64-dev, linux-kernel

On Tue, 7 Mar 2006, Alan Cox wrote:
> 
> What kind of mb/rmb/wmb goes with ioread/iowrite ? It seems we actually
> need one that can work out what to do for the general io API ?

The ioread/iowrite things only guarantee the laxer MMIO rules, since it 
_might_ be mmio. So you'd use the mmio barriers.

In fact, I would suggest that architectures that can do PIO in a more 
relaxed manner (x86 cannot, since all the serialization is in hardware) 
would do even a PIO in the more relaxed ordering (ie writes can at least 
be posted, but obviously not merged, since that would be against PCI 
specs).

x86 tends to serialize PIO too much (I think at least Intel CPU's will 
actually wait for the PIO write to be acknowledged by _something_ on the 
bus, although it obviously can't wait for the device to have acted on it).

			Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-07 18:28           ` Linus Torvalds
@ 2006-03-07 18:55             ` Alan Cox
  2006-03-07 20:21               ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Alan Cox @ 2006-03-07 18:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, akpm, ak, mingo, jblunck, bcrl, matthew,
	linux-arch, linuxppc64-dev, linux-kernel

On Maw, 2006-03-07 at 10:28 -0800, Linus Torvalds wrote:
> x86 tends to serialize PIO too much (I think at least Intel CPU's will 
> actually wait for the PIO write to be acknowledged by _something_ on the 
> bus, although it obviously can't wait for the device to have acted on it).

Don't bet on that 8(

In the PCI case the I/O write appears to be acked by the bridges used on
x86 when the write completes on the PCI bus and then back to the CPU.
MMIO is thankfully posted. At least thats how the timings on some
devices look.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-07 18:55             ` Alan Cox
@ 2006-03-07 20:21               ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2006-03-07 20:21 UTC (permalink / raw)
  To: Alan Cox
  Cc: David Howells, akpm, ak, mingo, jblunck, bcrl, matthew,
	linux-arch, linuxppc64-dev, linux-kernel

On Tue, 7 Mar 2006, Alan Cox wrote:
> 
> In the PCI case the I/O write appears to be acked by the bridges used on
> x86 when the write completes on the PCI bus and then back to the CPU.
> MMIO is thankfully posted. At least thats how the timings on some
> devices look.

Oh, absolutely. I'm sayign that you shouldn't wait for even that, since 
it's totally pointless (it's not synchronized _anyway_) and adds 
absolutely zero gain. To really synchronize, you need to read from the 
device anyway.

So the "wait for bus activity" is just making PIO slower for no good 
reason, and keeps the core waiting when it could do something more useful.

On an x86, there are legacy reasons to do it (people expect certain 
timings). But that was what I was saying - on non-x86 architectures, 
there's no reason for the ioread/iowrite interfaces to be as serializing 
as the old-fashioned PIO ones are. Might as well do the MMIO rules for a 
non-cacheable region: no re-ordering, but no waiting for the bus either.

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 16:45 ` David Howells
  2006-03-03 17:03   ` Linus Torvalds
@ 2006-03-03 20:02   ` Arjan van de Ven
  1 sibling, 0 replies; 32+ messages in thread
From: Arjan van de Ven @ 2006-03-03 20:02 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, akpm, mingo, jblunck, bcrl, matthew, linux-arch,
	linuxppc64-dev, linux-kernel

On Fri, 2006-03-03 at 16:45 +0000, David Howells wrote:
> David Howells <dhowells@redhat.com> wrote:
> 
> > 	WRITE mtx
> > 	--> implies SFENCE
> 
> Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of
> SFENCE implies that writes can be reordered, which sort of contradicts what
> the AMD64 System Programming Manual says.

there are 2 or 3 special instructions which do "non temporal
stores" (movntq and movnit and maybe one more). sfense is designed for
those. 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 16:03 Memory barriers and spin_unlock safety David Howells
  2006-03-03 16:45 ` David Howells
@ 2006-03-03 16:55 ` Linus Torvalds
  2006-03-03 20:15   ` David Howells
                     ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Linus Torvalds @ 2006-03-03 16:55 UTC (permalink / raw)
  To: David Howells
  Cc: akpm, mingo, jblunck, bcrl, matthew, linux-kernel, linux-arch,
	linuxppc64-dev

On Fri, 3 Mar 2006, David Howells wrote:
> 
> We've just had an interesting discussion on IRC and this has come up with two
> unanswered questions:
> 
> (1) Is spin_unlock() is entirely safe on Pentium3+ and x86_64 where ?FENCE
>     instructions are available?
> 
>     Consider the following case, where you want to do two reads effectively
>     atomically, and so wrap them in a spinlock:
> 
> 	spin_lock(&mtx);
> 	a = *A;
> 	b = *B;
> 	spin_unlock(&mtx);
> 
>     On x86 Pentium3+ and x86_64, what's to stop you from getting the reads
>     done after the unlock since there's no LFENCE instruction there to stop
>     you?

The rules are, afaik, that reads can pass buffered writes, BUT WRITES 
CANNOT PASS READS (aka "writes to memory are always carried out in program 
order").

IOW, reads can bubble up, but writes cannot. 

So the way I read the Intel rules is that "passing" is always about being 
done earlier than otherwise allowed, not about being done later.

(You only "pass" somebody in traffic when you go ahead of them. If you 
fall behind them, you don't "pass" them, _they_ pass you).

Now, this is not so much meant to be a semantic argument (the meaning of 
the word "pass") as to an explanation of what I believe Intel meant, since 
we know from Intel designers that the simple non-atomic write is 
supposedly a perfectly fine unlock instruction.

So when Intel says "reads can be carried out speculatively and in any 
order", that just says that reads are not ordered wrt other _reads_. They 
_are_ ordered wrt other writes, but only one way: they can pass an earlier 
write, but they can't fall back behind a later one.

This is consistent with
 (a) optimization (you want to do reads _early_, not late)
 (b) behaviour (we've been told that a single write is sufficient, with 
     the exception of an early P6 core revision)
 (c) at least one way of reading the documentation.

And I claim that (a) and (b) are the important parts, and that (c) is just 
the rationale.

> (2) What is the minimum functionality that can be expected of a memory
>     barriers? I was of the opinion that all we could expect is for the CPU
>     executing one them to force the instructions it is executing to be
>     complete up to a point - depending on the type of barrier - before
>     continuing past it.

Well, no. You should expect even _less_.

The core can continue doing things past a barrier. For example, a write 
barrier may not actually serialize anything at all: the sane way of doing 
write barriers is to just put a note in the write-queue, and that note 
just disallows write queue entries from being moved around it. So you 
might have a write barrier with two writes on either side, and the writes 
might _both_ be outstanding wrt the core despite the barrier.

So there's not necessarily any synchronization at all on a execution core 
level, just a partial ordering between the resulting actions of the core. 

>     However, on ppc/ppc64, it seems to be more thorough than that, and there
>     seems to be some special interaction between the CPU processing the
>     instruction and the other CPUs in the system. It's not entirely obvious
>     from the manual just what this does.

PPC has an absolutely _horrible_ memory ordering implementation, as far as 
I can tell. The thing is broken. I think it's just implementation 
breakage, not anything really fundamental, but the fact that their write 
barriers are expensive is a big sign that they are doing something bad. 

For example, their write buffers may not have a way to serialize in the 
buffers, and at that point from an _implementation_ standpoint, you just 
have to serialize the whole core to make sure that writes don't pass each 
other. 

>     As I understand it, Andrew Morton is of the opinion that issuing a read
>     barrier on one CPU will cause the other CPUs in the system to sync up, but
>     that doesn't look likely on all archs.

No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the 
other CPU. All barriers are purely local to one CPU, and do not generate 
any bus traffic what-so-ever. They only potentially affect the order of 
bus traffic due to the instructions around them (obviously).

So a read barrier on one CPU _has_ to be paired with a write barrier on 
the other side in order to make sense (although the write barrier can 
obviously be of the implied kind, ie a lock/unlock event, or just 
architecture-specific knowledge of write behaviour, ie for example knowing 
that writes are always seen in-order on x86).

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 16:55 ` Linus Torvalds
@ 2006-03-03 20:15   ` David Howells
  2006-03-03 21:31     ` Linus Torvalds
  2006-03-03 21:06   ` Benjamin Herrenschmidt
  2006-03-04 10:58   ` Paul Mackerras
  2 siblings, 1 reply; 32+ messages in thread
From: David Howells @ 2006-03-03 20:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, akpm, mingo, jblunck, bcrl, matthew, linux-kernel,
	linux-arch, linuxppc64-dev

Linus Torvalds <torvalds@osdl.org> wrote:

> The rules are, afaik, that reads can pass buffered writes, BUT WRITES 
> CANNOT PASS READS (aka "writes to memory are always carried out in program 
> order").

So in the example I gave, a read after the spin_unlock() may actually get
executed before the store in the spin_unlock(), but a read before the unlock
will not get executed after.

> No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the 
> other CPU.

Well, I think you mean will guarantee absolutely _nothing_ on the other CPU for
the Linux kernel.  According to the IBM powerpc book I have, it does actually
do something on the other CPUs, though it doesn't say exactly what.

Anyway, thanks.

I'll write up some documentation on barriers for inclusion in the kernel.

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 20:15   ` David Howells
@ 2006-03-03 21:31     ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2006-03-03 21:31 UTC (permalink / raw)
  To: David Howells
  Cc: akpm, mingo, jblunck, bcrl, matthew, linux-kernel, linux-arch,
	linuxppc64-dev



On Fri, 3 Mar 2006, David Howells wrote:
> 
> So in the example I gave, a read after the spin_unlock() may actually get
> executed before the store in the spin_unlock(), but a read before the unlock
> will not get executed after.

Yes.

> > No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the 
> > other CPU.
> 
> Well, I think you mean will guarantee absolutely _nothing_ on the other CPU for
> the Linux kernel.  According to the IBM powerpc book I have, it does actually
> do something on the other CPUs, though it doesn't say exactly what.

Yeah, Power really does have some funky stuff in their memory ordering. 
I'm not quite sure why, though. And it definitely isn't implied by any of 
the Linux kernel barriers.

(They also do TLB coherency in hw etc strange things).

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 16:55 ` Linus Torvalds
  2006-03-03 20:15   ` David Howells
@ 2006-03-03 21:06   ` Benjamin Herrenschmidt
  2006-03-03 21:18     ` Hollis Blanchard
                       ` (2 more replies)
  2006-03-04 10:58   ` Paul Mackerras
  2 siblings, 3 replies; 32+ messages in thread
From: Benjamin Herrenschmidt @ 2006-03-03 21:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, akpm, mingo, jblunck, bcrl, matthew, linux-kernel,
	linux-arch, linuxppc64-dev


> PPC has an absolutely _horrible_ memory ordering implementation, as far as 
> I can tell. The thing is broken. I think it's just implementation 
> breakage, not anything really fundamental, but the fact that their write 
> barriers are expensive is a big sign that they are doing something bad. 

Are they ? read barriers and full barriers are, write barriers should be
fairly cheap (but then, I haven't measured).

> For example, their write buffers may not have a way to serialize in the 
> buffers, and at that point from an _implementation_ standpoint, you just 
> have to serialize the whole core to make sure that writes don't pass each 
> other. 

The main problem I've had in the past with the ppc barriers is more a
subtle thing in the spec that unfortunately was taken to the word by
implementors, and is that the simple write barrier (eieio) will only
order within the same storage space, that is will not order between
cacheable and non-cacheable storage. That means IOs could leak out of
locks etc... Which is why we use expensive barriers in MMIO wrappers for
now (though we might investigate the use of mmioXb instead in the
future).

> No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the 
> other CPU. All barriers are purely local to one CPU, and do not generate 
> any bus traffic what-so-ever. They only potentially affect the order of 
> bus traffic due to the instructions around them (obviously).

Actually, the ppc's full barrier (sync) will generate bus traffic, and I
think in some case eieio barriers can propagate to the chipset to
enforce ordering there too depending on some voodoo settings and wether
the storage space is cacheable or not.

> So a read barrier on one CPU _has_ to be paired with a write barrier on 
> the other side in order to make sense (although the write barrier can 
> obviously be of the implied kind, ie a lock/unlock event, or just 
> architecture-specific knowledge of write behaviour, ie for example knowing 
> that writes are always seen in-order on x86).
> 
> 		Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 21:06   ` Benjamin Herrenschmidt
@ 2006-03-03 21:18     ` Hollis Blanchard
  2006-03-03 21:52       ` David S. Miller
  2006-03-03 22:04     ` Linus Torvalds
  2006-03-04 10:58     ` Paul Mackerras
  2 siblings, 1 reply; 32+ messages in thread
From: Hollis Blanchard @ 2006-03-03 21:18 UTC (permalink / raw)
  To: linuxppc64-dev
  Cc: Benjamin Herrenschmidt, Linus Torvalds, akpm, linux-arch, bcrl,
	matthew, linux-kernel, mingo, jblunck

On Friday 03 March 2006 15:06, Benjamin Herrenschmidt wrote:
> The main problem I've had in the past with the ppc barriers is more a
> subtle thing in the spec that unfortunately was taken to the word by
> implementors, and is that the simple write barrier (eieio) will only
> order within the same storage space, that is will not order between
> cacheable and non-cacheable storage.

I've heard Sparc has the same issue... in which case it may not be a "chip 
designer was too literal" thing, but rather it really simplifies chip 
implementation to do it that way.

-- 
Hollis Blanchard
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 21:18     ` Hollis Blanchard
@ 2006-03-03 21:52       ` David S. Miller
  0 siblings, 0 replies; 32+ messages in thread
From: David S. Miller @ 2006-03-03 21:52 UTC (permalink / raw)
  To: hollisb
  Cc: linuxppc64-dev, benh, torvalds, akpm, linux-arch, bcrl, matthew,
	linux-kernel, mingo, jblunck

From: Hollis Blanchard <hollisb@us.ibm.com>
Date: Fri, 3 Mar 2006 15:18:13 -0600

> On Friday 03 March 2006 15:06, Benjamin Herrenschmidt wrote:
> > The main problem I've had in the past with the ppc barriers is more a
> > subtle thing in the spec that unfortunately was taken to the word by
> > implementors, and is that the simple write barrier (eieio) will only
> > order within the same storage space, that is will not order between
> > cacheable and non-cacheable storage.
> 
> I've heard Sparc has the same issue... in which case it may not be a "chip 
> designer was too literal" thing, but rather it really simplifies chip 
> implementation to do it that way.

There is a "membar #MemIssue" that is meant to deal with this should
it ever matter, but for most sparc64 chips it doesn't which is why we
don't use that memory barrier type at all in the Linux kernel.

For UltraSPARC-I and II it technically could matter in Relaxed Memory
Ordering (RMO) mode which is what we run the kernel and 64-bit
userspace in, but I've never seen an issue resulting from it.

For UltraSPARC-III and later, the chip only implements the Total Store
Ordering (TSO) memory model and the manual explicitly states that
cacheable and non-cacheable memory operations are ordered, even using
language such as "there is an implicit 'membar #MemIssue' between
them".  It further goes on to say:

	The UltraSPARCIII Cu processor maintains ordering between
	cacheable and non-cacheable accesses.  The UltraSPARC III Cu
	processor maintains TSO ordering between memory references
	regardless of their cacheability.

Niagara behaves almost identically to UltraSPARC-III in this area.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 21:06   ` Benjamin Herrenschmidt
  2006-03-03 21:18     ` Hollis Blanchard
@ 2006-03-03 22:04     ` Linus Torvalds
  2006-03-04 10:58     ` Paul Mackerras
  2 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2006-03-03 22:04 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Howells, akpm, mingo, jblunck, bcrl, matthew, linux-kernel,
	linux-arch, linuxppc64-dev

On Sat, 4 Mar 2006, Benjamin Herrenschmidt wrote:
> 
> The main problem I've had in the past with the ppc barriers is more a
> subtle thing in the spec that unfortunately was taken to the word by
> implementors, and is that the simple write barrier (eieio) will only
> order within the same storage space, that is will not order between
> cacheable and non-cacheable storage.

If so, a simple write barrier should be sufficient. That's exactly what 
the x86 write barriers do too, ie stores to magic IO space are _not_ 
ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a 
spin_unlock()) at all.

On x86, we actually have this "CONFIG_X86_OOSTORE" configuration option 
that gets enable for you select a WINCHIP device, because that allows a 
weaker memory ordering for normal memory too, and that will end up using 
an "sfence" instruction for store buffers. But it's not normally enabled.

So the eieio should be sufficient,then.

Of course, the x86 store buffers do tend to flush out stuff after a 
certain cycle-delay too, so there may be drivers that technically are 
buggy on x86, but where the store buffer in practice is small and flushes 
out quickly enough that you'll never _see_ the bug.

> Actually, the ppc's full barrier (sync) will generate bus traffic, and I
> think in some case eieio barriers can propagate to the chipset to
> enforce ordering there too depending on some voodoo settings and wether
> the storage space is cacheable or not.

Well, the regular kernel ops definitely won't depend on that, since that's 
not the case anywhere else.

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 21:06   ` Benjamin Herrenschmidt
  2006-03-03 21:18     ` Hollis Blanchard
  2006-03-03 22:04     ` Linus Torvalds
@ 2006-03-04 10:58     ` Paul Mackerras
  2006-03-04 22:49       ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2006-03-04 10:58 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, akpm, linux-arch, bcrl, matthew, linux-kernel,
	mingo, linuxppc64-dev, jblunck

Benjamin Herrenschmidt writes:

> Actually, the ppc's full barrier (sync) will generate bus traffic, and I
> think in some case eieio barriers can propagate to the chipset to
> enforce ordering there too depending on some voodoo settings and wether
> the storage space is cacheable or not.

Eieio has to go to the PCI host bridge because it is supposed to
prevent write-combining, both in the host bridge and in the CPU.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-04 10:58     ` Paul Mackerras
@ 2006-03-04 22:49       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 32+ messages in thread
From: Benjamin Herrenschmidt @ 2006-03-04 22:49 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Linus Torvalds, akpm, linux-arch, bcrl, matthew, linux-kernel,
	mingo, linuxppc64-dev, jblunck

On Sat, 2006-03-04 at 21:58 +1100, Paul Mackerras wrote:
> Benjamin Herrenschmidt writes:
> 
> > Actually, the ppc's full barrier (sync) will generate bus traffic, and I
> > think in some case eieio barriers can propagate to the chipset to
> > enforce ordering there too depending on some voodoo settings and wether
> > the storage space is cacheable or not.
> 
> Eieio has to go to the PCI host bridge because it is supposed to
> prevent write-combining, both in the host bridge and in the CPU.

That can be disabled with HID bits tho ;)

Ben.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-03 16:55 ` Linus Torvalds
  2006-03-03 20:15   ` David Howells
  2006-03-03 21:06   ` Benjamin Herrenschmidt
@ 2006-03-04 10:58   ` Paul Mackerras
  2006-03-04 17:28     ` Linus Torvalds
  2006-03-05  2:04     ` Michael Buesch
  2 siblings, 2 replies; 32+ messages in thread
From: Paul Mackerras @ 2006-03-04 10:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, akpm, linux-arch, bcrl, matthew, linux-kernel,
	mingo, linuxppc64-dev, jblunck

Linus Torvalds writes:

> PPC has an absolutely _horrible_ memory ordering implementation, as far as 
> I can tell. The thing is broken. I think it's just implementation 
> breakage, not anything really fundamental, but the fact that their write 
> barriers are expensive is a big sign that they are doing something bad. 

An smp_wmb() is just an eieio on PPC, which is pretty cheap.  I made
wmb() be a sync though, because it seemed that there were drivers that
expected wmb() to provide an ordering between a write to memory and a
write to an MMIO register.  If that is a bogus assumption then we
could make wmb() lighter-weight (after auditing all the drivers we're
interested in, of course, ...).

And in a subsequent message:

> If so, a simple write barrier should be sufficient. That's exactly what 
> the x86 write barriers do too, ie stores to magic IO space are _not_ 
> ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a 
> spin_unlock()) at all.

By magic IO space, do you mean just any old memory-mapped device
register in a PCI device, or do you mean something else?

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-04 10:58   ` Paul Mackerras
@ 2006-03-04 17:28     ` Linus Torvalds
  2006-03-08  3:20       ` Paul Mackerras
  2006-03-05  2:04     ` Michael Buesch
  1 sibling, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2006-03-04 17:28 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Howells, akpm, linux-arch, bcrl, matthew, linux-kernel,
	mingo, linuxppc64-dev, jblunck

On Sat, 4 Mar 2006, Paul Mackerras wrote:
> 
> > If so, a simple write barrier should be sufficient. That's exactly what 
> > the x86 write barriers do too, ie stores to magic IO space are _not_ 
> > ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a 
> > spin_unlock()) at all.
> 
> By magic IO space, do you mean just any old memory-mapped device
> register in a PCI device, or do you mean something else?

Any old memory-mapped device that has been marked as write-combining in 
the MTRR's or page tables.

So the rules from the PC side (and like it or not, they end up being 
what all the drivers are tested with) are:

 - regular stores are ordered by write barriers
 - PIO stores are always synchronous
 - MMIO stores are ordered by IO semantics
	- PCI ordering must be honored:
	  * write combining is only allowed on PCI memory resources
	    that are marked prefetchable. If your host bridge does write 
	    combining in general, it's not a "host bridge", it's a "host 
	    disaster".
	  * for others, writes can always be posted, but they cannot
	    be re-ordered wrt either reads or writes to that device
	    (ie a read will always be fully synchronizing)
	- io_wmb must be honored

In addition, it will help a hell of a lot if you follow the PC notion of 
"per-region extra rules", ie you'd default to the non-prefetchable 
behaviour even for areas that are prefetchable from a PCI standpoint, but 
allow some way to relax the ordering rules in various ways.

PC's use MTRR's or page table hints for this, but it's actually perfectly 
possible to do it by virtual address (ie decide on "ioremap()" time by 
looking at some bits that you've saved away to remap it to a certain 
virtual address range, and then use the virtual address as a hint for 
readl/writel whether you need to serialize or not).

On x86, we already use the "virtual address" trick to distinguish between 
PIO and MMIO for the newer ioread/iowrite interface (the older 
inb/outb/readb/writeb interfaces obviously don't need that, since the IO 
space is statically encoded in the function call itself).

The reason I mention the MTRR emulation is again just purely compatibility 
with drivers that get 99.9% of all the testing on a PC platform.

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-04 17:28     ` Linus Torvalds
@ 2006-03-08  3:20       ` Paul Mackerras
  2006-03-08  3:54         ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2006-03-08  3:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, akpm, linux-arch, bcrl, matthew, linux-kernel,
	mingo, linuxppc64-dev, jblunck

Linus Torvalds writes:

> So the rules from the PC side (and like it or not, they end up being 
> what all the drivers are tested with) are:
> 
>  - regular stores are ordered by write barriers

I thought regular stores were always ordered anyway?

>  - PIO stores are always synchronous

By synchronous, do you mean ordered with respect to all other accesses
(regular memory, MMIO, prefetchable MMIO, PIO)?

In other words, if I store a value in regular memory, then do an
outb() to a device, and the device does a DMA read to the location I
just stored to, is the device guaranteed to see the value I just
stored (assuming no other later store to the location)?

>  - MMIO stores are ordered by IO semantics
> 	- PCI ordering must be honored:
> 	  * write combining is only allowed on PCI memory resources
> 	    that are marked prefetchable. If your host bridge does write 
> 	    combining in general, it's not a "host bridge", it's a "host 
> 	    disaster".

Presumably the host bridge doesn't know what sort of PCI resource is
mapped at a given address, so that information (whether the resource
is prefetchable) must come from the CPU, which would get it from the
TLB entry or an MTRR entry - is that right?

Or is there some gentleman's agreement between the host bridge and the
BIOS that certain address ranges are only used for certain types of
PCI memory resources?

> 	  * for others, writes can always be posted, but they cannot
> 	    be re-ordered wrt either reads or writes to that device
> 	    (ie a read will always be fully synchronizing)
> 	- io_wmb must be honored

What ordering is there between stores to regular memory and stores to
non-prefetchable MMIO?

If a store to regular memory can be performed before a store to MMIO,
does a wmb() suffice to enforce an ordering, or do you have to use
mmiowb()?

Do PCs ever use write-through caching on prefetchable MMIO resources?

Thanks,
Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-08  3:20       ` Paul Mackerras
@ 2006-03-08  3:54         ` Linus Torvalds
  2006-03-08 13:12           ` Alan Cox
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2006-03-08  3:54 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Howells, akpm, linux-arch, bcrl, matthew, linux-kernel,
	mingo, linuxppc64-dev, jblunck

On Wed, 8 Mar 2006, Paul Mackerras wrote:
>
> Linus Torvalds writes:
> 
> > So the rules from the PC side (and like it or not, they end up being 
> > what all the drivers are tested with) are:
> > 
> >  - regular stores are ordered by write barriers
> 
> I thought regular stores were always ordered anyway?

For the hw, yes. For the compiler no.

So you actually do end up needing write barriers even on x86. It won't 
compile to any actual _instruction_, but it will be a compiler barrier (ie 
it just ends up being an empty inline asm that "modifies" memory).

So forgetting the wmb() is a bug even on x86, unless you happen to program 
in assembly.

Of course, the x86 hw semantics _do_ mean that forgetting it is less 
likely to cause problems, just because the compiler re-ordering is fairly 
unlikely most of the time.

> >  - PIO stores are always synchronous
> 
> By synchronous, do you mean ordered with respect to all other accesses
> (regular memory, MMIO, prefetchable MMIO, PIO)?

Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I 
don't think PCI bridges are supposed to post PIO writes, but a x86 CPU 
basically won't stall for them forever. I _think_ they'll wait for it to 
hit that external bus, though.

So it's totally serializing in the sense that all preceding reads have 
completed and all preceding writes have hit the cache-coherency point, but 
you don't necessarily know when the write itself will hit the device (the 
write will return before that necessarily happens).

> In other words, if I store a value in regular memory, then do an
> outb() to a device, and the device does a DMA read to the location I
> just stored to, is the device guaranteed to see the value I just
> stored (assuming no other later store to the location)?

Yes, assuming that the DMA read is in respose to (ie causally related to) 
the write.

> >  - MMIO stores are ordered by IO semantics
> > 	- PCI ordering must be honored:
> > 	  * write combining is only allowed on PCI memory resources
> > 	    that are marked prefetchable. If your host bridge does write 
> > 	    combining in general, it's not a "host bridge", it's a "host 
> > 	    disaster".
> 
> Presumably the host bridge doesn't know what sort of PCI resource is
> mapped at a given address, so that information (whether the resource
> is prefetchable) must come from the CPU, which would get it from the
> TLB entry or an MTRR entry - is that right?

Correct. Although it could of course be a map in the host bridge itself, 
not on the CPU.

If the host bridge doesn't know, then the host bridge had better not 
combine or the CPU had better tell it not to combine, using something like 
a "sync" instruction that causes bus traffic. Either of those approaches 
is likely a performance disaster, so you do want to have the CPU and/or 
hostbridge do this all automatically for you.

Which is what the PC world does.

> Or is there some gentleman's agreement between the host bridge and the
> BIOS that certain address ranges are only used for certain types of
> PCI memory resources?

Not that I know. I _think_ all of the PC world just depends on the CPU 
doing the write combining, and the CPU knows thanks to MTRR's and page 
tables. But I could well imagine that there is some situation where the 
logic is further out.

> What ordering is there between stores to regular memory and stores to
> non-prefetchable MMIO?

Non-prefetchable MMIO will be in-order on x86 wrt regular memory (unless 
you use one of the non-temporal stores).

To get out-of-order stores you have to use a special MTRR setting (mtrr 
type "WC" for "write combining").

Or possibly non-temporal writes to an uncached area. I don't think we do.

> If a store to regular memory can be performed before a store to MMIO,
> does a wmb() suffice to enforce an ordering, or do you have to use
> mmiowb()?

On x86, MMIO normally doesn't need memory barriers either for the normal 
case (see above). We don't even need the compiler barrier, because we use 
a "volatile" pointer for that, telling the compiler to keep its hands off.

> Do PCs ever use write-through caching on prefetchable MMIO resources?

Basically only for frame buffers, with MTRR rules (and while write-through 
is an option, normally you'd use "write-combining", which doesn't cache at 
all, but write combines in the write buffers and writes the combined 
results out to the bus - there's usually something like four or eight 
write buffers of up to a cacheline in size for combining).

Yeah, I realize this can be awkward. PC's actually get good performance 
(ie they normally can easily fill the bus bandwidth) _and_ the sw doesn't 
even need to do anything. That's what you get from several decades of hw 
tweaking with a fixed - or almost-fixed - software base.

I _like_ PC's. Almost every other architecture decided to be lazy in hw, 
and put the onus on the software to tell it what was right. The PC 
platform hardware competition didn't allow for the "let's recompile the 
software" approach, so the hardware does it all for you. Very well too.

It does make it somewhat hard for other platforms. 

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-08  3:54         ` Linus Torvalds
@ 2006-03-08 13:12           ` Alan Cox
  2006-03-08 15:30             ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Alan Cox @ 2006-03-08 13:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, David Howells, akpm, linux-arch, bcrl, matthew,
	linux-kernel, mingo, linuxppc64-dev, jblunck

On Maw, 2006-03-07 at 19:54 -0800, Linus Torvalds wrote:
> Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I 
> don't think PCI bridges are supposed to post PIO writes, but a x86 CPU 
> basically won't stall for them forever.

The bridges I have will stall forever. You can observe this directly if
an IDE device decides to hang the IORDY line on the IDE cable or you
crash the GPU on an S3 card.

Alan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-08 13:12           ` Alan Cox
@ 2006-03-08 15:30             ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2006-03-08 15:30 UTC (permalink / raw)
  To: Alan Cox
  Cc: Paul Mackerras, David Howells, akpm, linux-arch, bcrl, matthew,
	linux-kernel, mingo, linuxppc64-dev, jblunck

On Wed, 8 Mar 2006, Alan Cox wrote:
>
> On Maw, 2006-03-07 at 19:54 -0800, Linus Torvalds wrote:
> > Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I 
> > don't think PCI bridges are supposed to post PIO writes, but a x86 CPU 
> > basically won't stall for them forever.
> 
> The bridges I have will stall forever. You can observe this directly if
> an IDE device decides to hang the IORDY line on the IDE cable or you
> crash the GPU on an S3 card.

Ok. The only thing I have tested is the timing of "outb()" on its own, 
which is definitely long enough that it clearly waits for _some_ bus 
activity (ie the CPU doesn't just post the write internally), but I don't 
know exactly what the rules are as far as the core itself is concerned: I 
suspect the core just waits until it has hit the northbridge or something.

In contrast, a MMIO write to a WC region at least will not necessarily 
pause the core at all: it just hits the write queue in the core, and the 
core continues on (and may generate other writes that will be combined in 
the write buffers before the first one even hits the bus).

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Memory barriers and spin_unlock safety
  2006-03-04 10:58   ` Paul Mackerras
  2006-03-04 17:28     ` Linus Torvalds
@ 2006-03-05  2:04     ` Michael Buesch
  1 sibling, 0 replies; 32+ messages in thread
From: Michael Buesch @ 2006-03-05  2:04 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Howells, akpm, linux-arch, bcrl, matthew, linux-kernel,
	mingo, linuxppc64-dev, jblunck, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 1168 bytes --]

On Saturday 04 March 2006 11:58, you wrote:
> Linus Torvalds writes:
> 
> > PPC has an absolutely _horrible_ memory ordering implementation, as far as 
> > I can tell. The thing is broken. I think it's just implementation 
> > breakage, not anything really fundamental, but the fact that their write 
> > barriers are expensive is a big sign that they are doing something bad. 
> 
> An smp_wmb() is just an eieio on PPC, which is pretty cheap.  I made
> wmb() be a sync though, because it seemed that there were drivers that
> expected wmb() to provide an ordering between a write to memory and a
> write to an MMIO register.  If that is a bogus assumption then we
> could make wmb() lighter-weight (after auditing all the drivers we're
> interested in, of course, ...).

In the bcm43xx driver there is code which looks like the following:

/* Write some coherent DMA memory */
wmb();
/* Write MMIO, which depends on the DMA memory
 * write to be finished.
 */

Are the assumptions in this code correct? Is wmb() the correct thing
to do here?
I heavily tested this code on PPC UP and did not see any anormaly, yet.

-- 
Greetings Michael.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2006-03-08 15:30 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-03 16:03 Memory barriers and spin_unlock safety David Howells
2006-03-03 16:45 ` David Howells
2006-03-03 17:03   ` Linus Torvalds
2006-03-03 20:17     ` David Howells
2006-03-03 21:34       ` Linus Torvalds
2006-03-03 21:51         ` Benjamin LaHaise
2006-03-03 22:21           ` Linus Torvalds
2006-03-03 22:36             ` Linus Torvalds
2006-03-07 17:36       ` David Howells
2006-03-07 17:40         ` Matthew Wilcox
2006-03-07 17:56           ` Jesse Barnes
2006-03-07 18:18         ` Alan Cox
2006-03-07 18:28           ` Linus Torvalds
2006-03-07 18:55             ` Alan Cox
2006-03-07 20:21               ` Linus Torvalds
2006-03-03 20:02   ` Arjan van de Ven
2006-03-03 16:55 ` Linus Torvalds
2006-03-03 20:15   ` David Howells
2006-03-03 21:31     ` Linus Torvalds
2006-03-03 21:06   ` Benjamin Herrenschmidt
2006-03-03 21:18     ` Hollis Blanchard
2006-03-03 21:52       ` David S. Miller
2006-03-03 22:04     ` Linus Torvalds
2006-03-04 10:58     ` Paul Mackerras
2006-03-04 22:49       ` Benjamin Herrenschmidt
2006-03-04 10:58   ` Paul Mackerras
2006-03-04 17:28     ` Linus Torvalds
2006-03-08  3:20       ` Paul Mackerras
2006-03-08  3:54         ` Linus Torvalds
2006-03-08 13:12           ` Alan Cox
2006-03-08 15:30             ` Linus Torvalds
2006-03-05  2:04     ` Michael Buesch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox