* Memory barriers and spin_unlock safety
@ 2006-03-03 16:03 David Howells
2006-03-03 16:45 ` David Howells
2006-03-03 16:55 ` Linus Torvalds
0 siblings, 2 replies; 32+ messages in thread
From: David Howells @ 2006-03-03 16:03 UTC (permalink / raw)
To: torvalds, akpm, mingo, jblunck, bcrl, matthew
Cc: linux-kernel, linux-arch, linuxppc64-dev
Hi,
We've just had an interesting discussion on IRC and this has come up with two
unanswered questions:
(1) Is spin_unlock() is entirely safe on Pentium3+ and x86_64 where ?FENCE
instructions are available?
Consider the following case, where you want to do two reads effectively
atomically, and so wrap them in a spinlock:
spin_lock(&mtx);
a = *A;
b = *B;
spin_unlock(&mtx);
On x86 Pentium3+ and x86_64, what's to stop you from getting the reads
done after the unlock since there's no LFENCE instruction there to stop
you?
What you'd expect is:
LOCK WRITE mtx
--> implies MFENCE
READ *A } which may be reordered
READ *B }
WRITE mtx
But what you might get instead is this:
LOCK WRITE mtx
--> implies MFENCE
WRITE mtx
--> implies SFENCE
READ *A } which may be reordered
READ *B }
There doesn't seem to be anything that says that the reads can't leak
outside of the locked section; at least, there doesn't in the AMD's system
programming manual for Amd64 (book 2, section 7.1).
Writes on the other hand may not happen out of order, so changing things
inside a critical section would seem to be okay.
On PowerPC, on the other hand, the barriers have to be made explicit
because they're not implied by LWARX/STWCX or by ordinary stores:
LWARX mtx
STWCX mtx
ISYNC
READ *A } which may be reordered
READ *B }
LWSYNC
WRITE mtx
So, should the spin_unlock() on i386 and x86_64 be doing an LFENCE
instruction before unlocking?
(2) What is the minimum functionality that can be expected of a memory
barriers? I was of the opinion that all we could expect is for the CPU
executing one them to force the instructions it is executing to be
complete up to a point - depending on the type of barrier - before
continuing past it.
On pentiums, x86_64, and frv this seems to be exactly what you get for a
barrier; there doesn't seem to be any external evidence of it that appears
on the bus, other than the CPU does a load of memory transactions.
However, on ppc/ppc64, it seems to be more thorough than that, and there
seems to be some special interaction between the CPU processing the
instruction and the other CPUs in the system. It's not entirely obvious
from the manual just what this does.
As I understand it, Andrew Morton is of the opinion that issuing a read
barrier on one CPU will cause the other CPUs in the system to sync up, but
that doesn't look likely on all archs.
David
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: Memory barriers and spin_unlock safety 2006-03-03 16:03 Memory barriers and spin_unlock safety David Howells @ 2006-03-03 16:45 ` David Howells 2006-03-03 17:03 ` Linus Torvalds 2006-03-03 20:02 ` Arjan van de Ven 2006-03-03 16:55 ` Linus Torvalds 1 sibling, 2 replies; 32+ messages in thread From: David Howells @ 2006-03-03 16:45 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, jblunck, bcrl, matthew, linux-arch, linuxppc64-dev, linux-kernel David Howells <dhowells@redhat.com> wrote: > WRITE mtx > --> implies SFENCE Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of SFENCE implies that writes can be reordered, which sort of contradicts what the AMD64 System Programming Manual says. If this isn't true, then x86_64 at least should do MFENCE before the store in spin_unlock() or change the store to be LOCK'ed. The same may also apply for Pentium3+ class CPUs with the i386 arch. David ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 16:45 ` David Howells @ 2006-03-03 17:03 ` Linus Torvalds 2006-03-03 20:17 ` David Howells 2006-03-03 20:02 ` Arjan van de Ven 1 sibling, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2006-03-03 17:03 UTC (permalink / raw) To: David Howells Cc: akpm, mingo, jblunck, bcrl, matthew, linux-arch, linuxppc64-dev, linux-kernel On Fri, 3 Mar 2006, David Howells wrote: > David Howells <dhowells@redhat.com> wrote: > > > WRITE mtx > > --> implies SFENCE > > Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of > SFENCE implies that writes can be reordered, which sort of contradicts what > the AMD64 System Programming Manual says. Note that _normal_ writes never need an SFENCE, because they are ordered by the core. The reason to use SFENCE is because of _special_ writes. For example, if you use a non-temporal store, then the write buffer ordering goes away, because there is no write buffer involved (the store goes directly to the L2 or outside the bus). Or when you talk to weakly ordered memory (ie a frame buffer that isn't cached, and where the MTRR memory ordering bits say that writes be done speculatively), you may want to say "I'm going to do the store that starts the graphics pipeline, all my previous stores need to be done now". THAT is when you need to use SFENCE. So SFENCE really isn't about the "smp_wmb()" kind of fencing at all. It's about the much weaker ordering that is allowed by the special IO memory types and nontemporal instructions. (Actually, I think one special case of non-temporal instruction is the "repeat movs/stos" thing: I think you should _not_ use a "repeat stos" to unlock a spinlock, exactly because those stores are not ordered wrt each other, and they can bypass the write queue. Of course, doing that would be insane anyway, so no harm done ;^). > If this isn't true, then x86_64 at least should do MFENCE before the store in > spin_unlock() or change the store to be LOCK'ed. The same may also apply for > Pentium3+ class CPUs with the i386 arch. No. But if you want to make sure, you can always check with Intel engineers. I'm pretty sure I have this right, though, because Intel engineers have certainly looked at Linux sources and locking, and nobody has ever said that we'd need an SFENCE. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 17:03 ` Linus Torvalds @ 2006-03-03 20:17 ` David Howells 2006-03-03 21:34 ` Linus Torvalds 2006-03-07 17:36 ` David Howells 0 siblings, 2 replies; 32+ messages in thread From: David Howells @ 2006-03-03 20:17 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, akpm, ak, mingo, jblunck, bcrl, matthew, linux-arch, linuxppc64-dev, linux-kernel Linus Torvalds <torvalds@osdl.org> wrote: > Note that _normal_ writes never need an SFENCE, because they are ordered > by the core. > > The reason to use SFENCE is because of _special_ writes. I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and that only io_wmb() should have that. David ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 20:17 ` David Howells @ 2006-03-03 21:34 ` Linus Torvalds 2006-03-03 21:51 ` Benjamin LaHaise 2006-03-07 17:36 ` David Howells 1 sibling, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2006-03-03 21:34 UTC (permalink / raw) To: David Howells Cc: akpm, ak, mingo, jblunck, bcrl, matthew, linux-arch, linuxppc64-dev, linux-kernel On Fri, 3 Mar 2006, David Howells wrote: > > I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and that > only io_wmb() should have that. Indeed. I think smp_wmb() should be a compiler fence only on x86(-64), ie just compile to a "barrier()" (and not even that on UP, of course). Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 21:34 ` Linus Torvalds @ 2006-03-03 21:51 ` Benjamin LaHaise 2006-03-03 22:21 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Benjamin LaHaise @ 2006-03-03 21:51 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, akpm, ak, mingo, jblunck, matthew, linux-arch, linuxppc64-dev, linux-kernel On Fri, Mar 03, 2006 at 01:34:17PM -0800, Linus Torvalds wrote: > Indeed. I think smp_wmb() should be a compiler fence only on x86(-64), ie > just compile to a "barrier()" (and not even that on UP, of course). Actually, no. At least in testing an implementation of Dekker's and Peterson's algorithms as a replacement for the locked operation in our spinlocks, it is absolutely necessary to have an sfence in the lock to ensure the lock is visible to the other CPU before proceeding. I'd use smp_wmb() as the fence is completely unnecessary on UP and is even irq-safe. Here's a copy of the Peterson's implementation to illustrate (it works, it's just slower than the existing spinlocks). -ben diff --git a/include/asm-x86_64/spinlock.h b/include/asm-x86_64/spinlock.h index fe484a6..45bd386 100644 --- a/include/asm-x86_64/spinlock.h +++ b/include/asm-x86_64/spinlock.h @@ -4,6 +4,8 @@ #include <asm/atomic.h> #include <asm/rwlock.h> #include <asm/page.h> +#include <asm/pda.h> +#include <asm/processor.h> #include <linux/config.h> /* @@ -18,50 +20,53 @@ */ #define __raw_spin_is_locked(x) \ - (*(volatile signed int *)(&(x)->slock) <= 0) - -#define __raw_spin_lock_string \ - "\n1:\t" \ - "lock ; decl %0\n\t" \ - "js 2f\n" \ - LOCK_SECTION_START("") \ - "2:\t" \ - "rep;nop\n\t" \ - "cmpl $0,%0\n\t" \ - "jle 2b\n\t" \ - "jmp 1b\n" \ - LOCK_SECTION_END - -#define __raw_spin_unlock_string \ - "movl $1,%0" \ - :"=m" (lock->slock) : : "memory" + ((*(volatile signed int *)(x) & ~0xff) != 0) static inline void __raw_spin_lock(raw_spinlock_t *lock) { - __asm__ __volatile__( - __raw_spin_lock_string - :"=m" (lock->slock) : : "memory"); + int cpu = read_pda(cpunumber); + + barrier(); + lock->flags[cpu] = 1; + lock->turn = cpu ^ 1; + barrier(); + + asm volatile("sfence":::"memory"); + + while (lock->flags[cpu ^ 1] && (lock->turn != cpu)) { + cpu_relax(); + barrier(); + } } #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) static inline int __raw_spin_trylock(raw_spinlock_t *lock) { - int oldval; - - __asm__ __volatile__( - "xchgl %0,%1" - :"=q" (oldval), "=m" (lock->slock) - :"0" (0) : "memory"); - - return oldval > 0; + int cpu = read_pda(cpunumber); + barrier(); + if (__raw_spin_is_locked(lock)) + return 0; + + lock->flags[cpu] = 1; + lock->turn = cpu ^ 1; + asm volatile("sfence":::"memory"); + + if (lock->flags[cpu ^ 1] && (lock->turn != cpu)) { + lock->flags[cpu] = 0; + barrier(); + return 0; + } + return 1; } static inline void __raw_spin_unlock(raw_spinlock_t *lock) { - __asm__ __volatile__( - __raw_spin_unlock_string - ); + int cpu; + //asm volatile("lfence":::"memory"); + cpu = read_pda(cpunumber); + lock->flags[cpu] = 0; + barrier(); } #define __raw_spin_unlock_wait(lock) \ diff --git a/include/asm-x86_64/spinlock_types.h b/include/asm-x86_64/spinlock_types.h index 59efe84..a409cbf 100644 --- a/include/asm-x86_64/spinlock_types.h +++ b/include/asm-x86_64/spinlock_types.h @@ -6,10 +6,11 @@ #endif typedef struct { - volatile unsigned int slock; + volatile unsigned char turn; + volatile unsigned char flags[3]; } raw_spinlock_t; -#define __RAW_SPIN_LOCK_UNLOCKED { 1 } +#define __RAW_SPIN_LOCK_UNLOCKED { 0, { 0, } } typedef struct { volatile unsigned int lock; ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 21:51 ` Benjamin LaHaise @ 2006-03-03 22:21 ` Linus Torvalds 2006-03-03 22:36 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2006-03-03 22:21 UTC (permalink / raw) To: Benjamin LaHaise Cc: David Howells, Andrew Morton, ak, mingo, jblunck, matthew, linux-arch, linuxppc64-dev, Linux Kernel Mailing List On Fri, 3 Mar 2006, Benjamin LaHaise wrote: > > Actually, no. At least in testing an implementation of Dekker's and > Peterson's algorithms as a replacement for the locked operation in > our spinlocks, it is absolutely necessary to have an sfence in the lock > to ensure the lock is visible to the other CPU before proceeding. I suspect you have some bug in your implementation. I think Dekker's algorithm depends on the reads and writes being ordered, and you don't seem to do that. The thing is, you pretty much _have_ to be wrong, because the x86-64 memory ordering rules are _exactly_ the same as for x86, and we've had that simple store as an unlock for a long long time. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 22:21 ` Linus Torvalds @ 2006-03-03 22:36 ` Linus Torvalds 0 siblings, 0 replies; 32+ messages in thread From: Linus Torvalds @ 2006-03-03 22:36 UTC (permalink / raw) To: Benjamin LaHaise Cc: David Howells, Andrew Morton, ak, mingo, jblunck, matthew, linux-arch, linuxppc64-dev, Linux Kernel Mailing List On Fri, 3 Mar 2006, Linus Torvalds wrote: > > I suspect you have some bug in your implementation. I think Dekker's > algorithm depends on the reads and writes being ordered, and you don't > seem to do that. IOW, I think you need a full memory barrier after the "lock->turn = cpu ^ 1;" and you should have a "smp_rmb()" in between your reads of "lock->flags[cpu ^ 1]" and "lock->turn" to give the ordering that Dekker (or Peterson) expects. IOW, the code should be something like lock->flags[other] = 1; smp_wmb(); lock->turn = other smp_mb(); while (lock->turn == cpu) { smp_rmb(); if (!lock->flags[other]) break; } where the wmb's are no-ops on x86, but the rmb's certainly are not. I _suspect_ that the fact that it starts working with an 'sfence' in there somewhere is just because the sfence ends up being "serializing enough" that it just happens to work, but that it has nothing to do with the current kernel wmb() being wrong. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 20:17 ` David Howells 2006-03-03 21:34 ` Linus Torvalds @ 2006-03-07 17:36 ` David Howells 2006-03-07 17:40 ` Matthew Wilcox 2006-03-07 18:18 ` Alan Cox 1 sibling, 2 replies; 32+ messages in thread From: David Howells @ 2006-03-07 17:36 UTC (permalink / raw) To: David Howells Cc: Linus Torvalds, akpm, ak, mingo, jblunck, bcrl, matthew, linux-arch, linuxppc64-dev, linux-kernel David Howells <dhowells@redhat.com> wrote: > I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and > that only io_wmb() should have that. Hmmm... We don't actually have io_wmb()... Should the following be added to all archs? io_mb() io_rmb() io_wmb() David ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-07 17:36 ` David Howells @ 2006-03-07 17:40 ` Matthew Wilcox 2006-03-07 17:56 ` Jesse Barnes 2006-03-07 18:18 ` Alan Cox 1 sibling, 1 reply; 32+ messages in thread From: Matthew Wilcox @ 2006-03-07 17:40 UTC (permalink / raw) To: David Howells Cc: Linus Torvalds, akpm, ak, mingo, jblunck, bcrl, linux-arch, linuxppc64-dev, linux-kernel On Tue, Mar 07, 2006 at 05:36:59PM +0000, David Howells wrote: > David Howells <dhowells@redhat.com> wrote: > > > I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and > > that only io_wmb() should have that. > > Hmmm... We don't actually have io_wmb()... Should the following be added to > all archs? > > io_mb() > io_rmb() > io_wmb() it's spelled mmiowb(), and reads from IO space are synchronous, so don't need barriers. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-07 17:40 ` Matthew Wilcox @ 2006-03-07 17:56 ` Jesse Barnes 0 siblings, 0 replies; 32+ messages in thread From: Jesse Barnes @ 2006-03-07 17:56 UTC (permalink / raw) To: Matthew Wilcox Cc: David Howells, Linus Torvalds, akpm, ak, mingo, jblunck, bcrl, linux-arch, linuxppc64-dev, linux-kernel On Tuesday, March 7, 2006 9:40 am, Matthew Wilcox wrote: > On Tue, Mar 07, 2006 at 05:36:59PM +0000, David Howells wrote: > > David Howells <dhowells@redhat.com> wrote: > > > I suspect, then, that x86_64 should not have an SFENCE for > > > smp_wmb(), and that only io_wmb() should have that. > > > > Hmmm... We don't actually have io_wmb()... Should the following be > > added to all archs? > > > > io_mb() > > io_rmb() > > io_wmb() > > it's spelled mmiowb(), and reads from IO space are synchronous, so > don't need barriers. To expand on willy's note, the reason it's called mmiowb as opposed to iowb is because I/O port acccess (inX/outX) are inherently synchronous and don't need barriers. mmio writes, however (writeX) need barrier operations to ensure ordering on some platforms. This raises the question of what semantics the unified I/O mapping routines have... are ioreadX/iowriteX synchronous or should we define the barriers you mention above for them? (IIRC ppc64 can use an io read ordering op). Jesse ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-07 17:36 ` David Howells 2006-03-07 17:40 ` Matthew Wilcox @ 2006-03-07 18:18 ` Alan Cox 2006-03-07 18:28 ` Linus Torvalds 1 sibling, 1 reply; 32+ messages in thread From: Alan Cox @ 2006-03-07 18:18 UTC (permalink / raw) To: David Howells Cc: Linus Torvalds, akpm, ak, mingo, jblunck, bcrl, matthew, linux-arch, linuxppc64-dev, linux-kernel On Maw, 2006-03-07 at 17:36 +0000, David Howells wrote: > Hmmm... We don't actually have io_wmb()... Should the following be added to > all archs? > > io_mb() > io_rmb() > io_wmb() What kind of mb/rmb/wmb goes with ioread/iowrite ? It seems we actually need one that can work out what to do for the general io API ? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-07 18:18 ` Alan Cox @ 2006-03-07 18:28 ` Linus Torvalds 2006-03-07 18:55 ` Alan Cox 0 siblings, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2006-03-07 18:28 UTC (permalink / raw) To: Alan Cox Cc: David Howells, akpm, ak, mingo, jblunck, bcrl, matthew, linux-arch, linuxppc64-dev, linux-kernel On Tue, 7 Mar 2006, Alan Cox wrote: > > What kind of mb/rmb/wmb goes with ioread/iowrite ? It seems we actually > need one that can work out what to do for the general io API ? The ioread/iowrite things only guarantee the laxer MMIO rules, since it _might_ be mmio. So you'd use the mmio barriers. In fact, I would suggest that architectures that can do PIO in a more relaxed manner (x86 cannot, since all the serialization is in hardware) would do even a PIO in the more relaxed ordering (ie writes can at least be posted, but obviously not merged, since that would be against PCI specs). x86 tends to serialize PIO too much (I think at least Intel CPU's will actually wait for the PIO write to be acknowledged by _something_ on the bus, although it obviously can't wait for the device to have acted on it). Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-07 18:28 ` Linus Torvalds @ 2006-03-07 18:55 ` Alan Cox 2006-03-07 20:21 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Alan Cox @ 2006-03-07 18:55 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, akpm, ak, mingo, jblunck, bcrl, matthew, linux-arch, linuxppc64-dev, linux-kernel On Maw, 2006-03-07 at 10:28 -0800, Linus Torvalds wrote: > x86 tends to serialize PIO too much (I think at least Intel CPU's will > actually wait for the PIO write to be acknowledged by _something_ on the > bus, although it obviously can't wait for the device to have acted on it). Don't bet on that 8( In the PCI case the I/O write appears to be acked by the bridges used on x86 when the write completes on the PCI bus and then back to the CPU. MMIO is thankfully posted. At least thats how the timings on some devices look. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-07 18:55 ` Alan Cox @ 2006-03-07 20:21 ` Linus Torvalds 0 siblings, 0 replies; 32+ messages in thread From: Linus Torvalds @ 2006-03-07 20:21 UTC (permalink / raw) To: Alan Cox Cc: David Howells, akpm, ak, mingo, jblunck, bcrl, matthew, linux-arch, linuxppc64-dev, linux-kernel On Tue, 7 Mar 2006, Alan Cox wrote: > > In the PCI case the I/O write appears to be acked by the bridges used on > x86 when the write completes on the PCI bus and then back to the CPU. > MMIO is thankfully posted. At least thats how the timings on some > devices look. Oh, absolutely. I'm sayign that you shouldn't wait for even that, since it's totally pointless (it's not synchronized _anyway_) and adds absolutely zero gain. To really synchronize, you need to read from the device anyway. So the "wait for bus activity" is just making PIO slower for no good reason, and keeps the core waiting when it could do something more useful. On an x86, there are legacy reasons to do it (people expect certain timings). But that was what I was saying - on non-x86 architectures, there's no reason for the ioread/iowrite interfaces to be as serializing as the old-fashioned PIO ones are. Might as well do the MMIO rules for a non-cacheable region: no re-ordering, but no waiting for the bus either. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 16:45 ` David Howells 2006-03-03 17:03 ` Linus Torvalds @ 2006-03-03 20:02 ` Arjan van de Ven 1 sibling, 0 replies; 32+ messages in thread From: Arjan van de Ven @ 2006-03-03 20:02 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, jblunck, bcrl, matthew, linux-arch, linuxppc64-dev, linux-kernel On Fri, 2006-03-03 at 16:45 +0000, David Howells wrote: > David Howells <dhowells@redhat.com> wrote: > > > WRITE mtx > > --> implies SFENCE > > Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of > SFENCE implies that writes can be reordered, which sort of contradicts what > the AMD64 System Programming Manual says. there are 2 or 3 special instructions which do "non temporal stores" (movntq and movnit and maybe one more). sfense is designed for those. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 16:03 Memory barriers and spin_unlock safety David Howells 2006-03-03 16:45 ` David Howells @ 2006-03-03 16:55 ` Linus Torvalds 2006-03-03 20:15 ` David Howells ` (2 more replies) 1 sibling, 3 replies; 32+ messages in thread From: Linus Torvalds @ 2006-03-03 16:55 UTC (permalink / raw) To: David Howells Cc: akpm, mingo, jblunck, bcrl, matthew, linux-kernel, linux-arch, linuxppc64-dev On Fri, 3 Mar 2006, David Howells wrote: > > We've just had an interesting discussion on IRC and this has come up with two > unanswered questions: > > (1) Is spin_unlock() is entirely safe on Pentium3+ and x86_64 where ?FENCE > instructions are available? > > Consider the following case, where you want to do two reads effectively > atomically, and so wrap them in a spinlock: > > spin_lock(&mtx); > a = *A; > b = *B; > spin_unlock(&mtx); > > On x86 Pentium3+ and x86_64, what's to stop you from getting the reads > done after the unlock since there's no LFENCE instruction there to stop > you? The rules are, afaik, that reads can pass buffered writes, BUT WRITES CANNOT PASS READS (aka "writes to memory are always carried out in program order"). IOW, reads can bubble up, but writes cannot. So the way I read the Intel rules is that "passing" is always about being done earlier than otherwise allowed, not about being done later. (You only "pass" somebody in traffic when you go ahead of them. If you fall behind them, you don't "pass" them, _they_ pass you). Now, this is not so much meant to be a semantic argument (the meaning of the word "pass") as to an explanation of what I believe Intel meant, since we know from Intel designers that the simple non-atomic write is supposedly a perfectly fine unlock instruction. So when Intel says "reads can be carried out speculatively and in any order", that just says that reads are not ordered wrt other _reads_. They _are_ ordered wrt other writes, but only one way: they can pass an earlier write, but they can't fall back behind a later one. This is consistent with (a) optimization (you want to do reads _early_, not late) (b) behaviour (we've been told that a single write is sufficient, with the exception of an early P6 core revision) (c) at least one way of reading the documentation. And I claim that (a) and (b) are the important parts, and that (c) is just the rationale. > (2) What is the minimum functionality that can be expected of a memory > barriers? I was of the opinion that all we could expect is for the CPU > executing one them to force the instructions it is executing to be > complete up to a point - depending on the type of barrier - before > continuing past it. Well, no. You should expect even _less_. The core can continue doing things past a barrier. For example, a write barrier may not actually serialize anything at all: the sane way of doing write barriers is to just put a note in the write-queue, and that note just disallows write queue entries from being moved around it. So you might have a write barrier with two writes on either side, and the writes might _both_ be outstanding wrt the core despite the barrier. So there's not necessarily any synchronization at all on a execution core level, just a partial ordering between the resulting actions of the core. > However, on ppc/ppc64, it seems to be more thorough than that, and there > seems to be some special interaction between the CPU processing the > instruction and the other CPUs in the system. It's not entirely obvious > from the manual just what this does. PPC has an absolutely _horrible_ memory ordering implementation, as far as I can tell. The thing is broken. I think it's just implementation breakage, not anything really fundamental, but the fact that their write barriers are expensive is a big sign that they are doing something bad. For example, their write buffers may not have a way to serialize in the buffers, and at that point from an _implementation_ standpoint, you just have to serialize the whole core to make sure that writes don't pass each other. > As I understand it, Andrew Morton is of the opinion that issuing a read > barrier on one CPU will cause the other CPUs in the system to sync up, but > that doesn't look likely on all archs. No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the other CPU. All barriers are purely local to one CPU, and do not generate any bus traffic what-so-ever. They only potentially affect the order of bus traffic due to the instructions around them (obviously). So a read barrier on one CPU _has_ to be paired with a write barrier on the other side in order to make sense (although the write barrier can obviously be of the implied kind, ie a lock/unlock event, or just architecture-specific knowledge of write behaviour, ie for example knowing that writes are always seen in-order on x86). Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 16:55 ` Linus Torvalds @ 2006-03-03 20:15 ` David Howells 2006-03-03 21:31 ` Linus Torvalds 2006-03-03 21:06 ` Benjamin Herrenschmidt 2006-03-04 10:58 ` Paul Mackerras 2 siblings, 1 reply; 32+ messages in thread From: David Howells @ 2006-03-03 20:15 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, akpm, mingo, jblunck, bcrl, matthew, linux-kernel, linux-arch, linuxppc64-dev Linus Torvalds <torvalds@osdl.org> wrote: > The rules are, afaik, that reads can pass buffered writes, BUT WRITES > CANNOT PASS READS (aka "writes to memory are always carried out in program > order"). So in the example I gave, a read after the spin_unlock() may actually get executed before the store in the spin_unlock(), but a read before the unlock will not get executed after. > No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the > other CPU. Well, I think you mean will guarantee absolutely _nothing_ on the other CPU for the Linux kernel. According to the IBM powerpc book I have, it does actually do something on the other CPUs, though it doesn't say exactly what. Anyway, thanks. I'll write up some documentation on barriers for inclusion in the kernel. David ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 20:15 ` David Howells @ 2006-03-03 21:31 ` Linus Torvalds 0 siblings, 0 replies; 32+ messages in thread From: Linus Torvalds @ 2006-03-03 21:31 UTC (permalink / raw) To: David Howells Cc: akpm, mingo, jblunck, bcrl, matthew, linux-kernel, linux-arch, linuxppc64-dev On Fri, 3 Mar 2006, David Howells wrote: > > So in the example I gave, a read after the spin_unlock() may actually get > executed before the store in the spin_unlock(), but a read before the unlock > will not get executed after. Yes. > > No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the > > other CPU. > > Well, I think you mean will guarantee absolutely _nothing_ on the other CPU for > the Linux kernel. According to the IBM powerpc book I have, it does actually > do something on the other CPUs, though it doesn't say exactly what. Yeah, Power really does have some funky stuff in their memory ordering. I'm not quite sure why, though. And it definitely isn't implied by any of the Linux kernel barriers. (They also do TLB coherency in hw etc strange things). Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 16:55 ` Linus Torvalds 2006-03-03 20:15 ` David Howells @ 2006-03-03 21:06 ` Benjamin Herrenschmidt 2006-03-03 21:18 ` Hollis Blanchard ` (2 more replies) 2006-03-04 10:58 ` Paul Mackerras 2 siblings, 3 replies; 32+ messages in thread From: Benjamin Herrenschmidt @ 2006-03-03 21:06 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, akpm, mingo, jblunck, bcrl, matthew, linux-kernel, linux-arch, linuxppc64-dev > PPC has an absolutely _horrible_ memory ordering implementation, as far as > I can tell. The thing is broken. I think it's just implementation > breakage, not anything really fundamental, but the fact that their write > barriers are expensive is a big sign that they are doing something bad. Are they ? read barriers and full barriers are, write barriers should be fairly cheap (but then, I haven't measured). > For example, their write buffers may not have a way to serialize in the > buffers, and at that point from an _implementation_ standpoint, you just > have to serialize the whole core to make sure that writes don't pass each > other. The main problem I've had in the past with the ppc barriers is more a subtle thing in the spec that unfortunately was taken to the word by implementors, and is that the simple write barrier (eieio) will only order within the same storage space, that is will not order between cacheable and non-cacheable storage. That means IOs could leak out of locks etc... Which is why we use expensive barriers in MMIO wrappers for now (though we might investigate the use of mmioXb instead in the future). > No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the > other CPU. All barriers are purely local to one CPU, and do not generate > any bus traffic what-so-ever. They only potentially affect the order of > bus traffic due to the instructions around them (obviously). Actually, the ppc's full barrier (sync) will generate bus traffic, and I think in some case eieio barriers can propagate to the chipset to enforce ordering there too depending on some voodoo settings and wether the storage space is cacheable or not. > So a read barrier on one CPU _has_ to be paired with a write barrier on > the other side in order to make sense (although the write barrier can > obviously be of the implied kind, ie a lock/unlock event, or just > architecture-specific knowledge of write behaviour, ie for example knowing > that writes are always seen in-order on x86). > > Linus > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 21:06 ` Benjamin Herrenschmidt @ 2006-03-03 21:18 ` Hollis Blanchard 2006-03-03 21:52 ` David S. Miller 2006-03-03 22:04 ` Linus Torvalds 2006-03-04 10:58 ` Paul Mackerras 2 siblings, 1 reply; 32+ messages in thread From: Hollis Blanchard @ 2006-03-03 21:18 UTC (permalink / raw) To: linuxppc64-dev Cc: Benjamin Herrenschmidt, Linus Torvalds, akpm, linux-arch, bcrl, matthew, linux-kernel, mingo, jblunck On Friday 03 March 2006 15:06, Benjamin Herrenschmidt wrote: > The main problem I've had in the past with the ppc barriers is more a > subtle thing in the spec that unfortunately was taken to the word by > implementors, and is that the simple write barrier (eieio) will only > order within the same storage space, that is will not order between > cacheable and non-cacheable storage. I've heard Sparc has the same issue... in which case it may not be a "chip designer was too literal" thing, but rather it really simplifies chip implementation to do it that way. -- Hollis Blanchard IBM Linux Technology Center ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 21:18 ` Hollis Blanchard @ 2006-03-03 21:52 ` David S. Miller 0 siblings, 0 replies; 32+ messages in thread From: David S. Miller @ 2006-03-03 21:52 UTC (permalink / raw) To: hollisb Cc: linuxppc64-dev, benh, torvalds, akpm, linux-arch, bcrl, matthew, linux-kernel, mingo, jblunck From: Hollis Blanchard <hollisb@us.ibm.com> Date: Fri, 3 Mar 2006 15:18:13 -0600 > On Friday 03 March 2006 15:06, Benjamin Herrenschmidt wrote: > > The main problem I've had in the past with the ppc barriers is more a > > subtle thing in the spec that unfortunately was taken to the word by > > implementors, and is that the simple write barrier (eieio) will only > > order within the same storage space, that is will not order between > > cacheable and non-cacheable storage. > > I've heard Sparc has the same issue... in which case it may not be a "chip > designer was too literal" thing, but rather it really simplifies chip > implementation to do it that way. There is a "membar #MemIssue" that is meant to deal with this should it ever matter, but for most sparc64 chips it doesn't which is why we don't use that memory barrier type at all in the Linux kernel. For UltraSPARC-I and II it technically could matter in Relaxed Memory Ordering (RMO) mode which is what we run the kernel and 64-bit userspace in, but I've never seen an issue resulting from it. For UltraSPARC-III and later, the chip only implements the Total Store Ordering (TSO) memory model and the manual explicitly states that cacheable and non-cacheable memory operations are ordered, even using language such as "there is an implicit 'membar #MemIssue' between them". It further goes on to say: The UltraSPARCIII Cu processor maintains ordering between cacheable and non-cacheable accesses. The UltraSPARC III Cu processor maintains TSO ordering between memory references regardless of their cacheability. Niagara behaves almost identically to UltraSPARC-III in this area. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 21:06 ` Benjamin Herrenschmidt 2006-03-03 21:18 ` Hollis Blanchard @ 2006-03-03 22:04 ` Linus Torvalds 2006-03-04 10:58 ` Paul Mackerras 2 siblings, 0 replies; 32+ messages in thread From: Linus Torvalds @ 2006-03-03 22:04 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David Howells, akpm, mingo, jblunck, bcrl, matthew, linux-kernel, linux-arch, linuxppc64-dev On Sat, 4 Mar 2006, Benjamin Herrenschmidt wrote: > > The main problem I've had in the past with the ppc barriers is more a > subtle thing in the spec that unfortunately was taken to the word by > implementors, and is that the simple write barrier (eieio) will only > order within the same storage space, that is will not order between > cacheable and non-cacheable storage. If so, a simple write barrier should be sufficient. That's exactly what the x86 write barriers do too, ie stores to magic IO space are _not_ ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a spin_unlock()) at all. On x86, we actually have this "CONFIG_X86_OOSTORE" configuration option that gets enable for you select a WINCHIP device, because that allows a weaker memory ordering for normal memory too, and that will end up using an "sfence" instruction for store buffers. But it's not normally enabled. So the eieio should be sufficient,then. Of course, the x86 store buffers do tend to flush out stuff after a certain cycle-delay too, so there may be drivers that technically are buggy on x86, but where the store buffer in practice is small and flushes out quickly enough that you'll never _see_ the bug. > Actually, the ppc's full barrier (sync) will generate bus traffic, and I > think in some case eieio barriers can propagate to the chipset to > enforce ordering there too depending on some voodoo settings and wether > the storage space is cacheable or not. Well, the regular kernel ops definitely won't depend on that, since that's not the case anywhere else. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 21:06 ` Benjamin Herrenschmidt 2006-03-03 21:18 ` Hollis Blanchard 2006-03-03 22:04 ` Linus Torvalds @ 2006-03-04 10:58 ` Paul Mackerras 2006-03-04 22:49 ` Benjamin Herrenschmidt 2 siblings, 1 reply; 32+ messages in thread From: Paul Mackerras @ 2006-03-04 10:58 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Linus Torvalds, akpm, linux-arch, bcrl, matthew, linux-kernel, mingo, linuxppc64-dev, jblunck Benjamin Herrenschmidt writes: > Actually, the ppc's full barrier (sync) will generate bus traffic, and I > think in some case eieio barriers can propagate to the chipset to > enforce ordering there too depending on some voodoo settings and wether > the storage space is cacheable or not. Eieio has to go to the PCI host bridge because it is supposed to prevent write-combining, both in the host bridge and in the CPU. Paul. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-04 10:58 ` Paul Mackerras @ 2006-03-04 22:49 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 32+ messages in thread From: Benjamin Herrenschmidt @ 2006-03-04 22:49 UTC (permalink / raw) To: Paul Mackerras Cc: Linus Torvalds, akpm, linux-arch, bcrl, matthew, linux-kernel, mingo, linuxppc64-dev, jblunck On Sat, 2006-03-04 at 21:58 +1100, Paul Mackerras wrote: > Benjamin Herrenschmidt writes: > > > Actually, the ppc's full barrier (sync) will generate bus traffic, and I > > think in some case eieio barriers can propagate to the chipset to > > enforce ordering there too depending on some voodoo settings and wether > > the storage space is cacheable or not. > > Eieio has to go to the PCI host bridge because it is supposed to > prevent write-combining, both in the host bridge and in the CPU. That can be disabled with HID bits tho ;) Ben. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-03 16:55 ` Linus Torvalds 2006-03-03 20:15 ` David Howells 2006-03-03 21:06 ` Benjamin Herrenschmidt @ 2006-03-04 10:58 ` Paul Mackerras 2006-03-04 17:28 ` Linus Torvalds 2006-03-05 2:04 ` Michael Buesch 2 siblings, 2 replies; 32+ messages in thread From: Paul Mackerras @ 2006-03-04 10:58 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, akpm, linux-arch, bcrl, matthew, linux-kernel, mingo, linuxppc64-dev, jblunck Linus Torvalds writes: > PPC has an absolutely _horrible_ memory ordering implementation, as far as > I can tell. The thing is broken. I think it's just implementation > breakage, not anything really fundamental, but the fact that their write > barriers are expensive is a big sign that they are doing something bad. An smp_wmb() is just an eieio on PPC, which is pretty cheap. I made wmb() be a sync though, because it seemed that there were drivers that expected wmb() to provide an ordering between a write to memory and a write to an MMIO register. If that is a bogus assumption then we could make wmb() lighter-weight (after auditing all the drivers we're interested in, of course, ...). And in a subsequent message: > If so, a simple write barrier should be sufficient. That's exactly what > the x86 write barriers do too, ie stores to magic IO space are _not_ > ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a > spin_unlock()) at all. By magic IO space, do you mean just any old memory-mapped device register in a PCI device, or do you mean something else? Paul. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-04 10:58 ` Paul Mackerras @ 2006-03-04 17:28 ` Linus Torvalds 2006-03-08 3:20 ` Paul Mackerras 2006-03-05 2:04 ` Michael Buesch 1 sibling, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2006-03-04 17:28 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, akpm, linux-arch, bcrl, matthew, linux-kernel, mingo, linuxppc64-dev, jblunck On Sat, 4 Mar 2006, Paul Mackerras wrote: > > > If so, a simple write barrier should be sufficient. That's exactly what > > the x86 write barriers do too, ie stores to magic IO space are _not_ > > ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a > > spin_unlock()) at all. > > By magic IO space, do you mean just any old memory-mapped device > register in a PCI device, or do you mean something else? Any old memory-mapped device that has been marked as write-combining in the MTRR's or page tables. So the rules from the PC side (and like it or not, they end up being what all the drivers are tested with) are: - regular stores are ordered by write barriers - PIO stores are always synchronous - MMIO stores are ordered by IO semantics - PCI ordering must be honored: * write combining is only allowed on PCI memory resources that are marked prefetchable. If your host bridge does write combining in general, it's not a "host bridge", it's a "host disaster". * for others, writes can always be posted, but they cannot be re-ordered wrt either reads or writes to that device (ie a read will always be fully synchronizing) - io_wmb must be honored In addition, it will help a hell of a lot if you follow the PC notion of "per-region extra rules", ie you'd default to the non-prefetchable behaviour even for areas that are prefetchable from a PCI standpoint, but allow some way to relax the ordering rules in various ways. PC's use MTRR's or page table hints for this, but it's actually perfectly possible to do it by virtual address (ie decide on "ioremap()" time by looking at some bits that you've saved away to remap it to a certain virtual address range, and then use the virtual address as a hint for readl/writel whether you need to serialize or not). On x86, we already use the "virtual address" trick to distinguish between PIO and MMIO for the newer ioread/iowrite interface (the older inb/outb/readb/writeb interfaces obviously don't need that, since the IO space is statically encoded in the function call itself). The reason I mention the MTRR emulation is again just purely compatibility with drivers that get 99.9% of all the testing on a PC platform. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-04 17:28 ` Linus Torvalds @ 2006-03-08 3:20 ` Paul Mackerras 2006-03-08 3:54 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Paul Mackerras @ 2006-03-08 3:20 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, akpm, linux-arch, bcrl, matthew, linux-kernel, mingo, linuxppc64-dev, jblunck Linus Torvalds writes: > So the rules from the PC side (and like it or not, they end up being > what all the drivers are tested with) are: > > - regular stores are ordered by write barriers I thought regular stores were always ordered anyway? > - PIO stores are always synchronous By synchronous, do you mean ordered with respect to all other accesses (regular memory, MMIO, prefetchable MMIO, PIO)? In other words, if I store a value in regular memory, then do an outb() to a device, and the device does a DMA read to the location I just stored to, is the device guaranteed to see the value I just stored (assuming no other later store to the location)? > - MMIO stores are ordered by IO semantics > - PCI ordering must be honored: > * write combining is only allowed on PCI memory resources > that are marked prefetchable. If your host bridge does write > combining in general, it's not a "host bridge", it's a "host > disaster". Presumably the host bridge doesn't know what sort of PCI resource is mapped at a given address, so that information (whether the resource is prefetchable) must come from the CPU, which would get it from the TLB entry or an MTRR entry - is that right? Or is there some gentleman's agreement between the host bridge and the BIOS that certain address ranges are only used for certain types of PCI memory resources? > * for others, writes can always be posted, but they cannot > be re-ordered wrt either reads or writes to that device > (ie a read will always be fully synchronizing) > - io_wmb must be honored What ordering is there between stores to regular memory and stores to non-prefetchable MMIO? If a store to regular memory can be performed before a store to MMIO, does a wmb() suffice to enforce an ordering, or do you have to use mmiowb()? Do PCs ever use write-through caching on prefetchable MMIO resources? Thanks, Paul. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-08 3:20 ` Paul Mackerras @ 2006-03-08 3:54 ` Linus Torvalds 2006-03-08 13:12 ` Alan Cox 0 siblings, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2006-03-08 3:54 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, akpm, linux-arch, bcrl, matthew, linux-kernel, mingo, linuxppc64-dev, jblunck On Wed, 8 Mar 2006, Paul Mackerras wrote: > > Linus Torvalds writes: > > > So the rules from the PC side (and like it or not, they end up being > > what all the drivers are tested with) are: > > > > - regular stores are ordered by write barriers > > I thought regular stores were always ordered anyway? For the hw, yes. For the compiler no. So you actually do end up needing write barriers even on x86. It won't compile to any actual _instruction_, but it will be a compiler barrier (ie it just ends up being an empty inline asm that "modifies" memory). So forgetting the wmb() is a bug even on x86, unless you happen to program in assembly. Of course, the x86 hw semantics _do_ mean that forgetting it is less likely to cause problems, just because the compiler re-ordering is fairly unlikely most of the time. > > - PIO stores are always synchronous > > By synchronous, do you mean ordered with respect to all other accesses > (regular memory, MMIO, prefetchable MMIO, PIO)? Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I don't think PCI bridges are supposed to post PIO writes, but a x86 CPU basically won't stall for them forever. I _think_ they'll wait for it to hit that external bus, though. So it's totally serializing in the sense that all preceding reads have completed and all preceding writes have hit the cache-coherency point, but you don't necessarily know when the write itself will hit the device (the write will return before that necessarily happens). > In other words, if I store a value in regular memory, then do an > outb() to a device, and the device does a DMA read to the location I > just stored to, is the device guaranteed to see the value I just > stored (assuming no other later store to the location)? Yes, assuming that the DMA read is in respose to (ie causally related to) the write. > > - MMIO stores are ordered by IO semantics > > - PCI ordering must be honored: > > * write combining is only allowed on PCI memory resources > > that are marked prefetchable. If your host bridge does write > > combining in general, it's not a "host bridge", it's a "host > > disaster". > > Presumably the host bridge doesn't know what sort of PCI resource is > mapped at a given address, so that information (whether the resource > is prefetchable) must come from the CPU, which would get it from the > TLB entry or an MTRR entry - is that right? Correct. Although it could of course be a map in the host bridge itself, not on the CPU. If the host bridge doesn't know, then the host bridge had better not combine or the CPU had better tell it not to combine, using something like a "sync" instruction that causes bus traffic. Either of those approaches is likely a performance disaster, so you do want to have the CPU and/or hostbridge do this all automatically for you. Which is what the PC world does. > Or is there some gentleman's agreement between the host bridge and the > BIOS that certain address ranges are only used for certain types of > PCI memory resources? Not that I know. I _think_ all of the PC world just depends on the CPU doing the write combining, and the CPU knows thanks to MTRR's and page tables. But I could well imagine that there is some situation where the logic is further out. > What ordering is there between stores to regular memory and stores to > non-prefetchable MMIO? Non-prefetchable MMIO will be in-order on x86 wrt regular memory (unless you use one of the non-temporal stores). To get out-of-order stores you have to use a special MTRR setting (mtrr type "WC" for "write combining"). Or possibly non-temporal writes to an uncached area. I don't think we do. > If a store to regular memory can be performed before a store to MMIO, > does a wmb() suffice to enforce an ordering, or do you have to use > mmiowb()? On x86, MMIO normally doesn't need memory barriers either for the normal case (see above). We don't even need the compiler barrier, because we use a "volatile" pointer for that, telling the compiler to keep its hands off. > Do PCs ever use write-through caching on prefetchable MMIO resources? Basically only for frame buffers, with MTRR rules (and while write-through is an option, normally you'd use "write-combining", which doesn't cache at all, but write combines in the write buffers and writes the combined results out to the bus - there's usually something like four or eight write buffers of up to a cacheline in size for combining). Yeah, I realize this can be awkward. PC's actually get good performance (ie they normally can easily fill the bus bandwidth) _and_ the sw doesn't even need to do anything. That's what you get from several decades of hw tweaking with a fixed - or almost-fixed - software base. I _like_ PC's. Almost every other architecture decided to be lazy in hw, and put the onus on the software to tell it what was right. The PC platform hardware competition didn't allow for the "let's recompile the software" approach, so the hardware does it all for you. Very well too. It does make it somewhat hard for other platforms. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-08 3:54 ` Linus Torvalds @ 2006-03-08 13:12 ` Alan Cox 2006-03-08 15:30 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Alan Cox @ 2006-03-08 13:12 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, David Howells, akpm, linux-arch, bcrl, matthew, linux-kernel, mingo, linuxppc64-dev, jblunck On Maw, 2006-03-07 at 19:54 -0800, Linus Torvalds wrote: > Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I > don't think PCI bridges are supposed to post PIO writes, but a x86 CPU > basically won't stall for them forever. The bridges I have will stall forever. You can observe this directly if an IDE device decides to hang the IORDY line on the IDE cable or you crash the GPU on an S3 card. Alan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-08 13:12 ` Alan Cox @ 2006-03-08 15:30 ` Linus Torvalds 0 siblings, 0 replies; 32+ messages in thread From: Linus Torvalds @ 2006-03-08 15:30 UTC (permalink / raw) To: Alan Cox Cc: Paul Mackerras, David Howells, akpm, linux-arch, bcrl, matthew, linux-kernel, mingo, linuxppc64-dev, jblunck On Wed, 8 Mar 2006, Alan Cox wrote: > > On Maw, 2006-03-07 at 19:54 -0800, Linus Torvalds wrote: > > Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I > > don't think PCI bridges are supposed to post PIO writes, but a x86 CPU > > basically won't stall for them forever. > > The bridges I have will stall forever. You can observe this directly if > an IDE device decides to hang the IORDY line on the IDE cable or you > crash the GPU on an S3 card. Ok. The only thing I have tested is the timing of "outb()" on its own, which is definitely long enough that it clearly waits for _some_ bus activity (ie the CPU doesn't just post the write internally), but I don't know exactly what the rules are as far as the core itself is concerned: I suspect the core just waits until it has hit the northbridge or something. In contrast, a MMIO write to a WC region at least will not necessarily pause the core at all: it just hits the write queue in the core, and the core continues on (and may generate other writes that will be combined in the write buffers before the first one even hits the bus). Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Memory barriers and spin_unlock safety 2006-03-04 10:58 ` Paul Mackerras 2006-03-04 17:28 ` Linus Torvalds @ 2006-03-05 2:04 ` Michael Buesch 1 sibling, 0 replies; 32+ messages in thread From: Michael Buesch @ 2006-03-05 2:04 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, akpm, linux-arch, bcrl, matthew, linux-kernel, mingo, linuxppc64-dev, jblunck, Linus Torvalds [-- Attachment #1: Type: text/plain, Size: 1168 bytes --] On Saturday 04 March 2006 11:58, you wrote: > Linus Torvalds writes: > > > PPC has an absolutely _horrible_ memory ordering implementation, as far as > > I can tell. The thing is broken. I think it's just implementation > > breakage, not anything really fundamental, but the fact that their write > > barriers are expensive is a big sign that they are doing something bad. > > An smp_wmb() is just an eieio on PPC, which is pretty cheap. I made > wmb() be a sync though, because it seemed that there were drivers that > expected wmb() to provide an ordering between a write to memory and a > write to an MMIO register. If that is a bogus assumption then we > could make wmb() lighter-weight (after auditing all the drivers we're > interested in, of course, ...). In the bcm43xx driver there is code which looks like the following: /* Write some coherent DMA memory */ wmb(); /* Write MMIO, which depends on the DMA memory * write to be finished. */ Are the assumptions in this code correct? Is wmb() the correct thing to do here? I heavily tested this code on PPC UP and did not see any anormaly, yet. -- Greetings Michael. [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2006-03-08 15:30 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-03 16:03 Memory barriers and spin_unlock safety David Howells 2006-03-03 16:45 ` David Howells 2006-03-03 17:03 ` Linus Torvalds 2006-03-03 20:17 ` David Howells 2006-03-03 21:34 ` Linus Torvalds 2006-03-03 21:51 ` Benjamin LaHaise 2006-03-03 22:21 ` Linus Torvalds 2006-03-03 22:36 ` Linus Torvalds 2006-03-07 17:36 ` David Howells 2006-03-07 17:40 ` Matthew Wilcox 2006-03-07 17:56 ` Jesse Barnes 2006-03-07 18:18 ` Alan Cox 2006-03-07 18:28 ` Linus Torvalds 2006-03-07 18:55 ` Alan Cox 2006-03-07 20:21 ` Linus Torvalds 2006-03-03 20:02 ` Arjan van de Ven 2006-03-03 16:55 ` Linus Torvalds 2006-03-03 20:15 ` David Howells 2006-03-03 21:31 ` Linus Torvalds 2006-03-03 21:06 ` Benjamin Herrenschmidt 2006-03-03 21:18 ` Hollis Blanchard 2006-03-03 21:52 ` David S. Miller 2006-03-03 22:04 ` Linus Torvalds 2006-03-04 10:58 ` Paul Mackerras 2006-03-04 22:49 ` Benjamin Herrenschmidt 2006-03-04 10:58 ` Paul Mackerras 2006-03-04 17:28 ` Linus Torvalds 2006-03-08 3:20 ` Paul Mackerras 2006-03-08 3:54 ` Linus Torvalds 2006-03-08 13:12 ` Alan Cox 2006-03-08 15:30 ` Linus Torvalds 2006-03-05 2:04 ` Michael Buesch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox