* [PATCH] Document Linux's memory barriers
@ 2006-03-07 17:40 David Howells
2006-03-07 10:34 ` Andi Kleen
` (5 more replies)
0 siblings, 6 replies; 89+ messages in thread
From: David Howells @ 2006-03-07 17:40 UTC (permalink / raw)
To: torvalds, akpm, mingo; +Cc: linux-arch, linuxppc64-dev, linux-kernel
The attached patch documents the Linux kernel's memory barriers.
Signed-Off-By: David Howells <dhowells@redhat.com>
---
warthog>diffstat -p1 mb.diff
Documentation/memory-barriers.txt | 359 ++++++++++++++++++++++++++++++++++++++
1 files changed, 359 insertions(+)
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
new file mode 100644
index 0000000..c2fc51b
--- /dev/null
+++ b/Documentation/memory-barriers.txt
@@ -0,0 +1,359 @@
+ ============================
+ LINUX KERNEL MEMORY BARRIERS
+ ============================
+
+Contents:
+
+ (*) What are memory barriers?
+
+ (*) Linux kernel memory barrier functions.
+
+ (*) Implied kernel memory barriers.
+
+ (*) i386 and x86_64 arch specific notes.
+
+
+=========================
+WHAT ARE MEMORY BARRIERS?
+=========================
+
+Memory barriers are instructions to both the compiler and the CPU to impose a
+partial ordering between the memory access operations specified either side of
+the barrier.
+
+Older and less complex CPUs will perform memory accesses in exactly the order
+specified, so if one is given the following piece of code:
+
+ a = *A;
+ *B = b;
+ c = *C;
+ d = *D;
+ *E = e;
+
+It can be guaranteed that it will complete the memory access for each
+instruction before moving on to the next line, leading to a definite sequence
+of operations on the bus:
+
+ read *A, write *B, read *C, read *D, write *E.
+
+However, with newer and more complex CPUs, this isn't always true because:
+
+ (*) they can rearrange the order of the memory accesses to promote better use
+ of the CPU buses and caches;
+
+ (*) reads are synchronous and may need to be done immediately to permit
+ progress, whereas writes can often be deferred without a problem;
+
+ (*) and they are able to combine reads and writes to improve performance when
+ talking to the SDRAM (modern SDRAM chips can do batched accesses of
+ adjacent locations, cutting down on transaction setup costs).
+
+So what you might actually get from the above piece of code is:
+
+ read *A, read *C+*D, write *E, write *B
+
+Under normal operation, this is probably not going to be a problem; however,
+there are two circumstances where it definitely _can_ be a problem:
+
+ (1) I/O
+
+ Many I/O devices can be memory mapped, and so appear to the CPU as if
+ they're just memory locations. However, to control the device, the driver
+ has to make the right accesses in exactly the right order.
+
+ Consider, for example, an ethernet chipset such as the AMD PCnet32. It
+ presents to the CPU an "address register" and a bunch of "data registers".
+ The way it's accessed is to write the index of the internal register you
+ want to access to the address register, and then read or write the
+ appropriate data register to access the chip's internal register:
+
+ *ADR = ctl_reg_3;
+ reg = *DATA;
+
+ The problem with a clever CPU or a clever compiler is that the write to
+ the address register isn't guaranteed to happen before the access to the
+ data register, if the CPU or the compiler thinks it is more efficient to
+ defer the address write:
+
+ read *DATA, write *ADR
+
+ then things will break.
+
+ The way to deal with this is to insert an I/O memory barrier between the
+ two accesses:
+
+ *ADR = ctl_reg_3;
+ mb();
+ reg = *DATA;
+
+ In this case, the barrier makes a guarantee that all memory accesses
+ before the barrier will happen before all the memory accesses after the
+ barrier. It does _not_ guarantee that all memory accesses before the
+ barrier will be complete by the time the barrier is complete.
+
+ (2) Multiprocessor interaction
+
+ When there's a system with more than one processor, these may be working
+ on the same set of data, but attempting not to use locks as locks are
+ quite expensive. This means that accesses that affect both CPUs may have
+ to be carefully ordered to prevent error.
+
+ Consider the R/W semaphore slow path. In that, a waiting process is
+ queued on the semaphore, as noted by it having a record on its stack
+ linked to the semaphore's list:
+
+ struct rw_semaphore {
+ ...
+ struct list_head waiters;
+ };
+
+ struct rwsem_waiter {
+ struct list_head list;
+ struct task_struct *task;
+ };
+
+ To wake up the waiter, the up_read() or up_write() functions have to read
+ the pointer from this record to know as to where the next waiter record
+ is, clear the task pointer, call wake_up_process() on the task, and
+ release the task struct reference held:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+ If any of these steps occur out of order, then the whole thing may fail.
+
+ Note that the waiter does not get the semaphore lock again - it just waits
+ for its task pointer to be cleared. Since the record is on its stack, this
+ means that if the task pointer is cleared _before_ the next pointer in the
+ list is read, then another CPU might start processing the waiter and it
+ might clobber its stack before up*() functions have a chance to read the
+ next pointer.
+
+ CPU 0 CPU 1
+ =============================== ===============================
+ down_xxx()
+ Queue waiter
+ Sleep
+ up_yyy()
+ READ waiter->task;
+ WRITE waiter->task;
+ <preempt>
+ Resume processing
+ down_xxx() returns
+ call foo()
+ foo() clobbers *waiter
+ </preempt>
+ READ waiter->list.next;
+ --- OOPS ---
+
+ This could be dealt with using a spinlock, but then the down_xxx()
+ function has to get the spinlock again after it's been woken up, which is
+ a waste of resources.
+
+ The way to deal with this is to insert an SMP memory barrier:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ smp_mb();
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+ In this case, the barrier makes a guarantee that all memory accesses
+ before the barrier will happen before all the memory accesses after the
+ barrier. It does _not_ guarantee that all memory accesses before the
+ barrier will be complete by the time the barrier is complete.
+
+ SMP memory barriers are normally no-ops on a UP system because the CPU
+ orders overlapping accesses with respect to itself.
+
+
+=====================================
+LINUX KERNEL MEMORY BARRIER FUNCTIONS
+=====================================
+
+The Linux kernel has six basic memory barriers:
+
+ MANDATORY (I/O) SMP
+ =============== ================
+ GENERAL mb() smp_mb()
+ READ rmb() smp_rmb()
+ WRITE wmb() smp_wmb()
+
+General memory barriers make a guarantee that all memory accesses specified
+before the barrier will happen before all memory accesses specified after the
+barrier.
+
+Read memory barriers make a guarantee that all memory reads specified before
+the barrier will happen before all memory reads specified after the barrier.
+
+Write memory barriers make a guarantee that all memory writes specified before
+the barrier will happen before all memory writes specified after the barrier.
+
+SMP memory barriers are no-ops on uniprocessor compiled systems because it is
+assumed that a CPU will be self-consistent, and will order overlapping accesses
+with respect to itself.
+
+There is no guarantee that any of the memory accesses specified before a memory
+barrier will be complete by the completion of a memory barrier; the barrier can
+be considered to draw a line in the access queue that accesses of the
+appropriate type may not cross.
+
+There is no guarantee that issuing a memory barrier on one CPU will have any
+direct effect on another CPU or any other hardware in the system. The indirect
+effect will be the order the first CPU commits its accesses to the bus.
+
+Note that these are the _minimum_ guarantees. Different architectures may give
+more substantial guarantees, but they may not be relied upon outside of arch
+specific code.
+
+
+There are some more advanced barriering functions:
+
+ (*) set_mb(var, value)
+ (*) set_wmb(var, value)
+
+ These assign the value to the variable and then insert at least a write
+ barrier after it, depending on the function.
+
+
+==============================
+IMPLIED KERNEL MEMORY BARRIERS
+==============================
+
+Some of the other functions in the linux kernel imply memory barriers. For
+instance all the following (pseudo-)locking functions imply barriers.
+
+ (*) interrupt disablement and/or interrupts
+ (*) spin locks
+ (*) R/W spin locks
+ (*) mutexes
+ (*) semaphores
+ (*) R/W semaphores
+
+In all cases there are variants on a LOCK operation and an UNLOCK operation.
+
+ (*) LOCK operation implication:
+
+ Memory accesses issued after the LOCK will be completed after the LOCK
+ accesses have completed.
+
+ Memory accesses issued before the LOCK may be completed after the LOCK
+ accesses have completed.
+
+ (*) UNLOCK operation implication:
+
+ Memory accesses issued before the UNLOCK will be completed before the
+ UNLOCK accesses have completed.
+
+ Memory accesses issued after the UNLOCK may be completed before the UNLOCK
+ accesses have completed.
+
+ (*) LOCK vs UNLOCK implication:
+
+ The LOCK accesses will be completed before the unlock accesses.
+
+Locks and semaphores may not provide any guarantee of ordering on UP compiled
+systems, and so can't be counted on in such a situation to actually do
+anything at all, especially with respect to I/O memory barriering.
+
+Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier
+memory and I/O accesses individually, or interrupt handling will barrier
+memory and I/O accesses on entry and on exit. This prevents an interrupt
+routine interfering with accesses made in a disabled-interrupt section of code
+and vice versa.
+
+This specification is a _minimum_ guarantee; any particular architecture may
+provide more substantial guarantees, but these may not be relied upon outside
+of arch specific code.
+
+
+As an example, consider the following:
+
+ *A = a;
+ *B = b;
+ LOCK
+ *C = c;
+ *D = d;
+ UNLOCK
+ *E = e;
+ *F = f;
+
+The following sequence of events on the bus is acceptable:
+
+ LOCK, *F+*A, *E, *C+*D, *B, UNLOCK
+
+But none of the following are:
+
+ *F+*A, *B, LOCK, *C, *D, UNLOCK, *E
+ *A, *B, *C, LOCK, *D, UNLOCK, *E, *F
+ *A, *B, LOCK, *C, UNLOCK, *D, *E, *F
+ *B, LOCK, *C, *D, UNLOCK, *F+*A, *E
+
+
+Consider also the following (going back to the AMD PCnet example):
+
+ DISABLE IRQ
+ *ADR = ctl_reg_3;
+ mb();
+ x = *DATA;
+ *ADR = ctl_reg_4;
+ mb();
+ *DATA = y;
+ *ADR = ctl_reg_5;
+ mb();
+ z = *DATA;
+ ENABLE IRQ
+ <interrupt>
+ *ADR = ctl_reg_7;
+ mb();
+ q = *DATA
+ </interrupt>
+
+What's to stop "z = *DATA" crossing "*ADR = ctl_reg_7" and reading from the
+wrong register? (There's no guarantee that the process of handling an
+interrupt will barrier memory accesses in any way).
+
+
+==============================
+I386 AND X86_64 SPECIFIC NOTES
+==============================
+
+Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the
+bus appear in program order - and so there's no requirement for any sort of
+explicit memory barriers.
+
+From the Pentium-III onwards were three new memory barrier instructions:
+LFENCE, SFENCE and MFENCE which correspond to the kernel memory barrier
+functions rmb(), wmb() and mb(). However, there are additional implicit memory
+barriers in the CPU implementation:
+
+ (*) Interrupt processing implies mb().
+
+ (*) The LOCK prefix adds implication of mb() on whatever instruction it is
+ attached to.
+
+ (*) Normal writes to memory imply wmb() [and so SFENCE is normally not
+ required].
+
+ (*) Normal writes imply a semi-rmb(): reads before a write may not complete
+ after that write, but reads after a write may complete before the write
+ (ie: reads may go _ahead_ of writes).
+
+ (*) Non-temporal writes imply no memory barrier, and are the intended target
+ of SFENCE.
+
+ (*) Accesses to uncached memory imply mb() [eg: memory mapped I/O].
+
+
+======================
+POWERPC SPECIFIC NOTES
+======================
+
+The powerpc is weakly ordered, and its read and write accesses may be
+completed generally in any order. It's memory barriers are also to some extent
+more substantial than the mimimum requirement, and may directly effect
+hardware outside of the CPU.
^ permalink raw reply related [flat|nested] 89+ messages in thread* Re: [PATCH] Document Linux's memory barriers 2006-03-07 17:40 [PATCH] Document Linux's memory barriers David Howells @ 2006-03-07 10:34 ` Andi Kleen 2006-03-07 18:30 ` David Howells 2006-03-07 18:40 ` Alan Cox ` (4 subsequent siblings) 5 siblings, 1 reply; 89+ messages in thread From: Andi Kleen @ 2006-03-07 10:34 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Tuesday 07 March 2006 18:40, David Howells wrote: > +Older and less complex CPUs will perform memory accesses in exactly the order > +specified, so if one is given the following piece of code: > + > + a = *A; > + *B = b; > + c = *C; > + d = *D; > + *E = e; > + > +It can be guaranteed that it will complete the memory access for each > +instruction before moving on to the next line, leading to a definite sequence > +of operations on the bus: Actually gcc is free to reorder it (often it will not when it cannot prove that they don't alias, but sometimes it can) > + > + Consider, for example, an ethernet chipset such as the AMD PCnet32. It > + presents to the CPU an "address register" and a bunch of "data registers". > + The way it's accessed is to write the index of the internal register you > + want to access to the address register, and then read or write the > + appropriate data register to access the chip's internal register: > + > + *ADR = ctl_reg_3; > + reg = *DATA; You're not supposed to do it this way anyways. The official way to access MMIO space is using read/write[bwlq] Haven't read all of it sorry, but thanks for the work of documenting it. -Andi ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 10:34 ` Andi Kleen @ 2006-03-07 18:30 ` David Howells 2006-03-07 11:13 ` Andi Kleen ` (2 more replies) 0 siblings, 3 replies; 89+ messages in thread From: David Howells @ 2006-03-07 18:30 UTC (permalink / raw) To: Andi Kleen Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Andi Kleen <ak@suse.de> wrote: > Actually gcc is free to reorder it > (often it will not when it cannot prove that they don't alias, but sometimes > it can) Yeah... I have mentioned the fact that compilers can reorder too, but obviously not enough. > You're not supposed to do it this way anyways. The official way to access > MMIO space is using read/write[bwlq] True, I suppose. I should make it clear that these accessor functions imply memory barriers, if indeed they do, and that you should use them rather than accessing I/O registers directly (at least, outside the arch you should). David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 18:30 ` David Howells @ 2006-03-07 11:13 ` Andi Kleen 2006-03-07 19:24 ` David Howells 2006-03-07 18:46 ` Jesse Barnes 2006-03-07 19:23 ` Bryan O'Sullivan 2 siblings, 1 reply; 89+ messages in thread From: Andi Kleen @ 2006-03-07 11:13 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Tuesday 07 March 2006 19:30, David Howells wrote: > > You're not supposed to do it this way anyways. The official way to access > > MMIO space is using read/write[bwlq] > > True, I suppose. I should make it clear that these accessor functions imply > memory barriers, if indeed they do, I don't think they do. > and that you should use them rather than > accessing I/O registers directly (at least, outside the arch you should). Even inside the architecture it's a good idea. -Andi ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 11:13 ` Andi Kleen @ 2006-03-07 19:24 ` David Howells 0 siblings, 0 replies; 89+ messages in thread From: David Howells @ 2006-03-07 19:24 UTC (permalink / raw) To: Andi Kleen, Stephen Hemminger, Jesse Barnes Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Andi Kleen <ak@suse.de> wrote: > > > You're not supposed to do it this way anyways. The official way to access > > > MMIO space is using read/write[bwlq] > > > > True, I suppose. I should make it clear that these accessor functions imply > > memory barriers, if indeed they do, > > I don't think they do. Hmmm.. Seems Stephen Hemminger disagrees: | > > 1) Access to i/o mapped memory does not need memory barriers. | > | > There's no guarantee of that. On FRV you have to insert barriers as | > appropriate when you're accessing I/O mapped memory if ordering is required | > (accessing an ethernet card vs accessing a frame buffer), but support for | > inserting the appropriate barriers is built into gcc - which knows the rules | > for when to insert them. | > | > Or are you referring to the fact that this should be implicit in inX(), | > outX(), readX(), writeX() and similar? | | yes David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 18:30 ` David Howells 2006-03-07 11:13 ` Andi Kleen @ 2006-03-07 18:46 ` Jesse Barnes 2006-03-07 19:23 ` Bryan O'Sullivan 2 siblings, 0 replies; 89+ messages in thread From: Jesse Barnes @ 2006-03-07 18:46 UTC (permalink / raw) To: David Howells Cc: Andi Kleen, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Tuesday, March 7, 2006 10:30 am, David Howells wrote: > True, I suppose. I should make it clear that these accessor functions > imply memory barriers, if indeed they do, and that you should use them > rather than accessing I/O registers directly (at least, outside the > arch you should). But they don't, that's why we have mmiowb(). There are lots of cases to handle: 1) memory vs. memory 2) memory vs. I/O 3) I/O vs. I/O (reads and writes for every case). AFAIK, we have (1) fairly well handled with a plethora of barrier ops. (2) is a bit fuzzy with the current operations I think, and for (3) all we have is mmiowb() afaik. Maybe one of the ppc64 guys can elaborate on the barriers their hw needs for the above cases (I think they're the pathological case, so covering them should be good enough everybody). Btw, thanks for putting together this documentation, it's desperately needed. Jesse ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 18:30 ` David Howells 2006-03-07 11:13 ` Andi Kleen 2006-03-07 18:46 ` Jesse Barnes @ 2006-03-07 19:23 ` Bryan O'Sullivan 2006-03-07 11:57 ` Andi Kleen 2 siblings, 1 reply; 89+ messages in thread From: Bryan O'Sullivan @ 2006-03-07 19:23 UTC (permalink / raw) To: David Howells Cc: Andi Kleen, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Tue, 2006-03-07 at 18:30 +0000, David Howells wrote: > True, I suppose. I should make it clear that these accessor functions imply > memory barriers, if indeed they do, They don't, but according to Documentation/DocBook/deviceiobook.tmpl they are performed by the compiler in the order specified. They also convert between PCI byte order and CPU byte order. If you want to avoid that, you need the __raw_* versions, which are not guaranteed to be provided by all arches. <b ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 19:23 ` Bryan O'Sullivan @ 2006-03-07 11:57 ` Andi Kleen 2006-03-07 20:01 ` Jesse Barnes ` (2 more replies) 0 siblings, 3 replies; 89+ messages in thread From: Andi Kleen @ 2006-03-07 11:57 UTC (permalink / raw) To: Bryan O'Sullivan Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Tuesday 07 March 2006 20:23, Bryan O'Sullivan wrote: > On Tue, 2006-03-07 at 18:30 +0000, David Howells wrote: > > > True, I suppose. I should make it clear that these accessor functions imply > > memory barriers, if indeed they do, > > They don't, but according to Documentation/DocBook/deviceiobook.tmpl > they are performed by the compiler in the order specified. I don't think that's correct. Probably the documentation should be fixed. -Andi ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 11:57 ` Andi Kleen @ 2006-03-07 20:01 ` Jesse Barnes 2006-03-07 21:14 ` Bryan O'Sullivan 2006-03-08 0:35 ` Alan Cox 2 siblings, 0 replies; 89+ messages in thread From: Jesse Barnes @ 2006-03-07 20:01 UTC (permalink / raw) To: Andi Kleen Cc: Bryan O'Sullivan, David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Tuesday, March 7, 2006 3:57 am, Andi Kleen wrote: > On Tuesday 07 March 2006 20:23, Bryan O'Sullivan wrote: > > On Tue, 2006-03-07 at 18:30 +0000, David Howells wrote: > > > True, I suppose. I should make it clear that these accessor > > > functions imply memory barriers, if indeed they do, > > > > They don't, but according to Documentation/DocBook/deviceiobook.tmpl > > they are performed by the compiler in the order specified. > > I don't think that's correct. Probably the documentation should > be fixed. On ia64 I'm pretty sure it's true, and it seems like it should be in the general case too. The compiler shouldn't reorder uncached memory accesses with volatile semantics... Jesse ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 11:57 ` Andi Kleen 2006-03-07 20:01 ` Jesse Barnes @ 2006-03-07 21:14 ` Bryan O'Sullivan 2006-03-07 21:24 ` Andi Kleen 2006-03-08 0:35 ` Alan Cox 2 siblings, 1 reply; 89+ messages in thread From: Bryan O'Sullivan @ 2006-03-07 21:14 UTC (permalink / raw) To: Andi Kleen Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Tue, 2006-03-07 at 12:57 +0100, Andi Kleen wrote: > > > True, I suppose. I should make it clear that these accessor functions imply > > > memory barriers, if indeed they do, > > > > They don't, but according to Documentation/DocBook/deviceiobook.tmpl > > they are performed by the compiler in the order specified. > > I don't think that's correct. Probably the documentation should > be fixed. That's why I hedged my words with "according to ..." :-) But on most arches those accesses do indeed seem to happen in-order. On i386 and x86_64, it's a natural consequence of program store ordering. On at least some other arches, there are explicit memory barriers in the implementation of the access macros to force this ordering to occur. <b ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 21:14 ` Bryan O'Sullivan @ 2006-03-07 21:24 ` Andi Kleen 2006-03-08 0:36 ` Alan Cox 0 siblings, 1 reply; 89+ messages in thread From: Andi Kleen @ 2006-03-07 21:24 UTC (permalink / raw) To: Bryan O'Sullivan Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Tuesday 07 March 2006 22:14, Bryan O'Sullivan wrote: > On Tue, 2006-03-07 at 12:57 +0100, Andi Kleen wrote: > > > > True, I suppose. I should make it clear that these accessor functions > > > > imply memory barriers, if indeed they do, > > > > > > They don't, but according to Documentation/DocBook/deviceiobook.tmpl > > > they are performed by the compiler in the order specified. > > > > I don't think that's correct. Probably the documentation should > > be fixed. > > That's why I hedged my words with "according to ..." :-) > > But on most arches those accesses do indeed seem to happen in-order. On > i386 and x86_64, it's a natural consequence of program store ordering. Not true for reads on x86. -Andi ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 21:24 ` Andi Kleen @ 2006-03-08 0:36 ` Alan Cox 0 siblings, 0 replies; 89+ messages in thread From: Alan Cox @ 2006-03-08 0:36 UTC (permalink / raw) To: Andi Kleen Cc: Bryan O'Sullivan, David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Maw, 2006-03-07 at 22:24 +0100, Andi Kleen wrote: > > But on most arches those accesses do indeed seem to happen in-order. On > > i386 and x86_64, it's a natural consequence of program store ordering. > > Not true for reads on x86. You must have a strange kernel Andi. Mine marks them as volatile unsigned char * references. Alan ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 11:57 ` Andi Kleen 2006-03-07 20:01 ` Jesse Barnes 2006-03-07 21:14 ` Bryan O'Sullivan @ 2006-03-08 0:35 ` Alan Cox 2 siblings, 0 replies; 89+ messages in thread From: Alan Cox @ 2006-03-08 0:35 UTC (permalink / raw) To: Andi Kleen Cc: Bryan O'Sullivan, David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Maw, 2006-03-07 at 12:57 +0100, Andi Kleen wrote: > > They don't, but according to Documentation/DocBook/deviceiobook.tmpl > > they are performed by the compiler in the order specified. > > I don't think that's correct. Probably the documentation should > be fixed. It would be wiser to ensure they are performed in the order specified. As far as I can see this is currently true due to the volatile cast and most drivers rely on this property so the brown and sticky will impact the rotating air impeller pretty fast if it isnt. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 17:40 [PATCH] Document Linux's memory barriers David Howells 2006-03-07 10:34 ` Andi Kleen @ 2006-03-07 18:40 ` Alan Cox 2006-03-07 18:54 ` linux-os (Dick Johnson) 2006-03-07 20:09 ` David Howells 2006-03-08 2:07 ` Nick Piggin ` (3 subsequent siblings) 5 siblings, 2 replies; 89+ messages in thread From: Alan Cox @ 2006-03-07 18:40 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Maw, 2006-03-07 at 17:40 +0000, David Howells wrote: > +Older and less complex CPUs will perform memory accesses in exactly the order > +specified, so if one is given the following piece of code: Not really true. Some of the fairly old dumb processors don't do this to the bus, and just about anything with a cache wont (as it'll burst cache lines to main memory) > + want to access to the address register, and then read or write the > + appropriate data register to access the chip's internal register: > + > + *ADR = ctl_reg_3; > + reg = *DATA; Not allowed anyway > + In this case, the barrier makes a guarantee that all memory accesses > + before the barrier will happen before all the memory accesses after the > + barrier. It does _not_ guarantee that all memory accesses before the > + barrier will be complete by the time the barrier is complete. Better meaningful example would be barriers versus an IRQ handler. Which leads nicely onto section 2 > +General memory barriers make a guarantee that all memory accesses specified > +before the barrier will happen before all memory accesses specified after the > +barrier. No. They guarantee that to an observer also running on that set of processors the accesses to main memory will appear to be ordered in that manner. They don't guarantee I/O related ordering for non main memory due to things like PCI posting rules and NUMA goings on. As an example of the difference here a Geode will reorder stores as it feels but snoop the bus such that it can ensure an external bus master cannot observe this by holding it off the bus to fix up ordering violations first. > +Read memory barriers make a guarantee that all memory reads specified before > +the barrier will happen before all memory reads specified after the barrier. > + > +Write memory barriers make a guarantee that all memory writes specified before > +the barrier will happen before all memory writes specified after the barrier. Both with the caveat above > +There is no guarantee that any of the memory accesses specified before a memory > +barrier will be complete by the completion of a memory barrier; the barrier can > +be considered to draw a line in the access queue that accesses of the > +appropriate type may not cross. CPU generated accesses to main memory > + (*) interrupt disablement and/or interrupts > + (*) spin locks > + (*) R/W spin locks > + (*) mutexes > + (*) semaphores > + (*) R/W semaphores Should probably cover schedule() here. > +Locks and semaphores may not provide any guarantee of ordering on UP compiled > +systems, and so can't be counted on in such a situation to actually do > +anything at all, especially with respect to I/O memory barriering. _irqsave/_irqrestore ... > +============================== > +I386 AND X86_64 SPECIFIC NOTES > +============================== > + > +Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the > +bus appear in program order - and so there's no requirement for any sort of > +explicit memory barriers. Actually they are not. Processors prior to Pentium Pro ensure that the perceived ordering between processors of writes to main memory is preserved. The Pentium Pro is supposed to but does not in SMP cases. Our spin_unlock code knows about this. It also has some problems with this situation when handling write combining memory. The IDT Winchip series processors are run in out of order store mode and our lock functions and dmamappers should know enough about this. On x86 memory barriers for read serialize order using lock instructions, on write the winchip at least generates serializing instructions. barrier() is pure CPU level of course > + (*) Normal writes to memory imply wmb() [and so SFENCE is normally not > + required]. Only at an on processor level and not for all clones, also there are errata here for PPro. > + (*) Accesses to uncached memory imply mb() [eg: memory mapped I/O]. Not always. MMIO ordering is outside of the CPU ordering rules and into PCI and other bus ordering rules. Consider writel(STOP_DMA, &foodev->ctrl); free_dma_buffers(foodev); This leads to horrible disasters. > + > +====================== > +POWERPC SPECIFIC NOTES Can't comment on PPC ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 18:40 ` Alan Cox @ 2006-03-07 18:54 ` linux-os (Dick Johnson) 2006-03-07 19:06 ` Matthew Wilcox 2006-03-07 19:33 ` Alan Cox 2006-03-07 20:09 ` David Howells 1 sibling, 2 replies; 89+ messages in thread From: linux-os (Dick Johnson) @ 2006-03-07 18:54 UTC (permalink / raw) To: Alan Cox Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, Linux kernel On Tue, 7 Mar 2006, Alan Cox wrote: [SNIPPED...] > > Not always. MMIO ordering is outside of the CPU ordering rules and into > PCI and other bus ordering rules. Consider > > writel(STOP_DMA, &foodev->ctrl); > free_dma_buffers(foodev); > > This leads to horrible disasters. This might be a good place to document: dummy = readl(&foodev->ctrl); Will flush all pending writes to the PCI bus and that: (void) readl(&foodev->ctrl); ... won't because `gcc` may optimize it away. In fact, variable "dummy" should be global or `gcc` may make it go away as well. Cheers, Dick Johnson Penguin : Linux version 2.6.15.4 on an i686 machine (5589.50 BogoMips). Warning : 98.36% of all statistics are fiction, book release in April. _ \x1a\x04 **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 18:54 ` linux-os (Dick Johnson) @ 2006-03-07 19:06 ` Matthew Wilcox 2006-03-07 19:15 ` linux-os (Dick Johnson) 2006-03-07 19:33 ` Alan Cox 1 sibling, 1 reply; 89+ messages in thread From: Matthew Wilcox @ 2006-03-07 19:06 UTC (permalink / raw) To: linux-os (Dick Johnson) Cc: Alan Cox, David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, Linux kernel On Tue, Mar 07, 2006 at 01:54:33PM -0500, linux-os (Dick Johnson) wrote: > This might be a good place to document: > dummy = readl(&foodev->ctrl); > > Will flush all pending writes to the PCI bus and that: > (void) readl(&foodev->ctrl); > ... won't because `gcc` may optimize it away. In fact, variable > "dummy" should be global or `gcc` may make it go away as well. static inline unsigned int readl(const volatile void __iomem *addr) { return *(volatile unsigned int __force *) addr; } The cast is volatile, so gcc knows not to optimise it away. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 19:06 ` Matthew Wilcox @ 2006-03-07 19:15 ` linux-os (Dick Johnson) 0 siblings, 0 replies; 89+ messages in thread From: linux-os (Dick Johnson) @ 2006-03-07 19:15 UTC (permalink / raw) To: Matthew Wilcox Cc: Alan Cox, David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, Linux kernel On Tue, 7 Mar 2006, Matthew Wilcox wrote: > On Tue, Mar 07, 2006 at 01:54:33PM -0500, linux-os (Dick Johnson) wrote: >> This might be a good place to document: >> dummy = readl(&foodev->ctrl); >> >> Will flush all pending writes to the PCI bus and that: >> (void) readl(&foodev->ctrl); >> ... won't because `gcc` may optimize it away. In fact, variable >> "dummy" should be global or `gcc` may make it go away as well. > > static inline unsigned int readl(const volatile void __iomem *addr) > { > return *(volatile unsigned int __force *) addr; > } > > The cast is volatile, so gcc knows not to optimise it away. > When the assignment is not made a.k.a., cast to void, or when the assignment is made to an otherwise unused variable, `gcc` does, indeed make it go away. These problems caused weeks of chagrin after it was found that a PCI DMA operation took 20 or more times than it should. The writel(START_DMA, &control), followed by a dummy = readl(&control), ended up with the readl() missing. That meant that the DMA didn't start until some timer code read a status register, wondering why it hadn't completed yet. Cheers, Dick Johnson Penguin : Linux version 2.6.15.4 on an i686 machine (5589.50 BogoMips). Warning : 98.36% of all statistics are fiction, book release in April. _ \x1a\x04 **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 18:54 ` linux-os (Dick Johnson) 2006-03-07 19:06 ` Matthew Wilcox @ 2006-03-07 19:33 ` Alan Cox 1 sibling, 0 replies; 89+ messages in thread From: Alan Cox @ 2006-03-07 19:33 UTC (permalink / raw) To: linux-os (Dick Johnson) Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, Linux kernel On Maw, 2006-03-07 at 13:54 -0500, linux-os (Dick Johnson) wrote: > On Tue, 7 Mar 2006, Alan Cox wrote: > > writel(STOP_DMA, &foodev->ctrl); > > free_dma_buffers(foodev); > > > > This leads to horrible disasters. > > This might be a good place to document: > dummy = readl(&foodev->ctrl); Absolutely. And this falls outside of the memory barrier functions. > > Will flush all pending writes to the PCI bus and that: > (void) readl(&foodev->ctrl); > ... won't because `gcc` may optimize it away. In fact, variable > "dummy" should be global or `gcc` may make it go away as well. If they were ordinary functions then maybe, but they are not so a simple readl(&foodev->ctrl) will be sufficient and isn't optimised away. Alan ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 18:40 ` Alan Cox 2006-03-07 18:54 ` linux-os (Dick Johnson) @ 2006-03-07 20:09 ` David Howells 2006-03-08 0:32 ` Alan Cox 2006-03-08 8:25 ` Duncan Sands 1 sibling, 2 replies; 89+ messages in thread From: David Howells @ 2006-03-07 20:09 UTC (permalink / raw) To: Alan Cox Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > Better meaningful example would be barriers versus an IRQ handler. Which > leads nicely onto section 2 Yes, except that I can't think of one that's feasible that doesn't have to do with I/O - which isn't a problem if you are using the proper accessor functions. Such an example has to involve more than one CPU, because you don't tend to get memory/memory ordering problems on UP. The obvious one might be circular buffers, except there's no problem there provided you have a memory barrier between accessing the buffer and updating your pointer into it. David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 20:09 ` David Howells @ 2006-03-08 0:32 ` Alan Cox 2006-03-08 8:25 ` Duncan Sands 1 sibling, 0 replies; 89+ messages in thread From: Alan Cox @ 2006-03-08 0:32 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Maw, 2006-03-07 at 20:09 +0000, David Howells wrote: > Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > > > Better meaningful example would be barriers versus an IRQ handler. Which > > leads nicely onto section 2 > > Yes, except that I can't think of one that's feasible that doesn't have to do > with I/O - which isn't a problem if you are using the proper accessor > functions. We get them off bus masters for one and you can construct silly versions of the other. There are several kernel instances of while(*ptr != HAVE_RESPONDED && time_before(jiffies, timeout)) rmb(); where we wait for hardware to bus master respond when it is fast and doesn't IRQ. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 20:09 ` David Howells 2006-03-08 0:32 ` Alan Cox @ 2006-03-08 8:25 ` Duncan Sands 2006-03-08 22:06 ` Paul Mackerras 1 sibling, 1 reply; 89+ messages in thread From: Duncan Sands @ 2006-03-08 8:25 UTC (permalink / raw) To: David Howells Cc: Alan Cox, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Tuesday 7 March 2006 21:09, David Howells wrote: > Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > > > Better meaningful example would be barriers versus an IRQ handler. Which > > leads nicely onto section 2 > > Yes, except that I can't think of one that's feasible that doesn't have to do > with I/O - which isn't a problem if you are using the proper accessor > functions. > > Such an example has to involve more than one CPU, because you don't tend to > get memory/memory ordering problems on UP. On UP you at least need compiler barriers, right? You're in trouble if you think you are writing in a certain order, and expect to see the same order from an interrupt handler, but the compiler decided to rearrange the order of the writes... > The obvious one might be circular buffers, except there's no problem there > provided you have a memory barrier between accessing the buffer and updating > your pointer into it. > > David Ciao, Duncan. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 8:25 ` Duncan Sands @ 2006-03-08 22:06 ` Paul Mackerras 2006-03-08 22:24 ` David S. Miller 2006-03-08 22:42 ` Alan Cox 0 siblings, 2 replies; 89+ messages in thread From: Paul Mackerras @ 2006-03-08 22:06 UTC (permalink / raw) To: Duncan Sands Cc: David Howells, akpm, linux-arch, linux-kernel, torvalds, mingo, linuxppc64-dev, Alan Cox Duncan Sands writes: > On UP you at least need compiler barriers, right? You're in trouble if you think > you are writing in a certain order, and expect to see the same order from an > interrupt handler, but the compiler decided to rearrange the order of the writes... I'd be interested to know what the C standard says about whether the compiler can reorder writes that may be visible to a signal handler. An interrupt handler in the kernel is logically equivalent to a signal handler in normal C code. Surely there are some C language lawyers on one of the lists that this thread is going to? Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 22:06 ` Paul Mackerras @ 2006-03-08 22:24 ` David S. Miller 2006-03-08 22:31 ` Linus Torvalds 2006-03-08 22:42 ` Alan Cox 1 sibling, 1 reply; 89+ messages in thread From: David S. Miller @ 2006-03-08 22:24 UTC (permalink / raw) To: paulus Cc: duncan.sands, dhowells, akpm, linux-arch, linux-kernel, torvalds, mingo, linuxppc64-dev, alan From: Paul Mackerras <paulus@samba.org> Date: Thu, 9 Mar 2006 09:06:05 +1100 > I'd be interested to know what the C standard says about whether the > compiler can reorder writes that may be visible to a signal handler. > An interrupt handler in the kernel is logically equivalent to a signal > handler in normal C code. > > Surely there are some C language lawyers on one of the lists that this > thread is going to? Just like for setjmp() I think you have to mark such things as volatile. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 22:24 ` David S. Miller @ 2006-03-08 22:31 ` Linus Torvalds 0 siblings, 0 replies; 89+ messages in thread From: Linus Torvalds @ 2006-03-08 22:31 UTC (permalink / raw) To: David S. Miller Cc: paulus, duncan.sands, dhowells, akpm, linux-arch, linux-kernel, mingo, linuxppc64-dev, alan On Wed, 8 Mar 2006, David S. Miller wrote: > > Just like for setjmp() I think you have to mark such things > as volatile. .. and sigatomic_t. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 22:06 ` Paul Mackerras 2006-03-08 22:24 ` David S. Miller @ 2006-03-08 22:42 ` Alan Cox 1 sibling, 0 replies; 89+ messages in thread From: Alan Cox @ 2006-03-08 22:42 UTC (permalink / raw) To: Paul Mackerras Cc: Duncan Sands, David Howells, akpm, linux-arch, linux-kernel, torvalds, mingo, linuxppc64-dev On Iau, 2006-03-09 at 09:06 +1100, Paul Mackerras wrote: > I'd be interested to know what the C standard says about whether the > compiler can reorder writes that may be visible to a signal handler. > An interrupt handler in the kernel is logically equivalent to a signal > handler in normal C code. The C standard doesn't have much to say. POSIX has a lot to say and yes it can do this. You do need volatile or store barriers in signal touched code quite often, or for that matter locks POSIX/SuS also has stuff to say about what functions are signal safe and what is not allowed. Alan ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 17:40 [PATCH] Document Linux's memory barriers David Howells 2006-03-07 10:34 ` Andi Kleen 2006-03-07 18:40 ` Alan Cox @ 2006-03-08 2:07 ` Nick Piggin 2006-03-08 3:10 ` Paul Mackerras ` (2 subsequent siblings) 5 siblings, 0 replies; 89+ messages in thread From: Nick Piggin @ 2006-03-08 2:07 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel David Howells wrote: >The attached patch documents the Linux kernel's memory barriers. > >Signed-Off-By: David Howells <dhowells@redhat.com> >--- > > Good :) >+============================== >+IMPLIED KERNEL MEMORY BARRIERS >+============================== >+ >+Some of the other functions in the linux kernel imply memory barriers. For >+instance all the following (pseudo-)locking functions imply barriers. >+ >+ (*) interrupt disablement and/or interrupts > Is this really the case? I mean interrupt disablement only synchronises with the local CPU, so it probably should not _have_ to imply barriers (eg. some architectures are playing around with "virtual" interrupt disablement). [...] >+ >+Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier >+memory and I/O accesses individually, or interrupt handling will barrier >+memory and I/O accesses on entry and on exit. This prevents an interrupt >+routine interfering with accesses made in a disabled-interrupt section of code >+and vice versa. >+ > But CPUs should always be consistent WRT themselves, so I'm not sure that it is needed? Thanks, Nick -- Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 17:40 [PATCH] Document Linux's memory barriers David Howells ` (2 preceding siblings ...) 2006-03-08 2:07 ` Nick Piggin @ 2006-03-08 3:10 ` Paul Mackerras 2006-03-08 3:30 ` Linus Torvalds ` (2 more replies) 2006-03-08 14:37 ` [PATCH] Document Linux's memory barriers [try #2] David Howells 2006-03-08 16:18 ` [PATCH] Document Linux's memory barriers Pavel Machek 5 siblings, 3 replies; 89+ messages in thread From: Paul Mackerras @ 2006-03-08 3:10 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel David Howells writes: > The attached patch documents the Linux kernel's memory barriers. Thanks for venturing into this particular lion's den. :) > +Memory barriers are instructions to both the compiler and the CPU to impose a > +partial ordering between the memory access operations specified either side of > +the barrier. ... as observed from another agent in the system - another CPU or a bus-mastering I/O device. A given CPU will always see its own memory accesses in order. > + (*) reads are synchronous and may need to be done immediately to permit Leave out the "are synchronous and". It's not true. I also think you need to avoid talking about "the bus". Some systems don't have a bus, but rather have an interconnection fabric between the CPUs and the memories. Talking about a bus implies that all memory accesses in fact get serialized (by having to be sent one after the other over the bus) and that you can therefore talk about the order in which they get to memory. In some systems, no such order exists. It's possible to talk sensibly about the order in which memory accesses get done without talking about a bus or requiring a total ordering on the memory access. The PowerPC architecture spec does this by specifying that in certain circumstances one load or store has to be "performed with respect to other processors and mechanisms" before another. A load is said to be performed with respect to another agent when a store by that agent can no longer change the value returned by the load. Similarly, a store is performed w.r.t. an agent when any load done by the agent will return the value stored (or a later value). > + The way to deal with this is to insert an I/O memory barrier between the > + two accesses: > + > + *ADR = ctl_reg_3; > + mb(); > + reg = *DATA; Ummm, this implies mb() is "an I/O memory barrier". I can see people getting confused if they read this and then see mb() being used when no I/O is being done. > +The Linux kernel has six basic memory barriers: > + > + MANDATORY (I/O) SMP > + =============== ================ > + GENERAL mb() smp_mb() > + READ rmb() smp_rmb() > + WRITE wmb() smp_wmb() > + > +General memory barriers make a guarantee that all memory accesses specified > +before the barrier will happen before all memory accesses specified after the > +barrier. By "memory accesses" do you mean accesses to system memory, or do you mean loads and stores - which may be to system memory, memory on an I/O device (e.g. a framebuffer) or to memory-mapped I/O registers? Linus explained recently that wmb() on x86 does not order stores to system memory w.r.t. stores to stores to prefetchable I/O memory (at least that's what I think he said ;). > +Some of the other functions in the linux kernel imply memory barriers. For > +instance all the following (pseudo-)locking functions imply barriers. > + > + (*) interrupt disablement and/or interrupts Enabling/disabling interrupts doesn't imply a barrier on powerpc, and nor does taking an interrupt or returning from one. > + (*) spin locks I think it's still an open question as to whether spin locks do any ordering between accesses to system memory and accesses to I/O registers. > + (*) R/W spin locks > + (*) mutexes > + (*) semaphores > + (*) R/W semaphores > + > +In all cases there are variants on a LOCK operation and an UNLOCK operation. > + > + (*) LOCK operation implication: > + > + Memory accesses issued after the LOCK will be completed after the LOCK > + accesses have completed. > + > + Memory accesses issued before the LOCK may be completed after the LOCK > + accesses have completed. > + > + (*) UNLOCK operation implication: > + > + Memory accesses issued before the UNLOCK will be completed before the > + UNLOCK accesses have completed. > + > + Memory accesses issued after the UNLOCK may be completed before the UNLOCK > + accesses have completed. And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but a LOCK followed by an UNLOCK isn't. > +Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier > +memory and I/O accesses individually, or interrupt handling will barrier > +memory and I/O accesses on entry and on exit. This prevents an interrupt > +routine interfering with accesses made in a disabled-interrupt section of code > +and vice versa. I don't think this is right, and I don't think it is necessary to achieve the end you state, since a CPU will always see its own memory accesses in program order. > +The following sequence of events on the bus is acceptable: > + > + LOCK, *F+*A, *E, *C+*D, *B, UNLOCK What does *F+*A mean? > +Consider also the following (going back to the AMD PCnet example): > + > + DISABLE IRQ > + *ADR = ctl_reg_3; > + mb(); > + x = *DATA; > + *ADR = ctl_reg_4; > + mb(); > + *DATA = y; > + *ADR = ctl_reg_5; > + mb(); > + z = *DATA; > + ENABLE IRQ > + <interrupt> > + *ADR = ctl_reg_7; > + mb(); > + q = *DATA > + </interrupt> > + > +What's to stop "z = *DATA" crossing "*ADR = ctl_reg_7" and reading from the > +wrong register? (There's no guarantee that the process of handling an > +interrupt will barrier memory accesses in any way). Well, the driver should *not* be doing *ADR at all, it should be using read[bwl]/write[bwl]. The architecture code has to implement read*/write* in such a way that the accesses generated can't be reordered. I _think_ it also has to make sure the write accesses can't be write-combined, but it would be good to have that clarified. > +====================== > +POWERPC SPECIFIC NOTES > +====================== > + > +The powerpc is weakly ordered, and its read and write accesses may be > +completed generally in any order. It's memory barriers are also to some extent > +more substantial than the mimimum requirement, and may directly effect > +hardware outside of the CPU. Unfortunately mb()/smp_mb() are quite expensive on PowerPC, since the only instruction we have that implies a strong enough barrier is sync, which also performs several other kinds of synchronization, such as waiting until all previous instructions have completed executing to the point where they can no longer cause an exception. Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 3:10 ` Paul Mackerras @ 2006-03-08 3:30 ` Linus Torvalds 2006-03-08 12:34 ` David Howells 2006-03-08 7:41 ` Nick Piggin 2006-03-08 13:19 ` David Howells 2 siblings, 1 reply; 89+ messages in thread From: Linus Torvalds @ 2006-03-08 3:30 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Wed, 8 Mar 2006, Paul Mackerras wrote: > > Linus explained recently that wmb() on x86 does not order stores to > system memory w.r.t. stores to stores to prefetchable I/O memory (at > least that's what I think he said ;). In fact, it won't order stores to normal memory even wrt any _non-prefetchable_ IO memory. PCI (and any other sane IO fabric, for that matter) will do IO posting, so the fact that the CPU _core_ may order them due to a wmb() doesn't actually mean anything. The only way to _really_ synchronize with a store to an IO device is literally to read from that device (*). No amount of memory barriers will do it. So you can really only order stores to regular memory wrt each other, and stores to IO memory wrt each other. For the former, "smp_wmb()" does it. For IO memory, normal IO memory is _always_ supposed to be in program order (at least for PCI. It's part of how the bus is supposed to work), unless the IO range allows prefetching (and you've set some MTRR). And if you do, that, currently you're kind of screwed. mmiowb() should do it, but nobody really uses it, and I think it's broken on x86 (it's a no-op, it really should be an "sfence"). A full "mb()" is probably most likely to work in practice. And yes, we should clean this up. Linus (*) The "read" can of course be any event that tells you that the store has happened - it doesn't necessarily have to be an actual "read[bwl]()" operation. Eg the store might start a command, and when you get the completion interrupt, you obviously know that the store is done, just from a causal reason. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 3:30 ` Linus Torvalds @ 2006-03-08 12:34 ` David Howells 2006-03-08 16:40 ` Bryan O'Sullivan 0 siblings, 1 reply; 89+ messages in thread From: David Howells @ 2006-03-08 12:34 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, David Howells, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Linus Torvalds <torvalds@osdl.org> wrote: > > Linus explained recently that wmb() on x86 does not order stores to > > system memory w.r.t. stores to stores to prefetchable I/O memory (at > > least that's what I think he said ;). On i386 and x86_64, do IN and OUT instructions imply MFENCE? It's not obvious from the x86_64 docs. David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 12:34 ` David Howells @ 2006-03-08 16:40 ` Bryan O'Sullivan 0 siblings, 0 replies; 89+ messages in thread From: Bryan O'Sullivan @ 2006-03-08 16:40 UTC (permalink / raw) To: David Howells Cc: Linus Torvalds, Paul Mackerras, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Wed, 2006-03-08 at 12:34 +0000, David Howells wrote: > On i386 and x86_64, do IN and OUT instructions imply MFENCE? No. <b ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 3:10 ` Paul Mackerras 2006-03-08 3:30 ` Linus Torvalds @ 2006-03-08 7:41 ` Nick Piggin 2006-03-08 13:19 ` David Howells 2 siblings, 0 replies; 89+ messages in thread From: Nick Piggin @ 2006-03-08 7:41 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Paul Mackerras wrote: > David Howells writes: >>+ The way to deal with this is to insert an I/O memory barrier between the >>+ two accesses: >>+ >>+ *ADR = ctl_reg_3; >>+ mb(); >>+ reg = *DATA; > > > Ummm, this implies mb() is "an I/O memory barrier". I can see people > getting confused if they read this and then see mb() being used when > no I/O is being done. > Isn't it? Why wouldn't you just use smp_mb() if no IO is being done? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 3:10 ` Paul Mackerras 2006-03-08 3:30 ` Linus Torvalds 2006-03-08 7:41 ` Nick Piggin @ 2006-03-08 13:19 ` David Howells 2006-03-08 21:49 ` Paul Mackerras 2 siblings, 1 reply; 89+ messages in thread From: David Howells @ 2006-03-08 13:19 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Paul Mackerras <paulus@samba.org> wrote: > By "memory accesses" do you mean accesses to system memory, or do you > mean loads and stores - which may be to system memory, memory on an I/O > device (e.g. a framebuffer) or to memory-mapped I/O registers? Well, I meant all loads and stores, irrespective of their destination. However, on i386, for example, you've actually got at least two different I/O access domains, and I don't know how they impinge upon each other (IN/OUT vs MOV). > Enabling/disabling interrupts doesn't imply a barrier on powerpc, and > nor does taking an interrupt or returning from one. Surely it ought to, otherwise what's to stop accesses done with interrupts disabled crossing with accesses done inside an interrupt handler? > > +Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier > ... > I don't think this is right, and I don't think it is necessary to > achieve the end you state, since a CPU will always see its own memory > accesses in program order. But what about a driver accessing some memory that its device is going to observe under irq disablement, and then getting an interrupt immediately after from that same device, the handler for which communicates with the device, possibly then being broken because the CPU hasn't completed all the memory accesses that the driver made while interrupts are disabled? Alternatively, might it be possible for communications between two CPUs to be stuffed because one took an interrupt that also modified common data before the it had committed the memory accesses done under interrupt disablement? This would suggest using a lock though. I'm not sure that I can come up with a feasible example for this, but Alan Cox seems to think that it's a valid problem too. The only likely way I can see this being a problem is with unordered I/O writes, which would suggest you have to place an mmiowb() before unlocking the spinlock in such a case, assuming it is possible to get unordered I/O writes (which I think it is). > What does *F+*A mean? Combined accesses. > Well, the driver should *not* be doing *ADR at all, it should be using > read[bwl]/write[bwl]. The architecture code has to implement > read*/write* in such a way that the accesses generated can't be > reordered. I _think_ it also has to make sure the write accesses > can't be write-combined, but it would be good to have that clarified. Than what use mmiowb()? Surely write combining and out-of-order reads are reasonable for cacheable devices like framebuffers. David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 13:19 ` David Howells @ 2006-03-08 21:49 ` Paul Mackerras 2006-03-08 22:05 ` Alan Cox 0 siblings, 1 reply; 89+ messages in thread From: Paul Mackerras @ 2006-03-08 21:49 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel David Howells writes: > > Enabling/disabling interrupts doesn't imply a barrier on powerpc, and > > nor does taking an interrupt or returning from one. > > Surely it ought to, otherwise what's to stop accesses done with interrupts > disabled crossing with accesses done inside an interrupt handler? The rule that the CPU always sees its own loads and stores in program order. If a CPU takes an interrupt after doing some stores, and the interrupt handler does loads from the same location(s), it has to see the new values, even if they haven't got to memory yet. The interrupt isn't special in this situation; if the instruction stream has a store to a location followed by a load from it, the load *has* to see the value stored by the store (assuming no other store to the same location in the meantime, of course). That's true whether or not the CPU takes an exception or interrupt between the store and the load. Anything else would make programming really ... um ... interesting. :) > > > +Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier > > ... > > I don't think this is right, and I don't think it is necessary to > > achieve the end you state, since a CPU will always see its own memory > > accesses in program order. > > But what about a driver accessing some memory that its device is going to > observe under irq disablement, and then getting an interrupt immediately after > from that same device, the handler for which communicates with the device, > possibly then being broken because the CPU hasn't completed all the memory > accesses that the driver made while interrupts are disabled? Well, we have to be clear about what causes what here. Is the device accessing this memory just at a random time, or is the access caused by (in response to) an MMIO store? And what causes the interrupt? Does it just happen to come along at this time or is it in response to one of the stores? If the device accesses to memory are in response to an MMIO store, then the code needs an explicit wmb() between the memory stores and the MMIO store. Disabling interrupts isn't going to help here because the device doesn't see the CPU interrupt enable state. In general it is possible for the CPU to see a different state of memory than the device sees. If the driver needs to be sure that they both see the same view then it needs to use some sort of synchronization. A memory barrier followed by a store to the device, with no further stores to memory until we have an indication from the device that it has received the MMIO store, would be a suitable way to synchronize. Enabling or disabling interrupts does nothing useful here because the device doesn't see that. That applies whether we are in an interrupt routine or not. Do you have a specific scenario in mind, with a particular device and driver? One thing that driver writers do need to be careful about is that if a device writes some data to memory and then causes an interrupt, the fact that the interrupt has reached the CPU and the CPU has invoked the driver's interrupt routine does *not* mean that the data has got to memory from the CPU's point of view. The data could still be queued up in the PCI host bridge or elsewhere. Doing an MMIO read from the device is sufficient to ensure that the CPU will then see the correct data in memory. > Alternatively, might it be possible for communications between two CPUs to be > stuffed because one took an interrupt that also modified common data before > the it had committed the memory accesses done under interrupt disablement? > This would suggest using a lock though. Disabling interrupts doesn't do *anything* to help with communication between CPUs. You have to use locks or explicit barriers for that. It is possible for one CPU to see memory accesses done by another CPU in a different order from the program order on the CPU that did the accesses. That applies whether or not some of the accesses were done inside an interrupt routine. > > What does *F+*A mean? > > Combined accesses. Still opaque, sorry: you mean they both happen in some unspecified order? > > Well, the driver should *not* be doing *ADR at all, it should be using > > read[bwl]/write[bwl]. The architecture code has to implement > > read*/write* in such a way that the accesses generated can't be > > reordered. I _think_ it also has to make sure the write accesses > > can't be write-combined, but it would be good to have that clarified. > > Than what use mmiowb()? That was introduced to help some platforms that have difficulty ensuring that MMIO accesses hit the device in the right order, IIRC. I'm still not entirely clear on exactly where it's needed or what guarantees you can rely on if you do or don't use it. > Surely write combining and out-of-order reads are reasonable for cacheable > devices like framebuffers. They are. read*/write* to non-cacheable non-prefetchable MMIO shouldn't be reordered or write-combined, but for prefetchable MMIO I'm not sure whether read*/write* should allow reordering, or whether drivers should use __raw_read/write* if they want that. (Of course, with the __raw_ functions they don't get the endian conversion either...) Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 21:49 ` Paul Mackerras @ 2006-03-08 22:05 ` Alan Cox 0 siblings, 0 replies; 89+ messages in thread From: Alan Cox @ 2006-03-08 22:05 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Iau, 2006-03-09 at 08:49 +1100, Paul Mackerras wrote: > If the device accesses to memory are in response to an MMIO store, > then the code needs an explicit wmb() between the memory stores and > the MMIO store. Disabling interrupts isn't going to help here because > the device doesn't see the CPU interrupt enable state. Interrupts are themselves entirely asynchronous anyway. The following can occur on SMP Pentium-PIII. Device Raise IRQ CPU writel(MASK_IRQ, &dev->ctrl); readl(&dev->ctrl); IRQ arrives CPU specific IRQ masking is synchronous, but IRQ delivery is not, including IPI delivery (which is asynchronous and not guaranteed to occur only once per IPI but can be replayed in obscure cases on x86). ^ permalink raw reply [flat|nested] 89+ messages in thread
* [PATCH] Document Linux's memory barriers [try #2] 2006-03-07 17:40 [PATCH] Document Linux's memory barriers David Howells ` (3 preceding siblings ...) 2006-03-08 3:10 ` Paul Mackerras @ 2006-03-08 14:37 ` David Howells 2006-03-08 14:55 ` Alan Cox 2006-03-08 19:37 ` [PATCH] Document Linux's memory barriers [try #3] David Howells 2006-03-08 16:18 ` [PATCH] Document Linux's memory barriers Pavel Machek 5 siblings, 2 replies; 89+ messages in thread From: David Howells @ 2006-03-08 14:37 UTC (permalink / raw) To: torvalds, akpm, mingo, alan; +Cc: linux-arch, linuxppc64-dev, linux-kernel The attached patch documents the Linux kernel's memory barriers. I've updated it from the comments I've been given. Note that the per-arch notes sections are gone because it's clear that there are so many exceptions, that it's not worth having them. I've added a list of references to other documents. I've tried to get rid of the concept of memory accesses appearing on the bus; what matters is apparent behaviour with respect to other observers in the system. I'm not sure that any mention interrupts vs interrupt disablement should be retained... it's unclear that there is actually anything that guarantees that stuff won't leak out of an interrupt-disabled section and into an interrupt handler. Paul Mackerras says this isn't valid on powerpc, and looking at the code seems to confirm that, barring implicit enforcement by the CPU. Signed-Off-By: David Howells <dhowells@redhat.com> --- warthog>diffstat -p1 /tmp/mb.diff Documentation/memory-barriers.txt | 589 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 589 insertions(+) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt new file mode 100644 index 0000000..1340c8d --- /dev/null +++ b/Documentation/memory-barriers.txt @@ -0,0 +1,589 @@ + ============================ + LINUX KERNEL MEMORY BARRIERS + ============================ + +Contents: + + (*) What are memory barriers? + + (*) Where are memory barriers needed? + + - Accessing devices. + - Multiprocessor interaction. + - Interrupts. + + (*) Linux kernel compiler barrier functions. + + (*) Linux kernel memory barrier functions. + + (*) Implicit kernel memory barriers. + + - Locking functions. + - Interrupt disablement functions. + - Miscellaneous functions. + + (*) Linux kernel I/O barriering. + + (*) References. + + +========================= +WHAT ARE MEMORY BARRIERS? +========================= + +Memory barriers are instructions to both the compiler and the CPU to impose an +apparent partial ordering between the memory access operations specified either +side of the barrier. They request that the sequence of memory events generated +appears to other components of the system as if the barrier is effective on +that CPU. + +Note that: + + (*) there's no guarantee that the sequence of memory events is _actually_ so + ordered. It's possible for the CPU to do out-of-order accesses _as long + as no-one is looking_, and then fix up the memory if someone else tries to + see what's going on (for instance a bus master device); what matters is + the _apparent_ order as far as other processors and devices are concerned; + and + + (*) memory barriers are only guaranteed to act within the CPU processing them, + and are not, for the most part, guaranteed to percolate down to other CPUs + in the system or to any I/O hardware that that CPU may communicate with. + + +For example, a programmer might take it for granted that the CPU will perform +memory accesses in exactly the order specified, so that if a CPU is, for +example, given the following piece of code: + + a = *A; + *B = b; + c = *C; + d = *D; + *E = e; + +They would then expect that the CPU will complete the memory access for each +instruction before moving on to the next one, leading to a definite sequence of +operations as seen by external observers in the system: + + read *A, write *B, read *C, read *D, write *E. + + +Reality is, of course, much messier. With many CPUs and compilers, this isn't +always true because: + + (*) reads are more likely to need to be completed immediately to permit + execution progress, whereas writes can often be deferred without a + problem; + + (*) reads can be done speculatively, and then the result discarded should it + prove not to be required; + + (*) the order of the memory accesses may be rearranged to promote better use + of the CPU buses and caches; + + (*) reads and writes may be combined to improve performance when talking to + the memory or I/O hardware that can do batched accesses of adjacent + locations, thus cutting down on transaction setup costs (memory and PCI + devices may be able to do this); and + + (*) the CPU's data cache may affect the ordering, though cache-coherency + mechanisms should alleviate this - once the write has actually hit the + cache. + +So what another CPU, say, might actually observe from the above piece of code +is: + + read *A, read {*C,*D}, write *E, write *B + + (By "read {*C,*D}" I mean a combined single read). + + +It is also guaranteed that a CPU will be self-consistent: it will see its _own_ +accesses appear to be correctly ordered, without the need for a memory +barrier. For instance with the following code: + + X = *A; + *A = Y; + Z = *A; + +assuming no intervention by an external influence, it can be taken that: + + (*) X will hold the old value of *A, and will never happen after the write and + thus end up being given the value that was assigned to *A from Y instead; + and + + (*) Z will always be given the value in *A that was assigned there from Y, and + will never happen before the write, and thus end up with the same value + that was in *A initially. + +(This is ignoring the fact that the value initially in *A may appear to be the +same as the value assigned to *A from Y). + + +================================= +WHERE ARE MEMORY BARRIERS NEEDED? +================================= + +Under normal operation, access reordering is probably not going to be a problem +as a linear program will still appear to operate correctly. There are, +however, three circumstances where reordering definitely _could_ be a problem: + + +ACCESSING DEVICES +----------------- + +Many devices can be memory mapped, and so appear to the CPU as if they're just +memory locations. However, to control the device, the driver has to make the +right accesses in exactly the right order. + +Consider, for example, an ethernet chipset such as the AMD PCnet32. It +presents to the CPU an "address register" and a bunch of "data registers". The +way it's accessed is to write the index of the internal register to be accessed +to the address register, and then read or write the appropriate data register +to access the chip's internal register, which could - theoretically - be done +by: + + *ADR = ctl_reg_3; + reg = *DATA; + +The problem with a clever CPU or a clever compiler is that the write to the +address register isn't guaranteed to happen before the access to the data +register, if the CPU or the compiler thinks it is more efficient to defer the +address write: + + read *DATA, write *ADR + +then things will break. + + +In the Linux kernel, however, I/O should be done through the appropriate +accessor routines - such as inb() or writel() - which know how to make such +accesses appropriately sequential. + +On some systems, I/O writes are not strongly ordered across all CPUs, and so +locking should be used, and mmiowb() should be issued prior to unlocking the +critical section. + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +MULTIPROCESSOR INTERACTION +-------------------------- + +When there's a system with more than one processor, these may be working on the +same set of data, but attempting not to use locks as locks are quite expensive. +This means that accesses that affect both CPUs may have to be carefully ordered +to prevent error. + +Consider the R/W semaphore slow path. In that, a waiting process is queued on +the semaphore, as noted by it having a record on its stack linked to the +semaphore's list: + + struct rw_semaphore { + ... + struct list_head waiters; + }; + + struct rwsem_waiter { + struct list_head list; + struct task_struct *task; + }; + +To wake up the waiter, the up_read() or up_write() functions have to read the +pointer from this record to know as to where the next waiter record is, clear +the task pointer, call wake_up_process() on the task, and release the reference +held on the waiter's task struct: + + READ waiter->list.next; + READ waiter->task; + WRITE waiter->task; + CALL wakeup + RELEASE task + +If any of these steps occur out of order, then the whole thing may fail. + +Note that the waiter does not get the semaphore lock again - it just waits for +its task pointer to be cleared. Since the record is on its stack, this means +that if the task pointer is cleared _before_ the next pointer in the list is +read, another CPU might start processing the waiter and it might clobber its +stack before up*() functions have a chance to read the next pointer. + + CPU 0 CPU 1 + =============================== =============================== + down_xxx() + Queue waiter + Sleep + up_yyy() + READ waiter->task; + WRITE waiter->task; + <preempt> + Resume processing + down_xxx() returns + call foo() + foo() clobbers *waiter + </preempt> + READ waiter->list.next; + --- OOPS --- + +This could be dealt with using a spinlock, but then the down_xxx() function has +to get the spinlock again after it's been woken up, which is a waste of +resources. + +The way to deal with this is to insert an SMP memory barrier: + + READ waiter->list.next; + READ waiter->task; + smp_mb(); + WRITE waiter->task; + CALL wakeup + RELEASE task + +In this case, the barrier makes a guarantee that all memory accesses before the +barrier will appear to happen before all the memory accesses after the barrier +with respect to the other CPUs on the system. It does _not_ guarantee that all +the memory accesses before the barrier will be complete by the time the barrier +itself is complete. + +SMP memory barriers are normally mere compiler barriers on a UP system because +the CPU orders overlapping accesses with respect to itself. + + +INTERRUPTS +---------- + +A driver may be interrupted by its own interrupt service routine, and thus they +may interfere with each other's attempts to control or access the device. + +This may be alleviated - at least in part - by disabling interrupts (a form of +locking), such that the critical operations are all contained within the +disabled-interrupt section in the driver. Whilst the driver's interrupt +routine is executing, the driver's core may not run on the same CPU, and its +interrupt is not permitted to happen again until the current interrupt has been +handled, thus the interrupt handler does not need to lock against that. + + +However, consider the following example: + + CPU 1 CPU 2 + =============================== =============================== + [A is 0 and B is 0] + DISABLE IRQ + *A = 1; + smp_wmb(); + *B = 2; + ENABLE IRQ + <interrupt> + *A = 3 + a = *A; + b = *B; + smp_wmb(); + *B = 4; + </interrupt> + +CPU 2 might see *A == 3 and *B == 0, when what it probably ought to see is *B +== 2 and *A == 1 or *A == 3, or *B == 4 and *A == 3. + +This might happen because the write "*B = 2" might occur after the write "*A = +3" - in which case the former write has leaked from the interrupt-disabled +section into the interrupt handler. In this case it is a lock of some +description should very probably be used. + + +This sort of problem might also occur with relaxed I/O ordering rules, if it's +permitted for I/O writes to cross. For instance, if a driver was talking to an +ethernet card that sports an address register and a data register: + + DISABLE IRQ + writew(ADR, ctl_reg_3); + writew(DATA, y); + ENABLE IRQ + <interrupt> + writew(ADR, ctl_reg_4); + q = readw(DATA); + </interrupt> + +In such a case, an mmiowb() is needed, firstly to prevent the first write to +the address register from occurring after the write to the data register, and +secondly to prevent the write to the data register from happening after the +second write to the address register. + + +======================================= +LINUX KERNEL COMPILER BARRIER FUNCTIONS +======================================= + +The Linux kernel has an explicit compiler barrier function that prevents the +compiler from moving the memory accesses either side of it to the other side: + + barrier(); + +This has no direct effect on the CPU, which may then reorder things however it +wishes. + +In addition, accesses to "volatile" memory locations and volatile asm +statements act as implicit compiler barriers. + + +===================================== +LINUX KERNEL MEMORY BARRIER FUNCTIONS +===================================== + +The Linux kernel has six basic CPU memory barriers: + + MANDATORY SMP CONDITIONAL + =============== =============== + GENERAL mb() smp_mb() + READ rmb() smp_rmb() + WRITE wmb() smp_wmb() + +General memory barriers give a guarantee that all memory accesses specified +before the barrier will appear to happen before all memory accesses specified +after the barrier with respect to the other components of the system. + +Read and write memory barriers give similar guarantees, but only for memory +reads versus memory reads and memory writes versus memory writes respectively. + +All memory barriers imply compiler barriers. + +SMP memory barriers are only compiler barriers on uniprocessor compiled systems +because it is assumed that a CPU will be apparently self-consistent, and will +order overlapping accesses correctly with respect to itself. + +There is no guarantee that any of the memory accesses specified before a memory +barrier will be complete by the completion of a memory barrier; the barrier can +be considered to draw a line in that CPU's access queue that accesses of the +appropriate type may not cross. + +There is no guarantee that issuing a memory barrier on one CPU will have any +direct effect on another CPU or any other hardware in the system. The indirect +effect will be the order in which the second CPU sees the first CPU's accesses +occur. + +There is no guarantee that some intervening piece of off-the-CPU hardware will +not reorder the memory accesses. CPU cache coherency mechanisms should +propegate the indirect effects of a memory barrier between CPUs. + +Note that these are the _minimum_ guarantees. Different architectures may give +more substantial guarantees, but they may not be relied upon outside of arch +specific code. + + +There are some more advanced barriering functions: + + (*) set_mb(var, value) + (*) set_wmb(var, value) + + These assign the value to the variable and then insert at least a write + barrier after it, depending on the function. + + +=============================== +IMPLICIT KERNEL MEMORY BARRIERS +=============================== + +Some of the other functions in the linux kernel imply memory barriers, amongst +them are locking and scheduling functions and interrupt management functions. + +This specification is a _minimum_ guarantee; any particular architecture may +provide more substantial guarantees, but these may not be relied upon outside +of arch specific code. + + +LOCKING FUNCTIONS +----------------- + +For instance all the following locking functions imply barriers: + + (*) spin locks + (*) R/W spin locks + (*) mutexes + (*) semaphores + (*) R/W semaphores + +In all cases there are variants on a LOCK operation and an UNLOCK operation. + + (*) LOCK operation implication: + + Memory accesses issued after the LOCK will be completed after the LOCK + accesses have completed. + + Memory accesses issued before the LOCK may be completed after the LOCK + accesses have completed. + + (*) UNLOCK operation implication: + + Memory accesses issued before the UNLOCK will be completed before the + UNLOCK accesses have completed. + + Memory accesses issued after the UNLOCK may be completed before the UNLOCK + accesses have completed. + + (*) LOCK vs UNLOCK implication: + + The LOCK accesses will be completed before the UNLOCK accesses. + +And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but +a LOCK followed by an UNLOCK isn't. + +Locks and semaphores may not provide any guarantee of ordering on UP compiled +systems, and so can't be counted on in such a situation to actually do anything +at all, especially with respect to I/O barriering, unless combined with +interrupt disablement operations. + + +As an example, consider the following: + + *A = a; + *B = b; + LOCK + *C = c; + *D = d; + UNLOCK + *E = e; + *F = f; + +The following sequence of events is acceptable: + + LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK + +But none of the following are: + + {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E + *A, *B, *C, LOCK, *D, UNLOCK, *E, *F + *A, *B, LOCK, *C, UNLOCK, *D, *E, *F + *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E + + +INTERRUPT DISABLEMENT FUNCTIONS +------------------------------- + +Interrupt disablement (LOCK equivalent) and enablement (UNLOCK equivalent) will +barrier memory and I/O accesses versus memory and I/O accesses done in the +interrupt handler. This prevents an interrupt routine interfering with +accesses made in a disabled-interrupt section of code and vice versa. + +Note that whilst interrupt disablement barriers all act as compiler barriers, +they only act as memory barriers with respect to interrupts, not with respect +to nested sections. + +Consider the following: + + <interrupt> + *X = x; + </interrupt> + *A = a; + SAVE IRQ AND DISABLE + *B = b; + SAVE IRQ AND DISABLE + *C = c; + RESTORE IRQ + *D = d; + RESTORE IRQ + *E = e; + <interrupt> + *Y = y; + </interrupt> + +It is acceptable to observe the following sequences of events: + + { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, *E, { INT, *Y } + { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, { INT, *Y, *E } + { INT, *X }, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y } + { INT }, *X, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y } + { INT }, *A, *X, SAVE, SAVE, *B, *C, *D, *E, REST, REST, { INT, *Y } + +But not the following: + + { INT }, SAVE, *A, *B, *X, SAVE, *C, REST, *D, REST, *E, { INT, *Y } + { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, REST, { INT, *Y, *D, *E } + + +MISCELLANEOUS FUNCTIONS +----------------------- + +Other functions that imply barriers: + + (*) schedule() and similar imply full memory barriers. + + +=========================== +LINUX KERNEL I/O BARRIERING +=========================== + +When accessing I/O memory, drivers should use the appropriate accessor +functions: + + (*) inX(), outX(): + + These are intended to talk to legacy i386 hardware using an alternate bus + addressing mode. They are synchronous as far as the x86 CPUs are + concerned, but other CPUs and intermediary bridges may not honour that. + + They are guaranteed to be fully ordered with respect to each other. + + (*) readX(), writeX(): + + These are guaranteed to be fully ordered and uncombined with respect to + each other on the issuing CPU, provided they're not accessing a + prefetchable device. However, intermediary hardware (such as a PCI + bridge) may indulge in deferral if it so wishes; to flush a write, a read + from the same location must be performed. + + Used with prefetchable I/O memory, an mmiowb() barrier may be required to + force writes to be ordered. + + (*) readX_relaxed() + + These are not guaranteed to be ordered in any way. There is no I/O read + barrier available. + + (*) ioreadX(), iowriteX() + + These will perform as appropriate for the type of access they're actually + doing, be it in/out or read/write. + + +========== +REFERENCES +========== + +AMD64 Architecture Programmer's Manual Volume 2: System Programming + Chapter 7.1: Memory-Access Ordering + Chapter 7.4: Buffering and Combining Memory Writes + +IA-32 Intel Architecture Software Developer's Manual, Volume 3: +System Programming Guide + Chapter 7.1: Locked Atomic Operations + Chapter 7.2: Memory Ordering + Chapter 7.4: Serializing Instructions + +The SPARC Architecture Manual, Version 9 + Chapter 8: Memory Models + Appendix D: Formal Specification of the Memory Models + Appendix J: Programming with the Memory Models + +UltraSPARC Programmer Reference Manual + Chapter 5: Memory Accesses and Cacheability + Chapter 15: Sparc-V9 Memory Models + +UltraSPARC III Cu User's Manual + Chapter 9: Memory Models + +UltraSPARC IIIi Processor User's Manual + Chapter 8: Memory Models + +UltraSPARC Architecture 2005 + Chapter 9: Memory + Appendix D: Formal Specifications of the Memory Models + +UltraSPARC T1 Supplment to the UltraSPARC Architecture 2005 + Chapter 8: Memory Models + Appendix F: Caches and Cache Coherency + +Solaris Internals, Core Kernel Architecture, p63-68: + Chapter 3.3: Hardware Considerations for Locks and + Synchronization + +Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching +for Kernel Programmers: + Chapter 13: Other Memory Models ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 14:37 ` [PATCH] Document Linux's memory barriers [try #2] David Howells @ 2006-03-08 14:55 ` Alan Cox 2006-03-08 15:41 ` Matthew Wilcox ` (2 more replies) 2006-03-08 19:37 ` [PATCH] Document Linux's memory barriers [try #3] David Howells 1 sibling, 3 replies; 89+ messages in thread From: Alan Cox @ 2006-03-08 14:55 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel On Wed, Mar 08, 2006 at 02:37:58PM +0000, David Howells wrote: > + (*) reads can be done speculatively, and then the result discarded should it > + prove not to be required; That might be worth an example with an if() because PPC will do this and if its a read with a side effect (eg I/O space) you get singed.. > +same set of data, but attempting not to use locks as locks are quite expensive. s/are quite/is quite and is quite confusing to read > +SMP memory barriers are normally mere compiler barriers on a UP system because s/mere// Makes it easier to read if you are not 1st language English. > +In addition, accesses to "volatile" memory locations and volatile asm > +statements act as implicit compiler barriers. Add The use of volatile generates poorer code and hides the serialization in type declarations that may be far from the code. The Linux coding style therefore strongly favours the use of explicit barriers except in small and specific cases. > +SMP memory barriers are only compiler barriers on uniprocessor compiled systems > +because it is assumed that a CPU will be apparently self-consistent, and will > +order overlapping accesses correctly with respect to itself. Is this true of IA-64 ?? > +There is no guarantee that some intervening piece of off-the-CPU hardware will > +not reorder the memory accesses. CPU cache coherency mechanisms should > +propegate the indirect effects of a memory barrier between CPUs. [For information on bus mastering DMA and coherency please read ....] sincee have a doc on this > +There are some more advanced barriering functions: "barriering" ... ick, barrier. > +LOCKING FUNCTIONS > +----------------- > + > +For instance all the following locking functions imply barriers: s/For instance// > + (*) spin locks > + (*) R/W spin locks > + (*) mutexes > + (*) semaphores > + (*) R/W semaphores > + > +In all cases there are variants on a LOCK operation and an UNLOCK operation. > + > + (*) LOCK operation implication: > + > + Memory accesses issued after the LOCK will be completed after the LOCK > + accesses have completed. > + > + Memory accesses issued before the LOCK may be completed after the LOCK > + accesses have completed. > + > + (*) UNLOCK operation implication: > + > + Memory accesses issued before the UNLOCK will be completed before the > + UNLOCK accesses have completed. > + > + Memory accesses issued after the UNLOCK may be completed before the UNLOCK > + accesses have completed. > + > + (*) LOCK vs UNLOCK implication: > + > + The LOCK accesses will be completed before the UNLOCK accesses. > + > +And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but > +a LOCK followed by an UNLOCK isn't. > + > +Locks and semaphores may not provide any guarantee of ordering on UP compiled > +systems, and so can't be counted on in such a situation to actually do anything > +at all, especially with respect to I/O barriering, unless combined with > +interrupt disablement operations. s/disablement/disabling/ Should clarify local ordering v SMP ordering for locks implied here. > +INTERRUPT DISABLEMENT FUNCTIONS > +------------------------------- s/Disablement/Disabling/ > +Interrupt disablement (LOCK equivalent) and enablement (UNLOCK equivalent) will disable > +=========================== > +LINUX KERNEL I/O BARRIERING /barriering/barriers > + (*) inX(), outX(): > + > + These are intended to talk to legacy i386 hardware using an alternate bus > + addressing mode. They are synchronous as far as the x86 CPUs are Not really true. Lots of PCI devices use them. Need to talk about "I/O space" > + concerned, but other CPUs and intermediary bridges may not honour that. > + > + They are guaranteed to be fully ordered with respect to each other. And make clear I/O space is a CPU property and that inX()/outX() may well map to read/write variant functions on many processors > + (*) readX(), writeX(): > + > + These are guaranteed to be fully ordered and uncombined with respect to > + each other on the issuing CPU, provided they're not accessing a MTRRs > + prefetchable device. However, intermediary hardware (such as a PCI > + bridge) may indulge in deferral if it so wishes; to flush a write, a read > + from the same location must be performed. False. Its not so tightly restricted and many devices the location you write is not safe to read so you must use another. I'd have to dig the PCI spec out but I believe it says the same devfn. It also says stuff about rules for visibility of bus mastering relative to these accesses and PCI config space accesses relative to the lot (the latter serveral chipsets get wrong). We should probably point people at the PCI 2.2 spec . Looks much much better than the first version and just goes to prove how complex this all is ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 14:55 ` Alan Cox @ 2006-03-08 15:41 ` Matthew Wilcox 2006-03-08 17:19 ` David Howells 2006-03-08 17:04 ` David Howells 2006-03-08 22:01 ` Paul Mackerras 2 siblings, 1 reply; 89+ messages in thread From: Matthew Wilcox @ 2006-03-08 15:41 UTC (permalink / raw) To: Alan Cox Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Wed, Mar 08, 2006 at 09:55:06AM -0500, Alan Cox wrote: > On Wed, Mar 08, 2006 at 02:37:58PM +0000, David Howells wrote: > > + (*) reads can be done speculatively, and then the result discarded should it > > + prove not to be required; > > That might be worth an example with an if() because PPC will do this and if > its a read with a side effect (eg I/O space) you get singed.. PPC does speculative memory accesses to IO? Are you *sure*? > > +same set of data, but attempting not to use locks as locks are quite expensive. > > s/are quite/is quite > > and is quite confusing to read His grammar's right ... but I'd just leave out the 'as' part. As you're right that it's confusing ;-) > > +SMP memory barriers are normally mere compiler barriers on a UP system because > > s/mere// > > Makes it easier to read if you are not 1st language English. Maybe s/mere/only/? > > +SMP memory barriers are only compiler barriers on uniprocessor compiled systems > > +because it is assumed that a CPU will be apparently self-consistent, and will > > +order overlapping accesses correctly with respect to itself. > > Is this true of IA-64 ?? Yes: #else # define smp_mb() barrier() # define smp_rmb() barrier() # define smp_wmb() barrier() # define smp_read_barrier_depends() do { } while(0) #endif > > + (*) inX(), outX(): > > + > > + These are intended to talk to legacy i386 hardware using an alternate bus > > + addressing mode. They are synchronous as far as the x86 CPUs are > > Not really true. Lots of PCI devices use them. Need to talk about "I/O space" Port space is deprecated though. PCI 2.3 says: "Devices are recommended always to map control functions into Memory Space." > > + > > + These are guaranteed to be fully ordered and uncombined with respect to > > + each other on the issuing CPU, provided they're not accessing a > > MTRRs > > > + prefetchable device. However, intermediary hardware (such as a PCI > > + bridge) may indulge in deferral if it so wishes; to flush a write, a read > > + from the same location must be performed. > > False. Its not so tightly restricted and many devices the location you write > is not safe to read so you must use another. I'd have to dig the PCI spec > out but I believe it says the same devfn. It also says stuff about rules for > visibility of bus mastering relative to these accesses and PCI config space > accesses relative to the lot (the latter serveral chipsets get wrong). We > should probably point people at the PCI 2.2 spec . 3.2.5 of PCI 2.3 seems most relevant: Since memory write transactions may be posted in bridges anywhere in the system, and I/O writes may be posted in the host bus bridge, a master cannot automatically tell when its write transaction completes at the final destination. For a device driver to guarantee that a write has completed at the actual target (and not at an intermediate bridge), it must complete a read to the same device that the write targeted. The read (memory or I/O) forces all bridges between the originating master and the actual target to flush all posted data before allowing the read to complete. For additional details on device drivers, refer to Section 6.5. Refer to Section 3.10., item 6, for other cases where a read is necessary. Appendix E is also of interest: 2. Memory writes can be posted in both directions in a bridge. I/O and Configuration writes are not posted. (I/O writes can be posted in the Host Bridge, but some restrictions apply.) Read transactions (Memory, I/O, or Configuration) are not posted. 5. A read transaction must push ahead of it through the bridge any posted writes originating on the same side of the bridge and posted before the read. Before the read transaction can complete on its originating bus, it must pull out of the bridge any posted writes that originated on the opposite side and were posted before the read command completes on the read-destination bus. I like the way they contradict each other slightly wrt config reads and whether you have to read from the same device, or merely the same bus. One thing that is clear is that a read of a status register on the bridge isn't enough, it needs to be *through* the bridge, not *to* the bridge. I wonder if a config read of a non-existent device on the other side of the bridge would force the write to complete ... ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 15:41 ` Matthew Wilcox @ 2006-03-08 17:19 ` David Howells 2006-03-08 22:10 ` Paul Mackerras 0 siblings, 1 reply; 89+ messages in thread From: David Howells @ 2006-03-08 17:19 UTC (permalink / raw) To: Matthew Wilcox Cc: Alan Cox, David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Matthew Wilcox <matthew@wil.cx> wrote: > > That might be worth an example with an if() because PPC will do this and > > if its a read with a side effect (eg I/O space) you get singed.. > > PPC does speculative memory accesses to IO? Are you *sure*? Can you do speculative reads from frame buffers? > # define smp_read_barrier_depends() do { } while(0) What's this one meant to do? > Port space is deprecated though. PCI 2.3 says: That's sort of irrelevant for the here. I still need to document the interaction. > Since memory write transactions may be posted in bridges anywhere > in the system, and I/O writes may be posted in the host bus bridge, I'm not sure whether this is beyond the scope of this document. Maybe the document's scope needs to be expanded. David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 17:19 ` David Howells @ 2006-03-08 22:10 ` Paul Mackerras 2006-03-08 23:08 ` Ivan Kokshaysky 0 siblings, 1 reply; 89+ messages in thread From: Paul Mackerras @ 2006-03-08 22:10 UTC (permalink / raw) To: David Howells Cc: Matthew Wilcox, Alan Cox, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel David Howells writes: > > # define smp_read_barrier_depends() do { } while(0) > > What's this one meant to do? On most CPUs, if you load one value and use the value you get to compute the address for a second load, there is an implicit read barrier between the two loads because of the dependency. That's not true on alpha, apparently, because of the way their caches are structured. The smp_read_barrier_depends is a read barrier that you use between two loads when there is already a dependency between the loads, and it is a no-op on everything except alpha (IIRC). Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 22:10 ` Paul Mackerras @ 2006-03-08 23:08 ` Ivan Kokshaysky 2006-03-09 1:01 ` Paul Mackerras 0 siblings, 1 reply; 89+ messages in thread From: Ivan Kokshaysky @ 2006-03-08 23:08 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, Matthew Wilcox, Alan Cox, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Thu, Mar 09, 2006 at 09:10:49AM +1100, Paul Mackerras wrote: > David Howells writes: > > > > # define smp_read_barrier_depends() do { } while(0) > > > > What's this one meant to do? > > On most CPUs, if you load one value and use the value you get to > compute the address for a second load, there is an implicit read > barrier between the two loads because of the dependency. That's not > true on alpha, apparently, because of the way their caches are > structured. Who said?! ;-) > The smp_read_barrier_depends is a read barrier that you > use between two loads when there is already a dependency between the > loads, and it is a no-op on everything except alpha (IIRC). My "Compiler Writer's Guide for the Alpha 21264" says that if the result of the first load contributes to the address calculation of the second load, then the second load cannot issue until the data from the first load is available. Obviously, we don't care about earlier alphas as they are executing strictly in program order. Ivan. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 23:08 ` Ivan Kokshaysky @ 2006-03-09 1:01 ` Paul Mackerras 2006-03-09 16:02 ` Ivan Kokshaysky 0 siblings, 1 reply; 89+ messages in thread From: Paul Mackerras @ 2006-03-09 1:01 UTC (permalink / raw) To: Ivan Kokshaysky Cc: David Howells, Matthew Wilcox, Alan Cox, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel, Paul, E.McKenney, " <paulmck Ivan Kokshaysky writes: > On Thu, Mar 09, 2006 at 09:10:49AM +1100, Paul Mackerras wrote: > > David Howells writes: > > > > > > # define smp_read_barrier_depends() do { } while(0) > > > > > > What's this one meant to do? > > > > On most CPUs, if you load one value and use the value you get to > > compute the address for a second load, there is an implicit read > > barrier between the two loads because of the dependency. That's not > > true on alpha, apparently, because of the way their caches are > > structured. > > Who said?! ;-) Paul McKenney, after much discussion with Alpha chip designers IIRC. > > The smp_read_barrier_depends is a read barrier that you > > use between two loads when there is already a dependency between the > > loads, and it is a no-op on everything except alpha (IIRC). > > My "Compiler Writer's Guide for the Alpha 21264" says that if the > result of the first load contributes to the address calculation > of the second load, then the second load cannot issue until the data > from the first load is available. Sure, but because of the partitioned caches on some systems, the second load can get older data than the first load, even though it issues later. If you do: CPU 0 CPU 1 foo = val; wmb(); p = &foo; reg = p; bar = *reg; it is apparently possible for CPU 1 to see the new value of p (i.e. &foo) but an old value of foo (i.e. not val). This can happen if p and foo are in different halves of the cache on CPU 1, and there are a lot of updates coming in for the half containing foo but the half containing p is quiet. I added Paul McKenney to the cc list so he can correct anything I have wrong here. Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 1:01 ` Paul Mackerras @ 2006-03-09 16:02 ` Ivan Kokshaysky 0 siblings, 0 replies; 89+ messages in thread From: Ivan Kokshaysky @ 2006-03-09 16:02 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, Matthew Wilcox, Alan Cox, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel, Paul, E.McKenney, " <paulmck On Thu, Mar 09, 2006 at 12:01:45PM +1100, Paul Mackerras wrote: > If you do: > > CPU 0 CPU 1 > > foo = val; > wmb(); > p = &foo; > reg = p; > bar = *reg; > > it is apparently possible for CPU 1 to see the new value of p > (i.e. &foo) but an old value of foo (i.e. not val). This can happen > if p and foo are in different halves of the cache on CPU 1, and there > are a lot of updates coming in for the half containing foo but the > half containing p is quiet. Indeed, this can happen according to architecture reference manual, so CPU 1 needs mb() as well. Thanks for clarification. Ivan. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 14:55 ` Alan Cox 2006-03-08 15:41 ` Matthew Wilcox @ 2006-03-08 17:04 ` David Howells 2006-03-08 17:36 ` Alan Cox 2006-03-08 22:01 ` Paul Mackerras 2 siblings, 1 reply; 89+ messages in thread From: David Howells @ 2006-03-08 17:04 UTC (permalink / raw) To: Alan Cox Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Alan Cox <alan@redhat.com> wrote: > [For information on bus mastering DMA and coherency please read ....] > > sincee have a doc on this Documentation/pci.txt? > The use of volatile generates poorer code and hides the serialization in > type declarations that may be far from the code. I'm not sure what you mean by that. > Is this true of IA-64 ?? Are you referring to non-temporal loads and stores? > > +There are some more advanced barriering functions: > > "barriering" ... ick, barrier. Picky:-) > Should clarify local ordering v SMP ordering for locks implied here. Do you mean explain what each sort of lock does? > > + (*) inX(), outX(): > > + > > + These are intended to talk to legacy i386 hardware using an alternate bus > > + addressing mode. They are synchronous as far as the x86 CPUs are > > Not really true. Lots of PCI devices use them. Need to talk about "I/O space" Which bit is not really true? David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 17:04 ` David Howells @ 2006-03-08 17:36 ` Alan Cox 2006-03-08 18:35 ` David Howells 0 siblings, 1 reply; 89+ messages in thread From: Alan Cox @ 2006-03-08 17:36 UTC (permalink / raw) To: David Howells Cc: Alan Cox, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Wed, Mar 08, 2006 at 05:04:51PM +0000, David Howells wrote: > > [For information on bus mastering DMA and coherency please read ....] > > sincee have a doc on this > > Documentation/pci.txt? and: Documentation/DMA-mapping.txt Documentation/DMA-API.txt > > > The use of volatile generates poorer code and hides the serialization in > > type declarations that may be far from the code. > > I'm not sure what you mean by that. in foo.h struct blah { volatile int x; /* need serialization } 2 million miles away blah.x = 1; blah.y = 4; And you've no idea that its magically serialized due to a type declaration in a header you've never read. Hence the "dont use volatile" rule > > Is this true of IA-64 ?? > > Are you referring to non-temporal loads and stores? Yep. But Matthew answered that > > Should clarify local ordering v SMP ordering for locks implied here. > > Do you mean explain what each sort of lock does? spin_unlock ensures that local CPU writes before the lock are visible to all processors before the lock is dropped but it has no effect on I/O ordering. Just a need for clarity. > > > + (*) inX(), outX(): > > > + > > > + These are intended to talk to legacy i386 hardware using an alternate bus > > > + addressing mode. They are synchronous as far as the x86 CPUs are > > > > Not really true. Lots of PCI devices use them. Need to talk about "I/O space" > > Which bit is not really true? The "legacy i386 hardware" bit. Many processors have an I/O space. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 17:36 ` Alan Cox @ 2006-03-08 18:35 ` David Howells 2006-03-08 18:45 ` Alan Cox 0 siblings, 1 reply; 89+ messages in thread From: David Howells @ 2006-03-08 18:35 UTC (permalink / raw) To: Alan Cox Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Alan Cox <alan@redhat.com> wrote: > spin_unlock ensures that local CPU writes before the lock are visible > to all processors before the lock is dropped but it has no effect on > I/O ordering. Just a need for clarity. So I can't use spinlocks in my driver to make sure two different CPUs don't interfere with each other when trying to communicate with a device because the spinlocks don't guarantee that I/O operations will stay in effect within the locking section? David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 18:35 ` David Howells @ 2006-03-08 18:45 ` Alan Cox 2006-03-08 18:59 ` David Howells 2006-03-08 19:08 ` David Howells 0 siblings, 2 replies; 89+ messages in thread From: Alan Cox @ 2006-03-08 18:45 UTC (permalink / raw) To: David Howells Cc: Alan Cox, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Wed, Mar 08, 2006 at 06:35:07PM +0000, David Howells wrote: > Alan Cox <alan@redhat.com> wrote: > > > spin_unlock ensures that local CPU writes before the lock are visible > > to all processors before the lock is dropped but it has no effect on > > I/O ordering. Just a need for clarity. > > So I can't use spinlocks in my driver to make sure two different CPUs don't > interfere with each other when trying to communicate with a device because the > spinlocks don't guarantee that I/O operations will stay in effect within the > locking section? If you have CPU #0 spin_lock(&foo->lock) writel(0, &foo->regnum) writel(1, &foo->data); spin_unlock(&foo->lock); CPU #1 spin_lock(&foo->lock); writel(4, &foo->regnum); writel(5, &foo->data); spin_unlock(&foo->lock); then on some NUMA infrastructures the order may not be as you expect. The CPU will execute writel 0, writel 1 and the second CPU later will execute writel 4 writel 5, but the order they hit the PCI bridge may not be the same order. Usually such things don't matter but in a register windowed case getting 0/4/1/5 might be rather unfortunate. See Documentation/DocBook/deviceiobook.tmpl (or its output) The following case is safe spin_lock(&foo->lock); writel(0, &foo->regnum); reg = readl(&foo->data); spin_unlock(&foo->lock); as the real must complete and it forces the write to complete. The pure write case used above should be implemented as spin_lock(&foo->lock); writel(0, &foo->regnum); writel(1, &foo->data); mmiowb(); spin_unlock(&foo->lock); The mmiowb ensures that the writels will occur before the writel from another CPU then taking the lock and issuing a writel. Welcome to the wonderful world of NUMA Alan ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 18:45 ` Alan Cox @ 2006-03-08 18:59 ` David Howells 2006-03-08 11:38 ` Andi Kleen 2006-03-08 19:08 ` David Howells 1 sibling, 1 reply; 89+ messages in thread From: David Howells @ 2006-03-08 18:59 UTC (permalink / raw) To: Alan Cox Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Alan Cox <alan@redhat.com> wrote: > then on some NUMA infrastructures the order may not be as you expect. Oh, yuck! Okay... does NUMA guarantee the same for ordinary memory accesses inside the critical section? David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 18:59 ` David Howells @ 2006-03-08 11:38 ` Andi Kleen 0 siblings, 0 replies; 89+ messages in thread From: Andi Kleen @ 2006-03-08 11:38 UTC (permalink / raw) To: David Howells Cc: Alan Cox, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Wednesday 08 March 2006 19:59, David Howells wrote: > Alan Cox <alan@redhat.com> wrote: > > > then on some NUMA infrastructures the order may not be as you expect. > > Oh, yuck! > > Okay... does NUMA guarantee the same for ordinary memory accesses inside the > critical section? If you use barriers the ordering should be the same on cc/NUMA vs SMP. Otherwise it wouldn't be "cc" But it might be quite unfair. -Andi ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 18:45 ` Alan Cox 2006-03-08 18:59 ` David Howells @ 2006-03-08 19:08 ` David Howells 2006-03-08 19:26 ` Linus Torvalds 1 sibling, 1 reply; 89+ messages in thread From: David Howells @ 2006-03-08 19:08 UTC (permalink / raw) To: Alan Cox Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Alan Cox <alan@redhat.com> wrote: > spin_lock(&foo->lock); > writel(0, &foo->regnum); I presume there only needs to be an mmiowb() here if you've got the appropriate CPU's I/O memory window set up to be weakly ordered. > writel(1, &foo->data); > mmiowb(); > spin_unlock(&foo->lock); David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 19:08 ` David Howells @ 2006-03-08 19:26 ` Linus Torvalds 2006-03-08 19:31 ` David Howells ` (3 more replies) 0 siblings, 4 replies; 89+ messages in thread From: Linus Torvalds @ 2006-03-08 19:26 UTC (permalink / raw) To: David Howells Cc: Alan Cox, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Wed, 8 Mar 2006, David Howells wrote: > Alan Cox <alan@redhat.com> wrote: > > > spin_lock(&foo->lock); > > writel(0, &foo->regnum); > > I presume there only needs to be an mmiowb() here if you've got the > appropriate CPU's I/O memory window set up to be weakly ordered. Actually, since the different NUMA things may have different paths to the PCI thing, I don't think even the mmiowb() will really help. It has nothing to serialize _with_. It only orders mmio from within _one_ CPU and "path" to the destination. The IO might be posted somewhere on a PCI bridge, and and depending on the posting rules, the mmiowb() just isn't relevant for IO coming through another path. Of course, to get into that deep doo-doo, your IO fabric must be separate from the memory fabric, and the hardware must be pretty special, I think. So for example, if you are using an Opteron with it's NUMA memory setup between CPU's over HT links, from an _IO_ standpoint it's not really anything strange, since it uses the same fabric for memory coherency and IO coherency, and from an IO ordering standpoint it's just normal SMP. But if you have a separate IO fabric and basically two different CPU's can get to one device through two different paths, no amount of write barriers of any kind will ever help you. So in the really general case, it's still basically true that the _only_ thing that serializes a MMIO write to a device is a _read_ from that device, since then the _device_ ends up being the serialization point. So in the exteme case, you literally have to do a read from the device before you release the spinlock, if ordering to the device from two different CPU's matters to you. The IO paths simply may not be serializable with the normal memory paths, so spinlocks have absolutely _zero_ ordering capability, and a write barrier on either the normal memory side or the IO side doesn't affect anything. Now, I'm by no means claiming that we necessarily get this right in general, or even very commonly. The undeniable fact is that "big NUMA" machines need to validate the drivers they use separately. The fact that it works on a normal PC - and that it's been tested to death there - does not guarantee much anything. The good news, of course, is that you don't use that kind of "big NUMA" system the same way you'd use a regular desktop SMP. You don't plug in random devices into it and just expect them to work. I'd hope ;) Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 19:26 ` Linus Torvalds @ 2006-03-08 19:31 ` David Howells 2006-03-09 0:35 ` Paul Mackerras 2006-03-08 19:40 ` Matthew Wilcox ` (2 subsequent siblings) 3 siblings, 1 reply; 89+ messages in thread From: David Howells @ 2006-03-08 19:31 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Alan Cox, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Linus Torvalds <torvalds@osdl.org> wrote: > Actually, since the different NUMA things may have different paths to the > PCI thing, I don't think even the mmiowb() will really help. It has > nothing to serialize _with_. On NUMA PowerPC, should mmiowb() be a SYNC or an EIEIO instruction then? Those do inter-component synchronisation. David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 19:31 ` David Howells @ 2006-03-09 0:35 ` Paul Mackerras 2006-03-09 0:54 ` Linus Torvalds 2006-03-09 0:55 ` Jesse Barnes 0 siblings, 2 replies; 89+ messages in thread From: Paul Mackerras @ 2006-03-09 0:35 UTC (permalink / raw) To: David Howells Cc: Linus Torvalds, Alan Cox, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel David Howells writes: > On NUMA PowerPC, should mmiowb() be a SYNC or an EIEIO instruction then? Those > do inter-component synchronisation. We actually have quite heavy synchronization in read*/write* on PPC, and mmiowb can safely be a no-op. It would be nice to be able to have lighter-weight synchronization, but I'm sure we would see lots of subtle driver bugs cropping up if we did. write* do a full memory barrier (sync) after the store, and read* explicitly wait for the data to come back before. If you ask me, the need for mmiowb on some platforms merely shows that those platforms' implementations of spinlocks and read*/write* are buggy... Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 0:35 ` Paul Mackerras @ 2006-03-09 0:54 ` Linus Torvalds 2006-03-09 1:08 ` Paul Mackerras 2006-03-09 0:55 ` Jesse Barnes 1 sibling, 1 reply; 89+ messages in thread From: Linus Torvalds @ 2006-03-09 0:54 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, Alan Cox, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Thu, 9 Mar 2006, Paul Mackerras wrote: > > If you ask me, the need for mmiowb on some platforms merely shows that > those platforms' implementations of spinlocks and read*/write* are > buggy... You could also state that same as "If you ask me, the need for mmiowb on some platforms merely shows that those platforms perform like a bat out of hell, and I think they should be slower" because the fact is, x86 memory barrier rules are just about optimal for performance. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 0:54 ` Linus Torvalds @ 2006-03-09 1:08 ` Paul Mackerras 2006-03-09 1:27 ` Linus Torvalds 0 siblings, 1 reply; 89+ messages in thread From: Paul Mackerras @ 2006-03-09 1:08 UTC (permalink / raw) To: Linus Torvalds Cc: akpm, linux-arch, linux-kernel, mingo, Alan Cox, linuxppc64-dev Linus Torvalds writes: > > If you ask me, the need for mmiowb on some platforms merely shows that > > those platforms' implementations of spinlocks and read*/write* are > > buggy... > > You could also state that same as > > "If you ask me, the need for mmiowb on some platforms merely shows > that those platforms perform like a bat out of hell, and I think > they should be slower" > > because the fact is, x86 memory barrier rules are just about optimal for > performance. ... and x86 mmiowb is a no-op. It's not x86 that I think is buggy. Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 1:08 ` Paul Mackerras @ 2006-03-09 1:27 ` Linus Torvalds 2006-03-09 2:38 ` Nick Piggin ` (2 more replies) 0 siblings, 3 replies; 89+ messages in thread From: Linus Torvalds @ 2006-03-09 1:27 UTC (permalink / raw) To: Paul Mackerras Cc: akpm, linux-arch, linux-kernel, mingo, Alan Cox, linuxppc64-dev On Thu, 9 Mar 2006, Paul Mackerras wrote: > > ... and x86 mmiowb is a no-op. It's not x86 that I think is buggy. x86 mmiowb would have to be a real op too if there were any multi-pathed PCI buses out there for x86, methinks. Basically, the issue boils down to one thing: no "normal" barrier will _ever_ show up on the bus on x86 (ie ia64, afaik). That, together with any situation where there are multiple paths to one physical device means that mmiowb() _has_ to be a special op, and no spinlocks etc will _ever_ do the serialization you look for. Put another way: the only way to avoid mmiowb() being special is either one of: (a) have the bus fabric itself be synchronizing (b) pay a huge expense on the much more critical _regular_ barriers Now, I claim that (b) is just broken. I'd rather take the hit when I need to, than every time. Now, (a) is trivial for small cases, but scales badly unless you do some fancy footwork. I suspect you could do some scalable multi-pathable version with using similar approaches to resolving device conflicts as the cache coherency protocol does (or by having a token-passing thing), but it seems SGI's solution was fairly well thought out. That said, when I heard of the NUMA IO issues on the SGI platform, I was initially pretty horrified. It seems to have worked out ok, and as long as we're talking about machines where you can concentrate on validating just a few drivers, it seems to be a good tradeoff. Would I want the hard-to-think-about IO ordering on a regular desktop platform? No. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 1:27 ` Linus Torvalds @ 2006-03-09 2:38 ` Nick Piggin 2006-03-09 3:45 ` Paul Mackerras 2006-03-09 4:34 ` Jesse Barnes 2 siblings, 0 replies; 89+ messages in thread From: Nick Piggin @ 2006-03-09 2:38 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, akpm, linux-arch, linux-kernel, mingo, Alan Cox, linuxppc64-dev Linus Torvalds wrote: > >On Thu, 9 Mar 2006, Paul Mackerras wrote: > >>... and x86 mmiowb is a no-op. It's not x86 that I think is buggy. >> > >x86 mmiowb would have to be a real op too if there were any multi-pathed >PCI buses out there for x86, methinks. > >Basically, the issue boils down to one thing: no "normal" barrier will >_ever_ show up on the bus on x86 (ie ia64, afaik). That, together with any >situation where there are multiple paths to one physical device means that >mmiowb() _has_ to be a special op, and no spinlocks etc will _ever_ do the >serialization you look for. > >Put another way: the only way to avoid mmiowb() being special is either >one of: > (a) have the bus fabric itself be synchronizing > (b) pay a huge expense on the much more critical _regular_ barriers > >Now, I claim that (b) is just broken. I'd rather take the hit when I need >to, than every time. > I'm not very driver-minded; would it make sense to have io versions of locks, which can provide critical sections for IO operations? The number of (uncommented) memory barriers sprinkled around drivers looks pretty scary... -- Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 1:27 ` Linus Torvalds 2006-03-09 2:38 ` Nick Piggin @ 2006-03-09 3:45 ` Paul Mackerras 2006-03-09 4:36 ` Jesse Barnes ` (2 more replies) 2006-03-09 4:34 ` Jesse Barnes 2 siblings, 3 replies; 89+ messages in thread From: Paul Mackerras @ 2006-03-09 3:45 UTC (permalink / raw) To: Linus Torvalds Cc: akpm, linux-arch, linux-kernel, mingo, Alan Cox, linuxppc64-dev Linus Torvalds writes: > x86 mmiowb would have to be a real op too if there were any multi-pathed > PCI buses out there for x86, methinks. Not if the manufacturers wanted to be able to run existing standard x86 operating systems on it, surely. I presume that on x86 the PCI host bridges and caches are all part of the coherence domain, and that the rule about stores being observed in order applies to what the PCI host bridge can see as much as it does to any other agent in the coherence domain. And if I have understood you correctly, the store ordering rule applies both to stores to regular cacheable memory and stores to noncacheable nonprefetchable MMIO registers without distinction. If that is so, then I don't see how the writel's can get out of order. Put another way, we expect spinlock regions to order stores to regular memory, and AFAICS the x86 ordering rules mean that the same guarantee should apply to stores to MMIO registers. (It's entirely possible that I don't fully understand the x86 memory ordering rules, of course. :) > Basically, the issue boils down to one thing: no "normal" barrier will > _ever_ show up on the bus on x86 (ie ia64, afaik). That, together with any A spin_lock does show up on the bus, doesn't it? > Would I want the hard-to-think-about IO ordering on a regular desktop > platform? No. In fact I think that mmiowb can actually be useful on PPC, if we can be sure that all the drivers we care about will use it correctly. If we can have the following rules: * If you have stores to regular memory, followed by an MMIO store, and you want the device to see the stores to regular memory at the point where it receives the MMIO store, then you need a wmb() between the stores to regular memory and the MMIO store. * If you have PIO or MMIO accesses, and you need to ensure the PIO/MMIO accesses don't get reordered with respect to PIO/MMIO accesses on another CPU, put the accesses inside a spin-locked region, and put a mmiowb() between the last access and the spin_unlock. * smp_wmb() doesn't necessarily do any ordering of MMIO accesses vs. other accesses, and in that sense it is weaker than wmb(). ... then I can remove the sync from write*, which would be nice, and make mmiowb() be a sync. I wonder how long we're going to spend chasing driver bugs after that, though. :) Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 3:45 ` Paul Mackerras @ 2006-03-09 4:36 ` Jesse Barnes 2006-03-09 7:41 ` Paul Mackerras 2006-03-09 5:38 ` Linus Torvalds 2006-03-09 11:44 ` Michael Buesch 2 siblings, 1 reply; 89+ messages in thread From: Jesse Barnes @ 2006-03-09 4:36 UTC (permalink / raw) To: Paul Mackerras Cc: Linus Torvalds, akpm, linux-arch, linux-kernel, mingo, Alan Cox, linuxppc64-dev On Wednesday, March 08, 2006 7:45 pm, Paul Mackerras wrote: > If we can have the following rules: > > * If you have stores to regular memory, followed by an MMIO store, > and you want the device to see the stores to regular memory at the > point where it receives the MMIO store, then you need a wmb() between > the stores to regular memory and the MMIO store. > > * If you have PIO or MMIO accesses, and you need to ensure the > PIO/MMIO accesses don't get reordered with respect to PIO/MMIO > accesses on another CPU, put the accesses inside a spin-locked > region, and put a mmiowb() between the last access and the > spin_unlock. > > * smp_wmb() doesn't necessarily do any ordering of MMIO accesses > vs. other accesses, and in that sense it is weaker than wmb(). This is a good set of rules. Hopefully David can add something like this to his doc. > ... then I can remove the sync from write*, which would be nice, and > make mmiowb() be a sync. I wonder how long we're going to spend > chasing driver bugs after that, though. :) Hm, a static checker should be able to find this stuff, shouldn't it? Jesse ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 4:36 ` Jesse Barnes @ 2006-03-09 7:41 ` Paul Mackerras 0 siblings, 0 replies; 89+ messages in thread From: Paul Mackerras @ 2006-03-09 7:41 UTC (permalink / raw) To: Jesse Barnes Cc: Linus Torvalds, akpm, linux-arch, linux-kernel, mingo, Alan Cox, linuxppc64-dev Jesse Barnes writes: > Hm, a static checker should be able to find this stuff, shouldn't it? Good idea. I wonder if sparse could be extended to do it. Alternatively, it wouldn't be hard to check dynamically. Just have a per-cpu count of outstanding MMIO stores. Zero it in spin_lock and mmiowb, increment it in write*, and grizzle if spin_unlock finds it non-zero. Should be very little overhead. Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 3:45 ` Paul Mackerras 2006-03-09 4:36 ` Jesse Barnes @ 2006-03-09 5:38 ` Linus Torvalds 2006-03-09 12:27 ` David Howells 2006-03-09 11:44 ` Michael Buesch 2 siblings, 1 reply; 89+ messages in thread From: Linus Torvalds @ 2006-03-09 5:38 UTC (permalink / raw) To: Paul Mackerras Cc: akpm, linux-arch, linux-kernel, mingo, Alan Cox, linuxppc64-dev On Thu, 9 Mar 2006, Paul Mackerras wrote: > > A spin_lock does show up on the bus, doesn't it? Nope. If the lock entity is in a exclusive cache-line, a spinlock does not show up on the bus at _all_. It's all purely in the core. In fact, I think AMD does a spinlock in ~15 CPU cycles (that's the serialization overhead in the core). I think a P-M core is ~25, while the NetBurst (P4) core is much more because they have horrible serialization issues (I think it's on the order of 100 cycles there). Anyway, try doing a spinlock in 15 CPU cycles and going out on the bus for it.. (Couple that with spin_unlock basically being free). Now, if the spinlocks end up _bouncing_ between CPU's, they'll obviously be a lot more expensive. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 5:38 ` Linus Torvalds @ 2006-03-09 12:27 ` David Howells 0 siblings, 0 replies; 89+ messages in thread From: David Howells @ 2006-03-09 12:27 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, akpm, linux-arch, linux-kernel, mingo, Alan Cox, linuxppc64-dev Linus Torvalds <torvalds@osdl.org> wrote: > > A spin_lock does show up on the bus, doesn't it? > > Nope. Yes, sort of, under some circumstances. If the CPU doing the spin_lock() doesn't own the cacheline with the lock, it'll have to resort to the bus to grab the cacheline from the current owner (so another CPU would at least see a read). The effect of the spin_lock() might not be seen outside of the CPU before the spin_unlock() occurs, but it *will* be committed to the CPU's cache, and given cache coherency mechanisms, that's effectively the same as main memory. So it's in effect visible on the bus, given that it will be transferred to another CPU when requested; and as long as the other CPUs expect to see the effects and the ordering imposed, it's immaterial whether the content of the spinlock is actually ever committed to SDRAM or whether it remains perpetually in one or another's CPU cache. David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 3:45 ` Paul Mackerras 2006-03-09 4:36 ` Jesse Barnes 2006-03-09 5:38 ` Linus Torvalds @ 2006-03-09 11:44 ` Michael Buesch 2 siblings, 0 replies; 89+ messages in thread From: Michael Buesch @ 2006-03-09 11:44 UTC (permalink / raw) To: Paul Mackerras Cc: Linus Torvalds, akpm, linux-arch, linux-kernel, mingo, Alan Cox, linuxppc64-dev [-- Attachment #1: Type: text/plain, Size: 348 bytes --] On Thursday 09 March 2006 04:45, you wrote: > ... then I can remove the sync from write*, which would be nice, and > make mmiowb() be a sync. I wonder how long we're going to spend > chasing driver bugs after that, though. :) Can you do a patch, which does the change, so people can actually test their drivers? -- Greetings Michael. [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 1:27 ` Linus Torvalds 2006-03-09 2:38 ` Nick Piggin 2006-03-09 3:45 ` Paul Mackerras @ 2006-03-09 4:34 ` Jesse Barnes 2006-03-09 4:43 ` Paul Mackerras 2 siblings, 1 reply; 89+ messages in thread From: Jesse Barnes @ 2006-03-09 4:34 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, akpm, linux-arch, linux-kernel, mingo, Alan Cox, linuxppc64-dev On Wednesday, March 08, 2006 5:27 pm, Linus Torvalds wrote: > That said, when I heard of the NUMA IO issues on the SGI platform, I > was initially pretty horrified. It seems to have worked out ok, and > as long as we're talking about machines where you can concentrate on > validating just a few drivers, it seems to be a good tradeoff. It's actually not too bad. We tried hard to make the arch code support the semantics that Linux drivers expect. mmiowb() was an optimization we added (though it's much less of an optimization than read_relaxed() was) to make things a little faster. Like you say, the alternative was to embed the same functionality into spin_unlock or something (IRIX actually had an io_spin_unlock that did that iirc), but that would mean an MMIO access on every unlock, which would be bad. So ultimately mmiowb() is the only thing drivers really have to care about on Altix (assuming they do DMA mapping correctly), and the rules for that are fairly simple. Then they can additionally use read_relaxed() to optimize performance a bit (quite a bit on big systems). > Would I want the hard-to-think-about IO ordering on a regular desktop > platform? No. I guess you don't want anyone to send you an O2 then? :) Jesse ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 4:34 ` Jesse Barnes @ 2006-03-09 4:43 ` Paul Mackerras 2006-03-09 10:05 ` Jes Sorensen 0 siblings, 1 reply; 89+ messages in thread From: Paul Mackerras @ 2006-03-09 4:43 UTC (permalink / raw) To: Jesse Barnes Cc: Linus Torvalds, akpm, linux-arch, linux-kernel, mingo, Alan Cox, linuxppc64-dev Jesse Barnes writes: > So ultimately mmiowb() is the only thing drivers really have to care > about on Altix (assuming they do DMA mapping correctly), and the rules > for that are fairly simple. Then they can additionally use > read_relaxed() to optimize performance a bit (quite a bit on big > systems). If I can be sure that all the drivers we care about on PPC use mmiowb correctly, I can reduce or eliminate the barrier in write*, which would be nice. Which drivers have been audited to make sure they use mmiowb correctly? In particular, has the USB driver been audited? Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 4:43 ` Paul Mackerras @ 2006-03-09 10:05 ` Jes Sorensen 0 siblings, 0 replies; 89+ messages in thread From: Jes Sorensen @ 2006-03-09 10:05 UTC (permalink / raw) To: Paul Mackerras Cc: Jesse Barnes, Linus Torvalds, akpm, linux-arch, linux-kernel, mingo, Alan Cox, linuxppc64-dev >>>>> "Paul" == Paul Mackerras <paulus@samba.org> writes: Paul> Jesse Barnes writes: >> So ultimately mmiowb() is the only thing drivers really have to >> care about on Altix (assuming they do DMA mapping correctly), and >> the rules for that are fairly simple. Then they can additionally >> use read_relaxed() to optimize performance a bit (quite a bit on >> big systems). Paul> If I can be sure that all the drivers we care about on PPC use Paul> mmiowb correctly, I can reduce or eliminate the barrier in Paul> write*, which would be nice. Paul> Which drivers have been audited to make sure they use mmiowb Paul> correctly? In particular, has the USB driver been audited? I think the primary drivers we've looked at are drivers/net/tg3.c, drivers/net/s2io.c, drivers/scsi/qla1280.c, and possibly the qla[234]xxx series - thats probably it! While we have USB on the systems, I don't think anyone has spend a lot of time verifying it in this context. At least the keyboard and mouse I have on this box seems to behave. Cheers, Jes ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 0:35 ` Paul Mackerras 2006-03-09 0:54 ` Linus Torvalds @ 2006-03-09 0:55 ` Jesse Barnes 2006-03-09 1:57 ` Paul Mackerras 1 sibling, 1 reply; 89+ messages in thread From: Jesse Barnes @ 2006-03-09 0:55 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, Linus Torvalds, Alan Cox, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Wednesday, March 8, 2006 4:35 pm, Paul Mackerras wrote: > David Howells writes: > > On NUMA PowerPC, should mmiowb() be a SYNC or an EIEIO instruction > > then? Those do inter-component synchronisation. > > We actually have quite heavy synchronization in read*/write* on PPC, > and mmiowb can safely be a no-op. It would be nice to be able to have > lighter-weight synchronization, but I'm sure we would see lots of > subtle driver bugs cropping up if we did. write* do a full memory > barrier (sync) after the store, and read* explicitly wait for the data > to come back before. > > If you ask me, the need for mmiowb on some platforms merely shows that > those platforms' implementations of spinlocks and read*/write* are > buggy... Or maybe they just wanted to keep them fast. I don't know why you compromised so much in your implementation of read/write and lock/unlock, but given how expensive synchronization is, I'd think it would be better in the long run to make the barrier types explicit (or at least a subset of them) to maximize performance. The rules for using the barriers really aren't that bad... for mmiowb() you basically want to do it before an unlock in any critical section where you've done PIO writes. Of course, that doesn't mean there isn't confusion about existing barriers. There was a long thread a few years ago (Jes worked it all out, iirc) regarding some subtle memory ordering bugs in the tty layer that ended up being due to ia64's very weak spin_unlock ordering guarantees (one way memory barrier only), but I think that's mainly an artifact of how ill defined the semantics of the various arch specific routines are in some cases. That's why I suggested in an earlier thread that you enumerate all the memory ordering combinations on ppc and see if we can't define them all. Then David can roll the implicit ones up into his document, or we can add the appropriate new operations to the kernel. Really getting barriers right shouldn't be much harder than getting DMA mapping right, from a driver writers POV (though people often get that wrong I guess). Jesse ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 0:55 ` Jesse Barnes @ 2006-03-09 1:57 ` Paul Mackerras 2006-03-09 4:26 ` Jesse Barnes 0 siblings, 1 reply; 89+ messages in thread From: Paul Mackerras @ 2006-03-09 1:57 UTC (permalink / raw) To: Jesse Barnes Cc: akpm, linux-arch, linux-kernel, Linus Torvalds, mingo, Alan Cox, linuxppc64-dev Jesse Barnes writes: > On Wednesday, March 8, 2006 4:35 pm, Paul Mackerras wrote: > > David Howells writes: > > > On NUMA PowerPC, should mmiowb() be a SYNC or an EIEIO instruction > > > then? Those do inter-component synchronisation. > > > > We actually have quite heavy synchronization in read*/write* on PPC, > > and mmiowb can safely be a no-op. It would be nice to be able to have > > lighter-weight synchronization, but I'm sure we would see lots of > > subtle driver bugs cropping up if we did. write* do a full memory > > barrier (sync) after the store, and read* explicitly wait for the data > > to come back before. > > > > If you ask me, the need for mmiowb on some platforms merely shows that > > those platforms' implementations of spinlocks and read*/write* are > > buggy... > > Or maybe they just wanted to keep them fast. I don't know why you > compromised so much in your implementation of read/write and > lock/unlock, but given how expensive synchronization is, I'd think it > would be better in the long run to make the barrier types explicit (or > at least a subset of them) to maximize performance. The PPC read*/write* and in*/out* aim to implement x86 semantics, in order to minimize the number of subtle driver bugs that only show up under heavy load. I agree that in the long run making the barriers more explicit is a good thing. > The rules for using > the barriers really aren't that bad... for mmiowb() you basically want > to do it before an unlock in any critical section where you've done PIO > writes. Do you mean just PIO, or do you mean PIO or MMIO writes? > Of course, that doesn't mean there isn't confusion about existing > barriers. There was a long thread a few years ago (Jes worked it all > out, iirc) regarding some subtle memory ordering bugs in the tty layer > that ended up being due to ia64's very weak spin_unlock ordering > guarantees (one way memory barrier only), but I think that's mainly an > artifact of how ill defined the semantics of the various arch specific > routines are in some cases. Yes, there is a lot of confusion, unfortunately. There is also some difficulty in defining things to be any different from what x86 does. > That's why I suggested in an earlier thread that you enumerate all the > memory ordering combinations on ppc and see if we can't define them all. The main difficulty we strike on PPC is that cacheable accesses tend to get ordered independently of noncacheable accesses. The only instruction we have that orders cacheable accesses with respect to noncacheable accesses is the sync instruction, which is a heavyweight "synchronize everything" operation. It acts as a full memory barrier for both cacheable and noncacheable loads and stores. The other barriers we have are the lwsync instruction and the eieio instruction. The lwsync instruction (light-weight sync) acts as a memory barrier for cacheable loads and stores except that it allows a following load to go before a preceding store. The eieio instruction has two separate and independent effects. It acts as a full barrier for accesses to noncacheable nonprefetchable memory (i.e. MMIO or PIO registers), and it acts as a write barrier for accesses to cacheable memory. It doesn't do any ordering between cacheable and noncacheable accesses though. There is also the isync (instruction synchronize) instruction, which isn't explicitly a memory barrier. It prevents any following instructions from executing until the outcome of any previous conditional branches are known, and until it is known that no previous instruction can generate an exception. Thus it can be used to create a one-way barrier in spin_lock and read*. > Then David can roll the implicit ones up into his document, or we can > add the appropriate new operations to the kernel. Really getting > barriers right shouldn't be much harder than getting DMA mapping right, > from a driver writers POV (though people often get that wrong I guess). Unfortunately, if you get the barriers wrong your driver will still work most of the time on pretty much any machine, whereas if you get the DMA mapping wrong your driver won't work at all on some machines. Nevertheless, we should get these things defined properly and then try to make sure drivers do the right things. Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 1:57 ` Paul Mackerras @ 2006-03-09 4:26 ` Jesse Barnes 0 siblings, 0 replies; 89+ messages in thread From: Jesse Barnes @ 2006-03-09 4:26 UTC (permalink / raw) To: Paul Mackerras Cc: akpm, linux-arch, linux-kernel, Linus Torvalds, mingo, Alan Cox, linuxppc64-dev On Wednesday, March 08, 2006 5:57 pm, Paul Mackerras wrote: > > The rules for using > > the barriers really aren't that bad... for mmiowb() you basically > > want to do it before an unlock in any critical section where you've > > done PIO writes. > > Do you mean just PIO, or do you mean PIO or MMIO writes? I'd have to check, but iirc it was just MMIO. We assumed PIO (inX/outX) was defined to be very strongly ordered (and thus slow) in Linux. But Linus is apparently flexible on that point for the new ioreadX/iowriteX stuff. > Yes, there is a lot of confusion, unfortunately. There is also some > difficulty in defining things to be any different from what x86 does. Well, Alpha has smp_barrier_depends or whatever, that's *really* funky. > > That's why I suggested in an earlier thread that you enumerate all > > the memory ordering combinations on ppc and see if we can't define > > them all. > > The main difficulty we strike on PPC is that cacheable accesses tend > to get ordered independently of noncacheable accesses. The only > instruction we have that orders cacheable accesses with respect to > noncacheable accesses is the sync instruction, which is a heavyweight > "synchronize everything" operation. It acts as a full memory barrier > for both cacheable and noncacheable loads and stores. Ah, ok, sounds like your chip needs an ISA extension or two then. :) > The other barriers we have are the lwsync instruction and the eieio > instruction. The lwsync instruction (light-weight sync) acts as a > memory barrier for cacheable loads and stores except that it allows a > following load to go before a preceding store. This sounds like ia64 acquire semantics, a fence, but only in the downward direction. > The eieio instruction has two separate and independent effects. It > acts as a full barrier for accesses to noncacheable nonprefetchable > memory (i.e. MMIO or PIO registers), and it acts as a write barrier > for accesses to cacheable memory. It doesn't do any ordering between > cacheable and noncacheable accesses though. Weird, ok, so for cacheable stuff it's equivalent to ia64's release semantics, but has additional effects for noncacheable accesses. Too bad it doesn't tie the two together somehow. > There is also the isync (instruction synchronize) instruction, which > isn't explicitly a memory barrier. It prevents any following > instructions from executing until the outcome of any previous > conditional branches are known, and until it is known that no > previous instruction can generate an exception. Thus it can be used > to create a one-way barrier in spin_lock and read*. Hm, interesting. > Unfortunately, if you get the barriers wrong your driver will still > work most of the time on pretty much any machine, whereas if you get > the DMA mapping wrong your driver won't work at all on some machines. > Nevertheless, we should get these things defined properly and then > try to make sure drivers do the right things. Agreed. Having a set of rules that driver writers can use would help too. Given that PPC doesn't appear to have a lightweight way of synchronizing between I/O and memory accesses, it sounds like full syncs will be needed in a lot of cases. Jesse ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 19:26 ` Linus Torvalds 2006-03-08 19:31 ` David Howells @ 2006-03-08 19:40 ` Matthew Wilcox 2006-03-09 0:37 ` Paul Mackerras 2006-03-08 19:54 ` Jesse Barnes 2006-03-08 20:02 ` Alan Cox 3 siblings, 1 reply; 89+ messages in thread From: Matthew Wilcox @ 2006-03-08 19:40 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Alan Cox, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Wed, Mar 08, 2006 at 11:26:41AM -0800, Linus Torvalds wrote: > > I presume there only needs to be an mmiowb() here if you've got the > > appropriate CPU's I/O memory window set up to be weakly ordered. > > Actually, since the different NUMA things may have different paths to the > PCI thing, I don't think even the mmiowb() will really help. It has > nothing to serialize _with_. > > It only orders mmio from within _one_ CPU and "path" to the destination. > The IO might be posted somewhere on a PCI bridge, and and depending on the > posting rules, the mmiowb() just isn't relevant for IO coming through > another path. Looking at the SGI implementation, it's smarter than you think. Looks like there's a register in the local I/O hub that lets you determine when this write has been queued in the appropriate host->pci bridge. So by the time __sn_mmiowb() returns, you're guaranteed no other CPU can bypass the write because the write's got far enough. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 19:40 ` Matthew Wilcox @ 2006-03-09 0:37 ` Paul Mackerras 2006-03-09 0:59 ` Jesse Barnes 0 siblings, 1 reply; 89+ messages in thread From: Paul Mackerras @ 2006-03-09 0:37 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, David Howells, Alan Cox, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Matthew Wilcox writes: > Looking at the SGI implementation, it's smarter than you think. Looks > like there's a register in the local I/O hub that lets you determine > when this write has been queued in the appropriate host->pci bridge. Given that mmiowb takes no arguments, how does it know which is the appropriate PCI host bridge? Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 0:37 ` Paul Mackerras @ 2006-03-09 0:59 ` Jesse Barnes 2006-03-09 1:36 ` Paul Mackerras 0 siblings, 1 reply; 89+ messages in thread From: Jesse Barnes @ 2006-03-09 0:59 UTC (permalink / raw) To: Paul Mackerras Cc: Matthew Wilcox, Linus Torvalds, David Howells, Alan Cox, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Wednesday, March 8, 2006 4:37 pm, Paul Mackerras wrote: > Matthew Wilcox writes: > > Looking at the SGI implementation, it's smarter than you think. > > Looks like there's a register in the local I/O hub that lets you > > determine when this write has been queued in the appropriate > > host->pci bridge. > > Given that mmiowb takes no arguments, how does it know which is the > appropriate PCI host bridge? It uses a per-node address space to reference the local bridge. The local bridge then waits until the remote bridge has acked the write before, then sets the outstanding write register to the appropriate value. Jesse ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 0:59 ` Jesse Barnes @ 2006-03-09 1:36 ` Paul Mackerras 2006-03-09 4:18 ` Jesse Barnes 0 siblings, 1 reply; 89+ messages in thread From: Paul Mackerras @ 2006-03-09 1:36 UTC (permalink / raw) To: Jesse Barnes Cc: Matthew Wilcox, Linus Torvalds, David Howells, Alan Cox, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Jesse Barnes writes: > It uses a per-node address space to reference the local bridge. The > local bridge then waits until the remote bridge has acked the write > before, then sets the outstanding write register to the appropriate > value. That sounds like mmiowb can only be used when preemption is disabled, such as inside a spin-locked region - is that right? Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-09 1:36 ` Paul Mackerras @ 2006-03-09 4:18 ` Jesse Barnes 0 siblings, 0 replies; 89+ messages in thread From: Jesse Barnes @ 2006-03-09 4:18 UTC (permalink / raw) To: Paul Mackerras Cc: Matthew Wilcox, Linus Torvalds, David Howells, Alan Cox, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Wednesday, March 08, 2006 5:36 pm, Paul Mackerras wrote: > Jesse Barnes writes: > > It uses a per-node address space to reference the local bridge. > > The local bridge then waits until the remote bridge has acked the > > write before, then sets the outstanding write register to the > > appropriate value. > > That sounds like mmiowb can only be used when preemption is disabled, > such as inside a spin-locked region - is that right? There's a scheduler hook to flush things if a process moves. I think Brent Casavant submitted that patch recently. Jesse ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 19:26 ` Linus Torvalds 2006-03-08 19:31 ` David Howells 2006-03-08 19:40 ` Matthew Wilcox @ 2006-03-08 19:54 ` Jesse Barnes 2006-03-08 20:02 ` Alan Cox 3 siblings, 0 replies; 89+ messages in thread From: Jesse Barnes @ 2006-03-08 19:54 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Alan Cox, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Wednesday, March 8, 2006 11:26 am, Linus Torvalds wrote: > But if you have a separate IO fabric and basically two different CPU's > can get to one device through two different paths, no amount of write > barriers of any kind will ever help you. No, that's exactly the case that mmiowb() was designed to protect against. It ensures that your writes have arrived at the destination bridge, which means after that point any other CPUs writing to the same device will have their data actually hit the device afterwards. Hopefully deviceiobook.tmpl makes that clear... Jesse ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 19:26 ` Linus Torvalds ` (2 preceding siblings ...) 2006-03-08 19:54 ` Jesse Barnes @ 2006-03-08 20:02 ` Alan Cox 3 siblings, 0 replies; 89+ messages in thread From: Alan Cox @ 2006-03-08 20:02 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Alan Cox, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Wed, Mar 08, 2006 at 11:26:41AM -0800, Linus Torvalds wrote: > Actually, since the different NUMA things may have different paths to the > PCI thing, I don't think even the mmiowb() will really help. It has > nothing to serialize _with_. It serializes to the bridge. On the Altix for example this is done by reading a local status register with the pending write count in it and waiting until the chip reports the write has propogated across the fabric. At that point it has hit the bridge and the usual PCI posting applies, but the PCI ordering rule will also apply so the write won't be passed by another write issued after the spinlock is then dropped. > The IO might be posted somewhere on a PCI bridge, and and depending on the > posting rules, the mmiowb() just isn't relevant for IO coming through > another path. Yes. mmiowb only serializes to the bridge. Thats how it is defined in the documentation. Thats enough to sort out things like the example with locks, but where a read from the device would be overkill. > general, or even very commonly. The undeniable fact is that "big NUMA" > machines need to validate the drivers they use separately. The fact that > it works on a normal PC - and that it's been tested to death there - does > not guarantee much anything. mmiowb comes about from the Altix folks strangley enough. > The good news, of course, is that you don't use that kind of "big NUMA" > system the same way you'd use a regular desktop SMP. You don't plug in > random devices into it and just expect them to work. I'd hope ;) Various core drivers like tg3 use mmiowb() Alan ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 14:55 ` Alan Cox 2006-03-08 15:41 ` Matthew Wilcox 2006-03-08 17:04 ` David Howells @ 2006-03-08 22:01 ` Paul Mackerras 2006-03-08 22:23 ` David S. Miller 2 siblings, 1 reply; 89+ messages in thread From: Paul Mackerras @ 2006-03-08 22:01 UTC (permalink / raw) To: Alan Cox Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Alan Cox writes: > On Wed, Mar 08, 2006 at 02:37:58PM +0000, David Howells wrote: > > + (*) reads can be done speculatively, and then the result discarded should it > > + prove not to be required; > > That might be worth an example with an if() because PPC will do this and if > its a read with a side effect (eg I/O space) you get singed.. On PPC machines, the PTE has a bit called G (for Guarded) which indicates that the memory mapped by it has side effects. It prevents the CPU from doing speculative accesses (i.e. the CPU can't send out a load from the page until it knows for sure that the program will get to that instruction) and from prefetching from the page. The kernel sets G=1 on MMIO and PIO pages in general, as you would expect, although you can get G=0 mappings for framebuffers etc. if you ask specifically for that. Paul. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #2] 2006-03-08 22:01 ` Paul Mackerras @ 2006-03-08 22:23 ` David S. Miller 0 siblings, 0 replies; 89+ messages in thread From: David S. Miller @ 2006-03-08 22:23 UTC (permalink / raw) To: paulus Cc: alan, dhowells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel From: Paul Mackerras <paulus@samba.org> Date: Thu, 9 Mar 2006 09:01:57 +1100 > On PPC machines, the PTE has a bit called G (for Guarded) which > indicates that the memory mapped by it has side effects. It prevents > the CPU from doing speculative accesses (i.e. the CPU can't send out a > load from the page until it knows for sure that the program will get > to that instruction) and from prefetching from the page. > > The kernel sets G=1 on MMIO and PIO pages in general, as you would > expect, although you can get G=0 mappings for framebuffers etc. if you > ask specifically for that. Sparc64 has a similar PTE bit called "E" for "side-Effect". And we also do the same thing as powerpc for framebuffers. Note that on sparc64 in our asm/io.h PIO/MMIO accessor macros we use physical addresses, so we don't have to map anything in ioremap(), and use a special address space identifier on the loads and stores that indicates "E" behavior is desired. ^ permalink raw reply [flat|nested] 89+ messages in thread
* [PATCH] Document Linux's memory barriers [try #3] 2006-03-08 14:37 ` [PATCH] Document Linux's memory barriers [try #2] David Howells 2006-03-08 14:55 ` Alan Cox @ 2006-03-08 19:37 ` David Howells 2006-03-09 14:01 ` David Howells 1 sibling, 1 reply; 89+ messages in thread From: David Howells @ 2006-03-08 19:37 UTC (permalink / raw) To: torvalds, akpm, mingo, alan; +Cc: linux-arch, linuxppc64-dev, linux-kernel The attached patch documents the Linux kernel's memory barriers. I've updated it from the comments I've been given. Note that the per-arch notes sections are gone because it's clear that there are so many exceptions, that it's not worth having them. I've added a list of references to other documents. I've tried to get rid of the concept of memory accesses appearing on the bus; what matters is apparent behaviour with respect to other observers in the system. I'm not sure that any mention interrupts vs interrupt disablement should be retained... it's unclear that there is actually anything that guarantees that stuff won't leak out of an interrupt-disabled section and into an interrupt handler. Paul Mackerras says this isn't valid on powerpc, and looking at the code seems to confirm that, barring implicit enforcement by the CPU. There's also some uncertainty with respect to spinlocks vs I/O accesses on NUMA. Signed-Off-By: David Howells <dhowells@redhat.com> --- warthog>diffstat -p1 /tmp/mb.diff Documentation/memory-barriers.txt | 781 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 781 insertions(+) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt new file mode 100644 index 0000000..6eeb7e4 --- /dev/null +++ b/Documentation/memory-barriers.txt @@ -0,0 +1,781 @@ + ============================ + LINUX KERNEL MEMORY BARRIERS + ============================ + +Contents: + + (*) What are memory barriers? + + (*) Where are memory barriers needed? + + - Accessing devices. + - Multiprocessor interaction. + - Interrupts. + + (*) Explicit kernel compiler barriers. + + (*) Explicit kernel memory barriers. + + (*) Implicit kernel memory barriers. + + - Locking functions. + - Interrupt disabling functions. + - Miscellaneous functions. + + (*) Inter-CPU locking barrier effects. + + - Locks vs memory accesses. + - Locks vs I/O accesses. + + (*) Kernel I/O barrier effects. + + (*) References. + + +========================= +WHAT ARE MEMORY BARRIERS? +========================= + +Memory barriers are instructions to both the compiler and the CPU to impose an +apparent partial ordering between the memory access operations specified either +side of the barrier. They request that the sequence of memory events generated +appears to other components of the system as if the barrier is effective on +that CPU. + +Note that: + + (*) there's no guarantee that the sequence of memory events is _actually_ so + ordered. It's possible for the CPU to do out-of-order accesses _as long + as no-one is looking_, and then fix up the memory if someone else tries to + see what's going on (for instance a bus master device); what matters is + the _apparent_ order as far as other processors and devices are concerned; + and + + (*) memory barriers are only guaranteed to act within the CPU processing them, + and are not, for the most part, guaranteed to percolate down to other CPUs + in the system or to any I/O hardware that that CPU may communicate with. + + +For example, a programmer might take it for granted that the CPU will perform +memory accesses in exactly the order specified, so that if a CPU is, for +example, given the following piece of code: + + a = *A; + *B = b; + c = *C; + d = *D; + *E = e; + +They would then expect that the CPU will complete the memory access for each +instruction before moving on to the next one, leading to a definite sequence of +operations as seen by external observers in the system: + + read *A, write *B, read *C, read *D, write *E. + + +Reality is, of course, much messier. With many CPUs and compilers, this isn't +always true because: + + (*) reads are more likely to need to be completed immediately to permit + execution progress, whereas writes can often be deferred without a + problem; + + (*) reads can be done speculatively, and then the result discarded should it + prove not to be required; + + (*) the order of the memory accesses may be rearranged to promote better use + of the CPU buses and caches; + + (*) reads and writes may be combined to improve performance when talking to + the memory or I/O hardware that can do batched accesses of adjacent + locations, thus cutting down on transaction setup costs (memory and PCI + devices may be able to do this); and + + (*) the CPU's data cache may affect the ordering, though cache-coherency + mechanisms should alleviate this - once the write has actually hit the + cache. + +So what another CPU, say, might actually observe from the above piece of code +is: + + read *A, read {*C,*D}, write *E, write *B + + (By "read {*C,*D}" I mean a combined single read). + + +It is also guaranteed that a CPU will be self-consistent: it will see its _own_ +accesses appear to be correctly ordered, without the need for a memory +barrier. For instance with the following code: + + X = *A; + *A = Y; + Z = *A; + +assuming no intervention by an external influence, it can be taken that: + + (*) X will hold the old value of *A, and will never happen after the write and + thus end up being given the value that was assigned to *A from Y instead; + and + + (*) Z will always be given the value in *A that was assigned there from Y, and + will never happen before the write, and thus end up with the same value + that was in *A initially. + +(This is ignoring the fact that the value initially in *A may appear to be the +same as the value assigned to *A from Y). + + +================================= +WHERE ARE MEMORY BARRIERS NEEDED? +================================= + +Under normal operation, access reordering is probably not going to be a problem +as a linear program will still appear to operate correctly. There are, +however, three circumstances where reordering definitely _could_ be a problem: + + +ACCESSING DEVICES +----------------- + +Many devices can be memory mapped, and so appear to the CPU as if they're just +memory locations. However, to control the device, the driver has to make the +right accesses in exactly the right order. + +Consider, for example, an ethernet chipset such as the AMD PCnet32. It +presents to the CPU an "address register" and a bunch of "data registers". The +way it's accessed is to write the index of the internal register to be accessed +to the address register, and then read or write the appropriate data register +to access the chip's internal register, which could - theoretically - be done +by: + + *ADR = ctl_reg_3; + reg = *DATA; + +The problem with a clever CPU or a clever compiler is that the write to the +address register isn't guaranteed to happen before the access to the data +register, if the CPU or the compiler thinks it is more efficient to defer the +address write: + + read *DATA, write *ADR + +then things will break. + + +In the Linux kernel, however, I/O should be done through the appropriate +accessor routines - such as inb() or writel() - which know how to make such +accesses appropriately sequential. + +On some systems, I/O writes are not strongly ordered across all CPUs, and so +locking should be used, and mmiowb() should be issued prior to unlocking the +critical section. + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +MULTIPROCESSOR INTERACTION +-------------------------- + +When there's a system with more than one processor, the CPUs in the system may +be working on the same set of data at the same time. This can cause +synchronisation problems, and the usual way of dealing with them is to use +locks - but locks are quite expensive, and so it may be preferable to operate +without the use of a lock if at all possible. In such a case accesses that +affect both CPUs may have to be carefully ordered to prevent error. + +Consider the R/W semaphore slow path. In that, a waiting process is queued on +the semaphore, as noted by it having a record on its stack linked to the +semaphore's list: + + struct rw_semaphore { + ... + struct list_head waiters; + }; + + struct rwsem_waiter { + struct list_head list; + struct task_struct *task; + }; + +To wake up the waiter, the up_read() or up_write() functions have to read the +pointer from this record to know as to where the next waiter record is, clear +the task pointer, call wake_up_process() on the task, and release the reference +held on the waiter's task struct: + + READ waiter->list.next; + READ waiter->task; + WRITE waiter->task; + CALL wakeup + RELEASE task + +If any of these steps occur out of order, then the whole thing may fail. + +Note that the waiter does not get the semaphore lock again - it just waits for +its task pointer to be cleared. Since the record is on its stack, this means +that if the task pointer is cleared _before_ the next pointer in the list is +read, another CPU might start processing the waiter and it might clobber its +stack before up*() functions have a chance to read the next pointer. + + CPU 0 CPU 1 + =============================== =============================== + down_xxx() + Queue waiter + Sleep + up_yyy() + READ waiter->task; + WRITE waiter->task; + <preempt> + Resume processing + down_xxx() returns + call foo() + foo() clobbers *waiter + </preempt> + READ waiter->list.next; + --- OOPS --- + +This could be dealt with using a spinlock, but then the down_xxx() function has +to get the spinlock again after it's been woken up, which is a waste of +resources. + +The way to deal with this is to insert an SMP memory barrier: + + READ waiter->list.next; + READ waiter->task; + smp_mb(); + WRITE waiter->task; + CALL wakeup + RELEASE task + +In this case, the barrier makes a guarantee that all memory accesses before the +barrier will appear to happen before all the memory accesses after the barrier +with respect to the other CPUs on the system. It does _not_ guarantee that all +the memory accesses before the barrier will be complete by the time the barrier +itself is complete. + +SMP memory barriers are normally nothing more than compiler barriers on a +kernel compiled for a UP system because the CPU orders overlapping accesses +with respect to itself, and so CPU barriers aren't needed. + + +INTERRUPTS +---------- + +A driver may be interrupted by its own interrupt service routine, and thus they +may interfere with each other's attempts to control or access the device. + +This may be alleviated - at least in part - by disabling interrupts (a form of +locking), such that the critical operations are all contained within the +interrupt-disabled section in the driver. Whilst the driver's interrupt +routine is executing, the driver's core may not run on the same CPU, and its +interrupt is not permitted to happen again until the current interrupt has been +handled, thus the interrupt handler does not need to lock against that. + + +However, consider the following example: + + CPU 1 CPU 2 + =============================== =============================== + [A is 0 and B is 0] + DISABLE IRQ + *A = 1; + smp_wmb(); + *B = 2; + ENABLE IRQ + <interrupt> + *A = 3 + a = *A; + b = *B; + smp_wmb(); + *B = 4; + </interrupt> + +CPU 2 might see *A == 3 and *B == 0, when what it probably ought to see is *B +== 2 and *A == 1 or *A == 3, or *B == 4 and *A == 3. + +This might happen because the write "*B = 2" might occur after the write "*A = +3" - in which case the former write has leaked from the interrupt-disabled +section into the interrupt handler. In this case it is a lock of some +description should very probably be used. + + +This sort of problem might also occur with relaxed I/O ordering rules, if it's +permitted for I/O writes to cross. For instance, if a driver was talking to an +ethernet card that sports an address register and a data register: + + DISABLE IRQ + writew(ADR, ctl_reg_3); + writew(DATA, y); + ENABLE IRQ + <interrupt> + writew(ADR, ctl_reg_4); + q = readw(DATA); + </interrupt> + +In such a case, an mmiowb() is needed, firstly to prevent the first write to +the address register from occurring after the write to the data register, and +secondly to prevent the write to the data register from happening after the +second write to the address register. + + +================================= +EXPLICIT KERNEL COMPILER BARRIERS +================================= + +The Linux kernel has an explicit compiler barrier function that prevents the +compiler from moving the memory accesses either side of it to the other side: + + barrier(); + +This has no direct effect on the CPU, which may then reorder things however it +wishes. + + +In addition, accesses to "volatile" memory locations and volatile asm +statements act as implicit compiler barriers. Note, however, that the use of +volatile has two negative consequences: + + (1) it causes the generation of poorer code, and + + (2) it can affect serialisation of events in code distant from the declaration + (consider a structure defined in a header file that has a volatile member + being accessed by the code in a source file). + +The Linux coding style therefore strongly favours the use of explicit barriers +except in small and specific cases. In general, volatile should be avoided. + + +=============================== +EXPLICIT KERNEL MEMORY BARRIERS +=============================== + +The Linux kernel has six basic CPU memory barriers: + + MANDATORY SMP CONDITIONAL + =============== =============== + GENERAL mb() smp_mb() + READ rmb() smp_rmb() + WRITE wmb() smp_wmb() + +General memory barriers give a guarantee that all memory accesses specified +before the barrier will appear to happen before all memory accesses specified +after the barrier with respect to the other components of the system. + +Read and write memory barriers give similar guarantees, but only for memory +reads versus memory reads and memory writes versus memory writes respectively. + +All memory barriers imply compiler barriers. + +SMP memory barriers are only compiler barriers on uniprocessor compiled systems +because it is assumed that a CPU will be apparently self-consistent, and will +order overlapping accesses correctly with respect to itself. + +There is no guarantee that any of the memory accesses specified before a memory +barrier will be complete by the completion of a memory barrier; the barrier can +be considered to draw a line in that CPU's access queue that accesses of the +appropriate type may not cross. + +There is no guarantee that issuing a memory barrier on one CPU will have any +direct effect on another CPU or any other hardware in the system. The indirect +effect will be the order in which the second CPU sees the first CPU's accesses +occur. + +There is no guarantee that some intervening piece of off-the-CPU hardware[*] +will not reorder the memory accesses. CPU cache coherency mechanisms should +propegate the indirect effects of a memory barrier between CPUs. + + [*] For information on bus mastering DMA and coherency please read: + + Documentation/pci.txt + Documentation/DMA-mapping.txt + Documentation/DMA-API.txt + +Note that these are the _minimum_ guarantees. Different architectures may give +more substantial guarantees, but they may not be relied upon outside of arch +specific code. + + +There are some more advanced barrier functions: + + (*) set_mb(var, value) + (*) set_wmb(var, value) + + These assign the value to the variable and then insert at least a write + barrier after it, depending on the function. They aren't guaranteed to + insert anything more than a compiler barrier in a UP compilation. + + +=============================== +IMPLICIT KERNEL MEMORY BARRIERS +=============================== + +Some of the other functions in the linux kernel imply memory barriers, amongst +them are locking and scheduling functions and interrupt management functions. + +This specification is a _minimum_ guarantee; any particular architecture may +provide more substantial guarantees, but these may not be relied upon outside +of arch specific code. + + +LOCKING FUNCTIONS +----------------- + +All the following locking functions imply barriers: + + (*) spin locks + (*) R/W spin locks + (*) mutexes + (*) semaphores + (*) R/W semaphores + +In all cases there are variants on a LOCK operation and an UNLOCK operation. + + (*) LOCK operation implication: + + Memory accesses issued after the LOCK will be completed after the LOCK + accesses have completed. + + Memory accesses issued before the LOCK may be completed after the LOCK + accesses have completed. + + (*) UNLOCK operation implication: + + Memory accesses issued before the UNLOCK will be completed before the + UNLOCK accesses have completed. + + Memory accesses issued after the UNLOCK may be completed before the UNLOCK + accesses have completed. + + (*) LOCK vs UNLOCK implication: + + The LOCK accesses will be completed before the UNLOCK accesses. + +And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but +a LOCK followed by an UNLOCK isn't. + +Locks and semaphores may not provide any guarantee of ordering on UP compiled +systems, and so can't be counted on in such a situation to actually do anything +at all, especially with respect to I/O accesses, unless combined with interrupt +disabling operations. + +See also the section on "Inter-CPU locking barrier effects". + + +As an example, consider the following: + + *A = a; + *B = b; + LOCK + *C = c; + *D = d; + UNLOCK + *E = e; + *F = f; + +The following sequence of events is acceptable: + + LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK + +But none of the following are: + + {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E + *A, *B, *C, LOCK, *D, UNLOCK, *E, *F + *A, *B, LOCK, *C, UNLOCK, *D, *E, *F + *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E + + +INTERRUPT DISABLING FUNCTIONS +----------------------------- + +Functions that disable interrupts (LOCK equivalent) and enable interrupts +(UNLOCK equivalent) will barrier memory and I/O accesses versus memory and I/O +accesses done in the interrupt handler. This prevents an interrupt routine +interfering with accesses made in a interrupt-disabled section of code and vice +versa. + +Note that whilst disabling or enabling interrupts acts as a compiler barriers +under all circumstances, they only act as memory barriers with respect to +interrupts, not with respect to nested sections. + +Consider the following: + + <interrupt> + *X = x; + </interrupt> + *A = a; + SAVE IRQ AND DISABLE + *B = b; + SAVE IRQ AND DISABLE + *C = c; + RESTORE IRQ + *D = d; + RESTORE IRQ + *E = e; + <interrupt> + *Y = y; + </interrupt> + +It is acceptable to observe the following sequences of events: + + { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, *E, { INT, *Y } + { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, { INT, *Y, *E } + { INT, *X }, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y } + { INT }, *X, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y } + { INT }, *A, *X, SAVE, SAVE, *B, *C, *D, *E, REST, REST, { INT, *Y } + +But not the following: + + { INT }, SAVE, *A, *B, *X, SAVE, *C, REST, *D, REST, *E, { INT, *Y } + { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, REST, { INT, *Y, *D, *E } + + +MISCELLANEOUS FUNCTIONS +----------------------- + +Other functions that imply barriers: + + (*) schedule() and similar imply full memory barriers. + + +================================= +INTER-CPU LOCKING BARRIER EFFECTS +================================= + +On SMP systems locking primitives give a more substantial form of barrier: one +that does affect memory access ordering on other CPUs, within the context of +conflict on any particular lock. + + +LOCKS VS MEMORY ACCESSES +------------------------ + +Consider the following: the system has a pair of spinlocks (N) and (Q), and +three CPUs; then should the following sequence of events occur: + + CPU 1 CPU 2 + =============================== =============================== + *A = a; *E = e; + LOCK M LOCK Q + *B = b; *F = f; + *C = c; *G = g; + UNLOCK M UNLOCK Q + *D = d; *H = h; + +Then there is no guarantee as to what order CPU #3 will see the accesses to *A +through *H occur in, other than the constraints imposed by the separate locks +on the separate CPUs. It might, for example, see: + + *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M + +But it won't see any of: + + *B, *C or *D preceding LOCK M + *A, *B or *C following UNLOCK M + *F, *G or *H preceding LOCK Q + *E, *F or *G following UNLOCK Q + + +However, if the following occurs: + + CPU 1 CPU 2 + =============================== =============================== + *A = a; + LOCK M [1] + *B = b; + *C = c; + UNLOCK M [1] + *D = d; *E = e; + LOCK M [2] + *F = f; + *G = g; + UNLOCK M [2] + *H = h; + +CPU #3 might see: + + *E, LOCK M [1], *C, *B, *A, UNLOCK M [1], + LOCK M [2], *H, *F, *G, UNLOCK M [2], *D + +But assuming CPU #1 gets the lock first, it won't see any of: + + *B, *C, *D, *F, *G or *H preceding LOCK M [1] + *A, *B or *C following UNLOCK M [1] + *F, *G or *H preceding LOCK M [2] + *A, *B, *C, *E, *F or *G following UNLOCK M [2] + + +LOCKS VS I/O ACCESSES +--------------------- + +Under certain circumstances (such as NUMA), I/O accesses within two spinlocked +sections on two different CPUs may be seen as interleaved by the PCI bridge. + +For example: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + writel(1, DATA); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + writel(5, DATA); + spin_unlock(Q); + +may be seen by the PCI bridge as follows: + + WRITE *ADDR = 0, WRITE *ADDR = 4, WRITE *DATA = 1, WRITE *DATA = 5 + +which would probably break. + +What is necessary here is to insert an mmiowb() before dropping the spinlock, +for example: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + writel(1, DATA); + mmiowb(); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + writel(5, DATA); + mmiowb(); + spin_unlock(Q); + +this will ensure that the two writes issued on CPU #1 appear at the PCI bridge +before either of the writes issued on CPU #2. + + +Furthermore, following a write by a read to the same device is okay, because +the read forces the write to complete before the read is performed: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + a = readl(DATA); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + b = readl(DATA); + spin_unlock(Q); + + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +========================== +KERNEL I/O BARRIER EFFECTS +========================== + +When accessing I/O memory, drivers should use the appropriate accessor +functions: + + (*) inX(), outX(): + + These are intended to talk to I/O space rather than memory space, but + that's primarily a CPU-specific concept. The i386 and x86_64 processors do + indeed have special I/O space access cycles and instructions, but many + CPUs don't have such a concept. + + The PCI bus, amongst others, defines an I/O space concept - which on such + CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O + space. However, it may also mapped as a virtual I/O space in the CPU's + memory map, particularly on those CPUs that don't support alternate + I/O spaces. + + Accesses to this space may be fully synchronous (as on i386), but + intermediary bridges (such as the PCI host bridge) may not fully honour + that. + + They are guaranteed to be fully ordered with respect to each other. + + They are not guaranteed to be fully ordered with respect to other types of + memory and I/O operation. + + (*) readX(), writeX(): + + Whether these are guaranteed to be fully ordered and uncombined with + respect to each other on the issuing CPU depends on the characteristics + defined for the memory window through which they're accessing. On later + i386 architecture machines, for example, this is controlled by way of the + MTRR registers. + + Ordinarily, these will be guaranteed to be fully ordered and uncombined,, + provided they're not accessing a prefetchable device. + + However, intermediary hardware (such as a PCI bridge) may indulge in + deferral if it so wishes; to flush a write, a read from the same location + is preferred[*], but a read from the same device or from configuration + space should suffice for PCI. + + [*] NOTE! attempting to read from the same location as was written to may + cause a malfunction - consider the 16550 Rx/Tx serial registers for + example. + + Used with prefetchable I/O memory, an mmiowb() barrier may be required to + force writes to be ordered. + + Please refer to the PCI specification for more information on interactions + between PCI transactions. + + (*) readX_relaxed() + + These are similar to readX(), but are not guaranteed to be ordered in any + way. Be aware that there is no I/O read barrier available. + + (*) ioreadX(), iowriteX() + + These will perform as appropriate for the type of access they're actually + doing, be it inX()/outX() or readX()/writeX(). + + +========== +REFERENCES +========== + +AMD64 Architecture Programmer's Manual Volume 2: System Programming + Chapter 7.1: Memory-Access Ordering + Chapter 7.4: Buffering and Combining Memory Writes + +IA-32 Intel Architecture Software Developer's Manual, Volume 3: +System Programming Guide + Chapter 7.1: Locked Atomic Operations + Chapter 7.2: Memory Ordering + Chapter 7.4: Serializing Instructions + +The SPARC Architecture Manual, Version 9 + Chapter 8: Memory Models + Appendix D: Formal Specification of the Memory Models + Appendix J: Programming with the Memory Models + +UltraSPARC Programmer Reference Manual + Chapter 5: Memory Accesses and Cacheability + Chapter 15: Sparc-V9 Memory Models + +UltraSPARC III Cu User's Manual + Chapter 9: Memory Models + +UltraSPARC IIIi Processor User's Manual + Chapter 8: Memory Models + +UltraSPARC Architecture 2005 + Chapter 9: Memory + Appendix D: Formal Specifications of the Memory Models + +UltraSPARC T1 Supplment to the UltraSPARC Architecture 2005 + Chapter 8: Memory Models + Appendix F: Caches and Cache Coherency + +Solaris Internals, Core Kernel Architecture, p63-68: + Chapter 3.3: Hardware Considerations for Locks and + Synchronization + +Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching +for Kernel Programmers: + Chapter 13: Other Memory Models + +Intel Itanium Architecture Software Developer's Manual: Volume 1: + Section 2.6: Speculation + Section 4.4: Memory Access ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #3] 2006-03-08 19:37 ` [PATCH] Document Linux's memory barriers [try #3] David Howells @ 2006-03-09 14:01 ` David Howells 0 siblings, 0 replies; 89+ messages in thread From: David Howells @ 2006-03-09 14:01 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel I'm thinking of adding the attached to the document. Any comments or objections? David diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 6eeb7e4..f9a9192 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -4,6 +4,8 @@ Contents: + (*) What do we consider memory? + (*) What are memory barriers? (*) Where are memory barriers needed? @@ -32,6 +34,82 @@ Contents: (*) References. +=========================== +WHAT DO WE CONSIDER MEMORY? +=========================== + +For the purpose of this specification, "memory", at least as far as cached CPU +vs CPU interactions go, has to include the CPU caches in the system. Although +any particular read or write may not actually appear outside of the CPU that +issued it because the CPU was able to satisfy it from its own cache, it's still +as if the memory access had taken place as far as the other CPUs are concerned +since the cache coherency and ejection mechanisms will propegate the effects +upon conflict. + +Consider the system logically as: + + <--- CPU ---> : <----------- Memory -----------> + : + +--------+ +--------+ : +--------+ +-----------+ + | | | | : | | | | +---------+ + | CPU | | Memory | : | CPU | | | | | + | Core |--->| Access |----->| Cache |<-->| | | | + | | | Queue | : | | | |--->| Memory | + | | | | : | | | | | | + +--------+ +--------+ : +--------+ | | | | + : | Cache | +---------+ + : | Coherency | + : | Mechanism | +---------+ + +--------+ +--------+ : +--------+ | | | | + | | | | : | | | | | | + | CPU | | Memory | : | CPU | | |--->| Device | + | Core |--->| Access |----->| Cache |<-->| | | | + | | | Queue | : | | | | | | + | | | | : | | | | +---------+ + +--------+ +--------+ : +--------+ +-----------+ + : + : + +The CPU core may execute instructions in any order it deems fit, provided the +expected program causality appears to be maintained. Some of the instructions +generate load and store operations which then go into the memory access queue +to be performed. The core may place these in the queue in any order it wishes, +and continue execution until it is forced to wait for an instruction to +complete. + +What memory barriers are concerned with is controlling the order in which +accesses cross from the CPU side of things to the memory side of things, and +the order in which the effects are perceived to happen by the other observers +in the system. + + +Note that the above model does not show uncached memory or I/O accesses. These +procede directly from the queue to the memory or the devices, bypassing any +cache coherency: + + <--- CPU ---> : + : +-----+ + +--------+ +--------+ : | | + | | | | : | | +---------+ + | CPU | | Memory | : | | | | + | Core |--->| Access |--------------->| | | | + | | | Queue | : | |------------->| Memory | + | | | | : | | | | + +--------+ +--------+ : | | | | + : | | +---------+ + : | Bus | + : | | +---------+ + +--------+ +--------+ : | | | | + | | | | : | | | | + | CPU | | Memory | : | |<------------>| Device | + | Core |--->| Access |--------------->| | | | + | | | Queue | : | | | | + | | | | : | | +---------+ + +--------+ +--------+ : | | + : +-----+ + : + + ========================= WHAT ARE MEMORY BARRIERS? ========================= @@ -448,8 +526,8 @@ In all cases there are variants on a LOC The LOCK accesses will be completed before the UNLOCK accesses. -And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but -a LOCK followed by an UNLOCK isn't. + Therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, + but a LOCK followed by an UNLOCK is not. Locks and semaphores may not provide any guarantee of ordering on UP compiled systems, and so can't be counted on in such a situation to actually do anything ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-07 17:40 [PATCH] Document Linux's memory barriers David Howells ` (4 preceding siblings ...) 2006-03-08 14:37 ` [PATCH] Document Linux's memory barriers [try #2] David Howells @ 2006-03-08 16:18 ` Pavel Machek 2006-03-08 20:16 ` David Howells 5 siblings, 1 reply; 89+ messages in thread From: Pavel Machek @ 2006-03-08 16:18 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Hi! > +There are some more advanced barriering functions: > + > + (*) set_mb(var, value) > + (*) set_wmb(var, value) > + > + These assign the value to the variable and then insert at least a write > + barrier after it, depending on the function. > + I... don't understand what these do. Better explanation would help.. .what is function? Does it try to say that set_mb(var, value) is equivalent to var = value; mb(); but here mb() affects that one variable, only? > +In all cases there are variants on a LOCK operation and an UNLOCK operation. > + > + (*) LOCK operation implication: > + > + Memory accesses issued after the LOCK will be completed after the LOCK > + accesses have completed. "LOCK access"? Does it try to say that ...will be completed after any access inside lock region is completed? ("LOCK" looks very much like well-known i386 prefix. Calling it *_lock() or something would avoid that confusion. Fortunately there's no UNLOCK instruction :-) > + (*) UNLOCK operation implication: > + > + Memory accesses issued before the UNLOCK will be completed before the > + UNLOCK accesses have completed. > + > + Memory accesses issued after the UNLOCK may be completed before the UNLOCK > + accesses have completed. > + > + (*) LOCK vs UNLOCK implication: > + > + The LOCK accesses will be completed before the unlock accesses. ~~~~~~ capital? Or lower it everywhere? > +============================== > +I386 AND X86_64 SPECIFIC NOTES > +============================== > + > +Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the > +bus appear in program order - and so there's no requirement for any sort of > +explicit memory barriers. > + > +From the Pentium-III onwards were three new memory barrier instructions: > +LFENCE, SFENCE and MFENCE which correspond to the kernel memory barrier > +functions rmb(), wmb() and mb(). However, there are additional implicit memory > +barriers in the CPU implementation: > + > + (*) Normal writes imply a semi-rmb(): reads before a write may not complete > + after that write, but reads after a write may complete before the write > + (ie: reads may go _ahead_ of writes). This makes it sound like pentium-III+ is incompatible with previous CPUs. Is it really the case? Pavel -- Web maintainer for suspend.sf.net (www.sf.net/projects/suspend) wanted... ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 16:18 ` [PATCH] Document Linux's memory barriers Pavel Machek @ 2006-03-08 20:16 ` David Howells 2006-03-08 22:01 ` Alan Cox 0 siblings, 1 reply; 89+ messages in thread From: David Howells @ 2006-03-08 20:16 UTC (permalink / raw) To: Pavel Machek Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Pavel Machek <pavel@ucw.cz> wrote: > > + (*) set_mb(var, value) > > + (*) set_wmb(var, value) > > + > > + These assign the value to the variable and then insert at least a write > > + barrier after it, depending on the function. > > + > > I... don't understand what these do. Better explanation would > help.. .what is function? I can only guess, and hope someone corrects me if I'm wrong. > Does it try to say that set_mb(var, value) is equivalent to var = > value; mb(); Yes. > but here mb() affects that one variable, only? No. set_*mb() is simply a canned sequence of assignment, memory barrier. The type of barrier inserted depends on which function you choose. set_mb() inserts an mb() and set_wmb() inserts a wmb(). > "LOCK access"? The LOCK and UNLOCK functions presumably make at least one memory write apiece to manipulate the target lock (on SMP at least). > Does it try to say that ...will be completed after any access inside lock > region is completed? No. What you get in effect is something like: LOCK { *lock = q; } *A = a; *B = b; UNLOCK { *lock = u; } Except that the accesses to the lock memory are made using special procedures (LOCK prefixed instructions, XCHG, CAS/CMPXCHG, LL/SC, etc). > This makes it sound like pentium-III+ is incompatible with previous > CPUs. Is it really the case? Yes - hence the alternative instruction stuff. David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 20:16 ` David Howells @ 2006-03-08 22:01 ` Alan Cox 2006-03-09 11:41 ` David Howells 0 siblings, 1 reply; 89+ messages in thread From: Alan Cox @ 2006-03-08 22:01 UTC (permalink / raw) To: David Howells Cc: Pavel Machek, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Mer, 2006-03-08 at 20:16 +0000, David Howells wrote: > The LOCK and UNLOCK functions presumably make at least one memory write apiece > to manipulate the target lock (on SMP at least). No they merely perform the bus transactions neccessary to perform an update atomically. They are however "serializing" instructions which means they do cause a certain amount of serialization (see the intel architecture manual on serializing instructions for detail). Athlon and later know how to turn it from locked memory accesses into merely an exclusive cache line grab. > > This makes it sound like pentium-III+ is incompatible with previous > > CPUs. Is it really the case? > > Yes - hence the alternative instruction stuff. It is the case for certain specialist instructions and the fences are provided to go with those but can also help in other cases. PIII and later in particular support explicit non temporal stores. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-08 22:01 ` Alan Cox @ 2006-03-09 11:41 ` David Howells 2006-03-09 12:28 ` Alan Cox 2006-03-09 16:32 ` Linus Torvalds 0 siblings, 2 replies; 89+ messages in thread From: David Howells @ 2006-03-09 11:41 UTC (permalink / raw) To: Alan Cox Cc: David Howells, Pavel Machek, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > > The LOCK and UNLOCK functions presumably make at least one memory write apiece > > to manipulate the target lock (on SMP at least). > > No they merely perform the bus transactions neccessary to perform an > update atomically. They are however "serializing" instructions which > means they do cause a certain amount of serialization (see the intel > architecture manual on serializing instructions for detail). > > Athlon and later know how to turn it from locked memory accesses into > merely an exclusive cache line grab. So, you're saying that the LOCK and UNLOCK primitives don't actually modify memory, but rather simply pin the cacheline into the CPU's cache and refuse to let anyone else touch it? No... it can't work like that. It *must* make a memory modification - after all, the CPU doesn't know that what it's doing is a spin_unlock(), say, rather than an atomic_set(). David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-09 11:41 ` David Howells @ 2006-03-09 12:28 ` Alan Cox 2006-03-09 13:02 ` David Howells 2006-03-09 16:32 ` Linus Torvalds 1 sibling, 1 reply; 89+ messages in thread From: Alan Cox @ 2006-03-09 12:28 UTC (permalink / raw) To: David Howells Cc: Pavel Machek, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Iau, 2006-03-09 at 11:41 +0000, David Howells wrote: > Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > So, you're saying that the LOCK and UNLOCK primitives don't actually modify > memory, but rather simply pin the cacheline into the CPU's cache and refuse to > let anyone else touch it? Basically yes > No... it can't work like that. It *must* make a memory modification Then you'll have to argue with the chip designers because it doesn't. Its all built around the cache coherency. To make a write to a cache line I must be the sole owner of the line. Look up "MESI cache" in a good book on the subject. If we own the affected line then we can update just the cache and be sure that since we own the cache line and we will write it back if anyone else asks for it (or nowdays on some systems transfer it direct to the other cpu) that we get locked semantics ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-09 12:28 ` Alan Cox @ 2006-03-09 13:02 ` David Howells 0 siblings, 0 replies; 89+ messages in thread From: David Howells @ 2006-03-09 13:02 UTC (permalink / raw) To: Alan Cox Cc: David Howells, Pavel Machek, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > > Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > > So, you're saying that the LOCK and UNLOCK primitives don't actually modify > > memory, but rather simply pin the cacheline into the CPU's cache and refuse to > > let anyone else touch it? > > Basically yes What you said is incomplete: the cacheline is wangled into the Exclusive state, and there it sits until modified (at which point it shifts to the Modified state) or stolen (when it shifts to the Shared state). Whilst the x86 CPU might pin it there for the duration of the execution of the locked instruction, it can't leave it there until it detects a spin_unlock() or equivalent. I guess LL/SC and LWARX/STWCX work by the reserved load wangling the cacheline into the Exclusive state, and then the conditional store only doing the store if the cacheline is still in that state. I don't know whether the conditional store may modify a cacheline that's in the Modified state, but I'd guess you'd need more state than that, because you have to pair it with a load reserved. With inter-CPU memory barriers I think you have to consider the cache part of the memory, not part of the CPU. The CPU _does_ make a memory modification; it's just that it doesn't proceed any further than the cache, until the cache coherency mechanisms transfer the change to another CPU, or until the cache becomes full and the lock's line gets ejected. > > No... it can't work like that. It *must* make a memory modification > > Then you'll have to argue with the chip designers because it doesn't. > > Its all built around the cache coherency. To make a write to a cache > line I must be the sole owner of the line. Look up "MESI cache" in a > good book on the subject. http://en.wikipedia.org/wiki/MESI_protocol And a picture of the state machine may be found here: https://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/MESIHelp.htm David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-09 11:41 ` David Howells 2006-03-09 12:28 ` Alan Cox @ 2006-03-09 16:32 ` Linus Torvalds 2006-03-09 17:39 ` David Howells 1 sibling, 1 reply; 89+ messages in thread From: Linus Torvalds @ 2006-03-09 16:32 UTC (permalink / raw) To: David Howells Cc: Alan Cox, Pavel Machek, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Thu, 9 Mar 2006, David Howells wrote: > > So, you're saying that the LOCK and UNLOCK primitives don't actually modify > memory, but rather simply pin the cacheline into the CPU's cache and refuse to > let anyone else touch it? > > No... it can't work like that. It *must* make a memory modification - after > all, the CPU doesn't know that what it's doing is a spin_unlock(), say, rather > than an atomic_set(). Basically, as long as nobody else is reading the lock, the lock will stay in the caches. Only old and stupid architectures go out to the bus for locking. For example, I remember the original alpha "load-locked"/"store-conditional", and it was totally _horrible_ for anything that wanted performance, because it would do the "pending lock" bit on the bus, so it took hundreds of cycles even on UP. Gods, how I hated that. It made it almost totally useless for anything that just wanted to be irq-safe - it was cheaper to just disable interrupts, iirc. STUPID. All modern CPU's do atomic operations entirely within the cache coherency logic. I think x86 still support the notion of a "locked cycle" on the bus, but I think that's entirely relegated to horrible people doing locked operations across PCI, and quite frankly, I suspect that it doesn't actually mean a thing (ie I'd expect no external hardware to actually react to the lock signal). However, nobody really cares, since nobody would be crazy enough to do locked cycles over PCI even if they were to work. So in practice, as far as I know, the way _all_ modern CPU's do locked cycles is that they do it by getting exclusive ownership on the cacheline on the read, and either having logic in place to refuse to do release the cacheline until the write is complete (ie "locked cycles to the cache"), or to re-try the instruction if the cacheline has been released by the time the write is ready (ie "load-locked" + "store-conditional" + "potentially loop" to the cache). NOBODY goes out to the bus for locking any more. That would be insane and stupid. Yes, many spinlocks see contention, and end up going out to the bus. But similarly, many spinlocks do _not_ see any contention at all (or other CPU's even looking at them), and may end up staying exclusive in a CPU cache for a long time. The "no contention" case is actually pretty important. Many real loads on SMP end up being largely single-threaded, and together with some basic CPU affinity, you really _really_ want to make that single-threaded case go as fast as possible. And a pretty big part of that is locking: the difference between a lock that goes to the bus and one that does not is _huge_. And lots of trivial code is almost dominated by locking costs. In some system calls on an SMP kernel, the locking cost can be (depending on how good or bad the CPU is at them) quite noticeable. Just a simple small read() will take several locks and/or do atomic ops, even if it was cached and it looks "trivial". Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-09 16:32 ` Linus Torvalds @ 2006-03-09 17:39 ` David Howells 2006-03-09 17:54 ` Linus Torvalds 0 siblings, 1 reply; 89+ messages in thread From: David Howells @ 2006-03-09 17:39 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Alan Cox, Pavel Machek, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Linus Torvalds <torvalds@osdl.org> wrote: > Basically, as long as nobody else is reading the lock, the lock will stay > in the caches. I think for the purposes of talking about memory barriers, we consider the cache to be part of the memory since the cache coherency mechanisms will give the same effect. I suppose the way the cache can be viewed as working is that bits of memory are shuttled around between the CPUs, RAM and any other devices that partake of the coherency mechanism. > All modern CPU's do atomic operations entirely within the cache coherency > logic. I know that, and I think it's irrelevant to specifying memory barriers. > I think x86 still support the notion of a "locked cycle" on the > bus, I wonder if that's what XCHG and XADD do... There's no particular reason they should be that much slower than LOCK INCL/DECL. Of course, I've only measured this on my Dual-PPro test box, so other i386 arch CPUs may exhibit other behaviour. David ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-09 17:39 ` David Howells @ 2006-03-09 17:54 ` Linus Torvalds 2006-03-09 17:56 ` Linus Torvalds 0 siblings, 1 reply; 89+ messages in thread From: Linus Torvalds @ 2006-03-09 17:54 UTC (permalink / raw) To: David Howells Cc: Alan Cox, Pavel Machek, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Thu, 9 Mar 2006, David Howells wrote: > > I think for the purposes of talking about memory barriers, we consider the > cache to be part of the memory since the cache coherency mechanisms will give > the same effect. Yes and no. The yes comes from the normal "smp_xxx()" barriers. As far as they are concerned, the cache coherency means that caches are invisible. The "no" comes from the IO side. Basically, since IO bypasses caches and sometimes write buffers, it's simply not ordered wrt normal accesses. And that's where "bus cycles" actually matter wrt barriers. If you have a barrier that creates a bus cycle, it suddenly can be ordered wrt IO. So the fact that x86 SMP ops basically never guarantee any bus cycles basically means that they are fundamentally no-ops when it comes to IO serialization. That was really my only point. > > I think x86 still support the notion of a "locked cycle" on the > > bus, > > I wonder if that's what XCHG and XADD do... There's no particular reason they > should be that much slower than LOCK INCL/DECL. Of course, I've only measured > this on my Dual-PPro test box, so other i386 arch CPUs may exhibit other > behaviour. I think it's an internal core implementation detail. I don't think they do anything on the bus, but I suspect that they could easily generate less optimized uops, simply because they didn't matter as much and didn't fit the "normal" core uop sequence. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] Document Linux's memory barriers 2006-03-09 17:54 ` Linus Torvalds @ 2006-03-09 17:56 ` Linus Torvalds 0 siblings, 0 replies; 89+ messages in thread From: Linus Torvalds @ 2006-03-09 17:56 UTC (permalink / raw) To: David Howells Cc: Alan Cox, Pavel Machek, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel On Thu, 9 Mar 2006, Linus Torvalds wrote: > > So the fact that x86 SMP ops basically never guarantee any bus cycles > basically means that they are fundamentally no-ops when it comes to IO > serialization. That was really my only point. Side note: of course, locked cycles _do_ "serialize" the core. So they'll stop at least the core write merging, and speculative reads. So they do have some impact on IO, but they have no way of impacting things like write posting etc that is outside the CPU. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
end of thread, other threads:[~2006-03-09 17:56 UTC | newest] Thread overview: 89+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-07 17:40 [PATCH] Document Linux's memory barriers David Howells 2006-03-07 10:34 ` Andi Kleen 2006-03-07 18:30 ` David Howells 2006-03-07 11:13 ` Andi Kleen 2006-03-07 19:24 ` David Howells 2006-03-07 18:46 ` Jesse Barnes 2006-03-07 19:23 ` Bryan O'Sullivan 2006-03-07 11:57 ` Andi Kleen 2006-03-07 20:01 ` Jesse Barnes 2006-03-07 21:14 ` Bryan O'Sullivan 2006-03-07 21:24 ` Andi Kleen 2006-03-08 0:36 ` Alan Cox 2006-03-08 0:35 ` Alan Cox 2006-03-07 18:40 ` Alan Cox 2006-03-07 18:54 ` linux-os (Dick Johnson) 2006-03-07 19:06 ` Matthew Wilcox 2006-03-07 19:15 ` linux-os (Dick Johnson) 2006-03-07 19:33 ` Alan Cox 2006-03-07 20:09 ` David Howells 2006-03-08 0:32 ` Alan Cox 2006-03-08 8:25 ` Duncan Sands 2006-03-08 22:06 ` Paul Mackerras 2006-03-08 22:24 ` David S. Miller 2006-03-08 22:31 ` Linus Torvalds 2006-03-08 22:42 ` Alan Cox 2006-03-08 2:07 ` Nick Piggin 2006-03-08 3:10 ` Paul Mackerras 2006-03-08 3:30 ` Linus Torvalds 2006-03-08 12:34 ` David Howells 2006-03-08 16:40 ` Bryan O'Sullivan 2006-03-08 7:41 ` Nick Piggin 2006-03-08 13:19 ` David Howells 2006-03-08 21:49 ` Paul Mackerras 2006-03-08 22:05 ` Alan Cox 2006-03-08 14:37 ` [PATCH] Document Linux's memory barriers [try #2] David Howells 2006-03-08 14:55 ` Alan Cox 2006-03-08 15:41 ` Matthew Wilcox 2006-03-08 17:19 ` David Howells 2006-03-08 22:10 ` Paul Mackerras 2006-03-08 23:08 ` Ivan Kokshaysky 2006-03-09 1:01 ` Paul Mackerras 2006-03-09 16:02 ` Ivan Kokshaysky 2006-03-08 17:04 ` David Howells 2006-03-08 17:36 ` Alan Cox 2006-03-08 18:35 ` David Howells 2006-03-08 18:45 ` Alan Cox 2006-03-08 18:59 ` David Howells 2006-03-08 11:38 ` Andi Kleen 2006-03-08 19:08 ` David Howells 2006-03-08 19:26 ` Linus Torvalds 2006-03-08 19:31 ` David Howells 2006-03-09 0:35 ` Paul Mackerras 2006-03-09 0:54 ` Linus Torvalds 2006-03-09 1:08 ` Paul Mackerras 2006-03-09 1:27 ` Linus Torvalds 2006-03-09 2:38 ` Nick Piggin 2006-03-09 3:45 ` Paul Mackerras 2006-03-09 4:36 ` Jesse Barnes 2006-03-09 7:41 ` Paul Mackerras 2006-03-09 5:38 ` Linus Torvalds 2006-03-09 12:27 ` David Howells 2006-03-09 11:44 ` Michael Buesch 2006-03-09 4:34 ` Jesse Barnes 2006-03-09 4:43 ` Paul Mackerras 2006-03-09 10:05 ` Jes Sorensen 2006-03-09 0:55 ` Jesse Barnes 2006-03-09 1:57 ` Paul Mackerras 2006-03-09 4:26 ` Jesse Barnes 2006-03-08 19:40 ` Matthew Wilcox 2006-03-09 0:37 ` Paul Mackerras 2006-03-09 0:59 ` Jesse Barnes 2006-03-09 1:36 ` Paul Mackerras 2006-03-09 4:18 ` Jesse Barnes 2006-03-08 19:54 ` Jesse Barnes 2006-03-08 20:02 ` Alan Cox 2006-03-08 22:01 ` Paul Mackerras 2006-03-08 22:23 ` David S. Miller 2006-03-08 19:37 ` [PATCH] Document Linux's memory barriers [try #3] David Howells 2006-03-09 14:01 ` David Howells 2006-03-08 16:18 ` [PATCH] Document Linux's memory barriers Pavel Machek 2006-03-08 20:16 ` David Howells 2006-03-08 22:01 ` Alan Cox 2006-03-09 11:41 ` David Howells 2006-03-09 12:28 ` Alan Cox 2006-03-09 13:02 ` David Howells 2006-03-09 16:32 ` Linus Torvalds 2006-03-09 17:39 ` David Howells 2006-03-09 17:54 ` Linus Torvalds 2006-03-09 17:56 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox