* [PATCH] Document Linux's memory barriers [try #4] @ 2006-03-09 20:29 ` David Howells 2006-03-09 23:34 ` Paul Mackerras ` (3 more replies) 0 siblings, 4 replies; 33+ messages in thread From: David Howells @ 2006-03-09 20:29 UTC (permalink / raw) To: torvalds, akpm, mingo, alan; +Cc: linux-arch, linuxppc64-dev, linux-kernel The attached patch documents the Linux kernel's memory barriers. I've updated it from the comments I've been given. The per-arch notes sections are gone because it's clear that there are so many exceptions, that it's not worth having them. I've added a list of references to other documents. I've tried to get rid of the concept of memory accesses appearing on the bus; what matters is apparent behaviour with respect to other observers in the system. Interrupts barrier effects are now considered to be non-existent. They may be there, but you may not rely on them. I've added a couple of definition sections at the top of the document: one to specify the minimum execution model that may be assumed, the other to specify what this document refers to by the term "memory". Signed-Off-By: David Howells <dhowells@redhat.com> --- warthog>diffstat -p1 /tmp/mb.diff Documentation/memory-barriers.txt | 855 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 855 insertions(+) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt new file mode 100644 index 0000000..04c5c88 --- /dev/null +++ b/Documentation/memory-barriers.txt @@ -0,0 +1,855 @@ + ============================ + LINUX KERNEL MEMORY BARRIERS + ============================ + +Contents: + + (*) Assumed minimum execution ordering model. + + (*) What is considered memory? + + - Cached interactions. + - Uncached interactions. + + (*) What are memory barriers? + + (*) Where are memory barriers needed? + + - Accessing devices. + - Multiprocessor interaction. + - Interrupts. + + (*) Explicit kernel compiler barriers. + + (*) Explicit kernel memory barriers. + + (*) Implicit kernel memory barriers. + + - Locking functions. + - Interrupt disabling functions. + - Miscellaneous functions. + + (*) Inter-CPU locking barrier effects. + + - Locks vs memory accesses. + - Locks vs I/O accesses. + + (*) Kernel I/O barrier effects. + + (*) References. + + +======================================== +ASSUMED MINIMUM EXECUTION ORDERING MODEL +======================================== + +It has to be assumed that the conceptual CPU is weakly-ordered in all respects +but that it will maintain the appearance of program causality with respect to +itself. Some CPUs (such as i386 or x86_64) are more constrained than others +(such as powerpc or frv), and so the worst case must be assumed. + +This means that it must be considered that the CPU will execute its instruction +stream in any order it feels like - or even in parallel - provided that if an +instruction in the stream depends on the an earlier instruction, then that +earlier instruction must be sufficiently complete[*] before the later +instruction may proceed. + + [*] Some instructions have more than one effect[**] and different instructions + may depend on different effects. + + [**] Eg: changes to condition codes and registers; memory changes; barriers. + +A CPU may also discard any instruction sequence that ultimately winds up having +no effect. For example if two adjacent instructions both load an immediate +value into the same register, the first may be discarded. + + +Similarly, it has to be assumed that compiler might reorder the instruction +stream in any way it sees fit, again provided the appearance of causality is +maintained. + + +========================== +WHAT IS CONSIDERED MEMORY? +========================== + +For the purpose of this specification what's meant by "memory" needs to be +defined, and the division between CPU and memory needs to be marked out. + + +CACHED INTERACTIONS +------------------- + +As far as cached CPU vs CPU[*] interactions go, "memory" has to include the CPU +caches in the system. Although any particular read or write may not actually +appear outside of the CPU that issued it (the CPU may may have been able to +satisfy it from its own cache), it's still as if the memory access had taken +place as far as the other CPUs are concerned since the cache coherency and +ejection mechanisms will propegate the effects upon conflict. + + [*] Also applies to CPU vs device when accessed through a cache. + +The system can be considered logically as: + + <--- CPU ---> : <----------- Memory -----------> + : + +--------+ +--------+ : +--------+ +-----------+ + | | | | : | | | | +---------+ + | CPU | | Memory | : | CPU | | | | | + | Core |--->| Access |----->| Cache |<-->| | | | + | | | Queue | : | | | |--->| Memory | + | | | | : | | | | | | + +--------+ +--------+ : +--------+ | | | | + : | Cache | +---------+ + : | Coherency | + : | Mechanism | +---------+ + +--------+ +--------+ : +--------+ | | | | + | | | | : | | | | | | + | CPU | | Memory | : | CPU | | |--->| Device | + | Core |--->| Access |----->| Cache |<-->| | | | + | | | Queue | : | | | | | | + | | | | : | | | | +---------+ + +--------+ +--------+ : +--------+ +-----------+ + : + : + +The CPU core may execute instructions in any order it deems fit, provided the +expected program causality appears to be maintained. Some of the instructions +generate load and store operations which then go into the memory access queue +to be performed. The core may place these in the queue in any order it wishes, +and continue execution until it is forced to wait for an instruction to +complete. + +What memory barriers are concerned with is controlling the order in which +accesses cross from the CPU side of things to the memory side of things, and +the order in which the effects are perceived to happen by the other observers +in the system. + + +UNCACHED INTERACTIONS +--------------------- + +Note that the above model does not show uncached memory or I/O accesses. These +procede directly from the queue to the memory or the devices, bypassing any +cache coherency: + + <--- CPU ---> : + : +-----+ + +--------+ +--------+ : | | + | | | | : | | +---------+ + | CPU | | Memory | : | | | | + | Core |--->| Access |--------------->| | | | + | | | Queue | : | |------------->| Memory | + | | | | : | | | | + +--------+ +--------+ : | | | | + : | | +---------+ + : | Bus | + : | | +---------+ + +--------+ +--------+ : | | | | + | | | | : | | | | + | CPU | | Memory | : | |<------------>| Device | + | Core |--->| Access |--------------->| | | | + | | | Queue | : | | | | + | | | | : | | +---------+ + +--------+ +--------+ : | | + : +-----+ + : + + +========================= +WHAT ARE MEMORY BARRIERS? +========================= + +Memory barriers are instructions to both the compiler and the CPU to impose an +apparent partial ordering between the memory access operations specified either +side of the barrier. They request that the sequence of memory events generated +appears to other components of the system as if the barrier is effective on +that CPU. + +Note that: + + (*) there's no guarantee that the sequence of memory events is _actually_ so + ordered. It's possible for the CPU to do out-of-order accesses _as long + as no-one is looking_, and then fix up the memory if someone else tries to + see what's going on (for instance a bus master device); what matters is + the _apparent_ order as far as other processors and devices are concerned; + and + + (*) memory barriers are only guaranteed to act within the CPU processing them, + and are not, for the most part, guaranteed to percolate down to other CPUs + in the system or to any I/O hardware that that CPU may communicate with. + + +For example, a programmer might take it for granted that the CPU will perform +memory accesses in exactly the order specified, so that if a CPU is, for +example, given the following piece of code: + + a = *A; + *B = b; + c = *C; + d = *D; + *E = e; + +They would then expect that the CPU will complete the memory access for each +instruction before moving on to the next one, leading to a definite sequence of +operations as seen by external observers in the system: + + read *A, write *B, read *C, read *D, write *E. + + +Reality is, of course, much messier. With many CPUs and compilers, this isn't +always true because: + + (*) reads are more likely to need to be completed immediately to permit + execution progress, whereas writes can often be deferred without a + problem; + + (*) reads can be done speculatively, and then the result discarded should it + prove not to be required; + + (*) the order of the memory accesses may be rearranged to promote better use + of the CPU buses and caches; + + (*) reads and writes may be combined to improve performance when talking to + the memory or I/O hardware that can do batched accesses of adjacent + locations, thus cutting down on transaction setup costs (memory and PCI + devices may be able to do this); and + + (*) the CPU's data cache may affect the ordering, though cache-coherency + mechanisms should alleviate this - once the write has actually hit the + cache. + +So what another CPU, say, might actually observe from the above piece of code +is: + + read *A, read {*C,*D}, write *E, write *B + + (By "read {*C,*D}" I mean a combined single read). + + +It is also guaranteed that a CPU will be self-consistent: it will see its _own_ +accesses appear to be correctly ordered, without the need for a memory barrier. +For instance with the following code: + + X = *A; + *A = Y; + Z = *A; + +assuming no intervention by an external influence, it can be taken that: + + (*) X will hold the old value of *A, and will never happen after the write and + thus end up being given the value that was assigned to *A from Y instead; + and + + (*) Z will always be given the value in *A that was assigned there from Y, and + will never happen before the write, and thus end up with the same value + that was in *A initially. + +(This is ignoring the fact that the value initially in *A may appear to be the +same as the value assigned to *A from Y). + + +================================= +WHERE ARE MEMORY BARRIERS NEEDED? +================================= + +Under normal operation, access reordering is probably not going to be a problem +as a linear program will still appear to operate correctly. There are, +however, three circumstances where reordering definitely _could_ be a problem: + + +ACCESSING DEVICES +----------------- + +Many devices can be memory mapped, and so appear to the CPU as if they're just +memory locations. However, to control the device, the driver has to make the +right accesses in exactly the right order. + +Consider, for example, an ethernet chipset such as the AMD PCnet32. It +presents to the CPU an "address register" and a bunch of "data registers". The +way it's accessed is to write the index of the internal register to be accessed +to the address register, and then read or write the appropriate data register +to access the chip's internal register, which could - theoretically - be done +by: + + *ADR = ctl_reg_3; + reg = *DATA; + +The problem with a clever CPU or a clever compiler is that the write to the +address register isn't guaranteed to happen before the access to the data +register, if the CPU or the compiler thinks it is more efficient to defer the +address write: + + read *DATA, write *ADR + +then things will break. + + +In the Linux kernel, however, I/O should be done through the appropriate +accessor routines - such as inb() or writel() - which know how to make such +accesses appropriately sequential. + +On some systems, I/O writes are not strongly ordered across all CPUs, and so +locking should be used, and mmiowb() should be issued prior to unlocking the +critical section. + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +MULTIPROCESSOR INTERACTION +-------------------------- + +When there's a system with more than one processor, the CPUs in the system may +be working on the same set of data at the same time. This can cause +synchronisation problems, and the usual way of dealing with them is to use +locks - but locks are quite expensive, and so it may be preferable to operate +without the use of a lock if at all possible. In such a case accesses that +affect both CPUs may have to be carefully ordered to prevent error. + +Consider the R/W semaphore slow path. In that, a waiting process is queued on +the semaphore, as noted by it having a record on its stack linked to the +semaphore's list: + + struct rw_semaphore { + ... + struct list_head waiters; + }; + + struct rwsem_waiter { + struct list_head list; + struct task_struct *task; + }; + +To wake up the waiter, the up_read() or up_write() functions have to read the +pointer from this record to know as to where the next waiter record is, clear +the task pointer, call wake_up_process() on the task, and release the reference +held on the waiter's task struct: + + READ waiter->list.next; + READ waiter->task; + WRITE waiter->task; + CALL wakeup + RELEASE task + +If any of these steps occur out of order, then the whole thing may fail. + +Note that the waiter does not get the semaphore lock again - it just waits for +its task pointer to be cleared. Since the record is on its stack, this means +that if the task pointer is cleared _before_ the next pointer in the list is +read, another CPU might start processing the waiter and it might clobber its +stack before up*() functions have a chance to read the next pointer. + + CPU 0 CPU 1 + =============================== =============================== + down_xxx() + Queue waiter + Sleep + up_yyy() + READ waiter->task; + WRITE waiter->task; + <preempt> + Resume processing + down_xxx() returns + call foo() + foo() clobbers *waiter + </preempt> + READ waiter->list.next; + --- OOPS --- + +This could be dealt with using a spinlock, but then the down_xxx() function has +to get the spinlock again after it's been woken up, which is a waste of +resources. + +The way to deal with this is to insert an SMP memory barrier: + + READ waiter->list.next; + READ waiter->task; + smp_mb(); + WRITE waiter->task; + CALL wakeup + RELEASE task + +In this case, the barrier makes a guarantee that all memory accesses before the +barrier will appear to happen before all the memory accesses after the barrier +with respect to the other CPUs on the system. It does _not_ guarantee that all +the memory accesses before the barrier will be complete by the time the barrier +itself is complete. + +SMP memory barriers are normally nothing more than compiler barriers on a +kernel compiled for a UP system because the CPU orders overlapping accesses +with respect to itself, and so CPU barriers aren't needed. + + +INTERRUPTS +---------- + +A driver may be interrupted by its own interrupt service routine, and thus they +may interfere with each other's attempts to control or access the device. + +This may be alleviated - at least in part - by disabling interrupts (a form of +locking), such that the critical operations are all contained within the +interrupt-disabled section in the driver. Whilst the driver's interrupt +routine is executing, the driver's core may not run on the same CPU, and its +interrupt is not permitted to happen again until the current interrupt has been +handled, thus the interrupt handler does not need to lock against that. + +However, consider a driver was talking to an ethernet card that sports an +address register and a data register. If that driver's core is talks to the +card under interrupt-disablement and then the driver's interrupt handler is +invoked: + + DISABLE IRQ + writew(ADDR, ctl_reg_3); + writew(DATA, y); + ENABLE IRQ + <interrupt> + writew(ADDR, ctl_reg_4); + q = readw(DATA); + </interrupt> + +If ordering rules are sufficiently relaxed, the write to the data register +might happen after the second write to the address register. + + +It must be assumed that accesses done inside an interrupt disabled section may +leak outside of it and may interleave with accesses performed in an interrupt +and vice versa unless implicit or explicit barriers are used. + +Normally this won't be a problem because the I/O accesses done inside such +sections will include synchronous read operations on strictly ordered I/O +registers that form implicit I/O barriers. If this isn't sufficient then an +mmiowb() may need to be used explicitly. + + +A similar situation may occur between an interrupt routine and two routines +running on separate CPUs that communicate with each other. If such a case is +likely, then interrupt-disabling locks should be used to guarantee ordering. + + +================================= +EXPLICIT KERNEL COMPILER BARRIERS +================================= + +The Linux kernel has an explicit compiler barrier function that prevents the +compiler from moving the memory accesses either side of it to the other side: + + barrier(); + +This has no direct effect on the CPU, which may then reorder things however it +wishes. + + +In addition, accesses to "volatile" memory locations and volatile asm +statements act as implicit compiler barriers. Note, however, that the use of +volatile has two negative consequences: + + (1) it causes the generation of poorer code, and + + (2) it can affect serialisation of events in code distant from the declaration + (consider a structure defined in a header file that has a volatile member + being accessed by the code in a source file). + +The Linux coding style therefore strongly favours the use of explicit barriers +except in small and specific cases. In general, volatile should be avoided. + + +=============================== +EXPLICIT KERNEL MEMORY BARRIERS +=============================== + +The Linux kernel has six basic CPU memory barriers: + + MANDATORY SMP CONDITIONAL + =============== =============== + GENERAL mb() smp_mb() + READ rmb() smp_rmb() + WRITE wmb() smp_wmb() + +General memory barriers give a guarantee that all memory accesses specified +before the barrier will appear to happen before all memory accesses specified +after the barrier with respect to the other components of the system. + +Read and write memory barriers give similar guarantees, but only for memory +reads versus memory reads and memory writes versus memory writes respectively. + +All memory barriers imply compiler barriers. + +SMP memory barriers are only compiler barriers on uniprocessor compiled systems +because it is assumed that a CPU will be apparently self-consistent, and will +order overlapping accesses correctly with respect to itself. + +There is no guarantee that any of the memory accesses specified before a memory +barrier will be complete by the completion of a memory barrier; the barrier can +be considered to draw a line in that CPU's access queue that accesses of the +appropriate type may not cross. + +There is no guarantee that issuing a memory barrier on one CPU will have any +direct effect on another CPU or any other hardware in the system. The indirect +effect will be the order in which the second CPU sees the first CPU's accesses +occur. + +There is no guarantee that some intervening piece of off-the-CPU hardware[*] +will not reorder the memory accesses. CPU cache coherency mechanisms should +propegate the indirect effects of a memory barrier between CPUs. + + [*] For information on bus mastering DMA and coherency please read: + + Documentation/pci.txt + Documentation/DMA-mapping.txt + Documentation/DMA-API.txt + +Note that these are the _minimum_ guarantees. Different architectures may give +more substantial guarantees, but they may not be relied upon outside of arch +specific code. + + +There are some more advanced barrier functions: + + (*) set_mb(var, value) + (*) set_wmb(var, value) + + These assign the value to the variable and then insert at least a write + barrier after it, depending on the function. They aren't guaranteed to + insert anything more than a compiler barrier in a UP compilation. + + +=============================== +IMPLICIT KERNEL MEMORY BARRIERS +=============================== + +Some of the other functions in the linux kernel imply memory barriers, amongst +which are locking and scheduling functions. + +This specification is a _minimum_ guarantee; any particular architecture may +provide more substantial guarantees, but these may not be relied upon outside +of arch specific code. + + +LOCKING FUNCTIONS +----------------- + +All the following locking functions imply barriers: + + (*) spin locks + (*) R/W spin locks + (*) mutexes + (*) semaphores + (*) R/W semaphores + +In all cases there are variants on a LOCK operation and an UNLOCK operation. + + (*) LOCK operation implication: + + Memory accesses issued after the LOCK will be completed after the LOCK + accesses have completed. + + Memory accesses issued before the LOCK may be completed after the LOCK + accesses have completed. + + (*) UNLOCK operation implication: + + Memory accesses issued before the UNLOCK will be completed before the + UNLOCK accesses have completed. + + Memory accesses issued after the UNLOCK may be completed before the UNLOCK + accesses have completed. + + (*) LOCK vs UNLOCK implication: + + The LOCK accesses will be completed before the UNLOCK accesses. + + Therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, + but a LOCK followed by an UNLOCK is not. + +Locks and semaphores may not provide any guarantee of ordering on UP compiled +systems, and so can't be counted on in such a situation to actually do anything +at all, especially with respect to I/O accesses, unless combined with interrupt +disabling operations. + +See also the section on "Inter-CPU locking barrier effects". + + +As an example, consider the following: + + *A = a; + *B = b; + LOCK + *C = c; + *D = d; + UNLOCK + *E = e; + *F = f; + +The following sequence of events is acceptable: + + LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK + +But none of the following are: + + {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E + *A, *B, *C, LOCK, *D, UNLOCK, *E, *F + *A, *B, LOCK, *C, UNLOCK, *D, *E, *F + *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E + + +INTERRUPT DISABLING FUNCTIONS +----------------------------- + +Functions that disable interrupts (LOCK equivalent) and enable interrupts +(UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O +barriers are required in such a situation, they must be provided from some +other means. + + +MISCELLANEOUS FUNCTIONS +----------------------- + +Other functions that imply barriers: + + (*) schedule() and similar imply full memory barriers. + + +================================= +INTER-CPU LOCKING BARRIER EFFECTS +================================= + +On SMP systems locking primitives give a more substantial form of barrier: one +that does affect memory access ordering on other CPUs, within the context of +conflict on any particular lock. + + +LOCKS VS MEMORY ACCESSES +------------------------ + +Consider the following: the system has a pair of spinlocks (N) and (Q), and +three CPUs; then should the following sequence of events occur: + + CPU 1 CPU 2 + =============================== =============================== + *A = a; *E = e; + LOCK M LOCK Q + *B = b; *F = f; + *C = c; *G = g; + UNLOCK M UNLOCK Q + *D = d; *H = h; + +Then there is no guarantee as to what order CPU #3 will see the accesses to *A +through *H occur in, other than the constraints imposed by the separate locks +on the separate CPUs. It might, for example, see: + + *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M + +But it won't see any of: + + *B, *C or *D preceding LOCK M + *A, *B or *C following UNLOCK M + *F, *G or *H preceding LOCK Q + *E, *F or *G following UNLOCK Q + + +However, if the following occurs: + + CPU 1 CPU 2 + =============================== =============================== + *A = a; + LOCK M [1] + *B = b; + *C = c; + UNLOCK M [1] + *D = d; *E = e; + LOCK M [2] + *F = f; + *G = g; + UNLOCK M [2] + *H = h; + +CPU #3 might see: + + *E, LOCK M [1], *C, *B, *A, UNLOCK M [1], + LOCK M [2], *H, *F, *G, UNLOCK M [2], *D + +But assuming CPU #1 gets the lock first, it won't see any of: + + *B, *C, *D, *F, *G or *H preceding LOCK M [1] + *A, *B or *C following UNLOCK M [1] + *F, *G or *H preceding LOCK M [2] + *A, *B, *C, *E, *F or *G following UNLOCK M [2] + + +LOCKS VS I/O ACCESSES +--------------------- + +Under certain circumstances (such as NUMA), I/O accesses within two spinlocked +sections on two different CPUs may be seen as interleaved by the PCI bridge. + +For example: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + writel(1, DATA); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + writel(5, DATA); + spin_unlock(Q); + +may be seen by the PCI bridge as follows: + + WRITE *ADDR = 0, WRITE *ADDR = 4, WRITE *DATA = 1, WRITE *DATA = 5 + +which would probably break. + +What is necessary here is to insert an mmiowb() before dropping the spinlock, +for example: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + writel(1, DATA); + mmiowb(); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + writel(5, DATA); + mmiowb(); + spin_unlock(Q); + +this will ensure that the two writes issued on CPU #1 appear at the PCI bridge +before either of the writes issued on CPU #2. + + +Furthermore, following a write by a read to the same device is okay, because +the read forces the write to complete before the read is performed: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + a = readl(DATA); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + b = readl(DATA); + spin_unlock(Q); + + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +========================== +KERNEL I/O BARRIER EFFECTS +========================== + +When accessing I/O memory, drivers should use the appropriate accessor +functions: + + (*) inX(), outX(): + + These are intended to talk to I/O space rather than memory space, but + that's primarily a CPU-specific concept. The i386 and x86_64 processors do + indeed have special I/O space access cycles and instructions, but many + CPUs don't have such a concept. + + The PCI bus, amongst others, defines an I/O space concept - which on such + CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O + space. However, it may also mapped as a virtual I/O space in the CPU's + memory map, particularly on those CPUs that don't support alternate + I/O spaces. + + Accesses to this space may be fully synchronous (as on i386), but + intermediary bridges (such as the PCI host bridge) may not fully honour + that. + + They are guaranteed to be fully ordered with respect to each other. + + They are not guaranteed to be fully ordered with respect to other types of + memory and I/O operation. + + (*) readX(), writeX(): + + Whether these are guaranteed to be fully ordered and uncombined with + respect to each other on the issuing CPU depends on the characteristics + defined for the memory window through which they're accessing. On later + i386 architecture machines, for example, this is controlled by way of the + MTRR registers. + + Ordinarily, these will be guaranteed to be fully ordered and uncombined,, + provided they're not accessing a prefetchable device. + + However, intermediary hardware (such as a PCI bridge) may indulge in + deferral if it so wishes; to flush a write, a read from the same location + is preferred[*], but a read from the same device or from configuration + space should suffice for PCI. + + [*] NOTE! attempting to read from the same location as was written to may + cause a malfunction - consider the 16550 Rx/Tx serial registers for + example. + + Used with prefetchable I/O memory, an mmiowb() barrier may be required to + force writes to be ordered. + + Please refer to the PCI specification for more information on interactions + between PCI transactions. + + (*) readX_relaxed() + + These are similar to readX(), but are not guaranteed to be ordered in any + way. Be aware that there is no I/O read barrier available. + + (*) ioreadX(), iowriteX() + + These will perform as appropriate for the type of access they're actually + doing, be it inX()/outX() or readX()/writeX(). + + +========== +REFERENCES +========== + +AMD64 Architecture Programmer's Manual Volume 2: System Programming + Chapter 7.1: Memory-Access Ordering + Chapter 7.4: Buffering and Combining Memory Writes + +IA-32 Intel Architecture Software Developer's Manual, Volume 3: +System Programming Guide + Chapter 7.1: Locked Atomic Operations + Chapter 7.2: Memory Ordering + Chapter 7.4: Serializing Instructions + +The SPARC Architecture Manual, Version 9 + Chapter 8: Memory Models + Appendix D: Formal Specification of the Memory Models + Appendix J: Programming with the Memory Models + +UltraSPARC Programmer Reference Manual + Chapter 5: Memory Accesses and Cacheability + Chapter 15: Sparc-V9 Memory Models + +UltraSPARC III Cu User's Manual + Chapter 9: Memory Models + +UltraSPARC IIIi Processor User's Manual + Chapter 8: Memory Models + +UltraSPARC Architecture 2005 + Chapter 9: Memory + Appendix D: Formal Specifications of the Memory Models + +UltraSPARC T1 Supplment to the UltraSPARC Architecture 2005 + Chapter 8: Memory Models + Appendix F: Caches and Cache Coherency + +Solaris Internals, Core Kernel Architecture, p63-68: + Chapter 3.3: Hardware Considerations for Locks and + Synchronization + +Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching +for Kernel Programmers: + Chapter 13: Other Memory Models + +Intel Itanium Architecture Software Developer's Manual: Volume 1: + Section 2.6: Speculation + Section 4.4: Memory Access ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-09 20:29 ` [PATCH] Document Linux's memory barriers [try #4] David Howells @ 2006-03-09 23:34 ` Paul Mackerras 2006-03-09 23:45 ` Michael Buesch ` (2 more replies) 2006-03-10 5:28 ` Nick Piggin ` (2 subsequent siblings) 3 siblings, 3 replies; 33+ messages in thread From: Paul Mackerras @ 2006-03-09 23:34 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel David Howells writes: > +On some systems, I/O writes are not strongly ordered across all CPUs, and so > +locking should be used, and mmiowb() should be issued prior to unlocking the > +critical section. I think we should say more strongly that mmiowb() is required where MMIO accesses are done under a spinlock, and that if your driver is missing them then that is a bug. I don't think it makes sense to say that mmiowb is required "on some systems". At least, we should either make that statement, or we should not require any driver on any platform to use mmiowb explicitly. (In that case, the platforms that need it could do something like keep a per-cpu count of MMIO accesses, which is zeroed in spin_lock and incremented by read*/write*, and if spin_unlock finds it non-zero, it does the mmiowb().) Also, this section doesn't sound right to me: > + DISABLE IRQ > + writew(ADDR, ctl_reg_3); > + writew(DATA, y); > + ENABLE IRQ > + <interrupt> > + writew(ADDR, ctl_reg_4); > + q = readw(DATA); > + </interrupt> > + > +If ordering rules are sufficiently relaxed, the write to the data register > +might happen after the second write to the address register. > + > + > +It must be assumed that accesses done inside an interrupt disabled section may > +leak outside of it and may interleave with accesses performed in an interrupt > +and vice versa unless implicit or explicit barriers are used. > + > +Normally this won't be a problem because the I/O accesses done inside such > +sections will include synchronous read operations on strictly ordered I/O > +registers that form implicit I/O barriers. If this isn't sufficient then an > +mmiowb() may need to be used explicitly. There shouldn't be any problem here, because readw/writew _must_ ensure that the device accesses are serialized. Just saying "if this isn't sufficient" leaves the reader wondering when it might not be sufficient or how they would know when it wasn't sufficient, and introduces doubt where there needn't be any. Of course, on an SMP system it would be quite possible for the interrupt to be taken on another CPU, and in that case disabling interrupts (I assume that by "DISABLE IRQ" you mean local_irq_disable() or some such) gets you absolutely nothing; you need to use a spinlock, and then the mmiowb is required. You may like to include these words describing some of the rules: * If you have stores to regular memory, followed by an MMIO store, and you want the device to see the stores to regular memory at the point where it receives the MMIO store, then you need a wmb() between the stores to regular memory and the MMIO store. * If you have PIO or MMIO accesses, and you need to ensure the PIO/MMIO accesses don't get reordered with respect to PIO/MMIO accesses on another CPU, put the accesses inside a spin-locked region, and put a mmiowb() between the last access and the spin_unlock. * smp_wmb() doesn't necessarily do any ordering of MMIO accesses vs. other accesses, and in that sense it is weaker than wmb(). Paul. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-09 23:34 ` Paul Mackerras @ 2006-03-09 23:45 ` Michael Buesch 2006-03-09 23:56 ` Linus Torvalds 2006-03-10 0:48 ` Alan Cox 2006-03-10 15:19 ` David Howells 2 siblings, 1 reply; 33+ messages in thread From: Michael Buesch @ 2006-03-09 23:45 UTC (permalink / raw) To: Paul Mackerras Cc: torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel, David Howells [-- Attachment #1: Type: text/plain, Size: 888 bytes --] On Friday 10 March 2006 00:34, you wrote: > David Howells writes: > > > +On some systems, I/O writes are not strongly ordered across all CPUs, and so > > +locking should be used, and mmiowb() should be issued prior to unlocking the > > +critical section. > > I think we should say more strongly that mmiowb() is required where > MMIO accesses are done under a spinlock, and that if your driver is > missing them then that is a bug. I don't think it makes sense to say > that mmiowb is required "on some systems". So what about: #define spin_lock_mmio(lock) spin_lock(lock) #define spin_unlock_mmio(lock) do { spin_unlock(lock); mmiowb(); } while (0) #define spin_lock_mmio_irqsave(lock, flags) spin_lock_irqsave(lock, flags) #define spin_unlock_mmio_irqrestore(lock, flags) do { spin_unlock_irqrestore(lock, flags); mmiowb(); } while (0) -- Greetings Michael. [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-09 23:45 ` Michael Buesch @ 2006-03-09 23:56 ` Linus Torvalds 2006-03-10 0:07 ` Michael Buesch 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2006-03-09 23:56 UTC (permalink / raw) To: Michael Buesch Cc: Paul Mackerras, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel, David Howells On Fri, 10 Mar 2006, Michael Buesch wrote: > > So what about: > > #define spin_lock_mmio(lock) spin_lock(lock) > #define spin_unlock_mmio(lock) do { spin_unlock(lock); mmiowb(); } while (0) You need to put the mmiowb() inside the spinlock. Yes, that is painful. But the point being that if it's outside, then when somebody else gets the lock, the previous lock-owners MMIO stores may still be in flight, which is what you didn't want in the first place. Anyway, no need to make a new name for it, since you might as well just use the mmiowb() explicitly. At least until this has been shown to be a really common pattern (it clearly isn't, right now ;) Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-09 23:56 ` Linus Torvalds @ 2006-03-10 0:07 ` Michael Buesch 0 siblings, 0 replies; 33+ messages in thread From: Michael Buesch @ 2006-03-10 0:07 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel, David Howells [-- Attachment #1: Type: text/plain, Size: 1215 bytes --] On Friday 10 March 2006 00:56, you wrote: > > On Fri, 10 Mar 2006, Michael Buesch wrote: > > > > So what about: > > > > #define spin_lock_mmio(lock) spin_lock(lock) > > #define spin_unlock_mmio(lock) do { spin_unlock(lock); mmiowb(); } while (0) > > You need to put the mmiowb() inside the spinlock. Ok, sorry. That was a typo. I should not do more than 3 things at the same time. :) > Yes, that is painful. But the point being that if it's outside, then when > somebody else gets the lock, the previous lock-owners MMIO stores may > still be in flight, which is what you didn't want in the first place. > > Anyway, no need to make a new name for it, since you might as well just > use the mmiowb() explicitly. At least until this has been shown to be a > really common pattern (it clearly isn't, right now ;) Ok, so maybe it is best if every device creates its own macros for convenience (if needed => if it is a common pattern in the scope of the driver). Example: #define bcm43xx_lock(bcm, flags) spin_lock_irqsave(&(bcm)->lock, flags) #define bcm43xx_unlock(bcm, flags) do { mmiowb(); spin_unlock_irqrestore(&(bcm)->lock, flags); } while (0) -- Greetings Michael. [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-09 23:34 ` Paul Mackerras 2006-03-09 23:45 ` Michael Buesch @ 2006-03-10 0:48 ` Alan Cox 2006-03-10 0:54 ` Paul Mackerras 2006-03-10 15:19 ` David Howells 2 siblings, 1 reply; 33+ messages in thread From: Alan Cox @ 2006-03-10 0:48 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel On Fri, Mar 10, 2006 at 10:34:53AM +1100, Paul Mackerras wrote: > MMIO accesses are done under a spinlock, and that if your driver is > missing them then that is a bug. I don't think it makes sense to say > that mmiowb is required "on some systems". Agreed. But if it is missing it may not be a bug. It depends what the lock actually protects. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-10 0:48 ` Alan Cox @ 2006-03-10 0:54 ` Paul Mackerras 0 siblings, 0 replies; 33+ messages in thread From: Paul Mackerras @ 2006-03-10 0:54 UTC (permalink / raw) To: Alan Cox Cc: David Howells, torvalds, akpm, mingo, linux-arch, linuxppc64-dev, linux-kernel Alan Cox writes: > On Fri, Mar 10, 2006 at 10:34:53AM +1100, Paul Mackerras wrote: > > MMIO accesses are done under a spinlock, and that if your driver is > > missing them then that is a bug. I don't think it makes sense to say > > that mmiowb is required "on some systems". > > Agreed. But if it is missing it may not be a bug. It depends what the lock > actually protects. True. What I want is a statement that if one of the purposes of the spinlock is to provide ordering of the MMIO accesses, then leaving out the mmiowb is a bug. I want it to be like the PCI DMA API in that drivers are required to use it even on platforms where it's a no-op. Paul. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-09 23:34 ` Paul Mackerras 2006-03-09 23:45 ` Michael Buesch 2006-03-10 0:48 ` Alan Cox @ 2006-03-10 15:19 ` David Howells 2006-03-11 0:01 ` Paul Mackerras 2 siblings, 1 reply; 33+ messages in thread From: David Howells @ 2006-03-10 15:19 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel Paul Mackerras <paulus@samba.org> wrote: > > +On some systems, I/O writes are not strongly ordered across all CPUs, and > > so +locking should be used, and mmiowb() should be issued prior to > > unlocking the +critical section. > > I think we should say more strongly that mmiowb() is required where > MMIO accesses are done under a spinlock, and that if your driver is > missing them then that is a bug. I don't think it makes sense to say > that mmiowb is required "on some systems". The point I was trying to make was that on some systems writes are not strongly ordered, so we need mmiowb() on _all_ systems. I'll fix the text to make that point. > There shouldn't be any problem here, because readw/writew _must_ > ensure that the device accesses are serialized. No. That depends on the properties of the memory window readw/writew write through, the properties of the CPU wrt memory accesses, and what explicit barriers at interpolated inside readw/writew themselves. If we're accessing a frame buffer, for instance, we might want it to be able to reorder and combine reads and writes. > Of course, on an SMP system it would be quite possible for the > interrupt to be taken on another CPU, and in that case disabling > interrupts (I assume that by "DISABLE IRQ" you mean > local_irq_disable() or some such) Yes. There are quite a few different ways to disable interrupts. > gets you absolutely nothing; you need to use a spinlock, and then the mmiowb > is required. I believe I've said that, though perhaps not sufficiently clearly. > You may like to include these words describing some of the rules: Thanks, I probably will. David ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-10 15:19 ` David Howells @ 2006-03-11 0:01 ` Paul Mackerras 0 siblings, 0 replies; 33+ messages in thread From: Paul Mackerras @ 2006-03-11 0:01 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel David Howells writes: > Paul Mackerras <paulus@samba.org> wrote: > > There shouldn't be any problem here, because readw/writew _must_ > > ensure that the device accesses are serialized. > > No. That depends on the properties of the memory window readw/writew write > through, the properties of the CPU wrt memory accesses, and what explicit > barriers at interpolated inside readw/writew themselves. The properties of the memory window are certainly relevant. For a non-prefetchable PCI MMIO region, the readw/writew must ensure that the accesses are serialized w.r.t. each other, although not necessarily serialized with accesses to normal memory. That is a requirement that the driver writer can rely on, and the implementor of readw/writew must ensure is met, taking into account the properties of the CPU (presumably by putting explicit barriers inside readw/write). For prefetchable regions, or if the cookie used with readw/writew has been obtained by something other than the normal ioremap{,_nocache}, then it's more of an open question. > > Of course, on an SMP system it would be quite possible for the > > interrupt to be taken on another CPU, and in that case disabling > > interrupts (I assume that by "DISABLE IRQ" you mean > > local_irq_disable() or some such) > > Yes. There are quite a few different ways to disable interrupts. I think it wasn't clear to me which of the following you meant: (a) Telling this CPU not to take any interrupts (e.g. local_irq_disable()) (b) Telling the interrupt controller not to allow interrupts from that device (e.g. disable_irq(irq_num)) (c) Telling the device not to generate interrupts in some device-specific fashion They all have different characteristics w.r.t. timing and synchronization, so I think it's important to be clear which one you mean. For example, if it's (c), then after doing a writel (or whatever) to the device, you then need at least to do a readl to make sure the write has got to the device, and even then there might be an interrupt signal still wending its way through the interrupt controller etc., which might arrive after the readl has finished. I think you meant (a), but DISABLE_IRQ actually sounds more like (b). Paul. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-09 20:29 ` [PATCH] Document Linux's memory barriers [try #4] David Howells 2006-03-09 23:34 ` Paul Mackerras @ 2006-03-10 5:28 ` Nick Piggin 2006-03-15 11:10 ` David Howells 2006-03-12 17:15 ` Eric W. Biederman 2006-03-15 14:23 ` [PATCH] Document Linux's memory barriers [try #5] David Howells 3 siblings, 1 reply; 33+ messages in thread From: Nick Piggin @ 2006-03-10 5:28 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel David Howells wrote: > +========================== > +WHAT IS CONSIDERED MEMORY? > +========================== > + > +For the purpose of this specification what's meant by "memory" needs to be > +defined, and the division between CPU and memory needs to be marked out. > + > + > +CACHED INTERACTIONS > +------------------- > + > +As far as cached CPU vs CPU[*] interactions go, "memory" has to include the CPU > +caches in the system. Although any particular read or write may not actually > +appear outside of the CPU that issued it (the CPU may may have been able to > +satisfy it from its own cache), it's still as if the memory access had taken > +place as far as the other CPUs are concerned since the cache coherency and > +ejection mechanisms will propegate the effects upon conflict. > + Isn't the Alpha's split caches a counter-example of your model, because the coherency itself is out of order? Why do you need to include caches and queues in your model? Do programmers care? Isn't the following sufficient... : | m | CPU -----> | e | : | m | : | o | CPU -----> | r | : | y | ... and bugger the implementation details? > + [*] Also applies to CPU vs device when accessed through a cache. > + > +The system can be considered logically as: > + > + <--- CPU ---> : <----------- Memory -----------> > + : > + +--------+ +--------+ : +--------+ +-----------+ > + | | | | : | | | | +---------+ > + | CPU | | Memory | : | CPU | | | | | > + | Core |--->| Access |----->| Cache |<-->| | | | > + | | | Queue | : | | | |--->| Memory | > + | | | | : | | | | | | > + +--------+ +--------+ : +--------+ | | | | > + : | Cache | +---------+ > + : | Coherency | > + : | Mechanism | +---------+ > + +--------+ +--------+ : +--------+ | | | | > + | | | | : | | | | | | > + | CPU | | Memory | : | CPU | | |--->| Device | > + | Core |--->| Access |----->| Cache |<-->| | | | > + | | | Queue | : | | | | | | > + | | | | : | | | | +---------+ > + +--------+ +--------+ : +--------+ +-----------+ > + : > + : > + -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-10 5:28 ` Nick Piggin @ 2006-03-15 11:10 ` David Howells 2006-03-15 11:51 ` Nick Piggin 0 siblings, 1 reply; 33+ messages in thread From: David Howells @ 2006-03-15 11:10 UTC (permalink / raw) To: Nick Piggin Cc: David Howells, torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Isn't the Alpha's split caches a counter-example of your model, > because the coherency itself is out of order? I'd forgotten I need to adjust my documentation to deal with this. It seems this is the reason for read_barrier_depends(), and that read_barrier_depends() is also a partial cache sync. Do you know of any docs on Alpha's split caches? The Alpha Arch Handbook doesn't say very much about cache operation on the Alpha. I've looked around for what exactly is meant by "split cache" in conjunction with Alpha CPUs, and I've found three different opinions of what it means: (1) Separate Data and Instruction caches. (2) Serial data caches (CPU -> L1 Cache -> L2 Cache -> L3 Cache -> Memory). (3) Parallel linked data caches, where a CPU's request can be satisfied by either data cache, in which whilst one data cache is being interrogated by the CPU, the other one can use the memory bus (at least, that's what I understand). > Why do you need to include caches and queues in your model? Do > programmers care? Isn't the following sufficient... I don't think it is sufficient, given the number of times the way the cache interacts with everything has come up in this discussion. > : | m | > CPU -----> | e | > : | m | > : | o | > CPU -----> | r | > : | y | > > ... and bugger the implementation details? Ah, but if the cache is on the CPU side of the dotted line, does that then mean that a write memory barrier guarantees the CPU's cache to have updated memory? David ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-15 11:10 ` David Howells @ 2006-03-15 11:51 ` Nick Piggin 2006-03-15 13:47 ` David Howells 0 siblings, 1 reply; 33+ messages in thread From: Nick Piggin @ 2006-03-15 11:51 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel, Paul E. McKenney David Howells wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>Isn't the Alpha's split caches a counter-example of your model, >>because the coherency itself is out of order? > > > I'd forgotten I need to adjust my documentation to deal with this. It seems > this is the reason for read_barrier_depends(), and that read_barrier_depends() > is also a partial cache sync. > > Do you know of any docs on Alpha's split caches? The Alpha Arch Handbook > doesn't say very much about cache operation on the Alpha. > > I've looked around for what exactly is meant by "split cache" in conjunction > with Alpha CPUs, and I've found three different opinions of what it means: > > (1) Separate Data and Instruction caches. > > (2) Serial data caches (CPU -> L1 Cache -> L2 Cache -> L3 Cache -> Memory). > > (3) Parallel linked data caches, where a CPU's request can be satisfied by > either data cache, in which whilst one data cache is being interrogated by > the CPU, the other one can use the memory bus (at least, that's what I > understand). > I don't have any docs myself, Paul might be the one to talk to as he's done the most recent research on this (though some of it directly with Alpha engineers, if I remember correctly). IIRC some alpha models have split cache close to what you describe in #3, however I'm not sure of the fine details (eg. I don't think the split caches are redundant, but just treated as two entities by the cache coherence protocol). Again, I don't have anything definitive that you can put in your docco, sorry. > >>Why do you need to include caches and queues in your model? Do >>programmers care? Isn't the following sufficient... > > > I don't think it is sufficient, given the number of times the way the cache > interacts with everything has come up in this discussion. > > >> : | m | >> CPU -----> | e | >> : | m | >> : | o | >> CPU -----> | r | >> : | y | >> >>... and bugger the implementation details? > > > Ah, but if the cache is on the CPU side of the dotted line, does that then mean > that a write memory barrier guarantees the CPU's cache to have updated memory? > I don't think it has to[*]. It would guarantee the _order_ in which "global memory" of this model ie. visibility for other "CPUs" see the writes, whether that visibility ultimately be implemented by cache coherency protocol or something else, I don't think matters (for a discussion of memory ordering). If anything it confused the matter for the case of Alpha. All the programmer needs to know is that there is some horizon (memory) beyond which stores are visible to other CPUs, and stores can travel there at different speeds so later ones can overtake earlier ones. And likewise loads can come from memory to the CPU at different speeds too, so later loads can contain earlier results. [*] Nor would your model require a smp_wmb() to update CPU caches either, I think: it wouldn't have to flush the store buffer, just order it. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-15 11:51 ` Nick Piggin @ 2006-03-15 13:47 ` David Howells 2006-03-15 23:21 ` Nick Piggin 0 siblings, 1 reply; 33+ messages in thread From: David Howells @ 2006-03-15 13:47 UTC (permalink / raw) To: Nick Piggin Cc: David Howells, torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel, Paul E. McKenney Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Ah, but if the cache is on the CPU side of the dotted line, does that then > > mean that a write memory barrier guarantees the CPU's cache to have > > updated memory? > > I don't think it has to[*]. It would guarantee the _order_ in which "global > memory" of this model ie. visibility for other "CPUs" see the writes, > whether that visibility ultimately be implemented by cache coherency > protocol or something else, I don't think matters (for a discussion of > memory ordering). It does matter, because I have to make it clear that the effect of the memory barrier usually stops at the cache, and in fact memory barriers may have no visibility at all on another CPU because it's all done inside a CPU's cache, until that other CPU tries to observe the results. > If anything it confused the matter for the case of Alpha. Nah... Alpha is self-confusing:-) > All the programmer needs to know is that there is some horizon (memory) > beyond which stores are visible to other CPUs, and stores can travel there > at different speeds so later ones can overtake earlier ones. And likewise > loads can come from memory to the CPU at different speeds too, so later > loads can contain earlier results. They also need to know that memory barriers don't imply an ordering on the cache. > [*] Nor would your model require a smp_wmb() to update CPU caches either, I > think: it wouldn't have to flush the store buffer, just order it. Exactly. But in your diagram, given that it doesn't show the cache, you don't know that the memory barrier doesn't extend through the cache and all the way to memory. David ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-15 13:47 ` David Howells @ 2006-03-15 23:21 ` Nick Piggin 0 siblings, 0 replies; 33+ messages in thread From: Nick Piggin @ 2006-03-15 23:21 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel, Paul E. McKenney David Howells wrote: >Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>>Ah, but if the cache is on the CPU side of the dotted line, does that then >>>mean that a write memory barrier guarantees the CPU's cache to have >>>updated memory? >>> >>I don't think it has to[*]. It would guarantee the _order_ in which "global >>memory" of this model ie. visibility for other "CPUs" see the writes, >>whether that visibility ultimately be implemented by cache coherency >>protocol or something else, I don't think matters (for a discussion of >>memory ordering). >> > >It does matter, because I have to make it clear that the effect of the memory >barrier usually stops at the cache, and in fact memory barriers may have no >visibility at all on another CPU because it's all done inside a CPU's cache, >until that other CPU tries to observe the results. > > But that's a cache coherency issue that is really orthogonal to the memory consistency one. WHY, when explaining memory consistency, do they need to know that a barrier "usually stops at cache" (except for alpha)? They already _know_ that barriers may have no visibility on any other CPU because you should tell them that barriers only imply an ordering over the horizon, nothing more (ie. they need not imply a "push"). >>If anything it confused the matter for the case of Alpha. >> > >Nah... Alpha is self-confusing:-) > > Well maybe ;) But for better or worse, it is what kernel programmers now have to deal with. >>All the programmer needs to know is that there is some horizon (memory) >>beyond which stores are visible to other CPUs, and stores can travel there >>at different speeds so later ones can overtake earlier ones. And likewise >>loads can come from memory to the CPU at different speeds too, so later >>loads can contain earlier results. >> > >They also need to know that memory barriers don't imply an ordering on the >cache. > > Why? I'm contending that this is exactly what they don't need to know. >>[*] Nor would your model require a smp_wmb() to update CPU caches either, I >>think: it wouldn't have to flush the store buffer, just order it. >> > >Exactly. > >But in your diagram, given that it doesn't show the cache, you don't know that >the memory barrier doesn't extend through the cache and all the way to memory. > > What do you mean "extend"? I don't think that is good terminology. What it does is provide an ordering of traffic going over the vertical line dividing CPU and memory. It does not matter whether "memory" is actually "cache + coherency" or not, just that the vertical line is the horizon between "visible to other CPUs" and "not". Nick -- Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-09 20:29 ` [PATCH] Document Linux's memory barriers [try #4] David Howells 2006-03-09 23:34 ` Paul Mackerras 2006-03-10 5:28 ` Nick Piggin @ 2006-03-12 17:15 ` Eric W. Biederman 2006-03-14 21:26 ` David Howells 2006-03-15 14:23 ` [PATCH] Document Linux's memory barriers [try #5] David Howells 3 siblings, 1 reply; 33+ messages in thread From: Eric W. Biederman @ 2006-03-12 17:15 UTC (permalink / raw) To: David Howells Cc: torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel A small nit. You are not documenting the most subtle memory barrier: smp_read_barrier_depends(); Which is a deep requirement of the RCU code. As I understand it. On some architectures (alpha) without at least this a load from a pointer can load from the an old pointer value. At one point it was suggested this be called: read_memory_barrier_data_dependent(). Simply calling: rcu_dereference is what all users should call but the semantics should be documented at least so that people porting Linux can have a chance of getting it right. Eric ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-12 17:15 ` Eric W. Biederman @ 2006-03-14 21:26 ` David Howells 2006-03-14 21:48 ` Paul Mackerras 0 siblings, 1 reply; 33+ messages in thread From: David Howells @ 2006-03-14 21:26 UTC (permalink / raw) To: Eric W. Biederman Cc: David Howells, torvalds, akpm, mingo, alan, linux-arch, linuxppc64-dev, linux-kernel Eric W. Biederman <ebiederm@xmission.com> wrote: > A small nit. You are not documenting the most subtle memory barrier: > smp_read_barrier_depends(); Which is a deep requirement of the RCU > code. How about this the attached adjustment? David diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 3ec9ff4..0c38bea 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -457,13 +457,14 @@ except in small and specific cases. In EXPLICIT KERNEL MEMORY BARRIERS =============================== -The Linux kernel has six basic CPU memory barriers: +The Linux kernel has eight basic CPU memory barriers: - MANDATORY SMP CONDITIONAL - =============== =============== - GENERAL mb() smp_mb() - READ rmb() smp_rmb() - WRITE wmb() smp_wmb() + TYPE MANDATORY SMP CONDITIONAL + =============== ======================= =========================== + GENERAL mb() smp_mb() + WRITE wmb() smp_wmb() + READ rmb() smp_rmb() + DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends() General memory barriers give a guarantee that all memory accesses specified before the barrier will appear to happen before all memory accesses specified @@ -472,6 +473,36 @@ after the barrier with respect to the ot Read and write memory barriers give similar guarantees, but only for memory reads versus memory reads and memory writes versus memory writes respectively. +Data dependency memory barriers ensure that if two reads are issued that +depend on each other, that the first read is completed _before_ the dependency +comes into effect. For instance, consider a case where the address used in +the second read is calculated from the result of the first read: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================================= + a == 0, b == 1 and p == &a, q == &a + b = 2; + smp_wmb(); Make sure b is changed before p + p = &b; q = p; + d = *q; + +then old data values may be used in the address calculation for the second +value, potentially resulting in q == &b and d == 0 being seen, which is never +correct. What is required is a data dependency memory barrier: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================================= + a == 0, b == 1 and p == &a, q == &a + b = 2; + smp_wmb(); Make sure b is changed before p + p = &b; q = p; + smp_read_barrier_depends(); + Make sure q is changed before d is read + d = *q; + +This forces the result to be either q == &a and d == 0 or q == &b and d == 2. +The result of q == &b and d == 0 will never be seen. + All memory barriers imply compiler barriers. SMP memory barriers are only compiler barriers on uniprocessor compiled systems ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-14 21:26 ` David Howells @ 2006-03-14 21:48 ` Paul Mackerras 2006-03-14 23:59 ` David Howells 0 siblings, 1 reply; 33+ messages in thread From: Paul Mackerras @ 2006-03-14 21:48 UTC (permalink / raw) To: David Howells Cc: Eric W. Biederman, akpm, linux-arch, linux-kernel, torvalds, mingo, alan, linuxppc64-dev David Howells writes: > + CPU 1 CPU 2 COMMENT > + =============== =============== ======================================= > + a == 0, b == 1 and p == &a, q == &a > + b = 2; > + smp_wmb(); Make sure b is changed before p > + p = &b; q = p; > + d = *q; > + > +then old data values may be used in the address calculation for the second > +value, potentially resulting in q == &b and d == 0 being seen, which is never > +correct. What is required is a data dependency memory barrier: No, that's not the problem. The problem is that you can get q == &b and d == 1, believe it or not. That is, you can see the new value of the pointer but the old value of the thing pointed to. Paul. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-14 21:48 ` Paul Mackerras @ 2006-03-14 23:59 ` David Howells 2006-03-15 0:20 ` Linus Torvalds 2006-03-15 0:54 ` Paul Mackerras 0 siblings, 2 replies; 33+ messages in thread From: David Howells @ 2006-03-14 23:59 UTC (permalink / raw) To: Paul Mackerras Cc: David Howells, Eric W. Biederman, akpm, linux-arch, linux-kernel, torvalds, mingo, alan, linuxppc64-dev Paul Mackerras <paulus@samba.org> wrote: > No, that's not the problem. The problem is that you can get q == &b > and d == 1, believe it or not. That is, you can see the new value of > the pointer but the old value of the thing pointed to. But that doesn't make any sense! That would mean we that we'd've read b into d before having read the new value of p into q, and thus before having calculated the address from which to read d (ie: &b) - so how could we know we were supposed to read d from b and not from a without first having read p? Unless, of course, the smp_wmb() isn't effective, and the write to b happens after the write to p; or the Alpha's cache isn't fully coherent. David ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-14 23:59 ` David Howells @ 2006-03-15 0:20 ` Linus Torvalds 2006-03-15 1:19 ` David Howells 2006-03-15 1:25 ` Nick Piggin 2006-03-15 0:54 ` Paul Mackerras 1 sibling, 2 replies; 33+ messages in thread From: Linus Torvalds @ 2006-03-15 0:20 UTC (permalink / raw) To: David Howells Cc: Paul Mackerras, Eric W. Biederman, akpm, linux-arch, linux-kernel, mingo, alan, linuxppc64-dev On Tue, 14 Mar 2006, David Howells wrote: > > But that doesn't make any sense! > > That would mean we that we'd've read b into d before having read the new value > of p into q, and thus before having calculated the address from which to read d > (ie: &b) - so how could we know we were supposed to read d from b and not from > a without first having read p? > > Unless, of course, the smp_wmb() isn't effective, and the write to b happens > after the write to p; or the Alpha's cache isn't fully coherent. The cache is fully coherent, but the coherency isn't _ordered_. Remember: the smp_wmb() only orders on the _writer_ side. Not on the reader side. The writer may send out the stuff in a particular order, but the reader might see them in a different order because _it_ might queue the bus events internally for its caches (in particular, it could end up delaying updating a particular way in the cache because it's busy). [ The issue of read_barrier_depends() can also come up if you do data speculation. Currently I don't think anybody does speculation for anything but control speculation, but it's at least possible that a read that "depends" on a previous read actually could take place before the read it depends on if the previous read had its result speculated. For example, you already have to handle the case of if (read a) read b; where we can read b _before_ we read a, because the CPU speculated the branch as being not taken, and then re-ordered the reads, even though they are "dependent" on each other. That's not that different from doing ptr = read a data = read [ptr] and speculating the result of the first read. Such a CPU would also need a non-empty read-barrier-depends ] So memory ordering is interesting. Some "clearly impossible" orderings actually suddenly become possible just because the CPU can do things speculatively and thus things aren't necessarily causally ordered any more. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-15 0:20 ` Linus Torvalds @ 2006-03-15 1:19 ` David Howells 2006-03-15 1:47 ` Linus Torvalds 2006-03-15 1:25 ` Nick Piggin 1 sibling, 1 reply; 33+ messages in thread From: David Howells @ 2006-03-15 1:19 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Paul Mackerras, Eric W. Biederman, akpm, linux-arch, linux-kernel, mingo, alan, linuxppc64-dev Linus Torvalds <torvalds@osdl.org> wrote: > That's not that different from doing > > ptr = read a > data = read [ptr] > > and speculating the result of the first read. But that would lead to the situation I suggested (q == &b and d == a), not the one Paul suggested (q == &b and d == old b) because we'd speculate on the old value of the pointer, and so see it before it's updated, and thus still pointing to a. > The cache is fully coherent, but the coherency isn't _ordered_. > > Remember: the smp_wmb() only orders on the _writer_ side. Not on the > reader side. The writer may send out the stuff in a particular order, but > the reader might see them in a different order because _it_ might queue > the bus events internally for its caches (in particular, it could end up > delaying updating a particular way in the cache because it's busy). Ummm... So whilst smp_wmb() commits writes to the mercy of the cache coherency system in a particular order, the updates can be passed over from one cache to another and committed to the reader's cache in any order, and can even be delayed: CPU 1 CPU 2 COMMENT =============== =============== ======================================= a == 0, b == 1 and p == &a, q == &a b = 2; smp_wmb(); Make sure b is changed before p <post b=2> <queue b=2> p = &b; q = p; <post p=&b> <queue p=&b> d = *q; <commit p=&b> <post q=p> <read *q> Reads from b before b updated in cache <post d=*q> <commit b=2> I presume the Alpha MB instruction forces cache queue completion in addition to a partial ordering on memory accesses: CPU 1 CPU 2 COMMENT =============== =============== ======================================= a == 0, b == 1 and p == &a, q == &a b = 2; smp_wmb(); Make sure b is changed before p <post b=2> <queue b=2> p = &b; q = p; <post p=&b> <queue p=&b> smp_read_barrier_depends(); <commit b=2> <commit p=&b> d = *q; <post q=p> <read *q> Reads new value of b <post d=*q> David ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-15 1:19 ` David Howells @ 2006-03-15 1:47 ` Linus Torvalds 0 siblings, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2006-03-15 1:47 UTC (permalink / raw) To: David Howells Cc: Paul Mackerras, Eric W. Biederman, akpm, linux-arch, linux-kernel, mingo, alan, linuxppc64-dev On Wed, 15 Mar 2006, David Howells wrote: > Linus Torvalds <torvalds@osdl.org> wrote: > > > That's not that different from doing > > > > ptr = read a > > data = read [ptr] > > > > and speculating the result of the first read. > > But that would lead to the situation I suggested (q == &b and d == a), not the > one Paul suggested (q == &b and d == old b) because we'd speculate on the old > value of the pointer, and so see it before it's updated, and thus still > pointing to a. No. If it _speculates_ the old value, and the value has actually changed when it checks the speculation, it would generally result in a uarch trap, and re-do of the instruction without speculation. So for data speculation to make a difference in this case, it would speculate the _new_ value (hey, doesn't matter _why_ - it could be that a previous load at a previous time had gotten that value), and then load the old value off the new pointer, and when the speculation ends up being checked, it all pans out (the speculated value matched the value when "a" was actually later read), and you get a "non-causal" result. Now, nobody actually does this kind of data speculation as far as I know, and there are perfectly valid arguments for why outside of control speculation nobody likely will (at least partly due to the fact that it would screw up existing expectations for memory ordering). It's also pretty damn complicated to do. But data speculation has certainly been a research subject, and there are papers on it. > > Remember: the smp_wmb() only orders on the _writer_ side. Not on the > > reader side. The writer may send out the stuff in a particular order, but > > the reader might see them in a different order because _it_ might queue > > the bus events internally for its caches (in particular, it could end up > > delaying updating a particular way in the cache because it's busy). > > Ummm... So whilst smp_wmb() commits writes to the mercy of the cache coherency > system in a particular order, the updates can be passed over from one cache to > another and committed to the reader's cache in any order, and can even be > delayed: Right. You should _always_ have as a rule of thinking that a "smp_wmb()" on one side absolutely _has_ to be paired with a "smp_rmb()" on the other side. If they aren't paired, something is _wrong_. Now, the data-dependent reads is actually a very specific optimization where we say that on certain architectures you don't need it, so we relax the rule to be "the reader has to have a smp_rmb() _or_ a smp_read_barrier_depends(), where the latter is only valid if the address of the dependent read depends directly on the first one". But the read barrier always has to be there, even though it can be of the "weaker" type. And note that the address really has to have a _data_ dependency, not a control dependency. If the address is dependent on the first read, but the dependency is through a conditional rather than actually reading the address itself, then it's a control dependency, and existing CPU's already short-circuit those through branch prediction. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-15 0:20 ` Linus Torvalds 2006-03-15 1:19 ` David Howells @ 2006-03-15 1:25 ` Nick Piggin 1 sibling, 0 replies; 33+ messages in thread From: Nick Piggin @ 2006-03-15 1:25 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Paul Mackerras, Eric W. Biederman, akpm, linux-arch, linux-kernel, mingo, alan, linuxppc64-dev Linus Torvalds wrote: > >On Tue, 14 Mar 2006, David Howells wrote: > >>But that doesn't make any sense! >> >>That would mean we that we'd've read b into d before having read the new value >>of p into q, and thus before having calculated the address from which to read d >>(ie: &b) - so how could we know we were supposed to read d from b and not from >>a without first having read p? >> >>Unless, of course, the smp_wmb() isn't effective, and the write to b happens >>after the write to p; or the Alpha's cache isn't fully coherent. >> > >The cache is fully coherent, but the coherency isn't _ordered_. > > This is what I was referring to when I said your (David's) idea of "memory" WRT memory consistency isn't correct -- cache coherency can be out of order. -- Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #4] 2006-03-14 23:59 ` David Howells 2006-03-15 0:20 ` Linus Torvalds @ 2006-03-15 0:54 ` Paul Mackerras 1 sibling, 0 replies; 33+ messages in thread From: Paul Mackerras @ 2006-03-15 0:54 UTC (permalink / raw) To: David Howells Cc: Eric W. Biederman, akpm, linux-arch, linux-kernel, torvalds, mingo, alan, linuxppc64-dev David Howells writes: > Paul Mackerras <paulus@samba.org> wrote: > > > No, that's not the problem. The problem is that you can get q == &b > > and d == 1, believe it or not. That is, you can see the new value of > > the pointer but the old value of the thing pointed to. > > But that doesn't make any sense! It certainly violates the principle of least surprise. :) Apparently this can occur on some Alpha machines that have a partitioned cache. Although CPU 1 sends out the updates to b and p in the right order because of the smp_wmb(), it's possible that b and p are present in CPU 2's cache, one in each half of the cache. If there are a lot of updates coming in for the half containing b, but the half containing p is quiet, it is possible for CPU 2 to see a new value of p but an old value of b, unless you put an rmb instruction between the two loads from memory. I haven't heard of this being an issue on any other architecture. On PowerPC it can't happen because the architecture specifies that a data dependency creates an implicit read barrier. Paul. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] Document Linux's memory barriers [try #5] 2006-03-09 20:29 ` [PATCH] Document Linux's memory barriers [try #4] David Howells ` (2 preceding siblings ...) 2006-03-12 17:15 ` Eric W. Biederman @ 2006-03-15 14:23 ` David Howells 2006-03-16 23:17 ` Paul E. McKenney [not found] ` <21253.1142509812@warthog.cambridge.redhat.com> 3 siblings, 2 replies; 33+ messages in thread From: David Howells @ 2006-03-15 14:23 UTC (permalink / raw) To: torvalds, akpm; +Cc: linux-arch, linuxppc64-dev, linux-kernel The attached patch documents the Linux kernel's memory barriers. I've updated it from the comments I've been given. The per-arch notes sections are gone because it's clear that there are so many exceptions, that it's not worth having them. I've added a list of references to other documents. I've tried to get rid of the concept of memory accesses appearing on the bus; what matters is apparent behaviour with respect to other observers in the system. Interrupts barrier effects are now considered to be non-existent. They may be there, but you may not rely on them. I've added a couple of definition sections at the top of the document: one to specify the minimum execution model that may be assumed, the other to specify what this document refers to by the term "memory". I've made greater mention of the use of mmiowb(). I've adjusted the way in which caches are described, and described the fun that can be had with cache coherence maintenance being unordered and data dependency not being necessarily implicit. I've described (smp_)read_barrier_depends(). Signed-Off-By: David Howells <dhowells@redhat.com> --- warthog>diffstat -p1 /tmp/mb.diff Documentation/memory-barriers.txt | 1039 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 1039 insertions(+) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt new file mode 100644 index 0000000..fd7a6f1 --- /dev/null +++ b/Documentation/memory-barriers.txt @@ -0,0 +1,1039 @@ + ============================ + LINUX KERNEL MEMORY BARRIERS + ============================ + +Contents: + + (*) Assumed minimum execution ordering model. + + (*) What is considered memory? + + - Cached interactions. + - Cache coherency. + - Uncached interactions. + + (*) What are memory barriers? + + (*) Where are memory barriers needed? + + - Accessing devices. + - Multiprocessor interaction. + - Interrupts. + + (*) Explicit kernel compiler barriers. + + (*) Explicit kernel memory barriers. + + - Barrier pairing. + + (*) Implicit kernel memory barriers. + + - Locking functions. + - Interrupt disabling functions. + - Miscellaneous functions. + + (*) Inter-CPU locking barrier effects. + + - Locks vs memory accesses. + - Locks vs I/O accesses. + + (*) Kernel I/O barrier effects. + + (*) References. + + +======================================== +ASSUMED MINIMUM EXECUTION ORDERING MODEL +======================================== + +It has to be assumed that the conceptual CPU is weakly-ordered in all respects +but that it will maintain the appearance of program causality with respect to +itself. Some CPUs (such as i386 or x86_64) are more constrained than others +(such as powerpc or frv), and so the most relaxed case must be assumed outside +of arch-specific code. + +This means that it must be considered that the CPU will execute its instruction +stream in any order it feels like - or even in parallel - provided that if an +instruction in the stream depends on the an earlier instruction, then that +earlier instruction must be sufficiently complete[*] before the later +instruction may proceed. + + [*] Some instructions have more than one effect[**] and different instructions + may depend on different effects. + + [**] Eg: changes to condition codes and registers; memory events. + +A CPU may also discard any instruction sequence that ultimately winds up having +no effect. For example if two adjacent instructions both load an immediate +value into the same register, the first may be discarded. + + +Similarly, it has to be assumed that compiler might reorder the instruction +stream in any way it sees fit, again provided the appearance of causality is +maintained. + + +========================== +WHAT IS CONSIDERED MEMORY? +========================== + +For the purpose of this specification what's meant by "memory" needs to be +defined, and the division between CPU and memory needs to be marked out. + + +CACHED INTERACTIONS +------------------- + +As far as cached CPU vs CPU[*] interactions go, "memory" has to include the CPU +caches in the system. Although any particular read or write may not actually +appear outside of the CPU that issued it (the CPU may may have been able to +satisfy it from its own cache), it's still as if the memory access had taken +place as far as the other CPUs are concerned since the cache coherency and +ejection mechanisms will propegate the effects upon conflict. + + [*] Also applies to CPU vs device when accessed through a cache. + +The system can be considered logically as: + + <--- CPU ---> : <----------- Memory -----------> + : + +--------+ +--------+ : +--------+ +-----------+ + | | | | : | | | | +--------+ + | CPU | | Memory | : | CPU | | | | | + | Core |--->| Access |----->| Cache |<-->| | | | + | | | Queue | : | | | |--->| Memory | + | | | | : | | | | | | + +--------+ +--------+ : +--------+ | | | | + : | Cache | +--------+ + : | Coherency | + : | Mechanism | +--------+ + +--------+ +--------+ : +--------+ | | | | + | | | | : | | | | | | + | CPU | | Memory | : | CPU | | |--->| Device | + | Core |--->| Access |----->| Cache |<-->| | | | + | | | Queue | : | | | | | | + | | | | : | | | | +--------+ + +--------+ +--------+ : +--------+ +-----------+ + : + : + +The CPU core may execute instructions in any order it deems fit, provided the +expected program causality appears to be maintained. Some of the instructions +generate load and store operations which then go into the memory access queue +to be performed. The core may place these in the queue in any order it wishes, +and continue execution until it is forced to wait for an instruction to +complete. + +What memory barriers are concerned with is controlling the order in which +accesses cross from the CPU side of things to the memory side of things, and +the order in which the effects are perceived to happen by the other observers +in the system. + + +CACHE COHERENCY +--------------- + +Life isn't quite as simple as it may appear above, however: for while the +caches are expected to be coherent, there's no guarantee that that coherency +will be ordered. This means that whilst changes made on one CPU will +eventually become visible on all CPUs, there's no guarantee that they will +become apparent in the same order on those other CPUs. + + +Consider dealing with a pair of CPUs (1 & 2), each of which has a pair of +parallel data caches (A,B and C,D) with the following properties: + + (*) a cacheline may be in one of the caches; + + (*) whilst the CPU core is interrogating one cache, the other cache may be + making use of the bus to access the rest of the system - perhaps to + displace a dirty cacheline or to do a speculative read; + + (*) each cache has a queue of operations that need to be applied to that cache + to maintain coherency with the rest of the system; + + (*) the coherency queue is not flushed by normal reads to lines already + present in the cache, even though the contents of the queue may + potentially effect those reads. + +Imagine, then, two writes made on the first CPU, with a barrier between them to +guarantee that they will reach that CPU's caches in the requisite order: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================================= + a == 0, b == 1 and p == &a, q == &a + b = 2; + smp_wmb(); Make sure b is changed before p + <A:modify b=2> The cacheline of b is now ours + p = &b; + <A:modify p=&b> The cacheline of p is now ours + +The write memory barrier forces the local CPU caches to be updated in the +correct order. But now imagine that the second CPU that wants to read those +values: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================================= + ... + q = p; + d = *q; + +The above pair of reads may then fail to apparently happen in expected order, +as the cacheline holding p may get updated in one of the second CPU's caches +whilst the update to the cacheline holding b is delayed in the other of the +second CPU's caches by some other cache event: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================================= + a == 0, b == 1 and p == &a, q == &a + b = 2; + smp_wmb(); Make sure b is changed before p + <A:modify b=2> <C:busy> + <C:queue b=2> + p = &b; q = p; + <D:request p> + <A:modify p=&b> <D:commit p=&b> + <D:read p> + d = *q; + <C:read *q> Reads from b before b updated in cache + <C:unbusy> + <C:commit b=2> + +Basically, whilst both cachelines will be updated on CPU 2 eventually, there's +no guarantee that, without intervention, the order of update will be the same +as that committed on CPU 1. + + +To intervene, we need to emplace a data dependency barrier or a read barrier. +This will force the cache to commit its coherency queue before processing any +further requests: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================================= + a == 0, b == 1 and p == &a, q == &a + b = 2; + smp_wmb(); Make sure b is changed before p + <A:modify b=2> <C:busy> + <C:queue b=2> + p = &b; q = p; + <D:request p> + <A:modify p=&b> <D:commit p=&b> + <D:read p> + smp_read_barrier_depends() + <C:unbusy> + <C:commit b=2> + d = *q; + <C:read *q> Reads from b after b updated in cache + + +This sort of problem can be encountered on Alpha processors as they have a +split cache that improves performance by making better use of the data bus. +Whilst most CPUs do imply a data dependency barrier on the read when a memory +access depends on a read, not all do, so it may not be relied on. + + +UNCACHED INTERACTIONS +--------------------- + +Note that the above model does not show uncached memory or I/O accesses. These +procede directly from the queue to the memory or the devices, bypassing any +caches or cache coherency: + + <--- CPU ---> : + : +-----+ + +--------+ +--------+ : | | + | | | | : | | +---------+ + | CPU | | Memory | : | | | | + | Core |--->| Access |--------------->| | | | + | | | Queue | : | |------------->| Memory | + | | | | : | | | | + +--------+ +--------+ : | | | | + : | | +---------+ + : | Bus | + : | | +---------+ + +--------+ +--------+ : | | | | + | | | | : | | | | + | CPU | | Memory | : | |<------------>| Device | + | Core |--->| Access |--------------->| | | | + | | | Queue | : | | | | + | | | | : | | +---------+ + +--------+ +--------+ : | | + : +-----+ + : + + +========================= +WHAT ARE MEMORY BARRIERS? +========================= + +Memory barriers are instructions to both the compiler and the CPU to impose an +apparent partial ordering between the memory access operations specified either +side of the barrier. They request that the sequence of memory events generated +appears to other components of the system as if the barrier is effective on +that CPU. + +Note that: + + (*) there's no guarantee that the sequence of memory events is _actually_ so + ordered. It's possible for the CPU to do out-of-order accesses _as long + as no-one is looking_, and then fix up the memory if someone else tries to + see what's going on (for instance a bus master device); what matters is + the _apparent_ order as far as other processors and devices are concerned; + and + + (*) memory barriers are only guaranteed to act within the CPU processing them, + and are not, for the most part, guaranteed to percolate down to other CPUs + in the system or to any I/O hardware that that CPU may communicate with. + + +For example, a programmer might take it for granted that the CPU will perform +memory accesses in exactly the order specified, so that if a CPU is, for +example, given the following piece of code: + + a = *A; + *B = b; + c = *C; + d = *D; + *E = e; + +They would then expect that the CPU will complete the memory access for each +instruction before moving on to the next one, leading to a definite sequence of +operations as seen by external observers in the system: + + read *A, write *B, read *C, read *D, write *E. + + +Reality is, of course, much messier. With many CPUs and compilers, this isn't +always true because: + + (*) reads are more likely to need to be completed immediately to permit + execution progress, whereas writes can often be deferred without a + problem; + + (*) reads can be done speculatively, and then the result discarded should it + prove not to be required; + + (*) the order of the memory accesses may be rearranged to promote better use + of the CPU buses and caches; + + (*) reads and writes may be combined to improve performance when talking to + the memory or I/O hardware that can do batched accesses of adjacent + locations, thus cutting down on transaction setup costs (memory and PCI + devices may be able to do this); and + + (*) the CPU's data cache may affect the ordering, and whilst cache-coherency + mechanisms may alleviate this - once the write has actually hit the cache + - there's no guarantee that the coherency management will be propegated in + order to other CPUs. + +So what another CPU, say, might actually observe from the above piece of code +is: + + read *A, read {*C,*D}, write *E, write *B + + (By "read {*C,*D}" I mean a combined single read). + + +It is also guaranteed that a CPU will be self-consistent: it will see its _own_ +accesses appear to be correctly ordered, without the need for a memory barrier. +For instance with the following code: + + X = *A; + *A = Y; + Z = *A; + +assuming no intervention by an external influence, it can be taken that: + + (*) X will hold the old value of *A, and will never happen after the write and + thus end up being given the value that was assigned to *A from Y instead; + and + + (*) Z will always be given the value in *A that was assigned there from Y, and + will never happen before the write, and thus end up with the same value + that was in *A initially. + +(This is ignoring the fact that the value initially in *A may appear to be the +same as the value assigned to *A from Y). + + +================================= +WHERE ARE MEMORY BARRIERS NEEDED? +================================= + +Under normal operation, access reordering is probably not going to be a problem +as a linear program will still appear to operate correctly. There are, +however, three circumstances where reordering definitely _could_ be a problem: + + +ACCESSING DEVICES +----------------- + +Many devices can be memory mapped, and so appear to the CPU as if they're just +memory locations. However, to control the device, the driver has to make the +right accesses in exactly the right order. + +Consider, for example, an ethernet chipset such as the AMD PCnet32. It +presents to the CPU an "address register" and a bunch of "data registers". The +way it's accessed is to write the index of the internal register to be accessed +to the address register, and then read or write the appropriate data register +to access the chip's internal register, which could - theoretically - be done +by: + + *ADR = ctl_reg_3; + reg = *DATA; + +The problem with a clever CPU or a clever compiler is that the write to the +address register isn't guaranteed to happen before the access to the data +register, if the CPU or the compiler thinks it is more efficient to defer the +address write: + + read *DATA, write *ADR + +then things will break. + + +In the Linux kernel, however, I/O should be done through the appropriate +accessor routines - such as inb() or writel() - which know how to make such +accesses appropriately sequential. + +On some systems, I/O writes are not strongly ordered across all CPUs, and so +locking should be used, and mmiowb() must be issued prior to unlocking the +critical section. + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +MULTIPROCESSOR INTERACTION +-------------------------- + +When there's a system with more than one processor, the CPUs in the system may +be working on the same set of data at the same time. This can cause +synchronisation problems, and the usual way of dealing with them is to use +locks - but locks are quite expensive, and so it may be preferable to operate +without the use of a lock if at all possible. In such a case accesses that +affect both CPUs may have to be carefully ordered to prevent error. + +Consider the R/W semaphore slow path. In that, a waiting process is queued on +the semaphore, as noted by it having a record on its stack linked to the +semaphore's list: + + struct rw_semaphore { + ... + struct list_head waiters; + }; + + struct rwsem_waiter { + struct list_head list; + struct task_struct *task; + }; + +To wake up the waiter, the up_read() or up_write() functions have to read the +pointer from this record to know as to where the next waiter record is, clear +the task pointer, call wake_up_process() on the task, and release the reference +held on the waiter's task struct: + + READ waiter->list.next; + READ waiter->task; + WRITE waiter->task; + CALL wakeup + RELEASE task + +If any of these steps occur out of order, then the whole thing may fail. + +Note that the waiter does not get the semaphore lock again - it just waits for +its task pointer to be cleared. Since the record is on its stack, this means +that if the task pointer is cleared _before_ the next pointer in the list is +read, another CPU might start processing the waiter and it might clobber its +stack before up*() functions have a chance to read the next pointer. + + CPU 1 CPU 2 + =============================== =============================== + down_xxx() + Queue waiter + Sleep + up_yyy() + READ waiter->task; + WRITE waiter->task; + <preempt> + Resume processing + down_xxx() returns + call foo() + foo() clobbers *waiter + </preempt> + READ waiter->list.next; + --- OOPS --- + +This could be dealt with using a spinlock, but then the down_xxx() function has +to get the spinlock again after it's been woken up, which is a waste of +resources. + +The way to deal with this is to insert an SMP memory barrier: + + READ waiter->list.next; + READ waiter->task; + smp_mb(); + WRITE waiter->task; + CALL wakeup + RELEASE task + +In this case, the barrier makes a guarantee that all memory accesses before the +barrier will appear to happen before all the memory accesses after the barrier +with respect to the other CPUs on the system. It does _not_ guarantee that all +the memory accesses before the barrier will be complete by the time the barrier +itself is complete. + +SMP memory barriers are normally nothing more than compiler barriers on a +kernel compiled for a UP system because the CPU orders overlapping accesses +with respect to itself, and so CPU barriers aren't needed. + + +INTERRUPTS +---------- + +A driver may be interrupted by its own interrupt service routine, and thus they +may interfere with each other's attempts to control or access the device. + +This may be alleviated - at least in part - by disabling interrupts (a form of +locking), such that the critical operations are all contained within the +interrupt-disabled section in the driver. Whilst the driver's interrupt +routine is executing, the driver's core may not run on the same CPU, and its +interrupt is not permitted to happen again until the current interrupt has been +handled, thus the interrupt handler does not need to lock against that. + +However, consider a driver was talking to an ethernet card that sports an +address register and a data register. If that driver's core is talks to the +card under interrupt-disablement and then the driver's interrupt handler is +invoked: + + LOCAL IRQ DISABLE + writew(ADDR, ctl_reg_3); + writew(DATA, y); + LOCAL IRQ ENABLE + <interrupt> + writew(ADDR, ctl_reg_4); + q = readw(DATA); + </interrupt> + +If ordering rules are sufficiently relaxed, the write to the data register +might happen after the second write to the address register. + +If ordering rules are relaxed, it must be assumed that accesses done inside an +interrupt disabled section may leak outside of it and may interleave with +accesses performed in an interrupt and vice versa unless implicit or explicit +barriers are used. + +Normally this won't be a problem because the I/O accesses done inside such +sections will include synchronous read operations on strictly ordered I/O +registers that form implicit I/O barriers. If this isn't sufficient then an +mmiowb() may need to be used explicitly. + + +A similar situation may occur between an interrupt routine and two routines +running on separate CPUs that communicate with each other. If such a case is +likely, then interrupt-disabling locks should be used to guarantee ordering. + + +================================= +EXPLICIT KERNEL COMPILER BARRIERS +================================= + +The Linux kernel has an explicit compiler barrier function that prevents the +compiler from moving the memory accesses either side of it to the other side: + + barrier(); + +This has no direct effect on the CPU, which may then reorder things however it +wishes. + + +=============================== +EXPLICIT KERNEL MEMORY BARRIERS +=============================== + +The Linux kernel has eight basic CPU memory barriers: + + TYPE MANDATORY SMP CONDITIONAL + =============== ======================= =========================== + GENERAL mb() smp_mb() + WRITE wmb() smp_wmb() + READ rmb() smp_rmb() + DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends() + +General memory barriers give a guarantee that all memory accesses specified +before the barrier will appear to happen before all memory accesses specified +after the barrier with respect to the other components of the system. + +Read and write memory barriers give similar guarantees, but only for memory +reads versus memory reads and memory writes versus memory writes respectively. + +Data dependency memory barriers ensure that if two reads are issued such that +the second depends on the result of the first, then the first read is completed +and the cache coherence is up to date before the second read is performed. The +primary case is where the address used in a subsequent read is calculated from +the result of the first read: + + CPU 1 CPU 2 + =============== =============== + { a == 0, b == 1 and p == &a, q == &a } + ... + b = 2; + smp_wmb(); + p = &b; q = p; + smp_read_barrier_depends(); + d = *q; + +Without the data dependency barrier, any of the following results could be +seen: + + POSSIBLE RESULT PERMISSIBLE ORIGIN + =============== =============== ======================================= + q == &a, d == 0 Yes + q == &b, d == 2 Yes + q == &b, d == 1 No Cache coherency maintenance delay + q == &b, d == 0 No q read after a + +See the "Cache Coherency" section above. + + +All memory barriers imply compiler barriers. + +Read memory barriers imply data dependency barriers. + +SMP memory barriers are reduced to compiler barriers on uniprocessor compiled +systems because it is assumed that a CPU will be apparently self-consistent, +and will order overlapping accesses correctly with respect to itself. + +There is no guarantee that any of the memory accesses specified before a memory +barrier will be complete by the completion of a memory barrier; the barrier can +be considered to draw a line in that CPU's access queue that accesses of the +appropriate type may not cross. + +There is no guarantee that issuing a memory barrier on one CPU will have any +direct effect on another CPU or any other hardware in the system. The indirect +effect will be the order in which the second CPU sees the effects of the first +CPU's accesses occur. + +There is no guarantee that some intervening piece of off-the-CPU hardware[*] +will not reorder the memory accesses. CPU cache coherency mechanisms should +propegate the indirect effects of a memory barrier between CPUs, but may not do +so in order (see above). + + [*] For information on bus mastering DMA and coherency please read: + + Documentation/pci.txt + Documentation/DMA-mapping.txt + Documentation/DMA-API.txt + +Note that these are the _minimum_ guarantees. Different architectures may give +more substantial guarantees, but they may not be relied upon outside of arch +specific code. + + +There are some more advanced barrier functions: + + (*) set_mb(var, value) + (*) set_wmb(var, value) + + These assign the value to the variable and then insert at least a write + barrier after it, depending on the function. They aren't guaranteed to + insert anything more than a compiler barrier in a UP compilation. + + +BARRIER PAIRING +--------------- + +Certain types of memory barrier should always be paired. A lack of an +appropriate pairing is almost certainly an error. + +An smp_wmb() should always be paired with an smp_read_barrier_depends() or an +smp_rmb(), though an smp_mb() would also be acceptable. Similarly an smp_rmb() +or an smp_read_barrier_depends() should always be paired with at least an +smp_wmb(), though, again, an smp_mb() is acceptable: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================= + a = 1; + smp_wmb(); Could be smp_mb() + b = 2; + x = a; + smp_rmb(); Could be smp_mb() + y = b; + +The data dependency barrier - smp_read_barrier_depends() - is a very specific +optimisation. It's a weaker form of the read memory barrier, for use in the +case where there's a dependency between two reads - since on some architectures +we can't rely on the CPU to imply a data dependency barrier. The typical case +is where a read is used to generate an address for a subsequent memory access: + + CPU 1 CPU 2 COMMENT + =============== =============================== ======================= + a = 1; + smp_wmb(); Could be smp_mb() + b = &a; + x = b; + smp_read_barrier_depends(); Or smp_rmb()/smp_mb() + y = *x; + +This is used in the RCU system. + +If the two reads are independent, an smp_read_barrier_depends() is not +sufficient, and an smp_rmb() or better must be employed. + +Basically, the read barrier always has to be there, even though it can be of +the "weaker" type. + +Note also that the address really has to have a _data_ dependency, not a +control dependency. If the address is dependent on the first read, but the +dependency is through a conditional rather than actually reading the address +itself, then it's a control dependency, and existing CPU's already deal with +those through branch prediction. + + +=============================== +IMPLICIT KERNEL MEMORY BARRIERS +=============================== + +Some of the other functions in the linux kernel imply memory barriers, amongst +which are locking and scheduling functions. + +This specification is a _minimum_ guarantee; any particular architecture may +provide more substantial guarantees, but these may not be relied upon outside +of arch specific code. + + +LOCKING FUNCTIONS +----------------- + +The Linux kernel has a number of locking constructs: + + (*) spin locks + (*) R/W spin locks + (*) mutexes + (*) semaphores + (*) R/W semaphores + +In all cases there are variants on "LOCK" operations and "UNLOCK" operations +for each construct. These operations all imply certain barriers: + + (*) LOCK operation implication: + + Memory accesses issued after the LOCK will be completed after the LOCK + accesses have completed. + + Memory accesses issued before the LOCK may be completed after the LOCK + accesses have completed. + + (*) UNLOCK operation implication: + + Memory accesses issued before the UNLOCK will be completed before the + UNLOCK accesses have completed. + + Memory accesses issued after the UNLOCK may be completed before the UNLOCK + accesses have completed. + + (*) LOCK vs UNLOCK implication: + + The LOCK accesses will be completed before the UNLOCK accesses. + + Therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, + but a LOCK followed by an UNLOCK is not. + + (*) Failed conditional LOCK implication: + + Certain variants of the LOCK operation may fail, either due to being + unable to get the lock immediately, or due to receiving an unblocked + signal whilst asleep waiting for the lock to become available. Failed + locks do not imply any sort of barrier. + +Locks and semaphores may not provide any guarantee of ordering on UP compiled +systems, and so cannot be counted on in such a situation to actually achieve +anything at all - especially with respect to I/O accesses - unless combined +with interrupt disabling operations. + +See also the section on "Inter-CPU locking barrier effects". + + +As an example, consider the following: + + *A = a; + *B = b; + LOCK + *C = c; + *D = d; + UNLOCK + *E = e; + *F = f; + +The following sequence of events is acceptable: + + LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK + +But none of the following are: + + {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E + *A, *B, *C, LOCK, *D, UNLOCK, *E, *F + *A, *B, LOCK, *C, UNLOCK, *D, *E, *F + *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E + + +INTERRUPT DISABLING FUNCTIONS +----------------------------- + +Functions that disable interrupts (LOCK equivalent) and enable interrupts +(UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O +barriers are required in such a situation, they must be provided from some +other means. + + +MISCELLANEOUS FUNCTIONS +----------------------- + +Other functions that imply barriers: + + (*) schedule() and similar imply full memory barriers. + + +================================= +INTER-CPU LOCKING BARRIER EFFECTS +================================= + +On SMP systems locking primitives give a more substantial form of barrier: one +that does affect memory access ordering on other CPUs, within the context of +conflict on any particular lock. + + +LOCKS VS MEMORY ACCESSES +------------------------ + +Consider the following: the system has a pair of spinlocks (N) and (Q), and +three CPUs; then should the following sequence of events occur: + + CPU 1 CPU 2 + =============================== =============================== + *A = a; *E = e; + LOCK M LOCK Q + *B = b; *F = f; + *C = c; *G = g; + UNLOCK M UNLOCK Q + *D = d; *H = h; + +Then there is no guarantee as to what order CPU #3 will see the accesses to *A +through *H occur in, other than the constraints imposed by the separate locks +on the separate CPUs. It might, for example, see: + + *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M + +But it won't see any of: + + *B, *C or *D preceding LOCK M + *A, *B or *C following UNLOCK M + *F, *G or *H preceding LOCK Q + *E, *F or *G following UNLOCK Q + + +However, if the following occurs: + + CPU 1 CPU 2 + =============================== =============================== + *A = a; + LOCK M [1] + *B = b; + *C = c; + UNLOCK M [1] + *D = d; *E = e; + LOCK M [2] + *F = f; + *G = g; + UNLOCK M [2] + *H = h; + +CPU #3 might see: + + *E, LOCK M [1], *C, *B, *A, UNLOCK M [1], + LOCK M [2], *H, *F, *G, UNLOCK M [2], *D + +But assuming CPU #1 gets the lock first, it won't see any of: + + *B, *C, *D, *F, *G or *H preceding LOCK M [1] + *A, *B or *C following UNLOCK M [1] + *F, *G or *H preceding LOCK M [2] + *A, *B, *C, *E, *F or *G following UNLOCK M [2] + + +LOCKS VS I/O ACCESSES +--------------------- + +Under certain circumstances (such as NUMA), I/O accesses within two spinlocked +sections on two different CPUs may be seen as interleaved by the PCI bridge. + +For example: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + writel(1, DATA); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + writel(5, DATA); + spin_unlock(Q); + +may be seen by the PCI bridge as follows: + + WRITE *ADDR = 0, WRITE *ADDR = 4, WRITE *DATA = 1, WRITE *DATA = 5 + +which would probably break. + +What is necessary here is to insert an mmiowb() before dropping the spinlock, +for example: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + writel(1, DATA); + mmiowb(); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + writel(5, DATA); + mmiowb(); + spin_unlock(Q); + +this will ensure that the two writes issued on CPU #1 appear at the PCI bridge +before either of the writes issued on CPU #2. + + +Furthermore, following a write by a read to the same device is okay, because +the read forces the write to complete before the read is performed: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + a = readl(DATA); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + b = readl(DATA); + spin_unlock(Q); + + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +========================== +KERNEL I/O BARRIER EFFECTS +========================== + +When accessing I/O memory, drivers should use the appropriate accessor +functions: + + (*) inX(), outX(): + + These are intended to talk to I/O space rather than memory space, but + that's primarily a CPU-specific concept. The i386 and x86_64 processors do + indeed have special I/O space access cycles and instructions, but many + CPUs don't have such a concept. + + The PCI bus, amongst others, defines an I/O space concept - which on such + CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O + space. However, it may also mapped as a virtual I/O space in the CPU's + memory map, particularly on those CPUs that don't support alternate + I/O spaces. + + Accesses to this space may be fully synchronous (as on i386), but + intermediary bridges (such as the PCI host bridge) may not fully honour + that. + + They are guaranteed to be fully ordered with respect to each other. + + They are not guaranteed to be fully ordered with respect to other types of + memory and I/O operation. + + (*) readX(), writeX(): + + Whether these are guaranteed to be fully ordered and uncombined with + respect to each other on the issuing CPU depends on the characteristics + defined for the memory window through which they're accessing. On later + i386 architecture machines, for example, this is controlled by way of the + MTRR registers. + + Ordinarily, these will be guaranteed to be fully ordered and uncombined,, + provided they're not accessing a prefetchable device. + + However, intermediary hardware (such as a PCI bridge) may indulge in + deferral if it so wishes; to flush a write, a read from the same location + is preferred[*], but a read from the same device or from configuration + space should suffice for PCI. + + [*] NOTE! attempting to read from the same location as was written to may + cause a malfunction - consider the 16550 Rx/Tx serial registers for + example. + + Used with prefetchable I/O memory, an mmiowb() barrier may be required to + force writes to be ordered. + + Please refer to the PCI specification for more information on interactions + between PCI transactions. + + (*) readX_relaxed() + + These are similar to readX(), but are not guaranteed to be ordered in any + way. Be aware that there is no I/O read barrier available. + + (*) ioreadX(), iowriteX() + + These will perform as appropriate for the type of access they're actually + doing, be it inX()/outX() or readX()/writeX(). + + +========== +REFERENCES +========== + +AMD64 Architecture Programmer's Manual Volume 2: System Programming + Chapter 7.1: Memory-Access Ordering + Chapter 7.4: Buffering and Combining Memory Writes + +IA-32 Intel Architecture Software Developer's Manual, Volume 3: +System Programming Guide + Chapter 7.1: Locked Atomic Operations + Chapter 7.2: Memory Ordering + Chapter 7.4: Serializing Instructions + +The SPARC Architecture Manual, Version 9 + Chapter 8: Memory Models + Appendix D: Formal Specification of the Memory Models + Appendix J: Programming with the Memory Models + +UltraSPARC Programmer Reference Manual + Chapter 5: Memory Accesses and Cacheability + Chapter 15: Sparc-V9 Memory Models + +UltraSPARC III Cu User's Manual + Chapter 9: Memory Models + +UltraSPARC IIIi Processor User's Manual + Chapter 8: Memory Models + +UltraSPARC Architecture 2005 + Chapter 9: Memory + Appendix D: Formal Specifications of the Memory Models + +UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005 + Chapter 8: Memory Models + Appendix F: Caches and Cache Coherency + +Solaris Internals, Core Kernel Architecture, p63-68: + Chapter 3.3: Hardware Considerations for Locks and + Synchronization + +Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching +for Kernel Programmers: + Chapter 13: Other Memory Models + +Intel Itanium Architecture Software Developer's Manual: Volume 1: + Section 2.6: Speculation + Section 4.4: Memory Access ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #5] 2006-03-15 14:23 ` [PATCH] Document Linux's memory barriers [try #5] David Howells @ 2006-03-16 23:17 ` Paul E. McKenney 2006-03-16 23:55 ` Linus Torvalds 2006-03-23 18:34 ` David Howells [not found] ` <21253.1142509812@warthog.cambridge.redhat.com> 1 sibling, 2 replies; 33+ messages in thread From: Paul E. McKenney @ 2006-03-16 23:17 UTC (permalink / raw) To: David Howells; +Cc: torvalds, akpm, linux-arch, linuxppc64-dev, linux-kernel On Wed, Mar 15, 2006 at 02:23:19PM +0000, David Howells wrote: > > The attached patch documents the Linux kernel's memory barriers. > > I've updated it from the comments I've been given. > > The per-arch notes sections are gone because it's clear that there are so many > exceptions, that it's not worth having them. > > I've added a list of references to other documents. > > I've tried to get rid of the concept of memory accesses appearing on the bus; > what matters is apparent behaviour with respect to other observers in the > system. > > Interrupts barrier effects are now considered to be non-existent. They may be > there, but you may not rely on them. > > I've added a couple of definition sections at the top of the document: one to > specify the minimum execution model that may be assumed, the other to specify > what this document refers to by the term "memory". > > I've made greater mention of the use of mmiowb(). > > I've adjusted the way in which caches are described, and described the fun > that can be had with cache coherence maintenance being unordered and data > dependency not being necessarily implicit. > > I've described (smp_)read_barrier_depends(). Good stuff!!! Please see comments interspersed, search for empty lines. One particularly serious issue involve your smp_read_barrier_depends() example. > Signed-Off-By: David Howells <dhowells@redhat.com> > --- > warthog>diffstat -p1 /tmp/mb.diff > Documentation/memory-barriers.txt | 1039 ++++++++++++++++++++++++++++++++++++++ > 1 files changed, 1039 insertions(+) > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > new file mode 100644 > index 0000000..fd7a6f1 > --- /dev/null > +++ b/Documentation/memory-barriers.txt > @@ -0,0 +1,1039 @@ > + ============================ > + LINUX KERNEL MEMORY BARRIERS > + ============================ > + > +Contents: > + > + (*) Assumed minimum execution ordering model. > + > + (*) What is considered memory? Suggest the following -- most people need to rethink what memory means: (*) What is memory? Or maybe "What is 'memory'"? > + > + - Cached interactions. > + - Cache coherency. > + - Uncached interactions. > + > + (*) What are memory barriers? > + > + (*) Where are memory barriers needed? > + > + - Accessing devices. > + - Multiprocessor interaction. > + - Interrupts. > + > + (*) Explicit kernel compiler barriers. > + > + (*) Explicit kernel memory barriers. > + > + - Barrier pairing. > + > + (*) Implicit kernel memory barriers. > + > + - Locking functions. > + - Interrupt disabling functions. > + - Miscellaneous functions. > + > + (*) Inter-CPU locking barrier effects. > + > + - Locks vs memory accesses. > + - Locks vs I/O accesses. > + > + (*) Kernel I/O barrier effects. > + > + (*) References. > + > + > +======================================== > +ASSUMED MINIMUM EXECUTION ORDERING MODEL > +======================================== > + > +It has to be assumed that the conceptual CPU is weakly-ordered in all respects > +but that it will maintain the appearance of program causality with respect to > +itself. Some CPUs (such as i386 or x86_64) are more constrained than others > +(such as powerpc or frv), and so the most relaxed case must be assumed outside Might as well call it out: +(such as powerpc or frv), and so the most relaxed case (namely DEC Alpha) +must be assumed outside > +of arch-specific code. Also, I have some verbiage and diagrams of Alpha's operation at http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2006.03.13a.pdf Feel free to take any that helps. (Source for paper is Latex and xfig, for whatever that is worth.) > +This means that it must be considered that the CPU will execute its instruction > +stream in any order it feels like - or even in parallel - provided that if an > +instruction in the stream depends on the an earlier instruction, then that > +earlier instruction must be sufficiently complete[*] before the later > +instruction may proceed. Suggest replacing the last two lines with: +instruction may proceed, in other words, provided that the appearance of +causality is maintained. > + Suggest just folding the following footnotes into the above paragraph: > + [*] Some instructions have more than one effect[**] and different instructions > + may depend on different effects. > + > + [**] Eg: changes to condition codes and registers; memory events. > + > +A CPU may also discard any instruction sequence that ultimately winds up having > +no effect. For example if two adjacent instructions both load an immediate > +value into the same register, the first may be discarded. > + > + > +Similarly, it has to be assumed that compiler might reorder the instruction > +stream in any way it sees fit, again provided the appearance of causality is > +maintained. > + > + > +========================== > +WHAT IS CONSIDERED MEMORY? > +========================== +=============== +WHAT IS MEMORY? +=============== > +For the purpose of this specification what's meant by "memory" needs to be > +defined, and the division between CPU and memory needs to be marked out. > + > + > +CACHED INTERACTIONS > +------------------- > + > +As far as cached CPU vs CPU[*] interactions go, "memory" has to include the CPU > +caches in the system. Although any particular read or write may not actually > +appear outside of the CPU that issued it (the CPU may may have been able to > +satisfy it from its own cache), it's still as if the memory access had taken > +place as far as the other CPUs are concerned since the cache coherency and > +ejection mechanisms will propegate the effects upon conflict. > + > + [*] Also applies to CPU vs device when accessed through a cache. > + > +The system can be considered logically as: Suggest showing the (typical) possibility of MMIO bypassing the CPU cache, as happens in many systems. See below for a hacky suggested change. > + <--- CPU ---> : <----------- Memory -----------> > + : > + +--------+ +--------+ : +--------+ +-----------+ > + | | | | : | | | | +--------+ > + | CPU | | Memory | : | CPU | | | | | > + | Core |--->| Access |----->| Cache |<-->| | | | > + | | | Queue | : | | | |--->| Memory | > + | | | | : | | | | | | > + +--------+ +--------+ : +--------+ | | | | > + V : | Cache | +--------+ > + | : | Coherency | > + | : | Mechanism | +--------+ > + +--------+ +--------+ : +--------+ | | | | > + | | | | : | | | | | | > + | CPU | | Memory | : | CPU | | |--->| Device | > + | Core |--->| Access |----->| Cache |<-->| | | | > + | | | Queue | : | | | | | | > + | | | | : | | | | +----+---+ > + +--------+ +--------+ : +--------+ +-----------+ ^ > + V : | > + | : | > + +----------------------------------------------+ > + : > + : In the past, there have been systems in which the devices did -not- see coherent memory, where (for example) CPU caches had to be manually flushed before a DMA operation was allowed to procede. > +The CPU core may execute instructions in any order it deems fit, provided the > +expected program causality appears to be maintained. Some of the instructions > +generate load and store operations which then go into the memory access queue > +to be performed. The core may place these in the queue in any order it wishes, > +and continue execution until it is forced to wait for an instruction to > +complete. > + > +What memory barriers are concerned with is controlling the order in which > +accesses cross from the CPU side of things to the memory side of things, and > +the order in which the effects are perceived to happen by the other observers > +in the system. Suggest emphasizing this point: +Memory barriers are -not- needed within a given CPU, as CPUs always see +their own loads and stores as if they had happened in program order. Might also want to define program order, execution order, and perceived order, or similar terms. > +CACHE COHERENCY > +--------------- > + > +Life isn't quite as simple as it may appear above, however: for while the > +caches are expected to be coherent, there's no guarantee that that coherency > +will be ordered. This means that whilst changes made on one CPU will > +eventually become visible on all CPUs, there's no guarantee that they will > +become apparent in the same order on those other CPUs. Nit, perhaps not worth calling out -- a given change might well be overwritten by a second change, so that the first change is never seen by other CPUs. > +Consider dealing with a pair of CPUs (1 & 2), each of which has a pair of > +parallel data caches (A,B and C,D) with the following properties: > + > + (*) a cacheline may be in one of the caches; > + > + (*) whilst the CPU core is interrogating one cache, the other cache may be > + making use of the bus to access the rest of the system - perhaps to > + displace a dirty cacheline or to do a speculative read; > + > + (*) each cache has a queue of operations that need to be applied to that cache > + to maintain coherency with the rest of the system; > + > + (*) the coherency queue is not flushed by normal reads to lines already > + present in the cache, even though the contents of the queue may > + potentially effect those reads. Suggest the following: + (*) the coherency queue is not flushed by normal reads to lines already + present in the cache, however, the reads will access the data in + the coherency queue in preference to the (obsoleted) data in the cache. + This is necessary to give the CPU the illusion that its operations + are being performed in order. > +Imagine, then, two writes made on the first CPU, with a barrier between them to > +guarantee that they will reach that CPU's caches in the requisite order: > + > + CPU 1 CPU 2 COMMENT > + =============== =============== ======================================= > + a == 0, b == 1 and p == &a, q == &a > + b = 2; > + smp_wmb(); Make sure b is changed before p Suggest the more accurate: + smp_wmb(); Make sure change to b visible before p The CPU is within its rights to execute these out of order as long as all the other CPUs perceive the change to b as preceding the change to p. And I believe that many CPUs in fact do just this sort of sneaky reordering, e.g., via speculative execution. > + <A:modify b=2> The cacheline of b is now ours > + p = &b; > + <A:modify p=&b> The cacheline of p is now ours > + > +The write memory barrier forces the local CPU caches to be updated in the > +correct order. It actually forces the other CPUs to see the updates in the correct order, the updates to the local CPU caches might well be misordered. Also, shouldn't one or the other of the "A:"s above be "B:"? I am assuming that they are supposed to refer to the "parallel data caches (A,B and C,D)" mentioned earlier. If they don't, I have no idea what they are supposed to mean. > + But now imagine that the second CPU that wants to read those > +values: > + > + CPU 1 CPU 2 COMMENT > + =============== =============== ======================================= > + ... > + q = p; > + d = *q; > + > +The above pair of reads may then fail to apparently happen in expected order, > +as the cacheline holding p may get updated in one of the second CPU's caches > +whilst the update to the cacheline holding b is delayed in the other of the > +second CPU's caches by some other cache event: > + > + CPU 1 CPU 2 COMMENT > + =============== =============== ======================================= > + a == 0, b == 1 and p == &a, q == &a > + b = 2; > + smp_wmb(); Make sure b is changed before p > + <A:modify b=2> <C:busy> > + <C:queue b=2> > + p = &b; q = p; > + <D:request p> > + <A:modify p=&b> <D:commit p=&b> > + <D:read p> > + d = *q; > + <C:read *q> Reads from b before b updated in cache > + <C:unbusy> > + <C:commit b=2> I don't understand the "A:", "B:", etc. in the above example. Is variable "b" in cacheline A or in cacheline C? Or do A and C mean something other than "cacheline" as indicated earlier. > +Basically, whilst both cachelines will be updated on CPU 2 eventually, there's > +no guarantee that, without intervention, the order of update will be the same > +as that committed on CPU 1. > + > + > +To intervene, we need to emplace a data dependency barrier or a read barrier. > +This will force the cache to commit its coherency queue before processing any > +further requests: > + > + CPU 1 CPU 2 COMMENT > + =============== =============== ======================================= > + a == 0, b == 1 and p == &a, q == &a > + b = 2; > + smp_wmb(); Make sure b is changed before p > + <A:modify b=2> <C:busy> > + <C:queue b=2> > + p = &b; q = p; > + <D:request p> > + <A:modify p=&b> <D:commit p=&b> > + <D:read p> > + smp_read_barrier_depends() > + <C:unbusy> > + <C:commit b=2> > + d = *q; > + <C:read *q> Reads from b after b updated in cache Again, I don't understand the "A:", "B:", etc. in the above example. Is variable "b" in cacheline A or in cacheline C? Or do A and C mean something other than "cacheline" as indicated earlier. > +This sort of problem can be encountered on Alpha processors as they have a > +split cache that improves performance by making better use of the data bus. > +Whilst most CPUs do imply a data dependency barrier on the read when a memory > +access depends on a read, not all do, so it may not be relied on. > + > + > +UNCACHED INTERACTIONS > +--------------------- > + > +Note that the above model does not show uncached memory or I/O accesses. These > +procede directly from the queue to the memory or the devices, bypassing any > +caches or cache coherency: > + > + <--- CPU ---> : > + : +-----+ There are some gratuitious tabs in the above line. > + +--------+ +--------+ : | | > + | | | | : | | +---------+ > + | CPU | | Memory | : | | | | > + | Core |--->| Access |--------------->| | | | > + | | | Queue | : | |------------->| Memory | > + | | | | : | | | | > + +--------+ +--------+ : | | | | > + : | | +---------+ > + : | Bus | > + : | | +---------+ > + +--------+ +--------+ : | | | | > + | | | | : | | | | > + | CPU | | Memory | : | |<------------>| Device | > + | Core |--->| Access |--------------->| | | | > + | | | Queue | : | | | | > + | | | | : | | +---------+ > + +--------+ +--------+ : | | > + : +-----+ > + : > + > + > +========================= > +WHAT ARE MEMORY BARRIERS? > +========================= > + > +Memory barriers are instructions to both the compiler and the CPU to impose an > +apparent partial ordering between the memory access operations specified either Suggest s/apparent/perceived/ > +side of the barrier. They request that the sequence of memory events generated > +appears to other components of the system as if the barrier is effective on > +that CPU. > + > +Note that: > + > + (*) there's no guarantee that the sequence of memory events is _actually_ so > + ordered. It's possible for the CPU to do out-of-order accesses _as long > + as no-one is looking_, and then fix up the memory if someone else tries to > + see what's going on (for instance a bus master device); what matters is > + the _apparent_ order as far as other processors and devices are concerned; > + and > + > + (*) memory barriers are only guaranteed to act within the CPU processing them, > + and are not, for the most part, guaranteed to percolate down to other CPUs > + in the system or to any I/O hardware that that CPU may communicate with. Suggest s/not, for the most part,/_not_ necessarily/ > +For example, a programmer might take it for granted that the CPU will perform > +memory accesses in exactly the order specified, so that if a CPU is, for > +example, given the following piece of code: > + > + a = *A; > + *B = b; > + c = *C; > + d = *D; > + *E = e; > + > +They would then expect that the CPU will complete the memory access for each > +instruction before moving on to the next one, leading to a definite sequence of > +operations as seen by external observers in the system: > + > + read *A, write *B, read *C, read *D, write *E. > + > + > +Reality is, of course, much messier. With many CPUs and compilers, this isn't > +always true because: > + > + (*) reads are more likely to need to be completed immediately to permit > + execution progress, whereas writes can often be deferred without a > + problem; > + > + (*) reads can be done speculatively, and then the result discarded should it > + prove not to be required; > + > + (*) the order of the memory accesses may be rearranged to promote better use > + of the CPU buses and caches; > + > + (*) reads and writes may be combined to improve performance when talking to > + the memory or I/O hardware that can do batched accesses of adjacent > + locations, thus cutting down on transaction setup costs (memory and PCI > + devices may be able to do this); and > + > + (*) the CPU's data cache may affect the ordering, and whilst cache-coherency > + mechanisms may alleviate this - once the write has actually hit the cache > + - there's no guarantee that the coherency management will be propegated in > + order to other CPUs. > + > +So what another CPU, say, might actually observe from the above piece of code > +is: > + > + read *A, read {*C,*D}, write *E, write *B > + > + (By "read {*C,*D}" I mean a combined single read). > + > + > +It is also guaranteed that a CPU will be self-consistent: it will see its _own_ > +accesses appear to be correctly ordered, without the need for a memory barrier. > +For instance with the following code: > + > + X = *A; > + *A = Y; > + Z = *A; > + > +assuming no intervention by an external influence, it can be taken that: > + > + (*) X will hold the old value of *A, and will never happen after the write and > + thus end up being given the value that was assigned to *A from Y instead; > + and > + > + (*) Z will always be given the value in *A that was assigned there from Y, and > + will never happen before the write, and thus end up with the same value > + that was in *A initially. > + > +(This is ignoring the fact that the value initially in *A may appear to be the > +same as the value assigned to *A from Y). > + > + > +================================= > +WHERE ARE MEMORY BARRIERS NEEDED? > +================================= > + > +Under normal operation, access reordering is probably not going to be a problem > +as a linear program will still appear to operate correctly. There are, > +however, three circumstances where reordering definitely _could_ be a problem: Suggest listing the three circumstances (accessing devices, multiprocessor interaction, interrupts) before the individual sections. Don't leave us in suspense! ;-) > +ACCESSING DEVICES > +----------------- > + > +Many devices can be memory mapped, and so appear to the CPU as if they're just > +memory locations. However, to control the device, the driver has to make the > +right accesses in exactly the right order. > + > +Consider, for example, an ethernet chipset such as the AMD PCnet32. It > +presents to the CPU an "address register" and a bunch of "data registers". The > +way it's accessed is to write the index of the internal register to be accessed > +to the address register, and then read or write the appropriate data register > +to access the chip's internal register, which could - theoretically - be done > +by: > + > + *ADR = ctl_reg_3; > + reg = *DATA; > + > +The problem with a clever CPU or a clever compiler is that the write to the > +address register isn't guaranteed to happen before the access to the data > +register, if the CPU or the compiler thinks it is more efficient to defer the > +address write: > + > + read *DATA, write *ADR > + > +then things will break. > + > + > +In the Linux kernel, however, I/O should be done through the appropriate > +accessor routines - such as inb() or writel() - which know how to make such > +accesses appropriately sequential. > + > +On some systems, I/O writes are not strongly ordered across all CPUs, and so > +locking should be used, and mmiowb() must be issued prior to unlocking the > +critical section. > + > +See Documentation/DocBook/deviceiobook.tmpl for more information. > + > + > +MULTIPROCESSOR INTERACTION > +-------------------------- > + > +When there's a system with more than one processor, the CPUs in the system may > +be working on the same set of data at the same time. This can cause > +synchronisation problems, and the usual way of dealing with them is to use > +locks - but locks are quite expensive, and so it may be preferable to operate > +without the use of a lock if at all possible. In such a case accesses that > +affect both CPUs may have to be carefully ordered to prevent error. > + > +Consider the R/W semaphore slow path. In that, a waiting process is queued on > +the semaphore, as noted by it having a record on its stack linked to the > +semaphore's list: > + > + struct rw_semaphore { > + ... > + struct list_head waiters; > + }; > + > + struct rwsem_waiter { > + struct list_head list; > + struct task_struct *task; > + }; > + > +To wake up the waiter, the up_read() or up_write() functions have to read the > +pointer from this record to know as to where the next waiter record is, clear > +the task pointer, call wake_up_process() on the task, and release the reference > +held on the waiter's task struct: > + > + READ waiter->list.next; > + READ waiter->task; > + WRITE waiter->task; > + CALL wakeup > + RELEASE task > + > +If any of these steps occur out of order, then the whole thing may fail. > + > +Note that the waiter does not get the semaphore lock again - it just waits for > +its task pointer to be cleared. Since the record is on its stack, this means > +that if the task pointer is cleared _before_ the next pointer in the list is > +read, another CPU might start processing the waiter and it might clobber its > +stack before up*() functions have a chance to read the next pointer. > + > + CPU 1 CPU 2 > + =============================== =============================== > + down_xxx() > + Queue waiter > + Sleep > + up_yyy() > + READ waiter->task; > + WRITE waiter->task; > + <preempt> > + Resume processing > + down_xxx() returns > + call foo() > + foo() clobbers *waiter > + </preempt> > + READ waiter->list.next; > + --- OOPS --- > + > +This could be dealt with using a spinlock, but then the down_xxx() function has > +to get the spinlock again after it's been woken up, which is a waste of > +resources. > + > +The way to deal with this is to insert an SMP memory barrier: > + > + READ waiter->list.next; > + READ waiter->task; > + smp_mb(); > + WRITE waiter->task; > + CALL wakeup > + RELEASE task > + > +In this case, the barrier makes a guarantee that all memory accesses before the > +barrier will appear to happen before all the memory accesses after the barrier > +with respect to the other CPUs on the system. It does _not_ guarantee that all > +the memory accesses before the barrier will be complete by the time the barrier > +itself is complete. > + > +SMP memory barriers are normally nothing more than compiler barriers on a Suggest s/barriers are/barriers (e.g., smp_mb(), as opposed to mb()) are/ > +kernel compiled for a UP system because the CPU orders overlapping accesses > +with respect to itself, and so CPU barriers aren't needed. > + > + > +INTERRUPTS > +---------- > + > +A driver may be interrupted by its own interrupt service routine, and thus they > +may interfere with each other's attempts to control or access the device. > + > +This may be alleviated - at least in part - by disabling interrupts (a form of > +locking), such that the critical operations are all contained within the > +interrupt-disabled section in the driver. Whilst the driver's interrupt > +routine is executing, the driver's core may not run on the same CPU, and its > +interrupt is not permitted to happen again until the current interrupt has been > +handled, thus the interrupt handler does not need to lock against that. > + > +However, consider a driver was talking to an ethernet card that sports an > +address register and a data register. If that driver's core is talks to the > +card under interrupt-disablement and then the driver's interrupt handler is > +invoked: > + > + LOCAL IRQ DISABLE > + writew(ADDR, ctl_reg_3); > + writew(DATA, y); > + LOCAL IRQ ENABLE > + <interrupt> > + writew(ADDR, ctl_reg_4); > + q = readw(DATA); > + </interrupt> > + > +If ordering rules are sufficiently relaxed, the write to the data register > +might happen after the second write to the address register. > + > +If ordering rules are relaxed, it must be assumed that accesses done inside an > +interrupt disabled section may leak outside of it and may interleave with > +accesses performed in an interrupt and vice versa unless implicit or explicit > +barriers are used. > + > +Normally this won't be a problem because the I/O accesses done inside such > +sections will include synchronous read operations on strictly ordered I/O > +registers that form implicit I/O barriers. If this isn't sufficient then an > +mmiowb() may need to be used explicitly. > + > + > +A similar situation may occur between an interrupt routine and two routines > +running on separate CPUs that communicate with each other. If such a case is > +likely, then interrupt-disabling locks should be used to guarantee ordering. > + > + > +================================= > +EXPLICIT KERNEL COMPILER BARRIERS > +================================= > + > +The Linux kernel has an explicit compiler barrier function that prevents the > +compiler from moving the memory accesses either side of it to the other side: > + > + barrier(); > + > +This has no direct effect on the CPU, which may then reorder things however it > +wishes. > + > + > +=============================== > +EXPLICIT KERNEL MEMORY BARRIERS > +=============================== > + > +The Linux kernel has eight basic CPU memory barriers: > + > + TYPE MANDATORY SMP CONDITIONAL > + =============== ======================= =========================== > + GENERAL mb() smp_mb() > + WRITE wmb() smp_wmb() > + READ rmb() smp_rmb() > + DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends() > + > +General memory barriers give a guarantee that all memory accesses specified > +before the barrier will appear to happen before all memory accesses specified > +after the barrier with respect to the other components of the system. > + > +Read and write memory barriers give similar guarantees, but only for memory > +reads versus memory reads and memory writes versus memory writes respectively. > + > +Data dependency memory barriers ensure that if two reads are issued such that > +the second depends on the result of the first, then the first read is completed > +and the cache coherence is up to date before the second read is performed. The > +primary case is where the address used in a subsequent read is calculated from > +the result of the first read: > + > + CPU 1 CPU 2 > + =============== =============== > + { a == 0, b == 1 and p == &a, q == &a } > + ... > + b = 2; > + smp_wmb(); > + p = &b; q = p; > + smp_read_barrier_depends(); > + d = *q; > + > +Without the data dependency barrier, any of the following results could be > +seen: > + > + POSSIBLE RESULT PERMISSIBLE ORIGIN > + =============== =============== ======================================= > + q == &a, d == 0 Yes > + q == &b, d == 2 Yes > + q == &b, d == 1 No Cache coherency maintenance delay Either s/No/Yes/ in the preceding line, or s/Without/With/ earlier. I believe the former is better. In absence of the smp_read_barrier_depends(), you really -can- see the (q == &b, d == 1) case on DEC Alpha!!! So the preceding line should instead be: + q == &b, d == 1 Yes This extremely counterintuitive situation arises most easily on machines with split caches, so that (for example) one cache bank processes even-numbered cache lines and the other bank processes odd-numbered cache lines. The pointer p might be stored in an odd-numbered cache line, and the variable b might be stored in an even-numbered cache line. Then, if the even-numbered bank of the reading CPU's cache is extremely busy while the odd-numbered bank is idle, one can see the new value of the pointer (&b), but the old value of the variable (1). I must confess to having strenuously resisted this scenario when first introduced to it, and Wayne Cardoza should be commended on his patience in leading me through it... Please see http://www.openvms.compaq.com/wizard/wiz_2637.html if you think that I am making all this up! This DEC-Alpha example is extremely important, since it defines Linux's memory-barrier semantics. I have some verbiage and diagrams in the aforementioned PDF, feel free to use them. > + q == &b, d == 0 No q read after a > + > +See the "Cache Coherency" section above. > + > + > +All memory barriers imply compiler barriers. > + > +Read memory barriers imply data dependency barriers. > + > +SMP memory barriers are reduced to compiler barriers on uniprocessor compiled > +systems because it is assumed that a CPU will be apparently self-consistent, > +and will order overlapping accesses correctly with respect to itself. Suggest adding: However, mandatory memory barriers still have effect in uniprocessor kernel builds due to the need to correctly order reads and writes to device registers. to this paragraph. > +There is no guarantee that any of the memory accesses specified before a memory > +barrier will be complete by the completion of a memory barrier; the barrier can > +be considered to draw a line in that CPU's access queue that accesses of the > +appropriate type may not cross. > + > +There is no guarantee that issuing a memory barrier on one CPU will have any > +direct effect on another CPU or any other hardware in the system. The indirect > +effect will be the order in which the second CPU sees the effects of the first > +CPU's accesses occur. And the second CPU may need to execute memory barriers of its own to reliably see the ordering effects. Might be good to explicitly note this. > +There is no guarantee that some intervening piece of off-the-CPU hardware[*] > +will not reorder the memory accesses. CPU cache coherency mechanisms should > +propegate the indirect effects of a memory barrier between CPUs, but may not do Suggest s/propegate/propagate/ > +so in order (see above). > + > + [*] For information on bus mastering DMA and coherency please read: > + > + Documentation/pci.txt > + Documentation/DMA-mapping.txt > + Documentation/DMA-API.txt > + > +Note that these are the _minimum_ guarantees. Different architectures may give > +more substantial guarantees, but they may not be relied upon outside of arch > +specific code. > + > + > +There are some more advanced barrier functions: > + > + (*) set_mb(var, value) > + (*) set_wmb(var, value) > + > + These assign the value to the variable and then insert at least a write > + barrier after it, depending on the function. They aren't guaranteed to > + insert anything more than a compiler barrier in a UP compilation. smp_mb__before_atomic_dec() and friends as well? > +BARRIER PAIRING > +--------------- Great section, good how-to information!!! > +Certain types of memory barrier should always be paired. A lack of an > +appropriate pairing is almost certainly an error. > + > +An smp_wmb() should always be paired with an smp_read_barrier_depends() or an > +smp_rmb(), though an smp_mb() would also be acceptable. Similarly an smp_rmb() > +or an smp_read_barrier_depends() should always be paired with at least an > +smp_wmb(), though, again, an smp_mb() is acceptable: > + > + CPU 1 CPU 2 COMMENT > + =============== =============== ======================= > + a = 1; > + smp_wmb(); Could be smp_mb() > + b = 2; > + x = a; > + smp_rmb(); Could be smp_mb() > + y = b; Need to note that (for example) set_mb(), smp_mb__before_atomic_dec(), and friends can stand in for smp_mb(). Similarly with related primitives. > +The data dependency barrier - smp_read_barrier_depends() - is a very specific > +optimisation. It's a weaker form of the read memory barrier, for use in the > +case where there's a dependency between two reads - since on some architectures > +we can't rely on the CPU to imply a data dependency barrier. The typical case > +is where a read is used to generate an address for a subsequent memory access: > + > + CPU 1 CPU 2 COMMENT > + =============== =============================== ======================= > + a = 1; > + smp_wmb(); Could be smp_mb() > + b = &a; > + x = b; > + smp_read_barrier_depends(); Or smp_rmb()/smp_mb() > + y = *x; > + > +This is used in the RCU system. Strictly speaking, this is not dependent on RCU, but probably close enough for now... Might be worth noting that array accesses are data-dependent on the corresponding index. Some architectures have more expansive definition of data dependency, including then- and else-clauses being data-dependent on the if-condition, but this is probably too much detail. > +If the two reads are independent, an smp_read_barrier_depends() is not > +sufficient, and an smp_rmb() or better must be employed. > + > +Basically, the read barrier always has to be there, even though it can be of > +the "weaker" type. > + > +Note also that the address really has to have a _data_ dependency, not a > +control dependency. If the address is dependent on the first read, but the > +dependency is through a conditional rather than actually reading the address > +itself, then it's a control dependency, and existing CPU's already deal with > +those through branch prediction. > + > + > +=============================== > +IMPLICIT KERNEL MEMORY BARRIERS > +=============================== > + > +Some of the other functions in the linux kernel imply memory barriers, amongst > +which are locking and scheduling functions. > + > +This specification is a _minimum_ guarantee; any particular architecture may > +provide more substantial guarantees, but these may not be relied upon outside > +of arch specific code. > + > + > +LOCKING FUNCTIONS > +----------------- > + > +The Linux kernel has a number of locking constructs: > + > + (*) spin locks > + (*) R/W spin locks > + (*) mutexes > + (*) semaphores > + (*) R/W semaphores > + > +In all cases there are variants on "LOCK" operations and "UNLOCK" operations > +for each construct. These operations all imply certain barriers: > + > + (*) LOCK operation implication: > + > + Memory accesses issued after the LOCK will be completed after the LOCK > + accesses have completed. Suggest s/accesses have/operation has/ here and below. > + Memory accesses issued before the LOCK may be completed after the LOCK > + accesses have completed. Suggest making the consequence very clear, perhaps adding ", so that code preceding a lock acquisition can "bleed" into the critical section". > + (*) UNLOCK operation implication: > + > + Memory accesses issued before the UNLOCK will be completed before the > + UNLOCK accesses have completed. > + > + Memory accesses issued after the UNLOCK may be completed before the UNLOCK > + accesses have completed. Again, suggest making the consequence very clear, perhaps adding ", so that code following a lock release can "bleed into" the critical section". > + (*) LOCK vs UNLOCK implication: > + > + The LOCK accesses will be completed before the UNLOCK accesses. > + > + Therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, > + but a LOCK followed by an UNLOCK is not. > + > + (*) Failed conditional LOCK implication: > + > + Certain variants of the LOCK operation may fail, either due to being > + unable to get the lock immediately, or due to receiving an unblocked > + signal whilst asleep waiting for the lock to become available. Failed > + locks do not imply any sort of barrier. > + > +Locks and semaphores may not provide any guarantee of ordering on UP compiled > +systems, and so cannot be counted on in such a situation to actually achieve > +anything at all - especially with respect to I/O accesses - unless combined > +with interrupt disabling operations. > + > +See also the section on "Inter-CPU locking barrier effects". > + > + > +As an example, consider the following: > + > + *A = a; > + *B = b; > + LOCK > + *C = c; > + *D = d; > + UNLOCK > + *E = e; > + *F = f; > + > +The following sequence of events is acceptable: > + > + LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK > + > +But none of the following are: > + > + {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E > + *A, *B, *C, LOCK, *D, UNLOCK, *E, *F > + *A, *B, LOCK, *C, UNLOCK, *D, *E, *F > + *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E > + > + > +INTERRUPT DISABLING FUNCTIONS > +----------------------------- > + > +Functions that disable interrupts (LOCK equivalent) and enable interrupts > +(UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O > +barriers are required in such a situation, they must be provided from some > +other means. > + > + > +MISCELLANEOUS FUNCTIONS > +----------------------- > + > +Other functions that imply barriers: > + > + (*) schedule() and similar imply full memory barriers. > + > + > +================================= > +INTER-CPU LOCKING BARRIER EFFECTS > +================================= > + > +On SMP systems locking primitives give a more substantial form of barrier: one > +that does affect memory access ordering on other CPUs, within the context of > +conflict on any particular lock. > + > + > +LOCKS VS MEMORY ACCESSES > +------------------------ > + > +Consider the following: the system has a pair of spinlocks (N) and (Q), and > +three CPUs; then should the following sequence of events occur: > + > + CPU 1 CPU 2 > + =============================== =============================== > + *A = a; *E = e; > + LOCK M LOCK Q > + *B = b; *F = f; > + *C = c; *G = g; > + UNLOCK M UNLOCK Q > + *D = d; *H = h; > + > +Then there is no guarantee as to what order CPU #3 will see the accesses to *A > +through *H occur in, other than the constraints imposed by the separate locks > +on the separate CPUs. It might, for example, see: > + > + *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M > + > +But it won't see any of: > + > + *B, *C or *D preceding LOCK M > + *A, *B or *C following UNLOCK M > + *F, *G or *H preceding LOCK Q > + *E, *F or *G following UNLOCK Q > + > + > +However, if the following occurs: > + > + CPU 1 CPU 2 > + =============================== =============================== > + *A = a; > + LOCK M [1] > + *B = b; > + *C = c; > + UNLOCK M [1] > + *D = d; *E = e; > + LOCK M [2] > + *F = f; > + *G = g; > + UNLOCK M [2] > + *H = h; > + > +CPU #3 might see: > + > + *E, LOCK M [1], *C, *B, *A, UNLOCK M [1], > + LOCK M [2], *H, *F, *G, UNLOCK M [2], *D > + > +But assuming CPU #1 gets the lock first, it won't see any of: > + > + *B, *C, *D, *F, *G or *H preceding LOCK M [1] > + *A, *B or *C following UNLOCK M [1] > + *F, *G or *H preceding LOCK M [2] > + *A, *B, *C, *E, *F or *G following UNLOCK M [2] > + > + > +LOCKS VS I/O ACCESSES > +--------------------- > + > +Under certain circumstances (such as NUMA), I/O accesses within two spinlocked > +sections on two different CPUs may be seen as interleaved by the PCI bridge. Suggest adding something to the effect of ", because the PCI bridge is not participating in the cache-coherence protocol, and therefore is incapable of issuing the required read-side memory barriers." > +For example: > + > + CPU 1 CPU 2 > + =============================== =============================== > + spin_lock(Q) > + writel(0, ADDR) > + writel(1, DATA); > + spin_unlock(Q); > + spin_lock(Q); > + writel(4, ADDR); > + writel(5, DATA); > + spin_unlock(Q); > + > +may be seen by the PCI bridge as follows: > + > + WRITE *ADDR = 0, WRITE *ADDR = 4, WRITE *DATA = 1, WRITE *DATA = 5 > + > +which would probably break. > + > +What is necessary here is to insert an mmiowb() before dropping the spinlock, > +for example: > + > + CPU 1 CPU 2 > + =============================== =============================== > + spin_lock(Q) > + writel(0, ADDR) > + writel(1, DATA); > + mmiowb(); > + spin_unlock(Q); > + spin_lock(Q); > + writel(4, ADDR); > + writel(5, DATA); > + mmiowb(); > + spin_unlock(Q); > + > +this will ensure that the two writes issued on CPU #1 appear at the PCI bridge > +before either of the writes issued on CPU #2. > + > + > +Furthermore, following a write by a read to the same device is okay, because > +the read forces the write to complete before the read is performed: > + > + CPU 1 CPU 2 > + =============================== =============================== > + spin_lock(Q) > + writel(0, ADDR) > + a = readl(DATA); > + spin_unlock(Q); > + spin_lock(Q); > + writel(4, ADDR); > + b = readl(DATA); > + spin_unlock(Q); > + > + > +See Documentation/DocBook/deviceiobook.tmpl for more information. > + > + > +========================== > +KERNEL I/O BARRIER EFFECTS > +========================== > + > +When accessing I/O memory, drivers should use the appropriate accessor > +functions: > + > + (*) inX(), outX(): > + > + These are intended to talk to I/O space rather than memory space, but > + that's primarily a CPU-specific concept. The i386 and x86_64 processors do > + indeed have special I/O space access cycles and instructions, but many > + CPUs don't have such a concept. > + > + The PCI bus, amongst others, defines an I/O space concept - which on such > + CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O > + space. However, it may also mapped as a virtual I/O space in the CPU's > + memory map, particularly on those CPUs that don't support alternate > + I/O spaces. > + > + Accesses to this space may be fully synchronous (as on i386), but > + intermediary bridges (such as the PCI host bridge) may not fully honour > + that. > + > + They are guaranteed to be fully ordered with respect to each other. > + > + They are not guaranteed to be fully ordered with respect to other types of > + memory and I/O operation. > + > + (*) readX(), writeX(): > + > + Whether these are guaranteed to be fully ordered and uncombined with > + respect to each other on the issuing CPU depends on the characteristics > + defined for the memory window through which they're accessing. On later > + i386 architecture machines, for example, this is controlled by way of the > + MTRR registers. > + > + Ordinarily, these will be guaranteed to be fully ordered and uncombined,, > + provided they're not accessing a prefetchable device. > + > + However, intermediary hardware (such as a PCI bridge) may indulge in > + deferral if it so wishes; to flush a write, a read from the same location > + is preferred[*], but a read from the same device or from configuration > + space should suffice for PCI. > + > + [*] NOTE! attempting to read from the same location as was written to may > + cause a malfunction - consider the 16550 Rx/Tx serial registers for > + example. > + > + Used with prefetchable I/O memory, an mmiowb() barrier may be required to > + force writes to be ordered. > + > + Please refer to the PCI specification for more information on interactions > + between PCI transactions. > + > + (*) readX_relaxed() > + > + These are similar to readX(), but are not guaranteed to be ordered in any > + way. Be aware that there is no I/O read barrier available. > + > + (*) ioreadX(), iowriteX() > + > + These will perform as appropriate for the type of access they're actually > + doing, be it inX()/outX() or readX()/writeX(). > + > + > +========== > +REFERENCES > +========== Suggest adding: Alpha AXP Architecture Reference Manual, Second Edition (Sites & Witek, Digital Press) Chapter 5.2: Physical Address Space Characteristics Chapter 5.4: Caches and Write Buffers Chapter 5.5: Data Sharing Chapter 5.6: Read/Write Ordering > +AMD64 Architecture Programmer's Manual Volume 2: System Programming > + Chapter 7.1: Memory-Access Ordering > + Chapter 7.4: Buffering and Combining Memory Writes > + > +IA-32 Intel Architecture Software Developer's Manual, Volume 3: > +System Programming Guide > + Chapter 7.1: Locked Atomic Operations > + Chapter 7.2: Memory Ordering > + Chapter 7.4: Serializing Instructions > + > +The SPARC Architecture Manual, Version 9 > + Chapter 8: Memory Models > + Appendix D: Formal Specification of the Memory Models > + Appendix J: Programming with the Memory Models > + > +UltraSPARC Programmer Reference Manual > + Chapter 5: Memory Accesses and Cacheability > + Chapter 15: Sparc-V9 Memory Models > + > +UltraSPARC III Cu User's Manual > + Chapter 9: Memory Models > + > +UltraSPARC IIIi Processor User's Manual > + Chapter 8: Memory Models > + > +UltraSPARC Architecture 2005 > + Chapter 9: Memory > + Appendix D: Formal Specifications of the Memory Models > + > +UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005 > + Chapter 8: Memory Models > + Appendix F: Caches and Cache Coherency > + > +Solaris Internals, Core Kernel Architecture, p63-68: > + Chapter 3.3: Hardware Considerations for Locks and > + Synchronization > + > +Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching > +for Kernel Programmers: > + Chapter 13: Other Memory Models > + > +Intel Itanium Architecture Software Developer's Manual: Volume 1: > + Section 2.6: Speculation > + Section 4.4: Memory Access > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #5] 2006-03-16 23:17 ` Paul E. McKenney @ 2006-03-16 23:55 ` Linus Torvalds 2006-03-17 1:29 ` Paul E. McKenney 2006-03-23 18:34 ` David Howells 1 sibling, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2006-03-16 23:55 UTC (permalink / raw) To: Paul E. McKenney Cc: David Howells, Andrew Morton, linux-arch, linuxppc64-dev, Linux Kernel Mailing List On Thu, 16 Mar 2006, Paul E. McKenney wrote: > > Also, I have some verbiage and diagrams of Alpha's operation at > http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2006.03.13a.pdf > Feel free to take any that helps. (Source for paper is Latex and xfig, > for whatever that is worth.) This paper too claims that x86-64 has somehow different memory ordering constraints than regular x86. Do you actually have a source for that statement, or is it just a continuation of what looks like confusion in the Linux x86-64 header files? (Also, x86 doesn't have an incoherent instruction cache - some older x86 cores have an incoherent instruction decode _buffer_, but that's a slightly different issue with basically no effect on any sane program). Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #5] 2006-03-16 23:55 ` Linus Torvalds @ 2006-03-17 1:29 ` Paul E. McKenney 2006-03-17 5:32 ` Linus Torvalds 0 siblings, 1 reply; 33+ messages in thread From: Paul E. McKenney @ 2006-03-17 1:29 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Andrew Morton, linux-arch, linuxppc64-dev, Linux Kernel Mailing List On Thu, Mar 16, 2006 at 03:55:07PM -0800, Linus Torvalds wrote: > > > On Thu, 16 Mar 2006, Paul E. McKenney wrote: > > > > Also, I have some verbiage and diagrams of Alpha's operation at > > http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2006.03.13a.pdf > > Feel free to take any that helps. (Source for paper is Latex and xfig, > > for whatever that is worth.) > > This paper too claims that x86-64 has somehow different memory ordering > constraints than regular x86. Do you actually have a source for that > statement, or is it just a continuation of what looks like confusion in > the Linux x86-64 header files? I based the table in that paper on "AMD x86-64 Architecture Programmer's Manual, Volume 2: System Programming", AMD publication #24593 rev 3.07 published in September 2002. This is a hardcopy (but soft-bound) book. In section 7.1.1 on page 195, it says: For cacheable memory types, the following rules govern read ordering: o Out-of-order reads are allowed. Out-of-order reads can occur as a result of out-of-order instruction execution or speculative execution. The processor can read memory out-of-order to allow out-of-order to proceed. o Speculative reads are allows ... [but no effect on ordering beyond that given in the other rules, near as I can tell] o Reads can be reordered ahead of writes. Reads are generally given a higher priority by the processor than writes because instruction execution stalls if the read data required by an instruction is not immediately available. Allowing reads ahead of writes usually maximizes software performance. o [additional constraints if read and write are to the same location] In section 7.1.2 on pages 195-6, it says: o Generally, out-of-order writes are -not- allowed. Write instructions executed out of order cannot commit (write) their result to memory until all previous instructions have completed in program order. The processor can, however, hold the result of an out-of-order write instruction in a private buffer (not visible to software) until that result can be committed to memory. o [write-combining memory -- but we are not talking about frame buffers.] o Speculative writes are -not- allowed. As with out-of-order writes, speculative write instructions cannot commit their result to memory until all previous instructions have completed in program order. Processors can hold the result in a private buffer (not visible to the software) until the result can be committed. o Write buffering is allowed. When a write instruction completes and commits its result, that result can be buffered before actually writing the result into a memory location in program order. Although the write buffer itself is not directly accessible by software, the results in the buffer are accessible during memory accesses to he locations that are buffered. For cacheable memory types, the write buffer can be read out-of-order and speculatively read, just like memory. o Write combining is allowed ... [but we aren't worried about frame buffers.] Of course, I might well have misinterpreted something. It would not be the first time. ;-) > (Also, x86 doesn't have an incoherent instruction cache - some older x86 > cores have an incoherent instruction decode _buffer_, but that's a > slightly different issue with basically no effect on any sane program). Newer cores check the linear address, so code generated in a different address space now needs to do CPUID. This is admittedly an unusual case -- perhaps I was getting overly worked up about it. I based this on Section 10.6 on page 10-21 (physical page 405) of Intel's "IA-32 Intel Architecture Software Developer's Manual Volume 3: System Programming Guide", 2004. PDF available (as of 2/16/2005) from: ftp://download.intel.com/design/Pentium4/manuals/25366814.pdf Thanx, Paul ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #5] 2006-03-17 1:29 ` Paul E. McKenney @ 2006-03-17 5:32 ` Linus Torvalds 2006-03-17 6:23 ` Paul E. McKenney 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2006-03-17 5:32 UTC (permalink / raw) To: Paul E. McKenney Cc: David Howells, Andrew Morton, linux-arch, linuxppc64-dev, Linux Kernel Mailing List On Thu, 16 Mar 2006, Paul E. McKenney wrote: > > In section 7.1.1 on page 195, it says: > > For cacheable memory types, the following rules govern > read ordering: > > o Out-of-order reads are allowed. Out-of-order reads > can occur as a result of out-of-order instruction > execution or speculative execution. The processor > can read memory out-of-order to allow out-of-order > to proceed. > > o Speculative reads are allows ... [but no effect on > ordering beyond that given in the other rules, near > as I can tell] > > o Reads can be reordered ahead of writes. Reads are > generally given a higher priority by the processor > than writes because instruction execution stalls > if the read data required by an instruction is not > immediately available. Allowing reads ahead of > writes usually maximizes software performance. These are just the same as the x86 ordering. Notice how reads can pass (earlier) writes, but won't be pushed back after later writes. That's very much the x86 ordering (together with the "CPU ordering" for writes). > > (Also, x86 doesn't have an incoherent instruction cache - some older x86 > > cores have an incoherent instruction decode _buffer_, but that's a > > slightly different issue with basically no effect on any sane program). > > Newer cores check the linear address, so code generated in a different > address space now needs to do CPUID. This is admittedly an unusual > case -- perhaps I was getting overly worked up about it. I based this > on Section 10.6 on page 10-21 (physical page 405) of Intel's "IA-32 > Intel Architecture Software Developer's Manual Volume 3: System > Programming Guide", 2004. PDF available (as of 2/16/2005) from: > > ftp://download.intel.com/design/Pentium4/manuals/25366814.pdf Not according to the docs I have. The _prefetch_ queue is invalidated based on the linear address, but not the caches. The caches are coherent, and the prefetch is also coherent in modern cores wrt linear address (but old cores, like the original i386, would literally not see the write, so you could do movl $1234,1f 1: xorl %eax,%eax and the "movl" would overwrite the "xorl", but the "xorl" would still get executed if it was in the 16-byte prefetch buffer or whatever). Modern cores will generally be _totally_ serialized, so if you write to the next instruction, I think most modern cores will notice it. It's only if you use paging or something to write to the physical address to something that is in the prefetch buffers that it can get you. Now, the prefetching has gotten longer over time, but it is basically still just a few tens of instructions, and any serializing instruction will force it to be serialized with the cache. It's really a non-issue, because regular self-modifying code will trigger the linear address check, and system code will always end up doing an "iret" or other serializing instruction, so it really doesn't trigger. So in practice, you really should see it as being entirely coherent. You have to do some _really_ strange sh*t to ever see anything different. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #5] 2006-03-17 5:32 ` Linus Torvalds @ 2006-03-17 6:23 ` Paul E. McKenney 0 siblings, 0 replies; 33+ messages in thread From: Paul E. McKenney @ 2006-03-17 6:23 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Andrew Morton, linux-arch, linuxppc64-dev, Linux Kernel Mailing List On Thu, Mar 16, 2006 at 09:32:03PM -0800, Linus Torvalds wrote: > > > On Thu, 16 Mar 2006, Paul E. McKenney wrote: > > > > In section 7.1.1 on page 195, it says: > > > > For cacheable memory types, the following rules govern > > read ordering: > > > > o Out-of-order reads are allowed. Out-of-order reads > > can occur as a result of out-of-order instruction > > execution or speculative execution. The processor > > can read memory out-of-order to allow out-of-order > > to proceed. > > > > o Speculative reads are allows ... [but no effect on > > ordering beyond that given in the other rules, near > > as I can tell] > > > > o Reads can be reordered ahead of writes. Reads are > > generally given a higher priority by the processor > > than writes because instruction execution stalls > > if the read data required by an instruction is not > > immediately available. Allowing reads ahead of > > writes usually maximizes software performance. > > These are just the same as the x86 ordering. Notice how reads can pass > (earlier) writes, but won't be pushed back after later writes. That's very > much the x86 ordering (together with the "CPU ordering" for writes). OK, so you are not arguing with the "AMD" row, but rather with the x86 row. So I was looking in the wrong manual. Specifically, you are saying that the x86's "Loads Reordered After Stores" cell should be blank rather than "Y", right? > > > (Also, x86 doesn't have an incoherent instruction cache - some older x86 > > > cores have an incoherent instruction decode _buffer_, but that's a > > > slightly different issue with basically no effect on any sane program). > > > > Newer cores check the linear address, so code generated in a different > > address space now needs to do CPUID. This is admittedly an unusual > > case -- perhaps I was getting overly worked up about it. I based this > > on Section 10.6 on page 10-21 (physical page 405) of Intel's "IA-32 > > Intel Architecture Software Developer's Manual Volume 3: System > > Programming Guide", 2004. PDF available (as of 2/16/2005) from: > > > > ftp://download.intel.com/design/Pentium4/manuals/25366814.pdf > > Not according to the docs I have. > > The _prefetch_ queue is invalidated based on the linear address, but not > the caches. The caches are coherent, and the prefetch is also coherent in > modern cores wrt linear address (but old cores, like the original i386, > would literally not see the write, so you could do > > movl $1234,1f > 1: xorl %eax,%eax > > and the "movl" would overwrite the "xorl", but the "xorl" would still get > executed if it was in the 16-byte prefetch buffer or whatever). > > Modern cores will generally be _totally_ serialized, so if you write to > the next instruction, I think most modern cores will notice it. It's only > if you use paging or something to write to the physical address to > something that is in the prefetch buffers that it can get you. Yep. But I would not put it past some JIT writer to actually do something like double-mapping the JITed code to two different linear addresses. > Now, the prefetching has gotten longer over time, but it is basically > still just a few tens of instructions, and any serializing instruction > will force it to be serialized with the cache. Agreed. > It's really a non-issue, because regular self-modifying code will trigger > the linear address check, and system code will always end up doing an > "iret" or other serializing instruction, so it really doesn't trigger. Only if there is a context switch between the writing of the instruction and the executing of it. Might not be the case if someone double-maps the memory or some other similar stunt. And I agree modern cores seem to be getting less aggressive in their search for instruction-level parallelism, but it doesn't take too much speculative-execution capability to (sometimes!) get some pretty strange stuff loaded into the instruction prefetch buffer. > So in practice, you really should see it as being entirely coherent. You > have to do some _really_ strange sh*t to ever see anything different. No argument with your last sentence! (And, believe it or not, there was a time when self-modifying code was considered manly rather than strange. But that was back in the days of 4096-byte main memories...) I believe we are in violent agreement on this one. The column label in the table is "Incoherent Instruction Cache/Pipeline". You are saying that only the pipeline can be incoherent, and even then, only in strange situations. But the only situation in which I would leave a cell in this column blank would be if -both- the cache -and- the pipeline were -always- coherent, even in strange situations. So I believe that this one still needs to stay "Y". I would rather see someone's JIT execute an extra CPUID after generating a new chunk of code than to see it fail strangely -- but only sometimes, and not reproducibly. Would it help if the column were instead labelled "Incoherent Instruction Cache or Pipeline", replacing the current "/" with "or"? Thanx, Paul ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #5] 2006-03-16 23:17 ` Paul E. McKenney 2006-03-16 23:55 ` Linus Torvalds @ 2006-03-23 18:34 ` David Howells 2006-03-23 19:28 ` Linus Torvalds 2006-03-23 22:26 ` Paul E. McKenney 1 sibling, 2 replies; 33+ messages in thread From: David Howells @ 2006-03-23 18:34 UTC (permalink / raw) To: paulmck, davem Cc: David Howells, torvalds, akpm, linux-arch, linuxppc64-dev, linux-kernel Paul E. McKenney <paulmck@us.ibm.com> wrote: > smp_mb__before_atomic_dec() and friends as well? These seem to be something Sparc64 related; or, at least, Sparc64 seems to do something weird with them. What are these meant to achieve anyway? They seems to just be barrier() on a lot of systems, even SMP ones. > Some architectures have more expansive definition of data dependency, > including then- and else-clauses being data-dependent on the if-condition, > but this is probably too much detail. Linus calls that a "control dependency" and doesn't seem to think that's a problem as it's sorted out by branch prediction. What you said makes me wonder about conditional instructions (such as conditional move). Anyway, I've incorporated your comments as well as reworking the document, which I'll shortly push upstream once again. David ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #5] 2006-03-23 18:34 ` David Howells @ 2006-03-23 19:28 ` Linus Torvalds 2006-03-23 22:26 ` Paul E. McKenney 1 sibling, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2006-03-23 19:28 UTC (permalink / raw) To: David Howells Cc: paulmck, davem, akpm, linux-arch, linuxppc64-dev, linux-kernel On Thu, 23 Mar 2006, David Howells wrote: > > > Some architectures have more expansive definition of data dependency, > > including then- and else-clauses being data-dependent on the if-condition, > > but this is probably too much detail. > > Linus calls that a "control dependency" and doesn't seem to think that's a > problem as it's sorted out by branch prediction. What you said makes me > wonder about conditional instructions (such as conditional move). I'd put it the other way: a control dependency is not "sorted out" by branch prediction, it is effectively _nullified_ by branch prediction. Basically, control dependencies aren't dependencies at all. There is absolutely _zero_ dependency between the following two loads: if (load a) load b; because the "load b" can happen before the "load a" because of control prediction. So if you need a read barrier where there is a _control_ dependency in between loading a and loading b, you need to make it a full "smp_rmb()". It is _not_ sufficient to make this a "read_barrier_depends", because the load of b really doesn't depend on the load of a at all. So data dependencies that result in control dependencies aren't dependencies at all. Not even if the address depends on the control dependency. So int *address_of_b; address_of_b = load(&a); smp_read_barrier_depends(); b = load(address_of_b); is correct, but int *address_of_b = default_address_of_b; if (load(&a)) address_of_b = another_address_of_b; smp_read_barrier_depends(); b = load(address_of_b); is NOT correct, because there is no data dependency on the load of b, just a control dependency that the CPU may short-circuit with prediction, and that second case thus needs a real smp_rmb(). And yes, if we ever hit a CPU that does actual data prediction, not just control prediction, that will force smp_read_barrier_depends() to be the same as smp_rmb() on such an architecture. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Document Linux's memory barriers [try #5] 2006-03-23 18:34 ` David Howells 2006-03-23 19:28 ` Linus Torvalds @ 2006-03-23 22:26 ` Paul E. McKenney 1 sibling, 0 replies; 33+ messages in thread From: Paul E. McKenney @ 2006-03-23 22:26 UTC (permalink / raw) To: David Howells Cc: davem, torvalds, akpm, linux-arch, linuxppc64-dev, linux-kernel On Thu, Mar 23, 2006 at 06:34:27PM +0000, David Howells wrote: > Paul E. McKenney <paulmck@us.ibm.com> wrote: > > > smp_mb__before_atomic_dec() and friends as well? > > These seem to be something Sparc64 related; or, at least, Sparc64 seems to do > something weird with them. > > What are these meant to achieve anyway? They seems to just be barrier() on a > lot of systems, even SMP ones. On architectures such as x86 where atomic_dec() implies an smp_mb(), they do nothing. On other architectures, they supply whatever memory barrier is required. So, on x86: smp_mb(); atomic_dec(&my_atomic_counter); would result in -two- atomic instructions, but the smp_mb() would be absolutely required on CPUs with weaker memory-consistency models. So your choice is to (1) be inefficient on x86 or (2) be unsafe on weak-memory-consistency systems. What we can do instead is: smp_mb__before_atomic_dec(); atomic_dec(&my_atomic_counter); This allows x86 to generate efficient code -and- allows weak-memory machines (e.g., Alpha, MIPS, PA-RISC(!), ppc, s390, SPARC64) to generate safe code. Thanx, Paul ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <21253.1142509812@warthog.cambridge.redhat.com>]
[parent not found: <Pine.LNX.4.64.0603160914410.3618@g5.osdl.org>]
* Re: [PATCH] Document Linux's memory barriers [try #5] [not found] ` <Pine.LNX.4.64.0603160914410.3618@g5.osdl.org> @ 2006-03-17 1:20 ` Nick Piggin 0 siblings, 0 replies; 33+ messages in thread From: Nick Piggin @ 2006-03-17 1:20 UTC (permalink / raw) To: Linus Torvalds; +Cc: David Howells, Andrew Morton, linux-arch, linux-kernel Linus Torvalds wrote: > Well, the argument I have against mentioning caches is that cache > coherency order is _not_ the only thing that is relevant. As already > mentioned, data speculation can cause exactly the same "non-causal" effect > as cache update ordering, so from a _conceptual_ standpoint, the cache is > really just one implementation detail in the much bigger picture of > buffering and speculative work re-ordering operations... > Agreed. I wonder how you can best illustrate the theoretical machine model (from a programmer's point of view). I guess each consistency domain has an horizon (barrier, but I'm trying to avoid that word in this context) over which memory barriers will provide some partial ordering on transactions going to and/or from the horizon. +---+ : |c |CPU|---:---|a m +---+ : |c e : |h m +---+ : |e o |CPU|---:---|d r +---+ : | y I guess you could think of a smp_wmb() in this case being a wall that moves out from the CPU along with other memory stores, and stops them from passing (the problem with a "wall" is that it doesn't easily apply to a full smp_mb() unless you have loads travelling the same way, which has its own intuitiveness problems). > So it might be a good idea to first explain the memory barriers in a more > abstract sense without talking about what exactly goes on, and then have > the section that gives _examples_ of what the CPU actually is doing that > causes these barriers to be needed. And make it clear that the examples > are just that - examples. > Yeah that would be nice. The abstract explanation will be tricky. Maybe it isn't so critical to be easily understandable if it is backed by examples. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2006-03-23 22:26 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20060315200956.4a9e2cb3.akpm@osdl.org>
2006-03-09 20:29 ` [PATCH] Document Linux's memory barriers [try #4] David Howells
2006-03-09 23:34 ` Paul Mackerras
2006-03-09 23:45 ` Michael Buesch
2006-03-09 23:56 ` Linus Torvalds
2006-03-10 0:07 ` Michael Buesch
2006-03-10 0:48 ` Alan Cox
2006-03-10 0:54 ` Paul Mackerras
2006-03-10 15:19 ` David Howells
2006-03-11 0:01 ` Paul Mackerras
2006-03-10 5:28 ` Nick Piggin
2006-03-15 11:10 ` David Howells
2006-03-15 11:51 ` Nick Piggin
2006-03-15 13:47 ` David Howells
2006-03-15 23:21 ` Nick Piggin
2006-03-12 17:15 ` Eric W. Biederman
2006-03-14 21:26 ` David Howells
2006-03-14 21:48 ` Paul Mackerras
2006-03-14 23:59 ` David Howells
2006-03-15 0:20 ` Linus Torvalds
2006-03-15 1:19 ` David Howells
2006-03-15 1:47 ` Linus Torvalds
2006-03-15 1:25 ` Nick Piggin
2006-03-15 0:54 ` Paul Mackerras
2006-03-15 14:23 ` [PATCH] Document Linux's memory barriers [try #5] David Howells
2006-03-16 23:17 ` Paul E. McKenney
2006-03-16 23:55 ` Linus Torvalds
2006-03-17 1:29 ` Paul E. McKenney
2006-03-17 5:32 ` Linus Torvalds
2006-03-17 6:23 ` Paul E. McKenney
2006-03-23 18:34 ` David Howells
2006-03-23 19:28 ` Linus Torvalds
2006-03-23 22:26 ` Paul E. McKenney
[not found] ` <21253.1142509812@warthog.cambridge.redhat.com>
[not found] ` <Pine.LNX.4.64.0603160914410.3618@g5.osdl.org>
2006-03-17 1:20 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox