[PATCH] Document Linux's memory barriers

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Howells <dhowells@redhat.com>
To: torvalds@osdl.org, akpm@osdl.org, mingo@redhat.com
Cc: linux-arch@vger.kernel.org, linuxppc64-dev@ozlabs.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH] Document Linux's memory barriers
Date: Tue, 07 Mar 2006 17:40:45 +0000	[thread overview]
Message-ID: <31492.1141753245@warthog.cambridge.redhat.com> (raw)


The attached patch documents the Linux kernel's memory barriers.

Signed-Off-By: David Howells <dhowells@redhat.com>
---
warthog>diffstat -p1 mb.diff 
 Documentation/memory-barriers.txt |  359 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 359 insertions(+)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
new file mode 100644
index 0000000..c2fc51b
--- /dev/null
+++ b/Documentation/memory-barriers.txt
@@ -0,0 +1,359 @@
+			 ============================
+			 LINUX KERNEL MEMORY BARRIERS
+			 ============================
+
+Contents:
+
+ (*) What are memory barriers?
+
+ (*) Linux kernel memory barrier functions.
+
+ (*) Implied kernel memory barriers.
+
+ (*) i386 and x86_64 arch specific notes.
+
+
+=========================
+WHAT ARE MEMORY BARRIERS?
+=========================
+
+Memory barriers are instructions to both the compiler and the CPU to impose a
+partial ordering between the memory access operations specified either side of
+the barrier.
+
+Older and less complex CPUs will perform memory accesses in exactly the order
+specified, so if one is given the following piece of code:
+
+	a = *A;
+	*B = b;
+	c = *C;
+	d = *D;
+	*E = e;
+
+It can be guaranteed that it will complete the memory access for each
+instruction before moving on to the next line, leading to a definite sequence
+of operations on the bus:
+
+	read *A, write *B, read *C, read *D, write *E.
+
+However, with newer and more complex CPUs, this isn't always true because:
+
+ (*) they can rearrange the order of the memory accesses to promote better use
+     of the CPU buses and caches;
+
+ (*) reads are synchronous and may need to be done immediately to permit
+     progress, whereas writes can often be deferred without a problem;
+
+ (*) and they are able to combine reads and writes to improve performance when
+     talking to the SDRAM (modern SDRAM chips can do batched accesses of
+     adjacent locations, cutting down on transaction setup costs).
+
+So what you might actually get from the above piece of code is:
+
+	read *A, read *C+*D, write *E, write *B
+
+Under normal operation, this is probably not going to be a problem; however,
+there are two circumstances where it definitely _can_ be a problem:
+
+ (1) I/O
+
+     Many I/O devices can be memory mapped, and so appear to the CPU as if
+     they're just memory locations. However, to control the device, the driver
+     has to make the right accesses in exactly the right order.
+
+     Consider, for example, an ethernet chipset such as the AMD PCnet32. It
+     presents to the CPU an "address register" and a bunch of "data registers".
+     The way it's accessed is to write the index of the internal register you
+     want to access to the address register, and then read or write the
+     appropriate data register to access the chip's internal register:
+
+	*ADR = ctl_reg_3;
+	reg = *DATA;
+
+     The problem with a clever CPU or a clever compiler is that the write to
+     the address register isn't guaranteed to happen before the access to the
+     data register, if the CPU or the compiler thinks it is more efficient to
+     defer the address write:
+
+	read *DATA, write *ADR
+
+     then things will break.
+
+     The way to deal with this is to insert an I/O memory barrier between the
+     two accesses:
+
+	*ADR = ctl_reg_3;
+	mb();
+	reg = *DATA;
+
+     In this case, the barrier makes a guarantee that all memory accesses
+     before the barrier will happen before all the memory accesses after the
+     barrier. It does _not_ guarantee that all memory accesses before the
+     barrier will be complete by the time the barrier is complete.
+
+ (2) Multiprocessor interaction
+
+     When there's a system with more than one processor, these may be working
+     on the same set of data, but attempting not to use locks as locks are
+     quite expensive. This means that accesses that affect both CPUs may have
+     to be carefully ordered to prevent error.
+
+     Consider the R/W semaphore slow path. In that, a waiting process is
+     queued on the semaphore, as noted by it having a record on its stack
+     linked to the semaphore's list:
+
+	struct rw_semaphore {
+		...
+		struct list_head waiters;
+	};
+
+	struct rwsem_waiter {
+		struct list_head list;
+		struct task_struct *task;
+	};
+
+     To wake up the waiter, the up_read() or up_write() functions have to read
+     the pointer from this record to know as to where the next waiter record
+     is, clear the task pointer, call wake_up_process() on the task, and
+     release the task struct reference held:
+
+	READ waiter->list.next;
+	READ waiter->task;
+	WRITE waiter->task;
+	CALL wakeup
+	RELEASE task
+
+     If any of these steps occur out of order, then the whole thing may fail.
+
+     Note that the waiter does not get the semaphore lock again - it just waits
+     for its task pointer to be cleared. Since the record is on its stack, this
+     means that if the task pointer is cleared _before_ the next pointer in the
+     list is read, then another CPU might start processing the waiter and it
+     might clobber its stack before up*() functions have a chance to read the
+     next pointer.
+
+	CPU 0				CPU 1
+	===============================	===============================
+					down_xxx()
+					Queue waiter
+					Sleep
+	up_yyy()
+	READ waiter->task;
+	WRITE waiter->task;
+	<preempt>
+					Resume processing
+					down_xxx() returns
+					call foo()
+					foo() clobbers *waiter
+	</preempt>
+	READ waiter->list.next;
+	--- OOPS ---
+
+     This could be dealt with using a spinlock, but then the down_xxx()
+     function has to get the spinlock again after it's been woken up, which is
+     a waste of resources.
+
+     The way to deal with this is to insert an SMP memory barrier:
+
+	READ waiter->list.next;
+	READ waiter->task;
+	smp_mb();
+	WRITE waiter->task;
+	CALL wakeup
+	RELEASE task
+
+     In this case, the barrier makes a guarantee that all memory accesses
+     before the barrier will happen before all the memory accesses after the
+     barrier. It does _not_ guarantee that all memory accesses before the
+     barrier will be complete by the time the barrier is complete.
+
+     SMP memory barriers are normally no-ops on a UP system because the CPU
+     orders overlapping accesses with respect to itself.
+
+
+=====================================
+LINUX KERNEL MEMORY BARRIER FUNCTIONS
+=====================================
+
+The Linux kernel has six basic memory barriers:
+
+		MANDATORY (I/O)	SMP
+		===============	================
+	GENERAL	mb()		smp_mb()
+	READ	rmb()		smp_rmb()
+	WRITE	wmb()		smp_wmb()
+
+General memory barriers make a guarantee that all memory accesses specified
+before the barrier will happen before all memory accesses specified after the
+barrier.
+
+Read memory barriers make a guarantee that all memory reads specified before
+the barrier will happen before all memory reads specified after the barrier.
+
+Write memory barriers make a guarantee that all memory writes specified before
+the barrier will happen before all memory writes specified after the barrier.
+
+SMP memory barriers are no-ops on uniprocessor compiled systems because it is
+assumed that a CPU will be self-consistent, and will order overlapping accesses
+with respect to itself.
+
+There is no guarantee that any of the memory accesses specified before a memory
+barrier will be complete by the completion of a memory barrier; the barrier can
+be considered to draw a line in the access queue that accesses of the
+appropriate type may not cross.
+
+There is no guarantee that issuing a memory barrier on one CPU will have any
+direct effect on another CPU or any other hardware in the system. The indirect
+effect will be the order the first CPU commits its accesses to the bus.
+
+Note that these are the _minimum_ guarantees. Different architectures may give
+more substantial guarantees, but they may not be relied upon outside of arch
+specific code.
+
+
+There are some more advanced barriering functions:
+
+ (*) set_mb(var, value)
+ (*) set_wmb(var, value)
+
+     These assign the value to the variable and then insert at least a write
+     barrier after it, depending on the function.
+
+
+==============================
+IMPLIED KERNEL MEMORY BARRIERS
+==============================
+
+Some of the other functions in the linux kernel imply memory barriers. For
+instance all the following (pseudo-)locking functions imply barriers.
+
+ (*) interrupt disablement and/or interrupts
+ (*) spin locks
+ (*) R/W spin locks
+ (*) mutexes
+ (*) semaphores
+ (*) R/W semaphores
+
+In all cases there are variants on a LOCK operation and an UNLOCK operation.
+
+ (*) LOCK operation implication:
+
+     Memory accesses issued after the LOCK will be completed after the LOCK
+     accesses have completed.
+
+     Memory accesses issued before the LOCK may be completed after the LOCK
+     accesses have completed.
+
+ (*) UNLOCK operation implication:
+
+     Memory accesses issued before the UNLOCK will be completed before the
+     UNLOCK accesses have completed.
+
+     Memory accesses issued after the UNLOCK may be completed before the UNLOCK
+     accesses have completed.
+
+ (*) LOCK vs UNLOCK implication:
+
+     The LOCK accesses will be completed before the unlock accesses.
+
+Locks and semaphores may not provide any guarantee of ordering on UP compiled
+systems, and so can't be counted on in such a situation to actually do
+anything at all, especially with respect to I/O memory barriering.
+
+Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier
+memory and I/O accesses individually, or interrupt handling will barrier
+memory and I/O accesses on entry and on exit. This prevents an interrupt
+routine interfering with accesses made in a disabled-interrupt section of code
+and vice versa.
+
+This specification is a _minimum_ guarantee; any particular architecture may
+provide more substantial guarantees, but these may not be relied upon outside
+of arch specific code.
+
+
+As an example, consider the following:
+
+	*A = a;
+	*B = b;
+	LOCK
+	*C = c;
+	*D = d;
+	UNLOCK
+	*E = e;
+	*F = f;
+
+The following sequence of events on the bus is acceptable:
+
+	LOCK, *F+*A, *E, *C+*D, *B, UNLOCK
+
+But none of the following are:
+
+	*F+*A, *B,	LOCK, *C, *D,	UNLOCK, *E
+	*A, *B, *C,	LOCK, *D,	UNLOCK, *E, *F
+	*A, *B,		LOCK, *C,	UNLOCK, *D, *E, *F
+	*B,		LOCK, *C, *D,	UNLOCK, *F+*A, *E
+
+
+Consider also the following (going back to the AMD PCnet example):
+
+	DISABLE IRQ
+	*ADR = ctl_reg_3;
+	mb();
+	x = *DATA;
+	*ADR = ctl_reg_4;
+	mb();
+	*DATA = y;
+	*ADR = ctl_reg_5;
+	mb();
+	z = *DATA;
+	ENABLE IRQ
+	<interrupt>
+	*ADR = ctl_reg_7;
+	mb();
+	q = *DATA
+	</interrupt>
+
+What's to stop "z = *DATA" crossing "*ADR = ctl_reg_7" and reading from the
+wrong register? (There's no guarantee that the process of handling an
+interrupt will barrier memory accesses in any way).
+
+
+==============================
+I386 AND X86_64 SPECIFIC NOTES
+==============================
+
+Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the
+bus appear in program order - and so there's no requirement for any sort of
+explicit memory barriers.
+
+From the Pentium-III onwards were three new memory barrier instructions:
+LFENCE, SFENCE and MFENCE which correspond to the kernel memory barrier
+functions rmb(), wmb() and mb(). However, there are additional implicit memory
+barriers in the CPU implementation:
+
+ (*) Interrupt processing implies mb().
+
+ (*) The LOCK prefix adds implication of mb() on whatever instruction it is
+     attached to.
+
+ (*) Normal writes to memory imply wmb() [and so SFENCE is normally not
+     required].
+
+ (*) Normal writes imply a semi-rmb(): reads before a write may not complete
+     after that write, but reads after a write may complete before the write
+     (ie: reads may go _ahead_ of writes).
+
+ (*) Non-temporal writes imply no memory barrier, and are the intended target
+     of SFENCE.
+
+ (*) Accesses to uncached memory imply mb() [eg: memory mapped I/O].
+
+
+======================
+POWERPC SPECIFIC NOTES
+======================
+
+The powerpc is weakly ordered, and its read and write accesses may be
+completed generally in any order. It's memory barriers are also to some extent
+more substantial than the mimimum requirement, and may directly effect
+hardware outside of the CPU.

next             reply	other threads:[~2006-03-07 17:40 UTC|newest]

Thread overview: 107+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-03-07 17:40 David Howells [this message]
2006-03-07 10:34 ` [PATCH] Document Linux's memory barriers Andi Kleen
2006-03-07 18:30   ` David Howells
2006-03-07 11:13     ` Andi Kleen
2006-03-07 19:24       ` David Howells
2006-03-07 19:46         ` Stephen Hemminger
2006-03-07 18:46     ` Jesse Barnes
2006-03-07 19:23     ` Bryan O'Sullivan
2006-03-07 11:57       ` Andi Kleen
2006-03-07 20:01         ` Jesse Barnes
2006-03-07 21:14         ` Bryan O'Sullivan
2006-03-07 21:24           ` Andi Kleen
2006-03-08  0:36             ` Alan Cox
2006-03-08  0:35         ` Alan Cox
2006-03-07 17:47 ` Stephen Hemminger
2006-03-07 18:40 ` Alan Cox
2006-03-07 18:54   ` linux-os (Dick Johnson)
2006-03-07 18:54     ` linux-os (Dick Johnson)
2006-03-07 19:06     ` Matthew Wilcox
2006-03-07 19:15       ` linux-os (Dick Johnson)
2006-03-07 19:15         ` linux-os (Dick Johnson)
2006-03-09 11:26         ` Sergei Organov
2006-03-07 19:33     ` Alan Cox
2006-03-07 20:09   ` David Howells
2006-03-08  0:32     ` Alan Cox
2006-03-08  8:25     ` Duncan Sands
2006-03-08 22:06       ` Paul Mackerras
2006-03-08 22:24         ` David S. Miller
2006-03-08 22:31           ` Linus Torvalds
2006-03-08 22:42         ` Alan Cox
2006-03-08  2:07 ` Nick Piggin
2006-03-08  3:10 ` Paul Mackerras
2006-03-08  3:30   ` Linus Torvalds
2006-03-08 12:34     ` David Howells
2006-03-08 16:40       ` Bryan O'Sullivan
2006-03-08  7:41   ` Nick Piggin
2006-03-08 13:19   ` David Howells
2006-03-08 21:49     ` Paul Mackerras
2006-03-08 22:05       ` Alan Cox
2006-03-10  0:49     ` H. Peter Anvin
2006-03-08 14:37 ` [PATCH] Document Linux's memory barriers [try #2] David Howells
2006-03-08 14:55   ` Alan Cox
2006-03-08 15:41     ` Matthew Wilcox
2006-03-08 17:19       ` David Howells
2006-03-08 22:10         ` Paul Mackerras
2006-03-08 23:08           ` Ivan Kokshaysky
2006-03-09  1:01             ` Paul Mackerras
2006-03-09 16:02               ` Ivan Kokshaysky
2006-03-08 17:04     ` David Howells
2006-03-08 17:36       ` Alan Cox
2006-03-08 18:35         ` David Howells
2006-03-08 18:45           ` Alan Cox
2006-03-08 18:59             ` David Howells
2006-03-08 11:38               ` Andi Kleen
2006-03-08 19:08             ` David Howells
2006-03-08 19:26               ` Linus Torvalds
2006-03-08 19:31                 ` David Howells
2006-03-09  0:35                   ` Paul Mackerras
2006-03-09  0:54                     ` Linus Torvalds
2006-03-09  1:08                       ` Paul Mackerras
2006-03-09  1:27                         ` Linus Torvalds
2006-03-09  2:38                           ` Nick Piggin
2006-03-09  3:45                           ` Paul Mackerras
2006-03-09  4:36                             ` Jesse Barnes
2006-03-09  7:41                               ` Paul Mackerras
2006-03-09  5:38                             ` Linus Torvalds
2006-03-09 12:27                               ` David Howells
2006-03-09 11:44                             ` Michael Buesch
2006-03-09  4:34                           ` Jesse Barnes
2006-03-09  4:43                             ` Paul Mackerras
2006-03-09 10:05                               ` Jes Sorensen
2006-03-09  0:55                     ` Jesse Barnes
2006-03-09  1:57                       ` Paul Mackerras
2006-03-09  4:26                         ` Jesse Barnes
2006-03-08 19:40                 ` Matthew Wilcox
2006-03-09  0:37                   ` Paul Mackerras
2006-03-09  0:59                     ` Jesse Barnes
2006-03-09  1:36                       ` Paul Mackerras
2006-03-09  4:18                         ` Jesse Barnes
2006-03-08 19:54                 ` Jesse Barnes
2006-03-08 20:02                 ` Alan Cox
2006-03-08 22:01     ` Paul Mackerras
2006-03-08 22:23       ` David S. Miller
2006-03-08 19:37   ` [PATCH] Document Linux's memory barriers [try #3] David Howells
2006-03-09 14:01     ` David Howells
2006-03-09 12:02   ` [PATCH] Document Linux's memory barriers [try #2] Sergei Organov
2006-03-08 16:18 ` [PATCH] Document Linux's memory barriers Pavel Machek
2006-03-08 20:16   ` David Howells
2006-03-08 22:01     ` Alan Cox
2006-03-09 11:41       ` David Howells
2006-03-09 12:28         ` Alan Cox
2006-03-09 13:02           ` David Howells
2006-03-09 16:32         ` Linus Torvalds
2006-03-09 17:39           ` David Howells
2006-03-09 17:54             ` Linus Torvalds
2006-03-09 17:56               ` Linus Torvalds
2006-03-08 16:26 ` Christoph Lameter
2006-03-08 17:35   ` David Howells
2006-03-08 17:46     ` Christoph Lameter
2006-03-08 17:59       ` Alan Cox
  -- strict thread matches above, loose matches on Subject: below --
2006-03-07 23:17 Chuck Ebbert
2006-03-08  0:15 ` David S. Miller
2006-03-08  0:24 ` Roberto Nibali
     [not found] <5NONi-2hp-3@gated-at.bofh.it>
     [not found] ` <5NOtZ-1FO-27@gated-at.bofh.it>
     [not found]   ` <5NPgs-2Rw-37@gated-at.bofh.it>
     [not found]     ` <5NPq4-34a-23@gated-at.bofh.it>
2006-03-08  0:22       ` Robert Hancock
     [not found] ` <5NQ2U-462-29@gated-at.bofh.it>
     [not found]   ` <5NRLg-6LJ-31@gated-at.bofh.it>
     [not found]     ` <5NRUR-6Yo-11@gated-at.bofh.it>
     [not found]       ` <5NUSF-30Z-5@gated-at.bofh.it>
2006-03-08  1:10         ` Robert Hancock
2006-03-08 11:35           ` Alan Cox
2006-03-08 14:55           ` Andi Kleen

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:c2fc51b )
 OR (
bs:"[PATCH] Document Linux's memory barriers" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=31492.1141753245@warthog.cambridge.redhat.com \
    --to=dhowells@redhat.com \
    --cc=akpm@osdl.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc64-dev@ozlabs.org \
    --cc=mingo@redhat.com \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.