[PATCH v5 tip/core/locking] Memory-barrier documentation updates + smp_mb__after_unlock

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 tip/core/locking] Memory-barrier documentation updates + smp_mb__after_unlock_lock()
@ 2013-12-10  1:27 Paul E. McKenney
  2013-12-10  1:27 ` [PATCH v5 tip/core/locking 1/7] Documentation/memory-barriers.txt: Add needed ACCESS_ONCE() calls to memory-barriers.txt Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10  1:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw

Hello!

This series applies some long-needed updates to memory-barriers.txt and
adds an smp_mb__after_unlock_lock():

1.	Add ACCESS_ONCE() calls where needed to ensure their inclusion
	in code copy-and-pasted from this file.

2.	Add long atomic examples alongside the existing atomics.

3.	Prohibit architectures supporting the Linux kernel from
	speculating stores.

4.	Document what ACCESS_ONCE() does along with a number of situations
	requiring its use.

5.	Downgrade UNLOCK+LOCK to no longer imply a full barrier, at least
	in the absence of smp_mb__after_unlock_lock().  See the LKML thread
	that includes http://www.spinics.net/lists/linux-mm/msg65653.html
	for more information.

6.	Added smp_mb__after_unlock_lock() for all architectures.  Because
	all architectures are presumably providing full-barrier semantics
	for UNLOCK+LOCK, these are all no-ops.  Some will change if
	low-latency-handoff queued locks are accepted.

7.	Applied smp_mb__after_unlock_lock() as needed in RCU.

Changes from v4:

o	Added Josh Triplett's Reviewed-by for 1-4.

o	Applied feedback from Ingo Molnar and Jonathan Corbet.

o	Trimmed Cc lists as suggested by David Miller.

o	Added smp_mb__after_unlock_lock() changes.

Changes from v3:

o	Fix typos noted by Peter Zijlstra.

o	Added the documentation about ACCESS_ONCE(), which expands on
	http://thread.gmane.org/gmane.linux.kernel.mm/82891/focus=14696,
	ably summarized by Jon Corbet at http://lwn.net/Articles/508991/.

Changes from v2:

o	Update examples so that that load against which the subsequent
	store is to be ordered is part of the "if" condition.

o	Add an example showing how the compiler can remove "if"
	conditions and how to prevent it from doing so.

o	Add ACCESS_ONCE() to the compiler-barrier section.

o	Add a sentence noting that transitivity requires smp_mb().

Changes from v1:

o	Combined with Peter Zijlstra's speculative-store-prohibition patch.

o	Added more pitfalls to avoid when prohibiting speculative
	stores, along with how to avoid them.

o	Applied Josh Triplett's review comments.

							Thanx, Paul

------------------------------------------------------------------------

 b/Documentation/memory-barriers.txt   |  760 +++++++++++++++++++++++++++-------
 b/arch/arm/include/asm/barrier.h      |    2 
 b/arch/arm64/include/asm/barrier.h    |    2 
 b/arch/ia64/include/asm/barrier.h     |    2 
 b/arch/metag/include/asm/barrier.h    |    2 
 b/arch/mips/include/asm/barrier.h     |    2 
 b/arch/powerpc/include/asm/barrier.h  |    2 
 b/arch/s390/include/asm/barrier.h     |    2 
 b/arch/sparc/include/asm/barrier_64.h |    2 
 b/arch/x86/include/asm/barrier.h      |    2 
 b/include/asm-generic/barrier.h       |    2 
 b/kernel/rcu/tree.c                   |   18 
 b/kernel/rcu/tree_plugin.h            |   13 
 13 files changed, 670 insertions(+), 141 deletions(-)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 tip/core/locking 1/7] Documentation/memory-barriers.txt: Add needed ACCESS_ONCE() calls to memory-barriers.txt
  2013-12-10  1:27 [PATCH v5 tip/core/locking] Memory-barrier documentation updates + smp_mb__after_unlock_lock() Paul E. McKenney
@ 2013-12-10  1:27 ` Paul E. McKenney
  2013-12-10  1:27   ` [PATCH v5 tip/core/locking 2/7] Documentation/memory-barriers.txt: Add long atomic examples " Paul E. McKenney
                     ` (5 more replies)
  0 siblings, 6 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10  1:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The Documentation/memory-barriers.txt file was written before the need
for ACCESS_ONCE() was fully appreciated.  It therefore contains no
ACCESS_ONCE() calls, which can be a problem when people lift examples
from it.  This commit therefore adds ACCESS_ONCE() calls.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 Documentation/memory-barriers.txt | 206 +++++++++++++++++++++++---------------
 1 file changed, 126 insertions(+), 80 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 020cccdbdd0c..1d067235b0bc 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -194,18 +194,22 @@ There are some minimal guarantees that may be expected of a CPU:
  (*) On any given CPU, dependent memory accesses will be issued in order, with
      respect to itself.  This means that for:
 
-	Q = P; D = *Q;
+	ACCESS_ONCE(Q) = P; smp_read_barrier_depends(); D = ACCESS_ONCE(*Q);
 
      the CPU will issue the following memory operations:
 
 	Q = LOAD P, D = LOAD *Q
 
-     and always in that order.
+     and always in that order.  On most systems, smp_read_barrier_depends()
+     does nothing, but it is required for DEC Alpha.  The ACCESS_ONCE()
+     is required to prevent compiler mischief.  Please note that you
+     should normally use something like rcu_dereference() instead of
+     open-coding smp_read_barrier_depends().
 
  (*) Overlapping loads and stores within a particular CPU will appear to be
      ordered within that CPU.  This means that for:
 
-	a = *X; *X = b;
+	a = ACCESS_ONCE(*X); ACCESS_ONCE(*X) = b;
 
      the CPU will only issue the following sequence of memory operations:
 
@@ -213,7 +217,7 @@ There are some minimal guarantees that may be expected of a CPU:
 
      And for:
 
-	*X = c; d = *X;
+	ACCESS_ONCE(*X) = c; d = ACCESS_ONCE(*X);
 
      the CPU will only issue:
 
@@ -224,6 +228,41 @@ There are some minimal guarantees that may be expected of a CPU:
 
 And there are a number of things that _must_ or _must_not_ be assumed:
 
+ (*) It _must_not_ be assumed that the compiler will do what you want with
+     memory references that are not protected by ACCESS_ONCE().  Without
+     ACCESS_ONCE(), the compiler is within its rights to do all sorts
+     of "creative" transformations:
+
+     (-) Repeat the load, possibly getting a different value on the second
+         and subsequent loads.  This is especially prone to happen when
+	 register pressure is high.
+
+     (-) Merge adjacent loads and stores to the same location.  The most
+         familiar example is the transformation from:
+
+		while (a)
+			do_something();
+
+         to something like:
+
+		if (a)
+			for (;;)
+				do_something();
+
+         Using ACCESS_ONCE() as follows prevents this sort of optimization:
+
+		while (ACCESS_ONCE(a))
+			do_something();
+
+     (-) "Store tearing", where a single store in the source code is split
+         into smaller stores in the object code.  Note that gcc really
+	 will do this on some architectures when storing certain constants.
+	 It can be cheaper to do a series of immediate stores than to
+	 form the constant in a register and then to store that register.
+
+     (-) "Load tearing", which splits loads in a manner analogous to
+     	 store tearing.
+
  (*) It _must_not_ be assumed that independent loads and stores will be issued
      in the order given.  This means that for:
 
@@ -450,14 +489,14 @@ The usage requirements of data dependency barriers are a little subtle, and
 it's not always obvious that they're needed.  To illustrate, consider the
 following sequence of events:
 
-	CPU 1		CPU 2
-	===============	===============
+	CPU 1		      CPU 2
+	===============	      ===============
 	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
 	B = 4;
 	<write barrier>
-	P = &B
-			Q = P;
-			D = *Q;
+	ACCESS_ONCE(P) = &B
+			      Q = ACCESS_ONCE(P);
+			      D = *Q;
 
 There's a clear data dependency here, and it would seem that by the end of the
 sequence, Q must be either &A or &B, and that:
@@ -477,15 +516,15 @@ Alpha).
 To deal with this, a data dependency barrier or better must be inserted
 between the address load and the data load:
 
-	CPU 1		CPU 2
-	===============	===============
+	CPU 1		      CPU 2
+	===============	      ===============
 	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
 	B = 4;
 	<write barrier>
-	P = &B
-			Q = P;
-			<data dependency barrier>
-			D = *Q;
+	ACCESS_ONCE(P) = &B
+			      Q = ACCESS_ONCE(P);
+			      <data dependency barrier>
+			      D = *Q;
 
 This enforces the occurrence of one of the two implications, and prevents the
 third possibility from arising.
@@ -504,21 +543,22 @@ Another example of where data dependency barriers might be required is where a
 number is read from memory and then used to calculate the index for an array
 access:
 
-	CPU 1		CPU 2
-	===============	===============
+	CPU 1		      CPU 2
+	===============	      ===============
 	{ M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
 	M[1] = 4;
 	<write barrier>
-	P = 1
-			Q = P;
-			<data dependency barrier>
-			D = M[Q];
+	ACCESS_ONCE(P) = 1
+			      Q = ACCESS_ONCE(P);
+			      <data dependency barrier>
+			      D = M[Q];
 
 
-The data dependency barrier is very important to the RCU system, for example.
-See rcu_dereference() in include/linux/rcupdate.h.  This permits the current
-target of an RCU'd pointer to be replaced with a new modified target, without
-the replacement target appearing to be incompletely initialised.
+The data dependency barrier is very important to the RCU system,
+for example.  See rcu_assign_pointer() and rcu_dereference() in
+include/linux/rcupdate.h.  This permits the current target of an RCU'd
+pointer to be replaced with a new modified target, without the replacement
+target appearing to be incompletely initialised.
 
 See also the subsection on "Cache Coherency" for a more thorough example.
 
@@ -530,22 +570,23 @@ A control dependency requires a full read memory barrier, not simply a data
 dependency barrier to make it work correctly.  Consider the following bit of
 code:
 
-	q = &a;
+	q = ACCESS_ONCE(a);
 	if (p) {
 		<data dependency barrier>
-		q = &b;
+		q = ACCESS_ONCE(b);
 	}
 	x = *q;
 
 This will not have the desired effect because there is no actual data
-dependency, but rather a control dependency that the CPU may short-circuit by
-attempting to predict the outcome in advance.  In such a case what's actually
-required is:
+dependency, but rather a control dependency that the CPU may short-circuit
+by attempting to predict the outcome in advance, so that other CPUs see
+the load from b as having happened before the load from a.  In such a
+case what's actually required is:
 
-	q = &a;
+	q = ACCESS_ONCE(a);
 	if (p) {
 		<read barrier>
-		q = &b;
+		q = ACCESS_ONCE(b);
 	}
 	x = *q;
 
@@ -561,23 +602,23 @@ barrier, though a general barrier would also be viable.  Similarly a read
 barrier or a data dependency barrier should always be paired with at least an
 write barrier, though, again, a general barrier is viable:
 
-	CPU 1		CPU 2
-	===============	===============
-	a = 1;
+	CPU 1		      CPU 2
+	===============	      ===============
+	ACCESS_ONCE(a) = 1;
 	<write barrier>
-	b = 2;		x = b;
-			<read barrier>
-			y = a;
+	ACCESS_ONCE(b) = 2;   x = ACCESS_ONCE(b);
+			      <read barrier>
+			      y = ACCESS_ONCE(a);
 
 Or:
 
-	CPU 1		CPU 2
-	===============	===============================
+	CPU 1		      CPU 2
+	===============	      ===============================
 	a = 1;
 	<write barrier>
-	b = &a;		x = b;
-			<data dependency barrier>
-			y = *x;
+	ACCESS_ONCE(b) = &a;  x = ACCESS_ONCE(b);
+			      <data dependency barrier>
+			      y = *x;
 
 Basically, the read barrier always has to be there, even though it can be of
 the "weaker" type.
@@ -586,13 +627,13 @@ the "weaker" type.
 match the loads after the read barrier or the data dependency barrier, and vice
 versa:
 
-	CPU 1                           CPU 2
-	===============                 ===============
-	a = 1;           }----   --->{  v = c
-	b = 2;           }    \ /    {  w = d
-	<write barrier>        \        <read barrier>
-	c = 3;           }    / \    {  x = a;
-	d = 4;           }----   --->{  y = b;
+	CPU 1                               CPU 2
+	===================                 ===================
+	ACCESS_ONCE(a) = 1;  }----   --->{  v = ACCESS_ONCE(c);
+	ACCESS_ONCE(b) = 2;  }    \ /    {  w = ACCESS_ONCE(d);
+	<write barrier>            \        <read barrier>
+	ACCESS_ONCE(c) = 3;  }    / \    {  x = ACCESS_ONCE(a);
+	ACCESS_ONCE(d) = 4;  }----   --->{  y = ACCESS_ONCE(b);
 
 
 EXAMPLES OF MEMORY BARRIER SEQUENCES
@@ -1435,12 +1476,12 @@ three CPUs; then should the following sequence of events occur:
 
 	CPU 1				CPU 2
 	===============================	===============================
-	*A = a;				*E = e;
+	ACCESS_ONCE(*A) = a;		ACCESS_ONCE(*E) = e;
 	LOCK M				LOCK Q
-	*B = b;				*F = f;
-	*C = c;				*G = g;
+	ACCESS_ONCE(*B) = b;		ACCESS_ONCE(*F) = f;
+	ACCESS_ONCE(*C) = c;		ACCESS_ONCE(*G) = g;
 	UNLOCK M			UNLOCK Q
-	*D = d;				*H = h;
+	ACCESS_ONCE(*D) = d;		ACCESS_ONCE(*H) = h;
 
 Then there is no guarantee as to what order CPU 3 will see the accesses to *A
 through *H occur in, other than the constraints imposed by the separate locks
@@ -1460,17 +1501,17 @@ However, if the following occurs:
 
 	CPU 1				CPU 2
 	===============================	===============================
-	*A = a;
-	LOCK M		[1]
-	*B = b;
-	*C = c;
-	UNLOCK M	[1]
-	*D = d;				*E = e;
-					LOCK M		[2]
-					*F = f;
-					*G = g;
-					UNLOCK M	[2]
-					*H = h;
+	ACCESS_ONCE(*A) = a;
+	LOCK M		     [1]
+	ACCESS_ONCE(*B) = b;
+	ACCESS_ONCE(*C) = c;
+	UNLOCK M	     [1]
+	ACCESS_ONCE(*D) = d;		ACCESS_ONCE(*E) = e;
+					LOCK M		     [2]
+					ACCESS_ONCE(*F) = f;
+					ACCESS_ONCE(*G) = g;
+					UNLOCK M	     [2]
+					ACCESS_ONCE(*H) = h;
 
 CPU 3 might see:
 
@@ -2177,11 +2218,11 @@ A programmer might take it for granted that the CPU will perform memory
 operations in exactly the order specified, so that if the CPU is, for example,
 given the following piece of code to execute:
 
-	a = *A;
-	*B = b;
-	c = *C;
-	d = *D;
-	*E = e;
+	a = ACCESS_ONCE(*A);
+	ACCESS_ONCE(*B) = b;
+	c = ACCESS_ONCE(*C);
+	d = ACCESS_ONCE(*D);
+	ACCESS_ONCE(*E) = e;
 
 they would then expect that the CPU will complete the memory operation for each
 instruction before moving on to the next one, leading to a definite sequence of
@@ -2228,12 +2269,12 @@ However, it is guaranteed that a CPU will be self-consistent: it will see its
 _own_ accesses appear to be correctly ordered, without the need for a memory
 barrier.  For instance with the following code:
 
-	U = *A;
-	*A = V;
-	*A = W;
-	X = *A;
-	*A = Y;
-	Z = *A;
+	U = ACCESS_ONCE(*A);
+	ACCESS_ONCE(*A) = V;
+	ACCESS_ONCE(*A) = W;
+	X = ACCESS_ONCE(*A);
+	ACCESS_ONCE(*A) = Y;
+	Z = ACCESS_ONCE(*A);
 
 and assuming no intervention by an external influence, it can be assumed that
 the final result will appear to be:
@@ -2250,7 +2291,12 @@ accesses:
 
 in that order, but, without intervention, the sequence may have almost any
 combination of elements combined or discarded, provided the program's view of
-the world remains consistent.
+the world remains consistent.  Note that ACCESS_ONCE() is -not- optional
+in the above example, as there are architectures where a given CPU might
+interchange successive loads to the same location.  On such architectures,
+ACCESS_ONCE() does whatever is necessary to prevent this, for example, on
+Itanium the volatile casts used by ACCESS_ONCE() cause GCC to emit the
+special ld.acq and st.rel instructions that prevent such reordering.
 
 The compiler may also combine, discard or defer elements of the sequence before
 the CPU even sees them.
@@ -2264,13 +2310,13 @@ may be reduced to:
 
 	*A = W;
 
-since, without a write barrier, it can be assumed that the effect of the
-storage of V to *A is lost.  Similarly:
+since, without either a write barrier or an ACCESS_ONCE(), it can be
+assumed that the effect of the storage of V to *A is lost.  Similarly:
 
 	*A = Y;
 	Z = *A;
 
-may, without a memory barrier, be reduced to:
+may, without a memory barrier or an ACCESS_ONCE(), be reduced to:
 
 	*A = Y;
 	Z = Y;
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v5 tip/core/locking 2/7] Documentation/memory-barriers.txt: Add long atomic examples to memory-barriers.txt
  2013-12-10  1:27 ` [PATCH v5 tip/core/locking 1/7] Documentation/memory-barriers.txt: Add needed ACCESS_ONCE() calls to memory-barriers.txt Paul E. McKenney
@ 2013-12-10  1:27   ` Paul E. McKenney
  2013-12-10  1:27   ` [PATCH v5 tip/core/locking 3/7] Documentation/memory-barriers.txt: Prohibit speculative writes Paul E. McKenney
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10  1:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Although the atomic_long_t functions are quite useful, they are a bit
obscure.  This commit therefore adds the common ones alongside their
atomic_t counterparts in Documentation/memory-barriers.txt.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 Documentation/memory-barriers.txt | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 1d067235b0bc..2d22da095a60 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1728,21 +1728,23 @@ explicit lock operations, described later).  These include:
 
 	xchg();
 	cmpxchg();
-	atomic_xchg();
-	atomic_cmpxchg();
-	atomic_inc_return();
-	atomic_dec_return();
-	atomic_add_return();
-	atomic_sub_return();
-	atomic_inc_and_test();
-	atomic_dec_and_test();
-	atomic_sub_and_test();
-	atomic_add_negative();
-	atomic_add_unless();	/* when succeeds (returns 1) */
+	atomic_xchg();			atomic_long_xchg();
+	atomic_cmpxchg();		atomic_long_cmpxchg();
+	atomic_inc_return();		atomic_long_inc_return();
+	atomic_dec_return();		atomic_long_dec_return();
+	atomic_add_return();		atomic_long_add_return();
+	atomic_sub_return();		atomic_long_sub_return();
+	atomic_inc_and_test();		atomic_long_inc_and_test();
+	atomic_dec_and_test();		atomic_long_dec_and_test();
+	atomic_sub_and_test();		atomic_long_sub_and_test();
+	atomic_add_negative();		atomic_long_add_negative();
 	test_and_set_bit();
 	test_and_clear_bit();
 	test_and_change_bit();
 
+	/* when succeeds (returns 1) */
+	atomic_add_unless();		atomic_long_add_unless();
+
 These are used for such things as implementing LOCK-class and UNLOCK-class
 operations and adjusting reference counters towards object destruction, and as
 such the implicit memory barrier effects are necessary.
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v5 tip/core/locking 3/7] Documentation/memory-barriers.txt: Prohibit speculative writes
  2013-12-10  1:27 ` [PATCH v5 tip/core/locking 1/7] Documentation/memory-barriers.txt: Add needed ACCESS_ONCE() calls to memory-barriers.txt Paul E. McKenney
  2013-12-10  1:27   ` [PATCH v5 tip/core/locking 2/7] Documentation/memory-barriers.txt: Add long atomic examples " Paul E. McKenney
@ 2013-12-10  1:27   ` Paul E. McKenney
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 4/7] Documentation/memory-barriers.txt: Document ACCESS_ONCE() Paul E. McKenney
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10  1:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney, Linux-Arch

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

No SMP architecture currently supporting Linux allows speculative writes,
so this commit updates Documentation/memory-barriers.txt to prohibit
them in Linux core code.  It also records restrictions on their use.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linux-Arch <linux-arch@vger.kernel.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 Documentation/memory-barriers.txt | 183 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 175 insertions(+), 8 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 2d22da095a60..deafa36aeea1 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -571,11 +571,10 @@ dependency barrier to make it work correctly.  Consider the following bit of
 code:
 
 	q = ACCESS_ONCE(a);
-	if (p) {
-		<data dependency barrier>
-		q = ACCESS_ONCE(b);
+	if (q) {
+		<data dependency barrier>  /* BUG: No data dependency!!! */
+		p = ACCESS_ONCE(b);
 	}
-	x = *q;
 
 This will not have the desired effect because there is no actual data
 dependency, but rather a control dependency that the CPU may short-circuit
@@ -584,11 +583,176 @@ the load from b as having happened before the load from a.  In such a
 case what's actually required is:
 
 	q = ACCESS_ONCE(a);
-	if (p) {
+	if (q) {
 		<read barrier>
-		q = ACCESS_ONCE(b);
+		p = ACCESS_ONCE(b);
 	}
-	x = *q;
+
+However, stores are not speculated.  This means that ordering -is- provided
+in the following example:
+
+	q = ACCESS_ONCE(a);
+	if (ACCESS_ONCE(q)) {
+		ACCESS_ONCE(b) = p;
+	}
+
+Please note that ACCESS_ONCE() is not optional!  Without the ACCESS_ONCE(),
+the compiler is within its rights to transform this example:
+
+	q = a;
+	if (q) {
+		b = p;  /* BUG: Compiler can reorder!!! */
+		do_something();
+	} else {
+		b = p;  /* BUG: Compiler can reorder!!! */
+		do_something_else();
+	}
+
+into this, which of course defeats the ordering:
+
+	b = p;
+	q = a;
+	if (q)
+		do_something();
+	else
+		do_something_else();
+
+Worse yet, if the compiler is able to prove (say) that the value of
+variable 'a' is always non-zero, it would be well within its rights
+to optimize the original example by eliminating the "if" statement
+as follows:
+
+	q = a;
+	b = p;  /* BUG: Compiler can reorder!!! */
+	do_something();
+
+The solution is again ACCESS_ONCE(), which preserves the ordering between
+the load from variable 'a' and the store to variable 'b':
+
+	q = ACCESS_ONCE(a);
+	if (q) {
+		ACCESS_ONCE(b) = p;
+		do_something();
+	} else {
+		ACCESS_ONCE(b) = p;
+		do_something_else();
+	}
+
+You could also use barrier() to prevent the compiler from moving
+the stores to variable 'b', but barrier() would not prevent the
+compiler from proving to itself that a==1 always, so ACCESS_ONCE()
+is also needed.
+
+It is important to note that control dependencies absolutely require a
+a conditional.  For example, the following "optimized" version of
+the above example breaks ordering:
+
+	q = ACCESS_ONCE(a);
+	ACCESS_ONCE(b) = p;  /* BUG: No ordering vs. load from a!!! */
+	if (q) {
+		/* ACCESS_ONCE(b) = p; -- moved up, BUG!!! */
+		do_something();
+	} else {
+		/* ACCESS_ONCE(b) = p; -- moved up, BUG!!! */
+		do_something_else();
+	}
+
+It is of course legal for the prior load to be part of the conditional,
+for example, as follows:
+
+	if (ACCESS_ONCE(a) > 0) {
+		ACCESS_ONCE(b) = q / 2;
+		do_something();
+	} else {
+		ACCESS_ONCE(b) = q / 3;
+		do_something_else();
+	}
+
+This will again ensure that the load from variable 'a' is ordered before the
+stores to variable 'b'.
+
+In addition, you need to be careful what you do with the local variable 'q',
+otherwise the compiler might be able to guess the value and again remove
+the needed conditional.  For example:
+
+	q = ACCESS_ONCE(a);
+	if (q % MAX) {
+		ACCESS_ONCE(b) = p;
+		do_something();
+	} else {
+		ACCESS_ONCE(b) = p;
+		do_something_else();
+	}
+
+If MAX is defined to be 1, then the compiler knows that (q % MAX) is
+equal to zero, in which case the compiler is within its rights to
+transform the above code into the following:
+
+	q = ACCESS_ONCE(a);
+	ACCESS_ONCE(b) = p;
+	do_something_else();
+
+This transformation loses the ordering between the load from variable 'a'
+and the store to variable 'b'.  If you are relying on this ordering, you
+should do something like the following:
+
+	q = ACCESS_ONCE(a);
+	BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
+	if (q % MAX) {
+		ACCESS_ONCE(b) = p;
+		do_something();
+	} else {
+		ACCESS_ONCE(b) = p;
+		do_something_else();
+	}
+
+Finally, control dependencies do -not- provide transitivity.  This is
+demonstrated by two related examples:
+
+	CPU 0                     CPU 1
+	=====================     =====================
+	r1 = ACCESS_ONCE(x);      r2 = ACCESS_ONCE(y);
+	if (r1 >= 0)              if (r2 >= 0)
+	  ACCESS_ONCE(y) = 1;       ACCESS_ONCE(x) = 1;
+
+	assert(!(r1 == 1 && r2 == 1));
+
+The above two-CPU example will never trigger the assert().  However,
+if control dependencies guaranteed transitivity (which they do not),
+then adding the following two CPUs would guarantee a related assertion:
+
+	CPU 2                     CPU 3
+	=====================     =====================
+	ACCESS_ONCE(x) = 2;       ACCESS_ONCE(y) = 2;
+
+	assert(!(r1 == 2 && r2 == 2 && x == 1 && y == 1)); /* FAILS!!! */
+
+But because control dependencies do -not- provide transitivity, the
+above assertion can fail after the combined four-CPU example completes.
+If you need the four-CPU example to provide ordering, you will need
+smp_mb() between the loads and stores in the CPU 0 and CPU 1 code fragments.
+
+In summary:
+
+  (*) Control dependencies can order prior loads against later stores.
+      However, they do -not- guarantee any other sort of ordering:
+      Not prior loads against later loads, nor prior stores against
+      later anything.  If you need these other forms of ordering,
+      use smb_rmb(), smp_wmb(), or, in the case of prior stores and
+      later loads, smp_mb().
+
+  (*) Control dependencies require at least one run-time conditional
+      between the prior load and the subsequent store.  If the compiler
+      is able to optimize the conditional away, it will have also
+      optimized away the ordering.  Careful use of ACCESS_ONCE() can
+      help to preserve the needed conditional.
+
+  (*) Control dependencies require that the compiler avoid reordering the
+      dependency into nonexistence.  Careful use of ACCESS_ONCE() or
+      barrier() can help to preserve your control dependency.
+
+  (*) Control dependencies do -not- provide transitivity.  If you
+      need transitivity, use smp_mb().
 
 
 SMP BARRIER PAIRING
@@ -1083,7 +1247,10 @@ compiler from moving the memory accesses either side of it to the other side:
 
 	barrier();
 
-This is a general barrier - lesser varieties of compiler barrier do not exist.
+This is a general barrier -- there are no read-read or write-write variants
+of barrier().  Howevever, ACCESS_ONCE() can be thought of as a weak form
+for barrier() that affects only the specific accesses flagged by the
+ACCESS_ONCE().
 
 The compiler barrier has no direct effect on the CPU, which may then reorder
 things however it wishes.
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v5 tip/core/locking 4/7] Documentation/memory-barriers.txt: Document ACCESS_ONCE()
  2013-12-10  1:27 ` [PATCH v5 tip/core/locking 1/7] Documentation/memory-barriers.txt: Add needed ACCESS_ONCE() calls to memory-barriers.txt Paul E. McKenney
  2013-12-10  1:27   ` [PATCH v5 tip/core/locking 2/7] Documentation/memory-barriers.txt: Add long atomic examples " Paul E. McKenney
  2013-12-10  1:27   ` [PATCH v5 tip/core/locking 3/7] Documentation/memory-barriers.txt: Prohibit speculative writes Paul E. McKenney
@ 2013-12-10  1:28   ` Paul E. McKenney
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK Paul E. McKenney
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10  1:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney, Oleg Nesterov, Jonathan Corbet, Rusty Russell

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The situations in which ACCESS_ONCE() is required are not well documented,
so this commit adds some verbiage to memory-barriers.txt.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 Documentation/memory-barriers.txt | 296 +++++++++++++++++++++++++++++++++-----
 1 file changed, 262 insertions(+), 34 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index deafa36aeea1..a0763db314ff 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -231,37 +231,8 @@ And there are a number of things that _must_ or _must_not_ be assumed:
  (*) It _must_not_ be assumed that the compiler will do what you want with
      memory references that are not protected by ACCESS_ONCE().  Without
      ACCESS_ONCE(), the compiler is within its rights to do all sorts
-     of "creative" transformations:
-
-     (-) Repeat the load, possibly getting a different value on the second
-         and subsequent loads.  This is especially prone to happen when
-	 register pressure is high.
-
-     (-) Merge adjacent loads and stores to the same location.  The most
-         familiar example is the transformation from:
-
-		while (a)
-			do_something();
-
-         to something like:
-
-		if (a)
-			for (;;)
-				do_something();
-
-         Using ACCESS_ONCE() as follows prevents this sort of optimization:
-
-		while (ACCESS_ONCE(a))
-			do_something();
-
-     (-) "Store tearing", where a single store in the source code is split
-         into smaller stores in the object code.  Note that gcc really
-	 will do this on some architectures when storing certain constants.
-	 It can be cheaper to do a series of immediate stores than to
-	 form the constant in a register and then to store that register.
-
-     (-) "Load tearing", which splits loads in a manner analogous to
-     	 store tearing.
+     of "creative" transformations, which are covered in the Compiler
+     Barrier section.
 
  (*) It _must_not_ be assumed that independent loads and stores will be issued
      in the order given.  This means that for:
@@ -749,7 +720,8 @@ In summary:
 
   (*) Control dependencies require that the compiler avoid reordering the
       dependency into nonexistence.  Careful use of ACCESS_ONCE() or
-      barrier() can help to preserve your control dependency.
+      barrier() can help to preserve your control dependency.  Please
+      see the Compiler Barrier section for more information.
 
   (*) Control dependencies do -not- provide transitivity.  If you
       need transitivity, use smp_mb().
@@ -1252,8 +1224,264 @@ of barrier().  Howevever, ACCESS_ONCE() can be thought of as a weak form
 for barrier() that affects only the specific accesses flagged by the
 ACCESS_ONCE().
 
-The compiler barrier has no direct effect on the CPU, which may then reorder
-things however it wishes.
+The barrier() function has the following effects:
+
+ (*) Prevents the compiler from reordering accesses following the
+     barrier() to precede any accesses preceding the barrier().
+     One example use for this property is to ease communication between
+     interrupt-handler code and the code that was interrupted.
+
+ (*) Within a loop, forces the compiler to load the variables used
+     in that loop's conditional on each pass through that loop.
+
+The ACCESS_ONCE() function can prevent any number of optimizations that,
+while perfectly safe in single-threaded code, can be fatal in concurrent
+code.  Here are some examples of these sorts of optimizations:
+
+ (*) The compiler is within its rights to merge successive loads from
+     the same variable.  Such merging can cause the compiler to "optimize"
+     the following code:
+
+	while (tmp = a)
+		do_something_with(tmp);
+
+     into the following code, which, although in some sense legitimate
+     for single-threaded code, is almost certainly not what the developer
+     intended:
+
+	if (tmp = a)
+		for (;;)
+			do_something_with(tmp);
+
+     Use ACCESS_ONCE() to prevent the compiler from doing this to you:
+
+	while (tmp = ACCESS_ONCE(a))
+		do_something_with(tmp);
+
+ (*) The compiler is within its rights to reload a variable, for example,
+     in cases where high register pressure prevents the compiler from
+     keeping all data of interest in registers.  The compiler might
+     therefore optimize the variable 'tmp' out of our previous example:
+
+	while (tmp = a)
+		do_something_with(tmp);
+
+     This could result in the following code, which is perfectly safe in
+     single-threaded code, but can be fatal in concurrent code:
+
+	while (a)
+		do_something_with(a);
+
+     For example, the optimized version of this code could result in
+     passing a zero to do_something_with() in the case where the variable
+     a was modified by some other CPU between the "while" statement and
+     the call to do_something_with().
+
+     Again, use ACCESS_ONCE() to prevent the compiler from doing this:
+
+	while (tmp = ACCESS_ONCE(a))
+		do_something_with(tmp);
+
+     Note that if the compiler runs short of registers, it might save
+     tmp onto the stack.  The overhead of this saving and later restoring
+     is why compilers reload variables.  Doing so is perfectly safe for
+     single-threaded code, so you need to tell the compiler about cases
+     where it is not safe.
+
+ (*) The compiler is within its rights to omit a load entirely if it knows
+     what the value will be.  For example, if the compiler can prove that
+     the value of variable 'a' is always zero, it can optimize this code:
+
+	while (tmp = a)
+		do_something_with(tmp);
+
+     Into this:
+
+     	do { } while (0);
+
+     This transformation is a win for single-threaded code because it gets
+     rid of a load and a branch.  The problem is that the compiler will
+     carry out its proof assuming that the current CPU is the only one
+     updating variable 'a'.  If variable 'a' is shared, then the compiler's
+     proof will be erroneous.  Use ACCESS_ONCE() to tell the compiler
+     that it doesn't know as much as it thinks it does:
+
+	while (tmp = ACCESS_ONCE(a))
+		do_something_with(tmp);
+
+     But please note that the compiler is also closely watching what you
+     do with the value after the ACCESS_ONCE().  For example, suppose you
+     do the following and MAX is a preprocessor macro with the value 1:
+
+	while ((tmp = ACCESS_ONCE(a)) % MAX)
+		do_something_with(tmp);
+
+     Then the compiler knows that the result of the "%" operator applied
+     to MAX will always be zero, again allowing the compiler to optimize
+     the code into near-nonexistence.  (It will still load from the
+     variable 'a'.)
+
+ (*) Similarly, the compiler is within its rights to omit a store entirely
+     if it knows that the variable already has the value being stored.
+     Again, the compiler assumes that the current CPU is the only one
+     storing into the variable, which can cause the compiler to do the
+     wrong thing for shared variables.  For example, suppose you have
+     the following:
+
+	a = 0;
+	/* Code that does not store to variable a. */
+	a = 0;
+
+     The compiler sees that the value of variable 'a' is already zero, so
+     it might well omit the second store.  This would come as a fatal
+     surprise if some other CPU might have stored to variable 'a' in the
+     meantime.
+
+     Use ACCESS_ONCE() to prevent the compiler from making this sort of
+     wrong guess:
+
+	ACCESS_ONCE(a) = 0;
+	/* Code that does not store to variable a. */
+	ACCESS_ONCE(a) = 0;
+
+ (*) The compiler is within its rights to reorder memory accesses unless
+     you tell it not to.  For example, consider the following interaction
+     between process-level code and an interrupt handler:
+
+	void process_level(void)
+	{
+		msg = get_message();
+		flag = true;
+	}
+
+	void interrupt_handler(void)
+	{
+		if (flag)
+			process_message(msg);
+	}
+
+     There is nothing to prevent the the compiler from transforming
+     process_level() to the following, in fact, this might well be a
+     win for single-threaded code:
+
+	void process_level(void)
+	{
+		flag = true;
+		msg = get_message();
+	}
+
+     If the interrupt occurs between these two statement, then
+     interrupt_handler() might be passed a garbled msg.  Use ACCESS_ONCE()
+     to prevent this as follows:
+
+	void process_level(void)
+	{
+		ACCESS_ONCE(msg) = get_message();
+		ACCESS_ONCE(flag) = true;
+	}
+
+	void interrupt_handler(void)
+	{
+		if (ACCESS_ONCE(flag))
+			process_message(ACCESS_ONCE(msg));
+	}
+
+     Note that the ACCESS_ONCE() wrappers in interrupt_handler()
+     are needed if this interrupt handler can itself be interrupted
+     by something that also accesses 'flag' and 'msg', for example,
+     a nested interrupt or an NMI.  Otherwise, ACCESS_ONCE() is not
+     needed in interrupt_handler() other than for documentation purposes.
+
+     This effect could also be achieved using barrier(), but ACCESS_ONCE()
+     is more selective:  With ACCESS_ONCE(), the compiler need only forget
+     the contents of the indicated memory located, while with barrier()
+     the compiler must discard the value of all memory locations that
+     it has currented cached in any machine registers.
+
+ (*) The compiler is within its rights to invent stores to a variable,
+     as in the following example:
+
+	if (a)
+		b = a;
+	else
+		b = 42;
+
+     The compiler might save a branch by optimizing this as follows:
+
+	b = 42;
+	if (a)
+		b = a;
+
+     In single-threaded code, this is not only safe, but also saves
+     a branch.  Unfortunately, in concurrent code, this optimization
+     could cause some other CPU to see a spurious value of 42 -- even
+     if variable 'a' was never zero -- when loading variable 'b'.
+     Use ACCESS_ONCE() to prevent this as follows:
+
+	if (a)
+		ACCESS_ONCE(b) = a;
+	else
+		ACCESS_ONCE(b) = 42;
+
+     The compiler can also invent loads.  These are usually less
+     damaging, but they can result in cache-line bouncing and thus in
+     poor performance and scalability.  Use ACCESS_ONCE() to prevent
+     invented loads.
+
+ (*) For aligned memory locations whose size allows them to be accessed
+     with a single memory-reference instruction, prevents "load tearing"
+     and "store tearing," in which a single large access is replaced by
+     multiple smaller accesses.  For example, given an architecture having
+     16-bit store instructions with 7-bit immediate fields, the compiler
+     might be tempted to use two 16-bit store-immediate instructions to
+     implement the following 32-bit store:
+
+	p = 0x00010002;
+
+     Please note that GCC really does use this sort of optimization,
+     which is not surprising given that it would likely take more
+     than two instructions to build the constant and then store it.
+     This optimization can therefore be a win in single-threaded code.
+     In fact, a recent bug (since fixed) caused GCC to incorrectly use
+     this optimization in a volatile store.  In the absence of such bugs,
+     use of ACCESS_ONCE() prevents store tearing in the following example:
+
+	ACCESS_ONCE(p) = 0x00010002;
+
+     Use of packed structures can also result in load and store tearing,
+     as in this example:
+
+	struct __attribute__((__packed__)) foo {
+		short a;
+		int b;
+		short c;
+	};
+	struct foo foo1, foo2;
+	...
+
+	foo2.a = foo1.a;
+	foo2.b = foo1.b;
+	foo2.c = foo1.c;
+
+     Because there are no ACCESS_ONCE() wrappers and no volatile markings,
+     the compiler would be well within its rights to implement these three
+     assignment statements as a pair of 32-bit loads followed by a pair
+     of 32-bit stores.  This would result in load tearing on 'foo1.b'
+     and store tearing on 'foo2.b'.  ACCESS_ONCE() again prevents tearing
+     in this example:
+
+	foo2.a = foo1.a;
+	ACCESS_ONCE(foo2.b) = ACCESS_ONCE(foo1.b);
+	foo2.c = foo1.c;
+
+All that aside, it is never necessary to use ACCESS_ONCE() on a variable
+that has been marked volatile.  For example, because 'jiffies' is marked
+volatile, it is never necessary to say ACCESS_ONCE(jiffies).  The reason
+for this is that ACCESS_ONCE() is implemented as a volatile cast, which
+has no effect when its argument is already marked volatile.
+
+Please note that these compiler barriers have no direct effect on the CPU,
+which may then reorder things however it wishes.
 
 
 CPU MEMORY BARRIERS
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10  1:27 ` [PATCH v5 tip/core/locking 1/7] Documentation/memory-barriers.txt: Add needed ACCESS_ONCE() calls to memory-barriers.txt Paul E. McKenney
                     ` (2 preceding siblings ...)
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 4/7] Documentation/memory-barriers.txt: Document ACCESS_ONCE() Paul E. McKenney
@ 2013-12-10  1:28   ` Paul E. McKenney
  2013-12-10  1:32     ` Josh Triplett
                       ` (2 more replies)
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier Paul E. McKenney
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 7/7] rcu: Apply smp_mb__after_unlock_lock() to preserve grace periods Paul E. McKenney
  5 siblings, 3 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10  1:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney, Ingo Molnar, Oleg Nesterov, Linus Torvalds,
	Will Deacon, Tim Chen, Waiman Long, Andrea Arcangeli, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Rik van Riel, Peter Hurley,
	H. Peter Anvin, Arnd Bergmann, Benjamin Herrenschmidt

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Historically, an UNLOCK+LOCK pair executed by one CPU, by one task,
or on a given lock variable has implied a full memory barrier.  In a
recent LKML thread, the wisdom of this historical approach was called
into question: http://www.spinics.net/lists/linux-mm/msg65653.html,
in part due to the memory-order complexities of low-handoff-overhead
queued locks on x86 systems.

This patch therefore removes this guarantee from the documentation, and
further documents how to restore it via a new smp_mb__after_unlock_lock()
primitive.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 Documentation/memory-barriers.txt | 51 +++++++++++++++++++++++++++++++++------
 1 file changed, 44 insertions(+), 7 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index a0763db314ff..efb791d33e5a 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1626,7 +1626,10 @@ for each construct.  These operations all imply certain barriers:
      operation has completed.
 
      Memory operations issued before the LOCK may be completed after the LOCK
-     operation has completed.
+     operation has completed.  An smp_mb__before_spinlock(), combined
+     with a following LOCK, acts as an smp_wmb().  Note the "w",
+     this is smp_wmb(), not smp_mb().  The smp_mb__before_spinlock()
+     primitive is free on many architectures.
 
  (2) UNLOCK operation implication:
 
@@ -1646,9 +1649,6 @@ for each construct.  These operations all imply certain barriers:
      All LOCK operations issued before an UNLOCK operation will be completed
      before the UNLOCK operation.
 
-     All UNLOCK operations issued before a LOCK operation will be completed
-     before the LOCK operation.
-
  (5) Failed conditional LOCK implication:
 
      Certain variants of the LOCK operation may fail, either due to being
@@ -1656,9 +1656,6 @@ for each construct.  These operations all imply certain barriers:
      signal whilst asleep waiting for the lock to become available.  Failed
      locks do not imply any sort of barrier.
 
-Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
-equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
-
 [!] Note: one of the consequences of LOCKs and UNLOCKs being only one-way
     barriers is that the effects of instructions outside of a critical section
     may seep into the inside of the critical section.
@@ -1677,6 +1674,40 @@ may occur as:
 
 	LOCK, STORE *B, STORE *A, UNLOCK
 
+An UNLOCK followed by a LOCK may -not- be assumed to be a full memory
+barrier because it is possible for a preceding UNLOCK to pass a later LOCK
+from the viewpoint of the CPU, but not from the viewpoint of the compiler.
+Note that deadlocks cannot be introduced by this interchange because if
+such a deadlock threatened, the UNLOCK would simply complete.  If it is
+necessary for an UNLOCK-LOCK pair to produce a full barrier, the LOCK
+can be followed by an smp_mb__after_unlock_lock() invocation.  This will
+produce a full barrier if either (a) the UNLOCK and the LOCK are executed
+by the same CPU or task, or (b) the UNLOCK and LOCK act on the same
+lock variable.  The smp_mb__after_unlock_lock() primitive is free on
+many architectures.  Without smp_mb__after_unlock_lock(), the UNLOCK
+and LOCK can cross:
+
+	*A = a;
+	UNLOCK
+	LOCK
+	*B = b;
+
+may occur as:
+
+	LOCK, STORE *B, STORE *A, UNLOCK
+
+With smp_mb__after_unlock_lock(), they cannot, so that:
+
+	*A = a;
+	UNLOCK
+	LOCK
+	smp_mb__after_unlock_lock();
+	*B = b;
+
+will always occur as:
+
+	STORE *A, UNLOCK, LOCK, STORE *B
+
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
 anything at all - especially with respect to I/O accesses - unless combined
@@ -1903,6 +1934,7 @@ However, if the following occurs:
 	UNLOCK M	     [1]
 	ACCESS_ONCE(*D) = d;		ACCESS_ONCE(*E) = e;
 					LOCK M		     [2]
+					smp_mb__after_unlock_lock();
 					ACCESS_ONCE(*F) = f;
 					ACCESS_ONCE(*G) = g;
 					UNLOCK M	     [2]
@@ -1920,6 +1952,11 @@ But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
 	*F, *G or *H preceding LOCK M [2]
 	*A, *B, *C, *E, *F or *G following UNLOCK M [2]
 
+Note that the smp_mb__after_unlock_lock() is critically important
+here: Without it CPU 3 might see some of the above orderings.
+Without smp_mb__after_unlock_lock(), the accesses are not guaranteed
+to be seen in order unless CPU 3 holds lock M.
+
 
 LOCKS VS I/O ACCESSES
 ---------------------
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier
  2013-12-10  1:27 ` [PATCH v5 tip/core/locking 1/7] Documentation/memory-barriers.txt: Add needed ACCESS_ONCE() calls to memory-barriers.txt Paul E. McKenney
                     ` (3 preceding siblings ...)
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK Paul E. McKenney
@ 2013-12-10  1:28   ` Paul E. McKenney
  2013-12-10  1:34     ` Josh Triplett
                       ` (2 more replies)
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 7/7] rcu: Apply smp_mb__after_unlock_lock() to preserve grace periods Paul E. McKenney
  5 siblings, 3 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10  1:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney, Linux-Arch, Ingo Molnar, Oleg Nesterov,
	Linus Torvalds

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The Linux kernel has traditionally required that an UNLOCK+LOCK pair
act as a full memory barrier when either (1) that UNLOCK+LOCK pair
was executed by the same CPU or task, or (2) the same lock variable
was used for the UNLOCK and LOCK.  It now seems likely that very few
places in the kernel rely on this full-memory-barrier semantic, and
with the advent of queued locks, providing this semantic either requires
complex reasoning, or for some architectures, added overhead.

This commit therefore adds a smp_mb__after_unlock_lock(), which may be
placed after a LOCK primitive to restore the full-memory-barrier semantic.
All definitions are currently no-ops, but will be upgraded for some
architectures when queued locks arrive.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linux-Arch <linux-arch@vger.kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/arm/include/asm/barrier.h      | 2 ++
 arch/arm64/include/asm/barrier.h    | 2 ++
 arch/ia64/include/asm/barrier.h     | 2 ++
 arch/metag/include/asm/barrier.h    | 2 ++
 arch/mips/include/asm/barrier.h     | 2 ++
 arch/powerpc/include/asm/barrier.h  | 2 ++
 arch/s390/include/asm/barrier.h     | 2 ++
 arch/sparc/include/asm/barrier_64.h | 2 ++
 arch/x86/include/asm/barrier.h      | 2 ++
 include/asm-generic/barrier.h       | 2 ++
 10 files changed, 20 insertions(+)

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 2f59f7443396..133aa543dfdb 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -74,6 +74,8 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__after_unlock_lock()	do { } while (0)
+
 #define read_barrier_depends()		do { } while(0)
 #define smp_read_barrier_depends()	do { } while(0)
 
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 78e20ba8806b..f97efda22e0a 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -91,6 +91,8 @@ do {									\
 
 #endif
 
+#define smp_mb__after_unlock_lock()	do { } while 0
+
 #define read_barrier_depends()		do { } while(0)
 #define smp_read_barrier_depends()	do { } while(0)
 
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index bddcfe91d9c8..fb64ead98e65 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -101,6 +101,8 @@ do {									\
 })
 #endif
 
+#define smp_mb__after_unlock_lock()	do { } while (0)
+
 /*
  * XXX check on this ---I suspect what Linus really wants here is
  * acquire vs release semantics but we can't discuss this stuff with
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index 5d6b4b407dda..76a3ff950607 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -97,4 +97,6 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__after_unlock_lock()	do { } while (0)
+
 #endif /* _ASM_METAG_BARRIER_H */
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index e1aa4e4c2984..6f9b586241d6 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -195,4 +195,6 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__after_unlock_lock()	do { } while (0)
+
 #endif /* __ASM_BARRIER_H */
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index f89da808ce31..abf645799991 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -84,4 +84,6 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__after_unlock_lock()	do { } while (0)
+
 #endif /* _ASM_POWERPC_BARRIER_H */
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index 578680f6207a..c8f12244aed3 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -47,4 +47,6 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__after_unlock_lock()	do { } while (0)
+
 #endif /* __ASM_BARRIER_H */
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index b5aad964558e..a2b73329f89c 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -68,4 +68,6 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__after_unlock_lock()	do { } while (0)
+
 #endif /* !(__SPARC64_BARRIER_H) */
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index f7053f354ac3..0371fd15f7ee 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -115,6 +115,8 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__after_unlock_lock()	do { } while (0)
+
 /*
  * Stop RDTSC speculation. This is needed when you need to use RDTSC
  * (or get_cycles or vread that possibly accesses the TSC) in a defined
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 6f692f8ac664..938ecf3a7cd8 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -77,5 +77,7 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__after_unlock_lock()	do { } while (0)
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v5 tip/core/locking 7/7] rcu: Apply smp_mb__after_unlock_lock() to preserve grace periods
  2013-12-10  1:27 ` [PATCH v5 tip/core/locking 1/7] Documentation/memory-barriers.txt: Add needed ACCESS_ONCE() calls to memory-barriers.txt Paul E. McKenney
                     ` (4 preceding siblings ...)
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier Paul E. McKenney
@ 2013-12-10  1:28   ` Paul E. McKenney
  5 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10  1:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

RCU must ensure that there is the equivalent of a full memory barrier
between any memory access preceding grace period and any memory access
following that same grace period, regardless of which CPU(s) happen to
execute the two memory accesses.  Therefore, downgrading UNLOCK+LOCK to
no longer imply a full memory barrier requires some adjustments to RCU.

This commit therefore adds smp_mb__after_unlock_lock() invocations as
needed after the RCU lock acquisitions that need to be part of a
full-memory-barrier UNLOCK+LOCK.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c        | 18 +++++++++++++++++-
 kernel/rcu/tree_plugin.h | 13 +++++++++++++
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index dd081987a8ec..a6205a05b5e4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1133,8 +1133,10 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp)
 	 * hold it, acquire the root rcu_node structure's lock in order to
 	 * start one (if needed).
 	 */
-	if (rnp != rnp_root)
+	if (rnp != rnp_root) {
 		raw_spin_lock(&rnp_root->lock);
+		smp_mb__after_unlock_lock();
+	}
 
 	/*
 	 * Get a new grace-period number.  If there really is no grace
@@ -1354,6 +1356,7 @@ static void note_gp_changes(struct rcu_state *rsp, struct rcu_data *rdp)
 		local_irq_restore(flags);
 		return;
 	}
+	smp_mb__after_unlock_lock();
 	__note_gp_changes(rsp, rnp, rdp);
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 }
@@ -1368,6 +1371,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
 
 	rcu_bind_gp_kthread();
 	raw_spin_lock_irq(&rnp->lock);
+	smp_mb__after_unlock_lock();
 	if (rsp->gp_flags == 0) {
 		/* Spurious wakeup, tell caller to go back to sleep.  */
 		raw_spin_unlock_irq(&rnp->lock);
@@ -1409,6 +1413,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
 	 */
 	rcu_for_each_node_breadth_first(rsp, rnp) {
 		raw_spin_lock_irq(&rnp->lock);
+		smp_mb__after_unlock_lock();
 		rdp = this_cpu_ptr(rsp->rda);
 		rcu_preempt_check_blocked_tasks(rnp);
 		rnp->qsmask = rnp->qsmaskinit;
@@ -1463,6 +1468,7 @@ static int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
 	/* Clear flag to prevent immediate re-entry. */
 	if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
 		raw_spin_lock_irq(&rnp->lock);
+		smp_mb__after_unlock_lock();
 		rsp->gp_flags &= ~RCU_GP_FLAG_FQS;
 		raw_spin_unlock_irq(&rnp->lock);
 	}
@@ -1480,6 +1486,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
 	raw_spin_lock_irq(&rnp->lock);
+	smp_mb__after_unlock_lock();
 	gp_duration = jiffies - rsp->gp_start;
 	if (gp_duration > rsp->gp_max)
 		rsp->gp_max = gp_duration;
@@ -1505,6 +1512,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	 */
 	rcu_for_each_node_breadth_first(rsp, rnp) {
 		raw_spin_lock_irq(&rnp->lock);
+		smp_mb__after_unlock_lock();
 		ACCESS_ONCE(rnp->completed) = rsp->gpnum;
 		rdp = this_cpu_ptr(rsp->rda);
 		if (rnp == rdp->mynode)
@@ -1515,6 +1523,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	}
 	rnp = rcu_get_root(rsp);
 	raw_spin_lock_irq(&rnp->lock);
+	smp_mb__after_unlock_lock();
 	rcu_nocb_gp_set(rnp, nocb);
 
 	rsp->completed = rsp->gpnum; /* Declare grace period done. */
@@ -1749,6 +1758,7 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
 		rnp_c = rnp;
 		rnp = rnp->parent;
 		raw_spin_lock_irqsave(&rnp->lock, flags);
+		smp_mb__after_unlock_lock();
 		WARN_ON_ONCE(rnp_c->qsmask);
 	}
 
@@ -1778,6 +1788,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
 
 	rnp = rdp->mynode;
 	raw_spin_lock_irqsave(&rnp->lock, flags);
+	smp_mb__after_unlock_lock();
 	if (rdp->passed_quiesce == 0 || rdp->gpnum != rnp->gpnum ||
 	    rnp->completed == rnp->gpnum) {
 
@@ -1992,6 +2003,7 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
 	mask = rdp->grpmask;	/* rnp->grplo is constant. */
 	do {
 		raw_spin_lock(&rnp->lock);	/* irqs already disabled. */
+		smp_mb__after_unlock_lock();
 		rnp->qsmaskinit &= ~mask;
 		if (rnp->qsmaskinit != 0) {
 			if (rnp != rdp->mynode)
@@ -2202,6 +2214,7 @@ static void force_qs_rnp(struct rcu_state *rsp,
 		cond_resched();
 		mask = 0;
 		raw_spin_lock_irqsave(&rnp->lock, flags);
+		smp_mb__after_unlock_lock();
 		if (!rcu_gp_in_progress(rsp)) {
 			raw_spin_unlock_irqrestore(&rnp->lock, flags);
 			return;
@@ -2231,6 +2244,7 @@ static void force_qs_rnp(struct rcu_state *rsp,
 	rnp = rcu_get_root(rsp);
 	if (rnp->qsmask == 0) {
 		raw_spin_lock_irqsave(&rnp->lock, flags);
+		smp_mb__after_unlock_lock();
 		rcu_initiate_boost(rnp, flags); /* releases rnp->lock. */
 	}
 }
@@ -2263,6 +2277,7 @@ static void force_quiescent_state(struct rcu_state *rsp)
 
 	/* Reached the root of the rcu_node tree, acquire lock. */
 	raw_spin_lock_irqsave(&rnp_old->lock, flags);
+	smp_mb__after_unlock_lock();
 	raw_spin_unlock(&rnp_old->fqslock);
 	if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
 		rsp->n_force_qs_lh++;
@@ -2378,6 +2393,7 @@ static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
 			struct rcu_node *rnp_root = rcu_get_root(rsp);
 
 			raw_spin_lock(&rnp_root->lock);
+			smp_mb__after_unlock_lock();
 			rcu_start_gp(rsp);
 			raw_spin_unlock(&rnp_root->lock);
 		} else {
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 6abb03dff5c0..eb161d80cbde 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -204,6 +204,7 @@ static void rcu_preempt_note_context_switch(int cpu)
 		rdp = per_cpu_ptr(rcu_preempt_state.rda, cpu);
 		rnp = rdp->mynode;
 		raw_spin_lock_irqsave(&rnp->lock, flags);
+		smp_mb__after_unlock_lock();
 		t->rcu_read_unlock_special |= RCU_READ_UNLOCK_BLOCKED;
 		t->rcu_blocked_node = rnp;
 
@@ -312,6 +313,7 @@ static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags)
 	mask = rnp->grpmask;
 	raw_spin_unlock(&rnp->lock);	/* irqs remain disabled. */
 	raw_spin_lock(&rnp_p->lock);	/* irqs already disabled. */
+	smp_mb__after_unlock_lock();
 	rcu_report_qs_rnp(mask, &rcu_preempt_state, rnp_p, flags);
 }
 
@@ -381,6 +383,7 @@ void rcu_read_unlock_special(struct task_struct *t)
 		for (;;) {
 			rnp = t->rcu_blocked_node;
 			raw_spin_lock(&rnp->lock);  /* irqs already disabled. */
+			smp_mb__after_unlock_lock();
 			if (rnp == t->rcu_blocked_node)
 				break;
 			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
@@ -605,6 +608,7 @@ static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
 	while (!list_empty(lp)) {
 		t = list_entry(lp->next, typeof(*t), rcu_node_entry);
 		raw_spin_lock(&rnp_root->lock); /* irqs already disabled */
+		smp_mb__after_unlock_lock();
 		list_del(&t->rcu_node_entry);
 		t->rcu_blocked_node = rnp_root;
 		list_add(&t->rcu_node_entry, lp_root);
@@ -629,6 +633,7 @@ static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
 	 * in this case.
 	 */
 	raw_spin_lock(&rnp_root->lock); /* irqs already disabled */
+	smp_mb__after_unlock_lock();
 	if (rnp_root->boost_tasks != NULL &&
 	    rnp_root->boost_tasks != rnp_root->gp_tasks &&
 	    rnp_root->boost_tasks != rnp_root->exp_tasks)
@@ -772,6 +777,7 @@ static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
 	unsigned long mask;
 
 	raw_spin_lock_irqsave(&rnp->lock, flags);
+	smp_mb__after_unlock_lock();
 	for (;;) {
 		if (!sync_rcu_preempt_exp_done(rnp)) {
 			raw_spin_unlock_irqrestore(&rnp->lock, flags);
@@ -787,6 +793,7 @@ static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
 		raw_spin_unlock(&rnp->lock); /* irqs remain disabled */
 		rnp = rnp->parent;
 		raw_spin_lock(&rnp->lock); /* irqs already disabled */
+		smp_mb__after_unlock_lock();
 		rnp->expmask &= ~mask;
 	}
 }
@@ -806,6 +813,7 @@ sync_rcu_preempt_exp_init(struct rcu_state *rsp, struct rcu_node *rnp)
 	int must_wait = 0;
 
 	raw_spin_lock_irqsave(&rnp->lock, flags);
+	smp_mb__after_unlock_lock();
 	if (list_empty(&rnp->blkd_tasks)) {
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	} else {
@@ -886,6 +894,7 @@ void synchronize_rcu_expedited(void)
 	/* Initialize ->expmask for all non-leaf rcu_node structures. */
 	rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) {
 		raw_spin_lock_irqsave(&rnp->lock, flags);
+		smp_mb__after_unlock_lock();
 		rnp->expmask = rnp->qsmaskinit;
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	}
@@ -1191,6 +1200,7 @@ static int rcu_boost(struct rcu_node *rnp)
 		return 0;  /* Nothing left to boost. */
 
 	raw_spin_lock_irqsave(&rnp->lock, flags);
+	smp_mb__after_unlock_lock();
 
 	/*
 	 * Recheck under the lock: all tasks in need of boosting
@@ -1377,6 +1387,7 @@ static int rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
 	if (IS_ERR(t))
 		return PTR_ERR(t);
 	raw_spin_lock_irqsave(&rnp->lock, flags);
+	smp_mb__after_unlock_lock();
 	rnp->boost_kthread_task = t;
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	sp.sched_priority = RCU_BOOST_PRIO;
@@ -1769,6 +1780,7 @@ static void rcu_prepare_for_idle(int cpu)
 			continue;
 		rnp = rdp->mynode;
 		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+		smp_mb__after_unlock_lock();
 		rcu_accelerate_cbs(rsp, rnp, rdp);
 		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
 	}
@@ -2209,6 +2221,7 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
 	struct rcu_node *rnp = rdp->mynode;
 
 	raw_spin_lock_irqsave(&rnp->lock, flags);
+	smp_mb__after_unlock_lock();
 	c = rcu_start_future_gp(rnp, rdp);
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK Paul E. McKenney
@ 2013-12-10  1:32     ` Josh Triplett
  2013-12-10  5:19       ` Paul E. McKenney
  2013-12-10 13:14     ` Peter Zijlstra
  2013-12-10 16:44     ` Oleg Nesterov
  2 siblings, 1 reply; 31+ messages in thread
From: Josh Triplett @ 2013-12-10  1:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw, Ingo Molnar, Oleg Nesterov, Linus Torvalds, Will Deacon,
	Tim Chen, Waiman Long, Andrea Arcangeli, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Rik van Riel, Peter Hurley,
	H. Peter Anvin, Arnd Bergmann, Benjamin Herrenschmidt

On Mon, Dec 09, 2013 at 05:28:01PM -0800, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> Historically, an UNLOCK+LOCK pair executed by one CPU, by one task,
> or on a given lock variable has implied a full memory barrier.  In a
> recent LKML thread, the wisdom of this historical approach was called
> into question: http://www.spinics.net/lists/linux-mm/msg65653.html,
> in part due to the memory-order complexities of low-handoff-overhead
> queued locks on x86 systems.
> 
> This patch therefore removes this guarantee from the documentation, and
> further documents how to restore it via a new smp_mb__after_unlock_lock()
> primitive.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Waiman Long <waiman.long@hp.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andi Kleen <andi@firstfloor.org>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Peter Hurley <peter@hurleysoftware.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---
>  Documentation/memory-barriers.txt | 51 +++++++++++++++++++++++++++++++++------
>  1 file changed, 44 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index a0763db314ff..efb791d33e5a 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1626,7 +1626,10 @@ for each construct.  These operations all imply certain barriers:
>       operation has completed.
>  
>       Memory operations issued before the LOCK may be completed after the LOCK
> -     operation has completed.
> +     operation has completed.  An smp_mb__before_spinlock(), combined
> +     with a following LOCK, acts as an smp_wmb().  Note the "w",
> +     this is smp_wmb(), not smp_mb().  The smp_mb__before_spinlock()
> +     primitive is free on many architectures.

Gah.  That seems highly error-prone; why isn't that
"smp_wmb__before_spinlock()"?

- Josh Triplett

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier Paul E. McKenney
@ 2013-12-10  1:34     ` Josh Triplett
  2013-12-10  5:26       ` Paul E. McKenney
  2013-12-10 12:37     ` Peter Zijlstra
  2013-12-10 17:04     ` Oleg Nesterov
  2 siblings, 1 reply; 31+ messages in thread
From: Josh Triplett @ 2013-12-10  1:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw, Linux-Arch, Ingo Molnar, Oleg Nesterov, Linus Torvalds

On Mon, Dec 09, 2013 at 05:28:02PM -0800, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> The Linux kernel has traditionally required that an UNLOCK+LOCK pair
> act as a full memory barrier when either (1) that UNLOCK+LOCK pair
> was executed by the same CPU or task, or (2) the same lock variable
> was used for the UNLOCK and LOCK.  It now seems likely that very few
> places in the kernel rely on this full-memory-barrier semantic, and
> with the advent of queued locks, providing this semantic either requires
> complex reasoning, or for some architectures, added overhead.
> 
> This commit therefore adds a smp_mb__after_unlock_lock(), which may be
> placed after a LOCK primitive to restore the full-memory-barrier semantic.
> All definitions are currently no-ops, but will be upgraded for some
> architectures when queued locks arrive.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Linux-Arch <linux-arch@vger.kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>

It seems quite unfortunate that this isn't in some common location, and
then only overridden by architectures that need to do so.

More importantly: you document this earlier in the patch series than you
introduce it.

- Josh Triplett

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10  1:32     ` Josh Triplett
@ 2013-12-10  5:19       ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10  5:19 UTC (permalink / raw)
  To: Josh Triplett
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw, Ingo Molnar, Oleg Nesterov, Linus Torvalds, Will Deacon,
	Tim Chen, Waiman Long, Andrea Arcangeli, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Rik van Riel, Peter Hurley,
	H. Peter Anvin, Arnd Bergmann, Benjamin Herrenschmidt

On Mon, Dec 09, 2013 at 05:32:31PM -0800, Josh Triplett wrote:
> On Mon, Dec 09, 2013 at 05:28:01PM -0800, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > Historically, an UNLOCK+LOCK pair executed by one CPU, by one task,
> > or on a given lock variable has implied a full memory barrier.  In a
> > recent LKML thread, the wisdom of this historical approach was called
> > into question: http://www.spinics.net/lists/linux-mm/msg65653.html,
> > in part due to the memory-order complexities of low-handoff-overhead
> > queued locks on x86 systems.
> > 
> > This patch therefore removes this guarantee from the documentation, and
> > further documents how to restore it via a new smp_mb__after_unlock_lock()
> > primitive.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Will Deacon <will.deacon@arm.com>
> > Cc: Tim Chen <tim.c.chen@linux.intel.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Waiman Long <waiman.long@hp.com>
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Andi Kleen <andi@firstfloor.org>
> > Cc: Michel Lespinasse <walken@google.com>
> > Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Peter Hurley <peter@hurleysoftware.com>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > ---
> >  Documentation/memory-barriers.txt | 51 +++++++++++++++++++++++++++++++++------
> >  1 file changed, 44 insertions(+), 7 deletions(-)
> > 
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index a0763db314ff..efb791d33e5a 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1626,7 +1626,10 @@ for each construct.  These operations all imply certain barriers:
> >       operation has completed.
> >  
> >       Memory operations issued before the LOCK may be completed after the LOCK
> > -     operation has completed.
> > +     operation has completed.  An smp_mb__before_spinlock(), combined
> > +     with a following LOCK, acts as an smp_wmb().  Note the "w",
> > +     this is smp_wmb(), not smp_mb().  The smp_mb__before_spinlock()
> > +     primitive is free on many architectures.
> 
> Gah.  That seems highly error-prone; why isn't that
> "smp_wmb__before_spinlock()"?

I must confess that I wondered that myself.  I didn't create it, I am
just documenting it.

Might be worth a change, though.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier
  2013-12-10  1:34     ` Josh Triplett
@ 2013-12-10  5:26       ` Paul E. McKenney
  2013-12-10 18:53         ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10  5:26 UTC (permalink / raw)
  To: Josh Triplett
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw, Linux-Arch, Ingo Molnar, Oleg Nesterov, Linus Torvalds

On Mon, Dec 09, 2013 at 05:34:17PM -0800, Josh Triplett wrote:
> On Mon, Dec 09, 2013 at 05:28:02PM -0800, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > The Linux kernel has traditionally required that an UNLOCK+LOCK pair
> > act as a full memory barrier when either (1) that UNLOCK+LOCK pair
> > was executed by the same CPU or task, or (2) the same lock variable
> > was used for the UNLOCK and LOCK.  It now seems likely that very few
> > places in the kernel rely on this full-memory-barrier semantic, and
> > with the advent of queued locks, providing this semantic either requires
> > complex reasoning, or for some architectures, added overhead.
> > 
> > This commit therefore adds a smp_mb__after_unlock_lock(), which may be
> > placed after a LOCK primitive to restore the full-memory-barrier semantic.
> > All definitions are currently no-ops, but will be upgraded for some
> > architectures when queued locks arrive.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Linux-Arch <linux-arch@vger.kernel.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> 
> It seems quite unfortunate that this isn't in some common location, and
> then only overridden by architectures that need to do so.

I was thinking that include/asm-generic/barrier.h was the place, but
it is all-or-nothing, used by UP architectures, from what I can see.
I figured that if there is such a common location, posting this patch
might flush it out.  I am not sure that this single definition is worth
the creation of a common place -- or even this definition combined with
smp_read_barrier_depends().

> More importantly: you document this earlier in the patch series than you
> introduce it.

Fair point, I reversed the order of those two patches.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier Paul E. McKenney
  2013-12-10  1:34     ` Josh Triplett
@ 2013-12-10 12:37     ` Peter Zijlstra
  2013-12-10 17:17       ` Paul E. McKenney
  2013-12-10 17:45       ` Josh Triplett
  2013-12-10 17:04     ` Oleg Nesterov
  2 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2013-12-10 12:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw, Linux-Arch, Ingo Molnar, Oleg Nesterov, Linus Torvalds

On Mon, Dec 09, 2013 at 05:28:02PM -0800, Paul E. McKenney wrote:
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index f89da808ce31..abf645799991 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -84,4 +84,6 @@ do {									\
>  	___p1;								\
>  })
>  
> +#define smp_mb__after_unlock_lock()	do { } while (0)
> +
>  #endif /* _ASM_POWERPC_BARRIER_H */

Didn't ben said ppc actually violates the current unlock+lock assumtion
and therefore this barrier woulnd't actually be a nop on ppc

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK Paul E. McKenney
  2013-12-10  1:32     ` Josh Triplett
@ 2013-12-10 13:14     ` Peter Zijlstra
  2013-12-10 17:12       ` Paul E. McKenney
  2013-12-10 16:44     ` Oleg Nesterov
  2 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2013-12-10 13:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw, Ingo Molnar, Oleg Nesterov, Linus Torvalds, Will Deacon,
	Tim Chen, Waiman Long, Andrea Arcangeli, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Rik van Riel, Peter Hurley,
	H. Peter Anvin, Arnd Bergmann, Benjamin Herrenschmidt

On Mon, Dec 09, 2013 at 05:28:01PM -0800, Paul E. McKenney wrote:
> +An UNLOCK followed by a LOCK may -not- be assumed to be a full memory
> +barrier because it is possible for a preceding UNLOCK to pass a later LOCK
> +from the viewpoint of the CPU, but not from the viewpoint of the compiler.
> +Note that deadlocks cannot be introduced by this interchange because if
> +such a deadlock threatened, the UNLOCK would simply complete.

For me its easier to read if we start a new paragraph here.

> If it is
> +necessary for an UNLOCK-LOCK pair to produce a full barrier, the LOCK
> +can be followed by an smp_mb__after_unlock_lock() invocation.  This will
> +produce a full barrier if either (a) the UNLOCK and the LOCK are executed
> +by the same CPU or task, or (b) the UNLOCK and LOCK act on the same
> +lock variable.  The smp_mb__after_unlock_lock() primitive is free on
> +many architectures. 

The way I read the above it says that you need
smp_mb__after_unlock_lock() when the UNLOCK and LOCK are on the same
variable. That doesn't make sense, I thought that was the one case we
all agreed on it would indeed be a full barrier without extra trickery.

So I would expect something like:

"If it is necessary for an UNLOCK-LOCK pair to produce a full barrier,
you must either ensure they operate on the same lock variable, or place
smp_mb__after_unlock_lock() after the LOCK."

> Without smp_mb__after_unlock_lock(), the UNLOCK
> +and LOCK can cross:
> +
> +	*A = a;
> +	UNLOCK
> +	LOCK
> +	*B = b;
> +
> +may occur as:
> +
> +	LOCK, STORE *B, STORE *A, UNLOCK
> +
> +With smp_mb__after_unlock_lock(), they cannot, so that:
> +
> +	*A = a;
> +	UNLOCK
> +	LOCK
> +	smp_mb__after_unlock_lock();
> +	*B = b;
> +
> +will always occur as:
> +
> +	STORE *A, UNLOCK, LOCK, STORE *B
> +

Since we introduced the concept of lock variables -- since it now
matters if the UNLOCK and LOCK act on the same one or not, we should
reflect that in the above examples (and maybe throughout the document).

That is; we should clarify:

  *A = a
  UNLOCK x
  LOCK y
  *B = b

Being different from:

  *A = a
  UNLOCK x
  LOCK x
  *B = b

I also find the wording slightly weird in that LOCK and UNLOCK are
stopped from crossing by smp_mb__after_unlock_lock(). They are not, what
it stopped is *B = b from moving up and the rest from moving down. The
UNLOCK and LOCK can still cross -- they happened before we issued the
barrier after all.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK Paul E. McKenney
  2013-12-10  1:32     ` Josh Triplett
  2013-12-10 13:14     ` Peter Zijlstra
@ 2013-12-10 16:44     ` Oleg Nesterov
  2013-12-10 17:15       ` Paul E. McKenney
  2 siblings, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2013-12-10 16:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw, Ingo Molnar, Linus Torvalds, Will Deacon, Tim Chen,
	Waiman Long, Andrea Arcangeli, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Rik van Riel, Peter Hurley, H. Peter Anvin,
	Arnd Bergmann, Benjamin Herrenschmidt

On 12/09, Paul E. McKenney wrote:
>
> @@ -1626,7 +1626,10 @@ for each construct.  These operations all imply certain barriers:
>       operation has completed.
>
>       Memory operations issued before the LOCK may be completed after the LOCK
> -     operation has completed.
> +     operation has completed.  An smp_mb__before_spinlock(), combined
> +     with a following LOCK, acts as an smp_wmb().  Note the "w",
> +     this is smp_wmb(), not smp_mb().

Well, but smp_mb__before_spinlock + LOCK is not wmb... But it is not
the full barrier. It should guarantee that, say,

	CONDITION = true;		// 1

	// try_to_wake_up
	smp_mb__before_spinlock();
	spin_lock(&task->pi_lock);

	if (!(p->state & state))	// 2
		return;		

can't race with with set_current_state() + check(CONDITION), this means
that 1 and 2 above must not be reordered.

But a LOAD before before spin_lock() can leak into the critical section.

Perhaps this should be clarified somehow, or perhaps it should actually
imply mb (if combined with LOCK).

Oleg


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier
  2013-12-10  1:28   ` [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier Paul E. McKenney
  2013-12-10  1:34     ` Josh Triplett
  2013-12-10 12:37     ` Peter Zijlstra
@ 2013-12-10 17:04     ` Oleg Nesterov
  2013-12-10 17:18       ` Paul E. McKenney
  2 siblings, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2013-12-10 17:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw, Linux-Arch, Ingo Molnar, Linus Torvalds

On 12/09, Paul E. McKenney wrote:
>
> This commit therefore adds a smp_mb__after_unlock_lock(), which may be
> placed after a LOCK primitive to restore the full-memory-barrier semantic.
> All definitions are currently no-ops, but will be upgraded for some
> architectures when queued locks arrive.

I am wondering, perhaps smp_mb__after_unlock() makes more sense?

Note that it already has the potential user:

	--- x/kernel/sched/wait.c
	+++ x/kernel/sched/wait.c
	@@ -176,8 +176,9 @@ prepare_to_wait(wait_queue_head_t *q, wa
		spin_lock_irqsave(&q->lock, flags);
		if (list_empty(&wait->task_list))
			__add_wait_queue(q, wait);
	-	set_current_state(state);
	+	__set_current_state(state);
		spin_unlock_irqrestore(&q->lock, flags);
	+	smp_mb__after_unlock();
	 }
	 EXPORT_SYMBOL(prepare_to_wait);
	 
	@@ -190,8 +191,9 @@ prepare_to_wait_exclusive(wait_queue_hea
		spin_lock_irqsave(&q->lock, flags);
		if (list_empty(&wait->task_list))
			__add_wait_queue_tail(q, wait);
	-	set_current_state(state);
	+	__set_current_state(state);
		spin_unlock_irqrestore(&q->lock, flags);
	+	smp_mb__after_unlock();
	 }
	 EXPORT_SYMBOL(prepare_to_wait_exclusive);
	 

Assuming it can also be used "later", after another LOCK, like in
your example in 5/7.

Oleg.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10 13:14     ` Peter Zijlstra
@ 2013-12-10 17:12       ` Paul E. McKenney
  2013-12-10 17:25         ` Peter Zijlstra
  2013-12-10 17:43         ` Peter Zijlstra
  0 siblings, 2 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10 17:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw, Ingo Molnar, Oleg Nesterov, Linus Torvalds, Will Deacon,
	Tim Chen, Waiman Long, Andrea Arcangeli, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Rik van Riel, Peter Hurley,
	H. Peter Anvin, Arnd Bergmann, Benjamin Herrenschmidt

On Tue, Dec 10, 2013 at 02:14:22PM +0100, Peter Zijlstra wrote:
> On Mon, Dec 09, 2013 at 05:28:01PM -0800, Paul E. McKenney wrote:
> > +An UNLOCK followed by a LOCK may -not- be assumed to be a full memory
> > +barrier because it is possible for a preceding UNLOCK to pass a later LOCK
> > +from the viewpoint of the CPU, but not from the viewpoint of the compiler.
> > +Note that deadlocks cannot be introduced by this interchange because if
> > +such a deadlock threatened, the UNLOCK would simply complete.
> 
> For me its easier to read if we start a new paragraph here.

Works for me!

> > If it is
> > +necessary for an UNLOCK-LOCK pair to produce a full barrier, the LOCK
> > +can be followed by an smp_mb__after_unlock_lock() invocation.  This will
> > +produce a full barrier if either (a) the UNLOCK and the LOCK are executed
> > +by the same CPU or task, or (b) the UNLOCK and LOCK act on the same
> > +lock variable.  The smp_mb__after_unlock_lock() primitive is free on
> > +many architectures. 
> 
> The way I read the above it says that you need
> smp_mb__after_unlock_lock() when the UNLOCK and LOCK are on the same
> variable. That doesn't make sense, I thought that was the one case we
> all agreed on it would indeed be a full barrier without extra trickery.

On x86, sure, but smp_mb__after_unlock_lock() is nothingness on x86
anyway.  Other architectures might benefit from requiring that the
smp_mb__after_unlock_lock() be used in this case.

> So I would expect something like:
> 
> "If it is necessary for an UNLOCK-LOCK pair to produce a full barrier,
> you must either ensure they operate on the same lock variable, or place
> smp_mb__after_unlock_lock() after the LOCK."

I respectfully disagree.  Requiring the smp_mb__after_unlock_lock()
isn't going to hurt anything and making the full-barrier assumption
explicit will make it a lot easier to inspect this stuff.

> > Without smp_mb__after_unlock_lock(), the UNLOCK
> > +and LOCK can cross:
> > +
> > +	*A = a;
> > +	UNLOCK
> > +	LOCK
> > +	*B = b;
> > +
> > +may occur as:
> > +
> > +	LOCK, STORE *B, STORE *A, UNLOCK
> > +
> > +With smp_mb__after_unlock_lock(), they cannot, so that:
> > +
> > +	*A = a;
> > +	UNLOCK
> > +	LOCK
> > +	smp_mb__after_unlock_lock();
> > +	*B = b;
> > +
> > +will always occur as:
> > +
> > +	STORE *A, UNLOCK, LOCK, STORE *B
> > +
> 
> Since we introduced the concept of lock variables -- since it now
> matters if the UNLOCK and LOCK act on the same one or not, we should
> reflect that in the above examples (and maybe throughout the document).
> 
> That is; we should clarify:
> 
>   *A = a
>   UNLOCK x
>   LOCK y
>   *B = b
> 
> Being different from:
> 
>   *A = a
>   UNLOCK x
>   LOCK x
>   *B = b
> 
> I also find the wording slightly weird in that LOCK and UNLOCK are
> stopped from crossing by smp_mb__after_unlock_lock(). They are not, what
> it stopped is *B = b from moving up and the rest from moving down. The
> UNLOCK and LOCK can still cross -- they happened before we issued the
> barrier after all.

Good point -- the UNLOCK and LOCK are guaranteed to be ordered only
if they both operate on the same lock variable.  OK, I will make the
example use different lock variables and show the different outcomes.
How about the following?

	If it is necessary for an UNLOCK-LOCK pair to
	produce a full barrier, the LOCK can be followed by an
	smp_mb__after_unlock_lock() invocation.  This will produce a
	full barrier if either (a) the UNLOCK and the LOCK are executed
	by the same CPU or task, or (b) the UNLOCK and LOCK act on the
	same lock variable.  The smp_mb__after_unlock_lock() primitive is
	free on many architectures.  Without smp_mb__after_unlock_lock(),
	the UNLOCK and LOCK can cross:

		*A = a;
		UNLOCK M
		LOCK N
		*B = b;

	could occur as:

		LOCK N, STORE *B, STORE *A, UNLOCK M

	With smp_mb__after_unlock_lock(), they cannot, so that:

		*A = a;
		UNLOCK M
		LOCK N
		smp_mb__after_unlock_lock();
		*B = b;

	will always occur as either of the following:

		STORE *A, UNLOCK, LOCK, STORE *B
		STORE *A, LOCK, UNLOCK, STORE *B

	If the UNLOCK and LOCK were instead both operating on the same
	lock variable, only the first of these two alternatives can occur.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10 16:44     ` Oleg Nesterov
@ 2013-12-10 17:15       ` Paul E. McKenney
  2013-12-10 17:35         ` Oleg Nesterov
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10 17:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw, Ingo Molnar, Linus Torvalds, Will Deacon, Tim Chen,
	Waiman Long, Andrea Arcangeli, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Rik van Riel, Peter Hurley, H. Peter Anvin,
	Arnd Bergmann, Benjamin Herrenschmidt

On Tue, Dec 10, 2013 at 05:44:37PM +0100, Oleg Nesterov wrote:
> On 12/09, Paul E. McKenney wrote:
> >
> > @@ -1626,7 +1626,10 @@ for each construct.  These operations all imply certain barriers:
> >       operation has completed.
> >
> >       Memory operations issued before the LOCK may be completed after the LOCK
> > -     operation has completed.
> > +     operation has completed.  An smp_mb__before_spinlock(), combined
> > +     with a following LOCK, acts as an smp_wmb().  Note the "w",
> > +     this is smp_wmb(), not smp_mb().
> 
> Well, but smp_mb__before_spinlock + LOCK is not wmb... But it is not
> the full barrier. It should guarantee that, say,
> 
> 	CONDITION = true;		// 1
> 
> 	// try_to_wake_up
> 	smp_mb__before_spinlock();
> 	spin_lock(&task->pi_lock);
> 
> 	if (!(p->state & state))	// 2
> 		return;		
> 
> can't race with with set_current_state() + check(CONDITION), this means
> that 1 and 2 above must not be reordered.
> 
> But a LOAD before before spin_lock() can leak into the critical section.
> 
> Perhaps this should be clarified somehow, or perhaps it should actually
> imply mb (if combined with LOCK).

If we leave the implementation the same, does the following capture the
constraints?

	Memory operations issued before the LOCK may be completed after
	the LOCK operation has completed.  An smp_mb__before_spinlock(),
	combined with a following LOCK, orders prior loads against
	subsequent stores and stores and prior stores against
	subsequent stores.  Note that this is weaker than smp_mb()!  The
	smp_mb__before_spinlock() primitive is free on many architectures.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier
  2013-12-10 12:37     ` Peter Zijlstra
@ 2013-12-10 17:17       ` Paul E. McKenney
  2013-12-10 17:45       ` Josh Triplett
  1 sibling, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10 17:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw, Linux-Arch, Ingo Molnar, Oleg Nesterov, Linus Torvalds, benh

On Tue, Dec 10, 2013 at 01:37:26PM +0100, Peter Zijlstra wrote:
> On Mon, Dec 09, 2013 at 05:28:02PM -0800, Paul E. McKenney wrote:
> > diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> > index f89da808ce31..abf645799991 100644
> > --- a/arch/powerpc/include/asm/barrier.h
> > +++ b/arch/powerpc/include/asm/barrier.h
> > @@ -84,4 +84,6 @@ do {									\
> >  	___p1;								\
> >  })
> >  
> > +#define smp_mb__after_unlock_lock()	do { } while (0)
> > +
> >  #endif /* _ASM_POWERPC_BARRIER_H */
> 
> Didn't ben said ppc actually violates the current unlock+lock assumtion
> and therefore this barrier woulnd't actually be a nop on ppc

Last I knew, I was saying that it did in theory, but wasn't able to
demonstrate it in practice.  But yes, I would be more comfortable with
it being smp_mb().

Ben?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier
  2013-12-10 17:04     ` Oleg Nesterov
@ 2013-12-10 17:18       ` Paul E. McKenney
  2013-12-10 17:32         ` Oleg Nesterov
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10 17:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw, Linux-Arch, Ingo Molnar, Linus Torvalds

On Tue, Dec 10, 2013 at 06:04:04PM +0100, Oleg Nesterov wrote:
> On 12/09, Paul E. McKenney wrote:
> >
> > This commit therefore adds a smp_mb__after_unlock_lock(), which may be
> > placed after a LOCK primitive to restore the full-memory-barrier semantic.
> > All definitions are currently no-ops, but will be upgraded for some
> > architectures when queued locks arrive.
> 
> I am wondering, perhaps smp_mb__after_unlock() makes more sense?
> 
> Note that it already has the potential user:
> 
> 	--- x/kernel/sched/wait.c
> 	+++ x/kernel/sched/wait.c
> 	@@ -176,8 +176,9 @@ prepare_to_wait(wait_queue_head_t *q, wa
> 		spin_lock_irqsave(&q->lock, flags);
> 		if (list_empty(&wait->task_list))
> 			__add_wait_queue(q, wait);
> 	-	set_current_state(state);
> 	+	__set_current_state(state);
> 		spin_unlock_irqrestore(&q->lock, flags);
> 	+	smp_mb__after_unlock();
> 	 }
> 	 EXPORT_SYMBOL(prepare_to_wait);
> 	 
> 	@@ -190,8 +191,9 @@ prepare_to_wait_exclusive(wait_queue_hea
> 		spin_lock_irqsave(&q->lock, flags);
> 		if (list_empty(&wait->task_list))
> 			__add_wait_queue_tail(q, wait);
> 	-	set_current_state(state);
> 	+	__set_current_state(state);
> 		spin_unlock_irqrestore(&q->lock, flags);
> 	+	smp_mb__after_unlock();
> 	 }
> 	 EXPORT_SYMBOL(prepare_to_wait_exclusive);
> 	 
> 
> Assuming it can also be used "later", after another LOCK, like in
> your example in 5/7.

I am fine either way.  But there was an objection to tying this to the
unlock because it costs more on many architectures than tying this to
the lock.

But if you are saying "in addition to" rather than "instead of" that
would be a different conversation.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10 17:12       ` Paul E. McKenney
@ 2013-12-10 17:25         ` Peter Zijlstra
  2013-12-10 17:43           ` Josh Triplett
  2013-12-10 17:49           ` Paul E. McKenney
  2013-12-10 17:43         ` Peter Zijlstra
  1 sibling, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2013-12-10 17:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw, Ingo Molnar, Oleg Nesterov, Linus Torvalds, Will Deacon,
	Tim Chen, Waiman Long, Andrea Arcangeli, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Rik van Riel, Peter Hurley,
	H. Peter Anvin, Arnd Bergmann, Benjamin Herrenschmidt

On Tue, Dec 10, 2013 at 09:12:47AM -0800, Paul E. McKenney wrote:
> > The way I read the above it says that you need
> > smp_mb__after_unlock_lock() when the UNLOCK and LOCK are on the same
> > variable. That doesn't make sense, I thought that was the one case we
> > all agreed on it would indeed be a full barrier without extra trickery.
> 
> On x86, sure, but smp_mb__after_unlock_lock() is nothingness on x86
> anyway.  Other architectures might benefit from requiring that the
> smp_mb__after_unlock_lock() be used in this case.

Confused, UNLOCK X, LOCK X, must always be fully serializing. That's the
entire purpose of the thing.

The only place you can go play games (and clearly we are going there) is
when the UNLOCK and LOCK are on different variables.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier
  2013-12-10 17:18       ` Paul E. McKenney
@ 2013-12-10 17:32         ` Oleg Nesterov
  0 siblings, 0 replies; 31+ messages in thread
From: Oleg Nesterov @ 2013-12-10 17:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw, Linux-Arch, Ingo Molnar, Linus Torvalds

On 12/10, Paul E. McKenney wrote:
>
> On Tue, Dec 10, 2013 at 06:04:04PM +0100, Oleg Nesterov wrote:
> >
> > I am wondering, perhaps smp_mb__after_unlock() makes more sense?
> >
> > Note that it already has the potential user:
> >
> > 	--- x/kernel/sched/wait.c
> > 	+++ x/kernel/sched/wait.c
> > 	@@ -176,8 +176,9 @@ prepare_to_wait(wait_queue_head_t *q, wa
> > 		spin_lock_irqsave(&q->lock, flags);
> > 		if (list_empty(&wait->task_list))
> > 			__add_wait_queue(q, wait);
> > 	-	set_current_state(state);
> > 	+	__set_current_state(state);
> > 		spin_unlock_irqrestore(&q->lock, flags);
> > 	+	smp_mb__after_unlock();
> > 	 }
> > 	 EXPORT_SYMBOL(prepare_to_wait);
> >
> > 	@@ -190,8 +191,9 @@ prepare_to_wait_exclusive(wait_queue_hea
> > 		spin_lock_irqsave(&q->lock, flags);
> > 		if (list_empty(&wait->task_list))
> > 			__add_wait_queue_tail(q, wait);
> > 	-	set_current_state(state);
> > 	+	__set_current_state(state);
> > 		spin_unlock_irqrestore(&q->lock, flags);
> > 	+	smp_mb__after_unlock();
> > 	 }
> > 	 EXPORT_SYMBOL(prepare_to_wait_exclusive);
> >
> >
> > Assuming it can also be used "later", after another LOCK, like in
> > your example in 5/7.
>
> I am fine either way.  But there was an objection to tying this to the
> unlock because it costs more on many architectures than tying this to
> the lock.

OK, I see, thanks.

> But if you are saying "in addition to" rather than "instead of" that
> would be a different conversation.

Yes, please forget for now.

Oleg.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10 17:15       ` Paul E. McKenney
@ 2013-12-10 17:35         ` Oleg Nesterov
  0 siblings, 0 replies; 31+ messages in thread
From: Oleg Nesterov @ 2013-12-10 17:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw, Ingo Molnar, Linus Torvalds, Will Deacon, Tim Chen,
	Waiman Long, Andrea Arcangeli, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Rik van Riel, Peter Hurley, H. Peter Anvin,
	Arnd Bergmann, Benjamin Herrenschmidt

On 12/10, Paul E. McKenney wrote:
>
> On Tue, Dec 10, 2013 at 05:44:37PM +0100, Oleg Nesterov wrote:
> >
> > Well, but smp_mb__before_spinlock + LOCK is not wmb... But it is not
> > the full barrier. It should guarantee that, say,
> >
> > 	CONDITION = true;		// 1
> >
> > 	// try_to_wake_up
> > 	smp_mb__before_spinlock();
> > 	spin_lock(&task->pi_lock);
> >
> > 	if (!(p->state & state))	// 2
> > 		return;
> >
> > can't race with with set_current_state() + check(CONDITION), this means
> > that 1 and 2 above must not be reordered.
> >
> > But a LOAD before before spin_lock() can leak into the critical section.
> >
> > Perhaps this should be clarified somehow, or perhaps it should actually
> > imply mb (if combined with LOCK).
>
> If we leave the implementation the same, does the following capture the
> constraints?
>
> 	Memory operations issued before the LOCK may be completed after
> 	the LOCK operation has completed.  An smp_mb__before_spinlock(),
> 	combined with a following LOCK, orders prior loads against
> 	subsequent stores

prior stores against subsequent loads ;)

Otherwise - thanks!

Oleg.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10 17:25         ` Peter Zijlstra
@ 2013-12-10 17:43           ` Josh Triplett
  2013-12-10 18:05             ` Paul E. McKenney
  2013-12-10 17:49           ` Paul E. McKenney
  1 sibling, 1 reply; 31+ messages in thread
From: Josh Triplett @ 2013-12-10 17:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw, Ingo Molnar, Oleg Nesterov, Linus Torvalds,
	Will Deacon, Tim Chen, Waiman Long, Andrea Arcangeli, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Rik van Riel, Peter Hurley,
	H. Peter Anvin, Arnd Bergmann, Benjamin Herrenschmidt

On Tue, Dec 10, 2013 at 06:25:28PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 10, 2013 at 09:12:47AM -0800, Paul E. McKenney wrote:
> > > The way I read the above it says that you need
> > > smp_mb__after_unlock_lock() when the UNLOCK and LOCK are on the same
> > > variable. That doesn't make sense, I thought that was the one case we
> > > all agreed on it would indeed be a full barrier without extra trickery.
> > 
> > On x86, sure, but smp_mb__after_unlock_lock() is nothingness on x86
> > anyway.  Other architectures might benefit from requiring that the
> > smp_mb__after_unlock_lock() be used in this case.
> 
> Confused, UNLOCK X, LOCK X, must always be fully serializing. That's the
> entire purpose of the thing.
> 
> The only place you can go play games (and clearly we are going there) is
> when the UNLOCK and LOCK are on different variables.

That would certainly be a good assumption to preserve, and it would
eliminate most of the need for smp_mb__after_unlock_lock().

- Josh Triplett

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10 17:12       ` Paul E. McKenney
  2013-12-10 17:25         ` Peter Zijlstra
@ 2013-12-10 17:43         ` Peter Zijlstra
  2013-12-10 18:49           ` Paul E. McKenney
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2013-12-10 17:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw, Ingo Molnar, Oleg Nesterov, Linus Torvalds, Will Deacon,
	Tim Chen, Waiman Long, Andrea Arcangeli, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Rik van Riel, Peter Hurley,
	H. Peter Anvin, Arnd Bergmann, Benjamin Herrenschmidt

On Tue, Dec 10, 2013 at 09:12:47AM -0800, Paul E. McKenney wrote:
> Good point -- the UNLOCK and LOCK are guaranteed to be ordered only
> if they both operate on the same lock variable.  OK, I will make the
> example use different lock variables and show the different outcomes.
> How about the following?
> 
> 	If it is necessary for an UNLOCK-LOCK pair to
> 	produce a full barrier, the LOCK can be followed by an
> 	smp_mb__after_unlock_lock() invocation.  This will produce a
> 	full barrier if either (a) the UNLOCK and the LOCK are executed
> 	by the same CPU or task, or (b) the UNLOCK and LOCK act on the
> 	same lock variable.  

So you're still requiring smp_mb__after_unlock_lock() even if they're on
the same variable?

> The smp_mb__after_unlock_lock() primitive is
> 	free on many architectures.  Without smp_mb__after_unlock_lock(),
> 	the UNLOCK and LOCK can cross:

Contradicted below :-)

> 		*A = a;
> 		UNLOCK M
> 		LOCK N
> 		*B = b;
> 
> 	could occur as:
> 
> 		LOCK N, STORE *B, STORE *A, UNLOCK M
> 
> 	With smp_mb__after_unlock_lock(), they cannot, so that:
> 
> 		*A = a;
> 		UNLOCK M
> 		LOCK N
> 		smp_mb__after_unlock_lock();
> 		*B = b;
> 
> 	will always occur as either of the following:
> 
> 		STORE *A, UNLOCK, LOCK, STORE *B
> 		STORE *A, LOCK, UNLOCK, STORE *B

See, UNLOCK and LOCK can still cross :-)

> 	If the UNLOCK and LOCK were instead both operating on the same
> 	lock variable, only the first of these two alternatives can occur.

Agreed.

Sorry for being a pedant. :-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier
  2013-12-10 12:37     ` Peter Zijlstra
  2013-12-10 17:17       ` Paul E. McKenney
@ 2013-12-10 17:45       ` Josh Triplett
  2013-12-10 20:11         ` Paul E. McKenney
  1 sibling, 1 reply; 31+ messages in thread
From: Josh Triplett @ 2013-12-10 17:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw, Linux-Arch, Ingo Molnar, Oleg Nesterov,
	Linus Torvalds

On Tue, Dec 10, 2013 at 01:37:26PM +0100, Peter Zijlstra wrote:
> On Mon, Dec 09, 2013 at 05:28:02PM -0800, Paul E. McKenney wrote:
> > diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> > index f89da808ce31..abf645799991 100644
> > --- a/arch/powerpc/include/asm/barrier.h
> > +++ b/arch/powerpc/include/asm/barrier.h
> > @@ -84,4 +84,6 @@ do {									\
> >  	___p1;								\
> >  })
> >  
> > +#define smp_mb__after_unlock_lock()	do { } while (0)
> > +
> >  #endif /* _ASM_POWERPC_BARRIER_H */
> 
> Didn't ben said ppc actually violates the current unlock+lock assumtion
> and therefore this barrier woulnd't actually be a nop on ppc

Or, ppc could fix its lock primitives to preserve the unlock+lock
assumption, and avoid subtle breakage across half the kernel.

- Josh Triplett

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10 17:25         ` Peter Zijlstra
  2013-12-10 17:43           ` Josh Triplett
@ 2013-12-10 17:49           ` Paul E. McKenney
  1 sibling, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10 17:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw, Ingo Molnar, Oleg Nesterov, Linus Torvalds, Will Deacon,
	Tim Chen, Waiman Long, Andrea Arcangeli, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Rik van Riel, Peter Hurley,
	H. Peter Anvin, Arnd Bergmann, Benjamin Herrenschmidt

On Tue, Dec 10, 2013 at 06:25:28PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 10, 2013 at 09:12:47AM -0800, Paul E. McKenney wrote:
> > > The way I read the above it says that you need
> > > smp_mb__after_unlock_lock() when the UNLOCK and LOCK are on the same
> > > variable. That doesn't make sense, I thought that was the one case we
> > > all agreed on it would indeed be a full barrier without extra trickery.
> > 
> > On x86, sure, but smp_mb__after_unlock_lock() is nothingness on x86
> > anyway.  Other architectures might benefit from requiring that the
> > smp_mb__after_unlock_lock() be used in this case.
> 
> Confused, UNLOCK X, LOCK X, must always be fully serializing. That's the
> entire purpose of the thing.

>From the viewpoint of some CPU holding the lock, yes.  If some CPU does
not hold the lock, the guarantee is a choice.

> The only place you can go play games (and clearly we are going there) is
> when the UNLOCK and LOCK are on different variables.

I agree that if UNLOCK and LOCK are on different variables, we do not
need to guarantee ordering of the two critical sections.  But if they
are on the same variable, why would it be different from the viewpoint
of some CPU not holding the lock?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10 17:43           ` Josh Triplett
@ 2013-12-10 18:05             ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10 18:05 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw, Ingo Molnar, Oleg Nesterov, Linus Torvalds,
	Will Deacon, Tim Chen, Waiman Long, Andrea Arcangeli, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Rik van Riel, Peter Hurley,
	H. Peter Anvin, Arnd Bergmann, Benjamin Herrenschmidt

On Tue, Dec 10, 2013 at 09:43:20AM -0800, Josh Triplett wrote:
> On Tue, Dec 10, 2013 at 06:25:28PM +0100, Peter Zijlstra wrote:
> > On Tue, Dec 10, 2013 at 09:12:47AM -0800, Paul E. McKenney wrote:
> > > > The way I read the above it says that you need
> > > > smp_mb__after_unlock_lock() when the UNLOCK and LOCK are on the same
> > > > variable. That doesn't make sense, I thought that was the one case we
> > > > all agreed on it would indeed be a full barrier without extra trickery.
> > > 
> > > On x86, sure, but smp_mb__after_unlock_lock() is nothingness on x86
> > > anyway.  Other architectures might benefit from requiring that the
> > > smp_mb__after_unlock_lock() be used in this case.
> > 
> > Confused, UNLOCK X, LOCK X, must always be fully serializing. That's the
> > entire purpose of the thing.
> > 
> > The only place you can go play games (and clearly we are going there) is
> > when the UNLOCK and LOCK are on different variables.
> 
> That would certainly be a good assumption to preserve, and it would
> eliminate most of the need for smp_mb__after_unlock_lock().

Perhaps RCU is an outlier, but most of the places where I added
smp_mb__after_unlock_lock() had an unlock of one rcu_node structure's
->lock followed by a lock of another rcu_node structure's ->lock.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK
  2013-12-10 17:43         ` Peter Zijlstra
@ 2013-12-10 18:49           ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10 18:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw, Ingo Molnar, Oleg Nesterov, Linus Torvalds, Will Deacon,
	Tim Chen, Waiman Long, Andrea Arcangeli, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Rik van Riel, Peter Hurley,
	H. Peter Anvin, Arnd Bergmann, Benjamin Herrenschmidt

On Tue, Dec 10, 2013 at 06:43:45PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 10, 2013 at 09:12:47AM -0800, Paul E. McKenney wrote:
> > Good point -- the UNLOCK and LOCK are guaranteed to be ordered only
> > if they both operate on the same lock variable.  OK, I will make the
> > example use different lock variables and show the different outcomes.
> > How about the following?
> > 
> > 	If it is necessary for an UNLOCK-LOCK pair to
> > 	produce a full barrier, the LOCK can be followed by an
> > 	smp_mb__after_unlock_lock() invocation.  This will produce a
> > 	full barrier if either (a) the UNLOCK and the LOCK are executed
> > 	by the same CPU or task, or (b) the UNLOCK and LOCK act on the
> > 	same lock variable.  
> 
> So you're still requiring smp_mb__after_unlock_lock() even if they're on
> the same variable?

Yep!

> > The smp_mb__after_unlock_lock() primitive is
> > 	free on many architectures.  Without smp_mb__after_unlock_lock(),
> > 	the UNLOCK and LOCK can cross:
> 
> Contradicted below :-)

Good eyes!  I changed this to:

	The smp_mb__after_unlock_lock() primitive is free on many
	architectures.	Without smp_mb__after_unlock_lock(), the critical
	sections corresponding to the UNLOCK and the LOCK can cross:

Is that better?

> > 		*A = a;
> > 		UNLOCK M
> > 		LOCK N
> > 		*B = b;
> > 
> > 	could occur as:
> > 
> > 		LOCK N, STORE *B, STORE *A, UNLOCK M
> > 
> > 	With smp_mb__after_unlock_lock(), they cannot, so that:
> > 
> > 		*A = a;
> > 		UNLOCK M
> > 		LOCK N
> > 		smp_mb__after_unlock_lock();
> > 		*B = b;
> > 
> > 	will always occur as either of the following:
> > 
> > 		STORE *A, UNLOCK, LOCK, STORE *B
> > 		STORE *A, LOCK, UNLOCK, STORE *B
> 
> See, UNLOCK and LOCK can still cross :-)

Indeed they can!  ;-)

> > 	If the UNLOCK and LOCK were instead both operating on the same
> > 	lock variable, only the first of these two alternatives can occur.
> 
> Agreed.
> 
> Sorry for being a pedant. :-)

;-) ;-) ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier
  2013-12-10  5:26       ` Paul E. McKenney
@ 2013-12-10 18:53         ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10 18:53 UTC (permalink / raw)
  To: Josh Triplett
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw, Linux-Arch, Ingo Molnar, Oleg Nesterov, Linus Torvalds

On Mon, Dec 09, 2013 at 09:26:41PM -0800, Paul E. McKenney wrote:
> On Mon, Dec 09, 2013 at 05:34:17PM -0800, Josh Triplett wrote:
> > On Mon, Dec 09, 2013 at 05:28:02PM -0800, Paul E. McKenney wrote:
> > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > 
> > > The Linux kernel has traditionally required that an UNLOCK+LOCK pair
> > > act as a full memory barrier when either (1) that UNLOCK+LOCK pair
> > > was executed by the same CPU or task, or (2) the same lock variable
> > > was used for the UNLOCK and LOCK.  It now seems likely that very few
> > > places in the kernel rely on this full-memory-barrier semantic, and
> > > with the advent of queued locks, providing this semantic either requires
> > > complex reasoning, or for some architectures, added overhead.
> > > 
> > > This commit therefore adds a smp_mb__after_unlock_lock(), which may be
> > > placed after a LOCK primitive to restore the full-memory-barrier semantic.
> > > All definitions are currently no-ops, but will be upgraded for some
> > > architectures when queued locks arrive.
> > > 
> > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Cc: Linux-Arch <linux-arch@vger.kernel.org>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Oleg Nesterov <oleg@redhat.com>
> > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > 
> > It seems quite unfortunate that this isn't in some common location, and
> > then only overridden by architectures that need to do so.
> 
> I was thinking that include/asm-generic/barrier.h was the place, but
> it is all-or-nothing, used by UP architectures, from what I can see.
> I figured that if there is such a common location, posting this patch
> might flush it out.  I am not sure that this single definition is worth
> the creation of a common place -- or even this definition combined with
> smp_read_barrier_depends().

And of course the right place to put this is include/linux/spinlock.h,
the same place where smp_mb__before_spinlock() is defined.  Exceptions
then go into the corresponding arch-specific spinlock.h files.

Much better that way, thank you for calling this out!

							Thanx, Paul

> > More importantly: you document this earlier in the patch series than you
> > introduce it.
> 
> Fair point, I reversed the order of those two patches.
> 
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier
  2013-12-10 17:45       ` Josh Triplett
@ 2013-12-10 20:11         ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2013-12-10 20:11 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw, Linux-Arch, Ingo Molnar, Oleg Nesterov,
	Linus Torvalds

On Tue, Dec 10, 2013 at 09:45:08AM -0800, Josh Triplett wrote:
> On Tue, Dec 10, 2013 at 01:37:26PM +0100, Peter Zijlstra wrote:
> > On Mon, Dec 09, 2013 at 05:28:02PM -0800, Paul E. McKenney wrote:
> > > diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> > > index f89da808ce31..abf645799991 100644
> > > --- a/arch/powerpc/include/asm/barrier.h
> > > +++ b/arch/powerpc/include/asm/barrier.h
> > > @@ -84,4 +84,6 @@ do {									\
> > >  	___p1;								\
> > >  })
> > >  
> > > +#define smp_mb__after_unlock_lock()	do { } while (0)
> > > +
> > >  #endif /* _ASM_POWERPC_BARRIER_H */
> > 
> > Didn't ben said ppc actually violates the current unlock+lock assumtion
> > and therefore this barrier woulnd't actually be a nop on ppc
> 
> Or, ppc could fix its lock primitives to preserve the unlock+lock
> assumption, and avoid subtle breakage across half the kernel.

Indeed.  However, another motivation for this change was the difficulty
in proving that x86 really provided the equivalent of a full barrier
for the MCS lock handoff case:

	http://www.spinics.net/lists/linux-mm/msg65653.html

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2013-12-10 20:12 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-10  1:27 [PATCH v5 tip/core/locking] Memory-barrier documentation updates + smp_mb__after_unlock_lock() Paul E. McKenney
2013-12-10  1:27 ` [PATCH v5 tip/core/locking 1/7] Documentation/memory-barriers.txt: Add needed ACCESS_ONCE() calls to memory-barriers.txt Paul E. McKenney
2013-12-10  1:27   ` [PATCH v5 tip/core/locking 2/7] Documentation/memory-barriers.txt: Add long atomic examples " Paul E. McKenney
2013-12-10  1:27   ` [PATCH v5 tip/core/locking 3/7] Documentation/memory-barriers.txt: Prohibit speculative writes Paul E. McKenney
2013-12-10  1:28   ` [PATCH v5 tip/core/locking 4/7] Documentation/memory-barriers.txt: Document ACCESS_ONCE() Paul E. McKenney
2013-12-10  1:28   ` [PATCH v5 tip/core/locking 5/7] Documentation/memory-barriers.txt: Downgrade UNLOCK+LOCK Paul E. McKenney
2013-12-10  1:32     ` Josh Triplett
2013-12-10  5:19       ` Paul E. McKenney
2013-12-10 13:14     ` Peter Zijlstra
2013-12-10 17:12       ` Paul E. McKenney
2013-12-10 17:25         ` Peter Zijlstra
2013-12-10 17:43           ` Josh Triplett
2013-12-10 18:05             ` Paul E. McKenney
2013-12-10 17:49           ` Paul E. McKenney
2013-12-10 17:43         ` Peter Zijlstra
2013-12-10 18:49           ` Paul E. McKenney
2013-12-10 16:44     ` Oleg Nesterov
2013-12-10 17:15       ` Paul E. McKenney
2013-12-10 17:35         ` Oleg Nesterov
2013-12-10  1:28   ` [PATCH v5 tip/core/locking 6/7] locking: Add an smp_mb__after_unlock_lock() for UNLOCK+LOCK barrier Paul E. McKenney
2013-12-10  1:34     ` Josh Triplett
2013-12-10  5:26       ` Paul E. McKenney
2013-12-10 18:53         ` Paul E. McKenney
2013-12-10 12:37     ` Peter Zijlstra
2013-12-10 17:17       ` Paul E. McKenney
2013-12-10 17:45       ` Josh Triplett
2013-12-10 20:11         ` Paul E. McKenney
2013-12-10 17:04     ` Oleg Nesterov
2013-12-10 17:18       ` Paul E. McKenney
2013-12-10 17:32         ` Oleg Nesterov
2013-12-10  1:28   ` [PATCH v5 tip/core/locking 7/7] rcu: Apply smp_mb__after_unlock_lock() to preserve grace periods Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).