[PATCH] barriers: introduce smp_mb__release_acquire and update documentation

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
@ 2015-09-15 16:13 Will Deacon
  2015-09-15 17:47 ` Paul E. McKenney
  2015-09-16 11:49 ` Boqun Feng
  0 siblings, 2 replies; 28+ messages in thread
From: Will Deacon @ 2015-09-15 16:13 UTC (permalink / raw)
  To: linux-arch; +Cc: linux-kernel, Will Deacon, Paul E. McKenney, Peter Zijlstra

As much as we'd like to live in a world where RELEASE -> ACQUIRE is
always cheaply ordered and can be used to construct UNLOCK -> LOCK
definitions with similar guarantees, the grim reality is that this isn't
even possible on x86 (thanks to Paul for bringing us crashing down to
Earth).

This patch handles the issue by introducing a new barrier macro,
smp_mb__release_acquire, that can be placed between a RELEASE and a
subsequent ACQUIRE operation in order to upgrade them to a full memory
barrier. At the moment, it doesn't have any users, so its existence
serves mainly as a documentation aid.

Documentation/memory-barriers.txt is updated to describe more clearly
the ACQUIRE and RELEASE ordering in this area and to show an example of
the new barrier in action.

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---

Following our discussion at [1], I thought I'd try to write something
down...

[1] http://lkml.kernel.org/r/20150828104854.GB16853@twins.programming.kicks-ass.net

 Documentation/memory-barriers.txt  | 23 ++++++++++++++++++++++-
 arch/powerpc/include/asm/barrier.h |  1 +
 arch/x86/include/asm/barrier.h     |  2 ++
 include/asm-generic/barrier.h      |  4 ++++
 4 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 2ba8461b0631..46a85abb77c6 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -459,11 +459,18 @@ And a couple of implicit varieties:
      RELEASE on that same variable are guaranteed to be visible.  In other
      words, within a given variable's critical section, all accesses of all
      previous critical sections for that variable are guaranteed to have
-     completed.
+     completed.  If the RELEASE and ACQUIRE operations act on independent
+     variables, an smp_mb__release_acquire() barrier can be placed between
+     them to upgrade the sequence to a full barrier.
 
      This means that ACQUIRE acts as a minimal "acquire" operation and
      RELEASE acts as a minimal "release" operation.
 
+A subset of the atomic operations described in atomic_ops.txt have ACQUIRE
+and RELEASE variants in addition to fully-ordered and relaxed definitions.
+For compound atomics performing both a load and a store, ACQUIRE semantics
+apply only to the load and RELEASE semantics only to the store portion of
+the operation.
 
 Memory barriers are only required where there's a possibility of interaction
 between two CPUs or between a CPU and a device.  If it can be guaranteed that
@@ -1895,6 +1902,20 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
 	a sleep-unlock race, but the locking primitive needs to resolve
 	such races properly in any case.
 
+If necessary, ordering can be enforced by use of an
+smp_mb__release_acquire() barrier:
+
+	*A = a;
+	RELEASE M
+	smp_mb__release_acquire();
+	ACQUIRE N
+	*B = b;
+
+in which case, the only permitted sequences are:
+
+	STORE *A, RELEASE M, ACQUIRE N, STORE *B
+	STORE *A, ACQUIRE N, RELEASE M, STORE *B
+
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
 anything at all - especially with respect to I/O accesses - unless combined
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index 0eca6efc0631..919624634d0a 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -87,6 +87,7 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__release_acquire()   smp_mb()
 #define smp_mb__before_atomic()     smp_mb()
 #define smp_mb__after_atomic()      smp_mb()
 #define smp_mb__before_spinlock()   smp_mb()
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 0681d2532527..1c61ad251e0e 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -85,6 +85,8 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__release_acquire()	smp_mb()
+
 #endif
 
 /* Atomic operations are already serializing on x86 */
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index b42afada1280..61ae95199397 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -119,5 +119,9 @@ do {									\
 	___p1;								\
 })
 
+#ifndef smp_mb__release_acquire
+#define smp_mb__release_acquire()	do { } while (0)
+#endif
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-15 16:13 [PATCH] barriers: introduce smp_mb__release_acquire and update documentation Will Deacon
@ 2015-09-15 17:47 ` Paul E. McKenney
  2015-09-16  9:14   ` Peter Zijlstra
  2015-09-16 11:49 ` Boqun Feng
  1 sibling, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2015-09-15 17:47 UTC (permalink / raw)
  To: Will Deacon; +Cc: linux-arch, linux-kernel, Peter Zijlstra

On Tue, Sep 15, 2015 at 05:13:30PM +0100, Will Deacon wrote:
> As much as we'd like to live in a world where RELEASE -> ACQUIRE is
> always cheaply ordered and can be used to construct UNLOCK -> LOCK
> definitions with similar guarantees, the grim reality is that this isn't
> even possible on x86 (thanks to Paul for bringing us crashing down to
> Earth).

"It is a service that I provide."  ;-)

> This patch handles the issue by introducing a new barrier macro,
> smp_mb__release_acquire, that can be placed between a RELEASE and a
> subsequent ACQUIRE operation in order to upgrade them to a full memory
> barrier. At the moment, it doesn't have any users, so its existence
> serves mainly as a documentation aid.
> 
> Documentation/memory-barriers.txt is updated to describe more clearly
> the ACQUIRE and RELEASE ordering in this area and to show an example of
> the new barrier in action.
> 
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Some questions and comments below.

							Thanx, Paul

> ---
> 
> Following our discussion at [1], I thought I'd try to write something
> down...
> 
> [1] http://lkml.kernel.org/r/20150828104854.GB16853@twins.programming.kicks-ass.net
> 
>  Documentation/memory-barriers.txt  | 23 ++++++++++++++++++++++-
>  arch/powerpc/include/asm/barrier.h |  1 +
>  arch/x86/include/asm/barrier.h     |  2 ++
>  include/asm-generic/barrier.h      |  4 ++++
>  4 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 2ba8461b0631..46a85abb77c6 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -459,11 +459,18 @@ And a couple of implicit varieties:
>       RELEASE on that same variable are guaranteed to be visible.  In other
>       words, within a given variable's critical section, all accesses of all
>       previous critical sections for that variable are guaranteed to have
> -     completed.
> +     completed.  If the RELEASE and ACQUIRE operations act on independent
> +     variables, an smp_mb__release_acquire() barrier can be placed between
> +     them to upgrade the sequence to a full barrier.
> 
>       This means that ACQUIRE acts as a minimal "acquire" operation and
>       RELEASE acts as a minimal "release" operation.
> 
> +A subset of the atomic operations described in atomic_ops.txt have ACQUIRE
> +and RELEASE variants in addition to fully-ordered and relaxed definitions.
> +For compound atomics performing both a load and a store, ACQUIRE semantics
> +apply only to the load and RELEASE semantics only to the store portion of
> +the operation.
> 
>  Memory barriers are only required where there's a possibility of interaction
>  between two CPUs or between a CPU and a device.  If it can be guaranteed that
> @@ -1895,6 +1902,20 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
>  	a sleep-unlock race, but the locking primitive needs to resolve
>  	such races properly in any case.
> 
> +If necessary, ordering can be enforced by use of an
> +smp_mb__release_acquire() barrier:
> +
> +	*A = a;
> +	RELEASE M
> +	smp_mb__release_acquire();
> +	ACQUIRE N
> +	*B = b;
> +
> +in which case, the only permitted sequences are:
> +
> +	STORE *A, RELEASE M, ACQUIRE N, STORE *B
> +	STORE *A, ACQUIRE N, RELEASE M, STORE *B
> +
>  Locks and semaphores may not provide any guarantee of ordering on UP compiled
>  systems, and so cannot be counted on in such a situation to actually achieve
>  anything at all - especially with respect to I/O accesses - unless combined
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index 0eca6efc0631..919624634d0a 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -87,6 +87,7 @@ do {									\
>  	___p1;								\
>  })
> 
> +#define smp_mb__release_acquire()   smp_mb()

If we are handling locking the same as atomic acquire and release
operations, this could also be placed between the unlock and the lock.

However, independently of the unlock/lock case, this definition and
use of smp_mb__release_acquire() does not handle full ordering of a
release by one CPU and an acquire of that same variable by another.
In that case, we need roughly the same setup as the much-maligned
smp_mb__after_unlock_lock().  So, do we care about this case?  (RCU does,
though not 100% sure about any other subsystems.)

>  #define smp_mb__before_atomic()     smp_mb()
>  #define smp_mb__after_atomic()      smp_mb()
>  #define smp_mb__before_spinlock()   smp_mb()
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index 0681d2532527..1c61ad251e0e 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -85,6 +85,8 @@ do {									\
>  	___p1;								\
>  })
> 
> +#define smp_mb__release_acquire()	smp_mb()
> +
>  #endif
> 
>  /* Atomic operations are already serializing on x86 */
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index b42afada1280..61ae95199397 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -119,5 +119,9 @@ do {									\
>  	___p1;								\
>  })
> 
> +#ifndef smp_mb__release_acquire
> +#define smp_mb__release_acquire()	do { } while (0)

Doesn't this need to be barrier() in the case where one variable was
released and another was acquired?

> +#endif
> +
>  #endif /* !__ASSEMBLY__ */
>  #endif /* __ASM_GENERIC_BARRIER_H */
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-15 17:47 ` Paul E. McKenney
@ 2015-09-16  9:14   ` Peter Zijlstra
  2015-09-16 10:29     ` Will Deacon
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2015-09-16  9:14 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Will Deacon, linux-arch, linux-kernel

On Tue, Sep 15, 2015 at 10:47:24AM -0700, Paul E. McKenney wrote:
> > diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> > index 0eca6efc0631..919624634d0a 100644
> > --- a/arch/powerpc/include/asm/barrier.h
> > +++ b/arch/powerpc/include/asm/barrier.h
> > @@ -87,6 +87,7 @@ do {									\
> >  	___p1;								\
> >  })
> > 
> > +#define smp_mb__release_acquire()   smp_mb()
> 
> If we are handling locking the same as atomic acquire and release
> operations, this could also be placed between the unlock and the lock.

I think the point was exactly that we need to separate LOCK/UNLOCK from
ACQUIRE/RELEASE.

> However, independently of the unlock/lock case, this definition and
> use of smp_mb__release_acquire() does not handle full ordering of a
> release by one CPU and an acquire of that same variable by another.

> In that case, we need roughly the same setup as the much-maligned
> smp_mb__after_unlock_lock().  So, do we care about this case?  (RCU does,
> though not 100% sure about any other subsystems.)

Indeed, that is a hole in the definition, that I think we should close.

> >  #define smp_mb__before_atomic()     smp_mb()
> >  #define smp_mb__after_atomic()      smp_mb()
> >  #define smp_mb__before_spinlock()   smp_mb()
> > diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> > index 0681d2532527..1c61ad251e0e 100644
> > --- a/arch/x86/include/asm/barrier.h
> > +++ b/arch/x86/include/asm/barrier.h
> > @@ -85,6 +85,8 @@ do {									\
> >  	___p1;								\
> >  })
> > 
> > +#define smp_mb__release_acquire()	smp_mb()
> > +
> >  #endif
> > 

All TSO archs would want this.

> >  /* Atomic operations are already serializing on x86 */
> > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > index b42afada1280..61ae95199397 100644
> > --- a/include/asm-generic/barrier.h
> > +++ b/include/asm-generic/barrier.h
> > @@ -119,5 +119,9 @@ do {									\
> >  	___p1;								\
> >  })
> > 
> > +#ifndef smp_mb__release_acquire
> > +#define smp_mb__release_acquire()	do { } while (0)
> 
> Doesn't this need to be barrier() in the case where one variable was
> released and another was acquired?

Yes, I think its very prudent to never let any barrier degrade to less
than barrier().

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-16  9:14   ` Peter Zijlstra
@ 2015-09-16 10:29     ` Will Deacon
  2015-09-16 10:29       ` Will Deacon
  2015-09-16 10:43       ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Will Deacon @ 2015-09-16 10:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

Hi Paul, Peter,

Thanks for the comments. More below...

On Wed, Sep 16, 2015 at 10:14:52AM +0100, Peter Zijlstra wrote:
> On Tue, Sep 15, 2015 at 10:47:24AM -0700, Paul E. McKenney wrote:
> > > diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> > > index 0eca6efc0631..919624634d0a 100644
> > > --- a/arch/powerpc/include/asm/barrier.h
> > > +++ b/arch/powerpc/include/asm/barrier.h
> > > @@ -87,6 +87,7 @@ do {									\
> > >  	___p1;								\
> > >  })
> > > 
> > > +#define smp_mb__release_acquire()   smp_mb()
> > 
> > If we are handling locking the same as atomic acquire and release
> > operations, this could also be placed between the unlock and the lock.
> 
> I think the point was exactly that we need to separate LOCK/UNLOCK from
> ACQUIRE/RELEASE.

Yes, pending the PPC investigation, I'd like to keep this separate for
now.

> > However, independently of the unlock/lock case, this definition and
> > use of smp_mb__release_acquire() does not handle full ordering of a
> > release by one CPU and an acquire of that same variable by another.
> 
> > In that case, we need roughly the same setup as the much-maligned
> > smp_mb__after_unlock_lock().  So, do we care about this case?  (RCU does,
> > though not 100% sure about any other subsystems.)
> 
> Indeed, that is a hole in the definition, that I think we should close.

I'm struggling to understand the hole, but here's my intuition. If an
ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
observe all memory accessed performed by CPUy prior to the RELEASE
before it observes the RELEASE itself, regardless of this new barrier.
I think this matches what we currently have in memory-barriers.txt (i.e.
acquire/release are neither transitive or multi-copy atomic).

Do we have use-cases that need these extra guarantees (outside of the
single RCU case, which is using smp_mb__after_unlock_lock)? I'd rather
not augment smp_mb__release_acquire unless we really have to, so I'd
prefer to document that it only applies when the RELEASE and ACQUIRE are
performed by the same CPU. Thoughts?

> > >  #define smp_mb__before_atomic()     smp_mb()
> > >  #define smp_mb__after_atomic()      smp_mb()
> > >  #define smp_mb__before_spinlock()   smp_mb()
> > > diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> > > index 0681d2532527..1c61ad251e0e 100644
> > > --- a/arch/x86/include/asm/barrier.h
> > > +++ b/arch/x86/include/asm/barrier.h
> > > @@ -85,6 +85,8 @@ do {									\
> > >  	___p1;								\
> > >  })
> > > 
> > > +#define smp_mb__release_acquire()	smp_mb()
> > > +
> > >  #endif
> > > 
> 
> All TSO archs would want this.

If we look at all architectures that implement smp_store_release without
an smp_mb already, we get:

  ia64
  powerpc
  s390
  sparc
  x86

so it should be enough to provide those with definitions. I'll do that
once we've settled on the documentation bits.

> > >  /* Atomic operations are already serializing on x86 */
> > > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > > index b42afada1280..61ae95199397 100644
> > > --- a/include/asm-generic/barrier.h
> > > +++ b/include/asm-generic/barrier.h
> > > @@ -119,5 +119,9 @@ do {									\
> > >  	___p1;								\
> > >  })
> > > 
> > > +#ifndef smp_mb__release_acquire
> > > +#define smp_mb__release_acquire()	do { } while (0)
> > 
> > Doesn't this need to be barrier() in the case where one variable was
> > released and another was acquired?
> 
> Yes, I think its very prudent to never let any barrier degrade to less
> than barrier().

Hey, I just copied read_barrier_depends from the same file! Both
smp_load_acquire and smp_store_release should already provide at least
barrier(), so the above should be sufficient.

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-16 10:29     ` Will Deacon
@ 2015-09-16 10:29       ` Will Deacon
  2015-09-16 10:43       ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Will Deacon @ 2015-09-16 10:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

Hi Paul, Peter,

Thanks for the comments. More below...

On Wed, Sep 16, 2015 at 10:14:52AM +0100, Peter Zijlstra wrote:
> On Tue, Sep 15, 2015 at 10:47:24AM -0700, Paul E. McKenney wrote:
> > > diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> > > index 0eca6efc0631..919624634d0a 100644
> > > --- a/arch/powerpc/include/asm/barrier.h
> > > +++ b/arch/powerpc/include/asm/barrier.h
> > > @@ -87,6 +87,7 @@ do {									\
> > >  	___p1;								\
> > >  })
> > > 
> > > +#define smp_mb__release_acquire()   smp_mb()
> > 
> > If we are handling locking the same as atomic acquire and release
> > operations, this could also be placed between the unlock and the lock.
> 
> I think the point was exactly that we need to separate LOCK/UNLOCK from
> ACQUIRE/RELEASE.

Yes, pending the PPC investigation, I'd like to keep this separate for
now.

> > However, independently of the unlock/lock case, this definition and
> > use of smp_mb__release_acquire() does not handle full ordering of a
> > release by one CPU and an acquire of that same variable by another.
> 
> > In that case, we need roughly the same setup as the much-maligned
> > smp_mb__after_unlock_lock().  So, do we care about this case?  (RCU does,
> > though not 100% sure about any other subsystems.)
> 
> Indeed, that is a hole in the definition, that I think we should close.

I'm struggling to understand the hole, but here's my intuition. If an
ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
observe all memory accessed performed by CPUy prior to the RELEASE
before it observes the RELEASE itself, regardless of this new barrier.
I think this matches what we currently have in memory-barriers.txt (i.e.
acquire/release are neither transitive or multi-copy atomic).

Do we have use-cases that need these extra guarantees (outside of the
single RCU case, which is using smp_mb__after_unlock_lock)? I'd rather
not augment smp_mb__release_acquire unless we really have to, so I'd
prefer to document that it only applies when the RELEASE and ACQUIRE are
performed by the same CPU. Thoughts?

> > >  #define smp_mb__before_atomic()     smp_mb()
> > >  #define smp_mb__after_atomic()      smp_mb()
> > >  #define smp_mb__before_spinlock()   smp_mb()
> > > diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> > > index 0681d2532527..1c61ad251e0e 100644
> > > --- a/arch/x86/include/asm/barrier.h
> > > +++ b/arch/x86/include/asm/barrier.h
> > > @@ -85,6 +85,8 @@ do {									\
> > >  	___p1;								\
> > >  })
> > > 
> > > +#define smp_mb__release_acquire()	smp_mb()
> > > +
> > >  #endif
> > > 
> 
> All TSO archs would want this.

If we look at all architectures that implement smp_store_release without
an smp_mb already, we get:

  ia64
  powerpc
  s390
  sparc
  x86

so it should be enough to provide those with definitions. I'll do that
once we've settled on the documentation bits.

> > >  /* Atomic operations are already serializing on x86 */
> > > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > > index b42afada1280..61ae95199397 100644
> > > --- a/include/asm-generic/barrier.h
> > > +++ b/include/asm-generic/barrier.h
> > > @@ -119,5 +119,9 @@ do {									\
> > >  	___p1;								\
> > >  })
> > > 
> > > +#ifndef smp_mb__release_acquire
> > > +#define smp_mb__release_acquire()	do { } while (0)
> > 
> > Doesn't this need to be barrier() in the case where one variable was
> > released and another was acquired?
> 
> Yes, I think its very prudent to never let any barrier degrade to less
> than barrier().

Hey, I just copied read_barrier_depends from the same file! Both
smp_load_acquire and smp_store_release should already provide at least
barrier(), so the above should be sufficient.

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-16 10:29     ` Will Deacon
  2015-09-16 10:29       ` Will Deacon
@ 2015-09-16 10:43       ` Peter Zijlstra
  2015-09-16 10:43         ` Peter Zijlstra
  2015-09-16 11:07         ` Will Deacon
  1 sibling, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-09-16 10:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote:
> > Indeed, that is a hole in the definition, that I think we should close.

> I'm struggling to understand the hole, but here's my intuition. If an
> ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
> observe all memory accessed performed by CPUy prior to the RELEASE
> before it observes the RELEASE itself, regardless of this new barrier.
> I think this matches what we currently have in memory-barriers.txt (i.e.
> acquire/release are neither transitive or multi-copy atomic).

Ah agreed. I seem to have gotten my brain in a tangle.

Basically where a program order release+acquire relies on an address
dependency, a cross cpu release+acquire relies on causality. If we
observe the release, we must also observe everything prior to it etc.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-16 10:43       ` Peter Zijlstra
@ 2015-09-16 10:43         ` Peter Zijlstra
  2015-09-16 11:07         ` Will Deacon
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-09-16 10:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote:
> > Indeed, that is a hole in the definition, that I think we should close.

> I'm struggling to understand the hole, but here's my intuition. If an
> ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
> observe all memory accessed performed by CPUy prior to the RELEASE
> before it observes the RELEASE itself, regardless of this new barrier.
> I think this matches what we currently have in memory-barriers.txt (i.e.
> acquire/release are neither transitive or multi-copy atomic).

Ah agreed. I seem to have gotten my brain in a tangle.

Basically where a program order release+acquire relies on an address
dependency, a cross cpu release+acquire relies on causality. If we
observe the release, we must also observe everything prior to it etc.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-16 10:43       ` Peter Zijlstra
  2015-09-16 10:43         ` Peter Zijlstra
@ 2015-09-16 11:07         ` Will Deacon
  2015-09-16 11:07           ` Will Deacon
  2015-09-17  2:50           ` Boqun Feng
  1 sibling, 2 replies; 28+ messages in thread
From: Will Deacon @ 2015-09-16 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote:
> On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote:
> > > Indeed, that is a hole in the definition, that I think we should close.
> 
> > I'm struggling to understand the hole, but here's my intuition. If an
> > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
> > observe all memory accessed performed by CPUy prior to the RELEASE
> > before it observes the RELEASE itself, regardless of this new barrier.
> > I think this matches what we currently have in memory-barriers.txt (i.e.
> > acquire/release are neither transitive or multi-copy atomic).
> 
> Ah agreed. I seem to have gotten my brain in a tangle.
> 
> Basically where a program order release+acquire relies on an address
> dependency, a cross cpu release+acquire relies on causality. If we
> observe the release, we must also observe everything prior to it etc.

Yes, and crucially, the "everything prior to it" only encompasses accesses
made by the releasing CPU itself (in the absence of other barriers and
synchronisation).

Given that we managed to get confused, it doesn't hurt to call this out
explicitly in the doc, so I can add the following extra text.

Will

--->8

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 46a85abb77c6..794d102d06df 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
 	a sleep-unlock race, but the locking primitive needs to resolve
 	such races properly in any case.
 
-If necessary, ordering can be enforced by use of an
-smp_mb__release_acquire() barrier:
+Where the RELEASE and ACQUIRE operations are performed by the same CPU,
+ordering can be enforced by use of an smp_mb__release_acquire() barrier:
 
 	*A = a;
 	RELEASE M
@@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are:
 	STORE *A, RELEASE M, ACQUIRE N, STORE *B
 	STORE *A, ACQUIRE N, RELEASE M, STORE *B
 
+Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE
+operations performed by other CPUs, even if they are to the same variable.
+In cases where transitivity is required, smp_mb() should be used explicitly.
+
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
 anything at all - especially with respect to I/O accesses - unless combined

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-16 11:07         ` Will Deacon
@ 2015-09-16 11:07           ` Will Deacon
  2015-09-17  2:50           ` Boqun Feng
  1 sibling, 0 replies; 28+ messages in thread
From: Will Deacon @ 2015-09-16 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote:
> On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote:
> > > Indeed, that is a hole in the definition, that I think we should close.
> 
> > I'm struggling to understand the hole, but here's my intuition. If an
> > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
> > observe all memory accessed performed by CPUy prior to the RELEASE
> > before it observes the RELEASE itself, regardless of this new barrier.
> > I think this matches what we currently have in memory-barriers.txt (i.e.
> > acquire/release are neither transitive or multi-copy atomic).
> 
> Ah agreed. I seem to have gotten my brain in a tangle.
> 
> Basically where a program order release+acquire relies on an address
> dependency, a cross cpu release+acquire relies on causality. If we
> observe the release, we must also observe everything prior to it etc.

Yes, and crucially, the "everything prior to it" only encompasses accesses
made by the releasing CPU itself (in the absence of other barriers and
synchronisation).

Given that we managed to get confused, it doesn't hurt to call this out
explicitly in the doc, so I can add the following extra text.

Will

--->8

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 46a85abb77c6..794d102d06df 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
 	a sleep-unlock race, but the locking primitive needs to resolve
 	such races properly in any case.
 
-If necessary, ordering can be enforced by use of an
-smp_mb__release_acquire() barrier:
+Where the RELEASE and ACQUIRE operations are performed by the same CPU,
+ordering can be enforced by use of an smp_mb__release_acquire() barrier:
 
 	*A = a;
 	RELEASE M
@@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are:
 	STORE *A, RELEASE M, ACQUIRE N, STORE *B
 	STORE *A, ACQUIRE N, RELEASE M, STORE *B
 
+Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE
+operations performed by other CPUs, even if they are to the same variable.
+In cases where transitivity is required, smp_mb() should be used explicitly.
+
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
 anything at all - especially with respect to I/O accesses - unless combined

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-16 11:07         ` Will Deacon
  2015-09-16 11:07           ` Will Deacon
@ 2015-09-17  2:50           ` Boqun Feng
  2015-09-17  7:57             ` Boqun Feng
  2015-09-17 18:00             ` Will Deacon
  1 sibling, 2 replies; 28+ messages in thread
From: Boqun Feng @ 2015-09-17  2:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 4396 bytes --]

On Wed, Sep 16, 2015 at 12:07:06PM +0100, Will Deacon wrote:
> On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote:
> > On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote:
> > > > Indeed, that is a hole in the definition, that I think we should close.
> > 
> > > I'm struggling to understand the hole, but here's my intuition. If an
> > > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
> > > observe all memory accessed performed by CPUy prior to the RELEASE
> > > before it observes the RELEASE itself, regardless of this new barrier.
> > > I think this matches what we currently have in memory-barriers.txt (i.e.
> > > acquire/release are neither transitive or multi-copy atomic).
> > 
> > Ah agreed. I seem to have gotten my brain in a tangle.
> > 
> > Basically where a program order release+acquire relies on an address
> > dependency, a cross cpu release+acquire relies on causality. If we
> > observe the release, we must also observe everything prior to it etc.
> 
> Yes, and crucially, the "everything prior to it" only encompasses accesses
> made by the releasing CPU itself (in the absence of other barriers and
> synchronisation).
> 

Just want to make sure I understand you correctly, do you mean that in
the following case:

CPU 1			CPU 2				CPU 3
==============		============================	===============
{ A = 0, B = 0 }
WRITE_ONCE(A,1);	r1 = READ_ONCE(A);		r2 = smp_load_acquire(&B);
			smp_store_release(&B, 1);	r3 = READ_ONCE(A);

r1 == 1 && r2 == 1 && r3 == 0 is not prohibitted?

However, according to the discussion of Paul and Peter:

https://lkml.org/lkml/2015/9/15/707

I think that's prohibitted on architectures except s390 for sure. And
for s390, we are waiting for the maintainers to verify this. If s390
also prohibits this, then a release-acquire pair(on different CPUs) to
the same variable does guarantee transitivity.

Did I misunderstand you or miss something here?

> Given that we managed to get confused, it doesn't hurt to call this out
> explicitly in the doc, so I can add the following extra text.
> 
> Will
> 
> --->8
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 46a85abb77c6..794d102d06df 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
>  	a sleep-unlock race, but the locking primitive needs to resolve
>  	such races properly in any case.
>  
> -If necessary, ordering can be enforced by use of an
> -smp_mb__release_acquire() barrier:
> +Where the RELEASE and ACQUIRE operations are performed by the same CPU,
> +ordering can be enforced by use of an smp_mb__release_acquire() barrier:
>  
>  	*A = a;
>  	RELEASE M
> @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are:
>  	STORE *A, RELEASE M, ACQUIRE N, STORE *B
>  	STORE *A, ACQUIRE N, RELEASE M, STORE *B
>  
> +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE
> +operations performed by other CPUs, even if they are to the same variable.
> +In cases where transitivity is required, smp_mb() should be used explicitly.
> +

Then, IIRC, the memory order effect of RELEASE+ACQUIRE should be:

If an ACQUIRE loads the value of stored by a RELEASE, then on the CPU
executing the ACQUIRE operation, all the memory operations after the
ACQUIRE operation will perceive all the memory operations before the
RELEASE operation on the CPU executing the RELEASE operation.

This could cover both the "on the same CPU" and "on different CPUs"
cases.

Of course, this may has nothing to do with smp_mb__release_acquire(),
but I think we can take this chance to document the memory order effect
of RELEASE+ACQUIRE well.


Regards,
Boqun

>  Locks and semaphores may not provide any guarantee of ordering on UP compiled
>  systems, and so cannot be counted on in such a situation to actually achieve
>  anything at all - especially with respect to I/O accesses - unless combined
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-17  2:50           ` Boqun Feng
@ 2015-09-17  7:57             ` Boqun Feng
  2015-09-17  7:57               ` Boqun Feng
  2015-09-17 18:00             ` Will Deacon
  1 sibling, 1 reply; 28+ messages in thread
From: Boqun Feng @ 2015-09-17  7:57 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 5147 bytes --]

On Thu, Sep 17, 2015 at 10:50:12AM +0800, Boqun Feng wrote:
> On Wed, Sep 16, 2015 at 12:07:06PM +0100, Will Deacon wrote:
> > On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote:
> > > On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote:
> > > > > Indeed, that is a hole in the definition, that I think we should close.
> > > 
> > > > I'm struggling to understand the hole, but here's my intuition. If an
> > > > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
> > > > observe all memory accessed performed by CPUy prior to the RELEASE
> > > > before it observes the RELEASE itself, regardless of this new barrier.
> > > > I think this matches what we currently have in memory-barriers.txt (i.e.
> > > > acquire/release are neither transitive or multi-copy atomic).
> > > 
> > > Ah agreed. I seem to have gotten my brain in a tangle.
> > > 
> > > Basically where a program order release+acquire relies on an address
> > > dependency, a cross cpu release+acquire relies on causality. If we
> > > observe the release, we must also observe everything prior to it etc.
> > 
> > Yes, and crucially, the "everything prior to it" only encompasses accesses
> > made by the releasing CPU itself (in the absence of other barriers and
> > synchronisation).
> > 
> 
> Just want to make sure I understand you correctly, do you mean that in
> the following case:
> 
> CPU 1			CPU 2				CPU 3
> ==============		============================	===============
> { A = 0, B = 0 }
> WRITE_ONCE(A,1);	r1 = READ_ONCE(A);		r2 = smp_load_acquire(&B);
> 			smp_store_release(&B, 1);	r3 = READ_ONCE(A);
> 
> r1 == 1 && r2 == 1 && r3 == 0 is not prohibitted?
> 
> However, according to the discussion of Paul and Peter:
> 
> https://lkml.org/lkml/2015/9/15/707
> 
> I think that's prohibitted on architectures except s390 for sure. And
> for s390, we are waiting for the maintainers to verify this. If s390
> also prohibits this, then a release-acquire pair(on different CPUs) to
> the same variable does guarantee transitivity.
> 
> Did I misunderstand you or miss something here?
> 
> > Given that we managed to get confused, it doesn't hurt to call this out
> > explicitly in the doc, so I can add the following extra text.
> > 
> > Will
> > 
> > --->8
> > 
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index 46a85abb77c6..794d102d06df 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
> >  	a sleep-unlock race, but the locking primitive needs to resolve
> >  	such races properly in any case.
> >  
> > -If necessary, ordering can be enforced by use of an
> > -smp_mb__release_acquire() barrier:
> > +Where the RELEASE and ACQUIRE operations are performed by the same CPU,
> > +ordering can be enforced by use of an smp_mb__release_acquire() barrier:
> >  
> >  	*A = a;
> >  	RELEASE M
> > @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are:
> >  	STORE *A, RELEASE M, ACQUIRE N, STORE *B
> >  	STORE *A, ACQUIRE N, RELEASE M, STORE *B
> >  
> > +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE
> > +operations performed by other CPUs, even if they are to the same variable.
> > +In cases where transitivity is required, smp_mb() should be used explicitly.
> > +
> 
> Then, IIRC, the memory order effect of RELEASE+ACQUIRE should be:
> 
> If an ACQUIRE loads the value of stored by a RELEASE, then on the CPU
> executing the ACQUIRE operation, all the memory operations after the
> ACQUIRE operation will perceive all the memory operations before the
> RELEASE operation on the CPU executing the RELEASE operation.
> 

Ah.. I think I lost my mind while writting this. Should be:

If an ACQUIRE loads the value of stored by a RELEASE, then after the
ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
all the memory operations that have been perceived by the CPU executing
the RELEASE operation before the RELEASE operation. 

Which means a release+acquire pair to the same variable guarantees
transitivity.

Sorry for the misleading paragraph..

Regards,
Boqun

> This could cover both the "on the same CPU" and "on different CPUs"
> cases.
> 
> Of course, this may has nothing to do with smp_mb__release_acquire(),
> but I think we can take this chance to document the memory order effect
> of RELEASE+ACQUIRE well.
> 
> 
> Regards,
> Boqun
> 
> >  Locks and semaphores may not provide any guarantee of ordering on UP compiled
> >  systems, and so cannot be counted on in such a situation to actually achieve
> >  anything at all - especially with respect to I/O accesses - unless combined
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-17  7:57             ` Boqun Feng
@ 2015-09-17  7:57               ` Boqun Feng
  0 siblings, 0 replies; 28+ messages in thread
From: Boqun Feng @ 2015-09-17  7:57 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 5147 bytes --]

On Thu, Sep 17, 2015 at 10:50:12AM +0800, Boqun Feng wrote:
> On Wed, Sep 16, 2015 at 12:07:06PM +0100, Will Deacon wrote:
> > On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote:
> > > On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote:
> > > > > Indeed, that is a hole in the definition, that I think we should close.
> > > 
> > > > I'm struggling to understand the hole, but here's my intuition. If an
> > > > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
> > > > observe all memory accessed performed by CPUy prior to the RELEASE
> > > > before it observes the RELEASE itself, regardless of this new barrier.
> > > > I think this matches what we currently have in memory-barriers.txt (i.e.
> > > > acquire/release are neither transitive or multi-copy atomic).
> > > 
> > > Ah agreed. I seem to have gotten my brain in a tangle.
> > > 
> > > Basically where a program order release+acquire relies on an address
> > > dependency, a cross cpu release+acquire relies on causality. If we
> > > observe the release, we must also observe everything prior to it etc.
> > 
> > Yes, and crucially, the "everything prior to it" only encompasses accesses
> > made by the releasing CPU itself (in the absence of other barriers and
> > synchronisation).
> > 
> 
> Just want to make sure I understand you correctly, do you mean that in
> the following case:
> 
> CPU 1			CPU 2				CPU 3
> ==============		============================	===============
> { A = 0, B = 0 }
> WRITE_ONCE(A,1);	r1 = READ_ONCE(A);		r2 = smp_load_acquire(&B);
> 			smp_store_release(&B, 1);	r3 = READ_ONCE(A);
> 
> r1 == 1 && r2 == 1 && r3 == 0 is not prohibitted?
> 
> However, according to the discussion of Paul and Peter:
> 
> https://lkml.org/lkml/2015/9/15/707
> 
> I think that's prohibitted on architectures except s390 for sure. And
> for s390, we are waiting for the maintainers to verify this. If s390
> also prohibits this, then a release-acquire pair(on different CPUs) to
> the same variable does guarantee transitivity.
> 
> Did I misunderstand you or miss something here?
> 
> > Given that we managed to get confused, it doesn't hurt to call this out
> > explicitly in the doc, so I can add the following extra text.
> > 
> > Will
> > 
> > --->8
> > 
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index 46a85abb77c6..794d102d06df 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
> >  	a sleep-unlock race, but the locking primitive needs to resolve
> >  	such races properly in any case.
> >  
> > -If necessary, ordering can be enforced by use of an
> > -smp_mb__release_acquire() barrier:
> > +Where the RELEASE and ACQUIRE operations are performed by the same CPU,
> > +ordering can be enforced by use of an smp_mb__release_acquire() barrier:
> >  
> >  	*A = a;
> >  	RELEASE M
> > @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are:
> >  	STORE *A, RELEASE M, ACQUIRE N, STORE *B
> >  	STORE *A, ACQUIRE N, RELEASE M, STORE *B
> >  
> > +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE
> > +operations performed by other CPUs, even if they are to the same variable.
> > +In cases where transitivity is required, smp_mb() should be used explicitly.
> > +
> 
> Then, IIRC, the memory order effect of RELEASE+ACQUIRE should be:
> 
> If an ACQUIRE loads the value of stored by a RELEASE, then on the CPU
> executing the ACQUIRE operation, all the memory operations after the
> ACQUIRE operation will perceive all the memory operations before the
> RELEASE operation on the CPU executing the RELEASE operation.
> 

Ah.. I think I lost my mind while writting this. Should be:

If an ACQUIRE loads the value of stored by a RELEASE, then after the
ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
all the memory operations that have been perceived by the CPU executing
the RELEASE operation before the RELEASE operation. 

Which means a release+acquire pair to the same variable guarantees
transitivity.

Sorry for the misleading paragraph..

Regards,
Boqun

> This could cover both the "on the same CPU" and "on different CPUs"
> cases.
> 
> Of course, this may has nothing to do with smp_mb__release_acquire(),
> but I think we can take this chance to document the memory order effect
> of RELEASE+ACQUIRE well.
> 
> 
> Regards,
> Boqun
> 
> >  Locks and semaphores may not provide any guarantee of ordering on UP compiled
> >  systems, and so cannot be counted on in such a situation to actually achieve
> >  anything at all - especially with respect to I/O accesses - unless combined
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-17  2:50           ` Boqun Feng
  2015-09-17  7:57             ` Boqun Feng
@ 2015-09-17 18:00             ` Will Deacon
  2015-09-21 13:45               ` Boqun Feng
  1 sibling, 1 reply; 28+ messages in thread
From: Will Deacon @ 2015-09-17 18:00 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote:
> On Wed, Sep 16, 2015 at 12:07:06PM +0100, Will Deacon wrote:
> > On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote:
> > > On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote:
> > > > > Indeed, that is a hole in the definition, that I think we should close.
> > > 
> > > > I'm struggling to understand the hole, but here's my intuition. If an
> > > > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
> > > > observe all memory accessed performed by CPUy prior to the RELEASE
> > > > before it observes the RELEASE itself, regardless of this new barrier.
> > > > I think this matches what we currently have in memory-barriers.txt (i.e.
> > > > acquire/release are neither transitive or multi-copy atomic).
> > > 
> > > Ah agreed. I seem to have gotten my brain in a tangle.
> > > 
> > > Basically where a program order release+acquire relies on an address
> > > dependency, a cross cpu release+acquire relies on causality. If we
> > > observe the release, we must also observe everything prior to it etc.
> > 
> > Yes, and crucially, the "everything prior to it" only encompasses accesses
> > made by the releasing CPU itself (in the absence of other barriers and
> > synchronisation).
> > 
> 
> Just want to make sure I understand you correctly, do you mean that in
> the following case:
> 
> CPU 1			CPU 2				CPU 3
> ==============		============================	===============
> { A = 0, B = 0 }
> WRITE_ONCE(A,1);	r1 = READ_ONCE(A);		r2 = smp_load_acquire(&B);
> 			smp_store_release(&B, 1);	r3 = READ_ONCE(A);
> 
> r1 == 1 && r2 == 1 && r3 == 0 is not prohibitted?
> 
> However, according to the discussion of Paul and Peter:
> 
> https://lkml.org/lkml/2015/9/15/707
> 
> I think that's prohibitted on architectures except s390 for sure. And
> for s390, we are waiting for the maintainers to verify this. If s390
> also prohibits this, then a release-acquire pair(on different CPUs) to
> the same variable does guarantee transitivity.
> 
> Did I misunderstand you or miss something here?

That certainly works on arm and arm64, so if it works everywhere else too,
then we can strengthen this (but see below).

> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index 46a85abb77c6..794d102d06df 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
> >  	a sleep-unlock race, but the locking primitive needs to resolve
> >  	such races properly in any case.
> >  
> > -If necessary, ordering can be enforced by use of an
> > -smp_mb__release_acquire() barrier:
> > +Where the RELEASE and ACQUIRE operations are performed by the same CPU,
> > +ordering can be enforced by use of an smp_mb__release_acquire() barrier:
> >  
> >  	*A = a;
> >  	RELEASE M
> > @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are:
> >  	STORE *A, RELEASE M, ACQUIRE N, STORE *B
> >  	STORE *A, ACQUIRE N, RELEASE M, STORE *B
> >  
> > +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE
> > +operations performed by other CPUs, even if they are to the same variable.
> > +In cases where transitivity is required, smp_mb() should be used explicitly.
> > +
> 
> Then, IIRC, the memory order effect of RELEASE+ACQUIRE should be:

[updated from your reply]

> If an ACQUIRE loads the value of stored by a RELEASE, then after the
> ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
> all the memory operations that have been perceived by the CPU executing
> the RELEASE operation before the RELEASE operation. 
> 
> Which means a release+acquire pair to the same variable guarantees
> transitivity.

Almost, but on arm64 at least, "all the memory operations" above doesn't
include reads by other CPUs. I'm struggling to figure out whether that's
actually an issue.

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-17 18:00             ` Will Deacon
@ 2015-09-21 13:45               ` Boqun Feng
  2015-09-21 13:45                 ` Boqun Feng
  2015-09-21 14:10                 ` Boqun Feng
  0 siblings, 2 replies; 28+ messages in thread
From: Boqun Feng @ 2015-09-21 13:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 5620 bytes --]

On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote:
> On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote:
> > On Wed, Sep 16, 2015 at 12:07:06PM +0100, Will Deacon wrote:
> > > On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote:
> > > > On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote:
> > > > > > Indeed, that is a hole in the definition, that I think we should close.
> > > > 
> > > > > I'm struggling to understand the hole, but here's my intuition. If an
> > > > > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
> > > > > observe all memory accessed performed by CPUy prior to the RELEASE
> > > > > before it observes the RELEASE itself, regardless of this new barrier.
> > > > > I think this matches what we currently have in memory-barriers.txt (i.e.
> > > > > acquire/release are neither transitive or multi-copy atomic).
> > > > 
> > > > Ah agreed. I seem to have gotten my brain in a tangle.
> > > > 
> > > > Basically where a program order release+acquire relies on an address
> > > > dependency, a cross cpu release+acquire relies on causality. If we
> > > > observe the release, we must also observe everything prior to it etc.
> > > 
> > > Yes, and crucially, the "everything prior to it" only encompasses accesses
> > > made by the releasing CPU itself (in the absence of other barriers and
> > > synchronisation).
> > > 
> > 
> > Just want to make sure I understand you correctly, do you mean that in
> > the following case:
> > 
> > CPU 1			CPU 2				CPU 3
> > ==============		============================	===============
> > { A = 0, B = 0 }
> > WRITE_ONCE(A,1);	r1 = READ_ONCE(A);		r2 = smp_load_acquire(&B);
> > 			smp_store_release(&B, 1);	r3 = READ_ONCE(A);
> > 
> > r1 == 1 && r2 == 1 && r3 == 0 is not prohibitted?
> > 
> > However, according to the discussion of Paul and Peter:
> > 
> > https://lkml.org/lkml/2015/9/15/707
> > 
> > I think that's prohibitted on architectures except s390 for sure. And
> > for s390, we are waiting for the maintainers to verify this. If s390
> > also prohibits this, then a release-acquire pair(on different CPUs) to
> > the same variable does guarantee transitivity.
> > 
> > Did I misunderstand you or miss something here?
> 
> That certainly works on arm and arm64, so if it works everywhere else too,
> then we can strengthen this (but see below).
> 
> > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > > index 46a85abb77c6..794d102d06df 100644
> > > --- a/Documentation/memory-barriers.txt
> > > +++ b/Documentation/memory-barriers.txt
> > > @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
> > >  	a sleep-unlock race, but the locking primitive needs to resolve
> > >  	such races properly in any case.
> > >  
> > > -If necessary, ordering can be enforced by use of an
> > > -smp_mb__release_acquire() barrier:
> > > +Where the RELEASE and ACQUIRE operations are performed by the same CPU,
> > > +ordering can be enforced by use of an smp_mb__release_acquire() barrier:
> > >  
> > >  	*A = a;
> > >  	RELEASE M
> > > @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are:
> > >  	STORE *A, RELEASE M, ACQUIRE N, STORE *B
> > >  	STORE *A, ACQUIRE N, RELEASE M, STORE *B
> > >  
> > > +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE
> > > +operations performed by other CPUs, even if they are to the same variable.
> > > +In cases where transitivity is required, smp_mb() should be used explicitly.
> > > +
> > 
> > Then, IIRC, the memory order effect of RELEASE+ACQUIRE should be:
> 
> [updated from your reply]
> 
> > If an ACQUIRE loads the value of stored by a RELEASE, then after the
> > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
> > all the memory operations that have been perceived by the CPU executing
> > the RELEASE operation before the RELEASE operation. 
> > 
> > Which means a release+acquire pair to the same variable guarantees
> > transitivity.
> 
> Almost, but on arm64 at least, "all the memory operations" above doesn't
> include reads by other CPUs. I'm struggling to figure out whether that's
> actually an issue.
> 

Ah.. that's indeed an issue! for example:

CPU 0			CPU 1				CPU 2
=====================	==========================	================
{a = 0, b = 0, c = 0}
r1 = READ_ONCE(a);	WRITE_ONCE(b, 1);		r3 = smp_load_acquire(&c);
smp_rmb();		smp_store_release(&c, 1);	WRITE_ONCE(a, 1);
r2 = READ_ONCE(b)

where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at
least on POWER.

However, I think that doens't mean a release+acquire pair to the same
variable doesn't guarantee transitivity, because the transitivity is
actually broken at the smp_rmb(). But yes, my document is incorrect.
How about:

If an ACQUIRE loads the value of stored by a RELEASE, then after the
ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
all the memory operations that have been perceived by the CPU executing
the RELEASE operation *transitively* before the RELEASE operation. 
("transitively before" means that a memory operation is either executed
on the same CPU before the other, or guaranteed executed before the
other by a transitive barrier).

Which means a release+acquire pair to the same variable guarantees
transitivity.


Maybe we can avoid to use term "transitively before" here, but it's not
bad to distinguish different kinds of "before"s.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-21 13:45               ` Boqun Feng
@ 2015-09-21 13:45                 ` Boqun Feng
  2015-09-21 14:10                 ` Boqun Feng
  1 sibling, 0 replies; 28+ messages in thread
From: Boqun Feng @ 2015-09-21 13:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 5620 bytes --]

On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote:
> On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote:
> > On Wed, Sep 16, 2015 at 12:07:06PM +0100, Will Deacon wrote:
> > > On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote:
> > > > On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote:
> > > > > > Indeed, that is a hole in the definition, that I think we should close.
> > > > 
> > > > > I'm struggling to understand the hole, but here's my intuition. If an
> > > > > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
> > > > > observe all memory accessed performed by CPUy prior to the RELEASE
> > > > > before it observes the RELEASE itself, regardless of this new barrier.
> > > > > I think this matches what we currently have in memory-barriers.txt (i.e.
> > > > > acquire/release are neither transitive or multi-copy atomic).
> > > > 
> > > > Ah agreed. I seem to have gotten my brain in a tangle.
> > > > 
> > > > Basically where a program order release+acquire relies on an address
> > > > dependency, a cross cpu release+acquire relies on causality. If we
> > > > observe the release, we must also observe everything prior to it etc.
> > > 
> > > Yes, and crucially, the "everything prior to it" only encompasses accesses
> > > made by the releasing CPU itself (in the absence of other barriers and
> > > synchronisation).
> > > 
> > 
> > Just want to make sure I understand you correctly, do you mean that in
> > the following case:
> > 
> > CPU 1			CPU 2				CPU 3
> > ==============		============================	===============
> > { A = 0, B = 0 }
> > WRITE_ONCE(A,1);	r1 = READ_ONCE(A);		r2 = smp_load_acquire(&B);
> > 			smp_store_release(&B, 1);	r3 = READ_ONCE(A);
> > 
> > r1 == 1 && r2 == 1 && r3 == 0 is not prohibitted?
> > 
> > However, according to the discussion of Paul and Peter:
> > 
> > https://lkml.org/lkml/2015/9/15/707
> > 
> > I think that's prohibitted on architectures except s390 for sure. And
> > for s390, we are waiting for the maintainers to verify this. If s390
> > also prohibits this, then a release-acquire pair(on different CPUs) to
> > the same variable does guarantee transitivity.
> > 
> > Did I misunderstand you or miss something here?
> 
> That certainly works on arm and arm64, so if it works everywhere else too,
> then we can strengthen this (but see below).
> 
> > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > > index 46a85abb77c6..794d102d06df 100644
> > > --- a/Documentation/memory-barriers.txt
> > > +++ b/Documentation/memory-barriers.txt
> > > @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
> > >  	a sleep-unlock race, but the locking primitive needs to resolve
> > >  	such races properly in any case.
> > >  
> > > -If necessary, ordering can be enforced by use of an
> > > -smp_mb__release_acquire() barrier:
> > > +Where the RELEASE and ACQUIRE operations are performed by the same CPU,
> > > +ordering can be enforced by use of an smp_mb__release_acquire() barrier:
> > >  
> > >  	*A = a;
> > >  	RELEASE M
> > > @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are:
> > >  	STORE *A, RELEASE M, ACQUIRE N, STORE *B
> > >  	STORE *A, ACQUIRE N, RELEASE M, STORE *B
> > >  
> > > +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE
> > > +operations performed by other CPUs, even if they are to the same variable.
> > > +In cases where transitivity is required, smp_mb() should be used explicitly.
> > > +
> > 
> > Then, IIRC, the memory order effect of RELEASE+ACQUIRE should be:
> 
> [updated from your reply]
> 
> > If an ACQUIRE loads the value of stored by a RELEASE, then after the
> > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
> > all the memory operations that have been perceived by the CPU executing
> > the RELEASE operation before the RELEASE operation. 
> > 
> > Which means a release+acquire pair to the same variable guarantees
> > transitivity.
> 
> Almost, but on arm64 at least, "all the memory operations" above doesn't
> include reads by other CPUs. I'm struggling to figure out whether that's
> actually an issue.
> 

Ah.. that's indeed an issue! for example:

CPU 0			CPU 1				CPU 2
=====================	==========================	================
{a = 0, b = 0, c = 0}
r1 = READ_ONCE(a);	WRITE_ONCE(b, 1);		r3 = smp_load_acquire(&c);
smp_rmb();		smp_store_release(&c, 1);	WRITE_ONCE(a, 1);
r2 = READ_ONCE(b)

where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at
least on POWER.

However, I think that doens't mean a release+acquire pair to the same
variable doesn't guarantee transitivity, because the transitivity is
actually broken at the smp_rmb(). But yes, my document is incorrect.
How about:

If an ACQUIRE loads the value of stored by a RELEASE, then after the
ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
all the memory operations that have been perceived by the CPU executing
the RELEASE operation *transitively* before the RELEASE operation. 
("transitively before" means that a memory operation is either executed
on the same CPU before the other, or guaranteed executed before the
other by a transitive barrier).

Which means a release+acquire pair to the same variable guarantees
transitivity.


Maybe we can avoid to use term "transitively before" here, but it's not
bad to distinguish different kinds of "before"s.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-21 13:45               ` Boqun Feng
  2015-09-21 13:45                 ` Boqun Feng
@ 2015-09-21 14:10                 ` Boqun Feng
  2015-09-21 14:10                   ` Boqun Feng
  2015-09-21 22:23                   ` Will Deacon
  1 sibling, 2 replies; 28+ messages in thread
From: Boqun Feng @ 2015-09-21 14:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 6175 bytes --]

On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote:
> On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote:
> > On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote:
> > > On Wed, Sep 16, 2015 at 12:07:06PM +0100, Will Deacon wrote:
> > > > On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote:
> > > > > On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote:
> > > > > > > Indeed, that is a hole in the definition, that I think we should close.
> > > > > 
> > > > > > I'm struggling to understand the hole, but here's my intuition. If an
> > > > > > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
> > > > > > observe all memory accessed performed by CPUy prior to the RELEASE
> > > > > > before it observes the RELEASE itself, regardless of this new barrier.
> > > > > > I think this matches what we currently have in memory-barriers.txt (i.e.
> > > > > > acquire/release are neither transitive or multi-copy atomic).
> > > > > 
> > > > > Ah agreed. I seem to have gotten my brain in a tangle.
> > > > > 
> > > > > Basically where a program order release+acquire relies on an address
> > > > > dependency, a cross cpu release+acquire relies on causality. If we
> > > > > observe the release, we must also observe everything prior to it etc.
> > > > 
> > > > Yes, and crucially, the "everything prior to it" only encompasses accesses
> > > > made by the releasing CPU itself (in the absence of other barriers and
> > > > synchronisation).
> > > > 
> > > 
> > > Just want to make sure I understand you correctly, do you mean that in
> > > the following case:
> > > 
> > > CPU 1			CPU 2				CPU 3
> > > ==============		============================	===============
> > > { A = 0, B = 0 }
> > > WRITE_ONCE(A,1);	r1 = READ_ONCE(A);		r2 = smp_load_acquire(&B);
> > > 			smp_store_release(&B, 1);	r3 = READ_ONCE(A);
> > > 
> > > r1 == 1 && r2 == 1 && r3 == 0 is not prohibitted?
> > > 
> > > However, according to the discussion of Paul and Peter:
> > > 
> > > https://lkml.org/lkml/2015/9/15/707
> > > 
> > > I think that's prohibitted on architectures except s390 for sure. And
> > > for s390, we are waiting for the maintainers to verify this. If s390
> > > also prohibits this, then a release-acquire pair(on different CPUs) to
> > > the same variable does guarantee transitivity.
> > > 
> > > Did I misunderstand you or miss something here?
> > 
> > That certainly works on arm and arm64, so if it works everywhere else too,
> > then we can strengthen this (but see below).
> > 
> > > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > > > index 46a85abb77c6..794d102d06df 100644
> > > > --- a/Documentation/memory-barriers.txt
> > > > +++ b/Documentation/memory-barriers.txt
> > > > @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
> > > >  	a sleep-unlock race, but the locking primitive needs to resolve
> > > >  	such races properly in any case.
> > > >  
> > > > -If necessary, ordering can be enforced by use of an
> > > > -smp_mb__release_acquire() barrier:
> > > > +Where the RELEASE and ACQUIRE operations are performed by the same CPU,
> > > > +ordering can be enforced by use of an smp_mb__release_acquire() barrier:
> > > >  
> > > >  	*A = a;
> > > >  	RELEASE M
> > > > @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are:
> > > >  	STORE *A, RELEASE M, ACQUIRE N, STORE *B
> > > >  	STORE *A, ACQUIRE N, RELEASE M, STORE *B
> > > >  
> > > > +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE
> > > > +operations performed by other CPUs, even if they are to the same variable.
> > > > +In cases where transitivity is required, smp_mb() should be used explicitly.
> > > > +
> > > 
> > > Then, IIRC, the memory order effect of RELEASE+ACQUIRE should be:
> > 
> > [updated from your reply]
> > 
> > > If an ACQUIRE loads the value of stored by a RELEASE, then after the
> > > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
> > > all the memory operations that have been perceived by the CPU executing
> > > the RELEASE operation before the RELEASE operation. 
> > > 
> > > Which means a release+acquire pair to the same variable guarantees
> > > transitivity.
> > 
> > Almost, but on arm64 at least, "all the memory operations" above doesn't
> > include reads by other CPUs. I'm struggling to figure out whether that's
> > actually an issue.
> > 
> 
> Ah.. that's indeed an issue! for example:
> 
> CPU 0			CPU 1				CPU 2
> =====================	==========================	================
> {a = 0, b = 0, c = 0}
> r1 = READ_ONCE(a);	WRITE_ONCE(b, 1);		r3 = smp_load_acquire(&c);
> smp_rmb();		smp_store_release(&c, 1);	WRITE_ONCE(a, 1);
> r2 = READ_ONCE(b)
> 
> where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at
> least on POWER.
> 

Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry
for the misleading. How about the behavior of that on arm and arm64?

If prohibitted too, please ignore below.

Apologies again for that..

Regards,
Boqun

> However, I think that doens't mean a release+acquire pair to the same
> variable doesn't guarantee transitivity, because the transitivity is
> actually broken at the smp_rmb(). But yes, my document is incorrect.
> How about:
> 
> If an ACQUIRE loads the value of stored by a RELEASE, then after the
> ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
> all the memory operations that have been perceived by the CPU executing
> the RELEASE operation *transitively* before the RELEASE operation. 
> ("transitively before" means that a memory operation is either executed
> on the same CPU before the other, or guaranteed executed before the
> other by a transitive barrier).
> 
> Which means a release+acquire pair to the same variable guarantees
> transitivity.
> 
> 
> Maybe we can avoid to use term "transitively before" here, but it's not
> bad to distinguish different kinds of "before"s.
> 
> Regards,
> Boqun



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-21 14:10                 ` Boqun Feng
@ 2015-09-21 14:10                   ` Boqun Feng
  2015-09-21 22:23                   ` Will Deacon
  1 sibling, 0 replies; 28+ messages in thread
From: Boqun Feng @ 2015-09-21 14:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 6175 bytes --]

On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote:
> On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote:
> > On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote:
> > > On Wed, Sep 16, 2015 at 12:07:06PM +0100, Will Deacon wrote:
> > > > On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote:
> > > > > On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote:
> > > > > > > Indeed, that is a hole in the definition, that I think we should close.
> > > > > 
> > > > > > I'm struggling to understand the hole, but here's my intuition. If an
> > > > > > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to
> > > > > > observe all memory accessed performed by CPUy prior to the RELEASE
> > > > > > before it observes the RELEASE itself, regardless of this new barrier.
> > > > > > I think this matches what we currently have in memory-barriers.txt (i.e.
> > > > > > acquire/release are neither transitive or multi-copy atomic).
> > > > > 
> > > > > Ah agreed. I seem to have gotten my brain in a tangle.
> > > > > 
> > > > > Basically where a program order release+acquire relies on an address
> > > > > dependency, a cross cpu release+acquire relies on causality. If we
> > > > > observe the release, we must also observe everything prior to it etc.
> > > > 
> > > > Yes, and crucially, the "everything prior to it" only encompasses accesses
> > > > made by the releasing CPU itself (in the absence of other barriers and
> > > > synchronisation).
> > > > 
> > > 
> > > Just want to make sure I understand you correctly, do you mean that in
> > > the following case:
> > > 
> > > CPU 1			CPU 2				CPU 3
> > > ==============		============================	===============
> > > { A = 0, B = 0 }
> > > WRITE_ONCE(A,1);	r1 = READ_ONCE(A);		r2 = smp_load_acquire(&B);
> > > 			smp_store_release(&B, 1);	r3 = READ_ONCE(A);
> > > 
> > > r1 == 1 && r2 == 1 && r3 == 0 is not prohibitted?
> > > 
> > > However, according to the discussion of Paul and Peter:
> > > 
> > > https://lkml.org/lkml/2015/9/15/707
> > > 
> > > I think that's prohibitted on architectures except s390 for sure. And
> > > for s390, we are waiting for the maintainers to verify this. If s390
> > > also prohibits this, then a release-acquire pair(on different CPUs) to
> > > the same variable does guarantee transitivity.
> > > 
> > > Did I misunderstand you or miss something here?
> > 
> > That certainly works on arm and arm64, so if it works everywhere else too,
> > then we can strengthen this (but see below).
> > 
> > > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > > > index 46a85abb77c6..794d102d06df 100644
> > > > --- a/Documentation/memory-barriers.txt
> > > > +++ b/Documentation/memory-barriers.txt
> > > > @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
> > > >  	a sleep-unlock race, but the locking primitive needs to resolve
> > > >  	such races properly in any case.
> > > >  
> > > > -If necessary, ordering can be enforced by use of an
> > > > -smp_mb__release_acquire() barrier:
> > > > +Where the RELEASE and ACQUIRE operations are performed by the same CPU,
> > > > +ordering can be enforced by use of an smp_mb__release_acquire() barrier:
> > > >  
> > > >  	*A = a;
> > > >  	RELEASE M
> > > > @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are:
> > > >  	STORE *A, RELEASE M, ACQUIRE N, STORE *B
> > > >  	STORE *A, ACQUIRE N, RELEASE M, STORE *B
> > > >  
> > > > +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE
> > > > +operations performed by other CPUs, even if they are to the same variable.
> > > > +In cases where transitivity is required, smp_mb() should be used explicitly.
> > > > +
> > > 
> > > Then, IIRC, the memory order effect of RELEASE+ACQUIRE should be:
> > 
> > [updated from your reply]
> > 
> > > If an ACQUIRE loads the value of stored by a RELEASE, then after the
> > > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
> > > all the memory operations that have been perceived by the CPU executing
> > > the RELEASE operation before the RELEASE operation. 
> > > 
> > > Which means a release+acquire pair to the same variable guarantees
> > > transitivity.
> > 
> > Almost, but on arm64 at least, "all the memory operations" above doesn't
> > include reads by other CPUs. I'm struggling to figure out whether that's
> > actually an issue.
> > 
> 
> Ah.. that's indeed an issue! for example:
> 
> CPU 0			CPU 1				CPU 2
> =====================	==========================	================
> {a = 0, b = 0, c = 0}
> r1 = READ_ONCE(a);	WRITE_ONCE(b, 1);		r3 = smp_load_acquire(&c);
> smp_rmb();		smp_store_release(&c, 1);	WRITE_ONCE(a, 1);
> r2 = READ_ONCE(b)
> 
> where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at
> least on POWER.
> 

Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry
for the misleading. How about the behavior of that on arm and arm64?

If prohibitted too, please ignore below.

Apologies again for that..

Regards,
Boqun

> However, I think that doens't mean a release+acquire pair to the same
> variable doesn't guarantee transitivity, because the transitivity is
> actually broken at the smp_rmb(). But yes, my document is incorrect.
> How about:
> 
> If an ACQUIRE loads the value of stored by a RELEASE, then after the
> ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
> all the memory operations that have been perceived by the CPU executing
> the RELEASE operation *transitively* before the RELEASE operation. 
> ("transitively before" means that a memory operation is either executed
> on the same CPU before the other, or guaranteed executed before the
> other by a transitive barrier).
> 
> Which means a release+acquire pair to the same variable guarantees
> transitivity.
> 
> 
> Maybe we can avoid to use term "transitively before" here, but it's not
> bad to distinguish different kinds of "before"s.
> 
> Regards,
> Boqun



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-21 14:10                 ` Boqun Feng
  2015-09-21 14:10                   ` Boqun Feng
@ 2015-09-21 22:23                   ` Will Deacon
  2015-09-21 22:23                     ` Will Deacon
                                       ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Will Deacon @ 2015-09-21 22:23 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Mon, Sep 21, 2015 at 03:10:38PM +0100, Boqun Feng wrote:
> On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote:
> > On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote:
> > > On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote:
> > > > If an ACQUIRE loads the value of stored by a RELEASE, then after the
> > > > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
> > > > all the memory operations that have been perceived by the CPU executing
> > > > the RELEASE operation before the RELEASE operation. 
> > > > 
> > > > Which means a release+acquire pair to the same variable guarantees
> > > > transitivity.
> > > 
> > > Almost, but on arm64 at least, "all the memory operations" above doesn't
> > > include reads by other CPUs. I'm struggling to figure out whether that's
> > > actually an issue.
> > > 
> > 
> > Ah.. that's indeed an issue! for example:
> > 
> > CPU 0			CPU 1				CPU 2
> > =====================	==========================	================
> > {a = 0, b = 0, c = 0}
> > r1 = READ_ONCE(a);	WRITE_ONCE(b, 1);		r3 = smp_load_acquire(&c);
> > smp_rmb();		smp_store_release(&c, 1);	WRITE_ONCE(a, 1);
> > r2 = READ_ONCE(b)
> > 
> > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at
> > least on POWER.
> > 
> 
> Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry
> for the misleading. How about the behavior of that on arm and arm64?

That explicit test is forbidden on arm/arm64 because of the smp_rmb(),
but if you rewrite it as (LDAR is acquire, STLR is release):


  {
  0:X1=x; 0:X3=y;
  1:X1=y; 1:X2=z;
  2:X1=z; 2:X3=x;
  }
   P0           | P1           | P2                ;
   LDAR W0,[X1] | MOV W0,#1    | LDAR W0,[X1]      ;
   LDR W2,[X3]  | STR W0,[X1]  | MOV W2,#1         ;
                | STLR W0,[X2] | STR W2,[X3]       ;

  Observed
      0:X0=1; 0:X2=0; 2:X0=1;


then it is permitted on arm64. Note that herd currently claims that this
is forbidden, but I'm talking to the authors about getting that fixed :)

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-21 22:23                   ` Will Deacon
@ 2015-09-21 22:23                     ` Will Deacon
  2015-09-21 23:42                     ` Boqun Feng
  2015-09-22 15:22                     ` Paul E. McKenney
  2 siblings, 0 replies; 28+ messages in thread
From: Will Deacon @ 2015-09-21 22:23 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Mon, Sep 21, 2015 at 03:10:38PM +0100, Boqun Feng wrote:
> On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote:
> > On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote:
> > > On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote:
> > > > If an ACQUIRE loads the value of stored by a RELEASE, then after the
> > > > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
> > > > all the memory operations that have been perceived by the CPU executing
> > > > the RELEASE operation before the RELEASE operation. 
> > > > 
> > > > Which means a release+acquire pair to the same variable guarantees
> > > > transitivity.
> > > 
> > > Almost, but on arm64 at least, "all the memory operations" above doesn't
> > > include reads by other CPUs. I'm struggling to figure out whether that's
> > > actually an issue.
> > > 
> > 
> > Ah.. that's indeed an issue! for example:
> > 
> > CPU 0			CPU 1				CPU 2
> > =====================	==========================	================
> > {a = 0, b = 0, c = 0}
> > r1 = READ_ONCE(a);	WRITE_ONCE(b, 1);		r3 = smp_load_acquire(&c);
> > smp_rmb();		smp_store_release(&c, 1);	WRITE_ONCE(a, 1);
> > r2 = READ_ONCE(b)
> > 
> > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at
> > least on POWER.
> > 
> 
> Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry
> for the misleading. How about the behavior of that on arm and arm64?

That explicit test is forbidden on arm/arm64 because of the smp_rmb(),
but if you rewrite it as (LDAR is acquire, STLR is release):


  {
  0:X1=x; 0:X3=y;
  1:X1=y; 1:X2=z;
  2:X1=z; 2:X3=x;
  }
   P0           | P1           | P2                ;
   LDAR W0,[X1] | MOV W0,#1    | LDAR W0,[X1]      ;
   LDR W2,[X3]  | STR W0,[X1]  | MOV W2,#1         ;
                | STLR W0,[X2] | STR W2,[X3]       ;

  Observed
      0:X0=1; 0:X2=0; 2:X0=1;


then it is permitted on arm64. Note that herd currently claims that this
is forbidden, but I'm talking to the authors about getting that fixed :)

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-21 22:23                   ` Will Deacon
  2015-09-21 22:23                     ` Will Deacon
@ 2015-09-21 23:42                     ` Boqun Feng
  2015-09-22 15:22                     ` Paul E. McKenney
  2 siblings, 0 replies; 28+ messages in thread
From: Boqun Feng @ 2015-09-21 23:42 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Paul E. McKenney, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2480 bytes --]

On Mon, Sep 21, 2015 at 11:23:01PM +0100, Will Deacon wrote:
> On Mon, Sep 21, 2015 at 03:10:38PM +0100, Boqun Feng wrote:
> > On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote:
> > > On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote:
> > > > On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote:
> > > > > If an ACQUIRE loads the value of stored by a RELEASE, then after the
> > > > > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
> > > > > all the memory operations that have been perceived by the CPU executing
> > > > > the RELEASE operation before the RELEASE operation. 
> > > > > 
> > > > > Which means a release+acquire pair to the same variable guarantees
> > > > > transitivity.
> > > > 
> > > > Almost, but on arm64 at least, "all the memory operations" above doesn't
> > > > include reads by other CPUs. I'm struggling to figure out whether that's
> > > > actually an issue.
> > > > 
> > > 
> > > Ah.. that's indeed an issue! for example:
> > > 
> > > CPU 0			CPU 1				CPU 2
> > > =====================	==========================	================
> > > {a = 0, b = 0, c = 0}
> > > r1 = READ_ONCE(a);	WRITE_ONCE(b, 1);		r3 = smp_load_acquire(&c);
> > > smp_rmb();		smp_store_release(&c, 1);	WRITE_ONCE(a, 1);
> > > r2 = READ_ONCE(b)
> > > 
> > > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at
> > > least on POWER.
> > > 
> > 
> > Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry
> > for the misleading. How about the behavior of that on arm and arm64?
> 
> That explicit test is forbidden on arm/arm64 because of the smp_rmb(),
> but if you rewrite it as (LDAR is acquire, STLR is release):
> 
> 
>   {
>   0:X1=x; 0:X3=y;
>   1:X1=y; 1:X2=z;
>   2:X1=z; 2:X3=x;
>   }
>    P0           | P1           | P2                ;
>    LDAR W0,[X1] | MOV W0,#1    | LDAR W0,[X1]      ;
>    LDR W2,[X3]  | STR W0,[X1]  | MOV W2,#1         ;
>                 | STLR W0,[X2] | STR W2,[X3]       ;
> 
>   Observed
>       0:X0=1; 0:X2=0; 2:X0=1;
> 

X0 is W0, etc. Right?

> 
> then it is permitted on arm64. Note that herd currently claims that this
> is forbidden, but I'm talking to the authors about getting that fixed :)
> 

Good to know ;-) I think this actually means two things:

1.	ACQUIRE doesn't provide transitivity itself

and 

2.	We still need the term like "transitively before".

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-21 22:23                   ` Will Deacon
  2015-09-21 22:23                     ` Will Deacon
  2015-09-21 23:42                     ` Boqun Feng
@ 2015-09-22 15:22                     ` Paul E. McKenney
  2015-09-22 15:22                       ` Paul E. McKenney
  2015-09-22 15:58                       ` Will Deacon
  2 siblings, 2 replies; 28+ messages in thread
From: Paul E. McKenney @ 2015-09-22 15:22 UTC (permalink / raw)
  To: Will Deacon
  Cc: Boqun Feng, Peter Zijlstra, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Mon, Sep 21, 2015 at 11:23:01PM +0100, Will Deacon wrote:
> On Mon, Sep 21, 2015 at 03:10:38PM +0100, Boqun Feng wrote:
> > On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote:
> > > On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote:
> > > > On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote:
> > > > > If an ACQUIRE loads the value of stored by a RELEASE, then after the
> > > > > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
> > > > > all the memory operations that have been perceived by the CPU executing
> > > > > the RELEASE operation before the RELEASE operation. 
> > > > > 
> > > > > Which means a release+acquire pair to the same variable guarantees
> > > > > transitivity.
> > > > 
> > > > Almost, but on arm64 at least, "all the memory operations" above doesn't
> > > > include reads by other CPUs. I'm struggling to figure out whether that's
> > > > actually an issue.
> > > > 
> > > 
> > > Ah.. that's indeed an issue! for example:
> > > 
> > > CPU 0			CPU 1				CPU 2
> > > =====================	==========================	================
> > > {a = 0, b = 0, c = 0}
> > > r1 = READ_ONCE(a);	WRITE_ONCE(b, 1);		r3 = smp_load_acquire(&c);
> > > smp_rmb();		smp_store_release(&c, 1);	WRITE_ONCE(a, 1);
> > > r2 = READ_ONCE(b)
> > > 
> > > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at
> > > least on POWER.
> > > 
> > 
> > Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry
> > for the misleading. How about the behavior of that on arm and arm64?
> 
> That explicit test is forbidden on arm/arm64 because of the smp_rmb(),
> but if you rewrite it as (LDAR is acquire, STLR is release):
> 
> 
>   {
>   0:X1=x; 0:X3=y;
>   1:X1=y; 1:X2=z;
>   2:X1=z; 2:X3=x;
>   }
>    P0           | P1           | P2                ;
>    LDAR W0,[X1] | MOV W0,#1    | LDAR W0,[X1]      ;
>    LDR W2,[X3]  | STR W0,[X1]  | MOV W2,#1         ;
>                 | STLR W0,[X2] | STR W2,[X3]       ;
> 
>   Observed
>       0:X0=1; 0:X2=0; 2:X0=1;
> 
> 
> then it is permitted on arm64. Note that herd currently claims that this
> is forbidden, but I'm talking to the authors about getting that fixed :)

But a pure store-release/load-acquire chain would be forbidden in
hardware as well as by herd, correct?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-22 15:22                     ` Paul E. McKenney
@ 2015-09-22 15:22                       ` Paul E. McKenney
  2015-09-22 15:58                       ` Will Deacon
  1 sibling, 0 replies; 28+ messages in thread
From: Paul E. McKenney @ 2015-09-22 15:22 UTC (permalink / raw)
  To: Will Deacon
  Cc: Boqun Feng, Peter Zijlstra, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Mon, Sep 21, 2015 at 11:23:01PM +0100, Will Deacon wrote:
> On Mon, Sep 21, 2015 at 03:10:38PM +0100, Boqun Feng wrote:
> > On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote:
> > > On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote:
> > > > On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote:
> > > > > If an ACQUIRE loads the value of stored by a RELEASE, then after the
> > > > > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive
> > > > > all the memory operations that have been perceived by the CPU executing
> > > > > the RELEASE operation before the RELEASE operation. 
> > > > > 
> > > > > Which means a release+acquire pair to the same variable guarantees
> > > > > transitivity.
> > > > 
> > > > Almost, but on arm64 at least, "all the memory operations" above doesn't
> > > > include reads by other CPUs. I'm struggling to figure out whether that's
> > > > actually an issue.
> > > > 
> > > 
> > > Ah.. that's indeed an issue! for example:
> > > 
> > > CPU 0			CPU 1				CPU 2
> > > =====================	==========================	================
> > > {a = 0, b = 0, c = 0}
> > > r1 = READ_ONCE(a);	WRITE_ONCE(b, 1);		r3 = smp_load_acquire(&c);
> > > smp_rmb();		smp_store_release(&c, 1);	WRITE_ONCE(a, 1);
> > > r2 = READ_ONCE(b)
> > > 
> > > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at
> > > least on POWER.
> > > 
> > 
> > Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry
> > for the misleading. How about the behavior of that on arm and arm64?
> 
> That explicit test is forbidden on arm/arm64 because of the smp_rmb(),
> but if you rewrite it as (LDAR is acquire, STLR is release):
> 
> 
>   {
>   0:X1=x; 0:X3=y;
>   1:X1=y; 1:X2=z;
>   2:X1=z; 2:X3=x;
>   }
>    P0           | P1           | P2                ;
>    LDAR W0,[X1] | MOV W0,#1    | LDAR W0,[X1]      ;
>    LDR W2,[X3]  | STR W0,[X1]  | MOV W2,#1         ;
>                 | STLR W0,[X2] | STR W2,[X3]       ;
> 
>   Observed
>       0:X0=1; 0:X2=0; 2:X0=1;
> 
> 
> then it is permitted on arm64. Note that herd currently claims that this
> is forbidden, but I'm talking to the authors about getting that fixed :)

But a pure store-release/load-acquire chain would be forbidden in
hardware as well as by herd, correct?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-22 15:22                     ` Paul E. McKenney
  2015-09-22 15:22                       ` Paul E. McKenney
@ 2015-09-22 15:58                       ` Will Deacon
  2015-09-22 15:58                         ` Will Deacon
  2015-09-22 16:38                         ` Paul E. McKenney
  1 sibling, 2 replies; 28+ messages in thread
From: Will Deacon @ 2015-09-22 15:58 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, Peter Zijlstra, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

Hi Paul,

On Tue, Sep 22, 2015 at 04:22:41PM +0100, Paul E. McKenney wrote:
> On Mon, Sep 21, 2015 at 11:23:01PM +0100, Will Deacon wrote:
> > On Mon, Sep 21, 2015 at 03:10:38PM +0100, Boqun Feng wrote:
> > > On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote:
> > > > 
> > > > Ah.. that's indeed an issue! for example:
> > > > 
> > > > CPU 0			CPU 1				CPU 2
> > > > =====================	==========================	================
> > > > {a = 0, b = 0, c = 0}
> > > > r1 = READ_ONCE(a);	WRITE_ONCE(b, 1);		r3 = smp_load_acquire(&c);
> > > > smp_rmb();		smp_store_release(&c, 1);	WRITE_ONCE(a, 1);
> > > > r2 = READ_ONCE(b)
> > > > 
> > > > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at
> > > > least on POWER.
> > > > 
> > > 
> > > Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry
> > > for the misleading. How about the behavior of that on arm and arm64?
> > 
> > That explicit test is forbidden on arm/arm64 because of the smp_rmb(),
> > but if you rewrite it as (LDAR is acquire, STLR is release):
> > 
> > 
> >   {
> >   0:X1=x; 0:X3=y;
> >   1:X1=y; 1:X2=z;
> >   2:X1=z; 2:X3=x;
> >   }
> >    P0           | P1           | P2                ;
> >    LDAR W0,[X1] | MOV W0,#1    | LDAR W0,[X1]      ;
> >    LDR W2,[X3]  | STR W0,[X1]  | MOV W2,#1         ;
> >                 | STLR W0,[X2] | STR W2,[X3]       ;
> > 
> >   Observed
> >       0:X0=1; 0:X2=0; 2:X0=1;
> > 
> > 
> > then it is permitted on arm64. Note that herd currently claims that this
> > is forbidden, but I'm talking to the authors about getting that fixed :)
> 
> But a pure store-release/load-acquire chain would be forbidden in
> hardware as well as by herd, correct?

Yup, and since that's likely the common use-case, I think that's precisely
the scenario where it makes sense for us to require transitivity in the
kernel.

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-22 15:58                       ` Will Deacon
@ 2015-09-22 15:58                         ` Will Deacon
  2015-09-22 16:38                         ` Paul E. McKenney
  1 sibling, 0 replies; 28+ messages in thread
From: Will Deacon @ 2015-09-22 15:58 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, Peter Zijlstra, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

Hi Paul,

On Tue, Sep 22, 2015 at 04:22:41PM +0100, Paul E. McKenney wrote:
> On Mon, Sep 21, 2015 at 11:23:01PM +0100, Will Deacon wrote:
> > On Mon, Sep 21, 2015 at 03:10:38PM +0100, Boqun Feng wrote:
> > > On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote:
> > > > 
> > > > Ah.. that's indeed an issue! for example:
> > > > 
> > > > CPU 0			CPU 1				CPU 2
> > > > =====================	==========================	================
> > > > {a = 0, b = 0, c = 0}
> > > > r1 = READ_ONCE(a);	WRITE_ONCE(b, 1);		r3 = smp_load_acquire(&c);
> > > > smp_rmb();		smp_store_release(&c, 1);	WRITE_ONCE(a, 1);
> > > > r2 = READ_ONCE(b)
> > > > 
> > > > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at
> > > > least on POWER.
> > > > 
> > > 
> > > Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry
> > > for the misleading. How about the behavior of that on arm and arm64?
> > 
> > That explicit test is forbidden on arm/arm64 because of the smp_rmb(),
> > but if you rewrite it as (LDAR is acquire, STLR is release):
> > 
> > 
> >   {
> >   0:X1=x; 0:X3=y;
> >   1:X1=y; 1:X2=z;
> >   2:X1=z; 2:X3=x;
> >   }
> >    P0           | P1           | P2                ;
> >    LDAR W0,[X1] | MOV W0,#1    | LDAR W0,[X1]      ;
> >    LDR W2,[X3]  | STR W0,[X1]  | MOV W2,#1         ;
> >                 | STLR W0,[X2] | STR W2,[X3]       ;
> > 
> >   Observed
> >       0:X0=1; 0:X2=0; 2:X0=1;
> > 
> > 
> > then it is permitted on arm64. Note that herd currently claims that this
> > is forbidden, but I'm talking to the authors about getting that fixed :)
> 
> But a pure store-release/load-acquire chain would be forbidden in
> hardware as well as by herd, correct?

Yup, and since that's likely the common use-case, I think that's precisely
the scenario where it makes sense for us to require transitivity in the
kernel.

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-22 15:58                       ` Will Deacon
  2015-09-22 15:58                         ` Will Deacon
@ 2015-09-22 16:38                         ` Paul E. McKenney
  1 sibling, 0 replies; 28+ messages in thread
From: Paul E. McKenney @ 2015-09-22 16:38 UTC (permalink / raw)
  To: Will Deacon
  Cc: Boqun Feng, Peter Zijlstra, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Tue, Sep 22, 2015 at 04:58:28PM +0100, Will Deacon wrote:
> Hi Paul,
> 
> On Tue, Sep 22, 2015 at 04:22:41PM +0100, Paul E. McKenney wrote:
> > On Mon, Sep 21, 2015 at 11:23:01PM +0100, Will Deacon wrote:
> > > On Mon, Sep 21, 2015 at 03:10:38PM +0100, Boqun Feng wrote:
> > > > On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote:
> > > > > 
> > > > > Ah.. that's indeed an issue! for example:
> > > > > 
> > > > > CPU 0			CPU 1				CPU 2
> > > > > =====================	==========================	================
> > > > > {a = 0, b = 0, c = 0}
> > > > > r1 = READ_ONCE(a);	WRITE_ONCE(b, 1);		r3 = smp_load_acquire(&c);
> > > > > smp_rmb();		smp_store_release(&c, 1);	WRITE_ONCE(a, 1);
> > > > > r2 = READ_ONCE(b)
> > > > > 
> > > > > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at
> > > > > least on POWER.
> > > > > 
> > > > 
> > > > Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry
> > > > for the misleading. How about the behavior of that on arm and arm64?
> > > 
> > > That explicit test is forbidden on arm/arm64 because of the smp_rmb(),
> > > but if you rewrite it as (LDAR is acquire, STLR is release):
> > > 
> > > 
> > >   {
> > >   0:X1=x; 0:X3=y;
> > >   1:X1=y; 1:X2=z;
> > >   2:X1=z; 2:X3=x;
> > >   }
> > >    P0           | P1           | P2                ;
> > >    LDAR W0,[X1] | MOV W0,#1    | LDAR W0,[X1]      ;
> > >    LDR W2,[X3]  | STR W0,[X1]  | MOV W2,#1         ;
> > >                 | STLR W0,[X2] | STR W2,[X3]       ;
> > > 
> > >   Observed
> > >       0:X0=1; 0:X2=0; 2:X0=1;
> > > 
> > > 
> > > then it is permitted on arm64. Note that herd currently claims that this
> > > is forbidden, but I'm talking to the authors about getting that fixed :)
> > 
> > But a pure store-release/load-acquire chain would be forbidden in
> > hardware as well as by herd, correct?
> 
> Yup, and since that's likely the common use-case, I think that's precisely
> the scenario where it makes sense for us to require transitivity in the
> kernel.

Agreed.  And again I believe that we need to err on the side of restricting
what the developer can expect.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-15 16:13 [PATCH] barriers: introduce smp_mb__release_acquire and update documentation Will Deacon
  2015-09-15 17:47 ` Paul E. McKenney
@ 2015-09-16 11:49 ` Boqun Feng
  2015-09-16 16:38   ` Will Deacon
  1 sibling, 1 reply; 28+ messages in thread
From: Boqun Feng @ 2015-09-16 11:49 UTC (permalink / raw)
  To: Will Deacon; +Cc: linux-arch, linux-kernel, Paul E. McKenney, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 3646 bytes --]

Hi Will,

On Tue, Sep 15, 2015 at 05:13:30PM +0100, Will Deacon wrote:
> As much as we'd like to live in a world where RELEASE -> ACQUIRE is
> always cheaply ordered and can be used to construct UNLOCK -> LOCK
> definitions with similar guarantees, the grim reality is that this isn't
> even possible on x86 (thanks to Paul for bringing us crashing down to
> Earth).
> 
> This patch handles the issue by introducing a new barrier macro,
> smp_mb__release_acquire, that can be placed between a RELEASE and a
> subsequent ACQUIRE operation in order to upgrade them to a full memory
> barrier. At the moment, it doesn't have any users, so its existence
> serves mainly as a documentation aid.
> 
> Documentation/memory-barriers.txt is updated to describe more clearly
> the ACQUIRE and RELEASE ordering in this area and to show an example of
> the new barrier in action.
> 
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
> 
> Following our discussion at [1], I thought I'd try to write something
> down...
> 
> [1] http://lkml.kernel.org/r/20150828104854.GB16853@twins.programming.kicks-ass.net
> 
>  Documentation/memory-barriers.txt  | 23 ++++++++++++++++++++++-
>  arch/powerpc/include/asm/barrier.h |  1 +
>  arch/x86/include/asm/barrier.h     |  2 ++
>  include/asm-generic/barrier.h      |  4 ++++
>  4 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 2ba8461b0631..46a85abb77c6 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -459,11 +459,18 @@ And a couple of implicit varieties:
>       RELEASE on that same variable are guaranteed to be visible.  In other
>       words, within a given variable's critical section, all accesses of all
>       previous critical sections for that variable are guaranteed to have
> -     completed.
> +     completed.  If the RELEASE and ACQUIRE operations act on independent
> +     variables, an smp_mb__release_acquire() barrier can be placed between
> +     them to upgrade the sequence to a full barrier.
>  
>       This means that ACQUIRE acts as a minimal "acquire" operation and
>       RELEASE acts as a minimal "release" operation.
>  
> +A subset of the atomic operations described in atomic_ops.txt have ACQUIRE
> +and RELEASE variants in addition to fully-ordered and relaxed definitions.
> +For compound atomics performing both a load and a store, ACQUIRE semantics
> +apply only to the load and RELEASE semantics only to the store portion of
> +the operation.
>  
>  Memory barriers are only required where there's a possibility of interaction
>  between two CPUs or between a CPU and a device.  If it can be guaranteed that
> @@ -1895,6 +1902,20 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
>  	a sleep-unlock race, but the locking primitive needs to resolve
>  	such races properly in any case.
>  
> +If necessary, ordering can be enforced by use of an
> +smp_mb__release_acquire() barrier:
> +
> +	*A = a;
> +	RELEASE M
> +	smp_mb__release_acquire();

Should this barrier be placed after the ACQUIRE? Because we do actually
want(?) and allow RELEASE and ACQUIRE operations to reorder in this
case, like your following example, right?

Regards,
Boqun

> +	ACQUIRE N
> +	*B = b;
> +
> +in which case, the only permitted sequences are:
> +
> +	STORE *A, RELEASE M, ACQUIRE N, STORE *B
> +	STORE *A, ACQUIRE N, RELEASE M, STORE *B
> +

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-16 11:49 ` Boqun Feng
@ 2015-09-16 16:38   ` Will Deacon
  2015-09-17  1:56     ` Boqun Feng
  0 siblings, 1 reply; 28+ messages in thread
From: Will Deacon @ 2015-09-16 16:38 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org,
	Paul E. McKenney, Peter Zijlstra

On Wed, Sep 16, 2015 at 12:49:18PM +0100, Boqun Feng wrote:
> Hi Will,

Hello,

> On Tue, Sep 15, 2015 at 05:13:30PM +0100, Will Deacon wrote:
> > +If necessary, ordering can be enforced by use of an
> > +smp_mb__release_acquire() barrier:
> > +
> > +	*A = a;
> > +	RELEASE M
> > +	smp_mb__release_acquire();
> 
> Should this barrier be placed after the ACQUIRE? Because we do actually
> want(?) and allow RELEASE and ACQUIRE operations to reorder in this
> case, like your following example, right?

I think it's a lot simpler to keep it where it is, in all honesty. The
relaxation for the RELEASE/ACQUIRE access ordering is mainly there to
allow architectures building those operations out of explicit barriers
to get away without a definition of smp_mb__release_acquire.

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] barriers: introduce smp_mb__release_acquire and update documentation
  2015-09-16 16:38   ` Will Deacon
@ 2015-09-17  1:56     ` Boqun Feng
  0 siblings, 0 replies; 28+ messages in thread
From: Boqun Feng @ 2015-09-17  1:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org,
	Paul E. McKenney, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 1077 bytes --]

On Wed, Sep 16, 2015 at 05:38:14PM +0100, Will Deacon wrote:
> On Wed, Sep 16, 2015 at 12:49:18PM +0100, Boqun Feng wrote:
> > Hi Will,
> 
> Hello,
> 
> > On Tue, Sep 15, 2015 at 05:13:30PM +0100, Will Deacon wrote:
> > > +If necessary, ordering can be enforced by use of an
> > > +smp_mb__release_acquire() barrier:
> > > +
> > > +	*A = a;
> > > +	RELEASE M
> > > +	smp_mb__release_acquire();
> > 
> > Should this barrier be placed after the ACQUIRE? Because we do actually
> > want(?) and allow RELEASE and ACQUIRE operations to reorder in this
> > case, like your following example, right?
> 
> I think it's a lot simpler to keep it where it is, in all honesty. The
> relaxation for the RELEASE/ACQUIRE access ordering is mainly there to
> allow architectures building those operations out of explicit barriers
> to get away without a definition of smp_mb__release_acquire.
> 

Fair enough, and plus there is actually no user(even potential user) of
this for now, it may be too early to argue where the barrier should be
put.

Regards,
Boqun


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2015-09-22 16:38 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-15 16:13 [PATCH] barriers: introduce smp_mb__release_acquire and update documentation Will Deacon
2015-09-15 17:47 ` Paul E. McKenney
2015-09-16  9:14   ` Peter Zijlstra
2015-09-16 10:29     ` Will Deacon
2015-09-16 10:29       ` Will Deacon
2015-09-16 10:43       ` Peter Zijlstra
2015-09-16 10:43         ` Peter Zijlstra
2015-09-16 11:07         ` Will Deacon
2015-09-16 11:07           ` Will Deacon
2015-09-17  2:50           ` Boqun Feng
2015-09-17  7:57             ` Boqun Feng
2015-09-17  7:57               ` Boqun Feng
2015-09-17 18:00             ` Will Deacon
2015-09-21 13:45               ` Boqun Feng
2015-09-21 13:45                 ` Boqun Feng
2015-09-21 14:10                 ` Boqun Feng
2015-09-21 14:10                   ` Boqun Feng
2015-09-21 22:23                   ` Will Deacon
2015-09-21 22:23                     ` Will Deacon
2015-09-21 23:42                     ` Boqun Feng
2015-09-22 15:22                     ` Paul E. McKenney
2015-09-22 15:22                       ` Paul E. McKenney
2015-09-22 15:58                       ` Will Deacon
2015-09-22 15:58                         ` Will Deacon
2015-09-22 16:38                         ` Paul E. McKenney
2015-09-16 11:49 ` Boqun Feng
2015-09-16 16:38   ` Will Deacon
2015-09-17  1:56     ` Boqun Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).