[PATCH v6 0/5] MCS Lock: MCS lock code cleanup and optimizations

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/5] MCS Lock: MCS lock code cleanup and optimizations
       [not found] <cover.1384885312.git.tim.c.chen@linux.intel.com>
@ 2013-11-20  1:37 ` Tim Chen
  2013-11-20 10:19   ` Will Deacon
  2013-11-20  1:37 ` [PATCH v6 1/5] MCS Lock: Restructure the MCS lock defines and locking code into its own file Tim Chen
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2013-11-20  1:37 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton, Thomas Gleixner
  Cc: linux-kernel, linux-mm, linux-arch, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Paul E.McKenney, Tim Chen,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Will Deacon,
	Figo.zhang

In this patch series, we separated out the MCS lock code which was
previously embedded in the mutex.c.  This allows for easier reuse of
MCS lock in other places like rwsem and qrwlock.  We also did some micro
optimizations and barrier cleanup.  

The original code has potential leaks between critical sections, which
was not a problem when MCS was embedded within the mutex but needs
to be corrected when allowing the MCS lock to be used by itself for
other locking purposes. 

Proper barriers are now embedded with the usage of smp_load_acquire() in
mcs_spin_lock() and smp_store_release() in mcs_spin_unlock.  See
http://marc.info/?l=linux-arch&m=138386254111507 for info on the
new smp_load_acquire() and smp_store_release() functions. 

This patches were previously part of the rwsem optimization patch series
but now we spearate them out.

We have also added hooks to allow for architecture specific 
implementation of the mcs_spin_lock and mcs_spin_unlock functions.

Will, do you want to take a crack at adding implementation for ARM
with wfe instruction?

Tim

v6:
1. Fix a bug of improper xchg_acquire and extra space in barrier
fixing patch.
2. Added extra hooks to allow for architecture specific version
of mcs_spin_lock and mcs_spin_unlock to be used.

v5:
1. Rework barrier correction patch.  We now use smp_load_acquire()
in mcs_spin_lock() and smp_store_release() in
mcs_spin_unlock() to allow for architecture dependent barriers to be
automatically used.  This is clean and will provide the right
barriers for all architecture.

v4:
1. Move patch series to the latest tip after v3.12

v3:
1. modified memory barriers to support non x86 architectures that have
weak memory ordering.

v2:
1. change export mcs_spin_lock as a GPL export symbol
2. corrected mcs_spin_lock to references

Jason Low (1):
  MCS Lock: optimizations and extra comments

Tim Chen (2):
  MCS Lock: Restructure the MCS lock defines and locking code into its
    own file
  MCS Lock: Allows for architecture specific mcs lock and unlock

Waiman Long (2):
  MCS Lock: Move mcs_lock/unlock function into its own file
  MCS Lock: Barrier corrections

 arch/Kconfig                  |  3 ++
 include/linux/mcs_spinlock.h  | 30 +++++++++++++
 include/linux/mutex.h         |  5 ++-
 kernel/locking/Makefile       |  6 +--
 kernel/locking/mcs_spinlock.c | 98 +++++++++++++++++++++++++++++++++++++++++++
 kernel/locking/mutex.c        | 60 ++++----------------------
 6 files changed, 144 insertions(+), 58 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h
 create mode 100644 kernel/locking/mcs_spinlock.c

-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/5] MCS Lock: MCS lock code cleanup and optimizations
  2013-11-20  1:37 ` [PATCH v6 0/5] MCS Lock: MCS lock code cleanup and optimizations Tim Chen
@ 2013-11-20 10:19   ` Will Deacon
  2013-11-20 12:50     ` Paul E. McKenney
  2013-11-20 17:00     ` Tim Chen
  0 siblings, 2 replies; 116+ messages in thread
From: Will Deacon @ 2013-11-20 10:19 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Paul E.McKenney, Raghavendra K T,
	George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

Hi Tim,

On Wed, Nov 20, 2013 at 01:37:26AM +0000, Tim Chen wrote:
> In this patch series, we separated out the MCS lock code which was
> previously embedded in the mutex.c.  This allows for easier reuse of
> MCS lock in other places like rwsem and qrwlock.  We also did some micro
> optimizations and barrier cleanup.  
> 
> The original code has potential leaks between critical sections, which
> was not a problem when MCS was embedded within the mutex but needs
> to be corrected when allowing the MCS lock to be used by itself for
> other locking purposes. 
> 
> Proper barriers are now embedded with the usage of smp_load_acquire() in
> mcs_spin_lock() and smp_store_release() in mcs_spin_unlock.  See
> http://marc.info/?l=linux-arch&m=138386254111507 for info on the
> new smp_load_acquire() and smp_store_release() functions. 
> 
> This patches were previously part of the rwsem optimization patch series
> but now we spearate them out.
> 
> We have also added hooks to allow for architecture specific 
> implementation of the mcs_spin_lock and mcs_spin_unlock functions.
> 
> Will, do you want to take a crack at adding implementation for ARM
> with wfe instruction?

Sure, I'll have a go this week. Thanks for keeping that as a consideration!

As an aside: what are you using to test this code, so that I can make sure I
don't break it?

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/5] MCS Lock: MCS lock code cleanup and optimizations
  2013-11-20 10:19   ` Will Deacon
@ 2013-11-20 12:50     ` Paul E. McKenney
  2013-11-20 17:00       ` Will Deacon
  2013-11-20 17:00     ` Tim Chen
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-20 12:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: Tim Chen, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

On Wed, Nov 20, 2013 at 10:19:57AM +0000, Will Deacon wrote:
> Hi Tim,
> 
> On Wed, Nov 20, 2013 at 01:37:26AM +0000, Tim Chen wrote:
> > In this patch series, we separated out the MCS lock code which was
> > previously embedded in the mutex.c.  This allows for easier reuse of
> > MCS lock in other places like rwsem and qrwlock.  We also did some micro
> > optimizations and barrier cleanup.  
> > 
> > The original code has potential leaks between critical sections, which
> > was not a problem when MCS was embedded within the mutex but needs
> > to be corrected when allowing the MCS lock to be used by itself for
> > other locking purposes. 
> > 
> > Proper barriers are now embedded with the usage of smp_load_acquire() in
> > mcs_spin_lock() and smp_store_release() in mcs_spin_unlock.  See
> > http://marc.info/?l=linux-arch&m=138386254111507 for info on the
> > new smp_load_acquire() and smp_store_release() functions. 
> > 
> > This patches were previously part of the rwsem optimization patch series
> > but now we spearate them out.
> > 
> > We have also added hooks to allow for architecture specific 
> > implementation of the mcs_spin_lock and mcs_spin_unlock functions.
> > 
> > Will, do you want to take a crack at adding implementation for ARM
> > with wfe instruction?
> 
> Sure, I'll have a go this week. Thanks for keeping that as a consideration!
> 
> As an aside: what are you using to test this code, so that I can make sure I
> don't break it?

+1 to that!  In fact, it would be nice to have the test code in-tree,
especially if it can test a wide variety of locks.  (/me needs to look
at what test code for locks might already be in tree, for that matter...)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/5] MCS Lock: MCS lock code cleanup and optimizations
  2013-11-20 12:50     ` Paul E. McKenney
@ 2013-11-20 17:00       ` Will Deacon
  2013-11-20 17:14         ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Will Deacon @ 2013-11-20 17:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tim Chen, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

On Wed, Nov 20, 2013 at 12:50:23PM +0000, Paul E. McKenney wrote:
> On Wed, Nov 20, 2013 at 10:19:57AM +0000, Will Deacon wrote:
> > On Wed, Nov 20, 2013 at 01:37:26AM +0000, Tim Chen wrote:
> > > Will, do you want to take a crack at adding implementation for ARM
> > > with wfe instruction?
> > 
> > Sure, I'll have a go this week. Thanks for keeping that as a consideration!
> > 
> > As an aside: what are you using to test this code, so that I can make sure I
> > don't break it?
> 
> +1 to that!  In fact, it would be nice to have the test code in-tree,
> especially if it can test a wide variety of locks.  (/me needs to look
> at what test code for locks might already be in tree, for that matter...)

Well, in the absence of those tests, I've implemented something that I think
will work for ARM and could be easily extended to arm64.

Tim: I reverted your final patch and went with Paul's suggestion just to
look into the contended case. I'm also not sure about adding
asm/mcs_spinlock.h. This stuff might be better in asm/spinlock.h, which
already exists and contains both spinlocks and rwlocks. Depends on how much
people dislike the Kconfig symbol + conditional #include.

Anyway, patches below. I included the ARM bits for reference, but please
don't include them in your series!

Cheers,

Will

--->8

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/5] MCS Lock: MCS lock code cleanup and optimizations
  2013-11-20 17:00       ` Will Deacon
@ 2013-11-20 17:14         ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-20 17:14 UTC (permalink / raw)
  To: Will Deacon
  Cc: Tim Chen, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

On Wed, Nov 20, 2013 at 05:00:17PM +0000, Will Deacon wrote:
> On Wed, Nov 20, 2013 at 12:50:23PM +0000, Paul E. McKenney wrote:
> > On Wed, Nov 20, 2013 at 10:19:57AM +0000, Will Deacon wrote:
> > > On Wed, Nov 20, 2013 at 01:37:26AM +0000, Tim Chen wrote:
> > > > Will, do you want to take a crack at adding implementation for ARM
> > > > with wfe instruction?
> > > 
> > > Sure, I'll have a go this week. Thanks for keeping that as a consideration!
> > > 
> > > As an aside: what are you using to test this code, so that I can make sure I
> > > don't break it?
> > 
> > +1 to that!  In fact, it would be nice to have the test code in-tree,
> > especially if it can test a wide variety of locks.  (/me needs to look
> > at what test code for locks might already be in tree, for that matter...)
> 
> Well, in the absence of those tests, I've implemented something that I think
> will work for ARM and could be easily extended to arm64.
> 
> Tim: I reverted your final patch and went with Paul's suggestion just to
> look into the contended case. I'm also not sure about adding
> asm/mcs_spinlock.h. This stuff might be better in asm/spinlock.h, which
> already exists and contains both spinlocks and rwlocks. Depends on how much
> people dislike the Kconfig symbol + conditional #include.
> 
> Anyway, patches below. I included the ARM bits for reference, but please
> don't include them in your series!

This approach does look way better than replicating the entire MCS-lock
implementation on a bunch of architectures!  ;-)

							Thanx, Paul

> Cheers,
> 
> Will
> 
> --->8
> 
> >From 074f4cdf9ddc97454467b9ad9f85128ee67c5604 Mon Sep 17 00:00:00 2001
> From: Will Deacon <will.deacon@arm.com>
> Date: Wed, 20 Nov 2013 16:14:04 +0000
> Subject: [PATCH 1/3] MCS Lock: allow architectures to hook in to contended
>  paths
> 
> When contended, architectures may be able to reduce the polling overhead
> in ways which aren't expressible using a simple relax() primitive.
> 
> This patch allows architectures to hook into the mcs_{lock,unlock}
> functions for the contended cases only.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  kernel/locking/mcs_spinlock.c | 47 +++++++++++++++++++++++++------------------
>  1 file changed, 27 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/locking/mcs_spinlock.c b/kernel/locking/mcs_spinlock.c
> index 6f2ce8efb006..853070b8a86d 100644
> --- a/kernel/locking/mcs_spinlock.c
> +++ b/kernel/locking/mcs_spinlock.c
> @@ -7,19 +7,34 @@
>   * It avoids expensive cache bouncings that common test-and-set spin-lock
>   * implementations incur.
>   */
> -/*
> - * asm/processor.h may define arch_mutex_cpu_relax().
> - * If it is not defined, cpu_relax() will be used.
> - */
> +
>  #include <asm/barrier.h>
>  #include <asm/cmpxchg.h>
>  #include <asm/processor.h>
>  #include <linux/compiler.h>
>  #include <linux/mcs_spinlock.h>
> +#include <linux/mutex.h>
>  #include <linux/export.h>
> 
> -#ifndef arch_mutex_cpu_relax
> -# define arch_mutex_cpu_relax() cpu_relax()
> +#ifndef arch_mcs_spin_lock_contended
> +/*
> + * Using smp_load_acquire() provides a memory barrier that ensures
> + * subsequent operations happen after the lock is acquired.
> + */
> +#define arch_mcs_spin_lock_contended(l)					\
> +	while (!(smp_load_acquire(l))) {				\
> +		arch_mutex_cpu_relax();					\
> +	}
> +#endif
> +
> +#ifndef arch_mcs_spin_unlock_contended
> +/*
> + * smp_store_release() provides a memory barrier to ensure all
> + * operations in the critical section has been completed before
> + * unlocking.
> + */
> +#define arch_mcs_spin_unlock_contended(l)				\
> +	smp_store_release((l), 1)
>  #endif
> 
>  /*
> @@ -44,13 +59,9 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>  		return;
>  	}
>  	ACCESS_ONCE(prev->next) = node;
> -	/*
> -	 * Wait until the lock holder passes the lock down.
> -	 * Using smp_load_acquire() provides a memory barrier that
> -	 * ensures subsequent operations happen after the lock is acquired.
> -	 */
> -	while (!(smp_load_acquire(&node->locked)))
> -		arch_mutex_cpu_relax();
> +
> +	/* Wait until the lock holder passes the lock down. */
> +	arch_mcs_spin_lock_contended(&node->locked);
>  }
>  EXPORT_SYMBOL_GPL(mcs_spin_lock);
> 
> @@ -72,12 +83,8 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>  		while (!(next = ACCESS_ONCE(node->next)))
>  			arch_mutex_cpu_relax();
>  	}
> -	/*
> -	 * Pass lock to next waiter.
> -	 * smp_store_release() provides a memory barrier to ensure
> -	 * all operations in the critical section has been completed
> -	 * before unlocking.
> -	 */
> -	smp_store_release(&next->locked, 1);
> +
> +	/* Pass lock to next waiter. */
> +	arch_mcs_spin_unlock_contended(&next->locked);
>  }
>  EXPORT_SYMBOL_GPL(mcs_spin_unlock);
> -- 
> 1.8.2.2
> 
> 
> >From faa48f77a17cfd99562b1e36de278367aa4d389c Mon Sep 17 00:00:00 2001
> From: Will Deacon <will.deacon@arm.com>
> Date: Wed, 20 Nov 2013 16:10:57 +0000
> Subject: [PATCH 2/3] MCS Lock: add Kconfig entries to allow arch-specific
>  hooks
> 
> This patch adds Kconfig entries to allow architectures to hook into the
> MCS lock/unlock functions in the contended case.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/Kconfig                 | 3 +++
>  include/linux/mcs_spinlock.h | 8 ++++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index f1cf895c040f..ae738f706325 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -303,6 +303,9 @@ config HAVE_CMPXCHG_LOCAL
>  config HAVE_CMPXCHG_DOUBLE
>  	bool
> 
> +config HAVE_ARCH_MCS_LOCK
> +	bool
> +
>  config ARCH_WANT_IPC_PARSE_VERSION
>  	bool
> 
> diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
> index d54bb232a238..d2c02adb0bbd 100644
> --- a/include/linux/mcs_spinlock.h
> +++ b/include/linux/mcs_spinlock.h
> @@ -12,6 +12,14 @@
>  #ifndef __LINUX_MCS_SPINLOCK_H
>  #define __LINUX_MCS_SPINLOCK_H
> 
> +/*
> + * An architecture may provide its own lock/unlock functions for the
> + * contended case.
> + */
> +#ifdef CONFIG_HAVE_ARCH_MCS_LOCK
> +#include <asm/mcs_spinlock.h>
> +#endif
> +
>  struct mcs_spinlock {
>  	struct mcs_spinlock *next;
>  	int locked; /* 1 if lock acquired */
> -- 
> 1.8.2.2
> 
> 
> >From 21f047d40002ec4f1b780eee88f16a1870ab00ef Mon Sep 17 00:00:00 2001
> From: Will Deacon <will.deacon@arm.com>
> Date: Wed, 20 Nov 2013 16:15:31 +0000
> Subject: [PATCH 3/3] ARM: mcs lock: implement wfe-based polling for MCS
>  locking
> 
> This patch introduces a wfe-based polling loop for spinning on contended
> MCS locks and waking up corresponding waiters when the lock is released.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm/Kconfig                    |  1 +
>  arch/arm/include/asm/mcs_spinlock.h | 20 ++++++++++++++++++++
>  2 files changed, 21 insertions(+)
>  create mode 100644 arch/arm/include/asm/mcs_spinlock.h
> 
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 214b698cefea..ab9fb84599ac 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -25,6 +25,7 @@ config ARM
>  	select HARDIRQS_SW_RESEND
>  	select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL
>  	select HAVE_ARCH_KGDB
> +	select HAVE_ARCH_MCS_LOCK
>  	select HAVE_ARCH_SECCOMP_FILTER
>  	select HAVE_ARCH_TRACEHOOK
>  	select HAVE_BPF_JIT
> diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
> new file mode 100644
> index 000000000000..f32f97e81471
> --- /dev/null
> +++ b/arch/arm/include/asm/mcs_spinlock.h
> @@ -0,0 +1,20 @@
> +#ifndef __ASM_MCS_LOCK_H
> +#define __ASM_MCS_LOCK_H
> +
> +/* MCS spin-locking. */
> +#define arch_mcs_spin_lock_contended(lock)				\
> +do {									\
> +	/* Ensure prior stores are observed before we enter wfe. */	\
> +	smp_mb();							\
> +	while (!(smp_load_acquire(lock)))				\
> +		wfe();							\
> +} while (0)								\
> +
> +#define arch_mcs_spin_unlock_contended(lock)				\
> +do {									\
> +	smp_store_release(lock, 1);					\
> +	dsb(ishst);							\
> +	sev();								\
> +} while (0)
> +
> +#endif	/* __ASM_MCS_LOCK_H */
> -- 
> 1.8.2.2
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/5] MCS Lock: MCS lock code cleanup and optimizations
  2013-11-20 10:19   ` Will Deacon
  2013-11-20 12:50     ` Paul E. McKenney
@ 2013-11-20 17:00     ` Tim Chen
  2013-11-20 17:16       ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Tim Chen @ 2013-11-20 17:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Paul E.McKenney, Raghavendra K T,
	George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Wed, 2013-11-20 at 10:19 +0000, Will Deacon wrote:
> Hi Tim,
> 
> On Wed, Nov 20, 2013 at 01:37:26AM +0000, Tim Chen wrote:
> > In this patch series, we separated out the MCS lock code which was
> > previously embedded in the mutex.c.  This allows for easier reuse of
> > MCS lock in other places like rwsem and qrwlock.  We also did some micro
> > optimizations and barrier cleanup.  
> > 
> > The original code has potential leaks between critical sections, which
> > was not a problem when MCS was embedded within the mutex but needs
> > to be corrected when allowing the MCS lock to be used by itself for
> > other locking purposes. 
> > 
> > Proper barriers are now embedded with the usage of smp_load_acquire() in
> > mcs_spin_lock() and smp_store_release() in mcs_spin_unlock.  See
> > http://marc.info/?l=linux-arch&m=138386254111507 for info on the
> > new smp_load_acquire() and smp_store_release() functions. 
> > 
> > This patches were previously part of the rwsem optimization patch series
> > but now we spearate them out.
> > 
> > We have also added hooks to allow for architecture specific 
> > implementation of the mcs_spin_lock and mcs_spin_unlock functions.
> > 
> > Will, do you want to take a crack at adding implementation for ARM
> > with wfe instruction?
> 
> Sure, I'll have a go this week. Thanks for keeping that as a consideration!
> 
> As an aside: what are you using to test this code, so that I can make sure I
> don't break it?
> 

I was testing it against my rwsem patches.  But any workload that 
exercises mutex should also use this code, as this is part of mutex.

Tim

> Will


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/5] MCS Lock: MCS lock code cleanup and optimizations
  2013-11-20 17:00     ` Tim Chen
@ 2013-11-20 17:16       ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-20 17:16 UTC (permalink / raw)
  To: Tim Chen
  Cc: Will Deacon, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

On Wed, Nov 20, 2013 at 09:00:47AM -0800, Tim Chen wrote:
> On Wed, 2013-11-20 at 10:19 +0000, Will Deacon wrote:
> > Hi Tim,
> > 
> > On Wed, Nov 20, 2013 at 01:37:26AM +0000, Tim Chen wrote:
> > > In this patch series, we separated out the MCS lock code which was
> > > previously embedded in the mutex.c.  This allows for easier reuse of
> > > MCS lock in other places like rwsem and qrwlock.  We also did some micro
> > > optimizations and barrier cleanup.  
> > > 
> > > The original code has potential leaks between critical sections, which
> > > was not a problem when MCS was embedded within the mutex but needs
> > > to be corrected when allowing the MCS lock to be used by itself for
> > > other locking purposes. 
> > > 
> > > Proper barriers are now embedded with the usage of smp_load_acquire() in
> > > mcs_spin_lock() and smp_store_release() in mcs_spin_unlock.  See
> > > http://marc.info/?l=linux-arch&m=138386254111507 for info on the
> > > new smp_load_acquire() and smp_store_release() functions. 
> > > 
> > > This patches were previously part of the rwsem optimization patch series
> > > but now we spearate them out.
> > > 
> > > We have also added hooks to allow for architecture specific 
> > > implementation of the mcs_spin_lock and mcs_spin_unlock functions.
> > > 
> > > Will, do you want to take a crack at adding implementation for ARM
> > > with wfe instruction?
> > 
> > Sure, I'll have a go this week. Thanks for keeping that as a consideration!
> > 
> > As an aside: what are you using to test this code, so that I can make sure I
> > don't break it?
> 
> I was testing it against my rwsem patches.  But any workload that 
> exercises mutex should also use this code, as this is part of mutex.

It would be good to have more focused tests.  It is amazing how reliable
broken synchronization-primitive implementations can be.  Of course, in
this case, they will fail at the least opportune and most difficult to
debug situation...

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v6 1/5] MCS Lock: Restructure the MCS lock defines and locking code into its own file
       [not found] <cover.1384885312.git.tim.c.chen@linux.intel.com>
  2013-11-20  1:37 ` [PATCH v6 0/5] MCS Lock: MCS lock code cleanup and optimizations Tim Chen
@ 2013-11-20  1:37 ` Tim Chen
  2013-11-20  1:37 ` [PATCH v6 2/5] MCS Lock: optimizations and extra comments Tim Chen
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2013-11-20  1:37 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton, Thomas Gleixner
  Cc: linux-kernel, linux-mm, linux-arch, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Paul E.McKenney, Tim Chen,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Will Deacon,
	Figo.zhang

We will need the MCS lock code for doing optimistic spinning for rwsem
and queue rwlock.  Extracting the MCS code from mutex.c and put into
its own file allow us to reuse this code easily.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 include/linux/mcs_spinlock.h | 64 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mutex.h        |  5 ++--
 kernel/locking/mutex.c       | 60 +++++------------------------------------
 3 files changed, 74 insertions(+), 55 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
new file mode 100644
index 0000000..b5de3b0
--- /dev/null
+++ b/include/linux/mcs_spinlock.h
@@ -0,0 +1,64 @@
+/*
+ * MCS lock defines
+ *
+ * This file contains the main data structure and API definitions of MCS lock.
+ *
+ * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
+ * with the desirable properties of being fair, and with each cpu trying
+ * to acquire the lock spinning on a local variable.
+ * It avoids expensive cache bouncings that common test-and-set spin-lock
+ * implementations incur.
+ */
+#ifndef __LINUX_MCS_SPINLOCK_H
+#define __LINUX_MCS_SPINLOCK_H
+
+struct mcs_spinlock {
+	struct mcs_spinlock *next;
+	int locked; /* 1 if lock acquired */
+};
+
+/*
+ * We don't inline mcs_spin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+static noinline
+void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *prev;
+
+	/* Init node */
+	node->locked = 0;
+	node->next   = NULL;
+
+	prev = xchg(lock, node);
+	if (likely(prev == NULL)) {
+		/* Lock acquired */
+		node->locked = 1;
+		return;
+	}
+	ACCESS_ONCE(prev->next) = node;
+	smp_wmb();
+	/* Wait until the lock holder passes the lock down */
+	while (!ACCESS_ONCE(node->locked))
+		arch_mutex_cpu_relax();
+}
+
+static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
+
+	if (likely(!next)) {
+		/*
+		 * Release the lock by setting it to NULL
+		 */
+		if (cmpxchg(lock, node, NULL) == node)
+			return;
+		/* Wait until the next pointer is set */
+		while (!(next = ACCESS_ONCE(node->next)))
+			arch_mutex_cpu_relax();
+	}
+	ACCESS_ONCE(next->locked) = 1;
+	smp_wmb();
+}
+
+#endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index bab49da..32a32e6 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -46,6 +46,7 @@
  * - detects multi-task circular deadlocks and prints out all affected
  *   locks and tasks (and only those tasks)
  */
+struct mcs_spinlock;
 struct mutex {
 	/* 1: unlocked, 0: locked, negative: locked, possible waiters */
 	atomic_t		count;
@@ -55,7 +56,7 @@ struct mutex {
 	struct task_struct	*owner;
 #endif
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	void			*spin_mlock;	/* Spinner MCS lock */
+	struct mcs_spinlock	*mcs_lock;	/* Spinner MCS lock */
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
@@ -179,4 +180,4 @@ extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
 # define arch_mutex_cpu_relax() cpu_relax()
 #endif
 
-#endif
+#endif /* __LINUX_MUTEX_H */
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index d24105b..e08b183 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -25,6 +25,7 @@
 #include <linux/spinlock.h>
 #include <linux/interrupt.h>
 #include <linux/debug_locks.h>
+#include <linux/mcs_spinlock.h>
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -52,7 +53,7 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
 	INIT_LIST_HEAD(&lock->wait_list);
 	mutex_clear_owner(lock);
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	lock->spin_mlock = NULL;
+	lock->mcs_lock = NULL;
 #endif
 
 	debug_mutex_init(lock, name, key);
@@ -111,54 +112,7 @@ EXPORT_SYMBOL(mutex_lock);
  * more or less simultaneously, the spinners need to acquire a MCS lock
  * first before spinning on the owner field.
  *
- * We don't inline mspin_lock() so that perf can correctly account for the
- * time spent in this lock function.
  */
-struct mspin_node {
-	struct mspin_node *next ;
-	int		  locked;	/* 1 if lock acquired */
-};
-#define	MLOCK(mutex)	((struct mspin_node **)&((mutex)->spin_mlock))
-
-static noinline
-void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
-{
-	struct mspin_node *prev;
-
-	/* Init node */
-	node->locked = 0;
-	node->next   = NULL;
-
-	prev = xchg(lock, node);
-	if (likely(prev == NULL)) {
-		/* Lock acquired */
-		node->locked = 1;
-		return;
-	}
-	ACCESS_ONCE(prev->next) = node;
-	smp_wmb();
-	/* Wait until the lock holder passes the lock down */
-	while (!ACCESS_ONCE(node->locked))
-		arch_mutex_cpu_relax();
-}
-
-static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
-{
-	struct mspin_node *next = ACCESS_ONCE(node->next);
-
-	if (likely(!next)) {
-		/*
-		 * Release the lock by setting it to NULL
-		 */
-		if (cmpxchg(lock, node, NULL) == node)
-			return;
-		/* Wait until the next pointer is set */
-		while (!(next = ACCESS_ONCE(node->next)))
-			arch_mutex_cpu_relax();
-	}
-	ACCESS_ONCE(next->locked) = 1;
-	smp_wmb();
-}
 
 /*
  * Mutex spinning code migrated from kernel/sched/core.c
@@ -448,7 +402,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 
 	for (;;) {
 		struct task_struct *owner;
-		struct mspin_node  node;
+		struct mcs_spinlock  node;
 
 		if (use_ww_ctx && ww_ctx->acquired > 0) {
 			struct ww_mutex *ww;
@@ -470,10 +424,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
-		mspin_lock(MLOCK(lock), &node);
+		mcs_spin_lock(&lock->mcs_lock, &node);
 		owner = ACCESS_ONCE(lock->owner);
 		if (owner && !mutex_spin_on_owner(lock, owner)) {
-			mspin_unlock(MLOCK(lock), &node);
+			mcs_spin_unlock(&lock->mcs_lock, &node);
 			goto slowpath;
 		}
 
@@ -488,11 +442,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 			}
 
 			mutex_set_owner(lock);
-			mspin_unlock(MLOCK(lock), &node);
+			mcs_spin_unlock(&lock->mcs_lock, &node);
 			preempt_enable();
 			return 0;
 		}
-		mspin_unlock(MLOCK(lock), &node);
+		mcs_spin_unlock(&lock->mcs_lock, &node);
 
 		/*
 		 * When there's no owner, we might have preempted between the
-- 
1.7.11.7



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v6 2/5] MCS Lock: optimizations and extra comments
       [not found] <cover.1384885312.git.tim.c.chen@linux.intel.com>
  2013-11-20  1:37 ` [PATCH v6 0/5] MCS Lock: MCS lock code cleanup and optimizations Tim Chen
  2013-11-20  1:37 ` [PATCH v6 1/5] MCS Lock: Restructure the MCS lock defines and locking code into its own file Tim Chen
@ 2013-11-20  1:37 ` Tim Chen
  2013-11-20  1:37 ` [PATCH v6 3/5] MCS Lock: Move mcs_lock/unlock function into its own file Tim Chen
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2013-11-20  1:37 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton, Thomas Gleixner
  Cc: linux-kernel, linux-mm, linux-arch, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Paul E.McKenney, Tim Chen,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Will Deacon,
	Figo.zhang

Remove unnecessary operation and make the cmpxchg(lock, node, NULL) == node
check in mcs_spin_unlock() likely() as it is likely that a race did not occur
most of the time.

Also add in more comments describing how the local node is used in MCS locks.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
---
 include/linux/mcs_spinlock.h | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
index b5de3b0..96f14299 100644
--- a/include/linux/mcs_spinlock.h
+++ b/include/linux/mcs_spinlock.h
@@ -18,6 +18,12 @@ struct mcs_spinlock {
 };
 
 /*
+ * In order to acquire the lock, the caller should declare a local node and
+ * pass a reference of the node to this function in addition to the lock.
+ * If the lock has already been acquired, then this will proceed to spin
+ * on this node->locked until the previous lock holder sets the node->locked
+ * in mcs_spin_unlock().
+ *
  * We don't inline mcs_spin_lock() so that perf can correctly account for the
  * time spent in this lock function.
  */
@@ -33,7 +39,6 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	prev = xchg(lock, node);
 	if (likely(prev == NULL)) {
 		/* Lock acquired */
-		node->locked = 1;
 		return;
 	}
 	ACCESS_ONCE(prev->next) = node;
@@ -43,6 +48,10 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 		arch_mutex_cpu_relax();
 }
 
+/*
+ * Releases the lock. The caller should pass in the corresponding node that
+ * was used to acquire the lock.
+ */
 static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 {
 	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
@@ -51,7 +60,7 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *nod
 		/*
 		 * Release the lock by setting it to NULL
 		 */
-		if (cmpxchg(lock, node, NULL) == node)
+		if (likely(cmpxchg(lock, node, NULL) == node))
 			return;
 		/* Wait until the next pointer is set */
 		while (!(next = ACCESS_ONCE(node->next)))
-- 
1.7.11.7



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v6 3/5] MCS Lock: Move mcs_lock/unlock function into its own file
       [not found] <cover.1384885312.git.tim.c.chen@linux.intel.com>
                   ` (2 preceding siblings ...)
  2013-11-20  1:37 ` [PATCH v6 2/5] MCS Lock: optimizations and extra comments Tim Chen
@ 2013-11-20  1:37 ` Tim Chen
  2013-11-20  1:37 ` [PATCH v6 4/5] MCS Lock: Barrier corrections Tim Chen
  2013-11-20  1:37 ` [PATCH v6 5/5] MCS Lock: Allows for architecture specific mcs lock and unlock Tim Chen
  5 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2013-11-20  1:37 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton, Thomas Gleixner
  Cc: linux-kernel, linux-mm, linux-arch, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Paul E.McKenney, Tim Chen,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Will Deacon,
	Figo.zhang

The following changes are made:

1) Create a new mcs_spinlock.c file to contain the
   mcs_spin_lock() and mcs_spin_unlock() function.
2) Include a number of prerequisite header files and define
   arch_mutex_cpu_relax(), if not previously defined so the
   mcs functions can be compiled for multiple architecture without
   causing problems.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/mcs_spinlock.h                       | 56 ++--------------------
 kernel/locking/Makefile                            |  6 +--
 .../locking/mcs_spinlock.c                         | 33 ++++++-------
 3 files changed, 24 insertions(+), 71 deletions(-)
 copy include/linux/mcs_spinlock.h => kernel/locking/mcs_spinlock.c (75%)

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
index 96f14299..d54bb23 100644
--- a/include/linux/mcs_spinlock.h
+++ b/include/linux/mcs_spinlock.h
@@ -17,57 +17,9 @@ struct mcs_spinlock {
 	int locked; /* 1 if lock acquired */
 };
 
-/*
- * In order to acquire the lock, the caller should declare a local node and
- * pass a reference of the node to this function in addition to the lock.
- * If the lock has already been acquired, then this will proceed to spin
- * on this node->locked until the previous lock holder sets the node->locked
- * in mcs_spin_unlock().
- *
- * We don't inline mcs_spin_lock() so that perf can correctly account for the
- * time spent in this lock function.
- */
-static noinline
-void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
-{
-	struct mcs_spinlock *prev;
-
-	/* Init node */
-	node->locked = 0;
-	node->next   = NULL;
-
-	prev = xchg(lock, node);
-	if (likely(prev == NULL)) {
-		/* Lock acquired */
-		return;
-	}
-	ACCESS_ONCE(prev->next) = node;
-	smp_wmb();
-	/* Wait until the lock holder passes the lock down */
-	while (!ACCESS_ONCE(node->locked))
-		arch_mutex_cpu_relax();
-}
-
-/*
- * Releases the lock. The caller should pass in the corresponding node that
- * was used to acquire the lock.
- */
-static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
-{
-	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
-
-	if (likely(!next)) {
-		/*
-		 * Release the lock by setting it to NULL
-		 */
-		if (likely(cmpxchg(lock, node, NULL) == node))
-			return;
-		/* Wait until the next pointer is set */
-		while (!(next = ACCESS_ONCE(node->next)))
-			arch_mutex_cpu_relax();
-	}
-	ACCESS_ONCE(next->locked) = 1;
-	smp_wmb();
-}
+extern
+void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node);
+extern
+void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node);
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..20d9d5c 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -13,12 +13,12 @@ obj-$(CONFIG_LOCKDEP) += lockdep.o
 ifeq ($(CONFIG_PROC_FS),y)
 obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
-obj-$(CONFIG_SMP) += spinlock.o
-obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
+obj-$(CONFIG_SMP) += spinlock.o mcs_spinlock.o
+obj-$(CONFIG_PROVE_LOCKING) += spinlock.o mcs_spinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
 obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
 obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
-obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock.o
+obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock.o mcs_spinlock.o
 obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
 obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
 obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
diff --git a/include/linux/mcs_spinlock.h b/kernel/locking/mcs_spinlock.c
similarity index 75%
copy from include/linux/mcs_spinlock.h
copy to kernel/locking/mcs_spinlock.c
index 96f14299..44fb092 100644
--- a/include/linux/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.c
@@ -1,7 +1,5 @@
 /*
- * MCS lock defines
- *
- * This file contains the main data structure and API definitions of MCS lock.
+ * MCS lock
  *
  * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
  * with the desirable properties of being fair, and with each cpu trying
@@ -9,13 +7,20 @@
  * It avoids expensive cache bouncings that common test-and-set spin-lock
  * implementations incur.
  */
-#ifndef __LINUX_MCS_SPINLOCK_H
-#define __LINUX_MCS_SPINLOCK_H
+/*
+ * asm/processor.h may define arch_mutex_cpu_relax().
+ * If it is not defined, cpu_relax() will be used.
+ */
+#include <asm/barrier.h>
+#include <asm/cmpxchg.h>
+#include <asm/processor.h>
+#include <linux/compiler.h>
+#include <linux/mcs_spinlock.h>
+#include <linux/export.h>
 
-struct mcs_spinlock {
-	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
-};
+#ifndef arch_mutex_cpu_relax
+# define arch_mutex_cpu_relax() cpu_relax()
+#endif
 
 /*
  * In order to acquire the lock, the caller should declare a local node and
@@ -23,11 +28,7 @@ struct mcs_spinlock {
  * If the lock has already been acquired, then this will proceed to spin
  * on this node->locked until the previous lock holder sets the node->locked
  * in mcs_spin_unlock().
- *
- * We don't inline mcs_spin_lock() so that perf can correctly account for the
- * time spent in this lock function.
  */
-static noinline
 void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 {
 	struct mcs_spinlock *prev;
@@ -47,12 +48,13 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	while (!ACCESS_ONCE(node->locked))
 		arch_mutex_cpu_relax();
 }
+EXPORT_SYMBOL_GPL(mcs_spin_lock);
 
 /*
  * Releases the lock. The caller should pass in the corresponding node that
  * was used to acquire the lock.
  */
-static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 {
 	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
 
@@ -69,5 +71,4 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *nod
 	ACCESS_ONCE(next->locked) = 1;
 	smp_wmb();
 }
-
-#endif /* __LINUX_MCS_SPINLOCK_H */
+EXPORT_SYMBOL_GPL(mcs_spin_unlock);
-- 
1.7.11.7



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v6 4/5] MCS Lock: Barrier corrections
       [not found] <cover.1384885312.git.tim.c.chen@linux.intel.com>
                   ` (3 preceding siblings ...)
  2013-11-20  1:37 ` [PATCH v6 3/5] MCS Lock: Move mcs_lock/unlock function into its own file Tim Chen
@ 2013-11-20  1:37 ` Tim Chen
  2013-11-20 15:31   ` Paul E. McKenney
  2013-11-20  1:37 ` [PATCH v6 5/5] MCS Lock: Allows for architecture specific mcs lock and unlock Tim Chen
  5 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2013-11-20  1:37 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton, Thomas Gleixner
  Cc: linux-kernel, linux-mm, linux-arch, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Paul E.McKenney, Tim Chen,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Will Deacon,
	Figo.zhang

This patch corrects the way memory barriers are used in the MCS lock
with smp_load_acquire and smp_store_release fucnction.
It removes ones that are not needed.

It uses architecture specific load-acquire and store-release
primitives for synchronization, if available. Generic implementations
are provided in case they are not defined even though they may not
be optimal. These generic implementation could be removed later on
once changes are made in all the relevant header files.

Suggested-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/locking/mcs_spinlock.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/mcs_spinlock.c b/kernel/locking/mcs_spinlock.c
index 44fb092..6f2ce8e 100644
--- a/kernel/locking/mcs_spinlock.c
+++ b/kernel/locking/mcs_spinlock.c
@@ -37,15 +37,19 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	node->locked = 0;
 	node->next   = NULL;
 
+	/* xchg() provides a memory barrier */
 	prev = xchg(lock, node);
 	if (likely(prev == NULL)) {
 		/* Lock acquired */
 		return;
 	}
 	ACCESS_ONCE(prev->next) = node;
-	smp_wmb();
-	/* Wait until the lock holder passes the lock down */
-	while (!ACCESS_ONCE(node->locked))
+	/*
+	 * Wait until the lock holder passes the lock down.
+	 * Using smp_load_acquire() provides a memory barrier that
+	 * ensures subsequent operations happen after the lock is acquired.
+	 */
+	while (!(smp_load_acquire(&node->locked)))
 		arch_mutex_cpu_relax();
 }
 EXPORT_SYMBOL_GPL(mcs_spin_lock);
@@ -68,7 +72,12 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 		while (!(next = ACCESS_ONCE(node->next)))
 			arch_mutex_cpu_relax();
 	}
-	ACCESS_ONCE(next->locked) = 1;
-	smp_wmb();
+	/*
+	 * Pass lock to next waiter.
+	 * smp_store_release() provides a memory barrier to ensure
+	 * all operations in the critical section has been completed
+	 * before unlocking.
+	 */
+	smp_store_release(&next->locked, 1);
 }
 EXPORT_SYMBOL_GPL(mcs_spin_unlock);
-- 
1.7.11.7



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-20  1:37 ` [PATCH v6 4/5] MCS Lock: Barrier corrections Tim Chen
@ 2013-11-20 15:31   ` Paul E. McKenney
  2013-11-20 15:46     ` Will Deacon
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-20 15:31 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, linux-kernel,
	linux-mm, linux-arch, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Will Deacon, Figo.zhang

On Tue, Nov 19, 2013 at 05:37:43PM -0800, Tim Chen wrote:
> This patch corrects the way memory barriers are used in the MCS lock
> with smp_load_acquire and smp_store_release fucnction.
> It removes ones that are not needed.
> 
> It uses architecture specific load-acquire and store-release
> primitives for synchronization, if available. Generic implementations
> are provided in case they are not defined even though they may not
> be optimal. These generic implementation could be removed later on
> once changes are made in all the relevant header files.
> 
> Suggested-by: Michel Lespinasse <walken@google.com>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> Signed-off-by: Jason Low <jason.low2@hp.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/locking/mcs_spinlock.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/locking/mcs_spinlock.c b/kernel/locking/mcs_spinlock.c
> index 44fb092..6f2ce8e 100644
> --- a/kernel/locking/mcs_spinlock.c
> +++ b/kernel/locking/mcs_spinlock.c
> @@ -37,15 +37,19 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>  	node->locked = 0;
>  	node->next   = NULL;
> 
> +	/* xchg() provides a memory barrier */
>  	prev = xchg(lock, node);
>  	if (likely(prev == NULL)) {
>  		/* Lock acquired */
>  		return;
>  	}
>  	ACCESS_ONCE(prev->next) = node;
> -	smp_wmb();
> -	/* Wait until the lock holder passes the lock down */
> -	while (!ACCESS_ONCE(node->locked))
> +	/*
> +	 * Wait until the lock holder passes the lock down.
> +	 * Using smp_load_acquire() provides a memory barrier that
> +	 * ensures subsequent operations happen after the lock is acquired.
> +	 */
> +	while (!(smp_load_acquire(&node->locked)))
>  		arch_mutex_cpu_relax();
>  }
>  EXPORT_SYMBOL_GPL(mcs_spin_lock);
> @@ -68,7 +72,12 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>  		while (!(next = ACCESS_ONCE(node->next)))
>  			arch_mutex_cpu_relax();
>  	}
> -	ACCESS_ONCE(next->locked) = 1;
> -	smp_wmb();
> +	/*
> +	 * Pass lock to next waiter.
> +	 * smp_store_release() provides a memory barrier to ensure
> +	 * all operations in the critical section has been completed
> +	 * before unlocking.
> +	 */
> +	smp_store_release(&next->locked, 1);

However, there is one problem with this that I missed yesterday.

Documentation/memory-barriers.txt requires that an unlock-lock pair
provide a full barrier, but this is not guaranteed if we use
smp_store_release() for unlock and smp_load_acquire() for lock.
At least one of these needs a full memory barrier.

							Thanx, Paul

>  }
>  EXPORT_SYMBOL_GPL(mcs_spin_unlock);
> -- 
> 1.7.11.7
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-20 15:31   ` Paul E. McKenney
@ 2013-11-20 15:46     ` Will Deacon
  2013-11-20 17:14       ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Will Deacon @ 2013-11-20 15:46 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tim Chen, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

Hi Paul,

On Wed, Nov 20, 2013 at 03:31:23PM +0000, Paul E. McKenney wrote:
> On Tue, Nov 19, 2013 at 05:37:43PM -0800, Tim Chen wrote:
> > @@ -68,7 +72,12 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> >  		while (!(next = ACCESS_ONCE(node->next)))
> >  			arch_mutex_cpu_relax();
> >  	}
> > -	ACCESS_ONCE(next->locked) = 1;
> > -	smp_wmb();
> > +	/*
> > +	 * Pass lock to next waiter.
> > +	 * smp_store_release() provides a memory barrier to ensure
> > +	 * all operations in the critical section has been completed
> > +	 * before unlocking.
> > +	 */
> > +	smp_store_release(&next->locked, 1);
> 
> However, there is one problem with this that I missed yesterday.
> 
> Documentation/memory-barriers.txt requires that an unlock-lock pair
> provide a full barrier, but this is not guaranteed if we use
> smp_store_release() for unlock and smp_load_acquire() for lock.
> At least one of these needs a full memory barrier.

Hmm, so in the following case:

  Access A
  unlock()	/* release semantics */
  lock()	/* acquire semantics */
  Access B

A cannot pass beyond the unlock() and B cannot pass the before the lock().

I agree that accesses between the unlock and the lock can be move across both
A and B, but that doesn't seem to matter by my reading of the above.

What is the problematic scenario you have in mind? Are you thinking of the
lock() moving before the unlock()? That's only permitted by RCpc afaiu,
which I don't think any architectures supported by Linux implement...
(ARMv8 acquire/release is RCsc).

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-20 15:46     ` Will Deacon
@ 2013-11-20 17:14       ` Paul E. McKenney
  2013-11-20 18:43         ` Tim Chen
  2013-11-21 11:03         ` Peter Zijlstra
  0 siblings, 2 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-20 17:14 UTC (permalink / raw)
  To: Will Deacon
  Cc: Tim Chen, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

On Wed, Nov 20, 2013 at 03:46:43PM +0000, Will Deacon wrote:
> Hi Paul,
> 
> On Wed, Nov 20, 2013 at 03:31:23PM +0000, Paul E. McKenney wrote:
> > On Tue, Nov 19, 2013 at 05:37:43PM -0800, Tim Chen wrote:
> > > @@ -68,7 +72,12 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> > >  		while (!(next = ACCESS_ONCE(node->next)))
> > >  			arch_mutex_cpu_relax();
> > >  	}
> > > -	ACCESS_ONCE(next->locked) = 1;
> > > -	smp_wmb();
> > > +	/*
> > > +	 * Pass lock to next waiter.
> > > +	 * smp_store_release() provides a memory barrier to ensure
> > > +	 * all operations in the critical section has been completed
> > > +	 * before unlocking.
> > > +	 */
> > > +	smp_store_release(&next->locked, 1);
> > 
> > However, there is one problem with this that I missed yesterday.
> > 
> > Documentation/memory-barriers.txt requires that an unlock-lock pair
> > provide a full barrier, but this is not guaranteed if we use
> > smp_store_release() for unlock and smp_load_acquire() for lock.
> > At least one of these needs a full memory barrier.
> 
> Hmm, so in the following case:
> 
>   Access A
>   unlock()	/* release semantics */
>   lock()	/* acquire semantics */
>   Access B
> 
> A cannot pass beyond the unlock() and B cannot pass the before the lock().
> 
> I agree that accesses between the unlock and the lock can be move across both
> A and B, but that doesn't seem to matter by my reading of the above.
> 
> What is the problematic scenario you have in mind? Are you thinking of the
> lock() moving before the unlock()? That's only permitted by RCpc afaiu,
> which I don't think any architectures supported by Linux implement...
> (ARMv8 acquire/release is RCsc).

If smp_load_acquire() and smp_store_release() are both implemented using
lwsync on powerpc, and if Access A is a store and Access B is a load,
then Access A and Access B can be reordered.

Of course, if every other architecture will be providing RCsc implementations
for smp_load_acquire() and smp_store_release(), which would not be a bad
thing, then another approach is for powerpc to use sync rather than lwsync
for one or the other of smp_load_acquire() or smp_store_release().

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-20 17:14       ` Paul E. McKenney
@ 2013-11-20 18:43         ` Tim Chen
  2013-11-20 19:06           ` Paul E. McKenney
  2013-11-21 11:03         ` Peter Zijlstra
  1 sibling, 1 reply; 116+ messages in thread
From: Tim Chen @ 2013-11-20 18:43 UTC (permalink / raw)
  To: paulmck
  Cc: Will Deacon, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

On Wed, 2013-11-20 at 09:14 -0800, Paul E. McKenney wrote:
> On Wed, Nov 20, 2013 at 03:46:43PM +0000, Will Deacon wrote:
> > Hi Paul,
> > 
> > On Wed, Nov 20, 2013 at 03:31:23PM +0000, Paul E. McKenney wrote:
> > > On Tue, Nov 19, 2013 at 05:37:43PM -0800, Tim Chen wrote:
> > > > @@ -68,7 +72,12 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> > > >  		while (!(next = ACCESS_ONCE(node->next)))
> > > >  			arch_mutex_cpu_relax();
> > > >  	}
> > > > -	ACCESS_ONCE(next->locked) = 1;
> > > > -	smp_wmb();
> > > > +	/*
> > > > +	 * Pass lock to next waiter.
> > > > +	 * smp_store_release() provides a memory barrier to ensure
> > > > +	 * all operations in the critical section has been completed
> > > > +	 * before unlocking.
> > > > +	 */
> > > > +	smp_store_release(&next->locked, 1);
> > > 
> > > However, there is one problem with this that I missed yesterday.
> > > 
> > > Documentation/memory-barriers.txt requires that an unlock-lock pair
> > > provide a full barrier, but this is not guaranteed if we use
> > > smp_store_release() for unlock and smp_load_acquire() for lock.
> > > At least one of these needs a full memory barrier.
> > 
> > Hmm, so in the following case:
> > 
> >   Access A
> >   unlock()	/* release semantics */
> >   lock()	/* acquire semantics */
> >   Access B
> > 
> > A cannot pass beyond the unlock() and B cannot pass the before the lock().
> > 
> > I agree that accesses between the unlock and the lock can be move across both
> > A and B, but that doesn't seem to matter by my reading of the above.
> > 
> > What is the problematic scenario you have in mind? Are you thinking of the
> > lock() moving before the unlock()? That's only permitted by RCpc afaiu,
> > which I don't think any architectures supported by Linux implement...
> > (ARMv8 acquire/release is RCsc).
> 
> If smp_load_acquire() and smp_store_release() are both implemented using
> lwsync on powerpc, and if Access A is a store and Access B is a load,
> then Access A and Access B can be reordered.
> 
> Of course, if every other architecture will be providing RCsc implementations
> for smp_load_acquire() and smp_store_release(), which would not be a bad
> thing, then another approach is for powerpc to use sync rather than lwsync
> for one or the other of smp_load_acquire() or smp_store_release().

Can we count on the xchg function in the beginning of mcs_lock to
provide a memory barrier? It should provide an implicit memory
barrier according to the memory-barriers document.

Thanks.

Tim

> 
> 							Thanx, Paul
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-20 18:43         ` Tim Chen
@ 2013-11-20 19:06           ` Paul E. McKenney
  2013-11-20 20:36             ` Tim Chen
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-20 19:06 UTC (permalink / raw)
  To: Tim Chen
  Cc: Will Deacon, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

On Wed, Nov 20, 2013 at 10:43:46AM -0800, Tim Chen wrote:
> On Wed, 2013-11-20 at 09:14 -0800, Paul E. McKenney wrote:
> > On Wed, Nov 20, 2013 at 03:46:43PM +0000, Will Deacon wrote:
> > > Hi Paul,
> > > 
> > > On Wed, Nov 20, 2013 at 03:31:23PM +0000, Paul E. McKenney wrote:
> > > > On Tue, Nov 19, 2013 at 05:37:43PM -0800, Tim Chen wrote:
> > > > > @@ -68,7 +72,12 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> > > > >  		while (!(next = ACCESS_ONCE(node->next)))
> > > > >  			arch_mutex_cpu_relax();
> > > > >  	}
> > > > > -	ACCESS_ONCE(next->locked) = 1;
> > > > > -	smp_wmb();
> > > > > +	/*
> > > > > +	 * Pass lock to next waiter.
> > > > > +	 * smp_store_release() provides a memory barrier to ensure
> > > > > +	 * all operations in the critical section has been completed
> > > > > +	 * before unlocking.
> > > > > +	 */
> > > > > +	smp_store_release(&next->locked, 1);
> > > > 
> > > > However, there is one problem with this that I missed yesterday.
> > > > 
> > > > Documentation/memory-barriers.txt requires that an unlock-lock pair
> > > > provide a full barrier, but this is not guaranteed if we use
> > > > smp_store_release() for unlock and smp_load_acquire() for lock.
> > > > At least one of these needs a full memory barrier.
> > > 
> > > Hmm, so in the following case:
> > > 
> > >   Access A
> > >   unlock()	/* release semantics */
> > >   lock()	/* acquire semantics */
> > >   Access B
> > > 
> > > A cannot pass beyond the unlock() and B cannot pass the before the lock().
> > > 
> > > I agree that accesses between the unlock and the lock can be move across both
> > > A and B, but that doesn't seem to matter by my reading of the above.
> > > 
> > > What is the problematic scenario you have in mind? Are you thinking of the
> > > lock() moving before the unlock()? That's only permitted by RCpc afaiu,
> > > which I don't think any architectures supported by Linux implement...
> > > (ARMv8 acquire/release is RCsc).
> > 
> > If smp_load_acquire() and smp_store_release() are both implemented using
> > lwsync on powerpc, and if Access A is a store and Access B is a load,
> > then Access A and Access B can be reordered.
> > 
> > Of course, if every other architecture will be providing RCsc implementations
> > for smp_load_acquire() and smp_store_release(), which would not be a bad
> > thing, then another approach is for powerpc to use sync rather than lwsync
> > for one or the other of smp_load_acquire() or smp_store_release().
> 
> Can we count on the xchg function in the beginning of mcs_lock to
> provide a memory barrier? It should provide an implicit memory
> barrier according to the memory-barriers document.

The problem with the implicit full barrier associated with the xchg()
function is that it is in the wrong place if the lock is contended.
We need to ensure that the previous lock holder's critical section
is seen by everyone to precede that of the next lock holder, and
we need transitivity.  The only operations that are in the right place
to force the needed ordering in the contended case are those involved
in the lock handoff.  :-(

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-20 19:06           ` Paul E. McKenney
@ 2013-11-20 20:36             ` Tim Chen
  2013-11-20 21:44               ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2013-11-20 20:36 UTC (permalink / raw)
  To: paulmck
  Cc: Will Deacon, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

On Wed, 2013-11-20 at 11:06 -0800, Paul E. McKenney wrote:
> On Wed, Nov 20, 2013 at 10:43:46AM -0800, Tim Chen wrote:
> > On Wed, 2013-11-20 at 09:14 -0800, Paul E. McKenney wrote:
> > > On Wed, Nov 20, 2013 at 03:46:43PM +0000, Will Deacon wrote:
> > > > Hi Paul,
> > > > 
> > > > On Wed, Nov 20, 2013 at 03:31:23PM +0000, Paul E. McKenney wrote:
> > > > > On Tue, Nov 19, 2013 at 05:37:43PM -0800, Tim Chen wrote:
> > > > > > @@ -68,7 +72,12 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> > > > > >  		while (!(next = ACCESS_ONCE(node->next)))
> > > > > >  			arch_mutex_cpu_relax();
> > > > > >  	}
> > > > > > -	ACCESS_ONCE(next->locked) = 1;
> > > > > > -	smp_wmb();
> > > > > > +	/*
> > > > > > +	 * Pass lock to next waiter.
> > > > > > +	 * smp_store_release() provides a memory barrier to ensure
> > > > > > +	 * all operations in the critical section has been completed
> > > > > > +	 * before unlocking.
> > > > > > +	 */
> > > > > > +	smp_store_release(&next->locked, 1);
> > > > > 
> > > > > However, there is one problem with this that I missed yesterday.
> > > > > 
> > > > > Documentation/memory-barriers.txt requires that an unlock-lock pair
> > > > > provide a full barrier, but this is not guaranteed if we use
> > > > > smp_store_release() for unlock and smp_load_acquire() for lock.
> > > > > At least one of these needs a full memory barrier.
> > > > 
> > > > Hmm, so in the following case:
> > > > 
> > > >   Access A
> > > >   unlock()	/* release semantics */
> > > >   lock()	/* acquire semantics */
> > > >   Access B
> > > > 
> > > > A cannot pass beyond the unlock() and B cannot pass the before the lock().
> > > > 
> > > > I agree that accesses between the unlock and the lock can be move across both
> > > > A and B, but that doesn't seem to matter by my reading of the above.
> > > > 
> > > > What is the problematic scenario you have in mind? Are you thinking of the
> > > > lock() moving before the unlock()? That's only permitted by RCpc afaiu,
> > > > which I don't think any architectures supported by Linux implement...
> > > > (ARMv8 acquire/release is RCsc).
> > > 
> > > If smp_load_acquire() and smp_store_release() are both implemented using
> > > lwsync on powerpc, and if Access A is a store and Access B is a load,
> > > then Access A and Access B can be reordered.
> > > 
> > > Of course, if every other architecture will be providing RCsc implementations
> > > for smp_load_acquire() and smp_store_release(), which would not be a bad
> > > thing, then another approach is for powerpc to use sync rather than lwsync
> > > for one or the other of smp_load_acquire() or smp_store_release().
> > 
> > Can we count on the xchg function in the beginning of mcs_lock to
> > provide a memory barrier? It should provide an implicit memory
> > barrier according to the memory-barriers document.
> 
> The problem with the implicit full barrier associated with the xchg()
> function is that it is in the wrong place if the lock is contended.
> We need to ensure that the previous lock holder's critical section
> is seen by everyone to precede that of the next lock holder, and
> we need transitivity.  The only operations that are in the right place
> to force the needed ordering in the contended case are those involved
> in the lock handoff.  :-(
> 

Paul,

I'm still scratching my head on how ACCESS A 
and ACCESS B could get reordered.

The smp_store_release instruction in unlock should guarantee that
all memory operations in the previous lock holder's critical section has
been completed and seen by everyone, before the store operation 
to set the lock for the next holder is seen. And the 
smp_load_acquire should guarantee that all memory operations 
for next lock holder happen after checking that it has got lock.  
So it seems like the two critical sections should not overlap.

Does using lwsync means that these smp_load_acquire 
and smp_store_release guarantees are no longer true?

Tim

> 							Thanx, Paul
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-20 20:36             ` Tim Chen
@ 2013-11-20 21:44               ` Paul E. McKenney
  2013-11-20 23:51                 ` Tim Chen
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-20 21:44 UTC (permalink / raw)
  To: Tim Chen
  Cc: Will Deacon, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

On Wed, Nov 20, 2013 at 12:36:07PM -0800, Tim Chen wrote:
> On Wed, 2013-11-20 at 11:06 -0800, Paul E. McKenney wrote:
> > On Wed, Nov 20, 2013 at 10:43:46AM -0800, Tim Chen wrote:
> > > On Wed, 2013-11-20 at 09:14 -0800, Paul E. McKenney wrote:
> > > > On Wed, Nov 20, 2013 at 03:46:43PM +0000, Will Deacon wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > On Wed, Nov 20, 2013 at 03:31:23PM +0000, Paul E. McKenney wrote:
> > > > > > On Tue, Nov 19, 2013 at 05:37:43PM -0800, Tim Chen wrote:
> > > > > > > @@ -68,7 +72,12 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> > > > > > >  		while (!(next = ACCESS_ONCE(node->next)))
> > > > > > >  			arch_mutex_cpu_relax();
> > > > > > >  	}
> > > > > > > -	ACCESS_ONCE(next->locked) = 1;
> > > > > > > -	smp_wmb();
> > > > > > > +	/*
> > > > > > > +	 * Pass lock to next waiter.
> > > > > > > +	 * smp_store_release() provides a memory barrier to ensure
> > > > > > > +	 * all operations in the critical section has been completed
> > > > > > > +	 * before unlocking.
> > > > > > > +	 */
> > > > > > > +	smp_store_release(&next->locked, 1);
> > > > > > 
> > > > > > However, there is one problem with this that I missed yesterday.
> > > > > > 
> > > > > > Documentation/memory-barriers.txt requires that an unlock-lock pair
> > > > > > provide a full barrier, but this is not guaranteed if we use
> > > > > > smp_store_release() for unlock and smp_load_acquire() for lock.
> > > > > > At least one of these needs a full memory barrier.
> > > > > 
> > > > > Hmm, so in the following case:
> > > > > 
> > > > >   Access A
> > > > >   unlock()	/* release semantics */
> > > > >   lock()	/* acquire semantics */
> > > > >   Access B
> > > > > 
> > > > > A cannot pass beyond the unlock() and B cannot pass the before the lock().
> > > > > 
> > > > > I agree that accesses between the unlock and the lock can be move across both
> > > > > A and B, but that doesn't seem to matter by my reading of the above.
> > > > > 
> > > > > What is the problematic scenario you have in mind? Are you thinking of the
> > > > > lock() moving before the unlock()? That's only permitted by RCpc afaiu,
> > > > > which I don't think any architectures supported by Linux implement...
> > > > > (ARMv8 acquire/release is RCsc).
> > > > 
> > > > If smp_load_acquire() and smp_store_release() are both implemented using
> > > > lwsync on powerpc, and if Access A is a store and Access B is a load,
> > > > then Access A and Access B can be reordered.
> > > > 
> > > > Of course, if every other architecture will be providing RCsc implementations
> > > > for smp_load_acquire() and smp_store_release(), which would not be a bad
> > > > thing, then another approach is for powerpc to use sync rather than lwsync
> > > > for one or the other of smp_load_acquire() or smp_store_release().
> > > 
> > > Can we count on the xchg function in the beginning of mcs_lock to
> > > provide a memory barrier? It should provide an implicit memory
> > > barrier according to the memory-barriers document.
> > 
> > The problem with the implicit full barrier associated with the xchg()
> > function is that it is in the wrong place if the lock is contended.
> > We need to ensure that the previous lock holder's critical section
> > is seen by everyone to precede that of the next lock holder, and
> > we need transitivity.  The only operations that are in the right place
> > to force the needed ordering in the contended case are those involved
> > in the lock handoff.  :-(
> > 
> 
> Paul,
> 
> I'm still scratching my head on how ACCESS A 
> and ACCESS B could get reordered.
> 
> The smp_store_release instruction in unlock should guarantee that
> all memory operations in the previous lock holder's critical section has
> been completed and seen by everyone, before the store operation 
> to set the lock for the next holder is seen. And the 
> smp_load_acquire should guarantee that all memory operations 
> for next lock holder happen after checking that it has got lock.  
> So it seems like the two critical sections should not overlap.
> 
> Does using lwsync means that these smp_load_acquire 
> and smp_store_release guarantees are no longer true?

Suppose that CPU 0 stores to a variable, then releases a lock,
CPU 1 acquires that same lock and reads a second variable,
and that CPU 2 writes the second variable, does smp_mb(), and
then reads the first variable.  Like this, where we replace the
spinloop by a check in the assertion:

	CPU 0		CPU 1			CPU 2
	-----		-----			-----
	x = 1;		r1 = SLA(lock);		y = 1;
	SSR(lock, 1);	r2 = y;			smp_mb();
						r3 = x;

The SSR() is an abbreviation for smp_store_release() and the SLA()
is an abbreviation for smp_load_acquire().

Now, if an unlock and following lock act as a full memory barrier, and
given lock, x, and y all initially zero, it should not be possible to
see the following situation:

	r1 == 1 && r2 == 0 && r3 == 0

The "r1 == 1" means that the lock was released, the "r2 == 1" means that
CPU 1's load from y happened before CPU 2's assignment to y, and the
"r3 == 0" means that CPU 2's load from x happened before CPU 0's store
to x.  If the unlock/lock combination was acting like a full barrier,
this would be impossible.  But if you implement both SSR() and SLA() with
lwsync on powerpc, this condition can in fact happen.  See scenario W+RWC
on page 2 of: http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf.

This may seem strange, but when you say "lwsync" you are saying "don't
bother flushing the store buffer", which in turn allows this outcome.

So if we require that smp_load_acquire() and smp_store_release() have
RCsc semantics, which we might well want to do, then your use becomes
legal and powerpc needs smp_store_release() to have a sync instruction
rather than the lighter-weight lwsync instruction.  Otherwise, you need
an smp_mb() in the lock-release handoff path.

Thoughts?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-20 21:44               ` Paul E. McKenney
@ 2013-11-20 23:51                 ` Tim Chen
  2013-11-21  4:53                   ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2013-11-20 23:51 UTC (permalink / raw)
  To: paulmck
  Cc: Will Deacon, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

On Wed, 2013-11-20 at 13:44 -0800, Paul E. McKenney wrote:
> On Wed, Nov 20, 2013 at 12:36:07PM -0800, Tim Chen wrote:
> > On Wed, 2013-11-20 at 11:06 -0800, Paul E. McKenney wrote:
> > > On Wed, Nov 20, 2013 at 10:43:46AM -0800, Tim Chen wrote:
> > > > On Wed, 2013-11-20 at 09:14 -0800, Paul E. McKenney wrote:
> > > > > On Wed, Nov 20, 2013 at 03:46:43PM +0000, Will Deacon wrote:
> > > > > > Hi Paul,
> > > > > > 
> > > > > > On Wed, Nov 20, 2013 at 03:31:23PM +0000, Paul E. McKenney wrote:
> > > > > > > On Tue, Nov 19, 2013 at 05:37:43PM -0800, Tim Chen wrote:
> > > > > > > > @@ -68,7 +72,12 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> > > > > > > >  		while (!(next = ACCESS_ONCE(node->next)))
> > > > > > > >  			arch_mutex_cpu_relax();
> > > > > > > >  	}
> > > > > > > > -	ACCESS_ONCE(next->locked) = 1;
> > > > > > > > -	smp_wmb();
> > > > > > > > +	/*
> > > > > > > > +	 * Pass lock to next waiter.
> > > > > > > > +	 * smp_store_release() provides a memory barrier to ensure
> > > > > > > > +	 * all operations in the critical section has been completed
> > > > > > > > +	 * before unlocking.
> > > > > > > > +	 */
> > > > > > > > +	smp_store_release(&next->locked, 1);
> > > > > > > 
> > > > > > > However, there is one problem with this that I missed yesterday.
> > > > > > > 
> > > > > > > Documentation/memory-barriers.txt requires that an unlock-lock pair
> > > > > > > provide a full barrier, but this is not guaranteed if we use
> > > > > > > smp_store_release() for unlock and smp_load_acquire() for lock.
> > > > > > > At least one of these needs a full memory barrier.
> > > > > > 
> > > > > > Hmm, so in the following case:
> > > > > > 
> > > > > >   Access A
> > > > > >   unlock()	/* release semantics */
> > > > > >   lock()	/* acquire semantics */
> > > > > >   Access B
> > > > > > 
> > > > > > A cannot pass beyond the unlock() and B cannot pass the before the lock().
> > > > > > 
> > > > > > I agree that accesses between the unlock and the lock can be move across both
> > > > > > A and B, but that doesn't seem to matter by my reading of the above.
> > > > > > 
> > > > > > What is the problematic scenario you have in mind? Are you thinking of the
> > > > > > lock() moving before the unlock()? That's only permitted by RCpc afaiu,
> > > > > > which I don't think any architectures supported by Linux implement...
> > > > > > (ARMv8 acquire/release is RCsc).
> > > > > 
> > > > > If smp_load_acquire() and smp_store_release() are both implemented using
> > > > > lwsync on powerpc, and if Access A is a store and Access B is a load,
> > > > > then Access A and Access B can be reordered.
> > > > > 
> > > > > Of course, if every other architecture will be providing RCsc implementations
> > > > > for smp_load_acquire() and smp_store_release(), which would not be a bad
> > > > > thing, then another approach is for powerpc to use sync rather than lwsync
> > > > > for one or the other of smp_load_acquire() or smp_store_release().
> > > > 
> > > > Can we count on the xchg function in the beginning of mcs_lock to
> > > > provide a memory barrier? It should provide an implicit memory
> > > > barrier according to the memory-barriers document.
> > > 
> > > The problem with the implicit full barrier associated with the xchg()
> > > function is that it is in the wrong place if the lock is contended.
> > > We need to ensure that the previous lock holder's critical section
> > > is seen by everyone to precede that of the next lock holder, and
> > > we need transitivity.  The only operations that are in the right place
> > > to force the needed ordering in the contended case are those involved
> > > in the lock handoff.  :-(
> > > 
> > 
> > Paul,
> > 
> > I'm still scratching my head on how ACCESS A 
> > and ACCESS B could get reordered.
> > 
> > The smp_store_release instruction in unlock should guarantee that
> > all memory operations in the previous lock holder's critical section has
> > been completed and seen by everyone, before the store operation 
> > to set the lock for the next holder is seen. And the 
> > smp_load_acquire should guarantee that all memory operations 
> > for next lock holder happen after checking that it has got lock.  
> > So it seems like the two critical sections should not overlap.
> > 
> > Does using lwsync means that these smp_load_acquire 
> > and smp_store_release guarantees are no longer true?
> 

Thanks for the detailed explanation.

> Suppose that CPU 0 stores to a variable, then releases a lock,
> CPU 1 acquires that same lock and reads a second variable,
> and that CPU 2 writes the second variable, does smp_mb(), and
> then reads the first variable.  Like this, where we replace the
> spinloop by a check in the assertion:
> 
> 	CPU 0		CPU 1			CPU 2
> 	-----		-----			-----
> 	x = 1;		r1 = SLA(lock);		y = 1;
> 	SSR(lock, 1);	r2 = y;			smp_mb();
> 						r3 = x;
> 
> The SSR() is an abbreviation for smp_store_release() and the SLA()
> is an abbreviation for smp_load_acquire().
> 
> Now, if an unlock and following lock act as a full memory barrier, and
> given lock, x, and y all initially zero, it should not be possible to
> see the following situation:
> 
> 	r1 == 1 && r2 == 0 && r3 == 0
> 
> The "r1 == 1" means that the lock was released, the "r2 == 1" means that

You mean "r2 == 0"?

> CPU 1's load from y happened before CPU 2's assignment to y, and the
> "r3 == 0" means that CPU 2's load from x happened before CPU 0's store
> to x.  If the unlock/lock combination was acting like a full barrier,
> this would be impossible.  But if you implement both SSR() and SLA() with
> lwsync on powerpc, this condition can in fact happen.  See scenario W+RWC
> on page 2 of: http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf.
> 
> This may seem strange, but when you say "lwsync" you are saying "don't
> bother flushing the store buffer", which in turn allows this outcome.

Yes, this outcome is certainly not expected.  I find the behavior
somewhat at odds with the memory barrier documentation:
   
"The use of ACQUIRE and RELEASE operations generally precludes the need
for other sorts of memory barrier (but note the exceptions mentioned in
the subsection "MMIO write barrier")."


> 
> So if we require that smp_load_acquire() and smp_store_release() have
> RCsc semantics, which we might well want to do, then your use becomes
> legal and powerpc needs smp_store_release() to have a sync instruction
> rather than the lighter-weight lwsync instruction.  Otherwise, you need
> an smp_mb() in the lock-release handoff path.
> 
> Thoughts?

If we intend to use smp_load_acquire and smp_store_release extensively
for locks, making RCsc semantics the default will simply things a lot.

Thanks.

Tim

> 
> 							Thanx, Paul
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-20 23:51                 ` Tim Chen
@ 2013-11-21  4:53                   ` Paul E. McKenney
  2013-11-21 10:17                     ` Will Deacon
                                       ` (2 more replies)
  0 siblings, 3 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-21  4:53 UTC (permalink / raw)
  To: Tim Chen
  Cc: Will Deacon, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

On Wed, Nov 20, 2013 at 03:51:54PM -0800, Tim Chen wrote:
> On Wed, 2013-11-20 at 13:44 -0800, Paul E. McKenney wrote:
> > On Wed, Nov 20, 2013 at 12:36:07PM -0800, Tim Chen wrote:
> > > On Wed, 2013-11-20 at 11:06 -0800, Paul E. McKenney wrote:
> > > > On Wed, Nov 20, 2013 at 10:43:46AM -0800, Tim Chen wrote:
> > > > > On Wed, 2013-11-20 at 09:14 -0800, Paul E. McKenney wrote:
> > > > > > On Wed, Nov 20, 2013 at 03:46:43PM +0000, Will Deacon wrote:
> > > > > > > Hi Paul,
> > > > > > > 
> > > > > > > On Wed, Nov 20, 2013 at 03:31:23PM +0000, Paul E. McKenney wrote:
> > > > > > > > On Tue, Nov 19, 2013 at 05:37:43PM -0800, Tim Chen wrote:
> > > > > > > > > @@ -68,7 +72,12 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> > > > > > > > >  		while (!(next = ACCESS_ONCE(node->next)))
> > > > > > > > >  			arch_mutex_cpu_relax();
> > > > > > > > >  	}
> > > > > > > > > -	ACCESS_ONCE(next->locked) = 1;
> > > > > > > > > -	smp_wmb();
> > > > > > > > > +	/*
> > > > > > > > > +	 * Pass lock to next waiter.
> > > > > > > > > +	 * smp_store_release() provides a memory barrier to ensure
> > > > > > > > > +	 * all operations in the critical section has been completed
> > > > > > > > > +	 * before unlocking.
> > > > > > > > > +	 */
> > > > > > > > > +	smp_store_release(&next->locked, 1);
> > > > > > > > 
> > > > > > > > However, there is one problem with this that I missed yesterday.
> > > > > > > > 
> > > > > > > > Documentation/memory-barriers.txt requires that an unlock-lock pair
> > > > > > > > provide a full barrier, but this is not guaranteed if we use
> > > > > > > > smp_store_release() for unlock and smp_load_acquire() for lock.
> > > > > > > > At least one of these needs a full memory barrier.
> > > > > > > 
> > > > > > > Hmm, so in the following case:
> > > > > > > 
> > > > > > >   Access A
> > > > > > >   unlock()	/* release semantics */
> > > > > > >   lock()	/* acquire semantics */
> > > > > > >   Access B
> > > > > > > 
> > > > > > > A cannot pass beyond the unlock() and B cannot pass the before the lock().
> > > > > > > 
> > > > > > > I agree that accesses between the unlock and the lock can be move across both
> > > > > > > A and B, but that doesn't seem to matter by my reading of the above.
> > > > > > > 
> > > > > > > What is the problematic scenario you have in mind? Are you thinking of the
> > > > > > > lock() moving before the unlock()? That's only permitted by RCpc afaiu,
> > > > > > > which I don't think any architectures supported by Linux implement...
> > > > > > > (ARMv8 acquire/release is RCsc).
> > > > > > 
> > > > > > If smp_load_acquire() and smp_store_release() are both implemented using
> > > > > > lwsync on powerpc, and if Access A is a store and Access B is a load,
> > > > > > then Access A and Access B can be reordered.
> > > > > > 
> > > > > > Of course, if every other architecture will be providing RCsc implementations
> > > > > > for smp_load_acquire() and smp_store_release(), which would not be a bad
> > > > > > thing, then another approach is for powerpc to use sync rather than lwsync
> > > > > > for one or the other of smp_load_acquire() or smp_store_release().
> > > > > 
> > > > > Can we count on the xchg function in the beginning of mcs_lock to
> > > > > provide a memory barrier? It should provide an implicit memory
> > > > > barrier according to the memory-barriers document.
> > > > 
> > > > The problem with the implicit full barrier associated with the xchg()
> > > > function is that it is in the wrong place if the lock is contended.
> > > > We need to ensure that the previous lock holder's critical section
> > > > is seen by everyone to precede that of the next lock holder, and
> > > > we need transitivity.  The only operations that are in the right place
> > > > to force the needed ordering in the contended case are those involved
> > > > in the lock handoff.  :-(
> > > > 
> > > 
> > > Paul,
> > > 
> > > I'm still scratching my head on how ACCESS A 
> > > and ACCESS B could get reordered.
> > > 
> > > The smp_store_release instruction in unlock should guarantee that
> > > all memory operations in the previous lock holder's critical section has
> > > been completed and seen by everyone, before the store operation 
> > > to set the lock for the next holder is seen. And the 
> > > smp_load_acquire should guarantee that all memory operations 
> > > for next lock holder happen after checking that it has got lock.  
> > > So it seems like the two critical sections should not overlap.
> > > 
> > > Does using lwsync means that these smp_load_acquire 
> > > and smp_store_release guarantees are no longer true?
> > 
> 
> Thanks for the detailed explanation.
> 
> > Suppose that CPU 0 stores to a variable, then releases a lock,
> > CPU 1 acquires that same lock and reads a second variable,
> > and that CPU 2 writes the second variable, does smp_mb(), and
> > then reads the first variable.  Like this, where we replace the
> > spinloop by a check in the assertion:
> > 
> > 	CPU 0		CPU 1			CPU 2
> > 	-----		-----			-----
> > 	x = 1;		r1 = SLA(lock);		y = 1;
> > 	SSR(lock, 1);	r2 = y;			smp_mb();
> > 						r3 = x;
> > 
> > The SSR() is an abbreviation for smp_store_release() and the SLA()
> > is an abbreviation for smp_load_acquire().
> > 
> > Now, if an unlock and following lock act as a full memory barrier, and
> > given lock, x, and y all initially zero, it should not be possible to
> > see the following situation:
> > 
> > 	r1 == 1 && r2 == 0 && r3 == 0
> > 
> > The "r1 == 1" means that the lock was released, the "r2 == 1" means that
> 
> You mean "r2 == 0"?

I do, good catch!

> > CPU 1's load from y happened before CPU 2's assignment to y, and the
> > "r3 == 0" means that CPU 2's load from x happened before CPU 0's store
> > to x.  If the unlock/lock combination was acting like a full barrier,
> > this would be impossible.  But if you implement both SSR() and SLA() with
> > lwsync on powerpc, this condition can in fact happen.  See scenario W+RWC
> > on page 2 of: http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf.
> > 
> > This may seem strange, but when you say "lwsync" you are saying "don't
> > bother flushing the store buffer", which in turn allows this outcome.
> 
> Yes, this outcome is certainly not expected.  I find the behavior
> somewhat at odds with the memory barrier documentation:
>    
> "The use of ACQUIRE and RELEASE operations generally precludes the need
> for other sorts of memory barrier (but note the exceptions mentioned in
> the subsection "MMIO write barrier")."

Well, ACQUIRE and RELEASE can do a great number of things, just not
everything.

> > So if we require that smp_load_acquire() and smp_store_release() have
> > RCsc semantics, which we might well want to do, then your use becomes
> > legal and powerpc needs smp_store_release() to have a sync instruction
> > rather than the lighter-weight lwsync instruction.  Otherwise, you need
> > an smp_mb() in the lock-release handoff path.
> > 
> > Thoughts?
> 
> If we intend to use smp_load_acquire and smp_store_release extensively
> for locks, making RCsc semantics the default will simply things a lot.

The other option is to weaken lock semantics so that unlock-lock no
longer implies a full barrier, but I believe that we would regret taking
that path.  (It would be OK by me, I would just add a few smp_mb()
calls on various slowpaths in RCU.  But...)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21  4:53                   ` Paul E. McKenney
@ 2013-11-21 10:17                     ` Will Deacon
  2013-11-21 13:16                       ` Paul E. McKenney
  2013-11-21 10:45                     ` Peter Zijlstra
  2013-11-21 22:27                     ` Linus Torvalds
  2 siblings, 1 reply; 116+ messages in thread
From: Will Deacon @ 2013-11-21 10:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tim Chen, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

On Thu, Nov 21, 2013 at 04:53:33AM +0000, Paul E. McKenney wrote:
> On Wed, Nov 20, 2013 at 03:51:54PM -0800, Tim Chen wrote:
> > If we intend to use smp_load_acquire and smp_store_release extensively
> > for locks, making RCsc semantics the default will simply things a lot.
> 
> The other option is to weaken lock semantics so that unlock-lock no
> longer implies a full barrier, but I believe that we would regret taking
> that path.  (It would be OK by me, I would just add a few smp_mb()
> calls on various slowpaths in RCU.  But...)

Unsurprisingly, my vote is for RCsc semantics.

One major advantage (in my opinion) of the acquire/release accessors is that
they feel intuitive in an area where intuition is hardly rife. I believe
that the additional reordering permitted by RCpc detracts from the relative
simplicity of what is currently being proposed.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21 10:17                     ` Will Deacon
@ 2013-11-21 13:16                       ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-21 13:16 UTC (permalink / raw)
  To: Will Deacon
  Cc: Tim Chen, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Raghavendra K T, George Spelvin,
	H. Peter Anvin, Arnd Bergmann, Aswin Chandramouleeswaran,
	Scott J Norton, Figo.zhang

On Thu, Nov 21, 2013 at 10:17:36AM +0000, Will Deacon wrote:
> On Thu, Nov 21, 2013 at 04:53:33AM +0000, Paul E. McKenney wrote:
> > On Wed, Nov 20, 2013 at 03:51:54PM -0800, Tim Chen wrote:
> > > If we intend to use smp_load_acquire and smp_store_release extensively
> > > for locks, making RCsc semantics the default will simply things a lot.
> > 
> > The other option is to weaken lock semantics so that unlock-lock no
> > longer implies a full barrier, but I believe that we would regret taking
> > that path.  (It would be OK by me, I would just add a few smp_mb()
> > calls on various slowpaths in RCU.  But...)
> 
> Unsurprisingly, my vote is for RCsc semantics.

That was in fact my guess.  ;-)

> One major advantage (in my opinion) of the acquire/release accessors is that
> they feel intuitive in an area where intuition is hardly rife. I believe
> that the additional reordering permitted by RCpc detracts from the relative
> simplicity of what is currently being proposed.

Fair point!  Let's see what others (both hackers and architectures) say.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21  4:53                   ` Paul E. McKenney
  2013-11-21 10:17                     ` Will Deacon
@ 2013-11-21 10:45                     ` Peter Zijlstra
  2013-11-21 13:18                       ` Paul E. McKenney
  2013-11-21 22:27                     ` Linus Torvalds
  2 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-21 10:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Wed, Nov 20, 2013 at 08:53:33PM -0800, Paul E. McKenney wrote:
> The other option is to weaken lock semantics so that unlock-lock no
> longer implies a full barrier, but I believe that we would regret taking
> that path.  (It would be OK by me, I would just add a few smp_mb()
> calls on various slowpaths in RCU.  But...)

Please no, I know we rely on it in a number of places, I just can't
remember where all those were :/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21 10:45                     ` Peter Zijlstra
@ 2013-11-21 13:18                       ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-21 13:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 21, 2013 at 11:45:03AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 20, 2013 at 08:53:33PM -0800, Paul E. McKenney wrote:
> > The other option is to weaken lock semantics so that unlock-lock no
> > longer implies a full barrier, but I believe that we would regret taking
> > that path.  (It would be OK by me, I would just add a few smp_mb()
> > calls on various slowpaths in RCU.  But...)
> 
> Please no, I know we rely on it in a number of places, I just can't
> remember where all those were :/

;-) ;-) ;-)

Yeah, I would also have to overprovision smp_mb()s in a number of
places.  Then again, I know that I don't rely on this on any of
RCU's fastpaths.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21  4:53                   ` Paul E. McKenney
  2013-11-21 10:17                     ` Will Deacon
  2013-11-21 10:45                     ` Peter Zijlstra
@ 2013-11-21 22:27                     ` Linus Torvalds
  2013-11-21 22:52                       ` Paul E. McKenney
  2 siblings, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2013-11-21 22:27 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Wed, Nov 20, 2013 at 8:53 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> The other option is to weaken lock semantics so that unlock-lock no
> longer implies a full barrier, but I believe that we would regret taking
> that path.  (It would be OK by me, I would just add a few smp_mb()
> calls on various slowpaths in RCU.  But...)

Hmm. I *thought* we already did that, exactly because some
architecture already hit this issue, and we got rid of some of the
more subtle "this works because.."

No?

Anyway, isn't "unlock+lock" fundamentally guaranteed to be a memory
barrier? Anything before the unlock cannot possibly migrate down below
the unlock, and anything after the lock must not possibly migrate up
to before the lock? If either of those happens, then something has
migrated out of the critical region, which is against the whole point
of locking..

It's the "lock+unlock" where it's possible that something before the
lock might migrate *into* the critical region (ie after the lock), and
something after the unlock might similarly migrate to precede the
unlock, so you could end up having out-of-order accesses across a
lock/unlock sequence (that both happen "inside" the lock, but there is
no guaranteed ordering between the two accesses themselves).

Or am I confused? The one major reason for strong memory ordering is
that weak ordering is too f*cking easy to get wrong on a software
level, and even people who know about it will make mistakes.

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21 22:27                     ` Linus Torvalds
@ 2013-11-21 22:52                       ` Paul E. McKenney
  2013-11-22  0:09                         ` Linus Torvalds
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-21 22:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 21, 2013 at 02:27:01PM -0800, Linus Torvalds wrote:
> On Wed, Nov 20, 2013 at 8:53 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > The other option is to weaken lock semantics so that unlock-lock no
> > longer implies a full barrier, but I believe that we would regret taking
> > that path.  (It would be OK by me, I would just add a few smp_mb()
> > calls on various slowpaths in RCU.  But...)
> 
> Hmm. I *thought* we already did that, exactly because some
> architecture already hit this issue, and we got rid of some of the
> more subtle "this works because.."
> 
> No?
> 
> Anyway, isn't "unlock+lock" fundamentally guaranteed to be a memory
> barrier? Anything before the unlock cannot possibly migrate down below
> the unlock, and anything after the lock must not possibly migrate up
> to before the lock? If either of those happens, then something has
> migrated out of the critical region, which is against the whole point
> of locking..

Actually, the weakest forms of locking only guarantee a consistent view
of memory if you are actually holding the lock.  Not "a" lock, but "the"
lock.  The trick is that use of a common lock variable short-circuits
the transitivity that would otherwise be required, which in turn
allows cheaper memory barriers to be used.  But when implementing these
weakest forms of locking (which Peter and Tim inadvertently did with the
combination of MCS lock and a PPC implementation of smp_load_acquire()
and smp_store_release() that used lwsync), then "unlock+lock" is no
longer guaranteed to be a memory barrier.

Which is why I (admittedly belatedly) complained.

So the three fixes I know of at the moment are:

1.	Upgrade smp_store_release()'s PPC implementation from lwsync
	to sync.
	
	What about ARM?  ARM platforms that have the load-acquire and
	store-release instructions could use them, but other ARM
	platforms have to use dmb.  ARM avoids PPC's lwsync issue
	because it has no equivalent to lwsync.

2.	Place an explicit smp_mb() into the MCS-lock queued handoff
	code.

3.	Remove the requirement that "unlock+lock" be a full memory
	barrier.

We have been leaning towards #1, but before making any hard decision
on this we are looking more closely at what the situation is on other
architectures.

> It's the "lock+unlock" where it's possible that something before the
> lock might migrate *into* the critical region (ie after the lock), and
> something after the unlock might similarly migrate to precede the
> unlock, so you could end up having out-of-order accesses across a
> lock/unlock sequence (that both happen "inside" the lock, but there is
> no guaranteed ordering between the two accesses themselves).

Agreed.

> Or am I confused? The one major reason for strong memory ordering is
> that weak ordering is too f*cking easy to get wrong on a software
> level, and even people who know about it will make mistakes.

Guilty to charges as read!  ;-)

That is a major reason why I am leaning towards #1 on the list above.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21 22:52                       ` Paul E. McKenney
@ 2013-11-22  0:09                         ` Linus Torvalds
  2013-11-22  4:08                           ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2013-11-22  0:09 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 21, 2013 at 2:52 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> Actually, the weakest forms of locking only guarantee a consistent view
> of memory if you are actually holding the lock.  Not "a" lock, but "the"
> lock.

I don't think we necessarily support any architecture that does that,
though. And afaik, it's almost impossible to actually do sanely in
hardware with any sane cache coherency, so..

So realistically, I think we only really need to worry about memory
ordering that is tied to cache coherency protocols, where even locking
rules tend to be about memory ordering (although extended rules like
acquire/release rather than the broken pure barrier model).

Do you know any actual architecture where this isn't the case?

> So the three fixes I know of at the moment are:
>
> 1.      Upgrade smp_store_release()'s PPC implementation from lwsync
>         to sync.
>
>         What about ARM?  ARM platforms that have the load-acquire and
>         store-release instructions could use them, but other ARM
>         platforms have to use dmb.  ARM avoids PPC's lwsync issue
>         because it has no equivalent to lwsync.
>
> 2.      Place an explicit smp_mb() into the MCS-lock queued handoff
>         code.
>
> 3.      Remove the requirement that "unlock+lock" be a full memory
>         barrier.
>
> We have been leaning towards #1, but before making any hard decision
> on this we are looking more closely at what the situation is on other
> architectures.

So I might be inclined to lean towards #1 simply because of test coverage.

We have no sane test coverage of weakly ordered models. Sure, ARM may
be weakly ordered (with saner acquire/release in ARM64), but
realistically, no existing ARM platforms actually gives us any
reasonable test *coverage* for things like this, despite having tons
of chips out there running Linux. Very few people debug problems in
that world. The PPC people probably have much better testing and are
more likely to figure out the bugs, but don't have the pure number of
machines. So x86 tends to still remain the main platform where serious
testing gets done.

That said, I'd still be perfectly happy with #3, since - unlike, say,
the PCI ordering issues with drivers - at least people *can* try to
think about this somewhat analytically, even if it's ripe for
confusion and subtle mistakes. And I still think you got the ordering
wrong, and should be talking about "lock+unlock" rather than
"unlock+lock".

                       Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22  0:09                         ` Linus Torvalds
@ 2013-11-22  4:08                           ` Paul E. McKenney
  2013-11-22  4:25                             ` Linus Torvalds
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-22  4:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 21, 2013 at 04:09:49PM -0800, Linus Torvalds wrote:
> On Thu, Nov 21, 2013 at 2:52 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > Actually, the weakest forms of locking only guarantee a consistent view
> > of memory if you are actually holding the lock.  Not "a" lock, but "the"
> > lock.
> 
> I don't think we necessarily support any architecture that does that,
> though. And afaik, it's almost impossible to actually do sanely in
> hardware with any sane cache coherency, so..

It is not the architecture that matters here, it is just a definition of
what ordering guarantees the locking primitives provide, independent of
the architecture.  In the Linux kernel, we just happen to have picked
a strongly ordered definition of locking.  And we have probably made
the right choice.  But there really are environments where unlock+lock
does -not- provide a full memory barrier.  If you never refer to any
lock-protected memory without holding the lock, this difference doesn't
matter to you -- there will be nothing you can do to detect the difference
under this restriction.  Of course, if you want to scale, you -will-
find yourself accessing lock-protected memory without holding locks, so
in the Linux kernel, this difference in unlock+lock ordering semantics
really does matter.

> So realistically, I think we only really need to worry about memory
> ordering that is tied to cache coherency protocols, where even locking
> rules tend to be about memory ordering (although extended rules like
> acquire/release rather than the broken pure barrier model).
> 
> Do you know any actual architecture where this isn't the case?

You can implement something that acts like a lock (just not like a
-Linux- -kernel- lock) but where unlock+lock does not provide a full
barrier on architectures that provide weak memory barriers.  And there
are software environments that provide these weaker locks.  Which does
-not- necessarily mean that the Linux kernel should do this, of course!

???

OK, part of the problem is that this discussion has spanned at least
three different threads over the past week or so.  I will try to
summarize.  Others can correct me if I blow it.

a.	We found out that some of the circular-buffer code is unsafe.
	I first insisted on inserting a full memory barrier, but it was
	eventually determined that weaker memory barriers could suffice.
	This was the thread where I proposed the smp_tmb(), which name you
	(quite rightly) objected to.  This proposal eventually morphed
	into smp_load_acquire() and smp_store_release().

b.	For circular buffers, you need really minimal ordering semantics
	out of smp_load_acquire() and smp_store_release().  In particular,
	on powerpc, the lwsync instruction suffices.  Peter therefore came
	up with an implementation matching those weak semantics.

c.	In a couple of threads involving MCS lock, I complained about
	insufficient barriers, again suggesting adding smp_mb().  (Almost
	seems like a theme here...)  Someone noted that ARM64's shiny new
	load-acquire and store-release instructions sufficed (which does
	appear to be true).  Tim therefore came up with an implementation
	based on Peter's smp_load_acquire() and smp_store_release().

	The smp_store_release() is used when the lock holder hands off
	to the next in the queue, and the smp_load_acquire() is used when
	the next in the queue notices that the lock has been handed off.
	So we really are talking about the unlock+lock case!!!

d.	Unfortunately (or fortunately, depending on your viewpoint),
	locking as defined by the Linux kernel requires stronger
	smp_load_acquire() and smp_store_release() semantics than are
	required by the circular buffer.  In particular, the weaker
	smp_load_acquire() and smp_store_release() semantics provided by
	the original powerpc implementation do not provide a full memory
	barrier for the unlock+lock handoff on the queue.  I (finally)
	noticed this and complained.

e.	The question then was "how to fix this?"  There are a number
	of ways, namely these guys from two emails ago plus one more:

> > So the three fixes I know of at the moment are:
> >
> > 1.      Upgrade smp_store_release()'s PPC implementation from lwsync
> >         to sync.

		We are going down this path.  I produced what I believe
		to be a valid proof that the x86 versions do provide
		a full barrier for unlock+lock, which Peter will check
		tomorrow (Friday) morning.  Itanium is trivial (famous
		last words), ARM is also trivial (again, famous last
		words), and if x86 works, then so does s390.  And so on.

		My alleged proof for x86 is here, should anyone else
		like to take a crack at it:

		http://www.spinics.net/lists/linux-mm/msg65462.html

> >         What about ARM?  ARM platforms that have the load-acquire and
> >         store-release instructions could use them, but other ARM
> >         platforms have to use dmb.  ARM avoids PPC's lwsync issue
> >         because it has no equivalent to lwsync.
> >
> > 2.      Place an explicit smp_mb() into the MCS-lock queued handoff
> >         code.

		This would allow unlock+lock to be a full memory barrier,
		but would allow the weaker and cheaper semantics for
		smp_load_acquire() and smp_store_release.

> > 3.      Remove the requirement that "unlock+lock" be a full memory
> >         barrier.

		This would allow cheaper locking primitives on some
		architectures, but would require more care when
		making unlocked accesses to variables protected by
		locks.

4.	Have two parallel APIs, smp_store_release_weak(),
	smp_store_release(), and so on.  My reaction to this
	is "just say no".  It is not like we exactly have a
	shortage of memory-barrier APIs at the moment.

> > We have been leaning towards #1, but before making any hard decision
> > on this we are looking more closely at what the situation is on other
> > architectures.
> 
> So I might be inclined to lean towards #1 simply because of test coverage.

Another reason is "Who knows what code in the Linux kernel might be
relying on unlock+lock providing a full barrier?"

> We have no sane test coverage of weakly ordered models. Sure, ARM may
> be weakly ordered (with saner acquire/release in ARM64), but
> realistically, no existing ARM platforms actually gives us any
> reasonable test *coverage* for things like this, despite having tons
> of chips out there running Linux. Very few people debug problems in
> that world. The PPC people probably have much better testing and are
> more likely to figure out the bugs, but don't have the pure number of
> machines. So x86 tends to still remain the main platform where serious
> testing gets done.

We clearly need something like rcutorture, but for locking.  No two
ways about it.  But we need that regardless of whether or not we
change the ordering semantics of unlock+lock.

> That said, I'd still be perfectly happy with #3, since - unlike, say,
> the PCI ordering issues with drivers - at least people *can* try to
> think about this somewhat analytically, even if it's ripe for
> confusion and subtle mistakes. And I still think you got the ordering
> wrong, and should be talking about "lock+unlock" rather than
> "unlock+lock".

No, I really am talking about unlock+lock.  The MCS queue handoff is an
unlock followed by a lock, and that is what got weakened to no longer
imply a full memory barrier on all architectures when the MCS patch
started using smp_load_acquire() and smp_store_release().

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22  4:08                           ` Paul E. McKenney
@ 2013-11-22  4:25                             ` Linus Torvalds
  2013-11-22  6:23                               ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2013-11-22  4:25 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 21, 2013 at 8:08 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> It is not the architecture that matters here, it is just a definition of
> what ordering guarantees the locking primitives provide, independent of
> the architecture.

So we definitely come from very different backgrounds.

I don't care one *whit* about theoretical lock orderings. Not a bit.

I do care deeply about reality, particularly of architectures that
actually matter. To me, a spinlock in some theoretical case is
uninteresting, but a efficient spinlock implementation on a real
architecture is a big deal that matters a lot.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22  4:25                             ` Linus Torvalds
@ 2013-11-22  6:23                               ` Paul E. McKenney
  2013-11-22 15:16                                 ` Ingo Molnar
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-22  6:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 21, 2013 at 08:25:59PM -0800, Linus Torvalds wrote:
> On Thu, Nov 21, 2013 at 8:08 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > It is not the architecture that matters here, it is just a definition of
> > what ordering guarantees the locking primitives provide, independent of
> > the architecture.
> 
> So we definitely come from very different backgrounds.

Agreed, and I am pretty sure we are talking past each other.

> I don't care one *whit* about theoretical lock orderings. Not a bit.

If by theoretical lock orderings, you mean whether or not unlock+lock
acts as a full memory barrier, we really do have some code in the kernel
that relies on this.  So we either need to have find and change this
code or we need unlock+lock to continue to act as a full memory barrier.
Making RCU stop relying on this is easy because all the code that assumes
unlock+lock is a full barrier is on slow paths anyway.  Other subsystems
might be in different situations.

If you mean something else by theoretical lock orderings, I am sorry,
but I am completely failing to see what it might be.

> I do care deeply about reality, particularly of architectures that
> actually matter. To me, a spinlock in some theoretical case is
> uninteresting, but a efficient spinlock implementation on a real
> architecture is a big deal that matters a lot.

Agreed, reality and efficiency are the prime concerns.  Theory serves
reality and efficiency, but definitely not the other way around.

But if we want locking primitives that don't rely solely on atomic
instructions (such as the queued locks that people have been putting
forward), we are going to need to wade through a fair bit of theory
to make sure that they actually work on real hardware.  Subtle bugs in
locking primitives is a type of reality that I think we can both agree
that we should avoid.

Or am I missing your point?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22  6:23                               ` Paul E. McKenney
@ 2013-11-22 15:16                                 ` Ingo Molnar
  2013-11-22 18:49                                   ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Ingo Molnar @ 2013-11-22 15:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Thu, Nov 21, 2013 at 08:25:59PM -0800, Linus Torvalds wrote:
>
> [...]
> 
> > I do care deeply about reality, particularly of architectures that 
> > actually matter. To me, a spinlock in some theoretical case is 
> > uninteresting, but a efficient spinlock implementation on a real 
> > architecture is a big deal that matters a lot.
> 
> Agreed, reality and efficiency are the prime concerns.  Theory 
> serves reality and efficiency, but definitely not the other way 
> around.
> 
> But if we want locking primitives that don't rely solely on atomic 
> instructions (such as the queued locks that people have been putting 
> forward), we are going to need to wade through a fair bit of theory 
> to make sure that they actually work on real hardware.  Subtle bugs 
> in locking primitives is a type of reality that I think we can both 
> agree that we should avoid.
> 
> Or am I missing your point?

I think one point Linus wanted to make that it's not true that Linux 
has to offer a barrier and locking model that panders to the weakest 
(and craziest!) memory ordering model amongst all the possible Linux 
platforms - theoretical or real metal.

Instead what we want to do is to consciously, intelligently _pick_ a 
sane, maintainable memory model and offer primitives for that - at 
least as far as generic code is concerned. Each architecture can map 
those primitives to the best of its abilities.

Because as we increase abstraction, as we allow more and more complex 
memory ordering details, so does maintainability and robustness 
decrease. So there's a very real crossover point at which point 
increased smarts will actually hurt our code in real life.

[ Same goes for compilers, we draw a line: for example we generally
  turn off strict aliasing optimizations, or we turn off NULL pointer
  check elimination optimizations. ]

I'm not saying this to not discuss theoretical complexities - I'm just 
saying that the craziest memory ordering complexities are probably 
best dealt with by agreeing not to use them ;-)

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22 15:16                                 ` Ingo Molnar
@ 2013-11-22 18:49                                   ` Paul E. McKenney
  2013-11-22 19:06                                     ` Linus Torvalds
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-22 18:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 04:16:00PM +0100, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Thu, Nov 21, 2013 at 08:25:59PM -0800, Linus Torvalds wrote:
> >
> > [...]
> > 
> > > I do care deeply about reality, particularly of architectures that 
> > > actually matter. To me, a spinlock in some theoretical case is 
> > > uninteresting, but a efficient spinlock implementation on a real 
> > > architecture is a big deal that matters a lot.
> > 
> > Agreed, reality and efficiency are the prime concerns.  Theory 
> > serves reality and efficiency, but definitely not the other way 
> > around.
> > 
> > But if we want locking primitives that don't rely solely on atomic 
> > instructions (such as the queued locks that people have been putting 
> > forward), we are going to need to wade through a fair bit of theory 
> > to make sure that they actually work on real hardware.  Subtle bugs 
> > in locking primitives is a type of reality that I think we can both 
> > agree that we should avoid.
> > 
> > Or am I missing your point?
> 
> I think one point Linus wanted to make that it's not true that Linux 
> has to offer a barrier and locking model that panders to the weakest 
> (and craziest!) memory ordering model amongst all the possible Linux 
> platforms - theoretical or real metal.
> 
> Instead what we want to do is to consciously, intelligently _pick_ a 
> sane, maintainable memory model and offer primitives for that - at 
> least as far as generic code is concerned. Each architecture can map 
> those primitives to the best of its abilities.
> 
> Because as we increase abstraction, as we allow more and more complex 
> memory ordering details, so does maintainability and robustness 
> decrease. So there's a very real crossover point at which point 
> increased smarts will actually hurt our code in real life.
> 
> [ Same goes for compilers, we draw a line: for example we generally
>   turn off strict aliasing optimizations, or we turn off NULL pointer
>   check elimination optimizations. ]
> 
> I'm not saying this to not discuss theoretical complexities - I'm just 
> saying that the craziest memory ordering complexities are probably 
> best dealt with by agreeing not to use them ;-)

Thank you for the explanation, Ingo!  I do agree with these principles.

That said, I remain really confused.  My best guess is that you are
advising me to ask Peter to stiffen up smp_store_release() so that
it preserves the guarantee that unlock+lock provides a full barrier,
thus allowing it to be used in the queued spinlocks as well as in its
original circular-buffer use case.  But even that doesn't completely
fit because that was the direction I was going beforehand.

You see, my problem is not the "crazy ordering" DEC Alpha, Itanium,
PowerPC, or even ARM.  It is really obvious what instructions to use in
a stiffened-up smp_store_release() for those guys: "mb" for DEC Alpha,
"st.rel" for Itanium, "sync" for PowerPC, and "dmb" for ARM.  Believe it
or not, my problem is instead with good old tightly ordered x86.

We -could- just put an mfence into x86's smp_store_release() and
be done with it, but it currently looks like we get the effect of
a full memory barrier without it, at least in the special case of
the high-contention queued-lock handoff.  To repeat, it looks like we
preserve the full-memory-barrier property of unlock+lock for x86 -even-
-though- the queued-lock high-contention handoff code contains neither
atomic instructions nor memory-barrier instructions.  This is a bit
surprising to me, to say the least.  Hence my digging into the theory
to check it -- after all, we cannot prove it correct by testing it.

Here are some other things that you and Linus might be trying to tell me:

o	Just say "no" to queued locks.  (I am OK with this.  NAKs are
	after all easier than beating my head against memory models.)

o	Don't add store-after-conditional control dependencies to
	Documentation/memory-barriers.txt because it is too complicated.
	(I am OK with this, I suppose -- but some people really want to
	rely on them.)

o	Just add general control dependencies, because that is what
	people expect.	(I have more trouble with this because there
	is a -lot- of work needed in many projects to make this happen,
	including on ARM, but some work on x86 as well.)

Anything I am missing here?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22 18:49                                   ` Paul E. McKenney
@ 2013-11-22 19:06                                     ` Linus Torvalds
  2013-11-22 20:06                                       ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2013-11-22 19:06 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 10:49 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> You see, my problem is not the "crazy ordering" DEC Alpha, Itanium,
> PowerPC, or even ARM.  It is really obvious what instructions to use in
> a stiffened-up smp_store_release() for those guys: "mb" for DEC Alpha,
> "st.rel" for Itanium, "sync" for PowerPC, and "dmb" for ARM.  Believe it
> or not, my problem is instead with good old tightly ordered x86.
>
> We -could- just put an mfence into x86's smp_store_release() and
> be done with it

Why would you bother? The *acquire* has a memory barrier. End of
story. On x86, it has to (since otherwise a load inside the locked
region could be re-ordered wrt the write that takes the lock).

Basically, any time you think you need to add a memory barrier on x86,
you should go "I'm doing something wrong". It's that simple.

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22 19:06                                     ` Linus Torvalds
@ 2013-11-22 20:06                                       ` Paul E. McKenney
  2013-11-22 20:09                                         ` Linus Torvalds
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-22 20:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 11:06:41AM -0800, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 10:49 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > You see, my problem is not the "crazy ordering" DEC Alpha, Itanium,
> > PowerPC, or even ARM.  It is really obvious what instructions to use in
> > a stiffened-up smp_store_release() for those guys: "mb" for DEC Alpha,
> > "st.rel" for Itanium, "sync" for PowerPC, and "dmb" for ARM.  Believe it
> > or not, my problem is instead with good old tightly ordered x86.
> >
> > We -could- just put an mfence into x86's smp_store_release() and
> > be done with it
> 
> Why would you bother?  The *acquire* has a memory barrier. End of
> story. On x86, it has to (since otherwise a load inside the locked
> region could be re-ordered wrt the write that takes the lock).

I am sorry, but that is not always correct.  For example, in the contended
case for Tim Chen's MCS queued locks, the x86 acquisition-side handoff
code does -not- contain any stores or memory-barrier instructions.
Here is that portion of the arch_mcs_spin_lock() code, along with the
x86 definition for smp_load_acquire:

+       while (!(smp_load_acquire(&node->locked)))                      \
+               arch_mutex_cpu_relax();                                 \

+#define smp_load_acquire(p)                                            \
+({                                                                     \
+       typeof(*p) ___p1 = ACCESS_ONCE(*p);                             \
+       compiletime_assert_atomic_type(*p);                             \
+       barrier();                                                      \
+       ___p1;                                                          \
+})

No stores, no memory-barrier instructions.

Of course, the fact that there are no stores means that on x86 the
critical section cannot leak out, even with no memory barrier.  That is
the easy part.  The hard part is if we want unlock+lock to be a full
memory barrier for MCS lock from the viewpoint of code not holding
that lock.  We clearly cannot rely on the non-existent memory barrier
in the acquisition handoff code.

And yes, there is a full barrier implied by the xchg() further up in
arch_mcs_spin_lock(), shown in full below, but that barrier is before
the handoff code, so that xchg() cannot have any effect on the handoff.
That xchg() therefore cannot force unlock+lock to act as a full memory
barrier in the contended queue-handoff case.

> Basically, any time you think you need to add a memory barrier on x86,
> you should go "I'm doing something wrong". It's that simple.

It -appears- that the MCS queue handoff code is one of the many cases
where we don't need a memory barrier on x86, even if we do want MCS
unlock+lock to be a full memory barrier.  But I wouldn't call it simple.
I -think- we do have a proof, but I don't yet totally trust it.

							Thanx, Paul


>                   Linus

------------------------------------------------------------------------

+#define arch_mcs_spin_lock(lock, node)                                 \
+{                                                                      \
+       struct mcs_spinlock *prev;                                      \
+                                                                       \
+       /* Init node */                                                 \
+       node->locked = 0;                                               \
+       node->next   = NULL;                                            \
+                                                                       \
+       /* xchg() provides a memory barrier */                          \
+       prev = xchg(lock, node);                                        \
+       if (likely(prev == NULL)) {                                     \
+               /* Lock acquired */                                     \
+               return;                                                 \
+       }                                                               \
+       ACCESS_ONCE(prev->next) = node;                                 \
+       /*                                                              \
+        * Wait until the lock holder passes the lock down.             \
+        * Using smp_load_acquire() provides a memory barrier that      \
+        * ensures subsequent operations happen after the lock is       \
+        * acquired.                                                    \
+        */                                                             \
+       while (!(smp_load_acquire(&node->locked)))                      \
+               arch_mutex_cpu_relax();                                 \

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22 20:06                                       ` Paul E. McKenney
@ 2013-11-22 20:09                                         ` Linus Torvalds
  2013-11-22 20:37                                           ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2013-11-22 20:09 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 12:06 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> I am sorry, but that is not always correct.  For example, in the contended
> case for Tim Chen's MCS queued locks, the x86 acquisition-side handoff
> code does -not- contain any stores or memory-barrier instructions.

So? In order to get *into* that contention code, you will have to go
through the fast-case code. Which will contain a locked instruction.

So I repeat: a "lock" sequence will always be a memory barrier on x86.

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22 20:09                                         ` Linus Torvalds
@ 2013-11-22 20:37                                           ` Paul E. McKenney
  2013-11-22 21:01                                             ` Linus Torvalds
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-22 20:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 12:09:31PM -0800, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 12:06 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > I am sorry, but that is not always correct.  For example, in the contended
> > case for Tim Chen's MCS queued locks, the x86 acquisition-side handoff
> > code does -not- contain any stores or memory-barrier instructions.
> 
> So? In order to get *into* that contention code, you will have to go
> through the fast-case code. Which will contain a locked instruction.

So you must also maintain ordering against the critical section that just
ended on some other CPU.  And that just-ended critical section might
well have started -after- you passed through your own fast-case code.
In that case, the barriers in your fast-case code cannot possibly
help you.  Instead, ordering must be supplied by the code in the two
handoff code sequences.  And in the case of the most recent version of
Tim Chen's MCS lock on x86, the two handoff code sequences (release
and corresponding acquire) contain neither atomic instructions nor
memory-barrier instructions.

The weird thing is that it looks like those two handoff code sequences
nevertheless provide the unlock+lock guarantee on x86.  But I need to
look at it some more, and eventually run it by experts from Intel and
AMD.

							Thanx, Paul

> So I repeat: a "lock" sequence will always be a memory barrier on x86.
> 
>                    Linus
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22 20:37                                           ` Paul E. McKenney
@ 2013-11-22 21:01                                             ` Linus Torvalds
  2013-11-22 21:52                                               ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2013-11-22 21:01 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 12:37 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Fri, Nov 22, 2013 at 12:09:31PM -0800, Linus Torvalds wrote:
>>
>> So? In order to get *into* that contention code, you will have to go
>> through the fast-case code. Which will contain a locked instruction.
>
> So you must also maintain ordering against the critical section that just
> ended on some other CPU.

But that's completely irrelevant to what you yourself have been saying
in this thread.

Your stated concern in this thread been whether the "unlock+lock"
sequence implies an ordering that is at least equivalent to a memory
barrier. And it clearly does, because the lock clearly contains a
memory barrier inside of it.

The fact that the locking sequence contains *other* things too is
irrelevant for that question. Those other things are at most relevant
then for *other* questions, ie from the standpoint of somebody wanting
to convince himself that the locking actually works as a lock, but
that wasn't what we were actually talking about earlier.

The x86 memory ordering doesn't follow the traditional theoretical
operations, no. Tough. It's generally superior than the alternatives
because of its somewhat unorthodox rules (in that it then makes the
many other common barriers generally be no-ops). If you try to
describe the x86 ops in terms of the theory, you will have pain. So
just don't do it. Think of them in the context of their own rules, not
somehow trying to translate them to non-x86 rules.

I think you can try to approximate the x86 rules as "every load is a
RCpc acquire, every store is a RCpc release", and then to make
yourself happier you can say that the lock sequence always starts out
with a serializing operation (which is obviously the actual locked
r-m-w op) so that on a lock/unlock level (as opposed to an individual
memory op level) you get the RCsc behavior of the acquire/releases not
re-ordering across separate locking events.

I'm not actually convinced that that is really a full and true
description of the x86 semantics, but it may _approximate_ being true
to the degree that you might translate it to some of the academic
papers that talk about these things.

(Side note: this is also true when the locked r-m-w instruction has
been replaced with a xbegin/xend. Intel documents that an RTM region
has the "same ordering semantics as a LOCK prefixed instruction": see
section 15.3.6 in the intel x86 architecture sw manual)

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22 21:01                                             ` Linus Torvalds
@ 2013-11-22 21:52                                               ` Paul E. McKenney
  2013-11-22 22:19                                                 ` Linus Torvalds
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-22 21:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 01:01:14PM -0800, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 12:37 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Fri, Nov 22, 2013 at 12:09:31PM -0800, Linus Torvalds wrote:
> >>
> >> So? In order to get *into* that contention code, you will have to go
> >> through the fast-case code. Which will contain a locked instruction.
> >
> > So you must also maintain ordering against the critical section that just
> > ended on some other CPU.
> 
> But that's completely irrelevant to what you yourself have been saying
> in this thread.
> 
> Your stated concern in this thread been whether the "unlock+lock"
> sequence implies an ordering that is at least equivalent to a memory
> barrier. And it clearly does, because the lock clearly contains a
> memory barrier inside of it.

You seem to be assuming that the unlock+lock rule applies only when the
unlock and the lock are executed by the same CPU.  This is not always
the case.  For example, when the unlock and lock are operating on the
same lock variable, the critical sections must appear to be ordered from
the perspective of some other CPU, even when that CPU is not holding
any lock.  Please see the last example in "LOCKS VS MEMORY ACCESSES"
in Documentation/memory-barriers.txt, which was added in March 2006:

------------------------------------------------------------------------

	CPU 1				CPU 2
	===============================	===============================
	*A = a;
	LOCK M		[1]
	*B = b;
	*C = c;
	UNLOCK M	[1]
	*D = d;				*E = e;
					LOCK M		[2]
					*F = f;
					*G = g;
					UNLOCK M	[2]
					*H = h;

CPU 3 might see:

	*E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
		LOCK M [2], *H, *F, *G, UNLOCK M [2], *D

But assuming CPU 1 gets the lock first, CPU 3 won't see any of:

	*B, *C, *D, *F, *G or *H preceding LOCK M [1]
	*A, *B or *C following UNLOCK M [1]
	*F, *G or *H preceding LOCK M [2]
	*A, *B, *C, *E, *F or *G following UNLOCK M [2]

------------------------------------------------------------------------

The code that CPU 2 executes after acquiring lock M must be seen by some
other CPU not holding any lock as following CPU 1's release of lock M.
And the other three sets of ordering constraints must hold as well.

Admittedly, this example only shows stores, but then again so do the
earlier examples that illustrate single-CPU unlock-lock acting as a full
memory barrier.  The intent was that unlock and a subsequent lock of a
given lock variable act as a full memory barrier regardless of whether
or not the unlock and lock were executed by the same CPU.

> The fact that the locking sequence contains *other* things too is
> irrelevant for that question. Those other things are at most relevant
> then for *other* questions, ie from the standpoint of somebody wanting
> to convince himself that the locking actually works as a lock, but
> that wasn't what we were actually talking about earlier.

Also from the standpoint of somebody wanting to convince himself
that an unlock on one CPU and a lock of that same lock on another
CPU provides ordering for some other CPU not holding that lock.
Which in fact was the case I was worried about.

> The x86 memory ordering doesn't follow the traditional theoretical
> operations, no. Tough. It's generally superior than the alternatives
> because of its somewhat unorthodox rules (in that it then makes the
> many other common barriers generally be no-ops). If you try to
> describe the x86 ops in terms of the theory, you will have pain. So
> just don't do it. Think of them in the context of their own rules, not
> somehow trying to translate them to non-x86 rules.
> 
> I think you can try to approximate the x86 rules as "every load is a
> RCpc acquire, every store is a RCpc release", and then to make
> yourself happier you can say that the lock sequence always starts out
> with a serializing operation (which is obviously the actual locked
> r-m-w op) so that on a lock/unlock level (as opposed to an individual
> memory op level) you get the RCsc behavior of the acquire/releases not
> re-ordering across separate locking events.
> 
> I'm not actually convinced that that is really a full and true
> description of the x86 semantics, but it may _approximate_ being true
> to the degree that you might translate it to some of the academic
> papers that talk about these things.

This approach is fine most of the time.  But when faced with something
as strange as "got a full barrier despite having no atomic instructions
and no memory-barrier instructions", I feel the need to look at it from
multiple viewpoints.  The multiple viewpoints I have used thus far do
seem to agree with each other, which does give me some confidence in
the result.

> (Side note: this is also true when the locked r-m-w instruction has
> been replaced with a xbegin/xend. Intel documents that an RTM region
> has the "same ordering semantics as a LOCK prefixed instruction": see
> section 15.3.6 in the intel x86 architecture sw manual)

Understood.  So, yes, it would be possible to implement locking with RTM,
as long as you had a non-RTM fallback path.  The fallback path would be
very rarely used, but I suspecct that you could exercise it by putting
it in userspace and attempting to single-step through the transaction.

But in the handoff case, there are no locked r-m-w instructions, so I
think I lost the thread somewhere in your side note.  Unless you are
simply saying that hardware transactional memory can be a good thing,
in which case I agree, at least for transactions that are small enough
to fit in the cache and to not need to be debugged via single-stepping.
I don't buy the infinite-composition argument of some transactional
memory academics, though.  Too many corner cases, such as having a remote
procedure call between the two transactions to be composed.  In fact,
one can argue that transactions are composable only to about the same
degree as are locks.  Not popular among those who want to believe that
transactions are infinitely composable and locks are not, but I never
have been popular among those people anyway.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22 21:52                                               ` Paul E. McKenney
@ 2013-11-22 22:19                                                 ` Linus Torvalds
  2013-11-23  0:25                                                   ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2013-11-22 22:19 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 1:52 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> You seem to be assuming that the unlock+lock rule applies only when the
> unlock and the lock are executed by the same CPU.  This is not always
> the case.  For example, when the unlock and lock are operating on the
> same lock variable, the critical sections must appear to be ordered from
> the perspective of some other CPU, even when that CPU is not holding
> any lock.

Umm. Isn't that pretty much *guaranteed* by any cache-coherent locking scheme.

The unlock - by virtue of being an unlock - means that all ops within
the first critical region must be visible in the cache coherency
protocol before the unlock is visible. Same goes for the lock on the
other CPU wrt the memory accesses within that locked region.

IOW, I'd argue that any locking model that depends on cache coherency
- as opposed to some magic external locks independent of cache
coherenecy - *has* to follow the rules in that section as far as I can
see. Or it's not a locking model at all, and lets the cache accesses
leak outside of the critical section.

Btw, you can see the difference in the very next section, where you
have *non-cache-coherent* (IO) accesses. So once you have different
rules for the data and the lock accesses, you can get different
results. And yes, there have been broken SMP models (historically)
where locking was "separate" from the memory system, and you could get
coherence only by taking the right lock. But I really don't think we
care about such locking models (for memory - again, IO accesses are
different, exactly because locking and data are in different "ordering
domains").

IOW, I don't think you *can* violate that "locks vs memory accesses"
model with any system where locking is in the same ordering domain as
the data (ie we lock by using cache coherency). And locking using
cache coherency is imnsho the only valid model for SMP. No?

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22 22:19                                                 ` Linus Torvalds
@ 2013-11-23  0:25                                                   ` Paul E. McKenney
  2013-11-23  0:42                                                     ` Linus Torvalds
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-23  0:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 02:19:15PM -0800, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 1:52 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > You seem to be assuming that the unlock+lock rule applies only when the
> > unlock and the lock are executed by the same CPU.  This is not always
> > the case.  For example, when the unlock and lock are operating on the
> > same lock variable, the critical sections must appear to be ordered from
> > the perspective of some other CPU, even when that CPU is not holding
> > any lock.
> 
> Umm. Isn't that pretty much *guaranteed* by any cache-coherent locking scheme.

No, there really are exceptions.  In fact, one such exception showed up
a few days ago on this very list, which is why I started complaining.

> The unlock - by virtue of being an unlock - means that all ops within
> the first critical region must be visible in the cache coherency
> protocol before the unlock is visible. Same goes for the lock on the
> other CPU wrt the memory accesses within that locked region.
> 
> IOW, I'd argue that any locking model that depends on cache coherency
> - as opposed to some magic external locks independent of cache
> coherenecy - *has* to follow the rules in that section as far as I can
> see. Or it's not a locking model at all, and lets the cache accesses
> leak outside of the critical section.

Start with Tim Chen's most recent patches for MCS locking, the ones that
do the lock handoff using smp_store_release() and smp_load_acquire().
Add to that Peter Zijlstra's patch that uses PowerPC lwsync for both
smp_store_release() and smp_load_acquire().  Run the resulting lock
at high contention, so that all lock handoffs are done via the queue.
Then you will have something that acts like a lock from the viewpoint
of CPU holding that lock, but which does -not- guarantee that an
unlock+lock acts like a full memory barrier if the unlock and lock run
on two different CPUs, and if the observer is running on a third CPU.

Easy fix -- make powerpc'd smp_store_release() use sync instead of lwsync.
Slows down the PowerPC circular-buffer implementation a bit, but I believe
that this is fixable separately.  More on that later.

And if you, the Intel guys, and the AMD guys all say that the x86 code
path does the right thing, then I won't argue, especially since the
formalisms seem to agree.  Quite surprising to me, but if that is the
way it works, well and good.  That said, I will check a few other CPU
families for completeness.

> Btw, you can see the difference in the very next section, where you
> have *non-cache-coherent* (IO) accesses. So once you have different
> rules for the data and the lock accesses, you can get different
> results. And yes, there have been broken SMP models (historically)
> where locking was "separate" from the memory system, and you could get
> coherence only by taking the right lock. But I really don't think we
> care about such locking models (for memory - again, IO accesses are
> different, exactly because locking and data are in different "ordering
> domains").

Yes, MMIO accesses add another set of rules.  I have not been talking
about MMIO accesses, however.

> IOW, I don't think you *can* violate that "locks vs memory accesses"
> model with any system where locking is in the same ordering domain as
> the data (ie we lock by using cache coherency). And locking using
> cache coherency is imnsho the only valid model for SMP. No?

No, I have not been considering trying to make these locks work in the
absence of cache coherence.  Not that crazy, not today, anyway.

But even with cache coherence, you really can create a lock that
acts like a lock from the viewpoint of CPUs holding that lock, but
which violates the "locks vs memory accesses" model.  For example, the
combination of Tim's most recent MCS lock patches with Peter's most recent
smp_store_release()/smp_load_acquire() patch that I called out above.

Sheesh, and I haven't even started reviewing the qrwlock...  :-/

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23  0:25                                                   ` Paul E. McKenney
@ 2013-11-23  0:42                                                     ` Linus Torvalds
  2013-11-23  1:36                                                       ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2013-11-23  0:42 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 4:25 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> Start with Tim Chen's most recent patches for MCS locking, the ones that
> do the lock handoff using smp_store_release() and smp_load_acquire().
> Add to that Peter Zijlstra's patch that uses PowerPC lwsync for both
> smp_store_release() and smp_load_acquire().  Run the resulting lock
> at high contention, so that all lock handoffs are done via the queue.
> Then you will have something that acts like a lock from the viewpoint
> of CPU holding that lock, but which does -not- guarantee that an
> unlock+lock acts like a full memory barrier if the unlock and lock run
> on two different CPUs, and if the observer is running on a third CPU.

Umm. If the unlock and the lock run on different CPU's, then the lock
handoff cannot be done through the queue (I assume that what you mean
by "queue" is the write buffer).

And yes, the write buffer is why running unlock+lock on the *same* CPU
is a special case and can generate more re-ordering than is visible
externally (and I generally do agree that we should strive for
serialization at that point), but even it does not actually violate
the rules mentioned in Documentation/memory-barriers.txt wrt an
external CPU because the write that releases the lock isn't actually
visible at that point in the cache, and if the same CPU re-aquires it
by doing a read that bypasses the write and hits in the write buffer
or the unlock, the unlocked state in between won't even be seen
outside of that CPU.

See? The local write buffer is special. It very much bypasses the
cache, but the thing about it is that it's local to that CPU.

Now, I do have to admit that cache coherency protocols are really
subtle, and there may be something else I'm missing, but the thing you
brought up is not one of those things, afaik.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23  0:42                                                     ` Linus Torvalds
@ 2013-11-23  1:36                                                       ` Paul E. McKenney
  2013-11-23  2:11                                                         ` Linus Torvalds
  2013-11-23 20:21                                                         ` Linus Torvalds
  0 siblings, 2 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-23  1:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 04:42:37PM -0800, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 4:25 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > Start with Tim Chen's most recent patches for MCS locking, the ones that
> > do the lock handoff using smp_store_release() and smp_load_acquire().
> > Add to that Peter Zijlstra's patch that uses PowerPC lwsync for both
> > smp_store_release() and smp_load_acquire().  Run the resulting lock
> > at high contention, so that all lock handoffs are done via the queue.
> > Then you will have something that acts like a lock from the viewpoint
> > of CPU holding that lock, but which does -not- guarantee that an
> > unlock+lock acts like a full memory barrier if the unlock and lock run
> > on two different CPUs, and if the observer is running on a third CPU.
> 
> Umm. If the unlock and the lock run on different CPU's, then the lock
> handoff cannot be done through the queue (I assume that what you mean
> by "queue" is the write buffer).

No, I mean by the MCS lock's queue of waiters.  Software, not hardware.

You know, this really isn't all -that- difficult.

Here is how Tim's MCS lock hands off to the next requester on the queue:

+       smp_store_release(&next->locked, 1);                            \

Given Peter's powerpc implementation, this is an lwsync followed by
a store.

Here is how Tim's MCS lock has the next requester take the handoff:

+       while (!(smp_load_acquire(&node->locked)))                      \
+               arch_mutex_cpu_relax();                                 \

Given Peter's powerpc implementation, this is a load followed by an
lwsync.

So a lock handoff looks like this, where the variable lock is initially 1
(held by CPU 0):

	CPU 0 (releasing)	CPU 1 (acquiring)
	-----			-----
	CS0			while (ACCESS_ONCE(lock) == 1)
	lwsync				continue;
	ACCESS_ONCE(lock) = 0;	lwsync
				CS1

Because lwsync orders both loads and stores before stores, CPU 0's
lwsync does the ordering required to keep CS0 from bleeding out.
Because lwsync orders loads before both loads and stores, CPU 1's lwsync
does the ordering required to keep CS1 from bleeding out.  It even works
transitively because we use the same lock variable throughout, all
from the perspective of a CPU holding "lock".

Therefore, Tim's MCS lock combined with Peter's powerpc implementations
of smp_load_acquire() and smp_store_release() really does act like a
lock from the viewpoint of whoever is holding the lock.

But this does -not- guarantee that some other non-lock-holding CPU 2 will
see CS0 and CS1 in order.  To see this, let's fill in the two critical
sections, where variables X and Y are both initially zero:

	CPU 0 (releasing)	CPU 1 (acquiring)
	-----			-----
	ACCESS_ONCE(X) = 1;	while (ACCESS_ONCE(lock) == 1)
	lwsync				continue;
	ACCESS_ONCE(lock) = 0;	lwsync
				r1 = ACCESS_ONCE(Y);

Then let's add in the observer CPU 2:

	CPU 2
	-----
	ACCESS_ONCE(Y) = 1;
	sync
	r2 = ACCESS_ONCE(X);

If unlock+lock act as a full memory barrier, it would be impossible to
end up with (r1 == 0 && r2 == 0).  After all, (r1 == 0) implies that
CPU 2's store to Y happened after CPU 1's load from Y, and (r2 == 0)
implies that CPU 0's load from X happened after CPU 2's store to X.
If CPU 0's unlock combined with CPU 1's lock really acted like a full
memory barrier, we end up with CPU 0's load happening before CPU 1's
store happening before CPU 2's store happening before CPU 2's load
happening before CPU 0's load.

However, the outcome (r1 == 0 && r2 == 0) really does happen both
in theory and on real hardware.  Therefore, although this acts as
a lock from the viewpoint of a CPU holding the lock, the unlock+lock
combination does -not- act as a full memory barrier.

So there is your example.  It really can and does happen.

Again, easy fix.  Just change powerpc's smp_store_release() from lwsync
to smp_mb().  That fixes the problem and doesn't hurt anyone but powerpc.

OK?

							Thanx, Paul

> And yes, the write buffer is why running unlock+lock on the *same* CPU
> is a special case and can generate more re-ordering than is visible
> externally (and I generally do agree that we should strive for
> serialization at that point), but even it does not actually violate
> the rules mentioned in Documentation/memory-barriers.txt wrt an
> external CPU because the write that releases the lock isn't actually
> visible at that point in the cache, and if the same CPU re-aquires it
> by doing a read that bypasses the write and hits in the write buffer
> or the unlock, the unlocked state in between won't even be seen
> outside of that CPU.
> 
> See? The local write buffer is special. It very much bypasses the
> cache, but the thing about it is that it's local to that CPU.
> 
> Now, I do have to admit that cache coherency protocols are really
> subtle, and there may be something else I'm missing, but the thing you
> brought up is not one of those things, afaik.
> 
>               Linus
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23  1:36                                                       ` Paul E. McKenney
@ 2013-11-23  2:11                                                         ` Linus Torvalds
  2013-11-23  4:05                                                           ` Paul E. McKenney
  2013-11-23 20:21                                                         ` Linus Torvalds
  1 sibling, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2013-11-23  2:11 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 5:36 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> So there is your example.  It really can and does happen.
>
> Again, easy fix.  Just change powerpc's smp_store_release() from lwsync
> to smp_mb().  That fixes the problem and doesn't hurt anyone but powerpc.
>
> OK?

Hmm. Ok

Except now I'm worried it can happen on x86 too because my mental
model was clearly wrong.

x86 does have that extra "Memory ordering obeys causality (memory
ordering respects transitive visibility)." rule, and the example in
the architecture manual (section 8.2.3.6 "Stores Are Transitively
Visible") seems to very much about this, but your particular example
is subtly different, so..

I will have to ruminate on this.

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23  2:11                                                         ` Linus Torvalds
@ 2013-11-23  4:05                                                           ` Paul E. McKenney
  2013-11-23 11:24                                                             ` Ingo Molnar
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-23  4:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 06:11:52PM -0800, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 5:36 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > So there is your example.  It really can and does happen.
> >
> > Again, easy fix.  Just change powerpc's smp_store_release() from lwsync
> > to smp_mb().  That fixes the problem and doesn't hurt anyone but powerpc.
> >
> > OK?
> 
> Hmm. Ok
> 
> Except now I'm worried it can happen on x86 too because my mental
> model was clearly wrong.
> 
> x86 does have that extra "Memory ordering obeys causality (memory
> ordering respects transitive visibility)." rule, and the example in
> the architecture manual (section 8.2.3.6 "Stores Are Transitively
> Visible") seems to very much about this, but your particular example
> is subtly different, so..

Indeed, my example needs CPU 1's -load- from y to be transitively visible,
so I am nervous about this one as well.

> I will have to ruminate on this.

The rules on the left-hand column of page 5 of the below URL apply to
this example more straightforwardly, but I don't know that Intel and
AMD stand behind them:

	http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf

My guess is that x86 does guarantee this ordering, but at this point I
would have to ask someone from Intel and AMD.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23  4:05                                                           ` Paul E. McKenney
@ 2013-11-23 11:24                                                             ` Ingo Molnar
  2013-11-23 17:06                                                               ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Ingo Molnar @ 2013-11-23 11:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> > x86 does have that extra "Memory ordering obeys causality (memory 
> > ordering respects transitive visibility)." rule, and the example 
> > in the architecture manual (section 8.2.3.6 "Stores Are 
> > Transitively Visible") seems to very much about this, but your 
> > particular example is subtly different, so..
> 
> Indeed, my example needs CPU 1's -load- from y to be transitively 
> visible, so I am nervous about this one as well.
> 
> > I will have to ruminate on this.
> 
> The rules on the left-hand column of page 5 of the below URL apply 
> to this example more straightforwardly, but I don't know that Intel 
> and AMD stand behind them:
> 
> 	http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
> 
> My guess is that x86 does guarantee this ordering, but at this point 
> I would have to ask someone from Intel and AMD.

An additional option might be to create a user-space testcase 
engineered to hit all the exotic ordering situations, one that might 
disprove any particular assumptions we have about the behavior of 
hardware. (Back a decade ago when the x86 space first introduced quad 
core CPUs with newfangled on-die cache coherency I managed to 
demonstrate a causality violation by simulating kernel locks in 
user-space, which turned out to be a hardware bug. Also, when 
Hyperthreading/SMT was new it demonstrated many interesting bugs never 
seen in practice before. So running stuff on real hardware is useful.)

And a cache coherency (and/or locking) test suite would be very useful 
anyway, for so many other purposes as well: such as a new platform/CPU 
bootstrap, or to prove the correctness of some fancy new locking 
scheme people want to add. Maybe as an extension to rcutorture, or a 
generalization of it?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23 11:24                                                             ` Ingo Molnar
@ 2013-11-23 17:06                                                               ` Paul E. McKenney
  2013-11-26 12:02                                                                 ` Ingo Molnar
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-23 17:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Sat, Nov 23, 2013 at 12:24:50PM +0100, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > > x86 does have that extra "Memory ordering obeys causality (memory 
> > > ordering respects transitive visibility)." rule, and the example 
> > > in the architecture manual (section 8.2.3.6 "Stores Are 
> > > Transitively Visible") seems to very much about this, but your 
> > > particular example is subtly different, so..
> > 
> > Indeed, my example needs CPU 1's -load- from y to be transitively 
> > visible, so I am nervous about this one as well.
> > 
> > > I will have to ruminate on this.
> > 
> > The rules on the left-hand column of page 5 of the below URL apply 
> > to this example more straightforwardly, but I don't know that Intel 
> > and AMD stand behind them:
> > 
> > 	http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
> > 
> > My guess is that x86 does guarantee this ordering, but at this point 
> > I would have to ask someone from Intel and AMD.
> 
> An additional option might be to create a user-space testcase 
> engineered to hit all the exotic ordering situations, one that might 
> disprove any particular assumptions we have about the behavior of 
> hardware. (Back a decade ago when the x86 space first introduced quad 
> core CPUs with newfangled on-die cache coherency I managed to 
> demonstrate a causality violation by simulating kernel locks in 
> user-space, which turned out to be a hardware bug. Also, when 
> Hyperthreading/SMT was new it demonstrated many interesting bugs never 
> seen in practice before. So running stuff on real hardware is useful.)
> 
> And a cache coherency (and/or locking) test suite would be very useful 
> anyway, for so many other purposes as well: such as a new platform/CPU 
> bootstrap, or to prove the correctness of some fancy new locking 
> scheme people want to add. Maybe as an extension to rcutorture, or a 
> generalization of it?

I have the locking counterpart of rcutorture on my todo list.  ;-)

Of course, we cannot prove locks correct via testing, but a quick test
can often find a bug faster and more reliably than manual inspection.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23 17:06                                                               ` Paul E. McKenney
@ 2013-11-26 12:02                                                                 ` Ingo Molnar
  2013-11-26 19:28                                                                   ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Ingo Molnar @ 2013-11-26 12:02 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Sat, Nov 23, 2013 at 12:24:50PM +0100, Ingo Molnar wrote:
> > 
> > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > > > x86 does have that extra "Memory ordering obeys causality (memory 
> > > > ordering respects transitive visibility)." rule, and the example 
> > > > in the architecture manual (section 8.2.3.6 "Stores Are 
> > > > Transitively Visible") seems to very much about this, but your 
> > > > particular example is subtly different, so..
> > > 
> > > Indeed, my example needs CPU 1's -load- from y to be transitively 
> > > visible, so I am nervous about this one as well.
> > > 
> > > > I will have to ruminate on this.
> > > 
> > > The rules on the left-hand column of page 5 of the below URL apply 
> > > to this example more straightforwardly, but I don't know that Intel 
> > > and AMD stand behind them:
> > > 
> > > 	http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
> > > 
> > > My guess is that x86 does guarantee this ordering, but at this point 
> > > I would have to ask someone from Intel and AMD.
> > 
> > An additional option might be to create a user-space testcase 
> > engineered to hit all the exotic ordering situations, one that 
> > might disprove any particular assumptions we have about the 
> > behavior of hardware. (Back a decade ago when the x86 space first 
> > introduced quad core CPUs with newfangled on-die cache coherency I 
> > managed to demonstrate a causality violation by simulating kernel 
> > locks in user-space, which turned out to be a hardware bug. Also, 
> > when Hyperthreading/SMT was new it demonstrated many interesting 
> > bugs never seen in practice before. So running stuff on real 
> > hardware is useful.)
> > 
> > And a cache coherency (and/or locking) test suite would be very 
> > useful anyway, for so many other purposes as well: such as a new 
> > platform/CPU bootstrap, or to prove the correctness of some fancy 
> > new locking scheme people want to add. Maybe as an extension to 
> > rcutorture, or a generalization of it?
> 
> I have the locking counterpart of rcutorture on my todo list.  ;-)
> 
> Of course, we cannot prove locks correct via testing, but a quick 
> test can often find a bug faster and more reliably than manual 
> inspection.

We cannot prove them correct via testing, but we can test our 
hypothesis about how the platform works and chances are that if the 
tests are smart enough then we will be proven wrong via an actual 
failure if our assumptions are wrong.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26 12:02                                                                 ` Ingo Molnar
@ 2013-11-26 19:28                                                                   ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-26 19:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Tue, Nov 26, 2013 at 01:02:18PM +0100, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Sat, Nov 23, 2013 at 12:24:50PM +0100, Ingo Molnar wrote:
> > > 
> > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > 
> > > > > x86 does have that extra "Memory ordering obeys causality (memory 
> > > > > ordering respects transitive visibility)." rule, and the example 
> > > > > in the architecture manual (section 8.2.3.6 "Stores Are 
> > > > > Transitively Visible") seems to very much about this, but your 
> > > > > particular example is subtly different, so..
> > > > 
> > > > Indeed, my example needs CPU 1's -load- from y to be transitively 
> > > > visible, so I am nervous about this one as well.
> > > > 
> > > > > I will have to ruminate on this.
> > > > 
> > > > The rules on the left-hand column of page 5 of the below URL apply 
> > > > to this example more straightforwardly, but I don't know that Intel 
> > > > and AMD stand behind them:
> > > > 
> > > > 	http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
> > > > 
> > > > My guess is that x86 does guarantee this ordering, but at this point 
> > > > I would have to ask someone from Intel and AMD.
> > > 
> > > An additional option might be to create a user-space testcase 
> > > engineered to hit all the exotic ordering situations, one that 
> > > might disprove any particular assumptions we have about the 
> > > behavior of hardware. (Back a decade ago when the x86 space first 
> > > introduced quad core CPUs with newfangled on-die cache coherency I 
> > > managed to demonstrate a causality violation by simulating kernel 
> > > locks in user-space, which turned out to be a hardware bug. Also, 
> > > when Hyperthreading/SMT was new it demonstrated many interesting 
> > > bugs never seen in practice before. So running stuff on real 
> > > hardware is useful.)
> > > 
> > > And a cache coherency (and/or locking) test suite would be very 
> > > useful anyway, for so many other purposes as well: such as a new 
> > > platform/CPU bootstrap, or to prove the correctness of some fancy 
> > > new locking scheme people want to add. Maybe as an extension to 
> > > rcutorture, or a generalization of it?
> > 
> > I have the locking counterpart of rcutorture on my todo list.  ;-)
> > 
> > Of course, we cannot prove locks correct via testing, but a quick 
> > test can often find a bug faster and more reliably than manual 
> > inspection.
> 
> We cannot prove them correct via testing, but we can test our 
> hypothesis about how the platform works and chances are that if the 
> tests are smart enough then we will be proven wrong via an actual 
> failure if our assumptions are wrong.

There actually is an open-source program designed to test this sort
of hypothesis...  http://diy.inria.fr/  Don't miss the advertisement
at the bottom of the page.

That said, you do need some machine time.  Some of the invalid hypotheses
have failure rates in the 1-in-a-billion range.  ;-)

Or you can read some of the papers that this group has written, some
of which include failure rates from empirical testing.  Here is the
one for ARM and Power:

http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23  1:36                                                       ` Paul E. McKenney
  2013-11-23  2:11                                                         ` Linus Torvalds
@ 2013-11-23 20:21                                                         ` Linus Torvalds
  2013-11-23 20:39                                                           ` Linus Torvalds
                                                                             ` (2 more replies)
  1 sibling, 3 replies; 116+ messages in thread
From: Linus Torvalds @ 2013-11-23 20:21 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 5:36 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> But this does -not- guarantee that some other non-lock-holding CPU 2 will
> see CS0 and CS1 in order.  To see this, let's fill in the two critical
> sections, where variables X and Y are both initially zero:
>
>         CPU 0 (releasing)       CPU 1 (acquiring)
>         -----                   -----
>         ACCESS_ONCE(X) = 1;     while (ACCESS_ONCE(lock) == 1)
>         lwsync                          continue;
>         ACCESS_ONCE(lock) = 0;  lwsync
>                                 r1 = ACCESS_ONCE(Y);
>
> Then let's add in the observer CPU 2:
>
>         CPU 2
>         -----
>         ACCESS_ONCE(Y) = 1;
>         sync
>         r2 = ACCESS_ONCE(X);
>
> If unlock+lock act as a full memory barrier, it would be impossible to
> end up with (r1 == 0 && r2 == 0).  After all, (r1 == 0) implies that
> CPU 2's store to Y happened after CPU 1's load from Y, and (r2 == 0)
> implies that CPU 0's load from X happened after CPU 2's store to X.
> If CPU 0's unlock combined with CPU 1's lock really acted like a full
> memory barrier, we end up with CPU 0's load happening before CPU 1's
> store happening before CPU 2's store happening before CPU 2's load
> happening before CPU 0's load.
>
> However, the outcome (r1 == 0 && r2 == 0) really does happen both
> in theory and on real hardware.

Ok, so I have ruminated.

But even after having ruminated, the one thing I cannot find is
support for your "(r1 == 0 && r2 == 0) really does happen on
hardware".

Ignore theory, and assume just cache coherency, ie the notion that in
order to write to a cacheline, you have to have that cacheline in some
exclusive state. We have three cachelines, X, Y and lock (and our
cachelines only have a single bit, starting out as 0,0,1
respectively).

CPU0:
   write X = 1;
   lwsync
   write lock = 0;

doesn't even really require any cache accesses at all per se, but it
*does* require that the two stores be ordered in the store buffer on
CPU0 in such a way that cacheline X gets updated (ie is in
exclusive/dirty state in CPU0 with the value 1) before cacheline
'lock' gets released from its exclusive/dirty state after having
itself been updated to 0.

So basically we know that *within* CPU0, by the time the lock actually
makes it out of the CPU, the cacheline containing X will have been in
dirty mode with the value "1". The core might actually have written
'lock' first, but it can't release that cacheline from exclusive state
(so that it is visible anywhere else) until it has _also_ gotten 'X'
into exclusive state (once both cachelines are exclusive within CPU0,
it can do any ordering, because the ordering won't be externally
visible).

And this is all just looking at CPU0, nothing else. But if it is
exclusive/dirty on CPU0, then it cannot be shared in any other CPU
(although a previous stale value may obviously still be "in flight"
somewhere else outside of the coherency domain).

So let's look at CPU2, which is similar, but now the second access is
a read (of zero), not a write:

CPU2:
   write Y = 1;
   sync
   read X as zero

So if 'sync' is a true memory barrier between the write and the read,
then we know that the following is true: CPU2 must have gotten
cacheline 'Y' into exclusive state and acquired (or held on to, which
is equivalent) cacheline 'X' in shared state _after_ it got that Y
into exclusive state. It can't rely on some "in flight" previously
read value of 'X' until after it got Y into exclusive state. Otherwise
it wouldn't be a ordering between the write and the read, agreed?

The pattern on CPU1, meanwhile, is somewhat different. But I'm going
to ignore the "while" part, and just concentrate on the last iteration
of the loop, and it turns into:

CPU1:
   read lock as zero
   lwsync
   read Y as zero

It only does reads, so it is happy with a shared cacheline, but in
order for the lwsync to actually act as an acquire, it does mean that
the cacheline 'Y' needs to be in some shared state within the cache
coherency domain after (or, again, across) cacheline 'lock' having
been in a shared state with value == 0 on CPU1. No "in flight" values.

Agreed?

So the above isn't really about lwsync/sync/memory ordering any more,
the above is basically rewriting things purely about cacheline states
as seen by individual processors. And I know the POWER cache coherency
is really complex (iirc cachelines can have 11+ states - we're not
talking about MOESI any more), but even there you have to have a
notion of "exclusive access" to be able to write in the end. So my
states are certainly simplified, but I don't see how that basic rule
can be violated (and still be called cache coherent).

And let's just look at the individual "events" on these CPU's:

 - A = CPU0 acquires exclusive access to cacheline X (in order to
write 1 into it)
 - B = CPU0 releases its exclusive access to cacheline lock (after
having written 0 into it)
 - C = CPU1 reads shared cacheline lock as being zero
 - D = CPU1 reads shared cacheline Y as being zero
 - E = CPU2 acquires exclusive access to cacheline Y (in order to
write 1 into it)
 - F = CPU2 reads shared cacheline X as being zero

and you cannot order these events arbitrarily, but there *ARE* certain
orderings you can rely on:

 - within a CPU due to the barriers. This gives you

    A < B, C < D and E < F

 - with regards to a particular cacheline because of that cacheline
coming exclusive (or, in the case of 'B', moving out of exclusive
state) and the value it has at that point:

    B < C, F < A and D < E

And as far as I can tell, the above gives you: A < B < C < D < E < F <
A. Which doesn't look possible.

So which step in my rumination here is actually wrong? Because I
really don't see how you can get that "(r1 == 0 && r2 == 0)" on real
hardware using cache coherency.

*SOME* assumption of mine must be wrong. But I don't see which one.

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23 20:21                                                         ` Linus Torvalds
@ 2013-11-23 20:39                                                           ` Linus Torvalds
  2013-11-25 12:09                                                             ` Peter Zijlstra
  2013-11-25 17:54                                                             ` Paul E. McKenney
  2013-11-23 21:29                                                           ` Peter Zijlstra
  2013-11-25 17:53                                                           ` Paul E. McKenney
  2 siblings, 2 replies; 116+ messages in thread
From: Linus Torvalds @ 2013-11-23 20:39 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Sat, Nov 23, 2013 at 12:21 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And as far as I can tell, the above gives you: A < B < C < D < E < F <
> A. Which doesn't look possible.

Hmm.. I guess technically all of those cases aren't "strictly
precedes" as much as "cannot have happened in the opposite order". So
the "<" might be "<=". Which I guess *is* possible: "it all happened
at the same time". And then the difference between your suggested
"lwsync" and "sync" in the unlock path on CPU0 basically approximating
the difference between "A <= B" and "A < B"..

Ho humm.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23 20:39                                                           ` Linus Torvalds
@ 2013-11-25 12:09                                                             ` Peter Zijlstra
  2013-11-25 17:18                                                               ` Will Deacon
  2013-11-25 17:54                                                             ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-25 12:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Sat, Nov 23, 2013 at 12:39:53PM -0800, Linus Torvalds wrote:
> On Sat, Nov 23, 2013 at 12:21 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And as far as I can tell, the above gives you: A < B < C < D < E < F <
> > A. Which doesn't look possible.
> 
> Hmm.. I guess technically all of those cases aren't "strictly
> precedes" as much as "cannot have happened in the opposite order". So
> the "<" might be "<=". Which I guess *is* possible: "it all happened
> at the same time". And then the difference between your suggested
> "lwsync" and "sync" in the unlock path on CPU0 basically approximating
> the difference between "A <= B" and "A < B"..
> 
> Ho humm.

But remember, there's an actual full proper barrier between E and F, so
at best you'd end up with something like:

  A <= B <= C <= D <= E < F <= A

Which is still an impossibility.

I'm hoping others will explain things, as I'm very much on shaky ground
myself wrt transitivity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 12:09                                                             ` Peter Zijlstra
@ 2013-11-25 17:18                                                               ` Will Deacon
  2013-11-25 17:56                                                                 ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Will Deacon @ 2013-11-25 17:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Paul McKenney, Ingo Molnar, Tim Chen, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

Hi Peter, Linus,

On Mon, Nov 25, 2013 at 12:09:02PM +0000, Peter Zijlstra wrote:
> On Sat, Nov 23, 2013 at 12:39:53PM -0800, Linus Torvalds wrote:
> > On Sat, Nov 23, 2013 at 12:21 PM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > And as far as I can tell, the above gives you: A < B < C < D < E < F <
> > > A. Which doesn't look possible.
> > 
> > Hmm.. I guess technically all of those cases aren't "strictly
> > precedes" as much as "cannot have happened in the opposite order". So
> > the "<" might be "<=". Which I guess *is* possible: "it all happened
> > at the same time". And then the difference between your suggested
> > "lwsync" and "sync" in the unlock path on CPU0 basically approximating
> > the difference between "A <= B" and "A < B"..
> > 
> > Ho humm.
> 
> But remember, there's an actual full proper barrier between E and F, so
> at best you'd end up with something like:
> 
>   A <= B <= C <= D <= E < F <= A
> 
> Which is still an impossibility.
> 
> I'm hoping others will explain things, as I'm very much on shaky ground
> myself wrt transitivity.

The transitivity issues come about by having multiple, valid copies of the
same data at a given moment in time (hence the term `multi-copy atomicity',
where all of these copies appear to be updated at once).

Now, I'm not familiar with the Power memory model and the implementation
intricacies between lwsync and sync, but I think a better way to think
about this is to think of the cacheline state changes being broadcast as
asynchronous requests, rather than necessarily responding to snoops from a
canonical source.

So, in Paul's example, the upgrade requests on X and lock (shared -> invalid)
may have reached CPU1, but not CPU2 by the time CPU2 reads X and therefore
reads 0 from its shared line. It really depends on the multi-copy semantics
you give to the different barrier instructions.

The other thing worth noting is that exclusive access instructions (e.g.
ldrex and strex on ARM) may interact differently with barriers than conventional
accesses, so lighter weight barriers can sometimes be acceptable for things
like locks and atomics.

Does that help at all?

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 17:18                                                               ` Will Deacon
@ 2013-11-25 17:56                                                                 ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-25 17:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Linus Torvalds, Ingo Molnar, Tim Chen,
	Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, Peter Hurley,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Mon, Nov 25, 2013 at 05:18:23PM +0000, Will Deacon wrote:
> Hi Peter, Linus,
> 
> On Mon, Nov 25, 2013 at 12:09:02PM +0000, Peter Zijlstra wrote:
> > On Sat, Nov 23, 2013 at 12:39:53PM -0800, Linus Torvalds wrote:
> > > On Sat, Nov 23, 2013 at 12:21 PM, Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > And as far as I can tell, the above gives you: A < B < C < D < E < F <
> > > > A. Which doesn't look possible.
> > > 
> > > Hmm.. I guess technically all of those cases aren't "strictly
> > > precedes" as much as "cannot have happened in the opposite order". So
> > > the "<" might be "<=". Which I guess *is* possible: "it all happened
> > > at the same time". And then the difference between your suggested
> > > "lwsync" and "sync" in the unlock path on CPU0 basically approximating
> > > the difference between "A <= B" and "A < B"..
> > > 
> > > Ho humm.
> > 
> > But remember, there's an actual full proper barrier between E and F, so
> > at best you'd end up with something like:
> > 
> >   A <= B <= C <= D <= E < F <= A
> > 
> > Which is still an impossibility.
> > 
> > I'm hoping others will explain things, as I'm very much on shaky ground
> > myself wrt transitivity.
> 
> The transitivity issues come about by having multiple, valid copies of the
> same data at a given moment in time (hence the term `multi-copy atomicity',
> where all of these copies appear to be updated at once).
> 
> Now, I'm not familiar with the Power memory model and the implementation
> intricacies between lwsync and sync, but I think a better way to think
> about this is to think of the cacheline state changes being broadcast as
> asynchronous requests, rather than necessarily responding to snoops from a
> canonical source.
> 
> So, in Paul's example, the upgrade requests on X and lock (shared -> invalid)
> may have reached CPU1, but not CPU2 by the time CPU2 reads X and therefore
> reads 0 from its shared line. It really depends on the multi-copy semantics
> you give to the different barrier instructions.

Exactly!  ;-)

> The other thing worth noting is that exclusive access instructions (e.g.
> ldrex and strex on ARM) may interact differently with barriers than conventional
> accesses, so lighter weight barriers can sometimes be acceptable for things
> like locks and atomics.
> 
> Does that help at all?

The differences between ldrex/strex and larx/stcx cannot come into play
in this example because there are only normal loads and stores, no atomic
instructions.

							Thanx, Paul

> Will
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23 20:39                                                           ` Linus Torvalds
  2013-11-25 12:09                                                             ` Peter Zijlstra
@ 2013-11-25 17:54                                                             ` Paul E. McKenney
  1 sibling, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-25 17:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Sat, Nov 23, 2013 at 12:39:53PM -0800, Linus Torvalds wrote:
> On Sat, Nov 23, 2013 at 12:21 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And as far as I can tell, the above gives you: A < B < C < D < E < F <
> > A. Which doesn't look possible.
> 
> Hmm.. I guess technically all of those cases aren't "strictly
> precedes" as much as "cannot have happened in the opposite order". So
> the "<" might be "<=". Which I guess *is* possible: "it all happened
> at the same time". And then the difference between your suggested
> "lwsync" and "sync" in the unlock path on CPU0 basically approximating
> the difference between "A <= B" and "A < B"..
> 
> Ho humm.

Indeed, the difference between "<" and "<=" does not matter here.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23 20:21                                                         ` Linus Torvalds
  2013-11-23 20:39                                                           ` Linus Torvalds
@ 2013-11-23 21:29                                                           ` Peter Zijlstra
  2013-11-23 22:24                                                             ` Linus Torvalds
  2013-11-25 17:53                                                           ` Paul E. McKenney
  2 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-23 21:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Sat, Nov 23, 2013 at 12:21:13PM -0800, Linus Torvalds wrote:
> *SOME* assumption of mine must be wrong. But I don't see which one.

I haven't read your email in full detail yet, but one thing I did miss
was cache-snoops.

One of the reasons for failing transitive / multi-copy atomicity is that
CPUs might have different views of the memory state depending on from
which caches they can get snoops.

Eg. if CPU0 and CPU1 share a cache level but CPU2 does not, CPU1 might
observe a write before CPU2 can.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23 21:29                                                           ` Peter Zijlstra
@ 2013-11-23 22:24                                                             ` Linus Torvalds
  0 siblings, 0 replies; 116+ messages in thread
From: Linus Torvalds @ 2013-11-23 22:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul McKenney, Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Sat, Nov 23, 2013 at 1:29 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> I haven't read your email in full detail yet, but one thing I did miss
> was cache-snoops.
>
> One of the reasons for failing transitive / multi-copy atomicity is that
> CPUs might have different views of the memory state depending on from
> which caches they can get snoops.

My argument doesn't depend on that or care about that.

My argument depends purely on:

 - barriers have a certain sane local meaning (because I don't think
they can work without that)

 - a cache coherenct implies that actually changing the cacheline
requires exclusive access to that cacheline at some point (where the
actual time of that "at some point" is not actually all that important
and you can hold on to the exclusive access for some arbitrary time,
but the particular points where the exclusive state changes matters
for the barrier semantics in order for barriers to work).

Nothing else matters for my argument. So cache snooping details are
irrelevant. Or rather, it is relevant only in the sense that the CPU's
that participate in the cache coherency protocol then have to have
barriers that work properly in the presence of said snooping. You
can't allow snooping to "break" the barriers.

Now, as I said in my follow-up, I think one "explanation" might be
that "everyting happens at the same time" approach, and while that
actually may work for "lwsync" and the sequence in question, I'm not
convinced that kind of lock really is a proper lock.

Because if you accept the "everything happens at once" model to
explain why the unlock+lock sequence doesn't act as a memory barrier,
than I actually think that you can build up an argument where multiple
concurrent spinlock'ed accesses (ie you make one of 'X'/'Y' be the
queue entry for the *next* MCS lock waiter trying to acquire it) can
get insane results that aren't consistent with actual exclusion.

Because if you accept that "a unlock and getting that lock on another
CPU can happen at the same time" argument (so that you can have that
circular chain of mutual dependencies), then you can extend the chain
further, since all the lock/unlock operations in question apparently
have zero latency and can thus see previous values for the same reason
CPU2 can see the original zero value of "X".

So I think anything that allows that

  A <= B <= C <= D <=E <= F <= A

situation is not necessarily a valid locking model (because you could
basically add a "lwsync + ACCESS_ONCE(lock-chain)=0" to the work CPU1
does, and since we've already established that there are zero
latencies in all this, CPU2 could have gotten *that* lock that we now
released "at the same time" as reading X as zero, even though CPU0 set
X to one in its "critical region".

So locks don't just imply "you can't let anything out of the critical
region". They also imply exclusion, and that "everything happens all
at once" model would seem to literally allow *EVERYTHING* to happen at
once, breaking the *exclusion* requirement.

But I dunno. My gut feel is that the "everything happens at once"
explanation is not actually then a valid model for locking, which in
turn would mean that using "lwsync" in both unlock and lock is not
sufficient.

Stated another way, let's say that you have multiple CPU's doing this:

   lock
   if (x == 0)
     x = 1;
   unlock

then we had *better* have only one of the CPU's actually set "x=1".
Otherwise the lock isn't a lock. Agreed?

Paul argued that "lwsync" is valid in both lock and unlock becuse it
doesn't allow anything to leak out of the locked region, but I'm
arguing that if it means that we allow that "everything happens at
once" model, then *every* CPU can do that "if (x == 0) x = 1" logic
all at the same time, *all* of them decide to set x to 1, and *none*
of them "leak" their accesses outside their locked region (they'd all
set 'x' to 1 at the same time), but the end result is still wrong.

So locking is not just "accesses inside of the lock cannot leak
outside the lock". It also implies "accesses inside of it have to be
_ordered_ wrt the lock", and that in turn disallows the "A <= .. <= F
<= A" model.  One of the "<=" has to be a "<" for the lock to be a
lock, methinks.

But hey, maybe somebody can point to where I screwed up. I just do not
think "snooping" is it.

                    Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-23 20:21                                                         ` Linus Torvalds
  2013-11-23 20:39                                                           ` Linus Torvalds
  2013-11-23 21:29                                                           ` Peter Zijlstra
@ 2013-11-25 17:53                                                           ` Paul E. McKenney
  2013-11-25 18:21                                                             ` Peter Zijlstra
  2 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-25 17:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Sat, Nov 23, 2013 at 12:21:13PM -0800, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 5:36 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > But this does -not- guarantee that some other non-lock-holding CPU 2 will
> > see CS0 and CS1 in order.  To see this, let's fill in the two critical
> > sections, where variables X and Y are both initially zero:
> >
> >         CPU 0 (releasing)       CPU 1 (acquiring)
> >         -----                   -----
> >         ACCESS_ONCE(X) = 1;     while (ACCESS_ONCE(lock) == 1)
> >         lwsync                          continue;
> >         ACCESS_ONCE(lock) = 0;  lwsync
> >                                 r1 = ACCESS_ONCE(Y);
> >
> > Then let's add in the observer CPU 2:
> >
> >         CPU 2
> >         -----
> >         ACCESS_ONCE(Y) = 1;
> >         sync
> >         r2 = ACCESS_ONCE(X);
> >
> > If unlock+lock act as a full memory barrier, it would be impossible to
> > end up with (r1 == 0 && r2 == 0).  After all, (r1 == 0) implies that
> > CPU 2's store to Y happened after CPU 1's load from Y, and (r2 == 0)
> > implies that CPU 0's load from X happened after CPU 2's store to X.
> > If CPU 0's unlock combined with CPU 1's lock really acted like a full
> > memory barrier, we end up with CPU 0's load happening before CPU 1's
> > store happening before CPU 2's store happening before CPU 2's load
> > happening before CPU 0's load.
> >
> > However, the outcome (r1 == 0 && r2 == 0) really does happen both
> > in theory and on real hardware.
> 
> Ok, so I have ruminated.
> 
> But even after having ruminated, the one thing I cannot find is
> support for your "(r1 == 0 && r2 == 0) really does happen on
> hardware".
> 
> Ignore theory, and assume just cache coherency,

Sorry, but that is not ignoring theory.  Ignoring theory would instead
mean confining oneself to running tests on real hardware.  And there is
real hardware that does allow the assertion to trigger.  You are instead
asking me to use your personal theory instead of a theory that has been
shown to match reality.

Let's see how that plays out.

>                                                 ie the notion that in
> order to write to a cacheline, you have to have that cacheline in some
> exclusive state. We have three cachelines, X, Y and lock (and our
> cachelines only have a single bit, starting out as 0,0,1
> respectively).
> 
> CPU0:
>    write X = 1;
>    lwsync
>    write lock = 0;
> 
> doesn't even really require any cache accesses at all per se, but it
> *does* require that the two stores be ordered in the store buffer on
> CPU0 in such a way that cacheline X gets updated (ie is in
> exclusive/dirty state in CPU0 with the value 1) before cacheline
> 'lock' gets released from its exclusive/dirty state after having
> itself been updated to 0.

So far, so good.

> So basically we know that *within* CPU0, by the time the lock actually
> makes it out of the CPU, the cacheline containing X will have been in
> dirty mode with the value "1".

You seem to be assuming that the only way for the cache line to make it
out of the CPU is via the caches.  This assumption is incorrect given
hardware multithreading, in which case the four hardware threads in a
core (in the case of Power 7) can share a store buffer, and can thus
communicate via that store buffer.

Another way of putting this is that you are assuming multi-copy atomic
behavior, which is in fact guaranteed on x86 by the the bullet in 8.2.2
of the "Intel 64 and IA-32 Architectures Software Developer's Manual"
which reads:

	"Any two stores are seen in a consistent order by processors
	other than those performing the stores."

The reality is that not all architectures guarantee multi-copy atomic
behavior.

>                                The core might actually have written
> 'lock' first, but it can't release that cacheline from exclusive state
> (so that it is visible anywhere else) until it has _also_ gotten 'X'
> into exclusive state (once both cachelines are exclusive within CPU0,
> it can do any ordering, because the ordering won't be externally
> visible).

Not always true when store buffers are shared among hardware threads!
In particular, consider the case where CPU 0 and CPU 1 share a store
buffer and CPU 2 is on some other core.  CPU 1 sees CPU 2's accesses
in order, but the lwsync instructions do not order prior stores against
later loads.  Therefore, it is legal for CPU 0's store to X be released
from the core -after- CPU 1's load from Y.  CPU 2's sync cannot help in
this case, so the assertion can trigger.

Please note that this does -not- violate cache coherence:  All three
CPUs agree on the order of accesses to each individual memory location.
(Or do you mean something else by "cache coherence"?)

> And this is all just looking at CPU0, nothing else. But if it is
> exclusive/dirty on CPU0, then it cannot be shared in any other CPU
> (although a previous stale value may obviously still be "in flight"
> somewhere else outside of the coherency domain).
> 
> So let's look at CPU2, which is similar, but now the second access is
> a read (of zero), not a write:
> 
> CPU2:
>    write Y = 1;
>    sync
>    read X as zero
> 
> So if 'sync' is a true memory barrier between the write and the read,
> then we know that the following is true: CPU2 must have gotten
> cacheline 'Y' into exclusive state and acquired (or held on to, which
> is equivalent) cacheline 'X' in shared state _after_ it got that Y
> into exclusive state. It can't rely on some "in flight" previously
> read value of 'X' until after it got Y into exclusive state. Otherwise
> it wouldn't be a ordering between the write and the read, agreed?

Again, this line of reasoning does not take into account the possibility
of store buffers being shared between hardware threads within a single
core.  The key point is that CPU 0 and CPU 1 can be sharing the new value
of X prior to its reaching the cache, in other words, before CPU 2 can
see it.

So, no, I do not agree that this holds for all real hardware.

> The pattern on CPU1, meanwhile, is somewhat different. But I'm going
> to ignore the "while" part, and just concentrate on the last iteration
> of the loop, and it turns into:
> 
> CPU1:
>    read lock as zero
>    lwsync
>    read Y as zero
> 
> It only does reads, so it is happy with a shared cacheline, but in
> order for the lwsync to actually act as an acquire, it does mean that
> the cacheline 'Y' needs to be in some shared state within the cache
> coherency domain after (or, again, across) cacheline 'lock' having
> been in a shared state with value == 0 on CPU1. No "in flight" values.

Or, if CPU 0 and CPU 1 are hardware threads in the same core, it is
happy with a shared store-buffer entry that might not yet be visible
to hardware threads in other cores.

> Agreed?

Sorry, but no.

> So the above isn't really about lwsync/sync/memory ordering any more,
> the above is basically rewriting things purely about cacheline states
> as seen by individual processors. And I know the POWER cache coherency
> is really complex (iirc cachelines can have 11+ states - we're not
> talking about MOESI any more), but even there you have to have a
> notion of "exclusive access" to be able to write in the end. So my
> states are certainly simplified, but I don't see how that basic rule
> can be violated (and still be called cache coherent).
> 
> And let's just look at the individual "events" on these CPU's:
> 
>  - A = CPU0 acquires exclusive access to cacheline X (in order to
> write 1 into it)
>  - B = CPU0 releases its exclusive access to cacheline lock (after
> having written 0 into it)
>  - C = CPU1 reads shared cacheline lock as being zero
>  - D = CPU1 reads shared cacheline Y as being zero
>  - E = CPU2 acquires exclusive access to cacheline Y (in order to
> write 1 into it)
>  - F = CPU2 reads shared cacheline X as being zero
> 
> and you cannot order these events arbitrarily, but there *ARE* certain
> orderings you can rely on:
> 
>  - within a CPU due to the barriers. This gives you
> 
>     A < B, C < D and E < F

Again, consider the case of shared store buffers.  In that case,
A < B and C < D does -not- imply A < D because the two lwsyncs will
-not- order a prior store (CPU 0's store to X) with a later load
(CPU 1's load from Y).  To get that guarantee, at least one of the
lwsync instructions needs to instead be a sync.

>  - with regards to a particular cacheline because of that cacheline
> coming exclusive (or, in the case of 'B', moving out of exclusive
> state) and the value it has at that point:
> 
>     B < C, F < A and D < E
> 
> And as far as I can tell, the above gives you: A < B < C < D < E < F <
> A. Which doesn't look possible.
> 
> So which step in my rumination here is actually wrong? Because I
> really don't see how you can get that "(r1 == 0 && r2 == 0)" on real
> hardware using cache coherency.
> 
> *SOME* assumption of mine must be wrong. But I don't see which one.

Wrong assumption 1.  Each hardware thread has its own store buffer.

Wrong assumption 2.  All architectures guarantee multi-copy atomic behavior.

							Thanx, Paul

>                    Linus
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 17:53                                                           ` Paul E. McKenney
@ 2013-11-25 18:21                                                             ` Peter Zijlstra
  0 siblings, 0 replies; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-25 18:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Ingo Molnar, Tim Chen, Will Deacon, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Mon, Nov 25, 2013 at 09:53:15AM -0800, Paul E. McKenney wrote:
> but the lwsync instructions do not order prior stores against
> later loads.

Bah I always forget that one.. :/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-20 17:14       ` Paul E. McKenney
  2013-11-20 18:43         ` Tim Chen
@ 2013-11-21 11:03         ` Peter Zijlstra
  2013-11-21 12:56           ` Peter Zijlstra
  2013-11-21 13:19           ` Paul E. McKenney
  1 sibling, 2 replies; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-21 11:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Wed, Nov 20, 2013 at 09:14:00AM -0800, Paul E. McKenney wrote:
> > Hmm, so in the following case:
> > 
> >   Access A
> >   unlock()	/* release semantics */
> >   lock()	/* acquire semantics */
> >   Access B
> > 
> > A cannot pass beyond the unlock() and B cannot pass the before the lock().
> > 
> > I agree that accesses between the unlock and the lock can be move across both
> > A and B, but that doesn't seem to matter by my reading of the above.
> > 
> > What is the problematic scenario you have in mind? Are you thinking of the
> > lock() moving before the unlock()? That's only permitted by RCpc afaiu,
> > which I don't think any architectures supported by Linux implement...
> > (ARMv8 acquire/release is RCsc).
> 
> If smp_load_acquire() and smp_store_release() are both implemented using
> lwsync on powerpc, and if Access A is a store and Access B is a load,
> then Access A and Access B can be reordered.
> 
> Of course, if every other architecture will be providing RCsc implementations
> for smp_load_acquire() and smp_store_release(), which would not be a bad
> thing, then another approach is for powerpc to use sync rather than lwsync
> for one or the other of smp_load_acquire() or smp_store_release().

So which of the two would make most sense?

As per the Document, loads/stores should not be able to pass up through
an ACQUIRE and loads/stores should not be able to pass down through a
RELEASE.

I think PPC would match that if we use sync for smp_store_release() such
that it will flush the store buffer, and thereby guarantee all stores
are kept within the required section.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21 11:03         ` Peter Zijlstra
@ 2013-11-21 12:56           ` Peter Zijlstra
  2013-11-21 13:20             ` Paul E. McKenney
  2013-11-21 13:19           ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-21 12:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 21, 2013 at 12:03:08PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 20, 2013 at 09:14:00AM -0800, Paul E. McKenney wrote:
> > > Hmm, so in the following case:
> > > 
> > >   Access A
> > >   unlock()	/* release semantics */
> > >   lock()	/* acquire semantics */
> > >   Access B
> > > 
> > > A cannot pass beyond the unlock() and B cannot pass the before the lock().
> > > 
> > > I agree that accesses between the unlock and the lock can be move across both
> > > A and B, but that doesn't seem to matter by my reading of the above.
> > > 
> > > What is the problematic scenario you have in mind? Are you thinking of the
> > > lock() moving before the unlock()? That's only permitted by RCpc afaiu,
> > > which I don't think any architectures supported by Linux implement...
> > > (ARMv8 acquire/release is RCsc).
> > 
> > If smp_load_acquire() and smp_store_release() are both implemented using
> > lwsync on powerpc, and if Access A is a store and Access B is a load,
> > then Access A and Access B can be reordered.
> > 
> > Of course, if every other architecture will be providing RCsc implementations
> > for smp_load_acquire() and smp_store_release(), which would not be a bad
> > thing, then another approach is for powerpc to use sync rather than lwsync
> > for one or the other of smp_load_acquire() or smp_store_release().
> 
> So which of the two would make most sense?
> 
> As per the Document, loads/stores should not be able to pass up through
> an ACQUIRE and loads/stores should not be able to pass down through a
> RELEASE.
> 
> I think PPC would match that if we use sync for smp_store_release() such
> that it will flush the store buffer, and thereby guarantee all stores
> are kept within the required section.

Wouldn't that also mean that TSO archs need the full barrier on
RELEASE?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21 12:56           ` Peter Zijlstra
@ 2013-11-21 13:20             ` Paul E. McKenney
  2013-11-21 17:25               ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-21 13:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 21, 2013 at 01:56:16PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 21, 2013 at 12:03:08PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 20, 2013 at 09:14:00AM -0800, Paul E. McKenney wrote:
> > > > Hmm, so in the following case:
> > > > 
> > > >   Access A
> > > >   unlock()	/* release semantics */
> > > >   lock()	/* acquire semantics */
> > > >   Access B
> > > > 
> > > > A cannot pass beyond the unlock() and B cannot pass the before the lock().
> > > > 
> > > > I agree that accesses between the unlock and the lock can be move across both
> > > > A and B, but that doesn't seem to matter by my reading of the above.
> > > > 
> > > > What is the problematic scenario you have in mind? Are you thinking of the
> > > > lock() moving before the unlock()? That's only permitted by RCpc afaiu,
> > > > which I don't think any architectures supported by Linux implement...
> > > > (ARMv8 acquire/release is RCsc).
> > > 
> > > If smp_load_acquire() and smp_store_release() are both implemented using
> > > lwsync on powerpc, and if Access A is a store and Access B is a load,
> > > then Access A and Access B can be reordered.
> > > 
> > > Of course, if every other architecture will be providing RCsc implementations
> > > for smp_load_acquire() and smp_store_release(), which would not be a bad
> > > thing, then another approach is for powerpc to use sync rather than lwsync
> > > for one or the other of smp_load_acquire() or smp_store_release().
> > 
> > So which of the two would make most sense?
> > 
> > As per the Document, loads/stores should not be able to pass up through
> > an ACQUIRE and loads/stores should not be able to pass down through a
> > RELEASE.
> > 
> > I think PPC would match that if we use sync for smp_store_release() such
> > that it will flush the store buffer, and thereby guarantee all stores
> > are kept within the required section.
> 
> Wouldn't that also mean that TSO archs need the full barrier on
> RELEASE?

It just might...  I was thinking not, but I do need to check.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21 13:20             ` Paul E. McKenney
@ 2013-11-21 17:25               ` Paul E. McKenney
  2013-11-21 21:52                 ` Peter Zijlstra
  2013-12-04 21:26                 ` Andi Kleen
  0 siblings, 2 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-21 17:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 21, 2013 at 05:20:41AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 21, 2013 at 01:56:16PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 21, 2013 at 12:03:08PM +0100, Peter Zijlstra wrote:
> > > On Wed, Nov 20, 2013 at 09:14:00AM -0800, Paul E. McKenney wrote:
> > > > > Hmm, so in the following case:
> > > > > 
> > > > >   Access A
> > > > >   unlock()	/* release semantics */
> > > > >   lock()	/* acquire semantics */
> > > > >   Access B
> > > > > 
> > > > > A cannot pass beyond the unlock() and B cannot pass the before the lock().
> > > > > 
> > > > > I agree that accesses between the unlock and the lock can be move across both
> > > > > A and B, but that doesn't seem to matter by my reading of the above.
> > > > > 
> > > > > What is the problematic scenario you have in mind? Are you thinking of the
> > > > > lock() moving before the unlock()? That's only permitted by RCpc afaiu,
> > > > > which I don't think any architectures supported by Linux implement...
> > > > > (ARMv8 acquire/release is RCsc).
> > > > 
> > > > If smp_load_acquire() and smp_store_release() are both implemented using
> > > > lwsync on powerpc, and if Access A is a store and Access B is a load,
> > > > then Access A and Access B can be reordered.
> > > > 
> > > > Of course, if every other architecture will be providing RCsc implementations
> > > > for smp_load_acquire() and smp_store_release(), which would not be a bad
> > > > thing, then another approach is for powerpc to use sync rather than lwsync
> > > > for one or the other of smp_load_acquire() or smp_store_release().
> > > 
> > > So which of the two would make most sense?
> > > 
> > > As per the Document, loads/stores should not be able to pass up through
> > > an ACQUIRE and loads/stores should not be able to pass down through a
> > > RELEASE.
> > > 
> > > I think PPC would match that if we use sync for smp_store_release() such
> > > that it will flush the store buffer, and thereby guarantee all stores
> > > are kept within the required section.
> > 
> > Wouldn't that also mean that TSO archs need the full barrier on
> > RELEASE?
> 
> It just might...  I was thinking not, but I do need to check.

I am still thinking not, at least for x86, given Section 8.2.2 of
"Intel(R) 64 and IA-32 Architectures Developer's Manual: Vol. 3A"
dated March 2013 from:

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html

Also from Sewell et al. "x86-TSO: A Rigorous and Usable Programmer's
Model for x86 Multiprocessors" in 2010 CACM.

Let's apply the Intel manual to the earlier example:

	CPU 0		CPU 1			CPU 2
	-----		-----			-----
	x = 1;		r1 = SLA(lock);		y = 1;
	SSR(lock, 1);	r2 = y;			smp_mb();
						r3 = x;

	assert(!(r1 == 1 && r2 == 0 && r3 == 0));

Let's try applying this to x86:

o	Stores from a given processor are ordered, so everyone
	agrees that CPU 0's store to x happens before the store-release
	to lock.

o	Reads from a given processor are ordered, so everyone agrees
	that CPU 1's load from lock precedes its load from y.

o	Because r1 == 1, we know that CPU 0's store to lock happened
	before CPU 1's load from lock.

o	Causality (AKA transitive visibility) means that everyone
	agrees that CPU 0's store to x happens before CPU 1's load
	from y.  (This is a crucial point, so it would be good to
	have confirmation/debunking from someone who understands
	the x86 architecture better than I do.)

o	CPU 2's memory barrier prohibits CPU 2's store to y from
	being reordered with its load from x.

o	Because r2 == 0, we know that CPU 1's load from y happened
	before CPU 2's store to y.

o	At this point, it looks to me that (r1 == 1 && r2 == 0)
	implies r3 == 1.

Sewell's model plays out as follows:

o	Rule 2 never applies in this example because no processor
	is reading its own write.

o	Rules 3 and 4 force CPU 0's writes to be seen in order.

o	Rule 1 combined with the ordered-instruction nature of
	the model force CPU 1's reads to happen in order.

o	Rule 4 means that if r1 == 1, CPU 0's write to x is
	globally visible before CPU 1 loads from y.

o	The fact that r2 == 0 combined with rules 1, 3, and 4
	mean that CPU 1's load from y happens before CPU 2 makes
	its store to y visible.

o	Rule 5 means that CPU 1 cannot execute its load from x
	until it has made its store to y globally visible.

o	Therefore, when CPU 2 executes its load from x, CPU 0's
	store to x must be visible, ruling out r3 == 0, and
	preventing the assertion from firing.

The other three orderings would play out similarly.  (These are read
before lock release and read after subsequent lock acquisition, read
before lock release and write after subsequent lock acquisition, and
read before lock release and read after subsequent lock acquisition.)

But before chasing those down, is the analysis above sound?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21 17:25               ` Paul E. McKenney
@ 2013-11-21 21:52                 ` Peter Zijlstra
  2013-11-21 22:18                   ` Paul E. McKenney
  2013-12-04 21:26                 ` Andi Kleen
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-21 21:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 21, 2013 at 09:25:58AM -0800, Paul E. McKenney wrote:
> I am still thinking not, at least for x86, given Section 8.2.2 of
> "Intel(R) 64 and IA-32 Architectures Developer's Manual: Vol. 3A"
> dated March 2013 from:
>
> http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html
>
> Also from Sewell et al. "x86-TSO: A Rigorous and Usable Programmer's
> Model for x86 Multiprocessors" in 2010 CACM.

Should be this one:

  http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf

And the rules referenced below are on page 5; left-hand column.

> Let's apply the Intel manual to the earlier example:
>
>	CPU 0		CPU 1			CPU 2
>	-----		-----			-----
>	x = 1;		r1 = SLA(lock);		y = 1;
>	SSR(lock, 1);	r2 = y;			smp_mb();
>						r3 = x;
>
>	assert(!(r1 == 1 && r2 == 0 && r3 == 0));
>
> Let's try applying this to x86:
>
> o	Stores from a given processor are ordered, so everyone
>	agrees that CPU 0's store to x happens before the store-release
>	to lock.
>
> o	Reads from a given processor are ordered, so everyone agrees
>	that CPU 1's load from lock precedes its load from y.
>
> o	Because r1 == 1, we know that CPU 0's store to lock happened
>	before CPU 1's load from lock.
>
> o	Causality (AKA transitive visibility) means that everyone
>	agrees that CPU 0's store to x happens before CPU 1's load
>	from y.  (This is a crucial point, so it would be good to
>	have confirmation/debunking from someone who understands
>	the x86 architecture better than I do.)
>
> o	CPU 2's memory barrier prohibits CPU 2's store to y from
>	being reordered with its load from x.
>
> o	Because r2 == 0, we know that CPU 1's load from y happened
>	before CPU 2's store to y.
>
> o	At this point, it looks to me that (r1 == 1 && r2 == 0)
>	implies r3 == 1.
>
> Sewell's model plays out as follows:
>
> o	Rule 2 never applies in this example because no processor
>	is reading its own write.
>
> o	Rules 3 and 4 force CPU 0's writes to be seen in order.
>
> o	Rule 1 combined with the ordered-instruction nature of
>	the model force CPU 1's reads to happen in order.
>
> o	Rule 4 means that if r1 == 1, CPU 0's write to x is
>	globally visible before CPU 1 loads from y.
>
> o	The fact that r2 == 0 combined with rules 1, 3, and 4
>	mean that CPU 1's load from y happens before CPU 2 makes
>	its store to y visible.
>
> o	Rule 5 means that CPU 1 cannot execute its load from x
>	until it has made its store to y globally visible.
>
> o	Therefore, when CPU 2 executes its load from x, CPU 0's
>	store to x must be visible, ruling out r3 == 0, and
>	preventing the assertion from firing.
>
> The other three orderings would play out similarly.  (These are read
> before lock release and read after subsequent lock acquisition, read
> before lock release and write after subsequent lock acquisition, and
> read before lock release and read after subsequent lock acquisition.)
>
> But before chasing those down, is the analysis above sound?

I _think_ so.. but its late. I'm also struggling to find where lwsync
goes bad in this story, because as you say lwsync does all except flush
the store buffer, which sounds like TSO.. but clearly is not quite the
same.

For one TSO has multi-copy atomicity, whereas ARM/PPC do not have this.

The explanation of lwsync given in 3.3 of 'A Tutorial Introduction to
the ARM and POWER Relaxed Memory Models'

  http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

Leaves me slightly puzzled as to the exact differences between the 2 WW
variants.

Anyway, hopefully a little sleep will cure some of my confusion,
otherwise I'll try and confuse you more tomorrow ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21 21:52                 ` Peter Zijlstra
@ 2013-11-21 22:18                   ` Paul E. McKenney
  2013-11-22 15:58                     ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-21 22:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 21, 2013 at 10:52:49PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 21, 2013 at 09:25:58AM -0800, Paul E. McKenney wrote:
> > I am still thinking not, at least for x86, given Section 8.2.2 of
> > "Intel(R) 64 and IA-32 Architectures Developer's Manual: Vol. 3A"
> > dated March 2013 from:
> >
> > http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html
> >
> > Also from Sewell et al. "x86-TSO: A Rigorous and Usable Programmer's
> > Model for x86 Multiprocessors" in 2010 CACM.
> 
> Should be this one:
> 
>   http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
> 
> And the rules referenced below are on page 5; left-hand column.

Yep, that is the one!  (I was relying on my ACM Digital Library
subscription.)

> > Let's apply the Intel manual to the earlier example:
> >
> >	CPU 0		CPU 1			CPU 2
> >	-----		-----			-----
> >	x = 1;		r1 = SLA(lock);		y = 1;
> >	SSR(lock, 1);	r2 = y;			smp_mb();
> >						r3 = x;
> >
> >	assert(!(r1 == 1 && r2 == 0 && r3 == 0));
> >
> > Let's try applying this to x86:
> >
> > o	Stores from a given processor are ordered, so everyone
> >	agrees that CPU 0's store to x happens before the store-release
> >	to lock.
> >
> > o	Reads from a given processor are ordered, so everyone agrees
> >	that CPU 1's load from lock precedes its load from y.
> >
> > o	Because r1 == 1, we know that CPU 0's store to lock happened
> >	before CPU 1's load from lock.
> >
> > o	Causality (AKA transitive visibility) means that everyone
> >	agrees that CPU 0's store to x happens before CPU 1's load
> >	from y.  (This is a crucial point, so it would be good to
> >	have confirmation/debunking from someone who understands
> >	the x86 architecture better than I do.)
> >
> > o	CPU 2's memory barrier prohibits CPU 2's store to y from
> >	being reordered with its load from x.
> >
> > o	Because r2 == 0, we know that CPU 1's load from y happened
> >	before CPU 2's store to y.
> >
> > o	At this point, it looks to me that (r1 == 1 && r2 == 0)
> >	implies r3 == 1.
> >
> > Sewell's model plays out as follows:
> >
> > o	Rule 2 never applies in this example because no processor
> >	is reading its own write.
> >
> > o	Rules 3 and 4 force CPU 0's writes to be seen in order.
> >
> > o	Rule 1 combined with the ordered-instruction nature of
> >	the model force CPU 1's reads to happen in order.
> >
> > o	Rule 4 means that if r1 == 1, CPU 0's write to x is
> >	globally visible before CPU 1 loads from y.
> >
> > o	The fact that r2 == 0 combined with rules 1, 3, and 4
> >	mean that CPU 1's load from y happens before CPU 2 makes
> >	its store to y visible.
> >
> > o	Rule 5 means that CPU 1 cannot execute its load from x
> >	until it has made its store to y globally visible.
> >
> > o	Therefore, when CPU 2 executes its load from x, CPU 0's
> >	store to x must be visible, ruling out r3 == 0, and
> >	preventing the assertion from firing.
> >
> > The other three orderings would play out similarly.  (These are read
> > before lock release and read after subsequent lock acquisition, read
> > before lock release and write after subsequent lock acquisition, and
> > read before lock release and read after subsequent lock acquisition.)
> >
> > But before chasing those down, is the analysis above sound?
> 
> I _think_ so.. but its late. I'm also struggling to find where lwsync
> goes bad in this story, because as you say lwsync does all except flush
> the store buffer, which sounds like TSO.. but clearly is not quite the
> same.
> 
> For one TSO has multi-copy atomicity, whereas ARM/PPC do not have this.

At least PPC lwsync does not have multi-copy atomicity.  The heavier sync
instruction does.  Last I checked, ARM did not have a direct replacment
for lwsync, though it does have something sort of like eieio.

But yes, we were trying to use lwsync for both smp_load_acquire() and
smp_store_release(), which does not provide exactly the same guarantees
that TSO does.  Though it does come close in many cases.

> The explanation of lwsync given in 3.3 of 'A Tutorial Introduction to
> the ARM and POWER Relaxed Memory Models'
> 
>   http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
> 
> Leaves me slightly puzzled as to the exact differences between the 2 WW
> variants.

The WW at the top of the page is discussing the ARM "dmb" instruction
and the PPC "sync" instruction, both of which are full barriers, and
both of which can therefore be thought of as flushing the store buffer.

In contrast, the WW at the bottom of the page is discussing the PPC
lwsync instruction, which most definitely is not guaranteed to flush
the write buffer.

> Anyway, hopefully a little sleep will cure some of my confusion,
> otherwise I'll try and confuse you more tomorrow ;-)

Fair enough!  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21 22:18                   ` Paul E. McKenney
@ 2013-11-22 15:58                     ` Peter Zijlstra
  2013-11-22 18:26                       ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-22 15:58 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 21, 2013 at 02:18:59PM -0800, Paul E. McKenney wrote:
> > > Let's apply the Intel manual to the earlier example:
> > >
> > >	CPU 0		CPU 1			CPU 2
> > >	-----		-----			-----
> > >	x = 1;		r1 = SLA(lock);		y = 1;
> > >	SSR(lock, 1);	r2 = y;			smp_mb();
> > >						r3 = x;
> > >
> > >	assert(!(r1 == 1 && r2 == 0 && r3 == 0));
> > >
> > > Let's try applying this to x86:
> > >
> > > o	Stores from a given processor are ordered, so everyone
> > >	agrees that CPU 0's store to x happens before the store-release
> > >	to lock.
> > >
> > > o	Reads from a given processor are ordered, so everyone agrees
> > >	that CPU 1's load from lock precedes its load from y.
> > >
> > > o	Because r1 == 1, we know that CPU 0's store to lock happened
> > >	before CPU 1's load from lock.
> > >
> > > o	Causality (AKA transitive visibility) means that everyone
> > >	agrees that CPU 0's store to x happens before CPU 1's load
> > >	from y.  (This is a crucial point, so it would be good to
> > >	have confirmation/debunking from someone who understands
> > >	the x86 architecture better than I do.)
> > >
> > > o	CPU 2's memory barrier prohibits CPU 2's store to y from
> > >	being reordered with its load from x.
> > >
> > > o	Because r2 == 0, we know that CPU 1's load from y happened
> > >	before CPU 2's store to y.
> > >
> > > o	At this point, it looks to me that (r1 == 1 && r2 == 0)
> > >	implies r3 == 1.

Agreed, and I now fully appreciate the transitive point. I can't say if
x86 does in fact do this, but I can agree that rules in the SDM support
your logic.

> > > Sewell's model plays out as follows:

I have problems with these rules, for instance:

> > > o	Rules 3 and 4 force CPU 0's writes to be seen in order.

Nothing in those rules state the store buffer is a strict FIFO, it might
be suggested by rule 4's use of 'oldest', but being a pendant the text
as given doesn't disallow store reordering in the store buffer.

Suppose an address a was written to two times, a store buffer might
simply update the entry for the first write with the new value. The
entry would still become oldest at some point and get flushed.

(Note the above is ambiguous in if the entry's time stamp is updated or
not -- also note that both cases violate TSO and therefore it doesn't
matter.)

That violates FIFO (and TSO) but not the definitions.

Similarly its not at all clear from rule 2 that reads are not
re-ordered.

So I'll ignore this section for now.


OK, so reading back a little he does describe the abstract machine and
does say the store buffer is FIFO, but what use are rules if you first
need more 'rules'.

Rules should be self supporting.

> > I _think_ so.. but its late. I'm also struggling to find where lwsync
> > goes bad in this story, because as you say lwsync does all except flush
> > the store buffer, which sounds like TSO.. but clearly is not quite the
> > same.
> > 
> > For one TSO has multi-copy atomicity, whereas ARM/PPC do not have this.
> 
> At least PPC lwsync does not have multi-copy atomicity. 

Right, which is why it lacks transitivity.

> The heavier sync
> instruction does.  Last I checked, ARM did not have a direct replacment
> for lwsync, though it does have something sort of like eieio.
> 
> But yes, we were trying to use lwsync for both smp_load_acquire() and
> smp_store_release(), which does not provide exactly the same guarantees
> that TSO does.  Though it does come close in many cases.

Right, and this lack of transitivity is what kills it.

> > The explanation of lwsync given in 3.3 of 'A Tutorial Introduction to
> > the ARM and POWER Relaxed Memory Models'
> > 
> >   http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
> > 
> > Leaves me slightly puzzled as to the exact differences between the 2 WW
> > variants.
> 
> The WW at the top of the page is discussing the ARM "dmb" instruction
> and the PPC "sync" instruction, both of which are full barriers, and
> both of which can therefore be thought of as flushing the store buffer.
> 
> In contrast, the WW at the bottom of the page is discussing the PPC
> lwsync instruction, which most definitely is not guaranteed to flush
> the write buffer.

Right, my complaint was that from the two definitions given the factual
difference in behaviour was not clear to me.

I knew that upgrading the SSR's lwsync to sync would 'fix' the thing,
and that is exactly the WW case, but given these two definitions I was
at a loss to find the hole.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22 15:58                     ` Peter Zijlstra
@ 2013-11-22 18:26                       ` Paul E. McKenney
  2013-11-22 18:51                         ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-22 18:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 04:58:35PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 21, 2013 at 02:18:59PM -0800, Paul E. McKenney wrote:
> > > > Let's apply the Intel manual to the earlier example:
> > > >
> > > >	CPU 0		CPU 1			CPU 2
> > > >	-----		-----			-----
> > > >	x = 1;		r1 = SLA(lock);		y = 1;
> > > >	SSR(lock, 1);	r2 = y;			smp_mb();
> > > >						r3 = x;
> > > >
> > > >	assert(!(r1 == 1 && r2 == 0 && r3 == 0));
> > > >
> > > > Let's try applying this to x86:
> > > >
> > > > o	Stores from a given processor are ordered, so everyone
> > > >	agrees that CPU 0's store to x happens before the store-release
> > > >	to lock.
> > > >
> > > > o	Reads from a given processor are ordered, so everyone agrees
> > > >	that CPU 1's load from lock precedes its load from y.
> > > >
> > > > o	Because r1 == 1, we know that CPU 0's store to lock happened
> > > >	before CPU 1's load from lock.
> > > >
> > > > o	Causality (AKA transitive visibility) means that everyone
> > > >	agrees that CPU 0's store to x happens before CPU 1's load
> > > >	from y.  (This is a crucial point, so it would be good to
> > > >	have confirmation/debunking from someone who understands
> > > >	the x86 architecture better than I do.)
> > > >
> > > > o	CPU 2's memory barrier prohibits CPU 2's store to y from
> > > >	being reordered with its load from x.
> > > >
> > > > o	Because r2 == 0, we know that CPU 1's load from y happened
> > > >	before CPU 2's store to y.
> > > >
> > > > o	At this point, it looks to me that (r1 == 1 && r2 == 0)
> > > >	implies r3 == 1.
> 
> Agreed, and I now fully appreciate the transitive point. I can't say if
> x86 does in fact do this, but I can agree that rules in the SDM support
> your logic.

OK, thank you for looking it over!

> > > > Sewell's model plays out as follows:
> 
> I have problems with these rules, for instance:
> 
> > > > o	Rules 3 and 4 force CPU 0's writes to be seen in order.
> 
> Nothing in those rules state the store buffer is a strict FIFO, it might
> be suggested by rule 4's use of 'oldest', but being a pendant the text
> as given doesn't disallow store reordering in the store buffer.
> 
> Suppose an address a was written to two times, a store buffer might
> simply update the entry for the first write with the new value. The
> entry would still become oldest at some point and get flushed.
> 
> (Note the above is ambiguous in if the entry's time stamp is updated or
> not -- also note that both cases violate TSO and therefore it doesn't
> matter.)
> 
> That violates FIFO (and TSO) but not the definitions.
> 
> Similarly its not at all clear from rule 2 that reads are not
> re-ordered.
> 
> So I'll ignore this section for now.
> 
> 
> OK, so reading back a little he does describe the abstract machine and
> does say the store buffer is FIFO, but what use are rules if you first
> need more 'rules'.
> 
> Rules should be self supporting.

Better than the older versions, but yes, could be better.  The key point
is that if a store updates an existing store buffer, that buffer becomes
the youngest.  It would be interesting to see what a continuous series
of stores to the same variable from different CPUs did to the hardware.
By this set of rules, the hardware would be within its rights to never
propagate any of the stored values to memory.  ;-)

> > > I _think_ so.. but its late. I'm also struggling to find where lwsync
> > > goes bad in this story, because as you say lwsync does all except flush
> > > the store buffer, which sounds like TSO.. but clearly is not quite the
> > > same.
> > > 
> > > For one TSO has multi-copy atomicity, whereas ARM/PPC do not have this.
> > 
> > At least PPC lwsync does not have multi-copy atomicity. 
> 
> Right, which is why it lacks transitivity.
> 
> > The heavier sync
> > instruction does.  Last I checked, ARM did not have a direct replacment
> > for lwsync, though it does have something sort of like eieio.
> > 
> > But yes, we were trying to use lwsync for both smp_load_acquire() and
> > smp_store_release(), which does not provide exactly the same guarantees
> > that TSO does.  Though it does come close in many cases.
> 
> Right, and this lack of transitivity is what kills it.

I am thinking that Linus's email and Ingo's follow-on says that the
powerpc version of smp_store_release() needs to be sync.  Though to
be honest, it is x86 that has been causing me the most cognitive pain.
And to be even more honest, I must confess that I still don't understand
the nature of the complaint.  Ah well, that is the next email to
reply to.  ;-)

The real source of my cognitive pain is that here we have a sequence of
code that has neither atomic instructions or memory-barrier instructions,
but it looks like it still manages to act as a full memory barrier.
Still not quite sure I should trust it...

> > > The explanation of lwsync given in 3.3 of 'A Tutorial Introduction to
> > > the ARM and POWER Relaxed Memory Models'
> > > 
> > >   http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
> > > 
> > > Leaves me slightly puzzled as to the exact differences between the 2 WW
> > > variants.
> > 
> > The WW at the top of the page is discussing the ARM "dmb" instruction
> > and the PPC "sync" instruction, both of which are full barriers, and
> > both of which can therefore be thought of as flushing the store buffer.
> > 
> > In contrast, the WW at the bottom of the page is discussing the PPC
> > lwsync instruction, which most definitely is not guaranteed to flush
> > the write buffer.
> 
> Right, my complaint was that from the two definitions given the factual
> difference in behaviour was not clear to me.
> 
> I knew that upgrading the SSR's lwsync to sync would 'fix' the thing,
> and that is exactly the WW case, but given these two definitions I was
> at a loss to find the hole.

Well, it certainly shows that placing lwsync between each pair of memory
references does not bring powerpc all the way to TSO.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22 18:26                       ` Paul E. McKenney
@ 2013-11-22 18:51                         ` Peter Zijlstra
  2013-11-22 18:59                           ` Paul E. McKenney
  2013-11-25 17:35                           ` Peter Zijlstra
  0 siblings, 2 replies; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-22 18:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 10:26:32AM -0800, Paul E. McKenney wrote:
> The real source of my cognitive pain is that here we have a sequence of
> code that has neither atomic instructions or memory-barrier instructions,
> but it looks like it still manages to act as a full memory barrier.
> Still not quite sure I should trust it...

Yes, this is something that puzzles me too.

That said, the two rules that:

1)  stores aren't re-ordered against other stores
2)  reads aren't re-ordered against other reads

Do make that:

	STORE x
	LOAD  x

form a fence that neither stores nor loads can pass through from
either side; note however that they themselves rely on the data
dependency to not reorder against themselves.

If you put them the other way around:

	LOAD x
	STORE y

we seem to get a stronger variant because stores are not re-ordered
against older reads.

There is however the exception cause for rule 1) above, which includes
clflush, non-temporal stores and string ops; the actual mfence
instruction doesn't seem to have this exception and would thus be
slightly stronger still.

Still confusion situation all round.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22 18:51                         ` Peter Zijlstra
@ 2013-11-22 18:59                           ` Paul E. McKenney
  2013-11-25 17:35                           ` Peter Zijlstra
  1 sibling, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-22 18:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 07:51:07PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 22, 2013 at 10:26:32AM -0800, Paul E. McKenney wrote:
> > The real source of my cognitive pain is that here we have a sequence of
> > code that has neither atomic instructions or memory-barrier instructions,
> > but it looks like it still manages to act as a full memory barrier.
> > Still not quite sure I should trust it...
> 
> Yes, this is something that puzzles me too.
> 
> That said, the two rules that:
> 
> 1)  stores aren't re-ordered against other stores
> 2)  reads aren't re-ordered against other reads
> 
> Do make that:
> 
> 	STORE x
> 	LOAD  x
> 
> form a fence that neither stores nor loads can pass through from
> either side; note however that they themselves rely on the data
> dependency to not reorder against themselves.
> 
> If you put them the other way around:
> 
> 	LOAD x
> 	STORE y
> 
> we seem to get a stronger variant because stores are not re-ordered
> against older reads.
> 
> There is however the exception cause for rule 1) above, which includes
> clflush, non-temporal stores and string ops; the actual mfence
> instruction doesn't seem to have this exception and would thus be
> slightly stronger still.
> 
> Still confusion situation all round.

At some point, we need people from Intel and AMD to look at it.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-22 18:51                         ` Peter Zijlstra
  2013-11-22 18:59                           ` Paul E. McKenney
@ 2013-11-25 17:35                           ` Peter Zijlstra
  2013-11-25 18:02                             ` Paul E. McKenney
  2013-11-25 18:52                             ` H. Peter Anvin
  1 sibling, 2 replies; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-25 17:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Fri, Nov 22, 2013 at 07:51:07PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 22, 2013 at 10:26:32AM -0800, Paul E. McKenney wrote:
> > The real source of my cognitive pain is that here we have a sequence of
> > code that has neither atomic instructions or memory-barrier instructions,
> > but it looks like it still manages to act as a full memory barrier.
> > Still not quite sure I should trust it...
> 
> Yes, this is something that puzzles me too.
> 
> That said, the two rules that:
> 
> 1)  stores aren't re-ordered against other stores
> 2)  reads aren't re-ordered against other reads
> 
> Do make that:
> 
> 	STORE x
> 	LOAD  x
> 
> form a fence that neither stores nor loads can pass through from
> either side; note however that they themselves rely on the data
> dependency to not reorder against themselves.
> 
> If you put them the other way around:
> 
> 	LOAD x
> 	STORE y
> 
> we seem to get a stronger variant because stores are not re-ordered
> against older reads.
> 
> There is however the exception cause for rule 1) above, which includes
> clflush, non-temporal stores and string ops; the actual mfence
> instruction doesn't seem to have this exception and would thus be
> slightly stronger still.
> 
> Still confusion situation all round.

I think this means x86 needs help too.

Consider:

x = y = 0

  w[x] = 1  |  w[y] = 1
  mfence    |  mfence
  r[y] = 0  |  r[x] = 0

This is generally an impossible case, right? (Since if we observe y=0
this means that w[y]=1 has not yet happened, and therefore x=1, and
vice-versa).

Now replace one of the mfences with smp_store_release(l1);
smp_load_acquire(l2); such that we have a RELEASE+ACQUIRE pair that
_should_ form a full barrier:

  w[x] = 1   | w[y] = 1
  w[l1] = 1  | mfence
  r[l2] = 0  | r[x] = 0
  r[y] = 0   |

At which point we can observe the impossible, because as per the rule:

'reads may be reordered with older writes to different locations'

Our r[y] can slip before the w[x]=1.

Thus x86's smp_store_release() would need to be:

+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)

Or: (void)xchg((p), (v));

Idem for s390 and sparc I suppose.

The only reason your example worked is because the unlock and lock were
for the same lock.

This of course leaves us without joy for circular buffers, which can do
without this LOCK'ed op and without sync on PPC. Now I'm not at all sure
we've got enough of those to justify primitives just for them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 17:35                           ` Peter Zijlstra
@ 2013-11-25 18:02                             ` Paul E. McKenney
  2013-11-25 18:24                               ` Peter Zijlstra
                                                 ` (2 more replies)
  2013-11-25 18:52                             ` H. Peter Anvin
  1 sibling, 3 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-25 18:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Mon, Nov 25, 2013 at 06:35:40PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 22, 2013 at 07:51:07PM +0100, Peter Zijlstra wrote:
> > On Fri, Nov 22, 2013 at 10:26:32AM -0800, Paul E. McKenney wrote:
> > > The real source of my cognitive pain is that here we have a sequence of
> > > code that has neither atomic instructions or memory-barrier instructions,
> > > but it looks like it still manages to act as a full memory barrier.
> > > Still not quite sure I should trust it...
> > 
> > Yes, this is something that puzzles me too.
> > 
> > That said, the two rules that:
> > 
> > 1)  stores aren't re-ordered against other stores
> > 2)  reads aren't re-ordered against other reads
> > 
> > Do make that:
> > 
> > 	STORE x
> > 	LOAD  x
> > 
> > form a fence that neither stores nor loads can pass through from
> > either side; note however that they themselves rely on the data
> > dependency to not reorder against themselves.
> > 
> > If you put them the other way around:
> > 
> > 	LOAD x
> > 	STORE y
> > 
> > we seem to get a stronger variant because stores are not re-ordered
> > against older reads.
> > 
> > There is however the exception cause for rule 1) above, which includes
> > clflush, non-temporal stores and string ops; the actual mfence
> > instruction doesn't seem to have this exception and would thus be
> > slightly stronger still.
> > 
> > Still confusion situation all round.
> 
> I think this means x86 needs help too.

I still do not believe that it does.  Again, strangely enough.

We need to ask someone in Intel that understands this all the way down
to the silicon.  The guy I used to rely on for this no longer works
at Intel.

Do you know someone who fits this description, or should I start sending
cold-call emails to various Intel contacts?

> Consider:
> 
> x = y = 0
> 
>   w[x] = 1  |  w[y] = 1
>   mfence    |  mfence
>   r[y] = 0  |  r[x] = 0
> 
> This is generally an impossible case, right? (Since if we observe y=0
> this means that w[y]=1 has not yet happened, and therefore x=1, and
> vice-versa).
> 
> Now replace one of the mfences with smp_store_release(l1);
> smp_load_acquire(l2); such that we have a RELEASE+ACQUIRE pair that
> _should_ form a full barrier:
> 
>   w[x] = 1   | w[y] = 1
>   w[l1] = 1  | mfence
>   r[l2] = 0  | r[x] = 0
>   r[y] = 0   |
> 
> At which point we can observe the impossible, because as per the rule:
> 
> 'reads may be reordered with older writes to different locations'
> 
> Our r[y] can slip before the w[x]=1.
> 
> Thus x86's smp_store_release() would need to be:
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> 
> Or: (void)xchg((p), (v));
> 
> Idem for s390 and sparc I suppose.
> 
> The only reason your example worked is because the unlock and lock were
> for the same lock.

Exactly!!!

And if the two locks are different, then the guarantee applies only
when the unlock and lock are on the same CPU, in which case, as Linus
noted, the xchg() on entry to the slow path does the job for use.

> This of course leaves us without joy for circular buffers, which can do
> without this LOCK'ed op and without sync on PPC. Now I'm not at all sure
> we've got enough of those to justify primitives just for them.

I am beginning to think that we do, but that is a separate discussion.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 18:02                             ` Paul E. McKenney
@ 2013-11-25 18:24                               ` Peter Zijlstra
  2013-11-25 18:34                                 ` Tim Chen
  2013-11-25 18:27                               ` Peter Zijlstra
  2013-11-25 23:55                               ` H. Peter Anvin
  2 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-25 18:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Mon, Nov 25, 2013 at 10:02:50AM -0800, Paul E. McKenney wrote:
> I still do not believe that it does.  Again, strangely enough.
> 
> We need to ask someone in Intel that understands this all the way down
> to the silicon.  The guy I used to rely on for this no longer works
> at Intel.
> 
> Do you know someone who fits this description, or should I start sending
> cold-call emails to various Intel contacts?

There's a whole bunch of Intel folks on the Cc. list; could one of you
find a suitable HW engineer and put him onto this thread?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 18:24                               ` Peter Zijlstra
@ 2013-11-25 18:34                                 ` Tim Chen
  0 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2013-11-25 18:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Mon, 2013-11-25 at 19:24 +0100, Peter Zijlstra wrote:
> On Mon, Nov 25, 2013 at 10:02:50AM -0800, Paul E. McKenney wrote:
> > I still do not believe that it does.  Again, strangely enough.
> > 
> > We need to ask someone in Intel that understands this all the way down
> > to the silicon.  The guy I used to rely on for this no longer works
> > at Intel.
> > 
> > Do you know someone who fits this description, or should I start sending
> > cold-call emails to various Intel contacts?
> 
> There's a whole bunch of Intel folks on the Cc. list; could one of you
> find a suitable HW engineer and put him onto this thread?

I'll try to do some asking around.

Tim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 18:02                             ` Paul E. McKenney
  2013-11-25 18:24                               ` Peter Zijlstra
@ 2013-11-25 18:27                               ` Peter Zijlstra
  2013-11-25 23:52                                 ` Paul E. McKenney
  2013-11-25 23:55                               ` H. Peter Anvin
  2 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-25 18:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Mon, Nov 25, 2013 at 10:02:50AM -0800, Paul E. McKenney wrote:
> And if the two locks are different, then the guarantee applies only
> when the unlock and lock are on the same CPU, in which case, as Linus
> noted, the xchg() on entry to the slow path does the job for use.

But in that case we rely on the fact that the thing is part of a
composite and we should no longer call it load_acquire, because frankly
it doesn't have acquire semantics anymore because the read can escape
out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 18:27                               ` Peter Zijlstra
@ 2013-11-25 23:52                                 ` Paul E. McKenney
  2013-11-26  9:59                                   ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-25 23:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Mon, Nov 25, 2013 at 07:27:15PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 25, 2013 at 10:02:50AM -0800, Paul E. McKenney wrote:
> > And if the two locks are different, then the guarantee applies only
> > when the unlock and lock are on the same CPU, in which case, as Linus
> > noted, the xchg() on entry to the slow path does the job for use.
> 
> But in that case we rely on the fact that the thing is part of a
> composite and we should no longer call it load_acquire, because frankly
> it doesn't have acquire semantics anymore because the read can escape
> out.

Actually, load-acquire and store-release are only required to provide
ordering in the threads/CPUs doing the load-acquire/store-release
operations.  It is just that we require something stronger than minimal
load-acquire/store-release to make a Linux-kernel lock.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 23:52                                 ` Paul E. McKenney
@ 2013-11-26  9:59                                   ` Peter Zijlstra
  2013-11-26 17:11                                     ` Paul E. McKenney
  2013-11-26 19:00                                     ` Linus Torvalds
  0 siblings, 2 replies; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-26  9:59 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Mon, Nov 25, 2013 at 03:52:52PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 25, 2013 at 07:27:15PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 25, 2013 at 10:02:50AM -0800, Paul E. McKenney wrote:
> > > And if the two locks are different, then the guarantee applies only
> > > when the unlock and lock are on the same CPU, in which case, as Linus
> > > noted, the xchg() on entry to the slow path does the job for use.
> > 
> > But in that case we rely on the fact that the thing is part of a
> > composite and we should no longer call it load_acquire, because frankly
> > it doesn't have acquire semantics anymore because the read can escape
> > out.
> 
> Actually, load-acquire and store-release are only required to provide
> ordering in the threads/CPUs doing the load-acquire/store-release
> operations.  It is just that we require something stronger than minimal
> load-acquire/store-release to make a Linux-kernel lock.

I suspect we're talking past one another here; but our Document
describes ACQUIRE/RELEASE semantics such that

  RELEASE
  ACQUIRE

matches a full barrier, regardless on whether it is the same lock or
not.

If you now want to weaken this definition, then that needs consideration
because we actually rely on things like

spin_unlock(l1);
spin_lock(l2);

being full barriers.

Now granted, for lock operations we have actual atomic ops in between
which would cure x86, but it would leave us confused with the barrier
semantics.

So please; either: 

A) we have the strong ACQUIRE/RELEASE semantics as currently described;
   and therefore any RELEASE+ACQUIRE pair must form a full barrier; and
   our propose primitives are non-compliant and needs strengthening.

B) we go fudge about with the definitions.

But given the current description of our ACQUIRE barrier, we simply
cannot claim the proposed primitives are good on x86 IMO.

Also, instead of the smp_store_release() I would argue that
smp_load_acquire() is the one that needs the full buffer, even on PPC.

Because our ACQUIRE dis-allows loads/stores leaking out upwards, and
both TSO and PPC lwsync allow just that, so the smp_load_acquire() is
the one that needs the full barrier.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26  9:59                                   ` Peter Zijlstra
@ 2013-11-26 17:11                                     ` Paul E. McKenney
  2013-11-26 17:18                                       ` Peter Zijlstra
  2013-11-26 19:00                                     ` Linus Torvalds
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-26 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Tue, Nov 26, 2013 at 10:59:45AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 25, 2013 at 03:52:52PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 25, 2013 at 07:27:15PM +0100, Peter Zijlstra wrote:
> > > On Mon, Nov 25, 2013 at 10:02:50AM -0800, Paul E. McKenney wrote:
> > > > And if the two locks are different, then the guarantee applies only
> > > > when the unlock and lock are on the same CPU, in which case, as Linus
> > > > noted, the xchg() on entry to the slow path does the job for use.
> > > 
> > > But in that case we rely on the fact that the thing is part of a
> > > composite and we should no longer call it load_acquire, because frankly
> > > it doesn't have acquire semantics anymore because the read can escape
> > > out.
> > 
> > Actually, load-acquire and store-release are only required to provide
> > ordering in the threads/CPUs doing the load-acquire/store-release
> > operations.  It is just that we require something stronger than minimal
> > load-acquire/store-release to make a Linux-kernel lock.
> 
> I suspect we're talking past one another here; but our Document
> describes ACQUIRE/RELEASE semantics such that
> 
>   RELEASE
>   ACQUIRE
> 
> matches a full barrier, regardless on whether it is the same lock or
> not.

Ah, got it!

> If you now want to weaken this definition, then that needs consideration
> because we actually rely on things like
> 
> spin_unlock(l1);
> spin_lock(l2);
> 
> being full barriers.
> 
> Now granted, for lock operations we have actual atomic ops in between
> which would cure x86, but it would leave us confused with the barrier
> semantics.
> 
> So please; either: 
> 
> A) we have the strong ACQUIRE/RELEASE semantics as currently described;
>    and therefore any RELEASE+ACQUIRE pair must form a full barrier; and
>    our propose primitives are non-compliant and needs strengthening.
> 
> B) we go fudge about with the definitions.

Another approach would be to have local and global variants, so that
the local variants have acquire/release semantics that are guaranteed
to be visible only in the involved threads (sufficient for circular
buffers) while the global ones are visible globally, thus sufficient
for queued locks.

> But given the current description of our ACQUIRE barrier, we simply
> cannot claim the proposed primitives are good on x86 IMO.
> 
> Also, instead of the smp_store_release() I would argue that
> smp_load_acquire() is the one that needs the full buffer, even on PPC.
> 
> Because our ACQUIRE dis-allows loads/stores leaking out upwards, and
> both TSO and PPC lwsync allow just that, so the smp_load_acquire() is
> the one that needs the full barrier.

You lost me on this one.  Here is x86 ACQUIRE for X:

	r1 = ACCESS_ONCE(X);
	<loads and stores>

Since x86 does not reorder loads with later loads or stores, this should
be sufficience.

For powerpc:

	r1 = ACCESS_ONCE(X);
	lwsync;
	<loads and stores>

And lwsync does not allow prior loads to be reordered with later loads or
stores, so this should also be sufficient.

In both cases, a RELEASE+ACQUIRE provides a full barrier as long as
RELEASE has the right stuff in it.

So what am I missing?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26 17:11                                     ` Paul E. McKenney
@ 2013-11-26 17:18                                       ` Peter Zijlstra
  0 siblings, 0 replies; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-26 17:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Tue, Nov 26, 2013 at 09:11:06AM -0800, Paul E. McKenney wrote:
> So what am I missing?

I got loads and stores mixed up again..

its loads that can be re-ordered against earlier stores. Not the other
way around.

/me dons brown paper hat

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26  9:59                                   ` Peter Zijlstra
  2013-11-26 17:11                                     ` Paul E. McKenney
@ 2013-11-26 19:00                                     ` Linus Torvalds
  2013-11-26 19:20                                       ` Paul E. McKenney
                                                         ` (2 more replies)
  1 sibling, 3 replies; 116+ messages in thread
From: Linus Torvalds @ 2013-11-26 19:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Will Deacon, Tim Chen, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Tue, Nov 26, 2013 at 1:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> If you now want to weaken this definition, then that needs consideration
> because we actually rely on things like
>
> spin_unlock(l1);
> spin_lock(l2);
>
> being full barriers.

Btw, maybe we should just stop that assumption. The complexity of this
discussion makes me go "maybe we should stop with subtle assumptions
that happen to be obviously true on x86 due to historical
implementations, but aren't obviously true even *there* any more with
the MCS lock".

We already have a concept of

        smp_mb__before_spinlock();
        spin_lock():

for sequences where we *know* we need to make getting a spin-lock be a
full memory barrier. It's free on x86 (and remains so even with the
MCS lock, regardless of any subtle issues, if only because even the
MCS lock starts out with a locked atomic, never mind the contention
slow-case). Of course, that macro is only used inside the scheduler,
and is actually documented to not really be a full memory barrier, but
it handles the case we actually care about.

IOW, where do we really care about the "unlock+lock" is a memory
barrier? And could we make those places explicit, and then do
something similar to the above to them?

                       Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26 19:00                                     ` Linus Torvalds
@ 2013-11-26 19:20                                       ` Paul E. McKenney
  2013-11-26 19:32                                         ` Linus Torvalds
  2013-11-26 19:21                                       ` Peter Zijlstra
  2013-11-26 23:08                                       ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-26 19:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, Peter Hurley,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Tue, Nov 26, 2013 at 11:00:50AM -0800, Linus Torvalds wrote:
> On Tue, Nov 26, 2013 at 1:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > If you now want to weaken this definition, then that needs consideration
> > because we actually rely on things like
> >
> > spin_unlock(l1);
> > spin_lock(l2);
> >
> > being full barriers.
> 
> Btw, maybe we should just stop that assumption. The complexity of this
> discussion makes me go "maybe we should stop with subtle assumptions
> that happen to be obviously true on x86 due to historical
> implementations, but aren't obviously true even *there* any more with
> the MCS lock".

>From an RCU viewpoint, I am OK with that approach.  From the viewpoint
of documenting our assumptions, I really really like that approach.

> We already have a concept of
> 
>         smp_mb__before_spinlock();
>         spin_lock():
> 
> for sequences where we *know* we need to make getting a spin-lock be a
> full memory barrier. It's free on x86 (and remains so even with the
> MCS lock, regardless of any subtle issues, if only because even the
> MCS lock starts out with a locked atomic, never mind the contention
> slow-case). Of course, that macro is only used inside the scheduler,
> and is actually documented to not really be a full memory barrier, but
> it handles the case we actually care about.

This would work well if we made it be smp_mb__after_spinlock(), used
as follows:

	spin_lock();
	smp_mb__after_spinlock();

The reason that it must go after rather than before is to handle the
MCS-style low-overhead handoffs.  During the handoff, you can count
on code at the beginning of the critical section being executed, but
things before the lock cannot possibly help you.

We should also have something for lock releases, for example:

	smp_mb__before_spinunlock();
	spin_unlock();

This allows architectures to choose where to put the overhead, and
also very clearly documents which unlock+lock pairs need the full
barriers.

Heh.  Must smp_mb__after_spinlock() and smp_mb__before_spinunlock()
provide a full barrier when used separately, or only when used together?
The unlock+lock guarantee requires that they provide a full barrier
only when used together.  However, I believe that the scheduler's use
of smp_mb__before_spinlock() needs a full barrier without pairing.

I have no idea whether or not we should have a separate API for each
flavor of lock.

> IOW, where do we really care about the "unlock+lock" is a memory
> barrier? And could we make those places explicit, and then do
> something similar to the above to them?

There are several places in RCU that assume unlock+lock is a full
memory barrier, but I would be more than happy to fix them up given
an smp_mb__after_spinlock() and an smp_mb__before_spinunlock(), or
something similar.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26 19:20                                       ` Paul E. McKenney
@ 2013-11-26 19:32                                         ` Linus Torvalds
  2013-11-26 22:51                                           ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2013-11-26 19:32 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, Peter Hurley,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Tue, Nov 26, 2013 at 11:20 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> There are several places in RCU that assume unlock+lock is a full
> memory barrier, but I would be more than happy to fix them up given
> an smp_mb__after_spinlock() and an smp_mb__before_spinunlock(), or
> something similar.

A "before_spinunlock" would actually be expensive on x86.

So I'd *much* rather see the "after_spinlock()" version, if that is
sufficient for all users. And it should be, since that's the
traditional x86 behavior that we had before the MCS lock discussion.

Because it's worth noting that a spin_lock() is still a full memory
barrier on x86, even with the MCS code, *assuming it is done in the
context of the thread needing the memory barrier". And I suspect that
is much more generally true than just x86. It's the final MCS hand-off
of a lock that is pretty weak with just a local read. The full lock
sequence is always going to be much stronger, if only because it will
contain a write somewhere shared as well.

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26 19:32                                         ` Linus Torvalds
@ 2013-11-26 22:51                                           ` Paul E. McKenney
  2013-11-26 23:58                                             ` Linus Torvalds
  2013-11-27 10:16                                             ` Will Deacon
  0 siblings, 2 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-26 22:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, Peter Hurley,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Tue, Nov 26, 2013 at 11:32:25AM -0800, Linus Torvalds wrote:
> On Tue, Nov 26, 2013 at 11:20 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > There are several places in RCU that assume unlock+lock is a full
> > memory barrier, but I would be more than happy to fix them up given
> > an smp_mb__after_spinlock() and an smp_mb__before_spinunlock(), or
> > something similar.
> 
> A "before_spinunlock" would actually be expensive on x86.

Good point, on x86 the typical non-queued spin-lock acquisition path
has an atomic operation with full memory barrier in any case.  I believe
that this is the case for the other TSO architectures.  For the non-TSO
architectures:

o	ARM has an smp_mb() during lock acquisition, so after_spinlock()
	can be a no-op for them.

o	Itanium will require more thought, but it looks like it doesn't
	care between after_spinlock() and before_spinunlock().  I have
	to defer to the maintainrs.

o	PowerPC is OK either way.

> So I'd *much* rather see the "after_spinlock()" version, if that is
> sufficient for all users. And it should be, since that's the
> traditional x86 behavior that we had before the MCS lock discussion.
> 
> Because it's worth noting that a spin_lock() is still a full memory
> barrier on x86, even with the MCS code, *assuming it is done in the
> context of the thread needing the memory barrier". And I suspect that
> is much more generally true than just x86. It's the final MCS hand-off
> of a lock that is pretty weak with just a local read. The full lock
> sequence is always going to be much stronger, if only because it will
> contain a write somewhere shared as well.

Good points, and after_spinlock() works for me from an RCU perspective.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26 22:51                                           ` Paul E. McKenney
@ 2013-11-26 23:58                                             ` Linus Torvalds
  2013-11-27  0:21                                               ` Thomas Gleixner
  2013-11-27  0:39                                               ` Paul E. McKenney
  2013-11-27 10:16                                             ` Will Deacon
  1 sibling, 2 replies; 116+ messages in thread
From: Linus Torvalds @ 2013-11-26 23:58 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, Peter Hurley,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Tue, Nov 26, 2013 at 2:51 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> Good points, and after_spinlock() works for me from an RCU perspective.

Note that there's still a semantic question about exactly what that
"after_spinlock()" is: would it be a memory barrier *only* for the CPU
that actually does the spinlock? Or is it that "third CPU" order?

IOW, it would stil not necessarily make your "unlock+lock" (on
different CPU's) be an actual barrier as far as a third CPU was
concerned, because you could still have the "unlock happened after
contention was going on, so the final unlock only released the MCS
waiter, and there was no barrier".

See what I'm saying? We could guarantee that if somebody does

    write A;
    spin_lock()
    mb__after_spinlock();
    read B

then the "write A" -> "read B" would be ordered. That's one thing.

But your

 -  CPU 1:

    write A
    spin_unlock()

 - CPU 2

    spin_lock()
    mb__after_spinlock();
    read B

ordering as far as a *third* CPU is concerned is a whole different
thing again, and wouldn't be at all the same thing.

Is it really that cross-CPU ordering you care about?

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26 23:58                                             ` Linus Torvalds
@ 2013-11-27  0:21                                               ` Thomas Gleixner
  2013-11-27  0:39                                               ` Paul E. McKenney
  1 sibling, 0 replies; 116+ messages in thread
From: Thomas Gleixner @ 2013-11-27  0:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar,
	Andrew Morton, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, Peter Hurley,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

Linus,

On Tue, 26 Nov 2013, Linus Torvalds wrote:

> On Tue, Nov 26, 2013 at 2:51 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > Good points, and after_spinlock() works for me from an RCU perspective.
> 
> Note that there's still a semantic question about exactly what that
> "after_spinlock()" is: would it be a memory barrier *only* for the CPU
> that actually does the spinlock? Or is it that "third CPU" order?
> 
> IOW, it would stil not necessarily make your "unlock+lock" (on
> different CPU's) be an actual barrier as far as a third CPU was
> concerned, because you could still have the "unlock happened after
> contention was going on, so the final unlock only released the MCS
> waiter, and there was no barrier".
> 
> See what I'm saying? We could guarantee that if somebody does
> 
>     write A;
>     spin_lock()
>     mb__after_spinlock();
>     read B
> 
> then the "write A" -> "read B" would be ordered. That's one thing.
> 
> But your
> 
>  -  CPU 1:
> 
>     write A
>     spin_unlock()
> 
>  - CPU 2
> 
>     spin_lock()
>     mb__after_spinlock();
>     read B
> 
> ordering as far as a *third* CPU is concerned is a whole different
> thing again, and wouldn't be at all the same thing.
> 
> Is it really that cross-CPU ordering you care about?

Depends on the use case. In the futex case we discussed in parallel we
very much care about that

     w[A]      |     w[B]
     mb	       |     mb
     r[B]      |     r[A]

provides the correct ordering. Until today the spinlock semantics
provided that.

I bet that more code than the cursed futexes is relying on that
assumption.

RCU being one example I'm aware of. Though RCU is one of the simple
cases where the maintainer is actually aware of the problem and
indicated that he is willing to adjust it.

Though I doubt that other places which silently rely on that ordering,
have the faintest clue why the heck it works at all.

I'm all for the change, but we need to be painfully aware of the
lurking (hard to decode) wreckage ahead.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26 23:58                                             ` Linus Torvalds
  2013-11-27  0:21                                               ` Thomas Gleixner
@ 2013-11-27  0:39                                               ` Paul E. McKenney
  2013-11-27  1:05                                                 ` Linus Torvalds
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-27  0:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, Peter Hurley,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Tue, Nov 26, 2013 at 03:58:11PM -0800, Linus Torvalds wrote:
> On Tue, Nov 26, 2013 at 2:51 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > Good points, and after_spinlock() works for me from an RCU perspective.
> 
> Note that there's still a semantic question about exactly what that
> "after_spinlock()" is: would it be a memory barrier *only* for the CPU
> that actually does the spinlock? Or is it that "third CPU" order?
> 
> IOW, it would stil not necessarily make your "unlock+lock" (on
> different CPU's) be an actual barrier as far as a third CPU was
> concerned, because you could still have the "unlock happened after
> contention was going on, so the final unlock only released the MCS
> waiter, and there was no barrier".
> 
> See what I'm saying? We could guarantee that if somebody does
> 
>     write A;
>     spin_lock()
>     mb__after_spinlock();
>     read B
> 
> then the "write A" -> "read B" would be ordered. That's one thing.
> 
> But your
> 
>  -  CPU 1:
> 
>     write A
>     spin_unlock()
> 
>  - CPU 2
> 
>     spin_lock()
>     mb__after_spinlock();
>     read B
> 
> ordering as far as a *third* CPU is concerned is a whole different
> thing again, and wouldn't be at all the same thing.
> 
> Is it really that cross-CPU ordering you care about?

Cross-CPU ordering.  I have to guarantee the grace period across all
CPUs, and I currently rely on a series of lock acquisitions to provide
that ordering.  On the other hand, I only rely on unlock+lock pairs,
so that I don't need any particular lock or unlock operation to be
a full barrier in and of itself.

If that turns out to be problematic, I could of course insert smp_mb()s
everywhere, but they would be redundant on most architectures.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-27  0:39                                               ` Paul E. McKenney
@ 2013-11-27  1:05                                                 ` Linus Torvalds
  2013-11-27  1:31                                                   ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2013-11-27  1:05 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, Peter Hurley,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Tue, Nov 26, 2013 at 4:39 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> Cross-CPU ordering.

Ok, in that case I *suspect* we want an actual "spin_lock_mb()"
primitive, because if we go with the MCS lock approach, it's quite
possible that we find cases where the fast-case is already a barrier
(like it is on x86 by virtue of the locked instruction) but the MCS
case then is not. And then a separate barrier wouldn't be able to make
that kind of judgement.

Or maybe we don't care enough. It *sounds* like on x86, we do probably
already get the cross-cpu case for free, and on other architectures we
may always need the memory barrier, so maybe the whole
"mb_after_spin_lock()" thing is fine.

Ugh.

           Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-27  1:05                                                 ` Linus Torvalds
@ 2013-11-27  1:31                                                   ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-27  1:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, Peter Hurley,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Tue, Nov 26, 2013 at 05:05:14PM -0800, Linus Torvalds wrote:
> On Tue, Nov 26, 2013 at 4:39 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > Cross-CPU ordering.
> 
> Ok, in that case I *suspect* we want an actual "spin_lock_mb()"
> primitive, because if we go with the MCS lock approach, it's quite
> possible that we find cases where the fast-case is already a barrier
> (like it is on x86 by virtue of the locked instruction) but the MCS
> case then is not. And then a separate barrier wouldn't be able to make
> that kind of judgement.
> 
> Or maybe we don't care enough. It *sounds* like on x86, we do probably
> already get the cross-cpu case for free, and on other architectures we
> may always need the memory barrier, so maybe the whole
> "mb_after_spin_lock()" thing is fine.
> 
> Ugh.

Indeed!  I don't know any way to deal with it other than enumerating
the architectures and checking each.  My first cut at that was earlier
in this thread.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26 22:51                                           ` Paul E. McKenney
  2013-11-26 23:58                                             ` Linus Torvalds
@ 2013-11-27 10:16                                             ` Will Deacon
  2013-11-27 17:11                                               ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Will Deacon @ 2013-11-27 10:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Tue, Nov 26, 2013 at 10:51:36PM +0000, Paul E. McKenney wrote:
> On Tue, Nov 26, 2013 at 11:32:25AM -0800, Linus Torvalds wrote:
> > On Tue, Nov 26, 2013 at 11:20 AM, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> > >
> > > There are several places in RCU that assume unlock+lock is a full
> > > memory barrier, but I would be more than happy to fix them up given
> > > an smp_mb__after_spinlock() and an smp_mb__before_spinunlock(), or
> > > something similar.
> > 
> > A "before_spinunlock" would actually be expensive on x86.
> 
> Good point, on x86 the typical non-queued spin-lock acquisition path
> has an atomic operation with full memory barrier in any case.  I believe
> that this is the case for the other TSO architectures.  For the non-TSO
> architectures:
> 
> o	ARM has an smp_mb() during lock acquisition, so after_spinlock()
> 	can be a no-op for them.

Ok, but what about arm64? We use acquire for lock() and release for
unlock(), so in Linus' example:

    write A;
    spin_lock()
    mb__after_spinlock();
    read B

Then A could very well be reordered after B if mb__after_spinlock() is a nop.
Making that a full barrier kind of defeats the point of using acquire in the
first place...

It's one thing ordering unlock -> lock, but another getting those two to
behave as full barriers for any arbitrary memory accesses.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-27 10:16                                             ` Will Deacon
@ 2013-11-27 17:11                                               ` Paul E. McKenney
  2013-11-28 11:40                                                 ` Will Deacon
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-27 17:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Wed, Nov 27, 2013 at 10:16:13AM +0000, Will Deacon wrote:
> On Tue, Nov 26, 2013 at 10:51:36PM +0000, Paul E. McKenney wrote:
> > On Tue, Nov 26, 2013 at 11:32:25AM -0800, Linus Torvalds wrote:
> > > On Tue, Nov 26, 2013 at 11:20 AM, Paul E. McKenney
> > > <paulmck@linux.vnet.ibm.com> wrote:
> > > >
> > > > There are several places in RCU that assume unlock+lock is a full
> > > > memory barrier, but I would be more than happy to fix them up given
> > > > an smp_mb__after_spinlock() and an smp_mb__before_spinunlock(), or
> > > > something similar.
> > > 
> > > A "before_spinunlock" would actually be expensive on x86.
> > 
> > Good point, on x86 the typical non-queued spin-lock acquisition path
> > has an atomic operation with full memory barrier in any case.  I believe
> > that this is the case for the other TSO architectures.  For the non-TSO
> > architectures:
> > 
> > o	ARM has an smp_mb() during lock acquisition, so after_spinlock()
> > 	can be a no-op for them.
> 
> Ok, but what about arm64? We use acquire for lock() and release for
> unlock(), so in Linus' example:

Right, I did forget the arm vs. arm64 split!

>     write A;
>     spin_lock()
>     mb__after_spinlock();
>     read B
> 
> Then A could very well be reordered after B if mb__after_spinlock() is a nop.
> Making that a full barrier kind of defeats the point of using acquire in the
> first place...

The trick is that you don't have mb__after_spinlock() unless you need the
ordering, which we expect in a small minority of the lock acquisitions.
So you would normally get the benefit of acquire/release efficiency.

> It's one thing ordering unlock -> lock, but another getting those two to
> behave as full barriers for any arbitrary memory accesses.

And in fact the unlock+lock barrier is all that RCU needs.  I guess the
question is whether it is worth having two flavors of __after_spinlock(),
one that is a full barrier with just the lock, and another that is
only guaranteed to be a full barrier with unlock+lock.

								Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-27 17:11                                               ` Paul E. McKenney
@ 2013-11-28 11:40                                                 ` Will Deacon
  2013-11-28 17:38                                                   ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Will Deacon @ 2013-11-28 11:40 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Wed, Nov 27, 2013 at 05:11:43PM +0000, Paul E. McKenney wrote:
> On Wed, Nov 27, 2013 at 10:16:13AM +0000, Will Deacon wrote:
> > On Tue, Nov 26, 2013 at 10:51:36PM +0000, Paul E. McKenney wrote:
> > > On Tue, Nov 26, 2013 at 11:32:25AM -0800, Linus Torvalds wrote:
> > > > On Tue, Nov 26, 2013 at 11:20 AM, Paul E. McKenney wrote:
> > > o	ARM has an smp_mb() during lock acquisition, so after_spinlock()
> > > 	can be a no-op for them.
> > 
> > Ok, but what about arm64? We use acquire for lock() and release for
> > unlock(), so in Linus' example:
> 
> Right, I did forget the arm vs. arm64 split!
> 
> >     write A;
> >     spin_lock()
> >     mb__after_spinlock();
> >     read B
> > 
> > Then A could very well be reordered after B if mb__after_spinlock() is a nop.
> > Making that a full barrier kind of defeats the point of using acquire in the
> > first place...
> 
> The trick is that you don't have mb__after_spinlock() unless you need the
> ordering, which we expect in a small minority of the lock acquisitions.
> So you would normally get the benefit of acquire/release efficiency.

Ok, understood. I take it this means that you don't care about ordering the
write to A with the actual locking operation? (that would require the mb to
be *inside* the spin_lock() implementation).

> > It's one thing ordering unlock -> lock, but another getting those two to
> > behave as full barriers for any arbitrary memory accesses.
> 
> And in fact the unlock+lock barrier is all that RCU needs.  I guess the
> question is whether it is worth having two flavors of __after_spinlock(),
> one that is a full barrier with just the lock, and another that is
> only guaranteed to be a full barrier with unlock+lock.

I think it's worth distinguishing those cases because, in my mind, one is
potentially a lot heavier than the other. The risk is that we end up
producing a set of strangely named barrier abstractions that nobody can
figure out how to use properly:


	/*
	 * Prevent re-ordering of arbitrary accesses across spin_lock and
	 * spin_unlock.
	 */
	mb__after_spin_lock()
	mb__after_spin_unlock()

	/*
	 * Order spin_lock() vs spin_unlock()
	 */
	mb__between_spin_unlock_lock() /* Horrible name! */


We could potentially replace the first set of barriers with spin_lock_mb()
and spin_unlock_mb() variants (which would be more efficient than half
barrier + full barrier), then we only end up with strangely named barrier
which applies to the non _mb() spinlock routines.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-28 11:40                                                 ` Will Deacon
@ 2013-11-28 17:38                                                   ` Paul E. McKenney
  2013-11-28 18:03                                                     ` Will Deacon
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-28 17:38 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 28, 2013 at 11:40:59AM +0000, Will Deacon wrote:
> On Wed, Nov 27, 2013 at 05:11:43PM +0000, Paul E. McKenney wrote:
> > On Wed, Nov 27, 2013 at 10:16:13AM +0000, Will Deacon wrote:
> > > On Tue, Nov 26, 2013 at 10:51:36PM +0000, Paul E. McKenney wrote:
> > > > On Tue, Nov 26, 2013 at 11:32:25AM -0800, Linus Torvalds wrote:
> > > > > On Tue, Nov 26, 2013 at 11:20 AM, Paul E. McKenney wrote:
> > > > o	ARM has an smp_mb() during lock acquisition, so after_spinlock()
> > > > 	can be a no-op for them.
> > > 
> > > Ok, but what about arm64? We use acquire for lock() and release for
> > > unlock(), so in Linus' example:
> > 
> > Right, I did forget the arm vs. arm64 split!
> > 
> > >     write A;
> > >     spin_lock()
> > >     mb__after_spinlock();
> > >     read B
> > > 
> > > Then A could very well be reordered after B if mb__after_spinlock() is a nop.
> > > Making that a full barrier kind of defeats the point of using acquire in the
> > > first place...
> > 
> > The trick is that you don't have mb__after_spinlock() unless you need the
> > ordering, which we expect in a small minority of the lock acquisitions.
> > So you would normally get the benefit of acquire/release efficiency.
> 
> Ok, understood. I take it this means that you don't care about ordering the
> write to A with the actual locking operation? (that would require the mb to
> be *inside* the spin_lock() implementation).

Or it would require an mb__before_spinlock().  More on this below...

> > > It's one thing ordering unlock -> lock, but another getting those two to
> > > behave as full barriers for any arbitrary memory accesses.
> > 
> > And in fact the unlock+lock barrier is all that RCU needs.  I guess the
> > question is whether it is worth having two flavors of __after_spinlock(),
> > one that is a full barrier with just the lock, and another that is
> > only guaranteed to be a full barrier with unlock+lock.
> 
> I think it's worth distinguishing those cases because, in my mind, one is
> potentially a lot heavier than the other. The risk is that we end up
> producing a set of strangely named barrier abstractions that nobody can
> figure out how to use properly:
> 
> 
> 	/*
> 	 * Prevent re-ordering of arbitrary accesses across spin_lock and
> 	 * spin_unlock.
> 	 */
> 	mb__after_spin_lock()
> 	mb__after_spin_unlock()
> 
> 	/*
> 	 * Order spin_lock() vs spin_unlock()
> 	 */
> 	mb__between_spin_unlock_lock() /* Horrible name! */
> 
> 
> We could potentially replace the first set of barriers with spin_lock_mb()
> and spin_unlock_mb() variants (which would be more efficient than half
> barrier + full barrier), then we only end up with strangely named barrier
> which applies to the non _mb() spinlock routines.

How about the current mb__before_spinlock() making the acquisition be
a full barrier, and an mb_after_spinlock() making a prior release plus
this acquisition be a full barrier?

Yes, we might need better names, but I believe that this approach does
what you need.

Thoughts?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-28 17:38                                                   ` Paul E. McKenney
@ 2013-11-28 18:03                                                     ` Will Deacon
  2013-11-28 18:27                                                       ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Will Deacon @ 2013-11-28 18:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 28, 2013 at 05:38:53PM +0000, Paul E. McKenney wrote:
> On Thu, Nov 28, 2013 at 11:40:59AM +0000, Will Deacon wrote:
> > On Wed, Nov 27, 2013 at 05:11:43PM +0000, Paul E. McKenney wrote:
> > > And in fact the unlock+lock barrier is all that RCU needs.  I guess the
> > > question is whether it is worth having two flavors of __after_spinlock(),
> > > one that is a full barrier with just the lock, and another that is
> > > only guaranteed to be a full barrier with unlock+lock.
> > 
> > I think it's worth distinguishing those cases because, in my mind, one is
> > potentially a lot heavier than the other. The risk is that we end up
> > producing a set of strangely named barrier abstractions that nobody can
> > figure out how to use properly:
> > 
> > 
> > 	/*
> > 	 * Prevent re-ordering of arbitrary accesses across spin_lock and
> > 	 * spin_unlock.
> > 	 */
> > 	mb__after_spin_lock()
> > 	mb__after_spin_unlock()
> > 
> > 	/*
> > 	 * Order spin_lock() vs spin_unlock()
> > 	 */
> > 	mb__between_spin_unlock_lock() /* Horrible name! */
> > 
> > 
> > We could potentially replace the first set of barriers with spin_lock_mb()
> > and spin_unlock_mb() variants (which would be more efficient than half
> > barrier + full barrier), then we only end up with strangely named barrier
> > which applies to the non _mb() spinlock routines.
> 
> How about the current mb__before_spinlock() making the acquisition be
> a full barrier, and an mb_after_spinlock() making a prior release plus
> this acquisition be a full barrier?

Hmm, without horrible hacks to keep track of whether we've done an
mb__before_spinlock() without a matching spinlock(), that's going to end up
with full-barrier + pointless half-barrier (similarly on the release path).

> Yes, we might need better names, but I believe that this approach does
> what you need.
> 
> Thoughts?

I still think we need to draw the distinction between ordering all accesses
against a lock and ordering an unlock against a lock. The latter is free for
arm64 (STLR => LDAR is ordered) but the former requires a DMB.

Not sure I completely got your drift...

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-28 18:03                                                     ` Will Deacon
@ 2013-11-28 18:27                                                       ` Paul E. McKenney
  2013-11-28 18:53                                                         ` Will Deacon
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-28 18:27 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 28, 2013 at 06:03:18PM +0000, Will Deacon wrote:
> On Thu, Nov 28, 2013 at 05:38:53PM +0000, Paul E. McKenney wrote:
> > On Thu, Nov 28, 2013 at 11:40:59AM +0000, Will Deacon wrote:
> > > On Wed, Nov 27, 2013 at 05:11:43PM +0000, Paul E. McKenney wrote:
> > > > And in fact the unlock+lock barrier is all that RCU needs.  I guess the
> > > > question is whether it is worth having two flavors of __after_spinlock(),
> > > > one that is a full barrier with just the lock, and another that is
> > > > only guaranteed to be a full barrier with unlock+lock.
> > > 
> > > I think it's worth distinguishing those cases because, in my mind, one is
> > > potentially a lot heavier than the other. The risk is that we end up
> > > producing a set of strangely named barrier abstractions that nobody can
> > > figure out how to use properly:
> > > 
> > > 
> > > 	/*
> > > 	 * Prevent re-ordering of arbitrary accesses across spin_lock and
> > > 	 * spin_unlock.
> > > 	 */
> > > 	mb__after_spin_lock()
> > > 	mb__after_spin_unlock()
> > > 
> > > 	/*
> > > 	 * Order spin_lock() vs spin_unlock()
> > > 	 */
> > > 	mb__between_spin_unlock_lock() /* Horrible name! */
> > > 
> > > 
> > > We could potentially replace the first set of barriers with spin_lock_mb()
> > > and spin_unlock_mb() variants (which would be more efficient than half
> > > barrier + full barrier), then we only end up with strangely named barrier
> > > which applies to the non _mb() spinlock routines.
> > 
> > How about the current mb__before_spinlock() making the acquisition be
> > a full barrier, and an mb_after_spinlock() making a prior release plus
> > this acquisition be a full barrier?
> 
> Hmm, without horrible hacks to keep track of whether we've done an
> mb__before_spinlock() without a matching spinlock(), that's going to end up
> with full-barrier + pointless half-barrier (similarly on the release path).

We should be able to detect mb__before_spinlock() without a matching
spinlock via static analysis, right?

Or am I missing your point?

> > Yes, we might need better names, but I believe that this approach does
> > what you need.
> > 
> > Thoughts?
> 
> I still think we need to draw the distinction between ordering all accesses
> against a lock and ordering an unlock against a lock. The latter is free for
> arm64 (STLR => LDAR is ordered) but the former requires a DMB.
> 
> Not sure I completely got your drift...

Here is what I am suggesting:

o	mb__before_spinlock():

	o	Must appear immediately before a lock acquisition.
	o	Upgrades a lock acquisition to a full barrier.
	o	Emits DMB on ARM64.
	
o	mb_after_spinlock():

	o	Must appear immediatly after a lock acquisition.
	o	Upgrades an unlock+lock pair to a full barrier.
	o	Emits a no-op on ARM64, as in "do { } while (0)".
	o	Might need a separate flavor for queued locks on
		some platforms, but no sign of that yet.

Does that make sense?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-28 18:27                                                       ` Paul E. McKenney
@ 2013-11-28 18:53                                                         ` Will Deacon
  2013-11-28 19:50                                                           ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Will Deacon @ 2013-11-28 18:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 28, 2013 at 06:27:12PM +0000, Paul E. McKenney wrote:
> On Thu, Nov 28, 2013 at 06:03:18PM +0000, Will Deacon wrote:
> > Hmm, without horrible hacks to keep track of whether we've done an
> > mb__before_spinlock() without a matching spinlock(), that's going to end up
> > with full-barrier + pointless half-barrier (similarly on the release path).
> 
> We should be able to detect mb__before_spinlock() without a matching
> spinlock via static analysis, right?
> 
> Or am I missing your point?

See below...

> > > Yes, we might need better names, but I believe that this approach does
> > > what you need.
> > > 
> > > Thoughts?
> > 
> > I still think we need to draw the distinction between ordering all accesses
> > against a lock and ordering an unlock against a lock. The latter is free for
> > arm64 (STLR => LDAR is ordered) but the former requires a DMB.
> > 
> > Not sure I completely got your drift...
> 
> Here is what I am suggesting:
> 
> o	mb__before_spinlock():
> 
> 	o	Must appear immediately before a lock acquisition.
> 	o	Upgrades a lock acquisition to a full barrier.
> 	o	Emits DMB on ARM64.

Ok, so that then means that:

	mb__before_spinlock();
	spin_lock();

on ARM64 expands to:

	dmb	ish
	ldaxr	...

so there's a redundant half-barrier there. If we want to get rid of that, we
need mb__before_spinlock() to set a flag, then we could conditionalise
ldaxr/ldxr but it's really horrible and you have to deal with interrupts
etc. so in reality we just end up having extra barriers.

Or we have separate a spin_lock_mb() function.

> o	mb_after_spinlock():
> 
> 	o	Must appear immediatly after a lock acquisition.
> 	o	Upgrades an unlock+lock pair to a full barrier.
> 	o	Emits a no-op on ARM64, as in "do { } while (0)".
> 	o	Might need a separate flavor for queued locks on
> 		some platforms, but no sign of that yet.

Ok, so mb__after_spinlock() doesn't imply a full barrier but
mb__before_spinlock() does? I think people will get that wrong :)

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-28 18:53                                                         ` Will Deacon
@ 2013-11-28 19:50                                                           ` Paul E. McKenney
  2013-11-29 16:17                                                             ` Will Deacon
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-28 19:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 28, 2013 at 06:53:41PM +0000, Will Deacon wrote:
> On Thu, Nov 28, 2013 at 06:27:12PM +0000, Paul E. McKenney wrote:
> > On Thu, Nov 28, 2013 at 06:03:18PM +0000, Will Deacon wrote:
> > > Hmm, without horrible hacks to keep track of whether we've done an
> > > mb__before_spinlock() without a matching spinlock(), that's going to end up
> > > with full-barrier + pointless half-barrier (similarly on the release path).
> > 
> > We should be able to detect mb__before_spinlock() without a matching
> > spinlock via static analysis, right?
> > 
> > Or am I missing your point?
> 
> See below...
> 
> > > > Yes, we might need better names, but I believe that this approach does
> > > > what you need.
> > > > 
> > > > Thoughts?
> > > 
> > > I still think we need to draw the distinction between ordering all accesses
> > > against a lock and ordering an unlock against a lock. The latter is free for
> > > arm64 (STLR => LDAR is ordered) but the former requires a DMB.
> > > 
> > > Not sure I completely got your drift...
> > 
> > Here is what I am suggesting:
> > 
> > o	mb__before_spinlock():
> > 
> > 	o	Must appear immediately before a lock acquisition.
> > 	o	Upgrades a lock acquisition to a full barrier.
> > 	o	Emits DMB on ARM64.
> 
> Ok, so that then means that:
> 
> 	mb__before_spinlock();
> 	spin_lock();
> 
> on ARM64 expands to:
> 
> 	dmb	ish
> 	ldaxr	...
> 
> so there's a redundant half-barrier there. If we want to get rid of that, we
> need mb__before_spinlock() to set a flag, then we could conditionalise
> ldaxr/ldxr but it's really horrible and you have to deal with interrupts
> etc. so in reality we just end up having extra barriers.

Given that there was just a dmb, how much does the ish &c really hurt?
Would the performance difference be measurable at the system level?

> Or we have separate a spin_lock_mb() function.

And mutex_lock_mb().  And spin_lock_irqsave_mb().  And spin_lock_irq_mb().
And...

Admittedly this is not yet a problem given the current very low usage
of smp_mb__before_spinlock(), but the potential for API explosion is
non-trivial.

That said, if the effect on ARM64 is measurable at the system level, I
won't stand in the way of the additional APIs.

> > o	mb_after_spinlock():
> > 
> > 	o	Must appear immediatly after a lock acquisition.
> > 	o	Upgrades an unlock+lock pair to a full barrier.
> > 	o	Emits a no-op on ARM64, as in "do { } while (0)".
> > 	o	Might need a separate flavor for queued locks on
> > 		some platforms, but no sign of that yet.
> 
> Ok, so mb__after_spinlock() doesn't imply a full barrier but
> mb__before_spinlock() does? I think people will get that wrong :)

As I said earlier in the thread, I am open to better names.

How about smp_mb__after_spin_unlock_lock_pair()?  That said, I am sure that
I could come up with something longer given enough time.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-28 19:50                                                           ` Paul E. McKenney
@ 2013-11-29 16:17                                                             ` Will Deacon
  2013-11-29 16:44                                                               ` Linus Torvalds
  0 siblings, 1 reply; 116+ messages in thread
From: Will Deacon @ 2013-11-29 16:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 28, 2013 at 07:50:40PM +0000, Paul E. McKenney wrote:
> On Thu, Nov 28, 2013 at 06:53:41PM +0000, Will Deacon wrote:
> > Ok, so that then means that:
> > 
> > 	mb__before_spinlock();
> > 	spin_lock();
> > 
> > on ARM64 expands to:
> > 
> > 	dmb	ish
> > 	ldaxr	...
> > 
> > so there's a redundant half-barrier there. If we want to get rid of that, we
> > need mb__before_spinlock() to set a flag, then we could conditionalise
> > ldaxr/ldxr but it's really horrible and you have to deal with interrupts
> > etc. so in reality we just end up having extra barriers.
> 
> Given that there was just a dmb, how much does the ish &c really hurt?
> Would the performance difference be measurable at the system level?

There's no definitive answer, as it depends heavily on a combination of the
microarchitecture and specific platform implementation. To get some sort of
idea, I tried adding a dmb to the start of spin_unlock on ARMv7 and I saw a
3% performance hit in hackbench on my dual-cluster board.

Whether or not that's a big deal, I'm not sure, especially given that this
should be rare.

> > Or we have separate a spin_lock_mb() function.
> 
> And mutex_lock_mb().  And spin_lock_irqsave_mb().  And spin_lock_irq_mb().
> And...

Ok, point taken.

> Admittedly this is not yet a problem given the current very low usage
> of smp_mb__before_spinlock(), but the potential for API explosion is
> non-trivial.
> 
> That said, if the effect on ARM64 is measurable at the system level, I
> won't stand in the way of the additional APIs.
> 
> > > o	mb_after_spinlock():
> > > 
> > > 	o	Must appear immediatly after a lock acquisition.
> > > 	o	Upgrades an unlock+lock pair to a full barrier.
> > > 	o	Emits a no-op on ARM64, as in "do { } while (0)".
> > > 	o	Might need a separate flavor for queued locks on
> > > 		some platforms, but no sign of that yet.
> > 
> > Ok, so mb__after_spinlock() doesn't imply a full barrier but
> > mb__before_spinlock() does? I think people will get that wrong :)
> 
> As I said earlier in the thread, I am open to better names.
> 
> How about smp_mb__after_spin_unlock_lock_pair()?  That said, I am sure that
> I could come up with something longer given enough time.  ;-)

Ha! Well, I think the principles are sound, but the naming is key to making
sure that this interface is used correctly.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-29 16:17                                                             ` Will Deacon
@ 2013-11-29 16:44                                                               ` Linus Torvalds
  2013-11-29 18:18                                                                 ` Will Deacon
  2013-11-30 17:38                                                                 ` Paul E. McKenney
  0 siblings, 2 replies; 116+ messages in thread
From: Linus Torvalds @ 2013-11-29 16:44 UTC (permalink / raw)
  To: Will Deacon
  Cc: Arnd Bergmann, Figo. zhang, Aswin Chandramouleeswaran,
	Rik van Riel, Waiman Long, linux-kernel@vger.kernel.org,
	Raghavendra K T, linux-arch@vger.kernel.org, Andi Kleen,
	George Spelvin, Tim Chen, Michel Lespinasse, Ingo Molnar,
	Paul E. McKenney, Peter Hurley, H. Peter Anvin, Andrew Morton,
	linux-mm, Alex Shi, Andrea Arcangeli, Scott J Norton,
	Thomas Gleixner, Dave Hansen, Peter Zijlstra, Matthew R Wilcox,
	Davidlohr Bueso

[-- Attachment #1: Type: text/plain, Size: 486 bytes --]

On Nov 29, 2013 8:18 AM, "Will Deacon" <will.deacon@arm.com> wrote:
>
>  To get some sort of
> idea, I tried adding a dmb to the start of spin_unlock on ARMv7 and I saw
a
> 3% performance hit in hackbench on my dual-cluster board.

Don't do a dmb. Just do a dummy release. You just said that on arm64 a
unlock+lock is a memory barrier, so just make the mb__before_spinlock() be
a dummy store with release to the stack..

That should be noticeably cheaper than a full dmb.

       Linus

[-- Attachment #2: Type: text/html, Size: 662 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-29 16:44                                                               ` Linus Torvalds
@ 2013-11-29 18:18                                                                 ` Will Deacon
  2013-11-30 17:38                                                                 ` Paul E. McKenney
  1 sibling, 0 replies; 116+ messages in thread
From: Will Deacon @ 2013-11-29 18:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnd Bergmann, Figo. zhang, Aswin Chandramouleeswaran,
	Rik van Riel, Waiman Long, linux-kernel@vger.kernel.org,
	Raghavendra K T, linux-arch@vger.kernel.org, Andi Kleen,
	George Spelvin, Tim Chen, Michel Lespinasse, Ingo Molnar,
	Paul E. McKenney, Peter Hurley, H. Peter Anvin, Andrew Morton,
	linux-mm, Alex Shi, Andrea Arcangeli, Scott J Norton,
	Thomas Gleixner, Dave Hansen, Peter Zijlstra, Matthew R Wilcox,
	Davidlohr Bueso

On Fri, Nov 29, 2013 at 04:44:41PM +0000, Linus Torvalds wrote:
> 
> On Nov 29, 2013 8:18 AM, "Will Deacon"
> <will.deacon@arm.com<mailto:will.deacon@arm.com>> wrote:
> >
> >  To get some sort of idea, I tried adding a dmb to the start of
> >  spin_unlock on ARMv7 and I saw a 3% performance hit in hackbench on my
> >  dual-cluster board.
> 
> Don't do a dmb. Just do a dummy release. You just said that on arm64 a
> unlock+lock is a memory barrier, so just make the mb__before_spinlock() be
> a dummy store with release to the stack..

Good idea! That should work quite nicely (I don't have anything sane I can
benchmark it on), so I think that solves the issue I was moaning about.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-29 16:44                                                               ` Linus Torvalds
  2013-11-29 18:18                                                                 ` Will Deacon
@ 2013-11-30 17:38                                                                 ` Paul E. McKenney
  1 sibling, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-30 17:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Arnd Bergmann, Figo. zhang,
	Aswin Chandramouleeswaran, Rik van Riel, Waiman Long,
	linux-kernel@vger.kernel.org, Raghavendra K T,
	linux-arch@vger.kernel.org, Andi Kleen, George Spelvin, Tim Chen,
	Michel Lespinasse, Ingo Molnar, Peter Hurley, H. Peter Anvin,
	Andrew Morton, linux-mm, Alex Shi, Andrea Arcangeli,
	Scott J Norton, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Matthew R Wilcox, Davidlohr Bueso

On Fri, Nov 29, 2013 at 08:44:41AM -0800, Linus Torvalds wrote:
> On Nov 29, 2013 8:18 AM, "Will Deacon" <will.deacon@arm.com> wrote:
> >
> >  To get some sort of
> > idea, I tried adding a dmb to the start of spin_unlock on ARMv7 and I saw
> a
> > 3% performance hit in hackbench on my dual-cluster board.
> 
> Don't do a dmb. Just do a dummy release. You just said that on arm64 a
> unlock+lock is a memory barrier, so just make the mb__before_spinlock() be
> a dummy store with release to the stack..
> 
> That should be noticeably cheaper than a full dmb.

Cute!  I like it!  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26 19:00                                     ` Linus Torvalds
  2013-11-26 19:20                                       ` Paul E. McKenney
@ 2013-11-26 19:21                                       ` Peter Zijlstra
  2013-11-27 16:58                                         ` Oleg Nesterov
  2013-11-26 23:08                                       ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2013-11-26 19:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, Will Deacon, Tim Chen, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang, Oleg Nesterov

On Tue, Nov 26, 2013 at 11:00:50AM -0800, Linus Torvalds wrote:
> On Tue, Nov 26, 2013 at 1:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > If you now want to weaken this definition, then that needs consideration
> > because we actually rely on things like
> >
> > spin_unlock(l1);
> > spin_lock(l2);
> >
> > being full barriers.
> 
> Btw, maybe we should just stop that assumption.

I'd be fine with that; it was one of the options listed. I was just
somewhat concerned that the definitions given by the Document and the
reality of proposed implementations was drifting.

> IOW, where do we really care about the "unlock+lock" is a memory
> barrier? And could we make those places explicit, and then do
> something similar to the above to them?

So I don't know :-(

I do know myself and Oleg have often talked about it, and I'm fairly
sure we must have used it at some point.

I think that introduction of smp_mb__before_spinlock() actually killed a
few of those, but I can't recall.

Oleg doesn't actually seem to be on the CC list -- lets amend that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26 19:21                                       ` Peter Zijlstra
@ 2013-11-27 16:58                                         ` Oleg Nesterov
  0 siblings, 0 replies; 116+ messages in thread
From: Oleg Nesterov @ 2013-11-27 16:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Paul E. McKenney, Will Deacon, Tim Chen,
	Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, Peter Hurley,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On 11/26, Peter Zijlstra wrote:
>
> On Tue, Nov 26, 2013 at 11:00:50AM -0800, Linus Torvalds wrote:
>
> > IOW, where do we really care about the "unlock+lock" is a memory
> > barrier? And could we make those places explicit, and then do
> > something similar to the above to them?
>
> So I don't know :-(
>
> I do know myself and Oleg have often talked about it, and I'm fairly
> sure we must have used it at some point.

No... I can't recall any particular place which explicitely relies
on "unlock+lock => mb().

(although I know the out-of-tree example which can be ignored ;)

I can only recall that this was mentioned in the context like
"no, the lack of mb() can't explain the problem because we have
unlock+lock".

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26 19:00                                     ` Linus Torvalds
  2013-11-26 19:20                                       ` Paul E. McKenney
  2013-11-26 19:21                                       ` Peter Zijlstra
@ 2013-11-26 23:08                                       ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 116+ messages in thread
From: Benjamin Herrenschmidt @ 2013-11-26 23:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Paul E. McKenney, Will Deacon, Tim Chen,
	Ingo Molnar, Andrew Morton, Thomas Gleixner,
	linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Waiman Long, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, Peter Hurley,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Tue, 2013-11-26 at 11:00 -0800, Linus Torvalds wrote:
> On Tue, Nov 26, 2013 at 1:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > If you now want to weaken this definition, then that needs consideration
> > because we actually rely on things like
> >
> > spin_unlock(l1);
> > spin_lock(l2);
> >
> > being full barriers.
> 
> Btw, maybe we should just stop that assumption. The complexity of this
> discussion makes me go "maybe we should stop with subtle assumptions
> that happen to be obviously true on x86 due to historical
> implementations, but aren't obviously true even *there* any more with
> the MCS lock".

I would love to get rid of that assumption because it's one of the
things that we currently violate on PowerPC and to get it completely
right we would have to upgrade at least one side to a full sync.

> We already have a concept of
> 
>         smp_mb__before_spinlock();
>         spin_lock():
> 
> for sequences where we *know* we need to make getting a spin-lock be a
> full memory barrier. It's free on x86 (and remains so even with the
> MCS lock, regardless of any subtle issues, if only because even the
> MCS lock starts out with a locked atomic, never mind the contention
> slow-case). Of course, that macro is only used inside the scheduler,
> and is actually documented to not really be a full memory barrier, but
> it handles the case we actually care about.
> 
> IOW, where do we really care about the "unlock+lock" is a memory
> barrier? And could we make those places explicit, and then do
> something similar to the above to them?

I personally am in favor of either that or an explicit variant of
spin_lock_mb (or unlock) that has the built in full barrier, which ever,
which could be a *little bit* more efficient since we wouldn't cumulate
both the semi permeable barrier in the lock/unlock *and* the added full
barrier.

Cheers,
Ben.

>                        Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 18:02                             ` Paul E. McKenney
  2013-11-25 18:24                               ` Peter Zijlstra
  2013-11-25 18:27                               ` Peter Zijlstra
@ 2013-11-25 23:55                               ` H. Peter Anvin
  2013-11-26  3:16                                 ` Paul E. McKenney
  2 siblings, 1 reply; 116+ messages in thread
From: H. Peter Anvin @ 2013-11-25 23:55 UTC (permalink / raw)
  To: paulmck, Peter Zijlstra
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On 11/25/2013 10:02 AM, Paul E. McKenney wrote:
> 
> I still do not believe that it does.  Again, strangely enough.
> 
> We need to ask someone in Intel that understands this all the way down
> to the silicon.  The guy I used to rely on for this no longer works
> at Intel.
> 
> Do you know someone who fits this description, or should I start sending
> cold-call emails to various Intel contacts?
> 

Feel free to poke me if you need any help.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 23:55                               ` H. Peter Anvin
@ 2013-11-26  3:16                                 ` Paul E. McKenney
  2013-11-27  0:46                                   ` H. Peter Anvin
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-26  3:16 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Mon, Nov 25, 2013 at 03:55:43PM -0800, H. Peter Anvin wrote:
> On 11/25/2013 10:02 AM, Paul E. McKenney wrote:
> > 
> > I still do not believe that it does.  Again, strangely enough.
> > 
> > We need to ask someone in Intel that understands this all the way down
> > to the silicon.  The guy I used to rely on for this no longer works
> > at Intel.
> > 
> > Do you know someone who fits this description, or should I start sending
> > cold-call emails to various Intel contacts?
> 
> Feel free to poke me if you need any help.

My biggest question is the definition of "Memory ordering obeys causality
(memory ordering respects transitive visibility)" in Section 3.2.2 of
the "Intel(R) 64 and IA-32 Architectures Developer's Manual: Vol. 3A"
dated March 2013 from:

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html

I am guessing that is orders loads as well as stores, so that a load
is said to be "visible" to some other CPU once that CPU no longer has
the opportunity to affect the return value from the load.  Is that a
reasonable interpretation?

More generally, is the model put forward by Sewell et al. in "x86-TSO:
A Rigorous and Usable Programmer's Model for x86 Multiprocessors"
accurate?  This is on pages 4 and 5 here:

	http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-26  3:16                                 ` Paul E. McKenney
@ 2013-11-27  0:46                                   ` H. Peter Anvin
  2013-11-27  1:07                                     ` Linus Torvalds
  2013-11-27  1:27                                     ` Paul E. McKenney
  0 siblings, 2 replies; 116+ messages in thread
From: H. Peter Anvin @ 2013-11-27  0:46 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On 11/25/2013 07:16 PM, Paul E. McKenney wrote:
> 
> My biggest question is the definition of "Memory ordering obeys causality
> (memory ordering respects transitive visibility)" in Section 3.2.2 of
> the "Intel(R) 64 and IA-32 Architectures Developer's Manual: Vol. 3A"
> dated March 2013 from:
> 
> http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html
> 
> I am guessing that is orders loads as well as stores, so that a load
> is said to be "visible" to some other CPU once that CPU no longer has
> the opportunity to affect the return value from the load.  Is that a
> reasonable interpretation?
> 

The best pointer I can give is the example in section 8.2.3.6 of the
current SDM (version 048, dated September 2013).  It is a bit more
complex than what you have described above.

> More generally, is the model put forward by Sewell et al. in "x86-TSO:
> A Rigorous and Usable Programmer's Model for x86 Multiprocessors"
> accurate?  This is on pages 4 and 5 here:
> 
> 	http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf

I think for Intel to give that one a formal stamp of approval would take
some serious analysis.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-27  0:46                                   ` H. Peter Anvin
@ 2013-11-27  1:07                                     ` Linus Torvalds
  2013-11-27  1:27                                     ` Paul E. McKenney
  1 sibling, 0 replies; 116+ messages in thread
From: Linus Torvalds @ 2013-11-27  1:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Paul McKenney, Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Tue, Nov 26, 2013 at 4:46 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> The best pointer I can give is the example in section 8.2.3.6 of the
> current SDM (version 048, dated September 2013).  It is a bit more
> complex than what you have described above.

That 8.2.3.6 thing (and the whole "causally related" argument) does
seem to say that the MCS lock is fine on x86 without any extra
barriers. My A < B .. < F < A argument was very much a causality-based
one.

          Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-27  0:46                                   ` H. Peter Anvin
  2013-11-27  1:07                                     ` Linus Torvalds
@ 2013-11-27  1:27                                     ` Paul E. McKenney
  2013-11-27  2:59                                       ` H. Peter Anvin
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-27  1:27 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Tue, Nov 26, 2013 at 04:46:54PM -0800, H. Peter Anvin wrote:
> On 11/25/2013 07:16 PM, Paul E. McKenney wrote:
> > 
> > My biggest question is the definition of "Memory ordering obeys causality
> > (memory ordering respects transitive visibility)" in Section 3.2.2 of
> > the "Intel(R) 64 and IA-32 Architectures Developer's Manual: Vol. 3A"
> > dated March 2013 from:
> > 
> > http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html
> > 
> > I am guessing that is orders loads as well as stores, so that a load
> > is said to be "visible" to some other CPU once that CPU no longer has
> > the opportunity to affect the return value from the load.  Is that a
> > reasonable interpretation?
> 
> The best pointer I can give is the example in section 8.2.3.6 of the
> current SDM (version 048, dated September 2013).  It is a bit more
> complex than what you have described above.

OK, I did see that example.  It is similar to the one we are chasing
in this thread, but there are some important differences.  But you
did mention that that other example operated as expected on x86, so
we are good for the moment.  I was hoping to gain more general
understanding, but I would guess that there will be other examples
to help towards that goal.  ;-)

> > More generally, is the model put forward by Sewell et al. in "x86-TSO:
> > A Rigorous and Usable Programmer's Model for x86 Multiprocessors"
> > accurate?  This is on pages 4 and 5 here:
> > 
> > 	http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
> 
> I think for Intel to give that one a formal stamp of approval would take
> some serious analysis.

I bet!!!

Hey, I had to ask!  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-27  1:27                                     ` Paul E. McKenney
@ 2013-11-27  2:59                                       ` H. Peter Anvin
  0 siblings, 0 replies; 116+ messages in thread
From: H. Peter Anvin @ 2013-11-27  2:59 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

<figo1802@gmail.com>
Message-ID: <589ca54b-4171-4164-b9ba-dc3a5bad6376@email.android.com>

Yes, if you have concrete scenarios we can discuss them.

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
>On Tue, Nov 26, 2013 at 04:46:54PM -0800, H. Peter Anvin wrote:
>> On 11/25/2013 07:16 PM, Paul E. McKenney wrote:
>> > 
>> > My biggest question is the definition of "Memory ordering obeys
>causality
>> > (memory ordering respects transitive visibility)" in Section 3.2.2
>of
>> > the "IntelA(R) 64 and IA-32 Architectures Developer's Manual: Vol. 3A"
>> > dated March 2013 from:
>> > 
>> >
>http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html
>> > 
>> > I am guessing that is orders loads as well as stores, so that a
>load
>> > is said to be "visible" to some other CPU once that CPU no longer
>has
>> > the opportunity to affect the return value from the load.  Is that
>a
>> > reasonable interpretation?
>> 
>> The best pointer I can give is the example in section 8.2.3.6 of the
>> current SDM (version 048, dated September 2013).  It is a bit more
>> complex than what you have described above.
>
>OK, I did see that example.  It is similar to the one we are chasing
>in this thread, but there are some important differences.  But you
>did mention that that other example operated as expected on x86, so
>we are good for the moment.  I was hoping to gain more general
>understanding, but I would guess that there will be other examples
>to help towards that goal.  ;-)
>
>> > More generally, is the model put forward by Sewell et al. in
>"x86-TSO:
>> > A Rigorous and Usable Programmer's Model for x86 Multiprocessors"
>> > accurate?  This is on pages 4 and 5 here:
>> > 
>> > 	http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
>> 
>> I think for Intel to give that one a formal stamp of approval would
>take
>> some serious analysis.
>
>I bet!!!
>
>Hey, I had to ask!  ;-)
>
>							Thanx, Paul

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 17:35                           ` Peter Zijlstra
  2013-11-25 18:02                             ` Paul E. McKenney
@ 2013-11-25 18:52                             ` H. Peter Anvin
  2013-11-25 22:58                               ` Tim Chen
  2013-11-25 23:36                               ` Paul E. McKenney
  1 sibling, 2 replies; 116+ messages in thread
From: H. Peter Anvin @ 2013-11-25 18:52 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On 11/25/2013 09:35 AM, Peter Zijlstra wrote:
> 
> I think this means x86 needs help too.
> 
> Consider:
> 
> x = y = 0
> 
>   w[x] = 1  |  w[y] = 1
>   mfence    |  mfence
>   r[y] = 0  |  r[x] = 0
> 
> This is generally an impossible case, right? (Since if we observe y=0
> this means that w[y]=1 has not yet happened, and therefore x=1, and
> vice-versa).
> 
> Now replace one of the mfences with smp_store_release(l1);
> smp_load_acquire(l2); such that we have a RELEASE+ACQUIRE pair that
> _should_ form a full barrier:
> 
>   w[x] = 1   | w[y] = 1
>   w[l1] = 1  | mfence
>   r[l2] = 0  | r[x] = 0
>   r[y] = 0   |
> 
> At which point we can observe the impossible, because as per the rule:
> 
> 'reads may be reordered with older writes to different locations'
> 
> Our r[y] can slip before the w[x]=1.
> 

Yes, because although r[l2] and r[y] are ordered with respect to each
other, they are allowed to be executed before w[x] and w[l1].  In other
words, smp_store_release() followed by smp_load_acquire() to a different
location do not form a full barrier.  To the *same* location, they will.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 18:52                             ` H. Peter Anvin
@ 2013-11-25 22:58                               ` Tim Chen
  2013-11-25 23:28                                 ` H. Peter Anvin
  2013-11-25 23:36                               ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Tim Chen @ 2013-11-25 22:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Paul E. McKenney, Will Deacon, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Mon, 2013-11-25 at 10:52 -0800, H. Peter Anvin wrote:
> On 11/25/2013 09:35 AM, Peter Zijlstra wrote:
> > 
> > I think this means x86 needs help too.
> > 
> > Consider:
> > 
> > x = y = 0
> > 
> >   w[x] = 1  |  w[y] = 1
> >   mfence    |  mfence
> >   r[y] = 0  |  r[x] = 0
> > 
> > This is generally an impossible case, right? (Since if we observe y=0
> > this means that w[y]=1 has not yet happened, and therefore x=1, and
> > vice-versa).
> > 
> > Now replace one of the mfences with smp_store_release(l1);
> > smp_load_acquire(l2); such that we have a RELEASE+ACQUIRE pair that
> > _should_ form a full barrier:
> > 
> >   w[x] = 1   | w[y] = 1
> >   w[l1] = 1  | mfence
> >   r[l2] = 0  | r[x] = 0
> >   r[y] = 0   |
> > 
> > At which point we can observe the impossible, because as per the rule:
> > 
> > 'reads may be reordered with older writes to different locations'
> > 
> > Our r[y] can slip before the w[x]=1.
> > 
> 
> Yes, because although r[l2] and r[y] are ordered with respect to each
> other, they are allowed to be executed before w[x] and w[l1].  In other
> words, smp_store_release() followed by smp_load_acquire() to a different
> location do not form a full barrier.  To the *same* location, they will.
> 
> 	-hpa
> 

Peter,

Want to check with you on Paul's example, 
where we are indeed writing and reading to the same
lock location when passing the lock on x86 with smp_store_release and
smp_load_acquire.  So the unlock and lock sequence looks like:

        CPU 0 (releasing)       CPU 1 (acquiring)
        -----                   -----
        ACCESS_ONCE(X) = 1;     while (ACCESS_ONCE(lock) == 1)
                                  continue;
        ACCESS_ONCE(lock) = 0;  
                                r1 = ACCESS_ONCE(Y);

observer CPU 2:

        CPU 2
        -----
        ACCESS_ONCE(Y) = 1;
        smp_mb();
        r2 = ACCESS_ONCE(X);

If the write and read to lock act as a full memory barrier, 
it would be impossible to
end up with (r1 == 0 && r2 == 0), correct?

Tim



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 22:58                               ` Tim Chen
@ 2013-11-25 23:28                                 ` H. Peter Anvin
  2013-11-25 23:51                                   ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: H. Peter Anvin @ 2013-11-25 23:28 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Paul E. McKenney, Will Deacon, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, linux-kernel@vger.kernel.org,
	linux-mm, linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On 11/25/2013 02:58 PM, Tim Chen wrote:
> 
> Peter,
> 
> Want to check with you on Paul's example, 
> where we are indeed writing and reading to the same
> lock location when passing the lock on x86 with smp_store_release and
> smp_load_acquire.  So the unlock and lock sequence looks like:
> 
>         CPU 0 (releasing)       CPU 1 (acquiring)
>         -----                   -----
>         ACCESS_ONCE(X) = 1;     while (ACCESS_ONCE(lock) == 1)
>                                   continue;
>         ACCESS_ONCE(lock) = 0;  
>                                 r1 = ACCESS_ONCE(Y);
> 

Here we can definitely state that the read from Y must have happened
after X was set to 1 (assuming lock starts out as 1).

> observer CPU 2:
> 
>         CPU 2
>         -----
>         ACCESS_ONCE(Y) = 1;
>         smp_mb();
>         r2 = ACCESS_ONCE(X);
> 
> If the write and read to lock act as a full memory barrier, 
> it would be impossible to
> end up with (r1 == 0 && r2 == 0), correct?
> 

It would be impossible to end up with r1 == 1 && r2 == 0, I presume
that's what you meant.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 23:28                                 ` H. Peter Anvin
@ 2013-11-25 23:51                                   ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-25 23:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tim Chen, Peter Zijlstra, Will Deacon, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Mon, Nov 25, 2013 at 03:28:32PM -0800, H. Peter Anvin wrote:
> On 11/25/2013 02:58 PM, Tim Chen wrote:
> > 
> > Peter,
> > 
> > Want to check with you on Paul's example, 
> > where we are indeed writing and reading to the same
> > lock location when passing the lock on x86 with smp_store_release and
> > smp_load_acquire.  So the unlock and lock sequence looks like:
> > 
> >         CPU 0 (releasing)       CPU 1 (acquiring)
> >         -----                   -----
> >         ACCESS_ONCE(X) = 1;     while (ACCESS_ONCE(lock) == 1)
> >                                   continue;
> >         ACCESS_ONCE(lock) = 0;  
> >                                 r1 = ACCESS_ONCE(Y);
> > 
> 
> Here we can definitely state that the read from Y must have happened
> after X was set to 1 (assuming lock starts out as 1).
> 
> > observer CPU 2:
> > 
> >         CPU 2
> >         -----
> >         ACCESS_ONCE(Y) = 1;
> >         smp_mb();
> >         r2 = ACCESS_ONCE(X);
> > 
> > If the write and read to lock act as a full memory barrier, 
> > it would be impossible to
> > end up with (r1 == 0 && r2 == 0), correct?
> > 
> 
> It would be impossible to end up with r1 == 1 && r2 == 0, I presume
> that's what you meant.

Yes, that is the correct impossibility.  Thank you, Peter!

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-25 18:52                             ` H. Peter Anvin
  2013-11-25 22:58                               ` Tim Chen
@ 2013-11-25 23:36                               ` Paul E. McKenney
  1 sibling, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-25 23:36 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Mon, Nov 25, 2013 at 10:52:10AM -0800, H. Peter Anvin wrote:
> On 11/25/2013 09:35 AM, Peter Zijlstra wrote:
> > 
> > I think this means x86 needs help too.
> > 
> > Consider:
> > 
> > x = y = 0
> > 
> >   w[x] = 1  |  w[y] = 1
> >   mfence    |  mfence
> >   r[y] = 0  |  r[x] = 0
> > 
> > This is generally an impossible case, right? (Since if we observe y=0
> > this means that w[y]=1 has not yet happened, and therefore x=1, and
> > vice-versa).
> > 
> > Now replace one of the mfences with smp_store_release(l1);
> > smp_load_acquire(l2); such that we have a RELEASE+ACQUIRE pair that
> > _should_ form a full barrier:
> > 
> >   w[x] = 1   | w[y] = 1
> >   w[l1] = 1  | mfence
> >   r[l2] = 0  | r[x] = 0
> >   r[y] = 0   |
> > 
> > At which point we can observe the impossible, because as per the rule:
> > 
> > 'reads may be reordered with older writes to different locations'
> > 
> > Our r[y] can slip before the w[x]=1.
> 
> Yes, because although r[l2] and r[y] are ordered with respect to each
> other, they are allowed to be executed before w[x] and w[l1].  In other
> words, smp_store_release() followed by smp_load_acquire() to a different
> location do not form a full barrier.  To the *same* location, they will.

In the case where we have a single CPU doing an unlock of one lock
followed by a lock of another lock using Tim Chen's MCS lock, there
will be an xchg() that will provide the needed full barrier.

If the unlock is from one CPU and the lock is from another CPU, then
Linux kernel only requires a full barrier in the case where both
the unlock and lock are acting on the same lock variable.  Which is
the scenario under investigation.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21 17:25               ` Paul E. McKenney
  2013-11-21 21:52                 ` Peter Zijlstra
@ 2013-12-04 21:26                 ` Andi Kleen
  2013-12-04 22:07                   ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Andi Kleen @ 2013-12-04 21:26 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

> Let's apply the Intel manual to the earlier example:
> 
> 	CPU 0		CPU 1			CPU 2
> 	-----		-----			-----
> 	x = 1;		r1 = SLA(lock);		y = 1;
> 	SSR(lock, 1);	r2 = y;			smp_mb();
> 						r3 = x;
> 
> 	assert(!(r1 == 1 && r2 == 0 && r3 == 0));

Hi Paul,

We discussed this example with CPU architects and they
agreed that it is valid to rely on (r1 == 1 && r2 == 0 && r3 == 0)
never happening.

So the MCS code is good without additional barriers.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-12-04 21:26                 ` Andi Kleen
@ 2013-12-04 22:07                   ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-12-04 22:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, Peter Hurley,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Figo.zhang

On Wed, Dec 04, 2013 at 10:26:13PM +0100, Andi Kleen wrote:
> > Let's apply the Intel manual to the earlier example:
> > 
> > 	CPU 0		CPU 1			CPU 2
> > 	-----		-----			-----
> > 	x = 1;		r1 = SLA(lock);		y = 1;
> > 	SSR(lock, 1);	r2 = y;			smp_mb();
> > 						r3 = x;
> > 
> > 	assert(!(r1 == 1 && r2 == 0 && r3 == 0));
> 
> Hi Paul,
> 
> We discussed this example with CPU architects and they
> agreed that it is valid to rely on (r1 == 1 && r2 == 0 && r3 == 0)
> never happening.
> 
> So the MCS code is good without additional barriers.

Good to hear!!!  Thank you, Andi!

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 4/5] MCS Lock: Barrier corrections
  2013-11-21 11:03         ` Peter Zijlstra
  2013-11-21 12:56           ` Peter Zijlstra
@ 2013-11-21 13:19           ` Paul E. McKenney
  1 sibling, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2013-11-21 13:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Tim Chen, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, linux-kernel@vger.kernel.org, linux-mm,
	linux-arch@vger.kernel.org, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Raghavendra K T, George Spelvin, H. Peter Anvin,
	Arnd Bergmann, Aswin Chandramouleeswaran, Scott J Norton,
	Figo.zhang

On Thu, Nov 21, 2013 at 12:03:08PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 20, 2013 at 09:14:00AM -0800, Paul E. McKenney wrote:
> > > Hmm, so in the following case:
> > > 
> > >   Access A
> > >   unlock()	/* release semantics */
> > >   lock()	/* acquire semantics */
> > >   Access B
> > > 
> > > A cannot pass beyond the unlock() and B cannot pass the before the lock().
> > > 
> > > I agree that accesses between the unlock and the lock can be move across both
> > > A and B, but that doesn't seem to matter by my reading of the above.
> > > 
> > > What is the problematic scenario you have in mind? Are you thinking of the
> > > lock() moving before the unlock()? That's only permitted by RCpc afaiu,
> > > which I don't think any architectures supported by Linux implement...
> > > (ARMv8 acquire/release is RCsc).
> > 
> > If smp_load_acquire() and smp_store_release() are both implemented using
> > lwsync on powerpc, and if Access A is a store and Access B is a load,
> > then Access A and Access B can be reordered.
> > 
> > Of course, if every other architecture will be providing RCsc implementations
> > for smp_load_acquire() and smp_store_release(), which would not be a bad
> > thing, then another approach is for powerpc to use sync rather than lwsync
> > for one or the other of smp_load_acquire() or smp_store_release().
> 
> So which of the two would make most sense?
> 
> As per the Document, loads/stores should not be able to pass up through
> an ACQUIRE and loads/stores should not be able to pass down through a
> RELEASE.
> 
> I think PPC would match that if we use sync for smp_store_release() such
> that it will flush the store buffer, and thereby guarantee all stores
> are kept within the required section.

Yep, for PPC we can just use sync for smp_store_release().  We just need
to check the other architectures.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v6 5/5] MCS Lock: Allows for architecture specific mcs lock and unlock
       [not found] <cover.1384885312.git.tim.c.chen@linux.intel.com>
                   ` (4 preceding siblings ...)
  2013-11-20  1:37 ` [PATCH v6 4/5] MCS Lock: Barrier corrections Tim Chen
@ 2013-11-20  1:37 ` Tim Chen
  5 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2013-11-20  1:37 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton, Thomas Gleixner
  Cc: linux-kernel, linux-mm, linux-arch, Linus Torvalds, Waiman Long,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Peter Hurley, Paul E.McKenney, Tim Chen,
	Raghavendra K T, George Spelvin, H. Peter Anvin, Arnd Bergmann,
	Aswin Chandramouleeswaran, Scott J Norton, Will Deacon,
	Figo.zhang

Restructure code to allow for architecture specific defines
of the arch_mcs_spin_lock and arch_mcs_spin_unlock funtion
that can be optimized for specific architecture.  These
arch specific functions can be placed in asm/mcs_spinlock.h.
Otherwise the default arch_mcs_spin_lock and arch_mcs_spin_unlock
will be used.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 arch/Kconfig                  |  3 ++
 include/linux/mcs_spinlock.h  |  5 +++
 kernel/locking/mcs_spinlock.c | 93 +++++++++++++++++++++++++------------------
 3 files changed, 62 insertions(+), 39 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index ded747c..c96c696 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -306,6 +306,9 @@ config HAVE_CMPXCHG_LOCAL
 config HAVE_CMPXCHG_DOUBLE
 	bool
 
+config HAVE_ARCH_MCS_LOCK
+	bool
+
 config ARCH_WANT_IPC_PARSE_VERSION
 	bool
 
diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
index d54bb23..d64786a 100644
--- a/include/linux/mcs_spinlock.h
+++ b/include/linux/mcs_spinlock.h
@@ -12,6 +12,11 @@
 #ifndef __LINUX_MCS_SPINLOCK_H
 #define __LINUX_MCS_SPINLOCK_H
 
+/* arch specific mcs lock and unlock functions defined here */
+#ifdef CONFIG_HAVE_ARCH_MCS_LOCK
+#include <asm/mcs_spinlock.h>
+#endif
+
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
 	int locked; /* 1 if lock acquired */
diff --git a/kernel/locking/mcs_spinlock.c b/kernel/locking/mcs_spinlock.c
index 6f2ce8e..582584a 100644
--- a/kernel/locking/mcs_spinlock.c
+++ b/kernel/locking/mcs_spinlock.c
@@ -29,28 +29,36 @@
  * on this node->locked until the previous lock holder sets the node->locked
  * in mcs_spin_unlock().
  */
+#ifndef arch_mcs_spin_lock
+#define arch_mcs_spin_lock(lock, node)					\
+{									\
+	struct mcs_spinlock *prev;					\
+									\
+	/* Init node */							\
+	node->locked = 0;						\
+	node->next   = NULL;						\
+									\
+	/* xchg() provides a memory barrier */				\
+	prev = xchg(lock, node);					\
+	if (likely(prev == NULL)) {					\
+		/* Lock acquired */					\
+		return;							\
+	}								\
+	ACCESS_ONCE(prev->next) = node;					\
+	/*								\
+	 * Wait until the lock holder passes the lock down.		\
+	 * Using smp_load_acquire() provides a memory barrier that	\
+	 * ensures subsequent operations happen after the lock is	\
+	 * acquired.							\
+	 */								\
+	while (!(smp_load_acquire(&node->locked)))			\
+		arch_mutex_cpu_relax();					\
+}
+#endif
+
 void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 {
-	struct mcs_spinlock *prev;
-
-	/* Init node */
-	node->locked = 0;
-	node->next   = NULL;
-
-	/* xchg() provides a memory barrier */
-	prev = xchg(lock, node);
-	if (likely(prev == NULL)) {
-		/* Lock acquired */
-		return;
-	}
-	ACCESS_ONCE(prev->next) = node;
-	/*
-	 * Wait until the lock holder passes the lock down.
-	 * Using smp_load_acquire() provides a memory barrier that
-	 * ensures subsequent operations happen after the lock is acquired.
-	 */
-	while (!(smp_load_acquire(&node->locked)))
-		arch_mutex_cpu_relax();
+	arch_mcs_spin_lock(lock, node);
 }
 EXPORT_SYMBOL_GPL(mcs_spin_lock);
 
@@ -58,26 +66,33 @@ EXPORT_SYMBOL_GPL(mcs_spin_lock);
  * Releases the lock. The caller should pass in the corresponding node that
  * was used to acquire the lock.
  */
+#ifndef arch_mcs_spin_unlock
+#define arch_mcs_spin_unlock(lock, node)				\
+{									\
+	struct mcs_spinlock *next = ACCESS_ONCE(node->next);		\
+									\
+	if (likely(!next)) {						\
+		/*							\
+		 * Release the lock by setting it to NULL               \
+		 */							\
+		if (likely(cmpxchg(lock, node, NULL) == node))          \
+			return;                                         \
+		/* Wait until the next pointer is set */		\
+		while (!(next = ACCESS_ONCE(node->next)))		\
+			arch_mutex_cpu_relax();				\
+	}								\
+	/*								\
+	 * Pass lock to next waiter.					\
+	 * smp_store_release() provides a memory barrier to ensure	\
+	 * all operations in the critical section has been completed	\
+	 * before unlocking.						\
+	 */								\
+	smp_store_release(&next->locked, 1);				\
+}
+#endif
+
 void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 {
-	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
-
-	if (likely(!next)) {
-		/*
-		 * Release the lock by setting it to NULL
-		 */
-		if (likely(cmpxchg(lock, node, NULL) == node))
-			return;
-		/* Wait until the next pointer is set */
-		while (!(next = ACCESS_ONCE(node->next)))
-			arch_mutex_cpu_relax();
-	}
-	/*
-	 * Pass lock to next waiter.
-	 * smp_store_release() provides a memory barrier to ensure
-	 * all operations in the critical section has been completed
-	 * before unlocking.
-	 */
-	smp_store_release(&next->locked, 1);
+	arch_mcs_spin_unlock(lock, node);
 }
 EXPORT_SYMBOL_GPL(mcs_spin_unlock);
-- 
1.7.11.7


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2013-12-04 22:07 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <cover.1384885312.git.tim.c.chen@linux.intel.com>
2013-11-20  1:37 ` [PATCH v6 0/5] MCS Lock: MCS lock code cleanup and optimizations Tim Chen
2013-11-20 10:19   ` Will Deacon
2013-11-20 12:50     ` Paul E. McKenney
2013-11-20 17:00       ` Will Deacon
2013-11-20 17:14         ` Paul E. McKenney
2013-11-20 17:00     ` Tim Chen
2013-11-20 17:16       ` Paul E. McKenney
2013-11-20  1:37 ` [PATCH v6 1/5] MCS Lock: Restructure the MCS lock defines and locking code into its own file Tim Chen
2013-11-20  1:37 ` [PATCH v6 2/5] MCS Lock: optimizations and extra comments Tim Chen
2013-11-20  1:37 ` [PATCH v6 3/5] MCS Lock: Move mcs_lock/unlock function into its own file Tim Chen
2013-11-20  1:37 ` [PATCH v6 4/5] MCS Lock: Barrier corrections Tim Chen
2013-11-20 15:31   ` Paul E. McKenney
2013-11-20 15:46     ` Will Deacon
2013-11-20 17:14       ` Paul E. McKenney
2013-11-20 18:43         ` Tim Chen
2013-11-20 19:06           ` Paul E. McKenney
2013-11-20 20:36             ` Tim Chen
2013-11-20 21:44               ` Paul E. McKenney
2013-11-20 23:51                 ` Tim Chen
2013-11-21  4:53                   ` Paul E. McKenney
2013-11-21 10:17                     ` Will Deacon
2013-11-21 13:16                       ` Paul E. McKenney
2013-11-21 10:45                     ` Peter Zijlstra
2013-11-21 13:18                       ` Paul E. McKenney
2013-11-21 22:27                     ` Linus Torvalds
2013-11-21 22:52                       ` Paul E. McKenney
2013-11-22  0:09                         ` Linus Torvalds
2013-11-22  4:08                           ` Paul E. McKenney
2013-11-22  4:25                             ` Linus Torvalds
2013-11-22  6:23                               ` Paul E. McKenney
2013-11-22 15:16                                 ` Ingo Molnar
2013-11-22 18:49                                   ` Paul E. McKenney
2013-11-22 19:06                                     ` Linus Torvalds
2013-11-22 20:06                                       ` Paul E. McKenney
2013-11-22 20:09                                         ` Linus Torvalds
2013-11-22 20:37                                           ` Paul E. McKenney
2013-11-22 21:01                                             ` Linus Torvalds
2013-11-22 21:52                                               ` Paul E. McKenney
2013-11-22 22:19                                                 ` Linus Torvalds
2013-11-23  0:25                                                   ` Paul E. McKenney
2013-11-23  0:42                                                     ` Linus Torvalds
2013-11-23  1:36                                                       ` Paul E. McKenney
2013-11-23  2:11                                                         ` Linus Torvalds
2013-11-23  4:05                                                           ` Paul E. McKenney
2013-11-23 11:24                                                             ` Ingo Molnar
2013-11-23 17:06                                                               ` Paul E. McKenney
2013-11-26 12:02                                                                 ` Ingo Molnar
2013-11-26 19:28                                                                   ` Paul E. McKenney
2013-11-23 20:21                                                         ` Linus Torvalds
2013-11-23 20:39                                                           ` Linus Torvalds
2013-11-25 12:09                                                             ` Peter Zijlstra
2013-11-25 17:18                                                               ` Will Deacon
2013-11-25 17:56                                                                 ` Paul E. McKenney
2013-11-25 17:54                                                             ` Paul E. McKenney
2013-11-23 21:29                                                           ` Peter Zijlstra
2013-11-23 22:24                                                             ` Linus Torvalds
2013-11-25 17:53                                                           ` Paul E. McKenney
2013-11-25 18:21                                                             ` Peter Zijlstra
2013-11-21 11:03         ` Peter Zijlstra
2013-11-21 12:56           ` Peter Zijlstra
2013-11-21 13:20             ` Paul E. McKenney
2013-11-21 17:25               ` Paul E. McKenney
2013-11-21 21:52                 ` Peter Zijlstra
2013-11-21 22:18                   ` Paul E. McKenney
2013-11-22 15:58                     ` Peter Zijlstra
2013-11-22 18:26                       ` Paul E. McKenney
2013-11-22 18:51                         ` Peter Zijlstra
2013-11-22 18:59                           ` Paul E. McKenney
2013-11-25 17:35                           ` Peter Zijlstra
2013-11-25 18:02                             ` Paul E. McKenney
2013-11-25 18:24                               ` Peter Zijlstra
2013-11-25 18:34                                 ` Tim Chen
2013-11-25 18:27                               ` Peter Zijlstra
2013-11-25 23:52                                 ` Paul E. McKenney
2013-11-26  9:59                                   ` Peter Zijlstra
2013-11-26 17:11                                     ` Paul E. McKenney
2013-11-26 17:18                                       ` Peter Zijlstra
2013-11-26 19:00                                     ` Linus Torvalds
2013-11-26 19:20                                       ` Paul E. McKenney
2013-11-26 19:32                                         ` Linus Torvalds
2013-11-26 22:51                                           ` Paul E. McKenney
2013-11-26 23:58                                             ` Linus Torvalds
2013-11-27  0:21                                               ` Thomas Gleixner
2013-11-27  0:39                                               ` Paul E. McKenney
2013-11-27  1:05                                                 ` Linus Torvalds
2013-11-27  1:31                                                   ` Paul E. McKenney
2013-11-27 10:16                                             ` Will Deacon
2013-11-27 17:11                                               ` Paul E. McKenney
2013-11-28 11:40                                                 ` Will Deacon
2013-11-28 17:38                                                   ` Paul E. McKenney
2013-11-28 18:03                                                     ` Will Deacon
2013-11-28 18:27                                                       ` Paul E. McKenney
2013-11-28 18:53                                                         ` Will Deacon
2013-11-28 19:50                                                           ` Paul E. McKenney
2013-11-29 16:17                                                             ` Will Deacon
2013-11-29 16:44                                                               ` Linus Torvalds
2013-11-29 18:18                                                                 ` Will Deacon
2013-11-30 17:38                                                                 ` Paul E. McKenney
2013-11-26 19:21                                       ` Peter Zijlstra
2013-11-27 16:58                                         ` Oleg Nesterov
2013-11-26 23:08                                       ` Benjamin Herrenschmidt
2013-11-25 23:55                               ` H. Peter Anvin
2013-11-26  3:16                                 ` Paul E. McKenney
2013-11-27  0:46                                   ` H. Peter Anvin
2013-11-27  1:07                                     ` Linus Torvalds
2013-11-27  1:27                                     ` Paul E. McKenney
2013-11-27  2:59                                       ` H. Peter Anvin
2013-11-25 18:52                             ` H. Peter Anvin
2013-11-25 22:58                               ` Tim Chen
2013-11-25 23:28                                 ` H. Peter Anvin
2013-11-25 23:51                                   ` Paul E. McKenney
2013-11-25 23:36                               ` Paul E. McKenney
2013-12-04 21:26                 ` Andi Kleen
2013-12-04 22:07                   ` Paul E. McKenney
2013-11-21 13:19           ` Paul E. McKenney
2013-11-20  1:37 ` [PATCH v6 5/5] MCS Lock: Allows for architecture specific mcs lock and unlock Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).