* [patch 0/3] introduce sys_membarrier(): process-wide memory barrier (v8)
@ 2010-01-31 20:52 Mathieu Desnoyers
2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers
` (2 more replies)
0 siblings, 3 replies; 41+ messages in thread
From: Mathieu Desnoyers @ 2010-01-31 20:52 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar,
josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells
Hi,
Here is an updated version of the sys_membarrier system call. It implements a
system call that allows userspace to distribute the overhead of memory barriers
asymmetrically. It is particularly useful for Userspace RCU (liburcu), to allow
the RCU read-side to only issue compiler barriers, matched by calls to
sys_membarrier() in the userspace synchronize_rcu().
Even though sys_membarrier should be portable to other architecture as-is
without arch-specific modifications, I propose to incrementally reserve the
system call IDs as we test it on each architecture.
Here is a version that I think should be ready for merging. It is based on the
2.6.33-rc6 kernel. I would appreciate if people who gave their Acked-by in the
RFC rounds could confirm their Acked-by following the patch set updates.
Thanks,
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 41+ messages in thread* [patch 1/3] Create spin lock/spin unlock with distinct memory barrier 2010-01-31 20:52 [patch 0/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers @ 2010-01-31 20:52 ` Mathieu Desnoyers 2010-02-01 7:25 ` Nick Piggin ` (2 more replies) 2010-01-31 20:52 ` [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock Mathieu Desnoyers 2010-01-31 20:52 ` [patch 3/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers 2 siblings, 3 replies; 41+ messages in thread From: Mathieu Desnoyers @ 2010-01-31 20:52 UTC (permalink / raw) To: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells Cc: Mathieu Desnoyers [-- Attachment #1: spinlock-mb.patch --] [-- Type: text/plain, Size: 6346 bytes --] Create the primitive family: spin_lock__no_acquire spin_unlock__no_release spin_lock_irq__no_acquire spin_unlock_irq__no_release raw_spin_lock__no_acquire raw_spin_unlock__no_release raw_spin_lock_irq__no_acquire raw_spin_unlock_irq__no_release smp_acquire__after_spin_lock() smp_release__before_spin_unlock() smp_mb__after_spin_lock() smp_mb__before_spin_unlock() This family of locks permits to combine the acquire/release ordering implied by the spin lock and full memory barriers placed just after spin lock and before spin unlock. Combining these memory barriers permits to decrease the overhead. The first user of these primitives is the scheduler code, where a full memory barrier is wanted before and after runqueue data structure modifications so these can be read safely by sys_membarrier without holding the rq lock. Replacing the spin lock acquire/release barriers with these memory barriers imply either no overhead (x86 spinlock atomic instruction already implies a full mb) or some hopefully small overhead caused by the upgrade of the spinlock acquire/release barriers to more heavyweight smp_mb(). The "generic" version of spinlock-mb.h declares both a mapping to standard spinlocks and full memory barriers. Each architecture can specialize this header following their own need and declare CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h. This initial implementation only specializes the x86 architecture. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Steven Rostedt <rostedt@goodmis.org> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> CC: Nicholas Miell <nmiell@comcast.net> CC: Linus Torvalds <torvalds@linux-foundation.org> CC: mingo@elte.hu CC: laijs@cn.fujitsu.com CC: dipankar@in.ibm.com CC: akpm@linux-foundation.org CC: josh@joshtriplett.org CC: dvhltc@us.ibm.com CC: niv@us.ibm.com CC: tglx@linutronix.de CC: peterz@infradead.org CC: Valdis.Kletnieks@vt.edu CC: dhowells@redhat.com --- arch/x86/Kconfig | 1 + arch/x86/include/asm/spinlock-mb.h | 28 ++++++++++++++++++++++++++++ include/asm-generic/spinlock-mb.h | 27 +++++++++++++++++++++++++++ include/linux/spinlock.h | 6 ++++++ init/Kconfig | 3 +++ 5 files changed, 65 insertions(+) Index: linux-2.6-lttng/include/asm-generic/spinlock-mb.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6-lttng/include/asm-generic/spinlock-mb.h 2010-01-31 15:12:16.000000000 -0500 @@ -0,0 +1,27 @@ +#ifndef ASM_GENERIC_SPINLOCK_MB_H +#define ASM_GENERIC_SPINLOCK_MB_H + +/* + * Generic spinlock-mb mappings. Use standard spinlocks with acquire/release + * semantics, and define the associated memory barriers as full memory barriers. + */ + +#define spin_lock__no_acquire spin_lock +#define spin_unlock__no_release spin_unlock + +#define spin_lock_irq__no_acquire spin_lock_irq +#define spin_unlock_irq__no_release spin_unlock_irq + +#define raw_spin_lock__no_acquire raw_spin_lock +#define raw_spin_unlock__no_release raw_spin_unlock + +#define raw_spin_lock_irq__no_acquire raw_spin_lock_irq +#define raw_spin_unlock_irq__no_release raw_spin_unlock_irq + +#define smp_acquire__after_spin_lock() do { } while (0) +#define smp_release__before_spin_unlock() do { } while (0) + +#define smp_mb__after_spin_lock() smp_mb() +#define smp_mb__before_spin_unlock() smp_mb() + +#endif /* ASM_GENERIC_SPINLOCK_MB_H */ Index: linux-2.6-lttng/include/linux/spinlock.h =================================================================== --- linux-2.6-lttng.orig/include/linux/spinlock.h 2010-01-31 14:59:42.000000000 -0500 +++ linux-2.6-lttng/include/linux/spinlock.h 2010-01-31 15:00:40.000000000 -0500 @@ -393,4 +393,10 @@ extern int _atomic_dec_and_lock(atomic_t #define atomic_dec_and_lock(atomic, lock) \ __cond_lock(lock, _atomic_dec_and_lock(atomic, lock)) +#ifdef CONFIG_HAVE_SPINLOCK_MB +# include <asm/spinlock-mb.h> +#else +# include <asm-generic/spinlock-mb.h> +#endif + #endif /* __LINUX_SPINLOCK_H */ Index: linux-2.6-lttng/arch/x86/include/asm/spinlock-mb.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6-lttng/arch/x86/include/asm/spinlock-mb.h 2010-01-31 15:11:28.000000000 -0500 @@ -0,0 +1,28 @@ +#ifndef ASM_X86_SPINLOCK_MB_H +#define ASM_X86_SPINLOCK_MB_H + +/* + * X86 spinlock-mb mappings. Use standard spinlocks with acquire/release + * semantics. Associated memory barriers are defined as no-ops, because the + * spinlock LOCK-prefixed atomic operations imply a full memory barrier. + */ + +#define spin_lock__no_acquire spin_lock +#define spin_unlock__no_release spin_unlock + +#define spin_lock_irq__no_acquire spin_lock_irq +#define spin_unlock_irq__no_release spin_unlock_irq + +#define raw_spin_lock__no_acquire raw_spin_lock +#define raw_spin_unlock__no_release raw_spin_unlock + +#define raw_spin_lock_irq__no_acquire raw_spin_lock_irq +#define raw_spin_unlock_irq__no_release raw_spin_unlock_irq + +#define smp_acquire__after_spin_lock() do { } while (0) +#define smp_release__before_spin_unlock() do { } while (0) + +#define smp_mb__after_spin_lock() do { } while (0) +#define smp_mb__before_spin_unlock() do { } while (0) + +#endif /* ASM_X86_SPINLOCK_MB_H */ Index: linux-2.6-lttng/arch/x86/Kconfig =================================================================== --- linux-2.6-lttng.orig/arch/x86/Kconfig 2010-01-31 14:59:32.000000000 -0500 +++ linux-2.6-lttng/arch/x86/Kconfig 2010-01-31 15:01:09.000000000 -0500 @@ -55,6 +55,7 @@ config X86 select ANON_INODES select HAVE_ARCH_KMEMCHECK select HAVE_USER_RETURN_NOTIFIER + select HAVE_SPINLOCK_MB config OUTPUT_FORMAT string Index: linux-2.6-lttng/init/Kconfig =================================================================== --- linux-2.6-lttng.orig/init/Kconfig 2010-01-31 14:59:42.000000000 -0500 +++ linux-2.6-lttng/init/Kconfig 2010-01-31 15:00:40.000000000 -0500 @@ -320,6 +320,9 @@ config AUDIT_TREE depends on AUDITSYSCALL select INOTIFY +config HAVE_SPINLOCK_MB + def_bool n + menu "RCU Subsystem" choice -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 1/3] Create spin lock/spin unlock with distinct memory barrier 2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers @ 2010-02-01 7:25 ` Nick Piggin 2010-02-01 14:08 ` Mathieu Desnoyers 2010-02-01 7:28 ` Nick Piggin 2010-02-01 15:22 ` Linus Torvalds 2 siblings, 1 reply; 41+ messages in thread From: Nick Piggin @ 2010-02-01 7:25 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Sun, Jan 31, 2010 at 03:52:55PM -0500, Mathieu Desnoyers wrote: > Create the primitive family: > > spin_lock__no_acquire > spin_unlock__no_release > spin_lock_irq__no_acquire > spin_unlock_irq__no_release > > raw_spin_lock__no_acquire > raw_spin_unlock__no_release > raw_spin_lock_irq__no_acquire > raw_spin_unlock_irq__no_release > > smp_acquire__after_spin_lock() > smp_release__before_spin_unlock() > smp_mb__after_spin_lock() > smp_mb__before_spin_unlock() Wow, someone who likes micro optimising things as much as I do. However, these have the wrong names. smp_mb__after_x() means that calling that function after calling x() will give a smp_mb(), right? With your functions, this is giving a smp_mb() after calling x__no_acquire(). I would suggest maybe just don't bother with the __no_acquire __no_release variants of spin locks, and stick with the expected semantics for the new smb_mb__xxx functions. x86 still gets the full benefit. But, I don't know if this is even worthwhile, given where you are using it. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 1/3] Create spin lock/spin unlock with distinct memory barrier 2010-02-01 7:25 ` Nick Piggin @ 2010-02-01 14:08 ` Mathieu Desnoyers 0 siblings, 0 replies; 41+ messages in thread From: Mathieu Desnoyers @ 2010-02-01 14:08 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells * Nick Piggin (npiggin@suse.de) wrote: > On Sun, Jan 31, 2010 at 03:52:55PM -0500, Mathieu Desnoyers wrote: > > Create the primitive family: > > > > spin_lock__no_acquire > > spin_unlock__no_release > > spin_lock_irq__no_acquire > > spin_unlock_irq__no_release > > > > raw_spin_lock__no_acquire > > raw_spin_unlock__no_release > > raw_spin_lock_irq__no_acquire > > raw_spin_unlock_irq__no_release > > > > smp_acquire__after_spin_lock() > > smp_release__before_spin_unlock() > > smp_mb__after_spin_lock() > > smp_mb__before_spin_unlock() > > Wow, someone who likes micro optimising things as much as I do. :-) > However, these have the wrong names. > > smp_mb__after_x() means that calling that function after calling x() > will give a smp_mb(), right? Supposed to. > > With your functions, this is giving a smp_mb() after calling > x__no_acquire(). Good point. > > I would suggest maybe just don't bother with the __no_acquire > __no_release variants of spin locks, and stick with the expected > semantics for the new smb_mb__xxx functions. x86 still gets the > full benefit. Well, most other architectures will get a gain by modifying the spin lock/unlock itself and adding the full memory barriers separately. x86 is a special case here. > > But, I don't know if this is even worthwhile, given where you are > using it. Given that it's used to modify the scheduler fast path, I try to keep it as fast as possible. But as you point out later in the thread, we might just consider taking all the rq locks one after another when issuing a sys_membarrier() instead. I did these modifications mostly to please Peter who is concerned about the impact of taking these rq lock very frequently in a DoS scenario (which you appropriately point out would not be the first case in the kernel, and would actually generate a RoS rather than DoS). Thanks, Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 1/3] Create spin lock/spin unlock with distinct memory barrier 2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers 2010-02-01 7:25 ` Nick Piggin @ 2010-02-01 7:28 ` Nick Piggin 2010-02-01 14:10 ` Mathieu Desnoyers 2010-02-01 15:22 ` Linus Torvalds 2 siblings, 1 reply; 41+ messages in thread From: Nick Piggin @ 2010-02-01 7:28 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Sun, Jan 31, 2010 at 03:52:55PM -0500, Mathieu Desnoyers wrote: > +/* > + * X86 spinlock-mb mappings. Use standard spinlocks with acquire/release > + * semantics. Associated memory barriers are defined as no-ops, because the > + * spinlock LOCK-prefixed atomic operations imply a full memory barrier. > + */ > + > +#define spin_lock__no_acquire spin_lock > +#define spin_unlock__no_release spin_unlock > + > +#define spin_lock_irq__no_acquire spin_lock_irq > +#define spin_unlock_irq__no_release spin_unlock_irq > + > +#define raw_spin_lock__no_acquire raw_spin_lock > +#define raw_spin_unlock__no_release raw_spin_unlock > + > +#define raw_spin_lock_irq__no_acquire raw_spin_lock_irq > +#define raw_spin_unlock_irq__no_release raw_spin_unlock_irq > + > +#define smp_acquire__after_spin_lock() do { } while (0) > +#define smp_release__before_spin_unlock() do { } while (0) > + > +#define smp_mb__after_spin_lock() do { } while (0) > +#define smp_mb__before_spin_unlock() do { } while (0) Oh, and that one's wrong. loads can pass spin_unlock on x86 so it needs to be smp_mb(). ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 1/3] Create spin lock/spin unlock with distinct memory barrier 2010-02-01 7:28 ` Nick Piggin @ 2010-02-01 14:10 ` Mathieu Desnoyers 0 siblings, 0 replies; 41+ messages in thread From: Mathieu Desnoyers @ 2010-02-01 14:10 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells * Nick Piggin (npiggin@suse.de) wrote: > On Sun, Jan 31, 2010 at 03:52:55PM -0500, Mathieu Desnoyers wrote: > > +/* > > + * X86 spinlock-mb mappings. Use standard spinlocks with acquire/release > > + * semantics. Associated memory barriers are defined as no-ops, because the > > + * spinlock LOCK-prefixed atomic operations imply a full memory barrier. > > + */ > > + > > +#define spin_lock__no_acquire spin_lock > > +#define spin_unlock__no_release spin_unlock > > + > > +#define spin_lock_irq__no_acquire spin_lock_irq > > +#define spin_unlock_irq__no_release spin_unlock_irq > > + > > +#define raw_spin_lock__no_acquire raw_spin_lock > > +#define raw_spin_unlock__no_release raw_spin_unlock > > + > > +#define raw_spin_lock_irq__no_acquire raw_spin_lock_irq > > +#define raw_spin_unlock_irq__no_release raw_spin_unlock_irq > > + > > +#define smp_acquire__after_spin_lock() do { } while (0) > > +#define smp_release__before_spin_unlock() do { } while (0) > > + > > +#define smp_mb__after_spin_lock() do { } while (0) > > +#define smp_mb__before_spin_unlock() do { } while (0) > > Oh, and that one's wrong. loads can pass spin_unlock on x86 so it > needs to be smp_mb(). > Good catch ! #if defined(CONFIG_X86_32) && \ (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)) /* * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock * (PPro errata 66, 92) */ # define UNLOCK_LOCK_PREFIX LOCK_PREFIX #else # define UNLOCK_LOCK_PREFIX #endif Thanks, Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 1/3] Create spin lock/spin unlock with distinct memory barrier 2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers 2010-02-01 7:25 ` Nick Piggin 2010-02-01 7:28 ` Nick Piggin @ 2010-02-01 15:22 ` Linus Torvalds 2010-02-01 15:41 ` Steven Rostedt 2 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2010-02-01 15:22 UTC (permalink / raw) To: Mathieu Desnoyers Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Sun, 31 Jan 2010, Mathieu Desnoyers wrote: > > Create the primitive family: > > spin_lock__no_acquire > spin_unlock__no_release > spin_lock_irq__no_acquire > spin_unlock_irq__no_release I really really hate this patch. Locking is really hard to get right anyway, you already had one bug in this, and on the major architectures, none of this will matter _anyway_, since you cannot split the acquire from the lock and the release from the unlock. So the only operation that actually makes any sense at all would be "smp_mb__after_spin_lock" (and no new spinlock primitives at all), since it can be optimized away on x86 (and then add "smp_mb__before_spin_unlock" just to make them symmetric). But even that one is very dubious: "The first user of these primitives is the scheduler code, where a full memory barrier is wanted before and after runqueue data structure modifications so these can be read safely by sys_membarrier without holding the rq lock." what kind of crazy model is this? That sounds insane. Locking together with some new random optimistic usage that we don't even know how performance-critical it is? Mixing locking and non-locking is simply wrong. Why would it be any safer to read whatever that the lock protects by adding smp barriers at the lock? If you need other smp barriers at the lock, then what about the non-locked accesses _while_ the lock is held? You get no ordering guarantees there. The whole thing sounds highly dubious. And all of this for something that is a new system call that nobody actually uses? To optimize the new and experimental path with some insane lockfree model, while making the core kernel more complex? A _very_ strong NAK from me. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 1/3] Create spin lock/spin unlock with distinct memory barrier 2010-02-01 15:22 ` Linus Torvalds @ 2010-02-01 15:41 ` Steven Rostedt 0 siblings, 0 replies; 41+ messages in thread From: Steven Rostedt @ 2010-02-01 15:41 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 2010-02-01 at 07:22 -0800, Linus Torvalds wrote: > > If you need other smp barriers at the lock, then what about the non-locked > accesses _while_ the lock is held? You get no ordering guarantees there. > The whole thing sounds highly dubious. The issues is not about protecting data, it was all about ordering an update of a variable (mm_cpumask) with respect to scheduling. The lock was just a convenient place to add this protection. The memory barriers here would allow the syscall to use memory barriers instead of locks. > > And all of this for something that is a new system call that nobody > actually uses? To optimize the new and experimental path with some insane > lockfree model, while making the core kernel more complex? A _very_ > strong NAK from me. I totally agree with this. The updates here were from the fear of grabbing all rq spinlocks (one at a time) called by a syscall would open up a DoS (or as Nick said RoS - Reduction of Service). If someone called this syscall within a while(1) loop on some large # CPU box, it could cause cache thrashing. But this is all being paranoid, and not worth the complexity in the core scheduler. We don't even know if this fear is founded. -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-01-31 20:52 [patch 0/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers 2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers @ 2010-01-31 20:52 ` Mathieu Desnoyers 2010-02-01 7:33 ` Nick Piggin 2010-02-01 15:27 ` Linus Torvalds 2010-01-31 20:52 ` [patch 3/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers 2 siblings, 2 replies; 41+ messages in thread From: Mathieu Desnoyers @ 2010-01-31 20:52 UTC (permalink / raw) To: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells Cc: Mathieu Desnoyers [-- Attachment #1: scheduler-use-spinlock-mb.patch --] [-- Type: text/plain, Size: 2950 bytes --] Depends on: "Create spin lock/spin unlock with distinct memory barrier" A full memory barrier is wanted before and after runqueue data structure modifications so these can be read safely by sys_membarrier without holding the rq lock. Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin lock/unlock already imply a full memory barrier. Combines the spin lock acquire/release barriers with the full memory barrier to diminish the performance impact on other architectures. (per-architecture spinlock-mb.h should be gradually implemented to replace the generic version) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Steven Rostedt <rostedt@goodmis.org> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> CC: Nicholas Miell <nmiell@comcast.net> CC: Linus Torvalds <torvalds@linux-foundation.org> CC: mingo@elte.hu CC: laijs@cn.fujitsu.com CC: dipankar@in.ibm.com CC: akpm@linux-foundation.org CC: josh@joshtriplett.org CC: dvhltc@us.ibm.com CC: niv@us.ibm.com CC: tglx@linutronix.de CC: peterz@infradead.org CC: Valdis.Kletnieks@vt.edu CC: dhowells@redhat.com --- kernel/sched.c | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) Index: linux-2.6-lttng/kernel/sched.c =================================================================== --- linux-2.6-lttng.orig/kernel/sched.c 2010-01-31 14:59:42.000000000 -0500 +++ linux-2.6-lttng/kernel/sched.c 2010-01-31 15:09:51.000000000 -0500 @@ -893,7 +893,12 @@ static inline void finish_lock_switch(st */ spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_); - raw_spin_unlock_irq(&rq->lock); + /* + * Order mm_cpumask and rq->curr updates before following memory + * accesses. Required by sys_membarrier(). + */ + smp_mb__before_spin_unlock(); + raw_spin_unlock_irq__no_release(&rq->lock); } #else /* __ARCH_WANT_UNLOCKED_CTXSW */ @@ -916,10 +921,15 @@ static inline void prepare_lock_switch(s */ next->oncpu = 1; #endif + /* + * Order mm_cpumask and rq->curr updates before following memory + * accesses. Required by sys_membarrier(). + */ + smp_mb__before_spin_unlock(); #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW - raw_spin_unlock_irq(&rq->lock); + raw_spin_unlock_irq__no_release(&rq->lock); #else - raw_spin_unlock(&rq->lock); + raw_spin_unlock__no_release(&rq->lock); #endif } @@ -5490,7 +5500,13 @@ need_resched_nonpreemptible: if (sched_feat(HRTICK)) hrtick_clear(rq); - raw_spin_lock_irq(&rq->lock); + raw_spin_lock_irq__no_acquire(&rq->lock); + /* + * Order memory accesses before mm_cpumask and rq->curr updates. + * Required by sys_membarrier() when prev != next. We only learn about + * next later, so we issue this mb() unconditionally. + */ + smp_mb__after_spin_lock(); update_rq_clock(rq); clear_tsk_need_resched(prev); -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-01-31 20:52 ` [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock Mathieu Desnoyers @ 2010-02-01 7:33 ` Nick Piggin 2010-02-01 9:42 ` Peter Zijlstra 2010-02-01 15:27 ` Linus Torvalds 1 sibling, 1 reply; 41+ messages in thread From: Nick Piggin @ 2010-02-01 7:33 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Sun, Jan 31, 2010 at 03:52:56PM -0500, Mathieu Desnoyers wrote: > Depends on: > "Create spin lock/spin unlock with distinct memory barrier" > > A full memory barrier is wanted before and after runqueue data structure > modifications so these can be read safely by sys_membarrier without holding the > rq lock. > > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin > lock/unlock already imply a full memory barrier. Combines the spin lock > acquire/release barriers with the full memory barrier to diminish the > performance impact on other architectures. (per-architecture spinlock-mb.h > should be gradually implemented to replace the generic version) It does add overhead on x86, as well as most other architectures. This really seems like the wrong optimisation to make, especially given that there's not likely to be much using librcu yet, right? I'd go with the simpler and safer version of sys_membarrier that does not do tricky synchronisation or add overhead to the ctxsw fastpath. Then if you see some actual improvement in a real program using librcu one day we can discuss making it faster. As it is right now, the change will definitely slow down everybody not using librcu (ie. nearly everything). > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > CC: Steven Rostedt <rostedt@goodmis.org> > CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > CC: Nicholas Miell <nmiell@comcast.net> > CC: Linus Torvalds <torvalds@linux-foundation.org> > CC: mingo@elte.hu > CC: laijs@cn.fujitsu.com > CC: dipankar@in.ibm.com > CC: akpm@linux-foundation.org > CC: josh@joshtriplett.org > CC: dvhltc@us.ibm.com > CC: niv@us.ibm.com > CC: tglx@linutronix.de > CC: peterz@infradead.org > CC: Valdis.Kletnieks@vt.edu > CC: dhowells@redhat.com > --- > kernel/sched.c | 24 ++++++++++++++++++++---- > 1 file changed, 20 insertions(+), 4 deletions(-) > > Index: linux-2.6-lttng/kernel/sched.c > =================================================================== > --- linux-2.6-lttng.orig/kernel/sched.c 2010-01-31 14:59:42.000000000 -0500 > +++ linux-2.6-lttng/kernel/sched.c 2010-01-31 15:09:51.000000000 -0500 > @@ -893,7 +893,12 @@ static inline void finish_lock_switch(st > */ > spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_); > > - raw_spin_unlock_irq(&rq->lock); > + /* > + * Order mm_cpumask and rq->curr updates before following memory > + * accesses. Required by sys_membarrier(). > + */ > + smp_mb__before_spin_unlock(); > + raw_spin_unlock_irq__no_release(&rq->lock); > } > > #else /* __ARCH_WANT_UNLOCKED_CTXSW */ > @@ -916,10 +921,15 @@ static inline void prepare_lock_switch(s > */ > next->oncpu = 1; > #endif > + /* > + * Order mm_cpumask and rq->curr updates before following memory > + * accesses. Required by sys_membarrier(). > + */ > + smp_mb__before_spin_unlock(); > #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW > - raw_spin_unlock_irq(&rq->lock); > + raw_spin_unlock_irq__no_release(&rq->lock); > #else > - raw_spin_unlock(&rq->lock); > + raw_spin_unlock__no_release(&rq->lock); > #endif > } > > @@ -5490,7 +5500,13 @@ need_resched_nonpreemptible: > if (sched_feat(HRTICK)) > hrtick_clear(rq); > > - raw_spin_lock_irq(&rq->lock); > + raw_spin_lock_irq__no_acquire(&rq->lock); > + /* > + * Order memory accesses before mm_cpumask and rq->curr updates. > + * Required by sys_membarrier() when prev != next. We only learn about > + * next later, so we issue this mb() unconditionally. > + */ > + smp_mb__after_spin_lock(); > update_rq_clock(rq); > clear_tsk_need_resched(prev); > > > -- > Mathieu Desnoyers > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 7:33 ` Nick Piggin @ 2010-02-01 9:42 ` Peter Zijlstra 2010-02-01 10:11 ` Nick Piggin 0 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2010-02-01 9:42 UTC (permalink / raw) To: Nick Piggin Cc: Mathieu Desnoyers, Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks, dhowells On Mon, 2010-02-01 at 18:33 +1100, Nick Piggin wrote: > > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin > > lock/unlock already imply a full memory barrier. Combines the spin lock > > acquire/release barriers with the full memory barrier to diminish the > > performance impact on other architectures. (per-architecture spinlock-mb.h > > should be gradually implemented to replace the generic version) > > It does add overhead on x86, as well as most other architectures. > > This really seems like the wrong optimisation to make, especially > given that there's not likely to be much using librcu yet, right? > > I'd go with the simpler and safer version of sys_membarrier that does > not do tricky synchronisation or add overhead to the ctxsw fastpath. > Then if you see some actual improvement in a real program using librcu > one day we can discuss making it faster. > > As it is right now, the change will definitely slow down everybody > not using librcu (ie. nearly everything). Right, so the problem with the 'slow'/'safe' version is that it takes rq->lock for all relevant rqs. This renders while (1) sys_membarrier() in a quite effective DoS. Now, I'm not quite charmed by all this. Esp. this patch seems wrong. The fact is on x86 we have all the required membarriers in place. There's a number of LOCK ins before we set rq->curr and we have them after. Adding more, like this patch effectively does (smp_mb__{before,after}_unlock should be a full mb as Nick pointed out) doesn't seem like a good idea at all. And then there's !x86 to consider. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 9:42 ` Peter Zijlstra @ 2010-02-01 10:11 ` Nick Piggin 2010-02-01 10:36 ` Peter Zijlstra 0 siblings, 1 reply; 41+ messages in thread From: Nick Piggin @ 2010-02-01 10:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Mathieu Desnoyers, Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks, dhowells On Mon, Feb 01, 2010 at 10:42:30AM +0100, Peter Zijlstra wrote: > On Mon, 2010-02-01 at 18:33 +1100, Nick Piggin wrote: > > > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin > > > lock/unlock already imply a full memory barrier. Combines the spin lock > > > acquire/release barriers with the full memory barrier to diminish the > > > performance impact on other architectures. (per-architecture spinlock-mb.h > > > should be gradually implemented to replace the generic version) > > > > It does add overhead on x86, as well as most other architectures. > > > > This really seems like the wrong optimisation to make, especially > > given that there's not likely to be much using librcu yet, right? > > > > I'd go with the simpler and safer version of sys_membarrier that does > > not do tricky synchronisation or add overhead to the ctxsw fastpath. > > Then if you see some actual improvement in a real program using librcu > > one day we can discuss making it faster. > > > > As it is right now, the change will definitely slow down everybody > > not using librcu (ie. nearly everything). > > Right, so the problem with the 'slow'/'safe' version is that it takes > rq->lock for all relevant rqs. This renders while (1) sys_membarrier() > in a quite effective DoS. All, but one at a time, no? How much of a DoS really is taking these locks for a handful of cycles each, per syscall? I mean, we have LOTS of syscalls that take locks, and for a lot longer, (look at dcache_lock). I think we basically just have to say that locking primitives should be somewhat fair, and not be held for too long, it should more or less work. If the locks are getting contended, then the threads calling sys_membarrier are going to be spinning longer too, using more CPU time, and will get scheduled away... If there is some particular problem on -rt because of the rq locks, then I guess you could consider whether to add more overhead to your ctxsw path to reduce the problem, or simply not support sys_membarrier for unprived users in the first place. > Now, I'm not quite charmed by all this. Esp. this patch seems wrong. The > fact is on x86 we have all the required membarriers in place. > > There's a number of LOCK ins before we set rq->curr and we have them > after. Adding more, like this patch effectively does > (smp_mb__{before,after}_unlock should be a full mb as Nick pointed out) > doesn't seem like a good idea at all. > > And then there's !x86 to consider. Yep. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 10:11 ` Nick Piggin @ 2010-02-01 10:36 ` Peter Zijlstra 2010-02-01 10:49 ` Nick Piggin 0 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2010-02-01 10:36 UTC (permalink / raw) To: Nick Piggin Cc: Mathieu Desnoyers, Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks, dhowells On Mon, 2010-02-01 at 21:11 +1100, Nick Piggin wrote: > On Mon, Feb 01, 2010 at 10:42:30AM +0100, Peter Zijlstra wrote: > > On Mon, 2010-02-01 at 18:33 +1100, Nick Piggin wrote: > > > > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin > > > > lock/unlock already imply a full memory barrier. Combines the spin lock > > > > acquire/release barriers with the full memory barrier to diminish the > > > > performance impact on other architectures. (per-architecture spinlock-mb.h > > > > should be gradually implemented to replace the generic version) > > > > > > It does add overhead on x86, as well as most other architectures. > > > > > > This really seems like the wrong optimisation to make, especially > > > given that there's not likely to be much using librcu yet, right? > > > > > > I'd go with the simpler and safer version of sys_membarrier that does > > > not do tricky synchronisation or add overhead to the ctxsw fastpath. > > > Then if you see some actual improvement in a real program using librcu > > > one day we can discuss making it faster. > > > > > > As it is right now, the change will definitely slow down everybody > > > not using librcu (ie. nearly everything). > > > > Right, so the problem with the 'slow'/'safe' version is that it takes > > rq->lock for all relevant rqs. This renders while (1) sys_membarrier() > > in a quite effective DoS. > > All, but one at a time, no? How much of a DoS really is taking these > locks for a handful of cycles each, per syscall? I was more worrying about the cacheline trashing than lock hold times there. > I mean, we have LOTS of syscalls that take locks, and for a lot longer, > (look at dcache_lock). Yeah, and dcache is a massive pain, isn't it ;-) > I think we basically just have to say that locking primitives should be > somewhat fair, and not be held for too long, it should more or less > work. Sure, it'll more of less work, but he's basically making rq->lock a global lock instead of a per-cpu lock. > If the locks are getting contended, then the threads calling > sys_membarrier are going to be spinning longer too, using more CPU time, > and will get scheduled away... Sure, and increased spinning reduces the total throughput. > If there is some particular problem on -rt because of the rq locks, > then I guess you could consider whether to add more overhead to your > ctxsw path to reduce the problem, or simply not support sys_membarrier > for unprived users in the first place. Right, for -rt we might need to do that, but its just that rq->lock is a very hot lock, and adding basically unlimited trashing to it didn't seem like a good idea. Also, I'm thinking making it a priv syscall basically renders it useless for Mathieu. Anyway, it might be I'm just paranoid... but archs with large core count and lazy tlb flush seem particularly vulnerable. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 10:36 ` Peter Zijlstra @ 2010-02-01 10:49 ` Nick Piggin 2010-02-01 14:47 ` Mathieu Desnoyers 0 siblings, 1 reply; 41+ messages in thread From: Nick Piggin @ 2010-02-01 10:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Mathieu Desnoyers, Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks, dhowells On Mon, Feb 01, 2010 at 11:36:01AM +0100, Peter Zijlstra wrote: > On Mon, 2010-02-01 at 21:11 +1100, Nick Piggin wrote: > > All, but one at a time, no? How much of a DoS really is taking these > > locks for a handful of cycles each, per syscall? > > I was more worrying about the cacheline trashing than lock hold times > there. Well, same issue really. Look at all the unprived files in /proc for example that can look through all per-cpu cachelines. It just takes a single read syscall to do a lot of them too. > > I mean, we have LOTS of syscalls that take locks, and for a lot longer, > > (look at dcache_lock). > > Yeah, and dcache is a massive pain, isn't it ;-) My point is, I don't think it is something we can realistically care much about and it is nowhere near a new or unique problem being added by this one patch. It is really a RoS, reduction of service, rather than a DoS. And any time we allow an unpriv user on our system, we have RoS potential :) > > I think we basically just have to say that locking primitives should be > > somewhat fair, and not be held for too long, it should more or less > > work. > > Sure, it'll more of less work, but he's basically making rq->lock a > global lock instead of a per-cpu lock. > > > If the locks are getting contended, then the threads calling > > sys_membarrier are going to be spinning longer too, using more CPU time, > > and will get scheduled away... > > Sure, and increased spinning reduces the total throughput. > > > If there is some particular problem on -rt because of the rq locks, > > then I guess you could consider whether to add more overhead to your > > ctxsw path to reduce the problem, or simply not support sys_membarrier > > for unprived users in the first place. > > Right, for -rt we might need to do that, but its just that rq->lock is a > very hot lock, and adding basically unlimited trashing to it didn't seem > like a good idea. > > Also, I'm thinking making it a priv syscall basically renders it useless > for Mathieu. Well I just mean that it's something for -rt to work out. Apps can still work if the call is unsupported completely. > Anyway, it might be I'm just paranoid... but archs with large core count > and lazy tlb flush seem particularly vulnerable. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 10:49 ` Nick Piggin @ 2010-02-01 14:47 ` Mathieu Desnoyers 2010-02-01 14:58 ` Nick Piggin 0 siblings, 1 reply; 41+ messages in thread From: Mathieu Desnoyers @ 2010-02-01 14:47 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks, dhowells * Nick Piggin (npiggin@suse.de) wrote: > On Mon, Feb 01, 2010 at 11:36:01AM +0100, Peter Zijlstra wrote: > > On Mon, 2010-02-01 at 21:11 +1100, Nick Piggin wrote: > > > All, but one at a time, no? How much of a DoS really is taking these > > > locks for a handful of cycles each, per syscall? > > > > I was more worrying about the cacheline trashing than lock hold times > > there. > > Well, same issue really. Look at all the unprived files in /proc > for example that can look through all per-cpu cachelines. It just > takes a single read syscall to do a lot of them too. > > > > > I mean, we have LOTS of syscalls that take locks, and for a lot longer, > > > (look at dcache_lock). > > > > Yeah, and dcache is a massive pain, isn't it ;-) > > My point is, I don't think it is something we can realistically > care much about and it is nowhere near a new or unique problem > being added by this one patch. > > It is really a RoS, reduction of service, rather than a DoS. And > any time we allow an unpriv user on our system, we have RoS potential :) > > > > > I think we basically just have to say that locking primitives should be > > > somewhat fair, and not be held for too long, it should more or less > > > work. > > > > Sure, it'll more of less work, but he's basically making rq->lock a > > global lock instead of a per-cpu lock. > > > > > If the locks are getting contended, then the threads calling > > > sys_membarrier are going to be spinning longer too, using more CPU time, > > > and will get scheduled away... > > > > Sure, and increased spinning reduces the total throughput. > > > > > If there is some particular problem on -rt because of the rq locks, > > > then I guess you could consider whether to add more overhead to your > > > ctxsw path to reduce the problem, or simply not support sys_membarrier > > > for unprived users in the first place. > > > > Right, for -rt we might need to do that, but its just that rq->lock is a > > very hot lock, and adding basically unlimited trashing to it didn't seem > > like a good idea. > > > > Also, I'm thinking making it a priv syscall basically renders it useless > > for Mathieu. > > Well I just mean that it's something for -rt to work out. Apps can > still work if the call is unsupported completely. OK, so we seem to be settling for the spinlock-based sys_membarrier() this time, which is much less intrusive in terms of scheduler fast path modification, but adds more system overhead each time sys_membarrier() is called. This trade-off makes sense to me, as we expect the scheduler to execute _much_ more often than sys_membarrier(). When I get confirmation that's the route to follow from both of you, I'll go back to the spinlock-based scheme for v9. Thanks, Mathieu > > > > Anyway, it might be I'm just paranoid... but archs with large core count > > and lazy tlb flush seem particularly vulnerable. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 14:47 ` Mathieu Desnoyers @ 2010-02-01 14:58 ` Nick Piggin 2010-02-01 15:23 ` Steven Rostedt 0 siblings, 1 reply; 41+ messages in thread From: Nick Piggin @ 2010-02-01 14:58 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Peter Zijlstra, Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks, dhowells On Mon, Feb 01, 2010 at 09:47:59AM -0500, Mathieu Desnoyers wrote: > * Nick Piggin (npiggin@suse.de) wrote: > > Well I just mean that it's something for -rt to work out. Apps can > > still work if the call is unsupported completely. > > OK, so we seem to be settling for the spinlock-based sys_membarrier() > this time, which is much less intrusive in terms of scheduler > fast path modification, but adds more system overhead each time > sys_membarrier() is called. This trade-off makes sense to me, as we > expect the scheduler to execute _much_ more often than sys_membarrier(). > > When I get confirmation that's the route to follow from both of you, > I'll go back to the spinlock-based scheme for v9. I think locking or cacheline bouncing DoS is just something we can't realistically worry too much about in the standard kernel. No further than just generally good practice of good scalability, avoiding starvations and long lock hold times etc. So I would prefer the simpler version that doesn't add overhead to ctxsw, at least for the first implementation. Thanks, Nick ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 14:58 ` Nick Piggin @ 2010-02-01 15:23 ` Steven Rostedt 2010-02-01 15:44 ` Steven Rostedt 2010-02-01 16:00 ` Mike Galbraith 0 siblings, 2 replies; 41+ messages in thread From: Steven Rostedt @ 2010-02-01 15:23 UTC (permalink / raw) To: Nick Piggin Cc: Mathieu Desnoyers, Peter Zijlstra, Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks, dhowells On Tue, 2010-02-02 at 01:58 +1100, Nick Piggin wrote: > On Mon, Feb 01, 2010 at 09:47:59AM -0500, Mathieu Desnoyers wrote: > > * Nick Piggin (npiggin@suse.de) wrote: > > > Well I just mean that it's something for -rt to work out. Apps can > > > still work if the call is unsupported completely. > > > > OK, so we seem to be settling for the spinlock-based sys_membarrier() > > this time, which is much less intrusive in terms of scheduler > > fast path modification, but adds more system overhead each time > > sys_membarrier() is called. This trade-off makes sense to me, as we > > expect the scheduler to execute _much_ more often than sys_membarrier(). > > > > When I get confirmation that's the route to follow from both of you, > > I'll go back to the spinlock-based scheme for v9. > > I think locking or cacheline bouncing DoS is just something we can't > realistically worry too much about in the standard kernel. No further > than just generally good practice of good scalability, avoiding > starvations and long lock hold times etc. > > So I would prefer the simpler version that doesn't add overhead to > ctxsw, at least for the first implementation. Acked-by: Steven Rostedt <rostedt@goodmis.org> ;-) -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 15:23 ` Steven Rostedt @ 2010-02-01 15:44 ` Steven Rostedt 2010-02-01 16:00 ` Mike Galbraith 1 sibling, 0 replies; 41+ messages in thread From: Steven Rostedt @ 2010-02-01 15:44 UTC (permalink / raw) To: Nick Piggin Cc: Mathieu Desnoyers, Peter Zijlstra, Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks, dhowells On Mon, 2010-02-01 at 10:23 -0500, Steven Rostedt wrote: > On Tue, 2010-02-02 at 01:58 +1100, Nick Piggin wrote: > > On Mon, Feb 01, 2010 at 09:47:59AM -0500, Mathieu Desnoyers wrote: > > > * Nick Piggin (npiggin@suse.de) wrote: > > > > Well I just mean that it's something for -rt to work out. Apps can > > > > still work if the call is unsupported completely. > > > > > > OK, so we seem to be settling for the spinlock-based sys_membarrier() > > > this time, which is much less intrusive in terms of scheduler > > > fast path modification, but adds more system overhead each time > > > sys_membarrier() is called. This trade-off makes sense to me, as we > > > expect the scheduler to execute _much_ more often than sys_membarrier(). > > > > > > When I get confirmation that's the route to follow from both of you, > > > I'll go back to the spinlock-based scheme for v9. > > > > I think locking or cacheline bouncing DoS is just something we can't > > realistically worry too much about in the standard kernel. No further > > than just generally good practice of good scalability, avoiding > > starvations and long lock hold times etc. > > > > So I would prefer the simpler version that doesn't add overhead to > > ctxsw, at least for the first implementation. > > Acked-by: Steven Rostedt <rostedt@goodmis.org> I'm in agreement with Nick on this issue. Peter, I think you are being a bit paranoid here. But that's a good thing. Anyone working in the scheduler must be paranoid. That's what brings out issues that people normally do no think about. :-) -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 15:23 ` Steven Rostedt 2010-02-01 15:44 ` Steven Rostedt @ 2010-02-01 16:00 ` Mike Galbraith 1 sibling, 0 replies; 41+ messages in thread From: Mike Galbraith @ 2010-02-01 16:00 UTC (permalink / raw) To: rostedt Cc: Nick Piggin, Mathieu Desnoyers, Peter Zijlstra, Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks, dhowells On Mon, 2010-02-01 at 10:23 -0500, Steven Rostedt wrote: > On Tue, 2010-02-02 at 01:58 +1100, Nick Piggin wrote: > > So I would prefer the simpler version that doesn't add overhead to > > ctxsw, at least for the first implementation. > > Acked-by: Steven Rostedt <rostedt@goodmis.org> > > ;-) Yup, fast path needs to _lose_ some weight. -Mike ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-01-31 20:52 ` [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock Mathieu Desnoyers 2010-02-01 7:33 ` Nick Piggin @ 2010-02-01 15:27 ` Linus Torvalds 2010-02-01 16:09 ` Mathieu Desnoyers 2010-02-01 16:11 ` Steven Rostedt 1 sibling, 2 replies; 41+ messages in thread From: Linus Torvalds @ 2010-02-01 15:27 UTC (permalink / raw) To: Mathieu Desnoyers Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Sun, 31 Jan 2010, Mathieu Desnoyers wrote: > > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin > lock/unlock already imply a full memory barrier. .. and as Nick pointed out, you're fundamentally incorrect on this. unlock on x86 is no memory barrier at all, since the x86 memory ordering rules are such that a regular store always has release consistency. But more importantly, you don't even explain why the addded smp_mb() helps. Why does a smp_mb() at the lock/unlock even matter? Reading accesses by the same CPU sure as hell do _not_ matter, so the whole concept seems totally broken. There is no way in _hell_ that whatever unlocked thing can ever write the variables protected by the lock, only read them. So a full memory barrier makes zero sense to begin with. So what are these magical memory barriers all about? Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 15:27 ` Linus Torvalds @ 2010-02-01 16:09 ` Mathieu Desnoyers 2010-02-01 16:23 ` Linus Torvalds 2010-02-01 16:24 ` Steven Rostedt 2010-02-01 16:11 ` Steven Rostedt 1 sibling, 2 replies; 41+ messages in thread From: Mathieu Desnoyers @ 2010-02-01 16:09 UTC (permalink / raw) To: Linus Torvalds Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Sun, 31 Jan 2010, Mathieu Desnoyers wrote: > > > > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin > > lock/unlock already imply a full memory barrier. > > .. and as Nick pointed out, you're fundamentally incorrect on this. > unlock on x86 is no memory barrier at all, since the x86 memory ordering > rules are such that a regular store always has release consistency. > > But more importantly, you don't even explain why the addded smp_mb() > helps. > > Why does a smp_mb() at the lock/unlock even matter? Reading accesses by > the same CPU sure as hell do _not_ matter, so the whole concept seems > totally broken. There is no way in _hell_ that whatever unlocked thing > can ever write the variables protected by the lock, only read them. So a > full memory barrier makes zero sense to begin with. > > So what are these magical memory barriers all about? > > Linus To show the ordering required by sys_membarrier, let's first express sys_membarrier() succinctly: smp_mb(); rcu_read_lock(); /* ensures validity of cpu_curr(cpu) tasks */ for_each_cpu(cpu, mm_cpumask(current->mm)) if (current->mm == cpu_curr(cpu)->mm) smp_call_function_single(cpu, membarrier_ipi, NULL, 1); rcu_read_unlock(); smp_mb(); We need to guarantee that between the two memory barriers issued by sys_membarrier(), we run a membarrier_ipi on every other running thread belonging to the same process. Where we have to be careful here is that the context switch does not imply memory barriers both before and after: - mm_cpumask update - rq->cur update Not having memory barriers between user-space instruction execution and these updates performed by the scheduler leaves room for a race between sys_membarrier() and the scheduler, where we would not deliver a membarrier_ipi to a thread which either starts running or is about to be context switched. One option is to surround the context switch by memory barriers. If we choose not to do that, we are left with these solutions: We can deal with the rq->cur update by holding the rq lock in each iteration of the for_each_cpu(cpu, mm_cpumask(current->mm)) loop. This ensures that if rq->cur is updated, we have an associated memory barrier issued (e.g. on x86, implied by writing to cr3 while the rq lock is held). However, this does not deal with mm_cpumask update, and we cannot use the per-cpu rq lock, as it's a process-wide data structure updated with clear_bit/set_bit in switch_mm(). So at the very least, we would have to add memory barriers in switch_mm() on some architectures to deal with this. Thanks, Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 16:09 ` Mathieu Desnoyers @ 2010-02-01 16:23 ` Linus Torvalds 2010-02-01 16:48 ` Mathieu Desnoyers 2010-02-01 16:24 ` Steven Rostedt 1 sibling, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2010-02-01 16:23 UTC (permalink / raw) To: Mathieu Desnoyers Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 1 Feb 2010, Mathieu Desnoyers wrote: > > However, this does not deal with mm_cpumask update, and we cannot use > the per-cpu rq lock, as it's a process-wide data structure updated with > clear_bit/set_bit in switch_mm(). So at the very least, we would have to > add memory barriers in switch_mm() on some architectures to deal with > this. I'd much rather have a "switch_mm()" is a guaranteed memory barrier logic, because quite frankly, I don't see how it ever couldn't be one anyway. It fundamentally needs to do at least a TLB context switch (which may be just switching an ASI around, not flushing the whole TLB, of course), and I bet that for 99% of all architectures, that is already pretty much guaranteed to be equivalent to a memory barrier. It certainly is for x86. "mov to cr0" is serializing (setting any control register except cr8 is serializing). And I strongly suspect other architectures will be too. Btw, one reason to strongly prefer "switch_mm()" over any random context switch is that at least it won't affect inter-thread (kernel or user-land) switching, including switching to/from the idle thread. So I'd be _much_ more open to a "let's guarantee that 'switch_mm()' always implies a memory barrier" model than to playing clever games with spinlocks. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 16:23 ` Linus Torvalds @ 2010-02-01 16:48 ` Mathieu Desnoyers 2010-02-01 16:56 ` Linus Torvalds 2010-02-01 17:13 ` Steven Rostedt 0 siblings, 2 replies; 41+ messages in thread From: Mathieu Desnoyers @ 2010-02-01 16:48 UTC (permalink / raw) To: Linus Torvalds Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Mon, 1 Feb 2010, Mathieu Desnoyers wrote: > > > > However, this does not deal with mm_cpumask update, and we cannot use > > the per-cpu rq lock, as it's a process-wide data structure updated with > > clear_bit/set_bit in switch_mm(). So at the very least, we would have to > > add memory barriers in switch_mm() on some architectures to deal with > > this. > > I'd much rather have a "switch_mm()" is a guaranteed memory barrier logic, > because quite frankly, I don't see how it ever couldn't be one anyway. It > fundamentally needs to do at least a TLB context switch (which may be just > switching an ASI around, not flushing the whole TLB, of course), and I bet > that for 99% of all architectures, that is already pretty much guaranteed > to be equivalent to a memory barrier. > > It certainly is for x86. "mov to cr0" is serializing (setting any control > register except cr8 is serializing). And I strongly suspect other > architectures will be too. What we have to be careful about here is that it's not enough to just rely on switch_mm() containing a memory barrier. What we really need to enforce is that switch_mm() issues memory barriers both _before_ and _after_ mm_cpumask modification. The "after" part is usually dealt with by the TLB context switch, but the "before" part usually isn't. > > Btw, one reason to strongly prefer "switch_mm()" over any random context > switch is that at least it won't affect inter-thread (kernel or user-land) > switching, including switching to/from the idle thread. > > So I'd be _much_ more open to a "let's guarantee that 'switch_mm()' always > implies a memory barrier" model than to playing clever games with > spinlocks. If we really want to make this patch less intrusive, we can consider iterating on each online cpu in sys_membarrier() rather than on the mm_cpumask. But it comes at the cost of useless cache-line bouncing on large machines with few threads running in the process, as we would grab the rq locks one by one for all cpus. Thanks, Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 16:48 ` Mathieu Desnoyers @ 2010-02-01 16:56 ` Linus Torvalds 2010-02-01 17:45 ` Mathieu Desnoyers 2010-02-01 17:13 ` Steven Rostedt 1 sibling, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2010-02-01 16:56 UTC (permalink / raw) To: Mathieu Desnoyers Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 1 Feb 2010, Mathieu Desnoyers wrote: > > What we have to be careful about here is that it's not enough to just > rely on switch_mm() containing a memory barrier. What we really need to > enforce is that switch_mm() issues memory barriers both _before_ and > _after_ mm_cpumask modification. The "after" part is usually dealt with > by the TLB context switch, but the "before" part usually isn't. Ok, whatever. I vote for not doing anything at all, because this just all sounds like some totally crazy crap. You haven't explained the actual races, you haven't explained anything at all, you're apparently just randomly sprinkling smp_mb's around until you think it's all fine. Show the actual memory ordering constraints as it is related to the kernel data structures. I'm totally uninterested in general handwaving and "we must have smp_mb's here and here" without very explicit explanations of exactly WHAT the memory orderings are. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 16:56 ` Linus Torvalds @ 2010-02-01 17:45 ` Mathieu Desnoyers 2010-02-01 18:00 ` Steven Rostedt 2010-02-01 18:36 ` Linus Torvalds 0 siblings, 2 replies; 41+ messages in thread From: Mathieu Desnoyers @ 2010-02-01 17:45 UTC (permalink / raw) To: Linus Torvalds Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Mon, 1 Feb 2010, Mathieu Desnoyers wrote: > > > > What we have to be careful about here is that it's not enough to just > > rely on switch_mm() containing a memory barrier. What we really need to > > enforce is that switch_mm() issues memory barriers both _before_ and > > _after_ mm_cpumask modification. The "after" part is usually dealt with > > by the TLB context switch, but the "before" part usually isn't. > > Ok, whatever. I vote for not doing anything at all, because this just all > sounds like some totally crazy crap. You haven't explained the actual > races, you haven't explained anything at all, you're apparently just > randomly sprinkling smp_mb's around until you think it's all fine. > > Show the actual memory ordering constraints as it is related to the kernel > data structures. I'm totally uninterested in general handwaving and "we > must have smp_mb's here and here" without very explicit explanations of > exactly WHAT the memory orderings are. It's probably worthwhile to summarize here the mb-related discussions we had in the previous RFC rounds. Here is the detailed execution scenario showing the race. The race with unlock showing that we need to order user-space loads before mm_cpumask store. This is showing an execution sequence involving the userspace RCU library and the Linux kernel with sys_membarrier(). As we see below, the lack of ordering between "cpumask_clear" and user-space execution creates a race with the scheduler where sys_membarrier() incorrectly considers CPU 1 as not needing a membarrier_ipi. CPU 0 CPU 1 -------------- -------------- <user space> <user space> (already in read-side C.S.) obj = rcu_dereference(list->next); -> load list->next copy = obj->foo; rcu_read_unlock(); -> store to urcu_reader.ctr <urcu_reader.ctr store is globally visible> list_del(obj); <preempt> <kernel space> schedule(); cpumask_clear(mm_cpumask, cpu); sys_membarrier(); <kernel space> smp_mb() iterates on mm_cpumask smp_mb() <user space> set global g.p. (urcu_gp_ctr) phase to 1 wait for all urcu_reader.ctr in phase 0 set global g.p. (urcu_gp_ctr) phase to 0 wait for all urcu_reader.ctr in phase 1 sys_membarrier(); <kernel space> smp_mb() iterates on mm_cpumask smp_mb() <user space> free(obj); <list->next load hits memory> <obj->foo load hits memory> <- corruption There is a similar race scenario (symmetric to the one above) between cpumask_set and a following user-space rcu_read_lock(), which I could provide upon request. This results in the following comments around smp_mb() in sys_membarrier: First sys_membarrier smp_mb(): /* * Memory barrier on the caller thread before sending first * IPI. Orders all memory accesses prior to sys_membarrier() before * mm_cpumask read and membarrier_ipi executions. * Matches memory barriers before and after mm_cpumask updates. */ Second sys_membarrier smp_mb(): /* * Memory barrier on the caller thread after we finished * waiting for the last IPI. Orders mm_cpumask read and membarrier_ipi * executions before memory accesses following sys_membarrier() * execution. * Matches memory barriers before and after mm_cpumask updates. */ Thanks, Mathieu > > Linus -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 17:45 ` Mathieu Desnoyers @ 2010-02-01 18:00 ` Steven Rostedt 2010-02-01 18:36 ` Linus Torvalds 1 sibling, 0 replies; 41+ messages in thread From: Steven Rostedt @ 2010-02-01 18:00 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 2010-02-01 at 12:45 -0500, Mathieu Desnoyers wrote: > * Linus Torvalds (torvalds@linux-foundation.org) wrote: > It's probably worthwhile to summarize here the mb-related discussions we > had in the previous RFC rounds. Actually it is probably worthwhile to include this explanation in the change log. Can't expect everyone to have read a kernel thread that contains hundreds of replies. > > Here is the detailed execution scenario showing the race. The race with > unlock showing that we need to order user-space loads before mm_cpumask > store. This is showing an execution sequence involving the userspace RCU > library and the Linux kernel with sys_membarrier(). As we see below, the > lack of ordering between "cpumask_clear" and user-space execution > creates a race with the scheduler where sys_membarrier() incorrectly > considers CPU 1 as not needing a membarrier_ipi. > > > CPU 0 CPU 1 > -------------- -------------- > <user space> <user space> (already in read-side C.S.) > obj = rcu_dereference(list->next); > -> load list->next > copy = obj->foo; > rcu_read_unlock(); > -> store to urcu_reader.ctr > <urcu_reader.ctr store is globally visible> > list_del(obj); > <preempt> > <kernel space> > > schedule(); > cpumask_clear(mm_cpumask, cpu); > > sys_membarrier(); > <kernel space> > smp_mb() > iterates on mm_cpumask > smp_mb() > <user space> > set global g.p. (urcu_gp_ctr) phase to 1 > wait for all urcu_reader.ctr in phase 0 > set global g.p. (urcu_gp_ctr) phase to 0 > wait for all urcu_reader.ctr in phase 1 > sys_membarrier(); > <kernel space> > smp_mb() > iterates on mm_cpumask > smp_mb() > <user space> > free(obj); > <list->next load hits memory> > <obj->foo load hits memory> <- corruption > > > There is a similar race scenario (symmetric to the one above) between > cpumask_set and a following user-space rcu_read_lock(), which I could > provide upon request. If it would help others to understand, it would be worthwhile in explaining that too. So, I'm requesting it ;-) > > This results in the following comments around smp_mb() in > sys_membarrier: > > First sys_membarrier smp_mb(): > > /* > * Memory barrier on the caller thread before sending first > * IPI. Orders all memory accesses prior to sys_membarrier() before > * mm_cpumask read and membarrier_ipi executions. > * Matches memory barriers before and after mm_cpumask updates. > */ The above comment does not really explain what it is protecting. It just states that it serializes memory access. Well, duh, that's what smp_mb() does ;-) It's the equivalent to i++; /* increment i */ Basically, the comment should explain "why" the memory barrier is needed. > > Second sys_membarrier smp_mb(): > > /* > * Memory barrier on the caller thread after we finished > * waiting for the last IPI. Orders mm_cpumask read and membarrier_ipi > * executions before memory accesses following sys_membarrier() > * execution. > * Matches memory barriers before and after mm_cpumask updates. > */ Ditto. -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 17:45 ` Mathieu Desnoyers 2010-02-01 18:00 ` Steven Rostedt @ 2010-02-01 18:36 ` Linus Torvalds 2010-02-01 19:56 ` Mathieu Desnoyers 2010-02-01 20:33 ` Steven Rostedt 1 sibling, 2 replies; 41+ messages in thread From: Linus Torvalds @ 2010-02-01 18:36 UTC (permalink / raw) To: Mathieu Desnoyers Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 1 Feb 2010, Mathieu Desnoyers wrote: > > Here is the detailed execution scenario showing the race. No. You've added random smp_mb() calls, but you don't actually show what the f*ck they are protecting against. For example > First sys_membarrier smp_mb(): I'm not AT ALL interested in the sys_membarrier() parts. You can hav ea million memory barriers there, and I won't care. I'm interested in what you think the memory barriers elsewhere protect against. It's a barrier between _which_ two operations? You can't say it's a barrier "around" the cpumask_clear(mm_cpumask, cpu); because a barrier is between two things. So if you want to add two barriers around that mm_cpumask acces, you need to describe the _three_ events you're barriers between in that call-path (with mm_cpumask being just one of them) And then, once you've described _those_ three events, you describe what the sys_membarrier interaction is, and how mm_cpumask is involved there. I'm not interested in the user-space code. Don't even quote it. It's irrelevant apart from the actual semantics you want to guarantee for the new membarrier() system call. So don't quote the code, just explain what the actual barriers are. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 18:36 ` Linus Torvalds @ 2010-02-01 19:56 ` Mathieu Desnoyers 2010-02-01 20:42 ` Linus Torvalds 2010-02-01 20:33 ` Steven Rostedt 1 sibling, 1 reply; 41+ messages in thread From: Mathieu Desnoyers @ 2010-02-01 19:56 UTC (permalink / raw) To: Linus Torvalds Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Mon, 1 Feb 2010, Mathieu Desnoyers wrote: > > > > Here is the detailed execution scenario showing the race. > > No. You've added random smp_mb() calls, but you don't actually show what > the f*ck they are protecting against. > > For example > > > First sys_membarrier smp_mb(): > > I'm not AT ALL interested in the sys_membarrier() parts. You can hav ea > million memory barriers there, and I won't care. I'm interested in what > you think the memory barriers elsewhere protect against. It's a barrier > between _which_ two operations? > > You can't say it's a barrier "around" the > > cpumask_clear(mm_cpumask, cpu); > > because a barrier is between two things. So if you want to add two > barriers around that mm_cpumask acces, you need to describe the _three_ > events you're barriers between in that call-path (with mm_cpumask being > just one of them) > > And then, once you've described _those_ three events, you describe what > the sys_membarrier interaction is, and how mm_cpumask is involved there. > > I'm not interested in the user-space code. Don't even quote it. It's > irrelevant apart from the actual semantics you want to guarantee for the > new membarrier() system call. So don't quote the code, just explain what > the actual barriers are. > The two event pairs we are looking at are: Pair 1) * memory accesses (load/stores) performed by user-space thread before context switch. * cpumask_clear_cpu(cpu, mm_cpumask(prev)); Pair 2) * cpumask_set_cpu(cpu, mm_cpumask(next)); * memory accessses (load/stores) performed by user-space thread after context switch. I can see two ways to add memory barriers in switch_mm that would provide ordering for these two memory access pairs: Either A) switch_mm() smp_mb__before_clear_bit(); cpumask_clear_cpu(cpu, mm_cpumask(prev)); cpumask_set_cpu(cpu, mm_cpumask(next)); smp_mb__after_set_bit(); or B) switch_mm() cpumask_set_cpu(cpu, mm_cpumask(next)); smp_mb__before_clear_bit(); cpumask_clear_cpu(cpu, mm_cpumask(prev)); (B) seems like a clear win, as we get the ordering right for both pairs with a single memory barrier, but I don't know if changing the set/clear bit order could have nasty side-effects on other mm_cpumask users. sys_membarrier uses the mm_cpumask to iterate on all CPUs on which the current process's mm is in use, so it can issue a smp_mb() through an IPI on all CPUs that need it. Without appropriate ordering of pairs 1-2 detailed above, we could miss a CPU that actually needs a memory barrier. Thanks, Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 19:56 ` Mathieu Desnoyers @ 2010-02-01 20:42 ` Linus Torvalds 2010-02-01 22:42 ` Mathieu Desnoyers 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2010-02-01 20:42 UTC (permalink / raw) To: Mathieu Desnoyers Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 1 Feb 2010, Mathieu Desnoyers wrote: > > The two event pairs we are looking at are: > > Pair 1) > > * memory accesses (load/stores) performed by user-space thread before > context switch. > * cpumask_clear_cpu(cpu, mm_cpumask(prev)); > > Pair 2) > > * cpumask_set_cpu(cpu, mm_cpumask(next)); > * memory accessses (load/stores) performed by user-space thread after > context switch. So explain why does that smp_mb() in between the two _help_? The user of this will do a for_each_cpu(mm_cpumask) send_IPI(cpu, smp_mb); but that's not an atomic op _anyway_. So you're reading mm_cpumask somewhere earlier, and doing the send_IPI later. So look at the whole scenario 2: cpumask_set_cpu(cpu, mm_cpumask(next)); memory accessses performed by user-space and think about it from the perspective of another CPU. What does an smp_mb() in between the two do? I'll tell you - it does NOTHING. Because it doesn't matter. I see no possible way another CPU can care, because let's assume that the other CPU is doing that for_each_cpu(mm_cpumask) send_ipi(smp_mb); and you have to realize that the other CPU needs to read that mm_cpumask early in order to do that. So you have this situation: CPU1 CPU2 ---- ---- cpumask_set_cpu read mm_cpumask smp_mb smp_mb user memory accessses send_ipi and exactly _what_ is that "smp_mb" on CPU1 protecting against? Realize that CPU2 is not ordered (because you wanted to avoid the locking), so the "read mm_cpumask" can happen before or after that cpumask_set_cpu. And it can happen before or after REGARDLESS of that smp_mb. The smp_mb doesn't make any difference to CPU2 that I can see. So the question becomes one of "How can CPU2 care about whether CPU1 is in the mask"? Considering that CPU2 doesn't do any locking, I don't see any way you can get a "consistent" CPU mask _regardless_ of any smp_mb's in there. When it does the "read mm_cpumask()" it might get the value _before_ the cpumask_set_cpu, and it might get the value _after_, and that's true regardless of whether there is a smp_mb there or not. See what I'm asking for? I'm asking for why it matters that we have a memory barrier, and why that mm_cpumask is so magical that _that_ access matters so much. Maybe I'm dense. But If somebody puts memory barriers in the code, I want to know exactly what the reason for the barrier is. Memory ordering is too subtle and non-intuitive to go by gut feel. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 20:42 ` Linus Torvalds @ 2010-02-01 22:42 ` Mathieu Desnoyers 0 siblings, 0 replies; 41+ messages in thread From: Mathieu Desnoyers @ 2010-02-01 22:42 UTC (permalink / raw) To: Linus Torvalds Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Mon, 1 Feb 2010, Mathieu Desnoyers wrote: > > > > The two event pairs we are looking at are: > > > > Pair 1) > > > > * memory accesses (load/stores) performed by user-space thread before > > context switch. > > * cpumask_clear_cpu(cpu, mm_cpumask(prev)); > > > > Pair 2) > > > > * cpumask_set_cpu(cpu, mm_cpumask(next)); > > * memory accessses (load/stores) performed by user-space thread after > > context switch. > > So explain why does that smp_mb() in between the two _help_? > > The user of this will do a > > for_each_cpu(mm_cpumask) > send_IPI(cpu, smp_mb); > > but that's not an atomic op _anyway_. So you're reading mm_cpumask > somewhere earlier, and doing the send_IPI later. So look at the whole > scenario 2: > > cpumask_set_cpu(cpu, mm_cpumask(next)); > memory accessses performed by user-space > > and think about it from the perspective of another CPU. What does an > smp_mb() in between the two do? > > I'll tell you - it does NOTHING. Because it doesn't matter. I see no > possible way another CPU can care, because let's assume that the other CPU > is doing that > > for_each_cpu(mm_cpumask) > send_ipi(smp_mb); > > and you have to realize that the other CPU needs to read that mm_cpumask > early in order to do that. > > So you have this situation: > > CPU1 CPU2 > ---- ---- > > cpumask_set_cpu > read mm_cpumask > smp_mb > smp_mb > user memory accessses > send_ipi > > and exactly _what_ is that "smp_mb" on CPU1 protecting against? > > Realize that CPU2 is not ordered (because you wanted to avoid the > locking), so the "read mm_cpumask" can happen before or after that > cpumask_set_cpu. And it can happen before or after REGARDLESS of that > smp_mb. The smp_mb doesn't make any difference to CPU2 that I can see. > > So the question becomes one of "How can CPU2 care about whether CPU1 is in > the mask"? Considering that CPU2 doesn't do any locking, I don't see any > way you can get a "consistent" CPU mask _regardless_ of any smp_mb's in > there. When it does the "read mm_cpumask()" it might get the value > _before_ the cpumask_set_cpu, and it might get the value _after_, and > that's true regardless of whether there is a smp_mb there or not. > > See what I'm asking for? I'm asking for why it matters that we have a > memory barrier, and why that mm_cpumask is so magical that _that_ access > matters so much. > > Maybe I'm dense. But If somebody puts memory barriers in the code, I want > to know exactly what the reason for the barrier is. Memory ordering is too > subtle and non-intuitive to go by gut feel. > I am 100% fine with your request for explanation. I'll do my best to explain the requirements and ordering as clearly as I can. Let's define the semantic of this system call clearly, this will help us below: it orders memory accesses performed by a thread before and after calling sys_membarrier with respect to _all_ memory accesses (in program order) performed by all other threads of the same process. The barriers are not there to "protect" against anything, they merely enforce ordering of memory accesses. sys_membarrier ensures that a memory barrier will be issued on all running threads of the current process executing on other cpus, between two memory barriers issued locally. A very important point is that the memory barriers _need_ to be issued on the other running threads after the sys_membarrier first mb and before its second mb. The second important aspect is the definition of a running vs non-running thread. Here, any CPU that would still have unflushed write buffers (or pending reordered read operations) affecting a thread which belongs to our process should be considered as "running", because it needs to have memory barriers issued. Now why is mm_cpumask so important here ? This is the only non-lock protected data structure we access to find out which CPUs are running threads belonging to our process. I'm modifying your scenario to match the requirements expressed above. In this first scenario, we send an IPI guaranteeing that the newly scheduled thread (which happens to have set its cpu in the mm_cpumask before we read it) issues a memory barrier. We end up issuing one extra memory barrier, as both the scheduler and IPI issue the barriers: CPU1 CPU2 ---- ---- smp_mb cpumask_set_cpu read mm_cpumask send_ipi | smp_mb | user memory accesses | | smp_mb <----------------/ smp_mb Now, let's see what happens if the read mm_cpumask occurs right before the cpumask_set_cpu. In this scenario, CPU1 is considered as "not running", because it cannot have any user-space memory accesses reordered prior to the first smp_mb on CPU2. CPU1 CPU2 ---- ---- smp_mb read mm_cpumask (no ipi to send) cpumask_set_cpu smp_mb smp_mb user memory accesses As we see, the smp_mb issued on CPU1 between the cpumask set and user memory accesses takes care to ensure that no user memory accesses spill over the barrier. So if sys_membarrier encounters a mm_cpumask with a cleared cpu bit, it knows for sure that there are no unflushed write buffers containing user-space data, nor reordered read or write accesses to user-space data. Removing this barrier could let CPU1 reorder the user memory accesses before the cpumask_set_cpu, which fails to satisfy the sys_membarrier guarantees. Thanks, Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 18:36 ` Linus Torvalds 2010-02-01 19:56 ` Mathieu Desnoyers @ 2010-02-01 20:33 ` Steven Rostedt 2010-02-01 20:52 ` Linus Torvalds 1 sibling, 1 reply; 41+ messages in thread From: Steven Rostedt @ 2010-02-01 20:33 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 2010-02-01 at 10:36 -0800, Linus Torvalds wrote: > > I'm not interested in the user-space code. Don't even quote it. It's > irrelevant apart from the actual semantics you want to guarantee for > the > new membarrier() system call. So don't quote the code, just explain > what > the actual barriers are. OK, but first we must establish that the sys_membarrier() system call guarantees that all running threads of this process have an mb() performed on them before this syscall returns. The simplest implementation would be to just do an IPI on all CPUs and have that IPI perform the mb(). But that would interfere with other tasks, so we want to limit it to only sending the mb()'s to the threads of the process that are currently running. We use the mm_cpumask to find out what threads are associated with this task, and only send the IPI to the CPUs running threads of the current process. With the kernel point of view, the goal is to make sure a mb() happens on all running threads of the calling process. The code does the following: for_each_cpu(cpu, mm_cpumask(current->mm)) { if (current->mm == cpu_curr(cpu)->mm) send_ipi(); } But a race exists between the reading of the mm_cpumask and sending the IPI. There is in fact two different problems with this race. One is that a thread scheduled away, but never issued an mb(), the other is that a running task just came in and we never saw it. Here: CPU 0 CPU 1 ----------- ----------- < same thread > schedule() clear_bit(); current->mm == cpu_curr(1)->mm <<< failed return sys_membarrier(); context_switch(); The above fails the situation, because we missed our thread before it actually switched to another task. This fails the guarantee that the syscall sys_membarrier() implies. Second scenario, for non-x86 archs that do not imply a mb() on switch_mm(): CPU 0 CPU 1 ----------- ----------- < different thread > schedule(); clear_bit(); set_bit(); schedule(); < same thread > sys_membarrier(); current->mm == cpu_curr(1)->mm <<<<< failed This scenario happens if the switch_mm() does not imply a mb(). That is, the syscall sys_membarrier() was called after CPU 1 scheduled a thread of the same process, but the switch_mm() did not force the mb() causing CPU 0 to see the old value of the mm_cpumask. The above does not take any user-space into account. It only tries to fulfill the kernel's obligation of sys_membarrier to ensure that all threads of the calling process has an mb() performed on them. Mathieu, from this point of view, you can explain the necessary mb()s that are within the kernel proper. -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 20:33 ` Steven Rostedt @ 2010-02-01 20:52 ` Linus Torvalds 2010-02-01 22:39 ` Steven Rostedt 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2010-02-01 20:52 UTC (permalink / raw) To: Steven Rostedt Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 1 Feb 2010, Steven Rostedt wrote: > > But a race exists between the reading of the mm_cpumask and sending the > IPI. There is in fact two different problems with this race. One is that > a thread scheduled away, but never issued an mb(), the other is that a > running task just came in and we never saw it. I get it. But the thing I object to here is that Mathieu claims that we need _two_ memory barriers in the switch_mm() code. And I'm still not seeing it. You claim that the rule is that "you have to do a mb on all threads", and that there is a race if a threads switches away just as we're about to do that. Fine. But why _two_? And what's so magical about the mm_cpumask that it needs to be around it? If the rule is that we do a memory barrier as we switch an mm, then why does that single one not just handle it? Either the CPU kept running that mm (and the IPI will do the memory barrier), or the CPU didn't (and the switch_mm had a memory barrier). Without locking, I don't see how you can really have any stronger guarantees, and as per my previous email, I don't see what the smp_mb() around mm_cpumask accesses help - because the other CPU is still not going to atomically "see the mask and IPI". It's going to see one value or the other, and the smp_mb() around the access doesn't seem to have anything to do with which value it sees. So I can kind of understand the "We want to guarantee that switching MM's around wants to be a memory barrier". Quite frankly, I haven't though even that through entirely, so who knows... But the "we need to have memory barriers on both sides of the bit setting/clearing" I don't get. IOW, show me why that cpumask is _so_ important that the placement of the memory barriers around it matters, to the point where you want to have it on both sides. Maybe you've really thought about this very deeply, but the explanations aren't getting through to me. Educate me. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 20:52 ` Linus Torvalds @ 2010-02-01 22:39 ` Steven Rostedt 2010-02-01 23:09 ` Steven Rostedt 0 siblings, 1 reply; 41+ messages in thread From: Steven Rostedt @ 2010-02-01 22:39 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 2010-02-01 at 12:52 -0800, Linus Torvalds wrote: > > On Mon, 1 Feb 2010, Steven Rostedt wrote: > > > > But a race exists between the reading of the mm_cpumask and sending the > > IPI. There is in fact two different problems with this race. One is that > > a thread scheduled away, but never issued an mb(), the other is that a > > running task just came in and we never saw it. > > I get it. But the thing I object to here is that Mathieu claims that we > need _two_ memory barriers in the switch_mm() code. Note, on x86, we don't need two memory barriers, because the cr3 load acts as one, and the set_bit() acts as the other. > > And I'm still not seeing it. > > You claim that the rule is that "you have to do a mb on all threads", and > that there is a race if a threads switches away just as we're about to do > that. > > Fine. But why _two_? And what's so magical about the mm_cpumask that it > needs to be around it? > > If the rule is that we do a memory barrier as we switch an mm, then why > does that single one not just handle it? Either the CPU kept running that > mm (and the IPI will do the memory barrier), or the CPU didn't (and the > switch_mm had a memory barrier). The one before the mm_cpumask actually does the mb() that we want to send, in case we miss it. We use mm_cpumask just to limit our sending of mb()s to CPUs. The unlikely case that we miss a thread, this mb() will perform the work mb() for us. CPU 0 CPU 1 ----------- ----------- < same thread > clear_bit(mm_cpumask) sys_membarrier(); <miss sending CPU 1 mb()> return back to user-space modify data < CPU 0 data is read > <<- corruption schedule() Earlier you asked what is the three things the two mb()'s are protecting. Actually, it is really just two things that can be accessed in: A mb(); B mb(); A Where, we access A then B and then A again, and need to worry about that ordering. The two things are, the mm_cpumask and the second thing is the user-space data that the sys_membarrier() is guaranteeing will be flushed when it returns. If we add a mb() before the clearing of the cpu bit in the mm_cpumask(), we don't care if we get an old value that does not contain a missing cpu that is running one of the process's threads. Because the mb() would have made all the necessary data of the thread visible, and no other mb() is needed. The mb() after the set bit makes sure the mask is visible for this: CPU 0 CPU 1 ----------- ----------- set_bit(); schedule(); return to userspace and access critical data sys_membarrier(); miss sending IPI. return; update critical data I guess the simple answer is that the first mb() is doing the mb() we missed. The second mb() makes sure that a thread does not get back to userspace without the sys_membarrier() seeing it. > > Without locking, I don't see how you can really have any stronger > guarantees, and as per my previous email, I don't see what the smp_mb() > around mm_cpumask accesses help - because the other CPU is still not going > to atomically "see the mask and IPI". It's going to see one value or the > other, and the smp_mb() around the access doesn't seem to have anything to > do with which value it sees. I guess you can say we are cheating here. If we miss the mm_cpumask update, the only time it is an issue for us is if one of our threads scheduled out. We use the mb() there to perform the mb() that we missed. We really don't care about the mm_cpumask, but we do care that a mb() is performed. > > So I can kind of understand the "We want to guarantee that switching MM's > around wants to be a memory barrier". Quite frankly, I haven't though even > that through entirely, so who knows... But the "we need to have memory > barriers on both sides of the bit setting/clearing" I don't get. > > IOW, show me why that cpumask is _so_ important that the placement of the > memory barriers around it matters, to the point where you want to have it > on both sides. Actually the cpumask is not as important as the mb(). We use the cpumask to find out what CPUs we need to make do a mb(), and the mb() before the cpumask is in the case we miss one that is caused by the race. > > Maybe you've really thought about this very deeply, but the explanations > aren't getting through to me. Educate me. The issues is that sys_membarrier() is guaranteeing that all the other threads of the process will have a mb() performed before the sys_membarrier() returns. We only care about the user-space data of the threads, thus the mb() here is quite unique. We are using the mm_cpumask as a short cut to sending a mb() to only the CPUs that are running our threads. The mb() before the mm_cpumask() is to make sure that the thread does a mb() in case we miss it when reading the mm_cpumask(). The mb() after the mm_cpumask() is to make sure that the cpumask is visible before the thread reads any data in user-space, so that the sys_membarrier() will send out the IPI() to that thread. For x86 these are a not an issues because clear_bit() is an mb() and so is the load_cr3(). Now if other arch maintainers do not want to add two mb()s in switch_mm(), then perhaps if they just guarantee that one exists after the clear/set bits, the following code may work: (posted in a previous email) sys_membarrier(void) { again: tmp_mask = mm_cpumask(current->mm); smp_mb(); rcu_read_lock(); /* ensures validity of cpu_curr(cpu) tasks */ for_each_cpu(cpu, tmp_mask) { spin_lock_irq(&cpu_rq(cpu)->lock); ret = current->mm == cpu_curr(cpu)->mm; spin_unlock_irq(&cpu_rq(cpu)->lock); if (ret) smp_call_function_single(cpu, membarrier_ipi, NULL, 1); } rcu_read_unlock(); smp_mb(); if (tmp_mask != mm_cpumask(current->mm)) { /* do check for signals here */ goto again; } Maybe even the locks are not needed. -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 22:39 ` Steven Rostedt @ 2010-02-01 23:09 ` Steven Rostedt 0 siblings, 0 replies; 41+ messages in thread From: Steven Rostedt @ 2010-02-01 23:09 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 2010-02-01 at 17:39 -0500, Steven Rostedt wrote: > > CPU 0 CPU 1 > ----------- ----------- > < same thread > > clear_bit(mm_cpumask) > > sys_membarrier(); > <miss sending CPU 1 mb()> > return back to user-space > > modify data > > < CPU 0 data is read > <<- corruption > > schedule() > > > (posted in a previous email) > > sys_membarrier(void) { > > again: > tmp_mask = mm_cpumask(current->mm); > smp_mb(); > rcu_read_lock(); /* ensures validity of cpu_curr(cpu) tasks */ > for_each_cpu(cpu, tmp_mask) { > spin_lock_irq(&cpu_rq(cpu)->lock); > ret = current->mm == cpu_curr(cpu)->mm; > spin_unlock_irq(&cpu_rq(cpu)->lock); > if (ret) > smp_call_function_single(cpu, membarrier_ipi, NULL, 1); > } > rcu_read_unlock(); > smp_mb(); > if (tmp_mask != mm_cpumask(current->mm)) { > /* do check for signals here */ > goto again; > } Doh, this misses the above case if a mb() is not before the caller. Nevermind, it looks like the only other case is to iterate over all CPUs. -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 16:48 ` Mathieu Desnoyers 2010-02-01 16:56 ` Linus Torvalds @ 2010-02-01 17:13 ` Steven Rostedt 2010-02-01 17:34 ` Linus Torvalds 1 sibling, 1 reply; 41+ messages in thread From: Steven Rostedt @ 2010-02-01 17:13 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 2010-02-01 at 11:48 -0500, Mathieu Desnoyers wrote: > What we have to be careful about here is that it's not enough to just > rely on switch_mm() containing a memory barrier. What we really need to > enforce is that switch_mm() issues memory barriers both _before_ and > _after_ mm_cpumask modification. The "after" part is usually dealt with > by the TLB context switch, but the "before" part usually isn't. Then we add a smp_mb__before_clear_bit() in the switch_mm() on all archs that do not have clear_bit imply a smp_mb(). > > > > > Btw, one reason to strongly prefer "switch_mm()" over any random context > > switch is that at least it won't affect inter-thread (kernel or user-land) > > switching, including switching to/from the idle thread. > > > > So I'd be _much_ more open to a "let's guarantee that 'switch_mm()' always > > implies a memory barrier" model than to playing clever games with > > spinlocks. > > If we really want to make this patch less intrusive, we can consider > iterating on each online cpu in sys_membarrier() rather than on the > mm_cpumask. But it comes at the cost of useless cache-line bouncing on > large machines with few threads running in the process, as we would grab > the rq locks one by one for all cpus. I still think modifying the switch_mm() is better than the full iteration. -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 17:13 ` Steven Rostedt @ 2010-02-01 17:34 ` Linus Torvalds 0 siblings, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2010-02-01 17:34 UTC (permalink / raw) To: Steven Rostedt Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 1 Feb 2010, Steven Rostedt wrote: > On Mon, 2010-02-01 at 11:48 -0500, Mathieu Desnoyers wrote: > > > What we have to be careful about here is that it's not enough to just > > rely on switch_mm() containing a memory barrier. What we really need to > > enforce is that switch_mm() issues memory barriers both _before_ and > > _after_ mm_cpumask modification. The "after" part is usually dealt with > > by the TLB context switch, but the "before" part usually isn't. > > Then we add a smp_mb__before_clear_bit() in the switch_mm() on all archs > that do not have clear_bit imply a smp_mb(). No. I want to get an _explanation_ first. And no, "I want to do user-level RCU and I'm looking at mm_cpumask" is not an explanation. In order to explain memory ordering rules, you need to actually show the actual accesses on the different cpu's and the ordering requirements between them, and explain them. IOW, you need to show thread1 thread2 ------- ------- some-particular-memop mm_cpumask access memory barrier memory barrier mm_cpumask access some-particular-memop memory barrier some-other-particular-memop and explain exactly why the memory barriers are needed. In particular, now I've heard the INSANE statement that we need "memory barriers both _before_ and _after_ mm_cpumask modification." and people need to realize that such statements are totally worthless. The fact is, a memory barrier with regards to a single location modification makes no sense. Not before, not after. Putting a barrier around a single access (even if that "single" access is then a read-modify-read one) is a meaningless operation - it has _zero_ semantic information. Memory barriers need to be _between_ two operations, and even then they never make sense on their own - you need to have another actor that also has a memory barrier, and that you are ordering things with regards to. Saying that they are "around" one operation is crazy talk. It's a nonsensical statement. It shows a total lack of understanding of what a barrier is about. You can't put a barrier 'around' anything at all. So before we go any further, I want to see the graph of barriers AND THE THINGS THEY ARE BARRIERS BETWEEN. On both sides. I want to know that you guys even know _what_ you are protecting against, and I want it documented so that when people say "this would solve it", they can show that yes, you actually need barriers at both points. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 16:09 ` Mathieu Desnoyers 2010-02-01 16:23 ` Linus Torvalds @ 2010-02-01 16:24 ` Steven Rostedt 2010-02-01 16:29 ` Peter Zijlstra 1 sibling, 1 reply; 41+ messages in thread From: Steven Rostedt @ 2010-02-01 16:24 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 2010-02-01 at 11:09 -0500, Mathieu Desnoyers wrote: > We can deal with the rq->cur update by holding the rq lock in each > iteration of the for_each_cpu(cpu, mm_cpumask(current->mm)) loop. This > ensures that if rq->cur is updated, we have an associated memory barrier > issued (e.g. on x86, implied by writing to cr3 while the rq lock is held). > > However, this does not deal with mm_cpumask update, and we cannot use > the per-cpu rq lock, as it's a process-wide data structure updated with > clear_bit/set_bit in switch_mm(). So at the very least, we would have to > add memory barriers in switch_mm() on some architectures to deal with > this. > Doesn't set_bit imply a wmb()? If so couldn't we do: What about: again: tmp_mask = mm_cpumask(current->mm); smp_mb(); rcu_read_lock(); /* ensures validity of cpu_curr(cpu) tasks */ for_each_cpu(cpu, tmp_mask) { spin_lock_irq(&cpu_rq(cpu)->lock); ret = current->mm == cpu_curr(cpu)->mm; spin_unlock_irq(&cpu_rq(cpu)->lock); if (ret) smp_call_function_single(cpu, membarrier_ipi, NULL, 1); } rcu_read_unlock(); smp_mb(); if (tmp_mask != mm_cpumask(current->mm)) { /* do check for signals here */ goto again; } Would the above work? -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 16:24 ` Steven Rostedt @ 2010-02-01 16:29 ` Peter Zijlstra 2010-02-01 16:46 ` Steven Rostedt 0 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2010-02-01 16:29 UTC (permalink / raw) To: rostedt Cc: Mathieu Desnoyers, Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks, dhowells On Mon, 2010-02-01 at 11:24 -0500, Steven Rostedt wrote: > > Doesn't set_bit imply a wmb()? If so couldn't we do: Nope, that's what we have smp_mb__{before,after}_set_bit() for. On x86 its a LOCK'ed op, so sure it implies a full membarrier, but not in generic. on x86 we have plenty serializing instructions before and after rq->curr is set so none of the crazyness is needed at all. The only thing is ! x86. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 16:29 ` Peter Zijlstra @ 2010-02-01 16:46 ` Steven Rostedt 0 siblings, 0 replies; 41+ messages in thread From: Steven Rostedt @ 2010-02-01 16:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Mathieu Desnoyers, Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks, dhowells, linux-arch [ Added linux-arch ] On Mon, 2010-02-01 at 17:29 +0100, Peter Zijlstra wrote: > On Mon, 2010-02-01 at 11:24 -0500, Steven Rostedt wrote: > > > > Doesn't set_bit imply a wmb()? If so couldn't we do: > > Nope, that's what we have smp_mb__{before,after}_set_bit() for. > > On x86 its a LOCK'ed op, so sure it implies a full membarrier, but not > in generic. > > on x86 we have plenty serializing instructions before and after rq->curr > is set so none of the crazyness is needed at all. The only thing is ! > x86. > So if we go with Linus's approach and make all archs guarantee that switch_mm() implies a full mb(), then we can simplify the sys_membarrier() code? That looks like the best approach. -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock 2010-02-01 15:27 ` Linus Torvalds 2010-02-01 16:09 ` Mathieu Desnoyers @ 2010-02-01 16:11 ` Steven Rostedt 1 sibling, 0 replies; 41+ messages in thread From: Steven Rostedt @ 2010-02-01 16:11 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 2010-02-01 at 07:27 -0800, Linus Torvalds wrote: > So what are these magical memory barriers all about? Mathieu is implementing userspace RCU. In order to make the rcu_read_locks() fast, they can not be calling memory barriers. What is needed is on the synchronize_rcu() the writer has to force a mb() on all CPUs running one of the readers. The first simple approach that Mathieu made, was to simply send an IPI to all CPUs and force the mb() to be made. But this lets one process interfere with other processes needlessly. And us Real-Time folks balked at the idea since it would allow any process to mess with the running of a real-time thread. The next approach was to use the mm_cpumask of the thread and only send IPIs to the CPUs that are running the thread. But there's a race between the update of the mm_cpumask and the scheduling of the task. If we send an IPI to a CPU that is not running the process's thread, it may cause a little interference with the other thread but nothing to worry about. The issue is if we miss sending to a process's thread. Then the reader could be accessing a stale pointer that the writer is modifying after the userspace synchronize_rcu() call. The taking of the rq locks was a way to make sure that the update of the mm_cpumask and the scheduling is in sync. And we know that we are sending an IPI that is running the process's thread and not missing any other ones. But all this got a bit ugly when we tried to avoid grabbing the run queue locks in the loop to send out IPIs. Note, I believe that x86 is not affected, since the act of doing the schedule is in itself a mb(). But this may not be the case on all archs. -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* [patch 3/3] introduce sys_membarrier(): process-wide memory barrier (v8) 2010-01-31 20:52 [patch 0/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers 2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers 2010-01-31 20:52 ` [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock Mathieu Desnoyers @ 2010-01-31 20:52 ` Mathieu Desnoyers 2 siblings, 0 replies; 41+ messages in thread From: Mathieu Desnoyers @ 2010-01-31 20:52 UTC (permalink / raw) To: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells Cc: Mathieu Desnoyers [-- Attachment #1: sys-membarrier.patch --] [-- Type: text/plain, Size: 20205 bytes --] Here is an implementation of a new system call, sys_membarrier(), which executes a memory barrier on all threads of the current process. It aims at greatly simplifying and enhancing the current signal-based liburcu userspace RCU synchronize_rcu() implementation. (found at http://lttng.org/urcu) Depends on: "scheduler: add full memory barriers upon task switch at runqueue lock/unlock" "Create spin lock/spin unlock with distinct memory barrier" Changelog since v1: - Only perform the IPI in CONFIG_SMP. - Only perform the IPI if the process has more than one thread. - Only send IPIs to CPUs involved with threads belonging to our process. - Adaptative IPI scheme (single vs many IPI with threshold). - Issue smp_mb() at the beginning and end of the system call. Changelog since v2: - simply send-to-many to the mm_cpumask. It contains the list of processors we have to IPI to (which use the mm), and this mask is updated atomically. Changelog since v3a: - Confirm that each CPU indeed runs the current task's ->mm before sending an IPI. Ensures that we do not disturb RT tasks in the presence of lazy TLB shootdown. - Document memory barriers needed in switch_mm(). - Surround helper functions with #ifdef CONFIG_SMP. Changelog since v4: - Add "int expedited" parameter, use synchronize_sched() in the non-expedited case. Thanks to Lai Jiangshan for making us consider seriously using synchronize_sched() to provide the low-overhead membarrier scheme. - Check num_online_cpus() == 1, quickly return without doing nothing. Changelog since v5: - Plan ahead for extensibility by introducing mandatory/optional masks to the "flags" system call parameter. Past experience with accept4(), signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and inotify_init1() indicates that this is the kind of thing we want to plan for. Return -EINVAL if the mandatory flags received are unknown. - Create include/linux/membarrier.h to define these flags. - Add MEMBARRIER_QUERY optional flag. Changelog since v6: - Remove some unlikely() not so unlikely. - Add the proper scheduler memory barriers needed to only use the RCU read lock in sys_membarrier rather than take each runqueue spinlock: - Move memory barriers from per-architecture switch_mm() to schedule() and finish_lock_switch(), where they clearly document that all data protected by the rq lock is guaranteed to have memory barriers issued between the scheduler update and the task execution. Replacing the spin lock acquire/release barriers with these memory barriers imply either no overhead (x86 spinlock atomic instruction already implies a full mb) or some hopefully small overhead caused by the upgrade of the spinlock acquire/release barriers to more heavyweight smp_mb(). - The "generic" version of spinlock-mb.h declares both a mapping to standard spinlocks and full memory barriers. Each architecture can specialize this header following their own need and declare CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h. - Note: benchmarks of scheduler overhead with specialized spinlock-mb.h implementations on a wide range of architecture would be welcome. Changelog since v7: - Move spinlock-mb and scheduler related changes to separate patches. - Add support for sys_membarrier on x86_32. - Only x86 32/64 system calls are reserved in this patch. It is planned to incrementally reserve syscall IDs on other architectures as these are tested. Both the signal-based and the sys_membarrier userspace RCU schemes permit us to remove the memory barrier from the userspace RCU rcu_read_lock() and rcu_read_unlock() primitives, thus significantly accelerating them. These memory barriers are replaced by compiler barriers on the read-side, and all matching memory barriers on the write-side are turned into an invokation of a memory barrier on all active threads in the process. By letting the kernel perform this synchronization rather than dumbly sending a signal to every process threads (as we currently do), we diminish the number of unnecessary wake ups and only issue the memory barriers on active threads. Non-running threads do not need to execute such barrier anyway, because these are implied by the scheduler context switches. To explain the benefit of this scheme, let's introduce two example threads: Thread A (non-frequent, e.g. executing liburcu synchronize_rcu()) Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock()) In a scheme where all smp_mb() in thread A synchronize_rcu() are ordering memory accesses with respect to smp_mb() present in rcu_read_lock/unlock(), we can change all smp_mb() from synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from rcu_read_lock/unlock() into compiler barriers "barrier()". Before the change, we had, for each smp_mb() pairs: Thread A Thread B prev mem accesses prev mem accesses smp_mb() smp_mb() follow mem accesses follow mem accesses After the change, these pairs become: Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses As we can see, there are two possible scenarios: either Thread B memory accesses do not happen concurrently with Thread A accesses (1), or they do (2). 1) Non-concurrent Thread A vs Thread B accesses: Thread A Thread B prev mem accesses sys_membarrier() follow mem accesses prev mem accesses barrier() follow mem accesses In this case, thread B accesses will be weakly ordered. This is OK, because at that point, thread A is not particularly interested in ordering them with respect to its own accesses. 2) Concurrent Thread A vs Thread B accesses Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses In this case, thread B accesses, which are ensured to be in program order thanks to the compiler barrier, will be "upgraded" to full smp_mb() thanks to the IPIs executing memory barriers on each active system threads. Each non-running process threads are intrinsically serialized by the scheduler. For my Intel Xeon E5405 (one thread is doing the sys_membarrier, the others are busy looping) * expedited 10,000,000 sys_membarrier calls: T=1: 0m20.173s T=2: 0m20.506s T=3: 0m22.632s T=4: 0m24.759s T=5: 0m26.633s T=6: 0m29.654s T=7: 0m30.669s For a 2-3 microseconds/call. * non-expedited 1000 sys_membarrier calls: T=1-7: 0m16.002s For a 16 milliseconds/call. (~5000-8000 times slower than expedited) The expected top pattern for the expedited scheme, when using 1 CPU for a thread doing sys_membarrier() in a loop and other 7 threads busy-waiting in user-space on a variable shows that the thread doing sys_membarrier is doing mostly system calls, and other threads are mostly running in user-space. Note that IPI handlers are not taken into account in the cpu time sampling. Cpu0 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 99.7%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.3%hi, 0.0%si, 0.0%st Cpu2 : 99.3%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.7%hi, 0.0%si, 0.0%st Cpu3 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 96.0%us, 1.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 2.6%si, 0.0%st Cpu6 : 1.3%us, 98.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 96.1%us, 3.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st Results in liburcu: Operations in 10s, 6 readers, 2 writers: (what we previously had) memory barriers in reader: 973494744 reads, 892368 writes signal-based scheme: 6289946025 reads, 1251 writes (what we have now, with dynamic sys_membarrier check, expedited scheme) memory barriers in reader: 907693804 reads, 817793 writes sys_membarrier scheme: 4316818891 reads, 503790 writes (dynamic sys_membarrier check, non-expedited scheme) memory barriers in reader: 907693804 reads, 817793 writes sys_membarrier scheme: 8698725501 reads, 313 writes So the dynamic sys_membarrier availability check adds some overhead to the read-side, but besides that, with the expedited scheme, we can see that we are close to the read-side performance of the signal-based scheme and also close (5/8) to the performance of the memory-barrier write-side. We have a write-side speedup of 400:1 over the signal-based scheme by using the sys_membarrier system call. This allows a 4.5:1 read-side speedup over the memory barrier scheme. The non-expedited scheme adds indeed a much lower overhead on the read-side both because we do not send IPIs and because we perform less updates, which in turn generates less cache-line exchanges. The write-side latency becomes even higher than with the signal-based scheme. The advantage of the non-expedited sys_membarrier() scheme over signal-based scheme is that it does not require to wake up all the process threads. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> CC: Nicholas Miell <nmiell@comcast.net> CC: mingo@elte.hu CC: laijs@cn.fujitsu.com CC: dipankar@in.ibm.com CC: akpm@linux-foundation.org CC: josh@joshtriplett.org CC: dvhltc@us.ibm.com CC: niv@us.ibm.com CC: tglx@linutronix.de CC: peterz@infradead.org CC: Valdis.Kletnieks@vt.edu CC: dhowells@redhat.com --- arch/x86/ia32/ia32entry.S | 1 arch/x86/include/asm/unistd_32.h | 3 arch/x86/include/asm/unistd_64.h | 2 arch/x86/kernel/syscall_table_32.S | 1 include/linux/Kbuild | 1 include/linux/membarrier.h | 25 ++++++ kernel/sched.c | 148 +++++++++++++++++++++++++++++++++++++ 7 files changed, 180 insertions(+), 1 deletion(-) Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h =================================================================== --- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h 2010-01-31 14:59:32.000000000 -0500 +++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h 2010-01-31 15:13:13.000000000 -0500 @@ -663,6 +663,8 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt __SYSCALL(__NR_perf_event_open, sys_perf_event_open) #define __NR_recvmmsg 299 __SYSCALL(__NR_recvmmsg, sys_recvmmsg) +#define __NR_membarrier 300 +__SYSCALL(__NR_membarrier, sys_membarrier) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR Index: linux-2.6-lttng/kernel/sched.c =================================================================== --- linux-2.6-lttng.orig/kernel/sched.c 2010-01-31 15:12:20.000000000 -0500 +++ linux-2.6-lttng/kernel/sched.c 2010-01-31 15:12:25.000000000 -0500 @@ -71,6 +71,7 @@ #include <linux/debugfs.h> #include <linux/ctype.h> #include <linux/ftrace.h> +#include <linux/membarrier.h> #include <asm/tlb.h> #include <asm/irq_regs.h> @@ -10930,6 +10931,153 @@ struct cgroup_subsys cpuacct_subsys = { }; #endif /* CONFIG_CGROUP_CPUACCT */ +#ifdef CONFIG_SMP + +/* + * Execute a memory barrier on all active threads from the current process + * on SMP systems. Do not rely on implicit barriers in IPI handler execution, + * because batched IPI lists are synchronized with spinlocks rather than full + * memory barriers. This is not the bulk of the overhead anyway, so let's stay + * on the safe side. + */ +static void membarrier_ipi(void *unused) +{ + smp_mb(); +} + +/* + * Handle out-of-mem by sending per-cpu IPIs instead. + */ +static void membarrier_retry(void) +{ + int cpu; + + for_each_cpu(cpu, mm_cpumask(current->mm)) { + if (current->mm == cpu_curr(cpu)->mm) + smp_call_function_single(cpu, membarrier_ipi, NULL, 1); + } +} + +#endif /* #ifdef CONFIG_SMP */ + +/* + * sys_membarrier - issue memory barrier on current process running threads + * @flags: One of these must be set: + * MEMBARRIER_EXPEDITED + * Adds some overhead, fast execution (few microseconds) + * MEMBARRIER_DELAYED + * Low overhead, but slow execution (few milliseconds) + * + * MEMBARRIER_QUERY + * This optional flag can be set to query if the kernel supports + * a set of flags. + * + * return values: Returns -EINVAL if the flags are incorrect. Testing for kernel + * sys_membarrier support can be done by checking for -ENOSYS return value. + * Return values >= 0 indicate success. For a given set of flags on a given + * kernel, this system call will always return the same value. It is therefore + * correct to check the return value only once at library load, passing the + * MEMBARRIER_QUERY flag in addition to only check if the flags are supported, + * without performing any synchronization. + * + * This system call executes a memory barrier on all running threads of the + * current process. Upon completion, the caller thread is ensured that all + * process threads have passed through a state where memory accesses match + * program order. (non-running threads are de facto in such a state) + * + * Using the non-expedited mode is recommended for applications which can + * afford leaving the caller thread waiting for a few milliseconds. A good + * example would be a thread dedicated to execute RCU callbacks, which waits + * for callbacks to enqueue most of the time anyway. + * + * The expedited mode is recommended whenever the application needs to have + * control returning to the caller thread as quickly as possible. An example + * of such application would be one which uses the same thread to perform + * data structure updates and issue the RCU synchronization. + * + * It is perfectly safe to call both expedited and non-expedited + * sys_membarrier() in a process. + * + * mm_cpumask is used as an approximation. It is a superset of the cpumask + * to which we must send IPIs, mainly due to lazy TLB shootdown. Therefore, + * we check each runqueue with the rq lock held to make sure our ->mm is indeed + * running on them. We hold the RCU read lock to ensure the task structs stay + * valid. We rely on memory barriers around context switch (paired with the + * scheduler rq lock) to ensure that tasks context switched concurrently with + * our ->mm or mm_cpumask accesses are issuing memory barriers between the ->mm + * or mm_cpumask updates and any memory access performed by the thread before or + * after the call to the scheduler. + * + * In addition to use the mm_cpumask approximation, checking the ->mm reduces + * the risk of disturbing a RT task by sending unnecessary IPIs. There is still + * a slight chance to disturb an unrelated task, because we do not lock the + * runqueues while sending IPIs, but the real-time effect of this heavy locking + * would be worse than the comparatively small disruption of an IPI. + * + * On uniprocessor systems, this system call simply returns 0 without doing + * anything, so user-space knows it is implemented. + * + * The flags argument has room for extensibility, with 16 lower bits holding + * mandatory flags for which older kernels will fail if they encounter an + * unknown flag. The high 16 bits are used for optional flags, which older + * kernels don't have to care about. + */ +SYSCALL_DEFINE1(membarrier, unsigned int, flags) +{ +#ifdef CONFIG_SMP + cpumask_var_t tmpmask; + int cpu; + + /* + * Expect _only_ one of expedited or delayed flags. + * Don't care about optional mask for now. + */ + switch (flags & MEMBARRIER_MANDATORY_MASK) { + case MEMBARRIER_EXPEDITED: + case MEMBARRIER_DELAYED: + break; + default: + return -EINVAL; + } + if (unlikely(flags & MEMBARRIER_QUERY + || thread_group_empty(current)) + || num_online_cpus() == 1) + return 0; + if (flags & MEMBARRIER_DELAYED) { + synchronize_sched(); + return 0; + } + /* + * Memory barrier on the caller thread _before_ sending first + * IPI. Matches memory barriers paired with scheduler rq lock. + */ + smp_mb(); + rcu_read_lock(); /* ensures validity of cpu_curr(cpu) tasks */ + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { + membarrier_retry(); + goto out; + } + cpumask_copy(tmpmask, mm_cpumask(current->mm)); + preempt_disable(); + cpumask_clear_cpu(smp_processor_id(), tmpmask); + for_each_cpu(cpu, tmpmask) + if (current->mm != cpu_curr(cpu)->mm) + cpumask_clear_cpu(cpu, tmpmask); + smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1); + preempt_enable(); + free_cpumask_var(tmpmask); +out: + rcu_read_unlock(); + /* + * Memory barrier on the caller thread _after_ we finished + * waiting for the last IPI. Matches memory barriers paired with + * scheduler rq lock. + */ + smp_mb(); +#endif /* #ifdef CONFIG_SMP */ + return 0; +} + #ifndef CONFIG_SMP int rcu_expedited_torture_stats(char *page) Index: linux-2.6-lttng/include/linux/membarrier.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6-lttng/include/linux/membarrier.h 2010-01-31 15:12:25.000000000 -0500 @@ -0,0 +1,25 @@ +#ifndef _LINUX_MEMBARRIER_H +#define _LINUX_MEMBARRIER_H + +/* First argument to membarrier syscall */ + +/* + * Mandatory flags to the membarrier system call that the kernel must + * understand are in the low 16 bits. + */ +#define MEMBARRIER_MANDATORY_MASK 0x0000FFFF /* Mandatory flags */ + +/* + * Optional hints that the kernel can ignore are in the high 16 bits. + */ +#define MEMBARRIER_OPTIONAL_MASK 0xFFFF0000 /* Optional hints */ + +/* Expedited: adds some overhead, fast execution (few microseconds) */ +#define MEMBARRIER_EXPEDITED (1 << 0) +/* Delayed: Low overhead, but slow execution (few milliseconds) */ +#define MEMBARRIER_DELAYED (1 << 1) + +/* Query flag support, without performing synchronization */ +#define MEMBARRIER_QUERY (1 << 16) + +#endif Index: linux-2.6-lttng/include/linux/Kbuild =================================================================== --- linux-2.6-lttng.orig/include/linux/Kbuild 2010-01-31 14:59:41.000000000 -0500 +++ linux-2.6-lttng/include/linux/Kbuild 2010-01-31 15:12:25.000000000 -0500 @@ -110,6 +110,7 @@ header-y += magic.h header-y += major.h header-y += map_to_7segment.h header-y += matroxfb.h +header-y += membarrier.h header-y += meye.h header-y += minix_fs.h header-y += mmtimer.h Index: linux-2.6-lttng/arch/x86/include/asm/unistd_32.h =================================================================== --- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_32.h 2010-01-31 15:13:56.000000000 -0500 +++ linux-2.6-lttng/arch/x86/include/asm/unistd_32.h 2010-01-31 15:14:38.000000000 -0500 @@ -343,10 +343,11 @@ #define __NR_rt_tgsigqueueinfo 335 #define __NR_perf_event_open 336 #define __NR_recvmmsg 337 +#define __NR_membarrier 338 #ifdef __KERNEL__ -#define NR_syscalls 338 +#define NR_syscalls 339 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6-lttng/arch/x86/ia32/ia32entry.S =================================================================== --- linux-2.6-lttng.orig/arch/x86/ia32/ia32entry.S 2010-01-31 15:16:36.000000000 -0500 +++ linux-2.6-lttng/arch/x86/ia32/ia32entry.S 2010-01-31 15:17:24.000000000 -0500 @@ -842,4 +842,5 @@ ia32_sys_call_table: .quad compat_sys_rt_tgsigqueueinfo /* 335 */ .quad sys_perf_event_open .quad compat_sys_recvmmsg + .quad sys_membarrier ia32_syscall_end: Index: linux-2.6-lttng/arch/x86/kernel/syscall_table_32.S =================================================================== --- linux-2.6-lttng.orig/arch/x86/kernel/syscall_table_32.S 2010-01-31 15:16:32.000000000 -0500 +++ linux-2.6-lttng/arch/x86/kernel/syscall_table_32.S 2010-01-31 15:17:08.000000000 -0500 @@ -337,3 +337,4 @@ ENTRY(sys_call_table) .long sys_rt_tgsigqueueinfo /* 335 */ .long sys_perf_event_open .long sys_recvmmsg + .long sys_membarrier -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2010-02-01 23:09 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-01-31 20:52 [patch 0/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers 2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers 2010-02-01 7:25 ` Nick Piggin 2010-02-01 14:08 ` Mathieu Desnoyers 2010-02-01 7:28 ` Nick Piggin 2010-02-01 14:10 ` Mathieu Desnoyers 2010-02-01 15:22 ` Linus Torvalds 2010-02-01 15:41 ` Steven Rostedt 2010-01-31 20:52 ` [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock Mathieu Desnoyers 2010-02-01 7:33 ` Nick Piggin 2010-02-01 9:42 ` Peter Zijlstra 2010-02-01 10:11 ` Nick Piggin 2010-02-01 10:36 ` Peter Zijlstra 2010-02-01 10:49 ` Nick Piggin 2010-02-01 14:47 ` Mathieu Desnoyers 2010-02-01 14:58 ` Nick Piggin 2010-02-01 15:23 ` Steven Rostedt 2010-02-01 15:44 ` Steven Rostedt 2010-02-01 16:00 ` Mike Galbraith 2010-02-01 15:27 ` Linus Torvalds 2010-02-01 16:09 ` Mathieu Desnoyers 2010-02-01 16:23 ` Linus Torvalds 2010-02-01 16:48 ` Mathieu Desnoyers 2010-02-01 16:56 ` Linus Torvalds 2010-02-01 17:45 ` Mathieu Desnoyers 2010-02-01 18:00 ` Steven Rostedt 2010-02-01 18:36 ` Linus Torvalds 2010-02-01 19:56 ` Mathieu Desnoyers 2010-02-01 20:42 ` Linus Torvalds 2010-02-01 22:42 ` Mathieu Desnoyers 2010-02-01 20:33 ` Steven Rostedt 2010-02-01 20:52 ` Linus Torvalds 2010-02-01 22:39 ` Steven Rostedt 2010-02-01 23:09 ` Steven Rostedt 2010-02-01 17:13 ` Steven Rostedt 2010-02-01 17:34 ` Linus Torvalds 2010-02-01 16:24 ` Steven Rostedt 2010-02-01 16:29 ` Peter Zijlstra 2010-02-01 16:46 ` Steven Rostedt 2010-02-01 16:11 ` Steven Rostedt 2010-01-31 20:52 ` [patch 3/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox