[patch 0/3] introduce sys_membarrier(): process-wide memory barrier (v8)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/3] introduce sys_membarrier(): process-wide memory barrier (v8)
@ 2010-01-31 20:52 Mathieu Desnoyers
  2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers
                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Mathieu Desnoyers @ 2010-01-31 20:52 UTC (permalink / raw)
  To: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
	Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar,
	josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells

Hi,

Here is an updated version of the sys_membarrier system call. It implements a
system call that allows userspace to distribute the overhead of memory barriers
asymmetrically. It is particularly useful for Userspace RCU (liburcu), to allow
the RCU read-side to only issue compiler barriers, matched by calls to
sys_membarrier() in the userspace synchronize_rcu().

Even though sys_membarrier should be portable to other architecture as-is
without arch-specific modifications, I propose to incrementally reserve the
system call IDs as we test it on each architecture.

Here is a version that I think should be ready for merging. It is based on the
2.6.33-rc6 kernel. I would appreciate if people who gave their Acked-by in the
RFC rounds could confirm their Acked-by following the patch set updates.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [patch 1/3] Create spin lock/spin unlock with distinct memory barrier
  2010-01-31 20:52 [patch 0/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers
@ 2010-01-31 20:52 ` Mathieu Desnoyers
  2010-02-01  7:25   ` Nick Piggin
                     ` (2 more replies)
  2010-01-31 20:52 ` [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock Mathieu Desnoyers
  2010-01-31 20:52 ` [patch 3/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers
  2 siblings, 3 replies; 41+ messages in thread
From: Mathieu Desnoyers @ 2010-01-31 20:52 UTC (permalink / raw)
  To: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
	Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar,
	josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells
  Cc: Mathieu Desnoyers

[-- Attachment #1: spinlock-mb.patch --]
[-- Type: text/plain, Size: 6346 bytes --]

Create the primitive family:

spin_lock__no_acquire
spin_unlock__no_release
spin_lock_irq__no_acquire
spin_unlock_irq__no_release

raw_spin_lock__no_acquire
raw_spin_unlock__no_release
raw_spin_lock_irq__no_acquire
raw_spin_unlock_irq__no_release

smp_acquire__after_spin_lock()
smp_release__before_spin_unlock()
smp_mb__after_spin_lock()
smp_mb__before_spin_unlock()

This family of locks permits to combine the acquire/release ordering implied by
the spin lock and full memory barriers placed just after spin lock and before
spin unlock. Combining these memory barriers permits to decrease the overhead.

The first user of these primitives is the scheduler code, where a full memory
barrier is wanted before and after runqueue data structure modifications so
these can be read safely by sys_membarrier without holding the rq lock.

Replacing the spin lock acquire/release barriers with these memory barriers
imply either no overhead (x86 spinlock atomic instruction already implies a full
mb) or some hopefully small overhead caused by the upgrade of the spinlock
acquire/release barriers to more heavyweight smp_mb().

The "generic" version of spinlock-mb.h declares both a mapping to standard
spinlocks and full memory barriers. Each architecture can specialize this header
following their own need and declare CONFIG_HAVE_SPINLOCK_MB to use their own
spinlock-mb.h. This initial implementation only specializes the x86
architecture.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Nicholas Miell <nmiell@comcast.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: mingo@elte.hu
CC: laijs@cn.fujitsu.com
CC: dipankar@in.ibm.com
CC: akpm@linux-foundation.org
CC: josh@joshtriplett.org
CC: dvhltc@us.ibm.com
CC: niv@us.ibm.com
CC: tglx@linutronix.de
CC: peterz@infradead.org
CC: Valdis.Kletnieks@vt.edu
CC: dhowells@redhat.com
---
 arch/x86/Kconfig                   |    1 +
 arch/x86/include/asm/spinlock-mb.h |   28 ++++++++++++++++++++++++++++
 include/asm-generic/spinlock-mb.h  |   27 +++++++++++++++++++++++++++
 include/linux/spinlock.h           |    6 ++++++
 init/Kconfig                       |    3 +++
 5 files changed, 65 insertions(+)

Index: linux-2.6-lttng/include/asm-generic/spinlock-mb.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-generic/spinlock-mb.h	2010-01-31 15:12:16.000000000 -0500
@@ -0,0 +1,27 @@
+#ifndef ASM_GENERIC_SPINLOCK_MB_H
+#define ASM_GENERIC_SPINLOCK_MB_H
+
+/*
+ * Generic spinlock-mb mappings. Use standard spinlocks with acquire/release
+ * semantics, and define the associated memory barriers as full memory barriers.
+ */
+
+#define spin_lock__no_acquire			spin_lock
+#define spin_unlock__no_release			spin_unlock
+
+#define spin_lock_irq__no_acquire		spin_lock_irq
+#define spin_unlock_irq__no_release		spin_unlock_irq
+
+#define raw_spin_lock__no_acquire		raw_spin_lock
+#define raw_spin_unlock__no_release		raw_spin_unlock
+
+#define raw_spin_lock_irq__no_acquire		raw_spin_lock_irq
+#define raw_spin_unlock_irq__no_release		raw_spin_unlock_irq
+
+#define smp_acquire__after_spin_lock()		do { } while (0)
+#define smp_release__before_spin_unlock()	do { } while (0)
+
+#define smp_mb__after_spin_lock()		smp_mb()
+#define smp_mb__before_spin_unlock()		smp_mb()
+
+#endif /* ASM_GENERIC_SPINLOCK_MB_H */
Index: linux-2.6-lttng/include/linux/spinlock.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/spinlock.h	2010-01-31 14:59:42.000000000 -0500
+++ linux-2.6-lttng/include/linux/spinlock.h	2010-01-31 15:00:40.000000000 -0500
@@ -393,4 +393,10 @@ extern int _atomic_dec_and_lock(atomic_t
 #define atomic_dec_and_lock(atomic, lock) \
 		__cond_lock(lock, _atomic_dec_and_lock(atomic, lock))
 
+#ifdef CONFIG_HAVE_SPINLOCK_MB
+# include <asm/spinlock-mb.h>
+#else
+# include <asm-generic/spinlock-mb.h>
+#endif
+
 #endif /* __LINUX_SPINLOCK_H */
Index: linux-2.6-lttng/arch/x86/include/asm/spinlock-mb.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/x86/include/asm/spinlock-mb.h	2010-01-31 15:11:28.000000000 -0500
@@ -0,0 +1,28 @@
+#ifndef ASM_X86_SPINLOCK_MB_H
+#define ASM_X86_SPINLOCK_MB_H
+
+/*
+ * X86 spinlock-mb mappings. Use standard spinlocks with acquire/release
+ * semantics. Associated memory barriers are defined as no-ops, because the
+ * spinlock LOCK-prefixed atomic operations imply a full memory barrier.
+ */
+
+#define spin_lock__no_acquire			spin_lock
+#define spin_unlock__no_release			spin_unlock
+
+#define spin_lock_irq__no_acquire		spin_lock_irq
+#define spin_unlock_irq__no_release		spin_unlock_irq
+
+#define raw_spin_lock__no_acquire		raw_spin_lock
+#define raw_spin_unlock__no_release		raw_spin_unlock
+
+#define raw_spin_lock_irq__no_acquire		raw_spin_lock_irq
+#define raw_spin_unlock_irq__no_release		raw_spin_unlock_irq
+
+#define smp_acquire__after_spin_lock()		do { } while (0)
+#define smp_release__before_spin_unlock()	do { } while (0)
+
+#define smp_mb__after_spin_lock()		do { } while (0)
+#define smp_mb__before_spin_unlock()		do { } while (0)
+
+#endif /* ASM_X86_SPINLOCK_MB_H */
Index: linux-2.6-lttng/arch/x86/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/x86/Kconfig	2010-01-31 14:59:32.000000000 -0500
+++ linux-2.6-lttng/arch/x86/Kconfig	2010-01-31 15:01:09.000000000 -0500
@@ -55,6 +55,7 @@ config X86
 	select ANON_INODES
 	select HAVE_ARCH_KMEMCHECK
 	select HAVE_USER_RETURN_NOTIFIER
+	select HAVE_SPINLOCK_MB
 
 config OUTPUT_FORMAT
 	string
Index: linux-2.6-lttng/init/Kconfig
===================================================================
--- linux-2.6-lttng.orig/init/Kconfig	2010-01-31 14:59:42.000000000 -0500
+++ linux-2.6-lttng/init/Kconfig	2010-01-31 15:00:40.000000000 -0500
@@ -320,6 +320,9 @@ config AUDIT_TREE
 	depends on AUDITSYSCALL
 	select INOTIFY
 
+config HAVE_SPINLOCK_MB
+	def_bool n
+
 menu "RCU Subsystem"
 
 choice

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-01-31 20:52 [patch 0/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers
  2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers
@ 2010-01-31 20:52 ` Mathieu Desnoyers
  2010-02-01  7:33   ` Nick Piggin
  2010-02-01 15:27   ` Linus Torvalds
  2010-01-31 20:52 ` [patch 3/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers
  2 siblings, 2 replies; 41+ messages in thread
From: Mathieu Desnoyers @ 2010-01-31 20:52 UTC (permalink / raw)
  To: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
	Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar,
	josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells
  Cc: Mathieu Desnoyers

[-- Attachment #1: scheduler-use-spinlock-mb.patch --]
[-- Type: text/plain, Size: 2950 bytes --]

Depends on:
"Create spin lock/spin unlock with distinct memory barrier"

A full memory barrier is wanted before and after runqueue data structure
modifications so these can be read safely by sys_membarrier without holding the
rq lock.

Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
lock/unlock already imply a full memory barrier. Combines the spin lock
acquire/release barriers with the full memory barrier to diminish the
performance impact on other architectures. (per-architecture spinlock-mb.h
should be gradually implemented to replace the generic version)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Nicholas Miell <nmiell@comcast.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: mingo@elte.hu
CC: laijs@cn.fujitsu.com
CC: dipankar@in.ibm.com
CC: akpm@linux-foundation.org
CC: josh@joshtriplett.org
CC: dvhltc@us.ibm.com
CC: niv@us.ibm.com
CC: tglx@linutronix.de
CC: peterz@infradead.org
CC: Valdis.Kletnieks@vt.edu
CC: dhowells@redhat.com
---
 kernel/sched.c |   24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c	2010-01-31 14:59:42.000000000 -0500
+++ linux-2.6-lttng/kernel/sched.c	2010-01-31 15:09:51.000000000 -0500
@@ -893,7 +893,12 @@ static inline void finish_lock_switch(st
 	 */
 	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
 
-	raw_spin_unlock_irq(&rq->lock);
+	/*
+	 * Order mm_cpumask and rq->curr updates before following memory
+	 * accesses. Required by sys_membarrier().
+	 */
+	smp_mb__before_spin_unlock();
+	raw_spin_unlock_irq__no_release(&rq->lock);
 }
 
 #else /* __ARCH_WANT_UNLOCKED_CTXSW */
@@ -916,10 +921,15 @@ static inline void prepare_lock_switch(s
 	 */
 	next->oncpu = 1;
 #endif
+	/*
+	 * Order mm_cpumask and rq->curr updates before following memory
+	 * accesses. Required by sys_membarrier().
+	 */
+	smp_mb__before_spin_unlock();
 #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq__no_release(&rq->lock);
 #else
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock__no_release(&rq->lock);
 #endif
 }
 
@@ -5490,7 +5500,13 @@ need_resched_nonpreemptible:
 	if (sched_feat(HRTICK))
 		hrtick_clear(rq);
 
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_lock_irq__no_acquire(&rq->lock);
+	/*
+	 * Order memory accesses before mm_cpumask and rq->curr updates.
+	 * Required by sys_membarrier() when prev != next. We only learn about
+	 * next later, so we issue this mb() unconditionally.
+	 */
+	smp_mb__after_spin_lock();
 	update_rq_clock(rq);
 	clear_tsk_need_resched(prev);
 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [patch 3/3] introduce sys_membarrier(): process-wide memory barrier (v8)
  2010-01-31 20:52 [patch 0/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers
  2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers
  2010-01-31 20:52 ` [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock Mathieu Desnoyers
@ 2010-01-31 20:52 ` Mathieu Desnoyers
  2 siblings, 0 replies; 41+ messages in thread
From: Mathieu Desnoyers @ 2010-01-31 20:52 UTC (permalink / raw)
  To: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
	Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar,
	josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells
  Cc: Mathieu Desnoyers

[-- Attachment #1: sys-membarrier.patch --]
[-- Type: text/plain, Size: 20205 bytes --]

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads of the current process.
 
It aims at greatly simplifying and enhancing the current signal-based
liburcu userspace RCU synchronize_rcu() implementation.
(found at http://lttng.org/urcu)

Depends on:
"scheduler: add full memory barriers upon task switch at runqueue lock/unlock"
"Create spin lock/spin unlock with distinct memory barrier"

Changelog since v1:

- Only perform the IPI in CONFIG_SMP.
- Only perform the IPI if the process has more than one thread.
- Only send IPIs to CPUs involved with threads belonging to our process.
- Adaptative IPI scheme (single vs many IPI with threshold).
- Issue smp_mb() at the beginning and end of the system call.

Changelog since v2:
- simply send-to-many to the mm_cpumask. It contains the list of processors we
  have to IPI to (which use the mm), and this mask is updated atomically.

Changelog since v3a:
- Confirm that each CPU indeed runs the current task's ->mm before sending an
  IPI. Ensures that we do not disturb RT tasks in the presence of lazy TLB
  shootdown.
- Document memory barriers needed in switch_mm().
- Surround helper functions with #ifdef CONFIG_SMP.

Changelog since v4:
- Add "int expedited" parameter, use synchronize_sched() in the non-expedited
  case. Thanks to Lai Jiangshan for making us consider seriously using
  synchronize_sched() to provide the low-overhead membarrier scheme.
- Check num_online_cpus() == 1, quickly return without doing nothing.

Changelog since v5:
- Plan ahead for extensibility by introducing mandatory/optional masks to the
  "flags" system call parameter. Past experience with accept4(), signalfd4(),
  eventfd2(), epoll_create1(), dup3(), pipe2(), and inotify_init1() indicates
  that this is the kind of thing we want to plan for. Return -EINVAL if the
  mandatory flags received are unknown.
- Create include/linux/membarrier.h to define these flags.
- Add MEMBARRIER_QUERY optional flag.

Changelog since v6:
- Remove some unlikely() not so unlikely.
- Add the proper scheduler memory barriers needed to only use the RCU read lock
  in sys_membarrier rather than take each runqueue spinlock:
- Move memory barriers from per-architecture switch_mm() to schedule() and
  finish_lock_switch(), where they clearly document that all data protected by
  the rq lock is guaranteed to have memory barriers issued between the scheduler
  update and the task execution. Replacing the spin lock acquire/release
  barriers with these memory barriers imply either no overhead (x86 spinlock
  atomic instruction already implies a full mb) or some hopefully small
  overhead caused by the upgrade of the spinlock acquire/release barriers to
  more heavyweight smp_mb().
- The "generic" version of spinlock-mb.h declares both a mapping to standard
  spinlocks and full memory barriers. Each architecture can specialize this
  header following their own need and declare CONFIG_HAVE_SPINLOCK_MB to use
  their own spinlock-mb.h.
- Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
  implementations on a wide range of architecture would be welcome.

Changelog since v7:
- Move spinlock-mb and scheduler related changes to separate patches.
- Add support for sys_membarrier on x86_32.
- Only x86 32/64 system calls are reserved in this patch. It is planned to
  incrementally reserve syscall IDs on other architectures as these are tested.

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the 
write-side are turned into an invokation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

To explain the benefit of this scheme, let's introduce two example threads:
 
Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A synchronize_rcu() are
ordering memory accesses with respect to smp_mb() present in 
rcu_read_lock/unlock(), we can change all smp_mb() from
synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
rcu_read_lock/unlock() into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
prev mem accesses           prev mem accesses
smp_mb()                    smp_mb()
follow mem accesses         follow mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() thanks to the IPIs executing memory barriers on each active
system threads. Each non-running process threads are intrinsically
serialized by the scheduler.

For my Intel Xeon E5405
(one thread is doing the sys_membarrier, the others are busy looping)

* expedited

10,000,000 sys_membarrier calls:

T=1: 0m20.173s
T=2: 0m20.506s
T=3: 0m22.632s
T=4: 0m24.759s
T=5: 0m26.633s
T=6: 0m29.654s
T=7: 0m30.669s

For a 2-3 microseconds/call.

* non-expedited

1000 sys_membarrier calls:

T=1-7: 0m16.002s

For a 16 milliseconds/call. (~5000-8000 times slower than expedited)

The expected top pattern for the expedited scheme, when using 1 CPU for a thread
doing sys_membarrier() in a loop and other 7 threads busy-waiting in user-space
on a variable shows that the thread doing sys_membarrier is doing mostly system
calls, and other threads are mostly running in user-space. Note that IPI
handlers are not taken into account in the cpu time sampling.

Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 99.7%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
Cpu2  : 99.3%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.7%hi,  0.0%si,  0.0%st
Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 96.0%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  2.6%si,  0.0%st
Cpu6  :  1.3%us, 98.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 96.1%us,  3.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

(what we previously had)
memory barriers in reader: 973494744 reads, 892368 writes
signal-based scheme:      6289946025 reads,   1251 writes

(what we have now, with dynamic sys_membarrier check, expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme:    4316818891 reads, 503790 writes

(dynamic sys_membarrier check, non-expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme:    8698725501 reads,    313 writes

So the dynamic sys_membarrier availability check adds some overhead to the
read-side, but besides that, with the expedited scheme, we can see that we are
close to the read-side performance of the signal-based scheme and also close
(5/8) to the performance of the memory-barrier write-side. We have a write-side
speedup of 400:1 over the signal-based scheme by using the sys_membarrier system
call. This allows a 4.5:1 read-side speedup over the memory barrier scheme.

The non-expedited scheme adds indeed a much lower overhead on the read-side
both because we do not send IPIs and because we perform less updates, which in
turn generates less cache-line exchanges. The write-side latency becomes even
higher than with the signal-based scheme. The advantage of the non-expedited
sys_membarrier() scheme over signal-based scheme is that it does not require to
wake up all the process threads.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Nicholas Miell <nmiell@comcast.net>
CC: mingo@elte.hu
CC: laijs@cn.fujitsu.com
CC: dipankar@in.ibm.com
CC: akpm@linux-foundation.org
CC: josh@joshtriplett.org
CC: dvhltc@us.ibm.com
CC: niv@us.ibm.com
CC: tglx@linutronix.de
CC: peterz@infradead.org
CC: Valdis.Kletnieks@vt.edu
CC: dhowells@redhat.com
---
 arch/x86/ia32/ia32entry.S          |    1 
 arch/x86/include/asm/unistd_32.h   |    3 
 arch/x86/include/asm/unistd_64.h   |    2 
 arch/x86/kernel/syscall_table_32.S |    1 
 include/linux/Kbuild               |    1 
 include/linux/membarrier.h         |   25 ++++++
 kernel/sched.c                     |  148 +++++++++++++++++++++++++++++++++++++
 7 files changed, 180 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
===================================================================
--- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-31 14:59:32.000000000 -0500
+++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-31 15:13:13.000000000 -0500
@@ -663,6 +663,8 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
 #define __NR_recvmmsg				299
 __SYSCALL(__NR_recvmmsg, sys_recvmmsg)
+#define __NR_membarrier				300
+__SYSCALL(__NR_membarrier, sys_membarrier)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c	2010-01-31 15:12:20.000000000 -0500
+++ linux-2.6-lttng/kernel/sched.c	2010-01-31 15:12:25.000000000 -0500
@@ -71,6 +71,7 @@
 #include <linux/debugfs.h>
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
+#include <linux/membarrier.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -10930,6 +10931,153 @@ struct cgroup_subsys cpuacct_subsys = {
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
 
+#ifdef CONFIG_SMP
+
+/*
+ * Execute a memory barrier on all active threads from the current process
+ * on SMP systems. Do not rely on implicit barriers in IPI handler execution,
+ * because batched IPI lists are synchronized with spinlocks rather than full
+ * memory barriers. This is not the bulk of the overhead anyway, so let's stay
+ * on the safe side.
+ */
+static void membarrier_ipi(void *unused)
+{
+	smp_mb();
+}
+
+/*
+ * Handle out-of-mem by sending per-cpu IPIs instead.
+ */
+static void membarrier_retry(void)
+{
+	int cpu;
+
+	for_each_cpu(cpu, mm_cpumask(current->mm)) {
+		if (current->mm == cpu_curr(cpu)->mm)
+			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
+	}
+}
+
+#endif /* #ifdef CONFIG_SMP */
+
+/*
+ * sys_membarrier - issue memory barrier on current process running threads
+ * @flags: One of these must be set:
+ *         MEMBARRIER_EXPEDITED
+ *             Adds some overhead, fast execution (few microseconds)
+ *         MEMBARRIER_DELAYED
+ *             Low overhead, but slow execution (few milliseconds)
+ *
+ *         MEMBARRIER_QUERY
+ *           This optional flag can be set to query if the kernel supports
+ *           a set of flags.
+ *
+ * return values: Returns -EINVAL if the flags are incorrect. Testing for kernel
+ * sys_membarrier support can be done by checking for -ENOSYS return value.
+ * Return values >= 0 indicate success. For a given set of flags on a given
+ * kernel, this system call will always return the same value. It is therefore
+ * correct to check the return value only once at library load, passing the
+ * MEMBARRIER_QUERY flag in addition to only check if the flags are supported,
+ * without performing any synchronization.
+ *
+ * This system call executes a memory barrier on all running threads of the
+ * current process. Upon completion, the caller thread is ensured that all
+ * process threads have passed through a state where memory accesses match
+ * program order. (non-running threads are de facto in such a state)
+ *
+ * Using the non-expedited mode is recommended for applications which can
+ * afford leaving the caller thread waiting for a few milliseconds. A good
+ * example would be a thread dedicated to execute RCU callbacks, which waits
+ * for callbacks to enqueue most of the time anyway.
+ *
+ * The expedited mode is recommended whenever the application needs to have
+ * control returning to the caller thread as quickly as possible. An example
+ * of such application would be one which uses the same thread to perform
+ * data structure updates and issue the RCU synchronization.
+ *
+ * It is perfectly safe to call both expedited and non-expedited
+ * sys_membarrier() in a process.
+ *
+ * mm_cpumask is used as an approximation. It is a superset of the cpumask
+ * to which we must send IPIs, mainly due to lazy TLB shootdown. Therefore,
+ * we check each runqueue with the rq lock held to make sure our ->mm is indeed
+ * running on them. We hold the RCU read lock to ensure the task structs stay
+ * valid. We rely on memory barriers around context switch (paired with the
+ * scheduler rq lock) to ensure that tasks context switched concurrently with
+ * our ->mm or mm_cpumask accesses are issuing memory barriers between the ->mm
+ * or mm_cpumask updates and any memory access performed by the thread before or
+ * after the call to the scheduler.
+ *
+ * In addition to use the mm_cpumask approximation, checking the ->mm reduces
+ * the risk of disturbing a RT task by sending unnecessary IPIs. There is still
+ * a slight chance to disturb an unrelated task, because we do not lock the
+ * runqueues while sending IPIs, but the real-time effect of this heavy locking
+ * would be worse than the comparatively small disruption of an IPI.
+ *
+ * On uniprocessor systems, this system call simply returns 0 without doing
+ * anything, so user-space knows it is implemented.
+ *
+ * The flags argument has room for extensibility, with 16 lower bits holding
+ * mandatory flags for which older kernels will fail if they encounter an
+ * unknown flag. The high 16 bits are used for optional flags, which older
+ * kernels don't have to care about.
+ */
+SYSCALL_DEFINE1(membarrier, unsigned int, flags)
+{
+#ifdef CONFIG_SMP
+	cpumask_var_t tmpmask;
+	int cpu;
+
+	/*
+	 * Expect _only_ one of expedited or delayed flags.
+	 * Don't care about optional mask for now.
+	 */
+	switch (flags & MEMBARRIER_MANDATORY_MASK) {
+	case MEMBARRIER_EXPEDITED:
+	case MEMBARRIER_DELAYED:
+		break;
+	default:
+		return -EINVAL;
+	}
+	if (unlikely(flags & MEMBARRIER_QUERY
+		     || thread_group_empty(current))
+		     || num_online_cpus() == 1)
+		return 0;
+	if (flags & MEMBARRIER_DELAYED) {
+		synchronize_sched();
+		return 0;
+	}
+	/*
+	 * Memory barrier on the caller thread _before_ sending first
+	 * IPI. Matches memory barriers paired with scheduler rq lock.
+	 */
+	smp_mb();
+	rcu_read_lock();	/* ensures validity of cpu_curr(cpu) tasks */
+	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
+		membarrier_retry();
+		goto out;
+	}
+	cpumask_copy(tmpmask, mm_cpumask(current->mm));
+	preempt_disable();
+	cpumask_clear_cpu(smp_processor_id(), tmpmask);
+	for_each_cpu(cpu, tmpmask)
+		if (current->mm != cpu_curr(cpu)->mm)
+			cpumask_clear_cpu(cpu, tmpmask);
+	smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
+	preempt_enable();
+	free_cpumask_var(tmpmask);
+out:
+	rcu_read_unlock();
+	/*
+	 * Memory barrier on the caller thread _after_ we finished
+	 * waiting for the last IPI. Matches memory barriers paired with
+	 * scheduler rq lock.
+	 */
+	smp_mb();
+#endif /* #ifdef CONFIG_SMP */
+	return 0;
+}
+
 #ifndef CONFIG_SMP
 
 int rcu_expedited_torture_stats(char *page)
Index: linux-2.6-lttng/include/linux/membarrier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/membarrier.h	2010-01-31 15:12:25.000000000 -0500
@@ -0,0 +1,25 @@
+#ifndef _LINUX_MEMBARRIER_H
+#define _LINUX_MEMBARRIER_H
+
+/* First argument to membarrier syscall */
+
+/*
+ * Mandatory flags to the membarrier system call that the kernel must
+ * understand are in the low 16 bits.
+ */
+#define MEMBARRIER_MANDATORY_MASK	0x0000FFFF	/* Mandatory flags */
+
+/*
+ * Optional hints that the kernel can ignore are in the high 16 bits.
+ */
+#define MEMBARRIER_OPTIONAL_MASK	0xFFFF0000	/* Optional hints */
+
+/* Expedited: adds some overhead, fast execution (few microseconds) */
+#define MEMBARRIER_EXPEDITED		(1 << 0)
+/* Delayed: Low overhead, but slow execution (few milliseconds) */
+#define MEMBARRIER_DELAYED		(1 << 1)
+
+/* Query flag support, without performing synchronization */
+#define MEMBARRIER_QUERY		(1 << 16)
+
+#endif
Index: linux-2.6-lttng/include/linux/Kbuild
===================================================================
--- linux-2.6-lttng.orig/include/linux/Kbuild	2010-01-31 14:59:41.000000000 -0500
+++ linux-2.6-lttng/include/linux/Kbuild	2010-01-31 15:12:25.000000000 -0500
@@ -110,6 +110,7 @@ header-y += magic.h
 header-y += major.h
 header-y += map_to_7segment.h
 header-y += matroxfb.h
+header-y += membarrier.h
 header-y += meye.h
 header-y += minix_fs.h
 header-y += mmtimer.h
Index: linux-2.6-lttng/arch/x86/include/asm/unistd_32.h
===================================================================
--- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_32.h	2010-01-31 15:13:56.000000000 -0500
+++ linux-2.6-lttng/arch/x86/include/asm/unistd_32.h	2010-01-31 15:14:38.000000000 -0500
@@ -343,10 +343,11 @@
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
 #define __NR_recvmmsg		337
+#define __NR_membarrier		338
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 338
+#define NR_syscalls 339
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6-lttng/arch/x86/ia32/ia32entry.S
===================================================================
--- linux-2.6-lttng.orig/arch/x86/ia32/ia32entry.S	2010-01-31 15:16:36.000000000 -0500
+++ linux-2.6-lttng/arch/x86/ia32/ia32entry.S	2010-01-31 15:17:24.000000000 -0500
@@ -842,4 +842,5 @@ ia32_sys_call_table:
 	.quad compat_sys_rt_tgsigqueueinfo	/* 335 */
 	.quad sys_perf_event_open
 	.quad compat_sys_recvmmsg
+	.quad sys_membarrier
 ia32_syscall_end:
Index: linux-2.6-lttng/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/syscall_table_32.S	2010-01-31 15:16:32.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/syscall_table_32.S	2010-01-31 15:17:08.000000000 -0500
@@ -337,3 +337,4 @@ ENTRY(sys_call_table)
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
 	.long sys_recvmmsg
+	.long sys_membarrier

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 1/3] Create spin lock/spin unlock with distinct memory barrier
  2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers
@ 2010-02-01  7:25   ` Nick Piggin
  2010-02-01 14:08     ` Mathieu Desnoyers
  2010-02-01  7:28   ` Nick Piggin
  2010-02-01 15:22   ` Linus Torvalds
  2 siblings, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2010-02-01  7:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
	Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar,
	josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells

On Sun, Jan 31, 2010 at 03:52:55PM -0500, Mathieu Desnoyers wrote:
> Create the primitive family:
> 
> spin_lock__no_acquire
> spin_unlock__no_release
> spin_lock_irq__no_acquire
> spin_unlock_irq__no_release
> 
> raw_spin_lock__no_acquire
> raw_spin_unlock__no_release
> raw_spin_lock_irq__no_acquire
> raw_spin_unlock_irq__no_release
> 
> smp_acquire__after_spin_lock()
> smp_release__before_spin_unlock()
> smp_mb__after_spin_lock()
> smp_mb__before_spin_unlock()

Wow, someone who likes micro optimising things as much as I do.
However, these have the wrong names.

smp_mb__after_x() means that calling that function after calling x()
will give a smp_mb(), right?

With your functions, this is giving a smp_mb() after calling
x__no_acquire().

I would suggest maybe just don't bother with the __no_acquire
__no_release variants of spin locks, and stick with the expected
semantics for the new smb_mb__xxx functions. x86 still gets the
full benefit.

But, I don't know if this is even worthwhile, given where you are
using it.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 1/3] Create spin lock/spin unlock with distinct memory barrier
  2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers
  2010-02-01  7:25   ` Nick Piggin
@ 2010-02-01  7:28   ` Nick Piggin
  2010-02-01 14:10     ` Mathieu Desnoyers
  2010-02-01 15:22   ` Linus Torvalds
  2 siblings, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2010-02-01  7:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
	Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar,
	josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells

On Sun, Jan 31, 2010 at 03:52:55PM -0500, Mathieu Desnoyers wrote:
> +/*
> + * X86 spinlock-mb mappings. Use standard spinlocks with acquire/release
> + * semantics. Associated memory barriers are defined as no-ops, because the
> + * spinlock LOCK-prefixed atomic operations imply a full memory barrier.
> + */
> +
> +#define spin_lock__no_acquire			spin_lock
> +#define spin_unlock__no_release			spin_unlock
> +
> +#define spin_lock_irq__no_acquire		spin_lock_irq
> +#define spin_unlock_irq__no_release		spin_unlock_irq
> +
> +#define raw_spin_lock__no_acquire		raw_spin_lock
> +#define raw_spin_unlock__no_release		raw_spin_unlock
> +
> +#define raw_spin_lock_irq__no_acquire		raw_spin_lock_irq
> +#define raw_spin_unlock_irq__no_release		raw_spin_unlock_irq
> +
> +#define smp_acquire__after_spin_lock()		do { } while (0)
> +#define smp_release__before_spin_unlock()	do { } while (0)
> +
> +#define smp_mb__after_spin_lock()		do { } while (0)
> +#define smp_mb__before_spin_unlock()		do { } while (0)

Oh, and that one's wrong. loads can pass spin_unlock on x86 so it
needs to be smp_mb().


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-01-31 20:52 ` [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock Mathieu Desnoyers
@ 2010-02-01  7:33   ` Nick Piggin
  2010-02-01  9:42     ` Peter Zijlstra
  2010-02-01 15:27   ` Linus Torvalds
  1 sibling, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2010-02-01  7:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
	Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar,
	josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells

On Sun, Jan 31, 2010 at 03:52:56PM -0500, Mathieu Desnoyers wrote:
> Depends on:
> "Create spin lock/spin unlock with distinct memory barrier"
> 
> A full memory barrier is wanted before and after runqueue data structure
> modifications so these can be read safely by sys_membarrier without holding the
> rq lock.
> 
> Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
> lock/unlock already imply a full memory barrier. Combines the spin lock
> acquire/release barriers with the full memory barrier to diminish the
> performance impact on other architectures. (per-architecture spinlock-mb.h
> should be gradually implemented to replace the generic version)

It does add overhead on x86, as well as most other architectures.

This really seems like the wrong optimisation to make, especially
given that there's not likely to be much using librcu yet, right?

I'd go with the simpler and safer version of sys_membarrier that does
not do tricky synchronisation or add overhead to the ctxsw fastpath.
Then if you see some actual improvement in a real program using librcu
one day we can discuss making it faster.

As it is right now, the change will definitely slow down everybody
not using librcu (ie. nearly everything).

> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> CC: Nicholas Miell <nmiell@comcast.net>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: mingo@elte.hu
> CC: laijs@cn.fujitsu.com
> CC: dipankar@in.ibm.com
> CC: akpm@linux-foundation.org
> CC: josh@joshtriplett.org
> CC: dvhltc@us.ibm.com
> CC: niv@us.ibm.com
> CC: tglx@linutronix.de
> CC: peterz@infradead.org
> CC: Valdis.Kletnieks@vt.edu
> CC: dhowells@redhat.com
> ---
>  kernel/sched.c |   24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6-lttng/kernel/sched.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-31 14:59:42.000000000 -0500
> +++ linux-2.6-lttng/kernel/sched.c	2010-01-31 15:09:51.000000000 -0500
> @@ -893,7 +893,12 @@ static inline void finish_lock_switch(st
>  	 */
>  	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
>  
> -	raw_spin_unlock_irq(&rq->lock);
> +	/*
> +	 * Order mm_cpumask and rq->curr updates before following memory
> +	 * accesses. Required by sys_membarrier().
> +	 */
> +	smp_mb__before_spin_unlock();
> +	raw_spin_unlock_irq__no_release(&rq->lock);
>  }
>  
>  #else /* __ARCH_WANT_UNLOCKED_CTXSW */
> @@ -916,10 +921,15 @@ static inline void prepare_lock_switch(s
>  	 */
>  	next->oncpu = 1;
>  #endif
> +	/*
> +	 * Order mm_cpumask and rq->curr updates before following memory
> +	 * accesses. Required by sys_membarrier().
> +	 */
> +	smp_mb__before_spin_unlock();
>  #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> -	raw_spin_unlock_irq(&rq->lock);
> +	raw_spin_unlock_irq__no_release(&rq->lock);
>  #else
> -	raw_spin_unlock(&rq->lock);
> +	raw_spin_unlock__no_release(&rq->lock);
>  #endif
>  }
>  
> @@ -5490,7 +5500,13 @@ need_resched_nonpreemptible:
>  	if (sched_feat(HRTICK))
>  		hrtick_clear(rq);
>  
> -	raw_spin_lock_irq(&rq->lock);
> +	raw_spin_lock_irq__no_acquire(&rq->lock);
> +	/*
> +	 * Order memory accesses before mm_cpumask and rq->curr updates.
> +	 * Required by sys_membarrier() when prev != next. We only learn about
> +	 * next later, so we issue this mb() unconditionally.
> +	 */
> +	smp_mb__after_spin_lock();
>  	update_rq_clock(rq);
>  	clear_tsk_need_resched(prev);
>  
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01  7:33   ` Nick Piggin
@ 2010-02-01  9:42     ` Peter Zijlstra
  2010-02-01 10:11       ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2010-02-01  9:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mathieu Desnoyers, Linus Torvalds, akpm, Ingo Molnar,
	linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx,
	Valdis.Kletnieks, dhowells

On Mon, 2010-02-01 at 18:33 +1100, Nick Piggin wrote:
> > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
> > lock/unlock already imply a full memory barrier. Combines the spin lock
> > acquire/release barriers with the full memory barrier to diminish the
> > performance impact on other architectures. (per-architecture spinlock-mb.h
> > should be gradually implemented to replace the generic version)
> 
> It does add overhead on x86, as well as most other architectures.
> 
> This really seems like the wrong optimisation to make, especially
> given that there's not likely to be much using librcu yet, right?
> 
> I'd go with the simpler and safer version of sys_membarrier that does
> not do tricky synchronisation or add overhead to the ctxsw fastpath.
> Then if you see some actual improvement in a real program using librcu
> one day we can discuss making it faster.
> 
> As it is right now, the change will definitely slow down everybody
> not using librcu (ie. nearly everything). 

Right, so the problem with the 'slow'/'safe' version is that it takes
rq->lock for all relevant rqs. This renders while (1) sys_membarrier()
in a quite effective DoS.

Now, I'm not quite charmed by all this. Esp. this patch seems wrong. The
fact is on x86 we have all the required membarriers in place.

There's a number of LOCK ins before we set rq->curr and we have them
after. Adding more, like this patch effectively does
(smp_mb__{before,after}_unlock should be a full mb as Nick pointed out)
doesn't seem like a good idea at all.

And then there's !x86 to consider.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01  9:42     ` Peter Zijlstra
@ 2010-02-01 10:11       ` Nick Piggin
  2010-02-01 10:36         ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2010-02-01 10:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Linus Torvalds, akpm, Ingo Molnar,
	linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx,
	Valdis.Kletnieks, dhowells

On Mon, Feb 01, 2010 at 10:42:30AM +0100, Peter Zijlstra wrote:
> On Mon, 2010-02-01 at 18:33 +1100, Nick Piggin wrote:
> > > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
> > > lock/unlock already imply a full memory barrier. Combines the spin lock
> > > acquire/release barriers with the full memory barrier to diminish the
> > > performance impact on other architectures. (per-architecture spinlock-mb.h
> > > should be gradually implemented to replace the generic version)
> > 
> > It does add overhead on x86, as well as most other architectures.
> > 
> > This really seems like the wrong optimisation to make, especially
> > given that there's not likely to be much using librcu yet, right?
> > 
> > I'd go with the simpler and safer version of sys_membarrier that does
> > not do tricky synchronisation or add overhead to the ctxsw fastpath.
> > Then if you see some actual improvement in a real program using librcu
> > one day we can discuss making it faster.
> > 
> > As it is right now, the change will definitely slow down everybody
> > not using librcu (ie. nearly everything). 
> 
> Right, so the problem with the 'slow'/'safe' version is that it takes
> rq->lock for all relevant rqs. This renders while (1) sys_membarrier()
> in a quite effective DoS.

All, but one at a time, no? How much of a DoS really is taking these
locks for a handful of cycles each, per syscall?

I mean, we have LOTS of syscalls that take locks, and for a lot longer,
(look at dcache_lock).

I think we basically just have to say that locking primitives should be
somewhat fair, and not be held for too long, it should more or less
work.

If the locks are getting contended, then the threads calling
sys_membarrier are going to be spinning longer too, using more CPU time,
and will get scheduled away...

If there is some particular problem on -rt because of the rq locks,
then I guess you could consider whether to add more overhead to your
ctxsw path to reduce the problem, or simply not support sys_membarrier
for unprived users in the first place.

 
> Now, I'm not quite charmed by all this. Esp. this patch seems wrong. The
> fact is on x86 we have all the required membarriers in place.
> 
> There's a number of LOCK ins before we set rq->curr and we have them
> after. Adding more, like this patch effectively does
> (smp_mb__{before,after}_unlock should be a full mb as Nick pointed out)
> doesn't seem like a good idea at all.
> 
> And then there's !x86 to consider.

Yep.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 10:11       ` Nick Piggin
@ 2010-02-01 10:36         ` Peter Zijlstra
  2010-02-01 10:49           ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2010-02-01 10:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mathieu Desnoyers, Linus Torvalds, akpm, Ingo Molnar,
	linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx,
	Valdis.Kletnieks, dhowells

On Mon, 2010-02-01 at 21:11 +1100, Nick Piggin wrote:
> On Mon, Feb 01, 2010 at 10:42:30AM +0100, Peter Zijlstra wrote:
> > On Mon, 2010-02-01 at 18:33 +1100, Nick Piggin wrote:
> > > > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
> > > > lock/unlock already imply a full memory barrier. Combines the spin lock
> > > > acquire/release barriers with the full memory barrier to diminish the
> > > > performance impact on other architectures. (per-architecture spinlock-mb.h
> > > > should be gradually implemented to replace the generic version)
> > > 
> > > It does add overhead on x86, as well as most other architectures.
> > > 
> > > This really seems like the wrong optimisation to make, especially
> > > given that there's not likely to be much using librcu yet, right?
> > > 
> > > I'd go with the simpler and safer version of sys_membarrier that does
> > > not do tricky synchronisation or add overhead to the ctxsw fastpath.
> > > Then if you see some actual improvement in a real program using librcu
> > > one day we can discuss making it faster.
> > > 
> > > As it is right now, the change will definitely slow down everybody
> > > not using librcu (ie. nearly everything). 
> > 
> > Right, so the problem with the 'slow'/'safe' version is that it takes
> > rq->lock for all relevant rqs. This renders while (1) sys_membarrier()
> > in a quite effective DoS.
> 
> All, but one at a time, no? How much of a DoS really is taking these
> locks for a handful of cycles each, per syscall?

I was more worrying about the cacheline trashing than lock hold times
there.

> I mean, we have LOTS of syscalls that take locks, and for a lot longer,
> (look at dcache_lock).

Yeah, and dcache is a massive pain, isn't it ;-)

> I think we basically just have to say that locking primitives should be
> somewhat fair, and not be held for too long, it should more or less
> work.

Sure, it'll more of less work, but he's basically making rq->lock a
global lock instead of a per-cpu lock.

> If the locks are getting contended, then the threads calling
> sys_membarrier are going to be spinning longer too, using more CPU time,
> and will get scheduled away...

Sure, and increased spinning reduces the total throughput.

> If there is some particular problem on -rt because of the rq locks,
> then I guess you could consider whether to add more overhead to your
> ctxsw path to reduce the problem, or simply not support sys_membarrier
> for unprived users in the first place.

Right, for -rt we might need to do that, but its just that rq->lock is a
very hot lock, and adding basically unlimited trashing to it didn't seem
like a good idea.

Also, I'm thinking making it a priv syscall basically renders it useless
for Mathieu.

Anyway, it might be I'm just paranoid... but archs with large core count
and lazy tlb flush seem particularly vulnerable.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 10:36         ` Peter Zijlstra
@ 2010-02-01 10:49           ` Nick Piggin
  2010-02-01 14:47             ` Mathieu Desnoyers
  0 siblings, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2010-02-01 10:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Linus Torvalds, akpm, Ingo Molnar,
	linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx,
	Valdis.Kletnieks, dhowells

On Mon, Feb 01, 2010 at 11:36:01AM +0100, Peter Zijlstra wrote:
> On Mon, 2010-02-01 at 21:11 +1100, Nick Piggin wrote:
> > All, but one at a time, no? How much of a DoS really is taking these
> > locks for a handful of cycles each, per syscall?
> 
> I was more worrying about the cacheline trashing than lock hold times
> there.

Well, same issue really. Look at all the unprived files in /proc
for example that can look through all per-cpu cachelines. It just
takes a single read syscall to do a lot of them too.

 
> > I mean, we have LOTS of syscalls that take locks, and for a lot longer,
> > (look at dcache_lock).
> 
> Yeah, and dcache is a massive pain, isn't it ;-)

My point is, I don't think it is something we can realistically
care much about and it is nowhere near a new or unique problem
being added by this one patch.

It is really a RoS, reduction of service, rather than a DoS. And
any time we allow an unpriv user on our system, we have RoS potential :)

 
> > I think we basically just have to say that locking primitives should be
> > somewhat fair, and not be held for too long, it should more or less
> > work.
> 
> Sure, it'll more of less work, but he's basically making rq->lock a
> global lock instead of a per-cpu lock.
> 
> > If the locks are getting contended, then the threads calling
> > sys_membarrier are going to be spinning longer too, using more CPU time,
> > and will get scheduled away...
> 
> Sure, and increased spinning reduces the total throughput.
> 
> > If there is some particular problem on -rt because of the rq locks,
> > then I guess you could consider whether to add more overhead to your
> > ctxsw path to reduce the problem, or simply not support sys_membarrier
> > for unprived users in the first place.
> 
> Right, for -rt we might need to do that, but its just that rq->lock is a
> very hot lock, and adding basically unlimited trashing to it didn't seem
> like a good idea.
> 
> Also, I'm thinking making it a priv syscall basically renders it useless
> for Mathieu.

Well I just mean that it's something for -rt to work out. Apps can
still work if the call is unsupported completely.
 

> Anyway, it might be I'm just paranoid... but archs with large core count
> and lazy tlb flush seem particularly vulnerable.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 1/3] Create spin lock/spin unlock with distinct memory barrier
  2010-02-01  7:25   ` Nick Piggin
@ 2010-02-01 14:08     ` Mathieu Desnoyers
  0 siblings, 0 replies; 41+ messages in thread
From: Mathieu Desnoyers @ 2010-02-01 14:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
	Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar,
	josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells

* Nick Piggin (npiggin@suse.de) wrote:
> On Sun, Jan 31, 2010 at 03:52:55PM -0500, Mathieu Desnoyers wrote:
> > Create the primitive family:
> > 
> > spin_lock__no_acquire
> > spin_unlock__no_release
> > spin_lock_irq__no_acquire
> > spin_unlock_irq__no_release
> > 
> > raw_spin_lock__no_acquire
> > raw_spin_unlock__no_release
> > raw_spin_lock_irq__no_acquire
> > raw_spin_unlock_irq__no_release
> > 
> > smp_acquire__after_spin_lock()
> > smp_release__before_spin_unlock()
> > smp_mb__after_spin_lock()
> > smp_mb__before_spin_unlock()
> 
> Wow, someone who likes micro optimising things as much as I do.

:-)

> However, these have the wrong names.
> 
> smp_mb__after_x() means that calling that function after calling x()
> will give a smp_mb(), right?

Supposed to.

> 
> With your functions, this is giving a smp_mb() after calling
> x__no_acquire().

Good point.

> 
> I would suggest maybe just don't bother with the __no_acquire
> __no_release variants of spin locks, and stick with the expected
> semantics for the new smb_mb__xxx functions. x86 still gets the
> full benefit.

Well, most other architectures will get a gain by modifying the spin
lock/unlock itself and adding the full memory barriers separately. x86
is a special case here.

> 
> But, I don't know if this is even worthwhile, given where you are
> using it.

Given that it's used to modify the scheduler fast path, I try to keep it
as fast as possible.  But as you point out later in the thread, we might
just consider taking all the rq locks one after another when issuing a
sys_membarrier() instead. I did these modifications mostly to please
Peter who is concerned about the impact of taking these rq lock very
frequently in a DoS scenario (which you appropriately point out would
not be the first case in the kernel, and would actually generate a RoS
rather than DoS).

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 1/3] Create spin lock/spin unlock with distinct memory barrier
  2010-02-01  7:28   ` Nick Piggin
@ 2010-02-01 14:10     ` Mathieu Desnoyers
  0 siblings, 0 replies; 41+ messages in thread
From: Mathieu Desnoyers @ 2010-02-01 14:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
	Steven Rostedt, Paul E. McKenney, Nicholas Miell, laijs, dipankar,
	josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells

* Nick Piggin (npiggin@suse.de) wrote:
> On Sun, Jan 31, 2010 at 03:52:55PM -0500, Mathieu Desnoyers wrote:
> > +/*
> > + * X86 spinlock-mb mappings. Use standard spinlocks with acquire/release
> > + * semantics. Associated memory barriers are defined as no-ops, because the
> > + * spinlock LOCK-prefixed atomic operations imply a full memory barrier.
> > + */
> > +
> > +#define spin_lock__no_acquire			spin_lock
> > +#define spin_unlock__no_release			spin_unlock
> > +
> > +#define spin_lock_irq__no_acquire		spin_lock_irq
> > +#define spin_unlock_irq__no_release		spin_unlock_irq
> > +
> > +#define raw_spin_lock__no_acquire		raw_spin_lock
> > +#define raw_spin_unlock__no_release		raw_spin_unlock
> > +
> > +#define raw_spin_lock_irq__no_acquire		raw_spin_lock_irq
> > +#define raw_spin_unlock_irq__no_release		raw_spin_unlock_irq
> > +
> > +#define smp_acquire__after_spin_lock()		do { } while (0)
> > +#define smp_release__before_spin_unlock()	do { } while (0)
> > +
> > +#define smp_mb__after_spin_lock()		do { } while (0)
> > +#define smp_mb__before_spin_unlock()		do { } while (0)
> 
> Oh, and that one's wrong. loads can pass spin_unlock on x86 so it
> needs to be smp_mb().
> 

Good catch !

#if defined(CONFIG_X86_32) && \
        (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE))
/*
 * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock
 * (PPro errata 66, 92)
 */
# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
#else
# define UNLOCK_LOCK_PREFIX
#endif

Thanks,

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 10:49           ` Nick Piggin
@ 2010-02-01 14:47             ` Mathieu Desnoyers
  2010-02-01 14:58               ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: Mathieu Desnoyers @ 2010-02-01 14:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Linus Torvalds, akpm, Ingo Molnar, linux-kernel,
	KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell,
	laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks,
	dhowells

* Nick Piggin (npiggin@suse.de) wrote:
> On Mon, Feb 01, 2010 at 11:36:01AM +0100, Peter Zijlstra wrote:
> > On Mon, 2010-02-01 at 21:11 +1100, Nick Piggin wrote:
> > > All, but one at a time, no? How much of a DoS really is taking these
> > > locks for a handful of cycles each, per syscall?
> > 
> > I was more worrying about the cacheline trashing than lock hold times
> > there.
> 
> Well, same issue really. Look at all the unprived files in /proc
> for example that can look through all per-cpu cachelines. It just
> takes a single read syscall to do a lot of them too.
> 
>  
> > > I mean, we have LOTS of syscalls that take locks, and for a lot longer,
> > > (look at dcache_lock).
> > 
> > Yeah, and dcache is a massive pain, isn't it ;-)
> 
> My point is, I don't think it is something we can realistically
> care much about and it is nowhere near a new or unique problem
> being added by this one patch.
> 
> It is really a RoS, reduction of service, rather than a DoS. And
> any time we allow an unpriv user on our system, we have RoS potential :)
> 
>  
> > > I think we basically just have to say that locking primitives should be
> > > somewhat fair, and not be held for too long, it should more or less
> > > work.
> > 
> > Sure, it'll more of less work, but he's basically making rq->lock a
> > global lock instead of a per-cpu lock.
> > 
> > > If the locks are getting contended, then the threads calling
> > > sys_membarrier are going to be spinning longer too, using more CPU time,
> > > and will get scheduled away...
> > 
> > Sure, and increased spinning reduces the total throughput.
> > 
> > > If there is some particular problem on -rt because of the rq locks,
> > > then I guess you could consider whether to add more overhead to your
> > > ctxsw path to reduce the problem, or simply not support sys_membarrier
> > > for unprived users in the first place.
> > 
> > Right, for -rt we might need to do that, but its just that rq->lock is a
> > very hot lock, and adding basically unlimited trashing to it didn't seem
> > like a good idea.
> > 
> > Also, I'm thinking making it a priv syscall basically renders it useless
> > for Mathieu.
> 
> Well I just mean that it's something for -rt to work out. Apps can
> still work if the call is unsupported completely.

OK, so we seem to be settling for the spinlock-based sys_membarrier()
this time, which is much less intrusive in terms of scheduler
fast path modification, but adds more system overhead each time
sys_membarrier() is called. This trade-off makes sense to me, as we
expect the scheduler to execute _much_ more often than sys_membarrier().

When I get confirmation that's the route to follow from both of you,
I'll go back to the spinlock-based scheme for v9.

Thanks,

Mathieu

>  
> 
> > Anyway, it might be I'm just paranoid... but archs with large core count
> > and lazy tlb flush seem particularly vulnerable.

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 14:47             ` Mathieu Desnoyers
@ 2010-02-01 14:58               ` Nick Piggin
  2010-02-01 15:23                 ` Steven Rostedt
  0 siblings, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2010-02-01 14:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Linus Torvalds, akpm, Ingo Molnar, linux-kernel,
	KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney, Nicholas Miell,
	laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks,
	dhowells

On Mon, Feb 01, 2010 at 09:47:59AM -0500, Mathieu Desnoyers wrote:
> * Nick Piggin (npiggin@suse.de) wrote:
> > Well I just mean that it's something for -rt to work out. Apps can
> > still work if the call is unsupported completely.
> 
> OK, so we seem to be settling for the spinlock-based sys_membarrier()
> this time, which is much less intrusive in terms of scheduler
> fast path modification, but adds more system overhead each time
> sys_membarrier() is called. This trade-off makes sense to me, as we
> expect the scheduler to execute _much_ more often than sys_membarrier().
> 
> When I get confirmation that's the route to follow from both of you,
> I'll go back to the spinlock-based scheme for v9.

I think locking or cacheline bouncing DoS is just something we can't
realistically worry too much about in the standard kernel. No further
than just generally good practice of good scalability, avoiding
starvations and long lock hold times etc.

So I would prefer the simpler version that doesn't add overhead to
ctxsw, at least for the first implementation.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 1/3] Create spin lock/spin unlock with distinct memory barrier
  2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers
  2010-02-01  7:25   ` Nick Piggin
  2010-02-01  7:28   ` Nick Piggin
@ 2010-02-01 15:22   ` Linus Torvalds
  2010-02-01 15:41     ` Steven Rostedt
  2 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2010-02-01 15:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

On Sun, 31 Jan 2010, Mathieu Desnoyers wrote:
>
> Create the primitive family:
> 
> spin_lock__no_acquire
> spin_unlock__no_release
> spin_lock_irq__no_acquire
> spin_unlock_irq__no_release

I really really hate this patch.

Locking is really hard to get right anyway, you already had one bug in 
this, and on the major architectures, none of this will matter _anyway_, 
since you cannot split the acquire from the lock and the release from the 
unlock.

So the only operation that actually makes any sense at all would be 
"smp_mb__after_spin_lock" (and no new spinlock primitives at all), since 
it can be optimized away on x86 (and then add "smp_mb__before_spin_unlock" 
just to make them symmetric). But even that one is very dubious:

  "The first user of these primitives is the scheduler code, where a full 
   memory barrier is wanted before and after runqueue data structure 
   modifications so these can be read safely by sys_membarrier without 
   holding the rq lock."

what kind of crazy model is this? That sounds insane. Locking together 
with some new random optimistic usage that we don't even know how 
performance-critical it is? Mixing locking and non-locking is simply 
wrong. Why would it be any safer to read whatever that the lock protects 
by adding smp barriers at the lock?

If you need other smp barriers at the lock, then what about the non-locked 
accesses _while_ the lock is held? You get no ordering guarantees there. 
The whole thing sounds highly dubious. 

And all of this for something that is a new system call that nobody 
actually uses? To optimize the new and experimental path with some insane 
lockfree model, while making the core kernel more complex?  A _very_ 
strong NAK from me.

			Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 14:58               ` Nick Piggin
@ 2010-02-01 15:23                 ` Steven Rostedt
  2010-02-01 15:44                   ` Steven Rostedt
  2010-02-01 16:00                   ` Mike Galbraith
  0 siblings, 2 replies; 41+ messages in thread
From: Steven Rostedt @ 2010-02-01 15:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mathieu Desnoyers, Peter Zijlstra, Linus Torvalds, akpm,
	Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney,
	Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx,
	Valdis.Kletnieks, dhowells

On Tue, 2010-02-02 at 01:58 +1100, Nick Piggin wrote:
> On Mon, Feb 01, 2010 at 09:47:59AM -0500, Mathieu Desnoyers wrote:
> > * Nick Piggin (npiggin@suse.de) wrote:
> > > Well I just mean that it's something for -rt to work out. Apps can
> > > still work if the call is unsupported completely.
> > 
> > OK, so we seem to be settling for the spinlock-based sys_membarrier()
> > this time, which is much less intrusive in terms of scheduler
> > fast path modification, but adds more system overhead each time
> > sys_membarrier() is called. This trade-off makes sense to me, as we
> > expect the scheduler to execute _much_ more often than sys_membarrier().
> > 
> > When I get confirmation that's the route to follow from both of you,
> > I'll go back to the spinlock-based scheme for v9.
> 
> I think locking or cacheline bouncing DoS is just something we can't
> realistically worry too much about in the standard kernel. No further
> than just generally good practice of good scalability, avoiding
> starvations and long lock hold times etc.
> 
> So I would prefer the simpler version that doesn't add overhead to
> ctxsw, at least for the first implementation.

Acked-by: Steven Rostedt <rostedt@goodmis.org>

;-)

-- Steve



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-01-31 20:52 ` [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock Mathieu Desnoyers
  2010-02-01  7:33   ` Nick Piggin
@ 2010-02-01 15:27   ` Linus Torvalds
  2010-02-01 16:09     ` Mathieu Desnoyers
  2010-02-01 16:11     ` Steven Rostedt
  1 sibling, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2010-02-01 15:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

On Sun, 31 Jan 2010, Mathieu Desnoyers wrote:
> 
> Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
> lock/unlock already imply a full memory barrier.

.. and as Nick pointed out, you're fundamentally incorrect on this. 
unlock on x86 is no memory barrier at all, since the x86 memory ordering 
rules are such that a regular store always has release consistency.

But more importantly, you don't even explain why the addded smp_mb() 
helps.

Why does a smp_mb() at the lock/unlock even matter? Reading accesses by 
the same CPU sure as hell do _not_ matter, so the whole concept seems 
totally broken. There is no way in _hell_ that whatever unlocked thing 
can ever write the variables protected by the lock, only read them. So a 
full memory barrier makes zero sense to begin with.

So what are these magical memory barriers all about?

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 1/3] Create spin lock/spin unlock with distinct memory barrier
  2010-02-01 15:22   ` Linus Torvalds
@ 2010-02-01 15:41     ` Steven Rostedt
  0 siblings, 0 replies; 41+ messages in thread
From: Steven Rostedt @ 2010-02-01 15:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel,
	KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs,
	dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks,
	dhowells

On Mon, 2010-02-01 at 07:22 -0800, Linus Torvalds wrote:
> 

> If you need other smp barriers at the lock, then what about the non-locked 
> accesses _while_ the lock is held? You get no ordering guarantees there. 
> The whole thing sounds highly dubious. 

The issues is not about protecting data, it was all about ordering an
update of a variable (mm_cpumask) with respect to scheduling. The lock
was just a convenient place to add this protection. The memory barriers
here would allow the syscall to use memory barriers instead of locks.

> 
> And all of this for something that is a new system call that nobody 
> actually uses? To optimize the new and experimental path with some insane 
> lockfree model, while making the core kernel more complex?  A _very_ 
> strong NAK from me.

I totally agree with this. The updates here were from the fear of
grabbing all rq spinlocks (one at a time) called by a syscall would open
up a DoS (or as Nick said RoS - Reduction of Service). If someone called
this syscall within a while(1) loop on some large # CPU box, it could
cause cache thrashing.

But this is all being paranoid, and not worth the complexity in the core
scheduler. We don't even know if this fear is founded.

-- Steve

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 15:23                 ` Steven Rostedt
@ 2010-02-01 15:44                   ` Steven Rostedt
  2010-02-01 16:00                   ` Mike Galbraith
  1 sibling, 0 replies; 41+ messages in thread
From: Steven Rostedt @ 2010-02-01 15:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mathieu Desnoyers, Peter Zijlstra, Linus Torvalds, akpm,
	Ingo Molnar, linux-kernel, KOSAKI Motohiro, Paul E. McKenney,
	Nicholas Miell, laijs, dipankar, josh, dvhltc, niv, tglx,
	Valdis.Kletnieks, dhowells

On Mon, 2010-02-01 at 10:23 -0500, Steven Rostedt wrote:
> On Tue, 2010-02-02 at 01:58 +1100, Nick Piggin wrote:
> > On Mon, Feb 01, 2010 at 09:47:59AM -0500, Mathieu Desnoyers wrote:
> > > * Nick Piggin (npiggin@suse.de) wrote:
> > > > Well I just mean that it's something for -rt to work out. Apps can
> > > > still work if the call is unsupported completely.
> > > 
> > > OK, so we seem to be settling for the spinlock-based sys_membarrier()
> > > this time, which is much less intrusive in terms of scheduler
> > > fast path modification, but adds more system overhead each time
> > > sys_membarrier() is called. This trade-off makes sense to me, as we
> > > expect the scheduler to execute _much_ more often than sys_membarrier().
> > > 
> > > When I get confirmation that's the route to follow from both of you,
> > > I'll go back to the spinlock-based scheme for v9.
> > 
> > I think locking or cacheline bouncing DoS is just something we can't
> > realistically worry too much about in the standard kernel. No further
> > than just generally good practice of good scalability, avoiding
> > starvations and long lock hold times etc.
> > 
> > So I would prefer the simpler version that doesn't add overhead to
> > ctxsw, at least for the first implementation.
> 
> Acked-by: Steven Rostedt <rostedt@goodmis.org>

I'm in agreement with Nick on this issue. Peter, I think you are being a
bit paranoid here. But that's a good thing. Anyone working in the
scheduler must be paranoid. That's what brings out issues that people
normally do no think about. :-)

-- Steve



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 15:23                 ` Steven Rostedt
  2010-02-01 15:44                   ` Steven Rostedt
@ 2010-02-01 16:00                   ` Mike Galbraith
  1 sibling, 0 replies; 41+ messages in thread
From: Mike Galbraith @ 2010-02-01 16:00 UTC (permalink / raw)
  To: rostedt
  Cc: Nick Piggin, Mathieu Desnoyers, Peter Zijlstra, Linus Torvalds,
	akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, Valdis.Kletnieks, dhowells

On Mon, 2010-02-01 at 10:23 -0500, Steven Rostedt wrote:
> On Tue, 2010-02-02 at 01:58 +1100, Nick Piggin wrote:

> > So I would prefer the simpler version that doesn't add overhead to
> > ctxsw, at least for the first implementation.
> 
> Acked-by: Steven Rostedt <rostedt@goodmis.org>
> 
> ;-)

Yup, fast path needs to _lose_ some weight.

	-Mike


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 15:27   ` Linus Torvalds
@ 2010-02-01 16:09     ` Mathieu Desnoyers
  2010-02-01 16:23       ` Linus Torvalds
  2010-02-01 16:24       ` Steven Rostedt
  2010-02-01 16:11     ` Steven Rostedt
  1 sibling, 2 replies; 41+ messages in thread
From: Mathieu Desnoyers @ 2010-02-01 16:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Sun, 31 Jan 2010, Mathieu Desnoyers wrote:
> > 
> > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
> > lock/unlock already imply a full memory barrier.
> 
> .. and as Nick pointed out, you're fundamentally incorrect on this. 
> unlock on x86 is no memory barrier at all, since the x86 memory ordering 
> rules are such that a regular store always has release consistency.
> 
> But more importantly, you don't even explain why the addded smp_mb() 
> helps.
> 
> Why does a smp_mb() at the lock/unlock even matter? Reading accesses by 
> the same CPU sure as hell do _not_ matter, so the whole concept seems 
> totally broken. There is no way in _hell_ that whatever unlocked thing 
> can ever write the variables protected by the lock, only read them. So a 
> full memory barrier makes zero sense to begin with.
> 
> So what are these magical memory barriers all about?
> 
> 		Linus

To show the ordering required by sys_membarrier, let's first express
sys_membarrier() succinctly:

  smp_mb();
  rcu_read_lock(); /* ensures validity of cpu_curr(cpu) tasks */
  for_each_cpu(cpu, mm_cpumask(current->mm))
    if (current->mm == cpu_curr(cpu)->mm)
      smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
  rcu_read_unlock();
  smp_mb();

We need to guarantee that between the two memory barriers issued by
sys_membarrier(), we run a membarrier_ipi on every other running thread
belonging to the same process.

Where we have to be careful here is that the context switch does not
imply memory barriers both before and after:

- mm_cpumask update
- rq->cur update

Not having memory barriers between user-space instruction execution and
these updates performed by the scheduler leaves room for a race between
sys_membarrier() and the scheduler, where we would not deliver a
membarrier_ipi to a thread which either starts running or is about to be
context switched.

One option is to surround the context switch by memory barriers. If we
choose not to do that, we are left with these solutions:

We can deal with the rq->cur update by holding the rq lock in each
iteration of the for_each_cpu(cpu, mm_cpumask(current->mm)) loop. This
ensures that if rq->cur is updated, we have an associated memory barrier
issued (e.g. on x86, implied by writing to cr3 while the rq lock is held).

However, this does not deal with mm_cpumask update, and we cannot use
the per-cpu rq lock, as it's a process-wide data structure updated with
clear_bit/set_bit in switch_mm(). So at the very least, we would have to
add memory barriers in switch_mm() on some architectures to deal with
this.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 15:27   ` Linus Torvalds
  2010-02-01 16:09     ` Mathieu Desnoyers
@ 2010-02-01 16:11     ` Steven Rostedt
  1 sibling, 0 replies; 41+ messages in thread
From: Steven Rostedt @ 2010-02-01 16:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel,
	KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs,
	dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks,
	dhowells

On Mon, 2010-02-01 at 07:27 -0800, Linus Torvalds wrote:

> So what are these magical memory barriers all about?

Mathieu is implementing userspace RCU. In order to make the
rcu_read_locks() fast, they can not be calling memory barriers. What is
needed is on the synchronize_rcu() the writer has to force a mb() on all
CPUs running one of the readers.

The first simple approach that Mathieu made, was to simply send an IPI
to all CPUs and force the mb() to be made. But this lets one process
interfere with other processes needlessly. And us Real-Time folks balked
at the idea since it would allow any process to mess with the running of
a real-time thread.

The next approach was to use the mm_cpumask of the thread and only send
IPIs to the CPUs that are running the thread. But there's a race between
the update of the mm_cpumask and the scheduling of the task. If we send
an IPI to a CPU that is not running the process's thread, it may cause a
little interference with the other thread but nothing to worry about.

The issue is if we miss sending to a process's thread. Then the reader
could be accessing a stale pointer that the writer is modifying after
the userspace synchronize_rcu() call.

The taking of the rq locks was a way to make sure that the update of the
mm_cpumask and the scheduling is in sync. And we know that we are
sending an IPI that is running the process's thread and not missing any
other ones.

But all this got a bit ugly when we tried to avoid grabbing the run
queue locks in the loop to send out IPIs. 

Note, I believe that x86 is not affected, since the act of doing the
schedule is in itself a mb(). But this may not be the case on all archs.

-- Steve

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 16:09     ` Mathieu Desnoyers
@ 2010-02-01 16:23       ` Linus Torvalds
  2010-02-01 16:48         ` Mathieu Desnoyers
  2010-02-01 16:24       ` Steven Rostedt
  1 sibling, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2010-02-01 16:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
> 
> However, this does not deal with mm_cpumask update, and we cannot use
> the per-cpu rq lock, as it's a process-wide data structure updated with
> clear_bit/set_bit in switch_mm(). So at the very least, we would have to
> add memory barriers in switch_mm() on some architectures to deal with
> this.

I'd much rather have a "switch_mm()" is a guaranteed memory barrier logic, 
because quite frankly, I don't see how it ever couldn't be one anyway. It 
fundamentally needs to do at least a TLB context switch (which may be just 
switching an ASI around, not flushing the whole TLB, of course), and I bet 
that for 99% of all architectures, that is already pretty much guaranteed 
to be equivalent to a memory barrier.

It certainly is for x86. "mov to cr0" is serializing (setting any control 
register except cr8 is serializing). And I strongly suspect other 
architectures will be too.

Btw, one reason to strongly prefer "switch_mm()" over any random context 
switch is that at least it won't affect inter-thread (kernel or user-land) 
switching, including switching to/from the idle thread.

So I'd be _much_ more open to a "let's guarantee that 'switch_mm()' always 
implies a memory barrier" model than to playing clever games with 
spinlocks.

			Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 16:09     ` Mathieu Desnoyers
  2010-02-01 16:23       ` Linus Torvalds
@ 2010-02-01 16:24       ` Steven Rostedt
  2010-02-01 16:29         ` Peter Zijlstra
  1 sibling, 1 reply; 41+ messages in thread
From: Steven Rostedt @ 2010-02-01 16:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

On Mon, 2010-02-01 at 11:09 -0500, Mathieu Desnoyers wrote:

> We can deal with the rq->cur update by holding the rq lock in each
> iteration of the for_each_cpu(cpu, mm_cpumask(current->mm)) loop. This
> ensures that if rq->cur is updated, we have an associated memory barrier
> issued (e.g. on x86, implied by writing to cr3 while the rq lock is held).
> 
> However, this does not deal with mm_cpumask update, and we cannot use
> the per-cpu rq lock, as it's a process-wide data structure updated with
> clear_bit/set_bit in switch_mm(). So at the very least, we would have to
> add memory barriers in switch_mm() on some architectures to deal with
> this.
> 


Doesn't set_bit imply a wmb()? If so couldn't we do:

What about:

again:
  tmp_mask = mm_cpumask(current->mm);
  smp_mb();
   rcu_read_lock(); /* ensures validity of cpu_curr(cpu) tasks */
   for_each_cpu(cpu, tmp_mask) {
     spin_lock_irq(&cpu_rq(cpu)->lock);
     ret = current->mm == cpu_curr(cpu)->mm;
     spin_unlock_irq(&cpu_rq(cpu)->lock);
     if (ret)
       smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
   }
   rcu_read_unlock();
   smp_mb();
   if (tmp_mask != mm_cpumask(current->mm)) {
      /* do check for signals here */
      goto again;
   }

Would the above work?

-- Steve



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 16:24       ` Steven Rostedt
@ 2010-02-01 16:29         ` Peter Zijlstra
  2010-02-01 16:46           ` Steven Rostedt
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2010-02-01 16:29 UTC (permalink / raw)
  To: rostedt
  Cc: Mathieu Desnoyers, Linus Torvalds, akpm, Ingo Molnar,
	linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell,
	laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks,
	dhowells

On Mon, 2010-02-01 at 11:24 -0500, Steven Rostedt wrote:
> 
> Doesn't set_bit imply a wmb()? If so couldn't we do:

Nope, that's what we have smp_mb__{before,after}_set_bit() for.

On x86 its a LOCK'ed op, so sure it implies a full membarrier, but not
in generic.

on x86 we have plenty serializing instructions before and after rq->curr
is set so none of the crazyness is needed at all. The only thing is !
x86.




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 16:29         ` Peter Zijlstra
@ 2010-02-01 16:46           ` Steven Rostedt
  0 siblings, 0 replies; 41+ messages in thread
From: Steven Rostedt @ 2010-02-01 16:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Linus Torvalds, akpm, Ingo Molnar,
	linux-kernel, KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell,
	laijs, dipankar, josh, dvhltc, niv, tglx, Valdis.Kletnieks,
	dhowells, linux-arch

[ Added linux-arch ]

On Mon, 2010-02-01 at 17:29 +0100, Peter Zijlstra wrote:
> On Mon, 2010-02-01 at 11:24 -0500, Steven Rostedt wrote:
> > 
> > Doesn't set_bit imply a wmb()? If so couldn't we do:
> 
> Nope, that's what we have smp_mb__{before,after}_set_bit() for.
> 
> On x86 its a LOCK'ed op, so sure it implies a full membarrier, but not
> in generic.
> 
> on x86 we have plenty serializing instructions before and after rq->curr
> is set so none of the crazyness is needed at all. The only thing is !
> x86.
> 

So if we go with Linus's approach and make all archs guarantee that
switch_mm() implies a full mb(), then we can simplify the
sys_membarrier() code?

That looks like the best approach.

-- Steve




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 16:23       ` Linus Torvalds
@ 2010-02-01 16:48         ` Mathieu Desnoyers
  2010-02-01 16:56           ` Linus Torvalds
  2010-02-01 17:13           ` Steven Rostedt
  0 siblings, 2 replies; 41+ messages in thread
From: Mathieu Desnoyers @ 2010-02-01 16:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
> > 
> > However, this does not deal with mm_cpumask update, and we cannot use
> > the per-cpu rq lock, as it's a process-wide data structure updated with
> > clear_bit/set_bit in switch_mm(). So at the very least, we would have to
> > add memory barriers in switch_mm() on some architectures to deal with
> > this.
> 
> I'd much rather have a "switch_mm()" is a guaranteed memory barrier logic, 
> because quite frankly, I don't see how it ever couldn't be one anyway. It 
> fundamentally needs to do at least a TLB context switch (which may be just 
> switching an ASI around, not flushing the whole TLB, of course), and I bet 
> that for 99% of all architectures, that is already pretty much guaranteed 
> to be equivalent to a memory barrier.
> 
> It certainly is for x86. "mov to cr0" is serializing (setting any control 
> register except cr8 is serializing). And I strongly suspect other 
> architectures will be too.

What we have to be careful about here is that it's not enough to just
rely on switch_mm() containing a memory barrier. What we really need to
enforce is that switch_mm() issues memory barriers both _before_ and
_after_ mm_cpumask modification. The "after" part is usually dealt with
by the TLB context switch, but the "before" part usually isn't.

> 
> Btw, one reason to strongly prefer "switch_mm()" over any random context 
> switch is that at least it won't affect inter-thread (kernel or user-land) 
> switching, including switching to/from the idle thread.
> 
> So I'd be _much_ more open to a "let's guarantee that 'switch_mm()' always 
> implies a memory barrier" model than to playing clever games with 
> spinlocks.

If we really want to make this patch less intrusive, we can consider
iterating on each online cpu in sys_membarrier() rather than on the
mm_cpumask. But it comes at the cost of useless cache-line bouncing on
large machines with few threads running in the process, as we would grab
the rq locks one by one for all cpus.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 16:48         ` Mathieu Desnoyers
@ 2010-02-01 16:56           ` Linus Torvalds
  2010-02-01 17:45             ` Mathieu Desnoyers
  2010-02-01 17:13           ` Steven Rostedt
  1 sibling, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2010-02-01 16:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
> 
> What we have to be careful about here is that it's not enough to just
> rely on switch_mm() containing a memory barrier. What we really need to
> enforce is that switch_mm() issues memory barriers both _before_ and
> _after_ mm_cpumask modification. The "after" part is usually dealt with
> by the TLB context switch, but the "before" part usually isn't.

Ok, whatever. I vote for not doing anything at all, because this just all 
sounds like some totally crazy crap. You haven't explained the actual 
races, you haven't explained anything at all, you're apparently just 
randomly sprinkling smp_mb's around until you think it's all fine.

Show the actual memory ordering constraints as it is related to the kernel 
data structures. I'm totally uninterested in general handwaving and "we 
must have smp_mb's here and here" without very explicit explanations of 
exactly WHAT the memory orderings are.

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 16:48         ` Mathieu Desnoyers
  2010-02-01 16:56           ` Linus Torvalds
@ 2010-02-01 17:13           ` Steven Rostedt
  2010-02-01 17:34             ` Linus Torvalds
  1 sibling, 1 reply; 41+ messages in thread
From: Steven Rostedt @ 2010-02-01 17:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

On Mon, 2010-02-01 at 11:48 -0500, Mathieu Desnoyers wrote:

> What we have to be careful about here is that it's not enough to just
> rely on switch_mm() containing a memory barrier. What we really need to
> enforce is that switch_mm() issues memory barriers both _before_ and
> _after_ mm_cpumask modification. The "after" part is usually dealt with
> by the TLB context switch, but the "before" part usually isn't.

Then we add a smp_mb__before_clear_bit() in the switch_mm() on all archs
that do not have clear_bit imply a smp_mb().

> 
> > 
> > Btw, one reason to strongly prefer "switch_mm()" over any random context 
> > switch is that at least it won't affect inter-thread (kernel or user-land) 
> > switching, including switching to/from the idle thread.
> > 
> > So I'd be _much_ more open to a "let's guarantee that 'switch_mm()' always 
> > implies a memory barrier" model than to playing clever games with 
> > spinlocks.
> 
> If we really want to make this patch less intrusive, we can consider
> iterating on each online cpu in sys_membarrier() rather than on the
> mm_cpumask. But it comes at the cost of useless cache-line bouncing on
> large machines with few threads running in the process, as we would grab
> the rq locks one by one for all cpus.

I still think modifying the switch_mm() is better than the full
iteration.

-- Steve



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 17:13           ` Steven Rostedt
@ 2010-02-01 17:34             ` Linus Torvalds
  0 siblings, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2010-02-01 17:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel,
	KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs,
	dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks,
	dhowells

On Mon, 1 Feb 2010, Steven Rostedt wrote:

> On Mon, 2010-02-01 at 11:48 -0500, Mathieu Desnoyers wrote:
> 
> > What we have to be careful about here is that it's not enough to just
> > rely on switch_mm() containing a memory barrier. What we really need to
> > enforce is that switch_mm() issues memory barriers both _before_ and
> > _after_ mm_cpumask modification. The "after" part is usually dealt with
> > by the TLB context switch, but the "before" part usually isn't.
> 
> Then we add a smp_mb__before_clear_bit() in the switch_mm() on all archs
> that do not have clear_bit imply a smp_mb().

No. I want to get an _explanation_ first.

And no, "I want to do user-level RCU and I'm looking at mm_cpumask" is not 
an explanation.

In order to explain memory ordering rules, you need to actually show the 
actual accesses on the different cpu's and the ordering requirements 
between them, and explain them. IOW, you need to show

	thread1			thread2
	-------			-------

	some-particular-memop
				mm_cpumask access
	memory barrier
				memory barrier
	mm_cpumask access
				some-particular-memop	
	memory barrier

	some-other-particular-memop

and explain exactly why the memory barriers are needed.

In particular, now I've heard the INSANE statement that we need

  "memory barriers both _before_ and _after_ mm_cpumask modification."

and people need to realize that such statements are totally worthless. 

The fact is, a memory barrier with regards to a single location 
modification makes no sense. Not before, not after. Putting a barrier 
around a single access (even if that "single" access is then a 
read-modify-read one) is a meaningless operation - it has _zero_ semantic 
information.

Memory barriers need to be _between_ two operations, and even then they 
never make sense on their own - you need to have another actor that also 
has a memory barrier, and that you are ordering things with regards to.

Saying that they are "around" one operation is crazy talk. It's a 
nonsensical statement. It shows a total lack of understanding of what a 
barrier is about. You can't put a barrier 'around' anything at all.

So before we go any further, I want to see the graph of barriers AND THE 
THINGS THEY ARE BARRIERS BETWEEN. On both sides. I want to know that you 
guys even know _what_ you are protecting against, and I want it documented 
so that when people say "this would solve it", they can show that yes, you 
actually need barriers at both points.

			Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 16:56           ` Linus Torvalds
@ 2010-02-01 17:45             ` Mathieu Desnoyers
  2010-02-01 18:00               ` Steven Rostedt
  2010-02-01 18:36               ` Linus Torvalds
  0 siblings, 2 replies; 41+ messages in thread
From: Mathieu Desnoyers @ 2010-02-01 17:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
> > 
> > What we have to be careful about here is that it's not enough to just
> > rely on switch_mm() containing a memory barrier. What we really need to
> > enforce is that switch_mm() issues memory barriers both _before_ and
> > _after_ mm_cpumask modification. The "after" part is usually dealt with
> > by the TLB context switch, but the "before" part usually isn't.
> 
> Ok, whatever. I vote for not doing anything at all, because this just all 
> sounds like some totally crazy crap. You haven't explained the actual 
> races, you haven't explained anything at all, you're apparently just 
> randomly sprinkling smp_mb's around until you think it's all fine.
> 
> Show the actual memory ordering constraints as it is related to the kernel 
> data structures. I'm totally uninterested in general handwaving and "we 
> must have smp_mb's here and here" without very explicit explanations of 
> exactly WHAT the memory orderings are.

It's probably worthwhile to summarize here the mb-related discussions we
had in the previous RFC rounds.

Here is the detailed execution scenario showing the race. The race with
unlock showing that we need to order user-space loads before mm_cpumask
store. This is showing an execution sequence involving the userspace RCU
library and the Linux kernel with sys_membarrier(). As we see below, the
lack of ordering between "cpumask_clear" and user-space execution
creates a race with the scheduler where sys_membarrier() incorrectly
considers CPU 1 as not needing a membarrier_ipi.


        CPU 0                   CPU 1
        --------------          --------------
        <user space>            <user space> (already in read-side C.S.)
                                obj = rcu_dereference(list->next);
                                  -> load list->next
                                copy = obj->foo;
                                rcu_read_unlock();
                                  -> store to urcu_reader.ctr
                                <urcu_reader.ctr store is globally visible>
        list_del(obj);
                                <preempt>
                                <kernel space>

                                schedule();
                                cpumask_clear(mm_cpumask, cpu);

        sys_membarrier();
          <kernel space>
          smp_mb()
          iterates on mm_cpumask
          smp_mb()
        <user space>
        set global g.p. (urcu_gp_ctr) phase to 1
        wait for all urcu_reader.ctr in phase 0
        set global g.p. (urcu_gp_ctr) phase to 0
        wait for all urcu_reader.ctr in phase 1
        sys_membarrier();
          <kernel space>
          smp_mb()
          iterates on mm_cpumask
          smp_mb()
        <user space>
        free(obj);
                                <list->next load hits memory>
                                <obj->foo load hits memory> <- corruption


There is a similar race scenario (symmetric to the one above) between
cpumask_set and a following user-space rcu_read_lock(), which I could
provide upon request.

This results in the following comments around smp_mb() in
sys_membarrier:

First sys_membarrier smp_mb():

        /*
         * Memory barrier on the caller thread before sending first
         * IPI. Orders all memory accesses prior to sys_membarrier() before
         * mm_cpumask read and membarrier_ipi executions.
         * Matches memory barriers before and after mm_cpumask updates.
         */

Second sys_membarrier smp_mb():

        /*
         * Memory barrier on the caller thread after we finished
         * waiting for the last IPI. Orders mm_cpumask read and membarrier_ipi
         * executions before memory accesses following sys_membarrier()
         * execution.
         * Matches memory barriers before and after mm_cpumask updates.
         */

Thanks,

Mathieu


> 
> 		Linus

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 17:45             ` Mathieu Desnoyers
@ 2010-02-01 18:00               ` Steven Rostedt
  2010-02-01 18:36               ` Linus Torvalds
  1 sibling, 0 replies; 41+ messages in thread
From: Steven Rostedt @ 2010-02-01 18:00 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

On Mon, 2010-02-01 at 12:45 -0500, Mathieu Desnoyers wrote:
> * Linus Torvalds (torvalds@linux-foundation.org) wrote:

> It's probably worthwhile to summarize here the mb-related discussions we
> had in the previous RFC rounds.

Actually it is probably worthwhile to include this explanation in the
change log. Can't expect everyone to have read a kernel thread that
contains hundreds of replies.

> 
> Here is the detailed execution scenario showing the race. The race with
> unlock showing that we need to order user-space loads before mm_cpumask
> store. This is showing an execution sequence involving the userspace RCU
> library and the Linux kernel with sys_membarrier(). As we see below, the
> lack of ordering between "cpumask_clear" and user-space execution
> creates a race with the scheduler where sys_membarrier() incorrectly
> considers CPU 1 as not needing a membarrier_ipi.
> 
> 
>         CPU 0                   CPU 1
>         --------------          --------------
>         <user space>            <user space> (already in read-side C.S.)
>                                 obj = rcu_dereference(list->next);
>                                   -> load list->next
>                                 copy = obj->foo;
>                                 rcu_read_unlock();
>                                   -> store to urcu_reader.ctr
>                                 <urcu_reader.ctr store is globally visible>
>         list_del(obj);
>                                 <preempt>
>                                 <kernel space>
> 
>                                 schedule();
>                                 cpumask_clear(mm_cpumask, cpu);
> 
>         sys_membarrier();
>           <kernel space>
>           smp_mb()
>           iterates on mm_cpumask
>           smp_mb()
>         <user space>
>         set global g.p. (urcu_gp_ctr) phase to 1
>         wait for all urcu_reader.ctr in phase 0
>         set global g.p. (urcu_gp_ctr) phase to 0
>         wait for all urcu_reader.ctr in phase 1
>         sys_membarrier();
>           <kernel space>
>           smp_mb()
>           iterates on mm_cpumask
>           smp_mb()
>         <user space>
>         free(obj);
>                                 <list->next load hits memory>
>                                 <obj->foo load hits memory> <- corruption
> 
> 
> There is a similar race scenario (symmetric to the one above) between
> cpumask_set and a following user-space rcu_read_lock(), which I could
> provide upon request.

If it would help others to understand, it would be worthwhile in
explaining that too. So, I'm requesting it ;-)

> 
> This results in the following comments around smp_mb() in
> sys_membarrier:
> 
> First sys_membarrier smp_mb():
> 
>         /*
>          * Memory barrier on the caller thread before sending first
>          * IPI. Orders all memory accesses prior to sys_membarrier() before
>          * mm_cpumask read and membarrier_ipi executions.
>          * Matches memory barriers before and after mm_cpumask updates.
>          */

The above comment does not really explain what it is protecting. It just
states that it serializes memory access. Well, duh, that's what smp_mb()
does ;-)  It's the equivalent to i++; /* increment i */

Basically, the comment should explain "why" the memory barrier is
needed.

> 
> Second sys_membarrier smp_mb():
> 
>         /*
>          * Memory barrier on the caller thread after we finished
>          * waiting for the last IPI. Orders mm_cpumask read and membarrier_ipi
>          * executions before memory accesses following sys_membarrier()
>          * execution.
>          * Matches memory barriers before and after mm_cpumask updates.
>          */

Ditto.

-- Steve



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 17:45             ` Mathieu Desnoyers
  2010-02-01 18:00               ` Steven Rostedt
@ 2010-02-01 18:36               ` Linus Torvalds
  2010-02-01 19:56                 ` Mathieu Desnoyers
  2010-02-01 20:33                 ` Steven Rostedt
  1 sibling, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2010-02-01 18:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
> 
> Here is the detailed execution scenario showing the race.

No. You've added random smp_mb() calls, but you don't actually show what 
the f*ck they are protecting against.

For example

> First sys_membarrier smp_mb():

I'm not AT ALL interested in the sys_membarrier() parts. You can hav ea 
million memory barriers there, and I won't care. I'm interested in what 
you think the memory barriers elsewhere protect against. It's a barrier 
between _which_ two operations? 

You can't say it's a barrier "around" the

	cpumask_clear(mm_cpumask, cpu);

because a barrier is between two things. So if you want to add two 
barriers around that mm_cpumask acces, you need to describe the _three_ 
events you're barriers between in that call-path (with mm_cpumask being 
just one of them)

And then, once you've described _those_ three events, you describe what 
the sys_membarrier interaction is, and how mm_cpumask is involved there.

I'm not interested in the user-space code. Don't even quote it. It's 
irrelevant apart from the actual semantics you want to guarantee for the 
new membarrier() system call. So don't quote the code, just explain what 
the actual barriers are.

			Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 18:36               ` Linus Torvalds
@ 2010-02-01 19:56                 ` Mathieu Desnoyers
  2010-02-01 20:42                   ` Linus Torvalds
  2010-02-01 20:33                 ` Steven Rostedt
  1 sibling, 1 reply; 41+ messages in thread
From: Mathieu Desnoyers @ 2010-02-01 19:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
> > 
> > Here is the detailed execution scenario showing the race.
> 
> No. You've added random smp_mb() calls, but you don't actually show what 
> the f*ck they are protecting against.
> 
> For example
> 
> > First sys_membarrier smp_mb():
> 
> I'm not AT ALL interested in the sys_membarrier() parts. You can hav ea 
> million memory barriers there, and I won't care. I'm interested in what 
> you think the memory barriers elsewhere protect against. It's a barrier 
> between _which_ two operations? 
> 
> You can't say it's a barrier "around" the
> 
> 	cpumask_clear(mm_cpumask, cpu);
> 
> because a barrier is between two things. So if you want to add two 
> barriers around that mm_cpumask acces, you need to describe the _three_ 
> events you're barriers between in that call-path (with mm_cpumask being 
> just one of them)
> 
> And then, once you've described _those_ three events, you describe what 
> the sys_membarrier interaction is, and how mm_cpumask is involved there.
> 
> I'm not interested in the user-space code. Don't even quote it. It's 
> irrelevant apart from the actual semantics you want to guarantee for the 
> new membarrier() system call. So don't quote the code, just explain what 
> the actual barriers are.
> 

The two event pairs we are looking at are:

Pair 1)

* memory accesses (load/stores) performed by user-space thread before
  context switch.
* cpumask_clear_cpu(cpu, mm_cpumask(prev));

Pair 2)

* cpumask_set_cpu(cpu, mm_cpumask(next));
* memory accessses (load/stores) performed by user-space thread after
  context switch.

I can see two ways to add memory barriers in switch_mm that would
provide ordering for these two memory access pairs:

Either A)

switch_mm()
  smp_mb__before_clear_bit();
  cpumask_clear_cpu(cpu, mm_cpumask(prev));
  cpumask_set_cpu(cpu, mm_cpumask(next));
  smp_mb__after_set_bit();

or B)

switch_mm()
  cpumask_set_cpu(cpu, mm_cpumask(next));
  smp_mb__before_clear_bit();
  cpumask_clear_cpu(cpu, mm_cpumask(prev));

(B) seems like a clear win, as we get the ordering right for both pairs
with a single memory barrier, but I don't know if changing the set/clear
bit order could have nasty side-effects on other mm_cpumask users.

sys_membarrier uses the mm_cpumask to iterate on all CPUs on which the
current process's mm is in use, so it can issue a smp_mb() through an
IPI on all CPUs that need it. Without appropriate ordering of pairs 1-2
detailed above, we could miss a CPU that actually needs a memory
barrier.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 18:36               ` Linus Torvalds
  2010-02-01 19:56                 ` Mathieu Desnoyers
@ 2010-02-01 20:33                 ` Steven Rostedt
  2010-02-01 20:52                   ` Linus Torvalds
  1 sibling, 1 reply; 41+ messages in thread
From: Steven Rostedt @ 2010-02-01 20:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel,
	KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs,
	dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks,
	dhowells

On Mon, 2010-02-01 at 10:36 -0800, Linus Torvalds wrote:
> 

> I'm not interested in the user-space code. Don't even quote it. It's 
> irrelevant apart from the actual semantics you want to guarantee for
> the 
> new membarrier() system call. So don't quote the code, just explain
> what 
> the actual barriers are.

OK, but first we must establish that the sys_membarrier() system call
guarantees that all running threads of this process have an mb()
performed on them before this syscall returns.

The simplest implementation would be to just do an IPI on all CPUs and
have that IPI perform the mb(). But that would interfere with other
tasks, so we want to limit it to only sending the mb()'s to the threads
of the process that are currently running. We use the mm_cpumask to find
out what threads are associated with this task, and only send the IPI to
the CPUs running threads of the current process.

With the kernel point of view, the goal is to make sure a mb() happens
on all running threads of the calling process.

The code does the following:

	for_each_cpu(cpu, mm_cpumask(current->mm)) {
		if (current->mm == cpu_curr(cpu)->mm)
			send_ipi();
	}

But a race exists between the reading of the mm_cpumask and sending the
IPI. There is in fact two different problems with this race. One is that
a thread scheduled away, but never issued an mb(), the other is that a
running task just came in and we never saw it.

Here:

	   CPU 0		   CPU 1
	-----------		-----------
				< same thread >
				schedule()
				clear_bit();

	current->mm == cpu_curr(1)->mm <<< failed
	return sys_membarrier();

				context_switch();

The above fails the situation, because we missed our thread before it
actually switched to another task. This fails the guarantee that the
syscall sys_membarrier() implies.

Second scenario, for non-x86 archs that do not imply a mb() on
switch_mm():

	   CPU 0		   CPU 1
	-----------		-----------
				< different thread >
				schedule();
				clear_bit();
				set_bit();
				schedule();
				< same thread >

	sys_membarrier();
	current->mm == cpu_curr(1)->mm <<<<< failed

This scenario happens if the switch_mm() does not imply a mb(). That is,
the syscall sys_membarrier() was called after CPU 1 scheduled a thread
of the same process, but the switch_mm() did not force the mb() causing
CPU 0 to see the old value of the mm_cpumask.

The above does not take any user-space into account. It only tries to
fulfill the kernel's obligation of sys_membarrier to ensure that all
threads of the calling process has an mb() performed on them.

Mathieu, from this point of view, you can explain the necessary mb()s
that are within the kernel proper.

-- Steve

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 19:56                 ` Mathieu Desnoyers
@ 2010-02-01 20:42                   ` Linus Torvalds
  2010-02-01 22:42                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2010-02-01 20:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
> 
> The two event pairs we are looking at are:
> 
> Pair 1)
> 
> * memory accesses (load/stores) performed by user-space thread before
>   context switch.
> * cpumask_clear_cpu(cpu, mm_cpumask(prev));
> 
> Pair 2)
>
> * cpumask_set_cpu(cpu, mm_cpumask(next));
> * memory accessses (load/stores) performed by user-space thread after
>   context switch.

So explain why does that smp_mb() in between the two _help_?

The user of this will do a

	for_each_cpu(mm_cpumask)
		send_IPI(cpu, smp_mb);

but that's not an atomic op _anyway_. So you're reading mm_cpumask 
somewhere earlier, and doing the send_IPI later. So look at the whole 
scenario 2:

	cpumask_set_cpu(cpu, mm_cpumask(next));
	memory accessses performed by user-space

and think about it from the perspective of another CPU. What does an 
smp_mb() in between the two do?

I'll tell you - it does NOTHING. Because it doesn't matter. I see no 
possible way another CPU can care, because let's assume that the other CPU 
is doing that

	for_each_cpu(mm_cpumask)
		send_ipi(smp_mb);

and you have to realize that the other CPU needs to read that mm_cpumask 
early in order to do that.

So you have this situation:

	CPU1			CPU2
	----			----

	cpumask_set_cpu
				read mm_cpumask
	smp_mb
				smp_mb
	user memory accessses
				send_ipi

and exactly _what_ is that "smp_mb" on CPU1 protecting against?

Realize that CPU2 is not ordered (because you wanted to avoid the 
locking), so the "read mm_cpumask" can happen before or after that 
cpumask_set_cpu. And it can happen before or after REGARDLESS of that 
smp_mb. The smp_mb doesn't make any difference to CPU2 that I can see. 

So the question becomes one of "How can CPU2 care about whether CPU1 is in 
the mask"? Considering that CPU2 doesn't do any locking, I don't see any 
way you can get a "consistent" CPU mask _regardless_ of any smp_mb's in 
there. When it does the "read mm_cpumask()" it might get the value 
_before_ the cpumask_set_cpu, and it might get the value _after_, and 
that's true regardless of whether there is a smp_mb there or not. 

See what I'm asking for? I'm asking for why it matters that we have a 
memory barrier, and why that mm_cpumask is so magical that _that_ access 
matters so much. 

Maybe I'm dense. But If somebody puts memory barriers in the code, I want 
to know exactly what the reason for the barrier is. Memory ordering is too 
subtle and non-intuitive to go by gut feel.

				Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 20:33                 ` Steven Rostedt
@ 2010-02-01 20:52                   ` Linus Torvalds
  2010-02-01 22:39                     ` Steven Rostedt
  0 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2010-02-01 20:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel,
	KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs,
	dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks,
	dhowells

On Mon, 1 Feb 2010, Steven Rostedt wrote:
> 
> But a race exists between the reading of the mm_cpumask and sending the
> IPI. There is in fact two different problems with this race. One is that
> a thread scheduled away, but never issued an mb(), the other is that a
> running task just came in and we never saw it.

I get it. But the thing I object to here is that Mathieu claims that we 
need _two_ memory barriers in the switch_mm() code.

And I'm still not seeing it.

You claim that the rule is that "you have to do a mb on all threads", and 
that there is a race if a threads switches away just as we're about to do 
that.

Fine. But why _two_? And what's so magical about the mm_cpumask that it 
needs to be around it?

If the rule is that we do a memory barrier as we switch an mm, then why 
does that single one not just handle it? Either the CPU kept running that 
mm (and the IPI will do the memory barrier), or the CPU didn't (and the 
switch_mm had a memory barrier).

Without locking, I don't see how you can really have any stronger 
guarantees, and as per my previous email, I don't see what the smp_mb() 
around mm_cpumask accesses help - because the other CPU is still not going 
to atomically "see the mask and IPI". It's going to see one value or the 
other, and the smp_mb() around the access doesn't seem to have anything to 
do with which value it sees.

So I can kind of understand the "We want to guarantee that switching MM's 
around wants to be a memory barrier". Quite frankly, I haven't though even 
that through entirely, so who knows... But the "we need to have memory 
barriers on both sides of the bit setting/clearing" I don't get. 

IOW, show me why that cpumask is _so_ important that the placement of the 
memory barriers around it matters, to the point where you want to have it 
on both sides.

Maybe you've really thought about this very deeply, but the explanations 
aren't getting through to me. Educate me.

			Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 20:52                   ` Linus Torvalds
@ 2010-02-01 22:39                     ` Steven Rostedt
  2010-02-01 23:09                       ` Steven Rostedt
  0 siblings, 1 reply; 41+ messages in thread
From: Steven Rostedt @ 2010-02-01 22:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel,
	KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs,
	dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks,
	dhowells

On Mon, 2010-02-01 at 12:52 -0800, Linus Torvalds wrote:
> 
> On Mon, 1 Feb 2010, Steven Rostedt wrote:
> > 
> > But a race exists between the reading of the mm_cpumask and sending the
> > IPI. There is in fact two different problems with this race. One is that
> > a thread scheduled away, but never issued an mb(), the other is that a
> > running task just came in and we never saw it.
> 
> I get it. But the thing I object to here is that Mathieu claims that we 
> need _two_ memory barriers in the switch_mm() code.

Note, on x86, we don't need two memory barriers, because the cr3 load
acts as one, and the set_bit() acts as the other.

> 
> And I'm still not seeing it.
> 
> You claim that the rule is that "you have to do a mb on all threads", and 
> that there is a race if a threads switches away just as we're about to do 
> that.
> 
> Fine. But why _two_? And what's so magical about the mm_cpumask that it 
> needs to be around it?
> 
> If the rule is that we do a memory barrier as we switch an mm, then why 
> does that single one not just handle it? Either the CPU kept running that 
> mm (and the IPI will do the memory barrier), or the CPU didn't (and the 
> switch_mm had a memory barrier).

The one before the mm_cpumask actually does the mb() that we want to
send, in case we miss it. We use mm_cpumask just to limit our sending of
mb()s to CPUs. The unlikely case that we miss a thread, this mb() will
perform the work mb() for us.

	   CPU 0		   CPU 1
	-----------		-----------
				< same thread >
				clear_bit(mm_cpumask)

	sys_membarrier();
	<miss sending CPU 1 mb()>
	return back to user-space

	modify data

				< CPU 0 data is read > <<- corruption

				schedule()

Earlier you asked what is the three things the two mb()'s are
protecting. Actually, it is really just two things that can be accessed
in:
	 A
	mb();
	 B
	mb();
	 A

Where, we access A then B and then A again, and need to worry about that
ordering. The two things are, the mm_cpumask and the second thing is the
user-space data that the sys_membarrier() is guaranteeing will be
flushed when it returns.

If we add a mb() before the clearing of the cpu bit in the mm_cpumask(),
we don't care if we get an old value that does not contain a missing cpu
that is running one of the process's threads. Because the mb() would
have made all the necessary data of the thread visible, and no other
mb() is needed.

The mb() after the set bit makes sure the mask is visible for this:

	   CPU 0		   CPU 1
	-----------		-----------
				set_bit();
				schedule();
				return to userspace and access
				 critical data

	sys_membarrier();
	miss sending IPI.
	return;

	update critical data

I guess the simple answer is that the first mb() is doing the mb() we
missed. The second mb() makes sure that a thread does not get back to
userspace without the sys_membarrier() seeing it.

> 
> Without locking, I don't see how you can really have any stronger 
> guarantees, and as per my previous email, I don't see what the smp_mb() 
> around mm_cpumask accesses help - because the other CPU is still not going 
> to atomically "see the mask and IPI". It's going to see one value or the 
> other, and the smp_mb() around the access doesn't seem to have anything to 
> do with which value it sees.

I guess you can say we are cheating here. If we miss the mm_cpumask
update, the only time it is an issue for us is if one of our threads
scheduled out. We use the mb() there to perform the mb() that we missed.
We really don't care about the mm_cpumask, but we do care that a mb() is
performed.

> 
> So I can kind of understand the "We want to guarantee that switching MM's 
> around wants to be a memory barrier". Quite frankly, I haven't though even 
> that through entirely, so who knows... But the "we need to have memory 
> barriers on both sides of the bit setting/clearing" I don't get. 
> 
> IOW, show me why that cpumask is _so_ important that the placement of the 
> memory barriers around it matters, to the point where you want to have it 
> on both sides.

Actually the cpumask is not as important as the mb(). We use the cpumask
to find out what CPUs we need to make do a mb(), and the mb() before the
cpumask is in the case we miss one that is caused by the race.

> 
> Maybe you've really thought about this very deeply, but the explanations 
> aren't getting through to me. Educate me.

The issues is that sys_membarrier() is guaranteeing that all the other
threads of the process will have a mb() performed before the
sys_membarrier() returns. We only care about the user-space data of the
threads, thus the mb() here is quite unique. We are using the mm_cpumask
as a short cut to sending a mb() to only the CPUs that are running our
threads.

The mb() before the mm_cpumask() is to make sure that the thread does a
mb() in case we miss it when reading the mm_cpumask().

The mb() after the mm_cpumask() is to make sure that the cpumask is
visible before the thread reads any data in user-space, so that the
sys_membarrier() will send out the IPI() to that thread.

For x86 these are a not an issues because clear_bit() is an mb() and so
is the load_cr3().

Now if other arch maintainers do not want to add two mb()s in
switch_mm(), then perhaps if they just guarantee that one exists after
the clear/set bits, the following code may work:

(posted in a previous email)

sys_membarrier(void) {

	again:
		tmp_mask = mm_cpumask(current->mm);
		smp_mb();
		rcu_read_lock(); /* ensures validity of cpu_curr(cpu) tasks */
		for_each_cpu(cpu, tmp_mask) {
			spin_lock_irq(&cpu_rq(cpu)->lock);
			ret = current->mm == cpu_curr(cpu)->mm;
			spin_unlock_irq(&cpu_rq(cpu)->lock);
			if (ret)
			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
		}
		rcu_read_unlock();
		smp_mb();
		if (tmp_mask != mm_cpumask(current->mm)) {
			/* do check for signals here */
			goto again;
		}

Maybe even the locks are not needed.

-- Steve

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 20:42                   ` Linus Torvalds
@ 2010-02-01 22:42                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 41+ messages in thread
From: Mathieu Desnoyers @ 2010-02-01 22:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: akpm, Ingo Molnar, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Paul E. McKenney, Nicholas Miell, laijs, dipankar, josh, dvhltc,
	niv, tglx, peterz, Valdis.Kletnieks, dhowells

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
> > 
> > The two event pairs we are looking at are:
> > 
> > Pair 1)
> > 
> > * memory accesses (load/stores) performed by user-space thread before
> >   context switch.
> > * cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > 
> > Pair 2)
> >
> > * cpumask_set_cpu(cpu, mm_cpumask(next));
> > * memory accessses (load/stores) performed by user-space thread after
> >   context switch.
> 
> So explain why does that smp_mb() in between the two _help_?
> 
> The user of this will do a
> 
> 	for_each_cpu(mm_cpumask)
> 		send_IPI(cpu, smp_mb);
> 
> but that's not an atomic op _anyway_. So you're reading mm_cpumask 
> somewhere earlier, and doing the send_IPI later. So look at the whole 
> scenario 2:
> 
> 	cpumask_set_cpu(cpu, mm_cpumask(next));
> 	memory accessses performed by user-space
> 
> and think about it from the perspective of another CPU. What does an 
> smp_mb() in between the two do?
> 
> I'll tell you - it does NOTHING. Because it doesn't matter. I see no 
> possible way another CPU can care, because let's assume that the other CPU 
> is doing that
> 
> 	for_each_cpu(mm_cpumask)
> 		send_ipi(smp_mb);
> 
> and you have to realize that the other CPU needs to read that mm_cpumask 
> early in order to do that.
> 
> So you have this situation:
> 
> 	CPU1			CPU2
> 	----			----
> 
> 	cpumask_set_cpu
> 				read mm_cpumask
> 	smp_mb
> 				smp_mb
> 	user memory accessses
> 				send_ipi
> 
> and exactly _what_ is that "smp_mb" on CPU1 protecting against?
> 
> Realize that CPU2 is not ordered (because you wanted to avoid the 
> locking), so the "read mm_cpumask" can happen before or after that 
> cpumask_set_cpu. And it can happen before or after REGARDLESS of that 
> smp_mb. The smp_mb doesn't make any difference to CPU2 that I can see. 
> 
> So the question becomes one of "How can CPU2 care about whether CPU1 is in 
> the mask"? Considering that CPU2 doesn't do any locking, I don't see any 
> way you can get a "consistent" CPU mask _regardless_ of any smp_mb's in 
> there. When it does the "read mm_cpumask()" it might get the value 
> _before_ the cpumask_set_cpu, and it might get the value _after_, and 
> that's true regardless of whether there is a smp_mb there or not. 
> 
> See what I'm asking for? I'm asking for why it matters that we have a 
> memory barrier, and why that mm_cpumask is so magical that _that_ access 
> matters so much. 
> 
> Maybe I'm dense. But If somebody puts memory barriers in the code, I want 
> to know exactly what the reason for the barrier is. Memory ordering is too 
> subtle and non-intuitive to go by gut feel.
> 

I am 100% fine with your request for explanation. I'll do my best to
explain the requirements and ordering as clearly as I can.

Let's define the semantic of this system call clearly, this will help us
below: it orders memory accesses performed by a thread before and after
calling sys_membarrier with respect to _all_ memory accesses (in program
order) performed by all other threads of the same process.

The barriers are not there to "protect" against anything, they merely
enforce ordering of memory accesses. sys_membarrier ensures that a
memory barrier will be issued on all running threads of the current
process executing on other cpus, between two memory barriers issued
locally. A very important point is that the memory barriers _need_ to be
issued on the other running threads after the sys_membarrier first mb
and before its second mb. The second important aspect is the definition
of a running vs non-running thread. Here, any CPU that would still have
unflushed write buffers (or pending reordered read operations) affecting
a thread which belongs to our process should be considered as "running",
because it needs to have memory barriers issued.

Now why is mm_cpumask so important here ? This is the only non-lock
protected data structure we access to find out which CPUs are running
threads belonging to our process.

I'm modifying your scenario to match the requirements expressed above.
In this first scenario, we send an IPI guaranteeing that the newly
scheduled thread (which happens to have set its cpu in the mm_cpumask
before we read it) issues a memory barrier. We end up issuing one extra
memory barrier, as both the scheduler and IPI issue the barriers:

        CPU1                    CPU2
        ----                    ----
                                smp_mb
        cpumask_set_cpu
                                read mm_cpumask
                                send_ipi
                                |
        smp_mb                  |
        user memory accesses    |
                                |
        smp_mb <----------------/
                                smp_mb

Now, let's see what happens if the read mm_cpumask occurs right before
the cpumask_set_cpu. In this scenario, CPU1 is considered as "not
running", because it cannot have any user-space memory accesses
reordered prior to the first smp_mb on CPU2.

        CPU1                    CPU2
        ----                    ----
                                smp_mb
                                read mm_cpumask
                                (no ipi to send)
        cpumask_set_cpu
                                smp_mb
        smp_mb
        user memory accesses

As we see, the smp_mb issued on CPU1 between the cpumask set and user
memory accesses takes care to ensure that no user memory accesses spill
over the barrier. So if sys_membarrier encounters a mm_cpumask with a
cleared cpu bit, it knows for sure that there are no unflushed write
buffers containing user-space data, nor reordered read or write accesses
to user-space data. Removing this barrier could let CPU1 reorder the
user memory accesses before the cpumask_set_cpu, which fails to satisfy
the sys_membarrier guarantees.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock
  2010-02-01 22:39                     ` Steven Rostedt
@ 2010-02-01 23:09                       ` Steven Rostedt
  0 siblings, 0 replies; 41+ messages in thread
From: Steven Rostedt @ 2010-02-01 23:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel,
	KOSAKI Motohiro, Paul E. McKenney, Nicholas Miell, laijs,
	dipankar, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks,
	dhowells

On Mon, 2010-02-01 at 17:39 -0500, Steven Rostedt wrote:

> 
> 	   CPU 0		   CPU 1
> 	-----------		-----------
> 				< same thread >
> 				clear_bit(mm_cpumask)
> 
> 	sys_membarrier();
> 	<miss sending CPU 1 mb()>
> 	return back to user-space
> 
> 	modify data
> 
> 				< CPU 0 data is read > <<- corruption
> 
> 				schedule()
> 
> 

> (posted in a previous email)
> 
> sys_membarrier(void) {
> 
> 	again:
> 		tmp_mask = mm_cpumask(current->mm);
> 		smp_mb();
> 		rcu_read_lock(); /* ensures validity of cpu_curr(cpu) tasks */
> 		for_each_cpu(cpu, tmp_mask) {
> 			spin_lock_irq(&cpu_rq(cpu)->lock);
> 			ret = current->mm == cpu_curr(cpu)->mm;
> 			spin_unlock_irq(&cpu_rq(cpu)->lock);
> 			if (ret)
> 			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> 		}
> 		rcu_read_unlock();
> 		smp_mb();
> 		if (tmp_mask != mm_cpumask(current->mm)) {
> 			/* do check for signals here */
> 			goto again;
> 		}

Doh, this misses the above case if a mb() is not before the caller.
Nevermind, it looks like the only other case is to iterate over all
CPUs.

-- Steve



^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2010-02-01 23:09 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-31 20:52 [patch 0/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers
2010-01-31 20:52 ` [patch 1/3] Create spin lock/spin unlock with distinct memory barrier Mathieu Desnoyers
2010-02-01  7:25   ` Nick Piggin
2010-02-01 14:08     ` Mathieu Desnoyers
2010-02-01  7:28   ` Nick Piggin
2010-02-01 14:10     ` Mathieu Desnoyers
2010-02-01 15:22   ` Linus Torvalds
2010-02-01 15:41     ` Steven Rostedt
2010-01-31 20:52 ` [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock Mathieu Desnoyers
2010-02-01  7:33   ` Nick Piggin
2010-02-01  9:42     ` Peter Zijlstra
2010-02-01 10:11       ` Nick Piggin
2010-02-01 10:36         ` Peter Zijlstra
2010-02-01 10:49           ` Nick Piggin
2010-02-01 14:47             ` Mathieu Desnoyers
2010-02-01 14:58               ` Nick Piggin
2010-02-01 15:23                 ` Steven Rostedt
2010-02-01 15:44                   ` Steven Rostedt
2010-02-01 16:00                   ` Mike Galbraith
2010-02-01 15:27   ` Linus Torvalds
2010-02-01 16:09     ` Mathieu Desnoyers
2010-02-01 16:23       ` Linus Torvalds
2010-02-01 16:48         ` Mathieu Desnoyers
2010-02-01 16:56           ` Linus Torvalds
2010-02-01 17:45             ` Mathieu Desnoyers
2010-02-01 18:00               ` Steven Rostedt
2010-02-01 18:36               ` Linus Torvalds
2010-02-01 19:56                 ` Mathieu Desnoyers
2010-02-01 20:42                   ` Linus Torvalds
2010-02-01 22:42                     ` Mathieu Desnoyers
2010-02-01 20:33                 ` Steven Rostedt
2010-02-01 20:52                   ` Linus Torvalds
2010-02-01 22:39                     ` Steven Rostedt
2010-02-01 23:09                       ` Steven Rostedt
2010-02-01 17:13           ` Steven Rostedt
2010-02-01 17:34             ` Linus Torvalds
2010-02-01 16:24       ` Steven Rostedt
2010-02-01 16:29         ` Peter Zijlstra
2010-02-01 16:46           ` Steven Rostedt
2010-02-01 16:11     ` Steven Rostedt
2010-01-31 20:52 ` [patch 3/3] introduce sys_membarrier(): process-wide memory barrier (v8) Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).