public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v2)
@ 2010-01-08 23:56 Mathieu Desnoyers
  2010-01-09 10:47 ` Peter Zijlstra
  0 siblings, 1 reply; 3+ messages in thread
From: Mathieu Desnoyers @ 2010-01-08 23:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: Steven Rostedt, paulmck, Peter Zijlstra, Josh Triplett,
	Ingo Molnar, akpm, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads of the current process.
 
It aims at greatly simplifying and enhancing the current signal-based
liburcu userspace RCU synchronize_rcu() implementation.
(found at http://lttng.org/urcu)

Changelog since v1:

- Only perform the IPI in CONFIG_SMP.
- Only perform the IPI if the process has more than one thread.
- Only send IPIs to CPUs involved with threads belonging to our process.
- Adaptative IPI scheme (single vs many IPI with threshold).
- Issue smp_mb() at the beginning and end of the system call.

TODO:
- We must make sure that the scheduler issues a memory barrier between
  - Last user-space instruction executed and runqueue ->mm modification.
  - runqueue ->mm modification and first user-space instruction.
  and document their location (so they won't be removed).

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the 
write-side are turned into an invokation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

To explain the benefit of this scheme, let's introduce two example threads:
 
Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A synchronize_rcu() are
ordering memory accesses with respect to smp_mb() present in 
rcu_read_lock/unlock(), we can change all smp_mb() from
synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
rcu_read_lock/unlock() into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
prev mem accesses           prev mem accesses
smp_mb()                    smp_mb()
follow mem accesses         follow mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() thanks to the IPIs executing memory barriers on each active
system threads. Each non-running process threads are intrinsically
serialized by the scheduler.

Just tried with a cache-hot kernel compilation using 6/8 CPUs.

Normally:                                              real 2m41.852s
With the sys_membarrier+1 busy-looping thread running: real 5m41.830s

So... 2x slower. That hurts.

So let's try allocating a cpu mask for PeterZ scheme. I prefer to have a
small allocation overhead and benefit from cpumask broadcast if
possible so we scale better. But that all depends on how big the
allocation overhead is.

Impact of allocating a cpumask (time for 10,000,000 sys_membarrier
calls, one thread is doing the sys_membarrier, the others are busy
looping)):

IPI to all:                                            real 0m44.708s
alloc cpumask+local mb()+broadcast IPI to 1 thread:    real 1m2.034s

So, roughly, the cpumask allocation overhead is 17s here, not exactly
cheap. So let's see when it becomes better than single IPIs:

local mb()+single IPI to 1 thread:                     real 0m42.434s
local mb()+single IPI to 7 threads:                    real 3m29.795s

So, roughly, the single IPI overhead is 137s here for 6 more threads,
for 22.8s per thread.

Here is what we can do: Given that it costs almost half as much to
perform the cpumask allocation than to send a single IPI, as we iterate
on the CPUs until we find more than N match or iterated on all cpus.
If we only have N match or less, we send single IPIs. If we need more
than that, then we switch to the cpumask allocation and send a broadcast
IPI to the cpumask we construct for the matching CPUs. Let's call it the
"adaptative IPI scheme".

For my Intel Xeon E5405 (new set of results, disabled kernel debugging)

Just doing local mb()+single IPI to T other threads:

T=1: 0m18.801s
T=2: 0m29.086s
T=3: 0m46.841s
T=4: 0m53.758s
T=5: 1m10.856s
T=6: 1m21.142s
T=7: 1m38.362s

Just doing cpumask alloc+IPI-many to T other threads:

T=1: 0m21.778s
T=2: 0m22.741s
T=3: 0m22.185s
T=4: 0m24.660s
T=5: 0m26.855s
T=6: 0m30.841s
T=7: 0m29.551s

So I think the right threshold should be 1 thread (assuming other
architecture will behave like mine). So starting with 2 threads, we
allocate the cpumask and send IPIs.

The resulting adaptative scheme:

T=1: 0m13.911s
T=2: 0m20.730s
T=3: 0m21.474s
T=4: 0m27.952s
T=5: 0m26.286s
T=6: 0m27.855s
T=7: 0m29.695s

A benchmark taking the runqueue locks (and disabling irqs) for each cpu shows
that it's significantly slower. Thus, as we can manage to only rely on the
scheduler memory barriers between rq ->mm updates and execution of userland
code, so we don't need the spinlocks.

With runqueue locks:

T=1: 0m15.802s
T=2: 0m22.484s
T=3: 0m24.751s
T=4: 0m29.134s
T=5: 0m30.094s
T=6: 0m33.090s
T=7: 0m33.897s

After adding proper memory barriers at the beginning/end of the system call
(they were missing in the previous tests, needed for correctness), without
spinlocks, we end up with:

T=1: 0m14.326s
T=2: 0m20.278s
T=3: 0m21.661s
T=4: 0m22.320s
T=5: 0m27.088s
T=6: 0m29.378s
T=7: 0m30.523s

The expected top pattern, when using 1 CPU for a thread doing sys_membarrier()
in a loop and other threads busy-waiting in user-space on a variable shows that
the thread doing sys_membarrier is doing mostly system calls, and other threads
are mostly running in user-space. Side-note, in this test, it's important to
check that individual threads are not always fully at 100% user-space time (they
range between ~95% and 100%), because when some thread in the test is always at
100% on the same CPU, this means it does not get the IPI at all. (I actually
found out about a bug in my own code while developing it with this test.)

Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 99.7%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
Cpu2  : 99.3%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.7%hi,  0.0%si,  0.0%st
Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 96.0%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  2.6%si,  0.0%st
Cpu6  :  1.3%us, 98.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 96.1%us,  3.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st

The system call number is only assigned for x86_64 in this RFC patch.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: mingo@elte.hu
CC: laijs@cn.fujitsu.com
CC: dipankar@in.ibm.com
CC: akpm@linux-foundation.org
CC: josh@joshtriplett.org
CC: dvhltc@us.ibm.com
CC: niv@us.ibm.com
CC: tglx@linutronix.de
CC: peterz@infradead.org
CC: rostedt@goodmis.org
CC: Valdis.Kletnieks@vt.edu
CC: dhowells@redhat.com
---
 arch/x86/include/asm/unistd_64.h |    2 
 kernel/sched.c                   |  123 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 125 insertions(+)

Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
===================================================================
--- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-06 23:23:34.000000000 -0500
+++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-07 00:23:45.000000000 -0500
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
 __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 #define __NR_perf_event_open			298
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_membarrier				299
+__SYSCALL(__NR_membarrier, sys_membarrier)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c	2010-01-06 23:23:34.000000000 -0500
+++ linux-2.6-lttng/kernel/sched.c	2010-01-08 18:17:44.000000000 -0500
@@ -119,6 +119,11 @@
  */
 #define RUNTIME_INF	((u64)~0ULL)
 
+/*
+ * IPI vs cpumask broadcast threshold. Threshold of 1 IPI.
+ */
+#define ADAPT_IPI_THRESHOLD	1
+
 static inline int rt_policy(int policy)
 {
 	if (unlikely(policy == SCHED_FIFO || policy == SCHED_RR))
@@ -10822,6 +10827,124 @@ struct cgroup_subsys cpuacct_subsys = {
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
 
+/*
+ * Execute a memory barrier on all CPUs on SMP systems.
+ * Do not rely on implicit barriers in smp_call_function(), just in case they
+ * are ever relaxed in the future.
+ */
+static void membarrier_ipi(void *unused)
+{
+	smp_mb();
+}
+
+/*
+ * Handle out-of-mem by sending per-cpu IPIs instead.
+ */
+static void membarrier_retry(void)
+{
+	int cpu;
+
+	for_each_cpu(cpu, mm_cpumask(current->mm)) {
+		if (cpu_curr(cpu)->mm == current->mm)
+			smp_call_function_single(cpu, membarrier_ipi,
+						 NULL, 1);
+	}
+}
+
+/*
+ * sys_membarrier - issue memory barrier on current process running threads
+ *
+ * Execute a memory barrier on all running threads of the current process.
+ * Upon completion, the caller thread is ensured that all process threads
+ * have passed through a state where memory accesses match program order.
+ * (non-running threads are de facto in such a state)
+ *
+ * The current implementation simply executes a memory barrier in an IPI handler
+ * on each active cpu. Going through the hassle of taking run queue locks and
+ * checking if the thread running on each online CPU belongs to the current
+ * thread seems more heavyweight than the cost of the IPI itself.
+ *
+ * cpu_curr(cpu)->mm read is racy on purpose. It rely on the fact that we race
+ * only with the scheduler, and that the scheduler issues a smp_mb() between the
+ * ->mm update and returning to the newly scheduled thread and between
+ * interruption of a running thread and ->mm update. The worse that could result
+ * from this race is that we issue an IPI for nothing, which doesn't hurt.
+ */
+SYSCALL_DEFINE0(membarrier)
+{
+#ifdef CONFIG_SMP
+	int cpu, i, cpu_ipi[ADAPT_IPI_THRESHOLD], nr_cpus = 0;
+	cpumask_var_t tmpmask;
+	int this_cpu;
+
+	if (likely(!thread_group_empty(current))) {
+		rcu_read_lock();	/* protect cpu_curr(cpu)-> access */
+		/*
+		 * We don't need to include ourself in IPI, as we already
+		 * surround our execution with memory barriers. We also
+		 * don't have to disable preemption here, because if we
+		 * migrate out of "this_cpu", then there is an implied memory
+		 * barrier for the thread now running on "this_cpu".
+		 */
+		this_cpu = raw_smp_processor_id();
+		/*
+		 * Memory barrier on the caller thread _before_ the first
+		 * cpu_curr(cpu)->mm read and also before sending first IPI.
+		 */
+		smp_mb();
+		/* Get CPU IDs up to threshold */
+		for_each_cpu(cpu, mm_cpumask(current->mm)) {
+			if (unlikely(cpu == this_cpu))
+				continue;
+			if (cpu_curr(cpu)->mm == current->mm) {
+				if (nr_cpus == ADAPT_IPI_THRESHOLD) {
+					nr_cpus++;
+					break;
+				}
+				cpu_ipi[nr_cpus++] = cpu;
+			}
+		}
+		if (likely(nr_cpus <= ADAPT_IPI_THRESHOLD)) {
+			for (i = 0; i < nr_cpus; i++) {
+				smp_call_function_single(cpu_ipi[i],
+							 membarrier_ipi,
+							 NULL, 1);
+			}
+		} else {
+			if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
+				membarrier_retry();
+				goto unlock;
+			}
+			for (i = 0; i < ADAPT_IPI_THRESHOLD; i++)
+				cpumask_set_cpu(cpu_ipi[i], tmpmask);
+			/* Continue previous for_each_cpu() */
+			do {
+				if (cpu_curr(cpu)->mm == current->mm)
+					cpumask_set_cpu(cpu, tmpmask);
+				cpu = cpumask_next(cpu,
+						   mm_cpumask(current->mm));
+				if (unlikely(cpu == this_cpu))
+					continue;
+			} while (cpu < nr_cpu_ids);
+			preempt_disable();	/* explicitly required */
+			smp_call_function_many(tmpmask, membarrier_ipi, NULL,
+					       1);
+			preempt_enable();
+			free_cpumask_var(tmpmask);
+		}
+unlock:
+		/*
+		 * Memory barrier on the caller thread _after_ we finished
+		 * waiting for the last IPI and also after reading the last
+		 * cpu_curr(cpu)->mm.
+		 */
+		smp_mb();
+		rcu_read_unlock();
+	}
+#endif	/* #ifdef CONFIG_SMP */
+	return 0;
+}
+
 #ifndef CONFIG_SMP
 
 int rcu_expedited_torture_stats(char *page)
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v2)
  2010-01-08 23:56 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v2) Mathieu Desnoyers
@ 2010-01-09 10:47 ` Peter Zijlstra
  2010-01-09 16:16   ` Mathieu Desnoyers
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2010-01-09 10:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Steven Rostedt, paulmck, Josh Triplett, Ingo Molnar,
	akpm, tglx, Valdis.Kletnieks, dhowells, laijs, dipankar

On Fri, 2010-01-08 at 18:56 -0500, Mathieu Desnoyers wrote:

> Index: linux-2.6-lttng/kernel/sched.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-06 23:23:34.000000000 -0500
> +++ linux-2.6-lttng/kernel/sched.c	2010-01-08 18:17:44.000000000 -0500
> @@ -119,6 +119,11 @@
>   */
>  #define RUNTIME_INF	((u64)~0ULL)
>  
> +/*
> + * IPI vs cpumask broadcast threshold. Threshold of 1 IPI.
> + */
> +#define ADAPT_IPI_THRESHOLD	1
> +
>  static inline int rt_policy(int policy)
>  {
>  	if (unlikely(policy == SCHED_FIFO || policy == SCHED_RR))
> @@ -10822,6 +10827,124 @@ struct cgroup_subsys cpuacct_subsys = {
>  };
>  #endif	/* CONFIG_CGROUP_CPUACCT */
>  
> +/*
> + * Execute a memory barrier on all CPUs on SMP systems.
> + * Do not rely on implicit barriers in smp_call_function(), just in case they
> + * are ever relaxed in the future.
> + */
> +static void membarrier_ipi(void *unused)
> +{
> +	smp_mb();
> +}
> +
> +/*
> + * Handle out-of-mem by sending per-cpu IPIs instead.
> + */
> +static void membarrier_retry(void)
> +{
> +	int cpu;
> +
> +	for_each_cpu(cpu, mm_cpumask(current->mm)) {
> +		if (cpu_curr(cpu)->mm == current->mm)
> +			smp_call_function_single(cpu, membarrier_ipi,
> +						 NULL, 1);
> +	}
> +}


> +SYSCALL_DEFINE0(membarrier)
> +{
> +#ifdef CONFIG_SMP
> +	int cpu, i, cpu_ipi[ADAPT_IPI_THRESHOLD], nr_cpus = 0;
> +	cpumask_var_t tmpmask;
> +	int this_cpu;
> +
> +	if (likely(!thread_group_empty(current))) {
> +		rcu_read_lock();	/* protect cpu_curr(cpu)-> access */
> +		/*
> +		 * We don't need to include ourself in IPI, as we already
> +		 * surround our execution with memory barriers. We also
> +		 * don't have to disable preemption here, because if we
> +		 * migrate out of "this_cpu", then there is an implied memory
> +		 * barrier for the thread now running on "this_cpu".
> +		 */
> +		this_cpu = raw_smp_processor_id();

How is this not a bug?

> +		/*
> +		 * Memory barrier on the caller thread _before_ the first
> +		 * cpu_curr(cpu)->mm read and also before sending first IPI.
> +		 */
> +		smp_mb();
> +		/* Get CPU IDs up to threshold */
> +		for_each_cpu(cpu, mm_cpumask(current->mm)) {
> +			if (unlikely(cpu == this_cpu))
> +				continue;
> +			if (cpu_curr(cpu)->mm == current->mm) {
> +				if (nr_cpus == ADAPT_IPI_THRESHOLD) {
> +					nr_cpus++;
> +					break;
> +				}
> +				cpu_ipi[nr_cpus++] = cpu;
> +			}
> +		}
> +		if (likely(nr_cpus <= ADAPT_IPI_THRESHOLD)) {
> +			for (i = 0; i < nr_cpus; i++) {
> +				smp_call_function_single(cpu_ipi[i],
> +							 membarrier_ipi,
> +							 NULL, 1);
> +			}
> +		} else {
> +			if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> +				membarrier_retry();
> +				goto unlock;
> +			}
> +			for (i = 0; i < ADAPT_IPI_THRESHOLD; i++)
> +				cpumask_set_cpu(cpu_ipi[i], tmpmask);
> +			/* Continue previous for_each_cpu() */
> +			do {
> +				if (cpu_curr(cpu)->mm == current->mm)
> +					cpumask_set_cpu(cpu, tmpmask);
> +				cpu = cpumask_next(cpu,
> +						   mm_cpumask(current->mm));
> +				if (unlikely(cpu == this_cpu))
> +					continue;
> +			} while (cpu < nr_cpu_ids);
> +			preempt_disable();	/* explicitly required */

This seems to indicate the same.

> +			smp_call_function_many(tmpmask, membarrier_ipi, NULL,
> +					       1);
> +			preempt_enable();
> +			free_cpumask_var(tmpmask);
> +		}
> +unlock:
> +		/*
> +		 * Memory barrier on the caller thread _after_ we finished
> +		 * waiting for the last IPI and also after reading the last
> +		 * cpu_curr(cpu)->mm.
> +		 */
> +		smp_mb();
> +		rcu_read_unlock();
> +	}
> +#endif	/* #ifdef CONFIG_SMP */
> +	return 0;
> +}
> +
>  #ifndef CONFIG_SMP
>  
>  int rcu_expedited_torture_stats(char *page)


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v2)
  2010-01-09 10:47 ` Peter Zijlstra
@ 2010-01-09 16:16   ` Mathieu Desnoyers
  0 siblings, 0 replies; 3+ messages in thread
From: Mathieu Desnoyers @ 2010-01-09 16:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Steven Rostedt, paulmck, Josh Triplett, Ingo Molnar,
	akpm, tglx, Valdis.Kletnieks, dhowells, laijs, dipankar

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Fri, 2010-01-08 at 18:56 -0500, Mathieu Desnoyers wrote:
> 
> > Index: linux-2.6-lttng/kernel/sched.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-06 23:23:34.000000000 -0500
> > +++ linux-2.6-lttng/kernel/sched.c	2010-01-08 18:17:44.000000000 -0500
> > @@ -119,6 +119,11 @@
> >   */
> >  #define RUNTIME_INF	((u64)~0ULL)
> >  
> > +/*
> > + * IPI vs cpumask broadcast threshold. Threshold of 1 IPI.
> > + */
> > +#define ADAPT_IPI_THRESHOLD	1
> > +
> >  static inline int rt_policy(int policy)
> >  {
> >  	if (unlikely(policy == SCHED_FIFO || policy == SCHED_RR))
> > @@ -10822,6 +10827,124 @@ struct cgroup_subsys cpuacct_subsys = {
> >  };
> >  #endif	/* CONFIG_CGROUP_CPUACCT */
> >  
> > +/*
> > + * Execute a memory barrier on all CPUs on SMP systems.
> > + * Do not rely on implicit barriers in smp_call_function(), just in case they
> > + * are ever relaxed in the future.
> > + */
> > +static void membarrier_ipi(void *unused)
> > +{
> > +	smp_mb();
> > +}
> > +
> > +/*
> > + * Handle out-of-mem by sending per-cpu IPIs instead.
> > + */
> > +static void membarrier_retry(void)
> > +{
> > +	int cpu;
> > +
> > +	for_each_cpu(cpu, mm_cpumask(current->mm)) {
> > +		if (cpu_curr(cpu)->mm == current->mm)
> > +			smp_call_function_single(cpu, membarrier_ipi,
> > +						 NULL, 1);
> > +	}
> > +}
> 
> 
> > +SYSCALL_DEFINE0(membarrier)
> > +{
> > +#ifdef CONFIG_SMP
> > +	int cpu, i, cpu_ipi[ADAPT_IPI_THRESHOLD], nr_cpus = 0;
> > +	cpumask_var_t tmpmask;
> > +	int this_cpu;
> > +
> > +	if (likely(!thread_group_empty(current))) {
> > +		rcu_read_lock();	/* protect cpu_curr(cpu)-> access */
> > +		/*
> > +		 * We don't need to include ourself in IPI, as we already
> > +		 * surround our execution with memory barriers. We also
> > +		 * don't have to disable preemption here, because if we
> > +		 * migrate out of "this_cpu", then there is an implied memory
> > +		 * barrier for the thread now running on "this_cpu".
> > +		 */
> > +		this_cpu = raw_smp_processor_id();
> 
> How is this not a bug?

This should be moved it right below the following smp_mb(). See below,

> 
> > +		/*
> > +		 * Memory barrier on the caller thread _before_ the first
> > +		 * cpu_curr(cpu)->mm read and also before sending first IPI.
> > +		 */
> > +		smp_mb();

So let's move it here. Now, why this should be ok:

The requirement of this algorithm is that, between the two smp_mb()
issued at the beginning and end of sys_membarrier() execution, we can be
sure that smp_mb() are issued on all processors running threads which
belong to our process.

So if we are preempted after reading the processor ID and migrated (and
another thread belonging to our process is run on our CPU), the
scheduler activity happening between the two smp_mb() surrounding
sys_membarrier() guarantees that memory barriers are issued for the
newly coming process. Therefore, we can skip the CPU on which we took
the processor ID, even if we happen to be moved to a different CPU.

It is based on the same assumption as the "racy" cpu_curr(cpu)->mm read:
we need to be sure that the scheduler issues a smp_mb() between the
execution of user-level instructions and scheduler activity (but in this
case, the scenario also involves a migration).


> > +		/* Get CPU IDs up to threshold */
> > +		for_each_cpu(cpu, mm_cpumask(current->mm)) {
> > +			if (unlikely(cpu == this_cpu))
> > +				continue;
> > +			if (cpu_curr(cpu)->mm == current->mm) {
> > +				if (nr_cpus == ADAPT_IPI_THRESHOLD) {
> > +					nr_cpus++;
> > +					break;
> > +				}
> > +				cpu_ipi[nr_cpus++] = cpu;
> > +			}
> > +		}
> > +		if (likely(nr_cpus <= ADAPT_IPI_THRESHOLD)) {
> > +			for (i = 0; i < nr_cpus; i++) {
> > +				smp_call_function_single(cpu_ipi[i],
> > +							 membarrier_ipi,
> > +							 NULL, 1);
> > +			}
> > +		} else {
> > +			if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> > +				membarrier_retry();
> > +				goto unlock;
> > +			}
> > +			for (i = 0; i < ADAPT_IPI_THRESHOLD; i++)
> > +				cpumask_set_cpu(cpu_ipi[i], tmpmask);
> > +			/* Continue previous for_each_cpu() */
> > +			do {
> > +				if (cpu_curr(cpu)->mm == current->mm)
> > +					cpumask_set_cpu(cpu, tmpmask);
> > +				cpu = cpumask_next(cpu,
> > +						   mm_cpumask(current->mm));
> > +				if (unlikely(cpu == this_cpu))
> > +					continue;
> > +			} while (cpu < nr_cpu_ids);
> > +			preempt_disable();	/* explicitly required */
> 
> This seems to indicate the same.

I solely added these preempt disable/enable because they are explicitly
required by smp_call_function_many() (see the comment at the beginning
of the implementation). I have not checked whether or not this would
indeed be required in this very specific case, or somehow permitted due
following my explanation above. Given that smp_processor_id() is used in
smp_call_function_many, it would complain if preemption is not disabled,
so it's better to stick with preemption disabling anyway.

So it all depends: I'm a bit reluctant to disable preemption when
iterating on all CPUs, as it's pushing the duration of the preempt-off
section as we add more cores on the system. So if possible I'd like to
keep preemption on. However, given that we already disable preemption
while waiting for other processor to acknowledge the IPI, we might as
well bite the bullet and disable it around the whole system call
althogether.

Anyway I don't see the impact of keeping preemption on as being as
important as taking all runqueue locks, so I could do with keeping
preemption off if it keeps the complexity level down.

Thoughts ?

Thanks,

Mathieu

> 
> > +			smp_call_function_many(tmpmask, membarrier_ipi, NULL,
> > +					       1);
> > +			preempt_enable();
> > +			free_cpumask_var(tmpmask);
> > +		}
> > +unlock:
> > +		/*
> > +		 * Memory barrier on the caller thread _after_ we finished
> > +		 * waiting for the last IPI and also after reading the last
> > +		 * cpu_curr(cpu)->mm.
> > +		 */
> > +		smp_mb();
> > +		rcu_read_unlock();
> > +	}
> > +#endif	/* #ifdef CONFIG_SMP */
> > +	return 0;
> > +}
> > +
> >  #ifndef CONFIG_SMP
> >  
> >  int rcu_expedited_torture_stats(char *page)
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-01-09 16:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-08 23:56 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v2) Mathieu Desnoyers
2010-01-09 10:47 ` Peter Zijlstra
2010-01-09 16:16   ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox