[RFC PATCH v3] membarrier: expedited private command

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v3] membarrier: expedited private command
@ 2017-07-28 16:54 Mathieu Desnoyers
  2017-07-28 17:01 ` Peter Zijlstra
  2017-07-28 18:29 ` Paul E. McKenney
  0 siblings, 2 replies; 3+ messages in thread
From: Mathieu Desnoyers @ 2017-07-28 16:54 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney
  Cc: linux-kernel, Mathieu Desnoyers, Boqun Feng, Andrew Hunter,
	Maged Michael, gromer, Avi Kivity, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman

Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
from all runqueues for which current thread's mm is the same as the
thread calling sys_membarrier.

Scheduler-wise, it requires a memory barrier before and after context
switching between processes (which have different mm). The memory
barrier before context switch is already present. After context switch,
finish_lock_switch() acts as a RELEASE barrier, which is a full memory
barrier on all architectures except PowerPC. Add a
smp_mb__after_unlock_lock() to promote this barrier to a full memory
barrier. This is a no-op on all architectures but PowerPC.

* Benchmark

A stress-test benchmark of sched pipe shows that it does not add
significant overhead to the scheduler switching between processes:

100 runs of:

taskset 01 ./perf bench sched pipe

Running 'sched/pipe' benchmark:
Executed 1000000 pipe operations between two processes

Hardware: CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

A) With 4.13.0-rc2+
   at commit a97fb594bc7d ("virtio-net: fix module unloading")

avg.:     2.923 usecs/op
std.dev:  0.057 usecs/op

B) With this commit:

avg.:     2.916 usecs/op
std.dev:  0.043 usecs/op

Changes since v2:
- Address comments from Peter Zijlstra,
- Add smp_mb__after_unlock_lock() after finish_lock_switch() in
  finish_task_switch() to add the memory barrier we need after storing
  to rq->curr. This is much simpler than the previous approach relying
  on atomic_dec_and_test() in mmdrop(), which actually added a memory
  barrier in the common case of switching between userspace processes.
- Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
  kernel, rather than having the whole membarrier system call returning
  -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
  Adapt the CMD_QUERY mask accordingly.

Changes since v1:
- move membarrier code under kernel/sched/ because it uses the
  scheduler runqueue,
- only add the barrier when we switch from a kernel thread. The case
  where we switch from a user-space thread is already handled by
  the atomic_dec_and_test() in mmdrop().
- add a comment to mmdrop() documenting the requirement on the implicit
  memory barrier.

CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: gromer@google.com
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 MAINTAINERS                     |   2 +-
 include/uapi/linux/membarrier.h |  23 +++++-
 kernel/Makefile                 |   1 -
 kernel/membarrier.c             |  70 ------------------
 kernel/sched/Makefile           |   1 +
 kernel/sched/core.c             |   5 ++
 kernel/sched/membarrier.c       | 152 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 180 insertions(+), 74 deletions(-)
 delete mode 100644 kernel/membarrier.c
 create mode 100644 kernel/sched/membarrier.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f66488dfdbc9..3b035584272f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8621,7 +8621,7 @@ M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
 M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
 L:	linux-kernel@vger.kernel.org
 S:	Supported
-F:	kernel/membarrier.c
+F:	kernel/sched/membarrier.c
 F:	include/uapi/linux/membarrier.h
 
 MEMORY MANAGEMENT
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index e0b108bd2624..6d47b3249d8a 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -40,14 +40,33 @@
  *                          (non-running threads are de facto in such a
  *                          state). This covers threads from all processes
  *                          running on the system. This command returns 0.
+ * @MEMBARRIER_CMD_PRIVATE_EXPEDITED:
+ *                          Execute a memory barrier on each running
+ *                          thread belonging to the same process as the current
+ *                          thread. Upon return from system call, the
+ *                          caller thread is ensured that all its running
+ *                          threads siblings have passed through a state
+ *                          where all memory accesses to user-space
+ *                          addresses match program order between entry
+ *                          to and return from the system call
+ *                          (non-running threads are de facto in such a
+ *                          state). This only covers threads from the
+ *                          same processes as the caller thread. This
+ *                          command returns 0. The "expedited" commands
+ *                          complete faster than the non-expedited ones,
+ *                          they never block, but have the downside of
+ *                          causing extra overhead.
  *
  * Command to be passed to the membarrier system call. The commands need to
  * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
  * the value 0.
  */
 enum membarrier_cmd {
-	MEMBARRIER_CMD_QUERY = 0,
-	MEMBARRIER_CMD_SHARED = (1 << 0),
+	MEMBARRIER_CMD_QUERY			= 0,
+	MEMBARRIER_CMD_SHARED			= (1 << 0),
+	/* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
+	/* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
+	MEMBARRIER_CMD_PRIVATE_EXPEDITED	= (1 << 3),
 };
 
 #endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 4cb8e8b23c6e..9c323a6daa46 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -108,7 +108,6 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
-obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
 
diff --git a/kernel/membarrier.c b/kernel/membarrier.c
deleted file mode 100644
index 9f9284f37f8d..000000000000
--- a/kernel/membarrier.c
+++ /dev/null
@@ -1,70 +0,0 @@
-/*
- * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
- *
- * membarrier system call
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- */
-
-#include <linux/syscalls.h>
-#include <linux/membarrier.h>
-#include <linux/tick.h>
-
-/*
- * Bitmask made from a "or" of all commands within enum membarrier_cmd,
- * except MEMBARRIER_CMD_QUERY.
- */
-#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
-
-/**
- * sys_membarrier - issue memory barriers on a set of threads
- * @cmd:   Takes command values defined in enum membarrier_cmd.
- * @flags: Currently needs to be 0. For future extensions.
- *
- * If this system call is not implemented, -ENOSYS is returned. If the
- * command specified does not exist, or if the command argument is invalid,
- * this system call returns -EINVAL. For a given command, with flags argument
- * set to 0, this system call is guaranteed to always return the same value
- * until reboot.
- *
- * All memory accesses performed in program order from each targeted thread
- * is guaranteed to be ordered with respect to sys_membarrier(). If we use
- * the semantic "barrier()" to represent a compiler barrier forcing memory
- * accesses to be performed in program order across the barrier, and
- * smp_mb() to represent explicit memory barriers forcing full memory
- * ordering across the barrier, we have the following ordering table for
- * each pair of barrier(), sys_membarrier() and smp_mb():
- *
- * The pair ordering is detailed as (O: ordered, X: not ordered):
- *
- *                        barrier()   smp_mb() sys_membarrier()
- *        barrier()          X           X            O
- *        smp_mb()           X           O            O
- *        sys_membarrier()   O           O            O
- */
-SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
-{
-	/* MEMBARRIER_CMD_SHARED is not compatible with nohz_full. */
-	if (tick_nohz_full_enabled())
-		return -ENOSYS;
-	if (unlikely(flags))
-		return -EINVAL;
-	switch (cmd) {
-	case MEMBARRIER_CMD_QUERY:
-		return MEMBARRIER_CMD_BITMASK;
-	case MEMBARRIER_CMD_SHARED:
-		if (num_online_cpus() > 1)
-			synchronize_sched();
-		return 0;
-	default:
-		return -EINVAL;
-	}
-}
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 53f0164ed362..78f54932ea1d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -25,3 +25,4 @@ obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
 obj-$(CONFIG_CPU_FREQ) += cpufreq.o
 obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
+obj-$(CONFIG_MEMBARRIER) += membarrier.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17c667b427b4..dd677fb2ee92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	vtime_task_switch(prev);
 	perf_event_task_sched_in(prev, current);
 	finish_lock_switch(rq, prev);
+	/*
+	 * The membarrier system call requires a full memory barrier
+	 * after storing to rq->curr, before going back to user-space.
+	 */
+	smp_mb__after_unlock_lock();
 	finish_arch_post_lock_switch();
 
 	fire_sched_in_preempt_notifiers(current);
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
new file mode 100644
index 000000000000..287a3f88f9b2
--- /dev/null
+++ b/kernel/sched/membarrier.c
@@ -0,0 +1,152 @@
+/*
+ * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * membarrier system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/membarrier.h>
+#include <linux/tick.h>
+#include <linux/cpumask.h>
+
+#include "sched.h"	/* for cpu_rq(). */
+
+/*
+ * Bitmask made from a "or" of all commands within enum membarrier_cmd,
+ * except MEMBARRIER_CMD_QUERY.
+ */
+#define MEMBARRIER_CMD_BITMASK	\
+	(MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
+
+static void ipi_mb(void *info)
+{
+	smp_mb();	/* IPIs should be serializing but paranoid. */
+}
+
+static void membarrier_private_expedited(void)
+{
+	int cpu;
+	bool fallback = false;
+	cpumask_var_t tmpmask;
+
+	if (num_online_cpus() == 1)
+		return;
+
+	/*
+	 * Matches memory barriers around rq->curr modification in
+	 * scheduler.
+	 */
+	smp_mb();	/* system call entry is not a mb. */
+
+	/*
+	 * Expedited membarrier commands guarantee that they won't
+	 * block, hence the GFP_NOWAIT allocation flag and fallback
+	 * implementation.
+	 */
+	if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
+		/* Fallback for OOM. */
+		fallback = true;
+	}
+
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		struct task_struct *p;
+
+		/*
+		 * Skipping the current CPU is OK even through we can be
+		 * migrated at any point. The current CPU, at the point
+		 * where we read raw_smp_processor_id(), is ensured to
+		 * be in program order with respect to the caller
+		 * thread. Therefore, we can skip this CPU from the
+		 * iteration.
+		 */
+		if (cpu == raw_smp_processor_id())
+			continue;
+		rcu_read_lock();
+		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		if (p && p->mm == current->mm) {
+			if (!fallback)
+				__cpumask_set_cpu(cpu, tmpmask);
+			else
+				smp_call_function_single(cpu, ipi_mb, NULL, 1);
+		}
+		rcu_read_unlock();
+	}
+	if (!fallback) {
+		smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
+		free_cpumask_var(tmpmask);
+	}
+	cpus_read_unlock();
+
+	/*
+	 * Memory barrier on the caller thread _after_ we finished
+	 * waiting for the last IPI. Matches memory barriers around
+	 * rq->curr modification in scheduler.
+	 */
+	smp_mb();	/* exit from system call is not a mb */
+}
+
+/**
+ * sys_membarrier - issue memory barriers on a set of threads
+ * @cmd:   Takes command values defined in enum membarrier_cmd.
+ * @flags: Currently needs to be 0. For future extensions.
+ *
+ * If this system call is not implemented, -ENOSYS is returned. If the
+ * command specified does not exist, not available on the running
+ * kernel, or if the command argument is invalid, this system call
+ * returns -EINVAL. For a given command, with flags argument set to 0,
+ * this system call is guaranteed to always return the same value until
+ * reboot.
+ *
+ * All memory accesses performed in program order from each targeted thread
+ * is guaranteed to be ordered with respect to sys_membarrier(). If we use
+ * the semantic "barrier()" to represent a compiler barrier forcing memory
+ * accesses to be performed in program order across the barrier, and
+ * smp_mb() to represent explicit memory barriers forcing full memory
+ * ordering across the barrier, we have the following ordering table for
+ * each pair of barrier(), sys_membarrier() and smp_mb():
+ *
+ * The pair ordering is detailed as (O: ordered, X: not ordered):
+ *
+ *                        barrier()   smp_mb() sys_membarrier()
+ *        barrier()          X           X            O
+ *        smp_mb()           X           O            O
+ *        sys_membarrier()   O           O            O
+ */
+SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
+{
+	if (unlikely(flags))
+		return -EINVAL;
+	switch (cmd) {
+	case MEMBARRIER_CMD_QUERY:
+	{
+		int cmd_mask = MEMBARRIER_CMD_BITMASK;
+
+		if (tick_nohz_full_enabled())
+			cmd_mask &= ~MEMBARRIER_CMD_SHARED;
+		return cmd_mask;
+	}
+	case MEMBARRIER_CMD_SHARED:
+		/* MEMBARRIER_CMD_SHARED is not compatible with nohz_full. */
+		if (tick_nohz_full_enabled())
+			return -EINVAL;
+		if (num_online_cpus() > 1)
+			synchronize_sched();
+		return 0;
+	case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
+		membarrier_private_expedited();
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH v3] membarrier: expedited private command
  2017-07-28 16:54 [RFC PATCH v3] membarrier: expedited private command Mathieu Desnoyers
@ 2017-07-28 17:01 ` Peter Zijlstra
  2017-07-28 18:29 ` Paul E. McKenney
  1 sibling, 0 replies; 3+ messages in thread
From: Peter Zijlstra @ 2017-07-28 17:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E . McKenney, linux-kernel, Boqun Feng, Andrew Hunter,
	Maged Michael, gromer, Avi Kivity, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman

On Fri, Jul 28, 2017 at 12:54:29PM -0400, Mathieu Desnoyers wrote:
> Scheduler-wise, it requires a memory barrier before and after context
> switching between processes (which have different mm). The memory
> barrier before context switch is already present. After context switch,
> finish_lock_switch() acts as a RELEASE barrier, which is a full memory
> barrier on all architectures except PowerPC. Add a
> smp_mb__after_unlock_lock() to promote this barrier to a full memory
> barrier. This is a no-op on all architectures but PowerPC.

Not quite complete:

Our TSO archs can do RELEASE without being a full barrier. Look at x86's
spin_unlock() being a regular STORE for example.  But for those archs,
all atomics imply smp_mb and all of them have atomic ops in switch_mm()
for mm_cpumask().

>From all weakly ordered machines, only ARM64 and PPC can do RELEASE, the
rest does indeed to smp_mb(), so there the spin_unlock() is a full
barrier and we're good.

ARM64 has a very heavy barrier in switch_to(), which suffices.

PPC just removed its barrier from switch_to(), but appears to be talking
about adding something to switch_mm().

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH v3] membarrier: expedited private command
  2017-07-28 16:54 [RFC PATCH v3] membarrier: expedited private command Mathieu Desnoyers
  2017-07-28 17:01 ` Peter Zijlstra
@ 2017-07-28 18:29 ` Paul E. McKenney
  1 sibling, 0 replies; 3+ messages in thread
From: Paul E. McKenney @ 2017-07-28 18:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Boqun Feng, Andrew Hunter,
	Maged Michael, gromer, Avi Kivity, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman

On Fri, Jul 28, 2017 at 12:54:29PM -0400, Mathieu Desnoyers wrote:
> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
> from all runqueues for which current thread's mm is the same as the
> thread calling sys_membarrier.
> 
> Scheduler-wise, it requires a memory barrier before and after context
> switching between processes (which have different mm). The memory
> barrier before context switch is already present. After context switch,
> finish_lock_switch() acts as a RELEASE barrier, which is a full memory
> barrier on all architectures except PowerPC. Add a
> smp_mb__after_unlock_lock() to promote this barrier to a full memory
> barrier. This is a no-op on all architectures but PowerPC.
> 
> * Benchmark
> 
> A stress-test benchmark of sched pipe shows that it does not add
> significant overhead to the scheduler switching between processes:
> 
> 100 runs of:
> 
> taskset 01 ./perf bench sched pipe
> 
> Running 'sched/pipe' benchmark:
> Executed 1000000 pipe operations between two processes
> 
> Hardware: CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> 
> A) With 4.13.0-rc2+
>    at commit a97fb594bc7d ("virtio-net: fix module unloading")
> 
> avg.:     2.923 usecs/op
> std.dev:  0.057 usecs/op
> 
> B) With this commit:
> 
> avg.:     2.916 usecs/op
> std.dev:  0.043 usecs/op
> 
> Changes since v2:
> - Address comments from Peter Zijlstra,
> - Add smp_mb__after_unlock_lock() after finish_lock_switch() in
>   finish_task_switch() to add the memory barrier we need after storing
>   to rq->curr. This is much simpler than the previous approach relying
>   on atomic_dec_and_test() in mmdrop(), which actually added a memory
>   barrier in the common case of switching between userspace processes.
> - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
>   kernel, rather than having the whole membarrier system call returning
>   -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
>   Adapt the CMD_QUERY mask accordingly.
> 
> Changes since v1:
> - move membarrier code under kernel/sched/ because it uses the
>   scheduler runqueue,
> - only add the barrier when we switch from a kernel thread. The case
>   where we switch from a user-space thread is already handled by
>   the atomic_dec_and_test() in mmdrop().
> - add a comment to mmdrop() documenting the requirement on the implicit
>   memory barrier.
> 
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> CC: Boqun Feng <boqun.feng@gmail.com>
> CC: Andrew Hunter <ahh@google.com>
> CC: Maged Michael <maged.michael@gmail.com>
> CC: gromer@google.com
> CC: Avi Kivity <avi@scylladb.com>
> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> CC: Paul Mackerras <paulus@samba.org>
> CC: Michael Ellerman <mpe@ellerman.id.au>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

I have queued this in place of your v3, and will do likewise for the
rumored upcoming v4.  ;-)

							Thanx, Paul

> ---
>  MAINTAINERS                     |   2 +-
>  include/uapi/linux/membarrier.h |  23 +++++-
>  kernel/Makefile                 |   1 -
>  kernel/membarrier.c             |  70 ------------------
>  kernel/sched/Makefile           |   1 +
>  kernel/sched/core.c             |   5 ++
>  kernel/sched/membarrier.c       | 152 ++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 180 insertions(+), 74 deletions(-)
>  delete mode 100644 kernel/membarrier.c
>  create mode 100644 kernel/sched/membarrier.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f66488dfdbc9..3b035584272f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8621,7 +8621,7 @@ M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>  M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
>  L:	linux-kernel@vger.kernel.org
>  S:	Supported
> -F:	kernel/membarrier.c
> +F:	kernel/sched/membarrier.c
>  F:	include/uapi/linux/membarrier.h
> 
>  MEMORY MANAGEMENT
> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> index e0b108bd2624..6d47b3249d8a 100644
> --- a/include/uapi/linux/membarrier.h
> +++ b/include/uapi/linux/membarrier.h
> @@ -40,14 +40,33 @@
>   *                          (non-running threads are de facto in such a
>   *                          state). This covers threads from all processes
>   *                          running on the system. This command returns 0.
> + * @MEMBARRIER_CMD_PRIVATE_EXPEDITED:
> + *                          Execute a memory barrier on each running
> + *                          thread belonging to the same process as the current
> + *                          thread. Upon return from system call, the
> + *                          caller thread is ensured that all its running
> + *                          threads siblings have passed through a state
> + *                          where all memory accesses to user-space
> + *                          addresses match program order between entry
> + *                          to and return from the system call
> + *                          (non-running threads are de facto in such a
> + *                          state). This only covers threads from the
> + *                          same processes as the caller thread. This
> + *                          command returns 0. The "expedited" commands
> + *                          complete faster than the non-expedited ones,
> + *                          they never block, but have the downside of
> + *                          causing extra overhead.
>   *
>   * Command to be passed to the membarrier system call. The commands need to
>   * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
>   * the value 0.
>   */
>  enum membarrier_cmd {
> -	MEMBARRIER_CMD_QUERY = 0,
> -	MEMBARRIER_CMD_SHARED = (1 << 0),
> +	MEMBARRIER_CMD_QUERY			= 0,
> +	MEMBARRIER_CMD_SHARED			= (1 << 0),
> +	/* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
> +	/* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
> +	MEMBARRIER_CMD_PRIVATE_EXPEDITED	= (1 << 3),
>  };
> 
>  #endif /* _UAPI_LINUX_MEMBARRIER_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 4cb8e8b23c6e..9c323a6daa46 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -108,7 +108,6 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>  obj-$(CONFIG_TORTURE_TEST) += torture.o
> -obj-$(CONFIG_MEMBARRIER) += membarrier.o
> 
>  obj-$(CONFIG_HAS_IOMEM) += memremap.o
> 
> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
> deleted file mode 100644
> index 9f9284f37f8d..000000000000
> --- a/kernel/membarrier.c
> +++ /dev/null
> @@ -1,70 +0,0 @@
> -/*
> - * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> - *
> - * membarrier system call
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License as published by
> - * the Free Software Foundation; either version 2 of the License, or
> - * (at your option) any later version.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> - * GNU General Public License for more details.
> - */
> -
> -#include <linux/syscalls.h>
> -#include <linux/membarrier.h>
> -#include <linux/tick.h>
> -
> -/*
> - * Bitmask made from a "or" of all commands within enum membarrier_cmd,
> - * except MEMBARRIER_CMD_QUERY.
> - */
> -#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
> -
> -/**
> - * sys_membarrier - issue memory barriers on a set of threads
> - * @cmd:   Takes command values defined in enum membarrier_cmd.
> - * @flags: Currently needs to be 0. For future extensions.
> - *
> - * If this system call is not implemented, -ENOSYS is returned. If the
> - * command specified does not exist, or if the command argument is invalid,
> - * this system call returns -EINVAL. For a given command, with flags argument
> - * set to 0, this system call is guaranteed to always return the same value
> - * until reboot.
> - *
> - * All memory accesses performed in program order from each targeted thread
> - * is guaranteed to be ordered with respect to sys_membarrier(). If we use
> - * the semantic "barrier()" to represent a compiler barrier forcing memory
> - * accesses to be performed in program order across the barrier, and
> - * smp_mb() to represent explicit memory barriers forcing full memory
> - * ordering across the barrier, we have the following ordering table for
> - * each pair of barrier(), sys_membarrier() and smp_mb():
> - *
> - * The pair ordering is detailed as (O: ordered, X: not ordered):
> - *
> - *                        barrier()   smp_mb() sys_membarrier()
> - *        barrier()          X           X            O
> - *        smp_mb()           X           O            O
> - *        sys_membarrier()   O           O            O
> - */
> -SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
> -{
> -	/* MEMBARRIER_CMD_SHARED is not compatible with nohz_full. */
> -	if (tick_nohz_full_enabled())
> -		return -ENOSYS;
> -	if (unlikely(flags))
> -		return -EINVAL;
> -	switch (cmd) {
> -	case MEMBARRIER_CMD_QUERY:
> -		return MEMBARRIER_CMD_BITMASK;
> -	case MEMBARRIER_CMD_SHARED:
> -		if (num_online_cpus() > 1)
> -			synchronize_sched();
> -		return 0;
> -	default:
> -		return -EINVAL;
> -	}
> -}
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 53f0164ed362..78f54932ea1d 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -25,3 +25,4 @@ obj-$(CONFIG_SCHED_DEBUG) += debug.o
>  obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
>  obj-$(CONFIG_CPU_FREQ) += cpufreq.o
>  obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 17c667b427b4..dd677fb2ee92 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  	vtime_task_switch(prev);
>  	perf_event_task_sched_in(prev, current);
>  	finish_lock_switch(rq, prev);
> +	/*
> +	 * The membarrier system call requires a full memory barrier
> +	 * after storing to rq->curr, before going back to user-space.
> +	 */
> +	smp_mb__after_unlock_lock();
>  	finish_arch_post_lock_switch();
> 
>  	fire_sched_in_preempt_notifiers(current);
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> new file mode 100644
> index 000000000000..287a3f88f9b2
> --- /dev/null
> +++ b/kernel/sched/membarrier.c
> @@ -0,0 +1,152 @@
> +/*
> + * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> + *
> + * membarrier system call
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/membarrier.h>
> +#include <linux/tick.h>
> +#include <linux/cpumask.h>
> +
> +#include "sched.h"	/* for cpu_rq(). */
> +
> +/*
> + * Bitmask made from a "or" of all commands within enum membarrier_cmd,
> + * except MEMBARRIER_CMD_QUERY.
> + */
> +#define MEMBARRIER_CMD_BITMASK	\
> +	(MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
> +
> +static void ipi_mb(void *info)
> +{
> +	smp_mb();	/* IPIs should be serializing but paranoid. */
> +}
> +
> +static void membarrier_private_expedited(void)
> +{
> +	int cpu;
> +	bool fallback = false;
> +	cpumask_var_t tmpmask;
> +
> +	if (num_online_cpus() == 1)
> +		return;
> +
> +	/*
> +	 * Matches memory barriers around rq->curr modification in
> +	 * scheduler.
> +	 */
> +	smp_mb();	/* system call entry is not a mb. */
> +
> +	/*
> +	 * Expedited membarrier commands guarantee that they won't
> +	 * block, hence the GFP_NOWAIT allocation flag and fallback
> +	 * implementation.
> +	 */
> +	if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> +		/* Fallback for OOM. */
> +		fallback = true;
> +	}
> +
> +	cpus_read_lock();
> +	for_each_online_cpu(cpu) {
> +		struct task_struct *p;
> +
> +		/*
> +		 * Skipping the current CPU is OK even through we can be
> +		 * migrated at any point. The current CPU, at the point
> +		 * where we read raw_smp_processor_id(), is ensured to
> +		 * be in program order with respect to the caller
> +		 * thread. Therefore, we can skip this CPU from the
> +		 * iteration.
> +		 */
> +		if (cpu == raw_smp_processor_id())
> +			continue;
> +		rcu_read_lock();
> +		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> +		if (p && p->mm == current->mm) {
> +			if (!fallback)
> +				__cpumask_set_cpu(cpu, tmpmask);
> +			else
> +				smp_call_function_single(cpu, ipi_mb, NULL, 1);
> +		}
> +		rcu_read_unlock();
> +	}
> +	if (!fallback) {
> +		smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
> +		free_cpumask_var(tmpmask);
> +	}
> +	cpus_read_unlock();
> +
> +	/*
> +	 * Memory barrier on the caller thread _after_ we finished
> +	 * waiting for the last IPI. Matches memory barriers around
> +	 * rq->curr modification in scheduler.
> +	 */
> +	smp_mb();	/* exit from system call is not a mb */
> +}
> +
> +/**
> + * sys_membarrier - issue memory barriers on a set of threads
> + * @cmd:   Takes command values defined in enum membarrier_cmd.
> + * @flags: Currently needs to be 0. For future extensions.
> + *
> + * If this system call is not implemented, -ENOSYS is returned. If the
> + * command specified does not exist, not available on the running
> + * kernel, or if the command argument is invalid, this system call
> + * returns -EINVAL. For a given command, with flags argument set to 0,
> + * this system call is guaranteed to always return the same value until
> + * reboot.
> + *
> + * All memory accesses performed in program order from each targeted thread
> + * is guaranteed to be ordered with respect to sys_membarrier(). If we use
> + * the semantic "barrier()" to represent a compiler barrier forcing memory
> + * accesses to be performed in program order across the barrier, and
> + * smp_mb() to represent explicit memory barriers forcing full memory
> + * ordering across the barrier, we have the following ordering table for
> + * each pair of barrier(), sys_membarrier() and smp_mb():
> + *
> + * The pair ordering is detailed as (O: ordered, X: not ordered):
> + *
> + *                        barrier()   smp_mb() sys_membarrier()
> + *        barrier()          X           X            O
> + *        smp_mb()           X           O            O
> + *        sys_membarrier()   O           O            O
> + */
> +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
> +{
> +	if (unlikely(flags))
> +		return -EINVAL;
> +	switch (cmd) {
> +	case MEMBARRIER_CMD_QUERY:
> +	{
> +		int cmd_mask = MEMBARRIER_CMD_BITMASK;
> +
> +		if (tick_nohz_full_enabled())
> +			cmd_mask &= ~MEMBARRIER_CMD_SHARED;
> +		return cmd_mask;
> +	}
> +	case MEMBARRIER_CMD_SHARED:
> +		/* MEMBARRIER_CMD_SHARED is not compatible with nohz_full. */
> +		if (tick_nohz_full_enabled())
> +			return -EINVAL;
> +		if (num_online_cpus() > 1)
> +			synchronize_sched();
> +		return 0;
> +	case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
> +		membarrier_private_expedited();
> +		return 0;
> +	default:
> +		return -EINVAL;
> +	}
> +}
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-07-28 18:29 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-28 16:54 [RFC PATCH v3] membarrier: expedited private command Mathieu Desnoyers
2017-07-28 17:01 ` Peter Zijlstra
2017-07-28 18:29 ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox