[RFC PATCH v3] membarrier: provide core serialization

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v3] membarrier: provide core serialization
@ 2017-09-01 16:10 Mathieu Desnoyers
  2017-09-01 16:25 ` Will Deacon
  0 siblings, 1 reply; 9+ messages in thread
From: Mathieu Desnoyers @ 2017-09-01 16:10 UTC (permalink / raw)
  To: Paul E . McKenney, Peter Zijlstra
  Cc: linux-kernel, Mathieu Desnoyers, Boqun Feng, Andrew Hunter,
	Maged Michael, gromer, Avi Kivity, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Dave Watson, Andy Lutomirski,
	Will Deacon, Hans Boehm

Add a new MEMBARRIER_FLAG_SYNC_CORE flag to the membarrier
system call. It allows membarrier to issue core serializing barriers in
addition to memory barriers on target threads whenever a membarrier
command is performed.

It is relevant for reclaim of JIT code, which requires to issue core
serializing barriers on all threads running on behalf of a process
after ensuring the old code is not visible anymore, before re-using
memory for new code.

The new MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED used with
MEMBARRIER_FLAG_SYNC_CORE flag registers the current process as
requiring core serialization. It may block. It can be used to ensure
MEMBARRIER_CMD_PRIVATE_EXPEDITED never blocks, even the first time it is
invoked by a process with the MEMBARRIER_FLAG_SYNC_CORE flag.

* Scheduler Overhead Benchmarks

Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Linux v4.13-rc6

Inter-thread scheduling
taskset 01 ./perf bench sched pipe -T

                       Avg. usecs/op         Std.Dev. usecs/op
Before this change:         2.55                   0.10
With this change:           2.49                   0.08
SYNC_CORE processes:        2.70                   0.10

Inter-process scheduling
taskset 01 ./perf bench sched pipe

Before this change:         2.93                   0.13
With this change:           2.93                   0.13
SYNC_CORE processes:        3.20                   0.06

Changes since v2:
- Rename MEMBARRIER_CMD_REGISTER_SYNC_CORE to
  MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED,
- Introduce the "MEMBARRIER_FLAG_SYNC_CORE" flag.
- Introduce CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE, only implemented by
  x86 32/64 initially.
- Introduce arch_membarrier_user_icache_flush, a no-op on x86 32/64,
  which can be implemented on architectures with incoherent data and
  instruction caches. It is associated with
  CONFIG_ARCH_HAS_MEMBARRIER_USER_ICACHE_FLUSH.
- Introduce membarrier_sync_core_active counter, used for the shared
  system-wide membarrier with MEMBARRIER_FLAG_SYNC_CORE flag. If set, it
  issues sync_core on sched_out.
- The membarrier_sync_core per-thread flag still issues a sync_core()
  on sched_out, but now issues both sync_core and icache flush on
  sched_in, only when the current->mm changes between prev and next.

Changes since v1:
- Add missing MEMBARRIER_CMD_REGISTER_SYNC_CORE header documentation,
- Add benchmarks to commit message.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: gromer@google.com
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Dave Watson <davejwatson@fb.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Will Deacon <will.deacon@arm.com>
CC: Hans Boehm <hboehm@google.com>
---
 arch/x86/Kconfig                |   1 +
 fs/exec.c                       |   1 +
 include/linux/sched.h           |  82 +++++++++++++++++++++++++
 include/uapi/linux/membarrier.h |  32 ++++++++--
 init/Kconfig                    |   6 ++
 kernel/fork.c                   |   2 +
 kernel/sched/core.c             |   3 +
 kernel/sched/membarrier.c       | 133 +++++++++++++++++++++++++++++++++++-----
 8 files changed, 240 insertions(+), 20 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 323cb065be5e..d39ae515632e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -62,6 +62,7 @@ config X86
 	select ARCH_HAS_STRICT_MODULE_RWX
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_ZONE_DEVICE		if X86_64
+	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
 	select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/fs/exec.c b/fs/exec.c
index 62175cbcc801..a4ab3253bac7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1794,6 +1794,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	membarrier_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8337e2db0bb2..113d9c03a21c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1086,6 +1086,9 @@ struct task_struct {
 	/* Used by LSM modules for access restriction: */
 	void				*security;
 #endif
+#ifdef CONFIG_MEMBARRIER
+	int membarrier_sync_core;
+#endif
 
 	/*
 	 * New fields for task_struct should be added above here, so that
@@ -1623,4 +1626,83 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_USER_ICACHE_FLUSH
+/*
+ * Architectures with incoherent data and instruction caches are
+ * required to implement arch_membarrier_user_icache_flush() if they
+ * want to support the MEMBARRIER_FLAG_SYNC_CORE flag.
+ */
+extern void arch_membarrier_user_icache_flush(void);
+#else
+static inline void arch_membarrier_user_icache_flush(void)
+{
+}
+#endif
+
+#if defined(CONFIG_MEMBARRIER) && defined(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)
+extern atomic_long_t membarrier_sync_core_active;
+
+static inline void membarrier_fork(struct task_struct *t,
+		unsigned long clone_flags)
+{
+	/*
+	 * Coherence of membarrier_sync_core against thread fork is
+	 * protected by siglock. membarrier_fork is called with siglock
+	 * held.
+	 */
+	t->membarrier_sync_core = current->membarrier_sync_core;
+}
+static inline void membarrier_execve(struct task_struct *t)
+{
+	t->membarrier_sync_core = 0;
+}
+static inline void membarrier_sched_out(struct task_struct *t)
+{
+	/*
+	 * Core serialization is performed before the memory barrier
+	 * preceding the store to rq->curr. A non-zero sync_core_active
+	 * implies that a core serializing shared membarrier is in
+	 * progress.
+	 */
+	if (unlikely(READ_ONCE(t->membarrier_sync_core)
+			|| atomic_long_read(&membarrier_sync_core_active)))
+		sync_core();
+	/*
+	 * Flushing icache on each scheduler entry when a shared
+	 * membarrier requiring core serialization is in progress.
+	 */
+	if (unlikely(atomic_long_read(&membarrier_sync_core_active)))
+		arch_membarrier_user_icache_flush();
+}
+static inline void membarrier_sched_in(struct task_struct *prev,
+		struct task_struct *next)
+{
+	/*
+	 * Core serialization is performed after the memory barrier
+	 * following the store to rq->curr.
+	 */
+	if (unlikely(READ_ONCE(next->membarrier_sync_core))) {
+		if (unlikely(prev->mm != next->mm)) {
+			sync_core();
+			arch_membarrier_user_icache_flush();
+		}
+	}
+}
+#else
+static inline void membarrier_fork(struct task_struct *t,
+		unsigned long clone_flags)
+{
+}
+static inline void membarrier_execve(struct task_struct *t)
+{
+}
+static inline void membarrier_sched_out(struct task_struct *t)
+{
+}
+static inline void membarrier_sched_in(struct task_struct *prev,
+		struct task_struct *next)
+{
+}
+#endif
+
 #endif
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index 6d47b3249d8a..4c8682026500 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -54,19 +54,41 @@
  *                          same processes as the caller thread. This
  *                          command returns 0. The "expedited" commands
  *                          complete faster than the non-expedited ones,
- *                          they never block, but have the downside of
- *                          causing extra overhead.
+ *                          they usually never block, but have the
+ *                          downside of causing extra overhead. The only
+ *                          case where it can block is the first time it
+ *                          is called by a process with the
+ *                          MEMBARRIER_FLAG_SYNC_CORE flag, if there has
+ *                          not been any prior registration of that
+ *                          process with
+ *                          MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED
+ *                          and the same flag.
+ * @MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED:
+ *                          When used with MEMBARRIER_FLAG_SYNC_CORE,
+ *                          register the current process as requiring
+ *                          core serialization when a private expedited
+ *                          membarrier is issued. It may block. It can
+ *                          be used to ensure
+ *                          MEMBARRIER_CMD_PRIVATE_EXPEDITED never
+ *                          blocks, even the first time it is invoked by
+ *                          a process with the MEMBARRIER_FLAG_SYNC_CORE
+ *                          flag.
  *
  * Command to be passed to the membarrier system call. The commands need to
  * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
  * the value 0.
  */
 enum membarrier_cmd {
-	MEMBARRIER_CMD_QUERY			= 0,
-	MEMBARRIER_CMD_SHARED			= (1 << 0),
+	MEMBARRIER_CMD_QUERY				= 0,
+	MEMBARRIER_CMD_SHARED				= (1 << 0),
 	/* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
 	/* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
-	MEMBARRIER_CMD_PRIVATE_EXPEDITED	= (1 << 3),
+	MEMBARRIER_CMD_PRIVATE_EXPEDITED		= (1 << 3),
+	MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED	= (1 << 4),
+};
+
+enum membarrier_flags {
+	MEMBARRIER_FLAG_SYNC_CORE			= (1 << 0),
 };
 
 #endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/init/Kconfig b/init/Kconfig
index 8514b25db21c..e74baef9f347 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -615,6 +615,12 @@ config ARCH_SUPPORTS_INT128
 config ARCH_WANT_NUMA_VARIABLE_LOCALITY
 	bool
 
+# For architectures implementing membarrier core synchronization,
+# required by the membarrier sync_core registration.
+#
+config ARCH_HAS_MEMBARRIER_SYNC_CORE
+	bool
+
 config NUMA_BALANCING
 	bool "Memory placement aware NUMA scheduler"
 	depends on ARCH_SUPPORTS_NUMA_BALANCING
diff --git a/kernel/fork.c b/kernel/fork.c
index e075b7780421..1d44d7250431 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1840,6 +1840,8 @@ static __latent_entropy struct task_struct *copy_process(
 	 */
 	copy_seccomp(p);
 
+	membarrier_fork(p, clone_flags);
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d57553551ad6..98aac5a44604 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3292,6 +3292,8 @@ static void __sched notrace __schedule(bool preempt)
 	local_irq_disable();
 	rcu_note_context_switch(preempt);
 
+	membarrier_sched_out(prev);
+
 	/*
 	 * Make sure that signal_pending_state()->signal_pending() below
 	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
@@ -3364,6 +3366,7 @@ static void __sched notrace __schedule(bool preempt)
 
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
+		membarrier_sched_in(prev, next);
 	} else {
 		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
 		rq_unlock_irq(rq, &rf);
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 7eec6914d2d2..8c8a25e17a50 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -18,6 +18,7 @@
 #include <linux/membarrier.h>
 #include <linux/tick.h>
 #include <linux/cpumask.h>
+#include <linux/atomic.h>
 
 #include "sched.h"	/* for cpu_rq(). */
 
@@ -25,22 +26,118 @@
  * Bitmask made from a "or" of all commands within enum membarrier_cmd,
  * except MEMBARRIER_CMD_QUERY.
  */
-#define MEMBARRIER_CMD_BITMASK	\
-	(MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
+#define MEMBARRIER_CMD_BITMASK			\
+	(MEMBARRIER_CMD_SHARED			\
+	| MEMBARRIER_CMD_PRIVATE_EXPEDITED	\
+	| MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED)
+
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+atomic_long_t membarrier_sync_core_active;
+
+static void membarrier_shared_sync_core_begin(int flags)
+{
+	if (flags & MEMBARRIER_FLAG_SYNC_CORE)
+		atomic_long_inc(&membarrier_sync_core_active);
+}
+
+static void membarrier_shared_sync_core_end(int flags)
+{
+	if (flags & MEMBARRIER_FLAG_SYNC_CORE)
+		atomic_long_dec(&membarrier_sync_core_active);
+}
+
+static int membarrier_register_private_expedited_sync_core(void)
+{
+	struct task_struct *p = current, *t;
+
+	if (READ_ONCE(p->membarrier_sync_core))
+		return 0;
+	if (get_nr_threads(p) == 1) {
+		p->membarrier_sync_core = 1;
+		return 0;
+	}
+
+	/*
+	 * Coherence of membarrier_sync_core against thread fork is
+	 * protected by siglock.
+	 */
+	spin_lock(&p->sighand->siglock);
+	for_each_thread(p, t)
+		WRITE_ONCE(t->membarrier_sync_core, 1);
+	spin_unlock(&p->sighand->siglock);
+	/*
+	 * Ensure all future scheduler execution will observe the new
+	 * membarrier_sync_core state for this process.
+	 */
+	synchronize_sched();
+	return 0;
+}
+static void membarrier_sync_core(void)
+{
+	sync_core();
+}
+#else
+static void membarrier_shared_sync_core_begin(int flags)
+{
+}
+static void membarrier_shared_sync_core_end(int flags)
+{
+}
+static int membarrier_register_private_expedited_sync_core(void)
+{
+	return -EINVAL;
+}
+static void membarrier_sync_core(void)
+{
+}
+#endif
+
+static int membarrier_shared(int flags)
+{
+	if (unlikely(flags & ~MEMBARRIER_FLAG_SYNC_CORE))
+		return -EINVAL;
+	/* MEMBARRIER_CMD_SHARED is not compatible with nohz_full. */
+	if (tick_nohz_full_enabled())
+		return -EINVAL;
+	if (num_online_cpus() == 1)
+		return 0;
+
+	membarrier_shared_sync_core_begin(flags);
+	synchronize_sched();
+	membarrier_shared_sync_core_end(flags);
+
+	return 0;
+}
 
 static void ipi_mb(void *info)
 {
-	smp_mb();	/* IPIs should be serializing but paranoid. */
+	/* IPIs should be serializing but paranoid. */
+	smp_mb();
+	membarrier_sync_core();
+	arch_membarrier_user_icache_flush();
 }
 
-static void membarrier_private_expedited(void)
+static int membarrier_private_expedited(int flags)
 {
 	int cpu;
 	bool fallback = false;
 	cpumask_var_t tmpmask;
 
+	if (unlikely(flags & ~MEMBARRIER_FLAG_SYNC_CORE))
+		return -EINVAL;
+	/*
+	 * Do the process registration ourself if it has not been
+	 * performed by an explicit register command.
+	 */
+	if (unlikely(flags & MEMBARRIER_FLAG_SYNC_CORE)) {
+		int ret;
+
+		ret = membarrier_register_private_expedited_sync_core();
+		if (ret)
+			return ret;
+	}
 	if (num_online_cpus() == 1 || get_nr_threads(current) == 1)
-		return;
+		return 0;
 
 	/*
 	 * Matches memory barriers around rq->curr modification in
@@ -94,6 +191,16 @@ static void membarrier_private_expedited(void)
 	 * rq->curr modification in scheduler.
 	 */
 	smp_mb();	/* exit from system call is not a mb */
+	return 0;
+}
+
+static int membarrier_register_private_expedited(int flags)
+{
+	if (unlikely(flags & ~MEMBARRIER_FLAG_SYNC_CORE))
+		return -EINVAL;
+	if (flags & MEMBARRIER_FLAG_SYNC_CORE)
+		return membarrier_register_private_expedited_sync_core();
+	return 0;
 }
 
 /**
@@ -125,27 +232,23 @@ static void membarrier_private_expedited(void)
  */
 SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
 {
-	if (unlikely(flags))
-		return -EINVAL;
 	switch (cmd) {
 	case MEMBARRIER_CMD_QUERY:
 	{
 		int cmd_mask = MEMBARRIER_CMD_BITMASK;
 
+		if (unlikely(flags))
+			return -EINVAL;
 		if (tick_nohz_full_enabled())
 			cmd_mask &= ~MEMBARRIER_CMD_SHARED;
 		return cmd_mask;
 	}
 	case MEMBARRIER_CMD_SHARED:
-		/* MEMBARRIER_CMD_SHARED is not compatible with nohz_full. */
-		if (tick_nohz_full_enabled())
-			return -EINVAL;
-		if (num_online_cpus() > 1)
-			synchronize_sched();
-		return 0;
+		return membarrier_shared(flags);
 	case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
-		membarrier_private_expedited();
-		return 0;
+		return membarrier_private_expedited(flags);
+	case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED:
+		return membarrier_register_private_expedited(flags);
 	default:
 		return -EINVAL;
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3] membarrier: provide core serialization
  2017-09-01 16:10 [RFC PATCH v3] membarrier: provide core serialization Mathieu Desnoyers
@ 2017-09-01 16:25 ` Will Deacon
  2017-09-01 17:00   ` Mathieu Desnoyers
  0 siblings, 1 reply; 9+ messages in thread
From: Will Deacon @ 2017-09-01 16:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E . McKenney, Peter Zijlstra, linux-kernel, Boqun Feng,
	Andrew Hunter, Maged Michael, gromer, Avi Kivity,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Dave Watson, Andy Lutomirski, Hans Boehm

On Fri, Sep 01, 2017 at 12:10:07PM -0400, Mathieu Desnoyers wrote:
> Add a new MEMBARRIER_FLAG_SYNC_CORE flag to the membarrier
> system call. It allows membarrier to issue core serializing barriers in
> addition to memory barriers on target threads whenever a membarrier
> command is performed.
> 
> It is relevant for reclaim of JIT code, which requires to issue core
> serializing barriers on all threads running on behalf of a process
> after ensuring the old code is not visible anymore, before re-using
> memory for new code.
> 
> The new MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED used with
> MEMBARRIER_FLAG_SYNC_CORE flag registers the current process as
> requiring core serialization. It may block. It can be used to ensure
> MEMBARRIER_CMD_PRIVATE_EXPEDITED never blocks, even the first time it is
> invoked by a process with the MEMBARRIER_FLAG_SYNC_CORE flag.
> 
> * Scheduler Overhead Benchmarks
> 
> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> Linux v4.13-rc6
> 
> Inter-thread scheduling
> taskset 01 ./perf bench sched pipe -T
> 
>                        Avg. usecs/op         Std.Dev. usecs/op
> Before this change:         2.55                   0.10
> With this change:           2.49                   0.08
> SYNC_CORE processes:        2.70                   0.10
> 
> Inter-process scheduling
> taskset 01 ./perf bench sched pipe
> 
> Before this change:         2.93                   0.13
> With this change:           2.93                   0.13
> SYNC_CORE processes:        3.20                   0.06
> 
> Changes since v2:
> - Rename MEMBARRIER_CMD_REGISTER_SYNC_CORE to
>   MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED,

I'm still not convinced that this registration step is needed (at least
for arm, power and x86), but my previous comments were ignored.

> - Introduce the "MEMBARRIER_FLAG_SYNC_CORE" flag.
> - Introduce CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE, only implemented by
>   x86 32/64 initially.
> - Introduce arch_membarrier_user_icache_flush, a no-op on x86 32/64,
>   which can be implemented on architectures with incoherent data and
>   instruction caches. It is associated with
>   CONFIG_ARCH_HAS_MEMBARRIER_USER_ICACHE_FLUSH.

Given that MEMBARRIER_FLAG_SYNC_CORE is about flushing the internal CPU
pipeline (iiuc), could we rename this so that it doesn't mention the
I-cache, please? I-cache flushing is a very different operation on most
architectures I'm aware of, and on arm64 it's even available to userspace
(and broadcast in hardware to other cores).

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3] membarrier: provide core serialization
  2017-09-01 16:25 ` Will Deacon
@ 2017-09-01 17:00   ` Mathieu Desnoyers
  2017-09-01 17:10     ` Will Deacon
  0 siblings, 1 reply; 9+ messages in thread
From: Mathieu Desnoyers @ 2017-09-01 17:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Peter Zijlstra, linux-kernel, Boqun Feng,
	Andrew Hunter, maged michael, gromer, Avi Kivity,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Dave Watson, Andy Lutomirski, Hans Boehm

----- On Sep 1, 2017, at 12:25 PM, Will Deacon will.deacon@arm.com wrote:

> On Fri, Sep 01, 2017 at 12:10:07PM -0400, Mathieu Desnoyers wrote:
>> Add a new MEMBARRIER_FLAG_SYNC_CORE flag to the membarrier
>> system call. It allows membarrier to issue core serializing barriers in
>> addition to memory barriers on target threads whenever a membarrier
>> command is performed.
>> 
>> It is relevant for reclaim of JIT code, which requires to issue core
>> serializing barriers on all threads running on behalf of a process
>> after ensuring the old code is not visible anymore, before re-using
>> memory for new code.
>> 
>> The new MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED used with
>> MEMBARRIER_FLAG_SYNC_CORE flag registers the current process as
>> requiring core serialization. It may block. It can be used to ensure
>> MEMBARRIER_CMD_PRIVATE_EXPEDITED never blocks, even the first time it is
>> invoked by a process with the MEMBARRIER_FLAG_SYNC_CORE flag.
>> 
>> * Scheduler Overhead Benchmarks
>> 
>> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>> Linux v4.13-rc6
>> 
>> Inter-thread scheduling
>> taskset 01 ./perf bench sched pipe -T
>> 
>>                        Avg. usecs/op         Std.Dev. usecs/op
>> Before this change:         2.55                   0.10
>> With this change:           2.49                   0.08
>> SYNC_CORE processes:        2.70                   0.10
>> 
>> Inter-process scheduling
>> taskset 01 ./perf bench sched pipe
>> 
>> Before this change:         2.93                   0.13
>> With this change:           2.93                   0.13
>> SYNC_CORE processes:        3.20                   0.06
>> 
>> Changes since v2:
>> - Rename MEMBARRIER_CMD_REGISTER_SYNC_CORE to
>>   MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED,
> 
> I'm still not convinced that this registration step is needed (at least
> for arm, power and x86), but my previous comments were ignored.

I mistakenly thought that your previous comments were addressed in
other legs of the previous thread, sorry about that.

Let's take x86 as an example. The private expedited membarrier
command iterates on all cpu runqueues, checking if rq->curr->mm
match current->mm, and only IPI if it matches.

We can very well have a CPU for which the scheduler goes back
and forth between user-space thread and a kernel thread, in
which case the mm state is kept as is, and rq->curr->mm is
temporarily saved into rq->curr->active_mm.

This means that while that CPU is executing a kthread, we
won't send any IPI that that CPU, but it could then schedule
back a thread belonging to the original process, and then
we go back executing user-space code without having issued
any kind of core serializing barrier (assuming we return to
userspace with sysexit).

Now about arm64, given that as you say it issues a core serializing
barrier when returning to user-space, and has a strong barrier
in switch_to, this means that the explicit sync_core() in sched_in
is not needed.

However, AFAIU, arm64 does not guarantee consistent data and instruction
caches.

I'm actually trying to wrap my head around what would be the sequence
of operations of a JIT trying to reclaim memory. Can we combine
core serialization and instruction cache flushing into a single
system call invocation, or we need to split this into two separate
operations ?

The JIT reclaim usage scheme I envision is:

- userspace unpublish all reference to old code,
- userspace ensure no thread use the old code anymore,
- sys_membarrier
  - for each executing threads
    - issue core serializing barrier
- userspace use a separate system call to issue data cache flush for
  the modified range
- sys_membarrier
  - for each executing threads
    - issue instruction cache flush

So my current thinking is that we may need to change the membarrier
system call so one command serializes the core, and a separate command
issues cache flush.

By the way, is there a system call on arm64 and arm32 allowing user-space
to flush a range of user data cache ?

> 
>> - Introduce the "MEMBARRIER_FLAG_SYNC_CORE" flag.
>> - Introduce CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE, only implemented by
>>   x86 32/64 initially.
>> - Introduce arch_membarrier_user_icache_flush, a no-op on x86 32/64,
>>   which can be implemented on architectures with incoherent data and
>>   instruction caches. It is associated with
>>   CONFIG_ARCH_HAS_MEMBARRIER_USER_ICACHE_FLUSH.
> 
> Given that MEMBARRIER_FLAG_SYNC_CORE is about flushing the internal CPU
> pipeline (iiuc), could we rename this so that it doesn't mention the
> I-cache, please? I-cache flushing is a very different operation on most
> architectures I'm aware of, and on arm64 it's even available to userspace
> (and broadcast in hardware to other cores).

I'm starting to think we may need to expose a separate membarrier commands
for core_sync and icache flush. Am I on the right path, or missing something
here ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3] membarrier: provide core serialization
  2017-09-01 17:00   ` Mathieu Desnoyers
@ 2017-09-01 17:10     ` Will Deacon
  2017-09-01 18:45       ` Mathieu Desnoyers
  0 siblings, 1 reply; 9+ messages in thread
From: Will Deacon @ 2017-09-01 17:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Peter Zijlstra, linux-kernel, Boqun Feng,
	Andrew Hunter, maged michael, gromer, Avi Kivity,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Dave Watson, Andy Lutomirski, Hans Boehm

Hi Mathieu,

On Fri, Sep 01, 2017 at 05:00:38PM +0000, Mathieu Desnoyers wrote:
> ----- On Sep 1, 2017, at 12:25 PM, Will Deacon will.deacon@arm.com wrote:
> 
> > On Fri, Sep 01, 2017 at 12:10:07PM -0400, Mathieu Desnoyers wrote:
> >> Add a new MEMBARRIER_FLAG_SYNC_CORE flag to the membarrier
> >> system call. It allows membarrier to issue core serializing barriers in
> >> addition to memory barriers on target threads whenever a membarrier
> >> command is performed.
> >> 
> >> It is relevant for reclaim of JIT code, which requires to issue core
> >> serializing barriers on all threads running on behalf of a process
> >> after ensuring the old code is not visible anymore, before re-using
> >> memory for new code.
> >> 
> >> The new MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED used with
> >> MEMBARRIER_FLAG_SYNC_CORE flag registers the current process as
> >> requiring core serialization. It may block. It can be used to ensure
> >> MEMBARRIER_CMD_PRIVATE_EXPEDITED never blocks, even the first time it is
> >> invoked by a process with the MEMBARRIER_FLAG_SYNC_CORE flag.
> >> 
> >> * Scheduler Overhead Benchmarks
> >> 
> >> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> >> Linux v4.13-rc6
> >> 
> >> Inter-thread scheduling
> >> taskset 01 ./perf bench sched pipe -T
> >> 
> >>                        Avg. usecs/op         Std.Dev. usecs/op
> >> Before this change:         2.55                   0.10
> >> With this change:           2.49                   0.08
> >> SYNC_CORE processes:        2.70                   0.10
> >> 
> >> Inter-process scheduling
> >> taskset 01 ./perf bench sched pipe
> >> 
> >> Before this change:         2.93                   0.13
> >> With this change:           2.93                   0.13
> >> SYNC_CORE processes:        3.20                   0.06
> >> 
> >> Changes since v2:
> >> - Rename MEMBARRIER_CMD_REGISTER_SYNC_CORE to
> >>   MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED,
> > 
> > I'm still not convinced that this registration step is needed (at least
> > for arm, power and x86), but my previous comments were ignored.
> 
> I mistakenly thought that your previous comments were addressed in
> other legs of the previous thread, sorry about that.

No problem, thanks for replying this time!

> Let's take x86 as an example. The private expedited membarrier
> command iterates on all cpu runqueues, checking if rq->curr->mm
> match current->mm, and only IPI if it matches.
> 
> We can very well have a CPU for which the scheduler goes back
> and forth between user-space thread and a kernel thread, in
> which case the mm state is kept as is, and rq->curr->mm is
> temporarily saved into rq->curr->active_mm.
> 
> This means that while that CPU is executing a kthread, we
> won't send any IPI that that CPU, but it could then schedule
> back a thread belonging to the original process, and then
> we go back executing user-space code without having issued
> any kind of core serializing barrier (assuming we return to
> userspace with sysexit).

Right, ok. I forgot about Andy's sysexit optimisation on x86.

> Now about arm64, given that as you say it issues a core serializing
> barrier when returning to user-space, and has a strong barrier
> in switch_to, this means that the explicit sync_core() in sched_in
> is not needed.

Good, that's what I thought.

> However, AFAIU, arm64 does not guarantee consistent data and instruction
> caches.

Correct, but:

  * On 32-bit arm, we have a syscall to do that (and this is already used by
    JITs and things like __builtin_clear_cache)

  * On arm64, cache maintenance instructions are directly available to
    userspace

In both cases, the maintenance is broadcast by the hardware to all CPUs.
The only part that cannot be broadcast is the pipeline flush, which is
the part we need to do above and is implicit on exception return.

> I'm actually trying to wrap my head around what would be the sequence
> of operations of a JIT trying to reclaim memory. Can we combine
> core serialization and instruction cache flushing into a single
> system call invocation, or we need to split this into two separate
> operations ?

I think that cache-flushing and pipeline-flushing should be separated,
as they tend to be in the CPU architectures I'm familiar with.

> The JIT reclaim usage scheme I envision is:
> 
> - userspace unpublish all reference to old code,
> - userspace ensure no thread use the old code anymore,
> - sys_membarrier
>   - for each executing threads
>     - issue core serializing barrier
> - userspace use a separate system call to issue data cache flush for
>   the modified range
> - sys_membarrier
>   - for each executing threads
>     - issue instruction cache flush
> 
> So my current thinking is that we may need to change the membarrier
> system call so one command serializes the core, and a separate command
> issues cache flush.

Yeah, and the sequence is slightly different I think, as we need the
pipeline flush to come *after* the I-cache invalidation (otherwise the
stale instructions can just be refetched).

If you're at LPC in a week's time, this might be a good thing to sit down
and bash our heads against (espec. if we can grab PPC and x86 folks too).

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3] membarrier: provide core serialization
  2017-09-01 17:10     ` Will Deacon
@ 2017-09-01 18:45       ` Mathieu Desnoyers
       [not found]         ` <CAMOCf+jjy2hjqdmrqFuVvnS8p-i+3Z3ZLubk4ymnRfsdT_F8PA@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Mathieu Desnoyers @ 2017-09-01 18:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Peter Zijlstra, linux-kernel, Boqun Feng,
	Andrew Hunter, maged michael, gromer, Avi Kivity,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Dave Watson, Andy Lutomirski, Hans Boehm, Russell King

----- On Sep 1, 2017, at 1:10 PM, Will Deacon will.deacon@arm.com wrote:

> Hi Mathieu,
> 
> On Fri, Sep 01, 2017 at 05:00:38PM +0000, Mathieu Desnoyers wrote:
>> ----- On Sep 1, 2017, at 12:25 PM, Will Deacon will.deacon@arm.com wrote:
>> 
>> > On Fri, Sep 01, 2017 at 12:10:07PM -0400, Mathieu Desnoyers wrote:
>> >> Add a new MEMBARRIER_FLAG_SYNC_CORE flag to the membarrier
>> >> system call. It allows membarrier to issue core serializing barriers in
>> >> addition to memory barriers on target threads whenever a membarrier
>> >> command is performed.
>> >> 
>> >> It is relevant for reclaim of JIT code, which requires to issue core
>> >> serializing barriers on all threads running on behalf of a process
>> >> after ensuring the old code is not visible anymore, before re-using
>> >> memory for new code.
>> >> 
>> >> The new MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED used with
>> >> MEMBARRIER_FLAG_SYNC_CORE flag registers the current process as
>> >> requiring core serialization. It may block. It can be used to ensure
>> >> MEMBARRIER_CMD_PRIVATE_EXPEDITED never blocks, even the first time it is
>> >> invoked by a process with the MEMBARRIER_FLAG_SYNC_CORE flag.
>> >> 
>> >> * Scheduler Overhead Benchmarks
>> >> 
>> >> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>> >> Linux v4.13-rc6
>> >> 
>> >> Inter-thread scheduling
>> >> taskset 01 ./perf bench sched pipe -T
>> >> 
>> >>                        Avg. usecs/op         Std.Dev. usecs/op
>> >> Before this change:         2.55                   0.10
>> >> With this change:           2.49                   0.08
>> >> SYNC_CORE processes:        2.70                   0.10
>> >> 
>> >> Inter-process scheduling
>> >> taskset 01 ./perf bench sched pipe
>> >> 
>> >> Before this change:         2.93                   0.13
>> >> With this change:           2.93                   0.13
>> >> SYNC_CORE processes:        3.20                   0.06
>> >> 
>> >> Changes since v2:
>> >> - Rename MEMBARRIER_CMD_REGISTER_SYNC_CORE to
>> >>   MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED,
>> > 
>> > I'm still not convinced that this registration step is needed (at least
>> > for arm, power and x86), but my previous comments were ignored.
>> 
>> I mistakenly thought that your previous comments were addressed in
>> other legs of the previous thread, sorry about that.
> 
> No problem, thanks for replying this time!

And thanks for the reminder :)

> 
>> Let's take x86 as an example. The private expedited membarrier
>> command iterates on all cpu runqueues, checking if rq->curr->mm
>> match current->mm, and only IPI if it matches.
>> 
>> We can very well have a CPU for which the scheduler goes back
>> and forth between user-space thread and a kernel thread, in
>> which case the mm state is kept as is, and rq->curr->mm is
>> temporarily saved into rq->curr->active_mm.
>> 
>> This means that while that CPU is executing a kthread, we
>> won't send any IPI that that CPU, but it could then schedule
>> back a thread belonging to the original process, and then
>> we go back executing user-space code without having issued
>> any kind of core serializing barrier (assuming we return to
>> userspace with sysexit).
> 
> Right, ok. I forgot about Andy's sysexit optimisation on x86.
> 
>> Now about arm64, given that as you say it issues a core serializing
>> barrier when returning to user-space, and has a strong barrier
>> in switch_to, this means that the explicit sync_core() in sched_in
>> is not needed.
> 
> Good, that's what I thought.

And now that I think about it further, I think we could do without the
sync_core in sched_out; we may just need core serializing instruction
between the full barrier after rq->curr store and return to user-space.
The rationale is that we just want to issue a core serializing instruction
before executing further user-space instructions. We might not care
about ordering wrt instructions executed by user-space prior to
entering into the scheduler. We'd need advice from architecture
maintainers on this point though.

> 
>> However, AFAIU, arm64 does not guarantee consistent data and instruction
>> caches.
> 
> Correct, but:
> 
>  * On 32-bit arm, we have a syscall to do that (and this is already used by
>    JITs and things like __builtin_clear_cache)
> 
>  * On arm64, cache maintenance instructions are directly available to
>    userspace

Good!

> 
> In both cases, the maintenance is broadcast by the hardware to all CPUs.
> The only part that cannot be broadcast is the pipeline flush, which is
> the part we need to do above and is implicit on exception return.
> 
>> I'm actually trying to wrap my head around what would be the sequence
>> of operations of a JIT trying to reclaim memory. Can we combine
>> core serialization and instruction cache flushing into a single
>> system call invocation, or we need to split this into two separate
>> operations ?
> 
> I think that cache-flushing and pipeline-flushing should be separated,
> as they tend to be in the CPU architectures I'm familiar with.

Indeed, if we make the icache flushing separate, then we can apply it
more specifically to specific address ranges and such, without having
to flush the entire user icache.

> 
>> The JIT reclaim usage scheme I envision is:
>> 
>> - userspace unpublish all reference to old code,
>> - userspace ensure no thread use the old code anymore,
>> - sys_membarrier
>>   - for each executing threads
>>     - issue core serializing barrier
>> - userspace use a separate system call to issue data cache flush for
>>   the modified range
>> - sys_membarrier
>>   - for each executing threads
>>     - issue instruction cache flush
>> 
>> So my current thinking is that we may need to change the membarrier
>> system call so one command serializes the core, and a separate command
>> issues cache flush.
> 
> Yeah, and the sequence is slightly different I think, as we need the
> pipeline flush to come *after* the I-cache invalidation (otherwise the
> stale instructions can just be refetched).

In my scenario, notice that userspace first unpublish all refs to old
code, and does its own waiting for any thread still seeing the old
code (e.g. by using RCU). However, URCU currently only has full barriers,
not core serializing barriers.

This means that when sys_membarrier is invoked, no thread can branch
into that old code anymore. What we actually want there is to synchronize
the core and flush the icache before we eventually publish a reference
to the new code.

What I wonder is whether the simple fact that some cores still hold the
old unsynchronized state prevents us from safely overwriting the old
code at that point, even though none of them can execute it going forward,
or if we need the sync core on every core before we even overwrite the
old code (conservative approach).

Assuming we don't need a sync core before updating the old code, an
aggressive approach would be:

reclaim and re-use (aggressive):

1- userspace unpublish all reference to old code,
2- userspace ensure no thread use the old code anymore (e.g. URCU),
3- userspace updates old code -> new code
4- issue data cache flush for the modified range (if needed)
5- sys_membarrier
   - for each executing threads
      - issue core serializing barrier
6- issue instruction cache flush for the modified range (if needed)
   (may be required on all active threads on some architectures)
7- userspace publish reference to new code

However, if we do need a sync core before updating the old code,
the conservative approach looks like:

reclaim and re-use (conservative):

1- userspace unpublish all reference to old code,
2- userspace ensure no thread use the old code anymore (e.g. URCU),
3- sys_membarrier
   - for each executing threads
      - issue core serializing barrier
4- userspace updates old code -> new code
5- issue data cache flush for the modified range (if needed)
6- issue instruction cache flush for the modified range (if needed)
   (may be required on all active threads on some architectures)
7- userspace publish reference to new code

Knowing whether the more "aggressive" approach above is correct should
allow us to find out whether we need to add a sync_core() in sched_out
as well.

> 
> If you're at LPC in a week's time, this might be a good thing to sit down
> and bash our heads against (espec. if we can grab PPC and x86 folks too).

Yes, I'll be there! I could even suggest a microconf topics about this. It
could fit either in Paul's Linux-Kernel Memory Model Workshop track [1], the
Wildcard track [2] or the Hallway track if we can get everyone together.

Thoughts ?

Thanks,

Mathieu

[1] http://www.linuxplumbersconf.org/2017/ocw/events/LPC2017/tracks/632
[2] http://www.linuxplumbersconf.org/2017/ocw/events/LPC2017/tracks/629

> 
> Will

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3] membarrier: provide core serialization
       [not found]         ` <CAMOCf+jjy2hjqdmrqFuVvnS8p-i+3Z3ZLubk4ymnRfsdT_F8PA@mail.gmail.com>
@ 2017-09-18 17:01           ` Will Deacon
       [not found]             ` <CAMOCf+gqMFmw9WCYqE_dXG3J+K=qBVT3Pv=z6CyrbppU6Y5qig@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Will Deacon @ 2017-09-18 17:01 UTC (permalink / raw)
  To: Hans Boehm
  Cc: Mathieu Desnoyers, Paul E. McKenney, Peter Zijlstra, linux-kernel,
	Boqun Feng, Andrew Hunter, maged michael, gromer, Avi Kivity,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Dave Watson, Andy Lutomirski, Russell King, Greg Hackmann

On Thu, Sep 07, 2017 at 05:03:49PM -0700, Hans Boehm wrote:
> > [Mathieu: ]
> >
> > Assuming we don't need a sync core before updating the old code, an
> > aggressive approach would be:
> >
> > reclaim and re-use (aggressive):
> >
> > 1- userspace unpublish all reference to old code,
> > 2- userspace ensure no thread use the old code anymore (e.g. URCU),
> > 3- userspace updates old code -> new code
> > 4- issue data cache flush for the modified range (if needed)
> > 5- sys_membarrier
> >    - for each executing threads
> >       - issue core serializing barrier
> > 6- issue instruction cache flush for the modified range (if needed)
> >    (may be required on all active threads on some architectures)
> > 7- userspace publish reference to new code
> >
> My assumption was that right sequence here, at least on Aarch64, is to
> do 5 and 6 in the opposite order; flush the icache,which I believe can
> be done from the thread that wrote the code, and then issue a sys_membarrier
> for the core serializing barrier.
> 
> It would be useful to get that clarified.

FWIW, Mathieu and I spent a while talking about this during LPC last week
and ended up agreeing that the ISB (core serialisation) is required *after*
the cache-maintenance to publish the new code has completed.

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3] membarrier: provide core serialization
       [not found]             ` <CAMOCf+gqMFmw9WCYqE_dXG3J+K=qBVT3Pv=z6CyrbppU6Y5qig@mail.gmail.com>
@ 2017-10-06 20:57               ` Mathieu Desnoyers
  2017-10-06 21:08                 ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Mathieu Desnoyers @ 2017-10-06 20:57 UTC (permalink / raw)
  To: Hans Boehm
  Cc: Paul E. McKenney, Peter Zijlstra, Boqun Feng, Andrew Hunter,
	maged michael, gromer, Avi Kivity, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Dave Watson, Andy Lutomirski,
	Russell King, ARM Linux, Greg Hackmann, Will Deacon, David Sehr,
	linux-kernel, linux-arch

----- On Oct 6, 2017, at 4:14 PM, Hans Boehm hboehm@google.com wrote:

> What's the status of MEMBARRIER_FLAG_SYNC_CORE? The discussion I saw left it
> unclear whether this would be a separate flag, or included by default. Did I
> miss something? I think we're fine with either, but we do have s strong
> interest in getting this in in some form...
> I also believe we're fine with MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED. And
> that seems to me like a reasonable way to deal with the added overhead.

[ re-sending with lkml and linux-arch in CC, making sure to send in plain text. ] 

Hi Hans, 

I'm currently making sure the MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED 
command makes its way into the 4.14 kernel before the end of the release candidates. 
Once that is done, I plan to post a patch adding a new MEMBARRIER_FLAG_SYNC_CORE 
flag for the 4.15 merge window. 

I have done a bit of research on the various architecture requirements for core serialization. 
Here are my findings so far about instructions providing core serialization on the main 
architectures supported by Linux. 

There are two places where we need it: in the interrupt handler for the membarrier IPI, and 
between scheduler execution (which can change the current "mm") and return to user-space. 

Please let me know if I missed anything. 

x86: iret, cpuid, wbinvd 
-> iret currently provides core serialization when going back to userspace and at the end of 
the IPI. There are plans to implement a return path without iret in the future, in which case 
I would need to issue an explicit "cpuid" instruction (sync_core()) in switch_mm() if the 
process is registered with MEMBARRIER_FLAG_SYNC_CORE. 

powerpc: rfi 
-> "rfi" instruction provides core serialization when going back to user-space. I believe this 
is used at the end of the membarrier IPI as well. (to be confirmed) 

arm32: returning to user-space provides core serialization. Same at the end of membarrier 
IPI (to be confirmed). 
aarch64: ERET instruction used when returning to user-space provides core sync. Same 
at the end of membarrier IPI (to be confirmed). 

s390/s390x: lpswe provides core sync. when returning to user-space. Not sure about end of IPI. 

ia64: rfi instruction provides core sync when returning to user-space. Probably the same at the 
end of IPI (to be confirmed). 
[ http://refspecs.linuxbase.org/IA64-softdevman-vol2 | http://refspecs.linuxbase.org/IA64-softdevman-vol2 ] 4.4.6.2 

parisc: core serialization is ensured by issuing at least 7 instructions. We should have 
at least that when going back to user-space (to be confirmed). Similar for IPI. 
[ https://parisc.wiki.kernel.org/images-parisc/6/68/Pa11_acd.pdf | https://parisc.wiki.kernel.org/images-parisc/6/68/Pa11_acd.pdf ] 5-152 

mips: eret instruction used when going back to user-space provides core sync on all 
SMP architectures. Probably same for IPI (to be confirmed). 
[ https://www.cs.cornell.edu/courses/cs3410/2008fa/MIPS_Vol2.pdf | https://www.cs.cornell.edu/courses/cs3410/2008fa/MIPS_Vol2.pdf ] p. 121 
on R3k and TX39XX, rfe is used instead, but those are uniprocessor, so they 
do not matter. 
[ http://os161.eecs.harvard.edu/documentation/sys161/mips.html | http://os161.eecs.harvard.edu/documentation/sys161/mips.html ] 

alpha: an explicit "imb" instruction seems to be required to perform core sync. 
Not sure if this is implicit by returning to user-space in any way. 
[ https://www2.cs.arizona.edu/projects/alto/Doc/local/alphahb2.pdf | https://www2.cs.arizona.edu/projects/alto/Doc/local/alphahb2.pdf ] 5-23 

sparc: seems to require an explicit "flush" instruction followed by at most 5 instructions 
to perform core serialization. Not sure if implied by return to user-space in any 
way. 

Based on my current understanding, only three architectures would require 
special flag test in switch_mm(): 

x86, when it implements an iret-free resume to userspace in the future, 
alpha: seems to require an explicit "imb" instruction, 
sparc: seems to require an explicit "flush" + 5 instructions. 

Those three cases would benefit from having an explicit registration of 
processes which want to use the private expedited core serializing membarrier, 
so we don't slow down unrelated context switching. It's also a good reason for 
making the core serializing behavior separate from the basic private expedited 
membarrier: some processes may only care about load/store ordering, so 
they should not have to take the performance hit of core serialization at context 
switch. 

It would be appreciated if architecture experts can fill-in on the missing 
architecture-specific details, or any misinterpretation of the documentation 
from my part. 

Thanks, 

Mathieu 

> Thanks!

> On Mon, Sep 18, 2017 at 10:01 AM, Will Deacon < [ mailto:will.deacon@arm.com |
> will.deacon@arm.com ] > wrote:

>> On Thu, Sep 07, 2017 at 05:03:49PM -0700, Hans Boehm wrote:
>> > > [Mathieu: ]

>> > > Assuming we don't need a sync core before updating the old code, an
>> > > aggressive approach would be:

>> > > reclaim and re-use (aggressive):

>> > > 1- userspace unpublish all reference to old code,
>> > > 2- userspace ensure no thread use the old code anymore (e.g. URCU),
>> > > 3- userspace updates old code -> new code
>> > > 4- issue data cache flush for the modified range (if needed)
>> > > 5- sys_membarrier
>> > > - for each executing threads
>> > > - issue core serializing barrier
>> > > 6- issue instruction cache flush for the modified range (if needed)
>> > > (may be required on all active threads on some architectures)
>> > > 7- userspace publish reference to new code

>> > My assumption was that right sequence here, at least on Aarch64, is to
>> > do 5 and 6 in the opposite order; flush the icache,which I believe can
>> > be done from the thread that wrote the code, and then issue a sys_membarrier
>> > for the core serializing barrier.

>> > It would be useful to get that clarified.

>> FWIW, Mathieu and I spent a while talking about this during LPC last week
>> and ended up agreeing that the ISB (core serialisation) is required *after*
>> the cache-maintenance to publish the new code has completed.

>> Will

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3] membarrier: provide core serialization
  2017-10-06 20:57               ` Mathieu Desnoyers
@ 2017-10-06 21:08                 ` Peter Zijlstra
  2017-10-09  8:32                   ` Will Deacon
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2017-10-06 21:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Hans Boehm, Paul E. McKenney, Boqun Feng, Andrew Hunter,
	maged michael, gromer, Avi Kivity, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Dave Watson, Andy Lutomirski,
	Russell King, ARM Linux, Greg Hackmann, Will Deacon, David Sehr,
	linux-kernel, linux-arch, ralf

On Fri, Oct 06, 2017 at 08:57:56PM +0000, Mathieu Desnoyers wrote:
> Hi Hans, 
> 
> I'm currently making sure the
> MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED command makes its way into
> the 4.14 kernel before the end of the release candidates.  Once that
> is done, I plan to post a patch adding a new MEMBARRIER_FLAG_SYNC_CORE
> flag for the 4.15 merge window. 
> 
> I have done a bit of research on the various architecture requirements
> for core serialization.  Here are my findings so far about
> instructions providing core serialization on the main architectures
> supported by Linux. 
> 
> There are two places where we need it: in the interrupt handler for
> the membarrier IPI, and between scheduler execution (which can change
> the current "mm") and return to user-space. 
> 
> Please let me know if I missed anything. 
> 
> x86: iret, cpuid, wbinvd -> iret currently provides core serialization
> when going back to userspace and at the end of the IPI. There are
> plans to implement a return path without iret in the future, in which
> case I would need to issue an explicit "cpuid" instruction
> (sync_core()) in switch_mm() if the process is registered with
> MEMBARRIER_FLAG_SYNC_CORE. 

I would much prefer setting a TIF flag that forces the IRET path instead
of doing additional work in switch_mm().

> arm32: returning to user-space provides core serialization. Same at
> the end of membarrier IPI (to be confirmed).  aarch64: ERET
> instruction used when returning to user-space provides core sync. Same
> at the end of membarrier IPI (to be confirmed). 

I thought Will already confirmed ERET did what we need, no?

> parisc: core serialization is ensured by issuing at least 7
> instructions. We should have at least that when going back to
> user-space (to be confirmed). Similar for IPI. 
> [ https://parisc.wiki.kernel.org/images-parisc/6/68/Pa11_acd.pdf | 
>   https://parisc.wiki.kernel.org/images-parisc/6/68/Pa11_acd.pdf ] 5-152 
> 
> mips: eret instruction used when going back to user-space provides
> core sync on all SMP architectures. Probably same for IPI (to be
> confirmed). 
> [ https://www.cs.cornell.edu/courses/cs3410/2008fa/MIPS_Vol2.pdf | 
>   https://www.cs.cornell.edu/courses/cs3410/2008fa/MIPS_Vol2.pdf ] p. 121 
> on R3k and TX39XX, rfe is used instead, but those are uniprocessor, so
> they do not matter. 
> [ http://os161.eecs.harvard.edu/documentation/sys161/mips.html |
>   http://os161.eecs.harvard.edu/documentation/sys161/mips.html ] 

> sparc: seems to require an explicit "flush" instruction followed by at
> most 5 instructions to perform core serialization. Not sure if implied
> by return to user-space in any way. 

We still have the problem with the virtually indexed archs that we need
to flush I$ on all CPUs.

Some archs have an instruction for this, others do not (or botched it).
So while some archs have a syscall to affect this, it is an integral
part of the use-case for MEMBAR_SYNC_CORE and I feel we must not gloss
over it.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3] membarrier: provide core serialization
  2017-10-06 21:08                 ` Peter Zijlstra
@ 2017-10-09  8:32                   ` Will Deacon
  0 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2017-10-09  8:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Hans Boehm, Paul E. McKenney, Boqun Feng,
	Andrew Hunter, maged michael, gromer, Avi Kivity,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Dave Watson, Andy Lutomirski, Russell King, ARM Linux,
	Greg Hackmann, David Sehr, linux-kernel, linux-arch, ralf

On Fri, Oct 06, 2017 at 11:08:25PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 06, 2017 at 08:57:56PM +0000, Mathieu Desnoyers wrote:
> > Hi Hans, 
> > 
> > I'm currently making sure the
> > MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED command makes its way into
> > the 4.14 kernel before the end of the release candidates.  Once that
> > is done, I plan to post a patch adding a new MEMBARRIER_FLAG_SYNC_CORE
> > flag for the 4.15 merge window. 
> > 
> > I have done a bit of research on the various architecture requirements
> > for core serialization.  Here are my findings so far about
> > instructions providing core serialization on the main architectures
> > supported by Linux. 
> > 
> > There are two places where we need it: in the interrupt handler for
> > the membarrier IPI, and between scheduler execution (which can change
> > the current "mm") and return to user-space. 
> > 
> > Please let me know if I missed anything. 
> > 
> > x86: iret, cpuid, wbinvd -> iret currently provides core serialization
> > when going back to userspace and at the end of the IPI. There are
> > plans to implement a return path without iret in the future, in which
> > case I would need to issue an explicit "cpuid" instruction
> > (sync_core()) in switch_mm() if the process is registered with
> > MEMBARRIER_FLAG_SYNC_CORE. 
> 
> I would much prefer setting a TIF flag that forces the IRET path instead
> of doing additional work in switch_mm().
> 
> > arm32: returning to user-space provides core serialization. Same at
> > the end of membarrier IPI (to be confirmed).  aarch64: ERET
> > instruction used when returning to user-space provides core sync. Same
> > at the end of membarrier IPI (to be confirmed). 
> 
> I thought Will already confirmed ERET did what we need, no?

Yes.

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-10-09  8:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-01 16:10 [RFC PATCH v3] membarrier: provide core serialization Mathieu Desnoyers
2017-09-01 16:25 ` Will Deacon
2017-09-01 17:00   ` Mathieu Desnoyers
2017-09-01 17:10     ` Will Deacon
2017-09-01 18:45       ` Mathieu Desnoyers
     [not found]         ` <CAMOCf+jjy2hjqdmrqFuVvnS8p-i+3Z3ZLubk4ymnRfsdT_F8PA@mail.gmail.com>
2017-09-18 17:01           ` Will Deacon
     [not found]             ` <CAMOCf+gqMFmw9WCYqE_dXG3J+K=qBVT3Pv=z6CyrbppU6Y5qig@mail.gmail.com>
2017-10-06 20:57               ` Mathieu Desnoyers
2017-10-06 21:08                 ` Peter Zijlstra
2017-10-09  8:32                   ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).