* [PATCH v9 00/13] support "task_isolation" mode for nohz_full
@ 2016-01-04 19:34 Chris Metcalf
  2016-01-04 19:34 ` [PATCH v9 04/13] task_isolation: add initial support Chris Metcalf
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf
It has been a couple of months since the v8 version of this patch,
since various other priorities came up at work.  Since it's been
a while I will try to summarize where I think we got to on the 
various issues that were raised with v8.
1. Andy Lutomirski raised the issue of whether it really made sense to
   only attempt to set up the conditions for task isolation, ask the kernel
   nicely for it, and then wait until it happened.  He wondered if a
   SCHED_ISOLATED class might be a helpful abstraction.  Steven Rostedt
   also suggested having an interface that would force everything else
   off a core to enable SCHED_ISOLATED to succeed.  Frederick added 
   some concerns about enforcing the test that the process was in a
   good state to enter task isolation.
   I tried to address the different design philosphies for what I called
   the original "polite" mode and the reviewers' suggestions for an
   "aggressive" mode in this email:
   https://lkml.org/lkml/2015/10/26/625
   As I said there, on balance I think the "polite" option is still
   better.  Obviously folks are welcome to disagree and I'm happy to
   continue that conversation (or perhaps I convinced everyone).
2. Andy didn't like the idea of having a "STRICT" mode which
   delivered a signal to a process for violating the contract that it
   will promise to stay out of the kernel.  Gilad Ben Yossef argued that
   it made sense to have a way for the kernel to enforce the requested
   correctness guarantee of never being interrupted.  Andy pointed out
   that we should then really deliver such a signal when the kernel
   delivers an asynchronous interrupt to the core as well.  In particular
   this is a concern for the application-error case of a process that
   calls unmap() on one core while a thread on another core is running
   STRICT, and thus gets an unexpected TLB flush.
   This patch series addresses that concern by including support for
   IRQs, IPIs, and similar asynchronous interrupts to also send the
   STRICT signal to the process.  We don't try to send the signal if
   we are in an NMI, and instead just force a console backtrace like
   you would get in task_isolation_debug mode.
3. Frederick nack'ed my patch for a boot flag to disable the 1Hz
   periodic scheduler tick.
   I'm still hoping he's open to changing his mind about that, but in
   this patch series I have removed that boot flag.
Various other changes have been introduced since v8:
https://lkml.kernel.org/r/1445373372-6567-1-git-send-email-cmetcalf@ezchip.com
- Rebased to Linux 4.4-rc5.
- Since nohz_full and isolnodes have been separated back out again in
  4.4, I introduced a new task_isolation=MASK boot argument that sets
  both of them.  The task isolation support now requires that this
  boot flag have been used; it intentionally doesn't work if you've
  just enabled nohz_full and isolcpus separately.  I could be
  convinced that doing it the other way around makes sense, though.
- I folded the two STRICT mode patches together since there didn't
  seem to be much value in having the second patch that just enabled
  having a settable signal.  I also refactored the various routines
  that report on interrupts/exceptions/etc to make it easier to hook
  in from the case where we are interrupted asynchronously.
- For the debug support, I moved most of the functionality into
  kernel/isolation.c and out of kernel/sched/core.c, leaving only a
  small hook to handle mapping a remote cpu to a task struct safely.
  In addition to implementing Andy's suggestion of signalling a task
  when it is interrupted asynchronously, I also added a ratelimit
  hook so we won't spam the console if (for example) a timer interrupt
  runs amok - particularly since when this happens without ratelimit,
  it can end up self-perpetuating the timer interrupt.
- I added a task_isolation_debug_cpumask() helper function to check
  all the cpus in a mask to see if they are being interrupted
  inappropriately.
- I made the check for irq_enter() robust to architectures that
  have already entered user mode context_tracking before calling
  irq_enter() by testing user_mode(get_irq_regs()) instead of
  context_tracking_in_user(), and split out the code to a separate
  inlined function so I could comment it better.
- For arm64, I added a task_isolation_debug_cpumask() hook for
  smp_cross_call(), which I had missed in the earlier versions.
- I generalized the fix for tile to set up a clockevents hook for
  set_state_oneshot_stopped() to also apply to the arm_arch_timer,
  which I realized was showing the same problem.  For both cases,
  this seems to be what Viresh had in mind with commit 8fff52fd509345
  ("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state").
- For tile, I adopted the arm model of doing user_exit() calls in the
  early assembly code (a new patch in this series).  I also added a
  missing task_isolation_debug hook for tile's IPI and remote cache
  flush code.
Chris Metcalf (12):
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: add debug boot flag
  arch/x86: enable task isolation functionality
  arch/arm64: adopt prepare_exit_to_usermode() model from x86
  arch/arm64: enable task isolation functionality
  arch/tile: adopt prepare_exit_to_usermode() model from x86
  arch/tile: move user_exit() to early kernel entry sequence
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state
Christoph Lameter (1):
  vmstat: provide a function to quiet down the diff processing
 Documentation/kernel-parameters.txt  |  16 +++
 arch/arm64/include/asm/thread_info.h |  18 ++-
 arch/arm64/kernel/entry.S            |   6 +-
 arch/arm64/kernel/ptrace.c           |  12 +-
 arch/arm64/kernel/signal.c           |  35 ++++--
 arch/arm64/kernel/smp.c              |   2 +
 arch/arm64/mm/fault.c                |   4 +
 arch/tile/include/asm/processor.h    |   2 +-
 arch/tile/include/asm/thread_info.h  |   8 +-
 arch/tile/kernel/intvec_32.S         |  51 +++-----
 arch/tile/kernel/intvec_64.S         |  54 +++------
 arch/tile/kernel/process.c           |  83 +++++++------
 arch/tile/kernel/ptrace.c            |  19 +--
 arch/tile/kernel/single_step.c       |   8 +-
 arch/tile/kernel/smp.c               |  26 ++--
 arch/tile/kernel/time.c              |   1 +
 arch/tile/kernel/traps.c             |  13 +-
 arch/tile/kernel/unaligned.c         |  16 ++-
 arch/tile/mm/fault.c                 |   6 +-
 arch/tile/mm/homecache.c             |   2 +
 arch/x86/entry/common.c              |  10 +-
 arch/x86/kernel/traps.c              |   2 +
 arch/x86/mm/fault.c                  |   2 +
 drivers/clocksource/arm_arch_timer.c |   2 +
 include/linux/isolation.h            |  80 +++++++++++++
 include/linux/sched.h                |   3 +
 include/linux/swap.h                 |   1 +
 include/linux/vmstat.h               |   4 +
 include/uapi/linux/prctl.h           |   8 ++
 init/Kconfig                         |  20 ++++
 kernel/Makefile                      |   1 +
 kernel/irq_work.c                    |   5 +-
 kernel/isolation.c                   | 225 +++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                  |  18 +++
 kernel/signal.c                      |   5 +
 kernel/smp.c                         |   6 +-
 kernel/softirq.c                     |  33 +++++
 kernel/sys.c                         |   9 ++
 mm/swap.c                            |  13 +-
 mm/vmstat.c                          |  24 ++++
 40 files changed, 665 insertions(+), 188 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c
-- 
2.1.2
^ permalink raw reply	[flat|nested] 29+ messages in thread
* [PATCH v9 04/13] task_isolation: add initial support
  2016-01-04 19:34 [PATCH v9 00/13] support "task_isolation" mode for nohz_full Chris Metcalf
@ 2016-01-04 19:34 ` Chris Metcalf
  2016-01-19 15:42   ` Frederic Weisbecker
  2016-01-04 19:34 ` [PATCH v9 05/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
       [not found] ` <1451936091-29247-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
  2 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.
However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.
This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.
The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl().  When the _ENABLE bit is set for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future. 
The task_isolation_ready() call plays an equivalent role to the
TIF_xxx flags when returning to userspace, and should be checked
in the loop check of the prepare_exit_to_usermode() routine or its
architecture equivalent.  It is called with interrupts disabled and
inspects the kernel state to determine if it is safe to return into
an isolated state.  In particular, if it sees that the scheduler
tick is still enabled, it sets the TIF_NEED_RESCHED bit to notify
the scheduler to attempt to schedule a different task.
Each time through the loop of TIF work to do, we call the new
task_isolation_enter() routine, which takes any actions that might
avoid a future interrupt to the core, such as a worker thread
being scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).
As a result of these tests on the "return to userspace" path, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.
Separate patches that follow provide these changes for x86, arm64,
and tile.
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 Documentation/kernel-parameters.txt |   8 +++
 include/linux/isolation.h           |  50 +++++++++++++++++
 include/linux/sched.h               |   3 ++
 include/uapi/linux/prctl.h          |   5 ++
 init/Kconfig                        |  20 +++++++
 kernel/Makefile                     |   1 +
 kernel/isolation.c                  | 105 ++++++++++++++++++++++++++++++++++++
 kernel/sys.c                        |   9 ++++
 8 files changed, 201 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 742f69d18fc8..e035679e646e 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3665,6 +3665,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation=	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION=y, set
+			the specified list of CPUs where cpus will be able
+			to use prctl(PR_SET_TASK_ISOLATION) to set up task
+			isolation mode.  Setting this boot flag implicitly
+			also sets up nohz_full and isolcpus mode for the
+			listed set of cpus.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..ed1bfc793c5a
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,50 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/* cpus that are configured to support task isolation */
+extern cpumask_var_t task_isolation_map;
+
+static inline bool task_isolation_possible(int cpu)
+{
+	return tick_nohz_full_enabled() &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+extern int task_isolation_set(unsigned int flags);
+
+static inline bool task_isolation_enabled(void)
+{
+	return task_isolation_possible(smp_processor_id()) &&
+		(current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);
+}
+
+extern bool _task_isolation_ready(void);
+extern void _task_isolation_enter(void);
+
+static inline bool task_isolation_ready(void)
+{
+	return !task_isolation_enabled() || _task_isolation_ready();
+}
+
+static inline void task_isolation_enter(void)
+{
+	if (task_isolation_enabled())
+		_task_isolation_enter();
+}
+
+#else
+static inline bool task_isolation_possible(int cpu) { return false; }
+static inline bool task_isolation_enabled(void) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a43edea..d439ee4f2ce2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1812,6 +1812,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..67224df4b559 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 235c7a2c0d20..fb0c707e527f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL
+	help
+	 Allow userspace processes to place themselves on task_isolation
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  On return to userspace,
+	 isolated tasks will first arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  This "hard" isolation from the kernel is
+	 required for userspace tasks that are running hard real-time
+	 tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf008ecb3..693a2ba35679 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..68a9f7457bc0
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,105 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include "time/tick-sched.h"
+
+cpumask_var_t task_isolation_map;
+
+/*
+ * Isolation requires both nohz and isolcpus support from the scheduler.
+ * We provide a boot flag that enables both for now, and which we can
+ * add other functionality to over time if needed.  Note that just
+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
+ */
+static int __init task_isolation_setup(char *str)
+{
+	alloc_bootmem_cpumask_var(&task_isolation_map);
+	if (cpulist_parse(str, task_isolation_map) < 0) {
+		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
+		return 1;
+	}
+
+	alloc_bootmem_cpumask_var(&cpu_isolated_map);
+	cpumask_copy(cpu_isolated_map, task_isolation_map);
+
+	alloc_bootmem_cpumask_var(&tick_nohz_full_mask);
+	cpumask_copy(tick_nohz_full_mask, task_isolation_map);
+	tick_nohz_full_running = true;
+
+	return 1;
+}
+__setup("task_isolation=", task_isolation_setup);
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single task_isolation core or we will
+ * return EINVAL.  Although the application could later re-affinitize
+ * to a housekeeping core and lose task isolation semantics, this
+ * initial test should catch 99% of bugs with task placement prior to
+ * enabling task isolation.
+ */
+int task_isolation_set(unsigned int flags)
+{
+	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+	    !task_isolation_possible(smp_processor_id()))
+		return -EINVAL;
+
+	current->task_isolation_flags = flags;
+	return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again.  To handle
+ * the periodic scheduler tick, we test to make sure that the tick is
+ * stopped, and if it isn't yet, we request a reschedule so that if
+ * another task needs to run to completion first, it can do so.
+ * Similarly, if any other subsystems require quiescing, we will need
+ * to do that before we return to userspace.
+ */
+bool _task_isolation_ready(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+
+	/* If we need to drain the LRU cache, we're not ready. */
+	if (lru_add_drain_needed(smp_processor_id()))
+		return false;
+
+	/* If vmstats need updating, we're not ready. */
+	if (!vmstat_idle())
+		return false;
+
+	/* Request rescheduling unless we are in full dynticks mode. */
+	if (!tick_nohz_tick_stopped()) {
+		set_tsk_need_resched(current);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void _task_isolation_enter(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat();
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 6af9212ab5aa..7c97227dfb39 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2266,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		error = task_isolation_set(arg2);
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
-- 
2.1.2
^ permalink raw reply related	[flat|nested] 29+ messages in thread
* [PATCH v9 05/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2016-01-04 19:34 [PATCH v9 00/13] support "task_isolation" mode for nohz_full Chris Metcalf
  2016-01-04 19:34 ` [PATCH v9 04/13] task_isolation: add initial support Chris Metcalf
@ 2016-01-04 19:34 ` Chris Metcalf
       [not found] ` <1451936091-29247-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
  2 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf
With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal; this is defined as
happening immediately before the SECCOMP test.
By default, the task is signalled with SIGKILL, but we add prctl()
bits to support requesting a specific signal instead.
To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/isolation.h  | 25 +++++++++++++++++++
 include/uapi/linux/prctl.h |  3 +++
 kernel/isolation.c         | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 88 insertions(+)
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index ed1bfc793c5a..69a3e4c59ab3 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -40,11 +40,36 @@ static inline void task_isolation_enter(void)
 		_task_isolation_enter();
 }
 
+extern bool task_isolation_syscall(int nr);
+extern void task_isolation_exception(const char *fmt, ...);
+extern void task_isolation_interrupt(struct task_struct *, const char *buf);
+
+static inline bool task_isolation_strict(void)
+{
+	return (task_isolation_possible(smp_processor_id()) &&
+		(current->task_isolation_flags &
+		 (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+		(PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT));
+}
+
+static inline bool task_isolation_check_syscall(int nr)
+{
+	return task_isolation_strict() && task_isolation_syscall(nr);
+}
+
+#define task_isolation_check_exception(fmt, ...)			\
+	do {								\
+		if (task_isolation_strict())				\
+			task_isolation_exception(fmt, ## __VA_ARGS__);	\
+	} while (0)
+
 #else
 static inline bool task_isolation_possible(int cpu) { return false; }
 static inline bool task_isolation_enabled(void) { return false; }
 static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
+static inline bool task_isolation_check_syscall(int nr) { return false; }
+static inline void task_isolation_check_exception(const char *fmt, ...) { }
 #endif
 
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,8 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		48
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 68a9f7457bc0..29ffb21ada0b 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
 #include <linux/syscalls.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 cpumask_var_t task_isolation_map;
@@ -103,3 +104,62 @@ void _task_isolation_enter(void)
 	/* Quieten the vmstat worker so it won't interrupt us. */
 	quiet_vmstat();
 }
+
+void task_isolation_interrupt(struct task_struct *task, const char *buf)
+{
+	siginfo_t info = {};
+	int sig;
+
+	pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
+		task->comm, task->pid, buf);
+
+	/*
+	 * Turn off task isolation mode entirely to avoid spamming
+	 * the process with signals.  It can re-enable task isolation
+	 * mode in the signal handler if it wants to.
+	 */
+	task->task_isolation_flags = 0;
+
+	sig = PR_TASK_ISOLATION_GET_SIG(task->task_isolation_flags);
+	if (sig == 0)
+		sig = SIGKILL;
+	info.si_signo = sig;
+	send_sig_info(sig, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(const char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	task_isolation_interrupt(current, buf);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+bool task_isolation_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return false;
+	}
+
+	task_isolation_exception("syscall %d", syscall);
+	return true;
+}
-- 
2.1.2
^ permalink raw reply related	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
       [not found] ` <1451936091-29247-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
@ 2016-01-11 21:15   ` Chris Metcalf
  2016-01-12 10:07     ` Will Deacon
       [not found]     ` <56941B86.9090009-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 2 replies; 29+ messages in thread
From: Chris Metcalf @ 2016-01-11 21:15 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
Ping!  There has been no substantive feedback to this version of
the patch in the week since I posted it, which optimistically suggests
to me that people may be satisfied with it.  If that's true, Frederic,
I assume this would be pulled into your tree?
I have slightly updated the v9 patch series since this posting:
- Incorporated a fix to initialize cpu_isolation_mask early if no
   cpu_isolation= boot argument was given, to avoid crashing on
   CPUMASK_OFFSTACK platforms.
- Incorporated Mark Rutland's changes to convert arm64
   assembly to C code instead of using my own version.
The updated patch series is available in the branch at
git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
dataplane
I will post a v10 with those couple of small changes if I don't hear
any other feedback, or of course feel free to pull from the git repo.
On 01/04/2016 02:34 PM, Chris Metcalf wrote:
> It has been a couple of months since the v8 version of this patch,
> since various other priorities came up at work.  Since it's been
> a while I will try to summarize where I think we got to on the
> various issues that were raised with v8.
>
> 1. Andy Lutomirski raised the issue of whether it really made sense to
>     only attempt to set up the conditions for task isolation, ask the kernel
>     nicely for it, and then wait until it happened.  He wondered if a
>     SCHED_ISOLATED class might be a helpful abstraction.  Steven Rostedt
>     also suggested having an interface that would force everything else
>     off a core to enable SCHED_ISOLATED to succeed.  Frederick added
>     some concerns about enforcing the test that the process was in a
>     good state to enter task isolation.
>
>     I tried to address the different design philosphies for what I called
>     the original "polite" mode and the reviewers' suggestions for an
>     "aggressive" mode in this email:
>
>     https://lkml.org/lkml/2015/10/26/625
>
>     As I said there, on balance I think the "polite" option is still
>     better.  Obviously folks are welcome to disagree and I'm happy to
>     continue that conversation (or perhaps I convinced everyone).
>
> 2. Andy didn't like the idea of having a "STRICT" mode which
>     delivered a signal to a process for violating the contract that it
>     will promise to stay out of the kernel.  Gilad Ben Yossef argued that
>     it made sense to have a way for the kernel to enforce the requested
>     correctness guarantee of never being interrupted.  Andy pointed out
>     that we should then really deliver such a signal when the kernel
>     delivers an asynchronous interrupt to the core as well.  In particular
>     this is a concern for the application-error case of a process that
>     calls unmap() on one core while a thread on another core is running
>     STRICT, and thus gets an unexpected TLB flush.
>
>     This patch series addresses that concern by including support for
>     IRQs, IPIs, and similar asynchronous interrupts to also send the
>     STRICT signal to the process.  We don't try to send the signal if
>     we are in an NMI, and instead just force a console backtrace like
>     you would get in task_isolation_debug mode.
>
> 3. Frederick nack'ed my patch for a boot flag to disable the 1Hz
>     periodic scheduler tick.
>
>     I'm still hoping he's open to changing his mind about that, but in
>     this patch series I have removed that boot flag.
>
> Various other changes have been introduced since v8:
>
> https://lkml.kernel.org/r/1445373372-6567-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org
>
> - Rebased to Linux 4.4-rc5.
>
> - Since nohz_full and isolnodes have been separated back out again in
>    4.4, I introduced a new task_isolation=MASK boot argument that sets
>    both of them.  The task isolation support now requires that this
>    boot flag have been used; it intentionally doesn't work if you've
>    just enabled nohz_full and isolcpus separately.  I could be
>    convinced that doing it the other way around makes sense, though.
>
> - I folded the two STRICT mode patches together since there didn't
>    seem to be much value in having the second patch that just enabled
>    having a settable signal.  I also refactored the various routines
>    that report on interrupts/exceptions/etc to make it easier to hook
>    in from the case where we are interrupted asynchronously.
>
> - For the debug support, I moved most of the functionality into
>    kernel/isolation.c and out of kernel/sched/core.c, leaving only a
>    small hook to handle mapping a remote cpu to a task struct safely.
>    In addition to implementing Andy's suggestion of signalling a task
>    when it is interrupted asynchronously, I also added a ratelimit
>    hook so we won't spam the console if (for example) a timer interrupt
>    runs amok - particularly since when this happens without ratelimit,
>    it can end up self-perpetuating the timer interrupt.
>
> - I added a task_isolation_debug_cpumask() helper function to check
>    all the cpus in a mask to see if they are being interrupted
>    inappropriately.
>
> - I made the check for irq_enter() robust to architectures that
>    have already entered user mode context_tracking before calling
>    irq_enter() by testing user_mode(get_irq_regs()) instead of
>    context_tracking_in_user(), and split out the code to a separate
>    inlined function so I could comment it better.
>
> - For arm64, I added a task_isolation_debug_cpumask() hook for
>    smp_cross_call(), which I had missed in the earlier versions.
>
> - I generalized the fix for tile to set up a clockevents hook for
>    set_state_oneshot_stopped() to also apply to the arm_arch_timer,
>    which I realized was showing the same problem.  For both cases,
>    this seems to be what Viresh had in mind with commit 8fff52fd509345
>    ("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state").
>
> - For tile, I adopted the arm model of doing user_exit() calls in the
>    early assembly code (a new patch in this series).  I also added a
>    missing task_isolation_debug hook for tile's IPI and remote cache
>    flush code.
>
> Chris Metcalf (12):
>    vmstat: add vmstat_idle function
>    lru_add_drain_all: factor out lru_add_drain_needed
>    task_isolation: add initial support
>    task_isolation: support PR_TASK_ISOLATION_STRICT mode
>    task_isolation: add debug boot flag
>    arch/x86: enable task isolation functionality
>    arch/arm64: adopt prepare_exit_to_usermode() model from x86
>    arch/arm64: enable task isolation functionality
>    arch/tile: adopt prepare_exit_to_usermode() model from x86
>    arch/tile: move user_exit() to early kernel entry sequence
>    arch/tile: enable task isolation functionality
>    arm, tile: turn off timer tick for oneshot_stopped state
>
> Christoph Lameter (1):
>    vmstat: provide a function to quiet down the diff processing
>
>   Documentation/kernel-parameters.txt  |  16 +++
>   arch/arm64/include/asm/thread_info.h |  18 ++-
>   arch/arm64/kernel/entry.S            |   6 +-
>   arch/arm64/kernel/ptrace.c           |  12 +-
>   arch/arm64/kernel/signal.c           |  35 ++++--
>   arch/arm64/kernel/smp.c              |   2 +
>   arch/arm64/mm/fault.c                |   4 +
>   arch/tile/include/asm/processor.h    |   2 +-
>   arch/tile/include/asm/thread_info.h  |   8 +-
>   arch/tile/kernel/intvec_32.S         |  51 +++-----
>   arch/tile/kernel/intvec_64.S         |  54 +++------
>   arch/tile/kernel/process.c           |  83 +++++++------
>   arch/tile/kernel/ptrace.c            |  19 +--
>   arch/tile/kernel/single_step.c       |   8 +-
>   arch/tile/kernel/smp.c               |  26 ++--
>   arch/tile/kernel/time.c              |   1 +
>   arch/tile/kernel/traps.c             |  13 +-
>   arch/tile/kernel/unaligned.c         |  16 ++-
>   arch/tile/mm/fault.c                 |   6 +-
>   arch/tile/mm/homecache.c             |   2 +
>   arch/x86/entry/common.c              |  10 +-
>   arch/x86/kernel/traps.c              |   2 +
>   arch/x86/mm/fault.c                  |   2 +
>   drivers/clocksource/arm_arch_timer.c |   2 +
>   include/linux/isolation.h            |  80 +++++++++++++
>   include/linux/sched.h                |   3 +
>   include/linux/swap.h                 |   1 +
>   include/linux/vmstat.h               |   4 +
>   include/uapi/linux/prctl.h           |   8 ++
>   init/Kconfig                         |  20 ++++
>   kernel/Makefile                      |   1 +
>   kernel/irq_work.c                    |   5 +-
>   kernel/isolation.c                   | 225 +++++++++++++++++++++++++++++++++++
>   kernel/sched/core.c                  |  18 +++
>   kernel/signal.c                      |   5 +
>   kernel/smp.c                         |   6 +-
>   kernel/softirq.c                     |  33 +++++
>   kernel/sys.c                         |   9 ++
>   mm/swap.c                            |  13 +-
>   mm/vmstat.c                          |  24 ++++
>   40 files changed, 665 insertions(+), 188 deletions(-)
>   create mode 100644 include/linux/isolation.h
>   create mode 100644 kernel/isolation.c
>
-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
  2016-01-11 21:15   ` [PATCH v9 00/13] support "task_isolation" mode for nohz_full Chris Metcalf
@ 2016-01-12 10:07     ` Will Deacon
       [not found]       ` <20160112100708.GA15737-5wv7dgnIgG8@public.gmane.org>
       [not found]     ` <56941B86.9090009-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 29+ messages in thread
From: Will Deacon @ 2016-01-12 10:07 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel
On Mon, Jan 11, 2016 at 04:15:50PM -0500, Chris Metcalf wrote:
> Ping!  There has been no substantive feedback to this version of
> the patch in the week since I posted it, which optimistically suggests
> to me that people may be satisfied with it.  If that's true, Frederic,
> I assume this would be pulled into your tree?
> 
> I have slightly updated the v9 patch series since this posting:
> 
> - Incorporated a fix to initialize cpu_isolation_mask early if no
>   cpu_isolation= boot argument was given, to avoid crashing on
>   CPUMASK_OFFSTACK platforms.
> 
> - Incorporated Mark Rutland's changes to convert arm64
>   assembly to C code instead of using my own version.
Please avoid queuing these patches -- the first is already in the arm64
queue for 4.5 and the second was found to introduce a substantial
performance regression on the syscall entry/exit path. I think Mark had
an updated version to address that, so it would be easier not to have
an old version sitting in some other queue!
Cheers,
Will
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
       [not found]     ` <56941B86.9090009-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
@ 2016-01-12 10:53       ` Ingo Molnar
  0 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2016-01-12 10:53 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
* Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
> Ping!  There has been no substantive feedback to this version of
> the patch in the week since I posted it, which optimistically suggests
> to me that people may be satisfied with it.  If that's true, Frederic,
> I assume this would be pulled into your tree?
We are right before (and into) the merge window, don't expect substantial feedback 
in those timeframes, as most kernel maintainers are very busy.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
       [not found]       ` <20160112100708.GA15737-5wv7dgnIgG8@public.gmane.org>
@ 2016-01-12 17:49         ` Chris Metcalf
       [not found]           ` <56953CBA.9090208-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2016-01-12 17:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mark Rutland
(Adding Mark to cc's)
On 01/12/2016 05:07 AM, Will Deacon wrote:
> On Mon, Jan 11, 2016 at 04:15:50PM -0500, Chris Metcalf wrote:
>> Ping!  There has been no substantive feedback to this version of
>> the patch in the week since I posted it, which optimistically suggests
>> to me that people may be satisfied with it.  If that's true, Frederic,
>> I assume this would be pulled into your tree?
>>
>> I have slightly updated the v9 patch series since this posting:
>>
>> [...]
>>
>> - Incorporated Mark Rutland's changes to convert arm64
>>    assembly to C code instead of using my own version.
> Please avoid queuing these patches -- the first is already in the arm64
> queue for 4.5 and the second was found to introduce a substantial
> performance regression on the syscall entry/exit path. I think Mark had
> an updated version to address that, so it would be easier not to have
> an old version sitting in some other queue!
I am not formally queueing them anywhere (like linux-next), though
now that you mention it, that's a pretty good idea - I'll talk to Steven
about that, assuming this merge window closes without the task
isolation stuff going in.
In the arch/tile code, we load the thread_info_flags and test them
against a bitmask before we call into C code, to avoid the various
overheads involved in the C path.  Perhaps that same strategy is all
that's needed for the arm64 code?  Hopefully you can get that
code merged up during the 4.5 window so I can use it as the new
baseline for the task isolation stuff.
Thanks!
-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
       [not found]           ` <56953CBA.9090208-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
@ 2016-01-13 10:44             ` Ingo Molnar
  2016-01-13 21:19               ` Chris Metcalf
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2016-01-13 10:44 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Will Deacon, Gilad Ben Yossef, Steven Rostedt, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mark Rutland
* Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
> (Adding Mark to cc's)
> 
> On 01/12/2016 05:07 AM, Will Deacon wrote:
> >On Mon, Jan 11, 2016 at 04:15:50PM -0500, Chris Metcalf wrote:
> >>Ping!  There has been no substantive feedback to this version of
> >>the patch in the week since I posted it, which optimistically suggests
> >>to me that people may be satisfied with it.  If that's true, Frederic,
> >>I assume this would be pulled into your tree?
> >>
> >>I have slightly updated the v9 patch series since this posting:
> >>
> >>[...]
> >>
> >>- Incorporated Mark Rutland's changes to convert arm64
> >>   assembly to C code instead of using my own version.
> >Please avoid queuing these patches -- the first is already in the arm64
> >queue for 4.5 and the second was found to introduce a substantial
> >performance regression on the syscall entry/exit path. I think Mark had
> >an updated version to address that, so it would be easier not to have
> >an old version sitting in some other queue!
> 
> I am not formally queueing them anywhere (like linux-next), though
> now that you mention it, that's a pretty good idea - I'll talk to Steven
> about that, assuming this merge window closes without the task
> isolation stuff going in.
NAK. Given the controversy, no way should this stuff go outside the primary trees 
it affects: the scheduler, timer, irq, etc. trees.
We can merge this up in -tip once everyone is happy... but as I said, don't expect 
many replies before and during the merge window.
Thanks,
	Ingo> 
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
  2016-01-13 10:44             ` Ingo Molnar
@ 2016-01-13 21:19               ` Chris Metcalf
  2016-01-20 13:27                 ` Mark Rutland
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2016-01-13 21:19 UTC (permalink / raw)
  To: Ingo Molnar, Mark Rutland
  Cc: Will Deacon, Gilad Ben Yossef, Steven Rostedt, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel
On 01/13/2016 05:44 AM, Ingo Molnar wrote:
> * Chris Metcalf <cmetcalf@ezchip.com> wrote:
>
>> (Adding Mark to cc's)
>>
>> On 01/12/2016 05:07 AM, Will Deacon wrote:
>>> On Mon, Jan 11, 2016 at 04:15:50PM -0500, Chris Metcalf wrote:
>>>> Ping!  There has been no substantive feedback to this version of
>>>> the patch in the week since I posted it, which optimistically suggests
>>>> to me that people may be satisfied with it.  If that's true, Frederic,
>>>> I assume this would be pulled into your tree?
>>>>
>>>> I have slightly updated the v9 patch series since this posting:
>>>>
>>>> [...]
>>>>
>>>> - Incorporated Mark Rutland's changes to convert arm64
>>>>    assembly to C code instead of using my own version.
>>> Please avoid queuing these patches -- the first is already in the arm64
>>> queue for 4.5 and the second was found to introduce a substantial
>>> performance regression on the syscall entry/exit path. I think Mark had
>>> an updated version to address that, so it would be easier not to have
>>> an old version sitting in some other queue!
>> I am not formally queueing them anywhere (like linux-next), though
>> now that you mention it, that's a pretty good idea - I'll talk to Steven
>> about that, assuming this merge window closes without the task
>> isolation stuff going in.
> NAK. Given the controversy, no way should this stuff go outside the primary trees
> it affects: the scheduler, timer, irq, etc. trees.
Fair enough.  I'll plan to do v10 once the merge window closes.
Mark, let me know when/if you get a new version of the de-asm stuff
for do_notify_resume() - thanks.  Or, would it be helpful if I worked up
the option I suggested, where we check the thread_info flags in the
assembly code before calling out to the new loop in do_notify_resume()?
-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-01-04 19:34 ` [PATCH v9 04/13] task_isolation: add initial support Chris Metcalf
@ 2016-01-19 15:42   ` Frederic Weisbecker
  2016-01-19 20:45     ` Chris Metcalf
  0 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2016-01-19 15:42 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel
On Mon, Jan 04, 2016 at 02:34:42PM -0500, Chris Metcalf wrote:
> diff --git a/kernel/isolation.c b/kernel/isolation.c
> new file mode 100644
> index 000000000000..68a9f7457bc0
> --- /dev/null
> +++ b/kernel/isolation.c
> @@ -0,0 +1,105 @@
> +/*
> + *  linux/kernel/isolation.c
> + *
> + *  Implementation for task isolation.
> + *
> + *  Distributed under GPLv2.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/swap.h>
> +#include <linux/vmstat.h>
> +#include <linux/isolation.h>
> +#include <linux/syscalls.h>
> +#include "time/tick-sched.h"
> +
> +cpumask_var_t task_isolation_map;
> +
> +/*
> + * Isolation requires both nohz and isolcpus support from the scheduler.
> + * We provide a boot flag that enables both for now, and which we can
> + * add other functionality to over time if needed.  Note that just
> + * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
> + */
> +static int __init task_isolation_setup(char *str)
> +{
> +	alloc_bootmem_cpumask_var(&task_isolation_map);
> +	if (cpulist_parse(str, task_isolation_map) < 0) {
> +		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
> +		return 1;
> +	}
> +
> +	alloc_bootmem_cpumask_var(&cpu_isolated_map);
> +	cpumask_copy(cpu_isolated_map, task_isolation_map);
> +
> +	alloc_bootmem_cpumask_var(&tick_nohz_full_mask);
> +	cpumask_copy(tick_nohz_full_mask, task_isolation_map);
> +	tick_nohz_full_running = true;
How about calling tick_nohz_full_setup() instead? I'd rather prefer
that nohz full implementation details stay in tick-sched.c
Also what happens if nohz_full= is given as well as task_isolation= ?
Don't we risk a memory leak and maybe breaking the fact that
(nohz_full & task_isolation != task_isolation) which is really a requirement?
> +
> +	return 1;
> +}
> +__setup("task_isolation=", task_isolation_setup);
> +
> +/*
> + * This routine controls whether we can enable task-isolation mode.
> + * The task must be affinitized to a single task_isolation core or we will
> + * return EINVAL.  Although the application could later re-affinitize
> + * to a housekeeping core and lose task isolation semantics, this
> + * initial test should catch 99% of bugs with task placement prior to
> + * enabling task isolation.
> + */
> +int task_isolation_set(unsigned int flags)
> +{
> +	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
> +	    !task_isolation_possible(smp_processor_id()))
> +		return -EINVAL;
> +
> +	current->task_isolation_flags = flags;
> +	return 0;
> +}
What if we concurrently change the task's affinity? Also it seems that preemption
isn't disabled, so we can also migrate concurrently. I'm surprised you haven't
seen warnings with smp_processor_id().
Also we should protect against task's affinity change when task_isolation_flags
is set.
> +
> +/*
> + * In task isolation mode we try to return to userspace only after
> + * attempting to make sure we won't be interrupted again.  To handle
> + * the periodic scheduler tick, we test to make sure that the tick is
> + * stopped, and if it isn't yet, we request a reschedule so that if
> + * another task needs to run to completion first, it can do so.
> + * Similarly, if any other subsystems require quiescing, we will need
> + * to do that before we return to userspace.
> + */
> +bool _task_isolation_ready(void)
> +{
> +	WARN_ON_ONCE(!irqs_disabled());
> +
> +	/* If we need to drain the LRU cache, we're not ready. */
> +	if (lru_add_drain_needed(smp_processor_id()))
> +		return false;
> +
> +	/* If vmstats need updating, we're not ready. */
> +	if (!vmstat_idle())
> +		return false;
> +
> +	/* Request rescheduling unless we are in full dynticks mode. */
> +	if (!tick_nohz_tick_stopped()) {
> +		set_tsk_need_resched(current);
I'm not sure doing this will help getting the tick to get stopped.
> +		return false;
> +	}
> +
> +	return true;
> +}
Thanks!
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-01-19 15:42   ` Frederic Weisbecker
@ 2016-01-19 20:45     ` Chris Metcalf
  2016-01-28  0:28       ` Frederic Weisbecker
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2016-01-19 20:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel
On 01/19/2016 10:42 AM, Frederic Weisbecker wrote:
> On Mon, Jan 04, 2016 at 02:34:42PM -0500, Chris Metcalf wrote:
>> diff --git a/kernel/isolation.c b/kernel/isolation.c
>> new file mode 100644
>> index 000000000000..68a9f7457bc0
>> --- /dev/null
>> +++ b/kernel/isolation.c
>> @@ -0,0 +1,105 @@
>> +/*
>> + *  linux/kernel/isolation.c
>> + *
>> + *  Implementation for task isolation.
>> + *
>> + *  Distributed under GPLv2.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/swap.h>
>> +#include <linux/vmstat.h>
>> +#include <linux/isolation.h>
>> +#include <linux/syscalls.h>
>> +#include "time/tick-sched.h"
>> +
>> +cpumask_var_t task_isolation_map;
>> +
>> +/*
>> + * Isolation requires both nohz and isolcpus support from the scheduler.
>> + * We provide a boot flag that enables both for now, and which we can
>> + * add other functionality to over time if needed.  Note that just
>> + * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
>> + */
>> +static int __init task_isolation_setup(char *str)
>> +{
>> +	alloc_bootmem_cpumask_var(&task_isolation_map);
>> +	if (cpulist_parse(str, task_isolation_map) < 0) {
>> +		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
>> +		return 1;
>> +	}
>> +
>> +	alloc_bootmem_cpumask_var(&cpu_isolated_map);
>> +	cpumask_copy(cpu_isolated_map, task_isolation_map);
>> +
>> +	alloc_bootmem_cpumask_var(&tick_nohz_full_mask);
>> +	cpumask_copy(tick_nohz_full_mask, task_isolation_map);
>> +	tick_nohz_full_running = true;
> How about calling tick_nohz_full_setup() instead? I'd rather prefer
> that nohz full implementation details stay in tick-sched.c
>
> Also what happens if nohz_full= is given as well as task_isolation= ?
> Don't we risk a memory leak and maybe breaking the fact that
> (nohz_full & task_isolation != task_isolation) which is really a requirement?
Yeah, this is a good point.  I'm not sure what the best way is to make
this happen.  It's already true that we will leak memory if you
specify "nohz_full=" more than once on the command line, but it's
awkward to fix (assuming we want the last value to win) so maybe
we can just ignore this problem - it's a pretty small amount of memory
after all.  If so, then making tick_nohz_full_setup() and 
isolated_cpu_setup()
both non-static and calling them from task_isolation_setup() might
be the cleanest approach.  What do you think?
You asked what happens if nohz_full= is given as well, which is a very
good question.  Perhaps the right answer is to have an early_initcall
that suppresses task isolation on any cores that lost their nohz_full
or isolcpus status due to later boot command line arguments (and
generate a console warning, obviously).
>> +
>> +	return 1;
>> +}
>> +__setup("task_isolation=", task_isolation_setup);
>> +
>> +/*
>> + * This routine controls whether we can enable task-isolation mode.
>> + * The task must be affinitized to a single task_isolation core or we will
>> + * return EINVAL.  Although the application could later re-affinitize
>> + * to a housekeeping core and lose task isolation semantics, this
>> + * initial test should catch 99% of bugs with task placement prior to
>> + * enabling task isolation.
>> + */
>> +int task_isolation_set(unsigned int flags)
>> +{
>> +	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
>> +	    !task_isolation_possible(smp_processor_id()))
>> +		return -EINVAL;
>> +
>> +	current->task_isolation_flags = flags;
>> +	return 0;
>> +}
> What if we concurrently change the task's affinity? Also it seems that preemption
> isn't disabled, so we can also migrate concurrently. I'm surprised you haven't
> seen warnings with smp_processor_id().
>
> Also we should protect against task's affinity change when task_isolation_flags
> is set.
I talked about this a bit when you raised it for the v8 patch series:
   http://lkml.kernel.org/r/562FA8FD.8080502@ezchip.com
I'd be curious to hear your take on the arguments I made there.
You're absolutely right about the preemption warnings, which I only fixed
a few days ago.  In this case I use raw_smp_processor_id() since with a
fixed single-core cpu affinity, we're not going anywhere, so the warning
from smp_processor_id() would be bogus.  And although technically it is
still correct (racing with another task resetting the task affinity on this
one), it is in any case equivalent to having that other task reset the 
affinity
on return from the prctl(), which I've already claimed isn't an interesting
use case to try to handle.  But let me know what you think!
>> +
>> +/*
>> + * In task isolation mode we try to return to userspace only after
>> + * attempting to make sure we won't be interrupted again.  To handle
>> + * the periodic scheduler tick, we test to make sure that the tick is
>> + * stopped, and if it isn't yet, we request a reschedule so that if
>> + * another task needs to run to completion first, it can do so.
>> + * Similarly, if any other subsystems require quiescing, we will need
>> + * to do that before we return to userspace.
>> + */
>> +bool _task_isolation_ready(void)
>> +{
>> +	WARN_ON_ONCE(!irqs_disabled());
>> +
>> +	/* If we need to drain the LRU cache, we're not ready. */
>> +	if (lru_add_drain_needed(smp_processor_id()))
>> +		return false;
>> +
>> +	/* If vmstats need updating, we're not ready. */
>> +	if (!vmstat_idle())
>> +		return false;
>> +
>> +	/* Request rescheduling unless we are in full dynticks mode. */
>> +	if (!tick_nohz_tick_stopped()) {
>> +		set_tsk_need_resched(current);
> I'm not sure doing this will help getting the tick to get stopped.
Well, I don't know that there is anything else we CAN do, right?  If there's
another task that can run, great - it may be that that's why full dynticks
isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
there's nothing else we can do, in which case we basically spend our time
going around through the scheduler code and back out to the
task_isolation_ready() test, but again, there's really nothing else more
useful we can be doing at this point.  Once the RCU tick fires (or whatever
it was that was preventing full dynticks from engaging), we will pass this
test and return to user space.
-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
  2016-01-13 21:19               ` Chris Metcalf
@ 2016-01-20 13:27                 ` Mark Rutland
  0 siblings, 0 replies; 29+ messages in thread
From: Mark Rutland @ 2016-01-20 13:27 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Ingo Molnar, Will Deacon, Gilad Ben Yossef, Steven Rostedt,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel
Hi Chris,
Sorry for the delay. I had intended to take a look at this and so held
off replying, but my time has been taken up elsewhere.
On Wed, Jan 13, 2016 at 04:19:56PM -0500, Chris Metcalf wrote:
> On 01/13/2016 05:44 AM, Ingo Molnar wrote:
> >* Chris Metcalf <cmetcalf@ezchip.com> wrote:
> >
> >>(Adding Mark to cc's)
> >>
> >>On 01/12/2016 05:07 AM, Will Deacon wrote:
> >>>On Mon, Jan 11, 2016 at 04:15:50PM -0500, Chris Metcalf wrote:
> >>>>Ping!  There has been no substantive feedback to this version of
> >>>>the patch in the week since I posted it, which optimistically suggests
> >>>>to me that people may be satisfied with it.  If that's true, Frederic,
> >>>>I assume this would be pulled into your tree?
> >>>>
> >>>>I have slightly updated the v9 patch series since this posting:
> >>>>
> >>>>[...]
> >>>>
> >>>>- Incorporated Mark Rutland's changes to convert arm64
> >>>>   assembly to C code instead of using my own version.
> >>>Please avoid queuing these patches -- the first is already in the arm64
> >>>queue for 4.5 and the second was found to introduce a substantial
> >>>performance regression on the syscall entry/exit path. I think Mark had
> >>>an updated version to address that, so it would be easier not to have
> >>>an old version sitting in some other queue!
> >>I am not formally queueing them anywhere (like linux-next), though
> >>now that you mention it, that's a pretty good idea - I'll talk to Steven
> >>about that, assuming this merge window closes without the task
> >>isolation stuff going in.
> >NAK. Given the controversy, no way should this stuff go outside the primary trees
> >it affects: the scheduler, timer, irq, etc. trees.
> 
> Fair enough.  I'll plan to do v10 once the merge window closes.
> 
> Mark, let me know when/if you get a new version of the de-asm stuff
> for do_notify_resume() - thanks.
If I get the chance soon, I will do, though I suspect I won't have the
chance to give that the time it deserves over the next week or two. 
> Or, would it be helpful if I worked up the option I suggested, where
> we check the thread_info flags in the assembly code before calling out
> to the new loop in do_notify_resume()?
That would probably be for the best.
Thanks,
Mark.
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-01-19 20:45     ` Chris Metcalf
@ 2016-01-28  0:28       ` Frederic Weisbecker
  2016-01-29 18:18         ` Chris Metcalf
  0 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2016-01-28  0:28 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel
On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:
> On 01/19/2016 10:42 AM, Frederic Weisbecker wrote:
> >>+/*
> >>+ * Isolation requires both nohz and isolcpus support from the scheduler.
> >>+ * We provide a boot flag that enables both for now, and which we can
> >>+ * add other functionality to over time if needed.  Note that just
> >>+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
> >>+ */
> >>+static int __init task_isolation_setup(char *str)
> >>+{
> >>+	alloc_bootmem_cpumask_var(&task_isolation_map);
> >>+	if (cpulist_parse(str, task_isolation_map) < 0) {
> >>+		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
> >>+		return 1;
> >>+	}
> >>+
> >>+	alloc_bootmem_cpumask_var(&cpu_isolated_map);
> >>+	cpumask_copy(cpu_isolated_map, task_isolation_map);
> >>+
> >>+	alloc_bootmem_cpumask_var(&tick_nohz_full_mask);
> >>+	cpumask_copy(tick_nohz_full_mask, task_isolation_map);
> >>+	tick_nohz_full_running = true;
> >How about calling tick_nohz_full_setup() instead? I'd rather prefer
> >that nohz full implementation details stay in tick-sched.c
> >
> >Also what happens if nohz_full= is given as well as task_isolation= ?
> >Don't we risk a memory leak and maybe breaking the fact that
> >(nohz_full & task_isolation != task_isolation) which is really a requirement?
> 
> Yeah, this is a good point.  I'm not sure what the best way is to make
> this happen.  It's already true that we will leak memory if you
> specify "nohz_full=" more than once on the command line, but it's
> awkward to fix (assuming we want the last value to win) so maybe
> we can just ignore this problem - it's a pretty small amount of memory
> after all.  If so, then making tick_nohz_full_setup() and
> isolated_cpu_setup()
> both non-static and calling them from task_isolation_setup() might
> be the cleanest approach.  What do you think?
I think we can reuse tick_nohz_full_setup() indeed, or some of its internals
and encapsulate that in a function so that isolation.c can initialize nohz full
without fiddling with internal variables.
> 
> You asked what happens if nohz_full= is given as well, which is a very
> good question.  Perhaps the right answer is to have an early_initcall
> that suppresses task isolation on any cores that lost their nohz_full
> or isolcpus status due to later boot command line arguments (and
> generate a console warning, obviously).
I'd rather imagine that the final nohz full cpumask is "nohz_full=" | "task_isolation="
That's the easiest way to deal with and both nohz and task isolation can call
a common initializer that takes care of the allocation and add the cpus to the mask.
> >>+int task_isolation_set(unsigned int flags)
> >>+{
> >>+	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
> >>+	    !task_isolation_possible(smp_processor_id()))
> >>+		return -EINVAL;
> >>+
> >>+	current->task_isolation_flags = flags;
> >>+	return 0;
> >>+}
> >What if we concurrently change the task's affinity? Also it seems that preemption
> >isn't disabled, so we can also migrate concurrently. I'm surprised you haven't
> >seen warnings with smp_processor_id().
> >
> >Also we should protect against task's affinity change when task_isolation_flags
> >is set.
> 
> I talked about this a bit when you raised it for the v8 patch series:
> 
>   http://lkml.kernel.org/r/562FA8FD.8080502@ezchip.com
> 
> I'd be curious to hear your take on the arguments I made there.
Oh ok, I'm going to reply there then :)
> 
> You're absolutely right about the preemption warnings, which I only fixed
> a few days ago.  In this case I use raw_smp_processor_id() since with a
> fixed single-core cpu affinity, we're not going anywhere, so the warning
> from smp_processor_id() would be bogus.  And although technically it is
> still correct (racing with another task resetting the task affinity on this
> one), it is in any case equivalent to having that other task reset the
> affinity
> on return from the prctl(), which I've already claimed isn't an interesting
> use case to try to handle.  But let me know what you think!
Ok it's very much tied to the affinity issue. If we deal with affinity changes
properly I think we can use the raw_ version.
> 
> >>+
> >>+/*
> >>+ * In task isolation mode we try to return to userspace only after
> >>+ * attempting to make sure we won't be interrupted again.  To handle
> >>+ * the periodic scheduler tick, we test to make sure that the tick is
> >>+ * stopped, and if it isn't yet, we request a reschedule so that if
> >>+ * another task needs to run to completion first, it can do so.
> >>+ * Similarly, if any other subsystems require quiescing, we will need
> >>+ * to do that before we return to userspace.
> >>+ */
> >>+bool _task_isolation_ready(void)
> >>+{
> >>+	WARN_ON_ONCE(!irqs_disabled());
> >>+
> >>+	/* If we need to drain the LRU cache, we're not ready. */
> >>+	if (lru_add_drain_needed(smp_processor_id()))
> >>+		return false;
> >>+
> >>+	/* If vmstats need updating, we're not ready. */
> >>+	if (!vmstat_idle())
> >>+		return false;
> >>+
> >>+	/* Request rescheduling unless we are in full dynticks mode. */
> >>+	if (!tick_nohz_tick_stopped()) {
> >>+		set_tsk_need_resched(current);
> >I'm not sure doing this will help getting the tick to get stopped.
> 
> Well, I don't know that there is anything else we CAN do, right?  If there's
> another task that can run, great - it may be that that's why full dynticks
> isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
> there's nothing else we can do, in which case we basically spend our time
> going around through the scheduler code and back out to the
> task_isolation_ready() test, but again, there's really nothing else more
> useful we can be doing at this point.  Once the RCU tick fires (or whatever
> it was that was preventing full dynticks from engaging), we will pass this
> test and return to user space.
There is nothing at all you can do and setting TIF_RESCHED won't help either.
If there is another task that can run, the scheduler takes care of resched
by itself :-)
Thanks.
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-01-28  0:28       ` Frederic Weisbecker
@ 2016-01-29 18:18         ` Chris Metcalf
       [not found]           ` <56ABACDD.5090500-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2016-01-29 18:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel
On 01/27/2016 07:28 PM, Frederic Weisbecker wrote:
> On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:
>> You asked what happens if nohz_full= is given as well, which is a very
>> good question.  Perhaps the right answer is to have an early_initcall
>> that suppresses task isolation on any cores that lost their nohz_full
>> or isolcpus status due to later boot command line arguments (and
>> generate a console warning, obviously).
> I'd rather imagine that the final nohz full cpumask is "nohz_full=" | "task_isolation="
> That's the easiest way to deal with and both nohz and task isolation can call
> a common initializer that takes care of the allocation and add the cpus to the mask.
I like it!
And by the same token, the final isolcpus cpumask is "isolcpus=" | 
"task_isolation="?
That seems like we'd want to do it to keep things parallel.
>>>> +bool _task_isolation_ready(void)
>>>> +{
>>>> +	WARN_ON_ONCE(!irqs_disabled());
>>>> +
>>>> +	/* If we need to drain the LRU cache, we're not ready. */
>>>> +	if (lru_add_drain_needed(smp_processor_id()))
>>>> +		return false;
>>>> +
>>>> +	/* If vmstats need updating, we're not ready. */
>>>> +	if (!vmstat_idle())
>>>> +		return false;
>>>> +
>>>> +	/* Request rescheduling unless we are in full dynticks mode. */
>>>> +	if (!tick_nohz_tick_stopped()) {
>>>> +		set_tsk_need_resched(current);
>>> I'm not sure doing this will help getting the tick to get stopped.
>> Well, I don't know that there is anything else we CAN do, right?  If there's
>> another task that can run, great - it may be that that's why full dynticks
>> isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
>> there's nothing else we can do, in which case we basically spend our time
>> going around through the scheduler code and back out to the
>> task_isolation_ready() test, but again, there's really nothing else more
>> useful we can be doing at this point.  Once the RCU tick fires (or whatever
>> it was that was preventing full dynticks from engaging), we will pass this
>> test and return to user space.
> There is nothing at all you can do and setting TIF_RESCHED won't help either.
> If there is another task that can run, the scheduler takes care of resched
> by itself :-)
The problem is that the scheduler will only take care of resched at a
later time, typically when we get a timer interrupt later.  By invoking the
scheduler here, we allow any tasks that are ready to run to run
immediately, rather than waiting for an interrupt to wake the scheduler.
Plenty of places in the kernel just call schedule() directly when they are
waiting.  Since we're waiting here regardless, we might as well
immediately get any other runnable tasks dealt with.
We could also just return "false" in _task_isolation_ready(), and then
check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
call schedule() explicitly there, but that seems a little more roundabout.
Admittedly it's more usual to see kernel code call schedule() directly
to yield the processor, but in this case I'm not convinced it's cleaner
given we're already in a loop where the caller is checking TIF_RESCHED
and then calling schedule() when it's set.
-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
       [not found]           ` <56ABACDD.5090500-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
@ 2016-01-30 21:11             ` Frederic Weisbecker
  2016-02-11 19:24               ` Chris Metcalf
  0 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2016-01-30 21:11 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
On Fri, Jan 29, 2016 at 01:18:05PM -0500, Chris Metcalf wrote:
> On 01/27/2016 07:28 PM, Frederic Weisbecker wrote:
> >On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:
> >>You asked what happens if nohz_full= is given as well, which is a very
> >>good question.  Perhaps the right answer is to have an early_initcall
> >>that suppresses task isolation on any cores that lost their nohz_full
> >>or isolcpus status due to later boot command line arguments (and
> >>generate a console warning, obviously).
> >I'd rather imagine that the final nohz full cpumask is "nohz_full=" | "task_isolation="
> >That's the easiest way to deal with and both nohz and task isolation can call
> >a common initializer that takes care of the allocation and add the cpus to the mask.
> 
> I like it!
> 
> And by the same token, the final isolcpus cpumask is "isolcpus=" |
> "task_isolation="?
> That seems like we'd want to do it to keep things parallel.
We have reverted the patch that made isolcpus |= nohz_full. Too
many people complained about unusable machines with NO_HZ_FULL_ALL
But the user can still set that parameter manually.
> 
> >>>>+bool _task_isolation_ready(void)
> >>>>+{
> >>>>+	WARN_ON_ONCE(!irqs_disabled());
> >>>>+
> >>>>+	/* If we need to drain the LRU cache, we're not ready. */
> >>>>+	if (lru_add_drain_needed(smp_processor_id()))
> >>>>+		return false;
> >>>>+
> >>>>+	/* If vmstats need updating, we're not ready. */
> >>>>+	if (!vmstat_idle())
> >>>>+		return false;
> >>>>+
> >>>>+	/* Request rescheduling unless we are in full dynticks mode. */
> >>>>+	if (!tick_nohz_tick_stopped()) {
> >>>>+		set_tsk_need_resched(current);
> >>>I'm not sure doing this will help getting the tick to get stopped.
> >>Well, I don't know that there is anything else we CAN do, right?  If there's
> >>another task that can run, great - it may be that that's why full dynticks
> >>isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
> >>there's nothing else we can do, in which case we basically spend our time
> >>going around through the scheduler code and back out to the
> >>task_isolation_ready() test, but again, there's really nothing else more
> >>useful we can be doing at this point.  Once the RCU tick fires (or whatever
> >>it was that was preventing full dynticks from engaging), we will pass this
> >>test and return to user space.
> >There is nothing at all you can do and setting TIF_RESCHED won't help either.
> >If there is another task that can run, the scheduler takes care of resched
> >by itself :-)
> 
> The problem is that the scheduler will only take care of resched at a
> later time, typically when we get a timer interrupt later.
When a task is enqueued, the scheduler sets TIF_RESCHED on the target. If the
target is remote it sends an IPI, if it's local then we wait the next reschedule
point (preemption points, voluntary reschedule, interrupts). There is just nothing
you can do to accelerate that.
> By invoking the scheduler here, we allow any tasks that are ready to run to run
> immediately, rather than waiting for an interrupt to wake the scheduler.
Well, in this case here we are interested in the current CPU. And if a task
got awoken and waits for the current CPU, it will have an opportunity to get
schedule on syscall exit.
> Plenty of places in the kernel just call schedule() directly when they are
> waiting.  Since we're waiting here regardless, we might as well
> immediately get any other runnable tasks dealt with.
> 
> We could also just return "false" in _task_isolation_ready(), and then
> check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
> call schedule() explicitly there, but that seems a little more roundabout.
> Admittedly it's more usual to see kernel code call schedule() directly
> to yield the processor, but in this case I'm not convinced it's cleaner
> given we're already in a loop where the caller is checking TIF_RESCHED
> and then calling schedule() when it's set.
You could call cond_resched(), but really syscall exit is enough for what
you want. And the problem here if a task prevents the CPU from stopping the
tick is that task itself, not the fact it doesn't get scheduled. If we have
other tasks than the current isolated one on the CPU, it means that the
environment is not ready for hard isolation.
And in general: we shouldn't loop at all there: if something depends on the tick,
the CPU is not ready for isolation and something needs to be done: setting
some task affinity, etc... So we should just fail the prctl and let the user
deal with it.
> 
> -- 
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
> 
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-01-30 21:11             ` Frederic Weisbecker
@ 2016-02-11 19:24               ` Chris Metcalf
  2016-03-04 12:56                 ` Frederic Weisbecker
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2016-02-11 19:24 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
On 01/30/2016 04:11 PM, Frederic Weisbecker wrote:
> On Fri, Jan 29, 2016 at 01:18:05PM -0500, Chris Metcalf wrote:
>> On 01/27/2016 07:28 PM, Frederic Weisbecker wrote:
>>> On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:
>>>> You asked what happens if nohz_full= is given as well, which is a very
>>>> good question.  Perhaps the right answer is to have an early_initcall
>>>> that suppresses task isolation on any cores that lost their nohz_full
>>>> or isolcpus status due to later boot command line arguments (and
>>>> generate a console warning, obviously).
>>> I'd rather imagine that the final nohz full cpumask is "nohz_full=" | "task_isolation="
>>> That's the easiest way to deal with and both nohz and task isolation can call
>>> a common initializer that takes care of the allocation and add the cpus to the mask.
>> I like it!
>>
>> And by the same token, the final isolcpus cpumask is "isolcpus=" |
>> "task_isolation="?
>> That seems like we'd want to do it to keep things parallel.
> We have reverted the patch that made isolcpus |= nohz_full. Too
> many people complained about unusable machines with NO_HZ_FULL_ALL
>
> But the user can still set that parameter manually.
Yes.  What I was suggesting is that if the user specifies task_isolation=X-Y
we should add cpus X-Y to both the nohz_full set and the isolcpus set.
I've changed it to work that way for the v10 patch series.
>>>>>> +bool _task_isolation_ready(void)
>>>>>> +{
>>>>>> +	WARN_ON_ONCE(!irqs_disabled());
>>>>>> +
>>>>>> +	/* If we need to drain the LRU cache, we're not ready. */
>>>>>> +	if (lru_add_drain_needed(smp_processor_id()))
>>>>>> +		return false;
>>>>>> +
>>>>>> +	/* If vmstats need updating, we're not ready. */
>>>>>> +	if (!vmstat_idle())
>>>>>> +		return false;
>>>>>> +
>>>>>> +	/* Request rescheduling unless we are in full dynticks mode. */
>>>>>> +	if (!tick_nohz_tick_stopped()) {
>>>>>> +		set_tsk_need_resched(current);
>>>>> I'm not sure doing this will help getting the tick to get stopped.
>>>> Well, I don't know that there is anything else we CAN do, right?  If there's
>>>> another task that can run, great - it may be that that's why full dynticks
>>>> isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
>>>> there's nothing else we can do, in which case we basically spend our time
>>>> going around through the scheduler code and back out to the
>>>> task_isolation_ready() test, but again, there's really nothing else more
>>>> useful we can be doing at this point.  Once the RCU tick fires (or whatever
>>>> it was that was preventing full dynticks from engaging), we will pass this
>>>> test and return to user space.
>>> There is nothing at all you can do and setting TIF_RESCHED won't help either.
>>> If there is another task that can run, the scheduler takes care of resched
>>> by itself :-)
>> The problem is that the scheduler will only take care of resched at a
>> later time, typically when we get a timer interrupt later.
> When a task is enqueued, the scheduler sets TIF_RESCHED on the target. If the
> target is remote it sends an IPI, if it's local then we wait the next reschedule
> point (preemption points, voluntary reschedule, interrupts). There is just nothing
> you can do to accelerate that.
But that's exactly what I'm saying.  If we're sitting in a loop here waiting
for some short-lived process (maybe kernel thread) to run and get out of
the way, we don't want to just spin sitting in prepare_exit_to_usermode().
We want to call schedule(), get the short-lived process to run, then when
it calls schedule() again, we're back in prepare_exit_to_usermode but now
we can return to userspace.
We don't want to wait for preemption points or interrupts, and there are
no other voluntary reschedules in the prepare_exit_to_usermode() loop.
If the other task had been woken up for some completion, then yes we would
already have had TIF_RESCHED set, but if the other runnable task was (for
example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at
this point, and thus we might need to call schedule() explicitly.
Note that the prepare_exit_to_usermode() loop is exactly the point at
which we normally call schedule() if we are in syscall exit, so we are
just encouraging that schedule() to happen if otherwise it might not.
>> By invoking the scheduler here, we allow any tasks that are ready to run to run
>> immediately, rather than waiting for an interrupt to wake the scheduler.
> Well, in this case here we are interested in the current CPU. And if a task
> got awoken and waits for the current CPU, it will have an opportunity to get
> schedule on syscall exit.
That's true if TIF_RESCHED was set because a completion occurred that
the other task was waiting for.  But there might not be any such completion
and the task just got preempted earlier and is still ready to run.
My point is that setting TIF_RESCHED is never harmful, and there are
cases like involuntary preemption where it might help.
>> Plenty of places in the kernel just call schedule() directly when they are
>> waiting.  Since we're waiting here regardless, we might as well
>> immediately get any other runnable tasks dealt with.
>>
>> We could also just return "false" in _task_isolation_ready(), and then
>> check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
>> call schedule() explicitly there, but that seems a little more roundabout.
>> Admittedly it's more usual to see kernel code call schedule() directly
>> to yield the processor, but in this case I'm not convinced it's cleaner
>> given we're already in a loop where the caller is checking TIF_RESCHED
>> and then calling schedule() when it's set.
> You could call cond_resched(), but really syscall exit is enough for what
> you want. And the problem here if a task prevents the CPU from stopping the
> tick is that task itself, not the fact it doesn't get scheduled.
True, although in that case we just need to wait (e.g. for an RCU tick
to occur to quiesce); we could spin, but spinning through the scheduler
seems no better or worse in that case then just spinning with
interrupts enabled in a loop.  And (as I said above) it could help.
> If we have
> other tasks than the current isolated one on the CPU, it means that the
> environment is not ready for hard isolation.
Right.  But the model is that in that case, the task that wants hard
isolation is just going to have to wait to return to userspace.
> And in general: we shouldn't loop at all there: if something depends on the tick,
> the CPU is not ready for isolation and something needs to be done: setting
> some task affinity, etc... So we should just fail the prctl and let the user
> deal with it.
So there are potentially two cases here:
(1) When we initially do the prctl(), should we check to see if there are
other schedulable tasks, etc., and fail the prctl() if so?  You could make a
case for this, but I think in practice userspace would just end up looping
back to retry the prctl if we created that semantic in the kernel.
(2) What about times when we are leaving the kernel after already
doing the prctl()?  For example a core doing packet forwarding might
want to report some error condition up to the kernel, and remove itself
from the set of cores handling packets, then do some syscall(s) to generate
logging data, and then go back and continue handling packets.  Or, the
process might have created some large anonymous mapping where
every now and then it needs to cross a page boundary for some structure
and touch a new page, and it knows to expect a page fault in that case.
In those cases we are returning from the kernel, not at prctl() time, and
we still want to enforce the semantics that no further interrupts will
occur to disturb the task.  These kinds of use cases are why we have
as general-purpose a mechanism as we do for task isolation.
-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-02-11 19:24               ` Chris Metcalf
@ 2016-03-04 12:56                 ` Frederic Weisbecker
  2016-03-09 19:39                   ` Chris Metcalf
  0 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2016-03-04 12:56 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel
On Thu, Feb 11, 2016 at 02:24:25PM -0500, Chris Metcalf wrote:
> On 01/30/2016 04:11 PM, Frederic Weisbecker wrote:
> >We have reverted the patch that made isolcpus |= nohz_full. Too
> >many people complained about unusable machines with NO_HZ_FULL_ALL
> >
> >But the user can still set that parameter manually.
> 
> Yes.  What I was suggesting is that if the user specifies task_isolation=X-Y
> we should add cpus X-Y to both the nohz_full set and the isolcpus set.
> I've changed it to work that way for the v10 patch series.
Ok.
> 
> 
> >>>>>>+bool _task_isolation_ready(void)
> >>>>>>+{
> >>>>>>+	WARN_ON_ONCE(!irqs_disabled());
> >>>>>>+
> >>>>>>+	/* If we need to drain the LRU cache, we're not ready. */
> >>>>>>+	if (lru_add_drain_needed(smp_processor_id()))
> >>>>>>+		return false;
> >>>>>>+
> >>>>>>+	/* If vmstats need updating, we're not ready. */
> >>>>>>+	if (!vmstat_idle())
> >>>>>>+		return false;
> >>>>>>+
> >>>>>>+	/* Request rescheduling unless we are in full dynticks mode. */
> >>>>>>+	if (!tick_nohz_tick_stopped()) {
> >>>>>>+		set_tsk_need_resched(current);
> >>>>>I'm not sure doing this will help getting the tick to get stopped.
> >>>>Well, I don't know that there is anything else we CAN do, right?  If there's
> >>>>another task that can run, great - it may be that that's why full dynticks
> >>>>isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
> >>>>there's nothing else we can do, in which case we basically spend our time
> >>>>going around through the scheduler code and back out to the
> >>>>task_isolation_ready() test, but again, there's really nothing else more
> >>>>useful we can be doing at this point.  Once the RCU tick fires (or whatever
> >>>>it was that was preventing full dynticks from engaging), we will pass this
> >>>>test and return to user space.
> >>>There is nothing at all you can do and setting TIF_RESCHED won't help either.
> >>>If there is another task that can run, the scheduler takes care of resched
> >>>by itself :-)
> >>The problem is that the scheduler will only take care of resched at a
> >>later time, typically when we get a timer interrupt later.
> >When a task is enqueued, the scheduler sets TIF_RESCHED on the target. If the
> >target is remote it sends an IPI, if it's local then we wait the next reschedule
> >point (preemption points, voluntary reschedule, interrupts). There is just nothing
> >you can do to accelerate that.
> 
> But that's exactly what I'm saying.  If we're sitting in a loop here waiting
> for some short-lived process (maybe kernel thread) to run and get out of
> the way, we don't want to just spin sitting in prepare_exit_to_usermode().
> We want to call schedule(), get the short-lived process to run, then when
> it calls schedule() again, we're back in prepare_exit_to_usermode but now
> we can return to userspace.
Maybe, although I think returning to userspace with -EAGAIN or -EBUSY, something like
that would be better so that userspace retries a bit later with prctl. Otherwise we may
well be waiting for ever in kernelmode.
> 
> We don't want to wait for preemption points or interrupts, and there are
> no other voluntary reschedules in the prepare_exit_to_usermode() loop.
> 
> If the other task had been woken up for some completion, then yes we would
> already have had TIF_RESCHED set, but if the other runnable task was (for
> example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at
> this point, and thus we might need to call schedule() explicitly.
There can't be another task in the runqueue waiting to be preempted since
we (the current task) are running on the CPU.
Besides, if we aren't alone in the runqueue, this breaks the task isolation
mode.
> 
> Note that the prepare_exit_to_usermode() loop is exactly the point at
> which we normally call schedule() if we are in syscall exit, so we are
> just encouraging that schedule() to happen if otherwise it might not.
> 
> >>By invoking the scheduler here, we allow any tasks that are ready to run to run
> >>immediately, rather than waiting for an interrupt to wake the scheduler.
> >Well, in this case here we are interested in the current CPU. And if a task
> >got awoken and waits for the current CPU, it will have an opportunity to get
> >schedule on syscall exit.
> 
> That's true if TIF_RESCHED was set because a completion occurred that
> the other task was waiting for.  But there might not be any such completion
> and the task just got preempted earlier and is still ready to run.
But if another task waits for the CPU, this break task isolation mode. Now
assuming we want a pending task to resume such that we get the CPU for ourself,
we have no idea if the scheduler is going to schedule that task, it depends on
vruntime and other things. TIF_RESCHED only make entering the scheduler, it doesn't
guarantee any context switch.
> My point is that setting TIF_RESCHED is never harmful, and there are
> cases like involuntary preemption where it might help.
Sure but we don't write code just because it doesn't harm. Strange code hurts
the brain of reviewers.
Now concerning involuntary preemption, it's a matter of a millisecond, userspace
needs to wait a few millisecond before retrying anyway. Sleeping at that point is
what can be useful as we leave the CPU for the resuming task.
Also if we have any task on the runqueue anyway, whether we hope that it resumes quickly
or not, it's a very bad sign for a task isolation session. Either we did not affine tasks
correctly or there is a kernel thread that might run again at some time ahead.
> 
> >>Plenty of places in the kernel just call schedule() directly when they are
> >>waiting.  Since we're waiting here regardless, we might as well
> >>immediately get any other runnable tasks dealt with.
> >>
> >>We could also just return "false" in _task_isolation_ready(), and then
> >>check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
> >>call schedule() explicitly there, but that seems a little more roundabout.
> >>Admittedly it's more usual to see kernel code call schedule() directly
> >>to yield the processor, but in this case I'm not convinced it's cleaner
> >>given we're already in a loop where the caller is checking TIF_RESCHED
> >>and then calling schedule() when it's set.
> >You could call cond_resched(), but really syscall exit is enough for what
> >you want. And the problem here if a task prevents the CPU from stopping the
> >tick is that task itself, not the fact it doesn't get scheduled.
> 
> True, although in that case we just need to wait (e.g. for an RCU tick
> to occur to quiesce); we could spin, but spinning through the scheduler
> seems no better or worse in that case then just spinning with
> interrupts enabled in a loop.  And (as I said above) it could help.
Lets just leave that waiting to userspace. Just sleep a few milliseconds.
> 
> >If we have
> >other tasks than the current isolated one on the CPU, it means that the
> >environment is not ready for hard isolation.
> 
> Right.  But the model is that in that case, the task that wants hard
> isolation is just going to have to wait to return to userspace.
I think we shouldn't do that wait for isolation on the kernel.
> 
> 
> >And in general: we shouldn't loop at all there: if something depends on the tick,
> >the CPU is not ready for isolation and something needs to be done: setting
> >some task affinity, etc... So we should just fail the prctl and let the user
> >deal with it.
> 
> So there are potentially two cases here:
> 
> (1) When we initially do the prctl(), should we check to see if there are
> other schedulable tasks, etc., and fail the prctl() if so?  You could make a
> case for this, but I think in practice userspace would just end up looping
> back to retry the prctl if we created that semantic in the kernel.
That sounds saner to me. And if we still fail after one second, then just give up.
In fact if it doesn't work on the first time, that's a bad sign like I said above.
The task that is running on the CPU may well come again later. Some pre-conditons
are not met.
> 
> (2) What about times when we are leaving the kernel after already
> doing the prctl()?  For example a core doing packet forwarding might
> want to report some error condition up to the kernel, and remove itself
> from the set of cores handling packets, then do some syscall(s) to generate
> logging data, and then go back and continue handling packets.  Or, the
> process might have created some large anonymous mapping where
> every now and then it needs to cross a page boundary for some structure
> and touch a new page, and it knows to expect a page fault in that case.
> In those cases we are returning from the kernel, not at prctl() time, and
> we still want to enforce the semantics that no further interrupts will
> occur to disturb the task.  These kinds of use cases are why we have
> as general-purpose a mechanism as we do for task isolation.
If any interrupt or any kind of disturbance happens, we should leave that
task isolation mode and warn the isolated task about that. SIGTERM?
Thanks.
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-03-04 12:56                 ` Frederic Weisbecker
@ 2016-03-09 19:39                   ` Chris Metcalf
  2016-04-08 13:56                     ` Frederic Weisbecker
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2016-03-09 19:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel
Frederic,
Thanks for the detailed feedback on the task isolation stuff.
This reply kind of turned into an essay, so I've added a little "TL;DR"
sentence before each section.
   TL;DR: Let's make an explicit decision about whether task isolation
   should be "persistent" or "one-shot".  Both have some advantages.
   =====
An important high-level issue is how "sticky" task isolation mode is.
We need to choose one of these two options:
"Persistent mode": A task switches state to "task isolation" mode
(kind of a level-triggered analogy) and stays there indefinitely.  It
can make a syscall, take a page fault, etc., if it wants to, but the
kernel protects it from incurring any further asynchronous interrupts.
This is the model I've been advocating for.
"One-shot mode": A task requests isolation via prctl(), the kernel
ensures it is isolated on return from the prctl(), but then as soon as
it enters the kernel again, task isolation is switched off until
another prctl is issued.  This is what you recommended in your last
email.
There are a number of pros and cons to the two models.  I think on
balance I still like the "persistent mode" approach, but here's all
the pros/cons I can think of:
PRO for persistent mode: A somewhat easier programming model.  Users
can just imagine "task isolation" as a way for them to still be able
to use the kernel exactly as they always have; it's just slower to get
back out of the kernel so you use it judiciously.  For example, a
process is free to call write() on a socket to perform a diagnostic,
but when returning from the write() syscall, the kernel will hold the
task in kernel mode until any timer ticks (perhaps from networking
stuff) are complete, and then let it return to userspace to continue
in task isolation mode.  This is convenient to the user since they
don't have to fret about re-enabling task isolation after that
syscall, page fault, or whatever; they can just continue running.
With your suggestion, the user pretty much has to leave STRICT mode
enabled so he gets notified of any unexpected return to kernel space
(in fact we might make it required so you always get a signal when
leaving task isolation unless it's via a prctl or exit syscall).
PRO for one-shot mode: A somewhat crisper interaction with
sched_setaffinity() etc.  With a persistent mode approach, a task can
start up task isolation, then later another task can be placed on its
cpu and break it (it won't return to userspace until killed or the new
process affinitizes itself away or stops running).  By contrast, in
one-shot mode, any return to kernel spaces turns off task isolation
anyway, so it's very clear what the interaction looks like.  I suspect
this is more a theoretical advantage to one-shot mode than a practical
one, though.
CON for one-shot mode: It's actually hard to catch every kernel entry
so we can turn the task-isolation flag off again - and we really do
need to have a flag, just so that we can suitably debug any bad
actions that bring us into the kernel when we're not expecting it.
Right now there are things that bring us into the kernel that we don't
bother annotating for task isolation STRICT mode, just because they're
visible to the user anyway: e.g., a bus fault or segmentation
violation.
I think we can actually make both modes available to users with just
another flag bit, so maybe we can look at what that looks like in v11:
adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
isolation at the next syscall entry, page fault, etc.  Then we can
think more specifically about whether we want to remove the flag or
not, and if we remove it, whether we want to make the code that was
controlled by it unconditionally true or unconditionally false
(i.e. remove it again).
   TL;DR: We should be more willing to return -EINVAL from prctl().
   =====
One thing you've argued is that we should be more aggressive about
failing the prctl() call.  I think, in any case, that this is probably
reasonable.  We already check that the task's affinity is limited to
the current core and that that core is a task_isolation cpu; I think we
can also require that can_stop_full_tick() return true (or the moral
equivalent given your recent patch series).  This will mean you can't
even try to go into task isolation mode if another task is
schedulable, among other things, which seems like a good thing.
However, it is important to note that the current task_isolation_ready
and task_isolation_enter calls that are in the prepare_exit_to_userspace
routine are still required even with your proposed one-shot mode.  We
have to be sure that no interrupts occur on the way back to userspace
that might then in principle lead to timer interrupts being scheduled,
and the way to do that is make sure task_isolation_ready returns true
with interrupts disabled, and interrupts are not then re-enabled before
return to userspace.  Anything else is just keeping your fingers
crossed and guessing.
   TL;DR: Returning -EBUSY from prctl() isn't really that helpful.
   =====
Frederic wonders if we can test for various things not being ready
(dynticks not off yet, etc) and just return -EBUSY and let userspace
do the spinning.
First, note that this is only possible for one-shot mode.  For
persistent mode, we have the potential to run up against this on
return from any syscall, and we obviously can't add new error returns
to other syscalls.  So it doesn't really make sense to add EBUSY
semantics to prctl if nothing else can use it.
But even in one-shot mode, I'm not really sure what the advantage is
here.  We still need to do something like task_isolation_ready() in
the prepare_exit_to_usermode() loop, since that's where we have
interrupts disabled and can do a final assessment of the state of the
kernel for this core.  So, while you could imagine having that code
just hook in and call syscall_set_return_value() there instead of
causing things to loop back, that doesn't really save us much
complexity in the kernel, and instead pushes complexity back to
userspace, which may well handle it just by busywaiting on the prctl()
anyway.  You might argue that if we just return to userspace, userspace
can sleep briefly and retry, thus avoiding spinning in the scheduler.
But it's relatively easy to do that (or better) in the kernel, so I'm
not sure that's more than a straw man.  See the next point.
   TL;DR: Should we arrange to actually use a completion in
   task_isolation_enter when dynticks are ticking, and call complete()
   in tick-sched.c when we shut down dynticks, or, just spin in
   schedule() and not worry about burning a little cpu?
   =====
One question that keeps getting asked is how useful it is to just call
schedule() while we're waiting for dynticks to shut off, since it
could just be a busy spin going into schedule() over and over.  Even
if another task is ready to run we might not switch to it right away.
So one thing we could think about is arranging so that whenever we
turn off dynticks, we also notify any tasks that were waiting for it
to be turned off; that way we can just sleep in task_isolation_enter()
and wait to be notified, thus guaranteeing any other task that wants
to run can run, or even just waiting in cpu idle for a little while.
Does this seem like it's worth coding up?  My impression has always
been that we wait pretty briefly for dynticks to shut down, so it
doesn't really matter if we spin - and even if we do spin, in
principle we already arranged for this cpu to be dedicated to this
task anyway, so it doesn't really do anything bad except maybe burn a
little bit of extra cpu power.  But I'm willing to be convinced...
   TL;DR: We should turn off task isolation mode for signals.
   =====
One thing that occurs to me is that we should arrange so that
any signal delivery turns off task isolation mode.  This is
easily documented semantics even in persistent mode, and it
allows the userspace program to run and discover that something bad
has happened, rather than potentially hanging in the kernel trying to
wait for isolation to be possible before calling the signal handler.
I'll make this change for v11 in any case.
Also, doing this is something of a requirement for the proposed
one-shot mode, since if we have STRICT mode enabled, then any entry
into the kernel is either a syscall, or else ends up causing a signal,
and by hooking the signal mechanism we have a place to catch all the
non-syscall entrypoints, more or less.
   TL;DR: Maybe we should use seccomp for STRICT mode syscall detection.
   =====
This is being investigated in a separate email thread with Andy
Lutomirski.  Whether it gets included in v11 is still TBD.
   TL;DR: Various minor issues in answer to Frederic's comments :-)
   =====
On 03/04/2016 07:56 AM, Frederic Weisbecker wrote:
> On Thu, Feb 11, 2016 at 02:24:25PM -0500, Chris Metcalf wrote:
>> We don't want to wait for preemption points or interrupts, and there are
>> no other voluntary reschedules in the prepare_exit_to_usermode() loop.
>>
>> If the other task had been woken up for some completion, then yes we would
>> already have had TIF_RESCHED set, but if the other runnable task was (for
>> example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at
>> this point, and thus we might need to call schedule() explicitly.
>
> There can't be another task in the runqueue waiting to be preempted since
> we (the current task) are running on the CPU.
My earlier sentence may not have been clear.  By saying "if the other
runnable task was pre-empted on a timer tick", I meant that
TIF_RESCHED wasn't set on our task, and we'd only eventually schedule
to that other task once a timer interrupt fired and ended our
scheduler slice.  I know you can't have a different task in the
runqueue waiting to be preempted, since that doesn't make sense :-)
> Besides, if we aren't alone in the runqueue, this breaks the task isolation
> mode.
Indeed.  We can and will do better catching that at prctl() time.
So the question is, if we adopt the "persistent mode", how do we
handle this case on some other return from kernel space?
>>>> By invoking the scheduler here, we allow any tasks that are ready to run to run
>>>> immediately, rather than waiting for an interrupt to wake the scheduler.
>>> Well, in this case here we are interested in the current CPU. And if a task
>>> got awoken and waits for the current CPU, it will have an opportunity to get
>>> schedule on syscall exit.
>>
>> That's true if TIF_RESCHED was set because a completion occurred that
>> the other task was waiting for.  But there might not be any such completion
>> and the task just got preempted earlier and is still ready to run.
>
> But if another task waits for the CPU, this break task isolation mode. Now
> assuming we want a pending task to resume such that we get the CPU for ourself,
> we have no idea if the scheduler is going to schedule that task, it depends on
> vruntime and other things. TIF_RESCHED only make entering the scheduler, it doesn't
> guarantee any context switch.
Yes, true.  So we have to decide if we feel spinning into the
scheduler is so harmful that we should set up a new completion driven
by entering dynticks fullmode, and handle it that way instead.
>> My point is that setting TIF_RESCHED is never harmful, and there are
>> cases like involuntary preemption where it might help.
>
> Sure but we don't write code just because it doesn't harm. Strange code hurts
> the brain of reviewers.
Fair enough, and certainly means at a minimum we need a good comment there!
> Now concerning involuntary preemption, it's a matter of a millisecond, userspace
> needs to wait a few millisecond before retrying anyway. Sleeping at that point is
> what can be useful as we leave the CPU for the resuming task.
>
> Also if we have any task on the runqueue anyway, whether we hope that it resumes quickly
> or not, it's a very bad sign for a task isolation session. Either we did not affine tasks
> correctly or there is a kernel thread that might run again at some time ahead.
Note that it might also be a one-time kernel task or kworker that is
queued by some random syscall in "persistent mode" and we need to let
it run until it quiesces again.  Then we can context switch back to
our task isolation task, and safely return from it to userspace.
>> (2) What about times when we are leaving the kernel after already
>> doing the prctl()?  For example a core doing packet forwarding might
>> want to report some error condition up to the kernel, and remove itself
>> from the set of cores handling packets, then do some syscall(s) to generate
>> logging data, and then go back and continue handling packets.  Or, the
>> process might have created some large anonymous mapping where
>> every now and then it needs to cross a page boundary for some structure
>> and touch a new page, and it knows to expect a page fault in that case.
>> In those cases we are returning from the kernel, not at prctl() time, and
>> we still want to enforce the semantics that no further interrupts will
>> occur to disturb the task.  These kinds of use cases are why we have
>> as general-purpose a mechanism as we do for task isolation.
>
> If any interrupt or any kind of disturbance happens, we should leave that
> task isolation mode and warn the isolated task about that. SIGTERM?
That's the goal of STRICT mode.  By default it uses SIGTERM.  You can
also choose a different signal via the prctl() API.
Thanks again, Frederic!  I'll work to put together a new version of
the patch incorporating a selectable one-shot mode, plus the other
things mentioned in this patch.  I think I will still not add the
suggested "dynticks full enabled completion" thing for now, and just
add a big comment on the code that makes us call schedule(), unless folks
really agree it's a necessary thing to have there.
-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-03-09 19:39                   ` Chris Metcalf
@ 2016-04-08 13:56                     ` Frederic Weisbecker
  2016-04-08 16:34                       ` Chris Metcalf
  0 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2016-04-08 13:56 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel
On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> Frederic,
> 
> Thanks for the detailed feedback on the task isolation stuff.
> 
> This reply kind of turned into an essay, so I've added a little "TL;DR"
> sentence before each section.
I think I'm going to cut my reply into several threads, because really
I can't get myself to make a giant reply in once :-)
> 
> 
>   TL;DR: Let's make an explicit decision about whether task isolation
>   should be "persistent" or "one-shot".  Both have some advantages.
>   =====
> 
> An important high-level issue is how "sticky" task isolation mode is.
> We need to choose one of these two options:
> 
> "Persistent mode": A task switches state to "task isolation" mode
> (kind of a level-triggered analogy) and stays there indefinitely.  It
> can make a syscall, take a page fault, etc., if it wants to, but the
> kernel protects it from incurring any further asynchronous interrupts.
> This is the model I've been advocating for.
But then in this mode, what happens when an interrupt triggers.
> 
> "One-shot mode": A task requests isolation via prctl(), the kernel
> ensures it is isolated on return from the prctl(), but then as soon as
> it enters the kernel again, task isolation is switched off until
> another prctl is issued.  This is what you recommended in your last
> email.
No I think we can issue syscalls for exemple. But asynchronous interruptions
such as exceptions (actually somewhat synchronous but can be unexpected) and
interrupts are what we want to avoid.
> 
> There are a number of pros and cons to the two models.  I think on
> balance I still like the "persistent mode" approach, but here's all
> the pros/cons I can think of:
> 
> PRO for persistent mode: A somewhat easier programming model.  Users
> can just imagine "task isolation" as a way for them to still be able
> to use the kernel exactly as they always have; it's just slower to get
> back out of the kernel so you use it judiciously. For example, a
> process is free to call write() on a socket to perform a diagnostic,
> but when returning from the write() syscall, the kernel will hold the
> task in kernel mode until any timer ticks (perhaps from networking
> stuff) are complete, and then let it return to userspace to continue
> in task isolation mode.
So this is not hard isolation anymore. This is rather soft isolation with
best efforts to avoid disturbance.
Surely we can have different levels of isolation.
I'm still wondering what to do if the task migrates to another CPU. In fact,
perhaps what you're trying to do is rather a CPU property than a process property?
> This is convenient to the user since they
> don't have to fret about re-enabling task isolation after that
> syscall, page fault, or whatever; they can just continue running.
> With your suggestion, the user pretty much has to leave STRICT mode
> enabled so he gets notified of any unexpected return to kernel space
> (in fact we might make it required so you always get a signal when
> leaving task isolation unless it's via a prctl or exit syscall).
Right. Although we can allow all syscalls in this mode actually.
> 
> PRO for one-shot mode: A somewhat crisper interaction with
> sched_setaffinity() etc.  With a persistent mode approach, a task can
> start up task isolation, then later another task can be placed on its
> cpu and break it (it won't return to userspace until killed or the new
> process affinitizes itself away or stops running).  By contrast, in
> one-shot mode, any return to kernel spaces turns off task isolation
> anyway, so it's very clear what the interaction looks like.  I suspect
> this is more a theoretical advantage to one-shot mode than a practical
> one, though.
I think I heard about workloads that need such strict hard isolation.
Workloads that really can not afford any disturbance. They even
use userspace network stack. Maybe HFT?
> CON for one-shot mode: It's actually hard to catch every kernel entry
> so we can turn the task-isolation flag off again - and we really do
> need to have a flag, just so that we can suitably debug any bad
> actions that bring us into the kernel when we're not expecting it.
> Right now there are things that bring us into the kernel that we don't
> bother annotating for task isolation STRICT mode, just because they're
> visible to the user anyway: e.g., a bus fault or segmentation
> violation.
> 
> I think we can actually make both modes available to users with just
> another flag bit, so maybe we can look at what that looks like in v11:
> adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
> isolation at the next syscall entry, page fault, etc.  Then we can
> think more specifically about whether we want to remove the flag or
> not, and if we remove it, whether we want to make the code that was
> controlled by it unconditionally true or unconditionally false
> (i.e. remove it again).
I think we shouldn't bother with strict hard isolation if we don't need
it yet. The implementation may well be invasive. Lets wait for someone
who really needs it.
> 
> 
>   TL;DR: We should be more willing to return -EINVAL from prctl().
>   =====
> 
> One thing you've argued is that we should be more aggressive about
> failing the prctl() call.  I think, in any case, that this is probably
> reasonable.  We already check that the task's affinity is limited to
> the current core and that that core is a task_isolation cpu; I think we
> can also require that can_stop_full_tick() return true (or the moral
> equivalent given your recent patch series).  This will mean you can't
> even try to go into task isolation mode if another task is
> schedulable, among other things, which seems like a good thing.
> 
> However, it is important to note that the current task_isolation_ready
> and task_isolation_enter calls that are in the prepare_exit_to_userspace
> routine are still required even with your proposed one-shot mode.  We
> have to be sure that no interrupts occur on the way back to userspace
> that might then in principle lead to timer interrupts being scheduled,
> and the way to do that is make sure task_isolation_ready returns true
> with interrupts disabled, and interrupts are not then re-enabled before
> return to userspace.  Anything else is just keeping your fingers
> crossed and guessing.
So your requirements are actually hard isolation but in userspace?
And what happens if you get interrupted in userspace? What about page
faults and other exceptions?
Thanks.
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-04-08 13:56                     ` Frederic Weisbecker
@ 2016-04-08 16:34                       ` Chris Metcalf
       [not found]                         ` <5707DDA8.10600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                                           ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Chris Metcalf @ 2016-04-08 16:34 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel
On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> >   TL;DR: Let's make an explicit decision about whether task isolation
> >   should be "persistent" or "one-shot".  Both have some advantages.
> >   =====
> >
> > An important high-level issue is how "sticky" task isolation mode is.
> > We need to choose one of these two options:
> >
> > "Persistent mode": A task switches state to "task isolation" mode
> > (kind of a level-triggered analogy) and stays there indefinitely.  It
> > can make a syscall, take a page fault, etc., if it wants to, but the
> > kernel protects it from incurring any further asynchronous interrupts.
> > This is the model I've been advocating for.
>
> But then in this mode, what happens when an interrupt triggers.
So here I'm taking "interrupt" to mean an external, asynchronous
interrupt, from another core or device, or asynchronously triggered
on the local core, like a timer interrupt.  By contrast I use "exception"
or "fault" to refer to synchronous, locally-triggered interruptions.
So for interrupts, the short answer is, it's a bug! :-)
An interrupt could be a kernel bug, in which case we consider it a
"true" bug.  This could be a timer interrupt occurring even after the
task isolation code thought there were none pending, or a hardware
device that incorrectly distributes interrupts to a task-isolation
cpu, or a global IPI that should be sent to fewer cores, or a kernel
TLB flush that could be deferred until the task-isolation task
re-enters the kernel later, etc.  Regardless, I'd consider it a kernel
bug.  I'm sure there are more such bugs that we can continue to fix
going forward; it depends on how arbitrary you want to allow code
running on other cores to be.  For example, can another core unload a
kernel module without interrupting a task-isolation task?  Not right now.
Or, it could be an application bug: the standard example is if you
have an application with task-isolated cores that also does occasional
unmaps on another thread in the same process, on another core.  This
causes TLB flush interrupts under application control.  The
application shouldn't do this, and we tell our customers not to build
their applications this way.  The typical way we encourage our
customers to arrange this kind of "multi-threading" is by having a
pure memory API between the task isolation threads and what are
typically "control" threads running on non-task-isolated cores.  The
two types of threads just both mmap some common, shared memory but run
as different processes.
So what happens if an interrupt does occur?
In the "base" task isolation mode, you just take the interrupt, then
wait to quiesce any further kernel timer ticks, etc., and return to
the process.  This at least limits the damage to being a single
interruption rather than potentially additional ones, if the interrupt
also caused timers to get queued, etc.
If you enable "strict" mode, we disable task isolation mode for that
core and deliver a signal to it.  This lets the application know that
an interrupt occurred, and it can take whatever kind of logging or
debugging action it wants to, re-enable task isolation if it wants to
and continue, or just exit or abort, etc.
If you don't enable "strict" mode, but you do have
task_isolation_debug enabled as a boot flag, you will at least get a
console dump with a backtrace and whatever other data we have.
(Sometimes the debug info actually includes a backtrace of the
interrupting core, if it's an IPI or TLB flush from another core,
which can be pretty useful.)
> > "One-shot mode": A task requests isolation via prctl(), the kernel
> > ensures it is isolated on return from the prctl(), but then as soon as
> > it enters the kernel again, task isolation is switched off until
> > another prctl is issued.  This is what you recommended in your last
> > email.
>
> No I think we can issue syscalls for exemple. But asynchronous interruptions
> such as exceptions (actually somewhat synchronous but can be unexpected) and
> interrupts are what we want to avoid.
Hmm, so I think I'm not really understanding what you are suggesting.
We're certainly in agreement that avoiding interrupts and exceptions
is important.  I'm arguing that the way to deal with them is to
generate appropriate signals/printks, etc.  I'm not actually sure what
you're recommending we do to avoid exceptions.  Since they're
synchronous and deterministic, we can't really avoid them if the
program wants to issue them.  For example, mmap() some anonymous
memory and then start running, and you'll take exceptions each time
you touch a page in that mapped region.  I'd argue it's an application
bug; one should enable "strict" mode to catch and deal with such bugs.
(Typically the recommendation is to do an mlockall() before starting
task isolation mode, to handle the case of page faults.  But you can
do that and still be screwed by another thread in your process doing a
fork() and then your pages end up read-only for COW and you have to
fault them back in.  But, that's an application bug for a
task-isolation thread, and should just be treated as such.)
> > There are a number of pros and cons to the two models.  I think on
> > balance I still like the "persistent mode" approach, but here's all
> > the pros/cons I can think of:
> >
> > PRO for persistent mode: A somewhat easier programming model.  Users
> > can just imagine "task isolation" as a way for them to still be able
> > to use the kernel exactly as they always have; it's just slower to get
> > back out of the kernel so you use it judiciously. For example, a
> > process is free to call write() on a socket to perform a diagnostic,
> > but when returning from the write() syscall, the kernel will hold the
> > task in kernel mode until any timer ticks (perhaps from networking
> > stuff) are complete, and then let it return to userspace to continue
> > in task isolation mode.
>
> So this is not hard isolation anymore. This is rather soft isolation with
> best efforts to avoid disturbance.
No, it's still hard isolation.  The distinction is that we offer a way
to get in and out of the kernel "safely" if you want to run in that
mode.  The syscalls can take a long time if the syscall ends up
requiring some additional timer ticks to finish sorting out whatever
it was you asked the kernel to do, but once you're back in userspace
you immediately regain "hard" isolation.  It's under program control.
Or, you can enable "strict" mode, and then you get hard isolation
without the ability to get in and out of the kernel at all: the kernel
just kills you if you try to leave hard isolation other than by an
explicit prctl().
> Surely we can have different levels of isolation.
Well, we have nohz_full now, and by adding task-isolation, we have
two.  Or three if you count "base" and "strict" mode task isolation as
two separate levels.
> I'm still wondering what to do if the task migrates to another CPU. In fact,
> perhaps what you're trying to do is rather a CPU property than a
> process property?
Well, we did go around on this issue once already (last August) and at
the time you were encouraging isolation to be a "task" property, not a
"cpu" property:
https://lkml.kernel.org/r/20150812160020.GG21542@lerouge
You convinced me at the time :-)
You're right that migration conflicts with task isolation.  But
certainly, if a task has enabled "strict" semantics, it can't migrate;
it will lose task isolation entirely and get a signal instead,
regardless of whether it calls sched_setaffinity() on itself, or if
someone else changes its affinity and it gets a kick.
However, if a task doesn't have strict mode enabled, it can call
sched_setaffinity() and force itself onto a non-task_isolation cpu and
it won't get any isolation until it schedules itself back onto a
task_isolation cpu, at which point it wakes up on the new cpu with
hard isolation still in effect.  I can make up reasons why this sort
of thing might be useful, but it's probably a corner case.
However, this makes me wonder if "strict" mode should be the default
for task isolation??  That way task isolation really doesn't conflict
semantically with migration.  And we could provide a "weak" mode, or a
"kernel-friendly" mode, or some such nomenclature, and define the
migration semantics just for that case, where it makes it clear it's a
bit unusual.
> I think I heard about workloads that need such strict hard isolation.
> Workloads that really can not afford any disturbance. They even
> use userspace network stack. Maybe HFT?
Certainly HFT is one case.
A lot of TILE-Gx customers using task isolation (which we call
"dataplane" or "Zero-Overhead Linux") are doing high-speed network
applications with user-space networking stacks.  It can be DPDK, or it
can be another TCP/IP stack (we ship one called tStack) or it
could just be an application directly messing with the network
hardware from userspace.  These are exactly the applications that led
me into this part of kernel development in the first place.
Googling "Zero-Overhead Linux" does take you to some discussions
of customers that have used this functionality.
> > I think we can actually make both modes available to users with just
> > another flag bit, so maybe we can look at what that looks like in v11:
> > adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
> > isolation at the next syscall entry, page fault, etc.  Then we can
> > think more specifically about whether we want to remove the flag or
> > not, and if we remove it, whether we want to make the code that was
> > controlled by it unconditionally true or unconditionally false
> > (i.e. remove it again).
>
> I think we shouldn't bother with strict hard isolation if we don't need
> it yet. The implementation may well be invasive. Lets wait for someone
> who really needs it.
I'm not sure what part of the patch series you're saying you don't
think we need yet.  I'd argue the whole patch series is "hard
isolation", and that the "strict" mode introduced in patch 06/13 isn't
particularly invasive.
> So your requirements are actually hard isolation but in userspace?
Yes, exactly.  Were you thinking about a kernel-level hard isolation?
That would have some similarities, I guess, but in some ways might
actually be a harder problem.
> And what happens if you get interrupted in userspace? What about page
> faults and other exceptions?
See above :-)
I hope we're converging here.  If you want to talk live or chat online
to help finish converging, perhaps that would make sense?  I'd be
happy to take notes and publish a summary of wherever we get to.
Thanks for taking the time to review this!
-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
       [not found]                         ` <5707DDA8.10600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-04-12 18:41                           ` Chris Metcalf
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2016-04-12 18:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
On 4/8/2016 12:34 PM, Chris Metcalf wrote:
> However, this makes me wonder if "strict" mode should be the default
> for task isolation??  That way task isolation really doesn't conflict
> semantically with migration.  And we could provide a "weak" mode, or a
> "kernel-friendly" mode, or some such nomenclature, and define the
> migration semantics just for that case, where it makes it clear it's a
> bit unusual. 
I noodled around with this and decided it was a better default,
so I've made the changes and pushed it up to the branch:
     git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane
Now, by default when you enter task isolation mode, you are in
what I used to call "strict" mode, i.e. you can't use the kernel.
You can select a user-specified signal you want to deliver instead of
the default SIGKILL, and if you select signal 0, then you don't get
a signal at all and instead you get to keep running in task
isolation mode after making a syscall, page fault, etc.
Thus the API now looks like this in <linux/prctl.h>:
#define PR_SET_TASK_ISOLATION		48
#define PR_GET_TASK_ISOLATION		49
# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
# define PR_TASK_ISOLATION_USERSIG	(1 << 1)
# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
# define PR_TASK_ISOLATION_NOSIG \
	(PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))
I think this better matches what people should want to do in
their applications, and also matches the expectations people
have about what it means to go into task isolation mode in the
first place.
I got rid of the ONESHOT mode that I added in the v12 series, since
it didn't seem like it was what Frederic had been asking for anyway,
and it didn't seem particularly useful on its own.
Frederic, how does this align with your intuition for this stuff?
-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-04-08 16:34                       ` Chris Metcalf
       [not found]                         ` <5707DDA8.10600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-04-22 13:16                         ` Frederic Weisbecker
  2016-04-25 20:36                           ` Chris Metcalf
  2016-05-26  1:07                         ` Frederic Weisbecker
  2 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2016-04-22 13:16 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel
On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
> >On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> >>   TL;DR: Let's make an explicit decision about whether task isolation
> >>   should be "persistent" or "one-shot".  Both have some advantages.
> >>   =====
> >>
> >> An important high-level issue is how "sticky" task isolation mode is.
> >> We need to choose one of these two options:
> >>
> >> "Persistent mode": A task switches state to "task isolation" mode
> >> (kind of a level-triggered analogy) and stays there indefinitely.  It
> >> can make a syscall, take a page fault, etc., if it wants to, but the
> >> kernel protects it from incurring any further asynchronous interrupts.
> >> This is the model I've been advocating for.
> >
> >But then in this mode, what happens when an interrupt triggers.
> 
> So here I'm taking "interrupt" to mean an external, asynchronous
> interrupt, from another core or device, or asynchronously triggered
> on the local core, like a timer interrupt.  By contrast I use "exception"
> or "fault" to refer to synchronous, locally-triggered interruptions.
Ok.
> So for interrupts, the short answer is, it's a bug! :-)
> 
> An interrupt could be a kernel bug, in which case we consider it a
> "true" bug.  This could be a timer interrupt occurring even after the
> task isolation code thought there were none pending, or a hardware
> device that incorrectly distributes interrupts to a task-isolation
> cpu, or a global IPI that should be sent to fewer cores, or a kernel
> TLB flush that could be deferred until the task-isolation task
> re-enters the kernel later, etc.  Regardless, I'd consider it a kernel
> bug.  I'm sure there are more such bugs that we can continue to fix
> going forward; it depends on how arbitrary you want to allow code
> running on other cores to be.  For example, can another core unload a
> kernel module without interrupting a task-isolation task?  Not right now.
> 
> Or, it could be an application bug: the standard example is if you
> have an application with task-isolated cores that also does occasional
> unmaps on another thread in the same process, on another core.  This
> causes TLB flush interrupts under application control.  The
> application shouldn't do this, and we tell our customers not to build
> their applications this way.  The typical way we encourage our
> customers to arrange this kind of "multi-threading" is by having a
> pure memory API between the task isolation threads and what are
> typically "control" threads running on non-task-isolated cores.  The
> two types of threads just both mmap some common, shared memory but run
> as different processes.
> 
> So what happens if an interrupt does occur?
> 
> In the "base" task isolation mode, you just take the interrupt, then
> wait to quiesce any further kernel timer ticks, etc., and return to
> the process.  This at least limits the damage to being a single
> interruption rather than potentially additional ones, if the interrupt
> also caused timers to get queued, etc.
So if we take an interrupt that we didn't expect, we want to wait some more
in the end of that interrupt to wait for things to quiesce some more?
That doesn't look right. Things should be quiesced once and for all on
return from the initial prctl() call. We can't even expect to quiesce more
in case of interruptions, the tick can't be forced off anyway.
> 
> If you enable "strict" mode, we disable task isolation mode for that
> core and deliver a signal to it.  This lets the application know that
> an interrupt occurred, and it can take whatever kind of logging or
> debugging action it wants to, re-enable task isolation if it wants to
> and continue, or just exit or abort, etc.
That sounds sensible.
> 
> If you don't enable "strict" mode, but you do have
> task_isolation_debug enabled as a boot flag, you will at least get a
> console dump with a backtrace and whatever other data we have.
> (Sometimes the debug info actually includes a backtrace of the
> interrupting core, if it's an IPI or TLB flush from another core,
> which can be pretty useful.)
Ok.
> 
> >> "One-shot mode": A task requests isolation via prctl(), the kernel
> >> ensures it is isolated on return from the prctl(), but then as soon as
> >> it enters the kernel again, task isolation is switched off until
> >> another prctl is issued.  This is what you recommended in your last
> >> email.
> >
> >No I think we can issue syscalls for exemple. But asynchronous interruptions
> >such as exceptions (actually somewhat synchronous but can be unexpected) and
> >interrupts are what we want to avoid.
> 
> Hmm, so I think I'm not really understanding what you are suggesting.
> 
> We're certainly in agreement that avoiding interrupts and exceptions
> is important.  I'm arguing that the way to deal with them is to
> generate appropriate signals/printks, etc.  I'm not actually sure what
> you're recommending we do to avoid exceptions.  Since they're
> synchronous and deterministic, we can't really avoid them if the
> program wants to issue them.  For example, mmap() some anonymous
> memory and then start running, and you'll take exceptions each time
> you touch a page in that mapped region.  I'd argue it's an application
> bug; one should enable "strict" mode to catch and deal with such bugs.
Ok, that looks right.
> 
> (Typically the recommendation is to do an mlockall() before starting
> task isolation mode, to handle the case of page faults.  But you can
> do that and still be screwed by another thread in your process doing a
> fork() and then your pages end up read-only for COW and you have to
> fault them back in.  But, that's an application bug for a
> task-isolation thread, and should just be treated as such.)
Ok.
> 
> >> There are a number of pros and cons to the two models.  I think on
> >> balance I still like the "persistent mode" approach, but here's all
> >> the pros/cons I can think of:
> >>
> >> PRO for persistent mode: A somewhat easier programming model.  Users
> >> can just imagine "task isolation" as a way for them to still be able
> >> to use the kernel exactly as they always have; it's just slower to get
> >> back out of the kernel so you use it judiciously. For example, a
> >> process is free to call write() on a socket to perform a diagnostic,
> >> but when returning from the write() syscall, the kernel will hold the
> >> task in kernel mode until any timer ticks (perhaps from networking
> >> stuff) are complete, and then let it return to userspace to continue
> >> in task isolation mode.
> >
> >So this is not hard isolation anymore. This is rather soft isolation with
> >best efforts to avoid disturbance.
> 
> No, it's still hard isolation.  The distinction is that we offer a way
> to get in and out of the kernel "safely" if you want to run in that
> mode.  The syscalls can take a long time if the syscall ends up
> requiring some additional timer ticks to finish sorting out whatever
> it was you asked the kernel to do, but once you're back in userspace
> you immediately regain "hard" isolation.  It's under program control.
Yeah indeed, task should be allowed to perform syscalls. So we can assume
that interrupts are fine when they fire in kernel mode.
> 
> Or, you can enable "strict" mode, and then you get hard isolation
> without the ability to get in and out of the kernel at all: the kernel
> just kills you if you try to leave hard isolation other than by an
> explicit prctl().
That would be extreme strict mode yeah. We can still add such mode later
if any user request it.
Thanks.
(I'll reply the rest of the email soonish)
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-04-22 13:16                         ` Frederic Weisbecker
@ 2016-04-25 20:36                           ` Chris Metcalf
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2016-04-25 20:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
On 4/22/2016 9:16 AM, Frederic Weisbecker wrote:
> On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
>> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
>>> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
>>>>    TL;DR: Let's make an explicit decision about whether task isolation
>>>>    should be "persistent" or "one-shot".  Both have some advantages.
>>>>    =====
>>>>
>>>> An important high-level issue is how "sticky" task isolation mode is.
>>>> We need to choose one of these two options:
>>>>
>>>> "Persistent mode": A task switches state to "task isolation" mode
>>>> (kind of a level-triggered analogy) and stays there indefinitely.  It
>>>> can make a syscall, take a page fault, etc., if it wants to, but the
>>>> kernel protects it from incurring any further asynchronous interrupts.
>>>> This is the model I've been advocating for.
>>> But then in this mode, what happens when an interrupt triggers.
>> So here I'm taking "interrupt" to mean an external, asynchronous
>> interrupt, from another core or device, or asynchronously triggered
>> on the local core, like a timer interrupt.  By contrast I use "exception"
>> or "fault" to refer to synchronous, locally-triggered interruptions.
> Ok.
>
>> So for interrupts, the short answer is, it's a bug! :-)
>>
>> An interrupt could be a kernel bug, in which case we consider it a
>> "true" bug.  This could be a timer interrupt occurring even after the
>> task isolation code thought there were none pending, or a hardware
>> device that incorrectly distributes interrupts to a task-isolation
>> cpu, or a global IPI that should be sent to fewer cores, or a kernel
>> TLB flush that could be deferred until the task-isolation task
>> re-enters the kernel later, etc.  Regardless, I'd consider it a kernel
>> bug.  I'm sure there are more such bugs that we can continue to fix
>> going forward; it depends on how arbitrary you want to allow code
>> running on other cores to be.  For example, can another core unload a
>> kernel module without interrupting a task-isolation task?  Not right now.
>>
>> Or, it could be an application bug: the standard example is if you
>> have an application with task-isolated cores that also does occasional
>> unmaps on another thread in the same process, on another core.  This
>> causes TLB flush interrupts under application control.  The
>> application shouldn't do this, and we tell our customers not to build
>> their applications this way.  The typical way we encourage our
>> customers to arrange this kind of "multi-threading" is by having a
>> pure memory API between the task isolation threads and what are
>> typically "control" threads running on non-task-isolated cores.  The
>> two types of threads just both mmap some common, shared memory but run
>> as different processes.
>>
>> So what happens if an interrupt does occur?
>>
>> In the "base" task isolation mode, you just take the interrupt, then
>> wait to quiesce any further kernel timer ticks, etc., and return to
>> the process.  This at least limits the damage to being a single
>> interruption rather than potentially additional ones, if the interrupt
>> also caused timers to get queued, etc.
> So if we take an interrupt that we didn't expect, we want to wait some more
> in the end of that interrupt to wait for things to quiesce some more?
I think it's actually pretty plausible.
Consider the "application bug" case, where you're running some code that does
packet dispatch to different cores.  If a core seems to back up you stop
dispatching packets to it.
Now, we get a TLB flush.  If handling the flush causes us to restart the tick
(maybe just as a side effect of entering the kernel in the first place) we
really are better off staying in the kernel until the tick is handled and
things are quiesced again.  That way, although we may end up dropping a
bunch of packets that were queued up to that core, we only do so ONCE - we
don't do it again when the tick fires a little bit later on, when the core
has already caught up and is claiming to be able to handle packets again.
Also, pragmatically, we would require a whole bunch of machinery in the
kernel to figure out whether we were returning from a syscall, an exception,
or an interrupt, and only skip the task-isolation work for interrupts.  We
don't actually have that information available to us at the moment we are
returning to userspace right now, so we'd need to add that tracking state
in each platform's code somehow.
> That doesn't look right. Things should be quiesced once and for all on
> return from the initial prctl() call. We can't even expect to quiesce more
> in case of interruptions, the tick can't be forced off anyway.
Yes, things are quiesced once and for all after prctl().  We also need to
be prepared to handle unexpected interrupts, though.  It's true that we can't
force the tick off, but as I suggested above, just waiting for the tick may
well be a better strategy than subjecting the application to another interrupt
after some fraction of a second.
>> Or, you can enable "strict" mode, and then you get hard isolation
>> without the ability to get in and out of the kernel at all: the kernel
>> just kills you if you try to leave hard isolation other than by an
>> explicit prctl().
> That would be extreme strict mode yeah. We can still add such mode later
> if any user request it.
So, humorously, I have become totally convinced that "extreme strict mode"
is really the right default for isolation.  It gives semantics that are easily
understandable: you stay in userspace until you do a prctl() to turn off
the flag, or exit(), or else the kernel kills you.  And, it's probably what
people want by default anyway for userspace driver code.  For code that
legitimately wants to make syscalls in this mode, you can just prctl() the
mode off, do whatever you need to do, then prctl() the mode back on again.
It's nominally a bit of overhead, but as a task-isolated application you
should be expecting tons of overhead from going into the kernel anyway.
The "less extreme strict mode" is arguably reasonable if you want to allow
people to make occasional syscalls, but it has confusing performance
characteristics (sometimes the syscalls happen quickly, but sometimes they
take multiple ticks while we wait for interrupts to quiesce), and it has
confusing semantics (what happens if a third party re-affinitizes you to
a non-isolated core).  So I like the idea of just having a separate flag
(PR_TASK_ISOLATION_NOSIG) that tells the kernel to let the user play in
the kernel without getting killed.
> (I'll reply the rest of the email soonish)
Thanks for the feedback.  It makes me feel like we may get there eventually :-)
-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-04-08 16:34                       ` Chris Metcalf
       [not found]                         ` <5707DDA8.10600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-04-22 13:16                         ` Frederic Weisbecker
@ 2016-05-26  1:07                         ` Frederic Weisbecker
  2016-06-03 19:32                           ` Chris Metcalf
  2 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2016-05-26  1:07 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel
I don't remember how much I answered this email, but I need to finish that :-)
On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
> >On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> >>   TL;DR: Let's make an explicit decision about whether task isolation
> >>   should be "persistent" or "one-shot".  Both have some advantages.
> >>   =====
> >>
> >> An important high-level issue is how "sticky" task isolation mode is.
> >> We need to choose one of these two options:
> >>
> >> "Persistent mode": A task switches state to "task isolation" mode
> >> (kind of a level-triggered analogy) and stays there indefinitely.  It
> >> can make a syscall, take a page fault, etc., if it wants to, but the
> >> kernel protects it from incurring any further asynchronous interrupts.
> >> This is the model I've been advocating for.
> >
> >But then in this mode, what happens when an interrupt triggers.
> 
> So what happens if an interrupt does occur?
> 
> In the "base" task isolation mode, you just take the interrupt, then
> wait to quiesce any further kernel timer ticks, etc., and return to
> the process.  This at least limits the damage to being a single
> interruption rather than potentially additional ones, if the interrupt
> also caused timers to get queued, etc.
Good, although that quiescing on kernel return must be an option.
> 
> If you enable "strict" mode, we disable task isolation mode for that
> core and deliver a signal to it.  This lets the application know that
> an interrupt occurred, and it can take whatever kind of logging or
> debugging action it wants to, re-enable task isolation if it wants to
> and continue, or just exit or abort, etc.
Good.
> 
> If you don't enable "strict" mode, but you do have
> task_isolation_debug enabled as a boot flag, you will at least get a
> console dump with a backtrace and whatever other data we have.
> (Sometimes the debug info actually includes a backtrace of the
> interrupting core, if it's an IPI or TLB flush from another core,
> which can be pretty useful.)
Right, I suggest we use trace events btw.
> 
> >> "One-shot mode": A task requests isolation via prctl(), the kernel
> >> ensures it is isolated on return from the prctl(), but then as soon as
> >> it enters the kernel again, task isolation is switched off until
> >> another prctl is issued.  This is what you recommended in your last
> >> email.
> >
> >No I think we can issue syscalls for exemple. But asynchronous interruptions
> >such as exceptions (actually somewhat synchronous but can be unexpected) and
> >interrupts are what we want to avoid.
> 
> Hmm, so I think I'm not really understanding what you are suggesting.
> 
> We're certainly in agreement that avoiding interrupts and exceptions
> is important.  I'm arguing that the way to deal with them is to
> generate appropriate signals/printks, etc.
Yes.
> I'm not actually sure what
> you're recommending we do to avoid exceptions.  Since they're
> synchronous and deterministic, we can't really avoid them if the
> program wants to issue them.  For example, mmap() some anonymous
> memory and then start running, and you'll take exceptions each time
> you touch a page in that mapped region.  I'd argue it's an application
> bug; one should enable "strict" mode to catch and deal with such bugs.
They are not all deterministic. For example a breakpoint, a step, a trap
can be set up by another process. So this is not entirely under the control
of the user.
> 
> (Typically the recommendation is to do an mlockall() before starting
> task isolation mode, to handle the case of page faults.  But you can
> do that and still be screwed by another thread in your process doing a
> fork() and then your pages end up read-only for COW and you have to
> fault them back in.  But, that's an application bug for a
> task-isolation thread, and should just be treated as such.)
Now how do you determine which exception is a bug and which is expected?
Strict mode should refuse all of them.
> >> There are a number of pros and cons to the two models.  I think on
> >> balance I still like the "persistent mode" approach, but here's all
> >> the pros/cons I can think of:
> >>
> >> PRO for persistent mode: A somewhat easier programming model.  Users
> >> can just imagine "task isolation" as a way for them to still be able
> >> to use the kernel exactly as they always have; it's just slower to get
> >> back out of the kernel so you use it judiciously. For example, a
> >> process is free to call write() on a socket to perform a diagnostic,
> >> but when returning from the write() syscall, the kernel will hold the
> >> task in kernel mode until any timer ticks (perhaps from networking
> >> stuff) are complete, and then let it return to userspace to continue
> >> in task isolation mode.
> >
> >So this is not hard isolation anymore. This is rather soft isolation with
> >best efforts to avoid disturbance.
> 
> No, it's still hard isolation.  The distinction is that we offer a way
> to get in and out of the kernel "safely" if you want to run in that
> mode.  The syscalls can take a long time if the syscall ends up
> requiring some additional timer ticks to finish sorting out whatever
> it was you asked the kernel to do, but once you're back in userspace
> you immediately regain "hard" isolation.  It's under program control.
> 
> Or, you can enable "strict" mode, and then you get hard isolation
> without the ability to get in and out of the kernel at all: the kernel
> just kills you if you try to leave hard isolation other than by an
> explicit prctl().
Well, hard isolation is what I would call strict mode.
> 
> >Surely we can have different levels of isolation.
> 
> Well, we have nohz_full now, and by adding task-isolation, we have
> two.  Or three if you count "base" and "strict" mode task isolation as
> two separate levels.
Right.
> 
> >I'm still wondering what to do if the task migrates to another CPU. In fact,
> >perhaps what you're trying to do is rather a CPU property than a
> >process property?
> 
> Well, we did go around on this issue once already (last August) and at
> the time you were encouraging isolation to be a "task" property, not a
> "cpu" property:
> 
> https://lkml.kernel.org/r/20150812160020.GG21542@lerouge
> 
> You convinced me at the time :-)
Indeed :-) Well if it's a task property, we need to handle its affinity properly then.
> 
> You're right that migration conflicts with task isolation.  But
> certainly, if a task has enabled "strict" semantics, it can't migrate;
> it will lose task isolation entirely and get a signal instead,
> regardless of whether it calls sched_setaffinity() on itself, or if
> someone else changes its affinity and it gets a kick.
Yes.
> 
> However, if a task doesn't have strict mode enabled, it can call
> sched_setaffinity() and force itself onto a non-task_isolation cpu and
> it won't get any isolation until it schedules itself back onto a
> task_isolation cpu, at which point it wakes up on the new cpu with
> hard isolation still in effect.  I can make up reasons why this sort
> of thing might be useful, but it's probably a corner case.
That doesn't look sane. The user asks the kernel to get away as much
as it can but if we are in a non-nohz-full CPU we know we can't provide that
service (or rather that non-service).
So we would refuse to enter in task isolation mode if it doesn't run in a
full dynticks CPUs whereas we accept that it migrates later to a periodic
CPU?. This isn't consistent.
> 
> However, this makes me wonder if "strict" mode should be the default
> for task isolation??  That way task isolation really doesn't conflict
> semantically with migration.  And we could provide a "weak" mode, or a
> "kernel-friendly" mode, or some such nomenclature, and define the
> migration semantics just for that case, where it makes it clear it's a
> bit unusual.
Well we can't really implement that strict mode until we fix the 1Hz issue, right?
Besides, is this something that anyone needs now?
> 
> >I think I heard about workloads that need such strict hard isolation.
> >Workloads that really can not afford any disturbance. They even
> >use userspace network stack. Maybe HFT?
> 
> Certainly HFT is one case.
> 
> A lot of TILE-Gx customers using task isolation (which we call
> "dataplane" or "Zero-Overhead Linux") are doing high-speed network
> applications with user-space networking stacks.  It can be DPDK, or it
> can be another TCP/IP stack (we ship one called tStack) or it
> could just be an application directly messing with the network
> hardware from userspace.  These are exactly the applications that led
> me into this part of kernel development in the first place.
> Googling "Zero-Overhead Linux" does take you to some discussions
> of customers that have used this functionality.
So those workloads couldn't stand an interrupt? Like they would like a signal
and exit the strict mode if it happens?
I think that we need to wait for somebody who explicitly request that feature
before we work on it, so we get sure the semantics really agree with someone's
real load case.
> 
> >> I think we can actually make both modes available to users with just
> >> another flag bit, so maybe we can look at what that looks like in v11:
> >> adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
> >> isolation at the next syscall entry, page fault, etc.  Then we can
> >> think more specifically about whether we want to remove the flag or
> >> not, and if we remove it, whether we want to make the code that was
> >> controlled by it unconditionally true or unconditionally false
> >> (i.e. remove it again).
> >
> >I think we shouldn't bother with strict hard isolation if we don't need
> >it yet. The implementation may well be invasive. Lets wait for someone
> >who really needs it.
> 
> I'm not sure what part of the patch series you're saying you don't
> think we need yet.  I'd argue the whole patch series is "hard
> isolation", and that the "strict" mode introduced in patch 06/13 isn't
> particularly invasive.
It's not in the patch series, I'm talking about the strict mode :-)
> 
> >So your requirements are actually hard isolation but in userspace?
> 
> Yes, exactly.  Were you thinking about a kernel-level hard isolation?
> That would have some similarities, I guess, but in some ways might
> actually be a harder problem.
> 
> >And what happens if you get interrupted in userspace? What about page
> >faults and other exceptions?
> 
> See above :-)
> 
> I hope we're converging here.  If you want to talk live or chat online
> to help finish converging, perhaps that would make sense?  I'd be
> happy to take notes and publish a summary of wherever we get to.
> 
> Thanks for taking the time to review this!
Ok, so thinking about that talk, I'm wondering if we need some flags
such as:
         ISOLATION_SIGNAL_SYSCALL
         ISOLATION_SIGNAL_EXCEPTIONS
         ISOLATION_SIGNAL_INTERRUPTS
Strict mode would be the three above OR'ed. It's just some random thoughts
but that would help define which level of kernel intrusion the user is ready
to tolerate.
I'm just not sure how granular we want that interface to be.
> 
> -- 
> Chris Metcalf, Mellanox Technologies
> http://www.mellanox.com
> 
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-05-26  1:07                         ` Frederic Weisbecker
@ 2016-06-03 19:32                           ` Chris Metcalf
  2016-06-29 15:18                             ` Frederic Weisbecker
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2016-06-03 19:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel
On 5/25/2016 9:07 PM, Frederic Weisbecker wrote:
> I don't remember how much I answered this email, but I need to finish that :-)
Sorry for the slow response - it's been a busy week.
> On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
>> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
>>> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
>>>>    TL;DR: Let's make an explicit decision about whether task isolation
>>>>    should be "persistent" or "one-shot".  Both have some advantages.
>>>>    =====
>>>>
>>>> An important high-level issue is how "sticky" task isolation mode is.
>>>> We need to choose one of these two options:
>>>>
>>>> "Persistent mode": A task switches state to "task isolation" mode
>>>> (kind of a level-triggered analogy) and stays there indefinitely.  It
>>>> can make a syscall, take a page fault, etc., if it wants to, but the
>>>> kernel protects it from incurring any further asynchronous interrupts.
>>>> This is the model I've been advocating for.
>>> But then in this mode, what happens when an interrupt triggers.
>> So what happens if an interrupt does occur?
>>
>> In the "base" task isolation mode, you just take the interrupt, then
>> wait to quiesce any further kernel timer ticks, etc., and return to
>> the process.  This at least limits the damage to being a single
>> interruption rather than potentially additional ones, if the interrupt
>> also caused timers to get queued, etc.
> Good, although that quiescing on kernel return must be an option.
Can you spell out why you think turning it off is helpful?  I'll admit
this is the default mode in the commercial version of task isolation
that we ship, and was also the default in the first LKML patch series.
But on consideration I haven't found scenarios where skipping the
quiescing is helpful.  Admittedly you get out of the kernel faster,
but then you're back in userspace and vulnerable to yet more
unexpected interrupts until the timer quiesces.  If you're asking for
task isolation, this is surely not what you want.
>> If you enable "strict" mode, we disable task isolation mode for that
>> core and deliver a signal to it.  This lets the application know that
>> an interrupt occurred, and it can take whatever kind of logging or
>> debugging action it wants to, re-enable task isolation if it wants to
>> and continue, or just exit or abort, etc.
> Good.
>
>> If you don't enable "strict" mode, but you do have
>> task_isolation_debug enabled as a boot flag, you will at least get a
>> console dump with a backtrace and whatever other data we have.
>> (Sometimes the debug info actually includes a backtrace of the
>> interrupting core, if it's an IPI or TLB flush from another core,
>> which can be pretty useful.)
> Right, I suggest we use trace events btw.
This is probably a good idea, although I wonder if it's worth deferring
until after the main patch series goes in - I'm reluctant to expand the scope
of this patch series and add more reasons for it to get delayed :-)
What do you think?
>>>> "One-shot mode": A task requests isolation via prctl(), the kernel
>>>> ensures it is isolated on return from the prctl(), but then as soon as
>>>> it enters the kernel again, task isolation is switched off until
>>>> another prctl is issued.  This is what you recommended in your last
>>>> email.
>>> No I think we can issue syscalls for exemple. But asynchronous interruptions
>>> such as exceptions (actually somewhat synchronous but can be unexpected) and
>>> interrupts are what we want to avoid.
>> Hmm, so I think I'm not really understanding what you are suggesting.
>>
>> We're certainly in agreement that avoiding interrupts and exceptions
>> is important.  I'm arguing that the way to deal with them is to
>> generate appropriate signals/printks, etc.
> Yes.
>
>> I'm not actually sure what
>> you're recommending we do to avoid exceptions.  Since they're
>> synchronous and deterministic, we can't really avoid them if the
>> program wants to issue them.  For example, mmap() some anonymous
>> memory and then start running, and you'll take exceptions each time
>> you touch a page in that mapped region.  I'd argue it's an application
>> bug; one should enable "strict" mode to catch and deal with such bugs.
> They are not all deterministic. For example a breakpoint, a step, a trap
> can be set up by another process. So this is not entirely under the control
> of the user.
That's true, but I'd argue the behavior in that case should be that you can
raise that kind of exception validly (so you can debug), and then you should
quiesce on return to userspace so the application doesn't see additional
exceptions.  There are two ways you could handle debugging:
1. Require the program to set the flag that says it doesn't want a signal
when it is interrupted (so you can interrupt it to debug it, and not kill it);
2. Or have debugging automatically set that flag in the target process.
Similarly, we could just say that if a debugger is attached, we never
generate the kill signal for task isolation.
>> (Typically the recommendation is to do an mlockall() before starting
>> task isolation mode, to handle the case of page faults.  But you can
>> do that and still be screwed by another thread in your process doing a
>> fork() and then your pages end up read-only for COW and you have to
>> fault them back in.  But, that's an application bug for a
>> task-isolation thread, and should just be treated as such.)
> Now how do you determine which exception is a bug and which is expected?
> Strict mode should refuse all of them.
Yes, exactly.  Task isolation will complain about everything. :-)
>>>> There are a number of pros and cons to the two models.  I think on
>>>> balance I still like the "persistent mode" approach, but here's all
>>>> the pros/cons I can think of:
>>>>
>>>> PRO for persistent mode: A somewhat easier programming model.  Users
>>>> can just imagine "task isolation" as a way for them to still be able
>>>> to use the kernel exactly as they always have; it's just slower to get
>>>> back out of the kernel so you use it judiciously. For example, a
>>>> process is free to call write() on a socket to perform a diagnostic,
>>>> but when returning from the write() syscall, the kernel will hold the
>>>> task in kernel mode until any timer ticks (perhaps from networking
>>>> stuff) are complete, and then let it return to userspace to continue
>>>> in task isolation mode.
>>> So this is not hard isolation anymore. This is rather soft isolation with
>>> best efforts to avoid disturbance.
>> No, it's still hard isolation.  The distinction is that we offer a way
>> to get in and out of the kernel "safely" if you want to run in that
>> mode.  The syscalls can take a long time if the syscall ends up
>> requiring some additional timer ticks to finish sorting out whatever
>> it was you asked the kernel to do, but once you're back in userspace
>> you immediately regain "hard" isolation.  It's under program control.
>>
>> Or, you can enable "strict" mode, and then you get hard isolation
>> without the ability to get in and out of the kernel at all: the kernel
>> just kills you if you try to leave hard isolation other than by an
>> explicit prctl().
> Well, hard isolation is what I would call strict mode.
Here's what I am inclined towards:
  - Default mode (hard isolation / "strict") - leave userspace, get a signal, no exceptions.
  - "No signal" mode - leave userspace synchronously (syscall/exception), get quiesced on
    return, no signals.  But asynchronous interrupts still cause a signal since they are
    not expected to occur.
  - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
    on return to userspace, and asynchronous interrupts don't even cause a signal.
    It's basically "best effort", just nohz_full plus the code that tries to get things
    like LRU or vmstat to run before returning to userspace.  I think there isn't enough
    "value add" to make this a separate mode, though.
>>> Surely we can have different levels of isolation.
>> Well, we have nohz_full now, and by adding task-isolation, we have
>> two.  Or three if you count "base" and "strict" mode task isolation as
>> two separate levels.
> Right.
>
>>> I'm still wondering what to do if the task migrates to another CPU. In fact,
>>> perhaps what you're trying to do is rather a CPU property than a
>>> process property?
>> Well, we did go around on this issue once already (last August) and at
>> the time you were encouraging isolation to be a "task" property, not a
>> "cpu" property:
>>
>> https://lkml.kernel.org/r/20150812160020.GG21542@lerouge
>>
>> You convinced me at the time :-)
> Indeed :-) Well if it's a task property, we need to handle its affinity properly then.
>> You're right that migration conflicts with task isolation.  But
>> certainly, if a task has enabled "strict" semantics, it can't migrate;
>> it will lose task isolation entirely and get a signal instead,
>> regardless of whether it calls sched_setaffinity() on itself, or if
>> someone else changes its affinity and it gets a kick.
> Yes.
>
>> However, if a task doesn't have strict mode enabled, it can call
>> sched_setaffinity() and force itself onto a non-task_isolation cpu and
>> it won't get any isolation until it schedules itself back onto a
>> task_isolation cpu, at which point it wakes up on the new cpu with
>> hard isolation still in effect.  I can make up reasons why this sort
>> of thing might be useful, but it's probably a corner case.
> That doesn't look sane. The user asks the kernel to get away as much
> as it can but if we are in a non-nohz-full CPU we know we can't provide that
> service (or rather that non-service).
>
> So we would refuse to enter in task isolation mode if it doesn't run in a
> full dynticks CPUs whereas we accept that it migrates later to a periodic
> CPU?. This isn't consistent.
Yes, and originally I made that consistent by not checking when it started
up, either, but I was subsequently convinced that the checks were good for
sanity.
Another answer is just to say that the full strict mode is the only mode, and
that if the task leaves userspace, it leaves task isolation mode until it the mode
is re-enabled.  In the context of receiving a signal each time, this is more plausible.
You can always re-enable task isolation in the signal handler if you want.
I still suspect that the "hybrid" mode where you can leave userspace for things
like syscalls, but quiesce on return, is useful.  I agree that it leaves some question
about task migration.  We can refuse to honor a task's request to migrate itself
in that case, perhaps.  I don't know what to think about when someone else tries
to migrate the task - perhaps it only succeeds if the caller is root, and otherwise
fails, when the task is in task isolation mode?  It gets tricky and that's why I
was inclined to go with a simple "it always works, but it produces results
that you have to read the documentation to understand" (i.e. task isolation
mode goes dormant until you schedule back to a task isolation cpu).
On balance this is still the approach that I like best.
Which approach seems best to you?
>> However, this makes me wonder if "strict" mode should be the default
>> for task isolation??  That way task isolation really doesn't conflict
>> semantically with migration.  And we could provide a "weak" mode, or a
>> "kernel-friendly" mode, or some such nomenclature, and define the
>> migration semantics just for that case, where it makes it clear it's a
>> bit unusual.
> Well we can't really implement that strict mode until we fix the 1Hz issue, right?
> Besides, is this something that anyone needs now?
Certainly all of this is assuming that we have "solved" the 1Hz tick problem,
either by commenting out the max_deferment call, or at such time as we have
really fixed the underlying issues and remove the max deferment entirely.
At that point, I'm not sure it's a question of people needing strict mode per se;
I think it's more about picking the mode that is the best from both a user experience
and a quality of implementation perspective.
>>> I think I heard about workloads that need such strict hard isolation.
>>> Workloads that really can not afford any disturbance. They even
>>> use userspace network stack. Maybe HFT?
>> Certainly HFT is one case.
>>
>> A lot of TILE-Gx customers using task isolation (which we call
>> "dataplane" or "Zero-Overhead Linux") are doing high-speed network
>> applications with user-space networking stacks.  It can be DPDK, or it
>> can be another TCP/IP stack (we ship one called tStack) or it
>> could just be an application directly messing with the network
>> hardware from userspace.  These are exactly the applications that led
>> me into this part of kernel development in the first place.
>> Googling "Zero-Overhead Linux" does take you to some discussions
>> of customers that have used this functionality.
> So those workloads couldn't stand an interrupt? Like they would like a signal
> and exit the strict mode if it happens?
Correct, they couldn't tolerate interrupts.  If one happened, it would cause packets to
be dropped and some kind of logging would fire to report the problem.
> I think that we need to wait for somebody who explicitly request that feature
> before we work on it, so we get sure the semantics really agree with someone's
> real load case.
This is really the scenario that Tilera's customers use, so I'm pretty familiar with
what they expect.
>>>> I think we can actually make both modes available to users with just
>>>> another flag bit, so maybe we can look at what that looks like in v11:
>>>> adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
>>>> isolation at the next syscall entry, page fault, etc.  Then we can
>>>> think more specifically about whether we want to remove the flag or
>>>> not, and if we remove it, whether we want to make the code that was
>>>> controlled by it unconditionally true or unconditionally false
>>>> (i.e. remove it again).
>>> I think we shouldn't bother with strict hard isolation if we don't need
>>> it yet. The implementation may well be invasive. Lets wait for someone
>>> who really needs it.
>> I'm not sure what part of the patch series you're saying you don't
>> think we need yet.  I'd argue the whole patch series is "hard
>> isolation", and that the "strict" mode introduced in patch 06/13 isn't
>> particularly invasive.
> It's not in the patch series, I'm talking about the strict mode :-)
>
>>> So your requirements are actually hard isolation but in userspace?
>> Yes, exactly.  Were you thinking about a kernel-level hard isolation?
>> That would have some similarities, I guess, but in some ways might
>> actually be a harder problem.
>>
>>> And what happens if you get interrupted in userspace? What about page
>>> faults and other exceptions?
>> See above :-)
>>
>> I hope we're converging here.  If you want to talk live or chat online
>> to help finish converging, perhaps that would make sense?  I'd be
>> happy to take notes and publish a summary of wherever we get to.
>>
>> Thanks for taking the time to review this!
> Ok, so thinking about that talk, I'm wondering if we need some flags
> such as:
>
>           ISOLATION_SIGNAL_SYSCALL
>           ISOLATION_SIGNAL_EXCEPTIONS
>           ISOLATION_SIGNAL_INTERRUPTS
>
> Strict mode would be the three above OR'ed. It's just some random thoughts
> but that would help define which level of kernel intrusion the user is ready
> to tolerate.
>
> I'm just not sure how granular we want that interface to be.
Yes, you could certainly imagine being more granular.  For example, if you expected
to make syscalls but not receive exceptions or interrupts, that might be a useful
mode.  Or, you were willing to make syscalls and take exceptions, but not receive
interrupts.  (I think you should never be willing to receive asynchronous interrupts,
since that kind of defeats the purpose of task isolation in the first place.)
So maybe something like this:
PR_TASK_ISOLATION_ENABLE - turn on basic strict/signaling mode
PR_TASK_ISOLATION_ALLOW_SYSCALLS - for syscalls, no signal, just quiesce before return
PR_TASK_ISOLATION_ALLOW_EXCEPTIONS - for all exceptions, no signal, quiesce before return
It might make sense to say you would allow page faults, for example, but not general
exceptions.  But my guess is that the exception-related stuff really does need an
application use case to account for it.  I would say for the initial support of task
isolation, we have a clearly-understood model for allowing syscalls (e.g. stuff
like generating diagnostics on error or slow paths), but not really a model for
understanding why users would want to take exceptions, so I'd say let's omit
that initially, and maybe just add the _ALLOW_SYSCALLS flag.
-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-06-03 19:32                           ` Chris Metcalf
@ 2016-06-29 15:18                             ` Frederic Weisbecker
  2016-07-01 20:59                               ` Chris Metcalf
  0 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2016-06-29 15:18 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel
On Fri, Jun 03, 2016 at 03:32:04PM -0400, Chris Metcalf wrote:
> On 5/25/2016 9:07 PM, Frederic Weisbecker wrote:
> >I don't remember how much I answered this email, but I need to finish that :-)
> 
> Sorry for the slow response - it's been a busy week.
I'm certainly much slower ;-)
> 
> >On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
> >>On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
> >>>On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> >>>>   TL;DR: Let's make an explicit decision about whether task isolation
> >>>>   should be "persistent" or "one-shot".  Both have some advantages.
> >>>>   =====
> >>>>
> >>>>An important high-level issue is how "sticky" task isolation mode is.
> >>>>We need to choose one of these two options:
> >>>>
> >>>>"Persistent mode": A task switches state to "task isolation" mode
> >>>>(kind of a level-triggered analogy) and stays there indefinitely.  It
> >>>>can make a syscall, take a page fault, etc., if it wants to, but the
> >>>>kernel protects it from incurring any further asynchronous interrupts.
> >>>>This is the model I've been advocating for.
> >>>But then in this mode, what happens when an interrupt triggers.
> >>So what happens if an interrupt does occur?
> >>
> >>In the "base" task isolation mode, you just take the interrupt, then
> >>wait to quiesce any further kernel timer ticks, etc., and return to
> >>the process.  This at least limits the damage to being a single
> >>interruption rather than potentially additional ones, if the interrupt
> >>also caused timers to get queued, etc.
> >Good, although that quiescing on kernel return must be an option.
> 
> Can you spell out why you think turning it off is helpful?  I'll admit
> this is the default mode in the commercial version of task isolation
> that we ship, and was also the default in the first LKML patch series.
> But on consideration I haven't found scenarios where skipping the
> quiescing is helpful.  Admittedly you get out of the kernel faster,
> but then you're back in userspace and vulnerable to yet more
> unexpected interrupts until the timer quiesces.  If you're asking for
> task isolation, this is surely not what you want.
I just feel that quiescing, on the way back to user after an unwanted
interruption, is awkward. The quiescing should work once and for all
on return back from the prctl. If we still get disturbed afterward,
either the quiescing is buggy or incomplete, or something is on the
way that can not be quiesced.
> 
> >>If you enable "strict" mode, we disable task isolation mode for that
> >>core and deliver a signal to it.  This lets the application know that
> >>an interrupt occurred, and it can take whatever kind of logging or
> >>debugging action it wants to, re-enable task isolation if it wants to
> >>and continue, or just exit or abort, etc.
> >Good.
> >
> >>If you don't enable "strict" mode, but you do have
> >>task_isolation_debug enabled as a boot flag, you will at least get a
> >>console dump with a backtrace and whatever other data we have.
> >>(Sometimes the debug info actually includes a backtrace of the
> >>interrupting core, if it's an IPI or TLB flush from another core,
> >>which can be pretty useful.)
> >Right, I suggest we use trace events btw.
> 
> This is probably a good idea, although I wonder if it's worth deferring
> until after the main patch series goes in - I'm reluctant to expand the scope
> of this patch series and add more reasons for it to get delayed :-)
> What do you think?
Yeah definetly, the patchset is big enough :-)
> 
> >>>>"One-shot mode": A task requests isolation via prctl(), the kernel
> >>>>ensures it is isolated on return from the prctl(), but then as soon as
> >>>>it enters the kernel again, task isolation is switched off until
> >>>>another prctl is issued.  This is what you recommended in your last
> >>>>email.
> >>>No I think we can issue syscalls for exemple. But asynchronous interruptions
> >>>such as exceptions (actually somewhat synchronous but can be unexpected) and
> >>>interrupts are what we want to avoid.
> >>Hmm, so I think I'm not really understanding what you are suggesting.
> >>
> >>We're certainly in agreement that avoiding interrupts and exceptions
> >>is important.  I'm arguing that the way to deal with them is to
> >>generate appropriate signals/printks, etc.
> >Yes.
> >
> >>I'm not actually sure what
> >>you're recommending we do to avoid exceptions.  Since they're
> >>synchronous and deterministic, we can't really avoid them if the
> >>program wants to issue them.  For example, mmap() some anonymous
> >>memory and then start running, and you'll take exceptions each time
> >>you touch a page in that mapped region.  I'd argue it's an application
> >>bug; one should enable "strict" mode to catch and deal with such bugs.
> >They are not all deterministic. For example a breakpoint, a step, a trap
> >can be set up by another process. So this is not entirely under the control
> >of the user.
> 
> That's true, but I'd argue the behavior in that case should be that you can
> raise that kind of exception validly (so you can debug), and then you should
> quiesce on return to userspace so the application doesn't see additional
> exceptions.
I don't see how we can quiesce such things.
> There are two ways you could handle debugging:
> 
> 1. Require the program to set the flag that says it doesn't want a signal
> when it is interrupted (so you can interrupt it to debug it, and not kill it);
That's rather about exceptions, right?
> 
> 2. Or have debugging automatically set that flag in the target process.
> Similarly, we could just say that if a debugger is attached, we never
> generate the kill signal for task isolation.
> 
> >>(Typically the recommendation is to do an mlockall() before starting
> >>task isolation mode, to handle the case of page faults.  But you can
> >>do that and still be screwed by another thread in your process doing a
> >>fork() and then your pages end up read-only for COW and you have to
> >>fault them back in.  But, that's an application bug for a
> >>task-isolation thread, and should just be treated as such.)
> >Now how do you determine which exception is a bug and which is expected?
> >Strict mode should refuse all of them.
> 
> Yes, exactly.  Task isolation will complain about everything. :-)
Ok :-)
> 
> >>>>There are a number of pros and cons to the two models.  I think on
> >>>>balance I still like the "persistent mode" approach, but here's all
> >>>>the pros/cons I can think of:
> >>>>
> >>>>PRO for persistent mode: A somewhat easier programming model.  Users
> >>>>can just imagine "task isolation" as a way for them to still be able
> >>>>to use the kernel exactly as they always have; it's just slower to get
> >>>>back out of the kernel so you use it judiciously. For example, a
> >>>>process is free to call write() on a socket to perform a diagnostic,
> >>>>but when returning from the write() syscall, the kernel will hold the
> >>>>task in kernel mode until any timer ticks (perhaps from networking
> >>>>stuff) are complete, and then let it return to userspace to continue
> >>>>in task isolation mode.
> >>>So this is not hard isolation anymore. This is rather soft isolation with
> >>>best efforts to avoid disturbance.
> >>No, it's still hard isolation.  The distinction is that we offer a way
> >>to get in and out of the kernel "safely" if you want to run in that
> >>mode.  The syscalls can take a long time if the syscall ends up
> >>requiring some additional timer ticks to finish sorting out whatever
> >>it was you asked the kernel to do, but once you're back in userspace
> >>you immediately regain "hard" isolation.  It's under program control.
> >>
> >>Or, you can enable "strict" mode, and then you get hard isolation
> >>without the ability to get in and out of the kernel at all: the kernel
> >>just kills you if you try to leave hard isolation other than by an
> >>explicit prctl().
> >Well, hard isolation is what I would call strict mode.
> 
> Here's what I am inclined towards:
> 
>  - Default mode (hard isolation / "strict") - leave userspace, get a signal, no exceptions.
Ok.
> 
>  - "No signal" mode - leave userspace synchronously (syscall/exception), get quiesced on
>    return, no signals.  But asynchronous interrupts still cause a signal since they are
>    not expected to occur.
So only interrupt cause a signal in this mode? Exceptions and syscalls are permitted, right?
> 
>  - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
>    on return to userspace, and asynchronous interrupts don't even cause a signal.
>    It's basically "best effort", just nohz_full plus the code that tries to get things
>    like LRU or vmstat to run before returning to userspace.  I think there isn't enough
>    "value add" to make this a separate mode, though.
I can imagine HPC to be willing this mode.
> 
> >>>Surely we can have different levels of isolation.
> >>Well, we have nohz_full now, and by adding task-isolation, we have
> >>two.  Or three if you count "base" and "strict" mode task isolation as
> >>two separate levels.
> >Right.
> >
> >>>I'm still wondering what to do if the task migrates to another CPU. In fact,
> >>>perhaps what you're trying to do is rather a CPU property than a
> >>>process property?
> >>Well, we did go around on this issue once already (last August) and at
> >>the time you were encouraging isolation to be a "task" property, not a
> >>"cpu" property:
> >>
> >>https://lkml.kernel.org/r/20150812160020.GG21542@lerouge
> >>
> >>You convinced me at the time :-)
> >Indeed :-) Well if it's a task property, we need to handle its affinity properly then.
> >>You're right that migration conflicts with task isolation.  But
> >>certainly, if a task has enabled "strict" semantics, it can't migrate;
> >>it will lose task isolation entirely and get a signal instead,
> >>regardless of whether it calls sched_setaffinity() on itself, or if
> >>someone else changes its affinity and it gets a kick.
> >Yes.
> >
> >>However, if a task doesn't have strict mode enabled, it can call
> >>sched_setaffinity() and force itself onto a non-task_isolation cpu and
> >>it won't get any isolation until it schedules itself back onto a
> >>task_isolation cpu, at which point it wakes up on the new cpu with
> >>hard isolation still in effect.  I can make up reasons why this sort
> >>of thing might be useful, but it's probably a corner case.
> >That doesn't look sane. The user asks the kernel to get away as much
> >as it can but if we are in a non-nohz-full CPU we know we can't provide that
> >service (or rather that non-service).
> >
> >So we would refuse to enter in task isolation mode if it doesn't run in a
> >full dynticks CPUs whereas we accept that it migrates later to a periodic
> >CPU?. This isn't consistent.
> 
> Yes, and originally I made that consistent by not checking when it started
> up, either, but I was subsequently convinced that the checks were good for
> sanity.
Sure sanity checks are good but if you refuse the prctl with returning an error
on the basis of this sanity condition, the task shouldn't be able to later reach
that insanity state without being properly kicked out of the feature provided by
the prctl().
Otherwise perhaps just drop a warning.
> 
> Another answer is just to say that the full strict mode is the only mode, and
> that if the task leaves userspace, it leaves task isolation mode until it the mode
> is re-enabled.  In the context of receiving a signal each time, this is more plausible.
> You can always re-enable task isolation in the signal handler if you want.
I would be afraid that, on workloads that can live with a few interrupts, those signals
would be a burden.
> 
> I still suspect that the "hybrid" mode where you can leave userspace for things
> like syscalls, but quiesce on return, is useful.  I agree that it leaves some question
> about task migration.  We can refuse to honor a task's request to migrate itself
> in that case, perhaps.  I don't know what to think about when someone else tries
> to migrate the task - perhaps it only succeeds if the caller is root, and otherwise
> fails, when the task is in task isolation mode?  It gets tricky and that's why I
> was inclined to go with a simple "it always works, but it produces results
> that you have to read the documentation to understand" (i.e. task isolation
> mode goes dormant until you schedule back to a task isolation cpu).
> On balance this is still the approach that I like best.
> 
> Which approach seems best to you?
Indeed, forbidding the task to run on a non-nohz-full CPU would be very tricky.
We would need to take care about all possible races, which need to be done under
rq lock so it requires complicating scheduler internals. And eventually if the
CPU gets offlined, we still need to find the task a place to run. Moreover this
raises some privilege issues.
That's not quite an option so this leaves two others:
* Make sure that as soon as the task gets scheduled out of a non-nohz-CPU, it loses
  the flag and gets a signal. That's possible but again it requires some scheduler
  internals.
* Just don't care and schedule the task anywhere, it will be warned soon enough about
  the problem.
The last one looks like a viable and simple enough solution.
> 
> >>However, this makes me wonder if "strict" mode should be the default
> >>for task isolation??  That way task isolation really doesn't conflict
> >>semantically with migration.  And we could provide a "weak" mode, or a
> >>"kernel-friendly" mode, or some such nomenclature, and define the
> >>migration semantics just for that case, where it makes it clear it's a
> >>bit unusual.
> >Well we can't really implement that strict mode until we fix the 1Hz issue, right?
> >Besides, is this something that anyone needs now?
> 
> Certainly all of this is assuming that we have "solved" the 1Hz tick problem,
> either by commenting out the max_deferment call, or at such time as we have
> really fixed the underlying issues and remove the max deferment entirely.
> 
> At that point, I'm not sure it's a question of people needing strict mode per se;
> I think it's more about picking the mode that is the best from both a user experience
> and a quality of implementation perspective.
Sure, ideally we need to start with the mode that people need most and leave room
in the interface for extension.
> 
> >>>I think I heard about workloads that need such strict hard isolation.
> >>>Workloads that really can not afford any disturbance. They even
> >>>use userspace network stack. Maybe HFT?
> >>Certainly HFT is one case.
> >>
> >>A lot of TILE-Gx customers using task isolation (which we call
> >>"dataplane" or "Zero-Overhead Linux") are doing high-speed network
> >>applications with user-space networking stacks.  It can be DPDK, or it
> >>can be another TCP/IP stack (we ship one called tStack) or it
> >>could just be an application directly messing with the network
> >>hardware from userspace.  These are exactly the applications that led
> >>me into this part of kernel development in the first place.
> >>Googling "Zero-Overhead Linux" does take you to some discussions
> >>of customers that have used this functionality.
> >So those workloads couldn't stand an interrupt? Like they would like a signal
> >and exit the strict mode if it happens?
> 
> Correct, they couldn't tolerate interrupts.  If one happened, it would cause packets to
> be dropped and some kind of logging would fire to report the problem.
Ok. And is it this mode you're interested in? Isn't quiescing an issue in this mode?
> 
> >I think that we need to wait for somebody who explicitly request that feature
> >before we work on it, so we get sure the semantics really agree with someone's
> >real load case.
> 
> This is really the scenario that Tilera's customers use, so I'm pretty familiar with
> what they expect.
Ok, so let's take that direction.
> 
> >Ok, so thinking about that talk, I'm wondering if we need some flags
> >such as:
> >
> >          ISOLATION_SIGNAL_SYSCALL
> >          ISOLATION_SIGNAL_EXCEPTIONS
> >          ISOLATION_SIGNAL_INTERRUPTS
> >
> >Strict mode would be the three above OR'ed. It's just some random thoughts
> >but that would help define which level of kernel intrusion the user is ready
> >to tolerate.
> >
> >I'm just not sure how granular we want that interface to be.
> 
> Yes, you could certainly imagine being more granular.  For example, if you expected
> to make syscalls but not receive exceptions or interrupts, that might be a useful
> mode.  Or, you were willing to make syscalls and take exceptions, but not receive
> interrupts.  (I think you should never be willing to receive asynchronous interrupts,
> since that kind of defeats the purpose of task isolation in the first place.)
> 
> So maybe something like this:
> 
> PR_TASK_ISOLATION_ENABLE - turn on basic strict/signaling mode
> PR_TASK_ISOLATION_ALLOW_SYSCALLS - for syscalls, no signal, just quiesce before return
> PR_TASK_ISOLATION_ALLOW_EXCEPTIONS - for all exceptions, no signal, quiesce before return
> 
> It might make sense to say you would allow page faults, for example, but not general
> exceptions.  But my guess is that the exception-related stuff really does need an
> application use case to account for it.  I would say for the initial support of task
> isolation, we have a clearly-understood model for allowing syscalls (e.g. stuff
> like generating diagnostics on error or slow paths), but not really a model for
> understanding why users would want to take exceptions, so I'd say let's omit
> that initially, and maybe just add the _ALLOW_SYSCALLS flag.
Ok. That interface looks better. At least we can start with just PR_TASK_ISOLATION_ENABLE which
does strict pure isolation mode and have future flags for more granularity.
I guess the last thing I'm uncomfortable with is the quiescing that needs to be re-done
everytime we get interrupted.
Thanks.
> 
> -- 
> Chris Metcalf, Mellanox Technologies
> http://www.mellanox.com
> 
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-06-29 15:18                             ` Frederic Weisbecker
@ 2016-07-01 20:59                               ` Chris Metcalf
       [not found]                                 ` <25c4ace1-6903-abb3-59e9-aedc11ac32fc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2016-07-01 20:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel
On 6/29/2016 11:18 AM, Frederic Weisbecker wrote:
> On Fri, Jun 03, 2016 at 03:32:04PM -0400, Chris Metcalf wrote:
>> On 5/25/2016 9:07 PM, Frederic Weisbecker wrote:
>>> On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
>>>> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
>>>>> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
>>>>>>   TL;DR: Let's make an explicit decision about whether task isolation
>>>>>>   should be "persistent" or "one-shot".  Both have some advantages.
>>>>>>   =====
>>>>>>
>>>>>> An important high-level issue is how "sticky" task isolation mode is.
>>>>>> We need to choose one of these two options:
>>>>>>
>>>>>> "Persistent mode": A task switches state to "task isolation" mode
>>>>>> (kind of a level-triggered analogy) and stays there indefinitely.  It
>>>>>> can make a syscall, take a page fault, etc., if it wants to, but the
>>>>>> kernel protects it from incurring any further asynchronous interrupts.
>>>>>> This is the model I've been advocating for.
>>>>> But then in this mode, what happens when an interrupt triggers.
>>>> So what happens if an interrupt does occur?
>>>>
>>>> In the "base" task isolation mode, you just take the interrupt, then
>>>> wait to quiesce any further kernel timer ticks, etc., and return to
>>>> the process.  This at least limits the damage to being a single
>>>> interruption rather than potentially additional ones, if the interrupt
>>>> also caused timers to get queued, etc.
>>> Good, although that quiescing on kernel return must be an option.
>>
>> Can you spell out why you think turning it off is helpful?  I'll admit
>> this is the default mode in the commercial version of task isolation
>> that we ship, and was also the default in the first LKML patch series.
>> But on consideration I haven't found scenarios where skipping the
>> quiescing is helpful.  Admittedly you get out of the kernel faster,
>> but then you're back in userspace and vulnerable to yet more
>> unexpected interrupts until the timer quiesces.  If you're asking for
>> task isolation, this is surely not what you want.
>
> I just feel that quiescing, on the way back to user after an unwanted
> interruption, is awkward. The quiescing should work once and for all
> on return back from the prctl. If we still get disturbed afterward,
> either the quiescing is buggy or incomplete, or something is on the
> way that can not be quiesced.
If we are thinking of an initial implementation that doesn't allow any
subsequent kernel entry to be valid, then this all gets much easier,
since any subsequent kernel entry except for a prctl() syscall will
result in a signal, which will turn off task isolation, and we will
never have to worry about additional quiescing.  I think that's where
we got from the discussion at the bottom of this email.
So for your question here, we're really just thinking about future
directions as far as how to handle interrupts, and if in the future we
add support for allowing syscalls and/or exceptions without leaving
task isolation mode, then we have to think about how that interacts
with interrupts.  The problem is that it's hard to tell, as you're
returning to userspace, whether you're returning from an exception or
an interrupt; you typically don't have that information available.  So
from a purely ease-of-implementation perspective, we'd likely want to
handle exceptions and interrupts the same way, and quiesce both.
In general, I think it would also be a better explanation to users of
task isolation to say "every enter/exit to the kernel is either an
error that causes a signal, or it quiesces on return".  It's a simpler
semantic, and I think it also is better for interrupts anyway, since
it potentially avoids multiple interrupts to the application (whatever
interrupted to begin with, plus potential timer interrupts later).
But that said, if we start with "pure strict" mode only, all of this
becomes hypothetical, and we may in fact choose never to allow "safe"
modes of entering the kernel.
>>>> I'm not actually sure what
>>>> you're recommending we do to avoid exceptions.  Since they're
>>>> synchronous and deterministic, we can't really avoid them if the
>>>> program wants to issue them.  For example, mmap() some anonymous
>>>> memory and then start running, and you'll take exceptions each time
>>>> you touch a page in that mapped region.  I'd argue it's an application
>>>> bug; one should enable "strict" mode to catch and deal with such bugs.
>>> They are not all deterministic. For example a breakpoint, a step, a trap
>>> can be set up by another process. So this is not entirely under the control
>>> of the user.
>>
>> That's true, but I'd argue the behavior in that case should be that you can
>> raise that kind of exception validly (so you can debug), and then you should
>> quiesce on return to userspace so the application doesn't see additional
>> exceptions.
>
> I don't see how we can quiesce such things.
I'm imagining task A is in dataplane mode, and task B wants to debug
it by writing a breakpoint into its text.  When task A hits the
breakpoint, it will enter the kernel, and hold there while task B
pokes at it with ptrace.  When task A finally is allowed to return to
userspace, it should quiesce before entering userspace in case any
timer interrupts got scheduled (again, maybe due to softirqs or
whatever, or random other kernel activity targeting that core while it
was in the kernel, or whatever).  This is just the same kind of
quiescing we do on return from the initial prctl().
With a "pure strict" mode it does get a little tricky, since we will
end up killing task A as it comes back from its breakpoint.  We might
just choose to say that task A should not enable task isolation if it
is going to be debugged (some runtime switch).  This isn't really a
great solution; I do kind of feel that the nicest thing to do is
quiesce the task again at this point.  This feels like the biggest
argument in favor of supporting a mode where a task-isolated task can
safely enter the kernel for exceptions.  What do you think?
>> There are two ways you could handle debugging:
>>
>> 1. Require the program to set the flag that says it doesn't want a signal
>> when it is interrupted (so you can interrupt it to debug it, and not kill it);
>
> That's rather about exceptions, right?
Yes, with the task A/task B example above, you're right.  I was
thinking there was a kick given by task B to task A.  I think that
might even be true in some circumstances, but anyway, it's a detail.
>> Here's what I am inclined towards:
>>
>>  - Default mode (hard isolation / "strict") - leave userspace, get a signal, no exceptions.
>
> Ok.
>
>>
>>  - "No signal" mode - leave userspace synchronously (syscall/exception), get quiesced on
>>    return, no signals.  But asynchronous interrupts still cause a signal since they are
>>    not expected to occur.
>
> So only interrupt cause a signal in this mode? Exceptions and syscalls are permitted, right?
Yes, correct.
>>  - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
>>    on return to userspace, and asynchronous interrupts don't even cause a signal.
>>    It's basically "best effort", just nohz_full plus the code that tries to get things
>>    like LRU or vmstat to run before returning to userspace.  I think there isn't enough
>>    "value add" to make this a separate mode, though.
>
> I can imagine HPC to be willing this mode.
Yes, perhaps.  I'm not convinced we want to target HPC without a much
clearer sense of why this is better than nohz_full, though.  I fear
people might think "task isolation" is better by definition and not
think too much about it, but I'm really not sure it is better for the
HPC use case, necessarily.
>>>> You're right that migration conflicts with task isolation.  But
>>>> certainly, if a task has enabled "strict" semantics, it can't migrate;
>>>> it will lose task isolation entirely and get a signal instead,
>>>> regardless of whether it calls sched_setaffinity() on itself, or if
>>>> someone else changes its affinity and it gets a kick.
>>> Yes.
>>>
>>>> However, if a task doesn't have strict mode enabled, it can call
>>>> sched_setaffinity() and force itself onto a non-task_isolation cpu and
>>>> it won't get any isolation until it schedules itself back onto a
>>>> task_isolation cpu, at which point it wakes up on the new cpu with
>>>> hard isolation still in effect.  I can make up reasons why this sort
>>>> of thing might be useful, but it's probably a corner case.
>>> That doesn't look sane. The user asks the kernel to get away as much
>>> as it can but if we are in a non-nohz-full CPU we know we can't provide that
>>> service (or rather that non-service).
>>>
>>> So we would refuse to enter in task isolation mode if it doesn't run in a
>>> full dynticks CPUs whereas we accept that it migrates later to a periodic
>>> CPU?. This isn't consistent.
>>
>> Yes, and originally I made that consistent by not checking when it started
>> up, either, but I was subsequently convinced that the checks were good for
>> sanity.
>
> Sure sanity checks are good but if you refuse the prctl with returning an error
> on the basis of this sanity condition, the task shouldn't be able to later reach
> that insanity state without being properly kicked out of the feature provided by
> the prctl().
>
> Otherwise perhaps just drop a warning.
Are you saying that we should printk a warning in the prctl() rather
than returning an error in the case where it's not on a full dynticks
cpu?  I could be convinced by that just to keep things consistent.
How about doing it this way?  If you invoke prctl() with the default
"strict" mode where any kernel entry results in a signal, the prctl()
will be strict, and require you to be affinitized to a single, full
dynticks cpu.
But, if you enable the "allow syscalls" mode, then the prctl isn't
strict either, since you can use syscalls to get into a state where
you're not on a full dynticks cpu, and you just get a console warning
if you enter task isolation on the wrong cpu.  (Of course, we may end
up not doing the "allow syscalls" mode for the first version of this
patch anyway, as we discuss below.)
>>>> Googling "Zero-Overhead Linux" does take you to some discussions
>>>> of customers that have used this functionality.
>>> So those workloads couldn't stand an interrupt? Like they would like a signal
>>> and exit the strict mode if it happens?
>>
>> Correct, they couldn't tolerate interrupts.  If one happened, it would cause packets to
>> be dropped and some kind of logging would fire to report the problem.
>
> Ok. And is it this mode you're interested in? Isn't quiescing an issue in this mode?
In this mode we don't worry about quiescing for interrupts, since we
are generating a signal, and when you send a signal, you first have to
disable task isolation mode to avoid getting into various bad states
(sending too many signals, or worse, getting deadlocked because you
are signalling the task BECAUSE it was about to receive a signal).  So
we only quiesce after syscalls/exceptions.
>> So maybe something like this:
>>
>> PR_TASK_ISOLATION_ENABLE - turn on basic strict/signaling mode
>> PR_TASK_ISOLATION_ALLOW_SYSCALLS - for syscalls, no signal, just quiesce before return
>> PR_TASK_ISOLATION_ALLOW_EXCEPTIONS - for all exceptions, no signal, quiesce before return
>>
>> It might make sense to say you would allow page faults, for example, but not general
>> exceptions.  But my guess is that the exception-related stuff really does need an
>> application use case to account for it.  I would say for the initial support of task
>> isolation, we have a clearly-understood model for allowing syscalls (e.g. stuff
>> like generating diagnostics on error or slow paths), but not really a model for
>> understanding why users would want to take exceptions, so I'd say let's omit
>> that initially, and maybe just add the _ALLOW_SYSCALLS flag.
>
> Ok. That interface looks better. At least we can start with just PR_TASK_ISOLATION_ENABLE which
> does strict pure isolation mode and have future flags for more granularity.
I think just implementing the basic _ENABLE mode with pure strict task
isolation makes sense for now.  We can wait to enable syscalls or
exceptions until we have a better use case.  Meanwhile, even without
support for allowing syscalls, you can always use prctl() to turn off
task isolation, and then you can do your syscalls, and prctl() it back
on again.  prctl() to disable task isolation always has to work :-)
Or, if we want to make it easy to do debugging, and as a result maybe
also support the plausible mode where task-isolation tasks make
occasional syscalls, we could say that the _ALLOW_EXCEPTIONS flag
above implies syscalls as well, and support that mode.  Perhaps that
makes the most sense...
I'll spin it as a new patch series and you can take a look.
Thanks!
-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
       [not found]                                 ` <25c4ace1-6903-abb3-59e9-aedc11ac32fc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-07-05 14:41                                   ` Frederic Weisbecker
  2016-07-05 17:47                                     ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2016-07-05 14:41 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
On Fri, Jul 01, 2016 at 04:59:26PM -0400, Chris Metcalf wrote:
> On 6/29/2016 11:18 AM, Frederic Weisbecker wrote:
> >
> >I just feel that quiescing, on the way back to user after an unwanted
> >interruption, is awkward. The quiescing should work once and for all
> >on return back from the prctl. If we still get disturbed afterward,
> >either the quiescing is buggy or incomplete, or something is on the
> >way that can not be quiesced.
> 
> If we are thinking of an initial implementation that doesn't allow any
> subsequent kernel entry to be valid, then this all gets much easier,
> since any subsequent kernel entry except for a prctl() syscall will
> result in a signal, which will turn off task isolation, and we will
> never have to worry about additional quiescing.  I think that's where
> we got from the discussion at the bottom of this email.
Right.
> 
> So for your question here, we're really just thinking about future
> directions as far as how to handle interrupts, and if in the future we
> add support for allowing syscalls and/or exceptions without leaving
> task isolation mode, then we have to think about how that interacts
> with interrupts.  The problem is that it's hard to tell, as you're
> returning to userspace, whether you're returning from an exception or
> an interrupt; you typically don't have that information available.  So
> from a purely ease-of-implementation perspective, we'd likely want to
> handle exceptions and interrupts the same way, and quiesce both.
Sure but what I don't understand is why do we need to quiesce more than
once (ie: at the prctl() call). Quiescing should be a single operation
that prevents from any further disturbance. Like offlining anything
we to other CPUs. And entering again in the kernel shouldn't break
that.
> 
> In general, I think it would also be a better explanation to users of
> task isolation to say "every enter/exit to the kernel is either an
> error that causes a signal, or it quiesces on return".  It's a simpler
> semantic, and I think it also is better for interrupts anyway, since
> it potentially avoids multiple interrupts to the application (whatever
> interrupted to begin with, plus potential timer interrupts later).
> 
> But that said, if we start with "pure strict" mode only, all of this
> becomes hypothetical, and we may in fact choose never to allow "safe"
> modes of entering the kernel.
Right. And starting with pure strict mode would be a good first step,
provided it is a mode you need.
> >>That's true, but I'd argue the behavior in that case should be that you can
> >>raise that kind of exception validly (so you can debug), and then you should
> >>quiesce on return to userspace so the application doesn't see additional
> >>exceptions.
> >
> >I don't see how we can quiesce such things.
> 
> I'm imagining task A is in dataplane mode, and task B wants to debug
> it by writing a breakpoint into its text.  When task A hits the
> breakpoint, it will enter the kernel, and hold there while task B
> pokes at it with ptrace.  When task A finally is allowed to return to
> userspace, it should quiesce before entering userspace in case any
> timer interrupts got scheduled (again, maybe due to softirqs or
> whatever, or random other kernel activity targeting that core while it
> was in the kernel, or whatever).  This is just the same kind of
> quiescing we do on return from the initial prctl().
Well again I think it shouldn't happen. Quiescing should be done once
and for all.
> With a "pure strict" mode it does get a little tricky, since we will
> end up killing task A as it comes back from its breakpoint.  We might
> just choose to say that task A should not enable task isolation if it
> is going to be debugged (some runtime switch).  This isn't really a
> great solution; I do kind of feel that the nicest thing to do is
> quiesce the task again at this point.  This feels like the biggest
> argument in favor of supporting a mode where a task-isolated task can
> safely enter the kernel for exceptions.  What do you think?
Yeah probably we'll need to introduce some sort of debugability. Allow
debug/trap exceptions only for example.
> >> - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
> >>   on return to userspace, and asynchronous interrupts don't even cause a signal.
> >>   It's basically "best effort", just nohz_full plus the code that tries to get things
> >>   like LRU or vmstat to run before returning to userspace.  I think there isn't enough
> >>   "value add" to make this a separate mode, though.
> >
> >I can imagine HPC to be willing this mode.
> 
> Yes, perhaps.  I'm not convinced we want to target HPC without a much
> clearer sense of why this is better than nohz_full, though.  I fear
> people might think "task isolation" is better by definition and not
> think too much about it, but I'm really not sure it is better for the
> HPC use case, necessarily.
I don't know. Perhaps HPC could just consist in quiescing once for all
(offline everything that can) and not signal when there is a rare disturbance.
> >Otherwise perhaps just drop a warning.
> 
> Are you saying that we should printk a warning in the prctl() rather
> than returning an error in the case where it's not on a full dynticks
> cpu?  I could be convinced by that just to keep things consistent.
Yeah that's what I meant.
> How about doing it this way?  If you invoke prctl() with the default
> "strict" mode where any kernel entry results in a signal, the prctl()
> will be strict, and require you to be affinitized to a single, full
> dynticks cpu.
But if you do that, you need to do it properly and care about races against
affinity changes. It involves heavy synchronization against scheduler code.
I tend to think we shouldn't bother with that. If we enter in task isolation
mode on a non-nohz-full CPU, the task will be signalled and kicked out of task
isolation mode in the next tick that happens very soon after the prctl().
> But, if you enable the "allow syscalls" mode, then the prctl isn't
> strict either, since you can use syscalls to get into a state where
> you're not on a full dynticks cpu, and you just get a console warning
> if you enter task isolation on the wrong cpu.  (Of course, we may end
> up not doing the "allow syscalls" mode for the first version of this
> patch anyway, as we discuss below.)
Right.
> >Ok. And is it this mode you're interested in? Isn't quiescing an issue in this mode?
> 
> In this mode we don't worry about quiescing for interrupts, since we
> are generating a signal, and when you send a signal, you first have to
> disable task isolation mode to avoid getting into various bad states
> (sending too many signals, or worse, getting deadlocked because you
> are signalling the task BECAUSE it was about to receive a signal).  So
> we only quiesce after syscalls/exceptions.
Ok. And are you interested in such strict mode? :-)
If so it would be nice to start with just that and iterate on top of it.
> >Ok. That interface looks better. At least we can start with just PR_TASK_ISOLATION_ENABLE which
> >does strict pure isolation mode and have future flags for more granularity.
> 
> I think just implementing the basic _ENABLE mode with pure strict task
> isolation makes sense for now.  We can wait to enable syscalls or
> exceptions until we have a better use case.  Meanwhile, even without
> support for allowing syscalls, you can always use prctl() to turn off
> task isolation, and then you can do your syscalls, and prctl() it back
> on again.  prctl() to disable task isolation always has to work :-)
Perfect!
> Or, if we want to make it easy to do debugging, and as a result maybe
> also support the plausible mode where task-isolation tasks make
> occasional syscalls, we could say that the _ALLOW_EXCEPTIONS flag
> above implies syscalls as well, and support that mode.  Perhaps that
> makes the most sense...
I fear that _ALLOW_EXCEPTIONS is too wide for a special case if all we
want it to allow debugging.
The most granular way to express custom isolation would be to use BPF.
Not sure we want to go that far though.
> I'll spin it as a new patch series and you can take a look.
Ok. Ideally it would be nice to respin a simple version (strict mode)
on top of which we can later iterate.
Thanks.
^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-07-05 14:41                                   ` Frederic Weisbecker
@ 2016-07-05 17:47                                     ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2016-07-05 17:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel
On Tue, 5 Jul 2016, Frederic Weisbecker wrote:
> > >>That's true, but I'd argue the behavior in that case should be that you can
> > >>raise that kind of exception validly (so you can debug), and then you should
> > >>quiesce on return to userspace so the application doesn't see additional
> > >>exceptions.
> > >
> > >I don't see how we can quiesce such things.
> >
> > I'm imagining task A is in dataplane mode, and task B wants to debug
> > it by writing a breakpoint into its text.  When task A hits the
> > breakpoint, it will enter the kernel, and hold there while task B
> > pokes at it with ptrace.  When task A finally is allowed to return to
> > userspace, it should quiesce before entering userspace in case any
> > timer interrupts got scheduled (again, maybe due to softirqs or
> > whatever, or random other kernel activity targeting that core while it
> > was in the kernel, or whatever).  This is just the same kind of
> > quiescing we do on return from the initial prctl().
>
> Well again I think it shouldn't happen. Quiescing should be done once
> and for all.
For debugging something like that would be helpful. And yes for the
realtime use cases quiescing is once and for all (until we end a different
operation mode if requested by the app)
> > >> - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
> > >>   on return to userspace, and asynchronous interrupts don't even cause a signal.
> > >>   It's basically "best effort", just nohz_full plus the code that tries to get things
> > >>   like LRU or vmstat to run before returning to userspace.  I think there isn't enough
> > >>   "value add" to make this a separate mode, though.
> > >
> > >I can imagine HPC to be willing this mode.
> >
> > Yes, perhaps.  I'm not convinced we want to target HPC without a much
> > clearer sense of why this is better than nohz_full, though.  I fear
> > people might think "task isolation" is better by definition and not
> > think too much about it, but I'm really not sure it is better for the
> > HPC use case, necessarily.
HPC folks generally like to actually understand what is going on in order
to get the best performance. Just expose the knobs for us please.
^ permalink raw reply	[flat|nested] 29+ messages in thread
end of thread, other threads:[~2016-07-05 17:47 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-04 19:34 [PATCH v9 00/13] support "task_isolation" mode for nohz_full Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 04/13] task_isolation: add initial support Chris Metcalf
2016-01-19 15:42   ` Frederic Weisbecker
2016-01-19 20:45     ` Chris Metcalf
2016-01-28  0:28       ` Frederic Weisbecker
2016-01-29 18:18         ` Chris Metcalf
     [not found]           ` <56ABACDD.5090500-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
2016-01-30 21:11             ` Frederic Weisbecker
2016-02-11 19:24               ` Chris Metcalf
2016-03-04 12:56                 ` Frederic Weisbecker
2016-03-09 19:39                   ` Chris Metcalf
2016-04-08 13:56                     ` Frederic Weisbecker
2016-04-08 16:34                       ` Chris Metcalf
     [not found]                         ` <5707DDA8.10600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-04-12 18:41                           ` Chris Metcalf
2016-04-22 13:16                         ` Frederic Weisbecker
2016-04-25 20:36                           ` Chris Metcalf
2016-05-26  1:07                         ` Frederic Weisbecker
2016-06-03 19:32                           ` Chris Metcalf
2016-06-29 15:18                             ` Frederic Weisbecker
2016-07-01 20:59                               ` Chris Metcalf
     [not found]                                 ` <25c4ace1-6903-abb3-59e9-aedc11ac32fc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-07-05 14:41                                   ` Frederic Weisbecker
2016-07-05 17:47                                     ` Christoph Lameter
2016-01-04 19:34 ` [PATCH v9 05/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
     [not found] ` <1451936091-29247-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
2016-01-11 21:15   ` [PATCH v9 00/13] support "task_isolation" mode for nohz_full Chris Metcalf
2016-01-12 10:07     ` Will Deacon
     [not found]       ` <20160112100708.GA15737-5wv7dgnIgG8@public.gmane.org>
2016-01-12 17:49         ` Chris Metcalf
     [not found]           ` <56953CBA.9090208-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
2016-01-13 10:44             ` Ingo Molnar
2016-01-13 21:19               ` Chris Metcalf
2016-01-20 13:27                 ` Mark Rutland
     [not found]     ` <56941B86.9090009-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
2016-01-12 10:53       ` Ingo Molnar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).