linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v13 00/12] support "task_isolation" mode
@ 2016-07-14 20:48 Chris Metcalf
  2016-07-14 20:48 ` [PATCH v13 04/12] task_isolation: add initial support Chris Metcalf
                   ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

Here is a respin of the task-isolation patch set.  This primarily
reflects feedback from Frederic and Peter Z.

Changes since v12:

- Rebased on v4.7-rc7.

- New default "strict" model for task isolation - tasks exit the
  kernel from the initial prctl() to userspace, and can only legally
  exit by calling prctl() again to turn off isolation.  Any other
  kernel entry results in a SIGKILL by default.

- New optional "relaxed" mode, where the application can receive some
  signal other than SIGKILL, or no signal at all, when it re-enters
  the kernel.  Since by default task isolation is now strict, there is
  no longer an additional "STRICT" mode, but rather a new "NOSIG" mode
  that builds on top of the "USERSIG" support for setting a signal
  other than SIGKILL to be delivered to the process.  The "NOSIG" mode
  also relaxes the required criteria for entering task isolation mode;
  we just issue a warning if the affinity isn't set right, and we
  don't fail with EAGAIN if the kernel isn't ready to stop the tick.

  Running your task-isolation application in this "NOSIG" mode is also
  necessary when debugging, since otherwise hitting breakpoints, etc.,
  will cause a fatal signal to be sent to the process.

  Frederic has suggested we might want to defer this functionality
  until later, but (in addition to the debuggability aspect) there is
  some thought that it might be useful for e.g. HPC, so I have just
  broken out the additional semantics into a single separate patch at
  the end of the series.

- Function naming has been changed and comments have been added to try
  to clarify the role of the task-isolation reporting on kernel
  entries that do NOT cause signals.  This hopefully clarifies why we
  only invoke the renamed task_isolation_quiet_exception() in a few
  places, since all the other places generate signals anyway. [PeterZ]

- The task_isolation_debug() call now has an inline piece that checks
  to see if the target is a task_isolation cpu before actually
  calling. [PeterZ]

- In _task_isolation_debug(), we use the new task_struct_trylock()
  call that is in linux-next now; for now I just have a static copy of
  the function, which I will switch to using the version from
  linux-next in the next rebasing. [PeterZ]

- We now pass a string describing the interrupt up from
  task_isolation_debug() so there is more information on where the
  interrupt came from beyond just the stack backtrace. [PeterZ]

- I added task_isolation_debug() hooks to smp_sched_reschedule() on
  x86, which was missing before, and removed the hooks in the tile
  send_IPI_*() routines, since there were already hooks in the
  callers.  Likewise I moved the hook for arm64 from the generic
  smp_cross_call() routine to the only caller that wasn't already
  hooked, smp_send_reschedule().  The commit message clarifies the
  rationale for where hooks are placed.

- I moved the page fault reporting so that it only reports in the case
  that we are not also sending a SIGSEGV/SIGBUS, for consistency with
  other uses of task_isolation_quiet_exception().

The previous (v12) patch series is here:

https://lkml.kernel.org/g/1459877922-15512-1-git-send-email-cmetcalf@mellanox.com

This version of the patch series has been tested on arm64 and tilegx,
and build-tested on x86.

It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.
Frederick, do you have a sense of what is left to be done there?
I can certainly try to contribute to that effort as well.

The series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Chris Metcalf (12):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: track asynchronous interrupts
  arch/x86: enable task isolation functionality
  arm64: factor work_pending state machine to C
  arch/arm64: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state
  task_isolation: support CONFIG_TASK_ISOLATION_ALL
  task_isolation: add user-settable notification signal

 Documentation/kernel-parameters.txt    |  16 ++
 arch/arm64/Kconfig                     |   1 +
 arch/arm64/include/asm/thread_info.h   |   5 +-
 arch/arm64/kernel/entry.S              |  12 +-
 arch/arm64/kernel/ptrace.c             |  15 +-
 arch/arm64/kernel/signal.c             |  42 +++-
 arch/arm64/kernel/smp.c                |   2 +
 arch/arm64/mm/fault.c                  |   8 +-
 arch/tile/Kconfig                      |   1 +
 arch/tile/include/asm/thread_info.h    |   4 +-
 arch/tile/kernel/process.c             |   9 +
 arch/tile/kernel/ptrace.c              |   7 +
 arch/tile/kernel/single_step.c         |   7 +
 arch/tile/kernel/smp.c                 |  26 +--
 arch/tile/kernel/time.c                |   1 +
 arch/tile/kernel/unaligned.c           |   4 +
 arch/tile/mm/fault.c                   |  13 +-
 arch/tile/mm/homecache.c               |   2 +
 arch/x86/Kconfig                       |   1 +
 arch/x86/entry/common.c                |  18 +-
 arch/x86/include/asm/thread_info.h     |   2 +
 arch/x86/kernel/smp.c                  |   2 +
 arch/x86/kernel/traps.c                |   3 +
 arch/x86/mm/fault.c                    |   5 +
 drivers/base/cpu.c                     |  18 ++
 drivers/clocksource/arm_arch_timer.c   |   2 +
 include/linux/context_tracking_state.h |   6 +
 include/linux/isolation.h              |  73 +++++++
 include/linux/sched.h                  |   3 +
 include/linux/swap.h                   |   1 +
 include/linux/tick.h                   |   2 +
 include/linux/vmstat.h                 |   4 +
 include/uapi/linux/prctl.h             |  10 +
 init/Kconfig                           |  37 ++++
 kernel/Makefile                        |   1 +
 kernel/fork.c                          |   3 +
 kernel/irq_work.c                      |   5 +-
 kernel/isolation.c                     | 337 +++++++++++++++++++++++++++++++++
 kernel/sched/core.c                    |  42 ++++
 kernel/signal.c                        |  15 ++
 kernel/smp.c                           |   6 +-
 kernel/softirq.c                       |  33 ++++
 kernel/sys.c                           |   9 +
 kernel/time/tick-sched.c               |  36 ++--
 mm/swap.c                              |  15 +-
 mm/vmstat.c                            |  19 ++
 46 files changed, 827 insertions(+), 56 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

-- 
2.7.2

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v13 04/12] task_isolation: add initial support
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
@ 2016-07-14 20:48 ` Chris Metcalf
  2016-07-14 20:48 ` [PATCH v13 12/12] task_isolation: add user-settable notification signal Chris Metcalf
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 38+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in
thread_info flags.  When task isolation is enabled for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

The task_isolation_ready() call is invoked when TIF_TASK_ISOLATION is
set in prepare_exit_to_usermode() or its architectural equivalent,
and forces the loop to retry if the system is not ready.  It is
called with interrupts disabled and inspects the kernel state
to determine if it is safe to return into an isolated state.
In particular, if it sees that the scheduler tick is still enabled,
it reports that it is not yet safe.

Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running.

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
of a number of other synchronous traps, the kernel will kill it
with SIGKILL.  For system calls, this test is performed immediately
before the SECCOMP test and causes the syscall to return immediately
with ENOSYS.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl() syscall so that we can clear the bit again
later, and ignores exit/exit_group to allow exiting the task without
a pointless signal killing you as you try to do so.

A new /sys/devices/system/cpu/task_isolation pseudo-file is added,
parallel to the comparable nohz_full file.

Separate patches that follow provide these changes for x86, tile,
and arm64.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 Documentation/kernel-parameters.txt |   8 ++
 drivers/base/cpu.c                  |  18 +++
 include/linux/isolation.h           |  60 ++++++++++
 include/linux/sched.h               |   3 +
 include/linux/tick.h                |   2 +
 include/uapi/linux/prctl.h          |   5 +
 init/Kconfig                        |  27 +++++
 kernel/Makefile                     |   1 +
 kernel/fork.c                       |   3 +
 kernel/isolation.c                  | 217 ++++++++++++++++++++++++++++++++++++
 kernel/signal.c                     |   8 ++
 kernel/sys.c                        |   9 ++
 kernel/time/tick-sched.c            |  36 +++---
 13 files changed, 384 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 82b42c958d1c..3db9bea08ed6 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3892,6 +3892,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation=	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION=y, set
+			the specified list of CPUs where cpus will be able
+			to use prctl(PR_SET_TASK_ISOLATION) to set up task
+			isolation mode.  Setting this boot flag implicitly
+			also sets up nohz_full and isolcpus mode for the
+			listed set of cpus.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 691eeea2f19a..eaf40f4264ee 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -17,6 +17,7 @@
 #include <linux/of.h>
 #include <linux/cpufeature.h>
 #include <linux/tick.h>
+#include <linux/isolation.h>
 
 #include "base.h"
 
@@ -290,6 +291,20 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
 static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
 #endif
 
+#ifdef CONFIG_TASK_ISOLATION
+static ssize_t print_cpus_task_isolation(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	int n = 0, len = PAGE_SIZE-2;
+
+	n = scnprintf(buf, len, "%*pbl\n", cpumask_pr_args(task_isolation_map));
+
+	return n;
+}
+static DEVICE_ATTR(task_isolation, 0444, print_cpus_task_isolation, NULL);
+#endif
+
 static void cpu_device_release(struct device *dev)
 {
 	/*
@@ -460,6 +475,9 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_NO_HZ_FULL
 	&dev_attr_nohz_full.attr,
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	&dev_attr_task_isolation.attr,
+#endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..d9288b85b41f
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,60 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/* cpus that are configured to support task isolation */
+extern cpumask_var_t task_isolation_map;
+
+extern int task_isolation_init(void);
+
+static inline bool task_isolation_possible(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+extern int task_isolation_set(unsigned int flags);
+
+extern bool task_isolation_ready(void);
+extern void task_isolation_enter(void);
+
+static inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags)
+{
+	p->task_isolation_flags = flags;
+
+	if (flags & PR_TASK_ISOLATION_ENABLE)
+		set_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+	else
+		clear_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+}
+
+extern int task_isolation_syscall(int nr);
+
+/* Report on exceptions that don't cause a signal for the user process. */
+extern void _task_isolation_quiet_exception(const char *fmt, ...);
+#define task_isolation_quiet_exception(fmt, ...)			\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_quiet_exception(fmt, ## __VA_ARGS__); \
+	} while (0)
+
+#else
+static inline void task_isolation_init(void) { }
+static inline bool task_isolation_possible(int cpu) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+extern inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_quiet_exception(const char *fmt, ...) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 253538f29ade..8195c14d021a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1918,6 +1918,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 62be0786d6d0..fbd81e322860 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -235,6 +235,8 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
+extern void tick_nohz_full_add_cpus(const struct cpumask *mask);
+extern bool can_stop_my_full_tick(void);
 #else
 static inline int housekeeping_any_cpu(void)
 {
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..2a49d0d2940a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index c02d89777713..fc71444f9c30 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -783,6 +783,33 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+	 Allow userspace processes to place themselves on task_isolation
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  Prior to returning to userspace,
+	 isolated tasks will arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  By default, attempting to re-enter the kernel
+	 while in this mode will cause the task to be terminated
+	 with a signal; you must explicitly use prctl() to disable
+	 task isolation before resuming normal use of the kernel.
+
+	 This "hard" isolation from the kernel is required for
+	 userspace tasks that are running hard real-time tasks in
+	 userspace, such as a 10 Gbit network driver in userspace.
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index e2ec54e2b952..91ff1615f4d6 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -112,6 +112,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 4a7ec0c6c88c..e1ab8f034a95 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,6 +76,7 @@
 #include <linux/compiler.h>
 #include <linux/sysctl.h>
 #include <linux/kcov.h>
+#include <linux/isolation.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1535,6 +1536,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 	clear_all_latency_tracing(p);
 
+	task_isolation_set_flags(p, 0);
+
 	/* ok, now we should be set up.. */
 	p->pid = pid_nr(pid);
 	if (clone_flags & CLONE_THREAD) {
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..bf3ebb0a727c
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,217 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include "time/tick-sched.h"
+
+cpumask_var_t task_isolation_map;
+static bool saw_boot_arg;
+
+/*
+ * Isolation requires both nohz and isolcpus support from the scheduler.
+ * We provide a boot flag that enables both for now, and which we can
+ * add other functionality to over time if needed.  Note that just
+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
+ */
+static int __init task_isolation_setup(char *str)
+{
+	saw_boot_arg = true;
+
+	alloc_bootmem_cpumask_var(&task_isolation_map);
+	if (cpulist_parse(str, task_isolation_map) < 0) {
+		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
+		return 1;
+	}
+
+	return 1;
+}
+__setup("task_isolation=", task_isolation_setup);
+
+int __init task_isolation_init(void)
+{
+	/* For offstack cpumask, ensure we allocate an empty cpumask early. */
+	if (!saw_boot_arg) {
+		zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+		return 0;
+	}
+
+	/*
+	 * Add our task_isolation cpus to nohz_full and isolcpus.  Note
+	 * that we are called relatively early in boot, from tick_init();
+	 * at this point neither nohz_full nor isolcpus has been used
+	 * to configure the system, but isolcpus has been allocated
+	 * already in sched_init().
+	 */
+	tick_nohz_full_add_cpus(task_isolation_map);
+	cpumask_or(cpu_isolated_map, cpu_isolated_map, task_isolation_map);
+
+	return 0;
+}
+
+/*
+ * Get a snapshot of whether, at this moment, it would be possible to
+ * stop the tick.  This test normally requires interrupts disabled since
+ * the condition can change if an interrupt is delivered.  However, in
+ * this case we are using it in an advisory capacity to see if there
+ * is anything obviously indicating that the task isolation
+ * preconditions have not been met, so it's OK that in principle it
+ * might not still be true later in the prctl() syscall path.
+ */
+static bool can_stop_my_full_tick_now(void)
+{
+	bool ret;
+
+	local_irq_disable();
+	ret = can_stop_my_full_tick();
+	local_irq_enable();
+	return ret;
+}
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single task_isolation core, or
+ * else we return EINVAL.  And, it must be at least statically able to
+ * stop the nohz_full tick (e.g., no other schedulable tasks currently
+ * running, no POSIX cpu timers currently set up, etc.); if not, we
+ * return EAGAIN.
+ */
+int task_isolation_set(unsigned int flags)
+{
+	if (flags != 0) {
+		if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+		    !task_isolation_possible(raw_smp_processor_id())) {
+			/* Invalid task affinity setting. */
+			return -EINVAL;
+		}
+		if (!can_stop_my_full_tick_now()) {
+			/* System not yet ready for task isolation. */
+			return -EAGAIN;
+		}
+	}
+
+	task_isolation_set_flags(current, flags);
+	return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again.  This test
+ * is run with interrupts disabled to test that everything we need
+ * to be true is true before we can return to userspace.
+ */
+bool task_isolation_ready(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+
+	return (!lru_add_drain_needed(smp_processor_id()) &&
+		vmstat_idle() &&
+		tick_nohz_tick_stopped());
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void task_isolation_enter(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat_sync();
+
+	/*
+	 * Request rescheduling unless we are in full dynticks mode.
+	 * We would eventually get pre-empted without this, and if
+	 * there's another task waiting, it would run; but by
+	 * explicitly requesting the reschedule, we may reduce the
+	 * latency.  We could directly call schedule() here as well,
+	 * but since our caller is the standard place where schedule()
+	 * is called, we defer to the caller.
+	 *
+	 * A more substantive approach here would be to use a struct
+	 * completion here explicitly, and complete it when we shut
+	 * down dynticks, but since we presumably have nothing better
+	 * to do on this core anyway, just spinning seems plausible.
+	 */
+	if (!tick_nohz_tick_stopped())
+		set_tsk_need_resched(current);
+}
+
+static void task_isolation_deliver_signal(struct task_struct *task,
+					  const char *buf)
+{
+	siginfo_t info = {};
+
+	info.si_signo = SIGKILL;
+
+	/*
+	 * Report on the fact that isolation was violated for the task.
+	 * It may not be the task's fault (e.g. a TLB flush from another
+	 * core) but we are not blaming it, just reporting that it lost
+	 * its isolation status.
+	 */
+	pr_warn("%s/%d: task_isolation mode lost due to %s\n",
+		task->comm, task->pid, buf);
+
+	/* Turn off task isolation mode to avoid further isolation callbacks. */
+	task_isolation_set_flags(task, 0);
+
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception that doesn't
+ * otherwise trigger a signal to the user process (e.g. simple page fault).
+ */
+void _task_isolation_quiet_exception(const char *fmt, ...)
+{
+	struct task_struct *task = current;
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	task_isolation_deliver_signal(task, buf);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in), and prevents most syscalls from executing and raises a
+ * signal to notify the process.
+ */
+int task_isolation_syscall(int syscall)
+{
+	char buf[20];
+
+	if (syscall == __NR_prctl ||
+	    syscall == __NR_exit ||
+	    syscall == __NR_exit_group)
+		return 0;
+
+	snprintf(buf, sizeof(buf), "syscall %d", syscall);
+	task_isolation_deliver_signal(current, buf);
+
+	syscall_set_return_value(current, current_pt_regs(),
+					 -ERESTARTNOINTR, -1);
+	return -1;
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index 96e9bc40667f..4ff9bafd5af0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -34,6 +34,7 @@
 #include <linux/compat.h>
 #include <linux/cn_proc.h>
 #include <linux/compiler.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -2213,6 +2214,13 @@ relock:
 		/* Trace actually delivered signals. */
 		trace_signal_deliver(signr, &ksig->info, ka);
 
+		/*
+		 * Disable task isolation when delivering a signal.
+		 * The isolation model requires users to reset task
+		 * isolation from the signal handler if desired.
+		 */
+		task_isolation_set_flags(current, 0);
+
 		if (ka->sa.sa_handler == SIG_IGN) /* Do nothing.  */
 			continue;
 		if (ka->sa.sa_handler != SIG_DFL) {
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be418157..4df84af425e3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2270,6 +2271,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		error = task_isolation_set(arg2);
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 536ada80f6dd..5cfde92a3785 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -23,6 +23,7 @@
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 
 #include <asm/irq_regs.h>
 
@@ -205,6 +206,11 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 	return true;
 }
 
+bool can_stop_my_full_tick(void)
+{
+	return can_stop_full_tick(this_cpu_ptr(&tick_cpu_sched));
+}
+
 static void nohz_full_kick_func(struct irq_work *work)
 {
 	/* Empty, the tick restart happens on tick_nohz_irq_exit() */
@@ -407,30 +413,34 @@ static int tick_nohz_cpu_down_callback(struct notifier_block *nfb,
 	return NOTIFY_OK;
 }
 
-static int tick_nohz_init_all(void)
+void tick_nohz_full_add_cpus(const struct cpumask *mask)
 {
-	int err = -1;
+	if (!cpumask_weight(mask))
+		return;
 
-#ifdef CONFIG_NO_HZ_FULL_ALL
-	if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
+	if (tick_nohz_full_mask == NULL &&
+	    !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
-		return err;
+		return;
 	}
-	err = 0;
-	cpumask_setall(tick_nohz_full_mask);
+
+	cpumask_or(tick_nohz_full_mask, tick_nohz_full_mask, mask);
 	tick_nohz_full_running = true;
-#endif
-	return err;
 }
 
 void __init tick_nohz_init(void)
 {
 	int cpu;
 
-	if (!tick_nohz_full_running) {
-		if (tick_nohz_init_all() < 0)
-			return;
-	}
+	task_isolation_init();
+
+#ifdef CONFIG_NO_HZ_FULL_ALL
+	if (!tick_nohz_full_running)
+		tick_nohz_full_add_cpus(cpu_possible_mask);
+#endif
+
+	if (!tick_nohz_full_running)
+		return;
 
 	if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
-- 
2.7.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 12/12] task_isolation: add user-settable notification signal
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
  2016-07-14 20:48 ` [PATCH v13 04/12] task_isolation: add initial support Chris Metcalf
@ 2016-07-14 20:48 ` Chris Metcalf
       [not found] ` <1468529299-27929-1-git-send-email-cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-07-27 14:01 ` Christoph Lameter
  3 siblings, 0 replies; 38+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

By default, if a task in task isolation mode re-enters the kernel,
it is terminated with SIGKILL.  With this commit, the application
can choose what signal to receive on a task isolation violation
by invoking prctl() with PR_TASK_ISOLATION_ENABLE, or'ing in the
PR_TASK_ISOLATION_USERSIG bit, and setting the specific requested
signal by or'ing in PR_TASK_ISOLATION_SET_SIG(sig).

This mode allows for catching the notification signal; for example,
in a production environment, it might be helpful to log information
to the application logging mechanism before exiting.  Or, the
application might choose to re-enable task isolation and return to
continue execution.

As a special case, the user may set the signal to 0, which means
that no signal will be delivered.  In this mode, the application
may freely enter the kernel for syscalls and synchronous exceptions
such as page faults, but each time it will be held in the kernel
before returning to userspace until the kernel has quiesced timer
ticks or other potential future interruptions, just like it does
on return from the initial prctl() call.  Note that in this mode,
the task can be migrated away from its initial task_isolation core,
and if it is migrated to a non-isolated core it will lose task
isolation until it is migrated back to an isolated core.
In addition, in this mode we no longer require the affinity to
be set correctly on entry (though we warn on the console if it's
not right), and we don't bother to notify the user that the kernel
isn't ready to quiesce either (since we'll presumably be in and
out of the kernel multiple times with task isolation enabled anyway).
The PR_TASK_ISOLATION_NOSIG define is provided as a convenience
wrapper to express this semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/uapi/linux/prctl.h |  5 ++++
 kernel/isolation.c         | 62 ++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 2a49d0d2940a..7af6eb51c1dc 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,10 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		48
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_USERSIG	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+# define PR_TASK_ISOLATION_NOSIG \
+	(PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 5e6cd67dfb0c..aca5de5e2e05 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -85,6 +85,15 @@ static bool can_stop_my_full_tick_now(void)
 	return ret;
 }
 
+/* Get the signal number that will be sent for a particular set of flag bits. */
+static int task_isolation_sig(int flags)
+{
+	if (flags & PR_TASK_ISOLATION_USERSIG)
+		return PR_TASK_ISOLATION_GET_SIG(flags);
+	else
+		return SIGKILL;
+}
+
 /*
  * This routine controls whether we can enable task-isolation mode.
  * The task must be affinitized to a single task_isolation core, or
@@ -92,16 +101,30 @@ static bool can_stop_my_full_tick_now(void)
  * stop the nohz_full tick (e.g., no other schedulable tasks currently
  * running, no POSIX cpu timers currently set up, etc.); if not, we
  * return EAGAIN.
+ *
+ * If we will not be strictly enforcing kernel re-entry with a signal,
+ * we just generate a warning printk if there is a bad affinity set
+ * on entry (since after all you can always change it again after you
+ * call prctl) and we don't bother failing the prctl with -EAGAIN
+ * since we assume you will go in and out of kernel mode anyway.
  */
 int task_isolation_set(unsigned int flags)
 {
 	if (flags != 0) {
+		int sig = task_isolation_sig(flags);
+
 		if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
 		    !task_isolation_possible(raw_smp_processor_id())) {
 			/* Invalid task affinity setting. */
-			return -EINVAL;
+			if (sig)
+				return -EINVAL;
+			else
+				pr_warn("%s/%d: enabling non-signalling task isolation\n"
+					"and not bound to a single task isolation core\n",
+					current->comm, current->pid);
 		}
-		if (!can_stop_my_full_tick_now()) {
+
+		if (sig && !can_stop_my_full_tick_now()) {
 			/* System not yet ready for task isolation. */
 			return -EAGAIN;
 		}
@@ -160,11 +183,11 @@ void task_isolation_enter(void)
 }
 
 static void task_isolation_deliver_signal(struct task_struct *task,
-					  const char *buf)
+					  const char *buf, int sig)
 {
 	siginfo_t info = {};
 
-	info.si_signo = SIGKILL;
+	info.si_signo = sig;
 
 	/*
 	 * Report on the fact that isolation was violated for the task.
@@ -175,7 +198,10 @@ static void task_isolation_deliver_signal(struct task_struct *task,
 	pr_warn("%s/%d: task_isolation mode lost due to %s\n",
 		task->comm, task->pid, buf);
 
-	/* Turn off task isolation mode to avoid further isolation callbacks. */
+	/*
+	 * Turn off task isolation mode to avoid further isolation callbacks.
+	 * It can choose to re-enable task isolation mode in the signal handler.
+	 */
 	task_isolation_set_flags(task, 0);
 
 	send_sig_info(info.si_signo, &info, task);
@@ -190,15 +216,20 @@ void _task_isolation_quiet_exception(const char *fmt, ...)
 	struct task_struct *task = current;
 	va_list args;
 	char buf[100];
+	int sig;
 
 	/* RCU should have been enabled prior to this point. */
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
 
+	sig = task_isolation_sig(task->task_isolation_flags);
+	if (sig == 0)
+		return;
+
 	va_start(args, fmt);
 	vsnprintf(buf, sizeof(buf), fmt, args);
 	va_end(args);
 
-	task_isolation_deliver_signal(task, buf);
+	task_isolation_deliver_signal(task, buf, sig);
 }
 
 /*
@@ -209,14 +240,19 @@ void _task_isolation_quiet_exception(const char *fmt, ...)
 int task_isolation_syscall(int syscall)
 {
 	char buf[20];
+	int sig;
 
 	if (syscall == __NR_prctl ||
 	    syscall == __NR_exit ||
 	    syscall == __NR_exit_group)
 		return 0;
 
+	sig = task_isolation_sig(current->task_isolation_flags);
+	if (sig == 0)
+		return 0;
+
 	snprintf(buf, sizeof(buf), "syscall %d", syscall);
-	task_isolation_deliver_signal(current, buf);
+	task_isolation_deliver_signal(current, buf, sig);
 
 	syscall_set_return_value(current, current_pt_regs(),
 					 -ERESTARTNOINTR, -1);
@@ -236,6 +272,7 @@ void task_isolation_debug_task(int cpu, struct task_struct *p, const char *type)
 {
 	static DEFINE_RATELIMIT_STATE(console_output, HZ, 1);
 	bool force_debug = false;
+	int sig;
 
 	/*
 	 * Our caller made sure the task was running on a task isolation
@@ -266,10 +303,13 @@ void task_isolation_debug_task(int cpu, struct task_struct *p, const char *type)
 	 * and instead just treat it as if "debug" mode was enabled,
 	 * since that's pretty much all we can do.
 	 */
-	if (in_nmi())
-		force_debug = true;
-	else
-		task_isolation_deliver_signal(p, type);
+	sig = task_isolation_sig(p->task_isolation_flags);
+	if (sig != 0) {
+		if (in_nmi())
+			force_debug = true;
+		else
+			task_isolation_deliver_signal(p, type, sig);
+	}
 
 	/*
 	 * If (for example) the timer interrupt starts ticking
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
       [not found] ` <1468529299-27929-1-git-send-email-cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-07-14 21:03   ` Andy Lutomirski
       [not found]     ` <CALCETrVddfd7ZDGpYs4CdkAMEmQCb6a-_5Um9bb4FO+XwWzOAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-07-21  2:04   ` Christoph Lameter
  1 sibling, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2016-07-14 21:03 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Thu, Jul 14, 2016 at 1:48 PM, Chris Metcalf <cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> Here is a respin of the task-isolation patch set.  This primarily
> reflects feedback from Frederic and Peter Z.

I still think this is the wrong approach, at least at this point.  The
first step should be to instrument things if necessary and fix the
obvious cases where the kernel gets entered asynchronously.  Only once
there's a credible reason to believe it can work well should any form
of strictness be applied.

As an example, enough vmalloc/vfree activity will eventually cause
flush_tlb_kernel_range to be called and *boom*, there goes your shiny
production dataplane application.  Once virtually mapped kernel stacks
happen, the frequency with which this happens will only increase.

On very brief inspection, __kmem_cache_shutdown will be a problem on
some workloads as well.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
       [not found]     ` <CALCETrVddfd7ZDGpYs4CdkAMEmQCb6a-_5Um9bb4FO+XwWzOAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-07-14 21:22       ` Chris Metcalf
  2016-07-18 22:11         ` Andy Lutomirski
  2016-07-18  0:42       ` Christoph Lameter
  1 sibling, 1 reply; 38+ messages in thread
From: Chris Metcalf @ 2016-07-14 21:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On 7/14/2016 5:03 PM, Andy Lutomirski wrote:
> On Thu, Jul 14, 2016 at 1:48 PM, Chris Metcalf <cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> Here is a respin of the task-isolation patch set.  This primarily
>> reflects feedback from Frederic and Peter Z.
> I still think this is the wrong approach, at least at this point.  The
> first step should be to instrument things if necessary and fix the
> obvious cases where the kernel gets entered asynchronously.

Note, however, that the task_isolation_debug mode is a very convenient
way of discovering what is going on when things do go wrong for task isolation.

> Only once
> there's a credible reason to believe it can work well should any form
> of strictness be applied.

I'm not sure what criteria you need for this, though.  Certainly we've been
shipping our version of task isolation to customers since 2008, and there
are quite a few customer applications in production that are working well.
I'd argue that's a credible reason.

> As an example, enough vmalloc/vfree activity will eventually cause
> flush_tlb_kernel_range to be called and *boom*, there goes your shiny
> production dataplane application.

Well, that's actually a refinement that I did not inflict on this patch series.

In our code base, we have a hook for kernel TLB flushes that defers such
flushes for cores that are running in userspace, because, after all, they
don't yet care about such flushes.  Instead, we atomically set a flag that
is checked on entry to the kernel, and that causes the TLB flush to occur
at that point.

> On very brief inspection, __kmem_cache_shutdown will be a problem on
> some workloads as well.

That looks like it should be amenable to a version of the same fix I pushed
upstream in 5fbc461636c32efd ("mm: make lru_add_drain_all() selective").
You would basically check which cores have non-empty caches, and only
interrupt those cores.  For extra credit, you empty the cache on your local cpu
when you are entering task isolation mode.  Now you don't get interrupted.

To be fair, I've never seen this particular path cause an interruption.  And I
think this speaks to the fact that there really can't be a black and white
decision about when you have removed enough possible interrupt paths.
It really does depend on what else is running on your machine in addition
to the task isolation code, and that will vary from application to application.
And, as the kernel evolves, new ways of interrupting task isolation cores
will get added and need to be dealt with.  There really isn't a perfect time
you can wait for and then declare that all the asynchronous entry cases
have been dealt with and now things are safe for task isolation.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
       [not found]     ` <CALCETrVddfd7ZDGpYs4CdkAMEmQCb6a-_5Um9bb4FO+XwWzOAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-07-14 21:22       ` Chris Metcalf
@ 2016-07-18  0:42       ` Christoph Lameter
  1 sibling, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2016-07-18  0:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Thu, 14 Jul 2016, Andy Lutomirski wrote:

> As an example, enough vmalloc/vfree activity will eventually cause
> flush_tlb_kernel_range to be called and *boom*, there goes your shiny
> production dataplane application.  Once virtually mapped kernel stacks
> happen, the frequency with which this happens will only increase.

But then vmalloc/vfre activity is not to be expected if user space only is
running. Since the kernel is not active and this affects kernel address
space only it could be deferred. Such events will cause OS activity that
causes a number of high latency events but then the system will quiet down
again.

> On very brief inspection, __kmem_cache_shutdown will be a problem on
> some workloads as well.

These are all corner cases that can be worked on over time if they are
significant. The main issue here is to reduce the obvious and relatively
frequent causes for ticks and allow easier detection of events that cause
tick activity.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-14 21:22       ` Chris Metcalf
@ 2016-07-18 22:11         ` Andy Lutomirski
  2016-07-18 22:50           ` Chris Metcalf
  0 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2016-07-18 22:11 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	linux-doc@vger.kernel.org, Linux API,
	linux-kernel@vger.kernel.org

On Thu, Jul 14, 2016 at 2:22 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> On 7/14/2016 5:03 PM, Andy Lutomirski wrote:
>>
>> On Thu, Jul 14, 2016 at 1:48 PM, Chris Metcalf <cmetcalf@mellanox.com>
>> wrote:
>>>
>>> Here is a respin of the task-isolation patch set.  This primarily
>>> reflects feedback from Frederic and Peter Z.
>>
>> I still think this is the wrong approach, at least at this point.  The
>> first step should be to instrument things if necessary and fix the
>> obvious cases where the kernel gets entered asynchronously.
>
>
> Note, however, that the task_isolation_debug mode is a very convenient
> way of discovering what is going on when things do go wrong for task
> isolation.
>
>> Only once
>> there's a credible reason to believe it can work well should any form
>> of strictness be applied.
>
>
> I'm not sure what criteria you need for this, though.  Certainly we've been
> shipping our version of task isolation to customers since 2008, and there
> are quite a few customer applications in production that are working well.
> I'd argue that's a credible reason.
>
>> As an example, enough vmalloc/vfree activity will eventually cause
>> flush_tlb_kernel_range to be called and *boom*, there goes your shiny
>> production dataplane application.
>
>
> Well, that's actually a refinement that I did not inflict on this patch
> series.

Submit it separately, perhaps?

The "kill the process if it goofs" think while there are known goofs
in the kernel, apparently with patches written but unsent, seems
questionable.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-18 22:11         ` Andy Lutomirski
@ 2016-07-18 22:50           ` Chris Metcalf
  0 siblings, 0 replies; 38+ messages in thread
From: Chris Metcalf @ 2016-07-18 22:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	linux-doc@vger.kernel.org, Linux API,
	linux-kernel@vger.kernel.org

On 7/18/2016 6:11 PM, Andy Lutomirski wrote:
>>> As an example, enough vmalloc/vfree activity will eventually cause
>>> flush_tlb_kernel_range to be called and*boom*, there goes your shiny
>>> production dataplane application.
>>
>> Well, that's actually a refinement that I did not inflict on this patch
>> series.
> Submit it separately, perhaps?
>
> The "kill the process if it goofs" thing while there are known goofs
> in the kernel, apparently with patches written but unsent, seems
> questionable.

Sure, that's a good idea.

I think what I will plan to do is, once the patch series is accepted into
some tree, return to this piece.  I'll have to go back and look at the internal
Tilera version of this code, since we have diverged quite a ways from that
in the 13 versions of the patch series, but my memory is that the kernel TLB
flush management was the only substantial piece of additional code not in
the initial batch of changes.  The extra requirement is the need to have a
hook very early on in the kernel entry path that you can hook in all paths;
arm64 has the ct_user_exit macro and tile has the finish_interrupt_save macro,
but I'm not sure there's something equivalent on x86 to catch all entries.

It's worth noting that the typical target application for task isolation, though
(at least in our experience) is a pretty dedicated machine, with the primary
application running in task isolation mode almost all of the time, and so
you are generally in pretty good control of all aspects of the system, including
whether or not you are generating kernel TLB flushes from your non task
isolation cores.  So I would argue the kernel TLB flush management piece is
an improvement to, not a requirement for, the main patch series.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
       [not found] ` <1468529299-27929-1-git-send-email-cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-07-14 21:03   ` [PATCH v13 00/12] support "task_isolation" mode Andy Lutomirski
@ 2016-07-21  2:04   ` Christoph Lameter
       [not found]     ` <alpine.DEB.2.20.1607202059180.25838-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2016-07-21  2:04 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

We are trying to test the patchset on x86 and are getting strange
backtraces and aborts. It seems that the cpu before the cpu we are running
on creates an irq_work event that causes a latency event on the next cpu.

This is weird. Is there a new round robin IPI feature in the kernel that I
am not aware of?

Backtraces from dmesg:

[  956.603223] latencytest/7928: task_isolation mode lost due to irq_work
[  956.610817] cpu 12: irq_work violating task isolation for latencytest/7928 on cpu 13
[  956.619985] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
[  956.628765] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 03/15/2016
[  956.637642]  0000000000000086 ce6735c7b39e7b81 ffff88103e783d00 ffffffff8134f6ff
[  956.646739]  ffff88102c50d700 000000000000000d ffff88103e783d28 ffffffff811986f4
[  956.655828]  ffff88102c50d700 ffff88203cf97f80 000000000000000d ffff88103e783d68
[  956.664924] Call Trace:
[  956.667945]  <IRQ>  [<ffffffff8134f6ff>] dump_stack+0x63/0x84
[  956.674740]  [<ffffffff811986f4>] task_isolation_debug_task+0xb4/0xd0
[  956.682229]  [<ffffffff810b4a13>] _task_isolation_debug+0x83/0xc0
[  956.689331]  [<ffffffff81179c0c>] irq_work_queue_on+0x9c/0x120
[  956.696142]  [<ffffffff811075e4>] tick_nohz_full_kick_cpu+0x44/0x50
[  956.703438]  [<ffffffff810b48d9>] wake_up_nohz_cpu+0x99/0x110
[  956.710150]  [<ffffffff810f57e1>] internal_add_timer+0x71/0xb0
[  956.716959]  [<ffffffff810f696b>] add_timer_on+0xbb/0x140
[  956.723283]  [<ffffffff81100ca0>] clocksource_watchdog+0x230/0x300
[  956.730480]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
[  956.738555]  [<ffffffff810f5615>] call_timer_fn+0x35/0x120
[  956.744973]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
[  956.753046]  [<ffffffff810f64cc>] run_timer_softirq+0x23c/0x2f0
[  956.759952]  [<ffffffff816d4397>] __do_softirq+0xd7/0x2c5
[  956.766272]  [<ffffffff81091245>] irq_exit+0xf5/0x100
[  956.772209]  [<ffffffff816d41d2>] smp_apic_timer_interrupt+0x42/0x50
[  956.779600]  [<ffffffff816d231c>] apic_timer_interrupt+0x8c/0xa0
[  956.786602]  <EOI>  [<ffffffff81569eb0>] ? poll_idle+0x40/0x80
[  956.793490]  [<ffffffff815697dc>] cpuidle_enter_state+0x9c/0x260
[  956.800498]  [<ffffffff815699d7>] cpuidle_enter+0x17/0x20
[  956.806810]  [<ffffffff810cf497>] cpu_startup_entry+0x2b7/0x3a0
[  956.813717]  [<ffffffff81050e6c>] start_secondary+0x15c/0x1a0
[ 1036.601758] cpu 12: irq_work violating task isolation for latencytest/8447 on cpu 13
[ 1036.610922] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
[ 1036.619692] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 03/15/2016
[ 1036.628551]  0000000000000086 ce6735c7b39e7b81 ffff88103e783d00 ffffffff8134f6ff
[ 1036.637648]  ffff88102dca0000 000000000000000d ffff88103e783d28 ffffffff811986f4
[ 1036.646741]  ffff88102dca0000 ffff88203cf97f80 000000000000000d ffff88103e783d68
[ 1036.655833] Call Trace:
[ 1036.658852]  <IRQ>  [<ffffffff8134f6ff>] dump_stack+0x63/0x84
[ 1036.665649]  [<ffffffff811986f4>] task_isolation_debug_task+0xb4/0xd0
[ 1036.673136]  [<ffffffff810b4a13>] _task_isolation_debug+0x83/0xc0
[ 1036.680237]  [<ffffffff81179c0c>] irq_work_queue_on+0x9c/0x120
[ 1036.687091]  [<ffffffff811075e4>] tick_nohz_full_kick_cpu+0x44/0x50
[ 1036.694388]  [<ffffffff810b48d9>] wake_up_nohz_cpu+0x99/0x110
[ 1036.701089]  [<ffffffff810f57e1>] internal_add_timer+0x71/0xb0
[ 1036.707896]  [<ffffffff810f696b>] add_timer_on+0xbb/0x140
[ 1036.714210]  [<ffffffff81100ca0>] clocksource_watchdog+0x230/0x300
[ 1036.721411]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
[ 1036.729478]  [<ffffffff810f5615>] call_timer_fn+0x35/0x120
[ 1036.735899]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
[ 1036.743970]  [<ffffffff810f64cc>] run_timer_softirq+0x23c/0x2f0
[ 1036.750878]  [<ffffffff816d4397>] __do_softirq+0xd7/0x2c5
[ 1036.757199]  [<ffffffff81091245>] irq_exit+0xf5/0x100
[ 1036.763132]  [<ffffffff816d41d2>] smp_apic_timer_interrupt+0x42/0x50
[ 1036.770520]  [<ffffffff816d231c>] apic_timer_interrupt+0x8c/0xa0
[ 1036.777520]  <EOI>  [<ffffffff81569eb0>] ? poll_idle+0x40/0x80
[ 1036.784410]  [<ffffffff815697dc>] cpuidle_enter_state+0x9c/0x260
[ 1036.791413]  [<ffffffff815699d7>] cpuidle_enter+0x17/0x20
[ 1036.797734]  [<ffffffff810cf497>] cpu_startup_entry+0x2b7/0x3a0
[ 1036.804641]  [<ffffffff81050e6c>] start_secondary+0x15c/0x1a0

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
       [not found]     ` <alpine.DEB.2.20.1607202059180.25838-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
@ 2016-07-21 14:06       ` Chris Metcalf
  2016-07-22  2:20         ` Christoph Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Chris Metcalf @ 2016-07-21 14:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 7/20/2016 10:04 PM, Christoph Lameter wrote:
> We are trying to test the patchset on x86 and are getting strange
> backtraces and aborts. It seems that the cpu before the cpu we are running
> on creates an irq_work event that causes a latency event on the next cpu.
>
> This is weird. Is there a new round robin IPI feature in the kernel that I
> am not aware of?

This seems to be from your clocksource declaring itself to be
unstable, and then scheduling work to safely remove that timer.
I haven't looked at this code before (in kernel/time/clocksource.c
under CONFIG_CLOCKSOURCE_WATCHDOG) since the timers on
arm64 and tile aren't unstable.  Is it possible to boot your machine
with a stable clocksource?


> Backtraces from dmesg:
>
> [  956.603223] latencytest/7928: task_isolation mode lost due to irq_work
> [  956.610817] cpu 12: irq_work violating task isolation for latencytest/7928 on cpu 13
> [  956.619985] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
> [  956.628765] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 03/15/2016
> [  956.637642]  0000000000000086 ce6735c7b39e7b81 ffff88103e783d00 ffffffff8134f6ff
> [  956.646739]  ffff88102c50d700 000000000000000d ffff88103e783d28 ffffffff811986f4
> [  956.655828]  ffff88102c50d700 ffff88203cf97f80 000000000000000d ffff88103e783d68
> [  956.664924] Call Trace:
> [  956.667945]  <IRQ>  [<ffffffff8134f6ff>] dump_stack+0x63/0x84
> [  956.674740]  [<ffffffff811986f4>] task_isolation_debug_task+0xb4/0xd0
> [  956.682229]  [<ffffffff810b4a13>] _task_isolation_debug+0x83/0xc0
> [  956.689331]  [<ffffffff81179c0c>] irq_work_queue_on+0x9c/0x120
> [  956.696142]  [<ffffffff811075e4>] tick_nohz_full_kick_cpu+0x44/0x50
> [  956.703438]  [<ffffffff810b48d9>] wake_up_nohz_cpu+0x99/0x110
> [  956.710150]  [<ffffffff810f57e1>] internal_add_timer+0x71/0xb0
> [  956.716959]  [<ffffffff810f696b>] add_timer_on+0xbb/0x140
> [  956.723283]  [<ffffffff81100ca0>] clocksource_watchdog+0x230/0x300
> [  956.730480]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
> [  956.738555]  [<ffffffff810f5615>] call_timer_fn+0x35/0x120
> [  956.744973]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
> [  956.753046]  [<ffffffff810f64cc>] run_timer_softirq+0x23c/0x2f0
> [  956.759952]  [<ffffffff816d4397>] __do_softirq+0xd7/0x2c5
> [  956.766272]  [<ffffffff81091245>] irq_exit+0xf5/0x100
> [  956.772209]  [<ffffffff816d41d2>] smp_apic_timer_interrupt+0x42/0x50
> [  956.779600]  [<ffffffff816d231c>] apic_timer_interrupt+0x8c/0xa0
> [  956.786602]  <EOI>  [<ffffffff81569eb0>] ? poll_idle+0x40/0x80
> [  956.793490]  [<ffffffff815697dc>] cpuidle_enter_state+0x9c/0x260
> [  956.800498]  [<ffffffff815699d7>] cpuidle_enter+0x17/0x20
> [  956.806810]  [<ffffffff810cf497>] cpu_startup_entry+0x2b7/0x3a0
> [  956.813717]  [<ffffffff81050e6c>] start_secondary+0x15c/0x1a0
> [ 1036.601758] cpu 12: irq_work violating task isolation for latencytest/8447 on cpu 13
> [ 1036.610922] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
> [ 1036.619692] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 03/15/2016
> [ 1036.628551]  0000000000000086 ce6735c7b39e7b81 ffff88103e783d00 ffffffff8134f6ff
> [ 1036.637648]  ffff88102dca0000 000000000000000d ffff88103e783d28 ffffffff811986f4
> [ 1036.646741]  ffff88102dca0000 ffff88203cf97f80 000000000000000d ffff88103e783d68
> [ 1036.655833] Call Trace:
> [ 1036.658852]  <IRQ>  [<ffffffff8134f6ff>] dump_stack+0x63/0x84
> [ 1036.665649]  [<ffffffff811986f4>] task_isolation_debug_task+0xb4/0xd0
> [ 1036.673136]  [<ffffffff810b4a13>] _task_isolation_debug+0x83/0xc0
> [ 1036.680237]  [<ffffffff81179c0c>] irq_work_queue_on+0x9c/0x120
> [ 1036.687091]  [<ffffffff811075e4>] tick_nohz_full_kick_cpu+0x44/0x50
> [ 1036.694388]  [<ffffffff810b48d9>] wake_up_nohz_cpu+0x99/0x110
> [ 1036.701089]  [<ffffffff810f57e1>] internal_add_timer+0x71/0xb0
> [ 1036.707896]  [<ffffffff810f696b>] add_timer_on+0xbb/0x140
> [ 1036.714210]  [<ffffffff81100ca0>] clocksource_watchdog+0x230/0x300
> [ 1036.721411]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
> [ 1036.729478]  [<ffffffff810f5615>] call_timer_fn+0x35/0x120
> [ 1036.735899]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
> [ 1036.743970]  [<ffffffff810f64cc>] run_timer_softirq+0x23c/0x2f0
> [ 1036.750878]  [<ffffffff816d4397>] __do_softirq+0xd7/0x2c5
> [ 1036.757199]  [<ffffffff81091245>] irq_exit+0xf5/0x100
> [ 1036.763132]  [<ffffffff816d41d2>] smp_apic_timer_interrupt+0x42/0x50
> [ 1036.770520]  [<ffffffff816d231c>] apic_timer_interrupt+0x8c/0xa0
> [ 1036.777520]  <EOI>  [<ffffffff81569eb0>] ? poll_idle+0x40/0x80
> [ 1036.784410]  [<ffffffff815697dc>] cpuidle_enter_state+0x9c/0x260
> [ 1036.791413]  [<ffffffff815699d7>] cpuidle_enter+0x17/0x20
> [ 1036.797734]  [<ffffffff810cf497>] cpu_startup_entry+0x2b7/0x3a0
> [ 1036.804641]  [<ffffffff81050e6c>] start_secondary+0x15c/0x1a0
>
>

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-21 14:06       ` Chris Metcalf
@ 2016-07-22  2:20         ` Christoph Lameter
  2016-07-22 12:50           ` Chris Metcalf
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2016-07-22  2:20 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel


On Thu, 21 Jul 2016, Chris Metcalf wrote:
> On 7/20/2016 10:04 PM, Christoph Lameter wrote:

> unstable, and then scheduling work to safely remove that timer.
> I haven't looked at this code before (in kernel/time/clocksource.c
> under CONFIG_CLOCKSOURCE_WATCHDOG) since the timers on
> arm64 and tile aren't unstable.  Is it possible to boot your machine
> with a stable clocksource?

It already as a stable clocksource. Sorry but that was one of the criteria
for the server when we ordered them. Could this be clock adjustments?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-22  2:20         ` Christoph Lameter
@ 2016-07-22 12:50           ` Chris Metcalf
  2016-07-25 16:35             ` Christoph Lameter
  2016-08-11  8:27             ` [PATCH v13 00/12] support "task_isolation" mode Peter Zijlstra
  0 siblings, 2 replies; 38+ messages in thread
From: Chris Metcalf @ 2016-07-22 12:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/21/2016 10:20 PM, Christoph Lameter wrote:
> On Thu, 21 Jul 2016, Chris Metcalf wrote:
>> On 7/20/2016 10:04 PM, Christoph Lameter wrote:
>> unstable, and then scheduling work to safely remove that timer.
>> I haven't looked at this code before (in kernel/time/clocksource.c
>> under CONFIG_CLOCKSOURCE_WATCHDOG) since the timers on
>> arm64 and tile aren't unstable.  Is it possible to boot your machine
>> with a stable clocksource?
> It already as a stable clocksource. Sorry but that was one of the criteria
> for the server when we ordered them. Could this be clock adjustments?

We probably need to get clock folks to jump in on this thread!

Maybe it's disabling some built-in unstable clock just as part of
falling back to using the better, stable clock that you also have?
So maybe there's a way of just disabling that clocksource from the
get-go instead of having it be marked unstable later.

If you run the test again after this storm of unstable marking, does
it all happen again?  Or is it a persistent state in the kernel?
If so, maybe you can just arrange to get to that state before starting
your application's task-isolation code.

Or, if you think it's clock adjustments, perhaps running your test with
ntpd disabled would make it work better?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-22 12:50           ` Chris Metcalf
@ 2016-07-25 16:35             ` Christoph Lameter
       [not found]               ` <alpine.DEB.2.20.1607251133450.25354-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  2016-08-11  8:27             ` [PATCH v13 00/12] support "task_isolation" mode Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2016-07-25 16:35 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On Fri, 22 Jul 2016, Chris Metcalf wrote:

> > It already as a stable clocksource. Sorry but that was one of the criteria
> > for the server when we ordered them. Could this be clock adjustments?
>
> We probably need to get clock folks to jump in on this thread!

Guess so. I will have a look at this when I get some time again.

> Maybe it's disabling some built-in unstable clock just as part of
> falling back to using the better, stable clock that you also have?
> So maybe there's a way of just disabling that clocksource from the
> get-go instead of having it be marked unstable later.

This is a standard Dell server. No clocksources are marked as unstable as
far as I can tell.

> If you run the test again after this storm of unstable marking, does
> it all happen again?  Or is it a persistent state in the kernel?

This happens anytime we try to run with prctl().

I hope to get some more detail once I get some time to look at this. But
this is likely an x86 specific problem.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
       [not found]               ` <alpine.DEB.2.20.1607251133450.25354-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
@ 2016-07-27 13:55                 ` Christoph Lameter
  2016-07-27 14:12                   ` Chris Metcalf
  2016-08-10 22:16                   ` Frederic Weisbecker
  0 siblings, 2 replies; 38+ messages in thread
From: Christoph Lameter @ 2016-07-27 13:55 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, 25 Jul 2016, Christoph Lameter wrote:

> Guess so. I will have a look at this when I get some time again.

Ok so the problem is the clocksource_watchdog() function in
kernel/time/clocksource.c. This function is active if
CONFIG_CLOCKSOURCE_WATCHDOG is defined. It will check the timesources of
each processor for being within bounds and then reschedule itself on the
next one.

The purpose of the function seems to be to determine *if* a clocksource is
unstable. It does not mean that the clocksource *is* unstable.

The critical piece of code is this:

        /*
         * Cycle through CPUs to check if the CPUs stay synchronized
         * to each other.
         */
        next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
        if (next_cpu >= nr_cpu_ids)
                next_cpu = cpumask_first(cpu_online_mask);
        watchdog_timer.expires += WATCHDOG_INTERVAL;
        add_timer_on(&watchdog_timer, next_cpu);


Should we just cycle through the cpus that are not isolated? Otherwise we
need to have some means to check the clocksources for accuracy remotely
(probably impossible for TSC etc).

The WATCHDOG_INTERVAL is 1 second so this causes an interrupt every
second.

Note that we are running with the patch that removes the 1 HZ mininum time
tick. With an older kernel code base (redhat) we can keep the kernel quiet
for minutes. The clocksource watchdog causes timers to fire again.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
                   ` (2 preceding siblings ...)
       [not found] ` <1468529299-27929-1-git-send-email-cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-07-27 14:01 ` Christoph Lameter
  3 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2016-07-27 14:01 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel


We tested this with 4.7-rc7 and aside from the issue with
clocksource_watchdog() this is working fine.

Tested-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 13:55                 ` clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode) Christoph Lameter
@ 2016-07-27 14:12                   ` Chris Metcalf
       [not found]                     ` <f8d72e47-7e84-cbd0-869f-69bf452a8bfb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-08-10 22:16                   ` Frederic Weisbecker
  1 sibling, 1 reply; 38+ messages in thread
From: Chris Metcalf @ 2016-07-27 14:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/27/2016 9:55 AM, Christoph Lameter wrote:
> The critical piece of code is this:
>
>          /*
>           * Cycle through CPUs to check if the CPUs stay synchronized
>           * to each other.
>           */
>          next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
>          if (next_cpu >= nr_cpu_ids)
>                  next_cpu = cpumask_first(cpu_online_mask);
>          watchdog_timer.expires += WATCHDOG_INTERVAL;
>          add_timer_on(&watchdog_timer, next_cpu);
>
>
> Should we just cycle through the cpus that are not isolated? Otherwise we
> need to have some means to check the clocksources for accuracy remotely
> (probably impossible for TSC etc).

That sounds like the right idea - use the housekeeping cpu mask instead of the
cpu online mask.  Should be a straightforward patch; do you want to do that
and test it in your configuration, and I'll include it in the next spin of the
patch series?

Thanks for your testing!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
       [not found]                     ` <f8d72e47-7e84-cbd0-869f-69bf452a8bfb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-07-27 15:23                       ` Christoph Lameter
       [not found]                         ` <alpine.DEB.2.20.1607271022130.25729-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2016-07-27 15:23 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, 27 Jul 2016, Chris Metcalf wrote:

> > Should we just cycle through the cpus that are not isolated? Otherwise we
> > need to have some means to check the clocksources for accuracy remotely
> > (probably impossible for TSC etc).
>
> That sounds like the right idea - use the housekeeping cpu mask instead of the
> cpu online mask.  Should be a straightforward patch; do you want to do that
> and test it in your configuration, and I'll include it in the next spin of the
> patch series?

Sadly housekeeping_mask is defined the following way:

static inline const struct cpumask *housekeeping_cpumask(void)
{
#ifdef CONFIG_NO_HZ_FULL
        if (tick_nohz_full_enabled())
                return housekeeping_mask;
#endif
        return cpu_possible_mask;
}

Why is it not returning cpu_online_mask?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
       [not found]                         ` <alpine.DEB.2.20.1607271022130.25729-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
@ 2016-07-27 15:31                           ` Christoph Lameter
  2016-07-27 17:06                             ` Chris Metcalf
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2016-07-27 15:31 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Ok here is a possible patch that explicitly checks for housekeeping cpus:

Subject: clocksource: Do not schedule watchdog on isolated or NOHZ cpus

watchdog checks can only run on housekeeping capable cpus. Otherwise
we will be generating noise that we would like to avoid on the isolated
processors.

Signed-off-by: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>

Index: linux/kernel/time/clocksource.c
===================================================================
--- linux.orig/kernel/time/clocksource.c	2016-07-27 08:41:17.109862517 -0500
+++ linux/kernel/time/clocksource.c	2016-07-27 10:28:31.172447732 -0500
@@ -269,9 +269,12 @@ static void clocksource_watchdog(unsigne
 	 * Cycle through CPUs to check if the CPUs stay synchronized
 	 * to each other.
 	 */
-	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
-	if (next_cpu >= nr_cpu_ids)
-		next_cpu = cpumask_first(cpu_online_mask);
+	do {
+		next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
+		if (next_cpu >= nr_cpu_ids)
+			next_cpu = cpumask_first(cpu_online_mask);
+	} while (!is_housekeeping_cpu(next_cpu));
+
 	watchdog_timer.expires += WATCHDOG_INTERVAL;
 	add_timer_on(&watchdog_timer, next_cpu);
 out:

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 15:31                           ` Christoph Lameter
@ 2016-07-27 17:06                             ` Chris Metcalf
  2016-07-27 18:56                               ` Christoph Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Chris Metcalf @ 2016-07-27 17:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/27/2016 11:31 AM, Christoph Lameter wrote:
> Ok here is a possible patch that explicitly checks for housekeeping cpus:
>
> Subject: clocksource: Do not schedule watchdog on isolated or NOHZ cpus
>
> watchdog checks can only run on housekeeping capable cpus. Otherwise
> we will be generating noise that we would like to avoid on the isolated
> processors.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>
>
> Index: linux/kernel/time/clocksource.c
> ===================================================================
> --- linux.orig/kernel/time/clocksource.c	2016-07-27 08:41:17.109862517 -0500
> +++ linux/kernel/time/clocksource.c	2016-07-27 10:28:31.172447732 -0500
> @@ -269,9 +269,12 @@ static void clocksource_watchdog(unsigne
>   	 * Cycle through CPUs to check if the CPUs stay synchronized
>   	 * to each other.
>   	 */
> -	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
> -	if (next_cpu >= nr_cpu_ids)
> -		next_cpu = cpumask_first(cpu_online_mask);
> +	do {
> +		next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
> +		if (next_cpu >= nr_cpu_ids)
> +			next_cpu = cpumask_first(cpu_online_mask);
> +	} while (!is_housekeeping_cpu(next_cpu));
> +
>   	watchdog_timer.expires += WATCHDOG_INTERVAL;
>   	add_timer_on(&watchdog_timer, next_cpu);
>   out:

How about using cpumask_next_and(raw_smp_processor_id(), cpu_online_mask,
housekeeping_cpumask()), likewise cpumask_first_and()?  Does that work?

Note that you should also  cpumask_first_and() in clocksource_start_watchdog(),
just to be complete.

Hopefully the init code runs after tick_init().  It seems like that's probably true.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 17:06                             ` Chris Metcalf
@ 2016-07-27 18:56                               ` Christoph Lameter
  2016-07-27 19:49                                 ` Chris Metcalf
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2016-07-27 18:56 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On Wed, 27 Jul 2016, Chris Metcalf wrote:

> How about using cpumask_next_and(raw_smp_processor_id(), cpu_online_mask,
> housekeeping_cpumask()), likewise cpumask_first_and()?  Does that work?

Ok here is V2:


Subject: clocksource: Do not schedule watchdog on isolated or NOHZ cpus V2

watchdog checks can only run on housekeeping capable cpus. Otherwise
we will be generating noise that we would like to avoid on the isolated
processors.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/kernel/time/clocksource.c
===================================================================
--- linux.orig/kernel/time/clocksource.c
+++ linux/kernel/time/clocksource.c
@@ -269,9 +269,10 @@ static void clocksource_watchdog(unsigne
 	 * Cycle through CPUs to check if the CPUs stay synchronized
 	 * to each other.
 	 */
-	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
+	next_cpu = cpumask_next_and(raw_smp_processor_id(), cpu_online_mask, housekeeping_cpumask());
 	if (next_cpu >= nr_cpu_ids)
-		next_cpu = cpumask_first(cpu_online_mask);
+		next_cpu = cpumask_first_and(cpu_online_mask, housekeeping_cpumask());
+
 	watchdog_timer.expires += WATCHDOG_INTERVAL;
 	add_timer_on(&watchdog_timer, next_cpu);
 out:

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 18:56                               ` Christoph Lameter
@ 2016-07-27 19:49                                 ` Chris Metcalf
  2016-07-27 19:53                                   ` Christoph Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Chris Metcalf @ 2016-07-27 19:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/27/2016 2:56 PM, Christoph Lameter wrote:
> On Wed, 27 Jul 2016, Chris Metcalf wrote:
>
>> How about using cpumask_next_and(raw_smp_processor_id(), cpu_online_mask,
>> housekeeping_cpumask()), likewise cpumask_first_and()?  Does that work?
> Ok here is V2:
>
>
> Subject: clocksource: Do not schedule watchdog on isolated or NOHZ cpus V2
>
> watchdog checks can only run on housekeeping capable cpus. Otherwise
> we will be generating noise that we would like to avoid on the isolated
> processors.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>
>
> Index: linux/kernel/time/clocksource.c
> ===================================================================
> --- linux.orig/kernel/time/clocksource.c
> +++ linux/kernel/time/clocksource.c
> @@ -269,9 +269,10 @@ static void clocksource_watchdog(unsigne
>   	 * Cycle through CPUs to check if the CPUs stay synchronized
>   	 * to each other.
>   	 */
> -	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
> +	next_cpu = cpumask_next_and(raw_smp_processor_id(), cpu_online_mask, housekeeping_cpumask());
>   	if (next_cpu >= nr_cpu_ids)
> -		next_cpu = cpumask_first(cpu_online_mask);
> +		next_cpu = cpumask_first_and(cpu_online_mask, housekeeping_cpumask());
> +
>   	watchdog_timer.expires += WATCHDOG_INTERVAL;
>   	add_timer_on(&watchdog_timer, next_cpu);
>   out:

Looks good.  Did you omit the equivalent fix in clocksource_start_watchdog()
on purpose?  For now I just took your change, but tweaked it to add the
equivalent diff with cpumask_first_and() there.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 19:49                                 ` Chris Metcalf
@ 2016-07-27 19:53                                   ` Christoph Lameter
  2016-07-27 19:58                                     ` Chris Metcalf
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2016-07-27 19:53 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On Wed, 27 Jul 2016, Chris Metcalf wrote:

> Looks good.  Did you omit the equivalent fix in clocksource_start_watchdog()
> on purpose?  For now I just took your change, but tweaked it to add the
> equivalent diff with cpumask_first_and() there.

Can the watchdog be started on an isolated cpu at all? I would expect that
the code would start a watchdog only on a housekeeping cpu.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 19:53                                   ` Christoph Lameter
@ 2016-07-27 19:58                                     ` Chris Metcalf
       [not found]                                       ` <2fefa17d-37c6-9669-724e-9ee0d841e7b2-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Chris Metcalf @ 2016-07-27 19:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/27/2016 3:53 PM, Christoph Lameter wrote:
> On Wed, 27 Jul 2016, Chris Metcalf wrote:
>
>> Looks good.  Did you omit the equivalent fix in clocksource_start_watchdog()
>> on purpose?  For now I just took your change, but tweaked it to add the
>> equivalent diff with cpumask_first_and() there.
> Can the watchdog be started on an isolated cpu at all? I would expect that
> the code would start a watchdog only on a housekeeping cpu.

The code just starts the watchdog initially on the first online cpu.
In principle you could have configured that as an isolated cpu, so
without any change to that code, you'd interrupt that cpu.

I guess another way to slice it would be to start the watchdog on the
current core.  But just using the same idiom as in clocksource_watchdog()
seems cleanest to me.

I added your patch to the series and pushed it up (along with adding your
Tested-by to the x86 enablement commit).  It's still based on 4.6 so I'll need
to rebase it once the merge window closes.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
       [not found]                                       ` <2fefa17d-37c6-9669-724e-9ee0d841e7b2-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-07-29 18:31                                         ` Francis Giraldeau
       [not found]                                           ` <CAC6yHM4LON5ASooVa_eUaDYsN1W0HYTMX76yHDxf8Mff0mKqiA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Francis Giraldeau @ 2016-07-29 18:31 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

I tested this patch on 4.7 and confirm that irq_work does not occurs anymore on
the isolated cpu. Thanks!

I don't know of any utility to test the task isolation feature, so I started
one:

    https://github.com/giraldeau/taskisol

The script exp.sh runs the taskisol to test five different conditions, but some
behavior is not the one I would expect.

At startup, it does:
 - register a custom signal handler for SIGUSR1
 - sched_setaffinity() on CPU 1, which is isolated
 - mlockall(MCL_CURRENT) to prevent undesired page faults

The default strict mode is set with:

    prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE)

And then, the syscall write() is called. From previous discussion, the SIGKILL
should be sent, but it does not occur. When instead of calling write() we force
a page fault, then the SIGKILL is correctly sent.

When instead a custom signal handler SIGUSR1:

    prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_USERSIG |
                      PR_TASK_ISOLATION_SET_SIG(SIGUSR1)

The signal is never delivered, either when the syscall is issued nor when the
page fault occurs.

I can confirm that, if two taskisol are created on the same CPU, the second one
fails with Resource temporarily unavailable, so that's fine.

I can add more test cases depending on your comments, such as the TLB events
triggered by another thread on a non-isolated core. But maybe there is already
a test suite?

Francis

2016-07-27 15:58 GMT-04:00 Chris Metcalf <cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>:
> On 7/27/2016 3:53 PM, Christoph Lameter wrote:
>>
>> On Wed, 27 Jul 2016, Chris Metcalf wrote:
>>
>>> Looks good.  Did you omit the equivalent fix in
>>> clocksource_start_watchdog()
>>> on purpose?  For now I just took your change, but tweaked it to add the
>>> equivalent diff with cpumask_first_and() there.
>>
>> Can the watchdog be started on an isolated cpu at all? I would expect that
>> the code would start a watchdog only on a housekeeping cpu.
>
>
> The code just starts the watchdog initially on the first online cpu.
> In principle you could have configured that as an isolated cpu, so
> without any change to that code, you'd interrupt that cpu.
>
> I guess another way to slice it would be to start the watchdog on the
> current core.  But just using the same idiom as in clocksource_watchdog()
> seems cleanest to me.
>
> I added your patch to the series and pushed it up (along with adding your
> Tested-by to the x86 enablement commit).  It's still based on 4.6 so I'll
> need
> to rebase it once the merge window closes.
>
>
> --
> Chris Metcalf, Mellanox Technologies
> http://www.mellanox.com
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
       [not found]                                           ` <CAC6yHM4LON5ASooVa_eUaDYsN1W0HYTMX76yHDxf8Mff0mKqiA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-07-29 21:04                                             ` Chris Metcalf
  0 siblings, 0 replies; 38+ messages in thread
From: Chris Metcalf @ 2016-07-29 21:04 UTC (permalink / raw)
  To: Francis Giraldeau
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On 7/29/2016 2:31 PM, Francis Giraldeau wrote:
> I tested this patch on 4.7 and confirm that irq_work does not occurs anymore on
> the isolated cpu. Thanks!

Great!  Let me know if you'd like me to add your Tested-by in the patch series.

> I don't know of any utility to test the task isolation feature, so I started
> one:
>
>      https://github.com/giraldeau/taskisol
>
> The script exp.sh runs the taskisol to test five different conditions, but some
> behavior is not the one I would expect.
>
> At startup, it does:
>   - register a custom signal handler for SIGUSR1
>   - sched_setaffinity() on CPU 1, which is isolated
>   - mlockall(MCL_CURRENT) to prevent undesired page faults
>
> The default strict mode is set with:
>
>      prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE)
>
> And then, the syscall write() is called. From previous discussion, the SIGKILL
> should be sent, but it does not occur. When instead of calling write() we force
> a page fault, then the SIGKILL is correctly sent.

This looks like it may be a bug in the x86-specific part of the kernel support.
On tilegx and arm64, running your test does the right thing:

# ./taskisol default syscall
taskisol run
taskisol/1855: task_isolation mode lost due to syscall 64
Killed

I think the x86 support doesn't properly return right away from a bad
syscall.  The patch below should fix that; can you try it?  However, it's
not clear to me why the signal isn't getting delivered.  Perhaps you can
try adding some tracing to the syscall_trace_enter() path and see if we're
actually running this code as expected?  Thank you!  :-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -90,8 +90,10 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
  
      /* In isolation mode, we may prevent the syscall from running. */
      if (work & _TIF_TASK_ISOLATION) {
-        if (task_isolation_syscall(regs->orig_ax) == -1)
-            return -1;
+        if (task_isolation_syscall(regs->orig_ax) == -1) {
+            regs->orig_ax = -1;
+            return 0;
+        }
          work &= ~_TIF_TASK_ISOLATION;
      }

I updated my dataplane branch on kernel.org with this fix.

> When instead a custom signal handler SIGUSR1:
>
>      prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_USERSIG |
>                        PR_TASK_ISOLATION_SET_SIG(SIGUSR1)
>
> The signal is never delivered, either when the syscall is issued nor when the
> page fault occurs.

This is a bug in your test program.  Try again with this fix:

--- a/taskisol.c
+++ b/taskisol.c
@@ -79,8 +79,9 @@ int main(int argc, char *argv[])
           * The program completes when using USERSIG,
           * but actually no signal is delivered
           */
-        if (strcmp(argv[1], "signal") == 0) {
-            if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_USERSIG |
+        else if (strcmp(argv[1], "signal") == 0) {
+            if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE |
+                      PR_TASK_ISOLATION_USERSIG |
                        PR_TASK_ISOLATION_SET_SIG(SIGUSR1)) < 0) {
                  perror("prctl sigusr");
                  return -1;

The prctl() API is intended to be one-shot, i.e. you set all the state you
want with a single prctl().  The next call to prctl() will reset the state
to whatever you specify (including if you don't specify "enable").

(Also, as a side note, I'd expect your Makefile to invoke $(CC) for taskisol,
not $(CXX) - there doesn't seem to be any actual C++ in the program.)

> I can confirm that, if two taskisol are created on the same CPU, the second one
> fails with Resource temporarily unavailable, so that's fine.
>
> I can add more test cases depending on your comments, such as the TLB events
> triggered by another thread on a non-isolated core. But maybe there is already
> a test suite?

The appended code is what I've been using as a test harness.  It passes on
tilegx and arm64.  No guarantees as to production-level code quality :-)

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <assert.h>
#include <string.h>
#include <errno.h>
#include <sched.h>
#include <pthread.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/prctl.h>

#ifndef PR_SET_TASK_ISOLATION   // Not in system headers yet?
# define PR_SET_TASK_ISOLATION		48
# define PR_GET_TASK_ISOLATION		49
# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
# define PR_TASK_ISOLATION_USERSIG	(1 << 1)
# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
# define PR_TASK_ISOLATION_NOSIG \
     (PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))
#endif

// The cpu we are using for isolation tests.
static int task_isolation_cpu;

// Overall status, maintained as tests run.
static int exit_status = EXIT_SUCCESS;

// Set affinity to a single cpu.
int set_my_cpu(int cpu)
{
	cpu_set_t set;
	CPU_ZERO(&set);
	CPU_SET(cpu, &set);
	return sched_setaffinity(0, sizeof(cpu_set_t), &set);
}

// Run a child process in task isolation mode and report its status.
// The child does mlockall() and moves itself to the task isolation cpu.
// It then runs SETUP_FUNC (if specified), calls prctl(PR_SET_TASK_ISOLATION, )
// with FLAGS (if non-zero), and then invokes TEST_FUNC and exits
// with its status.
static int run_test(void (*setup_func)(), int (*test_func)(), int flags)
{
	fflush(stdout);
	int pid = fork();
	assert(pid >= 0);
	if (pid != 0) {
		// In parent; wait for child and return its status.
		int status;
		waitpid(pid, &status, 0);
		return status;
	}

	// In child.
	int rc = mlockall(MCL_CURRENT);
	assert(rc == 0);
	rc = set_my_cpu(task_isolation_cpu);
	assert(rc == 0);
	if (setup_func)
		setup_func();
	if (flags) {
		int rc;
		do
			rc = prctl(PR_SET_TASK_ISOLATION, flags);
		while (rc != 0 && errno == EAGAIN);
		if (rc != 0) {
			printf("couldn't enable isolation (%d): FAIL\n", errno);
			exit(EXIT_FAILURE);
		}
	}
	rc = test_func();
	exit(rc);
}

// Run a test and ensure it is killed with SIGKILL by default,
// for whatever misdemeanor is committed in TEST_FUNC.
// Also test it with SIGUSR1 as well to make sure that works.
static void test_killed(const char *testname, void (*setup_func)(),
			int (*test_func)())
{
	int status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
	if (WIFSIGNALED(status) && WTERMSIG(status) == SIGKILL) {
		printf("%s: OK\n", testname);
	} else {
		printf("%s: FAIL (%#x)\n", testname, status);
		exit_status = EXIT_FAILURE;
	}

	status = run_test(setup_func, test_func,
			  PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_USERSIG |
			  PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
	if (WIFSIGNALED(status) && WTERMSIG(status) == SIGUSR1) {
		printf("%s (SIGUSR1): OK\n", testname);
	} else {
		printf("%s (SIGUSR1): FAIL (%#x)\n", testname, status);
		exit_status = EXIT_FAILURE;
	}
}

// Run a test and make sure it exits with success.
static void test_ok(const char *testname, void (*setup_func)(),
		    int (*test_func)())
{
	int status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
	if (status == EXIT_SUCCESS) {
		printf("%s: OK\n", testname);
	} else {
		printf("%s: FAIL (%#x)\n", testname, status);
		exit_status = EXIT_FAILURE;
	}
}

// Run a test with no signals and make sure it exits with success.
static void test_nosig(const char *testname, void (*setup_func)(),
		       int (*test_func)())
{
	int status =
		run_test(setup_func, test_func,
			 PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_NOSIG);
	if (status == EXIT_SUCCESS) {
		printf("%s: OK\n", testname);
	} else {
		printf("%s: FAIL (%#x)\n", testname, status);
		exit_status = EXIT_FAILURE;
	}
}

// Mapping address passed from setup function to test function.
static char *fault_file_mapping;

// mmap() a file in so we can test touching an unmapped page.
static void setup_fault(void)
{
	char fault_file[] = "/tmp/isolation_XXXXXX";
	int fd = mkstemp(fault_file);
	assert(fd >= 0);
	int rc = ftruncate(fd, getpagesize());
	assert(rc == 0);
	fault_file_mapping = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
				  MAP_SHARED, fd, 0);
	assert(fault_file_mapping != MAP_FAILED);
	close(fd);
	unlink(fault_file);
}

// Now touch the unmapped page (and be killed).
static int do_fault(void)
{
	*fault_file_mapping = 1;
	return EXIT_FAILURE;
}

// Make a syscall (and be killed).
static int do_syscall(void)
{
	write(STDOUT_FILENO, "goodbye, world\n", 13);
	return EXIT_FAILURE;
}

// Turn isolation back off and don't be killed.
static int do_syscall_off(void)
{
	prctl(PR_SET_TASK_ISOLATION, 0);
	write(STDOUT_FILENO, "==> hello, world\n", 17);
	return EXIT_SUCCESS;
}

// If we're not getting a signal, make sure we can do multiple system calls.
static int do_syscall_multi(void)
{
	write(STDOUT_FILENO, "==> hello, world 1\n", 19);
	write(STDOUT_FILENO, "==> hello, world 2\n", 19);
	return EXIT_SUCCESS;
}

#ifdef __aarch64__
/* ARM64 uses tlbi instructions so doesn't need to interrupt the remote core. */
static void test_munmap(void) {}
#else

// Fork a thread that will munmap() after a short while.
// It will deliver a TLB flush to the task isolation core.

static void *start_munmap(void *p)
{
	usleep(500000);   // 0.5s
	munmap(p, getpagesize());
	return 0;
}

static void setup_munmap(void)
{
	// First, go back to cpu 0 and allocate some memory.
	set_my_cpu(0);
	void *p = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
		       MAP_ANONYMOUS|MAP_POPULATE|MAP_PRIVATE, 0, 0);
	assert(p != MAP_FAILED);

	// Now fire up a thread that will wait half a second on cpu 0
	// and then munmap the mapping.
	pthread_t thr;
	int rc = pthread_create(&thr, NULL, start_munmap, p);
	assert(rc == 0);

	// Back to the task-isolation cpu.
	set_my_cpu(task_isolation_cpu);
}

// Global variable to avoid the compiler outsmarting us.
volatile int munmap_spin;

static int do_munmap(void)
{
	while (munmap_spin < 1000000000)
		++munmap_spin;
	return EXIT_FAILURE;
}

static void test_munmap(void)
{
	test_killed("test_munmap", setup_munmap, do_munmap);
}
#endif

#ifdef __tilegx__
// Make an unaligned access (and be killed).
// Only for tilegx, since other platforms don't do in-kernel fixups.
static int
do_unaligned(void)
{
	static int buf[2];
	volatile int* addr = (volatile int *)((char *)buf + 1);

	*addr;

	asm("nop");
	return EXIT_FAILURE;
}

static void test_unaligned(void)
{
	test_killed("test_unaligned", NULL, do_unaligned);
}
#else
static void test_unaligned(void) {}
#endif

// Fork a process that will spin annoyingly on the same core
// for a second.  Since prctl() won't work if this task is actively
// running, we following this handshake sequence:
//
// 1. Child (in setup_quiesce, here) starts up, sets state 1 to let the
//    parent know it's running, and starts doing short sleeps waiting on a
//    state change.
// 2. Parent (in do_quiesce, below) starts up, spins waiting for state 1,
//    then spins waiting on prctl() to succeed.  At that point it is in
//    isolation mode and the child is completing its most recent sleep.
//    Now, as soon as the parent is scheduled out, it won't schedule back
//    in until the child stops spinning.
// 3. Child sees the state change to 2, sets it to 3, and starts spinning
//    waiting for a second to elapse, at which point it exits.
// 4. Parent spins waiting for the state to get to 3, then makes one
//    syscall.  This should take about a second even though the child
//    was spinning for a whole second after changing the state to 3.

volatile int *statep, *childstate;
struct timeval quiesce_start, quiesce_end;
int child_pid;

static void setup_quiesce(void)
{
	// First, go back to cpu 0 and allocate some shared memory.
	set_my_cpu(0);
	statep = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
		      MAP_ANONYMOUS|MAP_SHARED, 0, 0);
	assert(statep != MAP_FAILED);
	childstate = statep + 1;

	gettimeofday(&quiesce_start, NULL);

	// Fork and fault in all memory in both.
	child_pid = fork();
	assert(child_pid >= 0);
	if (child_pid == 0)
		*childstate = 1;
	int rc = mlockall(MCL_CURRENT);
	assert(rc == 0);
	if (child_pid != 0) {
		set_my_cpu(task_isolation_cpu);
		return;
	}

	// In child.  Wait until parent notifies us that it has completed
	// its prctl, then jump to its cpu and let it know.
	*childstate = 2;
	while (*statep == 0)
		;
	*childstate = 3;
	//  printf("child: jumping to cpu %d\n", task_isolation_cpu);
	set_my_cpu(task_isolation_cpu);
	//  printf("child: jumped to cpu %d\n", task_isolation_cpu);
	*statep = 2;
	*childstate = 4;

	// Now we are competing for the runqueue on task_isolation_cpu.
	// Spin for one second to ensure the parent gets caught in kernel space.
	struct timeval start, tv;
	gettimeofday(&start, NULL);
	while (1) {
		gettimeofday(&tv, NULL);
		double time = (tv.tv_sec - start.tv_sec) +
			(tv.tv_usec - start.tv_usec) / 1000000.0;
		if (time >= 0.5)
			exit(0);
	}
}

static int do_quiesce(void)
{
	double time;
	int rc;

	rc = prctl(PR_SET_TASK_ISOLATION,
		   PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_NOSIG);
	if (rc != 0) {
		prctl(PR_SET_TASK_ISOLATION, 0);
		printf("prctl failed: rc %d", rc);
		goto fail;
	}
	*statep = 1;
     
	// Wait for child to come disturb us.
	while (*statep == 1) {
		gettimeofday(&quiesce_end, NULL);
		time = (quiesce_end.tv_sec - quiesce_start.tv_sec) +
			(quiesce_end.tv_usec - quiesce_start.tv_usec)/1000000.0;
		if (time > 0.1 && *statep == 1)	{
			prctl(PR_SET_TASK_ISOLATION, 0);
			printf("timed out at %gs in child migrate loop (%d)\n",
			       time, *childstate);
			char buf[100];
			sprintf(buf, "cat /proc/%d/stack", child_pid);
			system(buf);
			goto fail;
		}
	}
	assert(*statep == 2);

	// At this point the child is spinning, so any interrupt will keep us
	// in kernel space.  Make a syscall to make sure it happens at least
	// once during the second that the child is spinning.
	kill(0, 0);
	gettimeofday(&quiesce_end, NULL);
	prctl(PR_SET_TASK_ISOLATION, 0);
	time = (quiesce_end.tv_sec - quiesce_start.tv_sec) +
		(quiesce_end.tv_usec - quiesce_start.tv_usec) / 1000000.0;
	if (time < 0.4 || time > 0.6) {
		printf("expected 1s wait after quiesce: was %g\n", time);
		goto fail;
	}
	kill(child_pid, SIGKILL);
	return EXIT_SUCCESS;

fail:
	kill(child_pid, SIGKILL);
	return EXIT_FAILURE;
}

int main(int argc, char **argv)
{
	/* How many seconds to wait after running the other tests? */
	double waittime;
	if (argc == 1)
		waittime = 10;
	else if (argc == 2)
		waittime = strtof(argv[1], NULL);
	else {
		printf("syntax: isolation [seconds]\n");
		exit(EXIT_FAILURE);
	}

	/* Test that the /sys device is present and pick a cpu. */
	FILE *f = fopen("/sys/devices/system/cpu/task_isolation", "r");
	if (f == NULL) {
		printf("/sys device: FAIL\n");
		exit(EXIT_FAILURE);
	}
	char buf[100];
	char *result = fgets(buf, sizeof(buf), f);
	assert(result == buf);
	fclose(f);
	char *end;
	task_isolation_cpu = strtol(buf, &end, 10);
	assert(end != buf);
	assert(*end == ',' || *end == '-' || *end == '\n');
	assert(task_isolation_cpu >= 0);
	printf("/sys device : OK\n");

	// Test to see if with no mask set, we fail.
	if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) == 0 ||
	    errno != EINVAL) {
		printf("prctl unaffinitized: FAIL\n");
		exit_status = EXIT_FAILURE;
	} else {
		printf("prctl unaffinitized: OK\n");
	}

	// Or if affinitized to the wrong cpu.
	set_my_cpu(0);
	if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) == 0 ||
	    errno != EINVAL) {
		printf("prctl on cpu 0: FAIL\n");
		exit_status = EXIT_FAILURE;
	} else {
		printf("prctl on cpu 0: OK\n");
	}

	// Run the tests.
	test_killed("test_fault", setup_fault, do_fault);
	test_killed("test_syscall", NULL, do_syscall);
	test_munmap();
	test_unaligned();
	test_ok("test_off", NULL, do_syscall_off);
	test_nosig("test_multi", NULL, do_syscall_multi);
	test_nosig("test_quiesce", setup_quiesce, do_quiesce);

	// Exit failure if any test failed.
	if (exit_status != EXIT_SUCCESS)
		return exit_status;

	// Wait for however long was requested on the command line.
	// Note that this requires a vDSO implementation of gettimeofday();
	// if it's not available, we could just spin a fixed number of
	// iterations instead.
	struct timeval start, tv;
	gettimeofday(&start, NULL);
	while (1) {
		gettimeofday(&tv, NULL);
		double time = (tv.tv_sec - start.tv_sec) +
			(tv.tv_usec - start.tv_usec) / 1000000.0;
		if (time >= waittime)
			break;
	}

	return EXIT_SUCCESS;
}

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 13:55                 ` clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode) Christoph Lameter
  2016-07-27 14:12                   ` Chris Metcalf
@ 2016-08-10 22:16                   ` Frederic Weisbecker
  2016-08-10 22:26                     ` Chris Metcalf
  2016-08-11  8:40                     ` Peter Zijlstra
  1 sibling, 2 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2016-08-10 22:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On Wed, Jul 27, 2016 at 08:55:28AM -0500, Christoph Lameter wrote:
> On Mon, 25 Jul 2016, Christoph Lameter wrote:
> 
> > Guess so. I will have a look at this when I get some time again.
> 
> Ok so the problem is the clocksource_watchdog() function in
> kernel/time/clocksource.c. This function is active if
> CONFIG_CLOCKSOURCE_WATCHDOG is defined. It will check the timesources of
> each processor for being within bounds and then reschedule itself on the
> next one.
> 
> The purpose of the function seems to be to determine *if* a clocksource is
> unstable. It does not mean that the clocksource *is* unstable.
> 
> The critical piece of code is this:
> 
>         /*
>          * Cycle through CPUs to check if the CPUs stay synchronized
>          * to each other.
>          */
>         next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
>         if (next_cpu >= nr_cpu_ids)
>                 next_cpu = cpumask_first(cpu_online_mask);
>         watchdog_timer.expires += WATCHDOG_INTERVAL;
>         add_timer_on(&watchdog_timer, next_cpu);
> 
> 
> Should we just cycle through the cpus that are not isolated? Otherwise we
> need to have some means to check the clocksources for accuracy remotely
> (probably impossible for TSC etc).
> 
> The WATCHDOG_INTERVAL is 1 second so this causes an interrupt every
> second.
> 
> Note that we are running with the patch that removes the 1 HZ mininum time
> tick. With an older kernel code base (redhat) we can keep the kernel quiet
> for minutes. The clocksource watchdog causes timers to fire again.

I had similar issues, this seems to happen when the tsc is considered not reliable
(which doesn't necessarily mean unstable. I think it has to do with some x86 CPU feature
flag).

IIRC, this _has_ to execute on all online CPUs because every TSCs of running CPUs
are concerned.

I personally override that with passing the tsc=reliable kernel parameter. Of course
use it at your own risk.

But eventually I don't think we can offline that to housekeeping only CPUs.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-10 22:16                   ` Frederic Weisbecker
@ 2016-08-10 22:26                     ` Chris Metcalf
  2016-08-11  8:40                     ` Peter Zijlstra
  1 sibling, 0 replies; 38+ messages in thread
From: Chris Metcalf @ 2016-08-10 22:26 UTC (permalink / raw)
  To: Frederic Weisbecker, Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, linux-doc, linux-api,
	linux-kernel

On 8/10/2016 6:16 PM, Frederic Weisbecker wrote:
> On Wed, Jul 27, 2016 at 08:55:28AM -0500, Christoph Lameter wrote:
>> On Mon, 25 Jul 2016, Christoph Lameter wrote:
>>
>>> Guess so. I will have a look at this when I get some time again.
>> Ok so the problem is the clocksource_watchdog() function in
>> kernel/time/clocksource.c. This function is active if
>> CONFIG_CLOCKSOURCE_WATCHDOG is defined. It will check the timesources of
>> each processor for being within bounds and then reschedule itself on the
>> next one.
>>
>> The purpose of the function seems to be to determine *if* a clocksource is
>> unstable. It does not mean that the clocksource *is* unstable.
>>
>> The critical piece of code is this:
>>
>>          /*
>>           * Cycle through CPUs to check if the CPUs stay synchronized
>>           * to each other.
>>           */
>>          next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
>>          if (next_cpu >= nr_cpu_ids)
>>                  next_cpu = cpumask_first(cpu_online_mask);
>>          watchdog_timer.expires += WATCHDOG_INTERVAL;
>>          add_timer_on(&watchdog_timer, next_cpu);
>>
>>
>> Should we just cycle through the cpus that are not isolated? Otherwise we
>> need to have some means to check the clocksources for accuracy remotely
>> (probably impossible for TSC etc).
>>
>> The WATCHDOG_INTERVAL is 1 second so this causes an interrupt every
>> second.
>>
>> Note that we are running with the patch that removes the 1 HZ mininum time
>> tick. With an older kernel code base (redhat) we can keep the kernel quiet
>> for minutes. The clocksource watchdog causes timers to fire again.
> I had similar issues, this seems to happen when the tsc is considered not reliable
> (which doesn't necessarily mean unstable. I think it has to do with some x86 CPU feature
> flag).
>
> IIRC, this _has_ to execute on all online CPUs because every TSCs of running CPUs
> are concerned.
>
> I personally override that with passing the tsc=reliable kernel parameter. Of course
> use it at your own risk.
>
> But eventually I don't think we can offline that to housekeeping only CPUs.

Maybe the eventual model here is that as task-isolation cores
re-enter the kernel, they catch a hook that tells them to go
call the unreliable-tsc stuff and see what the state of it is.

This would be the same hook that we could use to defer
kernel TLB flushes, also.

The hard part is that on some platforms it may be fairly
intrusive to get all the hooks in.  Arm64 has a nice consistent
set of assembly routines to enter the kernel, which is how they
manage the context_tracking as well, but I fear that x86 may
have a lot more.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-22 12:50           ` Chris Metcalf
  2016-07-25 16:35             ` Christoph Lameter
@ 2016-08-11  8:27             ` Peter Zijlstra
  1 sibling, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2016-08-11  8:27 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On Fri, Jul 22, 2016 at 08:50:44AM -0400, Chris Metcalf wrote:
> On 7/21/2016 10:20 PM, Christoph Lameter wrote:
> >On Thu, 21 Jul 2016, Chris Metcalf wrote:
> >>On 7/20/2016 10:04 PM, Christoph Lameter wrote:
> >>unstable, and then scheduling work to safely remove that timer.
> >>I haven't looked at this code before (in kernel/time/clocksource.c
> >>under CONFIG_CLOCKSOURCE_WATCHDOG) since the timers on
> >>arm64 and tile aren't unstable.  Is it possible to boot your machine
> >>with a stable clocksource?
> >It already as a stable clocksource. Sorry but that was one of the criteria
> >for the server when we ordered them. Could this be clock adjustments?
> 
> We probably need to get clock folks to jump in on this thread!

Boot with: tsc=reliable, this disables the watchdog.

We (sadly) have to have this thing running on most x86 because TSC, even
if initially stable, can do weird things once its running.

We have seen:

 - SMI
 - hotplug
 - suspend
 - multi-socket

mess up the TSC, even if it was deemed 'good' at boot time.

If you _know_ your TSC to be solid, boot with tsc=reliable and be happy.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-10 22:16                   ` Frederic Weisbecker
  2016-08-10 22:26                     ` Chris Metcalf
@ 2016-08-11  8:40                     ` Peter Zijlstra
  2016-08-11 11:58                       ` Frederic Weisbecker
  2016-08-11 16:00                       ` Paul E. McKenney
  1 sibling, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2016-08-11  8:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Chris Metcalf, Gilad Ben Yossef,
	Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Thu, Aug 11, 2016 at 12:16:58AM +0200, Frederic Weisbecker wrote:
> I had similar issues, this seems to happen when the tsc is considered not reliable
> (which doesn't necessarily mean unstable. I think it has to do with some x86 CPU feature
> flag).

Right, as per the other email, in general we cannot know/assume the TSC
to be working as intended :/

> IIRC, this _has_ to execute on all online CPUs because every TSCs of running CPUs
> are concerned.

With modern Intel we could run it on one CPU per package I think, but at
the same time, too much in NOHZ_FULL assumes the TSC is indeed sane so
it doesn't make sense to me to keep the watchdog running, when it
triggers it would also have to kill all NOHZ_FULL stuff, which would
probably bring the entire machine down..

Arguably we should issue a boot time warning if NOHZ_FULL is configured
and the TSC watchdog is running.

> I personally override that with passing the tsc=reliable kernel
> parameter. Of course use it at your own risk.

Yes, that is (sadly) our only option. Manually assert our hardware is
solid under the intended workload and then manually disabling the
watchdog.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-11  8:40                     ` Peter Zijlstra
@ 2016-08-11 11:58                       ` Frederic Weisbecker
  2016-08-15 15:03                         ` Chris Metcalf
  2016-08-11 16:00                       ` Paul E. McKenney
  1 sibling, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2016-08-11 11:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Chris Metcalf, Gilad Ben Yossef,
	Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Thu, Aug 11, 2016 at 10:40:02AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 11, 2016 at 12:16:58AM +0200, Frederic Weisbecker wrote:
> > I had similar issues, this seems to happen when the tsc is considered not reliable
> > (which doesn't necessarily mean unstable. I think it has to do with some x86 CPU feature
> > flag).
> 
> Right, as per the other email, in general we cannot know/assume the TSC
> to be working as intended :/

Yeah, I remember you explained me that a little while ago.

> 
> > IIRC, this _has_ to execute on all online CPUs because every TSCs of running CPUs
> > are concerned.
> 
> With modern Intel we could run it on one CPU per package I think, but at
> the same time, too much in NOHZ_FULL assumes the TSC is indeed sane so
> it doesn't make sense to me to keep the watchdog running, when it
> triggers it would also have to kill all NOHZ_FULL stuff, which would
> probably bring the entire machine down..
> 
> Arguably we should issue a boot time warning if NOHZ_FULL is configured
> and the TSC watchdog is running.

That's a very good idea! We do that when tsc is unstable but indeed we can't
seriously run NOHZ_FULL on a non-reliable tsc.

I'll take care of that warning.

> 
> > I personally override that with passing the tsc=reliable kernel
> > parameter. Of course use it at your own risk.
> 
> Yes, that is (sadly) our only option. Manually assert our hardware is
> solid under the intended workload and then manually disabling the
> watchdog.

Right, I'll tell about that in the warning.

Thanks for those details!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-11  8:40                     ` Peter Zijlstra
  2016-08-11 11:58                       ` Frederic Weisbecker
@ 2016-08-11 16:00                       ` Paul E. McKenney
  2016-08-11 23:02                         ` Christoph Lameter
  1 sibling, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2016-08-11 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Christoph Lameter, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Thu, Aug 11, 2016 at 10:40:02AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 11, 2016 at 12:16:58AM +0200, Frederic Weisbecker wrote:
> > I had similar issues, this seems to happen when the tsc is considered not reliable
> > (which doesn't necessarily mean unstable. I think it has to do with some x86 CPU feature
> > flag).
> 
> Right, as per the other email, in general we cannot know/assume the TSC
> to be working as intended :/
> 
> > IIRC, this _has_ to execute on all online CPUs because every TSCs of running CPUs
> > are concerned.
> 
> With modern Intel we could run it on one CPU per package I think, but at
> the same time, too much in NOHZ_FULL assumes the TSC is indeed sane so
> it doesn't make sense to me to keep the watchdog running, when it
> triggers it would also have to kill all NOHZ_FULL stuff, which would
> probably bring the entire machine down..

Well, you -could- force a very low priority CPU-bound task to run on
all nohz_full CPUs.  Not necessarily a good idea, but a relatively
non-intrusive response to that particular error condition.

							Thanx, Paul

> Arguably we should issue a boot time warning if NOHZ_FULL is configured
> and the TSC watchdog is running.
> 
> > I personally override that with passing the tsc=reliable kernel
> > parameter. Of course use it at your own risk.
> 
> Yes, that is (sadly) our only option. Manually assert our hardware is
> solid under the intended workload and then manually disabling the
> watchdog.
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-11 16:00                       ` Paul E. McKenney
@ 2016-08-11 23:02                         ` Christoph Lameter
  2016-08-11 23:47                           ` Paul E. McKenney
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2016-08-11 23:02 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Frederic Weisbecker, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Thu, 11 Aug 2016, Paul E. McKenney wrote:

> > With modern Intel we could run it on one CPU per package I think, but at
> > the same time, too much in NOHZ_FULL assumes the TSC is indeed sane so
> > it doesn't make sense to me to keep the watchdog running, when it
> > triggers it would also have to kill all NOHZ_FULL stuff, which would
> > probably bring the entire machine down..
>
> Well, you -could- force a very low priority CPU-bound task to run on
> all nohz_full CPUs.  Not necessarily a good idea, but a relatively
> non-intrusive response to that particular error condition.

Given that we want the cpu only to run the user task I would think that is
not a good idea.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-11 23:02                         ` Christoph Lameter
@ 2016-08-11 23:47                           ` Paul E. McKenney
  2016-08-12 14:23                             ` Christoph Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2016-08-11 23:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Frederic Weisbecker, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Thu, Aug 11, 2016 at 06:02:34PM -0500, Christoph Lameter wrote:
> On Thu, 11 Aug 2016, Paul E. McKenney wrote:
> 
> > > With modern Intel we could run it on one CPU per package I think, but at
> > > the same time, too much in NOHZ_FULL assumes the TSC is indeed sane so
> > > it doesn't make sense to me to keep the watchdog running, when it
> > > triggers it would also have to kill all NOHZ_FULL stuff, which would
> > > probably bring the entire machine down..
> >
> > Well, you -could- force a very low priority CPU-bound task to run on
> > all nohz_full CPUs.  Not necessarily a good idea, but a relatively
> > non-intrusive response to that particular error condition.
> 
> Given that we want the cpu only to run the user task I would think that is
> not a good idea.

Heh!  The only really good idea is for clocks to be reliably in sync.

But if they go out of sync, what do you want to do instead?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-11 23:47                           ` Paul E. McKenney
@ 2016-08-12 14:23                             ` Christoph Lameter
       [not found]                               ` <alpine.DEB.2.20.1608120922450.20310-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2016-08-12 14:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Frederic Weisbecker, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Thu, 11 Aug 2016, Paul E. McKenney wrote:

> Heh!  The only really good idea is for clocks to be reliably in sync.
>
> But if they go out of sync, what do you want to do instead?

For a NOHZ task? Write a message to the syslog and reenable tick.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
       [not found]                               ` <alpine.DEB.2.20.1608120922450.20310-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
@ 2016-08-12 14:26                                 ` Frederic Weisbecker
  2016-08-12 16:19                                   ` Paul E. McKenney
  0 siblings, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2016-08-12 14:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Peter Zijlstra, Chris Metcalf, Gilad Ben Yossef,
	Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Aug 12, 2016 at 09:23:13AM -0500, Christoph Lameter wrote:
> On Thu, 11 Aug 2016, Paul E. McKenney wrote:
> 
> > Heh!  The only really good idea is for clocks to be reliably in sync.
> >
> > But if they go out of sync, what do you want to do instead?
> 
> For a NOHZ task? Write a message to the syslog and reenable tick.

Indeed, a strong clocksource is a requirement for a full tickless machine.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-12 14:26                                 ` Frederic Weisbecker
@ 2016-08-12 16:19                                   ` Paul E. McKenney
       [not found]                                     ` <20160812161919.GV3482-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2016-08-12 16:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Peter Zijlstra, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Fri, Aug 12, 2016 at 04:26:13PM +0200, Frederic Weisbecker wrote:
> On Fri, Aug 12, 2016 at 09:23:13AM -0500, Christoph Lameter wrote:
> > On Thu, 11 Aug 2016, Paul E. McKenney wrote:
> > 
> > > Heh!  The only really good idea is for clocks to be reliably in sync.
> > >
> > > But if they go out of sync, what do you want to do instead?
> > 
> > For a NOHZ task? Write a message to the syslog and reenable tick.

Fair enough!  Kicking off a low-priority task would achieve the latter
but not necessarily the former.  And of course assumes that the worker
thread is at real-time priority with various scheduler anti-starvation
features disabled.

> Indeed, a strong clocksource is a requirement for a full tickless machine.

No disagrement here!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
       [not found]                                     ` <20160812161919.GV3482-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2016-08-13 15:39                                       ` Frederic Weisbecker
  0 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2016-08-13 15:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christoph Lameter, Peter Zijlstra, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Aug 12, 2016 at 09:19:19AM -0700, Paul E. McKenney wrote:
> On Fri, Aug 12, 2016 at 04:26:13PM +0200, Frederic Weisbecker wrote:
> > On Fri, Aug 12, 2016 at 09:23:13AM -0500, Christoph Lameter wrote:
> > > On Thu, 11 Aug 2016, Paul E. McKenney wrote:
> > > 
> > > > Heh!  The only really good idea is for clocks to be reliably in sync.
> > > >
> > > > But if they go out of sync, what do you want to do instead?
> > > 
> > > For a NOHZ task? Write a message to the syslog and reenable tick.
> 
> Fair enough!  Kicking off a low-priority task would achieve the latter
> but not necessarily the former.  And of course assumes that the worker
> thread is at real-time priority with various scheduler anti-starvation
> features disabled.
> 
> > Indeed, a strong clocksource is a requirement for a full tickless machine.
> 
> No disagrement here!  ;-)

I have a bot in my mind that randomly posts obvious statements about nohz_full
here and then :-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-11 11:58                       ` Frederic Weisbecker
@ 2016-08-15 15:03                         ` Chris Metcalf
  0 siblings, 0 replies; 38+ messages in thread
From: Chris Metcalf @ 2016-08-15 15:03 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra, Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel

On 8/11/2016 7:58 AM, Frederic Weisbecker wrote:
>> Arguably we should issue a boot time warning if NOHZ_FULL is configured
>> >and the TSC watchdog is running.
> That's a very good idea! We do that when tsc is unstable but indeed we can't
> seriously run NOHZ_FULL on a non-reliable tsc.
>
> I'll take care of that warning.

Thanks.  So I will drop Christoph's patch to run the TSC watchdog on just
housekeeping cores and we will rely on the "boot time warning" instead.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-08-15 15:03 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 04/12] task_isolation: add initial support Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 12/12] task_isolation: add user-settable notification signal Chris Metcalf
     [not found] ` <1468529299-27929-1-git-send-email-cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-07-14 21:03   ` [PATCH v13 00/12] support "task_isolation" mode Andy Lutomirski
     [not found]     ` <CALCETrVddfd7ZDGpYs4CdkAMEmQCb6a-_5Um9bb4FO+XwWzOAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-07-14 21:22       ` Chris Metcalf
2016-07-18 22:11         ` Andy Lutomirski
2016-07-18 22:50           ` Chris Metcalf
2016-07-18  0:42       ` Christoph Lameter
2016-07-21  2:04   ` Christoph Lameter
     [not found]     ` <alpine.DEB.2.20.1607202059180.25838-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2016-07-21 14:06       ` Chris Metcalf
2016-07-22  2:20         ` Christoph Lameter
2016-07-22 12:50           ` Chris Metcalf
2016-07-25 16:35             ` Christoph Lameter
     [not found]               ` <alpine.DEB.2.20.1607251133450.25354-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2016-07-27 13:55                 ` clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode) Christoph Lameter
2016-07-27 14:12                   ` Chris Metcalf
     [not found]                     ` <f8d72e47-7e84-cbd0-869f-69bf452a8bfb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-07-27 15:23                       ` Christoph Lameter
     [not found]                         ` <alpine.DEB.2.20.1607271022130.25729-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2016-07-27 15:31                           ` Christoph Lameter
2016-07-27 17:06                             ` Chris Metcalf
2016-07-27 18:56                               ` Christoph Lameter
2016-07-27 19:49                                 ` Chris Metcalf
2016-07-27 19:53                                   ` Christoph Lameter
2016-07-27 19:58                                     ` Chris Metcalf
     [not found]                                       ` <2fefa17d-37c6-9669-724e-9ee0d841e7b2-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-07-29 18:31                                         ` Francis Giraldeau
     [not found]                                           ` <CAC6yHM4LON5ASooVa_eUaDYsN1W0HYTMX76yHDxf8Mff0mKqiA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-07-29 21:04                                             ` Chris Metcalf
2016-08-10 22:16                   ` Frederic Weisbecker
2016-08-10 22:26                     ` Chris Metcalf
2016-08-11  8:40                     ` Peter Zijlstra
2016-08-11 11:58                       ` Frederic Weisbecker
2016-08-15 15:03                         ` Chris Metcalf
2016-08-11 16:00                       ` Paul E. McKenney
2016-08-11 23:02                         ` Christoph Lameter
2016-08-11 23:47                           ` Paul E. McKenney
2016-08-12 14:23                             ` Christoph Lameter
     [not found]                               ` <alpine.DEB.2.20.1608120922450.20310-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2016-08-12 14:26                                 ` Frederic Weisbecker
2016-08-12 16:19                                   ` Paul E. McKenney
     [not found]                                     ` <20160812161919.GV3482-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2016-08-13 15:39                                       ` Frederic Weisbecker
2016-08-11  8:27             ` [PATCH v13 00/12] support "task_isolation" mode Peter Zijlstra
2016-07-27 14:01 ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).