* [PATCH v10 00/12] support "task_isolation" mode
@ 2016-03-02 20:09 Chris Metcalf
2016-03-02 20:09 ` [PATCH v10 04/12] task_isolation: add initial support Chris Metcalf
2016-03-02 20:09 ` [PATCH v10 06/12] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
0 siblings, 2 replies; 3+ messages in thread
From: Chris Metcalf @ 2016-03-02 20:09 UTC (permalink / raw)
To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
Daniel Lezcano, linux-doc, linux-api, linux-kernel
Cc: Chris Metcalf
Here is the latest version of the task-isolation patch set, adopting
various suggestions made about the v9 patch series, including some
feedback from testing on the new EZchip NPS ARC platform (although the
new arch/arc support is not included in this patch series). All of the
suggestions were relatively non-controversial.
Perhaps we are getting close to being able to merge this. :-)
Changes since v9:
- task_isolation is now set up by adding its required cpus to both the
nohz_full and isolcpus cpumasks. This allows users to separately
specify all three flags, if so desired, and still get reasonably
sane semantics. This is done with a new tick_nohz_full_add_cpus()
method for nohz, and just directly updating the isolcpus cpumask.
- We add a /sys/devices/system/cpu/task_isolation file to parallel
the equivalent nohz_full file. (This should have been in v8 since
once task_isolation isn't equivalent to nohz_full, it needs its
own way to let userspace know where to run.)
- We add a new Kconfig option, TASK_ISOLATION_ALL, which sets all but
the boot processor to run in task isolation mode. This parallels
the existing NO_HZ_FULL_ALL and works around the fact that you can't
easily specify a boot argument with the desired semantics.
- For task_isolation_debug, we add a check of the context_tracking
state of the remote cpu before issuing a warning; if the remote cpu
is actually in the kernel, we don't need to warn.
- A cloned child of a task_isolation task is not enabled for
task_isolation, since otherwise they would both fight over who could
safely return to userspace without requiring scheduling interrupts.
- The quiet_vmstat() function's semantics was changed since the v9
patch series, so I introduce a quiet_vmstat_sync() for isolation.
- The lru_add_drain_needed() function is updated to know about the new
lru_deactivate_pvecs variable.
- The arm64 patch factoring assembly into C has been modified based
on an earlier patch by Mark Rutland.
- I simplified the enabling patch for arm64 by realizing we could just
test TIF_NOHZ as the only bit for TIF_WORK_MASK for task isolation,
so I didn't have to renumber all the TIF_xxx bits.
- Small fixes to avoid preemption warnings.
- Rebased on v4.5-rc5
For changes in earlier versions of the patch series, please see:
http://lkml.kernel.org/r/1451936091-29247-1-git-send-email-cmetcalf@ezchip.com
A couple of the tile patches that refactored the context tracking
code were taken into 4.5 so are no longer present in this series.
This version of the patch series has been tested on arm64 and tile,
and build-tested on x86.
It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.
The series is available at:
git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane
Chris Metcalf (12):
vmstat: add quiet_vmstat_sync function
vmstat: add vmstat_idle function
lru_add_drain_all: factor out lru_add_drain_needed
task_isolation: add initial support
task_isolation: support CONFIG_TASK_ISOLATION_ALL
task_isolation: support PR_TASK_ISOLATION_STRICT mode
task_isolation: add debug boot flag
arm, tile: turn off timer tick for oneshot_stopped state
arch/x86: enable task isolation functionality
arch/tile: enable task isolation functionality
arm64: factor work_pending state machine to C
arch/arm64: enable task isolation functionality
Documentation/kernel-parameters.txt | 16 ++
arch/arm64/include/asm/thread_info.h | 8 +-
arch/arm64/kernel/entry.S | 12 +-
arch/arm64/kernel/ptrace.c | 12 +-
arch/arm64/kernel/signal.c | 34 ++++-
arch/arm64/kernel/smp.c | 2 +
arch/arm64/mm/fault.c | 4 +
arch/tile/kernel/process.c | 6 +-
arch/tile/kernel/ptrace.c | 6 +
arch/tile/kernel/single_step.c | 5 +
arch/tile/kernel/smp.c | 28 ++--
arch/tile/kernel/time.c | 1 +
arch/tile/kernel/unaligned.c | 3 +
arch/tile/mm/fault.c | 3 +
arch/tile/mm/homecache.c | 2 +
arch/x86/entry/common.c | 18 ++-
arch/x86/kernel/traps.c | 2 +
arch/x86/mm/fault.c | 2 +
drivers/base/cpu.c | 18 +++
drivers/clocksource/arm_arch_timer.c | 2 +
include/linux/context_tracking_state.h | 6 +
include/linux/isolation.h | 83 +++++++++++
include/linux/sched.h | 3 +
include/linux/swap.h | 1 +
include/linux/tick.h | 1 +
include/linux/vmstat.h | 4 +
include/uapi/linux/prctl.h | 8 +
init/Kconfig | 30 ++++
kernel/Makefile | 1 +
kernel/fork.c | 3 +
kernel/irq_work.c | 5 +-
kernel/isolation.c | 261 +++++++++++++++++++++++++++++++++
kernel/sched/core.c | 18 +++
kernel/signal.c | 5 +
kernel/smp.c | 6 +-
kernel/softirq.c | 33 +++++
kernel/sys.c | 9 ++
kernel/time/tick-sched.c | 31 ++--
mm/swap.c | 15 +-
mm/vmstat.c | 24 +++
40 files changed, 676 insertions(+), 55 deletions(-)
create mode 100644 include/linux/isolation.h
create mode 100644 kernel/isolation.c
--
2.1.2
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH v10 04/12] task_isolation: add initial support
2016-03-02 20:09 [PATCH v10 00/12] support "task_isolation" mode Chris Metcalf
@ 2016-03-02 20:09 ` Chris Metcalf
2016-03-02 20:09 ` [PATCH v10 06/12] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
1 sibling, 0 replies; 3+ messages in thread
From: Chris Metcalf @ 2016-03-02 20:09 UTC (permalink / raw)
To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
linux-doc, linux-api, linux-kernel
Cc: Chris Metcalf
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.
However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.
This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.
The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well. The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(). When the _ENABLE bit is set for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.
The task_isolation_ready() call plays an equivalent role to the
TIF_xxx flags when returning to userspace, and should be checked
in the loop check of the prepare_exit_to_usermode() routine or its
architecture equivalent. It is called with interrupts disabled and
inspects the kernel state to determine if it is safe to return into
an isolated state. In particular, if it sees that the scheduler
tick is still enabled, it reports that it is not yet safe.
Each time through the loop of TIF work to do, we call the new
task_isolation_enter() routine, which takes any actions that might
avoid a future interrupt to the core, such as a worker thread
being scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache). In addition, it
requests rescheduling if the scheduler dyntick is still running.
As a result of these tests on the "return to userspace" path, sys
calls (and page faults, etc.) can be inordinately slow. However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.
Separate patches that follow provide these changes for x86, arm64,
and tile.
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
Documentation/kernel-parameters.txt | 8 +++
drivers/base/cpu.c | 18 ++++++
include/linux/isolation.h | 53 ++++++++++++++++
include/linux/sched.h | 3 +
include/linux/tick.h | 1 +
include/uapi/linux/prctl.h | 5 ++
init/Kconfig | 20 ++++++
kernel/Makefile | 1 +
kernel/fork.c | 3 +
kernel/isolation.c | 118 ++++++++++++++++++++++++++++++++++++
kernel/sys.c | 9 +++
kernel/time/tick-sched.c | 31 ++++++----
12 files changed, 257 insertions(+), 13 deletions(-)
create mode 100644 include/linux/isolation.h
create mode 100644 kernel/isolation.c
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 9a53c929f017..c8d0b42d984a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3747,6 +3747,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.
+ task_isolation= [KNL]
+ In kernels built with CONFIG_TASK_ISOLATION=y, set
+ the specified list of CPUs where cpus will be able
+ to use prctl(PR_SET_TASK_ISOLATION) to set up task
+ isolation mode. Setting this boot flag implicitly
+ also sets up nohz_full and isolcpus mode for the
+ listed set of cpus.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 691eeea2f19a..eaf40f4264ee 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -17,6 +17,7 @@
#include <linux/of.h>
#include <linux/cpufeature.h>
#include <linux/tick.h>
+#include <linux/isolation.h>
#include "base.h"
@@ -290,6 +291,20 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
#endif
+#ifdef CONFIG_TASK_ISOLATION
+static ssize_t print_cpus_task_isolation(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ int n = 0, len = PAGE_SIZE-2;
+
+ n = scnprintf(buf, len, "%*pbl\n", cpumask_pr_args(task_isolation_map));
+
+ return n;
+}
+static DEVICE_ATTR(task_isolation, 0444, print_cpus_task_isolation, NULL);
+#endif
+
static void cpu_device_release(struct device *dev)
{
/*
@@ -460,6 +475,9 @@ static struct attribute *cpu_root_attrs[] = {
#ifdef CONFIG_NO_HZ_FULL
&dev_attr_nohz_full.attr,
#endif
+#ifdef CONFIG_TASK_ISOLATION
+ &dev_attr_task_isolation.attr,
+#endif
#ifdef CONFIG_GENERIC_CPU_AUTOPROBE
&dev_attr_modalias.attr,
#endif
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..c564cf1886bb
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,53 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/* cpus that are configured to support task isolation */
+extern cpumask_var_t task_isolation_map;
+
+extern int task_isolation_init(void);
+
+static inline bool task_isolation_possible(int cpu)
+{
+ return tick_nohz_full_enabled() &&
+ cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+extern int task_isolation_set(unsigned int flags);
+
+static inline bool task_isolation_enabled(void)
+{
+ return (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE) &&
+ task_isolation_possible(raw_smp_processor_id());
+}
+
+extern bool _task_isolation_ready(void);
+extern void _task_isolation_enter(void);
+
+static inline bool task_isolation_ready(void)
+{
+ return !task_isolation_enabled() || _task_isolation_ready();
+}
+
+static inline void task_isolation_enter(void)
+{
+ if (task_isolation_enabled())
+ _task_isolation_enter();
+}
+
+#else
+static inline void task_isolation_init(void) { }
+static inline bool task_isolation_possible(int cpu) { return false; }
+static inline bool task_isolation_enabled(void) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a94cc3..df6eb22510c3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1830,6 +1830,9 @@ struct task_struct {
unsigned long task_state_change;
#endif
int pagefault_disabled;
+#ifdef CONFIG_TASK_ISOLATION
+ unsigned int task_isolation_flags;
+#endif
/* CPU-specific state of this task */
struct thread_struct thread;
/*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 97fd4e543846..b4c0f4dc909f 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -158,6 +158,7 @@ extern void tick_nohz_full_kick(void);
extern void tick_nohz_full_kick_cpu(int cpu);
extern void tick_nohz_full_kick_all(void);
extern void __tick_nohz_task_switch(void);
+extern void tick_nohz_full_add_cpus(const struct cpumask *mask);
#else
static inline int housekeeping_any_cpu(void)
{
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..67224df4b559 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
# define PR_CAP_AMBIENT_LOWER 3
# define PR_CAP_AMBIENT_CLEAR_ALL 4
+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION 48
+#define PR_GET_TASK_ISOLATION 49
+# define PR_TASK_ISOLATION_ENABLE (1 << 0)
+
#endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 22320804fbaf..6cab348fe454 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -782,6 +782,26 @@ config RCU_EXPEDITE_BOOT
endmenu # "RCU Subsystem"
+config TASK_ISOLATION
+ bool "Provide hard CPU isolation from the kernel on demand"
+ depends on NO_HZ_FULL
+ help
+ Allow userspace processes to place themselves on task_isolation
+ cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+ themselves from the kernel. On return to userspace,
+ isolated tasks will first arrange that no future kernel
+ activity will interrupt the task while the task is running
+ in userspace. This "hard" isolation from the kernel is
+ required for userspace tasks that are running hard real-time
+ tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+ Without this option, but with NO_HZ_FULL enabled, the kernel
+ will make a best-faith, "soft" effort to shield a single userspace
+ process from interrupts, but makes no guarantees.
+
+ You should say "N" unless you are intending to run a
+ high-performance userspace driver or similar task.
+
config BUILD_BIN2C
bool
default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf008ecb3..693a2ba35679 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
$(obj)/configs.o: $(obj)/config_data.h
diff --git a/kernel/fork.c b/kernel/fork.c
index 2e391c754ae7..1890ea0dcd5b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1413,6 +1413,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->sequential_io = 0;
p->sequential_io_avg = 0;
#endif
+#ifdef CONFIG_TASK_ISOLATION
+ p->task_isolation_flags = 0; /* do not isolate */
+#endif
/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..e954afd8cce8
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,118 @@
+/*
+ * linux/kernel/isolation.c
+ *
+ * Implementation for task isolation.
+ *
+ * Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include "time/tick-sched.h"
+
+cpumask_var_t task_isolation_map;
+static bool saw_boot_arg;
+
+/*
+ * Isolation requires both nohz and isolcpus support from the scheduler.
+ * We provide a boot flag that enables both for now, and which we can
+ * add other functionality to over time if needed. Note that just
+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
+ */
+static int __init task_isolation_setup(char *str)
+{
+ saw_boot_arg = true;
+
+ alloc_bootmem_cpumask_var(&task_isolation_map);
+ if (cpulist_parse(str, task_isolation_map) < 0) {
+ pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
+ return 1;
+ }
+
+ return 1;
+}
+__setup("task_isolation=", task_isolation_setup);
+
+int __init task_isolation_init(void)
+{
+ /* For offstack cpumask, ensure we allocate an empty cpumask early. */
+ if (!saw_boot_arg) {
+ zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+ return 0;
+ }
+
+ /*
+ * Add our task_isolation cpus to nohz_full and isolcpus. Note
+ * that we are called relatively early in boot, from tick_init();
+ * at this point neither nohz_full nor isolcpus has been used
+ * to configure the system, but isolcpus has been allocated
+ * already in sched_init().
+ */
+ tick_nohz_full_add_cpus(task_isolation_map);
+ cpumask_or(cpu_isolated_map, cpu_isolated_map, task_isolation_map);
+
+ return 0;
+}
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single task_isolation core or we will
+ * return EINVAL. Although the application could later re-affinitize
+ * to a housekeeping core and lose task isolation semantics, this
+ * initial test should catch 99% of bugs with task placement prior to
+ * enabling task isolation.
+ */
+int task_isolation_set(unsigned int flags)
+{
+ if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+ !task_isolation_possible(raw_smp_processor_id()))
+ return -EINVAL;
+
+ current->task_isolation_flags = flags;
+ return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again. This test
+ * is run with interrupts disabled to test that everything we need
+ * to be true is true before we can return to userspace.
+ */
+bool _task_isolation_ready(void)
+{
+ WARN_ON_ONCE(!irqs_disabled());
+
+ return (!lru_add_drain_needed(smp_processor_id()) &&
+ vmstat_idle() &&
+ tick_nohz_tick_stopped());
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void _task_isolation_enter(void)
+{
+ WARN_ON_ONCE(irqs_disabled());
+
+ /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+ lru_add_drain();
+
+ /* Quieten the vmstat worker so it won't interrupt us. */
+ quiet_vmstat_sync();
+
+ /*
+ * Request rescheduling unless we are in full dynticks mode.
+ * We would eventually get pre-empted without this, and if there's
+ * another task waiting, it would run; but by explicitly requesting
+ * the reschedule, we reduce the latency. We could directly call
+ * schedule() here as well, but since our caller is the standard
+ * place where schedule() is called, we defer to the caller.
+ */
+ if (!tick_nohz_tick_stopped())
+ set_tsk_need_resched(current);
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 78947de6f969..c1d621f1137e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
#include <linux/syscore_ops.h>
#include <linux/version.h>
#include <linux/ctype.h>
+#include <linux/isolation.h>
#include <linux/compat.h>
#include <linux/syscalls.h>
@@ -2266,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_GET_FP_MODE:
error = GET_FP_MODE(me);
break;
+#ifdef CONFIG_TASK_ISOLATION
+ case PR_SET_TASK_ISOLATION:
+ error = task_isolation_set(arg2);
+ break;
+ case PR_GET_TASK_ISOLATION:
+ error = me->task_isolation_flags;
+ break;
+#endif
default:
error = -EINVAL;
break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 0b17424349eb..14eb7dd06f26 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
#include <linux/posix-timers.h>
#include <linux/perf_event.h>
#include <linux/context_tracking.h>
+#include <linux/isolation.h>
#include <asm/irq_regs.h>
@@ -311,30 +312,34 @@ static int tick_nohz_cpu_down_callback(struct notifier_block *nfb,
return NOTIFY_OK;
}
-static int tick_nohz_init_all(void)
+void tick_nohz_full_add_cpus(const struct cpumask *mask)
{
- int err = -1;
+ if (!cpumask_weight(mask))
+ return;
-#ifdef CONFIG_NO_HZ_FULL_ALL
- if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
+ if (tick_nohz_full_mask == NULL &&
+ !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
- return err;
+ return;
}
- err = 0;
- cpumask_setall(tick_nohz_full_mask);
+
+ cpumask_or(tick_nohz_full_mask, tick_nohz_full_mask, mask);
tick_nohz_full_running = true;
-#endif
- return err;
}
void __init tick_nohz_init(void)
{
int cpu;
- if (!tick_nohz_full_running) {
- if (tick_nohz_init_all() < 0)
- return;
- }
+ task_isolation_init();
+
+#ifdef CONFIG_NO_HZ_FULL_ALL
+ if (!tick_nohz_full_running)
+ tick_nohz_full_add_cpus(cpu_possible_mask);
+#endif
+
+ if (!tick_nohz_full_running)
+ return;
if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
--
2.1.2
^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH v10 06/12] task_isolation: support PR_TASK_ISOLATION_STRICT mode
2016-03-02 20:09 [PATCH v10 00/12] support "task_isolation" mode Chris Metcalf
2016-03-02 20:09 ` [PATCH v10 04/12] task_isolation: add initial support Chris Metcalf
@ 2016-03-02 20:09 ` Chris Metcalf
1 sibling, 0 replies; 3+ messages in thread
From: Chris Metcalf @ 2016-03-02 20:09 UTC (permalink / raw)
To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
linux-doc, linux-api, linux-kernel
Cc: Chris Metcalf
With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves. In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies. Add a simple flag that puts the process into
a state where any such kernel entry is fatal; this is defined as
happening immediately before the SECCOMP test.
By default, the task is signalled with SIGKILL, but we add prctl()
bits to support requesting a specific signal instead.
To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
include/linux/isolation.h | 25 +++++++++++++++++++
include/uapi/linux/prctl.h | 3 +++
kernel/isolation.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 88 insertions(+)
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index c564cf1886bb..ba6c4d510db8 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -42,12 +42,37 @@ static inline void task_isolation_enter(void)
_task_isolation_enter();
}
+extern bool task_isolation_syscall(int nr);
+extern void task_isolation_exception(const char *fmt, ...);
+extern void task_isolation_interrupt(struct task_struct *, const char *buf);
+
+static inline bool task_isolation_strict(void)
+{
+ return ((current->task_isolation_flags &
+ (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+ (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) &&
+ task_isolation_possible(raw_smp_processor_id());
+}
+
+static inline bool task_isolation_check_syscall(int nr)
+{
+ return task_isolation_strict() && task_isolation_syscall(nr);
+}
+
+#define task_isolation_check_exception(fmt, ...) \
+ do { \
+ if (task_isolation_strict()) \
+ task_isolation_exception(fmt, ## __VA_ARGS__); \
+ } while (0)
+
#else
static inline void task_isolation_init(void) { }
static inline bool task_isolation_possible(int cpu) { return false; }
static inline bool task_isolation_enabled(void) { return false; }
static inline bool task_isolation_ready(void) { return true; }
static inline void task_isolation_enter(void) { }
+static inline bool task_isolation_check_syscall(int nr) { return false; }
+static inline void task_isolation_check_exception(const char *fmt, ...) { }
#endif
#endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,8 @@ struct prctl_mm_map {
#define PR_SET_TASK_ISOLATION 48
#define PR_GET_TASK_ISOLATION 49
# define PR_TASK_ISOLATION_ENABLE (1 << 0)
+# define PR_TASK_ISOLATION_STRICT (1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 42ad7a746a1e..5621fdf15b17 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
#include <linux/vmstat.h>
#include <linux/isolation.h>
#include <linux/syscalls.h>
+#include <asm/unistd.h>
#include "time/tick-sched.h"
cpumask_var_t task_isolation_map;
@@ -122,3 +123,62 @@ void _task_isolation_enter(void)
if (!tick_nohz_tick_stopped())
set_tsk_need_resched(current);
}
+
+void task_isolation_interrupt(struct task_struct *task, const char *buf)
+{
+ siginfo_t info = {};
+ int sig;
+
+ pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
+ task->comm, task->pid, buf);
+
+ /*
+ * Turn off task isolation mode entirely to avoid spamming
+ * the process with signals. It can re-enable task isolation
+ * mode in the signal handler if it wants to.
+ */
+ task->task_isolation_flags = 0;
+
+ sig = PR_TASK_ISOLATION_GET_SIG(task->task_isolation_flags);
+ if (sig == 0)
+ sig = SIGKILL;
+ info.si_signo = sig;
+ send_sig_info(sig, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(const char *fmt, ...)
+{
+ va_list args;
+ char buf[100];
+
+ /* RCU should have been enabled prior to this point. */
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+
+ task_isolation_interrupt(current, buf);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+bool task_isolation_syscall(int syscall)
+{
+ /* Ignore prctl() syscalls or any task exit. */
+ switch (syscall) {
+ case __NR_prctl:
+ case __NR_exit:
+ case __NR_exit_group:
+ return false;
+ }
+
+ task_isolation_exception("syscall %d", syscall);
+ return true;
+}
--
2.1.2
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2016-03-02 20:09 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-02 20:09 [PATCH v10 00/12] support "task_isolation" mode Chris Metcalf
2016-03-02 20:09 ` [PATCH v10 04/12] task_isolation: add initial support Chris Metcalf
2016-03-02 20:09 ` [PATCH v10 06/12] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).