* [PATCH v14 02/14] vmstat: add vmstat_idle function [not found] <1470774596-17341-1-git-send-email-cmetcalf@mellanox.com> @ 2016-08-09 20:29 ` Chris Metcalf 2016-08-09 20:29 ` [PATCH v14 03/14] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf 2016-08-09 20:29 ` [PATCH v14 04/14] task_isolation: add initial support Chris Metcalf 2 siblings, 0 replies; 13+ messages in thread From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-mm, linux-kernel Cc: Chris Metcalf This function checks to see if a vmstat worker is not running, and the vmstat diffs don't require an update. The function is called from the task-isolation code to see if we need to actually do some work to quiet vmstat. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> --- include/linux/vmstat.h | 2 ++ mm/vmstat.c | 10 ++++++++++ 2 files changed, 12 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index fab62aa74079..69b6cc4be909 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -235,6 +235,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); void quiet_vmstat_sync(void); +bool vmstat_idle(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -338,6 +339,7 @@ static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } static inline void quiet_vmstat_sync(void) { } +static inline bool vmstat_idle(void) { return true; } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 57fc29750da6..7dd17c06d3a7 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1763,6 +1763,16 @@ void quiet_vmstat_sync(void) } /* + * Report on whether vmstat processing is quiesced on the core currently: + * no vmstat worker running and no vmstat updates to perform. + */ +bool vmstat_idle(void) +{ + return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) && + !need_update(smp_processor_id()); +} + +/* * Shepherd worker thread that checks the * differentials of processors that have their worker * threads for vm statistics updates disabled because of -- 2.7.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v14 03/14] lru_add_drain_all: factor out lru_add_drain_needed [not found] <1470774596-17341-1-git-send-email-cmetcalf@mellanox.com> 2016-08-09 20:29 ` [PATCH v14 02/14] vmstat: add vmstat_idle function Chris Metcalf @ 2016-08-09 20:29 ` Chris Metcalf 2016-08-09 20:29 ` [PATCH v14 04/14] task_isolation: add initial support Chris Metcalf 2 siblings, 0 replies; 13+ messages in thread From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-mm, linux-kernel Cc: Chris Metcalf This per-cpu check was being done in the loop in lru_add_drain_all(), but having it be callable for a particular cpu is helpful for the task-isolation patches. Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> --- include/linux/swap.h | 1 + mm/swap.c | 15 ++++++++++----- 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index b17cc4830fa6..58966a235298 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -295,6 +295,7 @@ extern void activate_page(struct page *); extern void mark_page_accessed(struct page *); extern void lru_add_drain(void); extern void lru_add_drain_cpu(int cpu); +extern bool lru_add_drain_needed(int cpu); extern void lru_add_drain_all(void); extern void rotate_reclaimable_page(struct page *page); extern void deactivate_file_page(struct page *page); diff --git a/mm/swap.c b/mm/swap.c index 75c63bb2a1da..a2be6f0931b5 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -655,6 +655,15 @@ void deactivate_page(struct page *page) } } +bool lru_add_drain_needed(int cpu) +{ + return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || + pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || + pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || + pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || + need_activate_page_drain(cpu)); +} + void lru_add_drain(void) { lru_add_drain_cpu(get_cpu()); @@ -699,11 +708,7 @@ void lru_add_drain_all(void) for_each_online_cpu(cpu) { struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); - if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || - pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || - pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || - pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || - need_activate_page_drain(cpu)) { + if (lru_add_drain_needed(cpu)) { INIT_WORK(work, lru_add_drain_per_cpu); queue_work_on(cpu, lru_add_drain_wq, work); cpumask_set_cpu(cpu, &has_work); -- 2.7.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v14 04/14] task_isolation: add initial support [not found] <1470774596-17341-1-git-send-email-cmetcalf@mellanox.com> 2016-08-09 20:29 ` [PATCH v14 02/14] vmstat: add vmstat_idle function Chris Metcalf 2016-08-09 20:29 ` [PATCH v14 03/14] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf @ 2016-08-09 20:29 ` Chris Metcalf 2016-08-11 18:11 ` Frederic Weisbecker 2 siblings, 1 reply; 13+ messages in thread From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate task_isolation=CPULIST boot argument, which enables nohz_full and isolcpus as well. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in thread_info flags. When task isolation is enabled for a task, and it is returning to userspace on a task isolation core, it calls the new task_isolation_ready() / task_isolation_enter() routines to take additional actions to help the task avoid being interrupted in the future. The task_isolation_ready() call is invoked when TIF_TASK_ISOLATION is set in prepare_exit_to_usermode() or its architectural equivalent, and forces the loop to retry if the system is not ready. It is called with interrupts disabled and inspects the kernel state to determine if it is safe to return into an isolated state. In particular, if it sees that the scheduler tick is still enabled, it reports that it is not yet safe. Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION is set, we call the new task_isolation_enter() routine. This takes any actions that might avoid a future interrupt to the core, such as a worker thread being scheduled that could be quiesced now (e.g. the vmstat worker) or a future IPI to the core to clean up some state that could be cleaned up now (e.g. the mm lru per-cpu cache). In addition, it reqeusts rescheduling if the scheduler dyntick is still running. Once the task has returned to userspace after issuing the prctl(), if it enters the kernel again via system call, page fault, or any of a number of other synchronous traps, the kernel will kill it with SIGKILL. For system calls, this test is performed immediately before the SECCOMP test and causes the syscall to return immediately with ENOSYS. To allow the state to be entered and exited, the syscall checking test ignores the prctl() syscall so that we can clear the bit again later, and ignores exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. A new /sys/devices/system/cpu/task_isolation pseudo-file is added, parallel to the comparable nohz_full file. Separate patches that follow provide these changes for x86, tile, and arm64. Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> --- Documentation/kernel-parameters.txt | 8 ++ drivers/base/cpu.c | 18 +++ include/linux/isolation.h | 60 ++++++++++ include/linux/sched.h | 3 + include/linux/tick.h | 2 + include/uapi/linux/prctl.h | 5 + init/Kconfig | 27 +++++ kernel/Makefile | 1 + kernel/fork.c | 3 + kernel/isolation.c | 217 ++++++++++++++++++++++++++++++++++++ kernel/signal.c | 8 ++ kernel/sys.c | 9 ++ kernel/time/tick-sched.c | 36 +++--- 13 files changed, 384 insertions(+), 13 deletions(-) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 46c030a49186..7f1336b50dcc 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -3943,6 +3943,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted. neutralize any effect of /proc/sys/kernel/sysrq. Useful for debugging. + task_isolation= [KNL] + In kernels built with CONFIG_TASK_ISOLATION=y, set + the specified list of CPUs where cpus will be able + to use prctl(PR_SET_TASK_ISOLATION) to set up task + isolation mode. Setting this boot flag implicitly + also sets up nohz_full and isolcpus mode for the + listed set of cpus. + tcpmhash_entries= [KNL,NET] Set the number of tcp_metrics_hash slots. Default value is 8192 or 16384 depending on total diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 691eeea2f19a..eaf40f4264ee 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -17,6 +17,7 @@ #include <linux/of.h> #include <linux/cpufeature.h> #include <linux/tick.h> +#include <linux/isolation.h> #include "base.h" @@ -290,6 +291,20 @@ static ssize_t print_cpus_nohz_full(struct device *dev, static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL); #endif +#ifdef CONFIG_TASK_ISOLATION +static ssize_t print_cpus_task_isolation(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + int n = 0, len = PAGE_SIZE-2; + + n = scnprintf(buf, len, "%*pbl\n", cpumask_pr_args(task_isolation_map)); + + return n; +} +static DEVICE_ATTR(task_isolation, 0444, print_cpus_task_isolation, NULL); +#endif + static void cpu_device_release(struct device *dev) { /* @@ -460,6 +475,9 @@ static struct attribute *cpu_root_attrs[] = { #ifdef CONFIG_NO_HZ_FULL &dev_attr_nohz_full.attr, #endif +#ifdef CONFIG_TASK_ISOLATION + &dev_attr_task_isolation.attr, +#endif #ifdef CONFIG_GENERIC_CPU_AUTOPROBE &dev_attr_modalias.attr, #endif diff --git a/include/linux/isolation.h b/include/linux/isolation.h new file mode 100644 index 000000000000..d9288b85b41f --- /dev/null +++ b/include/linux/isolation.h @@ -0,0 +1,60 @@ +/* + * Task isolation related global functions + */ +#ifndef _LINUX_ISOLATION_H +#define _LINUX_ISOLATION_H + +#include <linux/tick.h> +#include <linux/prctl.h> + +#ifdef CONFIG_TASK_ISOLATION + +/* cpus that are configured to support task isolation */ +extern cpumask_var_t task_isolation_map; + +extern int task_isolation_init(void); + +static inline bool task_isolation_possible(int cpu) +{ + return task_isolation_map != NULL && + cpumask_test_cpu(cpu, task_isolation_map); +} + +extern int task_isolation_set(unsigned int flags); + +extern bool task_isolation_ready(void); +extern void task_isolation_enter(void); + +static inline void task_isolation_set_flags(struct task_struct *p, + unsigned int flags) +{ + p->task_isolation_flags = flags; + + if (flags & PR_TASK_ISOLATION_ENABLE) + set_tsk_thread_flag(p, TIF_TASK_ISOLATION); + else + clear_tsk_thread_flag(p, TIF_TASK_ISOLATION); +} + +extern int task_isolation_syscall(int nr); + +/* Report on exceptions that don't cause a signal for the user process. */ +extern void _task_isolation_quiet_exception(const char *fmt, ...); +#define task_isolation_quiet_exception(fmt, ...) \ + do { \ + if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \ + _task_isolation_quiet_exception(fmt, ## __VA_ARGS__); \ + } while (0) + +#else +static inline void task_isolation_init(void) { } +static inline bool task_isolation_possible(int cpu) { return false; } +static inline bool task_isolation_ready(void) { return true; } +static inline void task_isolation_enter(void) { } +extern inline void task_isolation_set_flags(struct task_struct *p, + unsigned int flags) { } +static inline int task_isolation_syscall(int nr) { return 0; } +static inline void task_isolation_quiet_exception(const char *fmt, ...) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 62c68e513e39..77dc12cd4fe8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1923,6 +1923,9 @@ struct task_struct { #ifdef CONFIG_MMU struct task_struct *oom_reaper_list; #endif +#ifdef CONFIG_TASK_ISOLATION + unsigned int task_isolation_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/linux/tick.h b/include/linux/tick.h index 62be0786d6d0..fbd81e322860 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -235,6 +235,8 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal, extern void tick_nohz_full_kick_cpu(int cpu); extern void __tick_nohz_task_switch(void); +extern void tick_nohz_full_add_cpus(const struct cpumask *mask); +extern bool can_stop_my_full_tick(void); #else static inline int housekeeping_any_cpu(void) { diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index a8d0759a9e40..2a49d0d2940a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -197,4 +197,9 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +/* Enable/disable or query task_isolation mode for TASK_ISOLATION kernels. */ +#define PR_SET_TASK_ISOLATION 48 +#define PR_GET_TASK_ISOLATION 49 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index 69886493ff1e..85a4b6dd26f2 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -786,6 +786,33 @@ config RCU_EXPEDITE_BOOT endmenu # "RCU Subsystem" +config HAVE_ARCH_TASK_ISOLATION + bool + +config TASK_ISOLATION + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION + help + Allow userspace processes to place themselves on task_isolation + cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate" + themselves from the kernel. Prior to returning to userspace, + isolated tasks will arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. By default, attempting to re-enter the kernel + while in this mode will cause the task to be terminated + with a signal; you must explicitly use prctl() to disable + task isolation before resuming normal use of the kernel. + + This "hard" isolation from the kernel is required for + userspace tasks that are running hard real-time tasks in + userspace, such as a 10 Gbit network driver in userspace. + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + config BUILD_BIN2C bool default n diff --git a/kernel/Makefile b/kernel/Makefile index e2ec54e2b952..91ff1615f4d6 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -112,6 +112,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o obj-$(CONFIG_MEMBARRIER) += membarrier.o obj-$(CONFIG_HAS_IOMEM) += memremap.o +obj-$(CONFIG_TASK_ISOLATION) += isolation.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/fork.c b/kernel/fork.c index 52e725d4a866..54542266d7a8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -76,6 +76,7 @@ #include <linux/compiler.h> #include <linux/sysctl.h> #include <linux/kcov.h> +#include <linux/isolation.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -1533,6 +1534,8 @@ static struct task_struct *copy_process(unsigned long clone_flags, #endif clear_all_latency_tracing(p); + task_isolation_set_flags(p, 0); + /* ok, now we should be set up.. */ p->pid = pid_nr(pid); if (clone_flags & CLONE_THREAD) { diff --git a/kernel/isolation.c b/kernel/isolation.c new file mode 100644 index 000000000000..bf3ebb0a727c --- /dev/null +++ b/kernel/isolation.c @@ -0,0 +1,217 @@ +/* + * linux/kernel/isolation.c + * + * Implementation for task isolation. + * + * Distributed under GPLv2. + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/isolation.h> +#include <linux/syscalls.h> +#include <asm/unistd.h> +#include <asm/syscall.h> +#include "time/tick-sched.h" + +cpumask_var_t task_isolation_map; +static bool saw_boot_arg; + +/* + * Isolation requires both nohz and isolcpus support from the scheduler. + * We provide a boot flag that enables both for now, and which we can + * add other functionality to over time if needed. Note that just + * specifying "nohz_full=... isolcpus=..." does not enable task isolation. + */ +static int __init task_isolation_setup(char *str) +{ + saw_boot_arg = true; + + alloc_bootmem_cpumask_var(&task_isolation_map); + if (cpulist_parse(str, task_isolation_map) < 0) { + pr_warn("task_isolation: Incorrect cpumask '%s'\n", str); + return 1; + } + + return 1; +} +__setup("task_isolation=", task_isolation_setup); + +int __init task_isolation_init(void) +{ + /* For offstack cpumask, ensure we allocate an empty cpumask early. */ + if (!saw_boot_arg) { + zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL); + return 0; + } + + /* + * Add our task_isolation cpus to nohz_full and isolcpus. Note + * that we are called relatively early in boot, from tick_init(); + * at this point neither nohz_full nor isolcpus has been used + * to configure the system, but isolcpus has been allocated + * already in sched_init(). + */ + tick_nohz_full_add_cpus(task_isolation_map); + cpumask_or(cpu_isolated_map, cpu_isolated_map, task_isolation_map); + + return 0; +} + +/* + * Get a snapshot of whether, at this moment, it would be possible to + * stop the tick. This test normally requires interrupts disabled since + * the condition can change if an interrupt is delivered. However, in + * this case we are using it in an advisory capacity to see if there + * is anything obviously indicating that the task isolation + * preconditions have not been met, so it's OK that in principle it + * might not still be true later in the prctl() syscall path. + */ +static bool can_stop_my_full_tick_now(void) +{ + bool ret; + + local_irq_disable(); + ret = can_stop_my_full_tick(); + local_irq_enable(); + return ret; +} + +/* + * This routine controls whether we can enable task-isolation mode. + * The task must be affinitized to a single task_isolation core, or + * else we return EINVAL. And, it must be at least statically able to + * stop the nohz_full tick (e.g., no other schedulable tasks currently + * running, no POSIX cpu timers currently set up, etc.); if not, we + * return EAGAIN. + */ +int task_isolation_set(unsigned int flags) +{ + if (flags != 0) { + if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || + !task_isolation_possible(raw_smp_processor_id())) { + /* Invalid task affinity setting. */ + return -EINVAL; + } + if (!can_stop_my_full_tick_now()) { + /* System not yet ready for task isolation. */ + return -EAGAIN; + } + } + + task_isolation_set_flags(current, flags); + return 0; +} + +/* + * In task isolation mode we try to return to userspace only after + * attempting to make sure we won't be interrupted again. This test + * is run with interrupts disabled to test that everything we need + * to be true is true before we can return to userspace. + */ +bool task_isolation_ready(void) +{ + WARN_ON_ONCE(!irqs_disabled()); + + return (!lru_add_drain_needed(smp_processor_id()) && + vmstat_idle() && + tick_nohz_tick_stopped()); +} + +/* + * Each time we try to prepare for return to userspace in a process + * with task isolation enabled, we run this code to quiesce whatever + * subsystems we can readily quiesce to avoid later interrupts. + */ +void task_isolation_enter(void) +{ + WARN_ON_ONCE(irqs_disabled()); + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat_sync(); + + /* + * Request rescheduling unless we are in full dynticks mode. + * We would eventually get pre-empted without this, and if + * there's another task waiting, it would run; but by + * explicitly requesting the reschedule, we may reduce the + * latency. We could directly call schedule() here as well, + * but since our caller is the standard place where schedule() + * is called, we defer to the caller. + * + * A more substantive approach here would be to use a struct + * completion here explicitly, and complete it when we shut + * down dynticks, but since we presumably have nothing better + * to do on this core anyway, just spinning seems plausible. + */ + if (!tick_nohz_tick_stopped()) + set_tsk_need_resched(current); +} + +static void task_isolation_deliver_signal(struct task_struct *task, + const char *buf) +{ + siginfo_t info = {}; + + info.si_signo = SIGKILL; + + /* + * Report on the fact that isolation was violated for the task. + * It may not be the task's fault (e.g. a TLB flush from another + * core) but we are not blaming it, just reporting that it lost + * its isolation status. + */ + pr_warn("%s/%d: task_isolation mode lost due to %s\n", + task->comm, task->pid, buf); + + /* Turn off task isolation mode to avoid further isolation callbacks. */ + task_isolation_set_flags(task, 0); + + send_sig_info(info.si_signo, &info, task); +} + +/* + * This routine is called from any userspace exception that doesn't + * otherwise trigger a signal to the user process (e.g. simple page fault). + */ +void _task_isolation_quiet_exception(const char *fmt, ...) +{ + struct task_struct *task = current; + va_list args; + char buf[100]; + + /* RCU should have been enabled prior to this point. */ + RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); + + va_start(args, fmt); + vsnprintf(buf, sizeof(buf), fmt, args); + va_end(args); + + task_isolation_deliver_signal(task, buf); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in), and prevents most syscalls from executing and raises a + * signal to notify the process. + */ +int task_isolation_syscall(int syscall) +{ + char buf[20]; + + if (syscall == __NR_prctl || + syscall == __NR_exit || + syscall == __NR_exit_group) + return 0; + + snprintf(buf, sizeof(buf), "syscall %d", syscall); + task_isolation_deliver_signal(current, buf); + + syscall_set_return_value(current, current_pt_regs(), + -ERESTARTNOINTR, -1); + return -1; +} diff --git a/kernel/signal.c b/kernel/signal.c index af21afc00d08..895f547ff66f 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -34,6 +34,7 @@ #include <linux/compat.h> #include <linux/cn_proc.h> #include <linux/compiler.h> +#include <linux/isolation.h> #define CREATE_TRACE_POINTS #include <trace/events/signal.h> @@ -2213,6 +2214,13 @@ relock: /* Trace actually delivered signals. */ trace_signal_deliver(signr, &ksig->info, ka); + /* + * Disable task isolation when delivering a signal. + * The isolation model requires users to reset task + * isolation from the signal handler if desired. + */ + task_isolation_set_flags(current, 0); + if (ka->sa.sa_handler == SIG_IGN) /* Do nothing. */ continue; if (ka->sa.sa_handler != SIG_DFL) { diff --git a/kernel/sys.c b/kernel/sys.c index 89d5be418157..4df84af425e3 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -41,6 +41,7 @@ #include <linux/syscore_ops.h> #include <linux/version.h> #include <linux/ctype.h> +#include <linux/isolation.h> #include <linux/compat.h> #include <linux/syscalls.h> @@ -2270,6 +2271,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_TASK_ISOLATION + case PR_SET_TASK_ISOLATION: + error = task_isolation_set(arg2); + break; + case PR_GET_TASK_ISOLATION: + error = me->task_isolation_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 204fdc86863d..a6e29527743e 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -23,6 +23,7 @@ #include <linux/irq_work.h> #include <linux/posix-timers.h> #include <linux/context_tracking.h> +#include <linux/isolation.h> #include <asm/irq_regs.h> @@ -205,6 +206,11 @@ static bool can_stop_full_tick(struct tick_sched *ts) return true; } +bool can_stop_my_full_tick(void) +{ + return can_stop_full_tick(this_cpu_ptr(&tick_cpu_sched)); +} + static void nohz_full_kick_func(struct irq_work *work) { /* Empty, the tick restart happens on tick_nohz_irq_exit() */ @@ -407,30 +413,34 @@ static int tick_nohz_cpu_down_callback(struct notifier_block *nfb, return NOTIFY_OK; } -static int tick_nohz_init_all(void) +void tick_nohz_full_add_cpus(const struct cpumask *mask) { - int err = -1; + if (!cpumask_weight(mask)) + return; -#ifdef CONFIG_NO_HZ_FULL_ALL - if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) { + if (tick_nohz_full_mask == NULL && + !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) { WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n"); - return err; + return; } - err = 0; - cpumask_setall(tick_nohz_full_mask); + + cpumask_or(tick_nohz_full_mask, tick_nohz_full_mask, mask); tick_nohz_full_running = true; -#endif - return err; } void __init tick_nohz_init(void) { int cpu; - if (!tick_nohz_full_running) { - if (tick_nohz_init_all() < 0) - return; - } + task_isolation_init(); + +#ifdef CONFIG_NO_HZ_FULL_ALL + if (!tick_nohz_full_running) + tick_nohz_full_add_cpus(cpu_possible_mask); +#endif + + if (!tick_nohz_full_running) + return; if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) { WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n"); -- 2.7.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v14 04/14] task_isolation: add initial support 2016-08-09 20:29 ` [PATCH v14 04/14] task_isolation: add initial support Chris Metcalf @ 2016-08-11 18:11 ` Frederic Weisbecker 2016-08-11 18:50 ` Christoph Lameter 0 siblings, 1 reply; 13+ messages in thread From: Frederic Weisbecker @ 2016-08-11 18:11 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel On Tue, Aug 09, 2016 at 04:29:46PM -0400, Chris Metcalf wrote: > +/* > + * Each time we try to prepare for return to userspace in a process > + * with task isolation enabled, we run this code to quiesce whatever > + * subsystems we can readily quiesce to avoid later interrupts. > + */ > +void task_isolation_enter(void) > +{ > + WARN_ON_ONCE(irqs_disabled()); > + > + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ > + lru_add_drain(); > + > + /* Quieten the vmstat worker so it won't interrupt us. */ > + quiet_vmstat_sync(); So, this is going to be called everytime we resume to userspace while in task isolation mode, right? Do we need to quiesce vmstat everytime before entering userspace? I thought that vmstat only need to be offlined once and for all? And how about lru? > + > + /* > + * Request rescheduling unless we are in full dynticks mode. > + * We would eventually get pre-empted without this, and if > + * there's another task waiting, it would run; but by > + * explicitly requesting the reschedule, we may reduce the > + * latency. We could directly call schedule() here as well, > + * but since our caller is the standard place where schedule() > + * is called, we defer to the caller. > + * > + * A more substantive approach here would be to use a struct > + * completion here explicitly, and complete it when we shut > + * down dynticks, but since we presumably have nothing better > + * to do on this core anyway, just spinning seems plausible. > + */ > + if (!tick_nohz_tick_stopped()) > + set_tsk_need_resched(current); Again, that won't help :-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v14 04/14] task_isolation: add initial support 2016-08-11 18:11 ` Frederic Weisbecker @ 2016-08-11 18:50 ` Christoph Lameter 2016-08-15 14:59 ` Chris Metcalf 0 siblings, 1 reply; 13+ messages in thread From: Christoph Lameter @ 2016-08-11 18:50 UTC (permalink / raw) To: Frederic Weisbecker Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel On Thu, 11 Aug 2016, Frederic Weisbecker wrote: > Do we need to quiesce vmstat everytime before entering userspace? > I thought that vmstat only need to be offlined once and for all? Once is sufficient after disabling the tick. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v14 04/14] task_isolation: add initial support 2016-08-11 18:50 ` Christoph Lameter @ 2016-08-15 14:59 ` Chris Metcalf 2016-08-30 0:55 ` Frederic Weisbecker 0 siblings, 1 reply; 13+ messages in thread From: Chris Metcalf @ 2016-08-15 14:59 UTC (permalink / raw) To: Christoph Lameter, Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel On 8/11/2016 2:50 PM, Christoph Lameter wrote: > On Thu, 11 Aug 2016, Frederic Weisbecker wrote: > >> Do we need to quiesce vmstat everytime before entering userspace? >> I thought that vmstat only need to be offlined once and for all? > Once is sufficient after disabling the tick. It's true that task_isolation_enter() is called every time before returning to user space while task isolation is enabled. But once we enter the kernel again after returning from the initial prctl() -- assuming we are in NOSIG mode so doing so is legal in the first place -- almost anything can happen, certainly including restarting the tick. Thus, we have to make sure that normal quiescing happens again before we return to userspace. For vmstat, you're right that it's somewhat heavyweight to do the quiesce, and if we don't need it, it's wasted time on the return path. So I will add a guard call to the new vmstat_idle() before invoking quiet_vmstat_sync(). This slows down the path where it turns out we do need to quieten vmstat, but not by too much. The LRU quiesce is quite light-weight. We just check pagevec_count() on a handful of pagevec's, confirm they are all zero, and return without further work. So for that one, adding a separate lru_add_drain_needed() guard test would just be wasted effort. The thing to remember is that this is only relevant if the user has explicitly requested the NOSIG behavior from task isolation, which we don't really expect to be the default - we are implicitly encouraging use of the default semantics of "you can't enter the kernel again until you turn off isolation". > > + if (!tick_nohz_tick_stopped()) > > + set_tsk_need_resched(current); > > Again, that won't help It won't be better than spinning in a loop if there aren't any other schedulable processes, but it won't be worse either. If there is another schedulable process, we at least will schedule it sooner than if we just sat in a busy loop and waited for the scheduler to kick us. But there's nothing else we can do anyway if we want to maintain the guarantee that the dyn tick is stopped before return to userspace. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v14 04/14] task_isolation: add initial support 2016-08-15 14:59 ` Chris Metcalf @ 2016-08-30 0:55 ` Frederic Weisbecker 2016-08-30 15:41 ` Chris Metcalf 0 siblings, 1 reply; 13+ messages in thread From: Frederic Weisbecker @ 2016-08-30 0:55 UTC (permalink / raw) To: Chris Metcalf Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote: > On 8/11/2016 2:50 PM, Christoph Lameter wrote: > >On Thu, 11 Aug 2016, Frederic Weisbecker wrote: > > > >>Do we need to quiesce vmstat everytime before entering userspace? > >>I thought that vmstat only need to be offlined once and for all? > >Once is sufficient after disabling the tick. > > It's true that task_isolation_enter() is called every time before > returning to user space while task isolation is enabled. > > But once we enter the kernel again after returning from the initial > prctl() -- assuming we are in NOSIG mode so doing so is legal in the > first place -- almost anything can happen, certainly including > restarting the tick. Thus, we have to make sure that normal quiescing > happens again before we return to userspace. Yes but we need to sort out what needs to be called only once on prctl(). Once vmstat is quiesced, it's not going to need quiescing again even if we restart the tick. > > For vmstat, you're right that it's somewhat heavyweight to do the > quiesce, and if we don't need it, it's wasted time on the return path. > So I will add a guard call to the new vmstat_idle() before invoking > quiet_vmstat_sync(). This slows down the path where it turns out we > do need to quieten vmstat, but not by too much. Why not do this on prctl() only? > The LRU quiesce is quite light-weight. We just check pagevec_count() > on a handful of pagevec's, confirm they are all zero, and return > without further work. So for that one, adding a separate > lru_add_drain_needed() guard test would just be wasted effort. Ok if this one is justified, like LRU may need update everytime we re-enter the kernel, then we can keep it (I can't tell, I don't know much about -mm). > The thing to remember is that this is only relevant if the user has > explicitly requested the NOSIG behavior from task isolation, which we > don't really expect to be the default - we are implicitly encouraging > use of the default semantics of "you can't enter the kernel again > until you turn off isolation". That's right. Although NOSIG is the only thing we can afford as long as we drag around the 1Hz. > > >> + if (!tick_nohz_tick_stopped()) > >> + set_tsk_need_resched(current); > >> Again, that won't help > > It won't be better than spinning in a loop if there aren't any other > schedulable processes, but it won't be worse either. If there is > another schedulable process, we at least will schedule it sooner than > if we just sat in a busy loop and waited for the scheduler to kick > us. But there's nothing else we can do anyway if we want to maintain > the guarantee that the dyn tick is stopped before return to userspace. I don't think it helps either way. If reschedule is pending, the current task already has TIF_RESCHED set. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v14 04/14] task_isolation: add initial support 2016-08-30 0:55 ` Frederic Weisbecker @ 2016-08-30 15:41 ` Chris Metcalf 2016-08-30 17:10 ` Frederic Weisbecker 0 siblings, 1 reply; 13+ messages in thread From: Chris Metcalf @ 2016-08-30 15:41 UTC (permalink / raw) To: Frederic Weisbecker Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel On 8/29/2016 8:55 PM, Frederic Weisbecker wrote: > On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote: >> On 8/11/2016 2:50 PM, Christoph Lameter wrote: >>> On Thu, 11 Aug 2016, Frederic Weisbecker wrote: >>> >>>> Do we need to quiesce vmstat everytime before entering userspace? >>>> I thought that vmstat only need to be offlined once and for all? >>> Once is sufficient after disabling the tick. >> It's true that task_isolation_enter() is called every time before >> returning to user space while task isolation is enabled. >> >> But once we enter the kernel again after returning from the initial >> prctl() -- assuming we are in NOSIG mode so doing so is legal in the >> first place -- almost anything can happen, certainly including >> restarting the tick. Thus, we have to make sure that normal quiescing >> happens again before we return to userspace. > Yes but we need to sort out what needs to be called only once on prctl(). > > Once vmstat is quiesced, it's not going to need quiescing again even if we > restart the tick. That's true, but I really do like the idea of having a clean structure where we verify all our prerequisites in task_isolation_ready(), and have code to try to get things fixed up in task_isolation_enter(). If we start moving some things here and some things there, it gets harder to manage. I think by testing "!vmstat_idle()" in task_isolation_enter() we are avoiding any substantial overhead. I think it would be clearer to rename task_isolation_enter() to task_isolation_prepare(); it might be less confusing. Remember too that in general, we really don't need to think about return-to-userspace as a hot path for task isolation, unlike how we think about it all the rest of the time. So it makes sense to prioritize keeping things clean from a software development perspective first, and high-performance only secondarily. >> The thing to remember is that this is only relevant if the user has >> explicitly requested the NOSIG behavior from task isolation, which we >> don't really expect to be the default - we are implicitly encouraging >> use of the default semantics of "you can't enter the kernel again >> until you turn off isolation". > That's right. Although NOSIG is the only thing we can afford as long as > we drag around the 1Hz. True enough. Hopefully we'll finish sorting that out soon enough. >>>> + if (!tick_nohz_tick_stopped()) >>>> + set_tsk_need_resched(current); >>>> Again, that won't help >> It won't be better than spinning in a loop if there aren't any other >> schedulable processes, but it won't be worse either. If there is >> another schedulable process, we at least will schedule it sooner than >> if we just sat in a busy loop and waited for the scheduler to kick >> us. But there's nothing else we can do anyway if we want to maintain >> the guarantee that the dyn tick is stopped before return to userspace. > I don't think it helps either way. If reschedule is pending, the current > task already has TIF_RESCHED set. See the other thread with Peter Z for the longer discussion of this. At this point I'm leaning towards replacing the set_tsk_need_resched() with set_current_state(TASK_INTERRUPTIBLE); schedule(); __set_current_state(TASK_RUNNING); -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v14 04/14] task_isolation: add initial support 2016-08-30 15:41 ` Chris Metcalf @ 2016-08-30 17:10 ` Frederic Weisbecker 2016-08-30 17:36 ` Chris Metcalf 0 siblings, 1 reply; 13+ messages in thread From: Frederic Weisbecker @ 2016-08-30 17:10 UTC (permalink / raw) To: Chris Metcalf Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel On Tue, Aug 30, 2016 at 11:41:36AM -0400, Chris Metcalf wrote: > On 8/29/2016 8:55 PM, Frederic Weisbecker wrote: > >On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote: > >>On 8/11/2016 2:50 PM, Christoph Lameter wrote: > >>>On Thu, 11 Aug 2016, Frederic Weisbecker wrote: > >>> > >>>>Do we need to quiesce vmstat everytime before entering userspace? > >>>>I thought that vmstat only need to be offlined once and for all? > >>>Once is sufficient after disabling the tick. > >>It's true that task_isolation_enter() is called every time before > >>returning to user space while task isolation is enabled. > >> > >>But once we enter the kernel again after returning from the initial > >>prctl() -- assuming we are in NOSIG mode so doing so is legal in the > >>first place -- almost anything can happen, certainly including > >>restarting the tick. Thus, we have to make sure that normal quiescing > >>happens again before we return to userspace. > >Yes but we need to sort out what needs to be called only once on prctl(). > > > >Once vmstat is quiesced, it's not going to need quiescing again even if we > >restart the tick. > > That's true, but I really do like the idea of having a clean structure > where we verify all our prerequisites in task_isolation_ready(), and > have code to try to get things fixed up in task_isolation_enter(). > If we start moving some things here and some things there, it gets > harder to manage. I think by testing "!vmstat_idle()" in > task_isolation_enter() we are avoiding any substantial overhead. I think that making the code clearer on what needs to be done once for all on prctl() and what needs to be done on all actual syscall return is more important for readability. > > I think it would be clearer to rename task_isolation_enter() > to task_isolation_prepare(); it might be less confusing. For the prctl part, why not. > > Remember too that in general, we really don't need to think about > return-to-userspace as a hot path for task isolation, unlike how we > think about it all the rest of the time. So it makes sense to > prioritize keeping things clean from a software development > perspective first, and high-performance only secondarily. > > >>The thing to remember is that this is only relevant if the user has > >>explicitly requested the NOSIG behavior from task isolation, which we > >>don't really expect to be the default - we are implicitly encouraging > >>use of the default semantics of "you can't enter the kernel again > >>until you turn off isolation". > >That's right. Although NOSIG is the only thing we can afford as long as > >we drag around the 1Hz. > > True enough. Hopefully we'll finish sorting that out soon enough. > > >>>>+ if (!tick_nohz_tick_stopped()) > >>>>+ set_tsk_need_resched(current); > >>>>Again, that won't help > >>It won't be better than spinning in a loop if there aren't any other > >>schedulable processes, but it won't be worse either. If there is > >>another schedulable process, we at least will schedule it sooner than > >>if we just sat in a busy loop and waited for the scheduler to kick > >>us. But there's nothing else we can do anyway if we want to maintain > >>the guarantee that the dyn tick is stopped before return to userspace. > >I don't think it helps either way. If reschedule is pending, the current > >task already has TIF_RESCHED set. > > See the other thread with Peter Z for the longer discussion of this. > At this point I'm leaning towards replacing the set_tsk_need_resched() with > > set_current_state(TASK_INTERRUPTIBLE); > schedule(); > __set_current_state(TASK_RUNNING); I don't see how that helps. What will wake the thread up except a signal? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v14 04/14] task_isolation: add initial support 2016-08-30 17:10 ` Frederic Weisbecker @ 2016-08-30 17:36 ` Chris Metcalf 2016-08-30 18:17 ` Chris Metcalf 0 siblings, 1 reply; 13+ messages in thread From: Chris Metcalf @ 2016-08-30 17:36 UTC (permalink / raw) To: Frederic Weisbecker Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel On 8/30/2016 1:10 PM, Frederic Weisbecker wrote: > On Tue, Aug 30, 2016 at 11:41:36AM -0400, Chris Metcalf wrote: >> On 8/29/2016 8:55 PM, Frederic Weisbecker wrote: >>> On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote: >>>> On 8/11/2016 2:50 PM, Christoph Lameter wrote: >>>>> On Thu, 11 Aug 2016, Frederic Weisbecker wrote: >>>>> >>>>>> Do we need to quiesce vmstat everytime before entering userspace? >>>>>> I thought that vmstat only need to be offlined once and for all? >>>>> Once is sufficient after disabling the tick. >>>> It's true that task_isolation_enter() is called every time before >>>> returning to user space while task isolation is enabled. >>>> >>>> But once we enter the kernel again after returning from the initial >>>> prctl() -- assuming we are in NOSIG mode so doing so is legal in the >>>> first place -- almost anything can happen, certainly including >>>> restarting the tick. Thus, we have to make sure that normal quiescing >>>> happens again before we return to userspace. >>> Yes but we need to sort out what needs to be called only once on prctl(). >>> >>> Once vmstat is quiesced, it's not going to need quiescing again even if we >>> restart the tick. >> That's true, but I really do like the idea of having a clean structure >> where we verify all our prerequisites in task_isolation_ready(), and >> have code to try to get things fixed up in task_isolation_enter(). >> If we start moving some things here and some things there, it gets >> harder to manage. I think by testing "!vmstat_idle()" in >> task_isolation_enter() we are avoiding any substantial overhead. > I think that making the code clearer on what needs to be done once for > all on prctl() and what needs to be done on all actual syscall return > is more important for readability. We don't need to just do it on prctl(), though. We also need to do it whenever we have been in the kernel for another reason, which can happen with NOSIG. So we need to do this on the common return to userspace path no matter what, right? Or am I missing something? It seems like if, for example, we do mmaps or page faults, then on return to userspace, some of those counters will have been incremented and we'll need to run the quiet_vmstat_sync() code. >>>>>> + if (!tick_nohz_tick_stopped()) >>>>>> + set_tsk_need_resched(current); >>>>>> Again, that won't help >>>> It won't be better than spinning in a loop if there aren't any other >>>> schedulable processes, but it won't be worse either. If there is >>>> another schedulable process, we at least will schedule it sooner than >>>> if we just sat in a busy loop and waited for the scheduler to kick >>>> us. But there's nothing else we can do anyway if we want to maintain >>>> the guarantee that the dyn tick is stopped before return to userspace. >>> I don't think it helps either way. If reschedule is pending, the current >>> task already has TIF_RESCHED set. >> See the other thread with Peter Z for the longer discussion of this. >> At this point I'm leaning towards replacing the set_tsk_need_resched() with >> >> set_current_state(TASK_INTERRUPTIBLE); >> schedule(); >> __set_current_state(TASK_RUNNING); > I don't see how that helps. What will wake the thread up except a signal? The answer is that the scheduler will keep bringing us back to this point (either after running another runnable task if there is one, or else just returning to this point immediately without doing a context switch), and we will then go around the "prepare exit to userspace" loop and perhaps discover that enough time has gone by that the last dyntick interrupt has triggered and the kernel has quiesced the dynticks. At that point we stop calling schedule() over and over and can return normally to userspace. It's very counter-intuitive to burn cpu time intentionally in the kernel. I really don't see another way to resolve this, though. I don't think it would be safe, for example, to just promote the next dyntick to running immediately (rather than waiting a few microseconds until it is scheduled to go off). -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v14 04/14] task_isolation: add initial support 2016-08-30 17:36 ` Chris Metcalf @ 2016-08-30 18:17 ` Chris Metcalf 2016-09-03 15:31 ` Frederic Weisbecker 0 siblings, 1 reply; 13+ messages in thread From: Chris Metcalf @ 2016-08-30 18:17 UTC (permalink / raw) To: Frederic Weisbecker Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel On 8/30/2016 1:36 PM, Chris Metcalf wrote: >>> See the other thread with Peter Z for the longer discussion of this. >>> At this point I'm leaning towards replacing the set_tsk_need_resched() with >>> >>> set_current_state(TASK_INTERRUPTIBLE); >>> schedule(); >>> __set_current_state(TASK_RUNNING); >> I don't see how that helps. What will wake the thread up except a signal? > > The answer is that the scheduler will keep bringing us back to this > point (either after running another runnable task if there is one, > or else just returning to this point immediately without doing a > context switch), and we will then go around the "prepare exit to > userspace" loop and perhaps discover that enough time has gone > by that the last dyntick interrupt has triggered and the kernel has > quiesced the dynticks. At that point we stop calling schedule() > over and over and can return normally to userspace. Oops, you're right that if I set TASK_INTERRUPTIBLE, then if I call schedule(), I never get control back. So I don't want to do that. I suppose I could do a schedule_timeout() here instead and try to figure out how long to wait for the next dyntick. But since we don't expect anything else running on the core anyway, it seems like we are probably working too hard at this point. I don't think it's worth it just to go into the idle task and (possibly) save some power for a few microseconds. The more I think about this, the more I think I am micro-optimizing by trying to poke the scheduler prior to some external thing setting need_resched, so I think the thing to do here is in fact, nothing. I won't worry about rescheduling but will just continue going around the prepare-exit-to-userspace loop until the last dyn tick fires. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v14 04/14] task_isolation: add initial support 2016-08-30 18:17 ` Chris Metcalf @ 2016-09-03 15:31 ` Frederic Weisbecker 2016-09-09 18:54 ` Chris Metcalf 0 siblings, 1 reply; 13+ messages in thread From: Frederic Weisbecker @ 2016-09-03 15:31 UTC (permalink / raw) To: Chris Metcalf Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel On Tue, Aug 30, 2016 at 02:17:27PM -0400, Chris Metcalf wrote: > On 8/30/2016 1:36 PM, Chris Metcalf wrote: > >>>See the other thread with Peter Z for the longer discussion of this. > >>>At this point I'm leaning towards replacing the set_tsk_need_resched() with > >>> > >>> set_current_state(TASK_INTERRUPTIBLE); > >>> schedule(); > >>> __set_current_state(TASK_RUNNING); > >>I don't see how that helps. What will wake the thread up except a signal? > > > >The answer is that the scheduler will keep bringing us back to this > >point (either after running another runnable task if there is one, > >or else just returning to this point immediately without doing a > >context switch), and we will then go around the "prepare exit to > >userspace" loop and perhaps discover that enough time has gone > >by that the last dyntick interrupt has triggered and the kernel has > >quiesced the dynticks. At that point we stop calling schedule() > >over and over and can return normally to userspace. > > Oops, you're right that if I set TASK_INTERRUPTIBLE, then if I call > schedule(), I never get control back. So I don't want to do that. > > I suppose I could do a schedule_timeout() here instead and try > to figure out how long to wait for the next dyntick. But since we > don't expect anything else running on the core anyway, it seems > like we are probably working too hard at this point. I don't think > it's worth it just to go into the idle task and (possibly) save some > power for a few microseconds. > > The more I think about this, the more I think I am micro-optimizing > by trying to poke the scheduler prior to some external thing setting > need_resched, so I think the thing to do here is in fact, nothing. Exactly, I fear there is nothing you can do about that. > I won't worry about rescheduling but will just continue going around > the prepare-exit-to-userspace loop until the last dyn tick fires. You mean waiting in prepare-exit-to-userspace until the last tick fires? I'm not sure it's a good idea either, this could take ages, it could as well never happen. I'd rather say that if we are in signal mode, fire such, otherwise just return to userspace. If there is a tick, it means that the environment is not suitable for isolation anyway. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v14 04/14] task_isolation: add initial support 2016-09-03 15:31 ` Frederic Weisbecker @ 2016-09-09 18:54 ` Chris Metcalf 0 siblings, 0 replies; 13+ messages in thread From: Chris Metcalf @ 2016-09-09 18:54 UTC (permalink / raw) To: Frederic Weisbecker Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel On 9/3/2016 11:31 AM, Frederic Weisbecker wrote: > On Tue, Aug 30, 2016 at 02:17:27PM -0400, Chris Metcalf wrote: >> On 8/30/2016 1:36 PM, Chris Metcalf wrote: >>>>> See the other thread with Peter Z for the longer discussion of this. >>>>> At this point I'm leaning towards replacing the set_tsk_need_resched() with >>>>> >>>>> set_current_state(TASK_INTERRUPTIBLE); >>>>> schedule(); >>>>> __set_current_state(TASK_RUNNING); >>>> I don't see how that helps. What will wake the thread up except a signal? >>> The answer is that the scheduler will keep bringing us back to this >>> point (either after running another runnable task if there is one, >>> or else just returning to this point immediately without doing a >>> context switch), and we will then go around the "prepare exit to >>> userspace" loop and perhaps discover that enough time has gone >>> by that the last dyntick interrupt has triggered and the kernel has >>> quiesced the dynticks. At that point we stop calling schedule() >>> over and over and can return normally to userspace. >> Oops, you're right that if I set TASK_INTERRUPTIBLE, then if I call >> schedule(), I never get control back. So I don't want to do that. >> >> I suppose I could do a schedule_timeout() here instead and try >> to figure out how long to wait for the next dyntick. But since we >> don't expect anything else running on the core anyway, it seems >> like we are probably working too hard at this point. I don't think >> it's worth it just to go into the idle task and (possibly) save some >> power for a few microseconds. >> >> The more I think about this, the more I think I am micro-optimizing >> by trying to poke the scheduler prior to some external thing setting >> need_resched, so I think the thing to do here is in fact, nothing. > Exactly, I fear there is nothing you can do about that. > >> I won't worry about rescheduling but will just continue going around >> the prepare-exit-to-userspace loop until the last dyn tick fires. > You mean waiting in prepare-exit-to-userspace until the last tick fires? > I'm not sure it's a good idea either, this could take ages, it could as > well never happen. If you don't mind, let's take this to the other thread discussing what to do at return-to-userspace time: https://lkml.kernel.org/r/440e20d1-441a-3228-6b37-6e71e9fce47c@mellanox.com In general, I think if your task ends up waiting forever for the dyntick to stop, with the approach suggested in that thread you will at least be able to tell more easily, since the core will be running the idle task and your task will be waiting on a dyntick-specific completion. > I'd rather say that if we are in signal mode, fire such, otherwise just > return to userspace. If there is a tick, it means that the environment is > not suitable for isolation anyway. True if there is an ongoing tick, but not if the tick is about to stop and we just need to wait for the last tick to fire. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2016-09-09 18:54 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1470774596-17341-1-git-send-email-cmetcalf@mellanox.com> 2016-08-09 20:29 ` [PATCH v14 02/14] vmstat: add vmstat_idle function Chris Metcalf 2016-08-09 20:29 ` [PATCH v14 03/14] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf 2016-08-09 20:29 ` [PATCH v14 04/14] task_isolation: add initial support Chris Metcalf 2016-08-11 18:11 ` Frederic Weisbecker 2016-08-11 18:50 ` Christoph Lameter 2016-08-15 14:59 ` Chris Metcalf 2016-08-30 0:55 ` Frederic Weisbecker 2016-08-30 15:41 ` Chris Metcalf 2016-08-30 17:10 ` Frederic Weisbecker 2016-08-30 17:36 ` Chris Metcalf 2016-08-30 18:17 ` Chris Metcalf 2016-09-03 15:31 ` Frederic Weisbecker 2016-09-09 18:54 ` Chris Metcalf
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).