* [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-08 17:58 Chris Metcalf 2015-05-08 17:58 ` [PATCH 1/6] nohz_full: add support for "dataplane" mode Chris Metcalf ` (3 more replies) 0 siblings, 4 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. A prctl() option (PR_SET_DATAPLANE) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on dataplane cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on my arch/tile master tree for 4.2, in turn based on 4.1-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane Chris Metcalf (6): nohz_full: add support for "dataplane" mode nohz: dataplane: allow tick to be fully disabled for dataplane dataplane nohz: run softirqs synchronously on user entry nohz: support PR_DATAPLANE_QUIESCE nohz: support PR_DATAPLANE_STRICT mode nohz: add dataplane_debug boot flag Documentation/kernel-parameters.txt | 6 ++ arch/tile/mm/homecache.c | 5 +- include/linux/sched.h | 3 + include/linux/tick.h | 12 ++++ include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 3 + kernel/irq_work.c | 4 +- kernel/sched/core.c | 18 ++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 15 ++++- kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 112 +++++++++++++++++++++++++++++++++++- 13 files changed, 198 insertions(+), 5 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH 1/6] nohz_full: add support for "dataplane" mode 2015-05-08 17:58 [PATCH 0/6] support "dataplane" mode for nohz_full Chris Metcalf @ 2015-05-08 17:58 ` Chris Metcalf 2015-05-08 17:58 ` [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE Chris Metcalf ` (2 subsequent siblings) 3 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a stronger commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the stronger semantics as needed, specifying prctl(PR_SET_DATAPLANE, PR_DATAPLANE_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The dataplane state is indicated by setting a new task struct field, dataplane_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new tick_nohz_dataplane_enter() routine to take additional actions to help the task avoid being interrupted in the future. For this first patch, the only action taken is to call lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/sched.h | 3 +++ include/linux/tick.h | 10 ++++++++++ include/uapi/linux/prctl.h | 5 +++++ kernel/context_tracking.c | 3 +++ kernel/sys.c | 8 ++++++++ kernel/time/tick-sched.c | 13 +++++++++++++ 6 files changed, 42 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 8222ae40ecb0..3680aa07c9ea 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1732,6 +1732,9 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif +#ifdef CONFIG_NO_HZ_FULL + unsigned int dataplane_flags; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/tick.h b/include/linux/tick.h index f8492da57ad3..d191cda9b71a 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -10,6 +10,7 @@ #include <linux/context_tracking_state.h> #include <linux/cpumask.h> #include <linux/sched.h> +#include <linux/prctl.h> #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void __init tick_init(void); @@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu) return cpumask_test_cpu(cpu, tick_nohz_full_mask); } +static inline bool tick_nohz_is_dataplane(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->dataplane_flags & PR_DATAPLANE_ENABLE); +} + extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); +extern void tick_nohz_dataplane_enter(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void tick_nohz_full_kick(void) { } static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } +static inline bool tick_nohz_is_dataplane(void) { return false; } +static inline void tick_nohz_dataplane_enter(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..1aa8fa8a8b05 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query dataplane mode for NO_HZ_FULL kernels. */ +#define PR_SET_DATAPLANE 47 +#define PR_GET_DATAPLANE 48 +# define PR_DATAPLANE_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 72d59a1a6eb6..dd6bdd6197b6 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/tick.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (tick_nohz_is_dataplane()) + tick_nohz_dataplane_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index a4e372b798a5..930b750aefde 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NO_HZ_FULL + case PR_SET_DATAPLANE: + me->dataplane_flags = arg2; + break; + case PR_GET_DATAPLANE: + error = me->dataplane_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 914259128145..31c674719647 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/swap.h> #include <asm/irq_regs.h> @@ -389,6 +390,18 @@ void __init tick_nohz_init(void) pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", cpumask_pr_args(tick_nohz_full_mask)); } + +/* + * When returning to userspace on a nohz_full core after doing + * prctl(PR_DATAPLANE_SET,1), we come here and try more aggressively + * to prevent this core from being interrupted later. + */ +void tick_nohz_dataplane_enter(void) +{ + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE 2015-05-08 17:58 [PATCH 0/6] support "dataplane" mode for nohz_full Chris Metcalf 2015-05-08 17:58 ` [PATCH 1/6] nohz_full: add support for "dataplane" mode Chris Metcalf @ 2015-05-08 17:58 ` Chris Metcalf [not found] ` <1431107927-13998-5-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-05-08 17:58 ` [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode Chris Metcalf [not found] ` <1431107927-13998-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 3 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the kernel to quiesce any pending timer interrupts prior to returning to userspace. When running with this mode set, sys calls (and page faults, etc.) can be inordinately slow. However, user applications that want to guarantee that no unexpected interrupts will occur (even if they call into the kernel) can set this flag to guarantee that semantics. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 1 + kernel/time/tick-sched.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 55 insertions(+) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 1aa8fa8a8b05..8b735651304a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_DATAPLANE 47 #define PR_GET_DATAPLANE 48 # define PR_DATAPLANE_ENABLE (1 << 0) +# define PR_DATAPLANE_QUIESCE (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index fd0e6e5c931c..69d908c6cef8 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -392,6 +392,53 @@ void __init tick_nohz_init(void) } /* + * We normally return immediately to userspace. + * + * The PR_DATAPLANE_QUIESCE flag causes us to wait until no more + * interrupts are pending. Otherwise we nap with interrupts enabled + * and wait for the next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two processes on the same core and both + * specify PR_DATAPLANE_QUIESCE, neither will ever leave the kernel, + * and one will have to be killed manually. Otherwise in situations + * where another process is in the runqueue on this cpu, this task + * will just wait for that other task to go idle before returning to + * user space. + */ +static void dataplane_quiesce(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: dataplane task blocked for %ld jiffies\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start)); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + + /* Idle with interrupts enabled and wait for the tick. */ + set_current_state(TASK_INTERRUPTIBLE); + arch_cpu_idle(); + set_current_state(TASK_RUNNING); + } + if (warned) { + pr_warn("%s/%d: cpu %d: dataplane task unblocked after %ld jiffies\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start)); + dump_stack(); + } +} + +/* * When returning to userspace on a nohz_full core after doing * prctl(PR_DATAPLANE_SET,1), we come here and try more aggressively * to prevent this core from being interrupted later. @@ -411,6 +458,13 @@ void tick_nohz_dataplane_enter(void) lru_add_drain(); /* + * Quiesce any timer ticks if requested. On return from this + * function, no timer ticks are pending. + */ + if ((current->dataplane_flags & PR_DATAPLANE_QUIESCE) != 0) + dataplane_quiesce(); + + /* * Disable interrupts again since other code running in this * function may have enabled them, and the caller expects * interrupts to be disabled on return. Enabling them during -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
[parent not found: <1431107927-13998-5-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE [not found] ` <1431107927-13998-5-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-05-12 9:33 ` Peter Zijlstra [not found] ` <20150512093349.GH21418-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2015-05-14 20:54 ` Chris Metcalf 0 siblings, 2 replies; 159+ messages in thread From: Peter Zijlstra @ 2015-05-12 9:33 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote: > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the > kernel to quiesce any pending timer interrupts prior to returning > to userspace. When running with this mode set, sys calls (and page > faults, etc.) can be inordinately slow. However, user applications > that want to guarantee that no unexpected interrupts will occur > (even if they call into the kernel) can set this flag to guarantee > that semantics. Currently people hot-unplug and hot-plug the CPU to do this. Obviously that's a wee bit horrible :-) Not sure if a prctl like this is any better though. This is a CPU properly not a process one. ISTR people talking about 'quiesce' sysfs file, along side the hotplug stuff, I can't quite remember. ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <20150512093349.GH21418-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE [not found] ` <20150512093349.GH21418-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2015-05-12 9:50 ` Ingo Molnar [not found] ` <20150512095030.GD11477-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Ingo Molnar @ 2015-05-12 9:50 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote: > > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the > > kernel to quiesce any pending timer interrupts prior to returning > > to userspace. When running with this mode set, sys calls (and page > > faults, etc.) can be inordinately slow. However, user applications > > that want to guarantee that no unexpected interrupts will occur > > (even if they call into the kernel) can set this flag to guarantee > > that semantics. > > Currently people hot-unplug and hot-plug the CPU to do this. > Obviously that's a wee bit horrible :-) > > Not sure if a prctl like this is any better though. This is a CPU > properly not a process one. So if then a prctl() (or other system call) could be a shortcut to: - move the task to an isolated CPU - make sure there _is_ such an isolated domain available I.e. have some programmatic, kernel provided way for an application to be sure it's running in the right environment. Relying on random administration flags here and there won't cut it. Thanks, Ingo ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <20150512095030.GD11477-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE [not found] ` <20150512095030.GD11477-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2015-05-12 10:38 ` Peter Zijlstra 2015-05-12 12:52 ` Ingo Molnar 0 siblings, 1 reply; 159+ messages in thread From: Peter Zijlstra @ 2015-05-12 10:38 UTC (permalink / raw) To: Ingo Molnar Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, May 12, 2015 at 11:50:30AM +0200, Ingo Molnar wrote: > > * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > > > On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote: > > > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the > > > kernel to quiesce any pending timer interrupts prior to returning > > > to userspace. When running with this mode set, sys calls (and page > > > faults, etc.) can be inordinately slow. However, user applications > > > that want to guarantee that no unexpected interrupts will occur > > > (even if they call into the kernel) can set this flag to guarantee > > > that semantics. > > > > Currently people hot-unplug and hot-plug the CPU to do this. > > Obviously that's a wee bit horrible :-) > > > > Not sure if a prctl like this is any better though. This is a CPU > > properly not a process one. > > So if then a prctl() (or other system call) could be a shortcut to: > > - move the task to an isolated CPU > - make sure there _is_ such an isolated domain available > > I.e. have some programmatic, kernel provided way for an application to > be sure it's running in the right environment. Relying on random > administration flags here and there won't cut it. No, we already have sched_setaffinity() and we should not duplicate its ability to move tasks about. What this is about is 'clearing' CPU state, its nothing to do with tasks. Ideally we'd never have to clear the state because it should be impossible to get into this predicament in the first place. The typical example here is a periodic timer that found its way onto the cpu and stays there. We're actually working on allowing such self arming timers to migrate, so once we have that sorted this could be fixed proper I think. Not sure if there's more pollution that people worry about. The hotplug hack worked because unplug force migrates the timers away. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE 2015-05-12 10:38 ` Peter Zijlstra @ 2015-05-12 12:52 ` Ingo Molnar 2015-05-13 4:35 ` Andy Lutomirski 0 siblings, 1 reply; 159+ messages in thread From: Ingo Molnar @ 2015-05-12 12:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel * Peter Zijlstra <peterz@infradead.org> wrote: > > So if then a prctl() (or other system call) could be a shortcut > > to: > > > > - move the task to an isolated CPU > > - make sure there _is_ such an isolated domain available > > > > I.e. have some programmatic, kernel provided way for an > > application to be sure it's running in the right environment. > > Relying on random administration flags here and there won't cut > > it. > > No, we already have sched_setaffinity() and we should not duplicate > its ability to move tasks about. But sched_setaffinity() does not guarantee isolation - it's just a syscall to move a task to a set of CPUs, which might be isolated or not. What I suggested is that it might make sense to offer a system call, for example a sched_setparam() variant, that makes such guarantees. Say if user-space does: ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params); ... then we would get the task moved to an isolated domain and get a 0 return code if the kernel is able to do all that and if the current uid/namespace/etc. has the required permissions and such. ( BIND_ISOLATED will not replace the current p->policy value, so it's still possible to use the regular policies as well on top of this. ) I.e. make it programatic instead of relying on a fragile, kernel version dependent combination of sysctl, sysfs, kernel config and boot parameter details to get us this result. I.e. provide a central hub to offer this feature in a more structured, easier to use fashion. We might still require the admin (or distro) to separately set up the domain of isolated CPUs, and it would still be possible to simply 'move' tasks there using existing syscalls - but I say that it's not a bad idea at all to offer a single central syscall interface for apps to request such treatment. > What this is about is 'clearing' CPU state, its nothing to do with > tasks. > > Ideally we'd never have to clear the state because it should be > impossible to get into this predicament in the first place. That I absolutely agree about, that bit is nonsense. We might offer debugging facilities to debug such bugs, but we won't work or hack it around. Thanks, Ingo ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE 2015-05-12 12:52 ` Ingo Molnar @ 2015-05-13 4:35 ` Andy Lutomirski 2015-05-13 17:51 ` Paul E. McKenney 0 siblings, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-05-13 4:35 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Tue, May 12, 2015 at 5:52 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > >> > So if then a prctl() (or other system call) could be a shortcut >> > to: >> > >> > - move the task to an isolated CPU >> > - make sure there _is_ such an isolated domain available >> > >> > I.e. have some programmatic, kernel provided way for an >> > application to be sure it's running in the right environment. >> > Relying on random administration flags here and there won't cut >> > it. >> >> No, we already have sched_setaffinity() and we should not duplicate >> its ability to move tasks about. > > But sched_setaffinity() does not guarantee isolation - it's just a > syscall to move a task to a set of CPUs, which might be isolated or > not. > > What I suggested is that it might make sense to offer a system call, > for example a sched_setparam() variant, that makes such guarantees. > > Say if user-space does: > > ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params); > > ... then we would get the task moved to an isolated domain and get a 0 > return code if the kernel is able to do all that and if the current > uid/namespace/etc. has the required permissions and such. > > ( BIND_ISOLATED will not replace the current p->policy value, so it's > still possible to use the regular policies as well on top of this. ) I think we shouldn't have magic selection of an isolated domain. Anyone using this has already configured some isolated CPUs and probably wants to choose the CPU and, especially, NUMA node themselves. Also, maybe it should be a special type of realtime class/priority -- doing this should require RT permission IMO. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE 2015-05-13 4:35 ` Andy Lutomirski @ 2015-05-13 17:51 ` Paul E. McKenney [not found] ` <20150513175150.GL6776-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Paul E. McKenney @ 2015-05-13 17:51 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Peter Zijlstra, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Christoph Lameter, Srivatsa S. Bhat, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Tue, May 12, 2015 at 09:35:25PM -0700, Andy Lutomirski wrote: > On Tue, May 12, 2015 at 5:52 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > >> > So if then a prctl() (or other system call) could be a shortcut > >> > to: > >> > > >> > - move the task to an isolated CPU > >> > - make sure there _is_ such an isolated domain available > >> > > >> > I.e. have some programmatic, kernel provided way for an > >> > application to be sure it's running in the right environment. > >> > Relying on random administration flags here and there won't cut > >> > it. > >> > >> No, we already have sched_setaffinity() and we should not duplicate > >> its ability to move tasks about. > > > > But sched_setaffinity() does not guarantee isolation - it's just a > > syscall to move a task to a set of CPUs, which might be isolated or > > not. > > > > What I suggested is that it might make sense to offer a system call, > > for example a sched_setparam() variant, that makes such guarantees. > > > > Say if user-space does: > > > > ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params); > > > > ... then we would get the task moved to an isolated domain and get a 0 > > return code if the kernel is able to do all that and if the current > > uid/namespace/etc. has the required permissions and such. > > > > ( BIND_ISOLATED will not replace the current p->policy value, so it's > > still possible to use the regular policies as well on top of this. ) > > I think we shouldn't have magic selection of an isolated domain. > Anyone using this has already configured some isolated CPUs and > probably wants to choose the CPU and, especially, NUMA node > themselves. Also, maybe it should be a special type of realtime > class/priority -- doing this should require RT permission IMO. I have no real argument against special permissions, but this feature is totally orthogonal to realtime classes/priorities. It is perfectly legitimate for a given CPU's single runnable task to be SCHED_OTHER, for example. Thanx, Paul ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <20150513175150.GL6776-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE [not found] ` <20150513175150.GL6776-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2015-05-14 20:55 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-14 20:55 UTC (permalink / raw) To: paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andy Lutomirski Cc: Ingo Molnar, Peter Zijlstra, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Christoph Lameter, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On 05/12/2015 08:52 AM, Ingo Molnar wrote: > What I suggested is that it might make sense to offer a system call, > for example a sched_setparam() variant, that makes such guarantees. > > Say if user-space does: > > ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params); > > ... then we would get the task moved to an isolated domain and get a 0 > return code if the kernel is able to do all that and if the current > uid/namespace/etc. has the required permissions and such. Unfortunately I don't know nearly as much about the scheduler and scheduler policies as I might, since I mostly focused on make the scheduler stay out of the way. :-) This does seem like another way to set a policy bit on a process. I assume you could only validly issue this call on a nohz_full core, and that you're not assuming it migrates the cpu to such a core? You suggested that BIND_ISOLATED would not replace the usual scheduler policies, but perhaps SCHED_ISOLATED as a full replacement would make sense - it would make it an error to have any other schedulable task on that core. I guess that brings it around to whether the "cpu_isolated" task just loses when another task is scheduled on the core with it (the current approach I'm proposing) or if it ends up truly owning the core and other processes can be denied the right to run there: which in that case clearly does get us into the area of requiring privileges to set up, as Andy pointed out later. This would leave the notion of "strict" as proposed elsewhere as a separate thing, but presumably it could still be a prctl() as originally proposed. I admit I don't know enough to say whether this sounds like a better approach than just using a prctl() to set the cpu_isolated state. My instinct is that it's cleanest to avoid requiring permissions to do this, and to simply enable the quiescing semantics the process requested when it happens to be alone on a core. If so, it's somewhat orthogonal to the actual scheduler policy in force, so best not to conflate it with the notion of scheduler code at all via sched_setscheduler()? -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE 2015-05-12 9:33 ` Peter Zijlstra [not found] ` <20150512093349.GH21418-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2015-05-14 20:54 ` Chris Metcalf 1 sibling, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-14 20:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel On 05/12/2015 05:33 AM, Peter Zijlstra wrote: > On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote: >> This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the >> kernel to quiesce any pending timer interrupts prior to returning >> to userspace. When running with this mode set, sys calls (and page >> faults, etc.) can be inordinately slow. However, user applications >> that want to guarantee that no unexpected interrupts will occur >> (even if they call into the kernel) can set this flag to guarantee >> that semantics. > Currently people hot-unplug and hot-plug the CPU to do this. Obviously > that's a wee bit horrible :-) > > Not sure if a prctl like this is any better though. This is a CPU > properly not a process one. The CPU property aspects, I think, should be largely handled by fixing kernel bugs that let work end up running on nohz_full cores without having been explicitly requested to run there. As you said in a follow-up email: On 05/12/2015 06:38 AM, Peter Zijlstra wrote: > Ideally we'd never have to clear the state because it should be > impossible to get into this predicament in the first place. What my prctl() proposal does is quiesce things that end up happening specifically because the user process called on purpose into the kernel. For example, perhaps RCU was invoked in the kernel, and the core has to wait a timer tick to quiesce RCU. Whatever causes it, the intent is that you're not allowed back into userspace until everything has settled down from your call into the kernel; the presumption is that it's all due to the kernel entry that was just made, and not from other stray work. In that sense, it's very appropriate for it to be a process property. > ISTR people talking about 'quiesce' sysfs file, along side the hotplug > stuff, I can't quite remember. It seems somewhat similar (adding Viresh to the cc's) but does seem like it might have been more intended to address the CPU properties rather than process properties: https://lkml.org/lkml/2014/4/4/99 One thing the original Tilera dataplane code did was to require setting dataplane flags to succeed only on dataplane cores, and only when the task had been affinitized to that single core. This did not protect the task from later being re-affinitized in a way that broke those assumptions, but I suppose you could also imagine make sched_setaffinity() fail for such a process. Somewhat unrelated, but it occurred to me in the context of this reply, so what do you think? I can certainly add this to the patch series if it seems like it makes setting the prctl() flags more conservative. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-08 17:58 [PATCH 0/6] support "dataplane" mode for nohz_full Chris Metcalf 2015-05-08 17:58 ` [PATCH 1/6] nohz_full: add support for "dataplane" mode Chris Metcalf 2015-05-08 17:58 ` [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE Chris Metcalf @ 2015-05-08 17:58 ` Chris Metcalf 2015-05-09 7:28 ` Andy Lutomirski 2015-05-12 9:38 ` Peter Zijlstra [not found] ` <1431107927-13998-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 3 siblings, 2 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With QUIESCE mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current->dataplane_flags that is set when prctl() sets the flags. That way, when we are exiting the kernel after calling prctl() to forbid future kernel exits, we don't get immediately killed. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/sys.c | 2 +- kernel/time/tick-sched.c | 17 +++++++++++++++++ 3 files changed, 20 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 8b735651304a..9cf79aa1e73f 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_DATAPLANE 48 # define PR_DATAPLANE_ENABLE (1 << 0) # define PR_DATAPLANE_QUIESCE (1 << 1) +# define PR_DATAPLANE_STRICT (1 << 2) +# define PR_DATAPLANE_PRCTL (1U << 31) /* kernel internal */ #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index 930b750aefde..8102433c9edd 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2245,7 +2245,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, break; #ifdef CONFIG_NO_HZ_FULL case PR_SET_DATAPLANE: - me->dataplane_flags = arg2; + me->dataplane_flags = arg2 | PR_DATAPLANE_PRCTL; break; case PR_GET_DATAPLANE: error = me->dataplane_flags; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 69d908c6cef8..22ed0decb363 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -436,6 +436,20 @@ static void dataplane_quiesce(void) (jiffies - start)); dump_stack(); } + + /* + * Kill the process if it violates STRICT mode. Note that this + * code also results in killing the task if a kernel bug causes an + * irq to be delivered to this core. + */ + if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL)) + == PR_DATAPLANE_STRICT) { + pr_warn("Dataplane STRICT mode violated; process killed.\n"); + dump_stack(); + task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE; + local_irq_enable(); + do_group_exit(SIGKILL); + } } /* @@ -464,6 +478,9 @@ void tick_nohz_dataplane_enter(void) if ((current->dataplane_flags & PR_DATAPLANE_QUIESCE) != 0) dataplane_quiesce(); + /* Clear the bit set by prctl() when it updates the flags. */ + current->dataplane_flags &= ~PR_DATAPLANE_PRCTL; + /* * Disable interrupts again since other code running in this * function may have enabled them, and the caller expects -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-08 17:58 ` [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode Chris Metcalf @ 2015-05-09 7:28 ` Andy Lutomirski 2015-05-09 10:37 ` Gilad Ben Yossef [not found] ` <CALCETrUoptUPVUxL87jUgry1pFac0rDPpnZ790zDKyK4a0FARA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-12 9:38 ` Peter Zijlstra 1 sibling, 2 replies; 159+ messages in thread From: Andy Lutomirski @ 2015-05-09 7:28 UTC (permalink / raw) To: Chris Metcalf Cc: Srivatsa S. Bhat, Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel, linux-doc@vger.kernel.org, Andrew Morton, linux-kernel@vger.kernel.org, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, Christoph Lameter, Gilad Ben Yossef, Linux API On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote: > > With QUIESCE mode, the task is in principle guaranteed not to be > interrupted by the kernel, but only if it behaves. In particular, > if it enters the kernel via system call, page fault, or any of > a number of other synchronous traps, it may be unexpectedly > exposed to long latencies. Add a simple flag that puts the process > into a state where any such kernel entry is fatal. > > To allow the state to be entered and exited, we add an internal > bit to current->dataplane_flags that is set when prctl() sets the > flags. That way, when we are exiting the kernel after calling > prctl() to forbid future kernel exits, we don't get immediately > killed. Is there any reason this can't already be addressed in userspace using /proc/interrupts or perf_events? ISTM the real goal here is to detect when we screw up and fail to avoid an interrupt, and killing the task seems like overkill to me. Also, can we please stop further torturing the exit paths? We have a disaster of assembly code that calls into syscall_trace_leave and do_notify_resume. Those functions, in turn, *both* call user_enter (WTF?), and on very brief inspection user_enter makes it into the nohz code through multiple levels of indirection, which, with these patches, has yet another conditionally enabled helper, which does this new stuff. It's getting to be impossible to tell what happens when we exit to user space any more. Also, I think your code is buggy. There's no particular guarantee that user_enter is only called once between sys_prctl and the final exit to user mode (see the above WTF), so you might spuriously kill the process. Also, I think that most users will be quite surprised if "strict dataplane" code causes any machine check on the system to kill your dataplane task. Similarly, a user accidentally running perf record -a probably should have some reasonable semantics. /proc/interrupts gets that right as is. Sure, MCEs will hurt your RT performance, but Intel screwed up the way that MCEs work, so we should make do. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
* RE: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-09 7:28 ` Andy Lutomirski @ 2015-05-09 10:37 ` Gilad Ben Yossef [not found] ` <CALCETrUoptUPVUxL87jUgry1pFac0rDPpnZ790zDKyK4a0FARA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 0 replies; 159+ messages in thread From: Gilad Ben Yossef @ 2015-05-09 10:37 UTC (permalink / raw) To: Andy Lutomirski, Chris Metcalf Cc: Srivatsa S. Bhat, Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel, linux-doc@vger.kernel.org, Andrew Morton, linux-kernel@vger.kernel.org, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, Christoph Lameter, Linux API > From: Andy Lutomirski [mailto:luto@amacapital.net] > Sent: Saturday, May 09, 2015 10:29 AM > To: Chris Metcalf > Cc: Srivatsa S. Bhat; Paul E. McKenney; Frederic Weisbecker; Ingo Molnar; > Rik van Riel; linux-doc@vger.kernel.org; Andrew Morton; linux- > kernel@vger.kernel.org; Thomas Gleixner; Tejun Heo; Peter Zijlstra; Steven > Rostedt; Christoph Lameter; Gilad Ben Yossef; Linux API > Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode > > On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote: > > > > With QUIESCE mode, the task is in principle guaranteed not to be > > interrupted by the kernel, but only if it behaves. In particular, > > if it enters the kernel via system call, page fault, or any of > > a number of other synchronous traps, it may be unexpectedly > > exposed to long latencies. Add a simple flag that puts the process > > into a state where any such kernel entry is fatal. > > > > To allow the state to be entered and exited, we add an internal > > bit to current->dataplane_flags that is set when prctl() sets the > > flags. That way, when we are exiting the kernel after calling > > prctl() to forbid future kernel exits, we don't get immediately > > killed. > > Is there any reason this can't already be addressed in userspace using > /proc/interrupts or perf_events? ISTM the real goal here is to detect > when we screw up and fail to avoid an interrupt, and killing the task > seems like overkill to me. > > Also, can we please stop further torturing the exit paths? So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case? The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path? Gilad ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <CALCETrUoptUPVUxL87jUgry1pFac0rDPpnZ790zDKyK4a0FARA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode [not found] ` <CALCETrUoptUPVUxL87jUgry1pFac0rDPpnZ790zDKyK4a0FARA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-11 19:13 ` Chris Metcalf 2015-05-11 22:28 ` Andy Lutomirski 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-05-11 19:13 UTC (permalink / raw) To: Andy Lutomirski Cc: Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, Christoph Lameter, Gilad Ben Yossef, Linux API On 05/09/2015 03:28 AM, Andy Lutomirski wrote: > On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> With QUIESCE mode, the task is in principle guaranteed not to be >> interrupted by the kernel, but only if it behaves. In particular, >> if it enters the kernel via system call, page fault, or any of >> a number of other synchronous traps, it may be unexpectedly >> exposed to long latencies. Add a simple flag that puts the process >> into a state where any such kernel entry is fatal. >> >> To allow the state to be entered and exited, we add an internal >> bit to current->dataplane_flags that is set when prctl() sets the >> flags. That way, when we are exiting the kernel after calling >> prctl() to forbid future kernel exits, we don't get immediately >> killed. > Is there any reason this can't already be addressed in userspace using > /proc/interrupts or perf_events? ISTM the real goal here is to detect > when we screw up and fail to avoid an interrupt, and killing the task > seems like overkill to me. Patch 6/6 proposes a mechanism to track down times when the kernel screws up and delivers an IRQ to a userspace-only task. Here, we're just trying to identify the times when an application screws itself up out of cluelessness, and provide a mechanism that allows the developer to easily figure out why and fix it. In particular, /proc/interrupts won't show syscalls or page faults, which are two easy ways applications can screw themselves when they think they're in userspace-only mode. Also, they don't provide sufficient precision to make it clear what part of the application caused the undesired kernel entry. In this case, killing the task is appropriate, since that's exactly the semantics that have been asked for - it's like on architectures that don't natively support unaligned accesses, but fake it relatively slowly in the kernel, and in development you just say "give me a SIGBUS when that happens" and in production you might say "fix it up and let's try to keep going". You can argue that this is something that can be done by ftrace, but certainly you'd want to have a way to programmatically turn on ftrace at the moment when you're entering userspace-only mode, so we'd want some API around that anyway. And honestly, it's so easy to test a task state bit in a couple of places and generate the failurel on the spot, vs. the relative complexity of setting up and understanding ftrace, that I think it merits inclusion on that basis alone. > Also, can we please stop further torturing the exit paths? We have a > disaster of assembly code that calls into syscall_trace_leave and > do_notify_resume. Those functions, in turn, *both* call user_enter > (WTF?), and on very brief inspection user_enter makes it into the nohz > code through multiple levels of indirection, which, with these > patches, has yet another conditionally enabled helper, which does this > new stuff. It's getting to be impossible to tell what happens when we > exit to user space any more. > > Also, I think your code is buggy. There's no particular guarantee > that user_enter is only called once between sys_prctl and the final > exit to user mode (see the above WTF), so you might spuriously kill > the process. This is a good point; I also find the x86 kernel entry and exit paths confusing, although I've reviewed them a bunch of times. The tile architecture paths are a little easier to understand. That said, I think the answer here is avoid non-idempotent actions in the dataplane code, such as clearing a syscall bit. A better implementation, I think, is to put the tests for "you screwed up and synchronously entered the kernel" in the syscall_trace_enter() code, which TIF_NOHZ already gets us into; there, we can test if the dataplane "strict" bit is set and the syscall is not prctl(), then we generate the error. (We'd exclude exit and exit_group here too, since we don't need to shoot down a task that's just trying to kill itself.) This needs a bit of platform-specific code for each platform, but that doesn't seem like too big a problem. Likewise we can test in exception_enter() since that's only called for all the synchronous user entries like page faults. > Also, I think that most users will be quite surprised if "strict > dataplane" code causes any machine check on the system to kill your > dataplane task. Fair point, and avoided by testing as described above instead. (Though presumably in development it's not such a big deal, and as I said you'd likely turn it off in production.) > Similarly, a user accidentally running perf record -a > probably should have some reasonable semantics. Yes, also avoided by doing this as above, though I'd argue we could also just say that running perf disables this mode. But it's not as clean as the above suggestion. On 05/09/2015 06:37 AM, Gilad Ben Yossef wrote: > So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case? > The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path? We could certainly do this now that I'm planning to do the test at kernel entry rather than super-late in kernel exit. Rather than just do_group_exit(SIGKILL), we should raise a proper SIGKILL signal via send_sig(SIGKILL, current, 1), and then we could catch it in the debugger; the pc should help identify if it was a syscall, page fault, or other trap. I'm not sure there's an argument to be made for the user process being able to catch the signal itself; presumably in production you don't turn this mode on anyway, and in development, assuming a debugger is probably fine. But if you want to argue for another signal (SIGILL?) please do; I'm curious to hear if you think it would make more sense. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-11 19:13 ` Chris Metcalf @ 2015-05-11 22:28 ` Andy Lutomirski 2015-05-12 21:06 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-05-11 22:28 UTC (permalink / raw) To: Chris Metcalf, Peter Zijlstra Cc: Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel, linux-doc@vger.kernel.org, Andrew Morton, linux-kernel@vger.kernel.org, Thomas Gleixner, Tejun Heo, Steven Rostedt, Christoph Lameter, Gilad Ben Yossef, Linux API [add peterz due to perf stuff] On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 05/09/2015 03:28 AM, Andy Lutomirski wrote: >> >> On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote: >>> >>> With QUIESCE mode, the task is in principle guaranteed not to be >>> interrupted by the kernel, but only if it behaves. In particular, >>> if it enters the kernel via system call, page fault, or any of >>> a number of other synchronous traps, it may be unexpectedly >>> exposed to long latencies. Add a simple flag that puts the process >>> into a state where any such kernel entry is fatal. >>> >>> To allow the state to be entered and exited, we add an internal >>> bit to current->dataplane_flags that is set when prctl() sets the >>> flags. That way, when we are exiting the kernel after calling >>> prctl() to forbid future kernel exits, we don't get immediately >>> killed. >> >> Is there any reason this can't already be addressed in userspace using >> /proc/interrupts or perf_events? ISTM the real goal here is to detect >> when we screw up and fail to avoid an interrupt, and killing the task >> seems like overkill to me. > > > Patch 6/6 proposes a mechanism to track down times when the > kernel screws up and delivers an IRQ to a userspace-only task. > Here, we're just trying to identify the times when an application > screws itself up out of cluelessness, and provide a mechanism > that allows the developer to easily figure out why and fix it. > > In particular, /proc/interrupts won't show syscalls or page faults, > which are two easy ways applications can screw themselves > when they think they're in userspace-only mode. Also, they don't > provide sufficient precision to make it clear what part of the > application caused the undesired kernel entry. Perf does, though, complete with context. > > In this case, killing the task is appropriate, since that's exactly > the semantics that have been asked for - it's like on architectures > that don't natively support unaligned accesses, but fake it relatively > slowly in the kernel, and in development you just say "give me a > SIGBUS when that happens" and in production you might say > "fix it up and let's try to keep going". I think more control is needed. I also think that, if we go this route, we should distinguish syscalls, synchronous non-syscall entries, and asynchronous non-syscall entries. They're quite different. > > You can argue that this is something that can be done by ftrace, > but certainly you'd want to have a way to programmatically > turn on ftrace at the moment when you're entering userspace-only > mode, so we'd want some API around that anyway. And honestly, > it's so easy to test a task state bit in a couple of places and > generate the failurel on the spot, vs. the relative complexity > of setting up and understanding ftrace, that I think it merits > inclusion on that basis alone. perf_event, not ftrace. > >> Also, can we please stop further torturing the exit paths? We have a >> disaster of assembly code that calls into syscall_trace_leave and >> do_notify_resume. Those functions, in turn, *both* call user_enter >> (WTF?), and on very brief inspection user_enter makes it into the nohz >> code through multiple levels of indirection, which, with these >> patches, has yet another conditionally enabled helper, which does this >> new stuff. It's getting to be impossible to tell what happens when we >> exit to user space any more. >> >> Also, I think your code is buggy. There's no particular guarantee >> that user_enter is only called once between sys_prctl and the final >> exit to user mode (see the above WTF), so you might spuriously kill >> the process. > > > This is a good point; I also find the x86 kernel entry and exit > paths confusing, although I've reviewed them a bunch of times. > The tile architecture paths are a little easier to understand. > > That said, I think the answer here is avoid non-idempotent > actions in the dataplane code, such as clearing a syscall bit. > > A better implementation, I think, is to put the tests for "you > screwed up and synchronously entered the kernel" in > the syscall_trace_enter() code, which TIF_NOHZ already > gets us into; No, not unless you're planning on using that to distinguish syscalls from other stuff *and* people think that's justified. It's far to easy to just make a tiny change to the entry code. Add a tiny trivial change here, a few lines of asm (that's you, audit!) there, some weird written-in-asm scheduling code over here, and you end up with the truly awful mess that we currently have. If it really makes sense for this stuff to go with context tracking, then fine, but we should *fix* the context tracking first rather than kludging around it. I already have a prototype patch for the relevant part of that. > there, we can test if the dataplane "strict" bit is > set and the syscall is not prctl(), then we generate the error. > (We'd exclude exit and exit_group here too, since we don't > need to shoot down a task that's just trying to kill itself.) > This needs a bit of platform-specific code for each platform, > but that doesn't seem like too big a problem. I'd rather avoid that, too. This feature isn't really arch-specific, so let's avoid the arch stuff if at all possible. > > Likewise we can test in exception_enter() since that's only > called for all the synchronous user entries like page faults. Let's try to generalize a bit. There's also irq_entry and ist_enter, and some of the exception_enter cases are for synchronous entries while (IIRC -- could be wrong) others aren't always like that. > >> Also, I think that most users will be quite surprised if "strict >> dataplane" code causes any machine check on the system to kill your >> dataplane task. > > > Fair point, and avoided by testing as described above instead. > (Though presumably in development it's not such a big deal, > and as I said you'd likely turn it off in production.) Until you forget to turn it off in production because it worked so nicely in development. What if we added a mode to perf where delivery of a sample synchronously (or semi-synchronously by catching it on the next exit to userspace) freezes the delivering task? It would be like debugger support via perf. peterz, do you think this would be a sensible thing to add to perf? It would only make sense for some types of events (tracepoints and hw_breakpoints mostly, I think). >> So, I don't know if it is a practical suggestion or not, but would it >> better/easier to mark a pending signal on kernel entry for this case? >> The upsides I see is that the user gets her notification (killing the task >> or just logging the event in a signal handler) and hopefully since return to >> userspace with a pending signal is already handled we don't need new code in >> the exit path? > > > We could certainly do this now that I'm planning to do the > test at kernel entry rather than super-late in kernel exit. > Rather than just do_group_exit(SIGKILL), we should raise > a proper SIGKILL signal via send_sig(SIGKILL, current, 1), > and then we could catch it in the debugger; the pc should > help identify if it was a syscall, page fault, or other trap. > > I'm not sure there's an argument to be made for the user > process being able to catch the signal itself; presumably in > production you don't turn this mode on anyway, and in > development, assuming a debugger is probably fine. > > But if you want to argue for another signal (SIGILL?) please > do; I'm curious to hear if you think it would make more sense. Make it configurable as part of the prctl. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-11 22:28 ` Andy Lutomirski @ 2015-05-12 21:06 ` Chris Metcalf 2015-05-12 22:23 ` Andy Lutomirski 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-05-12 21:06 UTC (permalink / raw) To: Andy Lutomirski, Peter Zijlstra Cc: Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel, linux-doc@vger.kernel.org, Andrew Morton, linux-kernel@vger.kernel.org, Thomas Gleixner, Tejun Heo, Steven Rostedt, Christoph Lameter, Gilad Ben Yossef, Linux API On 05/11/2015 06:28 PM, Andy Lutomirski wrote: > [add peterz due to perf stuff] > > On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> Patch 6/6 proposes a mechanism to track down times when the >> kernel screws up and delivers an IRQ to a userspace-only task. >> Here, we're just trying to identify the times when an application >> screws itself up out of cluelessness, and provide a mechanism >> that allows the developer to easily figure out why and fix it. >> >> In particular, /proc/interrupts won't show syscalls or page faults, >> which are two easy ways applications can screw themselves >> when they think they're in userspace-only mode. Also, they don't >> provide sufficient precision to make it clear what part of the >> application caused the undesired kernel entry. > Perf does, though, complete with context. The perf_event suggestions are interesting, but I think it's plausible for this to be an alternate way to debug the issues that STRICT addresses. >> In this case, killing the task is appropriate, since that's exactly >> the semantics that have been asked for - it's like on architectures >> that don't natively support unaligned accesses, but fake it relatively >> slowly in the kernel, and in development you just say "give me a >> SIGBUS when that happens" and in production you might say >> "fix it up and let's try to keep going". > I think more control is needed. I also think that, if we go this > route, we should distinguish syscalls, synchronous non-syscall > entries, and asynchronous non-syscall entries. They're quite > different. I don't think it's necessary to distinguish the types. As long as we have a PC pointing to the instruction that triggered the problem, we can see if it's a system call instruction, a memory write that caused a page fault, a trap instruction, etc. We certainly could add infrastructure to capture syscall numbers, fault/signal numbers, etc etc, but I think it's overkill if it adds kernel overhead on entry/exit. >> A better implementation, I think, is to put the tests for "you >> screwed up and synchronously entered the kernel" in >> the syscall_trace_enter() code, which TIF_NOHZ already >> gets us into; > No, not unless you're planning on using that to distinguish syscalls > from other stuff *and* people think that's justified. So, the question is how we separate synchronous entries from IRQs? At a high level, IRQs are kernel bugs (for cpu-isolated tasks), and synchronous entries are application bugs. We'd like to deliver a signal for the latter, and do some kind of kernel diagnostics for the former. So we can't just add the test in the context tracking code, which doesn't actually know why we're entering or exiting. That's why I was thinking that the syscall_trace_entry and exception_enter paths were the best choices. I'm fairly sure that exception_enter is only done for synchronous traps, page faults, etc. Certainly on the tile architecture we include the trap number in the pt_regs, so it's possible to just examine the pt_regs and know why you entered or are exiting the kernel, but I don't think we can rely on that for all architectures. > It's far to easy to just make a tiny change to the entry code. Add a > tiny trivial change here, a few lines of asm (that's you, audit!) > there, some weird written-in-asm scheduling code over here, and you > end up with the truly awful mess that we currently have. > > If it really makes sense for this stuff to go with context tracking, > then fine, but we should *fix* the context tracking first rather than > kludging around it. I already have a prototype patch for the relevant > part of that. > >> there, we can test if the dataplane "strict" bit is >> set and the syscall is not prctl(), then we generate the error. >> (We'd exclude exit and exit_group here too, since we don't >> need to shoot down a task that's just trying to kill itself.) >> This needs a bit of platform-specific code for each platform, >> but that doesn't seem like too big a problem. > I'd rather avoid that, too. This feature isn't really arch-specific, > so let's avoid the arch stuff if at all possible. I'll put out a v2 of my patch that does both the things you advise against :-) just so we can have a strawman to think about how to do it better - unless you have a suggestion offhand as to how we can better differentiate sync and async entries into the kernel in a platform-independent way. I could imagine modifying user_exit() and exception_enter() to pass an identifier into the context system saying why they were changing contexts, so we could have syscalls, trap numbers, fault numbers, etc., and some way to query as to whether they were synchronous or asynchronous, and build this scheme on top of that, but I'm not sure the extra infrastructure is worthwhile. >> Likewise we can test in exception_enter() since that's only >> called for all the synchronous user entries like page faults. > Let's try to generalize a bit. There's also irq_entry and ist_enter, > and some of the exception_enter cases are for synchronous entries > while (IIRC -- could be wrong) others aren't always like that. I don't think we need to generalize this piece. irq_entry() shouldn't be reported by the STRICT mechanism but by kernel bug reporting. For ist_enter(), it looks like if you're coming from userspace it's just handled with exception_enter(). I'm more familiar with the tile architecture mechanisms than with x86, though, to be honest. >>> Also, I think that most users will be quite surprised if "strict >>> dataplane" code causes any machine check on the system to kill your >>> dataplane task. >> >> Fair point, and avoided by testing as described above instead. >> (Though presumably in development it's not such a big deal, >> and as I said you'd likely turn it off in production.) > Until you forget to turn it off in production because it worked so > nicely in development. I guess that's an argument for using a non-fatal signal with a handler from the get-go, since then even in production you'll just end up with a slightly heavier-weight kernel overhead (whatever stupid thing your application did, plus the time spent in the signal handler), but then after that you can get back to processing packets or whatever the app is doing. You had mentioned some alternatives to a catchable signal (a signal to some other process, or queuing to an fd); I think it still seems reasonable to just deliver a signal to the process, configurably by the prctl, and not do anything more complex. Does this seem reasonable to you at this point? > What if we added a mode to perf where delivery of a sample > synchronously (or semi-synchronously by catching it on the next exit > to userspace) freezes the delivering task? It would be like debugger > support via perf. > > peterz, do you think this would be a sensible thing to add to perf? > It would only make sense for some types of events (tracepoints and > hw_breakpoints mostly, I think). I suspect it's reasonable to consider this orthogonal, particularly if there is some skid between the actual violation by the application, and the freeze happening. You pushed back somewhat on prctl() in favor of a quiesce() syscall in your email, but it seemed like at the end of your email you were adopting the prctl() perspective. Is that true? I admit the prctl() still seems cleaner from my perspective. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-12 21:06 ` Chris Metcalf @ 2015-05-12 22:23 ` Andy Lutomirski 2015-05-15 21:25 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-05-12 22:23 UTC (permalink / raw) To: Chris Metcalf Cc: Paul E. McKenney, Frederic Weisbecker, linux-kernel@vger.kernel.org, Rik van Riel, Andrew Morton, Linux API, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, linux-doc@vger.kernel.org, Christoph Lameter, Gilad Ben Yossef, Ingo Molnar On May 13, 2015 6:06 AM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote: > > On 05/11/2015 06:28 PM, Andy Lutomirski wrote: >> >> [add peterz due to perf stuff] >> >> On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >>> >>> Patch 6/6 proposes a mechanism to track down times when the >>> kernel screws up and delivers an IRQ to a userspace-only task. >>> Here, we're just trying to identify the times when an application >>> screws itself up out of cluelessness, and provide a mechanism >>> that allows the developer to easily figure out why and fix it. >>> >>> In particular, /proc/interrupts won't show syscalls or page faults, >>> which are two easy ways applications can screw themselves >>> when they think they're in userspace-only mode. Also, they don't >>> provide sufficient precision to make it clear what part of the >>> application caused the undesired kernel entry. >> >> Perf does, though, complete with context. > > > The perf_event suggestions are interesting, but I think it's plausible > for this to be an alternate way to debug the issues that STRICT > addresses. > > >>> In this case, killing the task is appropriate, since that's exactly >>> the semantics that have been asked for - it's like on architectures >>> that don't natively support unaligned accesses, but fake it relatively >>> slowly in the kernel, and in development you just say "give me a >>> SIGBUS when that happens" and in production you might say >>> "fix it up and let's try to keep going". >> >> I think more control is needed. I also think that, if we go this >> route, we should distinguish syscalls, synchronous non-syscall >> entries, and asynchronous non-syscall entries. They're quite >> different. > > > I don't think it's necessary to distinguish the types. As long as we > have a PC pointing to the instruction that triggered the problem, > we can see if it's a system call instruction, a memory write that > caused a page fault, a trap instruction, etc. Not true. PC right after a syscall insn could be any type of kernel entry, and you can't even reliably tell whether the syscall insn was executed or, on x86, whether it was a syscall at all. (x86 insns can't be reliably decided backwards.) PC pointing at a load could be a page fault or an IPI. > We certainly could > add infrastructure to capture syscall numbers, fault/signal numbers, > etc etc, but I think it's overkill if it adds kernel overhead on > entry/exit. > None of these should add overhead. > >>> A better implementation, I think, is to put the tests for "you >>> screwed up and synchronously entered the kernel" in >>> the syscall_trace_enter() code, which TIF_NOHZ already >>> gets us into; >> >> No, not unless you're planning on using that to distinguish syscalls >> from other stuff *and* people think that's justified. > > > So, the question is how we separate synchronous entries > from IRQs? At a high level, IRQs are kernel bugs (for cpu-isolated > tasks), and synchronous entries are application bugs. We'd > like to deliver a signal for the latter, and do some kind of > kernel diagnostics for the former. So we can't just add the > test in the context tracking code, which doesn't actually know > why we're entering or exiting. Synchronous entries could be VM bugs, too. > > That's why I was thinking that the syscall_trace_entry and > exception_enter paths were the best choices. I'm fairly sure > that exception_enter is only done for synchronous traps, > page faults, etc. Maybe. Doing it through the actual entry/exit slow paths would be overhead-free, although I'm not sure that IRQs have real slow paths for entry. > > Certainly on the tile architecture we include the trap number > in the pt_regs, so it's possible to just examine the pt_regs and > know why you entered or are exiting the kernel, but I don't > think we can rely on that for all architectures. x86 can't do this. > I'll put out a v2 of my patch that does both the things you > advise against :-) just so we can have a strawman to think > about how to do it better - unless you have a suggestion > offhand as to how we can better differentiate sync and async > entries into the kernel in a platform-independent way. > > I could imagine modifying user_exit() and exception_enter() > to pass an identifier into the context system saying why they > were changing contexts, so we could have syscalls, trap > numbers, fault numbers, etc., and some way to query as > to whether they were synchronous or asynchronous, and > build this scheme on top of that, but I'm not sure the extra > infrastructure is worthwhile. > I'll take a look. Again, though, I think we really do need to distinguish at least MCE and NMI (on x86) from the others. > >> What if we added a mode to perf where delivery of a sample >> synchronously (or semi-synchronously by catching it on the next exit >> to userspace) freezes the delivering task? It would be like debugger >> support via perf. >> >> peterz, do you think this would be a sensible thing to add to perf? >> It would only make sense for some types of events (tracepoints and >> hw_breakpoints mostly, I think). > > > I suspect it's reasonable to consider this orthogonal, particularly > if there is some skid between the actual violation by the > application, and the freeze happening. > I think it could be done without skid, except for async entries, but for asynx entries we don't care about exact user state anyway. > You pushed back somewhat on prctl() in favor of a quiesce() > syscall in your email, but it seemed like at the end of your > email you were adopting the prctl() perspective. Is that true? > I admit the prctl() still seems cleaner from my perspective. > Prctl for the strict thing seems much more reasonable to me than prctl for quiescing. Also, the scheduler people seem to thing that quiescing should be automatic. Anyway, I'll happily look at code and maybe even write more coherent emails when I'm back in town in a week. Since you're thinking that async entries should give kernel diagnostics instead of signals, maybe the right thing to do is to separate them out completely and try to address the individual entry types separately and as needed. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-12 22:23 ` Andy Lutomirski @ 2015-05-15 21:25 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-15 21:25 UTC (permalink / raw) To: Andy Lutomirski Cc: Paul E. McKenney, Frederic Weisbecker, linux-kernel@vger.kernel.org, Rik van Riel, Andrew Morton, Linux API, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, linux-doc@vger.kernel.org, Christoph Lameter, Gilad Ben Yossef, Ingo Molnar On 05/12/2015 06:23 PM, Andy Lutomirski wrote: > On May 13, 2015 6:06 AM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote: >> On 05/11/2015 06:28 PM, Andy Lutomirski wrote: >>> On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >>>> In this case, killing the task is appropriate, since that's exactly >>>> the semantics that have been asked for - it's like on architectures >>>> that don't natively support unaligned accesses, but fake it relatively >>>> slowly in the kernel, and in development you just say "give me a >>>> SIGBUS when that happens" and in production you might say >>>> "fix it up and let's try to keep going". >>> I think more control is needed. I also think that, if we go this >>> route, we should distinguish syscalls, synchronous non-syscall >>> entries, and asynchronous non-syscall entries. They're quite >>> different. >> >> I don't think it's necessary to distinguish the types. As long as we >> have a PC pointing to the instruction that triggered the problem, >> we can see if it's a system call instruction, a memory write that >> caused a page fault, a trap instruction, etc. > Not true. PC right after a syscall insn could be any type of kernel > entry, and you can't even reliably tell whether the syscall insn was > executed or, on x86, whether it was a syscall at all. (x86 insns > can't be reliably decided backwards.) > > PC pointing at a load could be a page fault or an IPI. All that we are trying to do with this API, though, is distinguish synchronous faults. So IPIs, etc., should not be happening (they would be bugs), and hopefully we are mostly just distinguishing different types of synchronous program entries. That said, I did a si_info flag to differentiate syscalls from other synchronous entries, and I'm open to looking at more such if it seems useful. > Again, though, I think we really do need to distinguish at least MCE > and NMI (on x86) from the others. Yes, those are both interesting cases, and I'm not entirely sure what the right way to handle them is - for example, likely disable STRICT if you are running with perf enabled. I look forward to hearing more when you're back next week! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-08 17:58 ` [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode Chris Metcalf 2015-05-09 7:28 ` Andy Lutomirski @ 2015-05-12 9:38 ` Peter Zijlstra [not found] ` <20150512093858.GI21418-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 1 sibling, 1 reply; 159+ messages in thread From: Peter Zijlstra @ 2015-05-12 9:38 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote: > +++ b/kernel/time/tick-sched.c > @@ -436,6 +436,20 @@ static void dataplane_quiesce(void) > (jiffies - start)); > dump_stack(); > } > + > + /* > + * Kill the process if it violates STRICT mode. Note that this > + * code also results in killing the task if a kernel bug causes an > + * irq to be delivered to this core. > + */ > + if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL)) > + == PR_DATAPLANE_STRICT) { > + pr_warn("Dataplane STRICT mode violated; process killed.\n"); > + dump_stack(); > + task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE; > + local_irq_enable(); > + do_group_exit(SIGKILL); > + } > } So while I'm all for hard fails like this, can we not provide a wee bit more information in the siginfo ? And maybe use a slightly less fatal signal, such that userspace can actually catch it and dump state in debug modes? ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <20150512093858.GI21418-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode [not found] ` <20150512093858.GI21418-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2015-05-12 13:20 ` Paul E. McKenney 0 siblings, 0 replies; 159+ messages in thread From: Paul E. McKenney @ 2015-05-12 13:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, May 12, 2015 at 11:38:58AM +0200, Peter Zijlstra wrote: > On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote: > > +++ b/kernel/time/tick-sched.c > > @@ -436,6 +436,20 @@ static void dataplane_quiesce(void) > > (jiffies - start)); > > dump_stack(); > > } > > + > > + /* > > + * Kill the process if it violates STRICT mode. Note that this > > + * code also results in killing the task if a kernel bug causes an > > + * irq to be delivered to this core. > > + */ > > + if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL)) > > + == PR_DATAPLANE_STRICT) { > > + pr_warn("Dataplane STRICT mode violated; process killed.\n"); > > + dump_stack(); > > + task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE; > > + local_irq_enable(); > > + do_group_exit(SIGKILL); > > + } > > } > > So while I'm all for hard fails like this, can we not provide a wee bit > more information in the siginfo ? And maybe use a slightly less fatal > signal, such that userspace can actually catch it and dump state in > debug modes? Agreed, a bit more debug state would be helpful. Thanx, Paul ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <1431107927-13998-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full [not found] ` <1431107927-13998-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-05-08 21:18 ` Andrew Morton 2015-05-08 21:22 ` Steven Rostedt 2015-05-15 21:26 ` [PATCH v2 0/5] support "cpu_isolated" " Chris Metcalf 1 sibling, 1 reply; 159+ messages in thread From: Andrew Morton @ 2015-05-08 21:18 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > A prctl() option (PR_SET_DATAPLANE) is added Dumb question: what does the term "dataplane" mean in this context? I can't see the relationship between those words and what this patch does. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-08 21:18 ` [PATCH 0/6] support "dataplane" mode for nohz_full Andrew Morton @ 2015-05-08 21:22 ` Steven Rostedt [not found] ` <20150508172210.559830a9-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Steven Rostedt @ 2015-05-08 21:22 UTC (permalink / raw) To: Andrew Morton Cc: Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Fri, 8 May 2015 14:18:24 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > > > A prctl() option (PR_SET_DATAPLANE) is added > > Dumb question: what does the term "dataplane" mean in this context? I > can't see the relationship between those words and what this patch > does. I was thinking the same thing. I haven't gotten around to searching DATAPLANE yet. I would assume we want a name that is more meaningful for what is happening. -- Steve ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <20150508172210.559830a9-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>]
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full [not found] ` <20150508172210.559830a9-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org> @ 2015-05-08 23:11 ` Chris Metcalf 2015-05-08 23:19 ` Andrew Morton 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-05-08 23:11 UTC (permalink / raw) To: Steven Rostedt, Andrew Morton Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 5/8/2015 5:22 PM, Steven Rostedt wrote: > On Fri, 8 May 2015 14:18:24 -0700 > Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> >>> A prctl() option (PR_SET_DATAPLANE) is added >> Dumb question: what does the term "dataplane" mean in this context? I >> can't see the relationship between those words and what this patch >> does. > I was thinking the same thing. I haven't gotten around to searching > DATAPLANE yet. > > I would assume we want a name that is more meaningful for what is > happening. The text in the commit message and the 0/6 cover letter do try to explain the concept. The terminology comes, I think, from networking line cards, where the "dataplane" is the part of the application that handles all the fast path processing of network packets, and the "control plane" is the part that handles routing updates, etc., generally slow-path stuff. I've probably just been using the terms so long they seem normal to me. That said, what would be clearer? NO_HZ_STRICT as a superset of NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, we're talking about no interrupts of any kind, and maybe NO_HZ is too limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look to vendors who ship bare-metal runtimes and call it BARE_METAL? Borrow the Tilera marketing name and call it ZERO_OVERHEAD? Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, of course :-) -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-08 23:11 ` Chris Metcalf @ 2015-05-08 23:19 ` Andrew Morton 2015-05-09 7:05 ` Ingo Molnar 0 siblings, 1 reply; 159+ messages in thread From: Andrew Morton @ 2015-05-08 23:19 UTC (permalink / raw) To: Chris Metcalf Cc: Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 5/8/2015 5:22 PM, Steven Rostedt wrote: > > On Fri, 8 May 2015 14:18:24 -0700 > > Andrew Morton <akpm@linux-foundation.org> wrote: > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > >> > >>> A prctl() option (PR_SET_DATAPLANE) is added > >> Dumb question: what does the term "dataplane" mean in this context? I > >> can't see the relationship between those words and what this patch > >> does. > > I was thinking the same thing. I haven't gotten around to searching > > DATAPLANE yet. > > > > I would assume we want a name that is more meaningful for what is > > happening. > > The text in the commit message and the 0/6 cover letter do try to explain > the concept. The terminology comes, I think, from networking line cards, > where the "dataplane" is the part of the application that handles all the > fast path processing of network packets, and the "control plane" is the part > that handles routing updates, etc., generally slow-path stuff. I've probably > just been using the terms so long they seem normal to me. > > That said, what would be clearer? NO_HZ_STRICT as a superset of > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, > we're talking about no interrupts of any kind, and maybe NO_HZ is too > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look > to vendors who ship bare-metal runtimes and call it BARE_METAL? > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, > of course :-) I like NO_INTERRUPTS. Simple, direct. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-08 23:19 ` Andrew Morton @ 2015-05-09 7:05 ` Ingo Molnar [not found] ` <20150509070538.GA9413-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Ingo Molnar @ 2015-05-09 7:05 UTC (permalink / raw) To: Andrew Morton Cc: Chris Metcalf, Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel * Andrew Morton <akpm@linux-foundation.org> wrote: > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > > > On 5/8/2015 5:22 PM, Steven Rostedt wrote: > > > On Fri, 8 May 2015 14:18:24 -0700 > > > Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > > >> > > >>> A prctl() option (PR_SET_DATAPLANE) is added > > >> Dumb question: what does the term "dataplane" mean in this context? I > > >> can't see the relationship between those words and what this patch > > >> does. > > > I was thinking the same thing. I haven't gotten around to searching > > > DATAPLANE yet. > > > > > > I would assume we want a name that is more meaningful for what is > > > happening. > > > > The text in the commit message and the 0/6 cover letter do try to explain > > the concept. The terminology comes, I think, from networking line cards, > > where the "dataplane" is the part of the application that handles all the > > fast path processing of network packets, and the "control plane" is the part > > that handles routing updates, etc., generally slow-path stuff. I've probably > > just been using the terms so long they seem normal to me. > > > > That said, what would be clearer? NO_HZ_STRICT as a superset of > > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, > > we're talking about no interrupts of any kind, and maybe NO_HZ is too > > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look > > to vendors who ship bare-metal runtimes and call it BARE_METAL? > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? > > > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, > > of course :-) 'baremetal' has uses in virtualization speak, so I think that would be confusing. > I like NO_INTERRUPTS. Simple, direct. NO_HZ_PURE? That's what it's really about: user-space wants to run exclusively, in pure user-mode, without any interrupts. So I don't like 'NO_HZ_NO_INTERRUPTS' for a couple of reasons: - It is similar to a term we use in perf: PERF_PMU_CAP_NO_INTERRUPT. - Another reason is that 'NO_INTERRUPTS', in most existing uses in the kernel generally relates to some sort of hardware weakness, limitation, a negative property: that we try to limp along without having a hardware interrupt and have to poll. In other driver code that uses variants of NO_INTERRUPT it appears to be similar. So I think there's some confusion potential here. - Here the fact that we don't disturb user-space is an absolutely positive property, not a limitation, a kernel feature we work hard to achieve. NO_HZ_PURE would convey that while NO_HZ_NO_INTERRUPTS wouldn't. - NO_HZ_NO_INTERRUPTS has a double negation, and it's also too long, compared to NO_HZ_FULL or NO_HZ_PURE ;-) The term 'no HZ' already expresses that we don't have periodic interruptions. We just duplicate that information with NO_HZ_NO_INTERRUPTS, while NO_HZ_FULL or NO_HZ_PURE qualifies it, makes it a stronger property - which is what we want I think. So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be such a 'zero overhead' mode of operation, where if user-space runs, it won't get interrupted in any way. There's no need to add yet another Kconfig variant - lets just enhance the current stuff and maybe rename it to NO_HZ_PURE to better express its intent. Thanks, Ingo ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <20150509070538.GA9413-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full [not found] ` <20150509070538.GA9413-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2015-05-09 7:19 ` Andy Lutomirski [not found] ` <CALCETrXavog018+xLacXeBLaMLjWtqk0bMU5fUzZ+pkwgu7Y3A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> [not found] ` <55510885.9070101@ezchip.com> 2015-05-09 7:19 ` Mike Galbraith 2015-05-11 12:57 ` Steven Rostedt 2 siblings, 2 replies; 159+ messages in thread From: Andy Lutomirski @ 2015-05-09 7:19 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Sat, May 9, 2015 at 12:05 AM, Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > > * Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > >> On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> >> > On 5/8/2015 5:22 PM, Steven Rostedt wrote: >> > > On Fri, 8 May 2015 14:18:24 -0700 >> > > Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: >> > > >> > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> > >> >> > >>> A prctl() option (PR_SET_DATAPLANE) is added >> > >> Dumb question: what does the term "dataplane" mean in this context? I >> > >> can't see the relationship between those words and what this patch >> > >> does. >> > > I was thinking the same thing. I haven't gotten around to searching >> > > DATAPLANE yet. >> > > >> > > I would assume we want a name that is more meaningful for what is >> > > happening. >> > >> > The text in the commit message and the 0/6 cover letter do try to explain >> > the concept. The terminology comes, I think, from networking line cards, >> > where the "dataplane" is the part of the application that handles all the >> > fast path processing of network packets, and the "control plane" is the part >> > that handles routing updates, etc., generally slow-path stuff. I've probably >> > just been using the terms so long they seem normal to me. >> > >> > That said, what would be clearer? NO_HZ_STRICT as a superset of >> > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, >> > we're talking about no interrupts of any kind, and maybe NO_HZ is too >> > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look >> > to vendors who ship bare-metal runtimes and call it BARE_METAL? >> > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? >> > >> > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, >> > of course :-) > > 'baremetal' has uses in virtualization speak, so I think that would be > confusing. > >> I like NO_INTERRUPTS. Simple, direct. > > NO_HZ_PURE? > Naming aside, I don't think this should be a per-task flag at all. We already have way too much overhead per syscall in nohz mode, and it would be nice to get the per-syscall overhead as low as possible. We should strive, for all tasks, to keep syscall overhead down *and* avoid as many interrupts as possible. That being said, I do see a legitimate use for a way to tell the kernel "I'm going to run in userspace for a long time; stay away". But shouldn't that be a single operation, not an ongoing flag? IOW, I think that we should have a new syscall quiesce() or something rather than a prctl. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <CALCETrXavog018+xLacXeBLaMLjWtqk0bMU5fUzZ+pkwgu7Y3A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full [not found] ` <CALCETrXavog018+xLacXeBLaMLjWtqk0bMU5fUzZ+pkwgu7Y3A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-11 19:54 ` Chris Metcalf [not found] ` <555108FC.3060200-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-05-11 19:54 UTC (permalink / raw) To: Andy Lutomirski, Ingo Molnar Cc: Andrew Morton, Steven Rostedt, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (Oops, resending and forcing html off.) On 05/09/2015 03:19 AM, Andy Lutomirski wrote: > Naming aside, I don't think this should be a per-task flag at all. We > already have way too much overhead per syscall in nohz mode, and it > would be nice to get the per-syscall overhead as low as possible. We > should strive, for all tasks, to keep syscall overhead down*and* > avoid as many interrupts as possible. > > That being said, I do see a legitimate use for a way to tell the > kernel "I'm going to run in userspace for a long time; stay away". > But shouldn't that be a single operation, not an ongoing flag? IOW, I > think that we should have a new syscall quiesce() or something rather > than a prctl. Yes, if all you are concerned about is quiescing the tick, we could probably do it as a new syscall. I do note that you'd want to try to actually do the quiesce as late as possible - in particular, if you just did it in the usual syscall, you might miss out on a timer that is set by softirq, or even something that happened when you called schedule() on the syscall exit path. Doing it as late as we are doing helps to ensure that that doesn't happen. We could still arrange for this semantics by having a new quiesce() syscall set a temporary task bit that was cleared on return to userspace, but as you pointed out in a different email, that gets tricky if you end up doing multiple user_exit() calls on your way back to userspace. More to the point, I think it's actually important to know when an application believes it's in userspace-only mode as an actual state bit, rather than just during its transitional moment. If an application calls the kernel at an unexpected time (third-party code is the usual culprit for our customers, whether it's syscalls, page faults, or other things) we would prefer to have the "quiesce" semantics stay in force and cause the third-party code to be visibly very slow, rather than cause a totally unexpected and hard-to-diagnose interrupt show up later as we are still going around the loop that we thought was safely userspace-only. And, for debugging the kernel, it's crazy helpful to have that state bit in place: see patch 6/6 in the series for how we can diagnose things like "a different core just queued an IPI that will hit a dataplane core unexpectedly". Having that state bit makes this sort of thing a trivial check in the kernel and relatively easy to debug. Finally, I proposed a "strict" mode in patch 5/6 where we kill the process if it voluntarily enters the kernel by mistake after saying it wasn't going to any more. To do this requires a state bit, so carrying another state bit for "quiesce on user entry" seems pretty reasonable. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <555108FC.3060200-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full [not found] ` <555108FC.3060200-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-05-11 22:15 ` Andy Lutomirski 0 siblings, 0 replies; 159+ messages in thread From: Andy Lutomirski @ 2015-05-11 22:15 UTC (permalink / raw) To: Chris Metcalf Cc: Paul E. McKenney, Frederic Weisbecker, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Rik van Riel, Andrew Morton, Linux API, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Christoph Lameter, Gilad Ben Yossef, Ingo Molnar On May 12, 2015 4:54 AM, "Chris Metcalf" <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > > (Oops, resending and forcing html off.) > > > On 05/09/2015 03:19 AM, Andy Lutomirski wrote: >> >> Naming aside, I don't think this should be a per-task flag at all. We >> already have way too much overhead per syscall in nohz mode, and it >> would be nice to get the per-syscall overhead as low as possible. We >> should strive, for all tasks, to keep syscall overhead down*and* >> avoid as many interrupts as possible. >> >> That being said, I do see a legitimate use for a way to tell the >> kernel "I'm going to run in userspace for a long time; stay away". >> But shouldn't that be a single operation, not an ongoing flag? IOW, I >> think that we should have a new syscall quiesce() or something rather >> than a prctl. > > > Yes, if all you are concerned about is quiescing the tick, we could > probably do it as a new syscall. > > I do note that you'd want to try to actually do the quiesce as late as > possible - in particular, if you just did it in the usual syscall, you > might miss out on a timer that is set by softirq, or even something > that happened when you called schedule() on the syscall exit path. > Doing it as late as we are doing helps to ensure that that doesn't > happen. We could still arrange for this semantics by having a new > quiesce() syscall set a temporary task bit that was cleared on > return to userspace, but as you pointed out in a different email, > that gets tricky if you end up doing multiple user_exit() calls on > your way back to userspace. We should fix that, then. A quiesce() syscall can certainly arrange to clean up on final exit. > > More to the point, I think it's actually important to know when an > application believes it's in userspace-only mode as an actual state > bit, rather than just during its transitional moment. We can do that, too, with a new flag that's cleared on the next entry. > If an > application calls the kernel at an unexpected time (third-party code > is the usual culprit for our customers, whether it's syscalls, page > faults, or other things) we would prefer to have the "quiesce" > semantics stay in force and cause the third-party code to be > visibly very slow, rather than cause a totally unexpected and > hard-to-diagnose interrupt show up later as we are still going > around the loop that we thought was safely userspace-only. I'm not really convinced that we should design this feature around ease of debugging userspace screwups. There are already plenty of ways to do that part. Userspace getting an interrupt because userspace accidentally did a syscall is very different from userspace getting interrupted due to an IPI. > > And, for debugging the kernel, it's crazy helpful to have that state > bit in place: see patch 6/6 in the series for how we can diagnose > things like "a different core just queued an IPI that will hit a > dataplane core unexpectedly". Having that state bit makes this sort > of thing a trivial check in the kernel and relatively easy to debug. As above, this can be done with a one-time operation, too. > > Finally, I proposed a "strict" mode in patch 5/6 where we kill the > process if it voluntarily enters the kernel by mistake after saying it > wasn't going to any more. To do this requires a state bit, so > carrying another state bit for "quiesce on user entry" seems pretty > reasonable. I still dislike that in the form you chose. It's too deadly to be useful for anyone but the hardest RT users. I think I'd be okay with variants, though: let a suitably privileged process ask for a signal on inadvertent kernel entry or rig up an fd to be notified when one of these bad entries happens. Queueing something to a pollable fd would work, too. See that thread for more comments. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <55510885.9070101@ezchip.com>]
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full [not found] ` <55510885.9070101@ezchip.com> @ 2015-05-12 13:18 ` Paul E. McKenney 0 siblings, 0 replies; 159+ messages in thread From: Paul E. McKenney @ 2015-05-12 13:18 UTC (permalink / raw) To: Chris Metcalf Cc: Andy Lutomirski, Ingo Molnar, Andrew Morton, Steven Rostedt, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Mon, May 11, 2015 at 03:52:37PM -0400, Chris Metcalf wrote: > On 05/09/2015 03:19 AM, Andy Lutomirski wrote: > >Naming aside, I don't think this should be a per-task flag at all. We > >already have way too much overhead per syscall in nohz mode, and it > >would be nice to get the per-syscall overhead as low as possible. We > >should strive, for all tasks, to keep syscall overhead down*and* > >avoid as many interrupts as possible. > > > >That being said, I do see a legitimate use for a way to tell the > >kernel "I'm going to run in userspace for a long time; stay away". > >But shouldn't that be a single operation, not an ongoing flag? IOW, I > >think that we should have a new syscall quiesce() or something rather > >than a prctl. > > Yes, if all you are concerned about is quiescing the tick, we could > probably do it as a new syscall. > > I do note that you'd want to try to actually do the quiesce as late as > possible - in particular, if you just did it in the usual syscall, you > might miss out on a timer that is set by softirq, or even something > that happened when you called schedule() on the syscall exit path. > Doing it as late as we are doing helps to ensure that that doesn't > happen. We could still arrange for this semantics by having a new > quiesce() syscall set a temporary task bit that was cleared on > return to userspace, but as you pointed out in a different email, > that gets tricky if you end up doing multiple user_exit() calls on > your way back to userspace. > > More to the point, I think it's actually important to know when an > application believes it's in userspace-only mode as an actual state > bit, rather than just during its transitional moment. If an > application calls the kernel at an unexpected time (third-party code > is the usual culprit for our customers, whether it's syscalls, page > faults, or other things) we would prefer to have the "quiesce" > semantics stay in force and cause the third-party code to be > visibly very slow, rather than cause a totally unexpected and > hard-to-diagnose interrupt show up later as we are still going > around the loop that we thought was safely userspace-only. > > And, for debugging the kernel, it's crazy helpful to have that state > bit in place: see patch 6/6 in the series for how we can diagnose > things like "a different core just queued an IPI that will hit a > dataplane core unexpectedly". Having that state bit makes this sort > of thing a trivial check in the kernel and relatively easy to debug. I agree with this! It is currently a bit painful to debug problems that might result in multiple tasks runnable on a given CPU. If you suspect a problem, you enable tracing and re-run. Not paricularly friendly for chasing down intermittent problems, so some sort of improvement would be a very good thing. Thanx, Paul > Finally, I proposed a "strict" mode in patch 5/6 where we kill the > process if it voluntarily enters the kernel by mistake after saying it > wasn't going to any more. To do this requires a state bit, so > carrying another state bit for "quiesce on user entry" seems pretty > reasonable. > > -- > Chris Metcalf, EZChip Semiconductor > http://www.ezchip.com > ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full [not found] ` <20150509070538.GA9413-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-05-09 7:19 ` Andy Lutomirski @ 2015-05-09 7:19 ` Mike Galbraith [not found] ` <1431155983.3209.131.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-05-11 12:57 ` Steven Rostedt 2 siblings, 1 reply; 159+ messages in thread From: Mike Galbraith @ 2015-05-09 7:19 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Sat, 2015-05-09 at 09:05 +0200, Ingo Molnar wrote: > * Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > > > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > > > > > On 5/8/2015 5:22 PM, Steven Rostedt wrote: > > > > On Fri, 8 May 2015 14:18:24 -0700 > > > > Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > > > > > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > > > >> > > > >>> A prctl() option (PR_SET_DATAPLANE) is added > > > >> Dumb question: what does the term "dataplane" mean in this context? I > > > >> can't see the relationship between those words and what this patch > > > >> does. > > > > I was thinking the same thing. I haven't gotten around to searching > > > > DATAPLANE yet. > > > > > > > > I would assume we want a name that is more meaningful for what is > > > > happening. > > > > > > The text in the commit message and the 0/6 cover letter do try to explain > > > the concept. The terminology comes, I think, from networking line cards, > > > where the "dataplane" is the part of the application that handles all the > > > fast path processing of network packets, and the "control plane" is the part > > > that handles routing updates, etc., generally slow-path stuff. I've probably > > > just been using the terms so long they seem normal to me. > > > > > > That said, what would be clearer? NO_HZ_STRICT as a superset of > > > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, > > > we're talking about no interrupts of any kind, and maybe NO_HZ is too > > > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look > > > to vendors who ship bare-metal runtimes and call it BARE_METAL? > > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? > > > > > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, > > > of course :-) > > 'baremetal' has uses in virtualization speak, so I think that would be > confusing. > > > I like NO_INTERRUPTS. Simple, direct. > > NO_HZ_PURE? Hm, coke light, coke zero... OS_LIGHT and OS_ZERO? -Mike ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <1431155983.3209.131.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* RE: [PATCH 0/6] support "dataplane" mode for nohz_full [not found] ` <1431155983.3209.131.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2015-05-09 10:18 ` Gilad Ben Yossef 0 siblings, 0 replies; 159+ messages in thread From: Gilad Ben Yossef @ 2015-05-09 10:18 UTC (permalink / raw) To: Mike Galbraith, Ingo Molnar Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > From: Mike Galbraith [mailto:umgwanakikbuti@gmail.com] > Sent: Saturday, May 09, 2015 10:20 AM > To: Ingo Molnar > Cc: Andrew Morton; Chris Metcalf; Steven Rostedt; Gilad Ben Yossef; Ingo > Molnar; Peter Zijlstra; Rik van Riel; Tejun Heo; Frederic Weisbecker; > Thomas Gleixner; Paul E. McKenney; Christoph Lameter; Srivatsa S. Bhat; > linux-doc@vger.kernel.org; linux-api@vger.kernel.org; linux- > kernel@vger.kernel.org > Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full > > On Sat, 2015-05-09 at 09:05 +0200, Ingo Molnar wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> > wrote: > > > > > > > On 5/8/2015 5:22 PM, Steven Rostedt wrote: > > > > > On Fri, 8 May 2015 14:18:24 -0700 > > > > > Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf > <cmetcalf@ezchip.com> wrote: > > > > >> > > > > >>> A prctl() option (PR_SET_DATAPLANE) is added > > > > >> Dumb question: what does the term "dataplane" mean in this > context? I > > > > >> can't see the relationship between those words and what this > patch > > > > >> does. > > > > > I was thinking the same thing. I haven't gotten around to > searching > > > > > DATAPLANE yet. > > > > > > > > > > I would assume we want a name that is more meaningful for what is > > > > > happening. > > > > > > > > The text in the commit message and the 0/6 cover letter do try to > explain > > > > the concept. The terminology comes, I think, from networking line > cards, > > > > where the "dataplane" is the part of the application that handles > all the > > > > fast path processing of network packets, and the "control plane" is > the part > > > > that handles routing updates, etc., generally slow-path stuff. I've > probably > > > > just been using the terms so long they seem normal to me. > > > > > > > > That said, what would be clearer? NO_HZ_STRICT as a superset of > > > > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after > all, > > > > we're talking about no interrupts of any kind, and maybe NO_HZ is > too > > > > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look > > > > to vendors who ship bare-metal runtimes and call it BARE_METAL? > > > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? > > > > > > > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, > > > > of course :-) > > > > 'baremetal' has uses in virtualization speak, so I think that would be > > confusing. > > > > > I like NO_INTERRUPTS. Simple, direct. > > > > NO_HZ_PURE? > > Hm, coke light, coke zero... OS_LIGHT and OS_ZERO? LOL... you forgot OS_CLASSIC for backwards compatibility :-) How about TASK_SOLO? Yes, you are trying to achieve the least amount of interference but the bigger context is about monopolizing a single CPU for yourself. Anyway it is worth pointing out that while NO_HZ_FULL is very useful in conjunction with this turning the tick off is useful also if you have multiple tasks runnable (e.g. if you know you only need to context switch in 100 ms, why keep a periodic interrupt running?) even though we don't support it *right now*. It might be a good idea not to entangle these concepts too much. Gilad Gilad Ben-Yossef Chief Software Architect EZchip Technologies Ltd. 37 Israel Pollak Ave, Kiryat Gat 82025 ,Israel Tel: +972-4-959-6666 ext. 576, Fax: +972-8-681-1483 Mobile: +972-52-826-0388, US Mobile: +1-973-826-0388 Email: giladb@ezchip.com, Web: http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full [not found] ` <20150509070538.GA9413-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-05-09 7:19 ` Andy Lutomirski 2015-05-09 7:19 ` Mike Galbraith @ 2015-05-11 12:57 ` Steven Rostedt 2015-05-11 15:36 ` Frederic Weisbecker ` (2 more replies) 2 siblings, 3 replies; 159+ messages in thread From: Steven Rostedt @ 2015-05-11 12:57 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA NO_HZ_LEAVE_ME_THE_FSCK_ALONE! On Sat, 9 May 2015 09:05:38 +0200 Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep > it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be > such a 'zero overhead' mode of operation, where if user-space runs, it > won't get interrupted in any way. All kidding aside, I think this is the real answer. We don't need a new NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly what it was created to do. That should be fixed. Please lets get NO_HZ_FULL up to par. That should be the main focus. -- Steve ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 12:57 ` Steven Rostedt @ 2015-05-11 15:36 ` Frederic Weisbecker 2015-05-11 19:19 ` Mike Galbraith 2015-05-11 17:19 ` Paul E. McKenney [not found] ` <20150511085759.71deeb64-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org> 2 siblings, 1 reply; 159+ messages in thread From: Frederic Weisbecker @ 2015-05-11 15:36 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE! > > > On Sat, 9 May 2015 09:05:38 +0200 > Ingo Molnar <mingo@kernel.org> wrote: > > > So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep > > it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be > > such a 'zero overhead' mode of operation, where if user-space runs, it > > won't get interrupted in any way. > > > All kidding aside, I think this is the real answer. We don't need a new > NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly > what it was created to do. That should be fixed. > > Please lets get NO_HZ_FULL up to par. That should be the main focus. Now if we can achieve to make NO_HZ_FULL behave in a specific way that fits everyone's usecase, I'll be happy. But some people may expect hard isolation requirement (Real Time, deterministic latency) and others softer isolation (HPC, only interested in performance, can live with one rare random tick, so no need to loop before returning to userspace until we have the no-noise guarantee). I expect some Real Time users may want this kind of dataplane mode where a syscall or whatever sleeps until the system is ready to provide the guarantee that no disturbance is going to happen for a given time. I'm not sure HPC users are interested in that. In fact it goes along the fact that NO_HZ_FULL was really only supposed to be about the tick and now people are introducing more and more kernel default presetting that assume NO_HZ_FULL implies ISOLATION which is about all kind of noise (tick, tasks, irqs, ...). Which is true but what kind of ISOLATION? Probably NO_HZ_FULL should really only be about stopping the tick then some sort of CONFIG_ISOLATION would drive the kind of isolation we are interested in and hereby the behaviour of NO_HZ_FULL, workqueues, timers, tasks affinity, irqs affinity, dataplane mode, ... ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 15:36 ` Frederic Weisbecker @ 2015-05-11 19:19 ` Mike Galbraith 2015-05-11 19:25 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Mike Galbraith @ 2015-05-11 19:19 UTC (permalink / raw) To: Frederic Weisbecker Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, 2015-05-11 at 17:36 +0200, Frederic Weisbecker wrote: > I expect some Real Time users may want this kind of dataplane mode where a syscall > or whatever sleeps until the system is ready to provide the guarantee that no > disturbance is going to happen for a given time. I'm not sure HPC users are interested > in that. I bet they are. RT is just a different way to spell HPC, and reverse. > In fact it goes along the fact that NO_HZ_FULL was really only supposed to be about > the tick and now people are introducing more and more kernel default presetting that > assume NO_HZ_FULL implies ISOLATION which is about all kind of noise (tick, tasks, irqs, > ...). Which is true but what kind of ISOLATION? True, nohz mode and various isolation measures are distinct properties. NO_HZ_FULL is kinda pointless without isolation measures to go with it, but you're right. I really shouldn't have acked nohz_full -> isolcpus. Beside the fact that old static isolcpus was _supposed_ to crawl off and die, I know beyond doubt that having isolated a cpu as well as you can definitely does NOT imply that said cpu should become tickless. I routinely run a load model that wants all the isolation it can get. It's not single task compute though, rt executive coordinating rt workers, and of course wants every cycle it can get, so nohz_full is less than helpful. -Mike ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 19:19 ` Mike Galbraith @ 2015-05-11 19:25 ` Chris Metcalf 2015-05-12 1:47 ` Mike Galbraith 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-05-11 19:25 UTC (permalink / raw) To: Mike Galbraith, Frederic Weisbecker Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On 05/11/2015 03:19 PM, Mike Galbraith wrote: > I really shouldn't have acked nohz_full -> isolcpus. Beside the fact > that old static isolcpus was_supposed_ to crawl off and die, I know > beyond doubt that having isolated a cpu as well as you can definitely > does NOT imply that said cpu should become tickless. True, at a high level, I agree that it would be better to have a top-level concept like Frederic's proposed ISOLATION that includes isolcpus and nohz_cpu (and other stuff as needed). That said, what you wrote above is wrong; even with the patch you acked, setting isolcpus does not automatically turn on nohz_full for a given cpu. The patch made it true the other way around: when you say nohz_full, you automatically get isolcpus on that cpu too. That does, at least, make sense for the semantics of nohz_full. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 19:25 ` Chris Metcalf @ 2015-05-12 1:47 ` Mike Galbraith 2015-05-12 4:35 ` Mike Galbraith 0 siblings, 1 reply; 159+ messages in thread From: Mike Galbraith @ 2015-05-12 1:47 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote: > On 05/11/2015 03:19 PM, Mike Galbraith wrote: > > I really shouldn't have acked nohz_full -> isolcpus. Beside the fact > > that old static isolcpus was_supposed_ to crawl off and die, I know > > beyond doubt that having isolated a cpu as well as you can definitely > > does NOT imply that said cpu should become tickless. > > True, at a high level, I agree that it would be better to have a > top-level concept like Frederic's proposed ISOLATION that includes > isolcpus and nohz_cpu (and other stuff as needed). > > That said, what you wrote above is wrong; even with the patch you > acked, setting isolcpus does not automatically turn on nohz_full for > a given cpu. The patch made it true the other way around: when > you say nohz_full, you automatically get isolcpus on that cpu too. > That does, at least, make sense for the semantics of nohz_full. I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus. Yes, with nohz_full currently being static, the old allegedly dying but also static isolcpus scheduler off switch is a convenient thing to wire the nohz_full CPU SET (<- hint;) property to. -Mike ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-12 1:47 ` Mike Galbraith @ 2015-05-12 4:35 ` Mike Galbraith 0 siblings, 0 replies; 159+ messages in thread From: Mike Galbraith @ 2015-05-12 4:35 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Tue, 2015-05-12 at 03:47 +0200, Mike Galbraith wrote: > On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote: > > On 05/11/2015 03:19 PM, Mike Galbraith wrote: > > > I really shouldn't have acked nohz_full -> isolcpus. Beside the fact > > > that old static isolcpus was_supposed_ to crawl off and die, I know > > > beyond doubt that having isolated a cpu as well as you can definitely > > > does NOT imply that said cpu should become tickless. > > > > True, at a high level, I agree that it would be better to have a > > top-level concept like Frederic's proposed ISOLATION that includes > > isolcpus and nohz_cpu (and other stuff as needed). > > > > That said, what you wrote above is wrong; even with the patch you > > acked, setting isolcpus does not automatically turn on nohz_full for > > a given cpu. The patch made it true the other way around: when > > you say nohz_full, you automatically get isolcpus on that cpu too. > > That does, at least, make sense for the semantics of nohz_full. > > I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus. > Yes, with nohz_full currently being static, the old allegedly dying but > also static isolcpus scheduler off switch is a convenient thing to wire > the nohz_full CPU SET (<- hint;) property to. BTW, another facet of this: Rik wants to make isolcpus immune to cpusets, which makes some sense, user did say isolcpus=, but that also makes isolcpus truly static. If the user now says nohz_full=, they lose the ability to deactivate CPU isolation, making the set fairly useless for anything other than HPC. Currently, the user can flip the isolation switch as he sees fit. He takes a size extra large performance hit for having said nohz_full=, but he doesn't lose generic utility. -Mike ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 12:57 ` Steven Rostedt 2015-05-11 15:36 ` Frederic Weisbecker @ 2015-05-11 17:19 ` Paul E. McKenney 2015-05-11 17:27 ` Andrew Morton [not found] ` <20150511085759.71deeb64-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org> 2 siblings, 1 reply; 159+ messages in thread From: Paul E. McKenney @ 2015-05-11 17:19 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE! NO_HZ_OVERFLOWING? Kconfig naming controversy aside, I believe this patchset is addressing a real need. Might need additional adjustment, but something useful. Thanx, Paul > On Sat, 9 May 2015 09:05:38 +0200 > Ingo Molnar <mingo@kernel.org> wrote: > > > So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep > > it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be > > such a 'zero overhead' mode of operation, where if user-space runs, it > > won't get interrupted in any way. > > > All kidding aside, I think this is the real answer. We don't need a new > NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly > what it was created to do. That should be fixed. > > Please lets get NO_HZ_FULL up to par. That should be the main focus. > > -- Steve > ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 17:19 ` Paul E. McKenney @ 2015-05-11 17:27 ` Andrew Morton 2015-05-11 17:33 ` Frederic Weisbecker 0 siblings, 1 reply; 159+ messages in thread From: Andrew Morton @ 2015-05-11 17:27 UTC (permalink / raw) To: paulmck Cc: Steven Rostedt, Ingo Molnar, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, 11 May 2015 10:19:16 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: > > > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE! > > NO_HZ_OVERFLOWING? Actually, "NO_HZ" shouldn't appear in the name at all. The objective is to permit userspace to execute without interruption. NO_HZ is a part of that, as is NO_INTERRUPTS. The "NO_HZ" thing is a historical artifact from an early partial implementation. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 17:27 ` Andrew Morton @ 2015-05-11 17:33 ` Frederic Weisbecker 2015-05-11 18:00 ` Steven Rostedt 0 siblings, 1 reply; 159+ messages in thread From: Frederic Weisbecker @ 2015-05-11 17:33 UTC (permalink / raw) To: Andrew Morton Cc: paulmck, Steven Rostedt, Ingo Molnar, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, May 11, 2015 at 10:27:44AM -0700, Andrew Morton wrote: > On Mon, 11 May 2015 10:19:16 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: > > > > > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE! > > > > NO_HZ_OVERFLOWING? > > Actually, "NO_HZ" shouldn't appear in the name at all. The objective > is to permit userspace to execute without interruption. NO_HZ is a > part of that, as is NO_INTERRUPTS. The "NO_HZ" thing is a historical > artifact from an early partial implementation. Agreed! Which is why I'd rather advocate in favour of CONFIG_ISOLATION. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 17:33 ` Frederic Weisbecker @ 2015-05-11 18:00 ` Steven Rostedt 2015-05-11 18:09 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Steven Rostedt @ 2015-05-11 18:00 UTC (permalink / raw) To: Frederic Weisbecker Cc: Andrew Morton, paulmck, Ingo Molnar, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, 11 May 2015 19:33:06 +0200 Frederic Weisbecker <fweisbec@gmail.com> wrote: > On Mon, May 11, 2015 at 10:27:44AM -0700, Andrew Morton wrote: > > On Mon, 11 May 2015 10:19:16 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: > > > > > > > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE! > > > > > > NO_HZ_OVERFLOWING? > > > > Actually, "NO_HZ" shouldn't appear in the name at all. The objective > > is to permit userspace to execute without interruption. NO_HZ is a > > part of that, as is NO_INTERRUPTS. The "NO_HZ" thing is a historical > > artifact from an early partial implementation. > > Agreed! Which is why I'd rather advocate in favour of CONFIG_ISOLATION. Then we should have CONFIG_LEAVE_ME_THE_FSCK_ALONE. Hmm, I guess that's just an synonym for CONFIG_ISOLATION. -- Steve ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 18:00 ` Steven Rostedt @ 2015-05-11 18:09 ` Chris Metcalf [not found] ` <5550F077.6030906-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-05-12 9:10 ` CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) Ingo Molnar 0 siblings, 2 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-11 18:09 UTC (permalink / raw) To: Steven Rostedt, Frederic Weisbecker Cc: Andrew Morton, paulmck, Ingo Molnar, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel A bunch of issues have been raised by various folks (thanks!) and I'll try to break them down and respond to them in a few different emails. This email is just about the issue of naming and whether the proposed patch series should even have its own "name" or just be part of NO_HZ_FULL. First, Ingo and Steven both suggested that this new "dataplane" mode (or whatever we want to call it; see below) should just be rolled into the existing NO_HZ_FULL and that we should focus on making that work better. Steven writes: > All kidding aside, I think this is the real answer. We don't need a new > NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly > what it was created to do. That should be fixed. The claim I'm making is that it's worthwhile to differentiate the two semantics. Plain NO_HZ_FULL just says "kernel makes a best effort to avoid periodic interrupts without incurring any serious overhead". My patch series allows an app to request "kernel makes an absolute commitment to avoid all interrupts regardless of cost when leaving kernel space". These are different enough ideas, and serve different enough application needs, that I think they should be kept distinct. Frederic actually summed this up very nicely in his recent email when he wrote "some people may expect hard isolation requirement (Real Time, deterministic latency) and others softer isolation (HPC, only interested in performance, can live with one rare random tick, so no need to loop before returning to userspace until we have the no-noise guarantee)." So we need a way for apps to ask for the "harder" mode and let the softer mode be the default. What about naming? We may or may not want to have a Kconfig flag for this, and we may or may not have a separate mode for it, but we still will need some kind of name to talk about it with. (In particular there's the prctl name, if we take that approach, and potential boot command-line flags to consider naming for.) I'll quickly cover the suggestions that have been raised: - DATAPLANE. My suggestion, seemingly broadly disliked by folks who felt it wasn't apparent what it meant. Probably a fair point. - NO_INTERRUPTS (Andrew). Captures some of the sense, but was criticized pretty fairly by Ingo as being too negative, confusing with perf nomenclature, and too long :-) - PURE (Ingo). Proposed as an alternative to NO_HZ_FULL, but we could use it as a name for this new mode. However, I think it's not clear enough how FULL and PURE can/should relate to each other from the names alone. - BARE_METAL (me). Ingo observes it's confusing with respect to virtualization. - TASK_SOLO (Gilad). Not sure this conveys enough of the semantics. - OS_LIGHT/OS_ZERO and NO_HZ_LEAVE_ME_THE_FSCK_ALONE. Excellent ideas :-) - ISOLATION (Frederic). I like this but it conflicts with other uses of "isolation" in the kernel: cgroup isolation, lru page isolation, iommu isolation, scheduler isolation (at least it's a superset of that one), etc. Also, we're not exactly isolating a task - often a "dataplane" app consists of a bunch of interacting threads in userspace, so not exactly isolated. So perhaps it's too confusing. - OVERFLOWING (Steven) - not sure I understood this one, honestly. I suggested earlier a few other candidates that I don't love, but no one commented on: NO_HZ_STRICT, USERSPACE_ONLY, and ZERO_OVERHEAD. One thing I'm leaning towards is to remove the intermediate state of DATAPLANE_ENABLE and say that there is really only one primary state, DATAPLANE_QUIESCE (or whatever we call it). The "dataplane but no quiesce" state probably isn't that useful, since it doesn't offer the hard guarantee that is the entire point of this patch series. So that opens the idea of using the name NO_HZ_QUIESCE or just QUIESCE as the word that describes the mode; of course this sort of conflicts with RCU quiesce (though it is a superset of that so maybe that's OK). One new idea I had is to use NO_HZ_HARD to reflect what Frederic was suggesting about "soft" and "hard" requirements for NO_HZ. So enabling NO_HZ_HARD would enable my suggested QUIESCE mode. One way to focus this discussion is on the user API naming. I had prctl(PR_SET_DATAPLANE), which was attractive in being a "positive" noun. A lot of the other suggestions fail this test in various way. Reasonable candidates seem to be: PR_SET_OS_ZERO PR_SET_TASK_SOLO PR_SET_ISOLATION Another possibility: PR_SET_NONSTOP Or take Andrew's NO_INTERRUPTS and have: PR_SET_UNINTERRUPTED I slightly favor ISOLATION at this point despite the overlap with other kernel concepts. Let the bike-shedding continue! :-) -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <5550F077.6030906-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full [not found] ` <5550F077.6030906-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-05-11 18:36 ` Steven Rostedt 0 siblings, 0 replies; 159+ messages in thread From: Steven Rostedt @ 2015-05-11 18:36 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Ingo Molnar, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, 11 May 2015 14:09:59 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > Steven writes: > > All kidding aside, I think this is the real answer. We don't need a new > > NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly > > what it was created to do. That should be fixed. > > The claim I'm making is that it's worthwhile to differentiate the two > semantics. Plain NO_HZ_FULL just says "kernel makes a best effort to > avoid periodic interrupts without incurring any serious overhead". My > patch series allows an app to request "kernel makes an absolute > commitment to avoid all interrupts regardless of cost when leaving > kernel space". These are different enough ideas, and serve different > enough application needs, that I think they should be kept distinct. > > Frederic actually summed this up very nicely in his recent email when > he wrote "some people may expect hard isolation requirement (Real > Time, deterministic latency) and others softer isolation (HPC, only > interested in performance, can live with one rare random tick, so no > need to loop before returning to userspace until we have the no-noise > guarantee)." > > So we need a way for apps to ask for the "harder" mode and let > the softer mode be the default. Fair enough. But I would hope that this would improve on NO_HZ_FULL as well. > > What about naming? We may or may not want to have a Kconfig flag > for this, and we may or may not have a separate mode for it, but > we still will need some kind of name to talk about it with. (In > particular there's the prctl name, if we take that approach, and > potential boot command-line flags to consider naming for.) > > I'll quickly cover the suggestions that have been raised: > > - DATAPLANE. My suggestion, seemingly broadly disliked by folks > who felt it wasn't apparent what it meant. Probably a fair point. > > - NO_INTERRUPTS (Andrew). Captures some of the sense, but was > criticized pretty fairly by Ingo as being too negative, confusing > with perf nomenclature, and too long :-) What about NO_INTERRUPTIONS > > - PURE (Ingo). Proposed as an alternative to NO_HZ_FULL, but we could > use it as a name for this new mode. However, I think it's not clear > enough how FULL and PURE can/should relate to each other from the > names alone. I would find the two confusing as well. > > - BARE_METAL (me). Ingo observes it's confusing with respect to > virtualization. This is also confusing. > > - TASK_SOLO (Gilad). Not sure this conveys enough of the semantics. Agreed. > > - OS_LIGHT/OS_ZERO and NO_HZ_LEAVE_ME_THE_FSCK_ALONE. Excellent > ideas :-) At least the LEAVE_ME_ALONE conveys the semantics ;-) > > - ISOLATION (Frederic). I like this but it conflicts with other uses > of "isolation" in the kernel: cgroup isolation, lru page isolation, > iommu isolation, scheduler isolation (at least it's a superset of > that one), etc. Also, we're not exactly isolating a task - often > a "dataplane" app consists of a bunch of interacting threads in > userspace, so not exactly isolated. So perhaps it's too confusing. > > - OVERFLOWING (Steven) - not sure I understood this one, honestly. Actually, that was suggested by Paul McKenney. > > I suggested earlier a few other candidates that I don't love, but no > one commented on: NO_HZ_STRICT, USERSPACE_ONLY, and ZERO_OVERHEAD. > > One thing I'm leaning towards is to remove the intermediate state of > DATAPLANE_ENABLE and say that there is really only one primary state, > DATAPLANE_QUIESCE (or whatever we call it). The "dataplane but no > quiesce" state probably isn't that useful, since it doesn't offer the > hard guarantee that is the entire point of this patch series. So that > opens the idea of using the name NO_HZ_QUIESCE or just QUIESCE as the > word that describes the mode; of course this sort of conflicts with > RCU quiesce (though it is a superset of that so maybe that's OK). > > One new idea I had is to use NO_HZ_HARD to reflect what Frederic was > suggesting about "soft" and "hard" requirements for NO_HZ. So > enabling NO_HZ_HARD would enable my suggested QUIESCE mode. > > One way to focus this discussion is on the user API naming. I had > prctl(PR_SET_DATAPLANE), which was attractive in being a "positive" > noun. A lot of the other suggestions fail this test in various way. > Reasonable candidates seem to be: > > PR_SET_OS_ZERO > PR_SET_TASK_SOLO > PR_SET_ISOLATION > > Another possibility: > > PR_SET_NONSTOP > > Or take Andrew's NO_INTERRUPTS and have: > > PR_SET_UNINTERRUPTED For another possible answer, what about SET_TRANQUILITY A state with no disturbances. -- Steve > > I slightly favor ISOLATION at this point despite the overlap with > other kernel concepts. > > Let the bike-shedding continue! :-) > ^ permalink raw reply [flat|nested] 159+ messages in thread
* CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) 2015-05-11 18:09 ` Chris Metcalf [not found] ` <5550F077.6030906-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-05-12 9:10 ` Ingo Molnar [not found] ` <20150512091032.GA10138-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-05-12 21:05 ` CONFIG_ISOLATION=y Chris Metcalf 1 sibling, 2 replies; 159+ messages in thread From: Ingo Molnar @ 2015-05-12 9:10 UTC (permalink / raw) To: Chris Metcalf Cc: Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel * Chris Metcalf <cmetcalf@ezchip.com> wrote: > - ISOLATION (Frederic). I like this but it conflicts with other uses > of "isolation" in the kernel: cgroup isolation, lru page isolation, > iommu isolation, scheduler isolation (at least it's a superset of > that one), etc. Also, we're not exactly isolating a task - often > a "dataplane" app consists of a bunch of interacting threads in > userspace, so not exactly isolated. So perhaps it's too confusing. So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is a high level kernel feature, so it won't conflict with isolation concepts in lower level subsystems such as IOMMU isolation - and other higher level features like scheduler isolation are basically another partial implementation we want to merge with all this... nohz, RCU tricks, watchdog defaults, isolcpus and various other measures to keep these CPUs and workloads as isolated as possible are (or should become) components of this high level concept. Ideally CONFIG_ISOLATION=y would be a kernel feature that has almost zero overhead on normal workloads and on non-isolated CPUs, so that Linux distributions can enable it. Enabling CONFIG_ISOLATION=y should be the only 'kernel config' step needed: just like cpusets, the configuration of isolated CPUs should be a completely boot option free excercise that can be dynamically done and undone by the administrator via an intuitive interface. Thanks, Ingo ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <20150512091032.GA10138-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) [not found] ` <20150512091032.GA10138-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2015-05-12 11:48 ` Peter Zijlstra 2015-05-12 12:34 ` Ingo Molnar 0 siblings, 1 reply; 159+ messages in thread From: Peter Zijlstra @ 2015-05-12 11:48 UTC (permalink / raw) To: Ingo Molnar Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote: > > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is > a high level kernel feature, so it won't conflict with isolation > concepts in lower level subsystems such as IOMMU isolation - and other > higher level features like scheduler isolation are basically another > partial implementation we want to merge with all this... > But why do we need a CONFIG flag for something that has no content? That is, I do not see anything much; except the 'I want to stay in userspace and kill me otherwise' flag, and I'm not sure that warrants a CONFIG flag like this. Other than that, its all a combination of NOHZ_FULL and cpusets/isolcpus and whatnot. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) 2015-05-12 11:48 ` Peter Zijlstra @ 2015-05-12 12:34 ` Ingo Molnar [not found] ` <20150512123440.GA16959-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-05-12 15:36 ` Frederic Weisbecker 0 siblings, 2 replies; 159+ messages in thread From: Ingo Molnar @ 2015-05-12 12:34 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck, Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel * Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote: > > > > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this > > is a high level kernel feature, so it won't conflict with > > isolation concepts in lower level subsystems such as IOMMU > > isolation - and other higher level features like scheduler > > isolation are basically another partial implementation we want to > > merge with all this... > > But why do we need a CONFIG flag for something that has no content? > > That is, I do not see anything much; except the 'I want to stay in > userspace and kill me otherwise' flag, and I'm not sure that > warrants a CONFIG flag like this. > > Other than that, its all a combination of NOHZ_FULL and > cpusets/isolcpus and whatnot. Yes, that's what I meant: CONFIG_ISOLATION would trigger what is NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as an individual Kconfig option? CONFIG_ISOLATION=y would express the guarantee from the kernel that it's possible for user-space to configure itself to run undisturbed - instead of the current inconsistent set of options and facilities. A bit like CONFIG_PREEMPT_RT is more than just preemptable spinlocks, it also tries to offer various facilities and tune the defaults to turn the kernel hard-rt. Does that make sense to you? Thanks, Ingo ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <20150512123440.GA16959-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) [not found] ` <20150512123440.GA16959-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2015-05-12 12:39 ` Peter Zijlstra [not found] ` <20150512123912.GO21418-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Peter Zijlstra @ 2015-05-12 12:39 UTC (permalink / raw) To: Ingo Molnar Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote: > Yes, that's what I meant: CONFIG_ISOLATION would trigger what is > NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as > an individual Kconfig option? Ah, as a rename of nohz_full, sure that might work. ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <20150512123912.GO21418-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) [not found] ` <20150512123912.GO21418-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2015-05-12 12:43 ` Ingo Molnar 0 siblings, 0 replies; 159+ messages in thread From: Ingo Molnar @ 2015-05-12 12:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote: > > > Yes, that's what I meant: CONFIG_ISOLATION would trigger what is > > NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL > > as an individual Kconfig option? > > Ah, as a rename of nohz_full, sure that might work. It could also be named CONFIG_CPU_ISOLATION=y, to make it more explicit what it's about. Thanks, Ingo ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) 2015-05-12 12:34 ` Ingo Molnar [not found] ` <20150512123440.GA16959-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2015-05-12 15:36 ` Frederic Weisbecker 1 sibling, 0 replies; 159+ messages in thread From: Frederic Weisbecker @ 2015-05-12 15:36 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Chris Metcalf, Steven Rostedt, Andrew Morton, paulmck, Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote: > > > > > > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this > > > is a high level kernel feature, so it won't conflict with > > > isolation concepts in lower level subsystems such as IOMMU > > > isolation - and other higher level features like scheduler > > > isolation are basically another partial implementation we want to > > > merge with all this... > > > > But why do we need a CONFIG flag for something that has no content? > > > > That is, I do not see anything much; except the 'I want to stay in > > userspace and kill me otherwise' flag, and I'm not sure that > > warrants a CONFIG flag like this. > > > > Other than that, its all a combination of NOHZ_FULL and > > cpusets/isolcpus and whatnot. > > Yes, that's what I meant: CONFIG_ISOLATION would trigger what is > NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as > an individual Kconfig option? Right, we could return to what we had previously: CONFIG_NO_HZ. A config that enables dynticks-idle by default and allows full dynticks if nohz_full= boot option is passed (or something driven by higher level isolation interface). Because eventually, distros enable NO_HZ_FULL so that their 0.0001% users can use it. Well at least Red Hat does. > > CONFIG_ISOLATION=y would express the guarantee from the kernel that > it's possible for user-space to configure itself to run undisturbed - > instead of the current inconsistent set of options and facilities. > > A bit like CONFIG_PREEMPT_RT is more than just preemptable spinlocks, > it also tries to offer various facilities and tune the defaults to > turn the kernel hard-rt. > > Does that make sense to you? Right although distros tend to want features to be enabled dynamically so that they have a single kernel to maintain. Things like PREEMPT_RT really need to be a different kernel because fundamental primitives like spinlocks must be implemented statically. But isolation can be a boot-enabled, or even runtime-enabled, as it's only about timer,irq,task affinity. Full Nohz is more complicated but it can be runtime toggled in the future. So we can bring CONFIG_CPU_ISOLATION, at least for distros that are really not interested in that so they can disable it. CONFIG_CPU_ISOLATION=y would bring an ability which is default-disabled and driven dynamically through whatever interface. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: CONFIG_ISOLATION=y 2015-05-12 9:10 ` CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) Ingo Molnar [not found] ` <20150512091032.GA10138-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2015-05-12 21:05 ` Chris Metcalf 1 sibling, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-12 21:05 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On 05/12/2015 05:10 AM, Ingo Molnar wrote: > * Chris Metcalf <cmetcalf@ezchip.com> wrote: > >> - ISOLATION (Frederic). I like this but it conflicts with other uses >> of "isolation" in the kernel: cgroup isolation, lru page isolation, >> iommu isolation, scheduler isolation (at least it's a superset of >> that one), etc. Also, we're not exactly isolating a task - often >> a "dataplane" app consists of a bunch of interacting threads in >> userspace, so not exactly isolated. So perhaps it's too confusing. > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is > a high level kernel feature, so it won't conflict with isolation > concepts in lower level subsystems such as IOMMU isolation - and other > higher level features like scheduler isolation are basically another > partial implementation we want to merge with all this... > > nohz, RCU tricks, watchdog defaults, isolcpus and various other > measures to keep these CPUs and workloads as isolated as possible > are (or should become) components of this high level concept. > > Ideally CONFIG_ISOLATION=y would be a kernel feature that has almost > zero overhead on normal workloads and on non-isolated CPUs, so that > Linux distributions can enable it. Using CONFIG_CPU_ISOLATION to capture all this stuff instead of making CONFIG_NO_HZ_FULL do it seems plausible for naming. However, this feels like just bombing the current naming to this new name, right? I'd like to argue that this is orthogonal to adding new isolation functionality into no_hz_full, as my patch series has been doing. Perhaps we can defer this to a follow-up patch series? I'm happy to do the work but I'm not sure we want to bundle all that churn into the current patch series under consideration. I can use cpu_isolation_xxx for naming in the current patch series so we don't have to come back and bomb that later. > Enabling CONFIG_ISOLATION=y should be the only 'kernel config' step > needed: just like cpusets, the configuration of isolated CPUs should > be a completely boot option free excercise that can be dynamically > done and undone by the administrator via an intuitive interface. Eventually isolation can be runtime-enabled, but for now I think it makes sense to be boot-enabled. As Frederic suggested, we can arrange full nohz to be runtime toggled in the future. I agree that it should be reasonable to compile it in by default. On 05/12/2015 07:48 AM, Peter Zijlstra wrote: > But why do we need a CONFIG flag for something that has no content? > > That is, I do not see anything much; except the 'I want to stay in > userspace and kill me otherwise' flag, and I'm not sure that warrants a > CONFIG flag like this. > > Other than that, its all a combination of NOHZ_FULL and cpusets/isolcpus > and whatnot. There are three major pieces here - one is the STRICT piece that you allude to, but there is also the piece where we quiesce tasks in the kernel until no timer interrupts are pending, and the piece that allows easy debugging of stray IRQs etc to isolated cpus. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <20150511085759.71deeb64-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>]
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full [not found] ` <20150511085759.71deeb64-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org> @ 2015-05-12 10:46 ` Peter Zijlstra 2015-05-15 15:10 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Peter Zijlstra @ 2015-05-12 10:46 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: > > Please lets get NO_HZ_FULL up to par. That should be the main focus. > ACK, much of this dataplane stuff is (useful) hacks working around the fact that nohz_full just isn't complete. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-12 10:46 ` [PATCH 0/6] support "dataplane" mode for nohz_full Peter Zijlstra @ 2015-05-15 15:10 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-15 15:10 UTC (permalink / raw) To: Peter Zijlstra, Steven Rostedt Cc: Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On 05/12/2015 06:46 AM, Peter Zijlstra wrote: > On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: >> Please lets get NO_HZ_FULL up to par. That should be the main focus. >> > ACK, much of this dataplane stuff is (useful) hacks working around the > fact that nohz_full just isn't complete. There are enough disjoint threads on this topic that I want to just touch base here and see if you have been convinced on other threads that there is stuff beyond the hacks here: in particular 1. The basic "dataplane" mode to arrange to do extra work on return to kernel space that normally isn't warranted, to avoid future IPIs, and additionally to wait in the kernel until any timer interrupts required by the kernel invocation itself are done; and 2. The "strict" mode to allow a task to tell the kernel it isn't planning on making any more such calls, and have the kernel help diagnose any resulting application bugs. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v2 0/5] support "cpu_isolated" mode for nohz_full [not found] ` <1431107927-13998-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-05-08 21:18 ` [PATCH 0/6] support "dataplane" mode for nohz_full Andrew Morton @ 2015-05-15 21:26 ` Chris Metcalf 2015-05-15 21:27 ` [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf [not found] ` <1431725178-20876-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 1 sibling, 2 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-15 21:26 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. A prctl() option (PR_SET_CPU_ISOLATED) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on cpu_isolated cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on my arch/tile master tree for 4.2, in turn based on 4.1-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() Note: I have not yet removed the hack to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555). Chris Metcalf (5): nohz_full: add support for "cpu_isolated" mode nohz: support PR_CPU_ISOLATED_STRICT mode nohz: cpu_isolated strict mode configurable signal nohz: add cpu_isolated_debug boot flag nohz: cpu_isolated: allow tick to be fully disabled Documentation/kernel-parameters.txt | 6 +++ arch/tile/kernel/ptrace.c | 6 ++- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 +++-- include/linux/sched.h | 3 ++ include/linux/tick.h | 28 +++++++++++ include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 12 +++-- kernel/irq_work.c | 4 +- kernel/sched/core.c | 18 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 6 +++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 98 ++++++++++++++++++++++++++++++++++++- 16 files changed, 214 insertions(+), 10 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode 2015-05-15 21:26 ` [PATCH v2 0/5] support "cpu_isolated" " Chris Metcalf @ 2015-05-15 21:27 ` Chris Metcalf 2015-05-15 21:27 ` [PATCH v2 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf [not found] ` <1431725251-20943-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> [not found] ` <1431725178-20876-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 1 sibling, 2 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a stronger commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the stronger semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The "cpu_isolated" state is indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only two actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/sched.h | 3 +++ include/linux/tick.h | 10 +++++++++ include/uapi/linux/prctl.h | 5 +++++ kernel/context_tracking.c | 3 +++ kernel/sys.c | 8 ++++++++ kernel/time/tick-sched.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 80 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 8222ae40ecb0..fb4ba400d7e1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1732,6 +1732,9 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif +#ifdef CONFIG_NO_HZ_FULL + unsigned int cpu_isolated_flags; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/tick.h b/include/linux/tick.h index f8492da57ad3..ec1953474a65 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -10,6 +10,7 @@ #include <linux/context_tracking_state.h> #include <linux/cpumask.h> #include <linux/sched.h> +#include <linux/prctl.h> #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void __init tick_init(void); @@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu) return cpumask_test_cpu(cpu, tick_nohz_full_mask); } +static inline bool tick_nohz_is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); +extern void tick_nohz_cpu_isolated_enter(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void tick_nohz_full_kick(void) { } static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } +static inline bool tick_nohz_is_cpu_isolated(void) { return false; } +static inline void tick_nohz_cpu_isolated_enter(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 72d59a1a6eb6..66739d7c1350 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/tick.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (tick_nohz_is_cpu_isolated()) + tick_nohz_cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index a4e372b798a5..3fd9e47f8fc8 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NO_HZ_FULL + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 914259128145..f1551c946c45 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/swap.h> #include <asm/irq_regs.h> @@ -389,6 +390,56 @@ void __init tick_nohz_init(void) pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", cpumask_pr_args(tick_nohz_full_mask)); } + +/* + * We normally return immediately to userspace. + * + * In "cpu_isolated" mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two "cpu_isolated" processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void tick_nohz_cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start)); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + + /* Idle with interrupts enabled and wait for the tick. */ + set_current_state(TASK_INTERRUPTIBLE); + arch_cpu_idle(); + set_current_state(TASK_RUNNING); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start)); + dump_stack(); + } +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v2 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode 2015-05-15 21:27 ` [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf @ 2015-05-15 21:27 ` Chris Metcalf [not found] ` <1431725251-20943-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 1 sibling, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With cpu_isolated mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current->cpu_isolated_flags that is set when prctl() sets the flags. We check the bit on syscall entry as well as on any exception_enter(). The prctl() syscall is ignored to allow clearing the bit again later, and exit/exit_group are ignored to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86 and tile; I am happy to try to add more for additional platforms in the final version. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/ptrace.c | 6 +++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/tick.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++ 7 files changed, 76 insertions(+), 7 deletions(-) diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..d4e43a13bab1 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall( + regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index a7bc79480719..7f784054ddea 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 2821838256b4..d042f4cda39d 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/tick.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); extern void __context_tracking_task_switch(struct task_struct *prev, @@ -37,8 +38,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_exception(); + } + } return prev_ctx; } diff --git a/include/linux/tick.h b/include/linux/tick.h index ec1953474a65..b7ffb10337ba 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -147,6 +147,8 @@ extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); extern void tick_nohz_cpu_isolated_enter(void); +extern void tick_nohz_cpu_isolated_syscall(int nr); +extern void tick_nohz_cpu_isolated_exception(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -157,6 +159,8 @@ static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } static inline bool tick_nohz_is_cpu_isolated(void) { return false; } static inline void tick_nohz_cpu_isolated_enter(void) { } +static inline void tick_nohz_cpu_isolated_syscall(int nr) { } +static inline void tick_nohz_cpu_isolated_exception(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) @@ -189,4 +193,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk) __tick_nohz_task_switch(tsk); } +static inline bool tick_nohz_cpu_isolated_strict(void) +{ +#ifdef CONFIG_NO_HZ_FULL + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) == + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index edb40b6b84db..0c11238a84fb 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_CPU_ISOLATED 47 #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) +# define PR_CPU_ISOLATED_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 66739d7c1350..c82509caa42e 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -131,15 +131,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (__this_cpu_read(context_tracking.state) == state) { @@ -150,6 +151,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -157,6 +159,7 @@ void context_tracking_exit(enum ctx_state state) __this_cpu_write(context_tracking.state, CONTEXT_KERNEL); } local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index f1551c946c45..273820cd484a 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -27,6 +27,7 @@ #include <linux/swap.h> #include <asm/irq_regs.h> +#include <asm/unistd.h> #include "tick-internal.h" @@ -440,6 +441,43 @@ void tick_nohz_cpu_isolated_enter(void) } } +static void kill_cpu_isolated_strict_task(void) +{ + dump_stack(); + current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void tick_nohz_cpu_isolated_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_cpu_isolated_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void tick_nohz_cpu_isolated_exception(void) +{ + pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", + current->comm, current->pid); + kill_cpu_isolated_strict_task(); +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
[parent not found: <1431725251-20943-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* [PATCH v2 3/5] nohz: cpu_isolated strict mode configurable signal [not found] ` <1431725251-20943-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-05-15 21:27 ` Chris Metcalf 2015-05-15 22:17 ` [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode Thomas Gleixner 1 sibling, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a cpu_isolated process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> --- include/uapi/linux/prctl.h | 2 ++ kernel/time/tick-sched.c | 15 +++++++++++---- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 0c11238a84fb..ab45bd3d5799 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) # define PR_CPU_ISOLATED_STRICT (1 << 1) +# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 273820cd484a..772be78f926c 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -441,11 +441,18 @@ void tick_nohz_cpu_isolated_enter(void) } } -static void kill_cpu_isolated_strict_task(void) +static void kill_cpu_isolated_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + dump_stack(); current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -464,7 +471,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall) pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(1); } /* @@ -475,7 +482,7 @@ void tick_nohz_cpu_isolated_exception(void) { pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", current->comm, current->pid); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(0); } #endif -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode [not found] ` <1431725251-20943-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-05-15 21:27 ` [PATCH v2 3/5] nohz: cpu_isolated strict mode configurable signal Chris Metcalf @ 2015-05-15 22:17 ` Thomas Gleixner 2015-05-28 20:38 ` Chris Metcalf 1 sibling, 1 reply; 159+ messages in thread From: Thomas Gleixner @ 2015-05-15 22:17 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Fri, 15 May 2015, Chris Metcalf wrote: > +/* > + * We normally return immediately to userspace. > + * > + * In "cpu_isolated" mode we wait until no more interrupts are > + * pending. Otherwise we nap with interrupts enabled and wait for the > + * next interrupt to fire, then loop back and retry. > + * > + * Note that if you schedule two "cpu_isolated" processes on the same > + * core, neither will ever leave the kernel, and one will have to be > + * killed manually. And why are we not preventing that situation in the first place? The scheduler should be able to figure that out easily.. > + Otherwise in situations where another process is > + * in the runqueue on this cpu, this task will just wait for that > + * other task to go idle before returning to user space. > + */ > +void tick_nohz_cpu_isolated_enter(void) > +{ > + struct clock_event_device *dev = > + __this_cpu_read(tick_cpu_device.evtdev); > + struct task_struct *task = current; > + unsigned long start = jiffies; > + bool warned = false; > + > + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ > + lru_add_drain(); > + > + while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) { What's the ACCESS_ONCE for? > + if (!warned && (jiffies - start) >= (5 * HZ)) { > + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n", > + task->comm, task->pid, smp_processor_id(), > + (jiffies - start)); What additional value has the jiffies delta over a plain human readable '5sec' ? > + warned = true; > + } > + if (should_resched()) > + schedule(); > + if (test_thread_flag(TIF_SIGPENDING)) > + break; > + > + /* Idle with interrupts enabled and wait for the tick. */ > + set_current_state(TASK_INTERRUPTIBLE); > + arch_cpu_idle(); Oh NO! Not another variant of fake idle task. The idle implementations can call into code which rightfully expects that the CPU is actually IDLE. I wasted enough time already debugging the resulting wreckage. Feel free to use it for experimental purposes, but this is not going anywhere near to a mainline kernel. I completely understand WHY you want to do that, but we need proper mechanisms for that and not some duct tape engineering band aids which will create hard to debug side effects. Hint: It's a scheduler job to make sure that the machine has quiesced _BEFORE_ letting the magic task off to user land. > + set_current_state(TASK_RUNNING); > + } > + if (warned) { > + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n", > + task->comm, task->pid, smp_processor_id(), > + (jiffies - start)); > + dump_stack(); And that dump_stack() tells us which important information? tick_nohz_cpu_isolated_enter context_tracking_enter context_tracking_user_enter arch_return_to_user_code Thanks, tglx ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode 2015-05-15 22:17 ` [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode Thomas Gleixner @ 2015-05-28 20:38 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-05-28 20:38 UTC (permalink / raw) To: Thomas Gleixner Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Thomas, thanks for the feedback. My reply was delayed by being in meetings all last week and then catching up this week - sorry about that. On 05/15/2015 06:17 PM, Thomas Gleixner wrote: > On Fri, 15 May 2015, Chris Metcalf wrote: >> +/* >> + * We normally return immediately to userspace. >> + * >> + * In "cpu_isolated" mode we wait until no more interrupts are >> + * pending. Otherwise we nap with interrupts enabled and wait for the >> + * next interrupt to fire, then loop back and retry. >> + * >> + * Note that if you schedule two "cpu_isolated" processes on the same >> + * core, neither will ever leave the kernel, and one will have to be >> + * killed manually. > And why are we not preventing that situation in the first place? The > scheduler should be able to figure that out easily.. This is an interesting observation. My instinct is that adding tests in the scheduler costs time on a hot path for all processes, and I'm trying to avoid adding cost where we don't need it. It's pretty much a straight-up application bug if two threads or processes explicitly request the cpu_isolated semantics, and then explicitly schedule themselves onto the same core, so my preference was to let the application writer identify and fix the problem if it comes up. However, I'm certainly open to thinking about checking for this failure mode in the scheduler, though I don't know enough about the scheduler to immediately identify where such a change might go. Would it be appropriate to think about this as a follow-on patch, if it's determined that the cost of testing for this condition is worth it? >> + Otherwise in situations where another process is >> + * in the runqueue on this cpu, this task will just wait for that >> + * other task to go idle before returning to user space. >> + */ >> +void tick_nohz_cpu_isolated_enter(void) >> +{ >> + struct clock_event_device *dev = >> + __this_cpu_read(tick_cpu_device.evtdev); >> + struct task_struct *task = current; >> + unsigned long start = jiffies; >> + bool warned = false; >> + >> + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ >> + lru_add_drain(); >> + >> + while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) { > What's the ACCESS_ONCE for? We are technically in a loop here where we are waiting for an interrupt handler to update dev->next_event.tv64, so I felt it was appropriate to flag it as such. If we didn't have function calls inside the loop, the compiler would eliminate the loop. But it's just a style thing, and we can certainly drop it if it seems confusing. In any case I've changed it to READ_ONCE() since that's preferred now anyway; this code was originally written a while ago. >> + if (!warned && (jiffies - start) >= (5 * HZ)) { >> + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n", >> + task->comm, task->pid, smp_processor_id(), >> + (jiffies - start)); > What additional value has the jiffies delta over a plain human > readable '5sec' ? Good point. I've changed it to emit a value in seconds. >> + warned = true; >> + } >> + if (should_resched()) >> + schedule(); >> + if (test_thread_flag(TIF_SIGPENDING)) >> + break; >> + >> + /* Idle with interrupts enabled and wait for the tick. */ >> + set_current_state(TASK_INTERRUPTIBLE); >> + arch_cpu_idle(); > Oh NO! Not another variant of fake idle task. The idle implementations > can call into code which rightfully expects that the CPU is actually > IDLE. > > I wasted enough time already debugging the resulting wreckage. Feel > free to use it for experimental purposes, but this is not going > anywhere near to a mainline kernel. > > I completely understand WHY you want to do that, but we need proper > mechanisms for that and not some duct tape engineering band aids which > will create hard to debug side effects. Yes, I worried about that a little when I put it in. In particular it's certainly true that arch_cpu_idle() isn't necessarily designed to behave properly in this context, even if it may do the right thing somewhat by accident. In fact, we don't need the cpu-idling semantics in this loop; the loop can spin quite happily waiting for next_event in the tick_cpu_device to stop being defined (or a signal or scheduling request to occur). I've changed the code to make it opt-in, so that a weak no-op function that just calls cpu_relax() can be replaced by an architecture-defined function that safely waits until an interrupt is delivered, reducing the number of times we spin around in the outer loop. > Hint: It's a scheduler job to make sure that the machine has quiesced > _BEFORE_ letting the magic task off to user land. This is not so clear to me. There may, for example, be RCU events that occur after the scheduler is done with its part, that still require another timer tick on the cpu to finish quiescing RCU. I think we need to check for the timer-quiesced state as late as possible to handle things like this. Arguably the scheduler could also try to do the right thing with a cpu_isolated task, but again, this feels like time spent in the scheduler hot path that affects the non-cpu_isolated tasks. For cpu_isolated tasks they should be the only thing that's runnable on the core 99.999% of the time, or you've done something quite wrong anyway. >> + set_current_state(TASK_RUNNING); >> + } >> + if (warned) { >> + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n", >> + task->comm, task->pid, smp_processor_id(), >> + (jiffies - start)); >> + dump_stack(); > And that dump_stack() tells us which important information? > > tick_nohz_cpu_isolated_enter > context_tracking_enter > context_tracking_user_enter > arch_return_to_user_code For tile, the dump_stack() includes the register state, which includes the interrupt type that took us into the kernel, which might be helpful. That said, I'm certainly willing to remove it, or make it call a weak no-op function where architectures can add more info if they have it. Thanks again! I'll put out v3 of the patch series shortly, with changes from your comments incorporated. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <1431725178-20876-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* [PATCH v3 0/5] support "cpu_isolated" mode for nohz_full [not found] ` <1431725178-20876-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-06-03 15:29 ` Chris Metcalf 2015-06-03 15:29 ` [PATCH v3 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf ` (2 more replies) 0 siblings, 3 replies; 159+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. A prctl() option (PR_SET_CPU_ISOLATED) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on cpu_isolated cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on my arch/tile master tree for 4.2, in turn based on 4.1-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, cpu_isolated threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (5): nohz_full: add support for "cpu_isolated" mode nohz: support PR_CPU_ISOLATED_STRICT mode nohz: cpu_isolated strict mode configurable signal nohz: add cpu_isolated_debug boot flag nohz: cpu_isolated: allow tick to be fully disabled Documentation/kernel-parameters.txt | 6 +++ arch/tile/kernel/process.c | 9 ++++ arch/tile/kernel/ptrace.c | 6 ++- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 ++-- include/linux/sched.h | 3 ++ include/linux/tick.h | 28 ++++++++++ include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 12 +++-- kernel/irq_work.c | 4 +- kernel/sched/core.c | 18 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 6 +++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 104 +++++++++++++++++++++++++++++++++++- 17 files changed, 229 insertions(+), 10 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v3 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode 2015-06-03 15:29 ` [PATCH v3 0/5] support "cpu_isolated" mode for nohz_full Chris Metcalf @ 2015-06-03 15:29 ` Chris Metcalf [not found] ` <1433345365-29506-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-07-13 19:57 ` [PATCH v4 0/5] support "cpu_isolated" mode for nohz_full Chris Metcalf 2 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With cpu_isolated mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current->cpu_isolated_flags that is set when prctl() sets the flags. We check the bit on syscall entry as well as on any exception_enter(). The prctl() syscall is ignored to allow clearing the bit again later, and exit/exit_group are ignored to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86 and tile; I am happy to try to add more for additional platforms in the final version. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/ptrace.c | 6 +++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/tick.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++ 7 files changed, 76 insertions(+), 7 deletions(-) diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..d4e43a13bab1 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall( + regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index a7bc79480719..7f784054ddea 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 2821838256b4..d042f4cda39d 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/tick.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); extern void __context_tracking_task_switch(struct task_struct *prev, @@ -37,8 +38,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_exception(); + } + } return prev_ctx; } diff --git a/include/linux/tick.h b/include/linux/tick.h index ec1953474a65..b7ffb10337ba 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -147,6 +147,8 @@ extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); extern void tick_nohz_cpu_isolated_enter(void); +extern void tick_nohz_cpu_isolated_syscall(int nr); +extern void tick_nohz_cpu_isolated_exception(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -157,6 +159,8 @@ static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } static inline bool tick_nohz_is_cpu_isolated(void) { return false; } static inline void tick_nohz_cpu_isolated_enter(void) { } +static inline void tick_nohz_cpu_isolated_syscall(int nr) { } +static inline void tick_nohz_cpu_isolated_exception(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) @@ -189,4 +193,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk) __tick_nohz_task_switch(tsk); } +static inline bool tick_nohz_cpu_isolated_strict(void) +{ +#ifdef CONFIG_NO_HZ_FULL + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) == + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index edb40b6b84db..0c11238a84fb 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_CPU_ISOLATED 47 #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) +# define PR_CPU_ISOLATED_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 66739d7c1350..c82509caa42e 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -131,15 +131,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (__this_cpu_read(context_tracking.state) == state) { @@ -150,6 +151,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -157,6 +159,7 @@ void context_tracking_exit(enum ctx_state state) __this_cpu_write(context_tracking.state, CONTEXT_KERNEL); } local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index f6236b66788f..ce3bcf29a0f6 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -27,6 +27,7 @@ #include <linux/swap.h> #include <asm/irq_regs.h> +#include <asm/unistd.h> #include "tick-internal.h" @@ -446,6 +447,43 @@ void tick_nohz_cpu_isolated_enter(void) } } +static void kill_cpu_isolated_strict_task(void) +{ + dump_stack(); + current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void tick_nohz_cpu_isolated_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_cpu_isolated_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void tick_nohz_cpu_isolated_exception(void) +{ + pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", + current->comm, current->pid); + kill_cpu_isolated_strict_task(); +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
[parent not found: <1433345365-29506-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* [PATCH v3 1/5] nohz_full: add support for "cpu_isolated" mode [not found] ` <1433345365-29506-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-06-03 15:29 ` Chris Metcalf 2015-06-03 15:29 ` [PATCH v3 3/5] nohz: cpu_isolated strict mode configurable signal Chris Metcalf 1 sibling, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf The existing nohz_full mode makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a stronger commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the stronger semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The "cpu_isolated" state is indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only two actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> --- arch/tile/kernel/process.c | 9 ++++++++ include/linux/sched.h | 3 +++ include/linux/tick.h | 10 ++++++++ include/uapi/linux/prctl.h | 5 ++++ kernel/context_tracking.c | 3 +++ kernel/sys.c | 8 +++++++ kernel/time/tick-sched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 95 insertions(+) diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..e20c3f4a6a82 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_NO_HZ_FULL +void tick_nohz_cpu_isolated_wait() +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 8222ae40ecb0..fb4ba400d7e1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1732,6 +1732,9 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif +#ifdef CONFIG_NO_HZ_FULL + unsigned int cpu_isolated_flags; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/tick.h b/include/linux/tick.h index f8492da57ad3..ec1953474a65 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -10,6 +10,7 @@ #include <linux/context_tracking_state.h> #include <linux/cpumask.h> #include <linux/sched.h> +#include <linux/prctl.h> #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void __init tick_init(void); @@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu) return cpumask_test_cpu(cpu, tick_nohz_full_mask); } +static inline bool tick_nohz_is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); +extern void tick_nohz_cpu_isolated_enter(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void tick_nohz_full_kick(void) { } static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } +static inline bool tick_nohz_is_cpu_isolated(void) { return false; } +static inline void tick_nohz_cpu_isolated_enter(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 72d59a1a6eb6..66739d7c1350 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/tick.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (tick_nohz_is_cpu_isolated()) + tick_nohz_cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index a4e372b798a5..3fd9e47f8fc8 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NO_HZ_FULL + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 914259128145..f6236b66788f 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/swap.h> #include <asm/irq_regs.h> @@ -389,6 +390,62 @@ void __init tick_nohz_init(void) pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", cpumask_pr_args(tick_nohz_full_mask)); } + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + */ +void __weak tick_nohz_cpu_isolated_wait() +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In "cpu_isolated" mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two "cpu_isolated" processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void tick_nohz_cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + tick_nohz_cpu_isolated_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v3 3/5] nohz: cpu_isolated strict mode configurable signal [not found] ` <1433345365-29506-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-06-03 15:29 ` [PATCH v3 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf @ 2015-06-03 15:29 ` Chris Metcalf 1 sibling, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a cpu_isolated process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> --- include/uapi/linux/prctl.h | 2 ++ kernel/time/tick-sched.c | 15 +++++++++++---- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 0c11238a84fb..ab45bd3d5799 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) # define PR_CPU_ISOLATED_STRICT (1 << 1) +# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index ce3bcf29a0f6..f09c003da22f 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -447,11 +447,18 @@ void tick_nohz_cpu_isolated_enter(void) } } -static void kill_cpu_isolated_strict_task(void) +static void kill_cpu_isolated_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + dump_stack(); current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -470,7 +477,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall) pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(1); } /* @@ -481,7 +488,7 @@ void tick_nohz_cpu_isolated_exception(void) { pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", current->comm, current->pid); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(0); } #endif -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v4 0/5] support "cpu_isolated" mode for nohz_full 2015-06-03 15:29 ` [PATCH v3 0/5] support "cpu_isolated" mode for nohz_full Chris Metcalf 2015-06-03 15:29 ` [PATCH v3 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf [not found] ` <1433345365-29506-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-07-13 19:57 ` Chris Metcalf 2015-07-13 19:57 ` [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf ` (3 more replies) 2 siblings, 4 replies; 159+ messages in thread From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf This posting of the series is basically a "ping" since there were no comments to the v3 version. I have rebased it to 4.2-rc1, added support for arm64 syscall tracking for "strict" mode, and retested it; are there any remaining concerns? Thomas, I haven't heard from you whether my removal of the cpu_idle calls sufficiently addresses your concerns about that aspect. Are there other concerns with this patch series at this point? Original patch series cover letter follows: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. A prctl() option (PR_SET_CPU_ISOLATED) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on cpu_isolated cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on v4.2-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, cpu_isolated threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (5): nohz_full: add support for "cpu_isolated" mode nohz: support PR_CPU_ISOLATED_STRICT mode nohz: cpu_isolated strict mode configurable signal nohz: add cpu_isolated_debug boot flag nohz: cpu_isolated: allow tick to be fully disabled Documentation/kernel-parameters.txt | 6 +++ arch/tile/kernel/process.c | 9 ++++ arch/tile/kernel/ptrace.c | 6 ++- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 ++-- include/linux/sched.h | 3 ++ include/linux/tick.h | 28 ++++++++++ include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 12 +++-- kernel/irq_work.c | 4 +- kernel/sched/core.c | 18 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 6 +++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 104 +++++++++++++++++++++++++++++++++++- 17 files changed, 229 insertions(+), 10 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-13 19:57 ` [PATCH v4 0/5] support "cpu_isolated" mode for nohz_full Chris Metcalf @ 2015-07-13 19:57 ` Chris Metcalf 2015-07-13 20:40 ` Andy Lutomirski 2015-07-24 13:27 ` Frederic Weisbecker 2015-07-13 19:57 ` [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf ` (2 subsequent siblings) 3 siblings, 2 replies; 159+ messages in thread From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a stronger commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the stronger semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The "cpu_isolated" state is indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only two actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/process.c | 9 ++++++++ include/linux/sched.h | 3 +++ include/linux/tick.h | 10 ++++++++ include/uapi/linux/prctl.h | 5 ++++ kernel/context_tracking.c | 3 +++ kernel/sys.c | 8 +++++++ kernel/time/tick-sched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 95 insertions(+) diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..3625e839ad62 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_NO_HZ_FULL +void tick_nohz_cpu_isolated_wait(void) +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/sched.h b/include/linux/sched.h index ae21f1591615..f350b0c20bbc 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1778,6 +1778,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_NO_HZ_FULL + unsigned int cpu_isolated_flags; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/tick.h b/include/linux/tick.h index 3741ba1a652c..cb5569181359 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -10,6 +10,7 @@ #include <linux/context_tracking_state.h> #include <linux/cpumask.h> #include <linux/sched.h> +#include <linux/prctl.h> #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void __init tick_init(void); @@ -144,11 +145,18 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) cpumask_or(mask, mask, tick_nohz_full_mask); } +static inline bool tick_nohz_is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); +extern void tick_nohz_cpu_isolated_enter(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -158,6 +166,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void tick_nohz_full_kick(void) { } static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } +static inline bool tick_nohz_is_cpu_isolated(void) { return false; } +static inline void tick_nohz_cpu_isolated_enter(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..f9de3ee12723 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/tick.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (tick_nohz_is_cpu_isolated()) + tick_nohz_cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..36eb9a839f1f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NO_HZ_FULL + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index c792429e98c6..4cf093c012d1 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/swap.h> #include <asm/irq_regs.h> @@ -389,6 +390,62 @@ void __init tick_nohz_init(void) pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", cpumask_pr_args(tick_nohz_full_mask)); } + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + */ +void __weak tick_nohz_cpu_isolated_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In "cpu_isolated" mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two "cpu_isolated" processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void tick_nohz_cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + tick_nohz_cpu_isolated_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-13 19:57 ` [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf @ 2015-07-13 20:40 ` Andy Lutomirski 2015-07-13 21:01 ` Chris Metcalf 2015-07-24 13:27 ` Frederic Weisbecker 1 sibling, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-07-13 20:40 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > The existing nohz_full mode makes tradeoffs to minimize userspace > interruptions while still attempting to avoid overheads in the > kernel entry/exit path, to provide 100% kernel semantics, etc. > > However, some applications require a stronger commitment from the > kernel to avoid interruptions, in particular userspace device > driver style applications, such as high-speed networking code. > > This change introduces a framework to allow applications to elect > to have the stronger semantics as needed, specifying > prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. > Subsequent commits will add additional flags and additional > semantics. I thought the general consensus was that this should be the default behavior and that any associated bugs should be fixed. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-13 20:40 ` Andy Lutomirski @ 2015-07-13 21:01 ` Chris Metcalf 2015-07-13 21:45 ` Andy Lutomirski 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-07-13 21:01 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On 07/13/2015 04:40 PM, Andy Lutomirski wrote: > On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> The existing nohz_full mode makes tradeoffs to minimize userspace >> interruptions while still attempting to avoid overheads in the >> kernel entry/exit path, to provide 100% kernel semantics, etc. >> >> However, some applications require a stronger commitment from the >> kernel to avoid interruptions, in particular userspace device >> driver style applications, such as high-speed networking code. >> >> This change introduces a framework to allow applications to elect >> to have the stronger semantics as needed, specifying >> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. >> Subsequent commits will add additional flags and additional >> semantics. > I thought the general consensus was that this should be the default > behavior and that any associated bugs should be fixed. I think it comes down to dividing the set of use cases in two: - "Regular" nohz_full, as used to improve performance and limit interruptions, possibly for power benefits, etc. But, stray interrupts are not particularly bad, and you don't want to take extreme measures to avoid them. - What I'm calling "cpu_isolated" mode where when you return to userspace, you expect that by God, the kernel doesn't interrupt you again, and if it does, it's a flat-out bug. There are a few things that cpu_isolated mode currently does to accomplish its goals that are pretty heavy-weight: Processes are held in kernel space until ticks are quiesced; this is not necessarily what every nohz_full task wants. If a task makes a kernel call, there may well be arbitrary timer fallout, and having a way to select whether or not you are willing to take a timer tick after return to userspace is pretty important. Likewise, there are things that you may want to do on return to userspace that are designed to prevent further interruptions in cpu_isolated mode, even at a possible future performance cost if and when you return to the kernel, such as flushing the per-cpu free page list so that you won't be interrupted by an IPI to flush it later. If you're arguing that the cpu_isolated semantic is really the only one that makes sense for nohz_full, my sense is that it might be surprising to many of the folks who do nohz_full work. But, I'm happy to be wrong on this point, and maybe all the nohz_full community is interested in making the same tradeoffs for nohz_full generally that I've proposed in this patch series just for cpu_isolated? -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-13 21:01 ` Chris Metcalf @ 2015-07-13 21:45 ` Andy Lutomirski 2015-07-21 19:10 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-07-13 21:45 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 07/13/2015 04:40 PM, Andy Lutomirski wrote: >> >> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> >> wrote: >>> >>> The existing nohz_full mode makes tradeoffs to minimize userspace >>> interruptions while still attempting to avoid overheads in the >>> kernel entry/exit path, to provide 100% kernel semantics, etc. >>> >>> However, some applications require a stronger commitment from the >>> kernel to avoid interruptions, in particular userspace device >>> driver style applications, such as high-speed networking code. >>> >>> This change introduces a framework to allow applications to elect >>> to have the stronger semantics as needed, specifying >>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. >>> Subsequent commits will add additional flags and additional >>> semantics. >> >> I thought the general consensus was that this should be the default >> behavior and that any associated bugs should be fixed. > > > I think it comes down to dividing the set of use cases in two: > > - "Regular" nohz_full, as used to improve performance and limit > interruptions, possibly for power benefits, etc. But, stray > interrupts are not particularly bad, and you don't want to take > extreme measures to avoid them. > > - What I'm calling "cpu_isolated" mode where when you return to > userspace, you expect that by God, the kernel doesn't interrupt you > again, and if it does, it's a flat-out bug. > > There are a few things that cpu_isolated mode currently does to > accomplish its goals that are pretty heavy-weight: > > Processes are held in kernel space until ticks are quiesced; this is > not necessarily what every nohz_full task wants. If a task makes a > kernel call, there may well be arbitrary timer fallout, and having a > way to select whether or not you are willing to take a timer tick after > return to userspace is pretty important. Then shouldn't deferred work be done immediately in nohz_full mode regardless? What is this delayed work that's being done? > > Likewise, there are things that you may want to do on return to > userspace that are designed to prevent further interruptions in > cpu_isolated mode, even at a possible future performance cost if and > when you return to the kernel, such as flushing the per-cpu free page > list so that you won't be interrupted by an IPI to flush it later. > Why not just kick the per-cpu free page over to whatever cpu is monitoring your RCU state, etc? That should be very quick. > If you're arguing that the cpu_isolated semantic is really the only > one that makes sense for nohz_full, my sense is that it might be > surprising to many of the folks who do nohz_full work. But, I'm happy > to be wrong on this point, and maybe all the nohz_full community is > interested in making the same tradeoffs for nohz_full generally that > I've proposed in this patch series just for cpu_isolated? nohz_full is currently dog slow for no particularly good reasons. I suspect that the interrupts you're seeing are also there for no particularly good reasons as well. Let's fix them instead of adding new ABIs to work around them. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-13 21:45 ` Andy Lutomirski @ 2015-07-21 19:10 ` Chris Metcalf 2015-07-21 19:26 ` Andy Lutomirski 2015-07-24 14:03 ` Frederic Weisbecker 0 siblings, 2 replies; 159+ messages in thread From: Chris Metcalf @ 2015-07-21 19:10 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org Sorry for the delay in responding; some other priorities came up internally. On 07/13/2015 05:45 PM, Andy Lutomirski wrote: > On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> On 07/13/2015 04:40 PM, Andy Lutomirski wrote: >>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> >>> wrote: >>>> The existing nohz_full mode makes tradeoffs to minimize userspace >>>> interruptions while still attempting to avoid overheads in the >>>> kernel entry/exit path, to provide 100% kernel semantics, etc. >>>> >>>> However, some applications require a stronger commitment from the >>>> kernel to avoid interruptions, in particular userspace device >>>> driver style applications, such as high-speed networking code. >>>> >>>> This change introduces a framework to allow applications to elect >>>> to have the stronger semantics as needed, specifying >>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. >>>> Subsequent commits will add additional flags and additional >>>> semantics. >>> I thought the general consensus was that this should be the default >>> behavior and that any associated bugs should be fixed. >> >> I think it comes down to dividing the set of use cases in two: >> >> - "Regular" nohz_full, as used to improve performance and limit >> interruptions, possibly for power benefits, etc. But, stray >> interrupts are not particularly bad, and you don't want to take >> extreme measures to avoid them. >> >> - What I'm calling "cpu_isolated" mode where when you return to >> userspace, you expect that by God, the kernel doesn't interrupt you >> again, and if it does, it's a flat-out bug. >> >> There are a few things that cpu_isolated mode currently does to >> accomplish its goals that are pretty heavy-weight: >> >> Processes are held in kernel space until ticks are quiesced; this is >> not necessarily what every nohz_full task wants. If a task makes a >> kernel call, there may well be arbitrary timer fallout, and having a >> way to select whether or not you are willing to take a timer tick after >> return to userspace is pretty important. > Then shouldn't deferred work be done immediately in nohz_full mode > regardless? What is this delayed work that's being done? I'm thinking of things like needing to wait for an RCU quiesce period to complete. In the current version, there's also the vmstat_update() that may schedule delayed work and interrupt the core again shortly before realizing that there are no more counter updates happening, at which point it quiesces. Currently we handle this in cpu_isolated mode simply by spinning and waiting for the timer interrupts to complete. >> Likewise, there are things that you may want to do on return to >> userspace that are designed to prevent further interruptions in >> cpu_isolated mode, even at a possible future performance cost if and >> when you return to the kernel, such as flushing the per-cpu free page >> list so that you won't be interrupted by an IPI to flush it later. > Why not just kick the per-cpu free page over to whatever cpu is > monitoring your RCU state, etc? That should be very quick. So just for the sake of precision, the thing I'm talking about is the lru_add_drain() call on kernel exit. Are you proposing that we call that for every nohz_full core on kernel exit? I'm not opposed to this, but I don't know if other nohz developers feel like this is the right tradeoff. Similarly, addressing the vmstat_update() issue above, in cpu_isolated mode we might want to have a follow-on patch that forces the vmstat system into quiesced state on return to userspace. We would need to do this unconditionally on all nohz_full cores if we tried to combine the current nohz_full with my proposed cpu_isolated functionality. Again, I'm not necessarily opposed, but I suspect other nohz developers might not want this. (I didn't want to introduce such a patch as part of this series since it pulls in even more interested parties, and it gets harder and harder to get to consensus.) >> If you're arguing that the cpu_isolated semantic is really the only >> one that makes sense for nohz_full, my sense is that it might be >> surprising to many of the folks who do nohz_full work. But, I'm happy >> to be wrong on this point, and maybe all the nohz_full community is >> interested in making the same tradeoffs for nohz_full generally that >> I've proposed in this patch series just for cpu_isolated? > nohz_full is currently dog slow for no particularly good reasons. I > suspect that the interrupts you're seeing are also there for no > particularly good reasons as well. > > Let's fix them instead of adding new ABIs to work around them. Well, in principle if we accepted my proposed patch series and then over time came to decide that it was reasonable for nohz_full to have these complete cpu isolation semantics, the one proposed ABI simply becomes a no-op. So it's not as problematic an ABI as some. My issue is this: I'm totally happy with submitting a revised patch series that does all the stuff for pure nohz_full that I'm currently proposing for cpu_isolated. But, is it what the community wants? Should I propose it and see? Frederic, do you have any insight here? Thanks! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-21 19:10 ` Chris Metcalf @ 2015-07-21 19:26 ` Andy Lutomirski [not found] ` <CALCETrVoHvofNHG81Q2Vb2i1qc7f2dy=qgkyb5NWNfUgYxhE8Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-07-24 20:22 ` Chris Metcalf 2015-07-24 14:03 ` Frederic Weisbecker 1 sibling, 2 replies; 159+ messages in thread From: Andy Lutomirski @ 2015-07-21 19:26 UTC (permalink / raw) To: Chris Metcalf, Paul McKenney Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Viresh Kumar, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > Sorry for the delay in responding; some other priorities came up internally. > > On 07/13/2015 05:45 PM, Andy Lutomirski wrote: >> >> On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf@ezchip.com> >> wrote: >>> >>> On 07/13/2015 04:40 PM, Andy Lutomirski wrote: >>>> >>>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> >>>> >>>> wrote: >>>>> >>>>> The existing nohz_full mode makes tradeoffs to minimize userspace >>>>> interruptions while still attempting to avoid overheads in the >>>>> kernel entry/exit path, to provide 100% kernel semantics, etc. >>>>> >>>>> However, some applications require a stronger commitment from the >>>>> kernel to avoid interruptions, in particular userspace device >>>>> driver style applications, such as high-speed networking code. >>>>> >>>>> This change introduces a framework to allow applications to elect >>>>> to have the stronger semantics as needed, specifying >>>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. >>>>> Subsequent commits will add additional flags and additional >>>>> semantics. >>>> >>>> I thought the general consensus was that this should be the default >>>> behavior and that any associated bugs should be fixed. >>> >>> >>> I think it comes down to dividing the set of use cases in two: >>> >>> - "Regular" nohz_full, as used to improve performance and limit >>> interruptions, possibly for power benefits, etc. But, stray >>> interrupts are not particularly bad, and you don't want to take >>> extreme measures to avoid them. >>> >>> - What I'm calling "cpu_isolated" mode where when you return to >>> userspace, you expect that by God, the kernel doesn't interrupt you >>> again, and if it does, it's a flat-out bug. >>> >>> There are a few things that cpu_isolated mode currently does to >>> accomplish its goals that are pretty heavy-weight: >>> >>> Processes are held in kernel space until ticks are quiesced; this is >>> not necessarily what every nohz_full task wants. If a task makes a >>> kernel call, there may well be arbitrary timer fallout, and having a >>> way to select whether or not you are willing to take a timer tick after >>> return to userspace is pretty important. >> >> Then shouldn't deferred work be done immediately in nohz_full mode >> regardless? What is this delayed work that's being done? > > > I'm thinking of things like needing to wait for an RCU quiesce > period to complete. rcu_nocbs does this, right? > > In the current version, there's also the vmstat_update() that > may schedule delayed work and interrupt the core again > shortly before realizing that there are no more counter updates > happening, at which point it quiesces. Currently we handle > this in cpu_isolated mode simply by spinning and waiting for > the timer interrupts to complete. Perhaps we should fix that? > >>> Likewise, there are things that you may want to do on return to >>> userspace that are designed to prevent further interruptions in >>> cpu_isolated mode, even at a possible future performance cost if and >>> when you return to the kernel, such as flushing the per-cpu free page >>> list so that you won't be interrupted by an IPI to flush it later. >> >> Why not just kick the per-cpu free page over to whatever cpu is >> monitoring your RCU state, etc? That should be very quick. > > > So just for the sake of precision, the thing I'm talking about > is the lru_add_drain() call on kernel exit. Are you proposing > that we call that for every nohz_full core on kernel exit? > I'm not opposed to this, but I don't know if other nohz > developers feel like this is the right tradeoff. I'm proposing either that we do that or that we arrange for other cpus to be able to steal our LRU list while we're in RCU user/idle. >> Let's fix them instead of adding new ABIs to work around them. > > > Well, in principle if we accepted my proposed patch series > and then over time came to decide that it was reasonable > for nohz_full to have these complete cpu isolation > semantics, the one proposed ABI simply becomes a no-op. > So it's not as problematic an ABI as some. What if we made it a debugfs thing instead of a prctl? Have a mode where the system tries really hard to quiesce itself even at the cost of performance. > > My issue is this: I'm totally happy with submitting a revised > patch series that does all the stuff for pure nohz_full that > I'm currently proposing for cpu_isolated. But, is it what > the community wants? Should I propose it and see? > > Frederic, do you have any insight here? Thanks! > > -- > Chris Metcalf, EZChip Semiconductor > http://www.ezchip.com > > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <CALCETrVoHvofNHG81Q2Vb2i1qc7f2dy=qgkyb5NWNfUgYxhE8Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode [not found] ` <CALCETrVoHvofNHG81Q2Vb2i1qc7f2dy=qgkyb5NWNfUgYxhE8Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-07-21 20:36 ` Paul E. McKenney 2015-07-22 13:57 ` Christoph Lameter 0 siblings, 1 reply; 159+ messages in thread From: Paul E. McKenney @ 2015-07-21 20:36 UTC (permalink / raw) To: Andy Lutomirski Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, Jul 21, 2015 at 12:26:17PM -0700, Andy Lutomirski wrote: > On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > > Sorry for the delay in responding; some other priorities came up internally. > > > > On 07/13/2015 05:45 PM, Andy Lutomirski wrote: > >> > >> On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> > >> wrote: > >>> > >>> On 07/13/2015 04:40 PM, Andy Lutomirski wrote: > >>>> > >>>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> > >>>> > >>>> wrote: > >>>>> > >>>>> The existing nohz_full mode makes tradeoffs to minimize userspace > >>>>> interruptions while still attempting to avoid overheads in the > >>>>> kernel entry/exit path, to provide 100% kernel semantics, etc. > >>>>> > >>>>> However, some applications require a stronger commitment from the > >>>>> kernel to avoid interruptions, in particular userspace device > >>>>> driver style applications, such as high-speed networking code. > >>>>> > >>>>> This change introduces a framework to allow applications to elect > >>>>> to have the stronger semantics as needed, specifying > >>>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. > >>>>> Subsequent commits will add additional flags and additional > >>>>> semantics. > >>>> > >>>> I thought the general consensus was that this should be the default > >>>> behavior and that any associated bugs should be fixed. > >>> > >>> > >>> I think it comes down to dividing the set of use cases in two: > >>> > >>> - "Regular" nohz_full, as used to improve performance and limit > >>> interruptions, possibly for power benefits, etc. But, stray > >>> interrupts are not particularly bad, and you don't want to take > >>> extreme measures to avoid them. > >>> > >>> - What I'm calling "cpu_isolated" mode where when you return to > >>> userspace, you expect that by God, the kernel doesn't interrupt you > >>> again, and if it does, it's a flat-out bug. > >>> > >>> There are a few things that cpu_isolated mode currently does to > >>> accomplish its goals that are pretty heavy-weight: > >>> > >>> Processes are held in kernel space until ticks are quiesced; this is > >>> not necessarily what every nohz_full task wants. If a task makes a > >>> kernel call, there may well be arbitrary timer fallout, and having a > >>> way to select whether or not you are willing to take a timer tick after > >>> return to userspace is pretty important. > >> > >> Then shouldn't deferred work be done immediately in nohz_full mode > >> regardless? What is this delayed work that's being done? > > > > I'm thinking of things like needing to wait for an RCU quiesce > > period to complete. > > rcu_nocbs does this, right? CONFIG_RCU_NOCB_CPUS offloads the RCU callbacks to a kthread, which allows the nohz CPU to turn off its scheduling-clock tick more frequently. Chris might have some other reason to wait for an RCU grace period, given that waiting for an RCU grace period would not guarantee no callbacks. Some more might have arrived in the meantime, and there can be some delay between the end of the grace period and the invocation of the callbacks. > > In the current version, there's also the vmstat_update() that > > may schedule delayed work and interrupt the core again > > shortly before realizing that there are no more counter updates > > happening, at which point it quiesces. Currently we handle > > this in cpu_isolated mode simply by spinning and waiting for > > the timer interrupts to complete. > > Perhaps we should fix that? Didn't Christoph Lameter fix this? Or is this an additional problem? Thanx, Paul > >>> Likewise, there are things that you may want to do on return to > >>> userspace that are designed to prevent further interruptions in > >>> cpu_isolated mode, even at a possible future performance cost if and > >>> when you return to the kernel, such as flushing the per-cpu free page > >>> list so that you won't be interrupted by an IPI to flush it later. > >> > >> Why not just kick the per-cpu free page over to whatever cpu is > >> monitoring your RCU state, etc? That should be very quick. > > > > > > So just for the sake of precision, the thing I'm talking about > > is the lru_add_drain() call on kernel exit. Are you proposing > > that we call that for every nohz_full core on kernel exit? > > I'm not opposed to this, but I don't know if other nohz > > developers feel like this is the right tradeoff. > > I'm proposing either that we do that or that we arrange for other cpus > to be able to steal our LRU list while we're in RCU user/idle. > > >> Let's fix them instead of adding new ABIs to work around them. > > > > > > Well, in principle if we accepted my proposed patch series > > and then over time came to decide that it was reasonable > > for nohz_full to have these complete cpu isolation > > semantics, the one proposed ABI simply becomes a no-op. > > So it's not as problematic an ABI as some. > > What if we made it a debugfs thing instead of a prctl? Have a mode > where the system tries really hard to quiesce itself even at the cost > of performance. > > > > > My issue is this: I'm totally happy with submitting a revised > > patch series that does all the stuff for pure nohz_full that > > I'm currently proposing for cpu_isolated. But, is it what > > the community wants? Should I propose it and see? > > > > Frederic, do you have any insight here? Thanks! > > > > -- > > Chris Metcalf, EZChip Semiconductor > > http://www.ezchip.com > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-api" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Andy Lutomirski > AMA Capital Management, LLC > ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-21 20:36 ` Paul E. McKenney @ 2015-07-22 13:57 ` Christoph Lameter [not found] ` <alpine.DEB.2.11.1507220856030.17411-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Christoph Lameter @ 2015-07-22 13:57 UTC (permalink / raw) To: Paul E. McKenney Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Viresh Kumar, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Tue, 21 Jul 2015, Paul E. McKenney wrote: > > > In the current version, there's also the vmstat_update() that > > > may schedule delayed work and interrupt the core again > > > shortly before realizing that there are no more counter updates > > > happening, at which point it quiesces. Currently we handle > > > this in cpu_isolated mode simply by spinning and waiting for > > > the timer interrupts to complete. > > > > Perhaps we should fix that? > > Didn't Christoph Lameter fix this? Or is this an additional problem? Well the vmstat update must realize first that there are no outstanding updates before switching itself off. So typically there is one extra tick. But we could add another function that will simply fold the differential immediately and turn the kworker task in the expectation that the processor will stay quiet. ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <alpine.DEB.2.11.1507220856030.17411-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>]
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode [not found] ` <alpine.DEB.2.11.1507220856030.17411-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> @ 2015-07-22 19:28 ` Paul E. McKenney 2015-07-22 20:02 ` Christoph Lameter 0 siblings, 1 reply; 159+ messages in thread From: Paul E. McKenney @ 2015-07-22 19:28 UTC (permalink / raw) To: Christoph Lameter Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, Jul 22, 2015 at 08:57:45AM -0500, Christoph Lameter wrote: > On Tue, 21 Jul 2015, Paul E. McKenney wrote: > > > > > In the current version, there's also the vmstat_update() that > > > > may schedule delayed work and interrupt the core again > > > > shortly before realizing that there are no more counter updates > > > > happening, at which point it quiesces. Currently we handle > > > > this in cpu_isolated mode simply by spinning and waiting for > > > > the timer interrupts to complete. > > > > > > Perhaps we should fix that? > > > > Didn't Christoph Lameter fix this? Or is this an additional problem? > > Well the vmstat update must realize first that there are no outstanding > updates before switching itself off. So typically there is one extra tick. > But we could add another function that will simply fold the differential > immediately and turn the kworker task in the expectation that the > processor will stay quiet. Got it, thank you! Thanx, Paul ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-22 19:28 ` Paul E. McKenney @ 2015-07-22 20:02 ` Christoph Lameter 2015-07-24 20:21 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Christoph Lameter @ 2015-07-22 20:02 UTC (permalink / raw) To: Paul E. McKenney Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Viresh Kumar, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Wed, 22 Jul 2015, Paul E. McKenney wrote: > > > Didn't Christoph Lameter fix this? Or is this an additional problem? > > > > Well the vmstat update must realize first that there are no outstanding > > updates before switching itself off. So typically there is one extra tick. > > But we could add another function that will simply fold the differential > > immediately and turn the kworker task in the expectation that the > > processor will stay quiet. > > Got it, thank you! > > Thanx, Paul Ok here is a function that quiets down the vmstat kworkers. Subject: vmstat: provide a function to quiet down the diff processing quiet_vmstat() can be called in anticipation of a OS "quiet" period where no tick processing should be triggered. quiet_vmstat() will fold all pending differentials into the global counters and disable the vmstat_worker processing. Note that the shepherd thread will continue scanning the differentials from another processor and will reenable the vmstat workers if it detects any changes. Signed-off-by: Christoph Lameter <cl@linux.com> Index: linux/mm/vmstat.c =================================================================== --- linux.orig/mm/vmstat.c +++ linux/mm/vmstat.c @@ -1394,6 +1394,20 @@ static void vmstat_update(struct work_st } /* + * Switch off vmstat processing and then fold all the remaining differentials + * until the diffs stay at zero. The function is used by NOHZ and can only be + * invoked when tick processing is not active. + */ +void quiet_vmstat(void) +{ + do { + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) + cancel_delayed_work(this_cpu_ptr(&vmstat_work)); + + } while (refresh_cpu_vm_stats()); +} + +/* * Check if the diffs for a certain cpu indicate that * an update is needed. */ Index: linux/include/linux/vmstat.h =================================================================== --- linux.orig/include/linux/vmstat.h +++ linux/include/linux/vmstat.h @@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); +void quiet_vmstat(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -272,6 +273,7 @@ static inline void __dec_zone_page_state static inline void refresh_cpu_vm_stats(int cpu) { } static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } +static inline void quiet_vmstat(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-22 20:02 ` Christoph Lameter @ 2015-07-24 20:21 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-07-24 20:21 UTC (permalink / raw) To: Christoph Lameter, Paul E. McKenney Cc: Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Viresh Kumar, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On 07/22/2015 04:02 PM, Christoph Lameter wrote: > On Wed, 22 Jul 2015, Paul E. McKenney wrote: > >>>> Didn't Christoph Lameter fix this? Or is this an additional problem? >>> Well the vmstat update must realize first that there are no outstanding >>> updates before switching itself off. So typically there is one extra tick. >>> But we could add another function that will simply fold the differential >>> immediately and turn the kworker task in the expectation that the >>> processor will stay quiet. >> Got it, thank you! >> >> Thanx, Paul > Ok here is a function that quiets down the vmstat kworkers. That's great - I will include this patch in my series then, and call it as part of the "hard isolation" mode return to userspace. Thanks! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-21 19:26 ` Andy Lutomirski [not found] ` <CALCETrVoHvofNHG81Q2Vb2i1qc7f2dy=qgkyb5NWNfUgYxhE8Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-07-24 20:22 ` Chris Metcalf 1 sibling, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-07-24 20:22 UTC (permalink / raw) To: Andy Lutomirski, Paul McKenney Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Viresh Kumar, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On 07/21/2015 03:26 PM, Andy Lutomirski wrote: > On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> So just for the sake of precision, the thing I'm talking about >> is the lru_add_drain() call on kernel exit. Are you proposing >> that we call that for every nohz_full core on kernel exit? >> I'm not opposed to this, but I don't know if other nohz >> developers feel like this is the right tradeoff. > I'm proposing either that we do that or that we arrange for other cpus > to be able to steal our LRU list while we're in RCU user/idle. That seems challenging; there is a lot that has to be done in lru_add_drain() and we may not want to do it for the "soft isolation" mode Frederic alludes to in a later email. And, we would have to add a bunch of locking to allow another process to steal the list from under us, so that's not obviously going to be a performance win in terms of the per-cpu page cache for normal operations. Perhaps there could be a lock taken that nohz_full processes have to take just to exit from userspace, and that other tasks could take to do things on behalf of the nohz_full process that it thinks it can do locklessly. It gets complicated, since you'd want to tie that to whether the nohz_full process was currently in the kernel or not, so some kind of atomic update on the context_tracking state or some such, perhaps. Still not really clear if that overhead is worth it (both from a maintenance point of view and the possible performance hit). Limiting it just to the hard isolation mode seems like a good answer since there we really know that userspace does not care about the performance implications of kernel/userspace transitions, and it doesn't cause slowdowns to anyone else. For now I will bundle it in with my respin as part of the "hard isolation" mode Frederic proposed. >> Well, in principle if we accepted my proposed patch series >> and then over time came to decide that it was reasonable >> for nohz_full to have these complete cpu isolation >> semantics, the one proposed ABI simply becomes a no-op. >> So it's not as problematic an ABI as some. > What if we made it a debugfs thing instead of a prctl? Have a mode > where the system tries really hard to quiesce itself even at the cost > of performance. No, since it's really a mode within an individual task that you'd like to switch on and off depending on what the task is trying to do - strict mode while it's running its main fast-path userspace code, but certainly not strict mode during its setup, and possibly leaving strict mode to run some kinds of slow-path, diagnostic, or error-handling code. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-21 19:10 ` Chris Metcalf 2015-07-21 19:26 ` Andy Lutomirski @ 2015-07-24 14:03 ` Frederic Weisbecker 2015-07-24 20:19 ` Chris Metcalf 1 sibling, 1 reply; 159+ messages in thread From: Frederic Weisbecker @ 2015-07-24 14:03 UTC (permalink / raw) To: Chris Metcalf Cc: Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org, Mike Galbraith On Tue, Jul 21, 2015 at 03:10:54PM -0400, Chris Metcalf wrote: > >>If you're arguing that the cpu_isolated semantic is really the only > >>one that makes sense for nohz_full, my sense is that it might be > >>surprising to many of the folks who do nohz_full work. But, I'm happy > >>to be wrong on this point, and maybe all the nohz_full community is > >>interested in making the same tradeoffs for nohz_full generally that > >>I've proposed in this patch series just for cpu_isolated? > >nohz_full is currently dog slow for no particularly good reasons. I > >suspect that the interrupts you're seeing are also there for no > >particularly good reasons as well. > > > >Let's fix them instead of adding new ABIs to work around them. > > Well, in principle if we accepted my proposed patch series > and then over time came to decide that it was reasonable > for nohz_full to have these complete cpu isolation > semantics, the one proposed ABI simply becomes a no-op. > So it's not as problematic an ABI as some. > > My issue is this: I'm totally happy with submitting a revised > patch series that does all the stuff for pure nohz_full that > I'm currently proposing for cpu_isolated. But, is it what > the community wants? Should I propose it and see? > > Frederic, do you have any insight here? Thanks! So you guys mean that if nohz_full was implemented fully like we expect it to, we shouldn't be burdened at all by noise and that whole patchset would therefore be pointless, right? And that would meet the requirements for those who want hard isolation (critical noise-free guarantee) as well as those who want soft isolation (less noise as possible for performance). Well first of all nohz is not isolation, it's a significant part of it but it's not all isolatiion. We really want to separate these things and not mess up isolation policies in the tick code. Second, yes perhaps we can eventually have both soft and hard isolation expectation eventually be implemented the same way through hard isolation. But that will only work if we don't do that polling for noise-free before resuming userspace, which might work for hard isolation that is ready to sacrifice some warm-up before a run to meet guarantees, but it won't work for soft isolation workloads. So the only solution is to offline everything we can to housekeeping CPUs. And if we still have stuff that can't be dealt with that way and which need to be taken care of with some explicit operation before resuming to userspace, then we can start to think about splitting stuff in several isolation configs. Similarly, offlining everything to housekeepers means that we sacrifice a CPU that could have been used in performance oriented workloads so that might not match soft isolation as well. But I think we'll see that all once we manage to have pure noise-free CPUs (some patches are on the way to be posted by Vatika Harlalka concerning the residual 1hz tick to kill). To summarize, lets first split nohz and isolation. Introduce CONFIG_CPU_ISOLATION and stuff all the isolation policies to kernel/cpu_isolation.c, lets try to implement hard isolation and see if that meets soft isolation workload users as well, if not we'll split that later. And we can keep the prctl to tell the user when hard isolation has been broken, through SIGKILL or whatever. I think we are doing a similar thing with SCHED_DEADLINE when the task hasn't met deadline requirement. We might want to do the same. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-24 14:03 ` Frederic Weisbecker @ 2015-07-24 20:19 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-07-24 20:19 UTC (permalink / raw) To: Frederic Weisbecker Cc: Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org, Mike Galbraith On 07/24/2015 10:03 AM, Frederic Weisbecker wrote: > To summarize, lets first split nohz and isolation. Introduce > CONFIG_CPU_ISOLATION and stuff all the isolation policies to > kernel/cpu_isolation.c, lets try to implement hard isolation and see if that > meets soft isolation workload users as well, if not we'll split that later. I will do that for v5. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-13 19:57 ` [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf 2015-07-13 20:40 ` Andy Lutomirski @ 2015-07-24 13:27 ` Frederic Weisbecker 2015-07-24 20:21 ` Chris Metcalf 1 sibling, 1 reply; 159+ messages in thread From: Frederic Weisbecker @ 2015-07-24 13:27 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel On Mon, Jul 13, 2015 at 03:57:57PM -0400, Chris Metcalf wrote: > The existing nohz_full mode makes tradeoffs to minimize userspace > interruptions while still attempting to avoid overheads in the > kernel entry/exit path, to provide 100% kernel semantics, etc. > > However, some applications require a stronger commitment from the > kernel to avoid interruptions, in particular userspace device > driver style applications, such as high-speed networking code. > > This change introduces a framework to allow applications to elect > to have the stronger semantics as needed, specifying > prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. > Subsequent commits will add additional flags and additional > semantics. > > The "cpu_isolated" state is indicated by setting a new task struct > field, cpu_isolated_flags, to the value passed by prctl(). When the > _ENABLE bit is set for a task, and it is returning to userspace > on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter() > routine to take additional actions to help the task avoid being > interrupted in the future. > > Initially, there are only two actions taken. First, the task > calls lru_add_drain() to prevent being interrupted by a subsequent > lru_add_drain_all() call on another core. Then, the code checks for > pending timer interrupts and quiesces until they are no longer pending. > As a result, sys calls (and page faults, etc.) can be inordinately slow. > However, this quiescing guarantees that no unexpected interrupts will > occur, even if the application intentionally calls into the kernel. > > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > --- > arch/tile/kernel/process.c | 9 ++++++++ > include/linux/sched.h | 3 +++ > include/linux/tick.h | 10 ++++++++ > include/uapi/linux/prctl.h | 5 ++++ > kernel/context_tracking.c | 3 +++ > kernel/sys.c | 8 +++++++ > kernel/time/tick-sched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++ > 7 files changed, 95 insertions(+) > > diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c > index e036c0aa9792..3625e839ad62 100644 > --- a/arch/tile/kernel/process.c > +++ b/arch/tile/kernel/process.c > @@ -70,6 +70,15 @@ void arch_cpu_idle(void) > _cpu_idle(); > } > > +#ifdef CONFIG_NO_HZ_FULL I think this goes way beyond nohz itself. We don't only want the tick to shutdown, we want also the pending timers, workqueues, etc... It's time to create the CONFIG_ISOLATION_foo stuffs. > +void tick_nohz_cpu_isolated_wait(void) > +{ > + set_current_state(TASK_INTERRUPTIBLE); > + _cpu_idle(); > + set_current_state(TASK_RUNNING); > +} > +#endif > + > /* > * Release a thread_info structure > */ > diff --git a/include/linux/sched.h b/include/linux/sched.h > index ae21f1591615..f350b0c20bbc 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1778,6 +1778,9 @@ struct task_struct { > unsigned long task_state_change; > #endif > int pagefault_disabled; > +#ifdef CONFIG_NO_HZ_FULL > + unsigned int cpu_isolated_flags; > +#endif > }; > > /* Future-safe accessor for struct task_struct's cpus_allowed. */ > diff --git a/include/linux/tick.h b/include/linux/tick.h > index 3741ba1a652c..cb5569181359 100644 > --- a/include/linux/tick.h > +++ b/include/linux/tick.h > @@ -10,6 +10,7 @@ > #include <linux/context_tracking_state.h> > #include <linux/cpumask.h> > #include <linux/sched.h> > +#include <linux/prctl.h> > > #ifdef CONFIG_GENERIC_CLOCKEVENTS > extern void __init tick_init(void); > @@ -144,11 +145,18 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) > cpumask_or(mask, mask, tick_nohz_full_mask); > } > > +static inline bool tick_nohz_is_cpu_isolated(void) > +{ > + return tick_nohz_full_cpu(smp_processor_id()) && > + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); > +} > + > extern void __tick_nohz_full_check(void); > extern void tick_nohz_full_kick(void); > extern void tick_nohz_full_kick_cpu(int cpu); > extern void tick_nohz_full_kick_all(void); > extern void __tick_nohz_task_switch(struct task_struct *tsk); > +extern void tick_nohz_cpu_isolated_enter(void); > #else > static inline bool tick_nohz_full_enabled(void) { return false; } > static inline bool tick_nohz_full_cpu(int cpu) { return false; } > @@ -158,6 +166,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } > static inline void tick_nohz_full_kick(void) { } > static inline void tick_nohz_full_kick_all(void) { } > static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } > +static inline bool tick_nohz_is_cpu_isolated(void) { return false; } > +static inline void tick_nohz_cpu_isolated_enter(void) { } > #endif > > static inline bool is_housekeeping_cpu(int cpu) > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 31891d9535e2..edb40b6b84db 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -190,4 +190,9 @@ struct prctl_mm_map { > # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ > # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ > > +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ > +#define PR_SET_CPU_ISOLATED 47 > +#define PR_GET_CPU_ISOLATED 48 > +# define PR_CPU_ISOLATED_ENABLE (1 << 0) > + > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c > index 0a495ab35bc7..f9de3ee12723 100644 > --- a/kernel/context_tracking.c > +++ b/kernel/context_tracking.c > @@ -20,6 +20,7 @@ > #include <linux/hardirq.h> > #include <linux/export.h> > #include <linux/kprobes.h> > +#include <linux/tick.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/context_tracking.h> > @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) > * on the tick. > */ > if (state == CONTEXT_USER) { > + if (tick_nohz_is_cpu_isolated()) > + tick_nohz_cpu_isolated_enter(); > trace_user_enter(0); > vtime_user_enter(current); > } > diff --git a/kernel/sys.c b/kernel/sys.c > index 259fda25eb6b..36eb9a839f1f 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > case PR_GET_FP_MODE: > error = GET_FP_MODE(me); > break; > +#ifdef CONFIG_NO_HZ_FULL > + case PR_SET_CPU_ISOLATED: > + me->cpu_isolated_flags = arg2; > + break; > + case PR_GET_CPU_ISOLATED: > + error = me->cpu_isolated_flags; > + break; > +#endif > default: > error = -EINVAL; > break; > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index c792429e98c6..4cf093c012d1 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -24,6 +24,7 @@ > #include <linux/posix-timers.h> > #include <linux/perf_event.h> > #include <linux/context_tracking.h> > +#include <linux/swap.h> > > #include <asm/irq_regs.h> > > @@ -389,6 +390,62 @@ void __init tick_nohz_init(void) > pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", > cpumask_pr_args(tick_nohz_full_mask)); > } > + > +/* > + * Rather than continuously polling for the next_event in the > + * tick_cpu_device, architectures can provide a method to save power > + * by sleeping until an interrupt arrives. > + */ > +void __weak tick_nohz_cpu_isolated_wait(void) > +{ > + cpu_relax(); > +} > + > +/* > + * We normally return immediately to userspace. > + * > + * In "cpu_isolated" mode we wait until no more interrupts are > + * pending. Otherwise we nap with interrupts enabled and wait for the > + * next interrupt to fire, then loop back and retry. > + * > + * Note that if you schedule two "cpu_isolated" processes on the same > + * core, neither will ever leave the kernel, and one will have to be > + * killed manually. Otherwise in situations where another process is > + * in the runqueue on this cpu, this task will just wait for that > + * other task to go idle before returning to user space. > + */ > +void tick_nohz_cpu_isolated_enter(void) Similarly, I'd rather see that in kernel/cpu_isolation.c and call it cpu_isolation_enter(). > +{ > + struct clock_event_device *dev = > + __this_cpu_read(tick_cpu_device.evtdev); > + struct task_struct *task = current; > + unsigned long start = jiffies; > + bool warned = false; > + > + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ > + lru_add_drain(); > + > + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { > + if (!warned && (jiffies - start) >= (5 * HZ)) { > + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", > + task->comm, task->pid, smp_processor_id(), > + (jiffies - start) / HZ); > + warned = true; > + } > + if (should_resched()) > + schedule(); > + if (test_thread_flag(TIF_SIGPENDING)) > + break; > + tick_nohz_cpu_isolated_wait(); If we call cpu_idle(), what is going to wake the CPU up if not further interrupt happen? We could either implement some sort of tick waiters with proper wake up once the CPU sees no tick to schedule. Arguably this is all risky because this involve a scheduler wake up and thus the risk for new noise. But it might work. Another possibility is an msleep() based wait. But that's about the same, maybe even worse due to repetitive wake ups. > + } > + if (warned) { > + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n", > + task->comm, task->pid, smp_processor_id(), > + (jiffies - start) / HZ); > + dump_stack(); > + } > +} > + > #endif > > /* > -- > 2.1.2 > ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-24 13:27 ` Frederic Weisbecker @ 2015-07-24 20:21 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-07-24 20:21 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 07/24/2015 09:27 AM, Frederic Weisbecker wrote: > On Mon, Jul 13, 2015 at 03:57:57PM -0400, Chris Metcalf wrote: >> +{ >> + struct clock_event_device *dev = >> + __this_cpu_read(tick_cpu_device.evtdev); >> + struct task_struct *task = current; >> + unsigned long start = jiffies; >> + bool warned = false; >> + >> + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ >> + lru_add_drain(); >> + >> + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { >> + if (!warned && (jiffies - start) >= (5 * HZ)) { >> + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", >> + task->comm, task->pid, smp_processor_id(), >> + (jiffies - start) / HZ); >> + warned = true; >> + } >> + if (should_resched()) >> + schedule(); >> + if (test_thread_flag(TIF_SIGPENDING)) >> + break; >> + tick_nohz_cpu_isolated_wait(); > If we call cpu_idle(), what is going to wake the CPU up if no further interrupt happen? > > We could either implement some sort of tick waiters with proper wake up once the CPU sees > no tick to schedule. Arguably this is all risky because this involve a scheduler wake up > and thus the risk for new noise. But it might work. > > Another possibility is an msleep() based wait. But that's about the same, maybe even worse > due to repetitive wake ups. The presumption here is that it is not possible to have tick_cpu_device have a pending next_event without also having a timer interrupt pending to go off. That certainly seems to be true on the architectures I have looked at. Do we think that might ever not be the case? We are running here with interrupts disabled, so this core won't transition from "timer interrupt scheduled" to "no timer interrupt scheduled" before we spin or idle, and presumably no other core can reach across and turn off our timer interrupt either. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode 2015-07-13 19:57 ` [PATCH v4 0/5] support "cpu_isolated" mode for nohz_full Chris Metcalf 2015-07-13 19:57 ` [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf @ 2015-07-13 19:57 ` Chris Metcalf [not found] ` <1436817481-8732-3-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-07-13 19:57 ` [PATCH v4 3/5] nohz: cpu_isolated strict mode configurable signal Chris Metcalf [not found] ` <1436817481-8732-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 3 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With cpu_isolated mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current->cpu_isolated_flags that is set when prctl() sets the flags. We check the bit on syscall entry as well as on any exception_enter(). The prctl() syscall is ignored to allow clearing the bit again later, and exit/exit_group are ignored to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/kernel/ptrace.c | 4 ++++ arch/tile/kernel/ptrace.c | 6 +++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/tick.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++ 8 files changed, 80 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..7315b1579cbd 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { + /* Ensure we report cpu_isolated violations in all circumstances. */ + if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall(regs->syscallno); + /* Do the secure computing check first; failures should be fast. */ if (secure_computing() == -1) return -1; diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..d4e43a13bab1 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall( + regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..860f346977e2 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..8b994e2a0330 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/tick.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_exception(); + } + } return prev_ctx; } diff --git a/include/linux/tick.h b/include/linux/tick.h index cb5569181359..f79f6945f762 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -157,6 +157,8 @@ extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); extern void tick_nohz_cpu_isolated_enter(void); +extern void tick_nohz_cpu_isolated_syscall(int nr); +extern void tick_nohz_cpu_isolated_exception(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -168,6 +170,8 @@ static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } static inline bool tick_nohz_is_cpu_isolated(void) { return false; } static inline void tick_nohz_cpu_isolated_enter(void) { } +static inline void tick_nohz_cpu_isolated_syscall(int nr) { } +static inline void tick_nohz_cpu_isolated_exception(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) @@ -200,4 +204,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk) __tick_nohz_task_switch(tsk); } +static inline bool tick_nohz_cpu_isolated_strict(void) +{ +#ifdef CONFIG_NO_HZ_FULL + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) == + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index edb40b6b84db..0c11238a84fb 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_CPU_ISOLATED 47 #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) +# define PR_CPU_ISOLATED_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index f9de3ee12723..fd051ea290ee 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 4cf093c012d1..9f495c7c7dc2 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -27,6 +27,7 @@ #include <linux/swap.h> #include <asm/irq_regs.h> +#include <asm/unistd.h> #include "tick-internal.h" @@ -446,6 +447,43 @@ void tick_nohz_cpu_isolated_enter(void) } } +static void kill_cpu_isolated_strict_task(void) +{ + dump_stack(); + current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void tick_nohz_cpu_isolated_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_cpu_isolated_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void tick_nohz_cpu_isolated_exception(void) +{ + pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", + current->comm, current->pid); + kill_cpu_isolated_strict_task(); +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
[parent not found: <1436817481-8732-3-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode [not found] ` <1436817481-8732-3-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-07-13 21:47 ` Andy Lutomirski [not found] ` <CALCETrUvg+Dix=jG2_1J=mgQC+uRk4dthCYDcb4E5ooEfQjqtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-07-13 21:47 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > With cpu_isolated mode, the task is in principle guaranteed not to be > interrupted by the kernel, but only if it behaves. In particular, if it > enters the kernel via system call, page fault, or any of a number of other > synchronous traps, it may be unexpectedly exposed to long latencies. > Add a simple flag that puts the process into a state where any such > kernel entry is fatal. > To me, this seems like the wrong design. If nothing else, it seems too much like an abusable anti-debugging mechanism. I can imagine some per-task flag "I think I shouldn't be interrupted now" and a tracepoint that fires if the task is interrupted with that flag set. But the strong cpu isolation stuff requires systemwide configuration, and I think that monitoring that it works should work similarly. More comments below. > Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> > --- > arch/arm64/kernel/ptrace.c | 4 ++++ > arch/tile/kernel/ptrace.c | 6 +++++- > arch/x86/kernel/ptrace.c | 2 ++ > include/linux/context_tracking.h | 11 ++++++++--- > include/linux/tick.h | 16 ++++++++++++++++ > include/uapi/linux/prctl.h | 1 + > kernel/context_tracking.c | 9 ++++++--- > kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++ > 8 files changed, 80 insertions(+), 7 deletions(-) > > diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c > index d882b833dbdb..7315b1579cbd 100644 > --- a/arch/arm64/kernel/ptrace.c > +++ b/arch/arm64/kernel/ptrace.c > @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, > > asmlinkage int syscall_trace_enter(struct pt_regs *regs) > { > + /* Ensure we report cpu_isolated violations in all circumstances. */ > + if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict()) > + tick_nohz_cpu_isolated_syscall(regs->syscallno); IMO this is pointless. If a user wants a syscall to kill them, use seccomp. The kernel isn't at fault if the user does a syscall when it didn't want to enter the kernel. > @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) > return 0; > > prev_ctx = this_cpu_read(context_tracking.state); > - if (prev_ctx != CONTEXT_KERNEL) > - context_tracking_exit(prev_ctx); > + if (prev_ctx != CONTEXT_KERNEL) { > + if (context_tracking_exit(prev_ctx)) { > + if (tick_nohz_cpu_isolated_strict()) > + tick_nohz_cpu_isolated_exception(); > + } > + } NACK. I'm cautiously optimistic that an x86 kernel 4.3 or newer will simply never call exception_enter. It certainly won't call it frequently unless something goes wrong with the patches that are already in -tip. > --- a/kernel/context_tracking.c > +++ b/kernel/context_tracking.c > @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); > * This call supports re-entrancy. This way it can be called from any exception > * handler without needing to know if we came from userspace or not. > */ > -void context_tracking_exit(enum ctx_state state) > +bool context_tracking_exit(enum ctx_state state) > { > unsigned long flags; > + bool from_user = false; > IMO the internal context tracking API (e.g. context_tracking_exit) are mostly of the form "hey context tracking: I don't really know what you're doing or what I'm doing, but let me call you and make both of us feel better." You're making it somewhat worse: now it's all of the above plus "I don't even know whether I just entered the kernel -- maybe you have a better idea". Starting with 4.3, x86 kernels will know *exactly* when they enter the kernel. All of this context tracking what-was-my-previous-state stuff will remain until someone kills it, but when it goes away we'll get a nice performance boost. So, no, let's implement this for real if we're going to implement it. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <CALCETrUvg+Dix=jG2_1J=mgQC+uRk4dthCYDcb4E5ooEfQjqtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode [not found] ` <CALCETrUvg+Dix=jG2_1J=mgQC+uRk4dthCYDcb4E5ooEfQjqtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-07-21 19:34 ` Chris Metcalf [not found] ` <55AE9EAC.4010202-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-07-21 19:34 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On 07/13/2015 05:47 PM, Andy Lutomirski wrote: > On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> With cpu_isolated mode, the task is in principle guaranteed not to be >> interrupted by the kernel, but only if it behaves. In particular, if it >> enters the kernel via system call, page fault, or any of a number of other >> synchronous traps, it may be unexpectedly exposed to long latencies. >> Add a simple flag that puts the process into a state where any such >> kernel entry is fatal. >> > To me, this seems like the wrong design. If nothing else, it seems > too much like an abusable anti-debugging mechanism. I can imagine > some per-task flag "I think I shouldn't be interrupted now" and a > tracepoint that fires if the task is interrupted with that flag set. > But the strong cpu isolation stuff requires systemwide configuration, > and I think that monitoring that it works should work similarly. First, you mention a per-task flag, but not specifically whether the proposed prctl() mechanism is a reasonable way to set that flag. Just wanted to clarify that this wasn't an issue in and of itself for you. Second, you suggest a tracepoint. I'm OK with creating a tracepoint dedicated to cpu_isolated strict failures and making that the only way this mechanism works. But, earlier community feedback seemed to suggest that the signal mechanism was OK; one piece of feedback just requested being able to set which signal was delivered. Do you think the signal idea is a bad one? Are you proposing potentially having a signal and/or a tracepoint? Last, you mention systemwide configuration for monitoring. Can you expand on what you mean by that? We already support the monitoring only on the nohz_full cores, so to that extent it's already systemwide. And the per-task flag has to be set by the running process when it's ready for this state, so that can't really be systemwide configuration. I don't understand your suggestion on this point. >> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c >> index d882b833dbdb..7315b1579cbd 100644 >> --- a/arch/arm64/kernel/ptrace.c >> +++ b/arch/arm64/kernel/ptrace.c >> @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, >> >> asmlinkage int syscall_trace_enter(struct pt_regs *regs) >> { >> + /* Ensure we report cpu_isolated violations in all circumstances. */ >> + if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict()) >> + tick_nohz_cpu_isolated_syscall(regs->syscallno); > IMO this is pointless. If a user wants a syscall to kill them, use > seccomp. The kernel isn't at fault if the user does a syscall when it > didn't want to enter the kernel. Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT was to what I wanted here. One concern is that there doesn't seem to be a way to "escape" from seccomp strict mode, i.e. you can't call seccomp() again to turn it off - which makes sense for seccomp since it's a security issue, but not so much sense with cpu_isolated. So, do you think there's a good role for the seccomp() API to play in achieving this goal? It's certainly not a question of "the kernel at fault" but rather "asking the kernel to help catch user mistakes" (typically third-party libraries in our customers' experience). You could imagine a SECCOMP_SET_MODE_ISOLATED or something. Alternatively, we could stick with the API proposed in my patch series, or something similar, and just try to piggy-back on the seccomp internals to make it happen. It would require Kconfig to ensure that SECCOMP was enabled though, which obviously isn't currently required to do cpu isolation. >> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) >> return 0; >> >> prev_ctx = this_cpu_read(context_tracking.state); >> - if (prev_ctx != CONTEXT_KERNEL) >> - context_tracking_exit(prev_ctx); >> + if (prev_ctx != CONTEXT_KERNEL) { >> + if (context_tracking_exit(prev_ctx)) { >> + if (tick_nohz_cpu_isolated_strict()) >> + tick_nohz_cpu_isolated_exception(); >> + } >> + } > NACK. I'm cautiously optimistic that an x86 kernel 4.3 or newer will > simply never call exception_enter. It certainly won't call it > frequently unless something goes wrong with the patches that are > already in -tip. This is intended to catch user exceptions like page faults, GPV or (on platforms where this would happen) unaligned data traps. The kernel still has a role to play here and cpu_isolated mode needs to let the user know they have accidentally entered the kernel in this case. >> --- a/kernel/context_tracking.c >> +++ b/kernel/context_tracking.c >> @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); >> * This call supports re-entrancy. This way it can be called from any exception >> * handler without needing to know if we came from userspace or not. >> */ >> -void context_tracking_exit(enum ctx_state state) >> +bool context_tracking_exit(enum ctx_state state) >> { >> unsigned long flags; >> + bool from_user = false; >> > IMO the internal context tracking API (e.g. context_tracking_exit) are > mostly of the form "hey context tracking: I don't really know what > you're doing or what I'm doing, but let me call you and make both of > us feel better." You're making it somewhat worse: now it's all of the > above plus "I don't even know whether I just entered the kernel -- > maybe you have a better idea". > > Starting with 4.3, x86 kernels will know *exactly* when they enter the > kernel. All of this context tracking what-was-my-previous-state stuff > will remain until someone kills it, but when it goes away we'll get a > nice performance boost. > > So, no, let's implement this for real if we're going to implement it. I'm certainly OK with rebasing on top of 4.3 after the context tracking stuff is better. That said, I think it makes sense to continue to debate the intent of the patch series even if we pull this one patch out and defer it until after 4.3, or having it end up pulled into some other repo that includes the improvements and is being pulled for 4.3. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <55AE9EAC.4010202-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode [not found] ` <55AE9EAC.4010202-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-07-21 19:42 ` Andy Lutomirski 2015-07-24 20:29 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-07-21 19:42 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, Jul 21, 2015 at 12:34 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > On 07/13/2015 05:47 PM, Andy Lutomirski wrote: >> >> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >> wrote: >>> >>> With cpu_isolated mode, the task is in principle guaranteed not to be >>> interrupted by the kernel, but only if it behaves. In particular, if it >>> enters the kernel via system call, page fault, or any of a number of >>> other >>> synchronous traps, it may be unexpectedly exposed to long latencies. >>> Add a simple flag that puts the process into a state where any such >>> kernel entry is fatal. >>> >> To me, this seems like the wrong design. If nothing else, it seems >> too much like an abusable anti-debugging mechanism. I can imagine >> some per-task flag "I think I shouldn't be interrupted now" and a >> tracepoint that fires if the task is interrupted with that flag set. >> But the strong cpu isolation stuff requires systemwide configuration, >> and I think that monitoring that it works should work similarly. > > > First, you mention a per-task flag, but not specifically whether the > proposed prctl() mechanism is a reasonable way to set that flag. > Just wanted to clarify that this wasn't an issue in and of itself for you. I think I'm okay with a per-task flag for this and, if you add one, then prctl() is presumably the way to go. Unless people think that nohz should be 100% reliable always, in which case might as well make the flag per-cpu. > > Second, you suggest a tracepoint. I'm OK with creating a tracepoint > dedicated to cpu_isolated strict failures and making that the only > way this mechanism works. But, earlier community feedback seemed to > suggest that the signal mechanism was OK; one piece of feedback > just requested being able to set which signal was delivered. Do you > think the signal idea is a bad one? Are you proposing potentially > having a signal and/or a tracepoint? I prefer the tracepoint. It's friendlier to debuggers, and it's really about diagnosing a kernel problem, not a userspace problem. Also, I really doubt that people should deploy a signal thing in production. What if an NMI fires and kills their realtime program? > > Last, you mention systemwide configuration for monitoring. Can you > expand on what you mean by that? We already support the monitoring > only on the nohz_full cores, so to that extent it's already systemwide. > And the per-task flag has to be set by the running process when it's > ready for this state, so that can't really be systemwide configuration. > I don't understand your suggestion on this point. I'm really thinking about systemwide configuration for isolation. I think we'll always (at least in the nearish term) need the admin's help to set up isolated CPUs. If the admin makes a whole CPU be isolated, then monitoring just that CPU and monitoring it all the time seems sensible. If we really do think that isolating a CPU should require a syscall of some sort because it's too expensive otherwise, then we can do it that way, too. And if full isolation requires some user help (e.g. don't do certain things that break isolation), then having a per-task monitoring flag seems reasonable. We may always need the user's help to avoid IPIs. For example, if one thread calls munmap, the other thread is going to get an IPI. There's nothing we can do about that. > I'm certainly OK with rebasing on top of 4.3 after the context > tracking stuff is better. That said, I think it makes sense to continue > to debate the intent of the patch series even if we pull this one > patch out and defer it until after 4.3, or having it end up pulled > into some other repo that includes the improvements and > is being pulled for 4.3. Sure, no problem. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode 2015-07-21 19:42 ` Andy Lutomirski @ 2015-07-24 20:29 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-07-24 20:29 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On 07/21/2015 03:42 PM, Andy Lutomirski wrote: > On Tue, Jul 21, 2015 at 12:34 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> Second, you suggest a tracepoint. I'm OK with creating a tracepoint >> dedicated to cpu_isolated strict failures and making that the only >> way this mechanism works. But, earlier community feedback seemed to >> suggest that the signal mechanism was OK; one piece of feedback >> just requested being able to set which signal was delivered. Do you >> think the signal idea is a bad one? Are you proposing potentially >> having a signal and/or a tracepoint? > I prefer the tracepoint. It's friendlier to debuggers, and it's > really about diagnosing a kernel problem, not a userspace problem. > Also, I really doubt that people should deploy a signal thing in > production. What if an NMI fires and kills their realtime program? No, this piece of the patch series is about diagnosing bugs in the userspace program (likely in third-party code, in our customers' experience). When you violate strict mode, you get a signal and you have a nice pointer to what instruction it was that caused you to enter the kernel. You are right that running this in production is likely not a great idea, as is true for other debugging mechanisms. But you might really want to have it as a signal with a signal handler that fires to generate a trace of some kind into the application's existing tracing mechanisms, so the app doesn't just report "wow, I lost a bunch of time in here somewhere, sorry about those packets I dropped on the floor", but "here's where I took a strict signal". You probably drop a few additional packets due to the signal handling and logging, but given you've already fallen away from 100% in this case, the extra diagnostics are almost certainly worth it. In this case it's probably not as helpful to have a tracepoint-based solution, just because you really do want to be able to easily integrate into the app's existing logging framework. My sense, I think, is that we can easily add tracepoints to the strict failure code in the future, so it may not be worth trying to widen the scope of the patch series just now. >> Last, you mention systemwide configuration for monitoring. Can you >> expand on what you mean by that? We already support the monitoring >> only on the nohz_full cores, so to that extent it's already systemwide. >> And the per-task flag has to be set by the running process when it's >> ready for this state, so that can't really be systemwide configuration. >> I don't understand your suggestion on this point. > I'm really thinking about systemwide configuration for isolation. I > think we'll always (at least in the nearish term) need the admin's > help to set up isolated CPUs. If the admin makes a whole CPU be > isolated, then monitoring just that CPU and monitoring it all the time > seems sensible. If we really do think that isolating a CPU should > require a syscall of some sort because it's too expensive otherwise, > then we can do it that way, too. And if full isolation requires some > user help (e.g. don't do certain things that break isolation), then > having a per-task monitoring flag seems reasonable. > > We may always need the user's help to avoid IPIs. For example, if one > thread calls munmap, the other thread is going to get an IPI. There's > nothing we can do about that. I think we're mostly agreed on this stuff, though your use of "monitored" doesn't really match the "strict" mode in this patch. It's certainly true that, for example, we advise customers not to run the slow-path code on a housekeeping cpu as a thread in the same process space as the fast-path code on the nohz_full cores, just because things like fclose() on a file descriptor will lead to free() which can lead to munmap() and an IPI to the fast path. >> I'm certainly OK with rebasing on top of 4.3 after the context >> tracking stuff is better. That said, I think it makes sense to continue >> to debate the intent of the patch series even if we pull this one >> patch out and defer it until after 4.3, or having it end up pulled >> into some other repo that includes the improvements and >> is being pulled for 4.3. > Sure, no problem. I will add a comment to the patch and a note to the series about this, but for now I'll keep it in the series. If we can arrange to pull it into Frederic's tree after the context_tracking changes, we can respin it at that point to layer it on top. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v4 3/5] nohz: cpu_isolated strict mode configurable signal 2015-07-13 19:57 ` [PATCH v4 0/5] support "cpu_isolated" mode for nohz_full Chris Metcalf 2015-07-13 19:57 ` [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf 2015-07-13 19:57 ` [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf @ 2015-07-13 19:57 ` Chris Metcalf [not found] ` <1436817481-8732-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 3 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a cpu_isolated process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/time/tick-sched.c | 15 +++++++++++---- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 0c11238a84fb..ab45bd3d5799 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) # define PR_CPU_ISOLATED_STRICT (1 << 1) +# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9f495c7c7dc2..c5eca9c99fad 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -447,11 +447,18 @@ void tick_nohz_cpu_isolated_enter(void) } } -static void kill_cpu_isolated_strict_task(void) +static void kill_cpu_isolated_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + dump_stack(); current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -470,7 +477,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall) pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(1); } /* @@ -481,7 +488,7 @@ void tick_nohz_cpu_isolated_exception(void) { pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", current->comm, current->pid); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(0); } #endif -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
[parent not found: <1436817481-8732-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* [PATCH v5 0/6] support "cpu_isolated" mode for nohz_full [not found] ` <1436817481-8732-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-07-28 19:49 ` Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 2/6] cpu_isolated: add initial support Chris Metcalf ` (3 more replies) 0 siblings, 4 replies; 159+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf This version of the patch series incorporates Christoph Lameter's change to add a quiet_vmstat() call, and restructures cpu_isolated as a "hard" isolation mode in contrast to nohz_full's "soft" isolation, breaking it out as a separate CONFIG_CPU_ISOLATED with its own include/linux/cpu_isolated.h and kernel/time/cpu_isolated.c. It is rebased to 4.2-rc3. Thomas: as I mentioned in v4, I haven't heard from you whether my removal of the cpu_idle calls sufficiently addresses your concerns about that aspect. Andy: as I said in email, I've left in the support where cpu_isolated relies on the context_tracking stuff currently in 4.2-rc3. I'm not sure what the cleanest way is for me to pick up the new context_tracking stuff; if that's all that ends up standing between this patch series and having it be pulled, perhaps I can rebase it onto whatever branch it is that has the new context_tracking? Original patch series cover letter follows: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. The kernel must be built with CONFIG_CPU_ISOLATED to take advantage of this new mode. A prctl() option (PR_SET_CPU_ISOLATED) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on cpu_isolated cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on v4.2-rc3) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane v5: rebased on kernel v4.2-rc3 converted to use CONFIG_CPU_ISOLATED and separate .c and .h files incorporates Christoph Lameter's quiet_vmstat() call v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, cpu_isolated threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (5): cpu_isolated: add initial support cpu_isolated: support PR_CPU_ISOLATED_STRICT mode cpu_isolated: provide strict mode configurable signal cpu_isolated: add debug boot flag nohz: cpu_isolated: allow tick to be fully disabled Christoph Lameter (1): vmstat: provide a function to quiet down the diff processing Documentation/kernel-parameters.txt | 7 +++ arch/arm64/kernel/ptrace.c | 5 ++ arch/tile/kernel/process.c | 9 +++ arch/tile/kernel/ptrace.c | 5 +- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 +++- include/linux/cpu_isolated.h | 42 +++++++++++++ include/linux/sched.h | 3 + include/linux/vmstat.h | 2 + include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 12 +++- kernel/irq_work.c | 5 +- kernel/sched/core.c | 21 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 7 +++ kernel/sys.c | 8 +++ kernel/time/Kconfig | 20 +++++++ kernel/time/Makefile | 1 + kernel/time/cpu_isolated.c | 116 ++++++++++++++++++++++++++++++++++++ kernel/time/tick-sched.c | 3 +- mm/vmstat.c | 14 +++++ 23 files changed, 305 insertions(+), 10 deletions(-) create mode 100644 include/linux/cpu_isolated.h create mode 100644 kernel/time/cpu_isolated.c -- 2.1.2 ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v5 2/6] cpu_isolated: add initial support 2015-07-28 19:49 ` [PATCH v5 0/6] support "cpu_isolated" mode for nohz_full Chris Metcalf @ 2015-07-28 19:49 ` Chris Metcalf [not found] ` <1438112980-9981-3-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-07-28 19:49 ` [PATCH v5 3/6] cpu_isolated: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf ` (2 subsequent siblings) 3 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new CPU_ISOLATED Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "cpu_isolated" state is then indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only three actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, it calls quiet_vmstat() to quieten the vmstat worker to avoid a follow-on interrupt. Finally, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/process.c | 9 ++++++ include/linux/cpu_isolated.h | 24 +++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 ++++ kernel/context_tracking.c | 3 ++ kernel/sys.c | 8 +++++ kernel/time/Kconfig | 20 +++++++++++++ kernel/time/Makefile | 1 + kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 144 insertions(+) create mode 100644 include/linux/cpu_isolated.h create mode 100644 kernel/time/cpu_isolated.c diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..7db6f8386417 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_CPU_ISOLATED +void cpu_isolated_wait(void) +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h new file mode 100644 index 000000000000..a3d17360f7ae --- /dev/null +++ b/include/linux/cpu_isolated.h @@ -0,0 +1,24 @@ +/* + * CPU isolation related global functions + */ +#ifndef _LINUX_CPU_ISOLATED_H +#define _LINUX_CPU_ISOLATED_H + +#include <linux/tick.h> +#include <linux/prctl.h> + +#ifdef CONFIG_CPU_ISOLATED +static inline bool is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + +extern void cpu_isolated_enter(void); +extern void cpu_isolated_wait(void); +#else +static inline bool is_cpu_isolated(void) { return false; } +static inline void cpu_isolated_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 04b5ada460b4..0bb248385d88 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1776,6 +1776,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_CPU_ISOLATED + unsigned int cpu_isolated_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..36b6509c3e2a 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/cpu_isolated.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (is_cpu_isolated()) + cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..c68417ff4800 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_CPU_ISOLATED + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index 579ce1b929af..141969149994 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -195,5 +195,25 @@ config HIGH_RES_TIMERS hardware is not capable then this option only increases the size of the kernel image. +config CPU_ISOLATED + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_CPU_ISOLATED) to "isolate" + themselves from the kernel. On return to userspace, + cpu-isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + endmenu endif diff --git a/kernel/time/Makefile b/kernel/time/Makefile index 49eca0beed32..984081cce974 100644 --- a/kernel/time/Makefile +++ b/kernel/time/Makefile @@ -12,3 +12,4 @@ obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o tick-sched.o obj-$(CONFIG_TIMER_STATS) += timer_stats.o obj-$(CONFIG_DEBUG_FS) += timekeeping_debug.o obj-$(CONFIG_TEST_UDELAY) += test_udelay.o +obj-$(CONFIG_CPU_ISOLATED) += cpu_isolated.o diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c new file mode 100644 index 000000000000..e27259f30caf --- /dev/null +++ b/kernel/time/cpu_isolated.c @@ -0,0 +1,71 @@ +/* + * linux/kernel/time/cpu_isolated.c + * + * Implementation for cpu isolation. + * + * Distributed under GPLv2. + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/cpu_isolated.h> +#include "tick-sched.h" + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + */ +void __weak cpu_isolated_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In cpu_isolated mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two cpu_isolated processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + cpu_isolated_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
[parent not found: <1438112980-9981-3-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v5 2/6] cpu_isolated: add initial support [not found] ` <1438112980-9981-3-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-08-12 16:00 ` Frederic Weisbecker 2015-08-12 18:22 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Frederic Weisbecker @ 2015-08-12 16:00 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Jul 28, 2015 at 03:49:36PM -0400, Chris Metcalf wrote: > The existing nohz_full mode is designed as a "soft" isolation mode > that makes tradeoffs to minimize userspace interruptions while > still attempting to avoid overheads in the kernel entry/exit path, > to provide 100% kernel semantics, etc. > > However, some applications require a "hard" commitment from the > kernel to avoid interruptions, in particular userspace device > driver style applications, such as high-speed networking code. > > This change introduces a framework to allow applications > to elect to have the "hard" semantics as needed, specifying > prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. > Subsequent commits will add additional flags and additional > semantics. We are doing this at the process level but the isolation works on the CPU scope... Now I wonder if prctl is the right interface. That said the user is rather interested in isolating a task. The CPU being the backend eventually. For example if the task is migrated by accident, we want it to be warned about that. And if the isolation is done on the CPU level instead of the task level, this won't happen. I'm also afraid that the naming clashes with cpu_isolated_map, although it could be a subset of it. So probably in this case we should consider talking about task rather than CPU isolation and change naming accordingly (sorry, I know I suggested cpu_isolation.c, I guess I had to see the result to realize). We must sort that out first. Either we consider isolation on the task level (and thus the underlying CPU by backend effect) and we use prctl(). Or we do this on the CPU level and we use a specific syscall or sysfs which takes effect on any task in the relevant isolated CPUs. What do you think? It would be nice to hear others opinions as well. > The kernel must be built with the new CPU_ISOLATED Kconfig flag > to enable this mode, and the kernel booted with an appropriate > nohz_full=CPULIST boot argument. The "cpu_isolated" state is then > indicated by setting a new task struct field, cpu_isolated_flags, > to the value passed by prctl(). When the _ENABLE bit is set for a > task, and it is returning to userspace on a nohz_full core, it calls > the new cpu_isolated_enter() routine to take additional actions > to help the task avoid being interrupted in the future. > > Initially, there are only three actions taken. First, the > task calls lru_add_drain() to prevent being interrupted by a > subsequent lru_add_drain_all() call on another core. Then, it calls > quiet_vmstat() to quieten the vmstat worker to avoid a follow-on > interrupt. Finally, the code checks for pending timer interrupts > and quiesces until they are no longer pending. As a result, sys > calls (and page faults, etc.) can be inordinately slow. However, > this quiescing guarantees that no unexpected interrupts will occur, > even if the application intentionally calls into the kernel. > > Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> > --- > arch/tile/kernel/process.c | 9 ++++++ > include/linux/cpu_isolated.h | 24 +++++++++++++++ > include/linux/sched.h | 3 ++ > include/uapi/linux/prctl.h | 5 ++++ > kernel/context_tracking.c | 3 ++ > kernel/sys.c | 8 +++++ > kernel/time/Kconfig | 20 +++++++++++++ > kernel/time/Makefile | 1 + > kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++ It's not about time :-) The timer is only a part of the isolation. Moreover "isolatED" is a state. The filename should reflect the process. "isolatION" would better fit. kernel/task_isolation.c maybe or just kernel/isolation.c I think I prefer the latter because I'm not only interested in that task hard isolation feature, I would like to also drive all the general isolation operations from there (workqueue affinity, rcu nocb, ...). > 9 files changed, 144 insertions(+) > create mode 100644 include/linux/cpu_isolated.h > create mode 100644 kernel/time/cpu_isolated.c > > diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c > index e036c0aa9792..7db6f8386417 100644 > --- a/arch/tile/kernel/process.c > +++ b/arch/tile/kernel/process.c > @@ -70,6 +70,15 @@ void arch_cpu_idle(void) > _cpu_idle(); > } > > +#ifdef CONFIG_CPU_ISOLATED > +void cpu_isolated_wait(void) > +{ > + set_current_state(TASK_INTERRUPTIBLE); > + _cpu_idle(); > + set_current_state(TASK_RUNNING); > +} I'm still uncomfortable with that. A wake up model could work? > +#endif > + > /* > * Release a thread_info structure > */ > diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h > new file mode 100644 > index 000000000000..a3d17360f7ae > --- /dev/null > +++ b/include/linux/cpu_isolated.h > @@ -0,0 +1,24 @@ > +/* > + * CPU isolation related global functions > + */ > +#ifndef _LINUX_CPU_ISOLATED_H > +#define _LINUX_CPU_ISOLATED_H > + > +#include <linux/tick.h> > +#include <linux/prctl.h> > + > +#ifdef CONFIG_CPU_ISOLATED > +static inline bool is_cpu_isolated(void) > +{ > + return tick_nohz_full_cpu(smp_processor_id()) && > + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); > +} > + > +extern void cpu_isolated_enter(void); > +extern void cpu_isolated_wait(void); > +#else > +static inline bool is_cpu_isolated(void) { return false; } > +static inline void cpu_isolated_enter(void) { } > +#endif And all the naming should be about task as well, if we take that task direction. > + > +#endif > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 04b5ada460b4..0bb248385d88 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1776,6 +1776,9 @@ struct task_struct { > unsigned long task_state_change; > #endif > int pagefault_disabled; > +#ifdef CONFIG_CPU_ISOLATED > + unsigned int cpu_isolated_flags; > +#endif Can't we add a new flag to tsk->flags? There seem to be some values remaining. Thanks. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v5 2/6] cpu_isolated: add initial support 2015-08-12 16:00 ` Frederic Weisbecker @ 2015-08-12 18:22 ` Chris Metcalf [not found] ` <55CB8ED1.6030806-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-08-12 18:22 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 08/12/2015 12:00 PM, Frederic Weisbecker wrote: > On Tue, Jul 28, 2015 at 03:49:36PM -0400, Chris Metcalf wrote: >> The existing nohz_full mode is designed as a "soft" isolation mode >> that makes tradeoffs to minimize userspace interruptions while >> still attempting to avoid overheads in the kernel entry/exit path, >> to provide 100% kernel semantics, etc. >> >> However, some applications require a "hard" commitment from the >> kernel to avoid interruptions, in particular userspace device >> driver style applications, such as high-speed networking code. >> >> This change introduces a framework to allow applications >> to elect to have the "hard" semantics as needed, specifying >> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. >> Subsequent commits will add additional flags and additional >> semantics. > We are doing this at the process level but the isolation works on > the CPU scope... Now I wonder if prctl is the right interface. > > That said the user is rather interested in isolating a task. The CPU > being the backend eventually. > > For example if the task is migrated by accident, we want it to be > warned about that. And if the isolation is done on the CPU level > instead of the task level, this won't happen. > > I'm also afraid that the naming clashes with cpu_isolated_map, > although it could be a subset of it. > > So probably in this case we should consider talking about task rather > than CPU isolation and change naming accordingly (sorry, I know I > suggested cpu_isolation.c, I guess I had to see the result to realize). > > We must sort that out first. Either we consider isolation on the task > level (and thus the underlying CPU by backend effect) and we use prctl(). > Or we do this on the CPU level and we use a specific syscall or sysfs > which takes effect on any task in the relevant isolated CPUs. > > What do you think? Yes, definitely task-centric is the right model. With the original tilegx version of this code, we also checked that the process had only a single core in its affinity mask, and that the single core in question was a nohz_full core, before allowing the "task isolated" mode to take effect. I didn't do that in this round of patches because it seemed a little silly in that the user could then immediately reset their affinity to another core and lose the effect, and it wasn't clear how to handle that: do we return EINVAL from sched_setaffinity() after enabling the "task isolated" mode? That seems potentially ugly, maybe standards-violating, etc. So I didn't bother. But you could certainly argue for failing prctl() in that case anyway, as a way to make sure users aren't doing something stupid like calling the prctl() from a task that's running on a housekeeping core. And you could even argue for doing some kind of console spew if you try to migrate a task that is in "task isolation" state - though I suppose if you migrate it to another isolcpus and nohz_full core, maybe that's kind of reasonable and doesn't deserve a warning? I'm not sure. >> The kernel must be built with the new CPU_ISOLATED Kconfig flag >> to enable this mode, and the kernel booted with an appropriate >> nohz_full=CPULIST boot argument. The "cpu_isolated" state is then >> indicated by setting a new task struct field, cpu_isolated_flags, >> to the value passed by prctl(). When the _ENABLE bit is set for a >> task, and it is returning to userspace on a nohz_full core, it calls >> the new cpu_isolated_enter() routine to take additional actions >> to help the task avoid being interrupted in the future. >> >> Initially, there are only three actions taken. First, the >> task calls lru_add_drain() to prevent being interrupted by a >> subsequent lru_add_drain_all() call on another core. Then, it calls >> quiet_vmstat() to quieten the vmstat worker to avoid a follow-on >> interrupt. Finally, the code checks for pending timer interrupts >> and quiesces until they are no longer pending. As a result, sys >> calls (and page faults, etc.) can be inordinately slow. However, >> this quiescing guarantees that no unexpected interrupts will occur, >> even if the application intentionally calls into the kernel. >> >> Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >> --- >> arch/tile/kernel/process.c | 9 ++++++ >> include/linux/cpu_isolated.h | 24 +++++++++++++++ >> include/linux/sched.h | 3 ++ >> include/uapi/linux/prctl.h | 5 ++++ >> kernel/context_tracking.c | 3 ++ >> kernel/sys.c | 8 +++++ >> kernel/time/Kconfig | 20 +++++++++++++ >> kernel/time/Makefile | 1 + >> kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++ > It's not about time :-) > > The timer is only a part of the isolation. > > Moreover "isolatED" is a state. The filename should reflect the process. "isolatION" would > better fit. > > kernel/task_isolation.c maybe or just kernel/isolation.c > > I think I prefer the latter because I'm not only interested in that task > hard isolation feature, I would like to also drive all the general isolation > operations from there (workqueue affinity, rcu nocb, ...). That's reasonable, but I think the "task isolation" naming is probably better for all the stuff that we're doing in this patch. In other words, we probably should use "task_isolation" as the prefix for symbols names and API names, even if we put the code in kernel/isolation.c for now in anticipation of non-task isolation being added later. I think my instinct would still be to call it kernel/task_isolation.c until we actually add some non-task isolation, and at that point we can decide if it makes sense to rename the file, or put the new code somewhere else, but I'm OK with doing it the way I described in the previous paragraph if you think it's better. >> +#ifdef CONFIG_CPU_ISOLATED >> +void cpu_isolated_wait(void) >> +{ >> + set_current_state(TASK_INTERRUPTIBLE); >> + _cpu_idle(); >> + set_current_state(TASK_RUNNING); >> +} > I'm still uncomfortable with that. A wake up model could work? I don't know exactly what you have in mind. The theory is that at this point we're ready to return to user space and we're just waiting for a timer tick that is guaranteed to arrive, since there is something pending for the timer. And, this is an arch-specific method anyway; the generic method is actually checking to see if a signal has been delivered, scheduling is needed, etc., each time around the loop, so if you're not sure your architecture will do the right thing, just don't provide a method that idles while waiting. For tilegx I'm sure it works correctly, so I'm OK providing that method. >> +extern void cpu_isolated_enter(void); >> +extern void cpu_isolated_wait(void); >> +#else >> +static inline bool is_cpu_isolated(void) { return false; } >> +static inline void cpu_isolated_enter(void) { } >> +#endif > And all the naming should be about task as well, if we take that task direction. As discussed above, probably task_isolation_enter(), etc. >> + >> +#endif >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 04b5ada460b4..0bb248385d88 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -1776,6 +1776,9 @@ struct task_struct { >> unsigned long task_state_change; >> #endif >> int pagefault_disabled; >> +#ifdef CONFIG_CPU_ISOLATED >> + unsigned int cpu_isolated_flags; >> +#endif > Can't we add a new flag to tsk->flags? There seem to be some values remaining. Yeah, I thought of that, but it seems like a pretty scarce resource, and I wasn't sure it was the right thing to do. Also, I'm not actually sure why the lowest two bits aren't apparently being used; looks like PF_EXITING (0x4) is the first bit used. And there are only three more bits higher up in the word that are not assigned. Also, right now we are allowing users to customize the signal delivered for STRICT violation, and that signal value is stored in the cpu_isolated_flags word as well, so we really don't have room in tsk->flags for all of that anyway. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <55CB8ED1.6030806-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v5 2/6] cpu_isolated: add initial support [not found] ` <55CB8ED1.6030806-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-08-26 15:26 ` Frederic Weisbecker 2015-08-26 15:55 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Frederic Weisbecker @ 2015-08-26 15:26 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Wed, Aug 12, 2015 at 02:22:09PM -0400, Chris Metcalf wrote: > On 08/12/2015 12:00 PM, Frederic Weisbecker wrote: > >>+#ifdef CONFIG_CPU_ISOLATED > >>+void cpu_isolated_wait(void) > >>+{ > >>+ set_current_state(TASK_INTERRUPTIBLE); > >>+ _cpu_idle(); > >>+ set_current_state(TASK_RUNNING); > >>+} > >I'm still uncomfortable with that. A wake up model could work? > > I don't know exactly what you have in mind. The theory is that > at this point we're ready to return to user space and we're just > waiting for a timer tick that is guaranteed to arrive, since there > is something pending for the timer. Hmm, ok I'm going to discuss that in the new version. One worry is that it gets racy and we sleep there for ever. > > And, this is an arch-specific method anyway; the generic method > is actually checking to see if a signal has been delivered, > scheduling is needed, etc., each time around the loop, so if > you're not sure your architecture will do the right thing, just > don't provide a method that idles while waiting. For tilegx I'm > sure it works correctly, so I'm OK providing that method. Yes but we do busy waiting on all other archs then. And since we can wait for a while there, it doesn't look sane. > >>diff --git a/include/linux/sched.h b/include/linux/sched.h > >>index 04b5ada460b4..0bb248385d88 100644 > >>--- a/include/linux/sched.h > >>+++ b/include/linux/sched.h > >>@@ -1776,6 +1776,9 @@ struct task_struct { > >> unsigned long task_state_change; > >> #endif > >> int pagefault_disabled; > >>+#ifdef CONFIG_CPU_ISOLATED > >>+ unsigned int cpu_isolated_flags; > >>+#endif > >Can't we add a new flag to tsk->flags? There seem to be some values remaining. > > Yeah, I thought of that, but it seems like a pretty scarce resource, > and I wasn't sure it was the right thing to do. Also, I'm not actually > sure why the lowest two bits aren't apparently being used Probably they were used but got removed. > looks > like PF_EXITING (0x4) is the first bit used. And there are only three > more bits higher up in the word that are not assigned. Which makes room for 5 :) > > Also, right now we are allowing users to customize the signal delivered > for STRICT violation, and that signal value is stored in the > cpu_isolated_flags word as well, so we really don't have room in > tsk->flags for all of that anyway. Yeah indeed, ok lets keep it that way for now. Thanks. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v5 2/6] cpu_isolated: add initial support 2015-08-26 15:26 ` Frederic Weisbecker @ 2015-08-26 15:55 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-08-26 15:55 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel On 08/26/2015 11:26 AM, Frederic Weisbecker wrote: > On Wed, Aug 12, 2015 at 02:22:09PM -0400, Chris Metcalf wrote: >> On 08/12/2015 12:00 PM, Frederic Weisbecker wrote: >>>> +#ifdef CONFIG_CPU_ISOLATED >>>> +void cpu_isolated_wait(void) >>>> +{ >>>> + set_current_state(TASK_INTERRUPTIBLE); >>>> + _cpu_idle(); >>>> + set_current_state(TASK_RUNNING); >>>> +} >>> I'm still uncomfortable with that. A wake up model could work? >> I don't know exactly what you have in mind. The theory is that >> at this point we're ready to return to user space and we're just >> waiting for a timer tick that is guaranteed to arrive, since there >> is something pending for the timer. > Hmm, ok I'm going to discuss that in the new version. One worry is that > it gets racy and we sleep there for ever. > >> And, this is an arch-specific method anyway; the generic method >> is actually checking to see if a signal has been delivered, >> scheduling is needed, etc., each time around the loop, so if >> you're not sure your architecture will do the right thing, just >> don't provide a method that idles while waiting. For tilegx I'm >> sure it works correctly, so I'm OK providing that method. > Yes but we do busy waiting on all other archs then. And since we can wait > for a while there, it doesn't look sane. We can wait for a while (potentially multiple ticks), which is certainly a long time, but that's what the user asked for. Since we're checking signals and scheduling in the busy loop, we definitely won't get into some nasty unkillable state, which would be the real worst-case. I think the question is, could a process just get stuck there somehow in the normal course of events, where there is a future event on the tick_cpu_device, but no interrupt is enabled that will eventually deal with it? This seems like it would be a pretty fundamental timekeeping bug, so my assumption here is that can't happen, but maybe...? -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v5 3/6] cpu_isolated: support PR_CPU_ISOLATED_STRICT mode 2015-07-28 19:49 ` [PATCH v5 0/6] support "cpu_isolated" mode for nohz_full Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 2/6] cpu_isolated: add initial support Chris Metcalf @ 2015-07-28 19:49 ` Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 4/6] cpu_isolated: provide strict mode configurable signal Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 0/6] support "task_isolated" mode for nohz_full Chris Metcalf 3 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With cpu_isolated mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- Note: Andy Lutomirski points out that improvements are coming to the context_tracking code to make it more robust, which may mean that some of the code suggested here for context_tracking may not be necessary. I am keeping it in the series for now since it is required for it to work based on 4.2-rc3. arch/arm64/kernel/ptrace.c | 5 +++++ arch/tile/kernel/ptrace.c | 5 ++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/cpu_isolated.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/time/cpu_isolated.c | 38 ++++++++++++++++++++++++++++++++++++++ 8 files changed, 80 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..ff83968ab4d4 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/cpu_isolated.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { + /* Ensure we report cpu_isolated violations in all circumstances. */ + if (test_thread_flag(TIF_NOHZ) && cpu_isolated_strict()) + cpu_isolated_syscall(regs->syscallno); + /* Do the secure computing check first; failures should be fast. */ if (secure_computing() == -1) return -1; diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..e54256c54311 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (cpu_isolated_strict()) + cpu_isolated_syscall(regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..e5aec57e8e25 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (cpu_isolated_strict()) + cpu_isolated_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..590414ef2bf1 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/cpu_isolated.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (cpu_isolated_strict()) + cpu_isolated_exception(); + } + } return prev_ctx; } diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h index a3d17360f7ae..b0f1c2669b2f 100644 --- a/include/linux/cpu_isolated.h +++ b/include/linux/cpu_isolated.h @@ -15,10 +15,26 @@ static inline bool is_cpu_isolated(void) } extern void cpu_isolated_enter(void); +extern void cpu_isolated_syscall(int nr); +extern void cpu_isolated_exception(void); extern void cpu_isolated_wait(void); #else static inline bool is_cpu_isolated(void) { return false; } static inline void cpu_isolated_enter(void) { } +static inline void cpu_isolated_syscall(int nr) { } +static inline void cpu_isolated_exception(void) { } #endif +static inline bool cpu_isolated_strict(void) +{ +#ifdef CONFIG_CPU_ISOLATED + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) == + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index edb40b6b84db..0c11238a84fb 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_CPU_ISOLATED 47 #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) +# define PR_CPU_ISOLATED_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 36b6509c3e2a..c740850eea11 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c index e27259f30caf..d30bf3852897 100644 --- a/kernel/time/cpu_isolated.c +++ b/kernel/time/cpu_isolated.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/cpu_isolated.h> +#include <asm/unistd.h> #include "tick-sched.h" /* @@ -69,3 +70,40 @@ void cpu_isolated_enter(void) dump_stack(); } } + +static void kill_cpu_isolated_strict_task(void) +{ + dump_stack(); + current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void cpu_isolated_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_cpu_isolated_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void cpu_isolated_exception(void) +{ + pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", + current->comm, current->pid); + kill_cpu_isolated_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v5 4/6] cpu_isolated: provide strict mode configurable signal 2015-07-28 19:49 ` [PATCH v5 0/6] support "cpu_isolated" mode for nohz_full Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 2/6] cpu_isolated: add initial support Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 3/6] cpu_isolated: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf @ 2015-07-28 19:49 ` Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 0/6] support "task_isolated" mode for nohz_full Chris Metcalf 3 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a cpu_isolated process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/time/cpu_isolated.c | 17 ++++++++++++----- 2 files changed, 14 insertions(+), 5 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 0c11238a84fb..ab45bd3d5799 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) # define PR_CPU_ISOLATED_STRICT (1 << 1) +# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c index d30bf3852897..9f8fcbd97770 100644 --- a/kernel/time/cpu_isolated.c +++ b/kernel/time/cpu_isolated.c @@ -71,11 +71,18 @@ void cpu_isolated_enter(void) } } -static void kill_cpu_isolated_strict_task(void) -{ +static void kill_cpu_isolated_strict_task(int is_syscall) + { + siginfo_t info = {}; + int sig; + dump_stack(); current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -94,7 +101,7 @@ void cpu_isolated_syscall(int syscall) pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(1); } /* @@ -105,5 +112,5 @@ void cpu_isolated_exception(void) { pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", current->comm, current->pid); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(0); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v6 0/6] support "task_isolated" mode for nohz_full 2015-07-28 19:49 ` [PATCH v5 0/6] support "cpu_isolated" mode for nohz_full Chris Metcalf ` (2 preceding siblings ...) 2015-07-28 19:49 ` [PATCH v5 4/6] cpu_isolated: provide strict mode configurable signal Chris Metcalf @ 2015-08-25 19:55 ` Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 2/6] task_isolation: add initial support Chris Metcalf ` (3 more replies) 3 siblings, 4 replies; 159+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The cover email for the patch series is getting a little unwieldy so I will provide a terser summary here, and just update the list of changes from version to version. Please see the previous versions linked by the In-Reply-To for more detailed comments about changes in earlier versions of the patch series. v6: restructured to be a "task_isolation" mode not a "cpu_isolated" mode (Frederic) v5: rebased on kernel v4.2-rc3 converted to use CONFIG_CPU_ISOLATED and separate .c and .h files incorporates Christoph Lameter's quiet_vmstat() call v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() General summary: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. The kernel must be built with CONFIG_TASK_ISOLATION to take advantage of this new mode. A prctl() option (PR_SET_TASK_ISOLATION) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on task_isolation cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on v4.2-rc3) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, task_isolation threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (5): task_isolation: add initial support task_isolation: support PR_TASK_ISOLATION_STRICT mode task_isolation: provide strict mode configurable signal task_isolation: add debug boot flag nohz: task_isolation: allow tick to be fully disabled Christoph Lameter (1): vmstat: provide a function to quiet down the diff processing Documentation/kernel-parameters.txt | 7 +++ arch/arm64/kernel/ptrace.c | 5 ++ arch/tile/kernel/process.c | 9 +++ arch/tile/kernel/ptrace.c | 5 +- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 +++- include/linux/isolation.h | 42 +++++++++++++ include/linux/sched.h | 3 + include/linux/vmstat.h | 2 + include/uapi/linux/prctl.h | 8 +++ init/Kconfig | 20 ++++++ kernel/Makefile | 1 + kernel/context_tracking.c | 12 +++- kernel/irq_work.c | 5 +- kernel/isolation.c | 122 ++++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 21 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 7 +++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 3 +- mm/vmstat.c | 14 +++++ 23 files changed, 311 insertions(+), 10 deletions(-) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c -- 2.1.2 ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v6 2/6] task_isolation: add initial support 2015-08-25 19:55 ` [PATCH v6 0/6] support "task_isolated" mode for nohz_full Chris Metcalf @ 2015-08-25 19:55 ` Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf ` (2 subsequent siblings) 3 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new task_isolation_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only three actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, it calls quiet_vmstat() to quieten the vmstat worker to avoid a follow-on interrupt. Finally, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/process.c | 9 ++++++ include/linux/isolation.h | 24 +++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 ++++ init/Kconfig | 20 +++++++++++++ kernel/Makefile | 1 + kernel/context_tracking.c | 3 ++ kernel/isolation.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 8 +++++ 9 files changed, 148 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..1d9bd2320a50 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_TASK_ISOLATION +void task_isolation_wait(void) +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/isolation.h b/include/linux/isolation.h new file mode 100644 index 000000000000..fd04011b1c1e --- /dev/null +++ b/include/linux/isolation.h @@ -0,0 +1,24 @@ +/* + * Task isolation related global functions + */ +#ifndef _LINUX_ISOLATION_H +#define _LINUX_ISOLATION_H + +#include <linux/tick.h> +#include <linux/prctl.h> + +#ifdef CONFIG_TASK_ISOLATION +static inline bool task_isolation_enabled(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); +} + +extern void task_isolation_enter(void); +extern void task_isolation_wait(void); +#else +static inline bool task_isolation_enabled(void) { return false; } +static inline void task_isolation_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 04b5ada460b4..2acb618189d0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1776,6 +1776,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_TASK_ISOLATION + unsigned int task_isolation_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..79da784fe17a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */ +#define PR_SET_TASK_ISOLATION 47 +#define PR_GET_TASK_ISOLATION 48 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index af09b4fb43d2..82d313cbd70f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -795,6 +795,26 @@ config RCU_EXPEDITE_BOOT endmenu # "RCU Subsystem" +config TASK_ISOLATION + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate" + themselves from the kernel. On return to userspace, + isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + config BUILD_BIN2C bool default n diff --git a/kernel/Makefile b/kernel/Makefile index 43c4c920f30a..9ffb5c021767 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_JUMP_LABEL) += jump_label.o obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o obj-$(CONFIG_TORTURE_TEST) += torture.o +obj-$(CONFIG_TASK_ISOLATION) += isolation.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..c57c99f5c4d7 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/isolation.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (task_isolation_enabled()) + task_isolation_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/isolation.c b/kernel/isolation.c new file mode 100644 index 000000000000..d4618cd9e23d --- /dev/null +++ b/kernel/isolation.c @@ -0,0 +1,75 @@ +/* + * linux/kernel/isolation.c + * + * Implementation for task isolation. + * + * Distributed under GPLv2. + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/isolation.h> +#include "time/tick-sched.h" + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + * + * Note that it must be guaranteed for a particular architecture + * that if next_event is not KTIME_MAX, then a timer interrupt will + * occur, otherwise the sleep may never awaken. + */ +void __weak task_isolation_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In task_isolation mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two task_isolation processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void task_isolation_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + task_isolation_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: task_isolation task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..c7024be2d79b 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_TASK_ISOLATION + case PR_SET_TASK_ISOLATION: + me->task_isolation_flags = arg2; + break; + case PR_GET_TASK_ISOLATION: + error = me->task_isolation_flags; + break; +#endif default: error = -EINVAL; break; -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-08-25 19:55 ` [PATCH v6 0/6] support "task_isolated" mode for nohz_full Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 2/6] task_isolation: add initial support Chris Metcalf @ 2015-08-25 19:55 ` Chris Metcalf 2015-08-26 10:36 ` Will Deacon [not found] ` <1440532555-15492-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-09-28 15:17 ` [PATCH v7 00/11] support "task_isolated" mode for nohz_full Chris Metcalf 3 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/kernel/ptrace.c | 5 +++++ arch/tile/kernel/ptrace.c | 5 ++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/isolation.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/isolation.c | 38 ++++++++++++++++++++++++++++++++++++++ 8 files changed, 80 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..e3d83a12f3cf 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { + /* Ensure we report task_isolation violations in all circumstances. */ + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict()) + task_isolation_syscall(regs->syscallno); + /* Do the secure computing check first; failures should be fast. */ if (secure_computing() == -1) return -1; diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..c327cb918a44 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (task_isolation_strict()) + task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..2f9ce9466daf 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (task_isolation_strict()) + task_isolation_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..e0ac0228fea1 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (task_isolation_strict()) + task_isolation_exception(); + } + } return prev_ctx; } diff --git a/include/linux/isolation.h b/include/linux/isolation.h index fd04011b1c1e..27a4469831c1 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void) } extern void task_isolation_enter(void); +extern void task_isolation_syscall(int nr); +extern void task_isolation_exception(void); extern void task_isolation_wait(void); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } +static inline void task_isolation_syscall(int nr) { } +static inline void task_isolation_exception(void) { } #endif +static inline bool task_isolation_strict(void) +{ +#ifdef CONFIG_TASK_ISOLATION + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 79da784fe17a..e16e13911e8a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 47 #define PR_GET_TASK_ISOLATION 48 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index c57c99f5c4d7..17a71f7b66b8 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/isolation.c b/kernel/isolation.c index d4618cd9e23d..a89a6e9adfb4 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/isolation.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -73,3 +74,40 @@ void task_isolation_enter(void) dump_stack(); } } + +static void kill_task_isolation_strict_task(void) +{ + dump_stack(); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_task_isolation_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void task_isolation_exception(void) +{ + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", + current->comm, current->pid); + kill_task_isolation_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-08-25 19:55 ` [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf @ 2015-08-26 10:36 ` Will Deacon 2015-08-26 15:10 ` Chris Metcalf 2015-08-28 15:31 ` [PATCH v6.1 " Chris Metcalf 0 siblings, 2 replies; 159+ messages in thread From: Will Deacon @ 2015-08-26 10:36 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, linux-doc@vger.kernel.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Hi Chris, On Tue, Aug 25, 2015 at 08:55:52PM +0100, Chris Metcalf wrote: > With task_isolation mode, the task is in principle guaranteed not to > be interrupted by the kernel, but only if it behaves. In particular, > if it enters the kernel via system call, page fault, or any of a > number of other synchronous traps, it may be unexpectedly exposed > to long latencies. Add a simple flag that puts the process into > a state where any such kernel entry is fatal. > > To allow the state to be entered and exited, we ignore the prctl() > syscall so that we can clear the bit again later, and we ignore > exit/exit_group to allow exiting the task without a pointless signal > killing you as you try to do so. > > This change adds the syscall-detection hooks only for x86, arm64, > and tile. > > The signature of context_tracking_exit() changes to report whether > we, in fact, are exiting back to user space, so that we can track > user exceptions properly separately from other kernel entries. > > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > --- > arch/arm64/kernel/ptrace.c | 5 +++++ > arch/tile/kernel/ptrace.c | 5 ++++- > arch/x86/kernel/ptrace.c | 2 ++ > include/linux/context_tracking.h | 11 ++++++++--- > include/linux/isolation.h | 16 ++++++++++++++++ > include/uapi/linux/prctl.h | 1 + > kernel/context_tracking.c | 9 ++++++--- > kernel/isolation.c | 38 ++++++++++++++++++++++++++++++++++++++ > 8 files changed, 80 insertions(+), 7 deletions(-) > > diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c > index d882b833dbdb..e3d83a12f3cf 100644 > --- a/arch/arm64/kernel/ptrace.c > +++ b/arch/arm64/kernel/ptrace.c > @@ -37,6 +37,7 @@ > #include <linux/regset.h> > #include <linux/tracehook.h> > #include <linux/elf.h> > +#include <linux/isolation.h> > > #include <asm/compat.h> > #include <asm/debug-monitors.h> > @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, > > asmlinkage int syscall_trace_enter(struct pt_regs *regs) > { > + /* Ensure we report task_isolation violations in all circumstances. */ > + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict()) This is going to force us to check TIF_NOHZ on the syscall slowpath even when CONFIG_TASK_ISOLATION=n. > + task_isolation_syscall(regs->syscallno); > + > /* Do the secure computing check first; failures should be fast. */ Here we have the usual priority problems with all the subsystems that hook into the syscall path. If a prctl is later rewritten to a different syscall, do you care about catching it? Either way, the comment about doing secure computing "first" needs fixing. Cheers, Will ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-08-26 10:36 ` Will Deacon @ 2015-08-26 15:10 ` Chris Metcalf [not found] ` <55DDD6EA.3070307-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-08-28 15:31 ` [PATCH v6.1 " Chris Metcalf 1 sibling, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-08-26 15:10 UTC (permalink / raw) To: Will Deacon Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, linux-doc@vger.kernel.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org On 08/26/2015 06:36 AM, Will Deacon wrote: > Hi Chris, > > On Tue, Aug 25, 2015 at 08:55:52PM +0100, Chris Metcalf wrote: >> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c >> index d882b833dbdb..e3d83a12f3cf 100644 >> --- a/arch/arm64/kernel/ptrace.c >> +++ b/arch/arm64/kernel/ptrace.c >> @@ -37,6 +37,7 @@ >> #include <linux/regset.h> >> #include <linux/tracehook.h> >> #include <linux/elf.h> >> +#include <linux/isolation.h> >> >> #include <asm/compat.h> >> #include <asm/debug-monitors.h> >> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, >> >> asmlinkage int syscall_trace_enter(struct pt_regs *regs) >> { >> + /* Ensure we report task_isolation violations in all circumstances. */ >> + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict()) > This is going to force us to check TIF_NOHZ on the syscall slowpath even > when CONFIG_TASK_ISOLATION=n. Yes, good catch. I was thinking the "&& false" would suppress the TIF test but I forgot that test_bit() takes a volatile argument, so it gets evaluated even though the result isn't actually used. But I don't want to just reorder the two tests, because when isolation is enabled, testing TIF_NOHZ first is better. I think probably the right solution is just to put an #ifdef CONFIG_TASK_ISOLATION around that test, even though that is a little crufty. The alternative is to provide a task_isolation_configured() macro that just returns true or false, and make it a three-part "&&" test with that new macro first, but that seems a little crufty as well. Do you have a preference? >> + task_isolation_syscall(regs->syscallno); >> + >> /* Do the secure computing check first; failures should be fast. */ > Here we have the usual priority problems with all the subsystems that > hook into the syscall path. If a prctl is later rewritten to a different > syscall, do you care about catching it? Either way, the comment about > doing secure computing "first" needs fixing. I admit I am unclear on the utility of rewriting prctl. My instinct is that we are trying to catch userspace invocations of prctl and allow them, and fail most everything else, so doing it pre-rewrite seems OK. I'm not sure if it makes sense to catch it before or after the secure computing check, though. On reflection maybe doing it afterwards makes more sense - what do you think? Thanks! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <55DDD6EA.3070307-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode [not found] ` <55DDD6EA.3070307-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-09-02 10:13 ` Will Deacon 0 siblings, 0 replies; 159+ messages in thread From: Will Deacon @ 2015-09-02 10:13 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, Aug 26, 2015 at 04:10:34PM +0100, Chris Metcalf wrote: > On 08/26/2015 06:36 AM, Will Deacon wrote: > > On Tue, Aug 25, 2015 at 08:55:52PM +0100, Chris Metcalf wrote: > >> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c > >> index d882b833dbdb..e3d83a12f3cf 100644 > >> --- a/arch/arm64/kernel/ptrace.c > >> +++ b/arch/arm64/kernel/ptrace.c > >> @@ -37,6 +37,7 @@ > >> #include <linux/regset.h> > >> #include <linux/tracehook.h> > >> #include <linux/elf.h> > >> +#include <linux/isolation.h> > >> > >> #include <asm/compat.h> > >> #include <asm/debug-monitors.h> > >> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, > >> > >> asmlinkage int syscall_trace_enter(struct pt_regs *regs) > >> { > >> + /* Ensure we report task_isolation violations in all circumstances. */ > >> + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict()) > > This is going to force us to check TIF_NOHZ on the syscall slowpath even > > when CONFIG_TASK_ISOLATION=n. > > Yes, good catch. I was thinking the "&& false" would suppress the TIF > test but I forgot that test_bit() takes a volatile argument, so it gets > evaluated even though the result isn't actually used. > > But I don't want to just reorder the two tests, because when isolation > is enabled, testing TIF_NOHZ first is better. I think probably the right > solution is just to put an #ifdef CONFIG_TASK_ISOLATION around that > test, even though that is a little crufty. The alternative is to provide > a task_isolation_configured() macro that just returns true or false, and > make it a three-part "&&" test with that new macro first, but > that seems a little crufty as well. Do you have a preference? Maybe use IS_ENABLED(CONFIG_TASK_ISOLATION) ? > >> + task_isolation_syscall(regs->syscallno); > >> + > >> /* Do the secure computing check first; failures should be fast. */ > > Here we have the usual priority problems with all the subsystems that > > hook into the syscall path. If a prctl is later rewritten to a different > > syscall, do you care about catching it? Either way, the comment about > > doing secure computing "first" needs fixing. > > I admit I am unclear on the utility of rewriting prctl. My instinct is that > we are trying to catch userspace invocations of prctl and allow them, > and fail most everything else, so doing it pre-rewrite seems OK. > > I'm not sure if it makes sense to catch it before or after the > secure computing check, though. On reflection maybe doing it > afterwards makes more sense - what do you think? I don't have a strong preference (I really hate all these hooks we have on the syscall entry/exit path), but we do need to make sure that the behaviour is consistent across architectures. Will ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v6.1 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-08-26 10:36 ` Will Deacon 2015-08-26 15:10 ` Chris Metcalf @ 2015-08-28 15:31 ` Chris Metcalf 1 sibling, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-08-28 15:31 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. For arm64 we use an explict #ifdef CONFIG_TASK_ISOLATION so we can both achieve no overhead for !TASK_ISOLATION, but also achieve low latency (test TIF_NOHZ first) for TASK_ISOLATION. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- This "v6.1" is just a tweak to the existing v6 series to reflect Will Deacon's suggestions about the arm64 syscall entry code. I've updated the git tree with this updated patch in the series. A more disruptive change would be to capture the thread flags up front like x86 and tile, which allows the test itself to be optimized away if the task_isolation call becomes a no-op. arch/arm64/kernel/ptrace.c | 6 ++++++ arch/tile/kernel/ptrace.c | 5 ++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/isolation.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/isolation.c | 38 ++++++++++++++++++++++++++++++++++++++ 8 files changed, 81 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..5d4284445f70 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1154,6 +1155,11 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs) if (secure_computing() == -1) return -1; +#ifdef CONFIG_TASK_ISOLATION + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict()) + task_isolation_syscall(regs->syscallno); +#endif + if (test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..c327cb918a44 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (task_isolation_strict()) + task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..2f9ce9466daf 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (task_isolation_strict()) + task_isolation_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..e0ac0228fea1 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (task_isolation_strict()) + task_isolation_exception(); + } + } return prev_ctx; } diff --git a/include/linux/isolation.h b/include/linux/isolation.h index fd04011b1c1e..27a4469831c1 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void) } extern void task_isolation_enter(void); +extern void task_isolation_syscall(int nr); +extern void task_isolation_exception(void); extern void task_isolation_wait(void); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } +static inline void task_isolation_syscall(int nr) { } +static inline void task_isolation_exception(void) { } #endif +static inline bool task_isolation_strict(void) +{ +#ifdef CONFIG_TASK_ISOLATION + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 79da784fe17a..e16e13911e8a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 47 #define PR_GET_TASK_ISOLATION 48 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index c57c99f5c4d7..17a71f7b66b8 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/isolation.c b/kernel/isolation.c index d4618cd9e23d..a89a6e9adfb4 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/isolation.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -73,3 +74,40 @@ void task_isolation_enter(void) dump_stack(); } } + +static void kill_task_isolation_strict_task(void) +{ + dump_stack(); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_task_isolation_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void task_isolation_exception(void) +{ + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", + current->comm, current->pid); + kill_task_isolation_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
[parent not found: <1440532555-15492-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* [PATCH v6 4/6] task_isolation: provide strict mode configurable signal [not found] ` <1440532555-15492-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-08-25 19:55 ` Chris Metcalf 2015-08-28 19:22 ` Andy Lutomirski 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a task_isolation process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> --- include/uapi/linux/prctl.h | 2 ++ kernel/isolation.c | 17 +++++++++++++---- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index e16e13911e8a..2a4ddc890e22 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_TASK_ISOLATION 48 # define PR_TASK_ISOLATION_ENABLE (1 << 0) # define PR_TASK_ISOLATION_STRICT (1 << 1) +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/isolation.c b/kernel/isolation.c index a89a6e9adfb4..b776aa632c8f 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -75,11 +75,20 @@ void task_isolation_enter(void) } } -static void kill_task_isolation_strict_task(void) +static void kill_task_isolation_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + dump_stack(); current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); + if (sig == 0) + sig = SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -98,7 +107,7 @@ void task_isolation_syscall(int syscall) pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_task_isolation_strict_task(); + kill_task_isolation_strict_task(1); } /* @@ -109,5 +118,5 @@ void task_isolation_exception(void) { pr_warn("%s/%d: task_isolation strict mode violated by exception\n", current->comm, current->pid); - kill_task_isolation_strict_task(); + kill_task_isolation_strict_task(0); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v6 4/6] task_isolation: provide strict mode configurable signal 2015-08-25 19:55 ` [PATCH v6 4/6] task_isolation: provide strict mode configurable signal Chris Metcalf @ 2015-08-28 19:22 ` Andy Lutomirski [not found] ` <20150902101347.GF25720-5wv7dgnIgG8@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-08-28 19:22 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Tue, Aug 25, 2015 at 12:55 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > Allow userspace to override the default SIGKILL delivered > when a task_isolation process in STRICT mode does a syscall > or otherwise synchronously enters the kernel. > > In addition to being able to set the signal, we now also > pass whether or not the interruption was from a syscall in > the si_code field of the siginfo. > > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > --- > include/uapi/linux/prctl.h | 2 ++ > kernel/isolation.c | 17 +++++++++++++---- > 2 files changed, 15 insertions(+), 4 deletions(-) > > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index e16e13911e8a..2a4ddc890e22 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -195,5 +195,7 @@ struct prctl_mm_map { > #define PR_GET_TASK_ISOLATION 48 > # define PR_TASK_ISOLATION_ENABLE (1 << 0) > # define PR_TASK_ISOLATION_STRICT (1 << 1) > +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) > +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) > > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/isolation.c b/kernel/isolation.c > index a89a6e9adfb4..b776aa632c8f 100644 > --- a/kernel/isolation.c > +++ b/kernel/isolation.c > @@ -75,11 +75,20 @@ void task_isolation_enter(void) > } > } > > -static void kill_task_isolation_strict_task(void) > +static void kill_task_isolation_strict_task(int is_syscall) > { > + siginfo_t info = {}; > + int sig; > + > dump_stack(); > current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; > - send_sig(SIGKILL, current, 1); > + > + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); > + if (sig == 0) > + sig = SIGKILL; > + info.si_signo = sig; > + info.si_code = is_syscall; > + send_sig_info(sig, &info, current); The stuff you're doing here is sufficiently nasty that I think you should add something like: rcu_lockdep_assert(rcu_is_watching(), "some message here"); Because as it stands this is just asking for trouble. For the record, I am *extremely* unhappy with the state of the context tracking hooks. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <20150902101347.GF25720-5wv7dgnIgG8@public.gmane.org>]
* [PATCH v6.2 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode [not found] ` <20150902101347.GF25720-5wv7dgnIgG8@public.gmane.org> @ 2015-09-02 18:38 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-09-02 18:38 UTC (permalink / raw) To: Will Deacon, Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: Chris Metcalf This change updates just one patch of the patch series, so rather than spamming out the whole series again, I've just updated this patch: - Will Deacon suggested using IS_ENABLED(CONFIG_TASK_ISOLATION) and also recommended having the same ordering between SECCOMP and TASK_ISOLATION on all platforms, an excellent suggestion. - Andy Lutomirski suggested using rcu_lockdep_assert(rcu_is_watching()) to ensure RCU was properly turned back on during our syscall test-and-kill for strict mode. I will update a full PATCH v7 once there seem to be no further comments on the rest of the v6 series. -- From: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> Date: Tue, 28 Jul 2015 13:25:46 -0400 Subject: [PATCH v6.2 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. We specify that it happens immediately after the SECCOMP test, which appropriately should be tested first. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> --- arch/arm64/kernel/ptrace.c | 6 ++++++ arch/tile/kernel/ptrace.c | 5 ++++- arch/x86/kernel/ptrace.c | 10 +++++++++- include/linux/context_tracking.h | 11 ++++++++--- include/linux/isolation.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/isolation.c | 41 ++++++++++++++++++++++++++++++++++++++++ 8 files changed, 91 insertions(+), 8 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..737f62db8a6f 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1154,6 +1155,11 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs) if (secure_computing() == -1) return -1; + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && + test_thread_flag(TIF_NOHZ) && + task_isolation_strict()) + task_isolation_syscall(regs->syscallno); + if (test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..c327cb918a44 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (task_isolation_strict()) + task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..821699513a94 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1478,7 +1478,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) */ if (work & _TIF_NOHZ) { user_exit(); - work &= ~_TIF_NOHZ; + if (!IS_ENABLED(CONFIG_TASK_ISOLATION)) + work &= ~_TIF_NOHZ; } #ifdef CONFIG_SECCOMP @@ -1527,6 +1528,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) } #endif + /* Now check task isolation, if needed. */ + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) { + work &= ~_TIF_NOHZ; + if (task_isolation_strict()) + task_isolation_syscall(regs->orig_ax); + } + /* Do our best to finish without phase 2. */ if (work == 0) return ret; /* seccomp and/or nohz only (ret == 0 here) */ diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..e0ac0228fea1 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (task_isolation_strict()) + task_isolation_exception(); + } + } return prev_ctx; } diff --git a/include/linux/isolation.h b/include/linux/isolation.h index fd04011b1c1e..27a4469831c1 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void) } extern void task_isolation_enter(void); +extern void task_isolation_syscall(int nr); +extern void task_isolation_exception(void); extern void task_isolation_wait(void); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } +static inline void task_isolation_syscall(int nr) { } +static inline void task_isolation_exception(void) { } #endif +static inline bool task_isolation_strict(void) +{ +#ifdef CONFIG_TASK_ISOLATION + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 79da784fe17a..e16e13911e8a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 47 #define PR_GET_TASK_ISOLATION 48 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index c57c99f5c4d7..17a71f7b66b8 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/isolation.c b/kernel/isolation.c index d4618cd9e23d..caa40583fe0b 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/isolation.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -73,3 +74,43 @@ void task_isolation_enter(void) dump_stack(); } } + +static void kill_task_isolation_strict_task(void) +{ + /* RCU should have been enabled prior to checking the syscall. */ + rcu_lockdep_assert(rcu_is_watching(), "syscall entry without RCU"); + + dump_stack(); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_task_isolation_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void task_isolation_exception(void) +{ + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", + current->comm, current->pid); + kill_task_isolation_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v7 00/11] support "task_isolated" mode for nohz_full 2015-08-25 19:55 ` [PATCH v6 0/6] support "task_isolated" mode for nohz_full Chris Metcalf ` (2 preceding siblings ...) [not found] ` <1440532555-15492-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-09-28 15:17 ` Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 02/11] task_isolation: add initial support Chris Metcalf ` (3 more replies) 3 siblings, 4 replies; 159+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The cover email for the patch series is getting a little unwieldy so I will provide a terser summary here, and just update the list of changes from version to version. Please see the previous versions linked by the In-Reply-To for more detailed comments about changes in earlier versions of the patch series. v7: The main change in this version is a change in where we call task_isolation_enter(). The arm64 code only invokes the context_tracking code right at kernel entry, and right at kernel exit, and the exit point is too late for task isolation; one of my test cases, when run on arm64, showed that a signal delivered while task isolation is waiting for the timer interrupt to quiesce was not properly handled before returning to userspace. The tilegx code properly handled that case because it ran user_exit() in the work-pending loop. But since arm64 calls user_exit() later, it was too late to go back and handle the signal. I decided to make the task isolation work explicit in the "work" loop done on return to userspace, and although I could have done this by hacking up the arm64 assembly code for this purpose, I decided to follow the x86 approach and use the prepare_exit_to_usermode() model where architectures handles work looping in C code. I added that support to arm64 and tile as a pre-requisite change, then modified the loop in C to call task isolation appropriately. This both makes the slowpath return-to-user code more maintainable for arm64 and tile going forward, and also avoids some of the subtlety where the context tracking code was being asked to invoke task isolation at user_enter() time. As a result of this change, I have moved all the architecture-specific changes to individual patches: two patches to switch arm64 and tile to the prepare_exit_to_usermode() loop, and three patches (one each for x86, arm64, and tile) to add the necessary call to task_isolation(), plus changes to check at syscall entry for strict mode. In addition, since arm64 doesn't use exception_enter(), I added an explicit call to task_isolation_exception() in do_mem_abort() so that page faults would be properly flagged in strict mode. I also added an RCU_LOCKDEP_WARN() at Andy Lutomirski's suggestion. And, the patch series is rebased to v4.3-rc1. v6: restructured to be a "task_isolation" mode not a "cpu_isolated" mode (Frederic) v5: rebased on kernel v4.2-rc3 converted to use CONFIG_CPU_ISOLATED and separate .c and .h files incorporates Christoph Lameter's quiet_vmstat() call v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() General summary: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. The kernel must be built with CONFIG_TASK_ISOLATION to take advantage of this new mode. A prctl() option (PR_SET_TASK_ISOLATION) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on task_isolation cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on v4.3-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, task_isolation threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (10): task_isolation: add initial support task_isolation: support PR_TASK_ISOLATION_STRICT mode task_isolation: provide strict mode configurable signal task_isolation: add debug boot flag nohz: task_isolation: allow tick to be fully disabled arch/x86: enable task isolation functionality arch/arm64: adopt prepare_exit_to_usermode() model from x86 arch/arm64: enable task isolation functionality arch/tile: adopt prepare_exit_to_usermode() model from x86 arch/tile: enable task isolation functionality Christoph Lameter (1): vmstat: provide a function to quiet down the diff processing Documentation/kernel-parameters.txt | 7 ++ arch/arm64/include/asm/thread_info.h | 18 +++-- arch/arm64/kernel/entry.S | 6 +- arch/arm64/kernel/ptrace.c | 10 ++- arch/arm64/kernel/signal.c | 36 +++++++--- arch/arm64/mm/fault.c | 8 +++ arch/tile/include/asm/processor.h | 2 +- arch/tile/include/asm/thread_info.h | 8 ++- arch/tile/kernel/intvec_32.S | 46 ++++--------- arch/tile/kernel/intvec_64.S | 49 +++++--------- arch/tile/kernel/process.c | 92 ++++++++++++++----------- arch/tile/kernel/ptrace.c | 3 + arch/tile/mm/homecache.c | 5 +- arch/x86/entry/common.c | 45 ++++++++++--- include/linux/context_tracking.h | 11 ++- include/linux/isolation.h | 42 ++++++++++++ include/linux/sched.h | 3 + include/linux/vmstat.h | 2 + include/uapi/linux/prctl.h | 8 +++ init/Kconfig | 20 ++++++ kernel/Makefile | 1 + kernel/context_tracking.c | 9 ++- kernel/irq_work.c | 5 +- kernel/isolation.c | 127 +++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 21 ++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 7 ++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 3 +- mm/vmstat.c | 14 ++++ 31 files changed, 477 insertions(+), 148 deletions(-) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c -- 2.1.2 ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v7 02/11] task_isolation: add initial support 2015-09-28 15:17 ` [PATCH v7 00/11] support "task_isolated" mode for nohz_full Chris Metcalf @ 2015-09-28 15:17 ` Chris Metcalf 2015-10-01 12:14 ` Frederic Weisbecker 2015-09-28 15:17 ` [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf ` (2 subsequent siblings) 3 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new task_isolation_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only three actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, it calls quiet_vmstat() to quieten the vmstat worker to avoid a follow-on interrupt. Finally, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. The task_isolation_enter() routine must be called just before the hard return to userspace, so it is appropriately placed in the prepare_exit_to_usermode() routine for an individual architecture or some comparable location. Separate patches that follow provide these changes for x86, arm64, and tile. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/isolation.h | 24 +++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 +++ init/Kconfig | 20 ++++++++++++ kernel/Makefile | 1 + kernel/isolation.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 8 +++++ 7 files changed, 138 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/include/linux/isolation.h b/include/linux/isolation.h new file mode 100644 index 000000000000..fd04011b1c1e --- /dev/null +++ b/include/linux/isolation.h @@ -0,0 +1,24 @@ +/* + * Task isolation related global functions + */ +#ifndef _LINUX_ISOLATION_H +#define _LINUX_ISOLATION_H + +#include <linux/tick.h> +#include <linux/prctl.h> + +#ifdef CONFIG_TASK_ISOLATION +static inline bool task_isolation_enabled(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); +} + +extern void task_isolation_enter(void); +extern void task_isolation_wait(void); +#else +static inline bool task_isolation_enabled(void) { return false; } +static inline void task_isolation_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index a4ab9daa387c..bd2dc26948a6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1800,6 +1800,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_TASK_ISOLATION + unsigned int task_isolation_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index a8d0759a9e40..67224df4b559 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -197,4 +197,9 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */ +#define PR_SET_TASK_ISOLATION 48 +#define PR_GET_TASK_ISOLATION 49 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index c24b6f767bf0..4ff7f052059a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT endmenu # "RCU Subsystem" +config TASK_ISOLATION + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate" + themselves from the kernel. On return to userspace, + isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + config BUILD_BIN2C bool default n diff --git a/kernel/Makefile b/kernel/Makefile index 53abf008ecb3..693a2ba35679 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o obj-$(CONFIG_MEMBARRIER) += membarrier.o obj-$(CONFIG_HAS_IOMEM) += memremap.o +obj-$(CONFIG_TASK_ISOLATION) += isolation.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/isolation.c b/kernel/isolation.c new file mode 100644 index 000000000000..6ace866c69f6 --- /dev/null +++ b/kernel/isolation.c @@ -0,0 +1,77 @@ +/* + * linux/kernel/isolation.c + * + * Implementation for task isolation. + * + * Distributed under GPLv2. + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/isolation.h> +#include "time/tick-sched.h" + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + * + * Note that it must be guaranteed for a particular architecture + * that if next_event is not KTIME_MAX, then a timer interrupt will + * occur, otherwise the sleep may never awaken. + */ +void __weak task_isolation_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In task_isolation mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two task_isolation processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void task_isolation_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + if (WARN_ON(irqs_disabled())) + local_irq_enable(); + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + cond_resched(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + task_isolation_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: task_isolation task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} diff --git a/kernel/sys.c b/kernel/sys.c index fa2f2f671a5c..a2c6eb1d4ad9 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2266,6 +2266,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_TASK_ISOLATION + case PR_SET_TASK_ISOLATION: + me->task_isolation_flags = arg2; + break; + case PR_GET_TASK_ISOLATION: + error = me->task_isolation_flags; + break; +#endif default: error = -EINVAL; break; -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-09-28 15:17 ` [PATCH v7 02/11] task_isolation: add initial support Chris Metcalf @ 2015-10-01 12:14 ` Frederic Weisbecker 2015-10-01 12:18 ` Thomas Gleixner 2015-10-01 19:25 ` Chris Metcalf 0 siblings, 2 replies; 159+ messages in thread From: Frederic Weisbecker @ 2015-10-01 12:14 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: > diff --git a/include/linux/isolation.h b/include/linux/isolation.h > new file mode 100644 > index 000000000000..fd04011b1c1e > --- /dev/null > +++ b/include/linux/isolation.h > @@ -0,0 +1,24 @@ > +/* > + * Task isolation related global functions > + */ > +#ifndef _LINUX_ISOLATION_H > +#define _LINUX_ISOLATION_H > + > +#include <linux/tick.h> > +#include <linux/prctl.h> > + > +#ifdef CONFIG_TASK_ISOLATION > +static inline bool task_isolation_enabled(void) > +{ > + return tick_nohz_full_cpu(smp_processor_id()) && > + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); Ok, I may be a bit burdening with that but, how about using the regular existing task flags, and if needed later we can still introduce a new field in struct task_struct? > diff --git a/kernel/isolation.c b/kernel/isolation.c > new file mode 100644 > index 000000000000..6ace866c69f6 > --- /dev/null > +++ b/kernel/isolation.c > @@ -0,0 +1,77 @@ > +/* > + * linux/kernel/isolation.c > + * > + * Implementation for task isolation. > + * > + * Distributed under GPLv2. > + */ > + > +#include <linux/mm.h> > +#include <linux/swap.h> > +#include <linux/vmstat.h> > +#include <linux/isolation.h> > +#include "time/tick-sched.h" > + > +/* > + * Rather than continuously polling for the next_event in the > + * tick_cpu_device, architectures can provide a method to save power > + * by sleeping until an interrupt arrives. > + * > + * Note that it must be guaranteed for a particular architecture > + * that if next_event is not KTIME_MAX, then a timer interrupt will > + * occur, otherwise the sleep may never awaken. > + */ > +void __weak task_isolation_wait(void) > +{ > + cpu_relax(); > +} > + > +/* > + * We normally return immediately to userspace. > + * > + * In task_isolation mode we wait until no more interrupts are > + * pending. Otherwise we nap with interrupts enabled and wait for the > + * next interrupt to fire, then loop back and retry. > + * > + * Note that if you schedule two task_isolation processes on the same > + * core, neither will ever leave the kernel, and one will have to be > + * killed manually. Otherwise in situations where another process is > + * in the runqueue on this cpu, this task will just wait for that > + * other task to go idle before returning to user space. > + */ > +void task_isolation_enter(void) > +{ > + struct clock_event_device *dev = > + __this_cpu_read(tick_cpu_device.evtdev); > + struct task_struct *task = current; > + unsigned long start = jiffies; > + bool warned = false; > + > + if (WARN_ON(irqs_disabled())) > + local_irq_enable(); > + > + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ > + lru_add_drain(); > + > + /* Quieten the vmstat worker so it won't interrupt us. */ > + quiet_vmstat(); > + > + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { You should add a function in tick-sched.c to get the next tick. This is supposed to be a private field. > + if (!warned && (jiffies - start) >= (5 * HZ)) { > + pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n", > + task->comm, task->pid, smp_processor_id(), > + (jiffies - start) / HZ); > + warned = true; > + } > + cond_resched(); > + if (test_thread_flag(TIF_SIGPENDING)) > + break; Why not use signal_pending()? > + task_isolation_wait(); I still think we could try a wait-wake standard scheme. Thanks. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-10-01 12:14 ` Frederic Weisbecker @ 2015-10-01 12:18 ` Thomas Gleixner 2015-10-01 12:23 ` Frederic Weisbecker 2015-10-01 17:02 ` Chris Metcalf 2015-10-01 19:25 ` Chris Metcalf 1 sibling, 2 replies; 159+ messages in thread From: Thomas Gleixner @ 2015-10-01 12:18 UTC (permalink / raw) To: Frederic Weisbecker Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Thu, 1 Oct 2015, Frederic Weisbecker wrote: > On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: > > + > > + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { > > You should add a function in tick-sched.c to get the next tick. This > is supposed to be a private field. Just to make it clear. Neither the above nor a similar check in tick-sched.c is going to happen. This busy waiting is just horrible. Get your act together and solve the problems at the root and do not inflict your quick and dirty 'solutions' on us. Thanks, tglx ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-10-01 12:18 ` Thomas Gleixner @ 2015-10-01 12:23 ` Frederic Weisbecker 2015-10-01 12:31 ` Thomas Gleixner 2015-10-01 17:02 ` Chris Metcalf 1 sibling, 1 reply; 159+ messages in thread From: Frederic Weisbecker @ 2015-10-01 12:23 UTC (permalink / raw) To: Thomas Gleixner Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Thu, Oct 01, 2015 at 02:18:42PM +0200, Thomas Gleixner wrote: > On Thu, 1 Oct 2015, Frederic Weisbecker wrote: > > On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: > > > + > > > + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { > > > > You should add a function in tick-sched.c to get the next tick. This > > is supposed to be a private field. > > Just to make it clear. Neither the above nor a similar check in > tick-sched.c is going to happen. > > This busy waiting is just horrible. Get your act together and solve > the problems at the root and do not inflict your quick and dirty > 'solutions' on us. That's why I proposed a wait-wake scheme instead with the tick stop code. What's your opinion about such direction? ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-10-01 12:23 ` Frederic Weisbecker @ 2015-10-01 12:31 ` Thomas Gleixner 0 siblings, 0 replies; 159+ messages in thread From: Thomas Gleixner @ 2015-10-01 12:31 UTC (permalink / raw) To: Frederic Weisbecker Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Thu, 1 Oct 2015, Frederic Weisbecker wrote: > On Thu, Oct 01, 2015 at 02:18:42PM +0200, Thomas Gleixner wrote: > > On Thu, 1 Oct 2015, Frederic Weisbecker wrote: > > > On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: > > > > + > > > > + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { > > > > > > You should add a function in tick-sched.c to get the next tick. This > > > is supposed to be a private field. > > > > Just to make it clear. Neither the above nor a similar check in > > tick-sched.c is going to happen. > > > > This busy waiting is just horrible. Get your act together and solve > > the problems at the root and do not inflict your quick and dirty > > 'solutions' on us. > > That's why I proposed a wait-wake scheme instead with the tick stop > code. What's your opinion about such direction? Definitely more sensible than mindlessly busy looping. Thanks, tglx ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-10-01 12:18 ` Thomas Gleixner 2015-10-01 12:23 ` Frederic Weisbecker @ 2015-10-01 17:02 ` Chris Metcalf [not found] ` <560D6725.9000609-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-10-01 17:02 UTC (permalink / raw) To: Thomas Gleixner, Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/01/2015 08:18 AM, Thomas Gleixner wrote: > On Thu, 1 Oct 2015, Frederic Weisbecker wrote: >> On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: >>> + >>> + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { >> You should add a function in tick-sched.c to get the next tick. This >> is supposed to be a private field. > Just to make it clear. Neither the above nor a similar check in > tick-sched.c is going to happen. > > This busy waiting is just horrible. Get your act together and solve > the problems at the root and do not inflict your quick and dirty > 'solutions' on us. Thomas, You've raised a couple of different concerns and I want to make sure I try to address them individually. But first I want to address the question of the basic semantics of the patch series. I wrote up a description of why it's useful in my email yesterday: https://lkml.kernel.org/r/560C4CF4.9090601@ezchip.com I haven't directly heard from you as to whether you buy the basic premise of "hard isolation" in terms of protecting tasks from all kernel interrupts while they execute in userspace. I will add here that we've heard from multiple customers that the equivalent Tilera functionality (Zero-Overhead Linux) was the thing that brought them to buy our hardware rather than a competitor's. It's allowed them to write code that runs under a full-featured Linux environment rather than doing the thing that they otherwise would have been required to do, which is to target a minimal bare-metal environment. So as a feature, if we can gain consensus on an implementation of it, I think it will be an important step for that class of users, and potential users, of Linux. So I first want to address what is effectively the API concern that you raised, namely that you're concerned that there is a wait loop in the implementation. The nice thing here is that there is in fact no requirement in the API/ABI that we have a wait loop in the kernel at all. Let's say hypothetically that in the future we come up with a way to guarantee, perhaps in some constrained kind of way, that you can enter and exit the kernel and are guaranteed no further timer interrupts, and we are so confident of this property that we don't have to test for it programmatically on kernel exit. (In fact, we would likely still use the task_isolation_debug boot flag to generate a console warning if it ever did happen, but whatever.) At this point we could simply remove the timer interrupt test loop in task_isolation_wait(); the applications would be none the wiser, and the kernel would be that much cleaner. However, today, and I think for the future, I see that loop as an important backstop for whatever timer-elimination coding happens. In general, the hard task-isolation requirement is something that is of particular interest only to a subset of the kernel community. As the kernel grows, adds features, re-implements functionality, etc., it seems entirely likely that odd bits of deferred functionality might be added in the same way that RCU, workqueues, etc., have done in the past. Or, applications might exercise unusual corners of the kernel's semantics and come across an existing mechanism that ends up enabling kernel ticks (maybe only one or two) before returning to userspace. The proposed busy-loop just prevents that from damaging the application. I'm skeptical that we can prevent all such possible changes today and in the future, and I think the loop is a simple way of arranging to avoid breaking applications with interrupts, that only triggers for applications that have requested it, on cores that have been configured to support it. One additional insight that argues in favor of a busy-waiting solution is that a task that requests task isolation is almost certainly alone on the core. If multiple tasks are in fact runnable on that core, we have already abandoned the ability to use proper task isolation since we will want to use timer ticks to run the scheduler for pre-emption. So we only busy wait when, in fact, no other useful work is likely to get done on that core anyway. The other questions you raise have to do with the mechanism for ensuring that we wait until no timer interrupts are scheduled. First is the question of how we detect that case. As I said yesterday, the original approach I chose for the Tilera implementation was one where we simply wait until the timer interrupt is masked (as is done via the set_state_shutdown, set_state_oneshot, and tick_resume callbacks in the tile clock_event_device). When unmasked, the timer down-counter just counts down to zero, fires the interrupt, resets to its start value, and counts down again until it fires again. So we use masking of the interrupt to turn off the timer tick. Once we have done so, we are guaranteed no further timer interrupts can occur. I'm less familiar with the timer subsystems of other architectures, but there are clearly per-platform ways to make the same kinds of checks. If this seems like a better approach, I'm happy to work to add the necessary checks on tile, arm64, and x86, though I'd certainly benefit from some guidance on the timer implementation on the latter two platforms. One reason this might be necessary is if there is support on some platforms for multiple timer interrupts any of which can fire, not just a single timer driven by the clock_event_device. I'm not sure whether this is ever in fact a problem, but if it is, that would seem like it would almost certainly require per-architecture code to determine whether all the relevant timers were quiesced. However, I'm not sure whether you don't like the fact of checking the next_event in tick_cpu_device per se, or if it's the busy-waiting we do when it indicates a pending timer that bothers you. If you could help clarify this piece, that would be good. The last question is what to do when we detect that there is a timer interrupt scheduled. The current code spins, testing for resched or signal events, and bails out back to the work-pending loop when that happens. As an extension, one can add support for spinning in a lower-power state, as I did for tile, but this isn't required and frankly isn't that important, since we don't anticipate spending much time in the busy-loop state anyway. The suggestion proposed by Frederic and echoed by you is a wake-wait scheme. I'm curious to hear a more fully fleshed-out suggestion. Clearly, we can test for pending timer interrupts and put the task to sleep (pretty late in the return-to-userspace process, but maybe that's OK). The question is, how and when do we wake the task? We could add a hook to the platform timer shutdown code that would also wake any process that was waiting for the no-timer case; that process would then end up getting scheduled sometime later, and hopefully when it came time for it to try exiting to userspace again, the timer would still be shutdown. This could be problematic if the scheduler code or some other part of the kernel sets up the timer again before scheduling the waiting task back in. Arguably we can work to avoid this if it's really a problem. And, there is the question of how to handle multiple timer interrupt sources, since they would all have to quiesce before we would want to wake the waiting process, but the "multiple timers" isn't handled by the current code either, and it seems not to be a problem, so perhaps that's OK. Lastly, of course, is the question of what the kernel would end up doing while waiting: and the answer is almost certainly that it would sit in the cpu idle loop, waiting for the pending timer to fire and wake the waiting task. I'm not convinced that the extra complexity here is worth the gain. But I am open and willing to being convinced that I am wrong, and to implement different approaches. Let me know! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <560D6725.9000609-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v7 02/11] task_isolation: add initial support [not found] ` <560D6725.9000609-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-10-01 21:20 ` Thomas Gleixner 2015-10-02 17:15 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Thomas Gleixner @ 2015-10-01 21:20 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Thu, 1 Oct 2015, Chris Metcalf wrote: > But first I want to address the question of the basic semantics > of the patch series. I wrote up a description of why it's useful > in my email yesterday: > > https://lkml.kernel.org/r/560C4CF4.9090601-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org > > I haven't directly heard from you as to whether you buy the > basic premise of "hard isolation" in terms of protecting tasks > from all kernel interrupts while they execute in userspace. Just for the record. The first serious initiative to solve that problem started here in my own company when I guided Frederic through the endavour of figuring out what needs to be done to achieve that. That was the assignement of his master thesis, which I gave him. So I'm very well aware why this is needed and what needs to be done. I started this, because I got tired of half baken attempts to solve the problem, which were even worse than what you are trying to do now. > So I first want to address what is effectively the API concern that > you raised, namely that you're concerned that there is a wait > loop in the implementation. That wait loop is just a place holder for the underlying more serious concern I have with this whole approach. And I raised that concern several times in the past and I'm happy to do so again. The people working on this, especially you, are just dead set to achieve a certain functionality by jamming half baken mechanisms into the kernel and especially into the low level entry/exit code. And that's something which really annoys me, simply because you refuse to tackle the problems which have been identified as need to be solved 5+ years ago when Frederic did his thesis. Remote accounting: ================== It's not an easy problem, but it's not rocket science either. It's just quite some work. I know that you just give a shit about it because your use case does not care. But it's an essential part of the problem space. You just work around it, by shutting down the tick completely and rely on the fact that it does not explode in your face today. If we accept your hackery, then who is going to fix it, when it explodes in half a year from now? Tick shut down: =============== I still have to understand why the tick is needed at all. There is exactly one reason why the tick must run if a cpu is in full isolation mode: More than one SCHED_OTHER task is runnable on that cpu. There is no other reason, period. If there are requirements today to switch on the tick when a task running in full isolation mode enters the kernel, then they need to be fixed first. And again you don't care, because for your particular use case it's good enough to slap a busy wait loop into every archs low level exit code and be done with it. >From your mail excusing that approach: > The nice thing here is that there is in fact no requirement in > the API/ABI that we have a wait loop in the kernel at all. Let's > say hypothetically that in the future we come up with a way to > guarantee, perhaps in some constrained kind of way, that you > can enter and exit the kernel and are guaranteed no further > timer interrupts, .... "Let's say hypothetically" tells it all. You are not even trying to find a proper solution. You just try to get your particular interest solved. That's exactly the attitude which drives me nuts and that's the point where I say no. You can do all of that in an out of tree patch set as many other hard to solve features have done for years. Yes, it's an annoying catchup game, but it forces you to think harder, refactor code and do a lot of extra work to finally get it merged. Thanks, tglx ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-10-01 21:20 ` Thomas Gleixner @ 2015-10-02 17:15 ` Chris Metcalf [not found] ` <560EBBC5.7000709-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-10-02 17:15 UTC (permalink / raw) To: Thomas Gleixner Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/01/2015 05:20 PM, Thomas Gleixner wrote: > On Thu, 1 Oct 2015, Chris Metcalf wrote: >> But first I want to address the question of the basic semantics >> of the patch series. I wrote up a description of why it's useful >> in my email yesterday: >> >> https://lkml.kernel.org/r/560C4CF4.9090601@ezchip.com >> >> I haven't directly heard from you as to whether you buy the >> basic premise of "hard isolation" in terms of protecting tasks >> from all kernel interrupts while they execute in userspace. > Just for the record. The first serious initiative to solve that > problem started here in my own company when I guided Frederic through > the endavour of figuring out what needs to be done to achieve > that. That was the assignement of his master thesis, which I gave him. Thanks for that background. I didn't know you had gotten Frederic started down that path originally. >> So I first want to address what is effectively the API concern that >> you raised, namely that you're concerned that there is a wait >> loop in the implementation. > That wait loop is just a place holder for the underlying more serious > concern I have with this whole approach. And I raised that concern > several times in the past and I'm happy to do so again. > > The people working on this, especially you, are just dead set to > achieve a certain functionality by jamming half baken mechanisms into > the kernel and especially into the low level entry/exit code. And > that's something which really annoys me, simply because you refuse to > tackle the problems which have been identified as need to be solved 5+ > years ago when Frederic did his thesis. I think you raise a good point. I still claim my arguments are plausible, but you may be right that this is an instance where forcing a different approach is better for the kernel community as a whole. Given that, what would you think of the following two changes to my proposed patch series: 1. Rather than spinning in a busy loop if timers are pending, we reschedule if more than one task is ready to run. This directly targets the "architected" problem with the scheduler tick, rather than sweeping up the scheduler tick and any other timers into the one catch-all of "any timer ready to fire". (We can use sched_can_stop_tick() to check the case where other tasks can preempt us.) This would then provide part of the semantics of the task-isolation flag. The other part is running whatever code can be run to avoid the various ways tasks might get interrupted later (lru_add_drain(), quiet_vmstat(), etc) that are not appropriate to run unconditionally for tasks that aren't trying to be isolated. 2. Remove the tie between disabling the 1 Hz max deferment and task isolation per se. Instead add a boot flag (e.g. "debug_1hz_tick") that lets us turn off the 1 Hz tick to make it easy to experiment with both the negative effects of the missing tick, as well as to try to learn in parallel what actual timer interrupts are firing "on purpose" rather than just due to the 1 Hz tick to try to eliminate them as well. For #1, I'm not sure if it's better to hack up the scheduler's pick_next_task callback methods to avoid task-isolation tasks when other tasks are also available to run, or just to observe that there are additional tasks ready to run during exit to userspace, and yield the cpu to allow those other tasks to run. The advantage of doing it at exit to userspace is that we can easily yield in a loop and pay attention to whether we seem not to be making forward progress with that task and generate a suitable warning; it also keeps a lot of task-isolation stuff out of the core scheduler code, which may be a plus. With these changes, and booting with the "debug_1hz_tick" flag, I'm seeing a couple of timer ticks hit my task-isolation task in the first 20 ms or so, and then it quiesces. I will plan to work on figuring out what is triggering those interrupts and seeing how to fix them. My hope is that in parallel with that work, other folks can be working on how to fix problems that occur more silently with the scheduler tick max deferment disabled; I'm also happy to work on those problems to the extent that I understand them (and I'm always happy to learn more). As part of the patch series I'd extend the proposed task_isolation_debug flag to also track timer scheduling events against task-isolation tasks that are ready to run in userspace (no other runnable tasks). What do you think of this approach? -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <560EBBC5.7000709-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v7 02/11] task_isolation: add initial support [not found] ` <560EBBC5.7000709-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-10-02 19:02 ` Thomas Gleixner 0 siblings, 0 replies; 159+ messages in thread From: Thomas Gleixner @ 2015-10-02 19:02 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Chris, On Fri, 2 Oct 2015, Chris Metcalf wrote: > 1. Rather than spinning in a busy loop if timers are pending, > we reschedule if more than one task is ready to run. This > directly targets the "architected" problem with the scheduler > tick, rather than sweeping up the scheduler tick and any other > timers into the one catch-all of "any timer ready to fire". > (We can use sched_can_stop_tick() to check the case where > other tasks can preempt us.) This would then provide part > of the semantics of the task-isolation flag. The other part is > running whatever code can be run to avoid the various ways > tasks might get interrupted later (lru_add_drain(), > quiet_vmstat(), etc) that are not appropriate to run > unconditionally for tasks that aren't trying to be isolated. Sounds like a plan > 2. Remove the tie between disabling the 1 Hz max deferment > and task isolation per se. Instead add a boot flag (e.g. > "debug_1hz_tick") that lets us turn off the 1 Hz tick to make it > easy to experiment with both the negative effects of the > missing tick, as well as to try to learn in parallel what actual > timer interrupts are firing "on purpose" rather than just due > to the 1 Hz tick to try to eliminate them as well. I have no problem with a debug flag, which allows you to experiment, though I'm not entirely sure whether we need to carry it in mainline or just in an extra isolation git tree. > For #1, I'm not sure if it's better to hack up the scheduler's > pick_next_task callback methods to avoid task-isolation tasks > when other tasks are also available to run, or just to observe > that there are additional tasks ready to run during exit to > userspace, and yield the cpu to allow those other tasks to run. > The advantage of doing it at exit to userspace is that we can > easily yield in a loop and pay attention to whether we seem > not to be making forward progress with that task and generate > a suitable warning; it also keeps a lot of task-isolation stuff > out of the core scheduler code, which may be a plus. You should discuss that with Peter Zijlstra. I see the plus not to have it in the scheduler, but OTOH having it in the core code has its advantages as well. Let's see how ugly it gets. > With these changes, and booting with the "debug_1hz_tick" > flag, I'm seeing a couple of timer ticks hit my task-isolation > task in the first 20 ms or so, and then it quiesces. I will > plan to work on figuring out what is triggering those > interrupts and seeing how to fix them. My hope is that in > parallel with that work, other folks can be working on how to > fix problems that occur more silently with the scheduler > tick max deferment disabled; I'm also happy to work on those > problems to the extent that I understand them (and I'm > always happy to learn more). I like that approach :) > As part of the patch series I'd extend the proposed > task_isolation_debug flag to also track timer scheduling > events against task-isolation tasks that are ready to run > in userspace (no other runnable tasks). > > What do you think of this approach? Makes sense. Thanks, tglx ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-10-01 12:14 ` Frederic Weisbecker 2015-10-01 12:18 ` Thomas Gleixner @ 2015-10-01 19:25 ` Chris Metcalf 1 sibling, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-10-01 19:25 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/01/2015 08:14 AM, Frederic Weisbecker wrote: > On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: >> diff --git a/include/linux/isolation.h b/include/linux/isolation.h >> new file mode 100644 >> index 000000000000..fd04011b1c1e >> --- /dev/null >> +++ b/include/linux/isolation.h >> @@ -0,0 +1,24 @@ >> +/* >> + * Task isolation related global functions >> + */ >> +#ifndef _LINUX_ISOLATION_H >> +#define _LINUX_ISOLATION_H >> + >> +#include <linux/tick.h> >> +#include <linux/prctl.h> >> + >> +#ifdef CONFIG_TASK_ISOLATION >> +static inline bool task_isolation_enabled(void) >> +{ >> + return tick_nohz_full_cpu(smp_processor_id()) && >> + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); > Ok, I may be a bit burdening with that but, how about using the regular > existing task flags, and if needed later we can still introduce a new field > in struct task_struct? The problem is still that we have two basic bits ("enabled" and "strict") plus eight bits of signal number to override SIGKILL. So we end up with *something* extra in task_struct no matter what. And, right now it's conveniently the same value as the bits passed to prctl(), so we don't need to marshall and unmarshall the prctl() get/set results. If we could convince ourselves not to do the "settable signal" stuff I'd agree that use task flags makes sense, but I was convinced for v2 of the patch series to add a settable signal, and I suspect it still does make sense. >> + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { > You should add a function in tick-sched.c to get the next tick. This > is supposed to be a private field. Yes. Or probably better, a function that just says whether the timer is quiesced. Obviously I'll wait to hear what Thomas says on this subject first, though. >> + if (!warned && (jiffies - start) >= (5 * HZ)) { >> + pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n", >> + task->comm, task->pid, smp_processor_id(), >> + (jiffies - start) / HZ); >> + warned = true; >> + } >> + cond_resched(); >> + if (test_thread_flag(TIF_SIGPENDING)) >> + break; > Why not use signal_pending()? Makes sense, thanks. > I still think we could try a wait-wake standard scheme. I'm curious to hear what you make of my arguments in the other thread on this subject! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-09-28 15:17 ` [PATCH v7 00/11] support "task_isolated" mode for nohz_full Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 02/11] task_isolation: add initial support Chris Metcalf @ 2015-09-28 15:17 ` Chris Metcalf [not found] ` <1443453446-7827-4-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-09-28 15:17 ` [PATCH v7 04/11] task_isolation: provide strict mode configurable signal Chris Metcalf 2015-10-20 20:35 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Chris Metcalf 3 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal; this is defined as happening immediately after the SECCOMP test. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/context_tracking.h | 11 ++++++++--- include/linux/isolation.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/isolation.c | 41 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 72 insertions(+), 6 deletions(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 008fc67d0d96..a840374f5d29 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (task_isolation_strict()) + task_isolation_exception(); + } + } return prev_ctx; } diff --git a/include/linux/isolation.h b/include/linux/isolation.h index fd04011b1c1e..27a4469831c1 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void) } extern void task_isolation_enter(void); +extern void task_isolation_syscall(int nr); +extern void task_isolation_exception(void); extern void task_isolation_wait(void); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } +static inline void task_isolation_syscall(int nr) { } +static inline void task_isolation_exception(void) { } #endif +static inline bool task_isolation_strict(void) +{ +#ifdef CONFIG_TASK_ISOLATION + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 67224df4b559..2b8038b0d1e1 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -201,5 +201,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 48 #define PR_GET_TASK_ISOLATION 49 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..ffca3c3fe64a 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -166,6 +167,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -175,6 +177,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/isolation.c b/kernel/isolation.c index 6ace866c69f6..3779ba670472 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/isolation.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -75,3 +76,43 @@ void task_isolation_enter(void) dump_stack(); } } + +static void kill_task_isolation_strict_task(void) +{ + /* RCU should have been enabled prior to this point. */ + RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); + + dump_stack(); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_task_isolation_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void task_isolation_exception(void) +{ + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", + current->comm, current->pid); + kill_task_isolation_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
[parent not found: <1443453446-7827-4-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode [not found] ` <1443453446-7827-4-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-09-28 20:51 ` Andy Lutomirski 2015-09-28 21:54 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-09-28 20:51 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > With task_isolation mode, the task is in principle guaranteed not to > be interrupted by the kernel, but only if it behaves. In particular, > if it enters the kernel via system call, page fault, or any of a > number of other synchronous traps, it may be unexpectedly exposed > to long latencies. Add a simple flag that puts the process into > a state where any such kernel entry is fatal; this is defined as > happening immediately after the SECCOMP test. Why after seccomp? Seccomp is still an entry, and the code would be considerably simpler if it were before seccomp. > @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) > return 0; > > prev_ctx = this_cpu_read(context_tracking.state); > - if (prev_ctx != CONTEXT_KERNEL) > - context_tracking_exit(prev_ctx); > + if (prev_ctx != CONTEXT_KERNEL) { > + if (context_tracking_exit(prev_ctx)) { > + if (task_isolation_strict()) > + task_isolation_exception(); > + } > + } > > return prev_ctx; > } x86 does not promise to call this function. In fact, x86 is rather likely to stop ever calling this function in the reasonably near future. > --- a/kernel/context_tracking.c > +++ b/kernel/context_tracking.c > @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); > * This call supports re-entrancy. This way it can be called from any exception > * handler without needing to know if we came from userspace or not. > */ > -void context_tracking_exit(enum ctx_state state) > +bool context_tracking_exit(enum ctx_state state) This needs clear documentation of what the return value means. > +static void kill_task_isolation_strict_task(void) > +{ > + /* RCU should have been enabled prior to this point. */ > + RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); > + > + dump_stack(); > + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; > + send_sig(SIGKILL, current, 1); > +} Wasn't this supposed to be configurable? Or is that something that happens later on in the series? > + > +/* > + * This routine is called from syscall entry (with the syscall number > + * passed in) if the _STRICT flag is set. > + */ > +void task_isolation_syscall(int syscall) > +{ > + /* Ignore prctl() syscalls or any task exit. */ > + switch (syscall) { > + case __NR_prctl: > + case __NR_exit: > + case __NR_exit_group: > + return; > + } > + > + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", > + current->comm, current->pid, syscall); > + kill_task_isolation_strict_task(); > +} Ick. I guess it works, but this is still quite ugly IMO. > +void task_isolation_exception(void) > +{ > + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", > + current->comm, current->pid); > + kill_task_isolation_strict_task(); > +} Should this say what exception? --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-09-28 20:51 ` Andy Lutomirski @ 2015-09-28 21:54 ` Chris Metcalf 2015-09-28 22:38 ` Andy Lutomirski 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-09-28 21:54 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On 09/28/2015 04:51 PM, Andy Lutomirski wrote: > On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> With task_isolation mode, the task is in principle guaranteed not to >> be interrupted by the kernel, but only if it behaves. In particular, >> if it enters the kernel via system call, page fault, or any of a >> number of other synchronous traps, it may be unexpectedly exposed >> to long latencies. Add a simple flag that puts the process into >> a state where any such kernel entry is fatal; this is defined as >> happening immediately after the SECCOMP test. > Why after seccomp? Seccomp is still an entry, and the code would be > considerably simpler if it were before seccomp. I could be convinced to do it either way. My initial thinking was that a security violation was more interesting and more important to report than a strict-mode task-isolation violation. But see my comments in response to your email on patch 07/11. >> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) >> return 0; >> >> prev_ctx = this_cpu_read(context_tracking.state); >> - if (prev_ctx != CONTEXT_KERNEL) >> - context_tracking_exit(prev_ctx); >> + if (prev_ctx != CONTEXT_KERNEL) { >> + if (context_tracking_exit(prev_ctx)) { >> + if (task_isolation_strict()) >> + task_isolation_exception(); >> + } >> + } >> >> return prev_ctx; >> } > x86 does not promise to call this function. In fact, x86 is rather > likely to stop ever calling this function in the reasonably near > future. Yes, in which case we'd have to do it the same way we are doing it for arm64 (see patch 09/11), by calling task_isolation_exception() explicitly from within the relevant exception handlers. If we start doing that, it's probably worth wrapping up the logic into a single inline function to keep the added code short and sweet. If in fact this might happen in the short term, it might be a good idea to hook the individual exception handlers in x86 now, and not hook the exception_enter() mechanism at all. >> --- a/kernel/context_tracking.c >> +++ b/kernel/context_tracking.c >> @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); >> * This call supports re-entrancy. This way it can be called from any exception >> * handler without needing to know if we came from userspace or not. >> */ >> -void context_tracking_exit(enum ctx_state state) >> +bool context_tracking_exit(enum ctx_state state) > This needs clear documentation of what the return value means. Added: * Return: if called with state == CONTEXT_USER, the function returns * true if we were in fact previously in user mode. >> +static void kill_task_isolation_strict_task(void) >> +{ >> + /* RCU should have been enabled prior to this point. */ >> + RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); >> + >> + dump_stack(); >> + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; >> + send_sig(SIGKILL, current, 1); >> +} > Wasn't this supposed to be configurable? Or is that something that > happens later on in the series? Yup, next patch. >> +void task_isolation_exception(void) >> +{ >> + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", >> + current->comm, current->pid); >> + kill_task_isolation_strict_task(); >> +} > Should this say what exception? I could modify it to take a string argument (and then use it for the arm64 case at least). For the exception_enter() caller, we actually don't have the information available to pass down, and it would be hard to get it. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-09-28 21:54 ` Chris Metcalf @ 2015-09-28 22:38 ` Andy Lutomirski 2015-09-29 17:35 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-09-28 22:38 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Mon, Sep 28, 2015 at 2:54 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 09/28/2015 04:51 PM, Andy Lutomirski wrote: >> >> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> >>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) >>> return 0; >>> >>> prev_ctx = this_cpu_read(context_tracking.state); >>> - if (prev_ctx != CONTEXT_KERNEL) >>> - context_tracking_exit(prev_ctx); >>> + if (prev_ctx != CONTEXT_KERNEL) { >>> + if (context_tracking_exit(prev_ctx)) { >>> + if (task_isolation_strict()) >>> + task_isolation_exception(); >>> + } >>> + } >>> >>> return prev_ctx; >>> } >> >> x86 does not promise to call this function. In fact, x86 is rather >> likely to stop ever calling this function in the reasonably near >> future. > > > Yes, in which case we'd have to do it the same way we are doing > it for arm64 (see patch 09/11), by calling task_isolation_exception() > explicitly from within the relevant exception handlers. If we start > doing that, it's probably worth wrapping up the logic into a single > inline function to keep the added code short and sweet. > > If in fact this might happen in the short term, it might be a good > idea to hook the individual exception handlers in x86 now, and not > hook the exception_enter() mechanism at all. It's already like that in Linus' tree. FWIW, most of those exception handlers send signals, so it might pay to do it in notify_die or die instead. > >>> --- a/kernel/context_tracking.c >>> +++ b/kernel/context_tracking.c >>> @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); >>> * This call supports re-entrancy. This way it can be called from any >>> exception >>> * handler without needing to know if we came from userspace or not. >>> */ >>> -void context_tracking_exit(enum ctx_state state) >>> +bool context_tracking_exit(enum ctx_state state) >> >> This needs clear documentation of what the return value means. > > > Added: > > * Return: if called with state == CONTEXT_USER, the function returns > * true if we were in fact previously in user mode. This should note that it only returns true if context tracking is on. >>> +void task_isolation_exception(void) >>> +{ >>> + pr_warn("%s/%d: task_isolation strict mode violated by >>> exception\n", >>> + current->comm, current->pid); >>> + kill_task_isolation_strict_task(); >>> +} >> >> Should this say what exception? > > > I could modify it to take a string argument (and then use it for > the arm64 case at least). For the exception_enter() caller, we actually > don't have the information available to pass down, and it would > be hard to get it. For x86, the relevant info might be the actual hw error number (error_code, which makes it into die) or the signal. If we send a death signal, then reporting the error number the usual way might make sense. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-09-28 22:38 ` Andy Lutomirski @ 2015-09-29 17:35 ` Chris Metcalf [not found] ` <560ACBD9.90909-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-09-29 17:35 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On 09/28/2015 06:38 PM, Andy Lutomirski wrote: > On Mon, Sep 28, 2015 at 2:54 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> On 09/28/2015 04:51 PM, Andy Lutomirski wrote: >>> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> >>>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) >>>> return 0; >>>> >>>> prev_ctx = this_cpu_read(context_tracking.state); >>>> - if (prev_ctx != CONTEXT_KERNEL) >>>> - context_tracking_exit(prev_ctx); >>>> + if (prev_ctx != CONTEXT_KERNEL) { >>>> + if (context_tracking_exit(prev_ctx)) { >>>> + if (task_isolation_strict()) >>>> + task_isolation_exception(); >>>> + } >>>> + } >>>> >>>> return prev_ctx; >>>> } >>> x86 does not promise to call this function. In fact, x86 is rather >>> likely to stop ever calling this function in the reasonably near >>> future. >> >> Yes, in which case we'd have to do it the same way we are doing >> it for arm64 (see patch 09/11), by calling task_isolation_exception() >> explicitly from within the relevant exception handlers. If we start >> doing that, it's probably worth wrapping up the logic into a single >> inline function to keep the added code short and sweet. >> >> If in fact this might happen in the short term, it might be a good >> idea to hook the individual exception handlers in x86 now, and not >> hook the exception_enter() mechanism at all. > It's already like that in Linus' tree. OK, I will restructure so that it doesn't rely on the context_tracking code at all, but instead requires a line of code in every relevant kernel exception handler. > FWIW, most of those exception handlers send signals, so it might pay > to do it in notify_die or die instead. Well, the most interesting category is things that don't actually trigger a signal (e.g. minor page fault) since those are things that cause significant issues with task isolation processes (kernel-induced jitter) but aren't otherwise user-visible, much like an undiscovered syscall in a third-party library can cause unexpected jitter. > For x86, the relevant info might be the actual hw error number > (error_code, which makes it into die) or the signal. If we send a > death signal, then reporting the error number the usual way might make > sense. I may just choose to use a task_isolation_exception(fmt, ...) signature so that code can printk a suitable one-liner before delivering the SIGKILL (or whatever signal was configured). -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <560ACBD9.90909-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode [not found] ` <560ACBD9.90909-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-09-29 17:46 ` Andy Lutomirski [not found] ` <CALCETrUp+8UG5dKLdybcmhhfzcyUP8h-RJHcG0Bo7Up=Rx6DVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-09-29 17:46 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > On 09/28/2015 06:38 PM, Andy Lutomirski wrote: >> >> On Mon, Sep 28, 2015 at 2:54 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >> wrote: >>> >>> On 09/28/2015 04:51 PM, Andy Lutomirski wrote: >>>> >>>> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >>>>> >>>>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) >>>>> return 0; >>>>> >>>>> prev_ctx = this_cpu_read(context_tracking.state); >>>>> - if (prev_ctx != CONTEXT_KERNEL) >>>>> - context_tracking_exit(prev_ctx); >>>>> + if (prev_ctx != CONTEXT_KERNEL) { >>>>> + if (context_tracking_exit(prev_ctx)) { >>>>> + if (task_isolation_strict()) >>>>> + task_isolation_exception(); >>>>> + } >>>>> + } >>>>> >>>>> return prev_ctx; >>>>> } >>>> >>>> x86 does not promise to call this function. In fact, x86 is rather >>>> likely to stop ever calling this function in the reasonably near >>>> future. >>> >>> >>> Yes, in which case we'd have to do it the same way we are doing >>> it for arm64 (see patch 09/11), by calling task_isolation_exception() >>> explicitly from within the relevant exception handlers. If we start >>> doing that, it's probably worth wrapping up the logic into a single >>> inline function to keep the added code short and sweet. >>> >>> If in fact this might happen in the short term, it might be a good >>> idea to hook the individual exception handlers in x86 now, and not >>> hook the exception_enter() mechanism at all. >> >> It's already like that in Linus' tree. > > > OK, I will restructure so that it doesn't rely on the context_tracking > code at all, but instead requires a line of code in every relevant > kernel exception handler. > >> FWIW, most of those exception handlers send signals, so it might pay >> to do it in notify_die or die instead. > > > Well, the most interesting category is things that don't actually > trigger a signal (e.g. minor page fault) since those are things that > cause significant issues with task isolation processes > (kernel-induced jitter) but aren't otherwise user-visible, > much like an undiscovered syscall in a third-party library > can cause unexpected jitter. Would it make sense to exempt the exceptions that result in signals? After all, those are detectable even without your patches. Going through all of the exception types: divide_error, overflow, invalid_op, coprocessor_segment_overrun, invalid_TSS, segment_not_present, stack_segment, alignment_check: these all send signals anyway. double_fault is fatal. bounds: MPX faults can be silently fixed up, and those will need notification. (Or user code should know not to do that, since it requires an explicit opt in, and user code can flip it back off to get the signals.) general_protection: always signals except in vm86 mode. int3: silently fixed if uprobes are in use, but I don't think isolation cares about that. Otherwise signals. debug: The perf hw_breakpoint can result in silent fixups, but those require explicit opt-in from the admin. Otherwise, unless there's a bug or a debugger, the user will get a signal. (As a practical matter, the only interesting case is the undocumented ICEBP instruction.) math_error, simd_coprocessor_error: Sends a signal. spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK. We should just WARN if this hits. device_not_available: If you're using isolation without an FPU, you have bigger problems. page_fault: Needs notification. NMI, MCE: arguably these should *not* notify or at least not fatally. So maybe a better approach would be to explicitly notify for the relevant entries: IRQs, non-signalling page faults, and non-signalling MPX fixups. Other arches would have their own lists, but they're probably also short except for emulated instructions. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <CALCETrUp+8UG5dKLdybcmhhfzcyUP8h-RJHcG0Bo7Up=Rx6DVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode [not found] ` <CALCETrUp+8UG5dKLdybcmhhfzcyUP8h-RJHcG0Bo7Up=Rx6DVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-09-29 17:57 ` Chris Metcalf 2015-09-29 18:00 ` Andy Lutomirski 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-09-29 17:57 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On 09/29/2015 01:46 PM, Andy Lutomirski wrote: > On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> Well, the most interesting category is things that don't actually >> trigger a signal (e.g. minor page fault) since those are things that >> cause significant issues with task isolation processes >> (kernel-induced jitter) but aren't otherwise user-visible, >> much like an undiscovered syscall in a third-party library >> can cause unexpected jitter. > Would it make sense to exempt the exceptions that result in signals? > After all, those are detectable even without your patches. Going > through all of the exception types: > > divide_error, overflow, invalid_op, coprocessor_segment_overrun, > invalid_TSS, segment_not_present, stack_segment, alignment_check: > these all send signals anyway. > > double_fault is fatal. > > bounds: MPX faults can be silently fixed up, and those will need > notification. (Or user code should know not to do that, since it > requires an explicit opt in, and user code can flip it back off to get > the signals.) > > general_protection: always signals except in vm86 mode. > > int3: silently fixed if uprobes are in use, but I don't think > isolation cares about that. Otherwise signals. > > debug: The perf hw_breakpoint can result in silent fixups, but those > require explicit opt-in from the admin. Otherwise, unless there's a > bug or a debugger, the user will get a signal. (As a practical > matter, the only interesting case is the undocumented ICEBP > instruction.) > > math_error, simd_coprocessor_error: Sends a signal. > > spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK. We should > just WARN if this hits. > > device_not_available: If you're using isolation without an FPU, you > have bigger problems. > > page_fault: Needs notification. > > NMI, MCE: arguably these should *not* notify or at least not fatally. > > So maybe a better approach would be to explicitly notify for the > relevant entries: IRQs, non-signalling page faults, and non-signalling > MPX fixups. Other arches would have their own lists, but they're > probably also short except for emulated instructions. IRQs should get notified via the task_isolation_debug boot flag; the intent is that they should never get delivered to nohz_full cores anyway, so we produce a console backtrace if the boot flag is enabled. This isn't tied to having a task running with TASK_ISOLATION enabled, since it just shouldn't ever happen. Thanks for reviewing the possible exception sources on x86, which I'm less familiar with than tile. Non-signalling page faults and MPX fixups sounds exactly right - and I didn't know about MPX before your email (other than the userspace side of the notion of bounds registers), so thanks for the pointer. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-09-29 17:57 ` Chris Metcalf @ 2015-09-29 18:00 ` Andy Lutomirski [not found] ` <CALCETrVrHFh_wW_u0E+3mcN9J7_M+hAF59CdKOzKt3NT+gWJgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-09-29 18:00 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Tue, Sep 29, 2015 at 10:57 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 09/29/2015 01:46 PM, Andy Lutomirski wrote: >> >> On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf@ezchip.com> >> wrote: >>> >>> Well, the most interesting category is things that don't actually >>> trigger a signal (e.g. minor page fault) since those are things that >>> cause significant issues with task isolation processes >>> (kernel-induced jitter) but aren't otherwise user-visible, >>> much like an undiscovered syscall in a third-party library >>> can cause unexpected jitter. >> >> Would it make sense to exempt the exceptions that result in signals? >> After all, those are detectable even without your patches. Going >> through all of the exception types: >> >> divide_error, overflow, invalid_op, coprocessor_segment_overrun, >> invalid_TSS, segment_not_present, stack_segment, alignment_check: >> these all send signals anyway. >> >> double_fault is fatal. >> >> bounds: MPX faults can be silently fixed up, and those will need >> notification. (Or user code should know not to do that, since it >> requires an explicit opt in, and user code can flip it back off to get >> the signals.) >> >> general_protection: always signals except in vm86 mode. >> >> int3: silently fixed if uprobes are in use, but I don't think >> isolation cares about that. Otherwise signals. >> >> debug: The perf hw_breakpoint can result in silent fixups, but those >> require explicit opt-in from the admin. Otherwise, unless there's a >> bug or a debugger, the user will get a signal. (As a practical >> matter, the only interesting case is the undocumented ICEBP >> instruction.) >> >> math_error, simd_coprocessor_error: Sends a signal. >> >> spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK. We should >> just WARN if this hits. >> >> device_not_available: If you're using isolation without an FPU, you >> have bigger problems. >> >> page_fault: Needs notification. >> >> NMI, MCE: arguably these should *not* notify or at least not fatally. >> >> So maybe a better approach would be to explicitly notify for the >> relevant entries: IRQs, non-signalling page faults, and non-signalling >> MPX fixups. Other arches would have their own lists, but they're >> probably also short except for emulated instructions. > > > IRQs should get notified via the task_isolation_debug boot flag; > the intent is that they should never get delivered to nohz_full > cores anyway, so we produce a console backtrace if the boot > flag is enabled. This isn't tied to having a task running with > TASK_ISOLATION enabled, since it just shouldn't ever happen. OK, I like that. In that case, maybe NMI and MCE should be in a similar category. (IOW if a non-fatal MCE happens and the debug param is set, we could warn, assuming that anyone is willing to write the code. Doing printk from MCE is not entirely trivial, although it's less bad in recent kernels.) --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <CALCETrVrHFh_wW_u0E+3mcN9J7_M+hAF59CdKOzKt3NT+gWJgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode [not found] ` <CALCETrVrHFh_wW_u0E+3mcN9J7_M+hAF59CdKOzKt3NT+gWJgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-10-01 19:25 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-10-01 19:25 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On 09/29/2015 02:00 PM, Andy Lutomirski wrote: > On Tue, Sep 29, 2015 at 10:57 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> On 09/29/2015 01:46 PM, Andy Lutomirski wrote: >>> On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >>> wrote: >>>> Well, the most interesting category is things that don't actually >>>> trigger a signal (e.g. minor page fault) since those are things that >>>> cause significant issues with task isolation processes >>>> (kernel-induced jitter) but aren't otherwise user-visible, >>>> much like an undiscovered syscall in a third-party library >>>> can cause unexpected jitter. >>> Would it make sense to exempt the exceptions that result in signals? >>> After all, those are detectable even without your patches. Going >>> through all of the exception types: >>> >>> divide_error, overflow, invalid_op, coprocessor_segment_overrun, >>> invalid_TSS, segment_not_present, stack_segment, alignment_check: >>> these all send signals anyway. >>> >>> double_fault is fatal. >>> >>> bounds: MPX faults can be silently fixed up, and those will need >>> notification. (Or user code should know not to do that, since it >>> requires an explicit opt in, and user code can flip it back off to get >>> the signals.) >>> >>> general_protection: always signals except in vm86 mode. >>> >>> int3: silently fixed if uprobes are in use, but I don't think >>> isolation cares about that. Otherwise signals. >>> >>> debug: The perf hw_breakpoint can result in silent fixups, but those >>> require explicit opt-in from the admin. Otherwise, unless there's a >>> bug or a debugger, the user will get a signal. (As a practical >>> matter, the only interesting case is the undocumented ICEBP >>> instruction.) >>> >>> math_error, simd_coprocessor_error: Sends a signal. >>> >>> spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK. We should >>> just WARN if this hits. >>> >>> device_not_available: If you're using isolation without an FPU, you >>> have bigger problems. >>> >>> page_fault: Needs notification. >>> >>> NMI, MCE: arguably these should *not* notify or at least not fatally. >>> >>> So maybe a better approach would be to explicitly notify for the >>> relevant entries: IRQs, non-signalling page faults, and non-signalling >>> MPX fixups. Other arches would have their own lists, but they're >>> probably also short except for emulated instructions. >> >> IRQs should get notified via the task_isolation_debug boot flag; >> the intent is that they should never get delivered to nohz_full >> cores anyway, so we produce a console backtrace if the boot >> flag is enabled. This isn't tied to having a task running with >> TASK_ISOLATION enabled, since it just shouldn't ever happen. > OK, I like that. In that case, maybe NMI and MCE should be in a > similar category. (IOW if a non-fatal MCE happens and the debug param > is set, we could warn, assuming that anyone is willing to write the > code. Doing printk from MCE is not entirely trivial, although it's > less bad in recent kernels.) For now I will stay away from tampering with the NMI/MCE handlers, though if it turns out that it's the cause of mysterious latencies in task-isolation applications in the future, it will likely make sense to add some debugging there. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v7 04/11] task_isolation: provide strict mode configurable signal 2015-09-28 15:17 ` [PATCH v7 00/11] support "task_isolated" mode for nohz_full Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 02/11] task_isolation: add initial support Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf @ 2015-09-28 15:17 ` Chris Metcalf 2015-09-28 20:54 ` Andy Lutomirski 2015-10-20 20:35 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Chris Metcalf 3 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a task_isolation process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/isolation.c | 17 +++++++++++++---- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 2b8038b0d1e1..a5582ace987f 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -202,5 +202,7 @@ struct prctl_mm_map { #define PR_GET_TASK_ISOLATION 49 # define PR_TASK_ISOLATION_ENABLE (1 << 0) # define PR_TASK_ISOLATION_STRICT (1 << 1) +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/isolation.c b/kernel/isolation.c index 3779ba670472..44bafcd08bca 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -77,14 +77,23 @@ void task_isolation_enter(void) } } -static void kill_task_isolation_strict_task(void) +static void kill_task_isolation_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + /* RCU should have been enabled prior to this point. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); dump_stack(); current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); + if (sig == 0) + sig = SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -103,7 +112,7 @@ void task_isolation_syscall(int syscall) pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_task_isolation_strict_task(); + kill_task_isolation_strict_task(1); } /* @@ -114,5 +123,5 @@ void task_isolation_exception(void) { pr_warn("%s/%d: task_isolation strict mode violated by exception\n", current->comm, current->pid); - kill_task_isolation_strict_task(); + kill_task_isolation_strict_task(0); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v7 04/11] task_isolation: provide strict mode configurable signal 2015-09-28 15:17 ` [PATCH v7 04/11] task_isolation: provide strict mode configurable signal Chris Metcalf @ 2015-09-28 20:54 ` Andy Lutomirski [not found] ` <CALCETrXaWaUwWnOz16RAqjFP9tZm=tp74xWacXjqa36TWB9BfQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-09-28 20:54 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > Allow userspace to override the default SIGKILL delivered > when a task_isolation process in STRICT mode does a syscall > or otherwise synchronously enters the kernel. > > In addition to being able to set the signal, we now also > pass whether or not the interruption was from a syscall in > the si_code field of the siginfo. > > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > --- > include/uapi/linux/prctl.h | 2 ++ > kernel/isolation.c | 17 +++++++++++++---- > 2 files changed, 15 insertions(+), 4 deletions(-) > > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 2b8038b0d1e1..a5582ace987f 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -202,5 +202,7 @@ struct prctl_mm_map { > #define PR_GET_TASK_ISOLATION 49 > # define PR_TASK_ISOLATION_ENABLE (1 << 0) > # define PR_TASK_ISOLATION_STRICT (1 << 1) > +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) > +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) > > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/isolation.c b/kernel/isolation.c > index 3779ba670472..44bafcd08bca 100644 > --- a/kernel/isolation.c > +++ b/kernel/isolation.c > @@ -77,14 +77,23 @@ void task_isolation_enter(void) > } > } > > -static void kill_task_isolation_strict_task(void) > +static void kill_task_isolation_strict_task(int is_syscall) > { > + siginfo_t info = {}; > + int sig; > + > /* RCU should have been enabled prior to this point. */ > RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); > > dump_stack(); > current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; > - send_sig(SIGKILL, current, 1); > + > + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); > + if (sig == 0) > + sig = SIGKILL; > + info.si_signo = sig; > + info.si_code = is_syscall; I think this needs real SI_ defines. > + send_sig_info(sig, &info, current); > } > > /* > @@ -103,7 +112,7 @@ void task_isolation_syscall(int syscall) > > pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", > current->comm, current->pid, syscall); > - kill_task_isolation_strict_task(); > + kill_task_isolation_strict_task(1); No magic numbers please. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <CALCETrXaWaUwWnOz16RAqjFP9tZm=tp74xWacXjqa36TWB9BfQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v7 04/11] task_isolation: provide strict mode configurable signal [not found] ` <CALCETrXaWaUwWnOz16RAqjFP9tZm=tp74xWacXjqa36TWB9BfQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-09-28 21:54 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-09-28 21:54 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On 09/28/2015 04:54 PM, Andy Lutomirski wrote: > On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> Allow userspace to override the default SIGKILL delivered >> when a task_isolation process in STRICT mode does a syscall >> or otherwise synchronously enters the kernel. >> >> In addition to being able to set the signal, we now also >> pass whether or not the interruption was from a syscall in >> the si_code field of the siginfo. >> >> Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >> --- >> include/uapi/linux/prctl.h | 2 ++ >> kernel/isolation.c | 17 +++++++++++++---- >> 2 files changed, 15 insertions(+), 4 deletions(-) >> >> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h >> index 2b8038b0d1e1..a5582ace987f 100644 >> --- a/include/uapi/linux/prctl.h >> +++ b/include/uapi/linux/prctl.h >> @@ -202,5 +202,7 @@ struct prctl_mm_map { >> #define PR_GET_TASK_ISOLATION 49 >> # define PR_TASK_ISOLATION_ENABLE (1 << 0) >> # define PR_TASK_ISOLATION_STRICT (1 << 1) >> +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) >> +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) >> >> #endif /* _LINUX_PRCTL_H */ >> diff --git a/kernel/isolation.c b/kernel/isolation.c >> index 3779ba670472..44bafcd08bca 100644 >> --- a/kernel/isolation.c >> +++ b/kernel/isolation.c >> @@ -77,14 +77,23 @@ void task_isolation_enter(void) >> } >> } >> >> -static void kill_task_isolation_strict_task(void) >> +static void kill_task_isolation_strict_task(int is_syscall) >> { >> + siginfo_t info = {}; >> + int sig; >> + >> /* RCU should have been enabled prior to this point. */ >> RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); >> >> dump_stack(); >> current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; >> - send_sig(SIGKILL, current, 1); >> + >> + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); >> + if (sig == 0) >> + sig = SIGKILL; >> + info.si_signo = sig; >> + info.si_code = is_syscall; > I think this needs real SI_ defines. Yeah, it's a fair point, but of course SIGKILL has no SI_ defines at all right now. I'm tempted to suggest we just back out setting si_code altogether. It might be worth a one-line console message (a la show_signal_message()), and use that to pack in the extra information, instead of trying to fuss with the siginfo data. >> + send_sig_info(sig, &info, current); >> } >> >> /* >> @@ -103,7 +112,7 @@ void task_isolation_syscall(int syscall) >> >> pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", >> current->comm, current->pid, syscall); >> - kill_task_isolation_strict_task(); >> + kill_task_isolation_strict_task(1); > No magic numbers please. I think mooted by the above, but, good point. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-09-28 15:17 ` [PATCH v7 00/11] support "task_isolated" mode for nohz_full Chris Metcalf ` (2 preceding siblings ...) 2015-09-28 15:17 ` [PATCH v7 04/11] task_isolation: provide strict mode configurable signal Chris Metcalf @ 2015-10-20 20:35 ` Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 04/14] task_isolation: add initial support Chris Metcalf ` (3 more replies) 3 siblings, 4 replies; 159+ messages in thread From: Chris Metcalf @ 2015-10-20 20:35 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf This email discusses in detail the changes for v8; please see older versions of the cover letter for details about older versions. v8: The biggest difference in this version is, at Thomas Gleixner's suggestion, I removed the code that busy-waits until there are no scheduler-tick timer events queued. Instead, we now test for higher-level properties when attempting to return to userspace. We check if the core believes it has stopped the scheduler tick (which handles checking for scheduler contention from other tasks, RCU usage of the cpu, posix cpu timers, perf, etc), and if it hasn't, we request that the current process be rescheduled. In addition, we check if there are per-cpu lru pages to be drained, and we check if the vmstat worker has been quiesced. The structure is pretty clean so we can add additional tests as needed there as well. One nice aspect of this revised structure is that if the user actually requests a signal from a timer (for example), we will now return to userspace and let the program run. Of course it may get bombed with incremental timer ticks if the timer can't be programmed to the whole time interval in one step, but it still feels more correct this way then holding the process in the kernel until the user-requested timer expires. At Andy Lutomirski's suggestion, we separate out from the previous task_isolation_enter() a separate task_isolation_ready() test that can be done at the same time as we test the TIF_xxx flags, with interrupts disabled, so we can guarantee that the conditions we test for are still true when we return to userspace. To accomplish this we break out a new vmstat_idle() function that checks if the vmstat subsystem is quiesced on this core. Similarly, we factor out an lru_add_drain_needed() function from where it used to be in lru_add_drain_all(). Both of these "check" functions can now be called from task_isolation_ready() with interrupts disabled. Also at Andy's suggestion (and aligning with how I had done things previously in the Tilera private fork), the prctl() to enable task isolation will now fail with EINVAL if you attempt to enable task-isolation mode when your affinity does not lock you to a single core, or if that core is not a nohz_full core. We move the "strict" syscall test to just before SECCOMP instead of just after. It's not particularly clear that one is better than the other abstractly, and on a couple of the supported platforms (x86, tile) it makes the code structure work out better because the user_enter() can be done at the same time as the test for strict mode. The integration with context_tracking has been completely dropped; discussing with Andy showed that there are only a few exception sites that need strict-mode checking (the typical one is page faults that don't raise signals) so just putting the checks in the relevant functions feels cleaner than trying to hijack the exception_enter/exception_exit paths, which are being removed for x86 in any case. The task_isolation_exception() hook now takes full printf format arguments, so that we can generate a much more useful report as to why we are killing the task. As a result, we also remove the dump_stack() call, whose only utility was pointing the finger at which exception function had triggered. Rather than automatically disabling the 1 Hz maximum scheduler deferment for task-isolation tasks, we now require the user to specify a boot flag ("debug_1hz_tick") to do this. The boot flag allows us to test the case where all the 1 Hz updating subsystems have been fixed before that work actually is finished. An architecture-specific fix is included in this patch series for the tile architecture; I will push it through the tile tree (along with the tile prepare_exit_to_usermode restructuring) if there are no concerns. At issue is that we end up with one gratuitous timer tick when we are shutting down the timer; by setting up the set_state_oneshot_stopped function pointer callback for the tile tick timer we can avoid this problem. (Thomas, I'd particularly appreciate your ack on this fix, which is number 13 out of 14 in this patch series.) Rebased to v4.3-rc6 to pick up the fix for vmstat to properly use schedule_delayed_work_on(), since I was hitting a VM_BUG_ON without the fix (which I separately tracked down - oh well). v7: switch to architecture hooks for task_isolation_enter add an RCU_LOCKDEP_WARN() (Andy Lutomirski) rebased to v4.3-rc1 v6: restructured to be a "task_isolation" mode not a "cpu_isolated" mode (Frederic) v5: rebased on kernel v4.2-rc3 converted to use CONFIG_CPU_ISOLATED and separate .c and .h files incorporates Christoph Lameter's quiet_vmstat() call v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() General summary: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. The kernel must be built with CONFIG_TASK_ISOLATION to take advantage of this new mode. A prctl() option (PR_SET_TASK_ISOLATION) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts, this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane Chris Metcalf (13): vmstat: add vmstat_idle function lru_add_drain_all: factor out lru_add_drain_needed task_isolation: add initial support task_isolation: support PR_TASK_ISOLATION_STRICT mode task_isolation: provide strict mode configurable signal task_isolation: add debug boot flag nohz_full: allow disabling the 1Hz minimum tick at boot arch/x86: enable task isolation functionality arch/arm64: adopt prepare_exit_to_usermode() model from x86 arch/arm64: enable task isolation functionality arch/tile: adopt prepare_exit_to_usermode() model from x86 arch/tile: turn off timer tick for oneshot_stopped state arch/tile: enable task isolation functionality Christoph Lameter (1): vmstat: provide a function to quiet down the diff processing Documentation/kernel-parameters.txt | 7 ++ arch/arm64/include/asm/thread_info.h | 18 +++-- arch/arm64/kernel/entry.S | 6 +- arch/arm64/kernel/ptrace.c | 12 +++- arch/arm64/kernel/signal.c | 35 +++++++--- arch/arm64/mm/fault.c | 4 ++ arch/tile/include/asm/processor.h | 2 +- arch/tile/include/asm/thread_info.h | 8 ++- arch/tile/kernel/intvec_32.S | 46 ++++--------- arch/tile/kernel/intvec_64.S | 49 +++++--------- arch/tile/kernel/process.c | 83 ++++++++++++----------- arch/tile/kernel/ptrace.c | 6 +- arch/tile/kernel/single_step.c | 5 ++ arch/tile/kernel/time.c | 1 + arch/tile/kernel/unaligned.c | 3 + arch/tile/mm/fault.c | 3 + arch/tile/mm/homecache.c | 5 +- arch/x86/entry/common.c | 10 ++- arch/x86/kernel/traps.c | 2 + arch/x86/mm/fault.c | 2 + include/linux/isolation.h | 61 +++++++++++++++++ include/linux/sched.h | 3 + include/linux/swap.h | 1 + include/linux/vmstat.h | 4 ++ include/uapi/linux/prctl.h | 8 +++ init/Kconfig | 20 ++++++ kernel/Makefile | 1 + kernel/irq_work.c | 5 +- kernel/isolation.c | 127 +++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 37 ++++++++++ kernel/signal.c | 13 ++++ kernel/smp.c | 4 ++ kernel/softirq.c | 7 ++ kernel/sys.c | 9 +++ mm/swap.c | 13 ++-- mm/vmstat.c | 24 +++++++ 36 files changed, 507 insertions(+), 137 deletions(-) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c -- 2.1.2 ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v8 04/14] task_isolation: add initial support 2015-10-20 20:35 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Chris Metcalf @ 2015-10-20 20:36 ` Chris Metcalf 2015-10-20 20:56 ` Andy Lutomirski [not found] ` <1445373372-6567-5-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-10-20 20:36 ` [PATCH v8 05/14] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf ` (2 subsequent siblings) 3 siblings, 2 replies; 159+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new task_isolation_ready() / task_isolation_enter() routines to take additional actions to help the task avoid being interrupted in the future. The task_isolation_ready() call plays an equivalent role to the TIF_xxx flags when returning to userspace, and should be checked in the loop check of the prepare_exit_to_usermode() routine or its architecture equivalent. It is called with interrupts disabled and inspects the kernel state to determine if it is safe to return into an isolated state. In particular, if it sees that the scheduler tick is still enabled, it sets the TIF_NEED_RESCHED bit to notify the scheduler to attempt to schedule a different task. Each time through the loop of TIF work to do, we call the new task_isolation_enter() routine, which takes any actions that might avoid a future interrupt to the core, such as a worker thread being scheduled that could be quiesced now (e.g. the vmstat worker) or a future IPI to the core to clean up some state that could be cleaned up now (e.g. the mm lru per-cpu cache). As a result of these tests on the "return to userspace" path, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Separate patches that follow provide these changes for x86, arm64, and tile. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/isolation.h | 38 ++++++++++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 +++ init/Kconfig | 20 ++++++++++++ kernel/Makefile | 1 + kernel/isolation.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 9 ++++++ 7 files changed, 154 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/include/linux/isolation.h b/include/linux/isolation.h new file mode 100644 index 000000000000..4bef90024924 --- /dev/null +++ b/include/linux/isolation.h @@ -0,0 +1,38 @@ +/* + * Task isolation related global functions + */ +#ifndef _LINUX_ISOLATION_H +#define _LINUX_ISOLATION_H + +#include <linux/tick.h> +#include <linux/prctl.h> + +#ifdef CONFIG_TASK_ISOLATION +extern int task_isolation_set(unsigned int flags); +static inline bool task_isolation_enabled(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); +} + +extern bool _task_isolation_ready(void); +extern void _task_isolation_enter(void); + +static inline bool task_isolation_ready(void) +{ + return !task_isolation_enabled() || _task_isolation_ready(); +} + +static inline void task_isolation_enter(void) +{ + if (task_isolation_enabled()) + _task_isolation_enter(); +} + +#else +static inline bool task_isolation_enabled(void) { return false; } +static inline bool task_isolation_ready(void) { return true; } +static inline void task_isolation_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index b7b9501b41af..7a50f6904675 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1812,6 +1812,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_TASK_ISOLATION + unsigned int task_isolation_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index a8d0759a9e40..67224df4b559 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -197,4 +197,9 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */ +#define PR_SET_TASK_ISOLATION 48 +#define PR_GET_TASK_ISOLATION 49 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index c24b6f767bf0..4ff7f052059a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT endmenu # "RCU Subsystem" +config TASK_ISOLATION + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate" + themselves from the kernel. On return to userspace, + isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + config BUILD_BIN2C bool default n diff --git a/kernel/Makefile b/kernel/Makefile index 53abf008ecb3..693a2ba35679 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o obj-$(CONFIG_MEMBARRIER) += membarrier.o obj-$(CONFIG_HAS_IOMEM) += memremap.o +obj-$(CONFIG_TASK_ISOLATION) += isolation.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/isolation.c b/kernel/isolation.c new file mode 100644 index 000000000000..9a73235db0bb --- /dev/null +++ b/kernel/isolation.c @@ -0,0 +1,78 @@ +/* + * linux/kernel/isolation.c + * + * Implementation for task isolation. + * + * Distributed under GPLv2. + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/isolation.h> +#include <linux/syscalls.h> +#include "time/tick-sched.h" + +/* + * This routine controls whether we can enable task-isolation mode. + * The task must be affinitized to a single nohz_full core or we will + * return EINVAL. Although the application could later re-affinitize + * to a housekeeping core and lose task isolation semantics, this + * initial test should catch 99% of bugs with task placement prior to + * enabling task isolation. + */ +int task_isolation_set(unsigned int flags) +{ + if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || + !tick_nohz_full_cpu(smp_processor_id())) + return -EINVAL; + + current->task_isolation_flags = flags; + return 0; +} + +/* + * In task isolation mode we try to return to userspace only after + * attempting to make sure we won't be interrupted again. To handle + * the periodic scheduler tick, we test to make sure that the tick is + * stopped, and if it isn't yet, we request a reschedule so that if + * another task needs to run to completion first, it can do so. + * Similarly, if any other subsystems require quiescing, we will need + * to do that before we return to userspace. + */ +bool _task_isolation_ready(void) +{ + WARN_ON_ONCE(!irqs_disabled()); + + /* If we need to drain the LRU cache, we're not ready. */ + if (lru_add_drain_needed(smp_processor_id())) + return false; + + /* If vmstats need updating, we're not ready. */ + if (!vmstat_idle()) + return false; + + /* If the tick is running, request rescheduling; we're not ready. */ + if (!tick_nohz_tick_stopped()) { + set_tsk_need_resched(current); + return false; + } + + return true; +} + +/* + * Each time we try to prepare for return to userspace in a process + * with task isolation enabled, we run this code to quiesce whatever + * subsystems we can readily quiesce to avoid later interrupts. + */ +void _task_isolation_enter(void) +{ + WARN_ON_ONCE(irqs_disabled()); + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); +} diff --git a/kernel/sys.c b/kernel/sys.c index fa2f2f671a5c..f1b1d333f74d 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -41,6 +41,7 @@ #include <linux/syscore_ops.h> #include <linux/version.h> #include <linux/ctype.h> +#include <linux/isolation.h> #include <linux/compat.h> #include <linux/syscalls.h> @@ -2266,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_TASK_ISOLATION + case PR_SET_TASK_ISOLATION: + error = task_isolation_set(arg2); + break; + case PR_GET_TASK_ISOLATION: + error = me->task_isolation_flags; + break; +#endif default: error = -EINVAL; break; -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support 2015-10-20 20:36 ` [PATCH v8 04/14] task_isolation: add initial support Chris Metcalf @ 2015-10-20 20:56 ` Andy Lutomirski [not found] ` <CALCETrWzhrYreizoKG0w6Jtz3RLFjNx9Qk_JLykcLLUQcCXBEA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> [not found] ` <1445373372-6567-5-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-10-20 20:56 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Tue, Oct 20, 2015 at 1:36 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > +/* > + * In task isolation mode we try to return to userspace only after > + * attempting to make sure we won't be interrupted again. To handle > + * the periodic scheduler tick, we test to make sure that the tick is > + * stopped, and if it isn't yet, we request a reschedule so that if > + * another task needs to run to completion first, it can do so. > + * Similarly, if any other subsystems require quiescing, we will need > + * to do that before we return to userspace. > + */ > +bool _task_isolation_ready(void) > +{ > + WARN_ON_ONCE(!irqs_disabled()); > + > + /* If we need to drain the LRU cache, we're not ready. */ > + if (lru_add_drain_needed(smp_processor_id())) > + return false; > + > + /* If vmstats need updating, we're not ready. */ > + if (!vmstat_idle()) > + return false; > + > + /* If the tick is running, request rescheduling; we're not ready. */ > + if (!tick_nohz_tick_stopped()) { > + set_tsk_need_resched(current); > + return false; > + } > + > + return true; > +} I still don't get why this is a loop. I would argue that this should simply drain the LRU, quiet vmstat, and return. If the tick isn't stopped, then there's a reason why it's not stopped (which may involve having SCHED_OTHER tasks around, in which case user code shouldn't do that or there should simply be a requirement that isolation requires a real-time scheduler class). BTW, should isolation just be a scheduler class (SCHED_ISOLATED)? --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <CALCETrWzhrYreizoKG0w6Jtz3RLFjNx9Qk_JLykcLLUQcCXBEA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v8 04/14] task_isolation: add initial support [not found] ` <CALCETrWzhrYreizoKG0w6Jtz3RLFjNx9Qk_JLykcLLUQcCXBEA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-10-20 21:20 ` Chris Metcalf [not found] ` <5626B00E.3010309-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-10-20 21:20 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On 10/20/2015 04:56 PM, Andy Lutomirski wrote: > On Tue, Oct 20, 2015 at 1:36 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> +/* >> + * In task isolation mode we try to return to userspace only after >> + * attempting to make sure we won't be interrupted again. To handle >> + * the periodic scheduler tick, we test to make sure that the tick is >> + * stopped, and if it isn't yet, we request a reschedule so that if >> + * another task needs to run to completion first, it can do so. >> + * Similarly, if any other subsystems require quiescing, we will need >> + * to do that before we return to userspace. >> + */ >> +bool _task_isolation_ready(void) >> +{ >> + WARN_ON_ONCE(!irqs_disabled()); >> + >> + /* If we need to drain the LRU cache, we're not ready. */ >> + if (lru_add_drain_needed(smp_processor_id())) >> + return false; >> + >> + /* If vmstats need updating, we're not ready. */ >> + if (!vmstat_idle()) >> + return false; >> + >> + /* If the tick is running, request rescheduling; we're not ready. */ >> + if (!tick_nohz_tick_stopped()) { >> + set_tsk_need_resched(current); >> + return false; >> + } >> + >> + return true; >> +} > I still don't get why this is a loop. You mean, why is this code called from prepare_exit_to_userspace() in the loop, instead of after the loop? It's because the actual functions that clean up the LRU, vmstat worker, etc., may need interrupts enabled, may reschedule internally, etc. (refresh_cpu_vm_stats() calls cond_resched(), for example.) Even more importantly, we rely on rescheduling to take care of the fact that the scheduler tick may still be running, and therefore loop back to the schedule() call that's run when TIF_NEED_RESCHED gets set. And so, since interrupts and scheduling can happen, we need to be run in a loop to retest, just like the existing tests for signal dispatch, need_resched, etc. > I would argue that this should simply drain the LRU, quiet vmstat, and > return. If the tick isn't stopped, then there's a reason why it's not > stopped (which may involve having SCHED_OTHER tasks around, in which > case user code shouldn't do that or there should simply be a > requirement that isolation requires a real-time scheduler class). Sure, the tick not being stopped has a reason for not being stopped, but if it's not yet stopped, we need to schedule out and wait for that to happen. A real-time scheduler class won't completely take care of this as you still may have issues like RCU needing the cpu or any of the other cases in can_stop_full_tick(). > BTW, should isolation just be a scheduler class (SCHED_ISOLATED)? So a scheduler class is an interesting idea certainly, although not one I know immediately how to implement. I'm not sure whether it makes sense to require a user be root or have a suitable rtprio rlimit, but perhaps so. The nice thing about the current patch series is that you can affinitize yourself to a nohz_full core and declare that you want to run task-isolated, and none of that requires root nor really is there a reason it should. I guess you could make SCHED_ISOLATED like SCHED_BATCH and perhaps therefore allow non-root users to switch to it? In any case it would have to be true that we would still be doing all the other tests we do now, even if we could count on the scheduler to take care of only trying to run it when there were no other runnable processes. So it would certainly add complexity. I'm not sure how to evaluate the utility. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <5626B00E.3010309-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v8 04/14] task_isolation: add initial support [not found] ` <5626B00E.3010309-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-10-20 21:26 ` Andy Lutomirski [not found] ` <CALCETrX6e+mqfy-rNV3sA8xGVDNHviQ9vHBBhAPULeLecno7XQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-10-26 20:32 ` Chris Metcalf 0 siblings, 2 replies; 159+ messages in thread From: Andy Lutomirski @ 2015-10-20 21:26 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, Oct 20, 2015 at 2:20 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > On 10/20/2015 04:56 PM, Andy Lutomirski wrote: >> >> On Tue, Oct 20, 2015 at 1:36 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >> wrote: >>> >>> +/* >>> + * In task isolation mode we try to return to userspace only after >>> + * attempting to make sure we won't be interrupted again. To handle >>> + * the periodic scheduler tick, we test to make sure that the tick is >>> + * stopped, and if it isn't yet, we request a reschedule so that if >>> + * another task needs to run to completion first, it can do so. >>> + * Similarly, if any other subsystems require quiescing, we will need >>> + * to do that before we return to userspace. >>> + */ >>> +bool _task_isolation_ready(void) >>> +{ >>> + WARN_ON_ONCE(!irqs_disabled()); >>> + >>> + /* If we need to drain the LRU cache, we're not ready. */ >>> + if (lru_add_drain_needed(smp_processor_id())) >>> + return false; >>> + >>> + /* If vmstats need updating, we're not ready. */ >>> + if (!vmstat_idle()) >>> + return false; >>> + >>> + /* If the tick is running, request rescheduling; we're not ready. >>> */ >>> + if (!tick_nohz_tick_stopped()) { >>> + set_tsk_need_resched(current); >>> + return false; >>> + } >>> + >>> + return true; >>> +} >> >> I still don't get why this is a loop. > > > You mean, why is this code called from prepare_exit_to_userspace() > in the loop, instead of after the loop? It's because the actual functions > that clean up the LRU, vmstat worker, etc., may need interrupts enabled, > may reschedule internally, etc. (refresh_cpu_vm_stats() calls > cond_resched(), for example.) Yuck. I guess that's a reasonable argument, although it could also be fixed. > Even more importantly, we rely on > rescheduling to take care of the fact that the scheduler tick may still > be running, and therefore loop back to the schedule() call that's run > when TIF_NEED_RESCHED gets set. This just seems like a mis-design. We don't know why the scheduler tick is on, so we're just going to reschedule until the problem goes away? > >> BTW, should isolation just be a scheduler class (SCHED_ISOLATED)? > > > So a scheduler class is an interesting idea certainly, although not > one I know immediately how to implement. I'm not sure whether > it makes sense to require a user be root or have a suitable rtprio > rlimit, but perhaps so. The nice thing about the current patch > series is that you can affinitize yourself to a nohz_full core and > declare that you want to run task-isolated, and none of that > requires root nor really is there a reason it should. Your patches more or less implement "don't run me unless I'm isolated". A scheduler class would be more like "isolate me (and maybe make me super high priority so it actually happens)". I'm not a scheduler person, so I don't know. But "don't run me unless I'm isolated" seems like a design that will, at best, only ever work by dumb luck. You have to disable migration, avoid other runnable tasks, hope that the kernel keeps working the way it did when you wrote the patch, hope you continue to get lucky enough that you ever get to user mode in the first place, etc. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <CALCETrX6e+mqfy-rNV3sA8xGVDNHviQ9vHBBhAPULeLecno7XQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v8 04/14] task_isolation: add initial support [not found] ` <CALCETrX6e+mqfy-rNV3sA8xGVDNHviQ9vHBBhAPULeLecno7XQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-10-21 0:29 ` Steven Rostedt 2015-10-26 20:19 ` Chris Metcalf 2015-10-26 21:13 ` Chris Metcalf 0 siblings, 2 replies; 159+ messages in thread From: Steven Rostedt @ 2015-10-21 0:29 UTC (permalink / raw) To: Andy Lutomirski Cc: Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, 20 Oct 2015 14:26:34 -0700 Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote: > I'm not a scheduler person, so I don't know. But "don't run me unless > I'm isolated" seems like a design that will, at best, only ever work > by dumb luck. You have to disable migration, avoid other runnable > tasks, hope that the kernel keeps working the way it did when you > wrote the patch, hope you continue to get lucky enough that you ever > get to user mode in the first place, etc. Since it only makes sense to run one isolated task per cpu (not more than one on the same CPU), I wonder if we should add a new interface for this, that would force everything else off the CPU that it requests. That is, you bind a task to a CPU, and then change it to SCHED_ISOLATED (or what not), and the kernel will force all other tasks off that CPU. Well, we would still have kernel threads, but that's a different matter. Also, doesn't RCU need to have a few ticks go by before it can safely disable itself from userspace? I recall something like that. Paul? -- Steve ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support 2015-10-21 0:29 ` Steven Rostedt @ 2015-10-26 20:19 ` Chris Metcalf 2015-10-26 21:13 ` Chris Metcalf 1 sibling, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-10-26 20:19 UTC (permalink / raw) To: Steven Rostedt, Andy Lutomirski Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org Andy wrote: > Your patches more or less implement "don't run me unless I'm > isolated". A scheduler class would be more like "isolate me (and > maybe make me super high priority so it actually happens)". Steven wrote: > Since it only makes sense to run one isolated task per cpu (not more > than one on the same CPU), I wonder if we should add a new interface > for this, that would force everything else off the CPU that it > requests. That is, you bind a task to a CPU, and then change it to > SCHED_ISOLATED (or what not), and the kernel will force all other tasks > off that CPU. Frederic wrote: > I think you'll have to make sure the task can not be concurrently > reaffined to more CPUs. This may involve setting task_isolation_flags > under the runqueue lock and thus move that tiny part to the scheduler > code. And then we must forbid changing the affinity while the task has > the isolation flag, or deactivate the flag. These comments are all about the same high-level question, so I want to address it in this reply. The question is, should TASK_ISOLATION be "polite" or "aggressive"? The original design was "polite": it worked as long as no other thing on the system tried to mess with it. The suggestions above are for an "aggressive" design. The "polite" design basically tags a task as being interested in having the kernel help it out by staying away from it. It relies on running on a nohz_full cpu to keep scheduler ticks away from it. It relies on running on an isolcpus cpu to keep other processes from getting dynamically load-balanced onto it and messing it up. And, of course, it relies on the other applications and users running on the machine not to affinitize themselves onto its core and mess it up that way. But, as long as all those things are true, the kernel will try to help it out by never interrupting it. (And, it allows for the kernel to report when those expectations are violated.) The "aggressive" design would have an API that said "This is my core!". The kernel would enforce keeping other processes off the core. It would require nohz_full semantics on that core. It would lock the task to that core in some way that would override attempts to reset its sched_affinity. It would do whatever else was necessary to make that core unavailable to the rest of the system. Advantages of the "polite" design: - No special privileges required - As a result, no security issues to sort through (capabilities, etc.) - Therefore easy to use when running as an unprivileged user - Won't screw up the occasional kernel task that needs to run Advantages of the "aggressive" design: - Clearer that the application will get the task isolation it wants - More reasonable that it is enforcing kernel performance tweaks on the local core (e.g. flushing the per-cpu LRU cache) The "aggressive" design is certainly tempting, but there may be other negative consequences of this design: for example, if we need to run a usermode helper process as a result of some system call, we do want to ensure that it can run, and we need to allow it to be scheduled, even if it's just a regular scheduler class thing. The "polite" design allows the usermode helper to run and just waits until it's safe for the isolated task to return to userspace. Possibly we could arrange for a SCHED_ISOLATED class to allow that kind of behavior, though I'm not familiar enough with the scheduler code to say for sure. I think it's important that we're explicit about which of these two approaches feels like the more appropriate one. Possibly my Tilera background is part of which pushes me towards the "polite" design; we have a lot of cores, so they're a kind of trivial resource that we don't need to aggressively defend, and it's a more conservative design to enable task isolation only when all the relevant criteria have been met, rather than enforcing those criteria up front. I think if we adopt the "aggressive" model, it might likely make sense to express it as a scheduling policy, since it would include core scheduler changes such as denying other tasks the right to call sched_setaffinity() with an affinity that includes cores currently in use by SCHED_ISOLATED tasks. This would be something pretty deeply hooked into the scheduler and therefore might require some more substantial changes. In addition, of course, there's the cost of documenting yet another scheduler policy. In the "polite" model, we certainly could use a SCHED_ISOLATED scheduling policy (with static priority zero) to indicate task-isolation mode, rather than using prctl() to set a task_struct bit. I'm not sure how much it gains, though. It could allow the scheduler to detect that the only "runnable" task actually didn't want to be run, and switch briefly to the idle task, but since this would likely only be for a scheduler tick or two, the power advantages are pretty minimal, for a pretty reasonable additional piece of complexity both in the API (documenting a new scheduler class) and in the implementation (putting new requirements into the scheduler implementations). So I'm somewhat dubious, although willing to be pushed in that direction if that's the consensus. On balance I think it still feels to me like the original proposed direction (a "polite" task isolation mode with a prctl bit) feels better than the scheduler-based alternatives that have been proposed. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support 2015-10-21 0:29 ` Steven Rostedt 2015-10-26 20:19 ` Chris Metcalf @ 2015-10-26 21:13 ` Chris Metcalf 1 sibling, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-10-26 21:13 UTC (permalink / raw) To: Steven Rostedt, Andy Lutomirski Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On 10/20/2015 08:29 PM, Steven Rostedt wrote: > Also, doesn't RCU need to have a few ticks go by before it can safely > disable itself from userspace? I recall something like that. Paul? The current patch series supports that by testing tick_nohz_tick_stopped(), which internally only becomes true after tick_nohz_stop_sched_tick() manages to stop the tick, and it won't if rcu_needs_cpu() is true. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support 2015-10-20 21:26 ` Andy Lutomirski [not found] ` <CALCETrX6e+mqfy-rNV3sA8xGVDNHviQ9vHBBhAPULeLecno7XQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-10-26 20:32 ` Chris Metcalf 1 sibling, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-10-26 20:32 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On 10/20/2015 05:26 PM, Andy Lutomirski wrote: >> Even more importantly, we rely on >> rescheduling to take care of the fact that the scheduler tick may still >> be running, and therefore loop back to the schedule() call that's run >> when TIF_NEED_RESCHED gets set. > This just seems like a mis-design. We don't know why the scheduler > tick is on, so we're just going to reschedule until the problem goes > away? See my previous email about polite vs aggressive design for more thoughts on this, but, yes. I'm not sure there's a way to do anything else, other than my proposal there to dig deep into the scheduler and allow it to switch to idle for a few tasks - but again, I'm just not sure the complexity is worth the runtime power savings. >>> BTW, should isolation just be a scheduler class (SCHED_ISOLATED)? >> >> So a scheduler class is an interesting idea certainly, although not >> one I know immediately how to implement. I'm not sure whether >> it makes sense to require a user be root or have a suitable rtprio >> rlimit, but perhaps so. The nice thing about the current patch >> series is that you can affinitize yourself to a nohz_full core and >> declare that you want to run task-isolated, and none of that >> requires root nor really is there a reason it should. > Your patches more or less implement "don't run me unless I'm > isolated". A scheduler class would be more like "isolate me (and > maybe make me super high priority so it actually happens)". > > I'm not a scheduler person, so I don't know. But "don't run me unless > I'm isolated" seems like a design that will, at best, only ever work > by dumb luck. You have to disable migration, avoid other runnable > tasks, hope that the kernel keeps working the way it did when you > wrote the patch, hope you continue to get lucky enough that you ever > get to user mode in the first place, etc. Could you explain the "dumb luck" characterization a bit more? You're definitely right that I need to test for isolcpus separately now that it's been decoupled from nohz_full again, so I will add that to the next release of the series. But the rest of it seems like things you just control for when you are running the application, and if you do it right, the application runs. If you don't (e.g. you intentionally schedule multiple processes on the same core), the app doesn't run, and you fix it in development. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <1445373372-6567-5-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v8 04/14] task_isolation: add initial support [not found] ` <1445373372-6567-5-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-10-21 16:12 ` Frederic Weisbecker 2015-10-27 16:40 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Frederic Weisbecker @ 2015-10-21 16:12 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote: > diff --git a/kernel/isolation.c b/kernel/isolation.c > new file mode 100644 > index 000000000000..9a73235db0bb > --- /dev/null > +++ b/kernel/isolation.c > @@ -0,0 +1,78 @@ > +/* > + * linux/kernel/isolation.c > + * > + * Implementation for task isolation. > + * > + * Distributed under GPLv2. > + */ > + > +#include <linux/mm.h> > +#include <linux/swap.h> > +#include <linux/vmstat.h> > +#include <linux/isolation.h> > +#include <linux/syscalls.h> > +#include "time/tick-sched.h" > + > +/* > + * This routine controls whether we can enable task-isolation mode. > + * The task must be affinitized to a single nohz_full core or we will > + * return EINVAL. Although the application could later re-affinitize > + * to a housekeeping core and lose task isolation semantics, this > + * initial test should catch 99% of bugs with task placement prior to > + * enabling task isolation. > + */ > +int task_isolation_set(unsigned int flags) > +{ > + if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || I think you'll have to make sure the task can not be concurrently reaffined to more CPUs. This may involve setting task_isolation_flags under the runqueue lock and thus move that tiny part to the scheduler code. And then we must forbid changing the affinity while the task has the isolation flag, or deactivate the flag. In any case this needs some synchronization. > + !tick_nohz_full_cpu(smp_processor_id())) > + return -EINVAL; > + > + current->task_isolation_flags = flags; > + return 0; > +} > + > +/* > + * In task isolation mode we try to return to userspace only after > + * attempting to make sure we won't be interrupted again. To handle > + * the periodic scheduler tick, we test to make sure that the tick is > + * stopped, and if it isn't yet, we request a reschedule so that if > + * another task needs to run to completion first, it can do so. > + * Similarly, if any other subsystems require quiescing, we will need > + * to do that before we return to userspace. > + */ > +bool _task_isolation_ready(void) > +{ > + WARN_ON_ONCE(!irqs_disabled()); > + > + /* If we need to drain the LRU cache, we're not ready. */ > + if (lru_add_drain_needed(smp_processor_id())) > + return false; > + > + /* If vmstats need updating, we're not ready. */ > + if (!vmstat_idle()) > + return false; > + > + /* If the tick is running, request rescheduling; we're not ready. */ > + if (!tick_nohz_tick_stopped()) { Note that this function tells whether the tick is in dynticks mode, which means the tick currently only run on-demand. But it's not necessarily completely stopped. I think we should rename that function and the field it refers to. > + set_tsk_need_resched(current); > + return false; > + } > + > + return true; > +} Thanks. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support 2015-10-21 16:12 ` Frederic Weisbecker @ 2015-10-27 16:40 ` Chris Metcalf [not found] ` <562FA8FD.8080502-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-10-27 16:40 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/21/2015 12:12 PM, Frederic Weisbecker wrote: > On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote: >> +/* >> + * This routine controls whether we can enable task-isolation mode. >> + * The task must be affinitized to a single nohz_full core or we will >> + * return EINVAL. Although the application could later re-affinitize >> + * to a housekeeping core and lose task isolation semantics, this >> + * initial test should catch 99% of bugs with task placement prior to >> + * enabling task isolation. >> + */ >> +int task_isolation_set(unsigned int flags) >> +{ >> + if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || > I think you'll have to make sure the task can not be concurrently reaffined > to more CPUs. This may involve setting task_isolation_flags under the runqueue > lock and thus move that tiny part to the scheduler code. And then we must forbid > changing the affinity while the task has the isolation flag, or deactivate the flag. > > In any case this needs some synchronization. Well, as the comment says, this is not intended as a hard guarantee. As written, it might race with a concurrent sched_setaffinity(), but then again, it also is totally OK as written for sched_setaffinity() to change it away after the prctl() is complete, so it's not necessary to do any explicit synchronization. This harks back again to the whole "polite vs aggressive" issue with how we envision task isolation. The "polite" model basically allows you to set up the conditions for task isolation to be useful, and then if they are useful, great! What you're suggesting here is a bit more of the "aggressive" model, where we actually fail sched_setaffinity() either for any cpumask after task isolation is set, or perhaps just for resetting it to housekeeping cores. (Note that we could in principle use PF_NO_SETAFFINITY to just hard fail all attempts to call sched_setaffinity once we enable task isolation, so we don't have to add more mechanism on that path.) I'm a little reluctant to ever fail sched_setaffinity() based on the task isolation status with the current "polite" model, since an unprivileged application can set up for task isolation, and then presumably no one can override it via sched_setaffinity() from another task. (I suppose you could do some kind of permissions-based thing where root can always override it, or some suitable capability, etc., but I feel like that gets complicated quickly, for little benefit.) The alternative you mention is that if the task is re-affinitized, it loses its task-isolation status, and that also seems like an unfortunate API, since if you are setting it with prctl(), it's really cleanest just to only be able to unset it with prctl() as well. I think given the current "polite" API, the only question is whether in fact *no* initial test is the best thing, or if an initial test (as introduced in the v8 version) is defensible just as a help for catching an obvious mistake in setting up your task isolation. I decided the advantage of catching the mistake were more important than the "API purity" of being 100% consistent in how we handled the interactions between affinity and isolation, but I am certainly open to argument on that one. Meanwhile I think it still feels like the v8 code is the best compromise. >> + /* If the tick is running, request rescheduling; we're not ready. */ >> + if (!tick_nohz_tick_stopped()) { > Note that this function tells whether the tick is in dynticks mode, which means > the tick currently only run on-demand. But it's not necessarily completely stopped. I think in fact this is the semantics we want (and that people requested), e.g. if the user requests an alarm(), we may still be ticking even though tick_nohz_tick_stopped() is true, but that test is still the right condition to use to return to user space, since the user explicitly requested the alarm. > I think we should rename that function and the field it refers to. Sounds like a good idea. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <562FA8FD.8080502-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v8 04/14] task_isolation: add initial support [not found] ` <562FA8FD.8080502-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2016-01-28 16:38 ` Frederic Weisbecker 2016-02-11 19:58 ` Chris Metcalf 0 siblings, 1 reply; 159+ messages in thread From: Frederic Weisbecker @ 2016-01-28 16:38 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Oct 27, 2015 at 12:40:29PM -0400, Chris Metcalf wrote: > On 10/21/2015 12:12 PM, Frederic Weisbecker wrote: > >On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote: > >>+/* > >>+ * This routine controls whether we can enable task-isolation mode. > >>+ * The task must be affinitized to a single nohz_full core or we will > >>+ * return EINVAL. Although the application could later re-affinitize > >>+ * to a housekeeping core and lose task isolation semantics, this > >>+ * initial test should catch 99% of bugs with task placement prior to > >>+ * enabling task isolation. > >>+ */ > >>+int task_isolation_set(unsigned int flags) > >>+{ > >>+ if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || > >I think you'll have to make sure the task can not be concurrently reaffined > >to more CPUs. This may involve setting task_isolation_flags under the runqueue > >lock and thus move that tiny part to the scheduler code. And then we must forbid > >changing the affinity while the task has the isolation flag, or deactivate the flag. > > > >In any case this needs some synchronization. > > Well, as the comment says, this is not intended as a hard guarantee. > As written, it might race with a concurrent sched_setaffinity(), but > then again, it also is totally OK as written for sched_setaffinity() to > change it away after the prctl() is complete, so it's not necessary to > do any explicit synchronization. > > This harks back again to the whole "polite vs aggressive" issue with > how we envision task isolation. > > The "polite" model basically allows you to set up the conditions for > task isolation to be useful, and then if they are useful, great! What > you're suggesting here is a bit more of the "aggressive" model, where > we actually fail sched_setaffinity() either for any cpumask after > task isolation is set, or perhaps just for resetting it to housekeeping > cores. (Note that we could in principle use PF_NO_SETAFFINITY to > just hard fail all attempts to call sched_setaffinity once we enable > task isolation, so we don't have to add more mechanism on that path.) > > I'm a little reluctant to ever fail sched_setaffinity() based on the > task isolation status with the current "polite" model, since an > unprivileged application can set up for task isolation, and then > presumably no one can override it via sched_setaffinity() from another > task. (I suppose you could do some kind of permissions-based thing > where root can always override it, or some suitable capability, etc., > but I feel like that gets complicated quickly, for little benefit.) > > The alternative you mention is that if the task is re-affinitized, it > loses its task-isolation status, and that also seems like an unfortunate > API, since if you are setting it with prctl(), it's really cleanest just to > only be able to unset it with prctl() as well. > > I think given the current "polite" API, the only question is whether in > fact *no* initial test is the best thing, or if an initial test (as > introduced > in the v8 version) is defensible just as a help for catching an obvious > mistake in setting up your task isolation. I decided the advantage > of catching the mistake were more important than the "API purity" > of being 100% consistent in how we handled the interactions between > affinity and isolation, but I am certainly open to argument on that one. > > Meanwhile I think it still feels like the v8 code is the best compromise. So what is the way to deal with a migration for example? When the task wakes up on the non-isolated CPU, it gets warned or killed? > > >>+ /* If the tick is running, request rescheduling; we're not ready. */ > >>+ if (!tick_nohz_tick_stopped()) { > >Note that this function tells whether the tick is in dynticks mode, which means > >the tick currently only run on-demand. But it's not necessarily completely stopped. > > I think in fact this is the semantics we want (and that people requested), > e.g. if the user requests an alarm(), we may still be ticking even though > tick_nohz_tick_stopped() is true, but that test is still the right condition > to use to return to user space, since the user explicitly requested the > alarm. It seems to break the initial purpose. If your task really doesn't want to be disturbed, it simply can't arm a timer. tick_nohz_tick_stopped() is really no other indication than the CPU trying to do its best to delay the next tick. But that next tick could be re-armed every two msecs for example. Worse yet, if the tick has been stopped and finally issues a timer that rearms itself every 1 msec, tick_nohz_tick_stopped() will still be true. Thanks. > > >I think we should rename that function and the field it refers to. > > Sounds like a good idea. > > -- > Chris Metcalf, EZChip Semiconductor > http://www.ezchip.com > ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support 2016-01-28 16:38 ` Frederic Weisbecker @ 2016-02-11 19:58 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2016-02-11 19:58 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 01/28/2016 11:38 AM, Frederic Weisbecker wrote: > On Tue, Oct 27, 2015 at 12:40:29PM -0400, Chris Metcalf wrote: >> On 10/21/2015 12:12 PM, Frederic Weisbecker wrote: >>> On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote: >>>> +/* >>>> + * This routine controls whether we can enable task-isolation mode. >>>> + * The task must be affinitized to a single nohz_full core or we will >>>> + * return EINVAL. Although the application could later re-affinitize >>>> + * to a housekeeping core and lose task isolation semantics, this >>>> + * initial test should catch 99% of bugs with task placement prior to >>>> + * enabling task isolation. >>>> + */ >>>> +int task_isolation_set(unsigned int flags) >>>> +{ >>>> + if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || >>> I think you'll have to make sure the task can not be concurrently reaffined >>> to more CPUs. This may involve setting task_isolation_flags under the runqueue >>> lock and thus move that tiny part to the scheduler code. And then we must forbid >>> changing the affinity while the task has the isolation flag, or deactivate the flag. >>> >>> In any case this needs some synchronization. >> Well, as the comment says, this is not intended as a hard guarantee. >> As written, it might race with a concurrent sched_setaffinity(), but >> then again, it also is totally OK as written for sched_setaffinity() to >> change it away after the prctl() is complete, so it's not necessary to >> do any explicit synchronization. >> >> This harks back again to the whole "polite vs aggressive" issue with >> how we envision task isolation. >> >> The "polite" model basically allows you to set up the conditions for >> task isolation to be useful, and then if they are useful, great! What >> you're suggesting here is a bit more of the "aggressive" model, where >> we actually fail sched_setaffinity() either for any cpumask after >> task isolation is set, or perhaps just for resetting it to housekeeping >> cores. (Note that we could in principle use PF_NO_SETAFFINITY to >> just hard fail all attempts to call sched_setaffinity once we enable >> task isolation, so we don't have to add more mechanism on that path.) >> >> I'm a little reluctant to ever fail sched_setaffinity() based on the >> task isolation status with the current "polite" model, since an >> unprivileged application can set up for task isolation, and then >> presumably no one can override it via sched_setaffinity() from another >> task. (I suppose you could do some kind of permissions-based thing >> where root can always override it, or some suitable capability, etc., >> but I feel like that gets complicated quickly, for little benefit.) >> >> The alternative you mention is that if the task is re-affinitized, it >> loses its task-isolation status, and that also seems like an unfortunate >> API, since if you are setting it with prctl(), it's really cleanest just to >> only be able to unset it with prctl() as well. >> >> I think given the current "polite" API, the only question is whether in >> fact *no* initial test is the best thing, or if an initial test (as >> introduced >> in the v8 version) is defensible just as a help for catching an obvious >> mistake in setting up your task isolation. I decided the advantage >> of catching the mistake were more important than the "API purity" >> of being 100% consistent in how we handled the interactions between >> affinity and isolation, but I am certainly open to argument on that one. >> >> Meanwhile I think it still feels like the v8 code is the best compromise. > So what is the way to deal with a migration for example? When the task wakes > up on the non-isolated CPU, it gets warned or killed? Good question! We can only enable task isolation on an isolcpus core, so it must be a manual migration, either externally, or by the program itself calling sched_setaffinity(). So at some level, it's just an application bug. In the current code, if you have enabled STRICT mode task isolation, the process will get killed since it has to go through the kernel to migrate. If not in STRICT mode, then it will hang until it is manually killed since full dynticks will never get turned on once it wakes up on a non-isolated CPU - unless it is then manually migrated back to a proper task-isolation cpu. And, perhaps the intent was to do some cpu offlining and rearrange the task isolation tasks, and therefore that makes sense? So, maybe that semantics is good enough!? I'm not completely sure, but I think I'm willing to claim that for something this much of a corner case, it's probably reasonable. >>>> + /* If the tick is running, request rescheduling; we're not ready. */ >>>> + if (!tick_nohz_tick_stopped()) { >>> Note that this function tells whether the tick is in dynticks mode, which means >>> the tick currently only run on-demand. But it's not necessarily completely stopped. >> I think in fact this is the semantics we want (and that people requested), >> e.g. if the user requests an alarm(), we may still be ticking even though >> tick_nohz_tick_stopped() is true, but that test is still the right condition >> to use to return to user space, since the user explicitly requested the >> alarm. > It seems to break the initial purpose. If your task really doesn't want to be > disturbed, it simply can't arm a timer. tick_nohz_tick_stopped() is really no > other indication than the CPU trying to do its best to delay the next tick. But > that next tick could be re-armed every two msecs for example. Worse yet, if the > tick has been stopped and finally issues a timer that rearms itself every 1 msec, > tick_nohz_tick_stopped() will still be true. This is definitely another grey area. Certainly if there's a kernel timer that rearms itself every 1 ms, we're in trouble. (And the existing mechanisms of STRICT mode and task_isolation_debug would help.) But as far as just regular userspace arming a timer via syscall, then if your hardware had a 64-bit down counter for timer interrupts, for example, you might well be able to do something like say "every night at midnight, I can stop driving packets and do system maintenance, so I'd like the kernel to interrupt me". In this case some kind of alarm() would not be incompatible with task isolation. I admit this is kind of an extreme case; and certainly in STRICT mode, as currently written, you'd get a signal if you tried to do this, so you'd have to run with STRICT mode off. However, the reason I specifically decided to do this is community feedback. In http://lkml.kernel.org/r/CALCETrVdZxkEeQd3=V6p_yLYL7T83Y3WfnhfVGi3GwTxF+vPQg@mail.gmail.com, on 9/28/2015, Andy Lutomirski wrote: > Why are we treating alarms as something that should defer entry to > userspace? I think it would be entirely reasonable to set an alarm > for ten minutes, ask for isolation, and then think hard for ten > minutes. > > [...] > > ISTM something's suboptimal with the inner workings of all this if > task_isolation_enter needs to sleep to wait for an event that isn't > scheduled for the immediate future (e.g. already queued up as an > interrupt). -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* [PATCH v8 05/14] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-10-20 20:35 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 04/14] task_isolation: add initial support Chris Metcalf @ 2015-10-20 20:36 ` Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 06/14] task_isolation: provide strict mode configurable signal Chris Metcalf 2015-10-21 12:39 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Peter Zijlstra 3 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal; this is defined as happening immediately before the SECCOMP test. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/isolation.h | 21 +++++++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/isolation.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 64 insertions(+) diff --git a/include/linux/isolation.h b/include/linux/isolation.h index 4bef90024924..dc14057a359c 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -29,10 +29,31 @@ static inline void task_isolation_enter(void) _task_isolation_enter(); } +extern bool task_isolation_syscall(int nr); +extern bool task_isolation_exception(const char *fmt, ...); + +static inline bool task_isolation_strict(void) +{ + return (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)); +} + +#define task_isolation_check_syscall(nr) \ + (task_isolation_strict() && \ + task_isolation_syscall(nr)) + +#define task_isolation_check_exception(fmt, ...) \ + (task_isolation_strict() && \ + task_isolation_exception(fmt, ## __VA_ARGS__)) + #else static inline bool task_isolation_enabled(void) { return false; } static inline bool task_isolation_ready(void) { return true; } static inline void task_isolation_enter(void) { } +static inline bool task_isolation_check_syscall(int nr) { return false; } +static inline bool task_isolation_check_exception(const char *fmt, ...) { return false; } #endif #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 67224df4b559..2b8038b0d1e1 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -201,5 +201,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 48 #define PR_GET_TASK_ISOLATION 49 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/isolation.c b/kernel/isolation.c index 9a73235db0bb..30db40098a35 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -11,6 +11,7 @@ #include <linux/vmstat.h> #include <linux/isolation.h> #include <linux/syscalls.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -76,3 +77,44 @@ void _task_isolation_enter(void) /* Quieten the vmstat worker so it won't interrupt us. */ quiet_vmstat(); } + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +bool task_isolation_exception(const char *fmt, ...) +{ + va_list args; + char buf[100]; + + /* RCU should have been enabled prior to this point. */ + RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); + + va_start(args, fmt); + vsnprintf(buf, sizeof(buf), fmt, args); + va_end(args); + + pr_warn("%s/%d: task_isolation strict mode violated by %s\n", + current->comm, current->pid, buf); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); + + return true; +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +bool task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return false; + } + + return task_isolation_exception("syscall %d", syscall); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-20 20:35 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 04/14] task_isolation: add initial support Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 05/14] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf @ 2015-10-20 20:36 ` Chris Metcalf 2015-10-21 0:56 ` Steven Rostedt 2015-10-21 12:39 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Peter Zijlstra 3 siblings, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a task_isolation process in STRICT mode does a syscall or otherwise synchronously enters the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/isolation.c | 9 ++++++++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 2b8038b0d1e1..a5582ace987f 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -202,5 +202,7 @@ struct prctl_mm_map { #define PR_GET_TASK_ISOLATION 49 # define PR_TASK_ISOLATION_ENABLE (1 << 0) # define PR_TASK_ISOLATION_STRICT (1 << 1) +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/isolation.c b/kernel/isolation.c index 30db40098a35..0fa13b081bb4 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -84,8 +84,10 @@ void _task_isolation_enter(void) */ bool task_isolation_exception(const char *fmt, ...) { + siginfo_t info = {}; va_list args; char buf[100]; + int sig; /* RCU should have been enabled prior to this point. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); @@ -97,7 +99,12 @@ bool task_isolation_exception(const char *fmt, ...) pr_warn("%s/%d: task_isolation strict mode violated by %s\n", current->comm, current->pid, buf); current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); + if (sig == 0) + sig = SIGKILL; + info.si_signo = sig; + send_sig_info(sig, &info, current); return true; } -- 2.1.2 ^ permalink raw reply related [flat|nested] 159+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-20 20:36 ` [PATCH v8 06/14] task_isolation: provide strict mode configurable signal Chris Metcalf @ 2015-10-21 0:56 ` Steven Rostedt [not found] ` <20151020205610.51b3d742-2kNGR76GQU9OHLTnHDQRgA@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Steven Rostedt @ 2015-10-21 0:56 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Tue, 20 Oct 2015 16:36:04 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > Allow userspace to override the default SIGKILL delivered > when a task_isolation process in STRICT mode does a syscall > or otherwise synchronously enters the kernel. > Is this really a good idea? This means that there's no way to terminate a task in this mode, even if it goes astray. -- Steve ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <20151020205610.51b3d742-2kNGR76GQU9OHLTnHDQRgA@public.gmane.org>]
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal [not found] ` <20151020205610.51b3d742-2kNGR76GQU9OHLTnHDQRgA@public.gmane.org> @ 2015-10-21 1:30 ` Chris Metcalf 2015-10-21 1:41 ` Steven Rostedt 2015-10-21 1:42 ` Andy Lutomirski 0 siblings, 2 replies; 159+ messages in thread From: Chris Metcalf @ 2015-10-21 1:30 UTC (permalink / raw) To: Steven Rostedt Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 10/20/2015 8:56 PM, Steven Rostedt wrote: > On Tue, 20 Oct 2015 16:36:04 -0400 > Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > >> Allow userspace to override the default SIGKILL delivered >> when a task_isolation process in STRICT mode does a syscall >> or otherwise synchronously enters the kernel. >> > Is this really a good idea? This means that there's no way to terminate > a task in this mode, even if it goes astray. It doesn't map SIGKILL to some other signal unconditionally. It just allows the "hey, you broke the STRICT contract and entered the kernel" signal to be something besides the default SIGKILL. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-21 1:30 ` Chris Metcalf @ 2015-10-21 1:41 ` Steven Rostedt 2015-10-21 1:42 ` Andy Lutomirski 1 sibling, 0 replies; 159+ messages in thread From: Steven Rostedt @ 2015-10-21 1:41 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Tue, 20 Oct 2015 21:30:36 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 10/20/2015 8:56 PM, Steven Rostedt wrote: > > On Tue, 20 Oct 2015 16:36:04 -0400 > > Chris Metcalf <cmetcalf@ezchip.com> wrote: > > > >> Allow userspace to override the default SIGKILL delivered > >> when a task_isolation process in STRICT mode does a syscall > >> or otherwise synchronously enters the kernel. > >> > > Is this really a good idea? This means that there's no way to terminate > > a task in this mode, even if it goes astray. > > It doesn't map SIGKILL to some other signal unconditionally. It just allows > the "hey, you broke the STRICT contract and entered the kernel" signal > to be something besides the default SIGKILL. > Ah, I misread the change log. Now looking at the actual code, it makes sense. Sorry for the noise ;-) -- Steve ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-21 1:30 ` Chris Metcalf 2015-10-21 1:41 ` Steven Rostedt @ 2015-10-21 1:42 ` Andy Lutomirski [not found] ` <CALCETrXqDi24EPn79X9SXuz+5sYGZBF3yCRzb8PwdL=YbxVujw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-10-21 1:42 UTC (permalink / raw) To: Chris Metcalf Cc: Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 10/20/2015 8:56 PM, Steven Rostedt wrote: >> >> On Tue, 20 Oct 2015 16:36:04 -0400 >> Chris Metcalf <cmetcalf@ezchip.com> wrote: >> >>> Allow userspace to override the default SIGKILL delivered >>> when a task_isolation process in STRICT mode does a syscall >>> or otherwise synchronously enters the kernel. >>> >> Is this really a good idea? This means that there's no way to terminate >> a task in this mode, even if it goes astray. > > > It doesn't map SIGKILL to some other signal unconditionally. It just allows > the "hey, you broke the STRICT contract and entered the kernel" signal > to be something besides the default SIGKILL. > ...which has the odd side effect that sending a non-fatal signal from another process will cause the strict process to enter the kernel and receive an extra signal. I still dislike this thing. It seems like a debugging feature being implemented using signals instead of existing APIs. I *still* don't see why perf can't be used to accomplish your goal. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <CALCETrXqDi24EPn79X9SXuz+5sYGZBF3yCRzb8PwdL=YbxVujw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* RE: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal [not found] ` <CALCETrXqDi24EPn79X9SXuz+5sYGZBF3yCRzb8PwdL=YbxVujw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-10-21 6:41 ` Gilad Ben Yossef 2015-10-21 18:53 ` Andy Lutomirski 0 siblings, 1 reply; 159+ messages in thread From: Gilad Ben Yossef @ 2015-10-21 6:41 UTC (permalink / raw) To: Andy Lutomirski, Chris Metcalf Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > From: Andy Lutomirski [mailto:luto@amacapital.net] > Sent: Wednesday, October 21, 2015 4:43 AM > To: Chris Metcalf > Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode > configurable signal > > On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com> > wrote: > > On 10/20/2015 8:56 PM, Steven Rostedt wrote: > >> > >> On Tue, 20 Oct 2015 16:36:04 -0400 > >> Chris Metcalf <cmetcalf@ezchip.com> wrote: > >> > >>> Allow userspace to override the default SIGKILL delivered > >>> when a task_isolation process in STRICT mode does a syscall > >>> or otherwise synchronously enters the kernel. > >>> <snip> > > > > It doesn't map SIGKILL to some other signal unconditionally. It just allows > > the "hey, you broke the STRICT contract and entered the kernel" signal > > to be something besides the default SIGKILL. > > > <snip> > > I still dislike this thing. It seems like a debugging feature being > implemented using signals instead of existing APIs. I *still* don't > see why perf can't be used to accomplish your goal. > It is not (just) a debugging feature. There are workloads were not performing an action is much preferred to being late. Consider the following artificial but representative scenario: a task running in strict isolation is controlling a radiotherapy alpha emitter. The code runs in a tight event loop, reading an MMIO register with location data, making some calculation and in response writing an MMIO register that triggers the alpha emitter. As a safety measure, each trigger is for a specific very short time frame - the alpha emitter auto stops. The code has a strict assumption that no more than X cycles pass between reading the value and the response and the system is built in such a way that as long as the code has mastery of the CPU the assumption holds true. If something breaks this assumption (unplanned context switch to kernel), what you want to do is just stop place rather than fire the alpha emitter X nanoseconds too late. This feature lets you say: if the "contract" of isolation is broken, notify/kill me at once. For code where isolation is important, the correctness of a calculation is dependent on timing. It's like you would accept the kernel to kill a task if it read from an unmapped virtual address rather than returning garbage data. With an isolated task, the right data acted on later than you think is garbage just the same. I hope this sheds some light on the issue. Thanks, Gilad ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-21 6:41 ` Gilad Ben Yossef @ 2015-10-21 18:53 ` Andy Lutomirski 2015-10-22 20:44 ` Chris Metcalf [not found] ` <CALCETrVuE_VCk-7_VMJ-orL8pg+0F5vq6qvt4SfgXzt_MRr-SQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 2 replies; 159+ messages in thread From: Andy Lutomirski @ 2015-10-21 18:53 UTC (permalink / raw) To: Gilad Ben Yossef Cc: Chris Metcalf, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb@ezchip.com> wrote: > > >> From: Andy Lutomirski [mailto:luto@amacapital.net] >> Sent: Wednesday, October 21, 2015 4:43 AM >> To: Chris Metcalf >> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode >> configurable signal >> >> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com> >> wrote: >> > On 10/20/2015 8:56 PM, Steven Rostedt wrote: >> >> >> >> On Tue, 20 Oct 2015 16:36:04 -0400 >> >> Chris Metcalf <cmetcalf@ezchip.com> wrote: >> >> >> >>> Allow userspace to override the default SIGKILL delivered >> >>> when a task_isolation process in STRICT mode does a syscall >> >>> or otherwise synchronously enters the kernel. >> >>> > <snip> >> > >> > It doesn't map SIGKILL to some other signal unconditionally. It just allows >> > the "hey, you broke the STRICT contract and entered the kernel" signal >> > to be something besides the default SIGKILL. >> > >> > > <snip> >> >> I still dislike this thing. It seems like a debugging feature being >> implemented using signals instead of existing APIs. I *still* don't >> see why perf can't be used to accomplish your goal. >> > > It is not (just) a debugging feature. There are workloads were not performing an action is much preferred to being late. > > Consider the following artificial but representative scenario: a task running in strict isolation is controlling a radiotherapy alpha emitter. > The code runs in a tight event loop, reading an MMIO register with location data, making some calculation and in response writing an > MMIO register that triggers the alpha emitter. As a safety measure, each trigger is for a specific very short time frame - the alpha emitter > auto stops. > > The code has a strict assumption that no more than X cycles pass between reading the value and the response and the system is built in > such a way that as long as the code has mastery of the CPU the assumption holds true. If something breaks this assumption (unplanned > context switch to kernel), what you want to do is just stop place > rather than fire the alpha emitter X nanoseconds too late. > > This feature lets you say: if the "contract" of isolation is broken, notify/kill me at once. That's a fair point. It's risky, though, for quite a few reasons. 1. If someone builds an alpha emitter like this, they did it wrong. The kernel should write a trigger *and* a timestamp to the hardware and the hardware should trigger at the specified time if the time is in the future and throw an error if it's in the past. If you need to check that you made the deadline, check the actual desired condition (did you meat the deadline?) not a proxy (did the signal fire?). 2. This strict mode thing isn't exhaustive. It's missing, at least, coverage for NMI, MCE, and SMI. Sure, you can think that you've disabled all NMI sources, you can try to remember to set the appropriate boot flag that panics on MCE (and hope that you don't get screwed by broadcast MCE on Intel systems before it got fixed (Skylake? Is the fix even available in a released chip?), and, for SMI, good luck... 3. You haven't dealt with IPIs. The TLB flush code in particular seems like it will break all your assumptions. Maybe it would make sense to whack more of the moles before adding a big assertion that there aren't any moles any more. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-21 18:53 ` Andy Lutomirski @ 2015-10-22 20:44 ` Chris Metcalf 2015-10-22 21:00 ` Andy Lutomirski [not found] ` <CALCETrVuE_VCk-7_VMJ-orL8pg+0F5vq6qvt4SfgXzt_MRr-SQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 159+ messages in thread From: Chris Metcalf @ 2015-10-22 20:44 UTC (permalink / raw) To: Andy Lutomirski, Gilad Ben Yossef Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On 10/21/2015 02:53 PM, Andy Lutomirski wrote: > On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb@ezchip.com> wrote: >> >>> From: Andy Lutomirski [mailto:luto@amacapital.net] >>> Sent: Wednesday, October 21, 2015 4:43 AM >>> To: Chris Metcalf >>> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode >>> configurable signal >>> >>> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com> >>> wrote: >>>> On 10/20/2015 8:56 PM, Steven Rostedt wrote: >>>>> On Tue, 20 Oct 2015 16:36:04 -0400 >>>>> Chris Metcalf <cmetcalf@ezchip.com> wrote: >>>>> >>>>>> Allow userspace to override the default SIGKILL delivered >>>>>> when a task_isolation process in STRICT mode does a syscall >>>>>> or otherwise synchronously enters the kernel. >>>>>> >> <snip> >>>> It doesn't map SIGKILL to some other signal unconditionally. It just allows >>>> the "hey, you broke the STRICT contract and entered the kernel" signal >>>> to be something besides the default SIGKILL. >>>> >> <snip> >>> I still dislike this thing. It seems like a debugging feature being >>> implemented using signals instead of existing APIs. I *still* don't >>> see why perf can't be used to accomplish your goal. >>> >> It is not (just) a debugging feature. There are workloads were not performing an action is much preferred to being late. >> >> Consider the following artificial but representative scenario: a task running in strict isolation is controlling a radiotherapy alpha emitter. >> The code runs in a tight event loop, reading an MMIO register with location data, making some calculation and in response writing an >> MMIO register that triggers the alpha emitter. As a safety measure, each trigger is for a specific very short time frame - the alpha emitter >> auto stops. >> >> The code has a strict assumption that no more than X cycles pass between reading the value and the response and the system is built in >> such a way that as long as the code has mastery of the CPU the assumption holds true. If something breaks this assumption (unplanned >> context switch to kernel), what you want to do is just stop place >> rather than fire the alpha emitter X nanoseconds too late. >> >> This feature lets you say: if the "contract" of isolation is broken, notify/kill me at once. > That's a fair point. It's risky, though, for quite a few reasons. > > 1. If someone builds an alpha emitter like this, they did it wrong. > The kernel should write a trigger *and* a timestamp to the hardware > and the hardware should trigger at the specified time if the time is > in the future and throw an error if it's in the past. If you need to > check that you made the deadline, check the actual desired condition > (did you meat the deadline?) not a proxy (did the signal fire?). Definitely a better hardware design, but as we all know, hardware designers too rarely consult the software people who have to right the actual code to properly drive the hardware :-) My canonical example is high-performance userspace network drivers, and though dropping is packet is less likely to kill a patient, it's still a pretty bad thing if you're trying to design a robust appliance. In this case you really want to fix application bugs that cause the code to enter the kernel when you think you're in the internal loop running purely in userspace. Things like unexpected page faults, and third-party code that almost never calls the kernel but in some dusty corner it occasionally does, can screw up your userspace code pretty badly, and mysteriously. The "strict" mode support is not a hypothetical insurance policy but a reaction to lots of Tilera customer support over the years to folks failing to stay in userspace when they thought they were doing the right thing. > 2. This strict mode thing isn't exhaustive. It's missing, at least, > coverage for NMI, MCE, and SMI. Sure, you can think that you've > disabled all NMI sources, you can try to remember to set the > appropriate boot flag that panics on MCE (and hope that you don't get > screwed by broadcast MCE on Intel systems before it got fixed > (Skylake? Is the fix even available in a released chip?), and, for > SMI, good luck... You are confusing this strict mode support with the debug support in patch 07/14. Strict mode is for synchronous application errors. You might be right that there are cases that haven't been covered, but certainly most of them are covered on the three platforms that are supported in this initial series. (You pointed me to one that I would have missed on x86, namely the bounds check exception from a bad bounds setup.) I'm pretty confident I have all of them for tile, since I know that hardware best, and I think we're in good shape for arm64, though I'm still coming up to speed on that architecture. NMIs and machine checks are asynchronous interrupts that don't have to do with what the application is doing, more or less. Those should not be delivered to task-isolation cores at all, so we just generate console spew when you set the task_isolation_debug boot option. I honestly don't know enough about system management interrupts to comment on that, though again, I would hope one can configure the system to just not deliver them to nohz_full cores, and I think it would be reasonable to generate some kernel spew if that happens. > 3. You haven't dealt with IPIs. The TLB flush code in particular > seems like it will break all your assumptions. Again, not a synchronous application error that we are trying to catch with this signalling mechanism. That said it could obviously be a more general application error (e.g. a process with threads on both nohz_full and housekeeping cores, where the housekeeping core unmaps some memory and thus requires a TLB flush IPI). But this is covered by the task_isolation_debug patch for kernel/smp.c. > Maybe it would make sense to whack more of the moles before adding a > big assertion that there aren't any moles any more. Maybe, but I've whacked the ones I know how to whack. If there are ones I've missed I'm happy to add them in a subsequent version of this series, or in follow-on patches. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-22 20:44 ` Chris Metcalf @ 2015-10-22 21:00 ` Andy Lutomirski [not found] ` <CALCETrVQXwYwhEwbJsvN18w8qD-qVVCQAa8b9RcXD=RmXSqLiQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 159+ messages in thread From: Andy Lutomirski @ 2015-10-22 21:00 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org, Linux API, linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 1:44 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 10/21/2015 02:53 PM, Andy Lutomirski wrote: >> >> On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb@ezchip.com> >> wrote: >>> >>> >>>> From: Andy Lutomirski [mailto:luto@amacapital.net] >>>> Sent: Wednesday, October 21, 2015 4:43 AM >>>> To: Chris Metcalf >>>> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode >>>> configurable signal >>>> >>>> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com> >>>> wrote: >>>>> >>>>> On 10/20/2015 8:56 PM, Steven Rostedt wrote: >>>>>> >>>>>> On Tue, 20 Oct 2015 16:36:04 -0400 >>>>>> Chris Metcalf <cmetcalf@ezchip.com> wrote: >>>>>> >>>>>>> Allow userspace to override the default SIGKILL delivered >>>>>>> when a task_isolation process in STRICT mode does a syscall >>>>>>> or otherwise synchronously enters the kernel. >>>>>>> >>> <snip> >>>>> >>>>> It doesn't map SIGKILL to some other signal unconditionally. It just >>>>> allows >>>>> the "hey, you broke the STRICT contract and entered the kernel" signal >>>>> to be something besides the default SIGKILL. >>>>> >>> <snip> >>>> >>>> I still dislike this thing. It seems like a debugging feature being >>>> implemented using signals instead of existing APIs. I *still* don't >>>> see why perf can't be used to accomplish your goal. >>>> >>> It is not (just) a debugging feature. There are workloads were not >>> performing an action is much preferred to being late. >>> >>> Consider the following artificial but representative scenario: a task >>> running in strict isolation is controlling a radiotherapy alpha emitter. >>> The code runs in a tight event loop, reading an MMIO register with >>> location data, making some calculation and in response writing an >>> MMIO register that triggers the alpha emitter. As a safety measure, each >>> trigger is for a specific very short time frame - the alpha emitter >>> auto stops. >>> >>> The code has a strict assumption that no more than X cycles pass between >>> reading the value and the response and the system is built in >>> such a way that as long as the code has mastery of the CPU the assumption >>> holds true. If something breaks this assumption (unplanned >>> context switch to kernel), what you want to do is just stop place >>> rather than fire the alpha emitter X nanoseconds too late. >>> >>> This feature lets you say: if the "contract" of isolation is broken, >>> notify/kill me at once. >> >> That's a fair point. It's risky, though, for quite a few reasons. >> >> 1. If someone builds an alpha emitter like this, they did it wrong. >> The kernel should write a trigger *and* a timestamp to the hardware >> and the hardware should trigger at the specified time if the time is >> in the future and throw an error if it's in the past. If you need to >> check that you made the deadline, check the actual desired condition >> (did you meat the deadline?) not a proxy (did the signal fire?). > > > Definitely a better hardware design, but as we all know, hardware > designers too rarely consult the software people who have to > right the actual code to properly drive the hardware :-) > > My canonical example is high-performance userspace network > drivers, and though dropping is packet is less likely to kill a > patient, it's still a pretty bad thing if you're trying to design > a robust appliance. In this case you really want to fix application > bugs that cause the code to enter the kernel when you think > you're in the internal loop running purely in userspace. Things > like unexpected page faults, and third-party code that almost > never calls the kernel but in some dusty corner it occasionally > does, can screw up your userspace code pretty badly, and > mysteriously. The "strict" mode support is not a hypothetical > insurance policy but a reaction to lots of Tilera customer support > over the years to folks failing to stay in userspace when they > thought they were doing the right thing. But this is *exactly* the case where perf or other out-of-band debugging could be a much better solution. Perf could notify a non-isolated thread that an interrupt happened, you'd still drop a packet or two, but you wouldn't also drop the next ten thousand packets while handling the signal. > >> 2. This strict mode thing isn't exhaustive. It's missing, at least, >> coverage for NMI, MCE, and SMI. Sure, you can think that you've >> disabled all NMI sources, you can try to remember to set the >> appropriate boot flag that panics on MCE (and hope that you don't get >> screwed by broadcast MCE on Intel systems before it got fixed >> (Skylake? Is the fix even available in a released chip?), and, for >> SMI, good luck... > > > You are confusing this strict mode support with the debug > support in patch 07/14. Nope. I'm confusing this strict mode with what Gilad described: using strict mode to cause outright shutdown instead of failure to meet a deadline. (FWIW, you could also use an ordinary hardware watchdog timer to promote your failure to meet a deadline to a shutdown. No new kernel support needed.) > > Strict mode is for synchronous application errors. You might > be right that there are cases that haven't been covered, but > certainly most of them are covered on the three platforms that > are supported in this initial series. (You pointed me to one > that I would have missed on x86, namely the bounds check > exception from a bad bounds setup.) I'm pretty confident I > have all of them for tile, since I know that hardware best, > and I think we're in good shape for arm64, though I'm still > coming up to speed on that architecture. Again, for this definition of strict mode, I still don't see why it's the right design. If you want to debug your application to detect application errors, use a debugging interface. > > NMIs and machine checks are asynchronous interrupts that > don't have to do with what the application is doing, more or less. > Those should not be delivered to task-isolation cores at all, > so we just generate console spew when you set the > task_isolation_debug boot option. I honestly don't know enough > about system management interrupts to comment on that, > though again, I would hope one can configure the system to > just not deliver them to nohz_full cores, and I think it would > be reasonable to generate some kernel spew if that happens. Hah hah yeah right. On most existing Intel CPUs, you *cannot* configure machine checks to do anything other than broadcast to all cores or cause immediate shutdown. And getting any sort of reasonable control over SMI more or less requires special firmware. > >> 3. You haven't dealt with IPIs. The TLB flush code in particular >> seems like it will break all your assumptions. > > > Again, not a synchronous application error that we are trying > to catch with this signalling mechanism. > > That said it could obviously be a more general application error > (e.g. a process with threads on both nohz_full and housekeeping > cores, where the housekeeping core unmaps some memory and > thus requires a TLB flush IPI). But this is covered by the > task_isolation_debug patch for kernel/smp.c. > >> Maybe it would make sense to whack more of the moles before adding a >> big assertion that there aren't any moles any more. > > > Maybe, but I've whacked the ones I know how to whack. > If there are ones I've missed I'm happy to add them in a > subsequent version of this series, or in follow-on patches. > I agree that you can, in principle, catch all the synchronous application errors using this mechanism. I'm saying that catching them seems quite useful, but catching them using a prctl that causes a signal and explicitly does *not* solve the deadline enforcement problem seems to have dubious value in the upstream kernel. You can't catch the asynchronous application errors with this mechanism (or at least your ability to catch them depends on which patch version IIRC), which include calling anything like munmap or membarrier in another thread. --Andy ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <CALCETrVQXwYwhEwbJsvN18w8qD-qVVCQAa8b9RcXD=RmXSqLiQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal [not found] ` <CALCETrVQXwYwhEwbJsvN18w8qD-qVVCQAa8b9RcXD=RmXSqLiQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-10-27 19:37 ` Chris Metcalf 0 siblings, 0 replies; 159+ messages in thread From: Chris Metcalf @ 2015-10-27 19:37 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On 10/22/2015 05:00 PM, Andy Lutomirski wrote: > On Thu, Oct 22, 2015 at 1:44 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> On 10/21/2015 02:53 PM, Andy Lutomirski wrote: >>> On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >>> wrote: >>>> >>>>> From: Andy Lutomirski [mailto:luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org] >>>>> Sent: Wednesday, October 21, 2015 4:43 AM >>>>> To: Chris Metcalf >>>>> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode >>>>> configurable signal >>>>> >>>>> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >>>>> wrote: >>>>>> On 10/20/2015 8:56 PM, Steven Rostedt wrote: >>>>>>> On Tue, 20 Oct 2015 16:36:04 -0400 >>>>>>> Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >>>>>>> >>>>>>>> Allow userspace to override the default SIGKILL delivered >>>>>>>> when a task_isolation process in STRICT mode does a syscall >>>>>>>> or otherwise synchronously enters the kernel. >>>>>>>> >>>> <snip> >>>>>> It doesn't map SIGKILL to some other signal unconditionally. It just >>>>>> allows >>>>>> the "hey, you broke the STRICT contract and entered the kernel" signal >>>>>> to be something besides the default SIGKILL. >>>>>> >>>> <snip> >>>>> I still dislike this thing. It seems like a debugging feature being >>>>> implemented using signals instead of existing APIs. I *still* don't >>>>> see why perf can't be used to accomplish your goal. >>>>> >>>> It is not (just) a debugging feature. There are workloads were not >>>> performing an action is much preferred to being late. >>>> >>>> Consider the following artificial but representative scenario: a task >>>> running in strict isolation is controlling a radiotherapy alpha emitter. >>>> The code runs in a tight event loop, reading an MMIO register with >>>> location data, making some calculation and in response writing an >>>> MMIO register that triggers the alpha emitter. As a safety measure, each >>>> trigger is for a specific very short time frame - the alpha emitter >>>> auto stops. >>>> >>>> The code has a strict assumption that no more than X cycles pass between >>>> reading the value and the response and the system is built in >>>> such a way that as long as the code has mastery of the CPU the assumption >>>> holds true. If something breaks this assumption (unplanned >>>> context switch to kernel), what you want to do is just stop place >>>> rather than fire the alpha emitter X nanoseconds too late. >>>> >>>> This feature lets you say: if the "contract" of isolation is broken, >>>> notify/kill me at once. >>> That's a fair point. It's risky, though, for quite a few reasons. >>> >>> 1. If someone builds an alpha emitter like this, they did it wrong. >>> The kernel should write a trigger *and* a timestamp to the hardware >>> and the hardware should trigger at the specified time if the time is >>> in the future and throw an error if it's in the past. If you need to >>> check that you made the deadline, check the actual desired condition >>> (did you meat the deadline?) not a proxy (did the signal fire?). >> >> Definitely a better hardware design, but as we all know, hardware >> designers too rarely consult the software people who have to >> right the actual code to properly drive the hardware :-) >> >> My canonical example is high-performance userspace network >> drivers, and though dropping is packet is less likely to kill a >> patient, it's still a pretty bad thing if you're trying to design >> a robust appliance. In this case you really want to fix application >> bugs that cause the code to enter the kernel when you think >> you're in the internal loop running purely in userspace. Things >> like unexpected page faults, and third-party code that almost >> never calls the kernel but in some dusty corner it occasionally >> does, can screw up your userspace code pretty badly, and >> mysteriously. The "strict" mode support is not a hypothetical >> insurance policy but a reaction to lots of Tilera customer support >> over the years to folks failing to stay in userspace when they >> thought they were doing the right thing. > But this is *exactly* the case where perf or other out-of-band > debugging could be a much better solution. Perf could notify a > non-isolated thread that an interrupt happened, you'd still drop a > packet or two, but you wouldn't also drop the next ten thousand > packets while handling the signal. There's no reason the signal needs to be delivered to one of the nohz_full cores. If you're setting up to catch these signals rather than have them just SIGKILL you, then you want to run a separate thread on a housekeeping core that is doing a sigwait() or equivalent. I'm not sure why using perf to do this is particularly better; I'm most interested in ensuring that it is easy for applications to set this up if they want it, and perf isn't always super-easy to use. That said, maybe it's easier than I think to do that specific thing, and worth considering doing it that way instead. Is there an easily-explained way to do what you suggest where perf delivers a signal? I assume you have in mind creating a synthetic sampling perf event and using perf_event_open() to get a file descriptor for it, and waiting with poll or SIGIO? (Too bad perf_event_open isn't supported by glibc and we have to use syscall() to even call it.) Seems complex... >>> 2. This strict mode thing isn't exhaustive. It's missing, at least, >>> coverage for NMI, MCE, and SMI. Sure, you can think that you've >>> disabled all NMI sources, you can try to remember to set the >>> appropriate boot flag that panics on MCE (and hope that you don't get >>> screwed by broadcast MCE on Intel systems before it got fixed >>> (Skylake? Is the fix even available in a released chip?), and, for >>> SMI, good luck... >> >> You are confusing this strict mode support with the debug >> support in patch 07/14. > Nope. I'm confusing this strict mode with what Gilad described: using > strict mode to cause outright shutdown instead of failure to meet a > deadline. Yeah, fair point. We certainly could wire up a mode to deliver a signal or whatever for asynchronous interrupts (which I'm claiming are primarily kernel bugs) instead of just synchronous interrupts (which I'm claiming are application bugs). That could be an additional mode bit for prctl(), e.g. PR_TASK_ISOLATION_DEBUG to align with the task_isolation_debug boot variable that enables the kernel printk spew. > (FWIW, you could also use an ordinary hardware watchdog timer to > promote your failure to meet a deadline to a shutdown. No new kernel > support needed.) But more hardware support is needed; there may not be a handy hardware watchdog timer to use out of the box, and you don't want to require the customer to buy new hardware to support a feature like this if you don't have to. >> Strict mode is for synchronous application errors. You might >> be right that there are cases that haven't been covered, but >> certainly most of them are covered on the three platforms that >> are supported in this initial series. (You pointed me to one >> that I would have missed on x86, namely the bounds check >> exception from a bad bounds setup.) I'm pretty confident I >> have all of them for tile, since I know that hardware best, >> and I think we're in good shape for arm64, though I'm still >> coming up to speed on that architecture. > Again, for this definition of strict mode, I still don't see why it's > the right design. If you want to debug your application to detect > application errors, use a debugging interface. Maybe. But we basically want a single notification that the app (and/or maybe kernel) screwed up. Invoking all of perf for that seems like overkill and a signal seems totally adequate, whether for development fixing bugs, or production catching bad things. There are a reasonable number of precedents for doing things this way: SIGPIPE and SIGFPE, to name two. >> NMIs and machine checks are asynchronous interrupts that >> don't have to do with what the application is doing, more or less. >> Those should not be delivered to task-isolation cores at all, >> so we just generate console spew when you set the >> task_isolation_debug boot option. I honestly don't know enough >> about system management interrupts to comment on that, >> though again, I would hope one can configure the system to >> just not deliver them to nohz_full cores, and I think it would >> be reasonable to generate some kernel spew if that happens. > Hah hah yeah right. On most existing Intel CPUs, you *cannot* > configure machine checks to do anything other than broadcast to all > cores or cause immediate shutdown. And getting any sort of reasonable > control over SMI more or less requires special firmware. Yeah, as Gilad said, x86 may not be the best choice to run a task-isolated application unless you can really set up those kinds of things to stay off your core. >>> 3. You haven't dealt with IPIs. The TLB flush code in particular >>> seems like it will break all your assumptions. >> >> Again, not a synchronous application error that we are trying >> to catch with this signalling mechanism. >> >> That said it could obviously be a more general application error >> (e.g. a process with threads on both nohz_full and housekeeping >> cores, where the housekeeping core unmaps some memory and >> thus requires a TLB flush IPI). But this is covered by the >> task_isolation_debug patch for kernel/smp.c. >> >>> Maybe it would make sense to whack more of the moles before adding a >>> big assertion that there aren't any moles any more. >> >> Maybe, but I've whacked the ones I know how to whack. >> If there are ones I've missed I'm happy to add them in a >> subsequent version of this series, or in follow-on patches. >> > I agree that you can, in principle, catch all the synchronous > application errors using this mechanism. I'm saying that catching > them seems quite useful, but catching them using a prctl that causes a > signal and explicitly does *not* solve the deadline enforcement > problem seems to have dubious value in the upstream kernel. When you say "does not solve the deadline enforcement problem", I'm not sure what point you're making. The application presumably can meet its own deadlines when it's not interrupted; the intent here is to notice when the kernel gets in its way and notify it. Granted you could add separate mechanisms to create deadlines within the application, but that feels like a separate layer that may or may not be desired for any given application. > You can't catch the asynchronous application errors with this > mechanism (or at least your ability to catch them depends on which > patch version IIRC), which include calling anything like munmap or > membarrier in another thread. Yes, and munmap in another thread is certainly an application bug at some level, so that's another reason to allow using the same mechanism to notify the application of an asynchronous interrupt. I'll add that for the next version of the patch series. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <CALCETrVuE_VCk-7_VMJ-orL8pg+0F5vq6qvt4SfgXzt_MRr-SQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* RE: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal [not found] ` <CALCETrVuE_VCk-7_VMJ-orL8pg+0F5vq6qvt4SfgXzt_MRr-SQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-10-24 9:16 ` Gilad Ben Yossef 0 siblings, 0 replies; 159+ messages in thread From: Gilad Ben Yossef @ 2015-10-24 9:16 UTC (permalink / raw) To: Andy Lutomirski Cc: Chris Metcalf, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Hi Andy, Thank for the feedback. > From: Andy Lutomirski [mailto:luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org] > Sent: Wednesday, October 21, 2015 9:53 PM > To: Gilad Ben Yossef > Cc: Chris Metcalf; Steven Rostedt; Ingo Molnar; Peter Zijlstra; Andrew > Morton; Rik van Riel; Tejun Heo; Frederic Weisbecker; Thomas Gleixner; Paul > E. McKenney; Christoph Lameter; Viresh Kumar; Catalin Marinas; Will Deacon; > linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Linux API; linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode > configurable signal > > >> >> On Tue, 20 Oct 2015 16:36:04 -0400 > >> >> Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > >> >> > >> >>> Allow userspace to override the default SIGKILL delivered > >> >>> when a task_isolation process in STRICT mode does a syscall > >> >>> or otherwise synchronously enters the kernel. > >> >>> > > <snip> > >> > > >> > It doesn't map SIGKILL to some other signal unconditionally. It just allows > >> > the "hey, you broke the STRICT contract and entered the kernel" signal > >> > to be something besides the default SIGKILL. > >> > > >> > > > > <snip> > >> > >> I still dislike this thing. It seems like a debugging feature being > >> implemented using signals instead of existing APIs. I *still* don't > >> see why perf can't be used to accomplish your goal. > >> > > > > It is not (just) a debugging feature. There are workloads were not > performing an action is much preferred to being late. > > > > Consider the following artificial but representative scenario: a task running > in strict isolation is controlling a radiotherapy alpha emitter. > > The code runs in a tight event loop, reading an MMIO register with location > data, making some calculation and in response writing an > > MMIO register that triggers the alpha emitter. As a safety measure, each > trigger is for a specific very short time frame - the alpha emitter > > auto stops. > > > > The code has a strict assumption that no more than X cycles pass between > reading the value and the response and the system is built in > > such a way that as long as the code has mastery of the CPU the assumption > holds true. If something breaks this assumption (unplanned > > context switch to kernel), what you want to do is just stop place > > rather than fire the alpha emitter X nanoseconds too late. > > > > This feature lets you say: if the "contract" of isolation is broken, notify/kill > me at once. > > That's a fair point. It's risky, though, for quite a few reasons. > > 1. If someone builds an alpha emitter like this, they did it wrong. > The kernel should write a trigger *and* a timestamp to the hardware > and the hardware should trigger at the specified time if the time is > in the future and throw an error if it's in the past. If you need to > check that you made the deadline, check the actual desired condition > (did you meat the deadline?) not a proxy (did the signal fire?). > As I wrote above it is an *artificial* scenario. Yes, hardware and systems can be designed better, but they are not always are and in these kind of systems, you really do want to have double or triple checks. Knowing such systems, even IF the hardware was designed as you specified (and I agree it should!) you would still add the software protection. > 2. This strict mode thing isn't exhaustive. It's missing, at least, > coverage for NMI, MCE, and SMI. Sure, you can think that you've > disabled all NMI sources, you can try to remember to set the > appropriate boot flag that panics on MCE (and hope that you don't get > screwed by broadcast MCE on Intel systems before it got fixed > (Skylake? Is the fix even available in a released chip?), and, for > SMI, good luck... You are right - it isn't exhaustive. It is one piece in a bigger puzzle. Many of the other bits are platform specific and some of them have been dealt with on the platform that care about these things. Yes, we don't have dark magic to detect SMIs. Is that a reason to penalize platforms where there is no such thing as SMI? > 3. You haven't dealt with IPIs. The TLB flush code in particular > seems like it will break all your assumptions. > But we have - in the general context. Consider this patch set from 2012 - https://lwn.net/Articles/479510/ Not finished for sure. But what we have is now useful enough that it is used in the real world for different workloads on different platforms, from packet processing, through HPC to high frequency trading. > Maybe it would make sense to whack more of the moles before adding a > big assertion that there aren't any moles any more. > hm... maybe you are reading too much into this specific feature - its a "notify me, the application, if I asked you to do something that violates my previous request to be isolated", rather than "notify me whenever isolation is broken". Does that make more sense? Thanks, Gilad-- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-10-20 20:35 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Chris Metcalf ` (2 preceding siblings ...) 2015-10-20 20:36 ` [PATCH v8 06/14] task_isolation: provide strict mode configurable signal Chris Metcalf @ 2015-10-21 12:39 ` Peter Zijlstra 2015-10-22 20:31 ` Chris Metcalf 3 siblings, 1 reply; 159+ messages in thread From: Peter Zijlstra @ 2015-10-21 12:39 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Can you *please* start a new thread with each posting? This is absolutely unmanageable. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-10-21 12:39 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Peter Zijlstra @ 2015-10-22 20:31 ` Chris Metcalf 2015-10-23 2:33 ` Frederic Weisbecker [not found] ` <562947B0.7050103-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 0 siblings, 2 replies; 159+ messages in thread From: Chris Metcalf @ 2015-10-22 20:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/21/2015 08:39 AM, Peter Zijlstra wrote: > Can you *please* start a new thread with each posting? > > This is absolutely unmanageable. I've been explicitly threading the multiple patch series on purpose due to this text in "git help send-email": --in-reply-to=<identifier> Make the first mail (or all the mails with --no-thread) appear as a reply to the given Message-Id, which avoids breaking threads to provide a new patch series. The second and subsequent emails will be sent as replies according to the --[no]-chain-reply-to setting. So for example when --thread and --no-chain-reply-to are specified, the second and subsequent patches will be replies to the first one like in the illustration below where [PATCH v2 0/3] is in reply to [PATCH 0/2]: [PATCH 0/2] Here is what I did... [PATCH 1/2] Clean up and tests [PATCH 2/2] Implementation [PATCH v2 0/3] Here is a reroll [PATCH v2 1/3] Clean up [PATCH v2 2/3] New tests [PATCH v2 3/3] Implementation It sounds like this is exactly the behavior you are objecting to. It's all one to me because I am not seeing these emails come up in some hugely nested fashion, but just viewing the responses that I haven't yet triaged away. So is your recommendation to avoid the git send-email --in-reply-to option? If so, would you recommend including an lkml.kernel.org link in the cover letter pointing to the previous version, or is there something else that would make your workflow better? If you think this is actually the wrong thing, is it worth trying to fix the git docs to deprecate this option? Or is it more a question of scale, and the 80-odd patches that I've posted so far just pushed an otherwise good system into a more dysfunctional mode? If so, perhaps some text in Documentation/SubmittingPatches would be helpful here. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-10-22 20:31 ` Chris Metcalf @ 2015-10-23 2:33 ` Frederic Weisbecker 2015-10-23 8:49 ` Peter Zijlstra [not found] ` <562947B0.7050103-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 159+ messages in thread From: Frederic Weisbecker @ 2015-10-23 2:33 UTC (permalink / raw) To: Chris Metcalf Cc: Peter Zijlstra, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Thu, Oct 22, 2015 at 04:31:44PM -0400, Chris Metcalf wrote: > On 10/21/2015 08:39 AM, Peter Zijlstra wrote: > >Can you *please* start a new thread with each posting? > > > >This is absolutely unmanageable. > > I've been explicitly threading the multiple patch series on purpose > due to this text in "git help send-email": > > --in-reply-to=<identifier> > Make the first mail (or all the mails with --no-thread) appear > as a reply to the given Message-Id, which avoids breaking > threads to provide a new patch series. The second and subsequent > emails will be sent as replies according to the > --[no]-chain-reply-to setting. > > So for example when --thread and --no-chain-reply-to are > specified, the second and subsequent patches will be replies to > the first one like in the illustration below where [PATCH v2 > 0/3] is in reply to [PATCH 0/2]: > > [PATCH 0/2] Here is what I did... > [PATCH 1/2] Clean up and tests > [PATCH 2/2] Implementation > [PATCH v2 0/3] Here is a reroll > [PATCH v2 1/3] Clean up > [PATCH v2 2/3] New tests > [PATCH v2 3/3] Implementation > > It sounds like this is exactly the behavior you are objecting > to. It's all one to me because I am not seeing these emails > come up in some hugely nested fashion, but just viewing the > responses that I haven't yet triaged away. I personally (and I think this is the general LKML behaviour) use in-reply-to when I post a single patch that is a fix for a bug, or a small enhancement, discussed on some thread. It works well as it fits the conversation inline. But for anything that requires significant changes, namely a patchset, and that includes a new version of such patchset, it's usually better to create a new thread. Otherwise the thread becomes an infinite mess and it eventually expands further the mail client columns. Thanks. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-10-23 2:33 ` Frederic Weisbecker @ 2015-10-23 8:49 ` Peter Zijlstra 2015-10-23 13:29 ` Frederic Weisbecker 0 siblings, 1 reply; 159+ messages in thread From: Peter Zijlstra @ 2015-10-23 8:49 UTC (permalink / raw) To: Frederic Weisbecker Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Fri, Oct 23, 2015 at 04:33:02AM +0200, Frederic Weisbecker wrote: > On Thu, Oct 22, 2015 at 04:31:44PM -0400, Chris Metcalf wrote: > > On 10/21/2015 08:39 AM, Peter Zijlstra wrote: > > >Can you *please* start a new thread with each posting? > > > > > >This is absolutely unmanageable. > > > > I've been explicitly threading the multiple patch series on purpose > > due to this text in "git help send-email": > > > > --in-reply-to=<identifier> > > Make the first mail (or all the mails with --no-thread) appear > > as a reply to the given Message-Id, which avoids breaking > > threads to provide a new patch series. The second and subsequent > > emails will be sent as replies according to the > > --[no]-chain-reply-to setting. > > > > So for example when --thread and --no-chain-reply-to are > > specified, the second and subsequent patches will be replies to > > the first one like in the illustration below where [PATCH v2 > > 0/3] is in reply to [PATCH 0/2]: > > > > [PATCH 0/2] Here is what I did... > > [PATCH 1/2] Clean up and tests > > [PATCH 2/2] Implementation > > [PATCH v2 0/3] Here is a reroll > > [PATCH v2 1/3] Clean up > > [PATCH v2 2/3] New tests > > [PATCH v2 3/3] Implementation > > > > It sounds like this is exactly the behavior you are objecting > > to. It's all one to me because I am not seeing these emails > > come up in some hugely nested fashion, but just viewing the > > responses that I haven't yet triaged away. Yeah, the git people are not per definition following lkml standards, even though git originated 'here'. They, for a long time, also defaulted to --chain-reply-to, which is absolutely insane. > I personally (and I think this is the general LKML behaviour) use in-reply-to > when I post a single patch that is a fix for a bug, or a small enhancement, > discussed on some thread. It works well as it fits the conversation inline. > > But for anything that requires significant changes, namely a patchset, > and that includes a new version of such patchset, it's usually better > to create a new thread. Otherwise the thread becomes an infinite mess and it > eventually expands further the mail client columns. Agreed, although for single patches I use my regular mailer (mutt) and can't be arsed with tools. Also I don't actually use git-send-email ever, so I might be biased. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-10-23 8:49 ` Peter Zijlstra @ 2015-10-23 13:29 ` Frederic Weisbecker 0 siblings, 0 replies; 159+ messages in thread From: Frederic Weisbecker @ 2015-10-23 13:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Fri, Oct 23, 2015 at 10:49:51AM +0200, Peter Zijlstra wrote: > On Fri, Oct 23, 2015 at 04:33:02AM +0200, Frederic Weisbecker wrote: > > I personally (and I think this is the general LKML behaviour) use in-reply-to > > when I post a single patch that is a fix for a bug, or a small enhancement, > > discussed on some thread. It works well as it fits the conversation inline. > > > > But for anything that requires significant changes, namely a patchset, > > and that includes a new version of such patchset, it's usually better > > to create a new thread. Otherwise the thread becomes an infinite mess and it > > eventually expands further the mail client columns. > > Agreed, although for single patches I use my regular mailer (mutt) and > can't be arsed with tools. Yeah me too, otherwise I can't write a text before the patch changelog. > Also I don't actually use git-send-email ever, so I might be biased. Ah it's just too convenient so I wrote my scripts on top of it :-) But surely many mail sender libraries can post patches just fine as well. ^ permalink raw reply [flat|nested] 159+ messages in thread
[parent not found: <562947B0.7050103-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full [not found] ` <562947B0.7050103-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> @ 2015-10-23 9:04 ` Peter Zijlstra 2015-10-23 11:52 ` Theodore Ts'o 0 siblings, 1 reply; 159+ messages in thread From: Peter Zijlstra @ 2015-10-23 9:04 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Thu, Oct 22, 2015 at 04:31:44PM -0400, Chris Metcalf wrote: > So is your recommendation to avoid the git send-email --in-reply-to > option? If so, would you recommend including an lkml.kernel.org > link in the cover letter pointing to the previous version, or > is there something else that would make your workflow better? Mostly people don't bother with pointing to previous versions, and if they have the same 0/x subject, they're typically trivial to find anyway. But if you really feel the need for explicit references to previous versions, then yes, lkml.kernel.org/r/ links are preferred over pretty much anything else I think. > If you think this is actually the wrong thing, is it worth trying > to fix the git docs to deprecate this option? As said in the other email; git has different standards than lkml. By now we're just one of many many users of git. > Or is it more a question > of scale, and the 80-odd patches that I've posted so far just pushed > an otherwise good system into a more dysfunctional mode? If so, > perhaps some text in Documentation/SubmittingPatches would be > helpful here. Documentation/email-clients.txt maybe. ^ permalink raw reply [flat|nested] 159+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-10-23 9:04 ` Peter Zijlstra @ 2015-10-23 11:52 ` Theodore Ts'o 0 siblings, 0 replies; 159+ messages in thread From: Theodore Ts'o @ 2015-10-23 11:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Fri, Oct 23, 2015 at 11:04:59AM +0200, Peter Zijlstra wrote: > > If you think this is actually the wrong thing, is it worth trying > > to fix the git docs to deprecate this option? > > As said in the other email; git has different standards than lkml. By > now we're just one of many many users of git. Even git developers will create a new thread for a large (more than 2-3 patches) patch set. However, for a single patch, people have chained the -v3 version of the draft --- not to the v2 version, though, but to the review of the patch. And I've seen that behavior on some LKML lists, and I'm certainly fine with it on linux-ext4. But if you have a huge patch series, and you keep chaining it unto the 8th, 10th, 22nd version, it certainly will get **very** annoying for some MUA's. The bottom line is that you should use common sense, and it can be hard to document every last bit of what should be "common sense" into a rule that is followed by robots or a perl script. (Which is one of the reasons why I'm not fond of the philosophy that every single last checkpatch warning or error should result in a "cleanup" patch, but that's another issue.) Cheers, - Ted ^ permalink raw reply [flat|nested] 159+ messages in thread
end of thread, other threads:[~2016-02-11 19:58 UTC | newest] Thread overview: 159+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-05-08 17:58 [PATCH 0/6] support "dataplane" mode for nohz_full Chris Metcalf 2015-05-08 17:58 ` [PATCH 1/6] nohz_full: add support for "dataplane" mode Chris Metcalf 2015-05-08 17:58 ` [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE Chris Metcalf [not found] ` <1431107927-13998-5-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-05-12 9:33 ` Peter Zijlstra [not found] ` <20150512093349.GH21418-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2015-05-12 9:50 ` Ingo Molnar [not found] ` <20150512095030.GD11477-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-05-12 10:38 ` Peter Zijlstra 2015-05-12 12:52 ` Ingo Molnar 2015-05-13 4:35 ` Andy Lutomirski 2015-05-13 17:51 ` Paul E. McKenney [not found] ` <20150513175150.GL6776-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 2015-05-14 20:55 ` Chris Metcalf 2015-05-14 20:54 ` Chris Metcalf 2015-05-08 17:58 ` [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode Chris Metcalf 2015-05-09 7:28 ` Andy Lutomirski 2015-05-09 10:37 ` Gilad Ben Yossef [not found] ` <CALCETrUoptUPVUxL87jUgry1pFac0rDPpnZ790zDKyK4a0FARA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-11 19:13 ` Chris Metcalf 2015-05-11 22:28 ` Andy Lutomirski 2015-05-12 21:06 ` Chris Metcalf 2015-05-12 22:23 ` Andy Lutomirski 2015-05-15 21:25 ` Chris Metcalf 2015-05-12 9:38 ` Peter Zijlstra [not found] ` <20150512093858.GI21418-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2015-05-12 13:20 ` Paul E. McKenney [not found] ` <1431107927-13998-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-05-08 21:18 ` [PATCH 0/6] support "dataplane" mode for nohz_full Andrew Morton 2015-05-08 21:22 ` Steven Rostedt [not found] ` <20150508172210.559830a9-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org> 2015-05-08 23:11 ` Chris Metcalf 2015-05-08 23:19 ` Andrew Morton 2015-05-09 7:05 ` Ingo Molnar [not found] ` <20150509070538.GA9413-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-05-09 7:19 ` Andy Lutomirski [not found] ` <CALCETrXavog018+xLacXeBLaMLjWtqk0bMU5fUzZ+pkwgu7Y3A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-11 19:54 ` Chris Metcalf [not found] ` <555108FC.3060200-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-05-11 22:15 ` Andy Lutomirski [not found] ` <55510885.9070101@ezchip.com> 2015-05-12 13:18 ` Paul E. McKenney 2015-05-09 7:19 ` Mike Galbraith [not found] ` <1431155983.3209.131.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-05-09 10:18 ` Gilad Ben Yossef 2015-05-11 12:57 ` Steven Rostedt 2015-05-11 15:36 ` Frederic Weisbecker 2015-05-11 19:19 ` Mike Galbraith 2015-05-11 19:25 ` Chris Metcalf 2015-05-12 1:47 ` Mike Galbraith 2015-05-12 4:35 ` Mike Galbraith 2015-05-11 17:19 ` Paul E. McKenney 2015-05-11 17:27 ` Andrew Morton 2015-05-11 17:33 ` Frederic Weisbecker 2015-05-11 18:00 ` Steven Rostedt 2015-05-11 18:09 ` Chris Metcalf [not found] ` <5550F077.6030906-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-05-11 18:36 ` Steven Rostedt 2015-05-12 9:10 ` CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) Ingo Molnar [not found] ` <20150512091032.GA10138-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-05-12 11:48 ` Peter Zijlstra 2015-05-12 12:34 ` Ingo Molnar [not found] ` <20150512123440.GA16959-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-05-12 12:39 ` Peter Zijlstra [not found] ` <20150512123912.GO21418-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2015-05-12 12:43 ` Ingo Molnar 2015-05-12 15:36 ` Frederic Weisbecker 2015-05-12 21:05 ` CONFIG_ISOLATION=y Chris Metcalf [not found] ` <20150511085759.71deeb64-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org> 2015-05-12 10:46 ` [PATCH 0/6] support "dataplane" mode for nohz_full Peter Zijlstra 2015-05-15 15:10 ` Chris Metcalf 2015-05-15 21:26 ` [PATCH v2 0/5] support "cpu_isolated" " Chris Metcalf 2015-05-15 21:27 ` [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf 2015-05-15 21:27 ` [PATCH v2 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf [not found] ` <1431725251-20943-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-05-15 21:27 ` [PATCH v2 3/5] nohz: cpu_isolated strict mode configurable signal Chris Metcalf 2015-05-15 22:17 ` [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode Thomas Gleixner 2015-05-28 20:38 ` Chris Metcalf [not found] ` <1431725178-20876-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-06-03 15:29 ` [PATCH v3 0/5] support "cpu_isolated" mode for nohz_full Chris Metcalf 2015-06-03 15:29 ` [PATCH v3 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf [not found] ` <1433345365-29506-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-06-03 15:29 ` [PATCH v3 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf 2015-06-03 15:29 ` [PATCH v3 3/5] nohz: cpu_isolated strict mode configurable signal Chris Metcalf 2015-07-13 19:57 ` [PATCH v4 0/5] support "cpu_isolated" mode for nohz_full Chris Metcalf 2015-07-13 19:57 ` [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf 2015-07-13 20:40 ` Andy Lutomirski 2015-07-13 21:01 ` Chris Metcalf 2015-07-13 21:45 ` Andy Lutomirski 2015-07-21 19:10 ` Chris Metcalf 2015-07-21 19:26 ` Andy Lutomirski [not found] ` <CALCETrVoHvofNHG81Q2Vb2i1qc7f2dy=qgkyb5NWNfUgYxhE8Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-07-21 20:36 ` Paul E. McKenney 2015-07-22 13:57 ` Christoph Lameter [not found] ` <alpine.DEB.2.11.1507220856030.17411-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 2015-07-22 19:28 ` Paul E. McKenney 2015-07-22 20:02 ` Christoph Lameter 2015-07-24 20:21 ` Chris Metcalf 2015-07-24 20:22 ` Chris Metcalf 2015-07-24 14:03 ` Frederic Weisbecker 2015-07-24 20:19 ` Chris Metcalf 2015-07-24 13:27 ` Frederic Weisbecker 2015-07-24 20:21 ` Chris Metcalf 2015-07-13 19:57 ` [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf [not found] ` <1436817481-8732-3-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-07-13 21:47 ` Andy Lutomirski [not found] ` <CALCETrUvg+Dix=jG2_1J=mgQC+uRk4dthCYDcb4E5ooEfQjqtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-07-21 19:34 ` Chris Metcalf [not found] ` <55AE9EAC.4010202-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-07-21 19:42 ` Andy Lutomirski 2015-07-24 20:29 ` Chris Metcalf 2015-07-13 19:57 ` [PATCH v4 3/5] nohz: cpu_isolated strict mode configurable signal Chris Metcalf [not found] ` <1436817481-8732-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-07-28 19:49 ` [PATCH v5 0/6] support "cpu_isolated" mode for nohz_full Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 2/6] cpu_isolated: add initial support Chris Metcalf [not found] ` <1438112980-9981-3-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-08-12 16:00 ` Frederic Weisbecker 2015-08-12 18:22 ` Chris Metcalf [not found] ` <55CB8ED1.6030806-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-08-26 15:26 ` Frederic Weisbecker 2015-08-26 15:55 ` Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 3/6] cpu_isolated: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 4/6] cpu_isolated: provide strict mode configurable signal Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 0/6] support "task_isolated" mode for nohz_full Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 2/6] task_isolation: add initial support Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf 2015-08-26 10:36 ` Will Deacon 2015-08-26 15:10 ` Chris Metcalf [not found] ` <55DDD6EA.3070307-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-09-02 10:13 ` Will Deacon 2015-08-28 15:31 ` [PATCH v6.1 " Chris Metcalf [not found] ` <1440532555-15492-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-08-25 19:55 ` [PATCH v6 4/6] task_isolation: provide strict mode configurable signal Chris Metcalf 2015-08-28 19:22 ` Andy Lutomirski [not found] ` <20150902101347.GF25720-5wv7dgnIgG8@public.gmane.org> 2015-09-02 18:38 ` [PATCH v6.2 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 00/11] support "task_isolated" mode for nohz_full Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 02/11] task_isolation: add initial support Chris Metcalf 2015-10-01 12:14 ` Frederic Weisbecker 2015-10-01 12:18 ` Thomas Gleixner 2015-10-01 12:23 ` Frederic Weisbecker 2015-10-01 12:31 ` Thomas Gleixner 2015-10-01 17:02 ` Chris Metcalf [not found] ` <560D6725.9000609-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-10-01 21:20 ` Thomas Gleixner 2015-10-02 17:15 ` Chris Metcalf [not found] ` <560EBBC5.7000709-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-10-02 19:02 ` Thomas Gleixner 2015-10-01 19:25 ` Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf [not found] ` <1443453446-7827-4-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-09-28 20:51 ` Andy Lutomirski 2015-09-28 21:54 ` Chris Metcalf 2015-09-28 22:38 ` Andy Lutomirski 2015-09-29 17:35 ` Chris Metcalf [not found] ` <560ACBD9.90909-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-09-29 17:46 ` Andy Lutomirski [not found] ` <CALCETrUp+8UG5dKLdybcmhhfzcyUP8h-RJHcG0Bo7Up=Rx6DVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-09-29 17:57 ` Chris Metcalf 2015-09-29 18:00 ` Andy Lutomirski [not found] ` <CALCETrVrHFh_wW_u0E+3mcN9J7_M+hAF59CdKOzKt3NT+gWJgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-10-01 19:25 ` Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 04/11] task_isolation: provide strict mode configurable signal Chris Metcalf 2015-09-28 20:54 ` Andy Lutomirski [not found] ` <CALCETrXaWaUwWnOz16RAqjFP9tZm=tp74xWacXjqa36TWB9BfQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-09-28 21:54 ` Chris Metcalf 2015-10-20 20:35 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 04/14] task_isolation: add initial support Chris Metcalf 2015-10-20 20:56 ` Andy Lutomirski [not found] ` <CALCETrWzhrYreizoKG0w6Jtz3RLFjNx9Qk_JLykcLLUQcCXBEA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-10-20 21:20 ` Chris Metcalf [not found] ` <5626B00E.3010309-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-10-20 21:26 ` Andy Lutomirski [not found] ` <CALCETrX6e+mqfy-rNV3sA8xGVDNHviQ9vHBBhAPULeLecno7XQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-10-21 0:29 ` Steven Rostedt 2015-10-26 20:19 ` Chris Metcalf 2015-10-26 21:13 ` Chris Metcalf 2015-10-26 20:32 ` Chris Metcalf [not found] ` <1445373372-6567-5-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-10-21 16:12 ` Frederic Weisbecker 2015-10-27 16:40 ` Chris Metcalf [not found] ` <562FA8FD.8080502-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2016-01-28 16:38 ` Frederic Weisbecker 2016-02-11 19:58 ` Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 05/14] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 06/14] task_isolation: provide strict mode configurable signal Chris Metcalf 2015-10-21 0:56 ` Steven Rostedt [not found] ` <20151020205610.51b3d742-2kNGR76GQU9OHLTnHDQRgA@public.gmane.org> 2015-10-21 1:30 ` Chris Metcalf 2015-10-21 1:41 ` Steven Rostedt 2015-10-21 1:42 ` Andy Lutomirski [not found] ` <CALCETrXqDi24EPn79X9SXuz+5sYGZBF3yCRzb8PwdL=YbxVujw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-10-21 6:41 ` Gilad Ben Yossef 2015-10-21 18:53 ` Andy Lutomirski 2015-10-22 20:44 ` Chris Metcalf 2015-10-22 21:00 ` Andy Lutomirski [not found] ` <CALCETrVQXwYwhEwbJsvN18w8qD-qVVCQAa8b9RcXD=RmXSqLiQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-10-27 19:37 ` Chris Metcalf [not found] ` <CALCETrVuE_VCk-7_VMJ-orL8pg+0F5vq6qvt4SfgXzt_MRr-SQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-10-24 9:16 ` Gilad Ben Yossef 2015-10-21 12:39 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Peter Zijlstra 2015-10-22 20:31 ` Chris Metcalf 2015-10-23 2:33 ` Frederic Weisbecker 2015-10-23 8:49 ` Peter Zijlstra 2015-10-23 13:29 ` Frederic Weisbecker [not found] ` <562947B0.7050103-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> 2015-10-23 9:04 ` Peter Zijlstra 2015-10-23 11:52 ` Theodore Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).